<<

Organizational Research Methods 00(0) 1-39 ª The Author(s) 2012 Using the Propensity Score Reprints and permission: sagepub.com/journalsPermissions.nav Method to Estimate Causal DOI: 10.1177/1094428112447816 http://orm.sagepub.com Effects: A Review and Practical Guide

Mingxiang Li1

Abstract Evidence-based management requires management scholars to draw causal inferences. Researchers generally rely on observational sets and regression models where the independent variables have not been exogenously manipulated to estimate causal effects; however, using such models on observational data sets can produce a biased of treatment intervention. This article introduces the propensity score method (PSM)—which has previously been widely employed in social science disciplines such as public health and —to the management field. This research reviews the PSM literature, develops a procedure for applying the PSM to estimate the cau- sal effects of intervention, elaborates on the procedure using an empirical example, and discusses the potential application of the PSM in different management fields. The implementation of the PSM in the management field will increase researchers’ ability to draw causal inferences using observational data sets.

Keywords causal effect, propensity score method,

Management scholars are interested in drawing causal inferences (Mellor & Mark, 1998). One example of a causal inference that researchers might try to determine is whether a specific manage- ment practice, such as group training or a stock option plan, increases organizational performance. Typically, management scholars rely on observational data sets to estimate causal effects of the management practice. Yet, endogeneity—which occurs when a predictor variable correlates with the error term—prevents scholars from drawing correct inferences (Antonakis, Bendahan, Jacquart, & Lalive, 2010; Wooldridge, 2002). Econometricians have proposed a number of techniques to deal

1Department of Management and Human Resources, University of Wisconsin-Madison, Madison, WI, USA

Corresponding Author: Mingxiang Li, Department of Management and Human Resources, University of Wisconsin-Madison, 975 University Avenue, 5268 Grainger Hall, Madison, WI 53706, USA Email: [email protected]

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 2 Organizational Research Methods 00(0) with endogeneity—including selection models, fixed effects models, and instrumental variables, all of which have been used by management scholars. In this article, I introduce the propensity score method (PSM) as another technique that can be used to calculate causal effects. In management research, many scholars are interested in evidence-based management (Rynes, Giluk, & Brown, 2007), which ‘‘derives principles from research evidence and translates them into practices that solve organizational problems’’ (Rousseau, 2006, p. 256). To contribute to evidence- based management, scholars must be able to draw correct causal inferences. Cox (1992) defined a cause as an intervention that brings about a change in the variable of interest, compared with the baseline control model. A causal effect can be simply defined as the average effect due to a certain intervention or treatment. For example, researchers might be interested in the extent to which train- ing influences future earnings. While field is one approach that can be used to correctly estimate causal effects, in many situations field are impractical. This has prompted scholars to rely on observational data, which makes it difficult for scholars to gauge unbiased causal effects. The PSM is a technique that, if used appropriately, can increase scholars’ ability to draw causal inferences using observational data. Though widely implemented in other social science fields, the PSM has generally been over- looked by management scholars. Since it was introduced by Rosenbaum and Rubin (1983), the PSM has been widely used by economists (Dehejia & Wahba, 1999) and medical scientists (Wolfe & Michaud, 2004) to estimate the causal effects. Recently, financial scholars (Campello, Graham, & Harvey, 2010), sociologists (Gangl, 2006; Grodsky, 2007), and political scientists (Arceneaux, Gerber, & Green, 2006) have implemented the PSM in their empirical studies. A Google Scholar search in early 2012 showed that over 7,300 publications cited Rosenbaum and Rubin’s classic 1983 article that introduced the PSM. An additional Web of Science analysis indicated that over 3,000 academic articles cited this influential article. Of these citations, 20% of the publications were in economics, 14% were in , 10% were in methodological journals, and the remain- ing 56% were in health-related fields. Despite the widespread use of the PSM across a variety of disciplines, it has not been employed by management scholars, prompting Gerhart’s (2007) con- clusion that ‘‘to date, there appear to be no applications of propensity score in the management literature’’ (p. 563). This article begins with an overview of a counterfactual model, experiment, regression, and endo- geneity. This section illustrates why the counterfactual model is important for estimating causal effects and why regression models sometimes cannot successfully reconstruct counterfactuals. This is followed by a short review of the PSM and a discussion of the reasons for using the PSM. The third section employs a detailed example to illustrate how a treatment effect can be estimated using the PSM. The following section presents a short summary on the empirical studies that used the PSM in other social science fields, along with a description of potential implementation of the PSM in the management field. Finally, this article concludes with a discussion of the pros and cons of using the PSM to estimate causal effects.

Estimating Causal Effects Without the Propensity Score Method Evidence-based practices use quantitative methods to find reliable effects that can be implemen- ted by practitioners and administrators to develop and adopt effective policy interventions. Because the application of specific recommendations derived from evidence-based research is not costless, it is crucial for social scientists to draw correct causal inferences. As pointed out by King, Keohane, and Verba (1994), ‘‘we should draw causal inferences where they seem appro- priate but also provide the reader with the best and most honest estimate of the uncertainty of that inference’’ (p. 76).

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 3

Counterfactual Model To better understand causal effect, it is important to discuss counterfactuals. In Rubin’s causal model (see Rubin, 2004, for a summary), Y1i and Y0i are potential earnings for individual i when i receives (Y1i) or does not receive training (Y0iÞ. The fundamental problem of making a causal inference is how to reconstruct the outcomes that are not observed, sometimes called counterfactuals, because they are not what happened. Conceptually, either the treatment or the nontreatment is not observed and hence is ‘‘missing’’ (Morgan & Winship, 2007). Specifically, if i received training at time t,the earnings for i at t þ 1isY1i. But if i also did not receive training at time t, the potential earnings for i at t þ 1isY0i. Then the effect of training can be simply expressed as Y1i Y0i. Yet, because it is impossible for i to simultaneously receive (Y1i) and not receive (Y0iÞ the training, scholars need to find other ways to overcome this fundamental problem. One can also understand this fundamental issue as the ‘‘what-if’’ problem. That is, what if individual i does not receive training? Hence, recon- structing the counterfactuals is crucial to estimate unbiased causal effects. The counterfactual model shows that it is impossible to calculate individual-level treatment effects, and therefore scholars have to calculate aggregated treatment effects (Morgan & Winship, 2007). There are two major versions of aggregated treatment effects: the average treatment effect (ATE) and the average treatment effect on the treated group (ATT). A simple definition of the ATE can be written as

ATE ¼ EYðÞ1ijTi ¼ 1; 0 EðY0ijTi ¼ 1; 0Þ; ð1:1aÞ where E(.) represents the expectation in the population. Ti denotes the treatment with the value of 1 for the treated group and the value of 0 for the control group. In other words, the ATE can be defined as the average effect that would be observed if everyone in the treated and the control groups received treatment, compared with if no one in both groups received treatment (Harder, Stuart, & Anthony, 2010). The definition of ATT can be expressed as

ATT ¼ EYðÞ1ijTi ¼ 1 EðY0ijTi ¼ 1Þ: ð1:1bÞ In to the ATE, the ATT refers to the average difference that would be found if everyone in the treated group received treatment compared with if none of these individuals in the treated group received treatment. The value for the ATE will be the same as that for the ATT when the research design is experimental.1

Experiment There are different ways to estimate treatment effects other than PSM. Of these, the experiment is the (Antonakis et al., 2010). If the participants are randomly assigned to the treated or the control group, then the treatment effect can simply be estimated by comparing the differ- ence between these two groups. Experimental data can generate an unbiased estimator for causal effects because the randomized design ensures the equivalent distributions of the treated and the control groups on all observed and unobserved characteristics. Thus, any observed difference on out- come can be caused only by the treatment difference. Because randomized experiments can success- fully reconstruct counterfactuals, the causal effect generated by experiment is unbiased.

Regression In situations when the causal effects of training cannot be studied using an experimental design, scholars want to examine whether receiving training (T) has any effect on future earnings (Y). In this case, scholars generally rely on potentially biased observational data sets to investigate the causal

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 4 Organizational Research Methods 00(0) effect. For example, one can use a simple regression model by regressing future earnings (Y)on training (T) and demographic variables such as age (x1) and race (x2).

Y ¼ b0 þ b1x1 þ b2x2 þ tT þ e: ð1:2Þ Scholars then interpret the results by saying ‘‘ceteris paribus, the effect due to training is t.’’ They typically assume t is the causal effect due to management intervention. Indeed, regression or the structural equation models (SEM) (cf. Duncan, 1975; James, Mulaik, & Brett, 1982) is still a domi- nant approach for estimating treatment effect.2 Yet, regression cannot detect whether the cases are comparable in terms of distribution overlap on observed characteristics. Thus, regression models are unable to reconstruct counterfactuals. One can easily find many empirical studies that seek to esti- mate causal effects by regressing an outcome variable on an intervention dummy variable. The find- ings of these studies, which used observational data sets, could be wrong because they did not adjust for the distribution between the treated and control groups.

Endogeneity In addition to the nonequivalence of distribution between the control and treated groups, another severe error that prevents scholars from calculating unbiased causal effects is endogeneity. This occurs when predictor T correlates with error term e in Equation 1.2. A number of review articles have described the endogeneity problem and warned management scholars of its biasing effects (e.g., Antonakis et al., 2010; Hamilton & Nickerson, 2003). As discussed previously, endogeneity manifests from measurement error, simultaneity, and omitted variables. Measurement error typically attenuates the effect size of regression estimators in explanatory variables. Simultaneity happens when at least one of the predictors is determined simultaneously along with the dependent variable. An example of simultaneity is the estimation of price in a supply and demand model (Greene, 2008). An omitted variable appears when one does not control for additional variables that correlate with explanatory as well as dependent variables. Of these three sources of endogeneity, the omitted variable bias has probably received the most attention from management scholars. Returning to the earlier training example, suppose the researcher only controls for demographic variables but does not control for an individual’s ability. If training correlates with ability and ability correlates with future earnings, the result will be biased because of endogeneity. Consequently, omitting ability will cause a correlation between training dummy T and residuals e. This violates the assumption of strict exogeneity for models. Thus, the estimated causal effect (tÞ in Equation 1.2 will be biased. If the omitted variable is time-invariant, one can use the fixed effects model to deal with endogeneity (Allison, 2009). Beck, Bru¨derl, and Woywode’s (2008) simulation showed that the fixed effects model provided correction for biased estimation due to the omitted variable. One can also view nonrandom sample selection as a special case of the omitted variable problem. Taking the effect of training on earnings as an example, one can only observe earnings for individ- uals who are employed. Employed individuals could be a nonrandom subset of the population. One can write the nonrandom selection process as Equation 1.3, D ¼ aZ þ u; ð1:3Þ where D is latent selection variable (1 for employed individuals), Z represents a vector of variables (e.g., education level) that predicts selection, and u denotes disturbances. One can call Equation 1.2 the substantive equation and Equation 1.3 the selection equation. Sample selection bias is likely to materialize when there is correlation between the disturbances for substantive (e) and selection equation (u) (Antonakis et al., 2010, p. 1094; Berk, 1983; Heckman, 1979). When there is a correla- tion between e and u, the Heckman selection model, rather than the PSM, should be used to calculate

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 5 causal effect (Antonakis et al., 2010). To correct for the sample selection bias, one can first fit the selection model using probit or logit model. Then the predicted values from the selection model will be saved to compute the density and distribution values, from which the inverse Mills ratio (l)—the ratio for the density value to the distribution value—will be calculated. Finally, the inverse Mills ratio will be included in the substantive Equation 1.2 to correct for the bias of t due to selection. For more information on two-stage selection models, readers can consult Berk (1983).

The Propensity Score Method Having briefly reviewed existing techniques for estimating causal effects, I now discuss how PSM can help scholars to draw correct causal inferences. The PSM is a technique that allows researchers to reconstruct counterfactuals using observational data. It does this by reducing two sources of bias in the observational data: bias due to lack of distribution overlap and bias due to different density weighting (Heckman, Ichimura, Smith, & Todd, 1998). A propensity score can be defined as the probability of study participants receiving a treatment based on observed characteristics. The PSM refers to a special procedure that uses propensity scores and matching algorithm to calculate the cau- sal effect. Before moving on, it is useful to conceptually differentiate PSM from Heckman’s (1979) ‘‘selec- tion model.’’ His selection model deals with the probability of treatment assignment indirectly from instrumental variables. Thus, the probability calculated using the selection model requires one or more variables that are not censored or truncated and that can predict the selection. For example, if one wanted to study how training affects future earnings, one must consider the self-selection problem, because wages can only be observed for individuals who are already employed. Using the predicted probability calculated from the first stage (Equation 1.3), one can compute the inverse Mills ratio and insert this variable to the wage prediction model to correct for selection bias. In con- trast to the predicted probability calculated in the Heckman selection model, propensity scores are calculated directly only through observed predictors. Furthermore, the propensity scores and the pre- dicted probabilities calculated using Heckman selection have different purposes in estimating causal effects: The probabilities estimated from the Heckman model generate an inverse Mills ratio that can be used to adjust for bias due to or truncation, whereas the probabilities calculated in the PSM are used to adjust covariate distribution between the treated group and the control group.

Reasons for Using the PSM Because there are many methods that can estimate causal effects, why should management scholars care about the PSM? One reason is that most publications in the management field rely on observa- tional data. Such large data can be relatively inexpensive to obtain, yet they are almost always obser- vational rather than experimental. By adjusting covariates between the treated and control groups, the PSM allows scholars to reconstruct counterfactuals using observational data. If the strongly ignorable assumption that will be discussed in the next section is satisfied, then the PSM can produce an unbiased causal effect using observational data sets. Second, mis-specified econometric models using observational data sometimes produce biased estimators. One source of such bias is that the two samples lack distribution overlap, and cannot tell researchers the distribution overlap between two samples. Cochran (1957, pp. 265-266) illustrated this problem using the following example: ‘‘Suppose that we were adjusting for differences in parents’ income in a comparison of private and public school children, and that the private-school incomes ranged from $10,000–$12,000, while the public-school incomes ranged from $4,000–$6,000. The would adjust results so that they allegedly applied to a mean income of $8,000 in each group, although neither group has any observations in which incomes are

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 6 Organizational Research Methods 00(0) at or near this level.’’ The PSM can easily detect the lack of covariate distribution between two groups and adjust the distribution accordingly. Third, linear or logistic models have been used to adjust for covariates, but such models rely on assumptions regarding functional form. For example, one assumption required for a to produce an unbiased estimator is that it does not suffer from the aforementioned problem of endogeneity. Although the procedure to calculate propensity scores is parametric, using propensity scores to compute causal effect is largely nonparametric. Thus, using the PSM to calcu- late the causal effect is less susceptible to the violation of model assumptions. Overall, when one is interested in investigating the effectiveness of a certain management practice but is unable to collect experimental data, the PSM should be used, at least as a robust test to justify the findings estimated by parametric models.

Overview of the PSM The concept of subclassification is helpful for understanding the PSM. Simply comparing the mean difference of the outcome variables in two groups typically leads to biased estimators, because the distributions of the observational variables in the two groups may differ. Cochran’s (1968) subclas- sification method first divides an observational variable into n subclasses and then estimates the treatment effect by comparing the weighted of the outcome variable in each subclass. He used two approaches to demonstrate the effectiveness of subclassification in reducing bias in observa- tional studies. First, he used an empirical example (death rate for smoking groups with country of origin and age as covariates) to show that when age was divided into two classes more than half the effect of the age bias was removed. Second, he used a mathematical model to derive the proportion of bias that can be removed through subclassification. For different distribution functions, using five or six subclasses will typically remove 90% or more of the bias shown in the raw comparison. With more than six subclasses, only small amounts of additional bias can be removed. Yet, subclassifica- tion is difficult to utilize if many confounding covariates exist (Rubin, 1997). To overcome the difficulty of estimating the treatment effects using Cochran’s technique, Rosen- baum and Rubin (1983) developed the PSM. The key objective of the PSM is to replace the many confounding covariates in an with one function of these covariates. The function (or the propensity score) captures the likelihood of study participants receiving a treatment based on observed covariates. The estimated propensity score is then used as the only confounding covariate to adjust for all of the covariates that go into the estimation. Since the propensity score adjusts for all covariates using a simple variable and Cochran found that five blocks can remove 90% of bias due to raw comparison, stratifying the propensity score into five blocks can generally remove much of the difference due to the non-overlap of all observed covariates between the treated group and the con- trol group. Central to understanding the PSM is the balancing score. Rosenbaum and Rubin (1983) defined the balancing score as a function of observable covariates such that the conditional distribution of X given the balancing score is the same for the treated group and the control group. Formally, the bal- ancing score bðXÞ satisfies X?TjbðXÞ, where X is a vector of the observed covariates, T represents the treatment assignment, and ? refers to independence. Rosenbaum and Rubin argued that the pro- pensity score is a type of balancing score. They further proved that the finest balancing score is bXðÞ¼X, the coarsest balancing score is the propensity score, and any score that is finer than the propensity score is the balancing score. Rosenbaum and Rubin (1983) also introduced the strongly ignorable assumption, which implies that given the balancing scores, the distributions of the covariates between the treated and the control groups are the same. They further showed that treatment assignment is strongly ignorable if it satis- fies the condition of unconfoundedness and overlap. Unconfoundedness means that conditional on

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 7

observational covariates X, potential outcomes (Y1 and Y0) are not influenced by treatment assign- ment (Y1; Y0?TjX). This assumption simply asserts that the researcher can observe all variables that need to be adjusted. The overlap assumption means that given covariates X, the person with the same X values has positive and equal opportunity of being assigned to the treated group or the control group ð0

Estimating Causal Effects With the Propensity Score Method If the treatment assignment is strongly ignorable, scholars can use the PSM to remove the difference in the covariates’ distributions between the treated and the control groups (Imbens, 2004). This sec- tion details how scholars can apply the PSM to compute causal effects. Generally speaking, four major steps need to take place to estimate causal effect (Figure 1): (1) Determine observational cov- ariates and estimate the propensity scores, (2) stratify the propensity scores into different strata and test the balance for each stratum, (3) calculate the treatment effect by selecting appropriate methods such as matched (or matching) and covariance adjustment, and (4) conduct a sensitivity test to justify that the estimated ATT is robust. To demonstrate how scholars can use the proposed procedure listed in Figure 1 to gauge causal effect, I analyze three sources of data sets that have been widely used by economists (Dehejia & Wahba, 1999, 2002; Heckman & Hotz, 1989; Lalonde, 1986; Simith & Todd, 2005). These data sets include both experimental and observational data. Given that the unbiased treatment effect can be computed from the experimental design, it is possible to compare the discrepancy between the estimated ATT using observational data and the unbiased ATT calculated from the experimen- tal design. The National Supported Work Demonstration (NSW) data were collected using an experimental design in which individuals were randomly chosen to provide data on work experience for a period of around 6 to 18 months in the years from 1975 to 1977. This federally funded program randomly selected qualified individuals for training positions so that they could get paying jobs and accumu- late work experience. The other set of qualified individuals was randomly assigned to the control group, where they had no opportunity to receive the benefit of the NSW program. To ensure that the earnings information from the experiment included calendar year 1975 earnings, Lalonde (1986) chose participants who were assigned to treatment after December 1975. This procedure reduced the NSW sample to 297 treated individuals and 425 control individuals for the male

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 8 Organizational Research Methods 00(0)

Determine observational covariates Step 1: estimating propensity score

Estimate PScore: High-order 1. Logit/probit covariates 2. Ordinal probit 3. Multinomial logit 4. Hazard

Stratify PScore to different strata Step 2: stratifying and balancing propensity score

Covariate is not Test for balance balanced of covariate

Covariate is balanced

Estimate causal effect: Step 3: estimating 1. Matched sampling causal effect 1) Stratified matching 2) Nearest neighbor matching 3) Radius matching 4) Kernel matching 2. Covariate adjustment

Step 4: sensitivity Sensitivity test: 1. Multiple comparison groups test 2. Specification 3. Instrumental variables 4. Rosenbaum bounds

Figure 1. Steps for estimating treatment effects Note: PScore ¼ propensity scores. participants. Dehejia and Wahba (1999, 2002) reconstructed Lalonde’s original NSW data by including individuals who attended the program early enough to obtain retrospective 1974 earning information. The final NSW sample includes 185 treated and 265 control individuals. Lalonde’s (1986) observational data consisted of two distinct comparison groups in the years between 1975 and 1979: the Population Survey of Income Dynamics (PSID-1) and the Current Pop- ulation Survey–Social Security Administration File (CPS-1). Initiated in 1968, the PSID is a nation- ally representative longitudinal database that interviewed individuals and families for information on dynamics of employment, income, and earnings. The CPS, a monthly survey conducted by Bureau of the for the Bureau of Labor Statistics, provides comprehensive information on the unemployment, income, and poverty of the nation’s population. Lalonde further extracted four data sets (denoted as PSID-2, PSID-3, CPS-2, and CPS-3) that represent the treatment group based on

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 9

Table 1a. Description of Data Sets and Definition of Variables

Data Sets Sample Size Description

NSW Treated 185 National Supported Work Demonstration (NSW) data were collected using experimental design, where qualified individuals were randomly assigned to the training position to receive pay and accumulate experience. NSW Control 260 Experimental control group: The set of qualified individuals were randomly assigned to this control group so that they have no opportunity to receive the benefit of NSW program. PSID-1 2,490 Nonexperimental control group: 1975-1979 Population Survey of Income Dynamics (PSID) where all male household heads under age 55 who did not classify as retired in 1975. PSID-2 253 Data set was selected from PSID-1 who were not working in the spring of 1976. PSID-3 128 Data set was selected from PSID-2 who were not working in the spring of 1975. CPS-1 15,992 Nonexperimental control group: 1975-1979 Current Population Survey (CPS) where all participants with age under 55. CPS-2 2,369 Data set was selected from CPS-1 where all men who were not working when surveyed in March 1976. CPS-3 429 Data set was selected from CPS-2 where all unemployed men in 1976 whose income in 1975 was below the poverty line.

Variables Definition

Treatment Set to 1 if the participant comes from NSW treated data set, 0 otherwise Age The age of the participants (in years) Education Number of years of schooling Black Set to 1 for Black participants, 0 otherwise Hispanic Set to 1 for Hispanic participants, 0 otherwise Married Set to 1 for married participants, 0 otherwise Nodegree Set to 1 for the participants with no high school degree, 0 otherwise RE74 Earnings in 1974 RE75 Earnings in 1975 RE78 Earnings in 1978, the outcome variable simple pre-intervention characteristics (e.g., age or employment status; see Table 1a for details). Table 1a reports details of data sets and the definitions of the variables.

Step 1: Estimating the Propensity Scores To calculate a propensity score, one first needs to determine the covariates. Heckman, Ichimura, and Todd (1997) demonstrated that the quality of the observational variables has a significant impact on the estimated results. Having knowledge of relevant theory, institutional settings, and previous research is beneficial for scholars to specify which variables should be included in the model (Simith & Todd, 2005). To appropriately represent the theory, scholars need to specify not only the observa- tional covariates but also the high-order covariates such as quadratic effects and effects. From a methodological perspective, researchers need to add high-order covariates to achieve strata balance. The process of adding high-order covariates will be discussed in the section detailing how to obtain a balance of propensity scores in each stratum. A recent development called boosted regression can also be implemented to calculate propensity scores (McCaffrey, Ridgeway, &

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 10 Organizational Research Methods 00(0)

Table 1b.

Sample Statistics Age Education Black Hispanic Married Nodegree RE74 RE75 N

NSW Treated M 25.82 10.35 0.84 0.06 0.19 0.71 2,095.57 1,532.06 185 SD 7.16 2.01 0.36 0.24 0.39 0.46 4,886.62 3,219.25 NSW Control M 25.05 10.09 0.83 0.11 0.15 0.83 2,107.03 1,266.91 260 SD 7.06 1.61 0.38 0.31 0.36 0.37 5,687.91 3,102.98 SB 10.73 14.12 4.39 17.46 9.36 30.40 0.22 8.39 PSID-1a M 34.85 12.12 0.25 0.03 0.87 0.31 19,428.75 19,063.34 2,490 SD 10.44 3.08 0.43 0.18 0.34 0.46 13,406.88 13,596.95 SB 100.94 68.05 147.98 12.86 184.23 87.92 171.78 177.44 PSID-1Mb M 30.96 11.14 0.70 0.05 0.45 0.41 1,1386.48 9,528.64 1,103 SD 9.46 2.59 0.49 0.22 0.42 0.49 9,326.64 8,222.72 SB 61.35 34.29 33.13 3.48 64.37 62.44 124.79 128.07 Percentage 39.22 49.61 77.61 72.94 65.06 28.98 27.36 27.82 reduction in SB

Note: SB ¼ standardized bias estimated using Formula 2.1; N ¼ number of cases. aPSID-1: All male house heads under age 55 who did not classify as retired. bPSID-1M is the subsample of PSID-1 that is matched to the treatment group (NSW treated).

Morral, 2004). Boosted regression can simplify the process of achieving balance in each stratum. Appendix A provides further discussion on this technique. Steiner, Cook, Shadish, and Clark (2010) replicated a prior study to show the importance of appropriately selecting covariates. They summarized three strategies for covariates selection: First, select covariates that are correctly measured and modeled. Second, choose covariates that reduce selection bias. These will be covariates that are highly correlated with the treatment (best predicted treatment) and with the outcomes (best predicted outcomes). Finally, if there was no prior theoreti- cally or empirically sound guidance for the covariates selection (e.g., the research question is very new), scholars can measure a rich set of covariates to increase the likelihood of including covariates that satisfy the strongly ignorable assumption. After specifying the observational covariates, the propensity scores can be estimated using these observational variables. This article summarizes four different approaches that can be used to esti- mate the propensity scores. If there is only one treatment (e.g., training), then one can use a logistic model, probit model, or prepared program.3 If treatment has more than two versions (e.g., individuals receive several doses of medicine), then an ordinal logistic model can be used (Joffe & Rosenbaum, 1999). The treatment must be ordered based on certain threshold values. If there is more than one treatment and the treatments are discrete choices (e.g., Group 1 receives payment, Group 2 receives training), the propensity scores can be estimated using a multinomial logistic model. Receiving treatment does not need to happen at the same time. For many treatments, a deci- sion needs to be made regarding whether to treat now or to wait and treat later. The decision to treat now versus later is driven by the participants’ preferences. Under this condition, one can use the Cox proportional hazard model to compute the propensity scores. Li, Propert, and Rosenbaum (2001) demonstrated that the hazard model has properties similar to those of propensity scores. Except for the Cox model that uses partial likelihood (PL) and does not require us to specify the baseline hazard function, the estimating technique used in the aforementioned models is maximum likelihood estimation (MLE) (see Greene, 2008, Chapter 16, for more information on MLE). The logistic models and the hazard model all assume a latent variable (Y*) that represents an underlying propensity or probability to receive treatment. Long (1997) argues that one can view a binary out- come variable as a latent variable. When the estimated probability is greater than a certain threshold or cut point (t), one observes the treatment (Y* > t;T¼ 1). For an ordinal logistical model, one can

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 11 understand the latent variable with multiple thresholds and observe the treatment according to the thresholds (e.g., t1 < Y* < t2;T¼ 2). The multinomial logistical model can simply be viewed as the model that simultaneously estimates a binary model for all possible comparisons among outcome categories (Long, 1997), but it is more efficient to use a multinomial logistical model than using multiple binary models. It is somewhat tricky to generate the predicted probability from the Cox model because it is semiparametric with no assumption of the distribution of baseline. Two alterna- tive choices can be used to better derive probability for survival model: (1) One can rely on a para- metric survival model that specifies the baseline model; (2) one can transform the data in order to use the discrete-time model. To illustrate how to calculate propensity scores, this study employed treatment group data from the NSW and control group data from the observational data extracted from the PSID-2. Following Dehejia and Wahba (1999), I selected age, education, no degree, Black, Hispanic, RE74, RE75, age square, RE74 square, RE75 square, and RE74 Black as covariates to calculate propensity scores. To compute propensity scores, one can first run a logistic or probit model using a treatment dummy (whether an individual received training) as the dependent variable and the aforemen- tioned covariates as the independent variables. Propensity scores can be obtained by calculating the fitted value from the logistic or probit models (use –predict mypscore, p– in STATA). Readers can refer Hoetker (2007) for more information on calculating probability from logit or probit mod- els. After calculating propensity scores, Appendix B includes a randomly selected sample (n ¼ 50) from the combined data set NSW and PSID-2. Readers can obtain data for Appendix B, NSW treated, and PSID-2 from the author.

Step 2: Stratifying and Balancing the Propensity Scores After estimating the propensity scores, the next step is to subclassify them into different strata such that these blocks are balanced on propensity scores. The number of balanced propensity score blocks depends on the number of observations in the data set. As discussed previously, five blocks are a good starting point to stratify the propensity scores (Rosenbaum & Rubin, 1983). One then can test the bal- ance of each block by examining the distribution of covariates and the of propensity scores. The t test and the test for standardized bias (SB) are two widely used techniques to ensure the balance of the strata (Rosenbaum & Rubin, 1985). The t-test compares whether the means of covariates differ between the treated and the matched control groups. The SB approach calculates the difference of sam- ple means in the treated and the matched control groups as a percentage of the square root of the aver- age sample variance in both groups. To conduct the SB test, scholars need to compare values calculated before and after matching. The formula used to calculate the SB value can be written as

jX1M X0M j SBmatch ¼ 100 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ð2:1Þ 0:5ðV1M ðÞþX V0M ðXÞÞ where X1M ðV1M Þ and X0M ðV0M Þ are the means (variance) for the treated group and the matched con- trol group. In addition to these two widely used tests, the Kolmogorov-Smirnov’s two-sample test can also be used to investigate the overlap of the covariates between the treated and the control groups. Balanced strata between the treated and the matched control group ensure the minimal distance in the marginal distributions of the covariates. If any pretreatment variable is not balanced in a partic- ular block, one needs to subclassify the block into additional blocks until all blocks are balanced. To obtain strata balance, researchers sometimes need to add high-order covariates and recalculate the propensity scores. Rosenbaum and Rubin (1984) detailed the process of cycling between checking for balance within strata and reformulating the propensity model. Two guidelines for adding high-order covariates have been proposed: (1) When the of a critical covariate are found

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 12 Organizational Research Methods 00(0) to differ dramatically between the treatment and the control group, the squared terms of the covariate need to be included in the revised propensity score model and (2) when the correlation between two important covariates differs greatly between the groups, the interaction of the covariates can be added to the propensity score model. Appendix B shows a simple example of stratifying data into five blocks after calculating the pro- pensity scores. For this illustration, I stratified the 50 cases into five groups. I first identified the cases with propensity scores smaller than 0.05, which were classified as unmatched. When the pro- pensity scores were smaller than 0.2 but larger than 0.05, I coded this as block 1 (Block ID ¼ 1). When the propensity scores were smaller than 0.4 but larger than 0.2, this was coded as block 2. This process was repeated until I had created five blocks, and then I conducted the t-test within each block to detect any significant difference of propensity scores between the treated and control groups. T- values for each block were added in the columns next to the column of Block ID. Overall, the t-test reveals that the difference of propensity scores between the treated and control groups is statistically insignificant. If the t-test shows that there are statistically significant differences in propensity scores, one should either change threshold value of propensity scores in each block or change the covariates to recalculate the propensity scores. When the propensity scores in each stratum are balanced, all covariates in each stratum should also achieve equivalence of distribution. To confirm this, one can conduct the t-test for each obser- vational variable. To illustrate how balance of propensity scores within strata helps to achieve dis- tribution overlap for other covariates, Appendix B reports the values for one continuous variable, age. One can conduct the t-test to ensure that there is no age difference between the treated and con- trol groups within each stratum. The column Tage reports the t-test for age within the strata. After balancing each block’s propensity scores, the age difference between the treated and control groups in each block became statistically insignificant. I recommend that readers use a prepared package to stratify propensity scores, as a program can simultaneously categorize propensity scores and conduct balance tests. For instance, one can use the -pscore- program in STATA (Becker & Ichino, 2002) to estimate, stratify, and test the balance of propensity scores. To further illustrate how the PSM can achieve strata balance, I replicated the aforementioned two procedures for the combined experimental data set and each of the observational data sets in Table 1a. Following Dehejia and Wahba’s (1999) suggestions on choice of covariates, I first computed propensity scores for each data set. Then, the propensity scores were stratified and tested for the bal- ance within each stratum. When the propensity scores achieved balance within each stratum, I plotted the means of propensity scores in each stratum for each matched data set. Figure 2 provides evidence that the means of the propensity scores are almost the same for each sample within each balanced block. To demonstrate the effectiveness of the PSM in adjusting for the balance of other covariates, Table 1b summarizes the means, standard errors, and SB of the matched sample. Comparing the results between the matched and unmatched samples, one can see that the difference of most observed characteristics between the experimental design and the nonexperimental design reduces dramatically. For instance, PSID-1 of Table 1b reports that the absolute SB values from 12.86 to 184.23 (before using propensity score matching), but PSID-1M of Table 1b shows that the abso- lute minimum value of SB is 3.48 and the absolute maximum value of SB is 128.07. Furthermore, the t-test and the Kolmogorov-Smirnov sample test were conducted to examine the balance of each variable. As reported from Table 2, for the PSID-1 sample, except for RE74 in Block 3, one cannot see a p value smaller than 0.1. For simplicity, Table 2 uses only continuous variables that have been included for estimating the propensity scores to illustrate the effectiveness of the PSM in increasing the distribution overlap between the treated group and the matched control group. Overall, Table 2 shows strong evidence that after obtaining balance of propensity scores within a stratum, the covariates achieve overlap in terms of distribution. To preserve space, Table 1b and

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 13 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 1234567 1234567 12345678 12345678 PSID-1: Control group PSID-1: Treated group CPS-1: Control group CPS-1: Treated group 1 1 .8 .8 .6 .6 .4 .4 .2 .2 0 0 12345 12345 1234567 1234567 PSID-2: Control group PSID-2: Treated group CPS-2: Control group CPS-2: Treated group 1 1 Mean of Propensity Score of Propensity Mean .8 .8 .6 .6 .4 .4 .2 .2 0 0 123456789 123456789 12345 12345 PSID-3: Control group PSID-3: Treated group CPS-3: Control group CPS-3: Treated group Block ID

Figure 2. Means of propensity scores in balanced strata Note: PSID ¼ Population Survey of Income Dynamics (PSID-1); CPS ¼ Current Population Survey–Social Security Admin- istration File (CPS-1).

Table 2. Test of Strata Balance

t-test for Matched KS Test for Matcheda

Sample Block ID Age Education RE74 RE75 Age Education RE74 RE75

PSID-1 1 0.800 0.995 0.283 0.685 0.566 1.000 0.697 0.984 2 0.856 0.319 0.632 0.627 0.998 0.894 0.983 0.998 3 0.834 0.765 0.077 0.641 0.832 1.000 0.044 0.851 4 0.853 0.378 0.744 0.874 0.954 0.999 0.949 0.754 5 0.341 0.816 0.711 0.113 0.613 0.844 0.512 0.026 6 0.353 0.196 0.888 0.956 0.950 0.942 0.466 0.878 7 0.603 0.574 0.791 0.747 0.280 0.828 1.000 1.000

Note: The table reports the p value of each variable for each stratum between National Supported Work Demonstration (NSW) Treated and matched control groups. PSID-1 ¼ 1975-1979 Population Survey of Income Dynamics (PSID) where all male household heads under age 55 who did not classify as retired in 1975. aKS (Kolmogorov-Smirnov) two-sample test between NSW Treated and matched control groups.

Table 2 report statistics only for PSID-1. Readers can get a full version of these two tables by con- tacting the author. The aforementioned evidences generally support that the covariates are balanced for the treated and control groups.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 14 Organizational Research Methods 00(0)

Step 3: Estimating the Causal Effect Because the data sets include an experimental design, one can compute the unbiased causal effect. Table 3 shows the estimated results of training on earnings in 1978 (RE78). The first row of Table 3 reports the benchmark values calculated using the experimental data. The unadjusted result ($1,794.34) was calculated by subtracting the mean of RE78 in the treated group (NSW Treated) from the mean of RE78 in the control group (NSW Control). The adjusted estimation ($1,676.34) was computed by using regression, controlling for all observational covariates. Because the experi- mental data compiled by Lalonde (1986) does not achieve the same distribution between the treated and control groups (Table 1b), this article uses the causal effect value calculated by the adjusted esti- mation as the benchmark value. From Table 3 column 1, it is obvious that if there are substantial differences among the pretreatment variables (as shown in Table 1b), using the mean difference to estimate the causal effect is strongly biased (it ranges from –$15,204.78 to $1,069.85). In Table 3 column 2, a model was used to gauge the adjusted training effects. Col- umn 2 shows that the estimated treatment effects (with a range from $699.13 to $1,873.77) are more reliable than those calculated using the mean differences. In addition to mean difference and regression, PSM can also be used to effectively estimate the ATT. When the propensity scores are balanced in all strata, one can use two standard techniques to compute the ATT: matched sampling (e.g., stratified matching, nearest neighbor matching, radius matching, and kernel matching) and covariance adjustment. Matched sampling or matching is a technique used to sample certain covariates from the treated group and the control group to obtain a sample with similar distributions of covariates between the two groups.4 Rosenbaum (2004) con- cluded that propensity score matching can increase the robustness of the model-based adjustment and avoid unnecessarily detailed description. The quality of the matched samples depends on the covariate balance and the structure of the matched sets (Gu & Rosenbaum, 1993). Ideally, exact matching on all confounding variables is the best matching approach because the sample distribution of all confounding variables would be identical in the treated and control groups. Unfortunately, exact matching on a single confounding variable will reduce the number of final matched cases. Supposing that there are k confounding variables and each variable has three levels, there will be 3k patterns of levels to get perfectly matched samples. Thus, it is impractical to use the exact matching technique to get the identical distribution of confounding variables between the two groups. The PSM is more appropriate than exact matching because it reduces the covariates from k-dimensional to one-dimensional. Rosenbaum and Rubin (1983) also showed that the PSM not only simplified the matching algorithm, but also increased the quality of the matches.

Stratified Matching After achieving strata balance, one can apply stratified matching to calculate the ATT. In each balanced block, the average differences in the outcomes of the treated group and the matched control group are calculated. The ATT will be estimated by the mean difference weighted by the number of treated cases in each block. The ATT can be expressed as P P XQ T C T i Iq Yi j Iq Yj N ATT ¼ ð 2 ðÞ 2 ðÞ Þ q ; ð2:2Þ N T N C N T q¼1 q q

T C where Q denotes the number of blocks with balanced propensity scores, Nq and Nq refer to the num- T C ber of cases in the treated and the control groups for matched block q, Yi andYj represent the obser- vational outcomes for case i in the matched treated group q and case j in the matched control group q, respectively, and N T stands for the total number of cases in the treated group.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Table 3. Estimation Results

Matching

ATT Stratified Neighbor Radius Kernel Covariate Adjustment

Unadjusteda Adjustedb ATTc Nd ATT Nd ATTe Nd ATTf Nd ATTg Nd 1 2 3 4 5 6 7 8 9 10 11 12

Downloaded from NSW 1,794.34 1,676.34 (638.68) PSID-1h –15,204.78 751.95 1,637.43 1,288 1,654.57 248 1,871.44 37 1,507.10 1153 1,952.23 1,288 (915.26) (805.43) (1,174.63) (5,837.10) (826.11) (791.45) h orm.sagepub.com PSID-2 –3,646.81 1,873.77 1,467.04 308 1,604.09 231 1,519.60 77 1,712.18 297 1,593.32 308 (1,060.56) (1,461.75) (1,092.40) (2,110.71) (1,226.90) (1,476.54) PSID-3h 1,069.85 1,833.13 1,843.20 250 1,522.23 217 1,632.74 167 1,776.37 245 1,583.41 250 (1,159.78) (981.42) (1,920.24) (1,598.12) (1,425.32) (1,866.46)

atVrijeUniversiteit 34820onJanuary30, 2014 CPS-1i –8,497.52 699.13 1,488.29 4,563 1,600.74 280 1,890.13 102 1,513.78 4,144 1,634.81 4,563 (547.64) (716.79) (957.05) (1,993.50) (726.47) (515.58) CPS-2i –3,821.97 1,172.70 1,676.43 1,438 1,638.74 271 1,775.99 79 1,590.49 1,416 1,550.90 1,438 (645.86) (796.62) (1,014.64) (2,286.23) (736.85) (625.04) CPS-3i –635.03 1,548.24 1,505.49 508 1,376.65 273 1,307.63 53 1,166.93 493 1,572.09 508 (781.28) (1,065.52) (1,129.24) (2,821.56) (864.38) (943.65) Meanj –5,122.71 1,313.15 1,602.98 1,566.17 1,544.47 1,666.26 1,647.80 Variancej 3,5078,950.9 270,327.32 21,084.82 10,712.11 45,779.09 51,101.52 23,016.46

Note: Bootstrap with 100 replications was used to estimate standard errors for the propensity score matching; standard errors in parentheses. aThe mean difference between treatment group (NSW Treated) and corresponding control groups (NSW Control, PSID-1 to CSP-3). bLeast squares regression: regress RE78 (earning in 1978) on age, treatment dummy, education, no degree, Black, Hispanic, RE74 (earning in 1974), and RE75 (earning in 1975). cStratifying blocks based on propensity scores, and then use Formula 2.2 to estimate ATT (average treatment effect on treated). dThe total number of observations, including observations in NSW Treated and corresponding matched control groups. eFor Kernel matching, when the number of cases is small, use narrower bandwidth (.01) instead of .06. fRadius value ranges from .0001 to .0000025. gUse regression, take weights, which are defined by the number of treated observations in each balanced propensity score block. hObservational covariates: age, treatment dummy, education, no degree, Black, Hispanic, RE74, and RE75. Higher order covariates: age2, RE742, RE752, RE74 Black. iObservational covariates: same as h; high-order covariates: age2, education2, RE742, RE752, Education RE74. jMean and variance are calculated using estimated ATT for each technique. 15 16 Organizational Research Methods 00(0)

After stratifying data into differentP blocks, one can calculate the ATT using data listed in Appen- T dix B. First, one can compute Yi (the summation of the outcome variable in each block for each i2IðÞ1 P T C of the treated cases, denoted as Yi in Appendix B) and Yj (the summation of the outcome vari- j2IðÞ1 C able in each block for each of the control cases, denoted as Yj in Appendix B). For example, in block 1 the summation of the outcome for two treated cases is 49,237.66, and the summation of the T outcome for five control cases is 31,301.69. The number of cases in the treatment (N1 ) and the con- C trol group (N1 ) for matched block 1 is 2 and 5, respectively. One then can calculate the ATT for each block. For instance, ATTq¼1 (for block 1) ¼ 49,237.66/2 – 31,301.69/5 ¼ 18,388.492. After com- puting the ATT for each block, one can get weighted ATTs using the weight given by the fraction of treated cases in each block. For example, the weight for block 1 is 0.08 (two treated cases in block 1 divided by 25 treated cases in total). The final ATT is estimated by taking a summation of the weighted ATT ($1,702.321), which means that individuals who received training will, on average, earn around $1,702.321 more per year than their counterparts who did not obtain governmental training. The estimated ATT using simple regression is $2,316.414. Comparing this with the true treatment effect in Table 3 ($1,676.34), one can see that the PSM produces an ATT substantively similar to the actual casual effect, given that the propensity scores of every block are balanced. I also conducted another simulation with 200 randomly selected cases from NSW and PSID-2 for 50 times. The average ATT calculated by the PSM is $1,376.713, whereas the average ATT com- puted by regression analysis is $709.039. Clearly, the PSM produces an ATT closer to the true causal effects than does the ordinary (OLS). I further examined the balance test for each of these 50 randomly drawn data sets. Thirteen of 50 data sets did not achieve strata balance. The aver- age ATT calculated by the PSM was $979.612, and the average ATT calculated by OLS was $697.626. For the remaining 37 data sets that achieved strata balance, the average ATT calculated by the PSM was $1,516.23, and the average ATT calculated by OLS was $713.04. Therefore, achieving balance of propensity scores in each stratum is very important for obtaining a less biased estimator of causal effect. I also provided SPSS code in Appendix C and STATA code in Appendix D, which readers can adjust appropriately to other statistical packages for stratified matching. The codes show how to fit the model with the logit model, calculate propensity scores, stratify propensity scores, conduct the balance test, and compute the ATT using stratified matching. It is also convenient to implement the procedure in Excel after calculating the propensity scores using other statistical packages. Readers who are interested in Excel calculation can contact the author directly to obtain the original file for the calculation in Appendix B. Moreover, Appendix E also presents a table that reports the PSM prewritten software in R, SAS, SPSS, and STATA for readers to conveniently find appropriate sta- tistical packages. Combining NSW Treated with other observational data sets, column 3 of Table 3 further details the estimated ATT using stratified matching. Column 3 shows that the lowest esti- mated result is $1,467.04 (PSID-2) and the highest estimation of the treatment effect is $1,843.20 (PSID-3). Overall, stratified matching produces an ATT relatively close to the unbiased ATT ($1,676.34).

Nearest Neighbor and Radius Matching Nearest neighbor (NN) matching computes the ATT by selecting n comparison units whose propen- sity scores are nearest to the treated unit in question. In radius matching, the outcome of the control units matches with the outcome of the treated units only when the propensity scores fall in the pre- defined radius of the treated units. A simplified formula to compute the estimated treatment effect using the nearest neighbor matching or the radius matching technique can be written as

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 17

1 X 1 X ATT ¼ ð Y T Y CÞ; ð2:3Þ N T i C j i2T Ni j2C

T C where N is the number of cases in the treated group and Ni is a weighting scheme that equals the C number of cases in the control group using a specific algorithm (e.g., nearest neighbor matching, Ni , will be the n comparison units with the closest propensity scores). For more information, readers can consult Heckman et al. (1997). For NN matching, one can randomly draw either backward or forward matches. For example, in Appendix B, for case 7 (propensity score ¼ 0.101), one can draw forward matches and find the con- trol case (case 2) with the closest propensity score (0.109). Drawing backward matches, one can find case 1 with the closest propensity score (0.076). After repeating this for each treated case, one can calculate the ATT using Formula 2.3. For radius matching, one needs to specify the radius first. For example, suppose one sets the radius at 0.01, then the only matched case for case 7 is case 2, because the absolute value of the difference of the propensity scores between case 7 and case 2 is 0.008 (|0.101 – 0.109|), smaller than the radius value 0.01. One can repeat this matching procedure for each of the treated cases and use Formula 2.3 to estimate the ATT. In Table 3, column 5 reports the esti- mated ATT using NN matching, which produced an ATT with a range from $1,376.65 (CPS-3) to $1,654.57 (PSID-1). Column 7 describes the estimated ATT using the radius matching, which gen- erated an ATT with a range from $1,307.63 (CPS-3) to $1,890.13 (CPS-1).

Kernel Matching Kernel matching is another nonparametric estimation technique that matches all treated units with a weighted average of all controls. The weighting value is determined by distance of propensity scores, bandwidth parameter hn, and a kernel function K(.). Scholars can specify the Gaussian kernel and the appropriate bandwidth parameter to estimate the treatment effect using the Formula 2.4 X X X  1 e ðÞx e ðÞx e ðÞx e ðÞx ATT ¼ fY T Y CK j i = K k i g; ð2:4Þ N T i j h h i2T j2C n k2C n where ejðÞx denotes the propensity score of case j in the control group and eiðÞx denotes the propen- sity score of case i in the treated group, and ejðÞx eiðÞx represents the distance of the propensity scores. When one applies kernel matching, one downweights the case in the control group that has a long distance from the case in the treated group. The weight function KðÞ: in Equation 2.4 takes large values when ejðÞx is close to eiðÞx . To show how it happens, suppose one chooses Gaussian density 1ffiffiffiffiffiffi z2=2 ejðÞx eiðÞx function KzðÞ¼p e where z ¼ and hn ¼ 0.005, and wants to match treated 2p hn case 14 with control cases 10 and 11 (Appendix B). One then can compute z values for case 10 ([0.282 – 0.312]/0.05 ¼ –0.6) and case 11 ([0.313 – 0.312]/0.05 ¼ 0.02). The weights for case 10 and 11 are 0.33 (k(–0.6)) and 0.40 (k(0.02)), respectively. Clearly, the weight is low for case 10 (0.33) that has a long distance of propensity score with treated case 14 (0.282 – 0.312 ¼ –0.04), whereas the weight is relatively large for case 11 (0.40) that has a short distance of propensity score with case 14 (0.313 – 0.312 ¼ 0.001). For more information on kernel matching, readers can refer to Heckman et al. (1998). In Table 3, column 9 shows the results for the kernel matching. The estimated ATT using the kernel matching technique ranges from $1,166.93 (CPS-3) to $1,776.37 (PSID-3).

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 18 Organizational Research Methods 00(0)

Covariance Adjustment Covariance adjustment is a type of regression adjustment that weights the regression using propen- sity scores. The matching process does not consider the variance in the observational variables because the PSM can balance the difference in the pretreatment variables in each block. Therefore, the observational variables in the balanced strata do not contribute to the treatment assignment and the potential outcome. Although each block has a balanced propensity score, the pretreatment vari- ables may not have exactly the same distributions between the treatment group and the control group. Table 2 provides evidence that although the propensity scores are balanced in each stratum, the distributions of some variables do not fully overlap. For example, RE74 are statistically different between the treated and the matched control group for PSID-1. Covariate adjustment is achieved by using a matched sample to regress the treatment outcome on the covariates with appropriate weights for unmatched cases and duplicated cases. Dehejia and Wahba (1999) estimated the causal effect by conducting within-stratum regression, taking a weighted sum over the strata. Imbens (2000) proposed that one can use the inverse of one minus the propensity scores as the weight for each control case and the inverse of propensity scores as the weight for each treated case. Rubin (2001) provided additional discussion on covariate adjustment. Unlike matched sapling, covariance adjustment is a hybrid technique that combines nonparametric propensity matching with parametric regression. Column 11 of Table 3 reports the results of the cov- ariance adjustment, which were produced by regressing RE78 on all observational variables, weighted by number of treated cases in each block. This approach generates an ATT ranging from $1,550.90 (CPS-2) to $1,925.23 (PSID-1). Researchers have suggested two ways to calculate the variance of the nonparametric estimators of the ATT. First, Imbens (2004) suggested that one can estimate the variance by calculating each of five components5 included in the variance formula. The asymptotic variance can generally be esti- mated consistently using kernel methods, which can consistently compute each of these five com- ponents. The bootstrap is the second nonparametric approach to calculate variance (Efron & Tibshirani, 1997). Efron and Tibshirani (1997) argued that 50 bootstrap replications can produce a good estimator for standard errors, yet a much larger number of replications are needed to deter- mine the bootstrap . In Table 3, 100 bootstrap replications were used to calculate the standard errors for the matching technique. In addition to calculating the variance nonparame- trically, one can also compute it parametrically if covariance adjustment is used to produce the ATT. In Table 3, for the covariate adjustment technique, the standard errors in Column 11 of Table 3 were generated by linear regression.

Choosing Techniques This article has reviewed different techniques for gauging the ATT. The performance of these strategies differs case by case and depends on data structure. Dehejia and Wahba (2002) demonstrated that when there is substantial overlap in the distribution of propensity scores (or balanced strata) between the treated and control groups, most matching techniques will produce similar results. Imbens (2004) remarked that there are no fully applicable versions of tools that do not require applied researchers to specify smoothing parameters. Specifically, little is still known about the optimal bandwidth, radius, and number of matches. That being said, scholars still need to consider particular issues in choosing the techniques that their research will employ. For nearest neighbor matching, it is important to determine how many comparison units match each treated unit. Increasing comparison units decreases the variance of the estimator but increases the bias of the estimator. Furthermore, one needs to choose between matching with replacement and

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 19

Number of matched neighbor ↑; Bias ↑; Variance ↓ Nearest neighbor Match without replacement; Bias ↑; Variance ↓

Radius matching Maximum value of radius ↑; Bias ↑; Variance↓ Matched sampling Weighting: kernel function (e.g., Gaussian) Kernel matching Bandwidth ↑; Bias ↑; Variance ↓

Balanced Stratified matching strata Weighting: fraction of treated cases within strata

Number of treated cases in each stratum Covariate adjustment Weighting Inverse of propensity score for treated case

Figure 3. Choosing techniques matching without replacement (Dehejia & Wahba, 2002). When there are few comparison units, matching without replacement will force us to match treated units to the comparison ones that are quite different in propensity scores. This enhances the likelihood of bad matches (increase the bias of the estimator), but it could also decrease the variance of the estimator. Thus, matching without replacement decreases the variance of the estimator at the cost of increasing the estimation bias. In contrast, because matching with replacement allows one comparison unit to be matched more than once with each nearest treatment unit, matching with replacement can minimize the distance between the treatment unit and the matched comparison unit. This will reduce bias of the estimator but increase variance of the estimator. In regard to radius matching, it is important to choose the maximum value of the radius. The larger the radius is, the more matches can be found. More matches typically increase the likelihood of finding bad matches, which raises the bias of the estimator but decreases the variance of the esti- mator. As far as kernel matching is concerned, choosing an appropriate bandwidth is also crucial because a wider bandwidth will produce a smoother function at the cost of tracking data less closely. Typically, wider bandwidth increases chance of bad matches so that the bias of the estimator will also be high. Yet, more comparison units due to wider bandwidth will also decrease the variance of the estimator. Figure 3 summarizes the issues that scholars need to consider before choosing appropriate techniques. For organizational scholars, I recommend using stratified matching and covariate adjustment for the following reasons: First, these two techniques do not require scholars to choose specific smooth- ing parameters. The estimation of the ATT from these two techniques requires minimum statistical knowledge. Second, the weighting parameters can be easily constructed from the data. One can use a similar version of weighting parameters (the number of treated cases in each block) for both tech- niques. For stratified matching, one calculates the number of treated cases in each stratum, and then the proportion of treated cases will be computed. For covariate adjustment, one can use the number of treated cases as weights in the regression model. Finally, the performance of these two approaches (Table 3) is relatively close to other matching techniques. Overall, these two techniques are not only relatively simple, but can also produce a reliable ATT.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 20 Organizational Research Methods 00(0)

Step 4: Sensitivity Test The sensitivity test is the final step used to investigate whether the causal effect estimated from the PSM is susceptible to the influence of unobserved covariates. Ideally, when an unbiased causal effect is available (e.g., the benchmark ATT estimated from the experimental design), scholars can compare the ATT generated by the PSM with the unbiased ATT to assure the accuracy of the PSM. However, in most empirical settings, an unbiased ATT is not available. Rosenbaum (1987) proposed that multiple comparison groups are valuable in detecting the existence of important unobserved variables. For example, one can use multiple control groups to match the treated group to calculate multiple treatment effects. One can have a sense of the reliability of the estimated ATT comparing the effect size of different treatment effects. Table 3 reports results for such sensitivity test by draw- ing on multiple groups. One can compare the ATT for between PSID-1 and other data sets to confirm the effectiveness of stratified matching. Alternatively, one can match two control groups. If the results show that causal effects are statistically different between these two control groups, then one can conclude that the strongly ignorable assumption is violated. In practice, however, scholars will ordinarily not have multiple comparison groups or unbiased causal effect gauged from experimental data. How then can one conduct a sensitivity test? Three approaches—changing the specification in the equation, using the instrumental variable, and Rosenbaum Bounding (RB)—can be implemented. To conduct a sensitivity test by changing the spe- cification in the equation, scholars first need to change the specification by dropping or adding high- order covariates such as quadratic or interaction terms. After changing the specification, scholars should recalculate the propensity scores and the causal effect. Comparison of the newly calculated cau- sal effect and the originally computed causal effect will reveal how reliable the originally computed causal effect is. This technique is similar to Dehejia and Wahba’s (1999) suggestion of selecting based on observables. Selecting based on observables informs researchers whether the treatment assignment is strongly ignorable, the precondition for the PSM to produce an unbiased estimation. Table 4a shows the sensitivity analysis when I dropped higher-order pretreatment variables. By using only the observational variables, column 1 demonstrates that the estimated results of stratify- ing matching range from $813.20 (PSID-2) to $1,348.56 (CPS-1). Column 3 summarizes the esti- mated results by using the nearest neighbor technique. The lowest estimated result of the casual effect is $996.59 (PSID-2) and the highest estimated result of the causal effect is $1,855.61 (PSID-3). Column 5 reports the results of radius matching with a range from $835.68 (PSID-1) to $2,110.03 (PSID-2). In column 7 of Table 4a, the estimated ATTs range from $831.12 (PSID-1) to $1,778.12 (PSID-2). Finally, covariate adjustment shows the treatment effects ranging from $1,342.50 (CPS-1) to $2,328.20 (PSID-1). It is important to emphasize that after dropping the high-order covariates, the balancing property is not satisfied for all the matched control samples. When one lacks an unbiased estimator and multiple comparison groups, the instrumental variable (IV) method is another technique that can be used to assess the bias of the causal effects estimated by the PSM. DiPrete and Gangl (2004) argued that the IV estimation can produce a consistent and unbiased estimation of the causal effect when the IVs are appropriately chosen, but this method gen- erally reduces the of the causal estimators and introduces some uncertainty because of its reliance on additional assumptions. Usually, for public policy studies, a grouping variable that divides samples into a number of disjointed groups can be selected as an instrumental variable.6 For example, Angrist, Imbens, and Rubin (1996) used the lottery number as the instrumental variable to estimate the causal effect of Vietnam War veteran status on mortality. The rationale behind using lottery numbers is that they correlate with the treatment variable (whether to serve in the military) because a low lottery number would potentially get called to serve in the military. On the other hand, a lottery number is a random number that does not correlate with the error term. Thus the lottery number serves as a good instrument for the endogenous variable—serving in the Vietnam War. One

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 21

Table 4a. Sensitivity Test

Covariate Matching Adjustment

Stratified Neighbor Radius Kernel

ATT N ATT N ATT N ATT N ATT N 12 34567 8 9 10

PSID-1 1,342.40 1,345 1,545.52 257 835.68 21 831.12 1,260 2,328.20 1,345 (763.09) (1,093.77) (3,877.08) (805.65) (693.69) PSID-2 813.20 369 996.59 232 2,110.03 17 1,778.12 357 2,145.41 369 (1,081.68) (,1643.11) (2,999.31) (1,000.81) (1,143.55) PSID-3 1,035.09 270 1,855.61 229 1,764.55 219 1,724.97 269 1,535.83 270 (1,091.28) (1,703.87) (1,269.51) (1,283.44) (1,400.24) CPS-1 1,348.56 5,961 1,765.35 380 1,194.55 129 1,186.89 5,851 1,342.50 5,961 (651.14) (869.69) (1,855.94) (578.68) (470.60) CPS-2 1,301.86 1,747 1,108.86 297 1,296.92 79 1,049.00 1,742 1,570.37 1,747 (714.36) (995.48) (2,341.93) (654.90) (478.94) CPS-3 1,077.56 557 1,346.78 284 868.22 53 1,269.21 554 1,357.84 557 (707.68) (1,019.54) (2,752.29) (704.80) (685.77) Mean 1,153.11 1,436.45 1,306.55 1,344.99 1,713.36 Variance 46,267.12 120,918.64 141,108.36 254,592.59 176,117.80

Note: All the sensitivity tests used only observational covariates: age, education, no degree (no high school degree), Black, Hispanic, RE74 (earning in 1974), and RE75 (earning in 1975). No high-order covariates are included; bootstrap with 100 replications was used to estimate standard errors for the propensity score matching; ATT: average treatment effect on treated. Standard errors in parentheses.

Table 4b. Sensitivity Test

PSID-1 CPS-2

G p-criticala Lower Bound Upper Bound p-criticala Lower Bound Upper Bound

1.00 0.042 216.997 1,752.880 0.006 641.387 2,089.060 1.05 0.074 57.226 1,941.530 0.013 468.296 2,262.150 1.10 0.119 –26.215 2,090.720 0.025 320.627 2,413.840 1.15 0.177 –188.640 2,293.670 0.044 196.642 2,545.930 1.20 0.246 –343.541 2,478.540 0.072 43.579 2,741.260 1.25 0.325 –455.599 2,627.530 0.110 –4.340 2,894.800 1.30 0.409 –621.988 2,778.500 0.157 –112.684 3,039.860

Note: G ¼ The that individuals will receive treatment. aWilcoxon signed-rank gives the significance test for upper bound. can compare the estimate of the causal effect from the PSM with the IV estimators to determine the accuracy of the estimators calculated by the PSM. Unfortunately, the limited number of covariates in these data sets prevents me from using the IV approach to conduct the sensitivity analysis. Readers who are interested in this topic can find examples from Angrist et al. (1996) and DiPrete and Gangl (2004). Wooldridge (2002) provides further theoretical background on how IV can be used when one suspects the failure of a strongly ignorable assumption. Finally, Rosenbaum (2002, Chapter 4) proposed a bounding approach to test the existence of hid- den bias, which potentially arises to make the estimated treatment effect biased. Suppose u1i and u0j are unobserved characteristics for individuals i and j in the treated group and the control group. G

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 22 Organizational Research Methods 00(0) refers to the effect of these unobserved variables on treatment assignment. The odds ratio that individuals receive treatment can be simply written as G ¼ exp(u1i – u0j). If the unobserved variables u1i and u0j are uninformative, then the assignment process is random (G ¼ 1) and the estimated ATT and confidence intervals are unbiased. When the unobserved variables are informative, then the con- fidence intervals of the ATT become wider and the likelihood of finding support for the null hypoth- esis increases. Rosenbaum Bounding sensitivity test changes the effect of the unobserved variables on the treatment assignment to determine the end point of the significant test that leads one to accept the null hypothesis. Diprete and Gangl (2004) implemented the procedure in STATA for testing the continuous outcomes, however, their program only works for one to one matching. Becker and Caliendo (2007) also implemented this method in STATA but for testing the dichotomous outcome. Table 4b presents an example of using the RB test. The table reports only the test for PSID-1 and CPS-2 because the t-values for the ATT estimated using stratified matching show strong evidence of treatment effect. By varying the value of G, Table 4b reports the p value as well as the upper and lower bounds of the ATT. The Wilcoxon signed-rank test generates a significance test at a given level of hidden bias specified by parameter G (DiPrete & Gangl, 2004). As reported from Table 4b, the estimated ATT is very sensitive to hidden bias. As far as PSID-1 is concerned, when the crit- ical value of G is between 1.05 and 1.10 (the unobserved variables cause the odds ratio of being assigned to the treated group or the control group to be about 1.10), one needs to question the con- clusion of the positive effect of training on salary in the year 1978. In regards to the CPS-2 sample, when the critical value of G is between 1.20 and 1.25, one should question the positive effect of training on future salary. Yet, a value for G of 1.25 in CPS-2 does not mean that one will not observe the positive effect of training on future earnings; it only means that when unobserved variables deter- mine the treatment assignment by a ratio of 1.25, it will be so strong that the salary effect would include zero and that unobserved covariates almost perfectly determine the future salary in each matched case. RB presents a worst-case scenario that assumes treatment assignment is influenced by unobserved covariates. This sensitivity test conveys important information about how the level of uncertainty involved in matching estimators will undermine the conclusions of matched sampling analyses. The simple test in Table 4b generally reveals that the causal effect of training is very sen- sitive to hidden biases that could influence the odds of treatment assignment.

Future Applications of the Propensity Score Method To my knowledge, no publications in the management field have implemented the PSM in an empirical setting, yet other social science fields have empirically applied the PSM. Thus, before offering suggestions for applying the PSM to the field, I will provide an overview of how scholars in relevant social science fields (e.g., economics, finance, and sociology) employ the PSM in their empirical studies. Most applications of the PSM come from the evaluation of public policy by econ- omists (e.g., Dehejia & Wahba, 1999; Lechner, 2002). Early implementation of the PSM intended to examine whether this technique effectively reduces bias stemming from the heterogeneity of parti- cipants. Economists generally agreed that the PSM is appropriate for examining causal effects using observational data. Recent application by Couch and Placzek (2010), for example, used the PSM to calculate the ATT without any concern regarding the legitimacy of the technique. Combining the PSM and the average difference-in-difference approaches, Couch and Placzek (2010) found that mass layoff decreased earnings at 33%. To provide a concise overview of the PSM in other social science fields, I conducted a Web of Science search calling up articles that cited Rosenbaum and Rubin’s 1983 paper. Because most cita- tions came from health-related fields, I limited the search to fields such as economics, sociology, and business finance that are relevant to management. Overall, in early 2012, I found 674 articles in these three fields that have cited Rosenbaum and Rubin’s article. Fewer than 100 articles were

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 23 published before 2002, yet around 300 articles were published between 2009 and 2011. I first randomly selected one to two empirical studies from these top economics journals: American Eco- nomic Review, Econometrica, Quarterly Journal of Economics, and Review of Economic Studies.I then randomly selected one to two empirical articles from two top sociology journals: American Journal of Sociology and American Sociological Review. I finally randomly selected one to two studies from three top financial journals: Journal of Finance, Journal of Financial Economics, and Review of Financial Studies. Table 5 summarizes the data, analytical techniques, and key findings of these empirical articles employing the PSM in their fields. Given that management scholars have relied on observational data sets, using the PSM will be fundamentally helpful in discovering the effectiveness of management interventions, including areas such as strategy, entrepreneurship, and human resource management. For strategy scholars, future research can use the PSM to examine whether firms that adopt long-term incentive plans (e.g., stock options and stock ownership) can increase overall performance. Apparently, the data used in this type of study are not experimental. Future research can use the PSM to adjust the distribution between firms using long-term incentive policies and ones that have not adopted such policies. Indeed, the PSM can be widely used by strategy scholars who want to examine the out- comes of certain strategies. For example, one can examine whether duality (the practice of the CEO also being the Chairman of the Board) has real implications for stock price and long- term performance. The PSM can also be used in entrepreneurship research. Wasserman (2003) documented the par- adox of success in that founders were more likely to be replaced by professional managers when founders led firms to an important breakthrough (e.g., the receipt of additional funding from an external resource). Future research can further explore this question by investigating which types of funding lead to turnover in the top management team in newly founded firms. For example, scho- lars can examine whether funding received from venture capitalists (VCs) has a different effect on executive turnover than that obtained from a Small Business Innovative Research (SBIR) program. Similarly, using the PSM, scholars can examine how other interventions, such as a business plan, can affect entrepreneurial performance. Like strategy scholars, entrepreneurship researchers can imple- ment the PSM in many other questions. The PSM can also be widely implemented by strategic human resource management (SHRM) scholars. A major interest in SHRM literature is whether HR practices contribute to firm performance. One can implement the PSM to investigate whether HR practices (e.g., downsizing) contribute to firm performance. When the strongly ignorable assumption is satis- fied, the PSM provides an opportunity for HR scholars to document a less biased effect size between HR practices and firm performance. HR researchers can adjust the distributions of the observational variables and then estimate the ATT of the HR practices on firm performance. In conclusion, the PSM is an effective technique for scholars to reconstruct counterfactuals using observational data sets.

Discussion Research in other academic fields has documented the effectiveness of the PSM. Yet, like other methods, the PSM has its strength and weakness. The first advantage in using the PSM is that it sim- plifies the matching procedure. The PSM can reduce k-dimension observable variables into one dimension. Therefore, scholars can match observational data sets with k-dimensional covariates without sacrificing many observations or worrying about computational complexity. Second, the PSM eliminates two sources of bias (Heckman et al., 1998): bias from nonoverlapping supports and bias from different density weighting. The PSM increases the likelihood of achieving distribution overlap between the treated and control groups. Moreover, this technique reweights nonparticipant

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 24

Table 5. Empirical Studies Applying the Propensity Score Method (PSM)

Author(s) Data Analytical Technique Key Findings

Angrist (1998), Econometrica Military data come from Defense Because of the nonrandom selection Soldiers serving in the military in the Manpower Data Center. Earnings issues in the labor market, the early 1980s were paid more than

Downloaded from data come from Social Security propensity score matching technique comparable civilians. Military service Administration. and instrumental variables were used increased the employment rate for to examine the voluntary military veterans after service. Military service on earnings. service led to only a modest long-run orm.sagepub.com increase in earnings for non-White veterans, but reduced the civilian earnings of White veterans. Campello, Graham, and Harvey 1,050 Chief Financial Officers (CFOs) CFOs were asked to report whether Credit constrained firms burned more

atVrijeUniversiteit 34820onJanuary30, 2014 (2010), Journal of Financial were surveyed their firms were credit constrained cash, sold more assets to fund their Economics or not. Demographics of asset size, operation, drew more heavily on ownership form, and credit ratings lines of credit, and planned deeper were used to predict propensity cuts in spending. In addition, inability scores. Average treatment effects of to borrow forced many firms to constrained credit were estimated by bypass lucrative investment comparing the difference of spending opportunities. between constrained and unconstrained firms. Couch and Placzek (2010), State administrative files from Propensity score matching on Propensity score estimators calculating American Economic Review Connecticut observable variables was used to average treatment effects on treated reduce individual heterogeneity. (ATT) and the average difference-in- difference showed that earning losses were 33% at the time of mass layoff and 12% 6 years later. (continued) Table 5. (continued) Author(s) Data Analytical Technique Key Findings

Drucker and Puri (2005), Journal They combined data sets from multiple Propensity score matching was used to Overall, underwriters (commercial of Finance databases. They collected data on match non-current loans to currents banks and investment banks) engaged seasoned equity issuers, including loans. Propensity score is calculated in concurrent lending and provide

Downloaded from credit rating, stock return, lending using observational variables includ- discounts. In addition, concurrent history, and insurance history. ing credit rating, firm industry, and lending helped underwriters build other variables. relationships, which help underwriters increase the probability

orm.sagepub.com of receiving current and future business. Frank, Akresh, and Lu (2010), Data were collected from New They used ordinal logistic model to They found an average difference of American Sociological Review Immigrant Survey with around 1,000 calculate propensity scores, which $2,435.63 difference between lighter

atVrijeUniversiteit 34820onJanuary30, 2014 cases. were used to estimate the effect of and darker skinned individuals. In skin color on earnings. other words, darker skin individuals earn around $2,500 less per year than counterparts. Gangl (2006), American Survey of Income and Program Difference-in-difference propensity Gangl found strong evidence that post- Sociological Review Participation (SIPP) and European score matching unemployment losses are largely Community Household Panel permanent, and such effect is (ECHP) particularly significant for older and high-wage workers as well as for female employees. Grodsky (2007), American Journal Data came from a number of sources, In the first stage, propensity score was The author found the evidence that a of Sociology including the representative samples used to adjust for selection on wide range of institutions engage in of students who completed high observational variables. In the second affirmative action for African school in 1972, 1982, and 1992. stage, the author examined the type American students as well as for of college a student will attend Hispanic students. controlling for propensity scores. (continued) 25 26

Table 5. (continued) Author(s) Data Analytical Technique Key Findings

Heckman, Ichimura, and Todd The National Job Training Partnership Propensity matching, nonparametric After decomposing program evaluation (1997), Review of Economic Act (JTPA) and Survey of Income and conditional difference-in-difference bias into a number of components, it Downloaded from Studies Program Participation (SIPP) was found that selection bias due to unobservable variable is less important than other components. Matching technique can potentially orm.sagepub.com eliminate much of the bias. Lechner (2002), Review of Unemployed individuals in Zurich, a Multinomial model was used to estimate The empirical evidence revealed Economic Studies region of Switzerland, in periods propensity scores of discrete choices support for the fact that the

atVrijeUniversiteit 34820onJanuary30, 2014 1997-1999. (basic training, further training, propensity score matching can be an employment program, and informative tool to adjust for temporary wage subsidy). individual heterogeneity when individuals have multiple programs to be selected. Malmendier and Tate (2009), Hand-collected list of the winners of They used propensity score matching to They found that award-winning CEOs Quarterly Journal of Economics CEO awards between 1975 and 2002 create counterfactual sample for non- underperform over the 3 years winning CEOs. Nearest neighbor following the award: Relative matching technique, both with and underperformance is between 15% without bias adjustment, was used to to 26%. identify the counterfactual sample. Xuan (2009), Review of Financial S&P’s executive compensation between Ordinary least square was used as the Specialist CEOs, defined as CEOs who Studies 1993 and 2002 major technique. The propensity have promoted from a certain score method was used as robust divisions of their firm, negatively check to address the issue of affect segment investment efficiency. endogenous selection of CEO. Li 27 data to obtain equal distribution between the treated and control groups. Third, if treatment assign- ment is strongly ignorable, scholars can use the PSM on observational data sets to estimate an ATT that is reasonably close to the ATT calculated from experiments. Fourth, the matching technique, by its nature, is nonparametric. Like other nonparametric approaches, this technique will not suffer from problems that are prevalent in most parametric models, such as the assumption of distribution. It generally outperforms simple regression analysis when the true functional form for the regression is nonlinear (Morgan & Harding, 2006). Finally, the PSM is an intuitively sounder method for deal- ing with covariates than is traditional regression analysis. For example, the idea that covariates in both the treated group and the control group have the same distributions is much easier to understand than the interpretation using ‘‘control all other variables at mean’’ or ‘‘ceteris paribus.’’ Even for regression, without appropriately adjusting for the covariate distribution, one can get an ATT with the regression technique despite the fact that no meaningful ATT exists. Despite its many advantages, the PSM also has its limitations. Like other nonparametric tech- niques, the PSM generally has no test statistics. Although the bootstrap technique can be used to estimate the variance, such techniques are not fully justified or widely accepted by researchers (Imbens, 2004). Hence, the use of the PSM may be limited because while it can help scholars draw causal inferences, it cannot help with drawing statistical inferences. Another key hurdle of this method is that there are currently no established procedures to investigate whether treatment assign- ment is strongly ignorable. Heckman et al. (1998) demonstrated that the PSM cannot eliminate bias due to unobservable differences across groups. The PSM can reweight observational covariates, but it cannot deal with unobservable variables. Some unobservable variables (e.g., environmental con- text, region) can increase the bias of the ATT estimated using the PSM. Third, even when the treat- ment assignment is strongly ignorable, the accuracy of the ATT estimated by the PSM depends on the quality of the observational data. Thus, measurement error (cf. Gerhart, Wright, & McMahan, 2000) and nonrandom missing values can affect the estimated ATT. Finally, although there are a few propensity score matching techniques, one can find little guidance on which types of matching techniques work best for different applications. Overall, despite its shortcomings, the PSM can be employed by management scholars to inves- tigate the ATT of management interventions. Appropriately used, the PSM can eliminate bias due to nonoverlapping distributions between the treatment and the control groups. The PSM can also reduce the problem of unfair comparison. However, scholars must be careful about the quality of the data because the effectiveness of the PSM depends on the observational covariates. Research using objective measures will be an optimal setting for using the PSM. In empirical settings with low quality data, scholars can implement nonparametric PSM as a robust test to justify the para- metric findings generated by traditional econometric models. To draw meaningful and honest causal inferences, one must appropriately choose the technique that works best for testing the causal relationship. When one has collected panel data and believes that omitted variable is time-invariant, then the fixed effects model is the best choice for estimating bias due to an omitted variable (Allison, 2009; Beck et al., 2008). When one finds one or more per- fect instrumental variables, using two-stage least-squares (2SLS) can also address the bias of causal effects calculated through conventional regression techniques. When the endogenous variable suf- fers only from measurement error and when one knows the reliability coefficient, one can use regres- sion analysis and correct the bias using the reliability coefficient. Almost no technique is perfect in drawing an unbiased causal inference, including experimental design. Heckman and Vytlacil (2007) remarked that explicitly manipulating treatment assignment cannot always represent the real-world problem because experimentation naturally discards information contained in a real-world context that includes dropout, self-selection, and noncompliance. Sometimes a combination of techniques is also recommended. For example, to alleviate the extra- polation bias in the regression models Imbens and Wooldridge (2009) recommend using matching to

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 28 Organizational Research Methods 00(0) generate a balanced sample. Similarly, Rosenbaum and Rubin (1983) suggested that differences due to unobserved heterogeneity should be addressed after balancing the observed covariates. Additionally, the PSM can also be incorporated in studies using the longitudinal design. Readers who are interested in estimating the ATT using longitudinal data can also refer to the nonpara- metric conditional difference-in-difference model (Heckman et al., 1997) and the semiparametric conditional difference-in-difference model (Heckman et al., 1998). To conclude, to draw the best causal inference, one needs to choose the appropriate methods. Of various techniques, the PSM should be a potential choice.

Conclusion The purpose of this article is to introduce the PSM to the management field. This article makes several contributions to organizational research methods literature. First, it not only advances management scholars’ understanding of a neglected method to estimate causal effects, but also discusses some of the technique’s limitations. Second, by integrating previous work on the PSM, it provides a step-by-step flowchart that management scholars can easily implement in their empirical studies. The attached data set with SPSS and STATA stratified matching codes help management scholars to calculate the ATT. Readers can make context-dependent decisions and choose a matching algorithm that is most beneficial for their objectives. Finally, a brief review of the applications of the PSM in other social science fields and a discussion of potential usage of the PSM in the management field provides an overview of how management scholars can employ the PSM in future empirical studies.

Appendix A Boosted Regression Boosted regression (or boosting) is a general, automated, data-mining technique that has shown considerable success in using a large number of covariates to predict treatment assignment and fit a nonlinear surface (McCaffrey, Ridgeway, & Morral, 2004). Boosting relies on a regression tree using a recursive algorithm to estimate the function that describes the relationship between a set of covariates and the dependent variable. The regression tree begins with a complete data set and then partitions the data set into two regions by a series of if-then statements (Schonlau, 2005). For example, if age and race are covariates, the algorithm can first split the data set into two regions based on the condition of either of these two variables. The splitting algorithm continues recur- sively until the regression tree reaches the allowable number of splits. Friedman (2001) has shown that boosted regression outperforms other methods in reducing prediction error. McCaffrey et al. (2004) summarized three important advantages of the boosting technique. First, regression trees are easy and fast to fit. Second, regression trees can handle different types of covariates including continuous, nominal, ordinal, and missing variables. When boosted is used to predict propensity scores, the use of different forms of covariates generally produces exactly the same propensity score adjustment. Finally, the boosting technique is capable of handling many covariates, even those unrelated to treatment assignment or correlated with one another. Schonlau (2005) listed factors that favor the use of the boosting technique. These factors include a large data set, suspected nonlinearities, more variables than observations, suspected interactions, correlated data, and ordered categorical covariates. He concludes that the boosting technique does not require scholars to specify interactions and nonlinearities. Thus, the boosting technique can simplify the procedure of computing propensity scores by reducing the burden of adding high-order covariates such as interactions.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Appendix B. A Small Data Set for Manually Calculating Average Treatment Effect on the Treated Group (ATT)

Step 1 Step 2 Step 3: Estimate Causal Effect

Case ATT T C T C 5 ID Outcome Treatment Age PScore Block ID Tpscore Tage Yi Yj Nq Nq ATTq¼1 Weight Weight

1 1,0048.54 0 50 0.076 1 1.32 0.167 49,237.660 31,401.687 2 5 18,338.4926 0.08 1,467.079 2 0 0 19 0.109 1 3 2,0688.17 0 26 0.128 1 400440.141 Downloaded from 5 664.977 0 39 0.177 1 6 36,646.95 1 35 0.075 1 7 12,590.71 1 33 0.101 1 8 24,642.57 0 32 0.265 2 1.00 0.136 42,465.315 44,775.121 4 5 1,661.30455 0.16 265.809

orm.sagepub.com 9 10,344.09 0 44 0.268 2 10 9,788.461 0 41 0.282 2 11 0 0 33 0.313 2 12 0 0 20 0.365 2 13 13,167.52 1 22 0.261 2 atVrijeUniversiteit 34820onJanuary30, 2014 14 4,321.705 1 26 0.312 2 15 12,558.02 1 46 0.361 2 16 12,418.07 1 46 0.392 2 17 0 0 40 0.412 3 1.86 0.025 33,313.704 39,898.620 4 5 348.702 0.16 55.792 18 17,732.72 0 26 0.456 3 19 4,433.18 0 30 0.481 3 20 0 0 21 0.513 3 21 17,732.72 0 20 0.558 3 22 7,284.986 1 41 0.511 3 23 5,522.788 1 17 0.525 3 24 20,505.93 1 24 0.547 3 25 0 1 27 0.59 3 26 2,364.363 0 41 0.678 4 0.80 1.561 21,607.032 31,978.005 6 3 –7,058.163 0.24 –1,693.959 27 22,165.9 0 23 0.727 4 28 7,447.742 0 24 0.746 4 29 2,164.022 1 21 0.654 4 30 11,141.39 1 23 0.739 4 31 3,462.564 1 29 0.758 4 32 559.443 1 20 0.759 4 33 4,279.613 1 19 0.764 4 34 0 1 23 0.768 4 (continued) 29 30

Appendix B. (continued) Step 1 Step 2 Step 3: Estimate Causal Effect Downloaded from

Case ATT T C T C 5 ID Outcome Treatment Age PScore Block ID Tpscore Tage Yi Yj Nq Nq ATTq¼1 Weight Weight

orm.sagepub.com 35 5,615.361 0 28 0.923 5 1.23 0.552 65,459.109 5,615.361 9 2 4,465.553833 0.36 1,607.599 36 0 0 23 0.954 5 37 13,385.86 1 18 0.913 5 38 8,472.158 1 27 0.948 5 39 0 1 18 0.954 5 atVrijeUniversiteit 34820onJanuary30, 2014 40 6,181.88 1 17 0.959 5 41 289.79 1 21 0.961 5 42 17,814.98 1 37 0.965 5 43 9,265.788 1 17 0.966 5 44 1,923.938 1 25 0.97 5 45 8,124.715 1 25 0.987 5 46 11,821.81 0 53 0.001 Unmatched ATT¼ 1,702.321 47 24,825.81 0 52 0.003 cases 48 33,987.71 0 28 0.009 49 33,987.71 0 41 0.013 50 54,675.88 0 38 0.016

T C Note: PScore ¼ propensity scores; Tage /Tpscore ¼ t-test for age and propensity scores in each balanced block; Yi ¼ summation of outcome variable for treated cases in each block; Yi ¼ summa- T C 5 T T C C tion of outcome variable for control cases in each block; Nq ¼ total number of treated cases in each block; Nq ¼ total number of control cases in each block; ATTq¼1 ¼ Yi /Nq –Yi /Nq ; average treatment effect for each balanced block; weight ¼ total number of treated cases in each block divided by total number of treated cases in the sample. Li 31

Appendix C SPSS Code for Stratified Matching *Step 1: Calculate propensity score. LOGISTIC REGRESSION VARIABLES TREATMENT /METHOD¼ENTER X1 X2 X3 /SAVE¼PRED /CRITERIA¼PIN(.05) POUT(.10) ITERATE(20) CUT(.5). RENAME VARIABLES (PRE_1¼pscore).

The above code calculates predicted probability using a number of observation variables (e.g. X1, X2, and X3). Readers can change their variables correspondingly.

*Step 2: Stratify into five blocks. compute blockid¼. if (pscore<¼ .2) & (pscore > .05) blockid¼1. if (pscore<¼ .4) & (pscore > .2) blockid¼2. if (pscore<¼ .6) & (pscore > .4) blockid¼3. if (pscore<¼ .8) & (pscore > .6) blockid¼4. if ( pscore > .8) blockid¼5. execute. *Perform t test for each block. *Split file first, and then excute t test. SORT CASES BY blockid. SPLIT FILE SEPARATE BY blockid. T-TEST GROUPS¼treatment(0 1) /MISSING¼ANALYSIS /VARIABLES¼age pscore /CRITERIA¼CI(.95). The above code first stratifies variables into five blocks, and then carries on the t-test for each of the blocks. SPSS has no ‘‘if’’ option for t-test, thus it is important to split the data based on block ID, and then conduct the t-test.

*Step 3: Perform Stratification Matching Procedure. *Caclulate YiT and YjC in Appendix B. AGGREGATE /OUTFILE¼* ¼ADDVARIABLES /BREAK¼blockid treatment /outcome_sum¼SUM(outcome). *Calculate NqT and NqC in Appendix B. AGGREGATE /OUTFILE¼* MODE¼ADDVARIABLES /BREAK¼blockid treatment /N_BREAK¼N. *Calculate total number of treatment cases. AGGREGATE /OUTFILE¼* MODE¼ADDVARIABLES /BREAK¼ /N_Treatment¼sum(treatment).

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 32 Organizational Research Methods 00(0)

COMPUTE ATTQ¼outcome_sum/N_BREAK. EXECUTE. DATASET DECLARE agg_all. AGGREGATE /OUTFILE¼’agg_all’ /BREAK¼treatment blockid /N_Block_T¼MEAN(N_BREAK) /ATTQ_T¼MEAN(ATTQ) /N_Treatment¼MEAN(N_Treatment).

DATASET ACTIVATE agg_all. DATASET COPY agg_treat. DATASET ACTIVATE agg_treat. FILTER OFF. USE ALL. SELECT IF (treatment ¼ 1). EXECUTE. DATASET ACTIVATE agg_all.

DATASET COPY agg_control. DATASET ACTIVATE agg_control. FILTER OFF. USE ALL. SELECT IF (treatment¼0&blockid<6). EXECUTE. DATASET ACTIVATE agg_control.

RENAME VARIABLES (N_Block_T ATTQ_T ¼N_Block_C ATTQ_C ).

MATCH FILES /FILE¼* /FILE¼’agg_treat’ /RENAME (blockid N_Treatment treatment ¼ d0 d1 d2) /DROP¼ d0 d1 d2. EXECUTE.

COMPUTE ATTQ¼ATTQ_T-ATTQ_C. EXECUTE. COMPUTE weight¼N_Block_T/N_Treatment. EXECUTE. COMPUTE ATTxweight¼ATTQ*weight. EXECUTE. AGGREGATE /OUTFILE¼* MODE¼ADDVARIABLES OVERWRITEVARS¼YES /BREAK¼ /ATTxweight_sum¼SUM(ATTxweight).

DATASET CLOSE agg_all. DATASET CLOSE agg_control. DATASET CLOSE agg_treat.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 33

This step computes each of the components in Equation 2.2. For example, it first calculates the number of treated cases and the number of control cases in each matched block. Then, it also gauges the summation of outcome in each balanced blocks. The code then extracts each of the necessary components into two different data sets: agg_control and agg_treat. Finally, the code matches these two data sets based on block ID and estimates the ATT. The final result will be displayed in the vari- able called ‘‘ATTxweight.’’

Appendix D STATA Code for Stratified Matching *STEP 1: get the propensity scores using logistical regression *Choose covariates appropriately logit treatment X1 X2 X3 *Calculate propensity scores predict pscore, p

*STEP 2: subclassification gen blockid¼. replace blockid¼1 if pscore<¼.2 & pscore>.05 replace blockid¼2 if pscore<¼.4 & pscore> .2 replace blockid¼3 if pscore<¼.6 & pscore> .4 replace blockid¼4 if pscore<¼.8 & pscore> .6 replace blockid¼5 if pscore>.8

*STEP 2: t test for balance in each block foreach var of varlist age pscore f forvalues i¼1/5 f ttest ‘var’ if blockid ¼¼‘i’, by(treatment) g g

*STEP 3: Estimate causal effects using stratified matching sort blockid treatment gen YTQ¼. *Yic in Appendix B table gen TTN¼1 *Nqt in Appendix B table gen YCQ¼. *Yjc in Appendix B table gen TCN¼1 *Nqc in Appendix B table forvalues i¼1/5 f *Get sum for outcome in each treated block sum outcome if treatment¼¼1 & blockid¼¼‘i’ replace YTQ¼r(sum) if blockid¼¼‘i’ *Number of treated cases in each block sum TTN if treatment¼¼1 & blockid¼¼‘i’ replace TTN¼r(sum) if blockid¼¼‘i’ *Get sum for outcome in each control block sum outcome if treatment¼¼0 & blockid¼¼‘i’ replace YCQ¼r(sum) if blockid¼¼‘i’ *Number of treated cases in each block sum TCN if treatment¼¼0 & blockid¼¼‘i’

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 34 Organizational Research Methods 00(0)

replace TCN¼r(sum) if blockid¼¼‘i’ g gen ATTQ¼YTQ/TTN-YCQ/TCN *Weights for ATT sum treatment gen W¼TTN/r(sum) *Weighted ATT gen ATT¼ATTQ*W bysort blockid: gen id¼_n sum ATT if id¼¼1 display "The ATT is ‘r(sum)’"

Appendix E. Software Packages for Applying the Propensity Score Method (PSM)

Environment Software Name Authors Function and Download Sources

R Matching Sekhon (2007) Relies on an automated procedure to detect matches based on a number of univariate and multivariate metrics. It performs propensity matching, primarily 1:M matching. The package also allows matching with and without replacement. Download source: http://sekhon.berkeley.edu/matching/ Document: http://cran.r-project.org/web/packages/ Matching/Matching.pdf PSAgraphics Helmreich and Pruzek Provides enriched graphical tools to test (2009) within strata balance. It also provides graphical tools to detect covariate distributions across strata. Download source: http://cran.r-project.org/web/packages/ PSAgraphics/index.html Twang Ridgeway, McCaffrey, Includes propensity score estimating and and Morral (2006) weighting. Generalized boosted regression is used to estimate propensity scores thus simplifying the procedure to estimate propensity scores. Download source: http://cran.r-project.org/web/packages/ twang/index.html SAS Greedy matching Kosanke and Bergstralh Performs 1:1 nearest neighbor matching. (2004) Download source: http:// mayoresearch.mayo.edu/mayo/research/ biostat/upload/gmatch.sas OneToManyMTCH Parsons (2004) Allows users to specify the propensity score matching from 1:1 or 1:M. Download source: http://www2.sas.com/proceedings/sugi29/ 165-29.pdf (continued)

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 35

Appendix E. (continued) Environment Software Name Authors Function and Download Sources

SPSS SPSS Macro for P Painter (2004) Performs nearest neighbor propensity score score matching matching. It seems to solely do 1:1 matching without replacement. Download source: http://www.unc.edu/*painter/SPSSsyntax/ propen.txt STATA Pscore Becker and Ichino (2002) Estimates propensity scores and conducts a number of matching such as radius, nearest neighbor, kernel, and stratified. Download source: http://www.lrz.de/*sobecker/pscore.html Psmatch2 Leuven and Sianesi Allows a number of matching procedures, (2003) including kernel matching and k:1 matching. It also supports common support graphs and balance testing. Download source: http://ideas.repec.org/c/boc/bocode/ s432001.html

Acknowledgments

Special thanks to Barry Gerhart for his invaluable support and to Associate Editor James LeBreton and anon- ymous reviewers for their constructive feedbacks. This article has also benefited from suggestions by Russ Coff, Jose Cortina, Cindy Devers, Jon Eckhardt, Phil Kim, and seminar participants at 2011 AOM conference.

Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publi- cation of this article.

Funding The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes 1. Harder, Stuart, and Anthony (2010) argued that propensity score method (PSM) can be used to estimate the average treatment effect on the treated group (ATT), and subclassifying the propensity score can be used to calculate the average treatment effect (ATE). However, economists typically viewed the PSM as a technique to estimate the ATT (Dehejia & Wahba, 1999, 2002). Following Dehejia and Wahba (1999, 2002), the remaining section regards the PSM as a way to calculate the ATT. The remaining sections use causal effects, treatment effects, and ATT interchangeably. 2. Psychology scholars also extended this to develop the causal steps approach to draw mediating causal infer- ence (e.g., Baron & Kenny, 1986). It is beyond the scope of this article to fully discuss . Interested readers can read LeBreton, Wu, and Bing (2008) and Wood, Goodman, Beckmann, and Cook (2008) for surveys. 3. Becker and Ichino (2002) have written a nice STATA program (pscore) to estimate the propensity score. The convenience of using pscore is that the program can stratify propensity scores to a specified number of blocks and test the balance of propensity scores in each block. However, when there is more than one treat- ment, it is inappropriate to use pscore to estimate the propensity score.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 36 Organizational Research Methods 00(0)

4. Propensity score matching is one technique of many matched sampling technique. One can use exact match- ing simply based on one or more covariates. For example, scholars may match sample based on standard industry classification (SIC) and firm size rather than matching using propensity scores. 5. These components are: the variance of the covariates in the control groups, the variance of the covariates in the treated groups, the mean of the covariates in the control groups, the mean of the covariates in the treated groups, and the estimated propensity score. The variance of the covariates in the treated and the control groups are weighted by the propensity score. 6. Instrumental variable (IV) is typically used by scholars under the condition of simultaneity. Because of the difficulty in finding an IV, it is not viewed as a general remedy for endogeneity issues.

References Allison, P. (2009). Fixed effects regression models. Newbury Park, CA: Sage. Angrist, J. (1998). Estimating the labor market impact of voluntary military service using social security data on military applicants. Econometrica, 66, 249-288. Angrist, J. D., Imbens, G. W., & Rubin, D. B. (1996). Identification of causal effects using instrumental vari- ables. Journal of the American Statistical Association, 9, 444-455. Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2010). On making causal claims: A review and recom- mendations. The Leadership Quarterly, 21(6), 1086-1120. Arceneaux, K., Gerber, A., & Green, D. (2006). Comparing experimental and matching methods using a large- scale voter mobilization experiment. Political Analysis, 14, 1-26. Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51(6), 1173-1182. Beck, N., Bru¨derl, J., & Woywode, M. (2008). Momentum or deceleration? Theoretical and methodolo- gical reflections on the analysis of organizational change. Academyof Management Journal, 51(3), 413-435. Becker, S., & Caliendo, M. (2007). Sensitivity analysis for average treatment effects. Stata Journal, 7(1), 71-83. Becker, S., & Ichino, A. (2002). Estimation of average treatment effects based on propensity scores. The Stata Journal, 2, 358-377. Berk, R. A. (1983). An introduction to sample selection bias in sociological data. American Sociological Review, 48(3), 386-398. Campello, M., Graham, J., & Harvey, C. (2010). The real effects of financial constraints: Evidence from a financial crisis. Journal of Financial Economics, 97, 470-487. Cochran, W. (1957). Analysis of covariance: Its nature and uses. Biometrics, 13(3), 261-281. Cochran, W. (1968). The effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics, 24, 295-313. Couch, K. A., & Placzek, D. W. (2010). Earnings losses of displaced workers revisited. American Economic Review, 100, 572-589. Cox, D. (1992). : Some statistical aspects. Journal of the Royal Statistical Society, Series A (Statistics in Society), 155, 291-301. Dehejia, R., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94, 1053-1062. Dehejia, R., & Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and Statistics, 84, 151-161. DiPrete, T. A., & Gangl, M. (2004). Assessing bias in the estimation of causal effects: Rosenbaum bounds on matching estimators and instrumental variables estimation with imperfect instruments. Sociological Methodology, 34, 271-310. Duncan, O. D. (1975). Introduction to structural equation models. San Diego, CA: Academic Press.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 37

Drucker, S., & Puri, M. (2005). On the benefits of concurrent lending and underwriting. Journal of Finance, 60(6), 2763-2799. Efron, B., & Tibshirani, R. (1997). An introduction to the bootstrap. London: Chapman & Hall. Frank, R., Akresh, I. R., & Lu, B. (2010). Latino Immigrants and the US racial order: How and where do they fit in? American Sociological Review, 75(3), 378-401. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29, 1189-1232. Gangl, M. (2006). Scar effects of unemployment: An assessment of institutional complementarities. American Sociological Review, 71(6), 986-1013. Gerhart, B. (2007). Modeling human resource management and performance linkages. In P. Boxall, J. Purcell, & P. Wright (Eds.), The Oxford handbook of human resource management (pp. 552-580). Oxford: Oxford University Press. Gerhart, B., Wright, P., & McMahan, G. (2000). Measurement error in research on the human resources and firm performance relationship: Further evidence and analysis. Personnel Psychology, 53, 855-872. Greene, W. (2008). Econometric analysis (6th ed.). Upper Saddle River, NJ: Prentice Hall. Grodsky, E. (2007). Compensatory sponsorship in higher education. American Journal of Sociology, 112(6), 1662-1712. Gu, X., & Rosenbaum, P. (1993). Comparison of multivariate matching methods: Structures, distances, and algorithms. Journal of Computational and Graphical Statistics, 2, 405-420. Harder, V. S., Stuart, E. A., & Anthony, J. C. (2010). Propensity score techniques and the assessment of mea- sured covariate balance to test causal associations in psychological research. Psychological Methods, 15, 234-249. Hamilton, B. H., & Nickerson, J. A. (2003). Correcting for endogeneity in strategic management research. Strategic Organization, 1, 51-78. Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47, 153-161. Heckman, J., & Hotz, V. (1989). Choosing among alternative nonexperimental methods for estimating the impact of social programs: The case of manpower training. Journal of the American Statistical Association, 84, 862-874. Heckman, J., Ichimura, H., Smith, J., & Todd, P. (1998). Characterizing selection bias using experimental data. Econometrica, 66, 1017-1098. Heckman, J., Ichimura, H., & Todd, P. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating job training program. Review of Economic Studies, 64, 605-654. Heckman, J. J., & Vytlacil, E. J. (2007). Econometric evaluation of social programs, part II: Using the marginal treatment effect to organize alternative econometric estimators to evaluate social programs, and to forecast their effects in new environments. Handbook of , 6, 4875-5143. Helmreich, J. E., & Pruzek, R. M. (2009). PSAgraphics: An R package to support propensity score analysis. Journal of Statistical Software, 29, 1-23. Hoetker, G. (2007). The use of logit and probit models in strategic management research: Critical issues. Strategic Management Journal, 28(4), 331-343. Imbens, G. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87(3), 706-710. Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. The Review of Economics and Statistics, 86, 4-29. Imbens, G. W., & Wooldridge, J. M. (2009). Recent developments in the econometrics of program evaluation. Journal of Economic Literature, 47(1), 5-86. James, L. R. (1980). The unmeasured variables problem in path analysis. Journal of Applied Psychology, 65(4), 415-421. James, L. R., Mulaik, S. A., & Brett, J. M. (1982). : Assumptions, models, and data. Thousand Oaks, CA: Sage.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 38 Organizational Research Methods 00(0)

Joffe, M. M., & Rosenbaum, P. R. (1999). Invited commentary: Propensity scores. American Journal of , 150, 327-333. King, G., Keohane, R. O., & Verba, S. (1994). Designing social inquiry: Scientific inference in qualitative research. Princeton, NJ: Princeton University Press. Kosanke, J., & Bergstralh, E. (2004). gmatch: Match 1 or more controls to cases using the GREEDY algorithm. Retrieved from http://mayoresearch.mayo.edu/mayo/research/biostat/upload/gmatch.sas (accessed May 15, 2012) Lalonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76, 604-620. LeBreton, J. M., Wu, J., & Bing, M. N. (2008). The truth(s) on testing for mediation in the social and organiza- tional sciences. In C. E. Lance, & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends (pp. 107-140). New York, NY: Routledge. Lechner, M. (2002). Program heterogeneity and propensity score matching: An application to the evaluation of active labor market policies. Review of Economics and Statistics, 84, 205-220. Leuven, E., & Sianesi, B. (2003). PSMATCH2: Stata module to perform full Mahalanobis and propensity score matching, common support graphing, and covariate imbalance testing [Statistical software components]. Boston, MA: Boston College. Li, Y., Propert, K., & Rosenbaum, P. (2001). Balanced risk set matching. Journal of the American Statistical Association, 96, 870-882. Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage. Malmendier, U., & Tate, G. (2009). Superstar CEOs. The Quarterly Journal of Economics, 124(4), 1593-1638. McCaffrey, D. F., Ridgeway, G., & Morral, A. R. (2004). Propensity score estimation with boosted regression for evaluating causal effects in observational studies. Psychological Methods, 9, 403-425. Mellor, S., & Mark, M. M. (1998). A quasi-experimental design for studies on the impact of administrative decisions: Applications and extensions of the regression-discontinuity design. Organizational Research Methods, 1(3), 315-333. Morgan, S. L., & Harding, D. J. (2006). Matching estimators of causal effects—Prospects and pitfalls in theory and practice. Sociological Methods & Research, 35, 3-60. Morgan, S. L., & Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge, UK: Cambridge University Press. Painter, J. (2004). SPSS Syntax for nearest neighbor propensity score matching. Retrieved from http://www. unc.edu/~painter/SPSSsyntax/propen.txt (accessed May 15, 2012) Parsons, L. (2004). Performing a 1: N case-control match on propensity score. Proceedings of the 29th Annual SAS Users Group International Conference, SAS Institute, Montreal, Canada. Ridgeway, G., McCaffrey, D., & Morral, A. (2006). Toolkit for weighting and analysis of nonequivalent groups: A tutorial for the twang package. Santa Monica, CA: RAND Corporation. Rosenbaum, P. (1987). The role of a second control group in an observational study. Statistical Science, 2, 292-306. Rosenbaum, P. (2002). Observational studies. New York, NY: Springer-Verlag. Rosenbaum, P. (2004). Matching in observational studies. In A. Gelman & X. Meng (Eds.), Applied Bayesian modeling and causal inference from an incomplete-data perspective (pp. 15-24). New York, NY: Wiley. Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of propensity score in observational studies for causal effects. Biometrika, 70, 41-55. Rosenbaum, P., & Rubin, D. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516-524. Rosenbaum, P., & Rubin, D. (1985). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. American , 39, 33-38.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014 Li 39

Rousseau, D. (2006). Is there such a thing as evidence-based management. Academy of Management Review, 31, 256-269. Rubin, D. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, 757-763. Rubin, D. (2001). Using propensity scores to help design observational studies: Application to the tobacco liti- gation. Health Services and Outcomes Research Methodology, 2(3), 169-188. Rubin, D. (2004). Teaching for causal effects in experiments and observational studies. Journal of Educational and Behavioral Statistics, 29, 343-367. Rynes, S., Giluk, T., & Brown, K. (2007). The very separate worlds of academic and practitioner periodicals in human resource management: Implications for evidence-based management. Academy of Management Journal, 50(5), 987-1008. Schonlau, M. (2005). Boosted regression (boosting): An introductory tutorial and a Stata plugin. Stata Journal, 5, 330-354. Sekhon, J. S. (2007). Multivariate and propensity score matching software with automated balance optimiza- tion: The matching package for R. Journal of Statistical Software, 10(2), 1-51. Simith, J., & Todd, P. E. (2005). Does matching overcome Lalonde’s critique of nonexperimental estimators. Journal of Econometrics, 125, 305-353. Steiner, P. M., Cook, T. D., Shadish, W. R., & Clark, M. H. (2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15, 250-267. Wasserman, N. (2003). Founder-CEO succession and the paradox of entrepreneurial success. Organization Science, 14(2), 149-172. Wolfe, F., & Michaud, K. (2004). Heart failure in rheumatoid arthritis: Rates, predictors, and the effect of anti-tumor necrosis factor therapy. American Journal of Medicine, 116, 305-311. Wood, R. E., Goodman, J. S., Beckmann, N., & Cook, A. (2008). Mediation testing in management research: A review and proposals. Organizational Research Methods, 11(2), 270-295. Wooldridge, J. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press. Xuan, Y. (2009). Empire-building or bridge-building? Evidence from new CEOs’ internal capital allocation decisions. Review of Financial Studies, 22, 4919-4918.

Bio Mingxiang Li is a doctoral candidate at the Wisconsin School of Business, University of Wisconsin-Madison. In addition to research methods, his current research interests include corporate governance, social network, and entrepreneurship.

Downloaded from orm.sagepub.com at Vrije Universiteit 34820 on January 30, 2014