The Impact of Teach for America on Non-Test Academic Outcomes

THEIMPACTOFTEACHFORAMERICA ON NON-TEST ACADEMIC OUTCOMES

Ben Backes Abstract (corresponding author) Recent evidence on teacher productivity suggests that teachers American Institutes for meaningfully inﬂuence non-test academic student outcomes that Research are commonly overlooked by narrowly focusing on test scores. Washington, DC 20007 Despite a large number of studies investigating the Teach For [email protected] America (TFA) eﬀect on math and English achievement, little is known about non-tested academic outcomes. Using adminis-

Michael Hansen trative data from Miami-Dade County Public Schools, we inves- tigate the relationship between being in a TFA classroom and The Brookings Institution five non-test student outcomes commonly found in administra- Washington, DC 20036 tive datasets: days absent, days suspended, GPA, classes failed, [email protected] and grade repetition. We validate our use of non-test student academic outcomes to assess differences in teacher productivity using the quasi-experimental teacher switching methods of Chetty, Friedman, and Rockoff (2014) and fail to reject the null hypothesis of unbiasedness in most cases in elementary and middle school, although in some cases standard errors are large. We find suggestive evidence that students taught by TFA teachers in elementary and middle schools were less likely to miss school due to unexcused absences and suspensions compared with students taught by non-TFA teachers in the same school, although point estimates are very small. Other outcomes were found to be forecast- unbiased but showed no evidence of a TFA effect.

168

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

1. INTRODUCTION Teach For America (TFA) is an alternative certification program that selects, trains, and places recent college graduates or other young professionals into high-need schools with a two-year commitment to teach.1 Much of the prior empirical research on TFA has primarily focused on the impacts of TFA corps members on students’ standardized test scores. In general, these studies have shown a positive TFA effect in math (and science, where available) relative to comparison teachers in the same schools teaching similar students, but no significant effect is generally detected in reading (e.g., Glazerman, Mayer, and Decker 2006; Xu, Hannaway, and Taylor 2011; Clark et al. 2013; Backes et al. 2016). Focusing on test score gains alone, however, is a narrow view of TFA’s effects on students and the schools where they teach. TFA has a service-oriented mission that describes corps members as leaders and mentors to disadvantaged youth who help students set high expectations and change their learning and life trajectories in meaningful ways to overcome the challenges of generational poverty. Given the careful selection of TFA corps members based on dimensions of leadership and character (e.g., Dobbie 2011), one might hypothesize that TFA corps members could have broader impacts on their students beyond simply test scores by influencing other student behaviors in a meaningful way. Recent evidence on teacher productivity suggests teachers meaningfully influence non-test academic outcomes that are commonly overlooked by narrowly focusing on student test scores (e.g., Jackson 2014). This paper uses longitudinal data from recent years in the Miami-Dade County Pub- lic Schools (M-DCPS), spanning 2008–09 through 2013–14, to examine the impact of TFA corps members on five non-test student academic outcomes commonly found in administrative datasets: (1) days absent, (2) days suspended, (3) grade point average (GPA), (4) classes failed, and (5) grade repetition. We do this by constructing estimates of TFA effectiveness in these non-tested academic outcomes (we refer to these value- added estimates of teacher effectiveness in non-tested academic outcomes as N-VAMs). During the time span of the longitudinal data, TFA experimented with a new placement strategy where exceptionally low-performing schools in the district were specifically targeted for intensive TFA placements. At the same time, the size of the TFA corps in the district more than tripled. The effective result of these two changes in TFA placements resulted in large clusters of TFA in these targeted schools, where on average more than 10 percent of the teacher workforce comprised TFA corps members. The large infusion of the TFA corps in these schools resulted in large sample sizes and relatively precise estimates of TFA effects. This paper presents two main findings. The first relates to N-VAMs generally, where we demonstrate that forecasts of teacher N-VAMs are able to make unbiased predic- tions of non-tested student academic outcomes. We then use a teacher switching quasi- experiment adopted from Chetty, Friedman, and Rockoff (2014) to provide suggestive evidence that differences in performance across teachers as measured by N-VAMs are not driven by biasing factors, such as student–teacher sorting. This first step is

1. In most regions where it operates, TFA is an alternative certification program and is widely recognized as such. In Miami, however, TFA does not set teachers up for permanent certification and cannot be technically considered an alternative certification program.

169

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

important because, in contrast to VAMs based on student test scores, current evidence on whether N-VAMs represent causal effects on non-tested student academic outcomes is, to our knowledge, limited to one paper: Jackson (2014). However, Jackson (2014) is largely devoted to showing that N-VAMs are predictive of long-run outcomes. Thus, Jackson’s analysis is narrowly tailored: The sample includes only students in grade 9, it examines only one composite index of noncognitive outcomes, and it does not examine consistency of N-VAMs within teachers or across outcomes. To our knowledge, no other attempt to validate N-VAMs exists. Our second main finding links TFA teachers with students’ non-tested academic outcomes, where we estimate TFA effects in a value-added framework. We find suggestive evidence that student behavior in elementary and middle school—as measured by days missed due to unexcused absences and suspensions—improves by a small degree when placed in a TFA classroom, compared with non-TFA teachers in the same schools with similar students. We also find a small increase in GPA for elementary school students in TFA classrooms. This paper proceeds as follows. First, we discuss the background research on at- tempts to measure teachers’ contributions to non-test academic student outcomes. Next, we address TFA placement in M-DCPS and the data we use. Following this, we describe how we construct estimates of TFA effectiveness at improving student non- test academic outcomes and how we forecast N-VAMs for our validation procedure. We then explore the properties of our forecasted N-VAMs and then estimate TFA effectiveness on these measures. Finally, we conclude.

2. TEACHERS’ CONTRIBUTIONS TO STUDENT NON-TEST ACADEMIC OUTCOMES The last fifteen years in U.S. public schools have been an era characterized by test-based school accountability, as mandated by 2001’s No Child Left Behind Act. From the time of the Department of Education’s Race to the Top initiative, first announced in 2009, test-based accountability pressures have also trickled down to teachers, where student growth measures have increasingly become a part of teacher evaluation practices (Do- herty and Jacobs 2015). Though long unpopular, resistance to the culture of high-stakes testing has recently reached a boiling point among a diverse coalition of educators, parent groups, and other stakeholders (Taylor and Rich 2015). In response to this growing resistance, public attention has shifted toward the viability of creating school and teacher performance measures based on non-tested student academic outcomes. This focus has been inspired in part by the growing consensus around the importance in the development of noncognitive or socioemo- tional skills among young people and their influence on long-term outcomes, with important work done over the last decade in both economics (Heckman, Stixrud, and Urzua 2006; Kautz et al. 2014) and psychology (Duckworth et al. 2007; Duckworth, Quinn, and Seligman 2009; Eskreis-Winkler et al. 2014; Robertson-Kraft and Duck- worth 2014). On the policy side, state and local leaders have begun to explore and experiment with non-tested student academic outcomes as alternates to standardized tests, and last year’s reauthorization of the Elementary and Secondary Education Act (ESSA) included language requiring states to adopt new school accountability systems that

170

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

include alternate school performance indicators not directly tied to test scores (Klein 2015).2 Education policy researchers have also shown a growing interest in exploring whether and how school factors, and particularly teachers, may influence these types of soft skills. For example, studies from Chetty et al. (2011) and Dee and West (2011) find that students assigned to smaller classrooms have persistent gains in noncognitive outcomes that explain future earnings increases. Strands of research examining teacher contributions to non-tested academic outcomes (i.e., N-VAMs) include showing that some teachers are more effective than others at improving these outcomes (Garcia 2013; Jackson 2014; Gershenson 2016) and documenting differential returns to teacher experience for various non-test academic outcomes (Ladd and Sorensen 2017). It is important to highlight that the conceptualization of these non-tested student academic outcomes in the education policy literature is quite distinct from the socioe- motional skills measured in the psychology literature, even if these investigations may be similarly motivated by an interest in looking beyond typical assessments of cognitive ability. Education policy research has been largely constrained to administrative databases or longitudinal surveys that were not designed to capture psychological personality factors. As a result, the non-tested academic outcomes that are captured in this literature focus on those non-test variables being recorded for other (typically administrative) purposes. In spite of this distinction in the actual outcomes measured, we feel that the focus on the non-tested academic outcomes that we present here is important, and perhaps even more critical for practitioners and policy makers for pragmatic reasons. Though the new law is not prescriptive on the form of these new accountability measures, we speculate using administrative data of non-tested academic outcomes for students is a likely next step for many states, as they constitute the lowest-cost mode of developing these new non-test measures at scale. In fact, to our knowledge, a consortium of districts have already begun adopting these measures into their school accountability systems, and we expect to soon see even more interest in these types of measures as states redesign their systems in response to the recent enactment of ESSA.3 In this paper, we examine five different student outcomes: (1) days absent, (2) days suspended, (3) GPA, (4) classes failed, and (5) grade repetition. The outcomes are important for two reasons. First, they are simple to measure and found in most administrative databases,4 indicating that if teachers are found to meaningfully influence these outcomes, research examining teacher effects could easily be extended to include them.

2. Note that school eﬀects on non-test student academic outcomes like the ones we examine here are one of several plausible types of non-test school performance measures that could be adopted under the new law. These other types of non-test school measures will be discussed further in the Conclusion. 3. Last year, a consortium of nine large school districts in California, known as the CORE Districts, announced the development of its School Quality Improvement Index, which will include multiple measures of social and emotional learning in students and culture-climate indicators for schools. Several of the non-tested academic outcomes we explore in this paper are included as measures in this new index, and in the interim period leading up to the rollout of school climate instruments these administrative measures will exclusively be used to stand in for the non-test school performance measures. See the announcement available at http://coredistricts.org/our -work/school-quality-improvement-system/. We also found the inclusion of student absence measures for elementary and middle schools in a proposal for new accountability measures in Texas (see “2016 Accountability and Planning for House Bill 2804 Implementation,” available at https://tea.texas.gov). 4. We acknowledge that some of these measures may also be easily manipulated by teachers or administrators if accountability stakes were attached to them. We further address this issue in our discussion of the ﬁndings.

171

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

And second, each of these outcomes is an indicator of a student’s academic well-being. For absences and suspensions, time out of school leads to depressed academic performance (Goodman 2014), especially for low-income students (Gershenson, Jacknowitz, and Brannegan 2017). GPA is strongly predictive of success in college (Antonovics and Backes 2014). And grade repetition is predicted by GPA and test scores (McCoy and Reynolds 1999) and possibly harmful in and of itself (Jacob and Lefgren 2009). More generally, if student improvements along these measures are driven in part by noncognitive skills (Jackson 2014), then one may expect eventual gains in college outcomes and the labor market (Heckman, Stixrud, and Urzua 2006). Our examination here focuses specifically on whether TFA teachers impact these non-test academic outcomes. As discussed above, TFA teachers tend to be effective at improving students’ test scores in math. However, previous studies of teacher effects on non-test academic outcomes have found that the ability of a teacher to raise test scores and the ability to improve other student outcomes are largely unrelated (Jackson 2014; Gershenson 2016), meaning that a priori it is unclear whether one should expect TFA teachers to be better or worse (or no different) than the average non-TFA teacher along these measures. Two prior studies have explored TFA impacts on these types of non-test student academic outcomes (such as student absences and grade retention) but have not found any significant differences. Both Decker, Mayer, and Glazerman (2004) and Clark et al. (2013) examine the number of days absent and suspended for students randomly assigned to TFA classrooms relative to those assigned to control classrooms, with the former study conducted in elementary schools and the latter in secondary schools. Neither study finds a statistically significant relationship between TFA assignment and days absent or suspended, although their samples are substantially smaller than what is available in our M-DCPS data. In addition, our data provide information on a broader set of student non-test academic outcomes across a larger span of grade levels.

3. TFA BACKGROUND IN M-DCPS TFA has been placing corps members in M-DCPS since 2003, beginning with thirty- ﬁve initial placements and aiming to place corps members in schools with high levels of student poverty (student bodies exceeding 70 percent eligibility for free or reduced- price lunch). Beginning with the 2009–10 school year, TFA began a clustering strategy in which new placements were purposely assigned to schools within designated high- need communities, which contained the district’s lowest-performing schools. TFA’s clustering placement strategy grew out of an interest in accelerating TFA’s impact on student outcomes. The growth of the TFA corps and its density is readily ap- parent in the placement numbers during the six school years of data used for this analysis. Table 1 presents TFA presence in the district over time. In the 2008–09 school year, the year immediately preceding the clustering strategy, there was an average of slightly fewer than two TFA teachers in each school where they were placed. In the years following, the number of schools containing any TFA corps members dropped by nearly half and the number of TFA members in the district more than tripled, resulting in a peak of nearly ten corps members per school where there was any presence. The net result was a jump in the proportion of TFA corps members in placement schools, going from

172

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Table 1. Teach For America (TFA) Sample Sizes

2008—09 2009—10 2010—11 2011—12 2012—13 2013—14

Total TFA (active and alumni) 91 110 163 244 301 354 Total schools containing any TFA 50 47 35 31 36 43 TFA as proportion of school teachers by school type, conditional on containing TFA Elementary 3.6% 4.4% 8.6% 20.4% 13.8% 11.8% Middle 1.4% 7.6% 8.5% 16.9% 16.9% 13.6% High 1.7% 4.0% 13.6% 15.9% 14.9% 12.0%

Notes: Proportions of school teachers by school type are calculated among any schools containing any TFA during that school year. TFA: Teach For America.

2 to 4 percent in 2008–09 to as high as 14 to 17 percent in 2012–13. The concentrations of TFA in placement schools decreased slightly in 2013–14, due to expansion into more target schools.

4. DATA We use detailed student-level administrative data that cover M-DCPS students linked to their teachers for six school years (2008–09 through 2013–14) from kindergarten through twelfth grade. M-DCPS is the largest school district in Florida and the fourth largest in the United States. The district has large minority and disadvantaged student populations, typical of regions TFA has historically targeted; about 60 percent of its students are Hispanic, 30 percent black, and 10 percent white, and more than 60 percent of students qualify for free or reduced-price lunch (FRPL). Our set of non-test academic outcomes includes five variables: (1) the number of unexcused absences, (2) days absent due to suspension, (3) GPA in a given year, (4) percent of classes failed, and (5) grade repetition.5 In addition to these outcomes, we observe a variety of student characteristics: race, gender, FRPL-eligibility, limited English proficiency (LEP) status, and whether a student is flagged as having a mental, physical, or emotional disability. In addition, all students are linked to teachers through data files that contain information on course membership. Teacher personnel files in the M-DCPS data contain information on teachers’ experience levels and demographics. These are used as covariates for the analysis that follows. One variable included in the data is a flag on TFA teachers (representing both active corps members and TFA alumni); given the importance of this variable in the analysis, we externally validated this variable with historical corps member lists from TFA. One of the potential benefits of N-VAMs is that they can be estimated on a larger set of teachers because they are not restricted to teachers in tested grades and subjects— the number of teachers in the N-VAM sample is about twice as large as it would be if we were to restrict the sample to those teaching students with test scores. By necessity, when estimating VAMs, we restrict the sample to those in tested grades and subjects. Basic summary statistics for TFA and non-TFA teachers by school level are presented in table 2. TFA teachers are more likely to be concentrated in schools with high

5. Grade point average is calculated using course grades from transcripts, where earning an A corresponds to 4 grade points, earning a B equals 3 grade points, etc.

173

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

Table 2. Sample Descriptive Statistics

Elementary Middle High

TFA Non-TFA TFA Non-TFA TFA Non-TFA

Student descriptives Male 0.49 0.48 0.49 0.49 0.51 0.5 Black 0.79 0.24 0.71 0.24 0.80 0.24 Hispanic 0.20 0.67 0.27 0.66 0.19 0.66 FRPL 0.97 0.75 0.95 0.74 0.86 0.68 ELL 0.16 0.25 0.12 0.11 0.10 0.10 Mental disability 0.03 0.06 0.05 0.07 0.05 0.07 Physical disability 0.03 0.03 0.01 0.02 0.01 0.02 Emotional disability 0 0.02 0.01 0.02 0.01 0.02 Unexcused absences 6.51 3.67 8.37 4.75 12.3 7.36 (7.14) (4.98) (10.55) (7.40) (12.76) (9.92) Suspension absences 0.21 0.08 2.01 0.77 1.15 0.51 (1.20) (0.81) (5.52) (3.45) (3.74) (2.50) GPA 2.96 3.19 2.29 2.64 2.34 2.57 (0.67) (0.63) (0.76) (0.78) (0.82) (0.82) % classes failed 0.04 0.03 0.07 0.04 0.12 0.08 (0.13) (0.11) (0.16) (0.12) (0.20) (0.17) Repeated 0.04 0.02 0.02 0.01 0.01 0.02 Unique students 6,856 322,358 16,262 227,607 23,298 282,746 Teacher descriptives Black 0.09 0.21 0.09 0.23 0.03 0.21 Hispanic 0.08 0.44 0.04 0.35 0.02 0.37 Class size 13.06 12.9 19.15 17.76 16.83 18.14 (4.96) (5.66) (4.93) (7.01) (4.13) (19.30) Race matches student 0.07 0.46 0.08 0.4 0.03 0.37 No experience 0.41 0.01 0.44 0.01 0.48 0 1 year experience 0.35 0 0.35 0 0.37 0.02 2 years experience 0.12 0.02 0.09 0.02 0.08 0.03 3—4 years experience 0.03 0.06 0.03 0.06 0 0.06 5—9 years experience 0.01 0.19 0 0.18 0.02 0.19 10+ years experience 0 0.59 0 0.54 0 0.6 Missing experience 0 0.11 0.08 0.18 0.03 0.08 Unique teachers 175 17,141 199 10,059 501 25,727

Notes: Averages at the individual-year level,with standard deviation in parentheses.For example,a teacher observed in the first and second years of teaching would have one observation from the firstyearand one from the second year. ELL: English language learner; FRPL: eligible for free or reduced-price lunch; GPA: = grade point average; TFA: Teach For America.

shares of black and FRPL-eligible students, and they are less likely to be minority or experienced. About 80 percent of TFA teachers are in their ﬁrst or second year of teaching because of high attrition rates after the second year (see Hansen, Backes, and Brady 2016).

5. EMPIRICAL STRATEGY Our analysis is motivated by an interest in quantifying TFA corps members’ inﬂu- ence on student’s non-tested academic outcomes commonly found in administrative databases. In our survey of the prior literature, we do not believe the question of whether teachers have a causal role in inﬂuencing these non-tested academic outcomes (versus

174

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

being related to socioeconomic factors and/or nonrandom assignment between students and teachers) has been suﬃciently validated. To that end, our paper has two parts: The ﬁrst derives and validates the use of N-VAMs for teachers, and the second then ex- plores the relationship between TFA teachers and these non-test academic outcomes.

Conventional VAMs and Derivative N-VAMs Researchers typically estimate a VAM similar to the following: Yijt = β0 + Yit−1β1 + Xit β2 + Tjφ j + εijt, (1) j∈J

where Yijt represents test scores of student i taught by teacher j in year t, Yit−1 prior year achievement, Xit demographic characteristics, Tj an indicator variable identifying the jth teacher of J total teachers, and εijt an error term. The coeﬃcient on teacher indicator j, φ j, is meant to capture teacher j’s contribution to growth in student achievement. Recently developed N-VAMs are close derivatives of the conventional VAM in equa- OU T COME tion 1, simply substituting some non-test outcome (notated here as Yijt )asthe dependent variable, along with its lagged observation as an explanatory variable:6 OU TCOME OU TCOME Yijt = β0 + Yijt−1 β1 + Xit β2 + Tjφ j + εijt. (2) j∈J Models of this type are calculated in Gershenson (2016), Ladd and Sorensen (2017), and others. N-VAMs produced using this conventional methodology are assumed to take on the interpretation and analogous statistical properties of VAMs. For example, estimates are intended to be interpreted as teachers’ causal contributions to the corresponding non-test outcome in students—our validity tests described below will help evaluate this claim. N-VAMs may be estimated across various time spans in the data, producing one-year or multiyear teacher estimates, as the case may be. The reliability and variability of these estimates over time within teachers can be calculated following methods developed for test-based VAMs in McCaﬀrey et al. (2009) and Goldhaber and Hansen (2013).

Estimating a TFA Effect on Non-Test Academic Outcomes To address the study’s research question regarding the influence of TFA corps members on student non-test academic outcomes, we replace the teacher fixed effects in equation 2 with a TFA indicator and a vector of other teacher characteristics (Xjt), school fixed γ effects ( s), and classroom average characteristics (Xct ) in order to estimate the average change in student outcomes associated with being in a TFA classroom:

OU T COME = β + OU TCOME β + β + β + β + β +γ +ε . Yijt 0 Yijt−1 1 Xit 2 TFAit 3 Xjt 4 Xct 5 s ijt (3)

OU TCOME 6. For ease of notation, we write Yijt as Yijt in the remainder of the text, and it is meant to apply to any candidate non-test outcome. The procedure will be performed separately for each outcome.

175

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

Equation 3 is similar to existing studies of TFA (e.g., Boyd et al. 2006; Glazerman, Mayer, and Decker 2006; Kane, Rockoff, and Staiger 2008; Clark et al. 2013; Backes et al. 2016). However, where these studies use student test scores as the outcome variable and control for prior year scores, we measure non-test academic outcomes: unexcused absences, absences due to suspension, GPA, percent of classes failed, and grade repetition. Thus, we estimate equation 3 for each of these five outcome variables. Re- gardless of the outcome variable Yisct , we control for a student’s lagged value of all of these outcome variables in all regressions.7 Because TFA corps members are placed nonrandomly across schools in the district (at minimum, we know selected schools had at least 70 percent FRPL-eligible students; other characteristics may have also played into the selection decision), the point estimates associated with the TFA effect may be downward-biased because schools cho- sen to receive TFA corps members were probably targeted precisely because they were likely to have low student performance (both on tests and other non-test outcomes). γ As a result, we include both school fixed effects ( s) and controls for time-varying averages within classrooms such as student demographic characteristics. The inclusion of school fixed effects ensures that TFA teachers are compared to non-TFA teachers within a given school, serving similar classrooms.

Forecasting Non-Test Value Added Estimating equation 3 will yield the average change in a given outcome associated with being in a TFA classroom. However, in contrast to teachers’ value added (VA) on test scores, which has been thoroughly vetted in both experimental (Kane, Rockoff, and Staiger 2008; Kane et al. 2013) and nonexperimental (Bacher-Hicks, Kane, and Staiger 2014; Chetty, Friedman, and Rockoff 2014) studies, there is little evidence regarding whether N-VAMs represent causal estimates or merely reflect other factors, such as student–teacher sorting. For example, a positive TFA coefficient in equation 3 could be because sitting in a TFA classroom genuinely leads to better outcomes or because TFA teachers are systematically assigned to students who would have had better outcomes whether or not they were taught by TFA teachers, in ways that the control variables cannot account for. Thus, in this section, we describe how we make out-of-sample pre- dictions of teacher effectiveness to assess the extent to which we can think of the TFA estimates as meaningful contributions to changes in student outcomes. To both validate the causal nature of N-VAMs and explore the variability of these measures within and across teachers, we take a different approach from equation 2 for two reasons. First, equation 2, when estimated directly on the full sample, assumes the teacher contribution to the corresponding outcome is fixed over time by estimating a φÔU TCOME constant j for each teacher. However, using test-based VAMs, Goldhaber and Hansen (2013) and Chetty, Friedman, and Rockoff (2014) provide evidence that this is

7. The vector of student characteristics includes the following: race; gender; FRPL eligibility; LEP status; and mental, physical, or emotional disability status. The vector of classroom characteristics includes class size and classroom–level averages of each of the student characteristics listed above. Teacher controls include teacher race, gender, experience, and whether the race of the teacher matches that of the student. The student charac- teristic, class average, and teacher demographic controls are interacted with grade indicator variables to allow diﬀerences in the inﬂuence of these variables across grades. The estimating equation additionally includes indicator variables on grades and years. Stanford errors are clustered at the school level.

176

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

not the case: Teacher effectiveness drifts over time. Second, to conduct our validation procedure, we need to forecast teacher effectiveness in year t, not estimate it directly as in equation 2. As a result, we follow the three-step process used by Chetty, Friedman, and Rockoff (2014).8 We follow this procedure for each of our five non-test outcomes—absences, suspensions, grade retention, classes failed, and GPA—to obtain a forecast of each teacher’s value-added effect on each outcome. First, we residualize the outcome vari- OU TCOME able Yijt by regressing it on the student’s prior outcome value (Yijt−1), student 9 demographics (Xijt), and a vector of teacher fixed effects (Tj):

Yijt = β0 + Yijt−1β1 + Xijtβ2 + Tjφ + εijt. (4)

Using the coeﬃcient estimates from the regression in equation 4, we obtain residualized student outcomes:

Yîjt = Yijt − βˆ0 − Yijt−1βˆ1 − Xijtβˆ2 = Tjφ + εijt. (5) ¯ = 1 n ˆ A teacher’s average residuals in year t can then be written asYjt n i=1 Yijt, where n is the number of students taught in year t by teacher j. We then obtain coefficients from the estimation of the best linear predictor of Y¯jt given average residuals in all other −t 10 years, both past and future, Yj . Finally, we use the relationship between teachers’ average residuals across different years to forecast a teacher’s value-added contribution for outcome OUTCOME in year t: μ OU TCOME = ψ −t. ˆ jt Yj (6) The result of this process is a forecast of each teacher’s value-added contribution in each year for each outcome. In all that follows, we take this forecast to be a teacher’s value- added effect in t. Thus, for example, when calculating the correlation between teacher effects across two outcome measures Y1 and Y2 in a given year, we would calculate the

8. The discussion that follows is meant to provide a conceptual overview for how forecasts of teachers’ value– added contributions are constructed. In practice, we use the vam.ado program developed for use by Chetty, Friedman, and Rockoff (2014). See their online Appendix A, available at https://assets.aeaweb.org/assets /production/articles-attachments/aer/app/10409/20120127_app.pdf, for a step–by–step guide to implement- ing the methods developed in the paper. 9. In all regressions, Xijt includes FRPL eligibility, English language learner (ELL) status, gender, race, special education status, and an indicator of whether the student has an identified learning disability. Yijt−1 includes a cubic function of the lagged outcome variable (unless the outcome is binary, in which case it is simply one indicator variable), and sometimes will include lagged values of other outcomes, as discussed below. For all outcomes other than GPA, for the validation procedure, we first transform each non-test outcome and its lag by adding 1, taking the log, and then standardizing the values to be mean zero and standard deviation 1. 10. Specifically, we choose the vector ψ to minimize the mean-squared error of forecasts of test scores across all teacher–year observations in the sample: t−1 2 ψ = ¯ − ψ ¯ . arg min{ψ1 ,...,ψt−1 } Yjt sYjs j s=1

¯ −t This is equivalent to regressing Yjt on the observations contained in the vector Yj and obtaining the coeﬃcients.

177

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

μY 1 μY 2 μY correlation between ˆ jt and ˆ jt . Although this does not actually use data from t,ˆjt represents the best linear prediction of teacher performance in t, given data from all other years. Chetty, Friedman, and Rockoﬀ (2014) ﬁnd that these forecasts of teacher performance are predictive of student performance in math and reading. Below, we assess whether forecasts of non-test academic outcomes exhibit similar properties.

Definition of Bias To deﬁne bias, consider the following regression of outcome residuals on forecasted value-added when students are randomly assigned to teachers:

Yˆijt = αt + θμˆ jt + ζijt. (7)

We adopt the Chetty, Friedman, and Rockoff (2014) definition of forecast bias of 1 − θ. Chetty, Friedman, and Rockoff use two tests for bias to present evidence that the estimates of teacher effectivenessμ ˆ jt are forecast unbiased predictors of student test scores in t if the vector of prior achievement Yijt−1 (from equation 2) contains lagged test scores. As we describe in the following section, we will replicate one of these tests— a teacher switching quasi-experiment—using the study’s five non-test academic outcomes to determine whether value-added estimates based on these outcomes contain forecast bias.

Assessing Validity If we take a VAM framework and apply it to non-tested academic outcomes, do the estimates of teacher performance correspond to causal changes in actual student outcomes, as with tested outcomes? The idea behind our test is simple: If we know how students of a given teacher performed in the past, can we make out-of-sample predic- tions about the outcomes of future students taught by that teacher? Thus, our tests of validity are designed to generate predictive evidence about whether VAMs applied to non-test academic outcomes truly measure a teacher’s ability to improve those outcomes for the students they teach. We believe predictive evidence on N-VAMs is an important first step in assessing whether they should be used in research and policy. A regression of predicted student outcomes (Yîjt from equation 5) on the forecast value-added estimates (ˆμ jt from equation 6) is not directly informative about whether μˆ jt represents a causal relationship between Yîjt andμ ˆ jt because students are not randomly assigned to teachers. In other words, a coefficient of 1 could either be because of a causal relationship or because of persistent differences in ζijt in equation 7 (e.g., certain teachers being assigned to students with high- or low-income parents). Thus, as in Chetty, Friedman, and Rockoff (2014), we leverage variation in student exposure to teacher N-VAM contributions caused by staffing changes at the grade-school level to test whether student non-test academic outcomes change, corresponding to a 1:1 relationship. Because we forecast student achievement in multiple years, we can use the pre- dictions to implement the teacher switching research design of Chetty, Friedman, and Rockoff (2014) and Bacher-Hicks, Kane, and Staiger (2014). We use variation in student exposure to teacher quality at the school-by-grade level induced by teacher staffing

178

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

changes to estimate the predictive validity of teacher value-added contributions on student non-test academic outcomes. Letting Ysgt and μˆsgt be ﬁrst-diﬀerenced student outcomes and forecasted teacher value-added contributions aggregated to the school- grade-year level, respectively, we run the following regression:

Ysgt = μ + μˆsgtα + sgt. (8)

To construct μˆsgt , the differenced average teacher value-added contributions at the school-grade-year level, we weight teachers by the number of students they taught. To perform this test, when estimating value-added contributions for teachers in t,weomit observations from t and t–1 so as not to introduce bias when using differenced outcomes on the left-hand side. In addition, we include year fixed effects and cluster at the school- cohort level. The key driver of changes in teacher value-added contributions at the school-grade- year level is school staffing changes, where teachers are moving across schools or grades. The crucial identifying assumption is that changes in aggregate student outcomes are driven only by these changes in school staffing decisions and not other factors which influence test scores. Importantly, this test avoids the problem of nonrandom student–teacher sorting within schools because it aggregates teacher value-added contributions and student performance to school–grade cells. In equation 8, forecast bias is defined as 1–α.Biasis0whenα fails to reject the null hypothesis of being equal to one, and would imply that any changes in N-VAM estimates correspond to equivalent changes in actual non-tested academic outcomes. If α deviates significantly from the value of one, it would indicate that the predictive validity of the teacher estimateμ ˆ jt does not hold across schools or grades. Values less (greater) than one imply that estimated N-VAMs overpredict (underpredict) the corresponding changes in actual student outcomes as teachers flow across different schools. We test for the presence of forecast bias across the five candidate non-test academic outcomes, first as a pooled sample and then separately by levels (elementary, middle, and high school grades). In any given year, students may be exposed to multiple teachers, especially in middle school and high school. When estimating TFA effects, we include each student–teacher link as its own observation and dosage weight each of these records.11 When conducting the test for forecast bias, we obtain each teacher’s school-grade-year weight by dividing the sum of his or her dosages by the total sum of dosages.12 The mean outcome in the school-grade-year cell is simply the mean across students in that cell.

11. This method is referred to as the Full Roster Method (FRM) by Hock and Isenberg (2012). 12. In the paper where Chetty, Friedman, and Rockoﬀ (2014) develop their test for forecast bias, they keep one observation per student per year, thus there are no instances of multiple teachers per student in their data. In our case, students are linked to multiple teachers in a given year. As pointed out by Isenberg and Walsh (2015), since some students are linked to more teachers than others, they will have a larger contribution in estimating the coeﬃcients on explanatory variables in equation 4. Isenberg and Walsh (2015) recommend the Full Roster– plus Method (FRM+), which involves creating duplicate replications. However, due to the very large number of student–teacher links in our data, this suggestion is computationally infeasible. Thus, we do not implement this method; however, results are similar when restricting the sample to remove students linked to very few or very many teachers. When the sample is restricted to students linked to between 7 and 12 teachers in a given year (about 80 percent of the sample), results for the quasi-experimental tests for forecast bias are similar. Finally, Isenberg and Walsh (2015) note that results using FRM and FRM+ are very similar.

179

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

6. RESULTS We begin by presenting properties of forecasted N-VAMs and then conduct the validation procedure, showing that these N-VAMs represent persistent, meaningful differences across teachers and fail to reject forecast-unbiasedness for most measures in elementary and middle school. After ﬁnding these measures to be consequential and valid, we turn to the question of whether TFA teachers perform better on these measures than comparison non-TFA teachers.

Basic Properties of Forecasted VA In order to generate out-of-sample forecasts of VA for each outcome (both tested and non-tested outcomes), it has to be the case that the outcomes of students in a teacher’s classroom in one year are correlated with the outcomes of students of that teacher in a different year. Otherwise, we would have nothing to base forecasts off of and every teacher would have forecasted VA of exactly zero. Throughout this section, we consider elementary, middle, and high school teachers as separate groups due to differences in results across school types, though not all outcomes of interest can be estimated at each level. Because students in high school grades are not annually tested, due to lack of observations we do not include tested outcomes at the high school level. In addition, because very few students repeat grades in middle school, we do not consider grade repetition at that level. We verify that there is a persistent component of each non–test academic outcome within teachers by plotting the autocorrelation vectors by outcome and school type, shown in figure 1. The maximum years of separation is four: of our six years of data, one year must be used as prior controls in the residualization process, leaving five years of data, the maximum of which are four years apart. For teachers of elementary school students, the tested outcomes (math and reading), as well as GPA and unexcused absences, have the highest year-to-year correlations, whereas suspensions and grade repetition tail off quickly. For middle school, all correlations tend to be higher, especially suspensions and the percent of classes failed. We next turn to properties of the VA forecasts themselves. Table 3 displays the standard deviations of the VA forecasts for each outcome for each school level. For reference, the standard deviations of VA forecasts for reading and math from Chetty, Friedman, and Rockoff (2014) are also presented; our comparable measures are slightly larger than theirs. As with Chetty, Friedman, and Rockoff, for tested outcomes, the dispersion of teacher effects is higher in math than English and higher in elementary school than middle school. However, in contrast, many of the non-tested academic outcomes have higher variance in middle school than elementary school. Across all grade levels, the magnitudes of the standard deviations of N-VAM forecasts are slightly larger than the standard deviations of the tested outcomes (both in our data and in Chetty, Friedman, and Rockoff 2014).13 Even if we find that N-VAM estimates are valid and reliable, they are only useful for teachers in tested grades and subjects to the extent that they provide new information

13. The correlations presented here are likely larger than the true correlation in underlying teacher skills due to factors such as unobserved student heterogeneity across outcomes. However, as the outcomes of interest in this paper are the forecasts of teacher eﬀectiveness, estimating these true correlations is beyond the scope of the paper.

180

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Figure 1. Teacher Autocovariance Vectors by Years of Separation between Classes and School Level: a. Elementary school. b. Middle school. c. High school.

Table 3. Standard Deviation of Value-Added Forecasts by School Type

Elementary Middle High

CFR: Math 0.12 0.09 CFR: English 0.08 0.04 Math 0.15 0.12 ELA 0.10 0.08 Unexcused absences 0.11 0.16 0.17 Suspension absences 0.09 0.15 0.11 % classes failed 0.14 0.15 0.16 GPA 0.17 0.17 0.16 Repeated 0.10 . 0.26

Notes: CFR values (provided from Chetty, Friedman, and Rockoff 2014) are included for comparison. ELA: English language arts; GPA: grade point average.

relative to VAMs. As an extreme example, if VAMs for math teachers were perfectly correlated with their N-VAMs, then there would be no reason to include both measures when measuring teacher eﬀectiveness or designing a teacher evaluation system.14 One of the motivations for measuring N-VAMs is the assumption that they measure, at least in part, something other than VAMs. To explore the relationship between VAMs and

14. Of course, the measures could still be useful for teachers not in tested grades and subjects.

181

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

Table 4. Cross-subject Correlation of Value-Added Forecasts by School Type

Unexcused Suspension % Classes Math ELA Absences Absences GPA Failed

ELA 0.59 Unexcused absences −0.17 −0.32 Suspension absences −0.17 −0.24 0.24 GPA 0.09 0.11 −0.07 −0.20 % classes failed −0.00 0.02 0.07 0.17 −0.71 Repeated −0.09 −0.07 0.10 0.03 −0.02 0.08

Notes: ELA: English language arts; GPA: grade point average.

N-VAMs, we calculate the correlation between math and reading VAMs and each of our five N-VAMs. The higher the correlation, the larger the estimated overlap between teacher effectiveness on tested and non-tested academic outcomes. If, for example, the correlation between VA in absences and grade retention were high, it would provide evidence that certain teachers consistently reach their students in a manner that is re- flected across multiple measures. We then turn to the degree to which estimated N-VAMs for a given teacher are correlated. Table 4 displays the correlation across outcomes for a given teacher. Perhaps unsurprisingly, for teachers who taught both math and reading, forecasted teacher VA was highly correlated across the two subjects at nearly 0.60. At the same time, math and reading VA have a notable negative correlation with unexcused absences and suspensions. Teacher effects on student absences being negatively correlated with effects on student math and reading scores is also found by Gershenson (2016) in a different dataset, although the correlations are larger here. The implication of this result is that teacher evaluation systems that commonly use test-based VA as the only student outcome measure (along with observational ratings) may be inadvertently overlooking important gains in these non-test academic outcomes that may still be important for students. Looking specifically at the correlations among the N-VAMs, most correlations are relatively low (correlation coefficients with absolute value less than 0.20), with two notable exceptions. Both absence types (unexcused and suspension absences)15 are mod- estly correlated (0.24), and, unsurprisingly, GPA and the percent of classes failed are highly negatively correlated (−0.71). Accordingly, states could consider the inclusion of these N-VAMs in evaluation systems (after further investigation) to better reflect the multidimensional nature of teaching.

Results Using Forecasted VA To verify that our forecasts were constructed properly, we ﬁrst regress student-level residuals on the forecasted VA of their teachers:

Yˆijt = αt + θμˆ jt + ζijt. (9)

15. According to the M-DCPS handbook, suspensions count as excused absences, not unexcused. See www.dadeschools.net/schoolboard/rules/Chapt5/5a-1.041.pdf.

182

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Table 5. Coefficients from Student-Level Regression of Outcome on Value-Added Forecast

Unexcused Suspension % Classes Tests M ELA Absences Absences GPA Failed Repeater

Panel A: All 1.01 1.01 1.02 1.04 1.05 1.00 1.04 (0.02) (0.02) (0.02) (0.02) (0.03) (0.01) (0.01) Reject = 1 x Panel B: Elementary 1.01 1.00 1.02 1.01 1.05 0.96 0.99 0.95 (0.02) (0.02) (0.02) (0.02) (0.08) (0.01) (0.02) (0.04) Reject = 1x Panel C: Middle 1.02 1.01 1.02 1.02 1.04 0.99 1.03 (0.03) (0.04) (0.02) (0.03) (0.04) (0.02) (0.03) Reject = 1 Panel D: High 1.06 1.06 1.05 1.08 0.92 (0.03) (0.05) (0.03) (0.02) (0.06) Reject = 1xx

Notes: Coefficient is from the regression of residualized student outcome on forecasted student outcome, based onteacher performance in other years. Standard errors clustered at the school-cohort level in parentheses. Sample sizes not provided because they are very large as the unit of observation is the student–teacher link. ELA: English language arts; GPA: grade point average.

Results are shown in table 5. Panel A pools all observations across all levels, and panels B, C, and D are executed separately on elementary, middle, and high school samples, respectively. At the elementary and middle school levels, coefficients are indeed very close to 1. Although not surprising, it is still reassuring that for non-tested academic outcomes, the outcomes of students in a given year can be predicted based on the outcomes of students taught by their teacher in other years. Conversely, the forecasts in high school are not as closely related to current student outcomes, and reject the null hypothesis of being equal to one in two of the outcome variables (unexcused absences and the percentage of failed classes). These values exceeding 1 suggest N-VAM estimates based on other teacher years underpredict the actual outcomes observed among students assigned to a teacher in a given year, perhaps due to student tracking or a similar sorting bias in the data. The remaining outcomes fail to reject the null hypothesis, but the deviations from 1 are notable, especially for grade repetition.16 This may be due to the low degree of variation in grade repetition in high school, making it difficult to forecast. Finally, we implement the Chetty, Friedman, and Rockoff (2014) quasi-experimental estimate of forecast bias described above: Changes in student outcomes at the school- grade-subject-year level are regressed on average forecasted teacher performance at that level. Results are shown in table 6. As in table 5, panel A presents the results on a pooled sample, and panels B, C, and D conduct the test by school levels. At the elementary and middle school levels, only for unexcused absences is the coefficient estimate significantly different than 1. However, due to having a short panel, standard errors are generally large and we cannot rule out somewhat large degrees of bias. Thus, rather than

16. We also experimented with estimating teacher effects for whether students ever graduate. We do not include results in this paper for two reasons. First, controlling for lagged graduation outcomes is not feasible (students who graduate exit the data). And second, our panel is not sufficiently long to observe graduation outcomes for many students. When applying our methods to graduation outcomes, we find very high degrees of forecast bias and no relationship between TFA and graduation.

183

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

Table 6. Quasi-Experimental Estimates of Forecast Bias

Unexcused Suspension % Classes Tests M ELA Absences Absences GPA Failed Repeater

Panel A: All 0.93 0.86 1.20 1.12 0.92 1.23 0.94 (0.07) (0.08) (0.16) (0.09) (0.12) (0.07) (0.08) 8689 4306 4383 11435 11435 10029 10140 Reject = 1x Panel B: Elementary 0.98 0.89 1.27 0.50 0.88 1.10 1.00 0.88 (0.10) (0.12) (0.18) (0.09) (0.17) (0.06) (0.08) (0.13) 4857 2410 2447 7737 7737 6406 6459 7737 Reject = 1x Panel C: Middle 0.88 0.83 1.10 1.04 0.80 0.83 0.88 (0.10) (0.10) (0.29) (0.16) (0.19) (0.12) (0.19) 3832 1896 1936 2056 2056 2010 2056 Reject = 1 Panel D: High 1.50 1.24 1.88 0.78 2.49 (0.17) (0.36) (0.25) (0.16) (0.41) 1642 1642 1613 1625 1642 Reject = 1xxx

Notes: Coefficient from the regression of actual student outcomes on forecasted student outcomes at the school-grade-year cell, with forecasts based on teacher performance in other years. Standard errors clustered at the school-cohort level in parentheses. Number of cells provided below standard errors. See text for further details. ELA: English language arts; GPA: grade point average.

viewing this as a definitive test of whether N-VAMs can be thought of as valid, causal estimates of teacher effectiveness, we use this to justify measuring TFA effects on these outcomes, at least in elementary and middle schools. For high school, many outcomes are far from 1 (the standard errors are also much larger on these high school outcomes, relative to other levels). As previously seen, four of these five values exceed one, signify- ing that N-VAM forecasts underpredict actual changes in student outcomes; yet, in this table the estimates are constructed to remove the possibility of tracking within schools as a potential driver. The presence of forecast bias here may suggest that these outcomes are more driven by school environment or policy rather than teacher-specific factors. Given the rejection of these bias tests in high school grades, we do not estimate TFA effects for these teachers in our main results section (they are presented in Appendix table A.1 for completeness). Even for the high school outcomes which do not fail the test of statistically significant difference from one—classes failed and suspensions— the point estimates for bias are greater than 20 percent. Taken together, our results from tables 5 and 6 suggest the non-test academic outcomes of students in elementary school and middle school can be systematically explained by the teachers to which they are assigned. The one exception is unexcused absences in elementary school, which fails the quasi-experimental test of forecast bias with a degree of bias of 50 percent.

TFA Estimates on Non-Test Academic Outcomes Having validated—at least to the extent possible given our data—most non-test academic outcomes in elementary and middle school grades, we now move to estimating TFA eﬀects based on a value-added speciﬁcation with these non-test variables as dependent variables. Results are shown in table 7, with columns representing separate regressions for each outcome variable. In elementary school, students in classrooms taught

184

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Table 7. Relationship Between TFA and Non-Test Academic Outcomes

Unexcused Suspension % Classes Absences Absences GPA Failed Repeater

Elementary TFA coefficient −0.044a −0.054* 0.050** 0.001 0.001 (0.139) (0.032) (0.023) (0.003) (0.005) Observations 4661344 4661344 4622273 4661344 4661344 R2 0.438 0.127 0.645 0.242 0.115 Dependent variable mean, full sample 4.0 0.07 3.19 0.028 0.026 Dependent variable mean, students in TFA classrooms 7.2 0.17 2.96 0.041 0.038 Middle TFA coefficient −0.347** −0.075 −0.009 0.001 (0.154) (0.058) (0.011) (0.002) Observations 3174990 3174990 3159288 3174990 R2 0.488 0.301 0.659 0.266 Dependent variable mean, full sample 4.8 0.79 2.64 0.038 0.012 Dependent variable mean, students in TFA classrooms 8.4 2.01 2.29 0.067 0.022

Notes: Coefficients represent the average change in outcome associated with being in a classroom taught by TFA teacher. Regression controls for student-level and class average demographics and their interactions with grade. Other controls include class size and teacher race and their interactions with grade. All models include school fixed effects and cluster standard errors at the school level, with standard errors shown in parentheses. GPA: grade point average; TFA: Teach For America. aUnexcused absences in elementary school fails the forecast bias test (see table 5). We display the coefficient here for completeness but urge caution in interpreting this result. *Significant at the 10% level; **significant at the 5% level.

by a TFA teacher tended to have fewer unexcused absences, fewer days of suspension, and higher GPAs than observationally similar students in non-TFA classrooms, with the latter two being at least marginally significant.17 However, recall that unexcused absences in elementary school failed our test of forecast bias. In middle school, students in TFA classrooms continue to be less likely to have unexcused absences or suspensions (only the former is statistically significant), and the GPA effect is no longer present.18 In general, point estimates tend to be modest, with only one coefficient representing more than a 10 percent change relative to baseline values. For example, a reduction of unexcused absences in middle school by 0.347 per student corresponds to about 7 percent of the average of unexcused absences across all students in the sample (4.8). When taking into account that TFA teachers tend to be assigned relatively disadvantaged students (who tend to have more unexcused absences), the percentage change in outcomes becomes even smaller. A reduction of 0.347 absences per year corresponds to only 4 percent of the average number of absences of students taught by TFA teachers. For GPA in elementary school, an increase of 0.05 grade points corresponds to about

17. Consistent with Ladd and Sorensen (2017), in results available from the authors, we find that teacher experience is associated with a reduction in students’ unexcused absences, especially in elementary school. We do not find a relationship between teacher experience and student outcomes for the other non-test academic outcomes that we consider. 18. Many TFA corps members in M–DCPS are placed in schools under the direction of the Educational Trans- formation Office (ETO), which oversees the district’s implementation of school turnaround efforts in targeted schools. Results are very similar when adding a year × ETO interaction term, which should not be surprising because, in the presence of school fixed effects, estimates are identified from within-school variation.

185

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

1.7 percent of mean GPA in TFA classrooms (2.96) and about 7 percent of a standard deviation of GPA.19 The one coefficient estimate representing a large percentage change relative to baseline values is suspensions in elementary school, where the reduction of 0.05 days per year corresponds to about one-third of the average number of days suspended for students in TFA classrooms. However, the baseline value is extremely small (the average is about 1/6, meaning that on average, one student in six is suspended one day per year) and the coefficient is only marginally significant because of the relatively large standard error, so it is hard to know whether this particular result is replicable in different data. Overall, these results provide suggestive evidence that student behavior—as measured by days missed due to unexcused absences and suspensions—improves by a small degree when placed in a TFA classroom. In addition, students in TFA classrooms in elementary school had modest increases in GPA.

Robustness Check: Aggregate Grade-School Effects In the presence of systematic student–teacher sorting within schools, it could be the case that the tests for forecast bias used in this paper fail to detect bias even when such bias exists (see, e.g., Rothstein 2017 and Horvath 2015). Thus, we supplement our TFA estimates from equation 3 with a model where we replace the TFA indicator measured at the school level with the share of (dosage-weighted) students in TFA classrooms at the grade-school-year level:

OU TCOME = β + OU TCOME β + β + β + β + β +γ +ε . Yijt 0 Yijt−1 1 Xit 2 TFAgst 3 Xjt 4 Xct 5 s ijt (10)

Equation 10 does not utilize classroom-specific variation in TFA assignment, but rather the intensity of TFA presence in a given school-grade-year cell. Because of the high level of within-school variation over time induced by the clustering strategy, we are able to obtain more precise estimates for these TFA intensity variables than what would otherwise be possible under typical TFA assignment practices. Results are presented in table 8. In general, in table 7, where TFA effects were found to be associated with student outcomes, these effects are also present when TFA is measured at the grade level. Furthermore, most of the outcomes have signs in the same direction in both tables 7 and 8, with the exceptions being percent of classes failed (now significantly negative) and grade repetition (negative, but not significant) in elementary school. Although the direction of the TFA effect is generally consistent across tables 7 and 8, the magnitudes are not. Taking the estimates at face value, the results from table 7 indicate that replacing every teacher with a TFA corps member would lower unexcused absences by 0.347 days per student, and the results from table 8 imply that replacing an entire grade with TFA teachers would lower unexcused absences by 4.3 days per

19. One could also use the estimates of forecast bias presented in table 6 to scale the estimates in table 7 accordingly. For example, the estimated coeﬃcient on elementary GPA is 1.10, representing a 10 percent forecasting bias over what we would assign as due to the teacher’s input; reducing the TFA point estimate on elementary GPA in table 7 by 10 percent would result in a value of 0.045 rather than 0.050 points on the GPA scale. Scaling this way would not qualitatively change the results presented here.

186

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Table 8. TFA at Grade-School-Year Level and Non-Test Academic Outcomes

Unexcused Suspension % Classes Absences Absences GPA Failed Repeater

Elementary −0.571a −0.179 0.301*** −0.030** −0.018 (0.654) (0.109) (0.087) (0.012) (0.016) Middle −4.300** −0.189 −0.063 −0.005 (2.080) (0.623) (0.126) (0.031)

Notes: Coefficients displayed are for the share of TFA teachers at the grade-school-year level.Regression controls for student-level and class average demographics and their interactions with grade.Other controls include class size and teacher race and their interactions with grade. Standard errors clustered at the school level shown in parentheses. GPA: grade point average; TFA: Teach For America. aUnexcused absences in elementary school fails the forecast bias test (see table 5). We display the coefficient here for completeness but urge caution in interpreting this result. **Significant at the 5% level; ***significant at the 1% level.

student. However, there are two important differences between tables 7 and 8 that com- plicate this comparison. First, a TFA share of 1 is well outside the typical TFA density of schools in the sample: Of school–grade cells with any TFA in that year, the median TFA density is about 0.1 and the 90th percentile is about 0.3. Second, tables 7 and 8 are estimated from two different sources of variation. In table 7, the TFA coefficient represents the average change in student outcomes associated with being in a TFA classroom relative to a non-TFA classroom in a given school. On the other hand, table 8 compares school–grade outcomes in years with high dosage to outcomes in the same school–grade in years with low dosage. Thus, although both specifications are intended to measure the TFA contribution to student outcomes, there is little reason to expect them to have results consistent in magnitude. Another potential explanation for the differences between tables 7 and 8 is spillover effects, where high concentrations of TFA corps members lead to school improvements beyond their impacts in their own classrooms. In a companion paper (Backes et al. 2016), we find little evidence of spillover in math or reading. However, one interpretation of tables 7 and 8 is that the overall school–grade effect in table 8 is larger than the individual effect in table 7. In results available from the authors, we find that in a hy- brid model controlling for both an individual student’s teacher and the TFA share, the coefficient on TFA share remains similar to that in table 8, consistent with the spillover hypothesis. Of course, this finding could also be driven by possible unobserved corre- lates of TFA share, such as other school investments made at the same time that schools added TFA corps members.20

7. RECONCILIATION WITH PRIOR EVIDENCE OF TFA IMPACTS ON NON-TEST ACADEMIC OUTCOMES We ﬁnd suggestive evidence that students taught by TFA teachers in elementary and middle school were less likely to miss school due to unexcused absences and

20. When estimating these additional speciﬁcations, we included interaction terms for year × ETO (see footnote 18) as additional control variables. Hence, any school-level investments that may bias these results would have to be separate from the ETO interventions.

187

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

suspensions, and that students in elementary school had slightly higher GPAs.21 Among the outcomes that passed the validity tests, we do not find any evidence that assignment to TFA teachers has any adverse consequences on students.22 Our results stand in contrast to prior studies of TFA corps members, which have not found any significant differences in student absences and suspensions (e.g., Decker, Mayer, and Glazerman 2004; Clark et al. 2013). Decker, Mayer, and Glazerman (2004) report students randomly assigned to TFA classrooms averaged more days absent (by 0.52 days) and days suspended (by 0.04 days) than control students, although the result is not statistically significant, and Clark et al. (2013) find minimal differences in student absences in secondary school. Although this study does not benefit from random assignment, we do have a substantially larger sample of TFA teachers, resulting in very precise estimates. In fact, our point estimates for absences and suspensions fall well within the confidence intervals of both Decker, Mayer, and Glazerman and Clark et al., the former of which having standard errors large enough that a 0.5 difference in days absent between students in TFA and non-TFA classroom is not close to significant (p-value = 0.415). On the other hand, our estimate of a difference of 0.35 days absent at the middle school level is statistically significant at the 95 percent level. Thus, although Decker, Mayer, and Glazerman and Clark et al. are able to rule out large TFA effects on non-test academic outcomes, their estimates are not precise enough to detect small effects.

8. CONCLUSION The question of whether teachers influence non-test student academic outcomes has recently attracted the attention of both policy makers and researchers, with many states now considering new ways to evaluate school and teacher performance separately from test scores. This study specifically investigates whether TFA teachers influence non- tested academic outcomes commonly found in administrative databases, a likely starting point for many states as they move to experiment outside of test scores. In our review of the literature we find limited validation of the use of these non-tested student academic outcomes for teachers; hence, this paper seeks first to validate the use of such outcomes and then examines the influence of TFA teachers on these outcomes. The analysis of N-VAMs revealed persistent differences in the influence of teacher effectiveness on non-test academic outcomes of their students. For all but one outcome in elementary and middle schools we cannot reject changes in student outcomes at the school-grade-subject-year level being fully explained by changes in the teachers assigned to those students. Although the short panel and large standard errors prevent us from making strong statements about the causality of estimated N-VAMs, most of the coefficient estimates are close enough to 1 that we feel comfortable estimating TFA

21. Our TFA results are estimated on the full sample of TFA corps members and alumni. About 80 percent of the TFA sample consists of corps members (i.e., are in their first or second year) and thus estimating TFA impacts for alumni only yields very large standard errors. 22. The exception is the percent of classes failed for high school students, for which we find a statistically significant increase for students in TFA classrooms. However, although we cannot reject forecast-unbiasedness of this outcome due to large standard errors, the point estimate for bias in table 6 is 22 percent, and in table 5 we reject equality with 1. Thus, we do not place a great deal of weight on this finding, although it deserves future exploration.

188

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

effects on non-test academic outcomes in a value-added framework for elementary school and middle school. When estimating TFA effects, we find evidence of small reductions in class time missed due to absences and suspensions in elementary and middle school. The cases in which our tests reject the validity of N-VAMs also raise some caution for prior studies of teacher effects on non-tested outcomes. For example, where Jackson’s (2014) analysis presents evidence of teacher effects on an index of non-tested student academic outcomes in ninth grade, we are able to reject forecast unbiasedness of many of these same non-test outcomes in high school grades using our data. Although Jack- son does conduct validity tests similar to those found in this paper, the tests are noisy enough to be unable to rule out substantial levels of bias (which is consistent with this paper). Another example is Gershenson’s (2016) study analyzing teacher effects on unexcused absences of elementary school students; this is the one elementary outcome where the validity tests were rejected in our data. Although our rejection of the validity of these particular outcomes does not invalidate these prior studies’ findings (neither of which use the data we use here), it does underscore the need to thoroughly vet new outcomes before using them as dependent variables in a value-added framework. Fur- ther research validating new student outcomes across a variety of contexts would be highly valuable before considering policy applications of these N-VAMs. An open question regarding measures of a teacher’s contribution to non-tested student academic outcomes is the extent to which differences across teachers represent true changes in student behavior or differences among teachers. Taking student suspensions as an example, we find that teachers who tend to lead classrooms with fewer (or more) students who receive suspensions in one year tend to lead classrooms with fewer suspended students in other years, even when they switch schools or grade levels. Although this is likely not to be driven by the sorting of teachers to students, because these effects follow teachers when they switch schools, this finding could be driven by certain teachers inducing better behavior in their students (a true positive) or by certain teachers being more lenient with student discipline (a false positive). The only evidence that we are aware of is Jackson (2014), who finds that students assigned to teachers who are found to raise an index of non-test academic student outcomes (a weighted average of suspensions, absences, on time grade progression, and GPA) have improved future outcomes such as dropout and SAT-taking, suggesting that differences across teachers in estimated N-VAM measures translate to real effects on students. Fu- ture research could extend Jackson’s work to examine each of the components in the index and how the relationship varies across school levels, rather than on only a sample of ninth graders. On a related note, there is an additional element of caution warranted for using the non-test academic outcomes that we analyze here for evaluation purposes, because some of these outcomes could be easily manipulated by teachers or other administrative staff in the school. For example, if states began to attach stakes to students’ unexcused and suspension absences, unscrupulous administrators could simply record them as excused absences without penalty. Though we know of several jurisdictions that are beginning to consider student absences as performance measures, we urge caution in using these data as we do not know whether these measures will be equally predictive once stakes are assigned to them. Moreover, outcomes like GPA, course failure,

189

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

and grade repetition are primarily (if not exclusively) determined by the teachers themselves, and therefore are not likely to be useful for evaluation purposes—for either teachers or schools. Yet, as long as policy makers do not apply stakes directly to them, the evidence that value-added measures based on many of these outcomes are valid in elementary and secondary grades should give some conﬁdence to researchers (for research use) and personnel managers (for monitoring staﬀ performance) for these new insights into teacher outcomes. Finally, it is worth noting that the array of potential non-test measures that could plausibly be used for policy purposes is far wider than those we examined here. We focus on this set of non-test student academic outcomes because they are likely conve- nient starting points for states beginning to experiment with non-test measures, but the new federal law is not prescriptive on what the new measures should look like. Other types of performance measures based on student or teacher surveys, psychological evaluations, or school inspections by external raters could all theoretically be used to meet this provision in the law and have their own respective literatures supporting them.23 As of this writing, though, we do not know of states that have formally embraced any of these other types of measures.

ACKNOWLEDGMENTS This work was produced under financial support from the John S. and James L. Knight Foun- dation. We gratefully acknowledge the cooperation of Miami-Dade County Public Schools, especially Gisela Feild, in providing data access and research support and allowing us to interview district and school personnel regarding Teach For America (TFA). We acknowledge the cooperation of TFA’s national and Miami regional offices in conducting this research. We thank Victoria Brady and Jeff Nelson for excellent research assistance, and appreciate the helpful comments from Cory Koedel, Raegen Miller, Jason Atwood, Rachel Perera, James Cowan, and participants in presentations of findings at the 2015 AEFP Conference. Any and all remaining errors are our own.

DISCLOSURE The Knight Foundation funded both the TFA clustering strategy and this external evaluation of that strategy. The Knight Foundation, TFA, and Miami-Dade County Public Schools were coop- erative in facilitating this evaluation and received early drafts of these findings, but none had editorial privilege over the actual content or findings presented here. CALDER has a current contract with TFA to conduct a separate evaluation of corps member impacts in a different region of the country, which TFA procured through a competitive bid process; these evaluations are in- dependent of each other, and TFA has no editorial influence in the presentation of findings in either the current study or this separate contract.

REFERENCES Antonovics, Kate, and Ben Backes. 2014. The eﬀect of banning aﬃrmative action on college ad- missions policies and student quality. Journal of Human Resources 49(2):295–322.

23. For example, see Ferguson (2012) or Kane and Staiger (2012) for evidence and discussion of the use of student surveys or Kraft and Papay (2014) for the use of teacher surveys; Duckworth et al. (2007) for psychological assessments; Hussain (2015) for school inspections; and Schwartz, Stiefel, and Wiswall (2016) for school climate–based measures.

190

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Bacher-Hicks, Andrew, Thomas J. Kane, and Douglas O. Staiger. 2014. Validating teacher effect estimates using changes in teacher assignments in Los Angeles. NBER Working Paper No. w20657.

Backes, Ben, Michael Hansen, Zeyu Xu, and Victoria Brady. 2016. Examining spillover effects from Teach For America corps members in Miami-Dade County Public Schools. CALDER Work- ing Paper No. 113, American Institutes for Research. Boyd, Donald, Pamela Grossman, Hamilton Lankford, Susanna Loeb, and James Wyckoff. 2006. How changes in entry requirements alter the teacher workforce and affect student achievement. Education Finance and Policy 1(2):176–216. Chetty, Raj, John N. Friedman, Nathaniel Hilger, Emmanuel Saez, Diane Whitmore Schanzenbach, and Danny Yagan. 2011. How does your kindergarten classroom affect your earnings? Evidence from Project Star. The Quarterly Journal of Economics 126(4):1593– 1660.

Chetty, Raj, John N. Friedman, and Jonah E. Rockoﬀ. 2014. Measuring the impacts of teachers I: Evaluating bias in teacher value-added estimates. The American Economic Review 104(9):2593– 2632. Clark, Melissa A., Hanley S. Chiang, Tim Silva, Sheena McConnell, Kathy Sonnenfeld, Anastasia Erbe, and Michael Puma. 2013. The eﬀectiveness of secondary math teachers from Teach For Amer- ica and the Teaching Fellows programs (NCEE 2013-4015). Washington, DC: U.S. Department of Education.

Decker, Paul T., Daniel P. Mayer, and Steven Glazerman. 2004. The eﬀects of Teach For America on students: Findings from a national evaluation. Princeton, NJ: Mathematic Policy Research Reference No. 8792-750.

Dee, Thomas S., and Martin R. West. 2011. The non-cognitive returns to class size. Educational Evaluation and Policy Analysis 33(1):23–46.

Dobbie, Will. 2011. Teacher characteristics and student achievement: Evidence from Teach For America. Unpublished paper, Harvard University.

Doherty, Kathryn M., and Sandi Jacobs. 2015. State of the states 2015: Evaluating teaching, leading and learning. Available https://www.nctq.org/dmsView/StateofStates2015. Accessed 13 November 2017.

Duckworth, Angela L., Christopher Peterson, Michael D. Matthews, and Dennis R. Kelly. 2007. Grit: Perseverance and passion for long-term goals. Journal of Personality and Social Psychology 92(6):1087–1101. Duckworth, Angela Lee, Patrick D. Quinn, and Martin E. P. Seligman. 2009. Positive predictors of teacher eﬀectiveness. The Journal of Positive Psychology 4(6):540–547. Eskreis-Winkler, Lauren, Elizabeth P. Shulman, Scott A. Beal, and Angela L. Duckworth. 2014. The grit eﬀect: Predicting retention in the military, the workplace, school and marriage. Frontiers in Psychology 5:36. Ferguson, Ronald F. 2012. Can student surveys measure teaching quality? Phi Delta Kappan 94(3):24–28. García, Emma. 2013. What we learn in school: Cognitive and non-cognitive skills in the educational production function. Unpublished paper, Columbia University.

191

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 TFA and Non-Test Academic Outcomes

Gershenson, Seth. 2016. Linking teacher quality, student attendance, and student achievement. Education Finance and Policy 11(2):125–149.

Gershenson, Seth, Alison Jacknowitz, and Andrew Brannegan. 2017. Are student absences worth the worry in U.S. primary schools? Education Finance and Policy 12(2):137–165. Glazerman, Steven, Daniel Mayer, and Paul Decker. 2006. Alternative routes to teaching: The impacts of Teach For America on student achievement and other outcomes. Journal of Policy Analysis and Management 25(1):75–96.

Goldhaber, Dan, and Michael Hansen. 2013. Is it just a bad class? Assessing the long-term sta- bility of estimated teacher performance. Economica 80(319):589–612.

Goodman, Joshua. 2014. Flaking out: Student absences and snow days as disruptions of instruc- tional time. NBER Working Paper No. w20221. Hansen, Michael, Ben Backes, and Victoria Brady. 2016. Teacher attrition and mobility during the Teach For America clustering strategy in Miami-Dade County public schools. Educational Evaluation and Policy Analysis 24(3):495–516.

Heckman, James, Jora Stixrud, and Sergio Urzua. 2006. The eﬀects of cognitive and noncognitive abilities on labor market outcomes and social behavior. Journal of Labor Economics 24(3):411– 482.

Hock, Heinrich, and Eric Isenberg. 2012. Methods for accounting for co-teaching in value-added models. Princeton, NJ: Mathematic Policy Research Working Paper No. 6.

Horvath, Hedvig. 2015. Classroom assignment policies and implications for teacher value-added estimation. Unpublished paper, University of California, Berkeley.

Hussain, Iftikhar. 2015. Subjective performance evaluation in the public sector evidence from school inspections. Journal of Human Resources 50(1):189–221. Isenberg, Eric, and Elias Walsh. 2015. Accounting for co-teaching: A guide for policymakers and developers of value-added models. Journal of Research on Educational Effectiveness 8(1):112–119. Jackson, C. Kirabo. 2014. Non-cognitive ability, test scores, and teacher quality: Evidence from 9th grade teachers in North Carolina. NBER Working Paper No. 18624. Jacob, Brian A., and Lars Lefgren. 2009. The effect of grade retention on high school completion. American Economic Journal: Applied Economics 1(3):33–58. Kane, Thomas J., Daniel F. McCaffrey, Trey Miller, and Douglas O. Staiger. 2013. Have we identified effective teachers? Validating measures of effective teaching using random assignment. Available http://files.eric.ed.gov/fulltext/ED540959.pdf. Accessed 13 November 2017. Kane, Thomas J., Jonah E. Rockoff, and Douglas O. Staiger. 2008. What does certification tell us about teacher effectiveness? Evidence from New York City. Economics of Education Review 27(6):615–631.

Kane, Thomas J., and Douglas O. Staiger. 2012. Gathering feedback for teaching: Combining high- quality observations with student surveys and achievement gains. Available http://ﬁles.eric.ed.gov /fulltext/ED540961.pdf. Accessed 1 November 2017.

Kautz, Tim, James J. Heckman, Ron Diris, Bas Ter Weel, and Lex Borghans. 2014. Fostering and measuring skills: Improving cognitive and non-cognitive skills to promote lifetime success. NBER Working Paper No. w20749.

192

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021 Ben Backes and Michael Hansen

Klein, Alyson. 2015. ESEA reauthorization: The Every Student Succeeds Act explained. Available http://blogs.edweek.org/edweek/campaign-k-12/2015/11/esea_reauthorization_the_every.html. Accessed 13 November 2017.

Kraft, Matthew A., and John P. Papay. 2014. Can professional environments in schools promote teacher development? Explaining heterogeneity in returns to teaching experience. Educational Evaluation and Policy Analysis 36(4):476–500. Ladd, Helen F., and Lucy C. Sorensen. 2017. Returns to teacher experience: Student achievement and motivation in middle school. Education Finance and Policy 12(2):241–279.

McCaﬀrey, Daniel F., Tim R. Sass, J. R. Lockwood, and Kata Mihaly. 2009. The intertemporal variability of teacher eﬀect estimates. Education Finance and Policy 4(4):572–606.

McCoy, Ann R., and Arthur J. Reynolds. 1999. Grade retention and school performance: An extended investigation. Journal of School Psychology 37(3):273–298.

Robertson-Kraft, Claire, and Angela Lee Duckworth. 2014. True grit: Trait-level perseverance and passion for long-term goals predicts eﬀectiveness and retention among novice teachers. Teachers College Record 116(3):1–27.

Rothstein, Jesse. 2017. Measuring the impacts of teachers: Comment. American Economic Review 107(6):1656–1684.

Schwartz, Amy Ellen, Leanna Stiefel, and Matthew Wiswall. 2016. Are all schools created equal? Learning environments in small and large public high schools in New York City. Economics of Education Review 52:272–290. Taylor, Katie, and Moku Rich. 2015. Teachers’ unions ﬁght standardized testing, and ﬁnd diverse allies. The New York Times, 21 April.

Xu, Zeyu, Jane Hannaway, and Colin Taylor. 2011. Making a diﬀerence? The eﬀects of Teach For America in high school. Journal of Policy Analysis and Management 30(3):447–469.

APPENDIX A: ADDITIONAL DATA

Table A.1. Relationship Between TFA and Non-Test Outcomes in High School

Unexcused Suspension % Classes Absences Absences GPA Failed Repeater

High school 0.126 0.007 −0.025 0.008** 0.003*** (0.204) (0.028) (0.018) (0.004) (0.001) Observations 4,242,829 4,2428,29 4,226,106 4,242,791 4,242,829 R2 0.512 0.175 0.577 0.285 0.132 Dependent variable mean, full sample 7.59 0.53 2.57 0.08 0.028 Dependent variable mean, students in TFA classrooms 12.3 1.14 2.33 0.12 0.012

Notes: Regression controls for student–level and class average demographics and their interactions with grade. Other controls include class size and teacher race and their interactions with grade. GPA: grade point average; TFA: Teach For America. As all outcomes shown here have an estimate of forecast bias of greater than 20 percent (see table 5); these estimates should not be taken as credible estimates of TFA effectiveness. **Significant at the 5% level; ***significant at the 1% level.

193

Downloaded from http://www.mitpressjournals.org/doi/pdf/10.1162/EDFP_a_00231 by guest on 27 September 2021