research

A LASTING IMPACT High-stakes teacher evaluations drive student success in Washington, D.C

TEACHERS MATTER—and some matter more than others. That recognition has driven a tidal wave of controversial policy reforms over the past decade, rooted in new evaluation systems that link teachers’ ratings and, in some cases, their pay and advancement to evidence of classroom practice and student learning. Two out of three U.S. states overhauled teacher evaluations between 2009 and 2015, supported by federal incentives such as Race to the Top and Teacher Incentive Fund grants, as well as No Child Left Behind Act waivers. What is the impact, so far, of these reforms? One common narrative would indicate a flop. Most states adopting new evaluation systems saw little change in the share of teachers deemed less than effective, argu- ably limiting their potential to address underperformance. Meanwhile, a broader backlash against reform, fueled by concerns about over- reliance on standardized tests, the accuracy of new evaluations, and the efficacy of performance-based incentives, has led some states to reverse course. Congress ultimately chose to exclude any requirements about teacher evaluation policies from the Every Student Succeeds Act of 2015, dashing some reformers’ hopes for a federal mandate. A closer look at one high-stakes evaluation system, however, shows the positive consequences such systems can have for students. Since 2012, we have been studying IMPACT, a seminal effort by the District of Columbia Public Schools (DCPS) to link teacher retention and pay to their performance. Under IMPACT, the dis- trict sets detailed standards for high-quality instruction, conducts multiple observations, assesses individual performance based on evidence of student progress, and retains and rewards teachers based on annual ratings. Looking across our analyses, we see that under IMPACT, DCPS has dramatically improved the quality of teaching in its schools—likely contributing to its status as the fastest-improving large urban school system in the as measured by the National Assessment of Educational Progress. Such reforms are often considered politically impossible, and the effort in DCPS shows the potential fallout. The district’s controversial chancellor, Michelle Rhee, resigned a year after IMPACT launched, when the mayor who appointed her, Adrian Fenty, lost a reelection by THOMAS DEE and

58 EDUCATION NEXT / FALL 2017 educationnext.org PHOTOGRAPH / SARAH L. VOISIN,THE WASHINGTON POST educationnext.org Schools from2010–2016. of Columbia Public Chancellor of the District Kaya Henderson was

FALL 2017 /

EDUCATION NEXT

59 bid in a campaign focused on school reform. and are intended to drive improvements in But IMPACT outlasted them both, to teacher quality and student achievement the benefit of students. DCPS dismissed the (see “Capturing the Dimensions of Effective majority of very low performing teachers Teaching,” features, Fall 2012). and replaced them with teachers whose stu- Under IMPACT, all teachers receive a dents did better, especially in math. Other single score ranging from 100 to 400 points low-performing teachers were 50 percent at the end of each school year based on more likely to leave their jobs voluntarily, classroom observations, measures of student and those who opted to stay improved learning, and commitment to the school significantly, on average, the following community. However, the components year. High-performing teachers improved of the scores and the weights assigned to their performance as well, especially those them vary, depending on what measures within reach of the significant financial of student performance are available, and incentive created by the system. Certainly, DCPS has adjusted their composition over improvement was not universal, and some time. Below, we describe the components of very good teachers decided to leave the dis- IMPACT as it existed in its first three years. trict. Nonetheless, our analysis finds that All teachers were evaluated by five improved teaching was common and that All teachers receive structured classroom observations aligned student achievement increased as a result. a single IMPACT to the district’s Teaching and Learning The DCPS story shows that it may be Framework, which defined domains of politically challenging to adopt high-stakes score ranging from effective instruction, such as leading well- evaluation systems, but it is not impossible. 100 to 400 points organized, objective-driven lessons; check- And it shows that well-designed and care- ing for student understanding; explaining fully implemented teacher evaluations can at the end of each content clearly; and maximizing instruc- serve as an important district improvement school year based tional time. Of the five observations, three strategy—so long as states and districts are were conducted by an administrator and two also willing to make tough, performance- on classroom by “master educators” who traveled between based decisions about teacher retention, observations, several schools; only the first observation development, and pay. was announced in advance. measures of student The weight of these observation scores learning, and varied, depending on whether a teacher also A More-Rigorous Approach had a value-added score based on her stu- Teacher evaluations came into focus in commitment to the dents’ performance on standardized tests. 2009 alongside a growing research consen- school community. For those teachers—who led reading or math sus that teacher quality has dramatic, long- classrooms in grades 4‒8 and accounted for lasting impacts on student success. DCPS less than one in five DCPS teachers—observa- was in the vanguard of these efforts, and was tions were worth 35 percent and value-added one of the first school districts in the country was worth 50 percent. For the majority of to link individual performance evaluations DCPS teachers who did not have value-added to high-stakes decisions about pay and scores, observations counted for 75 percent retention when it launched IMPACT in the of the total IMPACT score. An additional 2009‒10 school year. 10 percent was based on progress toward IMPACT articulates clear standards for goals for student learning, which teachers effective instruction, provides instructional and administrators set each year. coaches to help teachers meet those standards, A teacher’s contribution to a school’s and evaluates teacher performance based in community, as assessed by the principal, large part on direct, structured observations was worth 10 percent of the overall evalu- of teachers’ classroom practices in addition ation score, while the final 5 percent was to evidence of student learning. IMPACT’s based on a measure of the value-added features are broadly consistent with emerging to student achievement for the school as best-practice design principles informed by a whole. Teachers could also have points the Measures of Effective Teaching project, deducted for failing to meet expectations

60 EDUCATION NEXT / FALL 2017 educationnext.org research TEACHER EVALUATIONS DEE & WYCKOFF

for attendance, punctuality, and adherence years of IMPACT: 2009‒10, 2010‒11, and to other policies and procedures. 2011‒12. We limited our sample to general- education teachers working in schools serving K‒12 students. We also drew on an additional Telling Teachers Apart Unlike typical year of data, from the 2012‒13 school year, Unlike typical teacher-evaluation systems, teacher-evaluation in assessing IMPACT’s effects on student IMPACT creates substantial differentiation in achievement in tested grades and subjects. ratings. In 2009‒10 and 2010‒11, 14 percent of systems, IMPACT teachers were rated “highly effective,” 69 per- creates substantial cent of teachers were rated “effective,” 14 per- Teacher Retention cent were judged “minimally effective,” and differentiation and Performance another 2 percent were deemed “ineffective.” in ratings. DCPS teachers were retained differently The system also sets swift consequences depending on their IMPACT ratings. By the for teachers with very low or very high scores. program’s third year, 95 percent of teach- Teachers rated “ineffective” are dismissed ers deemed “ineffective” in the system’s first with rare exception, a rule that caused more two years had been dismissed (see Figure than 200 teachers to exit DCPS in IMPACT’s 1). Overall, 3.8 percent of all teachers in the first year. Those rated “minimally effective” district were let go as a result of being rated are given one year to raise their score above “ineffective” once or after earning two con- the “effective” threshold; if they do not, they secutive “minimally effective” ratings under are dismissed. IMPACT between 2009‒10 and 2011‒12. On the other side of the spectrum, teach- Under IMPACT, lower-rated teachers were ers are eligible for a bonus of up to $25,000 in each year they are rated “highly effective;” those Differential Teacher Retention Under IMPACT (Figure 1) with “highly effective” scores two or more years in a row can earn Ninety-five percent of teachers identified as “ineffective” under IMPACT increases in their base pay of up to were dismissed. Among those identified as “minimally effective,” 14 percent $27,000. Teachers working in high- were dismissed and 27 percent voluntarily left DCPS. Among “effective” and poverty schools teaching high-need “highly effective” teachers, the majority stayed at DCPS. subjects are eligible for the largest pay increases. Percentage of teachers These design features create 0 20 40 60 80 100 sharp incentive contrasts for teachers with scores on either Ineffective (2%) side of the “effective” threshold, because scoring above it removes Minimally effective (14%) the threat of dismissal. And they create an incentive for teachers Effective (69%) with scores just below the “highly effective” threshold, because scor- Highly effective (14%) ing above it makes them eligible for a significant increase in pay. n Stay n Transfer n Voluntary exit n Dismissed Our research on IMPACT sought to understand whether NOTE: Percentages accompanying rating categories represent the percentage those incentives impact teachers’ of DCPS teachers in each category. IMPACT ratings are based on performance performance and retention—and during the 2009–10 and 2010–11 school years, with retention outcomes observed whether those impacts, if any, in the 2010-11 and 2011-12 school years. Units of observation are teacher years, improve student outcomes. We so teachers may be observed more than once. An “other” retention category, based our analysis on administra- which is always less than 2 percent of any IMPACT rating group, is omitted. tive data on all DCPS teachers and SOURCE: Authors’ calculations their students over the first three

educationnext.org FALL 2017 / EDUCATION NEXT 61 also far more likely to exit DCPS voluntarily. teachers, even in the absence of high-stakes Among all teachers rated “minimally effec- evaluations. In addition, IMPACT scores for tive,” 27 percent voluntarily left the district, The system sets teachers in their first two years of teaching compared to 14 percent of teachers rated average 17 points less than those with three “effective” and 9 percent of teachers rated swift consequences or more years of experience. Such consider- "highly effective." “Minimally effective” teach- for teachers with ations raise doubts about how to interpret the ers whose scores were closest to the “effective” relationship between IMPACT and retention. threshold were less likely to leave than those very low or very high Are we observing the effects of incentives or with lower scores; about one in four teach- scores. Teachers merely behavior that would have occurred in ers whose scores were within 25 points of the the absence of IMPACT? “effective” threshold chose to leave their jobs, rated “ineffective” To isolate the causal impact of perfor- compared to about one in three whose scores are dismissed with mance-based incentives on teacher quality, were more than 25 points below. In other we compare outcomes among teachers whose words, teachers under threat of dismissal were rare exception, and initial IMPACT scores placed them near the more likely to voluntarily leave than teachers on the other side thresholds between categories: “minimally not subject to this threat, and those who scored effective/effective” and “effective/highly effec- furthest from the “effective” threshold were of the spectrum, tive.” We assume that whether a teacher scores even more likely to go. teachers rated just above or just below a certain threshold is These patterns are consistent with a essentially random, which is supported both restructuring of the teacher workforce in “highly effective” by our knowledge of how DCPS calculates response to the incentives embedded in are eligible for IMPACT scores and additional analyses IMPACT, but other explanations are pos- confirming that teachers on either side of sible. For example, some studies have found financial incentives. each threshold are quite similar in terms of that less-effective early-career teachers are experience and other characteristics. We more likely to exit than more-effective novice look at teacher retention during the second year of IMPACT, when the threat of dismissal for “minimally effec- Teachers Rated “Minimally Effective” tive” teachers became newly credible and financial incentives for “highly More Likely to Leave DCPS (Figure 2) effective” teachers could be made At the threshold where IMPACT scores result in a rating of permanent. We also examine perfor- “minimally effective,” teacher retention drops by more than mance among returning teachers in 10 percentage points. the system’s third year. By comparing teacher attrition and performance on 1 each side of the performance cutoffs, we can get a better sense of how the threat of dismissal or prospect of a 0.8 raise affects teachers’ behavior. We first use this method to mea- sure the effects on teachers scoring 0.6 directly above or below the “mini- mally effective/effective” threshold in

in DCPS in 2011-12 in DCPS in 0.4 2010‒11. We find that teachers near the threshold who received their first Probability of being retained retained being of Probability “minimally effective” rating at this 0.2 time were considerably more likely -75 -50 -25 0 25 50 75 100 to exit DCPS voluntarily, with reten- IMPACT score in 2010-2011 tion dropping by 10 percentage points in 2011‒12 (see Figure 2). In other Teachers rated minimally effective Teachers rated effective words, IMPACT’s minimally effective SOURCE: Authors’ calculations rating increased the attrition of lower- performing teachers from 20 percent

62 EDUCATION NEXT / FALL 2017 educationnext.org research TEACHER EVALUATIONS DEE & WYCKOFF

to 30 percent, an increment of 50 percent. these figures to the typical improvement of a Meanwhile, the teachers in this “mini- novice teacher during her first three years in mally effective” group that did return to the classroom, we find that the gain of 12.6 DCPS the following school year improved points for lower-rated teachers is 52 percent their performance by roughly 11 points of this three-year gain, and the gain of 10.9 on the IMPACT scale. This sharp increase points for highly rated teachers is 41 percent in teacher performance is about 0.27 of of the three-year gain. a standard deviation (see Figure 3) and suggests that previously low-performing teachers who opted to remain on the job Teacher Performance Boost (Figure 3) undertook successful efforts to improve. Notably, the effects of a minimally effec- A “minimally effective” rating in IMPACT’s first year had tive rating on retention and performance small and statistically insignificant effects on subsequent occurred at the end of IMPACT’s second teacher performance (relative to an “effective” rating). year, when the political credibility of the However, in IMPACT’s second year, teachers rated “minimally reform had been affirmed by the appoint- effective” who chose to remain in DCPS improved their perfor- ment of Kaya Henderson as chancellor mance by 0.27 standard deviations compared teachers rated and by the first instance in which teachers “effective.” Meanwhile, observed in IMPACT’s first year, the (roughly 140) were fired for having two financial incentives available to teachers on the “highly effective” consecutive “minimally effective” ratings. side of the threshold improved subsequent teacher performance In contrast, following its first year of imple- by 0.24 standard deviations. mentation, IMPACT had no statistically detectable effect on teacher retention or IMPACT’s effect on teacher performance performance. Given the contentiousness 0.30 0.27* of these policies and the political volatil- 0.24** ity associated with Mayor Fenty’s loss and 0.20 Chancellor Rhee’s resignation, teachers may have reasonably doubted the staying 0.10 power of IMPACT’s performance-based dismissal threats, opting not to respond to its incentives. We are just speculating 0.00 regarding the cause, but the important deviations Standard lesson is that policies intended to induce -0.10 -0.06 significant behavioral responses may need Minimally effective teachers Highly effective teachers time to take hold. In contrast, the financial incentives for n IMPACT’s first-year effect on IMPACT scores consistently high-performing teachers (i.e., n IMPACT’s second-year effect on IMPACT scores jumping ahead on the salary schedule) * Statistically significant at the 95% confidence level appear to have been immediately cred- ** Statistically significant at the 99% confidence level ible. Among highly rated teachers who scored very close to the eligibility threshold NOTE: First-year effects are the 2009-10 IMPACT rating effects for a permanent pay increase, retention on outcomes observed in 2010-11, and second-year effects are the increased by roughly 3 percentage points, 2010-11 IMPACT rating effects on outcomes observed in 2011-12. though this effect was not statistically sig- This analysis includes only the first-year effects to assess the nificant. Their performance in 2011‒12 effect of receiving a rating of “highly effective” for the first time, which triggered a bonus of up to $25,000. Including teachers rated improved by a statistically significant 10.9 “highly effective” in the second year could also include teachers points, or 0.24 of a standard deviation. receiving a permanent salary boost triggered by repeated high rat- These performance gains imply increases ings. However, including data from IMPACT’s second year leave our of approximately 5 percentile points in the results qualitatively unchanged. distribution of teacher performance among lower-rated teachers, and 7 percentile points SOURCE: Authors’ calculations among highly rated teachers. Comparing

educationnext.org FALL 2017 / EDUCATION NEXT 63 Teacher Turnover and words, what was the change in test scores Student Achievement for 4th graders from year to year at a school IMPACT’s effects also depend on the that had teacher turnover in that grade com- direct impact of teacher turnover and the In math, the exit pared to the change in test scores between quality of newly hired teachers. Teacher of low-performing 4th graders at a school that did not have turnover is often assumed to have a univer- teacher turnover in that grade? sally negative influence on school quality, teachers is estimated We find that the overall effect of teacher and replacing teachers in schools with high to improve student turnover in DCPS at worst had no adverse rates of turnover can place strong demands effect on student achievement and, under on district recruitment efforts. Not all turn- achievement by reasonable assumptions, improved it. This over is the same, however. Under IMPACT, 0.21 of a standard average combines the negative, but statisti- a substantial fraction of teacher turnover cally insignificant, effects of exits of high- consists of lower-performing teachers who deviation—between performing teachers with the very large were purposefully compelled or encour- one-third and two- improvements in student achievement result- aged to leave, which potentially alters the ing from the departures of low-performing distribution of teacher effectiveness among thirds of a year of teachers. Figure 4 illustrates these patterns by exiting teachers. learning, depending comparing the average value-added scores of To determine the effect of teacher new DCPS teachers entering the district with turnover on student achievement under on grade level. those who had exited the previous year. IMPACT, we examine the year-to-year Looking directly at impacts on student changes in school-grade combinations achievement, and carefully controlling for with and without teacher turnover. In other a variety of potentially confounding factors,

Tracking Teacher Quality (Figure 4) Teacher turnover under IMPACT improved teacher quality. Although high-performing teachers who left DCPS were replaced by somewhat less-effective teachers, on average, when low-performing teachers left DCPS, they were replaced by much more effective teachers, on average.

Average value-added of exiting and entering teachers 3.5

3

2.5

2

1.5

1

0.5

0 IMPACT individual value-added points value-added individual IMPACT 2011 2012 2013 2011 2012 2013 2011 2012 2013 All teachers High-performing teachers Low-performing teachers

n Exiting teachers n Entering teachers

NOTE: Results for 2011, for example, indicate the average score for all teachers who exited at the end of 2009–2010 compared with all those entering in 2010–2011. Exits include teachers who retired, resigned, or were terminated. Teach- ers leaving schools that closed are excluded. Average teacher value-added was converted to a 0-4 scale using a con- version table. Data are for teachers who are matched to students with math achievement scores.

SOURCE: Authors’ calculations

64 EDUCATION NEXT / FALL 2017 educationnext.org research TEACHER EVALUATIONS DEE & WYCKOFF

we find results quite consistent with these averages (see Figure 5). For Achievement Gains from Turnover (Figure 5) example, turnover among low-per- forming teachers leads to improved (5a) The exit of DCPS teachers under IMPACT improved student achievement in both student achievement in both math math and reading. The turnover of high-performing teachers had a small, negative, and reading. In math, the exit of low- but statistically insignificant effect on achievement in both subjects, while the performing teachers is estimated departure of low-performing teachers substantially increased student achievement, to improve student achievement especially in math but also in reading. by 0.21 of a standard deviation— Impact of teacher turnover on student achievement between one-third and two-thirds 0.25 0.21** of a year of learning, depending 0.20 on grade level. In reading, student 0.15 0.14** achievement is estimated to increase 0.10 0.08** by 0.14 of a standard deviation. 0.05† We also compare differences in 0.05 turnover and their impact on student 0 achievement at high- and low-poverty deviations Standard -0.05 schools. Overall, we find that high- -0.06 -0.05 -0.10 poverty schools appear to improve as All exits Exits of high- Exits of low- a result of teacher turnover, though performing teachers performing teachers as in all schools, not all turnover is the same. n Math n Reading In high-poverty schools, we (5b) The improvement in student achievement from teacher turnover under IMPACT estimate that the overall effect of all was primarily concentrated at high-poverty schools. teacher turnover on student achieve- ment is 0.08 of a standard deviation in Impact of teacher turnover on student achievement, by school poverty math and 0.05 of a standard deviation 0.25 in reading. In comparison, in low- 0.21** poverty schools, the estimated effects 0.15 0.14** of turnover are close to zero. 0.08** 0.05* However, 40 percent of teacher 0.05 0.01 turnover in high-poverty schools is 0.00 -0.05 among low-performing teachers— -0.04 -0.04 -0.06 -0.05 and our estimates indicate consis- deviations Standard tently large gains for students when -0.15 these teachers exit these schools. In All exits Exits of high- Exits of low- All exits Exits of high- performing performing performing math, student achievement improves teachers teachers teachers by 0.21 of a standard deviation, with High-poverty schools teacher quality improving by 1.3 stan- Low-poverty schools dard deviations. In reading, student n Math n Reading achievement improves by 0.14 of a † Statistically significant at the 90% confidence level standard deviation, with an improve- * Statistically significant at the 95% confidence level ment of 1.0 standard deviations in ** Statistically significant at the 99% confidence level teacher quality. In sum, DCPS has been able to NOTE: “High-poverty” schools are those where the percentage of students replace low-performing teachers in eligible for free and reduced price lunch is at least 60 percent. "High-perform- high-poverty schools with teachers ing" teachers are those rated effective or highly effective under IMPACT; who are substantially more effec- "low-performing" teachers are those rated ineffective or minimally effective under IMPACT. Data are insufficient to generate estimates for the effect of low- tive. DCPS also appears to be quite performer exits in low-poverty schools. capable of replacing exiting high- performing teachers in low-poverty SOURCE: Authors’ calculations schools with comparable teachers.

educationnext.org FALL 2017 / EDUCATION NEXT 65 research TEACHER EVALUATIONS DEE & WYCKOFF

With the exception of math achievement in minimizing those instances will require expen- one year (2011‒12), however, the effect of sive, sophisticated tools. Policymakers must this turnover on student achievement is not weigh these costs against the substantial edu- statistically significant. cational and economic benefits such systems can create for successive cohorts of students, both through avoiding the career-long reten- Implications tion of the lowest-performing teachers and The high stakes associated with IMPACT through broad increases in performance in have been a source of contention, both the overall teaching workforce. within the District of Columbia as well as They should also expect continual change in broader discussions of education policy. and improvement. DCPS has made modifica- While the importance of highly effective tions to IMPACT nearly every year, such as teachers is uncontroversial, the manner in reducing the number of teacher observations, which schools and districts identify, reward, altering access to bonus and base-pay increases and retain these educators is. More than toward high-poverty schools, and reducing the Our studies provide clear evidence that 90 percent of the weight of value-added performance measures. IMPACT’s exceptionally high-powered, Most recently, the district announced it would individually targeted incentives linked turnover of rely on school principals for all observations to performance influence retention and low-performing and incorporate student-survey results in performance in desirable ways. This mul- teachers’ evaluations. DCPS also introduced tiple-measures system boosts performance teachers occurs in LEAP, an intensive professional development among teachers most immediately facing high-poverty schools, program to help teachers improve their skills. consequences for their ratings, and pro- The challenge of improving the com- motes higher rates of turnover among the which constitute position of teachers in DCPS is increasing. lowest-performing teachers, with positive 75 percent of As the least-effective teachers exit, there are consequences for student achievement. fewer such teachers who will leave over time. Importantly, more than 90 percent of the all schools. The Whether DCPS can reap further performance turnover of low-performing teachers occurs proportion of exiting benefits from compositional change in its in high-poverty schools, which constitute workforce as it increases performance stan- 75 percent of all schools. The proportion of teachers who are dards remains to be seen. exiting teachers who are low performers is low performers Regardless, our results indicate that, under twice as high in these high-poverty schools as a robust system of performance evaluation, the in low-poverty schools. In comparison with is twice as high in turnover of teachers can generate meaningful almost any other intervention, these are very these high-poverty gains in student outcomes, particularly for the large improvements for some of the school most disadvantaged students. The eight-year district’s neediest students. schools as in history of IMPACT shows that such efforts We do not claim that IMPACT caused low-poverty schools. may incur political consequences, but are not all of the teacher turnover we observe. politically impossible. Although IMPACT did result in some teachers being dismissed, some attrition Thomas Dee is a professor of education from teaching in DCPS as the system was and director of the Center for Education implemented was surely voluntary. Nor do Policy Analysis at , and we know whether our results showing that a research associate at the National Bureau teacher turnover benefited students in tested of Economic Research. James Wyckoff is grades and subjects generalize to turnover the Curry Memorial Professor of Education for other teachers. and Policy and director of EdPolicyWorks Policymakers looking to build on the at the . This article is DCPS model should also understand the based on research published in the March implementation challenges and the potential 2017 issue of Educational Evaluation and for errors. Any teacher-evaluation system will Policy Analysis and in the March 2015 make some number of objectionable mistakes issue of the Journal of Policy Analysis in assigning ratings and consequences, and and Management.

66 EDUCATION NEXT / FALL 2017 educationnext.org