<<

ROTTENAPPLES: AN INVESTIGATION OFTHE PREVALENCE ANDPREDICTORS OFTEACHER CHEATING*

BRIAN A. JACOB AND STEVEN D. LEVITT

Wedevelopan algorithm for detecting teacher cheating that combines infor- mationon unexpectedtest score  uctuationsand suspicious patterns of answers forstudents in a classroom.Using data from the , we estimatethat serious cases of teacheror administratorcheating on standardized testsoccur ina minimumof 4 –5percentof elementary school classrooms annu- ally.The observed frequency of cheatingappears to respond strongly to relatively minorchanges in incentives. Our resultshighlight the fact thathigh-powered incentivesystems, especially those with bright line rules, may induce unexpected behavioraldistortions such as cheating. Statistical analysis, however, may pro- videa meansof detecting illicit acts, despite the best attempts of perpetratorsto keepthem clandestine.

I. INTRODUCTION High-poweredincentive schemes are designed toalign the behaviorof agentswith theinterests of theprincipal implement- ing thesystem. A shortcomingof suchschemes, however, is that theyare likely to induce behavior distortions along other dimen- sionsas agentsseek to game the rules (see, for instance, Holm- stromand Milgrom[1991] and Baker[1992]). Thedistortions maybe particularly pronouncedin systemswith bright linerules [Glaeserand Shleifer2001]. It maybe impossible to anticipate themany ways in whicha particular incentivescheme may be gamed. Test-based accountability systemsin educationprovide an excellentexample of the costs and beneŽts ofhigh-powered in- centiveschemes. In an effortto improve student achievement,a numberof states and districts haverecently implemented pro- gramsthat usestudent testscores to punish orreward schools.

*Wewould like to thank Suzanne Cooper, Mark Duggan, SusanDynarski, ArneDuncan, Michael Greenstone, James Heckman, Lars Lefgren, two anony- mousreferees, the editor, Edward Glaeser,and seminar participants too numer- ousto mention for helpful comments and discussions. We also thank Arne Dun- can,Philip Hansen, Carol Perlman, and Jessie Qualles of the Chicago public schoolsfor their help and cooperation on the project. Financial support was providedby the National Science Foundation and the Sloan Foundation. All remainingerrors are our own. Addresses: Brian Jacob, Kennedy School of Gov- ernment,, 79 JFK Street,Cambridge, MA 02138;,Department of , , 1126 E. 59thStreet, Chicago,IL 60637.

© 2003by thePresident and Fellowsof Harvard Collegeand theMassachusetts Institute of Technology. The Quarterly Journal ofEconomics, August 2003

843 844 QUARTERLY JOURNALOF ECONOMICS

Recentfederal legislation institutionalizes this practice,requir- ing statesto test elementary students eachyear, rate schools on thebasis ofstudent performance,and intervenein schoolsthat do notmake sufŽ cient improvement. 1 Severalprior studies suggest that suchaccountability policiesmay be effective at raising stu- dent achievement[Richards and Sheu1992; Grissmer,Flanagan, etal. 2000; Deereand Strayer2001; Jacob2002; Carnoyand Loeb 2002; Hanushekand Raymond 2002]. At thesame time, however, researchershave documented instances of manipulation,includ- ing documentedshifts away fromnontested areas or “ teachingto thetest” [Klein etal.2002; Jacob2002], and increasingplacement in special [Jacob 2002; Figlioand Getzler2002; Cullen and Reback2002]. In this paper weexplore a verydifferent mechanismfor inating testscores: outright cheating on thepart ofteachers and administrators. 2 As incentivesfor high test scores increase, un- scrupulousteachers may be more likely to engage in arangeof illicit activities,including changingstudent responseson answer sheets,providing correctanswers to students, or obtaining copies ofan examillegitimately prior to the test date and teaching students using knowledgeof the precise exam questions. 3 While suchallegations may seem far-fetched, documented cases of cheatinghave recently been uncovered in California [May 2000], Massachusetts [Marcus 2000], NewYork [Loughran and Comis- key1999], Texas[Kolker 1999], and GreatBritain [Hofkins1995; Tysome1994]. Therehas beenvery little previous empirical analysis of teachercheating. 4 Thefew studies that doexist involve investi-

1.The federal legislation, No Child Left Behind, was passed in 2001.Prior to thislegislation, virtually every state had linked test-score outcomes to school fundingor requiredstudents to pass an exitexamination to graduate high school. Inthe state of California,a policyproviding for merit pay bonusesof asmuchas $25,000per teacher in schools with large test score gains was recently put into place. 2.Hereinafter, we uses the phrase “ teachercheating” to encompass cheating doneby eitherteachers or administrators. 3.We have no way ofknowingwhether the patterns we observe arise because ateacherexplicitly alters students’ answer sheets, directly provides answers to studentsduring a test,or perhapsmakes test materials available to students in advanceof the exam (for instance, by teachinga readingpassage that is on the test).If we had access to the actual exams, it might be possible to distinguish betweenthese scenarios through an analysisof erasure patterns. 4.In contrast, there is a well-developedstatistics literature for identifying whetherone student has copied answers from another student [Wollack 1997; Holland1996; Frary 1993;Bellezza and Bellezza 1989; Frary, Tideman,and Watts1977; Angoff 1974]. These methods involve the identiŽ cation of unusual ROTTEN APPLES 845 gationsof speciŽc instancesof cheatingand generallyrely on the analysis oferasure patterns and thecontrolled retesting of stu- dents.5 Whilethis earlierresearch provides convincing evidence ofisolated cheating incidents, our paper representsthe Ž rstsys- tematicattempt to (1) identify theoverall prevalence of teacher cheatingempirically and (2) analyze thefactors that predict cheating.To address thesequestions, we use detailed adminis- trativedata fromthe Chicago public schools(CPS) that includes thequestion-by-question answers given by everystudent in grades 3to8 whotook the Iowa Testof Basic Skills (ITBS) from 1993 to2000. 6 In addition tothe test responses, we also have accessto eachstudent’ s full academicrecord, including past test scores,the school and roomto which a student was assigned,and extensivedemographic and socioeconomiccharacteristics. Ourapproach todetecting classroom cheating combines twotypes ofindicators :unexpectedtestscore  uctuationsand unusual patterns ofanswers for students within aclassroom. Teachercheating increases the likelihoo dthat students in a classroomwill experiencelarge, unexpecte dincreasesin test scoresone year, followed by verysmall testscore gains (oreven declines)the following year. Teacher cheating, especially if donein an unsophisticated manner,is alsolikely to leave tell-tale signs in theform of blocks of identical answers,un- usual patterns ofcorrelat ionsacross student answerswithin theclassroom ,orunusual responsepatterns within astudent’s exam(e.g., a student whoanswers a numberof very difŽ cult

patternsof agreement in student responses and, for the most part, areonly effectivein identifyingthe most egregious cases of copying. 5.In the mid-eighties, Perlman [1985] investigated suspected cheating in a numberof Chicago public schools (CPS). Thestudy included23 suspect schools— identiŽed onthebasis of a highpercentage of erasures,unusual patterns of score increases,unnecessarily large orders of blankanswer sheets for the ITBS and tips tothe CPS OfŽce of Research—along with 17 comparisonschools. When a second formof thetest was administered to the 40 schools under more controlled condi- tions,the suspect schools did much worse than the comparison schools. An analysisof several dozen Los Angeles schools where the percentage of erasures andchanged answers was unusually high revealed evidence of teachercheating [Aiken1991]. One of themost highly publicized cheating scandals involved Strat- Želdelementary, an award-winningschool in Connecticut.In 1996 the Ž rm that developedand scored the exam found that the rate of erasures at StratŽeld was up toŽ vetimes greater than other schools in the same district andthat 89 percentof erasuresat StratŽ eld were from an incorrect toa correct response.Subsequent retestingresulted in signiŽ cantly lower scores [Lindsay 1996]. 6.We do not, however, have access to the actual test forms that students Žlledout so we are unable to analyze these tests for evidence of suspicious patternsof erasures. 846 QUARTERLY JOURNALOF ECONOMICS questionscorrectl ywhilemissing many simple questions ).Our identiŽcation strategy exploits the fact that thesetwo types of indicators arevery weakly correlate dinclassroomsunlikelyto havecheated, but veryhighly correlatedin situationswhere cheatinglikely occurred .Thatallows usto credibly estimate theprevalenc eofcheatingwithout having toinvoke arbitrary cutoffs as towhat constitutescheating. Empirically, wedetect cheating in approximately4 to5 per- centof the classes in oursample. This estimate is likelyto understatethe true incidence of cheating for two reasons. First, wefocus only on the most egregious type ofcheating, where teacherssystematically alter student testforms. There are other moresubtle ways in whichteachers can cheat, such as providing extratime to students, that ouralgorithm is unlikelyto detect. Second,even when test forms are altered, our approach is only partially successfulin detectingillicit behavior.As discussed later,when we ourselves simulate cheating by alteringstudent answerstrings and thentesting for cheating in theartiŽ cially manipulated classrooms,many instances of moderate cheating go undetectedby ourmethods. Anumberof patterns in theresults reinforce our conŽ - dencethat what wemeasure is indeed cheating.First, simu- lationresults demonstra tethat thereis nothingmechanic al about ouridentiŽ cation approach that automatically generates patterns likethose observed in thedata. Whenwe randomly assign students toclassroom sand searchfor cheating in these simulated classes,our methods Ž ndlittleevidence of cheating. Second,cheating on onepart ofthetest (e.g., math) isastrong predictorof cheating on other sections of the test (e.g., read- ing). Third,cheating is alsocorrelate dwithin classroomsover timeand acrossclassroom sinaparticular school.Finally, and perhaps mostconvincin gly,with thecooperati onoftheChicago public schoolswe wereable toconduct a prospective test of our methodsin whichwe retested a subsetof classroom sunder controlledconditionsthat precluded teachercheating. Class- roomsidentiŽ ed as likelycheaters experien cedlarge declines in testscores on the retest, whereas classroom snotsuspected ofcheating maintaine dtheirtest score gains. Theprevalenc eofcheating is alsoshown to respond to relativelyminor changes in teacherincentive s.Theimportanc e ofstandardize dtestsin theChicago public schoolsincreased ROTTEN APPLES 847 substantiallywith achangein leadershipin1996, particularly forlow-achie ving students.Following the introduct ionof these policies,the prevalenc eofcheatingrose sharply in low-achiev- ing classrooms,whereasclasses with averageor higher-achiev- ing students showedno increase in cheating.Cheating preva- lencealso appears tobe systemati cally lowerin caseswhere thecosts of cheating are higher (e.g., in mixed-gradeclass- roomsin whichtwo different examsare administer ed simulta- neously),or the beneŽ ts ofcheating are lower (e.g., in class- roomswith morespecial educationor bilingual students who takethe standardize dtests,but whosescores are excluded fromofŽ cial calculations). Theremainder of thepaper isstructuredas follows.Section IIdiscussesthe set of indicators weuse to capture cheating behavior.Section III describesour identiŽ cation strategy. Section IVprovidesa briefoverview of the institutional details ofthe Chicagopublic schoolsand thedata setthat weuse. Section V reportsthe basic empiricalresults on the prevalence of cheating and presentsa wide rangeof evidence supporting theinterpreta- tionof theseresults as cheating.Section VI analyzes howteacher cheatingresponds to incentives. Section VII discussesthe results and theimplications for increasing reliance on high-stakes testing.

II. INDICATORS OF TEACHER CHEATING Teachercheating, especially in extremecases, is likelyto leavetell-tale signs. In motivatingour discussionof theindicators weemploy for detecting cheating, it isinstructiveto compare two actual classroomstaking thesame test (see Figure I). Each rowin FigureI representsone student’ s answersto each item on the test.Columns correspond to the different questionsasked. The letter“ A,”“B,”“ C,”or“ D”meansa student provided thecorrect answer.If anumberis entered,the student answeredthe ques- tionincorrectly, with “1”corresponding to awronganswer of “ A,” “2”corresponding to a wronganswer of “ B,”etc.On theright- hand side ofthetable, we alsopresent student testscores for the preceding,current, and followingyear. Test scores are in units of “gradeequivalents.” The grading scaleis normedso that astu- dent at thenational averagefor sixth graderstaking thetest in theeighth month of the school year would score6.8. A typical 848 QUARTERLY JOURNALOF ECONOMICS

FIGURE I SampleAnswer Strings andTest Scores from Two Classrooms Thedata in theŽ gurerepresent actual answer strings and test scores from two CPS classroomstaking the same exam. The top classroom is suspected of cheat- ing;the bottom classroom is not.Each row correspondsto an individual student. Each columnrepresents a particularquestion on theexam. A letterindicates that thestudent gave that answer and the answer was correct. Anumbermeans that thestudent gave the corresponding letter answer (e.g., 1 5 “A”),butthe answer wasincorrect. A valueof “ 0”means the question was left blank. Student test scores,in gradeequivalents, are shown in the last three columns of theŽ gure.The testyear for which the answer strings are presented is denoted year t. The scores from years t 2 1 and t 1 1correspondto the preceding and following years’ examinations. ROTTEN APPLES 849 student would beexpected to gain onegrade equivalent for each yearof school. Thetop panel ofdata showsa class in whichwe suspect teachercheating took place; the bottom panel correspondsto a typical classroom.Two striking differences between the class- roomsare readily apparent. First,in thecheating classroom, a largeblock of students provided identical answerson consecutive questionsin themiddle ofthe test, as indicated bytheboxed area in theŽ gure.For the other classroom, no such pattern exists. Second,looking at thepattern oftest scores, students in the cheatingclassroom experienced large increases in testscores fromthe previous to the current year (1.7 gradeequivalents on average),and actually experienced declines onaverage the fol- lowingyear. In contrast,the students in thetypical classroom gained roughlyone grade equivalenteach year, as would be expected. Theindicators weuse as evidenceof cheatingformalize and extendthe basic picturethat emergesfrom Figure I. Wedivide theseindicators intotwo distinct groupsthat, respectively, cap- tureunusual testscore  uctuationsand suspiciouspatterns of answerstrings. In this sectionwe describe informally the mea- suresthat weuse. A morerigorous treatment of their construc- tionis provided in theAppendix.

III.A. IndicatorOne: Unexpected Test Score Fluctuations Giventhat theaim of cheatingis toraise test scores, one signal ofteacher cheating is an unusually largegain in test scoresrelative to how those same students testedin thepre- viousyear. Since test score gains that resultfrom cheating do notrepresen trealgains in knowledge,thereis noreason to expectthe gains tobe sustained onfuture exams taken by thesestudents (unless,of course, next year’ s teachersalso cheaton behalf ofthe students). Thus,large gains dueto cheatingshould befollowedby unusually small testscore gains forthese students in thefollowing year. In contrast,if large testscore gains aredue to atalentedteacher, the student gains arelikely to havea greaterpermanen tcomponent,evenif some regressiontothe mean occurs. We construct a summarymea- sureof howunusual thetest score  uctuationsare by ranking eachclassroom ’saveragetest score gains relativeto all other 850 QUARTERLY JOURNALOF ECONOMICS classroomsin that samesubject, grade, and year, 7 and com- puting thefollowing statistic:

2 2 (3) SCOREcbt 5 ~rank_ gainc,b,t! 1 ~1 2 rank_ gainc,b,t11! , where rank_ gain cb t is thepercentile rank for class c in subject b in year t.Classes with relativelybig gains onthis year’s testand relativelysmall gains onnextyear’ s testwill havehigh values of SCORE.Squaring theindividual termsgives relatively more weightto big testscore gains this yearand big testscore declines thefollowing year. 8 In theempirical analysis weconsider three possiblecutoffs forwhat it meansto have a “high”value on SCORE,correspondingto the eightieth, ninetieth, and ninety- Žfth percentilesamong all classroomsin thesample. III.B. IndicatorTwo: Suspicious Answer Strings Thequickest and easiestway fora teacherto cheatis toalter thesame block of consecutivequestions for a substantial portion ofstudents in theclass, as was apparently donein theclassroom in thetop panel ofFigure I. Moresophisticated interventions mightinvolve skipping somequestions so as toavoid alargeblock ofidentical answers,or altering different blocksof questions for different students. Wecombine four different measuresof how suspicious a classroom’s answerstrings are in determiningwhether a class- roommay be cheating. The Ž rstmeasure focuses on the most unlikelyblock of identical answersgiven by students onconsecu- tivequestions. Using past testscores, future test scores, and backgroundcharacteristics, we predict thelikelihood that each student will giveeach possible answer (A, B,C,orD) onevery questionusing amultinomiallogit. Each student’s predicted probability ofchoosing a particular responseis identiŽed by the likelihoodthat otherstudents (in thesame year, grade, and subject)with similarbackground characteristics will choosethat

7.We also experimented with more complicated mechanisms for deŽ ning largeor smalltest score gains (e.g., predicting each student’ s expectedtest score gainas afunctionof pasttest scores and background characteristics and comput- inga deviationmeasure for each student which was then aggregated to the classroomlevel), but because the results were similar we elected to use the simplermethod. We have also deŽ ned gains and losses using an absolutemetric (e.g.,where gains in excessof 1.5 or 2gradeequivalents are considered unusually large),and obtain comparable results. 8.In the following year the students who were in aparticularclassroom are typicallyscattered across multiple classrooms. We baseall calculations off of the compositionof thisyear’ s class. ROTTEN APPLES 851 response.We then search over all combinationsof students and consecutivequestions to Ž ndtheblock of identical answersgiven by students inaclassroomleast likely to have arisen by chance. 9 Themore unusual is themost unusual blockof test responses (adjusting forclass sizeand thenumber of questions on theexam, bothof which increase the possible combinations over which we search),the more likely it is that cheatingoccurred. Thus, if ten verybright students in aclass ofthirtygive the correct answers tothe Ž rstŽ vequestions on the exam (typically theeasier ques- tions),the block of identical answerswill notappear unusual.In contrast,if all Žfteenstudents in alow-achievingclassroom give thesame correct answers to the last Žvequestions on the exam (typically theharder questions), this would appear quitesuspect. Thesecond measure of suspicious answer strings involves theoverall degree of correlation in student answersacross the test.When a teacherchanges answers on test forms, it presum- ably increasesthe uniformity of student testforms across stu- dents in theclass. This measure is meantto capturemore general patterns ofsimilarity in student responsesbeyond just identical blocksof answers. Based onthe results of the multinomial logit describedabove, for each question and eachstudent wecreate a measureof howunexpected the student’ s responsewas. We then combinethe information for each student in theclassroom to createsomething akin to thewithin-classroom correlation in stu- dentresponses. This measure will behigh if students in aclass- roomtend to give the same answers on many questions, espe- cially if theanswers given are unexpected (i.e., correct answers on hard questionsor systematic mistakes on easy questions). Of course,within-classroom correlation may arise for many reasonsother than cheating(e.g., the teacher may emphasize certaintopics during theschool year). Therefore, a third indicator ofpotential cheating is ahigh variance in thedegree of correla- tion across questions.If theteacher changes answers for multiple students onselected questions, the within-class correlationon

9.Note that we do not require the answers to be correct. Indeed,in many classrooms,the most unusual strings include some incorrect answers.Note also thatthese calculations are done under the assumption that a givenstudent’ s answersare uncorrelated (conditional on observables) across questions on the exam,and that answers are uncorrelated across students. Of course,this assump- tionis unlikely to be true. Since all of our comparisons rely on the relative unusualnessof the answers given in different classrooms, this simplifying as- sumptionis not problematic unless the correlation within and across students variesby classroom. 852 QUARTERLY JOURNALOF ECONOMICS thoseparticular questionswill beextremely high, while the de- greeof within-class correlationon other questions is likelyto be typical. Thisleads thecross-question variance in correlationsto belarger than normalin cheatingclassrooms. OurŽ nal indicatorcompares the answers that students in oneclassroom give compared with theanswers of otherstudents in thesystem who take the identical testand getthe exact same score.This measure relies on the fact that questionsvary signiŽ- cantly in difŽculty. The typical student will answermost of the easyquestions correctly, but getmany of the hard questions wrong(where “ easy”and “hard”are based onhowwell students of similarability doon the question). If students in aclass system- atically missthe easy questions while correctly answering the hard questions,this maybe an indication ofcheating. Ouroverall measure of suspicious answer strings is con- structedin amannerparallel toour measure of unusual test score uctuations.Within a givensubject, grade, and year,we rankclassrooms on each of these four indicators, and thentake thesum of squared ranksacross the four measures: 10

2 2 (4) ANSWERScbt 5 ~rank_m1c,b,t! 1 ~rank_m2c,b,t!

2 2 1 ~rank_m3c,b,t! 1 ~rank_m4c,b,t! . In theempirical work, we again usethree possible cutoffs for potentialcheating: eightieth, ninetieth, and ninety-Žfth percentiles.

III. A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING Theprevious section described indicators that arelikely to be correlatedwith cheating.Because sometimes such patterns arise by chance,however, not every classroom with largetest score uctuationsand suspiciousanswer strings is cheating.Further- more,the particular choiceof what qualiŽes as a“large” uctu- ationor a “suspicious”set of answer strings will necessarilybe arbitrary.In this sectionwe present an identiŽcation strategy that,under a setof defensibleassumptions, nonetheless provides estimatesof the prevalence of cheating. Toidentify thenumber of cheating classrooms in agiven

10.Because different subjects and grades have differing numbers of ques- tions,it is difŽ cult to make meaningful comparisons across tests on the raw indicators. ROTTEN APPLES 853 year,we would liketo compare the observed joint distribution of testscore  uctuationsand suspiciousanswer strings with acoun- terfactual distribution in whichno cheating occurs. Differences betweenthe two distributions would providean estimateof how muchcheating occurred. If teachersin aparticular schoolor year arecheating, there will bemore classrooms exhibiting both un- usual testscore  uctuationsand suspiciousanswer strings than otherwiseexpected. In practice,we do not have the luxury of observingsuch a counterfactual.Instead, wemust make assump- tionsabout what thepatterns would looklike absent cheating. OuridentiŽ cation strategy hinges on threekey assumptions: (1) cheatingincreases the likelihood a class will haveboth large testscore  uctuationsand suspiciousanswer strings, (2) if cheat- ing classroomshad notcheated, their distribution oftest score uctuationsand answerstrings patterns would beidentical to noncheatingclassrooms, and (3) in noncheatingclassrooms, the correlationbetween test score  uctuationsand suspiciousan- swersis constantthroughout the distribution. 11 If assumption(1) holds,then cheating classrooms will be concentratedin theupper tail ofthejoint distribution ofunusual testscore  uctuationsand suspiciousanswer strings. Other parts ofthe distribution (e.g.,classrooms ranked in theŽ ftieth tosev- enty-Žfth percentileof suspicious answer strings) will conse- quentlyinclude few cheaters. As longas cheatingclassrooms would looksimilar to noncheatingclassrooms on ourmeasures if theyhad notcheated (assumption (2)), classesin thepart ofthe distribution with few cheatersprovide the noncheating counter- factual that weare seeking. ThedifŽ culty, however, is that weonly observe this noncheatingcounterfactual in thebottom and middle parts ofthe distribution,when what wereally care about is theupper tail of thedistribution wherethe cheaters are concentrated. Assump- tion(3), whichrequires that in noncheatingclassrooms the cor- relationbetween test score  uctuationsand suspiciousanswers is constantthroughout the distribution, provides a potentialsolu- tion.If this assumptionholds, we can use the part ofthe distri- butionthat is relativelyfree of cheatersto project what theright tail ofthe distribution would looklike absent cheating. The gap betweenthe predicted and observedfrequency of classroomsthat

11.We formally derive the mathematical model described in this section in Jacoband Levitt [2003]. 854 QUARTERLY JOURNALOF ECONOMICS

FIGURE II TheRelationship between Unusual Test Scores and Suspicious Answer Strings Thehorizontal axis re ects a classroom’s percentilerank inthedistribution of suspiciousanswer strings within a givengrade, subject, and year, with zero representingthe least suspicious classroom and one representing the most sus- piciousclassroom. The vertical axis is the probability that a classroomwill be abovethe ninety-Ž fth percentileon ourmeasure of unusualtest score  uctuations. Thecircles in the Ž gurerepresent averages from 200 equally spaced cells along the x-axis.The predicted line is based on aprobitmodel estimated with seventh- orderpolynomials in the suspicious string measure. areextreme on both the test score  uctuationand suspicious answerstring measures provides our estimate of cheating. FigureII demonstrateshow our identiŽ cation strategy works empirically.The horizontal axis in theŽ gureranks classrooms accordingto how suspicious their answer strings are. The vertical axis is thefraction of the classrooms that areabove the ninety- Žfth percentileon theunusual testscore  uctuationsmeasure. 12 Thegraph combinesall classroomsand all subjectsin ourdata. 13 Overmost of the range (roughly from zero to the seventy-Ž fth percentileon thehorizontal axis), thereis aslight positivecorre- lationbetween unusual testscores and suspiciousanswer strings.

12.The choice of the ninety-Ž fth percentileis somewhat arbitrary. In the empiricalwork thatfollows we also consider the eightieth and ninetieth percen- tiles.The choice of the cutoff does not affect the basic patterns observed in the data. 13.To construct theŽ gure,classes were rank orderedaccording totheir answerstrings and divided into 200 equally sized segments. The circles in the Žgurerepresent these 200 local means. The line displayed in the graph isthe Žttedvalue of a regressionwith a seventh-orderpolynomial in aclassroom’s rank onthe suspicious strings measure. ROTTEN APPLES 855

Thisis thepart ofthe distribution that is unlikelyto contain manycheating classrooms. Under the assumption that thecorre- lationbetween our two measures is constantin noncheating classroomsover the whole distribution, we would predict that the straight lineobserved for all but theright-hand portionof the graph would continueall theway tothe right edge of thegraph. Instead, as oneapproaches the extreme right tail ofthe distribution ofsuspiciousanswer strings, the probability oflarge testscore  uctuationsrises dramatically. Thatsharp rise,we argue,is aconsequenceof cheating.To estimatethe prevalence of cheating,we simply comparethe actual areaunder the curve in thefar righttail ofFigureI tothe predicted areaunder the curve in theright tail projectingthe linear relationship observed from theŽ ftieth toseventy-Ž fth percentileout to the ninety-ninth percentile.Because this identiŽcation strategy is necessarilyin- direct,we devote a greatdeal ofattention to presenting a wide varietyof tests of the validity ofour approach, thesensitivity of theresults to alternative assumptions, and theplausibility ofour Žndings.

IV. DATA AND INSTITUTIONAL BACKGROUND Elementarystudents in Chicagopublic schoolstake a stan- dardized, multiple-choiceachievement exam known as theIowa Testof Basic Skills (ITBS). TheITBS is anational,norm-refer- encedexam with areading comprehensionsection and three separatemath sections. 14 Third througheighth grade students in Chicagoare required to take the exams each year. Most schools administerthe exams to Ž rstand secondgrade students as well, althoughthis is nota district mandate. Ourbase sample includes all students in third toseventh gradefor the years 1993– 2000. 15 Foreach student wehave the question-by-questionanswer string on each year’ s tests,school and classroomidentiŽ ers, the full historyof priorand futuretest scores,and demographicvariables including age,sex, race, and freelunch eligibility. We also have information about a wide

14.There are also other parts of the test which are either not included in ofŽcial school reporting (spelling, punctuation, grammar) or are given only in selectgrades (science and social studies), for which we do not have information. 15.We alsohave test scores for eighth graders, but we exclude them because ouralgorithm requires test score data for the following year and the ITBS test is notadministered to ninth graders. 856 QUARTERLY JOURNALOF ECONOMICS rangeof school-level characteristics. We do not, however, have individual teacheridentiŽ ers, so we are unable to directly link teachersto classrooms or totrack a particular teacherover time. Becauseour cheating proxies rely on comparisonsto past and futuretest scores, we drop observationsthat aremissing reading ormath scores in eitherthe preceding year or thefollowing year (in addition tothose with missingtest scores in thebaseline year).16 Lessthan one-halfof 1percentof students with missing demographicdata arealso excluded from the analysis. Finally, becauseour algorithms for identifying cheatingrely on identify- ing suspiciouspatterns within aclassroom,our methods have littlepower in classroomswith smallnumbers of students. Con- sequently,we drop all classroomsfor which we have fewer than tenvalid students in aparticular gradeafter our other exclusions (roughly3 percentof students). Ahandful ofclassroomsrecorded as having morethan 40 students—presumably multiple class- roomsnot separately identiŽ ed in thedata— are also dropped. OurŽ nal data setcontains roughly 20,000 students pergrade per yeardistributed acrossapproximately 1,000 classrooms,for a totalof over 40,000 classroom-yearsof data (with foursubject testsper classroom-year) and over700,000 student-year observations. Summarystatistics forthe full sampleof classrooms are shownin TableI wherethe unit ofobservation is at thelevel of class*subject*year.As in manyurban schooldistricts, students in Chicagoare disproportionately poor and minority.Roughly 60 percentof students areBlack, and 26 percentare Hispanic, with nearly85 percentreceiving free or reduced-pricelunch. Less than 29 percentof students scoredat orabove national normsin reading. TheITBS examsare administered over a weeklongperiod in earlyMay. Third gradeteachers are permitted to administer the examto theirown students, while other teachers switch classesto administerthe exams. The exams are generally delivered to the schoolsone to two weeks before testing, and aresupposed tobe

16.The exact number of studentswith missing test data varies by yearand grade.Overall, we drop roughly12 percent of students because of missing test scoreinformation in the baseline year. These are students who (i) were not tested becauseof placement in bilingual or special education programs or (ii) simply missedthe exam that particular year. In additionto these students, we also drop approximately13 percent of studentswho were missing test scores in either the precedingor followingyears. Test data may be missing either because a student didnot attend school on the days of thetest, or becausethe student transferred intothe CPS systemin the current yearor leftthe system prior to the next year of testing. ROTTEN APPLES 857

TABLE I SUMMARY STATISTICS

Mean (sd)

Classroomcharacteristics Mixedgrade classroom 0.073 Teacheradministers exams to her own students (3rd grade) 0.206 Percentof studentswho were tested and included in ofŽcial reporting 0.883 Averageprior achievement 20.004 (asdeviation from year*grade*subject mean) (0.661) % Black 0.595 % Hispanic 0.263 % Male 0.495 %Old forgrade 0.086 %Livingin foster care 0.044 %Livingwith nonparental relative 0.104 Cheater—95th percentile cutoff 0.013 School characteristics Averagequality of teachers’ undergraduate institution 22.550 inthe school (0.877) Percentof teacherswho live in Chicago 0.712 Percentof teacherswho have an MA ora PhD 0.475 Percentof teacherswho majored in education 0.712 Percentof teachersunder 30 years of age 0.114 Percentof teachersat theschool less than 3 years 0.547 %studentsat national norms in reading last year 28.8 %studentsreceiving free lunch in school 84.7 PredominantlyBlack school 0.522 PredominantlyHispanic school 0.205 Mobilityrate in school 28.6 Attendancerate in school 92.6 School size 722 (317) Accountabilitypolicy Socialpromotion policy 0.215 Schoolprobation policy 0.127 Testform offered for the Ž rst time 0.371 Numberof observations 163,474

Thedata summarizedabove include one observation per classroom*year*subject. The sample includes all students inthird to seventh grade for the years 1993– 2000. Wedrop observations for the following reasons: (a)missing readingor math scores in either the baseline year, the preceding year, or the following year; (b) missing demographicdata; (c)classrooms forwhich we have fewer than ten valid students ina particular gradeor more than 40 students. Seethe discussion inthe text for a moredetailed explanation of thesample, includingthe percentages excluded for various reasons. keptin asecurelocation by theprincipal orthe school’ s test coordinator,an individual intheschool designated tocoordinate thetesting process (often a counseloror administrator). Each 858 QUARTERLY JOURNALOF ECONOMICS sectionof theexam consists of 30 to60 multiplechoice questions whichstudents aregiven between 30 and 75 minutesto com- plete.17 Students marktheir responses on answer sheets, which arescanned to determine a student’s score.There is nopenalty forguessing. A student’s raw scoreis simply calculated as the sumof correct responses on the exam. The raw scoreis then translated intoa grade equivalent. After collectingstudent exams,teachers or administrators then“ clean”the answer keys, erasing stray pencilmarks, remov- ing dirt ordebris fromthe form, and darkeningitem responses that wereonly faintly markedby thestudent. At theend of the testingweek, the test coordinators at eachschool deliver the completedanswer keys and examsto the CPS centralofŽ ce. Schoolpersonnel are not supposed tokeep copies of the actual exams,although school ofŽ cials acknowledgethat anumberof teacherseach year do so. The CPS has administeredthree differ- entversions of the ITBS between1993 and 2000. TheCPS alter- natesforms each year, with newforms being offered for the Ž rst timein 1993, 1994, and 1997. 18

V. THE PREVALENCE OF TEACHER CHEATING As notedearlier, our estimates of theprevalence of cheating arederived from a comparisonof theactual numberof classrooms that areabove some threshold on bothof ourcheating indicators, relativeto the number we would expectto observe based onthe correlationbetween these two measures in the50th– 75th percen- tilesof the suspicious string measure. The top panel ofTable II presentsour estimates of the percentage of classrooms that are cheatingon averageon agivensubject test (i.e., reading compre- hensionor oneof thethree math tests) in agivenyear. 19 Because

17.The mathematics and reading tests measure basic skills. The reading comprehensionexam consists of three to eight short passages followed by up to ninequestions relating to the passage. The math exam consists of threesections thatassess number concepts, problem-solving, and computation. 18.These three forms are used for retesting, summer school testing, and midyeartesting as well,so thatit is likely that over the years, teachers have seen thesame exam on anumberof occasions. 19.It isimportantto note that we exclude a particularform of cheatingthat appearsto be quite prevalent in the data: teachers randomly Ž llingin answersleft blankby students at the end of the exam. In many classrooms, almost every studentwill end the test with a longstring of the identical answers (typically all thesame response like “ B”or“ D.”) Thefact thatalmost all students in theclass coordinateon the same pattern strongly suggests that the students themselves didnot Ž llinthe blanks, or wereunder explicit instructions by the teacher to do ROTTEN APPLES 859

TABLE II ESTIMATED PREVALENCEOF TEACHER CHEATING

Cutofffor test score  uctuations(SCORE): Cutofffor suspicious answer strings(ANSWERS) 80thpercentile 90th percentile 95th percentile

Percent cheating onaparticular test 80thpercentile 2.1 2.1 1.8 90thpercentile 1.8 1.8 1.5 95thpercentile 1.3 1.3 1.1

Percent cheating onat least oneof the fourtests given 80thpercentile 4.5 5.6 5.3 90thpercentile 4.2 4.9 4.4 95thpercentile 3.5 3.8 3.4

Thetop panel of the table presentsestimates ofthe percentage of classrooms cheatingon a particular subjecttest ina givenyear based on three alternative cutoffs for ANSWERS and SCORE. Inall cases, the prevalenceof cheating is basedon the excess number of classrooms withunexpected test score uctuation amongclasses withsuspicious answerstrings relativeto classes that donothave suspicious answerstrings. Thebottom panel of thetable presentsestimates ofthe percentage of classrooms cheatingon at least oneof thefour subject tests that comprisethe overall test. Inthe bottom panel, classrooms that cheaton more than onesubject test arecounted only once. Our sample includes over 35,000 3rd–7th gradeclassrooms inthe Chicagopublic schools for the years 1993– 1999.

thedecision as towhat cutoff signiŽes a “high”value on our cheatingindicators is arbitrary,we present a 3 3 3 matrix of estimatesusing threedifferent thresholds(eightieth, ninetieth, and ninety-Žfth percentiles)for each of our cheating measures. Theestimated prevalence of cheatersranges from 1.1 percentto 2.1 percent,depending onthe particular setof cutoffs used.As would beexpected, the number of cheatersis generallydeclining as higherthresholds are employed. Nonetheless, it is encouraging that oversuch a wide setof cutoffs, the range of estimates is relativelytight. Thebottom panel ofTable II presentsestimates of the per- centageof classroomsthat arecheating on any ofthefour subject testsin aparticular year.If everyclassroom that cheateddid so onlyon one subject test, then the results in thebottom panel

so.Since there is nopenalty for guessing on thetest, Ž llingin the blanks can only increasestudent test scores. While this type ofteacher behavior is likely to be viewedby many as unethical,we do not make it thefocus of ouranalysis because (1)it is difŽ cult to provide deŽ nitive evidence of such behavior (a teacher could arguethat he or she instructed students well in advance of the test to Ž llin all blankswith the letter “ C”aspart ofgood test-taking strategy), and (2) in our mindsit is categorically different than a teacherwho systematically changes studentresponses to the correct answer. 860 QUARTERLY JOURNALOF ECONOMICS would simply befour times the results in thetop panel.In many instances,however, classrooms appear tocheat on multiple sub- jects.Thus, the prevalence rates range from 3.4 –5.6 percentof all classrooms. 20 Becauseof the necessarily indirect nature of our identiŽcation strategy, we explore a rangeof supplementarytests designed toassess the validity oftheestimates. 21

V.A.Simulation Results Ourprevalence estimates may be biased downward orup- ward, depending onthe extent to which our algorithm fails to successfullyidentify all instancesof cheating (“ false negatives”) versusthe frequency with whichwe wronglyclassify ateacheras cheating(“ false positive”). Onesimple simulation exercise we undertook with respectto possiblefalse positivesinvolves randomly assigning students to hypotheticalclassrooms. These synthetic classrooms thus consist ofstudents whoin actuality had noconnectionto oneanother. We thenanalyze thesehypothetical classrooms using thesame algo- rithmapplied tothe actual data. As onewould hope,no evidence ofcheating was found in thesimulated classes. Indeed, the esti- matedprevalence of cheatingwas slightly negativein this simu- lation(i.e., classrooms with largetest score increases in the currentyear followed by big declinesthe next year were slightly less likelyto have unusual patterns ofanswer strings). Thus, theredoes not appear tobe anything mechanicalin ouralgorithm that generatesevidence of cheating. Asecondsimulation involves artiŽ cially alteringa class-

20.Computation of theoverall prevalence is somewhatcomplicated because itinvolves calculating not only how many classrooms are actually above the thresholdson multiple subject tests, but also how frequently this would occur in theabsence of cheating. Details on these calculations are available from the authors. 21.In anearlier version of thispaper [Jacob and Levitt 2003], we present a numberof additionalsensitivity analyses that conŽ rm thebasic results. First, we testwhether our results are sensitive to the way inwhichwe predict thepreva- lenceof high values of ANSWERS andSCORE inthe upper part ofthe other distribution(e.g., Ž tting alinearor quadratic model to the data in the lower portionof thedistribution versus simply using the average value from the 50th– 75thpercentiles). We Žndour results are extremely robust. Second, we examine whetherour results might simply be due to mean reversion by testing whether classroomswith suspicious answer strings are less likely to maintain large test scoregains. Among a sampleof classroomswith large test score gains, we Ž ndthat meanreversion is substantially greater among the set of classes with highly suspiciousanswer strings. Third, we investigate whether we might be detecting studentrather than teacher cheating by examiningwhether students who have suspiciousanswer strings in one year are more likely to have suspicious answer stringsin otheryears. We Ž ndthat they do not. ROTTEN APPLES 861 room’s answerstrings in ways that webelieve mimic teacher cheatingor outstanding teaching. 22 Weare then able tosee how frequentlywe label thesealtered classes as “cheating”and how this varieswith theextent and natureof thechanges we make to aclassroom’s answers.We simulatetwo different types ofteacher cheating.The Ž rstis averynaive version, in whicha teacher starts cheatingat thesame question for a numberof students and changesconsecutive questions to the right answers for these students,creating a blockof identical and correctresponses. The secondtype ofcheatingis muchmore sophisticated: we randomly changeanswers from incorrect to correct for selected students in aclass.The outcome of this simulationreveals that ourmethods aresurprisingly weakin detectingmoderate cheating, even in the na¨•vecase. For instance, when three consecutive answers are changedto correct for 25 percentof aclass,we catchsuch behav- ioronly 4 percentof the time. Even when six questionsare changedfor half oftheclass, we catch the cheating in lessthan 60 percentof thecases. Only whenthe cheating is veryextreme (e.g., six answerschanged for the entire class) arewe very likely to catchthe cheating. The explanation forthis poorperformance is that classroomswith lowtest-score gains, even with theboost provided by this cheating,do not have large enough test score uctuationsto make them appear unusual becausethere is so muchinherent volatility in testscores. When the cheating takes themore sophisticated formdescribed above, we are even less successfulat catchinglow and moderatecheaters (2 percentand 34 percent,respectively, in theŽ rsttwo scenarios described above),but wealmost always detectvery extreme cheating. Froma political orequity perspective, however, we may be evenmore concerned with false positives,speciŽ cally therisk of accusinghighly effectiveteachers of cheating. Hence, we also simulatethe impact ofagoodteacher, changing certain answers forcertain students fromincorrect to correct responses. Our ma- nipulation in this casediffers fromthe cheating simulations abovein twoways: (1) weallow someof thegains tobe preserved in thefollowing year; and (2) wealter questions such that stu- dents will notget random answers correct, but ratherwill tendto showthe greatest improvement on theeasiest questions that the students weregetting wrong. These simulation results suggest

22.See Jacob and Levitt [2003] for the precise details of this simulation exercise. 862 QUARTERLY JOURNALOF ECONOMICS that onlyrarely are good teachers mistakenly labeled as cheaters. Forinstance, a teacherwhose prowess allows himor herto raise testscores by six questionsfor half thestudents in theclass (more than a0.5 grade equivalentincrease on average for the class) is labeled acheateronly 2 percentof the time. In contrast,a na ¨•ve cheaterwith thesame test score gains is correctlyidentiŽ ed as a cheaterin morethan 50 percentof cases. In anothertest for false positives,we explored whether we mightbe mistaking emphasis on certain subject material for cheating.For example, if amathteacher spends severalmonths onfractions with aparticular class,one would expectthe class to doparticularly wellon all ofthe math questions relating to fractionsand perhaps worsethan averageon other math ques- tions.One might imagine a similarscenario in reading if, for example,a teachercreates a lengthyunit onthe Underground Railroad, whichlater happens toappear in apassage onthe reading comprehensionexam. To examine this, we analyze the natureof the across-question correlations in cheatingversus noncheatingclassrooms. We Ž ndthat themost highly correlated questionsin cheatingclassrooms are no more likely to measure thesame skill (e.g., fractions, geometry) than in noncheating classrooms.The implication of these results is that theprevalence estimatespresented above are likely to substantially understate thetrue extent of cheating—there is littleevidence of false posi- tives—but wefrequently miss moderate cases of cheating.

V.B.The Correlation of Cheating Indicatorsacross Subjects, Classrooms,and Years If what weare detecting is trulycheating, then one would expectthat ateacherwho cheats on onepart ofthetest would be morelikely to cheat on otherparts ofthetest. Also, a teacherwho cheatsone year would bemore likely to cheat the following year. Finally,to the extent that cheatingis eithercondoned by the principal orcarried out by thetest coordinator, one would expect toŽ nd multipleclasses in aschoolcheating in any givenyear, and perhaps eventhat cheatingin aschoolone year predicts cheating in futureyears. If what wearedetecting is notcheating, then one would notnecessarily expect to Ž nd strongcorrelation in our cheatingindicators acrossexams for a speciŽc classroom,across classrooms,or across years. ROTTEN APPLES 863

TABLE III PATTERNS OF CHEATINGWITHIN CLASSROOMS AND SCHOOLS

Dependentvariable 5 Classsuspected ofcheating(mean of thedependent variable 5 0.011)

Sample of classes and school that existedin the Independentvariables Full sample prior year

Classroomcheated on exactly one 0.105 0.103 0.101 0.101 othersubject this year (0.008) (0.008) (0.009) (0.009) Classroomcheated on exactly two 0.289 0.285 0.243 0.243 othersubjects this year (0.027) (0.027) (0.031) (0.031) Classroomcheated on all three other 0.627 0.622 0.595 0.595 subjectsthis year (0.051) (0.051) (0.054) (0.054) Cheatingrate among all other classes inthe school this year on this 0.166 0.134 0.129 subject (0.030) (0.027) (0.027) Cheatingrate among all other classes inthe school this year on other 0.023 0.059 0.045 subjects (0.024) (0.026) (0.029) Cheatingin thisclassroom in this 0.096 0.091 subjectlast year (0.012) (0.012) Numberof othersubjects this 0.023 0.018 classroomcheated on last year (0.004) (0.004) Cheatingin thisclassroom ever in the 0.006 past (0.002) Cheatingrate among other classrooms 0.090 inthis school in past years (0.040)

Fixedeffects for grade*subject*year Yes Yes Yes Yes R2 0.0900.093 0.109 0.109 Numberof observations 165,578165,578 94,182 94,170

Thedependent variable is anindicator for whether a classroomis abovethe 95th percentileon bothour suspicious strings andunusual test scoremeasures of cheating on a particularsubject test. Estimation is doneusing a linearprobability model. The unit of observationis classroom*grade*year*subject.For columns that includemeasures of cheating in prior years, observations where that classroomor schooldoes not appear inthe data inthe prior year are excluded. Standard errors are clustered at theschool level to take into accountcorrelations across classroomas wellas serial correlation.

TableIII reportsregression results testing these predictions. Thedependent variable is an indicatorfor whether we believe a classroomis likelyto be cheating on a particular subjecttest using ourmost stringent deŽ nition (above the ninety-Ž fth per- centileon both cheating indicators). The baseline probability of 864 QUARTERLY JOURNALOF ECONOMICS qualifying as acheaterfor this cutoff is 1.1 percent.To fully appreciate theenormity of the effects implied by thetable, it is importantto keep this verylow baseline in mind.We report estimatesfrom linear probability models(probits yield similar marginaleffects), with standard errorsclustered at theschool level. Column1 ofTable III showsthat cheatingon other tests in thesame year is an extremelypowerful predictor of cheatingin a different subject.If aclassroomcheats on exactly one other sub- jecttest, the predicted probability ofcheating on this testin- creasesby overten percentage points. Since the baseline cheating ratesare only 1.1 percent,classrooms cheating on exactly one othertest are ten times more likely to have cheated on this subjectthan areclassrooms that did notcheat on any ofthe other subjects(which is theomitted category). Classrooms that cheat ontwo other subjects are almost 30 timesmore likely to cheat on this test,relative to those not cheating on other tests. If aclass cheatson all threeother subjects, it is50 timesmore likely to also cheaton this test. Therealso is evidenceof correlation in cheatingwithin schools.A ten-percentage-pointincrease in cheatingclassrooms in aschool(excluding theclassroom in question)on the same subjecttest raises the likelihood this class cheatsby roughly.016 percentagepoints. This potentially suggestssome role for cen- tralized cheatingby aschoolcounselor, test coordinator, or the principal, ratherthan by teachersoperating independently. Thereis littleevidence that cheatingrates within theschool on othersubject tests affect cheatingon this test. Whenmaking comparisons across years (columns 3 and 4), it is importantto note that wedo not actually haveteacher identi- Žers.We do,however, know what classrooma student is assigned to.Thus, we can only compare the correlation between past and currentcheating in agiven classroom. Totheextent that teacher turnoveroccurs or teachersswitch classrooms,this proxywill be contaminated.Even given this importantlimitation, cheating in theclassroom last yearpredicts cheatingthis year.In column3, forexample, we see that classroomsthat cheatedin thesame subjectlast yearare 9.6 percentagepoints morelikely to cheat this year,even after we control for cheating on other subjects in thesame year and cheatingin otherclasses in theschool. Column 4showsthat priorcheating in theschool strongly predicts the likelihoodthat aclassroomwill cheatthis year. ROTTEN APPLES 865

V.C.Evidence from Retesting under Controlled Circumstances Perhapsthe most compelling evidence for our methods comes fromthe results of a uniquepolicy intervention. In spring 2002 theChicago public schoolsprovided us theopportunity to retest over100 classroomsunder controlled circumstances that pre- cluded cheating,a fewweeks after the initial ITBS examwas administered. 23 Based onsuspiciousanswer strings and unusual testscore gains onthe initial spring 2002 exam,we identiŽ ed a setof classrooms most likely to have cheated. In addition, we chosea secondgroup of classrooms that had suspiciousanswer stringsbut did nothave large test score gains, a pattern that is consistentwith abad teacherwho attempts to mask a poor teachingperformance by cheating.We also retested two sets of classroomsas controlgroups. The Ž rstcontrol group consisted of classeswith largetest score gains but noevidence of cheating in theiranswer strings, a potentialindication ofeffective teaching. Theseclasses would beexpected to maintain theirgains when retested,subject to possible mean reversion. The second control groupconsisted of randomlyselected classrooms. Theresults of the retests are reported in TableIV. Columns 1–3 correspondto reading; columns 4 –6arefor math. For each subjectwe report the test score gain (in standard scoreunits) for threeperiods: (i) fromspring 2001 tothe initial spring 2002 test; (ii) fromthe initial spring 2002 testto the retest a fewweeks later;and (iii) fromspring 2001 tothe spring 2002 retest.For purposesof comparison we also report the average gain forall classroomsin thesystem for the Ž rstof thosemeasures (the other twomeasures are unavailable unlessa classroomwas retested). Classroomsprospectively identiŽ ed as “mostlikely to have cheated”experienced gains onthe initial spring 2002 testthat werenearly twice as largeas thetypical CPSclassroom(see columns1 and 4). On theretest, however, those excess gains completelydisappeared —thegains betweenspring 2001 and the spring 2002 retestwere close to the systemwide average. Those classroomsprospectively identiŽ ed as “bad teacherslikely to have cheated”also experienced large score declines on theretest ( 28.8 and 210.5 standard scorepoints, or roughlytwo-thirds ofagrade equivalent).Indeed, comparing the spring 2001 resultswith the

23.Full details of this policy experiment are reported in Jacob and Levitt [2003].The results of theretests were used by CPStoinitiate disciplinary action againsta numberof teachers,principals, and staff. 866 QUARTERLY JOURNALOF ECONOMICS

TABLE IV RESULTS OF RETESTS: COMPARISONOF RESULTS FOR SPRING 2002 ITBS AND AUDIT TEST

Readinggains between . ..Mathgains between . ..

Spring Spring Spring Spring Spring Spring 2001 2001 2002 2001 2001 2002 and and and and and and Category of spring 2002 2002 spring 2002 2002 classroom 2002 retest retest 2002 retest retest

Allclassrooms in the Chicago publicschools 14.3 — — 16.9 — — Classrooms prospectively identiŽed as most likely cheaters (N 5 36 on math, N 5 39 on reading) 28.8 12.6 216.230.0 19.3 210.7 Classrooms prospectively identiŽed asbad teachers suspectedof cheating (N 5 16 on math, N 5 20 on reading) 16.6 7.8 28.8 17.3 6.8 210.5 Classrooms prospectively identiŽed as goodteachers who did not cheat (N 5 17 on math, N 5 17on reading)20.6 21.1 10.528.8 25.5 23.3 Randomlyselected classrooms(N 5 24overall, but onlyone test per classroom) 14.5 12.2 22.314.5 12.2 22.3

Results inthis table comparetest scoresat threepoints in time (spring 2001, spring2002, anda retest administeredunder controlled conditions a fewweeks after the initial spring2002 test). Theunit of observationis aclassroom-subjecttest pair.Only students takingall threetests areincluded in the calculations. Becauseof limited data, mathand reading results forthe randomly selected classrooms are combined.Only the Ž rst twocolumns are available for all CPS classrooms sinceaudits wereperformed only ona subset ofclassrooms. All entriesin the table arein standard score units. ROTTEN APPLES 867 spring 2002 retest,children in theseclassrooms had gains only half as largeas thetypical yearlyCPS gain.In starkcontrast, classroomsprospectively identiŽ ed as having “goodteachers” (i.e.,classrooms with largegains but notsuspicious answer strings) actually scoredeven higher onthe reading retestthan theydid onthe initial test.The math scores fell slightly on retesting,but theseclassrooms continued to experience extremely largegains. The randomly selected classrooms also maintained almostall oftheir gains whenretested, as would beexpected. In summary,our methods proved to be extremely effective both in identifying likelycheaters and in correctlyclassifying teachers whoachieved large test score gains in alegitimatemanner.

VI. DOES TEACHER CHEATING RESPOND TO INCENTIVES? Fromthe perspective of economics, perhaps themost inter- estingquestion related to teachercheating is thedegree to which it respondsto incentives. As notedin theIntroduction, there were twomajor changes in theincentives faced by teachersand stu- dents overour sample period. Prior to 1996, ITBS scoreswere primarily usedto provide teachers and parents with asenseof howa child was progressingacademically. Beginning in 1996 with theappointment of Paul Vallas as CEO ofSchools, the CPS launchedan initiative designed tohold students and teachers accountablefor student learning. Thereform had twomain elements. The Ž rstinvolved put- ting schools“ onprobation” if lessthan 15 percentof students scoredat orabove national normson the ITBS reading exams. TheCPS did notuse math performance to determine probation status.Probation schools that donot exhibit sufŽ cient improve- mentmay be reconstituted, a procedurethat involvesclosing the schooland dismissing orreassigningall oftheschool staff. 24 It is clearfrom our discussions with teachersand administratorsthat beingon probationis viewedas an extremelyundesirable circum- stance.The second piece of the accountability reformwas an end tosocial promotion— the practice of passing students tothe next graderegardless of their academic skills orschool performance. Underthe new policy, students in third,sixth, and eighthgrade mustmeet minimum standards onthe ITBS in bothreading and

24.For a moredetailed analysis of theprobation policy, see Jacob [2002] and Jacoband Lefgren [forthcoming]. 868 QUARTERLY JOURNALOF ECONOMICS mathematicsin orderto be promoted to the next grade. The promotionstandards wereimplemented in spring 1997 forthird and sixth graders.Promotion decisions are based solelyon scores in reading comprehensionand mathematics. TableV presentsOLS estimatesof the relationship between teachercheating and avarietyof classroom and schoolcharac- teristics.25 Theunit ofobservationis aclassroom*subject*grade* year.The dependent variable is an indicatorof whether we sus- pectthe classroom cheated using ourmost stringent deŽ nition: a classroomis designated acheaterif bothits testscore  uctua- tionsand thesuspiciousness of its answerstrings are above the ninety-Žfth percentilefor that grade,subject, and year. 26 In column(1) thepolicy changes are restricted to have a constantimpact acrossall classrooms.The introduction of the socialpromotion and probationpolicies is positivelycorrelated with thelikelihood of classroom cheating, although the point estimatesare not statistically signiŽcant at conventionallevels. Much moreinteresting results emerge when we interact the policychanges with theprevious year’ s testscores for the class- room.For both probation and socialpromotion, cheating rates in thelowest performing classrooms prove to bequite responsive to thechange in incentives.In column(2) wesee that aclassroom onestandard deviationbelow the mean increased its likelihoodof cheatingby 0.43 percentagepoints in responseto the school probationpolicy and roughly0.65 percentagepoints dueto the endingof social promotion. Given a baselinerate of cheating of 1.1 percent,these effects are substantial. Themagnitude of these changesis particularly largeconsidering that noelementary schoolon probation was actually reconstitutedduring this period, and that thesocial promotion policy has adirectimpact onstu- dents,but notdirect ramiŽ cations for teacher pay orrewards. A classroomone standard deviationabove the mean does not see any signiŽcant changein cheatingin responseto these two poli- cies.Such classes are very unlikely to be locatedin schoolsat risk forbeing put onprobation, and alsoare likely to have few stu- dents at riskfor being retained. These Ž ndings arerobust to

25.Logit and Probit models evaluated at the mean yield comparable results, sothe estimates from a linearprobability model are presented for ease of interpretation. 26.The results are not sensitive to the cheating cutoff used. Note that this measuremay include error dueto both false positives and negatives. Since the measurementerror isinthe dependent variable, the primary impact will likely be tosimply decrease the precision of ourestimates. ROTTEN APPLES 869

TABLE V OLS ESTIMATES OFTHE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM CHARACTERISTICS

Dependentvariable 5 Indicatorof classroomcheating

Independentvariables (1) (2) (3) (4)

Socialpromotion policy 0.0011 0.0011 0.0015 0.0023 (0.0013) (0.0013) (0.0013) (0.0009) Schoolprobation policy 0.0020 0.0019 0.0021 0.0029 (0.0014) (0.0014) (0.0014) (0.0013) Priorclassroom achievement 20.0047 20.0028 20.0016 20.0028 (0.0005) (0.0005) (0.0007) (0.0007) Socialpromotion*classroom achievement 20.0049 20.0051 20.0046 (0.0014) (0.0014) (0.0012) Schoolprobation*classroom achievement 20.0070 20.0070 20.0064 (0.0013) (0.0013) (0.0013) Mixedgrade classroom 20.0084 20.0085 20.0089 20.0089 (0.0007) (0.0007) (0.0008) (0.0012) %ofstudents included in ofŽcial reporting 0.0252 0.0249 0.0141 0.0131 (0.0031) (0.0031) (0.0037) (0.0037) Teacheradministers exam toown 0.0067 0.0067 0.0066 0.0061 students (0.0015) (0.0015) (0.0015) (0.0011) Testform offeredfor the Ž rsttime a 20.0007 20.0007 20.0011 — (0.0011) (0.0011) (0.0010) Averagequality of teachers’ 20.0026 undergraduate institution (0.0007) Percentof teacherswho have workedat 20.0045 theschool less than 3years (0.0031) Percentof teachersunder 30years of age 0.0156 (0.0065) Percentof studentsin theschool meeting 0.0001 nationalnorms in reading lastyear (0.0000) Percentfree lunch in school 0.0001 (0.0000) PredominantlyBlack school 0.0068 (0.0019) PredominantlyHispanic school 20.0009 (0.0016) School*YearŽ xed effects No No No Yes Number ofobservations 163,474163,474 163,474 163,474

Theunit of observationis classroom*grade*year*subject,and the sample includes eight years (1993 to 2000), foursubjects (reading comprehension and three math sections), and Ž vegrades(three to seven). The dependentvariable is thecheating indicator derived using the ninety-Ž fth percentile cutoff. Robust standard errorsclustered by school*year are shown in parentheses. Other variables included in the regressions in column(1) and (2) include a lineartime trend, grade, cubic terms for the number of students, alineargrade variable,and Ž xedeffects for subjects. Theregression shown in column (3) also includesthe following variables: indicatorsof the percent of students inthe classroom who were Black, Hispanic, male,receiving free lunch, oldfor grade, in a specialeducation program, living in foster care and living with a nonparentalrelative, indicators of schoolsize, mobilityrate and attendance rate, and indicators of the percent of teachers in the school. Other variables includethe percent of teachers inthe school who had a master’s ordoctoral degree, lived in Chicago, and were education majors. a. Test formsvary only by year so this variablewill drop out of theanalysis whenschool*year Ž xedeffects areincluded. 870 QUARTERLY JOURNALOF ECONOMICS adding anumberof classroom and schoolcharacteristics (column (3)) as wellas school*yearŽ xedeffects (column (4)). Cheating alsoappears tobe responsive to other costs and beneŽts. Classrooms that testedpoorly last yearare much more likelyto cheat. For example, a classroomwith averagestudent priorachievement one classroom standard deviationbelow the meanis 23 percentmore likely to cheat. Classrooms with stu- dents in multiplegrades are65 percentless likely to cheat than classroomswhere all students arein thesame grade. This is consistentwith thefact that it islikelymore difŽ cult for teachers in suchclassrooms to cheat, since they must administer two different testforms to students, which will necessarilyhave dif- ferentcorrect answers. Moreover, classes with ahigherpropor- tionof students whoare included in theofŽ cial testreporting are morelikely to cheat— a10 percentagepoint increasein thepro- portionof students in aclass whosetest scores “ count”will increasethe likelihood of cheating by roughly20 percent. 27 Teacherswho administer the exam to theirown students are0.67 percentagepoints— approximately 50 percent—more likely to cheat.28 Afewother notable results emerge. First, there is nostatis- tically signiŽcant impact oncheating of reusinga testform that has beenadministered in apreviousyear. That Ž nding is of interestbecause it suggeststhat teacherstaking old examsand teachingthe precise questions to students is notan important componentof what weare detecting as cheating(although anec- dotal evidencesuggests this practiceexists). Classrooms in schoolswith lowerachievement, higher poverty rates, and more Black students aremore likely to cheat. Classrooms in schools with teacherswho graduated frommore prestigious undergrad- uateinstitutions are less likely to cheat, whereas classrooms in schoolswith youngerteachers are more likely to cheat.

27.Consistent with this Ž nding,within cheating classrooms, teachers are lesslikely to alter the answer sheets for students who are excluded from test reporting.Teachers also appear to less frequently cheat for boys and for older students. 28.In anearlierversion of thepaper [Jacob and Levitt 2002], we examine for whichstudents teachers tend to cheat. We Ž ndthat teachers are most likely to cheatfor students in the second quartile of the national ability distribution (twenty-Žfth toŽ ftiethpercentile) in the previous year, consistent with the incen- tivesunder the probation policy. ROTTEN APPLES 871

VII. CONCLUSION Thispaper developsan algorithmfor determining the preva- lenceof teacher cheating on standardized testsand applies the resultsto data fromthe Chicago public schools.Our methods revealthousands ofinstancesof classroomcheating, representing 4–5percentof theclassrooms each year. Moreover, we Ž nd that teachercheating appears quiteresponsive to relatively minor changesin incentives. Theobvious beneŽ ts ofhigh-stakes tests as ameansof pro- viding incentivesmust therefore be weighed against thepossible distortionsthat thesemeasures induce. Explicit cheatingis merelyone form of distortion. Indeed, the kind ofcheating we focuson is onethat couldbe greatly reduced at arelativelylow costthrough the implementation of propersafeguards, suchas is doneby Educational TestingService on theSAT orGRE exams. 29 Evenif this particular formof cheating were eliminated, however,there are a virtually inŽnite number of dimensions alongwhich strategic school personnel can act togame the cur- rentsystem. For instance, Figlio and Winicki[2001] and Rohlfs [2002] documentthat schoolsattempt to increase the caloric intakeof students ontest days, and disruptive students are especiallylikely to be suspended fromschool during ofŽcial test- ing periods[Figlio 2002]. It maybe prohibitively costly or even impossibleto completely safeguard thesystem. Finally,this paper Žts intoa smallbut growingbody of researchfocused on identifying corruptor illicit behavioron the part ofeconomic actors (see Porter and Zona[1993], Fisman [2001], Di Tellaand Schargrodsky[2001], and Duggan and Levitt [2002]). Becauseindividuals engagedin suchbehavior actively attemptto hide theirtrails, the intellectual exercise associated with uncoveringtheir misdeeds differs substantially fromthe typical economicapplication in whichthe researcher starts with awell-deŽned measure of the outcome variable (e.g.,earnings, economicgrowth, proŽ ts) and thenattempts to uncover the de- terminantsof theseoutcomes. In thecase of corruption, there is typically noclear outcome variable, making it necessaryfor the

29.Although even tests administered by theEducational Testing Service are notimmune to cheating, as evidenced by recent reports that many of the GRE scoresfrom China are fraudulent [Contreras 2002]. 872 QUARTERLY JOURNALOF ECONOMICS researcherto employ nonstandard approachesin generatingsuch a measure.

APPENDIX: THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES Thisappendix describesin greaterdetail howwe construct eachof our measures of unexpected or suspicious responses. TheŽ rstmeasure focuses on the most unlikely block of iden- tical answersgiven on consecutive questions. Using past test scores,future test scores, and backgroundcharacteristics, we predict thelikelihood that eachstudent will giveeach answer on eachquestion. For each item, a student has fourchoices (A, B,C, orD), onlyone of which is correct.We estimate a multinomial logitfor each item on the exam in orderto predict howstudents will respondto each question. We estimate the following model foreach item, using informationfrom other students in that year, grade,and subject:

ebjxs

(1) Pr~Yisc 5 j! 5 J bjxs , Sj51 e where Yisc indicates theresponse of student s in class c on item i,thenumber of possibleresponses ( J)is four,and Xs is a vector that includesmeasures of prior and futurestudent achievement in mathand reading as wellas demographicvariables (such as race,gender, and freelunch status) forstudent s.Thus,a stu- dent’s predicted probability ofchoosing a particular responseis identiŽed by thelikelihood of other students (in thesame year, grade,and subject)with similarbackground characteristics choosingthat response. Noticethat by including futureas wellas priortest scores in themodel we decrease the likelihood that students with unusu- ally goodteachers will beidentiŽ ed as cheaters,since these stu- dents will likelyretain some of theknowledge learned in thebase yearand thushave higher future test scores. Also notethat by estimatingthe probability ofselecting each possible response, ratherthan simply estimatingthe probability ofchoosing the correctresponse, we take advantage ofany additional informa- tionthat is provided by particular responsepatterns in a classroom. Usingthe estimates from this model,we calculate the pre- dicted probability that eachstudent would answereach item in ROTTEN APPLES 873 theway that heor she in fact did:

ˆ ebk xs (2) pisc 5 ˆ for k J b j xs Sj51 e 5 responseactually chosenby student s on item i.

Thisprovides us with onemeasure per student peritem. Taking theproduct overitems within student,we calculate the probabil- ity that astudent would haveanswered a stringof consecutive questionsfrom item m to item n as heor she did:

n mn 5 (3) psc P pisc. i5m

Wethen take the product acrossall students in theclassroom whohad identical responsesin thestring. If wedeŽ ne z as a m n student, Szc as thestring of responsesfor student z from item m m n to item n, and S# sc and as thestring for student s, then we can expressthe product as

mn 5 mn (4) p˜ sc P psc . mn # mn s[$ z:Sic 5S sc %

Notethat if thereare ns students in class c,and eachstudent has m n auniqueset of responses to these particular items,then p˜ sc m n collapsesto psc foreach student, and therewill be ns distinct valueswithin theclass. On theother extreme, if all ofthe stu- dents in class c haveidentical responses,then there is onlyone m n distinct valueof p˜ sc .Werepeat this calculationfor all possible consecutivestrings of length three to seven; that is,for all Sm n such that 3 # m 2 n # 7.To create our Ž rstindicator of suspiciousstring patterns, we take the minimum of the predicted blockprobability foreach classroom.

m n MEASURE 1. M1c 5 mins( p˜ s c ).

Thismeasure captures the least likely block of identical answers givenon consecutive questions in theclassroom. Thesecond measure of suspicious answer strings is intended tocapture more general patterns ofsimilarity in student re- sponses.To construct this measure,we Ž rstcalculate the resid- 874 QUARTERLY JOURNALOF ECONOMICS uals foreach of thepossible choices a student couldhave made for each item:

ˆ eb jxs (5) ejisc 5 0 2 ˆ if j Þ k; J bj xs Sj51 e ˆ eb j xs 5 1 2 ˆ if j 5 k, J b jxs Sj51 e where ejis c is theresidual forresponse j on item i by student s in classroom c.Wethus have four separate residuals perstudent per item. Tocreatea classroomlevel measure of theresponse to item i, weneed to combine the information for each student. First, we sumthe residuals foreach response across students within a classroom: 5 (6) ejic O ejisc. s

If thereis nowithin-class correlationin theway that students respondedto a particular item,this termshould beapproximately zero.Second, we sum across the four possible responses for each itemwithin classrooms.At thesame time, we squareeach of the componentresidual measuresto accentuate outliers and divide by numberof students in theclass ( nsc)tonormalize by class size:

2 Sj ejic (7) vic 5 . nsc

Thestatistic vic capturessomething like the variance of student responseson item i within classroom c.Noticethat wechoose to Žrstsum across the residuals ofeach response across students and thensum the classroom level measures for each response, ratherthan summingacross responses within student initially. Wedo this in orderto emphasizethe classroom level tendencies in responsepatterns. Oursecond measure of suspiciousstrings is simply theclass- roomaverage (across items) of this varianceterm across all test items.

MEASURE 2. M2c 5 v# c 5 ¥i vic /ni, where ni is thenumber of items on the exam. Ourthird measureis simply the variance (as opposedto the ROTTEN APPLES 875 mean)in thedegree of correlation across questions within a classroom:

2 2 EASURE 5 s 5 v 2 v ni M 3. M3c v c ¥i ( ic # c) / . OurŽ nal indicatorfocuses on theextent to which a student’s responsepattern was different fromother student’ s with the sameaggregate score that year.Let qisc equal oneif student s in classroom c answereditem i correctly,and zerootherwise. Let As equalthe aggregate score of student s onthe exam. We then determinewhat fractionof students at eachaggregate score level answeredeach item correctly. If welet nsA equalnumber of A students with an aggregatescore of A,thenthis fraction, q# i , can beexpressed as S q A s[$ z: Az5As% isc (8) q# i 5 . nsA Wethen calculate a measureof how much the response pattern of student s differed fromthe response pattern ofother students with thesame aggregate score. We do so by subtracting astu- dent’s answeron item i fromthe mean response of all students with aggregatescore A,squaring thesedeviations and thensum- mingacross all itemson the exam: 5 2 A 2 (9) Zsc O ~qisc q# i ! . i Wethen subtract outthe mean deviation for all students with the sameaggregate score, Z# A ,and sumthe students within each classroomto obtain ourŽ nal indicator: A MEASURE 4. M4c 5 ¥s (Zs c 2 Z# ).

KENNEDY SCHOOL OF GOVERNMENT, HARVARD UNIVERSITY UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES Aiken,Linda R., “ Detecting,Understanding and Controlling for Cheating on Tests,” Research inHigher Education, XXXII(1991), 725– 736. Angoff,William H., “TheDevelopment of StatisticalIndices for Detecting Cheat- ers,” Journalof the American Statistical Association, LXIX(1974), 44 – 49. Baker,George, “ IncentiveContracts and Performance Measurement,” Journal of Political Economy, C(1992),598 – 614. Bellezza,Francis S., and Suzanne F. Bellezza,“ Detectionof Cheatingon Multiple- ChoiceTests by Using Error SimilarityAnalysis,” Teaching ofPsychology, XVI(1989),151– 155. Carnoy,Martin, and Susanna Loeb, “ DoesExternal Accountability Affect Student 876 QUARTERLY JOURNALOF ECONOMICS

Outcomes?A Cross-StateAnalysis,” Working Paper, Stanford University Schoolof Education,2002. Contreras,Russell, “ InquiryUncovers Possible Cheating on GRE inAsia,” Asso- ciatedPress August 7,2002. Cullen,Julie Berry, andRandall Reback, “ TinkeringToward Accolades: School Gamingunder a PerformanceAccountability System.” Working paper, Uni- versityof Michigan,2002. Deere,Donald, and Wayne Strayer, “PuttingSchools to the Test: School Account- ability,Incentives and Behavior,” Working Paper, Texas A&M University, 2002. DiTella,Rafael, and Ernesto Schargrodsky, “TheRole of Wages and Auditing duringa Crackdown onCorruption in the City ofBuenos Aires,” unpublished manuscript,Harvard Business School, 2001. Duggan, Mark, andSteven Levitt, “ WinningIsn’ t Everything: Corruptionin Sumo Wrestling,” American EconomicReview, XCII(2002), 1594 –1605. Figlio,David N., “Testing,Crime and Punishment,” Working Paper University of Florida,2002. Figlio,David N., and Joshua Winicki, “ Foodfor Thought: The Effects of School AccountabilityPlans on School Nutrition,” Working Paper, University of Florida,2001. Figlio,David N., andLawrence S. Getzler,“ Accountability,Ability and Disability: Gamingthe System?” Working Paper University of Florida,2002. Fisman,Ray, “ Estimatingthe Value of PoliticalConnections,” American Economic Review, XCI(2001),1095– 1102. Frary, RobertB., “ StatisticalDetection of Multiple-Choice Answer Copying: Re- viewand Commentary,” Applied Measurement inEducation, VI (1993), 153–165. Frary, RobertB., T.NicholasTideman, and Thomas Watts, “ Indicesof Cheating onMultiple-Choice Tests,” Journalof Educational Statistics, II (1977), 235–256. Glaeser,Edward, andAndrei Shleifer, “ AReasonfor Quantity Regulation,” Amer- ican EconomicReview, XCI(2001), 431– 435. Grissmer,David W., AnnFlanagan, et al., “ ImprovingStudent Achievement: WhatNAEP TestScores Tell Us,” RAND Corporation,MR-924-EDU, 2000. Hanushek,Eric A.,and Margaret E. Raymond,“ ImprovingEducational Quality: HowBestto Evaluate Our Schools?”Conference paper, Federal Reserve Bank ofBoston,2002. Hofkins,Diane, “ Cheating‘ Rife’in National Tests,” New YorkTimes Educational Supplement, June16, 1995. Holland,Paul W., “AssessingUnusual Agreement between the Incorrect Answers ofTwo Examinees Using the K-Index: Statistical Theory and Empirical Support,”Technical Report No. 96-5, Educational Testing Service, 1996. Holmstrom,Bengt, and Paul Milgrom, “ MultitaskPrincipal-Agent Analyses: In- centiveContracts, Asset Ownership and Job Design,” Journalof Law, Eco- nomicsand Organization, VII(1991), 24 – 51. Jacob,Brian A., “ Accountability,Incentives and Behavior: The Impact of High- StakesTesting in the Chicago Public Schools,” Working Paper No. 8968, NationalBureau of EconomicResearch, 2002. Jacob,Brian A., and Lars Lefgren, “ TheImpact of Teacher Training on Student Achievement:Quasi-Experimental Evidence from School Reform Efforts in Chicago,” Journalof HumanResources, forthcoming. Jacob,Brian A., and Steven Levitt, “ CatchingCheating Teachers: The Results of anUnusualExperiment in ImplementingTheory,” Working Paper No. 9413, NationalBureau of EconomicResearch, 2003. Klein,Stephen P., LauraS. Hamilton,et al., “ WhatDo Test Scores in Texas Tell Us?”RAND Corporation,2000. Kolker,Claudia, “ TexasOffers Hard Lessons on School Accountability,” Los Angeles Times, April14, 1999. Lindsay,Drew, “Whodunit?OfŽ cials Find Thousands of Erasures on Standard- izedTests and Suspect Tampering,” Education Week (1996),25– 29. Loughran,Regina, and Thomas Comiskey, “ Cheatingthe Children: Educator ROTTEN APPLES 877

Misconducton Standardized Tests,” Report of the City ofNew York Special Commissionerof Investigation for the New York City SchoolDistrict, 1999. Marcus,John, “ Fakingthe Grade,” BostonMagazine, February2000. May,Meredith, “ StateFears Cheating by Teachers,” SanFrancisco Chronicle, October4, 2000. Perlman,Carol L., “ Resultsof a CitywideTesting Program Audit in Chicago,” Paperfor American Educational Research Association annual meeting, Chi- cago,IL, 1985. Porter,Robert H., andJ. DouglasZona, “ Detectionof Bid Rigging inProcurement Auctions,” Journalof Political Economy, CI(1993),518 – 538. Richards,Craig E.,and Tian Ming Sheu, “ TheSouth Carolina School Incentive Reward Program:A PolicyAnalysis,” Economicsof Education Review, XI (1992),71– 86. Rohlfs,Chris, “ EstimatingClassroom Learning Using the School Breakfast Pro- gramas an Instrumentfor Attendance and Class Size: Evidence from Chicago PublicSchools,” Unpublished manuscript, University of Chicago,2002. Tysome,Tony, “ CheatingPurge: Inspectors out,” New YorkTimes Higher Educa- tionSupplement, August 19,1994. Wollack,James A., “ANominalResponse Model Approach for Detecting Answer Copying,” Applied PsychologicalMeasurement, XX(1997),144 –152.