Defensible Explanations for Algorithmic Decisions About Writing in Education

Defensible Explanations for Algorithmic Decisions about Writing in Education Elijah Mayfield CMU-LTI-20-012 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Alan W Black (Chair) Carnegie Mellon University Yulia Tsvetkov Carnegie Mellon University Alexandra Chouldechova Carnegie Mellon University Anita Williams Woolley Carnegie Mellon University Ezekiel Dixon-Román University of Pennsylvania Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright © 2020 Elijah Mayfield Elijah Mayfield Defensible Explanations for Algorithmic Decisions about Writing in Education AUGUST 24, 2020 Carnegie Mellon University Copyright © 2020 Elijah Mayfield published by carnegie mellon university www.treeforts.org Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in com- pliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/ LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “as is” basis, without warranties or conditions of any kind, either express or implied. See the License for the specific language governing permissions and limitations under the License. First printing, August 2020 3 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Thesis Committee: Alan W Black, Language Technologies Institute (Chair) Yulia Tsvetkov, Language Technologies Institute Alexandra Chouldechova, Heinz College of Public Policy Anita Williams Woolley, Tepper School of Business Ezekiel Dixon-Román, School of Social Policy & Practice (University of Pennsylvania) 5 Abstract This dissertation is a call for collaboration at the interdisciplinary intersection of natural language processing, explainable machine learning, philosophy of science, and education technology. If we want algorithm decision-making to be explainable, those decisions must be defensible by practitioners in a social context, rather than transparent about their technical and mathematical details. Moreover, I argue that a narrow view of explanation, specifically one focused on causal reasoning about deep neural networks, is unsuccessful even on its own terms. To that end, the rest of the thesis aims to build alternate, non-causal tools for explaining behavior of classification models. My technical contributions study human judgments in two distinct domains. First, I study group decision-making, releasing a large- scale corpus of structured data from Wikipedia’s deletion debates. I show how decisions can be predicted and debate outcomes explained based on social and discursive norms. Next, in automated essay scoring, I study a dataset of student writing, collected through an ongoing cross-institutional tool for academic advising and diagnostic for college readiness. Here, I explore the characteristics of essays that receive disparate scores, focusing on several topics including genre norms, fairness audits across race and gender, and investigative topic modeling. In both cases, I show how to evaluate and choose the most straightforward tools that effectively make predictions, advocating for classical approaches over deep neural methods when appropriate. In my conclusion, I advance a new framework for building defensible explanations for trained models. Recognizing that explanations are constructed based on a scientific discourse, and that automated systems must be trustworthy for both developers and users, I develop success criteria for earning that trust. I conclude by connecting to critical theory, arguing that truly defensible algorithmic decision- making must not only be explainable, but must be held accountable for the power structures it enables and extends. Contents Part I: Goals 17 — Introduction and Overview 17 — The Philosophy of Explanation 29 Part II: Wikipedia Deletion Debates 49 — Context and Background 51 — Learning to Predict Decisions 65 — Exploring and Explaining Decisions 73 — Future Directions 85 Part III: Automated Essay Scoring 89 — Context and Background 91 — Evaluating Neural Methods 101 — Training and Auditing DAACS 113 — Explaining Essay Structure 133 — Explaining Essay Content 145 — Future Directions 165 Part IV: Takeaways 171 — Defensible Explanations 173 — Confronting Inequity 185 — List of Publications 201 Bibliography 203 List of Figures 1 Amid the 2020 coronavirus shutdown, New York’s state government declared a plan to redesign their curriculum around online learning in collaboration with the Bill & Melinda Gates Foundation. 19 2 Contemporary news articles covered the deletion controversy around the recent Nobel laureate Donna Strickland. From The Guardian. 21 3 The New York Times’ coverage of the edX EASE announcement drove much of the press attention to automated essay scoring in 2013. 23 4 Homepage of the DAACS support tool for first-year college students. 25 5 Network diagrams of causal systems. The system on the right resists surgical intervention between D and H. 35 6 Researchers often use attention weights (top attention layer) to gen- erate explanations. Jain & Wallace (middle) scramble weights and show that output remains stable; a similar result is obtained by Ser- rano & Smith (bottom) omitting highly-weighted nodes entirely. 39 7 Top: Header of the No original research policy, which can be linked using aliases (OR, ,NOR, and ORIGINAL). Bottom: one specific sub- section of that policy, which can be linked directly (WP:OI). 53 8 Excerpt from a single AfD discussion, with a nominating statement, five votes, and four comments displayed. Votes labeled in "bold" are explicit preferences (or stances), which are masked in our tasks. 54 9 Distributions by year for votes (left) and outcomes (right) over Wikipedia’s history. 62 10 Counts of discussions per year (blue) and of votes, comments, and citations per discussion in each year. 63 11 Log-log plot of user rank and contributions. The top 36,440 users, all with at least five contributions, are displayed. Collectively, these 22.6% of all users account for 94.3% of all contributions. 64 12 Probability of a Delete outcome as voting margin varies. Adminis- trators almost never overrule Delete majorities with a margin of at least 2 votes, or Keep majorities with a margin of at least 4 votes. 69 13 Real-Time BERT model accuracy mid-discussion, split by final debate length: short (5 or fewer), medium (6-10), and long (over 10). 70 10 14 Success rates (left) and forecast shifts (right) for votes that were the Nth contribution to a discussion, for different values of N. I mea- sure these values first for any vote with that label at that ordinal lo- cation in the debate, then for discussions where the first vote for a particular label appeared at rank N. 75 15 Large forecast shifts arise from initial votes for Keep followed by re- sponse votes for Delete. Here, a user successfully cites the Notability (geographic features) policy to keep an article. 77 16 Highly successful votes that also shift the forecast model often come from the narrow use of established policies for notability in specific subtopics. 78 17 One-time voters are more successful than more active voters; how- ever, the first contribution from more active voters have greater forecast shift than the votes from one-time contributors. 79 18 Example of highly successful editor behavior with minimal forecast shift. For each of the later votes, the probability of a Delete outcome is already well over 99%. 80 19 Citations in low-success rate votes that cause little change in fore- casts come late in discussions, often citing detailed technical policies rather than focusing on persuasion or notability. 81 20 Summary of success rates and forecast shifts for various policies. Scat- ter plot shows all policy pages with at least 25 citations in either Keep or Delete votes. Dotted lines mark baseline success rates. 82 21 An August 2019 Vice report on automated essay scoring brought re- newed attention to automated essay scoring, this time in the context of implementations for Common Core standardized testing. 90 22 An example of rubric traits designed for use in automated essay scoring, from my previous work on Turnitin Revision Assistant. 95 23 Illustration of cyclical (top), two-period cyclical (middle, log y-scale), and 1-cycle (bottom) learning rate curricula over N epochs. 105 24 QWK (top) and training time (bottom, in seconds) and for 5-fold cross- validation of 1-cycle neural fine-tuning on ASAP datasets 2-6, for BERT (left) and DistilBERT (right). 111 25 Screenshot from DAACS including the writing prompt students re- sponded to for this dataset. 114 26 Shift in population mean scores when using AES, compared to hand- scoring. 124 27 Accuracy of automated scoring by trait, broken out by race and gender. 127 28 Comparison of human inter-rater reliability, in QWK, from 2017 to 2020 datasets, with changes made to rubric and process design. 131 11 29 Mean score of essays in each category of five-paragraph form, marked with *** when there is a statistical significant relationship between form and score. 139 30 Accuracy of automated scoring by trait, broken out by 5PE form. 140 31 Breakdown of five-paragraph essay frequency by race and gender intersection. Dashed lines indicate whole-population frequency. 140 32 Reliability of automated essay scoring before and after 5PE encod- ing. Grey shaded area indicates human inter-rater reliability. 142 33 Sidebar menu for the DAACS self-regulated learning survey, which organizes results into a hierarchy. 149 34 Distribution of topic assignments to paragraphs from the LDA model. 151 35 Subgroup differences for document structure topics. 154 36 Subgroup differences for body paragraph topics. 155 37 Subgroup differences for non-adherent paragraph topics. 156 38 Data from Table 33, including exact matches only. 159 39 Relationship between topics as number of topics increases from 4 to 20, following the hierarchical method. Values between cells indicate correlation coefficient between topics.

Load more