Defensible Explanations for Algorithmic Decisions About Writing in Education

Total Page:16

File Type:pdf, Size:1020Kb

Defensible Explanations for Algorithmic Decisions About Writing in Education Defensible Explanations for Algorithmic Decisions about Writing in Education Elijah Mayfield ​ CMU-LTI-20-012 Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Ave., Pittsburgh, PA 15213 www.lti.cs.cmu.edu Thesis Committee: Alan W Black (Chair) Carnegie Mellon University Yulia Tsvetkov Carnegie Mellon University Alexandra Chouldechova Carnegie Mellon University Anita Williams Woolley Carnegie Mellon University Ezekiel Dixon-Román University of Pennsylvania Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Language and Information Technologies Copyright © 2020 Elijah Mayfield Elijah Mayfield Defensible Explanations for Algorithmic Decisions about Writing in Education AUGUST 24, 2020 Carnegie Mellon University Copyright © 2020 Elijah Mayfield published by carnegie mellon university www.treeforts.org Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in com- pliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/ LICENSE-2.0. Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “as is” basis, without warranties or conditions of any kind, either express or implied. See the License for the specific language governing permissions and limitations under the License. First printing, August 2020 3 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Thesis Committee: Alan W Black, Language Technologies Institute (Chair) Yulia Tsvetkov, Language Technologies Institute Alexandra Chouldechova, Heinz College of Public Policy Anita Williams Woolley, Tepper School of Business Ezekiel Dixon-Román, School of Social Policy & Practice (University of Pennsylvania) 5 Abstract This dissertation is a call for collaboration at the interdisciplinary intersection of natural language processing, explainable machine learning, philosophy of science, and education technology. If we want algorithm decision-making to be explainable, those decisions must be defensible by practitioners in a social context, rather than transparent about their technical and mathematical details. Moreover, I argue that a narrow view of explanation, specifically one focused on causal reasoning about deep neural networks, is unsuccessful even on its own terms. To that end, the rest of the thesis aims to build alternate, non-causal tools for explaining behavior of classification models. My technical contributions study human judgments in two distinct domains. First, I study group decision-making, releasing a large- scale corpus of structured data from Wikipedia’s deletion debates. I show how decisions can be predicted and debate outcomes explained based on social and discursive norms. Next, in automated essay scoring, I study a dataset of student writing, collected through an ongoing cross-institutional tool for academic advising and diagnostic for college readiness. Here, I explore the characteristics of essays that receive disparate scores, focusing on several topics including genre norms, fairness audits across race and gender, and investigative topic modeling. In both cases, I show how to evaluate and choose the most straightforward tools that effectively make predictions, advocating for classical approaches over deep neural methods when appropriate. In my conclusion, I advance a new framework for building defen- sible explanations for trained models. Recognizing that explanations are constructed based on a scientific discourse, and that automated systems must be trustworthy for both developers and users, I develop success criteria for earning that trust. I conclude by connecting to critical theory, arguing that truly defensible algorithmic decision- making must not only be explainable, but must be held accountable for the power structures it enables and extends. Contents Part I: Goals 17 — Introduction and Overview 17 — The Philosophy of Explanation 29 Part II: Wikipedia Deletion Debates 49 — Context and Background 51 — Learning to Predict Decisions 65 — Exploring and Explaining Decisions 73 — Future Directions 85 Part III: Automated Essay Scoring 89 — Context and Background 91 — Evaluating Neural Methods 101 — Training and Auditing DAACS 113 — Explaining Essay Structure 133 — Explaining Essay Content 145 — Future Directions 165 Part IV: Takeaways 171 — Defensible Explanations 173 — Confronting Inequity 185 — List of Publications 201 Bibliography 203 List of Figures 1 Amid the 2020 coronavirus shutdown, New York’s state government declared a plan to redesign their curriculum around online learn- ing in collaboration with the Bill & Melinda Gates Foundation. 19 2 Contemporary news articles covered the deletion controversy around the recent Nobel laureate Donna Strickland. From The Guardian. 21 3 The New York Times’ coverage of the edX EASE announcement drove much of the press attention to automated essay scoring in 2013. 23 4 Homepage of the DAACS support tool for first-year college students. 25 5 Network diagrams of causal systems. The system on the right resists surgical intervention between D and H. 35 6 Researchers often use attention weights (top attention layer) to gen- erate explanations. Jain & Wallace (middle) scramble weights and show that output remains stable; a similar result is obtained by Ser- rano & Smith (bottom) omitting highly-weighted nodes entirely. 39 7 Top: Header of the No original research policy, which can be linked using aliases (OR, ,NOR, and ORIGINAL). Bottom: one specific sub- section of that policy, which can be linked directly (WP:OI). 53 8 Excerpt from a single AfD discussion, with a nominating statement, five votes, and four comments displayed. Votes labeled in "bold" are explicit preferences (or stances), which are masked in our tasks. 54 9 Distributions by year for votes (left) and outcomes (right) over Wikipedia’s history. 62 10 Counts of discussions per year (blue) and of votes, comments, and citations per discussion in each year. 63 11 Log-log plot of user rank and contributions. The top 36,440 users, all with at least five contributions, are displayed. Collectively, these 22.6% of all users account for 94.3% of all contributions. 64 12 Probability of a Delete outcome as voting margin varies. Adminis- trators almost never overrule Delete majorities with a margin of at least 2 votes, or Keep majorities with a margin of at least 4 votes. 69 13 Real-Time BERT model accuracy mid-discussion, split by final de- bate length: short (5 or fewer), medium (6-10), and long (over 10). 70 10 14 Success rates (left) and forecast shifts (right) for votes that were the Nth contribution to a discussion, for different values of N. I mea- sure these values first for any vote with that label at that ordinal lo- cation in the debate, then for discussions where the first vote for a particular label appeared at rank N. 75 15 Large forecast shifts arise from initial votes for Keep followed by re- sponse votes for Delete. Here, a user successfully cites the Notability (geographic features) policy to keep an article. 77 16 Highly successful votes that also shift the forecast model often come from the narrow use of established policies for notability in specific subtopics. 78 17 One-time voters are more successful than more active voters; how- ever, the first contribution from more active voters have greater fore- cast shift than the votes from one-time contributors. 79 18 Example of highly successful editor behavior with minimal forecast shift. For each of the later votes, the probability of a Delete outcome is already well over 99%. 80 19 Citations in low-success rate votes that cause little change in fore- casts come late in discussions, often citing detailed technical poli- cies rather than focusing on persuasion or notability. 81 20 Summary of success rates and forecast shifts for various policies. Scat- ter plot shows all policy pages with at least 25 citations in either Keep or Delete votes. Dotted lines mark baseline success rates. 82 21 An August 2019 Vice report on automated essay scoring brought re- newed attention to automated essay scoring, this time in the context of implementations for Common Core standardized testing. 90 22 An example of rubric traits designed for use in automated essay scor- ing, from my previous work on Turnitin Revision Assistant. 95 23 Illustration of cyclical (top), two-period cyclical (middle, log y-scale), and 1-cycle (bottom) learning rate curricula over N epochs. 105 24 QWK (top) and training time (bottom, in seconds) and for 5-fold cross- validation of 1-cycle neural fine-tuning on ASAP datasets 2-6, for BERT (left) and DistilBERT (right). 111 25 Screenshot from DAACS including the writing prompt students re- sponded to for this dataset. 114 26 Shift in population mean scores when using AES, compared to hand- scoring. 124 27 Accuracy of automated scoring by trait, broken out by race and gen- der. 127 28 Comparison of human inter-rater reliability, in QWK, from 2017 to 2020 datasets, with changes made to rubric and process design. 131 11 29 Mean score of essays in each category of five-paragraph form, marked with *** when there is a statistical significant relationship between form and score. 139 30 Accuracy of automated scoring by trait, broken out by 5PE form. 140 31 Breakdown of five-paragraph essay frequency by race and gender intersection. Dashed lines indicate whole-population frequency. 140 32 Reliability of automated essay scoring before and after 5PE encod- ing. Grey shaded area indicates human inter-rater reliability. 142 33 Sidebar menu for the DAACS self-regulated learning survey, which organizes results into a hierarchy. 149 34 Distribution of topic assignments to paragraphs from the LDA model. 151 35 Subgroup differences for document structure topics. 154 36 Subgroup differences for body paragraph topics. 155 37 Subgroup differences for non-adherent paragraph topics. 156 38 Data from Table 33, including exact matches only. 159 39 Relationship between topics as number of topics increases from 4 to 20, following the hierarchical method. Values between cells indicate correlation coefficient between topics.
Recommended publications
  • Nber Working Paper Series Human Decisions And
    NBER WORKING PAPER SERIES HUMAN DECISIONS AND MACHINE PREDICTIONS Jon Kleinberg Himabindu Lakkaraju Jure Leskovec Jens Ludwig Sendhil Mullainathan Working Paper 23180 http://www.nber.org/papers/w23180 NATIONAL BUREAU OF ECONOMIC RESEARCH 1050 Massachusetts Avenue Cambridge, MA 02138 February 2017 We are immensely grateful to Mike Riley for meticulously and tirelessly spearheading the data analytics, with effort well above and beyond the call of duty. Thanks to David Abrams, Matt Alsdorf, Molly Cohen, Alexander Crohn, Gretchen Ruth Cusick, Tim Dierks, John Donohue, Mark DuPont, Meg Egan, Elizabeth Glazer, Judge Joan Gottschall, Nathan Hess, Karen Kane, Leslie Kellam, Angela LaScala-Gruenewald, Charles Loeffler, Anne Milgram, Lauren Raphael, Chris Rohlfs, Dan Rosenbaum, Terry Salo, Andrei Shleifer, Aaron Sojourner, James Sowerby, Cass Sunstein, Michele Sviridoff, Emily Turner, and Judge John Wasilewski for valuable assistance and comments, to Binta Diop, Nathan Hess, and Robert Webberfor help with the data, to David Welgus and Rebecca Wei for outstanding work on the data analysis, to seminar participants at Berkeley, Carnegie Mellon, Harvard, Michigan, the National Bureau of Economic Research, New York University, Northwestern, Stanford and the University of Chicago for helpful comments, to the Simons Foundation for its support of Jon Kleinberg's research, to the Stanford Data Science Initiative for its support of Jure Leskovec’s research, to the Robert Bosch Stanford Graduate Fellowship for its support of Himabindu Lakkaraju and to Tom Dunn, Ira Handler, and the MacArthur, McCormick and Pritzker foundations for their support of the University of Chicago Crime Lab and Urban Labs. The main data we analyze are provided by the New York State Division of Criminal Justice Services (DCJS), and the Office of Court Administration.
    [Show full text]
  • Himabindu Lakkaraju
    Himabindu Lakkaraju Contact Information 428 Morgan Hall Harvard Business School Soldiers Field Road Boston, MA 02163 E-mail: [email protected] Webpage: http://web.stanford.edu/∼himalv Research Interests Transparency, Fairness, and Safety in Artificial Intelligence (AI); Applications of AI to Criminal Justice, Healthcare, Public Policy, and Education; AI for Decision-Making. Academic & Harvard University 11/2018 - Present Professional Postdoctoral Fellow with appointments in Business School and Experience Department of Computer Science Microsoft Research, Redmond 5/2017 - 6/2017 Visiting Researcher Microsoft Research, Redmond 6/2016 - 9/2016 Research Intern University of Chicago 6/2014 - 8/2014 Data Science for Social Good Fellow IBM Research - India, Bangalore 7/2010 - 7/2012 Technical Staff Member SAP Research, Bangalore 7/2009 - 3/2010 Visiting Researcher Adobe Systems Pvt. Ltd., Bangalore 7/2007 - 7/2008 Software Engineer Education Stanford University 9/2012 - 9/2018 Ph.D. in Computer Science Thesis: Enabling Machine Learning for High-Stakes Decision-Making Advisor: Prof. Jure Leskovec Thesis Committee: Prof. Emma Brunskill, Dr. Eric Horvitz, Prof. Jon Kleinberg, Prof. Percy Liang, Prof. Cynthia Rudin Stanford University 9/2012 - 9/2015 Master of Science (MS) in Computer Science Advisor: Prof. Jure Leskovec Indian Institute of Science (IISc) 8/2008 - 7/2010 Master of Engineering (MEng) in Computer Science & Automation Thesis: Exploring Topic Models for Understanding Sentiments Expressed in Customer Reviews Advisor: Prof. Chiranjib
    [Show full text]
  • Overview on Explainable AI Methods Global‐Local‐Ante‐Hoc‐Post‐Hoc
    00‐FRONTMATTER Seminar Explainable AI Module 4 Overview on Explainable AI Methods Birds‐Eye View Global‐Local‐Ante‐hoc‐Post‐hoc Andreas Holzinger Human‐Centered AI Lab (Holzinger Group) Institute for Medical Informatics/Statistics, Medical University Graz, Austria and Explainable AI‐Lab, Alberta Machine Intelligence Institute, Edmonton, Canada a.holzinger@human‐centered.ai 1 Last update: 11‐10‐2019 Agenda . 00 Reflection – follow‐up from last lecture . 01 Basics, Definitions, … . 02 Please note: xAI is not new! . 03 Examples for Ante‐hoc models (explainable models, interpretable machine learning) . 04 Examples for Post‐hoc models (making the “black‐box” model interpretable . 05 Explanation Interfaces: Future human‐AI interaction . 06 A few words on metrics of xAI (measuring causability) a.holzinger@human‐centered.ai 2 Last update: 11‐10‐2019 01 Basics, Definitions, … a.holzinger@human‐centered.ai 3 Last update: 11‐10‐2019 Explainability/ Interpretability – contrasted to performance Zachary C. Lipton 2016. The mythos of model interpretability. arXiv:1606.03490. Inconsistent Definitions: What is the difference between explainable, interpretable, verifiable, intelligible, transparent, understandable … ? Zachary C. Lipton 2018. The mythos of model interpretability. ACM Queue, 16, (3), 31‐57, doi:10.1145/3236386.3241340 a.holzinger@human‐centered.ai 4 Last update: 11‐10‐2019 The expectations of xAI are extremely high . Trust – interpretability as prerequisite for trust (as propagated by Ribeiro et al (2016)); how is trust defined? Confidence? . Causality ‐ inferring causal relationships from pure observational data has been extensively studied (Pearl, 2009), however it relies strongly on prior knowledge . Transferability – humans have a much higher capacity to generalize, and can transfer learned skills to completely new situations; compare this with e.g.
    [Show full text]
  • Algorithms, Platforms, and Ethnic Bias: an Integrative Essay
    BERKELEY ROUNDTABLE ON THE INTERNATIONAL ECONOMY BRIE Working Paper 2018-3 Algorithms, Platforms, and Ethnic Bias: An Integrative Essay Selena Silva and Martin Kenney Algorithms, Platforms, and Ethnic Bias: An Integrative Essay In Phylon: The Clark Atlanta University Review of Race and Culture (Summer/Winter 2018) Vol. 55, No. 1 & 2: 9-37 Selena Silva Research Assistant and Martin Kenney* Distinguished Professor Community and Regional Development Program University of California, Davis Davis & Co-Director Berkeley Roundtable on the International Economy & Affiliated Professor Scuola Superiore Sant’Anna * Corresponding Author The authors wish to thank Obie Clayton for his encouragement and John Zysman for incisive and valuable comments on an earlier draft. Keywords: Digital bias, digital discrimination, algorithms, platform economy, racism 1 Abstract Racially biased outcomes have increasingly been recognized as a problem that can infect software algorithms and datasets of all types. Digital platforms, in particular, are organizing ever greater portions of social, political, and economic life. This essay examines and organizes current academic and popular press discussions on how digital tools, despite appearing to be objective and unbiased, may, in fact, only reproduce or, perhaps, even reinforce current racial inequities. However, digital tools may also be powerful instruments of objectivity and standardization. Based on a review of the literature, we have modified and extended a “value chain–like” model introduced by Danks and London, depicting the potential location of ethnic bias in algorithmic decision-making.1 The model has five phases: input, algorithmic operations, output, users, and feedback. With this model, we identified nine unique types of bias that might occur within these five phases in an algorithmic model: (1) training data bias, (2) algorithmic focus bias, (3) algorithmic processing bias, (4) transfer context bias, (5) misinterpretation bias, (6) automation bias, (7) non-transparency bias, (8) consumer bias, and (9) feedback loop bias.
    [Show full text]
  • Bayesian Sensitivity Analysis for Offline Policy Evaluation
    Bayesian Sensitivity Analysis for Offline Policy Evaluation Jongbin Jung Ravi Shroff [email protected] [email protected] Stanford University New York University Avi Feller Sharad Goel UC-Berkeley [email protected] [email protected] Stanford University ABSTRACT 1 INTRODUCTION On a variety of complex decision-making tasks, from doctors pre- Machine learning algorithms are increasingly used by employ- scribing treatment to judges setting bail, machine learning algo- ers, judges, lenders, and other experts to guide high-stakes de- rithms have been shown to outperform expert human judgments. cisions [2, 3, 5, 6, 22]. These algorithms, for example, can be used to One complication, however, is that it is often difficult to anticipate help determine which job candidates are interviewed, which defen- the effects of algorithmic policies prior to deployment, as onegen- dants are required to pay bail, and which loan applicants are granted erally cannot use historical data to directly observe what would credit. When determining whether to implement a proposed policy, have happened had the actions recommended by the algorithm it is important to anticipate its likely effects using only historical been taken. A common strategy is to model potential outcomes data, as ex-post evaluation may be expensive or otherwise imprac- for alternative decisions assuming that there are no unmeasured tical. This general problem is known as offline policy evaluation. confounders (i.e., to assume ignorability). But if this ignorability One approach to this problem is to first assume that treatment as- assumption is violated, the predicted and actual effects of an al- signment is ignorable given the observed covariates, after which a gorithmic policy can diverge sharply.
    [Show full text]
  • Designing Alternative Representations of Confusion Matrices to Support Non-Expert Public Understanding of Algorithm Performance
    153 Designing Alternative Representations of Confusion Matrices to Support Non-Expert Public Understanding of Algorithm Performance HONG SHEN, Carnegie Mellon University, USA HAOJIAN JIN, Carnegie Mellon University, USA ÁNGEL ALEXANDER CABRERA, Carnegie Mellon University, USA ADAM PERER, Carnegie Mellon University, USA HAIYI ZHU, Carnegie Mellon University, USA JASON I. HONG, Carnegie Mellon University, USA Ensuring effective public understanding of algorithmic decisions that are powered by machine learning techniques has become an urgent task with the increasing deployment of AI systems into our society. In this work, we present a concrete step toward this goal by redesigning confusion matrices for binary classification to support non-experts in understanding the performance of machine learning models. Through interviews (n=7) and a survey (n=102), we mapped out two major sets of challenges lay people have in understanding standard confusion matrices: the general terminologies and the matrix design. We further identified three sub-challenges regarding the matrix design, namely, confusion about the direction of reading the data, layered relations and quantities involved. We then conducted an online experiment with 483 participants to evaluate how effective a series of alternative representations target each of those challenges in the context of an algorithm for making recidivism predictions. We developed three levels of questions to evaluate users’ objective understanding. We assessed the effectiveness of our alternatives for accuracy in answering those questions, completion time, and subjective understanding. Our results suggest that (1) only by contextualizing terminologies can we significantly improve users’ understanding and (2) flow charts, which help point out the direction ofreading the data, were most useful in improving objective understanding.
    [Show full text]
  • Human Decisions and Machine Predictions*
    HUMAN DECISIONS AND MACHINE PREDICTIONS* Jon Kleinberg Himabindu Lakkaraju Jure Leskovec Jens Ludwig Sendhil Mullainathan August 11, 2017 Abstract Can machine learning improve human decision making? Bail decisions provide a good test case. Millions of times each year, judges make jail-or-release decisions that hinge on a prediction of what a defendant would do if released. The concreteness of the prediction task combined with the volume of data available makes this a promising machine-learning application. Yet comparing the algorithm to judges proves complicated. First, the available data are generated by prior judge decisions. We only observe crime outcomes for released defendants, not for those judges detained. This makes it hard to evaluate counterfactual decision rules based on algorithmic predictions. Second, judges may have a broader set of preferences than the variable the algorithm predicts; for instance, judges may care specifically about violent crimes or about racial inequities. We deal with these problems using different econometric strategies, such as quasi-random assignment of cases to judges. Even accounting for these concerns, our results suggest potentially large welfare gains: one policy simulation shows crime reductions up to 24.7% with no change in jailing rates, or jailing rate reductions up to 41.9% with no increase in crime rates. Moreover, all categories of crime, including violent crimes, show reductions; and these gains can be achieved while simultaneously reducing racial disparities. These results suggest that while machine learning can be valuable, realizing this value requires integrating these tools into an economic framework: being clear about the link between predictions and decisions; specifying the scope of payoff functions; and constructing unbiased decision counterfactuals.
    [Show full text]
  • Examining Wikipedia with a Broader Lens
    Examining Wikipedia With a Broader Lens: Quantifying the Value of Wikipedia’s Relationships with Other Large-Scale Online Communities Nicholas Vincent Isaac Johnson Brent Hecht PSA Research Group PSA Research Group PSA Research Group Northwestern University Northwestern University Northwestern University [email protected] [email protected] [email protected] ABSTRACT recruitment and retention strategies (e.g. [17,18,59,62]), and The extensive Wikipedia literature has largely considered even to train many state-of-the-art artificial intelligence Wikipedia in isolation, outside of the context of its broader algorithms [12,13,25,26,36,58]. Indeed, Wikipedia has Internet ecosystem. Very recent research has demonstrated likely become one of the most important datasets and the significance of this limitation, identifying critical research environments in modern computing [25,28,36,38]. relationships between Google and Wikipedia that are highly relevant to many areas of Wikipedia-based research and However, a major limitation of the vast majority of the practice. This paper extends this recent research beyond Wikipedia literature is that it considers Wikipedia in search engines to examine Wikipedia’s relationships with isolation, outside the context of its broader online ecosystem. large-scale online communities, Stack Overflow and Reddit The importance of this limitation was made quite salient in in particular. We find evidence of consequential, albeit recent research that suggested that Wikipedia’s relationships unidirectional relationships. Wikipedia provides substantial with other websites are tremendously significant value to both communities, with Wikipedia content [15,35,40,57]. For instance, this work has shown that increasing visitation, engagement, and revenue, but we find Wikipedia’s relationships with other websites are important little evidence that these websites contribute to Wikipedia in factors in the peer production process and, consequently, have impacts on key variables of interest such as content return.
    [Show full text]
  • Counterfactual Predictions Under Runtime Confounding
    Counterfactual Predictions under Runtime Confounding Amanda Coston Edward H. Kennedy Heinz College & Machine Learning Dept. Department of Statistics Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] Alexandra Chouldechova Heinz College Carnegie Mellon University [email protected] April 19, 2021 Abstract Algorithms are commonly used to predict outcomes under a particular decision or intervention, such as predicting likelihood of default if a loan is approved. Generally, to learn such counterfactual prediction models from observational data on historical decisions and corresponding outcomes, one must measure all factors that jointly affect the outcome and the decision taken. Motivated by decision support applications, we study the counterfactual prediction task in the setting where all relevant factors are captured in the historical data, but it is infeasible, undesirable, or impermissible to use some such factors in the prediction model. We refer to this setting as runtime confounding. We propose a doubly-robust procedure for learning counterfactual prediction models in this setting. Our theoretical analysis and experimental results suggest that our method often outperforms competing approaches. We also present a validation procedure for evaluating the performance of counterfactual prediction methods. 1 Introduction Algorithmic tools are increasingly prevalent in domains such as health care, education, lending, criminal justice, and child welfare [4, 8, 16, 19, 39]. In many cases, the tools are not intended to replace human decision-making, but rather to distill rich case information into a simpler form, arXiv:2006.16916v2 [stat.ML] 16 Apr 2021 such as a risk score, to inform human decision makers [3, 10]. The type of information that these tools need to convey is often counterfactual in nature.
    [Show full text]
  • Faithful and Customizable Explanations of Black Box Models
    Faithful and Customizable Explanations of Black Box Models Himabindu Lakkaraju Ece Kamar Harvard University Microsoft Research [email protected] [email protected] Rich Caruana Jure Leskovec Microsoft Research Stanford University [email protected] [email protected] ABSTRACT 1 INTRODUCTION As predictive models increasingly assist human experts (e.g., doc- The successful adoption of predictive models for real world decision tors) in day-to-day decision making, it is crucial for experts to be making hinges on how much decision makers (e.g., doctors, judges) able to explore and understand how such models behave in differ- can understand and trust their functionality. Only if decision makers ent feature subspaces in order to know if and when to trust them. have a clear understanding of the behavior of predictive models, To this end, we propose Model Understanding through Subspace they can evaluate when and how much to depend on these models, Explanations (MUSE), a novel model agnostic framework which detect potential biases in them, and develop strategies for further facilitates understanding of a given black box model by explaining model refinement. However, the increasing complexity and the how it behaves in subspaces characterized by certain features of proprietary nature of predictive models employed today is making interest. Our framework provides end users (e.g., doctors) with the this problem harder [9], thus, emphasizing the need for tools which flexibility of customizing the model explanations by allowing them can explain these complex black boxes in a faithful and interpretable to input the features of interest. The construction of explanations is manner. guided by a novel objective function that we propose to simultane- Prior research on explaining black box models can be catego- ously optimize for fidelity to the original model, unambiguity and rized as: 1) Local explanations, which focus on explaining individual interpretability of the explanation.
    [Show full text]
  • Using Big Data to Solve Economic and Social Problems
    Department of Economics Spring 2020 Opportunity Insights Harvard University ECON 50 / HKS SUP 135 / GSE A218 Using Big Data to Solve Economic and Social Problems Professor Raj Chetty Course Head: Gregory Bruich, Ph.D. Email: [email protected] Email: [email protected] Office: Opportunity Insights, 1280 Mass Ave. Office: Littauer Center 113 LECTURES Mondays and Wednesdays, 1:30-2:45 p.m. in Sanders Theater, 45 Quincy Street COURSE DESCRIPTION This course will show how "big data" can be used to understand and address some of the most important social and economic problems of our time. The course will give students an introduction to frontier research and policy applications in economics and social science in a non-technical manner that does not require prior coursework in economics or statistics, making it suitable both for students exploring economics for the first time, as well as for more advanced students. Topics include equality of opportunity, education, racial disparities, innovation and entrepreneurship, health care, climate change, criminal justice, and tax policy. In the context of these topics, the course will also provide an introduction to basic methods in data science, including regression, causal inference, and machine learning. The course will include discussions with leading researchers and practitioners who use big data in real-world applications. INSTRUCTORS Raj Chetty is the William A. Ackman Professor of Economics at Harvard. Raj’s current research, in collaboration with his Opportunity Insights research team, focuses on equality of opportunity: how can we given children from disadvantaged backgrounds the best chances of succeeding? Raj got his AB from Harvard in 2000 (where he lived in Hurlbut Hall as a first-year and then Pforzheimer House) and Ph.D.
    [Show full text]
  • Machine Learning and Development Policy
    Machine Learning and Development Policy Sendhil Mullainathan (joint papers with Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Ziad Obermeyer) Magic? • Hard not to be wowed • But what makes it tick? • Could that be used elsewhere? In my own work? AI Approach • We do it perfectly. • Introspect how • Program that up Programming • For each review make a vector of words • Figure out whether it has positive words and negative words • Count Trying to Copy Humans Brilliant Suck Dazzling Cliched Cool 60% Slow Gripping Awful Moving Bad This Approach Stalled • “Trivial” problems proved impossible – Marvin Minsky once assigned "the problem of computer vision" as a summer project • Forget about the more “complicated” problems like language What is the magic trick? • Make this an empirical exercise – Collect some data • Look at what combination of words predicts being a good review • Example dataset: – 2000 movie reviews – 1000 good and 1000 bad reviews Learning not Programming STILL Bad Love Stupid Superb 95% Worst Great ? ! Pang, Lee and Vaithyanathan Machine learning • Turn any “intelligence” task into an empirical learning task – Specify what is to be predicted – Specify what is used to predict it Y = 0, 1 { } Positive? X = 0, 1 k |{{z }} Word Vector ˆ f = argmin| {z }f E[L(f(x),y)] Machine learning • Turn any “intelligence” task into an empirical learning task – Specify what is to be predicted – Specify what is used to predict it • Underneath most machine intelligence you see • Not a coincidence that ML and big data arose together
    [Show full text]