Themis: Automatically Testing Software for Discrimination

Themis: Automatically Testing Software for Discrimination Rico Angell, Brittany Johnson, Yuriy Brun, and Alexandra Meliou University of Massachusetts Amherst Amherst, Massachusetts, USA {rangell, bjohnson, brun, [email protected] ABSTRACT software decided not to offer same-day delivery to predominantly Bias in decisions made by modern software is becoming a common minority neighborhoods [35]. And the software US courts use to and serious problem. We present Themis, an automated test suite assess the risk of a criminal repeating a crime exhibits racial bias [4]. generator to measure two types of discrimination, including causal Bias in software can come from learning from biased data, im- relationships between sensitive inputs and program behavior. We plementation bugs, design decisions, unexpected component in- explain how Themis can measure discrimination and aid its debug- teractions, or societal phenomena. Thus, software discrimination ging, describe a set of optimizations Themis uses to reduce test is a challenging problem and addressing it is integral to the en- suite size, and demonstrate Themis’ effectiveness on open-source tire software development cycle, from requirements elicitation, to software. Themis is open-source and all our evaluation data are architectural design, to testing, verification, and validation [14]. available at http://fairness.cs.umass.edu/. See a video of Themis in Even defining what it means for software to discriminate isnot action: https://youtu.be/brB8wkaUesY straightforward. Many definitions of algorithmic discrimination have emerged, including the correlation or mutual information CCS CONCEPTS between inputs and outputs [52], discrepancies in the fractions of inputs that produce a given output [17, 26, 56, 58] (known as • Software and its engineering → Software testing and de- group discrimination [23]), or discrepancies in output probability bugging; distributions [34]. These definitions do not capture causality and KEYWORDS can miss some forms of discrimination. To address this, our recent work developed a new measure called Software fairness, discrimination testing, fairness testing, software causal discrimination and described a technique for automated fair- bias, testing, Themis, automated test generation ness test generation [23]. This tool demonstration paper imple- ACM Reference Format: ments that technique for the group and causal definitions of dis- Rico Angell, Brittany Johnson, Yuriy Brun, and Alexandra Meliou. 2018. crimination in a tool called Themis v2.0 (building on an early proto- Themis: Automatically Testing Software for Discrimination. In Proceedings type [23]). This paper focuses on the tool’s architecture, test suite of the 26th ACM Joint European Software Engineering Conference and Sympo- generation workflow, and efficiency optimizations (Section2), and sium on the Foundations of Software Engineering (ESEC/FSE ’18), November its user interface (Section3). Section4 places Themis in the context 4–9, 2018, Lake Buena Vista, FL, USA. ACM, New York, NY, USA,5 pages. https://doi.org/10.1145/3236024.3264590 of related research and Section5 summarizes our contributions. 1 INTRODUCTION 2 THEMIS: AUTOMATED FAIRNESS TEST Software plays an important role in making decisions that shape GENERATION our society. Software decides what products we are led to buy [36]; Figure1 describes the Themis architecture and fairness test-suite who gets financial loans [43]; what a self-driving car does, which generation workflow. Themis consists of four major components: may lead to property damage or human injury [24], how medical input generator, cache, error-bound confidence calculator, and dis- patients are diagnosed and treated [48], and who gets bail and crimination score calculator. Themis uses the input schema of the which criminal sentence [4]. Unfortunately, there are countless system under test to generate test suites for group or causal dis- examples of bias in software. Translation engines inject societal crimination. Themis generates values for non-sensitive attributes biases, e.g., “She is a doctor” translated into Turkish and back into uniformly randomly, and then iterates over the values for sensi- English becomes “He is a doctor” [19]. YouTube is more accurate tive attributes. This process samples equivalence classes of test when automatically generating closed captions for videos with inputs. Themis later uses the system’s under test behavior on these male than female voices [50]. Facial recognition systems often sampled subsets of the equivalence classes to compute the discrim- underperform on female and black faces [32]. In 2016, Amazon ination score. Using a cache to ensure no test is executed multiple Permission to make digital or hard copies of all or part of this work for personal or times, Themis executes the system under test on the generated tests. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Themis iterates this test generation process, generating more tests on the first page. Copyrights for components of this work owned by others than the within the equivalence classes, until the confidence in the error author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or bound satisfies the user-specified threshold, and then outputs the republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. discrimination score. The final confidence bound is within the spec- ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA ified threshold, though Themis can also produce its exact measure. © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. Themis focuses on two measures of discrimination, group and ACM ISBN 978-1-4503-5573-5/18/11...$15.00 https://doi.org/10.1145/3236024.3264590 causal. To help explain the two measures, consider a simple loan 871 ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Rico Angell, Brittany Johnson, Yuriy Brun, and Alexandra Meliou Themis engine input generator inputs: non-sensitive non-sensitive - input schema input space attribute attribute subsets of - type of discrimination generator assignment equivalence classes of test inputs to measure system under test - acceptable error sensitive sensitive equivalence bound attribute attribute classes - desired confidence iterator assignment bound generate more test inputs to reach confidence cache Themis user interface discrimination error-bound confidence calculator no outputs: score confidence in - discrimination score calculator yes error bound I/O - confidence in error reached? database bound Figure 1: The Themis architecture and fairness test-suite generation workflow. program that decides if loan applicants should be given loans. loan on average, 2,849 times for group discrimination and 148 times for inputs are each applicant’s name, age bracket (≤40 or >40), race causal discrimination [23]. The more software discriminates, the (green, purple), income bracket, savings, employment status, and greater the reduction in test suite size. requested loan amount; the output is “approve” or “deny”. The number of possible executions grows ex- Group discrimination is the maximum difference in the fractions Sound pruning. ponentially with the number of input attributes being tested for of software outputs for each sensitive input group. For example, discrimination. However, the group and causal discrimination defi- loan’s group discrimination with respect to race compares the nitions are monotonic: if software discriminates over threshold θ fractions of green and purple applicants who get loans. If 35% with respect to a set of attributes X, then the software also discrim- of green and 20% of purple applicants get loans, then loan’s inates over θ with respect to all supersets of X (see Theorems 4.1 group discrimination with respect to race is 35% − 20% = 15%. and 4.2 and their proofs in [23]). This allows Themis to prune its With more than two races, the measure would be the difference test input space. Once Themis discovers that software discriminates between the largest and smallest fractions. Group discrimination against X, it can prune testing all supersets of X. with respect to multiple input attributes compares the crossproduct Further, causal discrimination always exceeds group discrimi- of the attributes, e.g., for race and age bracket, there are four groups: nation with respect to the same set of attributes (see Theorem 4.3 [purple, ≤40], [purple, >40], [green, ≤40], and [green, >40]. and its proof in [23]) so Themis can prune its test input space Software testing enables a unique opportunity to conduct hypoth- when measuring both kinds of discrimination: If software group esis testing to determine statistical causation [45] between inputs discriminates with respect to a set of attributes, it must causally and outputs. It is possible to execute loan on two individuals discriminate with respect to that set at least as much. identical in every way except race to verify if the race causes an These observations and their formal proofs allow Themis to output change. Causal discrimination is the frequency with which employ a provably sound pruning strategy (Algorithm 3 in [23]). equivalences classes of inputs (recall Figure1) contain at least two inputs on which the software under test produces different outputs. Adaptive

Load more