Datasheets for Datasets

Datasheets for Datasets Timnit Gebru 1 Jamie Morgenstern 2 Briana Vecchione 3 Jennifer Wortman Vaughan 1 Hanna Wallach 1 Hal Daumé III 1 4 Kate Crawford 1 5 Abstract We therefore propose the concept of datasheets for datasets. The machine learning community has no stan- In the electronics industry, every component is accompanied dardized way to document how and why a dataset by a datasheet describing standard operating characteristics, was created, what information it contains, what test results, and recommended usage. By analogy, we rec- tasks it should and should not be used for, and ommend that every dataset be accompanied with a datasheet whether it might raise any ethical or legal con- documenting its motivation, creation, composition, intended cerns. To address this gap, we propose the con- uses, distribution, maintenance, and other information. We cept of datasheets for datasets. In the electronics anticipate that such datasheets will increase transparency industry, it is standard to accompany every com- and accountability in the machine learning community. ponent with a datasheet providing standard oper- Section2 provides context for our proposal. Section3 ating characteristics, test results, recommended discusses the evolution of safety standards in other indus- usage, and other information. Similarly, we rec- tries, and outlines the concept of datasheets in electronics. ommend that every dataset be accompanied with a We give examples of questions that should be answered in datasheet documenting its creation, composition, datasheets for datasets in Section4, and discuss challenges intended uses, maintenance, and other properties. and future work in Section5. The appendix includes a more Datasheets for datasets will facilitate better com- complete proposal along with prototype datasheets for two munication between dataset creators and users, well-known datasets: Labeled Faces in the Wild (Huang and encourage the machine learning community et al., 2007) and Pang and Lee’s polarity dataset (2004). to prioritize transparency and accountability. 2. Context 1. Introduction A foundational challenge in the use of machine learning is Machine learning is no longer a purely academic disci- the risk of deploying systems in unsuitable environments. A pline. Domains such as criminal justice (Garvie et al., model’s behavior on some benchmark may say very little 2016; Systems, 2017; Andrews et al., 2006), hiring and about its performance in the wild. Of particular concern are employment (Mann & O’Neil, 2016), critical infrastruc- recent examples showing that machine learning systems can ture (O’Connor, 2017; Chui, 2017), and finance (Lin, 2012) amplify existing societal biases. For example, Buolamwini all increasingly depend on machine learning methods. & Gebru(2018) showed that commercial gender classifica- tion APIs have near perfect performance for lighter-skinned By definition, machine learning models are trained using males, while error rates for darker-skinned females can be arXiv:1803.09010v3 [cs.DB] 9 Jul 2018 data; the choice of data fundamentally influences a model’s as high as 33%.1 Bolukbasi et al.(2016) showed that word behavior. However, there is no standardized way to docu- embeddings trained on news articles exhibit gender biases, ment how and why a dataset was created, what information finishing the analogy “man is to computer programmer as it contains, what tasks it should and shouldn’t be used for, woman is to X” with “homemaker,” a stereotypical role for and whether it might raise any ethical or legal concerns. women. Caliskan et al.(2017) showed these embeddings This lack of documentation is especially problematic when also contain racial biases: traditional European-American datasets are used to train models for high-stakes applications. names are closer to positive words like “joy,” while African- 1Microsoft Research, New York, NY 2Georgia Institute of Tech- American names are closer to words like “agony.” 3 4 nology, Atlanta, GA Cornell University, Ithaca, NY University These biases can have dire consequences that might not be of Maryland, College Park, MD 5AI Now Institute, New York, NY. Correspondence to: Timnit Gebru <[email protected]>. easily discovered. Much like a faulty resistor or a capac- itor in a circuit, the effects of a biased machine learning th Proceedings of the 5 Workshop on Fairness, Accountability, and 1 Transparency in Machine Learning, Stockholm, Sweden, PMLR The evaluated APIs also provided the labels of female and 80, 2018. Copyright 2018 by the author(s). male, failing to address the complexities of gender beyond binary. Datasheets for Datasets component, such as a dataset, can propagate throughout a ational, social, and economic opportunities. However, much system making them difficult to track down. For example, like current machine learning technology, automobiles were biases in word embeddings can result in hiring discrimina- introduced with few safety checks or regulations. When cars tion (Bolukbasi et al., 2016). For these and other reasons, first became available in the US, there were no speed limits, the World Economic Forum lists tracking the provenance, stop signs, traffic lights, driver education, or regulations development, and use of training datasets as a best practice pertaining to seat belts or drunk driving (Canis, 2017). This that all companies should follow in order to prevent discrim- resulted in many deaths and injuries due to collisions, speed- inatory outcomes (World Economic Forum Global Future ing, and reckless driving (Hingson et al., 1988). Reminis- Council on Human Rights 2016–2018, 2018). But while cent of current debates about machine learning, courtrooms provenance has been extensively studied in the database and newspaper editorials argued the possibility that the au- literature (Cheney et al., 2009; Bhardwaj et al., 2014), it has tomobile was inherently evil (Lewis v. Amorous, 1907). received relatively little attention in machine learning. The US and the rest of the world have gradually enacted The risk of unintentional misuse of datasets can increase driver education, drivers licenses (Department of Transporta- when developers are not domain experts. This concern is tion Federal Highway Administration, 1997), and safety particularly important with the movement toward “democ- systems like four-wheel hydraulic brakes, shatter-resistant ratizing AI” and toolboxes that provide publicly available windshields, all-steel bodies (McShane, 2018), padded dash- datasets and off-the-shelf models to be trained by those with boards, and seat belts (Peltzman, 1975). Motorists’ slow little-to-no domain knowledge or machine learning exper- adoption of seat belts spurred safety campaigns promoting tise. As these powerful tools become available to a broader their adoption. By analogy, machine learning will likely set of developers, it is increasingly important to enable these to require laws and regulations (especially in high-stakes developers to understand the implications of their work. environments), as well as social campaigns to promote best practices. The automobile industry routinely uses crash-test We believe this problem can be partially mitigated by accom- dummies to develop and test safety systems. This practice panying datasets with datasheets that describe their creation, led to problems similar to the “biased dataset” problems cur- their strengths, and their limitations. While this is not the rently faced by the machine learning community: almost all same as making everyone an expert, it gives an opportunity crash-test dummies were designed with prototypical male for domain experts to communicate what they know about a physiology; only in 2011 did US safety standards require dataset and its limitations to the developers who might use it. frontal crash tests with “female” crash-test dummies (Na- The use of datasheets will be more effective if coupled with tional Highway Traffic Safety Administration, 2006), educational efforts around interpreting and applying ma- following evidence that women sustained more serious chine learning models. Such efforts are happening both injuries than men in similar accidents (Bose et al., 2011). within traditional “ivory tower” institutions (e.g., the new ethics in computing course at Harvard) and in new educa- 3.2. Clinical Trials in Medicine tional organizations. For instance, one of Fast.ai’s missions is “to get deep learning into the hands of as many people Like data collection and experimentation for machine learn- as possible, from as many diverse backgrounds as possi- ing, clinical trials play an important role in drug develop- ble” (Fast.ai, 2017); their educational program includes ex- ment. When the US justice system stopped viewing clinical plicit training in dataset biases and ethics. Combining better trials as a form of medical malpractice (Dowling, 1975), education and datasheets will more quickly enable progress standards for clinical trials were put in place, often spurred by both domain experts and machine learning experts. by gross mistreatment, committed in the name of science. For example, the US government ran experiments on citi- zens without their consent, including a study of patients with 3. Safety Standards in Other Industries syphilis who were not told they were sick (Curran, 1973) and To put our proposal into context, we discuss the evolution radiation experiments (Faden et al., 1996; Moreno, 2013). of safety standards for automobiles, drugs, and electronics. The poor, the imprisoned, minority groups,

Load more