Amazon Sagemaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud

Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud Michaela Hardt1, Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gelman, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai2, Keerthan Vasist, Pinar Yilmaz, M. Bilal Zafar, Sanjiv Das3, Kevin Haas, Tyler Hill, Krishnaram Kenthapadi Amazon Web Services ABSTRACT 1 INTRODUCTION Understanding the predictions made by machine learning (ML) Machine learning (ML) models and data-driven systems are in- models and their potential biases remains a challenging and labor- creasingly used to assist in decision-making across domains such intensive task that depends on the application, the dataset, and the as financial services, healthcare, education, and human resources. specific model. We present Amazon SageMaker Clarify, an explain- Benefits of using ML include improved accuracy, increased produc- ability feature for Amazon SageMaker that launched in December tivity, and cost savings. The increasing adoption of ML is the result 2020, providing insights into data and ML models by identifying of multiple factors, most notably ubiquitous connectivity, the ability biases and explaining predictions. It is deeply integrated into Ama- to collect, aggregate, and process large amounts of data using cloud zon SageMaker, a fully managed service that enables data scientists computing, and improved access to increasingly sophisticated ML and developers to build, train, and deploy ML models at any scale. models. In high-stakes settings, tools for bias and explainability in Clarify supports bias detection and feature importance computation the ML lifecycle are of particular importance. across the ML lifecycle, during data preparation, model evaluation, Regulatory [1, 2]: Many ML scenarios require an understanding of and post-deployment monitoring. We outline the desiderata derived why the ML model made a specific prediction or whether its predic- from customer input, the modular architecture, and the methodol- tion was biased. Recently, policymakers, regulators, and advocates ogy for bias and explanation computations. Further, we describe the have expressed concerns about the potentially discriminatory im- technical challenges encountered and the tradeoffs we had to make. pact of ML and data-driven systems, for example, due to inadvertent For illustration, we discuss two customer use cases. We present our encoding of bias into automated decisions. Informally, biases can be deployment results including qualitative customer feedback and viewed as imbalances in the training data or the prediction behavior a quantitative evaluation. Finally, we summarize lessons learned, of the model across different groups, such as age or income bracket. and discuss best practices for the successful adoption of fairness Business [3, 17]: The adoption of AI systems in regulated domains and explanation tools in practice. requires explanations of how models make predictions. Model explainability is particularly important in industries with reliability, KEYWORDS safety, and compliance requirements, such as financial services, Machine learning, Fairness, Explainability human resources, healthcare, and automated transportation. Using a financial example of lending applications, loan officers, customer ACM Reference Format: service representatives, compliance officers, and end users often re- 1 Michaela Hardt , Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gel- quire explanations detailing why a model made certain predictions. man, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, 2 Data Science: Data scientists and ML engineers need tools to debug Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai , Keerthan Vasist, Pinar and improve ML models, to ensure a diverse and unbiased dataset Yilmaz, M. Bilal Zafar, Sanjiv Das3, Kevin Haas, Tyler Hill, Krishnaram Ken- during feature engineering, to determine whether a model is making thapadi. 2021. Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud. In Proceedings of the 27th ACM SIGKDD inferences based on noisy or irrelevant features, and to understand Conference on Knowledge Discovery and Data Mining (KDD ’21), August the limitations and failure modes of their model. 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 10 pages. While there are several open-source tools for fairness and ex- https://doi.org/10.1145/3447548.3467177 plainability, we learned from our customers that developers often find it tedious to incorporate these tools across the ML workflow, 1Correspondence to: Michaela Hardt, <[email protected]>. as this requires a sequence of complex, manual steps: (1) Data scien- 2Work done while at Amazon Web Services. tists and ML developers first have to translate internal compliance 3 Santa Clara University, CA, USA. requirements into code that can measure bias in their data and ML models. To do this, they often must exit their ML workflow, Permission to make digital or hard copies of part or all of this work for personal or and write custom code to piece together open source packages. (2) classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Next, they need to test datasets and models for bias across all steps on the first page. Copyrights for third-party components of this work must be honored. of the ML workflow and write code for this integration. (3)Once For all other uses, contact the owner/author(s). the tests are complete, data scientists and ML developers create KDD ’21, August 14–18, 2021, Virtual Event, Singapore © 2021 Copyright held by the owner/author(s). reports showing the factors that contribute to the model prediction ACM ISBN 978-1-4503-8332-5/21/08. by manually collecting a series of screenshots for internal risk and https://doi.org/10.1145/3447548.3467177 compliance teams, auditors, and external regulators. (4) Finally, af- 3606, Microsoft’s Fairlearn7, and TensorFlow’s What-If-Tool [45] ter the models are in production, data scientists and ML developers are products built around bias metrics and mitigation strategies. must periodically run through a series of manual tests to ensure Explainabilty. The approaches for model explainability can be that no bias is introduced as a result of a drift in the data distribu- dissected in various ways. For a comprehensive survey, see e.g. [18]. tion (e.g., changes in income for a given geographic region). Similar We distinguish local vs. global explainability approaches. Local ap- challenges have also been pointed out in prior studies [8, 20, 33]. proaches [30, 36] explain model behavior in the neighborhood of a Our contributions: To address the need for generating bias and given instance. Global explainability methods aim to explain the explainability insights across the ML workflow, we present Ama- behavior of the model at a higher-level (e.g. [26]). Explainability zon SageMaker Clarify1, a new functionality in Amazon Sage- methods can also be divided into model-agnostic (e.g., [30]) and Maker [27] that helps detect bias in data and models, helps explain model-specific approaches which leverage internals of models such predictions made by models, and generates associated reports. We as gradients (e.g., [40, 42]). There is growing impetus to base expla- describe the design considerations, and technical architecture of nations on counterfactuals [6, 21]. At the same time concern about Clarify and its integration across the ML lifecycle (§3). We then overtrusting explanations grows [28, 37]. discuss the methodology for bias and explainability computations In comparison to previous work, Clarify is deeply integrated (§4). We implemented several measures of bias in Clarify, consid- into SageMaker and couples scalable fairness and explainability ering the dataset (pre-training) and the model predictions (post- analyses. For explainability, its implementation of SHAP offers training), to support various use cases of SageMaker’s customers. model-agnostic feature importance at the local and global level. We integrated pre-training bias detection with the data preparation functionality and the post-training bias detection with the model 3 DESIGN, ARCHITECTURE, INTEGRATION evaluation functionality in SageMaker. These biases can be moni- 3.1 Design Desiderata tored continuously for deployed models and alerts can notify users to changes. For model explainability, Clarify offers a scalable im- Conversations with customers in need of detecting biases in models plementation of SHAP (SHapley Additive exPlanations) [29], based and explaining their predictions shaped the following desiderata. on the concept of the Shapley value [38]. In particular, we imple- • Wide Applicability. Our offerings should be widely appli- mented the KernelSHAP [29] variant as it is model agnostic, so that cable across (1) industries (e.g., finance, human resources, Clarify can be applied to the broad range of models trained by and healthcare), (2) use cases (e.g., click-through-rate predic- SageMaker’s customers. For deployed models, Clarify monitors tion, fraud detection, and credit risk modeling), (3) prediction for changes in feature importance and issues alerts. tasks (e.g., classification and regression), (4) ML models (e.g., We discuss the implementation and technical challenges in §6 logistic regression, decision trees, deep neural networks),

Load more