Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud
Michaela Hardt1, Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gelman, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai2, Keerthan Vasist, Pinar Yilmaz, M. Bilal Zafar, Sanjiv Das3, Kevin Haas, Tyler Hill, Krishnaram Kenthapadi Amazon Web Services ABSTRACT 1 INTRODUCTION Understanding the predictions made by machine learning (ML) Machine learning (ML) models and data-driven systems are in- models and their potential biases remains a challenging and labor- creasingly used to assist in decision-making across domains such intensive task that depends on the application, the dataset, and the as financial services, healthcare, education, and human resources. specific model. We present Amazon SageMaker Clarify, an explain- Benefits of using ML include improved accuracy, increased produc- ability feature for Amazon SageMaker that launched in December tivity, and cost savings. The increasing adoption of ML is the result 2020, providing insights into data and ML models by identifying of multiple factors, most notably ubiquitous connectivity, the ability biases and explaining predictions. It is deeply integrated into Ama- to collect, aggregate, and process large amounts of data using cloud zon SageMaker, a fully managed service that enables data scientists computing, and improved access to increasingly sophisticated ML and developers to build, train, and deploy ML models at any scale. models. In high-stakes settings, tools for bias and explainability in Clarify supports bias detection and feature importance computation the ML lifecycle are of particular importance. across the ML lifecycle, during data preparation, model evaluation, Regulatory [1, 2]: Many ML scenarios require an understanding of and post-deployment monitoring. We outline the desiderata derived why the ML model made a specific prediction or whether its predic- from customer input, the modular architecture, and the methodol- tion was biased. Recently, policymakers, regulators, and advocates ogy for bias and explanation computations. Further, we describe the have expressed concerns about the potentially discriminatory im- technical challenges encountered and the tradeoffs we had to make. pact of ML and data-driven systems, for example, due to inadvertent For illustration, we discuss two customer use cases. We present our encoding of bias into automated decisions. Informally, biases can be deployment results including qualitative customer feedback and viewed as imbalances in the training data or the prediction behavior a quantitative evaluation. Finally, we summarize lessons learned, of the model across different groups, such as age or income bracket. and discuss best practices for the successful adoption of fairness Business [3, 17]: The adoption of AI systems in regulated domains and explanation tools in practice. requires explanations of how models make predictions. Model ex- plainability is particularly important in industries with reliability, KEYWORDS safety, and compliance requirements, such as financial services, Machine learning, Fairness, Explainability human resources, healthcare, and automated transportation. Using a financial example of lending applications, loan officers, customer ACM Reference Format: service representatives, compliance officers, and end users often re- 1 Michaela Hardt , Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gel- quire explanations detailing why a model made certain predictions. man, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, 2 Data Science: Data scientists and ML engineers need tools to debug Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai , Keerthan Vasist, Pinar and improve ML models, to ensure a diverse and unbiased dataset Yilmaz, M. Bilal Zafar, Sanjiv Das3, Kevin Haas, Tyler Hill, Krishnaram Ken- during feature engineering, to determine whether a model is making thapadi. 2021. Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud. In Proceedings of the 27th ACM SIGKDD inferences based on noisy or irrelevant features, and to understand Conference on Knowledge Discovery and Data Mining (KDD ’21), August the limitations and failure modes of their model. 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 10 pages. While there are several open-source tools for fairness and ex- https://doi.org/10.1145/3447548.3467177 plainability, we learned from our customers that developers often find it tedious to incorporate these tools across the ML workflow, 1Correspondence to: Michaela Hardt,
= (0) / (0) − (1) / (1) 퐷퐷 푛푑 푛 푛푑 푛 and compute it across all strata 푖 on 푀-player cooperative game, with individual features being players a user-supplied attribute that we want to control for, yielding and the predicted output 푓 (x) being the total payout of the game. 1 Í 퐷퐷푖 . 퐶퐷퐷퐿 = 푛 푖 푛푖 · 퐷퐷푖 where 푖 subscripts each strata and The framework decides how important each player is to the game’s 푡ℎ 푛푖 is the number of examples in it. outcome. Specifically, the importance of the 푖 player x푖 is defined as the average marginal contribution of the player to a game not 4.1.3 Post-Training Metrics. containing x푖 . Formally, the contribution 휙푖 is defined as: 퐷푃푃퐿 Difference in positive proportions in predicted labels: 퐷푃푃퐿 Õ |푆|!(푀 − |푆| − 1)! consider the difference in positive predictions: 퐷푃푃퐿 = 푞ˆ −푞ˆ , 휙푖 (x) = 푓 (x푆 ∪ x푖 ) − 푓 (x푆 ) , 푎 푑 푀! related is “statistical parity” [7, 10]. 푆 퐷퐼 Disparate Impact: Instead of the difference as in 퐷푃푃퐿 we can where the sum is computed over all the player coalitions 푆 not / 푀 consider the ratio: 퐷퐼 = 푞ˆ푑 푞ˆ푎. containing x푖 . The computation involves iterating over 2 feature 퐷퐶퐴 Difference in Conditional Acceptance / Rejection: 퐷퐶퐴 consid- coalitions, which does not scale to even moderate values of 푀, ers the ratio of the number of positive labels to that of positive e.g., 50. To overcome these scalability issues, KernelSHAP uses = (1) / (1) − (1) / (1) predictions assessing calibration: 퐷퐶퐴 푛푎 푛ˆ푎 푛푑 푛ˆ푑 . approximations [29]. Analogously, 퐷퐶푅 considers negative labels and predictions. 퐴퐷 Accuracy Difference: We compare the accuracy, i.e. the fraction 4.3 Monitoring bias and explanations of examples for which the predictions is equal to the label, across Measuring bias metrics and feature importance provide insights groups. We define 퐴퐷 = (푇푃푎 + 푇 푁푎)/푛푎 − (푇푃푑 + 푇 푁푑 )/푛푑 . into the model at time of evaluation. However, a change in data 푅퐷 Recall Difference: We compare the recall (the fraction of positive distribution can occur over time, resulting in different bias metrics examples that receive a positive prediction) across groups. 푅퐷 = and feature importance. Our goal is to continuously monitor for / (1) − / (1) 푇푃푎 푛푎 푇푃푑 푛푑 (aka “equal opportunity” [19]). such changes and raise alerts if bias metrics or feature importance 퐷퐴푅 Difference in Acceptance / Rejection Rates: This metric consid- drift from reference values established during an earlier evaluation ers precision, i.e. the fraction of positive predictions that are (before the model launch). Based on a user-specified schedule (e.g., = / (1) − / (1) hourly), newly accumulated data is analyzed and compared against correct: 퐷퐴푅 푇푃푎 푛ˆ푎 푇푃푑 푛ˆ푑 . Analogously, we define 퐷푅푅 for negative predictions. the reference values. 푇 퐸 Treatment Equality: We define 푇 퐸 = 퐹푁푑 /퐹푃푑 − 퐹푁푎/퐹푃푎 [7]. Bias. Given a bias metric from §4.1, we let the user specify a range 퐶퐷퐷푃퐿 Conditional Demographic Disparity of Predicted Labels: This of reference values 퐴 = (푎min, 푎max) in which the new bias value metric is analogous to the 퐶퐷퐷퐿 for predictions. 푏 computed on the live data should lie. However, if little data was accumulated simply checking for containment can lead to noisy See §A.1 for additional metrics. results. To ensure that the conclusions drawn from the observed 4.2 Explainabilty live data are statistically significant, we check for overlap of 퐴 with We implemented the KernelSHAP [29] algorithm due to its good the 95% bootstrap confidence interval of 푏 [15]. performance [36] and wide applicability. Explainability. For feature importance we consider the change The KernelSHAP algorithm [29] works as follows: Given an input in ranking of features by their importance. Specifically, we use x ∈ R푀 and the model output 푓 (x) ∈ R, SHAP, building on the the Normalized Discounted Cumulative Gain (nDCG) score for Shapley value framework [38], designates the prediction task as a comparing the ranking by feature importance of the reference data with that based on the live data. nDCG penalizes changes further associated UI components and documentation.12 Prior to public down in the ranking less. Also, swapping the position of two values launch, we set up pipelines and dashboards for health monitoring of is penalized relative to the difference of their original importance. our system, including alarms for too many failed jobs. Throughout, Let 퐹 = [푓1, . . . , 푓푀 ] denote the list of 푀 features sorted w.r.t. multiple reviews took place for the design, usability, security and ′ = [ ′ ′ ] their importance scores in the reference data, and 퐹 푓1 , . . . , 푓푀 privacy, and operational readiness. denote the new ranking of the features based on the live data. We Implementation Details: For wide applicability, we implemented use 푎(푓 ) to describe a function that returns the feature importance flexible parsers to support various data formats (JSONLines/CSV, score on the reference data of a feature 푓 . Then, nDCG is defined with(-out) header) and model signatures, see §A.2 for examples. Í푀 ′ 13 as: nDCG = DCG/iDCG, where DCG = 푖=1 푎(푓푖 )/log2 (푖 + 1) and The first bias metrics were implemented in numpy. For scala- Í푀 bility, all subsequent methods were implemented using Spark via iDCG = 푖=1 푎(푓푖 )/log2 (푖 + 1). An nDCG value of 1 indicates no change in the ranking. In our SageMaker’s Spark Processing job. Spark provided us with con- implementation, an nDCG value below 0.90 triggers an alert. venient abstractions (e.g., distributed median computations) and managed the cluster for us. Our bias metrics mostly require counts that are easy to distribute. Distributing SHAP only requires one 5 ALTERNATIVE DESIGN AND METHODS map operation to calculate feature importance for each example On the architectural side, our design provisions a separate model followed by an aggregation for global model feature importance. server which adds latency. Alternatively, a design that runs the model locally avoids network traffic. We decided to use a model 6.1 Challenges and Resolutions server so that we can re-use the existing SageMaker model hosting A challenge to fault-tolerance is that Clarify’s offline batch pro- primitives. Our model-agnostic approach enables support of a wide cessing job hits a live endpoint. This requires a careful implemen- range of models including those that cannot be executed locally. tation to avoid overwhelming the endpoint with requests. To limit This way we can scale and tune the instance type selection and the potential impact, Clarify spins up separate model servers and counts for the processing job and the model serving instances does not interfere with production servers. To avoid overwhelming independently based on the dataset and model characteristics. these separate endpoints, we dynamically configured the batch For the bias metrics, we considered limiting the number of met- size based on experimentation starting with the largest possible rics, to avoid overwhelming users. However, the appropriate bias batch size based on the maximum payload14 in order to achieve metric is application dependent. To support of customers from dif- high throughput at the model endpoints. Furthermore, we relied on ferent industries with various use cases, we decided to implement the boto clients’ retry policies15 with exponential back-off. This is a wide range of metrics (see §4.1). necessary to tolerate intermittent inference call failures. We further For explainability we decided to implement the KernelSHAP [29] expanded the retry mechanism, since not all errors are considered algorithm (see §4.2) rather than alternatives discussed in §2 for a retriable in the boto client. number of reasons: (i) it offers wide applicability to interpret pre- A key to scalability was to utilize resources well by carefully dictions of any score-based model; (ii) it works in a model-agnostic tuning the number of partitions of the dataset in Spark: If tasks manner and does not require access to model internals such as took very long (we started out with tasks taking an hour) for SHAP gradients allowing us to build on top of the the SageMaker hosting computation, failures were costly and utilization during the last few primitives; (iii) it provides attractive theoretical properties, for in- tasks was low. If tasks were too small for bias metrics computation, stance, all the feature importance scores sum to the total class score; throttling limitations when instantiating a SageMaker Predictor led and (iv) benchmarking shows that it performs favorably compared to job failures. The solution was to carefully tune the number of to related methods such as LIME [36]. More efficient model-specific tasks separately for explainability and bias computations. implementations like TreeSHAP and DeepSHAP [29] leverage the model architecture for improved efficiency. By using the model- 7 ILLUSTRATIVE CASE STUDY agnostic KernelSHAP, we traded efficiency for wider availability. In the future we will prioritize additional explainability methods For illustration we present a simple case study with the German based on customer requests. Credit Data dataset from the UCI repository [13] that contains 1000 labeled credit applications. Features include credit purpose, credit amount, housing status, and employment history. Categorical 6 DEVELOPMENT attributes are converted to indicator variables, e.g., A30 means “no Timeline: We started by collecting customer requirements to de- credits taken”. We mapped the labels (Class1Good2Bad) to 0 for rive Clarify’s goals and design the system (§3). Our first prototype poor credit and 1 for good. implemented the methods on a single node packaged into a con- tainer that could be run with a bare-bone configuration file. After 12 internal tests, we deployed this version for beta customers to solicit https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training- bias.html, https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model- feedback. This feedback helped us to prioritize scalability (§8.2) and explainability.html ease of use improvements (§8.3). Afterwards, we expanded support 13Our open source implementation is available at https://github.com/aws/amazon- of models and datasets, conducted more end-to-end tests, and ob- sagemaker-clarify. 14https://github.com/boto/botocore/blob/develop/botocore/data/sagemaker- tained additional rounds of customer feedback. We embarked on runtime/2017-05-13/service-2.json#L35 the integration across the model lifecycle (§3.3), emphasizing the 15https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html Figure 5: SHAP feature importance.
evaluation monitoring overall bias 0.15 0.21 0.36 Figure 3: Pre-training bias results in Data Wrangler. explainability 0.43 0.21 0.64 overall 0.58 0.42 1 Table 1: Break-down of Clarify jobs across method (bias / explainability) and phase (evaluation / monitoring).
submit the configuration for the post-training bias analysis (see Appendix A) as input along with the validation data to the Clarify job. The results shown in Figure 4 reveal a mixed picture with pos- itive and negative metrics. We start by inspecting the disparate impact 퐷퐼 with a value of 0.73 much below 1. Since the label rates are almost the same across the two groups (퐷푃퐿), 퐷퐼 is driven by differences in precision and recall. We see that foreign workers have a lower recall (푅퐷 of 0.13), but a higher precision (퐷퐴푅 of -0.19). Corrective steps can mitigate this bias prior to deployment. Feature importance. To understand which features are affecting the model’s predictions the most we look at the global SHAP values that are averaged across all examples after taking the absolute value. The three most important features are A61 (having very little savings), the loan amount, and credit duration, see Figure 5.
8 DEPLOYMENT RESULTS In the first few months after launch 36% ofall Clarify jobs com- puted bias metrics while the 64% computed feature importance, Figure 4: Post-training bias results in Studio Trials. see Table 1. Monitoring jobs made up 42% of jobs, illustrating that customers reached a mature stage in the ML lifecycle in which they incorporated bias and feature importance monitoring. This is Bias analysis. Before we train the model we seek to explore bi- expected in a healthy development cycle, after a few iterations on ases in the dataset. Figure 3 shows Clarify integrated into Data model training and evaluation, a model passes the launch criteria Wrangler. Convenient drop-downs are available to select the la- and is continuously monitored after. bel column and the group. Also, by default a small set of metrics Below, we evaluate our design goals and highlight two use cases. are pre-selected in order to not overwhelm customers with too many metrics. No code needs to be written to conduct this analysis. 8.1 Wide Applicability From the class imbalance value of -0.94, we infer that the dataset is Customers across various business domains use Clarify, includ- very imbalanced, with most examples representing foreign workers. ing digital publishing, education, electronics, finance, government, However, as 퐷푃퐿 is close to 0, there is not much difference in the hardware, healthcare, IT, manufacturing, news, sports, staffing, and label distributions across the two groups. With these encouraging telecommunications. Customers use Clarify for binary predic- results we proceed with the training of an XGBoost model. We tions, multi-class predictions, and ranking tasks. They are located method processing time in minutes cost in dollars method time in minutes P50 P90 P99 pre-training bias 1 $0.03 post-training bias 14 $0.13 bias 1 8 11 Table 2: Processing time and cost of bias metrics computa- explainability 5 12 103 tion for 1 million examples. overall 4 12 101 Table 4: Percentiles of processing times in minutes. number of examples processing time in minutes cost in dollars multiple processing instances thereby reducing running time. At 1000 16 $0.13 most 10 instances were used by customers so far. 10,000 90 $0.71 100,000 811 $6.43 8.3 Ease of use Table 3: Processing time and cost of SHAP with varying We obtained qualitative customer feedback on usability. For cus- dataset sizes using one processing / model instance. tomers it was not always easy to get the configuration right despite documentation. Difficulties included setting the right fields with in the U.S., Japan, UK, and Norway among other countries. For two the right values, and figuring out fields that are optional, or needto specific customer use cases see §8.4. go together for a specific use case. These difficulties were reduced through tooling, documentation, and examples. In terms of tooling, 8.2 Scalability customers can use the explanatory UI or programmatically create We conducted experiments measuring the time it takes to compute the configuration. Regarding documentation, customers requested bias metrics and feature importance. We oversampled the German additional documentation covering specifics of multi-class predic- credit data set described in §7 creating up to 1 million examples to tion. Regarding examples, multiple customers with strict security study scalability. We ran experiments on ml.c5.xlarge machines requirements who used advanced configuration options to enable which are suitable for this workload and provide cost estimates encryption of data at rest and in transit with fine-grained access based on AWS pricing in the us-west-2 region. control benefited from Clarify examples for such settings. Other Bias metrics. Table 2 shows that the bias metrics computation is customers noted that our notebooks focus on the CSV format and fast even for 1M examples. Post-training bias metrics are consid- requested examples with the JSONLines data format. These ex- erably slower than pre-training metrics since they require model amples illustrate tension between the goals of wide applicability inference. The reported time includes the 3-5 minutes it takes to and usability: Ease of use in for one setting (e.g., CSV data, binary spin up the model endpoint. Scalability issues may arise when the predictions) does not extend to all settings (e.g., JSONLines data, dataset does not fit in memory (in which case a larger instance type multi-class prediction). can be chosen) or when the model inference is much slower (in We received several inquiries about formulating the develop- which case the instance count can be increased). ment, business, and regulatory needs into the framework that Explainability. Table 3 shows that time and cost scale roughly Clarify provides. While we were able to find a fit for many differ- linearly with the dataset size. The implementation of SHAP iterates ent use cases, these questions are hard and answers differ case-by- over all examples in a for-loop and the execution at each step case. For example, the question of which bias metric is important, dominates the time and cost. For 100,000 examples the running depends on the application. There is no way to get around this ques- time of more than 13h is so long that it can be a point of frustration tion, as it is impossible to remove all biases [11, 24]. Here again, for customers. We parallelized SHAP to reduce the running time. documentation helps customers to address this hard question. Figure 6 shows the scaling behavior of SHAP on 100,000 exam- Quantitative customer results. SageMaker AutoPilot, that auto- ples. We see that increasing the number of instances (both for the matically builds, trains, and tunes models, integrated Clarify in processing job and the model endpoint) to two reduces the time March 2021. AutoPilot contributes 54% of the explainability jobs. down to 3h and 17 minutes. This super-linear speedup is surprising. The automatic configuration functionality offered by AutoPilot, It is due to the fact that our single instance implementation does along with the other improvements listed above, helped reduce not fully utilize all cores. Our Spark implementation not only splits failure rates by 75%. the workload across machines but also across cores within those machines thereby reducing not just time but even cost. We observe 8.4 Two Customer Use Case Examples that 1 model endpoint instance for every 1 to 2 processing instances Next, we illustrate the use of Clarify in production by providing provides a good ratio: A higher ratio underutilizes model endpoint two real customer use cases. instances, increasing cost without a speed up. With a lower ratio Fraud detection. Zopa’s data scientists use several dozens of in- (e.g., 1 model endpoint instance for 10 processing instances), the put features such as application details, device information, in- model becomes the bottleneck, and cost as well as running time can teraction behaviour, and demographics. For model building, they be reduced by increasing the number of model endpoint instances. extract the training dataset from their Amazon Redshift data ware- Customer results. Half of the Clarify jobs of customers take at house, perform data cleaning, and store into Amazon S3. As Zopa most 4 minutes. Table 4 shows percentiles of processing times of has their own in-house ML library for both feature engineering bias as well as explainability jobs. We see that the running time and ML framework support, they use the “Bring Your Own Con- is less than two hours for 99% of the jobs. Across all jobs 35% use tainer” (BYOC) approach to leverage SageMaker’s managed services Figure 6: Scaling of SHAP on 100,000 examples and measuring time (left) and cost (right) for a varying number of instances. and advanced functionalities such as hyperparameter optimization. 9 LESSONS LEARNED IN PRACTICE The optimized models are then deployed through a Jenkins CI/CD pipeline to the production system and serve as a microservice for Wide Applicability: We learned that wide applicability across real-time fraud detection as part of their customer facing platform. datasets and model endpoints not only requires flexible parsing Explanations are obtained both during model training for vali- logic (see §6), but also creates a tension with ease of use: Our dation, and after deployment for model monitoring and generating code tries to infer whether a file contains a header, or a feature is insights for underwriters. These are done in a non-customer facing categorical or numerical. While it is convenient for most not having analytical environment due to a heavy computational requirement to specify all these details, for some customers our inference does and a high tolerance of latency. SageMaker Multi Model Server not work well. This tension is further explored in §8.3. stack is used in a similar BYOC fashion, to register the models for Scalability: When working towards our goal of creating a scalable the Clarify processing job. Clarify spins up an ephemeral model application, we learned that bias and explainability analyses have endpoint and invokes it for millions of predictions on synthetic con- different resource demands (SHAP can be compute intensive, while trastive data. These predictions are used to compute SHAP values bias metrics can be memory intensive), and hence require separate for each individual example, which are stored in S3. tuning of the parallelism. For explainability we had to work on small For model validation and monitoring, Clarify aggregates these chunks, while for bias reports, we had to work on large chunks of local feature importance values across examples in the training/- the dataset. For details and other technical lessons, see §6.1. monitoring data, to obtain global feature importance, and visualizes Usability: In retrospect, our biggest lesson concerning usability is them in SageMaker Studio. To give insights to operation, the data one we should have recognized prior to the usability studies – we scientists select the features with the highest SHAP values that cannot rely heavily on documentation. Instead we need to enable contributed most to a positive fraud score (i.e., likely fraudster) for users to try out the feature without reading documentation. Early each individual example, and report them to the underwriter. implementations of Clarify required the user to create their JSON Explanation of a soccer scoring probability model. In the con- configuration file manually. This was prone to both schema errors text of soccer matches the DFL (Deutsche Fußball Liga) applied caused by users creating malformed JSON, and soft errors caused by Clarify to their xGoal model. The model predicts the likelihood users defining metrics using incorrect fields or labels. To avoid such that an attempted shot will result in a goal. They found that over- errors we developed an explanatory UI. Now, we ask the customer all the features AngleToGoal and DistanceToGoal have the highest to “Select the column to analyze for bias” and provide a drop down importance across all examples. Let us consider one of the most menu or programmatically construct the configuration for them. interesting games of the 2019/2020 Bundesliga season, where Borus- We are further revising the UI, see §8.3. sia Dortmund beat Augsburg in a 5-3 thriller on January 18, 2020. Best Practices: We recognize that the notions of bias and fairness Taking a look at the sixth goal of the game, scored by Jadon Sancho, are highly dependent on the application and that the choice of the at- the model predicted the goal with a high xGoal prediction of 0.18 tribute(s) for which to measure bias, as well as the choice of the bias compared to the average of prediction of 0.0871 across the past three metrics, can be guided by social, legal, and other non-technical con- seasons. What explains this high prediction? The SHAP analysis re- siderations. Our experience suggests that the successful adoption veals that PressureSum, DistanceToGoalClosest, and AmountofPlayers of fairness-aware ML approaches in practice requires building con- contribute the most to this prediction. This is a noticeable deviation sensus and achieving collaboration across key stakeholders (such from the features that are globally the most important. What makes as product, policy, legal, engineering, and AI/ML teams, as well as this explanation convincing is the unusual nature of this goal: San- end users and communities). Further, fairness and explainability cho received a highly offensive through ball well into Augsburg’s related ethical considerations need to be taken into account during half (past all defenders – hence low PressureSum), only to dart round each stage of the ML lifecycle, for example, by asking questions the keeper and tap it into the goal near the posts (low AngleToGoal). stated in Figure 2. 10 CONCLUSION [18] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys, Motivated by the need for bias detection and explainability from 51(5), 2018. regulatory, business, and data science perspectives, we presented [19] M. Hardt, E. Price, E. Price, and N. Srebro. Equality of opportunity in supervised Amazon SageMaker Clarify, that helps to detect bias in data and learning. In NeurIPS, 2016. [20] K. Holstein, J. Wortman Vaughan, H. Daumé III, M. Dudik, and H. Wallach. models and to explain predictions made by models. We described Improving fairness in machine learning systems: What do industry practitioners the design considerations, methodology, and technical architecture need? In CHI, 2019. [21] D. Janzing, L. Minorics, and P. Bloebaum. Feature relevance quantification in of Clarify, as well as how it is integrated across the ML lifecycle explainable AI: A causal problem. In AISTATS, 2020. (data preparation, model evaluation, and monitoring post deploy- [22] N. Kilbertus, M. Rojas Carulla, G. Parascandolo, M. Hardt, D. Janzing, and ment). We discussed the technical challenges and resolutions, a case B. Schölkopf. Avoiding discrimination through causal reasoning. In NeurIPS, 2017. study, customer use cases, deployment results, and lessons learned [23] J. Kleinberg and S. Mullainathan. Simplicity creates inequity: Implications for in practice. Considering that Clarify is a scalable, cloud-based fairness, stereotypes, and interpretability. In EC, 2019. bias and explainability service designed to address the needs of cus- [24] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In ITCS, 2017. tomers from multiple industries, the insights and experience from [25] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In NeurIPS, our work are likely to be of interest for researchers and practitioners 2017. [26] H. Lakkaraju, E. Kamar, R. Caruana, and J. Leskovec. Faithful and customizable working on responsible AI systems. explanations of black box models. In AIES, 2019. While we have focused on providing tools for bias and explain- [27] E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, ability in the context of an ML lifecycle, it is important to realize A. Sadoughi, Y. Astashonok, P. Das, C. Balioglu, S. Chakravarty, M. Jha, P. Gautier, D. Arpin, T. Januschowski, V. Flunkert, Y. Wang, J. Gasthaus, L. Stella, S. Ranga- that this is an ongoing effort that is part of a larger process. An- puram, D. Salinas, S. Schelter, and A. Smola. Elastic machine learning algorithms other key challenge is to navigate the trade-offs in the context ofan in Amazon SageMaker. In SIGMOD, 2020. application between model accuracy, interpretability, various bias [28] Z. C. Lipton. The mythos of model interpretability. ACM Queue, 16(3), 2018. [29] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. metrics, and related dimensions such as privacy and robustness In NeurIPS, 2017. (see, for example, [11, 23, 24, 32]). It cannot be overemphasized that [30] S. M. Lundberg, B. Nair, M. S. Vavilala, M. Horibe, M. J. Eisses, T. Adams, D. E. Liston, D. K.-W. Low, S.-F. Newman, J. Kim, and S.-I. Lee. Explainable machine- developing AI solutions needs to be thought of more broadly as a learning predictions for the prevention of hypoxaemia during surgery. Nature process involving iterated inputs from and discussions with key Biomedical Engineering, 2(10), Oct. 2018. stakeholders such as product, policy, legal, engineering, and AI/ML [31] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019. teams as well as end users and communities, and asking relevant [32] S. Milli, L. Schmidt, A. D. Dragan, and M. Hardt. Model reconstruction from questions during all stages of the ML lifecycle. model explanations. In Proc. FAT*, pages 1–9. ACM, 2019. [33] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru. Model cards for model reporting. In FAccT, 2019. [34] J. Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3, 2009. REFERENCES [35] V. Perrone, M. Donini, K. Kenthapadi, and C. Archambeau. Fair Bayesian opti- [1] Big data: Seizing opportunities, preserving values. https://obamawhitehouse. mization. In ICML AutoML Workshop, 2020. archives.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf, [36] M. T. Ribeiro, S. Singh, and C. Guestrin. "Why should I trust you?" Explaining 2014. the predictions of any classifier. In KDD, 2016. [2] Big data: A report on algorithmic systems, opportunity, and civil rights. [37] C. Rudin. Stop explaining black box machine learning models for high stakes https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/2016_ decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 0504_data_discrimination.pdf, 2016. May 2019. [3] Big data: A tool for inclusion or exclusion? Understanding the issues (FTC [38] L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, report). https://www.ftc.gov/reports/big-data-tool-inclusion-or-exclusion- 2(28), 1953. understanding-issues-ftc-report, 2016. [39] S. Sharifi-Malvajerdi, M. Kearns, and A. Roth. Average individual fairness: Algo- [4] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions rithms, generalization and experiments. In NeurIPS, 2019. approach to fair classification. In ICML, 2018. [40] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: [5] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairml- Visualising image classification models and saliency maps. In ICLR Workshop, book.org, 2019. 2014. [6] S. Barocas, A. D. Selbst, and M. Raghavan. The hidden assumptions behind [41] A. Singh and T. Joachims. Fairness of exposure in rankings. In KDD, 2018. counterfactual explanations and principal reasons. In FAccT, 2020. [42] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. [7] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. Fairness in criminal justice In ICML, 2017. risk assessments: The state of the art. Sociological Methods & Research, 2018. [43] S. Vasudevan and K. Kenthapadi. LiFT: A scalable framework for measuring [8] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh, R. Puri, J. M. fairness in ML applications. In CIKM, 2020. Moura, and P. Eckersley. Explainable machine learning in deployment. In FAccT, [44] S. Wachter, B. Mittelstadt, and C. Russell. Why fairness cannot be automated: 2020. Bridging the gap between EU non-discrimination law and AI. SSRN Scholarly [9] E. Black, S. Yeom, and M. Fredrikson. FlipTest: Fairness testing via optimal Paper ID 3547922, Social Science Research Network, Mar. 2020. transport. In FAccT, 2020. [45] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas, and J. Wil- [10] T. Calders and S. Verwer. Three naive Bayes approaches for discrimination-free son. The What-If Tool: Interactive probing of machine learning models. IEEE classification. Data Mining and Knowledge Discovery, 21(2), Sept. 2010. Transactions on Visualization and Computer Graphics, 26(1), 2020. [11] A. Chouldechova. Fair prediction with disparate impact: A study of bias in [46] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond recidivism prediction instruments. Big data, 5 2, 2017. disparate treatment & disparate impact: Learning classification without disparate [12] M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk mistreatment. In WWW, 2017. minimization under fairness constraints. In NeurIPS, 2018. [47] B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating unwanted biases with [13] D. Dua and C. Graff. UCI machine learning repository, 2017. adversarial learning. In AIES, 2018. [14] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In ITCS, 2012. [15] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994. [16] S. C. Geyik, S. Ambler, and K. Kenthapadi. Fairness-aware ranking in search & recommendation systems with application to LinkedIn talent search. In KDD, 2019. [17] B. Goodman and S. Flaxman. European union regulations on algorithmic decision- making and a “right to explanation”. AI magazine, 2017. ACKNOWLEDGMENTS A.3 Detailed Example We would like to thank our colleagues at AWS for their contri- Below we show an example analysis config. It is minimal and ad- butions to Clarify. We thank Edi Razum, Luuk Figdor, Hasan ditional configuration options (e.g. to support datasets without Poonawala, the DFL and Zopa for contributions to the customer headers or model endpoints using JSON format) exist. use case. 1
2 "dataset_type": "text/csv", A DETAILED CONFIGURATION 3 "label": "Class1Good2Bad", A.1 Additional Bias Metrics 4 "label_values_or_threshold":[1], Recall that we defined the fraction of positively labeled examples 5 "facet":[ (1) (1) 6 { in group 푎 as 푞푎 = 푛푎 /푛푎 (푞푑 = 푛 /푛푑 for group 푑, respectively). 푑 7 "name_or_index" : "ForeignWorker", They define a probability distribution 푃푎 for group 푎 over labels 8 "value_or_threshold":[1] as 푞푎 for positive labels and 1 − 푞푎 for negative labels and 푃푑 as 9 } 푞푑 and 1 − 푞푑 . Differences in these probability distributions yield 10 ], additional pre-training bias metrics: 11 "group_variable": "A151", 퐾퐿 KL divergence measures how much information is needed to 12 "methods":{ move from one probability distribution 푃푎 to another 푃푑 : 퐾퐿 = − 13 "shap":{ 푞 log 푞푎 + (1 − 푞 ) log 1 푞푎 . 푎 푞푑 푎 1−푞푑 14 "baseline": "s3://
divergence. Denoting by 푃 the average of the label distributions 16 "agg_method": "mean_abs"
of the two groups, we define 17 },
1 18 "pre_training_bias":{ 퐽푆 (푃푎, 푃푑, 푃) = 2 [퐾퐿(푃푎, 푃) + 퐾퐿(푃푑, 푃)] . 19 "methods": "all" 퐿푃 We can also consider the norm between the probability distri- 20 }, butions of the labels for the two groups. For 푝 ≥ 1, we have 21 "post_training_bias":{ h i1/푝 Í 푝 22 "methods": "all" 퐿푝 (푃푎, 푃푑 ) = 푦 |푃푎 (푦) − 푃푑 (푦)| . 23 },
푇푉 퐷 : The total variation distance is half the 퐿1 distance: 푇푉 퐷 = 24 "report":{ 1 ( ) 2 퐿1 푃푎, 푃푑 . 25 "name": "report", 퐾푆 Kolmogorov-Smirnov considers the maximum difference across 26 "title": "Clarify Analysis Report" label values: 퐾푆 = max(|푃푎 − 푃푑 |). 27 } For the post-training bias metric we additionally have a flip test. 28 }, 퐹푇 Counterfactual Fliptest: 퐹푇 assesses whether similar examples 29 "predictor":{ across the groups receive similar predictions. For each example 30 "model_name": "german-xgb", in 푑, we count how often its k-nearest neighbors in 푎 receive 31 "instance_type": "ml.c5.xlarge", a different prediction. Simplifying [9], we define 퐹푇 = (퐹 + − 32 "initial_instance_count":1 − + − 33 } 퐹 )/푛푑 where 퐹 (퐹 resp.) is the number of examples in 푑 with a negative (positive, resp.) prediction whose nearest neighbors 34 in 푎 have a positive (negative, resp.) prediction. A.2 Configuration Options Table 5 illustrates the flexible support for model outputs.
Example output Configuration of predictor C3,"[0.1,0.3,0.6]" "label": 0, "probability": 1 ”[0.1, 0.3, 0.6]” "probability": 0, ”label_headers”: [”C1”, ”C2”, ”C3”] {”pred”: C3, "label": "pred", ”sc”: [0.1, 0.3, 0.6]} "probability": "sc", "content_type": "application/jsonlines" Table 5: Illustration of multi-class model outputs supported by Clarify. The model predicts scores for classes C1, C2, C3.