SageMaker Clarify: Bias Detection and Explainability in the Cloud

Michaela Hardt1, Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gelman, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai2, Keerthan Vasist, Pinar Yilmaz, M. Bilal Zafar, Sanjiv Das3, Kevin Haas, Tyler Hill, Krishnaram Kenthapadi ABSTRACT 1 INTRODUCTION Understanding the predictions made by machine learning (ML) Machine learning (ML) models and data-driven systems are in- models and their potential biases remains a challenging and labor- creasingly used to assist in decision-making across domains such intensive task that depends on the application, the dataset, and the as financial services, healthcare, education, and human resources. specific model. We present Amazon SageMaker Clarify, an explain- Benefits of using ML include improved accuracy, increased produc- ability feature for Amazon SageMaker that launched in December tivity, and cost savings. The increasing adoption of ML is the result 2020, providing insights into data and ML models by identifying of multiple factors, most notably ubiquitous connectivity, the ability biases and explaining predictions. It is deeply integrated into Ama- to collect, aggregate, and process large amounts of data using cloud zon SageMaker, a fully managed service that enables data scientists computing, and improved access to increasingly sophisticated ML and developers to build, train, and deploy ML models at any scale. models. In high-stakes settings, tools for bias and explainability in Clarify supports bias detection and feature importance computation the ML lifecycle are of particular importance. across the ML lifecycle, during data preparation, model evaluation, Regulatory [1, 2]: Many ML scenarios require an understanding of and post-deployment monitoring. We outline the desiderata derived why the ML model made a specific prediction or whether its predic- from customer input, the modular architecture, and the methodol- tion was biased. Recently, policymakers, regulators, and advocates ogy for bias and explanation computations. Further, we describe the have expressed concerns about the potentially discriminatory im- technical challenges encountered and the tradeoffs we had to make. pact of ML and data-driven systems, for example, due to inadvertent For illustration, we discuss two customer use cases. We present our encoding of bias into automated decisions. Informally, biases can be deployment results including qualitative customer feedback and viewed as imbalances in the training data or the prediction behavior a quantitative evaluation. Finally, we summarize lessons learned, of the model across different groups, such as age or income bracket. and discuss best practices for the successful adoption of fairness Business [3, 17]: The adoption of AI systems in regulated domains and explanation tools in practice. requires explanations of how models make predictions. Model ex- plainability is particularly important in industries with reliability, KEYWORDS safety, and compliance requirements, such as financial services, Machine learning, Fairness, Explainability human resources, healthcare, and automated transportation. Using a financial example of lending applications, loan officers, customer ACM Reference Format: service representatives, compliance officers, and end users often re- 1 Michaela Hardt , Xiaoguang Chen, Xiaoyi Cheng, Michele Donini, Jason Gel- quire explanations detailing why a model made certain predictions. man, Satish Gollaprolu, John He, Pedro Larroy, Xinyu Liu, Nick McCarthy, 2 Data Science: Data scientists and ML engineers need tools to debug Ashish Rathi, Scott Rees, Ankit Siva, ErhYuan Tsai , Keerthan Vasist, Pinar and improve ML models, to ensure a diverse and unbiased dataset Yilmaz, M. Bilal Zafar, Sanjiv Das3, Kevin Haas, Tyler Hill, Krishnaram Ken- during feature engineering, to determine whether a model is making thapadi. 2021. Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability in the Cloud. In Proceedings of the 27th ACM SIGKDD inferences based on noisy or irrelevant features, and to understand Conference on Knowledge Discovery and (KDD ’21), August the limitations and failure modes of their model. 14–18, 2021, Virtual Event, Singapore. ACM, New York, NY, USA, 10 pages. While there are several open-source tools for fairness and ex- https://doi.org/10.1145/3447548.3467177 plainability, we learned from our customers that developers often find it tedious to incorporate these tools across the ML workflow, 1Correspondence to: Michaela Hardt, . as this requires a sequence of complex, manual steps: (1) Data scien- 2Work done while at Amazon Web Services. tists and ML developers first have to translate internal compliance 3 Santa Clara University, CA, USA. requirements into code that can measure bias in their data and ML models. To do this, they often must exit their ML workflow, Permission to make digital or hard copies of part or all of this work for personal or and write custom code to piece together open source packages. (2) classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation Next, they need to test datasets and models for bias across all steps on the first page. Copyrights for third-party components of this work must be honored. of the ML workflow and write code for this integration. (3)Once For all other uses, contact the owner/author(s). the tests are complete, data scientists and ML developers create KDD ’21, August 14–18, 2021, Virtual Event, Singapore © 2021 Copyright held by the owner/author(s). reports showing the factors that contribute to the model prediction ACM ISBN 978-1-4503-8332-5/21/08. by manually collecting a series of screenshots for internal risk and https://doi.org/10.1145/3447548.3467177 compliance teams, auditors, and external regulators. (4) Finally, af- 3606, Microsoft’s Fairlearn7, and TensorFlow’s What-If-Tool [45] ter the models are in production, data scientists and ML developers are products built around bias metrics and mitigation strategies. must periodically run through a series of manual tests to ensure Explainabilty. The approaches for model explainability can be that no bias is introduced as a result of a drift in the data distribu- dissected in various ways. For a comprehensive survey, see e.g. [18]. tion (e.g., changes in income for a given geographic region). Similar We distinguish local vs. global explainability approaches. Local ap- challenges have also been pointed out in prior studies [8, 20, 33]. proaches [30, 36] explain model behavior in the neighborhood of a Our contributions: To address the need for generating bias and given instance. Global explainability methods aim to explain the explainability insights across the ML workflow, we present Ama- behavior of the model at a higher-level (e.g. [26]). Explainability zon SageMaker Clarify1, a new functionality in Amazon Sage- methods can also be divided into model-agnostic (e.g., [30]) and Maker [27] that helps detect bias in data and models, helps explain model-specific approaches which leverage internals of models such predictions made by models, and generates associated reports. We as gradients (e.g., [40, 42]). There is growing impetus to base expla- describe the design considerations, and technical architecture of nations on counterfactuals [6, 21]. At the same time concern about Clarify and its integration across the ML lifecycle (§3). We then overtrusting explanations grows [28, 37]. discuss the methodology for bias and explainability computations In comparison to previous work, Clarify is deeply integrated (§4). We implemented several measures of bias in Clarify, consid- into SageMaker and couples scalable fairness and explainability ering the dataset (pre-training) and the model predictions (post- analyses. For explainability, its implementation of SHAP offers training), to support various use cases of SageMaker’s customers. model-agnostic feature importance at the local and global level. We integrated pre-training bias detection with the data preparation functionality and the post-training bias detection with the model 3 DESIGN, ARCHITECTURE, INTEGRATION evaluation functionality in SageMaker. These biases can be moni- 3.1 Design Desiderata tored continuously for deployed models and alerts can notify users to changes. For model explainability, Clarify offers a scalable im- Conversations with customers in need of detecting biases in models plementation of SHAP (SHapley Additive exPlanations) [29], based and explaining their predictions shaped the following desiderata. on the concept of the Shapley value [38]. In particular, we imple- • Wide Applicability. Our offerings should be widely appli- mented the KernelSHAP [29] variant as it is model agnostic, so that cable across (1) industries (e.g., finance, human resources, Clarify can be applied to the broad range of models trained by and healthcare), (2) use cases (e.g., click-through-rate predic- SageMaker’s customers. For deployed models, Clarify monitors tion, fraud detection, and credit risk modeling), (3) prediction for changes in feature importance and issues alerts. tasks (e.g., classification and regression), (4) ML models (e.g., We discuss the implementation and technical challenges in §6 logistic regression, decision trees, deep neural networks), and illustrate the graphical representation in SageMaker Studio in a and (5) the ML lifecycle (i.e., from dataset preparation to case study in §7. In §8 we present the deployment results, including model training to monitoring deployed models). a quantitative evaluation of the scalability of our system, a report on • Ease of Use. Our customers should not have to be experts usage across industries, qualitative customer feedback on usability, in fairness and explainability, or even in ML. They should be and two customer use cases. Finally, we summarize lessons learned able to analyze bias and obtain feature importance within in §9 and conclude in §10. their existing workflow having to write little or no code. • Scalability. Our methods should scale to both complex mod- 2 RELATED WORK els (i.e., billions of parameters) and large datasets. Clarify offers bias detection and explainability of a model. Both The latter two desiderata were the top two non-functional require- are active research areas that we review briefly below. ments requested as customers evaluated the early prototypes. Cus- Bias. Proposed bias metrics for machine learning include indi- tomers described frustratingly long running times with existing vidual fairness [14, 39] as well as group fairness. Group fairness open source techniques, which motivated us to scale these existing considers different aspects of model quality (e.g., calibration24 [ ], solutions to large datasets and models (c.f. §8.2). Moreover, feedback error rates [19]), and prediction tasks (e.g., ranking [41] or classifica- from early prototypes included frequent erroneous configurations tion [12, 46]). Techniques based on causality [34] provide promising due to the free-form nature of the configuration file, prompting us alternatives at the individual [25] and group level [22]. For a thor- to improve usability (c.f. §8.3). ough overview see [5, 31]. Techniques to mitigate bias either change the model, its optimiza- 3.2 Architecture of Clarify tion, (e.g., [4, 35, 47]), or post-process its predictions (e.g., [16, 19]). Clarify is implemented as part of Amazon SageMaker [27], a There are several open source repositories that offer bias man- fully managed service that enables data scientists and developers agement approaches, including fairness-comparison2, Aequitas3, to build, train, and deploy ML models at any scale. It leverages Themis4, responsibly5, and LiFT [43]. Moreover, IBM’s AI Fairness SageMaker’s processing job APIs, and executes batch processing jobs on a cluster of AWS compute instances in order to process large amounts of data in a scalable manner. Each processing job has an 1https://aws.amazon.com/sagemaker/clarify 2https://github.com/algofairness/fairness-comparison associated cluster of fully managed SageMaker compute instances 3https://github.com/dssg/aequitas 4https://github.com/LASER-UMASS/Themis 6https://aif360.mybluemix.net/ 5https://github.com/ResponsiblyAI/responsibly 7https://fairlearn.github.io/ 4 METHODS 4.1 Bias We implemented several bias metrics to support a wide range of use cases. The selection of the appropriate bias metric for an application requires human judgement. Some of the metrics only require the dataset and thus can be computed early in the model lifecycle, that we call pre-training bias metrics. They can shed light onto how different groups are represented in the dataset and if there isa difference in label distribution across groups. At this point some Figure 1: High-level system architecture of Clarify.A biases may require further investigation before training a model. dataset and a configuration file serve as input to abatch Once a model is trained, we can compute post-training bias metrics processing job that produces bias metrics and feature impor- that take into account how its predictions differ across groups. We tance. These outputs are stored in a filesystem, exported as introduce the notation before defining the metrics. metrics, and visualized in a web-based IDE for ML. 4.1.1 Setup and Notation. Without loss of generality, we use 푑 to running the specified container image and provisioned specifically denote the group that is potentially disadvantaged by the bias and for the processing job. Clarify provides a first-party container compare it to the reminder of the examples in the dataset 푎 (for ad- image which bundles efficient and optimized implementations of vantaged group). The selection of the groups often requires domain core libraries for bias and explainability computation together with information9 and application specific considerations. Denote the flexible parsing code to support a variety of formats of datasets and number of examples of each group by 푛푎, 푛푑 , with 푛 denoting the the model input and output. total. We assume that each example in the dataset has an associ- Inputs to Clarify are the dataset stored in the S3 object store and ated binary label 푦 ∈ {0, 1} (with 1 denoting a positive outcome), a configuration file specifying the bias and explainability metrics to and that the ML model predicts 푦ˆ ∈ {0, 1}. While these assump- compute and which model to use. At a high level, the processing job tions seem restrictive, different prediction tasks can be re-framed completes the following steps: (1) validate inputs and parameters; accordingly. For example, for regression tasks we can select one (2) compute pre-training bias metrics; (3) compute post-training or more thresholds of interest. For multi-class predictions we can bias-metrics; (4) compute local and global feature attributions for treat various subsets of the classes as positive outcomes. explainability; (5) generate output files. The job results are saved We denote by 푛(0), 푛(1) , respectively, the number of labels of (0) (1) in the S3 object store and can be consumed in multiple ways: (1) value 0, 1 that we can further break down by group into 푛푎 , 푛푎 ( ) ( ) download and programmatically inspect using a script or notebook; and 푛 0 , 푛 1 . Analogously, we can count the positive / negative (2) visualize and inspect in SageMaker Studio IDE through custom 푑 푑 (0) (1) (0) (1) visualizations and tooltips to describe and help interpret the results. predictions 푛ˆ , 푛ˆ and break them down by group into 푛ˆ푎 , 푛ˆ푎 (0) (1) The cluster of compute instances carries out the computation and 푛ˆ푑 , 푛ˆ푑 . These counts can be normalized by group size as in a distributed manner (c.f. §6). To obtain predictions from the = (1) / = (1) / = (1) / 푞푎 푛푎 푛푎 and 푞푑 푛푑 푛푑 for the labels and 푞ˆ푎 푛ˆ푎 푛푎 model, the Clarify job spins up model serving instances that take ( ) and 푞ˆ = 푛ˆ 1 /푛 for the predictions. Additionally, we consider the inference requests. This is a deliberate choice made to avoid in- 푑 푑 푑 푇푃: true positives; 퐹푃: false positives; 푇 푁 : true negatives; 퐹푁 : false terfering with production traffic to the already deployed models negatives; and compute them for each group separately. and polluting the performance and quality metrics of the models being monitored in production. This resulted in a modular system 4.1.2 Pre-Training Metrics. allowing customers to use our product without any changes to their 퐶퐼 Class Imbalance: Bias can stem from an under-representation deployed models; see Figure 1. of the disadvantaged group in the dataset, for example, due to Security. Built on top of existing SageMaker primitives, Clarify − a sampling bias. We define 퐶퐼 = 푛푎 푛푑 . supports access control and encryption for the data and the net- 푛 퐷푃퐿 Difference in positive proportions in observed labels: Wede- work. Specifically, roles, virtual private cloud configurations, and fine 퐷푃퐿 = 푞푎 − 푞푑 , to measures bias in the label distribution. encryption keys can be specified. 10 Additional metrics derived from 푞푎 and 푞푑 are defined in §A.1 . 3.3 Deep Integration in SageMaker 퐶퐷퐷퐿 Conditional Demographic Disparity in Labels: Can the differ- ence in 푞푎 and 푞푑 be explained by another feature? An in- For generating bias and explainability insights across the ML life- stance of Simpson’s paradox11 arose in the case of Berkeley cycle, we integrated Clarify with the following SageMaker com- admissions where men were admitted at a higher rate than ponents: Data Wrangler to visualize dataset biases during data women overall, but for each department women were admitted preparation, Studio & Experiments to explore biases and expla- at a higher rate. In this case women applied more to depart- nations of trained models, and Model Monitor to monitor these ments with lower acceptance rates. Following [44], we define metrics.8 Figure 2 illustrates the stages during ML lifecycle and the integration points of Clarify within SageMaker. 9https://plato.stanford.edu/entries/discrimination/ 10https://pages.awscloud.com/rs/112-TZM-766/images/Amazon.AI.Fairness.and. 8Data Wrangler: aws.amazon.com/sagemaker/data-wrangler; Studio: aws.amazon. Explainability.Whitepaper.pdf com/sagemaker/studio; Model Monitor: aws.amazon.com/sagemaker/model-monitor. 11https://www.britannica.com/topic/Simpsons-paradox Figure 2: The stages in a ML lifecycle from problem formulation to deployment and monitoring with their respective fairness considerations in orange. Clarify can be used at dataset construction and model testing and monitoring to investigate them.

= (0) / (0) − (1) / (1) 퐷퐷 푛푑 푛 푛푑 푛 and compute it across all strata 푖 on 푀-player cooperative game, with individual features being players a user-supplied attribute that we want to control for, yielding and the predicted output 푓 (x) being the total payout of the game. 1 Í 퐷퐷푖 . 퐶퐷퐷퐿 = 푛 푖 푛푖 · 퐷퐷푖 where 푖 subscripts each strata and The framework decides how important each player is to the game’s 푡ℎ 푛푖 is the number of examples in it. outcome. Specifically, the importance of the 푖 player x푖 is defined as the average marginal contribution of the player to a game not 4.1.3 Post-Training Metrics. containing x푖 . Formally, the contribution 휙푖 is defined as: 퐷푃푃퐿 Difference in positive proportions in predicted labels: 퐷푃푃퐿 Õ |푆|!(푀 − |푆| − 1)!  consider the difference in positive predictions: 퐷푃푃퐿 = 푞ˆ −푞ˆ , 휙푖 (x) = 푓 (x푆 ∪ x푖 ) − 푓 (x푆 ) , 푎 푑 푀! related is “statistical parity” [7, 10]. 푆 퐷퐼 Disparate Impact: Instead of the difference as in 퐷푃푃퐿 we can where the sum is computed over all the player coalitions 푆 not / 푀 consider the ratio: 퐷퐼 = 푞ˆ푑 푞ˆ푎. containing x푖 . The computation involves iterating over 2 feature 퐷퐶퐴 Difference in Conditional Acceptance / Rejection: 퐷퐶퐴 consid- coalitions, which does not scale to even moderate values of 푀, ers the ratio of the number of positive labels to that of positive e.g., 50. To overcome these scalability issues, KernelSHAP uses = (1) / (1) − (1) / (1) predictions assessing calibration: 퐷퐶퐴 푛푎 푛ˆ푎 푛푑 푛ˆ푑 . approximations [29]. Analogously, 퐷퐶푅 considers negative labels and predictions. 퐴퐷 Accuracy Difference: We compare the accuracy, i.e. the fraction 4.3 Monitoring bias and explanations of examples for which the predictions is equal to the label, across Measuring bias metrics and feature importance provide insights groups. We define 퐴퐷 = (푇푃푎 + 푇 푁푎)/푛푎 − (푇푃푑 + 푇 푁푑 )/푛푑 . into the model at time of evaluation. However, a change in data 푅퐷 Recall Difference: We compare the recall (the fraction of positive distribution can occur over time, resulting in different bias metrics examples that receive a positive prediction) across groups. 푅퐷 = and feature importance. Our goal is to continuously monitor for / (1) − / (1) 푇푃푎 푛푎 푇푃푑 푛푑 (aka “equal opportunity” [19]). such changes and raise alerts if bias metrics or feature importance 퐷퐴푅 Difference in Acceptance / Rejection Rates: This metric consid- drift from reference values established during an earlier evaluation ers precision, i.e. the fraction of positive predictions that are (before the model launch). Based on a user-specified schedule (e.g., = / (1) − / (1) hourly), newly accumulated data is analyzed and compared against correct: 퐷퐴푅 푇푃푎 푛ˆ푎 푇푃푑 푛ˆ푑 . Analogously, we define 퐷푅푅 for negative predictions. the reference values. 푇 퐸 Treatment Equality: We define 푇 퐸 = 퐹푁푑 /퐹푃푑 − 퐹푁푎/퐹푃푎 [7]. Bias. Given a bias metric from §4.1, we let the user specify a range 퐶퐷퐷푃퐿 Conditional Demographic Disparity of Predicted Labels: This of reference values 퐴 = (푎min, 푎max) in which the new bias value metric is analogous to the 퐶퐷퐷퐿 for predictions. 푏 computed on the live data should lie. However, if little data was accumulated simply checking for containment can lead to noisy See §A.1 for additional metrics. results. To ensure that the conclusions drawn from the observed 4.2 Explainabilty live data are statistically significant, we check for overlap of 퐴 with We implemented the KernelSHAP [29] algorithm due to its good the 95% bootstrap confidence interval of 푏 [15]. performance [36] and wide applicability. Explainability. For feature importance we consider the change The KernelSHAP algorithm [29] works as follows: Given an input in ranking of features by their importance. Specifically, we use x ∈ R푀 and the model output 푓 (x) ∈ R, SHAP, building on the the Normalized Discounted Cumulative Gain (nDCG) score for Shapley value framework [38], designates the prediction task as a comparing the ranking by feature importance of the reference data with that based on the live data. nDCG penalizes changes further associated UI components and documentation.12 Prior to public down in the ranking less. Also, swapping the position of two values launch, we set up pipelines and dashboards for health monitoring of is penalized relative to the difference of their original importance. our system, including alarms for too many failed jobs. Throughout, Let 퐹 = [푓1, . . . , 푓푀 ] denote the list of 푀 features sorted w.r.t. multiple reviews took place for the design, usability, security and ′ = [ ′ ′ ] their importance scores in the reference data, and 퐹 푓1 , . . . , 푓푀 privacy, and operational readiness. denote the new ranking of the features based on the live data. We Implementation Details: For wide applicability, we implemented use 푎(푓 ) to describe a function that returns the feature importance flexible parsers to support various data formats (JSONLines/CSV, score on the reference data of a feature 푓 . Then, nDCG is defined with(-out) header) and model signatures, see §A.2 for examples. Í푀 ′ 13 as: nDCG = DCG/iDCG, where DCG = 푖=1 푎(푓푖 )/log2 (푖 + 1) and The first bias metrics were implemented in numpy. For scala- Í푀 bility, all subsequent methods were implemented using Spark via iDCG = 푖=1 푎(푓푖 )/log2 (푖 + 1). An nDCG value of 1 indicates no change in the ranking. In our SageMaker’s Spark Processing job. Spark provided us with con- implementation, an nDCG value below 0.90 triggers an alert. venient abstractions (e.g., distributed median computations) and managed the cluster for us. Our bias metrics mostly require counts that are easy to distribute. Distributing SHAP only requires one 5 ALTERNATIVE DESIGN AND METHODS map operation to calculate feature importance for each example On the architectural side, our design provisions a separate model followed by an aggregation for global model feature importance. server which adds latency. Alternatively, a design that runs the model locally avoids network traffic. We decided to use a model 6.1 Challenges and Resolutions server so that we can re-use the existing SageMaker model hosting A challenge to fault-tolerance is that Clarify’s offline batch pro- primitives. Our model-agnostic approach enables support of a wide cessing job hits a live endpoint. This requires a careful implemen- range of models including those that cannot be executed locally. tation to avoid overwhelming the endpoint with requests. To limit This way we can scale and tune the instance type selection and the potential impact, Clarify spins up separate model servers and counts for the processing job and the model serving instances does not interfere with production servers. To avoid overwhelming independently based on the dataset and model characteristics. these separate endpoints, we dynamically configured the batch For the bias metrics, we considered limiting the number of met- size based on experimentation starting with the largest possible rics, to avoid overwhelming users. However, the appropriate bias batch size based on the maximum payload14 in order to achieve metric is application dependent. To support of customers from dif- high throughput at the model endpoints. Furthermore, we relied on ferent industries with various use cases, we decided to implement the boto clients’ retry policies15 with exponential back-off. This is a wide range of metrics (see §4.1). necessary to tolerate intermittent inference call failures. We further For explainability we decided to implement the KernelSHAP [29] expanded the retry mechanism, since not all errors are considered algorithm (see §4.2) rather than alternatives discussed in §2 for a retriable in the boto client. number of reasons: (i) it offers wide applicability to interpret pre- A key to scalability was to utilize resources well by carefully dictions of any score-based model; (ii) it works in a model-agnostic tuning the number of partitions of the dataset in Spark: If tasks manner and does not require access to model internals such as took very long (we started out with tasks taking an hour) for SHAP gradients allowing us to build on top of the the SageMaker hosting computation, failures were costly and utilization during the last few primitives; (iii) it provides attractive theoretical properties, for in- tasks was low. If tasks were too small for bias metrics computation, stance, all the feature importance scores sum to the total class score; throttling limitations when instantiating a SageMaker Predictor led and (iv) benchmarking shows that it performs favorably compared to job failures. The solution was to carefully tune the number of to related methods such as LIME [36]. More efficient model-specific tasks separately for explainability and bias computations. implementations like TreeSHAP and DeepSHAP [29] leverage the model architecture for improved efficiency. By using the model- 7 ILLUSTRATIVE CASE STUDY agnostic KernelSHAP, we traded efficiency for wider availability. In the future we will prioritize additional explainability methods For illustration we present a simple case study with the German based on customer requests. Credit Data dataset from the UCI repository [13] that contains 1000 labeled credit applications. Features include credit purpose, credit amount, housing status, and employment history. Categorical 6 DEVELOPMENT attributes are converted to indicator variables, e.g., A30 means “no Timeline: We started by collecting customer requirements to de- credits taken”. We mapped the labels (Class1Good2Bad) to 0 for rive Clarify’s goals and design the system (§3). Our first prototype poor credit and 1 for good. implemented the methods on a single node packaged into a con- tainer that could be run with a bare-bone configuration file. After 12 internal tests, we deployed this version for beta customers to solicit https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-detect-post-training- bias.html, https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-model- feedback. This feedback helped us to prioritize scalability (§8.2) and explainability.html ease of use improvements (§8.3). Afterwards, we expanded support 13Our open source implementation is available at https://github.com/aws/amazon- of models and datasets, conducted more end-to-end tests, and ob- sagemaker-clarify. 14https://github.com/boto/botocore/blob/develop/botocore/data/sagemaker- tained additional rounds of customer feedback. We embarked on runtime/2017-05-13/service-2.json#L35 the integration across the model lifecycle (§3.3), emphasizing the 15https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html Figure 5: SHAP feature importance.

evaluation monitoring overall bias 0.15 0.21 0.36 Figure 3: Pre-training bias results in Data Wrangler. explainability 0.43 0.21 0.64 overall 0.58 0.42 1 Table 1: Break-down of Clarify jobs across method (bias / explainability) and phase (evaluation / monitoring).

submit the configuration for the post-training bias analysis (see Appendix A) as input along with the validation data to the Clarify job. The results shown in Figure 4 reveal a mixed picture with pos- itive and negative metrics. We start by inspecting the disparate impact 퐷퐼 with a value of 0.73 much below 1. Since the label rates are almost the same across the two groups (퐷푃퐿), 퐷퐼 is driven by differences in precision and recall. We see that foreign workers have a lower recall (푅퐷 of 0.13), but a higher precision (퐷퐴푅 of -0.19). Corrective steps can mitigate this bias prior to deployment. Feature importance. To understand which features are affecting the model’s predictions the most we look at the global SHAP values that are averaged across all examples after taking the absolute value. The three most important features are A61 (having very little savings), the loan amount, and credit duration, see Figure 5.

8 DEPLOYMENT RESULTS In the first few months after launch 36% ofall Clarify jobs com- puted bias metrics while the 64% computed feature importance, Figure 4: Post-training bias results in Studio Trials. see Table 1. Monitoring jobs made up 42% of jobs, illustrating that customers reached a mature stage in the ML lifecycle in which they incorporated bias and feature importance monitoring. This is Bias analysis. Before we train the model we seek to explore bi- expected in a healthy development cycle, after a few iterations on ases in the dataset. Figure 3 shows Clarify integrated into Data model training and evaluation, a model passes the launch criteria Wrangler. Convenient drop-downs are available to select the la- and is continuously monitored after. bel column and the group. Also, by default a small set of metrics Below, we evaluate our design goals and highlight two use cases. are pre-selected in order to not overwhelm customers with too many metrics. No code needs to be written to conduct this analysis. 8.1 Wide Applicability From the class imbalance value of -0.94, we infer that the dataset is Customers across various business domains use Clarify, includ- very imbalanced, with most examples representing foreign workers. ing digital publishing, education, electronics, finance, government, However, as 퐷푃퐿 is close to 0, there is not much difference in the hardware, healthcare, IT, manufacturing, news, sports, staffing, and label distributions across the two groups. With these encouraging telecommunications. Customers use Clarify for binary predic- results we proceed with the training of an XGBoost model. We tions, multi-class predictions, and ranking tasks. They are located method processing time in minutes cost in dollars method time in minutes P50 P90 P99 pre-training bias 1 $0.03 post-training bias 14 $0.13 bias 1 8 11 Table 2: Processing time and cost of bias metrics computa- explainability 5 12 103 tion for 1 million examples. overall 4 12 101 Table 4: Percentiles of processing times in minutes. number of examples processing time in minutes cost in dollars multiple processing instances thereby reducing running time. At 1000 16 $0.13 most 10 instances were used by customers so far. 10,000 90 $0.71 100,000 811 $6.43 8.3 Ease of use Table 3: Processing time and cost of SHAP with varying We obtained qualitative customer feedback on usability. For cus- dataset sizes using one processing / model instance. tomers it was not always easy to get the configuration right despite documentation. Difficulties included setting the right fields with in the U.S., Japan, UK, and Norway among other countries. For two the right values, and figuring out fields that are optional, or needto specific customer use cases see §8.4. go together for a specific use case. These difficulties were reduced through tooling, documentation, and examples. In terms of tooling, 8.2 Scalability customers can use the explanatory UI or programmatically create We conducted experiments measuring the time it takes to compute the configuration. Regarding documentation, customers requested bias metrics and feature importance. We oversampled the German additional documentation covering specifics of multi-class predic- credit data set described in §7 creating up to 1 million examples to tion. Regarding examples, multiple customers with strict security study scalability. We ran experiments on ml.c5.xlarge machines requirements who used advanced configuration options to enable which are suitable for this workload and provide cost estimates encryption of data at rest and in transit with fine-grained access based on AWS pricing in the us-west-2 region. control benefited from Clarify examples for such settings. Other Bias metrics. Table 2 shows that the bias metrics computation is customers noted that our notebooks focus on the CSV format and fast even for 1M examples. Post-training bias metrics are consid- requested examples with the JSONLines data format. These ex- erably slower than pre-training metrics since they require model amples illustrate tension between the goals of wide applicability inference. The reported time includes the 3-5 minutes it takes to and usability: Ease of use in for one setting (e.g., CSV data, binary spin up the model endpoint. Scalability issues may arise when the predictions) does not extend to all settings (e.g., JSONLines data, dataset does not fit in memory (in which case a larger instance type multi-class prediction). can be chosen) or when the model inference is much slower (in We received several inquiries about formulating the develop- which case the instance count can be increased). ment, business, and regulatory needs into the framework that Explainability. Table 3 shows that time and cost scale roughly Clarify provides. While we were able to find a fit for many differ- linearly with the dataset size. The implementation of SHAP iterates ent use cases, these questions are hard and answers differ case-by- over all examples in a for-loop and the execution at each step case. For example, the question of which bias metric is important, dominates the time and cost. For 100,000 examples the running depends on the application. There is no way to get around this ques- time of more than 13h is so long that it can be a point of frustration tion, as it is impossible to remove all biases [11, 24]. Here again, for customers. We parallelized SHAP to reduce the running time. documentation helps customers to address this hard question. Figure 6 shows the scaling behavior of SHAP on 100,000 exam- Quantitative customer results. SageMaker AutoPilot, that auto- ples. We see that increasing the number of instances (both for the matically builds, trains, and tunes models, integrated Clarify in processing job and the model endpoint) to two reduces the time March 2021. AutoPilot contributes 54% of the explainability jobs. down to 3h and 17 minutes. This super-linear speedup is surprising. The automatic configuration functionality offered by AutoPilot, It is due to the fact that our single instance implementation does along with the other improvements listed above, helped reduce not fully utilize all cores. Our Spark implementation not only splits failure rates by 75%. the workload across machines but also across cores within those machines thereby reducing not just time but even cost. We observe 8.4 Two Customer Use Case Examples that 1 model endpoint instance for every 1 to 2 processing instances Next, we illustrate the use of Clarify in production by providing provides a good ratio: A higher ratio underutilizes model endpoint two real customer use cases. instances, increasing cost without a speed up. With a lower ratio Fraud detection. Zopa’s data scientists use several dozens of in- (e.g., 1 model endpoint instance for 10 processing instances), the put features such as application details, device information, in- model becomes the bottleneck, and cost as well as running time can teraction behaviour, and demographics. For model building, they be reduced by increasing the number of model endpoint instances. extract the training dataset from their data ware- Customer results. Half of the Clarify jobs of customers take at house, perform data cleaning, and store into . As Zopa most 4 minutes. Table 4 shows percentiles of processing times of has their own in-house ML library for both feature engineering bias as well as explainability jobs. We see that the running time and ML framework support, they use the “Bring Your Own Con- is less than two hours for 99% of the jobs. Across all jobs 35% use tainer” (BYOC) approach to leverage SageMaker’s managed services Figure 6: Scaling of SHAP on 100,000 examples and measuring time (left) and cost (right) for a varying number of instances. and advanced functionalities such as hyperparameter optimization. 9 LESSONS LEARNED IN PRACTICE The optimized models are then deployed through a Jenkins CI/CD pipeline to the production system and serve as a microservice for Wide Applicability: We learned that wide applicability across real-time fraud detection as part of their customer facing platform. datasets and model endpoints not only requires flexible parsing Explanations are obtained both during model training for vali- logic (see §6), but also creates a tension with ease of use: Our dation, and after deployment for model monitoring and generating code tries to infer whether a file contains a header, or a feature is insights for underwriters. These are done in a non-customer facing categorical or numerical. While it is convenient for most not having analytical environment due to a heavy computational requirement to specify all these details, for some customers our inference does and a high tolerance of latency. SageMaker Multi Model Server not work well. This tension is further explored in §8.3. stack is used in a similar BYOC fashion, to register the models for Scalability: When working towards our goal of creating a scalable the Clarify processing job. Clarify spins up an ephemeral model application, we learned that bias and explainability analyses have endpoint and invokes it for millions of predictions on synthetic con- different resource demands (SHAP can be compute intensive, while trastive data. These predictions are used to compute SHAP values bias metrics can be memory intensive), and hence require separate for each individual example, which are stored in S3. tuning of the parallelism. For explainability we had to work on small For model validation and monitoring, Clarify aggregates these chunks, while for bias reports, we had to work on large chunks of local feature importance values across examples in the training/- the dataset. For details and other technical lessons, see §6.1. monitoring data, to obtain global feature importance, and visualizes Usability: In retrospect, our biggest lesson concerning usability is them in SageMaker Studio. To give insights to operation, the data one we should have recognized prior to the usability studies – we scientists select the features with the highest SHAP values that cannot rely heavily on documentation. Instead we need to enable contributed most to a positive fraud score (i.e., likely fraudster) for users to try out the feature without reading documentation. Early each individual example, and report them to the underwriter. implementations of Clarify required the user to create their JSON Explanation of a soccer scoring probability model. In the con- configuration file manually. This was prone to both schema errors text of soccer matches the DFL (Deutsche Fußball Liga) applied caused by users creating malformed JSON, and soft errors caused by Clarify to their xGoal model. The model predicts the likelihood users defining metrics using incorrect fields or labels. To avoid such that an attempted shot will result in a goal. They found that over- errors we developed an explanatory UI. Now, we ask the customer all the features AngleToGoal and DistanceToGoal have the highest to “Select the column to analyze for bias” and provide a drop down importance across all examples. Let us consider one of the most menu or programmatically construct the configuration for them. interesting games of the 2019/2020 Bundesliga season, where Borus- We are further revising the UI, see §8.3. sia Dortmund beat Augsburg in a 5-3 thriller on January 18, 2020. Best Practices: We recognize that the notions of bias and fairness Taking a look at the sixth goal of the game, scored by Jadon Sancho, are highly dependent on the application and that the choice of the at- the model predicted the goal with a high xGoal prediction of 0.18 tribute(s) for which to measure bias, as well as the choice of the bias compared to the average of prediction of 0.0871 across the past three metrics, can be guided by social, legal, and other non-technical con- seasons. What explains this high prediction? The SHAP analysis re- siderations. Our experience suggests that the successful adoption veals that PressureSum, DistanceToGoalClosest, and AmountofPlayers of fairness-aware ML approaches in practice requires building con- contribute the most to this prediction. This is a noticeable deviation sensus and achieving collaboration across key stakeholders (such from the features that are globally the most important. What makes as product, policy, legal, engineering, and AI/ML teams, as well as this explanation convincing is the unusual nature of this goal: San- end users and communities). Further, fairness and explainability cho received a highly offensive through ball well into Augsburg’s related ethical considerations need to be taken into account during half (past all defenders – hence low PressureSum), only to dart round each stage of the ML lifecycle, for example, by asking questions the keeper and tap it into the goal near the posts (low AngleToGoal). stated in Figure 2. 10 CONCLUSION [18] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for explaining black box models. ACM Computing Surveys, Motivated by the need for bias detection and explainability from 51(5), 2018. regulatory, business, and data science perspectives, we presented [19] M. Hardt, E. Price, E. Price, and N. Srebro. Equality of opportunity in supervised Amazon SageMaker Clarify, that helps to detect bias in data and learning. In NeurIPS, 2016. [20] K. Holstein, J. Wortman Vaughan, H. Daumé III, M. Dudik, and H. Wallach. models and to explain predictions made by models. We described Improving fairness in machine learning systems: What do industry practitioners the design considerations, methodology, and technical architecture need? In CHI, 2019. [21] D. Janzing, L. Minorics, and P. Bloebaum. Feature relevance quantification in of Clarify, as well as how it is integrated across the ML lifecycle explainable AI: A causal problem. In AISTATS, 2020. (data preparation, model evaluation, and monitoring post deploy- [22] N. Kilbertus, M. Rojas Carulla, G. Parascandolo, M. Hardt, D. Janzing, and ment). We discussed the technical challenges and resolutions, a case B. Schölkopf. Avoiding discrimination through causal reasoning. In NeurIPS, 2017. study, customer use cases, deployment results, and lessons learned [23] J. Kleinberg and S. Mullainathan. Simplicity creates inequity: Implications for in practice. Considering that Clarify is a scalable, cloud-based fairness, stereotypes, and interpretability. In EC, 2019. bias and explainability service designed to address the needs of cus- [24] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. In ITCS, 2017. tomers from multiple industries, the insights and experience from [25] M. J. Kusner, J. Loftus, C. Russell, and R. Silva. Counterfactual fairness. In NeurIPS, our work are likely to be of interest for researchers and practitioners 2017. [26] H. Lakkaraju, E. Kamar, R. Caruana, and J. Leskovec. Faithful and customizable working on responsible AI systems. explanations of black box models. In AIES, 2019. While we have focused on providing tools for bias and explain- [27] E. Liberty, Z. Karnin, B. Xiang, L. Rouesnel, B. Coskun, R. Nallapati, J. Delgado, ability in the context of an ML lifecycle, it is important to realize A. Sadoughi, Y. Astashonok, P. Das, C. Balioglu, S. Chakravarty, M. Jha, P. Gautier, D. Arpin, T. Januschowski, V. Flunkert, Y. Wang, J. Gasthaus, L. Stella, S. Ranga- that this is an ongoing effort that is part of a larger process. An- puram, D. Salinas, S. Schelter, and A. Smola. Elastic machine learning algorithms other key challenge is to navigate the trade-offs in the context ofan in Amazon SageMaker. In SIGMOD, 2020. application between model accuracy, interpretability, various bias [28] Z. C. Lipton. The mythos of model interpretability. ACM Queue, 16(3), 2018. [29] S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. metrics, and related dimensions such as privacy and robustness In NeurIPS, 2017. (see, for example, [11, 23, 24, 32]). It cannot be overemphasized that [30] S. M. Lundberg, B. Nair, M. S. Vavilala, M. Horibe, M. J. Eisses, T. Adams, D. E. Liston, D. K.-W. Low, S.-F. Newman, J. Kim, and S.-I. Lee. Explainable machine- developing AI solutions needs to be thought of more broadly as a learning predictions for the prevention of hypoxaemia during surgery. Nature process involving iterated inputs from and discussions with key Biomedical Engineering, 2(10), Oct. 2018. stakeholders such as product, policy, legal, engineering, and AI/ML [31] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan. A survey on bias and fairness in machine learning. arXiv preprint arXiv:1908.09635, 2019. teams as well as end users and communities, and asking relevant [32] S. Milli, L. Schmidt, A. D. Dragan, and M. Hardt. Model reconstruction from questions during all stages of the ML lifecycle. model explanations. In Proc. FAT*, pages 1–9. ACM, 2019. [33] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru. Model cards for model reporting. In FAccT, 2019. [34] J. Pearl. Causal inference in statistics: An overview. Statistics Surveys, 3, 2009. REFERENCES [35] V. Perrone, M. Donini, K. Kenthapadi, and C. Archambeau. Fair Bayesian opti- [1] Big data: Seizing opportunities, preserving values. https://obamawhitehouse. mization. In ICML AutoML Workshop, 2020. archives.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf, [36] M. T. Ribeiro, S. Singh, and C. Guestrin. "Why should I trust you?" Explaining 2014. the predictions of any classifier. In KDD, 2016. [2] Big data: A report on algorithmic systems, opportunity, and civil rights. [37] C. Rudin. Stop explaining black box machine learning models for high stakes https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/2016_ decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 0504_data_discrimination.pdf, 2016. May 2019. [3] Big data: A tool for inclusion or exclusion? Understanding the issues (FTC [38] L. S. Shapley. A value for n-person games. Contributions to the Theory of Games, report). https://www.ftc.gov/reports/big-data-tool-inclusion-or-exclusion- 2(28), 1953. understanding-issues-ftc-report, 2016. [39] S. Sharifi-Malvajerdi, M. Kearns, and A. Roth. Average individual fairness: Algo- [4] A. Agarwal, A. Beygelzimer, M. Dudik, J. Langford, and H. Wallach. A reductions rithms, generalization and experiments. In NeurIPS, 2019. approach to fair classification. In ICML, 2018. [40] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: [5] S. Barocas, M. Hardt, and A. Narayanan. Fairness and Machine Learning. fairml- Visualising image classification models and saliency maps. In ICLR Workshop, book.org, 2019. 2014. [6] S. Barocas, A. D. Selbst, and M. Raghavan. The hidden assumptions behind [41] A. Singh and T. Joachims. Fairness of exposure in rankings. In KDD, 2018. counterfactual explanations and principal reasons. In FAccT, 2020. [42] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. [7] R. Berk, H. Heidari, S. Jabbari, M. Kearns, and A. Roth. Fairness in criminal justice In ICML, 2017. risk assessments: The state of the art. Sociological Methods & Research, 2018. [43] S. Vasudevan and K. Kenthapadi. LiFT: A scalable framework for measuring [8] U. Bhatt, A. Xiang, S. Sharma, A. Weller, A. Taly, Y. Jia, J. Ghosh, R. Puri, J. M. fairness in ML applications. In CIKM, 2020. Moura, and P. Eckersley. Explainable machine learning in deployment. In FAccT, [44] S. Wachter, B. Mittelstadt, and C. Russell. Why fairness cannot be automated: 2020. Bridging the gap between EU non-discrimination law and AI. SSRN Scholarly [9] E. Black, S. Yeom, and M. Fredrikson. FlipTest: Fairness testing via optimal Paper ID 3547922, Social Science Research Network, Mar. 2020. transport. In FAccT, 2020. [45] J. Wexler, M. Pushkarna, T. Bolukbasi, M. Wattenberg, F. Viégas, and J. Wil- [10] T. Calders and S. Verwer. Three naive Bayes approaches for discrimination-free son. The What-If Tool: Interactive probing of machine learning models. IEEE classification. Data Mining and Knowledge Discovery, 21(2), Sept. 2010. Transactions on Visualization and Computer Graphics, 26(1), 2020. [11] A. Chouldechova. Fair prediction with disparate impact: A study of bias in [46] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond recidivism prediction instruments. Big data, 5 2, 2017. disparate treatment & disparate impact: Learning classification without disparate [12] M. Donini, L. Oneto, S. Ben-David, J. Shawe-Taylor, and M. Pontil. Empirical risk mistreatment. In WWW, 2017. minimization under fairness constraints. In NeurIPS, 2018. [47] B. H. Zhang, B. Lemoine, and M. Mitchell. Mitigating unwanted biases with [13] D. Dua and C. Graff. UCI machine learning repository, 2017. adversarial learning. In AIES, 2018. [14] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In ITCS, 2012. [15] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994. [16] S. C. Geyik, S. Ambler, and K. Kenthapadi. Fairness-aware ranking in search & recommendation systems with application to LinkedIn talent search. In KDD, 2019. [17] B. Goodman and S. Flaxman. European union regulations on algorithmic decision- making and a “right to explanation”. AI magazine, 2017. ACKNOWLEDGMENTS A.3 Detailed Example We would like to thank our colleagues at AWS for their contri- Below we show an example analysis config. It is minimal and ad- butions to Clarify. We thank Edi Razum, Luuk Figdor, Hasan ditional configuration options (e.g. to support datasets without Poonawala, the DFL and Zopa for contributions to the customer headers or model endpoints using JSON format) exist. use case. 1

2 "dataset_type": "text/csv", A DETAILED CONFIGURATION 3 "label": "Class1Good2Bad", A.1 Additional Bias Metrics 4 "label_values_or_threshold":[1], Recall that we defined the fraction of positively labeled examples 5 "facet":[ (1) (1) 6 { in group 푎 as 푞푎 = 푛푎 /푛푎 (푞푑 = 푛 /푛푑 for group 푑, respectively). 푑 7 "name_or_index" : "ForeignWorker", They define a probability distribution 푃푎 for group 푎 over labels 8 "value_or_threshold":[1] as 푞푎 for positive labels and 1 − 푞푎 for negative labels and 푃푑 as 9 } 푞푑 and 1 − 푞푑 . Differences in these probability distributions yield 10 ], additional pre-training bias metrics: 11 "group_variable": "A151", 퐾퐿 KL divergence measures how much information is needed to 12 "methods":{ move from one probability distribution 푃푎 to another 푃푑 : 퐾퐿 = − 13 "shap":{ 푞 log 푞푎 + (1 − 푞 ) log 1 푞푎 . 푎 푞푑 푎 1−푞푑 14 "baseline": "s3:////", 퐽푆 Jenson-Shannon divergence provides a symmetric form if KL 15 "num_samples": 3000,

divergence. Denoting by 푃 the average of the label distributions 16 "agg_method": "mean_abs"

of the two groups, we define 17 },

1 18 "pre_training_bias":{ 퐽푆 (푃푎, 푃푑, 푃) = 2 [퐾퐿(푃푎, 푃) + 퐾퐿(푃푑, 푃)] . 19 "methods": "all" 퐿푃 We can also consider the norm between the probability distri- 20 }, butions of the labels for the two groups. For 푝 ≥ 1, we have 21 "post_training_bias":{ h i1/푝 Í 푝 22 "methods": "all" 퐿푝 (푃푎, 푃푑 ) = 푦 |푃푎 (푦) − 푃푑 (푦)| . 23 },

푇푉 퐷 : The total variation distance is half the 퐿1 distance: 푇푉 퐷 = 24 "report":{ 1 ( ) 2 퐿1 푃푎, 푃푑 . 25 "name": "report", 퐾푆 Kolmogorov-Smirnov considers the maximum difference across 26 "title": "Clarify Analysis Report" label values: 퐾푆 = max(|푃푎 − 푃푑 |). 27 } For the post-training bias metric we additionally have a flip test. 28 }, 퐹푇 Counterfactual Fliptest: 퐹푇 assesses whether similar examples 29 "predictor":{ across the groups receive similar predictions. For each example 30 "model_name": "german-xgb", in 푑, we count how often its k-nearest in 푎 receive 31 "instance_type": "ml.c5.xlarge", a different prediction. Simplifying [9], we define 퐹푇 = (퐹 + − 32 "initial_instance_count":1 − + − 33 } 퐹 )/푛푑 where 퐹 (퐹 resp.) is the number of examples in 푑 with a negative (positive, resp.) prediction whose nearest neighbors 34 in 푎 have a positive (negative, resp.) prediction. A.2 Configuration Options Table 5 illustrates the flexible support for model outputs.

Example output Configuration of predictor C3,"[0.1,0.3,0.6]" "label": 0, "probability": 1 ”[0.1, 0.3, 0.6]” "probability": 0, ”label_headers”: [”C1”, ”C2”, ”C3”] {”pred”: C3, "label": "pred", ”sc”: [0.1, 0.3, 0.6]} "probability": "sc", "content_type": "application/jsonlines" Table 5: Illustration of multi-class model outputs supported by Clarify. The model predicts scores for classes C1, C2, C3.