Managing Model Risk

May 13, 2019

Agus Sudjianto, Harsh Singhal and Jie Chen

2019 Wells Fargo Bank, N.A. All rights reserved. For public use. Master Class Agenda

• Introduction (15 minutes) – Agus • Machine Learning Interpretability (90 minutes) – Jie – Post-hoc methodology • Overview of Machine Learning – Model distillation – Ensemble Model Methodology and Examples: and GBM (60 minutes) – Jie – Methodology and Examples: Feedforward, • Structured-Interpretable Models – Agus Recurrent, and Generative Adversarial Network (60 minutes) – Jie • Validation of Machine Learning Models (90 min) – Harsh – Inputs/Data: bias and privacy test – Model specification: interpretability • Natural Language Processing (45 minutes) – Harsh – Performance: fairness and performance testing – Language Models – Model Monitoring and change control – Neural Architecture – Fail safe and disclosure Optional Lunch Time Bonus: Deep Learning Techniques for Derivatives Pricing – Bernhard

2 Machine Learning Methodology: Ensemble Model Methodology and Examples

May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk Outline

• Statistics vs Machine learning • Introduction to machine learning – – Semi-supervised learning – • Decision Tree and CART • Ensemble algorithms – Bagging – Random forest – Boosting • Probability • Classification Example

4 Statistics vs ML

• Leo Breiman: Two modelling paradigms: data model and algorithmic model – Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science • Traditional Statistics (data model) – View: Data generated by some underlying parametric model – goal is inference and interpret the model – Extensive interaction between data and data analyst o Summary, visualization, identification of outliers, shapes of distributions, transformation, … – Parameter estimation, testing, confidence intervals, asymptotic theory à based on model assumptions and theory – Dimensionality is curse à variable selection – Model validation: goodness of fit tests, residual diagnostics – Tailored for small data sets, few number of variables, structured data. – Driven by statisticians • Criticism – Simple parametric model imposed on data generated by complex system. Information obtained may be questionable. – Omnibus GOF test which tests in many directions have low power and will not reject until the lack of fit is large. – Feature engineer has to be done manually, which involves a lot of hand crafting and is impractical for large number of variables.

5 Statistics vs ML

• Leo Breiman: Two modelling paradigms: data model and algorithmic model – Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science • Machine Learning (algorithmic model) – View: Data mechanism unknown and no intrinsic interest in the data generation process. Goal is to get the most accurate model, however complicated. – Very little direct interaction with the data – Emphasis on better algorithms, speed, efficiency of computing, parameter tuning o – exploratory data analysis on steroids o Neural networks, Boosting algorithms, etc. – Algorithms are black box à hard to interpret – Dimensionality is blessing àvariable selection is not needed, feature creation is encouraged (SVM). – Model validation: check prediction accuracy on testing set – Tailored for large data sets, with large number of variables, unstructured data. – Driven by computer scientist, engineers, and a few statisticians • Criticism – Lack of interpretability.

6 Statistics vs ML

• Michael Jordan: the ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics. • Distinction is blurring … • Some statisticians have adopted methods from machine learning, leading to a combined field that they call statistical learning • Data Science has emerged as an alternative term to combine both fields… but includes DBM and computing

7 Machine Learning vs Artificial Intelligence (wiki and other sources)

§ Machine Learning: – Term coined by Arthur Samuel (IBM) in 1959 – gives "computers the ability to learn without being explicitly programmed” – study and construction of algorithms that can learn from data, summarize features, recognize patterns, make predictions, and take actions … – Related to statistics (`computational statistics’) but different paradigms – A key pathway to AI

§ Artificial Intelligence: concerned with making computers behave like humans – Term coined in 1956 by John McCarthy (MIT) – study of “ intelligent agents” – devices that perceive the environment and take actions that maximize its chance of success at some goal. – Long history: formal reasoning in philosophy, logic, … – Resurgence of AI techniques in the last decade: advances in computing power, computing and data architectures, sizes of training data, and theoretical understanding – Deep Learning Neural Networks: At the core of recent advancements in AI, specifically for certain classes of ML tasks (Reinforcement L and Representation L) – Applications: • Pattern recognition: speech (siri), image (Deep Face), handwriting, … • Autonomous systems: drones, self-driving cars • Recommender systems, drug discovery, marketing, …

8 Machine Learning: Tasks and Techniques

• Tasks: • Supervised Learning: • Regression and classification • Unsupervised Learning: • Discover underlying structure • Dimension reduction, clustering, … • Semi-supervised learning • Reinforcement Learning: • Identifying how to make good decisions from context: observe, learn, and optimize • Deep reinforcement learning • Representation Learning: • Feature selection and engineering

9 Supervised Machine Learning

§ Supervised learning means the desired outcome is known, aka, the response variable is given. § Learning is supervised under the response: minimizing the error between prediction and the response. § Algorithms that falls under this category: – K-nearest neighbor – LASSO, Elastic Net – Support vector machine – Decision trees – Ensemble methods – Neural networks • Artificial Feed Forward NN • More complex NN for DL

10 Supervised Machine Learning

§ Machine learning algorithms usually come with hyper-parameters which controls the complexity of the algorithm. – For example, trees have depth, number of terminal nodes, etc to define the tree structure – Neural networks have number of layers, number of neurons per layer, activation function, etc to define the network structure. § Complexity is related with bias-variance trade-off. Prediction error can be decomposed into bias and variance. Bias and variance trade-off § Bias: ! " − $ !% " . Simpler models have large bias, and vice versa

§ Variance: &'( !% " . Simpler models have smaller variance, and vice versa § The best model is the one that achieves a good balance between bias and variance à hyper-parameter tuning

11 Supervised Machine Learning: Tuning

§ Hyper-parameter tuning, is to find the best hyper parameters which gives the most accurate machine learning algorithm. It is the key to the success of machine learning algorithms. § Simple model structure, small data requires less complicated algorithm and more complicated model structure with large data requires more complicated algorithm. So the hyper parameters are data dependent, and they need to be tuned to get the best model. § Tuning involves a search routine and an evaluation routine. For each hyper-parameter setting, fit the model and evaluate the model performance; Using the search routine to find the hyper-parameter/model that optimizes the model performance.

12 Supervised Machine Learning: Tuning

§ Search routine, some popular ones are – Grid search: define a grid of parameters and search this entire grid – Randomized search: randomly select parameters from a distribution to search. – Bayesian hyper-parameter optimization: model the prediction performance as a Gaussian Process. § Evaluation routine. The model performance is measured by – Continuous response: mean squared error – Categorical response: AUC/Gini (binary response), error rate, logloss § It is well-known that a model that minimizes the loss/error on the training data is likely to overfit. To avoid this, the performance is measured on a separate validation data, or using cross-validation. § Cross-validation. The typical K-fold cross validation works as follows: 1. Randomly divide the data into K folds. Stratification may be needed for imbalanced data. 2. For each i = 1, …, K 1.Leave the ith fold out, build a model using the rest K-1 folds. 2.Predict on the ith fold. 3. After obtaining the cross-validation predictions for the entire data, compute the loss/error. This is the cross-validation model performance. § Since both training data and validation data are used in construction of the best model, the model performance has to be evaluated on a separate test set.

13 Unsupervised learning

§ Unsupervised learning means there is no response. The observations are unlabeled. § It is used for clustering, dimension reduction, , etc. § Algorithms that falls under this category: – Clustering • K-Means • • Mixture models – Visualization and • PCA • Kernel PCA • Locally-linear embedding • T-distributed stochastic neighbor embedding (t-SNE) – Association rule learning

14 Semi-supervised learning

§ Sometimes, it is very expensive or hard to obtain labels. So only part of the data are labeled. – Unlabeled data contains both 1’s and 0’s. – Labeled data contains only 1’s à PU learning § Train only using labeled data? Not accurate. § unlabeled data gives the “background” information § Background information can increase the accuracy

§ Algorithms: – Self training: label the unlabeled data by training supervised algorithm using labeled data and iterate. Heuristic algorithm but in some cases it is equivalent to EM algorithm. • only add the most confident predictions; • add all but weight by confidence. – Generative models: assume a probabilistic generative model (eg Gaussian mixture model, NB, HMM) and maximize likelihood using EM. – If the model is correct, it’s very effective; otherwise, unlabeled data can hurt – Cluster and label: use any clustering algorithm for clustering and assign labels using majority of labeled points – Graph-based methods: a graph is given on the labeled and unlabeled data, instances connected by heavy edge tend to have the same label.

15 Reinforcement Learning

Distinct arm of ML: • Do not directly observe the ‘right decision’ • Observe context (environment), make decision, and see outcome • Learn from decision over time – reward • Search over context space, learn, and identify how to optimize • Explore and exploit trade off: decisions that improve estimated model vs decisions that appear to be optimal under current model • Mathematical framework: Markov decision process or partially observed MDP

Megajuice,https://commons.wikimed Canonical applications: ia.org/w/index.php?curid=57895741 • Precision medicine – right treatment for right patients at right time • Robotics: agents interacting with environment to learn how to perform a task optimally • Recommendation systems: which advertisements or products to display given past browser or purchase history

16 Decision Tree and CART

• Decision tree partition the feature space into a set of rectangles and fit a simple model (e.g., constant) in each one. • Advantages: – Fast, intuitive – Able to handle both numeric and categorical data – Robust to outliers in predictors – Model interaction and nonlinearity automatically (little data transformation) • Disadvantage: weak learner – High bias for shallow trees, for example trying to model linear relationships – Instable, high variance for deep trees. Small change in data can result in a completely different tree

17 Decision Tree and CART • There are many different decision tree algorithms • Ross Quinlan invented three implementations: ID3, C4.5 and C5.0 – ID3 (iterative dichotomiser 3) is the first generation invented by Quinlan (1986) – C4.5 improves upon ID3 by allowing both discrete and continuous variables, tree pruning, missing value handling, etc. C5.0 further improves on speed and memory. – Splitting is based on minimum entropy (or maximum information gain). Only support categorical response. • CART (classification and regression tree) is similar to C4.5, first introduced by Breiman. – starts from the root node with all data – splits into several child nodes based on a certain variable, the goal is to make each child node as homogeneous as possible – The heterogeneity of each node is measure by squared error for regression and Gini/entropy impurity for classification ' ) • Gini impurity: 1 − ∑$%& ($ , ($ is the probability of class +. ' • Entropy impurity: − ∑$%& ($ log ($ • When the class is pure, Gini impurity and Entropy impurity = 0 – Pruning: the tree is grown large and pruned to minimize the cost complexity function: each leaf incurs a penalty set by complexity parameter – Some other features: missing value handling, surrogate split.

18 Decision Tree and CART

• Tuning parameters for CART – Splitting criterion (gini or entropy) – Max tree depth – Min leaf size – Complexity parameter for pruning – … • Implementation – Scikit-learn: DecisionTreeRegressor and DecisionTreeClassifier – R: rpart package – Spark: mllib library

19 Ensemble Algorithms

Improve performance by combining the outputs of several individual predictors:

Examples: • Bagging • Boosting

• Model Averaging • Majority Voting • Ensemble Stacking

web.engr.oregonstate.edu/~xfern/classes/cs534/notes/ensemble-11.pdf

20 Bagging

• Bagging: is an early ensemble method invented by Breiman in 1994. • Bagging works by – Take a bootstrap sample at each iteration !, ! = 1, 2, … , '.

– Fit the base learner to the bootstrap sample to get a base model à )(*(,) – Combine all base model predictions by averaging (regression) or majority voting (classification) • Tuning parameter: base learner parameters plus n (number of base learners). – For n, the more, the better as long as computation allows. • Deep decision tree is a good choice for base learner • Bagging leads to "improvements for unstable procedures" (Breiman 1996), for example, deep decision trees. ( 1 456 78 9 – Averaging reduces variance. In the independent case, ./0 ∑ )( , = . However, base model predictions 2 * * 2 are not independent because the bootstrap samples have overlapping data. The variance will flatten off instead of going to 0. – Correlation limits the reduction of variance, hence de-correlate the base models is important. How to further de- correlate the base models?

21 Random Forest

• Random forest is a modified version of bagging. • It is popularized by Breiman (2001), combines – Bagging applied to tree algorithm – random selection of features • It builds deep trees which have high variance but low bias, and reduce the variance through bagging. – A variant is to use sample without replacement instead of bagging. • To achieve maximum amount of variance reduction, different trees need to be as uncorrelated as possible, this is done through – Random feature sampling: for each split, use a random subset of features as candidate split variables, instead of the entire feature set. • Tuning parameters: n (number of trees), mtries (number of variables to sample in each split), tree depth, … – n: the more, the better as long as computation allows – mtries: there are some default values, e.g., ! for classification case, and !/3 for regression case. Too small is generally not good (you may not be able to include any important variables in your random selection), too big is also bad as it leads to higher correlation. – tree depth: deep trees. Breiman suggested fully grown tree but this is rarely a good idea for large data (storage and computation). In addition, fully grown trees can result in too rich a model and incur . Tune the depth can improve model performance.

22 Random Forest

• Implementation: – Scikit learn: RandomForestRegressor and RandomForestClassifier – R: randomForest package – Spark: mllib library – H2o: h2o.randomForest • Random forest can be uses as off-the-shelf with default parameter settings • Other features: oob (out of bag) error. It can be used in place of a validation data to tune the algorithm.

23 Boosting • Boosting is a different type of ensemble algorithm, based on removing bias of a simple learner. • Given a simple learner, can you improve it to be a strong learner? (Kearns and Valiant 1988) • Schapire (1989): Yes à by a technique called “boosting”, • Freund and Schapire (1995): AdaBoost for classification

• “Base learner”: simple rectangular classification regions at each stage • Reweighting at each stage – more weight to data that are misclassified • Fit an additive model (ensemble)

24 Gradient Boosting

• Breiman (1998+): Boosting is actually an optimization algorithm • Friedman (2000+): Extended concept to gradient boosting (gradient descent) • First, define your loss function to minimize: !(#, %). – Different types of loss functions à different gradient descents – Adaboost correspond to exponential loss – Commonly used ones: squared error loss # − % ( for regression and deviance #% − log 1 + ./ for (% is the logodds) – Exponential loss is less robust than deviance loss when the data is noisy or there is misspecification on class labels – Other loss functions: absolute error loss, partial likelihood, etc 5 • For the given loss function, find the prediction function %(0) that minimize the total loss ∑234 ! #2, % 02 . : – The best function %(0) is found in an additive, stage-wise way: % 0 = 78 0 + ∑934 ;979(0), where 78(0) is the baseline (e.g., overall mean in regression). – In each stage <, update the prediction function in the direction 79(0) where the total loss decreases, for a step size/learn rate of ;9. – The good direction to go is the negative gradient (gradient descent). Hence each base learner 79 02 is fit to the negative gradient • For squared error loss, the negative gradient is simply the error =92 = #2 − %9>4 02 from previous stage, where 9>4 %9>4 x = 78 0 + ∑ℓ34 ;ℓ7ℓ(0) BC • For deviance loss, the negative gradient is error = = # − A 0 , where A = is the probability 92 2 9>4 2 4DBC

25 Gradient Boosting • As an illustration for the regression case

• Stochastic gradient boosting (Friedman 1999): fit each tree with a subsample instead of the entire data. This can be more robust and less overfitting. • Tuning parameters: number of trees, learn rate, tree depth, … – number of tree: need to be tuned. Too many cause overfitting (in contrary to random forest) and too few results in under fitting. – learn rate: smaller generally is better but it will require more trees to be built – Sample rate: for stochastic gradient boosting, default 0.5 but depends on data size. – tree depth: shallow, in contrary to random forest • Implementation: – Scikit learn: GradientBoostingClassifier and GradientBoostingRegressor – R: gbm package – Spark: mllib library – H2o: h2o.gbm – XgBoost, ligthGBM, Catboost…

26 XGBoost

• XGBoost stems from GBM but is different in several ways: – Includes regularization (L1, L2 penalties) and column sampling to better control overfitting – Uses a different optimization algorithm (Newton boosting rather than gradient boosting) – Supports fast algorithm for tree split – Usually has better prediction performance (leading algorithm in Kaggle competitions) • Key parameters for tuning: – Number of tree (n_estimator): need to be tuned. Too many cause overfitting (in contrary to random forest) and too few results in under fitting. – Learning rate: smaller generally is better but it will require more trees to be built. – Tree depth (max_depth): shallow, in contrary to random forest. – L1 regularization term on weights (reg_alpha): regularization parameter specially for Xgboost. – L2 regularization term on weights. (reg_lambda): regularization parameter specially for Xgboost.

27 Comparison: GBM and Random Forest

• Random forest is “practically tuning free” and is less prone to overfitting than GBM • Random forest is embarrassingly parallel. GBM builds one tree at a time. • Random forest is slower to score and can take more time to train due to its tree depth. • Several empirical comparisons are done in the literature to compare the performance of GBM and random forest. – They have similar prediction performance, but generally well tuned GBM performs slightly better than random forest (Caruana et al. 2005). • The internal mechanics are different: one focuses on reducing variance and the other focuses on reducing bias.

28 Probability calibration

• One challenge for ML classification: the probability scores from binary response regression are not well calibrated. The rankings of the observations are usually good but the scores themselves do not align well with predicted probabilities that one may get from, say, models. E.g., – Naïve Bayes tends to push scores to 0 or 1 due to the conditional independence assumption; – support vector classifiers uses distance from point to the decision boundary which is not on the probability scale. – Bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values.

• Reliability plot can be used to visualize such bias in the scores. A perfectly calibrated model will show approximately a 45 degree straight line, whereas SVC usually shows a Sigmoid shaped curve and Naïve Bayes shows the opposite. • To correct the bias in probability scores, there are three main calibration methods: – Platt scaling – – spline calibration using natural cubic splines. • Based on our experience, a well-tuned XGBoost model could produce quite accurate probabilities even without calibration, thus calibration may not change much. On the other hand, a random forest model is less accurate and you may see it over-predict significantly in-test/out-of-time-test data.

Documentation: https://scikit-learn.org/stable/modules/calibration.html 29 Classification Example

• Auto loan level loss forecast model • Hyperparameter tuning—Grid search – Objective: – Hyperparameter tuning grids can be different for small data median data and big data. conditional probability of default More advanced users can adjust the tuning grids according to their own needs and computation time budget. – Model segment: Specific delinquency segment – Dependent variable: charged-off – Independent variables: Raw LOB independent variables – In-time data set: June 2004 - March 2016 • XGBoost: – OOT test set: – 'learning_rate': 0.05 April 2016 – May 2017 – 'max_depth': 5, – Training/Validation splitting – 'n_estimators': 300 o Clustered by customer ID – 'reg_alpha': 0 o 2/3 in-time data set for training – 'reg_lambda': 1 o 1/3 in-time data set for validation • Random Forest: – 'max_depth': 15, – 'max_features': 6 – ‘n_estimators’:300 30 Example– Probability Calibration

• Reliability plot by XGBoost before/after calibration – No significant improvement

• Reliability plot by Random Forest before/after calibration – Probability calibration is needed

31 Example: Account level performance metrics

• In-time train Regression – Over-fitting by RandomForest

• OOT test • In –time test – XGBboost>Randomforest >Logistic – XGBoost>Randomforest >Logistic Regression

32 Example: Aggregated level performance metrics

• In-time test set, over date: XgbGBM>Logistic>random forest • OOT test set, over date: (MAPE) random forest~XgbGBM>Logistic

XGBoost_ whole_time H2oLogist_whole_time RandomForest _whole_time XGBoost _whole_time H2oLogist _whole_time RandomForest _whole_time MAE 0.012 0.015 0.015 MAE 0.006 0.008 0.005 pRMSE(%) 15.18 18.34 16.67 pRMSE(%) 7.72 9.27 7.59 MAPE(%) 5.00 5.50 6.24 MAPE(%) 6.95 8.86 6.80 CPE(%) 1.76 -1.24 2.81 CPE(%) 6.50 8.51 5.88

33 Machine Learning Methodology: Deep Learning Methodology and Examples

May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk

© 2019 Wells Fargo Bank, N.A. All rights reserved. Public use. Outline

§ Introduction § Artificial Neural Networks § Training Neural Networks § Advanced Network Architectures § Practical Considerations § Neural Networks vs Ensemble Methods § Classification Example § Time Series Simulation by Conditional Generative Adversarial Net

35 Introduction

• Machine Learning Model inspired by neuroscience • Cyclical in Popularity; Recent Boom • Recent Wins: Unstructured Data, such as images, text, and speech. • Advantages: – Flexibility – Batch Training for Large Data – Unstructured and Hybrid Data: Automatic

36 Artificial Neural Networks

37 Artificial Neurons

• Inputs: !", !$, … , !&

• Weights: '", '$, … , '& and Bias ( • Calculates Linear Combination of inputs: ) = ∑, ', !, + ( • Output applies an activation function to ): . = /())

38 Activation Functions

• Introduce nonlinearities into the network. • Popular Choices: & – Sigmoid: !(#) = &'()* (*/()* – Hyperbolic Tangent: !(#) = tanh(#) = (*'()* – Identity Function: !(#) = # – Rectified Linear Units (ReLU): !(#) = max(0, #) – Leaky ReLU: ! # = 456 76, 6 ; 7 < 1 – Other specialized options

39 Example: Single Neuron Networks

– Activation Function: Identity Function

– Therefore: "^# = ∑& '& (&,# + +

• Logistic Regression – Activation function: Sigmoid , – Therefore: "^# = ,-./0(2 ∑3 4353,6-7)

40 : Network with Single Hidden Layer

• Using a set of neurons, or “hidden units” between the input and output allows the network to represent more complex functions of the input. • “Universal Approximation Theorem”: With a wide enough hidden layer and a squashing activation function, a neural network can approximate any well behaved function arbitrarily well. • The catch: potential overfitting and computational issues.

41 Deep Neural Networks: Multiple Hidden Layers

• Adding additional layers of hidden units increases the representation power of the ANN without as bad a computational cost. • Empirically, deep networks seem less prone to overfitting than wide, shallow networks.

(not really a “deep” network)

42 Output Layers:

• Neural Networks may be adapted to different machine learning tasks by appropriately choosing the output layer. For example: – A single node with an identity activation function represents a univariate regression task. – A single node with a sigmoid activation function can be used for binary classification , when the target & takes values of 0 or 1: !(#) = &'()* – A set of k output nodes can be used for k-class classification tasks using the “softmax” activation function. Each output node gives the probability that the corresponding observation belongs to one of the K classes: * ( , !(# ) = + *. ∑. ( – Something more customized to a specific task, such as a sequence.

43 Training Neural Networks

44 Fitting a Neural Network to Data: Learning the Weights

• The weights (and bias) of each neuron are the unknown parameters in an ANN that need to be learned from data. • To do so, we define an appropriate cost function. The cost function should: – Represent an average of the cost of individual observations in the training set. – Should be a function of the outputs from the ANN and the response y. ! – Example: ∑ ( ' − '^ )" "# % % % • Choose the weights and biases that minimize the cost function. – In principle, this can be achieved using calculus. – Numerical solutions are a well-studied field. – However, many of these solutions are not easily implemented in ANNs.

45 Choice of Cost Functions

• The cost (or loss) function provides a global measure comparing the output of a network with the true response for a set of data. • The choice of loss function depends heavily on the task. For example: – For continuous responses, we use squared error loss: 1 #( & − &^ )* " $ $ $ – For binary response, we use cross entropy, or log loss: 1 − # & log&^ + 1 − & log 1 − &^ " $ $ $ $ $ – For multinomial responses, we use a generalization of cross-entropy, with j indexing category: 1 (/) − # # &(/)log &^ " $ $ $ / – For other tasks, other loss functions.

46 Gradient Descent

• Iterative algorithm to minimize a function !($⃗).

1. Start with an initial point $⃗&.

2. Propose a new point via: $⃗'() = $⃗' − ,-.!($⃗'), where , is a small constant called the learning rate.

3. Repeat until $⃗' converges. • However, computing the gradient in neural networks can be challenging.

47 Back Propagation Algorithm

• Preparation: Input the data x, and initial all weights in the network. • The algorithm: 1. Feedforward: Feed the data through the network, computing the output of each node based on the current weights. 2. Gradient: Compute the gradient of the cost function with respect to the last hidden layer. 3. Backward Propagation: Work backwards through the network, computing the gradient of cost function w.r.t. the weights in layer l- 1 using the chain rule and the gradient w.r.t. the weights in layer l. 4. Update the weights using gradient decent, and return to step 1.

48 Back Propagation Algorithm

• The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm: – Input x: Set the corresponding activation !" f or the input layer. – Feedforward: For each # = 2,3, … , ) compute *+ = ,+ !+-" + /+ and !+ = 0 *+ . 34 – Output error 12: Compute the vector 12 = = 7 9 ⊙ 0; *2 . 356 8 – Back propagate the error: For each # = ) − 1, ) − 2, … , 2 compute 34 A 1+ = = ,+@" 1+@" ⊙ 0; *+ . 35? 34 +-" + 34 + – Output: The gradient of the cost function is given by ? = !E 1F and ? = 1F . 3BCD 3GC • In particular, given a mini-batch of m training examples, apply a gradient descent learning step based on that mini-batch.

49 Other Optimization Concerns

• Variety of sophisticated methods to improve learning in ANNs: – Stochastic Gradient Descent – RMSProp – Adadelta – Adam • These algorithms improve learning by using: – momentum, to prevent the gradients from changing too rapidly/ overcorrecting – adaptive learning rates, to balance speed with accuracy • In practice, this is done in batches of training data, called “minibatch learning”.

50 Advanced Network Architectures

51 Convolutional Neural Networks

• Useful when observations consist of uniformly • Each output is a weighted average of the inputs: sized arrays of measurements of the same ! % & quantity, such as images or time series. ",$ ",$ ",$ • Key Features: • The weights remain constant as the convolution is – Convolutional layers, where inputs are applied to successive windows of data. convolved with their neighbors. – Often use local pooling to reduce model First Window: Second Window: parameters.

52 Recurrent Neural Networks

• Useful in studying sequences, such as in natural language context. • NNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. • RNNs have a “memory” by taking the previous output or hidden states as inputs. • Several variations; “Long Short-Term Memory” (LTSM) variation is currently most successful, which addresses the vanishing gradient problem

53 Hybrid Networks

• Architectures can be adapted for a variety of purposes. For example: – Combining Feed-Forward, Recurrent, or Convolutional Elements – Adding layer-skipping connections, or reducing the number of connections between layers – Merging different input layers, or splitting into separate output layers, for different tasks.

54

• Dimension Reduction Network • Network learns to predicts input from input using smaller hidden layers. • Bottleneck layer engineers lower-dimensional features. After training, these features may be extracted as a lower-dimensional representation of the data. • Relationship to PCA: – If linear activations are used, the weights span the save vector subspace as the corresponding set of principle components. – Not guaranteed to be equal to the PCs, nor orthogonal.

55 Generative/Adversarial Networks (GANs)

• Unsupervised technique • Pair of ANNs, trained with simultaneous backpropagation • A Generator Network, which produces candidate data examples • A Discriminator Network, which learns to distinguish the generated data from the real data. (Classification) • Simultaneous training improves the performance of both networks.

56 Practical Considerations

57 Using ANNs in Practice

• Determine Network Structure and Properties • Train the network effectively • Avoid overfitting • ANNs vs Other Machine Learning Techniques

Note: Much of the literature available gives advice in the context of unstructured data (images/text/speech).

Such advice may not be useful in banking problems.

58 Network Structure and Properties

• Very flexible choices for: – Number of hidden layers – Number of nodes on each hidden layer – Activation functions for each hidden layer – Regularization strategy/ Parameters – Additional features: oSkip connections oBatch normalization oDropout oConstraints • Can make exhaustive search difficult

59 Training Effectively

• Training can be challenging; saddle points and local minima can result in a sub-optimal model. • Tips: – Standardize or Normalize data (X) before training. Avoids vanishing gradient problem. o Min/Max scaling is often used in the literature. o Gaussian standardization may perform better when large outliers are present. – Use Batch Normalization between hidden layers. – Consider using an optimization routine with learning rate decay (e.g. Adam). – Consider adjusting the batch size used in training. Smaller batches can be slower and more volatile, but can help escape local minima/saddle points. – Use early stopping to determine number of training epochs.

60 Overfitting

• ANNs are flexible models, with a (potentially) large number of parameters, therefore overfitting is a concern. • Strategies to avoid overfitting in ANNs include: – Multiple narrow layers vs. Single wide layer – Data augmentation – Weight Regularization: Penalizing large weights in the cost function. – Dropout: Randomly drop out units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. – Early Stopping via a validation set.

61 Neural Networks vs Ensemble Methods

• Ensemble Methods (Gradient Boosting, Random Forest): – Better predictive performance for structured data – Easier hyperparameter tuning: Smaller search space; less optimization tuning. – More natural handling of categorical variables (depends on implementation) • Artificial Neural Networks – More flexible data types; hybrid data – Analytical partial derivatives for more derivative based diagnostic tools – High performance for unstructured data – Larger hyperparameter space – Categorical variables need to be “dummy coded” or “one-hot encoded.” Many extra variables if large number of categories.

62 Examples

63 Classification Example

§ Response: 0s and 1s § Logistic regression § Handcrafted predictors: 75 – some are – Top 10 variables + interactions highly correlated – Fit Lasso – regularized regression § Develop a good classifier – Also does variable selection Techniques: § Compare performance with ML algorithms: o Logistic regression – Accuracy on cross-validated data o Gradient boosting machine o Random forest o Convolutional Neural Net (with original time series data) o Naïve Bayes o SVM o Adaboost

64 Comparison of predictive performances

• Logistic regression vs GBM and RF • Logistic regression vs GBM, RF and Deep Learning (CNN)

– GBM and RF are generally better in – CNN based on “raw” data terms of accuracy

65 Time Series Simulation by Conditional Generative Adversarial Net

66 Outline

• Motivation • Introduction of GAN, GAN Variants and CGAN • Rationale of how GAN works • Simulation results • VaR and ES application for • Macroeconomic time series simulation application

67 Motivation

• Traditional time series models are strongly dependent on model assumptions and estimation of the model parameters – Statistical time series models: AR, VAR, VECM, GARCH, … – stochastic process models: Hull White model, Ornstein-Uhlenbeck process,… – Complicated correlation modeling: Copula,…

• Therefore, traditional time series models are less effective in modeling – non-Gaussian, skewed, heavy-tailed distributions – complicated time-varying dependence and cross-correlation structure

• Generative Adversarial Net (GAN) and Conditional Generative Adversarial Net (CGAN) have been proved to be a powerful machine learning tool in image data analysis and generation. • We propose to use CGAN to learn and simulate time series with the bank’s application.

68 GAN and it Variants

• GAN training is a minmax game on a cost function between generator (G) and discriminator (D) where both G and D are neutral network models. min max ()~+ [log 1())] +()6~+ [log(1 − 1 )6 )], (1) $ ' , 7 – Both 1 and ; are trained simultaneously, where 1 receives either generated sample )6 or real data ), and 1 is trained to distinguish them by maximizing the cost function. – While, ; is trained to generate more and more realistic samples by minimizing the cost function. – The training stops when 1 and ; achieve the Nash equilibrium, where none of them can be further improved through training. • Issues with GAN: – mode collapse issue : the generator collapses to a parameter setting where it always generates a small range of outputs – Diminished gradient issue: discriminator gets too successful that the generator gradients vanish and the generator learns nothing. • Solution – WGAN (Wasserstein GAN)--a new cost function using Wasserstein distance – DRAGAN--a gradient penalty directly to GAN

69 Conditional GAN (CGAN)

• A conditional version of GAN is introduced by Mirza & Osindero in 2014. • CGAN enables GAN to generate specific samples given the conditions, where the same auxiliary condition, usually denoted by !, are applied to both generator and discriminator as additional input layers. • The Cost function of CGAN is min max )*~, [log 2(*, !)] +)8~, [log(1 − 2 <(8, !) )], (2) % ( - 9 • The Cost function of conditional WGAN (CWGAN)

• min max )*~, [ 2(*, !)] −)@~, 2 < 8, ! , (3) % (∈? - 9 • Condition for CGAN can be categorical or continuous – For image generating, categorical conditions like image categories are common. – For time series generating, continuous conditions based on the past information are more common for future prediction generation.

70 Rationale of GAN

Considering simulating single random variable X from uniform distributed random noise U(0,1). %& • According to Inverse transform sampling, simulate ! ∼ #$ ' . %& • GAN generator is building nonlinear mapping of #$ from ' to !, given one dimensional random noise. • Single layer of NN with Relu activation is “actually” piecewise linear spline. 1

( ) = +, + .+/ 2/ 3/) (4) /0&

– Bj(.) with simple hinge functions are called ReLU (Rectifier Linear Units), max(0, 3/)-cj) – 7/ "knot locations" are called “bias weights”

• Unlike spline approach, – Knot locations are optimized simultaneously among all input variables

– the knot location is optimized on scaled ), the (3/)), instead of )

71 Rationale of GAN (Cont)

• Let us simulate a N(0,1) distribution from uniform U(0,1) random noise with GAN • In the following plots, the blue line is inverse CDF from U(0,1) (x-axis) to N(0, 1) (y-axis), and the green line is the spline approximation trained by GAN. – Left :1-layer with 7 nodes with single dimensional noise – Middle: 1-layer with 100 nodes with single dimensional noise – Right: 2-layer with 100*100 nodes with single dimensional noise

Note that: some nodes are collapses together. • For multivariate simulation from multi-dimensional random noise, more complicated nonlinear mapping is established by GAN

72 Simulation study I: Gaussian Mixture Model with categorical or continuous conditions

• Gaussian Mixture Model with nominal categorical • Gaussian Mixture Model with Continuous Conditions conditions – Mean: along the circle with center at (0, 0) and radius = 2. – Four clusters of 2-dimensional Gaussian distributions with – Variance: linearly increase along the circle in an anticlockwise various means and variance direction. – Final data: Gaussian distributions with condition on means and the corresponding variances

73 Simulation study II VAR time series with time varying volatilities • Simulation model

+ !" = $!"%& + (", !" ∈ * 1 0 $ = [0.8, 0.6]3, ( ~ 5(789: = ;, <=> = ?@7( ! ) B , DF= 20) " "%& 0 1

• We take 10000 1-time-lag sliding windows. The condition is the past time-lag !"%& • We compare mean, variance, skewness, and kurtosis between conditional distributions by CGAN (y axis)and the corresponding true conditional distributions (x axis) given 500 random selected conditions. Each conditional distribution has 10000 samples.

1st time series 2nd time series 74 VaR and ES for equity 1-day returns • Equity spot prices for WFC and JPM from 11/1/2007 to 11/1/2011 are Table 1: VaR and ES downloaded from yahoo . 1-day absolute returns are calculated and used as training data. • Stressed (11/2007-11/2009) and the normal periods (11/2009-11/2011) are separated by using an indicator of periods as a categorical condition. Historical tail data • The Historic Simulation method is one of the most popular methods used by major financial institutions. This method is usually based on a relatively small number of actual historical observations and may lead to jumpy and non-smooth tail distribution and poor VaR and ES output. • We use CGAN to learn the historical data for both the stressed and normal periods, and generate simulated sample set with the sample size 50 times larger than the original one. CGAN simulated tail data • we calculate the VaR and ES (See Table 1 below). The plots show that the large data set generated by CGAN generates a clear and smooth tail of the distribution.

75 VaR and ES backtesting for equity 1-day returns

• Additional historical data for WFC and JPM stock prices from 11/1/2011- 11/1/2015 (around 1000 business days) is downloaded to implement the backtesting. • Since there has been no major financial crisis in this period, we use the VaR and ES from the normal period (in Table 1) as our measurement in the backtesting. • The expected breaches over 1000 days for 1-day 99% VaR is 10 days. Table 2 shows that the HS method may lead to an underestimated measurement of the portfolio loss, and CGAN outperformed the HS method in the calculation of VaR and ES for this example.

76 Economic Forecasting Model

• CCAR requires multiple economic forecasts and different capital requirements during different hypothetical economic projections. • CGAN-based economic model provides an alternative approach to produce multi- quarter forecasts at once, and assess the distributions of the forecast paths. • Five popular macroeconomic index data from 1956 quarter 1 to 2016 quarter 3 from the U.S. Census Bureau: real Gross Domestic Product (GDP), unemployment rate (Unemp), Federal fund rate (Fedrate), Consumer Price Index (CPI) and 10-year treasury rate – Time series data : 5 variable x 242 quarter – Output data for CGAN: 230 sample x 5 variable x 9 quarter – Conditional data for CGAN: 230 sample x 5 variable x 4 quarter • (Top Plot) Forecast distribution: 100 forecasting paths of GDP generated by CGAN using the most recent four-quarter historical values as conditions. • (Bottom Plot) Shock analysis: Federal fund rate is shocked upward by one standard deviation in the last quarter. Average forecast is used to assess the impact. A positive shock to the Federal fund rate suppresses the economic activity and leads to a higher unemployment rate (red) compared to baseline (black).

77 Introduction to Natural Language Processing

May 13, 2019 Presenter: Harsh Singhal Contributors: Jian Sun, Suhas Sreehari, Tarun Joshi, Eric Wang, Wayne Shoumaker

© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Agenda

• Pre-processing 03 • Simple Text Classifier 04 • Unsupervised Learning: LSA and LDA 05 • Language Models: Glove 06 • Language Models: Key Properties 07 • Neural Architectures for Text Classification 08 • Interpretability 13 • Transfer Learning 14 • Bonus 1: Advanced Neural Architectures 15 • Bonus 2: More Language Models 19

79 Pre-processing: From Text to Feature Vector

• The purpose of pre-processing is to transform text into data that can be digested by an algorithm, and to reduce the amount of information to core set for clarity and efficiency

An integral part of model development is testing Lower Case, Remove Numbers and Punctuations Tokenization: Split paragraphs and sentences into words

an integral part of model development is testing Stemming: Reduce words to their root by dropping unnecessary characters, such as suffix Lemmatization: Alternative approach to stemming, using WordNet’ s lexical database of English Spelling Corrections, N-grams, POS Tagging, NER, Collocation Extraction

integr part model develop model_develop test

Indexing and one-hot encoding

0 0 1 … 1 0 0 0 … 1 0 … … risk test valid rigor model control

credible Bag of Words risk_manage model_develop 80 Simple Classifier: From Word Vectors to Classification

Linear SVM and Logistic Based Classification – Pre-processing converts words to features and creates new features based on word TF-IDF count • Term Frequency – Inverse Document – Text vectorization outputs the features to numerical vectors. Ex count vectors, TF- Frequency. IDF vectors • TF denotes the number of times that a – For classification problems, vector spaced based ML methods can be applied to find term occurs in a given document decision boundary between two classes . Notable example SVM. • IDF is the logarithmically scaled inverse – Linear SVM defines the criterion that maximally separates the two classes, allowing fraction of the documents that contains the term. users to adjust cost and penalty parameters on misclassification to suit business problems. • TF-IDF are frequency scores that try to highlight words that are more frequent in a document but not across

Had Little Lamb Twinkle Star Light Bright Old Farm documents. Mary had a little lamb 1 1 1 0 0 0 0 0 0

Twinkle twinkle little star 0 1 0 2 1 0 0 0 0

Star light star bright 0 0 0 0 2 1 1 0 0

Old McDonald had a farm 1 0 0 0 0 0 0 1 1

81 Unsupervised Learning: LSA and LDA

Distributional Hypothesis: Words occurring in similar contexts tend to have similar meaning

• Latent Semantic Analysis (LSA) • Latent Dirichlet Allocation (LDA) Topic Model – Objective: Provide a Euclidean lower dimensional representation of – Objective: Infer a collection of topic (each a set of words) words and documents – Objective: assign a topic (or set of topics) to each document and – Essentially a SVD (or PCA) on term-document matrix (or term co- word occurrence matrix) – Essentially a mixture model based clustering approach Word and Document Vectors – A hierarchical generative model is proposed to explain observed term-document matrix – Statistical inference uses EM or MCMC techniques

Probability Distributions

M. Steyvers, and T. Griffiths, “Probabilistic Topic Models,” Handbook of Latent Polysemy Semantic Analysis, 2007. 82 Unsupervised Language Models: GloVe

Illustration An unsupervised learning algorithm for Consider a large corpus of words. We want to find the co-occurrence obtaining vector representations for words probabilities for words ice and steam with various words ice co-occurs more frequently with solid than it does with gas Training is performed on aggregated global word-word co-occurrence statistics from a steam co-occurs more frequently with gas than it does with solid corpus Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently Essentially a log-bilinear model with a The ratio of probabilities cancels less useful words like water and fashion weighted least-squares objective • large values (>> 1) correlate well with properties specific to ice

• small values (<< 1) correlate well with properties specific of steam Main intuition: ratios of word-word co- occurrence probabilities have the potential to encode meaning

83 J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” EMNLP, 2014. Working with Language Models

Linguistic Structure in Word Representations Bias and Discrimination

Many linguistic patterns are captures in the Euclidean geometry • Word representations are based on co-occurrence frequency and will pick up naturally of word representation occurring biases in language corpora

• (In) famous examples include association of gender with occupations

• Bias Amplification: For bivariate prediction problems (e.g. joint prediction of gender and occupation) the bias in model output can be worse than bias in model training data

• Various solutions have been proposed – Adjusting training data – Post processing raw word representations – None of the solutions “work perfectly”. Key is to be aware of possible discrimination and take into account. https://www.tensorflow.org/tutorials/representation/word2vec

Bias in Word Embedding

https://blog.conceptnet.io /posts/2017/conceptnet- numberbatch-17-04- better-less-stereotyped- Caliskan, A., Bryson, J. J., and Narayanan, A.. "Semantics derived automatically from word-vectors/ language corpora contain human-like biases." Science 356, no. 6334 (2017 84 Supervised Learning: Neural Architectures for Text Classification

Recipe for Text Classification using Neural Architectures – Preprocessing: Align text preprocessing rules with rules used for underlying word embedding to maximize vocabulary coverage. Avoid traditional methods like stemming, lemmatization, and stop-word removal. – Embedding: Words as embedding using word2vec, GloVE, fasttext etc. Pre-trained or custom. – Representation: Design an intermediate representation (i.e. encoding) of document using a DNN architecture. – Training: Train the representation & label using a dense feed forward neural network with a softmax layer.

Key Architectures for representing documents – Convolutional Neural Networks (CNN): Captures local (i.e. unigrams, bigrams, trigrams etc.) dependencies in document representation using convolution and pooling. Fast to train, works well, but fails to capture longer dependencies. – Recurrent Neural Networks (RNN): Captures longer dependencies in document representation. However, vanishing gradients makes the network forget long-term information. Generally expensive to train. – RNN with Long Short Term Memory (LSTM)/ Gated Representation Units (GRU): Replace RNN cell with LSTM or GRU cell to preserve long term dependencies. – Bidirectional RNN with LSTM/GRU: Preserve long term dependencies and preserve contextual information in both directions by stacking two RNNs in parallel. – Auto-Encoder Networks: For longer documents, train an LSTM auto-encoder to encode sentences. Use the encoded sentences as input to the DNN to represent documents. – Attention Networks: Augment RNNs to represent documents with a focus on key parts of their input. – Hierarchal networks with attention: Attention can be applied at both word and sentence level to focus more on important content when constructing document representation.

85 Supervised Learning: Neural Architectures for Text Classification (CNN)

86 Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. Supervised Learning: Neural Architectures for Text Classification (RNN)

Unrolled

As the sequence grows, RNNs become unable to learn to connect information.

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 87 Supervised Learning: Neural Architectures for Text Classification (RNN)

Standard RNN

LSTM GRU

Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 88 Supervised Learning: Neural Architectures for Text Classification (RNN)

Bidirectional RNN reads the text in forward as well as reverse fashion. Two RNNs are stacked in parallel to learn the output vector per word. These vectors are concatenated and used as input to FFNN. In practice, RNN cells are typically replaced with LSTM/GRU to model long term dependencies.

89 Image source: https://towardsdatascience.com/nlp-learning-series-part-3-attention-cnn-and-what-not-for-text-classification-4313930ed566 Interpretability: Rationales

Interpretability via providing concise evidence from input

Rationales must be: short and coherent pieces sufficient for correct prediction

Combines two modular components, generator and encoder, which are trained to operate well together

The generator specifies a distribution over text The candidate rationales are passed through the fragments as candidate rationales encoder for prediction

rationale label

90 T. Lei, “Interpretable Neural Models for Natural Language Processing,”, MIT CSAIL PhD Thesis, 2017. Transfer Learning

• Background: – Transfer learning is the process of training a model on a large scale dataset, and then using this pre-trained model for downstream task – It saves tremendous amount of computation time/power by pre-training on billions of Custom Word Representations words • When working in specialized domains (e.g. – Recent frameworks including ULMFit, ELMo, BERT, etc. customer complaints or loan documents) general – Take BERT as an example purpose word representations may not be adequate – It was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8 B from BookCorpus • Custom word representations built from domain specific corpora provide performance gains even – It has 93.6 million parameters with 4096 LSTM hidden size and 512 output size when corpora may be smaller – The training takes 50-70 days for 8 GPUs, while it was actually trained for 4 days with 16 TPUs by Google • Options include building language models from scratch or post-processing general purpose • Variations: representations – Techniques for incorporating specialized – Feature based (generate word embedding) sources of knowledge such as Glossaries – Fine tuning

91 Bonus 1: Advanced Neural Architectures

- Sentence Encoding using Auto Encoders - Sequence to Sequence - Transformer

92 Auto-Encoder for Sentence Encoding

J. Li, M. T. Luong, and D. Jurafsky, “A Hierarchical Neural Autoencoder for Paragraphs and Documents”, 2015.

93 Sequence to Sequence Modeling

• Encoder Decoder Architecture • Applications: – Speech recognition (many to many) – Machine translation (many to many) • Provides sentence representations

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems.

94 Transformer and Attention

§ RNN and CNN: Sequential – Word position aligns with computation step § Transformer architecture: Fully Connected - input sequences are transformed simultaneously into output – Shorter path length between long range dependencies – Lower computational complexity and more parallelizable – Positional encoding – Stacking helps: Syntactic information is derived in lower layers, semantic information is derived in higher layers – Multiple Attention heads

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is All You Need. In Advances in Neural Information Processing Systems.

95 Bonus 2: More Language Models

- Bert and Elmo - Near Synonym generation

96 BERT and ELMo

Masked Language Model (or MLM): A small percentage (10-15%) of the tokens are masked for training, a.k.a. cloze deletion test • BERT: Bidirectional Encoder Representations from Transformers – BERT is a BiLM and MLM – Each token is related to its own transformer. – The BERT process is jointly conditioned on both left and right contexts for all layers – BERT is easily available (TF Hub) • ELMo: Embeddings from Language Model – ELMo is also an MLM – Unlike BERT, ELMo uses a bidirectional LSTM (BiLSTM) with Cross-View Training (CVT), to examine a sentence before assigning an embedding to each word – In addition, ELMo concatenates independently trained left-to-right and right-to-left LSTMs to generate features for use downstream. – ELMo is easily available (TF Hub)

Devlin J., Chang, M, Lee, K, and Toutanova, K: BERT: Pre-Training of Deep Bidirectional

Transformers for Language Understanding. Google AI Language, 2018 97 Unsupervised Language Models: Near-Synonym System (NeSS)

An unsupervised corpus-based conditional model – for finding phrasal and near synonyms – requires only a large monolingual corpus

Based on maximizing information- theoretic combinations of shared contexts

Parallelizable for large-scale processing

98 D. Gupta, J. Carbonell, and A. Gershman, “Unsupervised Phrasal Near-Synonym Generation from Text Corpora,” AAAI, 2015. Deep Learning and Computational Graph Techniques for Derivatives Pricing and Analytics

May 13, 2019 Bernhard Hientzsch, Ph.D. , Managing Director, Head of (Markets) Model, Library, and Tools Development (M2LTD) Advanced Technologies of Modeling (AToM) Corporate Model (CMoR) Wells Fargo & Company

Work with/by team members of M2LTD and CMoR

© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Outline of the Talk and Idea

• Standard Martingale Pricing approach with MC or PDE – challenges in higher dimensions and otherwise. • FBSDE combine SDEs for risk factors and for value. Given an initial value and a replicating strategy (IVRS), try to hit given final value as well as possible. IVRS satisfy this minimization & control problem. Can generate many paths to train. • Use DNNs to represent RS, TF computational graphs to simulate and solve FBSDE given RS -> DL problem. • For forward approach, objective function is how well final value replicated. TF and its optimization methods (including SGD) will give RS and IV and also value along paths. • Also will cover other approaches and applications that can be similarly expressed or solved or take advantage of this. • Pointers to examples and results and some numerical results in the presentation. • Conclusion and References.

100 Martingale Pricing

Assume we have one or several risky underliers X (potentially vector) satisfying - !" = $%&' ( " ( !( + *%&' (, " " ( !, under some measure (for risk-neutral, $%&' ( will be r ( or r ( -q ( , for Black-Scholes, *%&' (, " will be *./' ( or a constant). Will simulate many paths for X. If looking at discounted values, assume a money market account as risk free : !0 = 1 ( 0 ( !( Only consider European exercise (i.e. payoff at maturity) According to martingale approach, value is: - 2 (, 3 = 0 ( 4 5 "6 /0(9) "; = 3 6 In our case, B(t)/B(T) = exp(− ∫; 1 A !A) . Taken at face value, that means that for each ((, " ( ) different expectation. Not determining some function representing 2 . , . or 2 (, . , just one point value at ((, " ( ).

101 Martingale Pricing – Monte Carlo

To approximate this by Monte-Carlo methods, simulate M independent copies (“paths”) of ! called !(#)with !(#)(t)=x

0 ∑ (#) % &, ( ≈ exp(− ∫/ 1 2 32) # 5 ! (6) /M Taken at face value, that means that for each (&, ! & ) different simulation to run. Not determining some function representing % . , . or % &, . , just one point value at (&, ! & ). Could try to find an approximate representation by least-square regression (LSM) or similar methodologies, but that requires good basis functions, good basis function underliers, and large effort and intuition. Simulation effort itself though only depends relatively weakly on dimension d of ! . Convergence of Monte-Carlo in general is slow at O(89:.;), but independent of dimension d of !.

102 Martingale Pricing – PDE

To compute u(t,x) on a grid through a Black-Scholes type PDE 1 ! #, % + ' # %! #, % + +, #, % %,! #, % − . # ! #, % = 0 " ( 2 (( ! 1, % = 2 % Use Feynman-Kac to obtain this PDE from definition of ! #, % . For instance, use finite differences in time and asset directions, solving PDE by time-stepping and applying difference operators (explicit) and also solving linear systems (implicit).

Computational grid for ! #, % will typically have O(31 345 ) points, requires O(31 345 ) memory, and O(31 345 6) time (e=1 for explicit, 2-3 for implicit). This is too much memory and time for large d (“curse of dimensionality”). Of course, equivalence of expectation and PDE needs to proven (Feynman-Kac).

103 SDE/Expectation and PDE Pricing (Feynman-Kac)

More generally, the following are equivalent under appropriate conditions: PDE 1 ! #, % + ' #, % ! #, % + +, #, % ! #, % − . #, % ! #, % + / #, % = 0 " ( 2 (( ! 2, % = 3 % Expectation 5 7 < 7 ! #, % = 4 ∫" exp(− ∫" . =, >? @=) f(s, XE)@F + exp(− ∫" . =, >? @=) 3 >7 >" = % under @> = ' #, > @# + + #, > @G5

The earlier GBM/LVM setting corresponds to setting ' #, > = 'HIJ # >, + #, > = +HIJ #, > >, . #, % = K # , / #, % = 0.

104 Other Equations for Value ?

• Would be nice if there is a direct equation or formula for ! ", $ along some simulated (or otherwise given) path for X so that values of ! ", $ can be computed along all paths – and then maybe a formula/expression for ! ", $ can be learned from those values. • If ! ", $ is characterized by a no- condition, then for instance the self-financing condition for the replicating strategy (assuming we know or will later determine replicating strategy) will give such a (stochastic) equation • Alternatively, if somehow know that ! ", $ is a function of t and X(t) that satisfies all necessary conditions and a PDE (or that discounted value is martingale), Ito’s lemma gives also a SDE • For arbitrary instruments and replication strategies, might not know what u will depend on, so write it as stochastic process Y(t) rather than ! ", %(")

105 Self-financing Condition

Given the value Y(t) and a replication strategy Z(t) (representing amount of each risky security scaled by vol), the self- financing condition reads !"($) = ' $ " $ !$ + ) $ !*+ or with discretized time: " $ + Δ$ − "($) = ' $ " $ Δ$ + ) $ Δ*+ and final value Y/ = g(X/) is known. That kind of SDE with given final value is typically called “backward” SDE (BSDE). The system of SDEs for X (given initial values - determined forward) and for Y (given final values - determined “backward”) is called a forward-backward stochastic differential equation (“FBSDE”). Under appropriate conditions, can be proven that " $ = 2 $, 4($) and ) $ = 56 $, 4($) ! $, 4($) , that u will follow (in general) nonlinear PDE, and under further conditions ! $, 4 $ = 27 $, 8 (or, in general, 9:2 $, 8 ) – nonlinear Feynman-Kac. Notice that ! $, 4($) acts as a scaling to get random part of Y SDE from random part of X SDE. Often, BSDE is written as negative of above.

106 FBSDE and PDE Pricing

More generally, equivalent (“nonlinear Feynman-Kac”) FBSDE !"# = % &, "# !& + ) &, "# !*# | ", = - 2 −!/# = 0 &, "#, /#, 1# !& − 1# !*#| Y4 = g(X4) PDE 1 9 &, - + <= ))2 &, - >?@@ 9 &, - + % &, - B9 &, - + 0 &, -, 9 &, - , )2 &, - B9 &, - = 0 # 2 A with 9 <, - = D -

For our example: 0 &, "#, /#, 1# =-r & /#, others like on Feynman-Kac slide.

107 Using BSDE – Pressing Forward

Going back, assuming simulated copies of X(t) and recorded values of X and Δ"#along those paths, time-discretized BSDE looks as follows: Δ$ % = $ % + Δ% − $(%) = + % $ % Δ% + , %, .(%) /(%, . % ) Δ"#

(Everything that is already known is colored green). Assume that a replication strategy / %0, . for each time %0 is known as some parametrized function /23 .23 , Θ23 - such as a DNN, SDE can be used in either direction

Forward (Weinan): Guessing an $ 0 , can compute $6 %0; $ 0 , Θ2. and then finally $6 8 = $ 8; $ 0 , Θ2. by using BSDE forward. To find exact solution, need $6 8; $ 0 , Θ2. = g(X;) . To find approximate solution, make @ <[ > .? − $6 8; $ 0 , Θ2. ] as small as possible – we try to determine $ 0 and Θ2. and thereby /23 .23 , Θ23 - replicating strategy so that expectation as small as possible. For nonrandom ODE, this is called shooting method. Possibly other norms could be used. Can use deterministic or stochastic optimization methods such as stochastic gradient descent.

108 Let the Tensors Flow

The way how !" #$; ! 0 , Θ). is computed from ! 0 and +), -), , Θ), and all other things that it depends on can be expressed as a TensorFlow computation graph.

The DNN for +), . , Θ), can also be expressed as a TensorFlow graph.

Altogether, obtain !" #$; ! 0 , Θ). as a TensorFlow graph and so can use stochastic optimization methods and other algorithms implemented in TensorFlow to determine ! 0 and +), . , Θ), . +), . , Θ), will be the replicating portfolio amounts for different underliers X.

+), . , Θ), could be represented by different networks for different #$ or as a single network for + #$, . ; Θ). =

+ . , . ; Θ). . Those networks could use different architectures. Note that Y/u values along each path and expressions for gradient of u are known, which can be used for CVA/DVA (see She Grecu article)

109 Tensorflow as an Intermediate Representation/Language

• Various ways to run Tensorflow (TF) serial, multi-core, (multi-)GPU, and/or distributed • IBM and NVIDIA are very interested and willing to support work in that area • TF is powerful intermediate representation (in CS sense) which comes with parallelization, (A)AD, visualization, … • For instance, if one implement MC simulation pricing in TF, can compute greeks, adding a few lines • Similarly, if implied volatility surface representation is given as TF graph, Dupire in various forms is “automatic”

110 Using BSDE – Looking Backward

Backward (Wang et al): Starting from ! " = g(X') , use BSDE backwards to compute !) *+; Θ.. and then finally !) 0; Θ..

Assuming instrument value under replicating strategy is given as function u of t and X1 and of no other arguments (at least in a neighborhood of t=0 and x=X2), then an exact solution YB(0) should be same along all paths. A good approximation should minimize Var(YB(0)) (i.e., size of the range of YB(0)). Given replications of X, approximate replications of YB and the mean of YB(0) , YB(0), can be computed for those replications. Variance of YB(0) as an objective function: 6 3[ !) 0; Θ.. − YB 0; Θ.. ]

YB(0) would also be the desired approximation of ! 0 . This allows to determine Θ.. so that expectation is as small as possible. Can use deterministic or stochastic optimization methods such as stochastic gradient descent just like in forward case. Is TensorFlow computational graph just like in forward case also. Of course, need to prove that YB is unique and same as other characterizations in the limit.

111 Using BSDE – Looking Backward

• Running BSDE backwards also allows to take Bermudan exercise into account • Exercise decision is made comparing value not exercised (given from BSDE) vs value if exercised (given from exercise condition) • Similarly, barrier options or similar could be treated since for those circumstances values of solution are known and could be propagated backwards. • Of course, for exercisable instruments or barriers, it is important to determine and record on every path whether instrument has been exercised or touched the barrier. • For different states of the instrument (exercised/knocked-in/…), need to train different representations – different BSDEs, possibly.

112 Other Formulations/Approaches - BSDE

• Instead of self-financing condition or replicating portfolio set-up, can use specification of underliers/risk factors under some consistent measure and try to approximate (for some computable numeraire N) ' ! ", $ = & ( )* )+ = $ ' ! ", $ = , " & ( )* /,(/) )+ = $ • Given this functional form and some assumptions, Ito’s lemma will give BSDE for 1 " = ! ", )(") . Use that BSDE similarly to the other BSDE. • Assume more realistic assumptions for replicating portfolio such as different rates for borrowing or lending – self-financing condition will change to give a BSDE with the f term: 2 ", $, 3, 4 = −673 + 69 − 67 [4*1 − y]> • Similarly, can handle other FBSDE for XVA etc. (Weinan and others have many examples). • If underliers or instruments have dividends/fees, appropriate changes in FSDE or BSDE

113 Want to Know the Solution?

• Want to take advantage of ! ", $ " = &' ", ( and determine some approximation for & ", $(") directly (instead of point values of Y) • This means BSDE is no longer used to determine point values forward or backward but used as constraint to evaluate - how well current guess of form of u satisfies BSDE, for instance with the following term in the loss function (+, = ------& " , $, , ., = /& " , $, , 0, and Σ, are f and 2 terms in FBSDE) : -45 - - - 8 - - : 33 +, − +, − 0,∆"- − ., Σ,∆9, , - • Could use it in a step-wise, rolling-back fashion to determine “slices” of u going backwards, or use it globally to judge how well the current global guess satisfies BSDE – Raissi’s FBSNN • For replicating portfolio, delta-hedging is not necessarily optimal for long times between hedging times, it would stand to reason that this will work better for smaller time-step sizes. • Once solution known, can be used for CVA/DVA/PFE/DIMM, collateralized CVA/DVA/..

114 Other Instruments

• For instance, for barrier options or Asian options, there is some relatively “weak” path-dependency • Standard approach is to extend state space (add elements to X) so as to make X Markovian • Open question exactly how far this can be treated with the standard FBSDE approach – currently working on some areas. In general volatility matrix of that extended X is degenerate or non-square and drift is not differentiable/continuous • For some cases (Black-Scholes with constant parameters), can be rewritten as final value problem with barrier breach probability by Brownian bridge approach (Bing Yu et al) – treat as before

115 Quantitative Finance Examples in Raissi/E

• BS with default and/or (1st order, E) • BS with differential rates – different rates for borrowing and lending (1st order, E) • BSB (Black-Scholes-Barenblatt) related to uncertain volatility model (2nd order, E) • BSB (R) • Also some synthetic test examples with explicit solution • In high dimensions – up to 100 • Their examples only show uncorrelated underliers

116 Some of our Extensions

• Implemented Local Volatility Model and Heston Model in TensorFlow • Implemented some interest rate models • Implemented correlated cases and various payoffs • Implementing geometric combination of time-dependent geometric Brownian Motion as test case • Tests of approaches against each other and against established approaches (MC, PDE) for lower-dimensional problems • Using learned solution for other analytics (XVA, …) • Extensions to extended state spaces etc.

117 Other Quantitative Finance Examples in CMoR

• LMM for caps/Europeans – Wang et al • LMM for Bermudans – Wang et al • CVA/DVA for forward and backward approach for BSDE – She et al • Barriers – Yu et al

118 Example with Explicit Analytical Solution (E)

d=15, N=5 Weinan approach Raissi approach

!" 0 1 0 1

Loss 6.80E-05 4.89E-07 6.40E-03 3.28E-04

#" 0.503 0.999945 0.49705 0.99985

#" 0.5 0.999999694 0.5 0.999999694

Relative Error 0.596% 0.005% 0.594% 0.015%

d=15, N=50 Weinan approach Raissi approach

!" 0 1 0 1

Loss 6.80E-05 4.15E-05 8.32E-04 6.20E-04

#" 0.503337 1.00004 0.49987 0.99938

#" 0.5 0.999999694 0.5 0.999999694

Relative Error 0.663% 0.004% 0.026% 0.062%

119 Heston - Different ! and # Time Steps (N)

120 Heston – # Time Steps (N)

121 Heston – # of SDE Simulations (M)

122 Heston – Learned vs Exact (FBSNN)

123 Conclusion • Deep FBSDE approach which consists of: – Changing solution characterization to FBSDE – formulating pricing problem as a minimization of replication accuracy (or minimum spread of initial values) given replicating strategy – representing replicating strategy by DNN – determining replication strategy by DL (TensorFlow/PyTorch/…) – computing initial value and value along path given optimized replicating strategy by DL/TensorFlow • is a powerful new approach to solve high-dimensional pricing problems • This approach can be extended to other problems and settings (such as LMM, Bermudan Options, Barriers) and its results can be used for other analytics (XVA etc.)

124 References I

• Weinan, E., Han, J., & Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4), 349-380. arXiv preprint arXiv:1706.04702. • Beck, C., Weinan, E., & Jentzen, A. (2017). Machine learning approximation algorithms for high- dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. arXiv preprint arXiv:1709.05963. • Raissi, M., Perdikaris, P., & Karniadakis, G.E. (2017). Physics Informed Deep Learning (Part I): Data- driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561 (2017). • Raissi, M. (2018). Forward-Backward Stochastic Neural Networks: Deep Learning of High-dimensional Partial Differential Equations. arXiv preprint arXiv:1804.07010.

125 References II

• Wang, H., Chen, H., Sudjianto, A., Liu, R., & Shen, Q. (2018). Deep Learning-Based BSDE Solver for Libor Market Model with Application to Bermudan Swaption pricing and hedging. arXiv preprint arXiv:1807.06622. • She, J.-H., Grecu, D. (2018). Neural Network for CVA: Learning Future Values. arXiv preprint arXiv:1811.08726. • Yu, B., Xing, X. (2019). Deep Learning Based Numerical BSDE Method for Barrier Options. CMoR internal whitepaper.

126 Deep insights into interpretability of machine learning algorithms and applications to risk management

May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk

© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Interpreting machine learning models

• Machine learning gives very good predictive performance

• But the biggest criticism for machine learning algorithms is its interpretation … predictor "! # is a `black box’ – hard to interpret • True of all ensemble methods, SVM, neural network • We need to understand the internals of a machine learning algorithm: – Required by regulation – Get insights from the model and make scientific/business findings • Some main questions to answer are – Which variables are important? – What is the input-output relationship look like for each important variable/a subset of important variables? Nonlinearity? Interaction? – How do correlations among variables impact the response surface? – How can we ensure the relationships from ML are consistent with historical and business understanding. • Machine learning interpretation is an active research area now.

128 Approaches for Interpreting machine learning

•modelsDiagnostic tools – Variable importance o Local importance o Global importance – Effects of inputs to outputs o 1D PDP o 2D PDP and Hstatistics for interactions o ICE plot and ICE ANOVA o Derivative based diagnostic tools • Model distillation: – Global surrogate tree – KLIME – LIME-SUP • Structured interpretable Model—explainable neural network

129 Model Explainability Approaches

Derivative-Based Approach and Variance Analysis Global Diagnostics: Liu, Chen, Vaughan, Nair, Sudjianto (2018), Model Interpretation: A Unified Derivative-based Framework for Effects of Inputs to Outputs Nonparametric Regression and Supervised Machine Learning, arXiv:1808.07216 Impact of correlations

Locally Interpretable Model Local diagnostics and Model Hu, Chen, Nair, Sudjianto (2018), Locally Interpretable Models and Effects based on Supervised Partitioning (LIME- Distillation SUP), arXiv:1806.00663

Explainable Neural Networks Structured-Interpretable Model Explainable Neural Networks based on Additive Index Models Vaughan, Sudjianto, Brahimi, Chen, Nair (2018), arXiv:1806.01933

130 Global and local diagnostics

• Global interpretation is aimed at interpreting the overall relationship between input and output over the entire space. • Local interpretation is aimed at interpreting the relationship between input and output over local region, with the idea that – a simple parametric model may be used to approximate the input-output relationship – local variable importance and input-output relationships are easily interpretable from the simple local model.

131 A real data example—home lending case

• This dataset is based on a retired home lending residential mortgage model. • we used a randomly selected subset of 1 million observations, divided into training, validation and testing sets. • Response is an indicator variable indicating if the loan is in trouble; there are 7 raw explanatory variables listed in the table below.

Variable Explanation fico0 fico at snapshot ltv_fcast ltv forecasted dlq_new delinquency status, 1 if clean and 0 otherwise unemprt unemployment rate totpersincyy total personal income year to year ratio h horizon 1, 2, …, 9 quarters premod_ind indicator before recession Q2 2007

132 Diagnostic tools

133 Local importance

• Describe how individual observation’s attributes affect model prediction for that observation. • Important for providing reason codes for credit decisions • Approaches – LIME (Local Interpretable Model-Agnostic Explanations) – KLIME – LIME-SUP – LOCO(leave one covariate out) – SHAP explanation – Tree interpreter – Quantitative input influence(QII) – Integrated gradients – DeepLIFT – Layer-wise Relevance Propagation (LRP) – Derivative based sensitivity analysis

134 LIME

• LIME (Local Interpretable Model-Agnostic Explanations) is perhaps the first local interpretation method, proposed in Ribeiro et al. (2016). • The idea is to approximate the model around a given instance/observation in order to explain the prediction: – Simulate new instances – Predict on the new instances using the machine learning model – Pick a kernel and fit a linear model using the kernel as weight; penalize the complexity of the linear model, for example, fit ridge regression.

• Available in python (lime package) and R (lime package)

135 Global importance • Measures the overall impact of an input feature on the model predictions • Important for variable selection • Approaches – tree-based importance (e.g. relative influence) – permutation test based importance – Sobol’ indices global sensitivity analysis – ANOVA decomposition based on ICE plots – derivative-based importance – Shapely effects – …

136 Permutation test and tree based Importance

• Permutation based importance • tree based importance for Xgboost – Randomly permute the corresponding column – For a single tree, compute the importance of a variable !" by in the data set while keeping other columns the total reduction of impurity at nodes where !" is used as a unchanged splitting variable. – Compute the decrease in prediction – For ensemble methods like random forest or GBM, the performance as the measure of importance. importance of !" is summed or averaged over all trees.

LTV_fcast and fico0 are the top important variables

137 Global sensitivity analysis

Sensitivity analysis studies how the variation in the model output can be apportioned to variation in model input. Global sensitivity analysis based on Sobol’ indices(Sobol 1993) • Any function !(#) can be represented as a sum of main effects, interaction effects, etc , ! # = !& + ( !) #) + ( !). #), #. + ⋯ + !+1…, # )*+ )-. • A functional decomposition of the variance is available, referred as functional ANOVA , 345 ! # = ( 345 !) #) + ( 345 !). #), #. + ⋯ + var !+1…,(#) )*+ )-. , 9 = ( 9) + ( 9). + ⋯ + 9+1…, )*+ )-. • Sobol’ indices are defined as ; – : = <. It measures the main effect of # . . ; . ∑ ; – : = <∈{

138 Effects of inputs to outputs

–1D PDP –2D PDP and Hstatistics for interactions –ICE plot and ICE ANOVA –Unified derivative based framework oMarginal and ALE plot oAccumulated total derivative effect (ATDEV) plots oScattered partial derivative plots and LE plots

139 1D Partial dependence plot

• Partial dependence plots are used to visualize the input-output relationship, proposed in Friedman (2001). • It removes the effect of other variables by the marginal integration so you get the “partial effect” of a variable. • Consider a single input variable !" first; Partition # into !", #%& where #%& is the complementary set.

'() *" = ∫ ' *", -%& . -%& /-%&

• How to compute this from data?

– Let '0 # = '0 !", #%& be the fitted model

– For each grid value 1, fix !" at 1, compute the average value of '0 over the entire data 8 1 '0 * = 1 = 4 '0(* = 1, - ) () " 3 " %&,: 567 The original plot is from “Apley, D. W. (2016). Visualizing the – Plot '() *" against *" over the grid. effects of predictor variables in black box supervised learning models”. • One-dimensional plots show possible nonlinearity. • Drawback: extrapolation when covariates are correlated, the grid points for PDP calculation could be far from the data distribution, so the prediction involves extrapolation.

140 2D PDP and Interaction statistics

• Similarly, partial dependence plot can also be defined for two and higher dimensions – For two dimensional plots, we can check two-way interaction effects – However it is computationally expensive.

• Friedman and Popescu (2005) defined the following H-statistics !"# to measure the interaction between $" and $# + 4 % ∑()* ,-. /0,(, /2,( 3,-. /0,( 3,-. /2,( % !"# = + 4 , !"# = !"# ∑()* ,-. /0,(,/2,(

– 567 8",9, 8#,9 , 567 8",9 , 567 8#,9 are the centered partial dependence functions % – !"# is the proportion of variation in 567 8",9, 8#,9 unexplained by an additive model, it’s a relative measure.

– When two variables are irrelevant, both denominator and nominator are small and !"# can be high due to instability • An absolute version of H-statistics is:

; % !:% = ∑< 5 8 , 8 − 5 8 − 5 8 , !: = !:% "# < 9=; 67 ",9 #,9 67 ",9 67 #,9 "# "#

Similarly, it can be computed faster on a grid.

141 1D -PDP for home lending case

• We can see some nonlinear trend for snapshot FICO, unemployment rate, prediction horizon h, and ltv_fcast.

142 H-statistics

• Unscaled H-statistics shows that there are strong interactions between h and dlq_new_clean, and between fico0 and LTV_fcast.

fico0 ltv_fcast dlq_new_clean unemprt totpersincyy h premod_ind

fico0 NaN 0.1630 0.1224 0.0820 0.0360 0.0339 0.1107

ltv_fcast 0.1630 NaN 0.0518 0.0291 0.0286 0.0186 0.0843

dlq_new_clean 0.1224 0.0518 NaN 0.0101 0.0071 0.2296 0.0003

unemprt 0.0820 0.0291 0.0101 NaN 0.0232 0.0122 0.0094

totpersincyy 0.0360 0.0286 0.0071 0.0232 NaN 0.0068 0.0661

h 0.0339 0.0186 0.2296 0.0122 0.0068 NaN 0.0192

premod_ind 0.1107 0.0843 0.0003 0.0094 0.0661 0.0192 NaN

143 2D PDPs for home lending case

• 2D PDPs further verify the interactions between FICO and LTV_forcast (left), h and dlq_new_clean(right).

144 ICE plots

• ICE (individual conditional expectation) plot is proposed in Goldstein et. al (2013), it’s a localized adaption of partial dependence plot & – 1-d partial dependence plot !"# $% shows the average of !($%, )*+,,) over the entire data. When there are & interaction effects, we expect !($%, )*+,,) to have different patterns for different )*+,,. Such averaging will lose the interaction information. & – The ICE plot is a plot of all the / curves !($%, )*+,,), 0 = 1, 2, … , /, conditional on )*+,,. Each curve is localized for a single 0th observation. – It allows us to see if there is any change of the input-output relationships for $%, thus to see any interaction effect.

• Centered ICE (CICE) Plots – Given the sample 5, CICE plot for 67 is to subtract each ICE curve with the weighted mean of the curve.

• Normalized CICE plot – the normalized CICE plots by subtracting CICE curves with the corresponding centered partial dependence curves.

145 ICE plot for home lending case

§ The dots are the sample points § The black curves are the ICE curves over a grid. § The red curves are PDP, centered PDP and zero curves, respectively. § The total variance of data, total sensitivity, first order sensitivity and interaction sensitivity can be visualized by the variance of ICE plot, CICE plot, centered PDP and normalized CICE plot respectively. § The strong divergent slopes show the strong interactions

146 ICE ANOVA decomposition

• Three type of variances

– Total effect for !" --weighted variance of CICE plot

– The first order effect for !" --weighted variance of centered PDP – Pure interaction effect where !" is involved----weighted variance of normalized CICE • ICE ANOVA decomposition

Total effect for !"

=The first order effect for !"+ interaction effect where !" is involved • Chen et. al (2018) give a more rigorous theoretical formation to ICE plot , CICE plot, and normalized ICE plots. If we choose appropriate reference point as #(%|'~),+), which is the mean of each ICE curve, we are able to prove that ICE plots and Sobol Indices global sensitivity (Saltelli, 2002; Saltelli, et al., 2010) are equivalent for independent cases.

147 ICE ANOVA for home lending case

• This interaction effects can also be quantified using a ANOVA variance decomposition approach. • The ICE ANOVA results are consistent with variable importance and PDP results. • Interactions effects for ICE ANOVA include high order interactions.

S_total S_firstorder S_interaction

fico0 0.3417 0.3123 0.0294

ltv_fcast 0.3736 0.3529 0.0208

dlq_new_clean 0.0216 0.0120 0.0095

unemprt 0.0375 0.0322 0.0053

totpersincyy 0.0074 0.0038 0.0036

h 0.0196 0.0111 0.0085

premod_ind 0.0732 0.0664 0.0068

148 Marginal plots and ALE plots

• Marginal plots – For exploratory analysis, people often plot response variable against each covariate to understand pairwise input-output relationship.

– When ! " # = % # = %('(, … , '+), the marginal function is

%- '. = ! % # '. = ∫ % '., #01 2 #01|'. 4#01 , 5 = 1, … , 2. – Techniques such as binning, LOESS and regression splines for nonparametric regression can be used for the empirical estimation of %- '. .

– Problem: for correlated data, response is projected to a single dimension of input space, so reflects the effects of both '. and its correlated variables • Accumulated Local Effects (ALE) Plots – The ALE plots, proposed by Apley (2016) eliminate the drawbacks for partial dependent plots and marginal plots.

>< BC(?) %89: '. = ∫ !?@1|A< |D. = E. 4E. ;<,= BA< – Partial derivative: to remove the effects of correlated variables (in additive models). – Conditional expectation: to avoid extrapolation issue in estimation.

149

Marginal plot decomposition

= – We propose a new interpretation technique called the accumulated ATDEV + total derivative effects (ATDEV) plot which is based on the total (Marginal plot) ALE ACE derivative of the fitted response surface. -* 4!(/) !"#$ %& = ( ./ |3 7& = 8& 48& 4!(A) B!(A) B!(A) 4C>(%&) 01 * 47& = + = ) *,, 4%& B%& B%> 4%& >?& – ATDEV plots can be proved to be equivalent with marginal plots, up to a constant difference. – ATDEV plots(or equivalently marginal plots) can be decomposed into Unemployment Accumulated Local Effect function (ALE) and Accumulated Cross When bump correlates Effects (ACE) unemployment with the bumps of rate is bumped other variables(e.g, 9@; loan characteristic !"#$ %& = !9:; %& + = !>,& %& by 1%, how will PD be impacted variables) resulting >?& in the change on PD – The ATDEV decomposition is consistent with the sensitivity analysis in the fitted model in the econometric perspectives. ATDEV plots capture the total sensitivity of the response to a specific covariate. Direct impact on correlated – 1D-ALE represents a variable’s direct 1st order effect through its own PD due to partial derivatives; unemployment bump – 1D-ACE represents a variable’s indirect 1st order effect through the partial derivatives of its correlated variables

150 Unified derivative based framework

• The derivative-based approach leads to a unified framework for the PDP, ALE and Marginal functions (ATDEV plots) • PDP can also be rewritten into partial derivative based form, but expectation is based on marginal distribution

,) 1!(3%, .40) !"# $% = ' -._0 63% + 8 1$% (),+ • The three are equivalent in the independent cases, but different in the correlated cases.

151 How to obtain derivatives for ML algorithms

• Even when the prediction model itself does not have closed-form gradients (such as Gradient Boosting), one can fit a NN surrogate model to the prediction model scores, and get model performance comparable to the original prediction models. • The concept of surrogate models is known as emulators in the field of computer experiments (Bastos & O'Hagan, 2009), and is referred to as model distillation (Tan, et al., 2018) or model compression (Bucilua, et al., 2006) in machine learning literature, with “born again trees” (Breiman & Shang, 1997) as one of the earliest implementations. • Machine learning performance

NN structure Test AUC XGBoost -- 0.8192 Neural network surrogate mdoel 64 0.8193

152 ALE/ACE matrix plot • The diagonal plots are ALE showing the “direct” 1D effect of each variable on the response surface.

%&' • The off-diagonal plots are ACE, i.e., !",$ ($ for subplot (k,j), showing the “indirect” 1D effect of each variable ($ passed through its correlated variables (" onto the response surface. • The sum of the column is marginal plot (or ATDEV plot). • A part of sensitivity of unemployment is taken by premod_ind and LTV_forecast.

153 ATDEV variance heat map and correlation matrix • ATDEV Variance heat map is more useful in a regression context because it combines the dependence (correlations) among predictors and their influence on the response *+, &'( )" -# , / ≠ 1 ! = % "# *2, &'( )# -# , / = 1

• the diagonal cells represent the individual marginal contribution of each predictor on the response (i.e., ALE); • the off-diagonal cells represent the magnitude of the cross marginal effect of the column variable on response transferred through the row variable (i.e., ACE).

154 More ATDEV plots • ATDEV and marginal plots overlay – ATDEV plots (blue curves) are the sum of each column in the ALE/ACE matrix plot. – ATDEV and marginal plots have good overlap

• ALE, PDP and marginal overlay for PDP validity check – With correlations, marginal plots tend to deviate from the PDP/ALE for unemployment, totpersincyy, premod_ind and del_new_clean. – It also confirms that marginal plots are often “misleading” in high correlation setting. – ALE and PDP overlap well generally

155 Scatter partial derivative and LE plots

• Scatter partial derivative plots: !"($) – for each pair (k, j), plot vs () !&' – For diagonal plots, the scattering is caused by interactions. • Local effect (LE) matrix plot: -. 6*(1) *+,) () = 01 |5 7) = 8) 23 3 67+ • 9LE plot :-. 6*(1) ;<+(7)) *+,) () = 0123|53 7) = 8) 67+ ;7) • Both global and local view • We can check – Potential data issue – Monotonic constraint – Interaction effects

156 Model distillation

157 Model distillation

• Model distillation: – Model distillation was originally designed to distill knowledge from a large, complex teacher model to a faster, simpler student model without significant loss in prediction accuracy. – We investigate model distillation for another goal –transparency, e.g., investigating if fully-connected neural networks can be distilled into models that are transparent or interpretable. – The purpose is to approximate the predications of the underlying model as closely as possible while retaining interpretability. – Literature: emulators, surrogate models, model distillation, model compression, “born again trees”

• Related tools include: – Global decision tree surrogate model – KLIME (H2o) – LIME-SUP (Wells Fargo CMoR) – …

158 KLIME

• KLIME is a variant of LIME proposed in H2o Driverless AI. It divides the input space into regions and fit a linear model in each region. – Cluster the input space using a K-Means algorithm – Fit a linear model to the machine learning prediction in each cluster – The number of clusters is chosen by maximizing Rsquare • KLIME can be used as a surrogate model (a less accurate but more interpretable substitute of the machine learning model). However, it has some disadvantages: – the unsupervised partitioning approaches can be unstable, yielding different partitions with different initial locations. – the unsupervised partitioning does not incorporate any model information which seems critical to preserving the underlying model structure. It is less accurate (see Figure below) – K-means partitions the input space according to the Voronoi diagrams, it is less intuitive in business environment where modelers are more used to rectangle partitioning (segmentation).

159 LIME-SUP

• Locally Interpretable Models and Effects based on Supervised Partitioning (LIME-SUP) is a local interpretation method developed by CMoR. It is a supervised partitioning method using information from the machine learning model. • The goal is to use supervised partitioning to achieve a more stable, more accurate and more interpretable surrogate model than KLIME. • There are two implementations of LIME-SUP. One uses model based tree (LIME-SUP-R) and the other uses partial derivatives (LIME-SUP-D).

160 LIME-SUP-R algorithm

1. Let {"#$, … , "'$, ( = 1, … +} be the set of K predictor variables used to train the original ML algorithm. We will use them as both modeling variable and partitioning variable for illustration purpose. 2. For the specified class of parametric model (say linear regression model with no interactions), fit a model- based tree to the ML predictions obtained on the training dataset. 1) Fit an overall parametric model at the root node to the ML predictions and modeling variables. 2) Find the best split to partition the root node into two child nodes. This is done by (again) fitting the same class of parametric models to all possible pairs of child nodes and determining the “best” partition. This involves searching over all partitioning variables and possible splits within each variable and optimizing a specified fit criterion such as MSE or logloss. 3) Continue splitting until a specified stop criterion is met; for example, max depth or minimum number of observations in the child node is reached, or the fit is satisfactory 3. Prune back the tree using appropriate model fit statistics such as improvement in -., improvement in SSE, etc. on the validation dataset, to cut off splits that have small impact 4. Once the final tree is determined, use a regularized regression algorithm (such as LASSO) to fit a sparse version of the parametric model at each node 5. Assess the fit on the testing dataset.

161 Notes

Some notes: • The search for best split for model based tree is very different from a regular decision tree. It requires fitting linear models in the child nodes instead of a constant, and doing so for each possible split point of each splitting variable. This is much more computationally expensive than fitting a regular decision tree. • To reduce the amount of computation, M-Fluctuation test in Zeileis et. al (2008) is used as a fast way to screen the partitioning variables. – M-Fluctuation test is a test for parameter instability. Its purpose is to test if the coefficients of a parametric model will change according to different segments of certain variable; – If the test is insignificant, a global model will fit well; otherwise, it is necessary to divide the data according to that variable and fit different child models. • We rank the partitioning variables by the M-Fluctuation test result, and for the top variables, we do an exhaustive search over the combination of splitting variable and splitting point. This greatly reduces the amount of computation.

162 A real data example-home lending case • Figures shows the tree structure and the coefficients in the terminal nodes. • The strongest patterns in the coefficients exist for ltv_fcast, fico0 and delinquency status. • For example, the highest coefficients of ltv_fcast at node 11 and 13 indicating the steepest slope for ltv_fast in [61.4, 92.5], and flatter at two ends.

163 A real data example-home lending case

• Similarly we fit KLIME with 8 clusters. • Table below shows the MSE, Rsquare and AUC for the 5 methods. • LIME-SUP is better than KLIME. Besides that, we see LIME-SUP-R fits slightly better than LIME-SUP-D, which is expected.

LIME-SUP-R LIME-SUP-D KLIME-E KLIME-M KLIME-P

MSE 0.0419 0.0485 0.0662 0.0677 0.0648

!" 0.975 0.970 0.960 0.959 0.960

AUC 0.817 0.817 0.816 0.816 0.816

164 A real data example-home lending case

• Figure below provides a different view of the comparisons: values of MSE and !" computed within each of eight local regions. • The conclusions are similar as before. LIME-SUP does better almost on all local regions, except LIME-SUP-D has slightly larger MSE than KLIME occasionally.

165 Explainable Neural Networks (xNN)

May 13, 2019 Agus Sudjianto, Ph.D. EVP, Head of Enterprise Model Risk

© 2019 Wells Fargo Bank, N.A. All rights reserved. Public use. Splines and Neural Networks

Linear Model: ! " = $% + $ "

Nonlinear f(x) : Splines Nonlinear f(x) : Neural Networks + + ! " = $ + ' $ , - " ! " = $% + ' $( ,( " % ( ( ( ()* ()*

Where Bj(.) are basis functions such as constant or simple Bj(.) with simple hinge functions are called ReLU ‘hinge’ function max(0, x-cj) (Rectifier Linear Units), max(0, -( x-cj) cj: knot locations .( "knot locaGons" are called “bias weights” k: #knots

167 Higher Dimension: Projection Approach Single Index Model + . ! " = $% + ' $( ,( - " ()* Projection vector: single projection Single Hidden Layer Neural Networks + . ! " = $% + ' $( ,( -( " ()* Projection matrix: multiple projections • Spline Index Model and Single hidden layer Neural Networks have the same form • Too many projections creates interpretation difficulty • Need to enrich the basis function B(.) for richer ridge functions: deeper network + . /0 " = $0,% + ' $0,( ,( -( " ()* … Deep Network: Splines on Splines

+5 . /2,0 324* = $2,0,% + ' $2,0( ,2,( -( 324* ()* 168 Additive Index Model: Explainable Neural Networks

T T T Additive ‘Index’ Model: f (x)=g 1 h1(β1` x)+g 2 h2 (β2` x)+...+g k hk (βk` x)

Model is inherently interpretable: Projection Layer: Linear projections are understandable Subnetwork: Nonlinear Ridge functions are easily graphed • Internally, are fully connected, multi-layer, and use nonlinear activation functions. • Externally, are only connected to rest of the network through a univariate input and univariate output. • Used to learn ridge functions, ℎ" ⋅

169 Simulation 1: Simple Example

•Simulate from Legendre Polynomials 1 1 4 = 6 + 36< − 1 + 56? − 36 + @ 7 2 < 2 ? ?

• Illustrate features of xNN

170 xNN Interpretation Tools: Subnet View 1 1 ! = # + 3#) − 1 + 5#, − 3# + - $ 2 ) 2 , , Learned Ridge Functions: Learned Projection Coefs:

171 xNN Interpretation Tools: Variable View 1 1 ! = # + 3#) − 1 + 5#, − 3# + - $ 2 ) 2 , , Conditional Dependence Fn: Learned Projection Coefs:

172 Simulation 2: Illustrate Interactions

• Simulate from:

) ) ! = 0.5&' + 0.5&) + 0.5&*&+ + 0.3&- + .

• We will see that xNN captures the multiplicative interaction, but the representation is not unique.

173 xNN Simulation 2: Subnetwork View ) ) ! = #. %&' + #. %&) + #. %&*&+ + #. *&% + , Learned Ridge Functions: Learned Projection Coefs:

174 xNN Simulation 2: Variable View ) ) ! = #. %&' + #. %&) + #. %&*&+ + #. *&% + , Conditional Dependence Fn: Learned Projection Coefs:

175 Simulation 3: Misspecified Model

• Simulate from: ! = exp('() ⋅ sin('.) + 0

• Not an AIM – model specification does not fit model

• Does not recover model form … • But learned model is interpretable.

176 xNN Simulation 3: Subnetwork View

! = exp('() ⋅ sin('.) + 0 Learned Ridge Functions: Learned Projection Coefs:

177 xNN Simulation 3: Variable View

! = exp('() ⋅ sin('.) + 0 Conditional Dependence Fn: Learned Projection Coefs:

178 Sparse Orthogonal and Smooth Explainable Neural Networks (SOSxNN)

179 SOSxNN Architecture

180 Simulation Scenario 1: Additive function with orthogonal projection

Ground Truth SOSxNN Estimate 181 Simulation Scenario 2: Additive function with near-orthogonal projection

Ground Truth SOSxNN Estimate 182 Simulation Scenario 3: Non-additive function with orthogonal projection

SOSxNN Approximation 183 Simulation Scenario 4: Non-additive function w/ non-orthogonal projection

SOSxNN Approximation 184 Simulation Results

The proposed SOSxNN keeps the flexibility of pursuing prediction accuracy while attaining the improved model interpretability.

185 Real data example: Lending Club acquisition

Data Source: https://www.lendingclub.com/info/download-data.action

After cleaning: 1,433,770 accepted and 4,059,452 declined cases

186 SOSxNN Modeling

Data split into 40% (training), 10% (validation) and 50% (testing)

Test accuracy in comparison with benchmark methods:

187 SOSxNN Model Interpretation

Raw Marginal Rates SOSxNN Estimate 188 Other Structured Neural Networks: Fast and scalable algorithms for fitting nonparametric regression models

189 Other Structured Neural Networks

• Can build structured neural networks (SNNs) to learn other semiparametric or nonparametric models.

• SNNs are computationally fast and scalable algorithms for fitting these models. • Implementation: – Use nodes with linear activation functions to learn linear combinations – Use subnetworks to learn nonlinear transformations of univariate input

190 Range Flexibility in Models

Linear Model f(#) = &'( Generalizatio Explainability

Generalized Additive Model Single Index Model / f(#) = )(*'()

f # = + ),(0,) Partial Linear Model

' n ,-. f(x, z) = & ( + ℎ 6

Additive Index Model 7 9 f(x) = + ℎ, 8, ( ,-.

191 (Generalized) Additive Model

• Functional form: !(#) = &' (' + … + &+ (+ • Nonlinear function for each of the predictors • Generalization of linear model/Special case of AIM

(' &'((')

( - &-((-) + ( .& (( ) 21 , &,((,) / / /0'

(+ &+((+)

192 GAM Simulation: Functional components30

The following functional components are served as the ridge functions:

$ Generate ! ∈ # , %&, … , %$ are not part of the model; y = 2%, + 3/0 %0 + /1 %1 + 3/2 %2 + 3

193 GAM: Simulation Results GAM fit GAM MSE comparisons of GAM-Net, GAM and MARS GAM - Net fit Net

Computation GAM-Net GAM time (s) n=10000 38.7 46.0

194 Single Index Model/ SIM-Net

! " = $ %&" • Another generalization of linear model • Special case of AIM • Continues to be a popular model in the economics and statistical community

• Fitting with R package: mgcv

'(

'+ ,&' & 0/ '* $(, ')

')

195 Single Index Model: Simulation Ridge Function: Proj Coeffs: Simulation: • Generate data from 1 , = ./ 0 2 + 4 SIM - Net:

MSE comparison of SIM-Net and SIM SIM:

Computation SIM-Net SIM time (s) n=10000 5.88 423.07

196 Partial linear model/ PLM-Net

!(#; %) = ()# + ℎ %

• # is p−dimensional and % is q−dimensional; • Additive linear and nonlinear components

,-

,/ 0),

) ,. 0 , + ℎ(1) 32

1 ℎ(1)

197 Ridge Function: Partial linear model

y = #$ + 2#' + ( #) + *, Simulation: with + = 1.5 PLM - Net:

Computation PLM- PLM time (s) Net n=2,000 2.50 48.36 PLM:

198 Structured Neural Networks Summary •Leverage structure in Neural Networks to build interpretable models. •Advantages: –Retains interpretability in corresponding semiparametric and nonparametric models. –Scales better to large data –Provides symbolic partial derivatives for further diagnostics –Additional flexibility in adding regularization –Consistent approach to nonlinear function estimation across different models •Note –Best SNN models require tuning regularization parameters for interpretability. –Traditional methods may be more appropriate for small samples

199 Managing Machine Learning Model Risk Model Validation

May 13th , 2019 Harsh Singhal Decision Science and Artificial Intelligence Validation Corporate Model Risk

© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Overview of Machine Learning (ML) Use Cases and Techniques

ML Type Use Cases Approach ML Technique

§ Market Segmentation § K-means § Pattern Discovery Discovering Structure § Density-based (DBSCAN) § Network Analysis (Clustering) § Agglomerative Clustering § Image Analysis Unsupervised § Transaction Monitoring Irregularity § Clustering based § Anti-money laundering Identification § K-nearest neighbor § Fraud detection (Anomaly Detection) § Bayesian networks § Cybersecurity

§ Sale forecasting § Regression tree Value Estimation § Trend analysis § Gradient Boosting (GBM) (Regression) § Cost Analysis § Regression Neutral Network

Supervised § Customer behavior analysis § Decision tree / forests § Know your customer ongoing Categorical Prediction § Support Vector Machine (SVM) customer analysis (Classification) § Neural Networks § Customer expansion

§ Collaborative filtering (e.g., Recommending matrix factorization) § Online sales recommending relevant products § Content-based filtering Reinforcement § Customer experience enhancement (Recommender § Knowledge-based filtering System) § Hybrid system

§ Text understanding and semantic § Word embeddings understanding Information retrieval § LSI / LSA / LDA NLP* § Legal contract processing and understanding § Sequential NN (RNN / LSTM) § Sentimental analysis & chatbot

* NLP is an application of all types of ML methods i.e. supervised, unsupervised and reinforcement learning 201 Model Risk and Model Validation

• Model Validation is a key control for managing model risk Transamerica Entities to Pay $97 Million to Investors • General principles of model validation translate well to Machine Learning (ML) models Relating to Errors in Quantitative Investment • Special aspects of ML models create risks that require special emphasis during validation Models US Securities and Exchange Commission

Fed flags BofA, 2 others ML Special Features "weaknesses in certain aspects of Bank of America's loss and Optimized for Highly Open source and Process revenue modeling practices… " Predictive automated / Complex Stochastic Several Hyper- vendor Automation Representation Training parameters Performance online learning ecosystem applications CNBC

Amazon Echo Secretly Recorded a Family's Model Validation Emphasis Conversation and Sent It to a Random Person on Their Contact List Explain/ability or Safety in Autonomous Data bias and limitations Interpretability Replicability and Stability Ongoing Monitoring Mode CNBC

Model Suitability Model Robustness Model Safety

Fairness and Ethics 202 Model Suitability: From Intent to Problem Formulation

Watch Items • Initial Problem formulation should be based on deep understanding of problem domain • Should this model even be built/implemented? Were relevant stakeholders – Model Performance Objectives, Unintended consequences, Compliance (e.g. Privacy, Compliance etc.) consulted and controls in place? and Ethical considerations • Is there a specific structural specification or simple rule motivated by sound theory that might be preferable? • Why Machine Learning? • If supervised learning formulation: is high quality labelled data generally available? – Model validation should assess if ML is a reasonable approach given the • Is there a symmetric cost to False Positives and False Negatives? What is the problem at hand. minimum acceptable performance? • Was an effort made to incorporate expert features? • High Level Framework: • For Natural Language Processing (NLP) problems: was an effort made to – Supervised vs. Unsupervised, Source of data, Choice of Features, Objective incorporate contextual and non-language features? Function

• Appropriate Objective Function for Reinforcement Learning – Prevent reward gaming

Cyber Security Model Data Collection Result

Supervised approach fails to detect sophisticated attacks and ends up The model intends to detect No way to label and identify distributed attacks in training being an emulator for simple sophisticated attacks that are data. Labels were created using simple detectors that only distributed across IP addresses. looked at anomalous traffic at single IP address level. detectors. Model to be built using supervised learning. Label

203 Model Suitability: Data Bias

• Data bias due to un-representative training data – Traditional statistical models are also impacted Watch Items – ML models are more at risk due to high dimensional nature of input data, local and complex specification and automated feature selection • Is the training data relevant for intended use? • Data bias during label creation • Check for errors related to validity, measurement, processing, coverage, sampling and non-response – Labelling is often expensive [Groves 2004] – Pre-screening: In many situations a specific mechanism selects observations • Representativeness can be a particular concern for which are then labelled through the expensive process vendor models – Labeled data set is biased as a result

Amazon Reportedly Killed an AI Recruitment System AI-Driven Dermatology Could Leave Because It Couldn't Stop the Tool from Dark-Skinned Patients Behind Discriminating Against Women The Atlantic Fortune

Customer Complaint Analytics Model Data Collection Observation

Label definition (concept of Out of time performance The model parses through high risk) continued to appears to be substantially evolve as more data is better than in-time customer complaints to Develop Pilot Production identify higher risk customer collected and model results performance complaints. are factored into labeling process

204 Model Suitability: Data Limitations

• Appropriateness and adequacy of features Watch Items • Is the training data sample size adequate? How many samples from the minority class? • Insufficient sample size of training data • Are there high cardinality categorical features? Do we have enough samples from each – Multi-class problems and class imbalance class? How will model handle unseen categories? • Are samples split appropriately between training and test and across folds in cross • Data Augmentation and Transfer Learning validation? – Innovative approaches to handle small size of training data • For NLP: Ensure features are appropriately engineered (stemming, spelling correction) and not overly specific or un-informative. – Create their own model risk challenges • For NLP: Watch out for bias in word representations.

Data Augmentation Transfer Learning

[Geron 2017] 205 Model Suitability: Interpretability and Explain-ability Regulatory Requirement

SR 11-7 Regulation B GDPR

Model testing includes checking the model's accuracy, demonstrating that the model is robust and stable, In the event of adverse action, in the case of automated decision-making, the data assessing potential limitations, and evaluating the model's behavior over a range of input values. It should notice is required and should subject possesses the right to access “meaningful also assess the impact of assumptions and identify situations where the model performs poorly or include: information about the logic involved, as well as the becomes unreliable. Testing should be applied to actual circumstances under a variety of market Either a statement of the specific significance and the envisaged consequences of conditions, including scenarios that are outside the range of ordinary expectations, and should encompass reasons for the action taken or a such processing for the data subject.“ the variety of products or applications for which the model is intended. Extreme values for inputs should disclosure of the applicant’s right be evaluated to identify any boundaries of model effectiveness. to a statement of specific reasons automated processing “should be subject to and the name, address, and suitable safeguards, which should include specific Where appropriate to the particular model, banks should employ sensitivity analysis in model telephone number of the person or information to the data subject and the right to development and validation to check the impact of small changes in inputs and parameter values on model office from which this information obtain human intervention, to express his or her outputs to make sure they fall within an expected range. can be obtained point of view, to obtain an explanation of the decision reached after such assessment and to Banks benefit from conducting model stress testing to check performance over a wide range of inputs and challenge the decision." parameter values, including extreme values, to verify that the model is robust.

Required to manage model risk • Performance evaluation alone is not sufficient • Understanding Input-Output relationships are critical • Fairness and Accountability: Decision attribution and adverse impact • Trustworthiness: Safe operating region: output uncertainty and generalization ability

206 Model Suitability: Fairness and Accountability

• Complex model specification makes it challenging to explain individual decisions (“reason codes”) • Typical approaches utilizes contribution of input variables in local models • LIME, LOCO, Anchors, Shapley explanations [Hall 2018] • Unstructured data (e.g. NLP) makes it even more challenging to ensure fairness

Customer Complaint Analytics Model Model Explainability Methods for NLP Result

The model parses through text to sort Words related to customer complaints. Semantic features in protected classes were NLP models can often be biased, and not in the top 100 most appropriate emphasis should be placed on important features assessing fairness Assess bias in embeddings Assess word association and importance [Caliskan 2017, Speer 2017] [Ribeiro 2016]

Customer Call Fraud Detection Model Explainability and Reason Codes Result

The model analyzes various features of the Predictive power is call, including speaker location and language, based on highly coarse to determine the likelihood of fraudulent features that are behavior. Thus, variables that could strongly correlated with potentially result in disparate impact should be checked. demographic variables. Emulation with logistic regression LIME and Shapely Values ANOVA and Interaction (Global surrogate) [Ribeiro 2016, Molnar] plots

207 Model Suitability: Trustworthiness

• Do we know what the model does? How well will it generalize?

California Home Price Model

[Hastie 2009]

208 Model Suitability: Trustworthiness • Can we trust the model? Detect un-intended errors, sensitivity to contextual features, artifacts, irrelevant or adversarial changes

Reliance on Contextual Features or Artifacts

Watch Items • Do the top features based on global importance make business sense? • What is the explanation for model errors? Can model errors and their explanation be perceived as discriminatory? • Is there appropriate balance between variable importance of top feature versus rest? • Are the less important variables contributing enough to justify the additional complexity from their inclusion? ML models are data driven and will tend to pick up all correlations that exist in • Is the direction of impact for each input variable reasonable? data, whether or not they make sense [Lapuschkin 2016] • Does the magnitude and direction of interaction make sense?

Adversarial and Transplanting Errors

Deep Learning models can be fooled by “Optical illusions” such as small changes like image distortion or changing a background object [Rosenfeld 2018, Ilyas 2018] 209 Model Robustness: Replicability and Stability

• Several implicit or explicit choices: – Treatment of missing values, feature engineering, scaling, multi-collinearity, transfer learning – Choice of technique (e.g. SVM versus RF), regularization, optimization, default settings and software library – Hyper-parameters and selection approach – Random numbers for stochastic algorithms (e.g. RF) – Training/Testing partition

• It is important for validators to develop benchmark models using different choices/implementations

Impact of Hyper-parameters to GBM Watch Items • If default settings were used for certain influential parameters or hyper- parameters, do they make sense? • When assessing stability, compare both performance as well as specification (variable effects, feature importance, reason codes etc.)

[Hastie 2009]

210 Model Robustness: Ongoing Monitoring

• Ongoing Monitoring is important for all models Watch Items • For ML models, additional complications include: • Controls on model updates include use expansion – Adversarial Environments • Frequency of monitoring versus frequency of anticipated changes – Online Learning • How will the system assess data quality of retraining data? What will cause exclusions or – High dimensional inputs cause retraining to abort? – Limited availability of manual labels • How will retrained variable effects be tested to ensure reasonableness? • What kind and magnitude of changes can be pre-approved versus trigger a revalidation? • For NLP: How will changes in input text be quantified? • For alert generation models, what kind of below the line testing will be employed? • For vendor models: Is there sufficient disclosure from vendor side and is there adequate independent internal testing?

Adversarial Environments Online Learning

Network Intrusion Social Chat-bot Detection

Microsoft's Chat Bot Was Fun for Awhile, Until it Turned into a Racist Fortune

[Geron 2017]

[Fogla 2006] 211 Model Safety: Cautious Generalization and Fail Safes

• AI/ML is increasing being used for automated decision making

Failure Mode Analysis Generalization Uncertainty

Watch Items

• Failure mode analysis • Cautious Generalization • Fail Safes

[Treml 2017] [Settles 2012]

Chat-bot Routing Model Cautious Generalization Result

Better customer The model classifies customer When model predicted category is associated with experience and question into one of several dozen low confidence, offer multiple re-phrasing of fewer mis-routed categories and offers a web-link or question “Did you mean …” or request for a re- queries routes to correct customer agent phrasing instead of proceeding with low confidence based on category. option.

212 ML Fairness

• Ethical and Fair Artificial Intelligence has been in focus in recent times – Given the enormous and expanding impact of automated decisions in modern life, this is a topic of critical importance

• Fairness considerations should be taken into account at each step in the model lifecycle from problem formulation, data collection, feature engineering, labeling, training, testing, validation and ongoing monitoring

• Several technical definitions and solutions have also recently been proposed – The goal for these technique is typically to achieve similar error rates across various demographic groups through data pre-processing, training constraints or output post-processing – These efforts are still in their infancy and some of the technical fixes may relay on unrealistic assumptions, suffer from limitations and logistical problems

• Most researchers agree that fairness is highly contextual – It may be virtually impossible to guarantee fairness (or even come up with a universal definition) – It is important to have robust processes to ensure transparency and credible challenge

• In financial institutions: – Some critical decisions may already have well defined regulatory and policy requirements (e.g. Fair Lending) and are best handled within corresponding frameworks – In all cases applicable data privacy and compliance procedures should be followed – Clarity in problem formulation, data limitations and strong explain-ability analysis provide all stakeholders with transparency into potential discriminatory impacts of the model – Model developers and validators should use judgement to conduct additional testing to ensure adequate performance across all relevant demographic groups

213 Conclusion

• Building ML applications is easier than ever • Validation of ML models is critical • All principles of Model Risk Management apply directly to ML models • Given the special risks of ML models and typical use, certain areas deserve special emphasis – Model Suitability – Model Robustness – Model Safety

214 References

[Apley 2016] Apley, Daniel W. "Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models." arXiv preprint arXiv:1612.08468 (2016).

[Caliskan 2017] Caliskan, Aylin, Joanna J. Bryson, and Arvind Narayanan. "Semantics derived automatically from language corpora contain human-like biases." Science 356, no. 6334 (2017): 183-186.

[Dries 2009] Dries, Anton, and Ulrich Rückert. "Adaptive concept drift detection." Statistical Analysis and Data Mining: The ASA Data Science Journal 2, no. 5-6 (2009): 311-327.

[FATML] http://www.fatml.org/

[Fogla 2006] Fogla, Prahlad, and Wenke Lee. "Evading network anomaly detection systems: formal reasoning and practical techniques." In Proceedings of the 13th ACM conference on Computer and communications security, pp. 59-68. ACM, 2006.

[Geron 2017] Géron, Aurélien. “Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems”. O'Reilly Media, Inc., 2017.

[Groves 2004] Groves, Robert, Floyd Fowler, Mick Couper, Eleanor Singer, and Roger Tourangeau. “Survey Methodology”. New York: Wiley. 2004.

[Hall 2018] Hall, Patrick, and Gill, Navdeep. “An Introduction to Machine Learning Interpretability”. O'Reilly Media, Inc., 2018.

[Hastie 2009] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. “The Elements of Statistical Learning: Data Mining, Inference and Prediction”. Second Ed. Springer, New York, NY, 2009.

[IEEE] https://standards.ieee.org/industry-connections/ec/autonomous-systems.html

[Ilyas 2018] Ilyas, Andrew, Logan Engstrom, Anish Athalye, and Jessy Lin. "Black-box Adversarial Attacks with Limited Queries and Information." arXiv preprint arXiv:1804.08598 (2018).

[Lapuschkin 2016] Lapuschkin, Sebastian, Alexander Binder, Grégoire Montavon, Klaus-Robert Muller, and Wojciech Samek. "Analyzing classifiers: Fisher vectors and deep neural networks." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2912-2920. 2016.

[Lipiec 2017] Lipiec, Maciej. “What we learned designing a chatbot for banking”. https://chatbotsmagazine.com/what-we-learned-designing-a-chatbot-for-banking-2dd2c51d7c2c

[Rosenfeld 2018] Rosenfeld, Amir, Richard Zemel, and John K. Tsotsos. "The elephant in the room." arXiv preprint arXiv:1808.03305 (2018).

[Settles 2012] Settles, Burr. "Active learning." Synthesis Lectures on Artificial Intelligence and Machine Learning 6, no. 1 (2012): 1-114.

[Speer 2017] Speer, Robyn. “ConceptNet Numberbatch 17.04: better, less-stereotyped word vectors”. https://blog.conceptnet.io/posts/2017/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/

[Treml 2017] Treml, Florian. “3 Things Your Chatbot Fails At (But Shouldn’t)”. https://chatbotsmagazine.com/3-things-your-chatbot-fails-at-but-shouldnt-1be87e81f6f7

[XAI] https://www.darpa.mil/program/explainable-artificial-intelligence

[Zliobaite 2010] Žliobaitė, Indrė. "Learning under concept drift: an overview." arXiv preprint arXiv:1010.4784 (2010). 215 Thank you