Managing Machine Learning Model Risk
May 13, 2019
Agus Sudjianto, Harsh Singhal and Jie Chen
2019 Wells Fargo Bank, N.A. All rights reserved. For public use. Master Class Agenda
• Introduction (15 minutes) – Agus • Machine Learning Interpretability (90 minutes) – Jie – Post-hoc methodology • Overview of Machine Learning – Model distillation – Ensemble Model Methodology and Examples: Random Forest and GBM (60 minutes) – Jie – Deep Learning Methodology and Examples: Feedforward, • Structured-Interpretable Models – Agus Recurrent, and Generative Adversarial Network (60 minutes) – Jie • Validation of Machine Learning Models (90 min) – Harsh – Inputs/Data: bias and privacy test – Model specification: interpretability • Natural Language Processing (45 minutes) – Harsh – Performance: fairness and performance testing – Language Models – Model Monitoring and change control – Neural Architecture – Fail safe and disclosure Optional Lunch Time Bonus: Deep Learning Techniques for Derivatives Pricing – Bernhard
2 Machine Learning Methodology: Ensemble Model Methodology and Examples
May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk Outline
• Statistics vs Machine learning • Introduction to machine learning – Supervised Learning – Unsupervised Learning – Semi-supervised learning – Reinforcement Learning • Decision Tree and CART • Ensemble algorithms – Bagging – Random forest – Boosting • Probability Calibration • Classification Example
4 Statistics vs ML
• Leo Breiman: Two modelling paradigms: data model and algorithmic model – Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science • Traditional Statistics (data model) – View: Data generated by some underlying parametric model – goal is inference and interpret the model – Extensive interaction between data and data analyst o Summary, visualization, identification of outliers, shapes of distributions, transformation, … – Parameter estimation, testing, confidence intervals, asymptotic theory à based on model assumptions and theory – Dimensionality is curse à variable selection – Model validation: goodness of fit tests, residual diagnostics – Tailored for small data sets, few number of variables, structured data. – Driven by statisticians • Criticism – Simple parametric model imposed on data generated by complex system. Information obtained may be questionable. – Omnibus GOF test which tests in many directions have low power and will not reject until the lack of fit is large. – Feature engineer has to be done manually, which involves a lot of hand crafting and is impractical for large number of variables.
5 Statistics vs ML
• Leo Breiman: Two modelling paradigms: data model and algorithmic model – Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science • Machine Learning (algorithmic model) – View: Data mechanism unknown and no intrinsic interest in the data generation process. Goal is to get the most accurate model, however complicated. – Very little direct interaction with the data – Emphasis on better algorithms, speed, efficiency of computing, parameter tuning o Data mining – exploratory data analysis on steroids o Neural networks, Boosting algorithms, etc. – Algorithms are black box à hard to interpret – Dimensionality is blessing àvariable selection is not needed, feature creation is encouraged (SVM). – Model validation: check prediction accuracy on testing set – Tailored for large data sets, with large number of variables, unstructured data. – Driven by computer scientist, engineers, and a few statisticians • Criticism – Lack of interpretability.
6 Statistics vs ML
• Michael Jordan: the ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics. • Distinction is blurring … • Some statisticians have adopted methods from machine learning, leading to a combined field that they call statistical learning • Data Science has emerged as an alternative term to combine both fields… but includes DBM and computing
7 Machine Learning vs Artificial Intelligence (wiki and other sources)
§ Machine Learning: – Term coined by Arthur Samuel (IBM) in 1959 – gives "computers the ability to learn without being explicitly programmed” – study and construction of algorithms that can learn from data, summarize features, recognize patterns, make predictions, and take actions … – Related to statistics (`computational statistics’) but different paradigms – A key pathway to AI
§ Artificial Intelligence: concerned with making computers behave like humans – Term coined in 1956 by John McCarthy (MIT) – study of “ intelligent agents” – devices that perceive the environment and take actions that maximize its chance of success at some goal. – Long history: formal reasoning in philosophy, logic, … – Resurgence of AI techniques in the last decade: advances in computing power, computing and data architectures, sizes of training data, and theoretical understanding – Deep Learning Neural Networks: At the core of recent advancements in AI, specifically for certain classes of ML tasks (Reinforcement L and Representation L) – Applications: • Pattern recognition: speech (siri), image (Deep Face), handwriting, … • Autonomous systems: drones, self-driving cars • Recommender systems, drug discovery, marketing, …
8 Machine Learning: Tasks and Techniques
• Tasks: • Supervised Learning: • Regression and classification • Unsupervised Learning: • Discover underlying structure • Dimension reduction, clustering, … • Semi-supervised learning • Reinforcement Learning: • Identifying how to make good decisions from context: observe, learn, and optimize • Deep reinforcement learning • Representation Learning: • Feature selection and engineering
9 Supervised Machine Learning
§ Supervised learning means the desired outcome is known, aka, the response variable is given. § Learning is supervised under the response: minimizing the error between prediction and the response. § Algorithms that falls under this category: – K-nearest neighbor – LASSO, Elastic Net – Support vector machine – Decision trees – Ensemble methods – Neural networks • Artificial Feed Forward NN • More complex NN for DL
10 Supervised Machine Learning
§ Machine learning algorithms usually come with hyper-parameters which controls the complexity of the algorithm. – For example, trees have depth, number of terminal nodes, etc to define the tree structure – Neural networks have number of layers, number of neurons per layer, activation function, etc to define the network structure. § Complexity is related with bias-variance trade-off. Prediction error can be decomposed into bias and variance. Bias and variance trade-off § Bias: ! " − $ !% " . Simpler models have large bias, and vice versa
§ Variance: &'( !% " . Simpler models have smaller variance, and vice versa § The best model is the one that achieves a good balance between bias and variance à hyper-parameter tuning
11 Supervised Machine Learning: Tuning
§ Hyper-parameter tuning, is to find the best hyper parameters which gives the most accurate machine learning algorithm. It is the key to the success of machine learning algorithms. § Simple model structure, small data requires less complicated algorithm and more complicated model structure with large data requires more complicated algorithm. So the hyper parameters are data dependent, and they need to be tuned to get the best model. § Tuning involves a search routine and an evaluation routine. For each hyper-parameter setting, fit the model and evaluate the model performance; Using the search routine to find the hyper-parameter/model that optimizes the model performance.
12 Supervised Machine Learning: Tuning
§ Search routine, some popular ones are – Grid search: define a grid of parameters and search this entire grid – Randomized search: randomly select parameters from a distribution to search. – Bayesian hyper-parameter optimization: model the prediction performance as a Gaussian Process. § Evaluation routine. The model performance is measured by – Continuous response: mean squared error – Categorical response: AUC/Gini (binary response), error rate, logloss § It is well-known that a model that minimizes the loss/error on the training data is likely to overfit. To avoid this, the performance is measured on a separate validation data, or using cross-validation. § Cross-validation. The typical K-fold cross validation works as follows: 1. Randomly divide the data into K folds. Stratification may be needed for imbalanced data. 2. For each i = 1, …, K 1.Leave the ith fold out, build a model using the rest K-1 folds. 2.Predict on the ith fold. 3. After obtaining the cross-validation predictions for the entire data, compute the loss/error. This is the cross-validation model performance. § Since both training data and validation data are used in construction of the best model, the model performance has to be evaluated on a separate test set.
13 Unsupervised learning
§ Unsupervised learning means there is no response. The observations are unlabeled. § It is used for clustering, dimension reduction, anomaly detection, etc. § Algorithms that falls under this category: – Clustering • K-Means • Hierarchical clustering • Mixture models – Visualization and dimensionality reduction • PCA • Kernel PCA • Locally-linear embedding • T-distributed stochastic neighbor embedding (t-SNE) – Association rule learning
14 Semi-supervised learning
§ Sometimes, it is very expensive or hard to obtain labels. So only part of the data are labeled. – Unlabeled data contains both 1’s and 0’s. – Labeled data contains only 1’s à PU learning § Train only using labeled data? Not accurate. § unlabeled data gives the “background” information § Background information can increase the accuracy
§ Algorithms: – Self training: label the unlabeled data by training supervised algorithm using labeled data and iterate. Heuristic algorithm but in some cases it is equivalent to EM algorithm. • only add the most confident predictions; • add all but weight by confidence. – Generative models: assume a probabilistic generative model (eg Gaussian mixture model, NB, HMM) and maximize likelihood using EM. – If the model is correct, it’s very effective; otherwise, unlabeled data can hurt – Cluster and label: use any clustering algorithm for clustering and assign labels using majority of labeled points – Graph-based methods: a graph is given on the labeled and unlabeled data, instances connected by heavy edge tend to have the same label.
15 Reinforcement Learning
Distinct arm of ML: • Do not directly observe the ‘right decision’ • Observe context (environment), make decision, and see outcome • Learn from decision over time – reward • Search over context space, learn, and identify how to optimize • Explore and exploit trade off: decisions that improve estimated model vs decisions that appear to be optimal under current model • Mathematical framework: Markov decision process or partially observed MDP
Megajuice,https://commons.wikimed Canonical applications: ia.org/w/index.php?curid=57895741 • Precision medicine – right treatment for right patients at right time • Robotics: agents interacting with environment to learn how to perform a task optimally • Recommendation systems: which advertisements or products to display given past browser or purchase history
16 Decision Tree and CART
• Decision tree partition the feature space into a set of rectangles and fit a simple model (e.g., constant) in each one. • Advantages: – Fast, intuitive – Able to handle both numeric and categorical data – Robust to outliers in predictors – Model interaction and nonlinearity automatically (little data transformation) • Disadvantage: weak learner – High bias for shallow trees, for example trying to model linear relationships – Instable, high variance for deep trees. Small change in data can result in a completely different tree
17 Decision Tree and CART • There are many different decision tree algorithms • Ross Quinlan invented three implementations: ID3, C4.5 and C5.0 – ID3 (iterative dichotomiser 3) is the first generation invented by Quinlan (1986) – C4.5 improves upon ID3 by allowing both discrete and continuous variables, tree pruning, missing value handling, etc. C5.0 further improves on speed and memory. – Splitting is based on minimum entropy (or maximum information gain). Only support categorical response. • CART (classification and regression tree) is similar to C4.5, first introduced by Breiman. – starts from the root node with all data – splits into several child nodes based on a certain variable, the goal is to make each child node as homogeneous as possible – The heterogeneity of each node is measure by squared error for regression and Gini/entropy impurity for classification ' ) • Gini impurity: 1 − ∑$%& ($ , ($ is the probability of class +. ' • Entropy impurity: − ∑$%& ($ log ($ • When the class is pure, Gini impurity and Entropy impurity = 0 – Pruning: the tree is grown large and pruned to minimize the cost complexity function: each leaf incurs a penalty set by complexity parameter – Some other features: missing value handling, surrogate split.
18 Decision Tree and CART
• Tuning parameters for CART – Splitting criterion (gini or entropy) – Max tree depth – Min leaf size – Complexity parameter for pruning – … • Implementation – Scikit-learn: DecisionTreeRegressor and DecisionTreeClassifier – R: rpart package – Spark: mllib library
19 Ensemble Algorithms
Improve performance by combining the outputs of several individual predictors:
Examples: • Bagging • Boosting
• Model Averaging • Majority Voting • Ensemble Stacking
web.engr.oregonstate.edu/~xfern/classes/cs534/notes/ensemble-11.pdf
20 Bagging
• Bagging: bootstrap aggregating is an early ensemble method invented by Breiman in 1994. • Bagging works by – Take a bootstrap sample at each iteration !, ! = 1, 2, … , '.
– Fit the base learner to the bootstrap sample to get a base model à )(*(,) – Combine all base model predictions by averaging (regression) or majority voting (classification) • Tuning parameter: base learner parameters plus n (number of base learners). – For n, the more, the better as long as computation allows. • Deep decision tree is a good choice for base learner • Bagging leads to "improvements for unstable procedures" (Breiman 1996), for example, deep decision trees. ( 1 456 78 9 – Averaging reduces variance. In the independent case, ./0 ∑ )( , = . However, base model predictions 2 * * 2 are not independent because the bootstrap samples have overlapping data. The variance will flatten off instead of going to 0. – Correlation limits the reduction of variance, hence de-correlate the base models is important. How to further de- correlate the base models?
21 Random Forest
• Random forest is a modified version of bagging. • It is popularized by Breiman (2001), combines – Bagging applied to tree algorithm – random selection of features • It builds deep trees which have high variance but low bias, and reduce the variance through bagging. – A variant is to use sample without replacement instead of bagging. • To achieve maximum amount of variance reduction, different trees need to be as uncorrelated as possible, this is done through – Random feature sampling: for each split, use a random subset of features as candidate split variables, instead of the entire feature set. • Tuning parameters: n (number of trees), mtries (number of variables to sample in each split), tree depth, … – n: the more, the better as long as computation allows – mtries: there are some default values, e.g., ! for classification case, and !/3 for regression case. Too small is generally not good (you may not be able to include any important variables in your random selection), too big is also bad as it leads to higher correlation. – tree depth: deep trees. Breiman suggested fully grown tree but this is rarely a good idea for large data (storage and computation). In addition, fully grown trees can result in too rich a model and incur overfitting. Tune the depth can improve model performance.
22 Random Forest
• Implementation: – Scikit learn: RandomForestRegressor and RandomForestClassifier – R: randomForest package – Spark: mllib library – H2o: h2o.randomForest • Random forest can be uses as off-the-shelf with default parameter settings • Other features: oob (out of bag) error. It can be used in place of a validation data to tune the algorithm.
23 Boosting • Boosting is a different type of ensemble algorithm, based on removing bias of a simple learner. • Given a simple learner, can you improve it to be a strong learner? (Kearns and Valiant 1988) • Schapire (1989): Yes à by a technique called “boosting”, • Freund and Schapire (1995): AdaBoost for classification
• “Base learner”: simple rectangular classification regions at each stage • Reweighting at each stage – more weight to data that are misclassified • Fit an additive model (ensemble)
24 Gradient Boosting
• Breiman (1998+): Boosting is actually an optimization algorithm • Friedman (2000+): Extended concept to gradient boosting (gradient descent) • First, define your loss function to minimize: !(#, %). – Different types of loss functions à different gradient descents – Adaboost correspond to exponential loss – Commonly used ones: squared error loss # − % ( for regression and deviance #% − log 1 + ./ for binary classification (% is the logodds) – Exponential loss is less robust than deviance loss when the data is noisy or there is misspecification on class labels – Other loss functions: absolute error loss, partial likelihood, etc 5 • For the given loss function, find the prediction function %(0) that minimize the total loss ∑234 ! #2, % 02 . : – The best function %(0) is found in an additive, stage-wise way: % 0 = 78 0 + ∑934 ;979(0), where 78(0) is the baseline (e.g., overall mean in regression). – In each stage <, update the prediction function in the direction 79(0) where the total loss decreases, for a step size/learn rate of ;9. – The good direction to go is the negative gradient (gradient descent). Hence each base learner 79 02 is fit to the negative gradient • For squared error loss, the negative gradient is simply the error =92 = #2 − %9>4 02 from previous stage, where 9>4 %9>4 x = 78 0 + ∑ℓ34 ;ℓ7ℓ(0) BC • For deviance loss, the negative gradient is error = = # − A 0 , where A = is the probability 92 2 9>4 2 4DBC
25 Gradient Boosting • As an illustration for the regression case
• Stochastic gradient boosting (Friedman 1999): fit each tree with a subsample instead of the entire data. This can be more robust and less overfitting. • Tuning parameters: number of trees, learn rate, tree depth, … – number of tree: need to be tuned. Too many cause overfitting (in contrary to random forest) and too few results in under fitting. – learn rate: smaller generally is better but it will require more trees to be built – Sample rate: for stochastic gradient boosting, default 0.5 but depends on data size. – tree depth: shallow, in contrary to random forest • Implementation: – Scikit learn: GradientBoostingClassifier and GradientBoostingRegressor – R: gbm package – Spark: mllib library – H2o: h2o.gbm – XgBoost, ligthGBM, Catboost…
26 XGBoost
• XGBoost stems from GBM but is different in several ways: – Includes regularization (L1, L2 penalties) and column sampling to better control overfitting – Uses a different optimization algorithm (Newton boosting rather than gradient boosting) – Supports fast algorithm for tree split – Usually has better prediction performance (leading algorithm in Kaggle competitions) • Key parameters for tuning: – Number of tree (n_estimator): need to be tuned. Too many cause overfitting (in contrary to random forest) and too few results in under fitting. – Learning rate: smaller generally is better but it will require more trees to be built. – Tree depth (max_depth): shallow, in contrary to random forest. – L1 regularization term on weights (reg_alpha): regularization parameter specially for Xgboost. – L2 regularization term on weights. (reg_lambda): regularization parameter specially for Xgboost.
27 Comparison: GBM and Random Forest
• Random forest is “practically tuning free” and is less prone to overfitting than GBM • Random forest is embarrassingly parallel. GBM builds one tree at a time. • Random forest is slower to score and can take more time to train due to its tree depth. • Several empirical comparisons are done in the literature to compare the performance of GBM and random forest. – They have similar prediction performance, but generally well tuned GBM performs slightly better than random forest (Caruana et al. 2005). • The internal mechanics are different: one focuses on reducing variance and the other focuses on reducing bias.
28 Probability calibration
• One challenge for ML classification: the probability scores from binary response regression are not well calibrated. The rankings of the observations are usually good but the scores themselves do not align well with predicted probabilities that one may get from, say, logistic regression models. E.g., – Naïve Bayes tends to push scores to 0 or 1 due to the conditional independence assumption; – support vector classifiers uses distance from point to the decision boundary which is not on the probability scale. – Bagging and random forests that average predictions from a base set of models can have difficulty making predictions near 0 and 1 because variance in the underlying base models will bias predictions that should be near zero or one away from these values.
• Reliability plot can be used to visualize such bias in the scores. A perfectly calibrated model will show approximately a 45 degree straight line, whereas SVC usually shows a Sigmoid shaped curve and Naïve Bayes shows the opposite. • To correct the bias in probability scores, there are three main calibration methods: – Platt scaling – isotonic regression – spline calibration using natural cubic splines. • Based on our experience, a well-tuned XGBoost model could produce quite accurate probabilities even without calibration, thus calibration may not change much. On the other hand, a random forest model is less accurate and you may see it over-predict significantly in-test/out-of-time-test data.
Documentation: https://scikit-learn.org/stable/modules/calibration.html 29 Classification Example
• Auto loan level loss forecast model • Hyperparameter tuning—Grid search – Objective: – Hyperparameter tuning grids can be different for small data median data and big data. conditional probability of default More advanced users can adjust the tuning grids according to their own needs and computation time budget. – Model segment: Specific delinquency segment – Dependent variable: charged-off – Independent variables: Raw LOB independent variables – In-time data set: June 2004 - March 2016 • XGBoost: – OOT test set: – 'learning_rate': 0.05 April 2016 – May 2017 – 'max_depth': 5, – Training/Validation splitting – 'n_estimators': 300 o Clustered by customer ID – 'reg_alpha': 0 o 2/3 in-time data set for training – 'reg_lambda': 1 o 1/3 in-time data set for validation • Random Forest: – 'max_depth': 15, – 'max_features': 6 – ‘n_estimators’:300 30 Example– Probability Calibration
• Reliability plot by XGBoost before/after calibration – No significant improvement
• Reliability plot by Random Forest before/after calibration – Probability calibration is needed
31 Example: Account level performance metrics
• In-time train Regression – Over-fitting by RandomForest
• OOT test • In –time test – XGBboost>Randomforest >Logistic – XGBoost>Randomforest >Logistic Regression
32 Example: Aggregated level performance metrics
• In-time test set, over date: XgbGBM>Logistic>random forest • OOT test set, over date: (MAPE) random forest~XgbGBM>Logistic
XGBoost_ whole_time H2oLogist_whole_time RandomForest _whole_time XGBoost _whole_time H2oLogist _whole_time RandomForest _whole_time MAE 0.012 0.015 0.015 MAE 0.006 0.008 0.005 pRMSE(%) 15.18 18.34 16.67 pRMSE(%) 7.72 9.27 7.59 MAPE(%) 5.00 5.50 6.24 MAPE(%) 6.95 8.86 6.80 CPE(%) 1.76 -1.24 2.81 CPE(%) 6.50 8.51 5.88
33 Machine Learning Methodology: Deep Learning Methodology and Examples
May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk
© 2019 Wells Fargo Bank, N.A. All rights reserved. Public use. Outline
§ Introduction § Artificial Neural Networks § Training Neural Networks § Advanced Network Architectures § Practical Considerations § Neural Networks vs Ensemble Methods § Classification Example § Time Series Simulation by Conditional Generative Adversarial Net
35 Introduction
• Machine Learning Model inspired by neuroscience • Cyclical in Popularity; Recent Boom • Recent Wins: Unstructured Data, such as images, text, and speech. • Advantages: – Flexibility – Batch Training for Large Data – Unstructured and Hybrid Data: Automatic Feature Engineering
36 Artificial Neural Networks
37 Artificial Neurons
• Inputs: !", !$, … , !&
• Weights: '", '$, … , '& and Bias ( • Calculates Linear Combination of inputs: ) = ∑, ', !, + ( • Output applies an activation function to ): . = /())
38 Activation Functions
• Introduce nonlinearities into the network. • Popular Choices: & – Sigmoid: !(#) = &'()* (*/()* – Hyperbolic Tangent: !(#) = tanh(#) = (*'()* – Identity Function: !(#) = # – Rectified Linear Units (ReLU): !(#) = max(0, #) – Leaky ReLU: ! # = 456 76, 6 ; 7 < 1 – Other specialized options
39 Example: Single Neuron Networks
• Linear Regression – Activation Function: Identity Function
– Therefore: "^# = ∑& '& (&,# + +
• Logistic Regression – Activation function: Sigmoid , – Therefore: "^# = ,-./0(2 ∑3 4353,6-7)
40 Perceptrons: Network with Single Hidden Layer
• Using a set of neurons, or “hidden units” between the input and output allows the network to represent more complex functions of the input. • “Universal Approximation Theorem”: With a wide enough hidden layer and a squashing activation function, a neural network can approximate any well behaved function arbitrarily well. • The catch: potential overfitting and computational issues.
41 Deep Neural Networks: Multiple Hidden Layers
• Adding additional layers of hidden units increases the representation power of the ANN without as bad a computational cost. • Empirically, deep networks seem less prone to overfitting than wide, shallow networks.
(not really a “deep” network)
42 Output Layers:
• Neural Networks may be adapted to different machine learning tasks by appropriately choosing the output layer. For example: – A single node with an identity activation function represents a univariate regression task. – A single node with a sigmoid activation function can be used for binary classification , when the target & takes values of 0 or 1: !(#) = &'()* – A set of k output nodes can be used for k-class classification tasks using the “softmax” activation function. Each output node gives the probability that the corresponding observation belongs to one of the K classes: * ( , !(# ) = + *. ∑. ( – Something more customized to a specific task, such as a sequence.
43 Training Neural Networks
44 Fitting a Neural Network to Data: Learning the Weights
• The weights (and bias) of each neuron are the unknown parameters in an ANN that need to be learned from data. • To do so, we define an appropriate cost function. The cost function should: – Represent an average of the cost of individual observations in the training set. – Should be a function of the outputs from the ANN and the response y. ! – Example: ∑ ( ' − '^ )" "# % % % • Choose the weights and biases that minimize the cost function. – In principle, this can be achieved using calculus. – Numerical solutions are a well-studied field. – However, many of these solutions are not easily implemented in ANNs.
45 Choice of Cost Functions
• The cost (or loss) function provides a global measure comparing the output of a network with the true response for a set of data. • The choice of loss function depends heavily on the task. For example: – For continuous responses, we use squared error loss: 1 #( & − &^ )* " $ $ $ – For binary response, we use cross entropy, or log loss: 1 − # & log&^ + 1 − & log 1 − &^ " $ $ $ $ $ – For multinomial responses, we use a generalization of cross-entropy, with j indexing category: 1 (/) − # # &(/)log &^ " $ $ $ / – For other tasks, other loss functions.
46 Gradient Descent
• Iterative algorithm to minimize a function !($⃗).
1. Start with an initial point $⃗&.
2. Propose a new point via: $⃗'() = $⃗' − ,-.!($⃗'), where , is a small constant called the learning rate.
3. Repeat until $⃗' converges. • However, computing the gradient in neural networks can be challenging.
47 Back Propagation Algorithm
• Preparation: Input the data x, and initial all weights in the network. • The algorithm: 1. Feedforward: Feed the data through the network, computing the output of each node based on the current weights. 2. Gradient: Compute the gradient of the cost function with respect to the last hidden layer. 3. Backward Propagation: Work backwards through the network, computing the gradient of cost function w.r.t. the weights in layer l- 1 using the chain rule and the gradient w.r.t. the weights in layer l. 4. Update the weights using gradient decent, and return to step 1.
48 Back Propagation Algorithm
• The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm: – Input x: Set the corresponding activation !" f or the input layer. – Feedforward: For each # = 2,3, … , ) compute *+ = ,+ !+-" + /+ and !+ = 0 *+ . 34 – Output error 12: Compute the vector 12 = = 7 9 ⊙ 0; *2 . 356 8 – Back propagate the error: For each # = ) − 1, ) − 2, … , 2 compute 34 A 1+ = = ,+@" 1+@" ⊙ 0; *+ . 35? 34 +-" + 34 + – Output: The gradient of the cost function is given by ? = !E 1F and ? = 1F . 3BCD 3GC • In particular, given a mini-batch of m training examples, apply a gradient descent learning step based on that mini-batch.
49 Other Optimization Concerns
• Variety of sophisticated methods to improve learning in ANNs: – Stochastic Gradient Descent – RMSProp – Adadelta – Adam • These algorithms improve learning by using: – momentum, to prevent the gradients from changing too rapidly/ overcorrecting – adaptive learning rates, to balance speed with accuracy • In practice, this is done in batches of training data, called “minibatch learning”.
50 Advanced Network Architectures
51 Convolutional Neural Networks
• Useful when observations consist of uniformly • Each output is a weighted average of the inputs: sized arrays of measurements of the same ! % & quantity, such as images or time series. ",$ ",$ ",$ • Key Features: • The weights remain constant as the convolution is – Convolutional layers, where inputs are applied to successive windows of data. convolved with their neighbors. – Often use local pooling to reduce model First Window: Second Window: parameters.
52 Recurrent Neural Networks
• Useful in studying sequences, such as in natural language context. • NNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. • RNNs have a “memory” by taking the previous output or hidden states as inputs. • Several variations; “Long Short-Term Memory” (LTSM) variation is currently most successful, which addresses the vanishing gradient problem
53 Hybrid Networks
• Architectures can be adapted for a variety of purposes. For example: – Combining Feed-Forward, Recurrent, or Convolutional Elements – Adding layer-skipping connections, or reducing the number of connections between layers – Merging different input layers, or splitting into separate output layers, for different tasks.
54 Autoencoders
• Dimension Reduction Network • Network learns to predicts input from input using smaller hidden layers. • Bottleneck layer engineers lower-dimensional features. After training, these features may be extracted as a lower-dimensional representation of the data. • Relationship to PCA: – If linear activations are used, the autoencoder weights span the save vector subspace as the corresponding set of principle components. – Not guaranteed to be equal to the PCs, nor orthogonal.
55 Generative/Adversarial Networks (GANs)
• Unsupervised technique • Pair of ANNs, trained with simultaneous backpropagation • A Generator Network, which produces candidate data examples • A Discriminator Network, which learns to distinguish the generated data from the real data. (Classification) • Simultaneous training improves the performance of both networks.
56 Practical Considerations
57 Using ANNs in Practice
• Determine Network Structure and Properties • Train the network effectively • Avoid overfitting • ANNs vs Other Machine Learning Techniques
Note: Much of the literature available gives advice in the context of unstructured data (images/text/speech).
Such advice may not be useful in banking problems.
58 Network Structure and Properties
• Very flexible choices for: – Number of hidden layers – Number of nodes on each hidden layer – Activation functions for each hidden layer – Regularization strategy/ Parameters – Additional features: oSkip connections oBatch normalization oDropout oConstraints • Can make exhaustive search difficult
59 Training Effectively
• Training can be challenging; saddle points and local minima can result in a sub-optimal model. • Tips: – Standardize or Normalize data (X) before training. Avoids vanishing gradient problem. o Min/Max scaling is often used in the literature. o Gaussian standardization may perform better when large outliers are present. – Use Batch Normalization between hidden layers. – Consider using an optimization routine with learning rate decay (e.g. Adam). – Consider adjusting the batch size used in training. Smaller batches can be slower and more volatile, but can help escape local minima/saddle points. – Use early stopping to determine number of training epochs.
60 Overfitting
• ANNs are flexible models, with a (potentially) large number of parameters, therefore overfitting is a concern. • Strategies to avoid overfitting in ANNs include: – Multiple narrow layers vs. Single wide layer – Data augmentation – Weight Regularization: Penalizing large weights in the cost function. – Dropout: Randomly drop out units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. – Early Stopping via a validation set.
61 Neural Networks vs Ensemble Methods
• Ensemble Methods (Gradient Boosting, Random Forest): – Better predictive performance for structured data – Easier hyperparameter tuning: Smaller search space; less optimization tuning. – More natural handling of categorical variables (depends on implementation) • Artificial Neural Networks – More flexible data types; hybrid data – Analytical partial derivatives for more derivative based diagnostic tools – High performance for unstructured data – Larger hyperparameter space – Categorical variables need to be “dummy coded” or “one-hot encoded.” Many extra variables if large number of categories.
62 Examples
63 Classification Example
§ Response: 0s and 1s § Logistic regression § Handcrafted predictors: 75 – some are – Top 10 variables + interactions highly correlated – Fit Lasso – regularized regression § Develop a good classifier – Also does variable selection Techniques: § Compare performance with ML algorithms: o Logistic regression – Accuracy on cross-validated data o Gradient boosting machine o Random forest o Convolutional Neural Net (with original time series data) o Naïve Bayes o SVM o Adaboost
64 Comparison of predictive performances
• Logistic regression vs GBM and RF • Logistic regression vs GBM, RF and Deep Learning (CNN)
– GBM and RF are generally better in – CNN based on “raw” data terms of accuracy
65 Time Series Simulation by Conditional Generative Adversarial Net
66 Outline
• Motivation • Introduction of GAN, GAN Variants and CGAN • Rationale of how GAN works • Simulation results • VaR and ES application for market risk • Macroeconomic time series simulation application
67 Motivation
• Traditional time series models are strongly dependent on model assumptions and estimation of the model parameters – Statistical time series models: AR, VAR, VECM, GARCH, … – stochastic process models: Hull White model, Ornstein-Uhlenbeck process,… – Complicated correlation modeling: Copula,…
• Therefore, traditional time series models are less effective in modeling – non-Gaussian, skewed, heavy-tailed distributions – complicated time-varying dependence and cross-correlation structure
• Generative Adversarial Net (GAN) and Conditional Generative Adversarial Net (CGAN) have been proved to be a powerful machine learning tool in image data analysis and generation. • We propose to use CGAN to learn and simulate time series with the bank’s application.
68 GAN and it Variants
• GAN training is a minmax game on a cost function between generator (G) and discriminator (D) where both G and D are neutral network models. min max ()~+ [log 1())] +()6~+ [log(1 − 1 )6 )], (1) $ ' , 7 – Both 1 and ; are trained simultaneously, where 1 receives either generated sample )6 or real data ), and 1 is trained to distinguish them by maximizing the cost function. – While, ; is trained to generate more and more realistic samples by minimizing the cost function. – The training stops when 1 and ; achieve the Nash equilibrium, where none of them can be further improved through training. • Issues with GAN: – mode collapse issue : the generator collapses to a parameter setting where it always generates a small range of outputs – Diminished gradient issue: discriminator gets too successful that the generator gradients vanish and the generator learns nothing. • Solution – WGAN (Wasserstein GAN)--a new cost function using Wasserstein distance – DRAGAN--a gradient penalty directly to GAN
69 Conditional GAN (CGAN)
• A conditional version of GAN is introduced by Mirza & Osindero in 2014. • CGAN enables GAN to generate specific samples given the conditions, where the same auxiliary condition, usually denoted by !, are applied to both generator and discriminator as additional input layers. • The Cost function of CGAN is min max )*~, [log 2(*, !)] +)8~, [log(1 − 2 <(8, !) )], (2) % ( - 9 • The Cost function of conditional WGAN (CWGAN)
• min max )*~, [ 2(*, !)] −)@~, 2 < 8, ! , (3) % (∈? - 9 • Condition for CGAN can be categorical or continuous – For image generating, categorical conditions like image categories are common. – For time series generating, continuous conditions based on the past information are more common for future prediction generation.
70 Rationale of GAN
Considering simulating single random variable X from uniform distributed random noise U(0,1). %& • According to Inverse transform sampling, simulate ! ∼ #$ ' . %& • GAN generator is building nonlinear mapping of #$ from ' to !, given one dimensional random noise. • Single layer of NN with Relu activation is “actually” piecewise linear spline. 1
( ) = +, + .+/ 2/ 3/) (4) /0&
– Bj(.) with simple hinge functions are called ReLU (Rectifier Linear Units), max(0, 3/)-cj) – 7/ "knot locations" are called “bias weights”
• Unlike spline approach, – Knot locations are optimized simultaneously among all input variables
– the knot location is optimized on scaled ), the (3/)), instead of )
71 Rationale of GAN (Cont)
• Let us simulate a N(0,1) distribution from uniform U(0,1) random noise with GAN • In the following plots, the blue line is inverse CDF from U(0,1) (x-axis) to N(0, 1) (y-axis), and the green line is the spline approximation trained by GAN. – Left :1-layer with 7 nodes with single dimensional noise – Middle: 1-layer with 100 nodes with single dimensional noise – Right: 2-layer with 100*100 nodes with single dimensional noise
Note that: some nodes are collapses together. • For multivariate simulation from multi-dimensional random noise, more complicated nonlinear mapping is established by GAN
72 Simulation study I: Gaussian Mixture Model with categorical or continuous conditions
• Gaussian Mixture Model with nominal categorical • Gaussian Mixture Model with Continuous Conditions conditions – Mean: along the circle with center at (0, 0) and radius = 2. – Four clusters of 2-dimensional Gaussian distributions with – Variance: linearly increase along the circle in an anticlockwise various means and variance direction. – Final data: Gaussian distributions with condition on means and the corresponding variances
73 Simulation study II VAR time series with time varying volatilities • Simulation model
+ !" = $!"%& + (", !" ∈ * 1 0 $ = [0.8, 0.6]3, ( ~ 5(789: = ;, <=> = ?@7( ! ) B , DF= 20) " "%& 0 1
• We take 10000 1-time-lag sliding windows. The condition is the past time-lag !"%& • We compare mean, variance, skewness, and kurtosis between conditional distributions by CGAN (y axis)and the corresponding true conditional distributions (x axis) given 500 random selected conditions. Each conditional distribution has 10000 samples.
1st time series 2nd time series 74 VaR and ES for equity 1-day returns • Equity spot prices for WFC and JPM from 11/1/2007 to 11/1/2011 are Table 1: VaR and ES downloaded from yahoo finance. 1-day absolute returns are calculated and used as training data. • Stressed (11/2007-11/2009) and the normal periods (11/2009-11/2011) are separated by using an indicator of periods as a categorical condition. Historical tail data • The Historic Simulation method is one of the most popular methods used by major financial institutions. This method is usually based on a relatively small number of actual historical observations and may lead to jumpy and non-smooth tail distribution and poor VaR and ES output. • We use CGAN to learn the historical data for both the stressed and normal periods, and generate simulated sample set with the sample size 50 times larger than the original one. CGAN simulated tail data • we calculate the VaR and ES (See Table 1 below). The plots show that the large data set generated by CGAN generates a clear and smooth tail of the distribution.
75 VaR and ES backtesting for equity 1-day returns
• Additional historical data for WFC and JPM stock prices from 11/1/2011- 11/1/2015 (around 1000 business days) is downloaded to implement the backtesting. • Since there has been no major financial crisis in this period, we use the VaR and ES from the normal period (in Table 1) as our measurement in the backtesting. • The expected breaches over 1000 days for 1-day 99% VaR is 10 days. Table 2 shows that the HS method may lead to an underestimated measurement of the portfolio loss, and CGAN outperformed the HS method in the calculation of VaR and ES for this example.
76 Economic Forecasting Model
• CCAR requires multiple economic forecasts and different capital requirements during different hypothetical economic projections. • CGAN-based economic model provides an alternative approach to produce multi- quarter forecasts at once, and assess the distributions of the forecast paths. • Five popular macroeconomic index data from 1956 quarter 1 to 2016 quarter 3 from the U.S. Census Bureau: real Gross Domestic Product (GDP), unemployment rate (Unemp), Federal fund rate (Fedrate), Consumer Price Index (CPI) and 10-year treasury rate – Time series data : 5 variable x 242 quarter – Output data for CGAN: 230 sample x 5 variable x 9 quarter – Conditional data for CGAN: 230 sample x 5 variable x 4 quarter • (Top Plot) Forecast distribution: 100 forecasting paths of GDP generated by CGAN using the most recent four-quarter historical values as conditions. • (Bottom Plot) Shock analysis: Federal fund rate is shocked upward by one standard deviation in the last quarter. Average forecast is used to assess the impact. A positive shock to the Federal fund rate suppresses the economic activity and leads to a higher unemployment rate (red) compared to baseline (black).
77 Introduction to Natural Language Processing
May 13, 2019 Presenter: Harsh Singhal Contributors: Jian Sun, Suhas Sreehari, Tarun Joshi, Eric Wang, Wayne Shoumaker
© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Agenda
• Pre-processing 03 • Simple Text Classifier 04 • Unsupervised Learning: LSA and LDA 05 • Language Models: Glove 06 • Language Models: Key Properties 07 • Neural Architectures for Text Classification 08 • Interpretability 13 • Transfer Learning 14 • Bonus 1: Advanced Neural Architectures 15 • Bonus 2: More Language Models 19
79 Pre-processing: From Text to Feature Vector
• The purpose of pre-processing is to transform text into data that can be digested by an algorithm, and to reduce the amount of information to core set for clarity and efficiency
An integral part of model development is testing Lower Case, Remove Numbers and Punctuations Tokenization: Split paragraphs and sentences into words
an integral part of model development is testing Stemming: Reduce words to their root by dropping unnecessary characters, such as suffix Lemmatization: Alternative approach to stemming, using WordNet’ s lexical database of English Spelling Corrections, N-grams, POS Tagging, NER, Collocation Extraction
integr part model develop model_develop test
Indexing and one-hot encoding
0 0 1 … 1 0 0 0 … 1 0 … … risk test valid rigor model control
credible Bag of Words risk_manage model_develop 80 Simple Classifier: From Word Vectors to Classification
Linear SVM and Logistic Based Classification – Pre-processing converts words to features and creates new features based on word TF-IDF count • Term Frequency – Inverse Document – Text vectorization outputs the features to numerical vectors. Ex count vectors, TF- Frequency. IDF vectors • TF denotes the number of times that a – For classification problems, vector spaced based ML methods can be applied to find term occurs in a given document decision boundary between two classes . Notable example SVM. • IDF is the logarithmically scaled inverse – Linear SVM defines the criterion that maximally separates the two classes, allowing fraction of the documents that contains the term. users to adjust cost and penalty parameters on misclassification to suit business problems. • TF-IDF are frequency scores that try to highlight words that are more frequent in a document but not across
Had Little Lamb Twinkle Star Light Bright Old Farm documents. Mary had a little lamb 1 1 1 0 0 0 0 0 0
Twinkle twinkle little star 0 1 0 2 1 0 0 0 0
Star light star bright 0 0 0 0 2 1 1 0 0
Old McDonald had a farm 1 0 0 0 0 0 0 1 1
81 Unsupervised Learning: LSA and LDA
Distributional Hypothesis: Words occurring in similar contexts tend to have similar meaning
• Latent Semantic Analysis (LSA) • Latent Dirichlet Allocation (LDA) Topic Model – Objective: Provide a Euclidean lower dimensional representation of – Objective: Infer a collection of topic (each a set of words) words and documents – Objective: assign a topic (or set of topics) to each document and – Essentially a SVD (or PCA) on term-document matrix (or term co- word occurrence matrix) – Essentially a mixture model based clustering approach Word and Document Vectors – A hierarchical generative model is proposed to explain observed term-document matrix – Statistical inference uses EM or MCMC techniques
Probability Distributions
M. Steyvers, and T. Griffiths, “Probabilistic Topic Models,” Handbook of Latent Polysemy Semantic Analysis, 2007. 82 Unsupervised Language Models: GloVe
Illustration An unsupervised learning algorithm for Consider a large corpus of words. We want to find the co-occurrence obtaining vector representations for words probabilities for words ice and steam with various words ice co-occurs more frequently with solid than it does with gas Training is performed on aggregated global word-word co-occurrence statistics from a steam co-occurs more frequently with gas than it does with solid corpus Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently Essentially a log-bilinear model with a The ratio of probabilities cancels less useful words like water and fashion weighted least-squares objective • large values (>> 1) correlate well with properties specific to ice
• small values (<< 1) correlate well with properties specific of steam Main intuition: ratios of word-word co- occurrence probabilities have the potential to encode meaning
83 J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global Vectors for Word Representation,” EMNLP, 2014. Working with Language Models
Linguistic Structure in Word Representations Bias and Discrimination
Many linguistic patterns are captures in the Euclidean geometry • Word representations are based on co-occurrence frequency and will pick up naturally of word representation occurring biases in language corpora
• (In) famous examples include association of gender with occupations
• Bias Amplification: For bivariate prediction problems (e.g. joint prediction of gender and occupation) the bias in model output can be worse than bias in model training data
• Various solutions have been proposed – Adjusting training data – Post processing raw word representations – None of the solutions “work perfectly”. Key is to be aware of possible discrimination and take into account. https://www.tensorflow.org/tutorials/representation/word2vec
Bias in Word Embedding
https://blog.conceptnet.io /posts/2017/conceptnet- numberbatch-17-04- better-less-stereotyped- Caliskan, A., Bryson, J. J., and Narayanan, A.. "Semantics derived automatically from word-vectors/ language corpora contain human-like biases." Science 356, no. 6334 (2017 84 Supervised Learning: Neural Architectures for Text Classification
Recipe for Text Classification using Neural Architectures – Preprocessing: Align text preprocessing rules with rules used for underlying word embedding to maximize vocabulary coverage. Avoid traditional methods like stemming, lemmatization, and stop-word removal. – Embedding: Words as embedding using word2vec, GloVE, fasttext etc. Pre-trained or custom. – Representation: Design an intermediate representation (i.e. encoding) of document using a DNN architecture. – Training: Train the representation & label using a dense feed forward neural network with a softmax layer.
Key Architectures for representing documents – Convolutional Neural Networks (CNN): Captures local (i.e. unigrams, bigrams, trigrams etc.) dependencies in document representation using convolution and pooling. Fast to train, works well, but fails to capture longer dependencies. – Recurrent Neural Networks (RNN): Captures longer dependencies in document representation. However, vanishing gradients makes the network forget long-term information. Generally expensive to train. – RNN with Long Short Term Memory (LSTM)/ Gated Representation Units (GRU): Replace RNN cell with LSTM or GRU cell to preserve long term dependencies. – Bidirectional RNN with LSTM/GRU: Preserve long term dependencies and preserve contextual information in both directions by stacking two RNNs in parallel. – Auto-Encoder Networks: For longer documents, train an LSTM auto-encoder to encode sentences. Use the encoded sentences as input to the DNN to represent documents. – Attention Networks: Augment RNNs to represent documents with a focus on key parts of their input. – Hierarchal networks with attention: Attention can be applied at both word and sentence level to focus more on important content when constructing document representation.
85 Supervised Learning: Neural Architectures for Text Classification (CNN)
86 Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. Supervised Learning: Neural Architectures for Text Classification (RNN)
Unrolled recurrent neural network
As the sequence grows, RNNs become unable to learn to connect information.
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 87 Supervised Learning: Neural Architectures for Text Classification (RNN)
Standard RNN
LSTM GRU
Image source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 88 Supervised Learning: Neural Architectures for Text Classification (RNN)
Bidirectional RNN reads the text in forward as well as reverse fashion. Two RNNs are stacked in parallel to learn the output vector per word. These vectors are concatenated and used as input to FFNN. In practice, RNN cells are typically replaced with LSTM/GRU to model long term dependencies.
89 Image source: https://towardsdatascience.com/nlp-learning-series-part-3-attention-cnn-and-what-not-for-text-classification-4313930ed566 Interpretability: Rationales
Interpretability via providing concise evidence from input
Rationales must be: short and coherent pieces sufficient for correct prediction
Combines two modular components, generator and encoder, which are trained to operate well together
The generator specifies a distribution over text The candidate rationales are passed through the fragments as candidate rationales encoder for prediction
rationale label
90 T. Lei, “Interpretable Neural Models for Natural Language Processing,”, MIT CSAIL PhD Thesis, 2017. Transfer Learning
• Background: – Transfer learning is the process of training a model on a large scale dataset, and then using this pre-trained model for downstream task – It saves tremendous amount of computation time/power by pre-training on billions of Custom Word Representations words • When working in specialized domains (e.g. – Recent frameworks including ULMFit, ELMo, BERT, etc. customer complaints or loan documents) general – Take BERT as an example purpose word representations may not be adequate – It was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8 B from BookCorpus • Custom word representations built from domain specific corpora provide performance gains even – It has 93.6 million parameters with 4096 LSTM hidden size and 512 output size when corpora may be smaller – The training takes 50-70 days for 8 GPUs, while it was actually trained for 4 days with 16 TPUs by Google • Options include building language models from scratch or post-processing general purpose • Variations: representations – Techniques for incorporating specialized – Feature based (generate word embedding) sources of knowledge such as Glossaries – Fine tuning
91 Bonus 1: Advanced Neural Architectures
- Sentence Encoding using Auto Encoders - Sequence to Sequence - Transformer
92 Auto-Encoder for Sentence Encoding
J. Li, M. T. Luong, and D. Jurafsky, “A Hierarchical Neural Autoencoder for Paragraphs and Documents”, 2015.
93 Sequence to Sequence Modeling
• Encoder Decoder Architecture • Applications: – Speech recognition (many to many) – Machine translation (many to many) • Provides sentence representations
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems.
94 Transformer and Attention
§ RNN and CNN: Sequential – Word position aligns with computation step § Transformer architecture: Fully Connected - input sequences are transformed simultaneously into output – Shorter path length between long range dependencies – Lower computational complexity and more parallelizable – Positional encoding – Stacking helps: Syntactic information is derived in lower layers, semantic information is derived in higher layers – Multiple Attention heads
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is All You Need. In Advances in Neural Information Processing Systems.
95 Bonus 2: More Language Models
- Bert and Elmo - Near Synonym generation
96 BERT and ELMo
Masked Language Model (or MLM): A small percentage (10-15%) of the tokens are masked for training, a.k.a. cloze deletion test • BERT: Bidirectional Encoder Representations from Transformers – BERT is a BiLM and MLM – Each token is related to its own transformer. – The BERT process is jointly conditioned on both left and right contexts for all layers – BERT is easily available (TF Hub) • ELMo: Embeddings from Language Model – ELMo is also an MLM – Unlike BERT, ELMo uses a bidirectional LSTM (BiLSTM) with Cross-View Training (CVT), to examine a sentence before assigning an embedding to each word – In addition, ELMo concatenates independently trained left-to-right and right-to-left LSTMs to generate features for use downstream. – ELMo is easily available (TF Hub)
Devlin J., Chang, M, Lee, K, and Toutanova, K: BERT: Pre-Training of Deep Bidirectional
Transformers for Language Understanding. Google AI Language, 2018 97 Unsupervised Language Models: Near-Synonym System (NeSS)
An unsupervised corpus-based conditional model – for finding phrasal and near synonyms – requires only a large monolingual corpus
Based on maximizing information- theoretic combinations of shared contexts
Parallelizable for large-scale processing
98 D. Gupta, J. Carbonell, and A. Gershman, “Unsupervised Phrasal Near-Synonym Generation from Text Corpora,” AAAI, 2015. Deep Learning and Computational Graph Techniques for Derivatives Pricing and Analytics
May 13, 2019 Bernhard Hientzsch, Ph.D. , Managing Director, Head of (Markets) Model, Library, and Tools Development (M2LTD) Advanced Technologies of Modeling (AToM) Corporate Model Risk Management (CMoR) Wells Fargo & Company
Work with/by team members of M2LTD and CMoR
© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Outline of the Talk and Idea
• Standard Martingale Pricing approach with MC or PDE – challenges in higher dimensions and otherwise. • FBSDE combine SDEs for risk factors and for value. Given an initial value and a replicating strategy (IVRS), try to hit given final value as well as possible. IVRS satisfy this minimization & control problem. Can generate many paths to train. • Use DNNs to represent RS, TF computational graphs to simulate and solve FBSDE given RS -> DL problem. • For forward approach, objective function is how well final value replicated. TF and its optimization methods (including SGD) will give RS and IV and also value along paths. • Also will cover other approaches and applications that can be similarly expressed or solved or take advantage of this. • Pointers to examples and results and some numerical results in the presentation. • Conclusion and References.
100 Martingale Pricing
Assume we have one or several risky underliers X (potentially vector) satisfying - !" = $%&' ( " ( !( + *%&' (, " " ( !, under some measure (for risk-neutral, $%&' ( will be r ( or r ( -q ( , for Black-Scholes, *%&' (, " will be *./' ( or a constant). Will simulate many paths for X. If looking at discounted values, assume a money market account as risk free security: !0 = 1 ( 0 ( !( Only consider European exercise (i.e. payoff at maturity) According to martingale approach, value is: - 2 (, 3 = 0 ( 4 5 "6 /0(9) "; = 3 6 In our case, B(t)/B(T) = exp(− ∫; 1 A !A) . Taken at face value, that means that for each ((, " ( ) different expectation. Not determining some function representing 2 . , . or 2 (, . , just one point value at ((, " ( ).
101 Martingale Pricing – Monte Carlo
To approximate this by Monte-Carlo methods, simulate M independent copies (“paths”) of ! called !(#)with !(#)(t)=x
0 ∑ (#) % &, ( ≈ exp(− ∫/ 1 2 32) # 5 ! (6) /M Taken at face value, that means that for each (&, ! & ) different simulation to run. Not determining some function representing % . , . or % &, . , just one point value at (&, ! & ). Could try to find an approximate representation by least-square regression (LSM) or similar methodologies, but that requires good basis functions, good basis function underliers, and large effort and intuition. Simulation effort itself though only depends relatively weakly on dimension d of ! . Convergence of Monte-Carlo in general is slow at O(89:.;), but independent of dimension d of !.
102 Martingale Pricing – PDE
To compute u(t,x) on a grid through a Black-Scholes type PDE 1 ! #, % + ' # %! #, % + +, #, % %,! #, % − . # ! #, % = 0 " ( 2 (( ! 1, % = 2 % Use Feynman-Kac to obtain this PDE from definition of ! #, % . For instance, use finite differences in time and asset directions, solving PDE by time-stepping and applying difference operators (explicit) and also solving linear systems (implicit).
Computational grid for ! #, % will typically have O(31 345 ) points, requires O(31 345 ) memory, and O(31 345 6) time (e=1 for explicit, 2-3 for implicit). This is too much memory and time for large d (“curse of dimensionality”). Of course, equivalence of expectation and PDE needs to proven (Feynman-Kac).
103 SDE/Expectation and PDE Pricing (Feynman-Kac)
More generally, the following are equivalent under appropriate conditions: PDE 1 ! #, % + ' #, % ! #, % + +, #, % ! #, % − . #, % ! #, % + / #, % = 0 " ( 2 (( ! 2, % = 3 % Expectation 5 7 < 7 ! #, % = 4 ∫" exp(− ∫" . =, >? @=) f(s, XE)@F + exp(− ∫" . =, >? @=) 3 >7 >" = % under @> = ' #, > @# + + #, > @G5
The earlier GBM/LVM setting corresponds to setting ' #, > = 'HIJ # >, + #, > = +HIJ #, > >, . #, % = K # , / #, % = 0.
104 Other Equations for Value ?
• Would be nice if there is a direct equation or formula for ! ", $ along some simulated (or otherwise given) path for X so that values of ! ", $ can be computed along all paths – and then maybe a formula/expression for ! ", $ can be learned from those values. • If ! ", $ is characterized by a no-arbitrage condition, then for instance the self-financing condition for the replicating strategy (assuming we know or will later determine replicating strategy) will give such a (stochastic) equation • Alternatively, if somehow know that ! ", $ is a function of t and X(t) that satisfies all necessary conditions and a PDE (or that discounted value is martingale), Ito’s lemma gives also a SDE • For arbitrary instruments and replication strategies, might not know what u will depend on, so write it as stochastic process Y(t) rather than ! ", %(")
105 Self-financing Condition
Given the value Y(t) and a replication strategy Z(t) (representing amount of each risky security scaled by vol), the self- financing condition reads !"($) = ' $ " $ !$ + ) $ !*+ or with discretized time: " $ + Δ$ − "($) = ' $ " $ Δ$ + ) $ Δ*+ and final value Y/ = g(X/) is known. That kind of SDE with given final value is typically called “backward” SDE (BSDE). The system of SDEs for X (given initial values - determined forward) and for Y (given final values - determined “backward”) is called a forward-backward stochastic differential equation (“FBSDE”). Under appropriate conditions, can be proven that " $ = 2 $, 4($) and ) $ = 56 $, 4($) ! $, 4($) , that u will follow (in general) nonlinear PDE, and under further conditions ! $, 4 $ = 27 $, 8 (or, in general, 9:2 $, 8 ) – nonlinear Feynman-Kac. Notice that ! $, 4($) acts as a scaling to get random part of Y SDE from random part of X SDE. Often, BSDE is written as negative of above.
106 FBSDE and PDE Pricing
More generally, equivalent (“nonlinear Feynman-Kac”) FBSDE !"# = % &, "# !& + ) &, "# !*# | ", = - 2 −!/# = 0 &, "#, /#, 1# !& − 1# !*#| Y4 = g(X4) PDE 1 9 &, - + <= ))2 &, - >?@@ 9 &, - + % &, - B9 &, - + 0 &, -, 9 &, - , )2 &, - B9 &, - = 0 # 2 A with 9 <, - = D -
For our example: 0 &, "#, /#, 1# =-r & /#, others like on Feynman-Kac slide.
107 Using BSDE – Pressing Forward
Going back, assuming simulated copies of X(t) and recorded values of X and Δ"#along those paths, time-discretized BSDE looks as follows: Δ$ % = $ % + Δ% − $(%) = + % $ % Δ% + , %, .(%) /(%, . % ) Δ"#
(Everything that is already known is colored green). Assume that a replication strategy / %0, . for each time %0 is known as some parametrized function /23 .23 , Θ23 - such as a DNN, SDE can be used in either direction
Forward (Weinan): Guessing an $ 0 , can compute $6 %0; $ 0 , Θ2. and then finally $6 8 = $ 8; $ 0 , Θ2. by using BSDE forward. To find exact solution, need $6 8; $ 0 , Θ2. = g(X;) . To find approximate solution, make @ <[ > .? − $6 8; $ 0 , Θ2. ] as small as possible – we try to determine $ 0 and Θ2. and thereby /23 .23 , Θ23 - replicating strategy so that expectation as small as possible. For nonrandom ODE, this is called shooting method. Possibly other norms could be used. Can use deterministic or stochastic optimization methods such as stochastic gradient descent.
108 Let the Tensors Flow
The way how !" #$; ! 0 , Θ). is computed from ! 0 and +), -), , Θ), and all other things that it depends on can be expressed as a TensorFlow computation graph.
The DNN for +), . , Θ), can also be expressed as a TensorFlow graph.
Altogether, obtain !" #$; ! 0 , Θ). as a TensorFlow graph and so can use stochastic optimization methods and other algorithms implemented in TensorFlow to determine ! 0 and +), . , Θ), . +), . , Θ), will be the replicating portfolio amounts for different underliers X.
+), . , Θ), could be represented by different networks for different #$ or as a single network for + #$, . ; Θ). =
+ . , . ; Θ). . Those networks could use different architectures. Note that Y/u values along each path and expressions for gradient of u are known, which can be used for CVA/DVA (see She Grecu article)
109 Tensorflow as an Intermediate Representation/Language
• Various ways to run Tensorflow (TF) serial, multi-core, (multi-)GPU, and/or distributed • IBM and NVIDIA are very interested and willing to support work in that area • TF is powerful intermediate representation (in CS sense) which comes with parallelization, (A)AD, visualization, … • For instance, if one implement MC simulation pricing in TF, can compute greeks, adding a few lines • Similarly, if implied volatility surface representation is given as TF graph, Dupire in various forms is “automatic”
110 Using BSDE – Looking Backward
Backward (Wang et al): Starting from ! " = g(X') , use BSDE backwards to compute !) *+; Θ.. and then finally !) 0; Θ..
Assuming instrument value under replicating strategy is given as function u of t and X1 and of no other arguments (at least in a neighborhood of t=0 and x=X2), then an exact solution YB(0) should be same along all paths. A good approximation should minimize Var(YB(0)) (i.e., size of the range of YB(0)). Given replications of X, approximate replications of YB and the mean of YB(0) , YB(0), can be computed for those replications. Variance of YB(0) as an objective function: 6 3[ !) 0; Θ.. − YB 0; Θ.. ]
YB(0) would also be the desired approximation of ! 0 . This allows to determine Θ.. so that expectation is as small as possible. Can use deterministic or stochastic optimization methods such as stochastic gradient descent just like in forward case. Is TensorFlow computational graph just like in forward case also. Of course, need to prove that YB is unique and same as other characterizations in the limit.
111 Using BSDE – Looking Backward
• Running BSDE backwards also allows to take Bermudan exercise into account • Exercise decision is made comparing value not exercised (given from BSDE) vs value if exercised (given from exercise condition) • Similarly, barrier options or similar could be treated since for those circumstances values of solution are known and could be propagated backwards. • Of course, for exercisable instruments or barriers, it is important to determine and record on every path whether instrument has been exercised or touched the barrier. • For different states of the instrument (exercised/knocked-in/…), need to train different representations – different BSDEs, possibly.
112 Other Formulations/Approaches - BSDE
• Instead of self-financing condition or replicating portfolio set-up, can use specification of underliers/risk factors under some consistent measure and try to approximate (for some computable numeraire N) ' ! ", $ = & ( )* )+ = $ ' ! ", $ = , " & ( )* /,(/) )+ = $ • Given this functional form and some assumptions, Ito’s lemma will give BSDE for 1 " = ! ", )(") . Use that BSDE similarly to the other BSDE. • Assume more realistic assumptions for replicating portfolio such as different rates for borrowing or lending – self-financing condition will change to give a BSDE with the f term: 2 ", $, 3, 4 = −673 + 69 − 67 [4*1 − y]> • Similarly, can handle other FBSDE for XVA etc. (Weinan and others have many examples). • If underliers or instruments have dividends/fees, appropriate changes in FSDE or BSDE
113 Want to Know the Solution?
• Want to take advantage of ! ", $ " = &' ", ( and determine some approximation for & ", $(") directly (instead of point values of Y) • This means BSDE is no longer used to determine point values forward or backward but used as constraint to evaluate - how well current guess of form of u satisfies BSDE, for instance with the following term in the loss function (+, = ------& " , $, , ., = /& " , $, , 0, and Σ, are f and 2 terms in FBSDE) : -45 - - - 8 - - : 33 +, − +, − 0,∆"- − ., Σ,∆9, , - • Could use it in a step-wise, rolling-back fashion to determine “slices” of u going backwards, or use it globally to judge how well the current global guess satisfies BSDE – Raissi’s FBSNN • For replicating portfolio, delta-hedging is not necessarily optimal for long times between hedging times, it would stand to reason that this will work better for smaller time-step sizes. • Once solution known, can be used for CVA/DVA/PFE/DIMM, collateralized CVA/DVA/..
114 Other Instruments
• For instance, for barrier options or Asian options, there is some relatively “weak” path-dependency • Standard approach is to extend state space (add elements to X) so as to make X Markovian • Open question exactly how far this can be treated with the standard FBSDE approach – currently working on some areas. In general volatility matrix of that extended X is degenerate or non-square and drift is not differentiable/continuous • For some cases (Black-Scholes with constant parameters), can be rewritten as final value problem with barrier breach probability by Brownian bridge approach (Bing Yu et al) – treat as before
115 Quantitative Finance Examples in Raissi/E
• BS with default and/or credit risk (1st order, E) • BS with differential rates – different rates for borrowing and lending (1st order, E) • BSB (Black-Scholes-Barenblatt) related to uncertain volatility model (2nd order, E) • BSB (R) • Also some synthetic test examples with explicit solution • In high dimensions – up to 100 • Their examples only show uncorrelated underliers
116 Some of our Extensions
• Implemented Local Volatility Model and Heston Model in TensorFlow • Implemented some interest rate models • Implemented correlated cases and various payoffs • Implementing geometric combination of time-dependent geometric Brownian Motion as test case • Tests of approaches against each other and against established approaches (MC, PDE) for lower-dimensional problems • Using learned solution for other analytics (XVA, …) • Extensions to extended state spaces etc.
117 Other Quantitative Finance Examples in CMoR
• LMM for caps/Europeans – Wang et al • LMM for Bermudans – Wang et al • CVA/DVA for forward and backward approach for BSDE – She et al • Barriers – Yu et al
118 Example with Explicit Analytical Solution (E)
d=15, N=5 Weinan approach Raissi approach
!" 0 1 0 1
Loss 6.80E-05 4.89E-07 6.40E-03 3.28E-04
#" 0.503 0.999945 0.49705 0.99985
#" 0.5 0.999999694 0.5 0.999999694
Relative Error 0.596% 0.005% 0.594% 0.015%
d=15, N=50 Weinan approach Raissi approach
!" 0 1 0 1
Loss 6.80E-05 4.15E-05 8.32E-04 6.20E-04
#" 0.503337 1.00004 0.49987 0.99938
#" 0.5 0.999999694 0.5 0.999999694
Relative Error 0.663% 0.004% 0.026% 0.062%
119 Heston - Different ! and # Time Steps (N)
120 Heston – # Time Steps (N)
121 Heston – # of SDE Simulations (M)
122 Heston – Learned vs Exact (FBSNN)
123 Conclusion • Deep FBSDE approach which consists of: – Changing solution characterization to FBSDE – formulating pricing problem as a minimization of replication accuracy (or minimum spread of initial values) given replicating strategy – representing replicating strategy by DNN – determining replication strategy by DL (TensorFlow/PyTorch/…) – computing initial value and value along path given optimized replicating strategy by DL/TensorFlow • is a powerful new approach to solve high-dimensional pricing problems • This approach can be extended to other problems and settings (such as LMM, Bermudan Options, Barriers) and its results can be used for other analytics (XVA etc.)
124 References I
• Weinan, E., Han, J., & Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4), 349-380. arXiv preprint arXiv:1706.04702. • Beck, C., Weinan, E., & Jentzen, A. (2017). Machine learning approximation algorithms for high- dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. arXiv preprint arXiv:1709.05963. • Raissi, M., Perdikaris, P., & Karniadakis, G.E. (2017). Physics Informed Deep Learning (Part I): Data- driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561 (2017). • Raissi, M. (2018). Forward-Backward Stochastic Neural Networks: Deep Learning of High-dimensional Partial Differential Equations. arXiv preprint arXiv:1804.07010.
125 References II
• Wang, H., Chen, H., Sudjianto, A., Liu, R., & Shen, Q. (2018). Deep Learning-Based BSDE Solver for Libor Market Model with Application to Bermudan Swaption pricing and hedging. arXiv preprint arXiv:1807.06622. • She, J.-H., Grecu, D. (2018). Neural Network for CVA: Learning Future Values. arXiv preprint arXiv:1811.08726. • Yu, B., Xing, X. (2019). Deep Learning Based Numerical BSDE Method for Barrier Options. CMoR internal whitepaper.
126 Deep insights into interpretability of machine learning algorithms and applications to risk management
May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk
© 2019 Wells Fargo Bank, N.A. All rights reserved. Internal use. Interpreting machine learning models
• Machine learning gives very good predictive performance
• But the biggest criticism for machine learning algorithms is its interpretation … predictor "! # is a `black box’ – hard to interpret • True of all ensemble methods, SVM, neural network • We need to understand the internals of a machine learning algorithm: – Required by regulation – Get insights from the model and make scientific/business findings • Some main questions to answer are – Which variables are important? – What is the input-output relationship look like for each important variable/a subset of important variables? Nonlinearity? Interaction? – How do correlations among variables impact the response surface? – How can we ensure the relationships from ML are consistent with historical and business understanding. • Machine learning interpretation is an active research area now.
128 Approaches for Interpreting machine learning
•modelsDiagnostic tools – Variable importance o Local importance o Global importance – Effects of inputs to outputs o 1D PDP o 2D PDP and Hstatistics for interactions o ICE plot and ICE ANOVA o Derivative based diagnostic tools • Model distillation: – Global surrogate tree – KLIME – LIME-SUP • Structured interpretable Model—explainable neural network
129 Model Explainability Approaches
Derivative-Based Approach and Variance Analysis Global Diagnostics: Liu, Chen, Vaughan, Nair, Sudjianto (2018), Model Interpretation: A Unified Derivative-based Framework for Effects of Inputs to Outputs Nonparametric Regression and Supervised Machine Learning, arXiv:1808.07216 Impact of correlations
Locally Interpretable Model Local diagnostics and Model Hu, Chen, Nair, Sudjianto (2018), Locally Interpretable Models and Effects based on Supervised Partitioning (LIME- Distillation SUP), arXiv:1806.00663
Explainable Neural Networks Structured-Interpretable Model Explainable Neural Networks based on Additive Index Models Vaughan, Sudjianto, Brahimi, Chen, Nair (2018), arXiv:1806.01933
130 Global and local diagnostics
• Global interpretation is aimed at interpreting the overall relationship between input and output over the entire space. • Local interpretation is aimed at interpreting the relationship between input and output over local region, with the idea that – a simple parametric model may be used to approximate the input-output relationship – local variable importance and input-output relationships are easily interpretable from the simple local model.
131 A real data example—home lending case
• This dataset is based on a retired home lending residential mortgage model. • we used a randomly selected subset of 1 million observations, divided into training, validation and testing sets. • Response is an indicator variable indicating if the loan is in trouble; there are 7 raw explanatory variables listed in the table below.
Variable Explanation fico0 fico at snapshot ltv_fcast ltv forecasted dlq_new delinquency status, 1 if clean and 0 otherwise unemprt unemployment rate totpersincyy total personal income year to year ratio h horizon 1, 2, …, 9 quarters premod_ind indicator before recession Q2 2007
132 Diagnostic tools
133 Local importance
• Describe how individual observation’s attributes affect model prediction for that observation. • Important for providing reason codes for credit decisions • Approaches – LIME (Local Interpretable Model-Agnostic Explanations) – KLIME – LIME-SUP – LOCO(leave one covariate out) – SHAP explanation – Tree interpreter – Quantitative input influence(QII) – Integrated gradients – DeepLIFT – Layer-wise Relevance Propagation (LRP) – Derivative based sensitivity analysis
134 LIME
• LIME (Local Interpretable Model-Agnostic Explanations) is perhaps the first local interpretation method, proposed in Ribeiro et al. (2016). • The idea is to approximate the model around a given instance/observation in order to explain the prediction: – Simulate new instances – Predict on the new instances using the machine learning model – Pick a kernel and fit a linear model using the kernel as weight; penalize the complexity of the linear model, for example, fit ridge regression.
• Available in python (lime package) and R (lime package)
135 Global importance • Measures the overall impact of an input feature on the model predictions • Important for variable selection • Approaches – tree-based importance (e.g. relative influence) – permutation test based importance – Sobol’ indices global sensitivity analysis – ANOVA decomposition based on ICE plots – derivative-based importance – Shapely effects – …
136 Permutation test and tree based Importance
• Permutation based importance • tree based importance for Xgboost – Randomly permute the corresponding column – For a single tree, compute the importance of a variable !" by in the data set while keeping other columns the total reduction of impurity at nodes where !" is used as a unchanged splitting variable. – Compute the decrease in prediction – For ensemble methods like random forest or GBM, the performance as the measure of importance. importance of !" is summed or averaged over all trees.
LTV_fcast and fico0 are the top important variables
137 Global sensitivity analysis