Managing Machine Learning Model Risk
Total Page:16
File Type:pdf, Size:1020Kb
Managing Machine Learning Model Risk May 13, 2019 Agus Sudjianto, Harsh Singhal and Jie Chen 2019 Wells Fargo Bank, N.A. All rights reserved. For public use. Master Class Agenda • Introduction (15 minutes) – Agus • Machine Learning Interpretability (90 minutes) – Jie – Post-hoc methodology • Overview of Machine Learning – Model distillation – Ensemble Model Methodology and Examples: Random Forest and GBM (60 minutes) – Jie – Deep Learning Methodology and Examples: Feedforward, • Structured-Interpretable Models – Agus Recurrent, and Generative Adversarial Network (60 minutes) – Jie • Validation of Machine Learning Models (90 min) – Harsh – Inputs/Data: bias and privacy test – Model specification: interpretability • Natural Language Processing (45 minutes) – Harsh – Performance: fairness and performance testing – Language Models – Model Monitoring and change control – Neural Architecture – Fail safe and disclosure Optional Lunch Time Bonus: Deep Learning Techniques for Derivatives Pricing – Bernhard 2 Machine Learning Methodology: Ensemble Model Methodology and Examples May 13, 2019 Jie Chen, Ph.D. MD, Head of Statistics and Machine Learning, Corporate Model Risk Outline • Statistics vs Machine learning • Introduction to machine learning – Supervised Learning – Unsupervised Learning – Semi-supervised learning – Reinforcement Learning • Decision Tree and CART • Ensemble algorithms – Bagging – Random forest – Boosting • Probability Calibration • Classification Example 4 Statistics vs ML • Leo Breiman: Two modelling paradigms: data model and algorithmic model – Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science • Traditional Statistics (data model) – View: Data generated by some underlying parametric model – goal is inference and interpret the model – Extensive interaction between data and data analyst o Summary, visualization, identiFication of outliers, shapes of distributions, transFormation, … – Parameter estimation, testing, conFidence intervals, asymptotic theory à based on model assumptions and theory – Dimensionality is curse à variable selection – Model validation: goodness oF Fit tests, residual diagnostics – Tailored For small data sets, Few number oF variables, structured data. – Driven by statisticians • Criticism – Simple parametric model imposed on data generated by complex system. InFormation obtained may be questionable. – Omnibus GOF test which tests in many directions have low power and will not reject until the lack oF Fit is large. – Feature engineer has to be done manually, which involves a lot oF hand craFting and is impractical For large number oF variables. 5 Statistics vs ML • Leo Breiman: Two modelling paradigms: data model and algorithmic model – Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science • Machine Learning (algorithmic model) – View: Data mechanism unknown and no intrinsic interest in the data generation process. Goal is to get the most accurate model, however complicated. – Very little direct interaction with the data – Emphasis on better algorithms, speed, efficiency of computing, parameter tuning o Data mining – exploratory data analysis on steroids o Neural networks, Boosting algorithms, etc. – Algorithms are black box à hard to interpret – Dimensionality is blessing àvariable selection is not needed, feature creation is encouraged (SVM). – Model validation: check prediction accuracy on testing set – Tailored for large data sets, with large number of variables, unstructured data. – Driven by computer scientist, engineers, and a few statisticians • Criticism – Lack of interpretability. 6 Statistics vs ML • Michael Jordan: the ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics. • Distinction is blurring … • Some statisticians have adopted methods from machine learning, leading to a combined field that they call statistical learning • Data Science has emerged as an alternative term to combine both fields… but includes DBM and computing 7 Machine Learning vs Artificial Intelligence (wiki and other sources) § Machine Learning: – Term coined by Arthur Samuel (IBM) in 1959 – gives "computers the ability to learn without being explicitly programmed” – study and construction of algorithms that can learn from data, summarize features, recognize patterns, make predictions, and take actions … – Related to statistics (`computational statistics’) but different paradigms – A key pathway to AI § Artificial Intelligence: concerned with making computers behave like humans – Term coined in 1956 by John McCarthy (MIT) – study of “ intelligent agents” – devices that perceive the environment and take actions that maximize its chance of success at some goal. – Long history: formal reasoning in philosophy, logic, … – Resurgence of AI techniques in the last decade: advances in computing power, computing and data architectures, sizes of training data, and theoretical understanding – Deep Learning Neural Networks: At the core of recent advancements in AI, specifically for certain classes of ML tasks (Reinforcement L and Representation L) – Applications: • Pattern recognition: speech (siri), image (Deep Face), handwriting, … • Autonomous systems: drones, self-driving cars • Recommender systems, drug discovery, marketing, … 8 Machine Learning: Tasks and Techniques • Tasks: • Supervised Learning: • Regression and classification • Unsupervised Learning: • Discover underlying structure • Dimension reduction, clustering, … • Semi-supervised learning • Reinforcement Learning: • Identifying how to make good decisions from context: observe, learn, and optimize • Deep reinforcement learning • Representation Learning: • Feature selection and engineering 9 Supervised Machine Learning § Supervised learning means the desired outcome is known, aka, the response variable is given. § Learning is supervised under the response: minimizing the error between prediction and the response. § Algorithms that falls under this category: – K-nearest neighbor – LASSO, Elastic Net – Support vector machine – Decision trees – Ensemble methods – Neural networks • Artificial Feed Forward NN • More complex NN for DL 10 Supervised Machine Learning § Machine learning algorithms usually come with hyper-parameters which controls the complexity of the algorithm. – For example, trees have depth, number of terminal nodes, etc to define the tree structure – Neural networks have number of layers, number of neurons per layer, activation function, etc to define the network structure. § Complexity is related with bias-variance trade-off. Prediction error can be decomposed into bias and variance. Bias and variance trade-off § Bias: ! " − $ !% " . Simpler models have large bias, and vice versa § Variance: &'( !% " . Simpler models have smaller variance, and vice versa § The best model is the one that achieves a good balance between bias and variance à hyper-parameter tuning 11 Supervised Machine Learning: Tuning § Hyper-parameter tuning, is to find the best hyper parameters which gives the most accurate machine learning algorithm. It is the key to the success of machine learning algorithms. § Simple model structure, small data requires less complicated algorithm and more complicated model structure with large data requires more complicated algorithm. So the hyper parameters are data dependent, and they need to be tuned to get the best model. § Tuning involves a search routine and an evaluation routine. For each hyper-parameter setting, fit the model and evaluate the model performance; Using the search routine to find the hyper-parameter/model that optimizes the model performance. 12 Supervised Machine Learning: Tuning § Search routine, some popular ones are – Grid search: define a grid of parameters and search this entire grid – Randomized search: randomly select parameters from a distribution to search. – Bayesian hyper-parameter optimization: model the prediction performance as a Gaussian Process. § Evaluation routine. The model performance is measured by – Continuous response: mean squared error – Categorical response: AUC/Gini (binary response), error rate, logloss § It is well-known that a model that minimizes the loss/error on the training data is likely to overfit. To avoid this, the performance is measured on a separate validation data, or using cross-validation. § Cross-validation. The typical K-fold cross validation works as follows: 1. Randomly divide the data into K folds. Stratification may be needed for imbalanced data. 2. For each i = 1, …, K 1.Leave the ith fold out, build a model using the rest K-1 folds. 2.Predict on the ith fold. 3. After obtaining the cross-validation predictions for the entire data, compute the loss/error. This is the cross-validation model performance. § Since both training data and validation data are used in construction of the best model, the model performance has to be evaluated on a separate test set. 13 Unsupervised learning § Unsupervised learning means there is no response. The observations are unlabeled. § It is used for clustering, dimension reduction, anomaly detection, etc. § Algorithms that falls under this category: – Clustering • K-Means • Hierarchical clustering • Mixture models – Visualization and dimensionality reduction • PCA • Kernel PCA • Locally-linear embedding • T-distributed stochastic neighbor embedding (t-SNE) – Association rule learning 14 Semi-supervised learning § Sometimes, it is very expensive or hard to obtain labels. So only part of the data are labeled. – Unlabeled data