Demystifying Random Forests Antoni Dzieciolowski Sas Canada

DEMYSTIFYING RANDOM FORESTS ANTONI DZIECIOLOWSKI SAS CANADA Copyright © 2016, SAS Institute Inc. All rights reserved. RANDOM FOREST MOTIVATION “With excellent performance on all eight metrics, calibrated boosted trees were the best learning algorithm overall. Random forests are close second, followed ".by uncalibrated bagged trees, calibrated SVMs, and uncalibrated neural nets Rich Caruana, Alexandru Niculescu-Mizil. An Empirical Comparison of Supervised Learning Algorithms. ICML 2006 Copyright © 2016, SAS Institute Inc. All rights reserved. 2 DECISION TREE DEFINITION Decision Tree: is a schematic, tree-shaped diagram used to determine a course of action or show a statistical probability. Each branch of the decision tree represents a possible decision, occurrence or reaction. The tree is structured to show how and why one choice may lead to the next, with the use of the branches indicating each option is mutually exclusive. Copyright © 2016, SAS Institute Inc. All rights reserved. 4 DECISION TREE DEFINITION X1 = 2 Copyright © 2016, SAS Institute Inc. All rights reserved. 5 DECISION TREE BINARY SPLIT EXAMPLE Splitting Criteria: • Information Gain • Variance • Gini Index (Binary only) • Chi Square Julie Grisanti - Decision Trees: An Overview • Etc. http://www.aunalytics.com/decision-trees-an-overview/ Copyright © 2016, SAS Institute Inc. All rights reserved. 6 RANDOM FORESTS Copyright © 2016, SAS Institute Inc. All rights reserved. 7 RANDOM FOREST LEO BREIMAN • Responsible in part for bridging the gap between statistics and computer science in machine learning. • Contributed in the work on how classification and regression trees and ensemble of trees fit to bootstrap samples. (Bagging) • Focused on computationally intensive multivariate analysis, especially the use of nonlinear methods for pattern recognition and prediction in high dimensional spaces • Developed decision trees (random forest) as computationally 1928 - 2005 efficient alternatives to neural nets. https://www.stat.berkeley.edu/~breiman/ Copyright © 2016, SAS Institute Inc. All rights reserved. 8 WHAT IS A RANDOM FOREST? “Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.” Breiman Leo. Random Forests, Statistics Department University of California Berkeley, 2001 Copyright © 2016, SAS Institute Inc. All rights reserved. 9 RANDOM FOREST ((x1,y1),…,(xN,yN)) = D (Observed Data points) m < M features (variables) Algorithm: Random Forest for Regression or Classification. 1. For t = 1 to B: (Construct B trees) (a) Choose a bootstrap sample Dt from D of size N from the training data. (b) Grow a random-forest tree Ti to the bootstrapped data, by recursively repeating the following steps for each leaf node of the tree, until the minimum node size nmin is reached. i. Select m variables at random from the M variables. ii. Pick the best variable/split-point among the m. iii. Split the node into two daughter nodes. B 2. Output the ensemble of trees {Tb} 1 . [Hastie, Tibshirani, Friedman. The Elements of Statistical Learning] Copyright © 2016, SAS Institute Inc. All rights reserved. 10 VISUALIZATION OF BAGGING Copyright © 2016, SAS Institute Inc. All rights reserved. 11 HOW TO BUILD A RANDOM TREE (BOOTSTRAPPING) Data Space (inputs) Response Space(outputs) Feat 1 Feat 2 Feat 3 … Feat M Target 1 Target 2 Target 3 Target 4 Target 5 … Target N Obs 1 2 3 5 3 0 1 1 0 0 1 Obs 2 6 1 4 4 Obs 3 3 5 9 5 Obs 4 5 7 8 8 Obs 5 0 8 2 2 … Obs N 7 1 3 5 Pick m features from M and n observations from N at random Feat 1 Feat 3 Copyright © 2016, SAS Institute Inc. All rights reserved. 12 BAGGING OR BOOTSTRAP AGGREGATION Average many noisy but approximately unbiased models, to reduce the variance of estimated prediction function [Hastie, Tibshirani, Friedman. The Elements of Statistical Learning] Copyright © 2016, SAS Institute Inc. All rights reserved. 13 BUILDING A FOREST (ENSEMBLE) Copyright © 2016, SAS Institute Inc. All rights reserved. 14 RANDOM FOREST ADVANTAGES • Can solve both type of problems, classification and regression • Random forests generalize well to new data • It is unexcelled in accuracy among current algorithms* • It runs efficiently on large data bases and can handle thousands of input variables without variable deletion • It gives estimates of what variables are important in the classification • It generates an internal unbiased estimate of the generalization error as the forest building progresses • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing • It computes proximities between pairs of cases that can be used in clustering, locating outliers, or give interesting views of the data. • Out-of-bag error estimate removes the need for a set aside test set Copyright © 2016, SAS Institute Inc. All rights reserved. 15 DISADVANTAGES • The results are less actionable because forests are not easily interpreted. Considered black box approach for statistical modelers with little control on what the model does. Similar to a Neural Network • It surely does a good job at classification but not as good as for regression problem as it does not give precise continuous nature predictions. In case of regression, it doesn’t predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy. Copyright © 2016, SAS Institute Inc. All rights reserved. 16 SAS ENTERPRISE RANDOM FOREST SAS HPFOREST MINER PROC HPFOREST; target targetname/level=typeoftarget; input (categorical variables) /level=typeofvariable (nominal) input (numerical variables) /level=typeofvariable (interval) Copyright © 2016, SAS Institute Inc. All rights reserved. 17 OUTPUT OF PROC HPFOREST Copyright © 2016, SAS Institute Inc. All rights reserved. 18 Copyright © 2016, SAS Institute Inc. All rights reserved. 19 Copyright © 2016, SAS Institute Inc. All rights reserved. 20 Copyright © 2016, SAS Institute Inc. All rights reserved. 21 Copyright © 2016, SAS Institute Inc. All rights reserved. 22 THANK YOU! Copyright © 2016, SAS Institute Inc. All rights reserved..

Demystifying Random Forests Antoni Dzieciolowski Sas Canada

Malware Classification with BERT

Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection

Machine Learning Methods for Classification of the Green

Random Forest Regression of Markov Chains for Accessible Music Generation

Evaluating the Combination of Word Embeddings with Mixture of Experts and Cascading Gcforest in Identifying Sentiment Polarity

10-601 Machine Learning, Project Phase1 Report Random Forest

Evaluation of Adaptive Random Forest Algorithm for Classification of Evolving Data Stream

Evaluation and Comparison of Word Embedding Models, for Efficient Text Classification

Random Forests, Decision Trees, and Categorical Predictors: the “Absent Levels” Problem

Random Forests

Deep Learning with Long Short-Term Memory Networks for Financial Market Predictions

A Hybrid Random Forest Based Support Vector Machine Classification Supplemented by Boosting by T Arun Rao & T.V