Why and How to Use Random Forest Variable Importance Measures (And

Why and how to use random forest Introduction Construction variable importance measures R functions Variable importance (and how you shouldn’t) Tests for variable importance Conditional importance Summary Carolin Strobl (LMU München)and Achim Zeileis (WU Wien) References [email protected] useR! 2008, Dortmund [imagine a long list of references here] I have become increasingly popular in, e.g., genetics and the neurosciences I can deal with “small n large p”-problems, high-order interactions, correlated predictor variables I are used not only for prediction, but also to assess variable importance Introduction Introduction Construction R functions Random forests Variable importance Tests for variable importance Conditional importance Summary References [imagine a long list of references here] I can deal with “small n large p”-problems, high-order interactions, correlated predictor variables I are used not only for prediction, but also to assess variable importance Introduction Introduction Construction R functions Random forests Variable importance Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences importance Summary References I can deal with “small n large p”-problems, high-order interactions, correlated predictor variables I are used not only for prediction, but also to assess variable importance Introduction Introduction Construction R functions Random forests Variable importance Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences [imagine a long list of references here] importance Summary References I are used not only for prediction, but also to assess variable importance Introduction Introduction Construction R functions Random forests Variable importance Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences [imagine a long list of references here] importance Summary I can deal with “small n large p”-problems, high-order References interactions, correlated predictor variables Introduction Introduction Construction R functions Random forests Variable importance Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences [imagine a long list of references here] importance Summary I can deal with “small n large p”-problems, high-order References interactions, correlated predictor variables I are used not only for prediction, but also to assess variable importance (Small) random forest Introduction 1 1 1 1 Start Start Start Construction Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 ≤ 12 > 12 ≤ 1 > 1 2 3 2 7 2 3 ≤ 8 > 8 n = 13 Age Age n = 49 n = 8 Number R functions y = (0.308, 0.692) p < 0.001 p < 0.001 y = (1, 0) y = (0.375, 0.625) p < 0.001 2 3 n = 15 Start ≤ 87 > 87 ≤ 68 > 68 ≤ 4 > 4 y = (0.4, 0.6) p < 0.001 4 5 3 6 4 7 n = 36 Start Number n = 12 Age n = 31 y = (1, 0) p < 0.001 p < 0.001 y = (0.25, 0.75) p < 0.001 y = (0.806, 0.194) ≤ 14 > 14 Variable ≤ 13 > 13 ≤ 4 > 4 ≤ 125 > 125 4 5 n = 34 n = 32 6 7 4 5 5 6 y = (0.882, 0.118) y = (1, 0) n = 16 n = 16 n = 11 n = 9 n = 31 n = 11 importance y = (0.75, 0.25) y = (1, 0) y = (1, 0) y = (0.556, 0.444) y = (1, 0) y = (0.818, 0.182) 1 Number 1 1 1 Start Start Tests for variable p < 0.001 Start p < 0.001 p < 0.001 p < 0.001 ≤ 5 > 5 2 9 ≤ 12 > 12 ≤ 14 > 14 importance Age n = 11 2 7 2 7 ≤ 12 > 12 p < 0.001 y = (0.364, 0.636) Age Number Age n = 35 ≤ 81 > 81 p < 0.001 p < 0.001 p < 0.001 y = (1, 0) 2 3 3 4 n = 38 Number n = 33 Start Conditional ≤ 18 > 18 ≤ 3 > 3 ≤ 71 > 71 y = (0.711, 0.289) p < 0.001 y = (1, 0) p < 0.001 3 4 8 9 3 4 ≤ 12 > 12 n = 10 Number n = 28 n = 21 n = 15 Start importance 5 6 ≤ > y = (0.9, 0.1) p < 0.001 y = (1, 0) y = (0.952, 0.048) y = (0.933, 0.067) p < 0.001 3 3 n = 13 Start y = (0.385, 0.615) p < 0.001 ≤ 4 > 4 ≤ 12 > 12 4 5 ≤ > n = 25 n = 18 15 15 5 6 5 6 y = (1, 0) y = (0.889, 0.111) 7 8 n = 12 n = 10 n = 16 n = 15 n = 12 n = 12 y = (0.417, 0.583)y = (0.2, 0.8) y = (0.375, 0.625) y = (0.733, 0.267) y = (0.833, 0.167) y = (1, 0) Summary 1 1 Start 1 1 Number p < 0.001 Start Start p < 0.001 p < 0.001 p < 0.001 ≤ 12 > 12 ≤ 6 > 6 References 2 7 2 7 ≤ 12 > 12 ≤ 8 > 8 Age Start Number n = 10 p < 0.001 p < 0.001 p < 0.001 y = (0.5, 0.5) 2 5 2 5 Age Start Start Age ≤ 27 > 27 ≤ 13 > 13 ≤ 3 > 3 p < 0.001 p < 0.001 p < 0.001 p < 0.001 3 4 8 9 3 6 n = 10 Number n = 11 n = 37 Start n = 37 y = (1, 0) p < 0.001 y = (0.818, 0.182) y = (1, 0) p < 0.001 y = (0.865, 0.135) ≤ 81 > 81 ≤ 13 > 13 ≤ 3 > 3 ≤ 136 > 136 ≤ 4 > 4 ≤ 13 > 13 3 4 6 7 3 4 6 7 5 6 n = 20 n = 16 n = 11 n = 34 n = 12 n = 14 n = 47 n = 8 4 5 n = 14 n = 9 y = (0.85, 0.15) y = (0.188, 0.812) y = (0.818, 0.182) y = (1, 0) y = (0.667, 0.333) y = (0.143, 0.857) y = (1, 0) y = (0.75, 0.25) n = 10 n = 24 y = (0.357, 0.643)y = (0.111, 0.889) y = (0.8, 0.2) y = (1, 0) 1 1 1 1 Start Start Start Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 2 3 ≤ 8 > 8 ≤ 12 > 12 ≤ 12 > 12 n = 18 Start y = (0.5, 0.5) p < 0.001 2 5 2 5 2 3 Start Start Age Start n = 28 Start ≤ 12 > 12 p < 0.001 p < 0.001 p < 0.001 p < 0.001 y = (0.607, 0.393) p < 0.001 4 5 n = 18 Number y = (0.833, 0.167) p < 0.001 ≤ 1 > 1 ≤ 12 > 12 ≤ 71 > 71 ≤ 14 > 14 ≤ 14 > 14 ≤ 3 > 3 3 4 6 7 3 4 6 7 4 5 n = 9 n = 13 n = 12 n = 47 n = 15 n = 17 n = 17 n = 32 n = 21 n = 32 6 7 y = (0.778, 0.222) y = (0.154, 0.846) y = (0.833, 0.167) y = (1, 0) y = (0.667, 0.333) y = (0.235, 0.765) y = (0.882, 0.118) y = (1, 0) y = (0.905, 0.095) y = (1, 0) n = 30 n = 15 y = (1, 0) y = (0.933, 0.067) I draw ntree bootstrap samples from original sample I fit a classification tree to each bootstrap sample ⇒ ntree trees I creates diverse set of trees because I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging) I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest) Construction of a random forest Introduction Construction R functions Variable importance Tests for variable importance Conditional importance Summary References I fit a classification tree to each bootstrap sample ⇒ ntree trees I creates diverse set of trees because I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging) I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest) Construction of a random forest Introduction Construction R functions Variable I draw ntree bootstrap samples from original sample importance Tests for variable importance Conditional importance Summary References I creates diverse set of trees because I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging) I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest) Construction of a random forest Introduction Construction R functions Variable I draw ntree bootstrap samples from original sample importance Tests for variable I fit a classification tree to each bootstrap sample importance Conditional ⇒ ntree trees importance Summary References Construction of a random forest Introduction Construction R functions Variable I draw ntree bootstrap samples from original sample importance Tests for variable I fit a classification tree to each bootstrap sample importance Conditional ⇒ ntree trees importance I creates diverse set of trees because Summary References I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging) I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest) Random forests in R Introduction Construction I randomForest (pkg: randomForest) R functions I reference implementation based on CART trees Variable (Breiman, 2001; Liaw and Wiener, 2008) importance Tests for variable – for variables of different types: biased in favor of importance Conditional continuous variables and variables with many categories importance (Strobl, Boulesteix, Zeileis, and Hothorn, 2007) Summary I cforest (pkg: party) References I based on unbiased conditional inference trees (Hothorn, Hornik, and Zeileis, 2006) + for variables of different types: unbiased when subsampling, instead of bootstrap sampling, is used (Strobl, Boulesteix, Zeileis, and Hothorn, 2007) (Small) random forest Introduction 1 1 1 1 Start Start Start Construction Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 ≤ 12 > 12 ≤ 1 > 1 2 3 2 7 2 3 ≤ 8 > 8 n = 13 Age Age n = 49 n = 8 Number R functions y = (0.308, 0.692) p < 0.001 p < 0.001 y = (1, 0) y = (0.375, 0.625) p < 0.001 2 3 n = 15 Start ≤ 87 > 87 ≤ 68 > 68 ≤ 4 > 4 y = (0.4, 0.6) p < 0.001 4 5 3 6 4 7 n = 36 Start Number n = 12 Age n = 31 y = (1, 0) p < 0.001 p < 0.001 y = (0.25, 0.75) p < 0.001 y = (0.806, 0.194) ≤ 14 > 14 Variable ≤ 13 > 13 ≤ 4 > 4 ≤ 125 > 125 4 5 n = 34 n = 32 6 7 4 5 5 6 y = (0.882, 0.118) y = (1, 0) n = 16 n = 16 n = 11 n = 9 n = 31 n = 11 importance y = (0.75, 0.25) y = (1, 0) y = (1, 0) y = (0.556, 0.444) y = (1, 0) y = (0.818, 0.182) 1 Number 1 1 1 Start Start Tests for variable p < 0.001 Start p < 0.001 p < 0.001 p < 0.001 ≤ 5 > 5 2 9 ≤ 12 > 12 ≤ 14 > 14 importance Age n = 11 2 7 2 7 ≤ 12 > 12 p < 0.001 y = (0.364, 0.636) Age Number Age n = 35 ≤ 81 > 81 p < 0.001 p < 0.001 p < 0.001 y = (1, 0) 2 3 3 4 n = 38 Number n = 33 Start Conditional ≤ 18 > 18 ≤ 3 > 3 ≤ 71 > 71 y = (0.711, 0.289) p < 0.001 y = (1, 0) p < 0.001 3 4 8 9 3 4 ≤ 12 > 12 n = 10 Number n = 28 n = 21 n = 15 Start

Why and How to Use Random Forest Variable Importance Measures (And

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support