<<

Why and how to use random forest Introduction Construction variable importance measures functions Variable importance

(and how you shouldn’t) Tests for variable importance Conditional importance

Summary Carolin Strobl (LMU M¨unchen)and Achim Zeileis (WU Wien) References

[email protected] useR! 2008, Dortmund [imagine a long list of references here]

I have become increasingly popular in, e.g., genetics and the neurosciences

I can deal with “small n large p”-problems, high-order interactions, correlated predictor variables

I are used not only for prediction, but also to assess variable importance

Introduction

Introduction

Construction R functions

Random forests Variable importance

Tests for variable importance Conditional importance

Summary

References [imagine a long list of references here]

I can deal with “small n large p”-problems, high-order interactions, correlated predictor variables

I are used not only for prediction, but also to assess variable importance

Introduction

Introduction

Construction R functions

Random forests Variable importance

Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences importance Summary

References I can deal with “small n large p”-problems, high-order interactions, correlated predictor variables

I are used not only for prediction, but also to assess variable importance

Introduction

Introduction

Construction R functions

Random forests Variable importance

Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences [imagine a long list of references here] importance Summary

References I are used not only for prediction, but also to assess variable importance

Introduction

Introduction

Construction R functions

Random forests Variable importance

Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences [imagine a long list of references here] importance Summary I can deal with “small n large p”-problems, high-order References interactions, correlated predictor variables Introduction

Introduction

Construction R functions

Random forests Variable importance

Tests for variable importance I have become increasingly popular in, e.g., genetics and Conditional the neurosciences [imagine a long list of references here] importance Summary I can deal with “small n large p”-problems, high-order References interactions, correlated predictor variables

I are used not only for prediction, but also to assess variable importance (Small) random forest

Introduction

1 1 1 1 Start Start Start Construction Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 ≤ 12 > 12 ≤ 1 > 1 2 3 2 7 2 3 ≤ 8 > 8 n = 13 Age Age n = 49 n = 8 Number R functions y = (0.308, 0.692) p < 0.001 p < 0.001 y = (1, 0) y = (0.375, 0.625) p < 0.001 2 3 n = 15 Start ≤ 87 > 87 ≤ 68 > 68 ≤ 4 > 4 y = (0.4, 0.6) p < 0.001 4 5 3 6 4 7 n = 36 Start Number n = 12 Age n = 31 y = (1, 0) p < 0.001 p < 0.001 y = (0.25, 0.75) p < 0.001 y = (0.806, 0.194) ≤ 14 > 14 Variable ≤ 13 > 13 ≤ 4 > 4 ≤ 125 > 125 4 5 n = 34 n = 32 6 7 4 5 5 6 y = (0.882, 0.118) y = (1, 0) n = 16 n = 16 n = 11 n = 9 n = 31 n = 11 importance y = (0.75, 0.25) y = (1, 0) y = (1, 0) y = (0.556, 0.444) y = (1, 0) y = (0.818, 0.182)

1 Number 1 1 1 Start Start Tests for variable p < 0.001 Start p < 0.001 p < 0.001 p < 0.001 ≤ 5 > 5 2 9 ≤ 12 > 12 ≤ 14 > 14 importance Age n = 11 2 7 2 7 ≤ 12 > 12 p < 0.001 y = (0.364, 0.636) Age Number Age n = 35 ≤ 81 > 81 p < 0.001 p < 0.001 p < 0.001 y = (1, 0) 2 3 3 4 n = 38 Number n = 33 Start Conditional ≤ 18 > 18 ≤ 3 > 3 ≤ 71 > 71 y = (0.711, 0.289) p < 0.001 y = (1, 0) p < 0.001 3 4 8 9 3 4 ≤ 12 > 12 n = 10 Number n = 28 n = 21 n = 15 Start importance 5 6 ≤ > y = (0.9, 0.1) p < 0.001 y = (1, 0) y = (0.952, 0.048) y = (0.933, 0.067) p < 0.001 3 3 n = 13 Start y = (0.385, 0.615) p < 0.001 ≤ 4 > 4 ≤ 12 > 12 4 5 ≤ > n = 25 n = 18 15 15 5 6 5 6 y = (1, 0) y = (0.889, 0.111) 7 8 n = 12 n = 10 n = 16 n = 15 n = 12 n = 12 y = (0.417, 0.583)y = (0.2, 0.8) y = (0.375, 0.625) y = (0.733, 0.267) y = (0.833, 0.167) y = (1, 0) Summary

1 1 Start 1 1 Number p < 0.001 Start Start p < 0.001 p < 0.001 p < 0.001 ≤ 12 > 12 ≤ 6 > 6 References 2 7 2 7 ≤ 12 > 12 ≤ 8 > 8 Age Start Number n = 10 p < 0.001 p < 0.001 p < 0.001 y = (0.5, 0.5) 2 5 2 5 Age Start Start Age ≤ 27 > 27 ≤ 13 > 13 ≤ 3 > 3 p < 0.001 p < 0.001 p < 0.001 p < 0.001 3 4 8 9 3 6 n = 10 Number n = 11 n = 37 Start n = 37 y = (1, 0) p < 0.001 y = (0.818, 0.182) y = (1, 0) p < 0.001 y = (0.865, 0.135) ≤ 81 > 81 ≤ 13 > 13 ≤ 3 > 3 ≤ 136 > 136

≤ 4 > 4 ≤ 13 > 13 3 4 6 7 3 4 6 7 5 6 n = 20 n = 16 n = 11 n = 34 n = 12 n = 14 n = 47 n = 8 4 5 n = 14 n = 9 y = (0.85, 0.15) y = (0.188, 0.812) y = (0.818, 0.182) y = (1, 0) y = (0.667, 0.333) y = (0.143, 0.857) y = (1, 0) y = (0.75, 0.25) n = 10 n = 24 y = (0.357, 0.643)y = (0.111, 0.889) y = (0.8, 0.2) y = (1, 0)

1 1 1 1 Start Start Start Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 2 3 ≤ 8 > 8 ≤ 12 > 12 ≤ 12 > 12 n = 18 Start y = (0.5, 0.5) p < 0.001 2 5 2 5 2 3 Start Start Age Start n = 28 Start ≤ 12 > 12 p < 0.001 p < 0.001 p < 0.001 p < 0.001 y = (0.607, 0.393) p < 0.001 4 5 n = 18 Number y = (0.833, 0.167) p < 0.001 ≤ 1 > 1 ≤ 12 > 12 ≤ 71 > 71 ≤ 14 > 14 ≤ 14 > 14

≤ 3 > 3 3 4 6 7 3 4 6 7 4 5 n = 9 n = 13 n = 12 n = 47 n = 15 n = 17 n = 17 n = 32 n = 21 n = 32 6 7 y = (0.778, 0.222) y = (0.154, 0.846) y = (0.833, 0.167) y = (1, 0) y = (0.667, 0.333) y = (0.235, 0.765) y = (0.882, 0.118) y = (1, 0) y = (0.905, 0.095) y = (1, 0) n = 30 n = 15 y = (1, 0) y = (0.933, 0.067) I draw ntree bootstrap samples from original sample

I fit a classification tree to each bootstrap sample ⇒ ntree trees

I creates diverse set of trees because

I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging)

I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest)

Construction of a random forest

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary

References I fit a classification tree to each bootstrap sample ⇒ ntree trees

I creates diverse set of trees because

I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging)

I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest)

Construction of a random forest

Introduction

Construction R functions

Variable I draw ntree bootstrap samples from original sample importance

Tests for variable importance Conditional importance

Summary

References I creates diverse set of trees because

I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging)

I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest)

Construction of a random forest

Introduction

Construction R functions

Variable I draw ntree bootstrap samples from original sample importance

Tests for variable I fit a classification tree to each bootstrap sample importance Conditional ⇒ ntree trees importance

Summary

References Construction of a random forest

Introduction

Construction R functions

Variable I draw ntree bootstrap samples from original sample importance

Tests for variable I fit a classification tree to each bootstrap sample importance Conditional ⇒ ntree trees importance

I creates diverse set of trees because Summary

References I trees are instable w.r.t. changes in learning data ⇒ ntree different looking trees (bagging)

I randomly preselect mtry splitting variables in each split ⇒ ntree more different looking trees (random forest) Random forests in R

Introduction

Construction I randomForest (pkg: randomForest) R functions

I reference implementation based on CART trees Variable (Breiman, 2001; Liaw and Wiener, 2008) importance Tests for variable – for variables of different types: biased in favor of importance Conditional continuous variables and variables with many categories importance

(Strobl, Boulesteix, Zeileis, and Hothorn, 2007) Summary

I cforest (pkg: party) References

I based on unbiased conditional inference trees (Hothorn, Hornik, and Zeileis, 2006) + for variables of different types: unbiased when subsampling, instead of bootstrap , is used (Strobl, Boulesteix, Zeileis, and Hothorn, 2007) (Small) random forest

Introduction

1 1 1 1 Start Start Start Construction Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 ≤ 12 > 12 ≤ 1 > 1 2 3 2 7 2 3 ≤ 8 > 8 n = 13 Age Age n = 49 n = 8 Number R functions y = (0.308, 0.692) p < 0.001 p < 0.001 y = (1, 0) y = (0.375, 0.625) p < 0.001 2 3 n = 15 Start ≤ 87 > 87 ≤ 68 > 68 ≤ 4 > 4 y = (0.4, 0.6) p < 0.001 4 5 3 6 4 7 n = 36 Start Number n = 12 Age n = 31 y = (1, 0) p < 0.001 p < 0.001 y = (0.25, 0.75) p < 0.001 y = (0.806, 0.194) ≤ 14 > 14 Variable ≤ 13 > 13 ≤ 4 > 4 ≤ 125 > 125 4 5 n = 34 n = 32 6 7 4 5 5 6 y = (0.882, 0.118) y = (1, 0) n = 16 n = 16 n = 11 n = 9 n = 31 n = 11 importance y = (0.75, 0.25) y = (1, 0) y = (1, 0) y = (0.556, 0.444) y = (1, 0) y = (0.818, 0.182)

1 Number 1 1 1 Start Start Tests for variable p < 0.001 Start p < 0.001 p < 0.001 p < 0.001 ≤ 5 > 5 2 9 ≤ 12 > 12 ≤ 14 > 14 importance Age n = 11 2 7 2 7 ≤ 12 > 12 p < 0.001 y = (0.364, 0.636) Age Number Age n = 35 ≤ 81 > 81 p < 0.001 p < 0.001 p < 0.001 y = (1, 0) 2 3 3 4 n = 38 Number n = 33 Start Conditional ≤ 18 > 18 ≤ 3 > 3 ≤ 71 > 71 y = (0.711, 0.289) p < 0.001 y = (1, 0) p < 0.001 3 4 8 9 3 4 ≤ 12 > 12 n = 10 Number n = 28 n = 21 n = 15 Start importance 5 6 ≤ > y = (0.9, 0.1) p < 0.001 y = (1, 0) y = (0.952, 0.048) y = (0.933, 0.067) p < 0.001 3 3 n = 13 Start y = (0.385, 0.615) p < 0.001 ≤ 4 > 4 ≤ 12 > 12 4 5 ≤ > n = 25 n = 18 15 15 5 6 5 6 y = (1, 0) y = (0.889, 0.111) 7 8 n = 12 n = 10 n = 16 n = 15 n = 12 n = 12 y = (0.417, 0.583)y = (0.2, 0.8) y = (0.375, 0.625) y = (0.733, 0.267) y = (0.833, 0.167) y = (1, 0) Summary

1 1 Start 1 1 Number p < 0.001 Start Start p < 0.001 p < 0.001 p < 0.001 ≤ 12 > 12 ≤ 6 > 6 References 2 7 2 7 ≤ 12 > 12 ≤ 8 > 8 Age Start Number n = 10 p < 0.001 p < 0.001 p < 0.001 y = (0.5, 0.5) 2 5 2 5 Age Start Start Age ≤ 27 > 27 ≤ 13 > 13 ≤ 3 > 3 p < 0.001 p < 0.001 p < 0.001 p < 0.001 3 4 8 9 3 6 n = 10 Number n = 11 n = 37 Start n = 37 y = (1, 0) p < 0.001 y = (0.818, 0.182) y = (1, 0) p < 0.001 y = (0.865, 0.135) ≤ 81 > 81 ≤ 13 > 13 ≤ 3 > 3 ≤ 136 > 136

≤ 4 > 4 ≤ 13 > 13 3 4 6 7 3 4 6 7 5 6 n = 20 n = 16 n = 11 n = 34 n = 12 n = 14 n = 47 n = 8 4 5 n = 14 n = 9 y = (0.85, 0.15) y = (0.188, 0.812) y = (0.818, 0.182) y = (1, 0) y = (0.667, 0.333) y = (0.143, 0.857) y = (1, 0) y = (0.75, 0.25) n = 10 n = 24 y = (0.357, 0.643)y = (0.111, 0.889) y = (0.8, 0.2) y = (1, 0)

1 1 1 1 Start Start Start Start p < 0.001 p < 0.001 p < 0.001 p < 0.001 ≤ 8 > 8 2 3 ≤ 8 > 8 ≤ 12 > 12 ≤ 12 > 12 n = 18 Start y = (0.5, 0.5) p < 0.001 2 5 2 5 2 3 Start Start Age Start n = 28 Start ≤ 12 > 12 p < 0.001 p < 0.001 p < 0.001 p < 0.001 y = (0.607, 0.393) p < 0.001 4 5 n = 18 Number y = (0.833, 0.167) p < 0.001 ≤ 1 > 1 ≤ 12 > 12 ≤ 71 > 71 ≤ 14 > 14 ≤ 14 > 14

≤ 3 > 3 3 4 6 7 3 4 6 7 4 5 n = 9 n = 13 n = 12 n = 47 n = 15 n = 17 n = 17 n = 32 n = 21 n = 32 6 7 y = (0.778, 0.222) y = (0.154, 0.846) y = (0.833, 0.167) y = (1, 0) y = (0.667, 0.333) y = (0.235, 0.765) y = (0.882, 0.118) y = (1, 0) y = (0.905, 0.095) y = (1, 0) n = 30 n = 15 y = (1, 0) y = (0.933, 0.067) Measuring variable importance

Introduction

Construction R functions

Variable I Gini importance importance mean Gini gain produced by X over all trees Tests for variable j importance Conditional I obj <- randomForest(..., importance=TRUE) importance obj$importance column: MeanDecreaseGini Summary

importance(obj, type=2) References

for variables of different types: biased in favor of continuous variables and variables with many categories Measuring variable importance

Introduction

Construction I permutation importance R functions

mean decrease in classification accuracy after Variable importance permuting Xj over all trees Tests for variable importance I obj <- randomForest(..., importance=TRUE) Conditional importance obj$importance column: MeanDecreaseAccuracy Summary importance(obj, type=1) References I obj <- cforest(...) varimp(obj)

for variables of different types: unbiased only when subsampling is used as in cforest(..., controls = cforest unbiased()) The permutation importance

within each tree t Introduction Construction R functions

Variable importance P  (t) P  (t)  (t) I yi =y ˆi (t) I yi =y ˆi,π Tests for variable (t) i∈B i∈B j importance VI (xj ) = − (t) (t) Conditional B B importance Summary

References (t) (t) yˆi = f (xi ) = predicted class before permuting

yˆ(t) = f (t)(x ) = predicted class after permuting X i,πj i,πj j  xi,πj = (xi,1,..., xi,j−1, xπj (i),j , xi,j+1,..., xi,p

(t) Note: VI (xj ) = 0 by definition, if Xj is not in tree t The permutation importance

Introduction

Construction R functions over all trees: Variable importance 1. raw importance Tests for variable importance Conditional importance

Pntree (t) Summary t=1 VI (xj ) VI (xj ) = ntree References

I obj <- randomForest(..., importance=TRUE) importance(obj, type=1, scale=FALSE) The permutation importance

Introduction

Construction over all trees: R functions Variable importance

2. scaled importance (z-score) Tests for variable importance Conditional importance

VI (xj ) Summary σˆ = zj √ References ntree

I obj <- randomForest(..., importance=TRUE) importance(obj, type=1, scale=TRUE) (default) I Breiman and Cutler (2008): simple significance test based on normality of z-score randomForest, scale=TRUE + α-quantile of N(0,1)

I Diaz-Uriarte and Alvarez de Andr´es (2006): backward elimination (throw out least important variables until out-of-bag prediction accuracy drops) varSelRF (pkg: varSelRF), dep. on randomForest

I Diaz-Uriarte (2007) and Rodenburg et al. (2008): plots and significance test (randomly permute response values to mimic the overall null hypothesis that none of the predictor variables is relevant = baseline)

Tests for variable importance

for variable selection purposes Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary

References I Diaz-Uriarte and Alvarez de Andr´es (2006): backward elimination (throw out least important variables until out-of-bag prediction accuracy drops) varSelRF (pkg: varSelRF), dep. on randomForest

I Diaz-Uriarte (2007) and Rodenburg et al. (2008): plots and significance test (randomly permute response values to mimic the overall null hypothesis that none of the predictor variables is relevant = baseline)

Tests for variable importance

for variable selection purposes Introduction

Construction R functions Breiman and Cutler (2008): simple significance test I Variable based on normality of z-score importance Tests for variable randomForest, scale=TRUE + α-quantile of N(0,1) importance Conditional importance

Summary

References I Diaz-Uriarte (2007) and Rodenburg et al. (2008): plots and significance test (randomly permute response values to mimic the overall null hypothesis that none of the predictor variables is relevant = baseline)

Tests for variable importance

for variable selection purposes Introduction

Construction R functions Breiman and Cutler (2008): simple significance test I Variable based on normality of z-score importance Tests for variable randomForest, scale=TRUE + α-quantile of N(0,1) importance Conditional importance I Diaz-Uriarte and Alvarez de Andr´es (2006): backward Summary elimination (throw out least important variables until References out-of-bag prediction accuracy drops) varSelRF (pkg: varSelRF), dep. on randomForest Tests for variable importance

for variable selection purposes Introduction

Construction R functions Breiman and Cutler (2008): simple significance test I Variable based on normality of z-score importance Tests for variable randomForest, scale=TRUE + α-quantile of N(0,1) importance Conditional importance I Diaz-Uriarte and Alvarez de Andr´es (2006): backward Summary elimination (throw out least important variables until References out-of-bag prediction accuracy drops) varSelRF (pkg: varSelRF), dep. on randomForest

I Diaz-Uriarte (2007) and Rodenburg et al. (2008): plots and significance test (randomly permute response values to mimic the overall null hypothesis that none of the predictor variables is relevant = baseline) I (at least) Breiman and Cutler (2008): strange statistical properties (Strobl and Zeileis, 2008)

I all: preference of correlated predictor variables (see also Nicodemus and Shugart, 2007; Archer and Kimes, 2008)

Tests for variable importance

Introduction

Construction R functions

Variable problems of these approaches: importance Tests for variable importance Conditional importance

Summary

References I all: preference of correlated predictor variables (see also Nicodemus and Shugart, 2007; Archer and Kimes, 2008)

Tests for variable importance

Introduction

Construction R functions

Variable problems of these approaches: importance Tests for variable importance Conditional importance I (at least) Breiman and Cutler (2008): strange statistical properties (Strobl and Zeileis, 2008) Summary References Tests for variable importance

Introduction

Construction R functions

Variable problems of these approaches: importance Tests for variable importance Conditional importance I (at least) Breiman and Cutler (2008): strange statistical properties (Strobl and Zeileis, 2008) Summary References I all: preference of correlated predictor variables (see also Nicodemus and Shugart, 2007; Archer and Kimes, 2008) Breiman and Cutler’s test

Introduction

Construction R functions

Variable under the null hypothesis of zero importance: importance Tests for variable importance as. Conditional zj ∼ N(0, 1) importance

Summary

References if zj exceeds the α-quantile of N(0,1) ⇒ reject the

null hypothesis of zero importance for variable Xj Raw importance

Introduction

Construction R functions

sample size Variable 100 importance 200 500 Tests for variable mean importance mean importance mean importance importance ntree = 100 ntree = 200 ntree = 500 Conditional importance

Summary

References

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 relevance z-score and power

Introduction

sample size Construction 100 200 R functions 500 z−score z−score z−score Variable ntree = 100 ntree = 200 ntree = 500 importance

Tests for variable importance Conditional importance

Summary

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 power power power References ntree = 100 ntree = 200 ntree = 500

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 relevance Findings

Introduction

Construction z-score and power R functions Variable importance

Tests for variable I increase in ntree importance Conditional importance I decrease in sample size Summary

⇒ rather use raw, unscaled permutation importance! References

importance(obj, type=1, scale=FALSE)

varimp(obj) What null hypothesis were we testing

in the first place? Introduction

Construction R functions

Variable importance obs Y Xj Z Tests for variable importance 1 y1 xπj (1),j z1 . . . . Conditional . . . . importance Summary i yi xπj (i),j zi . . . . References . . . .

n yn xπj (n),j zn

H0 : Xj ⊥ Y , Z or Xj ⊥ Y ∧ Xj ⊥ Z

H0 P(Y , Xj , Z) = P(Y , Z) · P(Xj ) ⇒ a high variable importance can result from violation of

either one!

What null hypothesis were we testing

in the first place? Introduction

Construction R functions

Variable importance

Tests for variable importance the current null hypothesis reflects independence of Xj from Conditional importance both Y and the remaining predictor variables Z Summary

References What null hypothesis were we testing

in the first place? Introduction

Construction R functions

Variable importance

Tests for variable importance the current null hypothesis reflects independence of Xj from Conditional importance both Y and the remaining predictor variables Z Summary

⇒ a high variable importance can result from violation of References

either one! Suggestion: Conditional permutation scheme

Introduction

obs Y Xj Z Construction R functions

1 y1 xπj|Z=a(1),j z1 = a Variable

3 y3 xπj|Z=a(3),j z3 = a importance 27 y x z = a Tests for variable 27 πj|Z=a(27),j 27 importance 6 y x z = b Conditional 6 πj|Z=b (6),j 6 importance

14 y14 xπj|Z=b (14),j z14 = b Summary 33 y x z = b 33 πj|Z=b (33),j 33 References ......

H0 : Xj ⊥ Y |Z

H0 P(Y , Xj |Z) = P(Y |Z) · P(Xj |Z)

H0 or P(Y |Xj , Z) = P(Y |Z) I here: use binary partition already learned by tree (use cutpoints as bisectors of feature space)

I condition on correlated variables or select some

Strobl et al. (2008) available in cforest from version 0.9-994: varimp(obj, conditional = TRUE)

Technically

Introduction

Construction R functions

Variable I use any partition of the feature space for conditioning importance

Tests for variable importance Conditional importance

Summary

References I condition on correlated variables or select some

Strobl et al. (2008) available in cforest from version 0.9-994: varimp(obj, conditional = TRUE)

Technically

Introduction

Construction R functions

Variable I use any partition of the feature space for conditioning importance here: use binary partition already learned by tree Tests for variable I importance Conditional (use cutpoints as bisectors of feature space) importance

Summary

References Strobl et al. (2008) available in cforest from version 0.9-994: varimp(obj, conditional = TRUE)

Technically

Introduction

Construction R functions

Variable I use any partition of the feature space for conditioning importance here: use binary partition already learned by tree Tests for variable I importance Conditional (use cutpoints as bisectors of feature space) importance

Summary I condition on correlated variables or select some References Technically

Introduction

Construction R functions

Variable I use any partition of the feature space for conditioning importance here: use binary partition already learned by tree Tests for variable I importance Conditional (use cutpoints as bisectors of feature space) importance

Summary I condition on correlated variables or select some References

Strobl et al. (2008) available in cforest from version 0.9-994: varimp(obj, conditional = TRUE) Simulation study

i.i.d. Introduction I dgp: yi = β1 · xi,1 + ··· + β12 · xi,12 + εi , εi ∼ N(0, 0.5) Construction R functions I X1,..., X12 ∼ N(0, Σ) Variable   importance 1 0.9 0.9 0.9 0 ··· 0 Tests for variable   importance  0.9 1 0.9 0.9 0 ··· 0    Conditional   importance  0.9 0.9 1 0.9 0 ··· 0    Summary  0.9 0.9 0.9 1 0 ··· 0  Σ =     References  0 0 0 0 1 ··· 0     ......   ...... 0    0 0 0 0 0 0 1

Xj X1 X2 X3 X4 X5 X6 X7 X8 ··· X12

βj 5 5 2 0 -5 -5 -2 0 ··· 0 Results

Introduction

Construction 25 R functions ● ● ● 15 ● Variable mtry = 1 ● ● importance 5

● ● ● ● ● ●

0 Tests for variable importance Conditional 50 importance

Summary 30

● ● mtry = 3

● ● ● References ● 10

● ● ● ● ● ● 0 80 60 40 mtry = 8 ● ●

● 20 ● ● ● ● ● ● ● ● ● 0

1 2 3 4 5 6 7 8 9 10 11 12

variablevariable Peptide-binding data

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional 0.005 unconditional importance

0 Summary

References conditional conditional 0.005

0 * h2y8 flex8 pol3 if your predictor variables are of different types: use cforest (pkg: party) with default option controls = cforest unbiased() with permutation importance varimp(obj)

otherwise: feel free to use cforest (pkg: party) with permutation importance varimp(obj) or randomForest (pkg: randomForest) with permutation importance importance(obj, type=1) or Gini importance importance(obj, type=2) but don’t fall for the z-score! (i.e. set scale=FALSE)

if your predictor variables are highly correlated: use the conditional importance in cforest (pkg: party)

Summary

Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary

References otherwise: feel free to use cforest (pkg: party) with permutation importance varimp(obj) or randomForest (pkg: randomForest) with permutation importance importance(obj, type=1) or Gini importance importance(obj, type=2) but don’t fall for the z-score! (i.e. set scale=FALSE)

if your predictor variables are highly correlated: use the conditional importance in cforest (pkg: party)

Summary if your predictor variables are of different types: Introduction use cforest (pkg: party) with default option controls = Construction R functions cforest unbiased() Variable with permutation importance varimp(obj) importance Tests for variable importance Conditional importance

Summary

References if your predictor variables are highly correlated: use the conditional importance in cforest (pkg: party)

Summary if your predictor variables are of different types: Introduction use cforest (pkg: party) with default option controls = Construction R functions cforest unbiased() Variable with permutation importance varimp(obj) importance Tests for variable importance Conditional otherwise: feel free to use cforest (pkg: party) importance

with permutation importance varimp(obj) Summary

or randomForest (pkg: randomForest) References with permutation importance importance(obj, type=1) or Gini importance importance(obj, type=2) but don’t fall for the z-score! (i.e. set scale=FALSE) Summary if your predictor variables are of different types: Introduction use cforest (pkg: party) with default option controls = Construction R functions cforest unbiased() Variable with permutation importance varimp(obj) importance Tests for variable importance Conditional otherwise: feel free to use cforest (pkg: party) importance

with permutation importance varimp(obj) Summary

or randomForest (pkg: randomForest) References with permutation importance importance(obj, type=1) or Gini importance importance(obj, type=2) but don’t fall for the z-score! (i.e. set scale=FALSE)

if your predictor variables are highly correlated: use the conditional importance in cforest (pkg: party) Introduction

Construction R functions

Variable importance

Tests for variable importance Conditional importance

Summary

References Archer, K. J. and R. V. Kimes (2008). Empirical characterization of random forest variable importance measures. Computational Introduction Construction Statistics & Data Analysis 52(4), 2249–2260. R functions Breiman, L. (2001). Random forests. 45(1), Variable importance

5–32. Tests for variable importance Breiman, L. and A. Cutler (2008). Random forests – Classification Conditional importance manual. Website accessed in 1/2008; Summary http://www.math.usu.edu/∼adele/forests. References Breiman, L., A. Cutler, A. Liaw, and M. Wiener (2006). Breiman and Cutler’s Random Forests for Classification and Regression. R package version 4.5-16. Diaz-Uriarte, R. (2007). GeneSrF and varselrf: A web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 8:328. Hothorn, T., K. Hornik, and A. Zeileis (2006). Unbiased recursive Introduction partitioning: A conditional inference framework. Journal of Construction R functions Computational and Graphical Statistics 15(3), 651–674. Variable Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (2007). importance Tests for variable Bias in random forest variable importance measures: importance Conditional Illustrations, sources and a solution. BMC Bioinformatics 8:25. importance Strobl, C. and A. Zeileis (2008). Danger: High power! – exploring Summary the statistical properties of a test for random forest variable References importance. In Proceedings of the 18th International Conference on Computational Statistics, Porto, Portugal. Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (2008). Conditional variable importance for random forests. BMC Bioinformatics 9:307.