Feature Power: a New Variable Importance Measure For

FEATURE POWER: A NEW VARIABLE IMPORTANCE MEASURE FOR

RANDOM FORESTS

A thesis presented to the faculty of San Francisco State University In partial fulfilment of 3 ^ The Requirements for ZD 19 The Degree C ^ f T t / P'

Master of Science In Computer Science

Katie Fotion

San Francisco, California

May 2018 Copyright by Katie Fotion 2018 CERTIFICATION OF APPROVAL

I certify that I have read FEATURE POWER: A NEW VARIABLE IM

PORTANCE MEASURE FOR RANDOM FORESTS by Katie Fotion and that in my opinion this work meets the criteria for approving a thesis submitted in partial fulfillment of the requirements for the degree:

Master of Science in Computer Science at San Francisco State University.

Kazunori Okada Associate Professor of Computer

tin Petkovic and Associate Chair of Computer Science

Associate Professor of Computer Science FEATURE POWER: A NEW VARIABLE IMPORTANCE MEASURE FOR

RANDOM FORESTS

Katie Fotion San Francisco State University 2018

Variable importance and interaction measures are crucial to breaking open the

“black box” of machine-learned classifiers. The existing metrics, however, are data- driven and lack a solid mathematical foundation, resulting in misleading conclusions on certain types of data. We propose feature power: a new variable importance mea sure based on the Shapley value of cooperative game theory. We evaluate the validity of this new measure and the behavior of feature power in comparison to existing variable importance metrics. We also introduce coalition power: a methodology for quantifying the power of a group of features collectively. We demonstrate that both methods produce consistent, correct results on toy data and gain interesting insights by applying feature power to real data sets. We discuss the extensibility of both power measures to other tree-based ensembles and neural networks.

I certify that the Abstract is a correct representation of the content of this thesis.

Chair, Thesis Committee Date ACKNOWLEDGMENTS

I thank the members of the Biomedical Imaging and Data Analysis Lab oratory (BIDAL) at San Francisco State University for acting as a sound ing board. I am particularly grateful for the many technical dialogues with Andrew Scott and thesis advisor Dr. Kazunori Okada. Thank you

Dr. Okada for providing the seed of an idea for this thesis as well as constant guidance. You are wonderful and wise.

I express gratitude to my brilliant thesis committee members, whose feedback has been invaluable in this process. I appreciate my coworkers at the SLAC National Accelerator Laboratory for reminding me why I am completing this work through the lab’s own needs for a better vari able importance measure.

Finally, I would like to thank my parents and friends for their undying support and ability to seamlessly remind me both of their unconditional love and expectation of my excellence. TABLE OF CONTENTS

1 Introduction...... 1

1.1 Motivation...... 2

1.2 Proposed Measure...... 6

1.3 Experimental Details...... 7

1.4 List of Contributions...... 8

2 Prior W o r k ...... 9

3 Random Forests...... 18

3.1 The Random Forest Algorithm ...... 19

3.2 Variable Im portance...... 20

3.3 Variable Interaction...... 24

4 Voting Games...... 26

4.1 Introduction to Voting Theory...... 28

4.1.1 Definitions...... 28

4.1.2 The Shapley-Shubik power in d e x ...... 29

4.1.3 The Banzhaf power in d e x ...... 30

4.1.4 Toy example...... 31

4.2 Voting Theory in Random Forests...... 33

4.3 Introduction to Cooperative Game T heory...... 35

4.3.1 Definitions...... 36

vi 4.3.2 The Shapley value ...... 36

4.3.3 Continuation of toy example...... 38

5 Feature Power...... 40

5.1 Definitions...... 41

5.1.1 Decision trees...... 41

5.1.2 Random forest as a voting game...... 42

5.2 Derivation of Feature Pow er...... 45

5.2.1 Feature power method 1: path iteration ...... 46

5.2.2 Feature power method 2: cumulative node iteration...... 51

5.2.3 Feature power method 3: strict node iteration...... 55

5.3 Extension of Feature Power to Random Forests ...... 58

5.4 Extension of Feature Power to Multi-Class Problems ...... 59

6 A Comparative Study ...... 61

6.1 Data Description...... 62

6.1.1 Toy data s e t...... 62

6.1.2 Real world data s e t s ...... 63

6.2 Results...... 64

6.2.1 V alidity...... 64

6.2.2 Stability...... 72

6.2.3 Sensitivity to hyperparameters...... 77

vii 6.2.4 Summary of results...... 84

6.3 Discussion...... 85

7 Theoretical Implications of Feature P o w e r...... 87

7.1 A Mapping to Voting T h e o r y ...... 88

7.1.1 The Procrastinator: pathPow ...... 88

7.1.2 The Dreamer: cumNodePow...... 89

7.1.3 The Literalist: strictNodePow...... 90

7.2 The Overweighting of Low-Depth Features...... 91

7.3 An Axiomatic A pproach...... 93

7.3.1 The Shapley Value Axioms...... 93

7.3.2 Feature Power A xiom s...... 96

7.4 Variable Importance and Feature Power R anges...... 102

7.4.1 Range of Gini im portance...... 103

7.4.2 Range of permutation importance...... 104

7.4.3 Range of count importance...... 105

7.4.4 Range of p ath P ow ...... 105

7.4.5 Range of cumNodePow...... 106

7.4.6 Range of strictNodePow...... 108

7.4.7 Semantics of Importance Values...... 109

7.5 An Alternative Probabilistic Characteristic Function ...... I ll

viii 8 Coalition Power...... 114

8.1 Derivation of Coalition P ow er...... 116

8.1.1 Coalition power by pathPow...... 117

8.1.2 Coalition power by cumNodePow...... 117

8.1.3 Notes on coalition pow er...... 118

8.2 Preliminary Results...... 120

9 Conclusion...... 131

9.1 Extension to Other Classifiers...... 134

9.1.1 Tree-based ensembles...... 134

9.1.2 Neural networks...... 135

9.2 Future W ork...... 135

Appendix A: Mathematical S ym bols...... 137

Appendix B: Additional P lo ts...... 143

Appendix C: Complete Tables...... 154

Bibliography ...... 158

ix LIST OF TABLES

Table Page

6.1 Data set details...... 62

6.2 Summary of results. Note that “pa” corresponds to pathPow, “st”

to strictNodePow and “cu” to cumNodePow...... 86

9.1 Enumeration of mathematical symbols, as consistent throughout the

sis and in the order in which they appear...... 142

9.2 Class-specific pairwise rank-order correlation on top 7 ranked features

of toys data...... 155

9.3 Class-specific pairwise rank-order correlation on top 7 ranked features

of breast cancer data...... 156

9.4 Class-specific pairwise rank-order correlation on top 7 ranked features

of wine data...... 157

x LIST OF FIGURES

Figure Page

5.1 Toy decision tree...... 43

5.2 Toy tree for pathPow methodology...... 49

5.3 Toy tree for cumNodePow methodology...... 53

5.4 Toy tree for strictNodePow methodology...... 57

6.1 Importance values of first ten features of toys data. Top 6 plots

correspond to the three feature power methods, while bottom 5 plots

correspond to existing variable importance measures...... 66

6.2 Importance values of features of image segmentation data. Note

that the values for the top 3 plots (titled pathPow, strictNodePow

and cumNodePow) are obtained by applying the three feature power

methods...... 67

6.3 Pairwise rank-order correlation on top 7 ranked features of toys data.

Headings “path,” “strict,” and “cum” correspond to the three feature

power methods...... 70

6.4 Pairwise rank-order correlation on top 7 ranked features of breast

cancer data, averaged across classes. Headings “path,” “strict,” and

“cum” correspond to the three feature power methods...... 71

6.5 Pairwise rank-order correlation on top 7 ranked features of wine data,

averaged across classes. Headings “path,” “strict,” and “cum” corre

spond to the three feature power m eth od s...... 72

xi 6.6 Pairwise rank-order correlation on top 7 ranked features of image seg

mentation data, averaged across classes. Headings “path,” “strict,”

and “cum” correspond to the three feature power m ethods...... 73

6.7 Average rank-order correlation on top 7 ranked features over 50 RF

instances on toys data. Headings “path,” “strictNode,” and “cumNode”

correspond to the three feature power methods...... 74

6.8 Average rank-order correlation on top 7 ranked features over 50 RF

instances on wine data. Headings “path,” “strictNode,” and “cumNode”

correspond to the three feature power methods...... 75

6.9 Importance values of features of wine data. Note that the values for

the top 3 plots (titled pathPow, strictNodePow and cumNodePow)

are obtained by applying the three feature power methods...... 76

6.10 Rank-order Spearman’s rho sensitivity to ntrees on toys data. Note

that all lines defined on the left side of the legend correspond to