FEATURE POWER: A NEW VARIABLE IMPORTANCE MEASURE FOR

RANDOM FORESTS

A thesis presented to the faculty of San Francisco State University In partial fulfilment of 3 ^ The Requirements for ZD 19 The Degree C ^ f T t / P'

Master of Science In Computer Science

by

Katie Fotion

San Francisco, California

May 2018 Copyright by Katie Fotion 2018 CERTIFICATION OF APPROVAL

I certify that I have read FEATURE POWER: A NEW VARIABLE IM­

PORTANCE MEASURE FOR RANDOM FORESTS by Katie Fotion and that in my opinion this work meets the criteria for approving a thesis submitted in partial fulfillment of the requirements for the degree:

Master of Science in Computer Science at San Francisco State University.

Kazunori Okada Associate Professor of Computer

tin Petkovic and Associate Chair of Computer Science

Associate Professor of Computer Science FEATURE POWER: A NEW VARIABLE IMPORTANCE MEASURE FOR

RANDOM FORESTS

Katie Fotion San Francisco State University 2018

Variable importance and measures are crucial to breaking open the

“black box” of machine-learned classifiers. The existing metrics, however, are data- driven and lack a solid mathematical foundation, resulting in misleading conclusions on certain types of data. We propose feature power: a new variable importance mea­ sure based on the Shapley value of cooperative game theory. We evaluate the validity of this new measure and the behavior of feature power in comparison to existing variable importance metrics. We also introduce coalition power: a methodology for quantifying the power of a group of features collectively. We demonstrate that both methods produce consistent, correct results on toy data and gain interesting insights by applying feature power to real data sets. We discuss the extensibility of both power measures to other tree-based ensembles and neural networks.

I certify that the Abstract is a correct representation of the content of this thesis.

Chair, Thesis Committee Date ACKNOWLEDGMENTS

I thank the members of the Biomedical Imaging and Data Analysis Lab­ oratory (BIDAL) at San Francisco State University for acting as a sound­ ing board. I am particularly grateful for the many technical dialogues with Andrew Scott and thesis advisor Dr. Kazunori Okada. Thank you

Dr. Okada for providing the seed of an idea for this thesis as well as constant guidance. You are wonderful and wise.

I express gratitude to my brilliant thesis committee members, whose feedback has been invaluable in this process. I appreciate my coworkers at the SLAC National Accelerator Laboratory for reminding me why I am completing this work through the lab’s own needs for a better vari­ able importance measure.

Finally, I would like to thank my parents and friends for their undying support and ability to seamlessly remind me both of their unconditional love and expectation of my excellence. TABLE OF CONTENTS

1 Introduction...... 1

1.1 Motivation...... 2

1.2 Proposed Measure...... 6

1.3 Experimental Details...... 7

1.4 List of Contributions...... 8

2 Prior W o r k ...... 9

3 Random Forests...... 18

3.1 The Algorithm ...... 19

3.2 Variable Im portance...... 20

3.3 Variable Interaction...... 24

4 Voting Games...... 26

4.1 Introduction to Voting Theory...... 28

4.1.1 Definitions...... 28

4.1.2 The Shapley-Shubik power in d e x ...... 29

4.1.3 The Banzhaf power in d e x ...... 30

4.1.4 Toy example...... 31

4.2 Voting Theory in Random Forests...... 33

4.3 Introduction to Cooperative Game T heory...... 35

4.3.1 Definitions...... 36

vi 4.3.2 The Shapley value ...... 36

4.3.3 Continuation of toy example...... 38

5 Feature Power...... 40

5.1 Definitions...... 41

5.1.1 Decision trees...... 41

5.1.2 Random forest as a voting game...... 42

5.2 Derivation of Feature Pow er...... 45

5.2.1 Feature power method 1: path iteration ...... 46

5.2.2 Feature power method 2: cumulative node iteration...... 51

5.2.3 Feature power method 3: strict node iteration...... 55

5.3 Extension of Feature Power to Random Forests ...... 58

5.4 Extension of Feature Power to Multi-Class Problems ...... 59

6 A Comparative Study ...... 61

6.1 Data Description...... 62

6.1.1 Toy data s e t...... 62

6.1.2 Real world data s e t s ...... 63

6.2 Results...... 64

6.2.1 V alidity...... 64

6.2.2 Stability...... 72

6.2.3 Sensitivity to hyperparameters...... 77

vii 6.2.4 Summary of results...... 84

6.3 Discussion...... 85

7 Theoretical Implications of Feature P o w e r...... 87

7.1 A Mapping to Voting T h e o r y ...... 88

7.1.1 The Procrastinator: pathPow ...... 88

7.1.2 The Dreamer: cumNodePow...... 89

7.1.3 The Literalist: strictNodePow...... 90

7.2 The Overweighting of Low-Depth Features...... 91

7.3 An Axiomatic A pproach...... 93

7.3.1 The Shapley Value Axioms...... 93

7.3.2 Feature Power A xiom s...... 96

7.4 Variable Importance and Feature Power R anges...... 102

7.4.1 of Gini im portance...... 103

7.4.2 Range of permutation importance...... 104

7.4.3 Range of count importance...... 105

7.4.4 Range of p ath P ow ...... 105

7.4.5 Range of cumNodePow...... 106

7.4.6 Range of strictNodePow...... 108

7.4.7 Semantics of Importance Values...... 109

7.5 An Alternative Probabilistic Characteristic Function ...... I ll

viii 8 Coalition Power...... 114

8.1 Derivation of Coalition P ow er...... 116

8.1.1 Coalition power by pathPow...... 117

8.1.2 Coalition power by cumNodePow...... 117

8.1.3 Notes on coalition pow er...... 118

8.2 Preliminary Results...... 120

9 Conclusion...... 131

9.1 Extension to Other Classifiers...... 134

9.1.1 Tree-based ensembles...... 134

9.1.2 Neural networks...... 135

9.2 Future W ork...... 135

Appendix A: Mathematical S ym bols...... 137

Appendix B: Additional P lo ts...... 143

Appendix C: Complete Tables...... 154

Bibliography ...... 158

ix LIST OF TABLES

Table Page

6.1 Data set details...... 62

6.2 Summary of results. Note that “pa” corresponds to pathPow, “st”

to strictNodePow and “cu” to cumNodePow...... 86

9.1 Enumeration of mathematical symbols, as consistent throughout the­

sis and in the order in which they appear...... 142

9.2 Class-specific pairwise rank-order correlation on top 7 ranked features

of toys data...... 155

9.3 Class-specific pairwise rank-order correlation on top 7 ranked features

of breast cancer data...... 156

9.4 Class-specific pairwise rank-order correlation on top 7 ranked features

of wine data...... 157

x LIST OF FIGURES

Figure Page

5.1 Toy decision tree...... 43

5.2 Toy tree for pathPow methodology...... 49

5.3 Toy tree for cumNodePow methodology...... 53

5.4 Toy tree for strictNodePow methodology...... 57

6.1 Importance values of first ten features of toys data. Top 6 plots

correspond to the three feature power methods, while bottom 5 plots

correspond to existing variable importance measures...... 66

6.2 Importance values of features of image segmentation data. Note

that the values for the top 3 plots (titled pathPow, strictNodePow

and cumNodePow) are obtained by applying the three feature power

methods...... 67

6.3 Pairwise rank-order correlation on top 7 ranked features of toys data.

Headings “path,” “strict,” and “cum” correspond to the three feature

power methods...... 70

6.4 Pairwise rank-order correlation on top 7 ranked features of breast

cancer data, averaged across classes. Headings “path,” “strict,” and

“cum” correspond to the three feature power methods...... 71

6.5 Pairwise rank-order correlation on top 7 ranked features of wine data,

averaged across classes. Headings “path,” “strict,” and “cum” corre­

spond to the three feature power m eth od s...... 72

xi 6.6 Pairwise rank-order correlation on top 7 ranked features of image seg­

mentation data, averaged across classes. Headings “path,” “strict,”

and “cum” correspond to the three feature power m ethods...... 73

6.7 Average rank-order correlation on top 7 ranked features over 50 RF

instances on toys data. Headings “path,” “strictNode,” and “cumNode”

correspond to the three feature power methods...... 74

6.8 Average rank-order correlation on top 7 ranked features over 50 RF

instances on wine data. Headings “path,” “strictNode,” and “cumNode”

correspond to the three feature power methods...... 75

6.9 Importance values of features of wine data. Note that the values for

the top 3 plots (titled pathPow, strictNodePow and cumNodePow)

are obtained by applying the three feature power methods...... 76

6.10 Rank-order Spearman’s rho sensitivity to ntrees on toys data. Note

that all lines defined on the left side of the legend correspond to

feature power, whereas the lines on the right correspond to existing

variable importance measures...... 78

6.11 Rank-order Spearman’s rho sensitivity to ntree on breast cancer data.

Note that all lines defined on the left side of the legend correspond to

feature power, whereas the lines on the right correspond to existing

variable importance measures...... 79

xii 6.12 Rank-order Spearman’s rho sensitivity to ntree on wine data. Note

that all lines defined on the left side of the legend correspond to

feature power, whereas the lines on the right correspond to existing

variable importance measures...... 80

6.13 Rank-order Spearman’s rho sensitivity to ntree on segmentation data.

Note that all lines defined on the left side of the legend correspond to

feature power, whereas the lines on the right correspond to existing

variable importance measures...... 81

6.14 Rank-order Spearman’s rho sensitivity to mtry on toys data. Note

that all lines defined on the left side of the legend correspond to

feature power, whereas the lines on the right correspond to existing

variable importance measures...... 82

6.15 Rank-order Spearman’s rho sensitivity to mtry on breast cancer data.

Note that all lines defined on the left side of the legend correspond to

feature power, whereas the lines on the right correspond to existing

variable importance measures...... 83

7.1 Logscale of Shapley value coefficient growth based on coalition size. . 92

7.2 Toy tree for alternative probabilistic characteristic function w...... 113

8.1 Results of path coalition power evaluation on all coalitions of size 2

on features 1-8...... 121 8.2 Class distribution of features 3 and 6, as identified as important by

path coalition power...... 122

8.3 Class distribution of features 3 and 5, as identified as important by

path coalition power...... 123

8.4 Results of cumNode coalition power evaluation on all coalitions of

size 2 on features 1-8...... 124

8.5 Class distribution of features 2 and 5, as identified as important by

cumNode coalition power...... 125

8.6 Class distribution of features 7 and 8, as identified as unimportant

by both coalition power methods...... 126

8.7 Class distribution of features 1 and 5, as identified as unimportant

by both coalition power methods...... 127

8.8 Results of coalition count evaluation on all coalitions of size 2 on

features 1-8...... 128

9.1 Aggregate importance values of first ten features of toys data...... 144

9.2 Class-specific importance values of features in breast cancer data. . . 145

9.3 Aggregate importance values of features in breast cancer data...... 146

9.4 Class-specific importance values of features in wine data...... 147

9.5 Class-specific importance values of features in segmentation data. . . 148

9.6 Average rank-order correlation on top 7 ranked features over 50 RF

instances on breast cancer data...... 149

xiv 9.7 Average rank-order correlation on top 7 ranked features over 50 RF

instances on image segmentation data...... 150

9.8 Effect of OOB accuracy on rank-order correlation on top 7 ranked

features over 50 RF instances on toys data...... 151

9.9 Effect of OOB accuracy on rank-order correlation on top 7 ranked

features over 50 RF instances on breast cancer data...... 151

9.10 Effect of OOB accuracy on aggregate rank-order correlation on top 7

ranked features over 50 RF instances on wine data...... 152

9.11 Effect of OOB accuracy on aggregate rank-order correlation on top 7

ranked features over 50 RF instances on image segmentation data. . . 152

9.12 Rank-order Spearman’s rho sensitivity to mtry on wine data...... 153

9.13 Rank-order Spearman’s rho sensitivity to mtry on segmentation data. 153

xv 1

Chapter 1

Introduction

Without making a conscious effort, the modern individual is in constant contact with (ML). Whether that be through targeted advertising, auto­ mated customer service phone calls, or mobile navigation apps, ML has undoubtedly infiltrated our society. Despite the ubiquity of ML in daily life, the vast majority of the population does not fully understand how these algorithms make their deci­ sions. Furthermore, most experts, including the engineers that design it, are unable to fully interpret the system [54, 76]. The ability of an algorithm to explain itself is a heavily researched topic as of late due to the fact that many of the top performing models, such as neural networks (NN), random forests (RF)1, and support vector machines (SVM) [26], fall into the category of “black boxes” [21, 59, 61]. The term refers to any model that is not transparent—or in other words, is uninterpretable.

1Due to a lack of consistency in the literature, this thesis chooses to ascribe to Leo Breiman’s nomenclature, using the plural “random forests” in reference to the algorithm in general and the singular “random forest” when discussing a particular trained instance. 2

The problem with black boxes is that they are difficult for users to trust and thus difficult for users to justify using. Research shows that the more a user under­ stands a system, the more likely they are to use it [66]. A prime example of the importance of explainability is found in the medical domain; a doctor will accept an ML-predicted diagnosis for a patient only if the model is able to justify how it arrived at that prediction and the doctor agrees with such a justification. Should the model fail to explain itself, the doctor will likely resort to other methods of selecting a diagnosis. To ensure the continued usage of ML, it is essential to develop techniques for understanding predictive systems.

1.1 Motivation

There are various approaches to enhancing the explainability of an ML model. One can select an algorithm that is inherently interpretable, such as a single decision tree [35, 68]. Since a decision tree is simply a set of decision paths leading to pre­ dictions, the justification for a single prediction can be extracted from the tree with ease. In many cases, however, the complexity of modern classification problems demand a more advanced algorithm than a decision tree due to its tendency toward high and consequent poor predictive performance [53]. In fact, there is a direct tradeoff between selecting a model that is the most interpretable and the best performing [33]. As such, researchers use more sophisticated algorithms—typically black boxes—and subsequently perform ad-hoc interpretability analysis to under­ 3

stand the system [74]. Such analyses can be geared toward either understanding the justification for a particular prediction instance or extracting the rationale of the model in a more global sense [5]. This thesis will be focusing on the latter by developing a methodology for quantifying the influence of features in decision tree ensembles. We do so by performing a thorough study in the context of random forest, accompanied by remarks regarding the extension to other decision tree ensembles

(such as gradient boosting) and the possibility of extension to neural networks.

A random forest instance bases its predictions on the aggregation of results from multiple, independently and semi-randomly constructed CART decision trees [15,

16, 19, 17, 18]. A given node of a tree splits the data into left and right partitions based on the satisfaction of an inequality on a feature /*. That is to say that all data with /, < a is directed down the left subtree and all data with /* > a follows a path within the right subtree, where a is the split value selected in the tree-growing process. Due to the elements of randomness injected into the training of each tree

(via bagging and random feature selection, as defined in Chapter 3), a balance is struck between optimization on the training data set and a level of generalization that allows for high predictive performance without overfitting the data. See Section

3.1 for a more detailed introduction to random forests.

Motivated by the algorithm’s staggering predictive power, the theory of random forests developed rapidly since Leo Breiman’s full introduction in 2001 [17], specif­ ically in the realm of defining and quantifying the factors that govern a model’s 4

prediction process. Regardless of these efforts, random forests remain black boxes to some extent, lacking a deep theoretical understanding [11, 27]. Variable impor­ tance and interaction measures are most commonly used to understand the behavior of a trained random forest.2 Variable importance is a measure of the relative in­ fluence a feature has on any prediction, whereas variable interaction is a pair-wise measure of the relationship between features. Examples of variable importance for­ mulations are Gini importance (or MDG) and permutation importance (or MDA).

Gini importance is computed by measuring the change in the Gini index at each node split. Permutation importance is computed by randomly permuting the values of a given feature across a data set and measuring the effect on misclassification rate. Gini interaction and permutation interaction are each based on their respec­ tive importance formulations. See Sections 3.2 and 3.3 for the full derivation of these measures. Unfortunately, both metrics lack a solid mathematical foundation

[53].

Current interpretability measures are severely limited. Not only are variable im­ portance and variable interaction separate measures, making it difficult to gain a full understanding of the behavior of the features within a forest, but also both are data-driven, requiring a data set to probe the trained random forest and gain insight from the output. Moreover, permutation importance and permutation interaction

2In the context of ML, the terms ‘Variable” and “feature” are interchangeable and refer to the input dimensions of the data. Therefore, variable importance refers to the impact that each individual feature has on predictions and variable interaction quantifies the way the features affect one another. 5

rely on random permutations of the data, resulting in differing values with each calculation. With respect to Gini importance and Gini interaction, the application of these measures makes sense in the context of random forest but the rationale surrounding their usage breaks down when applied to other decision tree ensembles or other classifiers. In addition, Gini importances demonstrates a bias in favor of features with a large range and numeric values (as opposed to a binary feature) due to the same bias in the CART tree growing procedure. Finally, there does not exist a standard, governing when it is appropriate to use each metric, resulting in a level of ambiguity surrounding what is actually being measured when calculating variable importance.

These problems stem from the fact that both variable importance and variable in­ teraction are loosely defined. The closest thing to a theoretical definition of variable importance lies in the feature subset selection domain. This is because the influence of a feature is considered when trying to form the most compact set of relevant, non- redundant features [50, 75]. Though some intuition can be gained by studying the theory of feature subset selection, there is no mathematically rigorous definition of what it to be an important feature. Furthermore, the feature subset selection algorithms compute the optimal feature set prior to training the model. Thus the results of feature subset selection are based on the information hidden in the data itself, as opposed to extracting the importance of features within a trained random 6

forest structure.3 We therefore attempt to borrow concepts from cooperative game

theory to build out the theoretical foundation and create a new variable importance

measure in the context of post-training analysis.

1.2 Proposed Measure

The proposed feature power measure is an effort to develop a theoretical basis for variable importance and provide a consistent metric that overcomes the downsides of existing interpretability measures. We do so by modeling a trained random forest as a formal voting structure, in which features act as voters who have the ability to sway electoral decisions. An introduction to random forests and existing interpretability

measures resides in Chapter 3, while a detailed description of voting theory and its

superset, cooperative game theory, can be found in Chapter 4. In voting theory, voters have a certain amount of power to sway votes based on the setup of the voting system. Similarly, we treat the trained forest as a voting game in which the

features have varying degrees of power to influence classifications. Following this

analogy, the Shapley value from cooperative game theory [70] lends itself nicely to

the quantification of feature power, which is derived in Chapter 5 and explored in

Chapters 6 and 7. We introduce a novel method for measuring the power of groups

3 It is important to note that the purpose of computing variable importance is to evaluate the influence of features within a trained random forest, not to determine the features that are most important in the data. This difference becomes evident when a trained random forest is unable to capture all relevant information present in the data. 7

of features in Chapter 8. To conclude, Chapter 9 will discuss possible extensions to other domains and suggest ways to further this work.

1.3 Experimental Details

When modeling a random forest as a voting game, certain assumptions must be made that directly affect the derivation of feature power. By varying these assump­ tions, one is able to produce three separate methodologies corresponding to three different versions of feature power. The derivation of each of these measures is followed by multiple to determine correctness and gain understanding regarding the nature of each measure. The validity of all measures is deduced by comparing feature power evaluations on a toy data set to the relative importance of the features in the data set that is known a priori. In addition, each feature power method is compared to existing measures like Gini importance and permu­ tation importance with respect to behavioral qualities such as stability, sensitivity to hyperparameters, and dependency on data set properties. We discover that fea­ ture power achieves accurate results on simulated data and behaves most similarly to

Gini importance when applied to real-world data (from UC-Irvine Machine Learning

Repository). Furthermore, we find that coalition power produces a correct evalua­ tion of group importance, though more time needs to be devoted to mapping the behavioral tendencies of this measure. 1.4 List of Contributions

The contributions of this thesis are the following:

i) A critique of existing variable importance and interaction measures

ii) A formulation of feature power: a variable importance measure for classification

problems

iii) A formulation of coalition power: the importance of a group of features with

respect to classification problems

iv) The empirical validation of correctness of these proposed measures

v) An analysis of the behavior of these proposed measures in comparison to ex­

isting measures

vi) A theoretical exploration into the nature of feature power vii) A discussion of the extensibility of both measures to other classifiers 9

Chapter 2

Prior Work

Since the rise in popularity of black box algorithms and other automated systems, research regarding the dangers of opaque algorithms has received thorough atten­ tion. One example is [31], which presents a set of psychological studies about the relationship between system reliability, user reliance, and user trust. The authors argue that uninterpretable systems are more likely to be misused and disused than transparent algorithms. One of their studies demonstrates that users expect an au­ tomated system to outperform humans; and, even if the system makes fewer errors than a human, the user will stop trusting the system and choose to rely on their own methods. On the other hand, another study in [31] shows that if humans are able to justify why the system erred, they will trust it even if it makes mistakes. As a result of their findings, Dzindolet et al. propose that a system should both explain why it might err on decisions about which it is uncertain and explain the justifica­ tion for correct decisions. The consequences of an uninterpretable system, however, 10

can be more extreme than simply a decline in trust. A workshop of panelists in

[63] discuss the need for explainability research due to the types of applications to which ML is now being applied. The participants point out that many models are used by non-ML experts in the biomedical domain and are applied to “high stake situations” like patient care, drug development, etc. As such, it is essential that a non-ML expert is capable of understanding the system in order to avoid placing too much or too little reliance on the model.

In response to the need for explainable systems, an area of research has emerged on the development of interpretable algorithms. While some develop entirely new interpretable classifiers, others choose to develop techniques to be applied on top of existing models, or what the authors of [74] call, “augmentation through explana­ tion.” The work in [52] is an example of the development of a new interpretable classifier, namely Bayesian Rule Lists (BRL). The authors develop the generative model and demonstrate its competitive performance with respect to other classi­ fiers, particularly in the predictive medicine domain. They claim BRL strikes the perfect balance between predictive accuracy, interpretability and tractability. On the other hand, the work in [12] suggests that the emphasis needs to be on defining and measuring interpretability of models that are traditionally black boxes. The au­ thors discuss the importance of answering the questions “what is interpretability” and “how do we measure it?” as opposed to avoiding black box models altogether.

This thesis will focus on answering the “how do we measure it?” in the context of 11

decision tree ensembles.

In [20], Brinton argues that interpretability is especially important in “safety- or life-critical applications” or when the chance of user distrust and consequent misuse is high. The authors of [66] agree that lack of user trust is the root of all misappli­ cations of use. They attempt to build both types of trust—trust of an individual prediction and trust of the model—by developing two separate techniques to per­ form instance-based and model-based interpretability analysis. The work in [46] attempts to enhance user trust by introducing a technique for providing justifica­ tions of predictions made by collaborative filtering. The authors point out that they are limited to instance-based explanations because of the ad-hoc nature of collabo­ rative filtering. They go on to mention that more rule-based systems like random forests are better suited for providing explanations regarding the entire model. The work in [5] argues that most classifiers answer what the label should be but not why the label was predicted and what features informed that instance of a decision.

They further discuss the fact that decision trees are rare in that they can answer the “why.”

In studying the development of random forests, it is clear that the algorithm was de­ signed with interpretability in mind. As [36] points out, the decision tree is the only classifier that automatically provides instance-based explanations with each predic­ tion. Though these explanations may be very long and relatively uninterpretable to a human reader, they still provide insight into the decision-making process. Since 12

random forests are made up of decision trees, the algorithm is able to leverage the inherent interpretability of decision trees, while boosting predictive power via the injection of randomness into the growing process.

Since Breiman’s introduction of random forests, numerous attempts have been made at interpreting the black box algorithm. For example, the authors of [27] introduce a new variant of the random forest algorithm that they argue is more theoretically sound and achieves empirical results competitive with Breimans algorithm. Since the model is more strongly based in theory, they claim that the variant is more interpretable. On the other hand, the authors of [45] and [62] develop frameworks for post-training analysis of a trained random forest in an attempt to explain the model’s behavior. More specifically, the authors of [62] take a user-centered approach by producing a one page “explainability summary report” to supplement a random forest classifier. After formal usability studies involving both experts and non­ experts, they found that the reports dramatically boosted the user’s understanding of the model and trust in the system. In [45], the authors take a more technical approach by using inherently interpretable algorithms to approximate a trained random forest. They then exploit the interpretability of the simpler model to gain access to information about the forest structure.

Despite being grouped with other black box algorithms, the random forest algorithm is unique because interpretability measures were developed alongside the original al­ gorithm. In [17], Breiman develops a variable importance measure based on permu­ 13

tation in an attempt to theorize the “black box” of random forests, which he further extends in [16]. More specifically, Breiman gives a concise definition of permutation variable importance. He explains, “after each tree is constructed, the values of the variable in the out-of-bag examples are randomly permuted and the out-of-bag data

is run down the corresponding tree” [17]. This process is repeated for all features

and the misclassification rate is calculated. The variable importance is then the

percent increase in misclassification rate in comparison to the rate when the out-

of-bag samples (without feature permutation) are run down the decision tree. It

is important to note that Breiman selects a data-driven approach to interpretabil-

ity analysis. He argues in [16] that this is similar to the way that are

calculated, using data to gain insight into some underlying distribution.

Since Breiman’s work in 2001, research has exploded in the area of variable impor­

tance. The value of using a model with built-in interpretability measures is proven

in [3] when the random forest algorithm is applied to microarray data. The authors

claim that if a scientist wants an ML algorithm with predictive power and some level

of interpretability, random forests is the best selection. Also in the

domain, authors of [57] and [72] examine the stability and reliability of existing

variable importance measures. Nicodemus discusses the fact that Decrease

Gini (MDG or Gini importance) is typically more stable than Mean Decrease Ac­

curacy (MDA or permutation importance) unless there are strong within-predictor

correlations [57]. In addition, Strobl et al. discover in [72] that variable importance 14

calculations can be misleading when the type of features are on very different scales or have an inconsistent number of categories. The authors of [40] thoroughly analyze the behavior of permutation importance for both regression and classification, using a similar approach to the one that will be employed for this thesis. They observe the measure’s sensitivity to the number of data observations, the number of features, and the hyperparameters associated with the random forest training algorithm. Their key discovery is that the results of permutation importance on highly correlated variables can be misleading. The authors of [44] perform a similar behavior analysis but focus exclusively on random forest regressors. More specifically, they compare and contrast a model to two forms of random forests, one based on CART decision trees [19] and the other based on conditional inference trees [47] and evaluate them via the appropriate variable importance measures (permutation importance for the random forest models). Gromping finds that, in the context of random forests, the CART decision tree structure results in the most interpretable and accurate permutation importance results.

Similar to this thesis, several researchers decided to further the literature on variable importance by developing a new variable importance measure. In [53], the authors introduce Mean Decrease Impurity (MDI), inspired by Breiman’s Gini importance

[17, 18]. The difference between Gini importance and MDI lies in the fact that the work of Louppe et al. make MDI generalizable to all impurity measures—not just the Gini index. The authors also provide a proof that the MDI of a feature is 15

zero if and only if the variable is irrelevant and its removal does not affect the MDI calculated on any of the other, relevant features. The authors of [71] develop a new variable importance measure based on a “conditional permutation scheme.” They claim that the new measure handles highly correlated variables and overcomes some of traditional permutation importance’s tendencies to provide misleading results.

Though variable importance is clearly a well-researched topic, the literature on variable interaction is not as abundant. The authors of [65] introduce the idea that a discriminating classifier can be developed by measuring the variable interactions within a data sample. They prove that this algorithm works particularly well for biological data. In [23], a variant of SVM is proposed called Binarized SVM (BSVM), which is able to automatically detect which individual features are most important to an SVM model. The authors of [22] build upon this by creating another variant that is also able to detect the most important interactions between features, which they call Non-linear Binarized SVM (NBSVM). In the context of random forests, the evaluation of variable interaction began with Breiman when, in [17], he observes some unusual behavior in variable importance. He notices that perhaps two variables give the same information and thus one in the absence of the other is important but when the other is already present, it does not affect the decrease in error rate. He and Cutler go on to define a Gini-impurity-based variable interaction measure in

[18]. Similarly, in [49], Kelly and Okada develop a new variable interaction measure by borrowing Breiman’s original concept of permuting variables over out-of-bag 16

samples and applying it to pairs of features, resulting in the calculation of pairwise variable interaction. They find that this formulation ensures an easy extension to measuring n-ary interactions. In addition, such an approach can be applied to any supervised multi-class classifier based on bagging.

Our work is not the first time that a game-theoretic approach has been applied to machine learning. In [14], Bowling and Veloso extend the usage of Markov decision processes (MDPs) for reinforcement learning to the multi-agent case. They do so by applying the properties of multi-player stochastic games to the reinforcement learning domain. The authors of [24] use the Shapley value to develop an algorithm for feature subset selection. The algorithm elects to add or remove a given feature from the optimal feature set by evaluating the Shapley value of that feature. The

Shapley value of a feature is calculated by considering the contribution of that feature to a performance measure (or the linear combination of multiple performance measures) with respect to the features in the optimal feature set. Note that this feature selection method is independent of any classifier.

Definitively the most similar to our work is the application of voting theory to random forests in [73]. In this work, the authors apply the Banzhaf power index to develop a new feature selection method for the growing of a random forest. Note that this paper is written with the intent to strategically build a random forest that extracts the greatest amount of information about a sample in the fewest number of nodes. The authors acknowledge that power indices of voting theory work well due 17

to their ability to consider the dependencies of features within groups, not just the importance as an individual. This helps justify our use of voting power as a variable importance evaluation technique. Furthermore, it remains that voting theory has never been applied to post-training analysis of random forests, making this work the first of its kind.

The research explained above covers the fact that explainability of machine learning is key and many attempts have been made at both developing new interpretable ML models and building tools to analyze existing black box models. Variable importance and variable interaction are used for random forests in particular and have been thoroughly researched empirically, though they come in many forms and lack a strong theoretical foundation. The work of [49] and [62] are the only known instances of explainability analysis performed on a group of features, though neither quantifies the importance of these groups. Game theory has been applied to machine learning, though never in the context of interpretability analysis. From this overview of prior work, it is clear that i) variable importance and variable interaction are not well-understood from a theoretical perspective, ii) voting theory has never been applied to post-training analysis of a trained random forest, and iii) there does not exist a measure for quantifying the importance of a group of features collectively.

Throughout this thesis, each of these three points will be addressed. 18

Chapter 3

Random Forests

Random forests are among the most powerful machine learning algorithms, in part due to their ability to handle data sets with a large number of variables and few observations [29, 71]. Developed in the early 2000’s by Leo Breiman, the ensemble method has proven its effectiveness repeatedly on problems ranging from computer vision to pattern recognition. Primarily attributed to the randomness injected into the training procedure, random forests compete with popular algorithms like Ad- aBoost [9, 34] and Support Vector Machines [26], while preserving some amount of interpretability in the decision tree structure. The sections below will introduce the random forest training and inference algorithms as well as derive existing variable importance and variable interaction measures. 19

3.1 The Random Forest Algorithm

Random forests consist of multiple binary CART-like decision trees [19], each inde­ pendently constructed via the bagging meta-algorithm, i.e. using a bootstrapped1 subset of training data. Since the training of a random forest employs bagging, the samples that were not used in the training of a particular tree is called the out-of-bag (OOB) set with respect to that tree. Each tree is grown by considering a set of candidate features and selecting the feature and split value that produces the “purest” division of the respective training data, as defined by Gini impurity, entropy, or some other performance metric. The resulting node consists of an in­ equality on a feature and gets added to the tree. Note that random forests can handle both continuous and discrete features. Randomness is injected into the for­ est in two ways: via the bootstrapped subset of observations used to construct each tree and the split of each node based on a randomly selected subset of candidate variables. Once each tree in the forest is constructed, the classification of a given sample is predicted by following the decision path for the sample across all trees and allowing the trees to vote on the output, where majority rules. To be clear, a decision path is a path from root to leaf of a single decision tree, where the leaf is a class prediction. The decision path can also be extracted from the tree, becoming the conjunction of multiple inequalities that results in a particular class prediction.

1 Bootstrapping is a statistics method and refers to the random selection of samples from a data set with replacement. The result is a data set of the same size but with a high likelihood of omitted and/or repeated samples [32]. 20

Hidden in the trained structure of the forest is information regarding the relative importance of the features in any decision the forest makes (regardless of the input); variable importance aims at quantifying that influence.

3.2 Variable Importance

There are many ways in which variable importance can be calculated, ranging from a simple count of the number of times a feature appears in a random forest (count importance), to more advanced techniques. The reason for such a variety of formu­ lations is the fact that the meaning of an important feature is loosely defined and, to some extent, subjective. As Gromping points out, “there is no theoretically defined variable importance metric in the sense of a parametric quantity that a variable importance estimator should try to estimate” [44], Here lies the motivation for de­ veloping a theoretically sound measure. In order to be able to compare our proposed methods to existing measures, we describe two of the most widely used formulations of variable importance: Gini importance and permutation importance.2

Gini importance, also known as Mean Decrease Gini (MDG)3, is defined as the sum of the weighted Gini impurity decreases for all nodes in the forest containing the

2Equations 3.1 - 3.9 are the result of an aggregation of [48], [53] and [49], whose notations were altered in some cases in order to allow for a fluid derivation of both variable importance and variable interaction measures. 3Consider the fact that Gini importance is also referred to as Mean Decrease Impurity (MDI), though this term is an overgeneralization. The term MDI refers to mean decrease of any impurity measure, whereas MDG refers specifically to a mean decrease in the Gini index. 21

feature of interest. The weighting is given by the proportion of observations from the randomly selected subset of training data that reach each corresponding node in the tree. Gini impurity is a measure of the probability that two samples selected at random, with replacement, will belong to different classes; thus the Gini impurity of data set S with K total classes is given by:

K K Gini(S) = ^2 P(c(s) = CfcX1 “ p (c(s) = c*0) = 1 ~ P (c(s) = Cfc)2’ (3-1) k—1 k=l where c(s) is the class label of a randomly selected observation from set S. In addition, the Gini impurity reduction of a particular node rii is defined as:

A Gini(ni) = Gini(Pni) - jj^Gini(Lni) - (3.2) V th\ Kn< I where Pn% is a partition of the training set for a given tree that reaches node and Lni and Rril are the data sets that reach the left and right subtrees resulting from the split at node n^, respectively. Given that /j is the feature of interest in a random forest, the aggregate Gini impurity reduction of feature /j within one tree t can be calculated by:

aggAGini(t,fi) = \Pnj\AGini(nj), (3.3) nj €£:/(«■,)=/; where / (rij) represents the feature that is employed for the split at node rij. Prom 22

Equation (3.3) it is simple to derive the Gini importance. Given a feature of interest fi and the set of all trees in the random forest T, the Gini importance can be given by: r f * \ 1 ^ aggAGini(t, fr) ImpGiniUi) = jrf where Dt is the entire training set for tree t. It is important to point out that Gini impurity is typically what is used to determine the best node split when constructing random forests. Therefore, it seems logical to use the same metric when determin­ ing the relative importance of the features post-training. As will be discussed in later chapters, however, this approach to variable importance limits the measure to random forests trained with the Gini index. Should adjustments be made to the classifier in an effort to boost performance, the justification for using Gini impor­ tance breaks down. Perhaps with this in mind, Breiman first proposed permutation importance as the primary variable importance measure for random forests.

Permutation importance, also known as Mean Decrease Accuracy (MDA), is based on randomly permuting the values of a feature over all out-of-bag (OOB) samples and considering the mean difference in accuracy. Let us define the following function, which counts the number of incorrect classifications for tree t on an arbitrary data set S:

CumErr(t: S) = £ l - <3-5> s€S:t(s)^c(s) where t(s) is the resulting class label of tree t on sample s and c(s) is the correct 23

classification for sample s. Then, given that the OOB set for tree t is represented as Ot, the OOB set with permuted values on feature f, is given by and the difference in accuracy with respect to feature /* is:

A Err(t,fi) — CumErr(Oti) — CumErr(Ot). (3.6)

From this formulation, the permutation importance formula can be derived as:

Im Pprm(f,) = ( 3,7)

Despite its stochastic nature, the measure tends to produce relatively consistent results. Since it relies on stochastic permutations of the OOB samples, however, the measure can be considered arbitrarily reliant on particular instances of the data, rather than attempting to compute an underlying, unknown constant. The calculation of permutation importance is also prohibitive if the number of features in a data set grows large. Note that permutation importance is the only variable importance measure that is computable in a class-specific manner. That is to say that permutation importance can evaluate the importance features with respect to a given class. This is particularly important when the data set is unbalanced and averaging across classes causes misleading variable importance conclusions.

Regardless of the measure employed, variable importance calculations have the po­ tential to result in misleading conclusions, particularly when a set of features is 24

highly correlated [40]. Therefore, computing variable interaction, in addition to variable importance, is crucial to building a complete understanding of the behavior of features in a trained forest.

3.3 Variable Interaction

The driving force for calculating variable interaction is the consideration that per­ haps a single feature is relatively unimportant alone, but when grouped with other features, the coalition4 becomes disproportionately more powerful. Conversely, it is possible for a coalition to be less powerful than the sum of its members. Suppose two features demonstrate high variable importance values individually but provide the same information for classifications. Then the importance of the two grouped together does not increase by any significant amount. Just as for variable impor­ tance, we describe the two most commonly used variable interaction measures: one based on Gini impurity and the other on random permutations of the OOB samples.

The variable interaction measure proposed by Breiman in [18] relies on the difference in the of features after calculating the aggregate Gini impurity reductions from Equation (3.3). That is, for a given t € T, after generating the sequence of

4 The term “coalition” is borrowed from cooperative game theory and refers to a group of features. The details regarding the exact way a coalition of features can be defined in the context of random forests will be discussed in Chapter 5. 25

features

(/t i, ft», • • •, /* « ) : aggAGini(ftl) > aggAGini(ft2) ...> aggAGini(ftM), where M is the total number of features and the function rank(t, /*) refers to the index of feature /* in the resulting sequence. Thus Gini interaction is given by:

In toU fi, fj) = Z Kr\ranHtJd~rank(t,m ^

Since permutation importance revolves around the random permutation of not just one feature, but two, we must first denote Oti j as the set containing the OOB samples with both features /* and fj permuted. Directly following, as adopted from

[49], we can define permutation interaction as:

Intprmifi, fj) = jij. ^T(CumErr(Oti) + CumErr(Otj) ^ g.

— CumErr(Otij) — CumErr(Ot)), where CumErr is defined above in Equation (3.5). Similar to variable importance, the calculation of variable interaction is stochastic and highly dependent on the OOB samples. In contrast, the proposed feature power is a non-stochastic interpretability measure that relies exclusively on the structure of the random forest by modeling the random forest as a voting game.5

5The formal definition of a voting game will be defined in the next chapter after the introduction of sufficient terms in voting theory and cooperative game theory. 26

Chapter 4

Voting Games

For the purpose of extending voting theory to random forests, there are two cat­ egories of voting systems that are relevant: direct election and weighted voting.

Direct election, where one person casts one vote for a candidate, is common in the real world1 since it guarantees that every voter has equal influence on the results.

This influence is called voting power and is formally defined as “the probability that a single vote is decisive” [38]. When there are more than two candidates running in a direct election, however, it is possible that the candidate that receives the largest number of votes is not necessarily the most popular amongst the majority [58]. For example, suppose the majority of a community favors a set of political views and three candidates run—two with the same political views as the majority and one with a differing standpoint. The latter may end up winning because the two can­ didates with similar platforms divided the majority of votes amongst themselves.

1 One example of direct election is the selection process of U.S. Senators, where voters from a given state vote directly for the representative they prefer [1]. 27

A weighted voting system, on the other hand, is one in which the number of votes belonging to a single voter are allocated based on the size of population he or she represents, the number of shares of the company he or she owns, etc. In this type of voting structure, voters no longer share equal voting power (unless, of course, each voter is allocated an equal number of votes). A common misconception is that voting power is directly proportional to the number of apportioned votes [7]. For a detailed justification of why this is not the case, see the toy example in Section

4.1.4.

A weighted voting system presents several challenges that have been overlooked in history, resulting in unfair voting systems. One such example is the 1980 Japanese

House of Representatives election, where votes were not reapportioned after World

War II, failing to capture the movement of the population from rural to urban areas.

Regardless of the fact that the majority supported urban candidates, the weights of the voting system did not reflect the country’s evolution and made it nearly impossi­ ble to elect urban representatives [42], Even if the allocation process is justified and up-to-date, a weighted voting system is still not guaranteed to be fair. In Nassau

County, New York, the number of votes allocated to each county to elect the Board of Supervisors was proportional to the county’s population. Nevertheless, the eval­ uation of voting power showed that two of the six counties had zero voting power

[43]. In other historical examples, a voting-theoretic approach was used a priori to 28

ensure the creation of a fair voting system. One of the most significant examples is the formation of the Council of the European Union [51], which remains intact to this day. Ultimately, since voting power in a weighted voting system is not as sim­ ple as considering the weights associated with each voter, evaluating voting power is crucial to both ensure fairness and formulate strategy [38]. The following sections will introduce the theory necessary to evaluate voting power.

4.1 Introduction to Voting Theory

4.1.1 Definitions

As noted previously, the power of a voter relies on the decisiveness of their vote within a defined voting structure. A voting structure made up of n voters is typically denoted as [q\ w\, W2, ■ ■ ■, wn], where q is the quota, or the threshold number of votes required to pass a motion, and Wi is the number of votes apportioned to voter f* [51].

Each voter has the choice to cast hers or his vote in favor of or against a particular bill, ultimately forming two groups of voters based on preferences. Any subset of voters is called a coalition and—depending on the quota and the number of votes belonging to a coalition collectively—the coalition is either winning or losing. A winning coalition refers to a group of voters whose sum of weights is greater than or equal to the quota, whereas a losing coalition is a group whose sum of votes falls short of that threshold [69]. A pivotal voter is an individual whose joining of a 29

losing coalition converts it to a winning one [69]. A voter is critical when the action of that voter abandoning a winning coalition makes it a losing one [6, 8, 67]. The subtlety that distinguishes between when a voter is pivotal and critical motivates the formulation of two commonly used voting power measures: the Shapley-Shubik power index and the Banzhaf power index.

4.1.2 The Shapley-Shubik power index

The Shapley-Shubik power index was first introduced by L.S. Shapley and Martin

Shubik in [69] for the purpose of evaluating voting power. Inspired by the Shapley value in cooperative game theory [70], the index was applied to political science by specializing the Shapley value for the context of a voting system. The Shapley-

Shubik power index relies on the assumption that the order in which individuals cast their votes matters. The reasoning behind this is the fact that once sufficient votes are obtained to pass a bill, those that vote subsequently are unable to affect the outcome [69]. The index preserves sequential order by evaluating power with respect to all possible orderings in which the voters can cast their votes. More specifically, given a set of voters N, such that |iV| = n, the voting power of a given voter Vi can be given by:

K = — , (4.1) where piv(vi) is the total number of times voter Vi is pivotal in all of the n! orderings of voters in N. In other words, the Shapley-Shubik power index is simply the 30

frequency that a voter is pivotal [67].

4.1.3 The Banzhaf power index

John Banzhaf III countered the Shapley-Shubik power index ten years later, by criticizing the index’s reliance on ordering. He argued that “it seems unreasonable to credit a legislator with different amounts of voting power depending on when and for what reasons he joins a particular voting coalition. His joining is a use of his voting power—not a measure of it” [6]. This contrasting perspective is seen in the formulation of the Banzhaf power index, in which the number of times a voter is critical is examined not across all orderings of voters, but rather over all coalitions of voters. Thus, given a set of voters N, such that |iV| = n, the voting power of a given voter can be given by:

0 = Cnt{v^ (42) Pvi ^critivjY { ' } where crit(vi) is the total number of times that voter Vi is critical over all 2n coali­ tions.2 The following section will provide a simple example and walk through the calculation of both power indices to build intuition.

2Note that it is common practice to divide f3Vi by the maximum possible number of times any voter could be critical (2n_1), forming a nonnormalized Banzhaf power index [67]. Since this thesis will focus primarily on the Shapley-Shubik formulation, however, it is not necessary to explain and derive this detail. 31

4.1.4 Toy example

Consider the following simple voting structure: [4; 3,2,1]. This is a system with three voters (call them A, B, and C) who have been allocated 3, 2, and 1 votes, respectively. A bill is passed only when it is supported by 4 votes or more. From initial inspection, it is clear that there is not one voter who has all the power3 since at least two voters are required to pass any bill. Recall that the voting power of an individual is not necessarily proportional to the number of votes he or she is apportioned. To illustrate this point, we will calculate the voting power for A, B, and C using both the Shapley-Shubik power index and the Banzhaf power index, similar to the work in [67].

Based on the Shapley-Shubik methodology, the first step in calculating voting power is to list out all possible orderings of voters in N. Of these orderings, the pivotal voter should be noted—that is, the voter whose addition to the sequence of voters turns a losing coalition into a winning one. In our particular example, there are

3 voters and thus 3! = 6 possible orderings. The pivotal player of each ordering is boxed, as seen below.

(A,[BJ, C) (A,|CJ, B) (B,[A],C) (B, C,|_Aj) (C,[AJ,B) (C,B,[Aj)

Counting the number of times each voter is pivotal yields 4, 1, and 1 for voters A,

3 If it is ever the case that one voter has sufficient votes to pass or block any bill no matter how everyone else votes, that voter is called a dictator [67]. 32

B, and C, respectively. Now by applying Equation (4.1), the power indices for each of the three voters can be calculated as follows:

(t>A ~ a6 ~ 3o ’ 6 &C = 7-6

Note that though B has more votes than C, the two voters share the same voting power by the Shapley-Shubik power index, supporting the claim that voting power is not proportional to the associated weights in a weighted voting system.

The voting power distribution of A, B, and C by the Banzhaf power index will prove itself different from that of the Shapley-Shubik power index, in part due to the fact that the calculation requires listing not every ordering of voters, but rather the winning coalitions. For our example, there are 23 = 8 possible coalitions, 3 of which are winning. Within these winning coalitions, the critical voters (i.e. the voters whose desertion of a winning coalition make it losing) are denoted by boxes.

Note that although there can only be one pivotal voter in a voter ordering, there can be multiple critical voters in a winning coalition. See below:

{0,®} {S® } {S.B.C}

Within this example, all members of the first two coalitions are critical, whereas only voter A is critical in the final coalition. Therefore, voters A, B, and C are critical 3, 1, and 1 times, respectively. By Equation (4.2), the voting powers get 33

calculated as: 3 1 1 Pa = PB = PC — g*

Again, despite the fact that B has twice the number of votes as C, the two have the same voting power via the Banzhaf power index. It is also important to point out that though the relative power ranking of voters A, B, and C remain consistent across measures {4>a > (t>B = Pb = Pc), this will not always be the case.

4.2 Voting Theory in Random Forests

Random forest inference is sometimes described as a voting process amongst inde­ pendent trees, where the class with the maximum number of votes is the resulting classification. In this setup, each tree has an equal amount of influence on the final output. More specifically, it is easy to see that in a forest of ntree trees, each tree has ^j^;th the voting power. This is an example of a direct election system, as explained earlier. Perhaps less obvious is the fact that random forest inference is actually a two-tiered voting system, where the first tier governs the decision made by each individual tree and the second tier is direct election amongst trees. Just as the trees are voters in the direct election at the second tier, the features serve as voters in the first tier. This is very similar to the U.S. Senate, in which the first tier is made up of the people voting directly for a state senator and the second tier 34

is the equal representation of each state in the U.S. Senate. In this analogy, the people are the features of a single decision tree and the senators are the decision trees voting in the entire random forest. Features are not guaranteed to share equal voting power within a decision tree because, if this were the case, each feature would be equally important, causing the motivation for calculating variable importance to break down. Thus the voting structure in a single decision tree can be viewed as a weighted voting system, where the method for weight assignment is yet to be defined.

Considering that the primary difference between the Banzhaf power index and the

Shapley-Shubik power index is the dependency on order, the selection of a voting power measure for any application should be based on the same principle. In a random forest, the order of features within the forest matters since each subsequent node results in an additional split of the remaining training data.4 Therefore, it is reasonable to conclude that the Shapley-Shubik power index, which preserves voting order, is best fit to evaluate random forests as a voting structure. As suggested by

Equation (4.1), which poses the Shapley-Shubik power index as proportional to the number of times the voter of interest is pivotal, it becomes essential to define what it means to be a pivotal feature; this leads to the necessity to determine the meaning of winning and losing coalitions in the context of random forests as well. Since there is no obvious way to draw this analogy, we turn to L.S. Shapley’s 1953 paper on the

4The subtleties of this point will be revisited when evaluating coalition power in Chapter 8. 35

Shapley value [70] for the generalized version of the Shapley-Shubik power index in the context of cooperative game theory.

4.3 Introduction to Cooperative Game Theory

Game theory, or “the study of multiperson decision problems” [41], is often ap­ plied to economics, political science, computer science, psychology, etc.. Dating back as far as 1713, game theory developed from a period of correspondence be­ tween Montmort and Waldegrave. In these letters, the two were discussing the strategy behind certain probability-based card games [10]. The field continued to be researched through the 1800s, when Cournot and Nash developed the theory of non-cooperative games [30]. In a non-cooperative game, players form strategy by evaluating the individual payoff of every possible move (think: decision-making in chess). This is in contrast to a cooperative game, in which the only evaluation of the result can be made by considering a group of players and their collective con­ tributions [79]. Cooperative game theory was first proposed by von Neumann and

Morgenstern in 1944 [77] when they chose to abandon the format of non-cooperative games in favor of a more coalition-based setup. With the exception of initial de­ velopment by von Neumann and Morgenstern, the first major contribution to the evaluation of cooperative games was made by L.S. Shapley in the form of the Shap­ ley value [70]. Though applicable to any cooperative game of n players, much of the measure’s popularity stems from a specialized version in the voting theory domain. 36

4.3.1 Definitions

The first distinction that must be made is between a game and an abstract game.

Where a game consists of rules and players, an abstract game consists of rules and roles, or placeholders for players. To be clear, a player is an actual person while a role could be “dealer” or “pitcher” [70]. It is typically the abstract game that is analyzed in game theory. An abstract game is denoted by the characteristic function v and can be evaluated as a real number in the context of a particular set of players.

In the case of a simple game, the value of v is constrained to be in { 0,1}. It is common to model a voting system as a simple abstract game in which voters are players of the game and the game is evaluated based on the outcome of the vote.

The properties of the game are defined based on the predefined quota, the concept of winning and losing coalitions and the weights assigned to each role. This is called a voting game [67, 79]. Note that in a voting system, the only outcomes are winning or losing (i.e. pass the bill or block the bill), making it a simple game. Evaluating v for all possible scenarios (or “prospects” ) provides insight into optimal strategy. In fact, a player can even evaluate the prospect of their participation in the game and use it to determine whether they want to even play the game in the first place [70].

4.3.2 The Shapley value

The Shapley value is an index of a single player’s power within an instance of an abstract game [79]. The measure can be used to help a player develop strategy 37

and/or determine the game’s level of fairness. Since the value is with respect to a particular abstract game, the notation changes to make reference to the game v, where v is a set function that maps from a set of players to a real number. This represents the value of that coalition in the context of v. Consequently, the Shapley value of player Pi with respect to game v will be denoted as Vi. As derived in [70], the

Shapley value is defined as:

E E 1J 1. 0.-W— (4.3)

SCN:pi€S ' SCN-.pitS where v(S) is determined by the properties of the abstract game. As mentioned earlier, a voting game is an example of an n-person simple game, constraining v(S) to take a value of either 0 or 1 for any S. More specifically, in a voting game, the function v is given by:

1 if S is a winning coalition or v(S) = (4.4) 0 if S' is a losing coalition, where winning and losing are defined by the quota q and the weights associated with each member of S. To put it simply, the Shapley value is a weighted difference of the number of winning coalitions to which player Pi belongs and the number of winning coalitions that exclude player p*. 38

4.3.3 Continuation of toy example

Further extending the same example as in Section 4.1.4, where the voting structure is defined as [4; 3,2,1], the Shapley value for voter A can be calculated by first listing all coalitions of voters that include A and do not include A. Of the 23 — 1 = 7 non-empty coalitions of voters, 4 include voter A and 3 do not. In particular, while denoting winning coalitions with boxes, the coalitions that include voter A are:

{A} {A, B} {A, C} {A, B, C} and the coalitions that do not include voter A are:

{B} {C} {B, C}.

By applying Equation (4.3), it follows that:

4>a[v\ =

'(1-1)! (3-1)! (2-1)! (3-2)! (2-1)1(3-2)! (3-1)! (3-3)! ' 3! 3! 3! 3! ~(1 - 1)! (3 - 1)! (1 - 1)1 (3 - 1)! (2 - 1)1 (3 - 2)1 ' 3! 3! 3!

2 3’ 39

which is the same voting power for A as calculated by the Shapley-Shubik power index. Similarly, the Shapley values for B and C coincide with the result of the

Shapley-Shubik formulation, as expected.

The beauty of the Shapley value lies in the fact that it observes marginalism5 [79].

More specifically, the measure requires only a way to compute the marginal con­ tribution of every player to a given coalition, i.e. the definition of the set function v; the remainder of the terms are based exclusively on the cardinality of subsets of players. This is in contrast to the Shapley-Shubik power index, which requires the definition of what it means to be pivotal. Based on the definition of v in Equation

(4.4), the game is evaluated by considering whether a set of players forms a winning or losing coalition. But as long as the system remains an abstract cooperative game, the Shapley value still applies, regardless of it maintaining the same v. As such, in the feature power formulation, the function v will be redefined and the meaning of coalitions in random forests will be clarified.

5Marginalism is a term borrowed from economics, referring to the measure of a commodity’s marginal utility [77]. 40

Chapter 5

Feature Power

This chapter will derive feature power: a method for evaluating variable importance within a trained random forest. Feature power is computed by modeling each trained

CART decision tree as an individual voting game in which features are voters with a certain amount of power to influence classifications. The feature power results are then averaged over all CART decision trees to get a final evaluation with respect to the entire random forest. The motivation for developing this measure is to introduce a variable importance metric that is non-stochastic, data independent1 and strongly rooted in mathematical theory. Feature power will also be class-specific but can be aggregated across classes as well.

In an effort to apply the Shapley value to the evaluation of variable importance, the random forest algorithm must be perfectly modeled as an abstract voting game.

1 By the term “data independent,” we are not implying that the training process of a model is independent of data. We are emphasizing the fact that while Gini importance and permutation importance require a data set to probe the trained decision tree, feature power is able to calculate variable importance by simply measuring the structure of the model itself. 41

This process requires the rigorous definition of a voter, a coalition, and the function v in the context of random forests. Since the goal is to evaluate the power of a feature, it is obvious that a single feature in a forest should be represented as a voter in a voting structure. The definition of the remaining two terms, however, can be somewhat ambiguous. In fact, this ambiguity has led to the development of three separate feature power formulations, varying in computational complexity, theoretical accuracy to the Shapley value methodology, and general utility in the context of variable importance evaluation. The following section will introduce some critical decision tree terminology, followed by the derivation of each of the three feature power measures.

5.1 Definitions

5.1.1 Decision trees

A path in a decision tree is defined by graph theory to be a finite non-null sequence of vertices such that any two adjacent vertices in the sequence share an edge in the graph [13]. Consequently, a path p consisting of nodes no, ni, and n2 can be denoted as (no,n2), a tuple of the starting and ending vertices of the path. The function d(p) can be used to determine the length of a path p. For example, if p is a path visiting nodes no, ni, and n2, then d(p) = 2 since there are two edges in total in this path. Thus d(p) + 1 is number of nodes in path p (or the “cardinality” of the set of 42

nodes).

Though graph theory uses the term “vertices,” trees are typically described in terms of “nodes.” An internal node in a CART decision tree is a vertex that consists of both a feature and a split value. In other words, the node produces an inequality on a particular feature, where data is partitioned and sent to the left and right subtrees of that node depending on which inequality is satisfied. A leaf node in a CART decision tree is a vertex that represents a class label resulting from a particular decision path. In order to denote the feature corresponding to a particular node rii, we use a function originally introduced in Chapter 3, that maps from an arbitrary node re, to the feature governing that split. We assume this function maps to the corresponding class label if re* is a leaf node. The distinction between a node and a feature is important because the same feature can appear in multiple nodes within the same forest, tree, or even decision path.

5.1.2 Random forest as a voting game

Though we have been talking thus far of modeling of a random forest as a voting game, the problem is really the modeling of a decision tree as a voting game. In other words, the methods proposed are ways of calculating feature power on a single decision tree. In the context of random forests, the final feature power evaluation can be obtained by aggregating the results over all decision trees in the forest.

This distinction, however, provides a simple extension to other tree-based ensemble 43

classifiers. Recall that the Shapley value in Equation (4.3) sums over all possible

t

A B

Figure 5.1: Toy decision tree. coalitions of players. Since the assumption is that features are voters in a random forest, it follows that a coalition must be a group of features. This would imply that feature power be calculated by summing over the powerset of the set of all features.

But in order to conserve information about the structure of the trained decision tree, the coalition space must be more carefully defined with respect to each individual tree. To illustrate this point, consider the decision tree in Figure 5.1. It is clear that nodes ri\ and n

a coalition whose members did not co-occur in the same path. In other words, the structure of the tree is needed to determine the value of a given coalition. In the context of voting theory, this can be thought of as a voting structure making it impossible for two voters to join a coalition together. For example, perhaps two shareholders of a company are competitors in their own businesses and thus will never vote in agreement with one another out of spite. As a result, there does not exist a value v for any coalition in which the two are members because it would never happen.

Not only should the coalition space consist exclusively of coalitions of features that co-occur in the same paths, but coalitions should also be defined more rigorously.

This is because the coalition space uses set notation (since it is the set of all possible coalitions). By the mathematical definition of sets, no duplicates are permitted.

If coalitions were defined simply as the set of features that co-occur in the same path, information would be lost every time the same set of features diverge into two different class labels. For example, in Figure 5.1, see the paths (no, nr) and

(no, n8). If the coalitions were defined as the set of features, both would be denoted by { /(no), /(n i), f{n$) }. To preserve the fact that these are, in fact, two separate paths with two separate evaluations of v, we must ensure that both appear in the coalition space. As such, we define the coalition space as a set of paths (or sequences of nodes). Note that this is in contrast to the previous statement that coalitions should consist of features. It is necessary to adjust the notation to ensure that all 45

information is conserved. Addressing these constraints on the coalition space, the following claim is made:

Claim 5.1. Given n* and rij, two arbitrary nodes of a decision tree t,

9 a coalition C € Ut \ rij G C <=> 3 a path p G t \ rii, rij € p.

This limitation placed on the coalition space of tree t, denoted by Ut (i.e. the universal set in the context of coalitions in t), is crucial to building the structure of a decision tree into the feature power formulation.

5.2 Derivation of Feature Power

Though we have given a general definition of the coalition space of a decision tree, one of the key differences between each of the three feature power methodologies is the severity of a length constraint on coalitions. More specifically, the first method

(pathPow) will require coalitions to consist of root-to-leaf paths only, whereas the remaining two methods (cumNodePow and strictNodePow) define a coalition as a path from root to any internal node. Directly following from these assumptions, v is defined in both a binary way (i.e. a simple game of strictly winning and losing coalitions) and a probabilistic way (i.e. each coalition has a probability of winning and losing). Since a root-to-leaf path is deterministic in nature, already corresponding definitively to a particular class label, it makes most sense to apply 46

a binary v to this formulation. In contrast, a path that terminates at an internal node will have an associated probability of the path ending in a particular class label, given by the fraction of leaf nodes that have that class label. As such, a probabilistic v will be used for cumNodePow strictNodePow.

Throughout the derivation of each of the three methods, assume that there exists a trained random forest T, where an arbitrary decision tree t € T has root node no-

Let a path p = (no, n*) begin at the root node of t and end at an arbitrary node n*, which will be defined specifically for each method. This node can either be a leaf node representing a final class label or an internal node representing an inequality on a feature. Recall that both the class label and the feature can be acquired with the function call /(n*). Suppose feature power is being evaluated with respect to the kth class, namely c*.

5.2.1 Feature power method 1: path iteration

In the formulation of the path iteration method (pathPow), the coalition space Ut is limited by the root-to-leaf paths in t. In other words,

Ut = { (no, rii) € t | n0 is a root node in t and n* is a leaf node in t }. 47

Then, given a root-to-leaf path p = (n0, rii), vCk (p) is defined as the value of path p with respect to abstract game v in which c* is the winning class, or:

(5.1) 0 otherwise.

Recall that C/. is simply an arbitrary placeholder representing the class of interest.

Ultimately, feature power will be calculated with respect to every class label. With this in mind, let Uti be the set of coalitions that contain at least one node corre­ sponding to feature /j. Then Ut — Uti is the complement set. The feature power of fi is given by

peUt. (5.2)

pe(ut-uu) where M is the total number of features. Observe the similarity to Equation (4.3).

The primary difference in methods is the need to define a coalition space (instead of all subsets of the set of features) and change v from a set function to a “path function.”

Below is the pseudocode for calculating feature power for every feature in a single, trained tree t. 48

M = total number of features for each feature /:

power[/] = 0

for each root-to-leaf path p in t:

s = dip)

if 3 a node n* 6 p | /(«*) = / : power [/] + = (8—1)^-3)!^ )

else :

power [/] - = s'{MM^ l>''v(p)

We will now go through a toy example to demonstrate the methodology. Suppose we have the same tree from Figure 5.1. For the sake of simplicity, assume that there are a total of four features in the problem (i.e. M = 4). The first step is to define the coalition space for this particular decision tree. The coalition space is generated from the paths in the tree, color-coded based on Figure 5.2. 49

Figure 5.2: Toy tree for pathPow methodology.

Thus the coalition space becomes:

Ut = { (no, n7), (no, n8), (n0, n , (n0, n5), (n0, n6) } , where the elements of the set are written in order of the corresponding leaf nodes in the tree from left to right (i.e. the first element is from the left-most path of the tree and the last element is from the right-most path of the tree). Suppose we are calculating feature power with respect to class A. Then the only coalitions with a non-zero v are (n0,n7), (n0. n and (n0,n6). Assume each feature in the nodes of this tree are distinct and /(ni) is the feature of interest. Then we can calculate

r , 2! 1! 1! 2! 2! 1! ^/(ni)[vA] = TT~ H 77- — 0.0833. 4! 4! 50

Here, the first term is calculated from (jiq.ti7), the second term from (no, n^) and the last term from (no, no). Note that the last term is subtracted because there does not exist a node in path (no,ne) whose split feature is the same as /(ni).

The advantage of the pathPow is its simplicity and interpretability. The binary nature of the v function keeps the voting analogy tangible by maintaining a concept of winning and losing. The root-to-leaf constraint on coalitions may be a naive approach, however, since voting theory defines coalitions as any subset of voters, regardless of the size. As will be seen in the following sections, the computational complexity of pathPow lies in the middle of the spectrum. More specifically, given m paths of average length I, the average case for pathPow method is 0(m ■ I)2. The other consideration with the pathPow method is the fact that nodes at low depths get iterated over most often. Consider the root node, for example. The feature /(no) will be considered in every path iteration because n0 is included in every path. As such, the power of /(no) will be boosted in comparison to other features that reside in nodes farther down the tree. Whether this should be considered an advantage or disadvantage is unknown. On one hand, every data point passes through the root node. As such, the decision made at the root influences every classification. On the other hand, due to the random feature selection, it is not guaranteed that the root node is the most influential feature in the data set. In fact, if this were the case,

2Note that in order for the term (M — d(p) — 1)! to exist, the longest path must be no more than length M — 1, where M is the total number of features in the data set. We enforce this by setting the maxdepth parameter in the random forest implementation. 51

every decision tree would look identical. More details regarding how the power of features at low depths are overweighted can be found in Chapter 7.

5.2.2 Feature power method 2: cumulative node iteration

In response to the criticism of a binary v function, the cumulative node iteration method defines v from a probabilistic perspective. If the same root-to-leaf coalition definition were applied to a probabilistic v, however, there would be no difference in methodology since the probability of a root-to-leaf path arriving at a particular class is deterministic. This leads us to iterate over all paths in the tree originating at the root and ending at an internal node. For clarification, these root-to-internal paths will be called subpaths. Let the coalition space for cumNodePow then be defined as the set of all subpaths:

Ut = { (n0, rii) G 1 1 no is a root node in t and n* is an internal node in t } .

In other words, the coalition space is the set of all subpaths that begin at the root and end at a node that does not correspond explicitly to a class label. Then given a root-to-node path p that ends with node n* and a subtree originating at node n*, denoted by t'n., v(p) is the fraction of class labels that match c* in subtree t'n.. More formally, given a function L(t,Ck), which counts the number of leaf nodes in t that 52

match class label C&,

V c M = T L u ^ - y (53) Note that this is one way to define a probabilistic characteristic function in the context of decision trees. The reasoning for the selection of this formulation as well as an alternative is provided in Chapter 7. Given this new definition of v and Ut, feature power is defined similarly to pathPow in Equation (5.2). The only difference is the need for d(p) to become d(p) + 1 since the class label is no longer included in the path. With this adjustment, feature power for cumNodePow is defined as:

1 ^ d(jp)\(M — d(p) — 1)! ( ^ < P fiM = VCk(p) M\ 77]------Peuti /5 4 ) (d(p) + 1)! (M — d{p) - 2)! ~ 2^ m \ Vc^ph Pe(ut-uti)

The pseudocode for computing the feature power using the cumulative node method is as follows:

M = total number of features for each feature /:

power [/] = 0

for each subpath p in t:

if 3 a node n* G p |/(n^) = / :

s = d(p[n0:nj] ) + l 53

power [/] + = (s MM &v(p)

else :

s = d(p) + l

power [/] - = a'^ Mr l)'-v(p)

To illustrate the usage of this algorithm, we will run through the same toy example where we are again calculating feature power on class A.

t

■P(A) = 0.6

Figure 5.3: Toy tree for cumNodePow methodology.

The consequent coalition space is:

Ut = { (no), (n0, ni), (n 0, n 3), (n 0, n 2) } , 54

where elements are written in an order as if the last node of each path is obtained from pre-order traversal of the tree, excluding the leaf nodes. Since we are now considering subpaths, which are non-deterministic by nature, there are no coalitions that are definitively “winning” or “losing,” but rather have a probabilistic value associated. Suppose again each feature in the nodes of this tree are distinct and f(n i) is the feature of interest. Then

it91 2* V 0f3! 1!2! = -^ r • 0.66 + - • 0.6 - • 0.5 = -0.0944.

In this calculation, the first term is obtained from the path (n0, ni), the second term from (n0,n3), the third term from (no) and the last term from (n0,n2). The key advantage of this method is its strong alignment with the theoretical methodology laid out by the Shapley value. By considering all subpaths within a tree (of varying sizes), this is most similar to considering all subsets of voters in a voting structure.

Unfortunately, as a result of this increase in iterations, the computational complexity of cumNodePow is significantly worse. Again suppose there are m paths of average length L Then the average case scenario would make cumNodePow 0(m • Z2), which is a worse computational complexity than pathPow. In addition, the possibility of overweighting features at low depths remains a concern for cumNodePow. This is because all subpaths are searched, originating at the root. As a result, features close to the root are iterated over most often and will have more terms added to their power evaluation. In fact, this overweighting phenomenon is even more exaggerated 55

in cumNodePow due to the consideration of all subpaths as opposed to just root- to-leaf paths.

5.2.3 Feature power method 3: strict node iteration

The strict node iteration method (strictNodePow) is the algorithm that minimizes computational complexity and avoids the overweighting of features at low depths.

It does this by iterating over nodes—like the cumulative node approach—but does not reward features that appear at lower depths in the subpath. The first change is to the coalition space. From the perspective of strictNodePow, the constraint stated in Claim 5.1 remains logically correct but may leave the wrong impression. More specifically, a coalition no longer consists of a path (or a subpath) but rather simply of a single node. The coalition space for strictNodePow then becomes:

Ut = {rii | /(raj) $ {cx,...,cK}}, where K is the total number of classes. The coalition space, in other words, is the set of all non-leaf nodes in the tree. Directly following, v becomes a very similar formulation as in Equation (5.3) except for the fact that it now maps from a node to a value, as opposed to from a path (or subpath). Given an arbitrary node n.*,

L { t'n. , c k ) vck(ni) = 7=Zk ~T77.----- 7- (5-5) 56

Though a coalition consists of simply a single node rii, the depth of the node in the tree is still used in the calculation of feature power. As a result, the calculation of feature power becomes:

where d((no,rii)) + 1 represents the number of nodes in the path from root to the node of interest.

The pseudocode is as follows:

M = total number of features for each feature /:

power [/] = 0 for each node n, in t:

f = f(Pi) s = diplno-.nil ) + l

for each feature g ^ f:

power [flr] - = S^MM] 57

Consider the same toy example from before, where we assume every node corre­ sponds to a unique feature.

t

H(A) = 0.6

Figure 5.4: Toy tree for strictNodePow methodology.

Applying strictNodePow, the coalition space is given by:

Ut = {no,ni,r?3,n2}, where the colors correspond to coalitions in Figure 5.4. In other words, the elements of Ut are written in the order obtained by pre-order traversal. From this, the feature power of /(ni) with respect to class A is calculated by:

1191 ni 3t 2' II 112' ^<».)M = -jp • 0-66 " “j f ' °'6 - %■ ' 0 5 “ " jf ' 05 = -° '1778' 58

In this example, the first term is obtained from node n\, the second term from n0, the third term from n$ and the last term from « 2- As designed, strictNodePow successfully avoids overweighting features at low depths by iteration. As will be dis­ cussed in Chapter 7, however, there may be other aspects of the Shapley value that ensure an overweighting of both low and high-depth features. The key selling point of the strict node feature power calculation is the improvement in computational complexity. For example, suppose that there are again m paths of average length

I. Then assuming that there are k nodes in the tree, it follows that k < m-1 since every path shares nodes with other paths. Thus the average case for strictNodePow is 0(k), which we know to be less expensive than the previous two methods. The disadvantage of strictNodePow, however, is the disconnect from theory. The appli­ cation of a cooperative game theory measure requires the use of coalitions, defined as groups of players. The strictNodePow method uses the size and value of a coalition, but fails to reward any member of a coalition other than the one that appeared last in the path.

5.3 Extension of Feature Power to Random Forests

All methods described above are in the context of a single decision tree. Conven­ tional variable importance measures, however, provide an importance value for each feature that is the result of an aggregation over all decision trees in the trained forest. As such, we provide such a methodology. Let represent any of the 59

three feature power calculations on decision tree t. Then the feature power of /* with respect to class c*,, denoted by ^/JvcJ (note the capitalization of ), is given

* /i K ] = (5.7)

In other words, the final feature power is the average power of the feature across every decision tree in the forest.

The pseudocode below uses the function call featurejpower in reference to any of the three feature power methodologies. fo r each tree t in forest T:

powers += featurejpower (t) powers = 22-Ei

5.4 Extension of Feature Power to Multi-Class Problems

Though voting results in a strictly binary outcome (winning or losing), random forests can be applied to multi-class problems. To extend feature power to handle more than two classes, we must ensure that characteristic function v is able to handle more than two outcomes. The good news is that we have already developed feature power in such a way that it remains valid for the multi-class case. To explain this point further, we will describe the characteristic function v as a set of functions 60

whose cardinality is equal to the number of classes. More formally,

v = {v Cl,vC2,...,v CK } (5.8) where K = 2 for binary classification and K > 2 for all multi-class problems.

Following the same logic, the power of a given feature can be represented as a set of values, again corresponding to each of the K classes.

(5.9)

In the following chapter, when displaying results for multi-class problems, the feature powers are aggregated across classes. By that we mean that the aggregate power of feature /* is given by:

(5.10)

Note that the number of classes does affect the expected range of feature power values. That is to say that as the number of classes increases, the number of paths

(or subpaths) that lead to the class of interest will likely decrease. As such, though the functions developed earlier for each method remain mathematically sound, the way in which the situation is interpreted changes, potentially leading to unexpected results for multi-class problems. 61

Chapter 6

A Comparative Study

Despite the careful derivation of feature power from the lens of cooperative game the­ ory, the new measure’s correctness still needs to be validated experimentally. Doing so will require the application of all measures to a simulated toy data set, in which there is information available about the true importance of the features. Note that this is in contrast to the real-world data sets from the UC-Irvine Machine Learning

Repository, in which the only information about variable importance is obtained by applying existing measures and using them as baselines. Since there are three sepa­ rate formulations of feature power, we seek to find differentiating qualities of each, along with their similarities and differences from Gini importance, permutation im­ portance and count importance. Supplemental experiments will be performed on both toy and real-world data to measure the methods’ responses to random variance across individual forest instances and sensitivity to hyperparameters. 62

6.1 Data Description

This study will use four separate data sets, both simulated and real, binary and multi-class, as described in Table 6.1. Note that the default hyperparameters used throughout this study are included in the table, chosen based on preliminary param­ eter tuning. Since the purpose of this study is not to achieve maximal predictive performance, but rather to explore the behavior of various measures, we do not need to perform an exhaustive grid search for optimal default hyperparameters. As­ sume that these hyperparameters are applied to every unless otherwise specified. The data sets are described in more detail in the sections below.

Name of data set n V feature types # classes ntrees mtry toys 100 200 real 2 50 66 Wisconsin breast 569 30 real 2 250 10 cancer (diagnostic) Wine 178 13 int, real 3 100 4 Image segmentation 210 19 real 7 100 7

Table 6.1: Data set details.

6.1.1 Toy data set

Originally introduced by Weston et al. in [78] and used by Genuer et al. in [40],

the “toys data set” is a simulated data set, constructed for the purpose of having

a few features that are known to be predictive and many features that are not 63

informative.1 Furthermore, in the spirit of a random forest, the toys data set is a binary classification example of when n « p, or in other words, when the number of observations is far smaller than number of features. In this particular case, n = 100 and p = 200, but only 6 of those 200 features are informative. More specifically, as described in [40], features 1 through 3 are very important to the prediction, while features 4 through 6 are moderately important. The remaining 194 features are noise. This data set will be used to benchmark the feature power methods against what is known about the data and the results that Gini importance, permutation importance, and count importance achieve.

6.1.2 Real world data sets

The three real world data sets used were found in the UC-Irvine Machine Learning

Repository, available at [28]. The binary classification example is the “breast cancer

Wisconsin (diagnostic) data set” , which is the compilation of measurements made on 30 different characteristics found in breast cancer tumor images. The resulting class labels determine whether the mass is malignant or benign. Serving as the first multi-class problem is the “wine data set” . It is a 3-class problem in which the 13 features relate to different chemical attributes of the wine (i.e. alcohol, magnesium, hue, etc.) and the class label is the cultivar in Italy. The other multi-class data set is the “image segmentation data set,” which consists of randomly selected images of

1 The toys data set is available at [39] 64

the outdoors. The images have been hand-segmented to extract information about each pixel. The class labels are 7 outdoor objects (sky, foliage, etc.).

6.2 Results

Throughout this chapter the following metrics will be used to evaluate the proposed feature power methods in comparison to existing variable importance measures: stability and sensitivity. We define stability as the mean Spearman’s rho of the top 7 ranked features between all pairs of 50 independently constructed random

forest instances. This metric will allow us to see a variable importance measure’s sensitivity to changes in the trained random forest instances. We define sensitivity

as the change in stability by varying hyperparameter values (ntrees or mtry). We

also use the term validity to refer to the qualitative correctness of feature power and

other variable importance measures when applied to data sets with ground truth variable importance (toys data set).

6.2.1 Validity

Upon learning about the nature of the 200 features in the toys data set, one can easily

build a model of what to expect from any accurate variable importance evaluation.

Provided that the random forest is constructed in a manner that takes advantage

of the underlying information in the data, the importance of features 1-3 should be 65

higher than the importance of features 4-6, while the importance of features 7-200 should be minimal. Figure 6.1 demonstrates the importance values calculated by each measure on the first ten features in the data set (i.e. variables 1-10) over 50 random instances of a trained forest. Observe that permutation importance—like feature power—has the ability to be calculated in a class-specific manner but is also available in an aggregate form. It is clear from the set of boxplots that the pathPow method is an outlier and seems to place nearly all importance on features 1 and 2, while feature 3 is essentially unimportant. This is in contrast to all other measures, which place feature 3 within the top 2 important features. This unusual behavior indicates that perhaps the binary nature of pathPow’s v function limits its ability to identify numerous important features. On the other hand, strictNodePow and cumNodePow behave very similarly to one another and to Gini importance. When applied to the breast cancer and wine data sets, similar conclusions were drawn; the boxplots of strictNodePow and cumNodePow were most similar to that of Gini importance, while pathPow demonstrated more erratic results, often attributing most or all of the power to a small number of features.

The difference between pathPow and the other feature power methods is again obvious when applied to the image segmentation data, as seen in Figure 6.2. Note that since the image segmentation data has 7 classes, the results for each measure have been aggregated across all classes. The full class-specific boxplots for all data sets can be found in Appendix B. The fact that pathPow seems to undervalue the 6 6

7e-04 6e-04 5e-04 , 1e-03 j 4e-04 5 8M>4 L 3e-04 = 5e-04 ‘ 2e-04 3e-04 le-04

Oe+OO ------0e+00 a . 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 kFeature # i Feature # strictNodePowl 1o-03 le-03 le-03 1e-03 8e-04 B 86-04 6e-04 [ 6e-04 j 4e-04 4e-04 2e-04 2e-04 09+00 T ------0e+00 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Feature# Feature# cumNodePowl

1e-03

le-03 8e-04

§ 8e-04 i 6e-04

|. fe-°4 f 4e-04 4e-04 it 2e-04 2B-04 i * . 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Feature# Feature # ♦V 8 9 10 11 1 23456789 10 11 Feature # mm

if I 'V . . ( V . 123456789 10 11 12345678910 11 Feature # Feature #

Figure 6.1: Importance values of first ten features of toys data. Top 6 plots cor­ respond to the three feature power methods, while bottom 5 plots correspond to existing variable importance measures. 67

pathPow strictNodePow 3e-03 2e-05

0e+00

-2e-05 -1e-03 1 2 3 4 5 6 7 8 9 10111213141516171819 1 2 3 4 5 6 7 8 9 1011 1213141516171819 Feature # Feature # cumNodePow Gini 3e-03

2e-03

1e-03 * *

0e+00

-1e-03 1 2 3 4 5 6 7 8 9 10111213141516171819 1 2 3 4 5 6 7 8 9 1011 1213141516171819 Feature # Feature # Perm Count 20 200 * 8 150 * t I 100 i £ 50 t o 1 2 3 4 5 6 7 8 9 10111213141516171819 1 2 3 4 5 6 7 8 9 1011 1213141516171819 Feature # Feature #

Figure 6.2: Importance values of features of image segmentation data. Note that the values for the top 3 plots (titled pathPow, strictNodePow and cumNodePow) are obtained by applying the three feature power methods. 6 8

importance of features 2 and 19, as identified as important by all other measures, indicates that pathPow may be unreliable. Also observe that strictNodePow and cumNodePow agree less strongly with Gini importance on this data. As will be seen in the following sections, the image segmentation data set is likely an example of where the feature power methodology breaks down. In addition, observe that pathPow assigns significantly smaller feature power values than strictNodePow and cumNodePow. Further discussion on the ranges of all variable importance and feature power measures can be found in Chapter 7.

To investigate the agreement between methods further, the pairwise correlation

between all measures was evaluated. More specifically, given the top seven ranked features by each method, a pairwise rank-order Spearman’s rho was computed and averaged over 50 random forest instances to obtain a mean rank-order correlation

between each measure. That is, given a trained decision tree t and a ranking of features within that tree ftl ftM such that

> Imp(ft2) > . . .> Imp(ftM),

where Imp €{, ImpGini, Impprm, Impmt} , the top 7 ranked features are { ftlJ /t2,... ft7 }• The correlation results when apply­ ing all methods to the toys data set are depicted in Figure 6.3. Upon inspection, it is clear that the pathPow method stands on its own, not significantly correlated 69

with any measure other than itself. In addition, it is evident that the red block be­ tween the strictNode and cumNode approaches indicates high pairwise correlation, particularly when measured with respect to the same class. These two feature power methods are also most correlated with Gini importance. The fact that none of the feature power methods are highly correlated with count importance indicates that the combinatorial nature of feature power is not a hindrance by making it too re­ liant on the count of feature presence in the forest. In fact, Gini importance is most correlated to count importance, despite Gini’s data-driven approach. As expected, the class-specific permutation importance measures are most correlated with the aggregate form of permutation importance. When applied to the breast cancer and wine data sets, the results remained consistent with the above conclusions. That is, strictNodePow and cumNodePow were most correlated with one another and some­ what correlated with Gini importance. In Figures 6.4 and 6.5, it is clear that the same relationship between strictNodePow, cumNodePow and Gini remains. Note that, similar to Figure 6.2 but unlike Figure 6.3, the correlation values are again averaged across classes for ease of viewing. This explains why the diagonals in Fig­ ures 6.4 and 6.5 are not solid red. In other words, less red indicates larger variation across classes. When applied to the image segmentation data set, the correlation results reinforce the fact that strictNodePow and cumNodePow are most correlated to Gini but also indicates that pathPow may be more closely correlated to Gini importance. Where in the other data sets, pathPow seemed relatively isolated, the 70

Pairwise Spearman’s rho for each V! measure

pathl path2 stnctl stnct2 cuml cum2 Fermi Perm2 Qnt Perm Count

Figure 6.3: Pairwise rank-order correlation on top 7 ranked features of toys data. Headings “path,” “strict,” and “cum” correspond to the three feature power meth­ ods. method now appears just as related to existing variable importance measures as the other two feature power methodologies. See Figure 6.6 below. When examining Fig­ ures 6.3-6.6, there are obvious inconsistencies in the behavior of Gini, permutation and count importance. Both the toys and image segmentation data set suggest that count importance and Gini importance are most correlated, whereas breast cancer results indicate that count importance is most similar to permutation importance.

Further complicating matters, the wine data set demonstrates the strongest correla­ tion between Gini and permutation importance. This inconsistency in existing and 71

Pairwise Spearman's rho for each VI measure

Figure 6.4: Pairwise rank-order correlation on top 7 ranked features of breast cancer data, averaged across classes. Headings “path,” “strict,” and “cum” correspond to the three feature power methods widely used measures indicates that the behavior of variable importance metrics can change across data sets. As pointed out by [72], existing variable importance measures often produce misleading results when a data set consists of multiple types of features. The wine data set contains both integer and real number features, po­ tentially explaining unexpected results when comparing feature power to existing measures. Tables containing all values (before averaging across classes) for the toys, breast cancer, and wine data sets can be found in Appendix C. 72

Pairwise Spearman's rho for each VI measure

Figure 6.5: Pairwise rank-order correlation on top 7 ranked features of wine data, averaged across classes. Headings “path,” “strict,” and “cum” correspond to the three feature power methods

6.2.2 Stability

Though the feature power methods achieve reasonable results, there is still not substantial evidence to prove feature power a viable variable importance measure.

As discussed in previous chapters, each existing variable importance formulation has its strengths and weaknesses. The next two sections will explore the behavior of feature power under varying conditions to determine its utility and applicability.

We begin by evaluating the stability of each measure, where stability is defined as the mean rank-order correlation of the top 7 ranked features between all pairs of 50 independently constructed random forest instances. When applying all methods to 73

Pairwise Spearman's rho for each VI measure

1.00

0.75

0.50

0.25

0.00

-0.25

Figure 6.6: Pairwise rank-order correlation on top 7 ranked features of image seg­ mentation data, averaged across classes. Headings “path,” “strict,” and “cum” correspond to the three feature power methods the toys data set, we find that feature power is more stable than the existing variable importance measures. As seen in Figure 6.7, both strictNodePow and cumNodePow demonstrate a high inter-instance correlation, with Gini and pathPow following closely behind. Both permutation and count importance demonstrate a rank-order correlation of less than 0.5.

The implication of these results is that strictNodePow and cumNodePow are least sensitive to the randomness that occurs as a result of the random forest tree growing process. This ensures a level of reliability in the feature power evaluation, minimiz­ ing the chance of a variable importance measure that is simply an artifact of the 74

randomness, and not of the underlying information in the data. When applied to the

Stability of methods with respect to ank-order correlation of 7 top-ranked features

m m Class 1 0.7 mm Class 2 wmm Not class-specific (A 8 0.6

cr 0.5 8

0.4

| 0.3 C0 (/)I 9) 1 0.2

0.1

0.0 path strictNode cumNode Perm Qni Count

Figure 6.7: Average rank-order correlation on top 7 ranked features over 50 RF instances on toys data. Headings “path,” “strictNode,” and “cumNode” correspond to the three feature power methods breast cancer data set, the same conclusions about strictNodePow and cumNodePow persist, though the stability of pathPow falls far below all other measures. See Ap­ pendix B for more detailed results. Both the wine and the image segmentation data sets, on the other hand, produce contradicting results. As seen in Figure 6.8, pathPow demonstrated highest stability in comparison to its counterparts, with strictNodePow and cumNodePow behaving similarly to the other measures. Note, 75

Stability of methods with respect to rank-order correlation of 7 top-ranked features

mm Class 2 ■ ■ I Class 3 m m Not class-sDecific

Figure 6.8: Average rank-order correlation on top 7 ranked features over 50 RF instances on wine data. Headings “path,” “strictNode,” and “cumNode” correspond to the three feature power methods however, that high stability means nothing if a method is converging to the wrong answer. In other words, given the boxplot results on the wine data set in Figure

6.9, it is clear that strictNodePow and cumNodePow still achieve results most con­ sistent with Gini importance (likely correct), though pathPow demonstrates highest stability. These results may indicate that the nature of the feature power meth­ ods changes when applied to a multi-class problem, making strictNodePow and 76

pathPow strictNodePow

le-02 1e-02

cumNodePow Gini 2e+01

1e-02 2e+01 8 | 5e-03 g 1e+01 8. a £ 0e+00 £ 5e+00

-5e-03 0e+00 u123456789 y 10111213 123456789 10111213 Feature # Feature #

Perm Count 2e+01 1e+02

1e+02

123456789 1011 1213 123456789 1011 1213 Feature # Feature #

Figure 6.9: Importance values of features of wine data. Note that the values for the top 3 plots (titled pathPow, strictNodePow and cumNodePow) are obtained by applying the three feature power methods. 77

cumNodePow highly unstable, though still more correct than pathPow. See Ap­ pendix B for stability results with respect to the image segmentation data set.

6.2.3 Sensitivity to hyperparameters

Hyperparameters are essential to the performance of any parametric machine learn­ ing model. Random forests, in particular, are primarily controlled by ntrees, or the number of trees in the forest, and mtry, or the number of candidate features to consider at each node split. Though previous results were obtained using default parameters, in this section, we will now abandon the default hyperparameters and iterate over the ntrees and mtry space to evaluate the effect on stability.

Sensitivity to ntrees

To evaluate each method’s sensitivity to ntrees, we let mtry maintain its default value and perturb ntrees to determine the effect on importance as ntrees gets sufficiently large. It is clear from Figure 6.10 that with respect to the toys data, though count importance demonstrates the largest stability when ntrees is small, the measure never witnesses an increase as ntrees grows large. In contrast,

Gini, pathPow, strictNodePow, and cumNodePow all experience a dramatic spike in stability once ntrees reaches a value of 45, leveling off and stratifying into groups of [strictNodePow, cumNodePow], [pathPow, Gini] and [Perm, Count], This level of reliability demonstrated by both strictNodePow and cumNodePow methodologies 78

is extremely desirable, provided that ntrees is sufficiently large to experience such a boost in stability. Perhaps the most shocking of these results, however, is the

Rank-order correlation sensitivity to ntrees 1.2 — pathPowl * • • Perml — pathPow2 • ■ • Perm2 — - stnctNodePowl — - Qni 1.0 - strictNodePow2 ■ • Perm — cumNodePowl Count

0.8 O £

E 0.6 <5 1 c

200 400 600 800 1000 ntrees

Figure 6.10: Rank-order Spearman’s rho sensitivity to ntrees on toys data. Note that all lines defined on the left side of the legend correspond to feature power, whereas the lines on the right correspond to existing variable importance measures. uncanny similarity in behavior between cumNodePow and strictNodePow. Despite strictNodePow’s relative theoretical inconsistency in the context of voting theory and its decreased computational complexity in comparison to both other measures, it appears to approximate cumNodePow extremely well. More on this can be found in the following chapter. When applied to the breast cancer data, the same trend was seen in strictNodePow and cumNodePow but pathPow remained completely 79

Rank-order correlation sensitivity to ntrees

pathPow 1 ■ Perml pathPow2 ■ • ■ Perm2 —• - strictNodePow 1 — — Gni £ 0.8 — ■ strictNodePow2 ■ ■ Perm Count

0.6

400 600 ntrees

Figure 6.11: Rank-order Spearman’s rho sensitivity to ntree on breast cancer data. Note that all lines defined on the left side of the legend correspond to feature power, whereas the lines on the right correspond to existing variable importance measures. insensitive to varying values in ntrees. See Figure 6.11 below.

At this point, it appears that given a forest with sufficient trees, cumNodePow and strictNodePow have a higher tendency to converge to some feature ranking than the other measures. The results from the wine and image segmentation data sets, however, question this claim. As seen on the wine data set in Figure 6.12, pathPow experiences the largest spike in stability as a result of incrementing ntrees, leaving cumNodePow and strictNodePow at a ranking correlation around 0.2. This remains 80

Rank-order correlation sensitivity to ntrees 1.0 — pathPow ■ ■ ■ Gni — ■ strictNodePow — — Perm — cumNodePow • • Count

200 400 600 800 1000 ntrees

Figure 6.12: Rank-order Spearman’s rho sensitivity to ntree on wine data. Note that all lines defined on the left side of the legend correspond to feature power, whereas the lines on the right correspond to existing variable importance measures. consistent with our stability analysis on this data from the previous section. When applied to the image segmentation data, the same low stability of strictNodePow and cumNodePow is seen. In addition, count importance, which is typically consid­ ered a naive approach to variable importance, appears to converge at some feature ranking around ntrees = 150 (see Figure 6.13). Again, it is unknown whether these contradicting results are a product of the extension to multi-class problems in the wine and image segmentation data sets, or whether there is something else inherently 81

Rank-order correlation sensitivity to ntrees 1.0 pathPow ■ ■ Gni strictNodePow ■— Perm cumNodePow ■ Count 8 0.8

s I"- C l 0.6 o €o

| 0.4 CO 8 CO cC' 8 2 0.2

0.0 200 400 600 800 1000 ntrees

Figure 6.13: Rank-order Spearman’s rho sensitivity to ntree on segmentation data. Note that all lines defined on the left side of the legend correspond to feature power, whereas the lines on the right correspond to existing variable importance measures. different about the data that makes feature power methods behave unexpectedly

Sensitivity to mtry

In a similar spirit, stability analysis was performed on random forest instances constructed with the default ntrees and varying mtry values. As seen in Figure

6.14, pathPow experiences the greatest increase in stability as mtry is incremented.

In fact, if mtry > 130, pathPow actually surpasses all other measures in stabil- 82

ity. Again, it is surprising how closely the behavior of strictNodePow matches cumNodePow considering the inherent differences in their methodologies. One pos­ sible explanation is to consider the fact that both strictNodePow and cumNodePow share the same probabilistic characteristic function. The two measures differ dra­ matically in the method for computing the number of combinatorial chances. As such, it is reasonable to conclude the Shapley value relies more heavily on the def­ inition of the characteristic function v than the combinatorial coefficient. When

Rank-order correlation sensitivity to mtry

I pathPow 1 Perml 10 — pathPow2 • * • Perm2 — * strictNodePow 1 * * *- Gini $ — • strictNodePow2 ■ • Perm 3 wmmmm cumNodePow 1 _ Count

Figure 6.14: Rank-order Spearman’s rho sensitivity to mtry on toys data. Note that all lines defined on the left side of the legend correspond to feature power, whereas the lines on the right correspond to existing variable importance measures. applied to real data, the same behavior is no longer witnessed with respect to the 83

pathPow method. As seen in Figure 6.15, the stability of pathPow remains rel­ atively low, whereas cumNodePow and strictNodePow both experience a spike in stability around the default mtry value, as specified by Breiman in [17]. Note the difference in plots depicting sensitivity to ntrees and mtry. It appears that after ntrees reaches a certain value, the stability over varying ntrees is more consistent than the stability over varying mtry. This may be an indication that variable im­ portance measures are more sensitive to the randomness induced by mtry than to that by ntrees. When applied to the wine and image segmentation data, however,

Rank-order correlation sensitivity to mtry 1.0 pathPow 1 * Perml pathPow2 - Perm2 — - strictNodePow 1 - Gni » 0 8 — - stnctNodePow2 ■ Perm — cumNodePow 1 Count — cumNodePow2

0.0 5 10 15 20 25 mtry

Figure 6.15: Rank-order Spearman’s rho sensitivity to mtry on breast cancer data. Note that all lines defined on the left side of the legend correspond to feature power, whereas the lines on the right correspond to existing variable importance measures. 84

there appeared to be little dependence of stability on increasing mtry values. See

Appendix B for these plots.

6.2.4 Summary of results

Further extending Table 6.1, we now include the qualitative results from the previous sections in Table 6.2 (on page 86). Observe that image segmentation remains the outlier in terms of behavior, though the other multi-class wine data set appears to agree in terms of the most stable method. This could be an indication that the binary v definition in the pathPow method produces more stable results than the probabilistic v definition on multiple classes. A proposed justification for this is the fact that as the number of classes grows, the probability of arriving at a particular class at a given node is likely divided among multiple classes, resulting in smaller overall values. This is an area that needs to be researched more thoroughly before making a decisive conclusion. The other interesting point is that there does not seem to be a relationship between qualitative results and predictive accuracy shown in Table 6.2. In other words, the image segmentation problem, which produced the most unusual results, was not the most difficult data set for the random forest. In fact, that toys data set, which caused behavior in line with both the breast cancer and wine results in most categories, experienced the random forest’s worst predictive performance. See Appendix B for a visualization of the effect of OOB accuracy on each method. 85

6.3 Discussion

Though cumNodePow is most theoretically elegant, strictNodePow seems to be a decent approximation of the measure, while significantly minimizing computational complexity. There are particular data sets, however, in which both methods pro­ duce inconsistent results across random instances of a trained forest. These are the scenarios in which pathPow appears to converge to some feature ranking, though likely not the correct ranking. There remain many questions regarding the qualities of data sets that make each method behave in particular ways. This is an area for future research, as will be discussed in Chapter 9. Feature power in general is unique in that it is not data-driven, meaning that assuming we begin with a trained decision tree, no further data is needed to compute the power of features. Both Gini and permutation importance require a data set to calculate variable importance, lending some amount of dependency to the selection of data samples. Feature power, on the other hand, is a completely non-stochastic, data-independent measure that is guaranteed to achieve the same results upon each calculation. feature Most G rini Mofst st. able Mof3t cli ass var Name of data set n # Acc P types classes pa st cu pa st cu pa st cu toys 100 200 real 2 0.87 XXXX X Wisconsin breast 569 30 real 2 0.95 X XXX cancer (diagnostic) Wine 178 13 int, real 3 0.96 XXXX Image segmentation 210 19 real 7 0.90 X XX

Table 6.2: Summary of results. Note that “pa” corresponds to pathPow, “st” to strictNodePow and “cu” to cumNodePow. 87

Chapter 7

Theoretical Implications of Feature Power

Despite the theoretical foundation that feature power has been built upon, the fact that three separate measures were able to be formed from the same equation demon­ strates some level of ambiguity. In addition, the results illustrated in the previous chapter pose some interesting questions regarding the mathematical properties of each of the three measures. We hope to answer some of these questions by returning to theory. In this chapter, we plan to explore the theoretical implications of each of the three feature power measures, first by mapping back to the voting theory domain and then exploring the overweighting of low-depth features, the axiomatic proper­ ties of feature power, the range of values that can result from variable importance and feature power evaluation, and another form of the probabilistic characteristic function. 8 8

7.1 A Mapping to Voting Theory

Due to the fact that all three feature power measures were derived from the same equation in cooperative game theory—namely the Shapley value—it becomes an interesting challenge to attempt to map the methodology for each of the three mea­ sures back to the voting theory domain and determine their differences with respect to the voter’s power calculation. We do so by developing stereotypes for each mea­ sure that describe the type of voter that each method would theoretically favor.

7.1.1 The Procrastinator: pathPow

PathPow is the measure that favors the procrastinating voter. By iterating over every root-to-leaf path and not every subpath, it is equivalent to assigning a v value of 0 to every subpath. In other words, it is not until every feature is present in the path that the path gets a classification. This is realistic in the context of random forests, though the concept breaks down when mapped back to voting theory. This is essentially stating that every voter in the coalition is pivotal in every ordering, which is mathematically impossible. As such, pathPow in some ways bears more resemblance to the Banzhaf power index and its use of the concept of a critical voter. By definition, there can be multiple critical voters in the same coalition (see toy example in Section 4.1.4).

Borrowing the concept of being a critical voter, if in the pathPow methodology we 89

assumed that every feature is critical to a path, the mapping back to voting theory becomes quite elegant. More specifically, if pathPow is applied to a structure of voters, it measures the difference of the following two terms: (1) the number of times a critical voter is able to vote last in that coalition and (2) the number of times the members of a winning coalition excluding that voter are able to cast their votes. In other words, pathPow is a measure of a voter’s chance of exercising a

“procrastination” technique on a group that depends on them minus the number of times they are not needed.

7.1.2 The Dreamer: cumNodePow

Like pathPow, cumNodePow also iterates over features at low depths most often. In fact, the only emphasis that either method places on the order of features witnessed in a path comes from this repeated iteration. The actual ordering of the features in the path (or subpath) is not considered when calculating feature power. This is because the combinatorial nature of the Shapley value makes the measure evaluate the marginal contribution of each feature with respect to every possible ordering of features within a given subpath. Note that this method is most accurate to the methodology laid out by the Shapley value in [70]. Just as in the calculation of the

Shapley value, where all possible orderings are exhaustively evaluated, the same is true for cumNodePow.

The methodology of cumNodePow diverges from the Shapley value formulation 90

in the evaluation of characteristic function v. The cumulative node iteration ap­ proach uses a probabilistic definition of v and thus ends up measuring a more fuzzy

“pivotal-ness” as opposed to forcing a voter to be pivotal to a sequence or not. More specifically, cumNodePow measures the degree of “pivotal-ness” of a feature over all hypothetical permutations of the features in the same path, though those orderings may not have been witnessed in the path explicitly. Here lies the justification for why cumNodePow is the dreamer; it measures orderings that could have been. This is in contrast to strictNodePow, which we will see places the emphasis on the actual order witnessed within a given subpath. When mapping back to a voting game, the function v is constrained to values in { 0,1} and thus becomes simply a count of the number of times a voter is pivotal in all possible permutations. In other words, this is simply the Shapley-Shubik power index.

7.1.3 The Literalist: strictNodePow

Again measuring the degree of “pivotal-ness” due to the probabilistic definition of v, strictNodePow places a stronger constraint on the ordering of features seen within a path. Instead of considering all possible orderings of the features and determining the pivotal voter, strictNodePow only rewards the feature that truly appeared last in the subpath with the number of times it could be pivotal. Furthermore, any feature that was not the last feature gets penalized for the number of scenarios in which they could have been pivotal but failed to do so in this particular subpath. 91

In the context of voting, this is as if a judge has access to additional information

(say, the history of the way the members of this voting structure have voted in the past). The judge will not reward a voter for being pivotal in a hypothetical ordering if she or he has never voted in that position in the past. In other words, the judge is a literalist because she or he will not reward behavior that did not actually occur. Note, however, that the introduction of chronology into this analogy is a little misleading since in a random forest, the structure of the paths is current and directly relevant, whereas history of people’s behavior is not always a definitive predictor of what is to happen in the future.

7.2 The Overweighting of Low-Depth Features

As mentioned throughout Chapter 5 and in the section above, two of the three feature power methodologies are destined to overweight the power of features that appear in nodes at low depths in a decision tree due to the pure number of itera­ tions that consider said features. In response to this concern, strictNodePow was developed, which iterates over each node only once and thus avoids giving preferen­ tial treatment to low-depth features. When examining this issue further, however, it becomes clear that features at both low and high depths are preferred in both cumNodePow and strictNodePow due to the nature of the Shapley value. To illus­ trate this point, Figure 7.1 demonstrates how the size of the coalition (i.e. the path for feature power) affects the coefficient of v. More specifically, as the cardinality 92

Cardinality of S affect on Shapley value coefficient when n=200

0 25 50 75 100 125 150 175 200 Cardinality of S

Figure 7.1: Logscale of Shapley value coefficient growth based on coalition size. of a coalition S becomes very small or very large, the coefficients and sl^ - ^ grow dramatically. This is due to the tradeoff between the s term and the

(n — s) term and the disproportionate growth of the factorial function. Therefore, the coefficient of the Shapley value is minimal when a coalition contains roughly half of the voters and is maximal when the coalition contains nearly zero or nearly all of the voters. Similarly, for strictNodePow and cumNodePow, the maximum coefficient of power gets attributed to a feature that appears at low or high depths in a decision tree.

The reason that pathPow is not included in the previous statement is the fact that 93

the “size” of a coalition is always the length of a root-to-leaf path. Assuming that the decision tree is balanced, these lengths will remain relatively consistent across all paths. As such, it is reasonable to assume that the tendency of pathPow to pick up a small set of important features and set the rest to zero can be both attributed to the binary v definition as well as this relatively stagnant coefficient.

7.3 An Axiomatic Approach

As originally introduced in [70] and adapted in [4, 56, 79], the Shapley value is accompanied by four key axioms that characterize the measure. In fact, L.S. Shapley proves that the Shapley value is the unique solution that satisfies all four axioms.

The following sections will introduce these four axioms in the context of cooperative game theory, followed by a discussion regarding the consequence of the extension to decision trees. In other words, we will explain which axioms remain valid for feature power and which axioms are broken.

7.3.1 The Shapley Value Axioms

The first axiom introduced by Shapley in [70] is that of symmetry. The below formulation is the adaptation seen in [79]. Recall that N is the set of all players. 94

Axiom 7.1. Symmetry. Given two players pt, pj € N, for every S C N \ Pi,Pj S,

v(SU{pi})=v(S\J{pj}) =► 4>Pi[v\ =

In other words, two players that make the same marginal contribution to every coalition should be allocated equal power. Axiom 2 in [70] is the axiom.

Axiom 7.2. Efficiency.

X I = V(N)- Pi€N

This axiom guarantees that players distribute the resources available to the grand coalition amongst one another (i.e. resources cannot disappear or appear from nowhere). The next axiom contains two parts: the additivity part introduced as

Axiom 3 in [70] and the second a consequence of various Shapley value properties and above axioms. In totality, this axiom refers to the linearity of the Shapley value.

Axiom 7.3. Linearity.

i. Given two games v and w, where the game v+w is defined such that (v+w)(S) =

v(S) + w(S) for all S C N, then for every player pi E N,

Pi[v + w} =

ii. Given game v and an arbitrary a € R, where the game av is defined such that 95

(av)(S) = av{S) for all SCN,

4>PAav\ = aPi[v\.

This axiom defines the Shapley value operations on the space of all games. Finally, the concept of a null player or a dummy player is introduced in Lemma 1 of [70] and further explained in [79].

Axiom 7.4. Dumm y property. For every player Pi for which the following is satisfied:

v(S U { Pi}) = v(S) VSCN,

Pi is considered a null or dummy player and (j>Pi[v\ = 0.

That is to say that any player that does not contribute to the value of any coalition must be allocated a power of zero. To demonstrate this point further, suppose we have a dummy player pt. Then by the definition of a dummy player, we know that v(S U{Pi}) = v(S) VS C N. Recall that the Shapley value in Equation (4.3) is the difference of two terms, the first of which sums over all coalitions including pi and the second which sums over all coalitions excluding pi. Note that S U {pt} will be found in the first term and S will be found in the second term for every S.

Furthermore, notice the coefficients of each will be equal since |S U {pi }| = |5| +1.

Because the value of the characteristic function for each coalition is equal, the two terms will cancel each other out in the calculation of the Shapley value. When 96

looking at the entire space S C N, it becomes clear that each coalition including

Pi has a counterpart excluding Pi with the same value. As such, the calculation of the Shapley value will yield (j)Pi [v] = 0 when pi is a dummy player. The reason this works in the context of cooperative games is the fact that a given player is found

in exactly half of all coalitions. As we will see in the following section, the dummy

property as well as others are not maintained in the extension to decision trees.

7.3.2 Feature Power Axioms

As seen in the derivation of feature power in Chapter 5, all feature power methods

require the construction of coalition space Ut- That is, we change the definition of a coalition from a set to a path, subpath, or node and build the coalition space from

the coalitions observed in a given decision tree. This is in an attempt to build the

structure of the decision tree into the calculation of feature power. This approach

does, however, present a definite from the Shapley value assumptions.

The Shapley value’s four above axioms apply in an ideal world. The real world,

however, is messy and thus more difficult to generalize in such elegant axioms. More

formally, the way we transferred the Shapley value to the decision tree domain built

the coalition space by the set of all possible coalitions with a particular

bias, as induced by the structure of the tree. We will see below that this bias in the

coalition space falsifies three of the four axioms.

Suppose we are given a trained decision tree t and pre-defined characteristic function 97

vCk calculated with respect to class label c*,. Then the following criterion axe the extensions from the previous section to the feature power (FP) domain. If we were to translate Axiom 7.1 to feature power terminology, it would state the following.

Criterion 7.5. FP-sym m etry. Given features fi and / 2, suppose we consider all pairs of paths (or subpaths) whose sequence of features are identical with the ex­ ception of the feature corresponding to a single node rii, such that in the first path f(,ni) = /1 an(l in the second path /(«*) = /2. Then if for every such pair of paths pW and p@\ vCk(p^) = v Ck(p^), f\ and / 2 are considered FP-symmetric features.

If this is the case, then

Counterexample to FP-symmetry. Assume that all conditions listed above are met. In other words, assume there exist two FP-symmetric features /1 and / 2. Take one example of the pair of paths described above that differ by only these two features and call these two paths p ^ and p^2\ Suppose, however, that the path p1'1' is made up of a sequence of features that appears many times throughout the same decision tree. The path p^\ on the other hand, presents the only such sequence of features. In this situation, it is likely that /1 and / 2 will not be attributed the same feature power due to the fact that f\ gets the same term added multiple times to its power, whereas / 2 only gets one such occurrence. This points to the fact that one of the differences between the Shapley value formulation and feature power is that the coalition space of feature power allows duplicates. The above example is only one such case where FP-symmetry is broken by feature power. 98

The efficiency axiom, found in Axiom 7.2 cannot be directly translated to the feature power measure because the grand coalition is not defined in the context of a decision tree. It is possible that a variant of this axiom could be introduced to be better fit for the decision tree structure.

Linearity is the one axiom that remains intact after the extension to decision trees and is actually quite useful for our purposes. Below is the redefinition of Axiom 7.3 in the context of feature power.

Criterion 7.6. FP-linearity.

i. Given two characteristic functions vCk and wCj calculated with respect to class

labels Ck and Cj, respectively, where the characteristic function (vCk + wCj) is

defined such that (vCk + wCj)(p) = vCk(p) + wCj(p) for every path or subpath p,

then for every feature f iy

4>fi[vCk + wC}] = 4>fi[vcJ + 4>ft{wCj].

ii. Given characteristic function vCk and an arbitrary a € R, where the character­

istic function avCk is defined such that (avCk)(p) = avCk(p) for every p, then

h i i a v ck} = afi [vcJ.

Proof We will prove FP-linearity in two parts. Let ^i(p) represent the coefficient 99

of the first term of feature power and 'J'2(p) represent the coefficient of the second term of feature power. This allows for a proof that is independent of the feature power method.

i. We will first show that given two characteristic functions vCk and wc. and an

arbitrary feature /*,

fi[vCk + wCj\ = 4>fi\veJ + 4>fi[wCj\.

Treating vCk + wCj as a single characteristic function, we find that feature power

is defined as:

4>uWk + WCj] = Y j ^dP)(vCk + WC])(p) - J2 + WCj){p) peUti pe(ut-utt) = ^i(p)vck(p)+ *i(p)wCj(p) peut. p£Uti

- ^2{p)vCk{p) - ^ ( p)Wcj(p) pe(ut-uH) p€(ut-uti) = d fiM + 4>fi[wCj],

as was to be shown.

ii. We will now show that given characteristic function vCk and an arbitrary feature

fi and a € R,

4>fAavck} = a h iM - 100

Treating avCk as its own characteristic function, calculating feature power yields:

h i [0^ ] = pec/ti )(p) ~ pe(f/(-c/(4)^ 2(p)(aVc<°) = a ■ 2 2 *i(p)vck(p) - a ' Y1 ^z(p)vck(p) P£UH pe(Ut-Uti) = a • ^/Jvcfc])

as was to be shown.

Observe that characteristic functions v and w can be defined differently. In other words, v could be the probabilistic formulation and w could be the binary formu­ lation, making the function (v + w) an aggregation of both value methodologies.

Furthermore, the two characteristic functions do not have to be computed with respect to the same class label. This provides an easy way to aggregate across classes for binary classification problems by defining a new characteristic function as the sum of the class-specific functions. In addition, the FP-linearity can easily be extended to be the sum of any number of characteristic functions, resulting in a method for aggregating across more than two classes as well (see Section 5.4 for derivation).

We will now extend the dummy property introduced in Axiom 7.4 to the feature power domain. 101

Criterion 7.7. FP-dumm y property. Let us consider feature fi. Consider every pair of paths such that the sequence of features of the first path does not contain fi and the sequence of features of the second path is identical to that of the first path except for the addition of feature f . Then if for every such pair of paths p and p^\ v(p) = v(p^), fi is considered an FP-dummy feature and <^/4[v] = 0.

Counterexample to FP-dummy property. Recall our discussion of the dummy property in the previous section. The main reason the overall power of a dummy player sums to zero is the fact that all players are guaranteed to be in exactly half of all coalitions. As such, since the marginal contribution of a dummy player is always zero, the inclusion and exclusion terms in Equation 4.3 end up canceling each other out. In the feature power formulation, however, the artificial construction of the coalition space based on the decision tree structure no longer guarantees that a feature will be found in half of the coalitions. As such, the feature power of an

FP-dummy feature is not guaranteed to be zero.

As is seen above, feature power only satisfies FP-linearity. All other criteria become false statements when extended to the decision tree domain. This is not the first time that a measure has been introduced that systematically disproves the axioms used to derive the original Shapley value. The authors of [55] introduce two separate variations of the Shapley value that intentionally disobey specific axioms. They name a Shapley variant that is efficient but not symmetric a quasivalue and a variant that is symmetric but not efficient a semivalue. In addition, work in [4] and [56] 102

examine the possibility of imposing a “social structure” on the set of players, directly affecting the evaluation of characteristic function v. In fact, [56] represents this social structure by a graph, where players are represented as vertices and the set of possible coalitions is defined by the set of connected subgraphs. In both [4] and [56], the authors introduce a new set of axioms that are derivatives of Shapley’s original four, tweaked to ensure validity in light of the new formulation of the Shapley value.

For example, [56] defines what it means to be “relatively efficient” in the subgraph problem as a substitute for Shapley’s efficiency axiom. It is possible that Criteria

7.5-7.7 can be more carefully defined in the context of decision trees to ensure validity as opposed to the brute force axiomization seen above. Further efforts need to be devoted to this area as well as determining the advantages of an inherently linear measure.

7.4 Variable Importance and Feature Power Ranges

As seen in the results from Chapter 6—specifically the boxplot representations (ex:

Figure 6.2)—the various feature power and variable importance evaluations result in a different range of values. For example, all feature power methods can attribute negative feature power, whereas Gini importance is always positive. The following sections will examine the ranges of Gini importance, permutation importance, and count importance, and the three feature power methods in an attempt to better understand the properties of feature power. 103

7.4.1 Range of Gini importance

Referring back to Equations (3.1)-(3.4) in Section 3.2, the heart of Gini importance is the Gini index. As described in [48], this is a measure of the probability that two randomly selected samples of a set belong to different classes. Jasso adds that the range of the Gini index (Gini(S)) is [0,1], where 0 is assigned if S is a homogenous set of data and 1 is assigned if S is completely diverse (i.e. a set containing no more than one sample of each class). With this in mind, we can conclude that the range of AGini(rii) is [0,1] as well, where a value of 0 is obtained when the partition of the data that reaches node n, is already homogenous and a value of 1 is obtained when the split at node n, converts a completely diverse partition into two completely homogenous partitions.

To calculate the range of possible values of aggAGini(t, / i ) , we need to consider two extreme situations. The first is the case where feature fa is not found in any of the nodes of tree t. In this case, aggAGini(t, fa) takes the minimum value of 0 since the summation condition will never yield a true case. On the other hand, the maximum value of aggAGini(t, fa) occurs when every node in t corresponds to feature fa. Note that this is possible since a node split consists of both a feature and a split value.

In other words, each node could consist of an inequality on the same feature but with varying split values to partition the data. Let k be the total number of nodes in tree t and Dt be the bootstrapped data set used to train tree t. Then the upper bound of aggAGini(t, fa) is k ■ \Dt\. Observe that this maximal value will only be 104

achieved when the tree consists of a single node split, because once a node split is completely homogenous, the remaining A Gini values will be 0. Finally considering

Equation (3.4), we can conclude that the range of Gini importance is [0, kavg], where kavg is the average number of nodes in a given tree across the entire forest T.

7.4.2 Range of permutation importance

Referencing Equations (3.5)-(3.7) in Section 3.2, a similar analysis as above can be performed. Since CumErr(t, S) is simply a count of the misclassifications of samples in S when run down tree t, and we know that in the context of a permutation importance we use the OOB data set for tree t (Ot), we can conclude that the range of CumErr(t, Ot) is [0, |Ot|], where a value of 0 indicates 100% accuracy and a value of \Ot\ represents 0% accuracy. The next step is to evaluate the range of

A Err(t,fi), which is the difference in misclassifications after randomly permuting the values of feature /, in the OOB set. Thus A Err(t, fa) will be minimal when the permutation converts a 0% accuracy data set to a 100% accuracy data set and will be maximal when the permutation causes a 100% accuracy data set to become a 0% accuracy data set. From this, we find that the range of AErr(t, fa) is [— \Ot\, |Ot|].

Similarly, the range of permutation importance is \Ot\avg, \Ot\avg , where \Ot\avg is the average cardinality of all OOB sets of forest T. 105

7.4.3 Range of count importance

The possible values that count importance can take on are much easier to calculate.

Since count importance is simply a raw count of the number of times a feature appears in a forest, the minimum will be obtained when a forest does not contain the feature of interest. On the other hand, count importance will be maximum when every node in the forest corresponds to the feature of interest. Thus the range of count importance is [0, kavg ■ |T|], where kavg is the average number of nodes in each decision tree across the entire forest T and \T\ represents the number of trees.

7.4.4 Range of pathPow

Recall the derivation of pathPow in Section 5.2.1. To obtain the range of pathPow, we must consider the two extreme cases where the feature of interest is in every node of the tree and the feature of interest is not found in the tree. Let us first consider the maximal case. Let L(t,Ck) represent the number of leaf nodes of tree t that have class label Ck (or the number of paths that have vCk equal to 1). Recall from Section 7.2 that the Shapley coefficient grows large when a coalition contains nearly all or nearly none of the voters. In the context of a decision tree, this occurs when a path is of length 1 or M — 1. Thus the upper bound of pathPow in a single decision tree is obtained by letting d(p) = 1 and is given by:

^ (1 — 1)! (M — 1)! L(t,ck)-(M - 1) L(t, cfc) • (M - 1 ) — = ------, 106

where M is the total number of features in the problem. Similarly, the minimum of

pathPow will be achieved when the feature of interest is not found in the decision

tree. By letting d(p) = M — 1, we discover a lower bound of:

Tit n r n (M-1)!(M-(M-1)-1)1 L(t, Ck) • (M — 1) --L(t,c„) • (M - 1 )------= ------.

After aggregating over all decision trees in the forest, pathPow will fall in the fol­

lowing range: Cfc)tn?g * (-^ 1) IJ(tj CjtjgvQ • (Af 1) M ’ M J ’

where L(t, Ck)avg is the average number of leaf nodes corresponding to class label Ck

in a given decision tree across the entire forest.

7.4.5 Range of cumNodePow

Referring to the derivation of cumNodePow in Section 5.2.2, we notice that the

use of probabilistic characteristic function v complicates matters when determining

the range of possible values. For the sake of simplicity, we say that the maximum value of vCk is 1 and thus is used for the calculation of cumNodePow minimum and

maximum. Note, however, that it would never be the case that every subtree in a

decision tree has vCk = 1 because this would imply that every leaf node of the tree corresponds to class label Cfc. In this case, the purpose of a decision tree breaks down since the tree structure is unnecessary for the classification. Nevertheless, we 107

consider this type of decision tree. Given m paths in tree t of maximum length

M — 1, we find that the nodes of the tree are collectively iterated over a maximum of m • times. Let us first assume that the feature of interest is in every node of the tree. Then letting d(p) + 1 = l 1, we obtain the maximum possible value for cumNodePow, given by:

M (M -l) (1)! (M — 1)! m (M -l) m 2 M! — 2 '

In a similar fashion, we compute the minimum possible value for cumNodePow by considering the case where the feature of interest is not in the tree. The lower bound is given by:

M (M — 1) ((M — 2) + 1)! (M — (M — 2) — 2)! m(M - 1) 171' 2 M! 2 '

In the context of an entire random forest, the only change is the use of the average number of paths in a given tree across the entire forest, denoted by mavg. Thus the range of cumNodePow is given by:

ma„g(M - 1) maug(M - 1)'

2 ’ 2

Note the similarity between the range of pathPow and cumNodePow. Observe that the dependency on the number of leaf nodes that correspond to the class of interest

'Note the 1 is added to d{p) because path p no longer consists of a class label leaf node. 108

strongly control the upper and lower bounds of pathPow. More specifically, after some manipulation of the upper bounds of both methods, we find that

— = » UB(pathPow) < UB(cumNodePow), where UB denotes the upper bound of the given method. In other words, the upper bound of pathPow exceeds that of cumNodePow when the ratio of leaf nodes corresponding to class label Ck to the total number of leaf nodes is more than half the number of features. This relationship is unexpected and very interesting. Note, however, that there is not a simple way to model the probabilistic characteristic function for cumNodePow and thus the upper bound calculated above is likely much larger than what will actually be observed in practice.

7.4.6 Range of strictNodePow

Finally, referencing the derivation of strictNodePow in Section 5.2.3, we again as­ sume that vCk = 1 with respect to every node since this will result in the largest positive and largest negative power evaluation. Consider the case in which the fea­ ture of interest is in every node of a given decision tree. Then given a total of k nodes in t, we can calculate the maximum possible strictNodePow value by considering the case where all nodes are at depth 1 (i.e. d((n0,nj)) + 1 = 1):

0! (M - 0 - 1)! _ k M\ ~ M ' 109

Similarly, the minimum strictNodePow value is given when nodes are at maximum depth (i.e. d((no,rii)) + 1 = M — 1):

((M - 2) + 1)! (M - (M - 2) - 2)! _ k M\ ~ M'

Within the random forest, we again let kavg denote the average number of nodes in a given tree across the entire forest. Then the values of strictNodePow are guaranteed to fall in the following range:

kavg kavg

Again, these values are unlikely to be obtained since we assumed that the probabilis­ tic characteristic function always evaluates to 1. This implies that the tree consists of leaf nodes for only one unique class label, rendering the decision tree useless for classification.

7.4.7 Semantics of Importance Values

It is important to note that the lower bound of both Gini importance and count im­ portance is zero, whereas permutation importance and feature power values can be negative. Nevertheless, the semantics of a negative permutation importance value is very different than that of a negative feature power value. To be clear, let us dis­ cuss the meaning of negative, zero, and positive values for each of the importance measures. When either Gini importance or count importance is evaluated to zero, 110

the feature of interest is entirely unimportant. On the other hand, a large, positive value indicates high importance. With respect to permutation importance, a value of zero indicates a lack of importance because the permutation of values over the feature of interest did not affect the classification accuracy. In the same spirit, a positive value indicates high importance due to the permutation negatively impact­ ing performance. The unique aspect of permutation importance is the fact that a negative evaluation indicates that the permutation of the feature of interest actually improved performance. This can be thought of similarly to correlation, where the value can be either positive or negative.

Finally, large, positive values for feature power indicate high importance due to that feature’s contribution to many paths or subpaths that relate to the class of interest. On the other hand, large, negative feature power values are a sign of very low importance because that feature rarely or never contributes to any path corresponding to the class of interest. In other words, feature power is negative when the feature of interest is excluded from more relevant paths than it is included in. By the same logic, a value close to zero indicates that the feature contributes to near the same number of “winning” paths as it is excluded from and is thus moderately important. Note this is different from the original Shapley value, which guarantees a positive value. This is because a voter will always be in exactly one half of all coalitions. When we artificially construct the coalition space to reflect a trained decision tree, however, this assumption no longer holds, resulting in both I ll

positive and negative values.

7.5 An Alternative Probabilistic Characteristic Function

As mentioned in Section 5.2.2, the probabilistic characteristic function employed

by both cumNodePow and strictNodePow is not the only possible probabilistic v

formulation. Recall that the previously used characteristic function bases its value

on the count of leaf node class labels of a given subtree. That is to say, when considering a subpath from root to a given node Hi, the function v is computed

based on the class labels at the leaf nodes of the subtree whose root is (see toy examples in Sections 5.2.2 and 5.2.3 for more detail). The underlying assumption of this calculation is that each node is equally reachable by an arbitrary input. Given the structure of a decision tree, however, we know that this is not true. For example, every input passes through the root node but there is no guarantee regarding the

reachability of any other node. Based on this observation, we develop an alternative

probabilistic characteristic function that takes the reachability of each node of a

decision tree into account.

In the development of this new probabilistic characteristic function (call it w), we

assume that there is no significant bias in the distribution of the inputs. This allows

us to assume that it is equally likely for an arbitrary input at a given node to be directed down the left subtree as it is to be sent down the right subtree. In other 112

words, the probability of reaching the left subtree of a given node is 0.5, as is the probability of reaching the right subtree. By applying the product rule, we can compute the likelihood of reaching any node from any other node. For example, in Figure 7.2, we can conclude that the probability of reaching node n7 from rii is 0.5 • 0.5 = 0.25. Similarly, the probability of reaching node from n\ is 0.5.

Applying the sum rule, it can be computed that the probability of reaching a leaf node corresponding to class A from node n\ is 0.25 -I- 0.5 = 0.75. Alternatively, the

probability of reaching a leaf node corresponding to class B from node nj is 0.5-0.5 =

0.25. The values shown at each node of Figure 7.2 represent the probability of reaching a leaf node corresponding to class A from that node, computed as described above. These probabilities will be used for the evaluation of characteristic function w with respect to a given subpath. For example, provided the subpath p — (no,ni), the value of this path with respect to class A will be w a (j >) = 0.75.

Though this method makes more sense from a mathematical perspective, the affect on the calculation of the Shapley value is an additional layer of overweighting fea­ tures at low depths. By attributing larger values to more reachable subpaths, the shorter paths will contribute more to the value than longer paths. As mentioned in

Section 7.2, it is unknown whether this is warranted or not. The justification for using the original characteristic function v based on a simple count of the leaf node class labels is the fact that the reachability of each node is somewhat considered by the coefficient of the Shapley value. In other words, the coefficient of the Shapley Figure 7.2: Toy tree for alternative probabilistic characteristic function w. value can be thought of as how many possible ways there are to achieve such a path

(or subpath) and the characteristic function is the value of such a path (or sub­ path). This is, in a way, using the product rule from probability, though it is never directly considering the reachability of each node in the tree. As will be discussed in

Chapter 9, one interesting direction for future work is the empirical and theoretical comparison of the two proposed probabilistic characteristic functions v and w. 114

Chapter 8

Coalition Power

Until this point, we have been dealing exclusively with a variable importance equiv­ alent: feature power. We have not, however, addressed the connection to variable interaction. As introduced in Chapter 3.1, the purpose of variable interaction is to identify the nature of pairwise relationships between features. In the case of per­ mutation interaction, n-ary interactions are also quantifiable. These relationships need to be measured to account for three types of situations: (i) when a feature is relatively unimportant individually but provides a certain small piece of differentiat­ ing information that, when paired with another feature, allows for strong predictive power, (ii) when two important features actually provide the same information and thus their collective importance does not increase, and (iii) when two important fea­ tures contradict one another, yielding a reduced collective importance. The problem with existing variable interaction measures, however, is they do not provide direct insight into the importance of groups of features. 115

To be clear, it is possible to examine both variable importance and interaction eval­ uations and draw certain conclusions about a given group of features. For example, if the variable interaction between two important features is high, then the two features are likely correlated and thus would be an example of situation (ii). On the other hand, if the variable interaction between two important features is a large negative value, then the two features likely provide contradicting information and are an example of situation (iii). If variable interaction between two unimportant features is close to zero, it is unknown whether it is situation (i) or simply an unim­ portant group of two features.1 This is the case that only coalition power can solve.

By measuring the power of a group of features in a decision tree, aggregated over all trees in the forest, there is no ambiguity regarding the importance of the group.

Though Guillermo Owen developed a method for quantifying the Shapley value of coalitions in [60], this formulation is not easily extensible to the context of random forests due to several operations that do not have a direct equivalent in the anal­ ogy we have drawn (for example, the union of two coalitions). We have, however, developed the feature power methodologies in such a way that allow for an easy ex­ tension to the evaluation of coalition power. Because we axe borrowing the feature power formulation, there are multiple ways to define coalition power based on the definition of the coalition space and the function v. In fact, there is both a path iter­ ation approach and a cumulative node iteration approach to coalition power. There

1 These scenarios describe what would happen theoretically, though variable interaction mea­ sures are approximate and thus will not always obtain the same results. 116

is not, however, a way to calculate coalition power using the strict node iteration

methodology. This is because strictNodePow credits only the feature that resides at

the last node of the subpath. Since it is impossible to have a set of features at the

last node, strictNodePow cannot be used to evaluate the power of a set of features.

Thus there are a total of two methodologies for calculating coalition power, which will be derived below.

8.1 Derivation of Coalition Power

When deriving coalition power, it is important to distinguish between the meaning of the word “coalition” in coalition power and coalition space. With respect to coalition power, a coalition is defined as a set of features. Thus when we evaluate coalition power, we are evaluating the power of a given set of features with respect

to a decision tree or a random forest. When discussing the coalition space, however, the word “coalition” maintains the same mathematical definition as discussed in

Chapter 5, referring to either a path or subpath originating at the root of a given decision tree. For clarification, we let F denote a set of features whose power we are evaluating and Ut denote the coalition space, such that an element of C/t is denoted by p. 117

8.1.1 Coalition power by pathPow

The formulation of coalition power by path iteration borrows the definition of the coalition space and the function v from the pathPow methodology. In other words, the coalition space is the set of all root-to-leaf paths and v is defined by Equation

(5.1). The only difference between coalition power by path iteration and feature power by path iteration is the definition of the sets that are being summed over in function (j). More specifically, given a set of features F, let Utp be the set of all root-to-leaf paths that contain at least one node corresponding to each feature f E F. The power of set F is then given by:

(8.1)

Pe(Ut-UtF )

Note that, in the definition of UtF, the nodes containing the features of interest do not have to be contiguous in the path. This point will be further discussed in

Section 8.1.3.

8.1.2 Coalition power by cumNodePow

Similarly, the derivation of coalition power by cumNodePow requires only a change in the definition of UtF. That is to say that, given a set of features F, let Utp be the set of all root-to-internal node paths that contain at least one node corresponding 118

to each feature f E F (note again that these nodes do not have to be contiguous).

Directly following, the power of the group of features F is given by:

^ r , d(p)\(M -d(p)-1)! , , M *«] = 2 ^ ------wcfc0)

p€Utp / g 2 ^ v (d(p) + 1)! (M — d(p) - 2)! M I Ck[P)' pe(Ut-UtF )

Note the only difference between Equation (8.1) and Equation (8.2) is the fact that since p no longer contains a class label leaf node, it is necessary to calculate d(p) +1 instead of d(p).

8.1.3 Notes on coalition power

Unlike the Shapley value and feature power, the sign of coalition power will be pre­ dominantly negative. To justify this, let us first consider the voting theory situation.

Given n voters, there are 2ra possible coalitions (including the empty set), 2n_1 = ^ of which a given voter is a member. Thus the two terms in the Shapley value in

Equation (4.3) are each summing over half of the coalitions (those of which the voter of interest is and is not a member). As a result, the power of a given voter is always positive and corresponding to the fraction of power belonging to that voter. When developing feature power, however, certain constraints are placed on the coalition space, limiting it to the paths found in a given decision tree. Consequently, the power evaluation does not represent the fraction of power belonging to a feature 119

since not all possible coalitions are included in the coalition space and thus the power of all features does not sum to one. Furthermore, the fraction of coalitions in which a given feature will be found is not directly calculable since it is dependent on the structure of the tree. From the results in the previous section, however, feature power was often found to be positive or a small negative number. This leads to the conclusion that an arbitrary feature is typically found in half or more of the coalitions. With respect to coalition power, however, the chance that a group of features is found in a path is relatively small in comparison to the chance that a single feature is found in that path. This is particularly true for the cumNode case.

For example, consider all subpaths that are smaller than the cardinality of the set of features under consideration. It will be impossible for all features of the group to be found in the small subpath. As such, it is reasonable to conclude that coalition power in Equations (8.1) and (8.2) will often result in a negative number, partic­ ularly in the cumNode formulation (Equation (8.2)). We will see in the following section that this is indeed the case.

The development of coalition power brings up the question about whether power of a group of features should take order into consideration. Note that in the above formulation, we are calculating the power of an unordered set of features, as opposed to an ordered sequence of features. Since no order is specified in set F, there is no order requirement when searching the paths either. If, however, we were to evaluate the power of sequences of voters, it would be most logical to then build Utp from 120

paths in which the sequence of interest is a subsequence, i.e. paths in which the order of the features is preserved. The justification for considering only sets of features is the fact that, though the order of features matters in the scope of the entire forest, when extracting a single decision path, the rule can be boiled down to the conjunction of many inequalities [2]. Due to the commutativity of the logical “and,” the reordering of the inequalities has no effect on the logic of the rule. Therefore, given a particular decision path, we could reorder the nodes in the path to match the order that we are interested in and have no effect on the logic of the path. As such, we have chosen to ignore order in the path and thus ignore order in the groups of features we are evaluating as well. The same reasoning can be applied to the lack of a contiguity constraint on the nodes, as mentioned in the previous section.

8.2 Preliminary Results

To check the validity of the above methodologies, we employ the same “toys data set” that was used for the evaluation of feature power. Figure 8.1 shows the path coalition power results for all possible coalitions of size two containing features 1-8.

Note that only coalitions of size two are shown, though values are associated with coalitions of size 3, 4, ... 200. As the size of a coalition increases, however, the total power will decrease because a set of features is less likely to appear in a given path than one of its subsets. Features 1-8 were evaluated since exclusively features 1-6 are important to the classification, while features 7 and 8 are used for comparison. 121

Coalition power by pathPow

2e-06

2e-06

1e-06 §: £ 5e-07

Oe+OO

-5e-07

-1e-06 Coalition

Figure 8.1: Results of path coalition power evaluation on all coalitions of size 2 on features 1-8.

These results indicate that the top three coalitions of features are { 3,6 } , { 3,5 } and

{ 2,5 }, respectively. When examining the class distribution of the samples in the toys data set with respect to features 3 and 6 in Figure 8.2, it is clear that though the two features are not entirely discriminative alone, when working together, they have the ability to divide the data into nearly homogenous groups of class 1 and class -1. Similarly, when examining the class distribution of samples with respect to features 3 and 5 in Figure 8.3, the same trend is witnessed, perhaps resulting in a slightly more homogenous divide (though it is difficult to conclude definitively). 122

Note that path coalition power places every coalition containing noisy features 7 and/or 8 at the minimal power value. This suggests that the method successfully identifies the lack of discriminative information these features are able to contribute.

• Class 1 X Class -1

C r

x * * X X X )? £x ** •• - V * X X • •

I I I I I I l | | -2 .5 - 2 .0 - 1 .5 -1 .0 -0 .5 0.0 0.5 1.0 1.5 2.0 feature 3

Figure 8.2: Class distribution of features 3 and 6, as identified as important by path coalition power. 123

Figure 8.3: Class distribution of features 3 and 5, as identified as important by path coalition power. 124

Coalition power by cumNodePow Oe+OO

-2e-06

4e-0 6

5_ -6e-06

-1e-05 ST 5 aa co CO* 00 -1e-05 i£ m

Coalition

Figure 8.4: Results of cumNode coalition power evaluation on all coalitions of size

2 on features 1-8.

Figure 8.4 depicts the coalition power evaluation by the cumNode methodology.

The most obvious observation is the fact that all values are negative, as anticipated.

Based on these results, the top three coalitions are {3,6},{2,5}, and {3,5}, overlapping with the path coalition power results on all 3 most powerful coalitions, though in a different order. Since the class distribution of features 3 and 6 was already examined in Figure 8.2 above, we look exclusively at the distribution of features 2 and 5 in Figure 8.5 below. 125

Figure 8.5: Class distribution of features 2 and 5, as identified as important by cumNode coalition power.

This class distribution plot suggests that despite the fact that cumNode coalition power placed {2,5} ahead of {3,5}, features 3 and 5 together are clearly more discriminative than features 2 and 5 seen here. Whether this is an indication of a breakdown in cumNode coalition power is unknown. Regardless, features 2 and 5 still have strong predictive power, as opposed to, for example, the coalition { 7,8 }, whose class distribution is seen in Figure 8.6. A group of these two noisy features should not be attributed any power, as is the case with both the path coalition power and cumNode coalition power methodologies. 126

• Class 1 x Class -l X x

- 3

- 3 - 2 - 1 0 feature 7

Figure 8.6: Class distribution of features 7 and 8, as identified as unimportant by

both coalition power methods.

Note, however, that both path coalition power and cumNode coalition power make

the power of coalition {1,5} just as powerless as {7,8}, though we can see in

Figure 8.7 that this is not evident in the data. This type of result is an artifact of the combinatorial nature of coalition power. For example, assume we are evaluating

the coalition power of a given pair of features that have collective discriminative

power. Suppose that, due to the random feature selection of the forest training

process, no paths were constructed with both members of the set. As a result, the coalition power will be evaluated at the minimal value, despite the fact that were the coalition to exist in the tree, it would be important. This brings up an 127

interesting point about whether we want to measure information hidden in the data or that which has been extracted and stored in the forest. This type of distinction

between data evaluation vs. model evaluation has not been researched in the past and introduces some interesting subtleties to the variable importance and interaction quantification problem.

3 • Class 1 X Class -1 2 - • •

X - 3 - 3 - 2 - 1 0 l 2 3 feature 1

Figure 8.7: Class distribution of features 1 and 5, as identified as unimportant by both coalition power methods. 128

Coalition count 12

10

to 8

I 6 5

4 S 2 co co ^ jm ® in to s (O S ffl S CO oo T - ' O J CN cn o*/ ■■ CO CO t ^ LO LO^ LQ CP CO 0 1 2 M Coalition

Figure 8.8: Results of coalition count evaluation on all coalitions of size 2 on features

1- 8 .

Finally, the coalition power results from both measures are compared to the coalition count measure on the same forest, where coalition count is defined as the simple count of the number of times each coalition is found in all root-to-leaf paths. Notice in Figure 8.8 that { 1,7 } is found in the random forest often but is not identified by either power method as an important coalition. This indicates that coalition power is able to identify that feature 7 does not hold discriminative power, despite it being selected alongside feature 1 frequently. Also { 1,2 } is in the forest more often than

{1,3}, though {1,3} gets attributed a larger power value by both measures. As 129

discussed in the feature power results as well, this suggests that the combinatorial

nature of the Shapley value does not create a measure that is too closely related to a

naive count of the number of times an entity is found in the random forest. Observe that {1,5} is not found in the random forest, explaining why both coalition power

evaluations set this coalition importance to the minimal value.2

Upon initial inspection, it appears that the path coalition power methodology may

produce more interpretable results. Recall that in feature power, pathPow often

picked out a few important features and ignored the rest. For the purpose of ranking features, this is relatively useless because it sets the importance of the majority of

features to an equally small value. Perhaps this ability, however, is actually useful

for coalition power because the goal is to identify a few groups of features that are

powerful. This notion and many others regarding coalition power need to be more thoroughly researched on a variety of data sets to gain a full understanding of the nature of this measure. Note that the calculation of coalition power for the power set of all features is computationally prohibitive. The proposed coalition power

measure can, however, be calculated on a specified set of coalitions, where the size of the set will dictate the complexity of the computation. An interesting extension to the above measure is the development of an accompanying method for narrowing

2Note that the computation of coalition power is designed to evaluate the important relation­ ships of features within a trained decision tree. This is not to be confused with feature selection, which is performed as a data preprocessing step. The distinction between these two boils down to the fact that coalition power is making measurements on the trained random forest, whereas feature selection methods measure the data itself, independent of any model. 130

down the power set of all features to candidate coalitions with a high probability of being important. This would allow coalition power to be computed at a lower computational cost without risking missing an important coalition of features. 131

Chapter 9

Conclusion

This thesis addressed the three main holes in related prior work: the development of a theoretically sound variable importance measure, the application of voting theory to post-training analysis of tree-based ensembles, and the derivation of a method for evaluating the power of a group of features. We were able to tackle these problems by modeling a trained random forest structure as a voting game. This allowed for the evaluation of the power of features and groups of features by treating them as voters within the random forest voting game. When drawing the analogy from voting games to random forests, there was some uncertainty regarding proper assumptions, ultimately leading to the development of three feature power and two coalition power methods. Each method was checked for correctness by application to a simulated toy data set in which ground truth was known. Feature power was also applied to real- world data sets to observe the feature rankings obtained from each power metric as well as existing variable importance measures. The application of all feature power 132

methods to real-world data sets was also used in an attempt to map the behavioral

similarities and differences between one another and existing variable importance

measures under varying conditions.

The feature power results indicated that pathPow often attributed all power to a

few number of features and showed low stability on both binary classification data

sets. On the other hand, strictNodePow and cumNodePow behaved extremely sim­

ilarly to one another and were most correlated with Gini importance. They also

demonstrated the highest stability on binary classification data sets and achieved

reasonable importance values. On the multi-class data sets, pathPow demonstrated

higher stability. It is important to note, however, that pathPow did not achieve

reasonable importance results on these data sets, suggesting that the high stabil­

ity is simply a convergence to an incorrect feature ranking. StrictNodePow and

cumNodePow suffered in terms of both correctness and stability on the multi-class

data sets, particularly on image segmentation. It is unknown whether this is because of the extension to multi-class problems or an inherent quality in the data itself.

The coalition power results suggested that the path iteration approach may be

best fit for coalition power, whereas cumNodePow (and strictNodePow) may be a better choice for feature power. This is because of pathPow’s tendency to pick out a few number of important features. This property is particularly useful for the evaluation of a group of features, where the goal is to get a short list of groups that are important. Note, however, that these results are based on the evaluation of pairs 133

of features only (not coalitions of size 3 or more). Nonetheless, both coalition power methods achieved reasonable results on the toy data set, providing a promising direction for future work.

Despite the ubiquity and robustness of existing variable importance and variable interaction measures, they have many flaws. We will explicitly state the way in which feature power overcomes these flaws below. The advantages of the feature power and coalition power methodologies are:

1. Feature power is non-stochastic, allowing for guaranteed repeatability of re­

sults given the same forest structure

2. Feature power is data independent, removing any reliance of results on the

choice of data samples (assuming the model has already been trained)

3. Feature power is easily extensible to neural networks and any tree-based clas­

sifier (see section below)

4. Coalition power is the only way to evaluate the importance of a group of

features with the exception of simply counting its occurrence within a forest

5. Both feature power and coalition power are deeply rooted in theory, ensuring

future researchers the ability to explore the mathematical properties of the

measures 134

6. The modeling of random forests as a voting game opens opportunities to apply

other mathematical principles of cooperative games to random forests, further

enhancing the interpretability of random forests in general

9.1 Extension to Other Classifiers

9.1.1 Tree-based ensembles

Recall the formulation of each feature power method. The power of a given feature was defined by the difference of two terms, one reliant on the number of times the feature was found in a path (or subpath) and the other reliant on the number of times a path existed that did not include the feature. There is neither a reference to the CART training process, nor the impurity measure used to construct the tree.

By this logic, it follows that feature power and coalition power can be applied to any tree-based ensemble classifier, regardless of the training procedure. Examples include gradient boosting [37] and trees trained by C4.5 [64]. Gini importance, for example, cannot be applied to either of these classifiers since the justification for using the Gini index breaks down when the tree is not constructed using Gini impurity. 135

9.1.2 Neural networks

As [68] points out, several works have been devoted to bridging the gap between neural networks and decision trees. Some researchers have turned decision trees into trees of neural networks, while others have treated entire decision trees as neural networks. The authors of [25], however, choose to extract the information in a neural network and represent it as a tree-based structure. Using this tool, one can easily leverage the inherent interpretability of a decision tree to understand the network it was generated from. As such, the application of feature power and coalition power to a network-generated tree can be used to measure the importance of features within the tree and thus within the neural network.

9.2 Future Work

Due to the complexity of this work, there are endless prospects for the future. The most obvious is the further evaluation of both feature power and coalition power on various data sets. A thorough behavioral analysis as seen in [40] is necessary to build an empirically-based understanding of both measures. These results may also lead to method improvements. Though the theoretical foundation of feature power is well understood, the theoretical implications discussed in Chapter 7 are the tip of the iceberg. Both the overweighting of features at low depths and the dependency on ordering need to be more thoroughly researched. In addition, the majority of the 136

axiomatic properties of the Shapley value were broken by feature power. It is possible that this fact has caused the misleading results for multi-class problems. Finally, it would be interesting to explore the alternative form of a probabilistic characteristic function by performing similar experiments as in this thesis. Though the extension to other classifiers is discussed above, the application of these methodologies to other classifiers may present new challenges. Future research should include the extension of these measures to other ML classifiers as this may provide interesting insights regarding the nature of the measures. In the scope of random forests, it would be an interesting problem to extend feature power to regression. Since the current definition of characteristic function v is dependent on the winning or losing state of a coalition (or the probability of winning or losing), this will require careful consideration. Ultimately, the application of voting theory to decision trees opens up countless opportunities for future research in this newly-created domain of ML voting games. 137

Appendix A: Mathematical Symbols

Mathematical Symbol Meaning

S an arbitrary set

K the number of class labels

c(s) the correct class label of data point s

Ck the kth class label

Gini(S) the Gini impurity of data set S

, Ttj two distinct nodes of a trained tree

Pm the partition of Dt that reaches node n*

Lm the partition of Dt that reaches the left subtree of node rii

Pm the partition of Dt that reaches the right subtree of node rii

AGini(rii) the Gini impurity reduction at node rii

fit fj two distinct features 138

t a tree in forest T

/K) the feature governing the split at node rij (unless rij is a leaf node, then function maps to the class label)

aggAGini(t, f ) the aggregate Gini impurity reduction of feature fi

within tree t

T a trained random forest (a set of trees)

D the set of training data

Dt the bootstrapped training data set for tree t

IlllpCyiniifi) the Gini importance of feature fi across the entire

random forest

t(s) the resulting classification of sample s by tree t

CumErr(t, S) a helper function that calculates the total number of

misclassifications for tree t on arbitrary data set S ot the out-of-bag samples for tree t, such that Ot U A = D

Ou the out-of-bag samples for tree t, permuted on

feature j)

A Err(t, f ) the difference in accuracy for tree t by permutation of

feature fi

IvnPprm (fi) the permutation importance of feature f t across the 139

entire random forest

fti the ith feature when sorted in nonincreasing order by aggregate Gini impurity reduction for tree t

M the total number of features

rank(t, fa) the index of feature fa when sorted in nonincreasing

order by aggregate Gini impurity reduction for tree t iTltciniifii fj) the Gini interaction of features /* and fj

the out-of-bag samples for tree i, permuted on

features fa and fj

^Titprrn(^fi1 f j) the permutation interaction of features fa and fj

9 the quota, or threshold number of votes required to pass a law

Wi the number of votes apportioned to voter

N the set of all voters/players in a voting

structure/game

n the cardinality of set N

Vi an arbitrary voter belonging to set N

&Vi the Shapley-Shubik power index of voter piv(vi) the total number of times voter Vi is pivotal in all n!

permutations of N

Pvi the Banzhaf power index of voter Vi 140

crit(vi) the total number of times voter Vi is critical in all 2n

subsets of N

ntree the number of trees (or estimators) in a random forest

V the characteristic set function of an abstract game

Pi M the Shapley value of player pi with respect to abstract game v

Ui an arbitrary node in a decision tree p a path in a decision tree d{jp) the number of edges in path p

ut the coalition space for decision tree t c a coalition in the coalition space n0 the root node

Uu the set of all coalitions of Ut in which feature fa is a member vck{P) the value of path p in abstract game v with respect to class Cfc

h i M the feature power of feature fa m the number of paths in an arbitrary decision tree

I the average length of a path in a decision tree

4 a subtree of tree t originating at node rit 141

L(t, cfc) a function that counts the number of leaf nodes

in tree t that match class label Ck

k the number of nodes in an arbitrary decision tree

the feature power of fa in tree t

the feature power of fi in a forest

^/i Mags the feature power of fi aggregated across classes Impcnt the count importance function

w the characteristic function of an abstract game

p « a path that contains a node corresponding to feature fi

kavg the average number of nodes in all trees in a forest

1 1 avg the average cardinality of all OOB sets of a forest

Lit-, Cfc^ayg the average of L(t, Ck) for every t G T

Tft'avg the average number of paths in a given tree across a

forest

UB a function mapping from a feature power method to

its upper bound

F a set of features

utF the set of all coalitions of Ut in which every element of

F is a member

Table 9.1: Enumeration of mathematical symbols, as

consistent throughout thesis and in the order in which

they appear. 143

Appendix B: Additional Plots 144

pathPow strictNodePow 2e-03

g 1e-03 y 1e-03

5e-04 I |iff

Oe+OO ■ ™ T x ------Qe+00Oe+OO * I 123456789 10 11 123456789 10 11 Feature # Feature # cumNodePow Gini

T 1e+01 T X 1e-03

5e-04 if Is- iff „ " ______I * * * ______123456789 10 11 123456789 10 11 Feature # Feature # Perm Count

8e+00 T T

$ 5e+00

| 2e+00 £ ** ft* t Oe+OO

0e+00 123456789 10 11 123456789 10 11 Feature # Feature #

Figure 9.1: Aggregate importance values of first ten features of toys data. 145

5e-05

4e-06

„ 6e-05 3e-05

3 26-05 I 4e-05 : 1e-05 2e-05 Oe+OO Oe+OO -1e-05 1 2 3 4 5 6 7 8 91O1112131415161718ig2C212223B42S26272ffi90O 1 2 3 4 5 6 7 8 91O1112131415161718190G2122232<25262728290O Feature # Feature # strictNodePow2 6e-03

Se-03

4®-03 ; 3^ I 2e-03

le-03

Oe+OO

-1e-03 1 2 3 4 5 6 7 8 910111213141516171819802X222304252627202900 Feature # Feature # cumNodePow 1 cumNodePow2

6e-03 5e-03

4e03 | 3e-03 | 2e-03 le-03

Oe+OO

-1e-03 1 2 3 4 5 6 7 8 9101112131415161718192021222324252627202990 Feature #

W i w 1 ^

1 2 3 4 5 6 7 8 9101112131415161718192021222324252627282900 Feature #

123456789ai»4M1Saat23«SZno 123456789I913M3SI89B22M9H990 Feature # Feature #

Figure 9.2: Class-specific importance values of features in breast cancer data. Importance Importance Importance 2e+01 0e+00 0e+00 0e+00 5e+00 1e+01 4e-03 6e-03 2e-04 2e-03 5e-05 1e-04 2 5 7 012345678901234567890 2 4 6 89 101112131415161718192021222324252627282930 9 8 7 6 5 4 3 2 1 9 8 101112131415161718192021222324252627282930 7 6 3 5 4 2 1 2 5 7 012345678901234567890 2 4 6 89 101112131415161718192021222324252627282930 9 8 7 6 5 4 3 2 1 9 8 101112131415161718192021222324252627282930 7 6 3 5 4 2 1 101112131415161718192021222324252627282930 9 8 7 6 5 4 3 2 1 9 8 101112131415161718192021222324252627282930 7 6 3 5 4 2 1 iue93 AggregateFigureimportancevalues9.3: features of breastin cancerdata. | W *** *** W uNdPw Gini cumNodePow aho stnctNodePow pathPow — i etr Feature# Feature# etr Feature# Feature# etr Feature# Feature# em Count Perm 1 !B !B __ — Qg+oo — _ i . ' . T 26-03 1 T 1 . , } 06+00 4e-03 6e-03

| | --- _____ ----- . * *. + ------«* •»«. «■* * __ pathPow 1 pathPow2

Oe+OO ^I 1e-02 I 1«-02 i™ I I T T lM2 | j -

-2e-03 J -3e-03 nffl * t - ***. “ _____ 1 2 3 4 5 6 7 8 9 10 11 12 13 Feature# strictNodePow2 stnctNodePow3

6e-03 1e'02 T 6e-03 T

4e-03 | 2*h03 t j t i i l | 0e+00

-2e-03 $***$ * * I * * * * * * * T *« * * * * * T 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 Feature # Feature # Feature # cumNodePow 1 cumNodePow2 cumNodePow3 0008 T 0008 t 0.006

0004

0002 0000 t * * * * * ^,4. T <005 i***™ * * 1 -o«» ****" *.* 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10^ 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 Feature # Feature # Feature # Perml Perm2 Perm3 10 j 150

T T T t ’ i T T i »o I * J f v i ; r I - * * * **** * * • W ■ f t ;; f t f V ** *

1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 ~25 T T 3 4 5 6 7 8 9 10 11 12 13 Feature # Feature # Feature #

120 T

t JV„ t ; ** 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 1 2 3 4 5 6 7 8 9 10 11 12 13 Feature # Feature # Feature »

Figure 9.4: Class-specific importance values of features in wine data. 148

Figure 9.5: Class-specific importance values of features in segmentation data. 149

Stability of methods with respect to rank-order correlation of 7 top-ranked features

Figure 9.6: Average rank-order correlation on top 7 ranked features over 50 RF instances on breast cancer data. 150

Stability of methods with respect to rank-order correlation of 7 top-ranked features

wtm Class 1 ■HR Class 2 mm Class 3 mm Class 4 Class 5 Class 6 i__L Class 7 u. Not class-specrfclass-specrt a. • mm

? O'1 • 1 flj •

! 0.3 • c g • & • %i 02

■ 01 i

/

nn ■ i d i : J i path strictNode cumNode Perm Gni Count

Figure 9.7: Average rank-order correlation on top 7 ranked features over 50 RF instances on image segmentation data. 151

Rank-order correlation sensitivity to accuracy 0.9 pathPow 1 • pathPow2 08 • strictNodePow 1 § & strictNodePow2 • • ~ • cumNodePow 1 i • z 0.7 • cumNodePow2 & • Perml Perm2 o 0.6 *: ( Qni # ^ • Perm • « Count I 05 • • • ? £& S. 2| 0.4 w

0.3

0.78 080 0.82 0 84 0.86 0 88 0.90 0.92 0.94 OOB accuracy

Figure 9.8: Effect of OOB accuracy on rank-order correlation on top 7 ranked features over 50 RF instances on toys data.

Rank-order correlation sensitivity to accuracy 0.7 pathPow 1 • pathPow2 0.6 • strictNodePow 1 jg • stnctNodePow2 ^ 2 * cumNodePowl m m J j ^ l f ^ • cumNodePow2 q- • Perml • • • ’ • 0.4 • Perm2 Gini • ^ \ £ Perm .*• •• -c 0.3 Count m

& 0.2 c ■- 28 >• • «

0.0 i • • • * 09400 0.9425 0 9450 0 9475 0.9500 0 9525 0.9550 0.9575 0.9600 OOB accuracy

Figure 9.9: Effect of OOB accuracy on rank-order correlation on top 7 ranked features over 50 RF instances on breast cancer data. 152

Rank-order correlation sensitivity to accuracy

• pathPow • strictNodePow 0.7 • cumNodePow ■ } • Gini • ■2 0.6 ^ penn • • ^ Count r- § 0.5 i o -£ 0.4 (A "c

03 i S8 0.2 • t 0.1 • 9

0.88 0 90 0 92 0 94 096 0 98 OOB accuracy

Figure 9.10: Effect of OOB accuracy on aggregate rank-order correlation on top 7 ranked features over 50 EF instances on wine data.

Rank-order correlation sensitivity to accuracy

pathPow strictNodePow cumNodePow 0.5 Gni Perm Count ^ 04 Q. 0 § 1 0.3 % .... ' y '

0.2 I * • . > •

• . • . * r s r 0.0 0.82 0 84 0 86 0.88 0.90 0.92 OOB accuracy

Figure 9.11: Effect of OOB accuracy on aggregate rank-order correlation on top 7 ranked features over 50 RF instances on image segmentation data. 153

Rank-order correlation sensitivity to mtry

— — pathPow • ■ ■ Giro 07 — . strictNodePow — — Perm — cumNodePow ■ • Count

0.1

4 5 6 7 8 9 10 11 12 mtry

Figure 9.12: Rank-order Spearman’s rho sensitivity to mtry on wine data.

Rank-order correlation sensitivity to mtry

^ —— pathPow • • • Gini — - strictNodePow — - Perm 06 cumNodePow • • Count

° ° 4 5 6 7 8 9 10 11 12 mtry

Figure 9.13: Rank-order Spearman’s rho sensitivity to mtry on segmentation data. 154

Appendix C: Complete Tables pal pa2 stl st2 cul cu2 Pel Pe2 G Pe Co pathl 1 0.48 0.45 0.48 0.45 0.47 0.48 0.43 0.49 0.42 0.42 path2 - 1 0.56 0.54 0.56 0.53 0.52 0.53 0.52 0.51 0.52 strict 1 - - 1 0.75 0.99 0.75 0.45 0.43 0.64 0.45 0.42 strict2 - -- 1 0.75 0.99 0.46 0.47 0.69 0.51 0.51 cuml -- -- 1 0.75 0.46 0.42 0.63 0.45 0.41 cum2 --- -- 1 0.45 0.46 0.69 0.50 0.50 Perml ------1 0.58 0.59 0.70 0.66 Perm2 ------1 0.47 0.66 0.50 Gini ------1 0.58 0.68 Perm ------1 0.67 Count ------1

Table 9.2: Class-specific pairwise rank-order correlation on top 7 ranked features of toys data. pal pa2 stl st2 cul cu2 Pel Pe2 G Pe Co pathl 1 0.09 0.20 0.17 0.19 0.17 0.18 0.09 0.16 0.17 0.20 path2 - 1 0.04 0.08 0.05 0.08 0.04 0.02 0.01 -0.01 -0.01 strict 1 - - 1 0.87 0.98 0.87 0.44 0.17 0.75 0.33 0.17 strict2 - - - 1 0.88 1 0.44 0.16 0.72 0.29 0.18 cuml - -- - 1 0.88 0.45 0.20 0.75 0.33 0.19 cum2 -- -- - 1 0.44 0.16 0.71 0.29 0.18 Perml ------1 0.25 0.47 0.50 0.35 Perm2 ------1 0.19 0.20 0.13 Gini ------1 0.30 0.24 Perm ------1 0.41 Count ------1

Table 9.3: Class-specific pairwise rank-order correlation on top 7 ranked features of breast cancer data. pal pa2 pa3 stl st2 st3 cul cu2 cu3 Pel Pe2 Pe3 G Pe Co pathl 1 0.54 0.42 0.20 0.17 0.15 0.21 0.19 0.16 0.07 0.12 0.34 0.17 0.18 0.19 path2 - 1 0.54 0.12 0.12 0.08 0.14 0.10 0.11 0.13 -0.01 0.43 0.06 0.08 0.10 path3 -- 1 0.07 0.03 0.05 0.06 0.01 0.03 0.06 0.06 0.48 -0.05 -0.01 0.09 strict 1 - -- 1 0.54 0.37 0.94 0.54 0.33 0.16 0.05 0.08 0.40 0.30 0.22 strict 2 -- -- 1 0.44 0.53 0.93 0.42 0.11 0.10 0.10 0.45 0.36 0.28 strict3 -- - - - 1 0.40 0.46 0.88 0.17 0.16 : 0.11 0.44 0.32 0.22 cuml ------1 0.53 0.38 0.18 0.06 1 0.08 0.41 0.31 0.22 cum2 ------1 0.46 0.12 0.11 0.10 0.47 0.33 0.30 cum3 ------1 0.20 0.12 0.10 0.41 0.28 0.21 Perml ------1 0.12 0.06 0.11 0.18 0.04 Perm2 ------1 -0.03 0.06 0.15 0.01 Perm3 ------1 0.08 0.01 0.15 Gini ------1 0.46 0.31 Perm ------1 0.34 Count ------1

Table 9.4: Class-specific pairwise rank-order correlation on top 7 ranked features of wine data. 158

Bibliography

[1] 17th Amendment to the U.S. Constitution: Direct election of U.S. Senators, The Constitution of the United States, 1912-1913.

[2] Md Nasim Adnan and Md Zahidul Islam, Forex++: A new framework for knowledge discovery from decision forests, Australasian Journal of Information Systems 21 (2017), no. 1539, 1-20.

[3] Kellie J. Archer and Ryan V. Kimes, Empirical characterization of random for­ est variable importance measures, Computational Statistics and Data Analysis 52 (2008), no. 4, 2249 - 2260.

[4] R. J. Aumann and J. H. Dreze, Cooperative games with coalition structures, International Journal of Game Theory 3 (1974), no. 4, 217—237.

[5] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Muller, How to explain individual classifi­ cation decisions, Journal of Machine Learning Research 11 (2010), no. Jun, 1803-1831.

[6] John F. Banzhaf III, Weighted voting doesn’t work: A mathematical analysis, The Rutgers Law Review 19 (1964), 317-343.

[7] ______, Mathematics, voting and the law: The quest for equal representation, Journal 8 (1968), no. 4, 69-76.

[8] , One man, 3.312 votes: Mathematical analysis of voting power and effective representation, Villanova Law Review 13 (1968), 303-332. 159

[9] Eric Bauer and Ron Kohavi, An empirical comparison of voting classification algorithms: Bagging, boosting, and variants, Machine Learning 36 (1999), no. 1- 2, 105-139.

[10] David Bellhouse, The problem of Waldegrave, Electronic Journal for the History of Probability and Statistics 3 (2007), no. 2, 1-12.

[11] Gerard Biau, Analysis of a random forests model, Journal of Machine Learning Research 13 (2012), no. Apr, 1063-1095.

[12] Adrien Bibal and Benoit Frenay, Interpretability of machine learning models and representations: an introduction, 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) (Bruges, Belgium), April-May 2016, pp. 77-82.

[13] John A. Bondy and Uppaluri Siva Ramachandra Murty, Graph theory with applications, The Macmillan Press Ltd., 1976.

[14] Michael Bowling and Manuela Veloso, An analysis of stochastic game theory for multiagent reinforcement learning, Tech. report, Carnegie-Mellon University, School of Computer Science, 2000.

[15] Leo Breiman, Bagging predictors, Machine Learning 24 (1996), no. 2, 123-140.

[16] ------, Looking inside the black box, Wald Lecture II, Berkeley University, 2001.

[17] ______, Random forests, Machine Learning 45 (2001), no. 1, 5-32.

[18] Leo Breiman and Adele Cutler, Random forests, Avail­ able: http://www.stat.berkeley.edu/users/breiman/randomforests/, Stat.berkeley.edu, 2011.

[19] Leo Breiman, Jerome Friedman, R.A. Olshen, and Charles J. Stone, Classifi­ cation and regression trees, Chapman and Hall/CRC, 1984.

[20] Chris Brinton, A framework for explanation of machine learning decisions, In­ ternational Joint Conference on Artificial Intelligence (IJCAI) Workshop on Explainable AI (XAI) Proceedings (Melbourne, Australia), 2017, pp. 14-18. 160

[21] Lars Carlsson, Ernst Ahlberg Helgee, and Scott Boyer, Interpretation of non­ linear QSAR models applied to Ames mutagenicity data, Journal of chemical information and modeling 49 (2009), no. 11, 2551-2558.

[22] Emilio Carrizosa, Belen Martm-Barragan, and Dolores Romero Morales, De­ tecting relevant variables and interactions for classification in support vector machines, Tech. report, Citeseer, 2006.

[23] Emilio Carrizosa, Belen Martm-Barragan, and Dolores Romero Morales, A col­ umn generation approach for support vector machines, Tech. report, Universi- dad de Sevilla, 2006.

[24] Shay Cohen, Eytan Ruppin, and Gideon Dror, Feature selection based on the Shapley value, Proceedings of the 19th International Joint Conference on Ar­ tificial Intelligence (IJCAI), Morgan Kaufmann Publishers Inc., July-August 2005, pp. 665-670.

[25] Mark W. Craven and Jude W. Shavlik, Extracting tree-structured represen­ tations of trained networks, Proceedings of the 8th International Conference on Neural Information Processing Systems (NIPS) (Cambridge, MA, USA) (David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, eds.), NIPS’95, MIT Press, 1995, pp. 24-30.

[26] Nello Cristianini and John Shawe-Taylor, An introduction to support vector ma­ chines: and other kernel-based learning methods, Cambridge University Press, 1999.

[27] Misha Denil, David Matheson, and Nando De Freitas, Narrowing the gap: Ran­ dom forests in theory and in practice, Proceedings of the 31st International Conference on Machine Learning (Bejing, China) (Eric P. Xing and Tony Je- bara, eds.), vol. 32, Proceedings of Machine Learning Research (PMLR), no. 1, PMLR, 22-24 Jun 2014, pp. 665-673.

[28] Dua Dheeru and Efi Kara Taniskidou, University of California, Irvine, School of Information and Computer Sciences Machine Learning Repository, [http:/ / archive.ics.uci.edu/ml], 2017. 161

[29] Ramon Dfaz-Uriarte and Sara Alvarez De Andres, Gene selection and classifi­ cation of microarray data using random forest, BioMed Central (BMC) Bioin­ formatics 3 (7), 2006.

[30 Robert W Dimand and Mary Ann Dimand, The early history of the theory of strategic games from Waldegrave to Borel, History of Political Economy 24 (1992), 15-27.

[31 Mary T. Dzindolet, Scott A. Peterson, Regina A. Pomranky, Linda G. Pierce, and Hall P. Beck, The role of trust in automation reliance, International Journal of Human-Computer Studies - Special issue: Trust and technology 58 (2003), no. 6, 697-718.

[32 Bradley Efron and R.J. Tibshirani, An introduction to the bootstrap, Chapman & Hall/CRC Monographs on Statistics & Applied Probability, CRC Press LLC, Boca Raton, Florida, 1994.

[33 Riki Eto, Ryohei Fujimaki, Satoshi Morinaga, and Hiroshi Tamano, Fully- automatic Bayesian piecewise sparse linear models, Artificial Intelligence and Statistics, April 2014, pp. 238-246.

[34 Yoav Freund and Robert E. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence 14 (1999), no. 5, 771-780.

[35 Mark A. Friedl and Carla E. Brodley, Decision tree classification of land cover from remotely sensed data, Remote sensing of environment 61 (1997), no. 3, 399-409.

[36 Jerome Friedman, Trevor Hastie, and Robert Tibshirani, The elements of sta­ tistical learning, Springer series in statistics New York, 2001.

[37 Jerome H. Friedman, Greedy function approximation: A gradient boosting ma­ chine, The Annals of Statistics 29 (2001), no. 5, 1189-1232.

[38 Andrew Gelman, Jonathan N. Katz, and Francis Tuerlinckx, The mathematics and statistics of voting power, Statistical Science 17 (2002), no. 4, 420-435.

[39 Robin Genuer, Toys dataset, Available: https://github.com/robingenuer. 162

[40] Robin Genuer, Jean-Michel Poggi, and Christine Tuleau, Random forests: some methodological insights, arXiv preprint arXiv:0811.3619 (2008), 1-35.

[41 Robert Gibbons, A primer in game theory, Harvester Wheatsheaf, 1992.

[42 Bernard Grofman and Arend Lijphart, Electoral laws and their political conse­ quences, vol. 1, Algora Publishing, 1986.

[43 Bernard Grofman and Howard Scarrow, Weighted voting in New York, Legisla­ tive Studies Quarterly 6 (1981), no. 2, 287-304.

[44 Ulrike Gromping, Variable importance assessment in regression: Linear regres­ sion versus random forest, The American Statistician 63 (2009), no. 4, 308-319.

[45 Satoshi Hara and Kohei Hayashi, Making tree ensembles interpretable, arXiv preprint arXiv: 1606.05390 (2016), 81-85.

[46 Jonathan L. Herlocker, Joseph A. Konstan, and John Riedl, Explaining col­ laborative filtering recommendations, Proceedings of the 2000 Association for Computing Machinery (ACM) conference on Computer supported cooperative work (New York, NY, USA), CSCW ’00, ACM, 2000, pp. 241-250.

[47 Torsten Hothorn, Kurt Hornik, and Achim Zeileis, Unbiased recursive partition­ ing: a conditional inference framework, Journal of Computational and Graph­ ical Statistics 15 (2006), 651-674.

[48 Guillermina Jasso, On gini’s mean difference and gini’s index of concentration, American Sociological Review 44 (1979), no. 5, 867-870.

[49 Cassidy Kelly and Kazunori Okada, Variable interaction measures with ran­ dom forest classifiers, 9th Institute of Electrical and Electronics Engineers (IEEE) International Symposium on Biomedical Imaging (ISBI), IEEE, May 2012, pp. 154-157.

[50 Daphne Koller and Mehran Sahami, Toward optimal feature selection, Tech. report, Stanford InfoLab, 1996.

[51 Dennis Leech, Designing the voting system for the Council of the European Union, Public Choice 113 (2002), no. 3-4, 437-464. 163

[52] Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, David Madigan, et al., Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model, The Annals of Applied Statistics 9 (2015), no. 3, 1350- 1371.

[53] Gilles Louppe, Louis Wehenkel, Antonio Sutera, and Pierre Geurts, Under­ standing variable importances in forests of randomized trees, Advances in Neural Information Processing Systems (NIPS) (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), vol. 26, Curran As­ sociates, Inc., 2013, pp. 431-439.

[54] Brent Daniel Mittelstadt, Patrick Alio, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi, The ethics of algorithms: Mapping the debate, Big Data & Society 3 (2016), no. 2, 1-21.

[55] Dov Monderer and Dov Samet, Handbook of game theory with economic appli­ cations, vol. 3, pp. 2055-2076, Elsevier, B.V., 2002.

[56] Roger B. Myerson, Graphs and cooperation in games, Mathematics of Opera­ tions Research 2 (1977), no. 3, 225-229.

[57] Kristin K. Nicodemus, Letter to the editor: On the stability and ranking of pre­ dictors from random forest variable importance measures, Briefings in Bioinfor­ matics 12 (2011), no. 4, 369-373.

[58] Richard G. Niemi and William H. Riker, The choice of voting systems, Scientific American 234 (1976), no. 6, 21-27.

[59] Julian D. Olden and Donald A. Jackson, Illuminating the “black box”: a ran­ domization approach for understanding variable contributions in artificial neu­ ral networks, Ecological modelling 154 (2002), no. 1-2, 135-150.

[60] Guillermo Owen, Multilinear extensions of games, Management Science 18 (1972), no. 5, Theory Series, Part 2, Game Theory and Gaming, 64-79.

[61] Anna Palczewska, Jan Palczewski, Richard Marchese Robinson, and Daniel Neagu, Interpreting random forest classification models using a feature contri­ bution method, Integration of Reusable Systems, Springer, 2014, pp. 193-218. 164

[62] Dragutin Petkovic, Russ Altman, Mike Wong, and Arthur Vigil, Improving the explainability of random forest classifier - user centered approach, Pacific Symposium on Biocomputing, vol. 23, World Scientific, 2018, pp. 204-215.

[63] Dragutin Petkovic, Lester Kobzik, and Christopher Re, Workshop on ”machine learning and deep analytics for biocomputing: call for better explainability”, Pacific Symposium on Biocomputing, 2018.

[64] J. R. Quinlan, Bagging, boosting and c4-5, Innovative Applications of Artificial Intelligence (IAAI), vol. 1, 1996, pp. 725-730.

[65] Rao Raghuraj and Samavedham Lakshminarayanan, Vpmcd: Variable inter­ action modeling approach for class discrimination in biological systems, FEBS letters 581 (2007), no. 5, 826-830.

[66] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, “Why should I trust you?” Explaining the predictions of any classifier, Proceedings of the 22nd Association for Computing Machinery (ACM) Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD) International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 1135-1144.

[67] Alvin E. Roth, The Shapley value: essays in honor of Lloyd S. Shapley, Cam­ bridge University Press, 1988.

[68] S. Rasoul Safavian and David Landgrebe, A survey of decision tree classifier methodology, Institute of Electrical and Electronics Engineers (IEEE) transac­ tions on systems, man, and cybernetics 21 (1991), no. 3, 660-674.

[69] L. S. Shapley and Martin Shubik, A method for evaluating the distribution of power in a committee system, The American Political Science Review 48 (1954), no. 3, 787-792.

[70] Lloyd S. Shapley, A value for n-person games, Contributions to the Theory of Games 2 (1953), no. 28, 307-317.

[71] Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis, Conditional variable importance for random forests, BioMed Central (BMC) Bioinformatics 9 (2008), no. 1, 307-318. 165

[72] Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn, Bias in random forest variable importance measures: Illustrations, sources and a solution, BioMed Central (BMC) Bioinformatics 8 (2007), no. 1, 25-46.

[73] Jianyuan Sun, Guoqiang Zhong, Junyu Dong, and Yajuan Cai, Banzhaf random forests, arXiv preprint arXiv:1507.06105 (2015), 1-15.

[74] Ryan Turner, A model explanation system, Institute of Electrical and Electron­ ics Engineers (IEEE) 26th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2016, pp. 1-6.

[75] Eugene Tuv, Alexander Borisov, George Runger, and Kari Torkkola, Feature selection with ensembles, artificial variables, and redundancy elimination, Jour­ nal of Machine Learning Research 10 (2009), no. Jul, 1341-1366.

[76] Alfredo Vellido, Jose D. Martm-Guerrero, and Paulo J.G. Lisboa, Making ma­ chine learning models interpretable, European Symposium on Artificial Neu­ ral Networks (ESPANN), Computational Intelligence and Machine Learning, vol. 12, Citeseer, 2012, pp. 163-172.

[77] John Von Neumann and Oskar Morgenstern, Theory of games and economic behavior, Princeton University Press, 1944.

[78] Jason Weston, Andre Elisseeff, Bernhard Scholkopf, and Mike Tipping, Use of the zero norm with linear models and kernel methods, Journal of Machine Learning Research 3 (2003), 1439-1461.

[79] Eyal Winter, The Shapley value, Handbook of game theory with economic ap­ plications 3 (2002), 2025-2054.