Bernoulli Random Forests: Closing the Gap Between Theoretical Consistency and Empirical Soundness

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Bernoulli Random Forests: Closing the Gap between Theoretical Consistency and Empirical Soundness , , , ? Yisen Wang† ‡, Qingtao Tang† ‡, Shu-Tao Xia† ‡, Jia Wu , Xingquan Zhu ⇧ † Dept. of Computer Science and Technology, Tsinghua University, China ‡ Graduate School at Shenzhen, Tsinghua University, China ? Quantum Computation & Intelligent Systems Centre, University of Technology Sydney, Australia ⇧ Dept. of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, USA wangys14, tqt15 @mails.tsinghua.edu.cn; [email protected]; [email protected]; [email protected] { } Traditional Bernoulli Trial Controlled Abstract Tree Node Splitting Tree Node Splitting Random forests are one type of the most effective ensemble learning methods. In spite of their sound Random Attribute Bagging Bernoulli Trial Controlled empirical performance, the study on their theoreti- Attribute Bagging cal properties has been left far behind. Recently, several random forests variants with nice theoreti- Random Structure/Estimation cal basis have been proposed, but they all suffer Random Bootstrap Sampling Points Splitting from poor empirical performance. In this paper, we (a) Breiman RF (b) BRF propose a Bernoulli random forests model (BRF), which intends to close the gap between the theoreti- Figure 1: Comparisons between Breiman RF (left panel) vs. cal consistency and the empirical soundness of ran- the proposed BRF (right panel). The tree node splitting of dom forests classification. Compared to Breiman’s Breiman RF is deterministic, so the final trees are highly data- original random forests, BRF makes two simplifi- dependent. Instead, BRF employs two Bernoulli distributions cations in tree construction by using two indepen- to control the tree construction. BRF ensures a certain degree dent Bernoulli distributions. The first Bernoulli dis- of randomness in the trees, so the final trees are less data- tribution is used to control the selection of candi- dependent but still maintain a high classification accuracy. date attributes for each node of the tree, and the second one controls the splitting point used by each node. As a result, BRF enjoys proved theoretical Prasad et al., 2006; Cutler et al., 2007; Shotton et al., 2011; ] consistency, so its accuracy will converge to opti- Xiong et al., 2012 ). mum (i.e., the Bayes risk) as the training data grow Different from their well recognized empirical reputation infinitely large. Empirically, BRF demonstrates across numerous domain applications, the theoretical pro- the best performance among all theoretical random perties of random forests have yet been fully established and forests, and is very comparable to Breiman’s orig- are still under active research investigation. Particularly, one inal random forests (which do not have the proved of the most fundamental theoretical properties of a learning consistency yet). The theoretical and experimen- algorithm is the consistency, which guarantees that the algo- tal studies advance the research one step further to- rithm will converge to optimum (i.e. the Bayes risk) as the wards closing the gap between the theory and the data grow infinitely large. Due to the inherent bootstrap ran- practical performance of random forests classifica- domization as well as attribute bagging process employed by tion. random forests, and the highly data-dependent tree structure, it is very difficult to prove the theoretical consistency of random forests. As shown in Fig. 1(a), Breiman’s original random forests have two random processes, bootstrap instance 1 Introduction sampling and attribute bagging, which intend to make the tree Random forests (RF) represent a class of ensemble learning less data-dependent. However, when constructing each inter- methods that construct a large number of randomized deci- nal node of the tree, the attribute selected for splitting and sion trees and combine results from all trees for classifica- the splitting point used by the attribute are controlled by a tion or regression. The training of the random forests, at deterministic criterion (such as Gini index [Breiman et al., least for the original Breiman version [Breiman, 2001], is ex- 1984]). As a result, the constructed trees eventually become tremely easy and efficient. Because predictions are derived data-dependent, making theoretical analysis difficult. from a large number of trees, random forests are very ro- Notice the difficulty of proving the consistency of ran- bust, and have demonstrated superb empirical performance dom forests, several random forests variants [Breiman, 2004; for many real-world learning tasks (e.g., [Svetnik et al., 2003; Biau et al., 2008; Genuer, 2010; 2012; Biau, 2012; Denil et 2167 al., 2014] relax or simplify the deterministic tree construc- also split obeying the rules created by structure points along tion process by employing two randomized approaches: (1) the construction of the tree, but they have no effect on the replacing the attribute bagging by randomly sampling one structure of the tree. attribute; or (2) splitting the tree node with a more elemen- For each tree, the points are partitioned randomly and in- tary split protocol in place of the complicated impurity crite- dependently. The ratio of the two parts is parameterized by rion. For both approaches, the objective is to make the tree Ratio = Structure points / Entire points . less data-dependent, such that the consistency analysis can be | | | | carried out. Unfortunately, although the properties of such 2.2 Tree construction approaches can be theoretically analyzed, they all suffer from poor empirical performance, and are mostly incomparable to In the proposed BRF, firstly, as stated above, the data point Breiman’s original random forests in terms of the classifica- partitioning process replaces the bootstrap technique in train- tion accuracies. The dilemma of the theoretical consistency ing instances. Secondly, two independent Bernoulli distribu- vs. empirical soundness persists and continuously motivates tions are introduced into the strategies of selecting attributes the research to close the gap. and splitting points. Motivated by the above observations, in this paper, we The first novel alteration in BRF is to choose candidate at- propose a novel Bernoulli random forests model (BRF) tributes with a probability satisfying a Bernoulli distribution. with proved theoretical consistency and comparable perfor- Let B 0, 1 be a binary random variable with “success” 1 2{ } mance to Breiman’s original random forests. As shown in probability of p1, then B1 has a Bernoulli distribution which Fig. 1(b), our key is to introduce two Bernoulli distributions takes 1 in a probability of p1. We define B1 =1if 1 candidate to drive/control the tree construction process, so we can en- attribute is chosen and B1 =0if pD candidate attributes are sure a certain degree of randomness in the trees, but still chosen. To be specific, for each internal node, we choose 1 keep the quality of the tree for classification. Because each or pD candidate attributes in a probability of p or 1 p Bernoulli trial involves a probability value controlled random 1 1 respectively. − process, BRF can ensure that the tree construction is random with respect to a probability value or being deterministic, oth- The second novel alteration is to choose splitting points erwise. As a result, the trees constructed by BRF are much using two different methods with a probability satisfying less data-dependent, compared to Breiman’s original random another Bernoulli distribution. Similar to B1, we assume B 0, 1 satisfies another Bernoulli distribution which forests, yet still have much better performance compared to 2 2{ } all existing theoretically consistent random forests. takes 1 in a probability of p2. We define B2 =1if the ran- The main contribution of the paper is threefold: (1) BRF dom sampling method is used and B2 =0if the optimizing has the least simplification changes, compared to Breiman’s impurity criterion method is used. Specifically, we choose the original random forests, and its theoretical consistency is splitting point through random sampling or optimizing impurity criterion in a probability of p or 1 p respectively in fully proved; (2) The two independent Bernoulli distributions 2 − 2 controlled attribute and splitting point selection process pro- each candidate attribute. vides a solution to resolve the dilemma of theoretical consis- The impurity criterion is denoted by: tency vs. empirical soundness; and (3) empirically, a series S S of experimental comparisons demonstrate that BRF achieves S 0 S 00 S I(v)=T ( ) |D |T ( 0 ) |D |T ( 00 ). (1) the best performance among all theoretical random forests. D − S D − S D |D | |D | 2 Bernoulli Random Forests Here v is the splitting point which is chosen by maximizing I(v). is the cell belonging to the node to be split, which Compared to Breiman’s original random forests, the proposed D S E contains structure points and estimation points . 0 BRF makes three alterations, as shown in Fig. 1, to close the D D D gap between theoretical consistency and empirical soundness. and 00 are two children that would be created if is split at v. TheD function T ( S) is the impurity criterion, e.g.D Shannon D 2.1 Data point partitioning entropy or Gini index, which computes over the labels of the structure points S. In BRF, the impurity criterion is Gini Given a data set n with n instances, for each instance D D D (X,Y), we have X R with D being the number of at- index and it is certain that Shannon entropy is also acceptable. tributes and Y 1, 2,...,C with C being the number of Through the above two steps, for each internal node of classes. Before2 the{ construction} of each individual tree, we the tree, one attribute and its corresponding splitting point is randomly partition the entire data points into two parts, i.e., selected to split the data and grow the tree.

Bernoulli Random Forests: Closing the Gap Between Theoretical Consistency and Empirical Soundness

3.2.3 Binomial Distribution

5 Stochastic Processes

1 Normal Distribution

Probability Theory

Discrete Distributions: Empirical, Bernoulli, Binomial, Poisson

Randomization-Based Inference for Bernoulli Trial Experiments And

Some Basic Probabilistic Processes

Optimal Stopping Times for Estimating Bernoulli Parameters with Applications to Active Imaging

Bernoulli Trials an Experiment, Or Trial, Whose Outcome Can Be Classified

Statistical Challenges with Modeling Motor Vehicle Crashes: Understanding the Implications of Alternative Approaches

AP Statistics Chapter 8

Chapter 4 Commonly Used Distributions Bernoulli Trial