Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)
Bernoulli Random Forests: Closing the Gap between Theoretical Consistency and Empirical Soundness
, , , ? Yisen Wang† ‡, Qingtao Tang† ‡, Shu-Tao Xia† ‡, Jia Wu , Xingquan Zhu ⇧ † Dept. of Computer Science and Technology, Tsinghua University, China ‡ Graduate School at Shenzhen, Tsinghua University, China ? Quantum Computation & Intelligent Systems Centre, University of Technology Sydney, Australia ⇧ Dept. of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, USA wangys14, tqt15 @mails.tsinghua.edu.cn; [email protected]; [email protected]; [email protected] { }
Traditional Bernoulli Trial Controlled Abstract Tree Node Splitting Tree Node Splitting Random forests are one type of the most effective
ensemble learning methods. In spite of their sound Random Attribute Bagging Bernoulli Trial Controlled empirical performance, the study on their theoreti- Attribute Bagging cal properties has been left far behind. Recently,
several random forests variants with nice theoreti- Random Structure/Estimation cal basis have been proposed, but they all suffer Random Bootstrap Sampling Points Splitting from poor empirical performance. In this paper, we (a) Breiman RF (b) BRF propose a Bernoulli random forests model (BRF), which intends to close the gap between the theoreti- Figure 1: Comparisons between Breiman RF (left panel) vs. cal consistency and the empirical soundness of ran- the proposed BRF (right panel). The tree node splitting of dom forests classification. Compared to Breiman’s Breiman RF is deterministic, so the final trees are highly data- original random forests, BRF makes two simplifi- dependent. Instead, BRF employs two Bernoulli distributions cations in tree construction by using two indepen- to control the tree construction. BRF ensures a certain degree dent Bernoulli distributions. The first Bernoulli dis- of randomness in the trees, so the final trees are less data- tribution is used to control the selection of candi- dependent but still maintain a high classification accuracy. date attributes for each node of the tree, and the second one controls the splitting point used by each node. As a result, BRF enjoys proved theoretical Prasad et al., 2006; Cutler et al., 2007; Shotton et al., 2011; ] consistency, so its accuracy will converge to opti- Xiong et al., 2012 ). mum (i.e., the Bayes risk) as the training data grow Different from their well recognized empirical reputation infinitely large. Empirically, BRF demonstrates across numerous domain applications, the theoretical pro- the best performance among all theoretical random perties of random forests have yet been fully established and forests, and is very comparable to Breiman’s orig- are still under active research investigation. Particularly, one inal random forests (which do not have the proved of the most fundamental theoretical properties of a learning consistency yet). The theoretical and experimen- algorithm is the consistency, which guarantees that the algo- tal studies advance the research one step further to- rithm will converge to optimum (i.e. the Bayes risk) as the wards closing the gap between the theory and the data grow infinitely large. Due to the inherent bootstrap ran- practical performance of random forests classifica- domization as well as attribute bagging process employed by tion. random forests, and the highly data-dependent tree structure, it is very difficult to prove the theoretical consistency of ran- dom forests. As shown in Fig. 1(a), Breiman’s original ran- dom forests have two random processes, bootstrap instance 1 Introduction sampling and attribute bagging, which intend to make the tree Random forests (RF) represent a class of ensemble learning less data-dependent. However, when constructing each inter- methods that construct a large number of randomized deci- nal node of the tree, the attribute selected for splitting and sion trees and combine results from all trees for classifica- the splitting point used by the attribute are controlled by a tion or regression. The training of the random forests, at deterministic criterion (such as Gini index [Breiman et al., least for the original Breiman version [Breiman, 2001], is ex- 1984]). As a result, the constructed trees eventually become tremely easy and efficient. Because predictions are derived data-dependent, making theoretical analysis difficult. from a large number of trees, random forests are very ro- Notice the difficulty of proving the consistency of ran- bust, and have demonstrated superb empirical performance dom forests, several random forests variants [Breiman, 2004; for many real-world learning tasks (e.g., [Svetnik et al., 2003; Biau et al., 2008; Genuer, 2010; 2012; Biau, 2012; Denil et
2167 al., 2014] relax or simplify the deterministic tree construc- also split obeying the rules created by structure points along tion process by employing two randomized approaches: (1) the construction of the tree, but they have no effect on the replacing the attribute bagging by randomly sampling one structure of the tree. attribute; or (2) splitting the tree node with a more elemen- For each tree, the points are partitioned randomly and in- tary split protocol in place of the complicated impurity crite- dependently. The ratio of the two parts is parameterized by rion. For both approaches, the objective is to make the tree Ratio = Structure points / Entire points . less data-dependent, such that the consistency analysis can be | | | | carried out. Unfortunately, although the properties of such 2.2 Tree construction approaches can be theoretically analyzed, they all suffer from poor empirical performance, and are mostly incomparable to In the proposed BRF, firstly, as stated above, the data point Breiman’s original random forests in terms of the classifica- partitioning process replaces the bootstrap technique in train- tion accuracies. The dilemma of the theoretical consistency ing instances. Secondly, two independent Bernoulli distribu- vs. empirical soundness persists and continuously motivates tions are introduced into the strategies of selecting attributes the research to close the gap. and splitting points. Motivated by the above observations, in this paper, we The first novel alteration in BRF is to choose candidate at- propose a novel Bernoulli random forests model (BRF) tributes with a probability satisfying a Bernoulli distribution. with proved theoretical consistency and comparable perfor- Let B 0, 1 be a binary random variable with “success” 1 2{ } mance to Breiman’s original random forests. As shown in probability of p1, then B1 has a Bernoulli distribution which Fig. 1(b), our key is to introduce two Bernoulli distributions takes 1 in a probability of p1. We define B1 =1if 1 candidate to drive/control the tree construction process, so we can en- attribute is chosen and B1 =0if pD candidate attributes are sure a certain degree of randomness in the trees, but still chosen. To be specific, for each internal node, we choose 1 keep the quality of the tree for classification. Because each or pD candidate attributes in a probability of p or 1 p Bernoulli trial involves a probability value controlled random 1 1 respectively. process, BRF can ensure that the tree construction is random with respect to a probability value or being deterministic, oth- The second novel alteration is to choose splitting points erwise. As a result, the trees constructed by BRF are much using two different methods with a probability satisfying less data-dependent, compared to Breiman’s original random another Bernoulli distribution. Similar to B1, we assume B 0, 1 satisfies another Bernoulli distribution which forests, yet still have much better performance compared to 2 2{ } all existing theoretically consistent random forests. takes 1 in a probability of p2. We define B2 =1if the ran- The main contribution of the paper is threefold: (1) BRF dom sampling method is used and B2 =0if the optimizing has the least simplification changes, compared to Breiman’s impurity criterion method is used. Specifically, we choose the original random forests, and its theoretical consistency is splitting point through random sampling or optimizing impu- rity criterion in a probability of p or 1 p respectively in fully proved; (2) The two independent Bernoulli distributions 2 2 controlled attribute and splitting point selection process pro- each candidate attribute. vides a solution to resolve the dilemma of theoretical consis- The impurity criterion is denoted by: tency vs. empirical soundness; and (3) empirically, a series S S of experimental comparisons demonstrate that BRF achieves S 0 S 00 S I(v)=T ( ) |D |T ( 0 ) |D |T ( 00 ). (1) the best performance among all theoretical random forests. D S D S D |D | |D | 2 Bernoulli Random Forests Here v is the splitting point which is chosen by maximizing I(v). is the cell belonging to the node to be split, which Compared to Breiman’s original random forests, the proposed D S E contains structure points and estimation points . 0 BRF makes three alterations, as shown in Fig. 1, to close the D D D gap between theoretical consistency and empirical soundness. and 00 are two children that would be created if is split at v. TheD function T ( S) is the impurity criterion, e.g.D Shannon D 2.1 Data point partitioning entropy or Gini index, which computes over the labels of the structure points S. In BRF, the impurity criterion is Gini Given a data set n with n instances, for each instance D D D (X,Y), we have X R with D being the number of at- index and it is certain that Shannon entropy is also acceptable. tributes and Y 1, 2,...,C with C being the number of Through the above two steps, for each internal node of classes. Before2 the{ construction} of each individual tree, we the tree, one attribute and its corresponding splitting point is randomly partition the entire data points into two parts, i.e., selected to split the data and grow the tree. Note that only the Structure points and Estimation points. The two parts per- structure points are involved in the tree construction. The pro- form different roles in the individual tree construction, which cess recursively repeats until the stopping condition is met. helps establish the consistency property of the proposed BRF. Similar to Breiman’s original random forests, BRF’s stop- Structure points are used to construct the tree. They are ping condition is also based on the minimum leaf size. But, only used to determine the attributes and splitting points in we only restrict the estimation points rather than the entire each internal node of the tree, but are not allowed to be used data points. In other words, in each leaf, we require that the for estimating class labels in tree leaves. number of estimation points must be larger than kn which de- Estimation points are only used to fit leaf predictors (esti- pends on the number of training instances, i.e., kn and mating class labels in tree leaves). Note that these points are k /n 0 when n . !1 n ! !1
2168 2.3 Prediction Lemma 1 shows that the consistency of random forests is Once the trees are trained by structure points and the leaf pre- determined by the individual trees [Biau et al., 2008]. Lemma dictors are fitted by estimation points, BRF can be used to 2 allows us to reduce the consistency of multi-class classifiers predict labels of new unlabeled instances. to the consistency of posterior estimates for each class [Denil When making predictions, each individual tree will predict et al., 2013]. independently. We denote a base decision tree classifier cre- Lemma 3. Suppose a sequence of classifiers g are con- ated by our algorithm as g. Assuming the unlabeled instance ditionally consistent for a specified distribution{ on} (X,Y), is x, the probability of each class c 1, 2,...,C is 2{ } i.e. (c) 1 ⌘ (x)= I Y = c , (2) P (g(X,Z,I) = Y I) L⇤, (7) N(AE(x)) { } 6 | ! (X,Y ) AE (x) X2 where I represents the randomness in the data point par- and the prediction of the tree is the class that maximizes titioning. If the random partitioning produces acceptable ⌘c(x): structure and estimation parts with probability 1, then g { } yˆ = g(x) = arg max ⌘(c)(x) , (3) are unconditionally consistent, i.e. c { } N(AE(x)) P (g(X,Z,I) = Y ) L⇤. (8) where denotes the number of estimation points 6 ! in the leaf containing the instance x. I(e) is the indicator function which takes 1 if e is true and takes 0 for other cases. Second, the prediction of the forests is the class which re- Lemma 3 shows that data point partitioning procedure of ceives the most votes from individual trees. the tree construction almost would not affect the consistency M of the base decision tree [Denil et al., 2013]. (i) yˆ = g(M)(x) = arg max I g (x)=c , (4) To prove the consistency of the base decision tree, we em- c i=1 ploy a general consistency lemma used for decision rules as X n o follows: where M is the number of individual trees in the random forests. Lemma 4. Consider a classification rule which builds a pre- diction by averaging the labels in each leaf node, if the labels 3 Consistency of the voting data do not influence the structure of the classi- fication rule then In this section, we first give an outline of the lemmas for es- tablishing the consistency of random forests. Then we prove E [L] L⇤ as n (9) the consistency of the proposed BRF. We use a variable Z ! !1 to denote the randomness involved in the construction of the provided that tree, including the selection of attributes and splitting points. 1. diam(A(X)) 0 in probability, ! 3.1 Preliminaries 2. N(AE(X)) in probability, !1 Definition 1. Given the data set n, a sequence of classifiers E D where A(X) denotes the leaf containing X and N(A (X)) g are consistent for a certain distribution of (X,Y) if the denotes the number of estimation points in A(X). error{ } probability L satisfies Generally, the construction of decision trees can be viewed E [L]=P (g(X,Z, n) = Y ) L⇤ as n , (5) D 6 ! !1 as a partitioning of the original instance space. Thus, each where L⇤ is the Bayes risk that is the minimum achievable node of the tree corresponds to a rectangular subset/cell of D D risk of any classifier for the distribution of (X,Y). R and the tree root corresponds to all of R . Therefore, diam(A(X)) 0 is equal to that the size of hypercube The consistency of random forests is implied by the consis- A(X) is close to! 0. tency of the trees they are comprised of, which will be shown Lemma 4 shows that the consistency of the tree construc- in the following two lemmas. tion can be proved on condition that the hypercubes/cells be- Lemma 1. Suppose a sequence of classifiers g are con- { } longing to leaves are sufficiently small but contain infinite sistent, then the voting classifier g(M) obtained by taking the number of estimation points [Devroye et al., 2013]. majority vote over M copies of g with different randomizing variables is also consistent. 3.2 Consistency theorem Lemma 2. Suppose each class posterior estimation is With these preliminary results in hand, we are equipped to ⌘(c)(x)=P (Y = c X = x), and that these estimations are prove the main consistency theorem. each consistent. The| classifier Theorem 1. Suppose that X is supported on [0, 1]D and g(x) = arg max ⌘(c)(x) (6) has non-zero density almost everywhere. Moreover, the cu- c { } mulative distribution function (CDF) of the splitting points is is consistent for the corresponding multi-class classification right-continuous at 0 and left-continuous at 1. Then BRF is problem. consistent provided that k and k /n 0 as n . n !1 n ! !1
2169 According to Subsection 3.1, we know that the consistency the final selected splitting point through the two methods can of random forests is determined by the consistency of the base be viewed as a random variable W (i 1, 2,...,K ) with i 2{ } decision tree (Lemma 1 and 2), which is further dependent CDF FWi from the tree root to a leaf. on the consistency of the tree construction (Lemma 3 and For any given K and a constant >0, the smallest child 4). More specifically, the proof of Theorem 1 is to prove node of the root has the size M =min(W , 1 W ) at least 1 1 1 the two conditions of Lemma 4, i.e., diam(A(X)) 0 and 1/K with the probability: N(AE(X)) in probability. ! !1 1/K 1/K 1/K E P M1 = P W1 1 Proof. Firstly, since BRF requires N(A (X)) kn, N(AE(X)) n ⇣ ⌘ ⇣ 1/K 1/K⌘ is trivial when . = FW1 (1 ) FW1 ( ). (13) Secondly, we!1 prove diam(A(X))!10 in probability. Let V (d) be the size of the d-th attribute! of A(X). It suffices to Without loss of generality, we scale the values of attributes to the range [0, 1] for each node, then after K splits, the small- show that E [V (d)] 0 for all d 1, 2,...,D . For any given d,! the largest size2{ of the child} node for the est child at the K-th level have the size at least with the probability at least d-th attribute is denoted by V ⇤(d). Recalling that the splitting point is chosen either by randomly sampling in [0, 1] with a K 1/K 1/K probability p2 or by optimizing the impurity criterion with a (F (1 ) F ( )), (14) Wi Wi probability 1 p2, we have i=1 Y E [V ⇤(d)] (1 p2) 1+p2 E [max(U, 1 U)] which is derived by assuming that the same attribute is split ⇥ ⇥ 3 1 at each level of the tree. If different attributes are split at =(1 p2) 1+p2 =1 p2, (10) different levels, the bound (14) also holds. Since FWi is right- ⇥ ⇥ 4 4 1/K continuous at 0 and left-continuous at 1, FWi (1 ) where U is the splitting point for random sampling method 1/K FWi ( ) 1 as 0. Thus, ✏1 > 0, 1 > 0, such and U Uniform[0, 1]. that ! ! 8 9 ⇠ p Revisiting that 1 or D candidate attributes are selected K with a probability p1 or 1 p1 respectively, we define the 1/K 1/K K (FWi (1 1 ) FWi ( 1 )) > (1 ✏1) . (15) following events: i=1 Y E1 = One candidate attribute is split Besides, ✏>0, ✏ > 0, such that { } 8 9 1 E2 = The splitting attribute is exactly the d-th one K { } (1 ✏1) > 1 ✏. (16) Denoting the size of the child node for the d-th attribute by The above (15) and (16) show that each node at K-th level of V 0(d), then the tree has the size of with probability at least 1 ✏. ¯ ¯ Because the distribution of X has a non-zero density, each E [V 0(d)] = P (E1) E [V 0(d) E1]+P E1 E V 0(d) E1 | | of these nodes has a positive measure with respect to µX . p1 E [V 0(d) E1]+(1 p1) 1 ⇥ | ⇥ ⇥ ⇤ Defining = p1 P (E2 E1) E [V 0(d) E1,E2] ⇥ | | p =minµX (l), (17) ¯ ¯ l: a leaf at K-th level + P E2 E1 E V 0(d) E1, E2 +(1 p1) | | 1 1 we know p>0 since the minimum is over finitely many p1 ( E [V ⇤(d⇥)] + 1 )+(1⇤ p1) leaves and each leaf contains a set of positive measure. ⇥ D D p p In a data set of size n, the number of data points falling in 1 1 2 . (11) the leaf A is Binomial(n, p). Then, the number of estimation 4D points in A is np/2 assuming the Ratio is 0.5 without loss Assuming K is the distance from the tree root to a leaf and of generality. According to Chebyshev’s inequality, we can iterating (11) after K splits, we have: bound N(AE) as follows: K np np p1p2 N(AE)