Residual Likelihood Forests 1
Total Page:16
File Type:pdf, Size:1020Kb
YAN ZUO, TOM DRUMMOND: RESIDUAL LIKELIHOOD FORESTS 1 Residual Likelihood Forests Yan Zuo ARC Centre of Excellence for Robotic [email protected] Vision, Monash University, Australia Tom Drummond [email protected] Abstract This paper presents a novel ensemble learning approach called Residual Likelihood Forests (RLF). Our weak learners produce conditional likelihoods that are sequentially optimized using global loss in the context of previous learners within a boosting-like framework (rather than probability distributions that are measured from observed data) and are combined multiplicatively (rather than additively). This increases the efficiency of our strong classifier, allowing for the design of classifiers which are more compact in terms of model capacity. We apply our method to several machine learning classification tasks, showing significant improvements in performance. When compared against sev- eral ensemble approaches including Random Forests and Gradient Boosted Trees, RLFs offer a significant improvement in performance whilst concurrently reducing the required model size. 1 Introduction Ensemble and Boosting methods such as Random Forests [3] and AdaBoost [19] are often recognized as some of the best out-of-the-box classifiers, consistently achieving state-of- the-art performance across a wide range of computer vision tasks including applications in image classification [1], semantic segmentation [22], object recognition [12] and data clus- teringarXiv:2011.02086v1 [stat.ML] 4 Nov 2020 [16]. The success of these methods is attributed to their ability to learn models (strong learners) which possess low bias and variance through the combination of weakly correlated learners (weak learners). Forests reduce variance through averaging its weak learners over the ensemble. Boosting, on the other hand, looks towards reducing both bias and variance through sequentially optimizing under conditional constraints. The commonality between both approaches is in the way each learner is constructed: both methods use a top-down induction algorithm (such as CART [4]) which greedily learns decision nodes in a recursive manner. This approach is known to be suboptimal in terms of objective maximization as there are no guarantees that a global loss is being minimized [14]. In practice, this type of optimization requires the non-linearity offered by several (very) deep trees, which results in redundancy in learned models with large overlaps of information between weak learners. To address these limitations, the ensemble approaches of [11, 20] have utilized gradient information within a boosting framework. This allows weak learners to be fit via pseudo- residuals or to a set of adaptive weights and allows for the minimization of a global loss via gradient descent. Whilst these methods have demonstrated improved performance over their © 2020. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. 2 YAN ZUO, TOM DRUMMOND: RESIDUAL LIKELIHOOD FORESTS Random Forest counterpart, they are more susceptible to overfitting, making them harder to tune. An alternative approach proposed by [17] tried to retain Random Forests as a baseline classifier, employing a global refinement technique combined with leaf pruning to gain im- provements in performance and reduce model size. However, this method still required the relatively large overhead of constructing the original Random Forest before refinement can be performed (with an additional overhead associated with global refinement and pruning). In this work, we propose a method which constructs a decision forest that benefits from the boosting approach of creating complementary weak learners through minimizing a global loss, yet retains the simplicity in model tuning of Random Forests. Our approach monoton- ically decreases the loss function of the ensemble using a novel, top-down induction al- gorithm, which unlike CART, accounts for the mutual information between weak learners already in the ensemble. Empirically, we extensively evaluate our approach across multiple datasets against several ensemble and boosting baselines. Our results show that our method improves classification performance at a higher parameter efficiency. The main contributions of this paper are as follows: • We propose a novel ensemble learning framework that includes a novel induction algo- rithm which accounts for the mutual information between weak learners it constructs. • Theoretically, we show that our method allows optimization to occur in closed-form (in contrast to existing gradient boosting methods), which simplifies model tuning. • Empirically, we show that our method offers dramatic reductions in required model size. Further, we show that our approach outperforms several competing boosting and ensemble baselines across 5 standard machine learning datasets. 2 Related Work 2.1 Random Forests RFs became popular in the literature due to their flexibility and robustness when dealing with data-driven learning problems [6]. RFs were first developed by [15] to overcome the variance and stability issues found in binary decision trees by randomly choosing the subspace of features for tree training. This method was refined by [3], where the bootstrap aggregating technique of [2] was combined with the random feature selection of RFs. [13] developed Extremely Randomized Trees, where both the attribute and cut-point of the input feature were randomized. [22] applied RFs to the semantic segmentation problem, using them as a global image classifier to quickly and efficiently provide priors to be pipelined into the system developed in [21]. 2.2 Boosting The work of [18] first introduced the concept of boosting using an algorithm that selectively combined several weak hypotheses into a single strong hypothesis. [8] extended this algo- rithm to combine the outputs of all weak hypotheses. These methods however, did not take advantage of the adaptability of weak learners. This was addressed by [9] with Adaptive Boosting, where each subsequent weak learner introduced to the strong learner is biased to favor misclassified training samples using high-weight data points. [5] extended on this YAN ZUO, TOM DRUMMOND: RESIDUAL LIKELIHOOD FORESTS 3 concept of adaptable boosting by offering an alternative to using weights, formulating the method as a gradient descent problem with a specialized loss function. 3 Preliminaries Conventional decision trees consist of internal decision nodes and terminal leaf nodes. The internal nodes of each decision tree are a set of Nd decision nodes, D = fd0;··· ;dNd −1g, each of which holds a decision function di(x;qi), where qi are the parameters for decision node i in the tree. For binary decision trees, the decision function for each decision node i is defined as di(x;qi) : X! [0;1], which routes a sample to a left or right subtree. Collectively, the decision nodes route data, x, through the tree until a terminal leaf node is reached: ` = d(x;Q). Here, d signifies the routing function which directs data x to the terminal leaf node `, given the tree’s decision node parameters Q. For classification, the leaf nodes Q in a decision tree contain class label probability distri- butions, q = Q(`). These distributions are formed from the training data and accompanying ground truth class labels. For a terminal leaf node `, the stored probability value for class j is given as: N ∑n cn; j q j = Q(`) j = (1) N` where N` is the total number of samples routed into leaf node ` and cn; j is the observed class label for sample n in class j: ( 1; if sample n has class label j cn; j = (2) 0; otherwise Typically, the decision nodes of the tree are chosen such that resulting distributions gener- ated give the best split according to some purity measure (i.e. entropy or the Gini index [3]). A conventional decision forest is an ensemble of Nt number of decision trees. It delivers a final prediction for the input x by routing x through each decision tree in the ensemble to a leaf node and taking the average prediction delivered by each tree: Nt 1 t t t Pj = ∑Q (d (x;Q )) j (3) Nt t where Qt , d t and Qt are the respective distributions, routing functions and parameters of tree t in the ensemble. 4 Residual Likelihood Forests First, we motivate our proposed approach using the following simple classification exam- ple involving two weak learners evaluating some input data. To keep our notation concise, 1 2 we denote p j and p j to represent the probabilities assigned by each weak learner for the underlying class label j given the input. Given that the prior is uniform over classes, the correct approach to combining these two weak learners depends on the correlation between the assigned probabilities from each weak learner. 4 YAN ZUO, TOM DRUMMOND: RESIDUAL LIKELIHOOD FORESTS 1 2 If p j and p j are independent of one another, the correct class probability should be given by the normalized product for the class j: 1 2 p j p j Pj = (4) Nc 1 2 ∑k pk pk where Nc is the number of classes. However, when the two weak learners are fully correlated, 1 2 the assigned probabilities from each weak learner would be the same and p j = p j. In this instance, applying Eq.4 would result in an incorrect estimate of the class probabilities as the normalized squared distribution and we should instead simply average their marginal distributions. 4.1 Weak Learners Generating Likelihoods In practice, probability distributions 1p and 2p will rarely be fully independent or fully cor- related. As such, deciding the amount each weak learner should contribute to the strong classifier’s prediction is a non-trivial task. Typically, ensemble methods combine their weak learners by averaging their marginal distributions [2,8, 18], relying on the law of large num- bers to get a good estimate of the underlying distribution of the data and smooth out variance between learners. The correlation between learners can be problematic if we wish to have a consistent approach for correctly combining weak learners within an ensemble since the approach that should be adopted depends on this correlation.