Random Hinge Forest for Differentiable Learning

Random Hinge Forest for Differentiable Learning Nathan Lay Adam P. Harrison Sharon Schreiber Gitesh Dawer 1 Adrian Barbu 1 Abstract ited to fixed features and can plateau on test performance regardless of the training set size. Deep neural networks, We propose random hinge forests, a simple, effi- however, are difficult to develop, train and even interpret. cient, and novel variant of decision forests. Im- As a consequence, many researchers shy away from explor- portantly, random hinge forests can be readily in- ing and developing their own network architectures and of- corporated as a general component within arbi- ten use and tweak pre-trained off-the-shelf models (for ex- trary computation graphs that are optimized end- ample (Simonyan & Zisserman, 2014; Ronneberger et al., to-end with stochastic gradient descent or vari- 2015; He et al., 2016)) for their purposes. ants thereof. We derive random hinge forest and ferns, focusing on their sparse and efficient na- We propose a novel formulation of random forest, the ran- ture, their min-max margin property, strategies dom hinge forest, that addresses its predecessor’s greedy to initialize them for arbitrary network architec- learning limitations. Random hinge forest is a differen- tures, and the class of optimizers most suitable tiable learning machine for use in arbitrary computation for optimizing random hinge forest. The perfor- graphs. This enables it to learn in an end-to-end fashion, mance and versatility of random hinge forests are benefit from learnable feature representations, as well as demonstrated by experiments incorporating a va- operate in concert with other computation graph mecha- riety of of small and large UCI machine learn- nisms. Random hinge forest also addresses some limita- ing data sets and also ones involving the MNIST, tions present in other formulations of differentiable deci- Letter, and USPS image datasets. We compare sion trees and forests, namely efficiency and numerical sta- random hinge forests with random forests and bility. We show that random hinge forest evaluation and the more recent backpropagating neural decision gradient evaluation are logarithmic in evaluation complex- forests. ity, compared to exponential complexity like previously described methods, and that random hinge forest is robust to activation/decision function saturation and loss of pre- 1. Introduction cision. Lastly, we show that random hinge forest is better than random forest and comparable to the state-of-the-art Random forest (Breiman, 2001) is a popular and widely decision forest, neural decision forest (Kontschieder et al., used ensemble learning algorithm. With constituent mod- 2015). els being decision trees, its formulation offers some level This paper first describes a series of related works and of interpretability and intuition. It also tends to generalize how they are different, then formulates the random hinge well with modest parameter tuning for a variety learning tree, ferns and forest. Then a series of experiments and tasks and data sets. However, the recent success of deep ar- results are presented on UCI data sets and MNIST. We tificial neural networks have revealed limitations of random arXiv:1802.03882v2 [stat.ML] 1 Mar 2018 compare the performance of the proposed methods with forest and similar greedy learning algorithms. Deep arti- random forest (Breiman, 2001) and neural decision for- ficial neural networks have exhibited state-of-the-art and est (Kontschieder et al., 2015), discuss the findings and fi- even super human performance on some types of tasks. nally conclude the paper. The success of deep neural networks can at least be partly attributed to both its ability to learn parameters in an end- to-end fashion with backpropagation and its ability to scale 2. Related Work on very large data sets. By contrast, random forest is lim- The forerunner of this work is random forest which 1Department of Statistics, Florida State University, Tallahas- was first described by (Amit & Geman, 1997) and later see, FL. Correspondence to: Nathan Lay <[email protected]>. by (Breiman, 2001). Random forest is an ensemble of ran- th dom decision trees that are generally aggregated by voting Proceedings of the 35 International Conference on Machine or averaging. Each random tree is trained on a bootstrap- Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). aggregated training sample in a greedy and randomized Random Hinge Forest for Differentiable Learning divide-and-conquer fashion. Random trees learn by opti- done for training efficiency and to keep non-linearity. It mizing a task-specific gain function as it divides and par- also, to some extent, helps prevent saturation problems that titions the training data. Gain functions have been defined lead to near-zero gradients with sigmoid activations. To for both classification and regression as well as a plethora the best knowledge of the author, differentiable decision of specialized tasks such as object localization (Gall & trees have always relied on sigmoid-like decision functions Lempitsky, 2013; Criminisi et al., 2010). Trees are also that resemble the activation functions of artificial neural found as components used in other learning algorithms networks. This reliance of sigmoid decision functions in such as boosting (Friedman, 2001). fuzzy decision trees, the plethora of numerical problems with sigmoid activation functions, and the very successful The works of (Suarez´ & Lutsko, 1999; Kontschieder et al., use of ReLU in (Krizhevsky et al., 2012) provided clues to 2015) make partitioning thresholds fuzzy by representing formulate random hinge forest. the threshold operation as a differentiable sigmoid. The objective is to make a decision tree that is differentiable Random hinge forest is different from random forest in that for end-to-end optimization. When traversing these fuzzy constituent trees are all inferred simultaneously in an end- trees, each decision will result in a degree of membership to-end fashion instead of with greedy optimization per tree. in the left and right partitioned sets. For example, if the However, the trees of random forest can currently learn de- decision produces a value of 0:4, then the decision results cision structures not easily described by random hinge trees in 0:4 membership in one partition and 0:6 membership in as formulated here. And where random trees deliberately the other. When the decision process is taken as a whole, choose splitting features that optimize information gain, each resulting partition will have a membership that is a random hinge trees indirectly adjust learnable splitting fea- generally defined as a path-specific product of these deci- tures to fit their randomized decision structure. Random sions producing, however small, non-zero membership for hinge forest is also different from the works of (Suarez´ & all partitions. The work of (Kontschieder et al., 2015) also Lutsko, 1999; Kontschieder et al., 2015) and fuzzy deci- introduces an alternating step for training the predictions sion trees in general in that random hinge trees use Re- and decisions and goes one step further and describes si- LUs for decision functions instead of sigmoids. This re- multaneously optimizing all trees of a forest in an end-to- tains the desirable evaluation behavior of decision trees. end fashion. The work (Schulter et al., 2013) globally opti- Where fuzzy decision trees admit a degree of membership mizes a forest against an arbitrary loss function in a gradi- to several partitions of the feature space, random hinge ent boost framework by gradually building the constituent trees admit membership to only one partition as with crisp trees up from root to leaf in a breadth-first order. decision trees. This also implies that for a depth D tree, random hinge trees need only evaluate D decisions, while The work of (Ozuysal et al., 2010) develops a constrained fuzzy trees need to evaluate on the order of 2D decisions. version of a decision tree known as a random fern. A sin- Like (Kontschieder et al., 2015), random hinge forest can gle random fern behaves like a checklist of tests on features simultaneously learn all trees in an end-to-end fashion. that each produce a yes/no answer. Then the combination However, random hinge forest is not limited to probabil- of yes/no answers is used to look up a prediction. Unlike ity prediction or leaf purity loss and does not need alter- random forest that are aggregated by averaging or voting, nating learning steps for leaf weights and thresholds and random ferns are aggregated via a semi-na¨ıve Bayes for- can be trained in the usual forward-backward pass done in mulation. And unlike greedy learning decision trees, each computation graphs. The work of (Schulter et al., 2013) random fern is trained by picking a random subset of bi- also builds up trees in an iterative fashion while random nary features and then binning the training examples in the hinge forest assume complete trees and a randomized ini- leaves. These leaves are then normalized to be used in a tial state. Both random hinge forest and alternating deci- semi-na¨ıve Bayes aggregation. sion forest aim to optimize global loss functions, but the Random hinge forest also resembles the models learned former can operate as part of a general computation graph. by multivariate adaptive regression splines (MARS) (Fried- This work also presents the random hinge fern which bares man, 1991). MARS is a learning method that fits a linear the same decision constraint as random fern, but we have combination of hinge functions in a greedy fashion. This is not explored semi-na¨ıve Bayes aggregation. This work also done in two passes where hinge functions are progressively provides a way to train the random hinge fern in an end-to- added to fit the training data in the first pass, and then some end fashion which, to the best of our knowledge, has never hinge functions are removed in the second pass to help pro- been done for random ferns.

Load more