End-To-End Learning of Deterministic Decision Trees

End-to-end Learning of Deterministic Decision Trees Thomas Hehn Fred A. Hamprecht HCI/IWR, Heidelberg University thomas.hehn,fred.hamprecht @iwr.uni-heidelberg.de { } Abstract Both neural networks and decision trees are composed of basic computational units, the perceptrons and nodes, re- Conventional decision trees have a number of favor- spectively. A crucial difference between the two is that in able properties, including interpretability, a small compu- a standard neural network, all units are being evaluated for tational footprint and the ability to learn from little train- every input; while in a decision tree with I inner split nodes, ing data. However, they lack a key quality that has helped only (log I) split nodes are visited. That is, in a decision O fuel the deep learning revolution: that of being end-to-end tree, a sample is routed along a single path from the root to trainable, and to learn from scratch those features that best a leaf, with the path conditioned on the sample’s features. allow to solve a given supervised learning problem. Recent It is this sparsity of the sample-dependent computational work (Kontschieder 2015) has addressed this deficit, but at graph that piques our interest in decision trees; but we also the cost of losing a main attractive trait of decision trees: hope to profit from their ability to learn their comparatively the fact that each sample is routed along a small subset of few parameters from a small training set, and their relative tree nodes only. We here propose a model and Expectation- interpretability. Maximization training scheme for decision trees that are One hallmark of neural networks is their ability to learn a fully probabilistic at train time, but after a deterministic complex combination of many elementary decisions jointly, annealing process become deterministic at test time. We by end-to-end training using backpropagation. This is a fea- also analyze the learned oblique split parameters on im- ture that has so far been missing in deterministic decision age datasets and show that Neural Networks can be trained trees, which are usually constructed greedily without sub- at each split node. In summary, we present the first end- sequent tuning. We here propose a mechanism to remedy to-end learning scheme for deterministic decision trees and this deficit. present results on par with or superior to published standard oblique decision tree algorithms. 1.1. Contributions We propose a decision tree whose internal nodes are • probabilistic and hence differentiable at train time. As 1. Introduction a consequence, we are able to train the internal nodes jointly in an end-to-end fashion. This is true for linear When selecting a supervised machine learning tech- nodes, but the property is maintained for more com- nique, we are led by multiple and often conflicting criteria. plex nodes, such as small Convolutional Neural Net- These include: how accurate is the resulting model? How works (CNNs) (section 3.4). arXiv:1712.02743v1 [stat.ML] 7 Dec 2017 much training data is needed to achieve a given level of accuracy? How interpretable is the model? How big is the We derive an expectation-maximization style algo- • computational effort at train time? And at test time? How rithm for finding the optimal parameters in a split node well does the implementation map to the available hard- (section 3.3). We develop a probabilistic split crite- ware? rion that generalizes the long-established information These days, neural networks have superseded all other gain [23]. The proposed criterion is asymptotically approaches in terms of achievable accuracy of the predic- identical to information gain in the limit of very steep tions; but state of the art networks are not easy to interpret, non-linearities, but allows to better model class over- are fairly hungry for training data, often require weeks of lap in the vicinity of a split decision boundary (sec- GPU training and have a computational and memory foot- tion 3.2). print that rules out their use on small embedded devices. Decision trees achieve inferior accuracy, but are fundamen- We demonstrate good results by making the nodes de- • tally more frugal. terministic at test time, sending each sample along a 1 unique path of only (log I) out of the I inner nodes tree. O in a tree. We evaluate the performance of the pro- Finally, connections between neural networks and deci- posed method on the same datasets as used in related sion tree ensembles have been examined. In [27, 29] deci- work [22] (section 4.1) and find steeper learning curves sion tree ensembles are cast to neural networks, which en- with respect to tree depth, as well as higher overall ac- ables gradient descent training. As long as the structure of curacy. We show the benefit of regularizing the spatial the trees is preserved, the optimized parameters of the neu- derivatives of learned features when samples are im- ral network can also be mapped back to the random forest. ages or image patches (section 4.2). Finally, we report Subsequently, [25] cast stacked decision forests to convolu- preliminary experiments with minimalistic trees with tional neural networks and found an approximate mapping CNNs as split feature. back. In [9, 17] several models of neural networks with sep- arate, conditional data flows are discussed. 2. Related work Our work builds on various ideas of previous work, however, none of these algorithms learn deterministic decision Decision trees and decision tree ensembles, such as ran- trees with arbitrary split functions in an end-to-end fashion. dom forests [1], are widely used for computer vision [4] and have proven effective on a variety of classification tasks [7]. In order to improve their performance for a specific task, 3. Methods it is common practice to engineer its features to a specific Consider a classification problem with input space task [8, 14, 16]. Oblique linear and non-linear classifiers us- X ⊂ Rp and output space = 1, ..., K . The training set is Y { } ing more than one feature at a time have been benchmarked defined as x , ..., x = with corresponding { 1 N } Xt ⊂ X in [18], but the available algorithms are, in contrast to our classes y , ..., y = . We propose training a { 1 N } Yt ⊂ Y approach, limited to binary classification problems. probabilistic decision tree model, which becomes determin- There have been several attempts to train decision trees istic at test time. using gradient optimization techniques for more complex split functions. Similarly to our approach, [19] have suc- 3.1. Standard decision tree and notation cessfully approximated information gain using a sigmoid function and a smoothness hyperparameter. However, that In binary decision trees (figure 1c), split functions s : R [0, 1] determine the routing of a sample through the approach does not allow joint optimization of an entire tree. → In [22], the authors also propose an algorithm for op- tree, conditioned on that sample’s features. The split func- timization of an entire tree with a given structure. They tion controls whether the splits are deterministic or prob- show a connection between optimizing oblique splits and abilistic. The prediction is made by the leaf node that is structured prediction with latent variables. As a result, they reached by the sample. formulate a convex-concave upper bound on the tree’s em- Split nodes. Each split node i 1, ..., I computes a ∈ { } pirical loss. In order to find an initial tree structure, the split feature from a sample, and sends that feature to into p work also relies on a greedy algorithm, which is based on a split function. That function is a map fβi : R R → the same upper bound approach [21]. Their method is re- parametrized by βi. For example, oblique splits are a linear T stricted to linear splits and relies on the kernel trick to intro- combination of the input as in fβi (x) = (x , 1) βi with p+1 · duce higher order split features as opposed to our optimiza- βi R . Similarly, an axis-aligned split perpendicular ∈ tion, which allows more complex split features. to axis a is represented by an oblique split whose only non- Other advances towards gradient-based decision tree op- zero parameters are at index a and p + 1. We write θβ = timization rely on either fuzzy or probabilistic split func- (β1, ..., βI ) to denote the collection of all split parameters tions [28, 13, 10]. In contrast to our approach, the assign- in the tree. ment of a single sample to the leaves remains fuzzy, re- Leaf nodes. Each leaf ` 1, ..., L stores a categor- ∈ { } spectively probabilistic, during prediction. Consequently, ical distribution over classes k 1, ..., K in a vector K ∈ { } all leaves and paths need to be evaluated for every sample, π` [0, 1] . These vectors are normalized such that the ∈ K which annihilates the computational benefits of trees. probability of all classes in a leaf sum to k=1(π`)k = 1. We build on reference [13], which is closest to our work. We define θπ = (π1, ..., πL) to include all leaf parameters The authors use sigmoid functions to model the probabilis- in the tree. P tic routes and employ the same log-likelihood objective. Paths. For each leaf node there exists one unique set of In contrast to their work, we derive the alternating opti- split outcomes, called a path. We define the probability that mization using the Expectation-Maximization approach as a sample x takes the path to leaf ` as in [11] and aim for a deterministic decision tree for prediction. Also, they start from a random, but balanced tree, µ (x; s, θ ) = s(f (x)) 1 s(f (x)) .

Load more