
End-to-end Learning of Deterministic Decision Trees Thomas Hehn Fred A. Hamprecht HCI/IWR, Heidelberg University thomas.hehn,fred.hamprecht @iwr.uni-heidelberg.de { } Abstract Both neural networks and decision trees are composed of basic computational units, the perceptrons and nodes, re- Conventional decision trees have a number of favor- spectively. A crucial difference between the two is that in able properties, including interpretability, a small compu- a standard neural network, all units are being evaluated for tational footprint and the ability to learn from little train- every input; while in a decision tree with I inner split nodes, ing data. However, they lack a key quality that has helped only (log I) split nodes are visited. That is, in a decision O fuel the deep learning revolution: that of being end-to-end tree, a sample is routed along a single path from the root to trainable, and to learn from scratch those features that best a leaf, with the path conditioned on the sample’s features. allow to solve a given supervised learning problem. Recent It is this sparsity of the sample-dependent computational work (Kontschieder 2015) has addressed this deficit, but at graph that piques our interest in decision trees; but we also the cost of losing a main attractive trait of decision trees: hope to profit from their ability to learn their comparatively the fact that each sample is routed along a small subset of few parameters from a small training set, and their relative tree nodes only. We here propose a model and Expectation- interpretability. Maximization training scheme for decision trees that are One hallmark of neural networks is their ability to learn a fully probabilistic at train time, but after a deterministic complex combination of many elementary decisions jointly, annealing process become deterministic at test time. We by end-to-end training using backpropagation. This is a fea- also analyze the learned oblique split parameters on im- ture that has so far been missing in deterministic decision age datasets and show that Neural Networks can be trained trees, which are usually constructed greedily without sub- at each split node. In summary, we present the first end- sequent tuning. We here propose a mechanism to remedy to-end learning scheme for deterministic decision trees and this deficit. present results on par with or superior to published stan- dard oblique decision tree algorithms. 1.1. Contributions We propose a decision tree whose internal nodes are • probabilistic and hence differentiable at train time. As 1. Introduction a consequence, we are able to train the internal nodes jointly in an end-to-end fashion. This is true for linear When selecting a supervised machine learning tech- nodes, but the property is maintained for more com- nique, we are led by multiple and often conflicting criteria. plex nodes, such as small Convolutional Neural Net- These include: how accurate is the resulting model? How works (CNNs) (section 3.4). arXiv:1712.02743v1 [stat.ML] 7 Dec 2017 much training data is needed to achieve a given level of ac- curacy? How interpretable is the model? How big is the We derive an expectation-maximization style algo- • computational effort at train time? And at test time? How rithm for finding the optimal parameters in a split node well does the implementation map to the available hard- (section 3.3). We develop a probabilistic split crite- ware? rion that generalizes the long-established information These days, neural networks have superseded all other gain [23]. The proposed criterion is asymptotically approaches in terms of achievable accuracy of the predic- identical to information gain in the limit of very steep tions; but state of the art networks are not easy to interpret, non-linearities, but allows to better model class over- are fairly hungry for training data, often require weeks of lap in the vicinity of a split decision boundary (sec- GPU training and have a computational and memory foot- tion 3.2). print that rules out their use on small embedded devices. Decision trees achieve inferior accuracy, but are fundamen- We demonstrate good results by making the nodes de- • tally more frugal. terministic at test time, sending each sample along a 1 unique path of only (log I) out of the I inner nodes tree. O in a tree. We evaluate the performance of the pro- Finally, connections between neural networks and deci- posed method on the same datasets as used in related sion tree ensembles have been examined. In [27, 29] deci- work [22] (section 4.1) and find steeper learning curves sion tree ensembles are cast to neural networks, which en- with respect to tree depth, as well as higher overall ac- ables gradient descent training. As long as the structure of curacy. We show the benefit of regularizing the spatial the trees is preserved, the optimized parameters of the neu- derivatives of learned features when samples are im- ral network can also be mapped back to the random forest. ages or image patches (section 4.2). Finally, we report Subsequently, [25] cast stacked decision forests to convolu- preliminary experiments with minimalistic trees with tional neural networks and found an approximate mapping CNNs as split feature. back. In [9, 17] several models of neural networks with sep- arate, conditional data flows are discussed. 2. Related work Our work builds on various ideas of previous work, how- ever, none of these algorithms learn deterministic decision Decision trees and decision tree ensembles, such as ran- trees with arbitrary split functions in an end-to-end fashion. dom forests [1], are widely used for computer vision [4] and have proven effective on a variety of classification tasks [7]. In order to improve their performance for a specific task, 3. Methods it is common practice to engineer its features to a specific Consider a classification problem with input space task [8, 14, 16]. Oblique linear and non-linear classifiers us- X ⊂ Rp and output space = 1, ..., K . The training set is Y { } ing more than one feature at a time have been benchmarked defined as x , ..., x = with corresponding { 1 N } Xt ⊂ X in [18], but the available algorithms are, in contrast to our classes y , ..., y = . We propose training a { 1 N } Yt ⊂ Y approach, limited to binary classification problems. probabilistic decision tree model, which becomes determin- There have been several attempts to train decision trees istic at test time. using gradient optimization techniques for more complex split functions. Similarly to our approach, [19] have suc- 3.1. Standard decision tree and notation cessfully approximated information gain using a sigmoid function and a smoothness hyperparameter. However, that In binary decision trees (figure 1c), split functions s : R [0, 1] determine the routing of a sample through the approach does not allow joint optimization of an entire tree. → In [22], the authors also propose an algorithm for op- tree, conditioned on that sample’s features. The split func- timization of an entire tree with a given structure. They tion controls whether the splits are deterministic or prob- show a connection between optimizing oblique splits and abilistic. The prediction is made by the leaf node that is structured prediction with latent variables. As a result, they reached by the sample. formulate a convex-concave upper bound on the tree’s em- Split nodes. Each split node i 1, ..., I computes a ∈ { } pirical loss. In order to find an initial tree structure, the split feature from a sample, and sends that feature to into p work also relies on a greedy algorithm, which is based on a split function. That function is a map fβi : R R → the same upper bound approach [21]. Their method is re- parametrized by βi. For example, oblique splits are a linear T stricted to linear splits and relies on the kernel trick to intro- combination of the input as in fβi (x) = (x , 1) βi with p+1 · duce higher order split features as opposed to our optimiza- βi R . Similarly, an axis-aligned split perpendicular ∈ tion, which allows more complex split features. to axis a is represented by an oblique split whose only non- Other advances towards gradient-based decision tree op- zero parameters are at index a and p + 1. We write θβ = timization rely on either fuzzy or probabilistic split func- (β1, ..., βI ) to denote the collection of all split parameters tions [28, 13, 10]. In contrast to our approach, the assign- in the tree. ment of a single sample to the leaves remains fuzzy, re- Leaf nodes. Each leaf ` 1, ..., L stores a categor- ∈ { } spectively probabilistic, during prediction. Consequently, ical distribution over classes k 1, ..., K in a vector K ∈ { } all leaves and paths need to be evaluated for every sample, π` [0, 1] . These vectors are normalized such that the ∈ K which annihilates the computational benefits of trees. probability of all classes in a leaf sum to k=1(π`)k = 1. We build on reference [13], which is closest to our work. We define θπ = (π1, ..., πL) to include all leaf parameters The authors use sigmoid functions to model the probabilis- in the tree. P tic routes and employ the same log-likelihood objective. Paths. For each leaf node there exists one unique set of In contrast to their work, we derive the alternating opti- split outcomes, called a path. We define the probability that mization using the Expectation-Maximization approach as a sample x takes the path to leaf ` as in [11] and aim for a deterministic decision tree for pre- diction. Also, they start from a random, but balanced tree, µ (x; s, θ ) = s(f (x)) 1 s(f (x)) .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages13 Page
-
File Size-