Using a Random Forest to Inspire a Neural Network and Improving on It

Using a Random Forest to Inspire a Neural Network and Improving on It Suhang Wang∗y Charu Aggarwalz Huan Liux Abstract tures. Such characteristics are certainly not true for Neural networks have become very popular in recent years all data domains, and in some cases a data set may be because of the astonishing success of deep learning in vari- drawn from an application with unknown characteristic ous domains such as image and speech recognition. In many behaviors. of these domains, specific architectures of neural networks, In spite of the successes of neural networks in spe- such as convolutional networks, seem to fit the particular cific domains, this success has not been replicated across structure of the problem domain very well, and can there- all domains. In fact, methods like random forests [8, 9] fore perform in an astonishingly effective way. However, the regularly outperform neural networks in arbitrary do- success of neural networks is not universal across all domains [10], especially when the underlying data sizes mains. Indeed, for learning problems without any special are small and no domain-specific insight has been used structure, or in cases where the data is somewhat limited, to arrange the architecture of the underlying neural net- neural networks are known not to perform well with respect work. This is because neural networks are highly prone to traditional machine learning methods such as random to overfitting, and the use of a generic layered architec- forests. In this paper, we show that a carefully designed ture of the computation units (without domain-specific neural network with random forest structure can have bet- insights) can lead to poor results. The performance of ter generalization ability. In fact, this architecture is more neural networks is often sensitive to the specific archi- powerful than random forests, because the back-propagation tectures used to arrange the computational units. Al- algorithm reduces to a more powerful and generalized way though the convolutional neural network architecture is of constructing a decision tree. Furthermore, the approach known to work well for the image domain, it is hard to is efficient to train and requires a small constant factor of expect an analyst to know which neural network archi- the number of training examples. This efficiency allows the tecture to use for a particular domain or for a specific training of multiple neural networks in order to improve the data set from a poorly studied application domain. generalization accuracy. Experimental results on 10 real- In contrast, methods like decision forests are con- world benchmark datasets demonstrate the effectiveness of sidered generalist methods in which one can take an the proposed enhancements. off-the-shelf package like caret [11], and often outperform [10] even the best of classifiers. A recent study [10] 1 Introduction evaluated 179 classifiers from 17 families on the entire UCI collection of data sets, and concluded that ran- Neural networks have become increasingly popular in re- dom forests were the best performing classifier among cent years because of their tremendous success in image these families, and in most cases, their performance was classification [1, 2], speech recognition [3, 4] and natural better than other classifiers in a statistically significant language processing tasks [5, 6]. In fact, deep learning way. In fact, multiple third-party implementations of methods have regularly won many recent challenges in random forests were tested by this study and virtually these domains [3, 7]. This success is, in part, because the all implementations provided better performance than special structure of these domains often allows the use of multiple implementations of other classifiers; these re- specialized neural network architectures such as convo- sults also suggest that the wins by the random forest lutional neural networks [3], which take advantage of the method were not a result of the specific implementa- aspects like spatial locality in images. Images, speech tions of the method, but are inherent to the merit of and natural language processing are rather specialized the approach. Furthermore, the data sets in the UCI data domains in which the attributes exhibit very char- repository are drawn from a vast variety of domains, acteristic spatial/temporal behavior, which can be ex- and are not specific to one narrow class of data such as ploited by carefully designed neural network architec- images or speech. This also suggests that the performance of random forests is quite robust irrespective of ∗Part of the work is done during first author's internship at IBM T.J Watson the data domain at hand. yArizona State University, [email protected] Random forests and neural networks share impor- zIBM T.J. Watson, [email protected] tant characteristics in common. Both have the ability to xArizona State University, [email protected] model arbitrary decision boundaries, and it can be ar- hand, a mapping exists from an arbitrary random forest gued that in this respect, neural networks are somewhat to such a neural network, and a mapping back exists as more powerful when a large amount of data is available. well. Interestingly, such a mapping has also been shown On the other hand, neural networks are highly prone in the case of convolutional neural networks [14, 15], al- to overfitting, whereas random forests are extremely though the resulting random forests have a specialized robust to overfitting because of their randomized en- structure that is suited to the image domain [15]. This semble approach. The overfitting of neural networks is paper will focus on designing a neural network architec- an artifact of the large number of parameters used to ture which has random forest structure such that it has construct the model. Methods like convolutional neu- better classification ability and reduced overfitting. The ral networks drastically reduce the number of parame- main contributions of the paper are listed as follows: ters to be learned by using specific insights about the data domain (e.g., images) at hand. This strongly sug- • We propose a novel architecture of decision-tree gests that the choice of a neural network architecture like neural networks, which has similar properties that drastically reduces the number of parameters with as a randomized decision tree, and an ensemble of domain-specific insights can help in improving accuracy. such neural networks forms the proposed frame- Domain-specific insights are not the only way in work called Neural Network with Random Forest which one can engineer the architecture of a neural Structure (NNRF); network in order to reduce the parameter footprint. In this paper, we show that one can use inspiration from • We design decision making functions of the neural successful classification methods like random forests networks, which results in forward and backward to engineer the architecture of the neural network. propagation with low time complexity and with Furthermore, starting with this basic architecture, one reduced possibility of overfitting for smaller data can improve on the basic random forest model by sets; and leveraging the inherent power in the neural network • We conduct extensive experiments to demonstrate architecture in a carefully controlled way. The reason the effectiveness of the proposed framework. is that models like random forests are also capable of approximating arbitrary decision boundaries but with less overfitting on smaller data sets. 2 A Random Forest-Inspired Architecture It is noteworthy that several methods have been In this section, we introduce the basic architecture proposed to simulate the output of a decision tree (or of the neural network used for the learning process. random forest) algorithm on a specific data set, once it Throughout this paper, matrices are written as boldface has already been constructed [12, 13]. In other words, capital letters such as M, Wij, and vectors are denoted such an approach first constructs the decision tree (or as boldface lowercase letters such as p and pij. M(i; j) random forests) on the data set up front, and then denotes the (i; j)-th entry of M while M(i; :) and M(:; j) tries to simulate this specific instantiation of the random denotes the i-th row and j-th column, respectively. forest with a neural network. Therefore, the constructed Similarly, p(i) denotes the i-th elements of p. random forest is itself an input to the algorithm. Such In conventional neural networks, the nodes in the an approach defeats the purpose of a neural network in input layer are cleanly separated from the hidden layer. the first place, because it now has to work with the strait However, in this case, we will propose a neural network jacket of a specific instantiation of the random forest. in which a clear separation does not exist between the In other words, it is hard to learn a model, which is nodes of the input and hidden layers. The internal nodes much better than the base random forest model, even are not completely hidden because they are allowed to with modifications. receive inputs from some of the features. Rather, the In this paper, we propose a fundamentally differ- neural network is designed with a hierarchical architec- ent approach to design a basic architecture of the neu- ture, much like a decision tree. Furthermore, just as ral network, so that it constructs a model with similar a random forest contains multiple independent decision properties as a randomized decision tree, although it trees, our approach will use multiple independent neural does not simulate a specific random forest. A different networks, each of which has a randomized architecture way of looking at this approach is that it constructs a based on the randomized choice of the inputs in the neural network first, which has the property of being in- \hidden" layers.

Load more