Adaptive Dropout for Training Deep Neural Networks

Adaptive dropout for training deep neural networks Lei Jimmy Ba Brendan Frey Department of Electrical and Computer Engineering University of Toronto jimmy, [email protected] Abstract Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly drop- ping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experi- ments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classification error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolu- tional architectures. 1 Introduction For decades, deep networks with broad hidden layers and full connectivity could not be trained to produce useful results, because of overfitting, slow convergence and other issues. One approach that has proven to be successful for unsupervised learning of both probabilistic generative models and auto-encoders is to train a deep network layer by layer in a greedy fashion [7]. Each layer of connections is learnt using contrastive divergence in a restricted Boltzmann machine (RBM) [6] or backpropagation through a one-layer auto-encoder [1], and then the hidden activities are used to train the next layer. When the parameters of a deep network are initialized in this way, further fine tuning can be used to improve the model, e.g., for classification [2]. The unsupervised, pre-training stage is a crucial component for achieving competitive overall performance on classification tasks, e.g., Coates et al. [4] have achieved improved classification rates by using different unsupervised learning algorithms. Recently, a technique called dropout was shown to significantly improve the performance of deep neural networks on various tasks [8], including vision problems [10]. Dropout randomly sets hidden unit activities to zero with a probability of 0:5 during training. Each training example can thus be viewed as providing gradients for a different, randomly sampled architecture, so that the final neural network efficiently represents a huge ensemble of neural networks, with good generalization capability. Experimental results on several tasks show that dropout frequently and significantly improves the classification performance of deep architectures. Injecting noise for the purpose of regularization has been studied previously, but in the context of adding noise to the inputs [3],[21] and to network components [16]. Unfortunately, when dropout is used to discriminatively train a deep fully connected neural network on input with high variation, e.g., in viewpoint and angle, little benefit is achieved (section 5.5), unless spatial structure is built in. 1 In this paper, we describe a generalization of dropout, where the dropout probability for each hidden variable is computed using a binary belief network that shares parameters with the deep network. Our method works well both for unsupervised and supervised learning of deep networks. We present results on the MNIST and NORB datasets showing that our ‘standout’ technique can learn better feature detectors for handwritten digit and object recognition tasks. Interestingly, we also find that our method enables the successful training of deep auto-encoders from scratch, i.e., without layer-by-layer pre-training. 2 The model The original dropout technique [8] uses a constant probability for omitting a unit, so a natural ques- tion we considered is whether it may help to let this probability be different for different hidden units. In particular, there may be hidden units that can individually make confident predictions for the presence or absence of an important feature or combination of features. Dropout will ignore this confidence and drop the unit out 50% of the time. Viewed another way, suppose after dropout is applied, it is found that several hidden units are highly correlated in the pre-dropout activities. They could be combined into a single hidden unit with a lower dropout probability, freeing up hidden units for other purposes. We denote the activity of unit j in a deep neural network by aj and assume that its inputs are fai : i < jg. In dropout, aj is randomly set to zero with probability 0:5. Let mj be a binary variable that is used to mask, the activity aj, so that its value is X aj = mjg wj;iai ; (1) i:i<j where wj;i is the weight from unit i to unit j and g(·) is the activation function and a0 = 1 accounts for biases. Whereas in standard dropout, mj is Bernoulli with probability 0:5, here we use an adaptive dropout probability that depends on input activities: X P (mj = 1jfai : i < jg) = f πj;iai ; (2) i:i<j where πj;i is the weight from unit i to unit j in the standout network or the adaptive dropout network; f(·) is a sigmoidal function, f : R ! [0; 1]. We use the logistic function, f(z) = 1=(1 + exp(−z)). The standout network is an adpative dropout network that can be viewed as a binary belief network that overlays the neural network and stochastically adapts its architecture, depending on the input. Unlike a traditional belief network, the distribution over the output variable is not obtained by marginalizing over the hidden mask variables. Instead, the distribution over the hidden mask variables should be viewed as specifying a Bayesian posterior distribution over models. Traditional Bayesian inference generates a posterior distribution that does not depend on the input at test time, whereas the posterior distribution described here does depend on the test input. At first, this may seem inappropriate. However, if we could exactly compute the Bayesian posterior distribution over neural networks (parameters and architectures), we would find strong correlations between components, such as the connectivity and weight magnitudes in one layer and the connectivity and weight magnitudes in the next layer. The standout network described above can be viewed as approximately taking into account these dependencies through the use of a parametric family of distributions. The standout method described here can be simplified to obtain other dropout techniques. The original dropout method is obtained by clamping πj;i = 0 for 0 ≤ i < j. Another interesting setting is obtained by clamping πj;i = 0 for 1 ≤ i < j, but learning the input-independent dropout parameter πj;0 for each unit aj. As in standard dropout, to process an input at test time, the stochastic feedforward process is replaced by taking the expectation of equation 1: X X E[aj] = f πj;iai g wj;iai : (3) i:i<j i:i<j We found that this method provides very similar results as randomly simulating the stochastic process and computing the expected output of the neural network. 3 Learning For a specific configuration m of the mask variables, let L(m; w) denote the likelihood of a training set or a minibatch, where w is the set of neural network parameters. It may include a prior as well. 2 The dependence of L on the input and output have been suppressed for notational simplicity. Given the current dropout parameters, π, the standout network acts like a binary belief network that generates a distribution over the mask variables for the training set or minibatch, denoted P (mjπ; w). Again, we have suppressed the dependence on the input to the neural network. As described above, this distribution should not be viewed as the distribution over hidden variables in a latent variable model, but as an approximation to a Bayesian posterior distribution over model architectures. The goal is to adjust π and w to make P (mjπ; w) close to the true posterior over architectures as given by L(m; w), while also adjusting L(m; w) so as maximize the data likelihood w.r.t. w. Since both the approximate posterior P (mjπ; w) and the likelihood L(m; w) depend on the neural network parameters, we use a crude approximation that we found works well in practice. If the approximate posterior were as close as possible to the true posterior, then the derivative of the free energy F (P; L) w.r.t P would be zero and we can ignore terms of the form @P=@w. So, we adjust the neural network parameters using the approximate derivative, X @ − P (mjπ; w) log L(m; w); (4) @w m which can be computed by sampling from P (mjπ; w). For a given setting of the neural network parameters, the standout network can in principal be adjusted to be closer to the Bayesian posterior by following the derivative of the free energy F (P; L) w.r.t. π. This is difficult in practice, so we use an approximation where we assume the approximate posterior is correct and sample a configuration of m from it. Then, for each hidden unit, we consider mj = 0 and mj = 1 and determine the partial contribution to the free energy. The standout network parameters are adjusted for that hidden unit so as to decrease the partial contribution to the free energy.

Adaptive Dropout for Training Deep Neural Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support