Practical Recommendations for Gradient-Based Training of Deep

arXiv:1206.5533v2 [cs.LG] 16 Sep 2012 e ead oeo hs uvvn rnvlelements novel or surviving these advantages of chap- practical some this regards in the ter presented panorama of The still virtue provided. sur- they are by some while book longer added, this vived were elements of a new for edition and valid, 2009) previous that (Bengio, the recommendations See practical justified the of features. Many of review. layer each pre-training of unsupervised layer-wise greedy on based ca erlntok a eie fe 06break- (Hinton 2006 a after through revived arti- was in networks research neural activity, ficial lower of decade a Following Introduction 1 architectures. with deeper observed difficulties open training with the closes about often questions It and networks. large-scale neural multi-layer debug deep and train successfully efficiently to used and practice it the Overall, of elements hyper-parameters. describes many that allow- adjust fact when to the obtained one with be ing can deal results to interesting how more back- optimiza- discusses also on gradient-based It based tion. and algorithms gradient learning in propagated of most particular context the practical in of the a hyper-parameters, some as used hyper- for meant commonly called recommendations is with whistles, chapter guide This and bells seem parameters. many may Learning involve Deep net- for to neural particular in artificial and to works related algorithms Learning Abstract Ranzato rcia eomnain o rdetBsdTann of Training Gradient-Based for Recommendations Practical tal. et 07 nteae of area the in 2007) , tal. et 06 Bengio 2006; , epLearning Deep eso ,Sp.1t,2012 16th, Sept. 2, Version tal. et Architectures ohaBengio Yoshua 2007; , , 1 nenlrepresentations internal learning representation supervised Ranzato the Python hog (Hinton through efudipeetdi the als in implemented found net- be neural training of practice al- the not works. on do agree groups research ways that is and validation such researchers for different need solid A the of by both). indication by or good (ideally analysis work experimental be theoretical comparative can by that either questions answered formally many been open not leaving have validated, often learn- very of good but user a algorithms and ing experimenter constitute the They for point challenged. justification, starting mathematical be extent should some they to experimenta- of and years tion from emerged that practice ing machine Boltzmann 2013). (Hinton, the chapter to another most to specific leaving family material but networks, the aiming neural of algorithms deep learning training on at focusing practice, of rmdln ako neeteasier. predic interest the of make task to modeling principle or intermedia in an representation, into internal input raw or the encoding step each mations, h 06De erigbreak- Learning Deep 2006 The eea ftercmedtospeetdhr can here presented recommendations the of Several liv- a of out come recommendations such Although 3 2 1 1 Theano http://deeplearning.net/software/pylearn2 http://deeplearning.net/tutorial/ erlntokcmue euneo aatransfor- data of sequence a computes network neural A n nterelated the in and rgamn language. programming tal. et irr dsusdblw rte nthe in written below) (discussed library 07 etrdo h s of use the on centered 2007) , tal. et Pylearn2 06 Bengio 2006; , 3 ypoiiga providing by epLann Tutori- Learning Deep library ohl learning help to 2 l ae on based all , tal. et oa train- local Deep 2007; , tion un- te ing signal at each level of a hierarchy of features4. tion. The main section of this chapter is Section 3, Unsupervised representation learning algorithms can which explains hyper-parameters in general, their op- be applied several times to learn different layers timization, and specifically covers the main hyper- of a deep model. Several unsupervised represen- parameters of neural networks. Section 4 briefly de- tation learning algorithms have been proposed scribes simple ideas and methods to debug and visu- since then. Those covered in this chapter (such as alize neural networks, while Section 5 covers paral- auto-encoder variants) retain many of the properties lelism, sparse high-dimensional inputs, symbolic in- of artificial multi-layer neural networks, relying puts and embeddings, and multi-relational learning. on the back-propagation algorithm to estimate The chapter closes (Section 6) with open questions stochastic gradients. Deep Learning algorithms on the difficulty of training deep architectures and such as those based on the Boltzmann machine improving the optimization methods for neural net- and those based on auto-encoder or sparse coding works. variants often include a supervised fine-tuning stage. This supervised fine-tuning as well as the gradient descent performed with auto-encoder variants also 1.1 Deep Learning and Greedy Layer- involves the back-propagation algorithm, just as Wise Pretraining like when training deterministic feedforward or recurrent artificial neural networks. Hence this The notion of reuse, which explains the power of chapter also includes recommendations for training distributed representations (Bengio, 2009), is also ordinary supervised deterministic neural networks at the heart of the theoretical advantages behind or more generally, most machine learning algorithms Deep Learning. Complexity theory of circuits, relying on iterative gradient-based optimization of e.g. (H˚astad, 1986; H˚astad and Goldmann, 1991), a parametrized learner with respect to an explicit (which include neural networks as special cases) has training criterion. much preceded the recent research on deep learning. The depth of a circuit is the length of the longest This chapter assumes that the reader already un- path from an input node of the circuit to an out- derstands the standard algorithms for training su- put node of the circuit. Formally, one can change pervised multi-layer neural networks, with the loss the depth of a given circuit by changing the defini- gradient computed thanks to the back-propagation tion of what each node can compute, but only by a algorithm (Rumelhart et al., 1986). It starts by constant factor (Bengio, 2009). The typical compu- explaining basic concepts behind Deep Learning tations we allow in each node include: weighted sum, and the greedy layer-wise pretraining strategy (Sec- product, artificial neuron model (such as a mono- tion 1.1), and recent unsupervised pre-training al- tone non-linearity on top of an affine transforma- gorithms (denoising and contractive auto-encoders) tion), computation of a kernel, or logic gates. Theo- that are closely related in the way they are trained retical results (H˚astad, 1986; H˚astad and Goldmann, to standard multi-layer neural networks (Section 1.2). 1991; Bengio et al., 2006b; Bengio and LeCun, 2007; It then reviews in Section 2 basic concepts in it- Bengio and Delalleau, 2011) clearly identify families erative gradient-based optimization and in particu- of functions where a deep representation can be expo- lar the stochastic gradient method, gradient com- nentially more efficient than one that is insufficiently putation with a flow graph, automatic differenta- deep. If the same set of functions can be represented

4 from within a family of architectures associated with In standard multi-layer neural networks trained using a smaller VC-dimension (e.g. less hidden units5), back-propagated gradients, the only signal that drives parameter updates is provided at the output of the network (and learning theory would suggest that it can be learned then propagated backwards). Some unsupervised learning algorithms provide a local source of guidance for the parameter 5 Note that in our experiments, deep architectures tend to update in each layer, based only on the inputs and outputs of generalize very well even when they have quite large numbers that layer. of parameters.

2 with fewer examples, yielding improvements in both to some objective of interest. Combining unsuper- computational efficiency and statistical efficiency. vised pre-training and supervised fine-tuning usu- Another important motivation for feature learning ally gives better generalization than pure supervised and Deep Learning is that they can be done with un- learning from a purely random initialization. The labeled examples, so long as the factors (unobserved unsupervised representation learning algorithms for random variables explaining the data) relevant to the pre-training proposed in 2006 were the Restricted questions we will ask later (e.g. classes to be pre- Boltzmann Machine or RBM (Hinton et al., 2006), dicted) are somehow salient in the input distribution the auto-encoder (Bengio et al., 2007) and a spar- itself. This is true under the manifold hypothesis, sifying form of auto-encoder similar to sparse cod- which states that natural classes and other high-level ing (Ranzato et al., 2007). concepts in which humans are interested are associated with low-dimensional regions in input space 1.2 Denoising and Contractive Auto- (manifolds) near which the distribution concentrates, Encoders and that different class manifolds are well-separated by regions of very low density. It means that a small An auto-encoder has two parts: an encoder func- semantic change around a particular example can tion f that maps the input x to a representation be captured by changing only a few numbers in a h = f(x), and a decoder function g that maps h high-level abstract representation space. As a conse- back in the space of x in order to reconstruct x. quence, feature learning and Deep Learning are in- In the regular auto-encoder the reconstruction func- timately related to principles of unsupervised learn- tion r(·)= g(f(·)) is trained to minimize the average ing, and they can work in the semi-supervised setting value of a reconstruction loss on the training exam- (where only a few examples are labeled), as well as in ples. Note that reconstruction loss should be high for the transfer learning and multi-task settings (where most other input configurations7. The regularization we aim to generalize to new classes or tasks). The mechanism makes sure that reconstruction cannot be underlying hypothesis is that many of the underlying perfect everywhere, while minimizing the reconstruc- factors are shared across classes or tasks. Since rep- tion loss at training examples digs a hole in recon- resentation learning aims to extract and isolate these struction error where the density of training exam- factors, representations can be shared across classes ples is large. Examples of reconstruction loss func- and tasks. tions include ||x−r(x)||2 (for real-valued inputs) and One of the most commonly used approaches for − i xi log ri(x) +(1 − xi) log(1 − ri(x)) (when in- training deep neural networks is based on greedy terpretingP xi as a bit or a probability of a binary layer-wise pre-training (Bengio et al., 2007). The event). Auto-encoders capture the input distribu- idea, first introduced in Hinton et al. (2006), is to tion by learning to better reconstruct more likely in- train one layer of a deep architecture at a time us- put configurations. The difference between the recon- ing unsupervised representation learning. Each level struction vector and the input vector can be shown to takes as input the representation learned at the pre- be related to the log-density gradient as estimated by vious level and learns a new representation. The the learner (Vincent, 2011; Bengio et al., 2012) and learned representation(s) can then be used as input the Jacobian matrix of the reconstruction with re- to predict variables of interest, for example to clas- spect to the input gives information about the second sify objects. After unsupervised pre-training, one can derivative of the density, i.e., in which direction the also perform supervised fine-tuning of the whole sys- density remains high when you are on a high-density 6 tem , i.e., optimize not just the classifier but also 7 the lower levels of the feature hierarchy with respect Different regularization mechanisms have been proposed to push reconstruction error up in low density areas: denoising criterion, contractive criterion, and code sparsity. It has been 6 The whole system composes the computation of the rep- argued that such constraints play a role similar to the partition resentation with computation of the predictor’s output. function for Boltzmann machines (Ranzato et al., 2008a).

3 manifold (Rifai et al., 2011a; Bengio et al., 2012). In and as the number of examples increases, so long as the Denoising Auto-Encoder (DAE) and the Con- capacity is limited (the number of parameters is small tractive Auto-Encoder (CAE), the training procedure compared to the number of examples), training er- also introduces robustness (insensitivity to small vari- ror and generalization approach each other. In the ations), respectively in the reconstruction r(x) or in regime of such large datasets, we can consider that the representation f(x). In the DAE (Vincent et al., the learner sees an unending stream of examples (e.g., 2008, 2010), this is achieved by training with stochas- think about a process that harvests text and images tically corrupted inputs, but trying to reconstruct the from the web and feeds it to a machine learning algo- uncorrupted inputs. In the CAE (Rifai et al., 2011a), rithm). In that context, it is most efficient to simply this is achieved by adding an explicit regularizing update the parameters of the model after each exam- term in the training criterion, proportional to the ple or few examples, as they arrive. This is the ideal ∂f(x) 2 norm of the Jacobian of the encoder, || ∂x || . But online learning scenario, and in a simplified setting, the CAE and the DAE are very related (Bengio et al., we can even consider each new example z as being 2012): when the noise is Gaussian and small, the sampled i.i.d. from an unknown generating distribu- denoising error minimized by the DAE is equiva- tion with probability density p(z). More realistically, lent to minimizing the norm of the Jacobian of the examples in online learning do not arrive i.i.d. but reconstruction function r(·) = g(f(·)), whereas the instead from an unknown stochastic process which CAE minimizes the norm of the Jacobian of the en- exhibits serial correlation and other temporal depen- coder f(·). Besides Gaussian noise, another interest- dencies. Many learning algorithms rely on gradient- ing form of corruption has been very successful with based numerical optimization of a training criterion. DAEs: it is called the masking corruption and con- Let L(z,θ) be the loss incurred on example z when sists in randomly zeroing out a large fraction (like the parameter vector takes value θ. The gradient 20% or even 50%) of the inputs, where the zeroed vector for the loss associated with a single example ∂L(z,θ) out subset is randomly selected for each example. In is ∂θ . addition to the contractive effect, it forces the learned If we consider the simplified case of i.i.d. data, encoder to be able to rely only on an arbitrary subset there is an interesting observation to be made: the of the input features. online learner is performing stochastic gradient de- Another way to prevent the auto-encoder from per- scent on its generalization error. Indeed, the gener- fectly reconstructing everywhere is to introduce a alization error C of a learner with parameters θ and sparsity penalty on h, discussed below (Section 3.1). loss function L is

C = E[L(z,θ)] = p(z)L(z,θ)dz 1.3 Online Learning and Optimization Z of Generalization Error while the stochastic gradient from sample z is

The objective of learning is not to minimize training ∂L(z,θ) gˆ = error or even the training criterion. The latter is a ∂θ surrogate for generalization error, i.e., performance on new (out-of-sample) examples, and there are no with z a random variable sampled from p. The gra- hard guarantees that minimizing the training crite- dient of generalization error is rion will yield good generalization error: it depends ∂C ∂ ∂L(z,θ) on the appropriateness of the parametrization and = p(z)L(z,θ)dz = p(z) dz = E[ˆg] ∂θ ∂θ Z Z ∂θ training criterion (with the corresponding prior they imply) for the task at hand. showing that the online gradientg ˆ is an unbiased es- ∂C Many learning tasks of interest will require huge timator of the generalization error gradient ∂θ . It quantities of data (most of which will be unlabeled) means that online learners, when given a stream of

4 non-repetitive training data, really optimize (maybe examples: not in the optimal way, i.e., using a first-order gra- B(t+1) dient technique) what we really care about: general- 1 ∂L(z ′ ,θ) θ(t) ← θ(t−1) − ǫ t . (1) ization error. t B ∂θ t′=XBt+1 With B = 1 we are back to ordinary online gradient 2 Gradients descent, while with B equal to the training set size, this is standard (also called “batch”) gradient de- 2.1 Gradient Descent and Learning scent. With intermediate values of B there is gener- Rate ally a sweet spot. When B increases we can get more multiply-add operations per second by taking advan- The gradient or an estimator of the gradient is tage of parallelism or efficient matrix-matrix multipli- used as the core part the computation of parame- cations (instead of separate matrix-vector multiplica- ter updates for gradient-based numerical optimizations), often gaining a factor of 2 in practice in overall tion algorithms. For example, simple online (or training time. On the other hand, as B increases, the stochastic) gradient descent (Robbins and Monro, number of updates per computation done decreases, 1951; Bottou and LeCun, 2004) updates the param- which slows down convergence (in terms of error vs eters after each example is seen, according to number of multiply-add operations performed) because less updates can be done in the same computing ∂L(z ,θ) time. Combining these two opposing effects yields a θ(t) ← θ(t−1) − ǫ t t ∂θ typical U-curve with a sweet spot at an intermediate value of B. where zt is an example sampled at iteration t and Keep in mind that even the true gradient direction where ǫt is a hyper-parameter that is called the learn- (averaging over the whole training set) is only the ing rate and whose choice is crucial. If the learn- steepest descent direction locally but may not point ing rate is too large8, the average loss will increase. in the right direction when considering larger steps. The optimal learning rate is usually close to (by a In particular, because the training criterion is not factor of 2) the largest learning rate that does not quadratic in the parameters, as one moves in param- cause divergence of the training criterion, an observa- eter space the optimal descent direction keeps chang- tion that can guide heuristics for setting the learning ing. Because the gradient direction is not quite the rate (Bengio, 2011), e.g., start with a large learning right direction of descent, there is no point in spend- rate and if the training criterion diverges, try again ing a lot of computation to estimate it precisely for with 3 times smaller learning rate, etc., until no di- gradient descent. Instead, doing more updates more vergence is observed. frequently helps to explore more and faster, especially See Bottou (2013) for a deeper treatment of with large learning rates. In addition, smaller values stochastic gradient descent, including suggestions to of B may benefit from more exploration in parame- set learning rate schedule and improve the asymp- ter space and a form of regularization both due to the totic convergence through averaging. “noise” injected in the gradient estimator, which may In practice, we use mini-batch updates based on explain the better test results sometimes observed an average of the gradients9 inside each block of B with smaller B. When the training set is finite, training proceeds 8 above a value which is approximately 2 times the largest by sweeps through the training set called an epoch, eigenvalue of the average loss Hessian matrix 9 and full training usually requires many epochs (iter- Compared to a sum, an average makes a small change in ations through the training set). Note that stochas- B have only a small effect on the optimal learning rate, with an increase in B generally allowing a small increase in the learning tic gradient (either one example at a time or with rate because of the reduced variance of the gradient. mini-batches) is different from ordinary gradient de-

5 scent, sometimes called “batch gradient descent”, 2.2 Gradient Computation and Auto- which corresponds to the case where B equals the matic Differentiation training set size, i.e., there is one parameter update per epoch). The great advantage of stochastic gra- The gradient can be either computed manually or dient descent and other online or minibatch update through automatic differentiation. Either way, it methods is that their convergence does not depend helps to structure this computation as a flow graph, on the size of the training set, only on the number in order to prevent mathematical mistakes and make of updates and the richness of the training distribu- sure an implementation is computationally efficient. tion. In the limit of a large or infinite training set, The computation of the loss L(z,θ) as a function of a batch method (which updates only after seeing all θ is laid out in a graph whose nodes correspond to the examples) is hopeless. In fact, even for ordinary elementary operations such as addition, multiplica- datasets of tens or hundreds of thousands of exam- tion, and non-linear operations such as the neural ples (or more!), stochastic gradient descent converges networks activation function (e.g., sigmoid or hyper- much faster than ordinary (batch) gradient descent, bolic tangent), possibly at the level of vectors, matri- and beyond some dataset sizes the speed-up is al- ces or tensors. The flow graph is directed and acyclic most linear (i.e., doubling the size almost doubles the and has three types of nodes: input nodes, internal gain)10. It is really important to use the stochastic nodes, and output nodes. Each of its nodes is as- version in order to get reasonable clock-time conver- sociated with a numerical output which is the result gence speeds. of the application of that computation (none in the case of input nodes), taking as input the output of As for any stochastic gradient descent method (in- previous nodes in a directed acyclic graph. Example cluding the mini-batch case), it is important for ef- z and parameter vector θ (or their elements) are the ficiency of the estimator that each example or mini- input nodes of the graph (i.e., they do not have in- batch be sampled approximately independently. Be- puts themselves) and L(z,θ) is a scalar output of the cause random access to memory (or even worse, to graph. Note that here, in the supervised case, z can disk) is expensive, a good approximation, called in- include an input part x (e.g. an image) and a target cremental gradient (Bertsekas, 2010), is to visit the part y (e.g. a target class associated with an object examples (or mini-batches) in a fixed order corre- in the image). In the unsupervised case z = x. In sponding to their order in memory or disk (repeating a semi-supervised case, there is a mix of labeled and the examples in the same order on a second epoch, if unlabeled examples, and z includes y on the labeled we are not in the pure online case where each exam- examples but not on the unlabeled ones. ple is visited only once). In this context, it is safer if In addition to associating a numerical output oa to the examples or mini-batches are first put in a ran- each node a of the flow graph, we can associate a gra- ∂L(z,θ) dom order (to make sure this is the case, it could dient ga = ∂oa . The gradient will be defined and be useful to first shuffle the examples). Faster con- computed recursively in the graph, in the opposite vergence has been observed if the order in which the direction of the computation of the nodes’ outputs, mini-batches are visited is changed for each epoch, i.e., whereas oa is computed using outputs op of pre- which can be reasonably efficient if the training set decessor nodes p of a, ga will be computed using the holds in computer memory. gradients gs of successor nodes s of a. More precisely, the chain rule dictates ∂o g = g s a s ∂o Xs a

10 where the sum is over immediate successors of a. On the other hand, batch methods can be parallelized easily, which becomes an important advantage with currently Only output nodes have no successor, and in par- available forms of computing power. ticular for the output node that computes L, the

6 ∂L gradient is set to 1 since ∂L = 1, thus initializing semantics of the output (given the input) but yield- the recursion. Manual or automatic differentiation ing smaller (or more numerically stable or more effi- then only requires to define the partial derivative as- ciently computed) graphs (e.g., removing redundant sociated with each type of operation performed by computations). To take advantage of the fact that any node of the graph. When implementing gradi- computing the loss gradient includes as a first step ent descent algorithms with manual differentiation computing the loss itself, it is advantageous to struc- the result tends to be verbose, brittle code that lacks ture the code so that both the loss and its gradient are modularity – all bad things in terms of software en- computed at once, with a single graph having multi- gineering. A better approach is to express the flow ple outputs. The advantages of performing gradient graph in terms of objects that modularize how to computations symbolically are numerous. First of all, compute outputs from inputs as well as how to com- one can readily compute gradients over gradients, i.e., pute the partial derivatives necessary for gradient de- second derivatives, which are useful for some learn- scent. One can pre-define the operations of these ob- ing algorithms. Second, one can define algorithms or jects (in a “forward propagation” or fprop method) training criteria involving gradients themselves, as re- and their partial derivatives (in a “backward prop- quired for example in the Contractive Auto-Encoder agation” or bprop method) and encapsulate these (which uses the norm of a Jacobian matrix in its computations in an object that knows how to com- training criterion, i.e., really requires second deriva- pute its output given its inputs, and how to com- tives, which here are cheap to compute). Third, it pute the gradient with respect to its inputs given makes it easy to implement other useful graph trans- the gradient with respect to its output. This is the formations such as graph simplifications or numerical strategy adopted in the Theano library11 with its Op optimizations and transformations that help making objects (Bergstra et al., 2010), as well as in libraries the numerical results more robust and more efficient such as Torch12 (Collobert et al., 2011b) and Lush13. (such as working in the domain of logarithms of prob- Compared to Torch and Lush, Theano adds an in- abilities rather than in the domain of probabilities teresting ingredient which makes it a full-fledged au- directly). Other potential beneficial applications of tomatic differentiation tool: symbolic computation. such symbolic manipulations include parallelization The flow graph itself (without the numerical values and additional differential operators (such as the R- attached) can be viewed as a symbolic representation operator, recently implemented in Theano, which is (in a data structure) of a numerical computation. In very useful to compute the product of a Jacobian ma- 2 Theano, the gradient computation is first performed ∂f(x) ∂ L(x,θ) trix ∂x or Hessian matrix ∂θ2 with a vector symbolically, i.e., each Op object knows how to create without ever having to actually compute and store other Ops corresponding to the computation of the the matrix itself (Pearlmutter, 1994)). partial derivatives associated with that Op. Hence the symbolic differentiation of the output of a flow graph with respect to any or all of its input nodes can be 3 Hyper-Parameters performed easily in most cases, yielding another flow graph which specifies how to compute these gradi- A pure learning algorithm can be seen as a func- ents, given the input of the original graph. Since the tion taking training data as input and producing gradient graph typically contains the original graph as output a function (e.g. a predictor) or model (mapping parameters to loss) as a sub-graph, in or- (i.e. a bunch of functions). However, in practice, der to make computations efficient it is important to many learning algorithms involve hyper-parameters, automate (as done in Theano) a number of simplifica- i.e., annoying knobs to be adjusted. In many algo- tions which are graph transformations preserving the rithms such as Deep Learning algorithms the number 11 of hyper-parameters (ten or more!) can make the idea http://deeplearning.net/software/theano/ 12 http://www.torch.ch of having to adjust all of them unappealing. In addi- 13 http://lush.sourceforge.net tion, it has been shown that the use of computer clus-

7 ters for hyper-parameter selection can have an im- datasets) to estimate generalization error of the pure portant effect on results (Pinto et al., 2009). Choos- learning algorithm (with hyper-parameter selection ing hyper-parameter values is formally equivalent to hidden inside). the question of model selection, i.e., given a family or set of learning algorithms, how to pick the most 3.1 Neural Network Hyper- appropriate one inside the set? We define a hyper- Parameters parameter for a learning algorithm A as a variable to be set prior to the actual application of A to the data, Different learning algorithms involve different sets of one that is not directly selected by the learning algo- hyper-parameters, and it is useful to get a sense of rithm itself. It is basically an outside control knob. the kinds of choices that practitioners have to make It can be discrete (as in model selection) or continu- in choosing their values. We focus here mostly on ous (such as the learning rate discussed above). Of those relevant to neural networks and Deep Learning course, one can hide these hyper-parameters by wrap- algorithms. ping another learning algorithm, say B, around A, to selects A’s hyper-parameters (e.g. to minimize vali- 3.1.1 Hyper-Parameters of the Approximate dation set error). We can then call B a hyper-learner, Optimization and if B has no hyper-parameters itself then the com- position of B over A could be a “pure” learning al- First of all, several learning algorithms can be viewed gorithm, with no hyper-parameter. In the end, to as the combination of two elements: a training cri- apply a learner to training data, one has to have a terion and a model (e.g., a family of functions, a pure learning algorithm. The hyper-parameters can parametrization) on the one hand, and on the other be fixed by hand or tuned by an algorithm, but their hand, a particular procedure for approximately op- value has to be selected. The value of some hyper- timizing this criterion. Correspondingly, one should parameters can be selected based on the performance distinguish hyper-parameters associated with the op- of A on its training data, but most cannot. For any timizer from hyper-parameters associated with the hyper-parameter that has an impact on the effective model itself, i.e., typically the function class, regular- capacity of a learner, it makes more sense to select its izer and loss function. We have already mentioned value based on out-of-sample data (outside the train- above some of the hyper-parameters typically asso- ing set), e.g., a validation set performance, online er- ciated with gradient-based optimization. Here is a ror, or cross-validation error. Note that some learn- more extensive descriptive list, focusing on those used ing algorithms (in particular unsupervised learning in stochastic (mini-batch) gradient descent (although algorithms such as algorithms for training RBMs by number of training iterations is used for all iterative approximate maximum likelihood) are problematic in optimization algorithms). this respect because we cannot directly measure the • The initial learning rate (ǫ below, Eq.(2)). quantity that is to be optimized (e.g. the likelihood) 0 This is often the single most important hyper- because it is intractable. On the other hand, the parameter and one should always make sure that expected denoising reconstruction error is easy to es- it has been tuned (up to approximately a fac- timate (by just averaging the denoising error over a tor of 2). Typical values for a neural network validation set). with standardized inputs (or inputs mapped to Once some out-of-sample data has been used for the (0,1) interval) are less than 1 and greater selecting hyper-parameter values, it cannot be used than 10−6 but these should not be taken as strict anymore to obtain an unbiased estimator of generalization performance, so one typically uses a test cross-validation, using an outer loop cross-validation to evalu- 14 ate generalization error and then applying an inner loop cross- set (or double cross-validation , in the case of small validation inside each outer loop split’s training subset (i.e., splitting it again into training and validation folds) in order to 14 Double cross-validation applies recursively the idea of select hyper-parameters for that split.

8 ranges and greatly depend on the parametriza- choices of learning rate (all in parallel), and keep tion of the model. A default value of 0.01 typi- the value that gave the best results until the next cally works for standard multi-layer neural net- re-estimation of the optimal learning rate. Other works but it would be foolish to rely exclu- examples of adaptive learning rate strategies are sively on this default value. If there is only discussed below (Sec. 6.2). time to optimize one hyper-parameter and one • The mini-batch size (B in Eq. (1)) is typi- uses stochastic gradient descent, then this is the cally chosen between 1 and a few hundreds, e.g. hyper-parameter that is worth tuning. B = 32 is a good default value, with values above • The choice of strategy for decreasing or adapt- 10 taking advantage of the speed-up of matrix- ing the learning rate schedule (with hyper- matrix products over matrix-vector products. parameters such as the time constant τ in Eq. (2) The impact of B is mostly computational, i.e., below). The default value of τ → ∞ means that larger B yield faster computation (with ap- the learning rate is constant over training it- propriate implementations) but requires visiting erations. In many cases the benefit of choos- more examples in order to reach the same error, ing other than this default value is small. An since there are less updates per epoch. In the- example of O(1/t) learning rate schedule, used ory, this hyper-parameter should impact train- in Bergstra and Bengio (2012) is ing time and not so much test performance, so it can be optimized separately of the other hyper- ǫ0τ ǫt = (2) parameters, by comparing training curves (train- max(t, τ) ing and validation error vs amount of training time), after the other hyper-parameters (except which keeps the learning rate constant for the learning rate) have been selected. B and ǫ may first τ steps and then decreases it in O(1/tα), 0 slightly interact with other hyper-parameters so with traditional recommendations (based on both should be re-optimized at the end. Once asymptotic analysis of the convex case) suggest- B is selected, it can generally be fixed while the ing α = 1. See Bach and Moulines (2011) for a other hyper-parameters can be further optimized recent analysis of the rate of convergence for the (except for a momentum hyper-parameter, if one general case of α ≤ 1, suggesting that smaller is used). values of α should be used in the non-convex case, especially when using a gradient averaging • Number of training iterations T (measured or momentum technique (see below). An adap- in mini-batch updates). This hyper-parameter tive and heuristic way of automatically setting is particular in that it can be optimized almost τ above is to keep ǫt constant until the training for free using the principle of early stopping: by criterion stops decreasing significantly (by more keeping track of the out-of-sample error (as for than some relative improvement threshold) from example estimated on a validation set) as train- epoch to epoch. That threshold is a less sensi- ing progresses (every N updates), one can decide tive hyper-parameter than τ itself. An alterna- how long to train for any given setting of all the tive to a fixed schedule with a couple of (global) other hyper-parameters. Early stopping is an free hyper-parameters like in the above formula inexpensive way to avoid strong overfitting, i.e., is the use of an adaptive learning rate heuristic, even if the other hyper-parameters would yield e.g., the simple procedure proposed in Bottou to overfitting, early stopping will considerably (2013): at regular intervals during training, us- reduce the overfitting damage that would other- ing a fixed small subset of the training set (what wise ensue. It also means that it hides the over- matters is only the number of examples used, fitting effect of other hyper-parameters, possibly not what fraction of the whole training set it obscuring the analysis that one may want to do represents), continue training with N different when trying to figure out the effect of individual

9 hyper-parameters, i.e., it tends to even out the during the stochastic gradient descent. For ex- performance obtained by many otherwise overfit- ample, a moving average of the past gradients ting configurations of hyper-parameters by com- can be computed withg ¯ ← (1−β)¯g+βg, where g pensating a too large capacity with a smaller ∂L(zt,θ) is the instantaneous gradient ∂θ or a mini- training time. For this reason, it might be use- batch average, and β is a small positive coeffi- ful to turn early-stopping off when analyzing the cient that controls how fast the old examples get effect of individual hyper-parameters. Now let downweighted in the moving average. The sim- us turn to implementation details. Practically, plest momentum trick is to make the updates one needs to continue training beyond the se- proportional to this smoothed gradient estima- lected number of training iterations Tˆ (which torg ¯ instead of the instantaneous gradient g. should be the point of lowest validation error The idea is that it removes some of the noise and in the training run) in order to ascertain that oscillations that gradient descent has, in particu- validation error is unlikely to go lower than at lar in the directions of high curvature of the loss the selected point. A heuristic introduced in the function18. A default value of β = 1 (no mo- Deep Learning Tutorials15 is based on the idea mentum) works well in many cases but in some of patience (set initially to 10000 examples in the cases momentum seems to make a positive dif- MLP tutorial), which is a minimum number of ference. Polyak averaging (Polyak and Juditsky, training examples to see after the candidate se- 1992) is a related form of parameter averag- lected point Tˆ before deciding to stop training ing19 that has theoretical advantages and has (i.e. before accepting this candidate as the final been advocated and shown to bring improve- answer). As training proceeds and new candi- ments on some unsupervised learning procedures date selected points Tˆ (new minima of the vali- such as RBMs (Swersky et al., 2010). More re- dation error) are observed, the patience param- cently, several mathematically motivated algo- eter is increased, either multiplicatively or addi- rithms (Nesterov, 2009; Le Roux et al., 2012) tively on top of the last Tˆ found. Hence, if we have been proposed that incorporate some form find a new minimum16 at t, we save the current of momentum and that also ensure much faster best model, update Tˆ ← t and we increase our convergence (linear rather than sublinear) com- patience up to t+constant or t× constant. Note pared to stochastic gradient descent, at least for that validation error should not be estimated af- convex optimization problems. See also Bottou ter each training update (that would be really (2013) for an example of averaged SGD with wasteful) but after every N examples, where N successful empirical speedups in the convex is at least as large as the validation set (ideally case. Note however that in the pure online several times larger so that the early stopping case (stream of examples) and under some as- overhead remains small)17. sumptions, the sublinear rate of convergence of stochastic gradient descent with O(1/t) decrease • Momentum β. It has long been advo- of learning rate is an optimal rate, at least for cated (Hinton, 1978, 2010) to temporally smooth convex problems (Nemirovski and Yudin, 1983). out the stochastic gradient samples obtained That would suggest that for really large train-

15 http://deeplearning.net/tutorial/ 18 Think about a ball coming down a valley. Since it has not 16 Ideally, we should use a statistical test of significance and started from the bottom of the valley it will oscillate between accept a new minimum (over a longer training period) only if its sides as it settles deeper, forcing the learning rate to be the improvement is statistically significant, based on the size small to avoid large oscillations that would kick it out of the and variance estimates one can compute for the validation set. valley. Averaging out the local gradients along the way will 17 When an extra processor on the same machine is available, cancel the opposing forces from each side of the valley. validation error can conveniently be recomputed by a proces- 19 Polyak averaging uses for predictions a moving average of sor different from the one performing the training updates, the parameters found in the trajectory of stochastic gradient allowing more frequent computation of validation error. descent.

10 ing sets it may not be possible to obtain bet- 3.2 Hyper-Parameters of the Model ter rates than ordinary stochastic gradient de- and Training Criterion scent, albeit the constants in front (which depend on the condition number of the Hessian) Let us now turn to “model” and “criterion” hyper- may still be greatly reduced by using second- parameters typically found in neural networks, espe- order information online (Bottou and LeCun, cially deep neural networks. 2004; Bottou and Bousquet, 2008). • Number of hidden units nh. Each layer in a • Layer-specific optimization hyper- multi-layer neural network typically has a size parameters: although rarely done, it is that we are free to set and that controls ca- possible to use different values of optimization pacity. Because of early stopping and possibly hyper-parameters (such as the learning rate) on other regularizers (e.g., weight decay, discussed different layers of a multi-layer network. This is below), it is mostly important to choose nh large especially appropriate (and easier to do) in the enough. Larger than optimal values typically do context of layer-wise unsupervised pre-training, not hurt generalization performance much, but since each layer is trained separately (while the of course they require proportionally more com- 2 layers below are kept fixed). This would be putation (in O(nh) if scaling all the layers at particularly useful when the number of units the same time in a fully connected architecture). per layer varies a lot from layer to layer. See Like for many other hyper-parameters, there is the paragraph below entitled Layer-wise opti- the option of allowing a different value of nh for mization of hyper-parameters (Sec. 3.3.4). each hidden layer20 of a deep architecture. See Some researchers also advocate the use of the paragraph below entitled Layer-wise opti- different learning rates for the different types mization of hyper-parameters (Sec. 3.3.4). of parameters one finds in the model, such as In a large comparative study (Larochelle et al., biases and weights in the standard multi-layer 2009), we found that using the same size for all network, but the issue becomes more important layers worked generally better or the same as us- when parameters such as precision or variance ing a decreasing size (pyramid-like) or increasing are included in the lot (Courville et al., 2011). size (upside down pyramid), but of course this Up to now we have only discussed the hyper- may be data-dependent. For most tasks that 21 parameters in the setup where one trains a neural we worked on, we find that an overcomplete network by stochastic gradient descent. With other first hidden layer works better than an under- optimization algorithms, some hyper-parameters complete one. Another even more often vali- are typically different. For example, Conju- dated empirical observation is that the optimal gate Gradient (CG) algorithms typically have a nh is much larger when using unsupervised pre- number of line search steps (which is a hyper- training in a supervised neural network, e.g., go- parameter) and a tolerance for stopping each line ing from hundreds of units to thousands of units. search (another hyper-parameter). An optimiza- A plausible explanation is that after unsuper- tion algorithm like L-BFGS (limited-memory Broy- vised pre-training many of the hidden units are den–Fletcher–Goldfarb–Shanno) also has a hyper- carrying information that is irrelevant to the spe- parameter controlling the memory usage of the algo- cific supervised task of interest. In order to make rithm, the rank of the Hessian approximation kept in sure that the information relevant to the task is memory, which also has an influence on the efficiency captured, larger hidden layers are therefore nec- of each step. Both CG and L-BFGS are iterative essary when using unsupervised pre-training. (e.g., one line search per iteration), and the number 20 A hidden layer is a group of units that is neither an input of iterations can be optimized as described above for layer nor an output layer. stochastic gradient descent, with early stopping. 21 larger than the input vector

11 • Weight decay regularization coefficient λ. A between early stopping (see above, choosing the way to reduce overfitting is to add a regulariza- number of training iterations) and L2 regular- tion term to the training criterion, which lim- ization (Collobert and Bengio, 2004a), with one its the capacity of the learner. The parameters basically playing the same role as the other (but of machine learning models can be regularized early stopping allowing a much more efficient se- by pushing them towards a prior value, which lection of the hyper-parameter value, which sug- is typically 0. L2 regularization adds a term gests dropping L2 regularization altogether when 2 λ i θi to the training criterion, while L1 reg- early-stopping is used). However, L1 regular- ularizationP adds a term λ i |θi|. Both types of ization behaves differently and can sometimes terms can be included. ThereP is a clean Bayesian be useful, acting as a form of feature selection. justification for such a regularization term: it is L1 regularization makes sure that parameters the negative log-prior − log P (θ) on the param- that are not really very useful are driven to zero eters θ. The training criterion then corresponds (i.e. encouraging sparsity of the parameter val- to the negative joint likelihood of data and pa- ues), and corresponds to a Laplace density prior |θ| rameters, − log P (data, θ) = − log P (data|θ) − − s 1 ∝ e with scale parameter s = λ . L1 regu- log P (θ), with the loss function L(z,θ) being in- larization often helps to make the input filters22 terpreted as − log P (z|θ) and − log P (data|θ) = cleaner (more spatially localized) and easier to T − t=1 L(zt,θ) if the data consists of T i.i.d. interpret. Stochastic gradient descent will not examplesP zt. This detail is important to note yield actual zeros but values hovering around because when one is doing stochastic gradient- zero. If both L1 and L2 regularization are used, based learning, it makes sense to use an unbi- a different coefficient (i.e. a different hyper- ased estimator of the gradient of the total train- parameter) should be considered for each, and ing criterion (including both the total loss and one may also use a different coefficient for differ- the regularizer), but one only considers a single ent layers. In particular, the input weights and mini-batch or example at a time. How should the output weights may be treated differently. regularizer be weighted in this sum, which is dif- One reason for treating output weights differ- ferent from the sum of the regularizer and the to- ently (i.e., not relying only on early stopping) tal loss on all examples? On each mini-batch up- is that we know that it is sufficient to regu- date, the gradient of the regularization penalty larize only the output weights in order to con- should be multiplied not just by λ but also by strain capacity: in the limit case of the num- B , i.e., one over the number of updates needed T ber of hidden units going to infinity, L2 regular- to go once through the training set. When the ization corresponds to Support Vector Machines training set size is not a multiple of B, the last (SVM) while L1 regularization corresponds to mini-batch will have size B′