Learning Hyperparameters for Neural Network Models Using Hamiltonian Dynamics

Learning Hyperparameters for Neural Network Models Using Hamiltonian Dynamics by Kiam Cho o A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Computer Science UniversityofToronto c Copyright by Kiam Cho o Abstract Learning Hyp erparameters for Neural Network Mo dels Using Hamiltonian Dynamics Kiam Cho o Master of Science Graduate Department of Computer Science UniversityofToronto We consider a feedforward neural network mo del with hyp erparameters controlling groups of weights Given some training data the p osterior distribution of the weights and the hyp erparameters can b e obtained by alternately up dating the weights with hybrid Monte Carlo and sampling from the hyp erparameters using Gibbs sampling However this metho d b ecomes slowfornetworks with large hidden layers We address this problem by incorp orating the hyp erparameters into the hybrid Monte Carlo up date However the region of state space under the p osterior with large hyp erparameters is huge and has low probability density while the region with small hyp erparameters is very small and very high densityAshybrid Monte Carlo inherently do es not movewell b etween such regions we reparameterize the weights to makethetwo regions more compatible only to b e hamp ered by the resulting inability to compute go o d stepsizes No denite improvement results from our eorts but we diagnose the reasons for that and suggest future directions of research ii Dedication I dedicate this thesis to my family who have accepted mywanderings over the years I esp ecially dedicate this to my mother whose strength and will I carry on Acknowledgements I thank Prof Radford Neal for his invaluable guidance on this thesis I also thank Faisal Qureshi for his helpful suggestions and friendship during this thesis Thanks also to the other inhabitants of the Articial Intelligence Lab oratory who have help ed me in some way to complete this thesis iii Contents Intro duction Overview The Neural Network Learning Problem Bayesian Approach to Neural Net Learning Bayesian Inference A Simple Example Making Predictions Determining the Hyp erparameters From the Data Motivation The Hybrid Monte Carlo Metho d Background on Markovchain Monte Carlo Sampling The Metrop olis Algorithm with Simple Prop osals The Hybrid Monte Carlo Metho d Leapfrog Prop osals Stepsize Selection Convergence of Hybrid MonteCarlo Hyp erparameter Up dates Using Gibbs Sampling Neural Network Architecture Neural Network Mo del of the Data iv Posterior Distributions of the Parameters and the Hyp erparameters Sampling From the Posterior Distributions of the Parameters and the Hy p erparameters Convergence of the Algorithm Ineciency Due to Gibbs Sampling of Hyp erparameters Hyp erparameter Up dates Using Hamiltonian Dynamics The New Scheme TheIdea The New Scheme in Detail Reparameterization of the Hyp erparameters Reparameterization of the Weights First Derivatives of the Potential Energy First Derivatives with Resp ect to Parameters First Derivatives with Resp ect to the Hyp erparameters Approximations to the Second Derivatives of the Potential Energy Second Derivatives with Resp ect to the Parameters Second Derivatives with Resp ect to the Hyp erparameters Summary of Compute Times Computation of Stepsizes Compute Times for the Dynamical A Metho d Results Training Data Verication of the New Metho ds Results From Old Metho d Results of New Metho ds Compared with the Old Metho dology for Evaluating Performance v The Variance of Means MeasurementofPerformance Error Estimation for VarianceofMeans Geometric Mean of Variance of Means Iterations Allowed for EachMethod Markovchain start states Master Runs Starting States Used Mo died Performance Measures Due to Stratication Numb er of Leapfrog Steps Allowed Results of Performance Evaluation Pairwise Bo otstrap Comparison Discussion Has the Reparameterization of the Network Weights Been Useful Making the Dynamical B Metho d Go Faster Explanation for the Rising Rejection Rates The Appropriateness of Stepsize Heuristics Dierent Settngs for h p Fine Splitting of Hyp erparameter Up dates Why the Stepsize Heuristics are Bad Other Implications of the Current Heuristics Conclusion A Preservation of Phase Space Volume Under Hamiltonian Dynamics B Pro of of Thorem Deterministic Prop osals for Metrop olis Algorithm C Preservation of Phase Space Volume Under Leapfrog Up dates vi Bibliography vii Chapter Intro duction Overview A feedforward neural network is a nonlinear mo del that maps an input to an output It can b e viewed as a nonparametric mo del in the sense that its parameters cannot easily be interpreted to provide insightinto the problem that it is b eing used for Nevertheless feedforward neural networks are p owerful as with sucient hidden units they can learn to approximate any nonlinear mapping arbitrarily closely Cyb enko Partly b ecause of this exibility they have b ecome widespread to ols used bymany practitioners in the sciences and engineering These practitioners typically use wellestablished learning techniques likebackpropagation Rumelhart et al or its variants But despite the multitude of learning metho ds already in existence learning for feedforward networks remains an area of active research A recent approach to feedforward neural net learning is Bayesian learning Buntine and Weigend MacKay Neal Muller and Insua This new approach can b e viewed as a resp onse to the problem of incorp orating prior knowledge into neural networks However the computational problems in Bayesian learning are complex and none of the existing techniques are p erfect In the interests of computational Chapter Introduction tractability b oth the works of MacKay and Buntine and Weigend assume Gaussian approximations to the p osterior distribution over network weights A more general and exible approach is to sample from the p osterior distribution of the weights as has b een done by Neal and Muller and Insua Neal obtains samples by alternating hybrid Monte Carlo up dates of the weights with Gibbs sampling up dates of the hyp erparameters Muller and Insua also alternately up date the weights and Gibbssample the hyp erparameters but in addition they observe that given all weights except for the hiddentooutput ones the p osterior distribution of the latter is simply Gaussian when the data noise is Gaussian While the other weights still need to b e up dated by a more complicated Metrop olis step this do es allow them to sample directly from the Gaussian distribution of the hiddentooutput weights However as will b e describ ed later b oth metho ds are exp ected to b ecome slow for large networks p ossibly to the p oint where they b ecome unusable This thesis addresses the ab ove ineciency for large networks Sp ecicallyitis concerned with improving on the hybrid Monte Carlo technique used by Neal so that b oth parameters and hyp erparameters are up dated using hybrid Monte Carlo The Neural Network Learning Problem The rest of this thesis is ab out feedforward neural networks onlysowe drop the feed forward for simplicity In this section we dene the neural network learning problem that underlies this thesis N N c c c c Given a set of inputs X fx g and targets Y fy g a neural network can b e c c used to mo del the relationship b etween them so that c c f x W y where f W is the function computed by the neural network with weights W This Chapter Introduction mo deling is achieved by training the weights W using the training data consisting of the inputs X and the targets Y Once training is complete the neural net can b e used to predict targets given previously unseen values of inputs Conventionally the learning pro cess is viewed as an optimization problem where the weights are learned using some kind of gradient descent metho d on an error function such as the following N c X c c E W f x W y c The result of this pro cedure is a single optimal set of weights W that minimizes opt the error This single set of weights is then used for future predictions from a new input The conventionallytrained network prediction is thus f x f x W C opt Bayesian Approach to Neural Net Learning The Bayesian approach to neural network learning diers fundamentally from the con ven tional optimization approach in that rather than obtaining a single b est set of weights from the training pro cess a probability distribution over the weights is obtained instead Bayesian Inference Generally sp eaking Bayesian inference is a waybywhich unknown prop erties of a system may b e inferred from observations In the Bayesian inference framework we mo del the observations

Load more