Neural Network Architectures and Activation Functions: a Gaussian Process Approach

Fakultät für Informatik Neural Network Architectures and Activation Functions: A Gaussian Process Approach Sebastian Urban Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr. rer. nat. Stephan Günnemann Prüfende der Dissertation: 1. Prof. Dr. Patrick van der Smagt 2. Prof. Dr. rer. nat. Daniel Cremers 3. Prof. Dr. Bernd Bischl, Ludwig-Maximilians-Universität München Die Dissertation wurde am 22.11.2017 bei der Technischen Universtät München eingereicht und durch die Fakultät für Informatik am 14.05.2018 angenommen. Neural Network Architectures and Activation Functions: A Gaussian Process Approach Sebastian Urban Technical University Munich 2017 ii Abstract The success of applying neural networks crucially depends on the network architecture being appropriate for the task. Determining the right architecture is a computationally intensive process, requiring many trials with different candidate architectures. We show that the neural activation function, if allowed to individually change for each neuron, can implicitly control many aspects of the network architecture, such as effective number of layers, effective number of neurons in a layer, skip connections and whether a neuron is additive or multiplicative. Motivated by this observation we propose stochastic, non-parametric activation functions that are fully learnable and individual to each neuron. Complexity and the risk of overfitting are controlled by placing a Gaussian process prior over these functions. The result is the Gaussian process neuron, a probabilistic unit that can be used as the basic building block for probabilistic graphical models that resemble the structure of neural networks. The proposed model can intrinsically handle uncertainties in its inputs and self-estimate the confidence of its predictions. Using variational Bayesian inference and the central limit theorem, a fully deterministic loss function is derived, allowing it to be trained as efficiently as a conventional neural network using stochastic gradient descent. The posterior distribution of activation functions is inferred from the training data alongside the weights of the network. The proposed model favorably compares to deep Gaussian processes, both in model complexity and efficiency of inference. It can be directly applied to recurrent or convolutional network structures, allowing its use in audio and image processing tasks. As an empirical evaluation we present experiments on regression and classification tasks, in which our model achieves performance comparable to or better than a Dropout regularized neural network. We further develop a novel method for automatic differentiation of elementwise-defined, tensor-valued functions that occur in the mathematical formulation of Gaussian processes. The proposed method allows efficient evaluation of the derivatives on modern GPUs and is used in our implementation of the Gaussian process neuron to achieve computational performance that is about 25% of a convential neuron with a fixed logistic activation function. iii iv Acknowledgements First and foremost I would like to thank my advisor Patrick van der Smagt for the amount of freedom and encouragement he provided. It was the foundation that allowed me to handle the ambitious projects that cumulated into this thesis. I would also like to sincerely thank Wiebke Köpp for very helpful discussions and all the time, effort and energy she spent on building and testing an efficient implementation of the concepts presented in chapter3. Just as well I would like to extend my utmost gratitude to Marcus Basalla for very helpful discussions, his contributions to the concepts developed in chapter4 and for the time and effort he spent on implementing and performing very detailed emperical evaluations of the models developed in chapter5. I would further like to extend many thanks to Marvin Ludersdorfer and Mark Hartenstein. Although they did not directly contribute to this work, we worked together on a predecessor project involving Gaussian processes that inspired many ideas used and developed within this thesis. I would also like to show utmost appreciation to my parents Jolanta and Janusz Urban for the amount of support they offered during my studies and especially to my mother for taking the time to thoroughly proofread this thesis. Finally, I would like to thank my lab colleagues Justin Bayer, Christian Osendorfer, Nutan Chen, Markus Kühne, Philip Häusser, Maximilian Sölch, Maximilian Karl, Benedikt Staffler, Grady Jensen, Johannes Langer, Leo Gissler and Daniela Korhammer for many inspiring discussions and their ability to tolerate the significant overlength of most of my presentations. v vi List of Publications The following list shows the publications that originated during the course of this work. Sebastian Urban, Patrick van der Smagt (2017). “Gaussian Process Neurons”. International Conference on Learning Representations (ICLR) 2018 (submitted). Sebastian Urban, Patrick van der Smagt (2017). “Automatic Differentiation for Tensor Algebras”. arXiv preprint arXiv:1711.01348 [cs.SC]. Sebastian Urban, Patrick van der Smagt (2015). “A Neural Transfer Function for a Smooth and Differentiable Transition Between Additive and Multiplicative Interactions”. arXiv preprint arXiv:1503.05724 [stat.ML]. Sebastian Urban, Marvin Ludersdorfer, Patrick van der Smagt (2015). “Sensor Calibration and Hysteresis Compensation with Heteroscedastic Gaussian Processes”. IEEE Sensors Journal 2015. Sebastian Urban, Justin Bayer, Christian Osendorfer, Goran Westling, Benoni Edin, Patrick van der Smagt (2013). “Computing Grip Force and Torque from Finger Nail Images using Gaus- sian Processes”. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2013. Wiebke Köpp, Patrick van der Smagt, Sebastian Urban (2016). “A Differentiable Transition Between Additive and Multiplicative Neurons”. International Conference on Learning Repre- sentations (ICLR) 2016 (workshop track). Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nico- las Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano, Tim vii viii Cooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron Courville, Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Desjardins, Sander Dieleman, Laurent Dinh, Mélanie Ducoffe, Vincent Dumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, Orhan Firat, Mathieu Germain, Xavier Glorot, Ian Goodfellow, Matt Graham, Caglar Gulcehre, Philippe Hamel, Iban Harlouchet, Jean-Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébas- tien Jean, Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lamblin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrancois, Simon Lemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, Cory Lorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, Olivier Mastropietro, Robert T. McGibbon, Roland Memisevic, Bart van Merriënboer, Vin- cent Michalski, Mehdi Mirza, Alberto Orlandi, Christopher Pal, Razvan Pascanu, Moham- mad Pezeshki, Colin Raffel, Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth, Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, John Schulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk, Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. Ramana Subramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tul- der, Joseph Turian, Sebastian Urban, Pascal Vincent, Francesco Visin, Harm de Vries, David Warde-Farley,Dustin J. Webb, Matthew Willson, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, Ying Zhang (2016). “Theano: A Python Framework for Fast Computation of Mathematical Expressions”. arXiv preprint arXiv:1605.02688 [cs.SC]. Nutan Chen, Justin Bayer, Sebastian Urban, Patrick van der Smagt (2015). “Efficient Movement Representation by Embedding Dynamic Movement Primitives in Deep Autoencoders”. IEEE- RAS 15th International Conference on Humanoid Robots (Humanoids) 2015. Nutan Chen, Sebastian Urban, Justin Bayer, Patrick van der Smagt (2015). “Measuring Fingertip Forces from Camera Images for Random Finger Poses”. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2015. Andreas Vollmayr, Stefan Sosnowski, Sebastian Urban, Sandra Hirche, J Leo van Hemmen (2014). “Snookie: An Autonomous Underwater Vehicle with Artificial Lateral-Line System”. Flow Sensing in Air and Water 2014. Nutan Chen, Sebastian Urban, Christian Osendorfer, Justin Bayer, Patrick van der Smagt (2014). “Estimating Finger Grip Force from an Image of the Hand using Convolutional Neural Net- works and Gaussian Processes”. IEEE International Conference on Robotics and Automation (ICRA) 2014. Justin Bayer, Christian Osendorfer, Daniela Korhammer, Nutan Chen, Sebastian Urban, Patrick van der Smagt (2013). “On Fast Dropout and its Applicability to Recurrent Networks”. arXiv ix preprint arXiv:1311.0701 [stat.ML]. Justin Bayer, Christian Osendorfer, Sebastian Urban, Patrick van der Smagt (2013). “Training Neural Networks with Implicit Variance”. International Conference on Neural Information Processing (ICONIP) 2013. Christian Osendorfer, Justin Bayer,

Neural Network Architectures and Activation Functions: a Gaussian Process Approach

Neural Networks (AI) (WBAI028-05) Lecture Notes

Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms

Mxnet-R-Reference-Manual.Pdf

6.5 Applications of Exponential and Logarithmic Functions 469

Artificial Neural Networks Part 2/3 – Perceptron

Jssjournal of Statistical Software MMMMMM YYYY, Volume VV, Issue II

Remarks About the Square Equation. Functional Square Root of a Logarithm

Verification of RNN-Based Neural Agent-Environment Systems

Artificial Neuron Using Mos2/Graphene Threshold Switching Memristor

Using the Modified Back-Propagation Algorithm to Perform Automated Downlink Analysis by Nancy Y

Chapter 10. Logistic Models

Deep Multiphysics and Particle–Neuron Duality: a Computational Framework Coupling (Discrete) Multiphysics and Deep Learning