The Effect of Hyperparameters in the Activation Layers of Deep Neural Networks

University of Mississippi eGrove Electronic Theses and Dissertations Graduate School 2016 The Effect Of Hyperparameters In The Activation Layers Of Deep Neural Networks Clay Lafayette Mcleod University of Mississippi Follow this and additional works at: https://egrove.olemiss.edu/etd Part of the Computer Sciences Commons Recommended Citation Mcleod, Clay Lafayette, "The Effect Of Hyperparameters In The Activation Layers Of Deep Neural Networks" (2016). Electronic Theses and Dissertations. 451. https://egrove.olemiss.edu/etd/451 This Thesis is brought to you for free and open access by the Graduate School at eGrove. It has been accepted for inclusion in Electronic Theses and Dissertations by an authorized administrator of eGrove. For more information, please contact [email protected]. Copyright 2016 Clay McLeod ALL RIGHTS RESERVED ABSTRACT Deep neural networks (DNNs), and artificial neural networks (ANNs) in general, have recently received a great amount of attention from both the media and the machine learning community at large. DNNs have been used to produce world-class results in a variety of domains, including image recognition, speech recognition, sequence modeling, and natural language processing. Many of most exciting recent deep neural network studies have made improvements by hardcoding less about the network and giving the neural network more control over its own parameters, allowing flexibility and control within the network. Although much research has been done to introduce trainable hyperparameters into transformation layers (GRU [7], LSTM [13], etc), the introduction of hyperparameters into the activation layers have been largely ignored. This paper serves several purposes: to (1) equip the reader with the background knowledge, including theory and best practices for DNNs, which help contextualize the contributions of this paper, (2) to describe and verify the effectiveness of current techniques in the literature that utilize hyperparameters in the activation layer, and (3) to introduce some new activation layers that introduce hyperparameters into the model, including activation pools (APs) and parametric activation pools (PAPs), and study the effectiveness of these new constructs on popular image recognition datasets. ii ACKNOWLEDGEMENTS I am thankful to several people for their crucial advice and unwavering support while writing this thesis. Thank you to Dr. Dawn Wilkins, | you have been both an invaluable mentor and a great friend to me for the past four years. Thank you for supporting my wandering research ambitions through both my undergraduate and graduate theses, none of which would have been possible without your guidance and insight. Thank you to my fellow scholars in Ole Miss Computer Science department, particularly Morgan Dock, Andrew Henning, and Ajay Sharma, for your feedback concerning the research methods and results included in this thesis. Thank you to the Mississippi Center for Supercomputing Research for the use of their supercomputers and their responsiveness in settings up tools I needed to conduct this research. Thank you to my close friends Jordan Pearce and Joey Linn for keeping me accountable in my pursuits and encouraging me to continue growing as an individual. Last, I'd like to thank my family, and especially my loving girlfriend Jane Pittman: thank you for your patience during late nights and unmatched support during these last two years while working on this thesis. iii TABLE OF CONTENTS ABSTRACT . ii ACKNOWLEDGEMENTS . iii LIST OF TABLES . v LIST OF FIGURES . vi 1 INTRODUCTION . 1 2 BACKGROUND . 3 3 THEORETICAL CONCEPTS . 13 4 EXPERIMENTS . 18 5 FURTHER WORK/CONCLUSION . 29 BIBLIOGRAPHY . 31 VITA ...................................................... 35 iv LIST OF TABLES 1 Model architectures for MNIST experiment . 19 2 Best accuracy for MNIST model . 20 3 Model architectures for CIFAR10 experiment . 23 4 Best accuracy for DeepCNet models using constant learning rates . 26 5 Best accuracy for DeepCNet models using dynamic learning rates . 26 v LIST OF FIGURES 1 Typical hierarchical structure of neural networks . 1 2 Example of a neural network with 3 layers . 4 3 Sigmoid activation functions . 9 4 Rectifier activation functions . 11 5 Architecture for activation pools . 14 6 Effect of MReLU coefficients on cost function . 17 7 Cost function comparison: ReLU vs. MReLU . 17 8 MNIST accuracy for different learning rates . 21 9 CIFAR10 accuracy for increasing learning rates . 24 10 CIFAR10 accuracy for dynamic learning rates . 27 vi CHAPTER 1 INTRODUCTION Best practice for neural network layer design follows a fairly formulaic, layer-based approach that is reminiscent of human biology. In this paradigm, layers in a neural network are constructed as the combination of smaller sub-nets. Each of these sub-nets performs a specific transformation of the input data, such a recurrent or convolutional layer. After each transformation, the output signal is fed into some mapping (activation) function that normalizes the output signal. These sub-nets are layered on top of one another to construct a hierarchical knowledge representation. Figure 1 shows a typical hierarchical neural network architecture. X Y Activation Activation Activation Transformation Transformation Transformation First sub-net Second sub-net Third sub-net Figure 1: Typical hierarchical structure of neural networks. Different variations of the transformation layers in these subnets have been heavily documented and studied in the literature. Furthermore, each transformation layer type has several hyperparameters that can be tuned to achieve optimal performance. Though many different types of activation layers have been proposed, most are simply mathematically convenient mapping functions that remain the same throughout the training process. In other words, the effect of hyperparameters within the activation layer has not been heavily studied. Many of deep learning's world-class results are the product of improved performance by allowing the neural network access to train its own hyperparameters [18, 13, 7]. The inclusion of trainable hyperparameters to the network adds computational complexity but gives the neural network more control over its configuration, which often leads to better performance. It is clear that techniques that take advantage of adding hyperparameters could be important in achieving the best results for DNNs in the future. 1 The contributions of this paper are as follows: • Equip the reader with the background knowledge, including theory and best practices for DNNs, which help contextualize the contributions of this paper. Readers that are familiar with the foundational concepts of DNNs can skip this section. • Describe and verify the effectiveness of techniques already described in the literature that introduce hyperparameters into the activation layer (such as parametric rectified linear units) against current best practices. • Introduce some new activation layers that introduce hyperparameters into the model, including activation pools (APs) and parametric activation pools (PAPs), and study the effectiveness of these new constructs on popular image recognition datasets. Our hope is that successful experiments on these image datasets would qualify further experiments with much larger models applied to a variety of datasets and domains. 2 CHAPTER 2 BACKGROUND Note: This section is intended to inform the reader of some basic concepts and best practices in deep learning. Readers that understand the foundational concepts of deep neural networks can skip to the next section. In recent years, the field of Deep Neural Networks (DNNs), also known as deep learning, has received attention from both the media and the machine learning community at large. A deep neural network is a special type of Artificial Neural Network (ANN), where the number of hidden layers is increased dramatically to form a much more expressive network. This technique is relatively new to the literature, although the exact number of layers the differentiate between traditional shallow learning and deep learning is not yet agreed upon [32]. What is clear, however, is that deep neural networks possess several advantages over their ANN ancestors, including the auto-encoding of features, hierarchical construction for better interpretation by humans, and better sequence modeling. Concretely, DNNs have proven state of the art in image recognition [18, 22, 12, 37, 35], speech recognition [27, 20, 31], natural language processing [11, 5, 8, 34] and many other interesting areas [25, 9, 19, 24, 38]. However, many researchers have their doubts about the claims of deep neural networks being the pathway to intelligent machines. In a recent survey of the machine learning community [1], approximately 61% of respondents said that deep learning only partially lived up to the hype surrounding it, 19% responded that deep learning did achieve its lofty goals, 13% said that deep learning did not achieve its goal, and 8% were undecided. Increasing skepticism further is the fact that closely examining what deep neural networks learn proves that deep neural networks don't learn anything close to the same representation that humans learn [2, 15]. In fact, artificial neural networks (and specifically, deep neural networks) seem to simply be extremely efficient input-to-output mapping constructs with no real magic involved: training multiple deep neural networks with different initialization parameters on the same dataset yields vastly different final configurations, and there are no

The Effect of Hyperparameters in the Activation Layers of Deep Neural Networks

6.5 Applications of Exponential and Logarithmic Functions 469

Chapter 10. Logistic Models

Neural Network Architectures and Activation Functions: a Gaussian Process Approach

Neural Networks

Lecture 5: Logistic Regression 1 MLE Derivation

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Fm 3-34 Engineer Operations Headquarters, Department

Notes: Logistic Functions I. Use, Function, and General Graph If

Sigmoid Functions in Reliability Based Management 2007 15 2 67 2.1 the Nature of Performance Growth

Learning Activation Functions in Deep Neural Networks

Applied Machine Learning

Logistic Regression