Learning Activation Functions in Deep Neural Networks

UNIVERSITÉ DE MONTRÉAL LEARNING ACTIVATION FUNCTIONS IN DEEP NEURAL NETWORKS FARNOUSH FARHADI DÉPARTEMENT DE MATHÉMATIQUES ET DE GÉNIE INDUSTRIEL ÉCOLE POLYTECHNIQUE DE MONTRÉAL MÉMOIRE PRÉSENTÉ EN VUE DE L’OBTENTION DU DIPLÔME DE MAÎTRISE ÈS SCIENCES APPLIQUÉES (GÉNIE INDUSTRIEL) DÉCEMBRE 2017 c Farnoush Farhadi, 2017. UNIVERSITÉ DE MONTRÉAL ÉCOLE POLYTECHNIQUE DE MONTRÉAL Ce mémoire intitulé : LEARNING ACTIVATION FUNCTIONS IN DEEP NEURAL NETWORKS présenté par : FARHADI Farnoush en vue de l’obtention du diplôme de : Maîtrise ès sciences appliquées a été dûment accepté par le jury d’examen constitué de : M. ADJENGUE Luc-Désiré, Ph. D., président M. LODI Andrea, Ph. D., membre et directeur de recherche M. PARTOVI NIA Vahid, Doctorat, membre et codirecteur de recherche M. CHARLIN Laurent, Ph. D., membre iii DEDICATION This thesis is dedicated to my beloved parents, Ahmadreza and Sholeh, who are my first teachers and always love me unconditionally. This work is also dedicated to my love, Arash, who has been a great source of motivation and encouragement during the challenges of graduate studies and life. iv ACKNOWLEDGEMENTS I would like to express my gratitude to Prof. Andrea Lodi for his permission to be my supervisor at Ecole Polytechnique de Montreal and more importantly for his enthusiastic encouragements and invaluable continuous support during my research and education. I am very appreciated to him for introducing me to a MITACS internship in which I have developed myself both academically and professionally. I would express my deepest thanks to Dr. Vahid Partovi Nia, my co-supervisor at Ecole Poly- technique de Montreal, for his supportive and careful guidance and taking part in important decisions which were extremely precious for my research both theoretically and practically. I would also like to thank the examining committee for taking time out from their busy schedule to read and review my thesis and provide me with their insightful comments. Further acknowledgements go to the members of the Excellent Research Chair in Data Science for Real-time Decision-making who were eager to help me whenever needed. I would like to express my deepest gratitude and special thanks to Lane Cochrane, the Chief Analytics Officer at iPerceptions, for giving me the opportunity to do an internship within the company. For me it was a unique experience to be at iPerceptions and to do the research on machine learning. I would like to thank Sylvain Laporte, the Chief Technology Officer at iPerceptions, who in spite of being extraordinarily busy with his duties, took time out to give me necessary advice. I am especially thankful to all my collegues at R&D team at iPerceptions for their careful and valuable guidance to make life easier. And on a personal note, a special word of thanks goes to my parents who always stand behind me and support me spiritually throughout my life. My special thanks to Arshin, Mehrnoush and Siavash, my sisters and brother-in-law whose encourage and guides always help me in my personal and professional life. I am grateful for the love, encouragement, and tolerance of Arash, the man who has made all the difference in my life. Without his patience and love, I could not have completed this thesis. I am especially grateful to Atefeh, my best friend, who has been there whenever I needed her. She always has the time to hear about my successes and failures and has always been a faithful listener. I would like to thank my dear friend David Berger for helping me in translating the abstract into French and Chris Nael for his assistance in general editing. v RÉSUMÉ Récemment, la prédiction de l’intention des utilisateurs basée sur les données comportemen- tales est devenue plus attrayante dans les domaines de marketing numérique et publicité en ligne. La prédiction est souvent effectuée par analyse de l’information du comportement des visiteurs qui comprend des détails sur la façon dont chaque utilisateur a visité le site Web commercial. Dans ce travail, nous explorons les méthodes statistiques d’apprentissage auto- matique pour classifier les visiteurs d’un site Web de commerce électronique d’après des URL qu’ils ont parcourues. Plus précisément, on vise à étudier comment les réseaux de neurones profonds de pointe pourraient efficacement prédire le but des visiteurs dans une session de visite, en utilisant uniquement des séquences d’URL. Dans un réseau contenant des couches limitées, le choix de la fonction d’activation a un effet sur l’apprentissage de la représentation et les performances du réseau. Typiquement, les modèles de réseaux de neurones sont appris en utilisant des fonctions d’activation spécifiques comme la fonction sigmoïde symétrique standard ou relu linéaire. Dans cette mémoire, on vise à proposer une fonction sigmoïde asymetrique paramétrée dont la paramètre pourrait être controlé et ajusté dans chacun des neurons cachées avec d’autres paramètres de réseau profond en formation. De plus, nous visons à proposer une variante non-linéaire de la fonction relu pour les réseaux profonds. Comme le sigmoïde adaptatif, notre objectif est de régler les paramètres du relu adaptatif pour chaque unité individuelle- ment, dans l’étape de formation. Des méthodes et des algorithmes pour développer ces fonctions d’activation adaptatives sont discutés. En outre, une petite variante de MLP (Multi Layer Perceptron) et un modèle CNN (Convolutional Neural Network) applicant nos fonctions d’activation proposées sont utilisés pour prédire l’intention des utilisateurs selon les données d’URL. Quatre jeux de données différents ont été choisis, appelé les données simulées, les données MNIST, les données de revue de film, et les données d’URL pour démontrer l’effet de sélectionner différentes fonctions d’activation sur les modèles MLP et CNN proposés. vi ABSTRACT Recently, user intention prediction based on behavioral data has become more attractive in digital marketing and online advertisement. The prediction is often performed by analyzing the logged information which includes details on how each user visited the commercial website. In this work, we explore the machine learning methods for classification of visitors to an e-commerce website based on URLs they visited. We aim at studying on how effectively the state-of-the-art deep neural networks could predict the purpose of visitors in a session using only URL sequences. Typical deep neural networks employ a fixed nonlinear activation function for each hidden neuron. In this thesis, two adaptive activation functions for individ- ual hidden units are proposed such that the deep network learns these activation functions in training. Methods and algorithms for developing these adaptive activation functions are discussed. Furthermore, performance of the proposed activation functions in deep networks compared to state-of-the-art models are also evaluated on a real-world URL dataset and two well-known benchmarks, MNIST and Movie Review benchmarks as well. vii TABLE OF CONTENTS DEDICATION . iii ACKNOWLEDGEMENTS . iv RÉSUMÉ . v ABSTRACT . vi TABLE OF CONTENTS . vii LIST OF TABLES . x LIST OF FIGURES . xii LIST OF ABBREVIATIONS AND NOTATIONS . xvii LIST OF APPENDICES . xx CHAPTER 1 INTRODUCTION . 1 1.1 Deep Neural Networks . .3 1.2 Adaptive Activation Functions . .5 1.3 Research Objectives . .6 1.4 Our Contributions . .7 1.5 Thesis Structure . .7 CHAPTER 2 OVERVIEW . 8 2.1 Linear Regression . .8 2.1.1 Maximum Likelihood Estimation and Least Squares . 10 2.1.2 Multiple Outputs . 12 2.2 Generalized Linear Models (GLMs) . 13 2.2.1 Exponential Family . 14 2.2.2 Exponential Dispersion Family . 15 2.2.3 Mean and Variance . 16 2.2.4 Linear Structure . 18 2.2.5 GLMs Components . 19 2.2.6 Maximum Likelihood Estimates . 21 viii 2.2.7 Logistic Regression . 25 2.2.8 Maximum Likelihood Estimates in Logistic Regression . 27 CHAPTER 3 DEEP NEURAL NETWORKS . 33 3.1 Perceptron . 33 3.1.1 Perceptron Algorithm . 33 3.1.2 Model . 33 3.1.3 Basic Perceptron Learning Rule and Convergence . 36 3.1.4 Perceptron Learning Rule on Loss Function . 42 3.1.5 Perceptron with Sigmoid Function . 45 3.1.6 Perceptron Issues . 51 3.2 Multi Layer Perceptrons (MLPs) . 52 3.2.1 Activation Functions . 56 3.2.2 Loss Function . 59 3.2.3 Backpropagation Learning Algorithm . 60 3.2.4 Mini-batch Stochastic Gradient Descent (Mini-batch SGD) . 64 3.2.5 Regularization . 66 3.2.6 Dropout . 67 3.2.7 Parameters Initialization . 68 3.3 Convolutional Neural Networks (CNNs) . 68 3.3.1 Convolutional Layer . 69 3.3.2 Max Pooling Layer . 71 3.3.3 Loss Function and Gradients . 72 CHAPTER 4 ADAPTIVE ACTIVATION FUNCTIONS . 74 4.1 Latent Variable Formulation . 74 4.2 The complementary log-log transformation . 77 4.2.1 Maximum Likelihood Estimates with Complementary log-log . 79 4.3 Adaptive Sigmoid Link Function . 81 4.3.1 Identifiability . 83 4.3.2 Maximum Likelihood Estimates for Adaptive Sigmoid Link Function 86 4.3.3 ANNs with Adaptive-sigmoid : Loss Function and Optimization . 87 4.4 Adaptive ReLu Function . 90 4.4.1 ANNs with Adaptive-relu : Loss Function and Optimization . 92 CHAPTER 5 EXPERIMENTS AND RESULTS . 95 5.1 Performance Measures . 95 ix 5.2 Experiment 1 : Simulated Data . 97 5.2.1 Data . 98 5.2.2 Tools and Technologies . 98 5.2.3 Results and Evaluations . 98 5.2.4 Conclusions . 108 5.3 Experiment 2 : MNIST Benchmark . 110 5.3.1 data . 110 5.3.2 Tools and Technologies . 110 5.3.3 Models . 110 5.3.4 Results and Evaluations . 111 5.3.5 Conclusions . 114 5.4 Experiment 3 : Sentiment Analysis . 117 5.4.1 Data . 117 5.4.2 Data preparation . 117 5.4.3 Models . 118 5.4.4 Tools and Technologies .

Learning Activation Functions in Deep Neural Networks

6.5 Applications of Exponential and Logarithmic Functions 469

Chapter 10. Logistic Models

Neural Network Architectures and Activation Functions: a Gaussian Process Approach

Neural Networks

Lecture 5: Logistic Regression 1 MLE Derivation

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Fm 3-34 Engineer Operations Headquarters, Department

Notes: Logistic Functions I. Use, Function, and General Graph If

Sigmoid Functions in Reliability Based Management 2007 15 2 67 2.1 the Nature of Performance Growth

2.161 Signal Processing: Continuous and Discrete Fall 2008

Applied Machine Learning

Multilayer Perceptron