A Responsible Softmax Layer in Deep Learning

A Responsible Softmax Layer in Deep Learning Item Type text; Electronic Dissertation Authors Coatney, Ryan Dean Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Download date 02/10/2021 03:05:29 Link to Item http://hdl.handle.net/10150/642087 A Responsible Softmax Layer in Deep Learning by Ryan Coatney Copyright © Ryan Coatney 2020 A Dissertation Submitted to the Faculty of the Department of Mathematics In Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy In the Graduate College The University of Arizona 2 0 2 0 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by: Ryan Coatney titled: and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy. Marek Rychlik _________________________________________________________________ Date: ____________Jun 22, 2020 Marek Rychlik Robert S. Maier _________________________________________________________________ Date: ____________Jun 23, 2020 Robert S. Maier _________________________________________________________________ Date: ____________Jun 25, 2020 David A Glickenstein _________________________________________________________________ Date: ____________Jun 22, 2020 Clayton T Morrison Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. Marek Rychlik _________________________________________________________________ Date: ____________Jun 22, 2020 Marek Rychlik Mathematics 3 Acknowledgements I would like to acknowledge everyone who had a very close hand in this work. My Advisor, Marek Rychlik, has been essential in his guidance and insight. Though it was a role he agreed to take on, I appreciate his patience in he;ping me finish. I would also like to thank my committee, David Glickenstein, Clayton Morrison, and Robert Maier for their time and input. This dissertation is much better for their input. My family also has been an excellent source of emotional support and motivation. My father, Thomas, has played a special role in helping me refine my thoughts during this process. Thanks for lending a listening ear and asking questions dad. My wife and children have been surviving with much less attention than they deserve for some time now. Their love and support has been demonstrably unconditional. I could not have come to this point without any of these people, and many others I have failed to mention. 4 Dedication To my family. My parents, my children, and most of all my wife. 5 Table of Contents List of Figures ................................. 7 List of Tables .................................. 8 Abstract ..................................... 9 Chapter 1. Introduction .......................... 11 1.1. Introduction . 11 Chapter 2. Background ........................... 14 2.1. Clustering and Classification . 14 2.1.1. Clustering . 14 2.1.2. Classification . 16 2.2. Unsupervised and Supervised Machine Learning . 16 2.3. Softmax and Logistic Regression . 20 2.4. Basic Neural Nets . 23 2.4.1. Single Layer Perceptron . 23 2.4.2. Multilayer Perceptron as a Multi-Class Classifier . 25 2.4.3. Backpropagation . 27 2.4.4. Derivative Notation . 29 2.5. Responsible Clustering Algorithms . 33 2.5.1. K-means algorithm . 33 2.5.2. Expectation Maximization . 35 2.6. Clustering with Mixture Models . 38 2.7. A Brief Introduction to Discrete Dynamical Systems . 41 Chapter 3. Dynamic Responsibility and Cluster Proportions .. 45 3.1. An Iterative Algorithm . 45 3.2. Further Examination of the K = 2 Case for Arbitrary N ....... 46 3.3. Some Basic Examples of the K = 2 Case . 49 3.3.1. An Example for a simple family of matrices Fα ........ 50 3.3.2. A GMM Example with an Unknown Mean . 51 3.4. Convergence of RF for Arbitrary K ................... 56 3.5. Relationship to the MLE . 67 Chapter 4. Responsible Softmax Layer ................. 71 4.1. Neural Net Learning of the Parameters for Dynamic Responsibility . 72 4.1.1. Choice of Loss Function . 73 6 Table of Contents|Continued 4.1.2. Method for determining F .................... 75 4.2. Proposed Neural Network Layer . 76 4.3. Backpropagation and Responsibility . 82 @Y 4.3.1. Computation of @F ........................ 82 4.4. Methods to compute DR ......................... 84 4.4.1. Calculating DR on SK ...................... 85 4.4.2. Calculating DR on Parameter Matrices . 89 @L 4.5. Using Derivatives of R to compute @Y ................. 90 Chapter 5. Applied Examples ....................... 97 5.1. Empirical Evidence of Convergence Rate . 98 5.2. Experimental Neural Network Setup . 103 5.3. Data Selection . 106 5.4. Evaluation Methods . 108 5.5. Results for GMM . 110 5.6. Results for MNIST . 116 5.7. Summary of Conclusions . 120 Appendices .................................... 121 Appendix A. Dynamic Responsibility Code ............... 121 A.1. Implementation of Responsibility Map . 121 A.2. Implementation of Algorithm 1 . 122 A.3. Implementation of Algorithm 2 . 123 A.4. Code for Experiments on Convergence . 124 Appendix B. Responsible Softmax Code ................. 128 B.1. Responsible Softmax Layer . 128 B.2. Fixed Responsibility Softmax layer . 135 Appendix C. Code for Examples on GMM Data ............ 138 Appendix D. Code for Example on MNIST Data ........... 142 References .................................... 145 7 List of Figures Figure 2.1. Network graph of a perceptron with N input units. 24 Figure 2.2. Network graph of a multilayer perceptron. 26 Figure 2.3. Conceptualized model of a single network layer. 28 Figure 3.1. A plot of fixed points for Fα .................... 53 Figure 3.2. Example with K = 2 and N = 12 . 54 Figure 3.3. Example GMM Histogram . 54 Figure 3.4. Fixed Point Estimates of GMM Mixing Probabilities . 55 Figure 4.1. Graphical Model of the Responsible Softmax Layer . 78 Figure 4.2. Computation Graph for Y ..................... 83 −1 Figure 5.1. Plot of residues σF − jbF j for various F ............. 101 Figure 5.2. Plots of distances and convergence errors for various F .... 102 Figure 5.3. A sample GMM data set used for training and testing . 107 Figure 5.4. Classification Regions for the MAP classifier 1 . 112 Figure 5.5. Classification Regions for GMM nets #1-#4 . 112 Figure 5.6. Classification regions for MAP estimator 2 . 115 Figure 5.7. Classification regions for RS GMM nets with large C ...... 116 Figure 5.8. Confusion matrices for MNIST nets #1-#4 . 118 8 List of Tables Table 3.1. Dynamic Responsibility Algorithm . 46 Table 3.2. A Newton Method version of Dynamic Responsibility . 68 Table 3.3. Experimental Evidence of Consistency . 69 Table 4.1. Forward prediction algorithm for responsible softmax. 81 Table 4.2. Backward propagation algorithm for responsible softmax. 81 Table 4.3. Intermediate Terms for Computation of @Y /@F ......... 84 Table 4.4. Computing gradients with backpropagation, iterative portion . 96 Table 5.1. General layer setup for GMM classification . 104 Table 5.2. Common convolutional layers for MNIST classification . 104 Table 5.3. A table of neural net setups used in numerical experiments . 106 Table 5.4. Confusion matrices with error estimates for GMM nets #1-#4 . 113 Table 5.5. Per class precision and recall for GMM nets #1-#4 . 114 Table 5.6. MLE of Class weights for GMM nets #2 and #3 . 115 Table 5.7. Accuracy of MNIST training with imbalanced data . 117 Table 5.8. Confusion matrix diagonal for MNIST nets #1-#4 . 119 Table 5.9. MLE of class weights for MNIST nets #2 and #3 . 119 9 Abstract Clustering algorithms are an important part of modern data analysis. The K-means and EM clustering algorithm both use an iterative process to find latent (or hidden) variables in a mixture distribution. These hidden variables may be interpreted as class label for the data points of a sample. In connection with these algorithms, I consider a family of nonlinear mappings called responsibility maps. The responsibility map is obtained as a gradient of the log likelihood of N independent samples, drawn from a mixture of K distributions. I look at the discrete dynamics of this family of maps and give a proof that iteration of responsibility converges to an estimate of the mixing coefficients. I also show that the convergence is consistent in the sense that the fixed point acts as a maximizer of the log likelihood. I call the process of determining class weight by iteration dynamic responsibility and show that it converges to a unique set of weights under mild assumptions. Dynamic responsibility (DR) is inspired by the expectation step of the expectation maximization (EM) algorithm and has a useful association with Bayesian methods. Like EM, dynamic responsibility is an iterative algorithm, but DR will converge to a unique maximum under reasonable conditions. The weights determined by DR can also be found using gradient descent but DR guarantees non-negative weights and gradient descent does not. I present a new algorithm which I call responsible softmax for doing classification with neural networks. This algorithm

Load more