Doctor of Philosophy

Total Page:16

File Type:pdf, Size:1020Kb

Doctor of Philosophy RICE UNIVERSITY By Randall Balestriero A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Doctor of Philosophy APPROVED, THESIS COMMITTEE Richard Baraniuk (Apr 28, 2021 06:57 ADT) Ankit B Patel (Apr 28, 2021 02:34 CDT) Ankit Patel Richard Baraniuk Behnam Azhang Behnam Azhang (Apr 26, 2021 18:34 CDT) Behnaam Aazhang Stephane Mallat Moshe Vardi Moshe Vardi (Apr 26, 2021 20:04 CDT) Albert Cohen (Apr 27, 2021 16:36 GMT+2) Moshe Vardi Albert Cohen HOUSTON, TEXAS April 2021 ABSTRACT Max-Affine Splines Insights Into Deep Learning by Randall Balestriero We build a rigorous bridge between deep networks (DNs) and approximation the- ory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings. For instance, conditioned on the spline partition region containing the input signal, the output of a MASO DN can be written as a simple affine transformation of the input. Studying the geometry of those regions allows to obtain novel insights into different regular- ization techniques, different layer configurations or different initialization schemes. Going further, this spline viewpoint allows to obtain precise geometric insights in various domains such as the characterization of the Deep Generative Networks's gen- erated manifold, the understanding of Deep Network pruning as a mean to simplify the DN input space partition or the relationship between different nonlinearities e.g. ReLU-Sigmoid Gated Linear Unit as simply corresponding to different MASO region membership inference algorithms. The spline partition of the input signal space that is implicitly induced by a MASO directly links DNs to the theory of vector quantiza- tion (VQ) and K-means clustering, which opens up new geometric avenues to study how DNs organize signals in a hierarchical fashion. ACKNOWLEDGEMENTS I would like to thank Prof. Herve Glotin for giving me the opportunity to enter into the research world during my Bachelor's degree with topics of greatest interests. Needless to say that without Herve's passion and love for his academic profession, I would not be doing research in this exciting field of machine and deep learning. Herve has done much more than just providing me with an opportunity. He has molded me into a curious dreamer, a quality that I hope to hold for as long as possible in order to one day walk into Herve's steps. I would also like to especially thank Prof. Sebastien Paris for considering me as an equal colleague during my Bachelor's research internships and thereafter. Sebastien's rigor, knowledge, and pragmatism have influenced me greatly in the most positive way. I also want to thank the countless invaluable encounters I have had within the LSIS team such as with Prof. Ricard Marxer, the LJLL team such as with Prof. Frederic Hecht and Prof. Albert Cohen, and in the DI ENS team such as with Prof. Stephane Mallat and Prof. Vincent Lostanlen, all sharing two common traits: a limitless expertise of their field, and an unbounded desire to share their knowledge. I would also like to thank Prof. Richard Baraniuk for taking me into his group and for constantly inspiring me to produce work of the highest quality. Rich's influences have allowed me to considerably improve upon my ability to not only conduct research, but also to communicate research. I would have been an incomplete PhD candidate without this primordial skill. I also want to thank Prof. Rudolf Riedi for taking me into a mathematical tour. Rolf's ability to seamlessly bridge the most abstract theoretical concepts and the most intuitive observations will never cease to amaze me and to fuel my desire to learn. I am also thanking Sina Alemohammad, CJ Barberan, Yehuda Dar, Ahmed Imtiaz, Hamid Javadi, Daniel Lejeune, Lorenzo Luzi, Tan Nguyen, Jasper Tan, Zichao Wang who are part of the Deep Learning group at Rice and with whom I have been collaborating, discussing and brainstorming. I also want to thank beyond words my family from whom I never stopped to learn, and my partner Dr. Kerda Varaku for mollifying the world around me (while performing a multi-year long reinforcement learning experiment on me, probably still in-progress today). I also want to give a special word for Dr. Romain Cosentino with whom we have been blindly pursuing ideas that led us to novel fields, and for Dr. Leonard Seydoux with whom I have discovered geophysics in the most interesting and captivating way. This work was partially supported by NSF grants IIS-17-30574 and IIS-18-38177, AFOSR grant FA9550-18-1-0478, ARO grant W911NF-15-1-0316, ONR grants N00014- 17-1-2551 and N00014-18-12571, DARPA grant G001534-7500, and a DOD Vannevar Bush Faculty Fellowship (NSSEFF) grant N00014-18-1-2047, a BP fellowship from the Ken Kennedy Institute. Contents Abstract ii Acknowledgments iii List of Illustrations xi List of Tables xxiv Notations xxvi 1 Introduction 1 1.1 Motivation . 2 1.2 Deep Networks . 4 1.2.1 Layers . 5 1.2.2 Training . 8 1.2.3 Approximation Results . 9 1.3 Related Works . 11 1.3.1 Mathematical Formulations of Deep Networks . 11 1.3.2 Training of Deep Generative Networks. 16 1.3.3 Batch-Normalization Understandings. 18 1.3.4 Deep Network Pruning. 19 1.4 Contributions . 21 2 Max-Affine Splines for Convex Function Approximation 24 2.1 Spline Functions . 25 2.2 Max-Affine Splines . 28 2.3 (Max-)Affine Spline Fitting . 30 vi 3 Deep Networks: Composition of Max-Affine Spline Op- erators 32 3.1 Max-Affine Spline Operators . 32 3.2 From Deep Network Layers to Max-Affine Spline Operators . 33 3.3 Composition of Max-Affine Spline Operators . 36 3.4 Deep Networks Input Space Partition: Power Diagram Subdivision . 38 3.4.1 Voronoi Diagrams and Power Diagrams . 38 3.4.2 Single Layer: Power Diagram . 40 3.4.3 Composition of Layers: Power Diagram Subdivision . 43 3.5 Discussions . 45 4 Insights Into Deep Generative Networks 47 4.1 Introduction . 47 4.1.1 Related Works . 47 4.1.2 Contributions . 49 4.2 Deep Generative Network Latent and Intrinsic Dimension . 49 4.2.1 Input-Output Space Partition and Per-Region Mapping . 50 4.2.2 Generated Manifold Angularity . 52 4.2.3 Generated Manifold Intrinsic Dimension . 55 4.2.4 Effect of Dropout/Dropconnect . 57 4.3 Per-Region Affine Mapping Interpretability and Manifold Tangent Space 61 4.3.1 Per-Region Mapping as Local Coordinate System and Disentanglement . 61 4.3.2 Tangent Space Regularization . 63 4.4 Density on the Generated Manifold . 65 4.4.1 Analytical Output Density . 66 4.4.2 On the Difficulty of Generating Low entropy/Multimodal Distributions . 67 vii 4.5 Discussions . 68 5 Expectation-Maximization for Deep Generative Networks 70 5.1 Introduction . 70 5.1.1 Related Works . 70 5.1.2 Contributions . 74 5.2 Posterior and Marginal Distributions of Deep Generative Networks . 74 5.2.1 Conditional, Marginal and Posterior Distributions of Deep Generative Networks . 75 5.2.2 Obtaining the DGN Partition . 77 5.2.3 Gaussian Integration on the Deep Generative Network Latent Partition . 79 5.3 Expectation-Maximization Learning of Deep Generative Networks . 81 5.3.1 Expectation Step . 82 5.3.2 Maximization Step . 83 5.3.3 Empirical Validation and VAE Comparison . 84 5.4 Discussions . 86 6 Insights Into Deep Network Pruning 88 6.1 Introduction . 88 6.1.1 Related Works . 89 6.1.2 Contributions . 90 6.2 Winning Tickets and DN Initialization . 90 6.2.1 The Initialization Dilemma and the Importance of Overparametrization . 91 6.2.2 Better DN Initialization: An Alternative to Pruning . 93 6.3 Pruning Continuous Piecewise Affine DNs . 95 6.3.1 Interpreting Pruning from a Spline Perspective . 97 6.3.2 Spline Early-Bird Tickets Detection . 98 viii 6.3.3 Spline Pruning Policy . 100 6.4 Experiment Results . 103 6.4.1 Proposed Layerwise Spline Pruning over SOTA Pruning Methods . 103 6.4.2 Proposed Global Spline Pruning over SOTA Pruning Methods 104 6.5 Discussions . 105 7 Insights into Batch-Normalization 107 7.1 Introduction . 107 7.1.1 Related Works . 108 7.1.2 Contributions . 110 7.2 Batch Normalization: Unsupervised Layer-Wise Fitting . 110 7.2.1 Batch-Normalization Updates . 111 7.2.2 Layer Input Space Hyperplanes and Partition . 111 7.2.3 Translating the Hyperplanes . 113 7.3 Multiple Layer Analysis: Following the Data Manifold . 116 7.3.1 Deep Network Partition and Boundaries . 116 7.3.2 Interpreting Each Batch-Normalization Parameter . 120 7.3.3 Experiments: Batch-Normalization Focuses the Partition onto the Data . 121 7.4 Where is the Decision Boundary . 123 7.4.1 Batch-Normalization is a Smart Initialization . 123 7.4.2 Experiments: Batch-Normalization Initialization Jump-Starts Training . 126 7.5 The Role of the Batch-Normalization Learnable Parameters . 127 7.6 Batch-Normalization Noisyness . 129 7.7 Discussions . 132 8 Insights Into (Smooth) Deep Networks Nonlinearities 133 ix 8.1 Introduction . 133 8.2 Max-Affine Splines meet Gaussian Mixture Models . 135 8.2.1 From MASO to GMM via K-Means . 135 8.2.2 hard-VQ Inference . 137 8.2.3 Soft-VQ Inference . 138 8.2.4 Soft-VQ MASO Nonlinearities . 139 8.3 Hybrid Hard/Soft Inference via Entropy Regularization . 139 8.4 Discussions . 141 A Insights into Generative Networks 142 A.1 Architecture Details . 142 A.2 Proofs .
Recommended publications
  • LNAI 4264, Pp
    Lecture Notes in Artificial Intelligence 4264 Edited by J. G. Carbonell and J. Siekmann Subseries of Lecture Notes in Computer Science José L. Balcázar Philip M. Long Frank Stephan (Eds.) Algorithmic Learning Theory 17th International Conference, ALT 2006 Barcelona, Spain, October 7-10, 2006 Proceedings 1 3 Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA Jörg Siekmann, University of Saarland, Saarbrücken, Germany Volume Editors José L. Balcázar Universitat Politecnica de Catalunya, Dept. Llenguatges i Sistemes Informatics c/ Jordi Girona, 1-3, 08034 Barcelona, Spain E-mail: [email protected] Philip M. Long Google 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA E-mail: [email protected] Frank Stephan National University of Singapore, Depts. of Mathematics and Computer Science 2 Science Drive 2, Singapore 117543, Singapore E-mail: [email protected] Library of Congress Control Number: 2006933733 CR Subject Classification (1998): I.2.6, I.2.3, F.1, F.2, F.4, I.7 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN 0302-9743 ISBN-10 3-540-46649-5 Springer Berlin Heidelberg New York ISBN-13 978-3-540-46649-9 Springer Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer.
    [Show full text]
  • MMLS 2017 Booklet
    MMLS 2017 Booklet Monday, June 19 9:30 - 11:15 TTIC Continental breakfast. (TTIC Colloquium: 10:00-11:00.) 11:15 - 11:30 GPAH Opening remarks: Po-Ling Loh. 11:30 - 12:20 GPAH Plenary speaker: Devavrat Shah . (Chair: Po-Ling Loh .) Latent Variable Model Estimation via Collaborative Filtering. 12:20 - 2:50 GPAH Lunch, posters. 2:50 - 3:30 GPAH Invited talks (chair: Mesrob Ohannessian ). 2:50: Rina Foygel Projected Gradient Descent with Nonconvex Constraints. 3:10: Maxim Raginsky Non-Convex Learning via Stochastic Gradient Langevin Dynamics. 3:30 - 4:00 GPAH Coffee Break. 4:00 - 4:50 GPAH Plenary speaker: Rayid Ghani . (Chair: Nati Srebro .) Machine Learning for Public Policy: Opportunities and Challenges 4:50 - 5:30 GPAH Invited talks (chair: Laura Balzano ). 4:50: Dimitris Papailiopoulos Gradient Diversity Empowers Distributed Learning. 5:10: Alan Ritter Large-Scale Learning for Information Extraction. 5:45 - 7:00 TTIC Reception, with remarks by Sadaoki Furui (TTIC President). Tuesday, June 20 8:30 - 9:50 GPAH Continental breakfast. 9:00 - 9:50 GPAH Bonus speaker: Larry Wasserman . (Chair: Mladen Kolar .) Locally Optimal Testing. [ Cancelled. ] 9:50 - 10:50 GPAH Invited talks (chair: Misha Belkin ). 9:50: Srinadh Bhojanapalli Effectiveness of Local Search for Low Rank Recovery 10:10: Niao He Learning From Conditional Distributions. 10:30: Clayton Scott Nonparametric Preference Completion. 10:50 - 11:20 GPAH Coffee break. 11:20 - 12:20 GPAH Invited talks (chair: Jason Lee ). 11:20: Lev Reyzin On the Complexity of Learning from Label Proportions. 11:40: Ambuj Tewari Random Perturbations in Online Learning. 12:00: Risi Kondor Multiresolution Matrix Factorization.
    [Show full text]
  • Nil Ib N O Ir Ali Mi S Na El Oo B Ilp Itl
    ecneicS retupmoC retupmoC ecneicS ecneicS - o t- o l t aA l aA DD DD 9 9 / / OOnn BBiilliinneeaarr TTeecchhnniiqquueess ffoorr 0202 0202 a p pa p r p a r K a K ii t t t aM t aM SSiimmiillaarriittyy SSeeaarrcchh aanndd BBoooolleeaann MMaattrriixx M ultiplication Multiplication MMaatttit iK Kaarprpppaa noi taci lpi t luM xi r taM naelooB dna hcraeS yt i ral imiS rof seuqinhceT raeni l iB nO noi taci lpi t luM xi r taM naelooB dna hcraeS yt i ral imiS rof seuqinhceT raeni l iB nO BBUUSSININESESS + + ECECOONNOOMMY Y NSI I NBS NBS 879 - 879 - 259 - 259 - 06 - 06 - 5198 - 5198 7 - 7 ( p ( r p i n r t i n t de ) de ) AARRT T+ + NSI I NBS NBS 879 - 879 - 259 - 259 - 06 - 06 - 6198 - 6198 4 - 4 ( ( dp f dp ) f ) DDESESIGIGNN + + NSI I NSS NSS 9971 - 9971 - 4394 4394 ( p ( r p i n r t i n t de ) de ) AARRCCHHITIETCECTUTURRE E NSI I NSS NSS 9971 - 9971 - 2494 2494 ( ( dp f dp ) f ) SSCCIEINENCCE E+ + TETCECHHNNOOLOLOGGY Y tirvn tlaA ot laA ot isrevinU yt isrevinU yt CCRROOSSSOOVEVRE R ceic fo o oohcS f l cS o oohcS i f l cS i ecne ecne DDOOCCTOTORRAAL L ecneicS retupmoC retupmoC ecneicS ecneicS DDISISSERERTATTAITOIONNS S DDOOCCTOTORRAAL L +hfbjia*GMFTSH9 +hfbjia*GMFTSH9 fi.otlaa.www . www a . a a l a t o l t . o fi . fi DDISISSERERTATTAITOIONNS S ot laA ytot laA isrevinU yt isrevinU 0202 0202 Aalto University publication series DOCTORAL DISSERTATIONS 9/2020 On Bilinear Techniques for Similarity Search and Boolean Matrix Multiplication Matti Karppa A doctoral dissertation completed for the degree of Doctor of Science (Technology) to be defended, with the permission of the Aalto University School of Science, at a public examination held at the lecture hall T2 of the school on 24 January 2020 at 12.
    [Show full text]
  • Theory and Algorithms for Modern Problems in Machine Learning and an Analysis of Markets
    Theory and Algorithms for Modern Problems in Machine Learning and an Analysis of Markets by Ashish Rastogi A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science Courant Institute of Mathematical Sciences New York University May 2008 Richard Cole—Advisor Mehryar Mohri—Advisor °c Ashish Rastogi All Rights Reserved, 2008 To the most wonderful parents in the whole world, Mrs. Asha Rastogi and Mr. Shyam Lal Rastogi iv Acknowledgements First and foremost, I would like to thank my advisors, Professor Richard Cole and Professor Mehryar Mohri, for their unwavering support, guidance and constant encouragement. They have been inspiring mentors and much of what lies in the following pages can be credited to them. Working under their supervision has been one of the most enriching experiences of my life. I would also like to thank Professor Joel Spencer, Professor Arun Sun- dararajan, Professor Subhash Khot and Dr. Corinna Cortes for agreeing to serve as members on my thesis committee. Professor Spencer’s class on Random Graphs remains one of the most stim- ulating courses I undertook as a graduate student. Internships at Google through the summers of 2005, 2006 and 2007 were some of the most enjoy- able periods of my graduate school life. Many thanks are due to Dr. Corinna Cortes for providing me with the opportunity to work on several challenging problems at Google. Research initiated during these internships culminated in the development of ideas that form the bulk of this thesis. I would also like to thank my peers from the graduate school.
    [Show full text]
  • Session II – Unit I – Fundamentals
    Tutor:- Prof. Avadhoot S. Joshi Subject:- Design & Analysis of Algorithms Session II Unit I - Fundamentals • Algorithm Design – II • Algorithm as a Technology • What kinds of problems are solved by algorithms? • Evolution of Algorithms • Timeline of Algorithms Algorithm Design – II Algorithm Design (Cntd.) Figure briefly shows a sequence of steps one typically goes through in designing and analyzing an algorithm. Figure: Algorithm design and analysis process. Algorithm Design (Cntd.) Understand the problem: Before designing an algorithm is to understand completely the problem given. Read the problem’s description carefully and ask questions if you have any doubts about the problem, do a few small examples by hand, think about special cases, and ask questions again if needed. There are a few types of problems that arise in computing applications quite often. If the problem in question is one of them, you might be able to use a known algorithm for solving it. If you have to choose among several available algorithms, it helps to understand how such an algorithm works and to know its strengths and weaknesses. But often you will not find a readily available algorithm and will have to design your own. An input to an algorithm specifies an instance of the problem the algorithm solves. It is very important to specify exactly the set of instances the algorithm needs to handle. Algorithm Design (Cntd.) Ascertaining the Capabilities of the Computational Device: After understanding the problem you need to ascertain the capabilities of the computational device the algorithm is intended for. The vast majority of algorithms in use today are still destined to be programmed for a computer closely resembling the von Neumann machine—a computer architecture outlined by the prominent Hungarian-American mathematician John von Neumann (1903–1957), in collaboration with A.
    [Show full text]
  • The Boosting Apporach to Machine Learning
    Nonlinear Estimation and Classification, Springer, 2003. The Boosting Approach to Machine Learning An Overview Robert E. Schapire AT&T Labs Research Shannon Laboratory 180 Park Avenue, Room A203 Florham Park, NJ 07932 USA www.research.att.com/ schapire December 19, 2001 Abstract Boosting is a general method for improving the accuracy of any given learning algorithm. Focusing primarily on the AdaBoost algorithm, this chapter overviews some of the recent work on boosting including analyses of AdaBoost’s training error and generalization error; boosting’s connection to game theory and linear programming; the relationship between boosting and logistic regression; extensions of AdaBoost for multiclass classification problems; methods of incorporating human knowledge into boosting; and experimental and applied work using boosting. 1 Introduction Machine learning studies automatic techniques for learning to make accurate pre- dictions based on past observations. For example, suppose that we would like to build an email filter that can distinguish spam (junk) email from non-spam. The machine-learning approach to this problem would be the following: Start by gath- ering as many examples as posible of both spam and non-spam emails. Next, feed these examples, together with labels indicating if they are spam or not, to your favorite machine-learning algorithm which will automatically produce a classifi- cation or prediction rule. Given a new, unlabeled email, such a rule attempts to predict if it is spam or not. The goal, of course, is to generate a rule that makes the most accurate predictions possible on new test examples. 1 Building a highly accurate prediction rule is certainly a difficult task.
    [Show full text]
  • Algorithms and Hardness for Linear Algebra on Geometric Graphs
    Algorithms and Hardness for Linear Algebra on Geometric Graphs∗ Josh Alman† Timothy Chu‡ Aaron Schild§ Zhao Song¶ Abstract d d d For a function K : R R R≥ , and a set P = x ,...,xn R of n points, the K × → 0 { 1 } ⊂ graph GP of P is the complete graph on n nodes where the weight between nodes i and j is given by K(xi, xj ). In this paper, we initiate the study of when efficient spectral graph theory is possible on these graphs. We investigate whether or not it is possible to solve the following 1+o(1) o(1) problems in n time for a K-graph GP when d<n : • Multiply a given vector by the adjacency matrix or Laplacian matrix of GP • Find a spectral sparsifier of GP • Solve a Laplacian system in GP ’s Laplacian matrix 2 For each of these problems, we consider all functions of the form K(u, v) = f( u v 2) for a function f : R R. We provide algorithms and comparable hardness resultsk for−manyk such K, including the→ Gaussian kernel, Neural tangent kernels, and more. For example, in dimension d = Ω(log n), we show that there is a parameter associated with the function f for which low parameter values imply n1+o(1) time algorithms for all three of these problems and high parameter values imply the nonexistence of subquadratic time algorithms assuming Strong Exponential Time Hypothesis (SETH), given natural assumptions on f. As part of our results, we also show that the exponential dependence on the dimension d in the celebrated fast multipole method of Greengard and Rokhlin cannot be improved, assuming SETH, for a broad class of functions f.
    [Show full text]
  • Algorithms and Hardness for Linear Algebra on Geometric Graphs
    2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS) Algorithms and Hardness for Linear Algebra on Geometric Graphs Josh Alman Timothy Chu Aaron Schild Zhao Song [email protected] [email protected] [email protected] [email protected] Harvard U. Carnegie Mellon U. U. of Washington Columbia, Princeton, IAS d d Abstract—For a function K : R × R → R≥0, and a set learning. The first task is a celebrated application of the P {x ,...,x }⊂Rd n K G P = 1 n of points, the graph P of fast multipole method of Greengard and Rokhlin [GR87], n is the complete graph on nodes where the weight between [GR88], [GR89], voted one of the top ten algorithms of the nodes i and j is given by K(xi,xj ). In this paper, we initiate the study of when efficient spectral graph theory is possible twentieth century by the editors of Computing in Science and on these graphs. We investigate whether or not it is possible Engineering [DS00]. The second task is spectral clustering to solve the following problems in n1+o(1) time for a K-graph [NJW02], [LWDH13], a popular algorithm for clustering o(1) GP when d<n : data. The third task is to label a full set of data given only • Multiply a given vector by the adjacency matrix or a small set of partial labels [Zhu05b], [CSZ09], [ZL05], Laplacian matrix of GP which has seen increasing use in machine learning. One • G Find a spectral sparsifier of P notable method for performing semi-supervised learning is • Solve a Laplacian system in GP ’s Laplacian matrix the graph-based Laplacian regularizer method [LSZ+19b], For each of these problems, we consider all functions of the K u, v f u − v2 f R → R [ZL05], [BNS06], [Zhu05a].
    [Show full text]
  • Contents U U U
    Contents u u u ACM Awards Reception and Banquet, June 2018 .................................................. 2 Introduction ......................................................................................................................... 3 A.M. Turing Award .............................................................................................................. 4 ACM Prize in Computing ................................................................................................. 5 ACM Charles P. “Chuck” Thacker Breakthrough in Computing Award ............. 6 ACM – AAAI Allen Newell Award .................................................................................. 7 Software System Award ................................................................................................... 8 Grace Murray Hopper Award ......................................................................................... 9 Paris Kanellakis Theory and Practice Award ...........................................................10 Karl V. Karlstrom Outstanding Educator Award .....................................................11 Eugene L. Lawler Award for Humanitarian Contributions within Computer Science and Informatics ..........................................................12 Distinguished Service Award .......................................................................................13 ACM Athena Lecturer Award ........................................................................................14 Outstanding Contribution
    [Show full text]
  • 12 | Learning Theory
    12 | LEARNING THEORY The Universe is under no obligation to make sense to you. – Learning Objectives: Neil deGrasse Tyson • Explain why inductive bias is necessary. • Define the PAC model and explain why both the “P” and “A” are necessary. • Explain the relationship between complexity measures and regulariz- ers. By now, you are an expert at building learning algorithms. You • Identify the role of complexity in probably understand how they work, intuitively. And you under- generalization. stand why they should generalize. However, there are several basic • Formalize the relationship between margins and complexity. questions you might want to know the answer to. Is learning always possible? How many training examples will I need to do a good job learning? Is my test performance going to be much worse than my training performance? The key idea that underlies all these answer is that simple functions generalize well. The amazing thing is that you can actually prove strong results that address the above questions. In this chapter, you will learn some of the most important results in learning theory that attempt to answer these questions. The goal of this chapter is not theory for Dependencies: theory’s sake, but rather as a way to better understand why learning models work, and how to use this theory to build better algorithms. As a concrete example, we will see how 2-norm regularization prov- ably leads to better generalization performance, thus justifying our common practice! 12.1 The Role of Theory In contrast to the quote at the start of this chapter, a practitioner friend once said “I would happily give up a few percent perfor- mance for an algorithm that I can understand.” Both perspectives are completely valid, and are actually not contradictory.
    [Show full text]
  • Association for Computing Machinery 2 Penn Plaza, Suite 701, New York
    ACM A N W N A N R N D N S Association for Computing Machinery 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA +1-212-869-7440 www.acm.org Contents N N N ACM Awards Reception and Banquet, June 2017 . .2 Introduction . .3 A.M. Turing Award . .4 ACM Prize in Computing . .5 ACM – AAAI Allen Newell Award . .6 Software System Award . .7 Grace Murray Hopper Award . .8 Paris Kanellakis Theory and Practice Award . .9 Karl V. Karlstrom Outstanding Educator Award . .10 ACM Policy Award . .11 Distinguished Service Award . .12 ACM Athena Lecturer Award . .13 Outstanding Contribution to ACM Award . .14 ACM Presidential Award . .15-17 Doctoral Dissertation Award . .18 ACM Student Research Competition . .19 ACM Fellows . .20 Eugene L. Lawler Award for Humanitarian Contributions within Computer Science and Informatics . .21 ACM Gordon Bell Prize . .21 ACM – IEEE CS Eckert-Mauchly Award . .22 ACM – IEEE CS Ken Kennedy Award . .22 ACM – IEEE CS George Michael HPC Memorial Fellowships . .23 SIAM – ACM Prize in Computational Science and Engineering . .23 ACM – CSTA Cutler-Bell Prize . .24 ACM India Doctoral Dissertation Award . .24 ACM China Doctoral Dissertation Award . .25 ACM China Rising Star Award . .25 IPSJ/ACM Award for Early Career Contributions to Global Research . .25 ACM Special Interest Group Awards . .26-27 2017 ACM Award Subcommittee Chairs . .28 ACM Award Nomination Submission Procedures . .29 2018 ACM Award Subcommittee Chairs and Members . .30-31 Past Recipients . .32-36 ACM Fellows . .37-43 In Memoriam, ACM Fellows . .44 1 ACM Awards Reception & Banquet ACM AWARDS N N N N N N The Westin St.
    [Show full text]
  • A Short Introduction to Boosting
    Journal of Japanese Society for Artificial Intelligence,14(5):771-780, September, 1999. (In Japanese, translation by Naoki Abe.) A Short Introduction to Boosting Yoav Freund Robert E. Schapire AT&T Labs Research Shannon Laboratory 180 Park Avenue Florham Park, NJ 07932 USA ¢ ¡ www.research.att.com/ yoav, schapire £ ¢ yoav, schapire £ @research.att.com Abstract Boosting is a general method for improving the accuracy of any given learning algorithm. This short overview paper introduces the boosting algorithm AdaBoost, and explains the un- derlying theory of boosting, including an explanation of why boosting often does not suffer from overfitting as well as boosting's relationship to support-vector machines. Some examples of recent applications of boosting are also described. Introduction A horse-racing gambler, hoping to maximize his winnings, decides to create a computer program that will accurately predict the winner of a horse race based on the usual information (number of races recently won by each horse, betting odds for each horse, etc.). To create such a program, he asks a highly successful expert gambler to explain his betting strategy. Not surprisingly, the expert is unable to articulate a grand set of rules for selecting a horse. On the other hand, when presented with the data for a specific set of races, the expert has no trouble coming up with a “rule of thumb” for that set of races (such as, “Bet on the horse that has recently won the most races” or “Bet on the horse with the most favored odds”). Although such a rule of thumb, by itself, is obviously very rough and inaccurate, it is not unreasonable to expect it to provide predictions that are at least a little bit better than random guessing.
    [Show full text]