Deep Learning

Total Page:16

File Type:pdf, Size:1020Kb

Deep Learning Deep Learning Elisa Ricci University of Perugia Outline ● Introduction to Deep Learning ● Brief history of Neural Networks ● Deep Learning ● Recent technical advances ● Models: Autoencoders !NN RNN ● Open proble#s ● Deep Learning Hands-on &he AI Revolution !omputer 'ision !omputer 'ision ● Amazing progresses in the last few years with !onvolutional Neural Networks )!NNs*+ ,.4M i#ages, 1K categories (2009) !omputer 'ision ● Amazing progresses in the last few years with !onvolutional Neural Networks )!NNs*+ I#ageNet Large%/cale Visual Recognition Challenge )IL/'R!* Traditional ML vs Deep models vs Human /peech 0rocessing ● Machine translation+ Rick Rashid in Tianjin, China, October, 25, 2012 A voice recognition program translated a speech given by Richard F. Rashid, Microsoft’s top scientist, into Mandarin Chinese. /peech 0rocessing ● &e1t &o /peech 2aveNet. !reativity ● /tyle &ransfer+ [Ruder et al.] !reativity ● 3enerate Donald &rump4s &witter eruptions+ Machine Learning vs Deep Learning 2002 2012 Deep Learning ● What is deep learning5 ● Why is it generally better than traditional ML methods on i#age speech and certain other types of data5 Deep Learning ● What is deep learning5 Deep Learning means using a neural network with several layers of nodes between input and output hidden hidden hidden input output More formally ● A fa#ily of parametric #odels which learn non- linear hierarchical representations: input parameters non-linear parameters of the network activation of layer L 6 and infor#ally Deep Learning ● Why is it generally better than other ML methods on i#age speech and certain other types of data5 The series of layers between input and output compute relevant features automatically in a series of stages, just as our brains seem to. Deep Learning +++"ut neural networks have "een around for 78 years+++ /o, what is new5 Biological neuron Dendrites /ynapses • A neuron has Nucleus – Branching input )dendrites* – Branching output )the axon* Axon !ell "ody or /oma • Infor#ation #oves fro# the dendrites to the a1on via the cell body • A1on connects to dendrites via synapses – /ynapses vary in strength – /ynapses #ay be e1citatory or inhi"itory 0erceptron An Artificial Neuron )0erceptron* is a non-linear parameterized function with restricted output range x, w, x7 w7 inputs x w 9 9 Σ h output x w- - activation x w 8 8 function Biological neuron and Perceptron Dendrites /ynapses x, w Nucleus , x7 w7 inputs x w 9 9 Σ h output A1on x w- - activation x w 8 8 function !ell body or /oma Dendrite !ell Body A1on Brief History of Neural Networks [VUNO] 1943 – Mc!ulloch & Pitts Model ● Early model of artificial neuron ● 3enerates a "inary output ● &he weights values are fixed x, w, x7 w7 inputs x w 9 9 Σ output x w- - &hreshold w x8 8 ,:88 – Perceptron by Rose#blatt ● 0erceptron as a machine for linear classification ● Main idea: Learn the weigths and consider "ias. ● One weight per input ● Multiply weights with respective inputs and add bias ● If result larger than threshold return , otherwise ? x, w, x7 w7 inputs x w 9 9 Σ h output x w- - activation x w 8 8 , function b Activation functions @irst AI winter ● &he e1clusive or (AOR* cannot be solved "y perceptrons ● Neural models cannot be applied to complex tasks @irst AI winter ● But, can XOR be solved by neural networks5 Output layer ● Multi-layer perceptrons (ML0* Hidden can solve XOR layer ● @ew years later Minsky built such Hidden ML0 layer x1 x2 ….. xn Multi-layer feed forward Neural Network ● Main idea: ● Densely connect artificial neurons to reali(e compositions of non% Output linear functions layer ● &he information is propagated from Hidden the inputs to the outputs layer ● Directed Acyclic Graph (DA3* Hidden ● &asks: Classification, Regression layer ● &he input data are n dimensional usually the feature vectors x1 x2 ….. xn @irst AI winter ● $ow to train a ML05 ● Rosenblatt4s algorith# not Output yi={0,1} applica"le as it e1pects to know the layer d=? desired target+ Hidden i ● @or hidden layers we cannot know layer the desired target Hidden layer x1 x2 ….. xn 1986 ; Backpropagation ● Backpropagation revitali(e the field ● Learning ML0 for co#plicated functions can "e solved ● =fficient algorith# which processes “large” training sets ● Allowed for complicated neural network architectures ● &oday backpropagation is still at the core of neural network training 2erbos ),:E-*+ Beyond Regression: New &ools for 0rediction and Analysis in the Behavioral /ciences. 0h.D+ &hesis, $arvard Fniversity+ Ru#elhart $intont 2illia#s ),:>B*+ Learning representations "y "ack%propagating errors+ Nature Backpropagation Learning is the process of #odifying the weights of each layer θl in order to produce a network that performs some function: Output layer Hidden layer Hidden layer θl = ? x1 x2 ….. xn Backpropagation ● 0reli#inary steps: ● !ollectGacquire a training set IX, !J ● Define model and initiali(e randomly weights. ● 3iven the training set find the weights: Backpropagation 1 &rue label vector yi 2 $ypotesis = ; θ aL h )xi * 3 1) Forward propagation: su# inputs, produce activations, feed-forward 2) Error estimation+ 3) Back propagate the error signal and used it to update weights Backpropagation Randomly initiali(e the initial weights 2hile error is too large ),* For each training sample )presented in random order* Apply the inputs to the network !alculate the output for every neuron fro# the input layer through the hidden layers, to the output layer )7* Calculate the error at the outputs )9* Use the output error to compute error signals for previous layers Fse the error signals to compute weight adKust#ents Apply the weight adjust#ents 0eriodically evaluate the network perfor#ance Backpropagation ● Optimization with gradient descent: ● &he #ost important component is how to compute the gradient ● &he "ackward computations of network return the gradient Backpropagation Forward Backward Recursive rule: Previous layer Current layer ,::?s - CNN and LSTM ● I#portant advances in the field: ● Backpropagation ● Recurrent Long%/hort &er# Me#ory Networks )/ch#idhu"er ,::E* ● !onvolutional Neural Networks: OCR solved "efore 7???s )LeNet ,::>*+ Second AI winter ● NN cannot exploit many layers ● Overfitting ● 'anishing gradient (with NN training you need to #ultiply several s#all num"ers → they become s#aller and s#aller* ● Lack of processing power (no G0Fs* ● Lack of data (no large annotated datasets*◦ Second AI winter ● .ernel Machines (e.g+ S'Ms* suddenly become very popular ● /imilar accuracies than NN in the sa#e tasks ● Much fewer heuristics and para#eters ● Nice proofs on generalization &he believers 7?06 - Learning deep belief nets ● !lever way of initiali(ing network weights: ● &rain each layer one "y one with unsupervised training )using contrastive divergence* ● Much better than random values ● @ine%tune weigths with a round of supervised learning Kust as is nor#al for neural nets ● /tate of the art perfor#ance on MNI/& dataset [Hinton et al.] 7?,7 - Ale1Net ● $inton4 s group imple#ented a CNN similar to LeNet LLe!un,::>M but+++ ● &rained on I#agenet with two 30Fs ● 2ith some technical improve#ents )ReLF, dropout, data aug#entation* Traditional ML vs Deep models vs Human 2hy so powerful? ● Build an i#proved feature space ● @irst layer learns first order features (e.g+ edges6* ● SubseHuent layers learns higher order features )co#"inations of first layer features, co#binations of edges, etc.) ● @inal layer of transfor#ed features are fed into supervised layer)s* Learning hierarchical representations ● &raditional fra#ework Handcrafted Classification PearSimple Features Model Classifier ● Deep Learning Layer1 Layer2 Layer3 PearSimple θ θ Classifier θ1 2 3 2hy Deep Learning now5 ● &hree #ain factors: ● Better hardware ● Big data ● &echnical advances: ● Layer%wise pretraining ● Opti#i(ation )e+g. Ada# "atch nor#ali(ation* ● Regulari(ation )e+g. dropout* 6+ 30Fs NVIDIA Blog Big Data ● Large !"lly annotated datasets Advances with Deep Learning ● Better: ● Activation functions (RELF* ● &raining sche#es ● 2eights initialization ● Address overfitting )dropout* ● Nor#ali(ation "etween layers ● Residual deep learning ● 6+ 47 /igmoid activations ● 0ositive facts: ● Output can "e interpreted as pro"a"ility ● Output bounded in [? ,M ● Negative facts ● Always multiply with <, gradients can "e s#all ● &he gradients at the tails is flat to ? almost no weights updates 48 Rectified Linear Units ● More efficient gradient propagation: )derivative is 0 or constant* ● More efficient co#putation: )only comparison addition and #ultiplication*+ ● /parse activation: e.g. in a randomly initialized networks only a"out 8?O of hidden units are activated (having a non%(ero output) 49 ● Lots of variations have been proposed recently+ Losses ● Sum squared error "L2) loss gradient seeks the maximu# likelihood hypothesis under the assumption that the training data can "e modeled "y Normally distributed noise added to the target function value. ● @ine for regression "ut less natural for classification+ ● @or classification problems it is advantageous and increasingly popular to use the softmax activation function Kust at the output layer with the cross entropy loss function+ Softmax and Cross Entropy ● /oft#a1: softens 1 of " targets to #i#ic a pro"a"ility vector for each output+ 51 Softmax and Cross Entropy ● !ross entropy loss: #ost popular classification losses for classifiers that output proba"ilities: ● 3enerali(ation of logistic regression for more than two outputs. ● &hese new loss and activation functions helps avoid gradient saturation+ 52 /tochastic 3radient Descent (/3D* ● Fse mini%"atch sampled in the dataset for gradient esti#ate. ● /ometi#es helps to escape from local mini#a ● Noisy gradients act as regularization ● Also suita"le for datasets that change over ti#e ● 'ariance of gradients increases when batch size decreases ● Not clear how #any sa#ple per batch53 Learning rate ● 3reat i#pact on learning perfor#ance 54 Momentu# ● 3radient updates with momentum ● 0revent gradient switching all the ti#e ● @aster and more robust convergence 55 Adaptive Learning ● 0opular sche#es ● #estero$ Moment"m – Calculate point you would go to if using nor#al #omentu#.
Recommended publications
  • Lecture 4 Feedforward Neural Networks, Backpropagation
    CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ^ = f(x) vector The network contains L − 1 hidden layers (2, in a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W 2 n×n and b 2 n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W 2 n×k and b 2 k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer (L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2 hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi 5/9 Mitesh M.
    [Show full text]
  • Batch Normalization
    Deep Learning Srihari Batch Normalization Sargur N. Srihari [email protected] 1 Deep Learning Srihari Topics in Optimization for Deep Models • Importance of Optimization in machine learning • How learning differs from optimization • Challenges in neural network optimization • Basic Optimization Algorithms • Parameter initialization strategies • Algorithms with adaptive learning rates • Approximate second-order methods • Optimization strategies and meta-algorithms 2 Deep Learning Srihari Topics in Optimization Strategies and Meta-Algorithms 1. Batch Normalization 2. Coordinate Descent 3. Polyak Averaging 4. Supervised Pretraining 5. Designing Models to Aid Optimization 6. Continuation Methods and Curriculum Learning 3 Deep Learning Srihari Overview of Optimization Strategies • Many optimization techniques are general templates that can be specialized to yield algorithms • They can be incorporated into different algorithms 4 Deep Learning Srihari Topics in Batch Normalization • Batch normalization: exciting recent innovation • Motivation is difficulty of choosing learning rate ε in deep networks • Method is to replace activations with zero-mean with unit variance activations 5 Deep Learning Srihari Adding normalization between layers • Motivated by difficulty of training deep models • Method adds an additional step between layers, in which the output of the earlier layer is normalized – By standardizing the mean and standard deviation of each individual unit • It is a method of adaptive re-parameterization – It is not an optimization
    [Show full text]
  • Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
    Revisiting the Softmax Bellman Operator: New Benefits and New Perspective Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies.
    [Show full text]
  • On Self Modulation for Generative Adver- Sarial Networks
    Published as a conference paper at ICLR 2019 ON SELF MODULATION FOR GENERATIVE ADVER- SARIAL NETWORKS Ting Chen∗ Mario Lucic, Neil Houlsby, Sylvain Gelly University of California, Los Angeles Google Brain [email protected] flucic,neilhoulsby,[email protected] ABSTRACT Training Generative Adversarial Networks (GANs) is notoriously challenging. We propose and study an architectural modification, self-modulation, which im- proves GAN performance across different data sets, architectures, losses, regu- larizers, and hyperparameter settings. Intuitively, self-modulation allows the in- termediate feature maps of a generator to change as a function of the input noise vector. While reminiscent of other conditioning techniques, it requires no labeled data. In a large-scale empirical study we observe a relative decrease of 5% − 35% in FID. Furthermore, all else being equal, adding this modification to the generator leads to improved performance in 124=144 (86%) of the studied settings. Self- modulation is a simple architectural change that requires no additional parameter tuning, which suggests that it can be applied readily to any GAN.1 1 INTRODUCTION Generative Adversarial Networks (GANs) are a powerful class of generative models successfully applied to a variety of tasks such as image generation (Zhang et al., 2017; Miyato et al., 2018; Karras et al., 2017), learned compression (Tschannen et al., 2018), super-resolution (Ledig et al., 2017), inpainting (Pathak et al., 2016), and domain transfer (Isola et al., 2016; Zhu et al., 2017). Training GANs is a notoriously challenging task (Goodfellow et al., 2014; Arjovsky et al., 2017; Lucic et al., 2018) as one is searching in a high-dimensional parameter space for a Nash equilibrium of a non-convex game.
    [Show full text]
  • On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks Xiangrui Li, Xin Li, Deng Pan, Dongxiao Zhu∗ Department of Computer Science Wayne State University {xiangruili, xinlee, pan.deng, dzhu}@wayne.edu Abstract (unweighted) loss, resulting in performance degradation Deep convolutional neural networks (CNNs) trained with lo- for minority classes. To remedy this issue, the class-wise gistic and softmax losses have made significant advancement reweighted loss is often used to emphasize the minority in visual recognition tasks in computer vision. When training classes that can boost the predictive performance without data exhibit class imbalances, the class-wise reweighted ver- introducing much additional difficulty in model training sion of logistic and softmax losses are often used to boost per- (Cui et al. 2019; Huang et al. 2016; Mahajan et al. 2018; formance of the unweighted version. In this paper, motivated Wang, Ramanan, and Hebert 2017). A typical choice of to explain the reweighting mechanism, we explicate the learn- weights for each class is the inverse-class frequency. ing property of those two loss functions by analyzing the nec- essary condition (e.g., gradient equals to zero) after training A natural question then to ask is what roles are those CNNs to converge to a local minimum. The analysis imme- class-wise weights playing in CNN training using LGL diately provides us explanations for understanding (1) quan- or SML that lead to performance gain? Intuitively, those titative effects of the class-wise reweighting mechanism: de- weights make tradeoffs on the predictive performance terministic effectiveness for binary classification using logis- among different classes.
    [Show full text]
  • CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett
    CS281B/Stat241B. Statistical Learning Theory. Lecture 7. Peter Bartlett Review: ERM and uniform laws of large numbers • 1. Rademacher complexity 2. Tools for bounding Rademacher complexity Growth function, VC-dimension, Sauer’s Lemma − Structural results − Neural network examples: linear threshold units • Other nonlinearities? • Geometric methods • 1 ERM and uniform laws of large numbers Empirical risk minimization: Choose fn F to minimize Rˆ(f). ∈ How does R(fn) behave? ∗ For f = arg minf∈F R(f), ∗ ∗ ∗ ∗ R(fn) R(f )= R(fn) Rˆ(fn) + Rˆ(fn) Rˆ(f ) + Rˆ(f ) R(f ) − − − − ∗ ULLN for F ≤ 0 for ERM LLN for f |sup R{z(f) Rˆ}(f)| + O(1{z/√n).} | {z } ≤ f∈F − 2 Uniform laws and Rademacher complexity Definition: The Rademacher complexity of F is E Rn F , k k where the empirical process Rn is defined as n 1 R (f)= ǫ f(X ), n n i i i=1 X and the ǫ1,...,ǫn are Rademacher random variables: i.i.d. uni- form on 1 . {± } 3 Uniform laws and Rademacher complexity Theorem: For any F [0, 1]X , ⊂ 1 E Rn F O 1/n E P Pn F 2E Rn F , 2 k k − ≤ k − k ≤ k k p and, with probability at least 1 2exp( 2ǫ2n), − − E P Pn F ǫ P Pn F E P Pn F + ǫ. k − k − ≤ k − k ≤ k − k Thus, P Pn F E Rn F , and k − k ≈ k k R(fn) inf R(f)= O (E Rn F ) . − f∈F k k 4 Tools for controlling Rademacher complexity 1.
    [Show full text]
  • Loss Function Search for Face Recognition
    Loss Function Search for Face Recognition Xiaobo Wang * 1 Shuo Wang * 1 Cheng Chi 2 Shifeng Zhang 2 Tao Mei 1 Abstract Generally, the CNNs are equipped with classification loss In face recognition, designing margin-based (e.g., functions (Liu et al., 2017; Wang et al., 2018f;e; 2019a; Yao angular, additive, additive angular margins) soft- et al., 2018; 2017; Guo et al., 2020), metric learning loss max loss functions plays an important role in functions (Sun et al., 2014; Schroff et al., 2015) or both learning discriminative features. However, these (Sun et al., 2015; Wen et al., 2016; Zheng et al., 2018b). hand-crafted heuristic methods are sub-optimal Metric learning loss functions such as contrastive loss (Sun because they require much effort to explore the et al., 2014) or triplet loss (Schroff et al., 2015) usually large design space. Recently, an AutoML for loss suffer from high computational cost. To avoid this problem, function search method AM-LFS has been de- they require well-designed sample mining strategies. So rived, which leverages reinforcement learning to the performance is very sensitive to these strategies. In- search loss functions during the training process. creasingly more researchers shift their attention to construct But its search space is complex and unstable that deep face recognition models by re-designing the classical hindering its superiority. In this paper, we first an- classification loss functions. alyze that the key to enhance the feature discrim- Intuitively, face features are discriminative if their intra- ination is actually how to reduce the softmax class compactness and inter-class separability are well max- probability.
    [Show full text]
  • Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan
    Deep Neural Networks for Choice Analysis: Architecture Design with Alternative-Specific Utility Functions Shenhao Wang Baichuan Mo Jinhua Zhao Massachusetts Institute of Technology Abstract Whereas deep neural network (DNN) is increasingly applied to choice analysis, it is challenging to reconcile domain-specific behavioral knowledge with generic-purpose DNN, to improve DNN's interpretability and predictive power, and to identify effective regularization methods for specific tasks. To address these challenges, this study demonstrates the use of behavioral knowledge for designing a particular DNN architecture with alternative-specific utility functions (ASU-DNN) and thereby improving both the predictive power and interpretability. Unlike a fully connected DNN (F-DNN), which computes the utility value of an alternative k by using the attributes of all the alternatives, ASU-DNN computes it by using only k's own attributes. Theoretically, ASU- DNN can substantially reduce the estimation error of F-DNN because of its lighter architecture and sparser connectivity, although the constraint of alternative-specific utility can cause ASU- DNN to exhibit a larger approximation error. Empirically, ASU-DNN has 2-3% higher prediction accuracy than F-DNN over the whole hyperparameter space in a private dataset collected in Singapore and a public dataset available in the R mlogit package. The alternative-specific connectivity is associated with the independence of irrelevant alternative (IIA) constraint, which as a domain-knowledge-based regularization method is more effective than the most popular generic-purpose explicit and implicit regularization methods and architectural hyperparameters. ASU-DNN provides a more regular substitution pattern of travel mode choices than F-DNN does, rendering ASU-DNN more interpretable.
    [Show full text]
  • Pseudo-Learning Effects in Reinforcement Learning Model-Based Analysis: a Problem Of
    Pseudo-learning effects in reinforcement learning model-based analysis: A problem of misspecification of initial preference Kentaro Katahira1,2*, Yu Bai2,3, Takashi Nakao4 Institutional affiliation: 1 Department of Psychology, Graduate School of Informatics, Nagoya University, Nagoya, Aichi, Japan 2 Department of Psychology, Graduate School of Environment, Nagoya University 3 Faculty of literature and law, Communication University of China 4 Department of Psychology, Graduate School of Education, Hiroshima University 1 Abstract In this study, we investigate a methodological problem of reinforcement-learning (RL) model- based analysis of choice behavior. We show that misspecification of the initial preference of subjects can significantly affect the parameter estimates, model selection, and conclusions of an analysis. This problem can be considered to be an extension of the methodological flaw in the free-choice paradigm (FCP), which has been controversial in studies of decision making. To illustrate the problem, we conducted simulations of a hypothetical reward-based choice experiment. The simulation shows that the RL model-based analysis reports an apparent preference change if hypothetical subjects prefer one option from the beginning, even when they do not change their preferences (i.e., via learning). We discuss possible solutions for this problem. Keywords: reinforcement learning, model-based analysis, statistical artifact, decision-making, preference 2 Introduction Reinforcement-learning (RL) model-based trial-by-trial analysis is an important tool for analyzing data from decision-making experiments that involve learning (Corrado and Doya, 2007; Daw, 2011; O’Doherty, Hampton, and Kim, 2007). One purpose of this type of analysis is to estimate latent variables (e.g., action values and reward prediction error) that underlie computational processes.
    [Show full text]
  • Lecture 18: Wrapping up Classification Mark Hasegawa-Johnson, 3/9/2019
    Lecture 18: Wrapping up classification Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by Julia Hockenmaier Today’s class • Perceptron: binary and multiclass case • Getting a distribution over class labels: one-hot output and softmax • Differentiable perceptrons: binary and multiclass case • Cross-entropy loss Recap: Classification, linear classifiers 3 Classification as a supervised learning task • Classification tasks: Label data points x ∈ X from an n-dimensional vector space with discrete categories (classes) y ∈Y Binary classification: Two possible labels Y = {0,1} or Y = {-1,+1} Multiclass classification: k possible labels Y = {1, 2, …, k} • Classifier: a function X →Y f(x) = y • Linear classifiers f(x) = sgn(wx) [for binary classification] are parametrized by (n+1)-dimensional weight vectors • Supervised learning: Learn the parameters of the classifier (e.g. w) from a labeled data set Dtrain = {(x1, y1),…,(xD, yD)} Batch versus online training Batch learning: The learner sees the complete training data, and only changes its hypothesis when it has seen the entire training data set. Online training: The learner sees the training data one example at a time, and can change its hypothesis with every new example Compromise: Minibatch learning (commonly used in practice) The learner sees small sets of training examples at a time, and changes its hypothesis with every such minibatch of examples For minibatch and online example: randomize the order of examples for
    [Show full text]
  • A Regularization Study for Policy Gradient Methods
    Submitted by Florian Henkel Submitted at Institute of Computational Perception Supervisor Univ.-Prof. Dr. Gerhard Widmer Co-Supervisor A Regularization Study Dipl.-Ing. Matthias Dorfer for Policy Gradient July 2018 Methods Master Thesis to obtain the academic degree of Diplom-Ingenieur in the Master’s Program Computer Science JOHANNES KEPLER UNIVERSITY LINZ Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696 Abstract Regularization is an important concept in the context of supervised machine learning. Especially with neural networks it is necessary to restrict their ca- pacity and expressivity in order to avoid overfitting to given train data. While there are several well-known and widely used regularization techniques for supervised machine learning such as L2-Normalization, Dropout or Batch- Normalization, their effect in the context of reinforcement learning is not yet investigated. In this thesis we give an overview of regularization in combination with policy gradient methods, a subclass of reinforcement learning algorithms relying on neural networks. We compare different state-of-the-art algorithms together with regularization methods for supervised learning to get a better understanding on how we can improve generalization in reinforcement learn- ing. The main motivation for exploring this line of research is our current work on score following, where we try to train reinforcement learning agents to listen to and read music. These agents should learn from given musical training pieces to follow music they have never heard and seen before. Thus, the agents have to generalize which is why this scenario is a suitable test bed for investigating generalization in the context of reinforcement learning.
    [Show full text]
  • Entropy-Based Aggregate Posterior Alignment Techniques for Deterministic Autoencoders and Implications for Adversarial Examples
    Entropy-based aggregate posterior alignment techniques for deterministic autoencoders and implications for adversarial examples by Amur Ghose A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science Waterloo, Ontario, Canada, 2020 c Amur Ghose 2020 Author's Declaration This thesis consists of material all of which I authored or co-authored: see Statement of Contributions included in the thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Statement of Contributions Chapters 1 and 3 consist of unpublished work solely written by myself, with proof- reading and editing suggestions from my supervisor, Pascal Poupart. Chapter 2 is (with very minor changes) an UAI 2020 paper (paper link) on which I was the lead author, wrote the manuscript, formulated the core idea and ran the majority of the experiments. Some of the experiments were ran by Abdullah Rashwan, a co-author on the paper (credited in the acknowledgements of the thesis). My supervisor Pascal Poupart again proofread and edited the manuscript and made many valuable suggestions. Note that UAI 2020 proceedings are Open Access under Creative Commons, and as such, no copyright section is provided with the thesis, and as the official proceedings are as of yet unreleased a link has been provided in lieu of a citation. iii Abstract We present results obtained in the context of generative neural models | specifically autoencoders | utilizing standard results from coding theory.
    [Show full text]