Word2vec Embeddings: CBOW and Skipgram

Total Page:16

File Type:pdf, Size:1020Kb

Word2vec Embeddings: CBOW and Skipgram Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Word2vec embeddings: CBOW and Skipgram VL Embeddings Uni Heidelberg SS 2019 Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: Maus P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren K¨ase wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: frißt P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren K¨ase wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren K¨ase wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y L(θ) = P(w jw ; θ) (1) Likelihood = t+j t t=1 −m≤j≤m j 6= 0 What is θ? T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y L(θ) = P(w jw ; θ) (1) Likelihood = t+j t t=1 −m≤j≤m j 6= 0 θ: vector representations of each word T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word Objective function (cost function, loss function): Maximise the probability of any context word given the current center word wt Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word The objective function J(θ) is the (average) negative log-likelihood: (cost function, loss function) T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word The objective function J(θ) is the (average) negative log-likelihood: (cost function, loss function) T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Objective function { Motivation • We want to model the probability distribution over mutually exclusive classes • measure the difference between predicted probabilitiesy ^ and ground-truth probabilities y • during training: tune parameters so that this difference is minimised • The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties) • arg max(x) is equivalent to arg min(−x) x x T 1 1 X X J(θ) = − T log L(θ) = − T log P(wt+j jwt ; θ) t=1 −m≤j≤m Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)? T Y Y L(θ) = P(wt+j jwt ; θ) t=1 −m≤j≤m MLE = argmax L(θ; x) Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)? T Y Y L(θ) = P(wt+j jwt ; θ) t=1 −m≤j≤m MLE = argmax L(θ; x) • The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties) • arg max(x) is equivalent to arg min(−x) x x T 1 1 X X J(θ) = − T log L(θ) = − T log P(wt+j jwt ; θ) t=1 −m≤j≤m Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood • We can interpret negative log-probability as information content or surprisal What is the log-likelihood of a model, given an event? ) The negative of the surprisal of the event, given the model: A model is supported by an event to the extent that the event is unsurprising, given the model. Cross entropy ) number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x) X 1 X H(p; q) = p(x)log = − p(x)log q(x) q(x) x x Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss Negative log likelihood is the same as cross entropy Recap: Entropy • If a discrete random variable X has the probability p(x), then the entropy of X is X 1 X H(X ) = p(x)log = − p(x)log p(x) p(x) x x ) expected number of bits needed to encode X if we use an optimal coding scheme Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss Negative log likelihood is the same as cross entropy Recap: Entropy • If a discrete random variable X has the probability p(x), then the entropy of X is X 1 X H(X ) = p(x)log = − p(x)log p(x) p(x) x x ) expected number of bits needed to encode X if we use an optimal coding scheme Cross entropy ) number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x) X 1 X H(p; q) = p(x)log = − p(x)log q(x) q(x) x x Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence
Recommended publications
  • Malware Classification with BERT
    San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2021 Malware Classification with BERT Joel Lawrence Alvares Follow this and additional works at: https://scholarworks.sjsu.edu/etd_projects Part of the Artificial Intelligence and Robotics Commons, and the Information Security Commons Malware Classification with Word Embeddings Generated by BERT and Word2Vec Malware Classification with BERT Presented to Department of Computer Science San José State University In Partial Fulfillment of the Requirements for the Degree By Joel Alvares May 2021 Malware Classification with Word Embeddings Generated by BERT and Word2Vec The Designated Project Committee Approves the Project Titled Malware Classification with BERT by Joel Lawrence Alvares APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE San Jose State University May 2021 Prof. Fabio Di Troia Department of Computer Science Prof. William Andreopoulos Department of Computer Science Prof. Katerina Potika Department of Computer Science 1 Malware Classification with Word Embeddings Generated by BERT and Word2Vec ABSTRACT Malware Classification is used to distinguish unique types of malware from each other. This project aims to carry out malware classification using word embeddings which are used in Natural Language Processing (NLP) to identify and evaluate the relationship between words of a sentence. Word embeddings generated by BERT and Word2Vec for malware samples to carry out multi-class classification. BERT is a transformer based pre- trained natural language processing (NLP) model which can be used for a wide range of tasks such as question answering, paraphrase generation and next sentence prediction. However, the attention mechanism of a pre-trained BERT model can also be used in malware classification by capturing information about relation between each opcode and every other opcode belonging to a malware family.
    [Show full text]
  • Backpropagation and Deep Learning in the Brain
    Backpropagation and Deep Learning in the Brain Simons Institute -- Computational Theories of the Brain 2018 Timothy Lillicrap DeepMind, UCL With: Sergey Bartunov, Adam Santoro, Jordan Guerguiev, Blake Richards, Luke Marris, Daniel Cownden, Colin Akerman, Douglas Tweed, Geoffrey Hinton The “credit assignment” problem The solution in artificial networks: backprop Credit assignment by backprop works well in practice and shows up in virtually all of the state-of-the-art supervised, unsupervised, and reinforcement learning algorithms. Why Isn’t Backprop “Biologically Plausible”? Why Isn’t Backprop “Biologically Plausible”? Neuroscience Evidence for Backprop in the Brain? A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: A spectrum of credit assignment algorithms: How to convince a neuroscientist that the cortex is learning via [something like] backprop - To convince a machine learning researcher, an appeal to variance in gradient estimates might be enough. - But this is rarely enough to convince a neuroscientist. - So what lines of argument help? How to convince a neuroscientist that the cortex is learning via [something like] backprop - What do I mean by “something like backprop”?: - That learning is achieved across multiple layers by sending information from neurons closer to the output back to “earlier” layers to help compute their synaptic updates. How to convince a neuroscientist that the cortex is learning via [something like] backprop 1. Feedback connections in cortex are ubiquitous and modify the
    [Show full text]
  • Training Autoencoders by Alternating Minimization
    Under review as a conference paper at ICLR 2018 TRAINING AUTOENCODERS BY ALTERNATING MINI- MIZATION Anonymous authors Paper under double-blind review ABSTRACT We present DANTE, a novel method for training neural networks, in particular autoencoders, using the alternating minimization principle. DANTE provides a distinct perspective in lieu of traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convex optimization techniques to cast autoencoder training as a bi-quasi-convex optimiza- tion problem. We show that for autoencoder configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations very effectively. DANTE effortlessly extends to networks with multiple hidden layers and varying network configurations. In experiments on standard datasets, autoencoders trained using the proposed method were found to be very promising and competitive to traditional backpropagation techniques, both in terms of quality of solution, as well as training speed. 1 INTRODUCTION For much of the recent march of deep learning, gradient-based backpropagation methods, e.g. Stochastic Gradient Descent (SGD) and its variants, have been the mainstay of practitioners. The use of these methods, especially on vast amounts of data, has led to unprecedented progress in several areas of artificial intelligence. On one hand, the intense focus on these techniques has led to an intimate understanding of hardware requirements and code optimizations needed to execute these routines on large datasets in a scalable manner. Today, myriad off-the-shelf and highly optimized packages exist that can churn reasonably large datasets on GPU architectures with relatively mild human involvement and little bootstrap effort.
    [Show full text]
  • Lecture 4 Feedforward Neural Networks, Backpropagation
    CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ^ = f(x) vector The network contains L − 1 hidden layers (2, in a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W 2 n×n and b 2 n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W 2 n×k and b 2 k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer (L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2 hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi 5/9 Mitesh M.
    [Show full text]
  • Q-Learning in Continuous State and Action Spaces
    -Learning in Continuous Q State and Action Spaces Chris Gaskett, David Wettergreen, and Alexander Zelinsky Robotic Systems Laboratory Department of Systems Engineering Research School of Information Sciences and Engineering The Australian National University Canberra, ACT 0200 Australia [cg dsw alex]@syseng.anu.edu.au j j Abstract. -learning can be used to learn a control policy that max- imises a scalarQ reward through interaction with the environment. - learning is commonly applied to problems with discrete states and ac-Q tions. We describe a method suitable for control tasks which require con- tinuous actions, in response to continuous states. The system consists of a neural network coupled with a novel interpolator. Simulation results are presented for a non-holonomic control task. Advantage Learning, a variation of -learning, is shown enhance learning speed and reliability for this task.Q 1 Introduction Reinforcement learning systems learn by trial-and-error which actions are most valuable in which situations (states) [1]. Feedback is provided in the form of a scalar reward signal which may be delayed. The reward signal is defined in relation to the task to be achieved; reward is given when the system is successfully achieving the task. The value is updated incrementally with experience and is defined as a discounted sum of expected future reward. The learning systems choice of actions in response to states is called its policy. Reinforcement learning lies between the extremes of supervised learning, where the policy is taught by an expert, and unsupervised learning, where no feedback is given and the task is to find structure in data.
    [Show full text]
  • Revisiting the Softmax Bellman Operator: New Benefits and New Perspective
    Revisiting the Softmax Bellman Operator: New Benefits and New Perspective Zhao Song 1 * Ronald E. Parr 1 Lawrence Carin 1 Abstract tivates the use of exploratory and potentially sub-optimal actions during learning, and one commonly-used strategy The impact of softmax on the value function itself is to add randomness by replacing the max function with in reinforcement learning (RL) is often viewed as the softmax function, as in Boltzmann exploration (Sutton problematic because it leads to sub-optimal value & Barto, 1998). Furthermore, the softmax function is a (or Q) functions and interferes with the contrac- differentiable approximation to the max function, and hence tion properties of the Bellman operator. Surpris- can facilitate analysis (Reverdy & Leonard, 2016). ingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman The beneficial properties of the softmax Bellman opera- operator when combined with Deep Q-learning, tor are in contrast to its potentially negative effect on the leads to Q-functions with superior policies in prac- accuracy of the resulting value or Q-functions. For exam- tice, even outperforming its double Q-learning ple, it has been demonstrated that the softmax Bellman counterpart. To better understand how and why operator is not a contraction, for certain temperature pa- this occurs, we revisit theoretical properties of the rameters (Littman, 1996, Page 205). Given this, one might softmax Bellman operator, and prove that (i) it expect that the convenient properties of the softmax Bell- converges to the standard Bellman operator expo- man operator would come at the expense of the accuracy nentially fast in the inverse temperature parameter, of the resulting value or Q-functions, or the quality of the and (ii) the distance of its Q function from the resulting policies.
    [Show full text]
  • Double Backpropagation for Training Autoencoders Against Adversarial Attack
    1 Double Backpropagation for Training Autoencoders against Adversarial Attack Chengjin Sun, Sizhe Chen, and Xiaolin Huang, Senior Member, IEEE Abstract—Deep learning, as widely known, is vulnerable to adversarial samples. This paper focuses on the adversarial attack on autoencoders. Safety of the autoencoders (AEs) is important because they are widely used as a compression scheme for data storage and transmission, however, the current autoencoders are easily attacked, i.e., one can slightly modify an input but has totally different codes. The vulnerability is rooted the sensitivity of the autoencoders and to enhance the robustness, we propose to adopt double backpropagation (DBP) to secure autoencoder such as VAE and DRAW. We restrict the gradient from the reconstruction image to the original one so that the autoencoder is not sensitive to trivial perturbation produced by the adversarial attack. After smoothing the gradient by DBP, we further smooth the label by Gaussian Mixture Model (GMM), aiming for accurate and robust classification. We demonstrate in MNIST, CelebA, SVHN that our method leads to a robust autoencoder resistant to attack and a robust classifier able for image transition and immune to adversarial attack if combined with GMM. Index Terms—double backpropagation, autoencoder, network robustness, GMM. F 1 INTRODUCTION N the past few years, deep neural networks have been feature [9], [10], [11], [12], [13], or network structure [3], [14], I greatly developed and successfully used in a vast of fields, [15]. such as pattern recognition, intelligent robots, automatic Adversarial attack and its defense are revolving around a control, medicine [1]. Despite the great success, researchers small ∆x and a big resulting difference between f(x + ∆x) have found the vulnerability of deep neural networks to and f(x).
    [Show full text]
  • Learned in Speech Recognition: Contextual Acoustic Word Embeddings
    LEARNED IN SPEECH RECOGNITION: CONTEXTUAL ACOUSTIC WORD EMBEDDINGS Shruti Palaskar∗, Vikas Raunak∗ and Florian Metze Carnegie Mellon University, Pittsburgh, PA, U.S.A. fspalaska j vraunak j fmetze [email protected] ABSTRACT model [10, 11, 12] trained for direct Acoustic-to-Word (A2W) speech recognition [13]. Using this model, we jointly learn to End-to-end acoustic-to-word speech recognition models have re- automatically segment and classify input speech into individual cently gained popularity because they are easy to train, scale well to words, hence getting rid of the problem of chunking or requiring large amounts of training data, and do not require a lexicon. In addi- pre-defined word boundaries. As our A2W model is trained at the tion, word models may also be easier to integrate with downstream utterance level, we show that we can not only learn acoustic word tasks such as spoken language understanding, because inference embeddings, but also learn them in the proper context of their con- (search) is much simplified compared to phoneme, character or any taining sentence. We also evaluate our contextual acoustic word other sort of sub-word units. In this paper, we describe methods embeddings on a spoken language understanding task, demonstrat- to construct contextual acoustic word embeddings directly from a ing that they can be useful in non-transcription downstream tasks. supervised sequence-to-sequence acoustic-to-word speech recog- Our main contributions in this paper are the following: nition model using the learned attention distribution. On a suite 1. We demonstrate the usability of attention not only for aligning of 16 standard sentence evaluation tasks, our embeddings show words to acoustic frames without any forced alignment but also for competitive performance against a word2vec model trained on the constructing Contextual Acoustic Word Embeddings (CAWE).
    [Show full text]
  • On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks
    The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks Xiangrui Li, Xin Li, Deng Pan, Dongxiao Zhu∗ Department of Computer Science Wayne State University {xiangruili, xinlee, pan.deng, dzhu}@wayne.edu Abstract (unweighted) loss, resulting in performance degradation Deep convolutional neural networks (CNNs) trained with lo- for minority classes. To remedy this issue, the class-wise gistic and softmax losses have made significant advancement reweighted loss is often used to emphasize the minority in visual recognition tasks in computer vision. When training classes that can boost the predictive performance without data exhibit class imbalances, the class-wise reweighted ver- introducing much additional difficulty in model training sion of logistic and softmax losses are often used to boost per- (Cui et al. 2019; Huang et al. 2016; Mahajan et al. 2018; formance of the unweighted version. In this paper, motivated Wang, Ramanan, and Hebert 2017). A typical choice of to explain the reweighting mechanism, we explicate the learn- weights for each class is the inverse-class frequency. ing property of those two loss functions by analyzing the nec- essary condition (e.g., gradient equals to zero) after training A natural question then to ask is what roles are those CNNs to converge to a local minimum. The analysis imme- class-wise weights playing in CNN training using LGL diately provides us explanations for understanding (1) quan- or SML that lead to performance gain? Intuitively, those titative effects of the class-wise reweighting mechanism: de- weights make tradeoffs on the predictive performance terministic effectiveness for binary classification using logis- among different classes.
    [Show full text]
  • CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett
    CS281B/Stat241B. Statistical Learning Theory. Lecture 7. Peter Bartlett Review: ERM and uniform laws of large numbers • 1. Rademacher complexity 2. Tools for bounding Rademacher complexity Growth function, VC-dimension, Sauer’s Lemma − Structural results − Neural network examples: linear threshold units • Other nonlinearities? • Geometric methods • 1 ERM and uniform laws of large numbers Empirical risk minimization: Choose fn F to minimize Rˆ(f). ∈ How does R(fn) behave? ∗ For f = arg minf∈F R(f), ∗ ∗ ∗ ∗ R(fn) R(f )= R(fn) Rˆ(fn) + Rˆ(fn) Rˆ(f ) + Rˆ(f ) R(f ) − − − − ∗ ULLN for F ≤ 0 for ERM LLN for f |sup R{z(f) Rˆ}(f)| + O(1{z/√n).} | {z } ≤ f∈F − 2 Uniform laws and Rademacher complexity Definition: The Rademacher complexity of F is E Rn F , k k where the empirical process Rn is defined as n 1 R (f)= ǫ f(X ), n n i i i=1 X and the ǫ1,...,ǫn are Rademacher random variables: i.i.d. uni- form on 1 . {± } 3 Uniform laws and Rademacher complexity Theorem: For any F [0, 1]X , ⊂ 1 E Rn F O 1/n E P Pn F 2E Rn F , 2 k k − ≤ k − k ≤ k k p and, with probability at least 1 2exp( 2ǫ2n), − − E P Pn F ǫ P Pn F E P Pn F + ǫ. k − k − ≤ k − k ≤ k − k Thus, P Pn F E Rn F , and k − k ≈ k k R(fn) inf R(f)= O (E Rn F ) . − f∈F k k 4 Tools for controlling Rademacher complexity 1.
    [Show full text]
  • Neural Network in Hardware
    UNLV Theses, Dissertations, Professional Papers, and Capstones 12-15-2019 Neural Network in Hardware Jiong Si Follow this and additional works at: https://digitalscholarship.unlv.edu/thesesdissertations Part of the Electrical and Computer Engineering Commons Repository Citation Si, Jiong, "Neural Network in Hardware" (2019). UNLV Theses, Dissertations, Professional Papers, and Capstones. 3845. http://dx.doi.org/10.34917/18608784 This Dissertation is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV with permission from the rights-holder(s). You are free to use this Dissertation in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/or on the work itself. This Dissertation has been accepted for inclusion in UNLV Theses, Dissertations, Professional Papers, and Capstones by an authorized administrator of Digital Scholarship@UNLV. For more information, please contact [email protected]. NEURAL NETWORKS IN HARDWARE By Jiong Si Bachelor of Engineering – Automation Chongqing University of Science and Technology 2008 Master of Engineering – Precision Instrument and Machinery Hefei University of Technology 2011 A dissertation submitted in partial fulfillment of the requirements for the Doctor of Philosophy – Electrical Engineering Department of Electrical and Computer Engineering Howard R. Hughes College of Engineering The Graduate College University of Nevada, Las Vegas December 2019 Copyright 2019 by Jiong Si All Rights Reserved Dissertation Approval The Graduate College The University of Nevada, Las Vegas November 6, 2019 This dissertation prepared by Jiong Si entitled Neural Networks in Hardware is approved in partial fulfillment of the requirements for the degree of Doctor of Philosophy – Electrical Engineering Department of Electrical and Computer Engineering Sarah Harris, Ph.D.
    [Show full text]
  • Lecture 26 Word Embeddings and Recurrent Nets
    CS447: Natural Language Processing http://courses.engr.illinois.edu/cs447 Where we’re at Lecture 25: Word Embeddings and neural LMs Lecture 26: Recurrent networks Lecture 26 Lecture 27: Sequence labeling and Seq2Seq Lecture 28: Review for the final exam Word Embeddings and Lecture 29: In-class final exam Recurrent Nets Julia Hockenmaier [email protected] 3324 Siebel Center CS447: Natural Language Processing (J. Hockenmaier) !2 What are neural nets? Simplest variant: single-layer feedforward net For binary Output unit: scalar y classification tasks: Single output unit Input layer: vector x Return 1 if y > 0.5 Recap Return 0 otherwise For multiclass Output layer: vector y classification tasks: K output units (a vector) Input layer: vector x Each output unit " yi = class i Return argmaxi(yi) CS447: Natural Language Processing (J. Hockenmaier) !3 CS447: Natural Language Processing (J. Hockenmaier) !4 Multi-layer feedforward networks Multiclass models: softmax(yi) We can generalize this to multi-layer feedforward nets Multiclass classification = predict one of K classes. Return the class i with the highest score: argmaxi(yi) Output layer: vector y In neural networks, this is typically done by using the softmax N Hidden layer: vector hn function, which maps real-valued vectors in R into a distribution … … … over the N outputs … … … … … …. For a vector z = (z0…zK): P(i) = softmax(zi) = exp(zi) ∕ ∑k=0..K exp(zk) Hidden layer: vector h1 (NB: This is just logistic regression) Input layer: vector x CS447: Natural Language Processing (J. Hockenmaier) !5 CS447: Natural Language Processing (J. Hockenmaier) !6 Neural Language Models LMs define a distribution over strings: P(w1….wk) LMs factor P(w1….wk) into the probability of each word: " P(w1….wk) = P(w1)·P(w2|w1)·P(w3|w1w2)·…· P(wk | w1….wk#1) A neural LM needs to define a distribution over the V words in Neural Language the vocabulary, conditioned on the preceding words.
    [Show full text]