Machine Learning & Data Mining

Total Page:16

File Type:pdf, Size:1020Kb

Machine Learning & Data Mining MACHINE LEARNING & DATA MINING CS/CNS/EE 155 Deep Learning Part I 1 logistic regression Y 1 logistic function P (Y =1X) | 1 P (Y =1X)= (w|x+w ) | 1+e− 0 0 X can fit binary labels Y 0, 1 2 { } | w x + w0 =0 defines the boundary between the classes in higher dimensions, this is a hyperplane 2 X2 Y =0 Y =1 X1 additional features in X result in additional weights in w 3 if Y 0, 1 2 { } 1 P (Y =1X)= (w|x+w ) | 1+e− 0 use binary cross-entropy loss function N = y(i) log(P (y =1x(i))) + (1 y(i)) log(1 P (y =1x(i))) L − i | − − i | i=1 X h i when yi =1make the when yi =0make the output closeyi to=1 output closeyi to=0 4 if Y 1, 1 2 {− } 1 P (y x)= y(wT x+w ) | 1+e− 0 use logistic loss function N y(i)(wT x(i)+w ) = log(1 + e− 0 ) L i=1 X T (i) (i) make sure w x + w0 and y have the same sign, and T (i) w x + w0 is large in magnitude 5 how do we extend logistic regression to handle multiple classes? y 1,...,K 2 { } approach I split the points into groups of one vs. rest, train model on each split X2 X2 X2 X2 X1 X1 X1 X1 approach II train one model on all data classes simultaneously X2 (w|x+w ) e k 0,k P (Y = k X)= | K (w x+w0,k ) | e k0 0 k0=1 P softmax function X1 6 (w|x+w ) e k 0,k multi-class logistic regression P (Y = k X)= | K (w x+w0,k ) | e k0 0 k0=1 P 1 (w|x+w ) assume probabilities of the form P (Y = k)= e k 0,k Z (log linear) Z is the ‘partition function,’ which normalizes the probabilities probabilities must sum to one K K 1 (w|x+w ) P (Y = k)= e k 0,k =1 Z kX=1 kX=1 K (w|x+w ) therefore Z = e k 0,k kX=1 7 logistic regression is a linear classifier linear scoring function is passed through the non-linear logistic function to give a probability output often works well for simple data distributions breaks down when confronted with data distributions that are not linearly separable AND OR XOR X2 X2 X2 X1 X1 X1 linearly separable linearly separable not linearly separable 8 to tackle non-linear data distributions with a linear approach, we need to turn it into a linear problem use a set of non-linear features in which the data are linearly separable one approach: use a set of pre-defined non-linear transformations XOR X2 X ,X X ,X ,X X 1 2 ! 1 2 1 2 linear decision hyperbolic decision boundary boundary X1 9 to tackle non-linear data distributions with a linear approach, we need to turn it into a linear problem use a set of non-linear features in which the data are linearly separable another approach: logistic regression outputs a non-linear transformation, use multiple stages of logistic regression XOR X1,X2 X1 X2,X1 X2 X2 linear decision! multiple^ linear decision_ boundary boundaries XOR: (X1 X2) (X1 X2) X1 ¬ ^ ^ _ 10 we used multiple stages of linear classifiers to create a a non-linear classifier as the number of stages increases, so too does the expressive power of the resulting non-linear classifier depth: the number of stages of processing 11 the point of deep learning with enough stages of linear/non-linear operations (depth), we can learn a good set of non-linear features to linearize any non-linear problem 12 basic operation: logistic regression X1 weights X 2 output feature summation non-linearity X3 X 4 X input features Xp artificial neuron 13 multiple operations X1 X2 X X3 X X4 Xp X layer of artificial neurons 14 multiple stages of operations X X X X X artificial neural network 15 notation X X X X X at the input layer X0 units 16 notation X X X X X to the first layer W 1 weights 17 notation X X X X X S1 at the first layer summations 18 notation X X X X X σ non-linearity 19 notation X X X X at the first layer X X1 units 20 notation X X X X X etc. 21 notation bias unit 1 1 S1 1 W1,0 1 W1,1 0 X1 X 1 W 1,2 0 1 X X2 1,3 W 0 X X3 0 1 ,N 1 W X number of units in layer 0 0 X 0 N X N 0 1 1 1 0 S1 = W1,0 + W1,iXi i=1 bias weight X 22 notation 1 1 S1 1 W1,0 1 W1,1 0 X1 X 1 W 1,2 0 1 X X2 1,3 W 0 X X3 0 1 ,N 1 W X 0 XN 0 X 0 1 1 0 absorb bias unit into X1 S = W X 0 1 1 1 X1,0 =1 vectorized form (dot product) 23 notation 1 S1 1 0 X1 X1 S2 0 X X2 0 X X3 1 SN 1 X 0 XN 0 X S1 = W 1X0 fully vectorized form (matrix multiplication) 24 notation 1 S1 1 1 X1 0 X1 X1 S2 0 1 X X2 X2 0 X X3 1 SN 1 X 1 XN 1 0 XN 0 X X1 = σ(S1) element-wise non-linearity 25 notation append a bias unit with weights 1 1 S1 1 1 X1 0 X1 X1 S2 0 1 X X2 X2 0 X X3 1 SN 1 X 1 XN 1 0 XN 0 X and proceed 26 matrix form 0 0 N X number of features number N number of data points Y 27 matrix form N N 0 N = X0 N 0 N 1 S1 N 1 W 1 note: appending biases to W 1 and bias units to X 0 changes N 0 N 0 +1 ! 28 matrix form N N = 1 1 N 1 X1 σ(N S ) 29 matrix form input X 0 somewhere back in here ( ( N ( N ` ` = ` ` 1 ` 2 X σ( W σ( W − σ( W − σ( ) it’s just a (deep, non-linear) function! note: bias appending omitted for clarity 30 train with gradient descent final layer XL is the output of the network L X1 X use (XL,Y) to compare L L L X and Y X2 X take gradients ` of loss w.r.t. rW L weights at each level ` update the weights to decrease loss, L L bring X closer to Y XN L X ` ` W W ↵ W ` − r L 31 final layer how do we get the gradients? backpropagation L X1 X L X2 1. Define a loss function. X 2. Use chain rule to recursively take derivatives at each level backward through the network. L XN L X 32 output what is the best form in which to present the labels? binary labels: Y 0, 1 (binary classification) 2 { } represent with a single output unit, use binary logistic regression at the last layer L [0, 1] X1 P (Y =1X) | X P (Y =0X)=1 P (Y =1X) | − | 33 output what is the best form in which to present the labels? categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with a single output unit, regress to the correct class? ( , ) XL 1 Y1 1,...,K 1 2 { } X No, in general, classes do not have a numerical relation class K is not ‘greater’ than class K 1 − 34 output what is the best form in which to present the labels? categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with multiple output units, regress to an encoding of the correct class (e.g. binary) [0, 1] XL 0, 1 1 { } X [0, 1] XL 0, 1 2 { } X Y 1,...,K 2 { } L [0, 1] X L 0, 1 N { } X N L = K 6 No, correct output requires coordinated} effort from units implies arbitrary similarities induced by the coding 35 output what is the best form in which to present the labels? categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with K output units, multi-class logistic regression at the last layer [0, 1] XL P (Y =1X) 1 | X [0, 1] XL P (Y =2X) 2 | X Y 1,...,K 2 { } [0, 1] XL P (Y = K X) K | X softmax Yes captures independence assumptions in the class }structure, and is a valid output 36 output categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with K output units, multi-class logistic regression at the last layer one-hot vector encoding example K =5 Y =1 Y =2 Y =3 Y =4 Y =5 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 37 loss function probability of x belonging to class k w|x e k multi-class logistic regression P (Y = k)= K w| x e k0 k0=1 P softmax loss function (cross-entropy) w| x e yi = log P (Y = y )= log i i K w| x L − − e k0 k0=1 minimize the negative log probability ofP the correct class recall: N = y(i) log(P (y =1x(i))) + (1 y(i)) log(1 P (y =1x(i))) L − i | − − i | i=1 Xsoftmaxh loss is the extension of binary case to multi-class i 38 backpropagation deep networks are just composed functions L L L 1 1 0 X = σ(W σ(W − σ(...σ(W X ))))) the loss (XL,Y) is a function of XL, which is a function of each layer’s weights W ` L therefore, we can find W ` using chain rule r L recall ` ` ` ` ` 1 X = σ(S ) S = W X − at the output layer @ @ @XL @SL L = L @W L @XL @SL @W L 39 backpropagation recall ` ` ` ` ` 1 X = σ(S ) S = W X − at the output layer @ @ @XL @SL L = L @W L @XL @SL @W L @ L the first two terms are the same @SL at the layer before L L L 1 L 1 @ @ @X @S @X − @S − LL 1 = LL L L 1 L 1 L 1 @W − @X @S @X − @S − @W − 40 backpropagation recall ` ` ` ` ` 1 X = σ(S ) S = W X − at the layer before L L L 1 L 1 @ @ @X @S @X − @S − LL 1 = LL L L 1 L 1 L 1 @W − @X @S @X − @S − @W − @ LL 1 the first four terms are the same @S − at the layer before that L L L 1 L 1 L 2 L 2 @ @ @X @S @X − @S − @X − @S − LL 2 = LL L L 1 L 1 L 2 L 2 L 2 @W − @X @S @X − @S − @X − @S − @W − 41 backpropagation general overview - dynamic programming compute gradient of loss w.r.t.
Recommended publications
  • Lecture 9 Recurrent Neural Networks “I’M Glad That I’M Turing Complete Now”
    Lecture 9 Recurrent Neural Networks “I’m glad that I’m Turing Complete now” Xinyu Zhou Megvii (Face++) Researcher [email protected] Nov 2017 Raise your hand and ask, whenever you have questions... We have a lot to cover and DON’T BLINK Outline ● RNN Basics ● Classic RNN Architectures ○ LSTM ○ RNN with Attention ○ RNN with External Memory ■ Neural Turing Machine ■ CAVEAT: don’t fall asleep ● Applications ○ A market of RNNs RNN Basics Feedforward Neural Networks ● Feedforward neural networks can fit any bounded continuous (compact) function ● This is called Universal approximation theorem https://en.wikipedia.org/wiki/Universal_approximation_theorem Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals, and Systems (MCSS) 2.4 (1989): 303-314. Bounded Continuous Function is NOT ENOUGH! How to solve Travelling Salesman Problem? Bounded Continuous Function is NOT ENOUGH! How to solve Travelling Salesman Problem? We Need to be Turing Complete RNN is Turing Complete Siegelmann, Hava T., and Eduardo D. Sontag. "On the computational power of neural nets." Journal of computer and system sciences 50.1 (1995): 132-150. Sequence Modeling Sequence Modeling ● How to take a variable length sequence as input? ● How to predict a variable length sequence as output? RNN RNN Diagram A lonely feedforward cell RNN Diagram Grows … with more inputs and outputs RNN Diagram … here comes a brother (x_1, x_2) comprises a length-2 sequence RNN Diagram … with shared (tied) weights x_i: inputs y_i: outputs W: all the
    [Show full text]
  • Neural Turing Machines Arxiv:1410.5401V2 [Cs.NE]
    Neural Turing Machines Alex Graves [email protected] Greg Wayne [email protected] Ivo Danihelka [email protected] Google DeepMind, London, UK Abstract We extend the capabilities of neural networks by coupling them to external memory re- sources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to- end, allowing it to be efficiently trained with gradient descent. Preliminary results demon- strate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples. 1 Introduction Computer programs make use of three fundamental mechanisms: elementary operations (e.g., arithmetic operations), logical flow control (branching), and external memory, which can be written to and read from in the course of computation (Von Neumann, 1945). De- arXiv:1410.5401v2 [cs.NE] 10 Dec 2014 spite its wide-ranging success in modelling complicated data, modern machine learning has largely neglected the use of logical flow control and external memory. Recurrent neural networks (RNNs) stand out from other machine learning methods for their ability to learn and carry out complicated transformations of data over extended periods of time. Moreover, it is known that RNNs are Turing-Complete (Siegelmann and Sontag, 1995), and therefore have the capacity to simulate arbitrary procedures, if properly wired. Yet what is possible in principle is not always what is simple in practice. We therefore enrich the capabilities of standard recurrent networks to simplify the solution of algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by analogy to Turing’s enrichment of finite-state machines by an infinite memory tape, we 1 dub our device a “Neural Turing Machine” (NTM).
    [Show full text]
  • Memory Augmented Matching Networks for Few-Shot Learnings
    International Journal of Machine Learning and Computing, Vol. 9, No. 6, December 2019 Memory Augmented Matching Networks for Few-Shot Learnings Kien Tran, Hiroshi Sato, and Masao Kubo broken down when they have to learn a new concept on the Abstract—Despite many successful efforts have been made in fly or to deal with the problems which information/data is one-shot/few-shot learning tasks recently, learning from few insufficient (few or even a single example). These situations data remains one of the toughest areas in machine learning. are very common in the computer vision field, such as Regarding traditional machine learning methods, the more data gathered, the more accurate the intelligent system could detecting a new object with only a few available captured perform. Hence for this kind of tasks, there must be different images. For those purposes, specialists define the approaches to guarantee systems could learn even with only one one-shot/few-shot learning algorithms that could solve a task sample per class but provide the most accurate results. Existing with only one or few known samples (e.g., less than 20 solutions are ranged from Bayesian approaches to examples per category). meta-learning approaches. With the advances in Deep learning One-shot learning/few-shot learning is a challenging field, meta-learning methods with deep networks could be domain for neural networks since gradient-based learning is better such as recurrent networks with enhanced external memory (Neural Turing Machines - NTM), metric learning often too slow to quickly cope with never-before-seen classes, with Matching Network framework, etc.
    [Show full text]
  • Training Robot Policies Using External Memory Based Networks Via Imitation
    Training Robot Policies using External Memory Based Networks Via Imitation Learning by Nambi Srivatsav A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved September 2018 by the Graduate Supervisory Committee: Heni Ben Amor, Chair Siddharth Srivastava HangHang Tong ARIZONA STATE UNIVERSITY December 2018 ABSTRACT Recent advancements in external memory based neural networks have shown promise in solving tasks that require precise storage and retrieval of past information. Re- searchers have applied these models to a wide range of tasks that have algorithmic properties but have not applied these models to real-world robotic tasks. In this thesis, we present memory-augmented neural networks that synthesize robot naviga- tion policies which a) encode long-term temporal dependencies b) make decisions in partially observed environments and c) quantify the uncertainty inherent in the task. We extract information about the temporal structure of a task via imitation learning from human demonstration and evaluate the performance of the models on control policies for a robot navigation task. Experiments are performed in partially observed environments in both simulation and the real world. i TABLE OF CONTENTS Page LIST OF FIGURES . iii CHAPTER 1 INTRODUCTION . 1 2 BACKGROUND .................................................... 4 2.1 Neural Networks . 4 2.2 LSTMs . 4 2.3 Neural Turing Machines . 5 2.3.1 Controller . 6 2.3.2 Memory . 6 3 METHODOLOGY . 7 3.1 Neural Turing Machine for Imitation Learning . 7 3.1.1 Reading . 8 3.1.2 Writing . 9 3.1.3 Addressing Memory . 10 3.2 Uncertainty Estimation . 12 4 EXPERIMENTAL RESULTS .
    [Show full text]
  • Learning Operations on a Stack with Neural Turing Machines
    Learning Operations on a Stack with Neural Turing Machines Anonymous Author(s) Affiliation Address email Abstract 1 Multiple extensions of Recurrent Neural Networks (RNNs) have been proposed 2 recently to address the difficulty of storing information over long time periods. In 3 this paper, we experiment with the capacity of Neural Turing Machines (NTMs) to 4 deal with these long-term dependencies on well-balanced strings of parentheses. 5 We show that not only does the NTM emulate a stack with its heads and learn an 6 algorithm to recognize such words, but it is also capable of strongly generalizing 7 to much longer sequences. 8 1 Introduction 9 Although neural networks shine at finding meaningful representations of the data, they are still limited 10 in their capacity to plan ahead, reason and store information over long time periods. Keeping track of 11 nested parentheses in a language model, for example, is a particularly challenging problem for RNNs 12 [9]. It requires the network to somehow memorize the number of unmatched open parentheses. In 13 this paper, we analyze the ability of Neural Turing Machines (NTMs) to recognize well-balanced 14 strings of parentheses. We show that even though the NTM architecture does not explicitely operate 15 on a stack, it is able to emulate this data structure with its heads. Such a behaviour was unobserved 16 on other simple algorithmic tasks [4]. 17 After a brief recall of the Neural Turing Machine architecture in Section 3, we show in Section 4 how 18 the NTM is able to learn an algorithm to recognize strings of well-balanced parentheses, called Dyck 19 words.
    [Show full text]
  • Neural Shuffle-Exchange Networks
    Neural Shuffle-Exchange Networks − Sequence Processing in O(n log n) Time Karlis¯ Freivalds, Em¯ıls Ozolin, š, Agris Šostaks Institute of Mathematics and Computer Science University of Latvia Raina bulvaris 29, Riga, LV-1459, Latvia {Karlis.Freivalds, Emils.Ozolins, Agris.Sostaks}@lumii.lv Abstract A key requirement in sequence to sequence processing is the modeling of long range dependencies. To this end, a vast majority of the state-of-the-art models use attention mechanism which is of O(n2) complexity that leads to slow execution for long sequences. We introduce a new Shuffle-Exchange neural network model for sequence to sequence tasks which have O(log n) depth and O(n log n) total complexity. We show that this model is powerful enough to infer efficient algorithms for common algorithmic benchmarks including sorting, addition and multiplication. We evaluate our architecture on the challenging LAMBADA question answering dataset and compare it with the state-of-the-art models which use attention. Our model achieves competitive accuracy and scales to sequences with more than a hundred thousand of elements. We are confident that the proposed model has the potential for building more efficient architectures for processing large interrelated data in language modeling, music generation and other application domains. 1 Introduction A key requirement in sequence to sequence processing is the modeling of long range dependencies. Such dependencies occur in natural language when the meaning of some word depends on other words in the same or some previous sentence. There are important cases, e.g., to resolve coreferences, when such distant information may not be disregarded.
    [Show full text]
  • Advances in Neural Turing Machines
    Advances in Neural Turing Machines [email protected] truyentran.github.io Truyen Tran Deakin University @truyenoz letdataspeak.blogspot.com Source: rdn consulting CafeDSL, Aug 2018 goo.gl/3jJ1O0 30/08/2018 1 https://twitter.com/nvidia/status/1010545517405835264 30/08/2018 2 (Real) Turing machine It is possible to invent a single machine which can be used to compute any computable sequence. If this machine U is supplied with the tape on the beginning of which is written the string of quintuples separated by semicolons of some computing machine M, then U will compute the same sequence as M. Wikipedia 30/08/2018 3 Can we learn from data a model that is as powerful as a Turing machine? 30/08/2018 4 Agenda Brief review of deep learning Neural Turing machine (NTM) Dual-controlling for read and write (PAKDD’18) Dual-view in sequences (KDD’18) Bringing variability in output sequences (NIPS’18 ?) Bringing relational structures into memory (IJCAI’17 WS+) 30/08/2018 5 Deep learning in a nutshell 1986 2016 http://blog.refu.co/wp-content/uploads/2009/05/mlp.png 30/08/2018 2012 6 Let’s review current offerings Feedforward nets (FFN) BUTS … Recurrent nets (RNN) Convolutional nets (CNN) No storage of intermediate results. Message-passing graph nets (MPGNN) Little choices over what to compute and what to use Universal transformer ….. Little support for complex chained reasoning Work surprisingly well on LOTS of important Little support for rapid switching of tasks problems Enter the age of differentiable programming 30/08/2018 7 Searching for better
    [Show full text]
  • ELEC 576: Understanding and Visualizing Convnets & Introduction
    ELEC 576: Understanding and Visualizing Convnets & Introduction to Recurrent Neural Networks Ankit B. Patel Baylor College of Medicine (Neuroscience Dept.) Rice University (ECE Dept.) 10-11-2017 Understand & Visualizing Convnets Deconvolutional Net [Zeiler and Fergus] Feature Visualization [Zeiler and Fergus] Activity Maximization [Simonyan et al.] Deep Dream Visualization • To produce human viewable images, need to • Activity maximization (gradient ascent) • L2 regularization • Gaussian blur • Clipping • Different scales Image Example [Günther Noack] Dumbbell Deep Dream AlexNet VGGNet GoogleNet Deep Dream Video Class: goldfish, Carassius auratus Texture Synthesis [Leon Gatys, Alexander Ecker, Matthias Bethge] Generated Textures [Leon Gatys, Alexander Ecker, Matthias Bethge] Deep Style [Leon Gatys, Alexander Ecker, Matthias Bethge] Deep Style [Leon Gatys, Alexander Ecker, Matthias Bethge] Introduction to Recurrent Neural Networks What Are Recurrent Neural Networks? • Recurrent Neural Networks (RNNs) are networks that have feedback • Output is feed back to the input • Sequence processing • Ideal for time-series data or sequential data History of RNNs Important RNN Architectures • Hopfield Network • Jordan and Elman Networks • Echo State Networks • Long Short Term Memory (LSTM) • Bi-Directional RNN • Gated Recurrent Unit (GRU) • Neural Turing Machine Hopfield Network [Wikipedia] Elman Networks [John McCullock] Echo State Networks [Herbert Jaeger] Definition of RNNs RNN Diagram [Richard Socher] RNN Formulation [Richard Socher] RNN Example [Andrej
    [Show full text]
  • A Review on Neural Turing Machine
    A review on Neural Turing Machine Soroor Malekmohammadi Faradounbeh, Faramarz Safi-Esfahani Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Najafabad, Iran. Big Data Research Center, Najafabad Branch, Islamic Azad University, Najafabad, Iran. [email protected], [email protected] Abstract— one of the major objectives of Artificial Intelligence is to design learning algorithms that are executed on a general purposes computational machines such as human brain. Neural Turing Machine (NTM) is a step towards realizing such a computational machine. The attempt is made here to run a systematic review on Neural Turing Machine. First, the mind-map and taxonomy of machine learning, neural networks, and Turing machine are introduced. Next, NTM is inspected in terms of concepts, structure, variety of versions, implemented tasks, comparisons, etc. Finally, the paper discusses on issues and ends up with several future works. Keywords— NTM, Deep Learning, Machine Learning, Turing machine, Neural Networks 1 INTRODUCTION The Artificial Intelligence (AI) seeks to construct real intelligent machines. The neural networks constitute a small portion of AI, thus the human brain should be named as a Biological Neural Network (BNN). Brain is a very complicated non-linear and parallel computer [1]. The almost relatively new neural learning technology in AI named ‘Deep Learning’ that consists of multi-layer neural networks learning techniques. Deep Learning provides a computerized system to observe the abstract pattern similar to that of human brain, while providing the means to resolve cognitive problems. As to the development of deep learning, recently there exist many effective methods for multi-layer neural network training such as Recurrent Neural Network (RNN)[2].
    [Show full text]
  • Evolving Neural Turing Machines
    Evolving Neural Turing Machines Rasmus Boll Greve, Emil Juul Jacobsen, and Sebastian Risi IT University of Copenhagen Copenhagen, Denmark fragr,ejuu,[email protected] Abstract Instead of training a Neural Turing Machine (NTM) with gradient descent [1], in this work NTMs are trained through an evolutionary algorithm. Preliminary results suggest that this setup can greatly simplify the neural model, generalizes better, and does not require accessing the entire memory content at each time-step. We show preliminary results on a simple copy and T-Maze learning task. 1 Introduction An important component of our human cognition is the ability to store and use information during inference. However, while much progress has been made in machine learning [2, 3], how to best augment artificial agents with a similar long-term memory component remains elusive. In recent work in this direction, Graves et al. [1] extended the capabilities of an artificial neural network (ANN) by combining it with an external memory bank. Because the architecture of their Neural Turing Machine (NTM) is differentiable end-to-end it can be trained through gradient descent and was able to learn simple algorithms such as copying, sorting and recall from example data. How- ever, the differentiable memory requires the NTM to access the entire memory content at each step, which can be prohibitively slow for larger memory banks and will likely not scale-up to large and more difficult problems. Others have tried to overcome this obstacle by combining gradient descent methods with reinforcement learning methods, so far with only limited success [4]. This study presents preliminary results training a NTM through a neuroevolutionary (NE) approach, in which both the topology and the weights of a network are determined through artificial evolution.
    [Show full text]
  • Dynamic Neural Turing Machine with Contin
    Under review as a conference paper at ICLR 2017 DYNAMIC NEURAL TURING MACHINE WITH CONTIN- UOUS AND DISCRETE ADDRESSING SCHEMES Caglar Gulcehre∗, Sarath Chandar∗, Kyunghyun Choy, Yoshua Bengio∗ ∗ University of Montreal, [email protected] y New York University, [email protected] ABSTRACT In this paper, we extend neural Turing machine (NTM) into a dynamic neural Turing machine (D-NTM) by introducing a trainable memory addressing scheme. This addressing scheme maintains for each memory cell two separate vectors, content and address vectors. This allows the D-NTM to learn a wide variety of location-based addressing strategies including both linear and nonlinear ones. We implement the D-NTM with both continuous, differentiable and discrete, non-differentiable read/write mechanisms. We investigate the mechanisms and effects for learning to read and write to a memory through experiments on Facebook bAbI tasks using both a feedforward and GRU-controller. The D-NTM is evaluated on a set of Facebook bAbI tasks and shown to outperform NTM and LSTM baselines. We also provide further experimental results on sequential MNIST, associative recall and copy tasks. 1 INTRODUCTION Designing general-purpose learning algorithms is one of the long-standing goals of artificial intelligence. Despite the success of deep learning in this area (see, e.g., (Goodfellow et al., 2016)) there are still a set of complex tasks that are not well addressed by conventional neural networks. Those tasks often require a neural network to be equipped with an explicit, external memory in which a larger, potentially unbounded, set of facts need to be stored.
    [Show full text]
  • Outline for Today's Presentation
    Outline for today’s presentation • We will see how RNNs and CNNs compare on variety of tasks • Then, we will go through a new approach for Sequence Modelling that has become state of the art • Finally, we will look at a few augmented RNN models RNNs vs CNNs Empirical Evaluation of Generic Networks for Sequence Modelling Let’s say you are given a sequence modelling task of text classification / music note prediction, and you are asked to develop a simple model. What would your baseline model be based on- RNNs or CNNs? Recent Trend in Sequence Modelling • Widely considered as RNNs “home turf” • Recent research has shown otherwise – • Speech Synthesis – WaveNet uses Dilated Convolutions for Synthesis • Char-to-Char Machine Translation – ByteNet uses Encoder-Decoder architecture and Dilated Convolutions. Tested on English-German dataset • Word-to-Word Machine Translation – Hybrid CNN-LSTM on English-Romanian and English-French datasets • Character-level Language Modelling – ByteNet on WikiText dataset • Word-level Language Modelling – Gated CNNs on WikiText dataset Temporal Convolutional Network (TCN) • Model that uses best practices in Convolutional network design • The properties of TCN - • Causal – there is no information leakage from future to past • Memory – It can look very far into the past for prediction/synthesis • Input - It can take any arbitrary length sequence with proper tuning to the particular task. • Simple – It uses no gating mechanism, no complex stacking mechanism and each layer output has the same length as the input •
    [Show full text]