Structured Neural Turing Machine

Neural Structured Turing Machine Hao Liu * 1 Xinyi Yang * 2 Zenglin Xu 2 Abstract coding, the system samples a hidden state with a stochastic Differential external memory has greatly im- probabilities. Commonly used approaches to estimate the proved attention based model, such as neural gradient are Monte Carlo sampling and REINFORCE. Turing machines(NTM), however none of exist- Recent work have advanced attention mechanism to ing attention models and memory augmented at- fully differentiable addressable memory. Such as Mem- tention models has considered real-valued struc- ory Networks(Weston et al., 2014), Neural Turing Ma- tured attention, which may hinders their repre- chines(Graves et al., 2014) and many other models(Kurach sentation power. Here we take structured infor- et al., 2016; Chandar et al., 2016; Gulcehre et al., 2016), mation into consideration and propose a Neu- all of which use external memory in which they can read ral Structured Turing Machine(NSTM). Our pre- (eventually write). Fully differentiable addressable mem- liminary experiment on varied length sequence ory enables model to bind values to specific locations in copy task shows the benefit of such structured at- data structures(Fodor & Pylyshyn, 1988). Differentiable tention mechanism, leading to a novel perspec- memory is important because many of the algorithms and tive for improving differential external memory things that we do every day on a computer are actually very based mechanism, e.g., NTM. hard for a computer to do because computers work in ab- solutes. While most neural network models dont actually do that. They work with real numbers, smoother sort of 1. Introduction curves, and that makes them actually a great deal easier Neural processes involving attention have been largely to train, because it means that what you see and hope that studied in Neuroscience and Computational Neuro- they provide, that you can easily track back how to improve science(Itti et al., 1998). Basically, an attention model is them by tweaking the parameters. a method that take n arguments and a context and return a However, currently none of attention models and other vector which is supposed to be the summary of the argu- differential memory mechanisms considers the structured ment focusing on information linked to the context. More information between memory location and query content formally, it returns a weighted arithmetic mean of the aug- when addressing memory, instead they simply use content ment and the weights are chosen according the relevance of to calculate a relevance score, which may hinder their rep- each argument of each argument given the context. resentation power. Here we show how to use a general Attention mechanism can be commonly divided into soft structured inference mechanism to model complex depen- attention and hard attention. Soft attention is a fully dif- dencies between memory locations and query in Neural ferentiable deterministic mechanism that can be plugged Turing Machine(NTM), we validate through experiments into an existing system, and the gradients are propagated the power of incorporating real-valued structured attention through the attention mechanism at the same time they are in NTM. propagated through the rest of the network. Contrast to soft attention, hard attention is a stochastic process where it in- 2. Model stead of using all the hidden states as an input for the de- 2.1. Energy Based Structured Attention *Equal contribution 1SMILE Lab & Yingcai Honors Col- lege, University of Electronic Science and Technology of China Let mt−1 denotes input, qt denotes the query, to en- 2 SMILE Lab & Big Data Research Center School of Com- able query based content search, input mt−1 can be pro- puter Science and Engineering, University of Electronic Sci- > cessed with query qt by x = mt−1W qt where W comes ence and Technology of China. Correspondence to: Zenglin Xu from a neural network(e.g., MLP or parameters), y = <[email protected]>. (y1; y2; :::; yn) denotes the attention value we want to K Proceedings of the ICML 17 Workshop on Deep Structured Pre- predict, here yi 2 [0; 1], E(x; y; w): R ! R is diction P , Sydney, Australia, PMLR 70, 2017. Copyright 2017 by an energy function, defined as E = i fi(yi; x; wu) + the author(s). Neural Structured Turing Machine {m t − 1} {q t } {r t − 1 } ation, in some cases, we may want to read from specific memory locations instead of reading specific memory values. This is called location-based addressing, and to imple- Controller ment it, we need three more stages. In the second stage, a scalar parameter gt 2 (0; 1) called the interpolation gate, c blends the content weight vector yt with the previous time Query Based step’s weight vector yt−1 to produce the gated weighting W Structured Attention g yt .This allows the system learn when to use (or ignore) content-based addressing, yg g yc + (1 − g )y . c t t t t t−1 yt We’d like the controller to be able to shift focus to other Interpolate gt rows. Let’s suppose that as one of the system’s parameters, the range of allowable shifts is specified. For ex- Convolve st e ample, a head’s attention could shift forward a row (+1), t at stay still (0), or shift backward a row(-1) given a shift mod- Sharpen γi ulo R, so that a shift forward at the bottom row of memory moves the head’s attention to the top row, and simi- Writing Reading larly for a shift backward at the top row. After interpolation, each head emits a normalized shift weighting st and the following convolutional shift with shift weight st, Memory P g y~t(i) i(yt (j))st(i − j). In order to prevent shift weight st from blurring, we define γi {m t} { r t } sharpening as following, yt(i) ≈ y~t(i) , here gt, st and γi are controller parameters. Figure 1. Neural Structured Turing machine 2.3. Writing P α fα(yα; x; wα) which can be viewed as unary func- Writing involves two separate steps: erasing, then adding. tion plus higher order energy functions. Our aim is obtain In order to erase old data, a write head will need a new vec- ? P P y = arg miny2Y i fi(yi; x; wu) + α fα(yα; x; wα). tor, the length-C erase vector et , in addition to our length- Since this kind of model in general is intractable although R normalized weight vector yt. The erase vector is used a variety of methods have been developed in the context in conjunction with the weight vector to specify which ele- of predicting discrete outputs(Zheng et al., 2015b; Chen ments in a row should be erased, left unchanged, or some- et al., 2015; Zheng et al., 2015a). A Markov random thing in between. If the weight vector tells us to focus on fields (MRFs) with Gaussian potentials is also real-valued a row, and the erase vector tells us to erase an element, the and when the precision matrix is positive semi-definite and element in that row will be erased. Then the write head satisfies the spectral radius condition (Weiss & Freeman, uses a length-C add vector at to complete the writing. 2001) then message passing is possible for exact inference. However, MRF with Gaussian potentials may not be pow- erased mt mt−1(i)[1 − yt(i)et] erful enough for extracting complex features from memory m (i) merased(i) + y (i)a and model complex dependencies between query and mem- t t t t ory. We turn to proximal method to remedy, for more about proximal method refer to (Parikh et al., 2014). We made the 2.4. Optimization and Learning following assumptions:fi(yi; x; w) = gi(yi; hi(x; w)) and T The general idea of primal dual solvers is to introduce aux- fα(yα; x; wα) = hα(x; w)gα(w yα), then α iliary variables z to decompose the high order terms. We X X T can then minimize z and y alternately through computing E = gi(yi; hi(x; w)) + hα(x; w)gα(wα yα) (1) i α their proximal operator. In particular, we can transform the primal problem in Eq.(1) into the following saddle point The above equations can be solved by primal-dual method c problem. effectively. Denote the output of t-th iteration of as yt . X X ? E = min max gi(yi; hi(x; wα)) − hα(x; w)gα(zα) 2.2. Controller Parameters y2Y z2Z i α Having shown how to address memory using structured at- X T + hα(x; w)hwα yα; zαi tention mechanism, now we turn to other addressing situ- α Neural Structured Turing Machine ? ? where gα(·) is the convex function of gα(z ): Turing Machine with LSTM controller, a basic LSTM and ? supfhz ; zi − gα(z)jz 2 Zg. Note that the whole infer- NSTM proposed in this paper. To qualify NSTM’s per- ence process has two stages: first we compute the unaries formance on longer sequences which were never seen be- h(x; wu) with a forward pass, followed by a MAP infer- fore, the networks were trained to copy sequences of eight ence. This kind of optimization approach was also pro- bit random vectors, where the sequence lengths were ran- posed in(Wang et al., 2016), note that both are special cases domised between 1 and 20. The target sequence was sim- of (Domke, 2012). ply a copy of the input sequence (without the delimiter flag). Note that no inputs were presented to the network 3.

Load more