Partially Non-Recurrent Controllers for Memory-Augmented Neural Networks

Naoya Taguchi∗† and Yoshimasa Tsuruoka‡

†DeNA Co., Ltd., Shibuya Hikarie 2-21-1 Shibuya, Shibuya-ku Tokyo, Japan [email protected]

‡The University of Tokyo, 3-7-1 Hongo, Bunkyo-ku, Tokyo, Japan [email protected]

Abstract MANNs have shown promising results in some (relatively small-scale) experiments, they are not yet practical enough Memory-Augmented Neural Networks (MANNs) are a class of neural networks equipped with an external memory, and to be widely used in many real-world applications. are reported to be effective for tasks requiring a large long- While there are various types of MANNs, we focus on term memory and its selective use. The core module of a the MANNs that are based on the Neural MANN is called a controller, which is usually implemented (NTM) (Graves et al. 2014). As shown in Figure 1, a NTM- as a (RNN) (e.g., LSTM) to en- based MANN consists of a memory and three types of mod- able the use of contextual information in controlling the other modules. However, such an RNN-based controller often al- ules implemented using neural networks (NNs), namely, a lows a MANN to directly solve the given task by using the controller, read heads, and a write head. Among these mod- (small) internal memory of the controller, and prevents the ules, we focus on the controller, which is the core module MANN from making the best use of the external memory, that controls how a MANN operates. In most of the previ- thereby resulting in a suboptimally trained model. To ad- ous work on the NTM, the controller is implemented using dress this problem, we present a novel type of RNN-based an LSTM-based RNN because it enables the controller to controller that is partially non-recurrent and avoids the direct operate using contextual information. However, it has re- use of its internal memory for solving the task, while keep- cently been pointed out that using RNNs for the controller ing the ability of using contextual information in controlling can have negative effects in training the whole model (Gul- the other modules. Our empirical experiments using Neu- cehre et al. 2017a; Gulcehre et al. 2017b). This is mainly ral Turing Machines and Differentiable Neural Computers on the Toy and bAbI tasks demonstrate that the proposed con- because the RNN-based controller has its own memory, and trollers give substantially better results than standard RNN- it allows the model to partially solve the given task with- based controllers. out using the large external memory, thereby resulting in a suboptimally trained model. Introduction To address the abovementioned problem, we present a Recurrent Neural Networks (RNNs) are widely used in ap- novel type of RNN-based controller that can avoid subop- plications that require sequential data processing such as timal solutions while keeping the ability of using contextual natural language processing and (Graves information. Experiments on the Toy tasks (Graves et al. et al. 2013; Cho et al. 2014a). In particular, RNNs equipped 2014; Grefenstette et al. 2015; Yang and Rush 2016) and with a long-term memory such as Long Short-Term Memory the bAbI task (Weston et al. 2015) demonstrate that our ap- (LSTM) (Hochreiter and Schmidhuber 1997) have proven proach substantially improves the performance of MANNs. arXiv:1812.11485v1 [cs.NE] 30 Dec 2018 highly effective and achieved state-of-the-art performance The main contributions of our work are as follows: in many tasks (Wu et al. 2016; Oord et al. 2016). Never- theless, those RNNs are not without limitation; since they Introducing a novel type of RNN-based controller for implement their memory using a fixed-size vector, the ca- • MANNs. This controller utilizes contextual information pacity of their memory is severely restricted and it is hard to for controlling the other modules while avoiding its direct have a compartmentalized memory to accurately remember use for the outputs of the model. facts about the past (Weston et al. 2014). To address the limitation of RNNs, researchers have pro- Demonstrating the effectiveness of the proposed con- posed models called Memory-Augmented Neural Networks • trollers by the experiments on the Toy and bAbI tasks. (MANNs). MANNs are a class of networks equipped with The experimental results show that the proposed con- an external memory (Santoro et al. 2016), and they are capa- trollers significantly outperform conventional controllers ble of using individual facts from the past selectively. While in both tasks. ∗Work done while the author was at the University of Tokyo Output r (iii) According to Ht and Mt, the read heads read informa- tion from the memory, and generate rt. The read operation Interface of the ith read head is performed as follows: to output i T r,i rt = Mt wt , � wr,i [0, 1]N i Write Head where t is the weights of the th read head for � ∈ r,i … P reading information from each address where j wt (j) (� , �, �) ≤ Controller , , 1. After its generation, rt is sent to the controller, and the � (� , … , � ) Memory controller saves it. � Read Heads (iv) According to xt, rt−1, and rt, the controller gener- o ates Ht , which is the information used for the output of the � model. � r In the procedures from (i) to (iv), how to generate Ht , w o w r,1 r,R Figure 1: The architecture of a NTM-based MANN. Note that rt H , H , w , and w , ..., w depends on models and o t t t t t is generated after the read and write operation, and used for Ht at their implementations. In this paper, we use the NTM and t. the Differentiable Neural Computer (DNC) (Graves et al. r 2016) for the models. First, we explain how to generate Ht , w o Ht , and Ht for the two models in the next section. We then Memory-Augmented Neural Networks w r,1 r,R explain the mechanisms to generate wt and wt , ..., wt Model Outline for each model in the following sections. MANNs are a class of neural networks equipped with an external memory, which is implemented as a set of vec- Controller r w o tors. Each of the vectors is associated with an address of How to generate Ht , Ht , and Ht is determined by the its memory, and each operation of reading from or writing controller. In this paper, we assume that the baseline con- to the memory is performed with respect to each address. trollers for the NTM and the DNC are implemented as r w o This design of memory use enables a MANN to use a large Ht = Ht = ht and Ht = [ht ; rt], where ht is a vec- memory and deal with facts from the past selectively. In this tor generated by a NN in the controller according to xt and paper, we focus on the NTM-based MANNs, and use the rt−1. The same design is used in the original paper of DNC term MANNs to refer to the NTM-based MANNs in what (Graves et al. 2016). follows. We consider three types of NNs for the baseline con- As shown in Figure 1, a MANN consists of a memory and troller: Feedforward Neural Networks (FNNs), Elman Net- three kinds of modules implemented using NNs, namely, a works (ENs) (Elman 1990), and LSTMs. We call each type controller, read heads, and a write head. At each time step of the controllers a FNN controller, an EN controller, and a t, these modules follow the procedures from (i) to (iv) as LSTM controller. The FNN controller generates ht as fol- below. lows: I ht = ϕ(Wxhxt + Wrhrt−1 + b), (i) According to the input to the model xt R and the in- W∈ formation read from the memory rt−1 R , the controller where ϕ is an . Similarly, the EN con- r w ∈ generates two vectors Ht and Ht to control the read heads troller generates ht as follows: and the write head. I is the size of the input vectors, and ht = ϕ(Wxhxt + Wrhrt−1 + Whhht−1 + b). (1) W is size of each vector in the memory. rt−1 is defined as 1 R i rt−1 = [r ; ...; r ], where r is a vector read from t−1 t−1 t−1 The LSTM controller generates ht as follows: the memory by the ith of the R read heads at t 1, and the − semicolons mean the concatenation of vectors. zt = tanh(Wxzxt + Wrzrt−1 + Whzht−1 + bz)(2), w (ii) According to Ht and the memory at t 1, Mt−1 it = σ(Wxixt + Wrirt−1 + Whiht−1 + bi), (3) N×W − ∈ R , the write head updates Mt−1 to Mt, where N is ft = σ(Wxf xt + Wrf rt−1 + Whf ht−1 + bf ), (4) the number of addresses of the memory. The write operation o = σ(W x + W r + W h + b ), (5) is performed as follows: t xo t ro t−1 ho t−1 o c = f c + i z , (6) w T w T t t t−1 t Mt = Mt−1 (E wt et ) + wt vt , − ht = ot tanh(ct), (7) where all the elements of E RN×W are 1, and the vectors W W ∈ where σ is an activation function for the gating mechanism. et [0, 1] and vt R are used for erasing or adding ∈ ∈ w N In Equations (1–7), we set h0 = c0 = 0. information in the memory at t. wt [0, 1] is a vector which represents the weights for erasing∈ and adding infor- P w Neural Turing Machine mation at each address, where j wt (j) 1. et and vt ≤ w w r,1 r,R are generated by a one- NN which uses Ht as its in- For the NTM, wt and all of wt , ..., wt are generated put. by the same mechanism. Here we denote them by wt. r w First, according to Ht = Ht = ht generated by the According to at, the actual address used for the write oper- controller, the following operation is conducted: ation is defined as follows: w w a a w β exp( (k , M[i])) wt = gt [gt at + (1 gt )ct ], t t − ct = C(M, kt, βt)[i] = P K , (8) w a j βt exp( (kt, M[j])) where gt [0, 1] and gt [0, 1] are scalars generated by a K ∈ w ∈ one-layer NN. ct is a vector generated as ct in Equation (8). where kt and β are generated from ht using a one-layer NN. The read operation is conducted using a temporal link ma- N×N Note that we use the expression M because the write head trix, Lt [0, 1] . This matrix holds the order of written uses Mt−1, while the read heads use Mt. (a, b) is a func- addresses.∈ In the DNC, the following operation is conducted tion which measures the relatedness betweenK two vectors, a w according to wt : and b, and usually implemented by cosine similarity: X w w pt = (1 w [i])pt−1 + w , a b − t t (a, b) = · . i K a b | || | where p0 = 0. pt basically represents the addresses where Next, the NTM generates wg as follows: the write operation is conducted at t, while it holds the re- t cently written addresses when the write operation is not con- g wt = gtct + (1 gt)wt−1, ducted. Lt tracks the write operation by the following oper- − ation: where gt [0, 1], and is generated by a one-layer NN. After w w w ∈ g Lt[i, j] = (1 wt [i] wt [j])Lt−1[i, j] + wt [i]pt−1[j], that, the following circular is applied to wt . − − where L0[i, j] = 0, i, j and Lt[i, i] = 0. By using this N−1 i ∀ i X g matrix, vectors ft and bt are defined as follows: w˜t[i] = wt [j]st[i j], − i r,i j=0 f = Lt w , t · t bi = LT wr,i. where st [0, 1] represents the amount of shift, and satisfies t t t ∈ P · the condition j st[j] = 1. st is generated by a one-layer The two vectors represent the addresses where the write op- NN. Finally, w is generated as follows: r,i t erations are conducted before and after the location wt is

γt written. Finally, the addresses for the read operation is de- w˜t[i] wt[i] = , fined as follows: P ˜ γt j wt[j] r,i i i i r,i i i wt = πt[1]bt + πt[2]ct + πt[3]ft , where γt satisfies the condition γt 1, and is generated by r,i 3 a one-layer NN. γ sharpens the element≥ of w . where πt [0, 1] is generated according to ht using a t t one-layer NN∈ which uses a for its activa- Differentiable Neural Computer tion function. We do not use the sparse link matrix for the DNC in this paper. w r,1 r,R In the DNC, wt and wt , ..., wt are generated by dif- ferent mechanisms. Partially Non-Recurrent Controllers First, we explain the write operation. The following oper- RNN-based controllers enable MANNs to utilize contextual ation is conducted. information for controlling the other modules. This is usu- R ally beneficial for the models, and most of the studies on Y i r,i ψt = (1 f w ), MANNs adopt a RNN-based controller for their models. − t t−1 i=1 However, the use of the memory in the RNN-based con- troller potentially has a negative effect for the training of where f i [0, 1] is a scalar generated by a one-layer NN t the models because the output of the controller h is used for each read∈ head. ψ [0, 1]N represents how much each t t for Ho, which allows the model to directly solve the given address will not be freed∈ by the free gates, f i. According to t t tasks using the (small) memory in the controller. ψ , the usage vector is defined as follows: t In this paper, we propose a novel type of RNN-based con- w w ut = (ut−1 + wt−1 ut−1 wt−1) ψt, troller that is partially non-recurrent and avoids the direct − use of its internal memory for solving the task, while keep- where each element of ut indicates the degree to which the ing the ability of using contextual information in controlling address is used, and the nearer it is to 1, the higher the degree the other modules. As shown in Figure 2, the outputs of the N is. After that, φt is defined. Each element of φt r w Z proposed RNN-based controller are Ht = Ht = ht and represents an index,∈ and they are sorted by ascending order o 0 Ht = [ht ; rt], where ht is the vector generated in the same of usage. By using φt, the allocation weighting, which is 0 way as usual RNN-based controllers, and ht is the vector used to provide new addresses for writing is generated as generated without using the memory in the controllers. For follows: 0 the EN controller, ht is generated as follows: j−1 ¯0 Y ht = Wxhxt + Wrhrt−1 + b, (9) at[φt[j]] = (1 ut[φt[j]]) ut[φt[i]]. − 0 ¯0 i=1 ht = ϕ(ht). (10) Input Output � COPY a a a a ... a a a a a ... a Concatenate 1 2 3 4 T 1 2 3 4 T REVERSE a1a2a3a4 ... aT aT aT −1aT −2aT −3 ... a1 �′ BIGRAM FLIP a1a2a3a4 ... aT a2a1a4a3 ... aT � � ODD FIRST a1a2a3a4 ... aT a1a3 ... a2a4 ... NNs REPEAT COPY a1a2a3a4 ... aT M a1 ... aT ... a1 ... aT (M times) 5 10 1 T 4 1 2 3 4 T PRIORITY SORT a1a2 a3a4 ... aT a3a7a8aT ... a4 �′ � Table 1: Details of the Toy tasks. The superscripts in PRIORITY Linear transformation SORT is the priorities defined on the dataset, and implicitly attached � Internal memory to each kind of vectors. Controller

� Figure 2: The architecture of the partially non-recurrent controller. Toy Tasks Settings. Following the previous work (Graves et al. 2014; Grefenstette et al. 2015; Yang and Rush 2016), we use six ¯0 Then, ht is generated by using ht as follows: kinds of the Toy tasks described in Table 1. On each task, the ¯0 models receive a sequence of nine-dimensional binary vec- ht = ϕ(Whhht−1 + ht). (11) tors, and they are required to output an appropriate sequence Note that the number of parameters used in Equation (9) and of vectors. The ninth element of each vector indicates the Equation (11) is same as that used in Equation (1). end of the input and output sequences. The number of the 0 T Similarly, for the LSTM controller, ht is generated in the input vectors is indicated by , and it is chosen randomly ¯0 for each input sequence. We adopt T [1, 20] for all tasks same manner as Equation (10) by using the following ht: except for REPEAT COPY, for which we∈ adopt T [1, 10] ¯0 ∈ ht = Wxzxt + Wrzrt−1 + bz. (12) and M [1, 10], where M is the randomly chosen number of repetitions.∈ We use ten different training datasets, each Then, h is generated according to Equations (3–7) by using t of which consists of 1, 000, 000 sequences for each individ- the following z : t ual model. The sizes of test and validation data are 10, 000 ¯0 zt = tanh(Whzht−1 + ht). (13) and 1, 000, respectively, and we validate the models for ev- ery 1, 000 training iterations. We use the unit size of 128 for Again, the number of parameters used in Equation (12) and all of the controllers. The memory size of the NTM and the Equation (13) is the same as that used in Equation (2). DNC is 128 20, and the number of the read heads is set Although in this paper we apply our proposal only to to one for all× of the tasks except for PRIORITY SORT, for the EN and the LSTM controller, it can be applied to other which we use the models with four read heads. For evalua- types of RNN-based controllers such as the RNN-based con- tion, we use the average bit error rates of the output of the trollers based on gated recurrent units (Cho et al. 2014b). ten models. Experiments Discussions on the test results. Table 2 shows the ex- perimental results on the Toy tasks. As shown in Table 2, General Settings the NTM or the DNC with one of the proposed controllers To evaluate the performance of our proposed RNN-based achieves the lowest average bit error rates on all of the tasks. controller, we carry out experiments on two sets of tasks, Figure 3 illustrates why the models with the proposed con- the Toy and bAbI tasks. On both tasks, we compare the trollers achieve the best results. In Figure 3(a), some of the performance of the NTM and the DNC with the FNN con- ten models converge insufficiently, while all of the models troller, the EN controller, the proposed EN controller, the converge successfully in Figure 3(b). Also, we show exam- LSTM controller, and the proposed LSTM controller. The ples of the output and the memory use of the NTM with the network settings of the NTM and the DNC are described in LSTM and the proposed LSTM controller on COPY in Fig- the following sections separately because they are different ure 4. As seen in Figure 4(b), the NTM with the proposed on the two tasks, except for the upper bound of the shift LSTM controller predicts the perfect output, making an ap- operation of the NTM, which is 3. We also carry out ex- propriate use of the external memory, while the output of periments on a one-layer EN and± LSTM with 128 hidden the NTM with the LSTM controller (Figure 4(a)) is far from units to evaluate how well the RNNs in RNN-based con- perfect. An interesting observation is that the output of the trollers can solve the tasks. For parameter optimization, we NTM with the LSTM controller is partially correct although use RMSProp (Graves 2013) with a learning rate of 0.0001 it does not read from the address where it wrote the infor- and a momentum of 0.9. The training is performed in an mation in the past. This phenomenon occurs because the online manner, and during we clip all gra- model solves the task using the small memory in the con- dient values by the global norm with a threshold of 5. When troller directly as we hypothesized, and the phenomenon is evaluating the models, we use the best model parameters in seen for all of the insufficiently converged NTMs with the terms of validation loss, and we report average scores of ten LSTM controller as shown in Figure 3(a). individual models with different parameter initializations for Nonetheless, there are situations where the proposed con- each experimental setting. trollers perform worse than the other controllers. Among 0.5 Write Read Ideal output

0.4

0.3 Predicted output 0.2 Location AVG bit0 error. rate 1 Time Time 0.0 0.0 0.2 0.4 0.6 0.8 1.0 (a) NTM with the LSTM controller 106 Learning steps × (a) NTM with the LSTM controller Write Read Ideal output

0.5

0.4 Predicted output 0.3 Location 0.2 Time Time AVG bit error rate 0.1 (b) NTM with the proposed LSTM controller

0.0 0.0 0.2 0.4 0.6 0.8 1.0 106 Figure 4: Examples of output and memory use of the NTM with Learning steps × the LSTM and the proposed LSTM controller on COPY. In the fig- (b) NTM with the proposed LSTM con- ure of Read and Write, the white addresses are read or written. The troller yellow line is the border between the input phase and the output phase. Note that only a subset of memory locations are shown. Figure 3: Learning curves of the NTM with the LSTM and the pro- posed LSTM controller on COPY. Bold lines represent the average learning curves, and each light colored line represents the learning curve of the one of the ten individual experiments. have the internal memory of the controller, and the usual RNN-based controllers do not suffer from the two phenom- ena because they can directly use the internal memory of the controller for the output of the model. them, we focus on the results of the NTM with the pro- posed EN controller on REVERSE because both of the av- Discussions on the average learning curves. Figure 6 erage bit error rate and the number of trained models which shows the average learning curves of the NTM and the DNC completely solved the task are worse than those of the NTM with different controllers. In Figure 6, we can see that the with the conventional controller only for this case. We show learning curves of successful case of the NTM converge so failed and successful examples of output and memory use of fast (e.g. the NTM with the FNN, the proposed EN, and the NTM with the proposed EN controller on REVERSE. In the proposed LSTM controller on COPY or BIGRAM FLIP). the experiment, we find that the NTM with the proposed EN The architecture of the NTM is simple but contains enough controller tends to converge to the solution shown in Fig- functions to solve basic tasks such as COPY, while that of ure 5(a) or similar ones. In Figure 5(a), the model predicts the DNC is more general but more complex. The FNN con- partially correct outputs although it does not read the written troller and the proposed RNN-based controllers utilize the information reversely. Because the proposed EN controller external memory and functions of the model to solve the cannot use the internal memory of the controller directly to tasks as much as possible, which results in the fast conver- solve tasks, the phenomenon that the model even partially gence. solves the task cannot occur without using the external mem- On the other hand, the models with the FNN, the proposed ory. In Figure 5(a), we can see that the write operation is EN, and the proposed LSTM controllers are less stable com- conducted on multiple addresses at each time step, while the pared to those with the EN and the LSTM controllers. For read operation is conducted on just one address. In addi- example, the learning curve for REVERSE COPY of the tion, the read operation in the input phase is conducted only NTM with the proposed LSTM controller is basically worse on one specific address. These observations suggest that the than that of the NTM with the LSTM controller, while the model converges to a local optimum where it holds the par- test result with the proposed LSTM controller is better than tial contextual information using the internal memory of the that with the LSTM controller as seen in Table 2. This is controller, and send it to the output using multiple memory because the learning curves of the NTM with the proposed locations. This type of local optimum tends to occur with the LSTM controller has high volatility. Therefore, using ap- proposed controllers but not with the FNN controller and the propriate early stopping is important to achieve good perfor- standard RNN-based controllers. The FNN controller does mance on the models with the FNN or the proposed RNN- not suffer from the second phenomenon because they do not based controllers. EN LSTM NTM DNC FNN EN proposed EN LSTM proposed LSTM FNN EN proposed EN LSTM proposed LSTM COPY 37.5 (0) 21.8 (0) 0.0 (10) 0.0 (9) 0.0 (10) 11.0 (3) 0.0 (10) 2.8 (9) 0.0 (9) 0.0 (10) 13.2 (3) 3.3 (7) REVERSE 25.8 (0) 13.2 (0) 10.0 (6) 2.0 (8) 15.2 (2) 8.9 (2) 7.8 (3) 2.2 (9) 7.4 (5) 0.0 (10) 7.7 (1) 13.5 (3) BIGRAM FLIP 37.1 (0) 23.2 (0) 4.5 (3) 2.0 (8) 0.0 (10) 9.6 (5) 0.0 (10) 2.8 (5) 0.8 (8) 1.7 (8) 9.6 (3) 7.0 (6) ODD FIRST 36.0 (0) 13.1 (0) 3.2 (1) 2.9 (6) 0.2 (7) 5.7 (3) 0.0 (10) 18.0 (0) 4.7 (2) 2.7 (6) 10.1 (1) 10.6 (1) REPEAT COPY 15.6 (0) 7.7 (0) 1.0 (8) 0.0 (10) 0.0 (10) 0.9 (5) 0.0 (10) 1.5 (8) 0.3 (9) 0.0 (10) 0.0 (10) 0.0 (10) PRIORITY SORT 30.0 (0) 14.7 (0) 11.2 (0) 7.4 (0) 9.3 (0) 7.5 (0) 8.2 (0) 12.1 (0) 6.7 (0) 9.0 (0) 8.3 (0) 4.4 (0)

Table 2: Average bit error rates on the Toy tasks. Bold results are the best ones for each task. The bracketed numbers are the number of the individual models of the ten which completely solved (achieved 0.0% of bit error rate) the tasks. EN LSTM NTM DNC FNN EN proposed EN LSTM proposed LSTM FNN EN proposed EN LSTM proposed LSTM 1: 1 supporting fact 76.9 30.0 1.1 33.7 26.5 6.1 1.4 12.7 56.8 36.7 11.7 0.1 4: 2 argument rels. 66.8 1.4 0.7 18.3 12.9 0.5 0.1 2.3 32.7 3.9 0.6 0.2 9: simple negation 80.0 17.8 7.9 14.3 22.1 5.5 4.3 13.8 34.6 16.9 10.8 0.7 10: indefinite knowl. 83.8 31.3 13.6 22.0 32.4 20.2 11.2 25.0 43.8 25.8 22.7 2.7 11: basic coreference 71.1 10.8 0.5 19.2 16.6 1.5 0.4 8.7 39.9 24.6 2.9 0.1 14: time reasoning 89.8 55.7 33.8 44.7 54.3 41.8 26.4 48.1 61.2 51.2 58.6 15.4 Mean err. 78.1 24.5 9.6 25.4 27.5 12.6 7.3 19.1 44.8 26.5 15.0 3.2

Table 3: Average error rates on the bAbI task. Bold results are the best ones for each task. bAbI Task Related Work

Settings. To evaluate the proposed controllers on more NTM-based MANNs have been actively studied since the practical situations, we carry out experiments on the bAbI advent of the NTM (Santoro et al. 2016; Park et al. 2017; task, which is a set of 20 simple question answering tasks. Franke et al. 2018). Rae et al. (2016) proposed Sparse Ac- In each task, the models read stories followed by a few cess Memory (SAM), which is a scalable end-to-end differ- questions. Because the experiments using the full dataset entiable memory access scheme. One of the biggest restric- were too computationally expensive, we only used Tasks tions of MANNs is that the capacity of memory depends 1, 4, 9, 10, 11, 14 of the 20 task, following Hsin (2017). on the size of the external memory, while larger external In the experiment, we train the models using a joint dataset memory requires more computational cost. SAM enables of these six tasks, and the evaluation is conducted for each efficient training of a MANN with a very large memory. task separately. We use the dataset provided by Facebook 1 Zaremba and Sutskever (2015) used a reinforcement learn- with 10k training examples for each task. We use NNs with ing on the NTM to apply it for tasks that require 128 units for the controller, 128 32 for the memory, and discrete interfaces, which are not differentiable. R = 4. The epoch size is 128, and× the other detailed settings Gulcehre et al. (2017b) proposed a novel NTM-based are the same as Graves et al. (2016). MANN, Dynamic Neural Turing Machine (D-NTM). While the original NTM implements location-based addressing us- Discussions. Table 3 shows the average error rates of the ing shift operations with a fixed size, the D-NTM performs ten models on the bAbI task. Due to the difference of the this operation using NNs directly. They also address the settings about the experiment, the results are different from same problem as ours, but they tackle the problem by us- those in Graves et al. (2016). ing a regularization approach where the addresses pointed Table 3 shows that the DNC with the proposed LSTM by the read heads and the write head are forced to be con- controller performs the best in terms of the mean error rate. sistent. Gulcehre et al. (2017a) proposed a novel MANN The proposed LSTM controller brings out the potential abil- called TARDIS based on the concept that MANNs con- ity of the DNC (e.g. the DNC can benefit from its design nect “time” discontinuously. Their work also addresses the of tracking the order of written addresses to solve Task 14, problem which we focus on, but their approach is using time reasoning) because our proposed controller is designed o Ht = [xt ; rt] for the output of the model. Our proposed to utilize the external memory. In addition, the proposed 0 o controller uses ht instead of xt for Ht , which enables the LSTM controller performs the best in all of the six tasks on output of the model and written information at t to be more each of the NTM and the DNC. Another observation seen in consistent. Table 3 is that the scores of the models with the EN and the There also exist models called MANNs which are not proposed EN controllers are worse than those of the other based on the NTM. In particular, the models based on Mem- cases. We speculate the reason that the internal memory of ory Networks (Weston et al. 2014) are actively studied (Ku- the EN does not contribute to increasing the performance of mar et al. 2015; Sukhbaatar et al. 2015; Henaff et al. 2016). the model while preventing it from converging to an appro- These MANNs are different from the NTM-based models in priate solution. some respects (e.g. MANNs based on Memory Networks conduct the read operation multiple times in a time step). In 1http://www.thespermwhale.com/jaseweston/ addition, MANNs based on Memory Networks do not have babi/tasks_1-20_v1-2.tar.gz a module corresponding to the controller, which controls all Write Read Ideal output

0.5 0.5 Copy Copy Reverse Reverse 0.4 Bigram Flip 0.4 Bigram Flip Odd First Odd First 0.3 Repeat Copy 0.3 Repeat Copy Predicted output Priority Sort Priority Sort

0.2 0.2 Location AVG bit0 error. rate 1 AVG bit0 error. rate 1 Time Time 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 106 106 (a) Failer case Learning steps × Learning steps × (a) NTM w/ FNN controller (b) DNC w/ FNN controller Write Read Ideal output 0.5 0.5 Copy Copy Reverse Reverse 0.4 Bigram Flip 0.4 Bigram Flip Odd First Odd First 0.3 Repeat Copy 0.3 Repeat Copy Predicted output Priority Sort Priority Sort 0.2 0.2 AVG bit error rate AVG bit error rate

Location 0.1 0.1

Time Time 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 106 106 (b) Successful case Learning steps × Learning steps × (c) NTM w/ EN controller (d) DNC w/ EN controller Figure 5: Failed and successful examples of output and memory 0.5 0.5 use of the NTM with the proposed EN controller on REVERSE. Copy Copy Reverse Reverse In the figure of Read and Write, the white addresses are read or 0.4 Bigram Flip 0.4 Bigram Flip Odd First Odd First written. The yellow line is the border between the input phase and 0.3 Repeat Copy 0.3 Repeat Copy the output phase, and only a subset of memory locations are shown. Priority Sort Priority Sort Note that in REVERSE, successfully trained models read written 0.2 0.2

information reversely as seen in Figure 5(b) AVG bit0 error. rate 1 AVG bit0 error. rate 1

0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 106 106 Learning steps × Learning steps × the other modules. (e) NTM w/ proposed EN con- (f) DNC w/ proposed EN con- troller troller

0.5 0.5 Conclusion and Future Work Copy Copy Reverse Reverse 0.4 Bigram Flip 0.4 Bigram Flip In this paper, we have proposed a novel type of RNN-based Odd First Odd First 0.3 Repeat Copy 0.3 Repeat Copy controller for MANNs. Without increasing the number of Priority Sort Priority Sort

training parameters, our proposed controller avoids using 0.2 0.2

the memory in the controller for the output of the model and AVG bit error rate AVG bit error rate 0.1 0.1 benefits from it for controlling the other modules. In the ex- 0.0 0.0 periments on both of the Toy and bAbI tasks, the best scores 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 106 106 are achieved by the models with our proposed controller, Learning steps × Learning steps × which demonstrates the effectiveness of our approach. (g) NTM w/ LSTM controller (h) DNC w/ LSTM controller

An interesting direction of future work is exploring other 0.5 0.5 Copy Copy architectures based on the insights obtained in this work be- Reverse Reverse 0.4 Bigram Flip 0.4 Bigram Flip cause there are more variations than the two proposed con- Odd First Odd First 0.3 Repeat Copy 0.3 Repeat Copy trollers we have used. Priority Sort Priority Sort

0.2 0.2

References AVG bit0 error. rate 1 AVG bit0 error. rate 1

0.0 0.0 [Cho et al. 2014a] Kyunghyun Cho, B van Merrienboer, 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 106 106 Dzmitry Bahdanau, and . On the properties Learning steps × Learning steps × of neural machine translation: Encoder-decoder approaches. (i) NTM w/ proposed LSTM (j) DNC w/ proposed LSTM Eighth Workshop on Syntax, Semantics and Structure in Sta- controller controller tistical Translation, 2014. Figure 6: Average learning curves of the NTM and the DNC with [Cho et al. 2014b] Kyunghyun Cho, Bart van Merrienboer, all of the controllers we use. Note that the results reported in Ta- Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Hol- ble 2 were obtained at the best learning steps in terms of validation ger Schwenk, and Yoshua Bengio. Learning Phrase Repre- loss. sentations using RNN Encoder–Decoder for Statistical Ma- chine Translation. In EMNLP, 2014. [Elman 1990] Jeffrey L Elman. Finding structure in time. [Rae et al. 2016] Jack W. Rae, Jonathan J. Hunt, Ivo Dani- Cognitive science, 1990. helka, Timothy Harley, Andrew W. Senior, Gregory Wayne, [Franke et al. 2018] Jorg¨ Franke, Jan Niehues, and Alex Alex Graves, and Tim Lillicrap. Scaling memory- Waibel. Robust and Scalable Differentiable Neural Com- augmented neural networks with sparse reads and writes. In puter for Question Answering. arXiv:1807.02658 [cs], NIPS, 2016. 2018. [Santoro et al. 2016] Adam Santoro, Sergey Bartunov, [Graves et al. 2013] Alex Graves, Abdel rahman Mohamed, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. and Geoffrey E. Hinton. Speech recognition with deep re- Meta-learning with memory-augmented neural networks. current neural networks. IEEE ICASSP, 2013. In ICML, 2016. [Graves et al. 2014] Alex Graves, Greg Wayne, and Ivo [Sukhbaatar et al. 2015] Sainbayar Sukhbaatar, Jason We- Danihelka. Neural turing machines. arXiv preprint ston, Rob Fergus, and others. End-to-end memory networks. arXiv:1410.5401, 2014. In NIPS, 2015. et al. [Graves et al. 2016] Alex Graves, Greg Wayne, Malcolm [Weston 2014] Jason Weston, Sumit Chopra, and An- arXiv preprint Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska- toine Bordes. Memory networks. arXiv:1410.3916 Barwinska,´ Sergio Gomez´ Colmenarejo, Edward Grefen- , 2014. stette, Tiago Ramalho, John Agapiou, and others. Hybrid [Weston et al. 2015] Jason Weston, Antoine Bordes, Sumit computing using a neural network with dynamic external Chopra, Alexander M Rush, Bart van Merrienboer,¨ Armand memory. Nature, 2016. Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint [Graves 2013] Alex Graves. Generating sequences with re- arXiv:1502.05698, 2015. current neural networks. arXiv preprint arXiv:1308.0850, 2013. [Wu et al. 2016] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, [Grefenstette et al. 2015] Edward Grefenstette, Karl Moritz Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Hermann, Mustafa Suleyman, and Phil Blunsom. Learning Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing to transduce with unbounded memory. In NIPS, 2015. Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku [Gulcehre et al. 2017a] Caglar Gulcehre, Sarath Chandar, Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nis- and Yoshua Bengio. Memory Augmented Neural Networks hant Patil, Wei Wang, Cliff Young, Jason Smith, Jason with Wormhole Connections. arXiv:1701.08718, 2017. Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Mac- [Gulcehre et al. 2017b] Caglar Gulcehre, Sarath Chandar, duff Hughes, and Jeffrey Dean. Google’s Neural Machine Kyunghyun Cho, and Yoshua Bengio. Dynamic Neural Translation System: Bridging the Gap between Human and Turing Machine with Soft and Hard Addressing Schemes. Machine Translation. arXiv:1609.08144, 2016. arXiv:1607.00036, 2017. [Yang and Rush 2016] Greg Yang and Alexander M. Rush. [Henaff et al. 2016] Mikael Henaff, Jason Weston, Arthur Lie-Access Neural Turing Machines. arXiv:1611.02854, Szlam, Antoine Bordes, and Yann LeCun. Track- 2016. ing the world state with recurrent entity networks. [Zaremba and Sutskever 2015] Wojciech Zaremba and Ilya arXiv:1612.03969, 2016. Sutskever. Neural Turing Machines [Hochreiter and Schmidhuber 1997] Sepp Hochreiter and - Revised. arXiv:1505.00521, 2015. Jurgen¨ Schmidhuber. Long short-term memory. Neural computation, 1997. [Hsin 2017] Carol Hsin. Implementation and Optimization of Differentiable Neural Computers. Technical Report for CS224 at Stanford University, 2017. [Kumar et al. 2015] Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter On- druska, Ishaan Gulrajani, and Richard Socher. Ask me any- thing: Dynamic memory networks for natural language pro- cessing. arXiv preprint arXiv:1506.07285, 2015. [Oord et al. 2016] Aaron van den Oord, Sander Diele- man, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Au- dio. arXiv:1609.03499, 2016. [Park et al. 2017] Seongsik Park, Seijoon Kim, Seil Lee, Ho Bae, and Sungroh Yoon. Quantized Memory- Augmented Neural Networks. arXiv:1711.03712 [cs, stat], 2017.