International Journal of Automation and Computing 13(3), June 2016, 226-234 DOI: 10.1007/s11633-016-1006-2
Minimal Gated Unit for Recurrent Neural Networks
Guo-Bing Zhou Jianxin Wu Chen-Lin Zhang Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Abstract: Recurrent neural networks (RNN) have been very successful in handling sequence data. However, understanding RNN and finding the best practices for RNN learning is a difficult task, partly because there are many competing and complex hidden units, such as the long short-term memory (LSTM) and the gated recurrent unit (GRU). We propose a gated unit for RNN, named as minimal gated unit (MGU), since it only contains one gate, which is a minimal design among all gated hidden units. The design of MGU benefits from evaluation results on LSTM and GRU in the literature. Experiments on various sequence data show that MGU has comparable accuracy with GRU, but has a simpler structure, fewer parameters, and faster training. Hence, MGU is suitable in RNNs applications. Its simple architecture also means that it is easier to evaluate and tune, and in principle it is easier to study MGUs properties theoretically and empirically.
Keywords: Recurrent neural network, minimal gated unit (MGU), gated unit, gate recurrent unit (GRU), long short-term memory (LSTM), deep learning.
1 Introduction curacy. Typical LSTM is complex, having 3 gates and 2 hidden states. Using LSTM as an representative example, In recent years, deep learning models have been partic- we may find some obstacles that prevent us from reaching ularly effective in dealing with data that have complex in- a consensus on RNN: ternal structures. For example, convolutional neural net- works (CNN) are very effective in handling image data • Depth with unrolling. Different from CNN which in which 2D spatial relationships are critical among the has concrete layers, RNN is considered a deep model set of raw pixel values in an image[1, 2].Anothersuc- only when it is unrolled along the time axis. This prop- cess of deep learning is handling sequence data, in which erty helps in reducing parameter size, but at the same the sequential relationships within a variable length in- time hinders visualization and understanding. This put sequence is crucial. In sequence modeling, recurrent difficulty comes from the complexity of sequence data neural networks (RNN) have been very successful in lan- itself. We do not have a satisfactory solution for solv- guage translation[3−5], speech recognition[6], image caption- ing this difficulty, but we hope this paper will help in ing, i.e., summarizing the semantic meaning of an image mitigating difficulties caused by the next one. into a sentence[7−9], recognizing actions in videos[10, 11],or [12] • short-term precipitation prediction . Competing and complex structures. Since its in- There is, however, one important difference between re- ception, the focus of LSTM-related research has been altering its structure to achieve higher performance. current neural networks (RNN) and CNN. The key building [15] blocks of CNN, such as nonlinear activation function, con- For example, a forget gate was added by Gers et al. to the original LSTM, and a peephole connection made volution and pooling operations, etc., have been extensively [16] studied. The choices are becoming convergent, e.g., recti- it even more complex by Gers et al. It was not un- fied linear unit (ReLU) for nonlinear activation, small con- til recently that simplified models were proposed and studied, such as the Gated Recurrent Unit (GRU) by volution kernels and max-pooling. Visualization also helps [3] us to understand the semantic functionalities of different Cho et al. GRU, however, is still relatively complex layers[13], e.g., firing at edges, corners, combination of spe- because it has two gates. Very recently there have been empirical evaluations on LSTM, GRU, and their cific edge groups, object parts, and objects. [17−19] The community s understanding of RNN, however, is variants . Unfortunately, no consensus has yet not as thorough, and the opinions are much less conver- been reached on the best LSTM-like RNN architec- gent. Proposed by Hochreiter and Schmidhuber[14],the ture. long short-term memory (LSTM) model and its variants Theoretical analysis and empirical understanding of deep have been popular RNN hidden units with excellent ac- learning techniques are fundamental. However, it is very Research Article difficult if there are too many components in the structure. Manuscript received April 6, 2016; accepted May 9, 2016 Simplifying the model structure is an important step to This work was supported by National Natural Science Foundation of China (Nos. 61422203 and 61333014), and National Key Basic Re- enable the learning theory analysis in the future. search Program of China (No. 2014CB340501). Complex RNN models not only hinder our understand- Recommended by Associate Editor Yi Cao c Institute of Automation, Chinese Academy of Sciences and ing. It also means that more parameters are to be learned Springer-Verlag Berlin Heidelberg 2016 G. B. Zhou et al. / Minimal Gated Unit for Recurrent Neural Networks 227 and more components to be tuned. As a natural conse- where we combined the parameters related to ht−1 and xt quence, more training sequences and (perhaps) more train- into a matrix W ,andb is a bias term. The activation ing time are needed. However, evaluations by Chung et (tanh) is applied to every element of its input. The task of al.[17] and Jozefwicz et al.[18] both show that more gates do RNN is to learn the parameters W and b, and we call this not lead to better accuracy. On the contrary, the accuracy architecture the simple RNN. An RNN may also have an of GRU is often higher than that of LSTM, albeit the fact optional output vector yt. that GRU has one less hidden state and one less gate than LSTM. RNN in the simple form suffers from the vanish- LSTM. ing or exploding gradient issue, which makes learning RNN In this paper, we propose a new variant of GRU (which is using gradient descent very difficult in long sequences[14, 20]. also a variant of LSTM), which has minimal number of gates LSTM solved the gradient issue by introducing various – only one gate! Hence, the proposed method is named gates to control how information flows in RNN, which are as the minimal gated unit (MGU). Evaluations in [17–19] summarizedinFig.1,from(4a)to(4f),inwhich agreed that RNN with a gated unit works significantly bet- 1 ter than an RNN with a simple tanh unit without any gate. σ(x)= (3) 1+exp(−x) The proposed method has the smallest possible number of gates in any gated unit, a fact giving rise to the name min- is the logistic sigmoid function (applied to every component imal gated unit. of the vector input) and is the component-wise product With only one gate, we expect MGU will have signifi- between two vectors. cantly fewer parameters to learn than GRU or LSTM, and also fewer components or variations to tune. The learning LSTM (long short-term memory) process will be faster compared to them, which will be veri- f σ W h , x b fied by our experiments on a diverse set of sequence data in t = ( f [ t−1 t]+ f ) (4a) Section 4. What is more, our experiments also showed that it = σ (Wi [ht−1, xt]+bi)(4b)
MGU has overall comparable accuracy with GRU, which ot = σ (Wo [ht−1, xt]+bo)(4c) once more concurs the observation that fewer gates reduces c˜t =tanh(Wc [ht−1, xt]+bc)(4d) complexity but not necessarily accuracy. c f c i c Before we present the details of MGU, we want to add t = t t−1 + t ˜t (4e) that we are not proposing a “better” RNN model in this ht = ot tanh(ct) . (4f) paper.1 The purpose of MGU is two-fold. First, with this simpler model, we can reduce the requirement for training GRU (gated recurrent unit) data, architecture tuning and CPU time, while at the same time maintaining accuracy. This characteristic may make zt = σ (Wz [ht−1, xt]+bz) (5a) MGU a good candidate in various applications. Second, a rt = σ (Wr [ht−1, xt]+br)(5b) minimally designed gated unit will (in principle) make our analyses of RNN easier, and help us understand RNN, but h˜ t =tanh(Wh [rt ht−1, xt]+bh)(5c) this will be left as a future work. ht =(1− zt) ht−1 + zt h˜ t . (5d)
2 RNN:LSTM,GRU,andmore MGU (minimal gated unit, the proposed method) We start by introducing various RNN models, mainly f = σ (Wf [ht−1, xt]+bf ) (6a) LSTM and GRU, and their evaluation results. These stud- t ies in the literature have guided us in how to minimize a h˜ t =tanh(Wh [f t ht−1, xt]+bh)(6b) gated hidden unit in RNN. ht =(1− f t) ht−1 + f t h˜ t . (6c) A recurrent neural network uses an index t to indicate different positions in an input sequence, and assumes that there is a hidden state ht to represent the system status Fig. 1 Summary of three gated units: LSTM, GRU, and at time t. We use boldface letters to denote vectors. RNN the proposed MGU. σ is the logistic sigmoid function, and accepts input xt at time t, and the status is updated by a means component-wise product. nonlinear mapping f from time to time: Fig. 2 (a) illustrates the data flow and operations in ht = f(ht−1, xt) . (1) LSTM.
One usual way of defining the recurrent unit f is a linear • There is one more hidden state ct in addition to ht, transformation plus a nonlinear activation, e.g., which helps maintain long-term memories.
ht =tanh(W [ht−1, xt]+b)(2)• The forget gate f t decides the portion (between 0 and
1We also believe that without a carefully designed common set of comprehensive benchmark datasets and evaluation criteria, it is not easy to get conclusive decisions as to which RNN model is better. 228 International Journal of Automation and Computing 13(3), June 2016
1) of ct−1 to be remembered, determined by parame- as illustrated in Fig. 2 (b). The coupling removed one gate ters Wf and bf . and its parameters (Wi and bi), which leads to reduced com-