Bag-Of-Words Input for Long History Representation in Neural Network-Based Language Models for Speech Recognition
Total Page:16
File Type:pdf, Size:1020Kb
INTERSPEECH 2015 Bag-of-Words Input for Long History Representation in Neural Network-based Language Models for Speech Recognition Kazuki Irie1, Ralf Schluter¨ 1, Hermann Ney1,2 1Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany 2Spoken Language Processing Group, LIMSI CNRS, Paris, France irie,schlueter,ney @cs.rwth-aachen.de { } Abstract regular grammars and they were not used in a real ASR task. In most of previous works on neural network based language Within a real LM framework, the investigations on the FFNN models (NNLMs), the words are represented as 1-of-N encoded LM have first shown improvements in PP and WER over the feature vectors. In this paper we investigate an alternative en- count LM [4, 5, 6], the LM based on RNN was later reported coding of the word history, known as bag-of-words (BOW) to be more successful [7, 8] and the LM based on the improved representation of a word sequence, and use it as an additional architecture of RNN, LSTM-RNN LM [9] has been shown to input feature to the NNLM. Both the feedforward neural net- achieve today’s state of the art results in LM when interpolated work (FFNN) and the long short-term memory recurrent neural to the count model. A number of works have compared these network (LSTM-RNN) language models (LMs) with additional two types of architectures and most of them have shown that BOW input are evaluated on an English large vocabulary auto- the RNN LM outperforms FFNN LM [8, 10, 11, 12]. Some matic speech recognition (ASR) task. We show that the BOW exceptions are [13, 14]. However, in most of these works, the features significantly improve both the perplexity (PP) and the context length of FFNN LM is very limited. At most 10-gram word error rate (WER) of a standard FFNN LM. In contrast, the [13, 12], often 4 or 5-gram FFNN LMs have been used by anal- LSTM-RNN LM does not benefit from such an explicit long ogy to the count model, and have been compared to RNN LMs context feature. Therefore the performance gap between feed- which are theoretically provided with infinitely long context. A forward and recurrent architectures for language modeling is possible reason for the previous absence of attempts to further reduced. In addition, we revisit the cache based LM, a seeming increase the context size of FFNNs may be the inconvenience analog of the BOW for the count based LM, which was un- of the 1-of-N word representation for long word sequences. successful for ASR in the past. Although the cache is able to In fact, in order to design a LM within a NN framework, improve the perplexity, we only observe a very small reduction words should be encoded as vectors. Most previous works on in WER. NNLM have represented words in the form of 1-of-N coding, Index Terms: language modeling, speech recognition, bag-of- also known as one-hot representation of words, which gives an words, feedforward neural networks, recurrent neural networks, index to each word in the vocabulary and a word is encoded as long short-term memory, cache language model a vocabulary-sized vector with all coefficients set to 0, except 10.21437/Interspeech.2015-513 the one at the index of the word to be encoded which is set to 1. 1. Introduction A word sequence is therefore encoded as a sequence of 1-of-N vectors. In this paper, we investigate an alternative representa- The LM estimates the probability p(wN ) of a sequence of 1 tion of a word sequence, known as bag-of-words, which is the words wN . Such probability appears in the fundamental Bayes’ 1 sum of the 1-of-N vectors of the word sequence. While this decision rules of many language and speech processing tasks, representation discards the information about the word order, including handwriting recognition, machine translation and it provides a compact representation of long word sequences. ASR. Its incorporation is crucial for these systems to achieve It has been successfully used in text classification [15] and in state-of-the-art results. Since this probability can be factorized computer vision [16]. We investigate the effect of an additional by the chain rule as follows, N BOW input feature on both FFNN and LSTM-RNN LM, with N n 1 p(w ) = p(w1) p(wn w − ) (1) respect to the PP and WER for a large vocabulary continuous 1 | 1 n=2 speech recognition (LVCSR) task in English. Y the task of language modeling becomes to estimate these con- In addition, we revisit the cache based LM [17] for ASR in n 1 ditional probabilities p(wn w1 − ). The conventional approach the last section. A number of works have been done to incor- estimates these probabilities| based on the relative frequencies porate the long range dependencies to the n-gram count mod- of n-gram counts and a smoothing technique is applied to dis- els. The techniques such as triggers [18], skip n-grams [19] and tribute probability to unseen events [1, 2]. More recently, esti- the cache based LM [17] can be viewed as analogs of BOW mation of these quantities by neural networks (NNs) has been for count LMs. The cache LM is well known to give large im- shown to produce state-of-the-art performance in many tasks. provements in PP over the n-gram count model by interpolation, Previously, two types of NN architectures have been inves- though it does not lead to WER improvements: we attempt to tigated for the LM: feedforward and recurrent NNs (RNNs). bring some clarifications by removing pruning related search er- Early investigations were interested in RNNs [3], but the ex- rors and by reporting the PP of the cache LM which was exactly periments were limited to the sequence prediction of stochastic used to obtain a specific WER. Copyright © 2015 ISCA 2371 September 6-10, 2015, Dresden, Germany y(t) y(t) output layer B B s(t) s(t) second hidden layer A R A x(t) s(t-1) x(t) projection layer P P w(t) P w(t-1) w(t) 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 input layer Figure 2: Standard Recurrent NNLM (with projection layer) Figure 1: trigram Feedforward NNLM y(t) 2. Word History Information in NNLMs: B Model Architecture Review s(t) 2.1. Feedforward Neural Network LM A In a standard FFNN LM, the Markov property is assumed as for x(t) the n-gram count LM: P B i 1 i 1 w(t) w(t-1) bow(t) p(wi w1− ) = p(wi wi−n+1) (2) P P | | − 0 0 0 1 0 0 1 0 0 0 0 3 1 2 1 FFNN LM takes a fixed number of predecessor words in form of 1-of-N vectors as input. The architecture of a trigram FFNN LM is illustrated in Figure 1. In practice, the output layer of NNLMs Figure 3: trigram Feedforward NNLM with an additional BOW is factorized with word classes [20] to reduce the computation: input we omit this in the illustration to avoid heavy notations. The forward pass of a trigram FFNN is as follows: given the 1-of-N inputs w(t) and w(t-1) of two predecessor words wt and wt 1, BOW as an alternative vector representation of word sequences. − it computes We use the BOW feature as an additional input to the x(t) = [P w(t), P w(t 1)] (3) − NNLM as illustrated in Figure 3. The architecture of NNLM s(t) = f(Ax(t)) (4) with additional features have been previously suggested by [21]. L bow y(t) = [p(w wt, wt 1)]w V = g(Bs(t)) (5) In this framework, the BOW vector of context size , (t) is | − ∈ defined at each time step as: where V is the vocabulary set, P , A, B respectively are L 1 − weight matrices of projection, hidden and output layers. f is bow(t) = w(t i) (8) − an element-wise sigmoid activation and g is the softmax func- i=0 X tion. We omit the bias terms for the shortcut. The only context = bow(t 1) + w(t) w(t L) (9) information n-gram FFNN LM is provided with is the (n-1) pre- − − − decessor words and to extend FFNN to larger n-gram context, it The equations of FFNN with BOW input are obtained by re- requires to increase the size of x(t). placing Eq (3) by: 2.2. Recurrent Neural Network LM x(t) = [P w(t), P w(t 1), P B bow(t)] (10) − As opposed to the FFNN LMs, RNN LMs only take one 1-of-N where PB is the projection matrix for the BOW input. The input corresponding to the previous word. Instead of taking a same applies for RNN. fixed length context, RNN stores the value of the hidden layer The BOW representation has one clear advantage over the from the previous position s(t-1) as illustrated in Figure 2. The 1-of-N representation, in particular for the FFNN LM: adding corresponding equations are obtained by replacing the Eqs. (3) one more word in the context of the FFNN LM implies to in- and (4) in those of FFNN with: crease the projection layer size by the feature size of one word. x(t) = [P w(t)] (6) Even if the computational cost increases only linearly by this s(t) = f(Ax(t) + Rs(t 1)) (7) process, this is not appealing when considering a long context − length, such as a 50-gram.