Overview Introduction Model description Extensions Empirical evaluation Current work
EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL
Toma´sˇ Mikolov, Stefan Kombrink, Luka´sˇ Burget, Jan “Honza” Cernockˇ y,´ Sanjeev Khudanpur
Speech@FIT, Brno University of Technology, Johns Hopkins University
25. 5. 2011
1 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Overview
Introduction Model description Extensions Empirical evaluation Current work
2 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Introduction
Neural network based LMs outperform standard backoff n-gram models Words are projected into low dimensional space, similar words are automatically clustered together Smoothing is solved implicitly Standard backpropagation algorithm (BP) is used for training In [Mikolov2010], we have shown that recurrent neural network (RNN) architecture is competitive with the standard feedforward architecture
3 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Introduction
In this presentation, we will show: Importance of ”backpropagation through time” (BPTT) [Rumelhart et al. 1986] training algorithm for RNN language models Simple speed-up technique that reduces computational complexity 10× - 100× Results after combining randomly initialized RNN models Comparison of different advanced LM techniques on the same data set Results on large data sets and LVCSR experiments
4 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Model description - recurrent NN
w( t ) y( t )
s( t )
U V
(delayed)
Input layer w and output layer y have the same dimensionality as the vocabulary Hidden layer s is orders of magnitude smaller U is the matrix of weights between input and hidden layer, V is the matrix of weights between hidden and output layer
5 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Backpropagation through time
Training of RNNs by normal backpropagation is not optimal Backpropagation through time (BPTT) is efficient algorithm for training recurrent neural networks BPTT works by unfolding the recurrent part of the network in time to obtain usual feedforward representation of the network; such deep network is then trained by backpropagation For on-line learning, ”truncated BPTT” is used
6 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
RNN unfolded in time
w(t) y(t)
s(t)
7 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
RNN unfolded in time
w(t) y(t)
s(t)
w(t-1) y(t-1)
s(t-1)
8 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
RNN unfolded in time
w(t) y(t)
s(t)
w(t-1) y(t-1)
s(t-1)
w(t-2) y(t-2)
s(t-2)
9 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Factorization of the output layer
w( t ) y( t )
s( t )
U V
W
(delayed) c( t )
P (wi|history) = P (ci|s(t))P (wi|ci, s(t)) (1)
Words are assigned to ”classes” based on their unigram frequency First, class layer is evaluated; then, only words belonging to the predicted class are evaluated, instead of the whole output layer y [Goodman2001] Provides speedup in some cases more than 100× 10 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Empirical evaluation - Setup description
We have used the Penn Treebank Corpus, with the same vocabulary and data division as other researchers: Sections 0-20: training data, 930K tokens Sections 21-22: validation data, 74K tokens Sections 23-24: test data, 82K tokens Vocabulary size: 10K
11 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Importance of BPTT training
145 average over 4 models 140 mixture of 4 models KN5 baseline 135
130
125
120 Perplexity (Penn corpus) 115
110
105 1 2 3 4 5 6 7 8 BPTT step
Importance of BPTT training on Penn Corpus. BPTT=1 corresponds to standard backpropagation.
12 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Combination of randomly initialized RNNs
130 RNN mixture 125 RNN mixture + KN5
120
115
110
Perplexity (Penn corpus) 105
100
95 0 5 10 15 20 25 Number of RNN models
By linearly interpolating outputs from randomly initialized RNNs, we obtain better results
13 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Comparison of different language modeling techniques
Model Perplexity Kneser-Ney 5-gram 141 Random forest [Xu 2005] 132 Structured LM [Filimonov 2009] 125 Feedforward NN LM 116 Syntactic NN LM [Emami 2004] 110 RNN trained by BP 113 RNN trained by BPTT 106 4x RNN trained by BPTT 98
Comparison of different language modeling techniques on Penn Corpus. Models are interpolated with the baseline 5-gram backoff model.
14 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Speedup with different amount of classes
Classes RNN RNN+KN5 Min/epoch Sec/test 30 134 112 12.8 8.8 100 136 114 9.1 5.6 1000 131 111 16.1 15.7 4000 127 108 44.4 57.8 Full 123 106 154 212
Values around sqrt(vocabulary size) lead to the largest speed-ups
15 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Improvements with increasing amount of data
9 KN5 8.8 KN5+RNN
8.6
8.4
8.2
8
7.8
7.6
7.4 Entropy per word on the WSJ test data 7.2
7 5 6 7 8 10 10 10 10 Training tokens
The improvement obtained from a single RNN model over the best backoff model increases with more data! 16 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Current work
Dynamic evaluation for model adaptation Combination and comparison of RNNs with many other advanced LM techniques More than 50% improvement in perplexity on large data set against modified Kneser-Ney smoothed 5-gram
17 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Current work - ASR
Almost 20% reduction of WER (Wall Street Journal) with simple ASR system, against backoff 5-gram model (WER 17.2% → 14.4%)
Almost 10% reduction of WER (Broadcast News) with state of the art IBM system, against backoff 4-gram model (WER 13.1% → 12.0%)
18 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Toolkit
Our experiments can be repeated using toolkit available at http://www.fit.vutbr.cz/~imikolov/rnnlm/
19 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work
Thanks for attention!
20 / 20