Extensions of Recurrent Neural Network Language Model

Overview Introduction Model description Extensions Empirical evaluation Current work EXTENSIONS OF RECURRENT NEURAL NETWORK LANGUAGE MODEL Toma´sˇ Mikolov, Stefan Kombrink, Luka´sˇ Burget, Jan “Honza” Cernockˇ y,´ Sanjeev Khudanpur Speech@FIT, Brno University of Technology, Johns Hopkins University 25. 5. 2011 1 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Overview Introduction Model description Extensions Empirical evaluation Current work 2 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Introduction Neural network based LMs outperform standard backoff n-gram models Words are projected into low dimensional space, similar words are automatically clustered together Smoothing is solved implicitly Standard backpropagation algorithm (BP) is used for training In [Mikolov2010], we have shown that recurrent neural network (RNN) architecture is competitive with the standard feedforward architecture 3 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Introduction In this presentation, we will show: Importance of ”backpropagation through time” (BPTT) [Rumelhart et al. 1986] training algorithm for RNN language models Simple speed-up technique that reduces computational complexity 10× - 100× Results after combining randomly initialized RNN models Comparison of different advanced LM techniques on the same data set Results on large data sets and LVCSR experiments 4 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Model description - recurrent NN w( t ) y( t ) s( t ) U V (delayed) Input layer w and output layer y have the same dimensionality as the vocabulary Hidden layer s is orders of magnitude smaller U is the matrix of weights between input and hidden layer, V is the matrix of weights between hidden and output layer 5 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Backpropagation through time Training of RNNs by normal backpropagation is not optimal Backpropagation through time (BPTT) is efﬁcient algorithm for training recurrent neural networks BPTT works by unfolding the recurrent part of the network in time to obtain usual feedforward representation of the network; such deep network is then trained by backpropagation For on-line learning, ”truncated BPTT” is used 6 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work RNN unfolded in time w(t) y(t) s(t) 7 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work RNN unfolded in time w(t) y(t) s(t) w(t-1) y(t-1) s(t-1) 8 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work RNN unfolded in time w(t) y(t) s(t) w(t-1) y(t-1) s(t-1) w(t-2) y(t-2) s(t-2) 9 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Factorization of the output layer w( t ) y( t ) s( t ) U V W (delayed) c( t ) P (wijhistory) = P (cijs(t))P (wijci; s(t)) (1) Words are assigned to ”classes” based on their unigram frequency First, class layer is evaluated; then, only words belonging to the predicted class are evaluated, instead of the whole output layer y [Goodman2001] Provides speedup in some cases more than 100× 10 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Empirical evaluation - Setup description We have used the Penn Treebank Corpus, with the same vocabulary and data division as other researchers: Sections 0-20: training data, 930K tokens Sections 21-22: validation data, 74K tokens Sections 23-24: test data, 82K tokens Vocabulary size: 10K 11 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Importance of BPTT training 145 average over 4 models 140 mixture of 4 models KN5 baseline 135 130 125 120 Perplexity (Penn corpus) 115 110 105 1 2 3 4 5 6 7 8 BPTT step Importance of BPTT training on Penn Corpus. BPTT=1 corresponds to standard backpropagation. 12 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Combination of randomly initialized RNNs 130 RNN mixture 125 RNN mixture + KN5 120 115 110 Perplexity (Penn corpus) 105 100 95 0 5 10 15 20 25 Number of RNN models By linearly interpolating outputs from randomly initialized RNNs, we obtain better results 13 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Comparison of different language modeling techniques Model Perplexity Kneser-Ney 5-gram 141 Random forest [Xu 2005] 132 Structured LM [Filimonov 2009] 125 Feedforward NN LM 116 Syntactic NN LM [Emami 2004] 110 RNN trained by BP 113 RNN trained by BPTT 106 4x RNN trained by BPTT 98 Comparison of different language modeling techniques on Penn Corpus. Models are interpolated with the baseline 5-gram backoff model. 14 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Speedup with different amount of classes Classes RNN RNN+KN5 Min/epoch Sec/test 30 134 112 12.8 8.8 100 136 114 9.1 5.6 1000 131 111 16.1 15.7 4000 127 108 44.4 57.8 Full 123 106 154 212 Values around sqrt(vocabulary size) lead to the largest speed-ups 15 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Improvements with increasing amount of data 9 KN5 8.8 KN5+RNN 8.6 8.4 8.2 8 7.8 7.6 7.4 Entropy per word on the WSJ test data 7.2 7 5 6 7 8 10 10 10 10 Training tokens The improvement obtained from a single RNN model over the best backoff model increases with more data! 16 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Current work Dynamic evaluation for model adaptation Combination and comparison of RNNs with many other advanced LM techniques More than 50% improvement in perplexity on large data set against modiﬁed Kneser-Ney smoothed 5-gram 17 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Current work - ASR Almost 20% reduction of WER (Wall Street Journal) with simple ASR system, against backoff 5-gram model (WER 17:2% ! 14:4%) Almost 10% reduction of WER (Broadcast News) with state of the art IBM system, against backoff 4-gram model (WER 13:1% ! 12:0%) 18 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Toolkit Our experiments can be repeated using toolkit available at http://www.fit.vutbr.cz/~imikolov/rnnlm/ 19 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work Thanks for attention! 20 / 20.

Load more