Overview Introduction Model description Extensions Empirical evaluation Current work

EXTENSIONS OF LANGUAGE MODEL

Toma´sˇ Mikolov, Stefan Kombrink, Luka´sˇ Burget, Jan “Honza” Cernockˇ y,´ Sanjeev Khudanpur

Speech@FIT, Brno University of Technology, Johns Hopkins University

25. 5. 2011

1 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Overview

Introduction Model description Extensions Empirical evaluation Current work

2 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Introduction

Neural network based LMs outperform standard backoff n-gram models Words are projected into low dimensional space, similar words are automatically clustered together Smoothing is solved implicitly Standard algorithm (BP) is used for training In [Mikolov2010], we have shown that recurrent neural network (RNN) architecture is competitive with the standard feedforward architecture

3 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Introduction

In this presentation, we will show: Importance of ”backpropagation through time” (BPTT) [Rumelhart et al. 1986] training algorithm for RNN language models Simple speed-up technique that reduces computational complexity 10× - 100× Results after combining randomly initialized RNN models Comparison of different advanced LM techniques on the same data set Results on large data sets and LVCSR experiments

4 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Model description - recurrent NN

w( t ) y( t )

s( t )

U V

(delayed)

Input layer w and output layer y have the same dimensionality as the vocabulary Hidden layer s is orders of magnitude smaller U is the matrix of weights between input and hidden layer, V is the matrix of weights between hidden and output layer

5 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Backpropagation through time

Training of RNNs by normal backpropagation is not optimal Backpropagation through time (BPTT) is efficient algorithm for training recurrent neural networks BPTT works by unfolding the recurrent part of the network in time to obtain usual feedforward representation of the network; such deep network is then trained by backpropagation For on-line learning, ”truncated BPTT” is used

6 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

RNN unfolded in time

w(t) y(t)

s(t)

7 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

RNN unfolded in time

w(t) y(t)

s(t)

w(t-1) y(t-1)

s(t-1)

8 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

RNN unfolded in time

w(t) y(t)

s(t)

w(t-1) y(t-1)

s(t-1)

w(t-2) y(t-2)

s(t-2)

9 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Factorization of the output layer

w( t ) y( t )

s( t )

U V

W

(delayed) c( t )

P (wi|history) = P (ci|s(t))P (wi|ci, s(t)) (1)

Words are assigned to ”classes” based on their unigram frequency First, class layer is evaluated; then, only words belonging to the predicted class are evaluated, instead of the whole output layer y [Goodman2001] Provides speedup in some cases more than 100× 10 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Empirical evaluation - Setup description

We have used the Penn Corpus, with the same vocabulary and data division as other researchers: Sections 0-20: training data, 930K tokens Sections 21-22: validation data, 74K tokens Sections 23-24: test data, 82K tokens Vocabulary size: 10K

11 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Importance of BPTT training

145 average over 4 models 140 mixture of 4 models KN5 baseline 135

130

125

120 Perplexity (Penn corpus) 115

110

105 1 2 3 4 5 6 7 8 BPTT step

Importance of BPTT training on Penn Corpus. BPTT=1 corresponds to standard backpropagation.

12 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Combination of randomly initialized RNNs

130 RNN mixture 125 RNN mixture + KN5

120

115

110

Perplexity (Penn corpus) 105

100

95 0 5 10 15 20 25 Number of RNN models

By linearly interpolating outputs from randomly initialized RNNs, we obtain better results

13 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Comparison of different language modeling techniques

Model Perplexity Kneser-Ney 5-gram 141 Random forest [Xu 2005] 132 Structured LM [Filimonov 2009] 125 Feedforward NN LM 116 Syntactic NN LM [Emami 2004] 110 RNN trained by BP 113 RNN trained by BPTT 106 4x RNN trained by BPTT 98

Comparison of different language modeling techniques on Penn Corpus. Models are interpolated with the baseline 5-gram backoff model.

14 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Speedup with different amount of classes

Classes RNN RNN+KN5 Min/epoch Sec/test 30 134 112 12.8 8.8 100 136 114 9.1 5.6 1000 131 111 16.1 15.7 4000 127 108 44.4 57.8 Full 123 106 154 212

Values around sqrt(vocabulary size) lead to the largest speed-ups

15 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Improvements with increasing amount of data

9 KN5 8.8 KN5+RNN

8.6

8.4

8.2

8

7.8

7.6

7.4 Entropy per word on the WSJ test data 7.2

7 5 6 7 8 10 10 10 10 Training tokens

The improvement obtained from a single RNN model over the best backoff model increases with more data! 16 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Current work

Dynamic evaluation for model adaptation Combination and comparison of RNNs with many other advanced LM techniques More than 50% improvement in perplexity on large data set against modified Kneser-Ney smoothed 5-gram

17 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Current work - ASR

Almost 20% reduction of WER (Wall Street Journal) with simple ASR system, against backoff 5-gram model (WER 17.2% → 14.4%)

Almost 10% reduction of WER (Broadcast News) with state of the art IBM system, against backoff 4-gram model (WER 13.1% → 12.0%)

18 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Toolkit

Our experiments can be repeated using toolkit available at http://www.fit.vutbr.cz/~imikolov/rnnlm/

19 / 20 Overview Introduction Model description Extensions Empirical evaluation Current work

Thanks for attention!

20 / 20