Large Scale Hierarchical Neural Network Language Models

INTERSPEECH 2012 ISCA Archive ISCA's 13th Annual Conference http://www.isca-speech.org/archive Portland, OR, USA September 9-13, 2012 Large Scale Hierarchical Neural Network Language Models Hong-Kwang Jeff Kuo1, Ebru Arısoy1, Ahmad Emami1, Paul Vozila2 1IBM T.J. Watson Research Center Yorktown Heights, NY, 10598 2Nuance Communications, One Wayside Road, Burlington, MA 01803, U.S.A. hkuo, earisoy, emami @us.ibm.com, [email protected] { } Abstract long with a large neural network architecture. Hierarchical decomposition of conditional probabilities has been proposed to Feed-forward neural network language models (NNLMs) are speed up NNLM training. In [7], the output vocabulary words known to improve both perplexity and word error rate perfor- were partitioned into classes, and the posterior probability esti- mance for speech recognition compared with conventional n- mation was factored into class prediction and word prediction gram language models. We present experimental results show- probabilities. Clustering the output vocabulary words into C ing how much the WER can be improved by increasing the scale classes made it possible to train C +1 separate NNLMs in par-| | of the NNLM, in terms of model size and training data. How- allel where each sub-network has| its| own smaller output vocab- ever, training time can become very long. We implemented ulary. Parallel training of sub-networks and reducing the out- a hierarchical NNLM approximation to speed up the training, put vocabulary size for each sub-network provided a significant through splitting up events and parallelizing training as well as speed up in training. This idea was extended to multiple levels reducing the output vocabulary size of each sub-network. The using a binary tree [8, 9]. In order to handle the shortcoming training time was reduced by about 20 times, e.g. from 50 of recursive binary structure, structuring the output layer with days to 2 days, with no degradation in WER. Using English a clustering tree was proposed in [10]. However the main mo- Broadcast News data (350M words), we obtained significant tivation of this work was to make the training of NNLM with improvements over the baseline n-gram language model, com- full vocabularies computationally feasible. Using classes at the petitive with recently published recurrent neural network lan- output layer was also investigated for RNNLM to speed up the guage model (RNNLM) results. training [5]. Index Terms: neural network language models NNLMs are usually trained on a moderate size (10-50M words) text corpus and interpolated with a baseline n-gram 1. Introduction model before rescoring lattices or N-best lists. However, [11] A statistical language model assigns a probability distribution showed that significant gains can be obtained on top of a very over all possible word strings in a language. In state-of-the-art good state-of-the-art system with large scale RNNLMs trained ASR systems, n-grams are the conventional language model- on a large amount of text data (300M words). This work en- ing approach due to their simplicity and good modeling perfor- couraged us to scale up feed-forward NNLM training in terms mance. One of the problems with n-gram language modeling is of model size and training data. Hierarchical decomposition of 10.21437/Interspeech.2012-458 data sparseness. Even though smoothing techniques [1] produce the output layer allows us to train large scale NNLMs in a fea- better estimates, conventional n-gram language models cannot sible amount of time. generalize well to unseen data. Our contributions in this paper include the following. Neural network language models (NNLMs) were proposed We implemented the hierarchical NNLM approximation given to obtain better generalizations for unseen n-grams. The main in [7]. On a state-of-the-art speech recognition system, we idea in NNLM is to project word indices into a continuous showed significant improvements in WER after scaling up space and to perform probability estimation in this space us- NNLM training, and we compared our results with RNNLM ing neural networks (feed-forward or recurrent). While pro- results reported in [11]. Finally we also investigated trade-offs jecting discrete word indices into continuous space, gram- between performance and training time, to formulate a practical matically and semantically similar words are assumed to be method to achieve the best performance with very large scale mapped into similar continuous features. Since the probabil- NNLMs that can be trained in a few days. ity estimations are smooth functions of the continuous word representations, NNLM achieves better generalization for un- 2. Neural Network Language Model seen n-grams. Feed-forward NNLMs [2, 3] and recurrent NNLMs [4, 5] have been shown to yield both perplexity and The major problem in language modeling is the discrete nature word error rate (WER) improvements compared to conventional of language itself and that there is no straightforward method to n-gram language models. guess (i.e. interpolate) probabilities of unseen events based on One major drawback of NNLMs is the computational com- the probabilities of similar seen events. To determine whether plexity, which is mostly dominated by the computations at the two events are similar requires a complex understanding of lan- output level since calculation of the posterior probabilities for guage that have not yet been realized by any statistical models. each vocabulary word requires normalization over all vocabu- The neural network language model attempts to get around this lary words. In order to tackle this problem, using a shortlist problem by mapping the probability estimation problem into containing the most frequent several thousands of words was continuous space, where the notion of similarity (i.e. distance) proposed [6]. Even though using a shortlist reduces the output is much better defined, and function estimation techniques are layer computations significantly, the training time can still be much more developed. INTERSPEECH 2012 1672 In a neural network based language model, words are rep- a total of 350M words from different sources. The vocabulary resented by points in a continuous multi-dimensional feature is about 80K words and the baseline language model is a linear space, and the probability of a sequence of words is computed interpolation of word 4-gram models, one for each corpus. The by means of a neural network. The architecture and the work- test set rt04 (4h) is decoded using a pruned language model, and ings of this model have been covered in great detail in the the lattices are rescored with an unpruned language model con- past [2, 7, 6], and we omit the details in the interest of space. taining 271M n-grams, resulting in a baseline WER of 13.0%. We trained NNLMs on various subsets of the text corpora. 3. Hierarchical Neural Network Language For example, we selected subsets of 12M and 55M words from Model the text corpora. We also concatenated all the corpora used in building the baseline n-gram language model into one cor- The computational complexity of NNLMs is mostly dominated1 pora for NNLM training. We used the transcriptions of rt03 by the computations at the output layer, h Vout , where h and and dev04, containing a total of 48K words, as a held-out set to | | Vout are the number of hidden units and the size of the out- optimize the linear interpolation weights between the baseline put| vocabulary,| respectively. When training on larger amounts unpruned n-gram language model and NNLMs. of training data, the number of parameters of the NNLM needs For all the NNLMs, we chose the 20K most frequent words to be increased to make sure the model learns from the extra in the vocabulary as the output vocabulary. The input vocabu- data provided, while avoiding over-training. The NNLM size is lary includes the 80K words used for the baseline n-gram LM, enlarged by increasing the number of hidden units h. However, except for a few thousand infrequent words to allow for better increasing h results in a proportional increase in complexity and training of the unknown words in the context. The NNLMs are training time of the NNLM. At the same time, more training all 6-gram, i.e. with a context of 5 previous words, and the size data directly translates into a proportional increase in training of the NNLM is varied by choosing different values of d, the time (for a NNLM of the same size) for each iteration. There- size of the projection layer (feature vector dimension), and h, fore training large models on large amounts of training data can the number of hidden units. For hierarchical NNLM (hNNLM), become prohibitively expensive. the output vocabulary words are clustered into C classes, cho- To train large NNLMs in a reasonable time, we applied the sen to be about 150. The NNLMs are typically| trained| for about output layer factorization method proposed in [7]. Let y (one of 30 epochs. For rescoring, the NNLMs are always linearly in- the words in the output vocabulary) be the word to be predicted, terpolated with another model, typically the baseline unpruned given the word history, x = x1, , x(n 1), where each xi n-gram LM. In later experiments, we also interpolate with a ··· − is a word in the input vocabulary. P (y x) is calculated at the model M language model [13], a class-based exponential lan- | output layer of the NNLM. If we cluster each word in the output guage model. Model M yields 12.3% WER on rt04 [14]. vocabulary into one of C classes, instead of the usual P (y x), | | | we can train a class-based model 4.2. Experimental Results P (y x) = P (c(y) x)P (y x, c(y)), (1) In this section we first investigate the effects of model and data | | | sizes on NNLM performance.

Load more