INTERSPEECH 2012 ISCA Archive ISCA's 13th Annual Conference http://www.isca-speech.org/archive Portland, OR, USA September 9-13, 2012

Large Scale Hierarchical Neural Network Language Models

Hong-Kwang Jeff Kuo1, Ebru Arısoy1, Ahmad Emami1, Paul Vozila2

1IBM T.J. Watson Research Center Yorktown Heights, NY, 10598 2Nuance Communications, One Wayside Road, Burlington, MA 01803, U.S.A. hkuo, earisoy, emami @us..com, [email protected] { }

Abstract long with a large neural network architecture. Hierarchical de- composition of conditional has been proposed to Feed-forward neural network language models (NNLMs) are speed up NNLM training. In [7], the output vocabulary words known to improve both perplexity and word error rate perfor- were partitioned into classes, and the posterior esti- mance for compared with conventional n- mation was factored into class prediction and word prediction gram language models. We present experimental results show- probabilities. Clustering the output vocabulary words into C ing how much the WER can be improved by increasing the scale classes made it possible to train C +1 separate NNLMs in par-| | of the NNLM, in terms of model size and training data. How- allel where each sub-network has| its| own smaller output vocab- ever, training time can become very long. We implemented ulary. Parallel training of sub-networks and reducing the out- a hierarchical NNLM approximation to speed up the training, put vocabulary size for each sub-network provided a significant through splitting up events and parallelizing training as well as speed up in training. This idea was extended to multiple levels reducing the output vocabulary size of each sub-network. The using a binary tree [8, 9]. In order to handle the shortcoming training time was reduced by about 20 times, e.g. from 50 of recursive binary structure, structuring the output layer with days to 2 days, with no degradation in WER. Using English a clustering tree was proposed in [10]. However the main mo- Broadcast News data (350M words), we obtained significant tivation of this work was to make the training of NNLM with improvements over the baseline n-gram , com- full vocabularies computationally feasible. Using classes at the petitive with recently published lan- output layer was also investigated for RNNLM to speed up the guage model (RNNLM) results. training [5]. Index Terms: neural network language models NNLMs are usually trained on a moderate size (10-50M words) text corpus and interpolated with a baseline n-gram 1. Introduction model before rescoring lattices or N-best lists. However, [11] A statistical language model assigns a showed that significant gains can be obtained on top of a very over all possible word strings in a language. In state-of-the-art good state-of-the-art system with large scale RNNLMs trained ASR systems, n-grams are the conventional language model- on a large amount of text data (300M words). This work en- ing approach due to their simplicity and good modeling perfor- couraged us to scale up feed-forward NNLM training in terms mance. One of the problems with n-gram language modeling is of model size and training data. Hierarchical decomposition of 10.21437/Interspeech.2012-458 data sparseness. Even though smoothing techniques [1] produce the output layer allows us to train large scale NNLMs in a fea- better estimates, conventional n-gram language models cannot sible amount of time. generalize well to unseen data. Our contributions in this paper include the following. Neural network language models (NNLMs) were proposed We implemented the hierarchical NNLM