Machine Learning, 60, 195–227, 2005 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
A Neural Syntactic Language Model∗
AHMAD EMAMI [email protected] FREDERICK JELINEK [email protected] Center for Language and Speech Processing, The Johns Hopkins University, Baltimore, MD
Editors: Dan Roth and Pascale Fung
Abstract. This paper presents a study of using neural probabilistic models in a syntactic based language model. The neural probabilistic model makes use of a distributed representation of the items in the conditioning history, and is powerful in capturing long dependencies. Employing neural network based models in the syntactic based language model enables it to use efficiently the large amount of information available in a syntactic parse in estimating the next word in a string. Several scenarios of integrating neural networks in the syntactic based language model are presented, accompanied by the derivation of the training procedures involved. Experiments on the UPenn Treebank and the Wall Street Journal corpus show significant improvements in perplexity and word error rate over the baseline SLM. Furthermore, comparisons with the standard and neural net based N-gram models with arbitrarily long contexts show that the syntactic information is in fact very helpful in estimating the word string probability. Overall, our neural syntactic based model achieves the best published results in perplexity and WER for the given data sets.
Keywords: statistical language models, neural networks, speech recognition, parsing
1. Introduction
Statistical language models are widely used in fields dealing with speech or natural lan- guage, speech recognition, machine translation, and information retrieval to name a few. For example, in the accepted statistical formulation of the speech recognition problem (Jelinek, 1998), the recognizer seeks to find the word string:
Wˆ = arg max P(A|W) P(W)(1) W where A denotes the observed speech signal, P(A|W) is the probability of producing A when W is spoken, and P(W)istheprior probability W was spoken. The role of a statistical language model is to assign a probability P(W) to any given word string W = w1w2 ...wn. This is usually done in a left-to-right manner by factoring the probability:
n = w w ...w = w w | i−1 P(W) P( 1 2 n) P( 1) P i W1 i=2
∗This work was supported by the National Science Foundation under grant No. IIS-0085940. 196 A. EMAMI AND F. JELINEK
w w ...w j where the sequence of words 1 2 j is denoted by W1 . Ideally the language model i−1 would use the entire history W1 to make its prediction for word wi. However, data sparseness is a crippling problem with language models; hence all practical models employ i−1 some sort of equivalence classification of histories W1 :
n ≈ w w | i−1 P(W) P( 1) P i W1 (2) i=2