Two Decades of Statistical Language Modeling: Where Do We Go from Here?

TWO DECADES OF STATISTICAL LANGUAGE MODELING: WHERE DO WE GO FROM HERE? Ronald Rosenfeld School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA [email protected] ABSTRACT Over the past twenty years, successively larger amounts of text of various types have become available online. As Statistical Language Models estimate the distribution of a result, in domains where such data became available, the various natural language phenomena for the purpose of speech quality of language models has increased dramatically. How- recognition and other language technologies. Since the first ever, this improvement is now beginning to asymptote. Even significant model was proposed in 1980, many attempts have if online text continues to accumulate at an exponential rate been made to improve the state of the art. We review them (which it no doubt will, given the growth rate of the web), here, point to a few promising directions, and argue for a the quality of currently used statistical language models is Bayesian approach to integration of linguistic theories with not likely to improve by a significant factor. One informal data. estimate from IBM shows that bigram models effectively saturate within several hundred million words, and trigram 1. OUTLINE models are likely to saturate within a few billion words. In several domains we already have this much data. Statistical language modeling (SLM) is the attempt to cap- Ironically, the most successful SLM techniques use very ture regularities of natural language for the purpose of im- little knowledge of what language really is. The most pop- proving the performance of various natural language appli- ular language models ( -grams) take no advantage of the cations. By and large, statistical language modeling amounts fact that what is being modeled is language – it may as well to estimating the probability distribution of various linguis- be a sequence of arbitrary symbols, with no deep structure, tic units, such as words, sentences, and whole documents. intention or thought behind them. Statistical language modeling is crucial for a large va- riety of language technology applications. These include A possible reason for this situation is that the knowl- speech recognition (where SLM got its start), machine trans- edge impoverished but data optimal techniques of -grams lation, document classification and routing, optical charac- succeeded too well, and thus stymied work on knowledge based approaches. ter recognition, information retrieval, handwriting recognition, spelling correction, and many more. But one can only go so far without knowledge. In the In machine translation, for example, purely statistical words of the premier proponent of the statistical approach approaches have been introduced in [1]. But even researchers to language modeling, Fred Jelinek, we must ‘put language using rule-based approaches have found it beneficial to in- back into language modeling’ [5]. Unfortunately, only a troduce some elements of SLM and statistical estimation handful of attempts have been made to date to incorporate [2]. In information retrieval, a language modeling approach linguistic structure, theories or knowledge into statistical was recently proposed by [3], and a statistical/information language models, and most such attempts have been only theoretical approach was developed by [4]. modestly successful. SLM employs statistical estimation techniques using lan- In what follows, section 2 introduces statistical language guage training data, that is, text. Because of the categorical modeling in more detail, and discusses the potential for im- nature of language, and the large vocabularies people natu- provement in this area. Section 3 overviews major estab- rally use, statistical techniques must estimate a large num- lished SLM techniques. Section 4 lists promising current ber of parameters, and consequently depend critically on the research directions. Finally, section 5 suggests both an in- availability of large amounts of training data. teractive approach and a Bayesian approach to the integration of linguistic knowledge into the model, and points to by: the encoding of such knowledge as a main challenge facing ? ¢¥9 ¦<; D¡8EF¢¥9 ¦ : the field. Average-Log-Likelihood >=1?A@CB (3) 9G 9£, 9H- 9JI#2 §.$.$. 2. STATISTICAL LANGUAGE MODELING where + is the new data sample, and M is the given language model. This latter quantity can also 2.1. Definition and use be viewed as an empirical estimate of the cross entropy of the true (but unknown) data distribution ¡ with regard to A statistical language model is simply a probability distri- ¡0E ¤ ¡£¢¥¤§¦ the model distribution : bution over all possible sentences .1 It is instructive to compare statistical language model- E E ¦LNM ¡£¢'9£¦ %RQ'SRTU¡ ¢'9£¦ ing to computational linguistics. Admittedly, the two fields cross-entropy ¢¥¡£K3¡ (4) =PO (and communities) have fuzzy boundaries, and a great deal of overlap. Nonetheless, one way to characterize the differ- Actual performance of language models is often reported ence is as follows. Let ¨ be the word sequence of a given in terms of perplexity [6]: sentence, i.e. its surface form, and let © be some hidden structure associated with it (i.e. its parse tree, word senses, E ZY\[P]0¢¥¡£K3¡ ¦^_ VXW V W aedgf etc.). Statistical language modeling is mostly about esti- cross-entropy `ba c (5) @ ¢ ¦ mating Pr ¨ , whereas computational linguistics is mostly ¦ ¢ Perplexity can be interpreted as the (geometric) average ¨ about estimating Pr © . Of course, if one could estimate ¦ ¢ ¦ ¢ ¦ ¢ branching factor of the language according to the model. It ¨ © ¨ well the joint Pr ¨ © , both Pr and Pr could be derived from it. In practice, this is usually not feasible. is a function of both the language and the model. When Statistical language models are usually used in the con- considered a function of the model, it measures how good text of a Bayes classifier, where they can play the role of the model is (the better the model, the lower the perplexity). either the prior or the likelihood function. For example, in When considered a function of the language, it estimates the entropy, or complexity, of that language. automatic speech recognition, given an acoustic signal , the goal is to find the sentence ¤ that is most likely to have Ultimately, the quality of a language model must be been spoken. Using a Bayesian framework, the solution is: measured by its effect on the specific application for which it was designed, namely by its effect on the error rate of that ¡£¢¥¤ ¦ !" ¡£¢ ¤$¦ %§¡&¢'¤$¦ ¤§ application. However, error rates are typically non-linear # (1) and poorly understood functions of the language model. Lower ¡£¢¥¤§¦ perplexity usually result in lower error rates, but there are where the language model plays the role of the prior. plenty of counterexamples in the literature. As a rough rule In contrast, in document classification, given a document ( of thumb, reduction of 5% in perplexity is usually not prac- , the goal is to find the class ) to which it belongs. Typi- tically significant; a 10%–20% reduction is noteworthy, and cally, examples of documents from each of the (say) * classes ( ¦ ¡,$¢ usually (but not always) translates into some improvement + are given, from which * different language models , ( ( ¦ ¡0/1¢ ¦32 ¡ -¢ in application performance; a perplexity improvement of , .$.$. , are constructed. Using a Bayes classi- 30% or more over a good baseline is quite significant (and fier, the solution ) is: rare!). ( ( ¡£¢ ¦8 ¡£¢ ¦8%§¡£¢ ¦ 45" Several attempts have been made to devise metrics that )7 ) ) ) (2) 6 6 are better correlated with application error rate than perplexity, yet are easier to optimize than the error rate itself. These ( ¡ ¢ ¦ where the language model 6 plays the role of the like- attempts have met with limited success. For now, perplexity lihood. In a similar fashion, one can derive the role of lan- continues to be the preferred metric for practical language guage models in Bayesian classifiers for the other language model construction. For more details, see [7]. technologies listed above. 2.3. Known weaknesses in current models 2.2. Measures of progress Even the simplest language model has a drastic effect on the To assess the quality of a given language modeling tech- application in which it is used (this can be observed by, say, nique, the likelihood of new data is most commonly used. removing the language model from a speech recognition The average log likelihood of a new random sample is given system). However, current language modeling techniques are far from optimal. Evidence for this comes from several 1Or spoken utterances, documents, or any other linguistic unit. sources: Brittleness across domains: Current language mod- ities: els are extremely sensitive to changes in the style, topic or I i ? ? ¢'h, h!I1¦ ¢¥h ¦ genre of the text on which they are trained. For example, ¢'¤$¦ def l Pr Pr .$.$. Pr (6) ?kj to model casual phone conversations, one is much better off , using 2 million words of transcripts from such conversa- ? ? h h tions than using 140 million words of transcripts from TV def , + l where is the m th word in the sentence, and , ?¥o h ,p2 and radio news broadcasts. This effect is quite strong even hn- , .§.$. is called the history. for changes that seem trivial to a human: a language model trained on Dow-Jones newswire text will see its perplexity doubled when applied to the very similar Associated Press 3.1. -grams newswire text from the same time period ([8, p. 220]). -grams are the staple of current speech recognition tech- False independence assumption: In order to remain nology. Virtually all commercial speech recognition prod- tractable, virtually all existing language modeling techniques ucts use some form of an -gram. An -gram reduces the assume some form of independence among different por- dimensionality of the estimation problem by modeling lan- tions of the same document. For example, the most com- guage as a Markov source of order -1: monly used model, the -gram, assumes that the probabil- ? ? ? ?¥o ?¥o ¡£¢'h ¦ q4¡£¢'h h h ¦ IZr8, ity of the next word in a sentence depends only on the iden- , l $.§.$.3 (7) tity of the last -1 words.

Two Decades of Statistical Language Modeling: Where Do We Go from Here?

A Challenge Set for Advancing Language Modeling

ACL Lifetime Achievement Award

Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

NOTE Our Lucky Moments with Frederick Jelinek

Fred Jelinek

Corpus Methods in a Digitized World

26Automatic Speech Recognition and Text-To-Speech

Lecture 5: N-Gram Language Models

History and Survey of AI August 31, 2012

A Maximum Likelihood Approach to Continuous Speech Recognition

Some of My Best Friends Are Linguists

Ieee-Level Awards