Linguistic Theory in Statistical Language Learning

II Linguistic Theory in Statistical Language Learning i Christer Samuelsson Bell Laboratories, Lucent Technologies Hi 600 Mountain Ave, Room 2D-339, Murray Hill, NJ 07974, USA HI christer@research, bell-labs, tom m Abstract into most linguistictheories and, for the very same reason, be used in models that striveto model lan- i This article attempts to determine what ele- guage successfully. ments of linguistic theory are used in statistical language learning, and why the extracted So what do the linguistictheories underlying vari- language models look like they do. The study ous statisticallanguage models look like? And why? m indicates that some linguistic elements, such as It may be useful to distinguishbetween those aspects the notion of a word, are simply too useful to of linguistictheory that are incidentallyin the lan- be ignored. The second most important factor seems to be features inherited from the origi- guage model, and those that are there intentionally. i nal task for which the technique was used, for We will start our tour of statisticallanguage learn- example using hidden Markov models for part- ing by inspecting language models with "very little" of-speech tagging, rather than speech recogni- linguistic content, and then proceed to analyse in- tion. The two remaining important factors are creasingly more linguisticmodels, until we end with m properties of the runtime processing scheme employing the extracted language model, and models that are entirely linguistic,in the sense that the properties of the available corpus resources they are pure grammars, associated with no statis- to which the statistical learning techniques are tical parameters. ill applied. Deliberate attempts to include linguistic theory seem to end up in a fifth place. 2 Word N-gr_~m Models IH Let us return to the simple bigram word model, 1 Introduction where the probability of each next word is deter- What role does linguistics play in statistical lan- mined from the current one. We already noted that i guage learning? "None at all!" might be the answer, this model relies on the notion of a word, the notion if we ask hard-core speech-recognition professionals. of an utterance, and the notion that an utterance is But even the most nonlinguistic language model, for a sequence of words. IH example a statistic word bigram model, actually re- The way this model is best visualized, and as it lies on key concepts integral to virtually all linguistic happens, best implemented, is as a finite-state au- theories. Words, for example, and the notion that tomaton (FSA), with arcs and states both labelled IH sequences of words form utterances. with words, and transition probabilities associated Statistical language learning is applied to some set with each arc. For example, there will be one state of data to extract a language model of some kind. labelled The with one arc to each other state, for This language model can serve a purely decorative example to the state Cat, and this arc will be la- HI purpose, but is more often than not used to pro- belled cat. The reason for labelling both arcs and cess data in some way, for example to aid speech states with words is that the states constitute the recognition. Anyone working under the pressure of only memory device available to an FSA. To re- I producing better results, and who employs language member that the most recent word was "cat", all models to this purpose, such a researchers in the arcs labelled cat must fall into the same state Cat. field of speech recognition, will have a high incen- The transitionprobability from the state The along I tive of incorporating useful aspects of language into the unique arc labelled cat to the state Cat will be his or her language models. Now, the most useful, the probabilityof the word "cat" following the word and thus least controversial ways of describing lan- "the", P(cat l the). I guage will, due to their usefulness, find their way More generally, we enumerate the words I Samuelsson 83 Linguistic Theory i Christer Samuelsson, Bell Laboratories (1998) Linguistic Theory in Statistical Language Learning. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 83-89. I II {Wa,...,wlv} and associate a state Si with finite-state automata results in simplistic models of each word wl. Now the automaton has the states interword dependencies. II {$1,..., SN} and from each state Si there is an arc General word N-gram models, of which word bi- labelled wj to state Sj with transition probability gram models are a special case with "N" equal to P(wj I wi), the word bigram probability. To two, can be accommodated in very much the same II establish the probabilities of each word starting or way by introducing states that remember not only finishing off the utterance, we introduce the special the previous word, but the N-1 previous words. This state So and special word w0 that marks the end generalization is purely technical and adds little or II of the utterance, and associate the arc from So to no linguistic fuel to the model from a theoretical Si with the probability of wi starting an utterance, point of view. From a practical point of view, the and the arc from Si to So with the probability of gain in predictive power using more conditioning in an utterance ending with word wl. the probability distributions is very quickly over- II If we want to calculate the probability of a word come by the difficulty in estimating these probabil- sequence wil ... wi,,, we simply multiply the bigram ity distributions accurately from available training probabilities: data; the perennial sparse-data problem. II So why does this model look like it does? We P(wi~ . ..wi,~) = conjecture the following explanations: Firstly, it is = P(wi, I WO)" P(wi2 I wi,)..... P(wo I Wi.) directly applicable to the representation used by an II acoustic speech recognizer, and this can be done ef- We now recall something from formal language ficiently as it essentially involves intersecting two theory about the equivalence between finite-state finite-state automata. Secondly, the model parame- II automata and regular languages. What does the ters -- the word bigram probabilities -- Can be es- equivalent regular language look like? Let's just first timated directly from electronically readable texts, rename So S and, by stretching it just a little, let the and there is a lot of that a~ilable. II end-of-utterance marker wo be ~, the empty string. 3 Tag N-gram Models S ~ wiSi P(wi [ e) S~ ~ wjSj P(wj [wi) Let us now move on to a somewhat more linguisti- II Si -* e P(~ [ wi) cally sophisticated language model, the tag N-gram model. Here, the interaction between words is me- Does this give us any new insight? Yes, it does! Let's diated by part-of-speech (PoS) tags, which consti- define a string rewrite in the usual way: cA7 =~ a~7 tute linguistically motivated labels that we assign to if the rule A -+ fl is in the grammar. We can then each word in an utterance. For example, we might derive the string Wil ... wl, from the top symbol S look at the basic word classes adjectives, adverbs, II in n+l steps: articles, conjunctions, nouns, numbers, prepositions, pronouns and verbs, essentially introduced already S ::~ WilSil ~ WilWi2Si2 :=~ ... by the ancient Greek Dionysius Thrax. We imme- Wi 1 • . Wi n diately realise that this gives us the opportunity to include a vast amount of linguistic knowledge into Now comes the clever bit: if we define the deriva- our model by selecting the set of PoS tags appro- tion probability as the product of the rewrite proba- priately; consequently, this is a much debated and bilities, and identify the rewrite and the rule proba- controversial issue. bilities, we realize that the string probability is sim- Such a representation can be used for disambigua- ply the derivation probability. This illustrates one tion, as in the case of the well-known, highly ambigu- of the most central aspects of probabilistic parsing: ous example sentence "Time flies like an arrow". We String probabilities are defined in terms o] can for example prescribe that "Time" is a noun, derivation probabilities. "flies" is a verb, "like" is a preposition (or adverb, II according to your taste), "an" is an article, and that So the simple word bigram model not only em- "arrow" is a noun. In effect, a label, i.e., a part- ploys highly useful notions from linguistic theory, of-speech tag, has been assigned to each word. We II it implicitly employs the machinery of rewrite rules realise that words may be assigned different labels and derivations from formal language theory, and it in different context, or in different readings; for ex- also assigns string probabilities in terms of deriva- ample, if we instead prescribe that '2]ies" is a noun II tion probabilities, just like most probabilistic pars- and "like" is a verb, we get another reading of the ing schemes around. However, the heritage from sentence. II Samuelsson 84 Linguistic Theory II II II What does this language model look like in more The interpretation of this is that we start off in the detail? We can actually recast it in virtually the initial state S, select a PoS tag Tia at random, ac- II same terms as the word bigram model, the only dif- cording to the probability distribution in state S, ference being that we interpret each state Si as a then generate the word wkl at random according PoS tag (in the bigram case, and as a tag sequence to the lexical distribution associated with tag Til, II in the general N-gram case): then draw a next PoS tag Ti2 at random according to the transition probabilitiesassociated with state S -+ wkSi P(Si ~ wk [S) Si~, hop to the corresponding state Si2,generate the II S~ -+ ~kSj P(Sj ~ wk I S~) word wk2 at random according to the lexical distri- S~ -+ e P(e I &) bution associated with tag Ti2, etcetera.

Linguistic Theory in Statistical Language Learning

Computer Vision Stochastic Grammars for Scene Parsing

UNIVERSITY of CALIFORNIA Los Angeles Human Activity

Using an Annotated Corpus As a Stochastic Grammar

GRAMMAR IS GRAMMAR and USAGE IS USAGE Frederick J

Application of Stochastic Grammars to Understanding Action

W. G. M., a Stochastic Model of Language Change Through Social

The Plasticity of Grammar

Calibrating Generative Models: the Probabilistic Chomsky-Schutzenberger¨ Hierarchy∗

Stochastic Definite Clause Grammars

Stochastic Attribute-Value Grammars

Unsupervised Language Acquisition: Theory and Practice

A Stochastic Grammar of Images