II Linguistic Theory in Statistical Language Learning i Christer Samuelsson Bell Laboratories, Lucent Technologies Hi 600 Mountain Ave, Room 2D-339, Murray Hill, NJ 07974, USA HI christer@research, bell-labs, tom m Abstract into most linguistictheories and, for the very same reason, be used in models that striveto model lan- i This article attempts to determine what ele- guage successfully. ments of linguistic theory are used in statisti- cal language learning, and why the extracted So what do the linguistictheories underlying vari- language models look like they do. The study ous statisticallanguage models look like? And why? m indicates that some linguistic elements, such as It may be useful to distinguishbetween those aspects the notion of a word, are simply too useful to of linguistictheory that are incidentallyin the lan- be ignored. The second most important factor seems to be features inherited from the origi- guage model, and those that are there intentionally. i nal task for which the technique was used, for We will start our tour of statisticallanguage learn- example using hidden Markov models for part- ing by inspecting language models with "very little" of-speech tagging, rather than speech recogni- linguistic content, and then proceed to analyse in- tion. The two remaining important factors are creasingly more linguisticmodels, until we end with m properties of the runtime processing scheme employing the extracted language model, and models that are entirelylinguistic, in the sense that the properties of the available corpus resources they are pure , associated with no statis- to which the statistical learning techniques are tical parameters. ill applied. Deliberate attempts to include lin- guistic theory seem to end up in a fifth place. 2 Word N-gr_~m Models IH Let us return to the simple bigram word model, 1 Introduction where the of each next word is deter- What role does play in statistical lan- mined from the current one. We already noted that i guage learning? "None at all!" might be the answer, this model relies on the notion of a word, the notion if we ask hard-core speech-recognition professionals. of an utterance, and the notion that an utterance is But even the most nonlinguistic language model, for a sequence of words. IH example a statistic word bigram model, actually re- The way this model is best visualized, and as it lies on key concepts integral to virtually all linguistic happens, best implemented, is as a finite-state au- theories. Words, for example, and the notion that tomaton (FSA), with arcs and states both labelled IH sequences of words form utterances. with words, and transition associated Statistical language learning is applied to some set with each arc. For example, there will be one state of data to extract a language model of some kind. labelled The with one arc to each other state, for This language model can serve a purely decorative example to the state Cat, and this arc will be la- HI purpose, but is more often than not used to pro- belled cat. The reason for labelling both arcs and cess data in some way, for example to aid speech states with words is that the states constitute the recognition. Anyone working under the pressure of only memory device available to an FSA. To re- I producing better results, and who employs language member that the most recent word was "cat", all models to this purpose, such a researchers in the arcs labelled cat must fall into the same state Cat. field of speech recognition, will have a high incen- The transitionprobability from the state The along I tive of incorporating useful aspects of language into the unique arc labelled cat to the state Cat will be his or her language models. Now, the most useful, the probabilityof the word "cat" following the word and thus least controversial ways of describing lan- "the", P(cat l the). I guage will, due to their usefulness, find their way More generally, we enumerate the words I Samuelsson 83 Linguistic Theory i Christer Samuelsson, Bell Laboratories (1998) Linguistic Theory in Statistical Language Learning. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 83-89. I II

{Wa,...,wlv} and associate a state Si with finite-state automata results in simplistic models of each word wl. Now the automaton has the states interword dependencies. II {$1,..., SN} and from each state Si there is an arc General word N-gram models, of which word bi- labelled wj to state Sj with transition probability gram models are a special case with "N" equal to P(wj I wi), the word bigram probability. To two, can be accommodated in very much the same II establish the probabilities of each word starting or way by introducing states that remember not only finishing off the utterance, we introduce the special the previous word, but the N-1 previous words. This state So and special word w0 that marks the end generalization is purely technical and adds little or II of the utterance, and associate the arc from So to no linguistic fuel to the model from a theoretical Si with the probability of wi starting an utterance, point of view. From a practical point of view, the and the arc from Si to So with the probability of gain in predictive power using more conditioning in an utterance ending with word wl. the probability distributions is very quickly over- II If we want to calculate the probability of a word come by the difficulty in estimating these probabil- sequence wil ... wi,,, we simply multiply the bigram ity distributions accurately from available training probabilities: data; the perennial sparse-data problem. II So why does this model look like it does? We P(wi~ . ..wi,~) = conjecture the following explanations: Firstly, it is = P(wi, I WO)" P(wi2 I wi,)..... P(wo I Wi.) directly applicable to the representation used by an II acoustic speech recognizer, and this can be done ef- We now recall something from formal language ficiently as it essentially involves intersecting two theory about the equivalence between finite-state finite-state automata. Secondly, the model parame- II automata and regular languages. What does the ters -- the word bigram probabilities -- Can be es- equivalent regular language look like? Let's just first timated directly from electronically readable texts, rename So S and, by stretching it just a little, let the and there is a lot of that a~ilable. II end-of-utterance marker wo be ~, the empty string. 3 Tag N-gram Models S ~ wiSi P(wi [ e) S~ ~ wjSj P(wj [wi) Let us now move on to a somewhat more linguisti- II Si -* e P(~ [ wi) cally sophisticated language model, the tag N-gram model. Here, the interaction between words is me- Does this give us any new insight? Yes, it does! Let's diated by part-of-speech (PoS) tags, which consti- define a string rewrite in the usual way: cA7 =~ a~7 tute linguistically motivated labels that we assign to if the rule A -+ fl is in the . We can then each word in an utterance. For example, we might derive the string Wil ... wl, from the top symbol S look at the basic word classes adjectives, adverbs, II in n+l steps: articles, conjunctions, nouns, numbers, prepositions, pronouns and verbs, essentially introduced already S ::~ WilSil ~ WilWi2Si2 :=~ ... by the ancient Greek Dionysius Thrax. We imme-

Wi 1 • . . Wi n diately realise that this gives us the opportunity to include a vast amount of linguistic knowledge into Now comes the clever bit: if we define the deriva- our model by selecting the set of PoS tags appro- tion probability as the product of the rewrite proba- priately; consequently, this is a much debated and bilities, and identify the rewrite and the rule proba- controversial issue. bilities, we realize that the string probability is sim- Such a representation can be used for disambigua- ply the derivation probability. This illustrates one tion, as in the case of the well-known, highly ambigu- of the most central aspects of probabilistic parsing: ous example sentence "Time flies like an arrow". We String probabilities are defined in terms o] can for example prescribe that "Time" is a noun, derivation probabilities. "flies" is a verb, "like" is a preposition (or adverb, II according to your taste), "an" is an article, and that So the simple word bigram model not only em- "arrow" is a noun. In effect, a label, i.e., a part- ploys highly useful notions from linguistic theory, of-speech tag, has been assigned to each word. We II it implicitly employs the machinery of rewrite rules realise that words may be assigned different labels and derivations from formal language theory, and it in different context, or in different readings; for ex- also assigns string probabilities in terms of deriva- ample, if we instead prescribe that '2]ies" is a noun II tion probabilities, just like most probabilistic pars- and "like" is a verb, we get another reading of the ing schemes around. However, the heritage from sentence. II

Samuelsson 84 Linguistic Theory II II II What does this language model look like in more The interpretationof this is that we start off in the detail? We can actually recast it in virtually the initial state S, select a PoS tag Tia at random, ac- II same terms as the word bigram model, the only dif- cording to the probability distribution in state S, ference being that we interpret each state Si as a then generate the word wkl at random according PoS tag (in the bigram case, and as a tag sequence to the lexical distribution associated with tag Til, II in the general N-gram case): then draw a next PoS tag Ti2 at random according to the transition probabilitiesassociated with state S -+ wkSi P(Si ~ wk [S) Si~, hop to the corresponding state Si2,generate the II S~ -+ ~kSj P(Sj ~ wk I S~) word wk2 at random according to the lexical distri- S~ -+ e P(e I &) bution associated with tag Ti2, etcetera. Another general lesson can be learnedfrom this: If Note that we have now separated the words Wk from we wish to calculatethe probabilityof a word string, II the states Si and that thus in principle, any state rather than of a word string with a particular tag can generate any word. This is actually a slightly assbciated with each word, as the model does as it more powerful than the standard hidden- stands, it would be natural to sum over the set of II markov model (HMM) used for N-gram PoS tagging possible ways of assigning PoS tags to the words of (5). We recast it as follows: the string. This means that: II S ~ TiS~ P(S~ I s) The probability of a word string is the sum s~ -+ TjSj P(s# I&) of its derivation probabilities. Si -+ e P(e I S i) II Ti -~ w~ P(wk [ Ti) The model parameters P(Sj [ Si) and P(wk I Tj) can be estimated essentially in two different ways. Here we have the rules of the form Si ~ TjSj, with The first employs manually annotated training data II the corresponding probabilities P(Sj [ Si), encoding and the other uses unannotated data and some rees- the tag N-gram statistics. This is the probability timation technique such as Baum-Welch reestima- that the tag Tj will follow the tag Ti, (in the bigram tion (1). In both cases, an optimal set of parame- II case, or the sequence encoded by Si in the general ters is sought, which will maximize the probability N-gram case). The rules Ti -+ wk with probabilities of the training data, supplemented with a portion P(Wk [ Ti) are the lexical probabilities, describing of the black art of smoothing. In the former case, the probability of tag Ti being realised as word wk. we are faced with two major problems: a shortage II The latter probabilities seem a bit backward, as we of training data, and a relatively high noise level in would rather think in terms of the converse probabil- existing data, in terms of annotation inconsistencies. ity P(Ti [ Wk) of a particular word wk being assigned In the latter case, the problems are the instability of II some PoS tag Ti, but one is easily recoverable form the resulting parameters as a function of the initial the other using Bayesian inversion: lexieal bias required, and the fact that the chances of finding a global optimum using any computation- II P(Ti I wk) " e(wk) P(wk ITi) = ally feasible technique rapidly approach zero as the P(T~) size of the model (in terms of the number of tags, We now connect the second formulation with the and N) increases. Experience as shown that, despite II first one by unfolding each rule Tj ---> wk into each the noise level, annotated training data yields better rule Si -+ TjSj. This lays bare the independence models. assumption Let us take a step back and see what we have got: II We have notion of a word, the notion of an utterance, P(sj & ~k I s~) = P(% I si) . P(Wk I Tj) the notion that an utterance is a sequence of words, the machinery of rewrite rules and derivations, and II As should be clear from this correspondence, the string probabilities are defined as the sum of the of HMM-based PoS-tagging model can be formulated derivation probabilities. In addition to this, we have as a (deterministic) FSA, thus allowing very fast pro- the possibility to include a lot of linguistic knowl- II cessing, linear in string length. edge into the model by selecting an appropriate set The word string wkl ... wk, can be derived from of PoS tags. We also need to somehow specify the the top symbol S in 2n+l steps: model parameters P(Sj [ Si) and P(wk I Tj). Once II this is done, the model is completely determined. S ~ Ti, Si a =:~ wkxSi, =~ WklTi2Si2 ::~ In particular, the only way that syntactic relations II Wkx Wk2 Si2 =~ • • • =~ Wkx • • • Wk, are modelled are by the probability of one PoS tag m Samuelsson 85 Linguistic Theory !1 given the previous tag (or in the general N-gram eral dynamic-programming scheme, and it is cubic case, given the previous N-1 tags). And just as in in string length and grammar size. We conjecture the case of word N-grams, the sparse data problem that exactly the properties of SCFGs discussed in sets severe bounds on'N, effectively limiting it to this paragraph explain why the model looks like it about three. does. !i We conjecture that the explanation to why this We again have the choice between training the model looks like it does is that it was imported model parameters, the rule probabilities,on anno- wholesale from the field of speech recognition, and tated data, or use unannotated data and some rees- Ii proved to allow fast, robust processing at accuracy timation method like the inside-outside algorithm, level that until recently were superior to, or on par which is the natural generalization of the Baum- with, those of hand-crafted rule-based approaches. Welch method of the previous section. If the chances II of finding a global optimum were slim using the 4 Stochastic Grammar Models Baurn-Welch algorithm, they're virtually zero us- To gain more control over the syntactic relation- ing the inside-outside algorithm. There is also very II much instability in terms of what set of rule prob- ships between the words, we turn to stochastic abilities one arrives at as a function of the initial context-free grammars (SCFGs), originally proposed by Booth and Thompson (4). This is the framework assignment of rule probabilities in the reestimation !1 process. The other option, training on annotated in which we have already discussed the N-gram mod- data, is also problematic, as there is precious little els, and it has been the starting point for many ex- of it available, and what exist is quite noisy. A cor- cursions into probabilistic-parsing land. A stochas- pus of CFG-analysed sentences is known as a tree !1 tic context-free grammar is really just a context- bank, and tree banks will be the topic of the next free grammar where each grammar rule has been assigned a probability. If we keep the left-hand-side section. As we have been stressing,the key idea is to assign II (LHS) symbol of the rule fix, and sum these proba- probabilities to derivation steps. If we instead look bilities over the different RHSs, we get one, since the at the rightmost derivation in reverse, as constructed probabilities are conditioned on the LHS symbol. by an LR parser, we can take as the derivation prob- The probability of a particular parse tree is the ability the probability of the action sequence, i.e., probability of its derivation, which in turn is the product of the probability of each derivation step. the product of the probabilitiesof each shift and re- duce action in it. This isn't exactly the same think The probability of a derivation step is the proba- as an SCFG, since the probabilitiesare typically not bility of rewriting a given symbol using some gram- conditioned on the LHS symbol of some grammar mar rule, and equals the rule probability. Thus, the parse-tree probability is the product of the rule prob- rule, but on the current internal state and the cur- II abilities. Since the same parse tree can be derived rent lookahead symbol. As observed by Fernando Pereira (12), this gives us the possibilityto throw in in different ways by first rewriting some symbol and then another, or vice versa, we need to specify the or- a few psycho-linguisticfeatures such as right associ- II der in which the nonterminal symbols of a sentential ation and minimal attachment by preferring shift ac- tions to reductions, and longer reductions to shorter form are rewritten. We require that in each deriva- ones, respectively. So if these features are present in tion step, the leftmost nonterminal is always rewrit- Ii ten, which yields us the leftmost derivation. This es- language, they should show up in our training data, tablishes a one-to-one correspondence between parse and thus in our language model. Whether these fea- tures are introduced or incidental is debatable. trees and derivations. We can take the idea of derivational stochastic Ii We now have plenty of opportunity to include lin- grammars one step further and claim that a parse guistic theory into our model by the choice of syn- tactic categories, and by the selection of grammar tree constructed by any sequence of derivation ac- II rules. The probabilistic limitations of the model mir- tions, regardless of what the derivation actions are, should be assigned the product of the probabilities ror the expressive power of context-free grammars, as the independence assumptions exactly match the of each derivation step, appropriately conditioned. II compositionality assumptions. For this reason, there This idea will be crucial for the various extensions is an efficient algorithm for finding the most proba- to SCFGs discussed in the next section. ble pares tree, or calculating the string probability 5 Models Using Tree Banks under an SCFG. The algorithm is a variant of the Cocke-Kasami-Younger (CKY) algorithm (17), but As previously mentioned, a tree bank is.a corpus of can also be seen as an incarnation of a more gen- CFG-annotated sentences, i.e.,a collection of parse II

Samuelsson 86 Linguistic Theory II trees. The mere existence of a tree bank actu- useful linguistic theory incorporated into a statis- ally inspired a statistic language model, namely the tical language model: the grammatical notion of a data-oriented parsing (DOP) model (3) advocated syntactic head. The idea here is to propagate up the by Remko Scha and Rens Bod. This model parses lexical head to use (amongst other things) lexical not only with the entire tree bank as its grammar, collocation statistics on the dependency level to de- but with a grammar consisting of each subtree of termine the constituent boundaries and attachment each tree in the tree bank. One interesting conse- preferences. quence of this is that there will in general be many Collins (6; 7) followed up on these ideas and added different leftmost derivations of any given parse tree. further elegance to the scheme by instead generat- This can most easily be seen by noting that there is ing the head daughter first, and then the rest of the one leftmost derivation for each way of cutting up a daughters as two zero-order Markov processes, one !1 parse tree into subtrees. Therefore, the parse prob- going left and one going right from it. He also man- ability is defined as the sum of the derivation prob- aged to adapt essentially the standard SCFG pars- abilities, which is the source to the NP-hardness of ing scheme to his model, thus allowing polynomial finding the most probable parse tree for a given in- processing time. It is interesting to note that al- put sentence under this model, as demonstrated by though the conditioning of the probabilities are top- Khalil Sima'an (15). down, parsing is performed bottom-up, just as is i There aren't really that many tree banks around, the case with SCPGs. This allows him to condi- and the by far most popular one for experiment- tion his probabilities on the word string dominated ing with probabilistic parsing is the Penn Treebank by the constituent, which he does in terms of a dis- ill (11). This leads usto the final source of influence on tance between the head constituent and the current the linguistic theory employed in statistical language one being generated. This in turn makes it possible learning: the available training and testing data. to let phrase-boundary indicators such as punctua- ill The annotators of the Penn Treebank may have tion marks influence the probabilities, and gives the overrated the minimal-attachment principle, result- model the chance to infer preferences for, e.g., right ing in very fiat rules with a minimum of recursion, association. and thus in very many rules. In fact, the Wall- In addition to this, Collins incorporated the no- HI Street-Journal portion of it consists of about a mil- tion of lexical complements and wh-movement ~ la lion words analysed using literally tens of thousands Generalized Phrase-Structure Grammar (GPSG) (8) i of distinct grammar rules. For example, there is one into his probabilistic language model. The former is rule of the form done by knocking off complements from a hypoth- esised complement list as the Markov chain of the NP ~ Det Noun (, Noun )n Conj Noun siblings of the head constituent are generated. The HI for each value of n seen in the corpus. There is latter is achieved by adding hypothesised NP gaps not even close to enough data to accurately estimate to these lists, requiring that they be either matched the probabilities of most rules seen in the training against an NP on the complement list, or passed on m data, let alone to achieve any type of robustness for to one of the sibling constituents or the head con- unseen rules. This inspired David Magerman and stituent itself, thus mimicking the behavior of the subsequently Michael Collins to instead generate the "slash feature" used in GPSG. The model learns the m RHS dynamically during parsing. probabilities for these rather sophisticated deriva- Magerman (10) grounded this in the idea that a tion actions under various conditionings. Not bad parse tree is constructed by a sequence of gener- for something that started out as a simple SCFG! II alized derivation actions and the derivation prob- ability is the parse probability, a framework that is 6 A Non-Derivational Model sometimes referred to as history-based parsing (2), The Constraint Grammar framework (9) introduced HI at least when decision trees are employed to deter- by Fred Karlsson and championed by Atro Vouti- mine the probability of each derivation action taken. lainen is a grammar formalism without derivations. More specifically, to allow us to assemble the RI-ISs It's not even constructive, but actually rather de- Igl as we go along, any previously constructed syntac- structive. In fact, most of it is concerned with de- tic constituent is assigned the role of the leftmost, stroying hypotheses. Of course, you first have to rightmost, middle or single daughter of some other have some hypotheses if you are going to destroy HI constituent with some probability. It may or may them, so there are a few components whose task it not also be the syntactic head of the other con- is to generate hypotheses. The first one is a lexicon, Hi stituent, and here we have another piece of highly which assigns a set of possible morphological read-

E Samuelsson 87 Linguistic Theory ill aa

Treebank". Computational Linguistics 19(2), pp. I 313-330. ACL. [12] Fernando Pereira. 1985. "A New Character- aa ization of Attachment Preferences". In Natural Language Parsing, pp. 307-319. Cambridge Uni- versity Press. [13] Christer Samuelsson and Atro Voutilainen. I 1997. "Comparing a Linguistic and a Stochas- tic Tagger". In Procs. Joint 85th Annual Meet- ing of the Association for Computational Linguis- tics and 8th Conference of the European Chapter el of the Association for Computational Linguistics, pp. 246-253. ACL. HI [14] Christer Samuelsson, Pasi Tapanainen and Atro Voutilainen. 1996. "Inducing Constraint Grammars". In Grammatical Inference: Learn- ing Syntaz from Sentences, pp. 146-155, Springer m Verlag. [15] Khalil Sima'an. 1996. "Computational Corn- m plexity of Probabilistic Disambiguations by means of Tree-Grammars". In Procs. 16th International Conference on Computational Linguistics, at the m very end. ICCL. [16] Pasi Tapanainen. 1996. The Constraint Gram- mar Parser CG-~. Publ. 27, Dept. General Lin- m guistics, University of Helsinki. [17] David H. Younger 1967. "Recognition and Parsing of Context-Free Languages in Time n 3". m In Information and Control 10(2), pp. 189-208. i i Hi m i i i

Samuebson 89 Linguistic Theory mm II

/

/

/ /

E Ii

/ U i