<<

Probabilistic Grammars and their Applications this discretion to pursue political and economic ends. and the Law; Monetary Policy; Multinational Cor- Most experiences, however, suggest the limited power porations; Regulation, Economic Theory of; Regu- of privatization in changing the modes of governance lation: Working Conditions; Smith, Adam (1723–90); which are prevalent in each country’s large private Socialism; Venture Capital companies. Further, those countries which have chosen the mass (voucher) privatization route have done so largely out of necessity and face ongoing efficiency problems as a result. In the UK, a country Bibliography whose privatization policies are often referred to as a Armijo L 1998 Balance sheet or ballot box? Incentives to benchmark, ‘control [of privatized companies] is not privatize in emerging democracies. In: Oxhorn P, Starr P exerted in the forms of threats of take-over or (eds.) The Problematic Relationship between Economic and bankruptcy; nor has it for the most part come from Political Liberalization. Lynne Rienner, Boulder, CO Bishop M, Kay J, Mayer C 1994 Introduction: privatization in direct investor intervention’ (Bishop et al. 1994, p. Š 11). After the steep rise experienced in the immediate performance. In: Bishop M, Kay J, Mayer C (eds.) Pri ati- zation and Economic Performance. Oxford University Press, aftermath of privatizations, the slow but constant Oxford, UK decline in the number of small shareholders highlights Boubakri N, Cosset J-C 1998 The financial and operating the difficulties in sustaining people’s capitalism in the performance of newly privatized firms: evidence from develop- longer run. In Italy, for example, privatization was ing countries. Journal of Finance 53: 1081–110 accompanied by a legislative effort aimed at providing Estrin S, Rosevear A 1999 Enterprise performance and own- noncontrolling shareholders (i.e. both individual and ership: The case of Ukraine. European Economic ReŠiew 43: collective investors) with more adequate safeguards 1125–36 Š and at introducing the necessary conditions to allow Evans P 1978 Dependent De elopment. Princeton University them to monitor managers. But successive gover- Press, Princeton, NJ Meggison W, Netter J 1999 From state to market: a survey of nments were unsuccessful in broadening the num- empirical studies on privatization. Fondazione Eni Enrico ber of large private business groups, whereas en- Mattei Working Paper, No. 01-99 hancing the mobility of control to investors outside Nicoletti G 2000 The implementation and the effects of the traditional core of Italy’s capitalism was explicitly regulatory reform: past experience and current issues. Mimeo. included among the authorities’ strategic goals. On the OECD Economics Department, Paris other hand, experiences such as those of the UK and Schamis H M 1998 Distributional coalitions and the politics of Chile underscore that mass sell-offs require the de- economic reform in Latin America. World Politics 51: 236–68 velopment of new institutional investors such as Shirley M M 1998 Bureaucrats in business: the roles of priv- pension funds that may later play an active role in atization versus corporatization in state-owned enterprise reform. World DeŠelopment 27: 115–36 corporate governance. (In Chile, for example, the Spicer A McDermott G Kogut B 1999 Entrepreneurship and takeover of the country’s dominant electricity utility, privatization in Central Europe: the tenuous balance between Enersis, one of the largest in emerging markets, was destruction and creation. The Wharton School. Reginald H. stalled for some months in 1998 as pension funds’ Jones Center Working Paper, No. 99-04 questioned lucrative additional terms that the mana- gement had negotiated for themselves based on impor- A. E. Goldstein tant agreements concerning the future strategic direction of Enersis that they had never told other shareholders about.) To conclude, although privatization’s promise has been frequently oversold, not least by international Probabilistic Grammars and their organizations, its ills have also been greatly exag- gerated. When ownership transfer has been accom- Applications panied by market liberalization and proper implemen- tation, in OECD and non-OECD countries alike, 1. Introduction consumers and end-users have benefited in terms of choice, quality, and prices. If public opinion and Natural language processing is the use of computers policy-makers wish so, much remains to be sold and for processing natural language text or speech. Ma- the future challenges for the regulatory state remain chine translation (the automatic translation of text or substantial to ensure maximization of welfare benefits. speech from one language to another) began with the very earliest computers (Kay et al. 1994). Natural See also: Business Law; Capitalism: Global; Corporate language interfaces permit computers to interact with Finance: Financial Control; Corporate Governance; humans using natural language, for example, to query Corporate Law; International Law and Treaties; databases. Coupled with speech recognition and International Trade: Commercial Policy and Trade speech synthesis, these capabilities will become more Negotiations; Law and Economics; Law and Econ- important with the growing popularity of portable omics: Empirical Dimensions; Liberalism; Markets computers that lack keyboards and large display

12075 Probabilistic Grammars and their Applications screens. Other applications include spell and grammar spurious ambiguities, that is, parses which are permitted checking and document summarization. Applications by the rules of syntax but have unusual or unlikely outside of natural language include , which semantic interpretations. For example, in the sentence translate source code into lower-level machine code, ‘I saw the boat with the telescope,’ the prepositional and computer vision (Fu 1974, 1982). phrase ‘with the telescope’ is most easily interpreted as Most natural language processing systems are based the instrument used in seeing, while in ‘I saw the on formal grammars. The development and study of policeman with the rifle,’ the prepositional phrase formal grammars is known as computational linguis- usually receives a different interpretation in which the tics. A grammar is a description of a language; usually policeman has the rifle. Note that the corresponding it identifies the sentences of the language and provides alternative interpretation is marginally accessible for descriptions of them, for example, by defining the each of these sentences: in the first sentence one can phrases of a sentence, their interrelationships, and imagine that the telescope is on the boat, and in the perhaps also aspects of their meanings. is the second, that the rifle (say, with a viewing scope) was process of recovering a sentence’s description from its used to view the policeman. words, while generation is the process of translating a In effect, there is a dilemma of coverage. A grammar meaning or some other part of a sentence’s description rich enough to accommodate natural language, in- into a grammatical or well-formed sentence. Parsing cluding rare and sometimes even ‘ungrammatical’ and generation are major research topics in their own constructions, fails to distinguish natural from un- right. Evidently, human use of language involves some natural interpretations. But a grammar sufficiently kind of parsing and generation process, as do many restricted so as to exclude what is unnatural fails to natural language processing applications. For ex- accommodate the scope of real language. These ample, a machine translation program may parse an observations led, in the 1980s, to a growing interest in input language sentence into a (partial) representation stochastic approaches to natural language, particu- of its meaning, and then generate an output language larly to speech. Stochastic grammars became the basis sentence from that representation. of speech recognition systems by outperforming the Modern computational linguistics began with best of the systems based on deterministic handcrafted Chomsky (1957), and was initially dominated by the grammars. Largely inspired by these successes, study of his ‘transformational’ grammars. These computational linguists began applying stochastic grammars involved two levels of analyses, a ‘deep approaches to other natural language processing structure’ meant to capture more-or-less simply the applications. Usually, the architecture of such a meaning of a sentence, and a ‘surface structure’ which stochastic model is specified manually, while the reflects the actual way in which the sentence was model’s parameters are estimated from a training constructed. The deep structure might be a clause in corpus, that is, a large representative sample of the active voice, ‘Sandy saw Sam,’ whereas the surface sentences. structure might involve the more complex passive As explained in the body of this article, stochastic voice, ‘Sam was seen by Sandy.’ approaches replace the binary distinctions (grammat- Transformational grammars are computationally ical vs. ungrammatical) of nonstochastic approaches complex, and in the 1980s several linguists came to the with probability distributions. This provides a way of conclusion that much simpler kinds of grammars dealing with the two drawbacks of nonstochastic could describe most syntactic phenomena, developing approaches. Ill-formed alternatives can be character- Generalized Phrase-Structure Grammars (Gazdar et ized as extremely low probability rather than ruled out al. 1985) and Unification-based Grammars (Kaplan as impossible, so even ungrammatical strings can be and Bresnan 1982, Pollard and Sag 1987, Shieber provided with an interpretation. Similarly, a stochastic 1986). These grammars generate surface structures model of possible interpretations of a sentence pro- directly; there is no separate deep structure and vides a method for distinguishing more plausible therefore no transformations. These kinds of gram- interpretations from less plausible ones. mars can provide very detailed syntactic and semantic The introduction sets out formally various classes of analyses of sentences, but even today there are no grammars and languages. Probabilistic grammars comprehensive grammars of this kind that fully are introduced in Sect. 2, along with the basic issues accommodate English or any other natural language. of parametric representation, inference, and Natural language processing using handcrafted computation. grammars suffers from two major drawbacks. First, the syntactic coverage offered by any available gram- mar is incomplete, reflecting both our lack of under- 2. Grammars and Languages standing of even relatively frequently occuring syntactic constructions and the organizational dif- The formal framework, whether used in a trans- ficultyofmanuallyconstructinganyartifactascomplex formational grammar, a generalized phrase-structure as a grammar of a natural language. Second, such grammar, or a more traditionally styled context-free grammars almost always permit a large number of grammar, is due to Chomsky (1957) and his co-

12076 Probabilistic Grammars and their Applications workers. This section presents a brief introduction to and β (the right-hand side of the production) is either of this framework. But for a thorough (and very read- the form Aw or of the form w where A ? N and w ? T*; able) presentation we highly recommend the book by in a right-linear grammar β always is of the form wA Hopcroft and Ullman (1979). or w. A right or left-linear grammar is called a The concept and notation of a is regular grammar. G", in Example 1, is context perhaps best introduced by an example: sensitive and context free. l Example 1: Define a grammar, G",byG" It is straightforward to show that the classes of (T , N , S, R ), where T logrows, rice, wheatq is a set languages generated by these classes of grammars " " " " lo q of words (a ‘lexicon’), N" S, NP, VP is a set of stand in strict equality or subset relationships. Specifi- symbols representing grammatically meaningful cally, the class of languages generated by right-linear strings of words, such as clauses or parts of speech grammars is the same as the class generated by left- (e.g., S for ‘Sentence,’ NP for ‘Noun Phrase,’ VP for linear grammars; this class is called the regular lo ! ! ‘Verb Phrase’), and R" S NP VP, NP rice, languages, and is a strict subset of the class of NP ! wheat, VP ! growsq is a collection of rules for languages generated by context-free grammars, which rewriting, or instantiating, the symbols in N". In- is a strict subset of the class of languages generated by formally, the nonterminal S rewrites to sentences or context-sensitive grammars, which in turn is a strict clauses, NP rewrites to noun phrases and VP rewrites subset of the class of languages generated by rewrite to verb phrases. The language, LG , generated by G" is grammars. the set of strings of words that are" reachable from S It turns out that context-sensitive grammars (where through the rewrite rules in R . In this example, a production rewrites more than one nonterminal) lo q "  LG rice grows, wheat grows , derived by S NP have not had many applications in natural language VP"  rice VP  rice grows, and S  NP VP  wheat processing, so from here on we will concentrate on VP  wheat grows. context-free grammars, where all productions take the More generally, if T is a finite set of symbols, let T* form A ! β, where A ? N and β ? (N D T)+. be the set of all strings (i.e., finite sequences) of An appealing property of grammars with pro- symbols of T, including the empty string, and let T + ductions in this form is that they induce tree structures be the set of all nonempty strings of symbols of T.A on the strings that they generate. And, as Sect. 3 language is a subset of T*. A rewrite grammar G is a shows, this is the basis for bringing in probability quadruple G l (T, N, S, R), where T and N are distributions and the theory of inference. We say that disjoint finite sets of symbols (called the terminal and the context-free grammar G l (T, N, S, R) generates nonterminal symbols respectively), S ? N is a dis- the labeled, ordered tree ψ, iff the root node of ψ is tinguished nonterminal called the start or sentence labeled S, and for each node n in ψ, either n has no symbol, and R is a finite set of productions. A children and its label is a member of T (i.e., it is production is a pair (α, β) where α ? N+ and labelled with a terminal) or else there is a production β ? (N D T)*; productions are usually written α ! β. A ! β ? R where the label of n is A and the left-to-right Productions of the form α ! ε, where ε is the empty sequence of labels of n’s immediate children is β.Itis string, are called epsilon productions. This entry straightforward to show that w is in LG if G generates restricts attention to grammars without epsilon a tree ψ whose yield (i.e., the left-to-right sequence of productions, that is, β ? (N D T)+, as this simplifies terminal symbols labeling ψ’s leaf nodes) is w; ψ is the mathematics considerably. called a parse tree of w (with respect to G). In what A rewrite grammar G defines a rewriting relation follows, we define Ψ to be the set of parse trees  7 D i D G: G (N T)* (N T)* over pairs of strings con- generated by G, and ( ) to be the function that maps sisting of terminals and nonterminals as follows: trees to their yields. βAγ  βαγ iff A ! α ? R and β, γ ? (N D T)* (the sub- Example 1 (continued): The grammar G defined above ψ " ψ script G is dropped when clear from the context). The generates the following two trees, " and #. reflexive, transitive closure of  is denoted *. Thus * is the rewriting relation using arbitrary finite sequences of productions. (It is called ‘reflexive’ because the identity rewrite, α  α, is included.) The language generated by G, denoted LG, is the set of all strings w ? T + such that S * w. Rewrite grammars are traditionally classified by the shapes of their productions. G l (T, N, S, R)isa Š  ψ l  ψ l context-sensiti e grammar if for all productions In this example, ( ") ‘rice grows’ and ( #) α ! β ? R, QαQ % QβQ, that is, the right-hand side of each ‘wheat grows.’ production is not shorter than its left-hand side. G A string of terminals w is called ambiguous if w has is a context-free grammar if QαQ l 1, that is, the left- two or more parse trees. Linguistically, each parse tree hand side of each production consists of a single non of an ambiguous string usually corresponds to a terminal. G is a left-linear grammar if G is context-free distinct interpretation.

12077 Probabilistic Grammars and their Applications

l Example 2: Consider G# (T#, N#, S, R#), where We briefly review the main probabilistic grammars T loI, saw, the, man, with, telescopeq, N loS, and their associated theories of inference. We begin in # q lo ! # NP, N, Det, VP, V, PP, P and R# S NP, VP, Sect. 3.1 with probabilistic regular grammars, also NP ! I, NP ! Det N, Det ! the, N ! NP, PP, known as hidden Markov models (HMM), which are N ! man, N ! telescope, VP ! V, NP, VP ! VP, PP, the foundation of modern speech recognition PP ! P, NP, V ! saw, P ! withq. Informally, N re- systems—see Jelinek (1997) for a survey. writes to nouns, Det to determiners, V to verbs, P to In Sect. 3.2 we discuss probabilistic context-free prepositions and PP to prepositional phrases. It is easy grammars, which turn out to be essentially the same to check that the two trees ψ and ψ with the yields thing as branching processes. Finally, in Sect. 3.3, we  ψ l  ψ l $ % ( $) ( %) ‘I saw the man with the telescope’ take a more general approach to placing probabilities are both generated by G#. Linguistically, these two on grammars, which leads to Gibbs distributions, a parse trees represent two different syntactic analyses role for Besag’s pseudolikelihood method (Besag 1974, of the sentence. The first analysis corresponds to the 1975), various computational issues, and, all in all, an interpretation where the seeing is by means of a active area of research in computational linguistics. telescope, while the second corresponds to the inter- pretation where the man has a telescope. 3.1 Regular Grammars We will focus on right-linear grammars, but the treatment of left-linear grammars is more or less identical. It is convenient to work with a normal form: all rules are either of the form A ! bB or A ! b, where A, B ? N and b ? T. It is easy to show that every right- linear grammar has an equivalent normal form in the sense that the two grammars produce the same language. Essentially nothing is lost.

3.1.1 Probabilities. The grammar G can be made into a probabilistic grammar by assigning to each non terminal A ? N a probability distribution p over pro- ductions of the form A ! α ? R: for every A ? N

 p(A!α) l 1 (1) α ? (N D T)+ s.t. (A ! α) ? R Recall that Ψ is the set of parse trees generated by G G ψ ? Ψ (see Sect. 2). If G is linear, then G is characterized by a sequence of productions, starting from S. It is, then, straightforward to use p to define a probability P Ψ ψ ψ ? Ψ on G: just take P ( ) (for G) to be the product of the associated production probabilities. Example 3: Consider the right-linear grammar G l lo q lo q $ (T$, N$, S, R$), with T$ a, b , N$ S, A and the productions (R$) and production probabilities (p): S!aS p l 0.80 S!bS p l 0.01 3. Probability and Statistics S!bA p l 0.19 Obviously broad coverage is desirable—natural language is rich and diverse, and not easily held to a A!bA p l 0.90 small set of rules. But it is hard to achieve broad A!b p l 0.10 coverage without massive ambiguity (a sentence may have tens of thousands of parses), and this of course The language is the set of strings ending with a complicates applications like language interpretation, sequence of at least two bs. The grammar is am- language translation, and speech recognition. This is biguous: in general, a sequence of terminal states does the dilemma of coverage that we referred to earlier, not uniquely identify a sequence of productions. The and it sets up a compelling role for probabilistic and sentence aabbbb has three parses (determined by the statistical methods. placement of the production S ! bA), but the most

12078 Probabilistic Grammars and their Applications likely parse, by far, is S ! aS, S ! aS, S ! bA, A ! As is usual with hidden data, there is an EM-type bA, A ! bA, A ! b(P l 0.8 : 0.8 : 0.19 : 0.9 : 0.1), iteration for climbing the likelihood surface—see which has a posterior probability of nearly 0.99. The Baum (1972) and Dempster et al. (1977): corresponding parse tree is shown below. n E V [ f(A!α; ψ)Qψ?Ψ ] pt wi j !α l i=" t 1(A ) n  E V [ f (A!β; ψ)Qψ?Ψ ] pt wi A!β?R i=" (5) Needless to say, nothing can be done with this unless we can actually evaluate, in a computationally feasible !α ψ Qψ ? Ψ way, expressions like Ep# [ f(A ; ) w]. This is one of several closely related computational problems that are part of the mechanics of working with regular grammars.

3.1.3 Computation. A sentence w ? T + is parsed by finding a sequence of productions A ! bB ? R which yield w. Depending on the grammar, this corre- sponds more or less to an interpretation of w. Often, there are many parses and we say that w is ambigu- 3.1.2 Inference. The problem is to estimate (see ous. In such cases, if there is a probability p on R Estimation: Point and InterŠal) the transition prob- then there is a probability P on Ψ, and a reasonably abilities, p(:), either from parsed data (examples from compelling choice of parse is the most likely parse: Ψ L G) or just from sentences (examples from G). ψ Consider first the case of parsed data (‘supervised arg max P( ) (6) ψ ψ ψ ? Ψ ψ?Ψw learning’), and let ", #,…, n be a sequence taken iid according to P.Iff(A ! α; ψ) is the coun- This is the maximum a posteriori (MAP) estimate of ting function, counting the number of times transi- ψ—obviously it minimizes the probability of error tion A ! α ? R occurs in ψ, then the likelihood under the distribution P. (Of course, in those cases in function (see Likelihood in Statistics)is which Eqn. (6) is small, P does little to make w un- n ambiguous.) l ψ ψ l   !α f(A ! α; ψi) What is the probability of w? How are its parses L L(p; ", …, n) p(A ) (2) i=" A!α?R computed? How is the most likely parse computed? These computational issues turn out to be more-or- The maximum likelihood estimate is, sensibly, the less the same as the issue of computing E # [f(A ! α; relative frequency estimator: ψ Qψ ? Ψ p ) w] that came up in our discussion of inference. n The basic structure and cost of the computational  !α ψ f(A ; i) algorithm is the same for each of the four problems— V !α l i=" compute the probability of w, compute the set of p(A ) n (3)   !β ψ parses, compute the best parse, compute Ep# .In A!β ? R f(A ; i) particular, there is a simple dynamic programming β i = " s.t solution to each of these problems, and in each case The problem of estimating p from sentences (‘unsuper- the complexity is of the order n:QRQ, where n is the vised learning’) is more interesting, and more im- length of w, and QRQ is the number of productions in portant for applications. Recall that (ψ) is the yield G—see Jelinek (1997), Geman and Johnson (2000). of ψ, that is. the sequence of terminals in ψ. Given a The existence of a dynamic programming principle for ? + Ψ sentence w T , let w be the set of parses which yield regular grammars is a primary reason for their central w: Ψ loψ ? Ψ: (ψ) l wq. Imagine a sequence role in state-of-the-art speech recognition systems. ψ w ψ ",… , n, iid according to P, for which only the corresponding yields, w l (ψ )1% i % n, are ob- i i 3.2 Context-free Grammars served. The likelihood function is l Despite the successes of regular grammars in speech L L(p; w", …, wn) recognition, the problems of language understanding n and translation are generally better addressed with the l    !α f(A ! α; ψ ) p(A ) i (4) more structured and more powerful context-free i= ψ ? Ψ A!a?R " wi grammars. Following our development of probab-

12079 Probabilistic Grammars and their Applications

Ψ ilistic regular grammars in the previous section, we will It is clear enough that P concentrates on G, and we address here the interrelated issues of fitting context- shall see shortly that this parameterization, in terms of free grammars with probability distributions, esti- products of probabilities p, is particularly workable mating the parameters of these distributions, and and convenient. The pair, G and P, is known as a computing various functionals of these distributions. probabilistic context-free grammar, or PCFG for The context-free grammars G l (T, N, S, R) have short. (Notice the connection to branching pro- rules of the form A!α, α ? (N D T)+, as discussed cesses—Harris (1963): Starting at S, use R, and the previously in Sect. 2. There is again a normal form, associated probabilities p(:), to expand nodes into known as the Chomsky normal form, which is par- daughter nodes until all leaf nodes are labeled with ticularly convenient when developing probabilistic terminals—i.e., with elements of T.) versions. Specifically, one can always find a context- h free grammar G , with all productions of the form 3.2.2 Inference. As with probabilistic regular gram- A ! BC or A ! a, A , B, C ? N, α ? T, which produces l mars, the production probabilities of a context-free the same language as G: LGh LG. Henceforth, we will grammar, which amount to a parameterization of the assume that context-free grammars are in the Ψ distribution P on G, can be estimated from Chomsky normal form. examples. In one scenario, we have access to a ψ ψ Ψ sequence ",…, n from G under P. This is ‘super- vised learning,’ in the sense that sentences come 3.2.1 Probabilities. The goal is to put a probability equipped with parses. More practical is the problem of distribution on the set of parse trees generated by a ‘unsupervised learning,’ wherein we observe only the context-free grammar in Chomsky normal form.  ψ  ψ yields, ( "), …, ( n). Ideally, the distribution will have a convenient para- In either case, the treatment of maximum likelihood metric form, that allows for efficient inference and estimation is essentially identical to the treatment for computation. regular grammars. In particular, the likelihood for Recall from Sect. 2 that context-free grammars fully observed data is again Eqn. (2), and the maximum generate labeled, ordered trees. Given sets of non- likelihood estimator is again the relative frequency terminals N and terminals T, let Ψ be the set of finite estimator Eqn. (3). And, in the unsupervised case, the trees with: likelihood is again Eqn. (4) and this leads to the same (a) root node labeled S; EM-type iteration given in Eqn. (5). (b) leaf nodes labeled with elements of T; (c) interior nodes labeled with elements of N; (d) every nonterminal (interior) node having either 3.2.3 Computation. There are again four basic two children labeled with nonterminals or one child computations: find the probability of a sentence w ? T +, find a ψ ? Ψ (or find all ψ ? Ψ) satisfying labeled with a terminal. ψ l Every ψ ? Ψ defines a sentence w ? T+: read the labels Y( ) w (‘parsing’); find ψ off of the terminal nodes of from left to right. arg max P (ψ) Consistent with the notation of Sect. 3.1 we write ψ ? Ψ s.t. (ψ) l w. Conversely, every sentence w ? T+ defines a  (ψ)=w subset of Ψ, which we denote by Ψ , consisting of all (‘maximum a posteriori’ or ‘optimal’ parsing); com- w !α ψ Qψ ? Ψ ψ with yield w ((ψ) l w). A context-free grammar G pute expectations of the form Ep# [f(A ; ) w] Ψ Ψ that arise in iterative estimation schemest like (5). The defines a subset of , G, whose collection of yields is four computations turn out to be more-or-less the the language, LG,ofG. We seek a probability dis- Ψ Ψ same, as was the case for regular grammars (Sect. tribution P on which concentrates on G. The time-honored approach to probabilistic con- 3.1.3) and there is again a common dynamic- text-free grammars is through the production prob- programming-like solution—see Lari and Young abilities p : R ! [0, 1], with (1990, 1991), Geman and Johnson (2000).

 p(A!α) l 1 (7) 3.3 Gibbs Distributions α?N# D T s.t.(A!α) ? R There are many ways to generalize. The coverage of a Following the development in Sect. 3.1, we introduce context-free grammar may be inadequate, and we may a counting function f (A ! α; ψ), which counts the hope, therefore, to find a workable scheme for placing number of instances of the rule A ! a in the tree ψ, i.e., probabilities on context-sensitive grammars, or per- the number of nonterminal nodes A whose daughter haps even more general grammars. Or, it may be nodes define, left-to-right, the string α. Through f, p preferable to maintain the structure of a context-free induces a probability P on Ψ: grammar, especially because of its dynamic program- ming principle, and instead generalize the class of P(ψ) l  p(A!α)f(A!α;ψ) (8) probability distributions away from those induced (A!α) ? R (parameterized) by production probabilities. But 12080 Probabilistic Grammars and their Applications nothing comes for free. Most efforts to generalize run that are not PCFGs? The answer turns out to be no, as into nearly intractable computational problems when was shown by Chi (1999) and Abney et al. (1999): Ψ it comes time to parse or to estimate parameters. given a probability distribution P on G of the form of Many computational linguists have experimented Eqn. (10), there always exists a system of production with using Gibbs distributions, popular in statistical probabilities p under which P is a PCFG. physics, to go beyond production-based probabilities, while nevertheless preserving the basic context-free 3.3.3 Inference. The feature set o f q can be structure. Next we take a brief look at this particular i i=",…,M formulation, in order to illustrate the various chal- accommodate arbitrary linguistic attributes and con- lenges that accompany efforts to generalize the more straints, and the Gibbs model Eqn. (9), therefore, has standard probabilistic grammars. great promise as an accurate measure of linguistic fit- ness. But the model depends critically on the par- oθ q ameters i i=",…,M, and the associated estimation Ψ 3.3.1 Probabilities. The sample space is G, the set problem is, unfortunately, very hard. Indeed, the of trees generated by a context-free grammar G. problem of unsupervised learning appears to be all Gibbs measures are built from sums of more-or-less but intractable. simple functions, known as ‘potentials’ in statistical Let us suppose, then, that we observe ψ , ψ ,…, ψ ? Ψ " # physics, defined on the sample space. In linguistics, n G (‘supervised learning’), iid according to Pθ.In n ψ it is more natural to call these features rather than general, the likelihood function, i = "Pθ( i), is more potentials. Suppose, then, that we have identified M or less impossible to maximize. But if the primary goal linguistically salient features f ,…,f , where f : is to select good parses, then perhaps the likelihood Ψ ! " M k G R, through which we will characterize the fit- function asks for too much, or even the wrong thing. It ness or appropriateness of a structure ψ ? Ψ . More might be more relevant to maximize the likelihood of G Š  ψ  ψ specifically, we will construct a class of probabilities the observed parses, gi en the yields ( "),…, ( n): Ψ ψ ? Ψ on G which depend on G only through n f (ψ), …, f (ψ). Examples of features are the number  ψ Q ψ " M Pθ( i ( i)) (11) of times a particular production occurs, the number i = " of words in the yield, various measures of subject- verb agreement, and the number of embedded or in- Maximization of Eqn. (11) is an instance of Besag’s re- dependent clauses. markably effective pseudolikelihood method (Besag Gibbs distributions have the form 1974, 1975), which is commonly used for estimating parameters of Gibbs’s distributions. The computations 1 M 5 1 2 6 involved are generally much easier than what is P (ψ) l exp 3 θ f (ψ) 7 (9) θ Z i i involved in maximizing the ordinary likelihood func- 4 i=" 8 tion. Take a look at the gradient of the logarithm of θ θ Eqn. (11): the θ component is proportional to where ",…, M are parameters, to be adjusted ‘by j hand’orinferredfromdata,θ l (θ ,…,θ ),andwhere l θ " M 1 n 1 n Z Z( ) (known as the ‘partition function’)  f ψ k  E f ψ Q ψ l  ψ Ψ l j( i) ! [ j( ) ( ) ( i)] (12) normalizes so that Pθ( ) 1. Evidently, we need to n i = " n i = " assume or ensure that  exp oMθ f (ψ)q ! _. ψ?ΨG " i i ψ Q ψ For instance, we had better require that θ ! 0if and Eθ[fj( ) ( )] can be computed directly from the l ψ l Q ψ Q " set of parses of the sentence (ψ). (In practice there is M 1 and f"( ) ( ) (the number of words in a sentence), unless of course QΨ Q ! _. often massive ambiguity, and the number of parses G may be too large to feasibly consider. Such cases require some form of pruning or approximation.) 3.3.2 Relation to probabilistic context-free gram- Thus gradient ascent of the pseudolikelihood func- mars. Gibbs distributions are much more general than tion is (at least approximately) computationally feas- probabilistic context-free grammars. In order to ible. This is particularly useful since the Hessian of the recover PCFG’s, consider the special feature set logarithm of the pseudolikelihood function is non- o !α ψ q f(A ; ) A!α?R: The Gibbs distribution Eqn. (9) positive, and therefore there are no local maxima. takes on the form What is more, under mild conditions pseudolikelihood estimators (i.e., maximizers of Eqn. (11)) are con- 1 1 5 P ψ l 2  θ f A!α ψ 6 sistent; (Chi 1998). θ( ) exp 3 A!α ( ; ) 7 (10) Z 4 A!α?R 8 Evidently, then, we get the probabilistic context-free Bibliography θ l ! α grammars by taking A ! α loge p(A ), where p is a Abney S, McAllester D, Pereira F 1999 Relating probabilistic system of production probabilities consistent with grammars and automata. In: Proceedings of the 37th Annual Eqn. (7), in which case Z l 1. But is Eqn. (10) more Meeting of the Association for Computational Linguistics, Ψ general? Are there probabilities on G of this form Morgan Kaufmann, San Francisco, pp. 542–9 12081 Probabilistic Grammars and their Applications

Baum L E 1972 An inequality and associated maximization Many of these people apparently had a fairly good techniques in statistical estimation of probabilistic functions intuitive idea of what we might now term the ‘proba- of Markov processes. Inequalities 3: 1–8 bilities’ or ‘odds’ of various outcomes associated with Besag J 1974 Spatial interaction and the statistical analysis of these games. Alternatively, at least some people lattice systems (with discussion). Journal of the Royal Stat- istical Society, Series D 36: 192–236 (whom in hindsight we might now label ‘super- Besag J 1975 Statistical analysis of non-lattice data. The stitious’) believed that the outcomes of these games Statistician 24: 179–95 could indicate something about the favorability of the Chi Z 1998 Probability models for complex systems. PhD thesis, world in general toward the gambler (a belief that may Brown University exist even today—at least on an implicit basis—for Chi Z 1999 Statistical properties of probabilistic context-free chronic gamblers), specifically the favorability or grammars. Computational Linguistics 25(1): 131–60 unfavorability of a particular god that was involved Chomsky N 1957 . Mouton, The Hague with a particular game. Dempster A, Laird N, Rubin D 1977 Maximum likelihood from In the last 600 years or so people have proposed incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39: 1–38 systematic ways of evaluating probabilities—e.g., in Fu K S 1974 Syntactic Methods in Pattern Recognition. Aca- terms of equally likely outcomes in games of chance or demic Press, New York in terms of subjective belief. In the last 150 years or so Fu K S 1982 Syntatic Pattern Recognition and Applications. scientific endeavor has started to make specific use of Prentice-Hall, Englewood Cliffs, NJ probabilistic thinking, and in the last 100 years or so Gazdar C, Klein E, Pullum G, Sag I 1985 Generalized Phrase probabilistic ideas have been applied to problems of Structure Grammar. Blackwell, Oxford, UK everyday life. (see Gigerenzer et al. 1989). Currently, Geman S, Johnson M 2000 Probability and statistics in compu- for example, jurors are often asked to determine tational linguistics, a brief review. Internal publication, manufacturer or company liability on the basis of Division of Applied Mathematics, Brown University Harris T E 1963 The Theory of Branching Processes. Springer, differential rates of accidents or cancers; in such Berlin judgments, jurors cannot rely on deterministic reason- Hopcroft J E, Ullman J D 1979 Introduction to Automata ing that would lead to a specification of exactly which Theory, Languages and Computation. Addison-Wesley, Read- negative outcome is due to what; nor can they rely on ing, MA ‘experience’ in these contexts in which they have had Jelinek F 1997 Statistical Methods for Speech Recognition. MIT none. Thus, it is extraordinarily important to under- Press, Cambridge, MA stand—if indeed generalizations about such thinking Kaplan R M, Bresnan J 1982 Lexical functional grammar: A are possible—how it is that people think about and formal system for grammatical representation. In: Bresnan J evaluate probabilities, and in particular the problems (ed.) The Mental Representation of Grammatical Relations. MIT Press, Cambridge, MA, Chap. 4, pp. 173–281 that they (we) have. The major thesis presented here is Kay M, Gawron J M, Norvig P 1994 Verbmobil: A Translation that both the thinking and the problems can be System for Face-to-face Dialog. CSLI Press, Stanford, CA understood in terms of systematic deŠiations from the Lari K, Young S J 1990 The estimation of stochastic context-free normative principles of probabilistic reasoning that grammars using the inside-outside algorithm. Computer have been devised over the past five or so centuries. Speech and Language 4: 35–56 These deviations are not inevitable; it is certainly Lari K, Young S J 1991 Applications of stochastic context-free possible for people to think about probabilities coher- grammars using the inside-outside algorithm. Computer ently (at least on occasion); nevertheless, when devi- Speech and Language 5: 237–57 ations occur, they tend to be of the systematic nature Pollard C, Sag A 1987 Information-based Syntax and Semantics. CSLI Lecture Notes Series No.13. University of California to be described here; they occur most often when Press, Stanford, CA people are engaged in what is termed ‘automatic’ or Shieber S M 1986 An Introduction to Unification-based ‘associational’ thinking, as opposed to ‘reflective’ and Approaches to Grammar. CSLI Lecture Notes Series. Uni- ‘comparative’ thinking. (These terms will be made versity of California Press, Stanford, CA more precise later in this article.)

S. Geman and M. Johnson 2. A Brief History (of Thought) Beginning in early infancy, we develop intuitions about space, time, and number (at least to the extent of ‘one, two, three, many’). These intuitions are es- Probabilistic Thinking sentially correct. They remain throughout our lives basically unaltered except for small variations, and 1. Introduction they form the foundation of far more sophisticated mathematical systems that have been developed over Some people have enjoyed playing games that involved the past thousands of years. Moreover, their im- at least a component of chance ever since civiliza- portance in our lives has been clear throughout those tions—as we know them—emerged in human history. millennia.

12082

Copyright # 2001 Elsevier Science Ltd. All rights reserved. International Encyclopedia of the Social & Behavioral Sciences ISBN: 0-08-043076-7