LING83600: Context-Free Grammars
Total Page:16
File Type:pdf, Size:1020Kb
LING83600: Context-free grammars Kyle Gorman 1 Introduction Context-free grammars (or CFGs) are a formalization of what linguists call “phrase structure gram- mars”. The formalism is originally due to Chomsky (1956) though it has been independently discovered several times. The insight underlying CFGs is the notion of constituency. Most linguists believe that human languages cannot be even weakly described by CFGs (e.g., Schieber 1985). And in formal work, context-free grammars have long been superseded by “trans- formational” approaches which introduce movement operations (in some camps), or by “general- ized phrase structure” approaches which add more complex phrase-building operations (in other camps). However, unlike more expressive formalisms, there exist relatively-efficient cubic-time parsing algorithms for CFGs. Nearly all programming languages are described by—and parsed using—a CFG,1 and CFG grammars of human languages are widely used as a model of syntactic structure in natural language processing and understanding tasks. Bibliographic note This handout covers the formal definition of context-free grammars; the Cocke-Younger-Kasami (CYK) parsing algorithm will be covered in a separate assignment. The definitions here are loosely based on chapter 11 of Jurafsky & Martin, who in turn draw from Hopcroft and Ullman 1979. 2 Definitions 2.1 Context-free grammars A context-free grammar G is a four-tuple N; Σ; R; S such that: • N is a set of non-terminal symbols, corresponding to phrase markers in a syntactic tree. • Σ is a set of terminal symbols, corresponding to words (i.e., X0s) in a syntactic tree. • R is a set of production rules. These rules are of the form A ! β where A 2 N and β 2 (Σ [ N)∗. Thus A is a phrase label and β is a sequence of zero or more terminals and/or non-terminals. 1Python programs are described by a CFG (https://docs.python.org/3/reference/grammar.html). When you execute a Python script, this grammar specification is used to parse your script. • S 2 N is a designated start symbol (i.e., the highest projection in a sentence). For simplicity, we assume N and Σ are disjoint. As is standard, we use Roman uppercase charac- ters to represent non-terminals and Greek lowercase characters to represent terminals. 2.2 Derivation Direct derivation describes the relationship between the input to a single grammar rule in R and the resulting output. If there is a rule A ! β 2 R, and α; γ are strings in (Σ [ N)∗, then αAγ ) αβγ i.e., αAγ directly derives αβγ. Derivation is a generalization of direct derivation which allows us ∗ to iteratively apply rules to strings. Given strings α1; α2; αm 2 (Σ [ N) such that α1 ) α2, and α2 ) α3, …, αm−1 ) αm, then ∗ α1 ) αm i.e., α1 derives αm (and α1 also derives α2, α3, etc.). 2.3 Context-free language The language LG generated by some grammar G is the (possibly infinite) set of strings of terminal symbols that can be derived by G starting from the start symbol S. Exercise Enumerate the language generated by a CFG with the following rules: S ! NP VP VP ! V NP VP ! V NP ! DT NN NP ! Kyle DT ! a j the NN ! cat j dog V ! barks j bites Solution The language generated is regular, and can be described by the following regular expression: [Kyle j (a j the)(dog j cat)] (bites j barks)[Kyle j (a j the)(dog j cat)]? : Note that many strings in the language are ungrammatical in English; e.g., *Kyle barks the dog. 3 Non-equivalence of context-free and regular languages You have previously seen regular grammars, which generate a class of languages known as the regular languages. The definition of the regular languages is repeated below: • The empty language ; is a regular language. • The empty string language fεg is a regular language. • For every symbol x 2 Σ, the singleton language fxg is a regular language. • If X is a regular language then X∗ (its closure) is a regular language. • If X and Y are regular languages then: – X [ Y (their union) is a regular language and – XY (their concatenation) is a regular language. • Other languages are not regular languages. It is well-known (e.g., Chomsky 1959) that the regular languages are a proper subset of the context- free languages. One intuitive explanation for this fact is that all “rules” in a regular grammar must be left-linear or right-linear. That is, they are all of the form A ! B Σ∗ (a left-linear rule) or A ! Σ∗ B (a right-linear rule). But CFGs allow a third type of rule, a center-embedding rule of the form A ! β A γ. Imagine this rule is part of the following CFG: S ! A A ! β A γ A ! ε Intuitively this grammar derives the language βnγn (where n is some non-negative integer). How- ∗ ever, regular languages can only approximate this language (e.g., with β γ∗). 4 Chomsky-normal form Syntacticians have long had a preference for binary branching syntactic structures, meaning that each non-terminal node has at most two children. As it happens, this assumption greatly simpli- fies parsing algorithms as well. One way this is enforced by converting grammars or treebanks to a format known as Chomsky normal form (CNF; Chomsky 1963). In Chomsky normal form, the elements of R, the set of production rules, are constrained to have one of two forms: • A ! BC where A; B; C 2 N. • A ! β where A 2 N and β 2 Σ. In other words, the right-hand side of every rule either consists of two non-terminals or one terminal. There exists for every CFG grammar a weakly equivalent CNF grammar, meaning that there exists a CNF which generates the same language (though it does not necessarily assign exactly the same phrase structure). For instance, given the rule A ! BCD, we can convert this to two CNF rules, namely A ! BX and X ! CD. Exercise Given the CFG rule M ! X λ ρ Y, where X; Y are non-terminals and λ; ρ are terminals, convert the rule into a series of CNF rules. Solution For example, we can introduce LP; RP as non-terminals immediately dominating λ; ρ, and XP; YP as the non-terminals headed by X and Y. We then obtain: • M ! XP YP • XP ! X LP • YP ! RP Y • LP ! λ • RP ! ρ Note that this is not a complete grammar; we have not introduced a start symbol and there are no expansions for X or Y. 5 Further reading • J&M (§13.2.1) give a general-purpose algorithm for converting a context-free grammars and trees to Chomsky-normal form. • J&M (§14) and Eisenstein (§10) describe probabilistic context-free grammars (PCFGs), in which each rule is associated with a conditional probability (conditioned on the left-hand side). • Klein and Manning (2003) describe knowledge-based Markovization techniques for unlex- icalized PCFG parsing. • Petrov et al. (2006) develop data-driven Markovization techniques for unlexicalized PCFG parsing. • Bikel (2004) describes the Collins (1999) parser, which uses a novel form of lexicalized PCFG. References Bikel, Daniel M. 2004. Intricacies of Collins’ parsing model. Computational Linguistics 30:479–511. Chomsky, Noam. 1956. Three models for the description of language. IEEE Transactions on Infor- mation Theory 3:113–124. Chomsky, Noam. 1959. On certain formal properties of grammars. Information and Control 2:137– 167. Chomsky, Noam. 1963. Formal properties of grammars. In Handbook of Mathematical Psychology, ed. R. Duncan Luce, Robert R. Bush, and Eugene Galanter, 323–418. John Wiley & Sons. Collins, Michael. 1999. Head-driven statistical models for natural language processing. Doctoral dissertation, University of Pennslvania. Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. Klein, Dan, and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, 423–430. Petrov, Slav, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In Proceedings of the 21st International Conference on Com- putational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 433–440. Schieber, Stuart M. 1985. Evidence against the context-freeness of natural languages. Linguistics and Philosophy 8:333–343..