Some Notes on Grammars and Parsing

15{212: Fundamental Structures of Computer Science II Some Notes on Grammars and Parsing Rob ert Harp er, Spring 1997 edited by Michael Erdmann, Spring 1999 Draft of April 7, 1999 1 Intro duction These notes are intended as a \rough and ready" guide to grammars and parsing. The theoretical foundations required for a thorough treatment of the sub ject are develop ed in the Formal Languages, Automata, and Computability course. The construction of parsers for programming languages using more advanced techniques than are discussed here is considered in detail in the Compiler Construction course. Parsing is the determination of the structure of a sentence according to the rules of grammar. In elementary scho ol we learn the parts of sp eech and learn to analyze sentences into constituent parts such as sub ject, predicate, direct ob ject, and so forth. Of course it is dicult to say precisely what are the rules of grammar for English or other human languages, but we nevertheless nd this kind of grammatical analysis useful. In an e ort to give substance to the informal idea of grammar and, more imp ortantly,to give a plausible explanation of how p eople learn languages, Noam Chomsky intro duced the notion of a formal grammar. Chomsky considered several di erent forms of grammars with di erent expressive power. Roughly sp eaking, a grammar consists of a series of rules for forming sentence fragments that, when used in combination, determine the set of well-formed grammatical sentences. We will be concerned here only with one, particularly useful form of grammar, called a context-free grammar. The idea of a context-free grammar is that the rules are sp eci ed to hold indep endently of the context in which they are applied. This clearly limits the expressivepower of the formalism, but is nevertheless p owerful enough to b e useful, esp ecially with computer languages. To illustrate the limitations of the formalism, Chomsky gave the now-famous sentence \Colorless green ideas sleep furiously." This sentence is grammatical according to some fairly obvious context-free rules: it has a sub ject and a predicate, with the sub ject mo di ed bytwo adjectives and the predicate by an adverb. It is debatable whether it is \really" grammatical, precisely b ecause we're uncertain exactly what is the b oundary b etween that which is grammatical and that which is meaningful. We will do dge these questions by avoiding consideration of interpreted languages those with meaning, and instead fo cus on the mathematical notion of a formal language, which is just a set of strings over a sp eci ed alphab et. A formal language has no intended meaning, so we avoid 1 questions like those suggested by Chomsky's example. 1 He has given many others. For example, the two sentences \Fruit ies like a banana." and \Time ies likean arrow." are sup er cially very similar, di ering only bytwo noun-noun replacements, yet their \deep structure" is clearly very di erent! 1 2 Context-Free Grammars Let us x an alphab et of letters. Recall that is the set of strings over the alphab et . In the terminology of formal grammars the letters of the alphab et are called terminal symbols or just terminals for short, and a string of terminals is called a sentence. A context-freegrammar consists of an alphab et , a set V of non-terminals or variables, together with a set P of rules or productions of the form A ! where A is a non-terminal and is any sequence of terminals and non-terminals called a sentential form . We distinguish a particular non-terminal S 2 V , the start symbol of the grammar. Thus a grammar G is a four-tuple ;V;P;S consisting of these four items. The signi cance of a grammar G is that it determines a language, LG, de ned as follows: LG=f w 2 j S w g That is, the language of the grammar G is the set of strings w over the alphab et such that w is derivable from the start symbol S . The derivability relation b etween sentential forms is de ned as follows. First, wesay that i = A , = , and A ! is a rule of the grammar 1 2 1 2 G. In other words, may b e derived from by \expanding" one non-terminal from by one rule of the grammar G. The relation is de ned to hold i may b e derived from in zero or more derivation steps. 2 Example 1 Let G be the fol lowing grammar over the alphabet =f a; b g: S ! S ! aS a S ! bS b R R It is easy to see that LG consists of strings of the form ww , where w is the reverse of w . For example, S aS a abS ba abaS aba abaaba R To prove that LG=f ww j w 2 g for the ab ove grammar G requires that we establish R two containments. Supp ose that x 2 LG | that is, S x. We are to show that x = ww for some w 2 . We pro ceed by induction on the length of the derivation sequence, whichmust b e of length at least 1 since x is a string and S is a non-terminal. In the case of a one-step derivation, we must have x = , which is trivially of the required form. Otherwise the derivation is of length n + 1, where n 1. The derivation must lo ok like this: S aS a x 2 The set of non-terminals is e ectively sp eci ed by the notation conventions in the rules. In this case the only non-terminal is S , whichwe implicitly take to b e the start symb ol. 2 or like this: S bS b x We consider the rst case; the second is handled similarly. Clearly x must have the form ay a, R where S y by a derivation of length n. Inductively, y has the form uu for some u, and hence R x = auu a. Taking w = au completes the pro of. R Now supp ose that x 2fww j w 2 g. We are to show that x 2 LG, i.e., that S x. We pro ceed by induction on the length of w . If w has length 0, then x = w = , and we see that R R R R S . Otherwise w = au, and w = u a or w = bu and w = u b. In the former case wehave R R inductively that S uu , and hence S auu a. The latter case is analogous. This completes the pro of. Exercise 2 Consider the grammar G with rules S S S S over the alphabet = f ; g. Prove that LG consists of precisely the strings of wel l-balanced parentheses. A word ab out notation. In computer science contexts we often see context-free grammars presented in BNF Backus-Naur Form. For example, a language of arithmetic expressions might b e de ned as follows: hexpri ::= hnumberi j hexpri + hexpri j hexpri hexpri hnumberi ::= hdigitihdigitsi hdigitsi ::= j hdigitihdigitsi hdigiti ::= 0 j ::: j 9 where the non-terminals are bracketed and the terminals are not. 3 Parsing The parsing problem for a grammar G is to determine whether or not w 2 LG for a given string w over the alphab et of the grammar. There is a p olynomial in fact, cubic time algorithm that solves the parsing problem for an arbitrary grammar, but it is only rarely used in practice. The reason is that in typical situations restricted forms of context-free grammars admitting more ecient linear time parsers are sucient. We will brie y consider two common approaches used in hand-written parsers, called operator precedence and recursive descent parsing. It should be mentioned that in most cases parsers are not written by hand, but rather are automatically generated from a sp eci cation of the grammar. Examples of such systems are Yacc and Bison, available on most Unix platforms, which are based on what are called LR grammars. 3.1 Op erator Precedence Parsing Op erator precedence parsing is designed to deal with in x op erators of varying precedences. The motivating example is the language of arithmetic expressions over the op erators +,,, and =. According to the standard conventions, the expression 3+6 9 4 is to be read as 3 + 6 3 9 4 since multiplication \takes precedence" over addition and subtraction. Left-asso ciativityof addition corresp onds to addition yielding precedence to itself ! and subtraction, and, conversely, subtraction yielding precedence to itself and addition. Unary minus is handled by assigning it highest precedence, so that 4 3 is parsed as 4 3. The grammar of arithmetic expressions that we shall consider is de ned as follows: E ! n j E j E + E j E E j E E j E=E where n stands for any numb er. Notice that, as written, this grammar is ambiguous in the sense that a given string may b e derived in several di erentways. In particular, wemay derive the string 3+4 5 in at least two di erentways, corresp onding to whether we regard multiplication to take precedence over addition: E E + E 3+E 3+E E 3+4 E 3+4 5 E E E E + E E 3+E E 3+4 E 3+4 5 The rst derivation corresp onds to the reading 3 + 4 5, the second to 3 + 4 5.

Load more