CDM [1Ex]Context-Free Grammars

CDM Context-Free Grammars Klaus Sutner Carnegie Mellon Universality 60-cont-free 2017/12/15 23:17 1 Generating Languages Properties of CFLs Generation vs. Recognition 3 Turing machines can be used to check membership in decidable sets. They can also be used to enumerate semidecidable sets, whence the classical notion of recursively enumerable sets. For languages L ⊆ Σ? there is a similar notion of generation. The idea is to set up a system of simple rules that can be used to derive all words in a particular formal language. These systems are typically highly nondeterministic and it is not clear how to find (efficient) recognition algorithms for the corresponding languages. Noam Chomsky 4 Historically, these ideas go back to work by Chomsky in the 1950s. Chomsky was mostly interested natural languages: the goal is to develop grammars that differentiate between grammatical and and ungrammatical sentences. 1 The cat sat on the mat. 2 The mat on the sat cat. Alas, this turns out to be inordinately difficult, syntax and semantics of natural languages are closely connected and very complicated. But for artificial languages such as programming languages, Chomsky's approach turned out be perfectly suited. Cat-Mat Example 5 Sentence Noun Phrase Verb Phrase Punctuation Determiner Noun Verb Prepositional Phrase . The cat sat Preposition Noun Phrase on Determiner Noun the mat Mat-Cat Example 6 Noun Phrase Noun Phrase Prepositional Phrase Punctuation Determiner Noun Preposition Noun Phrase . The mat on Determiner Adjective Noun the sat cat Killer App: Programming Languages 7 Many programming languages have a block structure like so: begin begin end begin begin end begin end end end Clearly, this is not a regular language and cannot be checked by a finite state machine. We need more compute power. Generalizing 8 We have two rather different ways of describing regular languages: finite state machine acceptors regular expressions We could try to generalize either one of these. Let's start with the algebra angle and handle the machine model later. Grammars 9 Definition A (formal) grammar is a quadruple G = h V; Σ; P;S i where V and Σ are disjoint alphabets, S 2 V , and P is a finite set of productions or rules. the symbols of V are (syntactic) variables, the symbols of Σ are terminals, S is called the start symbol (or axiom). We often write Γ = V [ Σ for the complete alphabet of G. Context Free Grammars 10 Definition (CFG) A context free grammar is a grammar where the productions have the form ? P ⊆ V × Γ It is convenient to write productions in the form π : A α where A 2 V and α 2 Γ?. The idea is that we may replace A by α. Naming Conventions 11 A; B; C : : : represent elements of V , S 2 V is the start symbol, a; b; c : : : represent elements of Σ, X; Y; Z : : : represent elements of Γ, w; x; y : : : represent elements of Σ?, α; β; γ : : : represent elements of Γ?. Derivations 12 Given a CFG G define a one-step relation =1) ⊆ Γ? × Γ? as follows: 1 αAβ =) αγβ if A γ 2 P As usual, by induction define α =k+1) β if 9 γ (α =k) γ ^ γ =1) β) and α =∗) β if 9 k α =k) β in which case one says that α derives or yields β. α is a sentential form if it can be derived from the start symbol S. To keep notation simple we'll often just write α =) β. Context Free Languages 13 Definition The language of a context free grammar G is defined to be L(G) = f x 2 Σ? j S =∗) x g Thus L(G) is the set of all sentential forms in Σ?. We also say that G generates L(G). A language is context free (CFL) if there exists a context free grammar that generates it. Note that in a CFG one can replace a single syntactic variable A by strings over Γ independently of were A occurs; whence the name \context free." Later on we will generalize to replacement rules that operate on a whole block of symbols (context sensitive grammars). Example: Regular 14 Let G = h fS; A; Bg; fa; bg; P;S i where the set P of productions is defined by: S aA j aB A aA j aB B bB j b. A typical derivation is: S ) aA ) aaA ) aaaB ) aaabB ) aaabb It is not hard to see that L(G) = a+b+ Not too interesting, we already know how to deal with regular languages. Can you see the finite state machine hiding in the grammar? Is it minimal? Derivation Graph 15 Derivations of length at most 6 in this grammar. Labeled 16 S aA aB aaA aaB ab abB aaaA aaaB aab aabB abb abbB aaaaA aaaaB aaab aaabB aabb aabbB abbb abbbB aaaaaA aaaaaB aaaab aaaabB aaabb aaabbB aabbb aabbbB abbbb abbbbB Example: Mystery 17 Let G = h fA; Bg; fa; bg; P;A i where the set P of productions is defined by: A AA j AB j a B AA j BB j b. A typical derivation is: A ) AA ) AAB ) AABB ) AABAA ) aabaa In this case it is not obvious what the language of G is (assuming it has some easy description, it does). More next time when we talk about parsing. Derivation Graph 18 Derivations of length at most 3 in this grammar. Three terminal strings appear at this point. Depth 4 19 Example: Counting 20 Let G = h fSg; fa; bg; P;S i where the set P of productions is defined by: S aSb j " A typical derivation is: S ) aSb ) aaSbb ) aaaSbbb ) aaabbb Clearly, this grammar generates the language f aibi j i ≥ 0 g It is easy to see that this language is not regular. Derivation Graph 21 Example: Palindromes 22 Let G = h fSg; fa; bg; P;S i where the set P of productions is defined by: S aSa j bSb j a j b j " A typical derivation is: S ) aSa ) aaSaa ) aabSbaa ) aababaa This grammar generates the language of palindromes. Exercise Give a careful proof of this claim. Derivation Graph 23 Example: Parens 24 Let G = h fSg; f(; )g; P;S i where the set P of productions is defined by: S SS j (S) j " A typical derivation is: S ) SS ) (S)S ) (S)(S) ) (S)((S)) ) ()(()) This grammar generates the language of well-formed parenthesized expressions. Exercise Give a careful proof of this claim. Derivation Graph 25 Example: Expressions of Arithmetic 26 Let G = h fEg; f+; ∗; (; ); vg; P;E i where the set P of productions is defined by: E E + E j E ∗ E j (E) j v A typical derivation is: E ) E ∗ E ) E ∗ (E) ) E ∗ (E + E) ) v ∗ (v + v) This grammar generates a language of arithmetical expressions with plus and times. Alas, there are problems: the following derivation is slightly awkward. E ) E + E ) E + (E) ) E + (E ∗ E) ) v + (v ∗ v) Our grammar is symmetric in + and ∗, it knows nothing about precedence. Derivation Graph 27 Ambiguity 28 We may not worry about awkward, but the following problem is fatal: E ) E + E ) E + E ∗ E ) v + v ∗ v E ) E ∗ E ) E + E ∗ E ) v + v ∗ v There are two derivations for the same word v + v ∗ v. Since derivations determine the semantics of a string this is really bad news: a compiler could interpret v + v ∗ v in two different ways, producing different results. Parse Trees 29 Derivation chains are hard to read, a better representation is a tree. Let G = hV; Σ; P;Si be a context free grammar. A parse tree of G (aka grammatical tree) is an ordered tree on nodes N, together with a labeling λ : N ! V [ Σ such that For all interior nodes x: λ(x) 2 V , If x1; : : : ; xk are the children, in left-to-right order, of interior node x then λ(x) λ(x1) : : : λ(xk) is a production of G, λ(x) = " implies x is an only child. Derivation Trees 30 Here are the parse trees of the \expressions grammar" from above. E E E + E E E ∗ E E E + E ∗ Note that the trees provide a method to evaluate arithmetic expressions, so the existence of two trees becomes a nightmare. Information Hiding 31 A parse tree typically represents several derivations: E E E ∗ v E + E v v represents for example θ1 : E ) E ∗ E ) E ∗ E + E ) v ∗ E + E ) v ∗ v + E ) v ∗ v + v θ2 : E ) E ∗ E ) E ∗ E + E ) E ∗ E + v ) E ∗ v + v ) v ∗ v + v θ3 : E ) E ∗ E ) v ∗ E ) v ∗ E + E ) v ∗ v + E ) v ∗ v + v but not θ4 : E ) E + E ) E ∗ E + E ) v ∗ E + E ) v ∗ v + E ) v ∗ v + v Leftmost Derivations 32 Let G be a grammar and assume α =1) β. We call this derivation step leftmost if α = xAα0 β = xγα0 x 2 Σ? A whole derivation is leftmost if it only uses leftmost steps. Thus, each replacement is made in the first possible position. Proposition Parse trees correspond exactly to leftmost derivations. Ambiguity 33 Definition A CFG G is ambiguous if there is a word in the language of G that has two different parse trees. Alternatively, there are two different leftmost derivations. As the arithmetic example demonstrates, trees are connected to semantics, so ambiguity is a serious problem in a programming language. Unambiguous Arithmetic 34 For a \reasonable" context free language it is usually possible to remove ambiguity by rewriting the grammar.

CDM [1Ex]Context-Free Grammars

Fast Graph Simplification for Interleaved Dyck-Reachability

Grammatical Compression: Compressed Equivalence and Other Problems Alberto Bertoni, Roberto Radicioni

Visibly Counter Languages and Constant Depth Circuits

The Inclusion Problem of Context-Free Languages: Some Tractable Cases?

Inverse Monoids, Trees, and Context-Free Languages

The Dyck Language Edit Distance Problem in Near-Linear Time

Quantum Lower and Upper Bounds for 2D-Grid and Dyck Language

Improved Bounds for Testing Dyck Languages

Rational Subsets of Groups, When Good Closure and Decidability Properties of These Subsets Are Satisﬁed

Rational Monoid and Semigroup Automata

Recursive Prime Factorizations: Dyck Words As Representations Of

How to Prove That a Language Is Regular Or Star-Free? Jean-Eric Pin