A Hints and Solutions to Selected Problems

A Hints and Solutions to Selected Problems Answer to Problem 2.1: The description is not really finite: it depends on all descriptions in the list, of which there are infinitely many. Answer to Problem 2.2: Any function that maps the integers onto unique integers will do (an injective function), for example nn or the n-th prime. Answer to Problem 2.3: Hint: use a special (non-terminal) marker for each block the tur- tle is located to the east of its starting point. Answer to Problem 2.4: The terminal production cannot contain S, and the only way to get rid of it is to apply S--->aSQ k ≥ 0 times followed by S--->abc. This yields a sentential form akabcQk. The part bcQk cannot produce any as, so all the as in the result come from aka,so all strings an··· can be produced, for n = k+1 ≥ 1. We now need to prove that bcQk produces exactly bncn. The only way to get rid of the Q is through bQc--->bbcc, which requires the Q to be adjacent to the b. The only way to get it there is through repeated application of cQ--->Qc. This does not affect the number of each token. After j application of bQc--->bbcc we have bb jc jcQk − j, and after k applications we have bbkckc, which equals bncn. Answer to Problem 2.5: Assume the alphabet is {(, [}. First create a nesting string of (, ), [,and], with a marker in the center and another marker at the end: ([[...[([X])]...]])Z. Give the center marker the ability to “ignite” the ) or ] on its right: ([[...[([X]*)]...]])Z. Make the ignited parenthesis move right until it bumps into the end marker: ([[...[([X)]...]])]*Z. Move it over the end marker and turn it into its (normal) partner: ([[...[([X)]...]])]Z[. This moves the character in the first position after the center marker to the first position after the end marker; repeat for the next character: ([[...[([X]...]])]Z([. This reverses the string between the markers while at the same time translating it: ([[...[([XZ([[...[([. Now the markers can annihilate each other. Answer to Problem 2.11: We have to carve up aibi into u, v,andw,suchthatuvnw is again in aibi,foralln. The problem is to find v:1.v cannot lie entirely in ai,sinceuvnw would produce ai + nbi.2.v cannot straddle the a-b boundary in aibi, since even vv would contain a b followed by an a.3.v cannot lie entirely in bieither, for the same reason as under 1. 646 A Hints and Solutions to Selected Problems Answer to Problem 2.12: It depends. In Ss--->A; S--->B; A--->a; B--->a the rules involving B can be removed without changing the language produced, but if one does so, the grammar is suddenly no longer ambiguous. Is that what we want? Answer to Problem 3.1: Not necessarily. Although there is no ambiguity in the prove- nance of each terminal in the input, in that each can come from only one rule identifying one non-terminal, the composition of the non-terminals into the start symbol can still be ambiguous: S--->A|B; A--->C; B--->C; C--->a. See also Problem 7.3. Answer to Problem 3.3: In a deterministic top-down parser the internal administration resides in the stack and the partial parse tree in the calls to semantic routines. k − 1 Answer to Problem 3.5: Ss--->aS|b with the input a b. Answer to Problem 3.8: See Tomita [161]. Answer to Problem 3.12: 1b. Compute how many t2sontheleftandhowmanyt1sonthe right each right-hand side contributes, using matching in between, and see if it comes out to 0 and 0 for the start symbol. 2. Assume the opposite. Then there is a (u1,u2)-demarcated string U which is not balanced for (t1,t2). Suppose U contains an unmatched t1. Since the entire string is balanced for (t1,t2), the corresponding t2 must be somewhere outside U. Together with the t1 it forms a string T which must be balanced for (u1,u2).However,T does either not contain the u1 or the u2 boundary token of U, which means that it contains either an unbalanced u1 or an unbalanced u2. 3. Exhaustive search seems the only option. 4. Trivial, but potentially useful. Answer to Problem 4.2: A simple solution can be obtained by replacing each terminal ti in the original grammar by a non-terminal Ti,andaddaruleTi → t1|ε to the grammar. Now the standard Unger and CYK parser will recognize subsequences. Incorporating the recovery of the deleted tokens in the parser itself is much more difficult. Answer to Problem 4.3: What does your algorithm do on S--->AAA, A--->ε (multiple nul- lable right-hand side symbols)? On S--->AS|x, A--->AS|B, B--->ε (infinite ambiguity)? Answer to Problem 4.4: No rule and no non-terminal remains, which supports the claim in Section 2.6 that ({}, {}, {}, {}) is the proper representation for grammars which produce the empty set. Answer to Problem 4.7: For each rule of the form A → BC, for each length l in Ti,B,for each length m in Ti+l,C, Ti,A has to contain a length l + m. Answer to Problem 5.2: 1. Yes: create a new accepting state ♦ and add ε-transitions to it from all original accepting states. 2. Yes: split off new transitions to ♦ just before you enter an accepting state. 3. No: an ε-free automaton deterministically recognizing just a and ab must have at least two different accepting states. Answer to Problem 5.3: It is sufficient to show that removing non-productive rules from a regular grammar with no unreachable non-terminals will not result in a grammar with unreachable non-terminals. Now, suppose that a non-terminal U becomes unreachable when removing non-productive rules. This means that U occurs in some right-hand side that is removed be- cause it is non-productive. But, as the grammar is regular, U is the only non-terminal occurring in this right-hand side, which means that U is non-productive and will be removed. Answer to Problem 5.4: 1. Find all ε-loops and combine all nodes on each of them into a ε a single node. 2. As long as there is an ε-transition Si → S j do: for each transition Sh → Si add a ε a transition Sh → S j,fora terminal or ε. Then remove the transition Si → S j. A Hints and Solutions to Selected Problems 647 Answer to Problem 6.4: 1. Define a stack of procedure pointers S. In the procedure for A → BCD, stack CD, call B, and unstack. In the procedure for A → a, pop the top-of-stack procedure, call it if the tokens match, and push it back. The stack represents the prediction stack. 2. Define a record L as a link in a chain of pointers to procedures. In the procedure for A → BCD, which has one parameter, a pointer prev to L, declare an L this, link it to prev, insert a pointer to CD, and call B with a pointer to this as parameter. In the procedure for A → a, call the procedure in prev with the backlink in it as a parameter, if the tokens match. The linked list represents the prediction stack. Answer to Problem 6.6: Hint: mind the ε-rules. Answer to Problem 6.8: (a) See Nederhof [105]. Answer to Problem 7.2: The plan does not tell how many εs should be recognized at a given position and in which order. In particular, for infinitely ambiguous grammars infinitely many εs would have to be recognized. Answer to Problem 7.3: No. Any grammar can be trivially brought in that form by intro- ducing non-terminals for all terminals. See also Problem 3.1. Answer to Problem 7.5: Lazy evaluation (+ memoization) is useful only when there is a good chance that the answer will not be needed. Since the Completer will come and visit anyway, it is not useful here. The required data structures would only slow the parser down. Answer to Problem 7.6: Put edges for all rules from each node to itself in the chart. This fulfills the invariant. Put edges for the input tokens on the stack. The parser now makes roughly the same movements as a CYK parser. Answer to Problem 7.7: Essentially the same arguments as in Section 7.3.6. Answer to Problem 8.1: Each rule has only one alternative, so no decision is required, and the language produced contains exactly one string. Answer to Problem 8.2: No. The first step in converting to CNF is ε-rule removal, which in general will not leave an LL(1) grammar. For example: A--->a|ε, B--->cA results in B--->cA’|c, A’--->a, which is not LL(1). Answer to Problem 8.3: Yes, these substitutions do not change the FIRST and FOLLOW sets of the right-hand sides. Answer to Problem 8.4: The token t is not in any FIRST set and either the grammar has no rules producing ε or t is not in any FOLLOW set either. (Question: How can a token not be in any FIRST nor in any FOLLOW set?) Answer to Problem 8.5: a. Yes. The terminal productions of both alternatives of S start with different terminals in spite of the fact that the alternatives themselves start with the same non-terminal, the latter producing only ε.

A Hints and Solutions to Selected Problems

Journal of Computer Science and Engineering Parsing

An Efficient Implementation of the Head-Corner Parser

Generalized Probabilistic LR Parsing of Natural Language (Corpora) with Unification-Based Grammars

CS 375, Compilers: Class Notes Gordon S. Novak Jr. Department Of

LATE Ain't Earley: a Faster Parallel Earley Parser

NLTK Parsing Demos

Parsing 1. Grammars and Parsing 2. Top-Down and Bottom-Up Parsing 3

Chart Parsing and Constraint Programming

Chart Parsing and Probabilistic Parsing

Chart Parsing: the CYK Algorithm Complexity

Compiler Construction

A Modified Earley Parser for Huge Natural Language Grammars 2 Earley Parser