Annotated Bibliography
Total Page:16
File Type:pdf, Size:1020Kb
18 Annotated Bibliography The purpose of this annotated bibliography is to supply the reader with more material and with more detail than was possible in the preceding chapters, rather than to just list the works referenced in the text. The annotations cover a considerable number of subjects that have not been treated in the rest of the book. The printed version of this book includes only those literature references and their summaries that are actually referred to in it. The full literature list with summaries as far as available can be found on the web site of this book; it includes its own authors index and subject index. This annotated bibliography differs in several respects from the habitual literature list. • The annotated bibliography consists of four sections: – Main parsing material — papers about the main parsing techniques. – Further parsing material — papers about extensions of and refinements to the main parsing techniques, non-Chomsky systems, error recovery, etc. – Parser writing and application — both in computer science and in natural languages. – Support material — books and papers useful to the study of parsers. • The entries in each section have been grouped into more detailed categories; for example, the main section contains categories for general CF parsing, LR parsing, precedence parsing, etc. For details see the Table of Contents at the beginning of this book. Most publications in parsing can easily be assigned a single category. Some that span two categories have been placed in one, with a reference in the other. • The majority of the entries are annotated. This annotation is not a copy of the ab- stract provided with the paper (which generally says something about the results obtained) but is rather the result of an attempt to summarize the technical content in terms of what has been explained elsewhere in this book. • The entries are ordered chronologically rather than alphabetically. This arrange- ment has the advantage that it is much more meaningful than a single alphabetic list, ordered on author names. Each section can be read as the history of research 576 18 Annotated Bibliography on that particular aspect of parsing, related material is found closely together and recent material is easily separated from older publications. A disadvantage is that it is now difficult to locate entries by author; to remedy this, an author index (starting on page 651) has been supplied. 18.1 Major Parsing Subjects 18.1.1 Unrestricted PS and CS Grammars 1. Tanaka, Eiichi and Fu, King-Sun. Error-correcting parsers for formal languages. IEEE Trans. Comput., C-27(7):605–616, July 1978. In addition to the error correction algorithms referred to in the title (for which see [301]) a version of the CYK algorithm for context-sensitive grammars is described. It requires the grammar to be in 2-form: no rule has a right-hand size longer than 2, and no rule has a left-hand size longer than its right-hand size. This limits the number of possible rule forms to 4: A → a, A → BC, AB → CB (right context), and BA → BC (left context). The algorithm is largely straightforward; for example, for rule AB → CB,ifC and B have been recognized adjacently, an A is recognized in the position of the C. Care has to be taken, however, to avoid recognizing a context for the application of a production rule when the context is not there at the right moment; a non-trivial condition is given for this, without explanation or proof. 18.1.2 General Context-Free Parsing 2. Irons, E. T. A syntax-directed compiler for ALGOL 60. Commun. ACM, 4(1):51–55, Jan. 1961. The first to describe a full parser. It is essentially a full backtracking recursive descent left-corner parser. The published program is corrected in a Letter to the Editor by B.H. Mayoh, Commun. ACM, 4(6):284, June 1961. 3. Hays, David G. Automatic language-data processing. In H. Borko, editor, Computer Applications in the Behavioral Sciences, pages 394–423. Prentice-Hall, 1962. Actually about machine translation of natural language. Contains descriptions of two parsing algorithms. The first is attributed to John Cocke of IBM Research, and is actually a CYK parser. All terminals have already been reduced to sets of non-terminals. The algorithm works by combining segments of the input (“phrases”) corresponding to non-terminals, according to rules X −Y = Z which are supplied in a list. The program iterates on the length of the phrases, and produces a list of numbered triples, consisting of a phrase and the numbers of its two direct constituents. The list is then scanned backwards to produce all parse trees. It is suggested that the parser might be modified to handle discontinuous phrases, phrases in which X and Y are not adjacent. The second algorithm, “Dependency-Structure Determination”, seems akin to chart parsing. The input sentence is scanned repeatedly and during each scan reductions appropriate at that scan are performed: first the reductions that bind tightest, for example the nouns modified by nouns (as in “computer screen”), then such entities modified by adjectives, then the articles, etc. The precise algorithm and precedence table seem to be constructed ad hoc. 4. Kuno, S. and Oettinger, A. G. Multiple-path syntactic analyzer. In Information Pro- cessing 1962, pages 306–312, Amsterdam, 1962. North-Holland. A pool of predictions is maintained during parsing. If the next input token and a prediction allows more than one new prediction, the prediction is duplicated as often as needed, and multiple new predictions result. If a prediction fails it is discarded. This is top-down breadth-first parsing. 5. Sakai, Itiroo. Syntax in universal translation. In 1961 International Conference on Machine Translation of Languages and Applied Language Analysis, pages 593–608, London, 1962. Her Majesty’s Stationery Office. Using a formalism that seems equivalent 18.1 Major Parsing Subjects 577 to a CF grammar in Chomsky Normal Form and a parser that is essentially a CYK parser, the author describes a translation mechanism in which the source language sentence is transformed into a binary tree (by the CYK parser). Each production rule carries a mark telling if the order of the two constituents should be reversed in the target language. The target language sentence is then produced by following this new order and by replacing words. A simple Japanese-to-English example is provided. 6. Greibach,S.A.Inverses of Phrase Structure Generators. PhD thesis, Technical Report NSF-11, Harvard U., Cambridge, Mass., 1963. 7. Greibach, Sheila A. Formal parsing systems. Commun. ACM, 7(8):499–504, Aug. 1964. “A formal parsing system G =(V,µ,T,R) consists of two finite disjoint vocabularies, V and T,a many-to-many map, µ, from V onto T, and a recursive set R of strings in T called syntactic sentence classes” (verbatim). This is intended to solve an additional problem in parsing, which occurs often in natural languages: a symbol found in the input does not always uniquely identify a terminal symbol from the language (for example, will (verb) versus will (noun)). On this level, the language is given as the entire set R, but in practice it is given through a “context-free phrase structure generator”, i.e. a grammar. To allow parsing, this grammar is brought into what is now known as Greibach Normal Form: each rule is of the form Z → aY1 ···Ym,wherea is a terminal symbol and Z and Y1 ···Ymare non-terminals. Now a directed production analyser is defined which consists of an unlimited set of pushdown stores and an input stream, the entries of which are sets of terminal symbols (in T), derived through µ from the lexical symbols (in V). For each consecutive input entry, the machine scans the stores for a top non-terminal Z for which there is a rule Z → aY1 ···Ym with a in the input set. A new store is filled with a copy of the old store and the top Z is replaced by Y1 ···Ym; if the resulting store is longer than the input, it is discarded. Stores will contain non-terminals only. For each store that is empty when the input is exhausted, a parsing has been found. This is in effect non-deterministic top-down parsing with a one-symbol look-ahead. This is probably the first description of a parser that will work for any CF grammar. A large part of the paper is dedicated to undoing the damage done by converting to Greibach Normal Form. 8. Greibach, S. A. A new normal form theorem for context-free phrase structure grammars. J. ACM, 12:42–52, Jan. 1965. A CF grammar is in “Greibach Normal Form” when the right-hand sides of the rules all consist of a terminal followed by zero or more non-terminals. For such a grammar a parser can be constructed that consumes (matches) one token in each step; in fact it does a breadth-first search on stack configurations. An algorithm is given to convert any CF grammar into Greibach Normal Form. It basically develops the first non-terminal in each rule that violates the above condition, but much care has to be taken in that process. 9. Griffiths, T. V. and Petrick, S. R. On the relative efficiencies of context-free grammar recognizers. Commun. ACM, 8(5):289–300, May 1965. To achieve a unified view of the parsing techniques known at that time, the authors define a non-deterministic two-stack machine whose only type of instruction is the replacement of two given strings on the tops of both stacks by two other strings; the machine is started with the input on one stack and the start symbol on the other and it “recognizes” the input if both stacks get empty simultaneously.