Assignment of Bachelor's Thesis
Total Page:16
File Type:pdf, Size:1020Kb
CZECH TECHNICAL UNIVERSITY IN PRAGUE FACULTY OF INFORMATION TECHNOLOGY ASSIGNMENT OF BACHELOR’S THESIS Title: Automata library - LR parser construction Student: Martin Kočička Supervisor: Ing. Jan Trávníček Study Programme: Informatics Study Branch: Computer Science Department: Department of Theoretical Computer Science Validity: Until the end of summer semester 2016/17 Instructions Get familiar with the current implementation of the Automata library toolkit[1]. Study LR(0) and SLR(1) parser construction. Propose suitable extensions of data structures in Automata library toolkit for representing LR(0) and SLR(1) parser. Implement these extensions of data structures. Extend the library to include LR(0) and SLR(1) parser construction algorithms using these new data structures. Implement an algorithm of syntactic analysis using the LR(0) and SLR(1) parser. Test the implemented extensions and algorithms. References [1] Martin Žák: Automatová knihovna – vnitřní a komunikační formát. Bakalářská práce, České vysoké učení technické v Praze, Fakulta informačních technologií, Praha, 2014. L.S. doc. Ing. Jan Janoušek, Ph.D. prof. Ing. Pavel Tvrdík, CSc. Head of Department Dean Prague January 9, 2016 Czech Technical University in Prague Faculty of Information Technology Department of Theoretical Computer Science Bachelor's thesis Automata library { LR parser construction Martin Koˇciˇcka Supervisor: Ing. Jan Tr´avn´ıˇcek 17th May 2016 Declaration I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis. I acknowledge that my thesis is subject to the rights and obligations stip- ulated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Article 46(6) of the Act, I hereby grant a nonexclusive author- ization (license) to utilize this thesis, including any and all computer programs incorporated therein or attached thereto and all corresponding documentation (hereinafter collectively referred to as the \Work"), to any and all persons that wish to utilize the Work. Such persons are entitled to use the Work in any way (including for-profit purposes) that does not detract from its value. This authorization is not limited in terms of time, location and quantity. However, all persons that makes use of the above license shall be obliged to grant a license at least in the same scope as defined above with respect to each and every work that is created (wholly or in part) based on the Work, by modi- fying the Work, by combining the Work with another work, by including the Work in a collection of works or by adapting the Work (including translation), and at the same time make available the source code of such work at least in a way and scope that are comparable to the way and scope in which the source code of the Work is made available. In Prague on 17th May 2016 . Czech Technical University in Prague Faculty of Information Technology c 2016 Martin Koˇciˇcka. All rights reserved. This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author's permission is prohibited (with exceptions defined by the Copyright Act). Citation of this thesis Koˇciˇcka, Martin. Automata library { LR parser construction. Bachelor's thesis. Czech Technical University in Prague, Faculty of Information Techno- logy, 2016. Abstract This thesis covers basic LR parsing algorithms. We describe bottom-up and shift-reduce parsing methods in general, and then we focus on LR parsing. We go into more detail with two algorithms|LR(0) and SLR(1). Suitable algorithms and data structures are presented and implemented into the Auto- mata library. We also explore existing solutions and the Automata library itself. Keywords parsing, bottom-up parsing, shift-reduce parsing, LR parsing, SLR parser vii Abstrakt Pˇredmˇetemt´etopr´acejsou z´akladn´ıLR parsovac´ıalgoritmy. Pr´acev ´uvodu popisuje obecnˇebottom-up a shift-reduce parsov´an´ı. D´alese zamˇeˇrujena LR parsov´an´ı, specificky LR(0) a SLR(1) parsery. Pr´aceobsahuje n´avrh potˇrebn´ych datov´ych struktur a algoritm˚upro implementaci tˇechto parser˚u. Jsou prozkoum´anaexistuj´ıc´ıˇreˇsen´ı,a pˇrid´anz´akladn´ıpopis Automatov´ekni- hovny. Pr´aceje souˇc´ast´ıprojektu Automatov´aknihovna. Kl´ıˇcov´aslova parsov´an´ı,bottom-up parsov´an´ı,shift-reduce parsov´an´ı,LR parsov´an´ı,SLR parser viii Contents Introduction 1 1 Theory 3 1.1 Strings . 3 1.2 Grammars . 4 1.3 Deterministic finite automaton . 6 2 Bottom-up parsing 9 2.1 Reductions . 9 2.2 Shift-reduce parser . 10 2.3 Conflicts . 11 3 LR parsing 13 3.1 Action table . 13 3.2 Goto table . 15 3.3 Parsing algorithm . 15 4 LR(0) parsing 17 4.1 LR(0) item . 17 4.2 LR(0) closure . 18 4.3 Augmented grammar . 18 4.4 LR(0) automaton . 19 4.5 SLR(1) parsing . 21 5 Analysis 25 5.1 Existing solutions . 25 5.2 Automata library . 26 6 Implementation 29 6.1 Enumeration LRAction . 29 ix 6.2 Type definitions . 29 6.3 Class LRParser . 30 6.4 Class LR0Parser . 31 6.5 Class SLR1ParseTable . 31 6.6 Class LR0ItemsLabel . 32 6.7 Testing . 32 Conclusion 33 Bibliography 35 A List of used abbreviations 37 B Contents of enclosed memory card 39 x List of Figures 1.1 Parse tree for the expression grammar and sentential form id + id. 6 1.2 Example of a deterministic finite automaton. 7 3.1 LR parser. 14 4.1 LR(0) automaton for the expression grammar. 20 xi List of Tables 2.1 State of the shift-reduce parser after a successful parse. 10 2.2 Shift-reduce parse of a string (id + id) from the expression grammar. 11 3.1 Action table for the expression grammar. 15 3.2 Goto table for the expression grammar. 16 xiii Introduction Parsing (syntax analysis) is one of the most crucial tasks solved in computing. It has wide range of usages|from processing programming languages, struc- tured text such as XML, structured binary data, to DNA pattern recognition and many more. Parsing algorithms can be split into two categories: top-down and bottom- up. Names of these two methods are referring to parse trees, but it is easier to explain the parsing process using formal grammars. In top-down method, we start with the initial symbol of a grammar, and apply production rules until we produce the input string. On the other hand, in bottom-up method we start with the input string, and replace right-hand sides of the production rules by the nonterminals inside the string, until we reduce it to the start symbol. This thesis focuses on LR parsing. LR parsers are deterministic bottom- up parsers, which produce a correct parse in linear time, due to the fact that they don't do any guessing or backtracking. LR parsers read input from left to right, construct a rightmost derivation in reverse, and they can detect syntax errors as soon as possible. The objective of this thesis is to describe selected algorithms, and to im- plement them into the Automata library. Two algorithms, LR(0) and SLR(1), have been selected, since they are the foundation of almost all other powerful LR parsing methods. 1 Chapter 1 Theory 1.1 Strings Definition 1.1.1. Alphabet is a finite set of elements (called symbols). Example 1.1.1. Set f0; 1g is an alphabet consisting of two symbols. Definition 1.1.2. String is a finite sequence of symbols belonging to the alphabet. Instead of denoting the strings using the sequence notation, we will just append the symbols to each other. Example 1.1.2. The string (α0; α1; α2; α3) will be written as α0α1α2α3. Example 1.1.3. 0, 01 and 01101 are examples of strings over alphabet f0; 1g. Definition 1.1.3. The number of symbols in the sequence is called length of the string, and for an arbitrary string w, the length will be denoted jwj. Definition 1.1.4. An empty sequence of symbols is also a string, called empty string. It will be denoted using symbol ". Definition 1.1.5. Let x = (x0; x1; : : : ; xn−1; xn) and y = (y0; y1; : : : ; ym−1; ym) be strings. String z = (x0; x1; : : : ; xn−1; xn; y0; y1; : : : ; ym−1; ym) is called the concatenation of strings x and y, and will be denoted xy. Example 1.1.4. String 01101 is a concatenation of strings 01 and 101. Definition 1.1.6. Language is a set of strings over an alphabet. Example 1.1.5. Set f0; 01; 01101g is a language over the alphabet f0; 1g. 3 1. Theory 1.2 Grammars Definition 1.2.1. Grammar is a quadruple consisting of following elements: 1. Finite set of nonterminal symbols (often denoted N). 2. Finite set of terminal symbols (often denoted Σ). Finite set of production rules of the form α ! β 3. where α 2 (N [ Σ)∗N(N [ Σ)∗ and β 2 (N [ Σ)∗ (often denoted P ). 4. Start symbol which belongs to N (often denoted S). It is common to write the rules for the same left-hand side on one line, separating right-hand sides using the pipe symbol \j". Example 1.2.1. Production rules A ! α; A ! β; A ! γ could be written as A ! α j β j γ. 1.2.1 Context-free grammar Definition 1.2.2. Context-free grammar is a grammar where all the produc- tion rules are in format A ! α (A is a single nonterminal symbol and α is a string of nonterminal and/or terminal symbols).