A Guide to Parsing
Total Page:16
File Type:pdf, Size:1020Kb
A Guide to Parsing Comparing Algorithms and explaining Terminology Learn more at tomassetti.me We have already introduced a few parsing terms, while listing the major tools and libraries used for parsing in Java, C#, Python and JavaScript. In this article we make a more in-depth presentation of the concepts and algorithms used in parsing, so that you can get a better understanding of this fascinating world. We have tried to be practical in this article. Our goal is to help practicioners, not to explain the full theory. We just explain what you need to know to understand and build parser. After the definition of parsing the article is divided in three parts: 1. The Big Picture. A section in which we describe the fundamental terms and components of a parser. 2. Grammars. In this part we explain the main formats of a grammar and the most common issues in writing them. 3. Parsing Algorithms. Here we discuss all the most used parsing algorithms and say what they are good for. Table of Contents Definition of Parsing ........................................................................................................................... 5 The Big Picture .................................................................................................................................... 6 Regular Expressions ........................................................................................................................ 6 Regular Expressions in Grammars .............................................................................................. 7 Structure of a Parser ....................................................................................................................... 7 Scannerless Parsers ..................................................................................................................... 7 Grammar ......................................................................................................................................... 8 Anatomy of a Grammar .............................................................................................................. 8 Types of Grammars ..................................................................................................................... 8 Lexer ................................................................................................................................................ 9 Where the Lexer Ends and the Parser Begins ............................................................................. 9 Parser ............................................................................................................................................ 10 Syntactic vs Semantic Correctness ............................................................................................ 10 Scannerless Parser .................................................................................................................... 11 Issues With Parsing Real Programming Languages ................................................................... 11 Parsing Tree and Abstract Syntax Tree ......................................................................................... 12 From Parse Tree to Abstract Syntax Tree ................................................................................. 12 Graphical Representation Of A Tree ......................................................................................... 14 Grammars ......................................................................................................................................... 15 Typical Grammar Issues ................................................................................................................ 15 The Missing Tokens ................................................................................................................... 15 Left-recursive Rules ................................................................................................................... 15 Predicates .................................................................................................................................. 17 Embedded Actions .................................................................................................................... 17 Formats ......................................................................................................................................... 17 Backus-Naur Form and Its Variants ........................................................................................... 18 PEG ............................................................................................................................................ 19 Parsing Algorithms ............................................................................................................................ 21 Overview ....................................................................................................................................... 21 Two Strategies ........................................................................................................................... 21 Common Elements .................................................................................................................... 24 Automatons .............................................................................................................................. 25 Tables of Parsing Algorithms ......................................................................................................... 26 Top-down Algorithms ................................................................................................................... 27 LL Parser .................................................................................................................................... 28 Earley Parser ............................................................................................................................. 28 Packrat (PEG)............................................................................................................................. 29 Recursive Descent Parser .......................................................................................................... 30 Bottom-up Algorithms .................................................................................................................. 31 CYK Parser ................................................................................................................................. 31 LR Parser ................................................................................................................................... 32 Summary ........................................................................................................................................... 33 Definition of Parsing The analysis of an input to organize the data according to the rule of a grammar There are a few ways to define parsing. However the gist remain the same: parsing means to find the underlying structure of the data we are given. In a way parsing can be considered the inverse of templating: identifying the structure and extracting the data. In templating instead we have a structure and we fill it with data. In the case of parsing you have to determine the model from the raw representation. While for templating you have to combine the data with the model, to create the raw representation. Raw representation is usually text, but it can also be binary data. Fundamentally parsing is necessary because different entities need the data to be in different forms. Parsing allows to transform data in a way that can be understood by a specific software. The obvious example are programs: they are written by humans, but they must be executed by computers. So humans write them in a form that they can understand, then a software transform them in a way that can be used by a computer. However parsing might be necessary even when passing data between two software that have different needs. For instance, it is needed when you have to serialize or deserialize a class. The Big Picture In this section we are going to describe the fundamental components of a parser. We are not trying to give you formal explanations, but practical ones. Regular Expressions A sequence of characters that can be defined by a pattern Regular expression are often touted as the thing you should not use for parsing. This is not strictly correct, because you can use regular expressions for parsing simple input. The problem is that some programmers only know regular expressions. So they use them to try to parse everything, even the things they should not. The result is usually a series of regular expressions hacked together, that are very fragile. You can use regular expressions to parse some simpler languages, but this exclude most programming languages. Even the ones that look simple enough like HTML. In fact, languages that can be parsed with just regular expressions are called regular languages. There is a formal mathematical definition, but that is beyond the scope of this article. Though one important consequence of the theory is that regular languages can be parsed or expressed also by a finite state machine. That is to say regular expressions and finite state machines are equally powerful. This is the reason because they are used to implement lexers, as we are going to see later. A regular language can be defined by a series of regular expressions, while more complex languages need something