Montana State University Computer Science Department

CSCI 468 - Compilers Spring 2019

Team Members Giovanny Addun, Connor Grace, Tyler Krueger, Ryley Rodriguez

Section 1: Program

Section 2: Teamwork (Gio = 1, Connor = 2, Tyler = 3, Ryley = 4) Throughout the course of this project our team worked together efficiently without any major setbacks. Due to our familiarity and use of GitHub, we were able to constantly collaborate with one another without meeting up in person too much throughout the semester. The few times that we did meet in person were mainly to confirm specifics that were hard to communicate over email/text, or to help a certain team member get the code working on their machine. Majority of the code was written by Team Member 1 (One) while the remaining three helped troubleshoot any errors and test the code on various inputs to ensure the program’s accuracy. To make up for this, Four took care of submitting all necessary code to D2L as well as contribute to multiple sections on the final report. Two and three contributed to the completion of both the final portfolio and the final report. Three constructed various diagrams for the report and helped in the finalization of the code base. All team members contributed to the completion of the UML diagrams and ensuring all components were submitted by the due date. The estimated percentage of time spent between team members is: One = 40%, Two = 20%, Three = 20%, Four = 20%.

Section 3: Design pattern During the course of the semester we were presented with a limited number of options as far as design patterns. We were given freedom in most aspects of our compiler generation, but we were restricted to two options in the symbol table step. Our team needed to choose between either an Observer (Listener) or Visitor design pattern. We ended up implementing the Listener pattern for a few reasons. One, it contains less components when compared to the other pattern which usually means it is easier to execute. Another reason is that our team was able to understand the Listener pattern more when it was explained to us during lecture. The Observer pattern does offer benefits when it comes to extending the program. However, considering we only needed to construct a compiler for the LITTLE , our team wasn’t concerned with scaling this project more than what was required. Please see the highlighted segments of Section 1 for the implementation of this design pattern.

Figure 1: Observer design pattern

Section 4: Technical writing Include the technical document that accompanied your capstone project.

1. Introduction The capstone project tasks a group of undergrads to build a compiler for the LITTLE language. Until this class, most students understand the general idea of how a compiler works but lack the insight of the details. This class provides a deeper description of each step and its importance to a compiler as a whole. The goal of this Compilers project is to test the knowledge that students have learned over the course of their studies. With the help of the ANTLR tool, groups are required to construct a scanner, parser and symbol table in order to generate code. For the scanner and parser steps a grammar file was created to recognize tokens and parse the input program. A symbol table was then created to store important program information such as scopes and variables. In the final step, the group builds semantic actions to generate an intermediate representation of the code which can be translated to assembly code. The following report gives an in depth explanation to each element of a compiler and our groups implementation strategies.

2. Background A compiler is a translator. In the traditional sense, it converts Java or ++ code into a low-level assembly language. In order for a compiler to work, it needs an input language, a target language, and specifications about both. It uses the specifications to break the input down into smaller chunks and then converts those pieces into the target language. Compilers enhance the portability of all types of software as well as increase the level of productivity in all aspects of a company. With the development of increasingly sophisticated languages i.e. those closer to English, the gap between programming languages and assembly language is becoming larger and larger. The better the compiler, the easier the task of transferring software between different machines and translating programs becomes. Major components of a compiler include: scanner, parser, semantic actions, symbol table, code generation and optimization. The scanner takes an input program and creates tokens that represent words or symbols in the program. These tokens are then passed to the parser and organized to fit a set or rules given by the languages grammar. Symbol tables will be used to keep track of scopes, variables, function calls, objects, etc. in the source program. Finally, any optimization techniques are used before the final code generation occurs. These components work together by passing information from one step to the next until the high-level language is transformed into machine code for the computer’s hardware to execute. To begin, the scanner takes a program’s as input, and outputs the various words and symbols that make up the tokens of the language to the parser. The parser is responsible for taking the tokens from the scanner and constructing a . The parser uses the rules of the grammar file for the target language to construct valid strings in the program. Next, the symbol table uses the generated parse tree to keep track of all variable declarations by maintaining a scope throughout blocks in the program. Finally, code is generated by first converting the parse tree into an . Once the AST is constructed, we perform a post-order traversal of the tree and generate code according to the leaves in the tree. Some real-life examples of compilers include basic language translation, as well as more software oriented compilers such as gcc and javac, for C and Java programs, respectively.

3. Methods and Discussion a. ANTLR Setup: For this project we will be incorporating the toolkit ANTLR (ANother ​ Tool for Language Recognition) to generate a parser from a grammar. Specifically, the parser will be created in Java and the grammar is pre-defined in a .g4 file. Setting up the toolkit requires that Java (Version 1.6 or higher) and ANTLR (Version 4.7.2) both be installed on the user’s system. Now the antlr4 command can be run on a .g4 file which creates the necessary .java files. Following that, the .java files are compiled and a parse tree can be viewed with a few additional ANTLR commands. The parse tree matches the specified keywords and identifiers given a string of input. b. Scanner i. General description: The scanner, or the lexers job is to take an input program and group each character into words and symbols called tokens. This eliminates unwanted information, such as comments, and creates a more uniform format for easier interpretation later in the process. These tokens consist of identifiers, keywords, literals, operators and others depending on the language. Instead of building a scanner from scratch, a language recognition tool is used to generate a scanner. Using regular expressions to describe the tokens, program files are created to perform the tasks of the scanner. Once the input is organized into these groups it is then passed on to the parser.

ii. Methodology: For the Scanner portion of this project, we were tasked with taking a program written for the Micro language as input and extracting tokens (Keywords, Literals Identifiers and Operators) of the Micro language. The only data that we are interested in is the contents of the input program (.micro file). The data is simply a long stream of characters, or a string. The contents of the string are essentially different token types, which are predefined in a .g4 grammar file, which can be seen below. Figure 1: Micro grammar screenshot for LITTLE language.

Our overall implementation for this part can be divided into two parts, Driver.java and TerminalListener.java, our Driver class handles initializing a Lexer object, Parser object, ParseTreeWalker object, and a Listener object. The Lexer, or Tokenizer, is applied to the input data and is used to obtain the vocabulary, or token definitions. The Parser object takes the token definitions and generates a tree representation of the input string. For this step in the assignment we defined the parse tree to match the regular expression “(Any_Token)*” and so the children of this tree are simply the tokens in the .micro file. The Driver then creates a parse tree walker and our TerminalListener class to maintain a record of the tokens the walker encounters and their values. For our Listener class we decided to override the visitTerminal method to store the token types and values of the tokens that the tree walker visits as it traverses the parse trees. This method uses two Java String Queues to store the tokens and values. This choice was made because it is an intuitive data structure to store data in such a way that we want to preserve the order when we dequeue the contents later on in the program. Additionally, we implemented the output generation functionality in our TreminalLstener class. We made this decision because it seemed more logical to generate the output from where the queues are stored rather than pass the queue around. iii. Difficulties 1. Understanding various java classes/methods to work with Antlr generated files. a. E.g. parseTreeWalker, TerminalListener 2. One difficulty encountered with the grammar file was the syntax and how defining tokens in various orders can affect the output. Originally the first set of tokens defined was the identifiers. This however recognized keywords as identifiers in the output. To fix this we moved the identifier definition to the end of the .g4 file. 3. Constructing and testing the Micro executable 4. Along with some general difficulties, each person had to work through some problems. Ryley, Tyler and Connor struggled to get the project working as expected with the JAR dependencies. Once the dependencies were set up correctly we still had problems running the project to find the output. After collaborating with each other we found that we needed to move the generated files into the source folder. a. Ryley, Gio ~ There was a steep learning curve in understanding the overall flow/process from taking a .g4 file and compiling it into necessary java files and then using predefined methods & overridden methods to output their tokens. It was particularly difficult to find a consolidated (and free) resource detailing how the different generated classes built by ANTLR work together (i.e. The ParseTreeWalker, RuleContext and Listener classes). We did eventually obtain a copy of the ANTLR definitive reference manual which answered many of our questions. c. Parser i. General description: The job of the parser is to take the string of valid tokens generated by the lexer and attempt to fit that strings to sets of rules defined by some grammar. Typically at this stage of compiling, syntax errors in source programs are detected by the compiler. These errors are found when the parser is unable to fit the string of tokens to the rules defined in the grammar. The figure on the following page shows our updated grammar file for this step of the project.

Figure 2: Additional screenshot of Micro grammar

The two types of parsers in consideration are LL and LR parsers. Both methods will scan the input from left to right. The main differences arise in the derivation method. LL parsers use a leftmost derivation and LR parsers use a rightmost derivation. In addition, these methods have variants that specify the number of lookahead tokens. Two examples include LL(1) and LR(k) which stand for one ​ ​ lookahead and k lookahead. Since we are using the ANTLR tool the method of ​ ​ used is LL(*). This is a very similar to LL(1) parser however, it is not restricted to the number of lookahead tokens.

ii. Methodology: At the last step of this project (Parsing), our team had already partially implemented the ANTLR4 ParseTree, ParseTreeWalker and BaseListener structures. For this step we extended the BaseListener class to override visitErrorNode() method so that when an error node is encountered while traversing the parse tree, the error is logged in a class variable. With this methodology we were able to pass all the posted test cases except for test10.micro. For this test case, although mismatched input errors occurred, the parser was still able to generate a full parse tree and therefore no error nodes were encountered by the TreminalListener while it walked the parse tree. To fix the above problem we overrode the ANTLR BaseErrorListener class with our Error_Listener class and added this error listener to our MicroParser object. By doing so we were able to pass all the posted test cases. For any of the above methods to work, we first had to update our grammar (Micro.g4). We were given a definition of the grammar, which we had to convert to a form usable by the tool. This consisted of writing parser rules (productions) for the rules defined in the provided grammar.txt file.

iii. Difficulties: The ANTLR Java api is still hard to understand. Particularly we did not understand why there is a BaseErrorListener structure when in the BaseListener class there is a visitErrorNode() method. Additionally, It was unclear how to implement the BaseErrorListener interface at first. We had assumed that the implementation would be similar to the BaseListener class. However, the methods for implementing the two are very different. d. Symbol Table i. General description: The function of symbol tables for compilers is essentially to maintain a record of all the variables, function names, classes, objects, etc. in a source program. In addition, symbol tables are used to keep track of scopes and what variables belong to each scope. The table or tables are created from semantic actions during the parse stage.

ii. Methodology: For our symbol table implementation we chose to use 2 Java LinkedHashMap structures with nesting. An outer LinkedHashMap uses program scopes as keys and stores inner LinkedhashMaps as values. The keys for the inner LinkedHashMaps being variable names for their corresponding scopes and the values being ArrayLists with information about the variables (Variable type and value for STRINGS). Figure 3 shows a concrete realization of our implementation given some test program.

Figure 3: Implementation of data structures

In addition to creating a data structure to represent our symbol tables, we also needed to create a Listener class which would make entries into our symbol tables as our Parser parsed the source program. To achieve this we extended the Antlr-genreated MicroBaseListener class with our SymbolTableListener class. The functionality of our class can be divided into two basic operations: a.) Changing the current scope b.) making entries into the symbol table. Our class uses a stack to maintain the scope. Whenever the source program enters a code block where the scope is changed (for instance, when entering a “while statement”) the new scope is pushed to the top of the stack and the current scope is updated. Likewise, whenever the program ends a code block, the top of the stack is popped and the current scope is updated. These events are defined by the rules of the grammar, and are captured by our Listener by invoking the ANTLR enterRule() and exitRule() methods. With the appropriate scope being maintained by the stack, whenever our Parser encounters a variable declaration our Listener class is notified to make an entry into the symbol table. These events are also defined by the grammar and are captured with the ANTLR exitRule() methods. iii. Difficulties: One difficulty that was encountered was catching variable declaration errors such as having two variables with the same name in the same scope. At first we attempted to stop parsing the program altogether and print out the error, but we instead decided to create a separate data structure to hold declaration errors. Once the entire program is parsed, the size of the ‘errors’ data structure is analyzed and if there are contents, the Symbol Table is not printed. e. Code Generation i. General description: The last step in our constructing of a compiler was the code generation step. A compiler is essentially a translator, translating code from one computer language to another. Typically this translation is from some higher level language to some lower level language. For our project the objective was to translate Tiny source code to Tiny assembly language.After parsing, lexing, and symbol table generation, our compiler has all the required components to translate code in the Tiny programming language to the Tiny assembly language.

ii. Methodology: In order to accomplish the goals of this step, we followed the three steps as suggested in class. 1. Develop a program which worked with our existing code to generate ASTs (Abstract Syntax Trees) given a grammar, parser, lexer and some input source code 2. Use post-construction traversal of the generated AST’s to create IR (Intermediate Representation) code for the given input source code. 3. Convert the IR code into For our implementation we created an abstract AST class, and also implemented several AST Node classes representing operations (+, -, :=, *, /), as well as variables.

iii. Difficulties: The complexity of this step far exceeded the complexity of previous steps. As a team we had underestimated this complexity and the amount of work required in this step. As a result we were unable to produce a fully functional compiler. We had particular difficulty with Step4-part2 (generating code for control structures). Our method for constructing AST objects seemed to work for generating code for expressions, but it was not until after spending hours failing that we realized the approach we were using would not work for accurately generating code for control structures as well. It was also quite difficult determining which rules in the Tiny gramar to take semantic actions on. f. Full-fledged compiler ​ In order to code this project in an efficient manner, our team constructed a UML diagram to show the relationship between all the various classes (auto-generated and written manually). This diagram visually shows how the different components, the Scanner, Parser, and Symbol Table, work together to generate the assembly instruction set from the input program. Please refer to Figure 4 for the UML diagram of our compiler.

Figure 4: UML class diagram of fully fledged compiler

During the course of the semester we were presented with a limited number of options as far as design patterns. We were given freedom in most aspects of our compiler generation, but we were restricted to two options in the symbol table step. Our team needed to choose between either an Observer (Listener) or Visitor design pattern. We ended up implementing the Listener pattern for a few reasons. One, it contains less components when compared to the other pattern which usually means it is easier to execute. Another reason is that our team was able to understand the Listener pattern more when it was explained to us during lecture. The Observer pattern does offer benefits when it comes to extending the program. However, considering we only needed to construct a compiler for the LITTLE programming language, our team wasn’t concerned with scaling this project more than what was required. Please see the highlighted segments of Section 1 for the implementation of this design pattern. Our team incorporated a version ​ control tool, Git, to allow each team member to constantly collaborate without having to meet in person. We hosted our entire project on GitHub so the four team members could have a working code base at all times if, for some reason, their branch contained errors that were too difficult or time consuming to find and fix.

4. Conclusion and Future Work Although the compiler we created works for the LITTLE language, there are still some aspects that could be improved on. For one, it would be difficult to expand our program to be able to accomodate more high-level languages. Due to the design patterns that we chose and the time constraints of a semester long project, our compiler would require serious rehaul in order for it to work with other languages. Another improvement that could be made to the code would be simple refactoring. Certain classes contain duplicate information and, for example, we created a different Driver for each portion of the project. Given more time we would clean up our code base to perform only the necessary operations in order to compile a LITTLE program. As far as future work goes, our team believes there is a lot of potential to optimize our code generation step. Throughout the semester we were presented with many different approaches to optimizing code, none of which were required for this specific course. Implementing these optimizations would show to be very beneficial in terms of runtime of our program. However, since these optimizations would not affect the accuracy of our program, they were not considered critical for the submission of our final code base.

REFERENCES ❏ https://www.antlr.org ❏ https://github.com/antlr/antlr4/blob/master/doc/getting-started.md ❏ Fischer, Charles, and LeBlanc, Richard. “Crafting a Compiler with C.” Benjamin/Cummings Publ. Co., Redwood City, CA, 1991. ❏ Parr, Terence, and Kathleen Fisher. “LL(*): The Foundation of the ANTLR Parser Generator.” ANTLR, June 2011, www.antlr.org/papers/LL-star-PLDI11.pdf. ​ ​ ​ ❏ https://existek.com/blog/sdlc-models/

Section 5: UML The UML diagram for the Listener design pattern that we implemented for Step 3: Symbol table can be found below. Please see Section 3 for further discussion.

Figure 2: Observer design pattern

Section 6: Design trade-offs During the development of the various components of our compiler, our team was forced to make many design decisions. For the first two parts of the project, the code was relatively straightforward and we weren’t forced to make any huge decisions that affected our memory space or execution time. In the symbol table step we decided to implement the Listener design pattern because our team thought the methods inside the auto-generated Listener classes were easier to interpret and therefore extend to what we needed them to do. During the code generation step, we also decided to build an AST from the Intermediate Representation as opposed to creating the code straight from the parse tree. This choice means that our program uses more memory but we believe it to be a good design choice if the compiler were to be extended to handle more scenarios. The AST is also much easier to debug than trying to determine where the traversal error occurred in the parse tree.

Section 7: Software development life cycle model The software development life cycle model that our team decided to base the development of our compiler on is the Iterative method. The reason we thought this was the appropriate decision was mainly due to time restrictions, as well as the requirements being laid out explicitly before any coding began. The Iterative model also works very well for smaller scale projects such as this where scalability is of little concern. In the Iterative model, the development process is repetitive, where each iteration is basically the development of one component of the system. This worked perfectly for our team because we were presented with four different parts at various times throughout the semester. The project also built on itself, meaning we would perform the Analysis, Design, Coding, and Testing steps in a circular fashion. After testing Step 1, we would design and code the next step, test, and repeat the process for the following step. Overall, this model ensured our team was never committed to one coding approach and that we could go back to the design step and continuously test our program throughout the whole process.