UG3 Compiling Techniques Coursework

January 2015

About this practical

This handout provides you with the details of your coursework for the UG3 Compiling Tech- niques course. The coursework comprises two individually marked parts that are weighted 40% and 60%, respectively, and together they count for 25% of the marks for the course. Both of the exercises are individual exercises, not to be undertaken in a group. Discus- sions, however, with your fellow students regarding technical approaches to individual problems of the coursework as well as infrastructure related questions are highly en- courged!

Description of this practical

Your task is to write a compiler for a simple subset of , small-C. It is compulsory to use ANTLR and ASM for generating the compiler. In the first stage the compiler must check that a source program is well-formed and in either case terminate gracefully. For the sec- ond (final) stage it should generate code (bytecode) for the Java Virtual Machine (JVM). The generated code should be reasonably short to solve the problem.

Note that some legitimate C programs will not be legitimate small-C programs.

The languages involved

The implementation language must be Java!

The syntax of the subset of C you must handle, small-C, is given in the Appendix. One of your first tasks is to translate the syntax diagrams to ANTLR input. In doing so, you must be careful not to change the language! You must also handle C comments. small-C is simply typed, it only has variables of type integer or character. A string is a (pos- sibly empty) sequence of characters (excluding newline) between double quotes, e.g. "hello world!". A character is a C character of the form ’c’ where c is a single key- board stroke, or \n, \t, \\, \’ or EOF (end-of-file). You don’t have to handle all possible C

1 character definitions, for example ’\045’. A number is a non-empty sequence of digits. An ident is a non-empty sequence of characters [a-zA-Z0-9_] beginning with an alphabetic character. A variable must be declared (exactly once) before used. All variables get the ini- tial value 0 or ’\000’. The binary subtraction, addition, multiplication, division and modulus all associate to the left. So 5 – 3 – 2 evaluates to 0, not 4. Unlike usual C, they can only be applied to integers, not characters. Conditions in while and if statements are interpreted as follows: 0 stands for false, any other value stands for true. An else binds to the closest previous unbound then. The #include lines should be ignored by your compiler, they are included so that the usual C compiler will also work on the input. The read and readc procedures read an integer and character respectively, while output and outputc output an integer and character respectively. You must handle non-recursive procedures and functions. You should check that each variable or procedure or function is defined before it is called.

Infrastructure

• ANTLR http://www.antlr.org

• ASM http://asm.ow2.org/

Useful references

• The Java Virtual Machine Specification http://java.sun.com/docs/books/vmspec/

• ANTLR Reference Manual http://www.antlr.org/doc/index.html

• ASM User Guide http://download.forge.objectweb.org/asm/asm-guide.pdf

• Wikipedia Entry for Small-C http://en.wikipedia.org/wiki/Small-C

• Wikipedia Entry for Bytecode http://en.wikipedia.org/wiki/Bytecode

• Apache ANT Manual http://ant.apache.org/manual/

2 Part 1: Front end (40%)

The first part of your coursework is to develop a front end of a compiler for the small-C lan- guage. Initially, you will develop a lexer and parser for the small-C language based on the ANTLR lexer and parser generation tool. A specification of the small-C language can be found in the appendix of this document. In a second step, you will extend your parser spec- ification with action rules for the construction of an suitable for further processing. Finally, the AST will be written out to a plain text file (or any other human read- able format suitable for representation of graphs) enabling you to inspect the AST for any given small-C program.

Your front end should accept correct small-C programs and construct an abstract syntax tree before this tree is written to a file. Furthermore, you front end should reject incorrect small-C programs (lexical and syntactical checking!) and provide the user with meaningful error messages before terminating gracefully. Perfecting error recovery, however, is not the main goal of this exercise and you should focus on the correctness of your Small-C gram- mar implementation and the AST construction!

Hints:

• Focus on the pure lexer and parser before you approach the construction of the AST! Once you have implemented a basic ANTLR specification for the given grammar you can then extend this with the necessary annotations for the construction of the AST.

• By default, ANTLR generates flat abstract syntax trees (= lists as degenerated trees). However, with a few extensions to your grammar file you can get ANTLR to generate proper trees. Make use of this facility.

• Please use ANTLR v.3 or v.4. Don’t use any older version of the ANTLR tool.

• It is advisable to use a build tool such as ANT for your project.

• There are many ANTLR grammars available for other programming languages. You are encouraged to read them and use them as templates for your own work!

• Write a couple of small Small-C test programs to exercise the various aspects of your front end. Include correct and incorrect Small-C programs and check whether they are correctly accepted or rejected. Also inspect the generated text file and compare the AST to what you would expect here. Submit your test programs along with your other code!

• Document your code!

• State what you have provided in a separate README file. This should include all features that you have implemented, tests that you have performed and features that you have found to be incomplete/incorrect, but haven’t managed to fix!

3 Part 2: Back end (60%)

The second part of your coursework is to develop a back end of a compiler for the Small-C language targeting the Java Virtual Machine. Initially, you will need to perform some se- mantic checks on the abstract syntax tree generated by your front end developed in part 1. Semantic analysis is dependent on context information, hence, you will need to develop a symbol table storing information about variables, functions and types. In the second stage of the part of the practical, you will generate JVM bytecode. You do not need to deal with low-level issues in bytecode generation, but will make use of the ASM library. Essentially, this library will provide you with a high-level API to bytecode generation. Traversing the ab- stract syntax tree you will generate (simple) code for each visited node using the informa- tion stored in the symbol tables.

Hints:

• If you prefer not to re-use your front end developed in part 1 you will be provided with a binary version of a front end for the Small-C language.

• It is advisable to maintain separate symbol tables for variables, functions, and types.

• It is sufficient to check properties during the semantic analysis stage: Identifier declared before it is used, compatibility of types in expressions, legal destination of assignments.

• If in doubt about how to translate a Small-C construct to JVM bytecode, write a small Java program that contains this feature, compile it and inspect the generated bytecode. The javap Java Class File Disassembler is very useful for this task!

• Read the ASM manual. Use existing examples as templates for your own work!

• All your generated code for a Small-C program can go into a single (Java) class. Each Small-C function corresponds to a (static) Java method. The Small-C int and char types map to the according Java types.

• As before, document all your code!

• Similarly, provide a README file stating what you have implemented and document the tests that you have performed. Include your test files.

4 Deadlines and submission

The deadline for completion of part 1 of the practical exercise is

Friday, February 13th, 2015 at 4:00pm, and the deadline for completion of part 2 is

Friday, March 20th, 2015 at 4:00pm.

Please submit your (and the ANTLR specification for the small-C parser) us- ing the submit command, e.g.

submit ct cw1 .... (for part 1) submit ct cw2 .... (for part 2)

Additional information on the electronic submission system can be found on the Informat- ics DICE machines using the man submit command.

Please submit only source code: no compiled class files. There is no need to submit the files generated by the ANTLR lexer and parser generator. Don’t tar/zip/... your code before submission, the submit command allows you to submit an entire directory!

Assessment procedure

Your Java submissions will be tested by being compiled and executed. A submission which uses proprietary (e.g. Microsoft) Java extensions and fails to compile with a version of the JDK will lose credit.

Your submission will be assessed on the correctness and clarity of your Java code and use of the tools infrastructure. You should follow good object-oriented programming practice by encapsulating information where it is appropriate to do so and providing a well-defined in- terface for other application . Your ANTLR grammar specification will be as- sessed on its correctness and clarity.

Queries and clarification

If you have any questions or uncertainties about this practical exercise please contact your lecturer, Christophe Dubach ([email protected]) or Björn Franke (bfranke@- inf.ed.ac.uk, IF1.04), by email or in person.

5