<<

The Creation of a for the LITTLE Language

Will Cassella, Woodrow Melling, Bradley White

Montana State University

CSCI 468 -

Spring 2017

Table of Contents

Table of Contents 1

Section 1: Program 3 Program Specification 3 4 IRGenerator 4 IRInstruction 5 Lexer 6 LexerGrammar 8 Listener 10 Operand 20 Symbol 22 SymbolTable 23 TinyEmitter 24

Section 2: Teamwork 30

Section 3: Design Pattern 31

Section 4: Technical Report 32 Introduction 32 Background 32 Methods and Discussion 34 Scanner 34 Background 34 Methods 34 Difficulties 35 Parser 35 Background 35 Methods 35 Difficulties 36 Symbol Table 36 Background 36 Methods 37 Difficulties 37 Semantic Routines 38 Background 38

1 Methods 38 Difficulties 39 Full Fledged Compiler 39 Background 39 Methods 39 Difficulties 40 Conclusion and Future Work 40

Section 5: UML 41

Section 6: Design Trade-offs 43

Section 7: Software Development Lifecycle 44

2 Section 1: Program

Program Specification The following source code is a simple compiler for the LITTLE . The compiler is written in Java and utilizes the ANTLR library to generate various classes based on the grammar defined within LexerGrammar. The program is invoked through the Lexer class, which takes a LITTLE source code file as a command line argument. The LITTLE code is then passed through the scanner which was generated by the ANTLR library to tokenize the source code. The stream of tokens will pass through the parser, which was also generated by ANTLR based on the predefined grammar. The parser will build a tree from the given input, as well as check that the syntax is correct based on the grammar definition provided. After the tree is created, it is traversed by the Listener class, which performs symbol-table construction as well as intermediate representation (IR) generation of the code. The SymbolTable is an ordered hash map where the keys are the symbol’s name and the value is a Symbol object, and is used to track variables. The Symbol object will contain a type, name and value. Each within the input file will contain a unique SymbolTable and these tables are further organized within a tree data structure allowing nested scopes to have an association with their respective parent scope. The last step of the compiler is IR generation. Unlike commercial compilers, we are not performing optimization techniques on the intermediate code. The Listener will instantiate an IRGenerator which contains a stack of IRInstruction objects. The IRInstruction objects contain two instances of the Operand class, which are integer, float, or string variables, and one operation, in assembly, such as ADDI or MULTI. Furthermore, these instructions are created as the Listener traverses the syntax tree. The generated IR is independent of any machine architecture and is consumed by the TinyEmitter class which then emits assembly code for the Tiny VM architecture. Further details on these processes are provided in the technical report, section four.

3 Source Code

IRGenerator import java.util.ArrayList; ​ import java.util.Stack; ​ public class IRGenerator ​ { public IRGenerator() ​ ​ { label_stack = new Stack<>(); ​ ​ ​ ​ expr_instructions = new Stack<>(); ​ ​ ​ ​ instructions = new ArrayList<>(); ​ ​ ​ ​ next_temp = 1; ​ ​ ​ ​ }

public Operand allocate_temporary() ​ ​ { Operand result = Operand.temp_operand("$T" + next_temp, null); ​ ​ ​ ​ ​ ​ ​ ​ next_temp += 1; ​ ​ ​ ​ return result; ​ ​ }

public IRInstruction add_instruction() ​ ​ { IRInstruction result = new IRInstruction(); ​ ​ instructions.add(result); ​ ​ return result; ​ ​ }

public IRInstruction push_instruction() ​ ​ { IRInstruction top = new IRInstruction(); ​ ​ expr_instructions.push(top); ​ ​ return top; ​ ​ }

public IRInstruction top_instruction() ​ ​ { if (!expr_instructions.empty()) ​ ​ ​ ​ { return expr_instructions.peek(); ​ ​ ​ } else ​ { ​ return null; ​ ​ }

4 }

public IRInstruction pop() ​ ​ { if (!expr_instructions.empty()) ​ ​ ​ ​ { IRInstruction top = expr_instructions.pop(); ​ ​ instructions.add(top); ​ ​ return top; ​ ​ } else ​ { ​ return null; ​ ​ } }

public void push_label(String label) ​ ​ { label_stack.push(label); ​ ​ }

public String pop_label() ​ ​ { return label_stack.pop(); ​ ​ ​ }

public String top_label() ​ ​ { return label_stack.peek(); ​ ​ ​ }

public Stack label_stack; ​ ​ ​ ​ public Stack expr_instructions; ​ ​ ​ ​ public ArrayList instructions; ​ ​ ​ ​ public int next_temp; ​ ​ ​ }

IRInstruction public class IRInstruction ​ { public enum OP ​ ​ { ADDI, ​ ​ ADDF, ​ ​ SUBI, ​ ​ SUBF, ​ ​ MULTI, ​ ​ MULTF, ​ ​ DIVI, ​ ​

5 DIVF, ​ ​ STOREI, ​ ​ STOREF, ​ ​ GT, ​ ​ GE, ​ ​ LT, ​ ​ LE, ​ ​ NE, ​ ​ EQ, ​ ​ JUMP, ​ ​ LABEL, ​ ​ READI, ​ ​ READF, ​ ​ WRITEI, ​ ​ WRITEF, ​ ​ WRITES, ​ ​ UNDETERMINED_RESERVED, ​ ​ }

@Override ​ public String toString() ​ ​ { String str = op.toString(); ​ ​ if (operand_1 != null) ​ ​ ​ ​ ​ ​ { str += " " + operand_1; ​ ​ ​ ​ } if (operand_2 != null) ​ ​ ​ ​ ​ ​ { str += " " + operand_2; ​ ​ ​ ​ }

return str + " " + result; ​ ​ ​ ​ ​ ​ }

public OP op = null; ​ ​ ​ ​ ​ ​ public Operand operand_1 = null; ​ ​ ​ ​ ​ ​ public Operand operand_2 = null; ​ ​ ​ ​ ​ ​ public Operand result = null; ​ ​ ​ ​ ​ ​ }

Lexer import org.antlr.v4.runtime.*; ​ import org.antlr.v4.runtime.tree.ParseTreeWalker; ​ import java.util.List; ​ public class Lexer ​

6 { private static void output_ir(List instructions) ​ ​ { // Output ir preamble ​ System.out.println(";IR code"); ​ ​ ​ ​ ​ System.out.println(";LABEL main"); ​ ​ ​ ​ System.out.println(";LINK"); ​ ​ ​ ​

// Output instructions ​ for (IRInstruction instruction : instructions) ​ ​ { System.out.println(";" + instruction); ​ ​ ​ ​ }

// Output post ​ System.out.println(";RET"); ​ ​ ​ ​ ​ System.out.println(";tiny code"); ​ ​ ​ ​ }

public static void main(String[] args) throws Exception ​ ​ ​ ​ { try ​ { ​ // Read in the micro file from the command line arguments ​ org.antlr.v4.runtime.ANTLRFileStream fileStream = new ​ ​ org.antlr.v4.runtime.ANTLRFileStream(args[0]); ​ ​

// Create a new lexer on the specified 'CharStream' ​ lexerGrammarLexer lexer = new lexerGrammarLexer(fileStream); ​ ​ ​ CommonTokenStream tokens = new CommonTokenStream(lexer); ​ ​

//Vocabulary vocab = lexer.getVocabulary(); ​ // Create a parser from the stream of tokens lexerGrammarParser parser = new lexerGrammarParser(tokens); ​ ​ ​ // Remove the error listener to not interfere with the output stream ​ parser.removeErrorListeners(); ​ // Parse from the beginning of the rules ​

ParseTreeWalker walker = new ParseTreeWalker(); ​ ​ ​ Listener listener = new Listener(); ​ ​ walker.walk(listener, parser.program());

output_ir(listener.ir_generator.instructions); ​ ​ ​ ​ ​ ​ TinyEmitter emitter = new TinyEmitter(); ​ ​ String result = emitter.emit_code(listener.current_scope, ​ ​ listener.ir_generator.instructions); ​ ​ ​ ​ System.out.println(result); ​ ​ } catch (IllegalArgumentException e) { ​ ​ System.out.println(e.getMessage()); ​ ​

7 } } }

LexerGrammar grammar lexerGrammar; ​ ​

/* Program **/ program : 'PROGRAM' id 'BEGIN' pgm_body 'END'; ​ ​ ​ ​ ​ ​ ​ id : IDENTIFIER; ​ ​ ​ pgm_body : decl func_declarations; ​ ​ ​ decl : string_decl decl | var_decl decl | ; ​ ​ ​ ​ ​

/* Global String Declaration */ string_decl : 'STRING' id ':=' str ';' ; ​ ​ ​ ​ ​ ​ ​ str : STRINGLITERAL; ​ ​ ​

/* Variable Declaration */ var_decl : var_type id_list ';' ; ​ ​ ​ ​ var_type : 'FLOAT' | 'INT'; ​ ​ ​ ​ ​ any_type : var_type | 'VOID'; ​ ​ ​ ​ ​ id_list : id id_tail; ​ ​ ​ id_tail : ',' id id_tail | ; ​ ​ ​ ​

/* Function Paramater List */ param_decl_list : '(' param_decl param_decl_tail ')' | '(' ')' | ; ​ ​ ​ ​ ​ ​ ​ param_decl : var_type id; ​ ​ ​ param_decl_tail : ',' param_decl param_decl_tail | ; ​ ​ ​ ​

/* Function Declarations */ func_declarations : func_decl func_declarations | ; ​ ​ ​ func_decl : 'FUNCTION' any_type id param_decl_list 'BEGIN' func_body 'END'; ​ ​ ​ ​ ​ ​ ​ func_body : decl stmt_list; ​ ​ ​

/* Statement List */ stmt_list : stmt stmt_list | ; ​ ​ ​ stmt : base_stmt | if_stmt | while_stmt; ​ ​ ​ ​ ​ ​ ​ base_stmt : assign_stmt | read_stmt | write_stmt | return_stmt; ​ ​ ​ ​ ​ ​ ​ ​ ​

/* Basic Statements */ assign_stmt : assign_expr ';' ; ​ ​ ​ ​ assign_expr : id ':=' expr; ​ ​ ​ ​ ​ read_stmt : 'READ' '(' id_list ')' ';' ; ​ ​ ​ ​ ​ write_stmt : 'WRITE' '(' id_list ')' ';' ; ​ ​ ​ ​ ​ return_stmt : 'RETURN' expr ';' ; ​ ​ ​ ​ ​

/* Expressions */ expr : expr_prefix factor; ​ ​ ​

8 expr_prefix : expr_prefix factor addop | ; ​ ​ ​ factor : factor_prefix postfix_expr; ​ ​ ​ factor_prefix : factor_prefix postfix_expr mulop | ; ​ ​ ​ postfix_expr : primary | call_expr; ​ ​ ​ ​ ​ call_expr : id '(' expr_list ')'; ​ ​ ​ ​ ​ ​ expr_list : expr expr_list_tail | ; ​ ​ ​ expr_list_tail : ',' expr expr_list_tail | ; ​ ​ ​ ​ primary : '(' expr ')' | id | INTLITERAL | FLOATLITERAL; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ addop : '+' | '-'; ​ ​ ​ ​ ​ mulop : '*' | '/'; ​ ​ ​ ​ ​

/* Complex Statements and Condition */ if_stmt : 'IF' '(' cond ')' decl stmt_list else_part 'ENDIF'; ​ ​ ​ ​ ​ ​ ​ else_part : 'ELSE' decl stmt_list | ; ​ ​ ​ ​ cond : expr compop expr; ​ ​ ​ compop : '<' | '>' | '=' | '!=' | '<=' | '>='; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

/* While statements */ while_stmt : 'WHILE' '(' cond ')' decl stmt_list 'ENDWHILE'; ​ ​ ​ ​ ​ ​ ​ start: .*? EOF; ​ ​ ​

WS: (' ' | '\t' | '\r' | '\n' ) -> skip; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

INTLITERAL: [0-9]+; ​

FLOATLITERAL: [0-9]*'.'[0-9]+ ; ​ ​ ​

STRINGLITERAL: '"'.*?'"'; ​ ​ ​ ​ ​

COMMENT: '--'.*?'\n' -> skip; ​ ​ ​ ​ ​ ​ ​

KEYWORD: 'PROGRAM' ​ ​ |'BEGIN' ​ ​ |'END' ​ ​ |'FUNCTION' ​ ​ |'READ' ​ ​ |'WRITE' ​ ​ |'IF' ​ ​ |'ELSE' ​ ​ |'ENDIF' ​ ​ |'WHILE' ​ ​ |'ENDWHILE' ​ ​ |'CONTINUE' ​ ​ |'BREAK' ​ ​ |'RETURN' ​ ​ |'INT' ​ ​ |'VOID' ​ ​ |'STRING' ​ ​

9 |'FLOAT'; ​ ​ ​

IDENTIFIER: [A-Za-z]+[0-9]* ; ​

OPERATOR:':=' ​ ​ |'+' ​ ​ |'-' ​ ​ |'*' ​ ​ |'/' ​ ​ |'=' ​ ​ |'!=' ​ ​ |'<' ​ ​ |'>' ​ ​ |'(' ​ ​ |')' ​ ​ |';' ​ ​ |',' ​ ​ |'<=' ​ ​ |'>=' ​ ​ ; ​

Listener public class Listener extends lexerGrammarBaseListener ​ ​ ​ { public SymbolTable current_scope; ​ ​ ​ ​ public int block_id = 1; ​ ​ ​ ​ ​ public IRGenerator ir_generator; ​ ​ ​ ​ public static final String END_PGM_LABEL = "END_PGM"; ​ ​ ​ ​ ​ ​

public Listener() ​ ​ { current_scope = new SymbolTable(null, "GLOBAL"); ​ ​ ​ ​ ​ ​ ​ ​ ir_generator = new IRGenerator(); ​ ​ ​ ​ }

@Override ​ public void exitPgm_body(lexerGrammarParser.Pgm_bodyContext ctx) ​ ​ { // Add the PGM_END label ​ IRInstruction end_instr = ir_generator.add_instruction(); ​ ​ ​ end_instr.op = IRInstruction.OP.LABEL; ​ ​ ​ ​ end_instr.result = Operand.label_operand(END_PGM_LABEL); ​ ​ ​ ​ ​ ​ }

@Override ​ public void enterReturn_stmt(lexerGrammarParser.Return_stmtContext ctx) ​ ​ { // Add an unconditional jump to the end ​

10 IRInstruction jmp_instr = ir_generator.add_instruction(); ​ ​ ​ jmp_instr.op = IRInstruction.OP.JUMP; ​ ​ ​ ​ jmp_instr.result = Operand.label_operand(END_PGM_LABEL); ​ ​ ​ ​ ​ ​ }

@Override ​ public void enterVar_decl(lexerGrammarParser.Var_declContext ctx) ​ ​ { String var_type = ctx.var_type().getText();

// Add the first id ​ Symbol id_sym = new Symbol(var_type, ctx.id_list().id().getText()); ​ ​ ​ current_scope.add(id_sym.get_name(), id_sym); ​ ​

// Add the remainder of the id list ​ lexerGrammarParser.Id_tailContext tail = ctx.id_list().id_tail(); ​ while (tail.getChildCount() != 0) ​ ​ ​ ​ { Symbol tail_sym = new Symbol(var_type, tail.id().getText()); ​ ​ current_scope.add(tail_sym.get_name(), tail_sym); ​ ​ tail = tail.id_tail(); } }

@Override ​ public void enterString_decl(lexerGrammarParser.String_declContext ctx) ​ ​ { Symbol sym = new Symbol("STRING", ctx.id().getText(), ctx.str().getText()); ​ ​ ​ ​ current_scope.add(sym.get_name(), sym); ​ ​ }

@Override ​ public void enterFunc_decl(lexerGrammarParser.Func_declContext ctx) ​ ​ { current_scope = new SymbolTable(current_scope, ctx.id().getText()); ​ ​ ​ ​ ​ ​ }

@Override ​ public void exitFunc_decl(lexerGrammarParser.Func_declContext ctx) ​ ​ { current_scope = current_scope.get_parent(); ​ ​ ​ ​ }

@Override ​ public void enterRead_stmt(lexerGrammarParser.Read_stmtContext ctx) ​ ​ { // Get the first id ​ Symbol id_sym = current_scope.find(ctx.id_list().id().getText()); ​ ​ ​ add_read_instruction(id_sym);

11 // Add the remainder of the id list ​ lexerGrammarParser.Id_tailContext tail = ctx.id_list().id_tail(); ​ while (tail.getChildCount() != 0) ​ ​ ​ ​ { id_sym = current_scope.find(tail.id().getText()); ​ ​ add_read_instruction(id_sym); tail = tail.id_tail(); } }

@Override ​ public void enterWrite_stmt(lexerGrammarParser.Write_stmtContext ctx) ​ ​ { // Get the first id ​ Symbol id_sym = current_scope.find(ctx.id_list().id().getText()); ​ ​ ​ add_write_instruction(id_sym);

// Add the remainder of the id list ​ lexerGrammarParser.Id_tailContext tail = ctx.id_list().id_tail(); ​ while (tail.getChildCount() != 0) ​ ​ ​ ​ { id_sym = current_scope.find(tail.id().getText()); ​ ​ add_write_instruction(id_sym); tail = tail.id_tail(); } }

@Override ​ public void enterParam_decl(lexerGrammarParser.Param_declContext ctx) ​ ​ { Symbol sym = new Symbol(ctx.var_type().getText(), ctx.id().getText()); ​ ​ current_scope.add(sym.get_name(), sym); ​ ​ }

@Override ​ public void enterIf_stmt(lexerGrammarParser.If_stmtContext ctx) ​ ​ { current_scope = new SymbolTable(current_scope, "BLOCK " + block_id); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

// Create a label for the ELSE and END ​ String else_label = "IF_ELSE_" + block_id; ​ ​ ​ ​ ​ String end_label = "IF_END_" + block_id; ​ ​ ​ ​ block_id += 1; ​ ​ ​ ​

// Push labels onto label stack so condition knows where to go ​ ir_generator.push_label(end_label); ​ ​ ir_generator.push_label(else_label); ​ ​ }

@Override ​

12 public void exitIf_stmt(lexerGrammarParser.If_stmtContext ctx) ​ ​ { current_scope = current_scope.get_parent(); ​ ​ ​ ​

// Generate end label ​ String end_label = ir_generator.pop_label(); ​ ​ ​ IRInstruction end_instr = ir_generator.add_instruction(); ​ ​ end_instr.op = IRInstruction.OP.LABEL; ​ ​ ​ ​ end_instr.result = Operand.label_operand(end_label); ​ ​ ​ ​ }

@Override ​ public void enterElse_part(lexerGrammarParser.Else_partContext ctx) ​ ​ { // Get labels ​ String else_label = ir_generator.pop_label(); ​ ​ ​ String end_label = ir_generator.top_label(); ​ ​

// Add unconditional jump to END (so positive part doesn't execute else) ​ IRInstruction end_instr = ir_generator.add_instruction(); ​ ​ ​ end_instr.op = IRInstruction.OP.JUMP; ​ ​ ​ ​ end_instr.result = Operand.label_operand(end_label); ​ ​ ​ ​

// Add ELSE label ​ IRInstruction else_instr = ir_generator.add_instruction(); ​ ​ ​ else_instr.op = IRInstruction.OP.LABEL; ​ ​ ​ ​ else_instr.result = Operand.label_operand(else_label); ​ ​ ​ ​

// Don't create a new scope for else ​ if (ctx.getChildCount() == 0) ​ ​ ​ ​ { return; ​ ​ }

// The else should not be a child of the if ​ current_scope = current_scope.get_parent(); ​ ​ ​ ​ current_scope = new SymbolTable(current_scope, "BLOCK " + block_id); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ block_id += 1; ​ ​ ​ ​ }

@Override ​ public void enterWhile_stmt(lexerGrammarParser.While_stmtContext ctx) ​ ​ { current_scope = new SymbolTable(current_scope, "BLOCK " + block_id); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ String begin_label = "WHILE_" + block_id + "_BEGIN"; ​ ​ ​ ​ ​ ​ String end_label = "WHILE_" + block_id + "_END"; ​ ​ ​ ​ ​ ​ block_id += 1; ​ ​ ​ ​

// Add instruction to jump here ​ IRInstruction begin_instr = ir_generator.add_instruction(); ​ ​ ​

13 begin_instr.op = IRInstruction.OP.LABEL; ​ ​ ​ ​ begin_instr.result = Operand.label_operand(begin_label); ​ ​ ​ ​

// Add labels to stack, so condition knows where to jump and we can add end ​ label later ir_generator.push_label(begin_label); ​ ​ ir_generator.push_label(end_label); ​ ​ }

@Override ​ public void exitWhile_stmt(lexerGrammarParser.While_stmtContext ctx) ​ ​ { current_scope = current_scope.get_parent(); ​ ​ ​ ​

// Get labels for this loop ​ String end_label = ir_generator.pop_label(); ​ ​ ​ String begin_label = ir_generator.pop_label(); ​ ​

// Generate unconditional jump to begin ​ IRInstruction jump_begin = ir_generator.add_instruction(); ​ ​ ​ jump_begin.op = IRInstruction.OP.JUMP; ​ ​ ​ ​ jump_begin.result = Operand.label_operand(begin_label); ​ ​ ​ ​

// Add ending label ​ IRInstruction end_instr = ir_generator.add_instruction(); ​ ​ ​ end_instr.op = IRInstruction.OP.LABEL; ​ ​ ​ ​ end_instr.result = Operand.label_operand(end_label); ​ ​ ​ ​ }

@Override ​ public void enterCond(lexerGrammarParser.CondContext ctx) ​ ​ { // Create comparison instruction, should jump to label on top of stack (DOES ​ NOT POP) ir_generator.top_label(); ​ ​ IRInstruction comp_instr = ir_generator.push_instruction(); ​ ​ comp_instr.result = Operand.label_operand(ir_generator.top_label()); ​ ​ ​ ​ ​ ​

// Figure out comparison type, and invert it ​ if (ctx.compop().getText().equals("<")) ​ ​ ​ ​ { comp_instr.op = IRInstruction.OP.GE; ​ ​ ​ ​ } else if (ctx.compop().getText().equals(">")) ​ ​ ​ ​ { comp_instr.op = IRInstruction.OP.LE; ​ ​ ​ ​ } else if (ctx.compop().getText().equals("=")) ​ ​ ​ ​ { comp_instr.op = IRInstruction.OP.NE; ​ ​ ​ ​

14 } else if (ctx.compop().getText().equals("!=")) ​ ​ ​ ​ { comp_instr.op = IRInstruction.OP.EQ; ​ ​ ​ ​ } else if (ctx.compop().getText().equals("<=")) ​ ​ ​ ​ { comp_instr.op = IRInstruction.OP.GT; ​ ​ ​ ​ } else if (ctx.compop().getText().equals(">=")) ​ ​ ​ ​ { comp_instr.op = IRInstruction.OP.LT; ​ ​ ​ ​ } else ​ { ​ assert(false); // Unrecognized comparison op ​ ​ ​ ​ ​ } ​ }

@Override ​ public void enterAssign_expr(lexerGrammarParser.Assign_exprContext ctx) ​ ​ { // Generate a new instruction with the symbol as the target ​ Symbol lvalue = current_scope.find(ctx.id().getText()); ​ ​ ​ IRInstruction assign_inst = ir_generator.push_instruction(); ​ ​ assign_inst.result = Operand.symbol_operand(lvalue); ​ ​ ​ ​ }

@Override ​ public void exitAssign_expr(lexerGrammarParser.Assign_exprContext ctx) ​ ​ { // If the op was never determined, it should be a store ​ IRInstruction assign_inst = ir_generator.pop(); ​ ​ ​ if (assign_inst != null && assign_inst.op == null) ​ ​ ​ ​ ​ ​ ​ ​ { if (assign_inst.result.type == Operand.Type.INT_VAR) ​ ​ ​ ​ ​ ​ ​ ​ { assign_inst.op = IRInstruction.OP.STOREI; ​ ​ ​ ​ } else if (assign_inst.result.type == Operand.Type.FLOAT_VAR) ​ ​ ​ ​ ​ ​ ​ ​ { assign_inst.op = IRInstruction.OP.STOREF; ​ ​ ​ ​ } } }

@Override ​ public void enterFactor_prefix(lexerGrammarParser.Factor_prefixContext ctx) ​ ​ { // If this isn't ACTUALLY a factor prefix, skip it ​

15 if (ctx.getChildCount() == 0) ​ ​ ​ ​ { return; ​ ​ }

expr_recurse(); }

@Override ​ public void enterExpr_prefix(lexerGrammarParser.Expr_prefixContext ctx) ​ ​ { // If this isn't ACTUALLY an expr prefix, skip it ​ if (ctx.getChildCount() == 0) ​ ​ ​ ​ { return; ​ ​ }

expr_recurse(); }

@Override ​ public void enterPrimary(lexerGrammarParser.PrimaryContext ctx) ​ ​ { if (ctx.id() != null) ​ ​ ​ ​ { Symbol sym = current_scope.find(ctx.id().getText()); ​ ​ Operand operand = Operand.symbol_operand(sym); ​ ​ assign_operand(operand); pop_if_complete(); } else if (ctx.INTLITERAL() != null) ​ ​ ​ ​ { assign_operand(Operand.int_lit_operand(ctx.INTLITERAL().getText())); ​ ​ pop_if_complete(); } else if (ctx.FLOATLITERAL() != null) ​ ​ ​ ​ { assign_operand(Operand.float_lit_operand(ctx.FLOATLITERAL().getText())); ​ ​ pop_if_complete(); } }

@Override ​ public void enterAddop(lexerGrammarParser.AddopContext ctx) ​ ​ { IRInstruction instr = ir_generator.top_instruction(); ​ ​

// First operand's type must have been determined (only ints and floats are ​ allowed for ADD/SUB) assert(instr.operand_1.is_int() || instr.operand_1.is_float()); ​ ​ ​ ​ ​ ​

16 assert(instr.op == IRInstruction.OP.UNDETERMINED_RESERVED); ​ ​ ​ ​ ​ ​

if (ctx.getText().equals("+")) ​ ​ ​ ​ { if (instr.operand_1.is_int()) ​ ​ ​ ​ { instr.op = IRInstruction.OP.ADDI; ​ ​ ​ ​ } else ​ { ​ instr.op = IRInstruction.OP.ADDF; ​ ​ ​ ​ } } else ​ { ​ assert(ctx.getText().equals("-")); ​ ​ ​ ​ if (instr.operand_1.is_int()) ​ ​ ​ ​ { instr.op = IRInstruction.OP.SUBI; ​ ​ ​ ​ } else ​ { ​ instr.op = IRInstruction.OP.SUBF; ​ ​ ​ ​ } } }

@Override ​ public void enterMulop(lexerGrammarParser.MulopContext ctx) ​ ​ { IRInstruction instr = ir_generator.top_instruction(); ​ ​

// First operand's type must have been determined (only ints and floats are allowed for MUL/DIV) assert(instr.operand_1.is_int() || instr.operand_1.is_float()); ​ ​ ​ ​ ​ ​ assert(instr.op == IRInstruction.OP.UNDETERMINED_RESERVED); ​ ​ ​ ​ ​ ​

if (ctx.getText().equals("*")) ​ ​ ​ ​ { if (instr.operand_1.is_int()) ​ ​ ​ ​ { instr.op = IRInstruction.OP.MULTI; ​ ​ ​ ​ } else ​ { ​ instr.op = IRInstruction.OP.MULTF; ​ ​ ​ ​ } } else ​ { ​

17 assert(ctx.getText().equals("/")); ​ ​ ​ ​ if (instr.operand_1.is_int()) ​ ​ ​ ​ { instr.op = IRInstruction.OP.DIVI; ​ ​ ​ ​ } else ​ { ​ instr.op = IRInstruction.OP.DIVF; ​ ​ ​ ​ } } } private void expr_recurse() ​ ​ { // If the current instruction hasn't actually be assigned anything yet ​ IRInstruction instr = ir_generator.top_instruction(); ​ ​ ​ if (instr.op != null) ​ ​ ​ ​ ​ ​ { instr = push_temp_instruction(); }

instr.op = IRInstruction.OP.UNDETERMINED_RESERVED; ​ ​ ​ ​ } private IRInstruction push_temp_instruction() ​ ​ { Operand temp = ir_generator.allocate_temporary(); ​ ​ assign_operand(temp); IRInstruction instr = ir_generator.push_instruction(); ​ ​ instr.result = temp; ​ ​ return instr; ​ ​ } private void determine_result_type() ​ ​ { IRInstruction instr = ir_generator.top_instruction(); ​ ​ if (instr.result.type != null) ​ ​ ​ ​ ​ ​ ​ ​ { return; ​ ​ }

assert(instr.operand_1.is_int() || instr.operand_1.is_float()); ​ ​ ​ ​ ​ ​

if (instr.operand_1.is_int()) ​ ​ ​ ​ { instr.result.type = Operand.Type.INT_VAR; ​ ​ ​ ​ ​ ​ } else ​ { ​ instr.result.type = Operand.Type.FLOAT_VAR; ​ ​ ​ ​ ​ ​

18 } }

private void assign_operand(Operand operand) ​ ​ { IRInstruction instr = ir_generator.top_instruction(); ​ ​ assert(instr.operand_1 == null || instr.operand_2 == null); ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

if (instr.operand_1 == null) ​ ​ ​ ​ ​ ​ { instr.operand_1 = operand; ​ ​ } else ​ { ​ instr.operand_2 = operand; ​ ​ } }

private void pop_if_complete() ​ ​ { IRInstruction instr = ir_generator.top_instruction(); ​ ​ if (instr != null && instr.operand_1 != null && instr.op != null && ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ instr.operand_2 != null) ​ ​ ​ ​ { determine_result_type(); ir_generator.pop(); ​ ​ pop_if_complete(); } }

private void add_read_instruction(Symbol symbol) ​ ​ { IRInstruction instr = ir_generator.add_instruction(); ​ ​ instr.result = Operand.symbol_operand(symbol); ​ ​ ​ ​ if (symbol.get_type().equals("INT")) ​ ​ ​ ​ { instr.op = IRInstruction.OP.READI; ​ ​ ​ ​ } else if (symbol.get_type().equals("FLOAT")) ​ ​ ​ ​ { instr.op = IRInstruction.OP.READF; ​ ​ ​ ​ } }

private void add_write_instruction(Symbol symbol) ​ ​ { IRInstruction instr = ir_generator.add_instruction(); ​ ​ instr.result = Operand.symbol_operand(symbol); ​ ​ ​ ​ if (symbol.get_type().equals("INT")) ​ ​ ​ ​ {

19 instr.op = IRInstruction.OP.WRITEI; ​ ​ ​ ​ } else if (symbol.get_type().equals("FLOAT")) ​ ​ ​ ​ { instr.op = IRInstruction.OP.WRITEF; ​ ​ ​ ​ } else if (symbol.get_type().equals("STRING")) ​ ​ ​ ​ { instr.op = IRInstruction.OP.WRITES; ​ ​ ​ ​ } } }

Operand public class Operand ​ { public enum Type ​ ​ { INT_VAR, ​ ​ FLOAT_VAR, ​ ​ STRING_VAR, ​ ​ INT_LIT, ​ ​ FLOAT_LIT, ​ ​ LABEL ​ } ​

public static Operand symbol_operand(Symbol symbol) ​ ​ ​ { Operand result = new Operand(); ​ ​ ​ ​ ​ result.value = symbol.get_name(); ​ ​

if (symbol.get_type().equals("INT")) ​ ​ ​ ​ { result.type = Type.INT_VAR; ​ ​ ​ ​ } else if (symbol.get_type().equals("FLOAT")) ​ ​ ​ ​ { result.type = Type.FLOAT_VAR; ​ ​ ​ ​ } else if (symbol.get_type().equals("STRING")) ​ ​ ​ ​ { result.type = Type.STRING_VAR; ​ ​ ​ ​ }

return result; ​ ​ }

public static Operand temp_operand(String name, Type type) ​ ​ ​

20 { Operand result = new Operand(); ​ ​ ​ ​ ​ result.value = name; ​ ​ result.type = type; ​ ​ return result; ​ ​ } public static Operand label_operand(String name) ​ ​ ​ { Operand result = new Operand(); ​ ​ ​ ​ ​ result.type = Type.LABEL; ​ ​ ​ ​ result.value = name; ​ ​ return result; ​ ​ }

@Override ​ public String toString() ​ ​ { return value; ​ ​ ​ } public static Operand int_lit_operand(String value) ​ ​ ​ { Operand result = new Operand(); ​ ​ ​ ​ ​ result.type = Type.INT_LIT; ​ ​ ​ ​ result.value = value; ​ ​ return result; ​ ​ } public static Operand float_lit_operand(String value) ​ ​ ​ { Operand result = new Operand(); ​ ​ ​ ​ ​ result.type = Type.FLOAT_LIT; ​ ​ ​ ​ result.value = value; ​ ​ return result; ​ ​ } public boolean is_int() ​ ​ { return type == Type.INT_LIT || type == Type.INT_VAR; ​ ​ ​ ​ ​ ​ ​ ​ ​ } public boolean is_float() ​ ​ { return type == Type.FLOAT_VAR || type == Type.FLOAT_LIT; ​ ​ ​ ​ ​ ​ ​ ​ ​ } public boolean is_lit() ​ ​ { return type == Type.INT_LIT || type == Type.FLOAT_LIT; ​ ​ ​ ​ ​ ​ ​ ​ ​

21 }

public Type type; ​ ​ ​ ​ public String value; ​ ​ ​ ​ }

Symbol public class Symbol ​ { public Symbol(String type, String name) ​ ​ { _type = type; ​ ​ _name = name; ​ ​ }

public Symbol(String type, String name, String value) ​ ​ { _type = type; ​ ​ _name = name; ​ ​ _value = value; ​ ​ }

public String get_type() ​ ​ { return _type; ​ ​ ​ }

public String get_name() ​ ​ { return _name; ​ ​ ​ }

public String get_value() ​ ​ { return _value; ​ ​ ​ }

@Override ​ public String toString() ​ ​ { return "name " + _name + " type " + _type + (_value != null ? " value " + ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ _value : ""); ​ ​ ​ }

private String _type; ​ ​ ​ ​ private String _name; ​ ​ ​ ​ private String _value; ​ ​ ​ ​ }

22

SymbolTable import org.antlr.v4.misc.OrderedHashMap; ​ import java.util.ArrayList; ​ import java.util.HashMap; ​ public class SymbolTable ​ { public SymbolTable(SymbolTable parent, String scope_name) ​ ​ { _parent = parent; ​ ​ _children = new ArrayList<>(); ​ ​ ​ ​ _scope_name = scope_name; ​ ​ _symbols = new OrderedHashMap<>(); ​ ​ ​ ​

if (parent != null) ​ ​ ​ ​ { parent._children.add(this); ​ ​ ​ ​ } }

public SymbolTable get_parent() ​ ​ { return _parent; ​ ​ ​ }

public Symbol find(String name) ​ ​ { Symbol result = _symbols.get(name); ​ ​ if (result != null) ​ ​ ​ ​ { return result; ​ ​ } else if (_parent != null) ​ ​ ​ ​ ​ ​ { return _parent.find(name); ​ ​ ​ }

return null; ​ ​ }

public void add(String name, Symbol symbol) throws IllegalArgumentException ​ ​ ​ ​ { if (_symbols.containsKey(name)) ​ ​ ​ ​ { throw new IllegalArgumentException("DECLARATION ERROR " + name); ​ ​ ​ ​ }

23 assert(name != null && symbol != null); ​ ​ ​ ​ ​ ​ _symbols.put(name, symbol); ​ ​ }

public ArrayList get_symbols() ​ ​ { ArrayList result = new ArrayList<>(); ​ ​ result.addAll(_symbols.values()); ​ ​ for (SymbolTable child : _children) ​ ​ ​ ​ { result.addAll(child.get_symbols()); }

return result; ​ ​ }

@Override ​ public String toString() ​ ​ { String result = "Symbol table " + _scope_name; ​ ​ ​ ​ for (Symbol entry : _symbols.values()) ​ ​ ​ ​ { result += "\n" + entry.toString(); ​ ​ ​ ​ }

for (SymbolTable child : _children) ​ ​ ​ ​ { result += "\n\n" + child.toString(); ​ ​ ​ ​ }

return result; ​ ​ }

private SymbolTable _parent; ​ ​ ​ ​ private ArrayList _children; ​ ​ ​ ​ private String _scope_name; ​ ​ ​ ​ private OrderedHashMap _symbols; ​ ​ ​ ​ }

TinyEmitter import java.util.HashMap; ​ import java.util.HashSet; ​ import java.util.List; ​ public class TinyEmitter ​ { private static final String SWAP_REG = "r0"; ​ ​ ​ ​ ​ ​

24 public String emit_code(SymbolTable symbols, List instructions) ​ ​ { StringBuilder result = new StringBuilder(); ​ ​

// Allocate variables ​ for (Symbol sym : symbols.get_symbols()) ​ ​ { if (sym.get_type().equals("STRING")) ​ ​ ​ ​ { _allocations.put(sym.get_name(), sym.get_name()); ​ ​ result.append("str "); ​ ​ result.append(sym.get_name()); result.append(" "); ​ ​ result.append(sym.get_value()); result.append("\n"); ​ ​ ​ ​ continue; ​ ​ }

_allocations.put(sym.get_name(), "v_" + sym.get_name()); ​ ​ ​ ​ result.append("var "); ​ ​ result.append("v_" + sym.get_name()); ​ ​ result.append("\n"); ​ ​ ​ ​ }

// begin emitting code ​ for (IRInstruction instr : instructions) ​ ​ { switch (instr.op) ​ ​ ​ ​ { case LABEL: ​ ​ ​ gen_1ac(result, "label", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case JUMP: ​ ​ ​ gen_1ac(result, "jmp", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case GT: ​ ​ ​ gen_cmp(result, instr.operand_1, instr.operand_2); ​ ​ ​ ​ gen_1ac(result, "jgt", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case GE: ​ ​ ​ gen_cmp(result, instr.operand_1, instr.operand_2); ​ ​ ​ ​ gen_1ac(result, "jge", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case LT: ​ ​ ​ gen_cmp(result, instr.operand_1, instr.operand_2); ​ ​ ​ ​ gen_1ac(result, "jlt", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​

25 break; ​ ​

case LE: ​ ​ ​ gen_cmp(result, instr.operand_1, instr.operand_2); ​ ​ ​ ​ gen_1ac(result, "jle", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case NE: ​ ​ ​ gen_cmp(result, instr.operand_1, instr.operand_2); ​ ​ ​ ​ gen_1ac(result, "jne", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case EQ: ​ ​ ​ gen_cmp(result, instr.operand_1, instr.operand_2); ​ ​ ​ ​ gen_1ac(result, "jeq", instr.result.value); ​ ​ ​ ​ ​ ​ ​ ​ break; ​ ​

case STOREI: ​ ​ ​ case STOREF: ​ ​ ​ // Only one argument to MOVE may be a memory address ​ if (!is_lit_or_reg(instr.operand_1)) ​ ​ ​ ​ { gen_store_in_swap(result, get_opmr(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "move", SWAP_REG, get_opmr(instr.result)); ​ ​ ​ ​ ​ ​ ​ ​ } else ​ { ​ gen_2ac(result, "move", get_opmrl(instr.operand_1), ​ ​ ​ ​ ​ ​ get_opmr(instr.result)); ​ ​ } break; ​ ​

case ADDI: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "addi", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​

case ADDF: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "addr", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​

case SUBI: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "subi", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​

26 case SUBF: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "subr", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​ case MULTI: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "muli", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​ case MULTF: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "mulr", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​ case DIVI: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "divi", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​ case DIVF: ​ ​ ​ gen_store_in_swap(result, get_opmrl(instr.operand_1)); ​ ​ ​ ​ gen_2ac(result, "divr", get_opmrl(instr.operand_2), SWAP_REG); ​ ​ ​ ​ ​ ​ ​ ​ gen_load_from_swap(result, get_opmr(instr.result)); ​ ​ ​ ​ break; ​ ​ case READI: ​ ​ ​ gen_1ac(result, "sys readi", get_opmr(instr.result)); ​ ​ ​ ​ ​ ​ break; ​ ​ case READF: ​ ​ ​ gen_1ac(result, "sys readr", get_opmr(instr.result)); ​ ​ ​ ​ ​ ​ break; ​ ​ case WRITEI: ​ ​ ​ gen_1ac(result, "sys writei", get_opmr(instr.result)); ​ ​ ​ ​ ​ ​ break; ​ ​ case WRITEF: ​ ​ ​ gen_1ac(result, "sys writer", get_opmr(instr.result)); ​ ​ ​ ​ ​ ​ break; ​ ​ case WRITES: ​ ​ ​ gen_1ac(result, "sys writes", get_opmr(instr.result)); ​ ​ ​ ​ ​ ​ break; ​ ​

27 default: ​ ​ assert(false); // Unsupported op ​ ​ ​ ​ ​ break; ​ ​ } }

// Output post ​ result.append("sys halt\n"); ​ ​ ​ ​ ​ result.append("end\n"); ​ ​ ​ ​

// Output result ​ return result.toString(); ​ ​ }

private static void gen_1ac(StringBuilder result, String op, String arg) ​ ​ { result.append(op); result.append(" "); ​ ​ result.append(arg); result.append("\n"); ​ ​ ​ ​ }

private static void gen_2ac(StringBuilder result, String op, String opmrl, String ​ ​ reg_result) { result.append(op); result.append(" "); ​ ​ result.append(opmrl); result.append(" "); ​ ​ result.append(reg_result); result.append("\n"); ​ ​ ​ ​ }

private static void gen_store_in_swap(StringBuilder result, String opmrl) ​ ​ { gen_2ac(result, "move", opmrl, SWAP_REG); ​ ​ ​ ​ ​ ​ }

private static void gen_load_from_swap(StringBuilder result, String opmr) ​ ​ { gen_2ac(result, "move", SWAP_REG, opmr); ​ ​ ​ ​ ​ ​ }

private void gen_cmp(StringBuilder result, Operand lhs, Operand rhs) ​ ​ { // Store the RHS in the swap reg, because RHS of CMP instruction is required to ​ be a register gen_store_in_swap(result, get_opmrl(rhs)); ​ ​

// Determine if we should use float or integer comparison ​

28 if (lhs.is_int()) ​ ​ { gen_2ac(result, "cmpi", get_opmrl(lhs), SWAP_REG); ​ ​ ​ ​ ​ ​ } else if (lhs.is_float()) ​ ​ { gen_2ac(result, "cmpr", get_opmrl(lhs), SWAP_REG); ​ ​ ​ ​ ​ ​ } else ​ { ​ assert(false); ​ ​ ​ ​ } }

private String get_opmrl(Operand operand) ​ ​ { if (operand.type == Operand.Type.INT_LIT || operand.type == ​ ​ ​ ​ ​ ​ ​ ​ Operand.Type.FLOAT_LIT) ​ ​ { return operand.value; ​ ​ ​ ​ }

return get_opmr(operand); ​ ​ }

private String get_opmr(Operand operand) ​ ​ { assert(!operand.is_lit()); ​ ​

if (!_allocations.containsKey(operand.value)) ​ ​ ​ ​ ​ ​ { String reg = "r" + _next_reg; ​ ​ ​ ​ _next_reg += 1; ​ ​ ​ ​ _allocations.put(operand.value, reg); ​ ​ ​ ​ _reg_variables.add(operand.value); ​ ​ ​ ​ return reg; ​ ​ }

return _allocations.get(operand.value); ​ ​ ​ ​ ​ }

private boolean is_lit_or_reg(Operand operand) ​ ​ { return operand.is_lit() || _reg_variables.contains(operand.value); ​ ​ ​ ​ ​ ​ }

private HashSet _reg_variables = new HashSet<>(); ​ ​ ​ ​ ​ ​ private HashMap _allocations = new HashMap<>(); ​ ​ ​ ​ ​ ​ private int _next_reg = 1; // r0 is reserved for swap ​ ​ ​ ​ ​ ​ }

29 Section 2: Teamwork We collaborated on this project through in person meetings, and Github for version control and sharing of the source code. For the first step of the compiler, the scanner, each team member worked on the source code to become familiar with the ANTLR library. During the subsequent steps, as the compiler became more complicated, the team began to do pair (and trio) programming on a single computer for more active collaboration and learning. For example, during a group meeting, team member 1 would be writing the code, while team member 2 was watching, commenting and assisting with design decisions and semantic errors. During this time, team member 3 would do research on the topic at hand, provide input similar to team member 2 or write the corresponding aspect of the technical report. These roles would change throughout the project, with each member writing code, watching over the shoulder or working on the report. At the completion of a step for the compiler, the project code was committed to Github from the team member’s computer we worked on, and submitted after it was tested with the grading shell scripts and corresponding input test cases. To summarize, all team members had equal time contributions across the project including writing source code, testing source code, providing input and design decisions, and completion of the portfolio.

30 Section 3: Design Pattern

We did not actively employ any design patterns during the course of the compiler design and creation. The rigid structure of the ANTLR library defined how our class hierarchy would be determined. Furthermore, the LITTLE compiler tends to serve a static purpose and design patterns are better suited for dynamic code bases which are continually maintained and extended. One could argue that a compiler does in fact need to be dynamic to handle various cases of source code, and it is true within commercial compilers, but the LITTLE language is rather simplistic and most of the dynamic aspects could be handled within the grammar definition.

31 Section 4: Technical Report

Introduction Most computer science students make little to no mental distinction between the language they write in (Java, #, C, etc), and the tools surrounding them (IDEs, Compilers, etc). In this course, our mission was to develop a full source-to-assembly compiler for a simple programming language known as LITTLE, in an effort to break up this indistinction.

Background At the end of the day, modern computers are still only capable of executing assembly instructions encoded in binary, known as . This machine code is very difficult to ​ ​ write, and the result is not portable between operating systems, processor architectures, and sometimes even different generations of the same processor architecture. In order for businesses to serve products to their millions of users across dozens of different platforms, there needs to be a way to write programs that are easy to port across different platforms, new and old. The way this is done is by using a compiler, which can translate code written in a ​ ​ machine-agnostic manner into machine code which can be executed. There are many different ways to write a compiler, but they almost always begin by first defining a specification for the language they compile. There are several advantages to ​ ​ separating the specification and implementation of a language: it makes it possible to differentiate between bugs in the compiler and bugs in the design of the language, end-users are able to reason about the specific rules of the language without having access to or knowledge of the compiler source code, and it becomes possible to develop new compilers for the language that are fully compatible with code written for the original compiler. There is a great deal of theory surrounding language specification, but fortunately enough work has been done that it is frequently possible to leverage compiler generators for ​ ​ certain components of a compiler, using only the specification of the language. These compiler generators can automatically generate source code in the language the compiler is written in that automatically implement components such as lexing and parsing. This can immediately eliminate bugs frequent in text processing applications, and greatly increase compiler developer productivity. However, this sometimes comes at the cost of performance and maintainability. The type of compiler that most programmers are familiar with operates by consuming source code and directly emitting machine code for the architecture the compiler is configured for (which may be different than the architecture the compiler is running on; this is known as

32 cross-compiling). These are sometimes referred to as native compilers, and the advantage of ​ ​ ​ this are generally high levels of static analysis and error checking, and greater run-time efficiency. The disadvantage of this style is that the generated code is as non-portable as if the machine code had been written by hand, though recompiling the original source code for a different architecture is trivial, assuming it is available. Which is not the case for closed-source applications, and it is not unheard of for companies to lose the source code for their own applications. Examples of languages that are typically compiled this way are C, C++, Rust, and D, ​ ​ ​ ​ ​ ​ ​ ​ which may be compiled with gcc, MSVC, rustc, and LDC, respectively (among others). ​ ​ ​ ​ ​ ​ ​ ​ Other compilers operate by consuming source code and immediately executing it; these are generally known as interpreters. These compilers are known for having very fast developer ​ ​ turnaround time, at the cost of run-time efficiency and fewer guarantees of correctly executing code. Examples of languages that are typically run this way are Python, Javascript, and Perl, ​ ​ ​ ​ ​ ​ which may be compiled with CPython, SpiderMonkey, and Perl, respectively (among others). ​ ​ ​ ​ ​ ​ Others generate binary code that can be interpreted more efficiently than source code by a program known as a virtual machine. The advantages of this are that the emitted code ​ ​ generally has the same portability as interpreted languages (assuming an implementation of the virtual machine is available for the target architecture), with typically the same level of static analysis as native compilers. Additionally, the binary code format can be specified independently of the source language, so that multiple source languages may be specified for the binary code, and can interoperate as cleanly as if they had been written in the same language. The greatest disadvantage of this style is that programs compiled this way are still generally not as efficient as natively compiled programs, though some virtual machine implementations alleviate this to an extent by leveraging Just In Time Compilation, which ​ ​ generates real machine code for the binary code while it’s being interpreted. The ecosystems for these types of languages are by far the most diverse, as there can be multiple source languages that can be compiled by multiple compilers to the same binary code format, and multiple virtual machines that can execute it. Examples of this are the languages Java, Scala, ​ ​ ​ ​ and Kotlin, all of which may be compiled by various compilers to the JVM bytecode format, or ​ ​ may be executed by the virtual machines Oracle JVM, and OpenJDK, among others. A similar ​ ​ ​ ​ ecosystem is Microsoft’s CLR, which can be targeted by compilers for C#, F#, IronPython, and ​ ​ ​ ​ ​ ​ even C++. Implementations of virtual machines for this are the .NET Runtime, and Mono. ​ ​ ​ ​ ​ ​ While those are the general archetypes for compilers, many implementations blur the lines between them. There are native compilers for typically interpreted languages such as Ruby (Crystal), there are interpreters for typically natively compiled languages such as C++ ​ ​ (CINT, Cling), and there are even compilers that can compile code written in one source ​ ​ language directly to another (which is known as a source-to-source compiler), such as C++ to ​ ​ JavaScript (Emscripten). For this project, we have written a native compiler for the language ​ ​ LITTLE.

33 Methods and Discussion

In this subsection, we will discuss the functionality of each stage of the compiler, and how we implemented that stage in our own code. We will also discuss some difficulties we faced when implementing each part.

Scanner

Background The scanner is a sub-program of the compiler, which views the LITTLE source code as a sequence of characters and outputs a sequence of tokens which represent the meaningful elements of the program. This sequence of tokens will be used as the input for later steps such as the parser. Each token is an object that contains a TOKENTYPE and the value of the token as specified in the input text.

Methods To initialize ANTLR and provide the scanner with a sequence of characters to read, we first initialized an AntlrFileStream that reads the contents of the input file. Then we created a lexerGrammarLexer that used the contents of our lexerGrammar grammar specification to create a scanner. We then passed the results of this into a CommonTokenStream, which holds the sequence of tokens our scanner outputs. The lexerGrammar file contains the regular expressions that define the tokens in the LITTLE Language. There are eight token types in our grammar: KEYWORD, IDENTIFIER, INTLITERAL, FLOATLITERAL, STRINGLITERAL, COMMENT, OPERATOR and WS (whitespace). Each of these token types have regular expressions that define what should be of that type. For example, the full definition for a floating point number would be:

FLOATLITERAL: [0-9]*'.'[0-9]+ ; ​ ​ ​

This regular expression captures numbers of the form 1234.5678. After building the lexerGrammar file and initializing the lexerGrammarLexer, ANTLR creates the actual parser itself, and generates the token stream based on the grammar in lexerGrammar and the input file.

34 Difficulties Differentiating between floating-point and integer literals was somewhat tricky. We encountered a bug where certain rules were too greedy at matching symbols, and would eventually match the entire program. We fixed this by following the advice of one of the warnings generated by the ANTLR grammar compiler, which was to replace the sequence ‘.*’ with ‘.*?’. Additionally, the order of the rules within the grammar file proved to be important, for example keywords like 'PROGRAM' could be erroneously matched as strings. Lastly, we had to ​ ​ account for whitespace characters like, '\r' and '\n', because we were not getting the ​ ​ ​ ​ intended output, and it was not completely apparent why at first.

Parser

Background The parser is a subprogram of the compiler which examines the stream of tokens that is output by the scanner. The parser has two main objectives. The first objective of the parser is to generate a parse tree which is then used to generate code. The parse tree is created according to a predefined grammar that contains the rules that represent the language. The secondary objective of the parser is syntactic error detection, the compiler will generate an error if the tokens do not match the predefined grammar. We used ANTLR’s parser generator to create our parser. ANTLR uses an LL(*) parser. An LL(*) parser is similar to an LL(k) parser, but it does not have a finite lookahead.

Methods To initialize the parser, we construct a lexerGrammarParser which takes a CommonTokenStream, which is an ANTLR class, as input. The token stream is the output of the lexer, which is then received as input to the parser. Each token is then compared to the parsing rules which were defined in the lexerGrammer; the naming convention was carried over from the lexer creation. The tokentype of each token is examined and the parser attempts to match it to one of the established rules. The simplest rules are defined as such:

id : IDENTIFIER; ​ ​ ​

This rule defines that whenever the parser sees a token with the tokentype of IDENTIFIER ​, the parser adds the “id” expression to the parse tree. Once we have this expression, we can start defining more complex expressions to add to the parse tree. The complex expressions are where we define the structures that a programmer can use, such as

35 function definitions, if statements, and for loops. The expression that represents a function declaration looks like this:

Func_decl : 'FUNCTION' any_type id param_decl_list 'BEGIN' func_body 'END'; ​ ​ ​ ​ ​ ​ ​

Difficulties With some of the more complex statements, we have to be careful that there are no rules that allow a statement to be two rules at once. For example, two if/then/else statements in a language could look like this:

if a then b ​ if a then b else c ​ ​ ​

This causes problems in the parser due to when the parser reaches the “b” in the statement, it could be two different expressions without any way to tell which rule to us. We got around this issue by structuring our if statement definition as such:

if_stmt : 'IF' '(' cond ')' decl stmt_list else_part 'ENDIF'; ​ ​ ​ ​ ​ ​ ​ else_part : 'ELSE' decl stmt_list | ; ​ ​ ​ ​ cond : expr compop expr; ​ ​ ​ compop : '<' | '>' | '=' | '!=' | '<=' | '>='; ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​

Designing the grammar in such a way that as few intermediate rules were required as necessary was a bit tricky, and sometimes impossible. In the case of identifier lists, it was required to have separate id_list and id_tail rules, even though id_tail should only ever appear in the context of an id_list. Furthermore, many rules had to be evaluated to remove left recursion, which would lead the parser into infinite loops when trying to define the more complex tokens. Sometimes this involved adding a ƛ transition, and others it took restructuring of multiple rules or breaking a complex rule into multiple simpler rules.

Symbol Table

Background The symbol table is the component of the compiler that keeps track of what the non-keyword names (symbols) encountered in source code refer to. The symbol table has to be aware of the type and scope of each symbol, so that name hiding can be supported and symbols go out of scope when appropriate. Within the LITTLE language, symbols can be of type string, float or integer. ​ ​ ​ ​ ​

36 Methods We implemented our SymbolTable class as a tree of tables. Each SymbolTable has a OrderedHashMap which associates a name with a Symbol object, a parent SymbolTable, an array of child symbol tables, and a scope name. The Listener is implemented by implementing the GrammarListener interface, generated by ANTLR. When the listener enters a new scope, the implementation creates a new SymbolTable object, assigns the current SymbolTable object as its parent, and then sets it as the current SymbolTable object. Symbols created in the new scope are added to the current SymbolTable object, and any new scopes are created using the same process as before. When the Listener leaves the current scope, the current SymbolTable object is reassigned to its parent. This solution is quite elegant, and requires very little state to be managed in both the Listener and the symbol table. In fact, only two variables are required for the Listener class: the current symbol table, and the next block id. Lookups into a symbol table are done by recursively searching first the current SymbolTable object, and then its parent. This continues until the symbol fails to be found in the root symbol table, in which case an error is returned.

Difficulties Initially our Listener found symbol names by using hard-coded indices into the children array supplied by the listener context object. Thanks to auto-completion by IntelliJ, we discovered that the generated context object actually includes the names of children as they’re described by the grammar, which makes the code much easier to read, and resilient to grammar changes down the road. Variable name lists were somewhat difficult, but managed by first parsing the initial variable name and type, and then iteratively adding the remaining variable names. Iteration terminates when the Id_tailContext object has no children. This is detected by using the expression tail.getChildCount() != 0, which strangely is not identical to writing ​ ​ ​ tail.isEmpty() ​, even if the tail is in fact empty. The only particularly difficult issue we ran into was when you encounter an ‘if’ statement. Our grammar requires that every ‘if’ statement be matched by an ‘else’ statement, even if the else is non-existent. This caused a bug where too many sub-scopes were created, but this was quickly solved by studying the tests and grammar, and adding an explicit check to see if an ‘else’ node is empty.

37 Semantic Routines

Background After constructing a symbol table, the compiler must convert the constructs recognized in the source code into a language and machine independent format known as IR. As this is not the final phase of the compiler, the generated IR does not necessarily need to be optimal, though it does need to be correct. There are some non-obvious difficulties with this, such as generating unique temporary variables and labels in the correct order. Failure to do so could lead to code that may function, but without the same effect as what the author intended.

Methods The compiler makes use of two different stacks when generating IR from the syntax tree, an Instruction Stack and a Label Stack. When a node representing an assignment is ​ ​ ​ ​ encountered, the compiler pushes a blank instruction onto the top of the stack with the result assigned to the lvalue of the assignment and the operands and opcode undetermined. Operands for the instruction, two at most, are filled in a left-to-right manner, as this is the order that the syntax tree listener sees the nodes. Symbol names and literals are simply filled into the operands for the instruction at the top of the stack, and if the instruction has been completed (ie, the result, opcode, and both operands for the instruction have all been determined), the instruction is popped off the stack and inserted into the final array of instructions. The next instruction is then examined for completion, and the process continues until an incomplete instruction is encountered or the stack is empty. The process for handling nodes that represent more complex expressions than symbols or literals is slightly more complicated. First, the instruction on the top of the stack is inspected to see if it’s opcode has been determined yet. If it has not, the opcode is simply assigned as whatever type of operation this node represents, and the listener moves on. If not, the listener generates a new temporary variable, assigns that as the next operand for the top instruction, and pushes a new instruction onto the instruction stack with the temporary assigned as the result of the instruction, and the opcode assigned as whatever type of operation this node represents. This algorithm generates quite simple and optimal code for very complex expressions, as well as very ones. It does not generate intermediate STORE instructions, as was pervasive in the example generated code we were given; the only time a STORE is ever generated is when the expression given actually represents a STORE. ​ ​

38 Difficulties Since our IR format requires that mathematical operations be distinguished by the data type they work with (ie, ADDI and ADDF opcodes are considered unique), the IR generator has ​ ​ ​ ​ to determine what data type it’s working with when generating code, which may not always be immediately obvious. The simple solution we developed was to perform type determination for opcodes based on the first operand to the instruction (which, due to the order with which the listener traverses the syntax tree, will always have been determined beforehand). The type for the result operand to the instruction is determined when popping it off the stack, if it hasn’t already. This solution has the disadvantage in that it may not work if the IR format is extended to support mixed-type instructions, and if the grammar for the language is changed in the future considerations may have to be made. However, for our purposes this technique works flawlessly.

Full Fledged Compiler

Background While IR is a fantastic representation of what the user was trying to accomplish with their code in a clear and language-independent manner, it cannot be run on an actual machine, and it has no understanding of register counts or other machine limitations. The solution to this is to then use an Assembly emitter, which can consume an array of IR and generate actual assembly instructions for manipulating registers and executing instructions in a machine-dependent manner. For this project, we implemented a backend which solely targets an architecture known as the Tiny VM, but due to the way the code is written it would be trivial to add additional architecture targets. The addition of this component completes the compiler, as it is now able to consume source code written in LITTLE and generate assembly instructions for Tiny, which are completely unrelated.

Methods The Tiny VM supports a generous number of registers, so conservative register allocation for this backend was not a concern. It also supports assigning arbitrary names to allocated memory locations via the var and str instructions, which was made use of by doing a ​ ​ ​ ​ linear traversal of the symbol table (which was all that was required, since all test programs effectively consist of a single scope). To avoid collision with any reserved Tiny names (such as register names), symbol names are slightly modified from their source representation. Anything not contained in the symbol table (temporaries) are allocated to registers as they are encountered.

39 Due to the linear nature with which IR is stored, traversing it in it’s entirety is simply a single for loop. The contents of the loop consists of a large which handles ​ ​ ​ ​ each opcode supported by the IR format, and transforms the instruction as necessary.

Difficulties Our IR format and Tiny have slightly different representations of mathematical operations as well as comparison-based branching, so special consideration was required for those instructions. The extent of it was that in some cases a reserved register known as the swap register was used to represent the intermediate results of multiple Tiny instructions ​ where our IR format only required a single instruction. Without performing non-trivial analysis on the code, it would be difficult to find a more efficient solution to this problem. Since Tiny is fairly unique in its limitations and not highly utilized, we determined that there is not much benefit in developing a solution anyway. Other than that, the generated Tiny instructions effectively have a 1-1 correspondence with the IR.

Conclusion and Future Work Future work could include optimization and extending the compiler to handle additional data types or structures such as booleans and arrays. Optimization would mostly revolve around peephole optimization techniques to replace a set of slow instructions with faster ones, and removing redundant code and stack instructions. The techniques would be implemented through a pattern matching approach while examining a small “window” of code which slides through the intermediate representation of the code. The optimizations could then be looped through numerous times before passing a final intermediate representation onto the code generator. In conclusion, this project has allowed for the application of various aspects of Computer Science knowledge including theory, software engineering, version control and collaboration. Given more time, it would have been interesting to write programs within LITTLE and C to make a comparison between the compilers to determine how well LITTLE’s compiler was performing. Furthermore, comparisons could have been made between each group’s compiler within the class because each implementation could be vastly different for the symbol table and semantic routines.

40 Section 5: UML The following figure contains a combined UML diagram for the entire compiler. Classes which were generated by or inherited from the ANTLR library have been omitted for clarity and due to their generic nature.

41 Section 6: Design Trade-offs While our IR format represents most instructions as a three-address-code, the Tiny VM ​ ​ architecture only supports two-address-codes, so the IR is not directly translatable to Tiny code. ​ ​ The solution we developed to work around this was to have a single reserved swap register that ​ ​ acts as the first operand and the result for each two-address-code instruction. The register is initialized with the first operand for the IR instruction, and then used as the second operand for the Tiny instruction. Afterwards, the value in the swap register is moved into the result operand ​ ​ for the IR instruction. This is not terribly performant, but without performing a non-trivial amount of analysis on the code, is the best solution we could create. The Tiny architecture also supports a great number of registers, which is unlike most others. However, registers may only be named as r0 - rn, which makes the generated code ​ ​ ​ ​ rather difficult to read. In order to support manual validation of the generated code, we made use of Tiny’s var instruction, which allows for the creation of named memory addresses. The ​ ​ names given to these are based off of the actual variable names used in the original source code, which the exception that they are prefixed with “v_”, to prevent the unlikely scenario where a variable was given the same name as a Tiny register, which would have resulted in invalid code being generated. The result is that the generated Tiny code is fairly easy to read, though it is not technically as performant as it could be.

42 Section 7: Software Development Lifecycle For this project, we used the iterative development cycle. For each step of the compiler, we did one iteration of the cycle. Each cycle started with examining the requirements which were provided in the documentation for the project. With the requirements in mind, we moved on to the planning stage. Usually, this consisted of us meeting in person and developing a plan for how we were going to structure our solution. Once we had a plan that would fulfill the requirements, we started the development using the methods discussed in section two. When the solution was implemented, we continued to the testing phase. In the testing phase, we compared outputs visually, and then finally our outputs for the grading script which utilizes the diff command. After the testing was finished, we addressed any issues discovered during the phase, and deployed the code (turned it in). This cycle was very beneficial to us. Before we even opened our IDE, we had a solid plan for each step. This meant that we never had to spend too much time , because our initial structure covered all of the normal cases. We only ever had to address special cases and syntactical errors in the debugging stage. Using the iterative development cycle meant that our meetings were streamlined and effective at finishing the work required.

43 Section 8: References Antlr/antlr4. Retrieved January, 2017, from https://github.com/antlr/antlr4/blob/master/doc/index.md

44