Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory
COMPILING A LANGUAGE
DCC 888 Dealing with Programming Languages
• LLVM gives developers many tools to interpret or compile a language: – The intermediate representa on – Lots of analyses and op miza ons When is it worth designing a new • We can work on a language that already language? exists, e.g., C, C++, Java, etc • We can design our own language. We need a front Machine independent Machine dependent end to convert optimizations, such as optimizations, such programs in the constant propagation as register allocation source language 2344555 to LLVM IR *+,-), 1#% ((0
!"#$%&'() '()./0 '()./0 '().- The Simple Calculator
• To illustrate this capacity of LLVM, let's design a very simple programming language: – A program is a func on applica on – A func on contains only one argument x – Only the integer type exists – The func on body contains only addi ons, mul plica ons, references to x, and integer constants in polish nota on:
1) Can you understand why we got each of these values?
2) How is the grammar of our language? The Architecture of Our Compiler
!"#"$ %&$'"$ 2$34"$
(#)$ !!*05136'
1) Can you guess the meaning of the *&$(#)$ 0,1(#)$ different arrows?
2) Can you guess the +,-(#)$ .//(#)$ role of each class?
3) What would be a good execu on mode for our system? The Execu on Engine
Our execu on engine parses the expression, $> ./driver 4! converts it to a func on wri en in LLVM IR, JIT * x x! Result: 16! compiles this func on, and runs it with the argument passed to the program in command $> ./driver 4! line. + x * x 2! Result: 12! Let's start with our lexer. Which $> ./driver 4! ; ModuleID = 'Example' tokens do we * x + x 2! have? Result: 24! define i32 @fun(i32 %x) { entry: %addtmp = add i32 %x, 2 %multmp = mul i32 %x, %addtmp ret i32 %multmp } Lexer.h! The Lexer
• A lexer is a program that divides a string of characters into tokens. – A token is a terminal in our grammar, e.g., #ifndef LEXER_H a symbol that is part of the alphabet of #define LEXER_H our language. #include
#include "Lexer.h" std::string Lexer::getToken() { while (isspace(lastChar)) { lastChar = getchar(); } if (isalpha(lastChar)) { std::string idStr; do { idStr += getNextChar(); } while (isalnum(lastChar)); return idStr; } else if (isdigit(lastChar)) { std::string numStr; do { numStr += getNextChar(); } while (isdigit(lastChar)); return numStr; } else if (lastChar == EOF) { return ""; 1) Would you be able to } else { represent this lexer as std::string operatorStr; a state machine? operatorStr = getNextChar(); return operatorStr; 2) We must now define } the parser. How can } we implement it? Parser.cpp! Parsing
• Parsing is the act to transform a string of tokens in a syntax tree♤.
#ifndef PARSER_H 1) What are these #define PARSER_H forward declara ons good for? #include
#endif ♤: it used to be one of the most important problems in computer science. Syntax Trees
• The parser produces syntax trees.
* x x + x * x 2 * x + x 2
* + *
x x x * x +
x 2 x 2
How can we implement these trees in C++? Expr.h! The Nodes of the Tree #ifndef AST_H #define AST_H
#include "llvm/IR/IRBuilder.h" class AddExpr : public Expr { class Expr { public: public: AddExpr(Expr* op1Arg, Expr* op2Arg) : op1(op1Arg), op2(op2Arg) {} virtual ~Expr() {} virtual llvm::Value *gen(llvm::IRBuilder<> *builder, llvm::Value *gen(llvm::IRBuilder<> *builder, llvm::LLVMContext& con) const = 0; llvm::LLVMContext& con) const; }; private: const Expr* op1; const Expr* op2; class NumExpr : public Expr { public: }; NumExpr(int argNum) : num(argNum) {} llvm::Value *gen(llvm::IRBuilder<> *builder, class MulExpr : public Expr { llvm::LLVMContext& con) const; public: MulExpr(Expr* op1Arg, Expr* op2Arg) : sta c const unsigned int SIZE_INT = 32; private: op1(op1Arg), op2(op2Arg) {} const int num; llvm::Value *gen(llvm::IRBuilder<> *builder, }; llvm::LLVMContext& con) const; private: const Expr* op1; class VarExpr : public Expr { const Expr* op2; public: There is a gen method llvm::Value *gen(llvm::IRBuilder<> *builder, }; llvm::LLVMContext& con) const; that is a bit weird. We sta c llvm::Value* varValue; #endif shall look into it later. }; Going Back into the Parser
• Our parser will build a syntax tree.
+ x * x 2 !"#$%# &%'()**+,-#. ((((&%'(/"#+,-.01 + ((((&%'(234+,-#. ((((((((&%'(/"#+,-#.01 ((((((((&%'(536+,-#.70 x * ((((0 0 x 2
The polish nota on really So, how can we simplifies parsing. We implement our already have the tree, and parser?
Jan Łukasiewicz, father without parentheses! of the Polish nota on Parser.cpp! The Parser's Implementa on
Expr* Parser::parseExpr() { #include "Expr.h" std::string tk = lexer‐>getToken(); #include "Lexer.h" if (tk == "") { #include "Parser.h" return NULL; } else if (isdigit(tk[0])) { 1) Why checking the first return new NumExpr(atoi(tk.c_str())); character of each token is } else if (tk[0] == 'x') { already enough to avoid any return new VarExpr(); ambiguity? } else if (tk[0] == '+') { Expr *op1 = parseExpr(); 2) Now we need a way to Expr *op2 = parseExpr(); translate trees into LLVM IR. return new AddExpr(op1, op2); How to do it? } else if (tk[0] == '*') { Expr *op1 = parseExpr(); !"#$%&'() '()156 "$; Expr *op2 = parseExpr(); <,!=), return new MulExpr(op1, op2); } else { *+,-), ./#,12)"34 7#% 89: return NULL; } ./#,0 '()17#%1- } Expr.cpp! The Translator #include "Expr.h" Our implementa on has a llvm::Value* VarExpr::varValue = NULL; small hack: our language llvm::Value* NumExpr::gen has only one variable, which (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { we have decided to call 'x'. return llvm::ConstantInt::get This variable must be (llvm::Type::getInt32Ty(context), num); represented by an LLVM } value, which is the llvm::Value* VarExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { argument of the func on llvm::Value* var = VarExpr::varValue; that we will create. Thus, return var ? var : NULL; we need a way to inform } the translator this value. We llvm::Value* AddExpr::gen do it through a sta c (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* v1 = op1‐>gen(builder, context); variable varValue. That is llvm::Value* v2 = op2‐>gen(builder, context); the only sta c variable that return builder‐>CreateAdd(v1, v2, "addtmp"); we are using in this class. } llvm::Value* MulExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* v1 = op1‐>gen(builder, context); llvm::Value* v2 = op2‐>gen(builder, context); return builder‐>CreateMul(v1, v2, "multmp"); } Driver.cpp! The Driver's Skeleton int main(int argc, char** argv) { The procedure if (argc != 2) { that creates an llvm::errs() << "Inform an argument to your expression.\n"; LLVM func on is return 1; not that } else { complicated. Can llvm::LLVMContext context; you guess its llvm::Module *module = new llvm::Module("Example", context); implementa on? llvm::Func on *func on = createEntryFunc on(module, context); module‐>dump(); llvm::Execu onEngine* engine = createEngine(module); JIT(engine, func on, atoi(argv[1])); } } !"#$%&'() '()156 "$; <,!=), *+,-), ./#,12)"34 7#% 89:
./#,0 '()17#%1- Driver.cpp! Crea ng an LLVM Func on llvm::Func on *createEntryFunc on( This code is not "that" complicated, but it llvm::Module *module, is not super straigh orward either, so we llvm::LLVMContext &context) { will go a bit more carefully over it. llvm::Func on *func on = llvm::cast