<<

Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory

COMPILING A LANGUAGE

DCC 888 Dealing with Programming Languages

• LLVM gives developers many tools to interpret or compile a language: – The intermediate representaon – Lots of analyses and opmizaons When is it worth designing a new • We can work on a language that already language? exists, e.g., , C++, Java, etc • We can design our own language. We need a front Machine independent Machine dependent end to convert optimizations, such as optimizations, such programs in the constant propagation as register allocation source language 2344555 to LLVM IR *+,-), 1#% ((0

!"#$%&'() '()./0 '()./0 '().- The Simple Calculator

• To illustrate this capacity of LLVM, let's design a very simple programming language: – A program is a funcon applicaon – A funcon contains only one argument x – Only the integer exists – The funcon body contains only addions, mulplicaons, references to x, and integer constants in polish notaon:

1) Can you understand why we got each of these values?

2) How is the grammar of our language? The Architecture of Our Compiler

!"#"$ %&$'"$ 2$34"$

(#)$ !!*05136'

1) Can you guess the meaning of the *&$(#)$ 0,1(#)$ different arrows?

2) Can you guess the +,-(#)$ .//(#)$ role of each class?

3) What would be a good execuon mode for our system? The Execuon Engine

Our execuon engine parses the expression, $> ./driver 4! converts it to a funcon wrien in LLVM IR, JIT * x x! Result: 16! compiles this funcon, and runs it with the argument passed to the program in $> ./driver 4! line. + x * x 2! Result: 12! Let's start with our lexer. $> ./driver 4! ; ModuleID = 'Example' tokens do we * x + x 2! have? Result: 24! define i32 @fun(i32 %x) { entry: %addtmp = add i32 %x, 2 %multmp = mul i32 %x, %addtmp ret i32 %multmp } Lexer.h! The Lexer

• A lexer is a program that divides a string of characters into tokens. – A token is a terminal in our grammar, e.g., #ifndef LEXER_H a symbol that is part of the alphabet of #define LEXER_H our language. #include – Lexers can be easily implemented as class Lexer { finite automata. public: std::string getToken(); Lexer() : lastChar(' ') {} private: 1) Again: which kind of char lastChar; tokens do we have? inline char getNextChar() { char c = lastChar; 2) Can you guess the lastChar = getchar(); implementaon of the return c; getToken() method? } }; #endif Lexer.cpp! Implementaon of the Lexer

#include "Lexer.h" std::string Lexer::getToken() { while (isspace(lastChar)) { lastChar = getchar(); } if (isalpha(lastChar)) { std::string idStr; do { idStr += getNextChar(); } while (isalnum(lastChar)); return idStr; } else if (isdigit(lastChar)) { std::string numStr; do { numStr += getNextChar(); } while (isdigit(lastChar)); return numStr; } else if (lastChar == EOF) { return ""; 1) Would you be able to } else { represent this lexer as std::string operatorStr; a state machine? operatorStr = getNextChar(); return operatorStr; 2) We must now define } the parser. How can } we implement it? Parser.cpp! Parsing

• Parsing is the act to transform a string of tokens in a syntax ♤.

#ifndef PARSER_H 1) What are these #define PARSER_H forward declaraons good for? #include class ; 2) Do you understand class Lexer; this syntax? class Parser { 3) What does the parser public: return? Parser(Lexer* argLexer) : lexer(argLexer) {} Expr* parseExpr(); private: Lexer* lexer; };

#endif ♤: it used to be one of the important problems in computer science. Syntax Trees

• The parser produces syntax trees.

* x x + x * x 2 * x + x 2

* + *

x x x * x +

x 2 x 2

How can we implement these trees in C++? Expr.h! The Nodes of the Tree #ifndef AST_H #define AST_H

#include "llvm/IR/IRBuilder.h" class AddExpr : public Expr { class Expr { public: public: AddExpr(Expr* op1Arg, Expr* op2Arg) : op1(op1Arg), op2(op2Arg) {} virtual ~Expr() {} virtual llvm::Value *gen(llvm::IRBuilder<> *builder, llvm::Value *gen(llvm::IRBuilder<> *builder, llvm::LLVMContext& con) const = 0; llvm::LLVMContext& con) const; }; private: const Expr* op1; const Expr* op2; class NumExpr : public Expr { public: }; NumExpr(int argNum) : num(argNum) {} llvm::Value *gen(llvm::IRBuilder<> *builder, class MulExpr : public Expr { llvm::LLVMContext& con) const; public: MulExpr(Expr* op1Arg, Expr* op2Arg) : stac const unsigned int SIZE_INT = 32; private: op1(op1Arg), op2(op2Arg) {} const int num; llvm::Value *gen(llvm::IRBuilder<> *builder, }; llvm::LLVMContext& con) const; private: const Expr* op1; class VarExpr : public Expr { const Expr* op2; public: There is a gen method llvm::Value *gen(llvm::IRBuilder<> *builder, }; llvm::LLVMContext& con) const; that is a bit weird. We stac llvm::Value* varValue; #endif shall look into it later. }; Going Back into the Parser

• Our parser will build a syntax tree.

+ x * x 2 !"#$%# &%'()**+,-#. ((((&%'(/"#+,-.01 + ((((&%'(234+,-#. ((((((((&%'(/"#+,-#.01 ((((((((&%'(536+,-#.70 x * ((((0 0 x 2

The polish notaon really So, how can we simplifies parsing. We implement our already have the tree, and parser?

Jan Łukasiewicz, father without parentheses! of the Polish notaon Parser.cpp! The Parser's Implementaon

Expr* Parser::parseExpr() { #include "Expr.h" std::string tk = lexer‐>getToken(); #include "Lexer.h" if (tk == "") { #include "Parser.h" return NULL; } else if (isdigit(tk[0])) { 1) Why checking the first return new NumExpr(atoi(tk.c_str())); character of each token is } else if (tk[0] == 'x') { already enough to avoid any return new VarExpr(); ambiguity? } else if (tk[0] == '+') { Expr *op1 = parseExpr(); 2) Now we need a way to Expr *op2 = parseExpr(); translate trees into LLVM IR. return new AddExpr(op1, op2); How to do it? } else if (tk[0] == '*') { Expr *op1 = parseExpr(); !"#$%&'() '()156 "$; Expr *op2 = parseExpr(); <,!=), return new MulExpr(op1, op2); } else { *+,-), ./#,12)"34 7#% 89: return NULL; } ./#,0 '()17#%1- } Expr.cpp! The Translator #include "Expr.h" Our implementaon has a llvm::Value* VarExpr::varValue = NULL; small hack: our language llvm::Value* NumExpr::gen has only one variable, which (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { we have decided to call 'x'. return llvm::ConstantInt::get This variable must be (llvm::Type::getInt32Ty(context), num); represented by an LLVM } value, which is the llvm::Value* VarExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { argument of the funcon llvm::Value* var = VarExpr::varValue; that we will create. Thus, return var ? var : NULL; we need a way to inform } the translator this value. We llvm::Value* AddExpr::gen do it through a stac (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* v1 = op1‐>gen(builder, context); variable varValue. That is llvm::Value* v2 = op2‐>gen(builder, context); the only stac variable that return builder‐>CreateAdd(v1, v2, "addtmp"); we are using in this class. } llvm::Value* MulExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* v1 = op1‐>gen(builder, context); llvm::Value* v2 = op2‐>gen(builder, context); return builder‐>CreateMul(v1, v2, "multmp"); } Driver.cpp! The Driver's Skeleton int main(int argc, char** argv) { The procedure if (argc != 2) { that creates an llvm::errs() << "Inform an argument to your expression.\n"; LLVM funcon is return 1; not that } else { complicated. Can llvm::LLVMContext context; you guess its llvm::Module *module = new llvm::Module("Example", context); implementaon? llvm::Funcon *funcon = createEntryFuncon(module, context); module‐>dump(); llvm::ExecuonEngine* engine = createEngine(module); JIT(engine, funcon, atoi(argv[1])); } } !"#$%&'() '()156 "$; <,!=), *+,-), ./#,12)"34 7#% 89:

./#,0 '()17#%1- Driver.cpp! Creang an LLVM Funcon llvm::Funcon *createEntryFuncon( This code is not "that" complicated, but it llvm::Module *module, is not super straighorward either, so we llvm::LLVMContext &context) { will go a bit carefully over it. llvm::Funcon *funcon = llvm::cast( module‐>getOrInsertFuncon("fun", llvm::Type::getInt32Ty(context), llvm::Type::getInt32Ty(context), (llvm::Type *)0) ); llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", funcon); llvm::IRBuilder<> builder(context); builder.SetInsertPoint(bb); llvm::Argument *argX = funcon‐>arg_begin(); Let's start with argX‐>setName("x"); this humongous VarExpr::varValue = argX; call. What do you Lexer lexer; think it is doing? Parser parser(&lexer); Expr* expr = parser.parseExpr(); llvm::Value* retVal = expr‐>gen(&builder, context); builder.CreateRet(retVal); return funcon; } Driver.cpp! Creang an LLVM Funcon llvm::Funcon *createEntryFuncon( llvm::Module *module, llvm::LLVMContext &context) { llvm::Funcon *funcon = llvm::cast( module‐>getOrInsertFuncon("fun", llvm::Type::getInt32Ty(context), llvm::Type::getInt32Ty(context), (llvm::Type *)0) ); llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", funcon); llvm::IRBuilder<> builder(context); builder.SetInsertPoint(bb); Here we are creang a funcon llvm::Argument *argX = funcon‐>arg_begin(); called "fun" that returns an argX‐>setName("x"); integer, and receives an integer VarExpr::varValue = argX; as a parameter. This cast has a And here, what Lexer lexer; variable number of arguments, are we doing? Parser parser(&lexer); and so we use a sennel, e.g., Expr* expr = parser.parseExpr(); NULL, to indicate the end of llvm::Value* retVal = expr‐>gen(&builder, context); the list of arguments. builder.CreateRet(retVal); return funcon; } Driver.cpp! Creang the Body of the Funcon llvm::Funcon *createEntryFuncon( This code creates a basic block, where we llvm::Module *module, will insert instrucons. We are aaching llvm::LLVMContext &context) { this block to a IRBuilder. This object is an llvm::Funcon *funcon = LLVM helper to create new instrucons. llvm::cast( module‐>getOrInsertFuncon("fun", llvm::Type::getInt32Ty(context), llvm::Type::getInt32Ty(context), (llvm::Type *)0) ); llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", funcon); llvm::IRBuilder<> builder(context); builder.SetInsertPoint(bb); llvm::Argument *argX = funcon‐>arg_begin(); 1) Before we move on, argX‐>setName("x"); do you remember VarExpr::varValue = argX; what is a basic block? Lexer lexer; Parser parser(&lexer); 2) And this code Expr* expr = parser.parseExpr(); sequence here, what llvm::Value* retVal = expr‐>gen(&builder, context); is it doing? That is a builder.CreateRet(retVal); consequence of our return funcon; hack... } Going Back to the Hack

Expr.h:! Again: our hack is a way to class VarExpr : public Expr { return an evaluaon to a public: variable. Our language only has llvm::Value *gen(llvm::IRBuilder<> *builder, one variable, and its value never llvm::LLVMContext& con) const; changes. This variable is the stac llvm::Value* varValue; argument of the funcon that }; we are creang. We set its value upon creang this argument. Expr.cpp:! llvm::Value* VarExpr::varValue = NULL; llvm::Value* VarExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* var = VarExpr::varValue; return var ? var : NULL; }

Driver.cpp:! llvm::Argument *argX = funcon‐>arg_begin(); argX‐>setName("x"); VarExpr::varValue = argX; Driver.cpp! A Few Final Remarks on Funcon Creaon llvm::Funcon *createEntryFuncon( llvm::Module *module, llvm::LLVMContext &context) { llvm::Funcon *funcon = llvm::cast( module‐>getOrInsertFuncon("fun", llvm::Type::getInt32Ty(context), llvm::Type::getInt32Ty(context), (llvm::Type *)0) ); llvm::BasicBlock *bb = llvm::BasicBlock::Create(context, "entry", funcon); llvm::IRBuilder<> builder(context); builder.SetInsertPoint(bb); llvm::Argument *argX = funcon‐>arg_begin(); argX‐>setName("x"); VarExpr::varValue = argX; 1) Easy one: what are Lexer lexer; we doing here? Parser parser(&lexer); Expr* expr = parser.parseExpr(); 2) And what are we llvm::Value* retVal = expr‐>gen(&builder, context); doing in this code builder.CreateRet(retVal); snippet? return funcon; } Driver.cpp! Now, the JIT int main(int argc, char** argv) { What do you think if (argc != 2) { the method llvm::errs() << "Inform an argument to your expression.\n"; createEngine is return 1; doing? } else { llvm::LLVMContext context; llvm::Module *module = new llvm::Module("Example", context); llvm::Funcon *funcon = createEntryFuncon(module, context); module‐>dump(); llvm::ExecuonEngine* engine = createEngine(module); Now, we need a way to JIT(engine, funcon, atoi(argv[1])); execute programs. We can } interpret these programs, } using lli, a tool that comes in the LLVM distro. If a JIT !"#$%&'() '()156 "$; compiler is available for <,!=), your architecture (usually *+,-), ./#,12)"34 7#% 89: it is), then we can JIT compile the code, as we will show next. ./#,0 '()17#%1- Driver.cpp! Creang an Engine to Execute Programs

• Engine is how we call the program that is in charge of execung other programs, e.g., the JavaScript engine in the Firefox browser, the C# engine in .NET, etc

llvm::ExecuonEngine* createEngine(llvm::Module *module) { llvm::InializeNaveTarget(); std::string errStr; llvm::ExecuonEngine *engine = These are the sequence of llvm::EngineBuilder(module) method calls necessary to .setErrorStr(&errStr) create a JIT engine. This .setEngineKind(llvm::EngineKind::JIT) engine can, later, receive a .create(); funcon, and execute it. if (!engine) { llvm::errs() << "Failed to construct ExecuonEngine: " << errStr << "\n"; } else if (llvm::verifyModule(*module)) { llvm::errs() << "Error construcng funcon!\n"; } return engine; } Driver.cpp! Invoking the JIT

Invoking the engine over a funcon is very easy. We just need a bit of setup to pass arguments to this funcon. Aer the JIT is done execung the funcon, we have the funcon's return value, which we can use as we wish. void JIT(llvm::ExecuonEngine* engine, llvm::Funcon* funcon, int arg) { std::vector Args(1); Args[0].IntVal = llvm::APInt(32, arg); llvm::GenericValue retVal = engine‐>runFuncon(funcon, Args); llvm::outs() << "Result: " << retVal.IntVal << "\n"; }

Can you idenfy the code that sets the arguments up, and the code that gets the return value back? Compiling Everything

• We can compile these programs using the LLVM standard Makefile. • In fact, LLVM comes with a folder, "examples", which we can use to build our applicaon:

~$ Programs/llvm/examples/DCC888/ Using the standard ~/Programs/llvm/examples/DCC888$ Makefile makes it llvm[0]: Compiling Driver.cpp for Debug+Asserts build easy to our code llvm[0]: Compiling Expr.cpp for Debug+Asserts build llvm[0]: Compiling Lexer.cpp for Debug+Asserts build with all the LLVM llvm[0]: Compiling Parser.cpp for Debug+Asserts build libraries. llvm[0]: Linking Debug+Asserts driver ld warning: ... llvm[0]: ======Finished Linking Debug+Asserts Executable driver

~/Programs/llvm/examples/DCC888$ cd ../../Debug+Asserts/examples/

~/Programs/llvm/Debug+Asserts/examples$ ./driver 4 * x 3 Result: 12 Makefile! Quick Look in our Makefile

LEVEL = ../.. TOOLNAME = driver EXAMPLE_TOOL = 1

# Link in JIT support LINK_COMPONENTS := jit interpreter navecodegen include $(LEVEL)/Makefile.common

We can specify the name of the executable that we shall be creang, and we can point out which libraries will be necessary to compile our program. Running

Example 1: Example 2:

* x + * x 4 + * x 3 + x + * x x * 3 x!

$> ./driver 4! ; ModuleID = 'Example'! * 3 + x * 5 + x 1! define i32 @fun(i32 %x) {! ; ModuleID = 'Example'! entry:! %multmp = mul i32 %x, 4! define i32 @fun(i32 %x) {! %multmp1 = mul i32 %x, 3! entry:! %multmp2 = mul i32 %x, %x! %addtmp = add i32 %x, 1! %multmp3 = mul i32 3, %x! %multmp = mul i32 5, %addtmp! %addtmp = add i32 %multmp2, %multmp3! %addtmp1 = add i32 %x, %multmp! %addtmp4 = add i32 %x, %addtmp! %multmp2 = mul i32 3, %addtmp1! %addtmp5 = add i32 %multmp1, %addtmp4! ret i32 %multmp2! %addtmp6 = add i32 %multmp, %addtmp5! }! %multmp7 = mul i32 %x, %addtmp6! ret i32 %multmp7! Result: 87! }!

Can you draw Result: 240! these two syntax trees? Opmizing the Programs

• One of the things of LLVM is that it comes with many opmizaons, which we can apply on its intermediate representaon.

As an example, if our input 1) How do you think this program has only constants, opmizaon works? LLVM folds all of them into a single value: 2) Where do you think this opmizaon is ./driver 4! implemented? + 3 * 4 + 5 6!

; ModuleID = 'Example'! 3) And what about this program below: will define i32 @fun(i32 %x) { ! LLVM opmize it? entry:! ret i32 47! }! ./driver 4! + * x 3 * x 3! Result: 47! The Need for Global Opmizaons

Constant folding is implemented by the IRBuilder class. This is a local opmizaon. In other words, this opmizaon can only look into the parameters of the instrucon that will be constructed. Naturally, this is not enough to catch, for instance, the redundancy between the two occurrences of "* x 3" in our example. 1) Which compiler opmizaons + * x 3 * x 3! do you know? ; ModuleID = 'Example'! 2) How could we opmize the define i32 @fun(i32 %x) {! entry:! program on the le? %multmp = mul i32 %x, 3! %multmp1 = mul i32 %x, 3! 3) Which opmizaons do you %addtmp = add i32 %multmp, %multmp1! think the compiler could use ret i32 %addtmp! }! to opmize this program? llvm::Value* AddExpr::gen (llvm::IRBuilder<> *builder, llvm::LLVMContext &context) const { llvm::Value* v1 = op1‐>gen(builder, context); llvm::Value* v2 = op2‐>gen(builder, context); return builder‐>CreateAdd(v1, v2, "addtmp"); } Driver.cpp! The LLVM Tool Belt

1) Can you guess what void opmizeFuncon( each of these opmizaons will do? llvm::ExecuonEngine* engine, llvm::Module *module, 2) How do we use this llvm::Funcon* funcon new method? ) { llvm::FunconPassManager passManager(module); passManager.add(new llvm::DataLayout(*engine‐>getDataLayout())); passManager.add(llvm::createInstruconCombiningPass()); passManager.add(llvm::createReassociatePass()); passManager.add(llvm::createGVNPass()); passManager.add(llvm::createCFGSimplificaonPass()); passManager.doInializaon(); passManager.run(*funcon); Beer not to forget: } #include "llvm/Analysis/Passes.h" #include "llvm/PassManager.h" #include "llvm/IR/DataLayout.h" #include "llvm/Transforms/Scalar.h" Driver.cpp! The New Driver int main(int argc, char** argv) { if (argc != 2) { llvm::errs() << "Inform an argument to your expression.\n"; return 1; } else { llvm::LLVMContext context; llvm::Module *module = new llvm::Module("Example", context); llvm::Funcon *funcon = createEntryFuncon(module, context); llvm::errs() << "Module before opmizaons:\n"; module‐>dump(); llvm::errs() << "Module aer opmizaons:\n"; llvm::ExecuonEngine* engine = createEngine(module); opmizeFuncon(engine, module, funcon); module‐>dump(); Just for fun, we are JIT(engine, funcon, atoi(argv[1])); prinng the funcon } before and aer we } run the opmizaons. The Opmizaons in Acon

+ * x 3 * x 3! The opmized program has only one arithmec Module before optimizations:! instrucon, whereas the ; ModuleID = 'Example'! original program had three define i32 @fun(i32 %x) {! such operaons. entry:! %multmp = mul i32 %x, 3! %multmp1 = mul i32 %x, 3! %addtmp = add i32 %multmp, %multmp1! ret i32 %addtmp! Different programming }! languages may require different kinds of Module after optimizations:! opmizaons. Can you ; ModuleID = 'Example'! think about define i32 @fun(i32 %x) {! opmizaons that are entry:! %addtmp = mul i32 %x, 6! specific to cular ret i32 %addtmp! languages? }!

Result: 24! Final Remarks

• LLVM gives several tools to build their programming languages: – Nice intermediate representaon – Several opmizaons – Several back‐ends 89: > >CC =0;<> %'5 ; ;G; !"#$%%&# !"#$(*6"# '(%)(*++,-)# 0*(1/( 2&/34$ &&$ $%+'.5/(# &*-).*)/ *($7,5/$5.(/

D*E* !?0= F%(5(*- 0@A /5$333