Compiling a Language
Total Page:16
File Type:pdf, Size:1020Kb
Universidade Federal de Minas Gerais – Department of Computer Science – Programming Languages Laboratory COMPILING A LANGUAGE DCC 888 Dealing with Programming Languages • LLVM gives developers many tools to interpret or compile a language: – The intermediate representaon – Lots of analyses and opDmiEaons Fhen is it worth designing a new • Fe can work on a language that already languageM exists, e.g., C, CKKI Lava, etc • Fe can design our own language. We need a front Machine independent Machine dependent end to convert optimizations, such as optimizations, such programs in the constant propagation as register allocation source language 2344555 to LLVM IR *+,-), 1#% ((0 !"#$%&'() '()./0 '()./0 '().- The Simple Calculator • To illustrate this capacity of LLAM, le2Ns design a very simple programming language: – A program is a funcDon applicaon – A funcDon contains only one argument x – Only the integer type exists – The funcDon body contains only addiDons, mulDplicaons, references to x, and integer constants in polish notaon: 1) Can you understand why we got each of these valuesM SR How is the grammar of our languageM The Architecture of Our Compiler !"#"$ %&$'"$ 2$34"$ (#)$ !!*05136' 1) Can you guess the meaning of the *&$(#)$ 0,1(#)$ diUerent arrowsM SR Can you guess the +,-(#)$ .//(#)$ role of each classM 3) Fhat would be a good execuDon mode for our systemM The Execuon Engine Our execuDon engine parses the expression, $> ./driver 4! converts it to a funcDon wriSen in LLAM IR, LIT * x x! Result: 16! compiles this funcDon, and runs it with the argument passed to the program in command $> ./driver 4! line. + x * x 2! Result: 12! Le2Ns start with our lexer. Which $> ./driver 4! ; ModuleID = NExampleN tokens do we * x + x 2! have? Result: 24! de]ne i32 @fun_i32 %xR*a* entry: %addtmp = add i32 %x, 2 %multmp = mul i32 %x, %addtmp ret i32 %multmp b* Lexer.h! The Lexer • A lexer is a program that divides a string of characters into tokens. – A token is a terminal in our grammar, e.g., cifndef LEdWR_H a symbol that is part of the alphabet of c(e]ne LEdWR_H our language. #include f'tringg* – Lexers can be easily implemented as class Lexer*a* ]nite automata. public: std::string getToken_R[ Lexer() : lastChar_N 'R*ab* private: 1) Again: which kind of char lastChar; tokens do we haveM inline char getNextChar_R*a* char c = lastChar; SR Can you guess the lastChar = getchar_R[ implementaon of the return c; getToken_R methodM } b[ cendif Lexer.cpp! Implementaon of the Lexer #include iLexer.hi* std::string Lexer::getToken_R*a* while (isspace_lastCharRR*a*lastChar = getchar_R[ } if (isalpha(lastCharRR*a* std::string idStr; do { idStr += getNextChar_R[ } while (isalnum_lastCharRR[ return idStr; } else if (isdigi2_lastCharRR*a* std::string numStr; do { numStr += getNextChar_R[ } while (isdigi2_lastCharRR[ return numStr; } else if (lastChar == EOFR a return ii; 1) Fould you be able to } else*a* represent this lexer as std::string operatorStr; a state machineM operatorStr = getNextChar_R[ return operatorStr; SR Fe must now de]ne } the parser. How can b* we implement i2M Parser.cpp! Parsing • Parsing is the act to transform a string of tokens in a syntax tree♤. cifndef PARSER_H 1) Fhat are these c(e]ne PARSER_H forward declaraons good forM #include f'tringg* class Expr; SR 0o you understand class Lexer; this syntaxM class Parser a 3) Fhat does the parser public: returnM Parser_Lexer* argLexer) : lexer_argLexerR*ab* Expr* parseExpr_R[ private: Lexer* lexer; b[ cendif ♤: it used to be one of the most important problems in computer science. Syntax Trees • The parser produces syntax trees. * x x + x * x 2 * x + x 2 * + * x x x * x + x 2 x 2 How can we implement these trees in CKKM Expr.h! The hodes of the Tree cifndef AST_H c(e]ne AST_H #include illvm/IR/IRBuilder.hi* class AddExpr : public Expr*a* class Expr*a* public: public: AddExpr(Expr* op1Arg, Expr* op2ArgR : op1(op1Arg), op2(op2Ar;R*ab* virtual mExpr_R*ab* virtual llvm::Aalue *gen_llvm::IRBuilderfg**builder, llvm::Aalue *gen_llvm::IRBuilderfg**builder, llvm::LLAMContextn conR const = o; llvm::LLAMContextn conR const; b[ private: const Expr* op1; const Expr* op2; class NumExpr : public Expr*a* public: b[ NumExpr(int arghum) : num_arghumR*ab* llvm::Aalue *gen_llvm::IRBuilderfg**builder, class MulExpr : public Expr*a* llvm::LLAMContextn conR const; public: MulExpr_Expr* op1Arg, Expr* op2ArgR : stac const unsigned int SYpEeYhT = 32; private: op1(op1Arg), op2(op2Ar;R*ab* const int num; llvm::Aalue *gen_llvm::IRBuilderfg**builder, b[ llvm::LLAMContextn conR const; private: const Expr* op1; class AarExpr : public Expr*a* const Expr* op2; public: There is a gen method llvm::Aalue *gen_llvm::IRBuilderfg**builder, b[ llvm::LLAMContextn conR const; that is a bit weird. Fe stac llvm::Aalue* varAalue; cendif shall look into it later. b[ .oing lack into the Parser • Our parser will build a syntax tree. + x * x 2 !"#$%# &%'()**+,-#. ((((&%'(/"#+,-.01 + ((((&%'(234+,-#. ((((((((&%'(/"#+,-#.01 ((((((((&%'(536+,-#.70 x * ((((0 0 x 2 The polish notaon really So, how can we simpli]es parsing. Fe implement our already have the tree, and parserM Lan rukasiewicEI father without parenthesesq of the Polish notaon Parser.cpp! The ParserNs Implementaon Expr* Parser::parseExpr_R*a* #include iExpr.hi* std::string tk = lexersggetToken_R[ #include iLexer.hi* if (tk == iiR a #include iParser.hi* return hULL; b else if _isdigi2_tktouRR a 1) Fhy checking the ]rst return new humExpr_atoi_tk.c_str_RRR[ character of each token is b else if _tktou == NxNR*a already enough to avoid any return new AarExpr_R[ ambiguityM b else if _tktou == NKNR a Expr *op1 = parseExpr_R[ SR how we need a way to Expr *op2 = parseExpr_R[ translate trees into LLAM IR. return new AddExpr_op1, op2R; How to do i2M b else if _tktou == NjNR a Expr *op1 = parseExpr_R[ !"#$%&'() '()156 "$; Expr *op2 = parseExpr_R[ <,!=), return new MulExpr_op1, op2R; } else*a* *+,-), ./#,12)"34 7#% 89: return hULL; } ./#,0 '()17#%1- b* Expr.cpp! The Translator #include iExpr.hi* Our implementaon has a llvm::Aalue* AarExpr::varAalue = hULL; small hack: our language llvm::Aalue* NumExpr::gen has only one variable, which _llvm::IRBuilderfg**builder, llvm::LLAMContext ncontext) const a we have decided to call Nx'. return llvm::ConstantInt::get This variable must be _llvm::Type::getInt32Ty_context), numR; represented by an LLAM b* value, which is the llvm::Aalue* AarExpr::gen _llvm::IRBuilderfg**builder, llvm::LLAMContext ncontext) const a argument of the funcDon llvm::Aalue* var = AarExpr::varAalue; that we will create. Thus, return var ? var : NULL; we need a way to inform b* the translator this value. Fe llvm::Aalue* AddExpr::gen do it through a stac _llvm::IRBuilderfg**builder, llvm::LLAMContext ncontext) const a llvm::Aalue* v1 = op1sggen_builder, context); variable varValue. That is llvm::Aalue* v2 = op2sggen_builder, context); the only stac variable that return buildersg6reateAdd_v1, v2, iaddtmpiR[* we are using in this class. b* llvm::Aalue* MulExpr::gen _llvm::IRBuilderfg**builder, llvm::LLAMContext ncontext) const a llvm::Aalue* v1 = op1sggen_builder, context); llvm::Aalue* v2 = op2sggen_builder, context); return buildersg6reateMul_v1, v2, imultmpiR[* b* Driver.cpp! The DriverNs Skeleton int main_int argc, char** argvR*a* The procedure if (argc q= 2R a that creates an llvm::errs_R ff*iInform an argument to your expression.vni[* LLAM funcDon is return 1; not that } else*a* complicated. Can llvm::LLAMContext context; you guess its llvm::Module *module = new llvm::Module_iExamplei, context); implementaonM llvm::Funcon *funcDon = createEntryFuncHon(module, context); modulesg(ump_R; llvm::ExecuonEngine* engine = createEngine(moduleR[ LIT_engine, funcDon, atoi_argvt1uRR; } b* !"#$%&'() '()156 "$; <,!=), *+,-), ./#,12)"34 7#% 89: ./#,0 '()17#%1- Driver.cpp! Creang an LLAM FuncDon llvm::Funcon *createEntryFuncDon( This code is not itha2i complicated, but it llvm::Module *module, is not super straighworward either, so we llvm::LLAMContext ncontext) a will go a bit more carefully over it. llvm::FuncHon *funcHon = llvm::castDllvm::FuncHonEF modul4HE1etOrInsertFuncHonFIfunIJ llvm::Type::1etInt32TyFcontextP, llvm::Type::getInt32TyFcontext), (llvm::Type *)0P ); llvm::lasicllock *bb = llvm::lasicllock::Create_context, ientryi, funcDonR; llvm::IRBuilderfg*builder(contextR[ builder.SetInsertPoin2_bbR[ llvm::Argument *argd*= funcDonsgarg_begin_R[ Le2Ns start with argdsgsetName_ixiR[* this humongous AarExpr::varAalue = argd; call. Fhat do you Lexer lexer; think it is doingM Parser parser_nlexerR[ Expr* expr = parser.parseExpr_R[ llvm::Aalue* re2Aal = exprsggen_nbuilder, context); builder.CreateRe2_re2AalR[ return funcDon; b* Driver.cpp! Creang an LLAM FuncDon llvm::Funcon *createEntryFuncDon( llvm::Module *module, llvm::LLAMContext ncontext) a llvm::Funcon *funcDon = llvm::castfllvm::Funcong_ modulesggetOrInsertFuncDon_ifuni, llvm::Type::getInt32Ty_context), llvm::Type::getInt32Ty_context), _llvm::Type jRoR ); llvm::BasicBlock *bb = llvm::BasicBlock::CreateFconteOt, "entry", funcHon); llvm::IRBuilderDE#builder(conteOt); builder.SetInsertPoint(bb); Here we are creang a funcDon llvm::Argument *argd*= funcDonsgarg_begin_R[ called ifuni that returns an argdsgsetName_ixiR[* integer, and receives an integer AarExpr::varAalue = argd; as a parameter. This cast has a And here, what Lexer lexer; variable number of arguments, are we doingM Parser parser_nlexerR[ and so we use a senDnel, e.g., Expr* expr = parser.parseExpr_R[ hULL, to indicate the end of llvm::Aalue* re2Aal = exprsggen_nbuilder, context);