Introduction

CS308 Compiler Principles Introduction Li Jiang Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Why study compiling? • Importance: – Programs written in high-level languages have to be translated into binary codes before executing – Reduce execution overhead of the programs – Make high-performance computer architectures effective on users' programs • Influence: – Language Design – Computer Architecture (influence is bi-directional) • Techniques used influence other areas – Text editors, information retrieval system, and pattern recognition programs – Query processing system such as SQL – Equation solver – Natural Language Processing – Debugging and finding security holes in codes – … 2 Compiler Principles Compiler Concept • A compiler is a program that takes a program written in a source language and translates it into an equivalent program in a target language. source program COMPILER target program ( Normally a program ( Normally the equivalent written in a high-level program in machine code programming relocatable object file) language) Can be called to process error messages input and provide output 3 Compiler Principles Interpreter • An interpreter directly executes the operations specified in the source program on inputs supplied by the user. source program INTERPRETER output input error messages 4 Compiler Principles Programming Languages • Compiled languages: – Fortran, Pascal, C, C++, C#, Delphi, Visual Basic, … • Interpreted languages: – BASIC, Perl, PHP, Ruby, TCL, MATLAB,… • Joint Compiled and Interpreted languages – Java, Python, … 5 Compiler Principles Compiler vs. Interpreter* • Preprocessing e.g. Debugging – Compilers do extensive preprocessing – Interpreters run programs “as is”, with little or no preprocessing • Efficiency – The target program produced by a compiler is usually much faster than interpreting the source codes 6 Compiler Principles Compiler Structure Intermediate Language Source Front End – Back End – Target language machine Language Language specific specific Analysis Symbol Synthesis Table •Separation of Concerns •Retargeting 7 Compiler Principles Two Main Phases • Analysis Phase: breaks up a source program into constituent pieces and produces an internal representation of it called intermediate code. – Syntactically ill, semantically unsound – Collect useful information and pass it to • Synthesis Phase: translates the intermediate code into the target program. 8 Compiler Principles Phases oF Compilation* • Compilers work in a sequence of phases. • Each phase transforms the source program from one representation into another representation. • They use the symbol table to store information of the entire source program. Intermediate Language Lexical Analyzer Syntax Analyzer Source Code Optimizer Semantic Analyzer Target Code Generator Language Intermediate Code Language Generator Analysis Symbol Synthesis Table 9 Compiler Principles A Model of A Compiler Font End • Lexical analyzer reads the source program character by character and returns the tokens of the source program. • Parser creates the tree-like syntactic structure of the given program. • Intermediate-code generator translates the syntax tree into three- address codes. 10 Compiler Principles Lexical Analysis 11 Compiler Principles Lexical Analysis • Lexical Analyzer reads the source program character by character and returns the tokens of the source program. <token-name, attribute-value> • A token describes a pattern of characters having the same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimiters, and so on) <NUM, 60> 12 Compiler Principles White Space Removal • No blank, tab, newline, or comments in grammar What is line used for? Skipping white space 13 Compiler Principles Constants • When a sequence of digits appears in the input stream, the lexical analyzer passes to the parser a token consisting of the terminal num along with an integer-valued attribute computed from the digits. 31+28+59 è <num, 31><+><num, 28><+><num, 59> • How to get these attributes? • Simulate parsing some number .... 14 Compiler Principles Keywords and Identifiers* Keywords: Fixed character strings used as punctuation marks or to identify constructs. Identifiers: A character string forms an identifier only if it is not a keyword. How differentiate? 15 Compiler Principles Lexical Analysis Cont’d • Puts information about identifiers into the symbol table. • Regular expressions are used to describe tokens (lexical constructs). – [a-z]*[A-Z]*[0-9]* • A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer. 16 Compiler Principles Symbol Table 17 Compiler Principles Symbol Table • Symbol Tables are data structures that are used by compilers to hold information about the source-program constructs. • For each identifier, there is an entry in the symbol table containing its information. • Symbol tables need to support multiple declarations of the same identifier – One symbol table per scope (of declaration)... { int x; char y; { bool y; x; y; } x; y; } Most-close rule x int Implementation: y char y bool Stack, hash table Outer symbol table Inner symbol table 18 Compiler Principles Symbol Table • A Symbol Table is a data structure containing a record for each variable name, with fields for the attributes of the name. position Id1 & attributes Initial Id2 & attributes rate Id3 & attributes 19 Compiler Principles Parsing Understand sentence Understand words • A Syntax/Semantic Analyzer (Parser) creates the syntactic structure (generally a parse tree) of the given program. • Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar 20 Compiler Principles Syntax Analysis • A Syntax Analyzer/Parser creates the syntactic structure (generally a parse tree) of the given program. • A parse tree describes a syntactic structure. • Each interior node represents an operation • The children of the node represent the arguments of the operation Understand the sentence: the order of operations 21 Compiler Principles Syntax (CFG) • The syntax of a language is specified by a context Free grammar (CFG). • The rules in a CFG are mostly recursive. • A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. – If it satisfies, the syntax analyzer creates a parse tree for the given program. • Ex: We use BNF (Backus Naur Form) to specify a CFG assgstmt -> identifier := expression expression -> identifier expression -> number expression -> expression + expression 22 Compiler Principles Syntax Definition • Context-Free Grammar (CFG) is used to specify the syntax of a formal language (for example a programming language like C, Java) • Grammar describes the structure (usually hierarchical) of programming languages. – Example: in Java an IF statement should fit in • if ( expression ) statement else statement Production Can be/can have a form – statement à if ( expression ) statement else statement – Note the recursive nature of statement. 23 Compiler Principles Definition of CFG • Four components: – A set of terminal symbols (name of tokens): elementary symbols of the language defined by the grammar – A set of non-terminals (syntactic variables): represent the set of strings of terminals – A set of productions: non-terminal à a sequence of terminals and/or non-terminals – A designation of one of the non-terminals as the start symbol. 24 Compiler Principles A Grammar Example List of digits separated by plus or minus signs • Accepts strings such as 9-5+2, 3-1, or 7. • 0, 1, …, 9, +, - are the terminal symbols • list and digit are non-terminals • Every “line” is a production • list is the start symbol • Grouping: list → list + digit | list – digit | digit 25 Compiler Principles Derivations* • Given grammar, how to understand the sentence? • A grammar derives strings by beginning with the start symbol and repeatedly replacing a non-terminal by the body of a production • Language: The terminal strings that can be derived from the start symbol defined by the grammar. • Example: Derivation of 9-5+2 – 9 is a list, since 9 is a digit. Bottom up – 9-5 is a list, since 9 is a list and 5 is a digit. – 9-5+2 is a list, since 9-5 is a list and 2 is a digit. 26 Compiler Principles Parse Trees • A parse tree shows how the start symbol of a grammar derives a string in the language A à XYZ Root terminal Leaf non-terminal Interior node start symbol What is the relationship? 27 Compiler Principles Parse Trees Properties • The root is labeled by the start symbol. • Each leaf is labeled by a terminal or by ε. • Each interior node is labeled by a non- terminal. • If A is the non-terminal labeling some interior node and X1, X2,… , Xn are the labels of the children of that node from left to right, then there must be a production A à X1X2 · · · Xn. 28 Compiler Principles Parse Tree for 9-5+2 * 29 Compiler Principles Ambiguity • A grammar can have more than one parse tree generating a given string of terminals. list à list + digit | list – digit | digit digit à 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 string à string + string | string - string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 (9-5)+2 = 6 9-5+2 9-(5+2) = 2 How to eliminate it? 30 Compiler Principles Eliminating Ambiguity * • Operator Associativity: in most programming languages arithmetic operators have left associativity. – Example: 9+5-2 = (9+5)-2 – Exception: Assignment operator = has right associativity: a=b=c is equivalent to a=(b=c) – Elaborate:

Introduction

Ece351 Lab Manual

Implementation of Processing in Racket 1 Introduction

Adaptive LL(*) Parsing: the Power of Dynamic Analysis

Conflict Resolution in a Recursive Descent Compiler Generator

Keyword Based Search Engine for Web Applications Using Lucene and Javacc Priyesh Wani, Nikita Shah, Kapil Thombare, Chaitalee Zade

Literature Review

Capitolo 1 Grammatiche E Linguaggi Formali

Der Compilergenerator Coco/R

Codegenassem.Java) Lexical Analysis (Scanning)

Grammars and Parsing for SSC1

Parsing, Lexical Analysis, and Tools

160428 Subscript