Describing Syntax and Semantics

Describing Syntax and Semantics of Programming Languages Part I 1 Programming Language Description Description must • be concise and understandable • be useful to both programmers and language implementors • cover both • syntax (forms of expressions, statements, and program units) and • semantics (meanings of expressions, statements, and program units Example: Java while-statement Syntax: while (boolean_expr) statement Semantics: if boolean_expr is true then statement is executed and control returns to the expression to repeat the process; if boolean_expr is false then control is passed on to the statement following the while-statement. 2 Lexemes and Tokens Lowest-level syntactic units are called lexemes. Lexemes include identifiers, literals, operators, special keywords etc. A token is a category of the lexemes (i.e. similar lexemes belong to a token) Example: Java statement: index = 2 * count + 17; Lexeme Token index IDENTIFIER IDENTIFIER tokens: index, count = EQUALS 2 NUMBER NUMBER tokens: 2, 17 * MUL count IDENTIFIER remaining 4 lexemes (=, *, +, ;) are lone + PLUS 17 NUMBER examples of their corresponding token! ; SEMI 3 Lexemes and Tokens: Another Example Example: SQL statement select sno, sname from suppliers where sname = ’Smith’ Lexeme Token select SELECT sno IDENTIFIER IDENTIFIER tokens: sno, same, suppliers , COMMA SLITERAL tokens: ‘Smith’ sname IDENTIFIER from FROM remaining lexemes (select, from, where, ,, =) suppliers IDENTIFIER where WHERE are lone examples of their corresponding token! sname IDENTIFIER = EQUALS ‘Smith’ SLITERAL 4 Lexemes and Tokens: A third Example Example: WAE expressions {with {{x 5} {y 2}} {+ x y}}; Lexeme Token Lexeme Token TOKENS: { LBRACE } RBRACE with WITH } RBRACE LBRACE RBRACE { LBRACE { LBRACE PLUS { LBRACE + PLUS MINUS x ID x ID TIMES 5 NUMBER y ID DIV } RBRACE } RBRACE ID { LBRACE } RBRACE WITH IF y ID ; SEMI NUMBER 2 NUMBER SEMI 5 Lexical Analyzer A lexical analyzer is a program that reads an input program/expression/query and extracts each lexeme from it (classifying each as one of the tokens). Two ways to write this lexical analyzer program: 1. Write it from scratch! i.e. choose your favorite programming language (python!) and write a program in python that reads input string (which contain the input program, expression, or query) and extracts the lexemes. 2. Use a code-generator (Lex, Yacc, PLY, ANTLR, Bison, …) that reads a high-level specification (in the form of regular expressions) of all tokens and generates a lexical analyzer program for you! 3. We will see how to write the lexical analyzer from scratch later. 4. Now, we will learn how to do it using PLY: http://www.dabeaz.com/ply/ 6 Regular Expressions in Python https://docs.python.org/3/library/re.html https://www.w3schools.com/python/python_regex.asp Meta Characters used in Python regular expressions: Meta Description Examples [] A set of characters [a-z], [0-9], [xyz012] . Any one character (except newline) he..o, ^ starts with ^hello $ ends with world$ * zero or more occurrences [a-z]* + one or more occurrences [a-zA-Z]+ ? one or zero occurrence [-+]? {} specify number of occurrences [0-9]{5} | either or [a-z]+ | [A-Z]+ () capture and group ([0-9]{5}) use \1 \2 etc. to refer \ begins special sequence; also used to escape meta characters \d, \w, etc. (see documentation) 7 PLY (Python Lex/Yacc): WAE Lexer def t_NUMBER(t): import ply.lex as lex r'[-+]?[0-9]+(\.([0-9]+)?)?' t.value = float(t.value) reserved = { 'with': 'WITH', 'if': 'IF' } t.type = 'NUMBER' return t tokens = [‘NUMBER’,’ID','LBRACE','RBRACE','SEMI','PLUS',\ def t_ID(t): 'MINUS','TIMES','DIV'] + list(reserved.values()) r'[a-zA-Z][_a-zA-Z0-9]*' t.type = reserved.get(t.value.lower(),'ID') t_LBRACE = r’\{' return t t_RBRACE = r’\}' t_SEMI = r';' # Ignored characters t_WITH = r'[wW][iI][tT][hH]' t_ignore = " \r\n\t" t_IF = r'[iI][fF]' t_ignore_COMMENT = r'\#.*' t_PLUS = r'\+' pip install ply t_MINUS = r'-' def t_error(t): t_TIMES = r'\*' or print("Illegal character '%s'" % t.value[0]) t_DIV = r'/' pip3 install ply t.lexer.skip(1) 8 lexer = lex.lex() WAE Lexer continued •The lexer object has just two methods: # Test it out lexer.input(data) and lexer.token() data = ''' {with {{x 5} {y 2}} {+ x y}}; •Usually, the Lexical Analyzer is used in ''' tandem with a Parser (the parser calls # Give the lexer some input lexer.token()) . print("Tokenizing: ",data) lexer.input(data) •So, the code on this page is written just to # Tokenize debug the Lexical Analyzer. while True: tok = lexer.token() •Once satisfied we can/should comment out if not tok: break # No more input this code. print(tok) 9 WAE Lexer continued {with {{x 5} {y 2}} {+ x y}}; The PLY Lexer program we wrote will generate the following sequence of pairs of token types and their values: (‘LBRACE’,’{‘), (‘WITH’,’with’), (‘LBRACE’,’{‘), (‘LBRACE’,’{‘), (‘ID’,’x’), (‘NUMBER’,’5’), (‘RBRACE’,’}’), (‘LBRACE’,’{‘), (‘ID’,’y’), (‘NUMBER’,’2’), (‘RBRACE’,’{‘), (‘RBRACE’,’}’), (‘LBRACE’,’{’), (‘PLUS’,’+’), (‘ID’,’x’) (‘ID’,’y’), (‘RBRACE’,’}’), (‘RBRACE’,’}’), (‘SEMI’,’;’) Let us see this program (WAELexer.py) in action! 10 Language Generators and Recognizers Now that we know how to describe tokens of a program, let us learn how to describe a “valid” sequence of tokens that constitutes a program. A valid program is referred to as a sentence in formal language theory. Two ways to describe the syntax: (1) Language Generator: a mechanism that can be used to generate sentences of a language. This is usually referred to as a Context-Free-Grammar (CFG). Easier to understand. (2) Language Recognizer: a mechanism that can be used to verify if a given string, p, of characters (grouped in a sequence of tokens) belongs to a language L. The syntax analyzer in a compiler is a language recognizer. (3) There is a close connection between a language generator and a language recognizer. 11 Chomsky Hierarchy and Backus-Naur Form • Chomsky, a noted Linguist, defined a hierarchy of language generator mechanisms or grammars for four different classes of languages. Two of them are used to describe the syntax of programming languages: • Regular Grammars: describe the tokens and are equivalent to regular expressions. • Context-free Grammars: describe the syntax of programming languages • John Backus invented a similar mechanism, which was extended by Peter Naur later and this mechanism is referred to as the Backus-Naur Form (BNF) • Both these mechanisms are similar and we may use CFG or BNF to refer to them interchangeably. 12 Fundamentals of Context Free Grammars CFGs are a meta-language to describe another language. They are meta-languages for programming languages! A context-free grammar G has 4 components (N,T,P,S): 1) N, a set of non-terminal symbols or just called non-terminals; these denote abstractions that stand for syntactic constructs in the programming language. 2) T, a set of terminal symbols or just called terminals; these denote the tokens of the programming language 3) P, a set of production rules of the form X → α where X is a non-terminal and α (definition of X) is a string made up of terminals or non-terminals. The production rules define the “valid” sequence of tokens for the programming language. 4) S, a non-terminal, that is designated as the start symbol; this denotes the highest level abstraction standing for all possible programs in the programming language. 13 CFGs: Examples of Production rules Note: We will use lower-case for non-terminals and upper-case for terminals. (1) A Java assignment statement may be represented by the abstraction assign. The definition of assign may be given by the production rule assign → VAR EQUALS expression (2) A Java if statement may be represented by the abstraction ifstmt and the following production rules: ifstmt → IF LPAREN logic_expr RPAREN stmt ifstmt → IF LPAREN logic_expr RPAREN stmt ELSE stmt These two rules have the same LHS; They can be combined into one rule with “or” on the RHS: ifstmt → IF LPAREN logic_expr RPAREN stmt | IF LPAREN logic_expr RPAREN stmt ELSE stmt In the above examples, we have to introduce production rules that define the various abstractions used such as expression, logic_expr, and stmt 14 CFGs: Examples of Production rules (3) A list of identifiers in Java may be represented by the abstraction ident_list. The definition of ident_list can be given by the following recursive production rules: ident_list → IDENTIFIER IMPORTANT PATTERN! ident_list → ident_list COMMA IDENTIFIER Notice that the second rule is recursive because the non-terminal ident_list on the LHS also appears in the RHS. It is time to learn how these production rules are to be used! The production rules are a type of “replacement” or “rewrite” rules, where the LHS is replaced by the RHS. Consider the following replacements/rewrites starting with ident_list: ident_list ⇒ ident_list COMMA IDENTIFIER ⇒ ident_list COMMA IDENTIFIER COMMA IDENTIFIER ⇒ ident_list COMMA IDENTIFIER COMMA IDENTIFIER COMMA IDENTIFIER ⇒ IDENTIFIER COMMA IDENTIFIER COMMA IDENTIFIER COMMA IDENTIFIER substituting these token types by their values, we may15 get: x, y, z, u WAE PLY Grammar Note: In PLY, we use : instead of → PRODUCTION RULES (P) TERMINALS (T) NON-TERMINALS (N) waeStart : wae SEMI LBRACE waeStart wae : NUMBER RBRACE wae wae : ID PLUS alist wae : LBRACE PLUS wae wae RBRACE MINUS wae : LBRACE MINUS wae wae RBRACE TIMES wae : LBRACE TIMES wae wae RBRACE DIV wae : LBRACE DIV wae wae RBRACE ID wae : LBRACE IF wae wae wae RBRACE WITH wae : LBRACE WITH LBRACE alist RBRACE wae RBRACE IF NUMBER alist : LBRACE ID wae RBRACE SEMI alist : LBRACE ID wae RBRACE alist wae : LBRACE WITH LBRACE alist RBRACE wae RBRACE wae : LBRACE PLUS wae wae RBRACE { + x y } 16 { with { {x 5} {y 2} } {+ x y} } Grammars and Derivations The sentences of the language are generated through a sequence of applications of the production rules, starting with the start symbol. This sequence of rule applications is called a derivation.

Describing Syntax and Semantics

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support