Notes on Programming Language Syntax by Geoffrey Smith and Jai Navlakha

At this point in the semester, we shift our focus from programming in F# to a more general study of programming languages. We will consider two main questions:  How can we precisely specify a programming language?  How can we implement a programming language?

For specification, we will explore three formalisms: 1. context-free grammars for specifying syntax, 2. natural semantics for specifying behavior, and 3. type systems for specifying "well-formed" programs.

When we consider language implementation, we will see that each of these formalisms can help us in building interpreters or compilers.

In these notes, we discuss programming language syntax. Any programming language has precise rules for the syntax of legal programs, which must be checked by an interpreter or compiler. Syntax specification is traditionally divided into two parts: the specification of the legal "words" or tokens of the language, and the specification of which combinations of tokens are legal. This division leads to two tasks for a compiler: lexical analysis and parsing.

Lexical Analysis

Lexical analysis is the process of transforming the source file from a sequence of characters into a sequence of tokens, usually including the stripping out of white space and comments. For instance, a source file like if x <= 27 (* comment *) else would be transformed into the tokens IF ID(x) LEQ INT(27) ELSE EOF

Note that x and 27 are recognized as ID (identifier) and INT (integer literal). As far as syntax analysis is concerned, it makes no difference which ID or INT appears. But for type checking and code generation later, we do need to remember the semantic values x and 27. Also note that EOF denotes the end of file.

Tools like lex make it very easy to construct a lexical analyzer, often called a lexer. One writes regular expressions that describe the legal tokens in the language, along with the action to be taken when they are found. For example, here is a lex-like specification: if {return IF;} [a-z][a-z0-9]* {return ID;} [0-9]+ {return INT;} (" "|\n)+ { }

Given this specification, lex automatically constructs a lexer.

Context-Free Grammars

Parsing is the problem of checking whether the sequence of tokens produced by the lexer is legal. To specify what is legal, we use the formalism of context-free grammars or CFGs.

A CFG consists of a set of nonterminals, one for each syntactic category to be defined. For example, we might have one nonterminal for expressions, another for commands, and another for declarations. Then we have a set of rules that describe all of the possible forms that each nonterminal can take. Each rule is of the form A -> alpha, where A is a nonterminal and alpha is a string of nonterminals and/or tokens.

Suppose for example that we wish to specify arithmetic expressions built from identifiers using the binary operators +, -, *, and /. Then we need just one nonterminal E, seven tokens (i, +, -, *, /, (, and )), and six rules: E -> i E -> E+E E -> E-E E -> E*E E -> E/E E -> (E)

Normally we would write the rules more compactly as follows: E -> i | E+E | E-E | E*E | E/E | (E)

Using this CFG we can generate legal arithmetic expressions by repeatedly replacing E using whichever rule we like. For example, here is a derivation of the expression (i-i)+i*i. E => E+E => E+E*E => E+i*E => (E)+i*E => (E)+i*i => (E-E)+i*i => (i-E)+i*i => (i-i)+i*i

Note that in each step we replace one E with one of the rule right-hand sides, copying all the other symbols verbatim. To avoid all this copying, it is much nicer to show a derivation as a parse tree, in which each nonterminal has as children each of the symbols of the right-hand side of the chosen rule:

The parse tree shows that a given expression is syntactically legal. More importantly, it reveals the structure of the expression. For example, it shows that the + operation is to be done last, taking as its two operands i-i and i*i. However, there is a major concern about the structuring of expressions. How do we know that every expression has a unique parse tree? If not, then there could be more than one way to structure it. This leads to an important definition:

Definition: A context-free grammar is ambiguous if there is at least one string with more than one parse tree. It is easy to see that our example grammar is ambiguous. For example, here are two parse trees for i+i*i:

Which of these two trees matches the usual mathematical conventions? You should also verify that our grammar gives two parse trees to i-i-i. The trouble with our grammar is that it does not enforce either precedence or associativity among the operators. However, it turns out that we can rewrite our grammar so that the following properties hold:  The set of legal expressions is not changed.  The new grammar is unambiguous.  The new grammar gives * and / higher precedence than + and -.  The new grammar makes all operators associate to the left.

The new grammar is a bit more complicated, as it requires three nonterminals E, T, and F to separately define expressions, terms, and factors:

E -> E+T | E-T | T T -> T*F | T/F | F F -> i | (E)

To convince yourself that the new grammar has the properties given above, you should draw the parse trees for a variety of expressions, such as i+i*i, (i+i)*i, and i-i-i.

Recursive-Descent Parsing

Given a CFG, the parsing problem is to find an algorithm that takes as input a sequence of tokens, and either  builds a parse tree, or  reports a syntax error (at some token).

Parsing is a very well-studied problem, and there exist nice tools like yacc that take as input a CFG and try to build a parser automatically.

Here I will briefly describe recursive-descent parsing, a simple (but not too general) parsing method that is easy to code by hand. [A parser that uses a set of recursive procedures to recognize its input with no backtracking is called a recursive-descent parser.] It is based on the observation that for some CFGs, a single lookahead token is enough to determine which rule to use next. Consider for example the following grammar for a little programming language:

S -> if E then S else S | begin S L | print E L -> end | ; S L E -> i

(For simplicity, we are limiting expressions to just identifiers.) It can generate programs like if a then begin print b; print c end else print d Suppose we are trying to parse an S. Notice that there are only three tokens that an S can start with, if, begin, and print, each corresponding to a different rule. Hence if the parser knows which token is coming next, it can know which rule it needs to use. The same property holds for L and for E Hence we can get the following C pseudo-code:

// lookahead token int tok = nextToken();

void advance() {tok = nextToken();}

// used whenever a specific token t must appear next void eat(int t) {if (tok == t) advance(); else error();}

void S() { switch (tok) { case IF: advance(); E(); eat(THEN); S(); eat(ELSE); S(); break; case BEGIN: advance(); S(); L(); break; case PRINT: advance(); E(); break; default: error(); } } void L() { switch (tok) { case END: advance(); break; case SEMICOLON: advance(); S(); L(); break; default: error(); } } void E() { if (tok == ID) then advance(); else error(); } void main() { S(); if (tok == EOF) accept(); else error(); }

To get a feel for recursive-descent parsing, you should try running this code (by hand) on sample inputs like begin print a; print b end and begin print a; print b; end (Note that the latter input has a syntax error.)

Finally, as written, the parser code only tests whether the input is syntactically legal or not. But it is quite straightforward to modify the code so that S(), T(), and E() return parse trees for whatever they parse.

Recursive-Descent Parsing of Arithmetic Expressions

I mentioned that recursive-decent parsing is "not too general". The reason is that many CFGs do not have the property that a single lookahead token determines which rule to use next. In particular our unambiguous grammar for arithmetic expressions does not have the property. Consider the rules E -> E+T | E-T | T

If the lookahead is ID, we have no way of knowing which rule to use. Nevertheless, we can "hack" a solution in this case. Observe that the grammar for E must ultimately generate a T followed by zero or more occurrences of +T or -T. Hence we can parse an E using the following code: void E() { T(); while (tok == PLUS || tok == MINUS) { advance(); T(); } }