COS 360 Programming Languages Formal Semantics
Total Page:16
File Type:pdf, Size:1020Kb
COS 360 Programming Languages Formal Semantics Semantics has to do with meaning, and typically the meaning of a complex whole is a function of the meaning of its parts. Just as in an English sentence, we parse a sentence into subject, verb, objects, adjectives(noun qualifiers) and adverbs(verb qualifiers), so with a program we break it up into control structures, statements, expressions, each of which has a meaning. The meanings of the constituents contribute to the meaning of the program as a whole. Another way this is put is to say that the semantics is compositional en.wikipedia.org/wiki/Denotational_semantics#Compositionality Consequently, the grammar of a language guides the exposition of its meaning. Each of the three approaches to formal semantics, axiomatic, operational, and denotational, will use the grammar as a structuring device to explain the meaning. The grammar used, however, is not the same one that the compiler’s parser uses. It eliminates the punctuation and just contains the relevant, meaning carrying pieces and is often ambiguous, but the notion is that the parsing has already taken place and the parse tree simplified to what is referred to as an abstract syntax tree. Learning Objectives 1. to understand the difference between the abstract syntax trees used in expositions of semantics and the concrete syntax trees, or parse trees, used in parsing 2. to understand how the abstract syntax tree contributes to the explanation of program- ming language semantics 3. to understand the three major approaches to programming language semantics: ax- iomatic, operational, and denotational 4. to understand how axiomatic semantics can be used to prove algorithms are correct 5. to understand how small step operational semantics define computation sequences for programs from abstract syntax trees 6. to understand how the semantic evaluation functions of denotational semantics define the meanings of abstract syntax trees in associated semantic domains and how to define such semantic evaluation functions 7. to appreciate the different roles with respect to a programming language: implementer(compiler writer), user (programmer), and designer 8. to appreciate how the three approaches have different emphases that favor one role over another We make a few preliminary observations. We say “formal” semantics because the explication is intended to be very precise, complete, and based on formal notations(a programming language itself is a formal language), like the formal notations in mathematics and logic. The alternative would be “informal”, which could still be of value, but wouldn’t have the same degree of rigor. Based as it is on mathematics and logic, the formal description of programming language semantics is one of the most technically sophisticated areas of computer science. Many of the contributors to its development, such as Robert W. Floyd, C. A. R. Hoare, Dana Scott, Leslie Lamport, Robin Milner, and Amir Pnueli have been recipients of the ACM Turing Award, the highest recognition for an academic computer scientist. We are not going to go into this area at great depth, which could easily occupy several graduate level courses in computer science, but will attempt to give you a sense of how the three main approaches proceed, using the same simple example in each. For imperative languages, a key notion is that of an environment, which is an association of values to variables. The program’s meaning is a map from some specific initial environment to another environment, the environment the machine is left in when the program terminates. Suppose the program is C (for “Command”), and let E stand for the set of all possible environ- ments, augmented(“lifted” is the verb that is used) to include a special value, the “undefined environment”, which we symbolize with ⊥. This ⊥ value is needed to cover the case when a program does not terminate; it is the result image environment of an initial environment that induces the program to go into an infinite loop. It is just a way for us to make the definition of the program’s function into a total function, defined on all environments in E. The function C computes from E to E, which we name fC because it is induced by C, is defined to be, for e ∈E fC (e) = ⊥ if e = ⊥ (a program cannot recover from undefined) ⊥ if e =6 ⊥ and C, when started in e, fails to terminate e′ if e =6 ⊥ and C terminates in e′ when started in e Since the focus is on the computation, the i/o is typically ignored in these discussions. The models assume the variables have been loaded with their appropriate input values and then exposition confines itself to just environment changes wrought by assignments, expression evaluation, and flow of control. As indicated above, the parse trees used in discussions of programming language semantics are not the same as the parse trees associated with a grammar for the language. Languages typically use parentheses to force certain orders of evaluation, but once the parse has taken place so that the higher priority subexpression is lower in the tree, the parentheses are not needed any more and just clutter up the tree. Like the Lone Ranger and Tonto at the end of an episode, their work is done and it’s time for them to vanish. The trees used in semantics are called abstract syntax trees and are usually very spare with ambiguous grammars. You can imagine that the real parsing has already been done and the resulting parse trees have been cleaned up. Here’s a stripped back grammar for abstract syntax trees of an imperative programming lan- guage whose only type is the integers. The start symbol is <program>. <program> ::= <stmnt> <stmnt> ::= <assign> | <while> | <if> | skip | <stmnt><stmnt> <assign> ::= id <- <e> <while> ::= while <test> <stmnt> <if> ::= if <test> <stmnt> | if <test> <stmnt> else <stmnt> <test> ::= <test> and <test> | <test> or <test> | not <test> | <comparison> <comparison> ::= <e> < <e> | <e> <= <e> | <e> > <e> | <e> >= <e> | <e> = <e> | <e> != <e> <e> ::= id | c | <e> + <e> | <e> - <e> | <e> * <e> | <e> / <e> | <e> mod <e> | - <exp> This language has terminals while if else and or not <- < <= > >= = != + - * / mod id c skip The first three are reserved words to indicate the control statements and mark the second branch of an the if statement. The next three are reserved words for the boolean operators. The next one is the assignment operator. The next six are numerical comparison operators. The next five are the arithmetic operators that return integer results. Note that - is overloaded as a binary and a unary operator and that mod is for the integer remainder of a division. The last two, id and c, are for identifiers and integer constants. We will assume that these have attributes for the actual string of the variable identifier(id.s) and the actual value of the integer constant(c.v). The last terminal, skip, is a “do nothing” statement that has the same meaning as x <- x. It has a technical use below. You will note the obvious ambiguity in the replacements for the expression syntactic category, <e>. Similarly, the <test> variable for boolean valued expressions has an ambiguous collection of replacements. Expression evaluation in this language, for both boolean valued expressions and integer valued expressions, does not have side effects to the environment, which is unlike constructs such as ++n and similar expressions of C and its descendants. The only action that changes the envi- ronment is the execution of the assignment statement, and it changes only a single variable’s value. If we have an environment σ ∈E and a statement id <- <e> that is executed in the environment σ and n is the value that <e> evaluates to in σ (which we might indicate with σ(<e>) by extending σ to be a map from not just the variables, but from all expressions), it is common to use a notation like σ[id.s 7→ n] or σ[n/id.s] to indicate the new environment that results from the assignment. It is exactly like σ except at the variable id.s, which it maps to n. We could lift the space of integer values to include ⊥ to accommodate the result of dividing by zero, or taking the remainder of a division by zero, and have the environment associate variables to values in the lifted space. We could then propagate the error to have the result of assigning ⊥ to a variable be the undefined environment. Those are design decisions for the language designer who is specifying the meaning of the language. Here is a simple program in this language that calculates into result the value of base raised to the power exponent, if exponent is not negative, or just 1 if it is. result <- 1 while exponent > 0 if exponent mod 2 = 1 result <- result * base base <- base * base exponent <- exponent / 2 Because the punctuation is mostly thrown out, I have used indentation to indicate the subordi- nation of statements. The program has two statements, the first one an assignment statement, and the second one a while statement. The while loop body has three statements, the first one an if with no else branch, and then two assignment statements. A It would be very difficult for me to draw the abstract syntax tree for this program in LTEX, but I did draw it by hand and it has 63 nodes. Since it has 28 terminals, it has 35 variable nodes. Once you accept that the while loop and if contain the statements given by the indentation, the only ambiguity comes from the <stmnt> ::= <stmnt><stmnt> replacement, since we could group the three statements of the while loop body as two followed by one or one followed by two.