Scope Rules for Block Structured Languages: The Block Language JoA˜£o Saraiva [email protected] .mailto:[email protected] 27 de Maio de 2005 This document presents an attribute grammar to specify the scope rules for block- structure languages. These rules are based on the name analysis rules of the programming language Algol 68. This document has the following objectives: • A brief introduction to the LRC system, • A concise introduction to Attribute Grammars, • The graphical representation of Abstract Syntax Trees (as XML trees), • The specification of GUI via attribute grammars, The pdf version of this document is available at block.pdf .http://www.di.uminho.pt/ jas/Research/LRC/examples/Block/block.pdf. 1 The Block Language Consider a very simple language that deals with the scope rules of a block structured language: a definition of an identifier x is visible in the smallest enclosing block, with the exception of local blocks that also contain a definition of x. In this later case, the definition of x in the local scope hides the definition in the global one. We shall analyse these scope rules via a toy language: the Block language. One sentence in Block consists of a block, and a block is a (possibly empty) list of statements.A statement is one of the following three things: a declaration of an identifier (such as decl a), the use of an identifier (such as use a), or a nested block. Statements are separated by the punctuation symbol “;” and blocks are surrounded by square brackets. A concrete sentence in this language looks as follows: 1 sentence = [ use x ; use y ; decl x ; [ decl y ; use y ; use w ] ; decl y ; decl x ] This language does not require that declarations of identifiers occur before their first use. Note that this is the case in the first two applied occurrences of x and y: they refer to their (later) definitions on the outermost block. Note also that the local block defines a second identifier y. Consequently, the second applied occurrence of y (in the local block) refers to the inner definition and not to the outer definition. In a block, however, an identifier may be declared once, at the most. So, the second definition of identifier x in the outermost block is invalid. Furthermore, the Block language requires that only defined identifiers may be used. As a result, the applied occurrence of w in the local block is invalid, since w has no binding occurrence at all. We aim to develop a program that analyses Block programs and computes a list con- taining the identifiers which do not obey to the rules of the language. Thus, this program, called block, is a static semantic analyser for the Block language. It has the following type: block :: Prog -> [Name] , where Name is the type of the Block identifiers. In order to make the problem more interesting, and also to make it easier to detect which identifiers are being incorrectly used in a Block program, we require that the list of invalid identifiers follows the sequential structure of the input program. Thus, the semantic meaning of processing the example sentence is [w,x], i.e.: block sentence = [w,x] Next, we shall describe the program block in the traditional attribute grammar paradigm. First, we define the concrete and the abstract syntax of Block via two context-free gram- mars. After that, we define the semantics of the language by extending the grammar with attributes and attribute equations. 2 Regular Expressions for Lexical analysis In order to define the concrete syntax of the Block language, we need to define its set of terminal symbols. Usually, terminal symbols are easily and conciselly defined by using reg- ular expressions. Next, we present the regular expressions defining the reserved keywords 2 of the language, and the notation used for its identifiers. In LRC regular expressions have a name for future possible references. The notation used for reegular expressions is the Unix one. USE : < "use" >; DECL : < "decl" >; IDENTIFIER : < [a-zA-Z][a-zA-Z0-9_]* >; Punctuation symbols, like ; or [ are also terminal symbols. We omit here its defintion via regular expressions since they can be directly defined in the context-free grammar (because they consist of a character only). Traditionally, lexical analisers remove ”spaces”from the input. In the LRC specification language there is a special named regular expression - WHITESPACE - where we can define regular expressions that will not return the matches charecters to the parser. WHITESPACE : [\ \t\n]+ >; 3 Context-free grammar for syntactic analysis The axiom defines a block surrounded by square brackets and the non-terminal stats defines the statements of the blocks. prog { syn P ast; }; prog ::= (stats) { $$.ast = Root(stats.ast); } ; The body of a block is either an empty statement, which is defined by the empty production, or, it is a non-empty statement, in which case, the body of a block is defined by non-terminal lststs. stats { syn Its ast; }; stats ::= (lststs) 3 { $$.ast = lststs.ast ; } | () { $$.ast = NilIts(); } ; Non-terminal lststs defines a sequence of one or more non-terminal symbols stat sepa- rated by the literal terminal ;. lststs { syn Its ast; }; lststs ::= (stat) { $$.ast = ConsIts (stat.ast,NilIts()); } | (stat ’;’ lststs) { $$.ast = ConsIts (stat.ast,lststs$2.ast); } ; The non-terminal stat defines the three statements of Block: definitions, uses and nested blocks. stat { syn It ast; }; stat ::= (USE name) { $$.ast = Use(name.ast); } | (DECL name) { $$.ast = Decl(name.ast); } | (’[’ stats ’]’ ) { $$.ast = Block(stats.ast); } ; name { syn Id ast ; } ; name ::= (IDENTIFIER) { $$.ast = Ident(IDENTIFIER); } ; The Block concrete grammar is used for the syntactic analysis of the Block language. It guides the derivation process of each concrete sentence of the language. It also assigns a unique concrete syntax tree to each syntactically valid sentence of the language. These trees are automatically generated by the LRC system, see chapter ???. 4 However, syntactically valid sentences may fail to correspond to semantically valid sentences of the language. Such sentences violate the semantic rules of the language. Consider, for example, the sentence sentence: it is syntactically correct because, as we said before, can be generated by the previous CFG, but it is semantically incorrect, because two identifiers ( x and w) of sentence violate the rules of the language. To describe the semantic rules of the language we focus on its abstract structure. So, before we extend our grammar with attribute and attribute equations, we have to define the abstract syntax of the Block language. Let us start by informally defining the abstract structure of a Block sentence: an abstract sentence in Block is a list of statements, where each statement is the declaration or the use of an identifier, or is a nested block. The syntactic property according to which statements are separated by a punctuation symbol is irrelevant for the abstract structure of the language. Formally, we define such abstract language by the following productions: P : Root (Its) ; Its : NilIts () | ConsIts (It Its) ; It : Use (Id) | Decl (Id) | Block (Its) ; Id : Ident (STR) ; As expected, the literal symbols, e.g., decl, use, [, ] and ;, are not mentioned in the abstract grammar. Observe also that in the concrete grammar we have used two non- terminal symbols to specify the fact that the body of a block is a possibly empty list of statements separated by the character “ ;”. As we have said above, this is irrelevant for the abstract structure of the language where non-terminal Its simply defines a list of statements. 3.1 Root Symbol(s) In the LRC specification language we have to explicitly define the root symbol of the context-free grammars defining the concrete and the abstract syntax of the underlying language. Let us start by the abstract grammar. To define the root symbol of this grammar we explicit declare the root non-terminal with the keyword root. In the Block (abstract) language we have the following root: 5 root P; This can also be seen as the type of the abstract syntax tree. Or in other words, the result of parsing a sentence is a tree of type P. Let us consider the concrete grammar. In the concrete grammar we have not only to define the root symbol, but also the the type os the result of the parser and the name of the attribute where that tree is computed. In LRC this is defined as follows: P ~ prog.ast; , and we read it as follows: we (may) start deriving sentences from non-terminal symbol prog and we construct an abstract syntax tree of type P in attribute ast. In LRC we can have several root symbols, meaning that we will be able to parse sub- languages of Block, as well. Next, we declare of the so-called parser entry points. Its ~ stats.ast; It ~ stat.ast; Id ~ name.ast; The possibility of having several ”root symbols”will be an important feature of the parsers when we build programming environments (see chapter ??). To build batch tools (e.g., compilers) this feature is not so relevant. Note that other systems, like for example the Yacc system, can use a default rule which assumes that the first non-terminal occurring in the specification is the root symbol. 4 Pretty Printing/Unparsing Rules To pretty print (or unparse) the abstract syntax tree we have three different approaches: • We may use a predefined LRC notation for unparsing. • We may use the xml4free tool (developed with LRC) which automatically produce attribution rules to compute XML trees (and its graphical representation). • We may use an attribute grammar for (generic) pretty printing. This pretty printing attribute grammar is distributed in the AGLIB 6 Next, we will present how to use the predefined unparsing rules for the Block language.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages28 Page
-
File Size-