Agile Parsing to Transform Web Applications

Agile Parsing to Transform Web Applications Thomas Dean1 and Mykyta Synytskyy2 1 Electrical and Computer Engineering, Queen’s University [email protected] 2 Amazon.com [email protected] Abstract. Syntactic analysis lies at the heart of many transformation tools. Grammars are used to provide a structure to guide the application of transformations. Agile parsing is a technique in which grammars are adapted on a transformation by transformation basis to simplify transformation tasks. This paper gives an overview of agile parsing techniques, and how they may be applied to Web Applications. We give examples from several transformations that have been used in the Web application domain. 1 Introduction Many program transformation tools [3,4,6,7,8,21,25] are syntactic in nature. That is, they parse the input into an abstract syntax tree (AST) or graph (ASG) as part of the transformation process. There is good reason for this: syntax is the framework on which the semantics of most modern languages is defined. One example is denotational semantics [24], which maps the syntax to a formal mathematical domain. The syntax of the language, and thus the structure of the tree or graph used by these tools is determined by a grammar. The grammar is typically some form of context free grammar, although it might be augmented with other information [1,2]. In most cases, the transformation rules that may be applied to the input are constrained by the resulting data structure. Grammars are used by transformation languages to impose a structure on the input that can be used by the rules. These grammars are general grammars and represent the compromise of authoring the best grammars that are suitable for a wide variety of tasks. However, for some tasks, minor changes to the grammar can make significant differences in the performance of the transformations. We call changing the grammar on a transformation by transformation basis agile parsing [13]. One particular area of analysis that agile parsing is useful for is web applications. Web applications are written in a variety of languages, and are the subject of a variety of analysis and transformation tasks. We have used agile parsing techniques on several web application transformation projects [11,14,16,22,23,26,27]. We do not discuss the details of the transformations as they are covered elsewhere. Instead we focus on how grammars can be crafted for the web applications and how they are customized for each of the particular transformation tasks. R. Lämmel, J. Saraiva, and J. Visser (Eds.): GTTSE 2005, LNCS 4143, pp. 312 – 326, 2006. © Springer-Verlag Berlin Heidelberg 2006 Agile Parsing to Transform Web Applications 313 2 Agile Parsing Agile parsing means customizing the grammar to the task. It is a collection of techniques from a variety of sources that we have found useful for transformation tasks. In this section, we give an introduction to agile parsing, how it is accomplished in the TXL programming language and a quick overview of some of the techniques. The parsing techniques generally serve several needs. One is to change the way the input is parsed. One example is that in the C reference grammar the keyword typedef is simply given in the grammar as a storage specifier, eliminating any syntactic difference between variable and type declarations. That is, there is a single non-terminal declaration, that consists of a sequence of declaration specifiers followed by the declarator list. A declaration of a global variable and a type definition using typedef are parsed as the same non-terminal. For many transformations, this is a reasonable choice. However a transformation that targets only type definitions must include conditions in the rule to match only those declarations that include the typedef keyword. Splitting the declaration productions into two non-terminals, one which requires a typedef keyword, and one that does not, lets the parser make the distinction. Those rule sets that depend on the distinction become much simpler since they no longer have to code the distinction directly as conditions in the rule. They can simply target the appropriate non-terminal. The change to the grammar is local and only applies for the given transformation task. The second need is to modify the grammar to allow information to be stored within the tree. An example is to modify the grammar to allow us to add XML style markup to various non-terminals in the grammar. While some transformation systems permit metadata to be added to the parse tree, sometimes data more complex than can be handled in the metadata is more appropriate for the problem. Thus the markup can be used to store temporary information part way through the transformation. Markup can also be used as a vehicle for communicating the results of an analysis to the user. The third need is to robustly parse a mixture of languages such as embedded SQL in COBOL[18] or in this case, dynamic scripting languages in HTML. The last need is to unify input and output grammars so that the rules may translate from one to the other while maintaining a consistent and well typed parse tree. 2.1 Agile Parsing in TXL TXL is a pure functional programming language particularly designed to support rule- based source-to-source transformation [7,8,10]. Each TXL program has two parts: a structure specification of the input to be transformed, which is expressed as an unrestricted context-free grammar; and a set of one or more transformation rules, which are specified as pattern/replacement pairs to define what actions will be performed on the input. Each pattern/replacement pair in a transformation rule is specified by example, and may be arbitrarily parameterized for more rule flexibility. Figure 1 shows a simple TXL grammar for expressions. Square brackets denote the use of a non-terminal. Prefixing a non-terminal symbol with the keyword repeat denotes a sequence of the non-terminal and the vertical bar is used to indicate alternate productions. The terminal symbols in the grammar are either literal values, such as the operator symbols in Fig. 1, or general token classes which are also referenced using square brackets ([id] for identifier in Fig 1). 314 T. Dean and M. Synytskyy define expression define term [term] [repeat addop_term] [factor] [repeat mulop_factor] end define end define define addop_term define mulop_factor [add_op] [term] [mul_op] [factor] end define end define define add_op define factor + | - [id] end define end define Fig. 1. Simple Expression Grammar in TXL Agile parsing is supported in TXL through the use of the redefine keyword. The redefine keyword is used to change a grammar. Figure 2 shows an example. In this example, the factor grammar production is changed to include XML markup on the identifier. The ellipses (‘...’) in the factor redefinition indicate the previous productions for the factor non-terminal. Thus a factor is whatever it was before, or it may be an identifier annotated by XML markup. Subsequent transformation programs may choose to insist that all such identifiers be marked up when parsing the input by removing the first line and the vertical bar from the factor redefinition. The TXL processor first tokenizes the input using a standard set of tokens which may be extended by the grammar definition, parses the input using the grammar and then applies the rules. In TXL, the rules are constrained by the grammar to keep the tree well formed. Thus the grammar includes not only the productions for parsing the input, but also the productions for the output and intermediate results. This may lead to ambiguities which are resolved through the use of an ordered parse. That is, the order of the rules in the grammar is used to determine the form of the input. 3 Agile Parsing in Web Applications Web applications deliver services over the HTTP protocol, usually to a browser of some sort. Early web applications provided dynamic content to standard web browsers using server side scripting such as CGI scripts, servlets, JSP and ASP. Recent innovations include Real Simple Syndication (RSS), SOAP and AJAX. redefine factor define XMLEnd ... </[id]> | [XmlStart][id][XMLEnd] end define end redefine define XmlStart define XMLParm <[id] [repeat XMLParm]> [id] = [stringlit] end define end define Fig. 2. Adding XML Markup to Identifiers in Expressions Agile Parsing to Transform Web Applications 315 Web applications present a particular challenge for parsing and for transformation. While some of the files in the web application contain conventional languages such as Java, the core files of the web application (the web pages) are typically comprised of a mixture of several languages. The web pages typically contain HTML, augmented with JavaScript, and server side languages such as Java, Visual Basic, or PHP. Other components of the application may be in XML such as various configuration files, or the XML transfer schema such as that used in SOAP. For some web transformation tasks, it may be possible to convert to a single language [12], but if it is desired to translate between client and server side technologies, we must be able to represent all of the languages in a single parse. A grammar for web applications must also be robust. Most browsers accept malformed HTML and our approach must also accept and deal with the malformed HTML. 3.1 Island Grammars and Robust Parsing The first agile parsing technique is island grammars [5,16,18,19,23]. Island grammars, originally introduced by van Deursen et al. [5], are a technique where the input is divided into interesting sequences of tokens, called islands and uninteresting tokens called water. Grammar productions that recognize the islands are given precedence over the water grammar productions which typically match any token or keyword.

Load more