Towards Environment Re-Targetable Parser Generators 2

This is page 1 Printer: Opaque this Towards Environment Re-targetable Parser Generators K. Kontogiannis J. Mylop oulos S. Wu ABSTRACT One of the most fundamental issues in program understanding is the is- sue of representing the source co de at a higher level of abstraction. Even though many researchers haveinvestigated a variety of Program Represen- tation schemes, one particular scheme, the Abstract Syntax Tree (AST), is of particular interest for its simplicity, generality, and completeness of the information it contains. In this pap er weinvestigate ways to generate Abstract Syntax Trees that conform with a user de ned domain mo del, can b e easily p orted to di erent CASE to ols, and can b e generated using publicly available parser generators. The work discussed in this pap er uses PCCTS and yacc as parser generators, and provides an integration mech- anism with the conceptual language Telos, and the CASE to ols Rigi, PBS, TM and Re ne . 1 Intro duction Extracting and representing the syntactic structure of source co de is gen- erally acknowledged to b e one of the most imp ortant activities in program understanding. This activity is considered a prerequisite to b oth mainte- nance and reverse engineering of legacy source co de [Till96]. Traditionally, the activity consists of two steps, parsing and lexical analysis. However, unlike traditional syntax-directed approaches used by compilers, parsing and lexical analysis for program understanding purp oses, is carried out with resp ect to a language sp eci c domain mo del which describ es the syntactic and semantic constructs of the underlying programming languages. The use of domain mo dels by parsers and parser generators is not new. In [Arango91], [Marko94] source co de analysis techniques for legacy co de which pro duce Abstract Syntax Trees (AST) on the basis of domain mo dels are discussed. In [Kotik89] the Re ne re-engineering environment allows for the de- velopment of parsers based on domain mo dels Similarly,in[Devambu92], [Devambu94] GENOA a language indep endent application generator is presented. GENOAcanbeinterfaced to the front-end of a language compiler that pro duces an attributed parse tree. This can b e achieved by writing 1. Towards Environment Re-targetable Parser Generators 2 an interface sp eci cation using a companion system to GENOA, called GENI I;. The advantage of GENOA/GENI I is that to ols for program understanding, metrics, reverse engineering and other applications can b e built, without having to write a complete parser/linker for the source language in which the application is written. However, most commercial to ols do not provide a programming interface (API) and therefore the generated Abstract Syntax Trees can only accessed and pro cessed in a limited way for any complex re-engineering or program understanding task. Within this framework, the ob jectives of this pap er fo cus on three directions. The rst is to rep ort on techniques whichmake it p ossible to generate annotated Abstract Syntax Trees at various levels of granularity with resp ect to the parsed source co de. The second is to prop ose a metho dology for simplifying the pro cess of developing a grammar and a parser for a given programming language. The third is to rep ort on an architecture that facilitates the exchange of source co de related information b etween di erent software engineering analysis to ols through the use of domain mo dels, schemata, and p ersistent software rep ositories. Towards these ob jectives, we exp erimented with the use of domain mo d- els by di erent parser generator environments. The rst one is the PCCTS [Par96], [Sta97] parser generator environment and the second is the C pub- lic domain parser generator yacc. On-going work rep orted also in this pa- per involves the use of XML for annotating source co de using the JavaCC parser generator [JavaCC]. The work describ ed in this pap er was carried out within the context of asoftware re-engineering pro ject whose ultimate ob jective is to establish a generic and op en software re-engineering environment. The environment should allow for di erent CASE to ols to b e integrated in terms of suitable intermediate representations and a p ersistent rep ository of multi-faceted information ab out a legacy system. Earlier rep orts on the pro ject include [Buss94] and [Finni97]. This pap er is organized as follows. In section 2 we rep ort on di erent techniques for pro ducing source co de representations tailored for program analysis and program understanding. In section 3 we discuss the rationale b ehind the use of the Abstract Syntax Trees (ASTs) as a representation of source co de structure. In section 3 we outline the approachwe adopted in order to construct customizable and domain re-targetable ASTs. Section 4 describ es how our approachworks by applying it to a small C co de fragment. In section 5, we discuss advantages and limitations of our approach. In section 6 we present alternative parsing techniques based on XML, and in section 7 we discuss the design of an integration environmentforcom- municating data b etween the system presented in this pap er and various CASE to ols. Finally, section 8 o ers a brief summary of the results rep orted in this pap er, as well as a p ersp ective for further work. 1. Towards Environment Re-targetable Parser Generators 3 2 Parsing for Program Analysis Traditionally, parsing has b een seen as a to ol for pro ducing internal representations of source co de with the purp ose of generating executable binaries for a sp eci c target op erating platform. In contrast, parsing for program understanding aims on pro ducing representations of the source co de that can b e used for program analysis and design recovery purp oses. In this resp ect, program representation aims on facilitating source co de analysis that can b e applied at various levels of abstraction and detail, namely at: the physical level where co de artifacts are represented as tokens, syntax trees, and lexemes, the logical level where the software is represented as a collection of mo dules and interfaces, in the form of a program design language, annotated tuples, aggregate data and control ow relations, the conceptual level where software is represented in the form of abstract entities such as, ob jects, abstract data typ es, and communi- cating pro cesses. These representations are achieved by parsing the source co de of the system b eing analyzed at various levels of detail and granularity. Sp eci cally, over the past yearsanumb er of parsing to ols have b een prop osed and used for re-do cumentation, design recovery and, architectural analysis. Overall, the parsing technology for program understanding can b e classi ed in three main categories. The rst category, deals with full parsers that pro duce complete ASTs of a given co de fragment. The second category fo cuses on scanners that pro duce partial ASTs or emit facts related to control and data ow prop- erties of a given source co de fragment. Finally, the third category deals with regular expression analyzers that extract high level syntactic information from the source co de of a given system. Full parsers, are used for detailed source co de analysis as they provide low level detailed information that can b e obtained from the source co de. This includes data typ e information, linking information, conditional com- pilation directives, and pre-pro cessed libraries and macros. In this context, a full parser generator called DIALECT is prop osed in [Kotik89]. The approach uses a domain mo del as well as yacc and lex to pro duce a parser given a BNF grammar. The AST is represented in terms of CLOS ob jects in dynamic memory and can b e accessed by LISP-likea programming interface (API) written for the CASE to ol Re ne. In [Datrix00], another full parser for C++ source co de is presented. The parser emits an AST representation that is suitable for source co de analysis and reverse engineering of systems written in C++. A domain mo del for C++ provides the schema that describ es the structure of the generated 1. Towards Environment Re-targetable Parser Generators 4 AST. Moreover, programming interface provides access to the AST and the ability to exp ort detailed information in the form of RSF tuples [Muller93]. Similar to the approachabove, a C/C++ parser generator that pro- duces a fully annotated AST representation of the co de is presented in [Devambu92]. The source co de representation can b e accessed by GEN++ for sp ecifying C++ analysis and GENOA that provides a language inde- pendent parse tree querying framework. Finally, another technique that is gaining p opularity due to the emer- gence of mark-up languages is to annotate the source co de using XML. Do cumentTyp e De nitions (DTDs) can b e automatically built by analyz- ing the grammar of a given programming language. Parser generators such as JavaCC can b e used to emit XML tagged co de by altering the semantic actions in a given grammar sp eci cation. Currently, DTDs for C, C++, and Javahave b een prop osed [Mamas00] and are available through the Con- sortium for Software Engineering Research (CSER). Once XML annotated source co de is parsed, AST-like representations can b e generated. These AST-like structures are called DOM trees. In the second category of parsing techniques, scanners are used for extracting high level information from a software system.

Load more