Eindhoven University of Technology

MASTER

Extracting GXF models from C code towards LIME-ng tool-chain for dtaflow models

Deshpande, A.S.

Award date: 2010

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Extracting GXF Models from C Code: towards LIME - next generation for Dataflow Models

Aditya S. Deshpande

August 2010

TECHNISCHE UNIVERSITEIT EINDHOVEN Department of Mathematics & Computer Science Software Engineering & Technology

Master Thesis

Extracting GXF Models from C Code

towards LIME-ng Tool-chain for Dataflow models

by

Aditya S. Deshpande

(0728718)

Supervisors:

dr. ir. Tom Verhoeff Pjotr Kourzanov, ir. Yanja Dajsuren, PDEng.

August 2010

Preview

This thesis introduces the LIME - next generation (LIME-ng) toolchain. LIME-ng was implemented at NXP Semiconductors B.V. Eindhoven. The project was developed in the department of media and signal processing at NXP research. The toolchain comprises of four independent tools. The tools developed as a part of LIME-ng toolchain include:

• Extracting GXF Models from C code, by Aditya S. Deshpande,

• Model Transformation tool for Dataflow Model Transformations [20], by Swaraj Bhat,

• Generating C code from Platform Specific Model [42], by Nishanth Sudhakara Shetty and

• Visualization of Dataflow Models [36], by Namratha Nayak.

The first part (Part 1) of this thesis gives an introduction to the project undertaken and is common for all the thesis mentioned above. Parallel computing is a form of computation in which many calculations are carried out simultaneously [18], operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (”in parallel”). With new advances in technology, the study of parallel systems has reached new magnitudes. The power of parallel systems for computational purposes is almost universal now. The result of this is a multitude of parallel systems. This might seem as an asset, but like every other thing it comes with its own set of drawbacks. All these parallel systems are mainly independent of each other, in the sense that they have their own architecture, their own memory models, etc. A particular application may work more efficiently on one parallel system but may not on another parallel system. An application developer may choose to develop a particular application on one of the parallel systems which he thinks is most apt. Later, if the application does not work as efficiently and as expected on the chosen parallel system then the developer has no other choice but to move the application to some other parallel system or rewrite the application. This move currently compels the application developer to start the development process again from scratch.

i A way to overcome these problems is to provide a level of abstraction over the existing parallel systems/parallel languages. Abstraction is espe- cially important given the apparent diversity of different communication and synchronization mechanisms found in modern embedded platforms, often possessing heterogeneous multi-cores and accelerators operating in parallel to run the necessary application. On first thoughts, a single layer of abstraction should solve the problem. API’s of parallel languages can be used as an abstraction layer to cover the bridge between the different parallel systems. Unfortunately, these API’s are confined to specific architectures and cannot be easily migrated. Therefore, there is a need to introduce yet another layer of abstraction, this layer should allow the translation of the application to any , giving the application developer the capability to develop an ap- plication once in any language, translate it into any other programming language and run it on its respective parallel hardware. The question thus is how to implement these abstractions as a practi- cal and feasible implementation. The answer to this question leads us to the development of a comprehensive tool chain which we refer to as LIME - next generation (LIME-ng). This tool chain can create dataflow models, transform these models into various platform specific representations and then translate them into their respective parallel language. The above men- tioned tool chain will also be incorporated with visual editing and graphical support. The task of implementing LIME-ng has been divided among four work packages. The tools to be developed will have the following functionalities:

1. Extracting GXF Models from C code - This tool handles the conversion of platform independent C code into a dataflow model for transforma- tion purposes.

2. Model Transformation tool for Dataflow Model Transformations - This tool applies a series of model transformations on an input dataflow model to make it more platform specific [20].

3. Generating C code from Platform Specific Model - This tool translates the platform specific dataflow model to its corresponding C code and also provides error locating capabilities [42].

4. Visualization of Dataflow Models - This tool provides visualization and editing capabilities to the dataflow model obtained from the GXF. [36].

ii Acknowledgement

The success of any task will be incomplete without complimenting those who made it possible and whose guidance and encouragement made our effort successful. We are indebted to many people for making this project reach its logical ending. We sincerely express our thanks and gratitude to Dr. Tom Verhoeff who has been there for us at every cornerstone. We heartily thank ir. Pjotr Kourzanov from NXP Semiconductor, Eind- hoven whose encouragement, support, guidance and able supervision led us to the understanding and subsequent completion of this project. He has been a constant pillar of support and guidance right from the preliminary level to the concluding. We would also like to thank Yanja Dajsuren, PDEng. from Virage Logic for her timely evaluations and constructive remarks all throughout the project period. We also thank ir. Protic Zvezdan for his valuable guidance and support during the development of the Eclipse plugin. We owe our deepest gratitude to Dr. Manohara Pai M.M. our programme director and Dr. Radhika M. Pai, H.O.D., Department of I & CT, MIT, Manipal for providing us with their able guidance and moral support at every step during this project. Last but not the least we would like to thank the almighty for giving us the strength and our families & friends for trusting us and being there during all the ups and downs in the project.

Aditya S Deshpande Namratha Nayak Nishanth Sudhakara Shetty Swaraj Bhat

iii iv Contents

Preview i

I Introduction 1

1 Domain Information 3

2 Challenges 5 2.1 Motivation ...... 5 2.2 Dataflow Models ...... 6 2.3 GXF and DTD ...... 7

3 Existing Prototype 9 3.1 LIME ...... 9 3.1.1 Compilation flow ...... 10 3.2 LIMEclipse ...... 11

4 LIME-next generation 15 4.1 General Architecture ...... 15 4.2 Code to Model Transformation ...... 17 4.3 Model Transformation ...... 17 4.4 Visualization ...... 18 4.5 Unparsing ...... 20 4.6 Implementation Language - . . . . . 21 4.6.1 Benefits of Functional Programming: ...... 21 4.6.2 Why Scheme? ...... 22

II Extracting GXF Models from C Code 25

5 Recognizers and Parsers 29 5.1 Recognizers ...... 29

v 6 CIGLOO and Extensions 35 6.1 CIGLOO ...... 35 6.2 Analysis of the CIGLOO parser ...... 37

7 Implementations 53 7.1 Standardising the AST ...... 53 7.2 Formatting ...... 58 7.2.1 Pattern Matching ...... 58 7.3 Module translate read-ast ...... 60 7.4 Module SizeOf ...... 61 7.4.1 Structure Padding ...... 63

8 Validation and Performance Measurements 65 8.1 GLPK ...... 68 8.2 Graphviz ...... 69 8.2.1 SizeOf library ...... 70 8.2.2 ModifyAst library ...... 72 8.3 Performance Measurements ...... 72 8.4 Software metric ...... 73 8.5 Pmccabe ...... 74

9 Related work 79 9.1 Edison Design Group ...... 79 9.2 C Intermediate language (CIL) ...... 80 9.3 Src2srcML ...... 82 9.4 Columbus ...... 84 9.5 Design Maintenance System (DMS) ...... 85

10 Conclusion and Future Work 87 10.1 Conclusion ...... 87 10.2 Future Work ...... 88

Bibliography 90

Appendices 96

A Structure of CIGLOO Grammar 97

B Graphviz Test Results 101

C Links to the Source Files 103

vi List of Figures

2.1 An example dataflow graph with 4 nodes and 5 edges . . . . 7

3.1 LIME tool-chain compilation flow[34]...... 11 3.2 A dataflow model in LIMEclipse ...... 12

4.1 General Architecture of LIME-ng ...... 16 4.2 Example of the visual representation of a simple Dataflow graph with three nodes and edges between them...... 19 4.3 Architecture of the tool developed...... 27

5.1 An example for lookahead ...... 30

7.1 Architecture of Engine Modify module...... 54 7.2 Representation of the AST for Case 1 ...... 56 7.3 Representation of the AST for Case 2 ...... 57

8.1 Package structure of Cigloo1.1 ...... 71 8.2 Pmccabe tool result page...... 75 8.3 Graph of LOC vs Time taken for parsing and pretty printing 76 8.4 Graph of McCabe complexity vs Time taken for parsing and pretty printing ...... 77

10.1 Dataflow model information as represented in a C file. . . . . 90 10.2 AST representation of the C source code ...... 91

A.1 Node hierarchy - 1 ...... 97 A.2 Node hierarchy - 2 ...... 98 A.3 Node hierarchy - 3 ...... 98 A.4 Node hierarchy - 4 ...... 99 A.5 Node hierarchy - 5 ...... 99

vii viii Part I

Introduction

1

Chapter 1

Domain Information

Software Defined Radio (SDR) refers to wireless communication in which the transmitter modulation is generated or defined by a computer and the re- ceiver uses a computer to recover the signal intelligence. SDR allows network operators to simultaneously support multiple communication standards on one network infrastructure, without being bound by a particular standard [18]. SDR has evolved from a conceptual solution for enabling multiple radio applications on one mobile device. SDR refers to software modules which run on a common hardware plat- form consisting of Digital Signal Processors (DSP) and general purpose mi- croprocessors. These are used to implement radio functions like generating transmission signals, detect radio signals at the receiver, amplifiers, mixers. The idea is to make use of general purpose processors to handle the signal processing and rely less on specific customized processors. With SDR we can have a single device which can be used for features which were earlier implemented only by integrating multiple radio compo- nents. Initially people involved with SDR conceptualized a multi-purpose device which could be pushed to its limits, i.e. they visualized a device which could handle multiple features like connect to Wireless data networks, an AM/FM receiver, an HDTV receiver, cellular connectivity. With SDR we have the upgradeability and flexibility options which can handle all these user requirements on a single hardware. The traditional approach involving Application Specific Integrated Cir- cuit (ASIC) made interoperability very challenging and raised the cost of the applications. Earlier generations of wireless devices relied on the above mentioned highly customized ASICs with no considerations on scalability or adaptation to new standards. This design approach was feasible in yielding power and performance optimized solutions but at the expense of flexibility and interoperability. On the other hand an SDR base station is modular, easily portable and reusable, thus reducing the changes to be implemented to keep pace with the latest technology.

3 The ASIC technology leads the wireless communication industry to face problems like • New technological advances and new standards in wireless network communication lead to incompatibility with older devices. • Introducing new features will require heavy customization on the de- vices. • Global roaming facilities across multiple standards are not possible. SDR with its following unique features is capable of resolving these issues • Ubiquitous connectivity - As the name implies, SDR strives to provide connectivity in any region. Different standards may co-exist in dif- ferent regions and if the device is incompatible with certain standards then just a software module needs to be installed in the device to make it compatible with the current standard. • Re-configurability - The essence of SDR lies in the fact that it allows multiple standards to run on the same system through different soft- ware modules. The device can then reconfigure itself to the current network type. • Interoperability - SDR facilitates the end-user to easily use any appli- cation on their device. This is because SDR implements open archi- tecture radio systems. Hence, the effective way to completely harness the potential offered by wire- less communication systems is through SDRs. SDR has been continuously evolving and has even spawned into related technologies like Cognitive ra- dio. This has provided the flexibility, reusability and adaptability to the current generation of wireless communication systems. With these features and flexibility, end-users can continuously upgrade and reconfigure SDRs as new standards are released. The programmable solutions offered by SDRs not only satisfy the oper- ational requirements (met by custom hardware), but they have their own advantages too, such as the following: • Capability to handle multimode operations depending on network avail- ability, like CDMA in USA and GSM in Europe. • Less expensive and even lesser time to market since the hardware on device can always be reused. • An increase in the chip volumes because the same chip can be reused which lowers the cost. • Also reduced debugging efforts will be required.

4 Chapter 2

Challenges

2.1 Motivation

Over the years embedded real-time systems in general and automotive radio modems systems in particular have seen a tremendous change with respect to speed and complexity. Both of these domains today involve parallel com- puting. Powerful parallel computers give the advantage of performing tasks faster and more efficiently. Unfortunately the pace and complexity in devel- opment of scientific simulations coupled with an even faster development in high performance parallel hardware has put a great burden on the software developers. A way to resolve this includes raising the level of abstraction in solutions developed in existing Software Defined Radio domain. The front-end of a software application handling streaming data is mod- eled by models like Dataflow, Kahn Process Network (KPN), etc. The front- end of an application is designed initially and can be changed when the application designer chooses to redesign it. The back-end of an application is a bit more complex. Parallel hardware at the backend of the application changes very rapidly, forcing the respective backend languages being used to change at an equal pace. These languages are the actual non-abstract parallel languages used to run the application on a specific platform. Al- though the backend changes quite rapidly and is beyond the application designer, the changes in the front-end are less frequent, thus causing a loss of synchronization between the front-end and the back-end. These issues regarding the sync of the front-end and back-end causes the programmer to write a fresh implementation or modify the existing one to get a faster application which conforms with each of the new releases of hardware or a back-end programming language. This is not only tedious for the application developer, but it also costs the industry extra man-hours in the process. What is required is a middle-end which can regulate these changes be- tween the front-end and the back-end. Specifically, this middle-end should

5 be capable of converting a given front-end model onto ever-changing back- ends in a generic manner. The middle-end should introduce a layer of ab- straction enabling the translation of a single model onto any back-end model and also vice-versa, i.e., translation of multiple backend models into a single model in harmony with the front-end. The abstraction provided by the middle-end allows a programmer to de- fine his program once and then run it on different parallel systems. This allows the comparing of different parallel systems without rewriting an ap- plication. Also, the abstraction lets an application designer focus mainly on the application instead of the unnecessary and tricky details of the parallel systems. The C programming language that is currently widely used by embed- ded real-time engineers provides limited abstraction capabilities, while rad- ical approaches to address language limitations (i.e., new functional and/or dataflow languages) have been struggling to gain practical significance. However, some of the questions that arise are: What should be the model? How could it be converted into practical computer applications? A good answer for the first question is the well-known parallel programming model: ‘dataflow model’. Dataflow models (Section 2.2) have a concrete theoretical background in scheduling and buffer limit determination. Hence, it is widely used by the application designers to represent the abstraction model of a parallel programming system. The second question can be tack- led by developing a software tool-chain which can create dataflow models from a parallel system or a programming model and after certain process- ing convert them back to the parallel systems or programming model as required. LIME is one such programming model which provides us with this capability (Section 3.1).

2.2 Dataflow Models

With the many advances in parallel systems, numerous researchers have been motivated to design several programming models and languages for parallel systems. However, many of these models/languages are not abstract and generic enough, as they are confined to the characteristics of the systems they are supposed to operate on. An abstract model such as a dataflow model is developed to define parallelism in an explicit way by representing the potential parallel elements such as components in the model without actually concerning itself with the underlying parallel system. Dataflow is a well-known programming model in which a program is represented as a set of tasks with data precedence. The dataflow graphs are directed and labeled multi-graphs which consist of nodes, edges, ports, etc. Figure 2.1 shows an example of a simple dataflow graph represented in a theoretical perspective where computation tasks (nodes) A, B, C and D are

6 represented as circles and FIFO queues (edges) that direct data values from the output of one computation to the input of another are represented as directed arrows. In a more practical and implementation perspective as used in the LIME prototype, the representation of the dataflow graphs consists of many more elements such as ports, bound-to, nested nodes. This will be explained in detail in the later chapters.

Figure 2.1: An example dataflow graph with 4 nodes and 5 edges

Nodes consume tokens (data) from their inputs, perform computations on them (fire) and produce a certain number of tokens on their outputs. The functions performed by the nodes define the overall function of the dataflow graphs. There are several variations of dataflow models namely, SDF (Synchronous dataflow), HSDF (Homogeneous Synchronous dataflow), CSDF (Cyclo-static Dataflow), BDF (Boolean Dataflow) and VRDF (Variable Rate Dataflow). “Dataflow graphs are very useful specification mechanism for signal process- ing systems since they capture the intuitive expressivity of block diagrams, flow charts, and signal flow graphs, while providing the formal semantics needed for system design and analysis tools” [43]. Since dataflow models have a concrete theoretical background in scheduling, they form an effective mechanism to determine the run-time behavior of parallel systems [44]. In addition to this, they can also be used to depict inherent parallelism in an application programmed for such systems.

2.3 GXF and DTD

Since dataflow models are used by groups of people, it is convenient to use versioning systems such as SVN, Mercurial to provide team-work on a single model. As these versioning systems require the model to be in a textual format, the desired software is expected to provide a textual representation along with the graphical representation of the dataflow model. Also, by

7 having a textual representation, the range of transformation frameworks to select from, for the transformation of Platform Independent Model (PIM) to Platform Specific Model (PSM) increases. The LIME standard for the textual representation is the GXF (Graph eXchange Format) which is related to the GXL (Graph eXchange Language). GXL is an XML (eXtensible Markup Language) schema that has limited ex- pressiveness, i.e., it does not support programming with flow of control/data constructs [6]. It is an XML based standard exchange format for sharing data between tools. Formally, GXL is built to represent typed, attributed, directed, ordered graphs which are further extended to represent hyper- graphs and hierarchical graphs [32]. The GXF is an XML based language providing a representation of the dataflow model which is used to transport and store data. It conforms to the GXF DTD (Document Type Definition) which defines the legal building blocks of a GXF model. The DTD defines the document structure with a list of legal elements (node, port, edge, stream, etc.) and attributes (id, xlink:label, xlink:type for a node, etc.). There are two types of GXFs involved in the tool chain, developed as a part of this project. The initial input to the tool chain consists of the pre- processed C files which contain the structural and behavioral information (components) and the Stream GXF which contain the connection between these components. The definition of the DTD for the Stream GXF can be found at [9]. In the tool chain, the Platform Independent Code (PIC) containing the structural information is parsed to obtain the nodes which are inserted into the initial GXF. The Stream GXF containing the connections between the nodes are then appended to the end of this initial GXF. The definition of the DTD for this GXF can be found at [8, 21]. This GXF is first checked for conformance against the DTD and if it complies, then it acts as an input to the visualization tool and the trans- formation tool-chain. The resulting PSM is transformed back into Platform Specific Code (PSC), which can then be compiled and run on the specified platform.

8 Chapter 3

Existing Prototype

3.1 LIME

LIME is a which uses an application model and converts it to one of several parallel backends [45]. The application model in this context can be defined in a primitive version of Dataflow that can be further converted to either pure dataflow models, SDF (Synchronous Dataflow) models or CSDF (Cyclo-static Data flows) and SP (Series-Parallel) models for analysis purposes. A LIME model consists of the following elements: • Node (actors, computation kernels or lime) - These are computation units which contain a program in ANSI C (ISO C99). Each node has a return type, name, and it can also nest other nodes. Nodes can be either a class or an instance. An instance node is the instance of a class node. Only an instantiated node can be used in the run-time of a program.

• Port - These are used in LIME to connect nodes together. They can either be an input port or an output port or both (inout) ports in one. Each of these ports has a size attribute which shows the consumption rate (if it is an input port) or production rate (if it is an output port) of tokens.

• Edge - The ports mentioned above can connect to each other via edges. An edge can have different types. If an edge connects two nodes with different production and consumption rates, LIME instantiates the nodes of the smaller size until it is equal to the size of the bigger nodes. ”This is the main source of data level parallelism in LIME [34]”.

• bound-to (Associations) - These associate a port (input or output) in one node to a port of the same type and rate in a nested node.

9 A LIME model is defined in GXF. The graphical visualization of these models are created by the LIMEclipse GUI (explained in section 3.2) which provides access to the graphical and textual format at the same time.

3.1.1 Compilation flow

”Compilation in LIME starts with a compiler-driver called ’slimer’ (imple- mented as a script) which consists of several processing stages. The slimer is built as a shell script that encapsulates a set of Makefiles and other scripts which are written to allow parallel computations [34].” A pictorial represen- tation of the compilation flow for the LIME tool chain is depicted in Figure 3.1. The ’slimer’ compiles a LIME program in five steps [34]:

• Front-end Parsing - In this step the conversions of C algorithms and GXL graphs into machine readable XML format are performed using GCC (GNU Compiler Collection). LIME uses -fdump-translation-unit switch of GCC to get the parse trees thereby avoiding the creation of a new parser for C applications.

• Middle-end (ME) static analysis and scheduling - It is responsible for static task admission, mapping, grouping and scheduling [34]. In other words, it groups the actor components together to be assigned to dif- ferent parallel tasks and then schedules those tasks. This uses an existing SDF analysis which is implemented in OCaml.

• Back-end code generation - In this step the final parallel system specific codes (CUDA, PThreads and NXP proprietary) are generated which are followed by the production of initializer scripts and OS Configura- tions.

• C tool-chain - In this step the computation kernels in components are compiled. It also provides the option of performing optimizations.

• Profiling and simulation - In this step the simulation tests are per- formed and feedback is provided to the middle-end for better schedul- ing and grouping for the rest of the application.

The middle-end analysis step is optional. The five components of the LIME tool-chain are implemented using ad-hoc scripting languages such as GNU awk, SED, GCC and make. To make the tool-chain more formal, these ad-hoc scripting solutions can be rewritten in a single programming / scripting environment. Formalism is incorporated in this project by using a BIGLOO Scheme, a Functional Programming environment.

10 Figure 3.1: LIME tool-chain compilation flow[34].

3.2 LIMEclipse

The front-end of LIME is LIMEclipse which visualizes the parallel appli- cation models. It has been designed in such a way that it meets all the requirements and rules of the LIME programming model. LIMEclipse [10] [45] is an Eclipse plug-in which provides a visual editor for the dataflow model in LIME. It has been developed based on the Eclipse Rich Client Platform (RCP) plug-ins to develop the rich-client application. It also uses two other plug-ins of Eclipse, namely, GEF (Graphical Editing Framework) and Draw2d provided by the Eclipse Community. On the whole, the entire architecture of LIMEclipse is based on the Model-View-Controller (MVC) pattern [11]. The model elements used by LIMEclipse to visualize are similar to the LIME model elements (dataflow).

• Components - These refer to the nodes in the LIME model. It can be either a class or an instance of a class.

• Ports - These correspond to the ports in the LIME model. It can be either an input port or an output port or both.

• Arc - It is a solid directed edge between ports which depicts the direc- tion of dataflow.

• Association - It is a dashed bidirectional edge which associates ports of the same type, so that an internal port is not visible externally.

• Code Fragment - A note shape used to display any code fragment in the LIME model.

11 There are certain set of properties associated with each of the model elements. LIMEclipse also supports the nesting of components and follows the rules stated in the LIME model. Each of the model elements defined above has a corresponding figure (i.e., a view) class defined for it, which helps to display the respective model element in the editor. There are controller classes for each model element, which connect the model class and view class and helps in reflecting changes made in the model class to the view class and vice versa. It handles the GXF (dataflow) model and keeps the model updated. To keep the GXF model updated, the dom4j [3] [4] parser is used which serves two purposes. Firstly, it parses the GXF and displays it in the editor. Secondly, it also updates the GXF whenever there are any changes made in the editor. Figure 3.2 shows how a simple dataflow model is displayed in the LIME- clipse editor. The illustrated simple dataflow model consists of a main outer component and two nested components with input and output ports, arcs, associations and a code fragment.

Figure 3.2: A dataflow model in LIMEclipse

LIMEclipse is designed in a way which allows addition of new model elements with their corresponding figures and controller parts to the diagram editor. The properties of the model elements are defined by the Property descriptors which are modifiable. This makes it easier for the programmer to add/remove/change the properties of the existing element and can also define properties for new model elements. Currently, LIMEclipse includes features like the following

1. Visualization of dataflow models in either the class mode or instance mode.

12 2. A palette, consisting of creation tools for components, ports, arcs, associations and code fragments.

3. A textual representation of a dataflow model in GXF format.

4. An Outline viewer, showing both the tree structure and a visual struc- ture of the dataflow model visualized in the editor.

5. Supports normal zooming, where the entire dataflow model is enlarged by the selected zoom percentage.

13 14 Chapter 4

LIME-next generation

4.1 General Architecture

In the sections above, we have discussed the need for abstraction due to the increasing complexity of systems in the SDR domain, the dataflow models that provide a good level of abstraction for this domain and the LIME tool-chain that uses this dataflow model concept and converts a platform independent application model to platform specific model. The goal of our project was to develop a similar tool-chain that is formally defined in a common programming environment and more robust than LIME. This tool- chain should take as input Platform Independent C code and go through several intermediate transformations to finally result in a Platform Specific C code that can be compiled into a binary and run on the intended hardware. The tool chain developed is named as LIME - next generation or LIME- ng. The tool has been so named because it derives its essence and motivation from the existing LIME tool. The LIME-ng takes C source files and connec- tion information obtained in the GXF stream, parses it to obtain an AST which is then transformed into a Platform Independent GXF model. This GXF in turn undergoes a series of transformations (either manually using a visualization tool or automatically) to obtain a fine grained Platform Spe- cific GXF model. This GXF is then transformed back into an AST and unparsed to generate the corresponding Platform Specific C code. The LIME-ng consists of four independent tools that are integrated in a particular order to achieve the above functionality of converting a PIC to PSC. The Figure 4.1 shows the high level view of the tool-chain flow. The proposed tool chain uses the architecture of LIME as a prototype (Refer Fig 3.1).

• The Platiform Independent Code (PIC) to Platform Independent Model (PIM) tool generates a machine readable GXF file based on the given

15 Figure 4.1: General Architecture of LIME-ng

C source files and the GXF streams. The functionality of this tool is similar to the Front-End Parsing engine of the LIME tool chain.

• The PIM to Platform Specific Model (PSM) tool takes as input a GXF file produced by the previous tool and instantiates the nodes, repli- cates them when necessary to increase parallelism, groups the nodes to different parallel tasks and schedules these tasks using a sequence of transformations. This tool performs similar functionality as the Middle-End static analysis and scheduling engine of the LIME tool chain.

• Visualization tool takes as input a GXF file and builds a visual repre- sentation of the graph. It also provides the option of manually editing the graph to customize the model transformation. This tool is built on similar grounds as LIMEclipse.

• The PSM to Platform Specific code (PSC) tool generates the final parallel system specific code. This tool performs similar functionality as Back-End code generation engine of the LIME tool-chain.

The tool chain is developed by taking inputs from the Model-Driven Architecture (MDA) approaches which involves defining a PIM and its au- tomated mapping to one or more PSMs. The MDA approach promises a number of benefits including portability due to separating application knowl- edge from the mapping to specific implementation technology, increased productivity due to automating the mapping and improved maintainability

16 due to better separation of concerns and better consistency and traceabil- ity between models and code [25] . In general, the MDA approach can be distinguished into three major categories, namely, model-to-model transfor- mation approach, model-to-code transformation approach, and code-to-code transformation approach. The individual tools are developed based on the concepts provided by the first two approaches.

4.2 Code to Model Transformation

The need for abstraction compels us to create dataflow models for easier mapping of the application onto different parallel systems. Pattern matching is involved with source-to-model transformations, where a description of an algorithm in ”text” is first analyzed lexically, then parsed, and finally converted to a dataflow model using pattern matching and/or recognition. The contribution of this tool would be a source-to-model pattern recognizer from ANSI C99 Abstract Syntax Tree (AST). The tool should be capable of providing a robust lexical and semantic analyzer for the C programming language and effective pattern matchers for dataflow patterns. This tool is the first in the tool chain and concerns itself with the trans- formation of a C code into a GXF model. This is the start point of the tool chain and it takes as input a pre-processed C code and a stream GXF file. The pre-processed C code contains information regarding the various nodes and ports involved in the dataflow model, whereas the stream GXF file con- tains information regarding the connections between the ports of different nodes. The transformation from a formal code to a GXF model is not a single-step process but incorporates a few other transformations. The initial transformation consists of building an AST from the input code. This intermediate representation of the code is important for later pattern matching. A well equipped parser that can handle all the standard C 99 constructs as well as certain NXP Semiconductor specific constructs, like having statements before declarations, K & R style coding, etc., is a must here. The reason for a strong and well equipped parser is the need for a complete transfer of information present in the source onto the model. This AST is still to be represented as a dataflow model (GXF). The GXF mainly contains in it the structural aspects of the C code.

4.3 Model Transformation

One major aspect of using such a tool chain is automation of the transforma- tion of the application models into platform specific models. Confronting the application designer with the full complexity of an embedded heterogeneous multi-core platform is inefficient and can be rather unproductive.

17 Hence, simple, fine-grained platform-independent models that are speci- fied by the application designers need to be gradually converted into coarse- grained platform specific application models by this tool. The implemen- tation of this transformation tool consists of several smaller independent but complete dataflow model transformations which can be combined in some order by the application designer to convert it into a platform specific dataflow model in several steps. The input to such a transformation tool is a GXF file. This GXF cap- tures the dataflow model which contains the component information along with the connections between these components relating the sequence in which the components are executed as well as the data rate. Such GXFs can be obtained as an input either from the previous tool that parses C files to obtain GXFs as well as directly from the application designer who de- velops a fine-grained application-specific GXF to be transformed using the visual editing tool. The output of the transformation tool is again a coarse-grained dataflow model in the form of a GXF file which then forms the input to the Unparsing tool. The transformations are done with the help of the SSAX-SXML [14] package which is a library built with BIGLOO Scheme. This package con- tains a collection of tools for processing markup documents (XML, XHTML, HTML) in the form of S-expressions (SXML, SHTML). With the help of this tool-set it is possible to query, add, delete, modify and transform the GXF model in several ways. An objective of this tool is to make the transformations as modular and independent as possible. It should also allow the designer to provide critical control information during the transformations.

4.4 Visualization

In the software streaming domain, a number of dataflow patterns are found which can be represented as actors and also be used in dataflow graphs having certain properties. These properties help the designer to estimate the performance of the application before deployment. There is a lot of legacy signal-processing software and also engineers are trained to program using sequential languages like C. Thus, the task of obtaining these dataflow graphs from legacy code or those created by the designer, and visualizing them are related to each other. Obtaining a dataflow graph from the source code is a source-model transformation and visualization concerns itself with the transformation of these obtained models to a visual representation and vice-versa. The model used here is represented in the GXF format. This GXF model represents the dataflow graph consisting of actors and relationships between

18 Figure 4.2: Example of the visual representation of a simple Dataflow graph with three nodes and edges between them. them. The presence of inherent relations among the elements to be visual- ized leads to the concept of graph visualization where the elements can be represented as the nodes of a graph, with the edges representing the relations among them. Graphs have a fixed structure, so they can be used to represent structured static information [31]. As the goal of any graph visualization is to help its users understand and analyze the data represented, different layouts of the same graph can make a user infer different information from it. As seen in Chapter 3, LIMEclipse is an existing tool which provides a visual editor for the dataflow model. Although a large number of features are supported in LIMEclipse, there are a few drawbacks as well. They are: 1. Layout: The layout was done in various levels. First the top-level

19 components were laid in the diagram and then in a recursive manner, the inner components were laid out. This kind of layout leads to overlap of the arcs and associations in the diagram, which in turn reduces the quality of the visualization.

2. Lack of shapes: There were only two kinds of shapes available to depict the ports, one being ellipse for the output port and the other is a rectangle to depict the input port.

3. Zooming and folding: LIMEclipse supports zooming for the entire visual model or diagram but not for the individual components. Also, there is no support for folding or unfolding the diagram, which allows the user to focus on a single component of the diagram.

The visualization tool developed here has thereby included the draw- backs of LIMEclipse as requirements along with other requirements as well, i.e., the graph visualization editor for this SDR domain must support the following features.

1. Lay out graphs with large number of nodes.

2. Avoid visual cluttering in large graphs, which can be achieved by draw- ing edges in different shapes, like splines and polylines that reduce the edge crossings.

3. As an interactive editor, it must allow users to select, explore (zoom- ing), layout, abstract/elaborate and filter (reduce the amount of data displayed) for large graphs.

4. Provide support for different shapes to represent the nodes in the graph.

4.5 Unparsing

The process of constructing an Abstract Syntax Tree from a string of char- acters is called parsing, and unparsing is the reverse: i.e, constructing a string of characters from a tree. Unparsing can be classified into two types, the textual unparsers and the structural unparsers. “Pretty-Printer” is an example of a processor that incorporates a textual unparser which prints the ”‘source text”’ [24] representation of the tree. In this project, the input to the Pretty printer is the AST of a preprocessed C file and this AST is matched with defined patterns to regenerate the code.

20 4.6 Implementation Language - Functional Pro- gramming

Lambda calculus is a formal system for function definition, function appli- cations and recursion. Functional programming is a style of programming that emphasizes the evaluation of expressions, rather than execution of com- mands. The expressions in these languages are formed by using functions to combine basic values. A functional language is a language that supports and encourages programming in a functional style [1]. LISP, an assembly style language for manipulating lists is referred to as the first computer based functional programming language. Lambda calculus is conceived from the notion of solving complex computations and problems relating to calcula- tion. According to Church, lambda calculus was not designed to work under physical limitations and hence like today’s object oriented programming, it is a set of ideas and not a set of guidelines [5].

4.6.1 Benefits of Functional Programming:

Functional programming languages as mentioned above have long been in existance. Unfortunately though with a nice set of advantages along with them functional programming languages have not gained the importance which is due to them. A few of the advantages of functional programming languages are highlighted below.

1. Programs written in high level functional language can easily be con- verted into a dataflow graph representation [43]. The following prop- erties of functional programming languages imbibes certain character- istics which frees these languages from any kind of side effects, i.e.,

(a) Functional programming languages treat each variable as an ex- pression. Functional languages view the use of the = operator as an expression [29]. All the variables can be assigned only once and the values cannot be modified again later. Hence it makes more sense to call them symbols instead of variables. Functional pro- gramming languages therefore are not allowed to contain global variables or data structures. This property is referred to as ref- erential transparency. (b) Since all the symbols are constants even the functions cannot change the symbols passed to them as arguments. (c) Referential transparency [29] implies that the symbols in a func- tional program can be replaced anytime. An expression is refer- entially transparent if it does not change the program semantic when it is replaced by its value at any point in the program.

21 The above characteristics imply that a function call can only compute its result based on the non-changing arguments passed to it and noth- ing else. This is a major advantage and eliminates a source of bugs. As a result of this, a function value depends only on the arguments passed to it, making the order of execution irrelevant.

2. Functional programming languages allow pattern matching whereas many imperative languages do not offer it yet.

3. Functional programs are by design concurrent. In this a piece of data cannot be modified again by the same thread or any other thread. This implies that there is no need to use any kind of locks to reserve concurrency. Also since the order of execution in a functional programming paradigm is not important, the compiler can optimize a single-threaded program to run on multiple CPUs.

4.6.2 Why BIGLOO Scheme? From the preceding section, it is clear why a functional programming lan- guage is being used in the project. There exist various functional program- ming languages. Some of these are, Haskell, LISP, ML, Scheme, etc. In this project, a variant of Scheme called BIGLOO Scheme is chosen as the lan- guage for implementation. BIGLOO is the implementation of an extended version of the Scheme programming language - R5Rs (Scheme Revised(5) Report on the Algorithmic Language Scheme). There are several advantages of using BIGLOO Scheme as the program- ming environment.

• BIGLOO programs can either be compiled to produce the respective intermediate files or they can also be interpreted to run the program without the creation of any intermediate files.

• Scheme files can be compiled in three modes, namely, the .NET mode, JVM (Java Virtual Machine byte code mode) and the native (C) mode under the BIGLOO environment.

– With the help of certain BIGLOO tools, the BIGLOO files can be compiled into JVM class files and used in the Eclipse or Java environment as a Foreign Function Interface (FFI). This is an important feature as this tool-chain is also supposed to run in an online mode as a set of Eclipse Plug-ins. – BIGLOO is a module compiler. It compiles modules into ”.o”, ”.class”, or ”.obj” files that can be linked together to produce stand-alone executable programs, JVM jar files, or .NET pro- grams or libraries.

22 • Pattern matching is a key feature of the BIGLOO functional program- ming language which allows clean and secure code to be written. In this project, the pattern matching has been used to parse C code into an AST and generate C code from the AST.

• Parsing: Conventional grammar generators like Lex are often coupled with tools like Yacc and Bison that can generate parsers for more powerful languages. These tools take as input a description of the language to be recognized and generate a parser for that language. The user in this case has to be aware of the requirements of the generated parser which becomes quite inconvenient. BIGLOO overcomes this as it has an integrated parser generator which generates LALR(1) class of parser.

• BIGLOO provides the SSAX-SXML package, which is a library built using BIGLOO. This package contains a collection of tools for pro- cessing mark-up documents (XML, XHTML, and HTML) in the form of S-Expression (SXML, SHTML). This is used in the model transfor- mation and visualization tool.

The BIGLOO programming environment seems to have incorporated in it all features which are essential for all the tools being implemented for the tool-chain. Since this single programming environment caters to satisfy the features of each individual tool, it gives an added advantage of using a single programming environment for the development of the complete tool chain.

23 24 Part II

Extracting GXF Models from C Code

25

Programming languages provide a platform to support all of data, ac- tivity and control modeling. Programming languages can be classified into two major classes, imperative languages and declarative languages. The im- perative class include languages like C and Pascal. These languages use a control-driven, procedure-based model of execution. The declarative class of programming includes languages like LISP, PROLOG, and Scheme. These languages are driven via pattern or demand-based computation. As mentioned earlier, this tool aims to raise the level of abstraction in the current development technology. The objective of this particular tool in the tool chain is to develop a dataflow model from an input source file. This source file is a C file depicting the different elements of the dataflow including the actors, the various ports, their properties and several other necessary information. The C source file is previously preprocessed. The tool acts as a front-end to the whole tool chain and accepts the preprocessed source file. The task for the tool is to analyze this preprocessed source file and develop a model out of it. This model in reference is a GXF model, as described in section 2.3. In summary, the tool has to do the following; take in a preprocessed source file and generate an equivalent GXF model from the same. This model is the abstraction that was referred to earlier. The generated model can be easily transformed to any of the parallel architec- tures when required. The application designer will not be involved in this transformation and his task will be limited to just designing and floping the application without any consideration towards the current backend architec- ture in use. This should considerably reduce the burden on the developer, as was the aim. The high-level architecture of the proposed tool is shown in Figure 4.3

Figure 4.3: Architecture of the tool developed.

27 PIC here refers to the Platform Independent Code. These are the pre- processed sources file taken as input by the tool. The tool also takes as input a GXF streamfile. This GXF streamfile contains the connection informa- tion in the form of edges. Edge information tells which actor is connected to which other actor and the type of edge, i.e. it can be a state edge or FIFO edge, etc. This information cannot be directly used to generate the required dataflow models. The source files must first be passed through a parser to generate an Abstract Syntax Tree (AST). This AST is then pat- tern matched to extract the relevant information for generating the models. Since the model to be generated should contain all the information present in the source file, it is imperative that the AST too contains in it every bit of information related to the nodes and streams present in the file. The first task of the tool hence becomes, to have a complete parser which can handle every C construct which may possibly be dealt with. Once the parser is ready, other modules handling pattern matching for extraction of infor- mation will work on the output of the parser (AST) to generate the GXF model. In the remaining thesis we discuss the recognizers and parsers in chapter 5, the CIGLOO parser and the work done on it is discussed in chapter 6, chapter 7 discusses the various modules implemented, chapter 8 is about the testing and performance analysis done, chapter 9 deals with related work followed by the conclusion and future work.

28 Chapter 5

Recognizers and Parsers

5.1 Recognizers

A language is a form of communication using words either spoken or gestured with the hands and structured with grammar, often with a writing system. In other words a language can simply be a set of valid sentences. Any language needs to be backed by two main concepts, language generation and language recognition. It is not possible to recognize a language unless you are able to understand the nature of the language and how they convey their meaning. Although it is important to understand language generation before going further with language recognition we will concentrate here mainly on the recognition part as it deals more closely with the work at hand. The phrase language recognizer seems very vast and at times can be really overwhelming. How can one create something which can recognize every construct of a particular language? Languages are generally perceived to be infinite, and a recognizer for the same would be equally complex. The following sections discuss more on what it means to recognize a structure and how a simple recognizer can be constructed? When, why and how recognizers use lookahead? Why is it important to analyze the input at two different levels and have two separate recognizers for efficiency? An input can be recognized at different layers. There are mainly two lay- ers you can consider, one being the basic character level where you consider each character individually and recognize it. Then there is the file structure layer where you consider the structure of the whole file. In compiler tech- nology, the character level layer is what is recognized by a lexer and the file structure level layer is what is recognized by a parser. The basic building block of any language is its vocabulary and in this case, it’s the character layer which forms the vocabulary of the language. As in the case of any language, sequencing and grouping of these vocabulary symbols defines the general structure of the whole language. Ability to consider and evaluate the two layers separately is more beneficial and efficient, and hence we have

29 Figure 5.1: An example for lookahead a separate lexer and a parser. The overall structure of the input is what is usually referred to as the language, recognized by the parser and the lower rung character layer is referred to as the vocabulary, recognized by the lexer. The main start point of a recognizer should be an instance variable which holds in it the character to be recognized. From implementation perspective this means that a buffer is required which can read and store in it the next character to be recognized. There are mainly two ways of how a recognizer might function. Either it starts from the most abstract level and moves its way down to character level structure (referred to as top-down recognizers) or starting the other way round, referred as the bottom-up recognizers. A sub-category of the top-down recognizers are the recursive-descent recogniz- ers which are composed of mutually recursive methods. As mentioned, the recognizer works with a buffer variable which always has in it the next character to be recognized. This continues till the end of file is reached. The character in the buffer is not consumed by the recognizer until it is recognized. The recognizer splits the input into three groups -

1. Characters which have already been recognized and consumed.

2. Characters in the buffer, i.e. characters which have been read but not recognized.

3. Characters still to be read from the input stream.

In the illustration of Figure 5.1, the recognizer has already read and recog- nized ”‘printf(”‘Hello, worl”’”’. The character ”‘d”’ has been read into the buffer but not recognized and the remaining input stream ”‘)”’”’ is yet to be read.

30 The character in the buffer is the next character for recognition. This character is used by the recognizer to predict the sub-structure coming in next. If this character does not fit in the structure to be recognized then the recognizers throws an error for the same. Although this is not an immediate reaction of the recognizer, a recognizer may evaluate a certain character many times before throwing an error. The operations of the recognizer on the character in the buffer does not have any implication on any other segment because the character is still not consumed. There may be cases when a recognizer has to use more than a single character of lookahead. This can be illustrated when a program has certain keywords in it. The question is as to how the recognizer will differentiate this keyword from an identifier? The problem lies in the fact that the keywords along with the identifier are lexically the same. This leads to an ambiguity when the decision of handling the symbol as a keyword or as an identifier is to be made. If all keywords were of the same length then just increasing the number of characters for lookahead with the same number should solve the problem. Unfortunately this is never the case; a keyword can be of any arbitrary length. A solution to the problem mentioned above regarding the number of lookahead characters could be by analyzing the way the recognizer han- dles the vocabulary symbols. Instead of viewing the input as a string of characters it is possible to look at them as tokens with certain grammatical structure behind them. To make this more understandable we take a human example of reading a particular sentence. At the ground level a sentence is formed by a sequence of characters. While reading it though, we don’ take it as a sequence of characters but rather we break down these characters into certain words and then these form a meaningful sentence based on the language’s grammar. The recognizer considered so far takes into account both the language and its corresponding vocabulary structure into account. Methods for rec- ognizing languages explicitly invoke other methods for matching the vocabu- lary. Although this approach works well for simple languages, it may not do so well when it comes to more complex languages. Hence to solve this prob- lem and be able to recognize complex languages it is apt to differentiate the character level recognition with the language level recognition. Generally, the character level recognizer is referred to as the lexer, and the language level recognizer is referred to as the parser. A parser functions by recog- nizing grammatical structure from a given stream of token, whereas a lexer recognizes structure from a stream of tokens. The advantages of separating the parser and the lexer are numerous [38]: • The parser can treat arbitrarily long character sequences as single tokens. • The parser sees a pipeline of tokens, which isolates it from the lexical

31 language recognizer.

• A separate lexer simplifies the recognition of keywords.

• The lexer can filter the input, sending only tokens of interest to the parser.

The parser and lexer are now two separate entities but they still need each other to recognize a language. For this it is essential for the recognizers to work in tandem with each other. The lexer on its part takes a stream of characters as input and converts them to their respective tokens. These tokens are next passed to the parser. Breaking the recognizer into a parser and a lexer curtails the lookahead problem for recognizing at language level. This happens because the parser now sees any lexical element of any length, be it an identifier or a keyword, as a single element. This separation is also beneficial at the lexer level as it merges the recognition of keywords and identifiers and sends different token types to the parser depending on whether the element is an identifier or a keyword or any other special type. From a programmer’s perspective, the preceding sections can be sum- marized by,

• Defining a certain regular expressions to describe the token structure.

• Implementing the lexer by combining the regular expressions with cer- tain code to get tokens which can be passed to the parser.

• Developing a parser with a Control Flow Graph (CFG) that describes the language structure utilizing the tokens passed by the lexer and other non-terminals.

Lets now give a small example of how a lexer and a parser can be used in combination to generate a particular grammar. The example concerns itself with the English language [39]. Consider the following sentences

1. John gesticulates.

2. John gesticulates vigorously.

3. The dog ate steak.

4. The dog ate ravenously.

Our mind subconsciously divides these sentences into basic language con- structs and structures the words into phrases and groups of phases. Making a machine do the same can get a bit tricky though. An English sentence always comprises of a subject (a person or a thing), a verb(describing an action) and an optional object(on which the subjects acts). On the basis of these constructs we can substitute them in the above examples -

32 1. Subject VerbPhrase

2. Subject VerbPhrase

3. Subject VerbPhrase Object

4. Subject VerbPhrase.

These can be further compressed by removing duplicates to just -

1. Subject VerbPhrase Object

2. Subject VerbPhrase

The Subject can either be John or The dog, which in turn are either a Noun or a Determiner Noun. Similarly, the VerbPhrases can be structured into just a ”verb” or ”verb adverb”. The tokens, i.e. John, The dog are identified by the lexer and reserved as Noun and Determiner noun respectively in this case. Similarly even the verb phrases are tokenized by the lexer and passed to the parser. The grammar written indicates what language can be recognized by the parser. The parser is actually generated based on this grammar itself. The parser can be generated in any well defined language as desired and as implemented by the parser generator. The project required using the BIGLOO scheme environment and BIGLOO has been designed to generate an LALR parser.

33 34 Chapter 6

CIGLOO and Extensions

6.1 CIGLOO

CIGLOO was created with an intention to generate foreign function inter- faces from C source files. CIGLOO reads C files and automatically gener- ates BIGLOO extern clauses for those files. CIGLOO is implemented in the BIGLOO Scheme environment. CIGLOO in itself provides a grammar and a lexer for recognizing the C language. Along with the C parser, CIGLOO provides the functions which translate an AST into the required function interfaces. The concern of the project was only limited to the grammar and the lexer provided by CIGLOO. This was because the only requirement from CIGLOO was the AST which could later be used for pattern matching purposes. The grammar of CIGLOO is written in the Scheme functional program- ming language. Grammars written in a functional programming paradigm can be a bit more complicated than their counterparts in the paradigms of imperative programming. To illustrate, a compiler makes several passes over a syntax tree computing certain information in each pass which is assigned to some node in the tree. Contrastingly, in functional programming, each pass over the syntax tree builds a new tree with the computed information in its respective node. At the higher abstract level the grammar written in any conventional format and that written as a functional program are the same. It too includes non-terminal symbols leading to other non terminals or terminals. On reaching the terminal symbols the grammar is designed to perform certain tasks, in this case create a structure in the AST. Grammars written in functional programming paradigm differ in their way of represen- tation. To illustrate consider a simple ANTLR parser [40] -

Exp : Exp Plus Exp -> Exp^ "+" Exp : Exp Star Exp

35 -> Exp^ "*" Exp : PAR-OPEN Exp Par-CLO -> "(" Exp^ ")"

The semantically similar grammar when written in functional programming terminology would look like - (Exp (( Exp Plus Exp) ‘(,Exp "+" ,Exp)) (( Exp Star Exp) ‘(,Exp "*" ,Exp)) ((PAR-OPEN Exp PAR-CLO) ‘("(" Exp ")")) ((id) id)) The former representation is the conventional representation. Regarding the representation in the functional programming paradigm each non ter- minal production is enclosed within a block. The block is identified by its associated non terminal. Within this block are contained all possible pro- ductions that can be obtained for the given non terminal. Each of these cases are represented separately. Every case is followed by certain action/actions which should be performed if that particular case is matched. In the above illustration consider the case of (( Exp Plus Exp) ‘(,Exp "+" ,Exp)) This case is similar to that represented as “Exp + Exp” in the BNF format. The symbol is recognized by the lexer and represented as the token “Plus”. Whenever this case is matched the action corresponding to it needs to be performed. In this case the action indicates the creation of a list in which the non terminal recognized (Exp) will be run again and matched with a case. “+” indicates that the symbol will be generated as output and the non terminal (Exp) will be treated in the same way as the former. The comma symbol before each non terminal signifies that they are not symbols but rather variables which need to be evaluated further. This section provides a thorough analysis of the grammar provided by CIGLOO to recognize the C language. It indicates the deficiencies and lacking of the parser generated along with the modifications and improvements made to overcome the same. The intent was to build a grammar which could handle all constructs of C and subsequently generate ASTs containing in them all the information of the original source code. The existing structure of the grammar provided by CIGLOO is represented in Appendix A. The remainder of the chapter con- tains each of these files dealt individually and their respective conclusions.

36 6.2 Analysis of the CIGLOO parser

1. The first test was written to check whether the CIGLOO parser cor- rectly generates the AST for various symbols.

void func1(int x){ static int yStatic; volatile float zVolatile; if((x != 10) || (!x)) { printf("x is neither 10 nor zero"); } else printf("x is either 10 or zero"); }

CIGLOO when run with this file used to generate an AST which re- flected the “!” in the expression “(x != 10)”, but it missed depicting it when used in the second expression, i.e when it is used with “(!x)”. Another problem with the AST generated for this particular code snip- pet was that it did not recognize any variable declaration statements. These problems were identified and then the corresponding CIGLOO code modified to include these features in the AST. To handle the symbols, the grammar for unary expressions was modified. To this grammar patterns containing these symbols were added. Patterns for the symbol ’!’, ’&’ and ’*’ were added to the rule of the unary expres- sion in the grammar.

((! expr) ‘("!" ,expr)) ((& unary-expr) ‘("&" ,unary-expr)) ((* cast-expr) ‘("*",cast-expr))

This code adds the recognized symbols into the AST generated. To include the declaration statements in the AST, the grammar for com- pound statements which handle the declaration statements was mod- ified. The declaration statements were previously being included in the AST only if they were written outside any function block. So the next step was to include the same functionality when the statement is written within a function block.

37 ((BRA-OPEN declaration-list statement-list BRA-CLO) ‘(compound ,declaration-list ,statement-list)))

2. Next check was to see how the parser handles the various pointer declarations and expressions.

void func3(int *p, int q) { int **x; const *check; float * const xp; printf("swap"); *x = &q; *p = 10; q = *p; *p = **x; **x = q;

}

The original CIGLOO parser could not handle the pointer declara- tions within a function body. It missed the pointer declarations and also the use of pointers in any other part of the program except for the function parameter list. Function parameter lists were the only place where pointers were handled and subsequently included in the generated AST. To correct this the modifications included previously for including the symbols were sufficient and no other modifications were needed. With these modifications, all pointer declarations and their subsequent usages in the code were taken care of and reflected in the AST.

3. The following code was written to verify if the parser handles all cases of array declarations. Cases which were tested were simple declarations of arrays, then various cases involving the type qualifiers and array modifiers with the various variables were added. Cases involving multi- dimensional arrays, arrays of pointers and type qualifying of pointer array variables were tested.

void func1(int a,int b,int x, const int y){ const int z; static int arrayNorm[10]; const int arrayVLA[x]; const int (* const arrayPoint[20])[30]; int ( *arrayConstPoint)[y];

38 float *arrayPoint1[z][20]; int arrayMulti[b][70][80]; float arrayParKeywordsp[const a][b]; }

The parser initially could handle only the normal declarations of ar- rays. It could handle variable length arrays, but was incapable of han- dling arrays where the array variable had a type qualifier associated with it. In case of variable length arrays the type qualifiers associated with the variables specifying the length were being missed in the AST. These were keywords like ”static”, ”const”, ”volatile”, etc. Care was taken not to parse incorrect constructs like array[10][const 2] Actually, the parser missed all declarations where a type qualifier was followed by a variable but without a data type in the middle. This problem was resolved by editing the declarator2 rule.

((declarator2 ANGLE-OPEN array-modifiers constant-expr ANGLE-CLO) (ast-decl2 Ff #f #f declarator2 array-modifiers constant-expr #f))

The grammar which handled the declarations and variables was edited with the above case to incorporate the required changes. This change makes sure that cases which involve a constant expression may contain an array-modifier before it. Array-modifier function handles all the type-qualifiers like “restrict”, “const”, “volatile”.

4. This code snippet was written to check if the parser handles and de- picts all the keywords present in a program. The keywords to be tested included “volatile”, “static”, “const”, “restrict”, etc.. The initial ver- sion of the parser missed all these keywords from the AST.

void func2(float buf, int array[10]){ int rate; int p[rate];

static int x_static; volatile float y_volatile = 10; int *state = malloc(sizeof(int)); buf = buf + 1; printf("DECLARED STATIC AND VOLATILE"); }

Modifications done to include declarations into the AST could be used to resolve this issue too.

39 5. The next to be checked on the parser was the handling of complex data types like structures/unions and enumeration specifiers by the parser. Cases to check included

• Declarations of structures/unions within a defined structure or union. It was also meant to check the handling of enum specifiers and their declarations. • Declaration of structure/unions/enumerators in function param- eter lists.

struct forwardDeclStruct;

struct foo1{

int a; int b;

struct foo2{ int abc; int pqr; }foo2_a;

}foo1_a;

const struct foo3{ enum foo4{nish, swaraj}enumCHeck; }foo5;

void func5(int a,int b, enum foo{aenum, benum}var2, struct foo1 {int a_foo; enum foo22 {abc, xyz}enumCheck; float b_foo; struct foo2{ int x; enum foo3{abcd, pqrs}enumCheck2; float y; }abcd; }var1){ int afunc;}

Initially the parser could not handle the declaration of structures or enum specifiers within themselves, i.e. it could not handle recursive

40 declaration of either a structure, union or enumerator within itself. There was no grammar to handle such kind of C constructs. Then the grammar for the declaration list of structures was modified to handle these cases.

((struct-or-union-specifier) ‘(,struct-or-union-specifier)) ((struct-or-union-specifier struct declaration-list) ‘(,struct-or-union-specifier ,struct-declaration-list) ((enum-specifier struct-declaration-list) ‘(,enum-specifier ,struct-declaration-list)) ((enum-specifier) ‘(,enum-specifier))

These modifications made sure that the above mentioned cases were handled efficiently and reflected in the generated AST. The C language allows enumerator lists which are of either of the following forms -

(a) enum foo{var}; (b) enum foo{var, var}; (c) enum foo{var, var,}; The initial two cases were handled well by the parser, but the third case was one which was not included in the original parser. The third case actually has a list ending with a COMMA as shown below. It required an additional modification to the rule of enumerator list -

((enumerator COMMA) ‘(,enumerator ","))

6. Code was also written to check if the parser could handle ternary operators. A simple ternary condition was written inside a function to test the same. The code snippet is as follows.

void func4(int a, int b, struct foo4 { int x; int y; struct check{ int p; int q;}var1;}var2) { int c=10; printf("%d", a? b : c); }

41 For the ternary operators the grammar was structured in such a way that the expressions which were to be evaluated in case the condition was true or false had certain restrictions, i.e., a few of the valid con- structs could not be evaluated. The comma separated expressions are one such case. The rules to handle expressions in the grammar are organized in a hierarchical manner. This implies that you check for a particular construct after some other basic construct has been done. The parser is structured in a way that all precedences are taken care of. However in the case of ternary operators the case of comma separated expressions were not handled. This was handled by adding another rule which explicitly took care of these comma separated expressions.

((logical-or-expr@lexp1 ? argument-expr-list : conditional-expr) ‘(,lexp1 "?" ,argument-expr-list ":" ,conditional-expr))

The AST generated for this code reflects the ternary condition in it and also handles all other constructs well.

7. struct edge edges[] = { {.type="init_state", .from={a+b,.node=constructor,.port="outs"}, .to={.node=process,.port="inst"}, }, {.type="state", .from={.node=process,.port="outs"}, .to={.node=process,.port="inst"}, }, {.type="deinit_state", .from={.node=process,.port="outs"}, .to={.node=destructor,.port="inst"}, }, };

struct foo{ int a, b; char c; }var = { a : 10 };

While testing the NXP specific codes a particular case was encoun- tered. This case involved comma separated expressions within braces as initializers for the structures as in structure edge. Although the parser could handle expressions as initializers but not in this case.

42 Also constructs of the type expr : expr in initializers were not han- dled. To rectify this another rule was added to the parser and another case was added to the existing assignment expression rule. If an ini- tializer was encountered with the declaration then a initializer rule was invoked to handle the case. This initializer rule was designed in a manner so that it could check if the initializing elements were singu- lar or composed of many expressions. In case of the latter, it would break down the list into singular expressions and then invoke the rules related to expression for each of them. Another case involved a partic- ular way of assignment of a structure. It is depicted in the structure foo. Incorporating these changes in the existing rules was not the most feasible option in this case as it would require modification of the set of rules handling expression statements. It was decided that a new rule had to be introduced which would separate the instances, hence the rule init-assignment-expr was introduced. The modifications done to parse the former case was done in the assignment-expr rule, they are:

((unary-expr assignment-operator BRA-OPEN expr BRA-CLO) ‘(assign ,unary-expr ,assignment-operator ,expr))

For handling cases involving constructs like struct foo described above, the following modification was incorporated.

(init-assignment-expr ((assignment-expr : conditional-expr) ‘(,assignment-expr ":" ,conditional-expr)) ((assignment-expr) assignment-expr))

The initializer now makes a call to the init-assignment-expr rule in- stead of the earlier assignment-expr. This rule handles the expr : expr problem and then continues with the assignment expression execution. The initialization with comma separated expression in braces was han- dled by the parser but the information passed on to the AST missed the “,”. This led to difficulties in the regeneration of the code. Hence the case was modified to include the missing “,” into the AST. On the same lines, there were cases where the argument expression list was not seperated by a “,”. This required another case to be included in the rule on the same lines as before but one which did not include a “,” in it.

(argument-expr-list ((assignment-expr)

43 ‘(,assignment-expr)) ((argument-expr-list COMMA assignment-expr) (append argument-expr-list ‘(",",assignment-expr))) ((argument-expr-list assignment-expr) (append argument-expr-list ‘(,assignment-expr))))

The first modification in this section enabled the parser to handle comma seperated expressions within braces. Now the next issue is to handle comma seperated expressions within each of these expres- sions. This was handled by introducing a recursive case to handle assignments-expr each seperated with a comma. Structure members declared of the type

{int : constant-expression} {int a : constant-expression} {int a = constant-expression}

were not being handled properly either. The parser entirely missed de- picting the constant expression from the AST. The constant expression was hence added to the AST structure which handled such constructs. In the same section the parser could not handle a construct like, .types = “process”, for this a case was added to the postfix-expr rule. This case allowed a DOT to be followed by an identifier when there is noth- ing preceeding the DOT either. The is as follows

((DOT postfix-expr) ‘("." ,postfix-expr))

8. The lexer associated with the parser was not complete. For identifying strings, it used to carry out a longest match. Also it was designed to accept all characters between two quotes. This meant that the escape characters like those of newline, null, etc were not taken as a single character but rather as a backslash followed by a ”n” or ”0”. To rectify this error the lexer of CIGLOO had to be modified. The regular expression for identifying strings was designed in a manner to include anything that came within quotes, except a quote itself. The lexer was accepting a newline character as a backslash followed by a ”n”, i.e., the lexer was not handling the escape character with its intended meaning but just as character. This resulted in constructs like

‘‘printf("\"hello,world\"");’’

44 not being parsed and threw errors. The rectification of this included a complete over haul of the regular expressions handling the strings and characters. It included adding rules to handle escape sequences as individual characters. The following are the modifications done to regular expressions handling characters and strings respectively.

((: (? #\L) #\’ #;(+ all) (+ (or (out #\’ #\\ #\newline) (: #\\ #\") (: #\\ #\’)(: #\\ #\?) (: #\\ #\\)(: #\\ #\a) (: #\\ #\b) (: #\\ #\f) (: #\\ #\n)(: #\\ #\r) (: #\\ #\t) (: #\\ #\v) (: #\\ #\0))) #\’) (list ’CONSTANT (the-coord input-port *line-count*) (the-string)))

((: (? #\L) #\" (* (or (: #\\ #\\) (out #\" #\\) (: #\\ #\") (: #\\ #\’)(: #\\ #\?) (: #\\ #\a) (: #\\ #\b) (: #\\ #\f) (: #\\ #\n)(: #\\ #\r) (: #\\ #\t) (: #\\ #\v) (: #\\ #\0))) (list ’CONSTANT (the-coord input-port *line-count*) (the-string)))

9. One of the cases while testing the parser involved a storage class spec- ifier as a parameter type. This capability was not handled by the existing parser and hence modifications had to be made to incorpo- rate this feature into it. It involved addition of another case into the rule handling function definitions as well as declarations. This case was as follows.

func(int var1, register int var2, int var3) {

// function body

}

To handle this case, the rule for parameter declaration was modified. It was incorporated with the storage-class-specifier rule to handle any such cases.

((storage-class-specifier type-specifier-list declarator) (ast-para-decl storage-class-specifier type-specifier-list declarator #f)) ((storage-class-specifier type-name) (ast-para-decl storage-class-specifier #f #f type-name))

The first modification handles cases where there is a declarator spec- ified in the parameter list, and the second modification handles the case where a declarator is not specified in the parameter list.

45 10. CIGLOO provided an option to handle the various gcc attributes in the code. Since the code at hand was always a pre-processed code it had in it a lot of gcc added attributes. In many cases these were handled by the parser but there still were cases left which were not dealt with by the parser. To name these cases, a declarator succeeded by a gcc attribute, a pointer symbol(“*”) succeeded by a gcc attribute or gcc- attribute between a declarator and function body. All these issues were handled by adding the appropriate cases in their respective rules.

((declarator2 gcc-attributes) (ast-decl #f #f declarator2))

((* gcc-attribute) (ast-ptr (car *) #f #f))

((declarator gcc-attribute function-body) (ast-fun-def #f #f ’() declarator function-body))

The first modification was added to the rule of declarators, the second to the rule of pointers and the last to function definitions. In all these cases the gcc-attributes are recognized but never handled. The gcc-attributes are never reflected in the AST as they are anyways suppressed by the compiler during compilation.

11. The original CIGLOO parser was not designed to handle any of the C extensions. Due to this, cases such as having compound statements as expressions were not handled. This was a major hurdle as most of the libraries on which this parser was to be tested on incorporated almost all possible extensions of C. Below are mentioned these extensions and the way they are handled:

• Allowing compound statements as an expression - This extension allowed compound statements, i.e., declarations or statements seperated by semi-colons within curly braces, to be used as ex- tensions. Handling this case involved modifying the grammar of expressions to incorporate compound statements. This involved adding the case to the expressions rule where the match for this particular case occurs. These expression mostly came in the fol- lowing form __extension__( //statement expressions ); The extension keyword was supressed by the “gcc” option, hence it was only the remaining expression which had to be han- dled and was done as follows,

46 ((PAR-OPEN statement PAR-CLO) ‘("(" ,statement ")"))

The statement rule is a much generic rule which is capable of han- dling any type of statement, be it compound, iterative, labelled, expression, selection or a jump statement. • Various builtin * forms were missing from the parser. They were to be handled by adding the appropriate cases in the parser and also by including them as keywords in the lexer.

12. While testing the libraries, specifically with the glpk 4.43 [15] library a lot of issues regarding the postfix expressions and unary expressions surfaced. The GLPK (GNU Linear Programming Kit) package is in- tended for solving large-scale linear programming (LP), mixed integer programming (MIP), and other related problems. It is a set of rou- tines written in ANSI C and organized in the form of a callable library [15]. This library hence includes a lot of numerical expressions and calculations. As a result of the above it was the most apt library to consider for testing the parser’s rules on expressions. Below are the various errors which surfaced and their respective handling -

• Case type-name X = abc → test.var - The parser was initially just capable of handling only expressions of the form a → b, but there were legit cases which involved a postfix expression like the latter being followed by another postfix-expr. Also multiple occurences of the same type of expression (a → b → c) were not handled. This required a recursive rule to be put into place which could handle all such cases. The specific case to handle this construct was ((postfix-expr -> identifier) ‘(,postfix-expr ‘‘->’’ ,identifier)) To make it more flexible, the case was modified to the following: ((postfix-expr -> unary-expr) ‘(,postfix-expr "->" ,unary-expr)) The rule “unary-expr” can again call postfix expression and all the resulting cases can be accepted. • Case type-name *X = *test → var - The parser was initially incapable of handling constructs involving pointers before certain postfix expressions, like the case mentioned. The capability was added to the parser by the addition of another case to the unary expression rule. This rule explicitly handles unary expressions

47 which are preceeded by a pointer. This pointer had to be treated differently since as in the usual case it was not followed by a variable or a declaration. The modification mentioned for the first item evolved to their present state to incorporate these cases. • Case type-name *X = *(test)→ var - The parser had in it the required grammar to handle parenthesis in expression, but this was not sufficient to handle all cases. A pointer variable where the pointer symbol “*” was followed by a variable in bracket was not handled. Cases were hence added to the same to handle them. Mainly rules were written to recognize the parenthesis and consume them before doing any further processing on them. Modification was made in the rule for primary expressions to eliminate the parenthesis and handle the identifier. ((PAR-OPEN identifier PAR-CLO) ‘("(" ,(ident-id identifier) ")")) This case eliminates the parenthesis and handles the identifier next. • Case type-name *X = *(int *) y[0] - This kind of explicit type casting using pointers was not handled by the pointer. The pointer symbol both inside the paranthesis as well as outside(for the same reason as mentioned above) were to be handled.The fol- lowing additions to the rules were added in order to overcome this problem. Changes mentioned previously with respect to unary expressions and postfix expressions handled this case. • Case type-name X = func(var → a) - Postfix expressions within function parameters were another basic construct missed by the parser. Interestingly, the parser was capable of handling the postfix expressions when they were not passed as the param- eter but not otherwise. This meant that the fix was not to be made in the grammar handling postfix expressions but rather in the grammar which handled the function declarations. All the declarations were being handled under the declarator rule. The following addition was made to this rule: ((declarator2 PAR-OPEN postfix-expr PAR-CLO) (ast-decl2 #f #f #f declarator2 #f postfix-expr #f)) • Case var[var1].var2 == (-1) - This kind of assignment state- ments where a value is being assigned to a member of an array of any user defined type threw exceptions when ran with the original parser. The fault was with the way the parser handled the array index. It was not designed to expect anything between a variable of a user defined type and its corresponding member. This issue

48 was resolved by modifying the rules of postfix-expr. Recursion was added to this rule. The intention was to break the left hand side into var[var1] and then this expression appended with .var2. • Case void func(int, float) - In function declarations the iden- tifier associated with each type need not be specified. This was but required in the grammar of the parser. Hence a rule to just accept the type specifier in the parameter of the function was added to the rule of primary expression. • Case - &(expr) or !(expr) - Expressions containing a negation or an ampersand were another set of cases not taken care of. A simple case stating this construct was added to the rule for expressions to handle it. ((! expr) ‘("!" ,expr)) ((& expr) ‘("&" ,expr))

13. The parser omitted a few keywords in it which were later added to complete it. Below is a list of the various additions done -

(a) “ inline ” was added to the storage class specifier. (b) “wchar t” and “ builtin va list” were added to the list of differ- ent types.

14. Another peculiar problem encountered while trying to run the parser on the test libraries was the the problem with data type/user de- fined types/variables which have been typedef’d to some other vari- able name. At the highest level the problem was quite major since it dealt with conflicts arising due to different namespaces in C. Consider the following case.

typedef struct Ppoly_t{

//struct body

}Ppoly_t;

typedef Ppoly_t Ppolyline_t;

typedef struct path{

//struct body;

49 }path; typedef struct XYZ{

Ppolyline_t path;

}abc;

The above code excerpt is a valid C code. The declaration of path as a variable of type Ppolyline t may seem a bit off, but it indeed is valid. The reason why this is valid is for the fact that declaration of ”path” is in two different namespaces and hence it has different meanings in each namespace. It may represent a struct type in the global namespace but it is an ordinary member within a structure. The CIGLOO parser was unable to mark the differences between these namespaces and parse the code. In this case CIGLOO took both Ppolyline t and path as the data types and then expected a variable after them. The cause of this was traced to two main roots -

(a) The first cause was more of a side effect due to another exist- ing rule. Consider a declaration in the global namespace, of the format Ppolyline_t path; int var1; The above are valid C declaration statements and they were being parsed too. A closer analysis of the generated AST uncovered a major flaw in the way the parser was handling this construct. The declaration rule of the parser had a case to handle constructs of the form ”(declaration-specifiers SEMI-COMMA)”. The declara- tion specifiers could be resolved by ”(storage-class-specifiers type- specifiers type-qualifier-specifiers)+”. This meant that although the semantic meaning of ”Ppolyline t path;” is a type-specifier followed by an identifier, the parser was reading it as a type- specifier followed by another type-specifier. The cause of this was the recursion allowed for type-specifier in the parser. This recur- sion was included in the parser to handle constructs like ”long long”, ”short int”, and the forward declarations of structures. Unfortunately this even allowed the parser to handle constructs like ”long long long long long long... int”, which is an incor- rect construct in C. To handle this, all recursions involving type- specifiers were to be eliminated from the parser. Although re- cursions involving type-qualifier-specifications and storage-class-

50 specifications were still valid. To do this, the declaration speci- fier was broken into two parts. One part handled all three con- structs, i.e., the storage-class-specifiers, type-specifiers and the type-qualifier-specifiers and the second part omitted the type- specifier construct. If recursion with regards to storage-class- specifier or type-qualifier-specifier was required then the reference was passed to the second part of delaration specifiers. The rule of type-specifier now had to be revamped. Initially it only had the basic types and all other types could be derived out of them. Now that the recursion has been eliminated each of the possible type-specifier had to be explicitly mentioned and handled in the function. In all, thirty two specific cases of type-specifiers were added in. (b) CIGLOO’s lexer is designed in such a way that it reserves all keywords as specific tokens and also all the TYPE-ID’s that are found during the compilation are reserved as keywords. This design prevents ”path” in the above code from being used as a normal variable anywhere further in the program, even in an en- tirely different . ”path” has been reserved as a keyword by the lexer and the grammar has no provision anywhere to accept such reserved keywords as variables. Hence the first task was to add this provision to the grammar. This involved a complete re- vamp of all available rules of the grammar. Each of these rules had now to be equipped for handling a reserved type-id along with a normal identifier. Another angle to be considered while implementing this was its repurccussions on the tool chain. Since the AST generated by this parser was to be used again to regen- erate the original code, the changes which had to be made were to be minimal and had to maintain the original structure. This was quite challenging since the changes were to be made in the right rules so that no conflicts or ambiguities arose later. The parser had to be made capable of choosing the right rule for the particular case. The following is a case which could lead to an ambiguity because of the above implementation -

int path; path = 10;

In this case the parser should read the first line as a declaration statement and handle ”path” as an identifier. Now in the next line, the parser can expect a declaration or a statement. Here it should again read ”path” as an identifier and not as a type-name.

51 15. The ANSI C99 standard specifies that any block of code should start with declarations, if present, and then followed by the statements. This is the standard format and the parser adhered to the same. While run- ning the tests on certain NXP files it was observed that there existed codes which had statements before declarations too. Further tests re- vealed that codes were also present which inter mingled declaration and statements throughout the code block. To handle this the rules of compound statements, which are invoked whenever a code block is encountered, had to be modified. Formerly the rule for compound statements was simple with cases to handle an empty block, only state- ments, only declarations or declaration lists followed by statement lists. Now to handle the NXP specific constructs, i.e. the inter mingling of declaration and statement lists, another rule was added. This rule contained all possible combinations of statement lists and declaration lists, barring one possible case. This case was of a declaration being the last statement of a code body. The rule is as stated below -

(decl-stmt-list ((declaration-list statement-list) ‘(,declaration-list ,statement-list)) ((declaration-list statement-list decl-stmt-list) ‘(,declaration-list ,statement-list ,decl-stmt-list)))

Relevant changes with respect to the above rule were then incorporated in the rule for compound statement, i.e. a case of the form “BRA- OPEN statement-list decl-stmt-list BRA-CLO” was added.

52 Chapter 7

Implementations

A number of modules were developed to help in the transformation of a Platform Independent Code into a Platform Independent Model. Each of these modules perform certain specific functions which in turn help in the over all transformation process. This chapter contains information regarding these modules.

7.1 Standardising the AST

The objective of this module is to verify whether the AST conforms to a standard format, in this case the C99 standard. If it does not then it needs to perform operations to transform it into the standard format. The next objective is to transform the AST from its representation in Scheme structure format into Scheme list format(also referred as S-Expressions). This was required as functions provided by scheme to perform operations on lists are far more convenient and flexible than those provided for structures. The overview of this module is depicted in Figure 7.1. This module has the following functionality.

1. It takes as input the AST generated by the parser (The AST is in the BIGLOO structure format.).

2. Checks whether the AST is in the standard form or in the NXP specific style (similar to Kerninghan and Ritchie [33]) style format.

3. If the AST is in the standard format don’t do any processing on it.

4. If the AST is in the Kerninghan and Ritchie style format,

(a) Extract the parts of the AST containing the identifiers and the declarations into two separate lists. Now there exists two lists; one containing the identifiers and the other containing their dec- larations. The two lists need not be aligned, meaning that any

53 Figure 7.1: Architecture of Engine Modify module.

identifier in the list of identifiers may have its declaration any- where in the other list. (b) A simple mapping procedure is then applied to get the identifiers and their respective declarations in the same order in each of the lists. (c) Once the two lists are in the same order, then it merges each identifier with its respective declaration into parameter declara- tion structure, same as it was defined in the standard AST format. This structure is created for each identifier present in the list. (d) All these structures are then merged into a single list.

5. This newly formed list of parameter declarations is then used to recre- ate the original AST.

There are two different ways of defining a function. One of the more common ways include defining a function in the C format.

Ret-type function-name ((type-name var name)*) { //Function body }

54 In this method all the arguments of the function are declared within enclosed parenthesis after the function name. This declaration of arguments included the type of the variable and the identifier associated. If a structure had to be declared then the same with its complete syntax would be included within the parenthesis. This style of defining functions is also referred to in the C99 standards. Another way of defining a function that is not much in use currently is the K and R standard [33] but parsers are still designed to accept this format too. In this method the arguments of the variables are just mentioned by their identifiers in the parenthesis whereas the declaration of these arguments is provided after the parenthesis. The structure looks like this -

Ret-type function-name (identifiers) declarations of the identifiers { //Function body }

This style of function declarations is referred as the Kerninghan and Ritchie style of coding. The existing parser could not handle such kind of constructs as it expected a compound statement, i.e. a set of statements within braces, after the closing parenthesis. To make the parser handle these situations a case was added to the parser which could take care of this and generate an equivalent AST. The case was added to the rule of function-body.

((declaration-list compound-statement) ‘(,declaration-list ,compound-statement)) This modification could allow the parser to handle these constructs. Al- though this was not the complete solution. Consider the following two func- tion definitions-

1. Void func1(int a, struct foo{int x; int y;}b) { printf ("Hello, world"); //Function body

}

2. Void func2 (b, a) int a; Struct foo{ int x; int y;} b; { printf ("Hello, world");

55 Figure 7.2: Representation of the AST for Case 1

//Function body

}

The above defined functions are semantically the same. They are just defined in a different manner. The parser handles each of these cases in a different manner. The former is the standard way of defining functions and hence everything within parenthesis after the function-name will be treated as function parameter. In the above, case 1 is defined in the standard format and will generate the AST depicted in Figure 7.2 The generated AST has the following format -

#{fun-def #f #f (function ret-type) (declarators) (function-body)}

The declarator consists of the following pattern

#{decl2 (func identifier) (#{para-decl type-name identifier } #{para- decl type-name identifier}...)}

Above is the general and semantically befitting format of describing a function. For the second case this description changes as although the parser treats the construct as a function, it cannot interpret correctly the declara- tion statements after the parenthesis. This leads to the inclusion of these

56 statements with the function body. The structure of the AST produced for the second case is in Figure 7.3

Figure 7.3: Representation of the AST for Case 2

It is easily observable that although the semantics of the two function bodies are exactly the same, the ASTs of those are different. To maintain uniformity in the AST being generated it was essential to modify the AST generated from the Kerninghan and Ritchie style of coding and conform it to the existing standards. A complete revamp of the existing structure was required. In this case there was a requirement for a thorough understand- ing of the generated AST and mark the exact differences between the two AST’s. Another important consideration is the fact that the declarations of the arguments need not be in the same order as they are mentioned within the parenthesis. Hence along with the reconstruction of the whole AST, a rearrangement of the lists having these declarations is also required. The recreated AST can be used again for any further operation. Another re- quirement of the project was that the generated AST needs be re-read by another module of the tool chain to unparse it and construct a valid code out of it. This meant that the AST generated had to be saved for further use. As mentioned earlier, the AST generated is in a particular format, which is referred to as structures in the BIGLOO nomenclature. These structures have incapability in them that they cannot be read directly from a file and used by the compiler, i.e. the compiler can read these structures from a file as text but not interpret them as structures. Hence it was important to transform these AST’s into a format which could later be read and correctly

57 interpreted by the BIGLOO compiler. The list format was one such option. All the ASTs are checked for the conformity with the standard format of the AST, i.e. if they were in C99 format or the Kerninghan and Ritchie [33] format, modified and then written to a file in the list format.

7.2 Formatting

Each of the ASTs mentioned till now were in the BIGLOO structure format. BIGLOO offers multiple ways of depicting data; this includes structures, lists and records. Data written in the form of structures have a limitation that they cannot be read entirely from a file. To elaborate, consider a simple structure as shown below -

#{ident 0 #{coord 1 "file.c" 779} "var"}

When this above structure is to be read from a file using the BIGLOO scheme functions it is not possible to extract the complete structure as it is from the file in a single read cycle while at the same time maintaining the integrity of the structure format. Whereas, if the same data is repre- sented in the form of a list and stored in a file, it can be easily extracted and reused again. The problem now is that the functions working on the generated AST require it to be depicted through only structures. The struc- tured data on the other hand cannot be passed on for processing through a file. This was a stand off between the two and hence to resolve this it was required to write a transformation code which would convert the AST from its structure format to a more convenient list format. This transformation was made possible by pattern matching the AST to certain defined formats and rewriting them in the list format. Pattern matching played an essential role in this transformation.

7.2.1 Pattern Matching Pattern matching is not a new technique and has been a standard feature of many functional languages like ML, Caml, Haskell etc for some time now. Pattern matching in its simplest sense allows to match a particular value against several cases. This may seem similar to a simple switch statement in C or Java, but unlike in such high-level languages which only allow to match numbers or single values, pattern matching allows to match what are essentially the creation form of objects. The general format of using pattern matching is as follows. (match ( ) ( ) ...)

58 Here expression contains the pattern to be matched. pattern x are the dif- ferent patterns against which the input is to be matched. The corresponding expression x of each pattern determines the action to be performed if that particular match is made. To illustrate with example consider a case of identifying a structure of an identifier or that of a declarator in an input AST. (match-case (ast) (#{ident ?var1 ?var2 ?id} //Do this if matched )

(#{decl ?var1 ?decl-spec ?declarators} //Do this if matched) )

In this case the variable ast contains the source to be matched. The vari- able is first tried against the first pattern matching identifiers. If matched then the statements corresponding to that case are executed. If a match does not occur then the next case is tried and so on. The difference here is that instead of matching single values certain set constructs are being matched. The variables above, which are preceded by a ’?’ are placeholders for anything that may be in the construct to be matched.

The patterns depicted above are simple patterns. Most of the cases in- volves patterns which are deeply nested. Patterns can actually be compared to expressions since they nest in the same way. Like expressions, patterns too construct a tree of objects but in case of the latter there are not any specifics to be considered. Placeholders in place of the actual specifics are what are actually needed.

Considering what pattern matching can do, it is essential to know where this can be useful. This can be explained best via an example. Take an XML tree. The XML tree consists of pure data consisting of nodes. Now consider the task of translating this tree into certain richer format where there are lists of objects of various types. This list could contain anything in it, ranging from telephone numbers to addresses, just anything. Now if the task is to access all these elements in a statically typed manner, then there might be complications since the type of the elements is unknown. To solve this in any language one needs to determine the instance of each individual element and then cast it accordingly. Instead of this, pattern matching is a better alternative since it does the same thing but in a much faster way.

59 In essence, pattern matching is particularly essential when data from structured graphs need to be accessed from the outside and methods cannot be added to these object graphs. Pattern matching is helpful in situations involving parsed data like abstract syntax trees in .

Pattern matching in BIGLOO - BIGLOO provides in built features for pattern matching [17]. To transform the generated ASTs from the structure format to the list format it is required that we create a list of all patterns that may be found in the AST and then match the ASTs to them. A match was made for each structure of the AST and if a match is found then that particular structure is transformed to its corresponding list format. ASTs generally contain other patterns within a single pattern i.e. there is nesting of patterns. These patterns need to be handled too. This required that a recursive function needed to be put in place which could handle all these patterns efficiently and transform them into lists format. The write-decl function in the Engine modify module han- dles this functionality. It works with a few helper functions which eliminate existing lists.

7.3 Module translate read-ast

Data can be represented in many formats in Scheme. Some of them are, • List format - Includes representing the data within paranthesis. Eg. (a b c 1 2) • Structure format - Includes representing the data in braces preceeded by a hash. The structure format is similar to the structures in C or C++. Eg. #a b 2 1 • Record format - Another way of representing data would be by defining it as a record. Records are represented as, #(a 1 b 2). BIGLOO Scheme provides various functions to access data written in these various formats. Of the three mentioned above the list format is the most convenient and flexible format for handling data. The main purpose of this module is to recreate the AST in the structure form which has been modified into a list format by the earlier module. The module which reads this AST and regenerates the code from it has been designed specifically to expect certain formats and structures. The intermediate representation generated by the engine modify module is a BIGLOO specific S-Expression format, whereas the actual AST generated by the parser has a structure representation. So it is all the more important to maintain a single standard of representation. This necessitates the implementation of this module. The module performs the following functionalities.

60 1. Reading an AST in its S-Expression format from a file.

2. Extracting each syntax tree from the AST forest (Still in its S-Expression form).

3. Converting each of these trees back into its original structure format.

4. Appending all these individual trees together and regenerating the AST forest, but now in a structure format.

The implementation of this module was fairly on the lines of the previous module except for a few basic changes. Initially, each of the syntax trees were extracted from the whole list via a loop and then passed on to the write-back function which did the actual conversion. In this function all possible patterns were included for pattern matching. The main challenge here was to handle the various lists (since they often tend to be recursive), eliminate them and identify the basic patterns. Along with this it was also imperative to handle the various patterns which had an S-Expression format originally. These patterns had to be recognized and eliminated from this transformation. To handle the lists which encapsulated the basic patterns a separate function, handle-back-list as mentioned earlier was written. The functional- ity of this is the same as the earlier one. It takes in a part/node of a tree and eliminates all list like expressions out of it. The write-back function in it has all possible patterns of the AST listed in it. These patterns may then in turn include in them other patterns which too are listed in the same function. Whenever there are cases of expecting a pattern within a pattern, a recursive call to the same function (write-back) but with the new pattern is made. Since an exhaustive list covering all possible patterns is included in this function, if any construct is not matched then it is to be written back in the same format. The write-back function is divided into two parts, write- back and write-back2. Although the write-back2 is an extension of patterns which could not be included in the write-back, this divide was necessary as the BIGLOO compiler could not handle more than eight patterns in a single block. Hence the divide was made, but it has no semantic difference.

7.4 Module SizeOf

The existing GXF structure allows a certain number of attributes of the actor to be reflected. These include port attributes i.e. type of port, its properties as to whether it is an input or an output port, its buffer size which in turn depicts the port rate, node attributes like the node id, edge attributes like the from node and the to node.

61 Each actor in the dataflow model has one or more port associated with it. This port signifies a major characteristic of the associated actor. A port is the interface between two computational units for transfer of data. To pass data among actors, the actors must first be connected to each other via an edge. Although this may seem sufficient to pass data there are still few other attributes to be taken care of. Most importantly it should be important that of the two ports connected one of them should be an output port and the other an input port. This may seem trivial but nevertheless is an essential check. Apart from this it should also be verified if the two ports connected via the edge are compatible with each other. By compatible here it implies whether the rate at which the data is sent by the output port is acceptable by the port at the other end. To check this compatibility between ports it is essential to verify if the size of the ports at the ends of the edges are the same. Consider an example description of an actor - Actor process (const int ip[4], float op) {

//function body

}

In the above code chunk the actor process has an input port by the name ip. Its specification suggests that it has a buffer capacity of four. Also the type of the port is integer. This implies that it cannot send data or receive data from another port which has a data type different from an integer. It may seem sufficient to just check the type name of the port and determine if they are compatible or not. Incidentally the present version of GXF does the same. In addition, it is the size of the port that should be matched against its corresponding port on the other side of the edge. The size of the port now includes the buffer capacity along with the size of the type of port. To exemplify, in the above case the size of the input port const int ip[4] depends on the buffer capacity, in this case four, and the size of the port type, in this case int. The port type in this case was basic data type, but there may arise cases when the port type would be a user defined data type like a structure or union. The basic intention is to make sure that the two ports connected on either ends of an edge should be compatible with each other and to this we should make sure that the sizes of these ports are the same. The above requirement led to the development of this module. The buffer capacity of a port can be directly accessed from the AST, it is the size of the type of the port which requires pattern matching and calculation of the size of the type of port. The port type could be any user defined data type from the program. The sizeof operator of C was studied to understand the logic

62 behind calculating the sizes of various user defined types in the program. Sizeof is a unary operator in C and can operate on any type, be it the basic data types like int, float or user defined type like structures and unions or even sizes of pointers. It is essential to know that besides its use in the situation concerning the ports in GXF why the sizeof operator was initially designed. There are times when knowledge of the size of a particular data type is required. In particular when dealing with dynamic memory allocation it is always beneficial and safer not to allocate memory by specifying a particular size value. This is so because although the sizes of basic data types are essentially constant for a particular implementation, they may change across various implementations. Providing a size value to the dynamic memory allocation function is like hard coding a non-constant value into the code which may lead to portability problems for the program. This was the case with basic data types and the problem of determining sizes gets even more complex while dealing with compound data types like structures and unions. Compound data types implement structure padding where addresses of the members of these data types are aligned to the nearest word while compiling. This is termed as structure padding. The main purpose of the sizeof operator is to compute the space oc- cupied by any data type in memory. Sizeof is used by utilizing the sizeof keyword followed by the name of the data type or variable whose size is to be computed. The value returned by this expression corresponds to the size of the data type or variable passed to the expression. This size is a non negative unsigned value and depicts the size in bytes. Sizeof operator can also be applied to arrays in which case it computes the total memory allocated under the array variable. A restriction on the use of sizeof operator is that it can only be used on complete data types. Also the sizeof operator should be configurable to the current memory allocation scheme being utilized. It should base all its calculation depending on the scheme on each system and should not have any hard coded values written in it.

7.4.1 Structure Padding A processor reads or writes from memory in word sized chunks. This re- quires two issues: Data alignment, this implies arranging the data at memory offsets which are multiples of word sizes. This in turn improves the perfor- mance of the system. Second issue concerns itself with data padding. This involves adding dummy bytes between data structures so as to align them with the word boundaries. To exemplify why this is important, consider the following example - Assume that the word size of a particular computer is four. This implies that the data to be read should be at some memory location which is a multiple of four. If this is not the case i.e. the data is at

63 some non-multiple memory location like 22, then the compiler has to read two chunks of four bytes each and then read the next data element. When calculating sizes of compound data types the sizeof operator has to consider the alignment of members for complex user-defined data structures. Due to this the size of a structure in C may be greater than the sum of the sizes of its members. For example -

Struct foo { char var2; int var1; };

Under normal assumptions the size of foo should add up to five. Four for the size of int and one for the size of char. Although when foo is passed to the sizeof operator the size returned by it is eight. This is so because complex data structures are by default aligned to a word boundary by the compilers. In the above case this leads to the structure foo getting aligned word boundary and the variable var2 getting aligned to the next word ad- dress. The above alignments are referred to as insertion of padding space between members or to the end of the structure for satisfying the alignment requirements. The advantage of aligning members to word boundaries is that processors can fetch aligned words faster than they can fetch words which are straddled between multiple words in memory. Padding is done only when a particular structure member is followed by a member with a bigger alignment requirement or when the member is the last member in the structure. This implies that by changing the ordering of elements inside a structure the amount of padding required too can be changed. With the above knowledge the module sizeof for calculating the size of port types was implemented. The module identifies each data type either primitive or user defined and calculates its size. The identified type and its subsequent size is then put into a list. This list contains elements as key-value pairs where the key refers to the data type and the value is its size. The module is capable of even referencing the sizes of data types which have been aliased using typedef. In such a case the (key, value) pair in the list consists of both type name, the key being the new aliased name and the value being its original name. The program recursively checks this list until it finds a numerical value associated with a key which is the actual size of the data type. The module also handles calculations as part of the length field in an array. Recursive declarations of sizeof operator as part of the array length field are also handled.

64 Chapter 8

Validation and Performance Measurements

A software tool is not complete without it being tested for its completeness or successfulness. Software testing aims at evaluating a certain property or attribute of a program and determining if it meets its success criteria. This criteria needs to be precisely defined before the tests on the program are conducted. Software testing is a complex process. It can be tedious to completely test even a moderately complex program. Testing involves more than just debugging a program, it may be for verification and validation or quality assurance. The testing was not done individually for this tool only. Parts of the testing were clubbed with the tool developed for transforming a Platform Specific Model into a Platform Specific Code. Since a part of the testing was merged with the oter tool, there are sections in this chapter common in both. The other thesis is titled Generating C code from Platform Specific Model [42]. To test a program there are generally a number of test cases. Test cases are the oldest and most common means of testing any system. The test cases may involve anything, including:

• Black box cases.

• Simple checklist.

• Guidelines on information displayed by the system.

To summarize, a test case gives a clear indication of what the program should do or should not do. Most of the times the development of a system is not documented and happens on the fly. Other factors like deadlines and changing requirements cause the discarding of the functional requirements document. In the present case too, there was no formal functional specifica- tion as to what the requirements were. This was a major hurdle and raised many questions, like:

65 • What testing needs to be done on the system? Unit, performance, functional or regression testing.

• From where has the understanding of the system being achieved by the developers?

• Availability of an end user and priority levels on any particular func- tionality of the system.

• Manual or automated testing.

The different types of testing options available are:

1. Unit testing: Unit testing includes tests written to test specific func- tionality of the code. Tests under this umbrella are mostly written by the developers themselves to verify if a certain section of the code is working as expected.

2. Integration testing - This involves anything that verifies the interfaces between components in a software.

3. System testing - Involves the testing of the complete system.

4. Regression testing - This focuses on errors and defects which crop up after a certain major change has been implemented on the code. It is done to ensure that the latest change do not allow previous bugs to come back.

Given the paucity of time and the iterative development environment adopted, regression testing was the best suited option for testing. Although before this at more basic level unit testing was carried out to test each of the corner cases that had to be handled for each function developed. The modules to be tested included one which parses the C source file to an AST and the other which unparsed the given AST back into a source file. The success criterion of the modules depends on the results obtained after testing them manually as well as automatically. The parser was initially tested with small files for basic constructs. These tests included checking for array modifiers, pointers and assignment state- ments. The details of the same have already been mentioned earlier. For the Codegen module to be tested, it should be first provided with an AST. There is hence a dependency between the two modules and the development as well as testing had to be precariously carried out. The parser underwent development and regression testing by individually testing it over various preprocessed source files from different libraries. These libraries were mainly the GLPK [15] and the Graphviz [7] libraries. They are explained later in this section. The testing for the parser was a slow process. Since a part of the parser was already developed, it was imperative

66 to keep testing the new developments along with the already existing ones lest there arise newer bugs. Hence after every change in the parser code, the parser was tested with library source files to check if the previously tested constructs were still accepted by the parser. Till now all the tests were manual tests. The initial tests involving the parser were to check if all C constructs present in the code were handled. There was still no test to check if the parser generated valid ASTs for each construct. This test could only be performed in confluence with the testing of the Codegen module. The Codegen module too was developed iteratively for both declarations and statement-list. This holds true for the testing and the validation of the Codegen module as well. Every time a new pattern or function was added in the Codegen module, it was tested with the same source file to check if the result obtained is similar to the original input. These test cases also enhanced the development of a better and efficient Codegen module. The initial testing commenced with simple test files with C declaration state- ments. These test files were parsed through Cigloo to obtain the necessary ASTs. This served as input to the Codegen module. On execution of Code- gen module, the code regenerated was manually verified to confirm if it is similar to the original source file. This test file increased in its size as the Codegen module developed from its simple declaration statements to more complex structures and other declarations. The reason for maintaining the same test files and adding more cases to them ensured that iterative devel- opment of the Codegen module did not change results obtained earlier. The NXP source files shown below had complex structure initializations which proved a good test input for both the parser as well as for Codegen module. struct edge edges[] = { {.type="init_state", .from={.node=constructor,.port="outs"}, .to={.node=process,.port="inst"}, }, {.type="deinit_state", .from={.node=process,.port="outs"}, .to={.node=destructor,.port="inst"}, }, };

When the test file grew in size, it became more tedious to verify them manually. At this point, we had to automate the test. The C source file was compiled to obtain the AST. This AST served as the input for the Codegen module. The resulting C code was now compared with the orig- inal C source file using the standard unix command diff. If there were no difference found in the two files, it proved that the Codegen module was

67 working properly. If any difference found, apart from the layout, then the Codegen module was tested for its errors and retested till there were no differences found in the two codes. The agreement in the two codes served as the success criteria for Cigloo parser as well as for the Codegen module. Testing was also performed for the nesting of structures, enumerators and multidimensional arrays to ensure that recursion was handled appropriately. At the completion of development and testing of Codegen module, the focus was now moved to the testing of function-body statements. This started with the simple expressions and was further improved upon by building more complex expressions with various postfix, prefix and assignment ex- pressions. On finding no difference in the two AST’s, it ensured the correct working of the module. This testing was further expanded to include the iteration-statements, selection-statements and other-statements. Nesting of these statements were also tested for the correct output generation. The success rate of the manual tests confirmed that the Codegen module was working as expected. But, these test files were all manually written and the lines of code were not larger than 200-250 lines. To confirm the proper work- ing of Codegen module, it was a must to check it against standard libraries. Two such libraries are GLPK and GRAPHVIZ libraries.

8.1 GLPK

GLPK - GNU Linear Programming Kit is a software package intended for solving large-scale linear programming (LP), mixed integer programming (MIP), and other related problems [15]. The GLPK project comprises of the following: • A callable library, written in C, for solving LP and MIP problems. • The GNU MathProg mathematical programming language for speci- fying the large problems. • The GLPSOL command-line application for translating and solving models. The reason for choosing this library for testing lies in the complexity of the mathematical equations and expressions used in the C source files. The error-free regeneration of these equations using the Codegen module is the criteria for the success of this module. Since the source code of the GLPK was available, it was easier to create binaries and executables to suit our needs. The AST generated from the Cigloo parser for the source files were the input to the Codegen module. Following are the steps followed to create the library and the application using the source files. 1. The C source file was compiled using gcc compiler to obtain the pre- processed file.

68 2. The pre-processed file was then compiled to obtain the object file.

3. Steps 1 and 2 were repeated until all the source files were compiled to obtain the object files.

4. The library and the GLPSOL application were created using these object files.

5. An existing example present in the GLPK distribution was used as the input to GLPSOL and the output generated was saved.

Following steps were followed to install GLPK library using the Codegen module:

1. This step remained the same as earlier (Step 1) i.e., the C source file was compiled to obtain the corresponding pre-processed file.

2. The preprocessed file was then given as input to the Cigloo parser and the corresponding AST produced was the input to the Codegen module. The output file generated was also pre-processed file with the same filename but with a new extension.

3. Steps 1 and 2 were executed until all the source files were regenerated with the Codegen module.

4. From the newly generated preprocessed files, the library and the glpsol application was re-created.

5. The same input as used in the earlier steps was considered and the output generated was saved.

There were two success criteria at the end of this testing. Firstly, all the preprocessed files had to be re-created without any errors using the Codegen module. These regenerated files should be able to be compiled into object files. The second criterion was the result. The output obtained using the original pre-processed file matched with the output obtained with the re- generated C preprocessed file. The successful execution of this test ensured that Codegen module works correctly in regenerating the original code from the AST.

8.2 Graphviz

Graph visualization is one of the oldest but still active graph visualization packages [7]. The package was first implemented in the early 1990’s at AT&T Research Labs and has developed into one of the most widely used and accepted graph drawing packages. Graphviz is a collection of software for viewing and manipulating abstract graphs. It provides graph visualization

69 for tools and websites in domains such as software engineering, networking, databases, knowledge representation and bio-informatics. Graphviz consists of several separate command line tools. Most of the Graphviz is written in C. The supporting libraries consist of 45,000 lines of code. The hierarchical layout consists of 6000 lines; Kamadi-Kawai about 3700 lines and so on. These large files are good source of input for testing of the Codegen module. The steps followed for testing is similar to GLPK library testing. The testing of the Graphviz library could not be completed and is only partially done. Certain files in the library threw errors when run with parser. On further evaluation it was found that certain grammar introduced in CIGLOO to handle the constructs of files in GLPK-4.43 library contradicted with the constructs of the files in the Graphviz library. Limited time constricts us from looking further into the problem and solving it. Of approximately 450 files in the Graphviz library, we could successfully test 220 files. The files which could not be parsed due to these ambiguities have been listed in Appendix B. A few statistics of the files tested are illustrated in the table below:

Source Average size (LOC) Num. of file GLPK library 1400 65 Graphviz library 1700 450 Self developed 75 35

Table 8.1: Statistics of test cases involved

The successful testing of GLPK-4.43 library proved the rightness of both CIGLOO parser and Codegen module. This Codegen module was then compiled to a library called Unparser. The testing process was reiterated using this new library to confirm its proper working. This new library was then added as a part of the CIGLOO package and the new CIGLOO is now called as Cigloo1.1. The FIGURE 8.1 depicts the package structure of Cigloo1.1. Apart from the Unparser library, there are two more new libraries added into Cigloo1.1 namely, sizeof which takes its input as an AST and calculates the size of the data type present in it. modifyAst takes its input an AST which is in the structured format and produces an AST in the list format.

8.2.1 SizeOf library The output of the SizeOf module is a list containing all defined data types and their respective sizes. It is impossible to manually check this list against the input code and verify if the generated sizes are correct. Hence it was required to automate the testing. To implement this, the module was mod- ified to generate a C file as output which would contain in it all the defined data types and the typedef statements. For every such defined data type

70 Figure 8.1: Package structure of Cigloo1.1 and typedef statement an if conditional statement will be generated. The condition to be checked is that whether the size of the key in the list is equal to the generated size in the key-value pair. This key value pair will corre- spond to the immediately previous data type or typedef statement printed in the generated C file. An error message will be printed out if the sizes do not match. This would then help in debugging the module. Testing was successfully performed on all the C source files included in the GLPK 4.43 library. Testing of structures declarations of the following type has not yet been carried out, struct foo{ int a; char b; };

The library generates the sizes of these structures and includes them in the list. The current implementation however does not include test cases for these constructs in the C file.

71 8.2.2 ModifyAst library The ModifyAst library was tested along with the translate read-ast mod- ule and the Unparser library. The output of the ModifyAst library is a file containing the AST in the Scheme list format. This file is then read by the translate read-ast module which transforms the list format of the AST into the structure format. This transformed AST is then passed on to the unparser module to be regenerated into the C source code. An executable by the name unparser was created which included in it all the above three libraries and performed the said functions. Testing for this was carried out manually on a subset of files included in the GLPK-4.43 library. The exe- cutable created was run on each file and the output generated was checked against the original file for any differences. The structure of the AST gen- erated by the parser and the patterns written to match them are so specific that even a single change or difference in it would throw an error and stop the execution. Hence the patterns written are for an exact match of what is generated by the parser. An improvement on this testing would be to au- tomatize it in the same manner as done with the integrated testing referred above. After the completion of the integration and system testing of the li- braries, it is important to measure the performance of CIGLOO1.1 to en- sure that there is no deviation observed in its behaviour. The performance of the application can be recorded using time consumption against certain standard parameters like Lines of Code (LOC), complexity metric.

8.3 Performance Measurements

The system testing of CIGLOO1.1 ensured the proper working of the tool. Although functionality is the main criteria for a successful development of the tool, it is necessary that the performance of the tool is not compromised. There is a need to assure that the tool executes using minimum space and also the time taken to process an input and to produce its corresponding output is the minimum. A method to achieve this is by performing profiling on the tool. Profiling is a well-known technique for recording program behavior and measuring program performance. Profiling is used to estimate the program execution times for code optimization, and identify program bottlenecks. In software engineering, program profiling, software profiling is a form of dynamic program analysis. It is the investigation of a program’s behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize to increase its overall speed, decrease its memory requirement. Profiling allows the user to collect the information on the time spent in the execution of the program. Since the profiler uses the information collected during the

72 actual execution of a program, it can be used on programs that are large or complex to analyze. However, the need for profiling arises only if there is a difference in the behaviour of the tool. The behaviour considered here is the time taken to execute the tool on a given input. These inputs are a set of preprocessed C files. A method to measure the performance is to draw a graph with one of its axis as the time taken for the execution of tool with a given input. Since these input files taken vary in its size, the other axis can be considered as the lines of code (LOC) in the input file. Nevertheless, LOC alone would not determine the performance of the tool, there can be other software metrics that can be used.

8.4 Software metric

Software metric is a measure of certain property of a part of software or its specification. It is important to have some tangible and quantitative approach to measure these properties of software. Tom DeMarco stated You cant control what you cant measure [26]. It has been proven time and again that certain simple measurements can be beneficial for the system in the long run. Cyclomatic complexity is one such attribute of a software system which can be measured. Cyclomatic complexity is used to indicate the number of linearly independent paths through a program’s source code. To calculate the cyclomatic complexity of a program, the control flow graph [2] of that particular program is required. The nodes of this graph indicate linear and indivisible groups of commands of the program. The directed edges connect- ing any two nodes indicates that the node pointed to by the edge represents the command or group of command which will be executed immediately after the present set of commands are executed. It is not necessary that this measure be applied to complete software programs but rather may be applied to individual functions, methods, classes within a program. To mathematically calculate the cyclomatic complexity of a program it is essential to have a reference to its control flow graph. This is imperative because mathematically the complexity is calculated as follows:

M = E − N + 2P (8.1) where, M = cyclomatic complexity E = number of edges in the control flow graph N = number of nodes in the control flow graph P = number of connected components (number of loops/conditional state- ments) Another alternative where the control flow graph is strongly connected may also be considered. In this case the exit point of the graph is connected back

73 to its entry point. In such a case the complexity is calculated as follows

M = E − N + P (8.2) where each variable holds the meaning as mentioned before. There are numerous tools that calculates the mccabe complexity for a given source file. One such tool is pmccabe [12].

8.5 Pmccabe

Pmccabe [12] is a software tool to calculate the McCabe type complexity in C an C++ source codes. Apart from calculating the complexity this tool also includes in it -

• decomment: A tool which removes all comments from the source code.

• codechanges: A tool which computes the change in two source files.

• vifn: A tool to invoke the vi editor with a function name.

Our concentration would mainly focus on only the pmccabe tool here. An advantage with the pmccabe tool is that it calculates the apparent complex- ity of the C/C++ source code rather than computing the complexity after the source files have been preprocessed. Pmccabe tool takes in a C source file as an input, parses it and gives out a six columnar output. Pmccabe generates two types of cyclomatic complexities.

1. Including C switch statements without considering the number of cases in the switch constructs.

2. A primitive approach which includes each individual case mentioned in the switch construct.

The first column in the output of pmccabe corresponds to the first case mentioned above, i.e. complexity of the program where each case in the switch construct is not considered individually. The next column then cor- responds to the second case mentioned. Columns three, four and five deal with line counts depicting number of statements in a function, line count of first statement in a function and the total number of lines within the func- tion respectively. The last column includes the file name, line number on which the function name occurs, and the name of the function. The Figure 8.2 shows the snapshot of the pmccabe tool with the various columns as detailed above. The current requirement is to only get the complexity of the files and then compare it against other attribute of the file. The earlier tests with the standard libraries ensured that the parser and pretty printer have been developed as per the requirements. But, these

74 Figure 8.2: Pmccabe tool result page. tests does not show if there is any change in the behavior of the tools for the various inputs. To guarantee this, it is required to compare the time taken for the execution of these two tools against other parameters. The two parameters that are best for such purpose are, the size of the C code measured in terms of Lines of Code (LOC) and secondly, the mccabe com- plexity. These two parameters gives a better picture on the performance of the tool. With these parameters, two graphs were drawn for a set of input files. The table below shows the details of the graph. The time mentioned here is measured in milliseconds(ms) and it is the total time taken for both parsing and prettyprinting.

Chart name X-axis Y-axis LOC vs Time(ms) LOC Time Complexity vs Time(ms) McCabe complexity Time

Table 8.2: Axis information of the graphs

The graphs in Figure 8.3 and 8.4 that are drawn based on the total running time taken by the parser and the pretty printer against the lines of code (LOC) and mccabe complexity of the input source file. The first graph shows the linear increase in the time taken for parsing and pretty printing against the LOC. The same applies for the second graph as well, where, as the mccabe complexity increases, the time taken for parsing and pretty printing also increases in a linear manner. These two graphs show that there is no deviation or a sudden change in the behavior of the parser and pretty printer for any given set of input files. The requirement stated that the tool should be able to run in both java and native mode. Tests with native mode were done initially. To test with

75 Figure 8.3: Graph of LOC vs Time taken for parsing and pretty printing the java mode, the libraries was first put in a package structure so as to work with the Eclipse framework. These libraries were then imported as external jar files and used in the Java project. A test file was written to invoke the function in the library. This test ran successfully ensuring that the project could be run as Java application. The final step was now to create a java plugin from this library. The necessary libraries and the standard plugins like the user interface, control menu were imported. On execution, there were certain errors faced due to the interfacing functions with BIGLOO. This made us postpone this implementation as part of future work.

76 Figure 8.4: Graph of McCabe complexity vs Time taken for parsing and pretty printing

77 78 Chapter 9

Related work

C is one of the most preferred language for programming and has been around for a long time now. Evidently a lot of tools exist which can parse a C code into its corresponding AST. In this chapter are presented a few of these tools which were analysed and then compared to the existing tool.

9.1 Edison Design Group

Edison Design Group (EDG) [30] is a private corporation focusing on devel- opment of compiler front ends. The group provides technology that parses computer programming languages. Languages accepted by this compiler in- clude C++ as defined by the ISO/IEC 14882:2003 standard and also the C language as defined by the C89 and C99 standards. The compiler provided by EDG reads source code and generates information that fully describes the structure and meaning of the code. The front end, as the compiler is gener- ally referred as, translates source programs into a high-level, tree-structured, in-memory intermediate language. The intermediate language preserves a great deal of source information (e.g., line numbers, column numbers, origi- nal types, original names), which is helpful in generating symbolic debugging information as well as in source analysis and transformation applications. Implicit and overloaded operations in the source program are made explicit in the intermediate language, but constructs are not otherwise added, re- moved, or reordered. The intermediate language is not machine dependent (e.g., it does not specify registers or dictate the layout of stack frames). The front end can optionally generate raw cross-reference information, which can be used as a basis for building source browsing tools. The front end provided by EDG is complete in every feature but not freely available, thus making it unsuitable to be utilized in the project. It also includes a C-generating back-end which can be used to generate back C code for C++ programs. The group is noteworthy because their front-ends have features, reliability and efficiency as required in commercial

79 compilers. Also because the front end provided by the group is capable of supporting the complete C++ standard including all its extensions. The same holds true for the C programming language. How complete this com- piler is, is shown by the fact that it handles even slightly relaxed modes of C and C++. This is because in the real programming world, not all the programs are written strictly according to the standards. The compiler is written in C language. Care is taken to separate the host computer characteristics from the ones at the target computer so that there arise no complications while rehosting or retargeting of data. The compiler provides elaborate debugging capabilities too. Although one may choose to exclude this by setting the appropriate options.

9.2 C Intermediate language (CIL)

C Intermediate language (CIL) is a high-level representation along with a set of tools that permit easy analysis and source-to-source transformation of C programs [37]. The intermediate language generated using this compiler is at a lower level compared to abstract syntax trees since it clarifies ambiguous constructs and removes redundant ones. At the same time this intermediate language is at a higher level than other intermediate languages at it main- tains type information and a close relationship with the source program. A major benefit of this tool is that it compiles all valid C programs into a few core constructs. The followed by CIL is syntax directed, making it easy to analyze and manipulate C programs. The CIL front end is capable of handling ANSI C, Microsoft C and the GNU C standards. As mentioned, CIL tool generates an intermediate language which is at a higher level compared to an Abstract Syntax Tree. This makes it easier to analyze a C code. What CIL does is that it does not generate the common AST, rather the intermediate language is more like a subset of C. The intermedi- ate language offers reduced number of syntactic and conceptual forms. For example, all the looping constructs are reduced to a single form. Below are a few significant examples of how CIL compiles a C program.

1. It interprets and normalizes the type specifiers -

int long signed x; signed long extern x; long static int long y; int main() { return x + y; }

The CIL output for the same is /* Generated by CIL v. 1.3.7 */ /* print_CIL_Input is true */

80 #line 1 "cilcode.tmp/ex1.c" long x ; #line 3 "cilcode.tmp/ex1.c" static long long y ; #line 6 "cilcode.tmp/ex1.c" int main(void) {

{ #line 6 return ((int )((long long )x + y)); } }

2. When structures or unions without identifiers are encountered, then CIL provides them names -

struct { int x; } s; leads to the following output - #line 1 "cilcode.tmp/ex2.c" struct __anonstruct_s_1 { int x ; }; #line 1 "cilcode.tmp/ex2.c" struct __anonstruct_s_1 s ;

3. Nested structures as given below are also handled and lead to the following output -

struct foo { struct bar { union baz { int x1; double x2; } u1; int y; } s1; int z; } f;

Output - #line 1 "cilcode.tmp/ex3.c" union baz {

81 int x1 ; double x2 ; }; #line 1 "cilcode.tmp/ex3.c" struct bar { union baz u1 ; int y ; }; #line 1 "cilcode.tmp/ex3.c" struct foo { struct bar s1 ; int z ; }; #line 1 "cilcode.tmp/ex3.c" struct foo f ;

As can be observed from the outputs above it can be seen that the intermediate language so produced by CIL is of a very high level. This rep- resentation is a simplified version of the input code but its at too high a level to be utilized for pattern matching. The aim of the exercise is to extract relevant information through pattern matching and then use this informa- tion to generate the required models. With this kind of representation it is easier to understand the code but it is not convenient for pattern matching. This intermediate language does not provide any set patterns which can be used as hard references. In case of ASTs, there are always a defined set of structures available which can be used to pattern match.

9.3 Src2srcML

Src2srcML [23] is a light weight fact extractor which makes use of certain XML tools like XPath and XSLT to extract information from C++ source code programs. Fact extraction, usually the first process in reverse engineering involves identification and subsequent extraction of facts, entities etc from the source code. Fact extraction includes a certain processing like parsing or/and searching the source codes for the relevant facts. Src2srcML utilizes the ANTLR parser generator [40]. A distinct varia- tion of this tool from others lies in the fact that this does not use a complete grammar for recognizing C++ constructs, rather it uses island grammars [35]. Island grammars are suited for reverse engineering processes like fact extraction as it takes advantage of the fact that these applications do not need complete parse trees. According to Moonen, island grammars consist of productions describing certain particular constructs, referred as islands,

82 along with other liberal productions which handle the rest. Island gram- mars can be expressed in any grammar specification formalism or parsing technique [23]. Island grammars are befitting in situations requiring iden- tification and translation of high level constructs while at the same time eliminating the lower level constructs. An advantage on one hand but this has its own disadvantages too, like the output of the tool is not complete and looses most of the information in it. An incorrect input does not necessarily halt the functioning of the parser. Source code is translated to the srcML format after passing through multiple stages. Translation is carried out by the srcML translator. This translator is constructed using ANTLR with a pred-LL(k) grammar specifi- cation and context stacks. The translator converts the C++ source file into a XML specification. The translator on recognizing the start of a syntactic structure places a XML start tag and a transition is made in the context stack to mark the current state of the construct being parsed. Whenever a block is terminated the translator puts an XML closing tag and performs another transition on the context stack to move to previous context. Literature on Src2srcML [40, 22] suggests that Src2srcML is relatively fast, mostly because it is not accompanied by a preprocessor and also since it does not do any extra work related to resolving C++ ambiguities. The fact that it does not generate a full parse tree, does not resolve certain ambiguities and fails few times in parsing code inside function bodies [22] makes it unsuitable for the tool to be developed. The following is a small code excerpt and its consequent output from src2srcML tool.

// swap two numbers if( a > b) { t = a; a = b; b = t; }

// swap two numbers if( a > b) { t = a; a = b; b = t; }

83 9.4 Columbus

Columbus is a reverse engineering framework [28], which has been devel- oped in cooperation between the Research Group on Artificial Intelligence in Szeged and the Software Technology Laboratory of Nokia Research Cen- ter. Columbus is able to analyze large C/C++ projects and extract their UML class model as well as conventional call graphs. It is a framework tool which supports project handling, data extraction, data representation, data storage, filtering, and visualization. The appropriate modules of the system help in accomplishing the basic tasks of the reverse-engineering process. Parsing of the input source code is performed by the C/C++ extractor plug-in of Columbus, which invokes a separate program called CAN (C++ Analyzer). The information extracted by the plug-in corresponds to the Columbus Schema [27]. The Columbus schema is also used as the internal representation in the C/C++ extractor module of the Columbus reverse engineering tool called CAN [28]. It reflects the low-level (Abstract Syntax Tree) structure of the code, as well as higher level semantic information (e.g. semantics of types). Unfortunately Columbus is not a freely available tool and needs to be li- censed from Frontendart [13]. Columbus as mentioned comes complete with a C++ Analyzer CAN, a C++ preprocessor CANPP, a linker CANLink and the exporter exportCPP. Columbus is slightly lenient while accepting the input code and can accept supersets of C/C++ implying that certain dialects of these languages are allowed by it. Columbus has an advantage in the sense that it does not halt the parsing process if an error is encoun- tered. Columbus outputs the information in multiple format and is hence very suitable for different purposes. Another advantage is that Columbus comes with its own preprocessor and it is quite well documented too. With all these advantages there are a few lacking too. Most importantly the fact that Columbus is not freely available and is a licensed tool. Along with this it is not permitted to modify or extend the tool for any kind of customizations. Columbus’ support for C++ is adequate but unfortunately not complete. Columbus can parse C, but only as a subset of C++. C on the other hand is not a strict subset and there are many projects which use a very specific and particular dialect of C which cannot be parsed by Columbus. This C support is very crucial since many industry projects still utilize legacy C libraries.

84 9.5 Design Maintenance System (DMS)

DMS [19] is a commercial tool developed by Semantic designs [16]. It is a program analysis and transformation system. DMS is a complete package composed of several inter related tools. Languages covered by DMS include COBOL. C/C++, Java, Fortran90 and VHDL. DMS stores the source in- formation from input source files in the form of hypergraphs. These hy- pergraphs are a forest of abstract syntax trees, call and flow graphs. For this hypergraph DMS offers a number of tools for various tasks like, source- to-source transformations for code optimizations, program analysis, pretty printing and so on [19]. Our focus is mainly on the C parsing capabilities of DMS. DMS parser is built using the combination of a lexer, preprocessor and the parser. The parser takes in the C source file as input and produces as output, a forest of alternative parse trees. The DMS parser uses Gen- eralized LR parsing [41], which generalizes LR parsing by efficiently trying all possible parses in parallel, to carry out the so called ”full context-free parsing” [19]. Cases involving ambiguity nodes, like multiple parses over the same phrase, are removed by a symbol table construction step that follows parsing. Name lookups and scoping is also carried out at the same time. The symbol table is used via API’s implementing language specific name lookup rules. DMS has been tested to analyze millions of C code lines [ref- 6]. DMS is high on performance since it uses symmetric multiprocessing support on X86 machines. DMS fit the requirements of the project very well but unfortunately it is a closed source project and it cannot be run in as native code in batch mode and on JVM inside Eclipse. Hence it could not be considered for the tool.

85 86 Chapter 10

Conclusion and Future Work

10.1 Conclusion

In this thesis we have discussed how to transform preprocessed C source code into a dataflow model represented in the form of a GXF. The objective for doing this was to introduce a layer of abstraction in the development of applications in parallel programming environment. This objective led to the development of the LIME-ng tool chain. This tool chain takes as its input platform independent C code and transforms it into platform dependent C code. The requirement of implementing the tool such that it works on na- tive, JVM backends led us to choosing the BIGLOO Scheme prgramming framework as the language for implementation. BIGLOO Scheme also offers a host of other features as discussed in chapter 4 of the report. BIGLOO of- fers the feature of pattern matching in it which was most conveniently suited for the purpose of this code to model transformation. Pattern matching re- quires the existence of certain fixed and defined cases which can be matched against any input. To get such patterns in the input source codes, it is re- quired to transform the given source files into the form of an AST. To make this transition a parser which could identify the constructs of the C pro- gramming language was required. CIGLOO the FFI generator of BIGLOO provided a grammar which could handle constructs of the C language. The initial task involved analyzing the existing parser and studying its structure and implementation. Determining the deficiencies in the form of constructs not handled by it was the logical next step. The parser was adapted to handle the NXP specific style of coding and an external module was implemented to conform the AST generated by this style of coding to the standard format. Elaborate testing was conducted to identify and sub- sequently rectify the mishandlings of the grammar. The parser was tested for validity by running it over the GLPK-4.43 library successfully. Partial tests were also carried out on the Graphviz library. The need to retrieve and reuse the AST generated by the parser in

87 different locations arose the need for a separate module. Also the inability of Scheme to read structure nodes from a file was the reason for implementing this transformation module. This module could write an AST in the Scheme structure format in a file after transforming it into a Scheme list format. The module involved the pattern matching feature of BIGLOO to identify the structures. Another module on the same lines, to transform an AST in a list format back into its original structure format was also written to allow the reuse of the AST. The above two modules make sure that all the information present in the source code can be represented in an AST and then can be reused anywhere. Discussion regarding the GXF structure necessitated the need of an ad- ditional size of port type element in the GXF. This was required to check the compatibility of the ports connected via a GXF edge. Theory of computing the sizes of types in a C code were studied and understood to be imple- mented in this environment. Main reference to it being the sizeof operator of C. None of the work above could obviously be used unless it was validated and tested. A comprehensive test suite was hence designed, integrating the parser and the unparser libraries. The tests were run on the GLPK-4.43 and Graphviz libraries. The expected output of the test suite was the same C files which were sent as input. To validate these, the test suite used the generated C files instead of the original files to compile the libraries. For the library sizeOf another test suite was written for validation and subsequently verified. The success of the test suites proved the validity of the implementations. The modules were hence converted into a library and in future can be included in any project. The requirement of running the tool in a JVM mode led to the creation of jar files of these libraries. A java project was created to test the libraries in th jvm mode. Upgrading the java project into an eclipse plug-in though could not be completed due to some interfacing complexities between Java and the BIGLOO framework. Although the aim of the project was to have at the end a dataflow model of the given C input it could not be completely realized. The task of pattern matching the generated AST for all the elements of the GXF and other tasks as mentioned next in the future works section could not be completed.

10.2 Future Work

The current status of the work provides us with a complete AST containing in it all the information present in the input source code. The structure of the generated AST contains in it line numbers corresponding to the declara- tion statements only. For the purpose of debugging, this feature needs to be extended to include line numbers corresponding to all the statements and

88 the declarations inside a module in the source file. The next intended work is to apply the pattern matching techniques on the AST and generate a GXF representation of the dataflow model. The GXF language is a variant of XML and uses a similar syntax pattern too. A sample GXF node would look as follows -

process relax unfold actor buf pointer_type unsigned int 1 1 1

The above chunk represents a single node in the dataflow model. Its corresponding C representation would be - actor process(unsigned int buf [restrict static BUF_SIZE]) { unsigned int i; for(i=0; i< BUF_SIZE; i++) { buf[i] = _state.count; _state.count++; /* Wrap on signed 24 bits value */ _state.count &= 0x7fffff; } }

Observing the two chunks above, a mapping can be seen among the two. The Figure 10.1 highlights this in a better format. In a C file, the node is

89 Figure 10.1: Dataflow model information as represented in a C file. identified by an actor keyword. The function name becomes the node id and the parameter list represents the port attributes of the corresponding node. All this information is collected in the AST. The AST corresponding to this C code would be as in Figure 10.2. The next task would be to create patterns which could match against this generated AST and represent in the scheme list format. BIGLOO Scheme framework provides a suite for handling XML documents in Scheme. This suite is referred to as the SSAX-SXML library. It is an S-expression based library which contains in it functions for XML parsing, querying and con- versions from XML codes to S-expressions and vice-versa. S-expression here refers to the Scheme list format. Once the data from the AST is extracted it needs to be represented in the form of an S-expression. This S-expression has to be in a particular format so as to be recognized by the serializer function [14] of the SSAX-SXML library. The serializer function takes an S-Expression object as its first parameter and an output port as its sec- ond argument. Serializer function converts an S-Expression input into its corresponding XML format. Hence if the S-Expression generated from the extracted data is input to this function the output generated will be a GXF file which will represent the dataflow model of the C source file.

90 Figure 10.2: AST representation of the C source code

91 92 Bibliography

[1] Functional Programming Languages. http://www.cs.nott.ac.uk/gmh/faq.html#functional- languages.

[2] Control flow graphs. www.ucw.cz/ hubicka/papers/proj/node18.html, August 2010.

[3] DOM parser. www.dom4j.org/dom4j-1.6.1/, June 2010.

[4] DOM parser. www.ibm.com/developerworks/library/x-dom4j.html, June 2010.

[5] Functional Programming Language. www.defmacro.org/ramblings/fp.html, June 2010.

[6] Graph Exchange Language. http://www.gupro.de/GXL/, June 2010.

[7] Graphviz library. www.graphviz.org/, June 2010.

[8] GXF DTD. http://bitbucket.org/pjotr/lime/src/9f5a51fe3ac3/doc/gxf.dtd, 2010.

[9] GXF Stream DTD. http://bitbucket.org/pjotr/lime/raw/karma/doc/stream.dtd, 2010.

[10] LIMEclipse Plug-in Source. http://bitbucket.org/mazaninfardi/lime- eclipse-jar/, June 2010.

[11] Model-View-Controller pattern. msdn.microsoft.com/en- us/library/ff649643.aspx, June 2010.

[12] Pmccabe. www.parisc-linux.org/ bame/pmccabe/overview.html, Au- gust 2010.

[13] Columbus, frontend art. www.frontendart.com, August, 2010.

[14] Ssax-sxml tutorial. modis.ispras.ru/Lizorkin/sxml- tutorial.html#hevea:serializ, August, 2010.

[15] Glpk library. www.gnu.org/software/glpk, July, 2010.

93 [16] Semantic designs, inc. http://www.semdesigns.com/Products/DMS/DMSComparison.html, July, 2010.

[17] Bigloo, pattern matching. www- sop.inria.fr/mimosa/fp/Bigloo/doc/bigloo-8.html#Pattern-Matching, June, 2010.

[18] G. S. Almasi and A. Gottlieb. Highly parallel computing. Benjamin- Cummings Publishing Co., Inc., Redwood City, CA, USA, 1989.

[19] I.D. Baxter, C. Pidgeon, and M. Mehlich. Dms: Program transforma- tions for practical scalable software evolution. In Software Evolution, International Conference on Software Engineering (ICSE, pages 625– 634. IEEE Computer Society, 2004.

[20] S. Bhat. Model Transformation tool for Dataflow Model Transforma- tions. Master’s thesis, Technical University, Eindhoven, 2010.

[21] S. Bhat and N. Nayak. Description of the elements of the gxf dtd. http://bitbucket.org/swaraj.bh/transformation- tool/src/21ddb2edd2ed/doc/, 2010.

[22] F.J.A. Boerboom and A.A.M.G. Janssen. Fact extraction, querying and visualization of large c++ code bases. Master’s thesis, TU Eindhoven, August, 2006.

[23] M.L. Collard, H.H. Kagdi, and J.I. Maletic. An xmlbased lightweight c++ fact extractor. In in Proceedings of 11th IEEE International Work- shop on Program Comprehension (IWPC’03, pages 10–11. IEEE Press, 2003.

[24] Dept. of Electrical Compiler Tools Group and Computer Engineering. Abstract Syntax Tree Unparsing. Technical report, University of Col- orado, Boulder, CO, USA, 2002.

[25] K. Czarnecki and S. Helsen. Classification of model transformation approaches. In OOPSLA03 Workshop on Generative Techniques in the Context of Model-Driven Architecture, 2003.

[26] T. DeMarco. Controlling Software Projects: Management, Measure- ment, and Estimates. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1986.

[27] R. Ferenc and . Beszdes. Data exchange with the columbus schema for c++. In In Proceedings of CSMR, pages 59–66, 2002.

[28] R. Ferenc, F. Magyar, . Beszdes, . Kiss, and M. Tarkiainen. Columbus - tool for reverse engineering large object oriented software systems. In

94 In Seventh Symposium on Programming Languages and Software Tools, pages 16–27, 2001.

[29] B. Goldberg. Functional programming languages. ACM Computing Surveys, 28(1):249–251, 1996.

[30] Edison Design Group. C++ front end, version 4.1. http://www.semdesigns.com/Products/DMS/DMSComparison.html, August, 2009.

[31] I. Herman, G. Melancon, and M.S. Marshall. Graph visualization and Navigation in Information Visualization: A survey. IEEE Transactions on Visualization and Computer Graphics, 6(1):24–43, 2000.

[32] R.C. Holt, A. Schurr, S.E. Sim, and A. Winter. GXL: a Graph-based Standard Exchange Format for Reengineering. Science of Computer Programming, 60(2):149–170, 2006.

[33] B. W. Kernighan. The C Programming Language. Prentice Hall Pro- fessional Technical Reference, 1988.

[34] P. Kourzanov, O. Moreira, and Sips. H. Disciplined multicore program- ming in C.

[35] L. Moonen. Generating robust parsers using island grammars. In WCRE ’01: Proceedings of the Eighth Working Conference on Reverse Engineering (WCRE’01), page 13, Washington, DC, USA, 2001. IEEE Computer Society.

[36] N. Nayak. Visualization of Dataflow Models. Master’s thesis, Technical University, Eindhoven, 2010.

[37] G.C. Necula, S. Mcpeak, S.P. Rahul, and W. Weimer. Cil: Intermediate language and tools for analysis and transformation of c programs. In In International Conference on Compiler Construction, pages 213–228, 2002.

[38] T. Parr. Practical computer language recognition and translation. www.antlr2.org/book/byhand.pdf, 1999.

[39] T. Parr. Practical computer language recognition and translation. www.antlr2.org/book/language.pdf, 1999.

[40] T.J. Parr and R.W. Quong. Antlr: A predicated-ll(k) parser generator. Software Practice and Experience, 25:789–810, 1994.

[41] J. Rekers. Parser Generation for Interactive Environments. PhD thesis, Univ. of Amsterdam, Amsterdam, The Netherlands, 1992.

95 [42] N.S. Shetty. Generating C code from Platform Specific Model. Master’s thesis, Technical University, Eindhoven, 2010.

[43] S. Sriram and S. Bhattacharyya. Embedded Multiprocessors- Scheduling and Synchronization. Marcel Dekker Inc, ., New York, Basel, USA, 2000.

[44] P. Wauters, Engels. M., R. Lauwereins, and Peperstraete. J.A. Cyclo- static dataflow. Signal Processing, 44(2):397–408, 1996.

[45] M. Zandifar. Abstract reduction operation models in the LIME pro- gramming model. Master’s thesis, TUDelft, 2009.

96 Appendix A

Structure of CIGLOO Grammar

The following figures depict the structure of the C grammar as provided by CIGLOO. The figures indicate the non terminal symbols of the grammar in terms of nodes and the edges originating from them show their relation- ship/dependency with other nodes. The whole structure is split across five figures each depicting a part of the structure. The node marked as ”‘NODE”’ depicts the start node of the grammar.

Figure A.1: Node hierarchy - 1

97 Figure A.2: Node hierarchy - 2

Figure A.3: Node hierarchy - 3

98 Figure A.4: Node hierarchy - 4

Figure A.5: Node hierarchy - 5

99 100 Appendix B

Graphviz Test Results

The Graphviz library provided a very interesting source of test cases for the integrated testing of the parser library and the unparser library. The method of testing this library was similar to what was described for the testing of GLPK-4.43. Unfortunately in case of graphviz though, the testing could not be successfully completed. Initially in the infant stage of development C source codes from graphviz library were individually tested with the parser and then subsequently with the unparser. Graphviz provided us with a very large set of test cases which varied drastically including almost all the C constructs possible. Initially each error encountered during testing was checked and rectified. This man- ual testing covered a major set of the codes provided. After a sufficiently large set of codes were tested the test for graphviz was automated on the lines of GLPK-4.43. On running this though a few errors came up. A few of these errors could be solved but paucity of time and the complexity of errors in few cases prevented us from rectifying the same. The errors encountered are mentioned below with a possible cause of reason -

1. The file shl load.c could not be executed. The reason for this was a parser error which could not parse a declaration statement. On looking further into the file it was found that the type-specifier being used in the statement was not defined anywhere in the file. The file was a preprocessed file. The reason for this might be the shear complexity of the graphviz library and the number of files being linked among each other. It might be possible that the preprocessor might have not linked the required file and hence the definition of that type-specifier might be missing. A number of other files also have the same problem.

2. In the file dtclose.c there exists a typedef statement which aliases a variable to a primitive datatype. The specific statement is typedef mode t int; where mode t refers to unsigned int. Running this file individually with the gcc compiler itself resulted in errors. The error

101 given out was that there cannot be two declaration specifiers in the same line. To avoid the situations of having two declaration specifiers in the same line the recursion in the type-specifier was removed. This is explained in Section 6.2.

These were the two kinds of errors we encountered the most and because of which were not able to complete the tests.

102 Appendix C

Links to the Source Files

This appendix is shared with the work of Nishanth Shetty [42] This appendix provides the following:

• The location from where the source files can be downloaded.

• A brief description of the important files present in the source files.

• Link to the document detailing the procedure of compiling the various libraries, and performing test on them.

The original source files (gunzipped) version and the information regard- ing the procedure to compile the various libraries can be obtained from the following link.

http://bitbucket.org/swaraj.bh/transformation-tool/src/ 23eb627a74fb/parser-unparser-src/

The details information about few of the important files present in Cigloo1.1.

1. Makefile: This is the main Makefile which inturn calls the Makefiles for creating the four libraries, namely, parser, modifyast, lib-sizeof and Unparser.

2. Makefile.lib: This makefile creates the parser library.

3. Makefile.sizetest: This makefile is used to test the sizeof library with a set of test-cases.

4. Makefile.test-glpk: This makefile performs the integration-testing of GLPK library.

5. Makefile.test-grapviz: This makefile performs the integration-testing of Graphviz library.

103 6. cigloo1.1.scm: This file imports the libraries and calls translate and translate-ast-convert. The function translate is the main function call that converts the preprocessed C code to its AST. The function translate-ast-convert takes this AST as its input and displays the C code on the port. The executable cigloo.out does the processing of C to AST and back to C code. It takes as input a valid preprocessed C source file.

7. gccE: This file is used in testing purposes in case of GLPK and Graphviz libraries.

8. make-lib.scm, codeTranslate.init: These files are useful in creating the Parser library.

9. Unparser: This folder has the required source files and other im- portant files for creating the Unparser library. The module trans- late generation is present in translate generation.scm in the folder unparsing/prettyprint. This module has the source code for the pretty printing.

104