Completeness of Fact Extractors and a New Approach to Extraction with Emphasis on the Refers-To Relation
Total Page:16
File Type:pdf, Size:1020Kb
Completeness of Fact Extractors and a New Approach to Extraction with Emphasis on the Refers-to Relation by Yuan Lin A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science Waterloo, Ontario, Canada, 2008 ©Yuan Lin 2008 AUTHOR'S DECLARATION I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract This thesis deals with fact extraction , which analyzes source code (and sometimes related artifacts) to produce extracted facts about the code. These facts may, for example, record where in the code variables are declared and where they are used, as well as related information. These extracted facts are typically used in software reverse engineering to reconstruct the design of the program. This thesis has two main parts, each of which deals with a formal approach to fact extraction. Part 1 of the thesis deals with the question: How can we demonstrate that a fact extractor actually does its job? That is, does the extractor produce the facts that it is supposed to produce? This thesis builds on the concept of semantic completeness of a fact extractor, as defined by Tom Dean et al, and further defines source, syntax and compiler completeness. One of the contributions of this thesis is to show that in particular important cases (when the extractor is deterministic and its front end is idempotent), there is an efficient algorithm to determine if the extractor is compiler complete. This result is surprising, considering that in general it is undecidable if two programs are semantically equivalent, and it would seem that source code and its corresponding extracted facts are each essentially programs that are to be proved to be equivalent or at least sufficiently similar. The larger part of the thesis, Part 2, presents Algebraic Refers-to Analysis (ARA), a new approach to fact extraction with emphasis on the Refers-to relation. ARA provides a framework for specifying fact extraction, based on a three-step pipeline: (1) basic (lexical and syntactic) extraction, (2) a normalization step and (3) a binding step. For practical programming languages, these three steps are repeated, in stages and phases, until the Refers-to relation is computed. During the writing of this thesis, ARA pipelines for C, Java, C++, Fortran, Pascal and Ada have been designed. A prototype fact extractor for the C language has been created. Validating ARA means to demonstrate that ARA pipelines satisfy the programming language standards such as ISO C++ standard. In other words, we show that ARA phases (stages and formulas) are correctly transcribed from the rules in the language standard. Comparing with the existing approaches such as Attribute Grammar, ARA has the following advantages. First, ARA formulas are concise, elegant and more importantly, insightful. As a result, we have some interesting discovery about the programming languages. Second, ARA is validated based on set theory and relational algebra, which is more reliable than exhaustive testing. Finally, ARA formulas are supported by existing software tools such as database management systems and relational calculators. Overall, the contributions of this thesis include 1) the invention of the concept of hierarchy of completeness and the automatic testing of completeness, 2) the use of the relational data model in fact extraction, 3) the invention of Algebraic Refers-to Relation Analysis (ARA) and 4) the discovery of some interesting facts of programming languages. iii Acknowledgements I would like to thank my supervisors, Richard Holt and Andrew Malton, for many years of guidance and inspiration. They have constantly taken care of their graduate students by pointing out relevant research, generating ideas, and finding work for them. I thank my dissertation committee members, Professor Charlie Clarke, Professor Grant Weddell, Professor Kostas Kontogiannis, and Professor Hausi Müller, for their invaluable time and effort put into reading my thesis. I also thank my current and previous members of SWAG, in particular, Jingwei Wu, Lijie Zou, Xinyin Dong, Cory Kapser and Ahmed Hassan. I appreciate their great friendship as well as their insightful comments on many ideas in this dissertation. I am grateful to my parents and my sister for their support from China. I wish to especially thank my wife, Chunhui Meng for her support during the past six years of study and taking care of our baby, Erica Lin. iv Table of Contents List of Figures ........................................................................................................................................x List of Tables.........................................................................................................................................xi Chapter 1 Introduction............................................................................................................................1 1.1 How to verify a fact extractor.......................................................................................................1 1.1.1 Drawbacks in current approaches..........................................................................................1 1.1.2 Motivation .............................................................................................................................2 1.1.3 Completeness of a fact extractor ...........................................................................................2 1.2 Algebraic Refers-to Analysis: A new approach to fact extraction ...............................................3 1.2.1 Current approach for extracting the Refers-to relation..........................................................4 1.2.2 Motivation .............................................................................................................................4 1.2.3 Algebraic Refers-to Analysis (ARA) ....................................................................................5 1.2.4 Validation of ARA ................................................................................................................7 1.3 Contributions................................................................................................................................8 1.4 Thesis Organization......................................................................................................................8 Chapter 2 Related work........................................................................................................................10 2.1 Extraction of the Refers-to relation ............................................................................................10 2.1.1 Ad hoc method ....................................................................................................................10 2.1.2 Attribute Grammars .........................................................................................................13 2.2 Data models for facts in the source code....................................................................................18 2.2.1 Conceptual models for extracted facts.................................................................................18 2.2.2 Storage models for extracted facts.......................................................................................18 2.2.3 Schemas...............................................................................................................................19 2.3 Application of relational algebra in software engineering .........................................................20 2.3.1 Program analysis .................................................................................................................20 2.3.2 Software repository exploration ..........................................................................................21 2.3.3 Software architecture recovery and repair...........................................................................21 2.4 Relational algebra software tools ...............................................................................................22 Chapter 3 Completeness of fact extraction...........................................................................................24 3.1 Introduction ................................................................................................................................24 3.2 Compiler phases and ASGs........................................................................................................25 3.3 Schemas and interchange formats ..............................................................................................25 3.4 Completeness of fact extractors..................................................................................................27 v 3.4.1 Four levels of completeness ........................................................................................... 27 3.4.2 Hierarchy of completeness ............................................................................................. 28 3.4.3 Semantic completeness ................................................................................................... 29 3.4.4 Relative completeness ..................................................................................................... 29 3.5 Validating