Improving Software Dependability Through Documentation Analysis

Improving Software Dependability through Documentation Analysis by Edmund Wong A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Engineering Waterloo, Ontario, Canada, 2019 © Edmund Wong 2019 Examining Committee Membership The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. Internal-External Examiner: Meiyappan Nagappan Assistant Professor, Dept. of Computer Science, University of Waterloo Supervisor: Lin Tan Associate Professor, Dept. of Electrical and Com- puter Engineering, University of Waterloo Internal Member: Arie Gurfinkel Associate Professor, Dept. of Electrical and Com- puter Engineering, University of Waterloo Internal Member: Mahesh Tripunitara Associate Professor, Dept. of Electrical and Com- puter Engineering, University of Waterloo External Member: Nicholas A. Kraft Software Researcher, ABB Corporate Research ii I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. iii Abstract Software documentation contains critical information that describes a system’s func- tionality and requirements. Documentation exists in several forms, including code comments, test plans, manual pages, and user manuals. The lack of documentation in existing software systems is an issue that impacts software maintainability and programmer pro- ductivity. Since some code bases contain a large amount of documentation, we want to leverage these existing documentation to improve software dependability. Specifically, we utilize documentation to help detect software bugs and repair corrupted files, which can reduce the number of software error and failure to improve a system’s reliability (e.g., continuity of correct service). We also generate documentation (e.g., code comment) automatically to help developers understand the source code, which helps improve a system’s maintainability (e.g., ability to undergo repairs and modifications). In this thesis, we analyze software documentation and propose two branches of work, which focuses on three types of documentation including manual pages, code comments, and user manuals. The first branch of work focuses on documentation analysis because documentation contains valuable information that describes the behavior of the program. We automatically extract constraints from documentation and apply them on a dynamic analysis symbolic execution tool to find bugs in the target software, and we extract constraints manually from documentation and apply them on a structured-file parsing application to repair corrupted PDF files. The second branch of work focuses on automatic code comment generation to improve software documentation. For documentation analysis, we propose and implement DASE and DocRepair. DASE leverages automatically extracted constraints from documentation to improve a dynamic analysis symbolic execution tool. DASE guides symbolic execution to focus the testing on execution paths that execute a program’s core functionalities using constraints learned from the documentation. We evaluated DASE on 88 programs from five mature real-world software suites to detect software bugs. DASE detects 12 previously unknown bugs that symbolic execution would fail to detect when given no input constraints, 6 of which have been confirmed by the developers. In DocRepair we perform an empirical study to study and repair corrupted PDF files. We create the first dataset of 319 corrupted PDF files and conduct an empirical study on 119 real-world corrupted PDF files to study the common types of file corruption. Based on the result of the empirical study we propose a technique called DocRepair. DocRepair’s repair algorithm includes seven repair operators that utilizes manually extracted constraints from documentation to repair corrupted files. We evaluate DocRepair against three common PDF repair tools. Amongst the 1,827 collected corrupted files from over two corpora iv of PDF files, DocRepair can successfully repair 354 files compared to Mutool, PDFtk, and GhostScript which repair 508, 41 and 84 respectively. We also propose a technique to combine multiple repair tools called DocRepair+, which can successfully repair 751 files. In the case where there is a lack of documentation, DASE and DocRepair+ would not work. Therefore, we propose automated documentation generation to address the issue. We propose and implement CloCom+ to generate code comments by mining both existing software repositories in GitHub and a Question and Answer site, Stack Overflow. CloCom+ generated 442 unique comments for 16 Java projects. Although CloCom+ improves on previous work, SumSlice, on automatic comment generation, the quality (evaluated on completeness, conciseness, expressiveness, and usefulness) and yield (number of generated comments) are still rather low which makes the technique not ready for real-world usage. In the future, it may be possible to combine the two proposed branches of work (documentation analysis and documentation generation) to further improve software dependability. For example, we can extract constraints from the automatically generated documentation (e.g., code comments). v Acknowledgements I would like to thank all the little people who made this possible. vi Dedication This is dedicated to the one I love. vii Table of Contents List of Tables xiii List of Figures xvi 1 Introduction1 1.1 Automatic Documentation Analysis......................2 1.1.1 General Symbolic Execution-Based Software Testing.........3 1.1.2 Automatic File Repair.........................4 1.2 Automatic Documentation Generation....................5 1.2.1 Mining Question and Answer Sites..................6 1.2.2 Mining Code Repositories.......................7 1.3 Contributions..................................7 1.4 Overview of Thesis...............................8 2 Related Work9 2.1 Symbolic Execution...............................9 2.2 Automatic File Repair............................. 11 2.3 Automatic Comment Generation....................... 13 2.4 Source Code Summarization.......................... 15 2.5 Mining Descriptions for Code Artifact..................... 16 2.6 Code Clone Detection............................. 17 viii 2.7 Fuzz Testing................................... 17 2.8 Documentation Analysis............................ 18 3 Documentation Analysis: Symbolic Execution-Based Software Testing using Documentation Constraints 20 3.1 Motivation.................................... 20 3.2 Overview..................................... 22 3.3 Background................................... 27 3.4 Design and Implementation.......................... 27 3.4.1 Extracting File Format Constraints.................. 27 3.4.2 Adding File Layout Constraints.................... 30 3.4.3 Extracting Valid Options........................ 31 3.4.4 Using Options to Flatten Symbolic Execution............ 32 3.5 Evaluation Method............................... 32 3.5.1 Evaluated Programs.......................... 32 3.5.2 Experimental Setup........................... 33 3.6 Evaluation Results............................... 34 3.6.1 Detected Bugs.............................. 34 3.6.2 Code Coverage............................. 37 3.6.3 DASE Complements Developer Tests................. 39 3.6.4 Constraint Extraction Results..................... 40 3.7 Threats to Validity............................... 41 3.7.1 Internal Validity............................ 41 3.7.2 External Validity............................ 41 3.7.3 Construct Validity........................... 42 3.7.4 Conclusion Validity........................... 42 3.8 Summary.................................... 42 ix 4 Documentation Analysis: Automatic File Repair 43 4.1 Motivation.................................... 43 4.2 Background................................... 48 4.2.1 PDF File Format............................ 48 4.2.2 Existing Repair Approaches...................... 49 4.3 Definitions.................................... 50 4.4 A Study of Corrupted PDF Files....................... 50 4.4.1 Collecting Corrupted PDF Files.................... 51 4.4.2 Identifying Corrupted PDF Files................... 52 4.4.3 PDF Repair Tools and Viewers.................... 53 4.4.4 Empirical Study Findings....................... 53 4.5 Design and Implementation.......................... 62 4.5.1 Data Parsing and Collection...................... 62 4.5.2 Repair Operators............................ 62 4.6 Evaluation Method and Results........................ 69 4.7 Threats to Validity............................... 73 4.7.1 Internal Validity............................ 73 4.7.2 External Validity............................ 73 4.7.3 Construct Validity........................... 74 4.7.4 Conclusion Validity........................... 74 4.8 Summary.................................... 75 5 Automatic Documentation Generation: Crowd Sourced Comment Gen- eration 76 5.1 Motivation.................................... 76 5.2 Examples and Challenges............................ 79 5.2.1 Example One.............................. 79 5.2.2 Example Two.............................. 82 x 5.2.3 Example Three............................. 84 5.3 Design and Implementation.......................... 85 5.3.1 Code-Description Mapping Extraction from Stack Overflow..... 86 5.3.2

Load more