Program Changes and the Cost of Selective Recompilation

Program Changes and the Cost of Selective Recompilation Ellen Ariel Borison July 1989 CMU-CS-89-205 Submitted to Carnegie Mellon University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Copyright © 1989 Ellen A. Borison This research was sponsored in part by a Xerox Special Opportunity Fellowship and in part by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 4976, Amend- ment 20, under contract number F33615-87-C-1499, monitored by the Avionics Laboratory, Air Force Wright Aeronautical Laboratories, Aeronautical Systems Division (AFSC), United States Air Force, Wright-Patterson AFB, Ohio 45433--6543. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the Xerox Corporation, the Defense Advanced Research Projects Agency or the US Government. to the Lackowitz sisters: my mother, Rosaline (1922-1984) and her sisters, Hilda Goldstein (1922-19"76) and Silvia Grossman and to my father, Herbert Abstract When module boundaries are dissolved, a program can be seen simply as a collection of named objects (procedures, types, variables, etc.) that reference one another. Program designers impose order on such collections by grouping somehow-related names into modules. Informed by over twenty years of software engineering wisdom, we trust that modularization makes programs easier to control, maintain, and understand. However, not all of the effects of modularity are beneficial. This research suggests that most of the recompilations performed after a change to an interface are redundant and that this redundancy is a direct consequence of how we modularize software systems. This conclusion is based on the careful analysis of a small number of C and Ada programs. This analysis in turn is based on a model of software manufacture that specifically addresses the question of how much work has to be done to incorporate a given set of changes consistently into a given software product. In each program analyzed, the average name defined in an interface is used in 2 or 3 compilation units; the average interface is used in 7 to 25 units. Thus if a program- mer were to change a single name chosen at random from an arbitrary interface and then, like the UNIX tool make, compile every compilation unit using that interface, we would expect between 6 and 9 out of every 10 compilations to be unnecessary. This phenomenon extends to the purposeful changes made during program development. For historic changes made to one program, fewer than half the compilations performed after an interface change were actually necessary, even though the data was treated conservatively both by grouping changes and by eliminating spurious interconnections. This work corroborates practical experience as well as observations made by other re- searchers, solidly tying together a collection of disparate evidence into a coherent picture of software manufacture. It validates the approach taken by some programming environments to use an underlying fiat (i.e. non-modular) representation of program objects and, to the extent that recompilation costs reflect general program complexity, leads us to question some basic assumptions about modularization. vi Acknowledgements I would not have completed this work without the support and enthusiasm of my thesis advisor, Jim Morris. The case studies of Chapters 3 and 4 reflect his talent for gently directing my attention to fruitful problems. The other members of my thesis comrrfit- tee, Bob Ellison, Joe Newcomer, Gene Rollins and Dave Wortman, provided useful and timely feedback as each stage of the work progressed. Joe Newcomer, especially, always displayed confidence in my approach to software manufacture. My long-time officemates Paola Giannini and Roberto Minio, and frequent luncheon companion Pedro Szekely, among many others, made CMU an enjoyable place to work. I am deeply indebted to Phil Levy of Rational, Inc. for writing the necessary code and collecting the raw data on the Ada programs that I used in the case studies of Chapter 4. Phil was completing his own thesis at the time. John Nestor and Reidar Conradi each read a preliminary draft of the thesis and made comments that have clarified its presentation. Reidar pointed out the relationship between my name use and visibility studies and observations on the "big inhale" made by his group and by Ragui Kamel and colleagues at Bell Northern Research. John has been a sometimes reluctant sounding board for as-of-yet poorly formulated ideas. My brother, Adam, reviewed the assumptions on which I based the analysis in Chapter 4. Several people helped me with the mechanics of the research. I want to specifically thank Susan Straub of the CMU Information Technology Center, and Grace Downey and Susan Dart of the Environments Group at the CMU Software Engineering Institute. Susan Straub helped me ship a copy of the code for Vice to a computer where I could analyze it. Grace Downey showed me the interactive cross-reference capabilities of the Rational Environment and helped me a get a copy of the code for the Rational Kermit program. Susan Dart not only helped me find reference material, but also annotated a copy of her report analyzing the Rational Environment so that I could easily find the material most relevant to my needs. I also want to thank Joe Newcomer for the extended loan of a Macintosh for figure production. Finally, I want to thank my father, an accomplished scientist, who taught me by his example the meaning of research. vii viii Contents 1 Introduction 1 1.1 The Consequences of Modularization ................... 3 1.2 The Thesis ................................. 4 1.2.1 The Model ............................. 5 1.2.2 An Analysis of Compilation Costs ................ 5 1.2.3 Name Use versus Name Visibility ................. 6 1.2.4 Conclusions and Recommendations ................ 8 1.3 Background ................................ 9 1.3.1 Examples of Problems in Software Manufacture ......... 10 1.3.2 Software Manufacture ....................... 13 1.3.3 Related Activities ......................... 16 1.4 Directly Related Work ........................... 18 1.4.1 Software Manufacturing Systems ................. 18 1.4.2 Selective Recompilation ...................... 23 1.4.3 A Profile of Compiling and Linking ............... 30 2 The Model 33 2.1 Pitfalls of Software Manufacture ..................... 35 2.2 The Representation of a Software Configuration ............. 37 2.2.1 Components ............................ 38 2.2.2 Manufacturing Steps and Step Schemas .............. 39 2.2.3 Manufacturing Graphs and Graph Schemas ............ 41 2.2.4 Encapsulated Subgraphs ...................... 43 2.3 Examples of Manufacturing Graph Schemas ............... 44 2.3.1 Conventional Compilation Strategies for C and Ada ....... 45 2.3.2 C Compilation in the SMILE Programming Environment .... 51 2.3.3 Generating the Tartan Lexical Analyzer Subsystem ........ 53 2.3.4 Bootstrapping the Mini IDL Tools ................ 54 ix x CONTENTS 2.4 The Instantiation of a Software Configuration ............... 57 2.4.1 Change, Context and the Incidence of Redundant Manufacture.. 57 2.4.2 What it Means for Two Products to be Effectively Indistinguishable 60 2.4.3 Difference Predicates ....................... 61 2.5 The Cost of Selective Manufacture .................... 65 3 An Analysis of Compilation Costs 67 3.1 An Overview of the Study ......................... 69 3.2 The Seven Difference Predicates ..................... 73 3.3 The Descartes Project ........................... 80 3.3.1 Why Descartes? .......................... 80 3.3.2 Descartes Programming Conventions ............... 81 3.3.3 The Descartes Change History .................. 81 3.4 The Method ................................ 83 3.4.1 From Change History to Configurations .............. 83 3.4.2 Measuring Compilation Costs ................... 86 3.4.3 The Data Collected ........................ 86 3.5 The Results of the Study ......................... 88 3.5.1 The Effect of Superfluously Included .h Files ........... 94 3.5.2 The Relali_nship Between the Predicates ............. 95 3.5.3 The Size of Compiled Files ..................... 97 3.6 Discussion ................................. 97 4 The Distribution of Names in Interfaces 101 4.1 The Ratio of Use to Visibility ....................... 103 4.1.1 Computing Use/Visibility Numbers for C ............. 105 4.1.2 Computing Use/Visibility Numbers for Ada ............ 106 4.1.3 The RUV versus the Big Inhale .................. 109 4.1.4 Partial RUV's ........................... 110 4.2 The Six Program Study .......................... 110 4.2.1 The Programs ........................... 112 4.2.2 The Data in Table 4.1 ....................... 113 4.2.3 The Effect of Compensating for Language Differences on the Con- crete RUV ............................. 115 4.2.4 The Unit RUV versus the Partial RUVs of the Big Inhale .... 118 4.3 Patterns of Use and Visibility ....................... 118 4.3.1 l ne Effect of Modularity in Descartes .............. 121 CONTENTS xi 4.4 The RUV versus Predicate Performance .................. 122 4.4.1 The Distribution of Name Changes in the Descartes History . 123 4.4.2 The Effect of Grouping Name Changes in the Descartes History. 124 4.5 Summary .................................. 130 5 Recommendations and Conclusion 133 5.1 Recommendations ............................

Program Changes and the Cost of Selective Recompilation

Featuring Independent Software Vendors

Automatically Achieving Elasticity in the Implementation of Programming Languages Michael Lee Horowitz

IDL (Interface Description Language). Background and Status

Register Reassociation in PA-RISC Compilers, by Vatsa Santhanam

Curriculum Vitae

Thursday Sessions

Curriculum Vitae Personal Scholastic Record

An Interview With

14.2 Intermediate Forms

Department of Computer Science School of Engineering and Applied Science University of Virginia

Downloaded to a Number of Different Target Machines