Program Changes and the Cost of Selective Recompilation

Ellen Ariel Borison July 1989 CMU-CS-89-205

Submitted to Carnegie Mellon University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

Copyright © 1989 Ellen A. Borison

This research was sponsored in part by a Xerox Special Opportunity Fellowship and in part by the Defense Advanced Research Projects Agency (DOD), ARPA Order No. 4976, Amend- ment 20, under contract number F33615-87--1499, monitored by the Avionics Laboratory, Air Force Wright Aeronautical Laboratories, Aeronautical Systems Division (AFSC), United States Air Force, Wright-Patterson AFB, Ohio 45433--6543.

The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the Xerox Corporation, the Defense Advanced Research Projects Agency or the US Government.

to the Lackowitz sisters: my mother, Rosaline (1922-1984) and her sisters, Hilda Goldstein (1922-19"76) and Silvia Grossman

and to my father, Herbert

Abstract

When module boundaries are dissolved, a program can be seen simply as a collection of named objects (procedures, types, variables, etc.) that reference one another. Program designers impose order on such collections by grouping somehow-related names into modules. Informed by over twenty years of software engineering wisdom, we trust that modularization makes programs easier to control, maintain, and understand. However, not all of the effects of modularity are beneficial. This research suggests that most of the recompilations performed after a change to an interface are redundant and that this redundancy is a direct consequence of how we modularize software systems.

This conclusion is based on the careful analysis of a small number of C and Ada pro- grams. This analysis in turn is based on a model of software manufacture that specifically addresses the question of how much work has to be done to incorporate a given set of changes consistently into a given software product.

In each program analyzed, the average name defined in an interface is used in 2 or 3 compilation units; the average interface is used in 7 to 25 units. Thus if a program- mer were to change a single name chosen at random from an arbitrary interface and then, like the UNIX tool make, compile every compilation unit using that interface, we would expect between 6 and 9 out of every 10 compilations to be unnecessary. This phenomenon extends to the purposeful changes made during program development. For historic changes made to one program, fewer than half the compilations performed after an interface change were actually necessary, even though the data was treated conservatively both by grouping changes and by eliminating spurious interconnections.

This work corroborates practical experience as well as observations made by other re- searchers, solidly tying together a collection of disparate evidence into a coherent picture of software manufacture. It validates the approach taken by some programming envi- ronments to use an underlying fiat (i.e. non-modular) representation of program objects and, to the extent that recompilation costs reflect general program complexity, leads us to question some basic assumptions about modularization. vi Acknowledgements

I would not have completed this work without the support and enthusiasm of my thesis advisor, Jim Morris. The case studies of Chapters 3 and 4 reflect his talent for gently directing my attention to fruitful problems. The other members of my thesis comrrfit- tee, Bob Ellison, Joe Newcomer, Gene Rollins and Dave Wortman, provided useful and timely feedback as each stage of the work progressed. Joe Newcomer, especially, always displayed confidence in my approach to software manufacture. My long-time officemates Paola Giannini and Roberto Minio, and frequent luncheon companion Pedro Szekely, among many others, made CMU an enjoyable place to work. I am deeply indebted to Phil Levy of Rational, Inc. for writing the necessary code and collecting the raw data on the Ada programs that I used in the case studies of Chapter 4. Phil was completing his own thesis at the time.

John Nestor and Reidar Conradi each read a preliminary draft of the thesis and made comments that have clarified its presentation. Reidar pointed out the relationship between my name use and visibility studies and observations on the "big inhale" made by his group and by Ragui Kamel and colleagues at Bell Northern Research. John has been a sometimes reluctant sounding board for as-of-yet poorly formulated ideas. My brother, Adam, reviewed the assumptions on which I based the analysis in Chapter 4. Several people helped me with the mechanics of the research. I want to specifically thank Susan Straub of the CMU Information Technology Center, and Grace Downey and Susan Dart of the Environments Group at the CMU Software Engineering Institute. Susan Straub helped me ship a copy of the code for Vice to a computer where I could analyze it. Grace Downey showed me the interactive cross-reference capabilities of the Rational Environment and helped me a get a copy of the code for the Rational Kermit program. Susan Dart not only helped me find reference material, but also annotated a copy of her report analyzing the Rational Environment so that I could easily find the material most relevant to my needs. I also want to thank Joe Newcomer for the extended loan of a Macintosh for figure production.

Finally, I want to thank my father, an accomplished scientist, who taught me by his example the meaning of research.

vii viii Contents

1 Introduction 1 1.1 The Consequences of Modularization ...... 3 1.2 The Thesis ...... 4 1.2.1 The Model ...... 5 1.2.2 An Analysis of Compilation Costs ...... 5 1.2.3 Name Use versus Name Visibility ...... 6 1.2.4 Conclusions and Recommendations ...... 8 1.3 Background ...... 9 1.3.1 Examples of Problems in Software Manufacture ...... 10 1.3.2 Software Manufacture ...... 13 1.3.3 Related Activities ...... 16 1.4 Directly Related Work ...... 18 1.4.1 Software Manufacturing Systems ...... 18 1.4.2 Selective Recompilation ...... 23 1.4.3 A Profile of Compiling and Linking ...... 30

2 The Model 33 2.1 Pitfalls of Software Manufacture ...... 35 2.2 The Representation of a Software Configuration ...... 37 2.2.1 Components ...... 38 2.2.2 Manufacturing Steps and Step Schemas ...... 39 2.2.3 Manufacturing Graphs and Graph Schemas ...... 41 2.2.4 Encapsulated Subgraphs ...... 43 2.3 Examples of Manufacturing Graph Schemas ...... 44 2.3.1 Conventional Compilation Strategies for C and Ada ...... 45 2.3.2 C Compilation in the SMILE Programming Environment .... 51 2.3.3 Generating the Tartan Lexical Analyzer Subsystem ...... 53 2.3.4 Bootstrapping the Mini IDL Tools ...... 54

ix x CONTENTS

2.4 The Instantiation of a Software Configuration ...... 57 2.4.1 Change, Context and the Incidence of Redundant Manufacture.. 57 2.4.2 What it Means for Two Products to be Effectively Indistinguishable 60 2.4.3 Difference Predicates ...... 61 2.5 The Cost of Selective Manufacture ...... 65

3 An Analysis of Compilation Costs 67 3.1 An Overview of the Study ...... 69 3.2 The Seven Difference Predicates ...... 73 3.3 The Descartes Project ...... 80 3.3.1 Why Descartes? ...... 80 3.3.2 Descartes Programming Conventions ...... 81 3.3.3 The Descartes Change History ...... 81 3.4 The Method ...... 83 3.4.1 From Change History to Configurations ...... 83 3.4.2 Measuring Compilation Costs ...... 86 3.4.3 The Data Collected ...... 86 3.5 The Results of the Study ...... 88 3.5.1 The Effect of Superfluously Included .h Files ...... 94 3.5.2 The Relali_nship Between the Predicates ...... 95 3.5.3 The Size of Compiled Files ...... 97 3.6 Discussion ...... 97

4 The Distribution of Names in Interfaces 101 4.1 The Ratio of Use to Visibility ...... 103 4.1.1 Computing Use/Visibility Numbers for C ...... 105 4.1.2 Computing Use/Visibility Numbers for Ada ...... 106 4.1.3 The RUV versus the Big Inhale ...... 109 4.1.4 Partial RUV's ...... 110 4.2 The Six Program Study ...... 110 4.2.1 The Programs ...... 112 4.2.2 The Data in Table 4.1 ...... 113 4.2.3 The Effect of Compensating for Language Differences on the Con- crete RUV ...... 115 4.2.4 The Unit RUV versus the Partial RUVs of the Big Inhale .... 118 4.3 Patterns of Use and Visibility ...... 118 4.3.1 l ne Effect of Modularity in Descartes ...... 121 CONTENTS xi

4.4 The RUV versus Predicate Performance ...... 122 4.4.1 The Distribution of Name Changes in the Descartes History . . . 123 4.4.2 The Effect of Grouping Name Changes in the Descartes History. 124 4.5 Summary ...... 130

5 Recommendations and Conclusion 133 5.1 Recommendations ...... 133 5.2 Further Empirical Studies and Other Research ...... 136 5.3 Summary of Contributions ...... 139 5.4 The Thesis in Perspective ...... 141

Bibliography 1.43 List of Figures

1.1 Modularization and the Compounding of Interconnections ...... 4 1.2 A Simple Case of Version Skew ...... 10

2.1 A Schema for Yacc ...... 39 2.2 A Rudimentary Manufacturing Graph ...... 41 2.3 A Schema for the UNIX cc Command ...... 44 2.4 The UNIX cc Command Encapsulated ...... 44 2.5 A Generic Manufacturing Graph Schema for a C Program ...... 46 2.6 A Typical Manufacturing Step Schema for a C Program ...... 47 2.7 A Generic Manufacturing Graph Schema for an Ada Program ...... 48 2.8 A Typical Manufacturing Step Schema for an Ada Program ...... 49 2.9 The Source of an Unnecessary Recompilation to Prevent Version Skew . 50 2.10 A Generic Manufacturing Graph Schema for SMILE ...... 52 2.11 A Partial Schema for the Generation of a Lexical Analyzer ...... 55 2.12 A Manufacturing Graph Schema for Mini IDL ...... 56 2.13 The Successful Application of a Difference Predicate ...... 62 2.14 The Successful Application of a Partial Difference Predicate ...... 64

3.1 A Simple Categorization of 572 Compilations ...... 68 3.2 The Relative Strength of the Seven Predicates ...... 79 3.3 Predicate Performance Relative to BIG BANG ...... 89 3.4 The Compound Effect of Parsimony in Interconnection and Manufacture 97

4.1 The Use and Visibility of Names in Ada ...... 107 4.2 Name Use Density in Three C and Three Ada Programs ...... 119 4.3 Cumulative Name Use and Visibility for Three C and Three Ada Programs 120 4.4 Expected Use and Visibility for Groups of Names in Descartes ..... 126 4.5 The k-RUV for Descartes ...... 126

xii List of Tables

3.1 A Comparison of Three Approaches to Software Manufacture ...... 71 3.2 The Same Comparison Based on Number of Lines Compiled ...... 71 3.3 Summary of Seven Predicates ...... 78 3.4 Summary of the Descartes Change History ...... 82 3.5 Summary of Descartes Revision Groups ...... 85 3.6 Four Revision Groups from the Change History of the Crostic Client . . 87 3.7 Average Predicate Performance ...... 91 3.8 Cumulative Predicate Performance ...... 92 3.9 Predicate Performance Relative to BIG BANG ...... 93 3.10 Average Size of the Files Compiled by each Predicate ...... 94

4.1 Comparative Name Use and Visibility for 3 C and 3 Ada Programs . . . 111 4.2 Clients of C Extems that are not Declared in .h Files ...... 115 4.3 Effect of Superfluously Included .h Files in C ...... 117 4.4 The Effect of Counting the Declaring Specification as a Client in Ada.. 117 4.5 Average Per Unit Use/Visibility Ratios ...... 118 4.6 Descartes' RUV versus Relative Predicate Performance ...... 122 4.7 Comparative Name Use and Visibility for Historic Changes ...... 124 4.8 Differences in the Estimated and Generated k-RUV ...... 129 4.9 Estimated versus Actual Use and Visibility ...... 131

°°° Xlll

Chapter 1

Introduction

Regardless of the language or environment they use, most experienced programmers have been frustrated repeatedly by their inability to regenerate a system quickly and accurately after having made a change. Many details can go wrong: steps may be performed in the wrong order, with the wrong parameters, using the wrong tools or on the wrong versions of program components. Sometimes a problem will show up as a failure of the system to rebuild; sometimes (particularly when tools provide inadequate checking) it will show up only in program execution. Often the cause is outside the programmer's control. The behavior of a program after it has been rebuilt might change for no apparent reason; bugs thought to have been fixed might reappear. Even when the details are under control, the amount of time necessary to rebuild a system can leave programmer idling impatiently while apparently remote parts of system are being regenerated. Sometimes a seemingly innocuous change will trigger hours of compilation.

Project managers have it worse than programmers because they are responsible for an entire system and must coordinate the efforts of many programmers. On large projects an error of judgement in scheduling the integration of changes can suspend progress for weeks. At least one authority claims that the most difficult decisions a project manager must make are when and how often to schedule complete system builds and how much regression testing to perform afterwards [8].

Today most production programming is done using higher level languages (like Ada, C, Fortran, or Pascal), which are processed by . While such languages undoubt- edly will dominate software development technology for some time, compilers are not the only tools at a programmer's disposal. For this reason, I use the term software manufacture to describe the process by which a software product is generated from the programmed components of a system. 2 CHAPTER 1. INTRODUCTION

The goal of this research has been to define and to explore the applicability of a general paradigm for efficient software manufacture that does not compromise reliabil- ity. Case studies of program changes and the cost of selective recompilation suggest that most recompilations performed using standard techniques are redundant and that this redundancy is a direct consequence of how we modularize software systems. This work corroborates practical experience as well as the observations of other researchers, it vali- dates the approach taken by some environments to use an underlying flat representation of program objects (e.g. SMILE [30]), and it raises questions about some assumptions about modularity.

After offering an observation about the relationship between modularization and the cost of recompilation that became clear during this research, I present an overview of the thesis, motivate its approach and aims, and review related work. In Chapter 2, I present a paradigm for software manufacture with examples. In Chapters 3 and 4, I explore the applicability of that paradigm to the recompilation of C and Ada programs. Finally, I present conclusions and recommendations in Chapter 5.

An Aside Introducing The Programs Studied

This thesis consists of both an abstract model and a set of observations based on quanti- tative data resulting from applying the model to actual manufacturing problems. Most of the quantitative results (presented in Chapters 3 and 4) come from studying one program, the Descartes crostic client. This 12,000 line program, which implements acrostic puzzle game, was designed to exercise user interface management software developed as part of Descartes project [58]. I chose to study the crostic client because its size is tractable, because as a Descartes developer, I was familiar with the code, and because a complete change history was available.

I also looked at two additional C programs: the Andrew File System [45] developed at Carnegie Mellon University's Information Technology Center, and a version of Kermit [12] obtained from the Columbia University archives. In addition, Phil Levy of Rational, Inc. provided summary cross-reference data on three programs written in Ada: a Kermit subset, and parts of two debuggers.

Some of the examples in Chapter 2 are taken from the above programs; others are based on program generation technologies originally developed at Camegie Mellon Uni- versity as part of the PQCC project [48] and later commercialized at Tartan laboratories. Finally, the Mini IDL system developed by Nestor and Stone [49] provides an excellent example of a bootstrapped system. 1.1. THE CONSEQUENCES OF MODULARIZATION 3

1.1 The Consequences of Modularization

When a programmer changes the definition of a name declared in an interface, typical recompilation strategies require the recompilation of all compilation units in which the changed name is visible. Recompilation is necessary, however, only for those units in which the changed name is actually used. This research suggests that the underlying cause of most unnecessary recompilations is the sparsity of name use relative to name visibility; that is, only a fraction of those units that might reference a given name actually do reference the name. This is a direct consequence of how we modularize software systems.

Modularization can be seen in two ways. It can be viewed either as a breaking apart of the entire corpus of a program into smaller units or as the gathering together of individual definitions into larger units. (In either case, units are formed according to principles of data abstraction and information hiding.) It is the latter view that is adopted here. The results of this thesis suggest that the unit size appropriate for information management is not necessarily the unit size appropriate for recompilation.

If we were to dissolve module boundaries and look at the fine-grained structure of a program, we would see that a program is simply a collection of named objects that reference one another. Program designers impose order on such collections by grouping names into modules and by replacing references between names with references between modules. Unless the programming environment remembers which names reference which other names, the effect of modularization is to compound the number of interconnections. When a name in one module references a name in a second module, the environment must behave as if each name in the first module references each name in the second.

This can be illustrated by a simple gedanken experiment. Starting with a graph that might represent the connections between names, individual nodes are grouped into clusters and connections between nodes are replaced with connections between their containing clusters. This is an information-losing transformation: the cluster graph contains fewer nodes and connections than the original graph. To reconstruct the finer-grained original graph from the cluster graph, it is necessary to assume that each node in a cluster is connected to each node in every connected cluster. There is a potential explosion of interconnections in the reconstructed graph depending on the connectivity of original graph and the way nodes are selected for clustering.

An example of this experiment is shown in Figure 1.1. Figure 1.1(a) is a graph of 12 nodes with 20 arbitrary connections. In Figure 1.1(b) individual nodes are clustered arbi- trarily, simply by grouping adjacent nodes into pairs, and connections between individual 4 CHAPTER 1. INTRODUCTION

1

a Ori ons b Nodes Gro ed in P irs (c) Implied Interconnections

Figure 1.1: Modularization and the Compounding of Interconnections

nodes are replaced by connections between pairs. (The figure does not show connections between nodes in the same pair.) Figure 1.1(c) again shows the original nodes plus all connections between nodes that are implied by the connections between pairs. Where in Figure 1.1(a) there may have been only a single connection between two of the four nodes in two pairs, in Figure 1.1(c) there are four connections between the four nodes. (Connections between nodes in the same pair are as in the original figure.)

While neither the pattern of references between names nor the collection of names into modules is arbitrary in programs, Figure 1.1 is representative of the phenomenon observed in the case studies reported in Chapter 4. In fact, in five of the six case studies the compounding of interconnections attributable to modularization is even greater than that illustrated in the figure. The almost complete interconnection of Figure 1.l(c), however, is an artifact of the small size of the example and the relative density of the original connections in Figure 1.1(a).

1.2 The Thesis

This thesis is based on a model (or paradigm) of software manufacture that specifically addresses how much work has to be done to incorporate a given set of changes consis- tently into a given software product. Using this model to analyze the compilation costs associated with the change history of a small C program, I found that most compilations associated with an interface change are redundant and that those redundant compilations 1.2. THE THESIS 5 can most effectively be avoided by using simple name-based selective recompilation tech- niques. I then looked at a handful of C and Ada programs to corroborate and explain this phenomenon.

1.2.1 The Model

The model for efficient manufacture defined in Chapter 2 is based on a dependency graph representation of a software configuration called a manufacturing graph. This representa- tion establishes a baseline for consistency in manufacture. It is philosophically akin to the system models described by Lampson and Schmidt [38] but having less semantic content, avoids the arcane linguistic constructs and the type-theoretic issues that make the System Modeler difficult to understand and implement. Unlike most other representations (in- cluding Lampson and Schmidt's), manufacturing graphs are programming language and tool independent and easily accommodate difficult manufacturing problems such as the cyclic dependencies found in bootstrapped systems, the use of program generators, or the incorporation of premanufactured components.

Software manufacture is the process of instantiating a manufacturing graph schema. When the schema is derived from the graph representing an already manufactured con- figuration by changing one or more initial (source) components, this instantiation can be done in two ways: by generating new derived components as needed, or by appropriating already manufactured components where their substitution will go unnoticed. The pro- cess is controlled by difference predicates. Difference predicates generalize the selective recompilation mechanisms described by Tichy [62] and others. These mechanisms are discussed in Section 1.4.2.

Chapter 2 describes various ways difference predicates may be applied and the con- ditions they must satisfy to be used safely. Both the predicates and the underlying repre- sentation are illustrated with examples from real programs and tools.

1.2.2 An Analysis of Compilation Costs

Chapter 3 represents a narrowing of focus from the general paradigm of Chapter 2 to a specific case study of compilation costs in C.

The predicate-driven model for software manufacture of Chapter 2 is intended to be used as a springboard for approaching manufacturing problems: for assessing manufac- turing costs, for evaluating existing technology, and as the basis of environment, tool, and program design. In Chapter 3 that model is applied to compare the efficacy of seven 6 CHAPTER 1. INTRODUCTION predicates on almost 200 revisions recorded in the change history of the Descartes crostic client. Roughly 40% of these revisions represent interface changes (changes to .h files).

One of the results of this study is that 64% of the compilations that would have been initiated by make [24] in response to an interface change are redundant. Of these redundant recompilations, 89% are due to changes to names that are visible in a compilation unit but are not used in that unit. Because only a small number of additional redundant compilations can be detected using a more selective predicate that considers both the nature of the change and how a changed name is used, it does not pay to be too smart.

The study also indicates that while all compilations based on gratuitous changes (such as changes to comments or to white space) are unnecessary, such changes probably do not contribute significantly to overall compilation costs.

Rolf Adams, Annette Weinen and Walter Tichy of the University of Karlsruhe have recently completed a similar study based on over 700 daily changes in the history of two Ada programs consisting of over 60,000 lines of code [1]. They found that half of the compilations initiated by normal Ada rules were redundant according to the name use criterion. In comparing the two studies it is interesting to note that incidence of redundant compilations for a C program is comparable to that for the larger Ada programs despite substantial differences in the compilation models of the two languages. The results of Chapter 4 suggest that differences in the compilation models of C and Ada do not necessarily lead to concomitant differences in the number of redundant compilations.

1.2.3 Name Use versus Name Visibility

Chapter 4 further explores the relationship between name use and name visibility and recompilation cost, considering instead of a single C program, a handful of C and Ada programs. This chapter departs from the paradigm for software manufacture per se to look at properties of programs that influence manufacturing costs; its purpose is to explain and corroborate the results of Chapter 3.

The change history study of Chapter 3 is based on historic changes to a single program. While this C study and the Adams-Weinert-Tichy Ada study demonstrate that there exist development efforts in which most recompilations are redundant, the two studies do not indicate to what extent this result can be extrapolated to other programs. Unfortunately, without a substantial investment in tooling, it would be prohibitively expensive to repeat the C study on a larger scale. Given the collaborating results of the Adams-Weinert-Tichy 1.2. THE THESIS 7

Study, it is also not clear what the intrinsic value of such a study would be. 1 So instead of examining changes, Chapter 4 seeks to explain the incidence of redundant recompilations by looking at the structure of programs. In so doing it compares the Descartes crostic client with other programs.

The conclusions of Chapter 4 are based on static name use and visibility patterns for three Ada and three C programs, including the Descartes crostic client. I computed both the average number of compilation units that use the names defined in the interfaces of each program and the average number of units in which those names are visible. For changes made to a single name chosen at random, the first number represents the expected number of units that actually need to be recompiled; the second represents the expected number of units that typically are recompiled. For the six programs studied, the ratios of these two numbers vary from .07 to .42, indicating that from 9 to 6 out of every 10 compilations may be unnecessary when one name changes at a time. The values for the Descartes crostic client fall in the middle of the range for the six programs.

Chapter 4 attempts to reconcile the results of the Descartes change history study with static use and visibility properties in two ways: by looking at the use and visibility properties of the names that changed in Descartes' history and by considering the eftect of multiple name changes. There is no evidence of a relationship between name use or visibility and the selection of names that actually changed in Descartes' history. However, when several names are changed at the same time, they are often defined in the same interface. Thus groups of names that are changed at the same time are likely to be visible in the same set of compilation units but used by different (perhaps overlapping) sets of units.

While the name use/visibility study is too small for generalization beyond the six programs studied, it raises important questions about the relationship between program organization and compilation costs. The most profound result of the study is an unex- pected consistency in patterns of name use across the six programs. Although average name visibility varied considerably, average name use was consistently between 2 and 3 units for all six programs. In other words, most global names link just two or three program units, including the unit in which the name is detined. This is reminiscent of Knuth's discovery that the average number of operations in an assignment statement is .5 [35].

Concemed with the overhead associated with importing externally defined symbols into a compilation unit, both Conradi and Wanvik, reporting on experience with the languages Mary/1 and Chill [7], and Kamel and Gammage, reporting on experience with 1SeeChapter5 for furtherdiscussion. 8 CHAPTER 1. INTRODUCTION

the language Protel [32], estimate that only 15% to 25% of the symbols visible to a compilation unit are actually used in that unit. These numbers indicate that if each name visible to a compilation unit were equally likely to change and if only one name were to change at a time, then between 75% and 85% of the recompilations performed on an average compilation unit would be unnecessary. Neither Conradi and Wanvik nor Kamel and Gammage made this observation, nor did they report on the average number of clients per name or on the distribution of name use.

1.2.4 Conclusions and Recommendations

The examples of Chapter 2 show the range of the dependency graph representation and predicate-driven model of software manufacture developed in this thesis. The case stud- ies of Chapters 3 and 4 lead to new insights about the causes of redundant compilations, showing that this model helps us understand and therefore control manufacturing costs. While these studies represent only a small number of programs, their evidence is suf- ficiently compelling to validate the approach of programming environments such as the

Rational Environment [2,21] or SMILE [30], which base recompilation decisions on the interconnections between individual names. Although this thesis raises number of ques- tions that demand further study, today's designers of practical programming environments would be well-advised to follow the example of these environments. More speculatively, the thesis suggests that we reappraise the mechanisms we use to organize programs.

At face value, the thesis shows that for historic changes to the Descartes crostic client program, only two predicates are interesting: one that makes recompilation decisions based on name visibility and one that makes the same decisions based on name use. Other predicates do not perform significantly differently from these two predicates, indicating that name use versus visibility is the primary source of redundancy in manufacture. This explanation is reinforced by the discovery that name use is consistently sparse relative to name visibility in a group of six programs. Because the static propertics of the Descartes program are comparable to those of the other programs studied, and because historic changes tend to represent sequences of actual changes, the results of the Descartes change history study are probably conservative for a broad class of programs.

Further studies are needed to confirm and expand the results of the thesis. First, it would be interesting to do a broad study to determine whether the consistent patterns of name use seen in Chapter 4 hold up for large numbers of programs. Tb" finding has im- plications not just for compilation costs, but for how we organize progra_as into modules. Second, because compilation units are themselves composite, it would be worthwhile to 1.3. BACKGROUND 9 measure pattems of references within units. This information would provide additional insights about how programs are organized. Finally, it would be desirable to characterize patterns of change to measure the size of actual changes and to see whether there i,; a discernible relationship between use or visibility and the names that change.

At a time of unprecedented increases in computational power, some might challenge the importance of reducing inefficiency in software manufacture. Historically, similar challenges have not held up. Increased computational power is used to build more am- bitious systems, to tackle harder problems, or to apply more sophisticated development techniques. In the future, techniques such as those explored in this thesis will be impor- tant as we build larger systems with more complex derivation relationships. At present, these techniques may make the difference between 1 minute and 10 for a programmer making a small change; they may make the difference between 10 minutes and an hour for a more pervasive change to a larger program. For a group of programmers, they may make the difference between 30 minutes and overnight; for a large organization they may make the difference between overnight and a few weeks.

1.3 Background

Recently there has been an awakening of academic interest in software configuration management. 2 In an effort to develop techniques for managing complexity and change in the development and maintenance of software systems, researchers are combining the language-and-tool-oriented concerns of programming in the large with what have been the traditional methodological concerns of configuration management as practiced in industry. These techniques represent a collection of activities that manipulate the programmed components of a software system as primitives in order to compose them into a coherent, well-formed and operational whole. One of these activities is software manufacture.

In this section I first give concrete examples of the problems that have motivated my own interest in software manufacture. I then offer a more abstract view of software manufacture, motivating the model of Chapter 2. Finally, I give a brief overview of the relationship between software manufacture and other activities of programming in the large.

ZFor example, a workshop devoted to configuration management was held in West Germany in January 1988 [68] and a second planned for the United States in October, 1989. 10 CHAPTER 1. INTRODUCTION

1.3.1 Examples of Problems in Software Manufacture

My interest in software manufacture as a problem (and my approach to its solution) stems from my experience as software developer at both Bell Laboratories and Tartan Laboratories and on my experience maintaining configurations of the tools, libraries and program skeletons used to generate front-ends at Tartan. At Tartan, too, I learned about the complexity of manufacture using sophisticated tools and about effectiveness of selective recompilation techniques.

The source of all problems in software manufacture is change. Dealing with change is conceptually simple. All one has to do is (1) recognize that something has changed, and then (2) perform the necessary steps to incorporate the changed components into a product. In practice, things are not so simple. A common and frustrating manifestation of a failure in either of these two operations is version skew. Version skew occurs when two potentially inconsistent versions of the same component are unintentionally incorporated in the same product. Version skew can only happen when there is more than one dependency chain between the product and the offending component. The derivation steps along one dependency chain are repeated using an updated version of this component while the steps along another chain are not. This is illustrated in Figure 1.2.

legal skewed

Figure 1.2: A Simple Case of Version Skew

In many compilation systems, version skew is detected at compile or link time; in others it may only show up at run-time. Sometimes version skew is benign. When there are not substantial differences between the versions of the component producing the skew, the compilations required by a checked system are unnecessary.

The two incidents of version skew I describe below will not surprise experienced 1.3. BACKGROUND 11 software developers. Both happened during the early stages of small projects before formal configuration management procedures were put into place. The first incident, an example of version skew in an unchecked system, took place at Bell Laboratories. The programming language was C. The second incident, an example of version skew in a checked system, took place at Tartan Laboratories.

Version Skew in C

One of my duties at Bell Laboratories was maintaining the interface defining a common object file format (coff) shared by a family of UNIX-based software development tools. At the time there were perhaps a half-dozen developers using the interface and I was allowed to change it at will. The programmers I worked with used make to regenerate their programs after making changes. Unfortunately some programmers would omit the coff interface dependency from their makefiles since it was not their code; thus make would fail to trigger the necessary recompilations after the interface was changed. On more than one occasion, a developer who had omitted the dependency would recompile one of the .c files using the coff interface after the interface had changed but fail to recompile other files that also used the interface. The partially recompiled system would successfully link and proceed to produce anomalous run-time behavior when data structures defined by the coff were communicated between modules having inconsistent views of the interface. The conscientious programmer who chose to use lint (a C program checker) to uncover the error would find nothing amiss since lint would reprocess all the .c files in the program using the new version of the interface. If, however, the appropriate parts of the system were recompiled, the anomalous behavior would disappear as mysteriously as it appeared.

Version Skew in Gnal

The software development environment at Tartan, based on the proprietary language Gnal, was different from the C/UNIX environment at Bell Laboratories. Designed to produce very efficient implementations, Gnal is a separately compiled systems-programming lan- guage offering strict type checking across module boundaries. In Gnal, unlike Ada (for example), interface and implementation information are declared in the same compilation unit. Thus when a Gnal module is compiled it produces both a relocatable object file and a symbol table that is used to compile clients of the module. The compiler, not the programmer, determines how much information about exported names is transmitted in this symbol table. A serious drawback of this strategy is that new versions of the symbol table are produced every time any change is made to a module and the programmer does 12 CHAPTER 1. INTRODUCTION

not always i _ow if the new version resulting from certain changes will be compatible with the old version. Techniques for dealing with these problems were eventually developed.

Version skew was a serious problem at Tartan before program library and configu- ration management facilities for Gnal were put into place. During one summer, I was responsible for developing program skeletons for a table-driven lexical analyzer. These program skeletons made use of an error reporting subsystem being developed by another programmer. This programmer would periodically recompile the interface used by the lexer. I would change some lexical analyzer module and recompile that module as well as any of its clients only to find that I was unable to recompile the lexer root module (module A in Figure 1.2) because the module I had changed (module C in Figure 1.2) had been recompiled with the new version of the error interface while a second imported module (module B in the figure) had been compiled with the previous version. The Gnal compiler detected version skew when two versions of the symbol table for some inter- face were imported either directly or transitively into the same compilation unit, whether there was a substantial difference between the two versions or not. It was precisely this problem that led to the development of a selective recompilation mechanism for Gnal. (This mechanism is described briefly in Sections 1.4.2 and 2.4.3.)

The early Gnal compiler reported the name of the module producing the version skew (module D); it did not however, report on the names of the modules using conflicting versions of the interface (modules B and C). When the network of dependencies is com- plex, it can be difficult to determine what modules need to be recompiled to correct a version skew. According to Jim Morris, Mesa programmers would sometimes resort to recompiling the entire system in comparable situations.

Incidently, the problem between the lexer and the error reporting subsystem was complicated because the programmer responsible for the error subsystem used some low level definitions that I maintained as part of the lexer. Although dependencies between individual modules remained hierarchical, the dependencies between the two subsystems were circular. Eventually, both the lexer and the error reporting subsystem were combined in a larger compiler front-end subsystem.

Further Complexities

The version skew problem becomes more complex when instead of a one level hierarchy between two developers, there is a lattice of dependencies among several independently developed subsystems. This is exactly the problem addressed by the Cedar environment release process described by Schmidt in his thesis [55, Section 2.4]. In the Cedar en- 1.3. BACKGROUND 13 vironment, interdependent subsystems had to be incrementally compiled in topological order so that the same versions of all components would propagate along all dependency chains in the hierarchy. According to Schmidt, internal releases of new versions of the Cedar environment took place once a month and took one to two days to accomplish. A similar problem plagued Tartan. However, while I was responsible for the configurations of tools, libraries and program skeletons used to generate compiler front-ends, several internal releases would take place weekly and the compilation system was as likely to change as the compiler generation tools or library code. We were able to turn around new releases in a matter of a few hours. 3

The complexity of software manufacture at Tartan was in part due to the Gnal sep- arate compilation model. At least as important, however, was the heavy use of program generating tools. A half-dozen or more tools were involved in the generation of a com- piler front-end alone. (This does not include another set of tools used in generating and maintaining compiler back-ends.) These tools all operated on a collection of specifica- tions derived from an original target language grammar. They would generate a collection of Gnal and assembly language modules to be combined with precompiled libraries and table-driven program skeletons. The whole process was set up to produce compilers running in different operating environments on different computers. The Tartan tooling resulted in phenomenal leverage as tens of modules would be generated to be used in com- bination with hundreds of library and skeleton modules. However, due to this leverage, a single change to the original grammar could trigger a substantial amount of processing, much of it redundant. It is precisely this kind of situation that the model of Chapter 2 is designed to address.

1.3.2 Software Manufacture

Software manufacture is the process by which an initial set of components representing a software system are incrementally transformed and combined, through an often complex sequence of manufacturing steps, into one or more software products. These steps are effected by software generation tools including, but not limited to, such tools as compilers and link editors. As the number and expressive power of the tools used increases, so does the complexity of the manufacturing process.

I use the term software manufacture because the process of product derivation is

3I do not know how the volume of code constituting a release of Tartan the compiler front-end building system compares with the volume of code in the Cedar environment. I do know that although both the Tartan and Cedar internal release procedures had to respond to similar problems, the two development environments were different as were the demands on the developers working in those environments. 14 CHAPTER 1. INTRODUCTION

largely automatic, mediated by tools. By comparison, the rest of the development process resembles a guild-hall craft where journeyman programmers labor under the supervision of craft masters. In some ways, the use of the term is unfortunate because it recalls the production techniques of the assembly line. In the manufacture of hard goods, the objective is to accurately replicate identical copies of a single product; in the manufacture of software, the same product is almost never produced twice.

Reproducibility as a Standard for Reliability

The first problem in software manufacture is reliability.

During its implementation and maintenance, a software system is repeatedly manu- factured as each new set of changes is introduced. The result is then tested sufficiently to validate the changes and to discover additional problems. Whether casually manufactured by an individual programmer or officially by an integration team, it is imperative that the system undergoing testing (or about to be released) accurately represent the components from which it was supposed to have been constructed; otherwise, any information gained from testing is worthless. If there is any doubt about which versions of which components were used to build a given version of a system, it is impossible to attribute any problems found in testing to specific components, or to know what to change in order to fix those problems.

A software manufacturing process is reliable to the extent that it provides its users with complete control over changes. This means that once a problem in the behavior of a system has been identified and appropriate changes to correct that problem have been made, it is possible to remanufacture the system so that all differences between the new and previous versions can be ascribed to the changes made. Unfortunately, just knowing which versions of which components were used to build a given version of a system is not enough to ensure this. It is also essential that the programmer or integration team know exactly how the system was manufactured; otherwise an inadvertent change in the manufacturing process may inexplicably distort system behavior.

Reproducibility is a single standard for reliability in software manufacture. If it is possible to reproduce a version of a system, then it is possible to produce a known variant of that version.

There are four conditions necessary to establish reproducibility. First, it is necessary to identify the versions of all primitive components used to build a designated version of the system, including tools. Second, it is necessary to retain copies of all those components. Third, it is necessary to remember exactly how they were combined and transformed 1.3. BACKGROUND 15 by the manufacturing process. Finally, each of the recorded steps of the manufacturing process must be repeatable.

The Issue of Efficiency

Once reliability has been established, the second problem in software manufacture is efficiency.

For an individual programmer making and testing changes, the time it takes to rebtfild a system is often wasted. The programmer cannot make additional changes that might be contingent on test results or that might interfere with the manufacturing process, and he cannot begin testing until manufacturing is completed. If manufacturing time is not negligible, the programmer may be reluctant to make a discretionary improvement because of the potential overhead. He may postpone the informal integration of changes made by other programmers for the same reason. For a project (like the development of the Cedar programming environment at Xerox PARC or the development of compiler front ends at Tartan), formal integration may require suspension of all programming activities as changes propagate through the whole system.

Just as reliability in software manufacture is associated with control over the content of a software product, efficiency is associated with control over manufacturing cost. A software manufacturing process is efficient to the extent that the amount of work necessary to incorporate a change is proportional to the scope of that change. That means not doing extra work.

While it is reasonable to expect manufacturing costs to be high when a pervasive definition is changed, it is not appropriate to have to pay the same price for changing a less widely used definition. Unfortunately, in most environments there is no a priori way of distinguishing between the two situations. Programmers will sometimes risk failure to rebuild a required component in an attempt to avoid rebuilding components unnecessarily. Some environments (for example, Apollo's DSEE [39]) sanction this practice by providing explicit escapes to subvert normal recompilation rules.

The price of inefficiency in manufacture is not measured only in time lost to redun- dant recompilations. Perceptions about manufacturing cost also influence how a system is designed, what programming language or software development tools are used, and what changes are made and when. Decisions based on lack of information or inaccurate perceptions can limit the productivity of software development personnel, the effective- ness of software generation technology, and the prerogatives of software development management. 16 CHAPTER 1. INTRODUCTION

1.3.3 Related Activities

As part of the increased interest in software configuration management, researchers have been making an effort to factor the problem and to define and understand its constituent functions. Recent papers by Tichy [64] and Estublier [20] (both reporting at SVCC'88) have focused on the separate functions required to configure a software product from a collection of assorted components. Tichy considers the relationship among functions whereas Estublier asks "what is a configuration".

Unfortunately, there is litre consensus about what an appropriate set of functions comprising software configuration management might be or even what constitutes a cover. Researchers still disagree on the meaning of basic concepts. In the following paragraphs I offer one view of the principal functions that interact with software manufacture.

Source and derived object management. Manufacturing operations must interact with a persistent object store to identify, retrieve and store components. Called version control, this function has long been treated as a distinct activity. Its traditional concerns have been techniques for representing the content and relationships between versions of individual software components and controlling mutual access. The typical paradigm for the control of text files (used for example in SCCS [52], RCS [61] and the DSEE history manager [40]) stores only the differences between successive versions as deltas and provides re- serve/replace access to those versions. Derived files are handled differently. In DSEE and Odin [6] for example, they are cached in a "derived object pool" and discarded according to some cache flushing strategy.

A problems with conventional version control techniques is that they do not deal effectively with versions of components that are not text files. They also do not effectively capture the relationships between the versions of different components. As the limitations of traditional file systems and database technology are becoming evident [47], versioned object management has become an important topic in programming environment research.

Traditional configuration management and change control. Historically the concerns of traditional configuration management and change control have been largely administra- tive. Configuration managers are responsible for the scheduling, coordination and tracking of changes. They determine when new features will be introduced, who will implement them, and how they will be integrated into the product. 1.3. BACKGROUND 17

Version selection. Version selection is a central problem in software configuration man- agement. Before a software configuration can be manufactured, appropriate versions of each of the primitive components in the configuration must be identified. These versions must be syntactically and semantically compatible and must implement a desired set of features.

Version selection can be extremely difficult in the presence of multiple versions of multiple components, especially when trying to integrate independently developed sub- systems. Unfortunately, there is no universal taxonomy of versions that describes the space from which to choose. The accepted approach to the problem of version selection has been to attach attributes to versions and to use predicates designating attribute values as selection criteria. Individual mechanisms differ in the complexity of the predicates they allow. This approach has been used by INTERCOL, an early module interconnec- tion language [63] and for the definition of DSEE configuration threads [39]. It has been explored extensively in the Adele system [19,3].

Module interconnection languages. Like version control, module interconnection lan- guages have traditionally been a distinct area of interest. While Deremer and Kron originally envisioned that a module interconnection language would be used to spec- ify information flow among the modules of a system [15], subsequent languages have provided facilities for version selection and software manufacture. Recently, Wolf has distinguished this original function of the module interconnection languages as interface control [69].

The module interconnection languages have always been concerned with the structure of software systems and with the relationships between components. Recently, both the NuMIL language [46] and the Inscape environment [50] have associated assertions with module interface specifications. These assertions, used during version selection and for consistency checking, distinguish between changes that are upwardly compatible and those that are not.

Most module interconnection languages describe relationships between modules writ- ten in a single programming language and see software manufacture as compiling and linking exclusively. These languages simply do not accommodate the use of other tools, be they preprocessors, assemblers, or program generators. A notable exception to this bias toward compilation is early work by Cooprider [10]. Because module interconnection lan- guage research has been the foundation of much recent interest in software configuration management, the limited outlook of the former has carried over into the latter.

Unlike the other constituent activities of software configuration management, which 18 CHAPTER 1. INTRODUCTION

are essential if one is to actually produce a software product, the study of module inter- connection languages has been confined to research laboratories and universities and has produced a fair number of doctoral theses.

System modeling. System modeling as coined by Lampson and Schmidt is a hybrid between module interconnection and software manufacture. In their view, a system model should be a complete and explicit description of a unique version of a software system that, like a deck of punched cards of the past, can be used to instantiate the system automatically once an initial command to proceed is given [37]. This view is maintained in the definition of a manufacturing graph in Chapter 2.

Subsequent to its introduction by Lampson and Schmidt the term "system model" has come to be used to describe schemas for software manufacture such as makefiles. When applied to a selected set of primitive component versions by a product build tool, the system model yields an appropriately configured software product.

1.4 Directly Related Work

This thesis builds on related work in software manufacture and selective recompilation. The following sections review the most influential of this work. The thesis is unique in its synthesis of work in these areas, in its search for general principles rather than special purpose mechanisms and in its emphasis on quantifying and explaining the practical value of alternative techniques.

1.4.1 Software Manufacturing Systems

Because it produces the end products of software development, software manufacture is recognized as an important function of configuration management. Until recently, however, most manufacturing systems have been designed simply to automate the steps necessary to rebuild a software product after some of its sources have changed. These sys- tems treat the products they build as ephemeral. They do not record how a given product was built and are unable to reproduce it at an arbitrary time in the future. Often auxiliary mechanisms are built around such systems to collect the information necessary to support some measure of reproducibility. Slowly, researchers are recognizing the value of inte- grating such mechanisms with manufacturing software both to reduce uncertainty about product composition during development and to expedite product release and continuing maintenance. 1.4. DIRECTLY RELATED WORK 19

Make and its Successors

Make is the archetypal manufacturing system for building ephemeral products. For this reason it is discussed briefly in Section 2.1 where it serves to illustrate some of the pitfalls of current practice. Make has been an essential part of UNIX since 1975 and has been widely imitated on non-UNIX systems. It uses a makefile to keep track of dependencies between files, and when invoked regenerates only those derived files that are out of date with respect to their sources. An important reason for make's success is the simplicity of this model. In its long tenure, make has proven remarkably adaptable; while it does not handle cyclic dependencies well, there are few derivation relationships it cannot express.

Some of make's descendants, including the "fourth generation make" [25] and the program, mk [29], have addressed perceived shortcomings in the make program, but have not challenged its underlying model. Areas in which these systems have made im- provements include makefile succinctness, execution speed, cleaner dependency semantics (especially to support parallel building), and a cleaner interface with the UNIX shell [23]. While mk retains the spirit of the original tool, the fourth generation make sacrifices much of make's simplicity and elegance in favor of more comprehensive support for idiomatic C development methods.

Build [18], extends make by allowing developers to specify a viewpath of identically- structured directory hierarchies that are searched in sequence to locate the files named in the makefiIe. Typically the first directory along this path represents the root of a programmer's working directory structure and the last represents the root of a stable system. By interposing other directories along the viewpath, programmers can control the visibility of changes. While this mechanism both helps to coordinate changes and promotes the sharing of files, it does little to support product reproducibility.

Alternatives to Make

SMILE and Odin are alternatives to make that have each been used to support substan- tial university research projects. Like make, however, neither system supports product reproducibility.

Used to develop and maintain Gandalf programming environments [59], SMILEis an incremental compilation system for a strongly-typed dialect of C that performs semantic analysis at the grain of individual declarations. While the user sees a SMILE program as a collection of modules, its underlying representation is a collection of separate procedures and variable declarations. The mechanisms that SMILE uses for incremental compilation 20 CHAPTER 1. INTRODUCTION

are discussed in Section 2.3.2. Incremental compilation as a selective recompilation strategy is discussed in Section 1.4.2, below.

The Odin object manager is used in the Eli compiler construction system to manu- facture target compilers [66]. The Odin interpreter uses compiled descriptions of manu- facturing operations called derivation graphs to determine how to construct a requested product. Similar to the manufacturing step schemas described in Section 2.2.2, derivation graphs describe relationships between the types of objects known to the Odin system. Associated with each derived object type is the name of a procedure and the input types necessary to produce instances of that derived object type.

Odin's derivation graphs allow one to specify, in one place, complex manufacturing operations that are applied over and over again. However, while Odin maintains a rep- resentation of the relationships between object types, it does not maintain any explicit representation of the relationships between the objects themselves. That is, it has no concept of a system model.

Keeping Lists of Files to Support Reproducibility

The ability to regenerate a system precisely has long been a concem of production software development efforts. A prudent approach to this problem is to take snapshots of the whole system when it is about to be released (or when other milestones are achieved), and rebuild the system under conditions that ensure that only the information in the snapshot is used. For large systems, ensuring that the snapshot is complete and its content consistent can be a formidable task that is performed only rarely. The introduction of appropriate tools can help considerably.

A minimal snapshot of a software system lists one version of each of the system's source components. Together with appropriate conventions for building the system, such a list makes it possible to reproduce the system as long as the tools used to build it and environment in which it is built do not change. This approach forms the basis of the Software Manufacturing System used at Bell Laboratories [11] and the Description Files used by Mesa development groups at Xerox [41]. It is also used in the Gandalf Software Version Control Environment [31], for configuration management the Rational Environment [21], and in Sun's Network Software Environment [60].

The Bell Laboratories Software Manufacturing System combines checks and conven- tions with the use of make and SCCS to :wide some degree of reproducibility in UNIX environments. The filename and SCCS-assigned version number of each source file used to generate a software product is embedded in that product. These filename/version hum- 1.4. DIRECTLY RELATED WORK 21 ber pairs, together with the version number of the makefile used to build the product, form a master slist. Checks ensure that only one version of each source component appears the slist. When a product is rebuilt, its new slist must conform to the master slist.

Description Files (DF files) have been used heavily by the Cedar and Mesa devel- opment projects at Xerox to maintain consistent sets of files on different nodes of a distributed file system and to manage frequent internal releases. They were adopted more to achieve control over the development process than to achieve reproducibility. Each DF file lists versions of both the source and the object files that make up a subsystem. Files from other subsystems are imported by referencing their defining DF file. Although DF files are not used to automate software manufacture, they are checked for completeness and consistency according to Mesa language rules. Release procedures based on DF files are used to configure and install consistent baseline versions of systems in protected locations.

The Sun Network Software Environment (NSE) contains mechanisms that are con- ceptually akin to slists and DF files. However, instead of manipulating lists of files, the NSE maintains views of entire populated directory structures. These views are called en- vironments. The NSE manages these environments to automatically configure the private workspaces of individual developers and to reintegrate changes made in those workspaces with main line project development. Each environment contains exactly one version of each source file in a given directory structure. While only source files are versioned, NSE ensures that any derived objects in an environment are consistent with the environment's sources. A variant mechanism is used to capture the compilation options and the tool suites used produce different sets of objects from the same set of sources.

Bound System Models

There is nothing fundamentally wrong with keeping lists of source versions and using version-controlled makefiles or canonical procedures as a basis for reproducing designated software products. While many lists are incomplete (they often do not identify versions of tools or hidden files pulled by those tools from the environment), the missing information could in principle be recorded. It is also possible to use manufacturing conventions (such as undefining UNIX shell variables or disallowing command line parameterization of make invocations) that reduce the chances of inadvertently changing the manufacturing process. It is even possible to refine component names so that more than one version of the same component can be used in the same system. However, all this information has to be put back together to describe an as-built system. 22 CHAPTER 1. INTRODUCTION

An alternative approach, used both to identify and to reproduce designated software products, combines version information and the manufacturing schema in a single rep- resentation, a bound system model. This is the approach taken by the Cedar System Modeler, by Apollo's Distributed Software Engineering Environment and by the Soft- ware Configuration Engineering system developed at the Technical University of Berlin [43]. Each of these three systems is capable of regenerating a product directly from its representation as a bound system model.

The Cedar System Modeler generates executable software by interpreting functional programs, called system models, written in the System Modeling Language (SML). A system model describes the composition of a Cedar program. Only source files (Cedar definition and implementation modules) are named in the model; these source files are identified as immutable objects with unique identifiers. Derived objects are represented as function applications and are treated simply as accelerators to be used by the Modeler in interpreting the model. As such, derived objects are identified by unique identifiers that are created by hashing the names and unique identifiers of the sources, the compiler, and the compiler options used in their creation. Thus, while the Cedar compiler is not represented explicitly in the model, new derived objects would be created were the compiler to change. This, of course, does not guarantee reproducibility of products. However, the model, not any product it may have generated, is seen as the true representation of a system.

The Cedar System Modeler was never a successful product. SML was designed specifically to represent Cedar programs. The Modeler was built only as research proto- type and used only by its developers. However, because of the general applicability and clear articulation of its principles, the System Modeler has been widely influential and its key ideas reflected in most later efforts.

DSEE, Apollo's Distributed Software Engineering Environment, is a proprietary pro- gramming environment supporting large scale software development efforts. It has been available for five years. While DSEE separates the specification of version selection in- formation, in a configuration thread, and the specification of manufacturing information, in a system model, it combines the two in a bound configuration thread to identify all derived objects. The bound configuration thread, or BCT, lists all commands and argu- ments used to create a given derived object, including input versions, tool versions and options. Any BCT can be saved and used for recreating the particular object it desig- nates. Unlike a Cedar System Model which is, at least in principle, meant to be read and understood by people, a BCT is an internal representation and not something that a user would manipulate. 1.4. DIRECTLY RELATED WORK 23

Shape, part of the Software Configuration Engineering system, is a software manu- facturing tool that is upwardly compatible with make. It is integrated with a dedicated version control system, similar to SCCS. Shapefiles, unlike makefiles, allow the program- mer to specify how the versions used to instantiate dependency graph are to be chosen. The default rule is to select the "busy" object in current directory, just like make. Shape can be instructed to created a bound shapefile that replaces version selection patterns with the specific versions used to generate a designated product. This bound shapefile then can be used to recreate the product. Derived objects are stored and identified much as they are under DSEE.

1.4.2 Selective Recompilation

Several of the manufacturing systems discussed above include mechanisms that eliminate the need for certain recompilations:

• The fourth generation make allows one to specify dependencies on individual C preprocessor variables defined in .h files. • Odin implements a "cutoff" recompilation strategy. In multistep transformations, it uses a simple bit-wise equivalence test to compare newly derived objects with objects that already exist to determine whether to propagate changes further. • DSEE has two mechanisms that allow programmers to control the propagation of changes. Equivalences are used interactively. Programmers can declare that one object (an old .h file, for example) can be used in any bound configuration thread instead of another object (the revised .h file). Non-critical dependencies are specified in system models. They express preferences for the manufacturing process that are used only when a derived object must be built for other reasons.

However, the real impetus for selective recompilation has come from the separately com- piled modular languages.

Modular programming languages (like Ada, Mesa or Modula-2) provide the benefits of strong type checking to programs developed as collections of separate units. Only the information explicitly exported by a unit is available for use by other units, and the compiler is required to check each use of that information for compliance with its definition. While the utility of modular languages for designing and maintaining large or complex programs is unquestionable, such programs place heavy demands on the compilation system. One problem is simply determining what units might need to be recompiled after another unit has changed. Other problems include the "big inhale" [7] and "trickle-down recompilations" [51]: 24 CHAPTER 1. INTRODUCTION

• The big inhale. Before processing a compilation unit, many compilation systems first construct a symbol table containing all the symbols that are visible to the unit. This symbol table, which includes information from both directly and indirectly imported interfaces, can be large and expensive to construct. For example, the largest symbol table constructed for a Protel 4 application included 11700 declara- tions from 90 interfaces and required 800 kilobytes of storage. Only about 15% of these declarations were used [32].

• Trickle-down recompilations. Trickle-down recompilations are a class of unnec- essary recompilations due to transitive dependencies between compilation units. In the absence of more specific information, the compiler must assume that any compilation unit that imports a changed unit, either directly or indirectly, may be affected by the change. In practice, however, many changes do not propagate to indirectly importing units, and for some changes the number of such units can be large. Unfortunately, although trickle-down recompilations are widely recognized as a problem, the extent of the problem has not been effectively quantified.

Trickle-down recompilations are only one class of redundant compilations identified by selective recompilation strategies. In the remainder of this section, I review strategies for selective recomt_ilation that have appeared in the literature. These strategies range from the BNR Pascal compilation system, which requires the programmer to identify the units that need recompilation and relies on the linker to detect inconsistencies, to various incremental compilation systems, which use specialized databases to encode the fine-grained dependency structure of a program.

Conradi and Wanvik [7] survey the issues and engineering tradeoffs that arise from alternative approaches to separate compilation. In so doing they identify several mech- anisms for limiting the big inhale and for detecting trickle-down and other redundant compilations.

Avoiding Trickle-dow_ Recompilations

In the language Mary2, as in Gnal, interface and implementation information are both declared in the same compilation unit (see Section 1.3. Thus when a module is com- piled, the compiler produces both object code and an exported interface containing the symbolic information needed to compile clients of the module. This exported interface is regenerated whenever the module is recompiled for any reason, potentially unleashing a

4Protel is a modular language for telephony that has been used at Bell Northern Research since the late 1970's [5]. 1.4. DIRECTLY RELATED WORK 25 cascade of contingent compilations. If the exported interface does not change when the module is recompiled, then importers of the interface do not need to be recompiled. This simple observation led to the development of similar selective recompilation strategies for both Mary2 and Gnal.

In both strategies, the changed module always recompiled. If the newly recompiled exported interface has changed, then direct importers of the interface are recompiled to see if their interfaces have changed. If not, no further compilations are performed. Direct importers that are unaffected by a change to an exported interface may be recompiled needlessly, but trickle-down recompilations are eliminated.

This technique has been called cutoff recompilation [1]. In addition to languages like Mary2 and Gnal, it is applicable to any language requiring multistep derivations. Adams, Weinert and Tichy looked at cutoff recompilations in their study of Ada compilations; they report that half of the redundant compilations detected using the name use criteria (see Section 1.2.2) could be found using this technique.

In the Mary2 implementation [51], each exported interface has a version stamp. The version stamps of imported units are packaged with the symbolic information in the ex- ported interface. Global consistency requires that the same version stamp appear wherever the same interface is used. To avoid trickle-down recompilations while maintaining con- sistency, a newly recompiled interface is given new version stamp only if it differs from its predecessor. The same method was suggested for Protel [5], but not implemented because of the perceived expense of tracking the previous versions of modules needed for comparisons [32].

Selective recompilation for Gnal was implemented by a compilation dispatcher named tcomp and a comparison predicate named refines. These tools were designed by John Nestor and Joe Newcomer and implemented by Hank Mashburn. The Gnal strategy differs from the Mary2 strategy in two ways. First, each newly recompiled interface is always assigned a new version stamp. Second, refines allows the addition of new definitions when comparing when a newly compiled interface and its predecessor. If the comparison succeeds then refines appends the version stamps(s) of the predecessor to the version stamp of the new interface; if the comparison fails then tcomp triggers the recompilation of dependent modules. In checking for version skew, the compiler allows multiple versions of the same interface to coexist (as transitive imports) in the same compilation unit as long as the version stamps of all predecessor versions are suffixes of the version stamp of the most recent version. 26 CHAPTER I. INTRODUCTION

Other Version Stamp Methods

The CHIPSY compilation environment for Chill [27] and the BNR Pascal compilation system for a modular dialect of Pascal [33] also use version stamps to check for global consistency. In both cases, however, version stamps are associated with individual names, not with entire modules. Because the compiler records the versions of names used dur- ing compilation, link-time consistency checks require the recompilation of a client of a changed interface only if the client actually references a changed name.

In the CHIPSY environment, each exported name has an associated version counter that is updated whenever the definition of the name is changed. The linker checks to make sure that the same version of each name is used throughout a program.

In the BNR Pascal compilation system the use of version stamps is coupled with different compilation rules for different classes of change:

* no additional recompilations are necessary when a new declaration is added to an interface,

• direct importers must be recompiled when variable or procedure declarations are changed, and

• both direct and indirect importers must be recompiled when types or constants are changed.

While the programmer is responsible for initiating the necessary compilations, the a dedi- cated linker enforces the rules by checking version stamps as in the CHIPSY environment. Evidently enterprising programmers use the linker to identify the modules that need to be recompiled, but the process is iterative when there are transitive dependencies.

The version numbers assigned by the BNR Pascal compilation system are based on the size of variables and the total size of procedure parameters. Because this is a heuristic, the full system is always recompiled and tested to prepare a release. Experience with three million lines of code written in BNR Pascal indicates that the use of differential recompilation rules reduces the number of recompilations by a factor of 3.

Smart Recompilation

Another set of selective recompilation strategies depend on maintaining auxiliary infor- mation about the use and definition of names. The best known of these strategies is Walter Tichy's smart recompilation algorithm [62] which makes recompilation decisions based on name use. Schwanke and Kaiser's smarter recompilation technique [56] ex- 1.4. DIRECTLY RELATED WORK 27 tends Tichy's algorithm by allowing parts of a system to continue to use the old version of a changed declaration. Techniques proposed by Rudmik and Moore for Chill [54] and by Dausmann for Ada [13] initiate compilations on intermediate representations of programs, making recompilation decisions based on changes to the attributes of decla- rations. These techniques anticipate incremental compilation systems like the Rational Environment, discussed below.

Tichy's algorithm keeps track of the names defined in each interface and the names referenced in each compilation unit. When an interface is changed, the set of changed names is intersected with the set of referenced names to determine whether a unit has to be recompiled. A collection of specific tests handle special cases. A prototype imple- mentation was built for the Berkeley Pascal, which uses a C-style include mechanism. Measurements of this prototype indicated that avoiding a single compilation recovers the cost of the smart recompilation analysis.

Schwanke and Kaiser's smarter recompilation technique requires the programmer to select an initial set of modules for recompilation called the "test set". These modules are recompiled along with any other modules that share with the test set any information derived from a changed declaration. Modules that do not share the use of changed declaration with the test set are unaffected, whether they use the declaration or not. While smarter recompilation might be valuable for changes to pervasive declarations in large systems, the data of Section 4.3 indicate that the number of modules using most declarations is too small to justify applying the technique.

The GTE Laboratories CHILL Compilation System (CCS) [54] stores the intermediate code resulting from semantic analysis in a dedicated database. The database includes a symbol table for each version of a program. This symbol table is the basis for an efficient recompilation strategy. Dependent units are reprocessed only if they depend on a changed definition and the amount of processing clone is limited based on the nature of the change. For example, extending the range of a type requires only code generation for the dependent unit. A self-compiling prototype of the CCS was built for the GTEL dialect of Pascal.

Manfred Dausmann proposed a similar strategy to minimize recompilation for Ada programs. Based on a dependence relation between the attributes of nan_ed entities and the compiler phases in which those attributes are used, the compilation of a dependent unit would be initiated from the earliest compilation phase in which a changed attribute was used. 28 CHAPTER 1. INTRODUCTION

Selective Recompilation and Interprocedural Optimization

Cooper, Kennedy and Torczon [9] describe a recompilation system designed for the Rn programming environment for Fortran. Fortran is not a modular language and therefore does not require type-checking across compilation units. However, the IRa environment performs global interprocedural optimization across compilation units. This places heavy demands on the separate compilation system. In modular languages compilation depen- dencies are restricted to information in interfaces. Interprocedural optimization opens a potentially much wider and less disciplined channel for information flow between mod- ules. For example, an optimization may produce a dependency between two otherwise unrelated procedures simply because they both call the same third procedure.

Like Tichy's smart recompilation algorithm, the Cooper-Kennedy-Torczon algorithm keeps track of the information needed to make recompilation decisions. In addition to reference information, the algorithm keeps annotation sets that represent the assumptions used in generating code for each procedure. These annotation sets are compared with new interprocedural information derived from the recompilation of changed procedures to determine whether the annotated procedure must be recompiled. The algorithm can be tuned by increasing the precision of the annotation sets, in which case the number of redundant recompilations decreases while the cost of the analysis increases.

Unlike other selective recompilation techniques, this technique does not produce re- sults that are equivalent to recompilation from scratch. While the resulting code will be functionally equivalent, the failure to recompile a procedure may miss an opportunity for additional optimization.

Incremental Compilation Systems

Incremental programming environments aim to provide the immediacy of an interpreted language without sacrificing the performance or the stability of a compiled representa- tion. They achieve this immediacy by recompiling only the individual procedures or other program fragments that use changed declarations. In some environments these program fragments may be as small as single expressions. To do this, an incremental compilation system must manipulate whole programs, even if the programs are written in a mod- ular language. While the distinction between compilation units remains as a program structuring mechanism, modularity ceases to be a (separate) compilation issue.

Existing incremental programming environments are monolingual. With notable ex- ceptions, most exist only as research prototypes. In contrast to earlier environments 1.4. DIRECTLY RELATED WORK 29 designed to accommodate novice programmers, more recent environments are designed for modular languages, and support multiple programmers or multiple versions of soft- ware. The Rational Environment [2,21] is a fully-integrated incremental programming environment for Ada that encompasses both hardware and operating system. It is a pro- duction software system that has successfully supported its own development. SMILE [36,30] is a programming environment for a restricted dialect of C. It is interesting for the leverage it achieves using simple implementation techniques.

The Rational Environment supports interactive incremental compilation at the level of individual expressions. All program units are represented as abstract syntax trees. Seman- tic analysis and code generation are performed as distinct operations as the programmer "promotes" the syntax-tree representation generated by the Ada-oriented editor. Selective reprocessing is done in cooperation with the user. Before a programmer can edit a pro- gram fragment, he or she must first "demote" a subtree containing the fragment as well as subtrees (possibly in other units) containing code that depends on the fragment. This can be done interactively with guidance from the environment. The sizes of the subtrees demoted determine how much code will eventually need to be recompiled, regardless of what the programmer actually changes.

In addition to offering incremental compilation, the Rational Environment also pro- vides two mechanisms to limit the propagation of changes. First, Ada package specifi- cations are implemented so that importing units do not need to be recompiled when the private part changes. Second, a subsystem concept provides an additional level of en- capsulation, allowing only upwardly-compatible changes to be seen outside a subsystem boundary.

Where the Rational environment is a fully integrated environment for Ada based on abstract syntax trees, SMILE represents C programs as plain text and relies on ordinary UNIX tools for processing. A SMILE module is a collection of C procedure, variable, type and macro definitions; each of these definitions is stored in a separate file. Simple string pattern matching is used to maintain cross-reference information. While the module as a whole is the unit of code generation and access control; the individual definition is the unit of editing and semantic analysis. When a programmer wishes to edit a definition, he or she must reserve the containing module and all modules containing code that references the definition. Although the whole module is reserved, only those items that reference a changed definition are reanalyzed. Feiler and Kaiser [22] discuss granularity issues for programming environments based on their experience with SMILE.

Unlike SMILE, Integral C [53] is an integrated incremental environment for the full C language including the lexical freedom of preprocessor. It is designed to support 30 CHAPTER I. INTRODUCTION

a single programmer working in a private workspace and uses Tichy's rules for smart recompilation to identify program fragments that need to be recompiled after a change.

Recent research on incremental compilation has focused on the use of attribute gram- mar formalisms to incrementally process whole programs. This approach does not require the complete reprocessing of dependent code when a program changes, but only the pro- cessing necessary to recompute invalidated attribute values. Wagner and Ford [65] use attribute grammars to define efficient separate recompilation systems that are integrated with version selection mechanisms; the same techniques are used to incrementally recheck the constraints of version selection and to determine the (minimal number of) modules needing recompilation after a change. While the Wagner-Ford approach can concur- rently process changes made by a single programmer, Micallef and Kaiser [44] describe a techniques for incremental attribute evaluation in distributed environment where multiple programmers make asynchronous changes to the same program.

1.4.3 A Profile of Compiling and Linking

While it is commonplace to measure the speed of compilers, little else has been done to analyze compilation costs. The few studies (those of Adams, Weinert and Tichy; Conradi and Wanvik; and Kamel and Gammage) that are directly relevant to this thesis are discussed elsewhere. However, there is one additional study that deserves mention.

Linton and Quong [42] instrumented make to discover how much time C programmers spend waiting for compiling and linking, how many modules are compiled each time a program is linked, and the change in size of the compiled modules. Based on data collected at Stanford University in which 93 users invoked make 13000 times on 800 (mostly small) programs, the study shows that most programs are relinked after only one or two modules are recompiled. While this contrasts with the change history study of Chapter 3, which averaged six compilations per link, it is consistent with the name distribution studies of Chapter 4. In the first case it is likely that the grouped RCS changes of Chapter 3 are larger than the live changes observed by instrumenting make; while in the second case, the name distribution study models changes to interfaces made one name at a time. In real life, one must assume a mix of changes to interfaces and bodies.

The data collected by Linton and Quong must be interpreted with caution as a measure of the size of the average change. The data indicates that 20% of all make invocations resulted in linking only. Since linking is typically triggered by the recompilation of .o files, some other process (another invocation of make, perhaps) must have initiated that recompilation. It is possible that the same changes were incorporated in several 1.4. DIRECTLY RELATED WORK 31 products, each built by a different makefile. However, one can not simply divide the number of compilations by the number of links to count the number of compilations per change. 32 CHAPTER 1. INTRODUCTION Chapter 2

The Model

This chapter develops, and illustrates with examples, a model of software manufacture to be used in examining how to incorporate changes accurately and efficiently into software products. This model not only explains the specific techniques for reducing recompilafion costs studied in remainder of thesis, it also shows that such techniques are not limited to recompilation per se and that their effectiveness is not bought at the price of reliability. Thus, in addition to a framework for classifying and evaluating powerful incremental techniques such as smart recompilafion, the model also provides a foundation enabling these techniques to achieve the reliability of manufacture from scratch in a controlled environment.

Previously presented at the Trondheim Workshop on Advanced Programming Envi- ronments [4], the model has two parts. The first part, a dependency graph representation of a software configuration called a manufacturing graph, captures the static derivation relationships between the components of an instantiated system. Each graph contains a complete record of how a particular version of a system was manufactured, including any information that might distinguish that version of the system from any other. Although there is nothing particularly novel in using a graph to represent a software system, existing representations have failed to capture the complexity of the manufacturing process.

As a surrogate for manufacture from scratch in a known and stable environment, a manufacturing graph establishes the identity of a software product and makes it possible to trace properties or features of the product from its sources. These graphs also make it possible to compare different versions of a product, to recreate previous versions or produce known variants, and to assess the potential impact of a change. Consequently, they also make it possible to avoid generating new components unnecessarily where existing components might be used. This last application is the focus of Chapters 3 and 4

33 34 CHAPTER 2. THE MODEL

and the subject of the second part of model.

The second part of the model considers how a manufacturing graph schema is in- stantiated. Whereas a manufacturing graph represents a single generated system, a man- ufacturing graph schema represents a family of potential systems all sharing the same structure. The schema describes what steps have to be performed in what order on what components to produce a given product. When previous versions of a product exist, a schema can be instantiated by generating new derived components where necessary, or by reusing already manufactured components where their substitution would go unno- ticed. When one or more initial components change, it is possible to reuse any derived component not affected by the change. This is exactly what make does crudely but ef- fectively with timestamps. However, make is not especially selective in identifying the components that are actually affected by a change.

In principle, make regenerates every "target" derived from a changed source. How- ever, for certain changes the newly manufactured target may be effectively indistinguish- able from the corresponding component in the original configuration. For example, adding a comment to an interface module will not produce a substantive change in any of its clients when they are recompiled; changing a declaration in the same interface module will affect only those clients that use the declaration. In such cases the manufacture of the target may be suppressed and an already existing component used when instantiating the new configuration. Such components are identified by difference predicates.

Tichy's smart recompilation technique implements a difference predicate, as does make's less selective timestamp heuristic. A difference predicate is simply an assertion that determines whether an existing component can be substituted for one that would otherwise need to be remanufactured. Depending on the cost of evaluating the predicate and on the interconnectivity of the to-be-instantiated schema, the choice of difference predicate can substantially affect the efficiency of software manufacture.

Chapter Outline

In turn, the following sections

1. Motivate the model using make as a convenient foil to discuss the pitfalls of software manufacture,

2. Define the manufacturing graph and graph schema representation,

3. Illustrate this representation through examples that contrast the manufacturing re- quirements of different compilation systems and tools, and 2.1. PITFALLS OF SOFTWARE MANUFACTURE 35

4. Describe the instantiation of a configuration using selective manufacturing tech- niques and characterize those techniques as difference predicates.

The chapter concludes with a discussion about the cost of selective manufacture and some general reflections on the utility of the model.

Because the scope of model is considerably larger than the specific concerns of next two chapters, readers interested primarily in the results of the Descartes change history study and in patterns of name use and visibility may wish to skim the abstract presentation of the model in Section 2.2 and most of Section 2.4 and rely more on the concrete examples and illustrations of Sections 2.3 and 2.4.1.

2.1 Pitfalls of Software Manufacture

Software manufacture is conceptually straightforward. If, starting with a base develop- ment environment populated only with a set of initial components, a software product is manufactured from scratch according to an automated script, then it is virtually assured that the resulting product represents the given set of initial components and that given the same base environment, initial components and script, one could reproduce the same product as necessary. Rather than rebuild the product in its entirety every time anything changes, it is a small and obvious step to try to incorporate changes incrementally. Start- ing from known state one simply has to identify which initial components have changed and what might depend on those components and then propagate the changes as necessary. Unfortunately in practice problems of both reliability and efficiency occur in all stages of this process.

Because its use is so widespread and its paradigm so universal, the tool make provides a good example of some common pitfalls in the current practice of software manufac- ture. Although make represents old technology and its problems are well understood, it nonetheless provides good motivation for the model. While some of its successors, no- tably Apollo's DSEE, have made substantial inroads in solving its most critical problems, none of these tools is based on a clean and comprehensive model of software manufacture.

Make's schema consists of the information in a makefile along with a set of default transformation rules. The makefile contains dependency lines that identify source-to-target derivation relationships between components and, where necessary, explicit transformation rules that list the command sequences necessary to recreate a target should any of the sources it depends on change. Together the dependency lines in the makefile form a directed acyclic graph. Make traverses this graph in depth-first order, executing the 36 CHAPTER 2. THE MODEL

command sequences necessary to bring targets "up to date" with their sources according to an implicit change propagation rule tha triggers the associated transformation rule if any of the sources on a dependency line have a more recent creation date than any of the targets, or if any of the targets do not exist.

Make's extraordinary success as a tool may be due to its simple model of manufacture and its ability to handle arbitrary transformation steps. This model is adequate as long as there is need for only one version of each source or target component named in the makefile and as long as only those components are allowed to change.

In addition to the obvious problems caused by incomplete or inaccurate dependency information or by the outright manipulation of timestamps, the make model breaks down

• when the tools or standard components constituting the base development environ- ment change,

• when its timestamp protocol is violated because distributed or multiprogrammed make invocations are not synchronized,

• when, as structure of an application evolves, there is a need to change the makefile itself, or

• when it becomes necessary to produce different versions of one or more targets based on different options or on different tools, for example when producing debug- ging or instrumented versions of certain components or when generating a product for an alien architecture.

While these problems often can be surmounted by placing restrictions on how make is used, such restrictions imply a change in paradigm and may be awkward or costly, particularly for large systems using many makefiles, where the need is greatest.

The fundamental problem with make is that components are identified by their file- names and extensions, not by their derivation history. Different versions of a component all share the same name. After make has been executed and the context of execution lost, it is impossible to identify how any given target component was actually constructed. Un- der such circumstances any notion of reproducibility is moot. It is this part of the problem that is addressed by the representation of Section 2.2.

When its timestamp protocol is obeyed, when the dependency lines in the makefile are accurate and when only the components named in the makefile are allowed to change, make is conservative in propagating changes. It may regenerate "some targets unneces- sarily, if for example a source file is written gratuitously, but it will not fail to rebuild a target that is out of date. As long as the number of targets that depend on a given source 2.2. THE REPRESENTATION OF A SOFTWARE CONFIGURATION 37 component is small, so that only a few components need to be regenerated whenever make is invoked, make's efficiency is not at issue. However, when the potential impact of a change is large, make is unable to take into account the nature of the change or the context in which the changed component is used. For example, with make it costs as much to change a comment or to reformat a declaration as it does to change a pervasive type definition. Again the problem is especially pronounced for large systems where hundreds of components potentially may be affected by certain changes. This part of the problem is addressed by the difference predicates defined in Section 2.4.

2.2 The Representation of a Software Configuration

A manufacturing graph is a directed acyclic graph made up of components, which rep- resent arbitrary software artifacts, and manufacturing steps, which represent arbitrary derivation relationships between components.

By capturing the tools and standard components comprising the base development environment, the set of initial components comprising the to-be-manufactured product, and the derivation steps by which those components are transformed and combined, a manufacturing graph serves as a surrogate for manufacturing a software product from scratch in a controlled environment. Consequently, the representation is extremely de- tailed. This level of detail is necessary because any omission or change in the structure or content of a graph can potentially affect the product represented. Although appropriate as an internal representation to be manipulated by tools or used for archival purposes, manufacturing graphs are not meant to be specified directly by the user.

The sole purpose of these graphs is to record the derivation dependencies between the components of a software system. While the content and structure of a manufacturing graph might be checked for conformity with higher level semantic constraints such as might be specified by a module interconnection description of a system, the graph does not represent those constraints explicitly. For example, knowing that a product was supposed to have been instantiated with a particular version of a component won't help reproduce a product that was built with a different version.

This section defines manufacturing graphs and manufacturing graph schemas. Sec- tion 2.3 gives examples of the kinds of schemas that arise in practice. 3 8 CHAPTER 2. THE MODEL

2.2.1 Components

A component is any artifact used in software manufacture that has the potential to affect the outcome of the manufacturing process if replaced by an artifact with a different value. Components are not limited to the objects conventionally considered as part of a configuration, generally restricted to "source". Instead, the model extends the notion of a component to include such artifacts as the option string supplied when a tool is invoked, the tool itself, the machine-readable representation of a processor's serial number or the instantaneous value of a processor's clock (a timestamp).

Whether it is a file, a structured object retrieved from a software database, a string of ascii text, or a string of bits, each component is atomic, has an immutable value (its concrete representation), and is identified by a unique label that serves to distinguish that component from every other component in the universe. A component's atomic value is the only property of the component that can affect the outcome of the manufacturing process, and that value, referenced transparently through a unique identifier, does not change.

Variants, Revisions and Other Component Attributes

Although components are atomic with respect to the model, many do have internal struc- ture. This internal structure is important to the tools that manipulate components and to the predicates that determine whether the differences between two components are significant or not, but it is not visible in the structure of the manufacturing graph itself.

Many components also have attributes such as a symbolic name, a type, a revision number, a size, a creation time, an owner, etc.. Such attributes are important in classifying or managing components, but they do not affect the outcome of the manufacturing process. When the value of a component attribute is used in the manufacturing process, it is treated as a component in its own right. For example, if the filename or creation time of a program source module is embedded in the corresponding object module, it is treated as independent atomic and immutable component used in the creation of the object module.

One consequence of the treatment of attributes in the model is that variants and revisions have no special status. Components that are versions of one another are not treated any differently from components that are otherwise unrelated; they simply happen to share certain attributes. 2.2. THE REPRESENTATION OF A SOFTWARE CONFIGURATION 39

2.2.2 Manufacturing Steps and Step Schemas

Unlike components, manufacturing steps have no concrete existence.

A manufacturing step represents an atomic derivation relationship between two sets of components: an initial (input) set and a target (output) set. The target set is said to depend on the initial set. Typically the initial set consists of a tool (or some other agent such as a human being), the components representing the actual arguments, including options, supplied when the tool is invoked, and any components that the tool accesses through the environment. The target set is exactly that set of components produced by a specific invocation of the tool with the remaining inputs as parameters. Because the manufacturing step records an actual derivation process, the components in the target set are consistent, by definition, with those in the initial set. Because the initial components are immutable and the target components are created by the manufacturing process, the intersection of the initial and target sets is empty.

yacc "-d" y.tab.c grammar.y y.tab.h /usr/lib/yaccpar

Figure 2.1: A Schema for Yacc

As an example of a manufacturing step, consider the UNIX tool yacc, a parser gen- erator. When invoked with the "-d" option on an appropriately described grammar, yacc produces two outputs: a parser in a C source file named "y.tab.c", and a list of token numbers in a C include file named "y.tab.h". The file "y.tab.c" includes code taken ver- batim from a parser skeleton named "/usr/lib/yaccpar". No other information is used in generating the output. Thus a manufacturing step representing the invocation of yace on the file "grammar.y', invoked by the command "yacc -d grammar.y', relates unique instances of inputs named "yacc", "grammar.y" and "/usr/lib/yaccpar" and an instance of the anonymous string "-d" to unique instances of outputs named "y.tab.c" and "y.tab.h". This is represented schematically in Figure 2.1.

Completeness and Reproducibility

Virtually all systems supporting software manufacture make and exploit the assumption that repeated invocations of the same tool on the same input will always produce identi- cal results. This is the basis of reproducibility. Since derived components can always be 40 CHAPTER 2. THE MODEL

recreated, efficiency considerations are the only reason to keep previously derived com- ponents around; the assumption is exploited to avoid recreating components that already exist. In both the Cedar System Modeler and Apollo's DSEE previously derived objects are cached simply to be used as accelerators in subsequent manufacturing operations.

Contrary to this conventional wisdom, the model does not make the same assumption. Whereas in principle it is desirable to completely capture everything that might affect the outcome of a manufacturing step, in practice this is not always possible. The first problem is that one might omit a dependency, the second problem is that repeated invocations of the same tool on the same input sometimes produce different results. The later problem, which most often is really a consequence of the first, is a problem no matter how conservative and controlled the manufacturing strategy.

Many factors might cause repeated invocations of the same tool on the same input to produce different results. These factors include transient hardware problems, subtle bugs in the tool itself or in the host operating system, or assumptions built into the tool about facilities provided by the underlying host computer or operating system that are violated when either is changed. What these factors have in common is that they represent pervasive and unexpected sources of change. It can be argued that they are no different from other sources of change and should be treated in exactly the same way: their effect should be captured by explicit dependencies on appropriate components, for example, on the state of host computer and operating system. At some point this hairsplitting will yield a component unique to each invocation of a manufacturing step.

As an alternative, the model considers each invocation of a manufacturing step, whether on different inputs or not, as distinct. This provides a hedge against aberra- tions such as the above. The model then provides a uniform mechanism (difference predicates) for equating manufacturing steps that produce equivalent outputs--regardless of the values of their inputs.

Manufacturing Step Schemas

Whereas a manufacturing step represents a specific relationship between individual com- ponents, a manufacturing step schema is a template for a class of existing and potential relationships that all share the same structure and may share one or more initial compo- nents.

A manufacturing step schema is a manufacturing step in which the entire target set and zero or more components in the initial set have been replaced by variables. The process by which these variables are bound to components is called schema instantiation. 2.2. THE REPRESENTATION OF A SOFTWARE CONFIGURATI'ON 41

All variables in the target set must be instantiated simultaneously and may be instantiated only after all variables in the initial set have been instantiated.

Divorced from an implementation that supplies identifies for components, all the examples of software manufacture in this chapter are shown schematically. The alternative would be to invent some artificial labeling convention that would contribute little or no understanding.

2.2.3 Manufacturing Graphs and Graph Schemas

While a manufacturing step records a single step in the manufacturing process, a man- ufacturing graph records the set of steps representing the instantiation of a particular configuration from a particular set of initial components.

primitive

@ product

Figure 2.2: A Rudimentary Manufacturing Graph

A manufacturing graph is a directed graph composed of manufacturing steps tied together by the components that are produced by one step and consumed by others. As the rudimentary graph of Figure 2.2 shows, a manufacturing graph is a bipartite graph in which nodes representing components alternate with nodes representing manufacturing steps. Components that have no in-edges are the primitives of the configuration represented by the graph. These are the components where changes to the configuration originate; they are represented by shaded circles in the figure. Components that are meant to be distributed as part of the (sub)system under development are designated as software products; they are represented by double circles in the figure. A component may be both a primitive and a product---consider for example an interface that is both used and exported by some subsystem. 42 CHAPTER 2. THE MODEL

Manufacturing graphs are acyclic. Because no component exists prior to its manufac- ture, no component can be derived from itself or from one of its derivatives. 1 Because no component is the direct output of more than one manufacturing step, no node representing a component can have more than one in-edge. Every manufacturing step in the graph must contribute to the generation of some product; otherwise it is gratuitous. However, a manufacturing graph is not necessarily connected. If a configuration exports multiple products, each may be associated with a disjoint subgraph.

Unique Identifiers and the Status of Components

The status of a component as primitive or derived (source or target) is not always an independent property of the component, instead it depends on how the component is used in a given configuration. In particular, the same component may be derived as a product of one configuration and used as a primitive in another. For example, a subroutine library may be represented both as a product that is manufactured and exported by one subsystem and as a primitive component of a client subsystem. Since each component is labeled with a unique identifier that belongs to the component, not the configuration, there is never any confusion about its identity or its derivation history.

If a configuration is defined by the set of products it exports, the same configuration may be represented at many different levels of detail. A project that is composed of several independently developed subsystems may choose to represent each subsystem internally as a separate configuration producing intermediate products. Externally, the released product may be represented as a single configuration. For a bootstrapped system, it may be desirable to represent only a single iteration for a released product but necessary to record several iterations to track the incorporation of a particular feature.

The alternative would be to represent the entire known genealogy of every component or else lose its derivation history.

Manufacturing Graph Schemas

The relationship between manufacturing graphs and graph schemas is the same as that between manufacturing steps and manufacturing step schemas. Whereas a manufacturing graph represents a fully instantiated software configuration, a manufacturing graph schema 1The constraint that a manufacturing graph be acyclic does not imply that cyclic relationships of other kinds are impossible. Such relationships can and do occur at the module interconnection level, notably in bootstrapped systems. When such systems are manufactured, however, at least one of the components in each cycle must be multiply instantiated. See Section 2.3.4 for an example. 2.2. THE REPRESENTATION OF A SOFTWARE CONFIGURATION 43 is a template for a family of existing and potential configurations that all share the same structure and may share some components.

A manufacturing graph schema is simply a manufacturing graph in which one or more manufacturing steps have been replaced by step schemas and in which no instan- tiated component depends on a variable. Consequently a graph schema has at least one exported product that is represented by a variable. Variables for primitive components in manufacturing graph schemas are instantiated by version selection; variables for de- rived components are instantiated by manufacture or, as described in Section 2.4, by appropriating suitable existing components.

Interesting schemas are derived by combining version selection with a graph repre- senting an already instantiated configuration by substituting new components for one or more primitive components and replacing with variables any derived components that depend on a new component.

2.2.4 Encapsulated Subgraphs

At times, particularly for presentation purposes, it is convenient to be able to manipulate a subgraph of a manufacturing graph or graph schema as if it were a single manufacturing step or step schema. For example, when representing the manufacture of a large system, it may be desirable to hide the intemal details of the manufacture of a subsystem so that a complex manufacturing sequence is encapsulated as a single step. Alternatively, some graphs contain multistep transformations that, adding little but size and complexity to the graph, can be encapsulated without fundamentally changing the graph's structure.

The UNIX cc command makes a good example of how a multistep transformation might be represented as a single step. This encapsulation is used in the examples of the next section.

Under Berkeley UNIX, the cc command, often thought of as the C compiler, is actually a dispatcher that invokes, as directed, the C preprocessor (cpp), the C compiler (ccom), an optional code improver (c2), the assembler (as), and finally, the link editor (ld). The process by which an optimized relocatable object file is generated is shown schematically in Figure 2.3. The components labeled "cc_cmd", etc., represent the command lines passed to each tool on invocation. The components labeled "[tmpl]", etc., are temporary files created during the compilation process. When the cc command is invoked with different options, the manufacturing graph may have a different shape.

Figure 2.4 shows the same process might be encapsulated into a single step corre- 44 CHAPTER 2. THE MODEL

al h

• ° an h acpp [tmpl]--_ . ccom cmd_ c2 [tmp3] a.o cc_cmd -- .c2_cmd_ as cc 2cpp--cmd_ cc°md b[tmp2]_ -as cmd

Preprocessor Compiler Optimizer Assembler

Figure 2.3: A Schema for the UNIX ee Command

a. c---_ ...... [-] a1"-1--- ...... H an. h-_ ...... I [ ! _ cpp-.1p-Etmpl]--[-] |I _-_'cpp cmd_ ccom--_ _[tmp2]_ |I I -- • ccom cmd_ c2_ r-_[tmp3]--_ -_a.o cc cmd--IflI - " c2 cmd--m__ as--_ i -- cc--_.t_l -- - as_cmd---.J__]

Figure 2.4: The UNIX ee Command Encapsulated sponding more to the user's view of the compiler. What goes on inside the encapsulation boundary (the outer box in the figure) is effectively hidden. Since the primitive com- ponents "cpp", "ccom", "c2" and "as", are not represented outside the boundary, that boundary must be opened when any schema for the encapsulated step is instantiated.

In general a subgraph encapsulated as a single manufacturing step must contain all the nodes of the parent configuration that occur on any path between the inputs and outputs of the encapsulated step. The inputs of the encapsulated step are a subset of the primitive components of the subgraph; the outputs of the step are those components that are derived in the subgraph and used or exported by the parent configuration.

2.3 Examples of Manufacturing Graph Schemas

This section gives several concrete examples of manufacturing graphs including generic schemas for the compilation of C programs in the UNIX and SMILE programming envi- 2.3. EXAMPLES OF MANUFACTURING GRAPH SCHEMAS 45 ronments and for the compilation of Ada programs as prescribed by the Ada Language Reference Manual [16]. These examples show how the manufacturing graph representa- tion allows one to examine and compare the manufacturing implications of tool design and application architecture.

I have chosen to present generic schemas showing the main features of each compila- tion system to avoid overwhelming the reader with the amount of detail contained in the schemas of even small programs. For example, a complete manufacturing graph schema for the Descartes crostic client studied in Chapter 3 has 27 encapsulated compilation steps, each with an average of 19 inputs (including compiler, compilation command, .c file, and 3 UNIX and 13 application .h files), and a link step with 31 inputs (.o files, linker, etc.). The generic schemas are supplemented by examples of typical steps for the Descartes crostic client and for the Rational Kermit program, one of the Ada programs studied in Chapter 4.

To show the range of the manufacturing graph representation, this section also shows partial schemas for the generation of the lexical analyzers discussed in Section 1.3.1 and for bootstrapping the Mini IDL tools.

2.3.1 Conventional Compilation Strategies for C and Ada

The languages C and Ada both allow programmers to divide programs into separate compilation units. This capability is desirable for many reasons, for example:

• programmers can simultaneously work on separate parts of program, • the same unit can be reused in multiple programs, • the impact of many changes can be confined to a small part of the whole program, and

• only those parts affected by a change need to be recompiled.

However, when a program is divided into multiple units, those units have to share in- formation, if only the addresses of global procedures and variables at run time. C and Ada differ in their requirements for sharing information between compilation units. Their compilers exemplify two widely used strategies for compiling the separate units.

Independent Compilation in C

C is an independently compiled language. This means that the results of one compilation do not depend on the results of any other compilation: all the information needed to 46 CHAPTER 2. THE MODEL

analyze and generate code for a compilation unit is available within that unit. Information sharing is accomplished through redundant declarations or by source text inclusion. There are no mechanisms for the compiler to check whether redundantly declared information is consistent or if the same text is shared by any two units. Some independently compiled language systems never perform such checks, others perform them at link time.

In C the physical unit of compilation is the file. The logical unit of compilation is the declaration and any number of declarations can be grouped in a file. Because a declaration is visible only within its containing file, information to be shared between files must be redeclared in each sharing file.

The C preprocessor is a vital part of the C compilation system. It is used to avoid having to maintain multiple copies of shared information. In addition to providing simple macro expansion and conditional compilation, the C preprocessor will insert the text of a specified file at a specified point in the text of another file as directed by an "include statement". C programmers exploit this feature by placing information to be shared in an "include" or "header" file, typically identified by the extension ".h". This file is then included by each C source file (identified by the extension ".c") that requires the shared information.

al .c il .h

• al .o--_ crtO. o--_ _a.exe

cc-- _ [ cc_cmd -- id cmd- I

• --an.old ___J

ira. h . an.c

Figure 2.5: A Generic Manufacturing Graph Schema for a C Program

Figure 2.5 shows a generic schema for the compilation and linking of a C program under UNIX. The program source consists of some number of .c files "al.c ... an.c" and some number of .h files "il.h ... im.h" The latter include both system .h files such as "stdio.h", providing access to the C standard i/o library, and application .h files. Each compilation step, encapsulated as in Figure 2.4, produces a relocatable object file (a .o file) from a .c file, the C compiler and compilation command, and a subset of the program .h 2.3. EXAMPLES OF MANUFACTURING GRAPH SCHEMAS 47 files. Although it is not clear in the generic schema, not every .h file need be included in every compilation step. All the .o files are linked together in a single step with specified C libraries and the UNIX run-time entry procedure ("crt0.o") to produce the final executable program.

compose.c "-- _, _compose.o utility.h except.h assoclist.h gp. h image, h style .h • rule.h glyph.h format.h rsupport.h

"cccompose.h-c compose.c" Z CC P'__

Figure 2.6: A Typical Manufacturing Step Schema for a C Program

A typical compilation step schema for the Descartes crostic client is shown in Fig- ure 2.6. Filenames in pointed brackets represent UNIX include files.

Many C compilation systems include an implementation of the tool lint which checks the consistency of declarations across compilation units. Lint operates in two passes. The first pass, which typically shares source with the C compiler, performs syntax and semantic analysis on each compilation unit separately generating a symbol table with information about each procedure definition and use. The second pass combines the symbol tables produced by the first pass to check the return and argument types of every procedure against each its call sites. A generic schema for lint is identical in structure to the schema for compilation in Figure 2.5, with pass one substituted for each compilation step and pass two substituted for the link step. Unfortunately, since lint is invoked separately from the compiler, there is no guarantee that the set of files that it processes is the same set processed by the compiler.

Separate Compilation in Ada

Separate compilation in Ada is typical of a class of modem programming languages in- cluding Mesa and Modula-2 that support the separate definition of interfaces and their 48 CHAPTER 2. THE MODEL

implementations and that require compile-time type-checking across compilation unit boundaries. In these languages all information to be shared at compile time must be declared in an interface. 2 When an interface is imported by a client the names declared in the interface are made visible to the client. This is typically implemented by precom- piling the interface to generate a symbol table that is used in subsequent compilations. Thus patterns of interface visibility constrain compilation order so that every interface unit imported by a client must be compiled before the client unit is compiled.

In C a compilation unit has no special status. In Ada, a compilation unit is a dis- tinguished syntactic entity. It can be a specification unit defining a subprogram, package or task interface; it can be a body unit defining the implementation of the corresponding specification; or it can be a body subunit defining the separately compiled body of a nested subprogram, package or task. An Ada package is a collection of declarations (a module); an Ada task represents a parallel execution thread.

Names declared in a specification unit are visible in any other compilation unit that names the specification in a with clause. The importing unit may be another specification unit, a body unit or a body subunit. All the names imported or declared in a specification unit are visible in the corresponding body unit and all the names imported or declared in a body unit are visible in all its subunits.

e.spec--.J _---e.sym--_r] ... , _ "-- II e.body--_ a.spec--i---_ I _ - - • ' r ada--_l_a, sym_ I I[ a. body-----_ I

_- I_ a.subunit-_ I --- ] _d-sym-_ I I"i.... "'" [I .i _a_Ody.sym_asubunit.o____prog.exe"HI I I id ,

Figure 2.7: A Generic Manufacturing Graph Schema for an Ada Program

/This statement is not entirely true. While an Ada interface always contains sufficient information to compile a program, Ada does allow the inline substitution of procedure bodies. This exposes information (the text of the procedure body) declared only in an implementation unit. 2.3. EXAMPLES OF MANUFACTURING GRAPH SCHEMAS 49

Figure 2.7 suggests how a typical Ada program is compiled and linked. The tmit visibility relation determines compilation order: a unit can be compiled only after all the units named in its with clause have been compiled. A body unit can be compiled only after its corresponding specification unit and a subunit only after the containing body. This requirement is enforced by the Ada compilation system as is the requirement that only one version of each compilation unit be used to generate any given program.

The compilation of each Ada compilation unit produces a symbol table to be used in the compilation of any dependent units; symbol tables for body units and subunits are used in compiling any nested subunits. The compilation of body units and subunits also produce relocatable object files that are linked together with any necessary libraries to form an executable program.

_- k_defs.sym -- ,! k.body------k.specada---,--_ 'F_ k.sym---,--*

byte_defs.sym---- -_ -_ calendar.sym ..... -_ convert, sym------,-- _ -_ _file io.sym------i .....

logging.sym_-- _ ..... _packet'symm-- 7-----_ _physical.sym ...... k.o m --I...... ---_kb°dy'sym-_ -_k_connect.o _j ...... --_-k_connect.sym

_ .... D

_ D

k connect.subunit --_ _ string_utilities.sym---_ P

Figure 2.8: A Typical Manufacturing Step Schema for an Ada Program

Compilation steps for the specification and body units of package "kermit" and the subunit defining the body of procedure "connect" are shown in Figure 2.8. The name "kermit" is abbreviated to the letter "k" so the figure will fit on the page. The arrow "_-" identifies symbol files produced as the result of antecedent compilations. 50 CHAPTER 2. THE MODEL

A Comparison of the Two Strategies

The C and Ada strategies for separate compilation make different tradeoffs in compile- time checking, program organization, and the complexity of the compilation system.

Independent compilation makes it easy for separate individuals or organizations to work independently. While independent compilations can be performed in parallel or in any order, shared text must be reprocessed every time it is used. Independent compilation also precludes compile-time type-checking across compilation unit boundaries, which means that programmers may not learn about incompatibilities between units until link time or later.

When interfaces are compiled separately each interface is processed only once, no matter how many times it is used. Compile-time checks ensure that interfaces are used consistently and that only one version of each interface appears in a program. While these checks make the integration of independently developed subsystems much more reliable 3, they do restrict compilation order and limit the number of compilations that can be performed in parallel.

__-_namesthatdonotI__iiiiiii!iii!ii____!_::i_i_i_i_i_iiiiiiiii_iii_ii!_ii!_::i_ii::i_i::_!!::i::_i!_!!_!_!i!_!ii_i_i!!_i_!!i!_!_!i!::!iiiiii_i_i_!_!_i_ii!!!::_i!!!_!_!ii_J_ 1 dep_nd_n_ interface interface C interface B A

Figure 2.9: The Source of an Unnecessary Recompilation to Prevent Version Skew

When interfaces are compiled separately, version skew (as defined in Section 1.3.1) is detected at compile time. This can be a particular problem, especially if the programmer responsible for a unit that fails to compile has no control over the units causing the version skew. In addition, when a low level interface changes, a significant fraction of a system may have to be recompiled simply to prevent version skew. For example, even though A may depend on B and B on C, as Figure 2.9 shows, A may not necessarily depend on C. It still will have to be recompiled every time C changes.

Many language reference manuals notwithstanding, the choice of compilation strategy is rightfully a part of a language implementation, not a language definition. Although

3It is often noted that once a program written in a language like Ada finally compiles, it usually runs. 2.3. EXAMPLES OF MANUFACTURING GRAPH SCHEMAS 51 effectively precluded by vagaries of the C language (the use of macros and the syntax of typedefs), there is no reason the include files of other independently compiled languages could not be precompiled. Interdependencies between include files would then lead to what looks like a schema for Ada. Conversely, Ada specification files could be reread and reprocessed during the compilation of each referencing body unit, producing what looks like a schema for C. In fact, although BLISS uses a C-like file inclusion mechanism for shared information, the language allows programmers to specify that certain included files be precompiled [14] The BNR Pascal compiler, on the other hand, does not precompile module interfaces because the source text representation is more compact than a symbol table [33].

2.3.2 C Compilation in the SMILE Programming Environment

The SMILE and Rational programming environments both use incremental techniques to provide faster turnaround during development. The Rational environment is a fully in- tegrated environment for programming in Ada. It is based on a highly interconnected representation that nevertheless maintains the integrity of Ada compilation units. Thus Rational manufacturing graph schemas look much the same as other Ada schemas. What distinguishes the Rational environment is how schemas are incrementally derived from instantiated graphs and how they are incrementally reinstantiated.

SMILE, on the other hand, performs routine-at-a-time semantic analysis by maintaining individual C objects as distinct entities. For this reason it is interesting to contrast SMILE with conventional compilation strategies for C.

Originally developed to support the GANDALF development effort, SMILE supports programming in a restricted dialect of C. It augments C with a module construct and requires the declaration of procedure prototypes. In return it provides greater type safety than most implementations of C. Users can manipulate individual declarations indepen- dently and are provided with rapid tumaround for semantic analysis. Although imple- mented using the UNIX equivalent of chewing gum and baling wire, SMILE has proven to be an effective tool for the projects that use it [30].

The following description of SMILE is derived from the SMILEReference Manual [36] in the GANDALF System Reference Manuals [59] and from conversations with Charlie Krueger, its current maintainer.

SMILE programs are divided into modules, each of which is a collection of C pro- cedure, variable, constant, type and macro definitions, some of which are designated for export. Each of these items is stored in a separate file. 52 CHAPTER 2. THE MODEL

Each module also establishes the compilation context for each of its contained items. This context consists of a project prelude, common to all the modules in a system, and a module prelude; both typically contain references to external .h files or declarations of library routines. In addition, the context includes a collection of declarations imported from other modules and declarations of all the items local to the module.

In SMILE, each procedure is analyzed independently using a variant of the tool lint. This strategy requires that the normal compilation context be supplemented with procedure stubs for all imported and locally declared procedures so that lint can check each call site against the corresponding procedure definition. Code is generated for the module as a whole using the UNIX C compiler when all procedures have been analyzed successfully.

b. proci -] -'[-']

• . . I k_ -_ agimport, h ------

a.typej -I :i _aimport.h

"--"I _ cimport .h--_[ ]

• " " _ cproci.g_ I I project.h_ --_

II aprelude.h_]- ---._.--aprocl.error a.vark--U-_ I II ... L_J-_aobject.h-- I- I- ___ II a.procl _• =i_ --_aprOcl.g- I-I---_ I I -U nt----ll-- aprocn, g _

a.pro;n-- --_ amodule, • o. prog. exe

• " • amodule, c -I cc _1m

Figure 2.10: A Generic Manufacturing Graph Schema for SMILE

A generic semantic analysis step for a single SMILE procedure Caprocl.g") and a generic code generation step for a SMILE module Camodule.c") are shown, somewhat simplified, in Figure 2.10. The types and constants declared in the module are combined with declarations imported from other modules to form part of the context for all module compilations ("aimport.h"). Stubs for imported procedures are also collected to form 2.3. EXAMPLES OF MANUFACTURING GRAPH SCHEMAS 53 part of the supplemental context be used in the semantic analysis of locally defined procedures ("agimport.h"). Items exported by the module are used in similar fashion to form contexts for the compilation of other procedures defined in other modules (e.g. "cimport.h"). Variable definitions are collected in yet another context file ("aobject.h"). The remainder of the unit to be analyzed is formed by combining the code for a single procedure with stubs for the remaining procedures ("aprocl.g"). Code for all procedures is combined for code generation. The resulting object file is then linked with other object files and libraries to form an executable program.

The various procedures that collect and transform individual declarations appropriately for their intended use are not shown in Figure 2.10. For example, SMILEstores the header and body of a procedure in separate files so that it can distinguish between interface and implementation changes. The representation of a procedure header is similar to an ANSI C function prototype [34] (SMILEpredates ANSI C). This representation must be transformed into a standard C function header when the procedure itself is analyzed and compiled, into a stub procedure declaration when other procedures are analyzed, and into a simple extemal procedure declaration when the containing module is compiled.

The SMILE compilation system is clearly more complex than the corresponding C compilation system. It is also fragile in response to abuses of the C macro language. However, SMILE is parsimonious in the amount of work that has to be done to reanalyze a program after a change. In particular, when there is a change to the interface of an exported declaration, only those procedures that actually use the declaration are reanalyzed. While the context in which each procedure is analyzed may be larger than necessary, the ability to import individual items selectively from other modules ensures that each imported item in the context is necessary for the compilation of the module as a whole.

2.3.3 Generating the Tartan Lexicai Analyzer Subsystem

The manufacturing graph schema for the generation of a Tartan lexical analyzer makes an interesting example for two reasons. First, the production of a lexical analyzer involves the use of tools other than just a compiler and a linker. Second, the interconnection structure of the lexical analyzer is such that it is possible to show a fair amount of the system in a relatively compact graph.

As stated in description of Tartan environment in Section 1.3.1, the lexical analyzer is written in the proprietary language Gnal. Like Ada, Gnal offers strict type checking across module boundaries and uses precompiled symbolic information to compile clients of an interface. Unlike Ada, Gnal does not support the separation of interface and 54 CHAPTER 2. THE MODEL

implementation but combines the two in a single compilation unit. Thus when a Gnal module is compiled, the result is both a relocatable object file and a symbol table to be used in compiling clients of the module. Consequently, a new version of the symbolic information transmitted between compilations is produced whenever any change is made to a module.

Figure 2.11 shows a partial manufacturing graph schema for the generation of a lexical analyzer. The schema shows only those manufacturing steps necessary for the generation of the lexer compiled interface "lexer.sym". It does not show all the inputs and outputs of every manufacturing step, omitting the Gnal compiler as an input and the relocatable object file produced as an output of every compilation step. These omissions are necessary to fit the graph on one page. In addition, the names of components that are used in more than one place are repeated rather than fill the page with a tangle of crossing lines. Consequently, the names of primitive components are printed in capital letters to distinguish them from derived components. These primitive components include Gnal source files (with extension ".gnl") as well as symbol table files (with extension ".sym") imported from other subsystems.

The schema for the lexer interface contains 35 manufacturing steps. Of these, 29 steps are Gnal compilations; the remaining 6 steps, shaded in the figure, are performed by the tools leg, tbg, idl and bh. While the maximum depth of the graph is 7 steps, and the maximum fan-out of any one step is 8 steps, the lexical analyzer is just one subsystem of many used to generate a compiler front end. However, because most of the lexical analyzer subsystem is target language independent, only the 8 steps that depend on input "L.FEG" must be re-executed when the user changes the description of the language to be analyzed. A user's view of the generation of a lexical analyzer would hide the whole manufacturing process in an encapsulated subgraph.

2.3.4 Bootstrapping the Mini IDL Tools

IDL [49] is a language for describing structured data. An IDL translator, supported by an IDL run-time system, provides facilities for reading, writing and manipulating instances of the structures described in an IDL specification. The Mini IDL system [49, Part IV] is a spare implementation of an IDL subset, written in and targeted to C. The system consists of an idl translator, idl and a structure instance linker, rtgen. These tools communicate using an IDL data structure, the IDL symbol table. When the description of that structure changes, their remanufacture involves a bootstrap operation. 2.3. EXAMPLES OF MANUFACTURING GRAPH SCHEMAS 55

Figure 2.11" A Partial Schema for the Generation of a Lexical Analyzer 56 CHAPTER 2. THE MODEL

• . . -_5[ base. c

cc base. o _ id

istw.h(edit )-- _ •

rtgen" c _ rtgenld _ cc -_k_j

Figure 2.12: A Manufacturing Graph Schema for Mini IDL

Figure 2.12 shows a partial schema for the bootstrapping of the Mini IDL translator and structure instance linker. The components involved in the bootstrap are circled in the figure. In addition to the tools idl and rtgen, the bootstrap involves the description of structure of the IDL run-time symbol table ("ist.idl"), and ascii and binary instances of the symbol table for the symbol table structure ("ist.ast" and "ist.o."). This the source of the circularity in the bootstrap.

In Step 1 of the figure, when the IDL description of the symbol table changes, the existing IDL translator is used to generate an ascii symbol table for the new structure as an instance of the old structure. In Step 2 this ascii instance is converted to a linkable binary instance by the existing structure instance linker. The binary instance is then linked with the object code for the structure instance linker in Step 4 to produce an intermediate structure instance linker that manipulates instances of the new symbol table structure. Meanwhile, the ascii symbol table generated by Step 1 is hand-edited in Step 3 transforming it into an instance of the new symbol table suitable for processing by the intermediate structure instance linker in Step 8. This produces a binary instance of the new symbol table for the new symbol table. This is linked with object files for the IDL translator and structure instance linker that have been recompiled using interfaces to the new symbol table structure produced by Step 1. The result is new versions of the translator and structure instance linker that communicate instances of the new symbol table structure.

Since multiple components have the same name in the Mini IDL bootstrap, its schema cannot be directly encoded by make or by any other system that describes a schema solely in terms of filenames and extensions. This is true also for any manufacturing operation in which there is a cyclic dependency. 2.4. THE INSTANTIATION OF A SOFTWARE CONFIGURATION 57

2.4 The Instantiation of a Software Configuration

Because a manufacturing graph traces the primitive components of a configuration through each stage of their transformation into a set of software products, there can be no question that the products of the graph represent the primitive components and that product behavior can be explained by looking at those primitives. When a primitive component changes, as long as all the variables in the resulting schema are remanufactured in the proper order, it is equally certain that the new products will represent the new set of primitives. However, some of the manufacturing steps necessary to reinstantiate the new configuration may be redundant. The price of this security may be considerable inefficiency in manufacture. For example:

1. When instantiating a graph schema derived from an existing configuration, it is unnecessary to execute any manufacturing step whose inputs are identical in value to those of the corresponding step in the original configuration. If the step in question happens not to be repeatable, then it is probably unwise to reexecute it.

2. Because the inputs to two manufacturing steps need not be identical for their outputs to have identical values, it is further unnecessary to perform any step in the new schema whose outputs would be identical in value to those of the corresponding step in the original configuration, regardless of the values of its inputs. 3. Because the differences between two outputs may be an artifact of processing that has no bearing on the function or performance of any resulting product and may disappear after further processing, it is further unnecessary to execute any step or sequence of steps that does not eventually result in a material difference in a product.

2.4.1 Change, Context and the Incidence of Redundant Manufacture

When a primitive component is changed, the changed component replaces the original in a manufacturing graph schema. Since the component may be used in several places, the eventual impact of the substitution depends not only on the type of change but also on each context in which the changed component is used. Conventional manufacturing strategies treat all changes as important in all contexts. Sometimes, however, simple tests (like the bitwise comparison of two components) are sufficient to prevent redundant manufacturing operations; at other times, specialized knowledge may be required to accomplish the same purpose.

Before characterizing such tests further, this section considers several specific cases 58 CHAPTER 2. THE MODEL where selective remanufacturing decisions can be made. While the following examples are drawn mostly from C and other UNIX tools, which are likely to be familiar to most readers, comparable examples can be found for any other language or system. In each example, it is presumed that the only change to a configuration is the one mentioned. When more than one change occurs at a time, potential interactions between the changes must be considered. Each of the specific techniques mentioned in this section are reviewed in more detail in Section 1.4.2.

If there are no differences between a component and its replacement in a manufactur- ing graph schema, so that only the identity of the component and not its value has changed, then the substitution will have no impact regardless of context. Whenever a component's value is changed, however, no matter what the component or what the change, there may be some circumstances in which the substitution has an effect and others in which it does not.

The effect of some changes is the same for whole classes of manufacturing steps:

• When a programmer changes a comment in a C .h file, it is safe to suppress the recompilation of all including .c files because the change will not affect any generated code. However, specially formatted comments are often used to embed directives or other information used by other tools. (One such tool is lint.) Noting that a comment has changed is not a sufficient condition to suppress steps involving these tools.

The effect of other changes must be evaluated in each context where the changed component is used:

• When a programmer changes the definition of a name declared in a .h file, it is safe to suppress the recompilation of those .c files that do not reference the name either directly or indirectly. However, each .c file has to be inspected individually to determine whether it falls in this category. This is what Tichy's smart recompilation mechanism does.

• For certain changes to names declared in .h files, it is safe even to suppress the recompilation of some referencing .c files. For example, if a programmer appends a field to a C structure definition, it is necessary only to recompile those clients that reference the size of the structure, either by allocating storage or by doing pointer arithmetic.

If, due to alignment requirements, the addition of the field does not change the size of the structure, it may not be necessary to recompile any client. While the effect of such a change is comparable to changing a comment, recognizing this requires 2.4. THE INSTANTIATION OF A SOFTWARE CONFIGURATION 59

much more intimate knowledge of the compiler's behavior. Techniques proposed by Rudmik and Moore, by Dausmann and by Cooper, Kennedy and Torczon might identify redundant compilations of this nature.

Sometimes it is possible to apply a simple test to the outputs of a manufacturing step after it has been executed rather than try to predict, in advance, the effect of a change in its inputs. Rain's technique for avoiding trickle-down compilations in Mary2 does this, as does the refines predicate for Gnal. "Write-if-changed" file system interfaces institutionalize this mechanism by creating a new version of an output file only when it differs from the existing version of same file.

Often it is easier to determine from a changed input whether a change to an output will be significant or not:

• Existing programs will not be affected when a new routine is added to a subroutine library. However, the intemal structure of the library may change dramatically if the new routine affects the order in which its constituent routines are collated. Since specialized knowledge of the format of the library would be required to recognize the nature of the differences between the old and new versions of the libraries, it might be easier simply to recognize that a new routine was added in the first place. • A program verification system might be able to prove that the semantics of two particular subroutines are equivalent. When the same two subroutines are embedded in two versions of the same module and compiled using an optimizing compiler, it may be difficult even to isolate the corresponding sequences of generated code, let alone determine that the two object modules are functionally equivalent.

Some changes may have an effect on one output of a manufacturing step, but not on another:

• A programmer may continue to make changes to the semantic actions associated with the productions of a yacc grammar long after the syntax of the language it describes has stabilized. Such changes will be reflected in yacc's output y.tab.c, but output y.tab.h will be unchanged. Therefore, while the yacc step itself must be executed, any recompilation triggered only by the new version of y.tab.h is unnecessary.

Changes that affect y.tab.c but not y.tab.h are perceived to occur frequently enough so that make users arrange to selectively copy y.tab.h to a surrogate only when it differs from its predecessor. Client files that depend on the surrogate do not need to be recompiled every time y.tab.h is regenerated. This strategy simulates write-if-changed. 60 CHAPTER 2. THE MODEL

Finally, some manufacturing decisi ns cannot be made one step at a time:

• Suppose a .h file is used by two programs, one of which needs a field added to a C structure definition contained therein. As long as all the .c files of the second program that reference the structure definition continue to use the old version of the .h file, none need to be recompiled. As soon as one of those files is recompiled with the new .h file, however, all will have to be recompiled.

Schwanke and Kaiser's smarter recompilation mechanism applies this reasoning to separate partitions of a single program. As long as the modules in two partitions do not communicate using the changed structure, it is permissible for each to use a different version of the .h file.

2.4.2 What it Means for Two Products to be Effectively Indistinguishable

With manufacture from scratch as a standard, it is permissible to use selective techniques such as those described above when instantiating a manufacturing graph schema as long as the resulting products are effectively indistinguishable from comparable products man- ufactured from scratch. It remains to define, however, what it means for two products to be effectively indistinguishable. For example, requiring bitwise identity is probably too strong a condition and may be impossible to achieve if the representation of a product includes a timestamp.

A software product can be many things--the source code for an interface, a subroutine library, an executable program, an operations manual, etc. It might be destined for a specific use by a single programmer (for example, an executable might be built to test the feasibility of particular solution of a bug) or it may be released to larger community to be used as each member of the community sees fit. These different uses place different requirements on the underlying manufacturing system. In first case, the programmer may not be concerned with any properties of the program that do not pertain to the solution of the bug and might be willing to take chances with the manufacturing process as long as they do not compromise what he can learn about the bug. 4 The second case demands complete reliability of the manufacturing process.

Rather than require a single standard for all selective manufacturing techniques, the model posits an external test that could be applied to the products of a schema as instan- tiated and to the products of the same schema manufactured from scratch. As long as the 4Thereare certaintechniques,discussedbelow,that are not guaranteedto be reliableunderall circum- stances. If these techniquesshould fail, however,recoverymay be as simple as changingthe name of a declaration. 2.4. THE INSTANTIATION OF A SOFTWARE CONFIGURATION 61 test cannot distinguish between the two sets of products, they are considered to be the same. In this way, constraints on the manufacturing process may vary according to the needs of a project.

Given such a test, the objective of schema instantiation is to predict as early as possible in the manufacturing process whether a given step or step sequence will have an effect on its outcome. If not, then the given step or step sequence need not be executed. Making such predictions is the function of difference predicates.

The test need not exist in fact, but only in principle. While it may access only the products of a configuration themselves, not their derivation histories, it may compare those products in any way deemed important. For example, the test can compare representations or subject products to a battery of tests for functionality or performance. It may even consult an oracle that predicts their future uses. Any differences in properties not tested do not matter.

In what follows, any detectable difference in the function or performance of a product is considered to be significant. Thus differences in representation are permitted but not differences in content. It should be noted that stronger criteria such as equivalence in representation imply equivalence in functionality and performance. The ramifications of weaker criteria are not in general considered in the remainder of the thesis.

2.4.3 Difference Predicates

When a manufacturing graph schema is derived from the graph representing an already instantiated configuration, the instantiation of the schema can be treated as a decision making process. The inputs of each to-be-instantiated step are compared with the inputs of the corresponding step in the original configuration to determine if new outputs need to be manufactured or if the already existing outputs can be reused. 5 This decision is made by consulting a difference predicate. The predicate is applied to the set of original inputs and the set of new inputs. If the predicate evaluates to true then the existing outputs are guaranteed to produce satisfactory products for the new inputs. If the predicate evaluates to false then no such guarantee can be made.

When a difference predicate is true, the outputs of the already instantiated step are said to be compatible with the inputs of the to-be-instantiated step.

A manufacturing step instantiated by the successful application of a difference pred-

5In principle the to-be-instantiated step can be compared with any manufacturing step to find suitable out- puts. Such outputs are simply most likely to be found in the corresponding step of the previous configuration. 62 CHAPTER 2. THE MODEL

/usr/lib/yaccpar .tab. h "-d" tab c yacc grammar,y grammar, y' -_..J

Figure 2.13: The Successful Application of a Difference Predicate

icate might be represented as in Figure 2.13. The figure shows that while "y.tab.c" and "y.tab.h" were manufactured from an original set of inputs including "grammar.y", they are compatible with a new set of inputs including "grammar.y' ". While Figure 2.13 ap- pears to be schematic, it is not: one must assume that each name in the figure represents a unique component, not a variable.

When instantiating a manufacturing step using already existing outputs, it is essential to remember that while the outputs may be compatible with the new inputs, they were actually derived from the original inputs. This is important because it makes it possible to reproduce the schema as instantiated. 6 It is even more important when processing successive changes to the same component. For example, a predicate may not produce the same result when it compares "grammar.y" and "grammar.y"" and when it compares "grammar.y' " and "grammar.y' '"

It is important to note that difference predicates are not necessarily symmetrical. For example, when a new declaration is added to an interface, the new version can transparently replace the old everywhere it is currently used (ignoring the possibility of redeclaring an already defined name). In contrast, when an existing declaration is deleted from an interface, there is no guarantee that all the current clients of the interface, which may reference the name that was deleted, can use the new version.

The Strength of Predicates

Difference predicates model manufacturing strategies ranging from manufacture from scratch (the constant predicate FALSE), to an idealized make (the predicate that compares components' unique identifiers), to the most elaborate mechanism for selective recompi-

6A difference predicate may guarantee that the schema can be remanufactured to produce products that are equivalent in function and performance (for example), but it may not guarantee that those products are equivalent in representation. 2.4. THE INSTANTIATION OF A SOFTWARE CONFIGURATION 63 lation.

If a given predicate is always true whenever a second predicate is also true then both predicates are equivalent or the second predicate is stronger than the first. Stronger predicates make finer distinctions between components than weaker predicates so that a change that is treated as compatible by a weaker predicate might be treated as incompatible by a stronger predicate. For example,

• The predicate FALSE is the strongest predicate. It requires the outputs of every step to be manufactured regardless of inputs.

• The predicate SAMECOMPONENTis weaker than FALSE; it is true whenever the components it compares have the same unique identifiers.

• The predicate SAMEVALUE is weaker than SAMECOMPONENT.It is true not only whenever two components have the same unique identifiers, it is also true whenever they have the same value.

Any other predicate will be weaker than SAMEVALUEand will require some knowledge about the components it evaluates. The selective manufacturing techniques discussed in Section 2.4.1 are good examples of such predicates and form a hierarchy of sorts of their own.

It should be noted that not all predicates are comparable. Some may not be applicable to the same sets of components; others may not be consistently weaker or stronger on all sets of inputs. Under unusual circumstances, for example, Dausmann's predicate and Schwanke and Kaiser's predicate may require the recompilation of disjoint sets of files]

Partial Predicates

It is in general necessary to reevaluate each changed component in each context in which it is used to determine whether the change is significant in that context. This is particularly important when there are changes to more than one component used in the same context. For certain kinds of changes, however, it is possible to compare two versions of a single component in isolation and make provisional decisions about a whole class of steps. Changes to comments in interfaces provide one such opportunity; after such a change, none of the interface's clients need to be recompiled.

7Schwanke and Kaiser's predicate requires the recompilation of all units in a given partition that reference a changed name. Dausmann's predicate requires the recompilation of all units that reference a changed attribute of a changed name. It may be the case that the only units that reference the attribute in question are not in the partition that Schwanke and Kaiser's predicate recompiles. See Section 1.4.2 for more information on these predicates. 64 CHAPTER 2. THE MODEL

A predicate that makes a decision about a single component in isolation is a partial difference predicate. If partial predicates on each of the inputs of an applicable manu- facturing step all evaluate to true, then the step can be instantiated using already existing outputs.

The strong predicates of the previous section (FALSE, SAME COMPONENT, and SAME VALUE) are all equally effective as partial predicates as they are as general predicates. Because they have access to less information, partial predicates are constrained to be less selective, and therefore stronger, than general predicates. They are cheaper to implement for the same reason. Assuming that an unchanged component can always be substituted for itself, the advantage of the partial predicates is that only changed components have to be evaluated and that each changed component must be processed only once, no matter how many times it is used.

fl.c fl.o

il. cc il

fn fn.o

Figure 2.14: The Successful Application of a Partial Difference Predicate

Figure 2.14 shows how a partial predicate might be represented. Again, one must assume that each name in the figure represents a unique component.

Predicate Strength and Component Granularity

Sometimes it is possible to simulate a weaker general predicate with a stronger partial predicate by changing the granularity of the components in the manufacturing graph. For example, make uses an approximation of the partial predicate SAME COMPONENT. Barring gratuitous changes, make would recompile exactly the same .c files as Tichy's smart recompilation predicate if each included .h file consisted of exactly one declaration. This is effectively the strategy used by SMILE. 2.5. THE COST OF SELECTIVE MANUFACTURE 65

Approximate Predicates

In practice a predicate that considers every eventuality is frequently too strong (it misses many what are in fact redundant compilations) or too expensive. Sometimes, it may be cost effective to use a predicate that only approximates the truth. Such predicates are based on assumptions about how programmers program and about how changes are made.

A particularly interesting class of predicates are the partial predicates that make as- sumptions about the context in which a component will be used. One such predicate was implemented by the Tartan tool refines, used in the recompilation of Gnal programs. Refines permitted upwardly compatible additions to an interface without requiring the compilation of its clients. Because it independently compared each compiled interface (.sym file) with its predecessor in isolation, however, there was no guarantee that such additions would not conflict with names declared in other files. Refines was a successful tool despite this loophole because (1) individual programmers do not in general define the same name in the same scope of their own code, because (2) to compensate for a flat link-time name space, Tartan had instituted naming conventions for combining sub- systems written by multiple programmers, and because (3) the cost of recovery from a failure of the predicate was low.

While the use of a predicate like refines might postpone the detection of a name conflict, such conflicts are unlikely to result in the wrong name being used. Existing references will continue to use the original declaration; where there is no conflict, new references will use the new name; and potentially conflicting new references will be caught by subsequent compilations. When the conflict is detected, the remedy is simply to change the offending name.

While approximate predicates are not appropriate for preparing releases, they may be cost effective at other times during development. When the predicate encounters a change having the potential for an error, it may be prudent for it to supply a warning. Approximate partial predicates may be particularly important for C where there is no guarantee that it is even possible to parse a .h file independently of its compilation context.

2.5 The Cost of Selective Manufacture

In the prevailing view, the costs of software manufacture are seen as inevitable. A project can choose to accept the performance penalty of programming in an interpretive environment (in lisp) where there are essentially no manufacturing costs or expect to endure the normal delays of system regeneration. Such projects may use make for 66 CHAPTER 2. THE MODEL

day-to-day operations and resort to manufact_re from scratch for epochal builds. While waiting for recompilations, clever programmers may devise explanations and stratagems to defeat the worst of the delays; but such attempts are usually shortsighted. (They are comparable to trying to optimize a program without profiling its performance.)

Specific smart recompilation mechanisms have been built to deal with the perceived problems of separate compilation systems. In keeping with these mechanisms, the model developed in this chapter could be used to justify implementing the weakest predicate possible in any given programming environment to minimize the number of manufacturing steps that would have to be performed. The model also offers an alternative.

With the notable exception of the BNR Pascal implementation, designed to eliminate the dependencies that trigger secondary recompilations, little has been done to systemat- ically understand and control manufacturing costs. The model provides a point of view and a framework for asking the questions to do exactly that. This is demonstrated in the case studies of the next two chapters.

The strength of the model is that it is technology independent. It is applicable to any multistep manufacturing process that manipulates discrete components, regardless of the tools involved or the underlying computational environment. 8 The same paradigm applies whether manufacturing steps are executed in parallel or sequentially, or if there is a dramatic shift in storage versus computation costs. Differences in technology raise new questions about the manufacturing process and result in different cost/benefit tradeoffs.

8Theone area where the modelis not particularlysuitedis in situationswhere each actionupdates the globalstate of some database. Chapter 3

An Analysis of Compilation Costs

Once we admit the use of an explicit test (a difference predicate) to determine whether a given manufacturing operation has to take place, we challenge conventional approaches to software manufacture that treat all changes as if they had the same impact, raising obvious questions about how well conventional techniques perform and about what alternatives might be more effective:

• How much software manufacture done in response to a typical change is really necessary and how much is redundant?

• What techniques might be appropriate for detecting this redundancy?

If, for example, make rarely initiated an extraneous compilation, there would be little point in appealing to a more selective test. If, on the other hand, a sizable fraction of the compilations initiated by make proved to be redundant, it makes sense to ask what predicates most effectively identify those redundant compilations.

This chapter reports on a case study comparing the performance of seven difference predicates on 190 revisions recorded in the change history of the Descartes crostic client. While the study would be valuable simply as an exercise in evaluating predicate perfor- mance, there is more to it than that. Although the study is based on only one system, its results are sufficiently pronounced to suggest that it indeed does make sense to challenge conventions in software manufacture and that the gains to be had from doing so can be surprisingly large.

67 68 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

28 Otherwise gratuitous compilations

171 217 Necessary Compilations due to declarations

156 Compilations due to unused .h files

Figure 3.1: A Simple Categorization of 572 Compilations.

Figure 3.1 categorizes the 572 compilations initiated by make in response to the analyzed changes. The figure shows that:

• Fewer than one-third of the compilations were actually necessary. • Almost two-fifths of the redundant compilations could be attributed simply to .h files that were included but unused. While this is largely an artifact of Descartes programming conventions, it should serve as a cautionary example. • Almost all the remaining redundant compilations could be detected simply by using cross-reference information as in Tichy's smart recompilation algorithm. • Discounting the redundant compilations due to superfluous .h files, make still re- compiled more than twice the number of files necessary.

The data in Figure 3.1 is based on changes to both .h and .c files. The figure does not show that when only .c files changed, few of the compilations initiated by make were unnecessary.

The lessons of the Descartes study are clear: superfluous include files can contribute significantly to recompilation costs, make is effective when there are no interface changes, and that when there are interface changes, smart recompilation will find 9 out of every 10 redundant recompilations. 3.1. AN OVERVIEW OF THE STUDY 69

The next section gives an overview of the study by presenting a simple example that illustrates the key factors of the analysis. The next two sections provide background information relevant to the interpretation of the results. Section 3.2 describes each of the seven predicates and explains why it was chosen for the study. Section 3.3 characterizes the Descartes project, its programming conventions and its change history. Section 3.4 describes the methods of the study, including what data was collected and how it was obtained. The results of the study are given in Section 3.5, and the chapter concludes by considering what those results mean. The impatient reader may want to read the overview in Section 3.1, consult Table 3.3 (on page 78) which summarizes the predicates, and then fast forward to Section 3.5, returning to the more detailed descriptions of the study as interest warrants.

Related Work

Many selective recompilation mechanisms have been proposed or implemented in the last decade and experienced users will attest to their benefits. However, the only other attempt to quantify the effectiveness of these techniques is a recently completed study by Adams, Weinert and Tichy [1] that compares the performance of Tichy's smart recompilation method with conventional recompilation in Ada. Although Adams and company analyzed a larger body of code written in a different language, their main result is the same as that reported here. Smart recompilation saves about half of the recompilations required by conventional methods.

Adams, Weinen and Tichy also measured the performance of cutoff recompilation. They report that about half the redundant compilations detected by smart recompilation could be found by comparing the new and old outputs of an interface compilation to determine whether subsequent client compilations must be performed. This strategy is not appropriate for independently compiled languages like C.

3.1 An Overview of the Study

The amount of manufacture necessary to restore consistency to a system after a change is bounded by the structure of its manufacturing graph. The amount that is redundant depends on what has changed and how the change is used. The effectiveness of any manufacturing strategy depends not only on its ability to discriminate between necessary and redundant recompilations but also on how often the latter occur.

The following simple example illustrates how three manufacturing strategies might 70 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

be compared.

A Simple Example

Consider a simple software system that consists of the following components (all C files):

• File definitions, h defines names A, B and D. • File clientl, c includes definitions, h and uses name A.

• File client2, c includes definitions, h and uses names A and B.

Under the BIG BANG ! approach to software manufacture, every change is a cataclysm. This approach requires that the entire system be remanufactured whenever any part of it changes. Thus, no matter which of the above files were to change, BIG BANG would require that both clientl, c and client2, c be recompiled. The advantage of this strategy, at least for C, is that it is not necessary to keep track of the dependencies between files.

In contrast to the BIG BANGapproach, typical conservative remanufactufing strategies require the recompilation of only those components that have themselves changed or that depend on a component that has changed. Under the MAKE approach (named in honor of the tool), if the file definitions, h were to change then both clientl, c and client2, c would have to be recompiled. However, if either .c file were to change, only that .c file would have to be recompiled. If changes to each of the three files were equally likely, this approach would average 1.3 compilations for every 2 compilations required by BIG BANG, saving 1 compilation out of every 3.

The model suggests that even these conservative approaches to software manufacture may not be particularly cost effective. A third approach to software manufacture recog- nizes that every change to a .h file does not necessarily affect every .c file that uses it. The NAME-USEapproach (smart recompilation) considers the effect of each change separately. In the example, both clientl, c and client2, c might be affected by a change in the definition of A; so if A were to change both files would have to be recompiled. However, if the definition of B was to change then only client 2. c would need to be recompiled; and if D was to change, then neither .c file would have to be recompiled. Assuming changes to A, B and D were equally likely, NAME-USE would average only 1 compilation for every 2 required by MAKEwhenever definitions, h changed.

XBIG BANG is a more descriptivename for the predicateFALSE. 3.1. AN OVERVIEW OF THE STUDY 71

Average Number of Compilations per Change

definitions. h client l. c client2o IIaverage BIG BANG 2 2 2 2 MAKE 2 1 1 1.3 NAME-USE 1 1 1 1

Table 3.1: A Comparison of Three Approaches to Software Manufacture

The differences between these three approaches are summarized in Table 3.1. The table shows the average number of compilations required by each approach for a single change to any one file, assuming that changes to each file are equally likely, and that changes to any definition within the .h file are also equally likely.

Because compilation costs are generally proportional to the size of a compilation unit, it might be more appropriate to compare the number of lines of code compiled by each approach instead of simply counting compilation units. In C the size of a compilation unit is the sum of the sizes of the .c file and each included .h file. If the sizes of the components in this example were as follows: • File definitions, h: 6 lines of code.

• File clientl, c: 35 lines of code.

• File client2, c: 50 lines of code. then the size of the compilation unit consisting of clientl, c and definitions, h would be 41 lines of code, and the size of the unit consisting of client2, c and de f in it i o n s. h would be 56 lines. Table 3.2 shows the results of using these numbers to compute the relative costs of each approach as in Table 3.1.

Average Number of Lines Compiled per Change

definitions, h clientl, c client2 .c ]l average BIG BANG 97 97 97 97 MAKE 97 41 56 65 NAME-USE 51 41 56 49

Table 3.2: The Same Comparison Based on Number of Lines Compiled

BIG BANGrequires that both clientl, c and client2, c be recompiled always. The sum of their sizes is 97 lines of code. MAKE requires that both units be recompiled 72 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

when definitions .h changes, but only requires one client to be recompiled when that client changes. The average number of lines of code compiled, when changes are equally distributed among the files, is (97 + 41 + 56)/3 or 65 lines. NAME-USErequires both clients to be recompiled when the definition of A changes in definitions, h. It requires only client2, c to be recompiled when the definition of B changes, and it requires no recompilations when D changes. Thus, the average number of lines of code compiled when de f init ion s. h changes is (97 + 56 + 0)/3 or 51 lines and the average overall for NAME-USE is (51 + 41 + 56)/3 or 49 lines of code. Since the sizes of the two compilation units in the example are so close, the relative costs of the three strategies are the same whether measured in lines of code or measured in compilation units.

Clearly this small example grossly oversimplifies the problem of analyzing the dif- ferences between approaches to software manufacture. It does, however, ill :ate what are the relevant factors in that analysis.

• For any given compilation model, the first factor is the actual interconnection pat- terns of a software system as expressed within that model. Here the compilation model is that of C, and the interconnection patterns are the file inclusion and name usage information given in the example. (For a discussion of how C is compiled see Section 2.3.1.)

• The second factor is the actual pattern of changes that require the system to be remanufactured. In the example above, it was assumed that changes were equally distributed among all the files in a system and furthermore, that when a definition file changes, only one change happens at a time and each name is equally likely to change.

• Finally we need a basis of comparison. Here two metrics are used: the average number of units and the average number of lines recompiled by each approach.

For any given software system and history of changes, one can analyze the actual interconnection structure and pattern of changes to determine what difference predicates are effective for detecting redundancy in the manufacture of that system. In the following sections, I describe how I did this with the Descartes data, and what I learned as a result.

An Aside Concerning Predicate Evaluation Costs

The unit and line count metrics used in the simple example above, and in the Descartes study itself, reflect only the compilation costs associated with each predicate. This is only one of two factors necessary to completely assess predicate performance. The other factor is the cost of evaluating the predicates themselves. This cost can vary widely and depends 3.2. THE SEVEN DIFFERENCE PREDICATES 73 heavily on predicate implementation. In general, however, the more discriminating the predicate, the more expensive it is to evaluate. For example, the predicate BIG BANG costs nothing to evaluate. The predicate MAKE requires comparing the unique identifier (or what serves as such on any given system) of each primitive component in the to- be-instantiated schema with the unique identifier of the corresponding component in the already instantiated manufacturing graph. Failure of this comparison is an appropriate precondition for the application of weaker predicates including NAME-USE. While the cost of NAME-USE is incurred only when MAKE fails, its evaluation is based on parsing changed files to identify changed names. NAME-USE also requires the maintenance of a cross-reference database to find where changed names are used.

Although the data reported here does not reflect predicate evaluation costs, it would not be difficult to factor in such costs for specific implementations of the predicates analyzed.

3.2 The Seven Difference Predicates

The seven predicates selected for study represent conventional approaches to software manufacture as well as plausible alternatives. The following labeled paragraphs describe what each predicate does and why it was chosen for the study. For the convenience of the reader, this information is summarized in Table 3.3. The section concludes with a comparison of the predicates' relative strength.

BIG BANG

BIG BANG requires that the entire system be remanufactured whenever any part of it changes. The advantages of this approach are numerous. BIG BANG COSTSnothing to implement and nothing to evaluate. It is not necessary to retain any information about previous configurations because there is no need to determine which files may have changed; and, at least with a language like C, it is not even necessary to keep track of the dependencies between files to decide what must be compiled and in what order. There is no chance of failing to recompile a .c file when a .h file it includes changes if the .c file is recompiled in any event.

BIG BANG is a plausible approach for small systems and a necessary one for systems that are out of control due to convoluted dependencies or the inability to identify changes. It is a cost effective choice for any system, however, only if weaker predicates consistently recompile a significant fraction of the system. 74 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

In the Descartes change history study, BIG BANG represents an upper bound against which the relative performance of the other predicates is evaluated.

MAKE

In contrast to BIG BANG, typical conservative remanufacture strategies require the recom- pilation of only those components that have themselves changed or that depend on a component that has changed. MAKE considers any .c or .h file with a new version iden- tifier to have changed. Formally, it is desirable for this version identifier to be a unique identifier; in practice it is often a timestamp. In the Descartes change history study, it is an RCS revision number.

The predicate MAKE does what make would do given an accurate makefile and no tampering with timestamps. MAKE is easy to implement and even easier to evaluate; all it has to do is compare the version identifiers of corresponding files in a predecessor configuration and its successor. The only information that has to be retained from the predecessor configuration is its instantiated manufacturing graph; no provision has to be made for saving the content of any changed file.

MAKE is an effective strategy when most changes to .c files are real, not gratuitous, and when changes to .h files are rare or when they affect every client of the .h file.

In the Descartes change history study MAKE represents standard practice in software manufacture. Compared with MAKE, the remaining predicates show how standard practice might be improved.

GRATUITOUS (HEADERS)

GRATUITOUS (HEADERS) in a partial predicate that detects gratuitous changes in .h files. It behaves identically with MAKE with respect to .c files, but for .h files, GRATUITOUS (t_ADERS) suppresses compilations based on null revisions (where the version identifier is the only thing that changes) or on changes confined to comments and whitespace. 2

GRATUITOUS (HEADERS) is appealing as a strategy for software manufacture because it is easily implemented and because a single test of a .h file applies to all compilations that include the file. Because GRATUITOUS (HEADERS) must scan both predecessor and successor versions of a changed .h file to localize changes to comments and whitespace,

2It should be noted GRATUITOUS(HEADERS) is actually an approximate predicate since one can abuse the C preprocessor in such a way that changes to comments or whitespace become significant. Such abuses did not occur in the Descartes change history and are not especially common in practice. 3.2. THE SEVEN DIFFERENCE PREDICATES 75 it is necessary to retain old versions of .h files in a form that makes this possible. Al- ternatively, GRATUITOUS(HEADERS) might be integrated with an editor. GRATUITOUS (HEADERS) is an effective strategy for software manufacture if there is a high incidence of null revisions or gratuitous changes to .h files, especially if those .h files have many clients.

Compared with MAKE, GRATUITOUS (HEADERS)measures the compilation costs asso- ciated with gratuitous changes to .h files.

GRATUTITOUS(ALLFILES)

GRATUTITOUS (ALL FILES) is a partial predicate that detects gratuitous changes in both .c and .h files. It simply adds to the compilations suppressed by GRATUITOUS (HEADERS) those that result from null revisions, changes in comments or changes in whitespace made to .c files. It is identical with GRATUITOUS (HEADERS) in its treatment of changes to .h files.

Treating GRATUITOUS(HEADERS) and GRATUITOUS(ALL FILES) separately makes it possible to differentiate between gratuitous changes to .h and to .c files. GRATUTITOUS (ALL FILES) will be incrementally better that GRATUITOUS (HEADERS) only if the incidence of gratuitous changes to .c files is high. However, the impact of a gratuitous change to a .h file can be significantly greater than the impact of a gratuitous change to a .c file. In the former case, every client of the .h file is spared recompilation (if no other file has changed); in the latter, only one .c file is spared recompilation.

Compared with MAKE, GRATUITOUS (ALL FILES) reflects the compilation costs asso- ciated with all gratuitous changes. Compared with weaker predicates (NAME-USE and DEMI-ORACLE,discussed subsequently), it represents the costs that would be incurred by conventional manufacturing technology if no purely gratuitous changes were ever made.

NAME-USE

NAME-USE determines what files need to be recompiled based on what names (that is, definitions) have changed and whether those names are used. It requires the recompilation of a changed .c file only if the change affects a name potentially referenced somewhere in the executable system. 3 It requires the recompilation of a .c file that includes a changed .h file only if the change to the .h file affects a name that is used by the .c file. For 3Inpracticethis kindof changeprovednot to be important;it spareda handfulof compilationsbasedon changesto a single.c file that was linkedin the crosticexecutablebutneverreferenced. 76 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

changes to .h files NAME-USEis equivalent to smart recompilation.

NAME-USE is appealing as a strategy for software manufacture since it changes the basis upon which recompilation decisions are made from files to individual definitions. While it can be computed solely based on cross-reference information, NAME-USEcan be costly both to implement and to evaluate. It requires that each changed file be eval- uated in each context in which it is used to determine what definitions have changed and whether a changed definition is used. Naive implementations may require that each compilation unit potentially affected by a change be parsed and processed independently. More sophisticated implementations might use incremental techniques to update a com- mon database. Both techniques require that enough information be retained, in some form, about previous configurations to decide what names have changed.

The predicate NAME-USEis effective in reducing number of compilations beyond GRATUITOUSif the typical set of names changed in a .h file does not include all clients of the .h file. However, NAME-USEis considerably more expensive to evaluate than the other predicates described so far. It is cost-effective only if the number of compilations suppressed compensates for this evaluation cost.

NAME-USE is the simplest predicate based on cross-reference information. In the Descartes change history study, it differentiates between compilations that are redundant simply because a changed name is not used and those that are redundant for more subtle reasons.

DEMI-ORACLE

DEMI-ORACLE behaves identically to NAME-USE, except it considers how a changed def- inition is used before making the decision to recompile a .c file. It uses advanced static analysis techniques to recompile only those .c files whose behavior might change as a result of a change to a definition in a .h file. For example, DEMI-ORACLEmight discover that changing the value of a constant does not change the sense of a test in which the constant is used.

DEMI-ORACLE is appealing as a predicate because it represents a practical limit in suppressing redundant compilations. However, it requires the full static analysis of each compilation unit potentially affected by a change. Since most compilers do not perform such extensive static analysis, DEMI-ORACLEmay cost more to evaluate than the compi- lation steps it suppresses. Incremental techniques may alleviate this problem. However, if a compiler were designed to prevent inconsequential changes to a source file from perturbing the generated object file, it may be cost effective to use a cutoff strategy, 3.2. THE SEVEN DIFFERENCE PREDICATES 77 simply performing the compilation and then comparing object files, especially if several additional manufacturing steps are contingent on the results of the compilation.

Short of a true oracle, DEMI-ORACLErepresents a practical lower bound on recompi- lation costs against which we can evaluate the performance of the other predicates.

OPTIMIST

The predicate OPTIMISTis an approximate partial predicate that attempts to combine the advantages of GRATUITOUS (HEADERS) with the discrimination of the name-based predi- cates. It behaves identically with GRATUITOUS (HEADERS), except it permits definitions to be added to or deleted from .h files without triggering the compilation of their clients. Any changes to an existing definition are considered to be important.

OPTIMIST is based on the assumption that programmers do not make inappropriate additions or deletions to .h files: A careful programmer will delete only unused definitions and new definitions can be ignored until new references are also added. OPTIMIST' is inherently unreliable for two reasons. The first reason applies in general to interface changes in any language: programmers are not always careful; the addition or deletion of a definition from a .h file may affect some client of the file. The second reason is specific to C: because of the preprocessor, it may not be possible to correctly determine what definitions have been added or deleted from a .h file independently of the context in which the file is used.

Because Descartes did not exploit the C preprocessor to redefine reserved words and because all the code in the Descartes .h files represents complete type definitions, macro definitions, or procedure or variable declarations, the second problem was not an issue in the Descartes change history study.

The cost of implementing and evaluating OPTIMIST depends on being able to detect additions and deletions in isolated .h files. Its effectiveness depends on how often such additions and deletions are harmless. Its role in the study is to explore the role of a predicate that might give a wrong answer. At issue is how OPTIMIST performs relative to GRATUITOUS (HF_,ADERS) and how often it fails to recompile a file recompiled by DEMI- ORACLE. 78 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

Predicate Description

BIG BANG BIG BANGrequires that every .c file in the system be recompiled, regardless of whether it or any .h file it includes has or has not changed.

Because it always requires the recompilation of every .c file, BIG BANGrep- resents an upper bound on recompilation costs.

MAKE MAKE recompiles every .c file that may have changed and every .c file that includes a .h file that may have changed. A .c or .h file may have changed if its version identifier has changed. Formally we would like the version identifier to be a unique identifier; in practice it is usually a timestamp.

GRATUITOUS GRATUITOUS (HEADERS) behaves identically to MAKE, except it ignores (HEADERS) changes to comments and whitespace in .h files.

GRATUITOUS GRATUTITOUS (ALL FILES) behaves identically to MAKE, except it ignores (ALL FILES) changes to comments and whitespace in both .c and .h files.

GRATUTITOUS (ALL FILES) represents what conventional manufacturing strate- gies would do in the absence of purely gratuitous changes.

NAME-USE NAME-USE determines what files need to be recompiled based on what names have changed and whether those names are used. NAME-USErequires the recompilation of a .c file that includes a changed .h file only if the change affects a name used by the .c file. It requires the recompilation of a changed .c file only if the change affects a name that might be used in the executable system.

DEMI- DEMI-ORACLE behaves identically to NAME-USE, except it considers how a

ORACLE changed definition is used before making the decision to recompile a .c file. It uses static analysis techniques to recompile only those .c files whose behavior might change as a result of a change to a definition in a .h file. Short of a true oracle, DEMI-ORACLErepresents a practical lower bound on recompilation costs.

OPTIMIST OPTIMIST behaves identically to GRATUITOUS(HEADERS), except it permits definitions to be added to or deleted from a .h file without requiring the recompilation of its clients.

Table 3.3: Summary of Seven Predicates 3.2. THE SEVEN DIFFERENCE PREDICATES 79

The Relationships between Predicates

Of the seven predicates analyzed, BIGBANGand DEMI-ORACLrEepresent upper and lower bounds on the amount of recompilation necessary to return a system to consistency after a change. One can do no worse in rebuilding a system than BIG BANG; it requires that everything be recompiled, always. On the other hand, while a true oracle might improve on DEMI-ORACLE,the latter represents the best that can be done in practice using state- of-the-art compilation technology.

All the predicates, except BIG BANG,are selective in the steps that they perform. Each requires the recompilation of only those .c files that appear to have changed or that appear to be affected by a change to a .h file. They simply differ in their ability to discriminate between actual and apparent changes and between actual and and apparent dependencies. This ability increases monotonically in the four predicates, MAKE,GRATUITOUS(HEAD-

ERS), GRATUITOUS (ALL FILES) and NAME-USE. The amount of recompilation required by DEMI-ORACLEis never greater than that required by NAME-USE, that required by NAME-USE is never greater than that required by GRATUITOUS(ALL FILES), and so on. Thus every file recompiled by DEMI-ORACLE will be recompiled by NAME-USE and by each of the four stronger predicates; BIG BANGwill recompile files that may not be recompiled by MAKE or by any of the four weaker predicates.

_gratuitous--_name-use--_demi-oracle big bang--_make--_gratuitous-_ (all files) (headers) _optimist

Figure 3.2: The Relative Strength of the Seven Predicates

The relative strength of the predicates is shown in Figure 3.2. The ability to discrim- inate between actual and apparent changes is flawed in the seventh predicate, OPTIMIST, which is shown as a side branch in the figure. It sometimes mistakes a significant change for an insignificant one. The amount of recompilation required by OPTIMIST is never greater than that required by GRATUITOUS (HEADERS), but it sometimes fails to recompile files recompiled by DEMI-ORACLE. 80 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

3.3 The Descartes Project

The Descartes software defines a set of abstract data types that together implement the application-independent parts of an interactive user interface. One of the applications built to exercise this interface is a crossword puzzle game, crostic.

Counting comments and blank lines, the most recent configuration of the crostic program contains some 12,000 lines of code, of which roughly 10,000 represent the generic Descartes software. The latter is divided among 23 .c and 25 .h files, which crostic augments to 26 .c and 28 .h files. Crostic also references 6 .h files from the UNIX library, which were presumed not to change.

Each of crostic's 26 .c files represents a compilation unit. In the most recent con- figuration, these units together contain roughly 39,000 lines of code. (On average each Descartes .h file is represented 12 times in this number.) The average size of a compilation unit is 1500 lines of code.

3.3.1 Why Descartes?

I chose to study the Descartes system for several reasons. Most important was the avail- ability of change data. Because the Descartes source was maintained under RCS, a complete record of its development was available. While this record is not sufficient to recreate every compilation performed on the system, it at least represents an authentic distribution of changes. Furthermore, because the system was no longer under develop- ment at the time of the study, there was no chance that the study would interfere with how the Descartes programmers chose to make those changes. Secondly, because I was one of the Descartes programmers and because the system is relatively small, I could hope to master it thoroughly. This was important because it was necessary to analyze the Descartes development record and apply the more complicated recompilation tests manu- ally (see Section 3.4.3). Finally, although C does not explicitly support data abstraction, the programming conventions used by the Descartes developers required that every .c file have a corresponding .h file that to the extent possible, defines its visible interface. This discipline made it easier to perform the analysis and helps extrapolate the Descartes results to systems developed in languages that do support data abstraction.

The remainder of this section presents material relevant to understanding and inter- preting the results of the study. 3.3. THE DESCARTES PROJECT 81

3.3.2 Descartes Programming Conventions

Although C does not support data abstraction, Descartes programming conventions re- quired that every .c file have a corresponding .h file defining all exported types and preprocessor constants (including macros), and declaring all global variables and proce- dures. In addition, any procedure or variable not exported from the defining .c file was declared static so as not to be visible outside the defining .c file. Thus, to the extent possible in C, a Descartes .c and .h file pair correspond to a module implementation and its visible interface; clients of a module import its interface by including the .h file. A notable deficiency in this correspondence is the inability to declare procedure parameters in .h files.

The problem with this approach is the need to manage all the .h files without lan- guage support. Even though Descartes is a relatively small system, keeping track of the dependencies between 28 .h files was not easy. Moreover, applications built on top of Descartes, like crostic, should not need to know the names of and relationships between lower level Descartes .h files, even if the client has access to higher level interfaces.

To solve this problem, Descartes introduced a single umbrella .h file, descartes, h, that included each of the remaining Descartes .h files in an order consistent with the dependencies between them. All that a Descartes or a client module needed to do then was to include descartes, h.

While this technique is convenient for the programmer, every .c file that includes descartes .h does not use every .h file in it. In addition, because descartes .h hides which .h files are actually included, some .c files include some .h files more than once. As long as the duplicated .h file contains only extemal declarations and preprocessor macro definitions, the compilation unit compiles successfully and the duplication is not detected.

Each unused or duplicated .h file is superfluous. Each superfluous .h file inflates recompilation costs by increasing the number of client units that apparently need to be recompiled when it changes and by increasing the size of those units. In the most recent configuration of the crostic client, of the 434 .h files that are included by all .c files, only 264 are actually used.

3.3.3 The Descartes Change History

The 54 RCS files (26 .c and 28 .h files) that make up the revision history of the Descartes software and its crostic client together contain a total of 435 revisions. These revisions 82 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS were made by six programmers over the course of one year (between December 21, 1983 and December 5, 1984). I analyzed the 190 most recent revisions in detail, beginning with the initial revisions of the crostic files.

All Revisions Revisions Analyzed number of revisions: 435 ' i90 number of initial revisions: 54 10 number of changes: 381 180 average number of changes per file: 7 3 number of .h file revisions: 181 77 percent of all revisions: 40% 41% average number of changes per .h file: 5 3 number of null changes to .h files: 17 11 null changes as percent of .h file changes: 11% 14% average number of lines added or deleted: 14 13 median number of lines added or deleted: 5 7 number of .c file revisions: 254 113 percent of all revisions: 60% 59% average number of changes per .c file: 9 4 number of null changes to .c files: 0 0 average number of lines added or deleted: 61 56 median number of lines added or deleted: 22 17

Table 3.4: Summary of the Descartes Change History

Table 3.4 characterizes the complete Descartes change history and the subset of changes analyzed. Initial revisions represent the first introduction of a file to RCS; subsequent revisions are changes. The size of a change is measured in lines added plus lines deleted; a single changed line constitutes one line added and one line deleted. The average size of a change to a .h file is about one-quarter the average size of a change to a .c file.

Approximately 40% of all revisions and 41% of the revisions analyzed represent changes to interfaces (.h files). This is noteworthy since interface changes tend to favor the weaker predicates. For example, when a .c file changes a more selective predicate can save at most one recompilation over a less selective predicate. When a .h file changes, each client of the .h file is a potential recompilation saved. 3.4. THE METHOD 83

Equally notable is the high number of null revisions of .h files. A null revision is one in which a file is checked out and checked in again, but no code is changed. While there were no null revisions of .c files, fully 14% of the .h file revisions analyzed were null revisions. This is because Descartes programmers often checked out .c and .h files in pairs and then checked them in again without having changed the .h file. The incidence of null revisions is one reason that the predicate GRATUITOUS (HEADERS)was selected for the study.

3.4 The Method

To compare predicate performance, I used the Descartes change history to form a sequence of manufacturing graph schemas and then computed the cost of instantiating each schema in turn using each predicate. I started with an initial configuration built from the initial revisions of the 6 crostic client files and the most recent contemporary revisions of each of the remaining 48 Descartes generic files.4 I then selected appropriate groups of revisions, based on when they were made and by whom, to reconstruct a sequence of schemas for recompilation. Section 3.4.1 explains the criteria used to form these groups.

I simulated the recompilation decisions made by each predicate in response to each manufacturing graph schema, keeping track of the number of compilation units as well as the number of lines of code recompiled by each predicate. Section 3.4.2 explains my choice of metric. Section 3.4.3 describes the data I collected for each predicate, and briefly discusses why I chose to simulate the predicates manually rather than build a selective recompilation tool for C.

3.4.1 From Change History to Configurations

Because the Descartes project used RCS, it left behind a record of all the changes that project members committed as RCS revisions, ordered in time. This record does not contain enough information to recreate every compilation done by each individual project member or by the project as a whole. One problem is that programmers typically update an RCS library only after they are reasonably satisfied that the changes they have made are correct. More often than not, this requires several iterations as a programmer makes coding errors and corrects them. The programmer may even defer check-in until he has made and tested a group of related changes. Another problem is that while revisions

4There were only 42 Descartes files at the time; 6 additional files were created in the development interval analyzed. 84 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

are totally ordered in time, there is no explicit information in the RCS record to indicate which files were changed together and which were meant to be recompiled at the same time.

While it is impossible to reconstruct the intermediate changes made between RCS revisions, it is possible use information maintained by RCS to group the revisions that were likely to have been submitted for recompilation together. The information available for grouping revisions includes when each revision was checked in and by whom. It does not include when a revision was checked out--so there is no way to reconstruct which files were checked out at the same time. The criteria used to group revisions should be objective and uniformly applicable to all revisions. The following three criteria satisfy this condition:

1. Each group may include at most 1 revision of each file.

2. All the revisions in a group must have been made by the same programmer. 3. All the revisions in a group must have been checked in consecutively within a short of time of one another.

The second criterion is based on the assumption that programmers usually work indepen- dently and when two or more programmers work together, only one checks out a working set of files. This is consistent with the way Descartes was developed. The third criterion is based on the assumption that once a programmer is satisfied with a change that affects a set of files, he checks in all the files together. Of course, the set of files checked in might constitute only a subset of the files checked out, but there is no way to discover if this is the case.

The remaining problem in forming groups of revisions is to establish a fixed threshold for the third criterion. As long as the first two criteria are satisfied, any two consecutive revisions checked in within the allotted time are put in the same group; any two revisions whose check-in times differ by more than this amount go into separate groups. The size of the threshold is important because it affects the number of revisions in a group, which in turn can affect the relative performance of the predicates.

Larger groups tend to favor the stronger predicates. The more revisions there are in a group, the more likely it is that any given file is tangibly affected by some change, the more likely it is that a weaker predicate makes the same decision as a stronger predicate. When groups are smaller, there is more opportunity for contrast between predicate performance. Consider two simple examples:

* Suppose a .h file and one of its client .c files were each changed. If the two files

were in separate groups, MAKE would recompile the .c file twice; it might not be 3.4. THE METHOD 85

recompiled at all by a weaker predicate. If the two changes were in the same group, MAKEwould only recompile the .c file once. • Suppose two .h files were changed. If the files were in separate groups, MAKE would recompile the clients of both files twice. It the files were in the same group MAKEwould only recompile each client once. Some clients of either .h file might not be recompiled at all by a weaker predicate.

I chose a threshold of five minutes between revisions. While this number is arbitrary, five minutes is more than sufficient for a programmer to compose a short log message 5 and for RCS to process the change, even under loaded conditions. Descartes programmers were able to check in multiple files on the same command line.

If the threshold had been increased to 10 minutes, there would have been no significant differences in the groups formed.

All Revisions Revisions Analyzed

number of revision groups: 164 1] 68 average number of revisions per group: 2.7 2.8 percent of groups with 1 revision: 46% 49% percent of groups with 1 or 2 revisions: 72% 71% percent of groups with more than 2 revisions: 28% 29% average number of .c file revisions per group: 1.5 1.7 percent of groups with no .c file revisions: 20% 19% percent of groups with at most 1 .c file revision: 70% 68% percent of groups with more than 1 .c file revision: 30% 32% average number of .h file revisions per group: 1.1 1.1 percent of groups with no .h file revisions: 37% 43% percent of groups with at most 1 .h file revision: 83% 82% percent of groups with more than 1 .h file revision: 17% 18%

Table 3.5: Summary of Descartes Revision Groups

Using three criteria described above, I partitioned the 435 revisions of the Descartes change history into 164 groups and analyzed 68. These groups are characterized in STheinterfacetoRCSusedby Descartesprogrammersrequiredthat theprogrammerdescribethe changes he had made. Only minimalintraline editingwas possiblewhen composingthese descriptionsand most consistedof a singleline. The longestdescriptionwas 6 linesand it occurredat the boundarybetweentwo change groupsseparatedbecausetheyboth containedthesame revision.The amountof timeseparatingthe revisionin questionfromits predecessorwas 4 minutes. 86 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

Table 3.5. On average each analyzed group contains 2.8 revisions. Some 19% of these groups have no .c file revisions while 43% have no .h file revisions.

3.4.2 Measuring Compilation Costs

Comparing predicate performance requires a metric that consistently and fairly represents relative recompilation costs. There are several measures that could have been used. One alternative was simply to count the number of modules recompiled by each predicate. Another alternative was to time the compilations that had to be performed.

While the former is appealing since it is direct and easy to understand; it has the disadvantage that it gives all compilation steps equal weight, when clearly some are less expensive than others. The latter has the disadvantage that it is tied to a particular hardware and depends on the questionable accuracy of timing tools. 6

Thus although I recorded the number of modules compiled for gross comparisons between predicates, I also measured the size, in lines of code, of each compilation unit that was recompiled. The advantage of this measure is that it is not machine dependent yet it still recognizes that different manufacturing steps will contribute differently to the overall cost of manufacture. Furthermore, the measure is credible since compiler performance is typically measured in lines compiled per minute.

Because of the high incidence of .h files that are included but not used in the Descartes software, I tracked predicate performance both for the crostic client as written and for the program stripped of those extraneous .h files.

3.4.3 The Data Collected

Before a manufacturing graph schema can be instantiated, it is necessary to select versions of each of its primitive components. Each group of revisions described above represents the differences in the versions selected for two consecutive schema. Assuming that the first schema had been instantiated successfully from scratch, I computed the cost of instantiating the second schema by simulating each predicate to decide which files had to be recompiled.

This process is clarified by a specific example.

Table 3.6 shows the summary data reported by RCS for the revisions in 4 of the groups analyzed (Groups 5 through 9). To analyze Group 6, for example, I assumed that 6Wendorf[67,AppendixA] describessomeinherentproblemswiththetimingfacilitiesof UNIXsystems. 3.4. THE METHOD 87

Timestamp Filename Revision No. Lines Lines Added Deleted Change Group 5: 84/09/25 16:59:33 image.h revision 1.5 1/1 84/09/25 16:58:24 image.c revision 1.10 111/6 Revision Group 6: 84/09/25 17:24:43 comptools.h revision 1.4 10/1 84/09/25 17:23:37 comptools.c revision 1.4 73/2 Revision Group 7: 84/09/27 10:12:13 chario.c revision 1.10 7/4 Revision Group 8: 84/09/27 10:30:31 strio.c revision 1.12 8/5 84/09/27 10:27:14 intio.c revision 1.13 7/4 84/09/27 10:24:01 floatio.c revision 1.12 7/4 Revision Group 9: 84/09/27 10:43:09 strio.c revision 1.13 5/2 84/09/27 10:39:17 getstring.c revls,on 1.6 5/2

Table 3.6: Four Revision Groups from the Change History of the Crostic Client

a graph for the configuration including the revisions in Group 5 had already been instan- tiated. I then formed a new schema by substituting revisions 1.4 of comptools, h and comptools, c for the previous revisions (revisions 1.3) of the same files in the already instantiated graph. Finally, I counted the files and the lines of code that each predicate would recompile when instantiating the new schema. Here MAKE would require the re- compilation of comptools, c and each of the clients of comptools .h. I assumed that all such compilations would be successful.

To compute manufacturing costs, I wrote a small program to keep track of the fol- lowing information:

1. the revision of each .c and .h file used in each configuration;

2. the size of each revision;

3. the .h files included by each revision of each .c file, and whether each .h file is used; and finally

4. the .c files recompiled by each of the seven predicates.

While the first three items were sufficient to compute the recompilation costs attributable to BIG BANG and MAKE, I supplied the list of files recompiled by each of the other predicates as interactive input to the program.

The results of this analysis are shown in Section 3.5. 88 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

Simulating Predicate Decisions

Instead of building a multipurpose selective recompilation tool, I chose to consult file comparisons and cross-reference listings to simulate the performance of the five predicates weaker than MAKE. Although such a tool would be useful, it would be costly to implement. The extent to which that cost is justified ultimately depends on the results of the very study to which the tool would be applied.

A multipurpose selective recompilation tool for C would undoubtedly produce more accurate results in larger volume than a manually conducted study. However, idiosyn- crasies of C make it difficult to implement predicates that would be trivial to implement for other languages. For instance:

• Depending on the macro definition, if one .h file uses a macro defined in another .h file, it may be impossible to parse the .h file outside the context in which it is used.

• Available grammars for C are designed to manipulate the already preprocessed lan- guage, and cannot be used to identify and process changes to preprocessor macros.

In addition, building a sophisticated static analyzer for any language is a major undertak- ing. While these difficulties can certainly be surmounted (especially when the ultimate performance of the predicate is not at issue), it is not clear that the cost is justified; it is also not clear that having the tool would produce substantially better information. It is clear, however, that building the tool would raise a set of implementation issues particular to C that may be important but are not directly relevant to the study.

3.5 The Results of the Study

While tables quantifying each predicate's performance are presented later in this sec- tion, the relative performance of the seven predicates is best appreciated graphically. Figures 3.3(a), 3.3(b) and 3.3(c) summarize the performance of each predicate relative to BIG BANG. Each figure compares both the number of lines compiled and the number of files compiled by each predicate, contrasting results for the crostic client as written (fat) and for the program stripped of extraneous include files (lean). Figure 3.3(a) is based on data representing all 68 configurations analyzed, Figure 3.3(b) shows the results for the 39 configurations in which at least one .h file changed, and Figure 3.3(c) shows the results for the 29 configurations without any .h file changes.

The main conclusions of the study are readily apparent from these three figures: 3.5. THE RESULTS OF THE STUDY 89

IOO

90 big bang

8o

7o

6o

5o

4o

make 20[ ..... gratuitous

10.- ...... --: ..... :_--_--.-" opt imi st ...... name-use O" -...... demi-oracle fat lean fat lean

lines compiled units compiled legend (a) All Configurations

i00 I00

90 90

80 80[

70 70[

60 ----- 60[

50 501

40 40[

30 _'-- --'-- 30[

20 20

i0 ...... " ...... I0

0 O_ ...... fat lean fat lean fat lean fat lean

lines compiled units compiled lines compiled units compiled

(b) Configurations w/ Interface Changes (c) Configurations w/o Interface Changes

Figure 3.3: Predicate Performance Relative to BIG BANG 90 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

• Comparing the three figures shows a marked difference in relative predicate perfor- mance for configurations with and without changes to .h files. Changes to interfaces (.h files) clearly require different manufacturing strategies than changes to imple- mentations (.c files).

• It is apparent in each figure that superfluous .h files have a greater effect on the number of lines compiled than on the number of files compiled. The differences in fat and lean values for lines of code in Figure 3.3(b) indicate that the weaker predicates to some extent mitigate this effect.

• Figure 3.3(c) shows that for changes to implementations there is no point being more discriminating than MAKE. Even the weakest predicate, DEMI-ORACLE, hardly does better.

• Figure 3.3(b) shows that for changes involving interfaces, DEMI-ORACLEis not sig- nificantly better than NAME-USE.There is little to justify investing in the additional complexity of the weaker predicate.

• Figure 3.3(b) also shows that despite the high incidence of null interface revisions in the Descartes change history, the overall impact of gratuitous changes is not significant.

• Finally, although OPTIMISTwould seem to present an intermediate option between MAKEand NAME-USEfor changes involving interfaces, 10% of its decisions not to recompile were wrong. This may be an artifact of the method used to group RCS revisions.

The remainder of this section looks at the relative performance of the predicates in more detail. This discussion is based on four views of the comparative recompilation data shown in four sets of tables. As in Figure 3.3, the tables present separate results for all 68 configurations, for the 39 configurations with changes to .h files and for the 29 configurations without such changes; based on both the number of lines and the number of files compiled by each predicate; for the Descartes software as written and minus its superfluously included .h files.

• Tables 3.7(a) and (b) show the average cost of compilation per configuration using each predicate.

• Tables 3.8(a) and (b) give data on the cumulative performance of the predicates, showing the total number of lines and the total number of files compiled by each predicate. All the data in the other tables and figures is based on these totals.

Table 3.8 includes the ratios between the lean and fat costs for each predicate. These numbers quantify the profound effect of superfluous .h files on the cost of 3.5. THE RESULTS OF THE STUDY 91

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST

(HEADERS) (ALL FILES) ORACLE 68 configurations fat: 38.2 15.6 14.7 14.6 5.4 4.7 10.4 lean : 29.2 8.6 8.1 8.0 4.2 3.6 5.7 39 configurations with interface changes fat: 38.4 25.0 23.5 23.5 7.7 6.4 16.0 lean: 29.3 13.4 12.4 12.4 5.9 4.9 8.4 29 configurations without interface changes fat : 37.9 2.8 2.8 2.6 2.3 2.3 2.8 lean : 29.0 2.2 2.2 2.0 1.8 1.8 2.2

(a) Average number of lines compiled (in thousands) per configuration.

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST

(HEADERS) (ALL FILES) ORACLE 68 configurations fat: 25 8 8 8 3 3 6 lean: 24 6 6 6 3 3 4 39 configurations with interface changes fat : 25 13 13 13 4 3 9 lean : 24 10 9 9 4 3 6 29 configurations without interface changes fat : 25 2 2 1 1 1 2 lean : 24 2 2 1 1 1 2

(b) Average number of files compiled per configuration.

Table 3.7: Average Predicate Performance 92 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST (HEADERS)(ALL FILES) ORACLE 68 configurations fat : 2595 1058 997 991 368 318 705 lean : 1984 585 547 543 283 244 391 lean / fat: .76 .55 .55 .55 .77 .77 .55 39 configurations with interface changes fat: 1496 977 917 917 301 250 624 lean : 1144 521 484 484 230 190 327 lean / fat: .76 .53 .53 .53 .76 .76 .52 29 configurations without interface changes fat : 1099 81 81 75 67 67 81 lean : 840 64 64 59 54 54 64 lean / fat: .76 .79 .79 .79 .79 .79 .79

(a) Total number of lines compiled (in thousands).

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST (HEADERS)(ALL FILES) ORACLE 68 configurations fat : 1699 572 541 538 199 171 381 lean : 1631 416 390 387 199 171 278 lean / fat: .96 .73 .72 .72 1.00 1.00 .73 39 configurations with interface changes fat: 979 526 495 495 160 132 335 lean: 940 371 345 345 160 132 233 lean / fat: .96 .71 .70 .70 1.00 1.00 .70 29 configurations without interface changes fat: 720 46 46 43 39 39 46 lean : 691 45 45 42 39 39 45 lean / fat: .96 .98 .98 .98 1.00 1.00 .98

(b) Total number of files compiled.

Table 3.8: Cumulative Predicate Performance 3.5. THE RESULTS OF THE STUDY 93

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST

(HEADERS) (ALL FILES) ORACLE 68 configurations fat / fat: 1.00 .41 .38 .38 .14 .12 .27 lean/fat: .76 .23 .21 .21 .11 .09 .15 lean / lean : 1.00 .29 .28 .27 .14 .12 .20 39 configurations with interface changes fat/fat: 1.00 .65 .61 .61 .20 .17 .42 lean / fat: .76 .35 .32 .32 .15 .13 .22 lean / lean : 1.(30 .46 .42 .42 .20 .17 .29 29 configurations without interface changes fat / fat: 1.00 .07 .07 .07 .06 .06 .07 lean / fat: .76 .06 .06 .05 .05 .05 .06 lean / lean : 1.00 .08 .08 .07 .06 .06 .08

Ratio of lines compiled.

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST

(HEADERS) (ALL FILES) ORACLE 68 configurations fat / fat: 1.00 .34 .32 .32 .12 .10 .22 lean / fat: .96 .24 .23 .23 .12 .10 .16 lean / lean : 1.00 .26 .24 .24 .12 .10 .17 39 configurations with interface changes fat/fat: 1.00 .54 .51 .51 .16 .13 .34 lean / fat: .96 .38 .35 .35 .16 .13 .24 lean / lean : 1.00 .39 .37 .37 .17 .14 .25 29 configurations without interface changes fat / fat: 1.00 .06 .06 .06 .05 .05 .06 lean / fat: .96 .06 .06 .06 .05 .05 .06 lean / lean : 1.00 .07 .07 .06 .06 .06 .07

Ratio of files compiled.

Table 3.9: Predicate Performance Relative to BIG BANG 94 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

BIG BANG MAKE GRATUITOUS NAME-USE DEMI- OPTIMIST

(HEADERS) (ALL FILES) ORACLE 68 configurations fat : 1527 1849 1844 1842 1850 1858 1850 lean : 1217 1405 1404 1402 1422 1424 1405 39 configurations with interface changes fat: 1528 1857 1852 1852 1880 1896 1863 lean: 1217 1405 1403 1403 1435 1440 1404 29 configurations without interface changes fat: 1526 1758 1758 1736 1729 1729 1758 lean: 1216 1412 1412 1397 1372 1372 1412

Number of lines per compiled file.

Table 3.10: Average Size of the Files Compiled by each Predicate

compiling the crostic client.

• Tables 3.9(a) and (b) compare each of the seven predicates with BIG BANG. The data in these tables quantify the observations about predicate performance based on Figure 3.3.

• Table 3.10 shows the average size of each file compiled by each predicate. This data is interesting because it indicates some relationship between file size and change: not surprisingly, larger files are more likely to be affected by a change than smaller files.

Because the relationship between fat and lean pervades the results, the effect of superflu- ously included .h files will be considered first.

3.5.1 The Effect of Superfluously Included .h Files

The effect of .h files that are included but not used is apparent when counting files com- piled, especially when interfaces change, but it is dramatic when counting lines compiled, except when interfaces do not change.

As explained in Section 3.3.2, superfluously included .h files inflate recompilation costs both by increasing the number of units that apparently need to be recompiled and by increasing the size of those units. Comparing the ratios of lean to fat costs for different predicates makes it possible to distinguish the effect of each of these factors.

In Table 3.8(a), the difference between the fat and lean numbers for the predicates BIG 3.5. THE RESULTS OF THE STUDY 95

BANG, NAME-USEand DEMI-ORACLEare due entirely to the inflated size of the compilation units that include unused .h files. The same is true of the numbers for all predicates limited to the 29 configurations without interface changes. In each of these cases, unused or unchanged .h files do not influence which .c files are compiled. BIG BANG always recompiles every file while NAME-USEand DEMI-ORACLEnever recompile a file because of a change to a .h file that is not used. While the small differences in the ratios of lean to fat sometimes are due to the different populations of files compiled by each predicate, the effect of superfluous includes on the size of the system as a whole is captured in the ratio for BIG BANG. This ratio shows that 24% of the size of the crostic compilation units is due to .h files that are included but not used.

Since module size is not reflected in the data representing the number of files compiled, the ratio of lean to fat approaches 1.00 for BIG BANG, NAME-USEand DEMI-ORACLEin Table 3.8(b). (Some values are less than 1.00 because the fat version of the system contains a .c file that is not used in the crostic client.) The differences between the fat and lean numbers for the predicates MAKE,GRATUITOUSand OPTIMISTin the same table are due entirely to the inflated number of compilations triggered by unused .h files. These ratios show that approximately 30% of the compilations of the Descartes crostic client files were triggered by unused .h files.

The differences between the fat and lean numbers for MAKE, GRATUITOUSand OPTI- MISTin Table 3.8(a) show the cumulative effect of superfluous .h files on both the number and size of compilation units. Simply removing unused .h files would reduce the cost of compiling the crostic client by a full 45%.

3.5.2 The Relationship Between the Predicates

Tables 3.9(a) and (b) compare each of the seven predicates with BIG BANG. The data in these tables quantify the observations made earlier about the relationships between predicates. Comparisons of fat values tofat values indicate the relative performance of the predicates on the crostic client as it was written; comparisons of lean values to lean values indicate how the predicates would perform on a system devoid of superfluous include files. Comparisons of the weaker lean predicates to the stronger fat ones demonstrate the combined effect of parsimony in the use of interfaces and in the propagation of changes.

Because the numbers associated with BIG BANGrepresent the repeated recompilation of the whole system, comparing the other predicates to BIG BANG yields the fraction of the system recompiled by each predicate. In particular, the ratio of MAKEto BIG BANG indicates how much of the system is potentially affected by the typical change; the ratio 96 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

of DEMI-ORACLEto BIG BANG indicates how much of the system is actually affected by the typical change. Comparing one predicate to another yields the incremental difference in the amount of code or number of modules compiled by the two predicates. Thus:

• The ratios of MAKE tO BIG BANG in Table 3.9(a) show that on average, 65% of the code in the crostic client is potentially affected when interfaces change and that less than 10% is affected when they do not.

• The ratios of DEMI-ORACLEto BIGBANGshow that on average less than 20% of the code is actually affected when .h files change. When interfaces do not change the amount of code actually affected by a .c file change is comparable to the amount potentially affected. • The difference of 4% between the ratios for MAKEand GRATUITOUSwhen interfaces change represents the incremental value of GRATUITOUSover MAKE. Although not shown in the tables, GRATUITOUSrecompiles 94% of the code compiled by MAKE when interfaces change.

• The difference of 3% between the ratios for NAME-USEand DEMI-ORACLErepresents the incremental cost of NAME-USEover DEMI-ORACLE.To the extent that DEMI- ORACLErepresents a lower bound on recompilation costs, NAME-USErecompiles 20% more code than necessary when interfaces change. In contrast, MAKE comes within 20% of the performance of DEMI-ORACLEwhen interfaces do not change. When interfaces do change, MAKE recompiles three to four times the amount of code necessary.

• Also not shown in the tables is the ratio of NAME-USEto GRATUITOUS.The lean numbers for GRATUITOUSrepresent the costs incurred by conventional compilation technology in a world without gratuitous changes or extraneous .h files. For the

39 configurations with interface changes, NAME-USE (lean) recompiles only 48% of code compiled by GRATUITOUS. DEMI-ORACLErecompiles 39%.

Figure 3.4 shows the dual effect of removing superfluous .h files and replacing MAKE with NAME-USE.The numbers in the figure represent ratios of lines compiled for config- urations with interface changes. For the crostic client, removing .h files that are included but unused reduces compilation costs by 47% using conventional manufacturing tech- niques and by 24% NAME-USE.Using NAME-USE instead of MAKE reduces costs by 69% in the presence of extraneously included .h files and by 56% in their absence. Doing both reduces costs by 76%. 3.6. DISCUSSION 97

MAKE NAME-USE

Figure 3.4: The Compound Effect of Parsimony in Interconnection and Manufacture

3.5.3 The Size of Compiled Files

Table 3.10 lists the average number of lines per file compiled by each predicate. The data shows that larger files are compiled more frequently than smaller files, regardless of superfluous includes. The numbers for configurations without interface changes, where the average size of a file compiled by BIG BANG (fat) is approximately 1500 lines and that compiled by MAKE (also fat) is over 1700 lines, indicate that larger .c files change more frequently. The lean numbers for configurations with interface changes suggest that .c files that include more .h files (and are larger for that reason) are selected for compilation more often.

The differences in the average sizes of the files compiled by different predicates help to explain why measuring predicate performance by number of files compiled is not equivalent to measuring performance by number of lines compiled. The differences in the two measures is particularly evident when we compare the weaker predicates to BIG BANG.

3.6 Discussion

The strength of the results of the Descartes change history study invites questions about how applicable those results are to other systems, possibly written in other languages, and to changes as they happen rather than in retrospect. The next paragraphs discuss to what extent the results might depend on the crostic program and its change history. 98 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS

The Influence of the Crostic Program

There are two features of the crostic program that might distort predicate performance; otherwise there no specific indication that the crostic program is exceptional. The first feature, for which the study compensates, is the use of programming conventions that lead to a high incidence of interfaces that are included but not used. Almost 40% of the .h files included in crostic compilation units were included unnecessarily. In contrast, Kamel and Gammage report that 7% of the files imported by 1500 Protel modules were unused [32]. This is probably a more typical number.

The second feature is more of a problem. Most of the code of the crostic program consists of the application-independent Descartes user interface software; naturally, this software provides functionality that is not used by every client. It is arguable that the observed performance of the predicates is simply due to changes in this generic software that have no relevance to crostic itself. This question is addressed in Chapter 4 by comparing name use and visibility patterns in the crostic client with the same patterns in five other programs.

The most compelling argument for the broader applicability of the results of this chapter is their corroboration in the work of Adams, Weinert and Tichy. Despite differ- ences between the two studies in program application area, program size, programming language and change history, Adams and company found the same relationship between NAME-USEand MAKE.

The Influence of its Change History

There are also questions about the influence of the experimental method on the results, particularly about the use of historic change information. Every change entered as an RCS revision represents the product of some series of intermediate changes; typically parts of the system will have been repeatedly remanufactured and tested as a result of these intermediate changes.

The relative performance of difference predicates depends on the distribution and size of changes. While the RCS record may not capture the provisional changes made between revisions, it is unlikely that it would misrepresent the distribution of those changes. It is, however, likely that those provisional changes are smaller than the revision groups formed from the Descartes change history upon which the study was based. Section 3.4.1 argued that larger changes favor the stronger predicates; if this is true then the results of study are conservative. Thus, for the intermediate changes to .h files made between RCS revisions, 3.6. DISCUSSION 99 a greater percentage of the recompilations triggered by MAKE may be redundant. This too is addressed in the Chapter 4 by considering the relationship between the number of names changed and recompilation costs. 100 CHAPTER 3. AN ANALYSIS OF COMPILATION COSTS Chapter 4

The Distribution of Names in Interfaces

The results of the previous chapter show that the name-based predicates (NAME-USE and DEMI-ORACLE) significantly outperform the file-based predicates (MAKEand GRATUITOUS) when applied to historic changes made to the Descartes program. While this is not surprising given the way C programs are organized, it is not clear whether the observed differences between the two kinds of predicates reflects some idiosyncrasy of the Descartes code and its change history, or whether comparable differences might arise from the immediate changes 1 made to other programs written in C or in other langmages. To answer this question, this chapter offers a concrete explanation for the differences between the name-based and the file-based predicates and uses that explanation to compare Descartes to other programs written in C and in Ada.

For any given program and any given set of changes, the relative performance of NAME-USE and MAKEdepends not only on the dynamic pattern of changes, but also on static patterns of name use relative to name visibility. When a name defined in a .h file changes, NAME-USEcompiles every .c file that uses the name; MAKEcompiles every .c file that uses the defining .h file -- that is, every .c file in which the name is visible. By assuming a fixed and simple distribution of changes, static name use and visibility patterns can be used to predict the relative performance of the two predicates for any given program. These predictions can then be used to compare different programs. The ratio of average name use to average name visibility is used as the basis of comparison. This metric is explained in detail in Section 4.1. 1An immediate changeis the consequenceof a single iteration of the edit-compile-debugcycle. In contrast,a historicchangeoften representsa seriesof immediatechanges.

101 102 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

The conclusions of this chapter are based on cross-reference data for six programs, three (including Descartes) written in C, and three in Ada. This data is presented in Sections 4.2 and 4.3. Each of the six programs is characterized by its size and interface properties. The latter include both averages and distributions of the number of clients using each name and the number of clients in which each name is visible, as well as ratios of name use to visibility measured both in compilation units and in application lines compiled. The value of each property measured for Descartes is either comparable to or intermediate among the values measured for the five remaining programs, although in some ways Descartes is more like the Ada programs than the other C programs.

Section 4.4 compares average name use and visibility patterns for Descartes with the use and visibility patterns of the names that changed in the revisions analyzed in Chapter 3. Use and visibility patterns for groups of names are also considered. For the Descartes configurations, it appears that while grouping changes is an effective strategy for reducing relative compilation costs, the reduction may not be in proportion to the size of the group.

It is important to remember that each of the six programs examined in this chapter represents a single case study. This number is not sufficient to make correlations between program size or interconnection structure and predicate performance even though there may be reasons to believe that such correlations exist. There is, however, sufficient evidence to conclude that Descartes is not idiosyncratic in its name use and visibility patterns. This evidence suggests further that the results of the previous chapter may be conservative not only for Descartes (if immediate changes to interfaces are typically smaller than historic changes) but also for a broad class of programs written in separately compiled languages.

Related Work

Before it can process a compilation unit, the compiler for a separately compiled language must first acquire the definitions of all separately declared names that might be used in the compilation. The number of such names is bounded above by the number of symbols visible to the compilation unit and below by the number of symbols actually used. Typical compilation strategies require all visible symbols to be read before processing begins; this is costly and alternative strategies have been proposed. These strategies reduce compilation costs by limiting the number of extraneous symbols read.

Both Conradi and Wanvik, reporting on experience with the languages Mary/1 and Chill [7], and Kamel and Gammage, reporting on experience with the language Protel [32], 4.1. THE RATIO OF USE TO VISIBILITY 103 observe that only 15% to 25% of the symbols visible to a compilation unit are actually used. This is precisely the unit ratio of use to visibility described below, which Kamel and Gammage computed on a per compilation unit basis rather than on a per program basis. The consequences of this difference in granularity are discussed in Section 4.1.3. Conradi and Wanvik give specific use/visibility ratios of .07 for 24 library modules defining types and constants and .13 for 6 application modules defining procedures. All modules are part of a 17000 line Chill system.

While neither the Conradi-Wanvik nor the Kamel-Gammage observations are as de- tailed as the case studies of this chapter, the values reported by both teams are consistent with the values obtained in these studies.

4.1 The Ratio of Use to Visibility

The ratio of use to visibility (abbreviated RUV) compares the average number of clients actually using the names defined in a program with the average number of clients in which those names are visible; it can be computed solely from information contained in the source text of a program. Based on two assumptions about the distribution of changes, the RUV can be used as a static measure relating the expected performance of the predicate NAME-USEto the expected performance of the predicate MAKE. The first assumption is that only one nanae changes at a time; the second is that all names are equally likely to change.

The value of the RUV is always between 0 and 1. (A name can only be used where it is visible.) To the extent that the two assumptions hold, NAME-USE Can be expected to significantly outperform MAKE on a program with a low RUV. In contrast, a high RUV indicates that MAKE is an adequate compilation strategy. The following paragraphs motivate the use of the RUV as a predictor of predicate performance and explain exactly how it is computed for C and for Ada.

To compare the performance of two predicates given an actual change, one simply computes the compilation costs associated with each predicate and then compares these costs. For example, suppose the definition of the macro TokenToRule defined in the Descartes .h file rule.h, were to change. TokenToRule is used in 2 .c files, glyph, c and rule. c, and is visible in each of the 12 .c files that include rule. h. NAME-USEwould require the recompilation of the 2898 lines of application code contained in glyph, c and rule. c and in the 16 Descartes .h files that together they include. MAKE would require the recompilation of the 18928 lines contained in the 12 compilation 104 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

units that include rule. h.

Similar use and visibility numbers for every name defined in a program can be pro- duced straightforwardly from cross-reference information and a table of file sizes. Given these numbers and a set of observed changes, it is not difficult to calculate and compare the relative costs of NAME-USEand MAKE. This is exactly what was done in Chapter 3 with the historic changes to the Descartes crostic client. In the absence of change data, however, it is not obvious how to combine the individual use and visibility numbers into a single number representing the whole program. To do so, it is necessary to make simplifying assumptions about the distribution of changes.

The first simplifying assumption is that only one name changes at a time. While it is possible to use the Descartes change data to estimate the number of names affected by the average historic change to an interface, that data does not reveal how many names are affected by the average immediate change. It is arguable that most changes affect only a few names and that single name changes may predominate, although there are doubtless many changes that affect more than one name. The assumption that only one name changes at a time establishes an easy-to-understand baseline. Furthermore, the assumption frees us from the formidable combinatorics of multiple name changes, necessary to account for all sets of two or more names. For example, the .h files of the Descartes crostic client define 711 names; these 711 names can be grouped into over 250,000 different pairs, nearly 60,000,000 different triplets, and so on. The effect of multiple name changes is reconsidered in Section 4.4.2.

The second assumption is that changes are uniformly distributed among the names of a program. This is the appropriate assumption without specific information relating the probability of a name changing to static properties of the name, in particular to its use or visibility. A uniform distribution does not bias the RUV in favor of either predicate. If the probability of a name changing is independent of its use or visibility then it is plausible that the relative performance of NAME-USEand MAKEwill in fact conform to the computed RUV.

Suppose the interfaces of a program define n names: name1 ... name,. Let usei be the set of compilation units that use namei and let visiblei be the set of units in which namei is visible. As before, predicate performance is measured in two ways: by counting number of units compiled and by counting application lines compiled. The unit measure is a convenient way to summarize program properties; the line measure accounts for variability in unit size and serves as an approximation of actual compilation costs. The unit costs for namei are simply the number of units in usei (luseil) and in visiblei (Ivisibleil) respectively; the line costs are _]xeusei size(x) and _]_evisible/size(x). 4.1. THE RATIO OF USE TO VISIBILITY 105

Given both assumptions about the distribution of changes, the unit RUV is simply the ratio of average name use to average name visibility measured in compilation units:

_'_ in_-1 use/ _-_n=l visible/ The line RUV is the same ratio measured in lines compiled: _n=l _-'_xEuseisize(x) _=1 _xrvisiblei size(x)"

The formula above is an abstract definition of the RUV. The concrete value of the RUV for a given program written in a given language depends on the language and on the tools available for collecting use and visibility information. In particular, it is necessary to identify the set of names on which the RUV is based, to specify the meaning of use and visibility, and to determine how the size of a compilation unit is measured. The next two subsections address these issues are for C and for Ada.

4.1.1 Computing Use/Visibility Numbers for C

For a C program, the RUV is based on the set of names that are defined or declared in the program's .h files. A C compilation unit consists of a base .c file together with every .h file that it includes either directly or indirectly. The size of a compilation unit is simply the sum of the sizes of all these files. A name is used in a compilation unit if it is referenced either directly or indirectly in the base .c file. If a name is referenced only in a .h file in a compilation unit, then the reference is not counted as a use unless the referencing name is ultimately used in the base file.

In C, any name defined in a .h file is visible in every compilation unit that includes that .h file. Because of the high incidence of extraneously included .h files in Descartes, name visibility for the C programs in the study is computed both with and without the extraneously included files. In the former case, name visibility is based on a prograzn's original include structure; in the latter case, the names defined in a .h file are considered visible only in those compilation units that actually reference some name defined in the .h file. These are exactly the visibility relationships that were used to compute the fat and lean numbers of Chapter 3. As before, the extraneous .h files serve to increase both the visibility of the average name and the size of the average client. Since interface visibility for the Ada programs in this study was computed based on necessary interconnections only, the lean C numbers are used when comparing programs written in the two languages.

The raw name use and visibility data used in this study was obtained by postprocess- ing cross-reference listings produced by the Tartan Laboratories C cross-reference tool, 106 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

teexref. Whereas most C language tools expand macro definitions prior to compilation so that macro names never appear to the compiler, the Tartan tools treat macros as first class language objects. Since macro definitions make up the bulk of interfaces in many C programs, this feature was essential for this study. While it would have been impossible to undertake this study without teexref, the amount of effort necessary to suit its data to the study's requirements should not be minimized. It would be advisable to build appropriate special-purpose tools before undertaking a similar study on a larger scale.

In C, conditional compilation complicates both name visibility and name use. On a per compilation unit basis, conditional compilation may hide or expose a name defined in a .h file. It may disable or enable the inclusion of a .h file, or it may hide or expose the use of a name in the base .c file.

The treatment of conditional compilation in computing the RUV is an artifact of the procedures used to gather raw name use and visibility data. Name use in .c files and .h file inclusion is determined on a per compilation unit basis using appropriate values for all conditional expressions (as would be supplied to a compiler). Thus the inclusion of a particular .h file, guarded by the same conditional expression in two compilation units, may be disabled in one unit and enabled in the other so that the names defined in the file are hidden from the one base .c file and visible to the second.

In contrast, the visibility of names declared or defined in .h files is determined uni- formly across all compilation units without regard to the values of conditional expressions that may differ from compilation to compilation. Suppose a name is declared condition- ally in a .h file. If the declaration is hidden in every compilation involving the .h file, then the name remains hidden and does not affect the value of the RUV. If the declaration is visible in any compilation involving the .h file, then the name is treated as visible in all compilations but unused where it is hidden by conditional compilation.

Only the sizes of a program's own .c and .h files were counted in measuring the size of a C compilation unit, not the sizes of the UNIX .h files included in the unit. For units that interact heavily with the operating system, system files can contribute substantially to compilation costs. Thus the line RUV cannot be interpreted strictly as a measure of relative compilation cost. It does however compare the amount of application code affected by a change.

4.1.2 Computing Use/Visibility Numbers for Ada

Phil Levy of Rational provided the data on the Ada programs in the study. Using facilities of the Rational environment, he designed and built a tool that reports on the use and 4.1. THE RATIO OF USE TO VISIBILITY 107

visibility of the names declared in the specification units of Ada programs. Levy's tool does not report on the names declared in the private parts of specifications, nor does it report on subsidiary names, such as those of record components or subprogram parameters, that appear as parts of larger declarations.

Ada compilation units are either specification units, body units or body subunits. When a specification unit is compiled, the compiler produces a symbol table that is used in the compilation of any clients of the specification. A client may be another specification unit, or it may be a body unit or a body subunit. The set of units using a name declared in a specification unit is computed as the compilation closure of the name; the set of units in which a name is visible is computed as the compilation closure of the specification in which the name is declared.

W: [ type omega3 is ...;]

N: type alpha0 is new W.omega3;

B: type beta1 is new A.alpha0; [ constant beta2 is ...; ] f T t

func ga_ala ( ... ) func ga_a2b ( ... ) array delta5 return B.betal; return B.betal; range [i..beta2] of ... ;

C1 C2 D

x II use(x) visible(x) omega3 {A, B, CI, C2} {W, A, B, CI, C2, D} alpha0 {B, el, C2} {A, B, CI, C2, D} betal {el, C2} {B, CI, C2, D}

....beta2 {D} {B, CI, C2, D} gammala O {CI} gamma2b @ {C2} delta5 @ {D}

Figure 4.1: The Use and Visibility of Names in Ada

A compilation unit is in the compilation closure of a name if the unit contains a reference to the name or if the unit is in the compilation closure of a second name that references the first. This is best illustrated by an example. In Figure 4.1, the 108 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

nat_ alpha0 is declared in specification unit A. Because alpha0 is referenced in the declaration of the name betal in specification unit B, B is its compilation closure as is every unit C in the compilation closure of betal. These units constitute the use set for alpha0. Because omega3 is referenced in the declaration of alpha0, its compilation closure will include A as well as the closure of alpha0.

A compilation unit is in the compilation closure of a specification if the unit contains a reference to some name declared in the specification or if the unit is in the compilation closure of another specification that references some such name. For example, compilation units A and B in Figure 4.1 are in the compilation closure of W,the specification declaring omega3. So is unit D, because D references a name declared in t3 and B is in the closure of W.q"ne name referenced need not be in the compilation closure of any name declared in w. In particular, unit D is not in the compilation closure of omega3 but omega3 is visible in D.

The Ada compilation model requires the recompilation of a specification whenever any of the names it declares is changed, whether the name is referenced in the specification or not. This is necessary to incorporate the change in the symbol table representing the specification. For this reason, Levy's data always includes the declaring specification in the compilation closure of each name; each specification is also included in its own compilation closure. Consequently Ada names that are unused appear have one client whereas C names that are unused have none. Because Levy's data does not contain enough information to distinguish names that are in fact referenced in the declaring specification from names that are not, it is impossible to accurately correct for this difference. Thus Levy's name use numbers will yield RUV values that are too large. Compensating by subtracting one from each of these numbers will produce RUV values that are too small. In the belief that most names are not referenced in their declaring specifications, the compensated RUV values are used in comparing Ada and C. RUV values based on Levy's original numbers are given in Section 4.2.3. The latter numbers probably give a better reflection of compilation costs since the declaring specification must be recompiled if the changed name has any clients.

In Ada, the names declared in a specification unit are visible in another compilation unit only if that unit imports the specification using a with clause. Most Ada compil- ers base recompilation on the with relation. However, a unit may import a specification without using any name declared therein. Such an extraneously imported specification corresponds directly with an extraneously included .h file in C. Since the compilation closure is computed without reference to the with relation, the name visibility data au- tomatically excludes unnecessary imports. Thus there is no data for the Ada programs 4.1. THE RATIO OF USE TO VISIBILITY 109 based on their with clause closures that corresponds with the data for C programs based on their original (fat) include structures.

In computing the line RUV, only the lines of source text in each compilation unit are counted; there is no measure of the size of the symbol table information needed to support the compilation of the unit. As with C, this number is not proportional to actual compilation costs, but it does reflect lines of code affected by a change.

4.1.3 The RUV versus the Big Inhale

Before it can process a compilation unit, the compiler for a separately compiled language must acquire the definitions of all symbols visible to the compilation unit. Conradi calls this process the "big inhale". The fraction of inhaled names that are actually used is exactly what is measured in the unit RUV, except that instead of computing the ratio for each compilation unit, it is computed for the program as a whole. The following paragraphs consider how the two values might differ.

Suppose a program that defines n names has m compilation units. To compare the number of names that are used in a compilation unit with the number of names that are • . . visible, we define u_ to be 1 if namei is used in unitj; otherwise u_ is 0. Similarly, v_ is 1 if namei is visible in unitj and 0 otherwise. These numbers are used to define Rj, the ratio of use to visibility for compilation unit j; R, the average ratio of use to visibility for units 1 ... m; and R, the unit RUV for the program.

Since the number of names used in unitj is _-'_i_1uJiand the number of visible names is _-]_'=v1_, the ratio of use to visibility for the unit is

Rj- En=x V_

Averaging this number over all m units in the program yields

R= 2..;;.-,v_l m

Earlier, usei was defined as the set of units using namei. Since u{ = 1 ¢_ unitj E use/, • the number of units in usei is _--]_1u-}and the number of units in visiblei is _-]_1 v-_. The unit RUV for the program is

R = E,'=I luse/I _ En=lEjm__l U j En=l [visible/ E_:I _'_'_;rn=lvj 110 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

In general, the average ratio of names used in a compilation unit to names visible in the compilation unit, R, is not the same as the unit RUV for the program, R. However, if the distribution of name use and name visibility is uniform across all compilation units in the program, these two numbers will be close. If we let Rj = R + 6j (so that 6j reflects the difference between the ratio of use to visibility for unitj and the program as a whole), then = Ejm__l(R + 6j) _ R + EJm--6]j m m As long as the 6j values are small or cancel each other out, R will approximate the unit RUV. Section 4.2.4 compares the ratios of use to visibility for the individual compilation units of the C programs in the study with the unit RUV values for the same programs.

4.1.4 Partial RUV's

The per compilation unit ratio of use to visibility Rj, defined above, can be thought of as a partial RUV. A partial RUV is a ratio of use to visibility computed from either of a subset of the compilation units in a program or from a subset of the names. There is occasion to look at both kinds of partial RUV in the next section.

4.2 The Six Program Study

I computed RUV values for six programs including Descartes. The five additional pro- grams were chosen based largely on their accessibility. Like Descartes, C-Kermit and Vice are written in C; otherwise, at least superficially, they are quite unlike Descartes. R-Kermit, Trtl3w and Debugger are written in Ada. Size and interface characteristics for each program, along with its unit and line RUV, are given in Table 4.1.

For each program, the table contains the following information:

• The size of the program text in thousands of lines, including comments and blank lines.

• The percent of the program text contained in interface components. • The number of implementation (body) components. For C this is the number of .c files; for Ada it is the number of package and subprogram body units and subunits.

• The average size in lines of an implementation component. • The number of interface components. For C this is the number of .h files; for Ada it is the number of package and subprogram specification units. 4.2. THE SIX PROGRAM STUDY 111

C-Kermit Descartesc Vice IIR-Kermit Trtl3w Debugger Size (thousands of lines) 11.5 11.7 57.7[[ 4.3 32.8 28.5 ... percent interface 4% 14% 6% 1] 17% 16% 24% Number of Body Components 13 26 151 II 13 45 47 ... average size (lines) 849 385 359 I[ 272 611 460 Number of Interface Components 4 28 43 I[ 13 48 54 ... average size (lines) 118 59 82 [I 55 110 126 Average No. Clients/Interface 7.8 7.4 13.6 I] 6.8 10.1 25.3 Average No. Names/Interface 51 25 36 II 12 20 26 Average Name Use (units) 1.8 2.4 3.2 H 3.2 2.7 2.9 Average Name Visibility (units) 6.1 12.8 22.4 II 7.8 18.6 39.7 Unit RUV 0.29 0.19 0.14 II 0.42 0.14 0.07 Line RUV 0.33 0.20 0.16 1[ 0.52 0.22 0.13

Table 4.1: Comparative Name Use and Visibility for 3 C and 3 Ada Programs

• The average size in lines of an interface component.

• The average number of clients per interface, measured in compilation units. For C this is the average number of compilation units that must include each .h file; for Ada it is the average number of units in the compilation closure of each specification unit.

• The average number of names declared in each interface. For Ada this number does not include record components or names declared in the private part of a specification unit.

• The average number of clients using each name, measured in compilation units. For Ada this number does not include the specification declaring each name.

• The average number of clients in which each name is visible, measured in compi- lation units.

• The unit RUV: The ratio of the average number of clients actually using each name to the average number of clients in which each name is visible.

• The line RUV: The ratio of the aggregate size of the clients actually using each name to the aggregate size of the clients in which each name is visible.

It is clear from the data in the table that there are considerable differences among the six programs, but Descartes is nowhere conspicuously atypical. Despite differences else- 112 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

where, average name use is consistently small for all programs. After briefly introducing the programs in the study, the data in the table will be considered in more detail.

4.2.1 The Programs

C-Kermit and R-Kermit 2 were included in the study bccause both programs arc imple- mentations of thc Kermit protocol for exchanging fries betwccn computcrs [12]. The source for C-Kermit was obtained from archives at Columbia University in the fall of 1987. Modcled after the Columbia sourccs, R-Kcrmit was developed at Rational and is distributed to Rational customers as part of an unsupported software catalog.

As programmed, C-Kermit includes five .h files. The data in Table 4.1 reflects only four of these. The fifth .h file is vestigial in the configuration analyzed, containing only single definition that is effectively unused. I omitted this .h file from the study because its influence on the aggregate data would have been unduly large.

The Andrew File System, colloquially known as Vice, is the third C program in the study. Vice is the distributed file system of the Andrew programming environment developed at Carnegie Mellon University's Information Technology Center [45]. It is not a single program, but a collection of programs that run as clients and servers on local workstations and on central file servers. These programs share a set of interfaces realized in common .h files and object libraries. The data in Table 4.1 represents these common .h filcs, the library sourccs, and thosc additional .c and .h filcs that makc up those Vice programs that usc thc shared interfaces. The data does not reflect a small amount of additional code making up three standalonc Vice programs that do not use any of the shared interfaces. I obtained the source for Vice in the fall of 1987 with the help of Susan Straub of the Information Technology Center.

Like Descartes, Vice is organizcd around a common core of support facilities. How- ever I looked at only one Descartes application (the crostic client) whereas I looked at almost all thc Vice programs. This is because each Descartes application is developed as an indcpendcnt program, whereas the Vice programs cooperate to provide a set of coordinated services.

Unlike the C programs and R-Kcrmit, the two remaining Ada programs in the study are effectively black boxes, revealed only through their use and visibility profiles. Both are parts of a debugger system. Unlike R-Kermit and Trtl3w, Debugger consists of mul- tiple Rational subsystems. The subsystc_ construct is a program structuring mechanism 2R-Kermitis my name for theprogramRationalcalls Kermit. 4.2. THE SIX PROGRAM STUDY 113 provided by Rational to control information sharing between collections of modules. Per- haps the difference in the RUVs of Trtl3w and Debugger, two programs that are otherwise similar, is due to differences in their structure.

Like most software designed to run in the Rational environment, all three Ada pro- grams are organized as sets of procedures that can be invoked by another Ada program, either interactively on behalf of the user as by the Rational environment, or in a more conventional batch-oriented manner.

4.2.2 The Data in Table 4.1

The range of RUV values in Table 4.1 is considerable. The unit RUVs differ by a factor of 6, and the line RUVs by a factor of 4; in both cases the extreme values belong to the Ada programs Debugger and R-Kermit. For a single change to an arbitrary name defined in Debugger, NAME-USE would require the compilation of only 1 program unit for every 14 units compiled by MAKE;for an arbitrary name defined in R-Kermit, NAME-USEwould recompile 1 unit out of every 2 units. For the three C programs, both the unit and line RUVs differ by a factor of 2. At .19 (unit) and .20 (line), the RUV values for Descartes lie intermediate between the values for the three C and for the six C and Ada programs.

In all cases, the value of the line RUV is larger than that of the unit RU¥. "[his indicates (not surprisingly) that larger compilation units are more likely to use a name than smaller units. While the difference between the two values is relatively small for the C programs in the study, it is significant for the Ada programs. The line RUV for Debugger, for example, is almost twice the unit RUV. This may be a consequence of how the sizes of compilation unit are measured in the two languages. The size of C compilation units may tend to be more uniform because they each consist of a collection of files. Ada compilation units range from subprogram specifications that are a few lines long to package bodies that are several hundred lines long.

While any conclusions drawn from such a small sample must be taken as anecdotal, it is intriguing to speculate about relationship between program characteristics and the RUV. 3 The most striking feature of the data in Table 4.1 is that despite the considerable range of values for the RUV, the average integral number of compilation units using each name is either 2 or 3 for each of the programs in the study. For these programs at least, it is name visibility, not name use that determines the value of the RUV. The distribution of use and visibility is discussed in Section 4.3.

3Thesespeculationsare testablehypotheses. 114 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

Both unit and line RUV values are conspicuously larger for the two versions of Kermit. If we interpret the RUV as a measure of the effectiveness of interface organization, then R-Kermit in particular seems to be well modularized. R-Kermit is the smallest program in the study, so it would be puzzling if the case were otherwise. Larger programs are harder to modularize.

In comparing the three C programs in the study few conclusions can be drawn about the relationship between size and RUV or between gross interface characteristics and RUV. Descartes and C-Kermit are the same size but C-Kermit has a substantially larger RUV. A clear difference between the two programs is their interface characteristics: 4% of C-Kermit is interface compared with 15% of Descartes. C-Kermit and Vice have similar interface characteristics, 4% of C-Kermit and 6% of Vice is interface, yet their RUV values differ by a factor of 2. Vice is 5 times the size of C-Kermit. Among the Ada programs, Trt13w and Debugger are comparable both in size and number of code and interface components. Debugger has a larger interface and a disproportionately larger average number of clients per interface. Its RUV is half that of Trtl3w.

The paragraphs that follow treat a few issues in more detail.

C-Kermit versus R-Kermit: The Effect of Language Difference?

Although both programs are implementations of the same protocol and both are relatively small, Table 4.1 shows C-Kermit and R-Kermit to be surprisingly dissimilar. Some of this difference is undoubtedly due to differences between C and Ada, but much of it can be attributed to more mundane factors. R-Kermit is the work of a single programmer; it implements only a subset of the protocol and runs only on a Rational computer. C-Kermit is an older and more complete program, developed by a team of programmers; it runs on a variety of machines under several operating systems.

Eliminating the conditional compilation expressions used to tailor the code for dif- ferent architectures and operating systems reduces the size of C-Kermit by about 18% to 9.5 thousand lines, still substantially larger than R-Kermit.

Descartes versus C-Kermit and Vice: The Effect of Programming Style?

It is clear from the data in Table 4.1 that the percent of interface code in Descartes is more characteristic of the Ada programs in the study than of the remaining C programs. Descartes was developed using abstraction techniques that are compulsory in Ada. 4.2. THE SIX PROGRAM STUDY 115

C-Kermit and Vice are probably more typical of most C programs. In C any global variable or procedure not declared static is automatically visible outside the defining .c file. An importing .c file simply has to declare the imported name as extern and the reference is resolved by the linker. For integer returning functions, a declaration is not even necessary. This is a perfectly legitimate and common C programming style. Most of the potentially shared names defined in both C-Kermit and Vice are not declared in .h files.

Given the number of such names, it is important to consider how the names not declared in .h files in C-Kermit and Vice would affect the RUV if they were to be made part of the interface. While no conclusions can be drawn about the visibility of these names, it is possible to investigate their use. If the average number of clients using each undeclared name were high, then the RUV values for C-Kermit and Vice might be too low and the consistency of average name use across the six programs in the study may simply be a fluke. If on the other hand, the average number of clients using each undeclared name were low, then the existing data is probably good. Table 4.2 shows that average name use for names not declared in interfaces is less than 2 for both C-Kermit and Vice. Considering only names that are used outside their defining .c file, that number increases to just slightly more than 2, even though each such name must have at least 2 clients.

C-Kermit Vice Number of externs not declared in interfaces 391 1657 ... percent of all shared names 66% 52% Average number of clients per extern 1.9 1.5 Number of extems that are unused 6 101 Number of externs that should be static 106 865 Number of externs that are in fact shared 279 691 Average number of clients per shared extem 2.2 2.3

Table 4.2: Clients of C Externs that are not Declared in .h Files

4.2.3 The Effect of Compensating for Language Differences on the Concrete RUV

Sections 4.1.1 and 4.1.2 described how the concrete RUV was computed for C and for Ada. To compensate for language and tool differences, I disregarded superfluously included .h files in computing name visibility for C; I disregarded the declaring specification in computing name use for Ada. This section considers the effect of these decisions. 116 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

The Effect of Superfluously Included .h Files in C

Section 4.1.1 justified the decision not to consider superfluously included .h files in com- puting name visibility for C because this method conforms with the way name visibility was computed for the Ada programs studied. This decision not only provides a common basis for comparing C and Ada programs, it also provides a basis for comparing funda- mental visibility patterns in C programs. However, superfluously included .h files do have an effect on recompilation costs. By increasing name visibility they naturally decrease the RUV. Table 4.3 shows by how much.

As in the previous chapter, the columns labeled "fat" represent each program as written; the columns labeled "lean" echo the data from Table 4.1, representing each program minus any unnecessarily included .h files. The table shows differences in name visibility and in the unit and line RUV for each of the three programs. None of the programs are free of unused .h files, though Descartes by far has the greatest proportion and C-Kermit has almost none. 4 The only RUV values that are substantially affected by superfluously included .h files are those of Descartes, which decrease in the presence of these files by 20-25%.

The high incidence of unnecessarily included .h files in Descartes is the result of a deliberate decision to use an umbrella include file and is one aspect in which Descartes differs from the other programs in the study. While the lean numbers of this and the previous chapter compensate for this difference, the picture of Descartes that emerges even from the fat numbers of Table 4.3 does not distinguish it as unusual among the programs in the study.

The Effect of Counting the Declaring Specification as a Client in Ada

The name use data for the Ada programs in the study counts the specification declaring a name as a user of that name whether the name is referenced outside its declaration or not. Because this data does not indicate which names (or what proportion of all names) are legitimately used in their declaring specifications and which are not, it is impossible to compute an accurate RUV. The data in Table 4.1 was produced by compensating for names that are not used in their declaring specifications by subtracting 1 from all use numbers. Since not all names are unused, this produces a lower bound for the RUV. Since not all names are used, computing the RUV based on the original Ada use numbers 4perhapsthis is because C-Kermitis the only programof the three that representsmatureproduction code. 4.2. THE SIX PROGRAM STUDY 117

fat lean fat lean fat lean C-Kermit I Descartes I Vice Size Ratio (fat/lean) 1.01 H 1.22 l[ 1.07 Average No. Clients/Interface 8.0 7.8 11.6 7.4 16.5 13.6 Average Name Use (units) 1.8 2.4 3.2 Average Name Visibility (units) 6.4 6.1 16.5 12.8 24.9 22.4 ... ratio of fat to lean 1.05 1.29 1.11

Unit RUV 0.28 0.29 [I 0.15 0.19 II 0.13 0.14 ... ratio of fat to lean 0.97 _ 0.79 H 0.93 Line RUV 0.31 0.33 II 0.15 0.20 ]] 0.15 0.16 ... ratio of fat to lean 0.94 _ 0.75 ]] 0.94

Table 4.3: Effect of Superfluously Included .h Files in C produces an upper bound for the RUV. The lower bound is used in Table 4.1 based on the assumption that more names are unused in their declaring specification than are used, so that the true RUV is closer to the lower than to the upper bound. Both figures in are presented in Table 4.4.

R-Kermit Trtl3w Debugger upper lower upper lower upper lower bound bound bound bound bound bound Average Number of Clients/Interface 6.8 10.1 25.3 Average Name Use (units) 4.2 3.2 3.7 2.7 3.9 2.9 Average Name Visibility (units) 7.8 18.6 39.7 Unit RUV 0.55 0.42 0.20 0.14 0.10 0.07 ... ratio of upper to lower bounds 1.31 1.43 1.42 Line RUV 0.56 0.52 0.25 0.22 0.16 0.13 ... ratio of upper to lower bounds 1.08 1.14 1.23

Table 4.4: The Effect of Counting the Declaring Specification as a Client in Ada

The only difference in the values used to compute the lower and upper bound RUVs in Table 4.4 is in average name use; counting the declaring specification increases its value by 1. Whereas average name use for the C programs in the study is consistently less than 3, the upper bound RUV places it close to 4 for all three Ada programs. This number is still consistently small, but it is not as uniform across programs as would appear from the values in Table 4.1. Increasing average name use by 1 has a dramatic effect on the unit RUV; its effect on the line RUV is less pronounced since specification units tend to be smaller than body units, both of which contribute to the line RUV. Even so, and 118 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

once again excepting R-Kermit, the values of the RUV remain consistent with those of C programs in the study.

4.2.4 The Unit RUV versus the Partial RUVs of the Big Inhale

Section 4.1.3 compared the unit RUV (R) with the average ratio of use to visibility for a compilation unit (R). While it is not possible to extract per unit ratios from the data on the Ada programs in the study, it was possible to do so for the C programs. The average per unit ratio for each C program is given in Table 4.5.

C-Kermit Vice Descartes average per unit use/visibility ratio 0.28 0.22 0.19 average ratio for units weighted by size 0.33 0.20 0.17

Table 4.5: Average Per Unit Use/Visibility Ratios

The second line of the table gives the average unit ratio weighted by unit size. This value is computed according to the formula:

W= _j=l size(/) z..,,-=lv_./ m

Although W is not the same as the line RUV, the values of the two are virtually the same for all 3 C programs.

All average unit ratios are close to the 15-25% range for reported by Conradi and Wanvik and by Kamel and Gammage.

4.3 Patterns of Use and Visibility

Figures 4.2 and 4.3 give a more detailed picture of name use and visibility for the six programs studied. The histograms in Figure 4.2 show how many names are used by 0 through 20 compilation units. The graphs of Figure 4.3 contrast cumulative name use and visibility across the entire range of units per name for each program. In these graphs, the name use value shown for k compilation units is the total number of names used in k or fewer units. Thus the curves grow monotonically with endpoints representing the largest number of compilation units using any one name. The cumulative values for name visibility are defined analogously. 4.3. PATTERNS OF USE AND VISIBILITY 119

3s. _ 35. E E 30. c 30. o 25. . o 25. L) -!!-!+!-! t.) ® :?:!'!C! 20, _ :_...... :_ ¢z 20. .__.

15 15 _ iiilii!iiiii_

:i:i:!:i:i_i

5...... 5 iiiiii?ii_il

_.:_.:,,_, _ .,v.v...._...,v.....T..,.....,..- °iiii!ii!iiiii iiiiiiii:iii•.--,,--, "-,_- - iiiiii_ "-.w-' ------i )iiiiiiiiiil_ "--,iiiii!iililillii.....-e-" ...... "-'-.-" "--," "-'_.-" "--e -" "-_,-" _'_-" '_',--" "-',,-" ------0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(a) C-Kermit clients per name (d) R-Kermit clients per name

35

30 = 30 ....

25 o 25 [9 tO = --.--.-. iii!/i/iiliiiiiiiii!

I0 i0

5

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(b) Descartes clients per name (e) Trtl3w clients per name

35. _ 35

c 30. c 30

o 25. _ 25 U

o 20. o 20

1 5. 15 0 ..::iiiii _0

5_ 5 _ ......

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

(C) Vice clients per name (f) Debugger clients per name

Figure 4.2: Name Use Density in Three C and Three Ada Programs 120 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES _, 250 *_ 160

=_ 200 _= 120140 .":'" ...... ".... """ ......

150 100 80

100 60

40 50 20

0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 1012141618 clients per name clients per name (a) C-Kermit (d) R-Kermit

800 _' 1000 E E = 700 = 900 800 600 700 i 500 600 400 500

300 400 300 200 200 100 100 0 0 0 5 10 15 20 25 0 10 20 30 40 50 60 70 clients per name clients per name (b) Descartes (e) Trtl3w

1600 _ 1400

1400 .. ,: = 1200 !: 1200 1000

8O0 8OO 6OO 1000 ) __

4000600200# 2004000 I ...... useViSibility 0 10 20 30 40 50 60 70 80 0 20 40 60 80 100120140 clients per name clients per name (c) Vice (f) Debugger

Figure 4.3: Cumulative Name Use and Visibility for Three C and Three Ada Programs 4.3. PATTERNS OF USE AND VISIBILITY 121

The pattem of name use is remarkably consistent across the six programs in the study. In each case most names have 4 or fewer users and there are very many names that have very few users and very few names that have many users. Although C-Kermit and R- Kermit each define too few names to display the characteristic curve, in the remaining programs name use is clearly exponentially distributed.

If there is a difference in patterns of name use and visibility that would distinguish Descartes from the remaining programs in the study it is in the distribution of name visibility. For Descartes this distribution is clearly uniform, for the other programs it is less so. In Descartes, there are as many names visible to a large or moderate number of clients as there are names visible to a small number of clients. In the other programs, visibility is biased toward fewer clients, but there is still a substantial number of names that are visible to a moderate number of clients.

4.3.1 The Effect of Modularity in Descartes

The thesis was introduced with an observation about modularity and the incidence of redundant recompilation. Although Figure 1.1 is powerful graphically, Section 4.4.2 (ion page 128) will show that names are not assigned to interfaces at random. It is reasonable to ask whether there is a natural modularity in programs that would make it possible to partition the names into a relatively small number of units so that use and visibility would coincide.

Suppose, for example, we were to reorganize a C program by partitioning the nanaes defined in its original .h files into a new set of .h files so that the following conditions were met:

1. All the names defined in a new .h file were defined in the same original .h file. 2. Every name defined in a new .h file is used by all the clients of the new .h file.

The first condition ensures that logically unrelated names are not combined (assuming the original partition into .h files grouped maximal sets of related names). The second ensures that use and visibility coincide so that any change to a .h file would affect each of its clients. How many new .h files would we have to manage?

Partitioning the original Descartes .h files as above would produce a total of 249 new .h files, each containing an average 3 names. While the largest file would contain 54 names, 85% of the files would contain between 1 and 3 names. If the first condition were relaxed, so that names originally defined in different .h files could appear together in the same new .h file, partitioning the original .h files would produce a total of 189 new .h 122 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

files, each containing an average of 4 names. The largest would again contain 54 names; 86% would contain between 1 and 4 names.

For Descartes there is not a reasonable program organization where use and visibility coincide, at least based on the current content of the Descartes .c files. While it would be interesting to repartition the content of the .c files as well as the .h files, doing so was not possible based on the available cross-reference data.

4.4 The RUV versus Predicate Performance

The last two sections have shown that Descartes is like other programs in its static name use and visibility patterns. However, if Descartes' RUV values are compared with appropriate ratios of predicate performance from Chapter 3 there is a pronounced difference.

RUV NAME-USE/MAKE[

compilationlines compiledunits 0.190.20 .44.43 ]

Table 4.6: Descartes' RUV versus Relative Predicate Performance

Table 4.6 contrasts the unit and line RUVs for Descartes taken from Table 4.1 with the ratios of NAME-USEtO MAKE for the 39 configurations with .h file changes based on the lean data in Table 3.8. The table shows that the RUV values for Descartes are substantially lower than the ratios of predicate performance. This section tries to reconcile this difference, and using Descartes as an example, understand the relationship between the RUV and predicate performance.

As explained in Section 4.1, two assumptions are required if the RUV is to be related to predicate performance. The first assumption is that only one name changes at a time; the second is that changes to all names are equally likely. The effect of these assumptions can be assessed by reexamining the historic data.

Section 4.4.1 considers the effect of the distribution of changes.

While numerous factors may affect the distribution of changes to a program relative to name use or visibility, the RUV is based on the assumption that changes are independent of either. To assess the effect of this assumption, I computed the partial RUV of the names that changed in the revisions analyzed in the previous chapter. How closely this number 4.4. THE RUV VERSUS PREDICATE PERFORMANCE 123 approximates the whole-program RUV reflects the extent to which use and visibility patterns in the population of the names that changed correspond to the same patterns in the population of names as a whole. For example, while programmers may be reluctant to change names in a widely visible interface, there also may be little incentive to change names that are only sparsely used. Under these circumstances, the names that are changed would tend to have lower than average visibility and higher than average use so that their partial RUV would be larger than the RUV of the program as a whole.

Section 4.4.2 considers the effect of the size of changes.

While many of the historic changes to the Descartes software certainly represent a sequence of smaller immediate changes, many immediate changes certainly affect more than one name. If the RUV for sets of names is defined analogously to the RUV for individual names, it is reasonable to expect that the more names that change at a time, the higher the RUV. (In the extreme, if the a set of names is large enough then it will be both used by and visible to every compilation unit in the program.) To assess effect of multiple name changes in Descartes, I used a combinatorial formula along with the empirical distribution of use and visibility from Section 4.3 to estimate average name use and visibility according to number of the names changed. Knowing how many names changed on in each of the configurations of the previous chapter, static properties of Descartes can be correlated with predicate performance based on the historic change data. The results of this comparison indicate that sets of names that are changed at the same time have substantially lower use and visibility than arbitrarily chosen sets of similar size.

4.4.1 The Distribution of Name Changes in the Descartes History

Table 4.7 shows partial unit and line RUV values for the names that changed in the .h file revisions examined in Chapter 3. These partial RUVs reflect the distribution of changes in the Descartes history and therefore represent what would have been the relationship between NAME-USEand MAKEhad only one name changed at a time.

Rather than return to the historic record to determine the number of units using or visible from each name, I used name and visibility data from the the most recent configuration. There are two minor problems with this expediency. The first is that use and visibility change over time. Had I tabulated contemporary use and visibility data for each changed name, I probably would have produced a different partial RUV. However because the structure of the crostic client was reasonably stable during period of studied, it is unlikely that the difference would have been large. 124 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

Historic Changes to Descartes .h Files Strict All Changes Additions Deletions Changes All Names Number of Names 125 107 105 337 711 Average Name Use (units) 2.8 2.2 0 1.8 2.4 Average Name Visibility (units) 9.6 12.8 10.1 10.8 12.8 Unit RUV 0.29 0.18 0 0.16 I1 0.19 Line RUV 0.30 0.18 0 0.17 II 020 Table 4.7: Comparative Name Use and Visibility for Historic Changes

The second problem is that some of the names that changed do not exist in most recent configuration. Most of these names are deletions, which I simply assumed have no uses. I ignored the few remaining names that were added or changed before being deleted.

Table 4.7 shows that the historic partial RUV is smaller than the static RUV for Descartes and that all together, the names that changed are on average both less used and less visible than the average name in the program. Average name use is lower for additions and deletions than for the population as a whole and higher for strict changes; however, additions and deletions make up a significant fraction of all changes. The table shows that for the changes in Descartes' history, the RUV is conservative predictor of the relative performance of NAME-USEand MAKEfor changes to one name at a time.

Not surprisingly, this implies that the difference between the RUV values for Descartes and the ratios of NAME-USEto MAKE in Table 4.6 is due to the size, not the distribution of changes.

4.4.2 The Effect of Grouping Name Changes in the Descartes History

The RUV values discussed so far are based on one name changing at a time. Since typical changes involve groups of names, it would be useful to understand the relationship between the RUV and group size. The RUV for a set of k names is a simple extension of the RUV for a single name: it is the ratio of set use to set visibility for all sets of names of size k. A set of names is used (visible) in a compilation unit if at least one name in the set is used (visible) in that unit.

If N (_) is the set of all subsets of size k of a set of names N and if usei is defined (as in Section 4.1) as the set of all units using namei, then the unit RUV for sets of size k 4.4. THE RUV VERSUS PREDICATE PERFORMANCE 125

(the k-RUV) is the ratio: _K_A_k,I Ui_K useil _g_N_k) ll,.Ji_gvisiblei I

By making simplifying assumptions once again, the problem of computing expected use and visibility for sets of names can be modeled as a variation of a standard com- binatorial problem. This problem is the occupancy problem: If n balls are randomly distributed among m cells, what is the expected number of occupied cells? 5 Instead of balls and cells, our problem involves names and compilation units: We wish to compute the expected number of compilation units "occupied" by a set of n names. The simplify- ing assumptions involve how compilation units are assigned to names. Since names are used by more than one unit, instead of assigning each ball one cell, each name (a ball) is assigned between 0 and m compilation units (cells). The number of units assigned to each name is determined by the distribution of Figure 4.3(b). The specific units that make up that number are selected at random.

Figure 4.4 plots expected use and visibility for sets of names against the number of names in a set; Figure 4.5 plots the relationship between unit RUV and set size.

The paragraphs that follow explain the formula that used to compute these numbers. Then the affect of the assumptions made in applying this formula are considered. Finally the relationship between the results of the previous chapter and the k-RUV of this section is discussed.

Expected Use and Visibility for Sets of Names

Consider a program that has M compilation units and defines N names. If P(m, n) is the probability that a set of n names, chosen at random, is used in exactly m units, then the expected number of units using n names is M E(m, n) = _ m . P(m, n) (4.1) m=0 The only problem is determining the probabilities P(m, n).

P(m, 1) is the probability that a given name is used in m units. It is defined according to the distribution for Descartes in Figure 4.2, as the ratio of the number of names used in m units to the total number of names. (For example 30 of the 711 names defined in the Descartes .h files are used in exactly 4 compilation units. Thus P(4, 1), the probability that a given name is used in 4 units is 30/711.)

5For a discussion of the occupancy problem, see any good textbook on probability. For example, Dwass [17, pages 61-64]. 126 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

30

25 .-" __ B° ._ I/"

20 f

15

10 ...... occupancy model visibility estimate o average visibility for generated name sets occupancy model use estimate 5 • average use for generated name sets

0 0 5 10 15 20 25 30 35 40 45 50 number of names in group

Figure 4.4: Expected Use and Visibility for Groups of Names in Descartes

1.00 n f

0.90

0.80

0.70

0.60

0.50

0.40

0.30 • - ratio forof occupancygenerated modelname setsestimates

0.20

0.10 0 5 10 15 20 25 30 35 40 45 50 number of names in group

Figure 4.5: The k-RUV for Descartes 4.4. THE RUV VERSUS PREDICATE PERFORMANCE 127

P(m, n) is computed according to the inductive formula:

P(m, n + 1) = _ P(i, 1) P(m - k, n) m - k M - (m - k) im=0 £k---o [( i- k )( k ) /(M)]

This formula assumes that names are chosen at random, one at a time, until n + 1 names have been chosen. The probability that any given name is used in i compilation units is P(i, 1). Assuming further that these units are also chosen at random from among the M units in the program, there are (if) ways of choosing the i units.

The formula for m compilation units and n + 1 names is based on dividing the M compilation units in the program into two groups, one group consisting of those units that use the first n names chosen and the other group consisting of the remaining units. The first group may have between 0 and m names. If the first group has m - k names then the set of n + 1 names will be used in m units only if the (n + 1)st name is used in k units taken from the second group. If the (n+ 1)st name is used in i units, where m >= i >= k, then i - k of the units must be chosen from the first group.

There are (m-i-kk) combinations of i - k units that can be selected from the first group of m - k units. There are independently (M-_m-k)) combinations of k units that can be selected from the second group of M- (m - k) units. The product of these two numbers is the number of ways i units may be selected out of M so that the n + 1 names are used in m units given that n names are used in m - k units. The probability that n names are used in m - k units is P(m - k, n).

By defining P(i, 1) to be the probability that a name is visible in i units, the same formula can be used to compute the expected number of units visible to a set of n + 1 names.

Assumptions about the Distribution of Compilation Units

To model the use and visibility of sets of names as an occupancy problem, it is necessary to make certain assumptions about how units are assigned to names.

The first assumption is that the number of units assigned to each name is independent of the number assigned to any other name. Thus the probability that the nth name in a set has k clients is exactly the same as the probability that the ith name has k clients. While this would not be true if we were to generate sets of names, the effect of the assumption is minor as long as the number of names in the program is large compared with set size. (What we are doing is sampling with replacement when we should be sampling without replacement.) 128 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

The second assumption has a larger effect. We assume that units are assigned to a name at random so that all sets of compilation units of a given size are equally likely to be clients of the name. This means there is no relationship either between the units using a single name or between the units using a group of names. Such relationships clearly exist in practice. For example, if two units share the same view of an interface they will both use many of the same names, and if there is reason to change two names at the same time, chances are both names will share some clients.

By using a straightforward application of the occupancy problem to estimate the expected visibility of a Descartes .h file had clients been assigned to names at random, it is possible to see what would happen if clients were randomly assigned to names.

The Assignment of Clients to Names In Descartes, the average name is used in 2.4 compilation units; the average .h file defines 25 names and is visible in 7.4 units. There are a total of 26 .c files, each of which is paired with a .h file. (We ignore the 2 .h files that are not so paired.) What if each of the 25 names in a .h file were assigned 2.4 clients at random? How visible would the average .h file be then?

The problem of estimating .h file visibility is simplified by assuming that each name has exactly 2 clients, one fixed (the paired .c file), and one chosen at random from the remaining 25 .c files. Since this simplification reduces the number of units assigned to each name, it produces a conservative estimate of .h file visibility. It also lets us map .h file visibility directly onto the occupancy problem: If the 25 names (balls) defined in a .h file are randomly distributed among 25 units (cells), the expected number of "occupied" units is 16. Thus had clients been assigned to names at random, the average .h file should have 17 clients--its paired .c file plus the 16 units using its 25 names. Conversely, the probability that a .h file defining 25 names would have fewer than 8 clients (including the paired .c file) is negligible (6.3x10-9). If clients were chosen at random we would expect the average .h file with 8 clients to define about 7 names.

This closer look at the assumption about the assignment of clients to names clearly shows that there is some relationship among the clients of the names defined in the same .h file. Because of these relationships, average use and visibility should be smaller for sets of names than indicated in Figure 4.4. However it is not clear by how much, nor is it clear how relationships between the names in compilation units might affect the k-RUV. 4.4. THE RUV VERSUS PREDICATE PERFORMANCE 129

Generating Name Sets to Compute the k-RUV

An alternative method for computing the k-RUV is to generate all sets of names of a given size along with their use and visibility sets. While this method does not model any relationship between the names in a set, it does produce in proper proportion, only those use and visibility sets that occur in practice.

Unfortunately the combinatorics of generating all such sets quickly get out of hand. I was, however, able to generate all sets of 2 and 3 names and to compute their expected use, visibility and k-RUV. 6 These data points are marked in Figures 4.4 and 4.5.

Occupancy Model Estimation Generated Name Sets Use Visibility RUV Use Visibility RUV 2 names 4.6 19.3 0.24 4.5 17.5 0.26 3 names 6.6 22.6 .029 6.4 19.6 0.32

Table 4.8: Differences in the Estimated and Generated k-RLIV

Table 4.8 compares the values produced by the two methods of computing the k- RUV. For expected use, the difference in estimated and generated values is small. This is not surprising since most names are used by few enough clients that it is likely that 2 or 3 randomly selected names will have different client sets. For expected visibility, the difference is larger, probably because there is more overlap in the sets of clients visible to 2 or 3 names. If we extrapolate from these data points, we might expect a slower increase in average use and visibility than that of Figure 4.4 and a faster increase in the expected RUV.

The Number of Changes per Configuration versus the k-RUV

There is a third assumption underlying both methods of computing the k-RUV. This assumption is that all sets of names of a given size are equally likely to change. The historic data shows that this assumption is false: As a final comparison between the k- RUV and predicate performance I consider a composite partial RUV based on the number of changes per configuration in the Descartes change history.

The ratio of NAME-USE tO MAKE in Chapter 3 is the ratio of the actual number of compilation units using the names that changed in each of the revision groups analyzed to the actual number of units in which those names are visible. One way to assess the

6My program, which was not optimized, had not produced results for sets of 4 names after running overnight on an Apollo DS 3000 workstation. 130 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

k-RUV is to compare these actual use and visibility numbers with the estimated use and visibility numbers computed using formula 4.1.

Of the 39 configurations with .h file changes examined in Chapter 3, 9 are inappro- priate for such a comparison. The changes in these configurations consist only of initial revisions, changes to the umbrella .h file, null revisions or changes to a .h file that has no (lean) clients. Data on the remaining 30 configurations is shown in Table 4.9, sorted by the number of names that changed in each revision group.

The table shows that on average both estimated use and estimated visibility, repre- senting groups of names chosen at random, are twice the corresponding actual values. This suggests that groups of names that are changed at the same time have more clients in common than arbitrarily chosen groups of names.

4.5 Summary

This chapter has explored in some depth the relationship between static patterns of name use and visibility and the predicate performance observed in Chapter 3. Because this exploration is complex, it is fitting to review its lessons once again.

What I set out to do in this chapter was to suggest (1) that the results of the previous chapter are not limited to the Descartes program and (2) that the relative performance of

NAME-USE and MAKEbased on historic change data is conservative for changes as they take place. The data presented above support these conclusions. In particular, name use and visibility patterns in Descartes are similar to the same patterns in other programs. Thus it is unlikely that the predicate performance data reflects idiosyncrasies of the Descartes program. I also showed that the use and visibility patterns of the names that changed in the revisions _malyzed in Chapter 3 are comparable to use and visibility patterns for the program at large. Thus if the same names were to change one at a time, the RUV would be a good predictor of predicate performance. This indicates that the observed predicate performance is due to the size, not the distribution, of changes.

To explore the relationship between the size of changes and static use and visibility patterns, I defined the k-RUV. While the k-RUV gives an indication of how predicate performance might vary according to the number of names changed at the same time, it depends on assumptions about how groups of names are selected for change that do not appear to hold true. In particular, comparisons with the historic data show that these assumptions overestimate both collective name use and name visibility by a considerable margin. 4.5. SUMMARY 131

Names Expected Actual Changed Use Visibility Use Visibility Revision Group 19: 1 2.4 12.8 1 6 Revision Group 22: 0 22 Revision Group 29: 2 14 Revision Group 37: 2 14 Revision Group 38: 1 6 Revision Group 55: 3 3 Revision Group 61" 0 12

Revision Group 52: 2 4.6 19.3 II 2 2 Revision Group 14: 3 6.6 22.6 2 18 Revision Group 18: 9 11 Revision Group 20: 5 13 Revision Group 23: 4 11 Revision Group 34: 8 11 Revision Group 36: 9 11 Revision Group 41: 6 6 Revision Group 63: 4 9

Revision Group 13: 4 8.4 24.3 II 2 18 Revision Group 6: 5 10.1 25.1 l[ 1 3 Revision Group 49: 7 12.9 25.8 II 4 7 Revision Group 44: 8 14.2 25.9 II 6 7 Revision Group 28: 9 15.3 25.9 6 23 Revision Group 46: 2 2 Revision Group 48: 5 5 Revision Group 57: 1 8 ,' Revision Group 24: 10 16.3 26.0 1[ 5 18

Revision,, Group 54: 16 20.6 26.0 11 9 9 Revision Group 45: 24 23.5 26.0 16 23 Revision Group 1: 47 25.7 26.0 i4 19 Revision Group 51: 10 14

Revision Group 27: 96 ] 26.0 26.0 ]1 12 12 average 10.6 21.7 [1 5.0 .... 11.2

Table 4.9: Estimated versus Actual Use and Visibility 132 CHAPTER 4. THE DISTRIBUTION OF NAMES IN INTERFACES

The most profound result of this chapter is undoubtedly the discovery that average name use was consistently between 2 and 3 for the names in all six programs analyzed and that furthermore the distribution of use over names was comparable in all programs. For Descartes at least, patterns of name use do not support any natural modularity of any scale. Chapter 5

Recommendations and Conclusion

This thesis addresses the problem of controlling the cost of software manufacture. It is based on two observations. First, when incorporating changes in a configuration the only manufacturing steps that must be performed are those that ultimately produce a detectable difference in a software product. Second, there are a spectrum of techniques for determining a priori whether a given step may produce such a difference. These techniques, called difference predicates, differ in their method of operation and in their relative strength. Their effectiveness depends on program structure, on properties of the tools used in the manufacturing process, and on patterns of change. Through a small number of case studies designed to compare and explain predicate performance, the thesis shows that the use of an appropriate predicate can result in substantial savings in manufacturing costs. Moreover, these case studies also reveal some interesting facts about programs.

The value of this work is due is part to the results of the case studies of Chapters 3 and 4 and in part to the introduction of a fresh approach to software manufacture that led to those results. In the next sections, I make concrete recommendations for practice based on the results of the case studies and suggest additional studies to substantiate and extend those results. I then enumerate the contributions of the thesis and conclude by indulging in some speculation on its broader implications.

5.1 Recommendations

The results of Chapters 3 and 4 are based on a small number of programs and must be further substantiated. However, in the absence of contradictory evidence, these results are

133 134 CHAPTER 5. RECOMMENDATIONS AND CONCLUSION

sufficiently clear to influence the practical decisions programmers and project managers must make today. Accordingly, the following recommendations are discussed in order of increasing levels of investment (and risk).

Organize programs to limit name visibility.

There are a number of simple, low-cost measures that individual programmers can use to reduce the visibility of names in programs. The profound effect on compilation costs of .h files that are included but not used in the Descartes system suggests that much can be gained from identifying and eliminating unused imports. Programmers should avoid techniques, such as the use of umbrella include files, that lead to the wholesale import of entire interface hierarchies.

Another technique programmers should avoid is the use of catch-all interfaces. In C this is a particular problem. Since interfaces (.h files) and implementations (.c files) are not necessarily paired, C programmers often provide only one .h file per subsystem. Where interfaces are overly broad, it may be possible to partition a composite interface into smaller units each of which has more limited visibility. It may also be useful to break apart large implementation modules based on their use of imported interfaces.

Finally, the incidence of names with no more than one client indicates that many names may be removed from interfaces and either deleted entirely or made local to the single module in which they are used. Cross-reference tools are essential for the discovery of such names.

Limiting name visibility requires additional bookkeeping to track dependencies. While reductions in compilation costs may be modest, they are realized every time a system is recompiled. An investment in program organization also yields cleaner code that is easier to maintain.

Use cross-reference tools for use/visibility analysis.

While an interactive browser may be state-of-the-art, and grep 1 may be the tool of choice when searching for all the places a given identifier is mentioned, neither tool is useful for understanding cross-reference patterns. For example, neither is designed answer queries like: "How many of the names defined in module A are used in moduic B?" Or: "On average, how many modules use the types declared in program P?" 1The UNIXtool grep is a regular expressionpattern matcherthat reports all occurrencesof a given expressionin a givenset of files. 5.1. RECOMMENDATIONS 135

Coupled with postmortem analysis tools, older batch-style cross-reference tools can be used to detect both names that are visible but not used and interfaces that are imported but not used, as well as answer other questions about general patterns of name use and visibility. Such tools make it easier for programmers to reorganize programs effectively. They are also essential to enlarge the name use and visibility studies of Chapter 4.

Unfortunately good cross-reference tools are not always easy to find. For example, to obtain the use and visibility data for C programs, I had to manually reprocess the available cross-reference data. To obtain data for Ada programs, Phil Levy built a special purpose tool.

Replace MAKEwith NAME-USE in existing programming environments.

Because they require more information about a program than file names and change dates, selective recompilation techniques based on name use are incompatible with the timestamp approaches to software manufacture typical of existing file-based programming environments. However, despite the high cost of introducing name-based predicates in existing environments and the lack of commercial products, all the evidence of Chapters 3 and 4 points to the value of making recompilation decisions on a per name rather than a per file basis.

To develop a name-based predicate for a file-based environment it is necessary to produce a tool that processes interface units that have changed. For C this is especially difficult since in general .h files cannot be parsed individually. To determine what names have changed, the tool must maintain information about previous versions of interfaces. To efficiently identify where changed names are used, it must also maintain a complete cross-reference database. While the performance of a nanae-based recompilation tool depends on exactly how it is implemented, Tichy found that the execution cost of his smart-recompilation tool, developed as a research prototype, was amortized by avoiding a single compilation [62].

Avoid being too smart.

The comparative performance of DEMI-ORACLEand NAME-USEin the Descartes study of Chapter 3, reinforced by the name use data of Chapter 4, indicates that there is little to be gained by using predicates weaker than NAME-USE.The added complexity of the weaker predicates, reflected in higher development and execution costs and less robustness, is probably not balanced by savings in compilation costs. The same data indicates that 136 CHAPTER 5. RECOMMENDATIONS AND CONCLUSION language and progr,un implementation techniques that defer bindings in order to short- circuit dependencies (for example, using record descriptors to compute field offsets at run-time or storing a constant in memory rather than embedding it in an instruction) are not in general effective for reducing compilation costs. While there may be projects where the use of weaker predicates or deferred binding techniques are valuable, these techniques should not be applied unless an analysis of the project itself suggests they are warranted.

The relative performance of GRATUITOUSindicates that changes to comments or whitespace do not constitute an important source of redundant compilations.

5.2 Further Empirical Studies and Other Research

This thesis raises many questions for further research. Some of these questions are considered in the paragraphs that follow. In particular, it would be worthwhile to undertake large scale empirical studies designed to substantiate and refine the results of Chapters 3 and 4 and smaller studies designed to elicit new information about software manufacture. While the former require a substantial investment in tooling, the success of the case studies demonstrates the value of making the investment. Some tooling can be used as adjunct to program development as described above.

Large intermodule name use studies.

The unexpected consistency of patterns of name use in the programs studied in Chapter 4 raises the most provocative question of thesis: How universal are these patterns? In most programs, is the average name declared in an interface used in only 2 or 3 compilation units?

To answer this question, it is necessary to look at a large number of programs written in different languages. By doing so, patterns may emerge relating name use or visibility to program size, structure or development language. For example, how does the partitioning of large systems into multiple subsystems affect the locality of name visibility and use?

Doing studies on this scale requires automating the data collection process. For Ada programs running in the Rational environment this is feasible using Phil Levy's tool. 5.2. FURTHER EMPIRICAL STUDIES AND OTHER RESEARCH 137

Large intramodule name use studies.

The use and visibility studies of Chapter 4 examined name use among compilation units, but did not consider how names are used within those units. There is no reason to believe that each name used in a compilation unit will be referenced by every subroutine or variable defined within the unit. In fact, the sparseness of name use among compilation units suggests that name use within units, themselves composite, may be comparably sparse. If so, then subroutine-at-a-time compilation may be an effective strategy for reducing compilation costs.

The predicate NAME-USE recompiles the minimum number of units according to the name use criterion. However, the number of lines compiled is inflated by unused symbols in interfaces. According to both Conradi and Wanvik, and Kamel and Gammage, unused symbols account for 75% to 85% of the symbols imported during compilation. Because the bodies of compilation units, like interfaces, are composite, one can speculate that there may be comparable overhead in processing unaffected parts of the unit body. Thus subroutine-at-a-time compilation performed in the context of minimal interfaces (that is, interfaces composed according to use, not visibility) has the potential for substantial savings in the number of lines compiled.

To evaluate the potential of subroutine-at-a-time compilation, it is necessary to in- vestigate how references to imported names are distributed within a compilation unit. Unfortunately typical cross-reference tools do not provide the necessary functionality to do so.

Large in vivo change studies.

The Descartes study of Chapter 3 and Adams, Weinert and Tichy's similar study of an Ada program show that for historic changes, there is considerable redundancy in manufacture. Because historic changes represent sequences of immediate changes (those made during a single iteration of the edit-compile-debug cycle), for the programmer at work the size of the average change is probably smaller and the amount of redundancy in manufacture is probably higher than what is reflected in the historic record.

While historic studies may provide an accurate picture of the selection of names that change they give conservative estimates of other change-related properties. Studies of the immediate changes made by programmers at work will answer questions about the actual frequency of interface changes and the average size of those changes. Large studies may help correlate the size of interface changes with the amount of redundancy 138 CHAPTER 5. RECOMMENDATIONS AND CONCLUSION in manufacture and reveal any relationship between use or visibility and the selection of names that change.

To carry out this kind of study on a large scale, it is necessary to instrument a widely used manufacturing environment. In a Stanford University study, Linton and Quong instrumented make to measure the effectiveness of a strategy for incremental link editing [42]. They found a ratio of 1 to 2 compilations per link, indicating that most changes affected only 1 or 2 modules. It is hard to reconcile this result with the frequency of interface changes seen in both the Descartes and in the Adams-Weinert-Tichy change histories along with the data on average name visibility from Chapter 4. For a brief discussion of the Linton-Quong study see Section 1.4.3.

Case study of fine program structure.

An advantage to carrying out case studies without the benefit of fully automated tooling is the insight such studies provide about the phenomenon under investigation. One such study that might be valuable is an exercise in representing a program as a collection of smaller entities. For example: What if the Descartes crostic client system was represented as a collection of individual subroutines, variables and type definitions? Would there be any natural modularity to the structure? How many separate units would there be? How big would the minimal interfaces needed to compile these units be? How many units would have interfaces in common?

Like the case studies reported here, the results of such a study may suggest patterns that can be substantiated by subsequent larger studies.

Data representation for subroutine-at-a-time compilation.

A major focus of current programming environment research is how to represent the data that is manipulated by the environment. With emerging object-oriented database technology, it is possible for an environment to maintain more detailed infomaation with more complex structure [70]. Given the effectiveness of name-based predicates and their mismatch with traditional file-based environments, representations supporting subroutine- at-a-time (incremental) compilation based on name use may be the most effective strategy for reducing compilation costs. While this strategy is used by both the SMILE system (for semantic analysis) and the Rational Environment, the effectiveness of either system has not been analyzed. Because each system imposes certain constraints on development and because the Rational system is expensive, compilation costs are a secondary factor in the 5.3. SUMMARY OF CONTRIBUTIONS 139 decision to use either system.

A challenge in programming environment research is to find appropriate data repre- sentations for different activities performed in the environment and, where necessary, to efficiently map between different representations. A representation supporting efficient recompilation must balance the overhead of managing complex data structures for a large number of units against the cost of redundant processing. What is the best representation for recompilation may not be the best representation for version control, or for software distribution, or tor concurrent execution.

Additional studies.

Other studies suggested by model have to do with evaluating strategies for using difference predicates and applying predicates to operations other than compilation. For example: Should the shape of a manufacturing graph affect the choice of predicate especially when there are transitive dependencies as in Ada, when a manufacturing step has large fan-out, or when a manufacturing step produces multiple outputs? Does predicate cost indicate that cheaper, less effective predicates should be tried before a more expensive like NAME-USE? What predicates are appropriate for tools like yaec, lex or idl?

5.3 Summary of Contributions

This thesis has made contributions in several areas. These contributions are summarized below.

The Model

This thesis began by defining a model of software manufacture based on a dependency graph representation of a software configuration and the use of difference predicates to control the amount of manufacture needed to incorporate changes. The dependency graph representation is versatile enough to represent arbitrary manufacturing steps and unique in that it represents tools, options and hidden dependencies at the same level as programmed components. Difference predicates represent the spectrum of techniques that may be used to produce "equivalent" software products consistent with a given set of primitives. Predicates differ in their method of operation, in their relative strength and in their relative effectiveness. 140 CHAPTER 5. RECOMMENDATIONS AND CONCLUSION

* The model raises questions of clear practical significance about the manufacturing process and provides a framework for effectively addressing those questions.

• While tools dedicated to rebuilding software have been available for some time, this research has been the first to distinguish software manufacture as an indepen- dent problem in the larger context of software configuration management and to demonstrate the value of treating it as such.

Incidence, Causes and Responses to Redundancy in Compilation

The model was applied to the problem of compilation in C by comparing the effectiveness of seven predicates applied to the change history of a small program, the Descartes crostic client. The results of that study were clear. Only two predicates are interesting: MAKE, which makes recompilation decisions based on name visibility, and NAME-USE,which makes recompilation decisions based on name use. The relative performance of these two predicates is corroborated by a recent study by Adams and colleagues.

The change history study was reinforced by examining static name use and visibility patterns in six C and Ada programs, including Descartes. While the six programs differed in other ways, in each program the average number of compilation units using a name defined in an interface was between 2 and 3.

• The Descartes change history study demonstrates that there is considerable redun- dancy in compilation - even for a small program written in C. It also suggests concrete steps that can be taken to eliminate that redundancy.

• The change history study not only quantifies the performance of NAME-USE relative

to MAKE, it also quantifies the performance of NAME-USErelative to a practical upper bound on predicate performance. The results of the study suggest that a weaker predicate that considers the content of a change and the context of its use, is not substantially better than NAME-USE,and that a stronger predicate that detects only changes to comments or whitespace is not substantially better than MAKE.

Insights about Program Structure

The unexpected consistency in patterns of name use led to insights about program structure and modularization. An analysis of the module interconnection structure of the Descartes crostic client indicates that while modules are clearly not random collections of names, few of the names defined in an interface are used by the same sets of clients. 5.4. THE THESIS IN PERSPECTIVE 141

• The name use studies of Chapter 3 provide concrete evidence suggesting that the average name defined in an interface is used in only 2 or 3 of the compilation units that import the interface. The same evidence suggests that many of the names defined in program interfaces are unused or used only in a single client.

• The thesis suggests that because grouping names into modules compounds their visibility, most unnecessary recompilations are the direct consequence of how we modularize software.

Platform for Further Work

The thesis set out to define and explore a paradigm for efficient manufacture that does not compromise reliability. In so doing, it has produced new insights about program structure as well as concrete recommendations for decreasing compilation costs. Nonetheless, the thesis has just begun to explore software manufacture.

• The success of the case studies of Chapters 3 and 4 provides method, material and incentive for further work to both broaden the scope of the thesis and to deepen its conclusions.

5.4 The Thesis in Perspective

The cost of software manufacture is one factor that influences how we organize software systems and design software production tools; other factors are intellectual and admin- istrative control. While intellectual control is certainly the dominant problem for larger systems, the other two factors are far more tractable. Monolithic program development have been largely abandoned in favor of a three paradigms that have evolved in response to the cost of introducing changes:

• The separately compiled languages are the medium of choice for the mainstream of software development and as such are the focus of thesis. Using these languages, a system is built in pieces and recomposed by manufacture. The paradigm accom- modates multiple programmers and produces a product that can be separated from its development environment.

• The interpreted languages, able to update a function at a time, sacrifice run-time efficiency and often reproducibility for flexibility during implementation. Notably with Lisp, the object of program development is to define the program itself, not to use it. 142 CHAPTER 5. RECOMMENDATIONS AND CONCLUSION

• Incremental programming environments that synthesize the other two paradigms are emerging. These environments have the potential to offer the control and efficiency of the separately compiled languages plus the flexibility of interpreted systems.

Ultimately, as we build faster processors and configure them to offer more parallelism, the techniques developed in this thesis may lose their importance for most development efforts. We certainly will continue to build larger and more ambitious systems that will strain the then existing technology, whatever it may be. We can speculate that the use of difference predicates will become even more important for such systems, especially those that gain leverage from tool-intensive development techniques such as the use of program generators or program transformer. The difference predicate approach to manu- facture should also be applicable to other derivation relationships in program development including the relationships between specifications and designs, between designs and code, or between implementations and testing results, for example.

In future environments intellectual control over development process will continue to be the dominant problem. If we can parse a million lines a minute on a single processor [28] and compile parts of a program concurrently [57], manufacturing costs will be less of an issue. It may become feasible, for example, to compile all but the largest systems monolithically. What will be important in such environments are facilities for viewing and and navigating system structure, for understanding and manipulating the relationships between components, and for anticipating the effects of change.

What may be lasting about the thesis is its approach to software manufacture that considers relationship between system structure and its sensitivity to changes. To extent that manufacturing costs are represent these underlying relationships, this approach will be important in how we design future programming environments and tools and how we process information. Bibliography

[1] Rolf Adams, Annette Weinert, and Walter Tichy. Software change dynamics or half of all Ada compilations are redundant. In Proceedings of the European Software Engineering Conference, September 1989.

[2] James E. Archer, Jr. and Michael T. Devlin. Rational's experience using Ada for very large systems. In Proceedings of the First International Conference on Ada Programming Language Applications for the NASA Space Station, June 1986.

[3] N. Belkhatir and J. Estublier. Protection and cooperation in a software engineering environment. In Proceedings of the International Workshop on Advanced Program- ming Environments, pages 221-229, June 1986. Published by Springer-Verlag as Lecture Notes in Computer Science 244.

[4] Ellen Borison. A model of software manufacture. In Proceedings of the Interna- tional Workshop on Advanced Programming Environments, pages 196-220, June 1986. Published by Springer-Verlag as Lecture Notes in Computer Science 244.

[5] P. M. Cashin, M. L. Joliat, R. F. Kamel, and D. M. Lasker. Experience with a mod- ular typed language: PROTEL. In Proceedings of the 5th International Conference on Software Engineering, pages 136-143, March 1981.

[6] Geoffrey M. Clemm. The Odin System - An Object Manager for Extensible Software Environments. PhD thesis, University of Colorado, 1986.

[7] Reidar Conradi and Dag Heieraas Wanvik. Mechanisms and Tools for Separate Compilation. Technical Report 25/85, University of Trondheim, Norwegian Institute of Technology, November 1985.

[8] Jack Cooper. Software development management planning. IEEE Transactions on Software Engineering, 10(1):22-26, January 1984.

143 144 BIBLIOGRAPHY

[9] Keith D. Cooper, Ken Kennedy, and Linda Torczon. Interprocedural optimization: eliminating unnecessary recompilation. In Proceedings of the SIGPLAN'86 Sympo- sium on Compiler Construction, pages 58-67, June 1986. Published as SIGPLAN Notices 21:7 (July 1986).

[10] Lee W. Cooprider. The Representation of Families of Software Systems. PhD thesis, Carnegie-Mellon University, 1979.

[11] Eugene Cristofor, T. A. Wendt, and B. C. Wonsiewicz. Source control + tools = stable systems. In Proceedings of the 4th Computer Science and Applications Con- ference, pages 527-532, IEEE Computer Society, October 1980.

[12] Frank da Cruz. Kermit, A File Transfer Protocol. Digital Press, 1987.

[13] Manfred Dausmann. Reducing recompilation costs for software systems in Ada. Draft of a paper published in Proceedings of the IFIP Conference on System Im- plementation Languages: Experience and Assessment, North-Holland, Amsterdam, 1985.

[14] BLISS Language Guide. Digital Equipment Corporation, 1977.

[15] Frank DeRemer and Hans H. Kron. Programming-in-the-large versus programming- in-the-small. IEEE Transactions on Software Engineering, 2(2):80-86, June 1976.

[16] Reference Manual for the Ada Programming Language. United States Department of Defense, 1983. ANSI/MIL-STD- 1815A- 1983.

[17] Meyer Dwass. Probability and Statistics: An Undergraduate Course. W.A. Ben- jamin, Inc., Menlo Park, CA, 1970.

[18] V. B. Erickson and J. E Pellegrin. Build - a software construction tool. AT&TBell Laboratories Technical Journal, 63(6), July-August 1984.

[19] J. Estublier, S. Ghoul, and S. Krakowiak. Preliminary experience with a config- uration control system for modular programs. In Proceedings of the ACM SIG- SOFT/SIGPLAN Software Engineering Symposium on Practical Software Develop- ment Environments, pages 149-156, April 1984. Published as SIGPLAN Notices 19:5 (May 1984).

[20] Jacky Estublier. Configuration management: the notion and the tools. In Proceed- ings of the International Workshop on Software Version and Configuration Control, Grassau, FRG, pages 38-61, January 1988. BIBLIOGRAPHY 145

[21] Peter H. Feiler, Susan A. Dart, and Grace Downey. Evaluation of the Rational Environment. Technical Report CMU/SEI-88-TR-15,.Camegie-Mellon University, Software Engineering Institute, July 1988.

[22] Peter H. Feiler and Gail E. Kaiser. Granularity Issues in a Knowledge-Based Pro- gramming Environment. Technical Report SEI-86-TM-11, Carnegie-Mellon Univer- sity, Software Engineering Institute, September 1986. Presented at the 2nd Kansas Conference on Knowledge-Based Software Development, Manhattan KA, October 1986.

[23] Stuart I. Feldman. Evolution of MAKE. In Proceedings of the International Work- shop on Software Version and Configuration Control, pages 413-416, January 1988.

[24] Stuart I. Feldman. Make-a program for maintaining computer programs. Software-- Practice and Experience, 9(4):255-265, April 1979.

[25] Glenn S. Fowler. The fourth generation Make. In Proceedings, Summer 1985 USENIX Conference, pages 159-174, June 1985.

[26] Geir A. Green, Svein O. Hallsteinsen, Dag H. Wanvik and Lars Nokken Separate compilation in Chipsy. In Proceedings of the Second Chill Conference, March 1983.

[27] Svein O. Hallsteinsen and Dag H. Wanvik. Separate compilation in Chipsy. February 1985. Revision of [26].

[28] R. Nigel Horspool and Michael Whitney. Even Faster LR Parsing. Technical Report DCS-114-IR, University of Victoria, May 1989.

[29] Andrew Hume. Mk: a successor to make. In Proceedings, Summer 1987 USENIX Conference, pages 445-457, 1987.

[30] Gail E. Kaiser and Peter H. Feiler. Intelligent assistance without artificial intelli- gence. In Proceedings of the Thirty-Second IEEE Computer Society International Conference (COMPCON), February 1987.

[31] Gail E. Kaiser and A. Nico Habermann. An environment for system version control. In Digest of Papers COMPCON Spring 83, pages 415-420, IEEE Computer Society, February 1983.

[32] R. F. Kamel and N. D. Gammage. Further experience with separate compilation at BNR. In Proceedings of the IFIP Conference on System Implementation Languages: Experience and Assessment, North-Holland, Amsterdam, 1985. 146 BIBLIOGRAPHY

[33] Ragui F. Kamel. Effect of modularity on system evolution. IEEE Software, 4(11):48-54, January 1987.

[34] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language. Pren- tice Hall, Englewood Cliffs, New Jersey, second edition, 1988.

[35] D. E. Knuth. An empirical study of FORTRAN programs. Software--Practice and Experience, 1(2):105-133, April-June 1971.

[36] Charles W. Krueger. The SMILE reference manual. In The GANDALF System Ref- erence Manuals, May 1986. Carnegie-Mellon University Technical Report Number CMU-CS-86-130.

[37] Butler W. Lampson and Eric E. Schmidt. Organizing software in a distributed en- vironment. In Proceedings of the SIGPLAN '83 Symposium on Programming Lan- guage Issues in Software Systems, pages 1-13, March 1983. Published as SIGPLAN Notices 18:6 (June 1983).

[38] Butler W. Lampson and Eric E. Schmidt. Practical use of a polymorphic applicative language. In Conference Record of the Tenth Annual ACM Symposium on Principles of Programming Languages, pages 237-255, January 1983.

[39] David B. Leblang and Jr. Gordon D. McLean. Configuration management for large- scale software development efforts. In Proceedings of the Workshop on Software En- gineering Environments for Programming-in-the-Large, pages 122-127, June 1985.

[40] David B. Leblang and Jr. Robert E Chase. Computer-aided software engineer- ing in a distributed workstation environment. In Proceedings of the ACM SIG- SOFT/SIGPLAN Software Engineering Symposium on Practical Software Develop- ment Environments, pages 104-112, April 1984. Published as SIGPLAN Notices 19:5 (May 1984).

[41] Brian T. Lewis. Experience with a system for controlling software versions in a distributed environment. In Proceedings of the Symposium on Application and and Assessment of Automated Tools for Software Development, pages 210-219, Novem- ber 1983.

[42] Mark A. Linton and Russell W. Quong. A macroscopic profile of program compi- lation and linking. IEEE Transactions on Software Engineering, SE-15(4):427-436, April 1989. BIBLIOGRAPHY 147

[43] Axel Mahler and Andreas Lampen. An integrated toolset for engineering software configurations. In Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineer- ing Symposium on Practical Software Development Environments, pages 191-200, November 1988. Published as SIGPLAN Notices 24:2 (February 1989).

[44] Josephine Micallef and Gail E. Kaiser. Version and configuration control in dis- tributed language-based environments. In Proceedings of the International Workshop on Software Version and Configuration Control, pages 119-143, January 1988.

[45] James H. Morris, Mahadev Satyanarayanan, Michael H. Conner, John H. Howard, David S. H. Rosenthal, and E Donelson Smith. Andrew: a distributed personal computing environment. Communications of the ACM, 29(3):184-201, March 1986.

[46] K. Narayanaswamy and Walt Scacchi. Maintaining configurations of evolving soft- ware systems. IEEE Transactions on Software Engineering, SE-13(3):324-334, March 1987.

[47] John R. Nestor. Toward a persistent object base. In Proceedings of the International Workshop on Advanced Programming Environments, pages 372-394, June 1986. Published by Springer-Verlag as Lecture Notes in Computer Science 244.

[48] John R. Nestor and Margaret A. Beard. Front end generator system. In Com- puter Science Research Review (1980-1981), pages 75-92, 1982. An annual report published by the Department of Computer Science, Carnegie-Mellon University, PA.

[49] John R. Nestor, Joseph M. Newcomer, Paola Giannini, and Donald L. Stone. IDL: The Language and Its Implementation. Prentice Hall, Englewood Cliffs, New Jersey, 1989.

[50] Dewayne E. Perry. Version control in the Inscape environment. In Proceedings of the 9th International Conference on Software Engineering, pages 61-69, March 1987.

[51] Mark Rain. Avoiding trickle-down recompilations in the Mary2 implementation. Software----Practice and Experience, 14(12): 1149-1157, December 1984.

[52] Marc J. Rochkind. The source code control system. IEEE Transactions on Software Engineering, 1(4):364-370, December 1975. 148 BIBLIOGRAPHY

[53] Graham Ross. Integral C - a practical environment for C programming. In Proceed- ings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, pages 42-48, December 1986. Published as SIGPLAN Notices 22:1 (January 1987).

[54] Andres Rudmik and Barbara G. Moore. An efficient separate compilation strategy for very large programs. In Proceedings of the SIGPLAN' 82 Symposium on Compiler Construction, pages 301-307, June 1982. Published as SIGPLAN Notices 17:6 (June 1982).

[55] Eric Emerson Schmidt. Controlling Large Software Development in a Distributed Environment. PhD thesis, University of California, Berkeley, 1982.

[56] Robert W. Schwanke and Gail E. Kaiser. Technical correspondence: smarter recom- pilation. ACM Transactions on Programming Languages and Systems, 10(4):627- 632, October 1988.

[57] V. Seshadri, D. B. Wortman, M. D. Junkin, S. Weber, C. P. Yu and I. Small. Semantic analysis in a concurrent compiler. In Proceedings of the SIGPLAN'88 Conference on Programming Language Design and Implementation, pages 233-240, June 1988. Published as SIGPLAN Notices 23:7 (July 1988).

[58] Mary Shaw, Ellen Borison, Michael Horowitz, Tom Lane, David Nichols, and Randy Pausch. Descartes: a programming-language approach to interactive display inter- faces. In Proceedings of the SIGPLAN'83 Symposium on Programming Language Issues in Software Systems, pages 100-111, June 1983. Published as SIGPLAN Notices 18:6 (June 1983).

[59] Barbara J. Staudt, Charles W. Krueger, A. N. Habermann, and Vincenzo Ambriola. The GANDALF System Reference Manuals. Technical Report CMU-CS-86-130, Carnegie-Mellon University, May 1986.

[60] The Network Software Environment. Sun Microsystems, Inc., 1989.

[61] Walter F. Tichy. Design, implementation, and evaluation of a revision control sys- tem. In Proceedings of the 6th International Conference on Software Engineering, pages 58-67, September 1982.

[62] Walter F. Tichy. Smart recompilation. ACM Transactions on Programming Lan- guages and Systems, 8(3):273-291, July 1986.

[63] Walter E Tichy. Software Development Control Based on System Structure Descrip- tion. PhD thesis, Carnegie-Mellon University, 1980. BIBLIOGRAPHY 149

[64] Walter F. Tichy. Tools for software configuration management. In Proceedings of the International Workshop on Software Version and Configuration Control, pages 1--20, January 1988.

[65] Mary Pfreundschuh Wagner and Ray Ford. Using attribute grammars to control incremental, concurrent builds of modular systems. In Proceedings of the Interna- tional Workshop on Software Version and Configuration Control, pages 283-304, January 1988.

[66] M. W. Waite, V. E Heuring, and U. Kastens. Configuration control in compiler construction. In Proceedings of the International Workshop on Software Version and Configuration Control, pages 159-168, January 1988.

[67] James W. Wendorf. Operating System / Application Concurrency in Tightly-Coupled Multiple-Processor Systems. PhD thesis, Carnegie-Mellon University, 1987.

[68] J. F. H. Winkler, editor. Proceedings of the International Workshop on Software Version and Configuration Control, B. G. Teubner, Stuttgart, FRG, 1988.

[69] Alexander L. Wolf, Loft A. Clark, and Jack C. Wilden. The AdaPIC tool set: sup- porting interface control and analysis throughout the software development process. IEEE Transactions on Software Engineering, 15(3):250-263, March 1989.

[70] Stanley B. Zdonik and David Maier. Fundamentals of object-oriented databases. In Readings in Object-Oriented Database Systems, pages 1-32, Morgan Kaufmann Publishers, Inc., San Mateo, California, 1990.