Supporting Component-Based Software Development with Active Component Repository Systems
by
Yunwen Ye
B.Sc., Fudan University, China, 1987
M.S., Fudan University, China, 1990
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
Department of Computer Science
2001 This thesis entitled: Supporting Component-Based Software Development with Active Component Repository Systems written by Yunwen Ye has been approved for the Department of Computer Science
Gerhard Fischer
James Martin
Date
The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. Ye, Yunwen (Ph.D., Computer Science)
Supporting Component-Based Software Development with Active Component Repository Sys-
tems
Thesis directed by Prof. Gerhard Fischer
It is widely believed and empirically proven that component reuse improves both the quality and productivity of software development. Before software components are reused, however, they must be located. Component repository systems provide a means to locate soft- ware components. Current component repository systems are designed to support the paradigm of development-with-reuse, which views reuse as a process independent of the whole software development process and relies on programmers to take the reuse initiative. Such systems fall short in supporting programmers who make no attempt to reuse because they do not know the existence of reusable components or they perceive reuse costs more than programming from scratch.
This dissertation advocates a paradigm shift from development-with-reuse to reuse-within- development, which views reuse as an integral part of software development, and component repository systems as information systems that augment programmers’ insufficient knowledge about reusable components and assist them in accomplishing their tasks. Active component repository systems—component repository systems equipped with active information delivery mechanisms—support reuse-within-development. They can be seamlessly integrated with pro- gramming environments. Through this integration, their active information delivery mechanism delivers task-relevant and user-specific components, without being given explicit reuse queries, to help programmers reuse unknown components and to reduce the cost of reuse.
An active component repository system, CodeBroker, has been developed and evaluated.
CodeBroker runs continuously in the background of a programming environment and infers programmers’ needs for reusable components by monitoring their interactions with the environ- iv ment. Potentially reusable components that match reuse queries extracted from comments and signatures in the programming environment are autonomously located and actively delivered to programmers. Formal evaluations of the CodeBroker system have indicated that it motivated programmers to reuse once relevant components were delivered, and that it was able to deliver components relevant to both the task and the background knowledge of programmers. Acknowledgments
I feel very fortunate that my employer, Software Research Associates, Inc. (SRA),
Tokyo, Japan, provided me the time and financial support to complete this research. In par- ticular, I thank Kouichi Kishida, executive vice president and technical director of SRA, for his lasting support and encouragement, without which I could not have finished this research.
Yoshitaka Matsumura, Kaoru Hayashi, and Yoshikazu Hayashi have been excellent managers who have gone to great lengths to provide the best conditions for me to complete my research.
I also want to thank my colleague Tomohiro Oda for his help.
I am grateful to the members of my thesis committee. Gerhard Fischer, my advisor, is simply the best advisor I could have found. His conceptual frameworks on Domain-Oriented
Design Environments and on learning have provided the foundations for this research. Without his excellent skills in challenging my ideas and motivating me to think deeper, I could not have
finished the research in this manner. Kumiyo Nakakoji, my mentor and role model, has provided immeasurable support, both emotionally and intellectually. She has been always there when I needed help. Brent Reeves has spent much time patiently listening to my sometimes rough ideas and reading my immature manuscripts, and has provided frank, yet friendly, critical feedback.
His constructive criticism has been invaluable in guiding me to frame the research problem, prioritize my resources, and present my ideas clearly. The support from other members of my thesis committee, Ken Anderson, James Martin, and Walter Kintsch helped me to clarify my understanding, and their input is very much appreciated. In particular, I thank James Martin for his excellent course on Natural Language Processing, which introduced me to the research field vi of information retrieval. That was one of the best courses I have ever taken.
Members of Center for LifeLong Learning and Design have been very supportive. I thank Taro Adachi for numerous, wide-ranging discussions that I have greatly enjoyed over the years. Jonathan Ostwald generously offered many times to listen to my thoughts and read my writings. His encouragement and feedback is greatly appreciated. I was extremely delighted to have as an officemate Eric Scharff, who had an answer to every computer problem I had, no matter whether it was a Mac, Windows or Linux problem. Many discussions with Rogerio de Paula helped me structure my thoughts. I thank Gerry Stahl, Hal Eden, Andy Gorman, and
Francesca Iovine for their support.
Finally, I would like to thank my family members. I thank my parents, who have taught me the joy of learning and have always urged me to do my best. I thank my eldest daughter,
Hanlu, for understanding when she had to spend many weekends being bored because dad had to work, and my 5-month-old daughter, Hanlei, for her innocent and sweet smiles which provided the best comfort after a day’s hard work. Most of all, I wholeheartedly acknowledge the endless love, understanding, and support that my wife, Yonghong Pan, has given to me. In particular, I thank her for her unabated confidence in me, which has cheered me greatly at times of frustration. Contents
Chapter
1 Introduction 1
1.1 Motivation ...... 1
1.2 Goal of the Research ...... 3
1.3 Active Component Repository Systems ...... 6
1.4 The CodeBroker System ...... 7
1.5 Organization of the Dissertation ...... 9
2 Roles of Reusable Components in Programming 11
2.1 A Process Model of Programming ...... 11
2.2 Programming Knowledge ...... 13
2.3 Opportunistic Programming ...... 17
2.4 Benefits of Software Components in Programming ...... 19
3 Challenges of Software Reuse 22
3.1 Overview of Software Reuse ...... 22
3.2 General Issues of Component Reuse ...... 25
3.3 Creating Reusable Components ...... 29
3.4 Understanding the Cognitive Difficulties of Component Reuse ...... 32
4 The Component Locating Problem 40
4.1 No Attempt to Reuse ...... 40 viii
4.2 Paradigm Shift: From Development-with-Reuse to Reuse-within-Development 46
4.3 Information-Enriched Workspaces ...... 49
4.4 Active Component Repository Systems ...... 51
5 Active Information Systems 55
5.1 Basic Issues of Active Information Systems ...... 55
5.2 Acquiring Information of User Tasks ...... 59
5.3 Personalizing Information Delivery ...... 73
5.4 Dealing with Partial, Imprecise Queries ...... 75
5.5 Comparing Active Information Systems with an Example in the Real World . . 78
5.6 The Spectrum of Support for Locating Information ...... 79
6 Indexing and Retrieval Mechanisms in CodeBroker 82
6.1 Indexing and Retrieval Mechanisms ...... 83
6.2 Creating the Component Repository ...... 94
7 Locating and Delivering Components in CodeBroker 99
7.1 System Architecture ...... 99
7.2 Listener ...... 100
7.3 Fetcher ...... 103
7.4 Presenter ...... 105
7.5 The Retrieval-by-Reformulation Mechanism ...... 113
7.6 Summary of CodeBroker ...... 117
8 Evaluations of CodeBroker 119
8.1 Evaluating the Retrieval Mechanisms ...... 120
8.2 Empirical Evaluations of the CodeBroker System ...... 123
8.3 Findings about the Usage of CodeBroker ...... 128
8.4 Other Findings about Programming in General ...... 139 ix
8.5 Problems of CodeBroker and Needed Improvements ...... 143
8.6 Summary of Evaluations ...... 147
9 Related Work 149
9.1 Active Information Systems ...... 149
9.2 Component Repository Systems ...... 151
9.3 Intelligent Programming Environments ...... 154
10 Future Work and Conclusions 155
10.1 Future Work ...... 155
10.2 Summary ...... 157
10.3 Contributions ...... 159
Bibliography 161
Appendix
A The List of Queries and Relevant Components 173
B Questions Asked in the Post-Experiment Interview 176
C Abbreviations 178
D Glossary 179 x
Tables
Table
1.1 The rapid growth of the Java Core API library ...... 2
4.1 Relations between reuse mode, knowledge sources, and tool support ...... 54
5.1 A comparison between plan recognition and similarity analysis ...... 66
8.1 Average precision and recall values for LSA, Mixed (average of LSA and Okapi),
and Okapi ...... 122
8.2 Programming knowledge and expertise of subjects ...... 125
8.3 Overall results of evaluation experiments with programmers ...... 129
8.4 Subjective evaluations of the CodeBroker system ...... 130
8.5 Experiment data regarding user models ...... 136
8.6 Experiment data about discourse models ...... 137 Figures
Figure
1.1 The location-comprehension-modification process of reusing components . . . 3
1.2 Software reuse failure modes ...... 4
1.3 Overview of CodeBroker ...... 9
2.1 The process model of programming ...... 14
2.2 A program and its program plans ...... 16
2.3 Orthogonality between program plans and software components ...... 17
2.4 The role of components in problem framing ...... 21
3.1 A cognitive model of the component reuse process ...... 34
4.1 Different levels of programmers’ knowledge about a component repository . . 42
4.2 The development-with-reuse paradigm ...... 47
4.3 The reuse-within-development paradigm ...... 50
5.1 Feedforward information delivery ...... 57
5.2 Autocompletion in Internet Explorer ...... 57
5.3 Feedback information delivery ...... 58
5.4 Two assumptions of similarity analysis ...... 63
5.5 The spectrum of support to information location ...... 80
6.1 The CodeIndexer and CodeBroker subsystems ...... 82 xii
6.2 The process of creating a component repository from Java programs ...... 94
6.3 An example of a document generated by Javadoc ...... 96
6.4 The indexing format of method documents in CodeBroker ...... 97
7.1 The architecture of the CodeBroker system ...... 100
7.2 Component delivery based on concept queries only ...... 102
7.3 Component delivery based on both concept queries and constraint queries . . . 103
7.4 Presenting more information triggered by mouse movement ...... 106
7.5 An example discourse model ...... 107
7.6 An example user model ...... 109
7.7 An illustrative program for adaptive user modeling ...... 110
7.8 The Skip Components Menu ...... 113
7.9 The Direct Manipulation interface ...... 115
7.10 The Query Refinement interface ...... 116
7.11 Summary of CodeBroker ...... 118
8.1 Recall-precision curves ...... 123 Chapter 1
Introduction
1.1 Motivation
A wide gap exists between the constantly increasing demands for complex software sys- tems and the capability of the software industry to deliver quality software systems in a timely and cost-effective manner. Software reuse, a development method of using existing reusable software components to create new programs, has been shown through empirical studies to im- prove both the quality and productivity of software development (Basili et al., 1996; Boehm,
1999). Software reuse also increases the evolvability of software systems because complex systems evolve faster when they are built from stable subsystems (Simon, 1996).
Programmers are knowledge workers, and programming is a process of progressive crys- tallization of their knowledge into a program. Knowledge needed during programming comes either from the programmer’s head or from such external sources as books, manuals, peer work- ers, and computerized information systems (Norman, 1993). A lack of needed knowledge is one of the major reasons for poor quality and productivity of programming. With the advent of objected-oriented technology, reusable software components now comprise the bulk of pro- gramming knowledge. Easy access to needed external information, in particular, reusable soft- ware components, to complement the insufficient knowledge of programmers is thus critical to the improvement of programming quality and productivity.
If programmers know a reusable software component well enough, they may integrate it into their programs whenever it is applicable without even realizing they are reusing be- 2 Version No. of Packages No. of Classes Year of Release Java 1.0 8 211 1996 Java 1.1 23 503 1997 Java 1.2 59 1525 1998 Java 2 70+ 2100+ 1999
Table 1.1: The rapid growth of the Java Core API library
cause such reusable components become “ready-to-hand” to programmers (Winograd & Flores,
1986). However, repositories of reusable software components are often so large that program- mers cannot learn about all of the components before they start programming. Software compo- nent repositories are not static; they are constantly evolving with new components added and old components updated. As an example, Table 1.1 shows the rapid growth of the Java Core API
(Application Programmer Interface) library—a repository of reusable components of classes and methods. Few Java programmers, if any, can claim that they know all the components in this library.
Programmers who have not learned the software component have to go through the reuse process if they want to reuse or use it in their programming. The reuse process consists of three steps: location, comprehension, and modification (Figure 1.1). Programmers have to locate those components that are potentially reusable in the current programming task from the com- ponent repository, comprehend their functionality and usage, and make necessary modifications if the components do not completely fit their needs (Fischer et al., 1991).
The foremost obstacle to the success of component reuse is that programmers cannot lo- cate needed software components quickly and easily. Locating reusable software components is often supported by component repository systems or reuse repository systems. Like many other information repository systems, browsing- and querying-oriented schemes have long served as the principal techniques for programmers to locate reusable software components. More innova- tive schemes, such as query by reformulation (Williams et al., 1982; Fischer & Nieper-Lemke,
1989; Henninger, 1993), information filtering (Belkin & Croft, 1992), and Latent Semantic 3
Location explanation
reformulation reformulation
Modification Comprehension
extraction
Figure 1.1: The location-comprehension-modification process of reusing components
Successful reuse requires programmers be able to locate, comprehend, and modify needed reusable components.
Analysis (Landauer & Dumais, 1997), have introduced new possibilities. Unfortunately, the problem remains that programmers simply do not actively search for components and make no attempt to reuse. According to a study by Frakes and Fox (Frakes & Fox, 1995), no attempt to reuse is the leading failure mode of software reuse (Figure 1.2). This inhibiting factor to the wide success of reuse has been reported again and again by software companies that have tried to introduce reuse into their organizations (Devanbu et al., 1991; Rosenbaum & DuCastel,
1995; Fichman & Kemerer, 1997).
1.2 Goal of the Research
Although many factors, such as the lack of managerial commitment and the difficulty in developing good reusable components, affect the widespread uptake of reuse, this research focuses on the cognitive difficulties faced by programmers who try to reuse, because only when programmers are willing and able to put reuse into their daily practice will reuse become fruitful.
This research tries to create a conceptual framework to analyze what hinders program- mers from making attempts to locate reusable components and, based on the analysis, it pro- 4 poses a new approach to the design of component repository systems that can motivate and encourage programmers to reuse by reducing the difficulty of locating components.
By applying cognitive engineering (Norman, 1986) on the reuse process, a cognitive model of reuse is first built. Based on this cognitive model and past research on the effective use of large information repositories (Fischer, 2001), the following two barriers to the component locating process are identified.
Due to the large volume and constantly evolving nature of component repositories,
programmers often fail to anticipate the existence of reusable components; when they
do not believe that a component exists in the repository, they will not even make an
Figure 1.2: Software reuse failure modes
In the Frakes and Fox (1995) paper, seven conditions—attempting to reuse, compo- nents existing, components available, components found, components understood, components valid, and components integratable—form a successful reuse chain. A breakdown in any condition causes the failure of reuse. The above data were collected from 29 software development organizations. The Y-axis shows the per- centage each condition plays in causing the failure of reuse. 5
effort to locate it in the first place.
Even if programmers are aware of the existence of reusable components, they do not
want to start the locating process if they do not know how to locate the components
or if they perceive that locating the components costs more than programming from
scratch.
Although reusable component repository systems have been an active research area for more than a decade, these two issues, especially the first one, have not been given enough attention. This is because those systems are designed to support the paradigm of development- with-reuse (Rada, 1995), which advocates reuse as a new paradigm for programming. Under this paradigm, the reuse process is treated as an independent process, and programmers have to change their current programming practice to embrace reuse; reusable component reposi- tory systems are researched as stand-alone systems under the assumption that programmers are always willing to use these systems and are able to use them with well-defined queries. Con- sequently, research on component repository systems has focused mainly on the information access mechanism only. Information access is an approach to obtain information that requires users1 to start the information locating process through browsing or querying.
This research proposes a paradigm shift from development-with-reuse to reuse-within- development. Development-with-reuse is a methodology-centered view of reuse that demands programmers to adapt themselves to the new methodology—reuse. It does not concern itself with the confusions and difficulties faced by programmers who try to reuse. When the ap- proach does not meet its expected success, programmers are labeled, due to their resistance to change, as having the NIH (Not Invented Here) syndrome (Fafchamps, 1994), and education of programmers about the value of reuse is called for.
Conversely, the reuse-within-development paradigm puts programmers back into the center and views reuse as an integral part of the whole programming process. It stresses that 1 Because users of component repository systems are programmers, in this thesis, the term “user” is used inter- changeably with the term “programmer”. 6 reusable component repository systems should serve as extensions to programmers’ limited knowledge. Such systems should actively participate in the programming process by provid- ing programmers immediate and easy access to reusable software components instead of being passively waiting for the exploration of programmers after they have made the decision to reuse.
Reuse-within-development needs the support of active component repository systems.
Active component repository systems are a subset of active information systems that are equipped with the information delivery mechanism. Unlike the passive information access mechanism by which users have to explicitly launch the information-seeking process by specifying their infor- mation needs in the form of well-defined queries or engaging in a series of browsing actions, the information delivery mechanism presents information to users on its own initiative without being prompted by explicit queries. With reusable components delivered by active component repository systems, programmers are able to reuse without changing their current programming practice and environment.
1.3 Active Component Repository Systems
In general, active information systems that just throw a piece of decontextualized infor- mation at a user are of little use because they ignore the user’s working context. The working context consists of the task acted upon and the user acting. The challenge of implementing an active information system or an information delivery system is to deliver context-sensitive in- formation related to both the task at hand and the background knowledge of the user. Task- and user-independent information delivery systems, or “push” systems, such as Microsoft’s “Tip of the Day,” suffer from the problem that information gets thrown at users in a decontextualized way. The “Tip of the Day” is a feature that tries to acquaint users with some arbitrarily chosen functionality in a complex system. Despite the possibility for interesting serendipitous encoun- ters of information (Roberts, 1989), most users find this feature more annoying than helpful.
The specific challenge faced by this research is to deliver context-sensitive components.
In other words, how can the active component repository system capture programmers’ needs 7 for reusable components by understanding to some extent what their tasks at hand are and then present only those task-relevant components that are not yet known to the programmers.
Needs for reusable software components are not determined before programming starts, as most current component repository systems have assumed; they arise in the middle of the programming process (Sen, 1997). Inasmuch as programmers are using computer-based de- velopment environments to develop software systems, it is possible for component repository systems to capture the reuse needs autonomously by utilizing information available in program- ming environments when component repository systems and program development environ- ments are properly integrated. For example, in a programming editor, comments inside pro- grams and signatures—the syntactical interfaces of program modules—are good indications of what programmers are going to develop next (Ye & Fischer, 2000). The integration of compo- nent repository systems and programming environments creates a shared workspace accessible to both programmers as well as component repository systems. This shared workspace en- ables component repository systems to play an active role in supporting reuse by programmers, with the delivery of task-relevant and user-specific reusable components. Presenting compo- nents specific to a programmer can be realized through user models (Fischer, 2001), because user models that represent programmers’ knowledge about reusable components can be used as
filters by the repository system to ensure only unknown components are delivered.
1.4 The CodeBroker System
An active component repository system, CodeBroker, has been developed. CodeBroker is integrated with the program development environment—Emacs. It utilizes an information delivery mechanism to bring to the attention of Java programmers those components that are unknown to them and yet are relevant to their current programming task by
constructing a task model to capture the programming task through continuously mon-
itoring programming activities in the development environment 8
identifying the domains of a programmer’s current interest by creating a discourse
model based on the history of interaction between the system and the programmer
creating a user model to represent each programmer’s knowledge about reusable com-
ponents to personalize the delivery.
Integrated with CodeBroker, the development environment becomes an information- enriched workspace (Ye, 2001b) consisting of the original programming environment and an augmented information display that presents reusable components dynamically based on the programming task and the programmer’s background knowledge. Programmers can access po- tentially reusable components immediately without switching working contexts. This is a dis- tinct advantage because it avoids interrupting the programming flow. The operational interface of the component repository system becomes transparent to programmers, and is replaced by three cooperative autonomous software agents (Bradshaw, 1997): Listener, Fetcher, and Pre- senter. The Listener agent creates reuse queries from the programming workspace as the task model; the Fetcher agent retrieves components matching reuse queries; and the Presenter agent presents retrieved components directly into the workspace of programmers, using discourse models and user models as filters (Figure 1.3).
The information-enriched workspace created by active component repository systems improves the “readiness-to-hand” of components because it hides the retrieval interface of com- ponent repository systems from programmers so that programmers can directly interact with reusable components rather than the repository system.
Evaluations of the system with programmers have found that the system was effective in supporting reuse along the following three dimensions:
CodeBroker effectively encouraged programmers to explore the possibility of reuse.
Programmers were able to reuse unknown software components when they were deliv-
ered by the system. 9
Figure 1.3: Overview of CodeBroker
The programming environment is augmented with a reusable component informa- tion display (the lower buffer), which presents reusable components dynamically. These components are autonomously retrieved by three cooperative software agents (Listener, Fetcher and Presenter) based on the programming task and the program- mer’s background knowledge. In this example, the programmer can reuse the first component (highlighted) to implement the task: “Create a random number between two limits” (indicated in the doc comment), without leaving the programming envi- ronment or explicitly operating the component repository system.
The combination of task models, discourse models, and user models succeeded in most
cases in delivering context-sensitive reusable components.
1.5 Organization of the Dissertation
Chapter 2 of this dissertation presents a conceptual framework of programming for ana- lyzing the roles of reusable components in programming. Most programmers follow the oppor- tunistic programming strategy, and the availability of reusable components affects the choice of different development alternatives.
After overviewing the issues of instituting systematic reuse in a software development organization, Chapter 3 analyzes in detail the difficulties of component reuse from the perspec- tive of programmers. Through cognitive engineering, a cognitive model of the reuse process is 10 created and the challenges faced by programmers in each step are discussed.
Chapter 4 focuses on the central theme of this research: why locating component is diffi- cult for programmers, and, in particular, what prohibits them from attempting to reuse. Drawing on past research on the use of large information repositories and on human cognition theories, the argument is made that the “no attempt to reuse” phenomenon is caused by the existence of information islands and perceived low reuse utility. The concept of active component repository system is introduced as a solution to this problem.
Chapter 5 delineates the challenges in implementing active information systems and their general solutions: Task models and discourse models contribute to the task-relevance of information delivery, and user models support the user-specific delivery. To accommodate the dynamic nature of the information-seeking process, the concept and role of retrieval-by- reformulation is discussed.
Chapter 6 describes the retrieval mechanisms used in the CodeBroker system and the
CodeIndexer subsystem that creates the contents of component repository from existing pro- grams.
Chapter 7 presents the design and implementation of the CodeBroker system.
Chapter 8 presents the findings from formal evaluations of CodeBroker.
Chapter 9 compares this research with related work.
Chapter 10 concludes the thesis by discussing future research directions and summariz- ing the contributions of this research. Chapter 2
Roles of Reusable Components in Programming
With the advent of object-oriented technology, reusable software components have be- come an indispensable part of programming knowledge: “[Reusable component] library design is [programming] language design” (Stroustrup, 1995). In addition to those classes and methods included in standard libraries of programming languages, such as the Java API library, many reusable software components are developed by software development organizations specifi- cally for reuse or repackaged from previously developed systems.
Practitioners and researchers generally believe, and experiments have empirically proven that component reuse improves the quality and productivity of programming (Lange & Moher,
1989; Lim, 1994; Basili et al., 1996; Simon, 1996; Boehm, 1999). However, most analy- ses of the benefits of reusable components have been based on the products finally produced.
To better understand how reusable components help programmers produce better software sys- tems faster—not a better product and a shorter production time, per se—we must analyze the roles of reusable components in the programming process. After presenting the process model of programming, drawing on design theory in general and empirical programming studies in particular, this chapter explains the benefits of reusable components in programming.
2.1 A Process Model of Programming
Viewed as a task to create a computer-executable representation—program—of a real- world problem by piecing together a set of primitive elements provided by a programming 12 language and its component libraries, programming consists of two distinctive, yet tightly in- tertwined processes: problem framing and problem solving (Schon,¨ 1983; Hoc et al., 1990;
Fischer, 1994).
2.1.1 Intertwining of Problem Framing and Problem Solving
During the problem-framing process, commonly known as the specification process in software engineering, programmers try to understand the problem given in the actual problem space by building a mental representation of the programming task. This mental representa- tion is a situation model that is the result of the interaction between the problem and the pro- grammer’s knowledge about the problem domain (Kintsch, 1998). Different programmers with different knowledge often come up with different situation models of the same programming task. During the problem-solving process, or implementation in software engineering terminol- ogy, programmers create programs based on the situation model as a new representation in the solution space defined by the programming language and its libraries.
Although problem solving starts after problem framing, these two processes are not sep- arate. The processes of framing the problem and of solving the problem influence each other because every transformation of the framing of the problem provides the direction in which a partial solution is to be transformed, and every transformation of the constructed solution determines into which the framing is to be transformed. Just as all other designs that are the interaction between understanding (problem framing) and creation (problem solving) (Rittel,
1984; Winograd & Flores, 1986), programming is an iterative process of problem framing and problem solving. Programmers rarely complete one process before beginning the second one (Pennington & Grabowski, 1990) for the following two reasons.
(1) In most cases, programming tasks cannot be fully understood without considering the
solution (Ghezzi et al., 1991). For example, given the programming task of drawing
a filled circle, a programmer can define the filled circle as a trajectory of rotating one
end of a fixed line 360 degrees, or as a collection of dots whose distance to a center is 13
not greater than the radius. Each definition is actually based on an intended solution to
the problem.
(2) Programming involves many tentative problem-solving strategies. After those tentative
strategies have been explored and their consequences evaluated, some become eventual
commitments and some require the modification of the initial mental representation of
the problem. This modification often breeds new subtasks to be solved.
2.1.2 Programming Is Knowledge Intensive
Neither problem framing nor problem solving is a process of simple transformation that converts one representation to another representation; instead, they are processes of interpreta- tion. The programming task, the situational model, and the final program are representations at different levels of formalization and abstraction intended for different purposes. Drawing on their knowledge, programmers have to interpret the previous representation by reifying abstract concepts, explicating the implicit, and structuring the symbols existing at the new representation level.
Knowledge required in programming can be divided into two categories: domain knowl- edge and programming knowledge. Domain knowledge is the knowledge about the problem domain and is mainly used in the process of problem framing. Programming knowledge is the knowledge needed to construct a program in the process of problem solving. However, due to the intertwined nature of those two processes, programming knowledge also contributes to problem framing, and domain knowledge contributes to problem solving as well. Figure 2.1 illustrates the process model of programming and its reliance on knowledge.
2.2 Programming Knowledge
Among the many constituents of programming knowledge—for example, the operation of compilers and other tools, general data structure knowledge, and the capability of reasoning 14
Problem in Actual Programming Program in Problem Solution Space Space
Problem Problem Framing Solving Situation Model in Represented Problem Space
Domain Domain Specific Programming Knowledge Programming Knowledge Knowledge
Figure 2.1: The process model of programming
Problem framing and problem solving are intertwined and they require both domain knowledge and programming knowledge. Domain knowledge and programming knowledge often overlap, and the overlap becomes domain-specific programming knowledge.
and abstracting—program plans and building blocks are two of the most important. As a series of interconnecting actions to achieve a goal (Soloway & Ehrlich, 1984; Rich & Waters, 1990), a program plan provides a skeleton structure for programs by abstracting key elements. Building blocks are the primitive elements provided by a programming language. They include basic statements of a programming language and reusable software components in repositories or libraries.
2.2.1 Program Plans
Considerable evidence exists in empirical studies of programming that program plans are the basic cognitive chunk used in program design and understanding (Soloway & Ehrlich,
1984; Rich & Waters, 1990). Programs are often added one plan chunk at a time (Rist, 1995;
Detienne, 1995). Because program plans are abstract representations of a solution, during the process of programming, they need to be gradually fleshed out with building blocks. A program often contains different plans that are interlaced. Figure 2.2 shows a program and the program 15 plans it uses. Program plans are hierarchical. A program plan at a higher abstraction level is built upon program plans of lower levels. For example, in Figure 2.2, the plan Shuffling an array comprises three other program plans: Loop over an array, Create a ran- dom number in a range, and Swap two numbers.
2.2.2 Building Blocks
Although programmers can build a program with only the basic statements of a program- ming language, it is just as impossible to build a complex software system from basic program statements alone as it is to build a jet airplane from only nuts and bolts. Reusable software com- ponents are an indispensable part of the building blocks, especially in today’s object-oriented programming languages. A reusable software component is a software module that can be in- tegrated into a new program directly or after minor changes. A software module refers to a named and addressable abstraction—either a procedural abstraction, such as a function, or a data abstraction, such as a class. Procedures, functions, methods, and classes are all considered software modules. In this dissertation, the term module refers to software abstractions to be developed by programmers, and the term component is used to refer to those modules that have been packaged for reuse. Because basic program statements of a programming language are not of interest in this research, the term “building block” is used throughout interchangeably with the term “software component.”
2.2.3 Orthogonality of Program Plans and Software Components
Software components are used to realize program plans. Program plans and software components are orthogonal to each other: a program plan can be realized with different software components, and a software component can be used in the realization of different program plans. Figure 2.3 illustrates the orthogonal relationship between program plans and software components. 16
01 public class CardDealer{ 02 static int [] cards = new int [52]; 03 static { for (int i=0; i<52; i++) cards[i]=i; } 04 /** create a random number in a range */ 05 public static int getRandomNumber (int from, int to) { 06 return ((int)(Math.random() * (to - from)) + from); 07 } 08 /** shuffle the cards */ 09 public static void shuffleCards() { 10 int r, temp; 11 for (int i=0; i<52; i++) { 12 r = getRandomNumber(i, 52); 13 temp = cards[i]; 14 cards[i] = cards[r]; 15 cards[r] = temp: 16 } 17 } 18 public static void main(String[] args) { 19 shuffleCards(); 20 for (int=0; i<52; i++) { 21 System.out.print(`` `` + cards[i]); 22 } 23 } 24 }
The above program contains following program plans: Lines Realiz- Plan Name Plan Description ing the Plan Create a Get the range; random num- Convert a random number between [0, 5,7 ber in a 1.0] to the range. range Save one data to a temporary vari- able; Swap two Move the other data to the saved 13-15 numbers data; Move the temporary variable to the other data. Initialize; Loop over an Set the ending condition; 11-16; array Perform operations; 20-22 Increase the loop variable. Loop over an array; Shuffling an Create a random number in a range; 11-16 array Swap two numbers. Figure 2.2: A program and its program plans 17
AND Shuffling Task an array OR
Swap two Create a random Loop over an array Swap two numbers number in a sets of numbers Plans range
swapInt() Math.Random() getInt(int, int) swapRanges() Components
:
Figure 2.3: Orthogonality between program plans and software components
The task Shuffling an array can be implemented in at least three ways: (1) with nodes connected with solid lines (concrete implementation shown in Fig- ure 2.2, except that the swapInt was implemented with primitive statements) (2) with nodes connected with thick dashed lines, i.e., with program plans of Create a random number in a range, Loop over an array and Swap two sets of numbers, and components of getInt(int, int) and swapRanges() (3) with the same program plans as in (2), and components connected with thin dashed lines, i.e., the components of swapInt() and Math.Random().
2.3 Opportunistic Programming
Different strategies exist to develop a program. A top-down development strategy starts with decomposing the programming task into subtasks, choosing program plans to achieve those subtasks, and then fleshing out the program plans with reusable software components and pro- gram statements. A bottom-up development strategy starts with selecting reusable software components, and then combining them according to the structure of a program plan.
Empirical studies have revealed, however, that most programmers follow neither the top- down nor the bottom-up design strategy. In fact, their programming activities are very oppor- tunistic: They are a mixture of top-down and bottom-up strategies, and which strategy is chosen depends on the knowledge of individual programmers and the particular situation (Curtis et al.,
1988; Visser, 1990). Interim decisions made during the programming process “often can lead 18 to subsequent decisions at arbitrary points in the [programming] space” (Hayes-Roth & Hayes-
Roth, 1979).
The opportunisticness of programming comes from the difference in each programmer’s knowledge of program plans and software components. Simon (Simon, 1996) has pointed out that cognitive activities are determined by the environment in which they take place. The en- vironment includes information present in the workspace as well as information present in the memory of human beings. Information in the workspace, including partially constructed pro- grams, talks back to the problem solvers (programmers) and serves as cues to activate relevant program plans and software components from memory (Schon,¨ 1983). Due to the difference in programmers’ familiarity with program plans and software components, which determines the link strength from cues to the activated knowledge in memory, it is quite natural that the programming process pursued by each programmer is different, and the resulting solutions vary. Taking Figure 2.3 as an example, if the programmer is more familiar with the component swapRanges, he or she may choose the program plan Swap two sets of numbers, and the final implementation will be the one connected with thick dashed lines. Conversely, if he or she is more familiar with the program plan Swap two numbers, he or she may proceed from that program plan and choose the component swapInt.
The lack of knowledge about reusable software components needed to implement a pro- gram plan often prevents programmers from considering it. However, if information about relevant reusable components is somehow present in the current workspace, it can expand pro- grammers’ solution spaces that are limited by their knowledge. Active component repository systems can complement programmers’ insufficient knowledge of reusable components by pre- senting them with immediately accessible components relevant to the current programming task in the workspace. 19
2.4 Benefits of Software Components in Programming
Reusable software components have both short-term and long-term benefits for the devel- opment of software systems. Short-term benefits are the immediate benefits that a programmer can attain during the implementation of a programming task. Long-term benefits may not be immediately enjoyed by the programmer who reuses the components, but they extend to the whole life cycle of the software system and to later programming activities of the programmer.
2.4.1 Short-Term Benefits
Reduced Development Time. By reusing existing software components, fewer pro- grams are written, and thus less time is spent in programming. Furthermore, because reusable components are usually carefully tested already, less time is needed in debugging and test- ing, which are the “hard and slow part” of programming (Brooks, 1995). Lim (Lim, 1994) has reported that in a Hewlett-Packard division, a nearly linear relationship exists between the percentage of reused code in the product and the productivity of programmers, measured in
LOC/pm (the number of lines of noncomment source code produced by a programmer in a month). Only 5% of reused code yields an LOC/pm of 550, and as the percentage of reused code increases to 81%, the LOC/pm reaches 2,850. Similar reports can be found in (Browne et al., 1990; Hallsteinsen & Paci, 1997).
Improved Quality. Because software components are often repeatedly reused, the defect
fixes from each reuse accumulate, resulting in higher quality of the developed software systems.
Raymond has vividly described this incremental bug fix process as “given enough eyeballs, all bugs are shallow” in his seminal essay that explains why Open Source software systems tend to have high quality (Raymond & Bob, 2001). Basili et al. (Basili et al., 1996) have reported that the error density (errors per thousand lines of code) drops from 6.11 for systems developed without reuse to 0.12 for systems developed from reusable components. Similar formal evaluations on the contribution of reuse to the improved quality of software systems can 20 be found in (Lim, 1994; Thomas et al., 1997).
2.4.2 Long-Term Benefits
Easy Maintenance. Reusable components contribute to easy maintenance not only be- cause they have fewer defects, but also because they facilitate communication among software developers by providing a set of common vocabulary, especially for the indirect communica- tion between system builders and system maintainers. Because reused software components are high-level abstractions, system maintainers do not need to look into the details of implementa- tion to uncover the original intentions of the system builders.
Improved Evolvability. To cope with constantly changing requirements and imple- mentation platforms, software systems must be able to evolve. Reusing software components improves the evolvability of software systems because it can limit the needed change to com- ponents instead of identifying and changing all occurrences distributed all over the system.
Graham (Graham, 1995) has reported a very typical example as follows. Three project teams in a company had used the same formula in their software systems. Later, they discovered an error in the formula and needed to modify the systems. The team that had not created a component for the formula spent 5 weeks to find and correct each incidence of the formula. The other two teams, which had put the formula in a set of components, spent 1.5 days and 2 days, respectively to correct the system.
Increased Problem Framing Ability. The representation of a problem is an important determinant of the range of solutions that will be considered, as well as an important source of problem-solving difficulty (Hayes & Simon, 1977). Reusable software components provide programmers with higher level concepts that are both close to application domains and easy to implement. Components increase programmers’ ability to frame the problem into representa- tions that are easier to solve. A component creates an abstraction for an existing solution, and it reduces the number of items that a programmer has to hold in simultaneous contemplation because the programmer can refer to the whole solution with the abstraction, in place of the 21
Concepts in Programming Problem Domain Languages
Computer Programmer
Compiler Developer
Concepts in Programming Problem Domain Components Languages
Computer
Programmer Compiler Component Developer Developer
Figure 2.4: The role of components in problem framing
In the top figure, programmers have to frame each concept in the problem domain based on their knowledge of the programming language. In the bottom figure, pro- grammers can represent some domain concepts with components directly (such as those having the same fill pattern in concepts and components) without thinking of their implementation.
details of the solution. Figure 2.4 illustrates the contribution software components make to the problem-framing ability. Without the support of reusable components, programmers have to frame each concept in the problem domain based on their knowledge of the programming lan- guage; with the support of software components, however, the difficulty of problem framing is reduced because certain concepts can be directly mapped to the components. Chapter 3
Challenges of Software Reuse
This chapter consists of two parts. The first part overviews software reuse to provide a broad background for this research. It defines the concept and scope of software reuse; describes different kinds of reusable software artifacts to establish the link between component reuse and other reuse research efforts; and discusses managerial issues, legal issues, technical issues, and cognitive issues involved in instituting a reuse program within a software development organization. The second part of the chapter analyzes the difficulties of component-based reuse from the perspective of programmers who want to reuse.
3.1 Overview of Software Reuse
3.1.1 Software Reuse and Reusable Software Artifacts
A broad definition of software reuse is using existing software artifacts to construct a new software system. A software artifact can be defined as a piece of formalized knowledge that can contribute to the software development process (Dusink & Van Katwijk, 1995). There are two types of software artifacts: (1) software products that are created as “things” or deliverables during the development process, and (2) development knowledge that is applied to the process.
The most commonly reused software product is source code, which is the final and most important product of software development. In addition to code, any intermediate life cycle products can be reused, which means that software developers can pursue the reuse of require- ment documents, system specifications, modular designs, test plans, test cases, and documenta- 23 tion in various stages of software development.
Reusable software development knowledge and experience exists at different abstrac- tion levels: the architecture level, the modular design level, and the program (or code) level.
Research on software architecture is currently aiming to define different software architecture styles for different families of software systems (Perry & Wolf, 1992). A software architecture style describes the formal arrangement of architectural elements, and can be reused by soft- ware developers to construct their new software systems once the style is well defined (Shaw &
Garlan, 1996; Taylor et al., 1996). For example, the domain-independent multifaceted architec- ture is an architecture style for domain-oriented design environments, which has been reused in and refined through the development of many generations of design environments for different domains (Fischer, 1994).
Reusable knowledge on modular design can be codified in design patterns (Alexander et al., 1977) and frameworks. A design pattern is the description of a solution to recurring problems. It specifies a problem to be solved, a solution that has stood the test of time, and the context in which the solution works (Gamma et al., 1994). Design patterns provide a common vocabulary for software developers to discuss their designs and can be passed from one devel- oper to another developer for reuse. The concept of framework comes from object-oriented programming languages (Fischer et al., 1995). A framework describes the interaction pattern among a set of collaborative classes or objects, and can be represented as a set of abstract classes that interact with each other in a particular way (Johnson, 1997). Programmers can reuse frame- works directly in their development after providing implementations for those abstract classes.
Framework reuse is a mixture of knowledge reuse and code reuse.
Programming knowledge at the level of code is represented as program plans that can also be reused by programmers if a suitable representation form is defined (Rich & Waters,
1988). 24
3.1.2 Two Approaches to Reuse
Another dimension to classify reuse research is the approach it takes: Reuse can be generation-based or composition-based.
The generation-based approach reuses the process of previous software development efforts, often embodied in computer tools that automate a part of the development life cy- cle (Henderson-Sellers & Edwards, 1990). This approach weaves domain knowledge and pro- gramming knowledge into a very high-level programming language (VHLL), which is then con- verted to executable systems by a VHLL compiler or an application generator. Because VHLLs have a higher abstraction level than most high-level languages (HLLs), such as Java and C, they are relatively closer to programmers’ informal requirements. They are meant for program- mers to describe what the computer program does instead of how it is implemented. Compilers of VHLLs directly convert the VHLL programs into executable programs in HLLs. Lex and
Yacc in Unix are two well-known examples; other research prototypes include SETL (Dubinsky et al., 1989) and PAISLey (Zave & Schell, 1984). Unlike VHLL compilers, which make the conversion from VHLL programs to HLL programs in one step, application generators often use a series of transformation rules to transform VHLL programs into HLL programs. A trans- formation rule maps a program in one abstraction level to a semantically equivalent but more computationally efficient program (Feather, 1989). Transformation-based application genera- tors allow programmers to control which transformation rule is applied when several applicable rules exist (Biggerstaff, 2000). Problems with the generation-based approach are the following:
(1) VHLLs are often defined for an extremely small application domain.
(2) Most VHLLs use mathematical abstractions, such as set theory or logic theory, that are
actually more difficult to learn and to use than HLLs.
The VHLLs in the generation-based reuse approach have many overlaps with end-user pro- gramming languages (Repenning, 1993) that provide end-users with a simple instruction set at the abstraction level of the problem domain so they can modify the behavior of the application 25 systems to their own needs or add new functionality (Girgensohn, 1992; Fischer & Eisenberg,
1994).
The composition-based approach reuses existing software products in a new system to avoid repetitive work. As mentioned in the previous section, many types of software products can be reused. However, because this research focuses on the reuse of components, the dis- cussion here is limited to component reuse, although many problems and solutions discussed can be extrapolated to the reuse of other software products. Component reuse is also known as component-based development. Based on the role that components contribute to the program- ming process, component reuse is further divided into three categories.
Black-Box Reuse. In black-box reuse, a component is directly reused without mod-
ification. A component can be reused as it is or reused through inheritance if the
programmer creates a specialized subclass of an existing class component.
White-Box Reuse. In white-box reuse, programmers reuse the component after they
have modified the components to their needs. White-box reuse does not contribute as
much to the easier maintenance and evolution of software systems as black-box reuse
does, but it can reduce development time.
Glass-Box Reuse. In glass-box reuse, programmers do not directly reuse the com-
ponent; instead, they use it as an example for their own development. For instance,
programmers can look at examples to find out how a program plan is realized and
build their own system through analogy. Glass-box reuse contributes indirectly to the
quality and productivity of programming because examples can reduce the cognitive
load of programmers (Neal, 1996).
3.2 General Issues of Component Reuse
Despite its many benefits, component reuse has not yet received wide success in prac- tice due to the many difficulties associated with it. Component reuse introduces two different 26 processes in the life cycle of software development:
(1) the process of developing for reusable components
(2) the process of developing with reusable components
The first process creates component repositories by identifying, developing, and indexing com- ponents. The second process, commonly known as reuse process, is conducted by programmers who want to reuse. In the reuse process, programmers need to locate, comprehend, and inte- grate components (see Figure 1.1 in Chapter 1). Widespread success of reuse needs to overcome managerial, legal, technical, and cognitive issues incurred by both processes.
3.2.1 Managerial Issues
Successful systematic reuse requires the support and commitment from the managers of a software development organization. Managers should foster a reuse culture in their organi- zation by encouraging programmers to reuse. For example, managers must stop evaluating the performance of programmers based on the lines of code produced, which, unfortunately, still occurs in many software development organizations. This evaluation criterion obviously dis- courages reuse by programmers because programs developed with reusable components have fewer lines of code.
To encourage reuse, component repositories, either purchased from third parties or de- veloped in-house, should be set up. To do so, managers must be willing to make long-term investments. This also requires good metric models to analyze the economics of reuse and to identify the most effective reuse strategies. Several reuse metric models have been proposed and are in use in some companies. However, these models still lack formal validation (Frakes
& Terry, 1996). 27
3.2.2 Legal Issues
Protecting legal rights of creators and consumers of software components is another dif-
ficult aspect of instituting reuse. Currently, software is protected by copyrights that are designed to protect products in the “world of atoms”. In the world of atoms, after a product is passed from its owner to its customer, the owner no longer owns it, and the customer possesses full ownership. Software components are made of bits. In the “world of bits”, ownership does not change hands in the same way as in the world of atoms (Cox, 1996). It is extremely easy to reproduce a software component without any loss of quality, and even after the software com- ponent is passed from its owner to its customer, the owner can still have the exact same software component as the customer does. This presents a difficulty in estimating the cost of reusable components as well as defining a suitable standard mechanism to charge fees. For example, when a customer uses a purchased component in his or her application and sells many copies of the application to many end users, how much should the customer who purchased the com- ponent pay to the original creator? Should the end users of the developed application pay the original creator as well? And if that is the case, how should the law ensure that such payments are made? The lack of a standard mechanism of charging for the use or reuse of components inhibits the emerging of a robust market for components, which is essential to the widespread reuse of components.
3.2.3 Technical Issues
The implementation of systematic reuse has to overcome many technical difficulties.
These difficulties exist in both phases of reuse: the initial setup of reusable component reposi- tories and the actual reuse of components during programming.
A great investment, both intellectually and financially, is required to develop and main- tain reusable component repositories. First, it is difficult to identify what kind of components should be included in component repositories. Second, the development of reusable components 28 is more difficult than usual development because reusable components should be more general and require higher quality and better documentation. It costs an estimated two or three times more to develop a reusable component than an ordinary component (Jones, 1984; Lim, 1994).
Reusable component developers need to balance the inherent dilemma between the component size and its reuse potential: A larger component has more reuse value than a smaller one, but its reusability decreases because larger components are often more specific and more difficult to understand.
Another dilemma in setting up a component repository is the relationship between the number of components in a repository and the ease of finding the needed components. For a component repository to really pay off requires a critical mass of available components; how- ever, as the number of components increases, it becomes more difficult for programmers to find the needed components. Only repositories with a large number of components can reap the real benefits of reuse, so an effective searching mechanism must be provided for programmers to
find the needed component.
3.2.4 Cognitive Issues
Even if good reusable component repositories do exist and a favorable reuse culture is fostered in an organization, reuse still fails if programmers do not put it into practice. After all, reuse must be carried out by programmers. This creates another dilemma for reuse: If program- mers do not reuse, the huge investment of building reuse repositories cannot be justified; and if the investment in reuse repositories cannot be justified, companies are less willing to create them, and then programmers have nothing to reuse.
Programmers’ resistance to reuse is often falsely dismissed as a mere attitude problem caused by the so-called NIH syndrome. However, recent studies have revealed that most pro- grammers do not have the NIH syndrome (Fafchamps, 1994; Frakes & Fox, 1995). On the contrary, programmers are very motivated to reuse if they know of the component or know how to locate the component (Lange & Moher, 1989; Isoda, 1995). What prevents programmers 29 from reusing is the fundamentally limited capability inherent in human cognition (Curtis, 1989;
Fischer et al., 1991): the limitation of short-term memory, the scarcity of human attention, the mental inertia of coping with changes, the subjectiveness of evaluation, and the ambiguity of natural language. Section 3.4 provides a more detailed analysis of what kind of cognitive chal- lenges programmers have to overcome in order to reuse, and what kind of technology is needed to address such challenges.
3.3 Creating Reusable Components
Although this research is not concerned in particular with the creation of reusable com- ponents,1 because “anybody who sells a technology for reuse without providing a library of components is a snake oil salesman, a fraud, a charlatan (Zand et al., 1997),” it is worthwhile to point out the possible venues from which reusable components will come.
As mentioned before, creating reusable components is difficult, time-consuming and expensive, and repositories of components with good quality are rare. Nevertheless, stable progress has been made recently in several directions.
3.3.1 Domain Analysis and Product-Line Analysis
Domain analysis is the identification, analysis, and specification of common require- ments from a specific application domain for reuse on multiple projects within that applica- tion domain. Domain analysis produces a domain model, which is used as a starting point to construct specifications and designs for many different systems within the application do- main (Kang, 1998). The domain analysis can be either synthetic or evidentiary (Fischer et al.,
1995).
The synthetic domain analysis approach resembles the process of developing a single ap- plication, but it is more broadly conceived. It starts with an informal description of the applica- 1 Components in the repository of the CodeBroker system come from existing libraries. For more details, see Section 6.2. 30 tion domain, identifies the common features, and develops reusable components corresponding to each feature in the domain.
The evidentiary domain analysis approach starts with existing systems in an application domain, using reverse engineering or design recovery (Ye, 1996) to identify and repackage common components for later reuse.
Product-line analysis is a more comprehensive approach than domain analysis. A product- line is a set of products, already existing or planned to be developed, that share a common set of requirements but also exhibit significant variability in requirements (Griss, 2000). Product-line analysis differs from domain analysis in that it not only extracts the commonality of the family of systems but also provides a systematic way to treat their significant variability. In addition to common reusable components, product-line analysis often creates a product-line architecture for the family of related systems where reusable components can be plugged in (Batory et al.,
2000).
3.3.2 Commercial Off-the-Shelf
Thousands of companies worldwide are developing their own information systems. There are three problems in this regard: (1) because most companies do not have enough expertise in software development, they cannot produce information systems with the highest quality; (2) because these systems are often developed internally and do not follow interoperation standards, it is very difficult to integrate them; and (3) similar functionality has been repeatedly developed.
Although it may take decades for it to dominate software development, the market of COTS
(Commercial Off-the-Shelf) is rapidly taking shape (Morisio et al., 2000). COTS comes in a variety of types and levels of software, e.g., components that provide specific functionality (such as subroutines, classes, frameworks, and even complete applications) or tools used to generate code (such as domain-oriented language processors and application generators). Many compa- nies are providing reusable off-the-shelf components for specific domains, and those compo- nents can be purchased by developers from the market. As this trend continues, programmers 31 may be able to create their own systems in the future by integrating components from different component vendors. For example, programmers or even end users may create their own word processing applications by integrating components of outline mode, spell-checking, grammar correction, and diagram drawing purchased at market in the same way as they purchase standard applications now.
3.3.3 Open-Source Components
With the advent of the Open Source movement (DiBona et al., 1999), many devel- oper communities, such as the Gamelan website2 and the Giant Java Tree,3 have formed, by which programmers can freely exchange their developed products. Moreover, some high- quality reusable component repositories, such as the Jun library (Aoki et al., 2001), have become open-source too. Traditionally, reusable components are created and maintained by creators who develop those components. Programmers who reuse those components are consumers, and do not directly contribute to the creation and evolution of components. By giving program- mers full access to the source code, Open Source breaks down the binary choice of creators and consumers (Fischer, 1998a) so that consumers can directly participate in the maintenance and improvement of reusable components or even derive new components from existing ones.
This encourages the natural emergence of reusable components with good quality, following the seeding, evolutionary growth, and reseeding (SER) model (Fischer, 1998b). The initial creator develops the component (seeding), and the component experiences evolutionary growth when it is reused and modified by many other programmers (or consumers). As those modifications are incorporated back into the original component (reseeding), the quality and reusability of the component will improve. The Open Source development model is particularly promising in pushing reuse to a large scale because programmers working on complimentary projects can each leverage the results of the other freely. 2 http://www.gamelan.com 3 http://www.gjt.com 32
3.4 Understanding the Cognitive Difficulties of Component Reuse
The exciting advance in the creation of reusable components cannot lead to the material- ization of reuse if programmers are not able to reuse them during their own system development.
To create appropriate tools to assist programmers in reusing, we need to identify the tasks faced by programmers when they try to reuse and the cognitive skills required to perform those tasks.
3.4.1 Cognitive Engineering
Component reuse is a cognitive activity, which is a goal-directed problem-solving effort.
To understand the complexity of cognitive activities, Norman (Norman, 1986) has developed a method called cognitive engineering that applies what is known from cognitive science to the design and construction of tools that assists cognitive activities of human beings. Such cognitive tools, including reusable component repository systems, must address the discrepancy between a user’s goal, expressed in terms relevant to the user and his or her task, and the tool’s mechanism, expressed in terms relative to it. This discrepancy creates two gulfs: the gulf of execution and the gulf of evaluation. The gulf of execution is the gap from goals to tools, and it must be bridged by three consecutive efforts:
Intention Formation. Users decide to do something with an internal specification of
the task created from their goal.
Action Specification. Users externalize the internal specification into a sequence of
specified actions.
Action Execution. The actions are executed with the tool.
The gulf of evaluation is the gap from the tool output to the intended goal, and it must be bridged by another three consecutive efforts:
System Perception. Users perceive the output of the tool.
Interpretation. Users interpret the perceived output. 33
Evaluation. Users compare the interpretation with the original goal.
3.4.2 A Cognitive Model of Component Reuse
Like other cognitive activities, component reuse starts with a goal in the mind of a pro- grammer. To achieve such goals, programmers have to translate their internal intentions into a series of physical actions constrained by a component repository system. Figure 3.1 illus- trates the actions needed for the reuse process to succeed based on the method of cognitive engineering (Ye et al., 2000):
Forming Reuse Intentions. As the first step to start a reuse process, programmers
must consciously decide to use the repository with requirements for reusable compo-
nents in mind. The requirements for reusable components arise from the goal of the
programming tasks.
Formulating Reuse Queries. Programmers formulate their intentions as a reuse query
in the terms provided by the component repository system. This is the step of action
specification.
Retrieving Components. Reusable components matching the queries are retrieved
from the repository system. This is the step of action execution.
Choosing Components. When the repository system returns matching components,
programmers must choose the appropriate one by comparing them with the reuse in-
tentions. This action of choosing corresponds to the gulf of evaluation for the reuse
activity: Programmers read each retrieved component, interpret its meaning, and eval-
uate whether the component can be reused in their tasks. The evaluation may result
in the reformulation of the original query if no suitable components are found because
the reuse query has not been appropriate.
Integrating Components. Chosen components are integrated into the current pro- 34
Development Environment
Programmer
Forming Integrating Intentions
Reuse Intentions Chosen Components
Formulating Choosing Queries Retrieval by Reformulation
Reuse Queries Retrieved Components
Retrieving
System Model of Repository
Action Component Existence of Components Repository Information
Figure 3.1: A cognitive model of the component reuse process
gram. If there is not a complete fit, programmers need to modify the component or
write a wrapper to adapt it. Integration is also a part of evaluation because the compo-
nent is finally reused only if it can be integrated into the current programming task.
The cognitive model in Figure 3.1 is a refinement of the location-comprehension-modification cycle (LCM-cycle) of a reuse process in Figure 1.1 (Fischer et al., 1991), with special emphasis on the location process, which is the focus of this research. Although the LCM-cycle acknowl- edges and stresses the difficulty of formulating appropriate reuse queries in the location process, 35 it does not point out that formulating reuse queries must be preceded by the forming of reuse intentions. Furthermore, the comprehension step in the LCM-cycle does not differentiate two different levels of comprehension: comprehension for choosing components and comprehen- sion for integrating components. Comprehension for choosing is still a part of the location effort because it may result in the reformulation of queries.
3.4.3 Cognitive Challenges of Component Reuse
Each action in this cognitive model of component reuse poses challenges to programmers and may deter the success of reuse without appropriate tool support.
3.4.3.1 Vocabulary Learning: A Prerequisite for Forming Reuse Intentions
Reusable components help programmers think at higher levels of abstraction and increase the “vocabulary” programmers can use to create and interpret program designs (Krueger, 1992).
However, programmers must learn the syntax and the semantics of the new vocabulary to take advantage of reusable components. At the least, programmers should know the existence of the components; otherwise, they are not able to form reuse intentions in the first place, and reuse would fail in this very first step. Vocabulary learning is a major part of the cognitive barrier to reuse (Brooks, 1995; Fichman & Kemerer, 1997).
Unlike the syntax of programming languages, which can be learned through schooling or tutoring before programmers start working, the mastery of reusable components cannot be completed in classrooms or merely by reading books (Ye, 1998). Due to the large volume of reusable components and their constantly evolving nature, total coverage is impossible and obsolescence is unavoidable. Moreover, learning components is less effective when the compo- nents are separated from their use context; components are better learned when they are needed for a programming task. Therefore, component learning needs to be integrated with working where the components are reused, and programmers should learn components on demand—that is, learn the component when it is needed (Fischer, 1991). However, a pitfall for the learning- 36 on-demand model is that programmers need to learn a component because they do not know it, but because they do not know it, they may settle on a suboptimal solution by creating their own program instead of reusing the existing component. To support learning-on-demand, com- ponent repository systems should be able to identify learning (and reusing) opportunities by connecting programmers to the components that can be reused in their current task.
3.4.3.2 Conceptual Gap in Formulating Reuse Queries
Formulating reuse queries refers to transforming internal reuse intentions into explicit, external reuse queries. Reuse intentions, derived from development activities, are conceptual- ized in the situation model (Kintsch, 1998) that is related to the application task to be solved and to the concerns of the programmer. A situation model is the mental model programmers have of their environment. A system model is the “actual” model of a computer system. For a component to be retrieved, these intentions need to be mapped from the user’s situation model onto the system model, namely the repository system (Fischer et al., 1991). Without enough knowledge about the system model of component repository systems, programmers cannot for- mulate reuse queries appropriately. This conceptual gap between situation model and system model is another cognitive barrier to reuse.
There are two types of conceptual gap between situation model and system model: vo- cabulary mismatch and abstraction mismatch. The vocabulary mismatch refers to the inherent ambiguity in most natural languages. Thanks to the richness of natural languages, people use a variety of words to refer to the same concept. Based on their systematic study of word use of ordinary people in different domains, Furnas et al. have found that the probability that two persons choose the same word to describe a concept is less than 20% (Furnas et al., 1987). Even well-trained indexing experts have a 20% disparity on average in choosing terms to describe the same document (Harman, 1995).
The abstraction mismatch refers to the difference of abstraction levels in requirements and component descriptions. Programmers deal with concrete problems and thus tend to de- 37 scribe their requirements concretely, whereas reusable components are often described in ab- stract concepts because they are designed to be generic so they can be reused in many different situations. For example, in one experiment to evaluate the CodeBroker system, one subject described his task as follows: 4
/** This class contains methods for converting between western-style numbers (three numbers in a set with a comma) and Chinese style numbers (four numbers followed by a comma). For example, 1,000,000--> 100,0000. */
Another subject initially described the same task in a similar way:5
/** Takes a string with a Chinese formatted number and outputs a western formatted number. */
This task can be easily implemented by setting the group size to 4 with the method set-
GroupingSize of the class java.text.DecimalFormat. However, the description of this method (as follows) is abstract: It describes grouping of numbers without mentioning west- ern style or Chinese style in particular.
public void setGroupingSize(int newValue) Set the grouping size. Grouping size is the number of digits between grouping separators in the integer portion of a number. For example, in the number "123,456.78", the grouping size is 3.
3.4.3.3 Effective Retrieval Mechanisms in Retrieving
The retrieval process finds the components that match given reuse queries. An effective retrieval mechanism—including a representation schema for indexing and a matching criterion between a query and a component—is essential. 4 The task asks the programmer to implement a program that converts a number written in Chinese format to an equal number in western format. The traditional way of writing big numbers in Chinese is to group numbers in fours and add a comma before each fourth digit from the right because Chinese concepts are of ten thousand (wan), a hundred million (yi), a thousand billion (zhao), instead of thousand, million, and billion. 5 He later realized the description was not good enough because he had not found what he wanted, and modified it to “/** Takes a string with a Chinese formatted number (numbers grouped into 4 columns separated by commas) and outputs a western formatted number (3 columns separated by commas). */”, which made the system deliver the component setGroupingSize. 38
Many retrieval mechanisms have been proposed in the past (for more detailed descrip- tions, see related work in Section 9.2). There are three major approaches: text-based, descriptor- based and formal specification-based. In text-based approaches, components are represented by their textual documents and information retrieval technology is used to match components to queries (Maarek et al., 1991). In descriptor-based approaches, components are represented by a set of selected descriptors. The semantic relationships among those descriptors are captured in a predetermined structure that can be specified by a semantic network (Henninger, 1997), an AI frame (Ostertag et al., 1992), a taxonomic category system (Devanbu et al., 1991), or a fuzzy set theory (Damiani et al., 1997). In specification-based approaches, components are repre- sented with formal specification languages, and automatic theorem-proving systems (Zaremski
& Wing, 1997) or specification refinement systems (Mili et al., 1997a) are used to determine whether a component matches a query, written in formal specification languages too.
In terms of complexity of representation schemata, the text-based approach is the sim- plest and the specification-based approach the most complicated. In general, a complete and precise representation can make the matching more precise and retrieval more effective. How- ever, because the same representation is also used by programmers to specify their reuse queries, the schema of representation is greatly limited by programmers’ willingness to formulate long and precise queries. There is no point in representing every bit of relevant information about a component if a programmer barely has the patience for typing string search regular expres- sions (Mili et al., 1995).
3.4.3.4 Retrieval by Reformulation in Retrieving and Choosing
Because effective use of any information retrieval system requires users be fairly familiar with the structure of the information systems and their representation schemata, it is difficult for most users to create a well-defined query on their first attempt (Jones, 1997). Component repository systems can, at best, retrieve those components that match the queries submitted by programmers, but not necessarily match their intentions, many of which are not articulated. 39
Retrieval by reformulation is a mechanism that allows users to incrementally improve their queries to match their intentions after they have interpreted and evaluated the retrieved results and have explored the underlying structure of the information system (Williams et al., 1982;
Fischer & Nieper-Lemke, 1989). If programmers cannot find the needed component from the
first retrieval result, they can reformulate their query by using more appropriate terms that they learn from retrieved components, or they can narrow the search range by taking advantage of the structure of the repository, which they may not have known before exploring the retrieved results.
3.4.3.5 Component Comprehension in Choosing and Integrating
Being able to comprehend components is necessary both for choosing the right compo- nent and for integrating the chosen component.
Comprehension for choosing is focused on what the component does, and it is conducted in two stages: information discernment and detailed evaluation (Carey & Rusli, 1995). At the stage of information discernment, programmers avoid spending too much time by quickly scanning the component and its description to decide whether this component is related to their current task, and thereby also avoid any deep understanding at this point (Lange & Moher,
1989). The process of information discernment may result in the reformulation of queries if programmers find the retrieval results are not satisfactory. Only when a promising component is found do programmers start to evaluate the components extensively.
To integrate a component into their programs, programmers need to understand the com- ponent’s functionality, its usage, and even its implementation details, especially in cases of white-box reuse and glass-box reuse (Section 3.1.2). Executable examples that use the compo- nent prove to be very useful to help programmers quickly understand how to reuse the compo- nent in their own programming task (Redmiles, 1992; Aoki et al., 2001). Chapter 4
The Component Locating Problem
Before programmers can take advantage of reuse, they must be able to locate reusable components quickly and easily. A component repository system is an information system that helps programmers locate reusable components. It has three connotations: a collection of reusable components, a retrieval mechanism, and a retrieval interface. Research on component repository systems has focused mainly on the effectiveness of retrieval mechanisms. However, even the most sophisticated and powerful component repository systems will not be effective if programmers make no attempt to reuse. Studies on reuse have shown that no attempt to reuse is the most significant barrier to reuse (Figure 1.2) (Frakes & Fox, 1996). This chapter analyzes the phenomenon of “no attempt to reuse” and points out that it is caused by the existence of information islands and perceived low reuse utility. As a solution, the concept of the active component repository system is introduced and its benefits are analyzed.
4.1 No Attempt to Reuse
4.1.1 Three Reuse Modes
As a part of the knowledge-intensive programming process, reuse is a process of applying the knowledge of reusable components into programs. Because few programmers know all about reusable components, component repository systems are introduced to facilitate the easy application of reusable components during programming. Based on the source of the knowledge of reusable components, three modes of reuse exist: reuse-by-memory, reuse-by-recall, and 41 reuse-by-anticipation.
Reuse-by-Memory. In the reuse-by-memory mode, while designing a new program,
programmers notice similarities between the new program and reusable components
that they have learned in the past and know very well. Therefore, they can reuse these
known components easily during the programming, even without the support of a com-
ponent repository system, because their memory assumes the role of the repository
system.
Reuse-by-Recall. In the reuse-by-recall mode, while developing a new program, pro-
grammers vaguely recall that the repository contains some reusable components with
similar functionality, but they do not remember exactly which components they are.
They need to search the repository to find what they need. In this mode, programmers
are often determined to find the needed components. An effective retrieval mechanism
is the main concern for component repository systems supporting this mode. The suc-
cessful operation of reuse in this mode needs both knowledge from programmers and
knowledge from the repository.
Reuse-by-Anticipation. In the reuse-by-anticipation mode, programmers formulate
reuse intentions based on their anticipation of the existence of certain reusable compo-
nents. Even though they are not certain that relevant components exist, their knowledge
of the domain, the programming environment, and the repository is enough to motivate
them to search in hopes of finding relevant components. In this mode, if programmers
cannot find quickly enough what they want from the repository, they will soon give up
reuse (Mili et al., 1995). Repository is the main source of knowledge for the successful
operation of reuse in this mode.
Programmers have little resistance to the first two modes of reuse. As has been reported by Isoda, programmers reuse those components repeatedly once they have learned them (Isoda,
1995). Lange and Moher, in their empirical study on programming and reuse strategies, have 42
L4
L2 Unknown L3 (Vaguely L1 (Well Known) Components (Belief) Known)
Figure 4.1: Different levels of programmers’ knowledge about a component repository
found that programmers search extensively for the components they know exist even if they may not be able to name them a priori (Lange & Moher, 1989). This explains why individual ad hoc reuse has been taking place while organization-wide systematic reuse has not received the same success: programmers have individual reuse repositories in their memories so they can reuse by memory or reuse by recall (Mili et al., 1995). For those components that have not yet been internalized into their memories, programmers have to resort to the mode of reuse-by- anticipation. The activation of the reuse-by-anticipation mode relies on two enabling factors:
Programmers anticipate the existence of reusable components.
Programmers perceive that the cost of the reuse process is cheaper than that of pro-
gramming from scratch.
4.1.2 Information Islands in Component Repositories
Unfortunately, programmers’ anticipation of available reusable components does not al- ways match real repository systems. Empirical studies on the use of high-functionality computer systems (component repository systems being typical examples of them) have found there are four levels of users’ knowledge about a computing system (Figure 4.1) (Fischer, 2001).
In Figure 4.1, ovals represent the collection of components that are in a particular knowl- 43 edge level of programmers, and the rectangle represents the actual information space (namely, the whole collection of items in an information system), labeled L4. L1 includes those reusable components that are well known, easily employed, and regularly reused by a programmer. L1 corresponds to the reuse-by-memory mode. L2 contains components known vaguely and reused only occasionally by a programmer; they often require further confirmation before being reused.
L2 corresponds to the reuse-by-recall mode. L3 represents what programmers believe about the repository. L3 corresponds to the reuse-by-anticipation mode.
Many components exist in the area of (L4 - L3), and their existence is not known to programmers. Consequently, there is no possibility for programmers to reuse them simply because people do not ask for what they do not know (Fischer & Reeves, 1995). Components in
(L4 - L3) thus become information islands (Engelbart, 1990; Ye & Fischer, 2000), inaccessible to programmers without appropriate tools. Repositories are not static—it is expected that they will evolve over time, and this will increase the size of (L4 - L3).
Many reports about reuse experiences of industrial software companies illustrate this inhibiting factor of reuse. Devanbu et al. have reported that because developers are unaware of reusable components, they repeatedly re-implement the same function—in one case, this occurred ten times (Devanbu et al., 1991). This kind of behavior is also observed as typical among the four companies investigated by Fichman and Kemerer (Fichman & Kemerer, 1997).
From the experience of promoting reuse, Rosenbaum and DuCastel have concluded that making components known to developers is a key factor for successful reuse (Rosenbaum & DuCastel,
1995).
4.1.3 Low Reuse Utility
Human beings often try to be utility-maximizers in the decision-making process (Reis- berg, 1997), and programmers are no exception. When programmers perceive that reuse utility, which is the ratio of reuse value to reuse cost, is too low, they do not make an attempt to reuse (Sen, 1997). Because there is no easy way for programmers to estimate reuse value and 44 reuse cost objectively, the estimation made by programmers during programming is quite sub- jective and suffers from cognitive biases against reuse; they tend to underestimate reuse value and overestimate reuse cost.
4.1.3.1 Underestimated Reuse Value
The value of reuse is multifold. As stated in Section 2.4, reuse value includes:
(1) reduced development time
(2) improved quality
(3) easy maintenance
(4) improved evolvability
(5) increased problem-framing ability
However, not all programmers recognize reuse value when they are under a tight schedule to fin- ish their current program. Most of the reuse value is long-term and shows its benefit only after the program has been developed; for programmers, what interests them most are the short-term benefits. In his investigation on reuse in NTT (Nippon Telegraph and Telephone Corporation),
Isoda concludes that unless programmers find the immediate benefits of applying reusable com- ponents, they will not, of their own free will, perform reuse (Isoda, 1995). It is human nature to pay attention to the immediate benefits only and ignore long-term benefits (Grudin, 1994) because human beings are unable to think coherently about the remote future and particularly about the distant consequences of their actions (Simon, 1996). To encourage programmers to recognize the full benefits of reuse, many researchers have called for reuse education. Despite its importance, reuse education alone has not brought reuse to fruition (Joos, 1994) because be- ing told that “it is for your own good” seldom provides adequate motivation for programmers to change their behavior (Simon, 1996). Some organizations have also tried to provide monetary rewards to programmers who reuse, which has not been successful either (Frakes & Fox, 1995). 45
4.1.3.2 Overestimated Reuse Cost
As analyzed in Section 3.4.3, the cost of reuse caused at reuse time includes:
(1) the cost of forming reuse intentions
(2) the cost of formulating reuse queries
(3) the cost of operating the repository system to retrieve components
(4) the cost of choosing components
(5) the cost of understanding and modifying components
(6) the cost of integrating components
In addition, when reuse repository systems are separated from current programming environ- ment, reuse cost includes the cost associated with switching back and forth between the pro- gramming environment and the reuse repository system, which causes the loss of working mem- ory and the disruption of workflow.
Depending on the reuse mode, only some of these costs may be involved. In the reuse-by- memory mode, the cost of reuse is reduced to the cost of (6) only. In the reuse-by-recall mode, the costs of (1), (2), and (4) are quite small because programmers know what to look for and where to find the components. In the reuse-by-anticipation mode, all of these costs are involved, and due to the following two cognitive biases—Einstellung and loss aversion—against reuse, those costs are often overestimated.
Einstellung. Human beings often display Einstellung in problem solving. Einstellung,
the German word for “attitude,” refers to the mechanization of problem-solving strat-
egy. Once problem solvers discover a strategy that “gets the job done,” they are less
likely to discover new strategies until they are completely stuck (Reisberg, 1997). Due
to Einstellung, human beings often stick with what they know best. As the term pro-
duction paradox (Carroll & Rosson, 1987) suggests, even though there is an effective 46
strategy of solving a problem, most people are not motivated to learn this new strategy
and will “play it safe” by using a suboptimal solution that they personally consider to
be safe. Even today, for most programmers, building programs from scratch is still
the proven strategy. This partially explains the observed phenomenon of “programmer
machoism”—programmers have a tendency to chronically underestimate how difficult
a programming task is and overestimate the cost of reuse (Graham, 1995).
Loss Aversion. Another known phenomenon in the decision-making process of human
beings is loss aversion—the tendency to be far more sensitive to potential loss than to
potential gain (Reisberg, 1997). Starting a reuse process requires a mental switch. The
demand on working memory and time is immediate, and the potential gain is unclear
because programmers are not sure whether the needed component exists, whether they
are able to find it even if it does exist, and whether they are able to understand and
modify it even if they find it.
4.2 Paradigm Shift: From Development-with-Reuse to Reuse-within-Development
4.2.1 Development-with-Reuse
Designers of current component repository systems are not particularly concerned with the problem that programmers make no attempt to reuse because these systems are designed to support the development-with-reuse paradigm (Rada, 1995). The development-with-reuse paradigm views reuse as a stand-alone process, independent of the current programming process and environment. Consequently, component repository systems are studied as self-contained systems, with no consideration of the context from which the needs for reusable components are derived and the components are reused. Their major focuses have been on the retrieval mechanisms only, with the assumption that programmers have no difficulty in forming reuse in- tentions and formulating reuse queries. Such systems require programmers to initiate the reuse process by switching from their current development environments to component repository 47