<<

Supporting Component-Based Development with Active Component Repository Systems

by

Yunwen Ye

B.Sc., Fudan University, China, 1987

M.S., Fudan University, China, 1990

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Doctor of Philosophy

Department of

2001 This thesis entitled: Supporting Component-Based with Active Component Repository Systems written by Yunwen Ye has been approved for the Department of Computer Science

Gerhard Fischer

James Martin

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. Ye, Yunwen (Ph.D., Computer Science)

Supporting Component-Based Software Development with Active Component Repository Sys-

tems

Thesis directed by Prof. Gerhard Fischer

It is widely believed and empirically proven that reuse improves both the quality and productivity of software development. Before software components are reused, however, they must be located. Component repository systems provide a means to locate soft- ware components. Current component repository systems are designed to support the paradigm of development-with-reuse, which views reuse as a process independent of the whole software development process and relies on programmers to take the reuse initiative. Such systems fall short in supporting programmers who make no attempt to reuse because they do not know the existence of reusable components or they perceive reuse costs more than programming from scratch.

This dissertation advocates a paradigm shift from development-with-reuse to reuse-within- development, which views reuse as an integral part of software development, and component repository systems as information systems that augment programmers’ insufficient knowledge about reusable components and assist them in accomplishing their tasks. Active component repository systems—component repository systems equipped with active information delivery mechanisms—support reuse-within-development. They can be seamlessly integrated with pro- gramming environments. Through this integration, their active information delivery mechanism delivers task-relevant and user-specific components, without being given explicit reuse queries, to help programmers reuse unknown components and to reduce the cost of reuse.

An active component repository system, CodeBroker, has been developed and evaluated.

CodeBroker runs continuously in the background of a programming environment and infers programmers’ needs for reusable components by monitoring their interactions with the environ- iv ment. Potentially reusable components that match reuse queries extracted from comments and signatures in the programming environment are autonomously located and actively delivered to programmers. Formal evaluations of the CodeBroker system have indicated that it motivated programmers to reuse once relevant components were delivered, and that it was able to deliver components relevant to both the task and the background knowledge of programmers. Acknowledgments

I feel very fortunate that my employer, Software Research Associates, Inc. (SRA),

Tokyo, Japan, provided me the time and financial support to complete this research. In par- ticular, I thank Kouichi Kishida, executive vice president and technical director of SRA, for his lasting support and encouragement, without which I could not have finished this research.

Yoshitaka Matsumura, Kaoru Hayashi, and Yoshikazu Hayashi have been excellent managers who have gone to great lengths to provide the best conditions for me to complete my research.

I also want to thank my colleague Tomohiro Oda for his help.

I am grateful to the members of my thesis committee. Gerhard Fischer, my advisor, is simply the best advisor I could have found. His conceptual frameworks on Domain-Oriented

Design Environments and on learning have provided the foundations for this research. Without his excellent skills in challenging my ideas and motivating me to think deeper, I could not have

finished the research in this manner. Kumiyo Nakakoji, my mentor and role model, has provided immeasurable support, both emotionally and intellectually. She has been always there when I needed help. Brent Reeves has spent much time patiently listening to my sometimes rough ideas and reading my immature manuscripts, and has provided frank, yet friendly, critical feedback.

His constructive criticism has been invaluable in guiding me to frame the research problem, prioritize my resources, and present my ideas clearly. The support from other members of my thesis committee, Ken Anderson, James Martin, and Walter Kintsch helped me to clarify my understanding, and their input is very much appreciated. In particular, I thank James Martin for his excellent course on Natural Language Processing, which introduced me to the research field vi of information retrieval. That was one of the best courses I have ever taken.

Members of Center for LifeLong Learning and Design have been very supportive. I thank Taro Adachi for numerous, wide-ranging discussions that I have greatly enjoyed over the years. Jonathan Ostwald generously offered many times to listen to my thoughts and read my writings. His encouragement and feedback is greatly appreciated. I was extremely delighted to have as an officemate Eric Scharff, who had an answer to every computer problem I had, no matter whether it was a Mac, Windows or Linux problem. Many discussions with Rogerio de Paula helped me structure my thoughts. I thank Gerry Stahl, Hal Eden, Andy Gorman, and

Francesca Iovine for their support.

Finally, I would like to thank my family members. I thank my parents, who have taught me the joy of learning and have always urged me to do my best. I thank my eldest daughter,

Hanlu, for understanding when she had to spend many weekends being bored because dad had to work, and my 5-month-old daughter, Hanlei, for her innocent and sweet smiles which provided the best comfort after a day’s hard work. Most of all, I wholeheartedly acknowledge the endless love, understanding, and support that my wife, Yonghong Pan, has given to me. In particular, I thank her for her unabated confidence in me, which has cheered me greatly at times of frustration. Contents

Chapter

1 Introduction 1

1.1 Motivation ...... 1

1.2 Goal of the Research ...... 3

1.3 Active Component Repository Systems ...... 6

1.4 The CodeBroker System ...... 7

1.5 Organization of the Dissertation ...... 9

2 Roles of Reusable Components in Programming 11

2.1 A Process Model of Programming ...... 11

2.2 Programming Knowledge ...... 13

2.3 Opportunistic Programming ...... 17

2.4 Benefits of Software Components in Programming ...... 19

3 Challenges of Software Reuse 22

3.1 Overview of Software Reuse ...... 22

3.2 General Issues of Component Reuse ...... 25

3.3 Creating Reusable Components ...... 29

3.4 Understanding the Cognitive Difficulties of Component Reuse ...... 32

4 The Component Locating Problem 40

4.1 No Attempt to Reuse ...... 40 viii

4.2 Paradigm Shift: From Development-with-Reuse to Reuse-within-Development 46

4.3 Information-Enriched Workspaces ...... 49

4.4 Active Component Repository Systems ...... 51

5 Active Information Systems 55

5.1 Basic Issues of Active Information Systems ...... 55

5.2 Acquiring Information of User Tasks ...... 59

5.3 Personalizing Information Delivery ...... 73

5.4 Dealing with Partial, Imprecise Queries ...... 75

5.5 Comparing Active Information Systems with an Example in the Real World . . 78

5.6 The Spectrum of Support for Locating Information ...... 79

6 Indexing and Retrieval Mechanisms in CodeBroker 82

6.1 Indexing and Retrieval Mechanisms ...... 83

6.2 Creating the Component Repository ...... 94

7 Locating and Delivering Components in CodeBroker 99

7.1 System Architecture ...... 99

7.2 Listener ...... 100

7.3 Fetcher ...... 103

7.4 Presenter ...... 105

7.5 The Retrieval-by-Reformulation Mechanism ...... 113

7.6 Summary of CodeBroker ...... 117

8 Evaluations of CodeBroker 119

8.1 Evaluating the Retrieval Mechanisms ...... 120

8.2 Empirical Evaluations of the CodeBroker System ...... 123

8.3 Findings about the Usage of CodeBroker ...... 128

8.4 Other Findings about Programming in General ...... 139 ix

8.5 Problems of CodeBroker and Needed Improvements ...... 143

8.6 Summary of Evaluations ...... 147

9 Related Work 149

9.1 Active Information Systems ...... 149

9.2 Component Repository Systems ...... 151

9.3 Intelligent Programming Environments ...... 154

10 Future Work and Conclusions 155

10.1 Future Work ...... 155

10.2 Summary ...... 157

10.3 Contributions ...... 159

Bibliography 161

Appendix

A The List of Queries and Relevant Components 173

B Questions Asked in the Post-Experiment Interview 176

C Abbreviations 178

D Glossary 179 x

Tables

Table

1.1 The rapid growth of the Java Core API ...... 2

4.1 Relations between reuse mode, knowledge sources, and tool support ...... 54

5.1 A comparison between plan recognition and similarity analysis ...... 66

8.1 Average precision and recall values for LSA, Mixed (average of LSA and Okapi),

and Okapi ...... 122

8.2 Programming knowledge and expertise of subjects ...... 125

8.3 Overall results of evaluation experiments with programmers ...... 129

8.4 Subjective evaluations of the CodeBroker system ...... 130

8.5 Experiment data regarding user models ...... 136

8.6 Experiment data about discourse models ...... 137 Figures

Figure

1.1 The location-comprehension-modification process of reusing components . . . 3

1.2 Software reuse failure modes ...... 4

1.3 Overview of CodeBroker ...... 9

2.1 The process model of programming ...... 14

2.2 A program and its program plans ...... 16

2.3 Orthogonality between program plans and software components ...... 17

2.4 The role of components in problem framing ...... 21

3.1 A cognitive model of the component reuse process ...... 34

4.1 Different levels of programmers’ knowledge about a component repository . . 42

4.2 The development-with-reuse paradigm ...... 47

4.3 The reuse-within-development paradigm ...... 50

5.1 Feedforward information delivery ...... 57

5.2 Autocompletion in Internet Explorer ...... 57

5.3 Feedback information delivery ...... 58

5.4 Two assumptions of similarity analysis ...... 63

5.5 The spectrum of support to information location ...... 80

6.1 The CodeIndexer and CodeBroker subsystems ...... 82 xii

6.2 The process of creating a component repository from Java programs ...... 94

6.3 An example of a document generated by Javadoc ...... 96

6.4 The indexing format of method documents in CodeBroker ...... 97

7.1 The architecture of the CodeBroker system ...... 100

7.2 Component delivery based on concept queries only ...... 102

7.3 Component delivery based on both concept queries and constraint queries . . . 103

7.4 Presenting more information triggered by mouse movement ...... 106

7.5 An example discourse model ...... 107

7.6 An example user model ...... 109

7.7 An illustrative program for adaptive user modeling ...... 110

7.8 The Skip Components Menu ...... 113

7.9 The Direct Manipulation ...... 115

7.10 The Query Refinement interface ...... 116

7.11 Summary of CodeBroker ...... 118

8.1 Recall-precision curves ...... 123 Chapter 1

Introduction

1.1 Motivation

A wide gap exists between the constantly increasing demands for complex software sys- tems and the capability of the software industry to deliver quality software systems in a timely and cost-effective manner. Software reuse, a development method of using existing reusable software components to create new programs, has been shown through empirical studies to im- prove both the quality and productivity of software development (Basili et al., 1996; Boehm,

1999). Software reuse also increases the evolvability of software systems because complex systems evolve faster when they are built from stable subsystems (Simon, 1996).

Programmers are knowledge workers, and programming is a process of progressive crys- tallization of their knowledge into a program. Knowledge needed during programming comes either from the programmer’s head or from such external sources as books, manuals, peer work- ers, and computerized information systems (Norman, 1993). A lack of needed knowledge is one of the major reasons for poor quality and productivity of programming. With the advent of objected-oriented technology, reusable software components now comprise the bulk of pro- gramming knowledge. Easy access to needed external information, in particular, reusable soft- ware components, to complement the insufficient knowledge of programmers is thus critical to the improvement of programming quality and productivity.

If programmers know a reusable software component well enough, they may integrate it into their programs whenever it is applicable without even realizing they are reusing be- 2 Version No. of Packages No. of Classes Year of Release Java 1.0 8 211 1996 Java 1.1 23 503 1997 Java 1.2 59 1525 1998 Java 2 70+ 2100+ 1999

Table 1.1: The rapid growth of the Java Core API library

cause such reusable components become “ready-to-hand” to programmers (Winograd & Flores,

1986). However, repositories of reusable software components are often so large that program- mers cannot learn about all of the components before they start programming. Software compo- nent repositories are not static; they are constantly evolving with new components added and old components updated. As an example, Table 1.1 shows the rapid growth of the Java Core API

(Application Programmer Interface) library—a repository of reusable components of classes and methods. Few Java programmers, if any, can claim that they know all the components in this library.

Programmers who have not learned the software component have to go through the reuse process if they want to reuse or use it in their programming. The reuse process consists of three steps: location, comprehension, and modification (Figure 1.1). Programmers have to locate those components that are potentially reusable in the current programming task from the com- ponent repository, comprehend their functionality and usage, and make necessary modifications if the components do not completely fit their needs (Fischer et al., 1991).

The foremost obstacle to the success of component reuse is that programmers cannot lo- cate needed software components quickly and easily. Locating reusable software components is often supported by component repository systems or reuse repository systems. Like many other information repository systems, browsing- and querying-oriented schemes have long served as the principal techniques for programmers to locate reusable software components. More innova- tive schemes, such as query by reformulation (Williams et al., 1982; Fischer & Nieper-Lemke,

1989; Henninger, 1993), information filtering (Belkin & Croft, 1992), and Latent Semantic 3

Location explanation

reformulation reformulation

Modification Comprehension

extraction

Figure 1.1: The location-comprehension-modification process of reusing components

Successful reuse requires programmers be able to locate, comprehend, and modify needed reusable components.

Analysis (Landauer & Dumais, 1997), have introduced new possibilities. Unfortunately, the problem remains that programmers simply do not actively search for components and make no attempt to reuse. According to a study by Frakes and Fox (Frakes & Fox, 1995), no attempt to reuse is the leading failure mode of software reuse (Figure 1.2). This inhibiting factor to the wide success of reuse has been reported again and again by software companies that have tried to introduce reuse into their organizations (Devanbu et al., 1991; Rosenbaum & DuCastel,

1995; Fichman & Kemerer, 1997).

1.2 Goal of the Research

Although many factors, such as the lack of managerial commitment and the difficulty in developing good reusable components, affect the widespread uptake of reuse, this research focuses on the cognitive difficulties faced by programmers who try to reuse, because only when programmers are willing and able to put reuse into their daily practice will reuse become fruitful.

This research tries to create a conceptual framework to analyze what hinders program- mers from making attempts to locate reusable components and, based on the analysis, it pro- 4 poses a new approach to the design of component repository systems that can motivate and encourage programmers to reuse by reducing the difficulty of locating components.

By applying cognitive engineering (Norman, 1986) on the reuse process, a cognitive model of reuse is first built. Based on this cognitive model and past research on the effective use of large information repositories (Fischer, 2001), the following two barriers to the component locating process are identified.

Due to the large volume and constantly evolving nature of component repositories,

programmers often fail to anticipate the existence of reusable components; when they

do not believe that a component exists in the repository, they will not even make an

Figure 1.2: Software reuse failure modes

In the Frakes and Fox (1995) paper, seven conditions—attempting to reuse, compo- nents existing, components available, components found, components understood, components valid, and components integratable—form a successful reuse chain. A breakdown in any condition causes the failure of reuse. The above data were collected from 29 software development organizations. The Y-axis shows the per- centage each condition plays in causing the failure of reuse. 5

effort to locate it in the first place.

Even if programmers are aware of the existence of reusable components, they do not

want to start the locating process if they do not know how to locate the components

or if they perceive that locating the components costs more than programming from

scratch.

Although reusable component repository systems have been an active research area for more than a decade, these two issues, especially the first one, have not been given enough attention. This is because those systems are designed to support the paradigm of development- with-reuse (Rada, 1995), which advocates reuse as a new paradigm for programming. Under this paradigm, the reuse process is treated as an independent process, and programmers have to change their current programming practice to embrace reuse; reusable component reposi- tory systems are researched as stand-alone systems under the assumption that programmers are always willing to use these systems and are able to use them with well-defined queries. Con- sequently, research on component repository systems has focused mainly on the information access mechanism only. Information access is an approach to obtain information that requires users1 to start the information locating process through browsing or querying.

This research proposes a paradigm shift from development-with-reuse to reuse-within- development. Development-with-reuse is a methodology-centered view of reuse that demands programmers to adapt themselves to the new methodology—reuse. It does not concern itself with the confusions and difficulties faced by programmers who try to reuse. When the ap- proach does not meet its expected success, programmers are labeled, due to their resistance to change, as having the NIH (Not Invented Here) syndrome (Fafchamps, 1994), and education of programmers about the value of reuse is called for.

Conversely, the reuse-within-development paradigm puts programmers back into the center and views reuse as an integral part of the whole programming process. It stresses that 1 Because users of component repository systems are programmers, in this thesis, the term “user” is used inter- changeably with the term “programmer”. 6 reusable component repository systems should serve as extensions to programmers’ limited knowledge. Such systems should actively participate in the programming process by provid- ing programmers immediate and easy access to reusable software components instead of being passively waiting for the exploration of programmers after they have made the decision to reuse.

Reuse-within-development needs the support of active component repository systems.

Active component repository systems are a subset of active information systems that are equipped with the information delivery mechanism. Unlike the passive information access mechanism by which users have to explicitly launch the information-seeking process by specifying their infor- mation needs in the form of well-defined queries or engaging in a series of browsing actions, the information delivery mechanism presents information to users on its own initiative without being prompted by explicit queries. With reusable components delivered by active component repository systems, programmers are able to reuse without changing their current programming practice and environment.

1.3 Active Component Repository Systems

In general, active information systems that just throw a piece of decontextualized infor- mation at a user are of little use because they ignore the user’s working context. The working context consists of the task acted upon and the user acting. The challenge of implementing an active information system or an information delivery system is to deliver context-sensitive in- formation related to both the task at hand and the background knowledge of the user. Task- and user-independent information delivery systems, or “push” systems, such as Microsoft’s “Tip of the Day,” suffer from the problem that information gets thrown at users in a decontextualized way. The “Tip of the Day” is a feature that tries to acquaint users with some arbitrarily chosen functionality in a complex system. Despite the possibility for interesting serendipitous encoun- ters of information (Roberts, 1989), most users find this feature more annoying than helpful.

The specific challenge faced by this research is to deliver context-sensitive components.

In other words, how can the active component repository system capture programmers’ needs 7 for reusable components by understanding to some extent what their tasks at hand are and then present only those task-relevant components that are not yet known to the programmers.

Needs for reusable software components are not determined before programming starts, as most current component repository systems have assumed; they arise in the middle of the programming process (Sen, 1997). Inasmuch as programmers are using computer-based de- velopment environments to develop software systems, it is possible for component repository systems to capture the reuse needs autonomously by utilizing information available in program- ming environments when component repository systems and program development environ- ments are properly integrated. For example, in a programming editor, comments inside pro- grams and signatures—the syntactical interfaces of program modules—are good indications of what programmers are going to develop next (Ye & Fischer, 2000). The integration of compo- nent repository systems and programming environments creates a shared workspace accessible to both programmers as well as component repository systems. This shared workspace en- ables component repository systems to play an active role in supporting reuse by programmers, with the delivery of task-relevant and user-specific reusable components. Presenting compo- nents specific to a programmer can be realized through user models (Fischer, 2001), because user models that represent programmers’ knowledge about reusable components can be used as

filters by the repository system to ensure only unknown components are delivered.

1.4 The CodeBroker System

An active component repository system, CodeBroker, has been developed. CodeBroker is integrated with the program development environment—Emacs. It utilizes an information delivery mechanism to bring to the attention of Java programmers those components that are unknown to them and yet are relevant to their current programming task by

constructing a task model to capture the programming task through continuously mon-

itoring programming activities in the development environment 8

identifying the domains of a programmer’s current interest by creating a discourse

model based on the history of interaction between the system and the programmer

creating a user model to represent each programmer’s knowledge about reusable com-

ponents to personalize the delivery.

Integrated with CodeBroker, the development environment becomes an information- enriched workspace (Ye, 2001b) consisting of the original programming environment and an augmented information display that presents reusable components dynamically based on the programming task and the programmer’s background knowledge. Programmers can access po- tentially reusable components immediately without switching working contexts. This is a dis- tinct advantage because it avoids interrupting the programming flow. The operational interface of the component repository system becomes transparent to programmers, and is replaced by three cooperative autonomous software agents (Bradshaw, 1997): Listener, Fetcher, and Pre- senter. The Listener agent creates reuse queries from the programming workspace as the task model; the Fetcher agent retrieves components matching reuse queries; and the Presenter agent presents retrieved components directly into the workspace of programmers, using discourse models and user models as filters (Figure 1.3).

The information-enriched workspace created by active component repository systems improves the “readiness-to-hand” of components because it hides the retrieval interface of com- ponent repository systems from programmers so that programmers can directly interact with reusable components rather than the repository system.

Evaluations of the system with programmers have found that the system was effective in supporting reuse along the following three dimensions:

CodeBroker effectively encouraged programmers to explore the possibility of reuse.

Programmers were able to reuse unknown software components when they were deliv-

ered by the system. 9

Figure 1.3: Overview of CodeBroker

The programming environment is augmented with a reusable component informa- tion display (the lower buffer), which presents reusable components dynamically. These components are autonomously retrieved by three cooperative software agents (Listener, Fetcher and Presenter) based on the programming task and the program- mer’s background knowledge. In this example, the programmer can reuse the first component (highlighted) to implement the task: “Create a random number between two limits” (indicated in the doc comment), without leaving the programming envi- ronment or explicitly operating the component repository system.

The combination of task models, discourse models, and user models succeeded in most

cases in delivering context-sensitive reusable components.

1.5 Organization of the Dissertation

Chapter 2 of this dissertation presents a conceptual framework of programming for ana- lyzing the roles of reusable components in programming. Most programmers follow the oppor- tunistic programming strategy, and the availability of reusable components affects the choice of different development alternatives.

After overviewing the issues of instituting systematic reuse in a software development organization, Chapter 3 analyzes in detail the difficulties of component reuse from the perspec- tive of programmers. Through cognitive engineering, a cognitive model of the reuse process is 10 created and the challenges faced by programmers in each step are discussed.

Chapter 4 focuses on the central theme of this research: why locating component is diffi- cult for programmers, and, in particular, what prohibits them from attempting to reuse. Drawing on past research on the use of large information repositories and on human cognition theories, the argument is made that the “no attempt to reuse” phenomenon is caused by the existence of information islands and perceived low reuse utility. The concept of active component repository system is introduced as a solution to this problem.

Chapter 5 delineates the challenges in implementing active information systems and their general solutions: Task models and discourse models contribute to the task-relevance of information delivery, and user models support the user-specific delivery. To accommodate the dynamic nature of the information-seeking process, the concept and role of retrieval-by- reformulation is discussed.

Chapter 6 describes the retrieval mechanisms used in the CodeBroker system and the

CodeIndexer subsystem that creates the contents of component repository from existing pro- grams.

Chapter 7 presents the design and implementation of the CodeBroker system.

Chapter 8 presents the findings from formal evaluations of CodeBroker.

Chapter 9 compares this research with related work.

Chapter 10 concludes the thesis by discussing future research directions and summariz- ing the contributions of this research. Chapter 2

Roles of Reusable Components in Programming

With the advent of object-oriented technology, reusable software components have be- come an indispensable part of programming knowledge: “[Reusable component] library design is [programming] language design” (Stroustrup, 1995). In addition to those classes and methods included in standard libraries of programming languages, such as the Java API library, many reusable software components are developed by software development organizations specifi- cally for reuse or repackaged from previously developed systems.

Practitioners and researchers generally believe, and experiments have empirically proven that component reuse improves the quality and productivity of programming (Lange & Moher,

1989; Lim, 1994; Basili et al., 1996; Simon, 1996; Boehm, 1999). However, most analy- ses of the benefits of reusable components have been based on the products finally produced.

To better understand how reusable components help programmers produce better software sys- tems faster—not a better product and a shorter production time, per se—we must analyze the roles of reusable components in the programming process. After presenting the process model of programming, drawing on design theory in general and empirical programming studies in particular, this chapter explains the benefits of reusable components in programming.

2.1 A Process Model of Programming

Viewed as a task to create a computer-executable representation—program—of a real- world problem by piecing together a set of primitive elements provided by a programming 12 language and its component libraries, programming consists of two distinctive, yet tightly in- tertwined processes: problem framing and problem solving (Schon,¨ 1983; Hoc et al., 1990;

Fischer, 1994).

2.1.1 Intertwining of Problem Framing and Problem Solving

During the problem-framing process, commonly known as the specification process in software engineering, programmers try to understand the problem given in the actual problem space by building a mental representation of the programming task. This mental representa- tion is a situation model that is the result of the interaction between the problem and the pro- grammer’s knowledge about the problem domain (Kintsch, 1998). Different programmers with different knowledge often come up with different situation models of the same programming task. During the problem-solving process, or implementation in software engineering terminol- ogy, programmers create programs based on the situation model as a new representation in the solution space defined by the programming language and its libraries.

Although problem solving starts after problem framing, these two processes are not sep- arate. The processes of framing the problem and of solving the problem influence each other because every transformation of the framing of the problem provides the direction in which a partial solution is to be transformed, and every transformation of the constructed solution determines into which the framing is to be transformed. Just as all other designs that are the interaction between understanding (problem framing) and creation (problem solving) (Rittel,

1984; Winograd & Flores, 1986), programming is an iterative process of problem framing and problem solving. Programmers rarely complete one process before beginning the second one (Pennington & Grabowski, 1990) for the following two reasons.

(1) In most cases, programming tasks cannot be fully understood without considering the

solution (Ghezzi et al., 1991). For example, given the programming task of drawing

a filled circle, a programmer can define the filled circle as a trajectory of rotating one

end of a fixed line 360 degrees, or as a collection of dots whose distance to a center is 13

not greater than the radius. Each definition is actually based on an intended solution to

the problem.

(2) Programming involves many tentative problem-solving strategies. After those tentative

strategies have been explored and their consequences evaluated, some become eventual

commitments and some require the modification of the initial mental representation of

the problem. This modification often breeds new subtasks to be solved.

2.1.2 Programming Is Knowledge Intensive

Neither problem framing nor problem solving is a process of simple transformation that converts one representation to another representation; instead, they are processes of interpreta- tion. The programming task, the situational model, and the final program are representations at different levels of formalization and abstraction intended for different purposes. Drawing on their knowledge, programmers have to interpret the previous representation by reifying abstract concepts, explicating the implicit, and structuring the symbols existing at the new representation level.

Knowledge required in programming can be divided into two categories: domain knowl- edge and programming knowledge. Domain knowledge is the knowledge about the problem domain and is mainly used in the process of problem framing. Programming knowledge is the knowledge needed to construct a program in the process of problem solving. However, due to the intertwined nature of those two processes, programming knowledge also contributes to problem framing, and domain knowledge contributes to problem solving as well. Figure 2.1 illustrates the process model of programming and its reliance on knowledge.

2.2 Programming Knowledge

Among the many constituents of programming knowledge—for example, the operation of and other tools, general data structure knowledge, and the capability of reasoning 14

Problem in Actual Programming Program in Problem Solution Space Space

Problem Problem Framing Solving Situation Model in Represented Problem Space

Domain Domain Specific Programming Knowledge Programming Knowledge Knowledge

Figure 2.1: The process model of programming

Problem framing and problem solving are intertwined and they require both domain knowledge and programming knowledge. Domain knowledge and programming knowledge often overlap, and the overlap becomes domain-specific programming knowledge.

and abstracting—program plans and building blocks are two of the most important. As a series of interconnecting actions to achieve a goal (Soloway & Ehrlich, 1984; Rich & Waters, 1990), a program plan provides a skeleton structure for programs by abstracting key elements. Building blocks are the primitive elements provided by a programming language. They include basic statements of a programming language and reusable software components in repositories or libraries.

2.2.1 Program Plans

Considerable evidence exists in empirical studies of programming that program plans are the basic cognitive chunk used in program design and understanding (Soloway & Ehrlich,

1984; Rich & Waters, 1990). Programs are often added one plan chunk at a time (Rist, 1995;

Detienne, 1995). Because program plans are abstract representations of a solution, during the process of programming, they need to be gradually fleshed out with building blocks. A program often contains different plans that are interlaced. Figure 2.2 shows a program and the program 15 plans it uses. Program plans are hierarchical. A program plan at a higher abstraction level is built upon program plans of lower levels. For example, in Figure 2.2, the plan Shuffling an array comprises three other program plans: Loop over an array, Create a ran- dom number in a range, and Swap two numbers.

2.2.2 Building Blocks

Although programmers can build a program with only the basic statements of a program- ming language, it is just as impossible to build a complex software system from basic program statements alone as it is to build a jet airplane from only nuts and bolts. Reusable software com- ponents are an indispensable part of the building blocks, especially in today’s object-oriented programming languages. A reusable software component is a software module that can be in- tegrated into a new program directly or after minor changes. A software module refers to a named and addressable abstraction—either a procedural abstraction, such as a function, or a data abstraction, such as a class. Procedures, functions, methods, and classes are all considered software modules. In this dissertation, the term module refers to software abstractions to be developed by programmers, and the term component is used to refer to those modules that have been packaged for reuse. Because basic program statements of a programming language are not of interest in this research, the term “building ” is used throughout interchangeably with the term “software component.”

2.2.3 Orthogonality of Program Plans and Software Components

Software components are used to realize program plans. Program plans and software components are orthogonal to each other: a program plan can be realized with different software components, and a software component can be used in the realization of different program plans. Figure 2.3 illustrates the orthogonal relationship between program plans and software components. 16

01 public class CardDealer{ 02 static int [] cards = new int [52]; 03 static { for (int i=0; i<52; i++) cards[i]=i; } 04 /** create a random number in a range */ 05 public static int getRandomNumber (int from, int to) { 06 return ((int)(Math.random() * (to - from)) + from); 07 } 08 /** shuffle the cards */ 09 public static void shuffleCards() { 10 int r, temp; 11 for (int i=0; i<52; i++) { 12 r = getRandomNumber(i, 52); 13 temp = cards[i]; 14 cards[i] = cards[r]; 15 cards[r] = temp: 16 } 17 } 18 public static void main(String[] args) { 19 shuffleCards(); 20 for (int=0; i<52; i++) { 21 System.out.print(`` `` + cards[i]); 22 } 23 } 24 }

The above program contains following program plans: Lines Realiz- Plan Name Plan Description ing the Plan Create a Get the range; random num- Convert a random number between [0, 5,7 ber in a 1.0] to the range. range Save one data to a temporary vari- able; Swap two Move the other data to the saved 13-15 numbers data; Move the temporary variable to the other data. Initialize; Loop over an Set the ending condition; 11-16; array Perform operations; 20-22 Increase the loop variable. Loop over an array; Shuffling an Create a random number in a range; 11-16 array Swap two numbers. Figure 2.2: A program and its program plans 17

AND Shuffling Task an array OR

Swap two Create a random Loop over an array Swap two numbers number in a sets of numbers Plans range

swapInt() Math.Random() getInt(int, int) swapRanges() Components

:

Figure 2.3: Orthogonality between program plans and software components

The task Shuffling an array can be implemented in at least three ways: (1) with nodes connected with solid lines (concrete implementation shown in Fig- ure 2.2, except that the swapInt was implemented with primitive statements) (2) with nodes connected with thick dashed lines, i.e., with program plans of Create a random number in a range, Loop over an array and Swap two sets of numbers, and components of getInt(int, int) and swapRanges() (3) with the same program plans as in (2), and components connected with thin dashed lines, i.e., the components of swapInt() and Math.Random().

2.3 Opportunistic Programming

Different strategies exist to develop a program. A top-down development strategy starts with decomposing the programming task into subtasks, choosing program plans to achieve those subtasks, and then fleshing out the program plans with reusable software components and pro- gram statements. A bottom-up development strategy starts with selecting reusable software components, and then combining them according to the structure of a program plan.

Empirical studies have revealed, however, that most programmers follow neither the top- down nor the bottom-up design strategy. In fact, their programming activities are very oppor- tunistic: They are a mixture of top-down and bottom-up strategies, and which strategy is chosen depends on the knowledge of individual programmers and the particular situation (Curtis et al.,

1988; Visser, 1990). Interim decisions made during the programming process “often can lead 18 to subsequent decisions at arbitrary points in the [programming] space” (Hayes-Roth & Hayes-

Roth, 1979).

The opportunisticness of programming comes from the difference in each programmer’s knowledge of program plans and software components. Simon (Simon, 1996) has pointed out that cognitive activities are determined by the environment in which they take place. The en- vironment includes information present in the workspace as well as information present in the memory of human beings. Information in the workspace, including partially constructed pro- grams, talks back to the problem solvers (programmers) and serves as cues to activate relevant program plans and software components from memory (Schon,¨ 1983). Due to the difference in programmers’ familiarity with program plans and software components, which determines the link strength from cues to the activated knowledge in memory, it is quite natural that the programming process pursued by each programmer is different, and the resulting solutions vary. Taking Figure 2.3 as an example, if the programmer is more familiar with the component swapRanges, he or she may choose the program plan Swap two sets of numbers, and the final implementation will be the one connected with thick dashed lines. Conversely, if he or she is more familiar with the program plan Swap two numbers, he or she may proceed from that program plan and choose the component swapInt.

The lack of knowledge about reusable software components needed to implement a pro- gram plan often prevents programmers from considering it. However, if information about relevant reusable components is somehow present in the current workspace, it can expand pro- grammers’ solution spaces that are limited by their knowledge. Active component repository systems can complement programmers’ insufficient knowledge of reusable components by pre- senting them with immediately accessible components relevant to the current programming task in the workspace. 19

2.4 Benefits of Software Components in Programming

Reusable software components have both short-term and long-term benefits for the devel- opment of software systems. Short-term benefits are the immediate benefits that a programmer can attain during the implementation of a programming task. Long-term benefits may not be immediately enjoyed by the programmer who reuses the components, but they extend to the whole life cycle of the software system and to later programming activities of the programmer.

2.4.1 Short-Term Benefits

Reduced Development Time. By reusing existing software components, fewer pro- grams are written, and thus less time is spent in programming. Furthermore, because reusable components are usually carefully tested already, less time is needed in debugging and test- ing, which are the “hard and slow part” of programming (Brooks, 1995). Lim (Lim, 1994) has reported that in a Hewlett-Packard division, a nearly linear relationship exists between the percentage of reused code in the product and the productivity of programmers, measured in

LOC/pm (the number of lines of noncomment produced by a programmer in a month). Only 5% of reused code yields an LOC/pm of 550, and as the percentage of reused code increases to 81%, the LOC/pm reaches 2,850. Similar reports can be found in (Browne et al., 1990; Hallsteinsen & Paci, 1997).

Improved Quality. Because software components are often repeatedly reused, the defect

fixes from each reuse accumulate, resulting in higher quality of the developed software systems.

Raymond has vividly described this incremental bug fix process as “given enough eyeballs, all bugs are shallow” in his seminal essay that explains why Open Source software systems tend to have high quality (Raymond & Bob, 2001). Basili et al. (Basili et al., 1996) have reported that the error density (errors per thousand lines of code) drops from 6.11 for systems developed without reuse to 0.12 for systems developed from reusable components. Similar formal evaluations on the contribution of reuse to the improved quality of software systems can 20 be found in (Lim, 1994; Thomas et al., 1997).

2.4.2 Long-Term Benefits

Easy Maintenance. Reusable components contribute to easy maintenance not only be- cause they have fewer defects, but also because they facilitate communication among software developers by providing a set of common vocabulary, especially for the indirect communica- tion between system builders and system maintainers. Because reused software components are high-level abstractions, system maintainers do not need to look into the details of implementa- tion to uncover the original intentions of the system builders.

Improved Evolvability. To cope with constantly changing requirements and imple- mentation platforms, software systems must be able to evolve. Reusing software components improves the evolvability of software systems because it can limit the needed change to com- ponents instead of identifying and changing all occurrences distributed all over the system.

Graham (Graham, 1995) has reported a very typical example as follows. Three project teams in a company had used the same formula in their software systems. Later, they discovered an error in the formula and needed to modify the systems. The team that had not created a component for the formula spent 5 weeks to find and correct each incidence of the formula. The other two teams, which had put the formula in a set of components, spent 1.5 days and 2 days, respectively to correct the system.

Increased Problem Framing Ability. The representation of a problem is an important determinant of the range of solutions that will be considered, as well as an important source of problem-solving difficulty (Hayes & Simon, 1977). Reusable software components provide programmers with higher level concepts that are both close to application domains and easy to implement. Components increase programmers’ ability to frame the problem into representa- tions that are easier to solve. A component creates an abstraction for an existing solution, and it reduces the number of items that a programmer has to hold in simultaneous contemplation because the programmer can refer to the whole solution with the abstraction, in place of the 21

Concepts in Programming Problem Domain Languages

Computer Programmer

Compiler Developer

Concepts in Programming Problem Domain Components Languages

Computer

Programmer Component Developer Developer

Figure 2.4: The role of components in problem framing

In the top figure, programmers have to frame each concept in the problem domain based on their knowledge of the programming language. In the bottom figure, pro- grammers can represent some domain concepts with components directly (such as those having the same fill pattern in concepts and components) without thinking of their implementation.

details of the solution. Figure 2.4 illustrates the contribution software components make to the problem-framing ability. Without the support of reusable components, programmers have to frame each concept in the problem domain based on their knowledge of the programming lan- guage; with the support of software components, however, the difficulty of problem framing is reduced because certain concepts can be directly mapped to the components. Chapter 3

Challenges of Software Reuse

This chapter consists of two parts. The first part overviews software reuse to provide a broad background for this research. It defines the concept and scope of software reuse; describes different kinds of reusable software artifacts to establish the link between component reuse and other reuse research efforts; and discusses managerial issues, legal issues, technical issues, and cognitive issues involved in instituting a reuse program within a software development organization. The second part of the chapter analyzes the difficulties of component-based reuse from the perspective of programmers who want to reuse.

3.1 Overview of Software Reuse

3.1.1 Software Reuse and Reusable Software Artifacts

A broad definition of software reuse is using existing software artifacts to construct a new software system. A software artifact can be defined as a piece of formalized knowledge that can contribute to the software development process (Dusink & Van Katwijk, 1995). There are two types of software artifacts: (1) software products that are created as “things” or deliverables during the development process, and (2) development knowledge that is applied to the process.

The most commonly reused software product is source code, which is the final and most important product of software development. In addition to code, any intermediate life cycle products can be reused, which means that software developers can pursue the reuse of require- ment documents, system specifications, modular designs, test plans, test cases, and documenta- 23 tion in various stages of software development.

Reusable software development knowledge and experience exists at different abstrac- tion levels: the architecture level, the modular design level, and the program (or code) level.

Research on software architecture is currently aiming to define different software architecture styles for different families of software systems (Perry & Wolf, 1992). A software architecture style describes the formal arrangement of architectural elements, and can be reused by soft- ware developers to construct their new software systems once the style is well defined (Shaw &

Garlan, 1996; Taylor et al., 1996). For example, the domain-independent multifaceted architec- ture is an architecture style for domain-oriented design environments, which has been reused in and refined through the development of many generations of design environments for different domains (Fischer, 1994).

Reusable knowledge on modular design can be codified in design patterns (Alexander et al., 1977) and frameworks. A design pattern is the description of a solution to recurring problems. It specifies a problem to be solved, a solution that has stood the test of time, and the context in which the solution works (Gamma et al., 1994). Design patterns provide a common vocabulary for software developers to discuss their designs and can be passed from one devel- oper to another developer for reuse. The concept of framework comes from object-oriented programming languages (Fischer et al., 1995). A framework describes the interaction pattern among a set of collaborative classes or objects, and can be represented as a set of abstract classes that interact with each other in a particular way (Johnson, 1997). Programmers can reuse frame- works directly in their development after providing implementations for those abstract classes.

Framework reuse is a mixture of knowledge reuse and .

Programming knowledge at the level of code is represented as program plans that can also be reused by programmers if a suitable representation form is defined (Rich & Waters,

1988). 24

3.1.2 Two Approaches to Reuse

Another dimension to classify reuse research is the approach it takes: Reuse can be generation-based or composition-based.

The generation-based approach reuses the process of previous software development efforts, often embodied in computer tools that automate a part of the development life cy- cle (Henderson-Sellers & Edwards, 1990). This approach weaves domain knowledge and pro- gramming knowledge into a very high-level programming language (VHLL), which is then con- verted to executable systems by a VHLL compiler or an application generator. Because VHLLs have a higher abstraction level than most high-level languages (HLLs), such as Java and C, they are relatively closer to programmers’ informal requirements. They are meant for program- mers to describe what the does instead of how it is implemented. Compilers of VHLLs directly convert the VHLL programs into executable programs in HLLs. Lex and

Yacc in Unix are two well-known examples; other research prototypes include SETL (Dubinsky et al., 1989) and PAISLey (Zave & Schell, 1984). Unlike VHLL compilers, which make the conversion from VHLL programs to HLL programs in one step, application generators often use a series of transformation rules to transform VHLL programs into HLL programs. A trans- formation rule maps a program in one abstraction level to a semantically equivalent but more computationally efficient program (Feather, 1989). Transformation-based application genera- tors allow programmers to control which transformation rule is applied when several applicable rules exist (Biggerstaff, 2000). Problems with the generation-based approach are the following:

(1) VHLLs are often defined for an extremely small application domain.

(2) Most VHLLs use mathematical abstractions, such as set theory or logic theory, that are

actually more difficult to learn and to use than HLLs.

The VHLLs in the generation-based reuse approach have many overlaps with end-user pro- gramming languages (Repenning, 1993) that provide end-users with a simple instruction set at the abstraction level of the problem domain so they can modify the behavior of the application 25 systems to their own needs or add new functionality (Girgensohn, 1992; Fischer & Eisenberg,

1994).

The composition-based approach reuses existing software products in a new system to avoid repetitive work. As mentioned in the previous section, many types of software products can be reused. However, because this research focuses on the reuse of components, the dis- cussion here is limited to component reuse, although many problems and solutions discussed can be extrapolated to the reuse of other software products. Component reuse is also known as component-based development. Based on the role that components contribute to the program- ming process, component reuse is further divided into three categories.

Black-Box Reuse. In black-box reuse, a component is directly reused without mod-

ification. A component can be reused as it is or reused through inheritance if the

programmer creates a specialized subclass of an existing class component.

White-Box Reuse. In white-box reuse, programmers reuse the component after they

have modified the components to their needs. White-box reuse does not contribute as

much to the easier maintenance and evolution of software systems as black-box reuse

does, but it can reduce development time.

Glass-Box Reuse. In glass-box reuse, programmers do not directly reuse the com-

ponent; instead, they use it as an example for their own development. For instance,

programmers can look at examples to find out how a program plan is realized and

build their own system through analogy. Glass-box reuse contributes indirectly to the

quality and productivity of programming because examples can reduce the cognitive

load of programmers (Neal, 1996).

3.2 General Issues of Component Reuse

Despite its many benefits, component reuse has not yet received wide success in prac- tice due to the many difficulties associated with it. Component reuse introduces two different 26 processes in the life cycle of software development:

(1) the process of developing for reusable components

(2) the process of developing with reusable components

The first process creates component repositories by identifying, developing, and indexing com- ponents. The second process, commonly known as reuse process, is conducted by programmers who want to reuse. In the reuse process, programmers need to locate, comprehend, and inte- grate components (see Figure 1.1 in Chapter 1). Widespread success of reuse needs to overcome managerial, legal, technical, and cognitive issues incurred by both processes.

3.2.1 Managerial Issues

Successful systematic reuse requires the support and commitment from the managers of a software development organization. Managers should foster a reuse culture in their organi- zation by encouraging programmers to reuse. For example, managers must stop evaluating the performance of programmers based on the lines of code produced, which, unfortunately, still occurs in many software development organizations. This evaluation criterion obviously dis- courages reuse by programmers because programs developed with reusable components have fewer lines of code.

To encourage reuse, component repositories, either purchased from third parties or de- veloped in-house, should be set up. To do so, managers must be willing to make long-term investments. This also requires good metric models to analyze the economics of reuse and to identify the most effective reuse strategies. Several reuse metric models have been proposed and are in use in some companies. However, these models still lack formal validation (Frakes

& Terry, 1996). 27

3.2.2 Legal Issues

Protecting legal rights of creators and consumers of software components is another dif-

ficult aspect of instituting reuse. Currently, software is protected by copyrights that are designed to protect products in the “world of atoms”. In the world of atoms, after a product is passed from its owner to its customer, the owner no longer owns it, and the customer possesses full ownership. Software components are made of bits. In the “world of bits”, ownership does not change hands in the same way as in the world of atoms (Cox, 1996). It is extremely easy to reproduce a software component without any loss of quality, and even after the software com- ponent is passed from its owner to its customer, the owner can still have the exact same software component as the customer does. This presents a difficulty in estimating the cost of reusable components as well as defining a suitable standard mechanism to charge fees. For example, when a customer uses a purchased component in his or her application and sells many copies of the application to many end users, how much should the customer who purchased the com- ponent pay to the original creator? Should the end users of the developed application pay the original creator as well? And if that is the case, how should the law ensure that such payments are made? The lack of a standard mechanism of charging for the use or reuse of components inhibits the emerging of a robust market for components, which is essential to the widespread reuse of components.

3.2.3 Technical Issues

The implementation of systematic reuse has to overcome many technical difficulties.

These difficulties exist in both phases of reuse: the initial setup of reusable component reposi- tories and the actual reuse of components during programming.

A great investment, both intellectually and financially, is required to develop and main- tain reusable component repositories. First, it is difficult to identify what kind of components should be included in component repositories. Second, the development of reusable components 28 is more difficult than usual development because reusable components should be more general and require higher quality and better documentation. It costs an estimated two or three times more to develop a reusable component than an ordinary component (Jones, 1984; Lim, 1994).

Reusable component developers need to balance the inherent dilemma between the component size and its reuse potential: A larger component has more reuse value than a smaller one, but its decreases because larger components are often more specific and more difficult to understand.

Another dilemma in setting up a component repository is the relationship between the number of components in a repository and the ease of finding the needed components. For a component repository to really pay off requires a critical mass of available components; how- ever, as the number of components increases, it becomes more difficult for programmers to find the needed components. Only repositories with a large number of components can reap the real benefits of reuse, so an effective searching mechanism must be provided for programmers to

find the needed component.

3.2.4 Cognitive Issues

Even if good reusable component repositories do exist and a favorable reuse culture is fostered in an organization, reuse still fails if programmers do not put it into practice. After all, reuse must be carried out by programmers. This creates another dilemma for reuse: If program- mers do not reuse, the huge investment of building reuse repositories cannot be justified; and if the investment in reuse repositories cannot be justified, companies are less willing to create them, and then programmers have nothing to reuse.

Programmers’ resistance to reuse is often falsely dismissed as a mere attitude problem caused by the so-called NIH syndrome. However, recent studies have revealed that most pro- grammers do not have the NIH syndrome (Fafchamps, 1994; Frakes & Fox, 1995). On the contrary, programmers are very motivated to reuse if they know of the component or know how to locate the component (Lange & Moher, 1989; Isoda, 1995). What prevents programmers 29 from reusing is the fundamentally limited capability inherent in human cognition (Curtis, 1989;

Fischer et al., 1991): the limitation of short-term memory, the scarcity of human attention, the mental inertia of coping with changes, the subjectiveness of evaluation, and the ambiguity of natural language. Section 3.4 provides a more detailed analysis of what kind of cognitive chal- lenges programmers have to overcome in order to reuse, and what kind of technology is needed to address such challenges.

3.3 Creating Reusable Components

Although this research is not concerned in particular with the creation of reusable com- ponents,1 because “anybody who sells a technology for reuse without providing a library of components is a snake oil salesman, a fraud, a charlatan (Zand et al., 1997),” it is worthwhile to point out the possible venues from which reusable components will come.

As mentioned before, creating reusable components is difficult, time-consuming and expensive, and repositories of components with good quality are rare. Nevertheless, stable progress has been made recently in several directions.

3.3.1 Domain Analysis and Product-Line Analysis

Domain analysis is the identification, analysis, and specification of common require- ments from a specific application domain for reuse on multiple projects within that applica- tion domain. Domain analysis produces a domain model, which is used as a starting point to construct specifications and designs for many different systems within the application do- main (Kang, 1998). The domain analysis can be either synthetic or evidentiary (Fischer et al.,

1995).

The synthetic domain analysis approach resembles the process of developing a single ap- plication, but it is more broadly conceived. It starts with an informal description of the applica- 1 Components in the repository of the CodeBroker system come from existing libraries. For more details, see Section 6.2. 30 tion domain, identifies the common features, and develops reusable components corresponding to each feature in the domain.

The evidentiary domain analysis approach starts with existing systems in an application domain, using reverse engineering or design recovery (Ye, 1996) to identify and repackage common components for later reuse.

Product-line analysis is a more comprehensive approach than domain analysis. A product- line is a set of products, already existing or planned to be developed, that share a common set of requirements but also exhibit significant variability in requirements (Griss, 2000). Product-line analysis differs from domain analysis in that it not only extracts the commonality of the family of systems but also provides a systematic way to treat their significant variability. In addition to common reusable components, product-line analysis often creates a product-line architecture for the family of related systems where reusable components can be plugged in (Batory et al.,

2000).

3.3.2 Commercial Off-the-Shelf

Thousands of companies worldwide are developing their own information systems. There are three problems in this regard: (1) because most companies do not have enough expertise in software development, they cannot produce information systems with the highest quality; (2) because these systems are often developed internally and do not follow interoperation standards, it is very difficult to integrate them; and (3) similar functionality has been repeatedly developed.

Although it may take decades for it to dominate software development, the market of COTS

(Commercial Off-the-Shelf) is rapidly taking shape (Morisio et al., 2000). COTS comes in a variety of types and levels of software, e.g., components that provide specific functionality (such as , classes, frameworks, and even complete applications) or tools used to generate code (such as domain-oriented language processors and application generators). Many compa- nies are providing reusable off-the-shelf components for specific domains, and those compo- nents can be purchased by developers from the market. As this trend continues, programmers 31 may be able to create their own systems in the future by integrating components from different component vendors. For example, programmers or even end users may create their own word processing applications by integrating components of outline mode, spell-checking, grammar correction, and diagram drawing purchased at market in the same way as they purchase standard applications now.

3.3.3 Open-Source Components

With the advent of the Open Source movement (DiBona et al., 1999), many devel- oper communities, such as the Gamelan website2 and the Giant Java Tree,3 have formed, by which programmers can freely exchange their developed products. Moreover, some high- quality reusable component repositories, such as the Jun library (Aoki et al., 2001), have become open-source too. Traditionally, reusable components are created and maintained by creators who develop those components. Programmers who reuse those components are consumers, and do not directly contribute to the creation and evolution of components. By giving program- mers full access to the source code, Open Source breaks down the binary choice of creators and consumers (Fischer, 1998a) so that consumers can directly participate in the maintenance and improvement of reusable components or even derive new components from existing ones.

This encourages the natural emergence of reusable components with good quality, following the seeding, evolutionary growth, and reseeding (SER) model (Fischer, 1998b). The initial creator develops the component (seeding), and the component experiences evolutionary growth when it is reused and modified by many other programmers (or consumers). As those modifications are incorporated back into the original component (reseeding), the quality and reusability of the component will improve. The Open Source development model is particularly promising in pushing reuse to a large scale because programmers working on complimentary projects can each leverage the results of the other freely. 2 http://www.gamelan.com 3 http://www.gjt.com 32

3.4 Understanding the Cognitive Difficulties of Component Reuse

The exciting advance in the creation of reusable components cannot lead to the material- ization of reuse if programmers are not able to reuse them during their own system development.

To create appropriate tools to assist programmers in reusing, we need to identify the tasks faced by programmers when they try to reuse and the cognitive skills required to perform those tasks.

3.4.1 Cognitive Engineering

Component reuse is a cognitive activity, which is a goal-directed problem-solving effort.

To understand the complexity of cognitive activities, Norman (Norman, 1986) has developed a method called cognitive engineering that applies what is known from cognitive science to the design and construction of tools that assists cognitive activities of human beings. Such cognitive tools, including reusable component repository systems, must address the discrepancy between a user’s goal, expressed in terms relevant to the user and his or her task, and the tool’s mechanism, expressed in terms relative to it. This discrepancy creates two gulfs: the gulf of execution and the gulf of evaluation. The gulf of execution is the gap from goals to tools, and it must be bridged by three consecutive efforts:

Intention Formation. Users decide to do something with an internal specification of

the task created from their goal.

Action Specification. Users externalize the internal specification into a sequence of

specified actions.

Action Execution. The actions are executed with the tool.

The gulf of evaluation is the gap from the tool output to the intended goal, and it must be bridged by another three consecutive efforts:

System Perception. Users perceive the output of the tool.

Interpretation. Users interpret the perceived output. 33

Evaluation. Users compare the interpretation with the original goal.

3.4.2 A Cognitive Model of Component Reuse

Like other cognitive activities, component reuse starts with a goal in the mind of a pro- grammer. To achieve such goals, programmers have to translate their internal intentions into a series of physical actions constrained by a component repository system. Figure 3.1 illus- trates the actions needed for the reuse process to succeed based on the method of cognitive engineering (Ye et al., 2000):

Forming Reuse Intentions. As the first step to start a reuse process, programmers

must consciously decide to use the repository with requirements for reusable compo-

nents in mind. The requirements for reusable components arise from the goal of the

programming tasks.

Formulating Reuse Queries. Programmers formulate their intentions as a reuse query

in the terms provided by the component repository system. This is the step of action

specification.

Retrieving Components. Reusable components matching the queries are retrieved

from the repository system. This is the step of action execution.

Choosing Components. When the repository system returns matching components,

programmers must choose the appropriate one by comparing them with the reuse in-

tentions. This action of choosing corresponds to the gulf of evaluation for the reuse

activity: Programmers read each retrieved component, interpret its meaning, and eval-

uate whether the component can be reused in their tasks. The evaluation may result

in the reformulation of the original query if no suitable components are found because

the reuse query has not been appropriate.

Integrating Components. Chosen components are integrated into the current pro- 34

Development Environment

Programmer

Forming Integrating Intentions

Reuse Intentions Chosen Components

Formulating Choosing Queries Retrieval by Reformulation

Reuse Queries Retrieved Components

Retrieving

System Model of Repository

Action Component Existence of Components Repository Information

Figure 3.1: A cognitive model of the component reuse process

gram. If there is not a complete fit, programmers need to modify the component or

write a wrapper to adapt it. Integration is also a part of evaluation because the compo-

nent is finally reused only if it can be integrated into the current programming task.

The cognitive model in Figure 3.1 is a refinement of the location-comprehension-modification cycle (LCM-cycle) of a reuse process in Figure 1.1 (Fischer et al., 1991), with special emphasis on the location process, which is the focus of this research. Although the LCM-cycle acknowl- edges and stresses the difficulty of formulating appropriate reuse queries in the location process, 35 it does not point out that formulating reuse queries must be preceded by the forming of reuse intentions. Furthermore, the comprehension step in the LCM-cycle does not differentiate two different levels of comprehension: comprehension for choosing components and comprehen- sion for integrating components. Comprehension for choosing is still a part of the location effort because it may result in the reformulation of queries.

3.4.3 Cognitive Challenges of Component Reuse

Each action in this cognitive model of component reuse poses challenges to programmers and may deter the success of reuse without appropriate tool support.

3.4.3.1 Vocabulary Learning: A Prerequisite for Forming Reuse Intentions

Reusable components help programmers think at higher levels of abstraction and increase the “vocabulary” programmers can use to create and interpret program designs (Krueger, 1992).

However, programmers must learn the syntax and the semantics of the new vocabulary to take advantage of reusable components. At the least, programmers should know the existence of the components; otherwise, they are not able to form reuse intentions in the first place, and reuse would fail in this very first step. Vocabulary learning is a major part of the cognitive barrier to reuse (Brooks, 1995; Fichman & Kemerer, 1997).

Unlike the syntax of programming languages, which can be learned through schooling or tutoring before programmers start working, the mastery of reusable components cannot be completed in classrooms or merely by reading books (Ye, 1998). Due to the large volume of reusable components and their constantly evolving nature, total coverage is impossible and obsolescence is unavoidable. Moreover, learning components is less effective when the compo- nents are separated from their use context; components are better learned when they are needed for a programming task. Therefore, component learning needs to be integrated with working where the components are reused, and programmers should learn components on demand—that is, learn the component when it is needed (Fischer, 1991). However, a pitfall for the learning- 36 on-demand model is that programmers need to learn a component because they do not know it, but because they do not know it, they may settle on a suboptimal solution by creating their own program instead of reusing the existing component. To support learning-on-demand, com- ponent repository systems should be able to identify learning (and reusing) opportunities by connecting programmers to the components that can be reused in their current task.

3.4.3.2 Conceptual Gap in Formulating Reuse Queries

Formulating reuse queries refers to transforming internal reuse intentions into explicit, external reuse queries. Reuse intentions, derived from development activities, are conceptual- ized in the situation model (Kintsch, 1998) that is related to the application task to be solved and to the concerns of the programmer. A situation model is the mental model programmers have of their environment. A system model is the “actual” model of a computer system. For a component to be retrieved, these intentions need to be mapped from the user’s situation model onto the system model, namely the repository system (Fischer et al., 1991). Without enough knowledge about the system model of component repository systems, programmers cannot for- mulate reuse queries appropriately. This conceptual gap between situation model and system model is another cognitive barrier to reuse.

There are two types of conceptual gap between situation model and system model: vo- cabulary mismatch and abstraction mismatch. The vocabulary mismatch refers to the inherent ambiguity in most natural languages. Thanks to the richness of natural languages, people use a variety of words to refer to the same concept. Based on their systematic study of word use of ordinary people in different domains, Furnas et al. have found that the probability that two persons choose the same word to describe a concept is less than 20% (Furnas et al., 1987). Even well-trained indexing experts have a 20% disparity on average in choosing terms to describe the same document (Harman, 1995).

The abstraction mismatch refers to the difference of abstraction levels in requirements and component descriptions. Programmers deal with concrete problems and thus tend to de- 37 scribe their requirements concretely, whereas reusable components are often described in ab- stract concepts because they are designed to be generic so they can be reused in many different situations. For example, in one experiment to evaluate the CodeBroker system, one subject described his task as follows: 4

/** This class contains methods for converting between western-style numbers (three numbers in a set with a comma) and Chinese style numbers (four numbers followed by a comma). For example, 1,000,000--> 100,0000. */

Another subject initially described the same task in a similar way:5

/** Takes a string with a Chinese formatted number and outputs a western formatted number. */

This task can be easily implemented by setting the group size to 4 with the method set-

GroupingSize of the class java.text.DecimalFormat. However, the description of this method (as follows) is abstract: It describes grouping of numbers without mentioning west- ern style or Chinese style in particular.

public void setGroupingSize(int newValue) Set the grouping size. Grouping size is the number of digits between grouping separators in the integer portion of a number. For example, in the number "123,456.78", the grouping size is 3.

3.4.3.3 Effective Retrieval Mechanisms in Retrieving

The retrieval process finds the components that match given reuse queries. An effective retrieval mechanism—including a representation schema for indexing and a matching criterion between a query and a component—is essential. 4 The task asks the programmer to implement a program that converts a number written in Chinese format to an equal number in western format. The traditional way of writing big numbers in Chinese is to group numbers in fours and add a comma before each fourth digit from the right because Chinese concepts are of ten thousand (wan), a hundred million (yi), a thousand billion (zhao), instead of thousand, million, and billion. 5 He later realized the description was not good enough because he had not found what he wanted, and modified it to “/** Takes a string with a Chinese formatted number (numbers grouped into 4 columns separated by commas) and outputs a western formatted number (3 columns separated by commas). */”, which made the system deliver the component setGroupingSize. 38

Many retrieval mechanisms have been proposed in the past (for more detailed descrip- tions, see related work in Section 9.2). There are three major approaches: text-based, descriptor- based and formal specification-based. In text-based approaches, components are represented by their textual documents and information retrieval technology is used to match components to queries (Maarek et al., 1991). In descriptor-based approaches, components are represented by a set of selected descriptors. The semantic relationships among those descriptors are captured in a predetermined structure that can be specified by a semantic network (Henninger, 1997), an AI frame (Ostertag et al., 1992), a taxonomic category system (Devanbu et al., 1991), or a fuzzy set theory (Damiani et al., 1997). In specification-based approaches, components are repre- sented with formal specification languages, and automatic theorem-proving systems (Zaremski

& Wing, 1997) or specification refinement systems (Mili et al., 1997a) are used to determine whether a component matches a query, written in formal specification languages too.

In terms of complexity of representation schemata, the text-based approach is the sim- plest and the specification-based approach the most complicated. In general, a complete and precise representation can make the matching more precise and retrieval more effective. How- ever, because the same representation is also used by programmers to specify their reuse queries, the schema of representation is greatly limited by programmers’ willingness to formulate long and precise queries. There is no point in representing every bit of relevant information about a component if a programmer barely has the patience for typing string search regular expres- sions (Mili et al., 1995).

3.4.3.4 Retrieval by Reformulation in Retrieving and Choosing

Because effective use of any information retrieval system requires users be fairly familiar with the structure of the information systems and their representation schemata, it is difficult for most users to create a well-defined query on their first attempt (Jones, 1997). Component repository systems can, at best, retrieve those components that match the queries submitted by programmers, but not necessarily match their intentions, many of which are not articulated. 39

Retrieval by reformulation is a mechanism that allows users to incrementally improve their queries to match their intentions after they have interpreted and evaluated the retrieved results and have explored the underlying structure of the information system (Williams et al., 1982;

Fischer & Nieper-Lemke, 1989). If programmers cannot find the needed component from the

first retrieval result, they can reformulate their query by using more appropriate terms that they learn from retrieved components, or they can narrow the search range by taking advantage of the structure of the repository, which they may not have known before exploring the retrieved results.

3.4.3.5 Component Comprehension in Choosing and Integrating

Being able to comprehend components is necessary both for choosing the right compo- nent and for integrating the chosen component.

Comprehension for choosing is focused on what the component does, and it is conducted in two stages: information discernment and detailed evaluation (Carey & Rusli, 1995). At the stage of information discernment, programmers avoid spending too much time by quickly scanning the component and its description to decide whether this component is related to their current task, and thereby also avoid any deep understanding at this point (Lange & Moher,

1989). The process of information discernment may result in the reformulation of queries if programmers find the retrieval results are not satisfactory. Only when a promising component is found do programmers start to evaluate the components extensively.

To integrate a component into their programs, programmers need to understand the com- ponent’s functionality, its usage, and even its implementation details, especially in cases of white-box reuse and glass-box reuse (Section 3.1.2). Executable examples that use the compo- nent prove to be very useful to help programmers quickly understand how to reuse the compo- nent in their own programming task (Redmiles, 1992; Aoki et al., 2001). Chapter 4

The Component Locating Problem

Before programmers can take advantage of reuse, they must be able to locate reusable components quickly and easily. A component repository system is an information system that helps programmers locate reusable components. It has three connotations: a collection of reusable components, a retrieval mechanism, and a retrieval interface. Research on component repository systems has focused mainly on the effectiveness of retrieval mechanisms. However, even the most sophisticated and powerful component repository systems will not be effective if programmers make no attempt to reuse. Studies on reuse have shown that no attempt to reuse is the most significant barrier to reuse (Figure 1.2) (Frakes & Fox, 1996). This chapter analyzes the phenomenon of “no attempt to reuse” and points out that it is caused by the existence of information islands and perceived low reuse utility. As a solution, the concept of the active component repository system is introduced and its benefits are analyzed.

4.1 No Attempt to Reuse

4.1.1 Three Reuse Modes

As a part of the knowledge-intensive programming process, reuse is a process of applying the knowledge of reusable components into programs. Because few programmers know all about reusable components, component repository systems are introduced to facilitate the easy application of reusable components during programming. Based on the source of the knowledge of reusable components, three modes of reuse exist: reuse-by-memory, reuse-by-recall, and 41 reuse-by-anticipation.

Reuse-by-Memory. In the reuse-by-memory mode, while designing a new program,

programmers notice similarities between the new program and reusable components

that they have learned in the past and know very well. Therefore, they can reuse these

known components easily during the programming, even without the support of a com-

ponent repository system, because their memory assumes the role of the repository

system.

Reuse-by-Recall. In the reuse-by-recall mode, while developing a new program, pro-

grammers vaguely recall that the repository contains some reusable components with

similar functionality, but they do not remember exactly which components they are.

They need to search the repository to find what they need. In this mode, programmers

are often determined to find the needed components. An effective retrieval mechanism

is the main concern for component repository systems supporting this mode. The suc-

cessful operation of reuse in this mode needs both knowledge from programmers and

knowledge from the repository.

Reuse-by-Anticipation. In the reuse-by-anticipation mode, programmers formulate

reuse intentions based on their anticipation of the existence of certain reusable compo-

nents. Even though they are not certain that relevant components exist, their knowledge

of the domain, the programming environment, and the repository is enough to motivate

them to search in hopes of finding relevant components. In this mode, if programmers

cannot find quickly enough what they want from the repository, they will soon give up

reuse (Mili et al., 1995). Repository is the main source of knowledge for the successful

operation of reuse in this mode.

Programmers have little resistance to the first two modes of reuse. As has been reported by Isoda, programmers reuse those components repeatedly once they have learned them (Isoda,

1995). Lange and Moher, in their empirical study on programming and reuse strategies, have 42

L4

L2 Unknown L3 (Vaguely L1 (Well Known) Components (Belief) Known)

Figure 4.1: Different levels of programmers’ knowledge about a component repository

found that programmers search extensively for the components they know exist even if they may not be able to name them a priori (Lange & Moher, 1989). This explains why individual ad hoc reuse has been taking place while organization-wide systematic reuse has not received the same success: programmers have individual reuse repositories in their memories so they can reuse by memory or reuse by recall (Mili et al., 1995). For those components that have not yet been internalized into their memories, programmers have to resort to the mode of reuse-by- anticipation. The activation of the reuse-by-anticipation mode relies on two enabling factors:

Programmers anticipate the existence of reusable components.

Programmers perceive that the cost of the reuse process is cheaper than that of pro-

gramming from scratch.

4.1.2 Information Islands in Component Repositories

Unfortunately, programmers’ anticipation of available reusable components does not al- ways match real repository systems. Empirical studies on the use of high-functionality computer systems (component repository systems being typical examples of them) have found there are four levels of users’ knowledge about a computing system (Figure 4.1) (Fischer, 2001).

In Figure 4.1, ovals represent the collection of components that are in a particular knowl- 43 edge level of programmers, and the rectangle represents the actual information space (namely, the whole collection of items in an information system), labeled L4. L1 includes those reusable components that are well known, easily employed, and regularly reused by a programmer. L1 corresponds to the reuse-by-memory mode. L2 contains components known vaguely and reused only occasionally by a programmer; they often require further confirmation before being reused.

L2 corresponds to the reuse-by-recall mode. L3 represents what programmers believe about the repository. L3 corresponds to the reuse-by-anticipation mode.

Many components exist in the area of (L4 - L3), and their existence is not known to programmers. Consequently, there is no possibility for programmers to reuse them simply because people do not ask for what they do not know (Fischer & Reeves, 1995). Components in

(L4 - L3) thus become information islands (Engelbart, 1990; Ye & Fischer, 2000), inaccessible to programmers without appropriate tools. Repositories are not static—it is expected that they will evolve over time, and this will increase the size of (L4 - L3).

Many reports about reuse experiences of industrial software companies illustrate this inhibiting factor of reuse. Devanbu et al. have reported that because developers are unaware of reusable components, they repeatedly re-implement the same function—in one case, this occurred ten times (Devanbu et al., 1991). This kind of behavior is also observed as typical among the four companies investigated by Fichman and Kemerer (Fichman & Kemerer, 1997).

From the experience of promoting reuse, Rosenbaum and DuCastel have concluded that making components known to developers is a key factor for successful reuse (Rosenbaum & DuCastel,

1995).

4.1.3 Low Reuse Utility

Human beings often try to be utility-maximizers in the decision-making process (Reis- berg, 1997), and programmers are no exception. When programmers perceive that reuse utility, which is the ratio of reuse value to reuse cost, is too low, they do not make an attempt to reuse (Sen, 1997). Because there is no easy way for programmers to estimate reuse value and 44 reuse cost objectively, the estimation made by programmers during programming is quite sub- jective and suffers from cognitive biases against reuse; they tend to underestimate reuse value and overestimate reuse cost.

4.1.3.1 Underestimated Reuse Value

The value of reuse is multifold. As stated in Section 2.4, reuse value includes:

(1) reduced development time

(2) improved quality

(3) easy maintenance

(4) improved evolvability

(5) increased problem-framing ability

However, not all programmers recognize reuse value when they are under a tight schedule to fin- ish their current program. Most of the reuse value is long-term and shows its benefit only after the program has been developed; for programmers, what interests them most are the short-term benefits. In his investigation on reuse in NTT (Nippon Telegraph and Telephone Corporation),

Isoda concludes that unless programmers find the immediate benefits of applying reusable com- ponents, they will not, of their own free will, perform reuse (Isoda, 1995). It is human nature to pay attention to the immediate benefits only and ignore long-term benefits (Grudin, 1994) because human beings are unable to think coherently about the remote future and particularly about the distant consequences of their actions (Simon, 1996). To encourage programmers to recognize the full benefits of reuse, many researchers have called for reuse education. Despite its importance, reuse education alone has not brought reuse to fruition (Joos, 1994) because be- ing told that “it is for your own good” seldom provides adequate motivation for programmers to change their behavior (Simon, 1996). Some organizations have also tried to provide monetary rewards to programmers who reuse, which has not been successful either (Frakes & Fox, 1995). 45

4.1.3.2 Overestimated Reuse Cost

As analyzed in Section 3.4.3, the cost of reuse caused at reuse time includes:

(1) the cost of forming reuse intentions

(2) the cost of formulating reuse queries

(3) the cost of operating the repository system to retrieve components

(4) the cost of choosing components

(5) the cost of understanding and modifying components

(6) the cost of integrating components

In addition, when reuse repository systems are separated from current programming environ- ment, reuse cost includes the cost associated with switching back and forth between the pro- gramming environment and the reuse repository system, which causes the loss of working mem- ory and the disruption of workflow.

Depending on the reuse mode, only some of these costs may be involved. In the reuse-by- memory mode, the cost of reuse is reduced to the cost of (6) only. In the reuse-by-recall mode, the costs of (1), (2), and (4) are quite small because programmers know what to look for and where to find the components. In the reuse-by-anticipation mode, all of these costs are involved, and due to the following two cognitive biases—Einstellung and loss aversion—against reuse, those costs are often overestimated.

Einstellung. Human beings often display Einstellung in problem solving. Einstellung,

the German word for “attitude,” refers to the mechanization of problem-solving strat-

egy. Once problem solvers discover a strategy that “gets the job done,” they are less

likely to discover new strategies until they are completely stuck (Reisberg, 1997). Due

to Einstellung, human beings often stick with what they know best. As the term pro-

duction paradox (Carroll & Rosson, 1987) suggests, even though there is an effective 46

strategy of solving a problem, most people are not motivated to learn this new strategy

and will “play it safe” by using a suboptimal solution that they personally consider to

be safe. Even today, for most programmers, building programs from scratch is still

the proven strategy. This partially explains the observed phenomenon of “programmer

machoism”—programmers have a tendency to chronically underestimate how difficult

a programming task is and overestimate the cost of reuse (Graham, 1995).

Loss Aversion. Another known phenomenon in the decision-making process of human

beings is loss aversion—the tendency to be far more sensitive to potential loss than to

potential gain (Reisberg, 1997). Starting a reuse process requires a mental switch. The

demand on working memory and time is immediate, and the potential gain is unclear

because programmers are not sure whether the needed component exists, whether they

are able to find it even if it does exist, and whether they are able to understand and

modify it even if they find it.

4.2 Paradigm Shift: From Development-with-Reuse to Reuse-within-Development

4.2.1 Development-with-Reuse

Designers of current component repository systems are not particularly concerned with the problem that programmers make no attempt to reuse because these systems are designed to support the development-with-reuse paradigm (Rada, 1995). The development-with-reuse paradigm views reuse as a stand-alone process, independent of the current programming process and environment. Consequently, component repository systems are studied as self-contained systems, with no consideration of the context from which the needs for reusable components are derived and the components are reused. Their major focuses have been on the retrieval mechanisms only, with the assumption that programmers have no difficulty in forming reuse in- tentions and formulating reuse queries. Such systems require programmers to initiate the reuse process by switching from their current development environments to component repository 47

Loss aversion

Lost working Articulatingthequeriesandretrieving memory

Knowingoranticipating theexistence Low reuse utility Component Repository

Conceptual gap Integrating

Unknown Evaluating components

ReuseProcess Einstellung

ProgramDevelopmentProcess

Figure 4.2: The development-with-reuse paradigm

In this paradigm, programmers have to take the initiative to overcome the huge gap between program development and reuse.

systems with properly formulated reuse queries. Whenever a programming task, either from the original task or as a result of further decomposition, arises, programmers must divert from their current process to execute the reuse process on their own initiative. If they fail to do so, component repository systems are of no use, and reuse will not happen.

Figure 4.2 depicts the development-with-reuse paradigm and its relationship with the overall programming process. At the left side are program development processes and environ- ments, and at the right side are reuse processes and systems. They are separated from each other, and for reuse to succeed, programmers have to bridge the cognitive gap between programming tasks and component repository systems by making an attempt to reuse on their own initiative. 48

4.2.2 Reuse-within-Development

Development-with-reuse is derived from the methodology-centered perspective, which views methodology as the most important thing and requires that programmers adapt their prac- tice to incorporate the new methodology. In contrast, the user-centered perspective—in this case, the programmer-centered perspective—focuses on the behavior of programmers and aims at melding the new development methodology (reuse) into the current practice of program- mers (Jarzabek & Huang, 1998).

Development-with-reuse is also a result of the company-centered perspective, which views reuse as a company profitable method, without considering the difficulties encountered by individual programmers (Aaen, 1992). In contrast, the programmer-centered perspective stresses the importance of offering immediate benefits for programmers. Instead of being driven only by the long-term productivity and quality gains for the company, it attempts to appeal to individual programmers (Winograd, 1995).

Development-with-reuse may work if all programming activities can be planned before- hand. However, as analyzed in Chapter 2, programming is by nature opportunistic: new pro- gramming tasks arise all the time during the whole period of programming; so do the reuse opportunities. Reuse cannot be completely planned a priori; it takes place within the context and the process of development (Sen, 1997). The needs for reusable components cannot be de- termined in advance, either; instead, they emerge throughout the whole programming process.

In order to put programmers into the center of the design of component repository sys- tems and to put the reuse into the context of programming activities as a whole, a paradigm shift from development-with-reuse to reuse-within-development is needed (Ye, 2001a). Reuse- within-development views reuse as a supporting, not a replacing, method to the current practice of programmers. It requires that the reuse process be smoothly melded into the current pro- gramming process and environment so that there is no context change from programming to reuse. Furthermore, it stresses that reuse should be immediately beneficial to each individual 49 programmer.

To support reuse-within-development, component repository systems should

(1) be integrated seamlessly with the programming environment

(2) help programmers identify reuse opportunities whenever they arise during their pro-

gramming processes

(3) provide immediate access, from current programming environments, to components

potentially reusable in the current development situation so that programmers do not

need to switch contexts between programming and reuse

4.3 Information-Enriched Workspaces

Integrating a component repository system and a programming environment creates an information-enriched workspace (Ye, 2001b). An information-enriched workspace is a special working environment (or programming environment, in this case) that is augmented with an information display that constantly shows the information immediately needed by users. In an information-enriched workspace, the cost structure of accessing needed information is tuned to the requirements of the work (programming) process using it (Robertson et al., 1993) because it provides immediate access to the most needed information for users without interrupting their workflow.

An observation of our own physical working environments helps us to better understand the information needs of users and the concept of an information-enriched workspace. When we are working, we have memos, paper, and books on our desks serving as the immediate storage of information mostly relevant to our current task; we have file cabinets and bookshelves as secondary storage to keep less relevant information; furthermore, libraries and bookstores serve as tertiary storage to complement the lack of information in our offices. In such a hierarchical structure of information storage, while the relevance to our task and the frequency of access decrease, the cost of accessing the information increases rapidly, at orders of magnitude (Card 50 et al., 1991). An independent information repository system (or component repository system) can be abstractly thought of as secondary information storage from the perspective of computer users because information stored there is accessible only after users have stopped working on their current tasks and switched from their workspaces. In contrast, the information display in an information-enriched space takes the role of immediate storage by storing frequently needed or immediately needed information that can be readily accessed by users without interrupting the workflow.

Figure 4.3 shows an information-enriched workspace that helps programmers reuse within development. A portion of the programming environment is now dedicated to the display of information on reusable components. This information display presents components that are extracted from the component repository based on their relevance to the programming task con- ducted in the programming environment. The interface to the component repository system becomes transparent to programmers because programmers now, within the programming en- vironment, can evaluate and integrate reusable components without operating the component repository system directly.

¡ ¢¤£¦¥¨§ ©   ¥¨¢      Component Repository

ProgramDevelopmentProcess

Figure 4.3: The reuse-within-development paradigm

Task-relevant components from the repository are now automatically presented in the information display, which is a part of the programming environment. In this paradigm, reuse is seamlessly integrated with the program development process. 51

4.4 Active Component Repository Systems

To create a programming environment (workspace) enriched with information on reusable components, the component repository system needs to predict programmers’ needs for reusable components and to automatically present those needed components in the information display.

This task can be supported by the active information delivery mechanism.

4.4.1 The Concept of Active Component Repository Systems

In contrast to conventional information access mechanisms, in which users explicitly initiate the information search process, active information delivery presents relevant information to users without having been asked for it explicitly (Fischer et al., 1993). The information access mechanism requires users to articulate and specify clearly their information needs, whereas the information delivery mechanism infers information needs. Support for information access is indispensable in reusable component repository systems because when programmers recall or anticipate the existence of reusable components, they must be able to locate them. However, reusable component repository systems need to be complemented with the information delivery mechanism so that programmers can reuse those components they fail to anticipate.

Component repository systems equipped with active information delivery mechanisms are called active component repository systems, or active component repositories. Traditional component repository systems that employ information access mechanisms solely are called passive component repository systems, or passive component repositories. Active component repositories autonomously extract cues that reveal the programming task in a programming en- vironment, and based on such cues, they formulate reuse queries on behalf of programmers and deliver relevant components in the information display embedded in the programming environ- ment. 52

4.4.2 Benefits of Active Component Repository Systems

Active component repository systems promote reuse by offering the following bene-

fits (Ye & Fischer, 2000).

A Bridge to Information Islands. As analyzed in Section 4.1.2, the existence of reusable components does not guarantee their reuse if programmers do not anticipate their existence (See

Figure 4.1). Passive component repository systems can only help programmers locate those components whose existence is anticipated. Active component repository systems can set up a bridge to information islands in a component repository. They lower the barrier of the vo- cabulary learning problem by supporting learning-on-demand because they can deliver those components that programmers have not learned and yet are reusable in the current task.

Well-Informed Decision-Making. Psychological studies on the decision-making pro- cess of human beings have shown that the presence of other alternatives affects decisions dra- matically (Reisberg, 1997). The presence of actively delivered reusable components reminds programmers of the alternative programming approach—reuse—other than their current ap- proach of programming from scratch, and alleviates the cognitive bias against reuse caused by

Einstellung in programming. Immediately accessible reusable components can contribute to the activation of associated program plans similar to how components in the memory do (see

Section 2.3). Active component repository systems serve as extensions to the memory of pro- grammers, and expand the possible solution space that is bounded by the limited knowledge of programmers.

Reduction of Perceived Reuse Cost. Compared with stand-alone passive component repository systems, readily accessible reusable components in an information-enriched workspace supported by active component repository systems reduce the perceived cost of reuse greatly be- cause this approach requires less commitment of resources from programmers. Programmers can quickly decide whether suitable components exist by scanning reusable components ac- tively delivered, and there is no conscious context switch between programming and reusing. 53

In passive component repository systems, such a decision (whether reusable components exist or not) can be made only after programmers have committed considerable resource of working memory and attention to the process of component location. As programmers switch from pro- gramming to reuse, their working memory of the programming activities decays with a half-life of about 15 seconds (Norman, 1986). Therefore, the longer they spend on locating compo- nents, the more working memory gets lost. Conversely, the near-instantaneous decision-making afforded by active component repository systems allows programmers to stay on task.

Reduction of Actual Reuse Cost. Active component repository systems reduce the ac- tual cost of reuse because programmers do not need to go through the location process explicitly.

As mentioned in the previous paragraph, shifting attention from current work to the operation of component repository systems causes the loss of precious working memory and interrupts the workflow. Formulating internal reuse intentions into external reuse queries also presents a difficult cognitive activity that requires programmers to overcome the conceptual gap between situation model and system model (Shneiderman, 1998). Active component repository systems

(1) allow programmers to interact directly with reusable components instead of interacting with the component repository system; (2) improve the readiness-to-hand of reusable components because the cognitive breakdown caused by the operation of component repository systems is bridged; and (3) reduce the cost of reusing unknown or anticipated components to the cost of reuse-by-recall or reuse-by-memory.

4.4.3 Full Support of Component Locating

In passive component repository systems, two kinds of knowledge are required for pro- grammers to locate reusable components successfully:

(1) They must know something about or at least the existence of the components.

(2) They must know how to operate the repository system correctly by submitting well-

defined queries or browsing efficiently. 54 Knowledge Features of Needed Reuse Mode Knowledge Required Sources Locating Support Replaced by Reuse-by- Knowing components Programmer’s learning None memory well head efforts in advance Programmer’s Reuse-by- Knowing components Specific Browsing or head and recall vaguely search querying repository Anticipating the existence of components Browsing, Reuse-by- Mostly Open-end and knowing the querying, or anticipation repository search operation of repository delivery systems Not knowing the No No attempt to existence or the Repository only user-initiated Delivery reuse operation search

Table 4.1: Relations between reuse mode, knowledge sources, and tool support

Active component repository systems do not have such requirements, and they fill the void unsupported by passive systems.

Table 4.1 summarizes the knowledge required from programmers and the needed support from repository systems to locate reusable components. Active delivery mechanisms not only overcome the “no attempt to reuse” phenomenon, but also support reuse-by-anticipation by speeding up the locating process. Chapter 5

Active Information Systems

Active component repository systems are a subclass of active information systems that support the information delivery mechanism. The information delivery mechanism is a com- plementary approach to the information access mechanism and is needed in situations in which users are unable to articulate the need for information or are unaware that they may profit from information. Examples of active information systems include, among others, active help sys- tems (Fischer et al., 1985; Virvou & Du Boulay, 1999); critic systems (Fischer et al., 1993);

Microsoft’s “Tip of the Day” and Office Assistants; and information agents (Lieberman, 1997;

Nardi et al., 1998). This chapter describes the challenges involved in the implementation of active information systems, possible solutions and their applicability to component repository systems.

5.1 Basic Issues of Active Information Systems

Implementing active information systems is quite different from implementing passive information systems that support browsing and querying only. In passive information systems, the process of information seeking is explicitly initiated by users, and the needs for information are either articulated as retrieval queries or externalized through a series of browsing actions.

In active information systems, the system must determine the information needs of users and when and how to present the retrieved information. 56

5.1.1 Contextualization: What to Deliver?

For users who are engaged in a task, most of the time they are not very interested in in- formation that bears no relationship to their current task. They need only information that helps them accomplish their task. Because different users have different knowledge backgrounds, their needs for information are also different. For most active information systems, the critical challenge is the contextualization of new information to the task acted upon and the user acting.

Active information systems that just throw a piece of de-contextualized information, such as Microsoft’s “Tip of the Day”, are of little use to most users. This type of system could be viewed as a reverse help system that exploits the communication paradigm of “Answer First,

Then Questions” in contrast to the traditional “Question-Answer” paradigm of most help sys- tems (Owen, 1986). Despite the possibility for interesting serendipitous encounters of infor- mation (Roberts, 1989), most users find this feature more annoying than helpful. The random presentation of information also makes it difficult to understand when or how the information should be used due to the lack of the problem context.

Sections 5.2 through 5.6 explain in detail how to achieve this contextualization in active information systems, which is the focus of this dissertation research.

5.1.2 Feedforward or Feedback: When to Deliver?

Depending on the temporal order between the time when the information is delivered and the time when the user action for which the information is delivered takes place, active information systems can provide feedforward or feedback to users.

For each action, there is a period of time called action-present, in which users have de- cided what to do but have not yet executed the needed operations to change the situation (Schon,¨

1983). Information delivered in this period of time is feedforward (Simon, 1996) information because it can make users change the course of action or assist users in accomplishing the action

(Figure 5.1). For example, the Autocompletion of Internet Explorer provides feedforward (Fig- 57

Figure 5.1: Feedforward information delivery

Information delivered during the period of action-present is feedforward informa- tion that can affect the execution of the action.

Figure 5.2: Autocompletion in Internet Explorer

When a user types http://www.cs into the the address bar, all URLs that the user has recently visited and that start with the typed string are shown in the pop-up menu, and the user can choose one to revisit.

ure 5.2) to users who want to visit a website by saving some keystrokes, but more importantly, by relieving users from remembering exact addresses of websites.

Feedback information is delivered when the action for which the information is delivered has been finished (Figure 5.3). Feedback can create a situational backtalk of the action by pointing out a potential breakdown the user has not known or noticed, or can augment the situational backtalk to help users reflect better on the action just completed (Nakakoji et al.,

1998). Feedback can serve two roles. First, it creates a learning opportunity for users to improve 58 work performance. For example, the ACTIVIST system (Fischer et al., 1985) teaches users the corresponding key shortcut to replace a series of complex keystrokes used in their previous action in a text editor. Second, if the previous problematic action can be undone or modified, it helps users reach a better solution, such as the on-the-fly spell-checking mechanism in many word-processing systems.

Feedback is retrospective because it gives users a chance to change a problematical or suboptimal solution; feedforward is prophylactic because it prevents a problematical or sub- optimal solution. To provide feedback, systems have to compare users’ solutions with ideal solutions to find out what went wrong; to provide feedforward, systems have to predict what is needed by the user in the near future based on what has been done so far.

5.1.3 Interruptive or Noninterruptive: How to Deliver?

Because information delivered by active information systems is unsolicited, it has the risk of interrupting the workflow of users whose primary goal is not the process of the delivered information. When delivered information distracts users, it becomes intrusive. The intrusive- ness of a system is the degree of users’ perception of being interrupted from their current focus.

Not all intrusive information is bad. Information that prevents a user from making a mistake

Figure 5.3: Feedback information delivery

Information delivered after the action has been finished is feedback information that help users reflect upon the action. 59 that may cause all subsequent work to be void needs to be timely attended so the user can avoid the cost of revising a whole chain of action.

Information can be delivered interruptively or noninterruptively. An interruptive delivery requires the immediate reaction of users: if users do not attend to the delivered information, they cannot continue their current work. A noninterruptive delivery just presents information with no reaction from users required. It is up to the user whether to pay attention to the delivered information. Although noninterruptive deliveries present less disruption to the workflow of users, they may go unnoticed and provide insufficient help. Noninterruptive delivery can have various degrees of intrusiveness, depending on how the delivered information is presented, for instance, the distance between the window displaying the information and the focal window of users. On one end, if the window does not exist or is hidden from the current working space and gets opened or displayed only when the users become interested, the intrusiveness does not exist. LispCritic (Fischer & Mastaglio, 1989) is an example of this type of system. On the other end, if the information window is placed right in the middle of user’s current focus, the intrusiveness is close to interruptive delivery.

Active information systems need to achieve the right balance between the cost of intru- sive interruptions and the loss of context-sensitivity of deferred alerts (Horvitz et al., 1999) by carefully considering when and how to deliver the information so that it can be utilized best by users. Depending on the importance of the information, systems can explore a variety of intervention modes to decide when and how to interrupt the user (Sumner, 1995).

5.2 Acquiring Information of User Tasks

To locate information relevant to a user’s task, the information system has to know to some extent what the task is. In passive information systems, users communicate to the infor- mation systems what their task is by articulating a query or taking a series of browsing actions through the explicit communication channel established at the time when the user initiates the information location process. In active information systems, systems must infer what the task 60 is.

In everyday communication among people, understanding draws on a shared background and a shared context between speakers and listeners. Each speech act committed by the speaker is interpreted by the listener against the shared background and the shared context implicitly accessible to both (Winograd & Flores, 1986). An implicit communication channel can be established when the workspace of users is shared with information systems because such a shared workspace can be utilized to create the shared background and context between users and information systems. Recent user actions in the workspace partially reveal what the current task is, and the actions can be understood and the goal of the task can be inferred based on the underlying domain knowledge and the relationship between the actions and the elements existing in the workspace already. Such an inference or understanding of user actions and goal can be used by systems to predict what kind of information would be interesting to the user. For example, when a kitchen designer places a stove in front of a window with curtain in a kitchen design environment, a knowledgeable observer (either a human being or a computer system) can infer that the designer is not aware of the fire hazard of the design, even if the designer does not say anything about it; and a piece of information about fire safety rules in kitchen design probably would interest the designer (Nakakoji, 1993).

The remainder of this section explains how an information system can utilize information existing in the shared workspace to locate task-relevant information. Relevance of information to a user’s task can be assessed at two levels: the immediate task level and the larger context level. Most endeavors of users cannot be accomplished in one action; they often need to be divided into smaller tasks. The immediate task, or task at hand, is the portion of the whole endeavor to which a user is currently attending, and every immediate task is conducted within a context defined by previously accomplished tasks and the overall goal of the whole endeavor.

For example, in a programming environment, an immediate task for a programmer is a pro- cedure or method that he or she is currently developing, and the larger context includes those functions or methods that have been developed so far and with which the current procedure or 61 method will interact to make a whole system.

5.2.1 General Approaches to Capturing the Immediate Task

To find information relevant to the immediate task, information systems need a repre- sentation schema of user tasks. An abstract representation of a user task is called a task model, and the acquisition of such task models is called task modeling. Tasks can be modeled through either plan recognition or similarity analysis.

5.2.1.1 Plan Recognition

The plan recognition approach uses plans to describe a user task. A plan is a sequence of user actions that achieve a certain goal.1 In general, a plan can be represented as a rule consisting of two parts: the condition and the result. The condition part includes a sequence of actions required to accomplish a task, and the result part is the intended goal of the task. When the actions of a user match, completely or partially, the condition part, the system can infer that the user is performing that corresponding task, and information about that task is delivered.

Two kinds of approaches exist for the recognition of task plans: plan libraries and generic plans.

Plan Libraries. In this approach, all task plans are stored in a plan library of the

information system and each task is described by a specific plan. As a user acts in a

computer system, plans whose beginnings match the sequence of actions are selected.

The ACTIVIST system (Fischer et al., 1985) takes this approach. For instance, as a

user repeatedly uses the delete key to delete a character backward in an editor, three

plans are first recognized: (1) deleting a word, (2) deleting a line up to the current

character, and (3) deleting a paragraph up to current character. If the user stops deleting

at a space, then the first plan is finally recognized and information about the command

of deleting a word is delivered; if the user stops at the beginning of the line, then the 1 Program plans defined in Section 2.2.1 are plans specific to the programming domain. 62

second plan is finally recognized and information about the command of deleting a line

is delivered; and so on. The difficulty with this approach is that system designers must

specify all plans beforehand. Moreover, as the number of plans increases in the plan

library, the performance of plan recognition becomes the bottleneck because there are

so many plans to compare with user actions.

Generic Plans. In this approach, a type of task plan is described in a generic rule

using regular expressions, syntactical grammar rules, or other descriptive patterns. If a

sequence of user actions is an instantiation of such a generic rule, the system recognizes

that the user is performing the corresponding task. For example, the ADD (Apple Data

Detector) system (Nardi et al., 1998) uses a regular expression to describe a URL

(Uniform Resource Locator) or an email address. When a string in text matches the

regular expression, several actions associated with it, such as “save it in the bookmark

list,” are suggested by the system and can be automatically executed if the user chooses

to do so. LispCritic (Fischer, 1987) also takes a similar approach. This approach

requires user actions to have a generalizable structure.

5.2.1.2 Similarity Analysis

The similarity analysis approach examines the contextual information surrounding the current focus of users, and uses that contextual information to predicate their information needs.

Information from the repository that has high similarity to the contextual circumstance is then delivered. Systems taking the approach of similarity analysis do not try to directly infer the goal of user actions; they operate based on the following two assumptions (Figure 5.4):

Similar Situations. If the current situation of a user is similar enough to another situa-

tion (Situation A), which has been encountered by either the same user or another user

before, and information X is often explored in Situation A, then the current situation

probably also needs information X. 63

Figure 5.4: Two assumptions of similarity analysis

Relevant information can be determined based on (a) the similarity of situations or (b) the similarity of information. 64

Similar Information. If the current situation uses information X and information Y is

similar enough to information X, then the user is probably also interested in exploring

information Y.

Some recommendation systems such as Siteseer (Rucker & Polanco, 1997) use the first assumption to recommend new information to users. Siteseer helps users discover interesting web pages. The situation of a user is defined by his or her Bookmarks (or Favorites). Users are thought to be in a similar situation if their Bookmarks have enough overlap. Within such a group of users, new web pages that have the highest overlap among the Bookmarks of other users but do not appear in one particular user’s own Bookmarks are recommended to that user.

Other systems, such as Grouplens (Konstan et al., 1997) and PHOAKS (Terveen et al., 1997), are also designed based on this assumption. This assumption is also widely used in e-commerce websites. For example, the Amazon.com website recommends to a book-buying customer new books that have been purchased by other customers who have bought books similar to those the customer has bought.

The second assumption—similar information—underlies many information agents, such as Remembrance Agent (Rhodes & Starner, 1996) and WebWatcher (Armstrong et al., 1995).

These information agents deliver to users new information, such as emails or web pages, that are determined to be similar to what the user is currently focusing on. For example, when a user is writing a new email in an email editor, Remembrance Agent autonomously searches the email folders and personal notes of the user and delivers messages and notes that are similar in content to what the user is currently writing.

5.2.1.3 Comparing Plan Recognition and Similarity Analysis

The plan recognition approach can support multiple well-defined and concrete tasks

(such as deleting a word or extracting an URL address), each of which is relatively easy to describe in a rule. A rule is activated when its condition is met. For the same task, delivered 65 information is always the same, and the delivered information is meant to be feedback to users in regard to their just-finished actions. Similarity analysis often supports only one ill-defined and abstract task (such as finding interesting information or writing a program) that is implicitly activated when users start to use the system. The operation of the delivery requires input from the current situation, and thus the delivered information varies in response to the differences of the situations. Information delivered through similarity analysis can be either feedback on the finished action or feedforward to stimulate a new action. Plan recognition approaches have difficulties dealing with semantic aspects of user actions because they try to “understand” what users are doing. Therefore, they are difficult to scale up because system designers have to engi- neer this understanding mechanism into the system beforehand. Because the similarity analysis approach focuses on the intention communication aspect of user’s working environment (Wino- grad & Flores, 1986), it can circumvent the requirements of human-like understanding and give an interpretation that makes sense to the system only. However, active information delivery based on similarity analysis is often not as accurate as the one based on plan recognition.

These two approaches correspond, respectively, to the two models of long-term memory retrieval of human beings: retrieval by recognition and retrieval by association. In the process of retrieving information by recognition, information from memory is directly retrieved at the recognition of distinctive features of a fixed pattern (Simon, 1996). Plan recognition systems simulate the same process: from features to goal recognition and then to relevant information.

In the process of retrieving information by association, information strongly linked with the existing perceptual elements is activated from the long-term memory. In this process, along with information useful for the current situation, some irrelevant information may also be activated.

Therefore, humans have to select, based on their existing knowledge, the information that can be correctly integrated with the current situation (Kintsch, 1998). Following this process, similarity analysis-based systems retrieve and deliver information based on the link (the similarity or association), and it is up to the users to decide which information they need to incorporate into their task. 66

Table 5.1 summarizes the differences between plan recognition and similarity analysis in terms of their underlying memory retrieval models, their technical approaches, and their major shortcomings.

Plan Recognition Similarity Analysis Memory Retrieval Model Retrieval by recognition Retrieval by association User actions - Goal - Technical Approach Context - Information Information Defining plans and Determining similarity of Major Technical recognizing plans from situations and similarity of Challenges actions information Feedback to previous Feedforward to immediate Major Objective actions actions Supported Tasks Multiple Single Major Shortcomings Difficult to scale Imprecise information Siteseer, CodeBroker, Example Systems Activist, LispCritic, ADA Remembrance Agent

Table 5.1: A comparison between plan recognition and similarity analysis

5.2.2 Modeling the Programming Task

In the case of locating reusable components, a plan recognition approach needs to recog- nize the program plan from programs under development, using such plan recovery methods as adopted in the Proust system (Soloway & Ehrlich, 1984), and to deliver reusable components that can be used to realize the recognized program plan. Recognition of program plans from complete programs is extremely difficult. Woods and Yang prove that program plan recogni- tion is reducible to the problem of identifying two isomorphic graphs, which is NP-complete

(Woods & Yang, 1996).

The primary goal of active component repository systems is to identify reuse opportuni- ties by delivering reusable components to programmers before they have fully implemented the program. Plan recognition is more difficult because the system has to recognize program plans from partially constructed programs. Therefore, the plan recognition approach is not suitable for active component repository systems. 67

This research adopts the similarity analysis approach to locate reusable components that are relevant to the module (a piece of program under development) by making use of the de- scriptive elements existing in both modules and components.

5.2.2.1 Three Aspects of Programs

A program has three aspects: concept, code, and constraint. The concept of a program is its functional purpose, or goal; the code is the embodiment of the concept; and the constraint regulates the environment in which the program runs (Ye et al., 2000). This characterization is similar to the 3C model of Tracz (Tracz, 1990), who uses concept, content, and context to describe a reusable component.

Concept

Important concepts of a program are often contained in its informal information structure. Soft- ware development is essentially a cooperative process among many programmers. Programs include both formal information for their executability and informal information for their read- ability by peer programmers (Fischer & Schneider, 1984). Informal information includes struc- tural indentation, comments, and identifier names (Soloway & Ehrlich, 1984). Comments and identifier names are important beacons for the understanding of programs because they reveal the important concepts of programs (Biggerstaff et al., 1994; Etzkorn & Davis, 1997a; Michail

& Notkin, 1999). The embedding of informal information in programs improves long-term in- direct communication among programmers because, unlike a separate document for a program, information about the program is stored in the place where it is most useful—in the program itself (Reeves, 1993).

Modern programming languages such as Java enforce this embedding of informal infor- mation further by introducing the concept of document comment (doc comment for short). A doc comment, beginning with /** and continuing until the next */, immediately precedes the declaration of a module which is either a class or a method. Doc comments are utilized by the Javadoc program to create online documentation from Java programs. Contents inside doc 68 comments are meant to describe the functionality of the following module.

Constraint

Constraints of a program exist at four levels: syntactical level, semantic level, architectural level, and practical level.

The syntactical constraint of a program is captured by its signature. As the type expres- sion of a program, a signature defines the program’s syntactical interface. The basic form of a signature of a method or a function is:

Signature:OutputTypeExp<-InputTypeExp where OutputTypeExp and InputTypeExp are type expressions that result from applying a Cartesian product constructor to all their parameter types. For example, for the method,

int getRandomNumber (int from, int to) the signature is

getRandomNumber: int <- int x int

A signature of a class contains all the type definitions of its attributes and all the signatures of its methods.

The semantic constraint of a program involves the conditions with which the input and output data have to agree. In formal specification languages, these conditions are described as pre-conditions and post-conditions (Wing, 1990). Some programming languages, such as C, use the assertion statement for programmers to express the intended semantic constraint of a program. For object-oriented programming languages, a contract model can be used to specify the semantic constraints of a class (Meyer, 1997). A contract of a class specifies the invariant of the class and the legal order in which the methods of the class should be called. However, both pre- and post-conditions and contracts are still not widely adopted by programmers. One reason is that it is difficult to write these constraints because it requires deep mathematical knowledge, and another reason is that it is also very difficult to develop reliable yet efficient computing tools to check these constraints.

The architectural constraint of a program is introduced when a programmer wants to 69 develop a system in conformance with a particular architectural style. In the class level, a pro- grammer may want to confine a class to a chosen design pattern or to fit the class into an existing framework. Because design patterns and frameworks prescribe how their constituent classes in- teract with each other, they impose extra constraints on the interface and implementation of classes.

The practical constraint of a program includes the performance criteria required. In some critical situations, programs need to be time-efficient to respond in a timely fashion, memory- efficient to consume limited memory resources, thread-safe to assure concurrent executions, and so on.

Code

Code is meant to be executed by computers. It is the machine-executable representation of concepts, and it must conform to all the required constraints.

5.2.2.2 Locating Relevant Components

Relevance of reusable components to the current programming task can be determined by the combination of concept similarity and constraint compatibility. Concept similarity is the similarity existing from the concept of the current task revealed through comments and identifiers to the concept revealed in the documentation of reusable components. Constraint compatibility is the compatibility existing between the constraints required for the module under development and those satisfied by the components from the repository. A component from the repository whose concept is similar to the concept of the program under development has a high probability of being reused in the current situation. Moreover, if the component has compatible constraints, its reuse possibility is further improved. Programmers who use passive repository systems, especially browsing mechanisms, follow the above heuristic rule to find reusable components too. They first look at the names and short descriptions of components and choose to explore those that suggest something similar to their task (Biggerstaff et al.,

1994; Henninger, 1993), and then choose to reuse components that can be easily integrated 70

(having the constraint compatibility) with their current programs.

5.2.3 Relevance to the Larger Context of Task

Each action taken by users in a computer system serves a global goal, and actions take place in a historical, larger context that is shaped by all preceeding actions. Information systems can provide more appropriate information by taking into consideration the overall goal and the historical context in which the information is needed. To reach this objective, a shared understanding of the larger context needs to be created between users and information systems.

This shared understanding can be created through up-front specification of goals and objectives by users or created incrementally during the course of interactions between users and systems.

5.2.3.1 Specifying the Global Goal

Before users start to interact with computer systems, they can specify their goals at a high level of abstraction. The specifications do not need to be complete and can be modified and augmented during the work process that follows. Active information systems can utilize partial specifications to determine what information might be most relevant to the actions taken by users to accomplish their global goals.

The KID system, a kitchen design environment embedded with active critiquing, in- cludes such a specification mechanism (Nakakoji, 1993). KID supports two types of critiquing: generic and specific. Generic critics deliver design knowledge applicable to all kitchen designs, such as accepted standards or regulations. Specific critics deliver design knowledge applicable only to the design situation currently under consideration. Prior to design actions, a user can characterize the kitchen to be designed by answering some questions, such as “Size of the fam- ily?” and “Is the primary cook right-handed or left-handed?” This specification is used later by the system to fire relevant specific critics regarding design actions as well as to exclude critics that are relevant in general but not consistent with the specification in particular.

Despite the value of a partial specification of high-level goals in delivering relevant in- 71 formation, some users may have difficulty articulating their requirements before they start to perform their task, especially when they are not clear about what kind of information is pro- vided by the system. Moreover, if the structure of the information system is too complicated and too many questions needed to be answered by users to collect a meaningful partial spec- ification, users may get annoyed and become impatient with the system. Generally speaking, users are more eager to do “real work” than answer a long list of questions.

5.2.3.2 Incremental Discourse Modeling

Incremental discourse modeling is another approach to the creation of the shared under- standing of the larger context. A discourse model represents the interaction history between the user and the system. In this approach, information about the larger context is revealed to the system piece by piece, and the shared understanding is established incrementally as users act to achieve the global goal. Previous actions define the historical context under which current action takes place and limit the applicability of information in the current situation. This is similar to the conversation structure in natural languages in which a new utterance is interpreted by the listener in light of the conversational discourse defined by previous utterances. However, the term “discourse model” is used here in a broader sense than it is used in natural language understanding. In natural language understanding, discourse models are mainly used to disam- biguate referring expressions such as it, this and my car (Jurafsky & Martin, 2000); whereas in this thesis, discourse models are used to disambiguate the relevance of information regarding the context.

Incremental discourse modeling amortizes the efforts needed from users for the specifi- cation of the global goal. For each piece of information delivered by active information systems throughout the work process, users can choose to agree with it, disagree with it, ignore it, or de- clare it irrelevant. If information systems have means to capture and represent these responses in a discourse model, more appropriate information can be delivered later.

A discourse model can be either positive or negative. A positive discourse model contains 72 the type of information that has interested users and is likely to be useful. In later deliveries, information systems try to retrieve and deliver information of the same or a similar type. A neg- ative discourse model contains the type of information in which users have not been interested.

It can be used as a filter to remove the same type of information. Negative discourse modeling is particularly powerful in situations where misfits are much easier than fits to be identified by users (Alexander, 1964). Information filtering and information retrieval are two sides of the same coin; both aim to improve the relevance of information. An information system can create a discourse model with both positive and negative representations of the larger context. In sys- tems that have this kind of mixed discourse model, information similar to positive descriptions is considered to have higher relevance, whereas information similar to negative descriptions is considered to have lower relevance.

A discourse model can be created and augmented explicitly or implicitly. Explicit dis- course modeling requires users to respond to the delivery of information with an explicit answer.

For example, a piece of information can be delivered with a mechanism that allows user to spec- ify whether the information is useful or useless, relevant or irrelevant, and the system integrates the user responses into the discourse model. Instead of asking users for direct input, implicit discourse modeling observes the users’ reaction to the delivered information—whether the user uses or ignores the information—and augments discourse models based on acquisition rules that make inferences from the observation.

As an example, a discourse model can be incorporated into the “Tip of the Day” of

Microsoft Office to improve the context relevance of delivered tips. For instance, if a user continues to use tables in his or her recent use of the Office system, an appropriate discourse modeling mechanism would be able to implicitly capture this and instruct the system to deliver tips about table operations to the user. 73

5.3 Personalizing Information Delivery

Information systems exist as a resource to supplement and overcome the limitation of a user’s knowledge. It is more disruptive, however, than helpful to deliver a piece of infor- mation that is already known to the user. Because different users have different knowledge backgrounds, a piece of information that is helpful to one user may be distracting to another.

Therefore, active information systems should personalize the delivered information. In other words, they should deliver user-specific information.

5.3.1 Representing Background Knowledge as User Models

Effective communication requires the ability to represent the other communicating part- ner’s knowledge (Norman, 1993). User models, which represent the users’ preferences and knowledge levels about a system, can be used in an active information system to adapt the sys- tem behavior to each user and to improve the efficiency of communication between users and systems (Thomas, 1996). User models are the result of a user modeling mechanism embedded in a system (Wahlster & Kobsa, 1989).

The term “user model” as well as “user modeling” is overloaded with different meanings in research literature. As a general definition, a user model is a computer system’s model of user characteristics for the purpose of tailoring the interaction or making the dialog between the user and system adaptive (Murray, 1987). However, user characteristics can have many dimensions: knowledge about the computer system, knowledge about the domain, goal of the current task, preferences, cognitive and learning abilities or disabilities, and so on. When the user model represents the goal of the current task, it overlaps with the task modeling described in Section 5.2.1. Furthermore, a user model has different temporal dimensions (Dieterich et al.,

1993). When it is used to describe characteristics of a user valid only in the current context or session (short-term data), it overlaps with the discourse model described in Section 5.2.3.2.

Throughout this thesis, the term “user model” is used exclusively in the following sense. 74

A user model represents the background knowledge that a user possesses and is kept as long- term data on a permanent storage medium. As shown in Figure 4.1, a user’s knowledge about an information repository falls into four levels. A user model should contain pieces of information falling in both L1 and L2. Because information belonging to L1 is well known and regularly used, there is no need for the system to actively deliver it. Although information belonging to L2 has not been completely acquired by the user yet, it can still be considered as a part of the user’s active knowledge because the user knows about it and will use it readily when it is needed. Even if the user may need more details about it, he or she knows very well how to find them with information access mechanisms. Accordingly, user modeling is used in this thesis to refer to the computational mechanism embedded in the information systems for the creation, augmentation, and maintenance of such user models.

5.3.2 Acquiring User Models

User models cannot be created once for all because users’ knowledge about a system changes over time. As users’ knowledge changes, their needs for information change, and their user models should also be modified to reflect the change.

Similar to discourse models, user models can be explicitly modified by users or implicitly updated by the system. Direct modification from users requires that the system be adaptable, which means users can customize the system behaviors to their own needs. Adaptive systems automatically update user models based on information observed or inferred from monitoring their interactions with the system (Fischer, 1993).

Adaptability and adaptivity complement each other. Although adaptivity requires little effort from users, it needs a relatively long time to establish a reliable user model. Deployment of VDDE, an active design environment for phone-based interface design, has found that ex- perienced designers do not expect to be interrupted with information they have told the system irrelevant (Sumner, 1995). Adaptability of systems gives users direct and immediate control over what information should be delivered. However, it places extra work on users. 75

Another challenge in the acquisition of user models is how to initialize a user model.

Few users know nothing about an information system when they start to use it, and, obviously, empty user models do not reflect this fact. Mechanisms supporting adaptability and adaptiv- ity can be extended to the explicit and implicit acquisition of initial user models, respectively.

An explicit acquisition method directly asks users what they already know through an up-front questionnaire or testing sessions when they use the system for the first time. An implicit ac- quisition method is suitable if artifacts previously created by users in the domain for which the information system is designed to support are available. The interactive adaptivity mechanism can be modified as a batch process to analyze those existing artifacts to obtain the initial user models. The third method is to create several stereotypical user models to represent different levels of users, as is widely done in intelligent tutoring systems and intelligent help systems.

One of the stereotypical models can be chosen as the approximation of the initial user model, or, a user can copy the user model of another user who has a similar level of knowledge.

5.4 Dealing with Partial, Imprecise Queries

Locating useful information from an information system to support complex design pro- cesses in wicked problem domains in general and locating reusable components from a compo- nent repository in particular are not the same as searching for data in a database system where the query is well defined and completely articulated. Instead, users are looking for useful infor- mation and the usefulness can only be determined by them, depending on how they intend to make use of it.

Users’ queries as well as items in information systems are usually not directly repre- sented. For example, in a multimedia design environment, a user wants to find an ideal image from a large image library for a design task. There is plainly no way to express the query directly—after all, if the user knows how to directly express it, he or she has it already and does not need to locate it at all! The user must rely on an abstract representation schema to describe certain attributes of the needed image. Images in the image library are also abstractly 76 represented. But even with this abstract representation schema, users can begin with only a very vague query (Nakakoji et al., 1998). In a reusable component repository system, reusable com- ponents are not directly represented either—a direct representation should be program code; however, they are represented by surrogates such as textual descriptions and signatures. Those abstract representations are partial, biased, and imprecise. Even in the domain of document retrieval, where the representation schema is the same as the information itself, users often ask the wrong questions (Jones, 1997).

5.4.1 Context-Aware Browsing

Given the fact that complete requirements for information are not available at first, infor- mation systems cannot locate the exact information needed by users. The problem of incomplete information requirements is more severe in active information systems because requirements are inferred. However, active information systems can heuristically reduce the searching space to the extent that users can easily browse and choose the one needed. This kind of approach to the acquisition of information is defined as context-aware browsing.

Querying and browsing are the two major information access mechanisms for most users.

Querying is direct: users formulate a query and the system returns information matching the query. However, formulating queries is a cognitively challenging task because users have to overcome the gap from the situational model to the system model (See Section 3.4.3.2). In browsing, users determine the usefulness or relevance of the information currently being dis- played in terms of their task and traverse its associated links. People tend to find browsing more fun than querying because they do not need to commit resources at first and can incrementally develop their requirements after evaluating the information along the way (Lieberman, 1997).

Mili et al. claims that browsing is the most predominant pattern of component repository usage because most programmers often cannot formulate clearly-defined requirements for reusable components so they rely on browsing to get acquainted with available reusable components in the repository (Mili et al., 1999). 77

However, browsing is not scalable due to the following reasons. First, there is an inherent dilemma in the design of the browsing structure of an information system: If links are too many, users will be puzzled by the complexity; if links are too few, information is not well connected. Second, there cannot be a structure suitable for all users and all user tasks, and the structure defined at the design time may not be the structure needed at the use time. For example, Smalltalk class library is structured according to the inheritance relationship. This structure is perfectly suitable for the execution of programs, and for locating components whose super nodes (classes or super-classes) are known. However, it is not suitable for programmers to

find a method based on functionality. Some methods with similar functionality are scattered in different deep nodes of the inheritance tree (Helm & Maarek, 1991). It is therefore very difficult for programmers to find and compare all of them in order to choose the most appropriate one to reuse. Third, in a large information system, following the right link requires users to have a very good understanding of the structure of the whole system. Most users, especially the less experienced users, may easily get lost in a complex network of nodes while tracing dozens of links (Halasz, 1988).

Context-aware browsing supported by active information systems combines the strength of both querying and browsing: the directness of querying and the lower cognitive threshold of browsing. Active information systems automatically collect and present information to users based on the task context where the information is consumed. Even though the delivered infor- mation may not be precise enough due to the incompleteness of task models, users can imme- diately start to browse a significantly reduced information space that is organized in accordance with their task structure.

5.4.2 Supporting Retrieval-by-Reformulation

Another approach to complement the incompleteness of information requirements is to support retrieval-by-reformulation (see Section 3.4.3.4). Active information systems first present users with the initial retrieval results. This initial delivery can serve the following two 78 purposes:

(1) Users can learn how the information system stores and organizes its information by

examining the retrieval results.

(2) Users can discover some requirements that were not present in the initial query by

comparing the retrieval results with their intentions of use in context.

Based on newly acquired knowledge on the information system and discovery of new aspects of requirements, users can either reframe or refine the initial query to improve its completeness and preciseness, or directly manipulate the retrieval results by filtering out apparently irrelevant information so that those needed get more focused attention.

When the retrieval-by-reformulation mechanism is integrated with active information systems, the reformulation process of users can be used at the same time, as a nice side effect, to augment discourse models and user models. Details will be explained in Section 3.4.3.4 in the context of introducing CodeBroker.

5.5 Comparing Active Information Systems with an Example in the Real World

Locating useful information from a large information repository is very similar to lo- cating an item in a big store. Empirical studies (Reeves, 1991) on the interaction between customers and sales agents in McGuckin Hardware, a store that carries more than three hun- dred thousand items, have revealed that customers coming to the store often have a vague idea of what they want, and they often do not know where to start to find it. Sales agents, called

Roamers, have incomplete or vague knowledge of the items in the store, but enough knowledge to direct customers toward an aisle where things of interest might be located, based on listening to customers’ descriptions. Once in that aisle, customers can incrementally improve the pre- ciseness of the problem description, based on examining existing available tools, or by talking to another kind of agent, called Green Apron who is a specialist of the domain of that aisle. 79

Active information systems play the role of a Roamer by dynamically constructing a virtual “aisle” of information of interest. Through examining and evaluating the information in the virtual “aisle,” users have two ways to narrow their focus through retrieval-by-reformulation:

(1) directly manipulate the virtual “aisle” to remove apparently irrelevant items to narrow

the collection for easier choice, or

(2) refine or reframe their queries to start another round of locating.

5.6 The Spectrum of Support for Locating Information

As epitomized in the problem of locating reusable components analyzed in Chapters 3 and 4, a wide gap exists between the user needing information and the information system providing information. Three approaches exist to bridge this gap as shown in Figure 5.5.

The first approach is taken by passive information systems based on pure information access mechanisms. For users to acquire the needed information, they must bridge the gap by themselves after they have learned how to use the system, how to write appropriate queries, and to anticipate the existence of information. This approach can be called the user-expert approach because the user is trained to be an expert in using the system.

The second approach is the computer-expert approach, in which the computer system plays the role of expert and tries to infer the needs of users and deliver the precisely relevant information. Although this approach is ideal, due to the incompleteness of task models, it is very difficult to implement such smart systems.

The third approach is the distributed-expert, or the human-computer cooperation ap- proach. It acknowledges the fact that neither computers nor users have enough expertise to find the relevant information alone, and the expertise is distributed among users and systems—users know their needs, and systems know what exists in their repositories. To overcome this sym- metry of ignorance or asymmetry of knowledge (Rittel, 1984), cooperation is needed between users and computer systems. Active information systems incorporated with the retrieval-by- 80

User Needing Information Information System The gap between information needs of users and information system

pure delivery

delivery

reformulation

pure access

Figure 5.5: The spectrum of support to information location

The first part of the figure describes the computer-expert approach in which the system plays the role of expert and presents relevant information based on a pure delivery mechanism; the second part describes the distributed-expert approach, in which the system and the user cooperate; the third part is the user-expert approach, in which the user plays the role of expert and locates the information using a pure access mechanism.

reformulation mechanism adopt such an approach. Their delivery mechanism first presents a set of potentially relevant items of information based on inferred task models, discourse mod- els, and user models, and then users contribute to the process of information location through the retrieval-by-reformulation mechanism. This cooperation process is also a mutual learning process. From the users’ reformulation process, systems learn the knowledge level of users and the larger context of their tasks to augment the discourse model and user model that can make systems deliver more context-relevant information later. From the deliveries of systems, users learn the structure of information systems and the availability of relevant information, which 81 can be utilized in their later retrieval activities. Chapter 6

Indexing and Retrieval Mechanisms in CodeBroker

This and the following chapters describe the system development effort of the research.

Two subsystems have been developed: CodeIndexer, which creates the component repository from existing Java programs and libraries, and CodeBroker, which assists Java programmers to

locate and reuse components while creating new programs1 (Figure 6.1).

"!$#&%('¦#¦)*

9

.02;:0.)8¨)'

9

.@8¨A&)@8 !B8¨,

+-,/.*0,/1¨23%

<=8¨:.%¤# '.,&>

.0,546#&7,&1 ,&# 8 %

9

.@F8¨G=,&.HF8¨,

CD8 E

+-,/.*0,/1¨23% +-,/.*0,/1¨232?8,/%

Figure 6.1: The CodeIndexer and CodeBroker subsystems

After explaining the indexing and retrieval mechanisms used by CodeBroker, this chapter describes how to create a component repository that can be used by CodeBroker with CodeIn- dexer. 1 For the sake of brevity, throughout the thesis, the word CodeBroker refers to the whole system development effort, and CodeIndexer is used when the description involves indexing only. 83

6.1 Indexing and Retrieval Mechanisms

An effective retrieval mechanism is essential in deciding whether relevant reusable com- ponents can be located. An encoding schema that determines how to represent reusable compo- nents and reuse queries, and a relevance judgment criterion that determines whether a compo- nent is relevant to a reuse query, are two major considerations in the design of retrieval mech- anisms. Encoding a component for the purpose of indexing can be based on its concepts, con- straints, or code (see Section 9.2 for more details about other indexing methods used for com- ponent repositories). CodeBroker encodes components based on both concepts and constraints.

The CodeBroker system extracts the concept of a component from its associated documentation embedded in Java source programs in the format of doc comments, and the constraint from the signature of a component. Reuse queries are represented in the same format. Relevance is determined by the combination of concept similarity and constraint compatibility. CodeBroker uses both probabilistic model-based indexing and retrieval techniques (Robertson & Walker,

1994) and the Latent Semantic Analysis (LSA) technique (Landauer & Dumais, 1997) to com- pute concept similarity. Constraint compatibility is computed by signature matching (Zaremski

& Wing, 1995). These methods are chosen because

(1) They require the least effort, among all suggested retrieval mechanisms for component

repository systems, to encode components for indexing and retrieval.

(2) They are the easiest and the most straightforward way for programmers to formulate

reuse queries for retrieval.

(3) The needed information to formulate a reuse query is readily available from the pro-

gram editor.

(4) They are as effective as other complicated retrieval mechanisms in terms of retrieval

performance (Frakes & Pole, 1994; Mili et al., 1997b). 84

6.1.1 Free-Text Indexing and Retrieval

Free-text indexing and retrieval is concerned with finding documents in free-text form

(such as newspaper articles, research papers, books, web pages, etc.) relevant to the queries submitted by users (Salton & McGill, 1983).

Free-text documents are indexed by terms. Terms can be controlled or un-controlled. In the controlled-term approach, an indexer is responsible for choosing the appropriate terms to index the document. Those controlled terms are also known as keywords. Because this process is often manually conducted, it is very time-consuming when the collection of documents gets large. Another problem with controlled terms is that people often choose different terms to describe the same document (Furnas et al., 1987; Harman, 1995), and even the same person may not be consistent in choosing terms.

The other approach of choosing terms is to automatically extract them from documents.

Terms can be precisely the words appearing in documents. However, most free-text indexing systems use one or both of the following two techniques: stemming and stop list. Stemming reduces words to their morphological root forms. For example, computer, computing, compute, computation, and computational are all reduced to the form comput and all five words are represented by the term comput. The advantage of stemming is to allow users to find documents that contain morphological variations of the word in their queries. A stop list is simply a list of high-frequency words that are not used as terms. Words that appear in almost every document are not very useful in distinguishing one document from other documents in terms of relevance to user queries.

The two primary models for the indexing and retrieving free-text documents are: the vector space model and the probabilistic model. 85

6.1.1.1 Vector Space Model

In the vector space model, documents and queries are represented as vectors of terms contained in the whole collection of documents, commonly known as a corpus. The value of each element in the vector reflects the importance of a particular term in representing the concept

or meaning of that document. In a corpus containing I terms, a document is represented by a

vector in the I -dimensional space as follows:

NXWS?Y V NZW\[][][]WS_^`V Nba JLKFMNPORQ&SUTV (6.1)

where

cedgf h

is a value denoting the importance of the term i in representing the concept of h

the document jlk\m .

The variation of the vector space model comes from the method of determining the value

Son&V N Son&V N q

for each . The simplest binary value model sets to p , if the term is present in the s document r , and to , if it is not. The binary value model does not reflect the fact that some

terms appear more in a document and thus contribute more to the concept of the document.

n&V N

q r

Term frequency ( tvu ), which is the number of occurrences of term in document , can be N used as the value of Son&V . Using term frequency favors longer documents because most longer

documents tend to use the same word more often, as the verbosity hypothesis (Robertson &

n&V N

Walker, 1994) states. To level the ground, tu can be normalized by being divided by the overall length of the document vector.

The term frequency model, normalized or not, treats all terms equally. However, terms that are limited to a few documents are more useful for discriminating documents from the

rest of the corpus than terms that occur frequently across the entire corpus. To reflect this

Son/V N

discrimination power of a term q , can be multiplied by its inverse document frequency

n q(w(u

( qw¤u ). The for each term is defined as follows:

a nxOzyg{X|}Q

I n

qw¤u (6.2) w(u 86

where

~ is the number of documents in the collection,

€ d

is the number of documents that include term i .

Queries submitted by users are also in free-text form and are represented as a query

vector in the same way as document vectors:



T Y ^

W„‚ W\[][][]W„‚ a ORQƒ‚ (6.3)

where

€

d d

‹PŒ i

is †‡‰ˆŠ

3

¤Žb

†‡‰ˆ

€

‘ d

is the frequency of term i in the query.

The relevance of a document to a query is determined by comparing the document vector against the query vector. It is common to use the cosine of the angle between two vectors as the

criteria to judge the similarity ( ’$q”“ ) of a document and a query:

^

n&V N˜— n

S ‚

ng–=T



Q W„JlNFa`O •

^ ^

™

’0q”“ ™ Y

Y (6.4)

—

S ‚

ng–=T n]–=T

n&V N n

• •

When the document and the query (or two documents) are identical, their vectors should be identical in the vector space and the cosine is one; and when they share no common terms, namely, they are orthogonal to each other, the cosine is zero. Upon receiving a query from a user, the retrieval system should thus compute the cosine for each document against the query vector, and return those documents with higher cosine values to the user as the relevant documents.

6.1.1.2 Probabilistic Model

The probabilistic model ranks documents in decreasing order of their evaluated proba- bility of relevance to a user query. It makes use of formal theories of probability and statistics to evaluate, or estimate, those probabilities of relevance. The relevance probability is different 87 from the similarity computed in the vector space model. The latter generally lacks the theoreti- cal soundness of the relevance probability, which can be defined precisely. However, the com- putation of a theoretically sound probability is not practically tractable, and currently the proba- bility can only be roughly approximated based on various simplification assumptions (Crestani et al., 1998).

The basis for all probabilistic models is the probability ranking principle, which asserts that optimal retrieval performance can be achieved when documents are ranked according to their probabilities of being judged relevant to a query (Robertson, 1977). Given a query  ,

the main task of retrieval systems based on the probabilistic model is to compute the relevance



Qƒ›œ W„JlNa

probability š for each query-document pair.

The relevance probability of a document can be estimated by assigning an appropri- ate weight to each term in the document corpus. Probabilistic models assume that terms are distributed differently in relevant and irrelevant documents, which is known as the cluster hy- pothesis (Van Rijsbergen, 1979). If a term appears more frequently in relevant than in irrelevant

documents, it has more power to discriminate relevant from irrelevant documents. The discrim- ›;Ÿ

inating power of a term is called its term relevance weight ( ž ) in probabilistic models, and

its value is calculated by the following formula:

—

n Q ‚$na

›;Ÿ¡ ¢„£ O¤y]{X|?¥

p?¦

n

n§— ž n

‚ Q

a (6.5)

¥ p?¦

where

¨Bd d

, represent the probability of the i th term appearing in a relevant or an irrelevant document, respectively.

The above formula can be computed only retrospectively on test collections where the relevance assessments of documents are known. At the time of regular document retrieval, we

do not know yet which document is relevant or irrelevant; therefore, we do not know how to

n ‚\n ›;Ÿ©n ¥ compute and , and ž can only be estimated.

A simplified formula, proposed by Croft and Harper (Croft & Harper, 1979), uses corpus

88

n ‚\n

information to make estimates and does not use the distribution probability (¥ and ). In their ›;Ÿ©n

model, ž is computed as follows:

T

Q&ª_« [­¬Za®¯Qƒ› ª?« [­¬Za

›;ŸR £ Ozy]{X|

s ¦ s

n

Q&° ª?« [­¬Za®¯Q ° ›²«³ª_« [­¬Za

ž (6.6)

¦ s I±¦ ¦ s

where

~ is the number of documents in the collection

´ is the number of documents containing the term

µ is the number of relevant documents ¶

is the number of relevant documents containing the i th term.

n

› ª ›;Ÿ

When and are not available, ž can be further reduced to

Y °¹« [­¬

›;Ÿ¡ £ Ozy]{X|·I¸¦ s

n

°¹« [­¬

ž (6.7) s

which is similar to the inverse document frequency in vector space model (see Equation (6.3)),

n °

whose w(u corresponds to ).

In Equation (6.6), only the presence and absence of terms (the binary value model) are

considered. To take the term frequency within-document ( tvu ) and term frequency within-query ‚

( tvu ) into consideration, a more refined formula is proposed by Robertson et al. to estimate the 

probability of relevance between document J3n and query (Robertson et al., 1995)

—

Qº»T§« a n&V N —

Y QºZ¾¿« a ‚ n



N —

Qƒ›œ W„J a`O ›;Ÿ p tu

£ p tvu

n

ž ¼½« n&V N ºZ¾¿«À‚ n

š (6.8)

tu tu

where Á

Ì FÌ hÍ$ÎÏ

is ÂbÃ

ÅÄÄÆeÇÅÈ„É ÊËÈD É

ÂÐ Â

à , , are parameters depending on the nature of the queries and the collection of

È

 Â0Ð

the data. In CodeBroker, Ã is set to 1.2, to 1.0, and to 0.75, according to È

the data in (Walker et al., 1998).

FÌ h

is the length of document Ñ FÌ ÎFÏ is the average length of all documents. 89

CodeBroker uses Equation (6.8) to calculate the relevance probability of a document to a given query mainly because it can reuse the source code that is available through the distribution of the Remembrance Agent system (Rhodes & Starner, 1996), and the system Okapi in which the equation has been implemented has achieved retrieval performance comparable to other leading

research prototypes of information retrieval systems (Robertson et al., 1995; Walker et al., ›;Ÿ©n

1998). For a more detailed explanation of why Equation (6.8) provides an estimation of ž , please see (Robertson & Walker, 1994). For the sake of brevity, Okapi will be used to refer to the probabilistic model used in CodeBroker.

6.1.1.3 Latent Semantic Analysis

LSA is an extension of the vector space model. The vector space model assumes that terms are independent from each other and does not take their semantics into consideration; therefore, it suffers from the concept-based retrieval problem (also known as vocabulary mis- match, discussed in Section 3.4.3.2): If programmers use terms different from those used in the descriptions of components, they cannot find what they want. By constructing a large semantic space of terms to capture the overall pattern of their associative relationship, LSA is expected to facilitate concept-based retrieval and bridge the conceptual gap in formulating reuse queries.

The indexing process of LSA starts with creating a semantic space with a large corpus of training documents in a specific domain. It first creates a vector for each document in the corpus in the same way that the vector space model does. All vectors for documents of the

corpus compose a large term-by-document matrix Ò .

ÓÔ Ú\Û

Ô Û

SUTVÖT SUTV Y [\[\[×SUTV Ø

Ô Û

Ô Û

Ô Û

S?Y VÖT S?Y V Y [\[\[×S?Y V Ø

Ô Û

Ô Û

O

Ô Û

Ò

[[0[$[[0[$[[0[$[[0[$[[0[$[[0[$[[0[$[[

Õ Ü

^¿VÖT ^¿V Y ^¿V Ø

S S [\[\[ÙS

where the columns of the matrix represent the documents in the collection ( Ý documents in

the corpus), and the rows represent the terms ( I terms in the corpus). The term-by-document 90

matrix Ò is then decomposed, by means of singular value decomposition, into the product of

Jà

žx¢ Þߢ

three matrices: , , and ¢ .

ÓÔ Ú\Û

Ô Û

¢„£ ¢„£ [\[\[ ¢„£

TVÖT TV Y TV á

Ô Û

t t t

Ô Û

Ô Û ÓÔ Ú\Û ÓÔ Ú\Û

Ô Û Ô Û Ô Û

¢„£ ¢„£ [\[\[ ¢„£ TVÖT [\[\[ ¢„£ ¢„£ [\[\[ ¢„£

Y VÖT Y V Y Y V á TVÖT TV Y

TV Ø

Ô Û Ô Û Ô Û

t t t ’ s s w w w

Ô Û Ô Û Ô Û

Ô Û Ô Û Ô Û

[$[$[0[$[$[$[$[[$[$[$[0[$[$[$[$[0[$[$[ Y V Y [\[\[ ¢„£ ¢„£ [\[\[ ¢„£

Y VÖT Y V Y

Ô Û Ô Û Ô Û

Y V Ø

s ’ s w w w

Ô Û Ô Û Ô Û

— —

O

Ô Û Ô Û Ô Û

Ò

Ô Û

[$[$[0[$[$[$[$[[$[$[$[0[$[$[$[$[0[$[$[ [\[0[$[$[\[$[$[\[0[$[$[\[$[$[0[\[$[$[ [0[0[0[[$[0[[$[0[[$[0[[$[0[[$[0[[

Ô Û

Ô Û

Õ Ü Õ Ü

Ô Û

[$[$[0[$[$[$[$[[$[$[$[0[$[$[$[$[0[$[$[ [\[\[ ávV á ¢„£ ¢„£ [\[\[ ¢„£

ávVÖT ávV Y

ávV Ø

s s ’ w w w

Õ Ü

¢„£ ¢„£ [\[\[ ¢„£

^¿VÖT ^¿V Y ^¿V á

t t t

J¹à

¢

Ò ž

is an orthogonal matrix having the left singular vectors of , ¢ is also an orthogonal matrix

¢ Þ

having the right singular vectors of Ò , and the diagonal matrix is the singular value matrix

NV N

ª Q ª-a

Ý ’ plâãrâ

whose rank is , the smaller number of I and . are singular values, and they

TVÖTåä Y V YÅäÙæ\æ\æ¿ä

¢

’ ’

appear in decreasing order along the diagonal of the matrix Þ , namely,

ávV á

’ .

The hypothesis behind LSA holds that because of synonymy and polysemy in natural

¢ Þ languages, there is much noise in the matrix Ò , and if the rank of the singular value matrix

is reduced—by getting rid of less significant singular values—to a much smaller number º to º

obtain another singular value matrix Þ , the noise is reduced too. The value of often ranges

from 40 to 400, but the best value of º still remains an open question and needs to be empirically

— —

ª º

¢

I ž I

determined. The ž matrix with the size of is reduced to with the size of , and

— —

J¹à ª J¹à º

Ý Ý

¢ with the size of is reduced to with the size of . Ò

A new matrix ç , viewed as the semantic space of the domain represented by the corpus,

J¹à Þ is constructed through the production of the three reduced matrices: ž , , and .

ÓÔ

Ú\Û 91

Ô Û

TVÖT TV Y [\[\[

TV è

Ô Û

t t t

Ô Û

Ô Û ÓÔ Ú\Û ÓÔ Ú\Û

Ô Û Ô Û Ô Û

Y VÖT Y V Y [\[\[ TVÖT [\[\[ TVÖT TV Y [\[\[ TV Ø

Y V è

Ô Û Ô Û Ô Û

t t t ’ s s w w w

Ô Û Ô Û Ô Û

Ô Û Ô Û Ô Û

[[$[$[0[0[$[$[0[0[$[0[$[0[$[0[$[0[0[$[ Y V Y [\[\[ Y VÖT Y V Y [\[\[ Y V Ø

Ô Û Ô Û Ô Û

s ’ s w w w

Ô Û Ô Û Ô Û

— —

O

Ô Û Ô Û Ô Û

Ò ç

Ô Û

[[$[$[0[0[$[$[0[0[$[0[$[0[$[0[$[0[0[$[ [0[$[$[[$[0[0[$[0[$[[$[$[[$[$[0[0[ [F[0[[0[0[[0[F[0[[0[0[[0[F[0[[0[0[

Ô Û

Ô Û

Õ Ü Õ Ü

Ô Û

[[$[$[0[0[$[$[0[0[$[0[$[0[$[0[$[0[0[$[ [\[\[ [\[\[

èV è èVÖT èV Y èV Ø

s s ’ w w w

Õ Ü

^`VÖT ^¿V Y [\[\[

^¿V è

t t t Ò In this new matrix ç , each row represents the position of each term in the semantic space. Terms are re-represented in the newly created semantic space. The reduction of singular values is important because it captures only the major, overall pattern of associative relationships among terms by ignoring the noises accompanying most automatic thesaurus construction simply based on co-occurrence statistics of terms.

After the semantic space is created, each document is represented as a vector in the semantic space based on terms contained, and so is a query. The similarity of a query and a document is thus determined by the cosine of the two vectors as in Equation (6.4). A document matches a query if their similarity value is above a certain threshold value.

The corpus used by CodeBroker to create the LSA semantic space for Java programming comes from four sources: Linux on-line manuals, programming textbooks, the Java language specification and virtual machine specification, and Java class libraries (component reposito- ries). These four types of documents are chosen because they cover the domain knowledge a Java programmer needs: knowledge about the computer and operating systems, which is covered by Linux manuals; knowledge about programming in general, which is covered by programming text books; knowledge about programming in Java, which is covered by the Java specifications; and knowledge about reusable components, which is covered by the Java class libraries. The corpus contains 78,475 documents and 10,988 different terms after common and extremely rare words are cut off. A word is considered as extremely rare if it appears in one document once. This is useful to remove those esoteric abbreviations that are common in Linux on-line manuals but not used elsewhere. 92

6.1.2 Signature Matching

Signature matching is the process of determining the compatibility of two components in terms of their signatures (Zaremski & Wing, 1995). It is an indexing and retrieval mechanism based on type constraints of a module or a component (see Section 5.2.2.1).

Two signatures

Sig1 : OutTypeExp1 <- InTypeExp1

Sig2 : OutTypeExp2 <- InTypeExp2 match if and only if InTypeExp1 is in structural conformance with InTypeExp2, and Out-

TypeExp1 is in structural conformance with OutTypeExp2. Two type expressions are struc- turally conformant if they are formed by applying the same type constructor to structurally conformant types.

This definition of signature matching is very restrictive because it misses components whose signature does not exactly match, but that are practically similar enough to be reusable after slight modification.

Partial signature matching relaxes the definition of structural conformance of types: A

type is considered as conforming to its more generalized form or more specialized form. For

T Y T

ž ž

procedural types, if there is a path from type ž to type in the type lattice, is a generalized

Y Y T

ž ž form of ž , and is a specialized form of . For example, in most programming languages,

integer is a specialized form of float; and float is a generalized form of integer. For object-

T Y T Y Y

ž ž ž ž

oriented types, if ž is a subclass of , is a specialized form of , and is a generalized T

form of ž .

The constraint compatibility value between two signatures is the production of the con- formance value between their types. The type conformance value is 1.0 if two types are in structural conformance according to the definition of the programming language. It drops a certain percentage if one type conversion is needed or if there is an immediate inheritance re- lationship between them. The signature compatibility value is 1.0 if two signatures exactly 93 match.

A class signature is composed of its data definition part and method definition part. Sig-

T Y é nature matching of two classes é and requires that both their data definition parts match and their method definition parts match, respectively. Data definition parts are treated as record-

types, and are compared according to their structural conformance.

T Y T

é “

The method definition parts of two classes é and match if for each method in

T Y Y T Y

“ é “ “

class é , there is a corresponding method in class such that matches according

T Y “

to signature matching for methods. Correspondence from “ to is decided based on the

Y Y “

best match principle—that is, among all methods from é , if has the highest signature

T T “ compatibility with “ , it is considered as the matching method of . The compatibility value of the method definition part is thus the average value of compatibility values existing between pairs of matching methods.

Because classes inherit data and methods from their parent classes, comparing two classes only is not enough if they do not inherit immediately from the same class. Inherited data defini- tions and methods must be taken into consideration. A common ancestor class is located first, and then all data definitions and method definitions in between the common ancestor and the compared classes should be added.

CodeBroker does not implement signature matching for classes due to the following two considerations. First, the primary goal of CodeBroker is to deliver reusable components before programmers start to implement the module. Unlike other object-oriented programming languages, such as C++, in which the declaration of class interfaces is usually separated from the implementation and is stored in a separate file such as a header file, the Java programming language dose not provide a mechanism to separate the declaration of class interfaces from implementations. It is not very common that Java programmers start to fill in the implementation of methods after they have finished the declaration of the class signature—the class variables and all method signatures. Therefore, in most cases, the class signature would not be available to CodeBroker for it to deliver components before implementation. 94

Second, signature matching alone is not very powerful in locating reusable components because the limited number of primary data types in a programming language such as Java leads to few variations of signatures. However, signature matching can play an important differen- tial role when many components with similar concepts are retrieved, and programmers need to choose one that fits their type constraints. The main goal of using signature matching in Code-

Broker is to make choosing components easier when there is a set of components intended for the same purpose but implemented for different data types. For example, a reasonably good component repository that contains random number generators needs a set of components that create random numbers of different types such as integers, floats and long integers. These com- ponents usually have same functionality descriptions, namely, same concepts, and the signature- matching process would help programmers identify the desired component immediately without too much browsing effort.

6.2 Creating the Component Repository

The component repository in CodeBroker is created by its indexing subsystem, CodeIn- dexer. CodeIndexer extracts and indexes functional descriptions (concepts) and signatures (con- straints) from the HTML-based online documentation generated by running Javadoc over Java source programs (Figure 6.2).

Java Concept Indexing Component Source HTML-based Javadoc Documents CodeIndexer Repository Programs Signature Indexing

Figure 6.2: The process of creating a component repository from Java programs

Javadoc generates documentation, in HTML format, for Java programs by the 95 source files. In the HTML documentation, each Java class has its own HTML-formatted docu- ment file, which is cross-linked to the document files of its super-classes and sub-classes. The contents of a class document describe the functionality of the class and all of its methods. Those descriptions are extracted from doc comments associated with each class and method. An ex- ample of a Javadoc document is shown in Figure 6.3. Documentation for Java components distributed with JDK (Java Development Kit) by Sun Microsystems, Inc. is also generated by

Javadoc. Other component developers create documentation for their components in the same fashion.

CodeIndexer creates indexes for Java components in two steps. First, it extracts needed information for indexing from Javadoc documents and converts it into the CodeBroker index- ing format that can be processed by the indexing program. Each method of a class is treated as a document to be independently indexed, although in Javadoc documentation, all method descriptions of a class appear as one physical file. Five types of information are extracted for the purpose of indexing a method component: the full class name (including the package name and class name); the HTML tag name which specifies the exact location of the method in the

Javadoc document; the method name; the signature; and the description of the method included in the doc comment for the method (Figure 6.4). Doc comments of Java may use special tags, which begin with the @ character and allow Javadoc to provide additional formatting for the documentation. For example, some doc comments may include @author to specify the au- thor of the component, or @see to specify a link to related methods or classes. These tags could provide additional indexing information to narrow the range of components to be located.

For instance, a programmer may be interested in components written by a specific author only.

However, the current version of CodeBroker does not support this, and all special tags, along with their contents, are removed.

The second step of CodeIndexer creates, from the CodeBroker indexing documents in the format of Figure 6.4, three index files: the probabilistic model index file (or Okapi index

file, for short), the LSA index file, and the signature index file. The Okapi index file and LSA 96

Figure 6.3: An example of a document generated by Javadoc

index file contain the concept indexes of components, and the signature index file contains the signature indexes.

The Okapi index for a component consists of terms and their frequencies appearing in the doc comment. A term is the stemmed form of an English word, which is not included in the stop list.

The LSA index for a component is a float vector with length º calculated by the following 97 NEW METHOD:: CLS: java.lang.String TAG: length MET: length SIG: int length() DEF: Returns the length of this string. The length is equal to the number of 16-bit Unicode characters in the string.

NEW METHOD:: CLS: java.lang.String TAG: charAt MET: charAt SIG: char charAt(int index) DEF: Returns the character at the specified index. An index ranges from 0 to length() - 1.

Figure 6.4: The indexing format of method documents in CodeBroker

In CodeBroker, each method is represented by an independent document, which starts with the line NEW METHOD::, followed by 5 fields: CLS for the full class name, TAG for the HTML tag of the method, MET for the method name, SIG for the signature, and DEF for the description of the method.

equation2 :

ê M Kª·O¡Q&îBT$WîXYbW\[\[\[ïWî a

è t

ÞeëUì·í (6.9)

^

T

ð

— —

NV na îbnxORQ N

n&V n

’Xñ t

tvu (6.10) N –=T

where

 is the number of singular values in the pre-computed semantic space

~ is the number of terms in the semantic space

€ d

‘ is the frequency of the term in the component

h¤f d ‘

is the term vector of term Ñ in the pre-computed semantic space; and terms in

the component but not in the corpus are discarded d ò dgf is the singular value.

2 ó is the reduced number of singular values in the LSA semantic space created from the training corpus. In CodeBroker, ó is equal to 278, which is the number of non-zero singular values in the semantic space. 98

The signature index for a component is in the following format:

5617 getInt : int <- int x int where the leftmost number is the identifier number assigned to each component. The string following the number is the component name (getInt), the string following the colon : is the returned type (int), and the string following the left arrow (<-) specifies the input type(s)

(int x int).

To speed up the locating process and to reduce the size of indexing files, all three index

files are encoded and stored as a database file.

The indexing mechanism can easily create a component repository from any Java source programs. However, components scavenged from ordinary programs present more challenges for reuse other than the locating problem, such as the low quality issue of documents and code.

Because the focus of this research is to help programmers discover reusable components, the current version of CodeBroker includes the Java 1.1.8 Core API library and JGL 3.0 (Java

General Library from ObjectSpace Inc.), both of which are of high quality and well documented.

There are 503 and 170 classes in Java 1.1.8 and JGL, respectively, and a total of 7,338 method components. Chapter 7

Locating and Delivering Components in CodeBroker

This chapter describes in detail the techniques used by CodeBroker. The interface of

CodeBroker is integrated with the programming environment to create an information-enriched workspace that supports reuse within development. Compared to passive component reposi- tory systems that automate the retrieval process only, CodeBroker also eliminates the steps of forming reuse intentions and formulating reuse queries (see Figure 3.1).

CodeBroker runs, as an active process, in the background of a programmer’s working en- vironment (a program editor—Emacs) to monitor the programmer’s interactions with it. Using the programmer’s current working context as retrieval cues, it automatically locates and delivers context-sensitive reusable components.

7.1 System Architecture

The architecture of CodeBroker is shown in Figure 7.1. CodeBroker consists of three software agents: Listener, Fetcher, and Presenter. A software agent is a software entity that functions autonomously in response to the changes in its running environment without requir- ing human guidance or intervention (Bradshaw, 1997; Ye & Reeves, 2000). The Listener agent captures the task in which programmers are currently engaged by monitoring and analyzing their interactions with the editor, and creates reuse queries as task models. Those queries are then passed to the Fetcher agent, which retrieves the matching components from the component repository. Reusable components retrieved by Fetcher are passed to the Presenter agent, which 100 uses discourse models and user models to remove unwanted components and delivers them in the Reusable Component Infomation-display (RCI-display, for short) placed below the edi- tor window. Through the RCI-display, programmers can invoke the retrieval-by-reformulation mechanism supported by CodeBroker to either directly manipulate the delivered components or refine the queries if they cannot immediately find useful components. The retrieval-by- reformulation mechanism also helps to evolve discourse models and user models.

Program Editor Delivered Components

RCI-display Presenter Manipulate Delivers

Working Refine Update Products User Model Automatic update Retrieved Components

Concept Concept Queries Indexing Listener Fetcher Component Retrieves Repository Analyzes Signature Constraint Indexing Queries

Data Data flow (bubble shows the contents) Action Control flow (label shows the action)

Figure 7.1: The architecture of the CodeBroker system

7.2 Listener

The Listener agent runs continuously in the background of Emacs to monitor the input of programmers. Its goal is to model the immediate programming task of programmers. As described in Section 5.2.2, Listener uses similarity analysis to create task models that are used as reuse queries passed to the Fetcher agent. Two types of reuse queries are autonomously created by Listener: the concept query extracted from doc comments, and the constraint query extracted from signatures. A task model may include the concept query only or include both 101 the concept query and the constraint query.

7.2.1 Creating Concept Queries from Doc Comments

Doc comments in Java begin with /** and end with */. Whenever a backslash character

(/) is entered into the editor, Listener scans one character backward to see if the backslash is preceded by an asterisk (*). If that is the case, Listener scans backward further until it finds the string /**, or the end of another statement in case the programmer has not written a legal doc comment statement. If a legal doc comment statement is found, the contents between /** and

*/ are then extracted and passed by the Listener agent as a concept query to the Fetcher agent that locates and delivers components that match the concept query.

Figure 7.2 shows an example of component delivery based on concept queries. A pro- grammer wants to create a random number between two long integers, and before he or she implements it (i.e., writes the code part of the program), the programmer indicates his or her task in the doc comment. As soon as the comment is written, a task model, including the con- cept query only, is created and represented in the same format as is used for indexing component documents (Section 6.2, Figure 6.4):

DEF: Create a random number between two limits.

Components whose functionality descriptions or concepts show similarity to this task model are delivered by CodeBroker.

7.2.2 Creating Constraint Queries from Signatures

Concept queries or concept-based task models are often not complete enough to describe what the programmer wants. For example, in Figure 7.2, the doc comment from which the concept query is created does not say the method must take two long integers as input. Although the fourth component in the RCI-display buffer, the signature of which is shown in the message buffer (the last line of the window), could be modified to achieve the task, it would be better to 102

Figure 7.2: Component delivery based on concept queries only

Components in RCI-display are delivered based on the task indicated in the doc comments preceding the cursor. The third component (nextBytes) and the fourth one (getInt) in RCI-display have the potential to be reused because their concepts are similar to the concept of the method under development.

find a component that can be immediately integrated without modification.

Further information about the task could be acquired from signatures that reveal the type constraints of modules under development (Section 6.1.2). In the previous example (Figure 7.2), as the programmer proceeds to declare the signature, the Listener agent refines the task model by taking into consideration the constraint requirements for reusable components.

As Figure 7.3 shows, when the programmer types the left bracket ô (just before the cur- sor), Listener is able to recognize it as the end of a signature definition, and creates a constraint query in the format of signature: long <- long x long. A more precise task model, including both the concept query and the constraint query, is created as follows:

SIG: getRandomNumber: long <- long x long

DEF: Create a random number between two limits. 103

Figure 7.3: Component delivery based on both concept queries and constraint queries

Components in RCI-display are delivered based on both the doc comment and the signature. The first component, which was not shown in Figure 7.2, can be reused by the programmer because it fully matches the programming task.

The RCI-display in Figure 7.3 shows components delivered based on this task model. Note that the first component in the RCI-display has exactly the same signature, shown in the message buffer, as the one extracted from the editor, and can be reused without further modification.

7.2.3 Updating User Models

In addition to modeling the programming task by creating concept queries and constraint queries, the Listener agent is also responsible for the automatic updating of user models. More details on this are available in Section 7.4.3.2.

7.3 Fetcher

The Fetcher agent performs the retrieval process based on the retrieval mechanisms in- troduced in Section 6.1. When Listener passes a concept query, Fetcher computes, using both 104 the Okapi technique (Section 6.1.1.2) and the LSA technique (Section 6.1.1.3), the concept similarity value from each component in the repository to the query, and returns those compo- nents whose similarity value ranks in the top 20. The number 20 is the threshold value used by

Fetcher to determine what components are to be regarded as relevant; it can be customized by programmers.

The concept similarity value is determined by the following formula

õlö

¡ £¢¥¤bü”ú§¦©¨ û§ü”þ ¢$ù6ü û§ü”þ û °ï÷\øù6úûýü”þÿü (7.1) where

 £ "!$#&%(') is computed according to Equation (6.8) as is used in the probabilistic model-based Okapi system.

 £ +*,.- is computed according to LSA

/0456/,34798;: < /021¡/43 are the weights assigned to each model, and . Each of them can be changed on the fly by programmers by issuing the command cb-set- lsa-weight from Emacs.

In order to find the best retrieval mechanism, I have experimented with the following

three methods to compute the concept similarity value: C¨DA?BA (1) Using the Okapi model only (i.e., $¨>=@?BA and )

(2) Averaging the similarity value computed by Okapi model and the one computed by C¨EA?GF

LSA (i.e., ¨EA?GF and )



¨DA?BA H¨>=@?BA (3) Using LSA only (i.e.,  and ).

The retrieval performance of method (1) consistently beat the other two methods. Therefore, H¨EA?BA the default setting for the CodeBroker system is I¨>=@?BA and . For more details about the evaluation of retrieval mechanisms, see Section 8.1.

When both the concept query and the constraint query are passed, the Fetcher agent com- putes the similarity value by combining both the concept similarity and constraint compatibility,

which is determined by the signature matching process, according to the following formula:

õ˜ö õlö õ˜ö

¡ £¢)¤FüúJ¦LKM¢N PO}øQ¨ °ï÷\øvù6ú„û§ü”þÿüJ P¢)¤bü”ú§¦RTSU °WV$ú§¤X¢-ü”°xú þ3ùH¢Bú¤üJY¨ü¡ &ü”úJ¦Z[ û§ü”þÿü (7.2) 105

where TS\E[¨]= , and the default values for them are each 0.5. Programmers can assign [ different weights to TS and to reflect the importance they assign to the concept similarity and constraint compatibility, respectively.

7.4 Presenter

Retrieved components are shown to programmers by the Presenter agent in the RCI- display in decreasing order of similarity value.

7.4.1 Layered Information Presentation

Information about components is presented to programmers in different layers of ab- straction due to the following two considerations. First, because the RCI-display has only a limited size (otherwise, it would take up too much working space of programmers), presented information should be condensed to accommodate as many components as possible for pro- grammers to choose. Second, the evaluation of the usefulness of information by programmers consists of two stages: information discernment, whereby they grossly determine whether the component is relevant, and detailed evaluation, whereby they study the component thoroughly

(see Section 3.4.3.5). Therefore, information shown in the RCI-display should contain the es- sential information on components only, and more detailed information should be displayed to programmers when they show interest in a particular component (Ye, 2001b).

The CodeBroker system presents information on components to programmers in three layers. The first layer is the RCI-display in which each component is accompanied with its rank of similarity, its similarity value, its name, and a short description (Figures 7.2 and 7.3).

The presentation of the second layer of information is triggered by the mouse movements of programmers. Component names and short descriptions in the RCI-display are mouse-sensitive.

When the mouse cursor is moved over the component name, the signature of the component is shown in the mini-buffer (see the last lines in Figures 7.2 and 7.3); and when the mouse cursor is over the short description, terms contributing to the concept similarity between the 106 component and the concept query are shown in the mini-buffer (Figure 7.4) to reveal why this component is retrieved and to help programmers refine their queries if necessary. The third layer of information, the most complete description of a component, is shown in an external HTML browser, such as Netscape Navigator. When the programmer left-clicks on the component, the full Javadoc documentation for the component is displayed in the browser. The HTML tag extracted at the time of indexing is used so that the browser can display the exact place of the component description.

Figure 7.4: Presenting more information triggered by mouse movement

The mini-buffer shows the keywords (terms) that contribute most to the concept similarity between the first component and the reuse query, which is not shown here.

7.4.2 Larger Context-Sensitive Presentation

The task model, created by the Listener agent from doc comments and signatures, de- scribes the immediate programming task, namely, the module that the programmer is going to develop. However, programmers often do not give a complete description about their tasks in doc comments. Furthermore, a module is only a part of the whole development task, and the functionality of this module is deeply connected with other modules that have been developed so far. As mentioned in Section 5.2.3, if the component repository system knows what the whole development task is and the larger context under which the current module development is conducted, it can provide more appropriate information on reusable components.

CodeBroker captures this larger context in a discourse model (Section 5.2.3.2) that rep- resents the previous interactions between the programmer and the system in one development 107 session. The discourse model is used by Presenter as a filter to remove components in which the programmer is not interested in the current development session, although they are retrieved by Fetcher based on incomplete task models.

Java component repositories are organized hierarchically according to packages and classes, and packages and classes are often designed for particular application domains. For most programming tasks, only a part of the repository is involved. CodeBroker uses negative discourse models to capture what part of the repository is not of interest to programmers because discourse models are incrementally evolved by programmers during their interactions with the

CodeBroker system, and in many cases it takes less effort for programmers to identify apparent irrelevant components. Section 7.5 explains in detail how the discourse model is incrementally augmented by programmers.

A discourse model in CodeBroker is in the format of a Lisp association list (Figure 7.5).

It specifies packages or classes in which the programmer has no interest for the current develop- ment session. Before components retrieved by Fetcher are delivered to programmers, Presenter compares each component against the discourse model, and if the component belongs to a class or a package in the discourse model, it is removed.

Figure 7.5: An example discourse model

A discourse model is a Lisp list of items with the format: (package-name (class-name (method-name))). Empty class-name or method- name fields indicate that the whole package or the whole class should not be deliv- ered in this session.

Discourse models also reduce the delivery of irrelevant components caused by polysemy— a difficult problem for any information retrieval systems—by limiting searching domains be- cause polysemous words often have different meanings in totally different domains. For ex- ample, if the programming task is to shuffle a deck of cards, the programmer may use the 108 word “card” in doc comments. That would make the system deliver components from the class java.awt.CardLayout, a GUI (Graphic User Interface) class in which “card” means a graphical element. If the current development project does not involve interface building, this whole class is irrelevant. The programmer can add the class (java.awt.CardLayout) or even the whole package (java.awt) to the discourse model to prevent components belonging to it from being delivered in this development session.

7.4.3 Personalized Component Presentation

The goal of active delivery in CodeBroker is meant to inform programmers of those com- ponents that fall into L3 (reuse-by-anticipation) and the area of (L4 - L3) (information islands) in Figure 4.1 (Section 4.1.2). Delivery of components from L2 (reuse-by-recall) and especially from L1 (reuse-by-memory) might be of little use, with the risk of making the unknown, re- ally needed components less salient. Therefore, the system needs to know what components the programmer already knows. CodeBroker uses user models (Section 5.3.1) to represent pro- grammers’ knowledge about the component repository to ensure user-specific delivery of com- ponents. User models in CodeBroker are both adaptable and adaptive (Thomas, 1996; Fischer

& Ye, 2001).

User models in CodeBroker contain a list of components known to the programmer, namely, those components from L1 and L2. An example user model is shown in Figure 7.6.

Each item in the list is a package, a class, or a method. Each component retrieved from the component repository is looked up in the user model before it is delivered. If a method compo- nent matches a method in the user model, and the user model indicates the programmer has used it more than three times (this number is adjustable by the programmer), the system assumes the programmer knows it already and removes it from the delivery. If the method has no use time, it means the method was added by the programmer, who had claimed he or she had known it very well and did not want it delivered. If the class of the method (which has no method list in the user model), or the package of the method (which has no class list) is included in the user 109 model, the method is removed as well.

Figure 7.6: An example user model

A user model is a Lisp list of items with the format: (package-name (class- name (method-name use-time use-time ...))). When the use of a component is detected by the system, it is added to the list with the current time as the “use time.” If the component is added by the user, there is no use time. As with discourse models, empty class-name or method-name areas mean the whole package or class is included.

7.4.3.1 Adaptable User Models

Programmers can explicitly update their user models through interactions with CodeBro- ker. If they find a known component is delivered, they can invoke the retrieval-by-reformulation interface (see Section 7.5 for more details) to tell the system that they know the component al- ready.

7.4.3.2 Adaptive User Models

Due to the large volume of components and the constantly evolving nature of reposi- tories, it is a time-consuming task for programmers to maintain their user models. To reduce the difficulty of maintaining user models, user models in CodeBroker are also adaptive. As mentioned in Section 7.2.3, the Listener agent continuously monitors programmers’ input in the programming editor. In addition to modeling the programming task, Listener detects what components are used by a programmer and updates user models through the following three 110 heuristic steps.

Figure 7.7: An illustrative program for adaptive user modeling

This program is excerpted from a user experiment and is slightly modified. The line number is added to make the explanation easier.

Step 1: Extracting Method Names

A method invocation in a Java program is followed by a pair of parentheses between

which parameters are passed. When a left parenthesis ( is entered in the editor, Lis-

tener scans backward to extract the identifier preceding the left bracket. For example,

in Figure 7.7, when the left parenthesis (where the cursor is placed)1 is entered, Lis-

tener extracts the identifier addElement. After that, Listener scans back further to

determine if this identifier is a legal method name using the following rules.

(1) If the identifier is a Java keyword, such as the for in line 8, it is not a method

name.

(2) If the identifier follows another word or a right square bracket, it is not a method

invocation either; instead, it is the name of a new method developed by the pro-

grammer, such as the findDuplicates in line 5. 1 The extraction of the used component starts immediately after the left bracket is entered, not after the whole statement is entered, as shown in the figure. The whole statement is included in the figure simply to show what a method invocation looks like. 111

(3) If the identifier follows a dot (.), it is a method invocation, and the identifier is a

legal method name. Listener scans further back to extract all characters preceding

the dot until a white space is met. If these characters constitute a legal class name

in Java, the method is a class method instead of an object method; otherwise,

these characters are recognized by Listener as a variable name which will be

used to find what class the method belongs to (described in Step 2).

(4) If the identifier follows termination characters of a Java statement, such as a semi-

colon (;), a left brace ( ^ ), or a right parenthesis, it is a class method name.

Step 2: Finding the Class of an Object Method The class name of a variable can be extracted from the variable declaration statement. A variable declaration statement is recognized by Listener based on the following BNF syntax of Java (Gosling et al., 1996):

VariableDeclarationStatement := LocalVariableDeclaration;

LocalVariableDeclaration := final_opt Type VariableDeclarators

VariableDeclarators := VariableDeclarator |

VariableDeclarators, VariableDeclarator

VariableDeclarator := VariableDeclaratorId |

VariableDeclaratorId = VariableInitializer

VariableDeclaratorId := Identifier | VariableDeclaratorId [ ]

VariableInitializer := Expression | ArrayInitializer

Type := PrimitiveType | ReferenceType

ReferenceType := TypeName | Type []

TypeName := Identifier | PackageOrTypeName.Identifier

PackageOrTypeName := Identifier | PackageOrTypeName.Identifier

Each time a new variable is declared by a programmer, the Listener stores the variable

name and its class name in an association list. When an object method name and its

variable name are extracted, Listener looks up the variable-class association list to find 112

to which class the method belongs. For example, in Figure 7.7, the variable name for

the method addElement (line 9) is tmpVec, which is declared as a Vector (line

6).

Step 3: Finding the Package of a Class Because not all class names in a Java program include its package name, and class names are not unique, Listener needs to find to which package the class belongs. Lis- tener first finds all packages that include the class from the list of indexed components, created by CodeIndexer at the time of indexing. If only one package is found, it is as- sumed to be the package of the class. If several packages are found, Listener will pick the package imported by the programmer in the package import statements (lines 1 and 2 in Figure 7.7). Whenever a package import statement is entered, Listener recognizes it based on its BNF syntax shown below:

ImportDeclaration := import TypeName; |

import PackageOrTypeName.*;

and creates a list of imported packages and classes. If the package of a class is unique

in the imported package list, then the imported package becomes the package of the

class; otherwise, the programmer has probably made a mistake,2 and the extracted

method is ignored.

To make it easier to understand here, three steps were described in the reverse order of their execution. Listener creates the list of imported packages first, followed by creating the variable-class list, followed by extracting method names. When Listener successfully extracts the method name and determines its class name and package name, it adds the component, including its class and package, with the current time as use time, to the user model. Listener adds only methods to the user model; it does not add a class or a package because the use of a class or a package does not mean that the developer knows the whole class or package. 2 This mistake will cause a compiler error. As an extension, CodeBroker could point out this error so that programmers can correct it before they submit the program to the compiler. 113

7.4.3.3 Initializing User Models

Initial user models are created by analyzing the Java programs that programmers have written so far. CodeBroker analyzes Java programs to extract each method used in the same way as adaptive user modeling, except that it is a batch process.

7.5 The Retrieval-by-Reformulation Mechanism

To complement the incompleteness of reuse queries, CodeBroker supports two forms of retrieval-by-reformulation (Section 3.4.3.4): direct manipulation and query refinement. After examining the components initially delivered by CodeBroker, programmers can either refine the query to improve its completeness and preciseness or directly manipulate the delivered components by removing apparently irrelevant ones.

7.5.1 Direct Manipulation

Direct manipulation of the delivered components serves two purpose: to facilitate the easy choice of components and to augment the discourse model or the user model. Each com- ponent in the RCI-display is associated with a float menu, the Skip Components Menu

(Figure 7.8), which pops up as the component name is right-clicked. The Skip Components

Figure 7.8: The Skip Components Menu

Menu allows programmers to remove those components that are apparently not related to their current development task so that they can find needed information easier. The first item of the menu is the method component itself; the second, its class; and the third, its package. If 114 programmers want to remove the method or all of the components in the class or the package from the RCI-display, they can choose the appropriate item. Each item has three choices: This

Buffer Only, This Session Only, and All Sessions.

When the command This Buffer Only is chosen, the corresponding components are removed from the RCI-display. When the command This Session Only is chosen, the components are not only removed from the RCI-display, they are also added to the discourse model and will not be delivered later in this development session. The discourse model is empty when a development session starts, and it gets incrementally increased by programmers as they interact with the system. When the command All Sessions is chosen, the components are removed from the current RCI-display and are added to the user model. Components added to user models through the Skip Components Menu do not have the use time field (see the last line in Figure 7.6).

With this design, the system can obtain information to evolve discourse models and user models without adding too much extra work for programmers, who also gain the im- mediate benefit because the choice of needed components becomes easier by removing those apparently irrelevant components. For example, in Figure 7.9(a), in response to the doc com- ment, CodeBroker delivers some components (No. 1 through No. 4) belonging to the class java.awt.Cardlayout (a GUI class) due to the term “card.” However, the current task is not related to the class java.awt.Cardlayout, so the programmer can remove it through the direct manipulation interface. This manipulation brings the needed component randomShuf- fle, obscured previously, to the salient fourth place (Figure 7.9(b)). The fact that the program- mer is not interested in the class java.awt.Cardlayout can be added to the discourse model at the same time if the programmer chooses the This Session Only command, and then no components from the class java.awt.Cardlayout will be delivered later in this development session, even if the programmer uses the word “card” in doc comments again, which is quite possible because the programmer is developing programs about card shuffling.

115

_adeb _a`cb

Figure 7.9: The Direct Manipulation interface

Parts (a) and (b) show the delivered components before and after direct manipula- tion, respectively.

7.5.2 Query Refinement

Query refinement is invoked by choosing the Query Refinement command in the same pop-up menu, or directly typing it in as an Emacs command. A buffer (Figure 7.10) will appear for programmers to start another round of component locating after having refined the automatically extracted reuse queries. Programmers can refine the concept query by choosing more appropriate terms, or they can modify the constraint query to make it less restrictive or more restrictive depending on the situation. To narrow the searching range of relevant compo- nents, the query refinement interface also provides two additional fields:

Filtered Components: for specifying classes or packages that are not of interest, and

Interested Components: for instructing the system to return components from the spec-

ified classes or packages only.

Component repository systems could provide a mechanism to let programmers specify either of these fields previous to the initial use of systems. However, programmers who do not 116

Figure 7.10: The Query Refinement interface

The components in the RCI-display are retrieved after the refined query is submit- ted. Any one of the four visible components can be reused in the current situation, depending on how the programmer wants to restructure his or her data types.

know the structure of the repository well enough may not be able specify these two fields. Even a system-guided dialog mechanism to solicit user specifications as explored in the KID sys- tem (Nakakoji, 1993), is not suitable for repository systems because component repositories are often very large and it will take a long time to get a meaningful specification. The CodeBroker system does not assume that programmers know the repository structure well enough, and it solicits user input only after its delivered components have acquainted programmers with the structure of the component repository, especially the structure of the part of the repository that might be relevant to the task at hand. 117

7.5.3 Comparing Retrieval-by-Reformulation and Relevance Feedback

The retrieval-by-reformulation mechanism in CodeBroker is a more comprehensive ap- proach to improving the retrieval performance than the relevance feedback mechanism used in many information retrieval systems (Buckley et al., 1994). Through the adjustment of terms used in a query by query expansion or other techniques, relevance feedback of information retrieval systems focuses mainly on the improvement of the retrieval process itself. Instead, the focus of retrieval-by-reformulation is to improve the relevance of information to the work- ing context of programmers, not to the query per se. The direct manipulation tries to establish a shared understanding of the context between the component repository system and the program- mer. It uses programmers’ previous interactions with the system as filters for later deliveries.

Although it does not affect what the Fetcher agent returns, it does modify what gets shown. The system also takes advantage of the fact that software components are organized into a hierarchy

(packages, classes, and methods) according to their application domains to let programmers limit the retrieval range to their interests.

7.6 Summary of CodeBroker

Figure 7.11 summarizes the role of each agent and retrieval-by-reformulation in Code-

Broker. In Figure 7.11, T represents components that can potentially be reused in the current programming task. From these components, programmers need to choose the most appropri- ate one. D1 and D2 represent delivered reusable components before and after user models are considered, respectively. User models can reduce the work of choosing by removing known components, because if the most appropriate component were known, it would have already been reused. Ideally, active component repository systems should present to programmers the

T set with known components (black circles) removed. However, due to the incompleteness of reuse queries, irrelevant components and missed components are unavoidable. Therefore, retrieval-by-reformulation is needed to allow programmers to move D2 toward T incrementally. 118

Figure 7.11: Summary of CodeBroker

Direct manipulation of retrieved results lets programmers remove irrelevant components (circles with waved lines) quickly; and query refinement lets programmers incorporate missed compo- nents (circles with no shade) into D2 in following locating efforts with incrementally developed reuse queries. Chapter 8

Evaluations of CodeBroker

This chapter presents the results of two types of evaluations conducted on the CodeBro- ker system. The first evaluation compares the retrieval effectiveness of the Okapi-based retrieval mechanism and the LSA-based retrieval mechanism (Section 6.1.1). The second evaluation empirically studies how well CodeBroker supports reuse-within-development (Section 4.2.2 through experiments with programmers.

The purpose of the empirical studies of CodeBroker was not to analyze the quality of programs produced by programmers and the productivity of programming, but to observe and analyze how the system promotes reuse during programming.1 The empirical studies attempted to answer the following questions:

f Are programmers able to reuse unknown software components with the support of

CodeBroker?

f Does CodeBroker encourage programmers to explore the possibility of reuse?

f Is the task modeling based on doc comments and signatures good enough to find com-

ponents relevant to the task at hand?

f Do discourse models improve the relevance of delivered components?

f Do user models contribute to the personalization of component delivery?

1 The fact that reuse improves the quality and productivity has been well studied by other researchers and has been discussed in Chapter 3. 120

8.1 Evaluating the Retrieval Mechanisms

Retrieval mechanisms play an important role in locating reusable components that match reuse queries. This section presents the evaluation and comparison of two retrieval mechanisms: the Okapi mechanism (Section 6.1.1.2) based on probabilistic models, and the LSA mechanism

(Section 6.1.1.3), used by the Fetcher agent of CodeBroker for the retrieval of conceptually similar components.

8.1.1 The Concept of Recall and Precision

Conventionally, information retrieval systems are measured by recall and precision. Re- call indicates the ability of the system to present all relevant documents, and precision indicates the ability of the system to present only the relevant documents. They can be computed by the

following equations:

öXn©oBö o o

O þmYø2¤ ÷(O þËø\°xúpVI¤Zø\ú§¤bü(ø2q¯ø ¢B° ¤Xø; ø2q)¢B°xú

õ

gih Qjk¨ l

öXn ö o-ö ö ö

6þmYø2¤ ú úr¢) s¤Xø; ø2q)¢B°xú ÷tO6þËø$°xúrVoü”°¹÷ £ ƒø0÷ ú(ü °

O (8.1)

l

öXn©oBö o o

O þmY øe¤ ÷(O þËø\°xúpV\¤Zø\ú§¤Fü¤ø2q5ø ¢B° ¤Xø; ø2q)¢B°xú

õMv v l

uMgih û ¨

ö@nwo-ö o

l

6þmYø2¤ ÷tO6þËø$°xúrVI¤Zø$úJ¤bü(øeq5ø

O (8.2) l

Recall and precision are not absolute, objective measurements of an information retrieval systems because

(1) the definition of relevance between documents and queries is subjective, and

(2) even if the relevance of a document is unanimously agreed, it may not be of interest to

one particular user if that user knows the document.

Nonetheless, when the relevance of documents is agreed, these two measures can be used to compare the performance of two retrieval systems. 121

8.1.2 Recall and Precision in CodeBroker

The purpose of computing recall and precision of the CodeBroker system is to compare the two retrieval mechanisms to find the better one. The data should not be taken as an abso- lutely objective measurement of the effectiveness of retrieval mechanisms implemented in the

CodeBroker system because sample queries were not random enough.

8.1.2.1 Reuse Queries and Relevant Components

In total, 19 reuse queries were selected. Among them, 10 queries were created by me,

4 queries were chosen from questions asked in newsgroups related to Java programming with phrases such as “How do I”, “Can someone tell me how to” removed, and 5 queries were extracted from the evaluation experiments (Section 8.2) without any words changed.

The relevance of components was determined as follows:

(1) For queries created by me, I chose those components that I thought could be used to

implement the task.

(2) For queries from newsgroups, those components suggested by responders were con-

sidered relevant in addition to components that I chose from the JGL library that not

all Java programmers are using.

(3) For queries from experiments, only those components that were used by the program-

mers were considered relevant.

The sets of relevant components determined by above criteria are by no means extensive. How- ever, for the purpose of comparison, they can provide sufficient evidence.2

8.1.2.2 Computed Recall and Precision

Two retrieval mechanisms—Okapi and LSA—are supported by CodeBroker to locate components that are conceptually similar to queries extracted from doc comments. The two 2 Queries and relevant components are listed in Appendix A. 122 Average Precision Recall LSA Mixed Okapi 0% 35.77% 32.80% 45.82% 10% 31.86% 32.80% 45.82% 20% 30.89% 32.75% 45.82% 30% 25.62% 28.63% 41.20% 40% 20.62% 24.66% 41.01% 50% 20.44% 24.66% 40.74% 60% 13.86% 21.95% 37.46% 70% 13.82% 20.70% 37.46% 80% 13.82% 20.63% 32.71% 90% 12.32% 19.90% 32.19% 100% 12.32% 17.86% 29.43%

Table 8.1: Average precision and recall values for LSA, Mixed (average of LSA and Okapi), and Okapi

mechanisms can be combined to retrieve components by being given different weights to the similarity value computed by each. I tried the system by using LSA only, Okapi only, and the average of both (Mixed). The average precision values at different recall values are shown in

Table 8.1. Figure 8.1 shows the recall-precision curves which are constructed by plotting the precision values against the recall values. Superimposing recall-precision curves of different retrieval mechanisms in the same graph can determine which retrieval mechanism is superior. In general, the curve closest to the upper right-hand corner indicates the best performance (Salton

& McGill, 1983).

8.1.2.3 Conclusions

It is easy to see from Table 8.1 and Figure 8.1 that Okapi has better retrieval perfor- mance than LSA and the mixed one (the average of both). The result is somehow unexpected because other researchers have reported that LSA has better performance than other retrieval methods (Deerwester et al., 1990). The unexpected low performance of LSA might be caused by insufficient training documents used in CodeBroker because LSA performance is largely de- pendent on the quality and volume of training documents. Because the evaluation shows that

123

“•”

™— š›€2œ tž)Ÿ ‚2–˜—

}~x~x

|Cx

{Cx

‘’

Œ Ž

Œ 

zCx

ˆ ‰Š‹

yCx

x

x yCx zCx {Cx |Cx }~x~x

C€ec‚;ƒ ƒe„† $‡

Figure 8.1: Recall-precision curves

Okapi has the best performance, the default setting of CodeBroker is to use Okapi only. Okapi is also favored over LSA because in Okapi, the system can find the terms that contribute most to the relevance between components and queries, and those terms can be shown to programmers, when they move the mouse cursor over the descriptions of components in the RCI-display, to help them refine their queries (Figure 7.4, Section 7.4.1). In LSA, in contrast, the reason a component is determined to be relevant is obscure because of the semantic space.

8.2 Empirical Evaluations of the CodeBroker System

To understand the effectiveness of the CodeBroker system in supporting reuse-within- development, formal evaluation experiments have been conducted. The structure of the exper- iments is described in this section, and findings and conclusions are presented in the next four sections. 124

8.2.1 Subjects of Experiments

Subjects were recruited from undergraduate and graduate students from the Computer

Science Department. As mentioned in Chapter 2, programming involves a wide range of knowl- edge. Because the design goal of CodeBroker is to provide knowledge about reusable compo- nents to programmers, to minimize other factors that contribute to the difficulty of programming in general, only students who already had extensive programming knowledge and experience were recruited as subjects. Because CodeBroker is developed as an add-on to the existing pro- gramming environment, Emacs in Unix, a basic working knowledge of Emacs and Unix was also required so that subjects could easily learn the operations of the system and experiments could be focused on the support provided by the system.

Five subjects voluntarily participated in the evaluation experiments. All but one pro- grammer had extensive knowledge in other programming languages, such as C and C++. Two had worked as professional programmers. Three were regular contributors to several Open

Source projects. Their expertise in Java programming varied, ranging from medium to expert level. All of them knew the syntax of Java very well; the difference of their expertise came from the range of reusable components (classes and methods in API libraries) they knew. Table 8.2 summarizes their background knowledge about programming in general and Java in particular.

In that table, small (abbreviated as S) projects refer to projects similar to semester projects, requiring 1 or 2 man-months; medium projects (abbreviated as M) refer to projects requiring

3 to 5 man-months; large projects (abbreviated as L) refer to projects requiring more than 6 man-months.

8.2.2 Structure of Experiments

Subjects were asked to implement two or three programming tasks with the CodeBro- ker system. Days before the experiments, CodeBroker created an initial user model (see Sec- tion 7.4.3 for the method, and Figure 7.6 for an exmaple) for each subject by scanning programs 125 Subject S1 S2 S3 S4 S5 Years of general programming 3 or 4 5 or 6 8 10+ 10+ Programming experience in 10+SM, general (measured in number of 3S, 1L 10S 7M, 1L 10+L 2L projects) Current major programming C++ Java Java Java Java language 10 Years of Java programming 4 4 7 5 months Self-evaluation of Java expertise (1: Beginner - 10: 4 7 7 or 8 10 7 Expert) Not Not Not Recent frequency of active active Every Every active programming in Java for 3 for 3 week day for months. months. months

Table 8.2: Programming knowledge and expertise of subjects

the subject had written recently. Because many of the programs the subjects had written were for companies and thus were not available, no user models were complete. Nonetheless, the number and range of components included in the user models were consistent with the sub- jects’ self-evaluations of Java expertise.

After analyzing their user models, the subjects were assigned tasks whose implementa- tion involved components they probably had not known well enough. In the beginning of the experiments, the main functionality of the CodeBroker system was briefly introduced with a running example after the subjects had signed the Informed Consent Form for participating in the experiments. This took about 5 minutes. Previous to the implementation of each task, pro- grammers were asked to describe briefly how they would implement the task, and after each task had been finished, simple questions such as “Did you know this component before?” and “Why did you choose this component?” were asked regarding their programming activities. At the end of the experiments, a post-experiment interview3 was conducted to capture the subjects’ back- ground knowledge of programming and their subjective evaluation of the CodeBroker system based on their use. 3 Questions asked in the interview are listed in Appendix B. 126

Programmers were told to do programming in their normal way but to take advantage of the support provided by CodeBroker. They could use books, the Java API Documentation

Browser, and all other support as they usually did. Two subjects actually brought and consulted their favorite “Java in a Nutshell” (Flanagan, 1997). As an observer, I occasionally answered their questions about the operation of the CodeBroker system.

The CodeBroker system used the following default settings in the experiments:

(1) It adopted the Okapi retrieval mechanism and the signature-matching mechanism, with

each assigned the weight of 0.5.

(2) A component was decided to be known to the programmer if the user model indicated

the programmer had used it three times.

(3) In the first four experiments with the first two subjects, the system delivered 14 com-

ponents in the RCI-display because the experiments were conducted on a laptop with a

small monitor. In all other experiments that were conducted on a desktop with a large

monitor, the system delivered 20 components.

(4) The component repository contained 673 classes and 7,338 methods from both the

Core API library of Java 1.1.8 and JGL 3.0.

8.2.3 Programming Tasks

Because subjects were volunteers, large and time-consuming tasks were not very suit- able. The experiments used programming tasks similar to the typical assignments of a program- ming language course, which could be implemented with several methods in about 20 to 60 minutes. The following tasks were used in the experiments.

Task 1 You are asked to implement a program that selectively backs up files based on a list that holds all files needed to be backed up. The list looks like: /usr/java/private/important/letter1 127

/usr/joe/project/backup/getAllFiles.java and the file name is passed as a parameter in the command line. It requires that the back-up program retain the same hierarchical structure when the files are backed up in another directory $BACKUPDIR, which is passed as the second parameter in the command line; for example, getALLFiles.java should also be found under the directory $BACKUPDIR/usr/joe/project/backup/.

Task 2 You are asked to write a program to simulate the process of card dealing. Each card is represented by a number from 0 to 51. The program should produce a list of 52 cards, as it results from a human card dealer. Let us assume that if a person cuts a deck of cards and shuffles it 7 times, the result is satisfactory.

Task 3 Traditionally, Chinese write numbers with a comma inserted at each fourth number from the right. For example, 1,000,000 is written as 100,0000. Please implement a program that transforms the Chinese writing format (100,0000) to the western format (1,000,000). To simplify the programming task, you don’t need to read the input from the keyboard. You can assume you can get the input anywhere you like, such as a static class variable, a parameter of a method, or input from the command line.

Task 4 Jack has a long list of MP3 songs he has compiled. However, many of the songs are repeated in the list. He wants to create a new list in which each song appears only once. Assume each list has the following format TITLEa, TITLEb, TITLEc, ... where TITLEi is a string including letters only. Implement a method to create a new list with no repetitions. Assume the list is stored somewhere; for example, you can put it into a class variable.

Task 5 Please write a program that can calculate the day of the week. We know that today is Jan. 19, 2001, Friday. Your program should be able to compute the day of the week M years from today, or N months from today. Both M and N could be negative, which means M years or N months before. Assume the convention to pass the data to your program is: Y 10 means 10 years from today, and M -5 means 5 months ago.

Task 6 A processor needs to respond to a series of events. Each event is assigned a distinct number. When the processor is busy, newly arrived events will be put into a waiting list. When the processor finishes processing the previous event, it picks an event in the waiting list. However, it picks the event with the largest number in the waiting list. You are asked to implement a pair of operations: one to put a new event into the waiting list, and the other to help the processor pick up the next event to be processed. (You don’t need to be concerned with concurrency.) 128

All tasks could be implemented with different combinations of different reusable compo- nents from the repository. If the subjects know or find the right components, the implementation would be fairly easy; if they do not, they would have to use components of lower levels or even basic statements. Therefore, those tasks can allow us to observe how the delivery of the system changes the programming process of subjects.

8.2.4 Methods of Observation and Analysis

The CodeBroker system has an automatic log mechanism that logs the reuse queries ex- tracted by Listener, the components retrieved by Fetcher, the components removed by Presen- ter, and both system-initiated and user-initiated changes to discourse models and user models.

All experiments, including interviews, were videotaped. Subjects were asked to think aloud during the experiments. However, because thinking aloud may interfere with normal programming practice, this was not stressed. Analysis of the system was based on the log data, the video tapes, and transcribed interviews. The purpose of the analysis was not about the quality and productivity of programming; instead, it was about how CodeBroker affects the process of programming by encouraging programmers to reuse. Quantitative assessment was based on log data, and qualitative evaluation was based on interviews and think-aloud protocols.

8.3 Findings about the Usage of CodeBroker

This section presents the findings about the usage of CodeBroker in the experiments.

After presenting the overall results, I discuss in detail the observed roles of active component delivery, task models, discourse models, user models, and the retrieval-by-reformation mecha- nism.

8.3.1 Overall Results

Table 8.3 summarizes the support the CodeBroker system provided to programmers dur- ing the experiments based on automatically logged data. Numbers in each column are defined

129

ËHÅƬ¤«@¢˜­¤£(¥\¡$£cÌLÍW¬¤± ¨ Î&¬¤ÅǬ¤­$ÏW£@Ð$ª@£¤¡¤¬@¡c§GÑ

Ä

°@ÒcÓÕÔ ´ «(Ñ&¢ ´ £(§P«¤± ÍW¬¤± ¨ Î&¬¤ÅǬ¤­ ´ Åƨ ¯@¯¤¬¤ÅǬ¤­

«¤¯¤°@¬¤± ²

W¡@¢˜¡¤£c¥\¡ ¦4¡c§P¨ ©˜¨ ª¤«(§P¬¤­ ®

³H¡¤£c¥\¡

´.µ µ¤¶ · ¸ ¸ ¶ ¶

Ä

µ

´N¸ ¹ µ µ ¶ ¶ µ

´.¹ º µ µ ¶ ¶ ¶

Ä

¸ ´N· · µ µ ¶ ¶ ¶

´.» » ¹ ¶ ¸ µ µ

´.¹ » ¸ µ µ ¶ µ

Ä

¹ ´.» · ¹ µ ¸ ¶ µ

´.¼ ¹ ¶ ¶ ¶ ¶ ¶

´.¹ · ¹ ¶ ¹ ¶ ¶

Ä

·

´.» ¹ µ µ ¶ ¶ ¸

´.¹ · µ µ ¶ ¶ ¸

Ä

»

´.¼ » ¶ ¶ ¶ ¶ ¶

½¤¾ ¿@À Á Â@À  à ÈsÉWÊ

Table 8.3: Overall results of evaluation experiments with programmers

as follows.

Subj. The subject who participated in the experiment.

Task. The number of the task used in the experiment.

Total. The number of distinct method components used in the implemented program. If a

method was used more than once, it was still counted as one. Class components were

not directly counted because when they were used, some of their methods, including

constructors, must have been used somewhere in the program.

Delivered. The total number of components that programmers directly reused from the com-

ponents delivered by the system. Those components are further broken down into three

categories based on subjects’ original knowledge about the components.

Unknown. Components whose existence were not expected (i.e., components from

the information islands (L4 - L3) in Figure 4.1).

Anticipated. Components subjects believed existed, but had never used before (i.e.,

components of L3 in Figure 4.1). Sometimes, they even guessed the right class

or the right package. 130

Vaguely known. Components subjects have used before, but were not sure about the

name, or remembered the name incorrectly (i.e., components of L2 in Figure 4.1).

Triggered. The number of components that were not delivered but were triggered to be reused

in the programs by the delivery. In some cases, when the subjects wanted to reuse

a delivered component that needed other supplementary components, they needed to

find those components out. Triggered components were not known by subjects before.

For example, one subject wanted to use the delivered randomShuffle method that

operates on Array. Because he did not know the Array class, he used the browsing

mechanism to find it out. Although triggered components were not directly delivered

by the system, they would not have been reused without the support of the system.

Table 8.4 summarizes the responses from subjects when asked to rate the usefulness of the system on a scale from 1 (totally useless) to 10 (extremely useful), and if they would use the system as their daily programming environment. Although those evaluations are subjective, they are indications of the subjects’ desire to use the system.

Subj. Rate Will you use CodeBroker as your daily programming environment? S1 7 Yes. S2 4 It is right on the threshold that maybe I would use it. S3 8.5 Yes. It is not perfect, but it is really good and it is very helpful. S4 7 Yes, but I have to get used to the system. S5 8 Yes.

Table 8.4: Subjective evaluations of the CodeBroker system

As both the quantitative data in Table 8.3 and the subjective evaluations in Table 8.4 show, CodeBroker has been quite effective in supporting programmers to locate and reuse com- ponents during the experiments. 131

8.3.2 Roles of Information Delivery Mechanism in Supporting Reuse

In the experiments, the information delivery mechanism of CodeBroker provided multi- ple supports to encourage subjects to reuse.

8.3.2.1 Supporting the Reuse of Unknown Components

As Table 8.3 shows, in 7 out of the 12 experiments, the system delivered reused compo- nents that had not been known to subjects. In one experiment, the subject had not even known the existence of the whole package. Without the delivery mechanism, the subjects would not have been able to reuse them and would have created their own solutions instead, as two subjects commented in the interviews:

“I would have never looked the roll function by myself, I would have done a lot of stuff by hand. Just because it showed up in the list, I saw the Calendar provided the roll feature that allowed me to do the task.” “I did not know the isDigit thing. I would have wasted time to design that thing.”

The delivery mechanism not only supported subjects to reuse components right off the deliveries, it also created a snowball effect that triggered them to reuse other unknown com- ponents that were not directly delivered but were needed to reuse those delivered compo- nents. Components in the libraries of object-oriented programming languages are often coupled through parameter passing or accessing the common class variables. To reuse one component often requires the reuse of other components tightly coupled. In the experiments, when those coupled components were not known, programmers used the deliveries of CodeBroker as the starting point and then followed the existing hyperlinks of the documentation system to learn and reuse them.

The delivery mechanism also created latent reuse opportunities. Sometimes, the deliv- ered components were not immediately reused because, although they were related to the task to some extent, they could not be directly reused right away for the immediate programming 132 task in which the subjects were engaged. However, as programming continued, subjects real- ized that something delivered before could now be reused. For example, in one experiment, the subject was first concerned with finding how to read the contents of a file. Among the delivered components was the isDirectory method, which could not be reused right away for the task of “reading the contents of a file” but somehow caught the attention of the subject. Later, when the subject moved to the task of “creating a new directory if it does not exist,” he thought of something he had seen before, but he could not remember the name. So he asked if the system had a mechanism allowing him to go back to previous deliveries. When told not, he inserted a temporary comment to find the isDirectory method.

8.3.2.2 Reducing the Cost of Locating Anticipated or Vaguely Known Components

In all experiments, 9 components that were reused from the deliveries of the CodeBroker system were somehow anticipated by the subjects (Table 8.3). In those cases, the subjects knew there was something in the Java library that could help them implement the task. Although some of them even knew the class names, they did not know the needed method names and were not sure whether all the needed functionality was supported by the class. Those components might have been reused by subjects without the support of delivery if they could locate them through browsing or querying quickly enough. However, CodeBroker made the locating of those com- ponents faster and easier. It is difficult to evaluate objectively how much the system reduced the cost of locating those anticipated components because (1) we cannot find two programmers with the same knowledge about the repository, which determines how a programmer conducts the locating process to compare the cost of locating the same component with and without the support of the CodeBroker system; and, more importantly, (2) programmers’ evaluation of the locating cost itself is subjective. Therefore, the conclusion that the delivery mechanism of

CodeBroker reduced the cost of locating anticipated or vaguely known components was based on the subjects’ answers to the question “Did you think the system saved you time in locating this component?” (referring to a specific component anticipated or known by the subject): 133

“It beats browsing. Because the way that I normally would have done the task, I would do a lot of browsing and then write the code alongside. So this reduced the browsing and searching.” “Yes. First, I did not have to start browsing and go through the packages, and I did not have to go through the index of methods. I could just go to the short list [RCI-display], found it and clicked it.” “I thought there might be a parse method, but I also was not sure whether it is called parse or something else. I also wasn’t sure if it was in the Format class. Maybe it is in a different class like Integer or number or something else. It’s helpful that I saw parse [in the RCI-display] and went through to see that it was in the Format class. ” “It seems to me the key benefit of this [CodeBroker] is that it gives you meth- ods for every class, not like this one [the API Documentation Browser] that you have to first find which class it is in then go to the class. Although it has in- dex of methods, but it is hard to find here [the API Documentation Browser].”

Subjects also acknowledged that the reduced cost of locating components motivated them more to explore the possibility of reuse. As one subject said:

“Having this system, I would try to explore more, I would spend more time to see whether this thing exists or not.”

8.3.3 Effectiveness of Task Models

Task models in CodeBroker are used as queries to retrieve relevant components, and they are created from doc comments and signatures of modules. In the experiments, most doc comments written by subjects were easy for humans to understand the functionality of their programs. In the meantime, those doc comments served as good reuse queries too, as evidenced by the number of successful deliveries shown in Table 8.3.

The more knowledge subjects had about the repository, the better were their doc com- ments used to retrieve relevant components. One subject described why he wrote one particular comment as:

“I knew there should be a class called NumberFormat or DecimalFor- mat having the method format...That’s why I wrote the word ‘format’ be- cause I knew it would catch those.” 134

As a result, he found what he expected from the deliveries of CodeBroker.

Probably because most subjects had a fairly good working knowledge of programming in general and Java in particular, they were able to describe, in comments, the functionality of programs in a similar way that was used to describe components in the repository.

Different subjects had different ways of writing comments. Some wrote very long and elaborate comments to describe everything they wanted to do in the method or the class. Others wrote very concise and short comments focused on the major task of the program. Because descriptions of components in the repository are short and concise, the short and focused com- ments made the system deliver more task-relevant components.

Comments are essential for CodeBroker to deliver task-relevant components. Therefore, in the interviews, subjects were asked if they wrote comments in their daily programming activ- ities. Two subjects answered they always wrote comments before the implementation for most classes and methods, one subject said he always wrote comments for classes but not always be- fore the implementation, one subject said he mostly wrote comments for classes but not always before the implementation, and one subject said he usually did not write comments. However, two subjects indicated that they probably would write more comments before implementation if they were going to use CodeBroker because they could benefit from the comments. Two subjects changed their styles of comments within methods from C++-style (which begins with

“//” and continues until the end of the line) to the doc comment style in order to take advantage of the system delivery. That was an unexpected use of the system,4 and showed that subjects expected and valued the help provided by the delivery mechanism of the CodeBroker system.

The signature matching mechanism of CodeBroker (Section 6.1.2) did not play too much of a role in the experiments. In fact, only one subject tried once to look at the change of delivery when he finished the signature declaration of a method, but the system failed to improve the task-relevance of the delivery because there was not any component in the repository that 4 The original design goal of the CodeBroker system was to deliver components based on doc comments pre- ceeding methods or classes, not on comments inside methods. 135 was both similar in concept and compatible in constraint with the task of the subject. In all other experiments, subjects shifted their attention to the RCI-display immediately after they had written the comments and started browsing. When they found components they needed, they moved back to programming and did not pay any attention to the RCI-display until they wrote the next doc comment. The original design goal of adopting the signature matching mechanism in CodeBroker was to help programmers find components that can be reused to replace the module under development. However, in the experiments, all subjects used the system to look for components that could be reused as parts of the module implementation instead of components to replace their intended implementation. The system was more effective in delivering implementation parts than delivering replacement components.

The fact that subjects did not pay attention to the change of delivery caused by signature definitions provided a clue to speculate on the boundary of the action-present (see Section 5.1.2) period of programming. When programmers are writing comments, they are still at the stage of planning, thinking of how to implement the program. At that stage, they are still willing to explore alternative solutions. When they start to define the signature, they have already committed to one chosen solution and have shifted into the stage of execution, at which they are less inclined to explore alternatives.

8.3.4 Effectiveness of User Models

No strong and conclusive data were collected regarding the role of user models in Code-

Broker. User models removed some known components in five experiments (Table 8.5), and in only two experiments (S1-T1 and S2-T5), more than 8% of the components were removed because they were included in the user models of subjects. The low number of components removed by user models was probably caused by the following two reasons:

(1) All subjects provided only a very small portion of their Java programs for CodeBroker

to create the initial user models. As a result, the user models did not sufficiently reflect

subjects’ knowledge about components. 136 Total No. of No. of Comp. No. of Comp. No. of Comp. Subj.- Comp. Removed by Added to UM Added to UM Task Retrieved UM by User by System S1-T1 168 15 0 0 S1-T2 28 0 0 0 S2-T3 140 5 0 0 S2-T4 52 0 0 0 S2-T5 160 14 2 5 S3-T3 60 0 0 6 S3-T5 20 1 0 0 S3-T6 60 0 0 0 S4-T3 80 0 0 0 S4-T5 140 0 0 0 S5-T3 100 1 0 9 S5-T6 420 0 0 0 SUM 1428 36 2 20

Table 8.5: Experiment data regarding user models

Subj. is an abbreviation for Subject, Comp. for Components, and UM for User Model.

(2) In order to observe the role of component delivery, subjects were assigned tasks whose

implementation required components from packages and classes that subjects had not

known yet. Therefore, most delivered components were not included in user models.

Although user models did not remove too many components, all subjects said, in the interviews, that they did not notice many well-known components were delivered. A careful examination of those components removed by user models in the experiments found that none of them could have been reused in those tasks. The perception of subjects and available, al- though quite limited, data still pointed to the need and effectiveness of personalizing component delivery based on user models.

As shown in Table 8.5, neither the system nor the users added too many components to the user models. More on this will be discussed in Section 8.5.4. 137 Total No. of No. of Comp. Subj.- No. of Comp. Comp. Removed by Task Added to DM Retrieved DM S1-T1 168 1p, 1c 45 S1-T2 28 1p, 1c 10 S2-T3 140 4m 0 S2-T4 52 0 0 S2-T5 160 0 0 S3-T3 60 0 0 S3-T5 20 0 0 S3-T6 60 0 0 S4-T3 80 1p 7 S4-T5 140 2p 68 S5-T3 100 0 0 S5-T6 420 0 0 SUM 1428 5p, 2c, 4m 130

Table 8.6: Experiment data about discourse models

Subj. is an abbreviation for Subject, Comp. for Components, and DM for Dis- course Model. In the third column, p, c, m refers to package, class, and method, respectively.

8.3.5 Effectiveness of Discourse Models

Discourse models, used as filters, improved the relevance of retrieved components. As the experiments with S1 and S4 showed (Table 8.6), when subjects added packages or classes into the discourse models, those discourse models were successful in removing irrelevant com- ponents in later deliveries, and in all the four experiments with S1 and S4, the system helped them find their needed components. However, if only method components were added to the discourse model, such as in the experiment of S2-T3, discourse model was not very effective in filtering components. All components removed by discourse models were not needed for the implementation of tasks in those experiments. Therefore, discourse models did reduce the number of irrelevant components.

8.3.6 Use of the Retrieval-by-Reformulation Mechanism

The CodeBroker system supports two interfaces of retrieval-by-reformulation: direct ma- nipulation and query refinement. Discourse models and user models were the results of using 138 the direct manipulation interface (Section 7.5.1), and their roles were discussed in previous two sections (Sections 8.3.4 and 8.3.5). This section discusses the usage of the query refinement interface (Section 7.5.2).

The query refinement interface was invoked by three subjects (S1, S3, and S5), and interestingly, they used three different features. S1 used the interface to modify a query and the modification led to locating a needed component. S3 did not change the query; instead he filled the field of Interested Components with a package name because he was sure that all his needed components were from that package. He did, in fact, find all of his needed components from that package. Later in the interview, he commented:

“It worked great since I knew everything was going to be in java.text. That is a nice feature, that little refinement thing.”

In addition to query modifications, S5 also used the interface to specify packages in which he was not interested by filling the field of Filtered Components. However, that still did not help him find what he wanted because the terms he chose appear in many other packages he did not specify in the Filtered Components field.

Instead of using the query refinement interface, some subjects activated the delivery mechanism by directly modifying comments in the editor when they realized their previous comments did not help them find what they wanted.

All of these observations confirmed that locating reusable components is an iterative process and the support of retrieval-by-reformulation is necessary (Section 5.4). However, the

CodeBroker system did not provide a mechanism that could guide programmers in refining their queries (for discussions, see Section 8.5.3).

8.3.7 The Role of Layered Presentation of Information

In CodeBroker, information about reusable components is presented to programmers in different layers: names and short descriptions of components in RCI-display, signatures and matching terms triggered by mouse movements, and full details in an external HTML Browser 139

(Section 7.4.1). The experiments confirmed that such a layered presentation mechanism was effective in helping programmers choose the right component to reuse.

Subjects were observed to use the following procedure to choose a component. They

first looked at names and descriptions in RCI-display. If they found one promising component, they clicked on it and went to the Browser for detailed information. If they did not find anything interesting in RCI-display, they either moved back to the editor to modify their doc comments or invoked the query refinement interface. If they found several similar components in RCI- display, they moved the mouse over the names and looked at the signatures, trying to find the best one. This process, called information discernment in Section 3.4.3.5, was often very short, and rarely took more than a minute; in contrast, the detailed evaluation of chosen components often took several minutes.

8.4 Other Findings about Programming in General

This section presents findings about programming in general that are not necessarily related to the use of CodeBroker, but are related to the theoretical framework discussed in

Chapters 2 and 3.

8.4.1 Knowledge of Components and Problem Framing

In Chapter 2, I postulated that programming is an interaction between problem framing and problem solving, and knowledge about reusable components not only makes the solving process easier but also increases the programmers’ capability of problem-framing because they can frame the problem directly with those components. Findings from the experiments support the claim.

Programming Task 3 (T3) (Section 8.2.3) was assigned to four subjects. Two subjects

(S3 and S4) knew some classes in Java could be used, even though both of them had never used the classes and were not sure what the classes were and whether the classes included all the needed functionality. They described their understanding and implementation plan of the task 140 as follows:

S3: “There are two ways that I could do this. One way is that Java might actu- ally have some supports in Java’s international text classes for doing reading in Chinese format and then writing out in western format because I know there is a NumberFormat class, but I have never used it. And it might be easier for me to just do it by hand, which is to take the number, read it from right to left and then read it and write out another string, because it is pretty a simple thing. I have to take the comma from here and put it right here.” S4: “So what I want to do is that I will probably parse it using something like date format—I don’t remember exactly what kind of number you can parse into—and then just reformat it into another way, another date format...oh, not date format, number format or something like that—java.util.text or java.text or something like that package. If it doesn’t work, I am going to do it by hand: just remove the commas and insert them back. But I believe there should be a class that, given a pattern of a string, can convert it back to a string of another pattern.”

Two other subjects (S2 and S5), who did not know anything about those classes, de- scribed the same task as follows:

S2: “I will convert the western number to an int primitive, run through the int-like StringBuffer backwards, throwing sets of four numbers into another StringBuffer. After each fourth number, check if there is a fifth number available; and if there is, insert a comma. At the end, simply reversing the string should give you the Chinese number.”5 S5: “Basically, I am gonna to parse the number, take out the commas and insert the commas.”

Apparently, the problem descriptions of S3 and S4 were strongly affected by their knowl- edge about components, whereas S2 and S5 described a more detailed, lower-level implemen- tation plan. As a result, both S3 and S4 implemented the task in a very compact way using components delivered by CodeBroker. S2 implemented the task as he described because the system failed to deliver relevant components based on his comments. S5 came up with a com- pact implementation too because the CodeBroker system delivered components from the Num- berFormat class. 5 The subject thought it was to convert a western format number to a Chinese format number. 141

8.4.2 The Opportunistic Nature of Programming

The example of S5 implementing T3 well illustrates the opportunistic nature of program- ming (Section 2.3). According to his original implementation plan, S5 first started with creating two method interfaces: one reads the inputted number, and another one converts and returns it.

However, based on the doc comment he wrote for the second method, the CodeBroker delivered the format method of the java.text.NumberFormat class. S5 noticed it, browsed its document, and totally changed his original implementation plan. His final implementation was very close to the one created by S3, who anticipated the existence of the format method.

This example showed that actively delivered components augment programmers’ insuf-

ficient programming knowledge and create learning and reuse opportunities for programmers.

The phenomenon that delivered components changed a programmer’s original implementation plan was also observed in experiments with two other subjects.

8.4.3 Four Levels of Knowledge about Components

Not surprisingly, the subjects did have four different levels (Figure 4.1) of knowledge about components in the repository. The disparity between L3 and L4 was very obvious in all subjects. For example, neither S2 nor S5 knew the existence of the format method. Some subjects knew a lot about GUI packages, whereas others knew a lot about the java.lang package. Subjects who had had more Java programming experience had more anticipation of the repository because they had explored the repository more often than less experienced ones. Some of the anticipations were transferred from their programming experience with other programming languages. Sometimes, subjects anticipated the existence of components that were not included in the repository.6 For example, one subject thought there should be a method of filling a sequence with a sequence of numbers. Before he gave up and wrote his own code, he first wrote a comment in his program, hoping the system could deliver it, and then browsed the documentation system quite thoroughly, trying to find the nonexistent component. 6 Note that in Figure 4.1, a part of L3 is outside of L4. 142

Even after he had finished his program, he still believed the component should exist in the repository, and he said: “I am not sure there is not a fill method [that can fill a sequence with a sequence of numbers].”

For those components that subjects had used before, there was a difference of levels of mastery (L1 and L2) as well. As expected, many components were reused by subjects directly, without consulting documents. However, in one experiment (S2-T5), the subject (S2) relied on the delivery mechanism to reuse one component he claimed he had used many times before but could not remember its exact name, even though he knew that it was a static method.

8.4.4 Learning Task-Relevant Components on Demand

One important factor that contributes to the importance of the learning-on-demand model

(Section 3.4.3.1) is that learning is more effective within the working context because program- mers are more motivated to learn new things that can be immediately applied to their task at hand, and the existence of application working context makes learning easier. Observation of the experiments confirmed this claim, and the following example is quite typical in all experi- ments.

In the previously cited example (Section 8.4.1), both S3 and S4 had not known how to use the class of java.text.NumberFormat and what kind of functionality it had,7 , and

S5 had not even known the existence of the java.text package. When the format method from that class was delivered, all of them realized they could use it. However, in order to use the format method, they needed to know how the class was structured and also needed other methods from the same class to implement the task. Despite their claims in the beginning that they could easily implement the task with their known knowledge, they were all very motivated to learn how to use the format method and other related methods. It took S3 and S4 more than 15 minutes, and S5 more than 30 minutes to learn all the needed methods. Given the fact 7 In fact, S4 used the java.text.DecimalFormat class, whereas S3 and S5 used the java.text.NumberFormat class. For the sake of brevity, I will use java.text.NumberFormat to repre- sent both. 143 that they all are expert programmers with extensive programming knowledge, learning those components probably cost them more time than had they just implemented the task by hand.

However, they all were determined to learn those new components because they wanted to use those components instead of creating their own primitive solutions.

8.5 Problems of CodeBroker and Needed Improvements

The experiments uncovered several problems of CodeBroker, which provide guidelines to improve the system.

8.5.1 Irrelevant Components

Although the system delivered task-relevant components, it also delivered many irrel- evant components. Most subjects would have liked the deliveries of the system to be more

“focused” on their tasks. However, they also acknowledged that if they could find something immediately useful from the deliveries that could save time, they would like to use the system even if the deliveries are not “focused” enough.

In the experiments, subjects were asked to implement two or three unrelated tasks; there- fore, the benefits of user models and discourse models were not fully utilized. As programmers continue using the system for a relative longer time, the number of irrelevant components can be expected to be reduced.

One particular problem was that the system looked at doc comments only. Some subjects wanted the system also to deliver components based on comments inside methods. One subject was disappointed that the system did not use the names of methods and variables because, although he did not like to write comments, he always created very good names to indicate what he wanted to implement. It might be possible for the system to better the relevance of deliveries if it uses not only the currently entered doc comments, but also those surrounding comments and identifiers. 144

8.5.2 Abstraction Mismatch from Queries to Components

When programmers wrote comments to describe concretely what they wanted, the sys- tem had no means to find components that could be reused but were described in abstract terms.

For example, the system did not deliver any reusable component for Task 6 (T6) that was as- signed to S3 and S5. The task could be very easily implemented with the two methods push and pop in the jgl.PriorityQueue class. However, descriptions of those methods in the repository are very abstract.

public synchronized void push(Object object) Push an object.

public synchronized Object pop() Pop the last object that was pushed onto me.

Both subjects, however, described the task in concrete terms. Both of them interpreted the task literally as an event management problem and described the two methods as

/** Takes the new event and puts it at the end of the event queue */ /** Gets the next event and handles it */

/** add new event to waiting list */ /** get the next event from the list */

In order to address this abstraction mismatch problem (Section 3.4.3.2), CodeBroker needs to index components based on not only their descriptions but also their using context.

For example, despite the disparity between the two subjects’ comments and the descriptions of the components in the repository, the comments of both subjects were quite similar. If one programmer used the push and pop methods in his program, and the system also indexed the two methods by using the comments of the program that used them, it might be able to deliver them to the second programmer. 145

8.5.3 Lack of Guidance in Refining Queries

The system did not provide any guidance to help programmers chose more appropriate terms to describe their task. One subject wanted to find a sorting component, and he strongly believed it should exist somewhere in the repository. He first created the following comment:

/** sort the list */ for which the system did not deliver anything useful. He continued to try two more queries (as follows) by invoking the query refinement interface:

order list

returns the highest value from the list which did not help either. In fact, the problem was caused by the term “list,” which was used in all three queries, because this term is used in the descriptions of dozens of components in the repository. If the system had a mechanism pointing out to the subject that the term “list” should be dropped, the subject would have been able to find the component he needed.

8.5.4 Problems with User Modeling

As shown in Table 8.5, only one subject (S2) added two items (one package and one class) into the user models, and that was actually a wrong operation. The subject actually wanted to add them to the discourse model. This illustrated one design problem with the system.

Subjects were concerned that if they added something into the user model, which is a permanent

file, they could not get it back. An interface for programmers to edit their user models might be able to alleviate this concern.

Subjects used at least 57 different components (shown in the Total column in Table 8.3), but the system automatically detected and added only 20 method components in total to the user models (shown in Table 8.5). This was caused by the inconsistency between the programming style assumed by the adaptive mechanism in CodeBroker and those of subjects. The system assumed that programmers would import a package first, followed by variable declarations, 146 followed by method invocations. However, most subjects programmed in the reverse order: they declared variables and imported packages after they had used the variables in method invocations (Section 7.4.3.2). Modifications to the algorithm of adaptive user modeling are needed.

Current user modeling technique in CodeBroker is too simple. One subject reused one delivered component whose name he often mistook for something else although he had used the component many times before. A more sophisticated user modeling technique is needed to capture misconceptions like that so that the system can deliver the right name of the component when the user repeats using the wrong name. Another problem was pointed out by another subject, who said:

“When you have programmed for a very long time, you may forget what you have used in your first program.”

Therefore, counting the number of use is too simplistic; there should be a forgetting mechanism incorporated to decide when to remove from user models those components that have not been used by programmers for a long time.

8.5.5 Lack of Examples of Components

Some components in the repository have included simple examples to explain how to use them. During the experiments, whenever subjects found such examples, they immediately jumped to read them instead of reading descriptive texts. However, only very few components are accompanied with examples in the current repository. Creating examples for each compo- nent is a time-consuming task and increases the difficulty in setting up a component repository.

A possible solution to this problem is to make the component repository system extendable by reusing programmers and to provide a mechanism that allow programmers to add their own simple examples to the component repository. Programmers commonly write their own test programs to try out potentially reusable components in order to fully understand their func- tionality (Lange & Moher, 1989; Aoki et al., 2001). Examples do not need to be added into 147 the physical storage of the repository; instead, hyperlinks from components in the repository to example programs can be created by utilizing the link service provided by open hypermedia systems, such as the Chimera system (Anderson et al., 2000).

8.5.6 Lack of User Configurability

Currently, the RCI-display is placed as a part of the editor buffer, and it reduces the available workspace of programmers. This placement did not cause any discomfort in the ex- periments. It may become a problem in real use, however, because the experiment tasks were very small and programmers did not need a lot of editing space. Some subjects said that be- cause they wanted to make their effective working space as large as possible in their normal programming practice, they wanted to be able to rearrange the RCI-display to other places.

They also wanted to control what information is shown in the RCI-display. For example, when the working space is too small, it would be better if the RCI-display shows component names only and shows the descriptions of components when programmers move the mouse over the names. Some subjects did not like to use the mouse to invoke the Skip Components Menu

(Section 7.5.1); instead, they wished they could use the keyboard to filter components.

8.6 Summary of Evaluations

Overall, the findings from the experiments have indicated that

f Active component delivery can promote reuse by supporting the reuse of unknown

components and reducing the cost of locating components.

f In most cases, modeling a programmer’s task based on comments is sufficient, although

not perfect, to locate relevant components.

f Discourse models can improve the relevance of retrieved components.

f User models contribute to the personalization of delivery to individual programmers. 148

f Support of retrieval-by-reformulation is essential because locating reusable compo-

nents is an iterative process.

All subjects reused components delivered by CodeBroker during the experiments. Most of them have very positively acknowledged the overall effectiveness of the system and have indicated that they would like to use CodeBroker as their daily programming environment.

However, the delivery mechanism alone is not sufficient; it must be combined with query and browsing mechanisms. Delivery jump-starts the reuse process by giving programmers im- mediate access to components contextualized to their task and their background knowledge.

After that, programmers need to use query or browsing mechanisms to find other components that are also needed because in object-oriented programming, reusing one component often requires the reuse of other tightly coupled components. Chapter 9

Related Work

This research has tried to find a suitable way to integrate active information delivery with reusable component repository systems to provide a reusable component-enriched pro- gramming environment so that programmers can easily access the needed components. This work has been greatly influenced by research efforts on active information systems, compo- nent repository systems, and intelligent programming environments. This Chapter compares

CodeBroker with closely related research work on those three fields.

9.1 Active Information Systems

The simplest implementation of the information delivery mechanism in an active infor- mation system is to deliver a piece of information without considering the working context, such as Microsoft Office’s “Tip of the Day” and a similar research prototype, the DYK (Did

You Know) system (Owen, 1986). The irrelevance of the information to the task at hand often causes users to ignore it totally.

ACTIVIST is an active help system for a text editor that uses a plan library to infer user goals from observed actions by matching them against the condition part of plans (Fischer et al.,

1985). It then suggests more efficient solutions that accomplish the same goals. It maintains a user model to tune the delivery to each user. CodeBroker shares with ACTIVIST in that both of them are totally embedded in the working environment, and task-relevant information is deliv- ered into the workspace. However, ACTIVIST delivers feedback information after users have 150

finished their task and the delivery is meant to improve their future work. Information deliv- ered by CodeBroker is meant to influence the current task under execution. LispCritic (Fischer,

1987) helps programmers improve their programming skills. It uses rules to suggest a syntactical equivalent that is either a more cognitively efficient or computa- tionally efficient solution after it has recognized a less ideal code segment. Unlike CodeBroker, which makes use of both the semantic and syntactical information of programs, LispCritic has no knowledge about the semantics.

Apple Data Detector (ADD) (Nardi et al., 1998) is an active system that smoothes the routine workflow of extracting structured information from everyday documents and automates the consequent actions. Both CodeBroker and ADD take advantage of the semiformal structure

(doc comments in CodeBroker, URLs in ADD) existing in the working environment and auto- mate the following actions (retrieving reusable components in CodeBroker, storing the URL as a bookmark in ADD) to eliminate the unnecessary switch of working contexts.

Remembrance Agent (RA) (Rhodes & Starner, 1996) tries to augment human memory by displaying relevant documents. Like CodeBroker, RA also listens to a text editor and au- tonomously formulates a query based on the user’s current focus. A back-end search engine is invoked to find relevant old emails and notes in the user’s individual information space. RA deals with unstructured texts only; CodeBroker relies on the semiformal structure of the pro- gram to extract needed information. In addition, CodeBroker also makes use of syntactical information. One shortcoming of RA is that it treats all documents the same, although its goal is to remind users of forgotten documents.

Letizia (Lieberman, 1997) assists users in browsing WWW by suggesting web pages within a few links from the current page. Like CodeBroker, it aims at eliminating the con- text switch from a browsing interface to a search interface to streamline the exploration of web information. Web pages in the bookmark list of a user are analyzed based on informa- tion retrieval techniques to create an interest profile. Suggestions are based on the similarity between web pages and the interest profile. Other web-browsing assistant systems, such as 151

WebWatcher (Armstrong et al., 1995) and Lira (Balabanovic & Shoham, 1995), adopt similar approaches.

9.2 Component Repository Systems

Research on reusable component repository systems is abundant. They differ from each other mainly in the retrieval mechanisms adopted. These systems can be divided into three groups according to the aspect of programs on which the retrieval mechanisms are based.

9.2.1 Concept-Based Component Repository Systems

The retrieval mechanism used in CodeBroker is similar to those of reusable component repository systems that use free-text indexing. GURU (Maarek et al., 1991) indexes components based on their textual documentation. Etzkorn and Davis have tried to use header comments

(similar to doc comments in CodeBroker) to index legacy object-oriented programs (Etzkorn &

Davis, 1997b). Comments and identifier names are also used for indexing in (Girardi & Ibrahim,

1995) and (DiFelice & Fonzi, 1998). Michail and Notkin have demonstrated the possibility of using identifier names only to find similar reusable components for comparison (Michail &

Notkin, 1999).

Free-text indexing is easy both for setting up a component repository and for program- mers to formulate reuse queries. Despite its simplicity, empirical studies have found that it per- forms no worse, in terms of retrieval effectiveness, than other more delicate, effort-consuming repository systems (Frakes & Pole, 1994; Mili et al., 1997b). Nevertheless, free-text indexing- based reuse systems do not directly support shortening the conceptual gap in query formulation.

One attempt to bridge the conceptual gap is to use structured representations and knowl- edge bases. Both CodeFinder (Henninger, 1997) and LaSSIE (Devanbu et al., 1991) use frames to represent reusable components. Frames in CodeFinder are connected by an associative net- work whose links have weights to reflect the semantic relationships among components. Search- ing relevant components is supported by spreading activation. Frames in LaSSIE are structured 152 into hierarchical, taxonomic categories by human experts. To ease the difficulty of creating the frame representations of components, ROSA (Girardi & Ibrahim, 1995) applies natural lan- guage processing techniques to automate the process. However, sentences that can be processed are very limited. The multiple faceted classification schema (Prieto-Diaz, 1991) is another for- mat of structured representations. Reusable components are represented with multiple facets, each of which is described with a term. A conceptual distance graph has to be constructed to reflect the semantic relationships among terms. AIRS is a system that combines multiple facets and the frame-based approach (Ostertag et al., 1992).

Structured representation-based systems are labor intensive in creating representations of components and knowledge bases.

9.2.2 Constraint-Based Component Repository Systems

Constraints of programs can also be used to index and retrieve reusable components.

Rittri first proposed to use signatures in reusable component retrieval (Rittri, 1989). His work is further extended by Zaremski and Wing who give a general framework for signature match- ing in languages (Zaremski & Wing, 1995). Although research on signature matching has largely focused on functional programming languages that are often designed with a sound type theory, signature matching is also applicable to other strong-typed programming languages (DiCosmo, 1995). An Ada version of signature matching has been im- plemented in (Stringer-Calvert, 1994). CodeBroker applies this technique to the strong-typed object-oriented programming language. Signature matching in CodeBroker is not used as the sole method of retrieving components; rather, it is used as a filter to exclude those components that are significantly different from the current task in terms of constraint compatibility.

The formal specification-based approach is another form of using constraints to index and retrieval components. Zaremski and Wing have adopted pre- and post-predicates to find components that exactly match or approximately match a reuse query (Zaremski & Wing, 1997).

A. Mili et al. have tried to classify reusable components based on refinement order existing 153 among their formal specifications (Mili et al., 1997a). The formal specification-based approach could be integrated into CodeBroker to improve the precision of retrieval, if the programming environment supports formal methods. For the majority of programmers, however, the formal approach is too difficult to use.

9.2.3 Code-Based Reuse Repository Systems

Behavior sampling exploits the code aspect of programs to retrieve reusable compo- nents (Hall, 1993; Podgurski & Pierce, 1993). In behavior sampling-based systems, a pro- grammer randomly chooses a small set of sample inputs and computes the corresponding out- puts after having specified the signature of the module (same as the constraint query created by

CodeBroker). Reusable components whose signature is compatible are found and executed on the sample inputs. Components whose outputs match the outputs computed by the program- mer are returned. Behavior sampling is difficult to apply to components with complicated data structures. Moreover, it is unable to find close but not identical components.

9.2.4 Uniqueness of CodeBroker

Compared with other component repository systems, CodeBroker has three unique fea- tures:

(1) It automatically extracts and formulates reuse queries.

(2) It is adaptable and adaptive to each programmer to reflect their growing knowledge

about reusable components.

(3) It is seamlessly integrated with current programming environments.

All of the above-mentioned systems, except CodeFinder, assume that component re- trieval is an one-time effort and do not support retrieval-by-reformulation. In addition to the query refinement supported in CodeFinder, CodeBroker allows programmers to manipulate re- trieved components interactively to reduce the difficulties of component choosing. 154

9.3 Intelligent Programming Environments

CodeBroker is similar to structure-based editors, which explore the structure of programs to speed and ease the work of program entry. Syntax-based editors translate the syntactical knowledge of programming languages into templates to aid novice programmers by freeing them from remembering syntactical details (Szwillus & Neal, 1996). In terms of supporting reuse, CodeBroker is more similar to the example-based programming environment proposed in (Neal, 1996) and the cliche-based programming environment of KBEmacs (Rich & Waters,

1990). An example-based programming environment has a window for example programs.

Programmers can reuse them directly or use them as examples of language constructs, or of al- gorithm implementations. KBEmacs is implemented as an extension to Emacs. It has a knowl- edge base of cliches, or program plans, which programmers can reuse in their programming by referring to them by name. However, in both systems, programmers have to activate examples or cliches by name. That is to say, programmers have to learn the “vocabulary” before they can reuse.

The Argo design environment (Robbins & Redmiles, 1998) incorporates plan recognition- based critics to actively deliver information for programmers to reflect upon their current design or programs. Based on the type of programming knowledge provided, critics are grouped into

9 types: correctness, completeness, consistency, optimization, alternative, evolvability, presen- tation, experimental, and organizational critics. The critiquing infrastructure (or information delivery mechanism, as it is called in this thesis) of Argo supports program design by automati- cally and timely supplying general programming knowledge that is relevant to the task at hand.

It complements the support provided by CodeBroker that focuses on supplying knowledge on reusable components. Chapter 10

Future Work and Conclusions

This chapter discusses future research opportunities uncovered by this research, summa- rizes the research approach taken, and then concludes with a description of the major contribu- tions of this research.

10.1 Future Work

In the discussion of the problems of CodeBroker found in the experiments (Section 8.5),

I have described some future work needed to improve the system. In this section, I point out future research questions in need of investigation based on lessons learned from the research, and speculate on possible directions.

10.1.1 Extending CodeBroker to a Larger Scale

Currently, the component repository of CodeBroker is located in the same machine of the programming environment and is created statically before its use because most current compo- nent repositories are closed and proprietary. As the movement of Open Source Systems attracts more and more developers, we can expect more software systems and software components will become open-source, for example, the Jun system (a 3D Smalltalk/Java library) (Aoki et al., 2001). This will make it increasingly difficult for programmers to know newly available open-source components. I envision a distributed CodeBroker system running on several server computers to which programmers contribute open-source components. It dynamically indexes 156 components from those constantly evolving repositories and then delivers components through networks to other programmers. Through the brokerage of CodeBroker, programmers can ben- efit from each other’s work and improve the productivity of software development by avoiding unnecessary repetition of work.

CodeBroker not only can be used to make programmers aware of unknown components developed by others, but also can be used to foster a community of developers. Forming a relatively stable community of developers is the key factor for the success of an Open Source project. Currently, support for the creation of such communities is rare. When CodeBroker

finds and delivers components or systems that are relevant to what a programmer is doing, it can also make the programmer aware of other programmers who created those components and systems, thus creating a possible opportunity for those programmers to join forces and form a new community.

10.1.2 Extending CodeBroker to Higher Design Levels

Although CodeBroker is designed to promote reuse in the phase of coding, the underlying principles are equally applicable to higher levels of software development activities. For exam- ple, if developers use modeling tools such as Argo (Taylor et al., 1996; Robbins & Redmiles,

1998) to create a conceptual design of a software system, they need to specify the functionality and signature for each class. An active component repository system can utilize that information to actively deliver potentially reusable components.

In addition to delivering reusable components, an active reuse system can also try to de- liver reusable designs from a repository of existing designs by exploiting the similarity between the current design and those existing designs. The conceptual similarity can still be computed based on their common descriptive names and functionality descriptions; however, constraint compatibility at the design level needs to be based on the interaction patterns of design elements. 157

10.1.3 Supporting More Complicated Indexing and Retrieving Mechanisms

How well an active component repository system such as CodeBroker can deliver rel- evant components depends on how well it can capture the programming task from existing information in a programming environment. CodeBroker has tried to capture the task based on conceptual information revealed through comments and constraint information revealed through signatures. However, there are other kinds of information that can be utilized. For example, a programmer may try to write a program using a known design pattern or framework, which may place extra constraints on the kind of components that could be reused. This piece of informa- tion could be used by the repository system to limit its search range and improve the relevance of delivered components.

Active component repository systems can also explore the affinity of components to de- liver relevant components. The affinity of two components is the likeliness that two components will be used in a same program. The repository system can deliver components that have high affinity with the components used by programmers in their current programs. There are two pos- sible approaches to compute the affinity of two components: coupling-based or statistics-based.

The coupling-based approach looks at how tightly two components are structurally connected.

Two components are more likely to appear together if they both access common data, or if the data type output by one component is the same as the data type input by another. The statistics- based approach looks at how often two components appear together in a single program. Such co-occurrence information could be obtained in a way similar to the automatic thesaurus con- struction in information retrieval systems by treating programs as documents and components as terms.

10.2 Summary

“The best way to attack the essence of building software is not to build it at all” (Brooks,

1995). Reusing existing software components is one way of reaching this goal. By reuse, 158 programmers can avoid repetitive work and focus on the unique features of the new system.

Based on his investigation, Jones claimed that, of all programs written in 1983, fewer than 15% of them were unique, novel and specific to individual programs, whereas 85% were common and generic (Jones, 1984). However, the problem is how can programmers know that they are doing something that others have done many times before. As one programmer said:

“I could be creating a method that does exactly the same thing somebody else’s does...even though we have access to each other’s code. We might call them different names and we might have a bit different way of doing it, but we’re still doing the same thing.” (Fichman & Kemerer, 1997)

This has been the central question investigated in this thesis. In order for programmers to be able to reuse those components whose existence is not known to them, component repository systems need to support the information delivery mechanism in addition to the traditional infor- mation access mechanism. A conceptual framework of designing active component repository systems was proposed. The key challenge of creating an active component repository system is to contextualize the delivery of reusable components to the task and the background knowledge of programmers. This thesis proposed that the contextualization could be realized through the combination of task models, which capture the immediate programming task; discourse models, which capture the larger context under which current programming task takes place; and user models, which represent programmers’ knowledge about reusable components.

Based on the conceptual framework, an active component repository system, CodeBro- ker, has been developed. CodeBroker postulates that comments and signatures in a program can serve as task models to describe what programmers are going to develop. Relevant reusable components can be retrieved by using such task models as reuse queries. The relevance of com- ponents can be further improved by using discourse models that are incrementally evolved by programmers through interactions with the system. In addition, the delivery of components can be personalized to each programmer using his or her user model, which is both adaptable and adaptive. 159

The empirical evaluation of CodeBroker has found the research has met its goals at both the technical level and theoretical level. The technical approach taken by CodeBroker was feasible because, in most of the experiments, CodeBroker successfully delivered task-relevant and user-specific components to programmers. The theoretical hypothesis of the research that active component repository systems can promote reuse has also been validated through the empirical studies. CodeBroker not only made it possible for programmers to reuse components they have not known before, but also triggered programmers to find more related components in order to reuse those delivered ones. Moreover, because CodeBroker reduced the cost of component locating, programmers were more willing to explore the possibility of reuse.

10.3 Contributions

The contributions of this research were threefold. First, this research contributed to the better understanding of cognitive difficulties faced by programmers who want to reuse. It iden- tified two major cognitive barriers to locating reusable components: information islands and the perceived low reuse utility, drawing on cognitive theories and empirical studies on programming and the use of large information repositories.

Second, a new type of component repository systems—active component repository systems—was proposed, developed, and evaluated. Active component repository systems not only automate the component-locating process but also help programmers identify reuse oppor- tunities that otherwise would be missed because they either did not know the existence of the components or perceived it was too costly to locate them.

Third, this research contributed to the design of active information systems in general.

Besides programming, there are many other knowledge-intensive domains where workers rely on external information resources to augment their mental abilities to comprehend and solve complex problems as much as programmers rely on reusable components. The challenge in designing information systems to support information acquisition for those knowledge workers is not only to make information available to them at any time, at any place, and in any form, 160 but to reduce information overload by making information relevant to task-at-hand and to the background knowledge of the users. The similarity analysis approach-based task modeling, in combination with incremental discourse modeling and user modeling, proposed and evaluated through the implementation of CodeBroker in this research is a step forward to meet such a challenge. Bibliography

Aaen, I. (1992), CASE Tools Bootstrapping (How Little Strokes Fell Great Oaks), in V.-P. T. K. Lyytinen (ed.), Next Generation CASE Tools, IOS, Netherlands, pp. 8–17.

Alexander, C. (1964), The Synthesis of Form, Harvard University Press.

Alexander, C., Ishikawa, S., Silverstein, M., Jacobson, M., Fiksdahl-King, I. & S., A. (1977), A Pattern Language: Towns, Buildings, Construction, Oxford University Press, New York.

Anderson, K. M., Taylor, R. N. & Whitehead, E. J. (2000), Chimera: Hypermedia for Heteroge- neous Software Development Environments, ACM Transactions on Information Systems 18(3), 211–245.

Aoki, A., Hayashi, K., Kishida, K., Nakakoji, K., Nishinaka, Y., Reeves, B., Takashima, A. & Yamamoto, Y. (2001), A Case Study of the Evolution of Jun: An Object-Oriented Open-Source 3D Multimedia Library, in Proceedings of 23rd International Conference on Software Engineering (ICSE’01), Toronto, Canada, (to appear).

Armstrong, R., Freitag, D., Joachims, T. & Mitchell, T. (1995), WebWatcher: A Learning Ap- prentice for the World Wide Web, in Proceedings of AAAI Spring Symposium on Informa- tion Gathering, Stanford, CA, pp. 6–12.

Balabanovic, M. & Shoham, Y. (1995), Learning Information Retrieval Agents: Experiments with Automated Web Browsing, in Proceedings of AAAI Spring Symposium on Informa- tion Gathering, Stanford, CA, pp. 13–18.

Basili, V., Briand, L. & Melo, W. (1996), How Reuse Influences Productivity in Object-Oriented Systems, Communications of the ACM 39(10), 104–116.

Batory, D., Johnson, C., MacDonald, B. & von Heeder, D. (2000), Achieving Extensibility through Product-Lines and Domain-Specific Languages: A Case Study, in Proceedings of 6th International Conference on Software Reuse (ICSR-6), Springer-Verlag, Vienna, Austria, pp. 117–136.

Belkin, N. & Croft, B. (1992), Information Filtering and Information Retrieval, Communica- tions of the ACM 35(12), 29–37.

Biggerstaff, T. J. (2000), A New Control Structure for Transformation-Based Generators, in Proceedings of 6th International Conference on Software Reuse (ICSR-6), Springer- Verlag, Vienna, Austria, pp. 1–19. 162

Biggerstaff, T. J., Mitbander, B. G. & Webster, D. E. (1994), Program Understanding and the Concept Assignment Problem, Communications of the ACM 37(5), 72–83.

Boehm, B. (1999), Managing Software Productivity and Reuse, IEEE Computer 16(9), 111– 113.

Bradshaw, J. M. (1997), An Introduction to Software Agents, in J. M. Bradshaw (ed.), Software Agents, AAAI Press, Menlo Park, CA, pp. 1–46.

Brooks, F. P. J. (1995), The Mythical Man-Month: Essays on Software Engineering, 20th an- niversary edition, Addison-Wesley, Reading, MA.

Browne, J., Lee, T. & Werth, J. (1990), Experimental Evaluation of a Reusability-Oriented Par- allel Programming Environment, IEEE Transactions on Software Engineering 16(2), 111– 120.

Buckley, C., Salton, G. & Allan, J. (1994), The Effect of Adding Relevance Information in a Relevance Feedback Environment, in W. B. Croft & C. J. v. Rijsbergen (eds.), Proceedings of 17th Annual International ACM SIGIR Conference, Springer-Verlag, Dublin, Ireland, pp. 292–300.

Card, S., Robertson, G. & Mackinlay, J. (1991), The Information Visualizer: An Information Workspace, in Proceedings of Conference on Human Factors in Computing Systems, ACM Press, pp. 181–188.

Carey, T. & Rusli, M. (1995), Usage Representations for Reuse of Design Insights: A Case Study of Access to On-Line Books, in J. M. Carroll (ed.), Scenario-Based Design: Envi- sioning Work and Technology in System Development, Wiley, pp. 165–182.

Carroll, J. M. & Rosson, M. B. (1987), Paradox of the Active User, in J. M. Carroll (ed.), Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, The MIT Press, Cambridge, MA, pp. 80–111.

Cox, B. J. (1996), Superdistribution: Objects as Property on the Electronic Frontier, Addison- Wesley, Reading, MA.

Crestani, F., Lalmas, M., Van Rijsbergen, C. J. & Campbell, I. (1998), ’Is This Document Relevant? ... Probably’: A Survey of Probabilistic Models in Information Retrieval, ACM Computing Surveys 30(4), 528–552.

Croft, W. B. & Harper, D. J. (1979), Using Probabilistic Models of Document Retrieval without Relevance Information, Journal of the Documentation 35, 285–295.

Curtis, B. (1989), Cognitive Issues in Reusing Software Artifacts, in T. J. Biggerstaff & A. J. Perlis (eds.), Software Reusability, Vol. II, ACM Press, New York, pp. 269–287.

Curtis, B., Krasner, H. & Iscoe, N. (1988), A Field Study of the Software Design Process for Large Systems, Communications of the ACM 31(11), 1268–1287.

Damiani, E., Fugini, M. G. & Fusaschi, E. (1997), A Descriptor-Based Approach to OO Code Reuse, IEEE Software 14(10), 73–80. 163

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. & Harshman, R. (1990), Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science 41(6), 391–407.

Detienne, F. (1995), Design Strategies and Knowledge in Object-Oriented Programming: Ef- fects of Expertise, Human-Computer Interaction 10(2/3), 129–169.

Devanbu, P., Brachman, R. J., Selfridge, P. G. & Ballard, B. W. (1991), LaSSIE: A Knowledge- Based Software Information System, Communications of the ACM 34(5), 34–49.

DiBona, C., Ockman, S. & Stone, M. (eds.) (1999), Open Sources: Voices from the Open Source Revolution, O’Reilly & Associates, Sebastopol, CA.

DiCosmo, R. (1995), Isomorphisms of Types: From Lamda Calculus to Information Retrieval and Language Design, Birkhauser, Boston.

Dieterich, H., Malinowski, U., Kuhme, T. & Schneider-Hufschmidt, M. (1993), State of the Art in Adaptive User Interfaces, in M. Schneider-Hufschmidt, T. Kuhme & U. Malinowski (eds.), Adaptive User Interfaces: Principles and Practice, Elsevier Science Publishers, Amsterdam, pp. 13–48.

DiFelice, P. & Fonzi, G. (1998), How to Write Comments Suitable for Automatic Software Indexing, Journal of Systems and Software 42, 17–28.

Dubinsky, E., Freudenberger, S., Schonberg, E. & Schwartz, J. T. (1989), Reusability of Design for Large Software Systems: An Experiment with the SETL Optimizer, in T. J. Biggerstaff & A. J. Perlis (eds.), Software Reusability, Vol. I, ACM Press, New York, pp. 275–294.

Dusink, L. & Van Katwijk, J. (1995), Reuse Dimensions, in Proceedings of ACM Symposium on Software Reuse (SSR’95), ACM Press, Seattle, WA, pp. 137–149.

Engelbart, D. C. (1990), Knowledge-Domain Interoperability and an Open Hyperdocument System, in Proceedings of Computer Supported Cooperative Work 1990, ACM Press, New York, pp. 143–156.

Etzkorn, L. H. & Davis, C. G. (1997a), Automated Object-Oriented Reusable Component Iden- tification, Knowledge-Based Systems 9(8), 517–524.

Etzkorn, L. H. & Davis, C. G. (1997b), Automatically Identifying Reusable OO Legacy Code, IEEE Computer 30(10), 66–71.

Fafchamps, D. (1994), Organizational Factors and Reuse, IEEE Software 11(5), 31–41.

Feather, M. S. (1989), Reuse in the Context of a Transformation-Based Methodology, in T. J. Biggerstaff & A. J. Perlis (eds.), Software Reusability, ACM Press, New York, pp. 337– 360.

Fichman, R. G. & Kemerer, C. E. (1997), Object Technology and Reuse: Lessons from Early Adopters, IEEE Software 14(10), 47–59.

Fischer, G. (1987), A Critic for LISP, in J. McDermott (ed.), Proceedings of the 10th Interna- tional Joint Conference on Artificial Intelligence, Morgan Kaufmann, Los Altos, CA, pp. 177–184. 164

Fischer, G. (1991), Supporting Learning on Demand with Design Environments, in L. Birnbaum (ed.), International Conference on the Learning Sciences, Association for the Advance- ment of Computing in Education, Evanston, IL, pp. 165–172.

Fischer, G. (1993), Shared Knowledge in Cooperative Problem-Solving Systems—Integrating Adaptive and Adaptable Components, in M. Schneider-Hufschmidt, T. Kuehme & U. Ma- linowski (eds.), Adaptive User Interfaces: Principles and Practice, Elsevier Science Pub- lishers, Amsterdam, pp. 49–68.

Fischer, G. (1994), Domain-Oriented Design Environments, Automated Software Engineering 1(2), 177–203.

Fischer, G. (1998a), Beyond ’Couch Potatoes’: From Consumers to Designers, in Proceedings of 1998 Asia-Pacific Computer and Human Interaction, IEEE Computer Society, Kana- gawa, Japan, pp. 2–9.

Fischer, G. (1998b), Seeding, Evolutionary Growth and Reseeding: Constructing, Capturing and Evolving Knowledge in Domain-Oriented Design Environments, Automated Software Engineering 5(4), 447–464.

Fischer, G. (2001), User Modeling in Human-Computer Interaction, User Modeling and User- Adapted Interaction (to appear).

Fischer, G. & Eisenberg, M. (1994), Programmable Design Environments: Integrating End- User Programming with Domain-Oriented Assistance, in Human Factors in Computing Systems, CHI’94 Conference Proceedings, Boston, MA, pp. 431–437.

Fischer, G. & Mastaglio, T. (1989), Computer-Based Critics, in Proceedings of the 22nd An- nual Hawaii Conference on System Sciences (HICSS-22), Vol. III: Decision Support and Knowledge Based Systems Track, IEEE Computer Society, Kailua-Kona, HI, pp. 427–436.

Fischer, G. & Nieper-Lemke, H. (1989), HELGON: Extending the Retrieval by Reformulation Paradigm, in Human Factors in Computing Systems, CHI’89 Conference Proceedings, Austin, TX, pp. 357–362.

Fischer, G. & Reeves, B. N. (1995), Beyond Intelligent Interfaces: Exploring, Analyzing and Creating Success Models of Cooperative Problem Solving, in R. Baecker, J. Grudin, W. Buxton & S. Greenberg (eds.), Readings in Human-Computer Interaction: Toward the Year 2000, 2nd edition, Morgan Kaufmann, San Francisco, CA, pp. 822–831.

Fischer, G. & Schneider, M. (1984), Knowledge-Based Communication Processes in Software Engineering, in Proceedings of 7th International Conference on Software Engineering (ICSE’84), IEEE Computer Society, Orlando, FL, pp. 358–368.

Fischer, G. & Ye, Y. (2001), Personalizing Delivered Information in a Software Reuse Environ- ment, in Proceedings of User Modeling 2001, Sonthofen, Germany, (to appear).

Fischer, G., Henninger, S. & Redmiles, D. (1991), Cognitive Tools for Locating and Compre- hending Software Objects for Reuse, in Proceedings of 13th International Conference on Software Engineering (ICSE’91), IEEE Computer Society, Austin, TX, pp. 318–328. 165

Fischer, G., Lemke, A. C. & Schwab, T. (1985), Knowledge-Based Help Systems, in Human Factors in Computing Systems, CHI’85 Conference Proceedings, San Francisco, CA, pp. 161–167.

Fischer, G., Nakakoji, K., Ostwald, J., Stahl, G. & Sumner, T. (1993), Embedding Critics in Design Environments, The Knowledge Engineering Review Journal 8(4), 285–307.

Fischer, G., Nakakoji, K., Ostwald, J., Stahl, G. & Sumner, T. (1998), Embedding Critics in Design Environments, in M. T. Maybury & W. Wahlster (eds.), Readings in Intelligent User Interfaces, Morgan Kaufmann Publisher, pp. 537–559.

Fischer, G., Redmiles, D., Williams, L., Puhr, G., Aoki, A. & Nakakoji, K. (1995), Beyond Object-Oriented Development: Where Current Object-Oriented Approaches Fall Short, Human-Computer Interaction, Special Issue on Object-Oriented Design 10(1), 79–119.

Flanagan, D. (1997), JAVA in a Nutshell, 2nd edition, O’Reilly & Associates, Sebastopol, CA.

Frakes, W. & Terry, C. (1996), Software Reuse: Metrics and Models, ACM Computing Surveys 28(2), 415–435.

Frakes, W. B. & Fox, C. J. (1995), Sixteen Questions about Software Reuse, Communications of the ACM 38(6), 75–87.

Frakes, W. B. & Fox, C. J. (1996), Quality Improvement Using a Software Reuse Failure Modes Model, IEEE Transactions on Software Engineering 22(4), 274–279.

Frakes, W. B. & Pole, T. P. (1994), An Empirical Study of Representation Methods for Reusable Software Components, IEEE Transactions on Software Engineering 20(8), 617–630.

Furnas, G. W., Landauer, T. K., Gomez, L. M. & Dumais, S. T. (1987), The Vocabulary Problem in Human-System Communication, Communications of the ACM 30(11), 964–971.

Gamma, E., Johnson, R., Helm, R. & Vlissides, J. (1994), Design Patterns—Elements of Reusable Object-Oriented Systems, Addison-Wesley, Reading, MA.

Ghezzi, C., Jazayeri, M. & Mandrioli, D. (1991), Fundamentals of Software Engineering, Pren- tice Hall, Englewood Cliffs, NJ.

Girardi, M. R. & Ibrahim, B. (1995), Using English to Retrieve Software, Journal of Systems and Software 30, 249–270.

Girgensohn, A. (1992), End-User Modifiability in Knowledge-Based Design Environments, Ph.D. Dissertation, University of Colorado at Boulder.

Gosling, J., Joy, B. & Steele, G. (1996), The Java Language Specification, 2nd edition, Addison- Wesley, Reading, MA.

Graham, I. (1995), Reuse: A Key to Successful Migration, Object Magazine 5(6), 82–83.

Griss, M. L. (2000), Implementing Product-Line Features with Component Reuse, in Proceed- ings of 6th International Conference on Software Reuse (ICSR-6), Springer-Verlag, Vi- enna, Austria, pp. 137–152. 166

Grudin, J. (1994), Groupware and Social Dynamics: Eight Challenges for Developers, Commu- nications of the ACM 37(1), 92–105.

Halasz, F. G. (1988), Reflections on NoteCards: Seven Issues for the Next Generation of Hy- permedia Systems, Communications of ACM 31(7), 836–852.

Hall, R. J. (1993), Generalized Behavior-Based Retrieval, in Proceedings of 15th International Conference on Software Engineering (ICSE’93), ACM Press, Baltimore, MD, pp. 371– 380.

Hallsteinsen, S. & Paci, M. (eds.) (1997), Experiences in Software Evolution and Reuse: Twelve Real World Projects, Springer-Verlag, Berlin.

Harman, D. (1995), Overview of the Third REtrieval Conference (TREC-3), in D. Harman (ed.), Overview of the Third REtrieval Conference, National Institute of Standards and Technology Special Publication, Gaithersburg, MD, pp. 1–21.

Hayes, J. R. & Simon, H. A. (1977), Psychological Differences among Problem Isomorphs, in N. J. Castellan, D. B. Pisoni & G. R. Potts (eds.), Cognitive Theory, Vol. 2, Erlbaum, Hillsdale, NJ.

Hayes-Roth, B. & Hayes-Roth, F. (1979), A Cognitive Model of Planning, Cognitive Science 3, 275–310.

Helm, R. & Maarek, Y. S. (1991), Integrating Information Retrieval and Domain Specific Ap- proaches for Browsing and Retrieval in Object-Oriented Class Libraries, in Proceedings of the 1991 ACM SIGPLAN Conference on Object-Oriented Programming Systems, Lan- guages, and Applications (OOPSLA’91), pp. pp. 47–61.

Henderson-Sellers, B. & Edwards, J. M. (1990), The Object-Oriented Systems Life Cycle, Communications of the ACM 33(9), 143–159.

Henninger, S. (1993), Locating Relevant Examples for Example-Based Software Design, Ph.D. Dissertation, University of Colorado at Boulder.

Henninger, S. (1997), An Evolutionary Approach to Constructing Effective Software Reuse Repositories, ACM Transactions on Software Engineering and methodology 6(2), 111– 140.

Hoc, J.-M., Green, T. R. G., Samurcay, R. & Gilmore, D. J. (eds.) (1990), Psychology of Pro- gramming, Academic Press, New York.

Horvitz, E., Jacobs, A. & Hovel, D. (1999), Attention-Sensitive Alerting, in Proceedings of Conference on Uncertainty and Artificial Intelligence 1999, Morgan Kaufmann, San Fran- cisco, CA, pp. 305–313.

Isoda, S. (1995), Experiences of a Software Reuse Project, Journal of Systems and Software 30, 171–186.

Jarzabek, S. & Huang, R. (1998), The Case for User-Centered CASE Tools, Communications of the ACM 41(8), 93–99. 167

Johnson, R. E. (1997), Components, Frameworks, Patterns, in Proceedings of ACM Symposium on Software Reuse (SSR’97), ACM Press, Boston, MA, pp. 10–17.

Jones, M. P. (1997), Spoken-Language Help for High-Functionality Applications, Ph.D. Dis- sertation, University of Colorado at Boulder.

Jones, T. C. (1984), Reusability in Programming: A Survey of the State of the Art, IEEE Trans- actions on Software Engineering SE-10(5), 1984.

Joos, R. (1994), Software Reuse at Motolora, IEEE Software 11(5), 42–47.

Jurafsky, D. & Martin, J. (2000), Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, Upper Saddle River, NJ.

Kang, K. C. (1998), Feature-Oriented Development of Applications for a Domain, in W. Frakes (ed.), Systematic Software Reuse, Annals of Software Engineering 5, Baltzer Science Pub- lishers, Bussum, The Netherlands, pp. 143–168.

Kintsch, W. (1998), Comprehension: A Paradigm for Cognition, Cambridge University Press, Cambridge, UK.

Konstan, J. A., Miller, B. N., Maltz, D., Herlocker, J. L., Gordon, L. R. & Riedl, J. (1997), GroupLens: Applying Collaborative Filtering to Usenet News, Communications of ACM 40(3), 77–87.

Krueger, C. W. (1992), Software Reuse, ACM Computing Surveys 24(2), 131–183.

Landauer, T. K. & Dumais, S. T. (1997), A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psycholog- ical Review 104(2), 211–240.

Lange, B. M. & Moher, T. G. (1989), Some Strategies of Reuse in An Object-oriented Pro- gramming Environment, in Human Factors in Computing Systems, CHI’89 Conference Proceedings, ACM Press, Austin, TX, pp. 69–73.

Lieberman, H. (1997), Autonomous Interface Agents, in Human Factors in Computing Systems, CHI’97 Conference Proceedings, ACM Press, Atlanta, GA, pp. 67–74.

Lim, W. C. (1994), Effects of Reuse on Quality, Productivity and Economics, IEEE Software 11(5), 23–29.

Maarek, Y. S., Berry, D. M. & Kaiser, G. E. (1991), An Information Retrieval Approach for Au- tomatically Constructing Software Libraries, IEEE Transactions on Software Engineering 17(8), 800–813.

Meyer, B. (1997), Object-Oriented , 2nd edition, Prentice Hall.

Michail, A. & Notkin, D. (1999), Assessing Software Libraries by Browsing Similar Classes, Functions and Relationships, in Proceedings of 21st International Conference on Software Engineering (ICSE’99), ACM Press, Los Angeles, CA, pp. 463–472. 168

Mili, A., Mili, R. & Mittermeir, R. (1997a), Storing and Retrieving Software Components: A Refinement-Based System, IEEE Transaction on Software Engineering 23(7), 445–460.

Mili, A., Yacoub, S., Addy, E. & Hafedh, M. (1999), Toward an Engineering Discipline of Software Reuse, IEEE Software 16(5), 22–31.

Mili, H., Ah-Ki, E., Grodin, R. & Mcheick, H. (1997b), Another Nail to the Coffin of Faceted Controlled-Vocabulary Component Classification and Retrieval, in Proceedings of Sympo- sium on Software Reuse (SSR’97), ACM Press, Boston, MA, pp. 89–98.

Mili, H., Mili, F. & Mili, A. (1995), Reusing Software: Issues and Research Directions, IEEE Transactions on Software Engineering 21(6), 528–562.

Morisio, M., Seaman, C. B., Parra, A. T., Basili, V. R., Kraft, S. E. & Condon, S. E. (2000), Investigating and Improving a COTS-Based Software Development Process, in Proceed- ings of 22nd International Conference on Software Engineering (ICSE’00), ACM Press, Limerick, Ireland, pp. 31–40.

Murray, D. M. (1987), Embedded User Models, in H.-J. Bullinger & B. Shackel (eds.), Proceed- ings of Human-Computer Interaction (INTERACT’87), Elsevier, Amsterdam, pp. 228– 235.

Nakakoji, K. (1993), Increasing Shared Understanding of a Design Task between Designers and Design Environments: The Role of a Specification Component, Ph.D. Dissertation, University of Colorado at Boulder.

Nakakoji, K., Yamamoto, Y., Suzuki, T., Takada, S. & Gross, M. D. (1998), From Cri- tiquing to Representational Talkback: Computer Support for Revealing Features in De- sign, Knowledge-Based Systems 11(7-8), 457–468.

Nardi, B. A., Miller, J. R. & Wright, D. J. (1998), Collaborative, Programmable Intelligent Agents, Communications of the ACM 41(3), 96–104.

Neal, L. (1996), Support for Software Design, Development and Reuse through an Example- Based Environment, in G. Szwillus & L. Neal (eds.), Structure-Based Editors and Envi- ronments, Academic Press, San Diego, CA, pp. 185–192.

Norman, D. (1986), Cognitive Engineering, in D. Norman & S. Draper (eds.), User Centered System Design, New Perspectives on Human-Computer Interaction, Erlbaum, Hillsdale, NJ, pp. 31–61.

Norman, D. (1993), Things That Make Us Smart, Addison-Wesley, Reading, MA.

Ostertag, E., Hendler, J., Prieto-Diaz, R. & Braun, C. (1992), Computing Similarity in a Reuse Library System: An AI-Based Approach., ACM Transactions on Software Engineering and Methodology 1(3), 205–228.

Owen, D. (1986), Answers First, Then Questions, in D. Norman & S. Draper (eds.), User Centered System Design, New Perspectives on Human-Computer Interaction, Erlbaum, Hillsdale, NJ, pp. 361–375. 169

Pennington, N. & Grabowski, B. (1990), The Tasks of Programming, in J.-M. Hoc, T. R. G. Green, R. Samurcay & D. J. Gilmore (eds.), Psychology of Programming, Academic Press, New York, pp. 45–61.

Perry, D. & Wolf, A. (1992), Foundations for the Study of Software Architecture, ACM Software Engineering Notes 17(4), 40–52.

Podgurski, A. & Pierce, L. (1993), Retrieving Reusable Software by Sampling Behavior, ACM Transactions on Software Engineering and Methodology 2(3), 286–303.

Prieto-Diaz, R. (1991), Implementing Faceted Classification for Software Reuse, Communica- tions of the ACM 34(5), 88–97.

Rada, R. (1995), Software Reuse: Principles, Methodologies and Practices, Ablex, Norwood, NJ.

Raymond, E. S. & Bob, Y. (2001), The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, rev. edition, O’Reilly & Associates.

Redmiles, D. F. (1992), From Programming Tasks to Solutions: Bridging the Gap through the Explanation of Examples, Ph.D. Dissertation, University of Colorado at Boulder.

Reeves, B. N. (1991), Locating the Right Object in a Large Hardware Store – An Empirical Study of Cooperative Problem Solving among Humans, Technical Report CU-CS-523-91, Department of Computer Science, University of Colorado.

Reeves, B. N. (1993), Supporting Collaborative Design by Embedding Communication and History in Design Artifacts, Ph.D. Dissertation, University of Colorado at Boulder.

Reisberg, D. (1997), Cognition, W. W. Norton & Company, New York.

Repenning, A. (1993), Agentsheets: A Tool for Building Domain-Oriented Visual Program- ming Environments, in Human Factors in Computing Systems, CHI’93 Conference Pro- ceedings, ACM Press, Amsterdam, pp. 142–143.

Rhodes, B. J. & Starner, T. (1996), Remembrance Agent: A Continuously Running Automated Information Retrieval System, in Proceedings of 1st International Conference on the Prac- tical Application of Intelligent Agents and Multi Agent Technology, London, pp. 487–495.

Rich, C. H. & Waters, R. C. (1988), : Myths and Prospects, 21(8), 40– 51.

Rich, C. H. & Waters, R. C. (1990), The Programmer’s Apprentice, Addison-Wesley, Reading, MA.

Rist, R. S. (1995), Program Structure and Design, Cognitive Science pp. 507–562.

Rittel, H. (1984), Second-Generation Design Methods, in N. Cross (ed.), Developments in De- sign Methodology, Wiley, New York, pp. 317–327.

Rittri, M. (1989), Using Types as Search Keys in Function Libraries, Journal of Functional Programming 1(1), 71–89. 170

Robbins, J. E. & Redmiles, D. F. (1998), Software Architecture Critics in the Argo Design Environment, Knowledge-Based Systems 11, 47–60.

Roberts, R. M. (1989), Serendipity: Accidental Discoveries in Science, Wiley, New York.

Robertson, G. G., Card, S. K. & Mackinlay, J. D. (1993), Information Visualization Using 3D Interactive Animation, Communications of the ACM 36(4), 57–71.

Robertson, S. E. (1977), The Probability Ranking Principle in IR, Journal of Documents 33(4), 294–304.

Robertson, S. E. & Walker, S. (1994), Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval, in W. B. Croft & C. J. Van Rijsbergen (eds.), Proceedings of the 17th International ACM-SIGIR Conference, Springer-Verlag, Dublin, Ireland, pp. 232–241.

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M. & Gatford, M. (1995), Okapi at TREC-3, in D. K. Harman (ed.), The 3rd Text REtrieval Conference (TREC-3), National Institute of Standards and Technology, Gaithersburg, MD, pp. 109–126.

Rosenbaum, S. & DuCastel, B. (1995), Managing Software Reuse–An Experience Report, in Proceedings of 17th International Conference on Software Engineering (ICSE’95), ACM Press, Seattle, WA, pp. 105–111.

Rucker, J. & Polanco, m. J. (1997), Siteseer: Personalized Navigation for the Web, Communi- cations of the ACM 40(3), 73–75.

Salton, G. & McGill, M. J. (1983), Introduction to Modern Information Retrieval, McGraw- Hill, New York.

Schon,¨ D. A. (1983), The Reflective Practitioner: How Professionals Think in Action, Basic Books, New York.

Sen, A. (1997), The Role of Opportunism in the Software Design Reuse Process, IEEE Trans- actions on Software Engineering 23(7), 418–436.

Shaw, M. & Garlan, D. (1996), Software Architecture: Perspectives on an Emerging Discipline, Prentice Hall, Upper Saddle River, NJ.

Shneiderman, B. (1998), Designing the User Interface: Strategies for Effective Human- Computer Interaction, 3rd edition, Addison-Wesley, Reading, MA.

Simon, H. A. (1996), The Sciences of the Artificial, third edition, The MIT Press, Cambridge, MA.

Soloway, E. & Ehrlich, K. (1984), Empirical Studies of Programming Knowledge, IEEE Trans- actions on Software Engineering SE-10(5), 595–609.

Stringer-Calvert, D. W. J. (1994), Signature Matching for Ada Software Reuse, Master’s thesis, University of York, UK.

Stroustrup, B. (1995), The C++ Programming Language, 2nd edition, Addison-Wesley, Read- ing, MA. 171

Sumner, T. (1995), Designers and Their Tools: Computer Support for Domain Construction, Ph.D. Dissertation, University of Colorado at Boulder.

Szwillus, G. & Neal, L. (eds.) (1996), Structure-Based Editors and Environments, Academic Press, New York.

Taylor, R. N., Medvidovic, N., Anderson, K. M., Whitehead, E. J., Robbins, J. E., Nies, K. A., Oreizy, P. & Dubrow, D. L. (1996), A Component- and Message-Based Architectural Style for GUI Software, IEEE Transactions on Software Engineering 22(6), 390–406.

Terveen, L., Hill, W., Amento, B., McDonald, D. & Creter, J. (1997), PHOAKS: A System for Sharing Recommendations, Communications of the ACM 40(3), 59–62.

Thomas, C. G. (1996), To Assist the User: On the Embedding of Adaptive and Agent-Based Mechanisms, R. Oldenbourg Verlag.

Thomas, W. M., Delis, A. & Basili, V. R. (1997), An Analysis of Errors in a Reuse-Oriented Development Environment, Journal of Systems Software 38, 211–224.

Tracz, W. (1990), The 3 Cons of Software Reuse, in Proceedings of the 3rd Annual Workshop on Institutionalizing Software Reuse (WISR ’90), Syracuse, NY.

Van Rijsbergen, C. J. (1979), Information Retrieval, 2nd edition, Butterworths, London.

Virvou, M. & Du Boulay, B. (1999), Human Plausible Reasoning for Intelligent Help, User Modeling and User-Adapted Interaction 9, 321–375.

Visser, W. (1990), More or Less Following a Plan During Design: Opportunistic Deviations in Specification, International Journal of Man-Machine Studies 33(3), 247–278.

Wahlster, W. & Kobsa, A. (1989), User Models in Dialog Systems, in W. Wahlster & A. Kobsa (eds.), User Models in Dialog Systems, Springer-Verlag, New York, pp. 4–34.

Walker, S., Robertson, S. E., M., B., Jones, G. J. F. & K., S. J. (1998), Okapi at TREC- 6: Automatic ad hoc, VLC, Routing, Filtering and QSDR, in D. K. Harman (ed.), The 6th Text REtrieval Conference (TREC-6), National Institute of Standards and Technology, Gaithersburg, MD, pp. 125–136.

Williams, M. D., Tou, F. N., Fikes, R., Henderson, A. & Malone, T. W. (1982), RABBIT: Cognitive Science in Interface Design, in Proceedings of the 4th Annual Conference of the Cognitive Science Society, Cognitive Science Society, Ann Arbor, MI, pp. 82–85.

Wing, J. M. (1990), A Specifier’s Introduction to Formal Methods, IEEE Computer 23(9), 8–24.

Winograd, T. (1995), From Programming Environments to Environments for Designing, Com- munications of the ACM 38(6), 65–74.

Winograd, T. & Flores, F. (1986), Understanding Computers and Cognition: A New Foundation for Design, Ablex, Norwood, NJ.

Woods, S. & Yang, Q. (1996), The Program Understanding Problem: Analysis and a Heuris- tic Approach, in Proceedings of 18th International Conference on Software Engineering (ICSE’96), ACM Press, Berlin, Germany, pp. 6–15. 172

Ye, Y. (1996), TCARE–Total Computer Aided Reverse Engineering Tool, in Proceedings of In- ternational Symposium on Software Engineering for the Next Generation, Nagoya, Japan, pp. 89–95.

Ye, Y. (1998), Supporting Incremental Learning with Active Accumulative and Adaptable Doc- umentation, in Proceedings of International Symposium on Future Software Technology 1998, Software Engineers Association, Hangzhou, China, pp. 185–190.

Ye, Y. (2001a), An Active and Adaptive Reuse Repository System, in Proceedings of 34th Hawaii International Conference on System Sciences (HICSS-34), IEEE Press, Maui, HI, pp. CD–ROM.

Ye, Y. (2001b), Information Enriched Workspaces, in Proceedings of INTERACT’01, Tokyo, Japan, (to appear).

Ye, Y. & Fischer, G. (2000), Promoting Reuse with Active Reuse Repository Systems, in Pro- ceedings of 6th International Conference on Software Reuse (ICSR-6), Springer-Verlag, Vienna, Austria, pp. 302–317.

Ye, Y. & Reeves, B. (2000), An Active and Intelligent Agent for Component Location, in Proceedings of Software Symposium 2000, Software Engineers Association, Kanazawa, Japan, pp. 67–74.

Ye, Y., Fischer, G. & Reeves, B. (2000), Integrating Active Information Delivery and Reuse Repository Systems, in Proceedings of ACM SIGSOFT 8th International Symposium on the Foundations of Software Engineering, ACM Press, San Diego, CA, pp. 60–68.

Zand, M., Arango, G., Davis, M., Johnson, R., Poulin, J. S. & Watson, A. (1997), Reuse R&D: Is It on the Right Track, in Proceedings of ACM Symposium on Software Reuse (SSR’97), ACM Press, Boston, MA, pp. 212–216.

Zaremski, A. M. & Wing, J. M. (1995), Signature Matching: A Tool for Using Software Li- braries, ACM Transactions on Software Engineering and Methodology 4(2), 146–170.

Zaremski, A. M. & Wing, J. M. (1997), Specification Matching of Software Components, ACM Transaction on Software Engineering and Methodology 6(4), 333–369.

Zave, P. & Schell, W. (1984), Salient Features of an Executable Specification Language and Its Environment, IEEE Transactions on Software Engineering SE-12(2), 312–325. Appendix A

The List of Queries and Relevant Components

The following table (next two pages) includes the queries and relevant components used in the evaluation of retrieval mechanisms (Section 8.1). In the table, the Query Contents column shows the queries submitted to the system. Queries 1 through 10 were created by me; queries

11 through 14 were extracted from newsgroups; and queries 15 through 19 were extracted from experiments. The Component Name column shows the names (including both class names and method names, and input parameter types if the component is polymorphic) of pre-determined relevant components. The Rank column shows the rank of the component returned by each retrieval mechanism.

174

7Jã 8Jâ

+-,

Ý . /

+>, 9 9

Ý . / 148=< Ý 8< ? 1;:;ä2148ÇÝ 8=<40Jã :eÝ

02143

ÖÆ×ÙØ ÚXÛ ÜÞÝÞßáàNâÕãÞä˜Û

å æèçÞéÞêÞëXì íÞîXí ì ï ëXì ð;îXéÞì ñ ëÞòèó ôÞñ õ öÆì ï ëc÷ ì ðrøJì ñ ëÞòÇó ôÞñ õ ù ù ú

ó ñ ì üÞêc÷ ðÇçÞýÞðèó ñ ì üÞêÞþ ì üÞó ÿ å å

û

ó ñ ì üÞêc÷ ðÇçÞýÞðèó ñ ì üÞêÞþ ì üÞó¢¡ ì üÞó ÿ £¥¤ ¦ ¦

û

ó ñ ì üÞêc÷ ì üÞéÞ륧©¨;í þ ó ñ ì üÞêÞÿ ¥ ú ú

û û

ó ñ ì üÞêc÷ ì üÞéÞ륧©¨;í þ ó ñ ì üÞê ¡ ì ü ó ÿ ¤¥ å

û û

Þï ì óÞî&ðÇó ñ ì üÞêXì üÞó ôXðèçÞýÞðÇó ñ ì üÞêÞð

¦

û

ó ñ ì üÞêc÷ ï îÞðèó¢ üÞéÞ륧¨›í þ ó ñ ì üÞê ÿ Þù ¤ £

û û

ó ñ ì üÞêc÷ ï îÞðèó¢ üÞéÞ륧¨›í þ ó ñ ì üÞê¡ ì üÞó ÿ ¥ 

û û

ó ñ ì üÞê¥JôÇëÞüÞì èëÞñÕ÷ ÞîÞð©XôÞñ ëJô¥èëÞüÞ𠦥£¥ ¦cå ¤¥£

û

ó ñ ì üÞê¥JôÇëÞüÞì èëÞñÕ÷ üÞ륧èó JôÇëÞü ¥¦ ¥£ å ú

û

ú øJëÞó ëÞñ Xì üÞëXì íÞì óÞì ð;îXï ëÞî¥XõèëÞîÞñ ;ñ ëÞêÞôÞñ ì îÞü¥˜îÞï ëÞüÞéÞîÞñ†÷ ì ð©ÞëÞÙëÞîÞñ å å å

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï  ó ëÞñ îÞó ôÞñ ¡ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï  ó ëÞñ

û û

ù¥ £¥¦ ù

îÞó ôÞñ ¡ !JîÞüÞéÞô¥Xÿ

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï  ó ëÞñ îÞó ôÞñ ¡ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï  ó ëÞñ

û û

ùcå £Þú ¤

£ !JîÞüÞéÞô¥Xì ÇëXîXðèë¥&ÞçÞëÞüÞòÇë

îÞó ôÞñ ÿ

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ ¡ !JîÞüÞéÞô¥Xÿ å ¥ åÞå ¦ ú¥

û û

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ ÿ å ¥ åÞå ú úcå

û û

!JëÞó çÞñ üXó ñ ç ëXì íÞó ÞëXðÇó ñ ì üÞê&ì ð;òèî¥Þì ó îÞï ì ÇëÞé "ÞîÞñ îÞòèó ëÞñÕ÷ ì ð§ì ó ï 류îÞðèë cå £¥¦ å

öÆì ï ëc÷ ñ ëÞüÞî¥Xë¥Jô å å å

 ;ÞîÞüÞêÞëXó ÞëXí ì ï ëXüÞî¥Xë öÆì ï ëc÷ òÇîÞü!¡ëÞîÞé ¥ £¥£ £¥¤

öÆì ï ëc÷ òÇîÞü#Lñ ì ó ë

£¥¦ ú¥ £Þù

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï  ó ëÞñ îÞó ôÞñ ¡ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï  ó ëÞñ

û û

¤¥ £cå å ¦

îÞó ôÞñ ¡ !JîÞüÞéÞô¥Xÿ

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ Ùì éÞì ñ ëÞòèó ì ôÞüÞîÞï  ó ëÞñ îÞó ôÞñ ¡ Ùì éÞì ñ ëÞòÇó ì ôÞüÞîÞï  ó ëÞñ

û û

˜ñ ëÞîÞó ëXîXñ îÞüÞéÞô¥ðÇë¥&ÞçÞëÞüÞòèëXí ñ ô¥ó Þë

¤¥ £¥¦ å ú

ù

îÞó ôÞñ ÿ

ôÞñ ì êÞì üÞîÞïÇôÞüÞë

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ ¡ !JîÞüÞéÞô¥Xÿ å ¥ å ù¥ cå

û û

ÞçÞí í ï ì üÞêc÷ ñ îÞüÞéÞô Þí í ï ëÞþ ˜ôÞüÞó îÞì üÞëÞñ ÿ å ¥ å ùÞù ¥¦

û û

ó ñ ì üÞê¥ÙçÞí í ëÞñ†÷ ÞëÞüÞéÞþ ó ñ ì üÞêÞÿ ¤ ¤ ¦

û û

¤ ¥ÞëÞüÞéXó '˜ôXðÇó ñ ì üÞêÞð

@

ó ñ ì üÞêc÷ òÇôÞüÞòèîÞó

å cå ù¥ ¦¥

û

îÞð© ëÞóG÷ ì üÞó ëÞñ ðèëÞòÇó ì ôÞü ¦¥ úÞú å ££

û

$

›ëÞóÞô¥5ÇëÞñ ï î¥Þð›ôÞíÞó '˜ôXðÇëÞó ð

¨;ñ éÞëÞñ ëÞé ëÞóG÷ ì üÞó ëÞñ ðÇëÞòèó ì ôÞü úcå ú¥ å £ ù

û

ø%Þì ðÇóG÷ çÞüÞì &ÞçÞë å å ¦¥

öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞëÞþÕöÇôÞñ '˜îÞñ é  ó ëÞñ îÞó ôÞñ ¡ öÇôÞñ '˜îÞñ é  ó ëÞñ î ó ôÞñ ÿ ¤¥ Þú £Þú

öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞëÞþÕöÇôÞñ '˜îÞñ é  ó ëÞñ îÞó ôÞñ ¡ öÇôÞñ '˜îÞñ ë  ó ëÞñ î ó ôÞñ ¡ Ùì üÞîÞñ õ)(£ñ ëÞé

¥¤ ù¥¤ Þú

ì òÇîÞó ë ÿ

öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë¥tôÞõÇþ  ü¥ÞçÞó¢ ó ëÞñ îÞó ôÞñ ¡  ü¥ÞçÞó¢ ó ëÞñ îÞó ôÞñ ¡ ¨›çÞó ÞçÞó¢ ó ëÞñ îÞó

å ¥ å ¤¥ åÞåÕú

å øJëÞï ëÞó ëXòÇô¥6XôÞüXëÞï ë¥XëÞüÞó ð;í ñ ô¥îXðÇë¥&ÞçÞëÞüÞòèë

ôÞñ ÿ

öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë¥tôÞõÇþ  ü¥ÞçÞó¢ ó ëÞñ îÞó ôÞñ ¡  ü¥ÞçÞó¢ ó ëÞñ îÞó ôÞñ ¡ ¨›çÞó ÞçÞó¢ ó ëÞñ îÞó

¦¥¥£ ¦¥ ¥ å ¦ ù

ôÞñ ¡ Ùì üÞîÞñ õ*(£ñ ëÞéÞì òèîÞó ëÞÿ

öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë¥tôÞõÇþ tôÞüÞó îÞì üÞëÞñ ¡ ¨;çÞó ÞçÞó¢ ó ëÞñ îÞó ôÞñ ¡ cì üÞîÞñ õ*(£ñ ëÞéÞì ò

¥¤¥¤ ¥¤Þù £¥ 

îÞó ëÞÿ

öÆì ï ó ëÞñ ì üÞêc÷ çÞüÞì &ÞçÞë¥tôÞõÇþ tôÞüÞó îÞì üÞëÞñ ¡ ¨;çÞó ÞçÞó¢ ó ëÞñ îÞó ôÞñ ÿ ¥£cå ¥¦cå ¥ ù

175

7Jã 8Jâ

+-,

Ý . /

+>, 9 9

Ý . / 148=< Ý 8< ? 1;:;ä2148ÇÝ 8=<40Jã :eÝ

02143

ÖÆ×ÙØ ÚXÛ ÜÞÝÞßáàNâÕãÞä˜Û

öÆì ï ëc÷ 6ÇéÞì ñ å å å

åÞå ˜ñ ëÞîÞó ëXîXéÞì ñ ëÞòÇó ôÞñ õ;ôÞü&îXí ï ô¥¥Þõ;é ì ð

öÆì ï ëc÷ 6ÇéÞì ñ ð ¦ ¦ ¦

å ¦ éÞéXéÞîÞõÇð›ó ôXîsø¡î ó ëXôÞý¥A ëÞòÇó tîÞï ëÞüÞéÞîÞñ†÷ îÞéÞé ¦¥¤ åÞå ¦¥

@

XîÞó c÷ ñ îÞüÞéÞô¥ å å å

!¡îÞüÞé ô¥s÷ üÞ륧Çó Ùõèó ëÞð ¥ å  ¦

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞó¢ üÞó þ ì üÞó ÿ ¦¥¤  ú

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞó¢ üÞó þ ì üÞó¢¡ ì üÞó ÿ ¦¥ ù £

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞó ÞôÞüÞêÞþ ï ôÞüÞêÞÿ ú¥ ¤

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞó ÞôÞüÞêÞþ ï ôÞüÞê ¡ ï ôÞüÞêÞÿ úcå 

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞóGöÇï ôÞîÞó þ í ï ôÞîÞó ÿ ú¥¦ å ù

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞóGöÇï ôÞîÞó þ í ï ôÞîÞó¢¡ í ï ôÞîÞó ÿ úÞú åÞå ¤

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞóGøJôÞçÞýÞï ëÞþ éÞôÞçÞýÞï ëÞÿ ú¥£ å ¦

˜ñ ëÞîÞó ëXîXñ îÞüÞéÞô¥üÞç¥XýÞëÞñJýÞëÞó '˜ëÞëÞüXó '˜ô

å ú

üÞç¥Xý ëÞñ ð

!¡îÞüÞé ô¥Xì ÇëÞñ†÷ êÞëÞóGøJôÞçÞýÞï ëÞþ éÞôÞçÞýÞï ë ¡ éÞôÞçÞýÞï ëÞÿ ú¥ å ú å

!¡îÞüÞé ô¥s÷ ðÇëÞó ëÞëÞé Þù å ù å ¦

û

!¡îÞüÞé ô¥s÷ üÞ륧Çó¢ üÞó Þù å å ú

!¡îÞüÞé ô¥s÷ üÞ륧Çó ÞôÞüÞê ù¥¦ £¥ å £

!¡îÞüÞé ô¥s÷ üÞ륧ÇóGöÇï ôÞîÞó ¥ ú¥¤ å

!¡îÞüÞé ô¥s÷ üÞ륧ÇóGøJôÞçÞýÞï ë ùcå ú¥ å 

ëÞòÇçÞñ ë !JîÞüÞéÞô¥s÷ üÞ륧Çó cõÇó ëÞð cå £¥¦ å ù

û

!¡îÞüÞé ô¥s÷ üÞ륧Çó ;îÞç ðÇðÇì îÞü åÞå cå å ¤

!¡îÞüÞé ô¥s÷ üÞ륧Çó Þú ùÞú ¥

å £ ›ëÞóÞó Þë6XôÞéÞì í ì òÇì îÞó ì ôÞüXó ì @ë&ôÞíÞîXí ì ï ë öÆì ï ëc÷ ï îÞðèó XôÞéÞì í ì ëÞé ¦¥ å å

öÆì ï ëc÷ êÞëÞó ýÞðèôÞï çÞó ë (ÞîÞó  ¤¥£ ¥ åÞå

@

›ì 5èëÞüXîsöÆì ï ëXîÞüÞéXîXéÞì ñ ëÞòèó ôÞñ õ›òÇô¥Þõ›ó ÞëXí ì ï ë

å

ì üÞó ôXó ÞëXéÞì ñ ëÞòÇó ôÞñ õ

öÆì ï ëc÷ êÞëÞó B˜î¥Xë å ¥ å ¤¥ ¥£

öÆì ï ëc÷ 6ÇéÞì ñ å å ú

;ÞëÞò©;ì íÞî&éÞì ñ ëÞòÇó ôÞñ õ;륧Çì ðÇó ð)¡ ì íÞüÞôÞóÞó ÞëÞüXòèñ ëÞîÞó ë

å 

ì ó

öÆì ï ëc÷ ì ðrøJì ñ ëÞòÇó ôÞñ õ ¥ ú¥£ ¤

!¡îÞüÞé ô¥s÷ êÞëÞó¢ üÞó å £¥¤ åÞå £ ¦cå

å ù (Þñ ì üÞóÞôÞçÞó ì üXîXñ î üÞéÞô¥ôÞñ éÞëÞñ

6E2 ¥¦

(£ñ ì üÞó ó ñ ëÞî¥s÷ Þñ ì üÞó ï ü Þú ¦Þú £Þù

û

;ÞëÞò©;ì íÞó  ëXòÞîÞñ îÞòÇó ëÞñ§ì ð;î&éÞì êÞì ó¢¡ îÞüÞé&ì íÞì óÞì ð*¡

å ¤ "ÞîÞñ îÞòèó ëÞñÕ÷ ì ð§øJì êÞì ó ¦¥ ¥ ¦

ó Þñ ô¥' ì óÞì üÞó ôXó Þë ó ñ ì üÞê¥ÙçÞí í ëÞñ†÷¥C£ï ðÇëXì êÞü ôÞñ ëXì ó

û

ëÞòÇó ôÞñ†÷ òÇô¥Þõ) üÞó ô å å å

D

ëÞòÇó ôÞñ†÷ ðÇì Çë åÞåÞå ¥ ¦Þù

D

å ˜ô¥Þõ›ó ÞëXëÞï ë¥XëÞüÞó ð›ôÞíÞó ÞëXîÞñ ñ îÞõ;ì üÞó ôXî65ÇëÞòèó ôÞñ

ëÞòÇó ôÞñ†÷ òÇôÞüÞó îÞì üÞð å ¥ åÞåÞå ¦¥

D

ëÞòÇó ôÞñ†÷ îÞéÞé C£ï ë¥XëÞüÞó åÞå  cå ú¥¦ D Appendix B

Questions Asked in the Post-Experiment Interview

Q1: How many years of programming experience do you have?

Q2: How many projects have you participated in? How large were they?

Q3: What is your current major programming language?

Q4: When did you start to learn Java? How often do you program with it?

Q5: What do you think your programming level in Java is, on a scale from 1 (beginner) to 10

(guru)?

Q6: Do you write comments in general? And in Java in particular?

Q7: Did you find the automatically delivered components useful to your programming tasks?

Please give an explicit example.

Q8: Did you learn something from the deliveries? For example, even though the delivered

component was not used at that delivery moment, but you used it later, or you think

you will use it from now on. Please give examples.

Q9: Did you find that some known components were delivered?

Q10: Did the Reusable-Component-Information-display distract your attention from your work?

If yes, did you want to turn it off?

Q11: Did you find the system useful overall? 177

Q12: What part of the system did you like most?

Q13: What part of the system did you not like most?

Q14: On a scale from 1 (totally useless) to 10 (extremely useful), how do you rate the system?

Q15: Do you want to use the system as your daily programming environment?

Q16: Do you have any suggestions or comments on the system and the experiment? Appendix C

Abbreviations

ADA Apple Data Detector

API Application Programmer Interface

COTS Commercial Off-The-Shelf

GUI Graphic User Interface

HLL High-Level Language

HTML Hyper Text Markup Language

JDK Java Development Kit

JGL Java General Library

LSA Latent Semantic Analysis

LCM Location, Comprehension, and Modification

NIH Not Invented Here

SER Seeding, Evolutionary growth, and Reseeding

RCI Reusable Component Information

URL Uniform Resource Locator

VHLL Very High-Level Language Appendix D

Glossary

Abstraction Mismatch, 36, 144 The difference of abstraction levels in reuse requirements and component descriptions. Programmers deal with concrete problems and thus tend to describe their requirements concretely, whereas reusable components are often described in abstract concepts be- cause they are designed to be generic so they can be reused in many different situations. Related Terms: Vocabulary Mismatch

Action-Present, 56, 135 Action-present is the period of time in which users have decided what to do but have not yet executed the needed operations to change the situation. (Schon,¨ 1983)

Active Component Repository System, 6, 18, 51, 52-55, 117 Active component repository systems support the information delivery mechanism which presents context-sensitive components to programmers without being given ex- plicit reuse queries. (Ye & Fischer, 2000) Related Terms: Active Information System, Information Delivery

Active Information System, 6, 55 Active information systems are information systems that actively deliver information to users. The challenge in implementing active information systems is to contextualize the information to the task and to the background knowledge of users. (Fischer et al., 1998) Related Terms: Information Delivery

Adaptive, 74, 109, 146 A characteristic of a system. A system is adaptive if it changes its behavior by itself. Adaptive user modeling (or discourse modeling) means that the system automatically updates user models (or discourse models) based on information observed or inferred from monitoring users’ interactions with the system. (Fischer, 1993; Fischer & Ye, 2001) Related Terms: Adaptable

Adaptable, 74, 109 A characteristic of a system. A system is adaptable if its behavior can be adjusted by 180

users to their own needs. Adaptable user modeling (or discourse modeling) means that users can directly modify their own user models (or discourse models). (Fischer, 1993; Fischer & Ye, 2001) Related Terms: Adaptable

Black-Box Reuse, 25 In black-box reuse, a component is directly reused without modification. A component can be reused as it is or reused through inheritance if the programmer creates a special- ized subclass of an existing class component. Related Terms: Component Reuse, Glass-Box Reuse, White-Box Reuse

Building Block, 14, 15 Building blocks are the primitive elements provided by a programming language. They include basic statements of a programming language and reusable software compo- nents in repositories or libraries. Related Terms: Components

Component, 15 Components are software modules that have been packaged for reuse. Both classes and methods are components. Related Terms: Module, Component Reuse

Component-Based Development, 25 An approach of creating new software systems by reusing existing software compo- nents. Component-based development improves the quality and productivity of soft- ware development. Related Terms: Component Reuse.

Component Reuse, 25 An approach of creating new software systems by reusing existing software compo- nents. Component reuse has three forms: black-box reuse, white-box reuse and glass- box reuse. Related Terms: Component-Based Development

Cognitive Engineering, 32, 33 Cognitive engineering is the process of applying what is known from cognitive science to the design and construction of tools that assists cognitive activities of human beings. (Norman, 1986)

Component Repository System, 1, 40, 49, 51 An information system that supports the locating of reusable software components. It has three connotations: a collection of reusable components, a retrieval mechanism, and a retrieval interface.

Concept Similarity 69, 83, 104, The similarity existing from the concept of the current programming task revealed through comments and identifiers in programs under development to the concept re- vealed in the documentation of reusable components. The concept of a program is its functional purpose, or goal. A reusable component from the repository whose con- cept is similar to the concept of the program under development has a high probability 181

of being reused in the current situation. Concept similarity can be determined by us- ing information retrieval techniques such as probabilistic models and latent semantic analysis.

Constraint Compatibility 69, 83, 104 The compatibility existing between the constraints required for the program under de- velopment and those satisfied by components from the repository. The constraint of a program regulates the environment in which it runs. For a component to be easily reused in a programming task, it should have compatible constraints. Signature match- ing is a process of determining the syntactical compatibility between a component and a program under development. Related Terms: Signature Matching

Context-Sensitive Information, 6, 56, 99 Information that is relevant to the working context of users. Working context consists of the task acted upon and the user acting, therefore, context-sensitive information is related to both the task and the background knowledge of users. (Fischer & Ye, 2001)

Development with Reuse, 5, 46, 47, 48 The development-with-reuse paradigm views reuse as a stand-alone process, inde- pendent of the current programming process and environment. Programmers have to change their current programming practice to embrace reuse. Component repository systems designed to support development-with-reuse assume that programmers have no difficulty in forming reuse intentions and formulating reuse queries. (Rada, 1995; Ye, 2001a) Related Terms: Reuse within Development

Discourse Model, 8, 71, 73, 78, 106-108, 114, 137 A discourse model represents the interaction history between the user and the system. It captures the larger context of current task and can improve the task-relevance of information.

Einstellung, 45, 52 Human beings often display Einstellung in problem solving. Einstellung, the German word for “attitude,” refers to the mechanization of problem solving strategy. Once problem solvers discover a strategy that “gets the job done,” they are less likely to dis- cover new strategies until they are completely stuck. Einstellung is one of the cognitive biases that prevent programmer from attempting to reuse because for most program- mers, programming from scratch is the proven approach. (Reisberg, 1997; Ye et al., 2000)

Feedforward, 56, 58 Information delivered during the period of action-present. Feedforward information affects the execution of user actions. (Simon, 1996)

Glass-Box Reuse 25, 39 In glass-box reuse, programmers do not directly reuse the component; instead, they use it as an example for their own development. For instance, programmers can look 182

at examples to find out how a program plan is realized and build their own system through analogy. Glass-box reuse contributes indirectly to the quality and productivity of programming because examples can reduce the cognitive load of programmers.

High-Functionality Computer Systems, 42 Systems that contain thousands of items and whose description requires thousands of pages. For these systems, complete understanding is impossible. Component repos- itory systems are examples of high-functionality computer systems. Also known as HFA (High-Functionality Applications). (Fischer, 2001)

Information Access, 5, 6, 51, 55 Information access requires users to start the information locating process through browsing or querying. Users have to anticipate the existence of information, and know how to search the information space by specifying their information needs in the form of well-defined queries or engaging in a series of browsing actions. Related Terms: Information Delivery

Information Delivery, 6, 7, 51, 55, 131-133 The information delivery mechanism presents information to users on its own initiative without being prompted by explicit queries. Information delivery complements infor- mation access and is needed in situations where users are unable to articulate the need for information or are unaware that they may profit from information. (Fischer, 1994) Related Terms: Active Information System

Information Discernment, 39, 105, 139 One of the two stages needed for choosing the right component. At the stage of infor- mation discernment, programmers avoid spending too much time by quickly scanning the component and its description to decide whether this component is related to their current task, and thereby also avoid any deep understanding at this point. (Ye, 2001b)

Information-Enriched Workspace, 8, 49, 50, 52, 99 An information-enriched workspace is a special working environment that is aug- mented with an information display that constantly shows the information immediately needed by users. In an information-enriched workspace, the cost structure of accessing needed information is tuned to the requirements of the work process using it because it provides immediate access to the most needed information for users without interrupt- ing their workflow. (Ye, 2001b)

Information Island, 0, 52 In an information system, those items whose existence is not anticipated by users be- come information islands. Information access mechanisms offer little support for users to reach information islands. In contrast, information delivery mechanisms can build a bridge to informaton islands. (Engelbart, 1990)

Intrusiveness, 58 The intrusiveness of a system is the degree of users’ perception of being interrupted from their current focus. Active information systems need to achieve the right balance between the cost of intrusive interruptions and the loss of context-sensitivity of deferred 183

alerts by carefully considering when and how to deliver the information so that it can be utilized best by users. (Horvitz et al., 1999)

Latent Semantic Analysis, 2, 83, 89 An extension of the vector space model. By constructing a large semantic space of terms to capture the overall pattern of their associative relationship, LSA (Latent Se- mantic Analysis) is expected to facilitate concept-based retrieval and bridge the con- ceptual gap in formulating reuse queries. (Landauer & Dumais, 1997) Related Terms: Vector Space Model

Learning on Demand, 35, 52, 142 Situated learning in a working context which occurs at the user’s discretion—often triggered by a breakdown. However, if users are not aware of the existence of the knowledge they need to learn for the working context, they may miss the learning opportunity and settle on a suboptimal solution. Active information systems can create learning opportunities. Learning-on-demand is the only feasible way for programmers to learn about reusable components when the repository becomes very large. (Fischer, 1991)

Loss Aversion, 46 In the decision-making process, human beings have the tendency to be far more sensi- tive to potential loss than to potential gain. Loss aversion is one of the cognitive biases that prevent programmers from attempting to reuse. Starting a reuse process requires a mental switch. The demand on working memory and time is immediate, and the poten- tial gain is unclear because programmers are not sure whether the needed component exists, whether they are able to find it even if it exists, and whether they are able to understand and modify it even if they find it. (Reisberg, 1997; Ye et al., 2000)

Module, 15 A module refers to a named and addressable abstraction in software—either a proce- dural abstraction such as a function, or a data abstraction such as a class. Procedures, functions, methods and classes are all considered as software modules. In this disserta- tion, the term module refers to software abstractions to be developed by programmers. Related Terms: Components

Opportunistic Programming, 17, 48, 141 Most programmers follow neither the top-down nor the bottom-up design strategy. In fact, their programming activities are very opportunistic: They are a mixture of top- down and bottom-up strategies, and which strategy is chosen depends on the knowl- edge of individual programmers and the particular situation. Interim decisions made during the programming process often can lead to subsequent decisions at arbitrary points in the programming space. (Curtis et al., 1988; Visser, 1990)

Passive Component Repository System, 51, 53 A component repository system that supports information access only. Most current component repository systems are passive, and they fall short in supporting program- mers who make no attempt to reuse. Related Terms: Information Access, Active Component Repository System 184

Plan Recognition, 61, 64-66 The plan recognition approach uses plans to describe a user task. A plan is a sequence of user actions that achieve a certain goal. In general, a plan can be represented as a rule consisting of two parts: the condition and the result. The condition part includes a sequence of actions required to accomplish a task, and the result part is the intended goal of the task. When the actions of a user match, completely or partially, the condi- tion part, the system can infer that the user is performing that corresponding task, and information about that task is delivered. (Fischer, 1987) Related Terms: Similarity Analysis, Task Model

Precision, 120-123 A metric measuring the performance of an information retrieval system. It is the ratio of the number of retrieved relevant items to the number of total retrieved items. Preci- sion indicates the ability of the system to present only the relevant documents. (Salton & McGill, 1983) Related Terms: Recall

Probabilistic Model, 83, 86, 96 The probabilistic model ranks documents in decreasing order of their evaluated prob- ability of relevance to a user query. It makes use of formal theories of probability and statistics in order to evaluate, or estimate, the probabilities of relevance. (Robertson & Walker, 1994; Crestani et al., 1998)

Program Plan, 14, 15-18, 23, 52 As a series of interconnecting actions to achieve a goal, a program plan provides a skeleton structure for programs by abstracting key elements. Programs plans are the basic chunk used in program design and understanding. (Soloway & Ehrlich, 1984; Rist, 1995)

Recall, 120-123 A metric measuring the performance of information retrieval systems. It is the ratio of the number of retrieved relevant items to the number of total relevant items in the collection. Recall indicates the ability of the system to present all relevant documents. (Salton & McGill, 1983) Related Terms: Precision

Retrieval by Reformulation 2, 38, 77, 113-117, 137 A mechanism that allows users to incrementally improve their queries to match their intentions after they have interpreted and evaluated the retrieved results and have ex- plored the underlying structure of the information systems. (Williams et al., 1982; Fischer & Nieper-Lemke, 1989)

Reuse-by-Anticipation, 41, 42-45, 53, 54 In the reuse-by-anticipation mode, programmers formulate reuse intentions based on their anticipation of the existence of certain reusable components. (Ye & Fischer, 2000) Related Terms: Reuse-by-Memory, Reuse-by-Recall

Reuse-by-Memory, 41, 42-45, 53, 54 In the reuse-by-memory mode, while designing a new program, programmers notice 185

similarities between the new program and reusable components that they have learned in the past and know very well. Therefore, they can reuse them easily during the programming, even without the support of a component repository system because their memory assumes the role of the repository system. (Ye & Fischer, 2000) Related Terms: Reuse-by-Recall, Reuse-by-Anticipation

Reuse-by-Recall, 41, 42-45, 53, 54 In the reuse-by-recall mode, while developing a new program, programmers vaguely recall that the repository contains some reusable components with similar functionality, but they do not remember exactly which components they are. They need to search the repository to find what they need. (Ye & Fischer, 2000) Related Terms: Reuse-by-Memory, Reuse-by-Anticipation

Reuse Repository System, 1, 40 A synonym of Component Repository System.

Reuse Process, 2, 26, 33, 34, 47-49, For programmers to reuse a software component from a repository, they have to go through the reuse process which has three steps: location, comprehension and modifi- cation. (Fischer et al., 1991)

Reuse within Development, 5, 48, 49, 50 The reuse-within-development paradigm views reuse as an integral part of software development and component repository systems as information systems that augment programmers’ insufficient knowledge about reusable components and assist them in accomplishing their tasks. It requires that the reuse process be smoothly melded into the current programming process and environment so that there is no context change from programming to reuse. (Ye, 2001a) Related Terms: Development with Reuse

Reuse Utility, 43 The ratio of reuse value to reuse cost. If reuse utility is perceived as too low, program- mers do not make an attempt to reuse. Reducing the reuse cost with the support of appropriate repository systems and increasing the recognition of reuse value through education are two approaches to increasing the reuse utility. (Ye et al., 2000)

Signature, 6, 68, 76, 96, 101-103, 134, 152 The type expression of a program. A signature captures the syntactic constraints of a program by defining the types of input and output data. Related Terms: Signature Matching, Constraint Compatibility

Signature Matching, 92 The process of determining the compatibility of two components in terms of their sig- natures. It is an indexing and retrieval mechanism based on type constraints of a mod- ule or a component. (Zaremski & Wing, 1995)

Similarity Analysis, 62-66, An approach to capturing the tasks of users for the purpose of delivering task-relevant information. It examines the contextual information surrounding the current focus of 186

users, and uses that contextual information to predicate their information needs. In- formation from the repository that has high similarity to the contextual circumstance is then delivered. Task-relevant information can be determined based on similar situa- tions or similar information. Related Terms: Plan Recognition, Task Model

Situation Model, 12, 36, 53 User’s understanding of a problem. A situation model is characterized with respect to the goal and background knowledge of a user. The conceptual gap between the situa- tion model and the system model includes the vocabulary mismatch and the abstraction mismatch. (Kintsch, 1998) Related Terms: System Model

Software Agent, 8, 9, 99 A software entity that functions autonomously in response to the changes in its running environment without requiring human guidance or intervention. (Bradshaw, 1997; Ye & Reeves, 2000)

System Model, 36, 53 An actual model of the computer system. The system model of a component reposi- tory system is the terms used in the descriptions of its components and the repository structure of organizing and storing the components.

Task Model, 7, 61, 73, 77, 80, 100-102 An abstract representation of a user task. Appropriate task models are essential in delivering task-relevant information. The acquisition of such task models is called task modeling. Tasks can be modeled through either plan recognition or similarity analysis. (Fischer & Ye, 2001) Related Terms: Plan Recognition, Similarity Analysis

User Model, 6, 8, 73, 78, 80, 109-114, 124-125, 135-136, 145-146 A representation of a user’s knowledge about an information space. User models can be used as a filter by the information system to ensure only unknown information is presented. The process of acquiring user models is called user modeling. User modeling can be adaptable or adaptive. (Fischer, 2001)

White-Box Reuse, 25, 39 In white-box reuse, programmers reuse the component after they have modified the components to their needs. White-box reuse does not contribute as much to the easier maintenance and evolution of software systems as black-box reuse does, but it can reduce development time.

Vector Space Model, 85 An approach to free-text indexing and retrieval. Documents and queries are represented as vectors of terms contained in the whole collection of documents, commonly known as a corpus. The similarity between a query and a document is the distance between their vectors in the vector space. (Jurafsky & Martin, 2000)

Vocabulary Mismatch, 36 The vocabulary mismatch describes the phenomenon that people use a variety of words 187 to refer to the same concept. Studies have found that the probability that two persons choose the same word to describe a concept is less than 20%. Even well-trained in- dexing experts have 20% disparity on average in choosing terms to describe a same document. (Furnas et al., 1987; Harman, 1995) Related Terms: Abstraction Mismatch