RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Jairo Hernán Aponte Melo Mario Linares Vásquez Laura Viviana Moreno Cubillos Christian Adolfo Rodríguez Bustos editors

VICERRECTORÍA DE INVESTIGACIÓN %*3&$$*¦/%&*/7&45*("$*¦/4&%&#0(05–

Bogotá, D. ., May de 2012 © Universidad Nacional de Colombia Vicerrectoría de Investigación Dirección de Investigación Sede Bogotá © Editorial Universidad Nacional de Colombia © Jairo Hernán Aponte-Melo [email protected] Mario Linares-Vásquez [email protected] Laura Viviana Moreno-Cubillos [email protected] Christian Adolfo Rodríguez-Bustos [email protected] Editors Dirección de Investigación Sede Bogotá Luis Fernando Niño Vásquez Director Editorial UN Editorial Board Alfonso Correa Motta María Belén Sáez de Ibarra Jaime Franky Julián García González Luis Eugenio Andrade Pérez Salomón Kalmanovitz Krauter Gustavo Silva Carrero First Edition, 2012 ISBN: 978-958-761-162-5 (paperback) ISBN: 978-958-761-163-2 (print on demand) ISBN: 978-958-761-167-0 (e-book) DIB collection design Ángela Pilone Herrera Publisher Editorial Universidad Nacional de Colombia [email protected] www.editorial.unal.edu.co Bogotá, D. C. Colombia, 2012 No part of this book may be reproduced by any means without permission in writing from the owner of the patrimonial rights. Made and printed in Bogotá, D. C. Colombia

Universidad Nacional de Colombia Cataloging-in-Publication Data

Research topics in software evolution and maintenance / [eds.] Jairo Hernán Aponte Melo ... [et al.]. -- BogotáUniversidad Nacional de Colombia. Vicerrectoría de Investigación. Dirección de Investigación Sede Bogotá, 2012 xxiv 256 p. – (Colección DIB)

Includes bibliography references and indexes

ISBN : 978-958-761-162-5 (paperback). – ISBN : 978-958-761-163-2 (print on demand). -- ISBN : 978-958-761-167-0 (e-book)

1. Software engineering 2. Software evolution 3. Software maintenance 4. Software visualization I. Aponte Melo, Jairo Hernán, 1965- II. Series CDD-21 005.1 / 2012 List of contributors

Jairo Aponte Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Fernando Cortés Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Miguel Cubides Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Óscar Chaparro Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Víctor Escobar-Sarmiento Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Mario Linares-Vásquez Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] David Montaño Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected]

vii Laura Moreno Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Yury Niño Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Christian Rodríguez-Bustos Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Juan Gabriel Romero-Silva Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Leslie Solorzano Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Henry Roberto Umaña-Acosta Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Angélica Veloza-Suan Universidad Nacional de Colombia, sede Bogotá E-mail: [email protected] Contents

Preface xxi

Chapter 1 Summarizing Software Artifacts: Overview and Applications 1 Abstract 1 1.1 Introduction 1 1.2 Essentials on Natural Language Summarization 2 1.2.1 The Dimensions of Summarization 3 1.2.2 Summarization Evaluation 5 1.3 Summarizing Software Artifacts: Existing Approaches 6 1.3.1 Summarizing Documentation 7 1.3.2 Summarizing 8 1.3.3 Combining Software Artifacts 16 1.4 Making Easier Software Evolution: Using Software Summaries in Maintenance Activities 17 1.4.1 Software Comprehension 19 1.4.2 Reverse Engineering 20 1.5 Trends and Challenges 21 References 24

Chapter 2 Survey and Research Trends in Mining Software Repositories 29 Abstract 29 2.1 Introduction 29

ix RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

2.2 Understanding Software Repositories 31 2.2.1 Historical Repositories 32 2.2.2 Communications Logs 33 2.2.3 Source Code 34 2.2.4 Other Kind of Repositories 34 2.3 Processes of Mining Software Repositories 34 2.3.1 Techniques 34 2.3.2 Tools 35 2.4 Purpose of Mining Software Repositories 36 2.4.1 Program Understanding 37 2.4.2 Prediction of Quality of Software Systems 38 2.4.3 Discovering Patterns of Change and Refactorings 39 2.4.4 Measuring of the Contribution of Individuals 40 2.4.5 Modeling Social and Development Processes 42 2.5 Trends and Challenges 44 2.5.1 Thinking in Distributed Systems 44 2.5.2 Integrating and Redesigning Repositories 46 2.5.3 Simplifying MSR Techniques 47 2.6 Summary 48 References 50

Chapter 3 Software Visualization to Simplify the Evolution of Software Systems 57 Abstract 57 3.1 Introduction 57 3.2 Background on Software Visualization 58 3.2.1 How Software Visualization Supports Software Evolutions Tasks 58 3.2.2 The Software Visualization Pipeline 59 3.2.3 Overview of Visualization Tools 59 3.2.4 Sources of Information Commonly Used 60 3.2.5 Differences of Software Visualization and Modeling Languages Like UML 62 x CONTENTS

3.3 SV Techniques 62 3.3.1 Metaphors 62 3.3.2 2D Approaches 64 3.3.3 3D Approaches 69 3.3.4 Virtual Environments 72 3.4 Towards a Better Software Visualization Process 72 3.4.1 Other Programming Paradigms 73 3.4.2 Include Other Languages 73 3.4.3 Better and More Flexible Metaphors 74 3.4.4 Educational Issues 74 3.5 Summary 75 References 76

Chapter 4 Incremental Change: The Way that Software Evolves 81 Abstract 81 4.1 Introduction 81 4.2 Incremental Change in the Software Development Process 82 4.2.1 Software Maintenance vs. Software Evolution 83 4.2.2 Activities of Incremental Change 84 4.3 Concept and Feature Location 87 4.3.1 Software Comprehension 87 4.3.2 Concept Location 88 4.3.3 Static Techniques 89 4.4 Impact Analysis 96 4.5 Summary 98 References 100

Chapter 5 Software Evolution Supported by Information Retrieval 105 Abstract 105 5.1 Introduction 105 5.2 Information Retrieval 106 5.2.1 Classic Models 108

xi RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

5.2.2 Alternative and Hybrid Models 109 5.2.3 Web Models 109 5.3 Software Evolution Activities 110 5.3.1 Incremental Change 110 5.3.2 Software Comprehension 111 5.3.3 Mining Software Repositories 112 5.3.4 Software Visualization 112 5.3.5 Reverse Engineering & Reengineering 112 5.3.6 Refactoring 113 5.4 Information Retrieval and Software Evolution 113 5.4.1 Concept/Feature Location 113 5.4.2 Mining Software Repositories (MSR) 114 5.4.3 Automatic Categorization of Source Code Repositories 117 5.4.4 Summarization of Software Artifacts 118 5.4.5 Traceability Recovery 119 5.5 Summary 120 References 121

Chapter 6 Reverse Engineering in Procedural Software Evolution 127 Abstract 127 6.1 Introduction 127 6.2 Reverse Engineering Concepts and Relationships 129 6.2.1 Reverse Engineering and Software Comprehension 129 6.2.2 Reverse Engineering and Software Maintenance 130 6.2.3 Reverse Engineering Concepts 130 6.3 Techniques in Reverse Engineering 133 6.3.1 Standard Techniques 133 6.3.2 Specialized Techniques 134 6.4 Application of Techniques 141 6.4.1 Description of the System 141 6.4.2 Considerations for Applying Reverse Engineering 142 xii CONTENTS

6.4.3 Application of Standard Techniques 143 6.4.4 Application of Specialized Techniques 144 6.5 Reverse Engineering Assessment 147 6.5.1 Assessment of Techniques 147 6.5.2 Assessment of Tools 148 6.6 Trends and Challenges 151 References 154

Chapter 7 Agility Is Not Only About Iterations But Also About Software Evolution 161 Abstract 161 7.1 Introduction 161 7.2 Evolutionary Software Processes 164 7.2.1 EVO 165 7.2.2 Spiral 168 7.2.3 The Unified Process Family 168 7.2.4 Staged Model 171 7.3 Principles, Agility and the Agile Manifesto 173 7.3.1 The Agile Manifesto 173 7.3.2 Agile Principles 175 7.3.3 Agility in Software Development 177 7.4 Agile Methodologies History 178 7.4.1 Iterative Development (1970-1990) 178 7.4.2 The Birth of Agile Methodologies (1990-2001) 179 7.4.3 The Post-manifesto Age (2001-2011) 181 7.5 Agile Methodologies Overview 183 7.5.1 Extreme Programming (XP) 183 7.5.2 SCRUM 184 7.5.3 Feature Driven Development (FDD) 184 7.5.4 Lean Agile Development: LSD, Kanban, Scrumban 185 7.5.5 Agile Versions of UP: AgileUP, Basic/OpenUP 188 7.6 Agility and Software Evolution 190

xiii RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

7.7 Trends and Challenges 193 References 195

Chapter 8 Software Development Agility in Small and Medium Enterprises (SMEs) 201 Abstract 201 8.1 Introduction 202 8.2 Legal Definition of SMEs 205 8.2.1 Foreign Definitions 205 8.2.2 Local Definition (Colombia) 207 8.3 Agile Methodologies in the Real World 208 8.3.1 Knowledge About Agile Methodologies 208 8.3.2 People Who Practiced Agile at a Previous Company 209 8.3.3 Roles in Agile Usage 209 8.3.4 Reasons for Agile Adoption 209 8.3.5 Agile Methodology Used 210 8.3.6 What Already Have Been Achieved by Using ASDM 210 8.3.7 Barriers to Further Agile Adoption 210 8.3.8 Agile Practices 211 8.3.9 Plans for Implementing Agile on Future Projects 211 8.3.10 Using Agile Techniques on Outsourced Projects 211 8.4 Agility Assessment Models 211 8.4.1 Boehm and Turner’s Agility and Discipline Assessment 212 8.4.2 Pikkarainen and Huomo’s Agile Assessment Framework 215 8.5 Weaknesses and Strengths of Agile Methodologies 219 8.6 Challenges Adopting Agile Methodologies in SMEs 221 8.7 Summary 223 References 224 xiv CONTENTS

Chapter 9 Model-driven Development and Model-driven Testing 227 Abstract 227 9.1 Introduction 227 9.1.1 Challenges of Software Development 228 9.1.2 Traditional Testing Process 228 9.1.3 A solution to Face the Problem 228 9.1.4 Model-based Testing 229 9.2 Why to Use Model-driven Architecture in Agile Methodologies 230 9.3 Model-driven Development 230 9.3.1 Philosophy 230 9.3.2 Unified Modeling Language 232 9.4 Model-driven Architecture 233 9.4.1 Software Engineering 233 9.5 What the Model-driven Architecture Does Not Do 234 9.6 Notations for Modeling Tests 235 9.6.1 Transitions-based Modeling 235 9.6.2 Pre/Post Modeling 235 9.7 Testing from Finite State Machines 236 9.7.1 FSM and ModelJUnit 236 9.7.2 Case Study 237 9.8 Testing from Pre/Post Models 239 9.8.1 Object Constraint Language (OCL) 239 9.8.2 Case Study 240 9.9 Trends and Challenges 241 9.10 Summary 242 References 243

Subject Index 246

Name Index 252

xv List of Figures

1.1 Characterization of Natural Language Summarization 4 1.2 Summary generation process for methods 11 1.3 Corpus creation process 12 1.4 Summary generation process for software changes 15

2.1 Classification of Software Repositories 32 2.2 Development workflows models 45

3.1 Structograms (sequence, loop and conditional) 65 3.2 Jackson diagrams representation (sequence, loop ccccc and conditional) 66 3.3 Basic control structure diagrams (sequence, loop ccccc and conditional) 66 3.4 SeeSoft tool example 67 3.5 Mozilla Firefox example of evolutionary coupling 68 3.6 Fractal representation 68 3.7 Pixel-maps example 69 3.8 sv3D and SeeIT 3D metaphor 70 3.9 Sorting steps using a third spatial dimension 71 3.10 CodeCity metaphor 71

4.1 Incremental change activities 83 4.2 Concept location techniques frequently use ccccccc ccc an intermediate representation of source code 89

5.1 General architecture of an IR system 107 5.2 Information retrieval models 108

xvi LIST OF FIGURES

5.3 Information retrieval and software evolution activities 113

6.1 Reverse Engineering process 132 6.2 Business rules knowledge through abstraction levels 135 6.3 Modularization process based on clustering/searching ccc algorithms 140 6.4 Prototype of the Oracle Forms Reverse Engineering ccc Tool 144 6.5 Example of a CURSOR and its use in a FOR cccccc statement 145 6.6 Example of a code block that does not represent ccccc a business rule 146 6.7 Application of the MECCA approach 152

7.1 Iterative development 164 7.2 Incremental development 165 7.3 Iterative and incremental development 166 7.4 The Spiral model 169 7.5 The Unified Process (UP) model 171 7.6 The RUP phases and their goals 172 7.7 UP elements 173 7.8 The Rational Unified Process (RUP) model 174 7.9 The simple staged model 175 7.10 The Crystal Family 180 7.11 Agile methodologies evolution 182 7.12 SCRUM 184 7.13 FDD lifecycle 185 7.14 Kanban board 187 7.15 OpenUP layers 190 7.16 Communication modes and effectiveness 191 7.17 Hub organizational structure 193

8.1 Software development in Colombia 204 8.2 Earnings generated by Software development ccccccccc in Colombia 204 8.3 Dimensions affecting method selection 216 8.4 An Agile assessment framework 218

xvii RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

8.5 Data collection planning 218 8.6 Changes required for adopting Agile methodologies cccc in traditional organizations, and associated ccccccccc challenges/risks 222

9.1 Representing states and state transitions using ccccccccc a state diagram 236 9.2 State diagram for the case study 237 9.3 State diagram for case study 239

xviii List of Tables

1.1 Software summarization approaches 18

2.1 Law of program evolution 30 2.2 Methodologies used in MSR 35 2.3 Mining Software Repositories tools 36 2.4 Current state of MSR tools 48

3.1 Steps in the visualization pipeline 60 3.2 List of applications 61

4.1 Classification of IR-based techniques according ccccccc to source code applicability 94 4.2 Classification of IR-based techniques according ccccccc to intermediate representation 95 4.3 Classification of IR-based techniques according ccccccc to granularity of results 95 4.4 Classification of IR-based techniques according ccccccc to the type of information used 96

7.1 Agile principles vs. Software features 192

8.1 Software exports from India, Ireland and Israel 202 8.2 Growth of the Indian software industry 203 8.3 Definition of SMEs in selected countries 206 8.4 Personnel characteristics 214 8.5 Agility - plan-driven method home grounds and levels cc of software method understanding and use 215

xix RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

9.1 Main OCL constructs 240 9.2 Test cases generated 241

xx Preface

Over its lifetime, a software system is affected by many changes whose fundamental purpose is to adapt it to its operating environment. When the first software systems were developed, only few adaptive changes were needed to achieve a perfect coupling between the new system and its production environment. This phenomenon, almost negligible at that time, has been growing gradually, and has become what is now called software evolution and maintenance. Nowadays, it is considered that any software system must be designed to change, because during their lifespan they must be continuously updated in order to be adapted to an ever-changing environment. As a research group in software engineering at Universidad Nacional de Colombia, we are interested in understanding this evolutionary phe- nomenon, building models to describe the past, present and future of the evolution of a software system, and designing and implementing tools to support these permanent change processes. The study of the drivers and general properties of the software evolution phenomenon is impor- tant because the highest costs of software construction are associated with maintenance tasks, and because the lifespan of a software sys- tem depends heavily on the procedures and techniques used for change implementation, corrections and required extensions. Our research in this subject is aimed at understanding, assessing, implementing and managing changes needed by all types of software artifacts. This includes evolutionary construction (Agile methods), soft- ware understanding, concept location (to identify where a particular functionality is implemented in the source code), impact analysis (to determine which parts of the current system are affected by a particular

xxi RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

change already implemented), identification of dependencies and links among artifacts that evolve simultaneously, and reverse engineering (to produce high-quality models to facilitate understanding of a complex legacy system).

What is this Book About?

This book presents specific research topics within the field of software evolution, addressed by members of the ColSWE research group, who have been working on them for the last three years. The first chap- ter examines several approaches for summarizing software artifacts, and discusses the use of summaries for aiding software comprehension, an unavoidable step of maintenance tasks. The second chapter highlights the importance of software repositories as sources of information that have allowed researchers to understand the evolutionary processes of software systems. Also, it presents a literature survey on approaches for mining software repositories, and a brief review of the main research trends and current challenges in this field. The third chapter presents the most relevant visualization techniques that have been applied to software entities as tools for supporting software understanding, and ex- plains the major difficulties to be overcome by the future visualization tools. The fourth chapter studies the incremental change process, which is the underpinning of software evolution because it enables developers to add new functionalities to programs, improve existing ones, and remove bugs. The fifth chapter shows how information retrieval techniques can support the implementations of some activities of the evolutionary model of software development. The sixth chapter presents an overview of reverse engineering, an analysis of its applicability in an industrial legacy software system, and some future trends that will direct research in this discipline. The seventh chapter provides a comprehensive intro- duction and explanation of agile software methods, and explains how these methodologies address specific features of software projects such as requirements volatility, users’ volubility, incremental change, and un- certainty in schedules. The eighth chapter is an analysis of how Agile software methods are being used in small and medium sized software development companies in Colombia. Finally, the last chapter intro- xxii PREFACE

duces the application of model-based design for building and executing the necessary artifacts to perform software testing, as well as the use of test cases for developing a new system.

Who Should Read this Book? This book is of interest to everyone working in the field of software engineering. It is also aimed at senior undergraduate and graduate students and researchers who are new to the field of software evolution. For them, each chapter of the book provides up-to-date information and links to other resources, within a specialized subject. In addition, each one can be read independently. We hope students, researchers, and practitioners will find in this book a gentle introduction and the initial motivation to start their own research in the field of software engineering, and particularly, in software development and maintenance.

Acknowledgments The authors would like to thank all people who have contributed to this book. In particular a very special thanks to:

• Luis Fernando Niño, the head of Dirección de Investigación Sede Bogotá - DIB, who has made possible the publication of this book as part of the collection called “Colección de Investigación de la DIB”.

• Hugo Alberto Herrera, the chair of the Department of Computer Systems and Industrial Engineering, who has provided significant resources to conduct the research presented in this book.

• Leslie Solorzano and Lina Johana Montoya for reviewing and im- proving our writing in English in so many ways.

• Our advisors and collaborators at Wayne State University and The College of William and Mary, who have given us their expertise and support for conducting research in Software Engineering.

xxiii RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• Our undergraduate students at Universidad Nacional de Colombia, who have participated as Java developers within the user studies related to this research field.

Bogotá, Colombia, November 2011

Jairo Aponte

xxiv Summarizing Software Artifacts: Overview and Applications

Laura Moreno Jairo Aponte

ABSTRACT Software maintenance is one of the most time and effort consuming phases in software life cycle. Usually, there is some underlying docu- mentation to the design and development of software, which is funda- mental for supporting maintainers’ tasks. However, while the amount and extension of software artifacts increase, their use gets complicated and non-practical. In order to deal with this situation, summarization of software artifacts has become a new area of software engineering research. This chapter presents various existing approaches for summa- rizing software artifacts, their possible applications in maintenance, and some research challenges.

1.1 INTRODUCTION Most of the time and effort of software developers and maintainers is devoted to read and analyze a huge amount of information about the system they are dealing with. This information is generated during each stage of the software life cycle, and depending on its purpose, is frequently captured in different kinds of software artifacts, like require- ments specification, use cases documents, technical designs, code, bug reports, test cases, and so on. Such artifacts represent the main source of knowledge when maintaining software, since they reflect the domain, design, and functionality of the system.

1 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Nonetheless, in some occasions those software documents can be excessive or too long for being read completely when performing, for ex- ample, an improvement to the system. In order to reduce the cost, effort and time spent by maintainers when using software artifacts, several ap- proaches have been proposed traditionally from diverse fields, including visualization, data mining and information retrieval. Thus, the applica- tion of concepts and techniques used in natural language processing is a novel subject in software engineering, particularly, in regard to summa- rization processes. The integration of these two subjects is reasonable since the same problem affects both areas: The continuous growth of documents and information. Consequently, Section 1.2 states the basics on summarization, in- cluding the factors that affect it and the taxonomy of summaries eval- uation. Section 1.3 describes how summarization has been adopted by software engineering domain; it also presents some existing approaches for summarizing software artifacts. The role or application of summaries for aiding software maintenance activities is treated in section 1.4. Fi- nally, some trends and challenges of summarization on the software engineering field are discussed in section 1.5

1.2 ESSENTIALS ON NATURAL LANGUAGE SUMMARIZATION The huge amount of textual information found physically and digitally has stimulated the research on Natural Language Processing (NLP) in the last decades, turning it into an extended field, product of the fusion between a variety of areas of knowledge such as linguistics, statistics, psychology, and . Particularly, the continuous growth of the World Wide Web has made data reduction a central point in NLP in order to deal with the information overload problem [1,2]. That reduced form is called summary, “a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually sig- nificantly less than that” [3]. In this way, the automatic summarization in NLP is basically about generating a shorter text from one or more documents, which still preserves their important content.

2 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

1.2.1 The Dimensions of Summarization Since mid-1950s different automatic techniques have been proposed to represent text documents in a condensed form, from fields such as statis- tics, machine learning, information retrieval, and natural language anal- ysis itself. These techniques are affected by several conditions called orthogonal views [2], aspects of variation [4], or context factors [5], which are commonly divided in three major categories (Figure 1.1):

• input factors, which define the characteristics of the original doc- ument(s);

• purpose factors, for characterizing the required transformations according to the summary usage; and

• output factors, that determine the final product, i.e., the summary.

Among the most remarkable input factors can be mentioned the language (monolingual vs. multilingual), the register or linguistic style, the genre, and the units or size of source(s). This last one determines if the input comprises single or multiple documents, and whether the elimination of content redundancy (or repetitive information) becomes a key issue within summarizing techniques. On the other hand, purpose factors include the envelope [5], the target audience, and the usage of a summary. For instance, a summary is critical or evaluative when it points out the opinion of the author on a particular subject; it is indicative if it helps to decide to the reader whether the sources are worth reading; otherwise, if the summary com- pletely replaces the reading of original documents, it is informative.

3 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE Extrinsic assesment comprehension categorization retrieval answering * Relevance * Reading * Document * Information * Question Evaluation Content-based * Similariry * ROUGE family * Pyramids Content Co-selection * Precision * Recall * F-score * Relative utility Intrinsic Text quality Text clarity and coherence * Gramatically * Non-redundancy * Referencial * Structure Output factors NATURAL LANGUAGE SUMMARIZATION LANGUAGE NATURAL reduction, derivation, speciality medium, structure, genre * Material: coverage, * Style * Coherence * Form: language, register, Characterization of Natural Language Summarization [2, 4, 5]. Figure 1.1 Purpose factors Context factors formality, triggering, destination * Usage * Audience * Expansiveness * Envelope: time, location, Basic concepts Summary Input factors * Form: language, register, * Subject type or specificity * Units or source size * Header medium, structure, genre, length

4 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

Finally, output factors take into account coherence, reduction, and derivation. In detail, derivation means that a summary can be extractive or abstractive: While the former is comprised entirely by sequences of words literally taken from the source, the latter is built, at least in part, from words which do not appear in the original document [4]. Thus pro- ducing abstracts is harder because they represent additional challenges involving analysis, topic fusion and generation of natural language [1,4], but even so, several approaches have tackled each single case, some more successfully than others, just as it is mentioned in [1, 2, 5]. But how is this determined? How is it known if an approach shows good results or at least better than others?

1.2.2 Summarization Evaluation

Currently, there is no clear idea of what constitutes a good summary. Actually, it is possible to obtain various perfectly acceptable summaries from the same sources. Moreover, the lack of a standard framework makes it difficult to construct a baseline for contrasting summarizing systems. However, despite the fact that evaluation is a controversial and challenging concern, it is a major subject in text summarization, taking into consideration that it allows to assess the results of a specific method or system and compare the results of different techniques; even more, some types of evaluation allow to understand why a method does not work adequately [6]. The quality of summaries can be determined in two different ways: By analyzing their internal properties (intrinsic methods) or by studying their impact when performing a particular activity (extrinsic methods). In the first case, the features to be evaluated are the text quality and the content of the summary. Measures of text quality assess aspects such as grammar, readability, cohesion, and coherence. This kind of linguistic features can not be evaluated by automatic methods, but by human annotators (online approaches). In contrast, content evaluation is done with significantly less or no human intervention (off line approaches), usually by comparison of systems output with a gold standard, i.e, an ideal summary which contains important content of a given source, and can be produced either manually or automatically [5, 7]. Co-selection

5 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

measures like precision, recall or relative utility are good examples of this type of evaluation [7], since they allow to compare the amount of sentences present in both the peer summary and the gold standard. Nevertheless, such methods leave aside the fact that two sentences might mean the same although they are written differently. In that way, the evaluation of two summaries can diverge, even when they express identical ideas but in different words. In order to avoid this situation, content-based measures assess the similarity level between sentences, and by extension, between summaries. As an illustration, cosine simi- larity establishes how similar are two documents by representing them inside a vector space model and verifying the co-occurrence of words. ROUGE family also evaluates co-occurrence, but this time of subse- quences of words from a given text [8]. Pyramid method is a good example of content-based evaluation as well. It scores summaries from content units (SCUs) which are prioritized depending on their frequency of appearance in human summaries [9]. Extrinsic evaluation moves away from inner properties of the sum- mary, in order to get it as a whole and determine its effect, efficiency or usefulness on a particular task. Therefore, task-based evaluation re- quires performing an activity manually or automatically supported by summaries. In [1], some attempts associated to relevance assessment and reading comprehension are mentioned. In [2], some approaches related to document categorization, information retrieval and question answering are discussed. A well-formed study on intrinsic summary evaluation is shown in [7]. An approximation to the taxonomy of summary evaluation is reflected in [2] and in Figure 1.1.

1.3 SUMMARIZING SOFTWARE ARTIFACTS: EXISTING APPROACHES Software engineering domain is not exempt from information overload problems. Through software development process developers must work daily with a large amount of data: From requirements specification and design documents, passing by maintenance records, until the source code itself.

6 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

Sometimes when maintaining software, developers deal with a soft- ware system of considerable size whose domain is totally unknown for them, and still they have to some kind of enhancement, adap- tation or correction in it. According to [10, 11] developers spend more time reading and navigating the source code than writing it. Therefore, it is imperative to find a way to facilitate the maintenance of systems. At this point a short description of software artifacts would be very use- ful, especially because searching and browsing within source code and documentation are two common tasks when maintaining software. So, although summarization is an emerging issue within software engineering field, several approaches have been proposed to reduce the data found in software artifacts. In this context, an artifact is defined as any product derived from development process which describes the process, functionality, design, or implementation of software. These derivatives include source code, binaries, and majorly, documentation. Currently, there is at least one summarization work that deals with each one of them separately, but in some cases, artifacts are used together for producing more accurate descriptions of their content.

1.3.1 Summarizing Documentation From a practical point of view, the approaches that treat with software documentation usually apply natural language summarization techniques on natural language texts, i.e., the traditional summarizing process. An evident approach of summarization of software documentation is [12], where bug reports are synthesized by machine learning techniques. An important issue about bug reports is that they often comprise two parts: One with predefined values in fixed fields, and other one with free-form texts such as a title, a bug description, and a sequence of com- ments related to its lifecycle. In that sense, bug reports are somewhat similar to email conversations, and in consequence, same techniques applied in the latter case might be useful for summarizing the former one. Along these lines, in [12] are extracted the most relevant sentences from bug reports based on three classifiers trained on structural, par- ticipants, length and lexical features, with different corpora: Annotated email threads, meetings, and bug reports.

7 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

The evaluation of the three methods was performed from several perspectives. For instance, each system was evaluated against a base- line classifier to measure its effectiveness using the ROC curve. Later, systems were compared between them applying the standard intrinsic measures of precision, recall and F-score. Next, content selection qual- ity was assessed via pyramid method. Not surprisingly, the classifier based on the bug reports corpus surpasses the other two systems. The informativeness of each feature considered in the summarizers was also evaluated: The features with the highest F-score were those related to length. Extrinsic measures were also used for evaluating the informative- ness, redundancy, relevance, and coherence of summaries, and actually their results were acceptable.

1.3.2 Summarizing Source Code When the aim of summarization is to describe source code, one cru- cial issue has to be considered: Source code is a mixed artifact which contains information for communicating to both humans (the develop- ers) and machines (the compilers). In [13], it is acutely explained this situation by means of an example similar to the next one. Suppose a random chunk of code as the following (extracted from ATunes1 system, a full-featured audio player and manager):

public static void setLanguage(String fileName) { if (fileName != null) languageBundle = getLanguageFile(TRANSLATIONS_DIR + ’/’ + fileName); else languageBundle = getLanguageFile(TRANSLATIONS_DIR + ’/’ + DEFAULT_LANGUAGE_FILE); }

Program instructions within this method can be interpreted by com- pilers, even if identifiers are replaced by arbitrary words. So, functionality is unaltered despite identifiers do not reveal the intention of the code:

public static void method_1(type_1 a) { if (a != null) b = method_2(c + ’/’ + a); else b = method_2(c + ’/’ + d); }

1http://www.atunes.org/ (accessed and verified on April 11, 2011)

8 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

On the other hand, if the information removed from source code is the formal one, textual information represented by the terms composing identifiers still remains, providing readers with an idea of the purpose of the code: set language string file name file name language bundle get language file translations dir file name language bundle get language file translations dir default language file

Consequently, source code analysis can be performed statically or dynamically. The elemental difference between them is that static anal- ysis does not involve executing code, whereas dynamic analysis studies the behavior of code during program execution. In these circumstances, dynamic approaches depend on the program input and the program it- self [14].

Static Approaches Static approaches usually consider syntactic and semantic properties of source code. The first step in any static approach is commonly token- ization, where composite identifiers are split into words according to capital letters, underscores or other special characters. On this wise, method names such as cdda2WavFile or cdda_2_wav_file are trans- formed into cdda, 2, wav, file. The most notable techniques for describing source code statically can be distinguished by the coherence factor that is determined by the output fluency [4]. This means that a summary is fluent if it consists of well-formed sentences that are related to each other forming coherent paragraphs. Otherwise, if it comprises individual words or text fragments which do not keep any relation, the summary is said to be disfluent. As an illus- tration, for the setLanguage method previously mentioned, an example of a fluent sentence-based summary is:

This method sets application language. If the file name is defined, it is used; if not, the default language file is applied

For the same method, a disfluent term-based summary could be: set language bundle translation default file.

9 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Sentence-based Summaries One of the most outstanding approaches of software summarization which deals directly with source code by creating descriptions of Java methods is [15]. The essentials on this proposal are heuristics, for se- lecting the central statements of code within a method (s_units), and templates, for generating natural language sentences and reducing re- dundancy. In such manner, the algorithm applied to obtain the descrip- tive comments starts as usual with source code preprocessing, which includes tokenization and abbreviation expansion. Then, the action, theme and secondary arguments for methods are obtained using a Soft- ware Word Usage Model (SWUM) that captures linguistic, structural and occurrence relationships of words within code. Next, some heuris- tics are applied to identify the most relevant units of code within a method. Five kinds of statements are considered as relevant: Ending units, void-return units, same-action units, data-facilitating units, and controlling units. The relevance and role of these statements were drawn as an inference from a study of a set of comments from open source Java programs, and an opinion survey with Java developers about the need of certain units within methods descriptions. At last, those units are lexicalized from predefined templates. For instance, the fixed template to an assignment is: action theme secondary-args and get return-type So, the text generated for the s_unit title = getNameWithoutExtension() is Get name without extension and get title. Similar kinds of templates were designed to variables, single method calls, return statements, nested and composed method calls, conditional expressions, and loop expressions. The whole summarizing process is shown in Figure 1.2. In general, this technique produces acceptable summaries for meth- ods, as stated by an informal evaluation where some Java developers were asked about the accuracy, content adequacy, and conciseness of text generated for individual s_units and the whole summaries. How- ever, developers disagreed with the level of detail required in summaries,

10 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

Method

Prepocesing Summary comment generator

Program analysis Natural languaje analysis Relevant s_units Text combination Text generation selection and smotthing Software Word Usage Model contruction

Method summary

Figure 1.2 Summary generation process for methods. Adapted from [15]. which suggests that selection of statements should be studied carefully. In addition, the proposal is designed for and limited to methods, making it unable to produce that kind of comments at other granularity levels, such as classes.

Term-based Summaries In that sense, [16] goes ahead and generates (majorly) extractive sum- maries based on terms for methods and classes, applying and comparing the results of several text retrieval techniques. Two basic phases are executed in this extent: corpus creation and relevant terms selection. First, a corpus is created from source code by extracting identifiers and comments, but on this occasion, as a particular case, tokenization is an optional step. So, the corpus can be composed of split identifiers, original identifiers, or both (Figure 1.3). After filtering out terms which do not carry out an specific meaning, the summaries are generated by selecting the terms with the highest scores, obtained by applying algebraic reduction methods, Vector Space Model (VSM) and Latent Semantic Indexing (LSI), with variations in the options of weighting terms (e.g., log, tf-idf, binary entropy). In [16], these schemes were compared against random and lead sum- maries, using intrinsic, online evaluation. Random summaries comprise artifacts terms chosen in a haphazard way, whereas lead summaries are built with the first terms of the artifact. The results until this point of the study showed that each variation of lead method outperformed

11 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

the other techniques regardless the weighting and length options. Al- though VSM summaries also obtained good scores, they were not as good as lead ones. Nevertheless, it was found that this technique ex- tracts relevant terms from parts of code where the lead method has no effect at all, i.e., other lines besides the header of the source code artifact. Therefore, these techniques are complementary and their union produces summaries with a greater amount of relevant terms, principally for methods.

Corpora Keep original Splitted Collection Use stop identifiers of Extract identifiers identifiers words documents Split (keywords, (methods, and Splitted + comments identifiers Discard prepositions, original classes) . . .) identifiers original identifiers Original identifiers

Figure 1.3 Corpus creation process followed in [16]

Model-based Summaries Source code descriptions can be represented as well through models such as diagrams, maps, or graphs. For instance, in [17] it is presented the software reflection model, where the developer initially selects a high-level task-specific model of the software system, which is used as a framework for summarizing information in the source code. This high- level model is delineated through interconnected boxes that refer to the task at hand, and represent the main modules of the system and the interactions between them. After choosing the high-level model, the developer uses a syntactic analysis tool to extract structural information from the source code, i.e., the methods and the method calls. Next, the user performs a mapping of the high-level model to the entities and relationships extracted from the code. He can use a series of tools to support the task development (like regular expression matching or mapping several source code entities into a single map entry), but the mapping is mostly performed manually. Lastly, the developer computes

12 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

a reflection model, which is a comparison between the high-level model, the source code structural information and the constructed map, and represents a summary of the structural information contained in the source code of the system.

Summaries as By-products Other approaches like [13,18] do not have as an explicit aim generating that kind of short descriptions, but still, their results can be considered as (partial) summaries. For example, in [13] the main objective was to analyze software without taking into account external documents, in order to provide a first impression of an unfamiliar system. In this case, LSI was used to compute the similarity between source code artifacts, and then, to describe the topic of clusters of artifacts with labels ex- tracted from the same source code. These labels capture the important concepts within each linguistic cluster, revealing the intention of the code. By the same token, a combination of LSI and Formal Concept Anal- ysis (FCA) was used by [18] for indexing source code and organizing the results in a concept lattice, respectively. The initial aim of this work was the reduction of developers’ effort when searching in source code, providing them with a list of artifacts related to a query entered by the user. Even so, the resulting representation of relevant informa- tion, labeling topics, concepts, and relationships between them, can be recognized as an approximation to source code summarization.

Dynamic Approaches As stated before, in order to analyze behavioral aspects, there are some methods that require the execution of a program, or at least, one of its slices or traces. In contrast to static approaches, dynamic ones are able to analyze the behavior of variables and control structures, detect data dependencies, and collect and log temporal information. Moreover, dynamic approaches allow to observe the flow and behavior of a program under determined conditions.

13 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Summarizing Long Traces

In [19], the most representative software routines of large traces are identified by selection or generalization of executed content. Thus, the summary is a simplification of the original input that is reduced when removing low-level implementation details and utilities (through an util- ityhood metric), but in this case the obtained description is represented as a UML2 sequence diagram. By utility, [19] refers to any software element designed and implemented to be accessed or called from any- where in source code (e.g., accessing methods or constructors), whereas by implementation detail, refers to any element whose absence does not interfere in the comprehension of a component. Basically, the algorithm for extracting the relevant routines from a scenario proposed in [19] starts by instrumenting source code for logging the method calls. Afterward, a static call graph is built and pruned by removing the low-level implementation details mentioned above. Next, the utilityhood of each remaining node of the graph is computed, based on fan-in and fan-out metrics; and then, the unnecessary routines are re- moved, i.e., those nodes whose utilityhood was the lowest. This process is repeated until reaching the amount of routines required by the user. Finally, the routines which are still present in the graph are represented in a light version of a UML sequence diagram, in order to visualize the trace behavior. The evaluation of this approach was performed informally by analyz- ing a summary obtained from a particular trace of a program, through a questionnaire answered by developers with intermediate and advanced knowledge of the system. The questions were addressed to assess the quality of the summary by asking about how well its content represented the trace process, and how effective could it be in software maintenance. Again, just as in [15], the amount of details included in the summaries and the amount required by the users were a disagreement point. Still, this kind of short descriptions were marked as useful when understanding different scenarios of a system.

2Unified Modeling Language, a standardized modeling language for object- oriented development.

14 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

Summarizing Software Changes Another direction is taken by [20], where code changes are documented by describing the effects they have in the behavior of a program. This description, commonly known as message in control version systems, is generated automatically when executing a couple of versions of the same system in search of differences between their control flows (delta). As a result, the proposed algorithm considers the new behavior of the system and the conditions under it is produced. To this end, pairs are obtained from running two different versions of the system, using symbolic values as input variables. In this context, the path predicate indicates the condi- tions under which the statement is executed. By comparing both sets of pairs, the statements whose path predicate has been added, removed or modified are identified. Of course not all statements become part of the summary: A process of filtering is performed for retaining only method invocations, field assignments, return, and throw statements. Now, some summarization transformations are applied to those state- ments previously found. For instance, if the first version of a chunk of code is if (interrupted) deleteFile(); and the second one is if (interrupted) revertProcess(); the whole change transformation for conditions expresses the change as if interrupted, do revertProcess() instead of deleteFile().

Revision 1 Predicate Text Text optimization Change Acceptance summary extraction generation and smoothing text Revision 2

Figure 1.4 Summary generation process for software changes. Adapted from [20].

Similar kind of templates are defined for predicates, hierarchical structures, and method calls. Other transformations are applied to re- duce the size of the documentation, and some others to improve read- ability. Major steps of the approach are showed in Figure 1.4.

15 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

The evaluation of this method assessed quantitative features like size and content of the summary. This last one was evaluated through a comparison rubric which contrasted the description of changes with a set of humans’ annotations. Additionally, developers considered qual- itative features of summaries like usefulness, readability and accuracy. In general, the results were satisfactory, and the summaries could be a complement for control version systems, reducing developer’s effort when describing changes.

1.3.3 Combining Software Artifacts

All previous approaches give an insight into specific software artifacts. They are explicitly focused on single type of documents, avoiding ex- ternal files which might in fact complement their results. However, more often than not, the information provided by just one artifact is not enough. This is specially true in some maintenance tasks that demand several sources of knowledge in order to be successfully completed, as in the case of bug correction where data contained in bug reports, source code, and even in specification documents, can support maintainers’ work. As an illustration, the second model proposed in [17], called on lex- ical source model, is constructed using keyword matching techniques (e.g., grep or awk) to find structural information in the source code, by specifying regular expressions related to different types of structural constructs (i.e., method declarations, method calls, variable definition, etc.). In this approach, a developer is able to specify a set of pat- terns for finding structural information in source code (e.g., define a regular expression to match a function call), a set of actions to be ex- ecuted when specific structural information is found, and a set of rules to combine the structural information found in different files into one model. Several types of artifacts such as data files or documentation files can be scanned using this lexical approach, and the information can be combined into one model. This technique can be used to provide the structural information needed to build the aforesaid reflection model. In [21] it is proposed a framework for synthesizing the information related to an evolution task or concern, and its interactions between

16 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

code, revision history, bug reports, code wikis, and available documen- tation. Basically the approach follows two phases. In the first one, the knowledge provided by sources and related to the concern is extracted and deducted using static analysis methods, data mining and language processing techniques for populating an ontology, which is used in the second phase to complete fixed templates or predefined rules in order to form a text-based summary. Likewise, in [22] are applied code analysis and text mining on specifi- cation documents, UML diagrams and user’s manuals, with the purpose of building an ontology that allows to cross and match the semantic knowledge between those elements. So, source code and documents are processed in such way that entities are identified and extracted for being associated to a class belonging to the ontology. This approach was evaluated through intrinsic co-selection measures which turned out in acceptable levels of precision and recall for text mining techniques. Nevertheless, the results of code analysis suggested, once again, that naming conventions are fundamental for the quality of the ontology, and also, that disambiguation between parts of code is a desired feature. Another approach which uses several types of software artifacts is [23]. In it, some techniques from information extraction and natural text processing fields are applied to source code and documentation in order to connect the entities found in both of them, in such a way that business rules underlying to design and implementation are rebuilt. Thus, after preprocessing source code, an Abstract Syntax Tree is built and stored in a knowledge database. Then, the collected data are simplified and linked to knowledge documentation. For evaluating the results, the associated keyphrases are compared to others sets of documents used by analyzers, through intrinsic measures.

1.4 MAKING EASIER SOFTWARE EVOLUTION: USING SOFTWARE SUMMARIES IN MAINTENANCE ACTIVITIES Maintenance is the most difficult and extended phase of software life cycle: According to [24], more than 90% of software costs are spent on maintenance activities. The causes of this situation are related to the inherent properties of modern software systems such as complexity

17 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

and changeability, but they are also associated with technical prob- lems (e.g., limited understanding, testing costs, impact analysis, and main tenability) and management affairs (e.g., alignment with organi- zational objectives, staffing, process issues, organizational aspects, and outsourcing) [25].

Table 1.1 Software summarization approaches.

Artifact Short description Input Output Evaluation Documentation Summarizing bug reports - Bug reports Text-based - Intrinsic through machine learning summaries evaluation by classifiers [12] co-selection -Informal evaluation Source code Summarizing the content - Large traces UML sequence - Informal (dynamic of large traces through diagrams evaluation approaches) Routines filtering [19] Documenting program - Source code of Textual - Quantitative changes by symbolic two versions of a description of and qualitative execution comparison and program changes in evaluation natural language runtime behavior processing [20] Source code (static Building reflection models - Source code Reflection approaches) comparing high-level models models and source code structural data [17] Identifying topics in source - Source code Semantic clusters code using information retrieval and map visualization [13] Summarizing source code - Source code Term-based - Intrinsic artifacts applying text artifacts summaries evaluation retrieval techniques [16,26] (online and offline) Summarizing Java - Source code of Text-based - Informal methods by relevance methods summaries evaluation heuristics and templates [15] Documentation + Building lexical source - Source code Lexical source Source code models using keyword - Data files models matching techniques [17] - Documentation Summarizing software - Source code Text-based concerns applying static - Historical summaries analysis, information repositories retrieval and natural - Bug repositories language processing - Wikis and techniques [21] documentation Building an ontology - Source Ontology linking - Intrinsic between code and code source code and evaluation by documentation via source - Specification documents co-selection code analysis and text documents semantic mining [22] - UML diagrams - User’s manuals Extracting business rules - Source code Knowledge from code and - Specification database documentation by documents information extraction and natural language processing [23]

18 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

In order to reduce the effects of these concerns, there are some techniques used specifically for maintenance, which include but are not limited to program comprehension, re-engineering, reverse engineering, impact analysis, and feature/concept location. Such kind of activities requires support from available information about software, which is mainly found in artifacts. In that sense, a short description of them would be totally appropriate because it would reduce time and effort when reading, for example, source code or software documentation. Next, some applications of software summarization into certain mainte- nance activities are mentioned.

1.4.1 Software Comprehension 40% to 60% of maintenance effort is dedicated to program understand- ing [25]. Before making any enhancement, adaptation or correction to a system, the possible changes need to be analyzed for determining where they have to be implemented, and how those modifications will be per- formed. In an ideal situation, maintainers would be provided with con- cise, sufficiently explanatory and up-to-date software deliverables, to aid software comprehension. However, in many occasions documentation is lengthy, out-dated, or in the worst case, non-existent or poor-quality. The summarization of software documents represents a viable al- ternative to minimize the overload documentation problem, which is a common concern in document-driven processes, (e.g., those based on the Unified Process, RUP). In each phase of this type of methodologies, it is often generated a long list of documents intended to help maintain- ing systems. Although such artifacts are neither the most important nor used by maintainers when understanding a system [27], they still denote a rich but undervalued source of information about software, which can be better exploited if a shorter version of the documents, or at least the gist of their content, is provided to the user. As a result, depending on the purpose of the summary, maintainers would be able to

• identify whether the document is useful and worth analyzing more carefully (if the summary is indicative); or

• use it as a substitute of the document (if the summary is infor- mative).

19 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

On the other hand, if the problem to face is the low-quality or ob- solete documentation, source code becomes the only useful artifact for supporting maintenance tasks. Actually, source code is in some cases the only element generated through development processes, and not sur- prisingly, it is considered by developers as the most important and used artifact when understanding software systems [27]. However, during software maintenance it is not always possible to read and understand the entire implementation of a system: Techniques like scanning, skim- ming, and detailed reading are commonly used by maintainers when dealing with source code. Briefly:

• scanning is about deciding if a resource is useful or not (e.g., reading only the header of a method), and allows developers to locate relevant information rapidly;

• skimming is getting the gist of a source code artifact without reading each line, and allows to digest its purpose and role in the system quickly; and

• detailed reading is to focus on artifacts of particular interest, by reading them in detail.

As it can be noticed, these techniques are fully integrated with searching and navigation. Furthermore, developers spend more time reading and navigating the source code than writing it [16]. In such sense, source code summarization could generate descriptions whose content offers more information than headers and comments, and whose length is less than the one of the chunk of code at hand. Even if reading the summary does not replace detailed reading, it could substitute scan- ning and skimming tasks. Along these lines, searching and navigation tools can be supported also by summaries.

1.4.2 Reverse Engineering Reverse engineering covers every method, technique and tool whose aim is to recover or acquire knowledge about software systems, in order to support the performing of a software engineering task [28]. This is usually made by identifying software components and their interrelation

20 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

ships, and creating representations of the software at higher levels of abstraction [25]. In this regard, software summarization itself can be considered as a reverse engineering technique. As a case in point, the automatic summarization of source code artifacts can be used in docu- mentation and re-documentation processes of software systems [15,16], by generating the leading comments and keywords of classes or methods. Moreover, the summaries of software artifacts have been proposed to improve traceability link recovery. One of the main challenges in the analysis of candidate links is the great amount of different software artifacts that has to be studied. In [29], it is proposed the application of text processing techniques for summarizing text-based elements such as requirements, user’s manuals, design documents, test cases, bug re- ports, etc., and hybrid methods for describing source code. In this way, developers will be provided with summaries of software artifacts effec- tive for making decisions about the correctness of links. The summaries for fulfilling this objective are extractive or abstractive depending on the summarized artifact, but they are always intended to be informative.

1.5 TRENDS AND CHALLENGES

The previous background reveals that summarization is a relatively re- cent concept in software engineering field. Even though there are cur- rently some approaches attempting to describe shortly certain types of artifacts, there is still a lot of work to do and many obstacles to overcome. The first obvious one is the lack of approaches aiming to summarize single documents like requirements specifications, use cases, technical designs or test cases. Such files are often written in natural language, and their content is organized in formal, semi-standard struc- tures; such features can be exploited using the same techniques applied in natural language processing. Even if they include source code chunks in their content, they can be treated with mixed methods for taking into account different properties of discourse. In regard to source code summarization, there are some issues that need to be covered as well. Albeit acceptable summaries are achieved in [15, 16], both approaches present their own inconveniences. For ex- ample, in the former it is assumed that the code follows certain types

21 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

of conventions, and for each pattern found it is proposed a template, which supposes an indirect human intervention, an extra effort in text generation, and probably, the lack of a template for unidentified pat- terns. On the other hand, the latter approach leaves aside the structural information of source code, considering an artifact as a bag of words, which clearly it is not the case. Even more, static approaches suppose that source code contains good-quality identifiers and comments, and sometimes, that it follows certain writing standards. So, the results of static summarizing methods keep a strong correlation with the lexical, syntactic and semantic adequacy of terms chosen by . In such sense, dynamic approaches represent an apparent alternative for summarizing source code, leaving aside those grammatical assump- tions by considering its real functionality under arbitrary conditions. But none of existing dynamic techniques describe the content of a method, class or package yet, probably because this kind of analysis requires extra effort when designing quality execution scenarios. Furthermore, a better alternative could take advantage of semantic, syntactic and functional properties of source code, by designing a fusion between both static and dynamic analyses, similar to the one presented by [20] to describe software changes. Having this in mind, in [6] a proposal is presented to provide a description of the code which is more informative than the header and the leading comments, yet much shorter than the im- plementation, while capturing the essential information from it. The novelty of this approach is the use of lexical and structural informa- tion, complemented with some heuristics based on the parts of the code that developers consider important when describing specific source code units. Besides, the hybrid techniques embody one of the most promising challenges in software summarization, not just because of the value they represent in maintenance activities through the richness of their content (derived from developers’ knowledge and captured in different kinds of software artifacts), but also because of the value of algorithms that consider the diverse properties of source code and natural language texts and beyond, interrelate in someway their content for generating a more complete and coherent description of artifacts. In this extent, [22, 23] embody well-designed approaches which mix artifacts from the base of

22 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

a common language, whereas [21] is so far a proposal for summarizing systems by using several types of artifacts. It is important to state that summarization techniques should tend to be fully automatic, or at least, to have the minimum human intervention, unless their objective is feedback or customization of the results. In this way, although the models generated in [17] allow to understand the structure of a system, the techniques applied to obtain them require a lot of intervention of developers, making the approach unsuitable in maintenance tasks. Regarding the evaluation of software summaries, the lack of formal frameworks becomes evident for both, intrinsic and extrinsic types. Un- til now, informal assessing of software summarization approaches has made it difficult to compare techniques and results. Just as in Nat- ural Language Processing, those techniques which do not involve the subjectivity of evaluators are desired (i.e., off line evaluation). The re- sults presented in [12, 26, 30] suggest that information retrieval metrics are suitable to assess the inner quality of summaries, either of docu- mentation or source code. Even so, evaluation depends on summaries representation; if the resulting description is delineated as a diagram or map, the ideal would be that the same measures were adapted to the context in order to facilitate the comparison between approaches. Additionally, the research on evaluation requires more studies related to the usefulness of software summaries within typical activities of software processes, besides software comprehension and reverse engineering. Finally, summaries are practical only if they can be easily generated, accessed and used. In that sense, automatic tools for producing sum- maries should be available within development environments. Moreover, summarizers should allow developers to provide a feedback about the summaries for improving them by adding or taking off details, depending on needs and expertise of users. In conclusion, software summarization is an interesting research area in software engineering. Not surprisingly it has rapidly gained impor- tance and obtained good results at many levels. However, a lot of work still remains to do concerning to software artifacts coverage, improve- ment of summarization techniques, summaries representation, evalua- tion formalization, and summarizers availability.

23 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

REFERENCES [1] Mani, I. Summarization of Text, Automatic. In: Encyclopedia of Language & Linguistics, (ed.) Brown, K. Elsevier, Oxford: 2006. doi:DOI:%25252010.1016/B0-08-044854-2/00957-3.

[2] Steinberger, J. & Jeek, K. Text Summarization: An Old Challenge and New Approaches. In: Foundations of Computa- tional, Intelligence Volume 6, (eds.) Abraham, A., Hassanien, A. E., Leon, D. & Snáel, V., tome 206 in Studies in Compu- tational Intelligence. Springer Berlin / Heidelberg: 2009, 127–149. doi:10.1007/978-3-642-01091-0_6.

[3] Radev, D. R., Hovy, E. & McKeown, K. Introduction to the Special Issue on Summarization. Computational Linguis- tics, 28(4): 2002; 399–408. ISSN 0891-2017. doi:10.1162/ 089120102762671927.

[4] Hovy, E. & Lin, C. Y. Automated Text Summarization in SUMMARIST. In: Advances in Automatic Text Summarization, (eds.) Mani, I. & Maybury, M. T. MIT Press: 1999.

[5] Jones, K. Automatic summarising: The state of the art. Infor- mation Processing & Management, 43(6): 2007. doi:10.1016/j. ipm.2007.03.009.

[6] Aponte, J. & Moreno, L. Summarizing Source Code Artifacts. In: Encuentro Nacional de Investigación y Desarrollo - ENID, (eds.) Niño, L. & Gómez, A. Universidad Nacional de Colombia: 2010.

[7] Hariharan, S. & Srinivasan, R. Studies on intrinsic summary evaluation. 2: 2010.

[8] Lin, C. ROUGE: A Package for Automatic Evaluation of sum- maries. In: Proc. ACL workshop on Text Summarization Branches Out: 2004, 10.

[9] Nenkova, A. & Passonneau, R. Evaluating Content Selection in Summarization: The Pyramid Method: 2005.

24 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

[10] LaToza, T. D., Venolia, G. & DeLine, R. Maintaining mental models: a study of developer work habits. In: ICSE ’06: Proceedings of the 28th international conference on Soft- ware engineering. ACM, New York, NY, USA: 2006, 492–501. doi: 10.1145/1134285.1134355. [11] Ko, A., Myers, B., Coblenz, M. & Aung, H. An Ex- ploratory Study of How Developers Seek, Relate, and Collect Rele- vant Information during Software Maintenance Tasks. IEEE Trans- actions on Software Engineering, 32(12): 2006; 971–987. doi: 10.1109/TSE.2006.116. [12] Rastkar, S., Murphy, G. C. & Murray, G. Summarizing software artifacts: a case study of bug reports. In: ICSE ’10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. ACM, New York, NY, USA: 2010, 505–514. doi:http://doi.acm.org/10.1145/1806799.1806872. [13] Kuhn, A., Ducasse, S. & G’ırba, T. Semantic clustering: Identifying topics in source code. Inf. Softw. Technol., 49(3): 2007; 230–243. doi:http://dx.doi.org/10.1016/j.infsof.2006.10.017. [14] Binkley, D. Source Code Analysis: A Road Map. In: 2007 Future of Software Engineering, FOSE ’07. IEEE Computer Society, Washington, DC, USA: 2007. ISBN 0-7695-2829-5, 104–119. doi: http://dx.doi.org/10.1109/FOSE.2007.27. [15] Sridhara, G., Hill, E., Muppaneni, D., Pollock, L. & Shanker, K. V. Towards Automatically Generating Summary Comments for Java Methods. 25th IEEE/ACM International Con- ference on Automated Software Engineering: 2010; 43–52. doi: http://doi.acm.org/10.1145/1858996.1859006. [16] Haiduc, S., Aponte, J., Moreno, L. & Marcus, A. On the Use of Automated Text Summarization Techniques for Sum- marizing Source Code: 2010. [17] Murphy, G. C. Lightweight structural summarization as an aid to software evolution. Tesis Doctoral, University of Washington: 1996.

25 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[18] Poshyvanyk, D. & Marcus, A. Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code. In: 15th IEEE International Conference on Program Com- prehension (ICPC ’07). IEEE, Washington, DC, USA: 2007, 37–48. doi:10.1109/ICPC.2007.13. [19] Hamou-Lhadj, A. & Lethbridge, T. Summarizing the Con- tent of Large Traces to Facilitate the Understanding of the Be- haviour of a Software System. In: Proceedings of the 14th IEEE International Conference on Program Comprehension. IEEE Com- puter Society: 2006. doi:10.1109/ICPC.2006.45. [20] Buse, R. P. L. & Weimer, W. R. Automatically documenting program changes. In: ASE ’10: Proceedings of the IEEE/ACM international conference on Automated software engineering. ACM, New York, NY, USA: 2010, 33–42. doi:http://doi.acm.org/10. 1145/1858996.1859005. [21] Rastkar, S. Summarizing software concerns. In: ICSE ’10: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. ACM, New York, NY, USA: 2010, 527–528. doi:http://doi.acm.org/10.1145/1810295.1810464. [22] Witte, R., Li, Q., Zhang, Y. & Rilling, J. Text mining and software engineering: an integrated source code and document analysis approach. IET Software Journal, 2(1): 2008. [23] Putrycz, E. & Kark, A. Connecting Legacy Code, Business Rules and Documentation, tome 5321. Springer Berlin / Heidel- berg: 2008, 17–30. [24] Koskinen, J. Software Maintenance Costs: 2010. Accessed and verified on 04/10/2011. [25] IEEE Computer Society. Software Engineering Body of Knowledge (SWEBOK): 2004. Accessed and verified on April 11, 2011. [26] Haiduc, S., Aponte, J. & Marcus, A. Supporting program comprehension with source code summarization. In: ICSE ’10:

26 SUMMARIZING SOFTWARE ARTIFACTS: OVERVIEW AND APPLICATIONS

Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, tome 2. ACM, New York, NY, USA: 2010, 223–226. doi:http://doi.acm.org/10.1145/1810295.1810335.

[27] de Souza, S. C. B., Anquetil, N. & de Oliveira, K. M. A study of the documentation essential to software maintenance. In: Proceedings of the 23rd annual international conference on Design of communication: documenting & designing for pervasive information, SIGDOC ’05. ACM, New York, NY, USA: 2005. ISBN 1-59593-175-9, 68–75. doi:http://doi.acm.org/10.1145/1085313. 1085331.

[28] Tonella, P., Torchiano, M., Du Bois, B. & Systä, T. Empirical studies in reverse engineering: state of the art and future trends. Empirical Softw. Engg., 12: 2007; 551–571. ISSN 1382- 3256. doi:10.1007/s10664-007-9037-5.

[29] Aponte, J. & Marcus, A. Improving Traceability Link Recovery Methods through Software Artifact Summarization. In: Proceed- ings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering (TEFSE 2011): 2011.

[30] Moreno, L., Rodríguez, C. & Aponte, J. Evaluating source-code summaries using Information Retrieval and Text Sum- marization techniques. In: Tendencias en ingeniería de software e inteligencia artificial, (eds.) Giraldo, G. & Zapata, C. Univer- sidad Nacional de Colombia: 2011.

27 In conclusion, software Survey and Research Trends in Mining Software Repositories

Christian Rodríquez-Bustos Yury Niño Jairo Aponte ABSTRACT Software repositories such as version control and bug tracking systems and archived communication mechanisms are used by software devel- opers to facilitate tracking activities that they perform. Particularly, version control systems allow storing and managing changes made to source code and documentation. Bug tracking systems are employed to track the state of issues, and other repositories such as IRC systems are used to discuss topics related with coordination activities. Data contained in these repositories have been used by researchers to sup- port maintenance tasks, improve software designs, understand software development, predict bugs, handle planning aspects and measure the contribution of individuals. Mining Software Repositories (MSR) are a research field that seeks discovering knowledge through exploration, in- tegration, processing and analyzing data contained in these repositories. This document presents a literature survey on approaches for MSR and various research trends and challenges that require further investigation.

2.1 INTRODUCTION Since the 80’s, several authors have studied data generated in software projects to understand how software evolves. The most notable results of these studies are software life cycles [1], the laws of software evo- lution [2], software evolution metrics [3, 4], and the theory of software

29 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

evolution [5]. These studies or research works were done only in in- dustrial systems which did not have public software repositories, conse- quently, researchers’ efforts were limited to a few companies interested in making good use of their historical data that could be used to improve the development process. Meir Lehman was one of the pioneers in software evolution theory; he identified the evolution sources on computer applications and pro- grams and showed that it is a never ending process. One of the main contributions he made was the formulation of software evolution laws in the 80’s. They are summarized in the Table 2.1.

Table 2.1 Law of program evolution [1]. Law Description Continuing change This law expresses the fact that large programs are never completed. The change or decay process con- tinues until it is judged more cost effective to replace the system with a recreated version. Increasing complexity This law proposes that because an evolving program is continually changed, its complexity reflects a de- teriorating structure. The fundamental law This law proposes that the evolution of a program of program evolution is subject to a dynamic. It makes the program- ming process self-regulating with statistically deter- minable trends and invariables. Conservation of orga- This law proposes that during the active life of a nizational stability program, the global activity rate in a programming project is statistically invariant. Conservation of famil- This law describes the fact that during the active life iarity of a program the release content (changes, additions, deletions) of the successive releases of an evolving program is statistically invariant.

Recently, with the success of open source software projects, available data in public software repositories have grown exponentially1. These repositories include version control systems, bug tracking systems and archived communication mechanisms. Version control systems allow

1http://sourceforge.net, the most important Web site for the development of open code projects, had 2.7 million of registered developers and 260,000 projects created at February 2011.

30 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

developers to store and manage changes made to source code and doc- umentation. Bug tracking systems are employed to track the state of issues. Other repositories such as Internet Relay Chat (IRC) systems are used to discuss topics related with coordination activities. These repositories are generated and populated during software evolution; con- sequently, they hold a wealth of information and provide a unique view of the actual evolutionary path taken to realize a software system [6]. In this context, Mining Software Repositories (MSR) emerged as a research field that seeks for new knowledge through exploring, integrat- ing, processing and analyzing data contained in software repositories. The goal is to make the acquired knowledge useful for present and fu- ture decision making processes [7]. The term MSR has been coined to describe a broad class of related investigations which can be classified according to the type of issue they tackle. Particularly, MSR has been used by researchers in various ways, for instance, analyzing software ecosystems and mining reposi- tories across multiple projects, assisting in program understanding and visualization [8–10], predicting the quality of software systems [11–13], studying the evolution of software systems, characterizing software de- fects [14], discovering patterns of change and refactorings [15, 16], un- derstanding the origins of code cloning [17], measuring the contribu- tion of developers [18–23], or modeling social and development pro- cesses [24–26]. The goal of this chapter is to provide a general approach about MSR. This document presents a description of software repositories in Section 2.2; in Section 2.3 an explanation of the phases commonly used in MSR and a description of functionality of some tools for MSR are presented. A state of the art of research works made in MSR is presented in Section 2.4, and finally, some research challenges and trends are mentioned in Section 2.5.

2.2 UNDERSTANDING SOFTWARE REPOSITORIES

Software repositories are artifacts produced and archived during soft- ware life cycle. They are used by software developers to help in man- aging progress of software projects. They can be classified in historical

31 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

repositories, communications logs, source code, runtime, documenta- tion, and tests executions. Figure 2.1 illustrates the classification of software repositories according to the data they store.

Software Repositories

Historical Communications Other kind Logs Sourde Code

Version Control Bugtracker IM Systems Systems Conversations Sourceforge Test Results

Local Only Distributed Bugzilla Emails Google Code Deployment Logs Centralized

Forums SCCS CVS Jira GitHub Error Messages

RCS SVN Mantis Build Warnings

PVCS Clearcase Bikeeper

Perforce Bazaar

Figure 2.1 Classification of Software Repositories.

2.2.1 Historical Repositories

Historical repositories include version control and bug tracking systems. The following subsections present a general description of both.

Version Control Systems

Version control systems keep the record of changes made to files over time and have been used by developers for managing source code in terms of revisions, versions or patches. Also, version control systems al- low developers to create workflows through branching and merging oper- ations [27]. Changes stored in historical repositories generally have asso- ciated detailed information about every transaction or commit, namely: Date, author, modification performed, comments and extra informa- tion [28]. These repositories are classified as centralized or distributed.

32 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

Centralized repositories, such as CVS collapsed2, Subversion 3, and Per- force4 have a single server that contains all the versioned files allowing clients to check out files from that central place [27]. In distributed repositories, such as Git5, Mercurial6, Bazaar7 or Darcs8, clients not only check out the latest snapshot of the files, but they mirror the en- tire central repository or others developer’s repositories [29]. This model is explained later in Section 2.5.1.

Bug Tracking Systems

Bug tracking systems such as Bug Zilla9 or Mantis10 have been used to store bug reports, requests for enhancements and new features requests. These repositories are employed by developers for submitting reports and providing feedback about results of testing activities. Reports are assigned to developers, who fix the corresponding bugs and then mark them as solved. In bug tracking systems, users may add comments and propose source code patches [28].

2.2.2 Communications Logs

In open source software projects, most developers are geographically distributed. Communication mainly occurs through electronic mail, in most cases in the context of electronic mailing lists, forums, or through Internet Relay Chat (IRC) systems [28]. These systems are used by developers for asking each other questions and get instant responses. Consequently, the corresponding archives are a rich source of information related with the decisions taken throughout the life of a project.

2CVS Project webpage: http://www.nongnu.org/cvs/ 3SVN Project webpage: http://subversion.tigris.org 4Perforce Project webpage: http://www.perforce.com/ 5Git Project webpage: http://git-scm.com/ 6Mercurial Project webpage: http://mercurial.selenic.com/ 7Bazaar Project webpage: http://bazaar.canonical.com/en/ 8Darcs Project webpage: http://darcs.net/ 9Bugzilla project website: http://www.bugzilla.org/ 10Mantis project webpage: http://www.mantisbt.org/

33 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

2.2.3 Source Code Sites as SourceForge11, Google Code12 or GitHub13 are web-based source code repositories that allow developers to create open source software projects and manage the development and distribution process. These repositories act as a centralized location for software developers to con- trol and manage projects. Some of these systems provide revision con- trol systems, bugtrackers, wikis, metrics, access to databases and unique URLs14.

2.2.4 Other Kind of Repositories Other kinds of less common repositories include documentation, test executions, runtime logs, deployment logs and building reports. These archives contain valuable dynamic and static information about the sys- tems. Although the extraction of knowledge from these archives can be difficult given the unstructured nature of the data contained in them, these archives are a rich source in the phases of operation and mainte- nance.

2.3 PROCESSES OF MINING SOFTWARE REPOSITORIES In the following subsections, a survey of each methodology that has been developed or adopted by MSR is presented. A description of the techniques employed in recent MSR studies as well as an explanation of the functionality of tools proposed in the literature are also included.

2.3.1 Techniques Several methodologies have been proposed for resolving problems as- sociated to software engineering using MSR. Huzefa [6, 30] classified some methodologies according to the techniques used in studies found in the MSR literature. Table 2.2 depicts a description of several popular techniques reported.

11SourceForge webpage: http://sourceforge.net/ 12Google Code webpage: http://code.google.com/ 13GitHub webpage: https://github.com/ 14URLs are useful for publishing projects in Internet 34 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

Table 2.2 Methodologies used in MSR.

Technique Purpose Metadata To study the links between metadata (e.g., author, severity, date, Analysis id) using regular expressions, heuristics and common subsequence matching. Static Source To study changes in individual versions of source code using static Code Analysis program analysis. Source Code To study differences between versions of source code using Differencing semantic differencing and changes in micro patterns analysis. Software To study software metrics that assess aspects of software products Metrics by analyzing modules size, developers’ effort, phases cost, functionality implemented and software quality, complexity, efficiency, reliability and maintainability. Visualization To study computer-based and interactive visual representations of data mined from software repositories in order to amplify cognition and understanding. Clone To study source code entities with similar textual, structural and Detection semantic composition using techniques as text-based, Methods token-based, program dependency graphs and AST. Frequent To study metadata, source code data and difference data with Pattern frequent pattern through itemset mining and sequential pattern Mining mining. Information To study traceability, program comprehension and software reuse Retrieval through classification and clustering of textual units based on Methods various similarity concepts using IR methods. Classification To study the classification of bug reports or changes using with techniques that automatically acquire and integrate knowledge in Supervised order to improve performance for a task. Learning Social To study the relationships between human entities using social Network network analysis for measuring contributions, discovering Analysis developers’ roles and associations between projects’ contributors.

2.3.2 Tools

In order to accomplish MSR studies some researchers have developed specific purpose tools; some of them are shown in Table 2.3. For each tool a short description of its purpose is mentioned.

35 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Table 2.3 Mining Software Repositories tools. Tool Functionality Alitheia Core The Alitheia Core tool is an extensible platform for software quality anal- ysis. It is designed to facilitate software engineering research on large and diverse data sources, by integrating data collection and preprocess- ing phases with an array of analysis services; it presents the researcher with an easy to use extension mechanism [31] [32]. CVSAnaly CVSAnaly is a tool for measuring and analyzing remotely big libre software projects using publicly available data from their version control reposi- tories. It consists of three steps: Preprocessing, database insertion, and postprocessing [33]. CVSChecker CVSChecker is a tool designed to analyze the performance of individual developer s and the work distribution patterns of teams based on historical source code repository data. It is developed as a plug-in for the Eclipse IDE and assumes CVS as the underlying source code repository [34]. CVSScan CVSScan is an integrated multiview environment that is oriented to dis- play changing code, in which each version is represented by a column and the horizontal direction is used for time. Separate linked displays show various metrics, as well as the source code itself. A large variety of options is provided to visualize a number of different aspects [35]. DrJones DrJones is a system for performing analysis of for software archived in a versioned repository. It is used for studing how old the current code base in software applications is [33]. GlueTheos GlueTheos is a system which allows users to retrieve and analyze data of public software repositories. It can access CVS repositories and archives of source packages. It is designed in a highly modularized way. By using external tools it can make various different analyses on the fetched data and produce various kinds of reports [36]. MailingListStats MailingListStats is a tool for analyzing activity in mailing lists. The tool itself extracts information about authors, dates, subjects and other fields related to email, and stores them in a database [37]. Programeter Programeter is a metrics tracking kit for software development. It moni- tors a set of metrics throughout the software development life cycle and generates full and objective analytics on the coding, quality assurance and project management trends. SoftChange SoftChange is a tool for the extraction, enhancement and visualization of software trails as CVS. It has an extractor, a SQL relational database management system, a fact enhancer and a visualizer [38]. SLOCCount SLOCCount is a tool that performs advanced counting of physical source lines of source code. It uses various heuristics to determine the program- ming language, and then filters comments [39]. Xia Xia is a visualization tool for the navigation and exploration of software version history and associated human activities. It has exploration mech- anisms to query the information space. The tool was integrated with Eclipse, an integrated development environment [40].

2.4 PURPOSE OF MINING SOFTWARE REPOSITORIES Researchers have mined data and metadata gathered from software repositories to extract pertinent information for resolving problems 36 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

related to system growth, program understanding, software systems quality prediction, refactorings and change patterns, individual contri- bution measuring and social and development processes modeling and understanding. In the next subsections the state of the art of some works related with the problems mentioned above is presented.

2.4.1 Program Understanding A good understanding of the software system is needed to reduce the cost and time of maintenance activities. Various research works have been proposed to assist developers in this process. Next, a description of some related works is shown. In 2001, Sayyad et al. [8] described the application of inductive methods on data extracted from both source code and software mainte- nance records. They extracted the relations that indicated which files in a legacy system were relevant to each other, in the context of program maintenance. They proposed a methodology for extracting and eval- uating the relations and found that the precision and recall measures increased compared with other research works. In 2004, Hassan et al. [41] described an approach which recovers valuable information from source control systems and attaches this in- formation to the static dependency graph of a software system. This information is used to help understanding the architecture of large soft- ware systems. To demonstrate the viability of their approach, they used it to understand the architecture of NetBSD. In 2007, Canfora et al. [9] presented a technique to track the evo- lution of source code lines, identifying whether a CVS change is due to line modifications rather than to additions and deletions. The technique compares the sets of lines added and deleted in a change set using vec- tor space models with the Levenshtein distance. They obtained results that indicated that the proposed approach ensured both high precision and high recall. The improvement of those kinds of techniques could be useful for development industry because the knowledge about the system could be extracted from the system itself. In this sense there is no human dependency with the experts, making easier the work for newcomers,

37 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

distributed teams or legacy systems. One of the big cons is that there are few “generic tool” that implement these approaches, restricting their applicability in industry.

2.4.2 Prediction of Quality of Software Systems

The participants of software projects spend a high amount of resources in quality assurance, but these resources most of the time are limited. Because of this high cost, managers need to invest resources on those modules which have the highest risk of producing failures. Some re- search works proposed in that sense are presented below. In 2006, Knab et al. [11] presented an approach that applies a de- cision tree learner on evolution data, extracted from the Mozilla open source web browser project for predicting density of the defects. The data includes different source codes, modifications, and defect mea- sures, computed from seven recent Mozilla releases. Their experiments showed that a simple tree learner can produce good results with various sets of input data. They found that the lines of code have a lower value for predicting defect densities than the number of bug reports in the past. In 2007, Morisaki et al. [12] described an empirical study to reveal rules associated with defect correction efforts, defined as a quantitative variable and extended association mining rule to directly handle such quantitative variables. An extended rule describes the statistical char- acteristic of a ratio or interval scale variable in the consequent part of the rule by its mean value and standard deviation, so that conditions producing distinctive statistics can be discovered. They found that it is necessary to pay attention to types of defects that have larger mean and standard deviation of effort. In 2008, Zimmermann et al. [42] researched how code complex- ity, problem domain, past history, and process quality affect software defects. They performed a case study on five Microsoft projects and observed significant correlations between complexity metrics, in both object-oriented OO and non-OO metrics, and post-release defects. In 2007, Panjer et al. [13] explored the viability of using data min- ing tools to predict the time spent on fixing a bug, having only basic

38 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

information about the bug’s lifetime. They used a historical portion of the Eclipse Bugzilla database for modeling and predicting bug lifetimes. A bug history transformation process is described and various data min- ing models are built and tested. They found that an accuracy of 34.9% can be achieved by using primitive attributes associated with a bug. In 2010, Guo et al. [43] performed an empirical study to characterize factors that affect which bugs get fixed in Windows Vista and Windows 7. They focused on factors related to bug reports and relationships between people involved in handling the bug. They built a statistical model to predict the probability of fixing a new bug and found that bugs reported by people with better reputations were more likely to be fixed, as well as those bugs handled by people on the same team and working in geographical proximity. It is notorious that software industry spends a lot of resources in qual- ity assurance of its products. Since these resources are limited, progress in this research field has allowed to spend them in a most effective way, by getting the best quality and the lowest risk of failure. Open source projects communities have been benefited from these results because they have bug tracking systems that contain a wealth of information about software failures, how they occurred, who was affected, and how they were fixed. In previous works, this information has been used to predict future software properties such as where the defects are, how to fix them and their associated cost [42].

2.4.3 Discovering Patterns of Change and Refactorings

Change propagation is a central aspect of software development. When developers modify software entities, such as functions or variables, to introduce new features or fix bugs, they must ensure that other entities in the software system are updated to be consistent with these new changes [15]. A description of some research works published in the literature is shown below. In 2004, Hassan et al. [15] proposed various heuristics to predict change propagation. They presented a framework to measure the per- formance of these heuristics and validated the results empirically using

39 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

data obtained by analyzing the development history of five large open source software systems. In 2004, Ying et al. [16] showed how useful change pattern mining is and introduced a set of criteria for evaluating the usefulness of change pattern recommendations. Their approach consisted on three stages: First, data from a software configuration management system (SCM) are extracted; second, the data are preprocessed to be suitable as an input for a data mining algorithm; finally, an mining algorithm (based on association rule) is applied to construct change patterns and recommend relevant source files. Although the precision and recall found after the process were not high, recommendations revealed valuable dependencies that may not be distinguishable from other existing analysis.

2.4.4 Measuring of the Contribution of Individuals

Measuring and monitoring the contribution of the developers are impor- tant for managers because human resources are the source of most of the costs. In this sense, some models described in the literature have proposed solutions for visualizing the progress and planning of equip- ment, which revealed weaknesses in released products, identified bottle- necks within working groups and ensured monitoring the contribution of developers. In 2006, Amor et al. [18] presented a model for measuring the contri- bution of developers in free software projects. This model considered the human effort as the sum of individual costs: Internal costs and external costs. The internal costs considered developers assigned to companies participating in the project, while external costs referred to the effort by third parties, usually volunteers. In 2006, Amor et al. [19] characterized the coding activity of devel- opers on a software project. They used a methodology for classifying the interactions of developers with versioning systems. The methodol- ogy is based on the analysis of the textual descriptions attached to each transaction. They presented the results of applying this methodology on the FreeBSD CVS repository. In 2007, Anvik et al. [20] presented an empirical evaluation of two approaches to determine implementation expertise, based on the data

40 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

contained in source and bug repositories. They used the “Line 10 Rule” heuristic, which is used in programs such as Expertise Recommender and the Expertise Browser. This heuristic was used for creating bug reports, and expertise sets were created and compared to those provided by experts. This comparison was evaluated using the measures of precision and recall and the results showed that both approaches are good at finding all the appropriate developers, although they may vary in how many false positives are returned. In 2009, Gousios et al. [22] presented a model for measuring the developer’s contribution in agile and distributed projects. This model combines traditional contribution metrics with data mined from software repositories; with these data they created clusters of similar projects to extract weights that are then applied to the actions that a developer per- forms on project tasks, in order to extract a combined measurement of the developer’s contribution. The model presented has been developed as a plug-in to the Alitheia Core software evaluation tool. In 2010, Niño et al. [44] presented a model to measure the contribu- tion of developers in open source projects. They defined a contribution function associated to each developer; the function is composed by two factors: A first factor associated to the size of the contributions and a second factor associated to the quality of contributions. The values are defined on a set of metrics of size and quality of software. The metrics of size include the number of added, modified or deleted artifacts, and the metrics of quality include the cyclomatic complexity of these arti- facts. They implemented a tool named DevMeter, that automatically calculates the values of these metrics and the value of the contribution function. Industry and academia have paid special attention to measuring the contribution of the developers. Human resources are the source of most of the costs and a good team is essential for their effective perfor- mance [45]. Unfortunately, only a third of all software companies use techniques to measure their products and development projects [46]; the reason is related to the difficulties associated with measuring non- tangible aspects such as the contribution of the developers. The re- sults obtained in the automation of productivity measuring will reduce the costs associated with software production process; it will show

41 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

weaknesses in released products and it can be used to identify bot- tlenecks within working groups.

2.4.5 Modeling Social and Development Processes The theory of complex networks is based on representing complex sys- tems as graphs. In software engineering this approach has been success- fully used for modeling social processes. In 2004, Lopez et al. [47] represented a network in terms of actors; each vertex was associated with a particular person and two vertices are linked together when they belong to the same group of people. They also represented the network in terms of groups; each vertex is associated with a group and two groups are linked through an edge when there is, at least, one person belonging to both at the same time. They used this approach for characterizing the open software projects, their evolution over time and their internal structure. In 2004, Crowston et al. [24] examined 120 project teams from SourceForge, representing a wide range of FLOSS project types, for their communications centralization as revealed in the interactions in the bug tracking system. They found that FLOSS development teams vary widely in their communications centralization, from projects completely centered on one developer to projects that are highly decentralized and exhibit a distributed pattern of conversation between developers and active users. In 2005, Ohira et al. [48] developed a tool named Graphmania to collect data of projects and developers at SourceForge, and to visualize the relationship among them using techniques of collaborative filtering and social networks. They performed a case study applying Graphmania to F/OSS15 projects data collected from SourceForge and they found a common practice in which similar projects have similar project names, and they showed the benefits of knowing which are the developer’s neighborhoods. In 2005, Huang et al. [49] used a representative model named Le- gitimate Peripheral Participation (LPP) for describing the interactions in the open source development process. They divide developers and

15Acronym of Free and open-source software.

42 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

modules into groups according to revision histories of the open source software repository; they divided the modules into kernel and non-kernel types. Their results showed some process of relative importance on the constructed graph from the project development. The graph revealed certain subtle relationships in the interactions between core and non- core team developers, and the interfaces between kernel and non-kernel modules. In 2007, Yu et al. [25] applied clustering techniques to data gath- ered from CVS repositories, in order to determine whether a developer belongs to the core or not. They discovered that for small (14 mem- bers) or medium (55 members) teams, the core was formed by 5 and 11 developers, respectively. It was remarked that there is no strong relation between core and non-core developers. In 2010, Yuan et al. [50] mined 7779 projects hosted on SourceForge to understand the structure of OSS in terms of the relationships between roles involved in the development process; the goal was to analyze the roles structure involved around those projects in order to have a quan- titative way to measure them. The authors discovered that the number of roles involved in the projects are related to the rank of the project inside the community; in this way they mentioned that the analysis of the role structure in an OSS project conduces to an OSS evaluation; also, they found that specific roles are most suitable to interact with some other roles, and projects with more roles established tend to be better ranked inside the SourceForge community. In software engineering field, people are the responsible for the suc- cess or the failure of a project; in this sense, to analyze the invisible relationships between team members should be a priority. Factors such as code contribution, roles, communication flows or code ownership are a few examples of what kinds of invisible relationships or information are possible to extract using MSR techniques. The improvement, im- plementation and adoption of techniques that allow team leaders to calculate this kind of metrics could represent valuable data useful for arranging, measuring, sponsoring or encouraging of team members.

43 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

2.5 TRENDS AND CHALLENGES

In previous section was shown that MSR is a research multi-purpose field related to the software evolution area, but there are new challenges and opportunities in which researchers have started to working on. Some of them are focused on making easier the mining process, researching new data sources or simplifying access to MSR techniques. In the next paragraphs those challenges and opportunities are shown.

2.5.1 Thinking in Distributed Version Control Systems

All the works mentioned in Section 2.4, independently of which problem they addressed, were based on data extracted from centralized version control systems (CVCSs) such as SVN or CVS, but we think that new opportunities are emerging with the growing of Distributed Version Con- trol Systems (DVCSs). During the last 10 years several DVCSs, such as Git, Bazaar and Mercurial, have been developed and many of them are still undergoing rapid evolution and gaining adepts; this fact is high- lighted by the adoption of this technology by large and important open source projects, such as Mozilla Project, Linux Kernel Project, MySQL, Perl and Eclipse Foundation, and is a fact of the current importance of this new flavor of version control systems. From a technical point of view, the most important feature of DVCSs is that they do not have fixed client-server architecture, allowing individ- ual developers to be servers or clients, depending on particular circum- stances, like in peer-to-peer models. In this way, developers can work on source code without being connected to a central repository, and therefore most of the operations are performed faster since no network is involved. From a conceptual point of view, DVCSs work in terms of changes instead of versions, forcing the developers to think about the changes themselves as first class concept within their version control operations. The major argument behind this crucial variation is that when someone manages changes (patches) instead of versions, merging works better, and therefore, developers can branch any time they need, because going back is easier, as is mentioned in [51] and in the technical

44 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

talk16 offered in 2007 by Linus Torvalds, the creator and leader of the IT project. In this regard, David Miller, a senior Linux developer from the Red Hat project, mentioned two years after the migration of the Linux Kernel to BitKeeper (another DVCS): BitKeeper I was in hell every time a new kernel was released. Now I do real work instead of wasting time on repeated merging.

Regarding team structure, DVCSs allow large teams, such as those that occur in some OSS projects, to easily create or implement (or change between) different workflow models. So, the classic central models, the “dictator and lieutenant model” or the “integration model manager model”, are two examples of the possibilities brought by the use of DVCSs [27]. These models are shown in the Figure 2.2.

Developer Developer Blessed Developer Developer Developer repository public public public

Shared repository

Developer Developer Developer Developer Developer Integrator private private private

Central Model Integration model manager model

Dictator

Blessed repository Lieutenant Lieutenant

Developer Developer Developer Developer

Figure 2.2 Development workflows models. Adapted from [27].

Recently many open and closed source projects are proposing to migrate, or have already migrated, their repositories from a CVCS to a DVCS [52]. Researchers are starting to find out the rationales behind the migration decisions and determine the consequences for the organization

16Tech talk: Linus Torvalds on Git: http://www.youtube.com/watch?v=4XpnKHJAok8

45 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

and activities of the development team. Besides, they think, this new generation of version control systems comes with the promise of new data, which will lead to new research questions related to how DVCSs affect processes, products and people around software projects. The promises and perils of DVCSs are mentioned by [52], but this time the authors look the problem from a research point of view. They observe that researchers could recover more history from repositories since any developer can make her own repository or branch. Those local workspaces contain all the information about merges, branches and commits done in the DAGs17 of multiples repositories. Also it is possible to recover the information related to the original committer or reviewer of a change using information stored in each patch, in front to the centralized repositories where only allowed committers (the ones who have writing permissions) appear as code contributors for projects. Technically, this new way of management historical data brings a pair of benefits to MSR researches: First, the distributed repositories are smaller in size than centralized ones but contain more information about contribution and workflows, and second, the extraction of data is faster since a connection with a repository is not required. Also, is important to mention that these kind of version control systems have been few explored until these days, making them an attractive area to move in.

2.5.2 Integrating and Redesigning Repositories Some MSR studies, for example the ones which intend to support main- tenance tasks, have the necessity of linking different data sources. In Section 2.2 was mentioned that data generated along the evolution of software products can be stored in several kind of repositories; this fact may cause a loss of traceability between related artifacts; for instance, a bug report stored in a bug tracker and its respective code fix stored in a version control system have an evident relationship that cannot be handled by most of these systems. This problem has been addressed by researchers in order to develop mechanisms for linking data between

17DAG: acronym of Directed Acyclic Graph, refers to the relationship between different patches.

46 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

repositories, most of them have focused their efforts in relate bug reports and source code patches. In 2009, Bachmann [53] and Bernstein presented an automatic method for link bug reports stored in Bug Zilla and source code patches stored in SVN using XML parsers and regular expressions analyzers to search bug reports identifiers on the CVS log commit; they mentioned several problems they had to tackle in the processes of extraction, pro- cessing and validation. As a result, they found their proposal had a higher precision than other authors’ techniques. We think that one of the problems of this kind of studies is the dependency on the “good practices” of the developing teams, because the content of log commits can change from one team to another; so, if two teams describe com- mits in different ways maybe the linking technique get better or worse results [53]. In similar way as Herraiz, Robles and Gonzalez-Barahona expressed in [29], we think that MSR researchers could attack the linking pro- cess in terms of adapting or extending current open source repositories (or developing new ones) systems for support automatically or at least assisted linking between data stored in different repositories, e.g., bug reports, source code artifacts, IRC logs, documentation, etc.

2.5.3 Simplifying MSR Techniques

In Table 2.3 is shown that there are tools developed for solving specific MSR task; it is possible to find ones which help in tasks of measuring contribution, visualizing historical data or measuring software quality. For those tools mentioned before and four additional cited by Has- san in [7], we determined the last source code activity, the availabil- ity of source code and binary or runnable scripts; also, we annotated if the tools are Web based or IDE plug-ins. These data are shown in Table 2.4. We find that most of the projects are not in active developing, only 5 of 13 have been updated in the last year, and 2 do not have web page; also, tools are commonly offered as source code, or binary scripts; this point can be considered as a use limitation, because only experimented users could have the ability to use them freely; on the contrary, we

47 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

found that DevMeter and MyLyn (both of them in active developing) are implemented as web tools and IDE plug-ins, respectively; this point could facilitate the adoption of MSR techniques by no-researcher users.

Table 2.4 Current state of MSR tools. Tool Last Code Source Binary Web IDE Web Update Code or Scripts Tool Integration Page Alitheia Core 2011 yes no yes no yes CVSAnaly 2010 yes yes no no yes CVSScan 2005 no yes no no yes DevMeter 2010 yes yes yes no yes DrJones no info no info no info no info no info no info eRose 2005 no no no Eclipse yes GluTeos 2007 yes no no no yes HATARI no info no no no Eclipse yes Hipikat 2005 no no no Eclipse yes MailingListStats 2007 yes no no no yes Mylar (MyLyn) 2011 yes yes no Eclipse yes Programeter (Commercial) no info no no yes no yes SLOCCount 2006 yes yes no no yes SoftChange 2003 yes yes no no yes XIA no info no info no info no info no info no info

As is mentioned by Hassan in [7], researchers must think in terms of users, taking into account the usability and accessibility of their tools, because MSR techniques must help team members to solve real life problems (solving bugs, understanding systems, measuring quality, etc.), not only research ones. Therefore, new efforts in this field must be oriented to facilitate the access to MSR techniques integrating them into IDEs, or as web tools for fast and easy access to developers, team leaders and users.

2.6 SUMMARY This chapter shows a short review about the main concepts of mining software repositories (MSR); this field was born to support evolutionary and maintenance tasks by mining data stored in the software repositories of developing projects. Most common repositories can be categorized in: Historical, communication logs, source code or other repositories (this categorization depends on the data they stored).

48 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

Some of the problems related to mining large software repositories are the preprocessing, extraction and analysis tasks, because they involve the processing of large amount of data generated during the software life cycle; in this way, some researchers have tackled this problem by developing tools or by designing techniques in order to facilitate mining processes. Examples of these techniques are the metadata analysis, static source code analysis, and social network analysis. Also, three main research trends and opportunities were presented: First, the popularization and adoption of distributed version control sys- tems by large open source projects highlight the emerging importance of this technology, it promises new data and technical benefits for de- velopers and MSR’s researchers. Second, the limitation from current repositories systems needs to be addressed by developing extensions or creating new systems that support mining process naturally. The last future trend remarks the importance of simplifying the access to MSR techniques for researchers and non-researchers; this could be done by implementation of tools as web-based tools and as IDE plug-ins, bringing to new users the benefits of this kind of studies. At last, MSR is presented as an active field of research which can bring multiples benefits for software evolution researchers or for new or active software developers.

49 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

REFERENCES [1] Lehman, M. Programs, Life Cycles, and Laws of Software Evo- lution. In: Proceedings of the IEEE: 1980.

[2] Lehman, M. On Understanding Laws, Evolution and Conser- vation in the Large Program Life Cycle. Journal of Systems and Software, 1: 1980; 213–221.

[3] Lehman, M., Ramil, J. F., Wernick, P. D., Perry, D. E. & Turski, W. M. Metrics and Laws of Software Evolution The Nineties View. In: Proceedings 4th International Software Metrics Symposium (METRICS ’97): 1997.

[4] Ramil, J. F. & Lehman, M. Metrics of Software Evolution as Effort Predictors - A Case Study. In: Proc. Int Software Mainte- nance Conf : 2000, 163–172. doi:10.1109/ICSM.2000.883036.

[5] Lehman, M. & Ramil, J. F. An Approach to a Theory of Software Evolution. In: Proceedings of International Workshop on Principles of Software Evolution (IWPSE’01) Vienna, Austria: 2001.

[6] Kagdi, H., Collard, M. L. & Maletic, J. I. A Survey and Taxonomy of Approaches for Mining Software Repositories in the Context of Software Evolution. J. Softw. Maint. Evol., 19(2): 2007; 77–131. ISSN 1532-060X. doi:http://dx.doi.org/10.1002/ smr.344.

[7] Hassan, A. E. The road ahead for Mining Software Repositories. In: Proc. FoSM 2008. Frontiers of Software Maintenance: 2008, 48–57. doi:10.1109/FOSM.2008.4659248.

[8] Sayyad, J. & Lethbridge, C. Supporting Software Mainte- nance by Mining Software Update Records. In: ICSM ’01: Pro- ceedings of the IEEE International Conference on Software Mainte- nance (ICSM’01). IEEE Computer Society, Washington, DC, USA: 2001. ISBN 0-7695-1189-9, 22.

50 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

[9] Canfora, G., Cerulo, L. & Penta, M. D. Identifying Changed Source Code Lines from Version Repositories. In: Pro- ceedings of the Fourth International Workshop on Mining Software Repositories: 2007.

[10] Hassan, A. E. Mining Software Repositories to Assist Developers and Support Managers. Tesis Doctoral, University of Waterloo, Ontario, Canada: 2004.

[11] Knab, P., Pinzger, M. & Bernstein, A. Predicting defect densities in source code files with decision tree learners. In: Pro- ceedings of the 2006 international workshop on Mining software repositories: 2006.

[12] Morisaki, S., Monden, A., Matsumura, T., Tamada, H. & Matsumoto, K. Defect Data Analysis Based on Extended As- sociation Rule Mining. In: Proceedings of the Fourth International Workshop on Mining Software Repositories: 2007.

[13] Panjer, L. Predicting Eclipse Bug Lifetimes. In: Proceedings of the Fourth International Workshop on Mining Software Reposito- ries: 2007.

[14] Lamkanfi, A., Demeyer, S., Giger, E. & Goethals, B. Predicting the severity of a reported bug. In: Proc. 7th IEEE Working Conf. Mining Software Repositories (MSR): 2010, 1–10. doi:10.1109/MSR.2010.5463284.

[15] Hassan, A. E. & Holt, R. C. Predicting Change Propagation in Software Systems. In: ICSM ’04: Proceedings of the 20th IEEE In- ternational Conference on Software Maintenance. IEEE Computer Society, Washington, DC, USA: 2004. ISBN 0-7695-2213-0, 284– 293.

[16] Ying, A. T. T., Murphy, G. C., Ng, R. & Chu-Carroll, M. C. Predicting Source Code Changes by Mining Change History. IEEE Trans. Softw. Eng., 30(9): 2004; 574–586. ISSN 0098-5589. doi:http://dx.doi.org/10.1109/TSE.2004.52.

51 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[17] Krinke, J., Gold., N., Yue, J. & Binkley, D. Cloning and Copying Between GNOME Projects. In: Proc. 7th IEEE Working Conf. Mining Software Repositories (MSR): 2010, 98–101. doi: 10.1109/MSR.2010.5463290.

[18] Amor, J., Robles, G. & González, J. Effort Estimation by Characterizing Developer Activity. In: 8th Workshop on Software Engineering Economics, Shangai China: 2006.

[19] Amor, J., Robles, G., González, J. & Navarro, A. Dis- criminating Development Activities in Versioning Systems: A Case Study. In: Proceedings PROMISE 2006: 2nd. International Work- shop on Predictor Models in Software Engineering: 2006.

[20] Anvik, J. & Murphy, G. Determining Implementation Exper- tise from Bug Reports. In: Proceedings of the Fourth International Workshop on Mining Software Repositories: 2007.

[21] Linstead, E., Rigor, P., Bajracharya, S., Lopes, C. & Baldi, P. Mining Eclipse Developer Contributions via Author- Topic Models. In: Proc. Fourth Int. Workshop Mining Software Repositories ICSE Workshops MSR ’07: 2007. doi:10.1109/MSR. 2007.20.

[22] Kalliamvakou, E., Gousios, G., Spinellis, D. & Pouloudi, N. Measuring Developer Contribution from Software Repository Data. In: 4th Mediterranean Conference on Information Systems: 2009, 129–132.

[23] Koch, S. & Schneider, G. Effort, Cooperation and Coordina- tion in an Open Source SSoftware Project: GNOME. Information Systems Journal, 12: 2002; 27–42.

[24] Crowston, K. & Howison, J. The social structure of Free and Open Source software development. First Monday, 10(2): 2005; 1–21.

[25] Yu, L. & Ramaswamy, S. Mining CVS Repositories to Un- derstand Open-Source Project Developer Roles. In: Proc. Fourth

52 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

Int. Workshop Mining Software Repositories ICSE Workshops MSR ’07: 2007, 8. doi:10.1109/MSR.2007.19. [26] Shihab, E., Jiang, Z. M. & Hassan, A. E. On the use of Internet Relay Chat (IRC) meetings by developers of the GNOME GTK+ project. In: Proc. 6th IEEE Int. Working Conf. Mining Software Repositories MSR ’09: 2009, 107–110. doi: 10.1109/MSR.2009.5069488. [27] Chacon, S. Pro Git. Apress, Berkely, CA, USA: 2009. ISBN 1430218339, 9781430218333.

[28] Fogel, K. Producing Open Source Software. O’Reilly: 2005.

[29] Herraiz, I., Robles, G. & González, J. M. Research friendly software repositories. In: IWPSE-Evol ’09: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) work- shops. ACM, New York, NY, USA: 2009. ISBN 978-1-60558-678-6, 19–24. doi:http://doi.acm.org/10.1145/1595808.1595814. [30] Kagdi, H. Mining Software Repositories to Support Software Evolution. Tesis Doctoral, Kent State University: 2008.

[31] Gousios, G. & Spinellis, D. Alitheia core: An extensible Soft- ware Quality Monitoring Platform. In: Proceedings of the 31rst International Conference of Software Engineering Research Demos Track: 2009.

[32] Georgios, G. & Diomidis, S. A Platform for Software Engi- neering Research. In: MSR ’09: Proceedings of the 6th Working Conference on Mining Software Repositories.: 2009.

[33] Robles, G. Empirical Software Engineering Research on Libre Software: Data Sources, Methodologies and Results. Tesis Doc- toral, Escuela Superior de Ciencias Experimentales y Tecnologia, Universidad Rey Juan Carlos: 2005.

[34] Liu, Y., Stroulia, E. & Erdogmus, H. Understanding the Open-Source Software Development Process: A Case Study with

53 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

CVSChecker. In: Proceedings of the First International Conference on Open Source Systems: 2005, 11–15. [35] Voinea, L., Telea, A. & van Wijk, J. J. CVSscan: Visual- ization of Code Evolution. In: Proceedings of the 2005 ACM sym- posium on Software visualization. ACM, St. Louis, Missouri: 2005. ISBN 1-59593-073-6, 47–56. doi:10.1145/1056018.1056025. [36] Robles, G., González, J. M. & Rishab, G. GlueTheos: Au- tomating the Retrieval and Analysis of Data from Publicly Available Software Repositories. In: Proceedings 1st International Workshop on Mining Software Repositories: 2004, 20–32. [37] Herraiz, I., Robles, G., Amor, J., Teofilo, R. & González, J. M. The Processes of Joining in Global Dis- tributed Software Projects. In: Proceedings of the 2006 inter- national workshop on Global software development for the practi- tioner. ACM, Shanghai, China: 2006. ISBN 1-59593-404-9, 27–33. doi:10.1145/1138506.1138513. [38] German, D. Mining CVS Repositories, the SoftChange Ex- perience. IEE Seminar Digests, 2004(917): 2004; 17–21. doi: 10.1049/ic:20040469. [39] Amor, J., Robles, G., González, J. M. & Herraiz, I. From Pigs to Stripes: A travel Through Debian. In: DebConf5 (Debian Annual Developers Meeting), Helsinki, Finland: 2005. [40] Wu, X. Visualization of Version Control Information. Proyecto Fin de Carrera, University of Victoria: 2003. [41] Hassan, A. E. & Holt., R. C. Using Development History Sticky Notes to Understand Software Architecture. In: IWPC ’04: Proceedings of the 12th IEEE International Workshop on Program Comprehension. IEEE Computer Society, Washington, DC, USA: 2004. ISBN 0-7695-2149-5, 183. [42] Zimmermann, T., Nagappan, N. & Zeller, A. Software Evolution, chapter Predicting Bugs from History. Springer: 2008, 69–88.

54 SURVEY AND RESEARCH TRENDS IN MINING SOFTWARE REPOSITORIES

[43] Guo, P., Zimmermann, T., Nagappan, N. & Murphy, B. Characterizing and Predicting Which Bugs Get Fixed: An Empirical Study of . In: Proceedings of the 32th Interna- tional Conference on Software Engineering (ICSE 2010): 2010.

[44] Nino, Y. & Aponte, J. DevMeter: Una Herramienta que Mide la Contribucion de los Desarrolladores: 2010.

[45] Winter, M. Developing a Group Model for Student Soft- ware Engineering Teams. Proyecto Fin de Carrera, Universityt of Saskatchewan: 2004.

[46] Ebert C., B. M. S. A., Dumke R & R., D. Best Prac- tices in Software Measurement. Springer, 1 edición: 2004. ISBN 3540208674.

[47] López, L., Robles, G. & González, J. Applying Social Net- work Analysis to the Information in CVS Repositories. IEE Seminar Digests, 2004(917): 2004; 101–105. doi:10.1049/ic:20040485.

[48] Ohira, M., Ohsugi, N., Ohoka, T. & ichi Matsumoto, K. Accelerating Cross-project Knowledge Collaboration Using Col- laborative Filtering and Social Networks. SIGSOFT Softw. Eng. Notes, 30: 2005; 1–5. ISSN 0163-5948. doi:http://doi.acm.org/ 10.1145/1082983.1083163.

[49] Huang, S.-K. & Liu, K. M. Mining version histories to verify the learning process of Legitimate Peripheral Participants. In: Proceed- ings of the 2005 international workshop on Mining software reposi- tories, MSR ’05. ACM, New York, NY, USA: 2005. ISBN 1-59593- 123-6, 1–5. doi:http://doi.acm.org/10.1145/1082983.1083158.

[50] Yuan, L., Wang, H., Yin, G., Shi, D. & Mi, H. Mining Roles of Open Source Software. In: Proc. 2nd Int Software Engineering and Data Mining (SEDM) Conf : 2010, 548–554.

[51] Ruparelia, N. B. The history of version control. SIGSOFT Softw. Eng. Notes, 35(1): 2010; 5–9. ISSN 0163-5948. doi:http: //doi.acm.org/10.1145/1668862.1668876.

55 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[52] Bird, C., Rigby, P., Barr, E., Hamilton, D., German, D. & Devanbu, P. The promises and perils of mining git. In: Proc. 6th IEEE Int. Working Conf. Mining Software Repositories MSR ’09: 2009, 1–10. doi:10.1109/MSR.2009.5069475.

[53] Bachmann, A. & Bernstein, A. Data Retrieval, Processing and Linking for Software Process Data Analysis. Informe técnico, University of Zurich, Department of Informatics: 2009.

56 Software Visualization to Simplify the Evolution of Software Systems

David Montaño Leslie Solorzano Henry Roberto Umaña-Acosta

ABSTRACT Software Visualization can be used to address the problems associated with the evolution of software systems. It offers a wide set of possibil- ities as it was shown since the first appearance of visualization in early 80’s. Based on the progress made in the area, this chapter presents the different techniques developed in the past, as well as the most novel solutions. However, despite of the efforts done by researchers, the field needs to face new challenges proposed by the software industry, such as new programming paradigms.

3.1 INTRODUCTION Size and complexity of software systems make the understanding and maintenance task difficult to be completed. Besides, these systems are abstract entities that cannot be represented in a way they can be understood by humans. Software Visualization (SV), through the use of various kinds of imagery techniques, provides strategies to facilitate the understanding and reduce the apparent complexity of the software system [1]. Specifically, the challenge of Software Visualization is to find effective mappings between software aspects and their graphical representations using visual metaphors [1] that allow to understand the software system.

57 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Considering this, there have been defined three main sources of in- formation: Static, dynamic and evolutionary history [2]. These sources have been used in the development of several visualization tools, most of them based on the use of graphs, geometrical figures, colors, and com- binations of them. These tools include Tarantula [3], Structure1011, CodeCrawler [4], SeeSoft [5], X-Ray [6], STAN2, CVScan [7] and EPOSee3, among others. On the other hand, these sources of infor- mation have been used in the development of three dimensional tools, such as sv3D [8], Vizz3D [9] and CodeCity [10]. Despite the favorable and promising aspects of three dimensional space, such as less error prone perception [11] and a more direct mapping between software ar- tifacts and their representation, it has not been designed a system that the users feel comfortable with.

3.2 BACKGROUND ON SOFTWARE VISUALIZATION As stated before, software visualization pretends to become a tool that supports the tasks associated with the development of software systems. It includes mainly understanding and maintenance tasks. Considering the goal of SV, it is easily embodied in the software evolution research field whose objective is to study the phenomenon of an evolving system that changes according to external factors, such as the environment, technologies, and business modeling. In this section an overview of what has been done in the SV area is given, as well as how it helps the software evolution process.

3.2.1 How Software Visualization Supports Software Evolutions Tasks Mens et al. proposed in [12] a set of challenges associated with soft- ware evolution, and SV systems are able to contribute to at least seven of them in several ways. First, helping to preserve and improve soft- ware quality, since SV tools can support, for example, understanding

1http://www.headwaysoftware.com/products/structure101/index.php 2http://stan4j.com/ 3http://www.st.uni-trier.de/eposoft/eposee/index.html

58 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

processes and error fixing tasks. Second, supporting model evolution by visualizing different software artifacts, and therefore, allowing devel- opers to embrace the underlying model in a more practical way. Also, providing different views (like filtering mechanisms) that promote bet- ter modeling of the system. Third, supporting multi-language systems by using metaphors that are independent of the information source for- mats. Fourth, increasing managerial awareness by exposing views easily understandable by managers and stakeholders. Fifth, integrating data from various sources, and in this way, providing a more complete view of the system. Sixth analyzing huge amounts of data by using 3D systems where it is possible to analyze more information. Seventh, assisting software evolution training and teaching by making easier to highlight concepts when they are seen by students [13].

3.2.2 The Software Visualization Pipeline The visualizing process is divided into three main sections, namely: In- formation extraction, analysis, and visualization. This process can be used in several software areas, such as: Data mining, software reposito- ries analysis, and reverse engineering. Having in mind the basic elements of this process, it can be seen that SV is easily extendable to support other tasks. However, its effectiveness will depend on the metaphor chosen, as it will be explained in the next sections. Each step in the visualization pipeline is explained in Table 3.1.

3.2.3 Overview of Visualization Tools Since the first appearance of SV in algorithm visualization [14], hundreds of tools whose intention is to provide a simple mechanism to understand a software system have been developed. Table 3.2 briefly describes a set of tools developed in the area. It highlights the wide spectrum of these tools; they range from 2D to 3D spaces, from graph representations to animations and also includes static, dynamic and evolutionary sources of information.

59 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Table 3.1 Steps in the visualization pipeline. Step Description Information This step is in charge of taking the different sources extraction of information (static, dynamic or evolutionary), pro- cessing them, and finally, getting them ready for the next step. Analysis Once the information is loaded, it needs to be pro- cessed (e.g., calculate metrics values) in order to prepare a high level view of the raw data that the user can study and understand easily. Visualization Once the information is analyzed, it needs to be ren- dered to the user. At this point the visualization tool needs to define a graphical metaphor that correctly represents the underlying information.

3.2.4 Sources of Information Commonly Used As it was shown in Table 3.2 the applications developed in the area of SV are based on three main sources of information: • Static source of information is referred to the data that can be extracted without running the program. Therefore, no informa- tion from runtime can be extracted. The main source of static information is the source code, but documents, diagrams, require- ments and others can also be found. • The visualization that depends on the data extracted from the execution of the program is called dynamic. It is based on runtime information, such as content of variables, conditions executed, stack size and so on. This kind of data is difficult to obtain because of the lack of mechanisms to gather information from the program memory. • The study of software evolution is a complete research field, sev- eral interesting proposals have been written Regarding the visual- ization of data resulting from the analysis of this source of infor- mation. It was proposed in [15] that the basic unit of information needed to visualize software evolution is called a Maintenance Re- quest (MR). It refers to the delta of change in a software system

60 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

(a committed revision in a control versioning system like SVN or CVS). These deltas are taken together to analyze and visualize the evolutionary information about a software system.

Table 3.2 List of applications taken from [13]. Tool Description Visualization Source of techniques information sv3D [8] It uses three dimen- It uses 3D, colors, heights and Static sional polycylinders as a deepness to show information metaphor Vizz3D [9] Visualization as cities It uses color, heights and a real Static metaphor to show software com- ponents and their relations Tarantula [3] It uses a SeeSoft-like It uses color and geometrical Dynamic representation to show shapes to show the results the results of a set of tests Structure [16]* Visualization of Java de- It uses a two dimensional Static pendencies at different metaphor, based on the Eclipse levels of abstractions platform set of icons and graphs to show relations SeeSoft [5] Visualization of changes It uses color and geometrical Evolution through time shapes to show the results of the analysis X-Ray [6] It visualizes relation- It uses geometrical shapes, links Static ships in Java source (lines) and some basic colors to code represent dependencies EPOSee [17] It shows information It uses pixel-map and supports Static about the evolution of graph representations to show files in the source code. relations among files in the ver- sion control system SHriMP [18] It shows the dependen- It uses a two dimensional ap- Static cies in a program code proach to show the analyzed in- and other kinds of ar- formation. It is based on rectan- tifacts like architectural gles, color and links among the design and documenta- parts of the visualization tion X-Tango [19] It visualizes the execu- It uses an animated version of Dynamic tion of a program geometrical shapes and color to show the behavior of the pro- gram *http://www.headwaysoftware.com/products/structure101/index.php

61 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

3.2.5 Differences of Software Visualization and Modeling Languages Like UML UML [20] was first conceived as a union of the methodologies proposed by Grady Booch, Ivar Jacobson, and James Rumbaugh. They partici- pated in the evolution of what was the first UML specification, which came out in 1997 under the organization of the Object Management Group4. UML clearly follows the same principles of SV (i.e., tries to make the process of maintaining and understanding easier). The main difference is the way they face the problem: While software visualization tries to provide a view of an already written software system, the UML specification aims to provide a view before the actual code is written. UML is a modeling language and also a visualization tool, although it does not provide a view of a system in terms of its metric values. De- spite the effort invested in the modeling stage of any methodology, there will always be the need to determine the current state of a system. This current state and its metrics can be determined by a visualization tool that analyzes for example the source code.

3.3 SV TECHNIQUES Based on the three sources of information, researchers have developed a wide spectrum of techniques. They try to use as much visual techniques as possible, in order to enable the metaphor with enough data about the analyzed system. In the following sections, the definitions of metaphor as well as the techniques developed in the past are explored. These are divided into different sections depending on the mechanism used to represent the software system.

3.3.1 Metaphors One of the fundamental concepts behind any kind of visualization is the metaphor. It was defined by Lakoff [21] as “a rhetorical figure whose essence is the understanding and experiencing one kind of thing in terms of another”. In software visualization, metaphors are the most important

4After a negotiation for the “UML” name with Rational.

62 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

concern because the information that is going to be visualized does not have a natural visual representation. Since the beginning of SV, metaphors have been developed using different techniques, such as bar charts, pie charts, cylinders, pixel-maps, buildings within cities, and even galaxies in the universe. When building metaphors designers must consider a set of basic aspects, as defined by Graçanin [15], before they can be included in a visualization system; these aspects include:

1. Scope of representation: Software systems usually consist of thou- sands of lines of code and the visualization tool has to render information related with them. This vast amount of information often causes confusion to the final user. Any metaphor should allow the user to limit the scope of the information that is being visualized, so he can decide what information is relevant or not.

2. Medium of representation: One kind of medium is 2D or 3D visu- alization type. The medium has an important role when building a software visualization system as it usually depends on the kind of information that is being visualized.

3. Visual metaphor: This aspect refers to what visual elements the metaphor uses to display information to the user. This includes ge- ometric shapes, such as lines, dots, circles, squares, and polygons, or real-world entities, such as buildings, trees, planets, and so on. These elements may have a color (in a color scale) to represent another aspect of a software artifact. Considering these elements and how they are used, two important aspects of a metaphor are:

• Consistency of the metaphor: It refers to the correct use of the metaphor. This means that there must be a mapping between software entities and entities in the visualization to avoid misunderstandings due to the representation of differ- ent software entities or properties with the same graphical element in the visualization. • Semantic richness of the metaphor and complexity: The metaphor should be rich enough to provide as much

63 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

representations as different aspects of the software that is being visualized.

4. Abstractedness (Ellison): The user of the visualization system should be able to choose the level of detail in the software sys- tem that is being evaluated. In this way the user may choose from direct representation, structural representation, synthesized representation, and analytical representation.

5. Ease of navigation and interaction: Since normally the visualiza- tion system is able to provide too much information, it should allow the user to know what information is visualizing, what part of the system is being visualized, what level of abstraction has been selected, and it should allow the user to navigate in an un- derstandable way according to some usability criteria. This is an important aspect in 3D visualizations, where the user can easily get lost.

6. Level of automation: Software visualization systems need to be automated (i.e., extract, analyze, and render all the information from a software system with a minimum interaction with the user).

7. Effectiveness [22]: It indicates the efficacy of the metaphor as a medium for representing the information. The metaphor should be able to convey the analyzed information from the software system. For example, it could be important to know if the system is able to show ordinal values as well as cardinal ones.

8. Expressiveness [22]: It refers to the capability of the metaphor to visually represent all the analyzed data. As in the case of semantic richness, the metaphor must provide a considerable number of visual properties so the parameters obtained by the analysis can be represented in the view.

3.3.2 2D Approaches As it was mentioned in 3.2.4, one of the main sources of information is the source code. To address the challenges proposed by the analysis of

64 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

this source, techniques have been created to make use of a bidimensional space, mainly diagrams, such as:

• Control-flow diagrams: In 1947 Von Neumann [23] created one of the most famous ways to visualize the flow of a program by using geometrical figures to represent actions or events within the application. These actions are represented by Rectangles when the flow of the program refers to events, activities, processes, functions, and other general statements, and by Diamonds when the flow reaches a point where a decision has to be made. This is a simple yet powerful way to better explain and understand basic algorithms, for example sorting algorithms, but gradually, degrades when the program gets bigger. For this reason, since then, researchers have developed tools to automatically generate diagrams with the proper layout and configuration.

• Structograms: Nassi-Shneiderman proposed a new way to dia- gram programs based on rectangles to represent them (see Fig- ure 3.1). Since it does not have a representation of the GOTO statement, the is forced to write programs without it (when this representation was proposed, Object-Oriented Pro- gramming was not as popular as it is today), so making them more structured and easier to maintain and understand.

Figure 3.1 Structograms (sequence, loop and conditional) [2].

• Jackson diagrams: They represent a program as a tree hierarchy providing a format to depict the structure of the source code. Figure 3.2 shows the diagrams for three control structures.

• Control structure diagrams: These diagrams put the control-flow charts and the source code together. They assemble the diagram into the source code by showing on the side of the figure that

65 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

A A A

C1 C2 C B C B * B C

Figure 3.2 Jackson diagrams representation (sequence, loop and conditional) [2].

represents the statement. For example, if a line contains a condi- tional, it is marked with a diamond on the left side; similar with loops, where a vertical bar indicating all the block is placed on the left side, as it is shown in Figure 3.3.

for(...) loop if test then

begin statement statement statement else statement statement end end loop end if

Figure 3.3 Basic control structure diagrams (sequence, loop and conditional) [2].

One of the most important visualization tools found within this cat- egory is SeeSoft [5]. This tools intent to provide a view of how the software system has evolved through time. It uses a bidimensional space based on rectangles that represent each file of the system, whereas the content of the rectangle defines how each line of code has changed. In Figure 3.4 this metaphor is depicted. There, it is possible to see that the color scale represents how much the file has changed during the development or maintenance of the software system. Another example of bidimensional visualization based on the history of a program can be found in the evolutionary information. In this case, researchers have developed interesting techniques, such as support graphs [24], fractals [25], and pixel maps. The first of them, a support graph, takes the concept of evolutionary coupling and represents the evolution of a system using a graph-like representation, where each node

66 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

Figure 3.4 SeeSoft tool example. Taken from [2]. represents a file and a measure of coupling is defined, so it can be used as a weight. This weight is later interpreted as a distance between nodes. Figure 3.5 shows an example of this kind of visualization. The second technique, fractals, was developed as a tool to help stakeholders to comprehend and measure the effort of each developer within a developer team. It is based on a set of nested rectangles where the size of each rectangle represents the contribution of a developer. Each example illustrated in Figure 3.6 represents a type of development environment using this method. Finally, the third technique, pixel maps, represents the concept of evolutionary coupling; a matrix is built from the files of the system; each row and column represents a file. Each cell has a color indicating the strength of the coupling between the two files represented by the corresponding column and row. By using this mechanism, it is possible to determine the emerging blocks that indicate the presence of related concepts in the code, leading to an easier understanding and helping to support impact analysis. A similar approach to highlight associations among files is employed in pixel maps, as shown in Figure 3.7. In this case the files are placed as a matrix where each cell represents, by using a color scale, the weight

67 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

of the relation between the two corresponding files. With this kind of visualization is possible, once again, to see emerging patterns in the system architecture.

Figure 3.5 Mozilla Firefox example of evolutionary coupling. Taken from [2].

(a) Developer (b) Few Developers (c) Many balance developers

Figure 3.6 Fractal representation. Taken from [2].

68 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

Figure 3.7 Pixel-maps example. Taken from [2].

3.3.3 3D Approaches These approaches have not been widely explored compared to the 2D techniques. However, Stasko in [26] states, about visualizing in a three- dimensional world, “by adding an extra spatial dimension, we supply visualization designers with one more possibility for describing some aspect of a program or system”, thus more information is easily rep- resented. In addition, when using a 3D metaphor, it has been suggested that the perception is less error prone if software objects are mapped to visual objects, as there is a natural mapping between them [11]. There are some cases where a third dimension5 is needed. An ex- ample of this is the visualization of structural change. It needs to use at least two dimensions to reflect the internals of a system at a specific time and it needs another one to render the information about how it has changed over time. Techniques like the one used by SeeSoft have been extended to a 3D space. This is the case of sv3D [8] and its successor SeeIT 3D [13], where a software artifact is represented by a container and a set of polycylinders represents the components of the artifact, e.g., a class is a container and its methods are the polycylinders. This representation allows to visualize more information and the user feels more comfortable

5Can be spatial or temporal.

69 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

with it than in the case of 2D approaches, as stated before. It even allows to represent a wide spectrum of software artifacts, such as source code and relational databases. Figure 3.8 depicts the concept of this metaphor.

Figure 3.8 sv3D and SeeIT 3D metaphor.

A special of a metaphor like the one presented before is shown in Figure 3.9. It shows the stages of a sorting algorithm using a third dimension to represent how it has solved the problem. In the front row, the polycylinders are unsorted and step by step (represented by each row) they are sorted. Finally, in the last row they are sorted showing how the sorting process was completed. Another important example of using a 3D space and its advantages is proposed by the CodeCity [10] tool. The purpose of this tool is to render a software system as a city where the user is able to locate himself easily. CodeCity employs a mapping from software components to city, objects so that each package represents a district, each building represents a class, the height of the building is specified by the number of methods and the width of it represents the number of attributes. Figure 3.10 illustrates this metaphor at work.

70 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

Figure 3.9 Sorting steps using a third spatial dimension. Taken from [2].

Figure 3.10 CodeCity metaphor. Taken from [10].

Finally, it is worth mentioning a special case of 3D visualization based on UML. After the definition of this language, an enhancement was proposed in [16, 27]. They used a 3D space to visualize the same elements of UML, but its success was not as important as UML due to the difficulty for drawing the shapes on a paper or a whiteboard.

71 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

3.3.4 Virtual Environments Virtual environments give the user a unique type of immersion and nav- igation because of the way they represent and render the information. This level of interaction is achieved by presenting a world to the user where he is able to interact with different objects that are mapped to software components/artifacts. Research in this area is far more complicated as it requires more human and technological resources. Therefore, the work in the area has not been as popular as 2D approaches. Despite of the costs, certain work has been done; this is the case of ImsoVision [28], which visualizes C++ source code. It employs geometrical figures to represent the different components in the software system. While this work has had impact on the C++ community, Software World [1] has done a similar job for the Java language; it uses a metaphor based on elements from real world like countries, districts and buildings, to represent the source code. Additionally, one of the most important visualization tools is CAVE6, proposed in [29], which uses a cubicle where the user interacts with the world presented on it. Distributed VEs are a special kind of virtual environment where many users, distributed in different places, interact with the visualization at the same time. By providing a tool capable to support this kind of operations, all users involved in the visualization can interact with each other and work in a collaborative style.

3.4 TOWARDS A BETTER SOFTWARE VISUALIZATION PROCESS Even though the efforts done in the SV area, the emerging technologies and other paradigms propose challenges that need to be addressed by new metaphors and techniques. They include new software program- ming paradigms, such as Aspect Oriented Programming (AOP), more dynamic languages, the processing of higher level languages, more flex- ible metaphors and some educational issues.

6Not only for software visualization.

72 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

3.4.1 Other Programming Paradigms With the development of more and more languages available for writing programs, it is mandatory to build better visualization systems based on these new models, such as aspect oriented programming. AOP is an extension of the object-oriented model based on cross-cutting concepts within a system (e.g., the logging capabilities). If a SV tool is going to analyze a system of this type, it needs to know how the interactions of aspects are defined and implemented. In this way, a SV system could explain adequately the underlying system and achieve the objective of proving better understanding. This specific case was explored in [30] but few of the tools developed so far provide a novelty solution to this problem. Instead, they have tried to extend the models used in object- oriented systems or even employed in older models. Dynamic languages are a second example where SV has not been widely introduced. These languages differ from compiled languages, so that the behaviors are calculated at runtime rather than defined at compile time7. Although these behaviors can be emulated in compiled languages, the syntax and understanding process is much easier in dy- namic languages because they are built in that specific way. Examples of dynamic languages are: Groovy, JavaScript, Objective-C, PHP, and Python. Therefore, new metaphors that enable the presentation of this kind of information need to be defined. It can be runtime information, more important in this case, or static information referred to the source code.

3.4.2 Include Other Languages In software development there are more languages apart from the classic programming languages (i.e., Java, PHP, .Net, C, C++, etc.). Domain Specific Languages (DSL), Architecture Description Language (ADL) and metamodeling are a examples of such languages. DSL are languages or specifications designed to be applied in a specific domain, for example, R for statistics. Considering that not every aspect of a software system is specified in a general purpose language, it is necessary to be able to

7See http://en.wikipedia.org/wiki/Dynamic_programming_language for more information.

73 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

analyze sources like DSLs or even analyze the languages in which they can be written (Groovy language). Another topic that has been partially explored using the classic sources of information are the architectures of a software system. Specif- ically, ADLs should be considered as sources of information for SV tools. An ADL is a language to define and express how the architecture of a system is implemented. Hence, ADLs could be very useful sources of information that would facilitate the visualization of bigger components, and therefore, enable the rendering of high level views of the analyzed system. Finally, a last example of other languages that could be visualized is the metamodeling techniques. It is a mechanism used to define con- straints, terms, rules, and concepts about the model employed to solve the domain specific problem. A visualization tool may add this infor- mation to the model itself and help to address the inherent problems in the software evolution process. As it can be seen there are other languages that can be considered as alternative or additional sources of information for SV tools.

3.4.3 Better and More Flexible Metaphors Most of the metaphors defined for software visualization have been de- signed for a particular problem or source of information, as it was ex- plained in Section 3.3. The next step in SV should be providing better and more flexible metaphors that allow the user to visualize several sources of information with just one view. If a mechanism of this nature is provided, the learning curve of the visualization tools is reduced be- cause the final user only needs to comprehend how a metaphor behaves, instead of learning different metaphors for each different source of in- formation. The results would be even better if the tool considers other sources and languages like the ones explained in the previous sections.

3.4.4 Educational Issues At the educational level the SV process can be improved if this subject is taught as part of a software evolution course. At least at Universidad Nacional de Colombia, there is no clear may way to teach the benefits

74 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

of SV and how it can be used in production environments to support the software evolution process. Students of software engineering or software evolution courses can start reviewing the state of the art of software visualization. Next, they can be users of various visualization tools, when doing software maintenance tasks. After that, they will be able to work on the design and implementation of new tools, based on real world needs, instead of researchers’ assumptions of a world they may not know.

3.5 SUMMARY This chapter presented the bases of software visualization. As it was seen they range from 2D approaches using graph-like representations, pixel maps, geometrical shapes and colors, to virtual environments where the user is able to navigate a world representing the analyzed software system. These approaches are mainly based on three sources of infor- mation: Static, dynamic, and evolutionary. Static is referred to the data that are extracted from sources of information that do not need the program to be running (source code, documents, and so on). Dy- namic information is gathered from a running system, for example, the values placed in the execution stack of the program. And evolutionary information is processed from the different software repositories, such as bug trackers and version control systems. At the end of the chapter some considerations that should be taken into account for making the visualization process/tools better were pre- sented. They include the embracement of new programming paradigms, other languages different from the classic ones, the development of new metaphors, and some basic considerations that could lead to a bigger community around the software visualization tools.

75 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

REFERENCES [1] Knight, C. & Munro, M. Comprehension with in virtual en- vironment visualisations. In: Program Comprehension, 1999. Pro- ceedings. Seventh International Workshopon: 1999, 4–11.

[2] Diehl, S. Software visualization: visualizing the structure, be- haviour, and evolution of software. Springer Verlag: 2007.

[3] Jones, J. A., Harrold, M. J. & Stasko, J. T. Visualization for fault localization. In: Proceedings of ICSE 2001 Workshop on Software Visualization. Toronto, Ontario, Canada: 2001, 71–75.

[4] Lanza, M. Codecrawler-lessons learned in building a software visualization tool. In: CSMR ’03 Proceedings of the Seventh Eu- ropean Conference on Software Maintenance and Reengineering, tome 2003: 2003.

[5] Eick, S. C., Steffen, J. L. & Jr, E. E. S. Seesoft-a tool for visualizing line oriented software statistics. IEEE Transactions on Software Engineering, 18 Issue 11: 1992; 957–968.

[6] Malnati, J. X-Ray: An Eclipse Plug-in for Software Visualiza- tion. Proyecto Fin de Carrera, Universita della Svizzera italiana: 2007.

[7] Voinea, L., Telea, A. & van Wijk, J. J. CVSscan: vi- sualization of code evolution. In: Proceedings of the 2005 ACM symposium on Software visualization, SoftVis ’05. ACM, New York, NY, USA: 2005. ISBN 1-59593-073-6, 47–56.

[8] Marcus, A., Feng, L. & Maletic, J. I. 3D representations for software visualization. In: Proceedings of the 2003 ACM sym- posium on Software visualization, SoftVis ’03. ACM, New York, NY, USA: 2003. ISBN 1-58113-642-0, 27–ff.

[9] Lowe, W. & Panas, T. Rapid construction of software compre- hension tools. International Journal of Software Engineering and Knowledge Engineering, 15(6): 2005; 995–1025.

76 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

[10] Wettel, R. & Lanza, M. CodeCity: 3D visualization of large- scale software. In: Companion of the 30th international conference on Software engineering, ICSE Companion ’08. ACM, New York, NY, USA: 2008. ISBN 978-1-60558-079-1, 921–922. [11] Ware, C., Hui, D. & Franck, G. Visualizing object oriented software in three dimensions. In: Proceedings of the 1993 confer- ence of the Centre for Advanced Studies on Collaborative research: software engineering - Volume 1, CASCON ’93. IBM Press: 1993, 612–620. [12] Mens, T., Wermelinger, M., Ducasse, S., Demeyer, S., Hirschfeld, R. & Jazayeri, M. Challenges in software evo- lution. In: Principles of Software Evolution, Eighth International Workshop on: 2005, 13–22. [13] Montano, D. Development of a 3D tool for visualization of different software artifacts and their relationships. Proyecto Fin de Carrera, Universidad Nacional de Colombia, Departamento de Ingenieria de Sistemas e Industrial: 2010. [14] Baecker, R. & Sherman, D. Sorting out sorting. Video shown at SIGGRAPH-81: 1981. Video. [15] Graçanin, D., Matković, K. & Eltoweissy, M. Software visualization. Innovations in Systems and Software Engineering, 1(2): 2005; 221–230. [16] Irani, P., Tingley, M. & Ware, C. Using Perceptual Syn- tax to Enhance Semantic Content in Diagrams. IEEE Computer Graphics and Applications, 21(5): 2001; 76–85. ISSN 0272-1716. [17] Burch, M., Diehl, S. & Weigerber, P. EPOSee: A Tool For Visualizing Software Evolution. In: 3rd IEEE International Workshop on Visualizing Software for Understanding and Analysis: 2005. [18] Storey, M.-A., Best, C. & Michaud, J. SHriMP Views: An Interactive Environment for Exploring Java Programs. International Conference on Program Comprehension, 0: 2001; 0111.

77 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[19] Stasko, J. T. Tango: A Framework and System for Algorithm Animation. ACM SIGCHI Bulletin, 21: 1990; 59–60. ISSN 0736- 6906.

[20] OMG. UML Specification. ccccccccc http://www.omg.org/spec/UML/2.2/Infrastructure/PDF/.

[21] Lakoff, G. & Johnson, M. Metaphors we live by. Chicago London: 1980.

[22] Mackinlay, J. Automating the design of graphical presentations of relational information. ACM Transactions on Graphics, 5: 1986; 110–141. ISSN 0730-0301.

[23] Goldstine, H. H. & Von Neumann, J. Planning and coding of problems for an electronic computing instrument. Institute for Advanced Study: 1947.

[24] Zimmermann, T., Weisgerber, P., Diehl, S. & Zeller, A. Mining Version Histories to Guide Software Changes. In: Pro- ceedings of the 26th International Conference on Software Engi- neering, ICSE ’04. IEEE Computer Society, Washington, DC, USA: 2004. ISBN 0-7695-2163-0, 563–572.

[25] D’Ambros, M., Lanza, M. & Gall, H. Fractal Figures: Vi- sualizing Development Effort for CVS Entities. In: Proceedings of the 3rd IEEE International Workshop on Visualizing Software for Understanding and Analysis, VISSOFT ’05. IEEE Computer So- ciety, Washington, DC, USA: 2005. ISBN 0-7803-9540-9, 16–. doi:http://dx.doi.org/10.1109/VISSOF.2005.1684303.

[26] Stasko, J. T. & Wehrli, J. Three-dimensional computation visualization. In: Visual Languages, 1993., Proceedings 1993 IEEE Symposium on: 1993, 100 –107.

[27] Irani, P. & Ware, C. Diagrams based on structural object perception. In: Proceedings of the working conference on Advanced visual interfaces. ACM, Palermo, Italy: 2000. ISBN 1-58113-252-2, 61–67. doi:10.1145/345513.345254.

78 SOFTWARE VISUALIZATION TO SIMPLIFY THE EVOLUTION OF SOFTWARE SYSTEMS

[28] Maletic, J. I., Leigh, J., Marcus, A. & Dunlap, G. Visu- alizing object oriented software in virtual reality. In: Proceedings of the 9th International Workshop on Program Comprehension. ACM: 2001, 21–13.

[29] Cruz-Neira, C., Sandin, D. J., DeFanti, T. A., Kenyon, R. V. & Hart, J. C. The CAVE: audio visual experience auto- matic virtual environment. Communications of the ACM, 35(6): 1992; 64–72.

[30] Pfeiffer, J. H. & Gurd, J. R. Visualisation-based tool sup- port for the development of aspect-oriented programs. In: Proceed- ings of the 5th international conference on Aspect-oriented soft- ware development, AOSD ’06. ACM, New York, NY, USA: 2006. ISBN 1-59593-300-X, 146–157. doi:http://doi.acm.org/10.1145/ 1119655.1119676.

79 In conclusion, software Incremental Change: The Way that Software Evolves

Juan Romero-Silva Mario Linares-Vásquez Jairo Aponte

ABSTRACT Change is an unavoidable characteristic of software. The research field of software maintenance and software evolution emerges precisely from this characteristic. Incremental Change (IC) is an essential part of software evolution because it deals with the addition of features and properties to the software. In this chapter, we make a review of several techniques related to Incremental Change as a method to implement new features or correct bugs in software. IC embraces change in the same way that iterative and agile methodologies of software development do. Thus, IC is also a fundamental piece of iterative and agile development.

4.1 INTRODUCTION Change is a fundamental element of software development processes. However, plan-driven methodologies traditionally try to avoid change by using different strategies from other branches of industry and en- gineering (i.e., planning during the beginning of a project in order to “freeze” requirements). These models have been successful for product manufacturing. That success was the reason for the adoption of the plan-driven waterfall model in software development. However, in soft- ware development it is not easy to know the requirements in advance. Thus, a “complete design” before the implementation is impossible to

81 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

fulfill, and using plan-driven methodologies in software becomes almost useless. Rajlich [1] quotes a report that states that approximately 16% of software projects using waterfall methodology have succeeded. The remaining 84% failed because they exceed by far budget, timeliness or both; or because they were canceled. Several authors have studied the problem of changing requirements from several points of view; some of the most notorious works include:

• The software evolution laws described by Lehman [2,3]. • The software maintenance categorization proposed by Swanson [4]. • The staged model for the software life cycle proposed by Bennett and Rajlich [5].

Incremental change appears always that software needs to evolve because of the environment changes, users want new features, or some kind of issue must be resolved. These cases usually occur after initial software delivery, but if software is developed with an agile or even an iterative methodology, incremental change appears since the first itera- tion is completed. Thus, this chapter intends to show the importance of IC in the software development process and how researchers are making several efforts to address the steps related to IC. The structure of the rest of the chapter is as follows: Section 2 describes the steps to complete an incremental change. Section 3 presents a review of research related to concept and feature location. Besides, differences and similarities between them are established. Section 4 shows a review of techniques for impact analysis. Finally, Section 5 draws some conclusions.

4.2 INCREMENTAL CHANGE IN THE SOFTWARE DEVELOPMENT PROCESS After several years of trying to avoid the change during software develop- ment processes, the acceptance of change as an inherent feature of the software helped developers to adopt iterative and agile methodologies. These methodologies embrace change as a fundamental piece. Thus, iterative and agile methodologies have so much in common with incre- mental change. Febbraro and Rajlich in [6] present an agile methodology

82 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

based on the implementation of one change for each iteration. In Figure 4.1, the IC activities are presented. The start point for an IC process is a change request. Then, the successive activities are the extraction of concepts from the request, the location of these concepts in source code, and the change impact analysis. Next, the developer prepares the source code for the change by doing a previous refactoring (this is an optional activity), updates the code by implementing the changes, and incorporates these changes into the source code (by propagating the effects through elements identified during impact analysis). Some- times, incorporation and update are done together and can not be easily separated as two activities. Finally, the programmer propagates changes that were not expected and makes a new refactoring. This refactoring is only needed when “bad smells” are located after updating, incorporation and change propagation.

Change Concept Concept Impact Request Extraction Location Analysis

Prefactoring

Actualization

Change Postfactoring Incorporation Propagation

Figure 4.1 Incremental change activities.

4.2.1 Software Maintenance vs. Software Evolution Incremental change was first described only as a maintenance task that can only be performed after the release of the software system, when change requests are done. Agile software development processes based on incremental change understand the results of the first iteration as the first release. From this point of view, the next iterations increment the existing software by adding features, showing an evolutionary view of incremental change.

83 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

According to [7], the differences between maintenance and evolution are deep, because maintenanc e denotes the idea of keeping a system on execution without fundamental changes in the design, while evolution instead tries to produce enhancements that often require profound de- sign modifications. In software, changes are produced in order to meet the unsatisfied needs of users. Few times, software systems do not need architectural changes, but frequently these changes affect the architec- ture widely. In this latter case, changes do not look for preservation, but for innovation. This is why the term evolution is considered to be more adequate when discussing about software. Besides, in [5], Rajlich and Bennett present even more drastic dif- ferences between evolution and maintenance. They proposed a software life cycle, which is composed by five stages and strongly distinguishes the evolution stage (second stage after initial development) from the ser- vicing one (after the evolution stage). In the evolution stage, software changes without degrading its architecture; during this stage, the soft- ware incorporates new features maintaining the integrity of its structure. When the software moves to the servicing stage, it stops its evolution and begins maintenance. On this stage the architecture begins to de- grade and changes are increasingly harder to be done. The maintenance finishes when the architecture is so degraded that changes become im- possible or prohibitively expensive.

4.2.2 Activities of Incremental Change There is a consensus on the activities related to incremental change. Oc- casionally some activities are not performed; for example, if the software architecture is prepared for the change the prefactoring activity is not needed, or if the program is short enough and the programmer knows it very well, concept location may be a trivial task done intuitively. How- ever, in complex and large software systems it is likely that all activities have to be done. According to several works by Rajlich [1, 6, 8, 9], the activities of incremental change are:

1. Change request: The discovery of a bug. A request for a new feature or an enhancement are typical change requests, which can be asked by the users or someone in the development team.

84 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

These change requests are usually done in natural language, and are formulated in terms of domain concepts [8]. Since the goal is to implement the functionality described using the concepts in the change request, the next steps for incremental change make a vast use of concepts.

2. Concept extraction: As the request is made in natural language, it contains lots of words, domain concepts can be found among all those words. In this step, the developer must extract the domain concepts from the change request, this activity has to be done in order to formulate the queries to locate the concepts in the source code. As developers extract concepts from the change request, the ability to compare them with concepts in source code could increase the effectiveness of the next activity; recent work has been done to enhance the extraction of concepts from source code [10,11].

3. Concept location: The developer extracts the domain concepts involved in the change request, in order to locate the parts of the code that have to be changed. This activity is called concept location. It is done by formulating a query and processing some of the results that show possible places where the concepts in the query are implemented in the source code. In early research [12], concept location was used as a synonym of feature location but in this chapter a comparison that shows the differences between both terms (according with the current definitions from the same authors [13]) is presented.

4. Impact analysis: As different entities or classes have to collabo- rate to achieve an objective, usually the implementation of a new functionality involves several parts of code. During this activity, developers search for other parts of code that have to be modified with the change request. The source code is not the only artifact affected by the change. There are also non-executable files (con- figuration files, media files, user manuals, etc.) that are impacted by the change. These non-source code artifacts also need to be identified in order to successfully complete the change.

85 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

5. Prefactoring: The refactoring activity is defined as an activity in which the internals of the program change while the observable behavior remains the same. This is done to maintain or even get a better architecture of a software application. The prefactoring is a refactoring task that is done before implementing the change identified in previous steps. It is necessary because sometimes the architecture is not prepared adequately for the change. The objective of this activity is to try to make these changes easier.

6. Actualization: The implementation of the concept is done fully, perhaps by adding new classes or modifying the existing ones. It depends on whether the concept is implicit or explicit.

7. Incorporation: When the change implemented during the actual- ization activity requires the addition of new classes, these classes need to be intertwined with the old code. This is called incorpo- ration.

8. Change propagation: Once the change is implemented, all source code elements interacting with changed elements may be affected. During impact analysis the programmer identifies the parts of code affected by the change. Change propagation is the activity in which all these identified changes are effectively implemented, while doing this task, there can always appear parts that need modifications and that were not identified during impact analysis. This can be expressed better if impact analysis is explained as a part of incremental change design, while change propagation is a part of incremental change implementation [14].

9. Post-factoring: During the implementation of changes, there is always the possibility of injecting “bad smells” into the code. The post-factoring activity is another refactoring done with the objec- tive of reducing or removing negative impact of changes in the architecture of the program.

10. Testing: Testing is done through the whole process. Existing tests must not be broken after the change is implemented; obviously, in some cases, tests need to be updated to match the updated

86 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

code. This is done specially during refactoring, actualization, and incorporation. This task also shows another significant similarity between incremental change and agile methodologies, both refer to testing as a fundamental activity.

11. Documentation: Non-source code artifacts also have to be up- dated when a change is done in the code, this updating task is done throughout the process. After post-factoring, some changes that were not previously identified have to be reflected in the doc- umentation. That is why it can be considered as a final step in an incremental change of software.

4.3 CONCEPT AND FEATURE LOCATION

4.3.1 Software Comprehension Only the source code that is understood, can be modified. That is why discussing software comprehension is an important issue in incremental change. While there are many definitions for concept in the literature, this work uses the definition by Rajlich in [15]: A concept is a unit of knowledge that can be processed by a human mind. Thus, the kinds of concepts to consider in software development domain are: • Domain concepts: This type of concepts form the vocabulary of end users. They are used to describe the problem.

• High-level design concepts: These are related to the implemen- tation of functionalities in source code. Architectural patterns among other terms from the solution domain may be included.

• Conditions of failure: Software system concepts that the user barely understands. Some of these probably remain hidden until programming tasks. Later, Rajlich extends the definition of concept [16], showing it as a triad composed of name, intention, and extension: • Name is the label used to identify the concept.

• Intention represents the meaning, and

87 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• Extension includes the set of all the things described by the con- cept.

From the work described above, the central role of concepts in soft- ware comprehension can be inferred. In [17] the authors state that one of the biggest difficulties in achieving software comprehension is what they called the concept assignment problem. It is related to the problem of knowing what parts of the source code are implementing a specific requirement of the software [18]. One of the principal causes for the con- cept assignment problem is that the concepts used in the requirements specification are part of the domain problem, while the concepts found in source code are high level design concepts from the application do- main. But this is not the only cause, another important difficulty comes from the fact that some concepts cannot be implemented in just one software component, but are implemented through several components, this is known as the delocalization problem [19].

4.3.2 Concept Location Once the software development team receives a change request, the first task to be done is to identify the parts of software artifacts that need to be modified. This is done in two steps: First, the developer locates where to start making changes (concept location), and then, the developer performs the impact analysis of these changes (this is discussed in the next section). Usually, the change request is expressed in natural language, and includes the domain concepts that must be implemented. The pro- grammer now needs to find the parts of source code related to the concepts in the change request because those are the parts that will probably change. Precisely, the problem arises at this point, because the concepts from the change request are domain concepts, and the concepts in source code (where the developer is searching) are design concepts or failure conditions; therefore, these are concepts from differ- ent abstraction levels. In order to minimize the gap between the domain concepts and the concepts expressed in source code, most concept location techniques use an intermediate representation, as seen in Figure 4.2. Reducing this

88 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

difference between levels of representations is the motivation behind concept location. If the objective is achieved, the search space where the programmer explores looking for the place to implement a change request will be smaller than before.

Changer request Intermediate (Human-oriented) Representation Source code

Some change request spec expressed associations usually in natural language, using domain concepts

Figure 4.2 Concept location techniques frequently use an intermediate representation of source code [20].

There are some situations where concept location is done intuitively; for example, when the programmer has a lot of experience with the source code that is being changed, or when the size of the software system is small enough. However, when the project is big, or the de- veloper does not have much experience with the code or when despite having an experienced developer, there has not been contact with the project for a while other alternatives beyond intuitive location have to be used. Concept location techniques are extensively classified into two types: Static and dynamic techniques.

4.3.3 Static Techniques The static techniques do not require the software to be executed. They are largely based on textual information, which is present in software artifacts. For the same reason these techniques are usually inexpensive and can be used even if source code is not complete.

String Pattern Matching The search is conducted directly on the source code looking for strings that match a regular expression. This is probably the most used tech- nique by the software developers, maybe because learning it is easy and

89 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

almost natural for developers. The most important part of this tech- nique is the selection of the search pattern. As this step is done by the programmer, the method is considered highly dependant on the user’s judgment. Another weakness is that it only uses the text found in the source code as the search space, leaving its structure unused.

Dependency Search In [12] the use of Abstract System Dependence Graph (ASDG) as a representation of source code is described. The ASDG is used to facil- itate the search experience to the developer. The authors’ abstraction works at function and global variables levels. The technique starts by selecting one source code component to begin the search, and construct- ing a search graph with its neighbors. The graph is extended as more components are visited. Finally, when the components that implement the concept are found, the process finishes. The selection of the first component is one of the critical steps of this technique. This can be done either randomly, or as a product of a previous exploration, or by selecting the top component (perhaps a main method).

Information Retrieval-based Techniques IR techniques have been successfully used in the completion of tasks aimed at extracting information from unstructured sources. That is why researchers decided to use them to perform the location task. In [21], several approaches are shown:

• VSM: Vector Space Model used by Zhao et al. [22]. It constructs feature documents with words extracted from the requirements and design documentation. Then, these documents are matched against query documents derived from identifiers in functions from the source code. This approach considers two documents as sim- ilar if they share the same terms, which leads to polysemy and synonymy problems.

• LSI: Latent Semantic Indexing analyzes the relation between words and documents. In the case of concept location, the words are extracted from the queries, while the documents are the source

90 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

code artifacts. LSI generates a vector for each document, and then uses it to compute similarity with other documents (the query is constructed as another document) [18]. According to [23], LSI outperforms VSM because it deals with polysemy and synonymy.

• Language Modeling: This technique calculates the conditional probability of generating a query Q given a document D based on a probabilistic language model derived from document D [21].

Dynamic Techniques

Contrary to static techniques, these do require execution of software. Therefore, the requirements of these techniques are higher than those of the static ones. In order to achieve execution, the source code must be complete. These techniques also need test cases, because the execution starts with a test case and ignores any software artifact that can not be executed. The Software Reconnaissance Method [24] is one of the first dy- namic approaches for the concept location problem. First, it prepares the source code to make the production of traces possible. Test cases are selected. Some of these test cases execute the characteristic the de- veloper is searching for, and some are unrelated with the characteristic. After the execution, the traces are compared. Components exercised by test cases related with the characteristic, and that were not exercised in test cases unrelated, are the results of the technique. In [25], some slicing techniques that can be used in the context of concept location are presented. These methods act by restricting the behavior of software to a specific zone of interest (a slice). In this way, the search space is reduced to the components in traces related to the slice. In short, dynamic techniques collect traces and then analyze them in order to create sets of components related and non-related to the concept of interest.

91 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Hybrid Techniques Hybrid techniques that combine static and dynamic analysis have be- come the interest of researchers. In [23], a technique that uses static and dynamic analysis is described. It uses Latent Semantic Indexing (LSI), as described above, as the static part, and Scenario-based Proba- bilistic Ranking (SPR) as the dynamic analysis. In SPR some scenarios that exercise the feature and some scenarios that do not exercise the feature are defined to collect traces. The traces collected are examined to split the events in two sets, one with relevant and one with irrelevant events. Then, the traces are used to find events whose frequency in the relevant set is greater than their frequency in the irrelevant set. The two techniques are applied independently and the result of each technique is considered the result from an expert. Finally a weight is assigned to each expert showing the final results. An interesting fact is that the results were better when the weights were approximately the same for each expert (0,5). In [26] is proposed a technique that uses dynamic analysis to collect a single execution trace. After that, LSI is used to rank only the methods in the execution trace (not all methods in the source code). That way, the dynamic part (the execution trace collected) filters information for LSI (the static part). Finally, they add one more filter by using web mining techniques in order to exclude irrelevant elements.

Feature or Concept? The words feature and concept were used as synonyms for a long time, just recently, these terms have begun to be used in specific ways. Ac- cording to [13], features are a subset of concepts. A feature is a concept that can be exposed with a user interface, and can be selected by a sys- tem user. A feature is a special kind of concept which describes the observable functionality of software while is executed. Concept and feature location can be viewed as slightly different tasks. However, since features are sets of concepts, feature location could be described as an specialization of concept location. That is, any concept location technique should be able to locate where to initiate the implementation of a feature, but not all concepts can be located using a feature location approach. 92 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

According to [18], dynamic techniques are better suited for feature location (understanding features as concepts that can be viewed by the user when running the program with appropriate data), while static approaches can better locate the remaining concepts that are present in the source code, but not necessarily can be selectable by a final user. With this in mind, typifying the change request could help the developer in the selection of the best technique: Using dynamic techniques (or hybrid techniques with more weight on the dynamic part) when the concepts searched are features, and static techniques (or hybrid with more weight on the static side) in other cases.

CL Techniques Classification Besides the classic taxonomy that divides techniques into static and dynamic, more recently, with the addition of hybrid, here we consider other ways of classifying techniques.

Source Taking into account the source of information used during the applica- tion of the technique, we basically divide them into techniques that use source code artifacts, and techniques that incorporate non-executable artifacts (diagrams, help documentation, etc.). Most techniques do not use non-source code artifacts at all, in spite of IR-based models are good at extracting information from texts, and some of these artifacts are composed of text. As can be seeing in Table 4.1, from the techniques we know, only Cognitive Assignment [21] uses both types of sources. Table 4.1 shows that IR-based techniques have been widely applied to source code artifacts. There can also be seen that the same techniques have not been used in non-source code artifacts, even when could be applied to these types of elements.

Intermediate Representation As previously stated, intermediate representations are used to reduce the gap between domains of concepts. The representation is constructed in order to conduct the search in it, instead of performing it directly on

93 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Table 4.1 Classification of IR-based techniques according to source code applicability. Technique source code non-source code Software Reconnaissance x Grep-based [20] x ASDG [12] x VSM (IR) [22] x LSI (IR) [18] x Cognitive Assignment [21] x x Set-based trace recollection x [24, 25] Set-based with FCA & ASDG x [27] SPR [28] x PROMESIR [23] x FCA + IR [29] x Data fusion [26] x

the source code. As the representation is an abstraction of source code, the search space is, then, reduced. Traces are representations of particular executions of software. So, they can only be used by dynamic techniques. As Table 4.2 shows, every technique; dynamic or hybrid, makes use of traces. IR-based approaches use vectors that represent documents. The indexes of the vectors are the terms extracted from documents, and the values in the vector are either the frequency of appearance of the term, or just a binary representation that states the appearance or not of the term in the document. Graphical representations seem to be more used by hybrid tech- niques. Both dynamic and static approaches, are capable to obtain dependencies that can be easily represented in graphs on their own. Other graphic representation used is the lattice. Grep-based techniques do not use any representation at all. The search is conducted directly in the source code. Table 4.2 shows a classification of the techniques according to the intermediate representation used.

94 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

Table 4.2 Classification of IR-based techniques according to intermediate representation. Technique Graphical Documents Traces None Software Reconnaissance x Grep-based x ASDG x x VSM (IR) x LSI (IR) x Cognitive Assignment x Set-based trace recollection x Set-based with FCA & ASDG x x SPR x PROMESIR x x FCA + IR x x Data fusion x x x

Granularity

As Table 4.3 shows, most techniques provide a scope to the search at the method or function level. Class or file level is typically considered by researchers to be extremely coarse, because some files can be really large. On the other side, statement level granularity is extremely fine-grained, and probably too expensive to achieve while the benefits obtained should be minimal.

Table 4.3 Classification of IR-based techniques according to granularity of results. Technique Method Statement Software Reconnaissance x Grep-based x ASDG x VSM (IR) x LSI (IR) x Cognitive Assignment x Set-based trace recollection x Set-based with FCA & ASDG x SPR x PROMESIR x FCA + IR x Data fusion x

95 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Information Type Source code, as the principal source used, exposes semantic information (mainly substantives and verbs), but also shows structural information (inheritance, dependencies, coupling, etc.). Late Hybrid techniques tend to use both types, while earlier techniques choose one and stick to it. Table 4.4 shows the information type used by each technique revisited in this chapter.

Table 4.4 Classification of IR-based techniques according to the type of information used. Technique Semantic Structural Software Reconnaissance x Grep-based x ASDG x VSM (IR) x LSI (IR) x Cognitive Assignment x Set-based trace recollection x Set-based with FCA & ASDG x SPR x PROMESIR x x FCA + IR x x Data fusion x x

4.4 IMPACT ANALYSIS Once the first place to implement the changes has been located, the next step is to predict the impact these changes will have on all soft- ware artifacts. This is called software change impact analysis. The implementation of the concept, in the last step, introduces some con- straints on the related software components. For example, the change of a method’s signature or the addition of a component are changes that have effects on the rest of the program. The output of this activity is the impact set, which is a collection of components that need to be changed in order to maintain consistency in the source code, this is also important for the because it helps budgeting the cost of a change.

96 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

Earlier papers explain change impact analysis as a task intended to understand better the scope and determine the possible effects of a change. The results of this task can be used to support the planning and management activities of software change [30]. In this way, im- pact analysis is an activity related to planning, analyzing, designing and managing an incremental software change. This task also serves to the greater good of avoiding the injection of bugs in the program. Just like concept location, impact analysis techniques have been traditionally divided into static and dynamic. In [31], static techniques that use dependencies between components can be found. According to [32], previous papers show that techniques that use expert judgment or source code inspections are inaccurate or too expensive. Orso et al. [33] compare two dynamic techniques called PathImpact and CoverageImpact. Both need to instrument and execute the system with a test suite, and are considered to include every method that can be affected. PathImpact collects and compresses the traces. It requires large amounts of disk space in order to save traces, and also needs large amounts of time to compress the collected traces. CoverageImpact intersects a slice for all the definitions in the function, with a particular execution of it. Both techniques show high precision. The principal weakness of this approaches is related to the resources consumed. In [34], two techniques that perform the analysis while the program is executed are presented. This techniques are categorized under the label of online techniques. One of this techniques is called PI_Allin1. As the name suggests, this algorithm is similar to PathImpact since it also instruments the program, and considers that the impact set of a function f includes every function called after f, and every function that f can return into. The difference resides in that while PathImpact collects the trace, compresses it, and analyzes it; PI_Allin1 analyzes the trace while is in execution by means of a matrix that saves the functions that can be affected. These algorithms (PI and PI_Allin1) show an approach better suited for programs that use global variables. These techniques are considered pessimistic because the impact set includes every function that can be affected. More optimistic algorithms can be used in object-oriented programs, because the probability that a function called after the function f will be impacted, when not called directly neither transitively by f, is much lower. 97 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

In [35], a static technique based on SVD is proposed and com- pared to PathImpact and CoverageImpact. The technique uses software change records to find components that change together. Then, this information is used in order to obtain an impact set based on histori- cal evidence of change. The technique obtains less precision, but the performance is quite higher. Another static technique showed in [32] uses historical change re- quests and revision comments to find textual similarities, and based on the results computes the impact set. The static techniques weaknesses are related with poor precision compared to dynamic techniques. The trade-off here is related to performance, because static methods are less resource intensive by far. Dynamic techniques usually show better preci- sion, but are unable to find non-executable impacted software artifacts. By their nature, static techniques are able to find documents, help files, and configuration files that need to be changed. One big problem during change impact analysis are the hidden de- pendencies. When class A and B are not related, but class C is related to both and silently propagates a change, there is a hidden dependency. These are extremely difficult to find. In [36], the technique executes the whole test suite, extracts invariants from the traces, and uses it to search for hidden dependencies.

4.5 SUMMARY

Software changes can be considered from evolution or maintenance per- spective. IC can be the way that software evolves, but is also the path to degradation of architecture. It, is extremely important to manage it in order to keep software on the evolution stage, and avoid an early move to maintenance. Another important difference available currently in literature is the one made between concept and feature. With features being concepts “touchable” by users during normal use of the application, dynamic ap- proaches seem better suited for feature location. This chapter shows how some techniques can be used in different tasks related to software incremental change. Almost the same tech- niques applied to perform concept location can be applied in search of

98 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

the whole change impact set. Most techniques make use only of source code, but as in impact analysis is important every artifact that can be affected, we think more research can be done to incorporate non- executable components. The static and dynamic techniques in both IC activities show strengths and weaknesses. Combinations of the ap- proaches seem to enhance the results in a significant way, so we believe several works using hybrid techniques are still waiting to appear. Here can also be observed that IR plays an important role as a source for implementations to improve the ability to perform IC activities. Incremental changes are an unavoidable characteristic of software. Managing these changes in a systematic manner gives developers op- portunities to enhance its comprehension, evolutionary and maintenance tasks.

99 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

REFERENCES [1] Rajlich, V. Changing the paradigm of software engineering. Communications of the ACM, 49(8): 2006; 67–70. ISSN 00010782. doi:10.1145/1145287.1145289.

[2] Lehman, M. & Belady, L. A. Program Evolution: Processes of software change. Academic Press: 1985.

[3] Lehman, M., Ramil, J., Wernick, P., Perry, D. & Turski, W. Metrics and laws of software evolution-the nineties view. In: Software Metrics Symposium, 1997. Proceedings., Fourth International: 1997, 20 –32. doi:10.1109/METRIC.1997.637156.

[4] Swanson, E. B. The dimensions of maintenance. In: Proceedings of the 2nd international conference on Software engineering, ICSE ’76. IEEE Computer Society Press, Los Alamitos, CA, USA: 1976, 492–497.

[5] Rajlich, V. & Bennett, K. A staged model for the software life cycle. Computer, 33(7): 2000; 66 –71. ISSN 0018-9162. doi: 10.1109/2.869374.

[6] Febbraro, N. & Rajlich, V. The Role of Incremental Change in Agile Software Processes. In: AGILE 2007: 2007, 92 –103. doi:10.1109/AGILE.2007.58.

[7] Godfrey, M. & German, D. The past, present, and future of software evolution. In: Frontiers of Software Maintenance, 2008. FoSM 2008.: 2008, 129 –138. doi:10.1109/FOSM.2008.4659256.

[8] Rajlich, V. & Gosavi, P. A case study of unanticipated in- cremental change. In: Software Maintenance, 2002. Proceedings. International Conference on: 2002. ISSN 1063-6773, 442 – 451. doi:10.1109/ICSM.2002.1167801.

[9] Rajlich, V. & Gosavi, P. Incremental change in object- oriented programming. Software, IEEE, 21(4): 2004; 62 – 69. ISSN 0740-7459. doi:10.1109/MS.2004.17.

100 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

[10] Abebe, S. & Tonella, P. Natural Language Parsing of Program Element Names for Concept Extraction: 2010. ISSN 1063-6897, 156 –159. doi:10.1109/ICPC.2010.29.

[11] Ratiu, D. & Heinemann, L. Utilizing Web Search Engines for Program Analysis: 2010. ISSN 1063-6897, 94 –103. doi:10.1109/ ICPC.2010.26.

[12] Chen, K. & Rajlich, V. Case study of feature location using dependence graph. In: Program Comprehension, 2000. Proceed- ings. IWPC 2000. 8th International Workshop on: 2000, 241 –247. doi:10.1109/WPC.2000.852498.

[13] Chen, K. & Rajlich, V. Case Study of Feature Location Using Dependence Graph, after 10 Years. In: Program Comprehension (ICPC), 2010 IEEE 18th International Conference on: 2010. ISSN 1063-6897, 1 –3. doi:10.1109/ICPC.2010.40.

[14] Buckner, J., Buchta, J., Petrenko, M. & Rajlich, V. JRipples: a tool for program comprehension during incremental change. In: Program Comprehension, 2005. IWPC 2005. Proceed- ings. 13th International Workshop on: 2005. ISSN 1092-8138, 149 – 152. doi:10.1109/WPC.2005.22.

[15] Rajlich, V. & Wilde, N. The role of concepts in program comprehension. In: Program Comprehension, 2002. Proceedings. 10th International Workshop on: 2002. ISSN 1092-8138, 271 – 278. doi:10.1109/WPC.2002.1021348.

[16] Rajlich, V. Intensions are a key to program comprehension. In: Program Comprehension, 2009. ICPC ’09. IEEE 17th International Conference on: 2009. ISSN 1063-6897, 1 –9. doi:10.1109/ICPC. 2009.5090022.

[17] Biggerstaff, T., Mitbander, B. & Webster, D. The con- cept assignment problem in program understanding. In: Software Engineering, 1993. Proceedings., 15th International Conference on: 1993. ISSN 0270-5257, 482 –498. doi:10.1109/ICSE.1993.346017.

101 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[18] Marcus, A., Sergeyev, A., Rajlich, V. & Maletic, J. An information retrieval approach to concept location in source code: 2004. ISSN 1095-1350, 214 – 223. doi:10.1109/WCRE.2004.10. [19] Letovsky, S. & Soloway, E. Delocalized Plans and Program Comprehension. Software, IEEE, 3(3): 1986; 41 –49. ISSN 0740- 7459. doi:10.1109/MS.1986.233414. [20] Marcus, A., Rajlich, V., Buchta, J., Petrenko, M. & Sergeyev, A. Static techniques for concept location in object- oriented code: 2005. ISSN 1092-8138, 33 – 42. doi:10.1109/ WPC.2005.33. [21] Cleary, B., Exton, C., Buckley, J. & English, M. An empirical analysis of information retrieval based concept location techniques in software comprehension. Empirical Software Engi- neering, 14(1): 2009; 93–130. Cited By (since 1996): 1. [22] Zhao, W., Zhang, L., Liu, Y., Sun, J. & Yang, F. SNI- AFL: towards a static non-interactive approach to feature location. In: International Conference on Software Engineering (ICSE 04). ACM/IEEE.: 2004. [23] Poshyvanyk, D., Gueheneuc, Y.-G., Marcus, A., Anto- niol, G. & Rajlich, V. Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Informa- tion Retrieval. Software Engineering, IEEE Transactions on, 33(6): 2007; 420 –432. ISSN 0098-5589. doi:10.1109/TSE.2007.1016. [24] Wilde, N. & Casey, C. Early field experience with the Soft- ware Reconnaissance technique for program comprehension. In: Software Maintenance 1996, Proceedings., International Confer- ence on: 1996, 312 –318. doi:10.1109/ICSM.1996.565034. [25] Gallagher, K. & Lyle, J. Using program slicing in software maintenance. Software Engineering, IEEE Transactions on, 17(8): 1991; 751 –761. ISSN 0098-5589. doi:10.1109/32.83912. [26] Revelle, M., Dit, B. & Poshyvanyk, D. Using Data Fu- sion and Web Mining to Support Feature Location in Software.

102 INCREMENTAL CHANGE: THE WAY THAT SOFTWARE EVOLVES

In: Program Comprehension (ICPC), 2010 IEEE 18th Interna- tional Conference on: 2010. ISSN 1063-6897, 14 –23. doi: 10.1109/ICPC.2010.10.

[27] Koschke, R. & Quante, J. On dynamic feature location. In: Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, ASE ’05. ACM, New York, NY, USA: 2005. ISBN 1-58113-993-4, 86–95. doi:http://doi.acm.org/ 10.1145/1101908.1101923.

[28] Antoniol, G. & Gueheneuc, Y. G. Future identification: A novel approach and a caje study. In: Software Metrics, 2005. ICSM 05 Proceedings of the 21st IEEE International Conference on: 2005, 357 – 366.

[29] Poshyvanyk, D. & Marcus, A. Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code: 2007. ISSN 1063-6897, 37 –48. doi:10.1109/ICPC.2007.13.

[30] Bohner, S. Impact analysis in the software change process: a year 2000 perspective. In: Software Maintenance 1996, Proceedings., International Conference on: 1996, 42 –51. doi:10.1109/ICSM. 1996.564987.

[31] Arnold, R. & Bohner, S. Impact analysis-Towards a frame- work for comparison. In: Software Maintenance, 1993. CSM-93, Proceedings., Conference on: 1993, 292 – 301. doi:10.1109/ICSM. 1993.366933.

[32] Canfora, G. & Cerulo, L. Impact analysis by mining software and change request repositories. In: Software Metrics, 2005. 11th IEEE International Symposium: 2005. ISSN 1530-1435, 9 – 29. doi:10.1109/METRICS.2005.28.

[33] Orso, A., Apiwattanapong, T., Law, J., Rothermel, G. & Harrold, M. An empirical comparison of dynamic impact analysis algorithms. In: Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on: 2004. ISSN 0270- 5257, 491 – 500. doi:10.1109/ICSE.2004.1317471.

103 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[34] Breech, B., Tegtmeyer, M. & Pollock, L. A Compari- son of Online and Dynamic Impact Analysis Algorithms. Software Maintenance and Reengineering, 2005. CSMR 2005. Ninth Euro- pean Conference on: 2005.

[35] Sherriff, M. & Williams, L. Empirical Software Change Im- pact Analysis using Singular Value Decomposition. In: Software Testing, Verification, and Validation, 2008 1st International Con- ference on: 2008, 268 –277. doi:10.1109/ICST.2008.25.

[36] Vanciu, R. & Rajlich, V. Hidden dependencies in software systems. In: Software Maintenance (ICSM), 2010 IEEE Inter- national Conference on: 2010. ISSN 1063-6773, 1 –10. doi: 10.1109/ICSM.2010.5609657.

104 Software Evolution Supported by Information Retrieval

Angélica Veloza-Suan Mario Linares-Vásquez Henry Roberto Umaña-Acosta

ABSTRACT Information Retrieval (IR) techniques have been used traditionally in analysis of free text and documents, but with the arising of new work areas proposed by software evolution researches IR Techniques are be- coming a necessary support of the activities in the evolutionary model of software development; some of them, such as incremental change and software comprehension, deal with problems related to information extraction and querying in source code. One of the challenges of soft- ware evolution activities is that software evolution analysis requires to deal with large scale repositories, and IR is a good method for extract- ing information from large repositories. Thus, this chapter shows how the implementations of some activities of the evolutionary model are supported on IR techniques.

5.1 INTRODUCTION The evolutionary model of software development [1] appears as a para- digm in which the development process is addressed from the perspective of adaptation and not from the standpoint of prediction, as in tradi- tional development. Software is a product that is evolving, with a life cycle [2] in which incremental change (IC) is continuous. This incremen- tal change requires a set of activities that support the development and

105 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

maintenance of software, in order to ensure application quality and to facilitate its implementation by any team member (newcomer or expert) in any stage of the development process. Thus, the incremental change has positioned itself as the focus of research in software development, with activities such as concept location, automatic categorization of repositories, automatic summarization and traceability, among others. Because incremental change inputs are software artifacts (documenta- tion and code) or some kind of query that represents change, IC activities require the use of text analysis techniques like those used in information retrieval. The information retrieval approach is used in software evolution tasks in the same way; there is a query, then retrieval is executed on the in- formation source using that query, and finally the relevant results are presented, although some implementation details for each task are dif- ferent, such as the information sources, the format of presentation of results, and the way in which the query is built. The purpose of this chapter is to show how the techniques and the concept of information retrieval are used in the context of the ac- tivities/tasks of the evolutionary model of software development. The structure of the chapter is as follows: Section 2 describes the purpose of information retrieval and the techniques which are generally used for the case of document analysis. Section 3 describes the activities of the evolutionary model of software development, with emphasis on those in which information retrieval is relevant. Section 4 presents how informa- tion retrieval is applied on the software evolution activities. Section 5 presents the summary.

5.2 INFORMATION RETRIEVAL The Information Retrieval (IR) addresses issues of representation, stor- age, organization, and access to information [3]. The IR attempts to model, design, and implement systems capable of providing fast and efficient access to large amounts of information, in order to present the user the most relevant elements from a collection or repository of ob- jects (documents, multimedia, images). The relevance of the objects is estimated based on a query that expresses the needs of the user [4].

106 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

The overall operation of IR systems is based on the search for occur- rences of terms on a repository of objects, which are modeled using an intermediate numerical representation. This representation symbolizes the importance (weight) of the terms regarding the objects. Based on those weights, a numerical calculation is performed, and the result is the degree of similarity (relevance) between the terms and the query of the user; at the end of the process, the most relevant objects are displayed in a ranked list. The general architecture of an IR system (see Figure 5.1) consists of a repository, an indexing module; a query module and a ranking module. The repository has the objects on which the retrieval will be done. The indexing module creates an intermediate representation of the corpus in order to make faster searching on large repositories. The intermediate representation usually reduces the dimensionality and represents the la- tent semantic of the corpus. The query module finds the documents that have a better match with the query by using the intermediate repre- sentation. The ranking module prioritizes the retrieved objects by using a similarity measure. Then the top ranked results are displayed to the user.

General architecture of an IR System

1 2 Guery Module Repository User query (Objects)

Documents Documents 4 that match 3 Documents/Objects User that match 5 with the query with the query

6 Indexing Module Ranking Module Revelant documents for the query

Figure 5.1 General architecture of an IR system.

Several models have been developed in order to address the issues of Information Retrieval. Figure 5.2 is a taxonomy of Information Retrieval models. The models are described below.

107 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Information Retrieval Models

Classical Models Alternative and Web Models Hybrid Models

Latent Semantic Boolean Retrieval Rage Rank Indexing (LSI)

Vector Space Latent Dirichlet Hypertext Induced Model Allocation (LDA) Topic Selection (HITS)

Probabilistic Models

Figure 5.2 Information retrieval models.

5.2.1 Classic Models

Boolean Retrieval

It is a model based on set theory and mathematical logic. The doc- uments are represented as a set of binary weights of the index terms, and queries are represented as a Boolean expression. The relevant doc- uments are retrieved applying the logical operations of the query on the document representation; it means that the documents that comply with that query, according to its set of terms, will be relevant to the user [5].

Vector Space Model

It is a model in which the documents and the query are represented as vectors of terms. The representation of the documents is a matrix in which the columns are the documents, the rows are composed of terms and each position in the matrix contains a value that represents the number of occurrences of the word in the document. The calculation of similarity is based on the result of the dot product between the query vector and the document vector [6]. 108 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

Probabilistic Models The foundation of probabilistic models is to determine whether the prob- ability of a given document can satisfy a query. This model requires an initial hypothesis to establish the relevance, so there is no need to take into account the frequency of the terms [5, 7].

5.2.2 Alternative and Hybrid Models

Latent Semantic Indexing (LSI) It is an extension of the Vector Space Model. LSI applies Singular Value Decomposition (SVD) on the matrix of documents and terms, to reduce the dimension of the matrix of documents, and eliminating problems of ambiguity and synonyms. Thus, LSI generates a latent semantic by modeling associations between terms and concepts of the documents. LSI is based on the fact that the terms are used in similar contexts with similar meanings [5,7,8].

Latent Dirichlet Allocation (LDA) It is a Bayesian network method in which the distribution of terms in each document is generated by a Dirichlet probability distribution. The documents are modeled using a Dirichlet random variable which as- sumes a set of predefined categories. Each category is a multinomial distribution over the terms. The probability of a document is given by the likelihood that its words appear in a category [8,9].

5.2.3 Web Models Web models quantify the importance of web pages using link analysis, search and, visualization, among others. Link analysis is based on the structure of Web pages which is seen as a directed graph between pages, and links.

109 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Page Rank This technique analyzes the links on each web page and assigns a value to it through a probability distribution, which depends on the amount and importance of the links that the page has. Then, given a query made by the user, a computation is made between a number of features, such as proximity to the terms, with the page rank of the web page, to return the most significant pages to the query [5, 7].

Hypertext Induced Topic Selection (HITS) This technique is based on two indicators to assess and rank the impor- tance of a Web page. The first indicator is the authority, which is based on the values of hub of the pages that link to it. The second indicator, the hub, is based on the value of authority of the sites that it refers, which means that it takes into account the quality of the information that would be obtained by following the links that has to other pages. A hub value is high when the page is linked to many pages with high authority scores [5,7].

5.3 SOFTWARE EVOLUTION ACTIVITIES The concept of software evolution arose with software methodologies, such as Evo [10], Spiral [11], and Rajlich Staged Model [2]. Addition- ally, the high rate of adoption of agile methodologies and the decadent waterfall process have powered the concept of evolution as an essential feature of software, which must be embraced by development process. Activities in Software Evolution can be grouped in Incremental Change (IC), Software Comprehension, Mining Software Repositories (MSR), Software Visualization, Reverse Engineering & Reenginering, and Refac- toring. Each one of them are described below.

5.3.1 Incremental Change It is the foundation of software evolution and the manifestation of volatil- ity of requirements in software development processes. IC starts with a change request (new feature, bug, enhancement) and ends with the

110 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

implementation of the request on existent code. According to [2, 12], IC includes the following activities: • Concept extraction. Change requests are described using domain concepts and can be written such as user histories, features lists, or free text specifications. Thus, concept extraction consists in extracting relevant domain concepts from change requests speci- fications. • Concept location. It consists in locating the places where the relevant domain concepts are implemented in the existent code. These places are the possible ones where the change request can be implemented. • Impact analysis. It consists in identifying the set of classes which can be affected by implementing the change request. • Prefactoring. It is an opportunistic refactoring which is aimed to prepare the architecture for change request implementation. • Actualization and change propagation. It consists in implementing the change request and fixing inconsistencies or bugs. • Postfactoring. It is aimed at removing “bad smells” which could have been injected by the change implementation.

5.3.2 Software Comprehension It is the process by which developers understand software artifacts using domain, semantic, and syntactic knowledge, in order to build a mental model of the software and its relationship with the environment. The software comprehension process includes building models from software artifacts and cognitive processes of the stakeholders. The tasks associ- ated with software comprehension are: • Summarization of software artifacts. • Software categorization. • Traceability recovery between requirements specifications and source code.

111 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

For further information refer to [13] and [14].

5.3.3 Mining Software Repositories The aim of MSR is to analyze software repositories (artifacts, issue re- ports) in order to extract relevant information for learning, understand- ing, modeling, and managing evolution in software systems. MSR most important tasks are:

• Clones detection in code.

• Analysis of developers contribution.

• Traceability analysis between changes implementation and change requests.

• Automatic assignation of tasks.

• Software defect prediction.

• Analysis of patterns in the code and in the process.

See [15] for further information on MSR.

5.3.4 Software Visualization Visualization is the process of transforming information in visual repre- sentations in order to improve its comprehension; software visualization consists in building visual representations of several software aspects by using metaphors. For example, Code City is a 2D metaphor in which several software metrics are visualized like a city by using poly cylinders and containers [16].

5.3.5 Reverse Engineering & Reengineering Reengineering is the analysis and modification of software systems in order to make a new implementation [17]. The process includes a stage of reverse engineering which provides models of the systems. Reengi- neering is usually applied on legacy systems for migration issues such as changing database engine, programming language, architecture, etc.

112 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

5.3.6 Refactoring It consists in applying transformations to the code to improve the inter- nal structure of software, preserving features and external behavior.

5.4 INFORMATION RETRIEVAL AND SOFTWARE EVOLUTION Figure 5.3 shows the activities in software evolution, which use infor- mation retrieval techniques for their development. These activities are explained below.

Information Retrieval and Software Evolution

Concept/Feature Mining Software Automatic Summarization Traceability Location Repositories (MSR) Categorization of Source Software Artifacts Recovery Code Repositories

Identification Static Source Other and Elimination Code Analysis Applications of Clones

Figure 5.3 Information retrieval and software evolution activities.

5.4.1 Concept/Feature Location Methods for concept/feature location based on IR share the following general workflow:

1. Preprocess the source code and documentation.

2. Create the corpus from the source code and the documentation previously preprocessed.

3. Index the corpus and create an intermediate representation. This representation is modeled as a relationship between terms (vari- ables, methods, comments) and documents (class files) of the corpus. The indexing process includes the decomposition of the entire search space in documents; according to the criterion of

113 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

granularity and the kind of language in this measure, each docu- ment can represent a package, a class or a method.

4. Formulate the query based on words that represent the concepts/ features. The query is formulated manually or assisted by the model.

5. Run the query on the corpus.

6. Retrieve the results and display them in a ranked list.

In [18], LSI is used for concept location. The corpus is built from the identifiers and the comments in the code. In this case, each function and each block of code external to the functions are represented as documents. [19] extends the model in [18], through building a lattice of concepts from the list of relevant documents generated by LSI. The lattice is built using Formal Concept Analysis (FCA) and it is used to select the most relevant attributes of the documents. In [20], LSI and SPR (Scenarios Probabilistic Ranking) are used for feature location, generating lists of relevant documents as two experts who performed the evaluation independently. Finally, the two lists are combined by a function of linear transform and a translation function, to display a ranked list with the localization. In [21], the authors present a state of the art about the concept loca- tion techniques based on information retrieval and additionally present a new model, called cognitive mapping technique, in which the location is based on an expansion of the query. The query expansion making is based on the flows and the co-occurrence of information between artifacts.

5.4.2 Mining Software Repositories (MSR) IR techniques are used in MSR for several task, such as clones detection, clones removal, and static analysis.

114 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

Clones Detection Removal The general process for clones detection based on IR techniques is:

1. Preprocess the code of interest and split it into units of source code (preprocessing).

2. Build an intermediate representation of the code, through the application of mining and processing techniques (transformation).

3. Compare the transformed code units in order to find similarities between them and to detect the clones (clones detection).

4. Map the clones to the original code (mapping).

5. Display the clones in order to filter false positives by the user (filtering).

6. Cluster the clones in classes or families, to reduce the amount of data and facilitate the process of analysis (clustering).

In [22], LSI is applied in the transformation stage, in order to find similarities in code segments. This work is limited to the comparison of comments and identifiers, returning two pieces of code as potential clones or a cluster of potential clones, in which there is a high degree of similarity between the sets of identifiers and comments. In [23] are shown the benefits of grouping classes using LSI clones to improve their comprehension (relations among the clones of classes). In [24] is pre- sented a comprehensive qualitative comparison of techniques and tools for clones detection. In [25] it is argued the need to include information retrieval techniques in MSR research.

Static Analysis In [26], the vector space model and machine learning techniques are applied on free text records (logs) to identify performance problems in the code, without manual intervention. The methodology proposed in [26] follows the steps listed below:

115 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

1. Find messages inside of the log, which present particular scenarios (e.g., those who have been reported many times, or have many different values and appear in multiple types of messages).

2. Group the messages based on the values of the variables from the previous step.

3. For each group of messages, create a vector where the number of occurrences of the message in the group is counted.

Other Applications Other applications that involve IR and MSR are presented in [15] and [27]. In [27], automatic classification of software changes is made based on their descriptions, using data available in the version control systems to discover qualitative and quantitative information on several charac- teristics of software development. In [28] is proposed a method to predict the impact of change requests on source code. In this case, the IR probabilistic model is used for linking the descriptions of change requests, with the set of historical revisions of source code affected by similar change requests previously made. In [29] and [30] is proposed a tool that uses Vector Space Model to perform inference (implicit memory) from the possible relationships be- tween objects stored in a project, and they recommend relevant artifacts for developers that work in a certain task. In [31] is proposed a technique that evaluates and recommends open source applications that contain relevant features in the implementa- tions, based on its functionality. To do this, it combines probabilistic reordering techniques and program analysis. In [32], the authors focus on the use of version control to estimate the association between the changes of the software modules, using the probabilistic model. In [33] is presented a model, in which using the LSI technique is ex- tracted the vocabulary of source code and automatic suggestions from developers who have more experience handling error reports are gener- ated. In [34], data mining techniques and the probabilistic model are combined to analyze the version history and then the authors suggest

116 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

additional changes that must be made. The objective of the analysis is to prevent errors due to incomplete changes, detecting the coupling between elements, which are not detected by the program analysis [35] uses LDA in order to find bugs automatically. [36] presents a work based on LDA for automatic categorization of software systems implemented in different programming languages.

5.4.3 Automatic Categorization of Source Code Repositories Automatic Categorization of Software is achieved using IR in an overall process like this:

1. Define the categories, manually or automatically using IR.

2. Index the corpus, which consists in the assignment of projects repository, to the categories.

3. Build the intermediate representation.

4. Categorize the new project using the similarity between the new project and the elements in the intermediate representation. Thus, the category for the new project is the one of the most similar elements in the corpus.

In [37], the problem of categorization consists in indexing compo- nents of a library of code, in order to be grouped by similar charac- teristics. The indexing is performed by generating a profile of each component, extracting from the available documentation (manuals and comments) the representative characteristics. Each profile is a list of binary lexical relations between words in the text that have the greatest amount of information. According to the retrieval scheme selected by the user, this can be done by a classical vector space model or by a hierarchy of clusters of profiles. In [38] is proposed a tool called MUDABlue, which allows automatic and multiple categorization by analyzing the application source code. An application is modeled as a document, and the identifiers in the code as words in the document. The categories are modeled as a cluster and

117 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

are defined automatically from the code, using LSA over the identifiers and then grouping them using clustering. The similarity measure used is the cosine distance. The identifier from each category is composed by the ten most representative terms from each cluster. In [39], the identifiers are preprocessed to obtain legible names of the categories. The categories are built using the identifiers and the comments in the code. The intermediate representation is built with LDA and the categories are grouped in clusters using cosine similarity measure, as in the case of MUDABlue [38].

5.4.4 Summarization of Software Artifacts Summaries can be generated from the documentation or from the code. Additionally, they can be general summaries or summaries that are rel- evant to a query made by the user. To generate the summaries which are relevant for a query, the task is the same as text retrieval and deals basically as an extraction of relevant words or phrases (as in the case of the automatic categorization). In the general summaries case, artifacts are analyzed without a user query that leads the process. The general process for generating summaries by using IR is:

1. Decompose documents in phrases or paragraphs (preprocessing).

2. Build the matrix of occurrence of words per sentence and per document. Then to apply transformations such as SVD or other IR technique (intermediate representation).

3. For each sentence or paragraph, its relevance to the document is calculated, and a summary is constructed with the paragraphs that have a higher relevancy. If the generation of the Summary is relevant to a query, the relevance of each sentence or paragraph is calculated against the query (Summary generation).

In [40] is presented a model that creates general summaries of text using LSI. [41] and [42] build summaries based on the source code.

118 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

5.4.5 Traceability Recovery The general approach of the traceability recovery by using IR includes several processes over the documents and source code, prior to the anal- ysis of their similarity. Free text documents are indexed by a vocabulary extracted for themselves, using the following preprocessing steps:

1. Apply stemming and to remove stop-words.

2. Remove all capital letters.

3. Convert words in plural to singular, and to change the verbs to infinitive.

The query is built with the identifiers in the source code. The steps for making the query are:

1. Split composite identifiers.

2. Apply the same process that takes place over free text documents.

After preprocessing steps, the classifier computes the similarity be- tween queries and documents, and returns a ranked list of documents for each component of the source code. In [43], vector space model and probabilistic models are used for re- cover traceability between source code and free text documents. In [8], the authors make an extension of the vector space model and compare its performance with LSI and the probabilistic model. In [44], several variants of the vector space model are applied in order to make traceabil- ity between requirements and UML artifacts, source code and test cases. In [45], LSI is used to retrieve traceability between free text documents and source code. In [46] and [47], traceability recovery is performed be- tween different types of artifacts (interaction diagrams, test cases, use cases, among others), by using LSI. In [48] is proposed an improvement to the performance of dynamic requirements traceability, incorporat- ing three strategies to probabilistic retrieval algorithm. Finally, in [49] is analyzed the equivalence of some traceability recovery methods; the techniques analyzed were Jensen-Shannon, vector space model, LSI, and LDA.

119 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

5.5 SUMMARY The aim of software evolution tasks is to support software develop- ment process, by improving the mechanisms for software comprehension, change implementation, maintenance, and tasks assignation in develop- ment teams. For example, analyzing the explicit/implicit semantic in source code is a cross-cutting challenge for software evolution tasks, and it is one of the classic problems in Information Retrieval. Using IR techniques as a support for software evolution tasks has contributed to the development of research fields in software evolution. The general model for Information Retrieval has been applied on the software evolution tasks (except for Software Visualization) naturally. The difference between how IR techniques are applied depends on how the query is used in the task. For example, in concept location, the query is defined by the user or assisted by the model in an automatic manner; in clones detection, the query is built automatically from software artifacts. The reason why the integration between IR and Software Evolu- tion has been achieved is the fact that software artifacts are considered documents. Software evolution tasks are complex for machine learning techniques, but if the software artifacts are documents, then, interme- diate representations can be generated and thus IR techniques can be applied to extract information by using queries which represent the user needs. IR has provided software evolution with a framework for resolv- ing issues related with the tasks and the nature of research problems in software evolution. In spite of the use of IR techniques in software evolution, there are some tasks such as refactoring, software visualization, and some MSR tasks, which are not supported on IR yet. But tasks such as incremental change, software categorization, and summarization have widely used several IR models.

120 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

REFERENCES [1] Rajlich, V. Changing the Paradigm of Software Engineering. In: Communications of the ACM: 2006.

[2] Rajlich, V. & Bennett, K. A Staged Model for the Software Life Cycle. In: Computer, Vol. 33, Issue 7: 2000.

[3] Baeza-Yates, R. & Ribeiro-Neto, B. Modern Information Retrieval. Addison-Wesley Longman: 1999.

[4] Baeza-Yates, R. Information Retrieval in the Web: Beyond Current Search Engines. International Journal on Approximated Reasoning, 34: 2003; 97 – 104.

[5] Dominich, S. The Modern Algebra of Information Retrieval. Springer-Verlag Berlin Heidelberg: 2008.

[6] Grossman, D. & Frieder, O. Information Retrieval: Algo- rithms and Heuristics. In: Springer, Second edition.: 2004.

[7] Manning, C. D., Raghavan, P. & Schutze, H. Introduction to Information Retrieval. Cambridge University Press: 2009.

[8] Deerwester, S., Dumais, S. T., Furnas, G., Landauer, T. K. & Harshman, R. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41: 1990; 391 – 407.

[9] Blei, D. M., Ng, A. Y. & Yordan, M. I. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3: 2003; 993 – 1022.

[10] Gilb, T. ACM SIGSOFT Software Engineering Notes. Evolution- ary Development, 6: 1981; 17.

[11] Boehm, B. A spiral model of software development and enhance- ment. IEEE Computer, 21: 1998; 61 – 72.

[12] Febbraro, N. & Rajlich, V. The Role of Incremental Change in Agile Software Processes. In: Agile: 2007.

121 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[13] Storey, M. A. Theories, tools and research methods in program comprehension: past, present and future. Software Quality Journal, 14: 2006; 187 – 208.

[14] O’Brien, M. Software Comprehension - A Review & Research Direction. Informe técnico, Department of Computer Science & Information Systems. University of Limerick: 2004.

[15] Kagdi, H., Collard, M. & Malletic, J. A survey and taxon- omy of approaches for mining software repositories in the context of software evolution. Journal of Software Maintenance and Evo- lution: Research and Practice, 19: 2007; 77 – 131.

[16] Wettel, R. & Lanza, M. Program Comprehension through Software Habitability. In: Proceedings of the 15th IEEE Interna- tional Conference on Program Comprehension: 2007.

[17] Chikofsky, E. J. & Cross, J. Reverse engineering and design recovery: A taxonomy. IEEE Software, 7(1): 1990; 13–17.

[18] Marcus, A., Sergeyev, Rajlich, V. & Maletic, J. An In- formation Retrieval Approach to Concept Location in Source Code. In: 11th Working Conference on Reverse Engineering: 2004.

[19] Poshyvanyk, D. & Marcus, A. Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code. In: 15th IEEE International Conference on Program Com- prehension: 2007.

[20] Poshyvanyk, D., Gueheneuc, Y., Marcus, A., Anto- niol, G. & Rajlich, V. Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Informa- tion. IEEE Transactions on Software Enginnering, 33: 2007; 420 – 432.

[21] Cleary, B., Exton, C., Buckley, J. & English, M. An empirical analysis of information retrieval based concept location techniques in software comprehension. Empirical Software Engi- neering, 14: 2009; 93 – 130.

122 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

[22] Marcus, A. & Maletic, J. Identification of High-Level Concept Clones in Source Code. In: 16th IEEE International Conference on Automated Software Engineering: 2001.

[23] Tairas, R. & Gray, J. An Information Retrieval Process to Aid in the Analysis of Code Clones. In: Empirical Software Engineering: 2009.

[24] Roy, C., Cordy, J. & Koschke, R. Comparison and evalu- ation of code clone detection techniques and tools: A qualitative approach. In: Science of Computer Programming: 2009.

[25] Walenstein, A. & Lakhotia, A. Clone Detector Evaluation Can Be Improved: Ideas from Information Retrieval. In: Second Internacioanl Workshop the Detection of Software Clones: 2003.

[26] Wei, X., Huang, L., Fox, A., Patterson, D. & Jordan, M. Detecting large-scale system problems by mining console logs. In: ACM SIGOPS 22nd symposium on Operating Systems Princi- ples: 2009.

[27] Mockus, A. & Votta, L. G. Identifying reasons for software changes using historic databases. In: 16th IEEE International Con- ference on Software Maintenance: 2000.

[28] Canfora, G. & Cerulo, L. Impact Analysis by Mining Soft- ware and Change Request Repositories. In: 11th IEEE International Symposium on Software Metrics: 2009.

[29] Cubranic, D. & Murphy, G. C. Hipikat: Recommending pertinent software development artifacts. In: 25th International Conference on Software Engineering: 2003.

[30] Cubranic, D., Murphy, G. C., Singer, J. & Booth, K. S. Hipikat: A project memory for software development. In: IEEE Transactions on Software Engineering: 2005.

[31] Grechanik, M. & Poshyvanyk, D. Evaluating recommended applications. In: International workshop on Recommendation sys- tems for Software Engineering: 2008.

123 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[32] Colaco, M., Mendonca, M. & Rodrigues, F. Mining Soft- ware Change History in an Industrial Environment. In: Brazilian Symposium on Software Engineering: 2009.

[33] Matter, D., Kuhn, A. & Nierstrasz, O. Assigning Bug Reports using a Vocabulary-Based Expertise Model of Developers. In: 6th IEEE International Working Conference on Mining Software Repositories: 2009.

[34] Zimmermann, T., Zeller, A., Weissgerber, P. & Diehl, S. Mining Version Histories to Guide Software Changes. IEEE Transactions on Software Engineering, 31: 2005; 429 – 445.

[35] Lukins, S. K., Kraft, N. A. & Etzkorn, L. H. Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation. In: Working Conference on Reverse Engineering: 2008.

[36] Tian, K., Revelle, M. & Poshyvanyk, D. Using Latent Dirichlet Allocation for automatic categorization of software. In: 6th IEEE International Working Conference on Mining Software Repositories.: 2009.

[37] Maarek, Y. S., Berry, D. M. & Kaiser, G. E. An infor- mation retrieval approach for automatically constructing sofware libraries. IEEE Transactions in Software Engineering, 17: 1991; 800 – 813.

[38] Kawaguchi, S., Garg, P. K., Matushita, M. & Inoue, K. MUDABlue: An automatic categorization system for Open Source repositories. Journal of Systems and Software, 79: 2006; 939 – 953.

[39] Tian, K., Revelle, M. & Poshyvanyk, D. Using Latent Dirichlet Allocation for Automatic Software Categorization of Soft- ware. In: 6th IEEE Working Conference on Mining Software Repos- itories: 2009.

[40] Gong, Y. & Liu, X. Generic Text Summarization Using Rel- evance Measure and Latent Semantic Analysis. In: 24th annual

124 SOFTWARE EVOLUTION SUPPORTED BY INFORMATION RETRIEVAL

international ACM SIGIR conference on Research and development in information retrieval: 2001. [41] Haiduc, S., Aponte, J. & Marcus, A. Supporting pro- gram comprehension with source code summarization. In: 32nd ACM/IEEE International Conference on Software Engineering: 2010. [42] Haiduc, S., Aponte, J., Moreno, L. & Marcus, A. On the Use of Automated Text Summarization Techniques for Sum- marizing Source Code. In: 17th Working Conference on Reverse Engineering: 2010. [43] Antoniol, G., Canfora, G., Casazza, G., De Lucia, A. & Merlo, E. Recovering Traceability Links between Code and Documentation. In: IEEE Transactions on Software Engineering: 2002. [44] Settimi, R., Cleland-Huang, J., Khadra, O. B., Mody, J., Lukasik, W. & DePalma, C. Supporting Software Evolu- tion through Dynamically Retrieving Traces to UML Artifacts. In: 7th International Workshop on Principles of Software Evolution: 2004. [45] Marcus, A. & Maletic, J. I. Recovering Documentation-to- Source-Code Traceability Links using Latent Semantic Indexing. In: 25th International Conference on Software Engineering: 2003. [46] Lucia, A. D., Fasano, F., Oliveto, R. & Tortora, G. En- hancig an Artefact Management System with Traceability Recov- ery Features. In: 20th IEEE International Conference on Software Maintenance: 2004. [47] Lucia, A. D., Fasano, F., Oliveto, R. & Tortora, G. Re- covering Traceability Links in Software Artifact Management Sys- tem using Information Retrieval Methods. In: ACM Transactions on Software Engineering and Methodology: 2007. [48] Cleland-Huang, J., Settimi, R., Duan, C. & Zou, X. Utilizing Supporting Evidence to Improve Dynamic Requirements

125 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Traceability. In: International Requirements Engineering Confer- ence: 2005.

[49] Oliveto, R., Gethers, M., Poshyvanyk, D. & DeLucia, A. On the Equivalence of Information Retrieval Methods for Au- tomated Traceability Link Recovery. In: IEEE 18th International Conference on Program Comprehension: 2010.

126 Reverse Engineering in Procedural Software Evolution

Óscar Chaparro Fernando Cortés Jairo Aponte

ABSTRACT Software Engineering is much more than development of new software; it also involves understanding, maintenance, re-engineering and quality assurance of software. In these and other processes, Reverse Engineer- ing plays an important role, especially in legacy systems. In Software Understanding, Reverse Engineering is vital since it abstracts detailed information and presents structured knowledge of software to the user. In the same way, it allows performing maintenance activities in a more controlled way because it provides information to measure the impact of changes. Reverse Engineering also gives useful information about how a system is designed and allows assessing software from many perspec- tives, for example, to identify software clones or to measure some at- tributes such as coupling or cohesion. This chapter presents an overview of some techniques in the field, an analysis of their applicability in an industrial procedural software system, and some future trends that will direct research in this discipline. 6.1 INTRODUCTION Software Reverse Engineering is a field that consists of techniques, tools and processes to get knowledge from software systems already built. The main goal of this discipline is to retrieve information about how a soft- ware system is composed and how it behaves according to the relations

127 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

of its components [1]. Reverse Engineering aims at creating represen- tations of software in other forms or in more abstract representations, independently from the implementation (e.g., visual metaphors or UML diagrams). Techniques in the field are employed for many purposes [1]: To reduce software systems’ complexity, to generate alternative views of software from several architectural perspectives, to retrieve “hidden” information as a consequence of a long evolution of systems, to detect side effects or non-planned design branches [2], to synthesize software, to facilitate reuse of code and components, to assess the correctness and reliability of systems, and to locate and track defects or changes faster. Reverse Engineering (RE) arises as an important area in software engineering, since it becomes necessary in Software Evolution. Soft- ware Evolution is an inevitable process, basically because most factors related to software (and technology) change [3]: Business factors such as business concepts, paradigms, processes, etc., and technology factors such as hardware, software, software engineering paradigms, etc. RE is especially important in regard to legacy systems since they suffer degra- dation and have a long operational life, producing a loss of knowledge about how they are built. Generally, the changes in this kind of software actually happen but are not well-documented. Instead, knowledge about changes is kept by people, but people are volatile: People leave projects and companies, and people forget easily. Therefore, knowledge needs to be extracted directly from software (source code) and its behavior (run-time information), which are both the main sources of information for RE. In this sense, the general problem is how to extract information or knowledge from software artifacts. Reverse Engineering can be considered as a more general field com- pared to Software Comprehension, as the first one involves more field actions and not only comprehension of code (e.g., RE is also used for fault localization), although sometimes the comprehension process is a secondary result of it. The philosophy about RE is the application of techniques to know how software works, how it is designed and how this design allows software to behave the way it does. So, in this process it is perfectly natural that the comprehension part appears as an indirect consequence or as a motivation.

128 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

Since Reverse Engineering is vital for software development pro- cesses, this chapter presents an overview of some techniques in the area. The chapter exposes an analysis of the applicability of some tech- niques in an industrial procedural software system, as a first step in the development of a RE tool for Oracle Forms applications. The chapter is organized as follows: In Section 6.2, the needs, benefits and purposes of RE are reviewed, in the context of Software Understanding and Mainte- nance. Section 6.3 presents a review of some techniques in the field. An analysis of the applicability of techniques in an industrial Oracle Forms application is exposed in Section 6.4. Later, an assessment of RE tech- niques and tools is addressed in Section 6.5, since the results of this topic allow setting a common research and development environment in RE for future work. Finally, future trends in the field and conclusions are presented in Section 6.6.

6.2 REVERSE ENGINEERING CONCEPTS AND RELATIONSHIPS The term Reverse Engineering has been used to refer to methods and tools related to understanding (or comprehending) and maintaining soft- ware systems. RE techniques have been used to perform systems exam- ination, so in Software Understanding and Maintenance, RE has been a useful medium to support these processes.

6.2.1 Reverse Engineering and Software Comprehension Software Comprehension is the process performed by an engineer or a developer to understand how a software system works internally. The understanding process involves the comprehension of the structure, the behavior and the context of operation of a program. Along with these at- tributes, the explanation of problem domain relationships is required [4]. Understanding is one of the most important problems in Software Evo- lution; it is said that between 50 and 90% of the effort in maintenance stages is devoted to this task [5]. Commonly, system documentation is out of date, incorrect or even inexistent, which increases the difficulty of understanding what the system does, how it works and why it is coded that way [6]. 129 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Several comprehension models and cognitive theories are reviewed by Storey in [7]. It could be said that the main models, or at least the most common, are top-down and bottom-up. Top-down comprehen- sion strategy is basically the mapping between previous system/domain knowledge and the code, through formulation, verification and rejection of hypotheses. In terms of the implementation of a RE tool, this process typically includes rule matching to detect how code chunks achieve sub- goals within a specific feature or plan [8]. When performing automatic RE to legacy software, bottom-up approach is commonly used because top-down approach requires detailed knowledge about the “goals the program is supposed to achieve ” [9]. In bottom-up understanding, software instructions are taken to form and infer logic and semantic groups, categories and goals. The automation of this approach is very complex [8, 9], and is not supposed to be solved by a single technique, because of the semantic gap between code and domain knowledge of a system. Additionally, developers actually need a variety of functionali- ties and information that a technique or implementation can not achieve or provide.

6.2.2 Reverse Engineering and Software Maintenance

On the other hand, Software Maintenance is usually defined as the pro- cess made on software after its delivery to production [10]. Common activities in maintenance are correction of defects, performance improve- ment, and software adaptation due to changes in requirements or in busi- ness rules. Software Maintenance is divided into four categories [11]: Corrective maintenance, Adaptive maintenance, Performance enhance- ment and Perfective maintenance. Reverse Engineering comprises the first step in Software Mainte- nance: The examination process. The change of software is executed later; therefore, RE does not involve the changing process [10].

6.2.3 Reverse Engineering Concepts

Software Reverse Engineering was defined by Chikofsky et al. [1] as the process of analyzing a system to:

130 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

• Identify the system’s components and their inter-relationships, and

• Create representations of the system in another form or at a higher level of abstraction.

A discussion of this definition is developed in [12]. In this work, au- thors state that this definition does not fit to all techniques in the field; for example, program slicing does not recover system’s components and relationships. This definition does not specify what kinds of representa- tions are considered and the context in which the process is executed, so the role of automation and the knowledge acquisition process are not clear. In this sense, the authors propose a more complete definition: “Reverse Engineering includes every method aimed at recovering knowl- edge about an existing software system in support to the execution of a software engineering task”. The process of Reverse Engineering is divided into four phases [13]: Context parsing, Component analyzing, Design recovering and Design reconstructing. Figure 6.1 shows the whole process. Asif in [14] presents the elements involved in the Reverse Engineering process:

• Extraction at different levels of abstraction,

• Abstraction for scaling through more abstract representations,

• Presentation for supporting other process such as maintenance, and

• User specification allowing the user to manage the process, the mappings for transformation of representations, and software ar- tifacts.

Khan et al. [2] define the general process of Reverse Engineering, which consists in four phases: Data extraction from software artifacts, data processing and analysis, knowledge persistence in a repository, and presentation of knowledge. The authors also present some benefits and applications of Reverse Engineering. RE is used:

• To ensure system consistency and completeness with specification. 131 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Intermediary Source code Context Parsing Representation Format

Component Specification Component Analyzing Component interrelationship

Domain Knowledge Sytem Architecture Desing Recovery Documentation System Requirement

Desing Expert Knowledge Desing Model Reconstruction

Figure 6.1 Reverse Engineering process according to [13].

• To support verification and validation phases. • To assess the correctness and reliability of the system in the de- velopment phase, before it is delivered. • To trace down software defects. • To evaluate the impact of a change in the software (for estimating and controlling the maintenance process). • To facilitate understanding by allowing the user to navigate through software in a graphical way (software visualization is an important aspect of RE this topic is detailed in Chapter 3). • To provide more speed in Software Maintenance and Understand- ing. A requirement of a RE tool is the fast and correct gener- ation of cross-reference information and different representations of software. In Section 6.5.2 some features of tools are reviewed.

132 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

• To measure re-usability through pattern identification.

RE embraces a broad range of techniques, from simple ones, such as call graphs extraction, to more elaborated ones, such as architec- ture recovery. The trends go towards more sophisticated and automatic methods. The next section presents some techniques in the field.

6.3 TECHNIQUES IN REVERSE ENGINEERING Methods or techniques (used indistinctly in this chapter) in Reverse Engineering are automatic solutions to one or more problems in the field [12]. This section provides a partial revision of techniques, divided into two categories: Standard and specialized techniques.

6.3.1 Standard Techniques Standard techniques include mostly basic descriptive techniques, such as dependency models or structural diagrams. The objective of standard techniques is to obtain system structure and dependencies at different levels of abstraction, by applying basic source code analysis and Abstract Syntax Tree (AST)1 processing. According to [2], a RE tool typically provides the following views:

• Module charts: They present relationships between system com- ponents. A module is a group of software units based on a crite- rion. Theoretically, modules must have a well-defined function or purpose.

• Structure charts: In a general sense, they present software ob- jects categorized and linked by some kind of relationship, generally, method calls. According to [15]2, this kind of diagrams also shows data interfaces between modules. Examples of structure charts are entity-relationship models and class diagrams. Actually, module charts are structure charts.

1An AST is a tree representation of the syntactic structure of source code. For example, AST View is an Eclipse plug-in for visualizing AST for java programs: http://www.eclipse.org/jdt/ui/astview/ 2See Chapter 15 of [15]: Additional Modeling Tools.

133 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• Call graphs: They describe calls and dependencies between ob- jects at different levels of granularity. These diagrams are built by analyzing AST from code. The level of granularity is set (for instance functions, methods, classes or variables) and then, the AST is traversed to find the usage of objects. Call graphs are important for change propagation analysis.

• Control-flow diagrams: At low/medium levels of granularity they present the systems execution flow, in which control struc- tures (e.g., IF or FOR / WHILE) guide the flow. Another diagram of this type is the Control Structure Diagram (CSD) [16], in which source code is enriched through several graphical constructs called “CSD program components/units”, to improve its comprehensibil- ity.

• Data-flow diagrams: They are graphs that show the flow of data (in parameters and variables) through functionalities, modules, or functional processes [15].

• Global and local data structures, and parameter lists: These allow going to a fine-grained level of software.

The automatic extraction of these diagrams involves several opera- tional tasks on code and its AST. For example, in data flow diagrams, the transformation and the storing of data must be obtained; in this case every parameter needs to be tracked between and inside methods or functions to know what operations include them and how they are used. Before this process, modules need to be determined. In addition, some of these diagrams are the source of information for techniques such as dependencies analysis of code and data, and the evaluation of changes impact in code (see Impact Analysis in Chapter 4).

6.3.2 Specialized Techniques Other techniques, called specialized techniques in this chapter, include operations on software artifacts that are intended not only to describe software but to extract knowledge from it. Business rules extraction [17,18], natural text processing [18], execution traces processing [19,20],

134 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

pattern extraction [21], module extraction [22, 23], and programming plans matching using artificial intelligence techniques [9], are some ex- amples. Some of them are described briefly in this section.

Business Rules Extraction Business rules extraction refers to the discovery of some important con- ditions that produce business actions. The key is to find those conditions and translate them into the business domain. Recovery of business rules supports the preservation of legacy business assets, allows optimizing the business model and assists system forward engineering [17,18]. The process of business rules extraction is depicted in Figure 6.23.

Figure 6.2 Business rules knowledge through abstraction levels.

In the case of [17] and [18], the authors address business rules ex- traction of COBOL systems from the analysis of AST and some par- ticular constructs in COBOL language. The authors define a format to express business rules: , where are boolean expressions and are “action expressions” which are executed only if the conditions are true. This format is based on the

3Picture based on the image located in http://blog.erikputrycz.net/ projects/business-rules-extraction/ (February, 2011).

135 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

definition of production rules in the specification of SBVR (Semantics of Business Vocabulary and Business Rules) [24]. The procedure proposed to extract business rules is the following: First, the branching and calculations statements are identified, and sec- ond, the context of those statements is constructed. The branching statements contain CALL and PERFORM words; the first allows exe- cuting external programs and the second allows transferring the control flow to a paragraph4. The context of the statements is a set of all con- ditions (local and global conditions), in which branching and calculation operations happen. The key point in their work is that not only production rules are re- trieved, but also comments, identifiers, assignments, code blocks, busi- ness rules dependencies and exceptions. CALL and PERFORM state- ments and the analysis of local and global conditions allow defining a context, and on the other hand, comments, identifiers and other “arti- facts” allow assigning semantic information to that context. On the other hand, according to Baxter et al. [25], automated ex- traction of business rules from code can only be heuristic because busi- ness vocabulary and business rules are independent from implementation and they are not present in code. As they say, code simply suggests or hints about business rules, so these clues in code (code fragments, op- erations and functions, error messages, program comments, etc.) are extracted to form approximations of business rules. Of course, some rules will be missed or incorrect, so the problem is how to get better ap- proximations. Other approaches for extracting business rules are slicing criterion identification and program slicing [26–28], data analysis [27,28] and text processing from documents [29].

Programming Plans Matching One problem in Reverse Engineering is how to find the meaning of code. In the automation of this process the analysis of code identifiers and concept extraction are almost required. However, another approach is making matches between chunks of code and programming plans stored in a repository [8,9]. A programming plan is a “design element in terms

4A paragraph in COBOL is a sequence of statements.

136 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

of common implementation patterns” [8]. A plan can be considered as a programming pattern or template, and can be generic or domain- specific. Some examples of plans are READ-PROCESS-LOOP and READ-EMPLOYEE-INFO-CALCULATE-SALARY; the first is a generic plan which means “reading input values and perform some actions to each value” at implementation level. The second is a specific domain plan that is a specialization of the first one because at implementation level is almost the same as the first plan, but has a more semantic stereotype. This means that the repository has a set of generic and specialized plans organized in a hierarchy. In [8], the author state is that the recognition of programming plans against information of the AST of code is better; in terms of searching cost, if the plan library is highly organized, each plan has indexing and specialization/implication links to other plans5. However, the matching task is a NP problem, so the process is com- putationally expensive [9]. The work presented in [9] addresses this problem by applying artificial intelligence techniques. The approach is a two-step process. The first step is a Genetic Algorithm execution to make an initial filtering of the plan library based on “relaxed” matching between code chunks [30] and programming plans stored in the library (the repository). The second step uses Fuzzy Logic to perform a deeper matching. The output of the whole approach is a ranked set of pro- gramming plans according to a similarity value with a chunk of code. In summary, the objective to be achieved by these works is to find programming plans similar to a portion of code. Programming plans are stereotypical patterns of code (generic or domain-specific patterns), therefore it is possible to assign high level concepts to programs, once the matching process has been performed.

Execution Traces Analysis As a complement to Static Analysis in RE, processing of run-time infor- mation is commonly used in what is called Dynamic Analysis. Maybe the most common source is the system execution trace.

5According to Quilici [18], a plan consists of inputs, components, constraints, indexes, and implications. 137 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Execution traces analysis refers to execution trace processing to find patterns in traces that have a specific function. The advantage of traces is that they show the portions of software that are being executed in a specific execution scenario. In this way, the search space is smaller than the one in static analysis because the executed portions of code are the only ones considered. Two problems are detected in dynamic processing: First, knowledge of the system is required to perform this analysis, and second, this dy- namic analysis produces huge amounts of information (long traces). The former problem refers to the fact that it is not possible to capture the (infinite) entire execution domain of a system. If there is no knowledge about how the system works, using and executing all its functionalities is not possible. If it is not necessary to know about the entire system, and the knowledge about the use of specific functionality actually exists, this would not be a problem. The latter problem depends on how the software is built at low level, how the code is instrumented6 and how much information the user needs. Other problems related to dynamic analysis are low performance, high storage requirement and cognitive load in humans [4]. Object-oriented software has been the most common object of study in traces analysis. For example, in [20], the problem of identifying clus- ters of classes is addressed based on a technique that reduces the “noise” in execution traces through the detection of what the authors call “tem- porally omnipresent elements”, which represent execution units (groups of statements) distributed in the trace. In this sense, noise represents information that is not specific to a behavior of interest. For this, sam- ples of a trace are taken and the distribution of each element along the samples is calculated through a measure of temporal occurrence. On the other hand, dynamic correlation is used to cluster elements. Two elements are dynamically correlated if they appear in the same samples, so the measure of correlation is based on the number of samples in which they occur. The clustering part takes all elements whose correlation is higher than a fixed threshold, thus grouping elements in components.

6Code instrumentation refers to the use of software tools or additional portions of code in the system, through which execution traces and behavior information of the system are gathered.

138 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

In summary, the noise of traces is removed and then clustering is applied to the filtered traces. That work presents an industrial experi- ment of the approach. The system of study was a two-tier client-server application: The client was a Visual Basic 6 application of 240Kloc and the server was comprised of 90Kloc of Oracle PL/SQL (Procedu- ral Language/Structured Query Language) code. The client code was instrumented since no trace generation environment was found, and in the case of the server, Oracle tracing functions were used. The system was executed over a use-case, producing a trace of 26,000 calls. Similarly, the authors in [19] define “utility element” as any element in a program that is accessed from multiple places within a scope of the program. The objective of the authors is to remove these utility elements or classes from the analysis. This is achieved by calculating the proportion of classes that call each other (fan-in analysis) iteratively by reducing the analysis scope or by applying the technique on a set of packages. The utility-hood metric, U, of the class C, is defined as

U = |IN| /(|S|−1) where S is a set of classes considered in the analysis and IN is a subset of classes that use C. Besides this metric, the standard score (z-score) was considered to determine possible utility classes: classes with large and positive z-score values are possible utilities. Once the filtering is performed, the depiction of components is done by a tool that generates Use Case Maps7. In this latter step, calls between classes and conditions of execution (control-flow statements) are considered. As it is noticed, the main challenge in execution trace analysis is how to reduce and process the trace, so the final result is a good abstraction of what a system does under a specific execution scenario. The main problems to be addressed are the definition of the information that the traces should have, the metrics and procedures that should be used for filtering, and the way to analyze and represent the reduced traces, so that they can express knowledge about the system. Other dynamic analysis works employ Web mining [31], association rules and clustering [32–34], and reduction techniques [35]8.

7For more information about this type of models go to www.usecasemaps.org 8For more information about dynamic analysis techniques see [4].

139 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Module Extraction One important fact for architecture recovery and other areas in RE is module extraction. This is performed by defining a set of criteria that provide a semantic clustering of components to form cohesive modules. For example in [22] and [23], Hill Climbing and Simulated Annealing are used to form clusters of components based on fan-in/out analysis. The approach starts by building a Module Dependency Graph, then random partitions (clusters) of the graph are formed as the initial clustering con- figuration, and later the partitions are rearranged iteratively by changing one component from one cluster to another. The objective is to find the optimal configuration based on the concepts of low coupling and high cohesion, which is achieved by considering fan-in/out information. This was accomplished by maximizing the objective function, which the au- thors called Modularization Quality (MQ). The general process is shown in Figure 6.3.

Figure 6.3 Modularization process based on clustering/searching algorithms [23].

Text Processing Text processing refers to the processing of source code as text. This implies the processing of morphological, syntactic, semantic and lexical

140 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

elements of code. Text processing takes advantage of identifiers and how statements are organized to extract semantic information of artifacts: classes, methods, packages, etc. The requirement is that code is well written, i.e., the identifiers express semantic information of the business, and follow some general parameters about how they are defined. In [18], the key-phrase extraction algorithm KEA9 is used to translate business rules into specific domain business terms, from documentation. For this, they connect documents to business rules. The key point is that documents contain technical description of variables, so it is possible to establish a direct mapping between rules and documents. More information about text processing is presented in Chapter 1.

6.4 APPLICATION OF TECHNIQUES This section includes an analysis of the applicability of some techniques to a business procedural software application.

6.4.1 Description of the System The business system is called Financial Management System (SGF)10. It is an Oracle Forms11 business application, which manages financial and administrative information of an organization. The main characteristics of SGF are:

• It is a two-tier client-server application. It is implemented in Or- acle Forms and in PL-SQL/SQL.

• The system has about 2072 DB tables, 507 views, 1689 DB stored procedures and 1097 forms. The system has about 700,000 lines of code.

9For more information about the KEA algorithm see http://www.nzdl.org/Kea (February, 2011) 10Acronym in spanish which means “Sistema de Gestión Financiera”. 11The last version of Oracle Forms up to date is 11g. For more informa- tion about this technology please visit http://www.oracle.com/technetwork/ developer-tools/forms or http://en.wikipedia.org/wiki/Oracle_Forms (March, 2011)

141 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• The system is implemented in Oracle Forms 6i technology, which is no longer supported by Oracle.

• Its life cycle is about ten years of evolution and it has several clients such as Universidad Nacional de Colombia, Policía Nacional de Colombia and EPM Bogotá12.

• The coupling of this system is relatively high because business logic is integrated as in server as well as in client side.

The application of RE to this system is performed with three pur- poses: First, to redocument and understand the system because it was owned by other company; second, to facilitate the maintenance tasks in this system; and third, to get representations of the system at different levels of granularity as a first step for the re-engineering process that the owner company, ITC S.A.S13, will carry out in the short/middle term.

6.4.2 Considerations for Applying Reverse Engineering In order to perform RE to this system, the application of static and dynamic approaches can be considered as convenient. In one hand, as PL/SQL is a procedural language there is no object resolution at run- time14, so the usage of static methods is prominent. On the other hand, it is possible to apply dynamic analysis using execution traces in order to restrict the search space; this is useful when a developer wants to focus on some part of the system and in a particular execution scenario. In this case, there are two alternatives: using tracing mechanisms of Oracle15, or performing code instrumentation, which provides a better

12These are large organizations in Colombia. 13http://www.itc.com.co 14Oracle provides the programming technique called Dynamic SQL. According to the opinion of several developers who maintain the system, the usage of this technique is low, so this topic is not addressed in RE process, at least for now. 15Some Web sites about Oracle tracing mechanisms are: http://download.oracle.com/docs/cd/B28359_01/appdev.111/b28419/d_ trace.htm, http://docstore.mik.ua/orelly/oracle/prog2/ch26_01.htm and http://www.devshed.com/c/a/Oracle/Debugging-PLSQL-Code/1/ (March, 2011)

142 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

customization degree. Two problems arise here: First, the places of the code, where the tracing code is inserted, need to be determined (for example, at the beginning of a procedure or just after every con- trol statement), and second, the trace produced could be huge. These problems depend on the specific user needs and the required balance between these issues, so the main way (maybe the only way) to tackle them is by performing empirical studies. Additionally, it is possible to try inferring more knowledge from what is already known of the system. In this way, a mix between top-down and bottom-up approaches is convenient and practical [8]. On the other hand, the application of some RE techniques for Object-Oriented Soft- ware (OOS) to procedural software is possible in some way, especially when modeling behavior, since these two paradigms have a common component: The procedural part [36]16

6.4.3 Application of Standard Techniques Standard views and techniques are suitable for almost all kind of soft- ware. The application of some views to SGF is described below.

• Module charts: The problem about extraction of modules is how to group modules. For example, a basic approach is to group objects by operations on tables, i.e., grouping forms, procedures or packages that perform DELETE, UPDATE or INSERT operations on tables. Then, based on a heuristic (for example, the number of objects that uses a specific table, given a threshold) the modules can be formed.

• Structure chart: This can be viewed as just categorization of objects and their usages. For example, the natural structure of a form (composed of canvases, data blocks, record groups, etc.17) or a table (columns, triggers, etc.).

16In OOS, the data is encapsulated together with procedures but in procedural software the latter are separated from data. 17For more information about Oracle Forms components see [37] or review the presentation hosted in http://www.cse.iitb.ac.in/dbms/Data/Courses/DBIS/ DBIS-Fall2000/course_demos/OracleForms/FormsBuilder.ppt (March, 2011)

143 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• Call graph: It is just call dependencies between PL/SQL objects: Procedures, functions, program units, triggers, etc. Call graphs allow searching paths, evaluating the impact of a change (for example, a change in the signature of a procedure) or calculating metrics, such as, coupling.

• Control flow diagrams: Beyond constructing this kind of dia- grams, which is relatively easy, the identification of key control structures in code is proposed (see Section 6.4.4).

• Data-flow diagram: In which the flow of data (in parameters and variables) across procedures, functions and packages, is displayed.

Some of these diagrams have been already implemented in a pro- totype of a Reverse Engineering Tool for Oracle Forms applications18, which has been tested on SGF. In Figure 6.4, the graphical interface of the prototype is displayed.

Figure 6.4 Prototype of the Oracle Forms Reverse Engineering Tool.

6.4.4 Application of Specialized Techniques The specialized techniques presented in this chapter are feasible to be applied to SGF. However only some of them are going to be considered in this section.

18The tool can be found at http://code.google.com/p/retool/

144 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

Simple Business Rules Extraction The format () presented by Putrycz et al. [17] is a good start point for formalizing business rules. The conditions could be extracted from IF and FOR statements. In the case of FOR statements, the analysis of cursors is very important because they define specific restrictions of actions to be performed. For example, in Figure 6.5, there is a tax calculation which depends on the data queried by the cursor C_MOVIDP, which has several restrictions: The type of the process (PECL_TIPO IN (‘DC’, ‘DF’, ‘DD’, ‘AM’)) and the date of the movement (MOVI_FECHA > ‘01/01/2010’).

Figure 6.5 Example of a CURSOR and its use in a FOR statement.

The business rule associated with this code could be the following:

:for all MOVI_MOVI, with PECL_TIPOE IN (‘DC’, ‘DF’, ‘DD’, ‘AM’), with MOVI_FECHA > ‘01/01/2010’.

: TOTAL_TAX := TOTAL_TAX + CALCULATE_TAX(REC.MOVI)

In some way, this rule is readable but it depends on the identifiers meaning. In this case, there are two alternatives: using replacement rules (for example, MOVI_MOVI → movement) or using the description of tables and columns, assuming that this information is available. The first option could be exploited because there are some conventions in

145 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

SGF with respect to the identifiers, for example a variable named v_cias means company (cias → company). So, the initial idea is to generate simple rules (also some heuristics) to perform this replacement by using simple matching; the process would not be completely automatic.

Text Processing and Dependencies Analysis The basis considered in this analysis is constructing business rules from the processing of control structures in code. A problem would be that not all control structures are important when extracting rules; it is pos- sible that the implementation contains control structures that do not contribute in a rule. For instance, Figure 6.6 shows a code block that does not represent a business rule per se, instead is just a validation of parameters and the recording of its result.

Figure 6.6 Example of a code block that does not represent a business rule.

Consequently, the initial hypothesis is that some control statements in PL/SQL code, together with non-control statements, actually express rules about how the system should behave, according to the business it models. Then, the problem to address is how to find these statements. The preliminary proposal for addressing this problem is applying first text processing techniques and then dependencies analysis. The input of the process is the code of a PL/SQL object (package, function, pro- cedure, trigger, etc.) and the output is a list of control statements

146 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

(the conditions) that are likely to represent business rules. Each control statement would have an operational context, which would be trans- lated into non-control statements (the actions) that complement the conditions. The text processing step has as objective the recognition of some identifiers (names of variables, functions, procedures, etc.), and their places in code, that could be important. This step explores the code without performing a deep analysis, by using simple techniques such as word frequency and relationships between words. The second step would exploit those statements that contain the identifiers previously recognized. The control statements and their in- ner code blocks are analyzed, by considering variables, procedures and function dependencies, and data operations, such as INSERTs, UP- DATEs or DELETEs. The basic idea is to find a heuristic that evaluates the probability of a statement being part of a business rule. For ex- ample, the heuristic should have into account the following elements in statements: If-then statements, exception and raise statements, update, inserts, delete operations, etc. Once the business statements are recognized the next step is pro- cessing them to form human readable business rules. The initial approx- imation to tackle this is using parsing, syntactic analysis, and business rules templates.

6.5 REVERSE ENGINEERING ASSESSMENT Assessment of techniques and tools is an important subject in RE, since the results of this process allow setting a common research and devel- opment environment in RE for future work. This section details some concepts and procedures about how should be the evaluation and what factors should be considered when assessing RE techniques and tools.

6.5.1 Assessment of Techniques Assessment of techniques refers to measure their effectiveness when per- forming a specific task, for example, a maintenance task such as defect fixing. In general, assessment of techniques and tools is difficult because Reverse Engineering is a goal-driven process, so there is a wide range

147 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

of scenarios to compare [12]. The effectiveness includes correctiveness and performance. The main instrument to measure characteristics of techniques and tools is through empirical studies. According to [12], the first step is to have a defined scope of the discipline as requirement for empirical evaluation. The second step is the definition of a common and agreed taxonomy of methods and tools under investigation19. Finally, as third step, a framework of empirical studies in the field is required. The authors propose some taxonomy criteria: Method or tool, dynamic or static, input required, output produced, interaction supported, required user guidance, task applicability and scalability. According to this frame- work, there are six dimensions in regard to designing and classifying empirical studies in Reverse Engineering:

• Type of study: Experience report, case of study, experiment, observational study and systematic review.

• Object of study: It is basically a method or a tool.

• Purpose: Conceptual proposition, proof of concept, quantifica- tion, comparison, conditioned comparison, review and post-facto.

• Focus: Usefulness and usability.

• Population: Humans or programs.

• Context: For example, factors that influence Software Mainte- nance tasks.

6.5.2 Assessment of Tools Assessment of tools can be considered in terms of quality evaluation, but this depends on the concept of quality. Khan et al. [2] mention the fol- lowing quality criteria: Absence of defects or errors, which is important, especially in RE tools that display specific information (for estimation

19An attempt to provide this taxonomy is performed by the authors through an open wiki: http://lore.cmi.ua.ac.be/reWiki/index.php/Main_Page However, the wiki is outdated.

148 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

and impact analysis this factor is critical); and user requirements sat- isfaction. In the latter case, the goals that a developer can achieve with the tool are more important: How he can evaluate the impact of a specific maintenance task (time and precision) and the knowledge obtained (automatic knowledge acquisition). This is associated with the features that a tool provides; actually, literature about assessment of tools considers mostly the amount and type of functionalities as a quality judgment. Storey, in [7], presents some general features that Program Com- prehension tools should have. Some of them are required in Reverse Engineering tools as well:

• Documentation: In top-down comprehension, program level doc- umentation improves maintenance tasks in terms of time and number of errors.

• Browsing and navigation support: Developers switch between top-down and bottom-up models, so flexible browsing needs to be supported. Control-flow and data-flow links or breadth-first and depth-first browsing should be provided by tools.

• Searching and querying: Developers focus on specific compre- hension objects so filtering mechanisms should be provided.

• Multiple views: Combination and cross-referencing views are re- quired since developers employ several comprehension strategies. This is also important when describing a system from different perspectives.

• Context-driven views: Depending on some attributes, metrics or conditions, some views are more appropriate than others, even for different kind of users (e.g., developers, architects, etc.).

• Cognitive support: More cognitive support benefits comprehen- sion, through visualization, browsing and navigation, etc.

Other features required in a RE tool include concept assign capabil- ities, searching and keeping of history, control-flow, call graphs, pruned

149 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

call trees, entity fan-in, capabilities for maintainers, software visualiza- tion support, integration with Integrated Development Environments (IDE), among others. In [38], the authors perform an evaluation of four Reverse Engi- neering tools (Refine/C2, Imagix 4D, Rigi, and SNiFF+), following the experimental framework for evaluating software technology presented in [39]. The context of evaluation for determining the differences of the tools was the recovery of architectural information from embedded software systems. First, based on the author’s knowledge in using RE tools in commercial contexts, a set of assessment criteria grouped in functional categories was defined. Second, tools were assessed accord- ing to the criteria by reviewing the availability degree of functionalities. The assessment categories and some of their criteria are listed bellow: • Analysis: This category refers to the parsing process. Some crite- ria are source languages, incremental parsing, fault tolerant pars- ing, abortable parsing, parsing results, and parse speed. • Representation: In this category the usability and graphical rep- resentations are assessed. Some criteria are speed of generation, static/dynamic views, filters, scopes and grouping, sorting, layered views, and view edition. • Editing/browsing: Switching between abstract and fine-grained levels is a requirement when performing a task that involves Re- verse Engineering. Some criteria are searching functions, history of browsing, editor integration, and highlighting of objects and source code. • General capabilities: For instance, supported platforms, multi- user support, toolset extensibility, storing capabilities, exporting, automatic generation of documentation and on-line help. On the other hand, this and other approaches give a good list of assessment criteria but they do not give a quantitative methodology of evaluation. In this sense, a quantitative approach is exposed in [2], which is based on the MECCA (Multi-Element Component Comparison and Analysis) method. The idea is to define a set of mandatory at- tributes that a Reverse Engineering tool must have. These attributes

150 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

are arranged and split in sub-attributes hierarchically. Later, a weight percentage is assigned to each attribute and sub-attribute, and a score is assigned to each last-level sub-attribute according to a scale. The result of the evaluation of this method is a number that indicates how appropriate is the tool, depending on the defined attributes. For ex- ample, if the result for a tool X is 7 and the result for a tool Y is 8, then the tool Y would be better than the other one, in terms of the functionality20. The application of the MECCA method for evaluating a tool is only performed if mandatory attributes regarding to functionality are present. According to the authors, attributes are grouped in five categories: User functions or functionality, interface, I/O operation, metrics, and verifi- cation strategy (Figure 6.7).

6.6 TRENDS AND CHALLENGES Reverse Engineering provides useful methods and tools in many do- mains such as Software Maintenance and Comprehension. At least, since 1990 [1] a lot of work has been done to solve different problems in RE in a successful way [41]. In this chapter some methods and tech- niques proposed in the research literature were reviewed in order to, first, show the wide spectrum of proposals and solutions; second, present the general scene of the area; and third, show an analysis of how some tech- niques could be applied to an industrial procedural software system, as an evidence of their applicability. One of the main tendencies in RE is the development of more auto- matic, usable and useful tools for software developers and stakeholders, for supporting comprehension and maintenance tasks. Also, new impor- tant approaches will continue emerging; for instance, Model Driven De- velopment and Model Driven Architecture (MDD/MDA) development has increased in last years, therefore is expected that the Architecture Driven Modernization (ADM) process changes the RE concepts and methods in terms of models and visual languages through several levels of abstraction. Some work in ADM can be found in [42] and [43].

20For the computational method of the MECCA model please refer to [40].

151 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

90% Mandatory Functions 38% Functionality Optional 10% Functions

60% Man Manchine Interaction Ability

23% Interface Interface to Other 40% Tools

40% Accessability to the Repository

Scope Reverse 19% 30% Engineering Tool I/O operation of Programming Languages

Report Generation 30% Capability

65% Metrics Collection

11% Metrics

Metrics Analysis 35%

60% Consistency Checking 9% Verification Strategy Constraining 40% Mechanisms

Figure 6.7 Application of the MECCA approach [2].

Besides, the proposal of new combined techniques is important since it allows getting more effectiveness when solving common problems in Reverse Engineering, Software Comprehension and Maintenance. In this sense, Section 6.4 presents some elements that combine business rules extraction, text processing, and dependencies analysis. On the other hand, more applications of the field will emerge; for example, RE for assessing security issues in software systems, or for inferring new business rules.

152 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

From another perspective, the field has also problems to solve. Re- search and academic community should work to get a more structured field of study. Standard taxonomies and frameworks are needed [12,44] for, among other purposes, assessing RE methods and tools in a quan- tifiable way [12], and for defining a strong basis for research in each subdomain of RE: Business rules extraction, module and architecture recovery, concept extraction, etc. In the same way, the development of usable and useful tools (useful for different tasks) is a critical requirement. For example, tools should be flexible and extensible, they should provide filtering and searching mechanisms and scalable features. As Tonella stated [12], “the pos- sibility to customize and extend a tool clearly affects its usability and adaptability”. Therefore, the research of new methods and the develop- ment of useful tools are mandatory, because as industry keeps adopting practices and tools, RE will continue consolidating as a major field. Finally, the applicability and the proposal presented in Section 6.4 will continue evolving to a more structured and effective method and so- lution. A more detailed analysis will be performed for determining what other factors would be convenient to consider from the RE perspective.

153 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

REFERENCES [1] Chikofsky, E. J. & Cross II, J. H. Reverse Engineering and Design Recovery: A Taxonomy. IEEE Softw., 7(1): 1990; 13–17. ISSN 0740-7459. doi:http://dx.doi.org/10.1109/52.43044.

[2] Skramstad, T. & Khan, M. Assessment of reverse engineering tools: A MECCA approach. In: Assessment of Quality Software Development Tools, 1992., Proceedings of the Second Symposium on: 1992, 120 –126. doi:10.1109/AQSDT.1992.205845.

[3] Bennett, K. H. & Rajlich, V. Software maintenance and evolution: a roadmap. In: ICSE ’00: Proceedings of the Conference on The Future of Software Engineering. ACM, New York, NY, USA: 2000. ISBN 1-58113-253-0, 73–87. doi:http://doi.acm.org/10. 1145/336512.336534.

[4] Cornelissen, B. Evaluating Dynamic Analysis Techniques for Program Comprehension. Tesis Doctoral, Delft University of Tech- nology: 2009.

[5] Müller, H. A., Tilley, S. R. & Wong, K. Understanding software systems using reverse engineering technology perspectives from the Rigi project. In: Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: software engineering - Volume 1, CASCON ’93. IBM Press: 1993, 217–226.

[6] Baxter, I. D. & Mehlich, M. Reverse engineering is re- verse forward engineering. Science of Computer Programming, 36(2-3): 2000; 131 – 147. ISSN 0167-6423. doi:DOI:10.1016/ S0167-6423(99)00034-9.

[7] Storey, M.-A. Theories, tools and research methods in program comprehension: past, present and future. Software Quality Journal, 14: 2006; 187–208. ISSN 0963-9314. 10.1007/s11219-006-92164.

[8] Quilici, A. A memory-based approach to recognizing program- ming plans. Commun. ACM, 37: 1994; 84–93. ISSN 0001-0782. doi:http://doi.acm.org/10.1145/175290.175301.

154 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

[9] Burnstein, I., Saner, R. & Limpiyakorn, Y. Using an artificial intelligence approach to build an automated program un- derstanding/fault localization tool. In: Tools with Artificial Intelli- gence, 1999. Proceedings. 11th IEEE International Conference on: 1999. ISSN 1082-3409, 69 –76. doi:10.1109/TAI.1999.809768.

[10] Canfora, G. & Cimitile, A. Software Maintenance, tome 2, chapter 2. World Scientific Pub. Co: 2002, 15–20.

[11] Swanson, E. B. The dimensions of maintenance. In: Proceedings of the 2nd international conference on Software engineering, ICSE ’76. IEEE Computer Society Press, Los Alamitos, CA, USA: 1976, 492–497.

[12] Tonella, P., Torchiano, M., Du Bois, B. & Systa, T. Empirical studies in reverse engineering: state of the art and future trends. Empirical Softw. Engg., 12: 2007; 551–571. ISSN 1382- 3256. doi:10.1007/s10664-007-9037-5.

[13] Lu, C. W., Chu, W., Chang, C. H., Chung, Y. C., Liu, X. & Yang, H. Reverse Engineering, tome Vol. 2, chapter 18. World Scientific Pub. Co: 2002, 447–466.

[14] Asif, N. Software reverse engineering process: Factors, elements and features. International Journal of Library and Information Sci- ence, Vol. 2(7): 2010; pp. 124–136.

[15] Yourdon, E. Structured Analysis Wiki. http://yourdon.com/strucanalysis: 2011.

[16] Jgrasp.org. The Control Structure Diagram (CSD) (tutorial). http://www.jgrasp.org/: 2009.

[17] Putrycz, E. & Kark, A. Recovering Business Rules from Legacy Source Code for System Modernization.In:Advances in Rule Interchange and Applications, (eds.) Paschke, A. & Biletskiy, Y., tome 4824 in Lecture Notes in Computer Science. Springer Berlin / Heidelberg: 2007, 107–118. 10.1007/978-3-540- 75975-19.

155 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[18] Putrycz, E. & Kark, A. Connecting Legacy Code, Business Rules and Documentation. In: Rule Representation, Interchange and Reasoning on the Web, (eds.) Bassiliades, N., Gover- natori, G. & Paschke, A., tome 5321 in Lecture Notes in Computer Science. Springer Berlin / Heidelberg: 2008, 17–30. 10.1007/978-3-540-88808-65.

[19] Hamou-Lhadj, A., Braun, E., Amyot, D. & Lethbridge, T. Recovering Behavioral Design Models from Execution Traces. In: Software Maintenance and Reengineering, 2005. CSMR 2005. Ninth European Conference on: 2005. ISSN 1534-5351, 112 – 121. doi:10.1109/CSMR.2005.46.

[20] Dugerdil, P. Using trace sampling techniques to identify dy- namic clusters of classes. In: Proceedings of the 2007 confer- ence of the center for advanced studies on Collaborative research, CASCON ’07. ACM, New York, NY, USA: 2007, 306–314. doi: http://doi.acm.org/10.1145/1321211.1321254.

[21] Gall, H. C., Rol, R. R. K. & Mittermeir, T. Abstract Pattern-Driven Reverse Engineering: 1995.

[22] Mancoridis, S., Mitchell, B., Chen, Y. & Gansner, E. Bunch: a clustering tool for the recovery and maintenance of soft- ware system structures.In:Software Maintenance, 1999. (ICSM ’99) Proceedings. IEEE International Conference on: 1999, 50 –59. doi:10.1109/ICSM.1999.792498.

[23] Mitchell, B. & Mancoridis, S. On the automatic modular- ization of software systems using the Bunch tool. Software En- gineering, IEEE Transactions on, 32(3): 2006; 193 – 208. ISSN 0098-5589. doi:10.1109/TSE.2006.31.

[24] OMG. Semantics of Business Vocabulary and Business Rules (SBVR): 2002.

[25] Baxter, H. S., I. A Standards-Based Approach to Extracting Business Rules. OMG’s Architecture Driven Modernization Work- shop: 2005.

156 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

[26] Huang, H., Tsai, W., Bhattacharya, S., Chen, X., Wang, Y. & Sun, J. Business rule extraction from legacy code. In: Computer Software and Applications Conference, 1996. COMP- SAC ’96., Proceedings of 20th International: 1996, 162 –167. doi: 10.1109/CMPSAC.1996.544158.

[27] Wang, X., Sun, J., Yang, X., He, Z. & Maddineni, S. Business rules extraction from large legacy systems. In: Software Maintenance and Reengineering, 2004. CSMR 2004. Proceedings. Eighth European Conference on: 2004. ISSN 1534-5351, 249 – 258. doi:10.1109/CSMR.2004.1281426.

[28] Shekar, S., Hammer, J., Schmalz, M.,&Topsakal, O. Knowledge Extraction in the SEEK Project Part II: Extracting Meaning from Legacy Application Code through Pattern Match- ing. Informe técnico, University of Florida: 2003.

[29] Martinez-Fernandez, J. L., Gonzalez, J. C., Villena, J. & Martinez, P. A Preliminary Approach to the Automatic Extraction of Business Rules from Unrestricted Text in the Banking Industry. In: Proceedings of the 13th international conference on Natural Language and Information Systems: Applications of Natural Language to Information Systems, NLDB ’08. Springer- Verlag, Berlin, Heidelberg: 2008. ISBN 978-3-540-69857-9, 299– 310. doi:http://dx.doi.org/10.1007/978-3-540-69858-6_29.

[30] Burnstein, I. & Roberson, K. Automated chunking to sup- port program comprehension. In: Program Comprehension, 1997. IWPC ’97. Proceedings., Fifth Iternational Workshop on: 1997, 40 –49. doi:10.1109/WPC.1997.601262.

[31] Zaidman, A., Calders, T., Demeyer, S. & Paredaens, J. Applying Webmining Techniques to Execution Traces to Support the Program Comprehension Process. In: Software Maintenance and Reengineering, 2005. CSMR 2005. Ninth European Conference on: 2005. ISSN 1534-5351, 134–142. doi:10.1109/CSMR.2005. 12.

157 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[32] Lo, D. Mining specifications in diversified formats from execu- tion traces. In: Software Maintenance, 2008. ICSM 2008. IEEE International Conference on: 2008. ISSN 1063-6773, 420–423. doi:10.1109/ICSM.2008.4658094. [33] Lo, D., cheng Khoo, S. & Liu, C. Mining temporal rules from program execution traces. Int. Work. on Prog. Comprehension, vol. 20, no. 4: 2007; 227–247. [34] Safyallah, H. & Sartipi, K. Dynamic Analysis of Software Systems using Execution Pattern Mining. In: Program Compre- hension, 2006. ICPC 2006. 14th IEEE International Conference on: 2006, 84–88. doi:10.1109/ICPC.2006.19. [35] Cornelissen, B., Moonen, L. & Zaidman, A. An Assess- ment Methodology for Trace Reduction Techniques. In: Proceed- ings of the 24th International Conference on Software Maintenance, (ed.) Hong Mei, K. W. IEEE Computer Society: 2008. ISBN 978-1-4244-2614-0, 107–116. [36] White, G. & Sivitanides, M. Cognitive Differences Between Procedural Programming and Object Oriented Programming. In- formation Technology and Management, 6: 2005; 333–350. ISSN 1385-951X. 10.1007/s10799-005-3899-2. [37] Andrade, L., Gouveia, J., Antunes, M., El-Ramly, M. & Koutsoukos, G. Forms2Net - Migrating Oracle Forms to microsoft.NET. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), tome 4143 LNCS. Braga, Portugal: 2006. ISSN 03029743, 261 – 277. [38] Bellay, B. & Gall, H. An evaluation of reverse engineering tool capabilities. Journal of Software Maintenance, 10: 1998; 305–331. ISSN 1040-550X. doi:10.1002/(SICI)1096-908X(199809/10)10: 5<305::AID-SMR175>3.3.CO;2-Z. [39] Brown, A. & Wallnau, K. A framework for evaluating soft- ware technology. Software, IEEE, 13(5): 1996; 39 –49. ISSN 0740-7459. doi:10.1109/52.536457.

158 REVERSE ENGINEERING IN PROCEDURAL SOFTWARE EVOLUTION

[40] Khan, M., Ramakrishnan, M.,&Lo, B. Assess- ment Model for Software Maintenance Tools: A Concep- tual Framework. In: PACIS 1997 Proceedings. Paper 51. http://aisel.aisnet.org/pacis1997/51: 1997.

[41] Canfora, G. & Di Penta, M. New Frontiers of Reverse Engi- neering. In: FOSE ’07: 2007 Future of Software Engineering. IEEE Computer Society, Washington, DC, USA: 2007. ISBN 0-7695- 2829-5, 326–341. doi:http://dx.doi.org/10.1109/FOSE.2007.15.

[42] Cánovas, J. L. & Molina, J. G. A Domain Specific Language for Extracting Models in Software Modernization. In: ECMDA- FA ’09: Proceedings of the 5th European Conference on Model Driven Architecture - Foundations and Applications. Springer- Verlag, Berlin, Heidelberg: 2009. ISBN 978-3-642-02673-7, 82–97. doi:http://dx.doi.org/10.1007/978-3-642-02674-4_7.

[43] Izquierdo, J. & Molina, J. An Architecture-Driven Modern- ization Tool for Calculating Metrics. Software, IEEE, 27(4): 2010; 37 –43. ISSN 0740-7459. doi:10.1109/MS.2010.61.

[44] Rasool, G., Maeder, P. & Philippow, I. Evaluation of de- sign pattern recovery tools. Procedia Computer Science, 3: 2011; 813 – 819. ISSN 1877-0509. doi:DOI:10.1016/j.procs.2010.12. 134. World Conference on Information Technology.

159 In conclusion, software Agility Is Not Only About Iterations But Also About Software Evolution

Mario Linares-Vásquez Jairo Aponte

ABSTRACT What is agility in software development? Agility is related to high- quality and fast software development. Most of people think that agile methodologies are only about development with frequent releases over short iterations. However, agility is more than that; agility is based on a set of values and principles that embrace the real nature of software development and try to address the quality required by the stakeholders. The nature of software development is defined by phenomena such as requirements’ volatility, users’ volubility, incremental change, and uncer- tainty in schedules. The best way to address this nature is through an adaptive process, an evolutionary process, because software systems and their environments are in continuous evolution. Therefore, agility and agile methodologies were conceived to develop software as an adaptive process some years ago; but now, they are a way to achieve an evolu- tionary process in software development.

7.1 INTRODUCTION Plan-driven methodologies are based on two well defined stages of de- sign and construction; the former requires creative people to build de- tailed plans for construction; the latter is less expensive than design and can be achieved in a predictable way. This manufacturing model was

161 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

initially adopted by software development, because the software process was considered like a predictable one. Consequently the iterative water- fall process proposed by Royce [1] in the early 70s has been used like a sequential and plan driven methodology 1 with a long up-front elicitation stage. This fact has been recognized as a big mistake, because software development is not a predictable process. Software development is an adaptive process [2], and prominent researchers and practitioners such as Winston Royce, Frederic Brooks, Tom Gilb and Barry Boehm stated it during the 1980s. Brooks proposed rapid prototyping and require- ments refinement as suitable approaches for dealing with the essential difficulties of software development in 1987 [3]; Boehm proposed an iterative, risk-driven model which uses prototypes and user feedback in 1988 [4]; Gilb suggested an evolutionary model for delivering value to stakeholders, fast and continuously, in 1988 [5]. Agile methodologies (formerly called light-weight methodologies) arose as a natural reaction of software developers community against plan-driven methodologies, promoting adaptive development. Although adaptive and evolutionary development is not a new idea, the values and principles behind agility have promoted it during the last 10 years. Thus, agile methodologies recognize essential difficulties and the nature of software development to provide a way to develop software using adaptive processes. For us, followers of the agile philosophy, the most relevant features that describe the nature of software are:

• F1 - Requirements’ volatility and Evolution. Heraclitus said in the old times “There is nothing permanent except change”, and software systems are not the exception since they are open and operate in ecosystems together with other elements such as peo- ple, hardware, laws, business rules, and other systems. Evolution is a fact and requirements’ volatility is a manifestation of evolu- tion because all the elements in the ecosystem are continuously changing.

• F2 - Users’ volubility. Users often understand their needs using software prototypes. Thus, the initial requirements are usually

1The real waterfall process proposed by Royce was conceived like an iterative process.

162 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

raw and fuzzy, and can be only refined through exploration and experimentation (with prototypes). However, meanwhile users are testing and using software prototypes, the infinite space that rep- resents their requirements can change easily (the odds are high). The experience of a user with a new prototype is the source of new requirements and a trigger for mental processes that refine the space requirements. • F3 - The users’ needs are also tacit knowledge. Business rules and business logic are in the stakeholders’ minds and are de- scribed in software artifacts. Explicit knowledge is easily expressed in artifacts, even if artifacts are written in ambiguous languages (natural languages, modeling languages). However, business rules are usually tacit knowledge and getting them out of users’ minds onto software artifacts is a hard task. • F4 - Software invisibility. Software is intangible, it lives in silicon chips and hard disks, and software designs. Two consequences of this feature are: (i) most of users think that software development is an easy task, and (ii) there are not full spatial representations of software designs to perform early verification and validation. • F5 - Code decay. Software evolution has a limit, because in- cremental changes gradually degrade the architecture of a system throughout its lifespan. Therefore, software systems die just like living systems do [6]. • F6 - Developers are non-linear components. Any software process is carried out by people, and we really think that software development requires creative people during each stage of the process (analysis, design, implementation, testing, deployment). However, software professionals are humans and cannot be con- sidered as plug-ins or replaceable parts [7]. These features describe common issues in software development pro- cesses. Can a software development process be predictable? We think NO, and the answer of iterative development creators, practitioners and promoters is NO, and the answer of agile people is NO. And the an- swer of every person involved in software development should be NO.

163 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Therefore, we will show you in this chapter how agile methodologies embrace software evolution and provide the developers with means for dealing with software nature. The structure of the rest of the chapter is as follows: Section 2 describes software evolutionary models. Section 3 discusses agility concepts, values and principles. Section 4 presents briefly the history of agile methodologies. Section 5 describes the main agile methods. Finally, Section 6 examines the relationship between software evolution and agility.

7.2 EVOLUTIONARY SOFTWARE PROCESSES Software evolution is not a new term, it dates from 1960s, and it was explicitly recognized by the community when Lehman [8] explained that evolution is an activity different from post-deployment maintenance. There are several definitions and considerations about what software evolution is, but we prefer to adopt the viewpoint in which evolution is the process of adapting the software to the environment by using planned and unplanned activities.

Figure 7.1 Iterative development.

The foundations for evolutionary software processes are the iterative and incremental development. Figure 7.1 represents iterative develop- ment with n iterations. Iterative development means making several product releases on iterations. Each iteration is associated with a full sequential waterfall process, and there is a responsibility chain between them; each iteration generates a product release, each one implements a set of features, and the iterations are part of a sequence. In iterative

164 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

development the outcome of an iteration can be a throwaway prototype that is discarded for the next iteration or a product release that can be used to build a new release in a further iteration.

Figure 7.2 Incremental development.

Incremental development is related to making evolvable or incremen- tal products, where each one is an increment of the previous release. Figure 7.2 summarizes an incremental development with three product releases; each one includes the features implemented in previous re- leases and is built on the previous product. Each release can be built using some stages of the sequential waterfall process. However, when incremental development includes well defined iterations (each prod- uct/prototype is built using a sequential waterfall process), it is called iterative and incremental development. Figure 7.3 depicts an iterative and incremental development process. This kind of processes has also been called evolutionary process; next we list some of them.

7.2.1 EVO The Evolutionary Development Model (EVO) proposed by Gilb [5, 9] is an iterative and incremental process for early and on-time delivery of results. EVO relies on making small and frequent releases of high- value-first results to the stakeholders. EVO is based on the “Plan-Do- Study-Act” cycle: Plan the EVO step, perform the EVO step, analyze the feedback results from the EVO step and the current environment, then decide what to do next. Its principles are [10]:

1. Capablanca’s next move: There is only one move that really counts, the next one.

165 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Figure 7.3 Iterative and incremental development.

2. Do the juicy bits first: Do whatever gives the biggest gains. Do not let the other stuff distract you!

3. Better the devil you know: Successful visionaries start from where they are, and what they and their customers have.

4. You eat an elephant one bite at a time: System stakeholders need to digest new systems in small increments.

5. Cause and effect: If you change in small stages, the causes of effects are clearer and easier to correct, if needed.

6. The early bird catches the worm: Your customers will be happier with an early long-term stream of their priority improvements, than with years of promises, culminating in late disaster.

7. Strike early, while the iron is still hot: Release and test quickly with people who are most interested and motivated.

8. A bird in the hand is worth two in the bush: Your next step should give the best result you can get now.

166 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

9. No plan survives first contact with the enemy: A little practical experience beats a lot of committee meetings. 10. Adaptive architecture: Since you cannot be sure where or when you are going, your first priority is to equip yourself to go almost anywhere, anytime. The characteristics of EVO are: • Frequent delivery of system changes (steps). • Steps delivered to stakeholders for real use. • Feedback obtained from stakeholders to determine next step(s). • The existing system is used as the initial system base. • Small steps (ideally between 2%-5% of total project cost and time). • Steps with highest value and benefit-to-cost are ratios given high- est priority for delivery. • Feedback used “immediately” to modify long-term plans and re- quirements. • Result-oriented (“delivering the results” is prime concern). Typical steps of the EVO process are: 1. Gather from all the key stakeholders the top few (5 to 20) most critical goals that the project needs to deliver. 2. For each goal, define a scale of measure and a “final” goal level. 3. Define approximately four budgets for your most limited resources. 4. Write up these plans for the goals and budgets. 5. Negotiate with the key stakeholders to formally agree the goals and budgets. 6. Plan to deliver some goals in weekly increments (EVO steps). 7. Implement the project in EVO steps.

167 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

7.2.2 Spiral The Spiral model proposed by Boehm [4] was conceived as a risk-driven approach for software development. Each cycle in the spiral involves the evolution of the product by addressing continuously the same sequence of steps, for each portion of the product and for each of its levels of elaboration (specification, modules, components, methods). With each cycle the product is evolving by increments. Typical steps of a spiral cycle are (Figure 7.4):

1. Identify the functional requirements (objectives) of the portion of the product which is going to be elaborated in the current cycle.

2. Identify the possible strategies/alternatives for implementing the product, and the constraints imposed by each strategy.

3. Evaluate the alternatives relative to the objectives and constraints. they include evaluating the risks and proposing strategies for mit- igating them (prototyping, simulation, benchmarking, etc.).

4. Implement and test the product in order to reduce the dominant risks.

5. Make a review involving the primary people or organizations con- cerned with the product. The review covers all products developed during the previous cycle, including the plans for the next cycle and the resources required to carry them out.

6. Prepare the plan for the next cycle.

7.2.3 The Unified Process Family The Unified Process (UP) is the most used model for large projects. UP is based on previous models such as the Ericsson approach (1967), the Rational Objectory Process (1997) and the Rational Unified Process (1998). The main features of UP are [11]:

• It is an iterative and incremental model.

168 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

Cumulative cost

Progress

1. Determine objectives 2. Indentify and resolve risks

Risk analysis

Risk analysis

Risk analysis

Requirements Operational Review plan Prototype 2 Prototype 1 Prototype Concept Concept of of requirements requirements Requirements Draft Detailed desing

Development Verification plan and validation Code

Verification Test plan Integration and validation

Test

Release Implementation

4. Plan the next iteration 3. Development and test

Figure 7.4 The Spiral model [4].

• It is use-case-driven. This means that use cases are used to specify the functional requirements and the iterations are planned and evaluated against use cases. Use cases drive the work through each iteration.

• It is centered on the architecture definition because RUP is fo- cused on early development and baselining of an executable archi- tecture.

• It is an adaptable framework because it may be tailored according to the team and the project.

• It is a bidimensional model (Figure 7.5). The horizontal axis rep- resents the dynamic aspect and the vertical axis represents the static aspect. Horizontal axis shows the time and how the phases and iterations define the lifecycle aspects. The four phases of UP

169 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

are inception, elaboration, construction, and transition. Each one of these has a set of well-defined goals, artifacts and milestones; a summary of the goals is showed in Figure 7.6. UP is serial in the large because the whole process is split in four sequential phases; and it is iterative in the small because each phase is split into iter- ations. The vertical axis represents the disciplines in the lifecycle and the effort required for each one through the iterations. Fig- ure 7.7 summarizes the main elements in the vertical dimension of UP. • It includes risk-mitigation by using iterations with product re- leases; assessments at the end of each phase in order to decide if the project continues; and an inception phase aimed to define the project scope and the business case for the system. The Rational Unified Process (RUP) is the commercial version of UP developed initially by Rational [12]. RUP has a dual nature, because it is a process and a product. A big difference between UP and RUP is that the RUP has nine disciplines grouped into two types (i) engineering (development) and (ii) support (Figure 7.8), meanwhile UP has only the development disciplines. Another flavor of UP is the Enterprise Unified Process (EUP) pro- posed by Ambler [13]. It is an extension of RUP aimed to cover weak- nesses of RUP and UP. The scope of RUP is the software process; therefore, RUP does not include activities of real development processes such as design of the IT-architecture, maintenance, operation and sup- port in production environment, and management of portfolio. Thus, EUP adds new disciplines and phases to RUP. These disciplines and phases are described as follows: • Production and Retirement phases: The new phases represent the lifecycle after a system has been deployed. The Production phase purpose is to keep the software in production until it is either replaced with a new version or it is retired and removed from production. The Retirement phase purpose is the successful removal of a system from production. • Enterprise disciplines: These are seven new enterprise manage- ment disciplines related to cross-system issues that organizations

170 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

Figure 7.5 The Unified Process (UP) model [11].

should address to be successful at IT. These disciplines are: En- terprise business modeling, portfolio management, enterprise ar- chitecture, strategic reuse, people management, enterprise admin- istration, and software process improvement.

• A new support discipline called Operations & Support: It is related to operating and supporting the software specially after deploying the system in the production environment.

A full list of all the flavors of UP is in [14] and a summary of the UP history is in [15].

7.2.4 Staged Model Maintenance is not just post-delivery work, it is not an uniform task over the software life cycle because it can be performed in several ways (perfective, adaptive, corrective, preventive). Maintenance is a series

171 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

- Define project scope. - Estimate cost and schedule. - Define risks. Inception - Develop business case. - Prepare project environment. - Identify architecture.

- Specify requirements in greater detail. Elaboration - Validate architecture. - Evolve project environment. - Staff project team.

- Model, build and test system. Construction - Develop supporting documentation.

- System testing. - User testing. Transition - System rework. -System deployment.

Figure 7.6 The RUP phases and their goals [12].

of distinct stages, each one with different activities, tools, and business consequences . This is the motivation of the Staged Model proposed by Rajlich et al. [16]. Thus, the software life cycle consists of five stages (Figure 7.9):

1. Initial development: The system is built from scratch to meet initial requirements.

2. Evolution: Capabilities and functionality of the system are ex- tended to meet user needs. It is possible to make major changes in the architecture.

3. Servicing: Minor defects are repaired and simple functional changes are performed through servicing patches.

4. Phaseout: Servicing is stopped for seeking to generate revenue from the system as long as possible.

5. Closedown: The system is retired from production environment.

172 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

Template

follows

is responsible for Artifact

creates/updates

Worker/Role performs Activity

is part of

is defined by Discipline Workflow

Figure 7.7 UP elements.

7.3 PRINCIPLES, AGILITY AND THE AGILE MANIFESTO

7.3.1 The Agile Manifesto

On February 11-13, 2001, at The Lodge at Snowbird ski resort in the Wasatch mountains of Utah, 17 people representing Extreme Pro- gramming, SCRUM, DSDM, Adaptive Software Development, Crys- tal, Feature-Driven Development, Pragmatic Programming, and other “light-weight” methodologies, met to talk about a new way of developing software different from “heavy-weight” processes (plan/document-driven methods). The results were an alliance around agile development and a “Manifesto for Software Agile Development” [17]. This manifesto is a set of value statements and principles that describe how people should implement agility in software development:

173 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Figure 7.8 The Rational Unified Process (RUP) model [12].

We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value: Individuals and interactions over processes and tools. Working software over comprehensive documentation. Customer collaboration over contract negotiation. Responding to change over following a plan. That is, while there is value in the items on the right, we value the items on the left more [17].

Each one of the value statements indicates a preference (the first segment) and an item of lesser priority (the latter segment). It does

174 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

Initial development

Evolution changes First running version

Evolution

Servicing patches Loss of evolvability

Servicing

Servicing discontinued

Phaseout

Switchoff

Closedown

Figure 7.9 The simple staged model [16]. not mean that agile people dismiss the second segments or that second segments are bad practices. For agile people, the first segments in the statements are more important than the second segments. For example, agile practitioners recognize the importance of process and tools, with the additional recognition that the interaction among skilled individuals has even greater importance.

7.3.2 Agile Principles The second part of the manifesto is a set of principles which agile prac- titioners must follow:

175 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• P1: Our highest priority is to satisfy the customer through early and continuous delivery of valuable software.

• P2: Welcome changing requirements, even late in development. Agile processes harness change for the customer’s competitive advantage.

• P3: Deliver working software frequently, from a couple of weeks to a couple of months, with a preference for the shorter timescale.

• P4: Business people and developers must work together daily throughout the project.

• P5: Build projects around motivated individuals. Give them the environment and support they need, and trust them to get the job done.

• P6: The most efficient and effective method of conveying infor- mation to and within a development team is face-to-face conver- sation.

• P7: Working software is the primary measure of progress.

• P8: Agile processes promote sustainable development. The spon- sors, developers, and users should be able to maintain a constant pace indefinitely.

• P9: Continuous attention to technical excellence and good design enhances agility.

• P10: Simplicity –the art of maximizing the amount of work not done– is essential.

• P11: The best architectures, requirements, and designs emerge from self-organizing teams.

• P12: At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

176 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

7.3.3 Agility in Software Development Iterative does not mean agile, and incremental (evolutionary) does not necessarily mean agile. However, agile methodologies are iterative and incremental. Agile methodologies consider the software process as an adaptive one and are the evolution of iterative and incremental pro- cesses. Agility in software development is supported on a set of values and principles which represent ways for dealing with the inherent fea- tures and difficulties of software development. These values and princi- ples are represented by the agile manifesto. Thus, agile methodologies main features are: • Iterative and incremental.

• Frequent planning and feedback.

• Frequent and rapid delivery of business value in the form of high- quality working software.

• Planning based on prioritized requirements. Users define high- value requirements.

• Minimum usage of bureaucracy and overhead within the develop- ment lifecycle.

• Embracing and managing changing requirements and business pri- orities.

• Collaborative decision making and continuous customer involve- ment.

• Usage of effective and efficient ways of communication.

• Empowered teams.

• Promotion of team’s skills improvement.

• Usage of creative and ergonomic environments2

• Self-organizing and adaptive teams.

2An agile team-room wish list is presented in [18].

177 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

7.4 AGILE METHODOLOGIES HISTORY

The birth of agile methodologies is a process that started before the wa- terfall method [1]. The first steps of iterative methods were in the period between 1930 and 1970; iterative and incremental practices (time-boxed iterations and test-drive n development) were used in military projects such as X-15 and Mercury [19]. Larman et al. [19] present an interesting chronology of iterative and incremental practices in software develop- ment.

7.4.1 Iterative Development (1970-1990)

Iterative methodologies were widely used during the 1970s and the 1980s in research centers such as NASA and the IBM T.J Watson. Further- more, the waterfall method was wrongly adopted like a non-iterative model with poor results in several projects. Facts like these were the reasons why people started to promote the principles of iterative method- ologies by the end of 1980s. Concurrent movements and publications such as [3–5,20] are examples of the discussion which arose around the effectiveness and performance of waterfall model. Another element of discussion was the need for improving the results of development process using people’s experience with large software systems during thirty years. Brooks [3] proposed fast and iterative pro- totyping as suitable approaches for software development. Victor Basili and Albert Turner [20] proposed a method for developing software by iterative enhancements; the method starts with an implementation of a subset of the problem and then continues iteratively enhancing the versions until the full system is implement ed; it is an application of the stepwise refinement proposed in [21]. Tom Gilb in [5, 22] introduced EVO, an iterative and incremental methodology. He also started the discussion about the “Evolutionary delivery” of software, based on fast and incremental delivery of high-value results. Gilb is perhaps the man who most promoted iterative models in the 1980s. Boehm [4] intro- duced his Spiral model as an enhancement of waterfall and other classic models. By the end of 1980s, the discussion about iterative methodologies plus the ideas of Boehm, Gilb, Shultz, and Lantz [23], were used by

178 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

James Martin to define a process called Rapid Application Development (RAD) [24]. RAD is also recognized as a foundation for DSDM (the first agile methodology) and XP (the most radical of the agile method- ologies). RAD main features are iterative prototyping by using CASE tools, short iterations and small teams of motivated and experienced people with high development skills.

7.4.2 The Birth of Agile Methodologies (1990-2001) The 1990s is the decade of the agile methodologies. All the efforts during 1970s and 1980s for promoting the iterative and incremental models resulted in the foundation of the Agile Alliance and the birth of a new methodology [2]; this new methodology is based on a set of value statements and principles written in the Agile Manifesto. In January 1994, a group of 16 practitioners of RAD met at United Kingdom to define an iterative process based on RAD. The result of this meeting was the Dynamic Systems Development Method (DSDM) [25]. DSDM is a method based on nine principles and four values. The DSDM process includes three phases (pre-project phase, project life-cycle phase, and post-project phase) and five stages (feasibility study, business study, functional model iteration, design and build iteration, and implementati on). SCRUM appears in scene as an iterative process with time-boxed iterations of 30 days, which was used by Ken Schwaber and Jeff Suther- land during 1993 and 1994 in EASEL Corp. SCRUM is based on an iterative model used in Honda, Canon, and Fujitsu; this model was pub- lished in the article The New Product Development Game [26] and is considered as the first version of SCRUM with adaptive and self-directed team practices in a Sashimi approach3. SCRUM was published as a soft- ware development process for the first time in [27] and a refined version was presented in [28]. Feature Driven Development (FDD) is one of the agile method- ologies that is process-oriented and not people-oriented. It was used

3Sashimi is a japanese food which consists of fish or shellfish served in thin slices. In the context of SCRUM, Sashimi refers to making a bit of completed functionality in time-boxed iterations.

179 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

in 1995 as a set of five processes designed by Jeff de Luca for imple- menting a system of the Bank of Singapore with 50 developers and a schedule of 15 months. FDD was influenced by the approach of Peter Coad for modeling with UML and colors, turning FDD into the only ag- ile methodology which uses UML explicitly as modeling language. The FDD description was published for the first time in [29] and a deeper description is in [30].

Figure 7.10 The Crystal Family.

Other agile methodology which was conceived during the 1990s is Crystal. The Crystal family is the result of an empirical study carried out by Alistair Cockburn during ten years (it started up by 1991), which was aimed to design an effective methodology for software development and analyze the impact of team size in the success of projects. The study analyzed several teams and projects (specially successful projects) in order to find and document similarities between them. Two of the conclusions were:

• People-centered methodologies work better than process-centered methodologies.

180 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

• The right methodology for a project depends on the team size and the project objectives.

Thus, the Crystal family is a set of methodologies (clear, yellow, orange, red) for software development which has to be applied according to the team size and criticality level [31–33]. Figure 7.10 shows the Crystal family. Adaptive Software Development (ASD) is the result of two experi- ences of Jim Highsmith in the early 1990s as Microsoft consultant and author of a RAD methodology (RADical Application Development4) to- gether with Sam Bayer in 1994 [34]. The addition of the two experiences to the usage of adaptive complex systems are the main elements in the conception of ASD [35]. Extreme Programming (XP) is the consolidation of the ideas of three men (Kent Beck, Ron Jeffries, and Ward Cunningham) in the C3 team [36]. The Chrysler Comprehensive Compensation (C3) system was initially conceived as the solution for the complex payroll system of Chrysler. The C3 project was canceled but the experience of Beck in Tektronix, the practices used at Chrysler and the help of Jeffries (the first XP coach) were used for defining the XP methodology [37, 38].

7.4.3 The Post-manifesto Age (2001-2011) The Agile Alliance foundation and the agile manifesto were the starting point for the agile community and the boom of agility in software devel- opment. This is confirmed by the large number of publications in high impact journals and magazines (Software Development, IEEE Software, IEEE Computer, Cutter IT Journal, Software Testing and Quality En- gineering, The Economist, etc.); the publication of Web sites and agile conferences; the adoption of agile methodologies in the industry; and the foundation of new agile companies. New methodologies have been conceived during this period such as Agile Modeling (AM), Agile Uni- fied Process (AUP), eXtreme Unified Process (XUP), Dynamic Systems Development Method vs. 4.2 (DSDM 4.2), DSDM Atern, Microsoft Solutions Framework for Agile Development (MSF4), OpenUP, and the Lean methodologies.

4It was a methodology with one-week iterations. 181 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

1970 Ericsson Waterfall model (1970) approach (1976)

1980

Prototyping New Peoduct Object Evo (1988) Methodology (1986) spiral (1986) Development factory (1987) Game (1986)

Lean 1990 Rapid Application Manufacturing Development (1998) (1990)

Rational Objectory Process (1997) DSDM (1995) RADIcal Software SCRUM (1995) RUP (1998) Development (1994) UP (1999) C3Team (1997) MSF (1994) 2000 EUP (2000) FDD (1997) ASD(1999) Crystal (1998) XP (1999) Pragmatic RUP (2001) Programming (2000) EUP vs 2002 (2002) Agile manifesto (2001) EUP vs 2004 (2004) BUP (2005)

MSF4 (2004) LSD (2003) AUP (2004) Kanban (2007) OpenUP DSDM4.2 (2006) DOI(2005) 0.9 (2006) XUP(2005) DSDM Atern (2007)

Scrumban (2008) 2010

Figure 7.11 Agile methodologies evolution.

In the post-manifesto movement, methodologies based on lean man- ufacturing concepts are the most representative. These are Lean Soft- ware Development (LSD) [39, 40], Kanban [41] and Scrumban [42]. Figure 7.11 shows the evolution of agile methodologies since 1970 and the relationships between agile methodologies and some events in the history.

182 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

As a consequence of the agile movement Declaration of Interde- pendence (DOI) was published and signed as a set of six management principles initially intended for agile project managers:

Agile and adaptive approaches for linking people, projects and value: We are a community of project leaders that are highly successful at delivering results. To achieve these results:  We increase return on investment by making continuous flow of value our focus.  We deliver reliable results by engaging customers in frequent in- teractions and shared ownership.  We expect uncertainty and manage for it through iterations, an- ticipation, and adaptation.  We unleash creativity and innovation by recognizing that individu- als are the ultimate source of value, and creating an environment where they can make a difference.  We boost performance through group accountability for results and shared responsibility for team effectiveness.  We improve effectiveness and reliability through situationally spe- cific strategies, processes and practices [43].

7.5 AGILE METHODOLOGIES OVERVIEW

7.5.1 Extreme Programming (XP)

XP is the most popular and controversial of the agile methods. It is based on five values (simplicity, communication, feedback, courage, and respect), five principles (rapid feedback, assuming simplicity, incremen- tal change, embracing change, quality work) and thirteen supporting and mandatory practices (whole team, planning game, small releases, cus- tomer tests, simple design, pair programming, test-driven development, refactoring, continuous integration, collective code ownership, coding standard, metaphor, sustainable pace). According to Fowler, the XP practices are concrete things that a team can do day-to-day, while val- ues are the fundamental knowledge and understanding that underpin the approach [2]. XP main features are: High customer involvement, rapid feedback loops, continuous testing, continuous planning, and close teamwork to deliver working software at very frequent intervals, typically every 1-3 weeks.

183 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

7.5.2 SCRUM SCRUM is a lightweight management framework for iterative and in- cremental development; it can be applied to any kind of product. In SCRUM each iteration (sprint) is time-boxed (30 days) and the devel- opment is driven by a set of prioritized requirements (sprint backlog). This backlog is a subset of the full list of requirements for the product (product backlog). SCRUM has two moments for planning:

• Daily SCRUM: It is a daily meeting driven by three questions that each member of the team has to answer (What have I done since the last daily scrum? What will I do today? What problems did I have?).

• Sprint planning: The customer chooses the items for the sprint backlog and the team estimates the effort for each item.

Figure 7.12 SCRUM.

7.5.3 Feature Driven Development (FDD) FDD is a model-driven, short-iteration process. It begins with a startup phase which is aimed to:

1. Establish an overall model of the system by modeling in color with UML (domain object modeling).

184 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

2. Build a features list by using the domain model and the feature naming template.

3. Make a development plan according to the features list.

Then it continues with the construction phase with a series of two- week “design by feature, build by feature” iterations (Figure 7.13). FDD is strongly based on artifacts with colors such as the domain object, the graphic summary of features, and the plan view.

Figure 7.13 FDD lifecycle.

7.5.4 Lean Agile Development: LSD, Kanban, Scrum- ban Lean agile methodologies are based on the principles of Lean Manufac- turing [44,45]:

185 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

1. Specify value from the standpoint of the end customer by product family.

2. Identify all the steps in the value stream for each product family, eliminating, whenever possible, those steps that do not create value.

3. Make the value-creating steps occur in tight sequence, so the product will flow smoothly toward the customer.

4. As flow is introduced, let customers pull value from the next up- stream activity.

5. As value is specified, value streams are identified, wasted steps are removed, and flow and pull are introduced, begin the process again and continue it until a state of perfection is reached in which perfect value is created with no waste.

These principles were initially applied by Mary and Tom Poppendieck for developing the first agile methodology based on lean manufacturing: LSD [39,40]. Lean Software Development (LSD) has seven principles:

1. Optimize the whole: Optimizing a part of a system will always, over time, sub-optimize the overall system.

2. Eliminate waste: Waste is anything that does not deliver value to the customer.

3. Build quality in: If you routinely find defects in your verification process, your process is defective.

4. Learn constantly: Planning is useful. Learning is essential.

5. Deliver as fast as possible: Start with a deep understanding of all stakeholders and what they will value. Create a steady, even flow of work, pulled from this deep understanding of value.

6. Engage everyone: The time and energy of creative people are the scarce resources in today’s economy, and the basis of competitive advantage.

186 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

7. Keep getting better: Results are not the main point, the point is to develop people and systems capable of delivering results.

Figure 7.14 Kanban board.

A Kanban is a physical card used in the Toyota Production Sys- tem (TPS) as a tool of lean manufacturing. Kanban cards are used to support non-centralized “pull” production control. In software develop- ment, this tool is used in addition to walls or white-boards for visualizing project progress. Thus, in a Kanban system the aim is to minimize the work-in-progress (WIP) by “pulling” the parts or tasks when needed and “pushing” instructions about how to do the tasks. Every time a task is pulled from the queue of WIP there is a signal of production with a Kan- ban card. Therefore, a kanban card is information exchanged between separated processes; in the software development case, each process is a development lifecycle step. Figure 7.14 describes a typical kanban board. Kanban systems are based on a strict discipline described by the “six rules of kanban”:

1. Customer processes withdraw or “pull” items in the precise amounts specified on the Kanban (Downstream processes).

187 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

2. Supplier produces items in the precise amounts and sequences specified by the Kanban (Upstream processes).

3. No items are made or moved without a Kanaba.

4. A Kanban should accompany each item, every time.

5. Defects and incorrect amounts are never sent to the next down- stream process5.

6. The number of Kanban is reduced carefully to reduce inventories and to reveal problems.

These principles and the philosophy of kanban systems are used in new agile methodologies such as Kanban [41, 46] and a version of SCRUM called Scrumban [42]. The main features of software develop- ment models based on kanban are:

• Requirements are split into pieces (tasks) and each one must be written in a card and put on the wall.

• In the kanban board (wall) there are named columns to illustrate where each task is in the workflow.

• Workflow and progress of the process are always visible for the team on Kanban boards.

• Work in progress must be limited, by using explicit limits of how many tasks may be in progress at each workflow step.

• The process is optimized by measuring the lead time.

7.5.5 Agile Versions of UP: AgileUP, Basic/OpenUP The UP family has two agile versions: AUP and Basic/OpenUP. The Agile Unified Process (AUP) is a simplified version of RUP proposed by Scott Ambler [47]. AUP preserves the essentials of RUP and uses

5Upstream processes produce parts only if their downstream processes need them. Downstream workers withdraw or “pull” the parts they need from their upstream processes.

188 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

agile techniques such as test-driven development, agile modeling, ag- ile change management, and database refactoring. The AUP lifecycle is quite different from RUP, because AUP disciplines are modeling6, implementation, test, deployment, configuration management, project management, and environment. The AUP principles are:

1. The AUP product provides high-level guidance.

2. Simplicity.

3. Agility.

4. Focus on high-value activities.

5. Tool independence.

6. Tailor the AUP without taking a course or buying a product.

BasicUP (BUP) is a lightweight version of RUP for small projects. It is an iterative and incremental process based on scenarios-driven development, risk management, and an architecture-centric approach. OpenUP is just the renamed version of BUP published in [49]. OpenUP is driven by the four core principles listed below:

1. Collaborate to align interests and share understanding.

2. Balance competing priorities to maximize stakeholder value.

3. Focus on the architecture early to minimize risks and organize development.

4. Evolve to continuously obtain feedback and improve.

OpenUP is organized in two different dimensions: Method content and process content. The method content is related to the definition of method elements (roles, tasks, artifacts, and guidance) and the process content is where the method elements are applied in a temporal sense. One of the innovations in OpenUP over RUP is that OpenUP addresses

6The goal of the modeling discipline is to understand the business of the organi- zation, the problem domain being addressed by the project, and to identify a viable solution to address the problem domain, by using agile modeling techniques [48].

189 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

organization of work at personal, team and stakeholder levels (Figure 7.15). The OpenUP method is focused on the following disciplines: Re- quirements, architecture, development, test, project management, and configuration and change Management.

Figure 7.15 OpenUP layers - http://epf.eclipse.org/wikis/openup/ (accessed and verified on April 10, 2011).

7.6 AGILITY AND SOFTWARE EVOLUTION People in agile methods do not expect a detailed and complete set of requirement s at the beginning of the project, because the environ- ment of a software system and the requirements are subject to frequent changes. In order to deal with this feature of the software nature, agile methods use several strategies to minimize the risk and impact of new requirements in the process. Let us check one typical example in control theory; systems without feedback loops have high odds of going away of the desired behavior, which means there is a high probability of achiev- ing an entropic state. Entropy exists because change exists and also because evolution is an inevitable natural phenomenon. If the environ- ment evolves, then the systems in the environment must evolve. Thus, the strategy to deal with entropy is to introduce feedback in the sys- tems, in order to compare the desired behavior versus the real behavior for making internal decisions (inside the system) and correct undesir- able results. Then, the corrective actions depend on the recurrence and the quality of feedback. The lesser the feedback period the better the

190 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

results, but there are limits because it is impossible to build systems with feedback periods equal to zero.

Figure 7.16 Communication modes and effectiveness [50].

But, how is this related to software development? Simple, software is a system and the process of development is a system, too; the environ- ment where software and its process “are living” is continuously evolving, then the software and the process must evolve, too. But, how can the development team know if the process is going away from the desired results? Simple, frequently releasing the product to the stakeholders and measuring how the product is giving value to them (feedback loops). And there are limits, because if you make too short iterations, the de- velopment process can become a chaotic “code and fix”. That is why agile processes are iterative and incremental. Feedback quality is also a factor for successful evolution of systems, and agile people know it, thus agile methodologies use and suggest to use practices to get good feedback, based on communication effec- tiveness. Alistair Cockburn in [50] describes various modes of commu- nication for increasing the effectiveness. Figure 7.16 plots the relation

191 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

between richness of communication modes and their effectiveness. Face- to-face conversation and design on whiteboards are the most effective. Both of these modes are the preferred modes of communication in agile methodologies.

Table 7.1 Agile principles vs. Software features. Principle vs. Features F1 F2 F3 F4 F5 P6 P1 XXXX P2 XXX P3 X X X P4 X X P5 X P6 XXX P7 X P8 X P9 X P10 X P11 X P12 X

Agile methodologies embrace the real nature of software develop- ment. Table 7.1 shows the relationship between each one of the agile principles and the software features presented in the introduction. Thus, agile methodologies are a good choice for dealing with software diffi- culties and features. However, there is still an open discussion about effectiveness of agile methods in large projects with large teams, be- cause these kinds of methods were designed initially for small teams. Scaling agile methods can be achieved by structures of small and dis- tributed agile teams instead of typical hierarchy structures, and using management practices for coordinating the small teams. Highsmith [51] presents a structure called “hub”, which consists of a network of small agile teams; each node in the hub is a team and in the network there may be several feature teams, a customer team, an architecture team, a project-management team, an integration and build team, and even a center of excellence (Figure 7.17). Other examples of strategies and models for scaling scrum agile teams are in [52] and [53].

192 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

Figure 7.17 Hub organizational structure [51].

7.7 TRENDS AND CHALLENGES In December 2009 the SEMAT7 (Software Engineering Method and Theory) initiative was launched as a new community effort to reshape

7SEMAT initiative: http://www.semat.org/

193 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

software engineering based on a solid theory, proven principles, and best practices. The founders and first signatories of this initiative are world- class experts who have pioneered the development of software engineer- ing since its inception. This group of eminent members includes most of the creators of agile methodologies. Lots of new supporters are coming from many sides of the field and from around the world, and now there is an organization that defines several tracks’ works. One of the key points is that the SEMAT initiative pretends to define a kernel of widely-agreed elements needed to build software using a minimalist approach: The kernel would be complete when no other element can be removed from it. This approach is completely opposite to the one used in the definition of heavy methodologies, where almost all type of potentially useful element was included. This means that this initiative recognizes that agile principles embodied the smart way to build software, and therefore, the core of the intersection of agile methods will be surely part of this kernel. However, the goals of SEMAT are beyond current agile methods. One of them is to obtain a foundation for the agile creation and en- actment of software engineering methods (agile or more traditional) by practitioners themselves. Ideally, developer teams would be able to cus- tomize, tailor and adapt the process they use, not just at the beginning of a project, but continuously as necessary over the course of a develop- ment effort. This means that agile principles will influence the way we think about well-defined processes, since in the future software devel- opment teams will not have a prescribed method that they must follow step by step, regardless of specific circumstances and changes in the environment around them. Instead, the used method will be flexible enough to be adapted anytime during the software development effort. In this regard, a consistent notation or language for describing software engineering practices is needed in order to allow developers to formally express changes in their method, and consequently, be effective and scal- able while remaining flexible and agile. Accordingly, an agile method will evolve to suit the particular circumstances of the project in which it is used.

194 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

REFERENCES [1] Royce, W. Managing the development of large software systems. In: Proceedings of the 9th International Conference of Software Engineering: 1970.

[2] Fowler, M. The New Methodology: 2005. Accessed and Verified on April 11, 2011.

[3] Brooks, F. No Silver Bullet - Essence and Accidents of Software Engineering. Computer, 20: 1987; 10–19.

[4] Boehm, B. A spiral model of software development and en- hancemnt. Computer, 21: 1988; 61–72.

[5] Gilb, T. Principles of Software Management. Addison-Wesley: 1988.

[6] Eick, S., Graves, T., Karr, A., Marron, J. & Mockus, A. Does Code Decay? Assessing the Evidence from Change Man- agement Data. IEEE Transaction on Software Engineering, 27(1): 2001.

[7] Cockburn, A. Characterizing people as non-linear, first-order components in software development. Informe técnico, Humans and Technology: 1999.

[8] Lehman, M. Programs, Life Cycles, and Laws of Software Evo- lution. Proceedings of the IEEE, 68(9): 1980; 1060–1076.

[9] Gilb, K. Evolutionary Project Management and Product Devel- opment: 2007.

[10] Gilb, T. Competitive Engineering: A Handbook For Systems Engineering, Requirements Engineering, and Software Engineering Using Planguage. Butterworth-Heinemann: 2005.

[11] I. Jacobson, G. B. & Rumbaugh, J. The Unified Software Development Process. Adisson-Wesley Professional: 1999.

195 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[12] Kruchten, P. The Rational Unified Process: An Introduction. Adisson-Wesley: 2003.

[13] Ambler, S., Balbone, J. & Vizdos, M. The Enterprise Uni- fied Process: Extending the Rational Unified Process. Prentice Hall: 2005.

[14] Ambler, S. The Unified Process From A to Z: 2009. Accessed and verified on April 11, 2011.

[15] Ambler, S. History of the Unified Process: 2010. Accessed and verified on April 11, 2011.

[16] Rajlich, V. & Bennett, K. A staged model for the software life cycle. Computer, 33(7): 2000; 66–71.

[17] Beck, K. Manifesto for Agile Software Development: 2001. Ac- cessed and verified on April 11, 2011.

[18] Hartmann, D. Designing Collaborative Spaces for Productivity: 2007. Accessed and Verified on April 11, 2011.

[19] Larman, C. & Basili, V. Iterative and Incremental Develop- ment. Computer, 36(6): 2003; 47–56.

[20] Basili, V. & Turner, A. Iterative Enhancement: A Practical Technique for Software Development. IEEE Transaction on Soft- ware Engineering, SE-1(4): 1975; 390–386.

[21] Wirth, N. Program Development by Stepwise Refinement. Com- munications of the ACM, 14(4): 1971; 221–227.

[22] Gilb, T. Software Metrics. Winthrop Publishers: 1977.

[23] Lantz, K. Prototyping Methodology. Reston Pub Co: 1986.

[24] Martin, J. Rapid Application Development. Macmillan Coll Div: 1991.

[25] Stapleton, J. DSDM: The Method in Practice. Adisson-Wesley: 1997.

196 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

[26] Takeuchi, H. & Nonaka, I. The new product development game. Harvard Business Review, January-February: 1986; 137– 146.

[27] Sutherland, J. & Schwaber, K. Controlled Chaos: Living on the edge. Informe técnico, Advavanced Development Methods: 1996.

[28] Beedle, M. Scrum: An extension pattern language for hyper- productive software development. Pattern Language of Program Design, 4: 1999; 637–651.

[29] Coad, P., LeFebvre, E. & Luca, J. Java Modelling in Color with UML: Enterprise Component and Process. Prentice Hall: 2000.

[30] Palmer, S. & Felsing, J. A Practical Guide to Feature-Driven Development. Prentice Hall PTR: 2002.

[31] Cockburn, A. Surviving Object-Oriented Projects. Adisson- Wesley Professional: 1998.

[32] Cockburn, A. People and methodologies in software devel- opment. Tesis Doctoral, Faculty of Mathematics and Natural Sciencies- University of Oslo: 2003.

[33] Cockburn, A. Crystal Clear: A Human-Powered Methodology for Small Teams. Adisson-Wesley Professional: 2004.

[34] Bayer, S. & Highsmith, J. Radical Software Development. American Programmer Magazine, 7: 1994; 35–42.

[35] Highsmith, J. Adaptive Software Development: A Collaborative Approach to Managing Complex Systems. Dorset House Publish- ing: 2000.

[36] Beck, K. Chrysler goes to extreme. Distributed Computing: 1998; 24–28.

[37] Beck, K. Extreme Programming Explained. Adisson-Wesley: 2000.

197 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[38] Beck, K. & Fowler, M. Planning Extreme Programming.Ad: 2000.

[39] Poppendieck, M. & Poppendieck, T. Lean Software Devel- opment: An Agile Toolkit. Adisson-Wesley: 2003.

[40] Poppendieck, M. & Poppendieck, T. Implementing Lean Software Development: From Concept to Cash. Adisson-Wesley: 2006.

[41] Kniberg, H. & Skarin, M. Kanban and Scrum - making the most of both. IngoQ: 2010.

[42] Ladas, C. Scrumban - Essays on Kanban Systems for Lean Soft- ware Development. Modus Cooperandi Press: 2009.

[43] Highsmith, J. Declaration of Interdependence: 2005. Accessed and Verified on April 11, 2011.

[44] Womack, J. & Jones, D. The Machine that Changed the World: The Story of Lean Production. Harper Perennial: 1991.

[45] Womack, J. & Jones, D. Lean Thinking. Simon & Schuster: 1996.

[46] Anderson, D. Kanban. Blue Hole Press: 2010.

[47] Ambler, S. The Agile Unified Process (AUP): 2009. Accessed and verified on April 11, 2011.

[48] Ambler, S. Agile Modelling. Wiley: 2001.

[49] Kroll, P. & MacIsaac, B. Agility and Discipline Made Easy: Practices from OpenUP and RUP. Addison-Wesley Professional: 2006.

[50] Cockburn, A. Agile Software Development. Addison-Wesley Professional: 2001.

[51] Highsmith, J. Agile project management. Addison-Wesley: 2004.

198 AGILITY IS NOT ONLY ABOUT ITERATIONS BUT ALSO ABOUT SOFTWARE EVOLUTION

[52] Woodward, E., Surdek, S. & Ganis, M. A Practical Guide to Distributed Scrum. IBM Press: 2010.

[53] Larman, C. & Vodde, B. Practices for Scaling Lean & Agile Development. Addison-Wesley: 2010.

199 In conclusion, software Software Development Agility in Small and Medium Enterprises (SMEs)

Victor Escobar-Sarmiento Mario Linares-Vásquez Jairo Aponte

ABSTRACT The creation of new companies (small and medium size) in the soft- ware development industry has been influenced by the worldwide accep- tance of software as an important aspect in daily life, and the continued growth of the software development industry during the last decade. The rapid pace with which the companies are founded and enter into business makes them experience some drawbacks such as informality in their processes and management models, and methodological deficien- cies. On the other hand, agile methodologies have been thought as a model for software development with high quality, using small work- groups without hierarchical organizations, and reducing the ceremony in internal processes. These methodologies are characterized by con- stant planning, continuous feedback, and permanent interaction with the client. Agile methodologies appear as a way to improve the per- formance and results of companies interested in software development, specially the small and medium ones. In this chapter we present the features of small and medium companies and their challenges to adopt agile methodologies.

201 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

8.1 INTRODUCTION In the last decades, the uncontrollable evolution of technology has been a crucial factor in the birth and growth of new companies, which provide different kinds of services worldwide. Nowadays, software development companies full the market, offering new products or solutions. Acs et al. [1] argued that small firms are indeed the engines of global eco- nomic growth, technology advancement and employment opportunities. In most countries, SMEs (Small and Medium Enterprises) dominate the industrial and commercial infrastructure. Economists believe that wealth of nations and growth of their economies depend on their SMEs’ performance [2]. The SMEs are the most important companies in terms of software production. Tables 8.1 and 8.2 show the importance of India, Ireland, and Israel in the software market. For example, Ireland is recognized as one of the principal countries in this field; being this sector one of the top of its economy, growing 25% faster than international markets in the nineties [3]. In Ireland, almost 99% of companies are SMEs and each one employs less than 50 people; however, they account for over 68% of the private sector employment [4]. For example, a report of ICT Ireland [5] shows that the employment contribution of Irish SMEs is just 11% compared with 15% in the United States. Another remarkable country in the genesis of software development SMEs is India. Dur- ing the 70s, companies at India started to have an unexpected growth, employing about 345.000 persons in 2004 and generating earnings for almost US$12 billions, equivalent to the 3.3% of global services spend- ing.

Table 8.1 Software exports from India, Ireland and Israel (US$millions) [6]. India Ireland Israel 1990 105 2,132 90 2000 6,200 8,865 2,600 2002 7,500 12,192 3,000 2003 8,600 11,819 N/A Employment −2003 260,000 23,930 15,000 Revenue/employee 2003 33,076 493,988 273,000

202 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

Table 8.2 Growth of the Indian software industry [6]. Year Total No. of Average Average Exports/ Exports firms revenue per revenue per Revenue (US$millions) firm (US$) employee (US$) (%) 1980 4 21 190,476 16,000 50 1984 25,3 35 722,857 18,471 50 1990 105,4 700 150,571 16,215 N/A 2000 5,287 816 7,598,039 32,635 71,8 2004 12,200 3170 7,004,154 35,362 73,9

Software industry in Colombia has taken an important role in econ- omy; in the last years its growth has been significant, with revenues around US$465 millions in 2009 [7], with SMEs being main participants. Reports of Proexport 1 show that the Colombian market duplicated its earnings from 2006 to 2009 [7]. However, according to [8], the software products of Colombian companies do not have the recognition outside of the local boundaries because their market is not sophisticated and is oriented just for customers in Latin America. This study also reports a list of factors to be improved in the industry, mostly related to SMEs configuration, industrial maturity, government support, infrastructure, and the human resources involved in the processes. Additionally [9] and [10] describe a primary set of obstacles for SMEs’ development in Colombia; the set includes the following items: • Difficulties recognizing and accessing to appropriate technology. • Formalization and absorption of new technologies. • Technical and competitive limitations. • Poor physical infrastructure. • Lack of managers with management skills and strategic thinking. • Lack of qualified human resources. • Limited access to external markets; among others.

1Proexport is the entity who promotes the tourism, foreign investment and ex- ports in Colombia. http://www.proexport.com.co/ (accessed and verified on November 17, 2011)

203 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

(a) Software development companies in (b) Size participation of Soft- Colombia (2005-2009) ware development companies in Colombia

Figure 8.1 Software development in Colombia [7].

Figure 8.2 Earnings generated by Software development in Colombia [7].

Therefore, smart and effective solutions for these kinds of problems are needed, and we think agile methodologies are better suited for SMEs features than traditional plan-driven methodologies. This chapter is organized as follows: Section 2 explains the legal def- inition of SMEs globally and locally; Section 3 provides a description of the current situation of agile methodologies around the world; Section 4 exposes some agility assessment models; Section 5 discusses the weak- nesses and strengths of agile methodologies, Section 6 describes the challenges adopting agile methodologies in SMEs, and the last section draws conclusions and summarizes the chapter.

204 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

8.2 LEGAL DEFINITION OF SMES In this section we present international and local definitions of what the SMEs are.

8.2.1 Foreign Definitions The European Commission defines SMEs according to the Recommen- dation 2003/361/EC on 6th May 2003, to take effect from January 1st, 2005 (published in OJ L 124 of 20.5.2003, p. 36):

• A micro enterprise has a headcount of less than 10, and a turnover or balance sheet total of not more than €2 million.

• A small enterprise has a headcount of less than 50, and a turnover or balance sheet total of not more than €10 million.

• A medium-sized enterprise has a headcount of less than 250, and a turnover of not more than €50 million or a balance sheet total of not more than €43 million; other definitions can be found in Table 8.3.

According to Schatz [12], the main characteristics of these kinds of companies are:

1. SMEs are strongly owner-manager driven. Most of the decision makers’ time is spent on doing routine tasks. In many cases, they are family run.

2. SMEs are driven by the demand for improving productivity, cutting costs and ever decreasing life-cycle phases.

3. SMEs do not have extensive processes or structures. They are run by one individual or a small team, who takes decisions on a short-term time horizon.

4. SMEs are generally more flexible, and can quickly adapt the way they do their work around a better solution.

5. SMEs entrepreneurs are generally “all-rounders” with basic knowl- edge in many areas. They are good at multi-tasking.

205 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Table 8.3 Definition of SMEs in selected countries (adapted from [11]) *(CBI, 2009); **(ISIPO, 2009). Country Category of Employee num- Turnover Other measures bers enterprise European Small 10-50 employ- Less than €10 Balance sheet total: Com- ees (13.5 USD) million Less than €10 mil- mission turnover lion balance sheet total Medium Fewer than 250 Less than €50 Balance sheet total: employees (67.6 USD) million Less than €43 mil- turnover lion balance sheet total Iran Small Less than 10* Less than 50** Medium 10-100* 50-250** Malaysia Small Between 5-50 Between RM employees 250,000 (75,000 USD) and less than RM 10 (3 USD) million Medium Between 50-150 Between RM 10 (3 employees USD) million and RM 25 (7.5 USD) million

6. SMEs are more people –than process– dependent. There are spe- cific individuals who do certain tasks; their experience and knowl- edge enable them to do so.

7. SMEs are often less sophisticated, since it is hard for them to recruit and retain technology professionals.

8. SMEs are more focused on medium-term survival than long-term profits.

9. SMEs do not focus on efficiencies. They end up wasting a lot of time and money on general and administrative expenses.

10. SMEs are time-pressured.

11. SMEs want a solution, not a particular machine or service.

206 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

12. SMEs focus on gaining instant gratification with technology solu- tions. These solutions must be simple to use and easy to deploy, and must provide clear tangible benefits.

13. SMEs do not necessarily need to have the “latest and greatest” technology. The solution can use “lag technology”, then it be- comes cheaper to obtain and use.

8.2.2 Local Definition (Colombia) The SME definition in Colombia is founded in the law number 590 of 2000. It defines SMEs as every unit of economic exploitation, estab- lished by natural or legal persons, in business, agricultural, industrial, commercial or services activities, that must respond to the following parameters:

• Medium enterprises

– Number of employees: Between 51 and 200. – Total assets: Between 5001 and 15,000 SMLV2.

• Small enterprises

– Number of employees: Between 11 and 50. – Total assets: Between 501 and 5000 SMLV.

• Micro enterprises

– Number of employees: Less than 10. – Total assets: Less than 500 SMLV.

For those SMEs with parameter differences in the combination of employee counts and total assets, the determining factor are the total assets.

2Legal minimum wage (for its initials in spanish - SMLV: Salario Mínimo Legal Vigente); equivalent to US$302.

207 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

8.3 AGILE METHODOLOGIES IN THE REAL WORLD This section reports the results of the the 5th annual state of agile development survey [13]; this report shows the main participants in the ASDM3 usage, what has been its impact within the organizations, and the tools that are used in ASDM implementations. The survey was conducted between August 11 and October 30, 2010. The data were analyzed and prepared into a summary report with a total of 4770 responses. The respondents were commonly project managers (almost 20%) and other managerial staff; the distribution of the roles of the rest of respondents was: • 9.7% were development managers. • 9.2% were team leaders. • 9% were developers. • 8.9% were other respondents. More than 90% of the survey respondents reported some kind of knowledge about ASDM; also, only 30% of them practiced ASDM in previous jobs. The companies or organizations involved in the study that have implemented ASDM approve the agility usage in more than a 60%; the ASDM and its practices helped them, improve the results in their software development processes. With this survey we found out a lot of people want to get more involved in agile methodologies, and the ones who have worked with them want to improve their skills and try to have an effective software development process, faster and friendly. The results of the survey are listed as follows:

8.3.1 Knowledge About Agile Methodologies • 42.7% were moderately knowledgeable. • 25.3% were extremely knowledgeable. • 23.7% were knowledgeable. • 8.3% were very little knowledgeable.

3Agile Software Development Methodologies.

208 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

8.3.2 People Who Practiced Agile at a Previous Company • 34% did.

• 66% did not.

The survey helps to find concerns prior to adoption of ASDM; the most common are the following ones:

• 224 people are concerned about loss of management control.

• 210 people are concerned about lack of up-front planning.

• 204 people are concerned about management opposed to change.

• 173 people are concerned about lack of documentation.

8.3.3 Roles in Agile Usage According to the survey, the closer role is the development director, with 21.5%, in contrast to QA group, with 0.9%. Other results are:

• 21.5% think the closer is VP/director of development.

• 17.9% think the closer is project manager.

• 15.4% think the closer is development manager.

• 12.8% think the closer is product manager.

• 7.9% think the closer is team lead.

• 7.4% think the closer is developer.

8.3.4 Reasons for Agile Adoption Enhancing speed is one of the most important reasons to implement ASDM. People respond with “Very important” (41.5%) or “highest im- portant” (37.4%) to the reason “Accelerate Time to Market”, whereas

209 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

they respond with “Not important at all” (45.8%) or “Somewhat impor- tant” (31.8%) to the reason “Manage distributed teams”. For the reason “Enhance Software Quality”, respondents thought that is “very impor- tant”, with 49%, or “highest importance” with, 23.6%. On the other hand, 40.3%, of people answered that “Improved Increased Engineering Discipline” is “Somewhat important” for adopting agile methods.

8.3.5 Agile Methodology Used Scrum and Scrum XP were the most used methodologies, with 75.6% of answers, while the other methodologies had less than 5%. The rest of the results were:

• 58% used Scrum.

• 17.6% used ScrumXP Hybrid.

• 5.4% used CustomHybrid.

• 3.7% used Extreme programming (XP).

• 2.1% used Feature-driven development (FDD).

• 3.3% did not know what methodology they used.

8.3.6 What Already Have Been Achieved by Using ASDM For this topic, 46% of the respondents answered that “Enhance ability to manage changing priorities” was significantly improved after adopting Agile; others (38.5%) said that ASDM has improved significantly the project’s visibility. According to survey’s results, after adopting ASDM the project’s performance was improved.

8.3.7 Barriers to Further Agile Adoption The more likely barriers to the ASDM adoption are;

• 1286 respondents (17.1%) said that the ability to change organi- zational culture was a barrier.

210 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

• 1023 respondents (13.6%) answered that the availability of per- sonnel with the necessary skills was a barrier.

• 1018 respondents (13.5%) responded that the general resistance to change was a barrier.

• 867 respondents (11.5%) said that managerial support was a bar- rier.

8.3.8 Agile Practices The most common techniques used were daily stand-up meetings, with 82%; iteration planning, with 83%; unit testing, with 77%; whilst, the unlikely techniques were collective code ownership, with 36%; Kanban, with 18%, and behavior- driven development, with 9%.

8.3.9 Plans for Implementing Agile on Future Projects Most of respondents, about 436 people (63.3%), indicated that they or their companies plan to implement Agile methods on future projects; other respondents, about 193 people (28%), said that they or their companies do not know if they are planning to implement Agile methods, and only 60 respondents, about 8.7%, answered they will not.

8.3.10 Using Agile Techniques on Outsourced Projects The survey describes that 1082 respondents (41%) do not outsource. On the other hand, 845 (32.1%) outsource or are planning to do it. Others (12%) answered that they are using Agile methods on outsourced projects but are not planning to do it in the future, and there are 10.8% that do not currently use Agile methods and do not plan to do it in the future.

8.4 AGILITY ASSESSMENT MODELS Adoption of agile methods depends on the project, the team, and the company. In this way, several models have been proposed to assess the

211 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

level of agility of discipline in companies. Below we describe some of these models.

8.4.1 Boehm and Turner’s Agility and Discipline Assessment

Boehm et al. [14] argued that according to a set of factors in software projects, agility and discipline should be combined to achieve good re- sults in software development production.“Discipline is the foundation for any successful endeavor” [14], it can be compared to the athletes’ training, or the musicians’ practicing. Without discipline the success would be occasional. The authors highlight that discipline creates well- organized memories, history, and experience in an organization. How- ever, discipline has its counterpart, “Agility”. Taking the athletes’ exam- ple, agility gives them the ability to make an unexpected play or provide engineers with the ability to embrace changing technology, needs and re- quirements. Agility, according to the authors, uses the memory and the history to adjust the new environments, react and adapt to the changes, take advantage of unexpected opportunities, and update the experience for the future. It’s clear that people develop software using agility and discipline, agile and plan-driven methods; thus, the best way to face the development is trying to find a balance between the two models.

Plan-driven and Agile Methods

The plan-driven approach is the traditional way to develop software. It is based on the formal methodology adopted from traditional engineer- ings, such as the civil and mechanical engineering. For example, in the construction of a new vehicle in an assembly factory, the design phase is focused on the specification of the principal characteristics of the new vehicle, such as materials, structure, shape, among others. Then, the process goes to the assembly line, which carries out the execution of the plan made in the design phase. In this stage there are all the mate- rials, personnel and necessary dependencies to build the vehicle. In the software development case, this approach is represented by a require- ments/design/build paradigm with standard and well defined processes that organizations often improve continuously.

212 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

Boehm et al. [14,15] say that the plan-driven approach is a system- atic engineering that carefully adheres to specific processes that move a software product through a series of representations from requirements to finished code. Thus, there is a need for completeness of documen- tation in every step of the process. This is the traditional way to view the development cycle as a waterfall from the concept up to the prod- uct release. The plan-driven method requires management support, organizational infrastructure and an environment where the participants understand the importance of common processes to their personal work and the success of the enterprise. Boehm et al. [14,15] describe agile methods as lightweight processes that employ short iterative cycles; actively involve users to establish, prioritize, and verify requirements; and rely on tacit knowledge of the team as opposed to documentation. Agile methods have the following attributes:

• Iterative (several cycles).

• Incremental (not deliver the entire product at once).

• Self-organizing (teams determine the best way to handle work).

• Emergence (processes, principles, and work structures are recog- nized during the project rather than predetermined).

The rapidly changing nature of software requires a faster speed from software developers and new techniques to achieve their goals. The problem of change is described by long development cycles that yield a code that may be well written but does not meet user expectations. Agile methods deal with this problem. However, they have some require- ments for success, such as close relationships with the customer and final users of the system under development; motivated and knowledgeable team members; and minimum documentation effort.

Agile and Plan-driven Methods Home Grounds Boehm et al. [14] identified five critical decision factors related to agile and plan-driven home grounds (Table 8.4), and summarized them in a five-axes plane (Figure 8.3) where:

213 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

• Size axis represents the number of persons working in the project. • Culture axis represents the balance between chaos and order. • Dynamism axis is an estimate of how much the team or organi- zation likes to work on the edge of chaos or with more planning and defined practices and procedures. • Personnel axis represents skills of the team. (The different skill roles are explained in Table 8.5. • Criticality axis represents the criticality of the project measured as loss of lives resulting from defects that may exist in the process.

Table 8.4 Personnel characteristics [14, 15]. Level Characteristics 3 Able to revise a method (break its rules) to fit an unprece- dented situation. 2 Able to tailor a method to fit a precedented new situa- tion. Can manage a small, precedented agile or plan-driven project but would need level 3 guidance on complex, un- precedented projects. 1A With training, able to perform discretionary method steps (e.g., sizing tasks for project timescales, composing pat- terns, architecture reengineering). With experience, can be- come level 2. 1A’s perform well in all teams with guidance from level 2 people. 1B With training, able to perform procedural method steps (e.g., coding a class method, using a CM tool, performing a build/installation/test, writing a test document). With ex- perience, can master some level 1A skills. May slow down an agile team but will perform well in a plan-driven team. -1 May have technical skills, but unable or unwilling to collab- orate or follow shared methods. Not good on an agile or plan-driven team.

These five factors associated with the ASDM and the plan-driven methods can be summarized in the radar plot (Figure 8.3), which al- lows us to visualize the agility needed in each organization. Therefore, the level of agility is based on the kind of projects in course and the organization in charge of them.

214 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

Table 8.5 Agility - plan-driven method home grounds and levels of software method un- derstanding and use [15]. Characteristics Agile Plan-driven APPLICATION Primary Goal Rapid value; responding to Predictability, stability, high as- change surance Size Smaller teams and projects Larger teams and projects Environment Turbulent; high change; project Stable; low - change; focused project/organization focused MANAGEMENT Customer Rela- Dedicated on-site customers; fo- As-needed customer interac- tions cused on prioritized increments tions; focused on contract provisions Planning and Internalized plans; qualitative Documented plans, quantitative Control control control Communications Tacit interpersonal knowledge Explicit documented knowledge TECHNICAL Requirements Prioritized informal stories and Formalized project, capability, test cases; undergoing unfore- interface, quality, foreseeable seeable change evolution requirements Development Simple design; short increments; Extensive design; longer incre- refactoring assumed inexpensive ments; refactoring assumed ex- pensive Test Executable test cases define re- Documented test plans and pro- quirements, testing cedures PERSONNEL Customers Dedicated, collocated CRACK* CRACK* performers, not always performers collocated Developers At least 30% full-time Cockburn 50% Cockburn Level 3s early; Level 2 and 3 experts; no Level 10% throughout; 30% Level 1Bs 1B or -1 personnel** workable; no Level -1s** Culture Comfort and empowerment via Comfort and empowerment via many degrees of freedom (thriv- framework of policies and proce- ing on chaos) dures (thriving on order) * Collaborative, Representative, Authorized, Committed, Knowledgable ** See Table 8.2. These numbers will particularly vary with the complexity of the application

8.4.2 Pikkarainen and Huomo’s Agile Assessment Framework According to [16], agility means the ability of the companies to respond to change and to balance organizational flexibility and stability. The assessment framework proposed by Pikkarainen et al. [16] is based on the wide understanding of different practices, methods and tools that are proved to increase agility in software development companies. The

215 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Figure 8.3 Dimensions affecting method selection [15].

authors state that “The purpose of agile assessment is thus to integrate the software process assessment methods and the knowledge of agile practices together”. The aim of this framework is to provide answers about the following questions:

1. How to evaluate the agility of the software product development?

2. How to tailor agile practices, methods and tools to fit the needs of the projects?

3. How to tailor agile practices, methods and tools to fit the needs of the organization?

The first version of the assessment is focused on the requirements and project management process areas. The main purpose here is to evaluate a selected group of the company software development projects through the next aspects:

216 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

1. CMMI4 process descriptions, which give a basic structure for anal- ysis.

2. Agile principles, through which the processes, methods and tech- niques are analyzed.

3. Agile practices methods and tools.

4. Boehm and Turner agile dimensions [14,15].

Thus, the authors define the goals of their agile assessment as:

• Analyze the agility of the evaluated project (e.g., using the Boehm and Turner agility dimensions).

• Analyze both, the agile and plan-driven practices methods and tools that are currently used in an organization or in a project.

• Discover and evaluate the most suitable agile practices for the development needs in the organizations.

• Evaluate the efficiency of the agile practices, methods and tools in use.

• Support the deployment of the agile practices, methods and tools in the organizational level.

The agile assessment process used by [16] is depicted in Figure 8.4. Its main steps are focus definition, agility evaluation, data collection planning, interviews, analysis, and workshops and learning phases. Agility evaluation is divided in four phases. It starts with the con- text factor analysis, using the Boehm and Turner’s model. Then, the next phase is aimed at defining the agile principles, methods, practices and tools used in the evaluated projects. Phases three and four are alternative depending on the main purpose; the selection depends on whether we want to find the best agile practices in plan-driven software development or we want to evaluate the agile projects efficiency. The

4(Capability Maturity Model Integration) is a process improvement approach that provides organization with the essential elements of effective processes, whit will improve their performance. http://www.sei.cmu.edu/cmmi/

217 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Figure 8.4 An Agile assessment framework [16].

data collection planning is explained in Figure 8.5. We consider this last phase as the most important one to assess agility in SMEs, because it could be used to define a data structure capable to assess the agility in three different dimensions: The company (administration), the projects and the work teams.

Figure 8.5 Data collection planning [16].

218 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

After the agile assessment, the agile practices and improvements are prioritized and further analyzed in the company’s internal meetings. The results should be validated testing them in pilot projects and comparing the final outcomes.

8.5 WEAKNESSES AND STRENGTHS OF AGILE METHODOLOGIES Petersen et al. [17] present a case study about software development with agile and iterative models. The main contribution of the study is to help managers in the decision of adopting agile methods and showing the problems that have to be addressed as well as the merits that can be gained by ASDMs. The issues and advantages they formulate are listed as follows:

• Small projects allowed to implement and release requirements it- erations faster, which leads to reduction of requirements volatility in projects.

• The waste of unused work (documented requirements, imple- mented components, etc.) is reduced through small iterations.

• Requirements in iterations are precise, and estimates are accurate due to the small scope.

• Small teams with people having different roles, only require small amounts of documentation because it is replaced with direct com- munication facilitating learning and understanding for each one inside the team.

• Frequent integration and deliveries to subsystem tests allow the team to receive early and frequent feedback on its work.

• Rework caused by faults is reduced as testing priorities are made clearer due to prioritized features, and doe to testers as well as designers work closely together.

• Testers time is used more efficiently because in small teams test- ing and design can be easily parallelized due to short ways of

219 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

communication between designers/developers and testers (instant testing).

• Testing the latest product release makes problems and successes transparent (testing and integration per iteration) and thus gener- ates high incentives for designers/developers to deliver high qual- ity.

We can also find another issues that became a barrier to agile pro- cesses. Boehm et al. [14] stated that, although the ASDMs are a good solution for some old problems, they also have some characteristics that make the process vulnerable to fail. For example, one of the biggest conflicts using ASDMs lies in the estimation, resource loading and slack calculations, due to the level of uncertainty and ambiguity that exists in an iterative process in long-term estimates; it is higher with agile ap- proaches. According to [14], we state that the most important barriers for agile adoption in companies are:

• Cost estimation: One of the main problems in ASDMs is the cost estimation, because there are not long-term estimations.

• Business process conflicts: An often overlooked difference between agile and traditional engineering processes is the way everyday business is conducted. Estimation, resource loading, and slack calculations can vary significantly.

• Human resources policies and processes: Agile development team members often cross the boundaries of standard development po- sition descriptions and might require significantly more skills and experience to adequately perform them.

• Process standards assessed: Most agile methods do not support the degree of documentation and infrastructure required for lower- level certification; it could, in fact, make agile methods less effec- tive.

• Resistance to change: The paradigm change is aimed at empow- ering individuals by supporting new processes, but many people are reluctant to change because of the fear of failure in a project.

220 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

• Customer empowerment: Agile methods require close relation- ships between the customer and the development team.

8.6 CHALLENGES ADOPTING AGILE METHODOLOGIES IN SMES The primary challenge in adopting agile practices in large organizations is integration of agile projects with the existing processes [18]. Adopting ASDM in a company stems from different kinds of reasons like increasing costumer satisfaction or improving the teams’ performance or a faster software production. Therefore, before explaining these challenges is necessary to explain the role of the company in the ASDM implantation, putting into consideration the following topics proposed by [19]: • The development team evolution inside the methodology: It is common to find some kinds of concerns in the work team when it faces a change. The organization must understand and take actions as training, communication, and continued support managing the change in the correct way. With this in mind, it is important to be aware about the time it takes for the implemen- tation of the new ASDM. • Business addressing: It is important to keep the top-role people involved into the agile process making them know that continuity is required for success in the ASDM implementation. The organi- zation has the duty of leading the team looking for an incremental advance. • Ensure project success: To achieve success in the implementa- tion of an ASDM is necessary for both manager and stakeholders to be involved into the process. Both parts should understand the project requirements and try to use a common language to cover them. • Managing the company expectations: The company can ob- tain return on investment having different partial deliveries to the client. To achieve this is necessary prioritize the most important user stories, giving greater value to each one and minimizing the risk.

221 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

A group of students of Carleton University, Ottawa, Canada, identi- fied the major changes needed to implement agile software development practices (Figure 8.6) and main challenges that companies using tra- ditional methodologies have to address, regarding from a management view. Understanding these challenges and having a strategy to overcome the adoption of agile methodologies can make them easier for organiza- tions. In software projects, people and processes are important. People can be trained and processes can be built and improved. Now, the use of agile software development methodologies is raising among software professionals. The engineers’ experience in agile software construction reveals us changes and challenges implicated in agile software projects. SMEs widely use the conventional way of developing software, where the requirements are fixed, with written documentation and a unique delivery. However, following an agile software development methodol- ogy could allow software teams to develop quickly and react to changes presented throughout projects.

Figure 8.6 Changes required for adopting Agile methodologies in traditional organizations, and associated challenges/risks [20].

222 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

8.7 SUMMARY The aim of this chapter was mainly to show the challenges that small companies could face, in order to adopt agile methodologies. It is nec- essary to understand that the increase in software market is the best opportunity for these companies to get involved in agile methodologies. Nowadays, the SMEs’ spread on software development market is offering all kind of products and solutions. For example, Colombia grew 48% in this field between 2000 and 2004 [9]. However, this kind of com- panies cannot use normal methodologies such as UP, RUP or CMMI, because they do not have hierarchical structures clearly defined, their development teams are small, and plan-driven methods require a lot of bureaucracy. The use of agile methods has contributed to find practices with better results than the ones achieved with plan-driven methods for all kind of projects. Therefore, it is important to find a way of assessing and assuring the success of being agile. The agile culture is gaining popularity, becoming one of the most named topics in software development. This can be seen through the number of people that are adopting agile methods, and the companies that are beginning to use them in their processes. Thus, it is necessary to consider all the challenges associated with the implementation of new processes in a company; for example, the issues related to integrating agile projects with an existing company environment can not be avoided. Additionally, the profile of a company and the project need to be assessed in order to identify the level of agility. According to that level, strategies for adopting agile method should be designed to provide lower rates of negative impact in the productivity of the team.

223 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

REFERENCES [1] Acs, Z. J. & Preston, L. Small and Medium-Sized Enterprises, Technology, and Globalization: Introduction to a Special Issue on Small and Medium-Sized Enterprises in the Global Economy. Small Bus.Econ., 9: 1997; 1–6.

[2] Schroder, H. H. & Kraaijenbrink, J. IN: Knowledge Integration-The Practice of Knowledge Management in Small and Medium Enterprises. Physica-Verlag HD: 2006.

[3] Consulting, M. Manpower, Education and Training Study of the Irish Software Sector. Report submitted to the Software Train- ing Advisory Committee and FAS: 1998.

[4] Richardson, I. & Avram, G. Having a Foot on Each Shore, Bridging Global Software Development in the Case of SMEs. 2008 IEEE International Conference on Global Software Engineering: 2008.

[5] Ireland, I. Key Industry Statistics: 2005.

[6] Dossani, R. Origins and Growth of the Software Industry in India. Asia-Pacific Research Center - Stanford University.

[7] Proexport. Sector de software y servicios TI. http://www.slideshare.net/inviertaencolombia/sector-servicios-de- tiproexport?src=related_normal&rel=1187105: 2010. Accessed and verified on May 3, 2011.

[8] Ministerio de Comercio, I. y. t. Programa MIDAS - De- sarrollando el Sector de TI Como uno de Clase Mundial. Informe técnico: 2008.

[9] Rodríguez, A. G. La relidad de la Pyme colombiana, Desafio para el desarrollo. FUNDES Internacional: 2003.

[10] Sánchez, J. J. & Osorio, J. Algunas Aproximaciones Al Pro- blema De Financiamiento De Las Pymes En Colombia. Scientia et Technica Ano XIII: 2007; 321–324.

224 SOFTWARE DEVELOPMENT AGILITY IN SMALL AND MEDIUM ENTERPRISES (SMES)

[11] Ebrahim, N. A., Ahmed, S. & Taha, Z. VirtualR&Dteams in small and medium enterprises: A literature review. Sci. Res. Essay: 2009; 1575–1590.

[12] Schatz, C. A Methodology for Production Development. Tesis Doctoral, Norwegian University of Science and Technology: 2006.

[13] VersionOne. 5th Annual State of Agile Development Survey Final summary report: 2010.

[14] Boehm, B. & Turner, R. Balancing Agility and Discipline - A Guide for the Perplexed. Addison-Wesley: 2004.

[15] Boehm, B. & Turner, R. Observations on Balancing Discipline and Agility: 2004.

[16] Pikkarainen, M. & Huomo, T. Agile Assessment Framework. Information Technology for European Advancement: 2005; 1–44.

[17] Petersen, K. & Wohlin, C. A Comparison of Issues and Advantages in Agile and Incremental Development between State of the Art and an Industrial Case. Journal of Systems and Software: 2009; 1–14.

[18] Lindvall, M., Muthig, D., Dagnino, A., Wallin, C., Stupperich, M., Kiefer, D., May, J. & Kahkonen, T. Ag- ile software development in large organizations. Computer, 37(12): 2004; 26 – 34. ISSN 0018-9162. doi:10.1109/MC.2004.231.

[19] Mahanti, A. Challenges in Enterprise Adoption of Agile Meth- ods. Journal of Computing and Information Technology, 3: 2006; 197 206.

[20] Misra, S. C., Kumar, U., Kumar, V. & Grant, G. The Organizational Changes Required and the Challenges Involved in Adopting Agile Methodologies in Traditional Software Develop- ment Organizations. IEEE Computer Society: 2006.

225 In conclusion, software Model-driven Development and Model-driven Testing

Henry Roberto Umaña-Acosta Miguel Cubides

ABSTRACT Software modeling is useful in the specification and comprehension of software requirements. UML specification, for example, helps developers to understand the intended solution with static and dynamic designs. In this chapter we explain several software modeling approaches in software development and testing.

9.1 INTRODUCTION

ColSWE, the research group in Software Engineering from Universidad Nacional de Colombia, has two main research areas in software modeling: Development and test modeling. This has been motivated by the need of supporting the comprehension of the software. In the software development area, we explain the modeling advan- tages and disadvantages, and in the software testing area, we focus in writing and generating tests automatically from models. The last tech- nique belongs to one of the wider areas of Model Based Testing. When developers build models, they spend many time and effort, but the pro- ductivity of development is increased. This area is promising as in the academic research as in the industry.

227 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

9.1.1 Challenges of Software Development Software development has, amongst others, two identified difficulties: The maintenance and evolution and the comprehension and communi- cation of the different artifacts that compose it. By the comprehension of the intrinsic evolvable nature of software, it can be demonstrated that along time it has to adapt to its environment, including solutions for new requirements and/or previous requirements modifications. This requires modifications in different aspects that have to be studied and implemented: Studied at the level of impact that a modification would have as directly as indirectly in the system and implemented by means of making all the modifications that need to be executed as for the impact analysis. On the other hand, for achieving the development of optimal quality software it is required that the development team totally comprehends each one of the specifications of the different artifacts that describe the previous stages to that point. For example, in the development phase, the specifications in the requirement, specification, requirement and design phases should be perfectly understood. This produces certain difficulties, for example the fact that in the different phases there are also different types of people that work with distinct levels of expertise and experience, and even with various intellectual and cultural focuses.

9.1.2 Traditional Testing Process In the life cycle of a software project, sooner or later the team needs to deal with tests. In the formal way, someone, maybe the analyst or the tester, designs the Test Cases based on the Use Cases or on the functional requirements. Then, the tester executes, step by step, each Test Case and compares the result with the expected output and defines if the Test Case was successful or not.

9.1.3 A solution to Face the Problem Popular knowledge wisely indicates that “a picture is worth a thousand words”; and that explains why, for the comprehension, development and communication of the different software artifacts, a specification that

228 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

allows guiding them graphically has been developed using different tools specialized in modeling to achieve this objective. We have, then, dia- grams that represent the software and also what it should do, that sup- port documentation and facilitate communication between parts with different focuses and knowledge. Also, this diagrams offer software per- spective from two different focuses: a static vision and a dynamic one.

9.1.4 Model-based Testing

On the other hand, the Model-based Testing (MBT) starts defining the SUT, System Under Test, or the parts of the system that need to be tested. Then, the team models this SUT with one of several approaches, and generates the Test Cases from this model. Those Test Cases generated are not executable. They need to be transformed into a script test in some language, in order to be executed. Finally, the tester analyzes the result of the test, which can be: Fail or pass, and report to the development team. This chapter represents an introductory description for the good use of Model-driven development (MDD) during software development, ini- tially treating the reason for using MDD in agile methodologies (model- ing a specification does not slow software development), continuing with a model-designed development description where its philosophy shall be explained. Following this, UML unified language specification is de- scribed, which allows to talk about model-oriented architecture (MDA), its capacities and potential, the use of MDA in the software’s design and analysis phases and reviewing of the development eases that some CASE tools offer. Finally, some common errors, in which people fall when it is supposed that MDA will act by itself improving the quality of the developed software, will be described, and some conclusions of the work will be specified. In this chapter we will revise two works in the area of Model Based Testing, one of the research areas of ColSWE. These works were leaded by the structure and explanation given by Mark Utting and Bruno Legeard [1]. We will cover two techniques of MBT: One based in Finite State Machines (FSM) and the other one based in Pre/Post notations, specifically in OCL. In both cases, we use

229 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

and evaluate the tools ModelJUnit for FSM and Qtronic (QML) from Conformiq for Pre/Post notation.

9.2 WHY TO USE MODEL-DRIVEN ARCHITECTURE IN AGILE METHODOLOGIES According to Jorbi Cabot [2], there are two points to consider: Can agile methodologies benefit from modeling? And, can modeling benefit from agile methodologies? For embracing the first question, it should be considered that agile methodologi es, in spite of an emergent design, maintain modeling use, supporting them in different specifications related to the development team’s capacity and experience. There is the case, for example, of the agile modeling proposed by Scott Ambler [2, 3], where he suggests the use of model rain and iterated necessary modeling, keeping up with a light, but representative model of the system. This suggested model goes from abstract representations of the system to representations of the tests to be done in the product. The second question is embraced from a point of view based on experience and offered as investigation theme, reaching the proposal of establish a methodology or specification that allows modeling procedures a consequence of the application of characteristics that are common in agile methodologies in the modeling process. Based on this it can be observed that there is, firstly, the possibil- ity to develop a light model to support development following an agile methodology, and secondly, the possibility to initiate an investigative road that allows to implement new characteristics in the actual exist- ing modeling methodologies which are oriented to classic development methodologies.

9.3 MODEL-DRIVEN DEVELOPMENT

9.3.1 Philosophy A model is a generalized representation of the system. This is achieved by the use of specific and specialized diagrams in each of the different parts of the product.

230 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

The use of MDD is suggested for three reasons:

Documentation and Comprehension of the System Due to the simplicity and globalization of a model, it is easier to get the general idea that is represented by a diagram than the one represented by a text. That is way documentation is a great support (even though a diagram does not replace a textual documentation) helping the reviewers to comprehend the system’s characteristics in a better way.

Internal Communication and Client Communication Once the documentation is complete, it acts as a support for the com- munication with the client, helping to acquire the initial requirements and the representation of what, as a development team, has been un- derstood regarding the application should do. On the other hand, communication between different internal divi- sions in the software team gets easier too, as a diagram is designed in a universal language and no matter the specialty, the team members will understand it easily.

Automatic Code Generation and Evolution Related Facilities Due to the utilization of computing tools that help software engineering (CASE), code can be generated automatically from a model, which reduces the dedication time of the different resources that intervene in the application’s development phase. With the use of these tools, a reverse engineering can also be ac- complished from the code of an existing application, which generates diagrams that model the system and allows the generation of its own documentation. Another characteristic that facilitates to optimize the model-oriented development is the localization of the affected sectors in the system by means of the change applied. This is seen in the evolution process, where the modification of a requirement or the input of a new one makes a change in the product in such way that it is necessary to dimension

231 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

this change by making an impact analysis (despite that low coupling development is suggested, there are some cases in which the impact is very high). There is also an advantage given by the union of impact analysis and self-genera ting code, from which codification due to evolution au- tomation can be reached.

9.3.2 Unified Modeling Language

The OMG1 group worked on the specification of a language that permit- ted creating a model of any system that could be understood universally; for this goal, Unified Modeling Language (UML) was created, which in its 2.0 version specifies more than 10 diagrams that divide between dy- namic and static [3]. On this specification, case of use, class, package, sequence, commu- nication, state, activity, components, deployment and object diagrams are found (amongst others). With this group of diagrams a system can be represented from a general perspective to a more granular level required. Through the case of use diagrams the specification of system re- quirements can be represented. With the classes, package, state, com- ponents, and deployment diagrams the system is represented in a static way, while in sequence, communication, activity and objects diagrams the system is dynamically represented. To be strict, in this context, dynamic is taken as the way in which the system behaves given certain characteristics; that is so because, for example, the object diagram is a representation of the system in a specific state. One of the great advantages that UML represents is its capacity to be extended through profiles, like Lidia Fuentes exposes it in her investi- gation [4], by which the generation of a particularized specification and a custom necessity specific development or the company’s development politics can be achieved.

1Acronym for Object Management Group, more information http://www.omg.org/

232 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

9.4 MODEL-DRIVEN ARCHITECTURE

Due to the fact that UML has been developed to standardize systems modeling through diagrams, a methodology from MDD that implements UML was born naturally, and it is known as model-based architecture (MDA), developed by the OMG group. The difference between both methodologies is that MDD is a generalized methodology and MDA is the one that uses UML. Regarding tools for automatic coding generation from models, the more evolved of these two methodologies is the one that uses UML standard, i.e., MDA to be clear, just as Nikiforova proves it in [5]. MDA proposes the creation of a meta model and a model [6–10]. The meta model is a model that is language-independent and gives an initial holistic vision of the system, which brings up a first approximation for communication with the stakeholders; while the model is a directed specification to a particular programming language which permits to automate coding generation tasks, using UML profiles for a greater accuracy and a higher independence of the activities.

9.4.1 Software Engineering

Software creation implies an engineering process that, depending on the implemented methodology, will embrace different phases. In agile methodologies, despite being managed in a iterative way, planning or analysis, architecture and design, development, testing and revision and implementation phases are contemplated. It is important to point out the fact that, depending on a specific methodology, these phases have greater or lower emphasis, for example XP contemplates an emergent design [11]. As it has been explained, software development will be strengthened and facilitated with the implementation of MDD. Specially by developing a focused complete engineering in the design and analysis phase, it will be possible, in advance, to comprehend in a better way the product to be created.

233 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

The Analysis and Design Phase Both of these phases are focused to the comprehension of the application to be developed, creating different artifacts to seize the representation of the different characteristics of the system, what the product has to do and its limits, how it should be developed and which development characteristics will be done. It is in the design phase where different purpose models are created to analyze how the product will be created, what modules it will contain and in which manner will the modules communicate between them. It is that in this phase where representative system models implementation will be strongly used and where UML standard is suggested for obtaining greater facilities in implementation and development (for example, code automatic generation).

MDA for Software Engineering Since UML standard provides different diagrams that can represent a system from different focuses [3, 12], it can be used during design to obtain a more complete abstraction easily, which will make comprehend, in earlier production phases, what the system has to do and how to do it. A system will be more complete as more perspectives it contem- plates. For example, a system should be analyzed from the user’s point of view (what it should do) as from the performance’s point of view (how it should do it). For this purpose, case of use and state diagrams can be used to see what it must do and as object, activity and sequence diagrams to see how it must do it. The first ones allows seeing the system from a static point of view, while the second ones represents a dynamic view.

9.5 WHAT THE MODEL-DRIVEN ARCHITECTURE DOES NOT DO A typical error made in software development companies that intro- duce themselves into model-oriented design, as Bell explains it [13], is to believe that creating an initial system’s model will optimize their

234 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

development practices by itself. It should be considered and clearly un- derstood that model-driven design is just a tool that, well used, will permit obtaining great advantages in software development.

9.6 NOTATIONS FOR MODELING TESTS In order to build the model, Utting and Legeard [1] give some recom- mendations:

• Choose only the classes related with the SUT

• Include only the methods to be tested

• Include only the class’s attributes needed to reflect the behavior of the methods

With this scope you may choose some notations for your model. We will explain some of them studied in our research.

9.6.1 Transitions-based Modeling We model the behavior of the system as transitions between several states due to events. Usually the model is represented by a Finite State Machine Figure 9.1.

9.6.2 Pre/Post Modeling Another kind of system representation is through a series of variables with their respective values in a specific point of time. We model that in the specification of the use cases with preconditions and post-conditions. We also can be more formal and write this specification in Z language, Spec# or OCL (Object Constraint Language). Preconditions specify the conditions that must be true before the operations be executed. As an example [14], let see these instructions in OCL:

Context Player::calculateFinalScore():: Integer Pre: self.isComplete = true

The precondition related to the operation “calculateFinalScore()” states that the player has completed the game. Post-conditions

235 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Figure 9.1 Representing states and state transitions using a state diagram.

specify the conditions that must be true after the operations have been executed. From the same authors, another specification in OCL,

Context GameEvent::processPlayerChoices():: Integer Post: result = 0.

According to this example, the post-condition of the operation “pro- cessPlayerCho ices” states that the player has no choices to play.

9.7 TESTING FROM FINITE STATE MACHINES The next sections show the work developed by Miguel Cubides, re- searcher of ColSWE, and presented as a requisite to get the under- graduate level at the Systems Engineering career in 2009 [15].

9.7.1 FSM and ModelJUnit ModelJUnit [3] is a plug-in that extends the JUnit functionality. Its implementation is based in a Finite State Machine as a model. Besides, ModelJUnit gives us several figures about testing process as is presented by Utting [16].

236 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

9.7.2 Case Study We develop a small prototype in order to show the functionality of ModelJUnit. The case chosen is a very basic ATM with two operations: Credit, debit and few restrictions. We start modeling the system with a finite state machine Figure 9.2.

Figure 9.2 State diagram for the case study [15].

Next, we code this model using the classes offered by ModelJUnit. public class PruebasModel implements FsmModel { public enum Estados {VERIFICAR_DATOS, MOVIMIENTO, CANCELAR, TERMINAR}; private Estados estado; private ControlCajeroTest test;

public PruebasModel() { test = new ControlCajeroTest(); estado = Estados.VERIFICAR_DATOS; }

public String getState() { return String.valueOf(estado); }

237 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

public void reset(boolean testing) { test.reset(); estado = Estados.VERIFICAR_DATOS; }

public boolean verificarDatosCorrectosGuard() { return estado == Estados.VERIFICAR_DATOS;

@Action public void verificarDatosCorrectos() throws Exception { test.testVerificarDatosCorrectos(); estado = Estados.MOVIMIENTO; }

public boolean verificarDatosIncorrectosGuard() { return estado == Estados.VERIFICAR_DATOS;

@Action public void verificarDatosIncorrectos() throws Exception { test.testVerificarDatosIncorrectos(); estado = Estados.VERIFICAR_DATOS; }

\begin_inset Newline newline public boolean debitarGuard() { return estado == Estados.MOVIMIENTO;

@Action public void debitar() throws Exception { test.testDebitar(); estado = Estados.MOVIMIENTO; }

public boolean cancelarGuard() { return estado == Estados.MOVIMIENTO;

@Action public void cancelar() throws Exception { test.testCancelar(); estado = Estados.MOVIMIENTO; }

public boolean salirGuard() { return estado == Estados.MOVIMIENTO;

@Action public void sali() throws Exception { test.testCancelar(); estado = Estados.MOVIMIENTO; } }

238 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

ModelJUnit has a graphical tool that shows us the FSM embedded in the code. Figure 9.3 depicts the FSM of the code for the case study.

Figure 9.3 State diagram for case study [15].

The tool also offers some features of the tests:

• Number of tests: 10

– State coverage = 2/3 – Transition coverage = 4/6 – Transition pair coverage = 6/16 – Action coverage = 4/6

The main advantage using the tool is getting the possible sequences of events in the system through the implementation of the Finite State Machine. In other words, we got the Tests Cases running the code.

9.8 TESTING FROM PRE/POST MODELS The next sections show the work developed by Luis Alberto Bonilla [17], researcher of ColSWE, and presented as a requisite to get the undergraduate level at the Systems Engineering career in 2010.

9.8.1 Object Constraint Language (OCL) The OCL notation is used to do more precise the modeling through UML diagrams. For example, OCL allows us to specify preconditions and post-conditions of the operations in a class. Each expression in OCL has a context that usually is the class or method which it belongings to. In Table 9.1 there are some important constructors of OCL.

239 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Table 9.1 Main OCL constructs [1]. Constructor OCL context class inv: predicate context class def: name: type = expr context class::attribute init: expr context class::method pre: predicate context class::method post: predicate context class::method body: expr

9.8.2 Case Study The system to be modeled is the triangle classificator from Myers’s book (2004), [18],

The program reads three integer values from an input dialog. The three values represent the lengths of the sides of a triangle. The program displays a message that states whether the triangle is scalene, isosceles, or equilateral. The context is the Triangle with its three values (length of sides) and a text message with the evaluation as an output.

Context Triangle :: kindOfTriangle(a:int, b:int, c:int) : String

The precondition is referring to the fact that the integers should be positive

pre: a>0 and b>0 and c>0

The post-conditions validate that the three integers built a triangle where a + b>cfor all the possible combinations. Also they contain the rules to classify the triangle as equilateral, isosceles, or scalene:

post: if ( a+b<=cora+c<=borb+c<=a)then result = ‘‘notriangle’’ else if (a=b or b=c or a=c) if(a=b and b=c) then result = ‘‘equilateral’’ else result = ‘‘isosceles’’ endif else result = ‘‘scalene’’ endif endif

240 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

In this assessment of the OCL utilization as a modeler for generating tests cases, Bonilla uses QML from Qtronic [19], a OCL- like that allows to generate the test cases from the specification, as you can see in Table 9.2.

Table 9.2 Test cases generated. Bonilla [17]. Test Port/Field Value 1 in/(−1, 0, 0) out/badside 2 in/(0, 0, 0) out/badside 3 in/(1, −1, 0) out/badside 4 in/(2, 0, 0) out/badside 5 in/(1, 1, −1) out/badside 6 in/( 1, 2, 0) out/badside 7 in/(1, 1, 2) out/notriangle 8 in/(1, 1, 9) out/notriangle 9 in/(1, 2, 1) out/notriangle 10 in/( 1, 9, 1) out/notriangle 11 in/( 2, 1, 1) out/notriangle 12 in/( 9, 1, 1) out/notriangle 13 in/( 1, 1, 1) out/equilateral 14 in/( 5, 5, 9) out/isosceles 15 in/( 9, 1, 9) out/isosceles 16 in/( 9, 9, 1) out/isosceles 17 in/( 1, 9, 9) out/isosceles 18 in/( 3, 9, 7) out/scalene 19 in/( 9, 3, 7) out/scalene

The main advantage of this technique is the coverage. In fact, the algorithm behind the tools is based on Decision Coverage, also known as branch coverage.

9.9 TRENDS AND CHALLENGES Right now, we are conducting our research in another area related to MBT: How can we derive test cases automatically from models that represent GUIs? In middle term, we think to cover other challenge: To make test executable or concrete the abstract tests generated from the model. On the other hand, in the software modeling area, our research in- cludes two sub-areas: MDD for mobile applications and the optimization

241 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

of data transference applying MDA, focused in the data access object in a multilayer architecture.

9.10 SUMMARY Model based software development has three main advantages.

1. It is an excellent support for documenting different artifacts on the development process.

2. It allows people involved in development to easily comprehend what software is and what it is designed for.

3. It makes development more agile by involving it with self-generating code tools, easing the process in evolution tasks.

The implementation of model-oriented software development method- ology enables the analysis of a project from both, dynamic and static standpoints. One of the most advanced methodological lines in MDD is the one offered by the OMG group: MDA, that, by implementing UML, uses MDD with a standardization that allows comprehending and evaluating results easily between different parts.

242 MODEL-DRIVEN DEVELOPMENT AND MODEL-DRIVEN TESTING

REFERENCES [1] Utting, M. & Legeard, B. Practical model-based testing: a tools approach. Morgan Kaufmann: 2007. ISBN 9780123725011.

[2] Cabot, J. Agile and Modeling / MDE: friends or foes? | MOdeling LAnguages. http://modeling- languages.com/blog/content/agile-and-modeling-mde-friends- or-foes.

[3] Ambler, S. The Elements of UML 2.0 Style. Cambridge Univer- sity Press: 2005.

[4] Fuentes, L. & Vallecillo, A. Una Introduccion a los Perfiles UML. Novática, 168: 2004; 6–11.

[5] Nikiforova, O., Cernickins, A. & Pavlova, N. Discussing the Difference between Model Driven Architecture and Model Driven Development in the Context of Supporting Tools. Fourth In- ternational Conference on Software Engineering Advances, 1: 2009; 446–451.

[6] Belaunde, M., Burt, C. & Casanave, C. MDA Guide Ver- sion 1.0.1. OMG: 2003.

[7] Caramazana, A. Tecnologias MDA para el desarrollo de soft- ware. In: I Jornada Academica de Investigacion en Ingenieria In- formatica: 2004.

[8] Franky, M. C. MDA: Arquitectura Dirigida por Modelos: 2010.

[9] Schmidt, D. C. Model-Driven Engineering. IEEE Computer, 39: 2006; 25–31.

[10] Schmidt, D. C. Guest Editor’s Introduction to Model-Driven Engineering. Computer, 39: 2006; 25–31. ISSN 0018-9162. doi: http://doi.ieeecomputersociety.org/10.1109/MC.2006.58.

[11] XP Design and Documentation | xProgramming.com. http://xprogramming.com/articles/ferlazzo/.

243 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

[12] Favre, L. Formalizing MDA-Based Reverse Engineering Pro- cesses. In: Australian Software Engineering Conference: 2008, 153 –160. doi:10.1109/SERA.2008.21.

[13] Bell, A. E. Death by UML fever. DSPs, 2: 2004; 11–23.

[14] Ericksson H., L. B. F. D., Penker M. UML 2 Toolkit.Jow Wikert: 2004.

[15] Umana, H. & Cubides, M. Pruebas basadas en maquinas de estado Finitas (FSM). Revista Tendencias en Ingenieria de Software e Inteligencia Artificial, 4: 2009.

[16] The ModelJUnit test generation tool. http://www.cs.waikato.ac.nz/~marku/mbt/modeljunit/.

[17] Bonilla, L. Como escribir modelos Pre/Post adecuados para la automatizacion de pruebas. Paper presentado en el proceso de trabajo de grado en Ingenieria de Sistemas: 2010.

[18] Myers, G. J. The Art of Software Testing. John Wiley & Sons, 1 edición: 1979. ISBN 0471043281.

[19] Automated Test Design | Model-Based Testing (Conformiq). http://www.conformiq.com/.

244 In conclusion, software Subject Index

A UP, Basic/Open; 188 Abstract Syntax Tree (AST); 17, Agility; 212-218, 223 35, 133-135, 137 and the Agile Manifesto; 173, Abstract System Dependence 177, 179 Graph (ASDG); 90, 94, and software evolution; 161, 95, 96 164 Agile assessment models; 204, 211 assessment framework; 215, evolutionary software pro- 216, 217, 218, 219 cesses; 177, 181, 190, 201 manifesto; 173, 181, 182 Architecture; 37, 44, 68, 74, 84, methodologies; 86, 98, 107, 111, 133, evolution; 164 140, 151, 153, 163, 167- history; 164, 178 172, 189, 190, 192, 214, in traditional organizations, 229, 230, 234, 242 and associated chal- Description Language (ADL); lenges/risks; 189, 220 73, 74 Driven Modernization (ADM); overview; 183 151 methods required for adopt- Aspect Oriented Programming ing; 190-194, 210-214, (AOP); 72, 73 217, 220-223 Automatic categorization; 106, practices; 211, 216, 221 113, 117, 118 principles; 173, 175, 192, 194, 217 principles vs. software fea- B tures; 83 Business rules; 17, 18, 130, 136, roles; 209 141, 147, 152, 162, 163

246 SUBJECT INDEX

Extraction; 134, 135, 145, 153 Documentation; 1, 7, 15-23, 29- knowledge through abstrac- 34, 47, 87, 90, 93, 106, tion levels; 88 113, 117, 118, 129, 132, 141, 149, 150, 156, 172, 174, 209, 213, 219, 220, C 222, 241, 231 Centralized version control systems Summarizing; 7 (CVCSs); 44, 45 Changes; 7, 15-19, 22, 32, 35, 39, Entropy; 190 44, 58, 61, 82 Binary; 11 Implementation; Evolution; 17, 57, 61, 83, 105, Managing; 29, 31 110, 113, 127, 161, 190, Clustering; 35, 43, 115, 118, 138- 182 140 Evolutionary Development Model CodeCity metaphor; 112 (EVO); 165, 167, 178, Colombia, software development 182 in; 203, 204 Extreme programming (XP); 173, earnings generated; 204 179, 181, 183, 210, 233 Communications; 42, 79, 215 logs; 32, 33, F modes and effectiveness FDD lifecycle; 185 Concept location; vii, 19, 26, 83- Feature location; 87 85, 88-92, 97, 98, 111, Financial Management System 114, 120 (SGF); 141, 143, 144, 146 Control-flow diagrams; 66, 134 Formal Concept Analysis (FCA); Control structure diagrams; 65, 66, 13, 94-96, 114 134 Fractal representation; 68 Corpus creation process; 11, 12 Crystal Family; 180, 181 H Historical repositories; 18, 32 D Hub organizational structure; 193 Data mining algorithm; 40 Distributed Version Control Sys- I tems (DVCSs); 44, 45, 49 Impact analysis; 18, 19, 67, 82, 83, Domain Specific Languages 85, 86, 88, 96-99, 134, (DSL); 73 149, 228, 232

247 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Incremental change; 81-89, 98, PHP; 73 105, 106, 110, 120,161, Python; 73 183 Java; 72 Activities; 84 Latent; 11, 90, 92, 107-09 Incremental development; 164- Dirichlet Allocation (LDA); 166, 184 108, 109, 117-119 Information retrieval (IR); 2, 3, 4, Semantic Indexing (LSI); 11, 6, 18, 23, 35, 90, 94, 105- 13, 90-96, 108, 109, 114- 108, 113, 114, 115, 120 119 and software evolution activi- Law of program evolution; 30 ties; 113 models; 93 techniques; 90, 93-96, 105 M Internet Relay Chat (IRC); 29, 31, Maintenance; 17, 83, 130 33 MECCA approach; 152 IR-based techniques; 93-96 Mining Software Repositories IR system, general architecture; (MSR); 29-36, 48, 59, 107 110, 112-14 Iterative and incremental develop- Modeling language; 62,180, 232 ment; 164-166, 184 Model(s); viii, 1, 6, 10-13, 16, 33, 39-42, 45, 59, 69, 73, 74, J 81, 82, 90, 91, 105-111, Jackson diagrams; 65, 66 114-120, 132, 135, 151, 161, 162, 165, 168, 169, K 171, 172, 174, 175, 178, Kanban board; 187, 188 179, 182, 184, 185, 201, 217, 230-237, 241, 242 L architecture in agile method- Language(s); 2-10, 17, 18, 21-23, ologies; 36, 59, 62, 71-74, 85, 88- based testing; ix, 227, 229 91, 112, 114, 135, 139, model-driven; 184, 227, 230, 142, 180, 194, 221, 229, 233 231-235, 239 architecture; 230, 234 C++; 72 development; 227, 230 Groovy; 73, 74 testing; 227 JavaScript; 73 notations for modeling tests; Objective-C; 73 235

248 SUBJECT INDEX

probabilistic; 108, 116, 119 forms; 129, 141, 142, 144 staged; 171 6i technology; 142 testing from finite state ma- business application; 141 chines; 236 reverse engineering tool; testing from pre/post models; 144, 150, 152 239 PL/SQL (Procedural Lan- trends and challenges; 2, 21, guage/Structured Query 29, 44, 151, 193, 241 Language); 139 Web; 108, 109 Modularization; 140 process based on cluster- P ing/searching algorithms; Pixel-maps; 63, 66, 67, 69, 75 140 Process(es); Quality (MQ); 140 Enterprise Unified Process Mozilla Firefox; 68 (EUP); 170, 182 Multi-Element Component Com- of mining software reposito- parison and Analysis ries; 34, 112, 114 (MECCA); 150-152 Rational Objectory Process; 168, 182 N Rational Unified Process (RUP); 19, 168, 169, 170, Natural language; 2-7, 21, 22, 85, 172, 174, 182, 188, 189, 88, 89, 163 223 analysis; 3 processing (NLP); 2, 18, 23 summarization; 2, 10, 11, 15 concepts and techniques; 2 traditional testing; 228 summarization; 2, 4, 7 Production environment; 170, 171, 172 Program(s); 30, 116 O Instructions; 8 Object-oriented; 38, 65, 77, 97, understanding; 19, 31, 37 138, 143 software (OOS); 143 OCL, main constructs; 229, 235, Q 236, 239, 240, 241 Quality of Software Systems; 31, OpenUP layers; 190 38 Operating environment; vii Oracle; 129

249 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

R comprehension; 4, 19, 87, Refactoring; 83, 86, 87, 110, 111, 111, 129 113, 120, 183, 189, 215 requirements specification; Reverse engineering (RE); 19, 20, 1, 6, 88 110, 112, 127-133, 137, technical designs; 1, 21 140, 142-144, 147-153 test cases; 20, 91, 119, 215, assessment; 147, 228, 229, 241 concepts and relationships; trends and challenges; 21 129, 133 use cases documents; 1 considerations for applying; code; 8, 34, 117 142 communications logs; 32, 33 in procedural software evolu- comprehension; 19, 87, 111, tion; 127 129 process; 131, 132 configuration management techniques; 133 system (SCM); 40 trends and challenges; 21, 154 costs; 7, 17, 40 RUP (Rational Unified Process) development; 82, 177, 201, model; 19, 168, 169, 170, 204, 228 172, 174, 182, 188, 189, in Colombia; 204 223 engineering; 233 evolution; 17, 83, 105, 110, S 113, 127, 161, 190 Scenario-based Probabilistic Rank- historical repositories; 32 ing (SPR); 92-96, 114 life cycle; 1, 17, 31, 49, 82, SCRUM; 173, 179, 182, 184, 188, 84, 171, 172 192, 210 maintenance; 1, 2, 14, 20, 37, SeeSoft tool; 67 75, 81-83, 130, 132, 148, Semantics of Business Vocabu- 151 lary and Business Rules production process; 41 (SBVR); 136 properties; 39 Simple staged model; 175 reflection model; 12 Software; repositories; 29, 31, 32, 34, artifacts; 1, 6, 16, 106, 113, 36, 112, 114 118, 120, 134, 163, 228 summarization; 3, 5, 118 bug reports; 1, 7, 8, 16-18, trends and challenges; 24, 21, 33, 35, 38, 39, 41, 47 154, 193, 241

250 SUBJECT INDEX

types; vii, 16, 17, 21, 38 U understanding; 127, 129 Unified Modeling language (UML); visualization; 53, 57-59, 62, 14, 17, 18, 62, 71, 119, 64, 72, 112 128, 180, 184, 227, 229, Word Usage Model (SWUM); 232 10, 11 sequence diagram; 14 Software Development Agility in Unified Process (UP); 19, 168-171, Small and Medium Enter- 173, 174, 188, 223 prises (SMEs); 201, 205, elements; 173 206, 221, 245 family; 168 assessment models; 204, 211 model; 171, 174 definition; 207 weaknesses and strengths; V 219, 228 Vector Space Model (VSM); 6, 11, Source code; 8, 34, 89, 94, 117 12, 90, 91, 94-96, 108, Spiral; 110, 168, 169 109, 115-119 model; 168, 169 Virtual environments; 72, 75 State diagram; 236, 237, 239 Visualization; 57-59, 62, 112 Structograms; 65, 66 pipeline; 59-61 conditional; 65 process; 72, loop; 66 steps; 60 sequence; 66 Summary Generation process; 11, 15 W Web mining; 92, 139

T W Theory of complex networks; 42 XML (eXtensible Markup Lan- Tools; 35, 36, 48, 59, 148 guage); 47 assessment; 148 visualization; viii, 58, 59, 66, 72, 74, 75 Traceability recovery; 111, 113, 119

251 Name Index

A Chikofsky, E. J.; 130 Acs, Z. J.; 202 Coad, P.; 180 Ambler, S.; 170, 188, 230 Cockburn, A.; 180, 191, 215 Amor, J.; 40 Crowston, K.; 42 Anvik, J.; 40 Cubides, M.; 227, 236 Aponte, J.; 1, 29, 81, 127, 161, Cunningham, W.; 181 201 Asif, N.; 131, D de Luca, J. 180 B Bachmann, A.; 47 Basili, V.; 178 G Baxter, I.; 136 Gilb, T.; 162, 165, 178 Bayer, S.; 181 Gonzalez, J. M; 47 Beck, K.; 181 Gousios, G.; 40 Bennett, K.; 82, 84 Graçanin, D.; 63 Bernstein, A.; 47 Guo, P.; 39 Boehm, B.; 162, 168, 178, 212, 213, 217, 220 Bonilla, L. A.; 215, 241 H Booch, G.; 62 Hassan, A. E.; 37, 39, 47, 48, Brooks, F.; 162, 178, Herraiz, I.; 47 Highsmith, J.; 181, 192 C Huang, S. K., 42 Cabot, J.; 230 Canfora, G.; 37

252 NAME INDEX

J P Jacobson, I.; 62 Panjer, L.; 38 Jeffries, R.; 181 Petersen, K.; 219 Pikkarainen, M.; 215 Poppendieck, M. & T.; 186 K Putrycz, E.; 145 Kanban; 182, 185, 187, 188, 211 Khan, M.; 131 Knab, P.; 38 R Rajlich, V.; 82, 84, 87, 110, 172 Robles, G.; 47 L Royce, W.; 162 Lantz, K.; 178 Rumbaugh, J.; 62 Larman, C.; 178 Legeard, B.; 211, 235 S Lehman, M.; 30, 82, 164 Sayyad, J.; 37 Lopez-Fernandez, L.; 42 Schwaber, K.; 179 Scrumban; 182, 185, 188 M Shultz, S.; 178 Martin, J.; 179 Stasko, J. T.; 69 Mens, T.; 58 Storey, M. A.; 130, 149 Montaño, D.; 57 Sutherland, J.; 179 Moreno, L.; 1 Morisaki, S.; 38 T Myers, G. J.; 240 Tonella, P.; 153 Turner, A.; 178, 217 N Nikiforova, O.; 233 U Utting, M.; 229, 235, 236 Niño, L. F.; ix Niño, Y.; 29, 41 V Von Neumann, J.; 65 O Ohira, M.; 42 Y Orso, A.; 97 Ying, A. T. T.; 40 Yu, L.; 43 Yuan, L.; 43

253 RESEARCH TOPICS IN SOFTWARE EVOLUTION AND MAINTENANCE

Z Zhao, W.; 90 Zimmermann, T.; 38

254