Mining Software Repositories for Traceability Links

Mining Software Repositories for Traceability Links Huzefa Kagdi, Jonathan I. Maletic, Bonita Sharif Department of Computer Science Kent State University Kent Ohio 44242 {hkagdi, jmaletic, bsimoes}@cs.kent.edu Abstract multiple versions (i.e., change history) of the software An approach to recover/discover traceability links artifacts to recover traceability links. In [15] we between software artifacts via the examination of a developed the premise that if artifacts of different types software system’s version history is presented. A (e.g., src.cpp and help.doc) are co-changed with high heuristic-based approach that uses sequential-pattern frequency over multiple versions, then such artifacts mining is applied to the commits in software repositories potentially have a traceability link between them. for uncovering highly frequent co-changing sets of This history-based approach offers some distinct artifacts (e.g., source code and documentation). If advantages over single version approaches, namely: different types of files are committed together with high • Traceability links are derived from the actual frequency then there is a high probability that they have changes to artifacts, rather than estimations that are a traceability link between them. The approach is based on the analysis of various structural and evaluated on a number of versions of the open source semantic dependencies between them in a single system KDE. As a validation step, the discovered links snapshot of a system. are used to predict similar changes in the newer versions • The change history represents one of the few sources of the same system. The results show highly precision of information available for recovering traceability predictions of certain types of traceability links. links that is manually created and maintained by the actual developers. The commits, and commit 1. Introduction messages embody part of the developer’s knowledge Traceability link recovery has been a subject of and experience. The version history may contain investigation for many years within the software domain-specific “hidden” links that program- engineering community [12, 24]. Having explicit analysis methods fail to uncover. documented traceability links between various artifacts To accomplish this, we mine the version history found (e.g., source code and documentation) is vital for a in software repositories that are maintained by version- variety of software maintenance tasks including impact control tools such as Subversion or CVS. Specifically, analysis, program comprehension, and requirements we use the data mining technique, sequential-pattern assurance of high quality systems. It is particularly mining [1], to identify and analyze sets of files that are useful to support source code comprehension if links committed together. This technique produces frequently exist between the source code, design, and requirements co-changed files, otherwise known as change patterns. documentation. These links help explain why a Change patterns that include files representing different particular function or class exists in the program. types of artifacts are considered related via a traceability A number of techniques have been proposed to assist link. Another approach, itemset mining [11], has also in the recovery/discovery of traceability links in existing been used for mining version histories. Itemset mining software systems [24]. However, the problem has proven produces change patterns that are unordered sets of co- to be quite difficult and no single approach is, by itself, changing artifacts; whereas sequential-pattern mining completely successful or accurate. Many approaches produces ordered (actually partially ordered) lists of co- suffer from producing many false positives, suggesting a changing artifacts. That is, the order in which the link when none should exist [2, 17, 18]. artifacts were changed (or committed) is preserved. The Approaches to recovering/discovering traceability ordering information can thus be used to infer links typically analyze a single snapshot (i.e., current directionality of the traceability links. version) of a software system to infer links between two Our approach is evaluated on the open-source or more artifacts. The method presented here to software system KDE (K Desktop Environment). The uncovering traceability links differs in that it examines results show that our approach is able to uncover traceability links between various types of software artifacts (e.g., source code files, change logs, user a combination of information in the CVS log file documentation, and build files) with high accuracy. This (commits) and Bugzilla to study fix-inducing changes. approach can readily be applied, in conjunction with Fix-inducing changes are the changes that introduced other link recovery methods, to produce a more complete new changes to fix an earlier reported problem. Regular- picture of the traceability of a software system. expression matching on the commit messages and text The paper is organized as follows. In Section 2 we descriptions in Bugzilla along with heuristics are used to discuss traceability in the context of open source determine the CVS deltas that are related to a change that development. In Section 3 we discuss our approach to fixes a bug. Cubranic et al. [6] describes a tool, namely mining traceability patterns from software repositories Hipikat, to assist new developers (not necessarily novice) and our mining tool. The evaluation of the approach is on a project, in performing their current task(s). Hipikat given in Section 4. Related work and conclusions follow recommends artifacts from the project memory that may that in Sections 5 and 6. hold relevance to a task at hand. A developer may ask for the relevant artifacts explicitly in the form of an 2. Open source and Traceability explicit query, or the tool can do so automatically based Scacchi et al. [21] observed that requirements on the current context (e.g., based on the currently open elicitation, analysis, and specification of open-source document(s) in the developer’s workspace). system are very different from the traditional approaches In summary, existing MSR approaches have focused (e.g., use of mathematical logic, descriptive schemes, and on uncovering traceability links between requests in bug- UML design models) in software engineering. Their tracking systems and source code. While these are requirements are typically implied by discourse of project important efforts, they cover only a portion of the broad participants, and after implementation assertions. spectrum of documents found in open-source Different types of informal sources (termed as software development. A sustainable success of an open-source informalisms) form collective requirements and project, from both development and end use perspectives, documentation of an open-source project. This includes depends to a large extent on how well they maintain software repositories, communications, HowTo guides, these documents. For example, an application that and traditional system documents (e.g., man pages). One frequently fails to compile or with very little installation particular type of requirements that is a common feature help could have a diminishing effect on the user base. It in many open-source projects is the ability to support is important that these documents be kept in alignment extension mechanisms with various programming with the current state of the source code. Therefore, languages and architecture (e.g., a python binding to the traceability between them is of desirable interest and KDE libraries). Due to the distributed collaborative value. Accounting these documents along with the nature of open development, software repositories requests in bug-tracking systems is a major step towards comprise the primary location of project artifacts along achieving the complete picture of traceability to source with the primary means of coordination and archival. code in the context of open-source development. The bug/issue tracking repositories and emails can be seen as a source for requirements and corrective- 3. Uncovering Traceability Links maintenance requests of an open-source system. Source- Our research interest is in uncovering traceability control repositories can be seen as a source of between source code and other artifacts. This includes implementation artifacts. Few efforts have been made to user documents (e.g., HTML, XML/docbook, LaTeX and infer and then utilize traceability links between artifacts Doxygen), build management documents (automake, in bug repositories and source code artifacts via Mining cmake, and makefile), HowTo guides (e.g., FAQs), Software Repositories (MSR). Canfora et al. [4] used the release and distribution documents (e.g., ChangeLogs, bug descriptions and the CVS commit messages for the whatsNew, README, and INSTALL guides), progress purpose of change predictions. Their approach provides monitoring documents (TODO and STATUS), and a set of files (at line level of granularity) that are likely to extensible mechanisms (e.g., Python, Ruby, and Pearl change given the textual description of a new bug (or bindings for an API). These artifacts can be considered feature). An information-retrieval method is used to software informalisms [21] index the changed files in the CVS repositories with the Our approach is to analyze sets of files that frequently textual description of past bug reports in the Bugzilla co-occur in change-sets by applying a frequent-pattern repository

Load more