A Large Scale Empirical Study

Co-Change Patterns: A Large Scale Empirical Study Luciana L. Silvaa,∗, Marco Tulio Valentea, Marcelo de Almeida Maiab aDepartment of Computer Science, UFMG, Brazil bFaculty of Computer Science, UFU, Brazil Abstract Co-Change Clustering is a modularity assessment technique that reveals how often changes are localized in modules and whether a change propagation represents design problems. This technique is centered on co-change clusters, which are highly inter-related source code files considering co-change relations. In this paper, we conduct a series of empirical analysis in a large corpus of 133 popular software projects on GitHub. We describe six co-change patterns by projecting them over the directory structure. We mine 1,802 co-change clusters and 1,719 co-change clusters (95%) are covered by the six co-change patterns. In this study, we aim to answer two central questions: (i) Are co-change patterns detected in different programming languages? (ii) How do different co- change patterns relate to rippling, activity density, ownership, and team diversity on clusters? We conclude that Encapsulated and Well-Confined clusters (Wrapped) implement well-defined and confined concerns. Octopus clusters are proportionally numerous regarding to other patterns. They relate significantly with ripple effect, activity, ownership, and diversity in development teams. Although Crosscutting are scattered over directories, they implement well-defined concerns. Despite they present higher activity compared to Wrapped clusters, it is not necessarily easy to get rid of them, suggesting that support tools may play a crucial role. Keywords: Modularity, co-change clusters, co-change patterns 1. Introduction the standard approach is based on the analysis of structural mea- sures, e.g., coupling and cohesion (Chidamber and Kemerer, Parnas developed the criteria that “modules should hide de- 1991; Stevens et al., 1974). Over the years, several alterna- cisions or decisions that are likely to change (Parnas, 1972). In tive attempts have been proposed, such as semantic approaches, addition, according to Parnas, a module represents a responsi- which analyze the source code vocabulary using Information bility assignment. However, the initial planned modules may Retrieval techniques (Santos et al., 2014; Maletic and Marcus, suffer changes overtime to adapt to their new responsibilities, 2000), co-change approaches, which mine historical data to de- otherwise their modularity tends to decay. During software de- tect software artifacts that usually change together (Zimmer- velopment, changes are performed constantly in tasks related mann et al., 2007; Alali et al., 2013), or hybrid approaches, to new features, code refactoring, and bug fixing. In a modular which combine these types of information (Kagdi et al., 2013; software, when those tasks are required, they should change Bavota et al., 2014). a single module with minimal—if possible none—impact in other modules (Aggarwal and Singh, 2005). In contrast, im- Recently, we proposed the Co-Change Clustering technique proper modularization may cause ripple effects during software to assess modularity by capturing logical modules from history maintenance. For such situation, it would be beneficial whether of software changes (Silva et al., 2014, 2015a). Co-change clus- the system expert could semi-automatically analyze modularity ters consist of source code artifacts that frequently change to- by retrieving set of classes that usually change together to com- gether between themselves but not with others artifacts in dif- pare and contrast co-change clusters with directory structure. ferent clusters. Later, we reported a study with experts of six object-oriented systems to investigate the developers’ percep- The importance of modular systems has motivated re- tion of co-change clusters (Silva et al., 2015b). These clusters searchers and practitioners to investigate different dimensions were classified in three patterns to represent common instances to assess system modularity. Despite modularity being an es- of co-change clusters, regarding their projection over the direc- sential principle in software development, effective approaches tory structure: Encapsulated Clusters (clusters that match the to assess modularity still remain an open problem. Typically, directory structure, i.e., they dominate all co-change classes in the directory structures they touch), Crosscutting Clusters (clus- ∗Corresponding author at Departamento de Cienciaˆ da Computaçao,˜ Av. ters whose co-change classes are scattered across several direc- Antonioˆ Carlos, 6627 - Pampulha CEP: 31270-010, UFMG, Belo Horizonte, tory structures), and Octopus Clusters (most co-change classes Brazil. Tel.: +55 31 3409-5865. are confined in one directory structure with some arms—or Email addresses: “tentacles”—in others). Our proposed co-change patterns could [email protected],[email protected] (Luciana L. Silva), [email protected] (Marco Tulio Valente), cover 52% of the 102 mined clusters. Although unitary changes [email protected] (Marcelo de Almeida Maia) are preferable than co-changes, indeed, those co-change pat- Preprint submitted to Journal of Systems and Software March 15, 2019 terns with propagation and scattering revealed to have different we conduct a qualitative and semantic analysis and Section ?? perspectives on the impact they may cause in software main- we discuss our findings. We discuss threats to validity in Sec- tenance. In summary, our first results indicate that: (i) Encap- tion 7 and related work in Section 8. Finally, we conclude the sulated Clusters tend to be more controllable; (ii) around 50% paper in Section 9. of the Crosscutting Clusters were associated to design anoma- lies; (iii) Octopus Clusters represent expected class distribu- 2. Co-Change Clustering tions, which are difficult to implement in an encapsulated way. Nonetheless, we did not evaluate whether programming lan- The ultimate goal of our technique is to support develop- guages, application domains, thresholds, and commit density ers on modularity assessment using co-change relations. The (number of commits divided by the number of source code files) technique relies on historical information to generate co-change impact co-change pattern’s detection. We also did not analyze graphs and then mine co-change clusters. As illustrated in Fig- whether co-change patterns relate to ripple-effect, activity den- ure 1, we propose three phases to extract co-change clusters. In sity, ownership, and team diversity on clusters. the first phase, we extract commit transactions and apply pre- In this paper, we extend our previous work to answer these processing steps to build co-change graphs. After that, a post- aforementioned open questions in five directions: processing phase is applied on co-change graphs to prune edges with small weights. Finally, in the last phase, the co-change 1. We conduct a new study in a large corpus of 133 popular graph is clustered several times to selected the best clustering. projects hosted in GitHub. We consider projects in differ- A detailed description of this section can be found in previous ent languages (C/C++, Java, PHP, Ruby, JavaScript, and work (Silva et al., 2014, 2015a). Python) and application domains. 2. We investigate the threshold used to group commits (ap- 2.1. Co-Change Graph plied by the same developer in a period of time during the data preprocessing) and the process of co-change cluster Beyer and Noack (2005) proposed co-change graphs, an ab- extraction. In this analysis, we intend to evaluate the ef- straction to represent commit transactions from version control fectiveness of co-change clusters. system (VCS). Conceptually, a co-change graph is an undi- rected and weighted graph fV, Eg, where V is a set of source 3. We propose three new co-change patterns (Section 3): (i) code files and E is a set of edges. If there is an edge between Well-Confined Clusters (clusters which touch a single di- two vertices (source code files) F and F , then at least one com- rectory structure but do not dominate it); (ii) Black Sheep i j mit in the VCS changes Fi and F j, where i , j. Finally, we Clusters (similar to Crosscutting, however, they touch very conduct some experiments with relative edge weights to define few code files in each one directory structure); and (iii) the weighting measure. We concluded that co-change modifi- Squid Clusters (similar to Octopus, but they have smaller cations are usually much less frequent than single changes in a bodies and arms). We mine 1,802 co-change clusters from source code file. For example, in our previous work (Silva et al., version histories of such projects, which were then cate- 2014, 2015a), we extract co-change clusters for Lucene project1 gorized in six patterns regarding their projection over the to present the concept of co-change clustering. We observe that directory structure. From the initially computed clusters, the maximal edges’ weight is seven and the maximum size is 1,719 co-change clusters (95%) are covered by the pro- 27. For this reason, we do not adopt a relative edge weights on posed co-change patterns. co-change graphs. Instead, the edges’ weights represent how 4. We conduct a series of statistical analysis to evaluate the many commits changed the connected source code files simul- categorized co-change clusters on the 133 projects (Sec- taneously represent. tion 5.4). This study aims to answer two central questions: Before extracting co-change graphs, we perform Phase 1 (i) Are co-change patterns detected in different program- (Figure 1) to preprocess commits extracted from version his- ming languages? (ii) How do different co-change patterns tory. This phase consists of the following three steps: relate to rippling, activity density, ownership, and team diversity on clusters? 1. We discard commits which do not change source code files, 5. We conduct a qualitative analysis by investigating under- e.g., documentation and configuration files. In addition, test- lying natural language topics in commit messages on clas- ing files are eliminated because co-changes between testing sified co-change clusters.

Load more