Comparison of Clustering Algorithms in the Context of Software Evolution
Jingwei Wu, Ahmed E. Hassan, Richard C. Holt School of Computer Science University of Waterloo
Waterloo ON, Canada j25wu,aeehassa,holt @uwaterloo.ca
Abstract one of many problems that confront today’s large software projects. Software clustering holds out the promise of help- To aid software analysis and maintenance tasks, a num- inginthistask[13,24]. ber of software clustering algorithms have been proposed Software clustering refers to the decomposition of a soft- to automatically partition a software system into meaning- ware system into meaningful subsystems. It plays an im- ful subsystems or clusters. However, it is unknown whether portant role in understanding legacy software systems [15], these algorithms produce similar meaningful clusterings for assisting in their architectural documentation [3, 13], and similar versions of a real-life software system under contin- supporting their re-modularization[22, 27]. For example, a ual change and growth. misplaced procedure or file can be automatically discovered This paper describes a comparative study of six software and relocated to the proper subsystem to reduce unexpected clustering algorithms. We applied each of the algorithms dependencies and prevent the decay of the architecture [12]. to subsequent versions from five large open source systems. Ideally, a software clustering algorithm should be auto- We conducted comparisons based on three criteria respec- mated to provide continual support throughout the lifetime tively: stability (Does the clustering change only modestly of a large software system. More importantly, the algorithm as the system undergoes modest updating?), authoritative- must produce meaningful clusterings in a stable manner. To ness (Does the clustering reasonably approximate the struc- be meaningful, the algorithm must produce clusterings that ture an authority provides?) and extremity of cluster distri- can help developers understand the system. A clustering al- bution (Does the clustering avoid huge clusters and many gorithm which places all source files from a system into two very small clusters?). large clusters will not be helpful to developers analyzing the Experimental results indicate that the studied algorithms code. To be stable, the algorithm must produce clusterings exhibit distinct characteristics. For example, the clusterings that do not vary widely from one version of the system to from the most stable algorithm bear little similarity to the the next. An algorithm which produces clusterings that vary implemented system structure, while the clusterings from widely even though no major code restructuring occurred the least stable algorithm has the best cluster distribution. between versions is likely not to be used by developers. Based on obtained results, we claim that current automatic In this paper, we will compare six clustering algorithms clustering algorithms need significant improvement to pro- by applying them to subsequent versions of five large open vide continual support for large software projects. source systems. Our objective is to clarify to what extent software clustering algorithms can be used to support re- modularization and architectural documentation during the 1 Introduction life cycle of a large software system. We use three criteria
to evaluate the usefulness of these algorithms: C1 When A well documented architecture can improve the quality a system changes modestly, the clustering produced by an and maintainability of a software system. However, many algorithm should also change modestly; C2 An automati- existing systems often do not have their architecture docu- cally produced clustering should approximate the clustering mented. Moreover, the documented architecture becomes produced by an authority (e.g., the architect); and C3 Au- outdated and the system structure decays as rapid changes tomatically produced clusters should generally not be either are made to the system to meet market pressure [7, 18]. A huge (i.e., containing hundreds of source files) or tiny (i.e., high rate of turnover among developers makes the situation containing very few source files). even worse. Maintenance of architectural documentation is The rest of this paper is organized as follows: Section 2 Systems Prog. Lang. # Versions # Source Files Size (KLOC) Graph Data Raw Data Extraction Time Ruby C 73 90 – 261 74 – 187 6.5MB 1.0 GB 21 mins KSDK C/C++ 70 21 – 1156 3 – 263 82.5MB 2.5 GB 52 mins OpenSSL C 73 593 – 845 164 – 278 74.4MB 4.2 GB 57 mins PGSQL C 73 771 – 947 182 – 519 99.1MB 4.5 GB 59 mins KOffice C/C++ 70 1358 – 3266 272 – 962 1235.5MB 42.0 GB 325 mins
Table 1: Properties of five target systems from 1999 to 2004. The Raw Data refers to CTSX output and Graph Data refers to lifted graphs. Due to a CVS update problem, the last three monthly versions of KSDK and KOffice in 2004 are not available. provides an overview of five systems chosen for the exper- 3 Experimental Design imentations. Section 3 describes the experimental setup in the context of software evolution. Section 4 introduces a simple ordinal measure for comparing a number of data se- Figure 1 illustrates the design of our experiments. We ries. Section 5 describes the experimental results obtained begin with the repository (stored in CVS) of a target system using ordinal evaluation techniques. The results show that (such as PGSQL) that we use to evaluate clustering algo- the studied clustering algorithms exhibit distinct character- rithms. The repository contains successive versions of the
istics in terms of stability, authoritativeness, and extremity. target system. We retrieve monthly versions of the source
Î Î ¾
Section 6 discusses several interesting observations. Sec- code, called here ½ , , and etc. In step 1 (program ex-
tion 7 considers related work and Section 8 concludes this traction), we extract a directed graph from each version Î paper. . In this graph, each node represents a file in the target system. Each edge represents a static dependency (such as 2 Target Systems a reference to a variable or a data type, a call to a function, or a call to a macro) from one file to another. In step 2 (soft- ware clustering), we run a number of clustering algorithms,
To evaluate software clustering algorithms, we chose to
called , , etc., on each graph , producing clusterings
apply them to a number of real-life systems, all of which are
, , etc. Finally, in step 3 (comparative analysis), we open source software and hence available for study. Table use a set of criteria to evaluate the algorithms. We now dis- 1 gives a summary of key properties of five target systems. cuss these steps in more detail. They represent distinct application domains and have gone through a number of years of development. We now briefly describe these systems. 3.1 Program Extraction 1. KSDK is a software development kit for the K Desktop Environment (KDE) [8]. It offers powerful framework In the first step we extract graphs from 70 to 73 monthly support for various kinds of KDE applications. versions (see Table 1) of the target systems, as illustrated in 2. KOffice is a free, integrated office suite for KDE. This Figure 1. Extraction can be a difficult task due to the size suite contains 12 major applications: KWord, KChart, of the systems, the number of versions, and irregularities KSpread, KPresenter, Kivio, Karbon14, Krita, Kugar, present in the source code (e.g., syntactical errors, incom- KPlato, Kexi, KFormular, and Filters [9]. plete code, language dialects and software configurations). 3. OpenSSL is a cryptography toolkit implementing the To circumvent these difficulties, we developed a lightweight Secure Socket Layer (SSL) and Transport Layer Secu- C/C++ source code extractor called CTSX. rity (TLS) network protocols and related cryptography CTSX is built on CTags [5] and CScope [4]. Both tools standards [17]. are efficient and robust in handling software systems up to 4. PostgreSQL is a free, large, SQL compliant object Re- millions of lines of code. CTSX uses CTags to extract pro- lational Database Management System (DBMS) [19] gram entities (e.g., functions, variables, and data types) and that originated at the University of California at Berke- CScope to retrieve references (e.g., function calls) to enti- ley in 1996. The rest of this paper will refer to Post- ties found by CTags. These references are then “lifted” to greSQL as PGSQL. the level of source files to produce directed edges between nodes (files) in the extracted graph. This lifting to the level 5. Ruby is an interpreted scripting language designed for of files greatly decreases the size of the graph, which is an quick and easy object oriented programming [21]. It important consideration when dealing with many versions has many convenient features for text file processing of a large target system. and system management.
2 Software Versions Software Graphs Software Clusterings Results
V1 G1 Algorithm A C1A C2A . . . CnA Scores
Source Step 1: Step 2: Step 3: Repository V2 Program G2 Software Comparative (e.g., CVS) . Extraction . Clustering Analysis . . Data Plots Algorithm B C1B C2B . . . CnB Vn Gn
Figure 1: Comparative analysis of software clustering algorithms in the context of software evolution
3.2 Software Clustering 6. Bunch provides a suite of algorithms that include Hill Climbing, Exhaustive, and Genetic Algorithms [6, 13]. In the second step of the experiment (see Figure 1) a set We tried three different hill climbing configurations: of clustering algorithms is run over graphs extracted from NAHC (nearest ascend hill climbing), SAHC (shortest each monthly version of the target system. A clustering is ascend hill climbing), and a customized configuration created for each version by each algorithm. with the minimum search space greater than 55% and We chose a set of six clustering algorithms based on the randomized proportion of the search space equal to their availability and their discussion in the literature. We 20%. This paper will only discuss the third configu- limited our choice to available implementations that run in ration since NAHC and SAHC is not much different batch mode since we needed to run each of these algorithms from it based on the results we obtained. For brevity, many times as they were applied to different target systems it will be referred to as Bunch in the rest of this paper. over a large number of versions. We selected software clus- 3.3 Comparative Analysis tering tools produced by researchers Anquetil, Mancoridis, and Tzerpos. Anquetil designed a hierarchical clustering algorithm suite, which offers a selection of association and In the third step of the experiment (see Fig 1), we com- distance coefficients as well as update rules [1]. Four algo- pare the clustering algorithms based on three criteria, C1, rithms from this suite were chosen and given names of the C2 and C3, which we have previously mentioned. We now form of CL** and SL** where ** encodes the parametriza- detail the three criteria: tion, as described below. Mancoridis and Mitchell provided C1 Stability us with their Bunch suite [13]. Tzerpos provided us his al- gorithm for comprehension driven clustering (ACDC) [24]. Similar clusterings should be produced for similar ver- We now give a brief description of the six algorithms. sions of a software system. This criterion emphasizes the persistence of the clustering structure of successive 1. CL75 is an agglomerative clustering algorithm based versions of an evolving software system. Under con- on the Jaccard coefficient and the complete linkage up- ditions of small and incremental change between con- date rule [1]. The cut-point height for the dendrogram secutive versions, an algorithm should be stable, i.e., it is set to 0.75. The higher the cut-point, the smaller the should produces similar clusterings to prior months. number of clusters in the resulting dendrogram. C2 Authoritativeness 2. CL90 has the same configuration as CL75 except that Clusterings produced by an algorithm should resemble its cut-point height is set to 0.90 [1]. clustering from some authority. An authoritative clus- tering may be produced by a (human) architect. It may 3. SL75 is an agglomerative clustering algorithm based also be derived from the directory structure of the tar- on the Jaccard coefficient and the single linkage update get system. In the latter case, authoritativeness can be rule. The cut-point height is set to 0.75 [1]. seen as adherence to the source folder structure. 4. SL90 has the same configuration as SL75 except that C3 Extremity of Cluster Distribution its cut-point height is set to 0.90 [1]. The cluster size distribution of a clustering should not 5. ACDC is an algorithm based on program comprehen- exhibit extremity. In particular, a clustering algorithm sion patterns and it attempts to recover subsystems that should avoid the following two situations: (1) the ma- are commonly found in manually-created decomposi- jority of files are grouped into one or few huge clusters tions of large software systems [24]. (sometimes called black holes), and (2) the majority
3 of clusters are singletons (forming what are sometimes clustering and the one obtained from the original version of called dust clouds). the system is not greater than 1% of the total number of resources (i.e., source files). We feel that it is difficult to Based on these criteria, we conducted three comparisons determine a proper threshold in the context of real-life soft- to examine how the algorithms chosen for this study are dif- ware evolution, and 1% seems too optimistic in reality. ferent from one another. The detailed information on these We instead used our ordinal measure (defined in Section comparisons will be presented in Section 5. 4) to derive a stability ordering of an algorithm in compari- son to other algorithms. In addition, we evaluated an algo- 4 A Simple Ordinal Measure rithms’ stability with three ordinal values: High, Medium, and Low. We referred to the latter method as the HML ordi- To support our analysis of clustering algorithms over nal evaluation. consecutive versions of a target software system, we created Sequence of clusterings an ordinal measure for ranking a number of data series. A Sim1 Sim2 Sim3 Simn-1 C1 C2 C3 Cn
data series refers to a sequence of quantitative values, for
example, 1,2,3,4 . Clustering comparison
Ë Ë ÓÚ ´Ë Ë µ
For series and ,wedefine and
ÐÓÛ´Ë Ë µ