Comparison of Clustering Algorithms in the Context of Software Evolution

Comparison of Clustering Algorithms in the Context of Software Evolution Jingwei Wu, Ahmed E. Hassan, Richard C. Holt School of Computer Science University of Waterloo Waterloo ON, Canada j25wu,aeehassa,holt @uwaterloo.ca Abstract one of many problems that confront today’s large software projects. Software clustering holds out the promise of help- To aid software analysis and maintenance tasks, a num- inginthistask[13,24]. ber of software clustering algorithms have been proposed Software clustering refers to the decomposition of a soft- to automatically partition a software system into meaning- ware system into meaningful subsystems. It plays an im- ful subsystems or clusters. However, it is unknown whether portant role in understanding legacy software systems [15], these algorithms produce similar meaningful clusterings for assisting in their architectural documentation [3, 13], and similar versions of a real-life software system under contin- supporting their re-modularization[22, 27]. For example, a ual change and growth. misplaced procedure or file can be automatically discovered This paper describes a comparative study of six software and relocated to the proper subsystem to reduce unexpected clustering algorithms. We applied each of the algorithms dependencies and prevent the decay of the architecture [12]. to subsequent versions from five large open source systems. Ideally, a software clustering algorithm should be auto- We conducted comparisons based on three criteria respec- mated to provide continual support throughout the lifetime tively: stability (Does the clustering change only modestly of a large software system. More importantly, the algorithm as the system undergoes modest updating?), authoritative- must produce meaningful clusterings in a stable manner. To ness (Does the clustering reasonably approximate the struc- be meaningful, the algorithm must produce clusterings that ture an authority provides?) and extremity of cluster distri- can help developers understand the system. A clustering al- bution (Does the clustering avoid huge clusters and many gorithm which places all source files from a system into two very small clusters?). large clusters will not be helpful to developers analyzing the Experimental results indicate that the studied algorithms code. To be stable, the algorithm must produce clusterings exhibit distinct characteristics. For example, the clusterings that do not vary widely from one version of the system to from the most stable algorithm bear little similarity to the the next. An algorithm which produces clusterings that vary implemented system structure, while the clusterings from widely even though no major code restructuring occurred the least stable algorithm has the best cluster distribution. between versions is likely not to be used by developers. Based on obtained results, we claim that current automatic In this paper, we will compare six clustering algorithms clustering algorithms need significant improvement to pro- by applying them to subsequent versions of five large open vide continual support for large software projects. source systems. Our objective is to clarify to what extent software clustering algorithms can be used to support re- modularization and architectural documentation during the 1 Introduction life cycle of a large software system. We use three criteria to evaluate the usefulness of these algorithms: C1 When A well documented architecture can improve the quality a system changes modestly, the clustering produced by an and maintainability of a software system. However, many algorithm should also change modestly; C2 An automati- existing systems often do not have their architecture docu- cally produced clustering should approximate the clustering mented. Moreover, the documented architecture becomes produced by an authority (e.g., the architect); and C3 Au- outdated and the system structure decays as rapid changes tomatically produced clusters should generally not be either are made to the system to meet market pressure [7, 18]. A huge (i.e., containing hundreds of source files) or tiny (i.e., high rate of turnover among developers makes the situation containing very few source files). even worse. Maintenance of architectural documentation is The rest of this paper is organized as follows: Section 2 Systems Prog. Lang. # Versions # Source Files Size (KLOC) Graph Data Raw Data Extraction Time Ruby C 73 90 – 261 74 – 187 6.5MB 1.0 GB 21 mins KSDK C/C++ 70 21 – 1156 3 – 263 82.5MB 2.5 GB 52 mins OpenSSL C 73 593 – 845 164 – 278 74.4MB 4.2 GB 57 mins PGSQL C 73 771 – 947 182 – 519 99.1MB 4.5 GB 59 mins KOffice C/C++ 70 1358 – 3266 272 – 962 1235.5MB 42.0 GB 325 mins Table 1: Properties of five target systems from 1999 to 2004. The Raw Data refers to CTSX output and Graph Data refers to lifted graphs. Due to a CVS update problem, the last three monthly versions of KSDK and KOffice in 2004 are not available. provides an overview of five systems chosen for the exper- 3 Experimental Design imentations. Section 3 describes the experimental setup in the context of software evolution. Section 4 introduces a simple ordinal measure for comparing a number of data se- Figure 1 illustrates the design of our experiments. We ries. Section 5 describes the experimental results obtained begin with the repository (stored in CVS) of a target system using ordinal evaluation techniques. The results show that (such as PGSQL) that we use to evaluate clustering algo- the studied clustering algorithms exhibit distinct character- rithms. The repository contains successive versions of the istics in terms of stability, authoritativeness, and extremity. target system. We retrieve monthly versions of the source Î Î ¾ Section 6 discusses several interesting observations. Sec- code, called here ½ , , and etc. In step 1 (program ex- tion 7 considers related work and Section 8 concludes this traction), we extract a directed graph from each version Î paper. . In this graph, each node represents a file in the target system. Each edge represents a static dependency (such as 2 Target Systems a reference to a variable or a data type, a call to a function, or a call to a macro) from one file to another. In step 2 (software clustering), we run a number of clustering algorithms, To evaluate software clustering algorithms, we chose to called , , etc., on each graph , producing clusterings apply them to a number of real-life systems, all of which are , , etc. Finally, in step 3 (comparative analysis), we open source software and hence available for study. Table use a set of criteria to evaluate the algorithms. We now dis- 1 gives a summary of key properties of five target systems. cuss these steps in more detail. They represent distinct application domains and have gone through a number of years of development. We now briefly describe these systems. 3.1 Program Extraction 1. KSDK is a software development kit for the K Desktop Environment (KDE) [8]. It offers powerful framework In the first step we extract graphs from 70 to 73 monthly support for various kinds of KDE applications. versions (see Table 1) of the target systems, as illustrated in 2. KOffice is a free, integrated office suite for KDE. This Figure 1. Extraction can be a difficult task due to the size suite contains 12 major applications: KWord, KChart, of the systems, the number of versions, and irregularities KSpread, KPresenter, Kivio, Karbon14, Krita, Kugar, present in the source code (e.g., syntactical errors, incom- KPlato, Kexi, KFormular, and Filters [9]. plete code, language dialects and software configurations). 3. OpenSSL is a cryptography toolkit implementing the To circumvent these difficulties, we developed a lightweight Secure Socket Layer (SSL) and Transport Layer Secu- C/C++ source code extractor called CTSX. rity (TLS) network protocols and related cryptography CTSX is built on CTags [5] and CScope [4]. Both tools standards [17]. are efficient and robust in handling software systems up to 4. PostgreSQL is a free, large, SQL compliant object Re- millions of lines of code. CTSX uses CTags to extract pro- lational Database Management System (DBMS) [19] gram entities (e.g., functions, variables, and data types) and that originated at the University of California at Berke- CScope to retrieve references (e.g., function calls) to enti- ley in 1996. The rest of this paper will refer to Post- ties found by CTags. These references are then “lifted” to greSQL as PGSQL. the level of source files to produce directed edges between nodes (files) in the extracted graph. This lifting to the level 5. Ruby is an interpreted scripting language designed for of files greatly decreases the size of the graph, which is an quick and easy object oriented programming [21]. It important consideration when dealing with many versions has many convenient features for text file processing of a large target system. and system management. 2 Software Versions Software Graphs Software Clusterings Results V1 G1 Algorithm A C1A C2A . CnA Scores Source Step 1: Step 2: Step 3: Repository V2 Program G2 Software Comparative (e.g., CVS) . Extraction . Clustering Analysis . Data Plots Algorithm B C1B C2B . CnB Vn Gn Figure 1: Comparative analysis of software clustering algorithms in the context of software evolution 3.2 Software Clustering 6. Bunch provides a suite of algorithms that include Hill Climbing, Exhaustive, and Genetic Algorithms [6, 13]. In the second step of the experiment (see Figure 1) a set We tried three different hill climbing configurations: of clustering algorithms is run over graphs extracted from NAHC (nearest ascend hill climbing), SAHC (shortest each monthly version of the target system. A clustering is ascend hill climbing), and a customized configuration created for each version by each algorithm. with the minimum search space greater than 55% and We chose a set of six clustering algorithms based on the randomized proportion of the search space equal to their availability and their discussion in the literature.

Load more