Automated Software Architecture Extraction Using Graph-Based Clustering John Thomas Chargo Iowa State University

Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2013 Automated Software Architecture Extraction Using Graph-based Clustering John Thomas Chargo Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Computer Engineering Commons Recommended Citation Chargo, John Thomas, "Automated Software Architecture Extraction Using Graph-based Clustering" (2013). Graduate Theses and Dissertations. 12983. https://lib.dr.iastate.edu/etd/12983 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Automated software architecture extraction using graph-based clustering by John Thomas Chargo A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Computer Engineering Program of Study Committee: Suraj Kothari, Major Professor Tien Nguyen Joseph Zambreno Iowa State University Ames, Iowa 2012 Copyright c John Thomas Chargo, 2012. All rights reserved. ii DEDICATION I would like to dedicate this thesis to the friends, family, and teachers who have inspired, supported, and motivated me. iii TABLE OF CONTENTS LIST OF TABLES . v LIST OF FIGURES . vi ACKNOWLEDGEMENTS . vii ABSTRACT . viii CHAPTER 1. OVERVIEW . 1 CHAPTER 2. BASICS OF CLUSTERING AND GRAPHS . 2 CHAPTER 3. SOFTWARE CLUSTERING AND GRAPHS . 4 CHAPTER 4. PROPOSED APPROACH . 6 CHAPTER 5. RELATED RESEARCH . 8 5.1 Star Diagrams: Designing Abstractions out of Existing Code . .8 5.2 Clustering Software Using Knowledgebase . .9 5.3 Term Weighting Schemes for Labeling Clusters . 10 5.4 Weighted Combined Algorithm . 11 5.5 Hierarchical Clustering for Software Architecture Recovery . 14 CHAPTER 6. EXPERIMENTAL SETUP . 16 CHAPTER 7. PARTITIONAL ALGORITHM DEVELOPMENT . 18 7.1 Algorithm 1 . 19 7.2 Algorithm 2 . 20 7.3 Algorithm 3 . 21 7.4 Algorithm 4 . 23 iv CHAPTER 8. RESULTS . 25 8.1 Partitional Clustering Algorithm . 25 8.2 Weighted Combined Algorithm . 29 CHAPTER 9. CONCLUSIONS AND FUTURE WORK . 32 APPENDIX A. PARTITIONAL ALGORITHM CLUSTER RESULTS . 34 APPENDIX B. WEIGHTED COMBINED ALGORITHM RESULTS . 40 BIBLIOGRAPHY . 44 v LIST OF TABLES Table 8.1 Xinu Subsystem Truth Data . 27 Table 8.2 Partitional Algorithm Precision and Recall Score . 28 Table 8.3 Weighted combined Algorithm Precision and Recall Score . 30 Table A.1 Paritional Algorithm Xinu Clusters . 34 Table B.1 Weighted Combined Algorithm Xinu Clusters . 40 vi LIST OF FIGURES Figure 5.1 Star Diagram . .9 Figure 5.2 Weighted Combined Algorithm Feature Vector . 12 Figure 5.3 Weighted Combined Algorithm Combined Feature Vector . 13 Figure 5.4 Weighted Combined Algorithm vs Complete, Xfig f files Subsystem . 14 Figure 7.1 Partitional Algorithm 1 Example . 20 Figure 7.2 Partitional Algorithm 2 Example . 21 Figure 7.3 Partitional Algorithm 3 Example . 22 Figure 7.4 Partitional Algorithm 4 Example . 24 Figure 8.1 Graph Viewer Output of Sample Clusters . 26 Figure 8.2 Precision, Recall and F-Score of Xinu Analysis . 29 Figure 8.3 Weighted Combined Algorithm Precision, Recall and F-Score . 31 vii ACKNOWLEDGEMENTS I would like to express my sincere thanks for those that helped me in conducting research and writing this thesis. First and foremost I'd like to thank my major professor, Dr. Suraj Kothari for his direction and guidance throughout the development of this thesis. His knowledge of software engineering and experience in guiding graduate research has been invaluable. I'd also like to thank my committee members for their contributions to this work: Dr. Tien Nguyen and Dr. Joseph Zambreno. viii ABSTRACT As the size and complexity of software grows developers have an ever-increasing need to understand software in a modular way. Most complex software systems can be divided into smaller modules if the developer has domain knowledge of the code or up-to-date documentation. If neither of these exist discovery of code modules can be a tedious, manual process. This research hypothesizes that graph-based clustering can be used effectively for automated software architecture extraction. We propose methods of representing relationships between program artifacts as graphs and then propose new partitional algorithms to extract software modules from those graphs. To validate our hypothesis and the partitional algorithms a new set of tools, including a software data miner, cluster builder, graph viewer, and cluster score calculator, were created. This toolset was used to implement partitional algorithms and analyze their performance in extracting modules. The Xinu operating system was used as a case study because it has defined modules that can be compared to the results of the partitional algorithm. 1 CHAPTER 1. OVERVIEW As the size and complexity of software grows developers have an ever-increasing need to understand software in a modular way. It is impractical for one person to understand all the code in a large system. However, most complex software systems can be divided into many smaller modules. For instance, an operating system might contain modules that handle tasks, file I/O, networking, or memory management. If a developer can identify these modules the task of understanding the code becomes easier. If the developer with knowledge of these modules sets out to modify the software, he or she won't need to understand everything about all modules, but rather the module he or she is modifying and how it interacts with the other modules. Frequently the existence of code modules is known to the developers only after they gain significant domain knowledge of the software. For instance, a developer working on operating system software would probably know there are file, network, and task scheduling subsystems. Without significant domain knowledge discovering these modules can be a tedious process. Cross-cutting connections further complicate this process by obscuring module boundaries. This research focuses on methods to automate the discovery and identification of code modules. Recently the use of clustering techniques borrowed from other fields have made their way into research in the software engineering field. This research summarizes these existing methods in an overview of how clustering and graphs have been applied to software. It then explores the use of graphs for software clustering and visualization. Included is a description of a test bed that was developed for experimentation and analysis of clustering applied to software. This test bed allowed us to develop our own graph-based partitional clustering algorithm and compare it to another popular clustering algorithm, the weighted combined algorithm. 2 CHAPTER 2. BASICS OF CLUSTERING AND GRAPHS In general, clustering is a process of grouping together inter-related items based on common properties or features of the items. By grouping together similar items, complex data sets can be divided into multiple smaller sets that are more manageable. As a practical application, these smaller sets could be divided amongst members of a team if the complex set is more than a single developer could manage. Successful grouping of common items can decrease the scope of data that must be understood or processed at one time and can therefore simplify complex systems. Clustering of data has been applied within many disciplines as a method of analyzing data. In biology taxonomy is used to cluster groups of biological organisms based on common properties. For instance, mammals are grouped together; reptiles are grouped together, etc. Through this clustering a biologist need not be an expert on all animal species. Instead he or she can understand properties of an animal based simply on the cluster in which it exists. Clustering is similarly used in astrophysics, health sciences, and chemistry. Distance is a common notion in clustering used as a heuristic for measuring how similar items are to each other. Items sharing a small distance are very similar whereas those with a large distance are very different. Distance can be an abstract measure. For example, the distance between two strings could be how many characters they have in common, how similar the length of the strings are, or how close they are alphabetically. Regardless of the definition of distance, many clustering algorithms function by grouping items with the smallest distances together. There are exponentially many cluster combinations for any data set. Clustering algorithms work efficiently to create meaningful clusters, those with the smallest distance between items in the clusters and maximum spacing between clusters[9]. Graphs are mathematical structures used to model relationships between objects. A graph 3 contains vertices, or nodes which represent entities and edges that represent the relationship between those entities. Relationships represented as edges can be physical relationships; in the case of transportation network graphs nodes could be cities and edges can represent the roads that connect them. Edges can also be abstract or virtual, representing some other relationship nodes have in common. Graphs are commonly used in computer science because they allow for intuitive visualization of relationships by modeling vertices as dots or circles and edges as lines connecting those dots or circles. Significant research has been done on the properties of. Problems solved by graphs include graph coloring, routing, network flows, and covering. Of particular relevance to this research is the use of graphs to represent and analyze flow networks. Flow networks, frequently used for transportation network diagrams, have weighted edges that connect a source node to a sink node through intermediate nodes. Flow problems include those like the maximum flow problem which seeks to find the path with most capacity, based on edge weight, from a source to a sink node. Algorithms like the Ford-Fulkerson Algorithm and Edmond-Karps Algorithm are available to compute the maximum flow of a flow network.

Load more