Automatic Structure Discovery for Large Source Code

Automatic Structure Discovery for Large Source Code By Sarge Rogatch Universiteit van Amsterdam, Master Thesis Artificial Intelligence, 2010 Automatic Structure Discovery for Large Source Code Page 1 of 130 Master Thesis, AI Sarge Rogatch, University of Amsterdam July 2010 Acknowledgements I would like to acknowledge the researchers and developers who are not even aware of this project, but their findings have played very significant role: Soot developers: Raja Vall´ee-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and others. TreeViz developer: Werner Randelshofer H3 Layout author and H3Viewer developer: Tamara Munzner Researchers of static call graph construction: Ondˇrej Lhot´ak, Vijay Sundaresan, David Bacon, Peter Sweeney Researchers of Reverse Architecting: Heidar Pirzadeh, Abdelwahab Hamou-Lhadj, Timothy Lethbridge, Luay Alawneh Researchers of Min Cut related problems: Dan Gusfield, Andrew Goldberg, Maxim Babenko, Boris Cherkassky, Kostas Tsioutsiouliklis, Gary Flake, Robert Tarjan Automatic Structure Discovery for Large Source Code Page 2 of 130 Master Thesis, AI Sarge Rogatch, University of Amsterdam July 2010 Contents 1 Abstract ................................................................................................................................ 6 2 Introduction .......................................................................................................................... 7 2.1 Project Summary .......................................................................................................... 8 2.2 Global Context ........................................................................................................... 10 2.3 Relevance for Artificial Intelligence .......................................................................... 10 2.4 Problem Analysis ....................................................................................................... 11 2.5 Hypotheses ................................................................................................................. 11 2.6 Business Applications ................................................................................................ 12 2.7 Thesis Outline ............................................................................................................ 15 3 Literature and Tools Survey ............................................................................................... 16 3.1 Source code analysis .................................................................................................. 16 3.1.1 Soot ........................................................................................................................ 17 3.1.2 Rascal ..................................................................................................................... 18 3.2 Clustering ................................................................................................................... 18 3.2.1 Particularly Considered Methods ........................................................................... 20 3.2.1.1 Affinity Propagation ...................................................................................... 20 3.2.1.2 Clique Percolation Method ............................................................................ 22 3.2.1.3 Based on Graph Cut ....................................................................................... 22 3.2.2 Other Clustering Methods ...................................................................................... 24 3.2.2.1 Network Structure Indices based ................................................................... 25 3.2.2.2 Hierarchical clustering methods .................................................................... 27 4 Background ........................................................................................................................ 29 4.1 Max Flow & Min Cut algorithm ................................................................................ 29 4.1.1 Goldberg’s implementation ................................................................................... 29 4.2 Min Cut Tree algorithm ............................................................................................. 30 4.2.1 Gusfield algorithm ................................................................................................. 30 4.2.2 Community heuristic .............................................................................................. 31 4.3 Flake-Tarjan clustering .............................................................................................. 31 4.3.1 Alpha-clustering ..................................................................................................... 31 4.3.2 Hierarchical version ............................................................................................... 32 4.4 Call Graph extraction ................................................................................................. 33 4.5 The Problem of Utility Artifacts ................................................................................ 34 4.6 Various Algorithms .................................................................................................... 36 5 Theory ................................................................................................................................ 37 5.1 Normalization ............................................................................................................ 37 5.1.1 Directed Graph to Undirected ................................................................................ 38 5.1.2 Leverage ................................................................................................................. 39 5.1.3 An argument against fan-out analysis .................................................................... 40 5.1.4 Lifting the Granularity ........................................................................................... 40 5.1.5 An Alternative ........................................................................................................ 43 5.2 Merging Heterogeneous Dependencies ..................................................................... 44 5.3 Alpha-search .............................................................................................................. 45 5.3.1 Search Tree ............................................................................................................ 45 5.3.2 Prioritization .......................................................................................................... 46 5.4 Hierarchizing the Partitions ....................................................................................... 47 5.5 Distributed Computation ............................................................................................ 48 5.6 Perfect Dependency Structures .................................................................................. 49 5.6.1 Maximum Spanning Tree ...................................................................................... 50 Automatic Structure Discovery for Large Source Code Page 3 of 130 Master Thesis, AI Sarge Rogatch, University of Amsterdam July 2010 5.6.2 Root Selection Heuristic ........................................................................................ 51 6 Implementation and Specification ..................................................................................... 53 6.1 Key Choices ............................................................................................................... 54 6.1.1 Reducing Real- to Integer- Weighted Flow Graph ................................................ 54 6.1.2 Results Presentation ............................................................................................... 54 6.2 File formats ................................................................................................................ 54 6.3 Visualization .............................................................................................................. 55 6.4 Processing Pipeline .................................................................................................... 55 7 Evaluation .......................................................................................................................... 58 7.1 Experiments ............................................................................................................... 58 7.1.1 Analyzed Software and Dimensions ...................................................................... 58 7.2 Interpretation of the Results ....................................................................................... 59 7.2.1 Architectural Insights ............................................................................................. 60 7.2.2 Class purpose from library neighbors .................................................................... 61 7.2.2.1 Obvious from class name ............................................................................... 62 7.2.2.2 Hardly obvious from class name .................................................................... 64 7.2.2.3 Not obvious from class name ......................................................................... 65 7.2.2.4 Class name seems to contradict the purpose .................................................. 66 7.2.3 Classes that act together ........................................................................................

Load more