Automatic Structure Discovery for Large Source Code
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Structure Discovery for Large Source Code By Sarge Rogatch Universiteit van Amsterdam, Master Thesis Artificial Intelligence, 2010 Automatic Structure Discovery for Large Source Code Page 1 of 130 Master Thesis, AI Sarge Rogatch, University of Amsterdam July 2010 Acknowledgements I would like to acknowledge the researchers and developers who are not even aware of this project, but their findings have played very significant role: Soot developers: Raja Vall´ee-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and others. TreeViz developer: Werner Randelshofer H3 Layout author and H3Viewer developer: Tamara Munzner Researchers of static call graph construction: Ondˇrej Lhot´ak, Vijay Sundaresan, David Bacon, Peter Sweeney Researchers of Reverse Architecting: Heidar Pirzadeh, Abdelwahab Hamou-Lhadj, Timothy Lethbridge, Luay Alawneh Researchers of Min Cut related problems: Dan Gusfield, Andrew Goldberg, Maxim Babenko, Boris Cherkassky, Kostas Tsioutsiouliklis, Gary Flake, Robert Tarjan Automatic Structure Discovery for Large Source Code Page 2 of 130 Master Thesis, AI Sarge Rogatch, University of Amsterdam July 2010 Contents 1 Abstract ................................................................................................................................ 6 2 Introduction .......................................................................................................................... 7 2.1 Project Summary .......................................................................................................... 8 2.2 Global Context ........................................................................................................... 10 2.3 Relevance for Artificial Intelligence .......................................................................... 10 2.4 Problem Analysis ....................................................................................................... 11 2.5 Hypotheses ................................................................................................................. 11 2.6 Business Applications ................................................................................................ 12 2.7 Thesis Outline ............................................................................................................ 15 3 Literature and Tools Survey ............................................................................................... 16 3.1 Source code analysis .................................................................................................. 16 3.1.1 Soot ........................................................................................................................ 17 3.1.2 Rascal ..................................................................................................................... 18 3.2 Clustering ................................................................................................................... 18 3.2.1 Particularly Considered Methods ........................................................................... 20 3.2.1.1 Affinity Propagation ...................................................................................... 20 3.2.1.2 Clique Percolation Method ............................................................................ 22 3.2.1.3 Based on Graph Cut ....................................................................................... 22 3.2.2 Other Clustering Methods ...................................................................................... 24 3.2.2.1 Network Structure Indices based ................................................................... 25 3.2.2.2 Hierarchical clustering methods .................................................................... 27 4 Background ........................................................................................................................ 29 4.1 Max Flow & Min Cut algorithm ................................................................................ 29 4.1.1 Goldberg’s implementation ................................................................................... 29 4.2 Min Cut Tree algorithm ............................................................................................. 30 4.2.1 Gusfield algorithm ................................................................................................. 30 4.2.2 Community heuristic .............................................................................................. 31 4.3 Flake-Tarjan clustering .............................................................................................. 31 4.3.1 Alpha-clustering ..................................................................................................... 31 4.3.2 Hierarchical version ............................................................................................... 32 4.4 Call Graph extraction ................................................................................................. 33 4.5 The Problem of Utility Artifacts ................................................................................ 34 4.6 Various Algorithms .................................................................................................... 36 5 Theory ................................................................................................................................ 37 5.1 Normalization ............................................................................................................ 37 5.1.1 Directed Graph to Undirected ................................................................................ 38 5.1.2 Leverage ................................................................................................................. 39 5.1.3 An argument against fan-out analysis .................................................................... 40 5.1.4 Lifting the Granularity ........................................................................................... 40 5.1.5 An Alternative ........................................................................................................ 43 5.2 Merging Heterogeneous Dependencies ..................................................................... 44 5.3 Alpha-search .............................................................................................................. 45 5.3.1 Search Tree ............................................................................................................ 45 5.3.2 Prioritization .......................................................................................................... 46 5.4 Hierarchizing the Partitions ....................................................................................... 47 5.5 Distributed Computation ............................................................................................ 48 5.6 Perfect Dependency Structures .................................................................................. 49 5.6.1 Maximum Spanning Tree ...................................................................................... 50 Automatic Structure Discovery for Large Source Code Page 3 of 130 Master Thesis, AI Sarge Rogatch, University of Amsterdam July 2010 5.6.2 Root Selection Heuristic ........................................................................................ 51 6 Implementation and Specification ..................................................................................... 53 6.1 Key Choices ............................................................................................................... 54 6.1.1 Reducing Real- to Integer- Weighted Flow Graph ................................................ 54 6.1.2 Results Presentation ............................................................................................... 54 6.2 File formats ................................................................................................................ 54 6.3 Visualization .............................................................................................................. 55 6.4 Processing Pipeline .................................................................................................... 55 7 Evaluation .......................................................................................................................... 58 7.1 Experiments ............................................................................................................... 58 7.1.1 Analyzed Software and Dimensions ...................................................................... 58 7.2 Interpretation of the Results ....................................................................................... 59 7.2.1 Architectural Insights ............................................................................................. 60 7.2.2 Class purpose from library neighbors .................................................................... 61 7.2.2.1 Obvious from class name ............................................................................... 62 7.2.2.2 Hardly obvious from class name .................................................................... 64 7.2.2.3 Not obvious from class name ......................................................................... 65 7.2.2.4 Class name seems to contradict the purpose .................................................. 66 7.2.3 Classes that act together ........................................................................................