Combining Logical and Probabilistic Reasoning in Program Analysis

COMBINING LOGICAL AND PROBABILISTIC REASONING IN PROGRAM ANALYSIS A Dissertation Presented to The Academic Faculty By Xin Zhang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the School of Computer Science Georgia Institute of Technology December 2017 Copyright c Xin Zhang 2017 COMBINING LOGICAL AND PROBABILISTIC REASONING IN PROGRAM ANALYSIS Approved by: Dr. Mayur Naik, Advisor Dr. Hongseok Yang Department of Computer and Department of Computer Science Information Science Oxford University University of Pennsylvania Dr. Santosh Pande Dr. William Harris School of Computer Science School of Computer Science Georgia Institute of Technology Georgia Institute of Technology Dr. Aditya Nori Microsoft Research Date Approved: August 23, 2017 Knowledge comes, but wisdom lingers. Alfred, Lord Tennyson To my parents. ACKNOWLEDGEMENTS I am forever in debt to my advisor, Mayur Naik, for his support and guidance throughout my Ph.D. study. It is Mayur who brought me to the wonderful world of research. As his first Ph.D. student, I have received enormous amount of attention from Mayur that other students could only dream of. From topic selection to problem solving, formalization to empirical evaluation, writing to presentation, he has coached me heavily in every aspect that is needed to be a researcher. Mayur’s passion and high standards about research will always inspire me to be a better researcher. Besides Mayur, I was fortunate to be mentored by Hongseok Yang and Aditya Nori. I worked with Hongseok closely in the first half of my Ph.D. study, and learnt the most about programming language theories from him. It always amazes me how Hongseok can draw principles and insights from raw and seemingly hacky ideas. I worked with Aditya closely in the second half of my Ph.D. study and benefited greatly from his crisp feedback. Although we only met for one hour a week, this one hour was often the one hour when I learnt the most of the whole week. I would like to thank all my collaborators, but especially Radu Grigore, Ravi Mangal, and Xujie Si. They have helped greatly in projects that eventually led to this thesis. I also had a great fun time working with them. I am also grateful to the rest of my collaborators, including Sulekha Kulkarni, Aditya Kamath, Jongse Park, Hadi Esmaeilzadeh, Vasco Manquinho, Mikolas Janota, and Alexey Ignatiev. Bill Harris and Santosh Pande were kind enough to serve on my thesis committee. They and other folks at Georgia Tech have made GT a wonderful place for graduate study. Last but not least, I would thank my parents for the unconditional love and support I have received ever since I can remember. Throughout my life, I have always been encour- aged by them to pursue things that I am passionate about. When facing challenges in life, I always gain strength knowing that they will be there for me. v TABLE OF CONTENTS Acknowledgments . v List of Tables . xi List of Figures . xiii Chapter 1: Introduction . 1 1.1 Motivating Applications . 2 1.1.1 Automated Verification . 3 1.1.2 Interactive Verification . 5 1.1.3 Static Bug Detection . 6 1.2 System Architecture . 8 1.3 Thesis Contributions . 10 1.4 Thesis Organization . 11 Chapter 2: Background . 12 2.1 Constraint-Based Program Analysis . 12 2.2 Datalog . 14 2.3 Markov Logic Networks . 16 Chapter 3: Applications . 20 vi 3.1 Automated Verification . 20 3.1.1 Introduction . 20 3.1.2 Overview . 23 3.1.3 Parametric Dataflow Analyses . 30 3.1.3.1 Abstractions and Queries . 31 3.1.3.2 Problem Statement . 33 3.1.4 Algorithm . 34 3.1.4.1 From Datalog Derivations to Hard Constraints . 35 3.1.4.2 The Algorithm . 37 3.1.4.3 Choosing Good Abstractions via Mixed Hard and Soft Constraints . 38 3.1.5 Empirical Evaluation . 42 3.1.5.1 Evaluation Setup . 43 3.1.5.2 Evaluation Results . 45 3.1.6 Related Work . 51 3.1.7 Conclusion . 53 3.2 Interactive Verification . 53 3.2.1 Introduction . 53 3.2.2 Overview . 56 3.2.3 The Optimum Root Set Problem . 62 3.2.3.1 Declarative Static Analysis . 62 3.2.3.2 Problem Statement . 63 3.2.3.3 Monotonicity . 64 vii 3.2.3.4 NP-Completeness . 64 3.2.4 Interactive Analysis . 65 3.2.4.1 Main Algorithm . 67 3.2.4.2 Soundness . 67 3.2.4.3 Finding an Optimum Root Set . 70 3.2.4.4 From Augmented Datalog to Markov Logic Network . 70 3.2.4.5 Feasible Payoffs . 72 3.2.4.6 Discussion . 74 3.2.5 Instance Analyses . 76 3.2.6 Empirical Evaluation . 78 3.2.6.1 Evaluation Setup . 79 3.2.6.2 Evaluation Results . 81 3.2.7 Related Work . 88 3.2.8 Conclusion . 92 3.3 Static Bug Detection . 92 3.3.1 Introduction . 92 3.3.2 Overview . 95 3.3.3 Analysis Specification . 100 3.3.4 The EUGENE System . 103 3.3.4.1 Online Component of EUGENE: Inference . 104 3.3.4.2 Offline Component of EUGENE: Learning . 105 3.3.5 Empirical Evaluation . 106 3.3.5.1 Evaluation Setup . 106 viii 3.3.5.2 Evaluation Results . 110 3.3.6 Related Work . 117 3.4 Conclusion . 118 Chapter 4: Solver Techniques . 119 4.1 Iterative Lazy-Eager Grounding . 124 4.1.1 Introduction . 124 4.1.2 The IPR Algorithm . 126 4.1.3 Empirical Evaluation . 132 4.1.4 Related Work . 136 4.1.5 Conclusion . 137 4.2 Query-Guided Maximum Satisfiability . 137 4.2.1 Introduction . 137 4.2.2 Example . 139 4.2.3 The Q-MAXSAT Problem . 144 4.2.4 Solving a Q-MAXSAT Instance . 145 4.2.4.1 Implementing an Efficient CHECK Function . 146 4.2.4.2 Efficient Optimality Check via Distinct APPROX Functions 151 4.2.5 Empirical Evaluation . 169 4.2.5.1 Evaluation Setup . 169 4.2.5.2 Evaluation Result . 172 4.2.6 Related Work . 176 4.2.7 Conclusion . 179 ix Chapter 5: Future Directions . 180 Chapter 6: Conclusion . 184 Appendix A: Proofs . 187 A.1 Proofs for Results of Chapter 2 . 187 A.2 Proofs of Results of Chapter 3.1 . 188 A.2.1 Proofs of Theorems 4 and 5 . 188 A.2.2 Proof of Theorem 6 . 195 A.3 Proofs of Results of Chapter 3.2 . 199 A.4 Proofs of Results of Chapter 4.1 . 200 Appendix B: Alternate Use Case of URSA: Combining Two Static Analyses . 205 References . 210 x LIST OF TABLES 3.1 Markov Logic Network encodings of different program analysis applications. 21 3.2 Each iteration (run) eliminates a number of abstractions. Some are elim- inated by analyzing the current Datalog run (within run); some are elimi- nated because of the derivations from the current run interact with derivations from previous runs (across runs). 26 3.3 Benchmark characteristics. All numbers are computed using a 0-CFA call- graph analysis. 42 3.4 Results showing statistics of queries, abstractions, and iterations of our approach (CURRENT) and the baseline approaches (BASELINE) on the pointer analysis. 45 3.5 Results showing statistics of queries, abstractions, and iterations of our approach (CURRENT) and the baseline approaches (BASELINE) on the type- state analysis. 46 3.6 Running time (in seconds) of the Datalog solver in each iteration. 48 3.7 Running time (in seconds) of the Markov Logic Network solver in each iteration. 49 3.8 Benchmark characteristics. Column jAj shows the numbers of alarms. Col- umn jQU j shows the sizes of the universes of potential causes, where k stands for thousands. All the reported numbers except for jAj and jQU j are computed using.

Load more