Rethinking the Application-Database Interface Alvin K. Cheung
Total Page:16
File Type:pdf, Size:1020Kb
Rethinking the Application-Database Interface by Alvin K. Cheung Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2015 © Massachusetts Institute of Technology 2015. All rights reserved. Author................................................................... Department of Electrical Engineering and Computer Science August 28, 2015 Certified by . Samuel Madden Professor Thesis Supervisor Certified by . Armando Solar-Lezama Associate Professor Thesis Supervisor Accepted by . Leslie A. Kolodziejski Chair, Department Committee on Graduate Students 2 Rethinking the Application-Database Interface by Alvin K. Cheung Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2015, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering Abstract Applications that interact with database management systems (DBMSs) are ubiquitous in our daily lives. Such database applications are usually hosted on an application server and perform many small accesses over the network to a DBMS hosted on a database server to retrieve data for pro- cessing. For decades, the database and programming systems research communities have worked on optimizing such applications from different perspectives: database researchers have built highly efficient DBMSs, and programming systems researchers have developed specialized compilers and runtime systems for hosting applications. However, there has been relative little work that exam- ines the interface between these two software layers to improve application performance. In this thesis, I show how making use of application semantics and optimizing across these layers of the software stack can help us improve the performance of database applications. In particular, I describe three projects that optimize database applications by looking at both the programming system and the DBMS in tandem. By carefully revisiting the interface between the DBMS and the application, and by applying a mix of declarative database optimization and modern program analysis and synthesis techniques, we show that multiple orders of magnitude speedups are possible in real-world applications. I conclude by highlighting future work in the area, and propose a vision towards automatically generating application-specific data stores. Thesis Supervisor: Samuel Madden Title: Professor Thesis Supervisor: Armando Solar-Lezama Title: Associate Professor 3 4 To my family, friends, and mentors. Acknowledgments Carrying out the work in this thesis would not be possible without the help of many others. First and foremost, I am very fortunate to have two great advisors who I interact with on a daily basis: Sam Madden and Armando Solar-Lezama. As an undergraduate student with close to zero knowledge of database systems, Sam courageously took me under his wings, and taught me everything from selecting optimal join orders to thinking positively in life. I am grateful to the amount of academic freedom that Sam allows me in exploring different areas in computer science, and from him I learned the importance of picking problems that are both interesting scientifically and matter to others outside of the computer science world. After moving out of Boston, I will miss his afternoon “wassup guys” (in a baritone voice) checkups where he would hang out in our office and discussed everything from the news to the best chocolate in town. In addition to Sam, Armando has been my other constant source of inspirations. As someone who knows a great deal about programming systems, computer science, and life in general, I learned a lot from Armando from our many hours of discussions and working through various intricate proofs and subtle concepts. Armando has always been accessible and is someone who I can just knock on the door at any time to discuss any topic with. I will always remember the many sunrises that we see in Stata after pulling all-nighters due to conference deadlines. The amount of dedication that Armando puts into his students is simply amazing. I could not have asked for better advisors other than Armando and Sam. I am also indebted to Andrew Myers, who introduced me to the exciting world of programming systems research. Andrew has always been extremely patient in explaining programming language concepts to a novice (me), and I learned so much about attention to details from his meticulous edits to manuscripts. Thank you also to Michael Stonebraker with his honest and insightful advice, and agreeing to be on my dissertation committee. I would also like to express gratitude to Martin Rinard and Daniel Jackson, who were members of my qualification exam committee. They have definitely shaped me into a better researcher throughout my graduate career. Finally, I am grateful to my mentor Rakesh Agrawal who introduced me to computer science research and told me to “go east!” and move to Boston for graduate studies. 6 My PhD journey would not be complete without the past and present illustrious members of the Database and Computer-Aided Programming groups. I have received so much intellectual and emotional support from all of them, along with all my friends on the top three floors of Stata: Ramesh Chandra, Shuo Deng, Shyam Gollakota, Shoaib Kamil, Eunsuk Kang, Taesoo Kim, Yan- dong Mao, Ravi Netravali, Jonathan Perry, Lenin Ravindranath, Meelap Shah, Anirudh Sivaraman, Arvind Thiagarajan, Xi Wang, and many other music group partners and friends on campus and around Boston who we shared laughs, music, and food together: Clement Chan, Godine Chan, Poting Cheung, Wang-Chi Cheung, Nicole Fong, Hee-Sun Han, Sueann Lee, Jenny Man, Chen Sun, Gary Tam, Liang Jie Wong, and I am sure I have missed many others. Last but not least, I thank my parents and my family for their endless love and support through- out this journey. I owe all my accomplishments to them. 7 THIS PAGE INTENTIONALLY LEFT BLANK 8 Contents 1 Introduction 17 1.1 Architecture of Database Applications . 19 1.2 Challenges in Implementing Database Applications . 19 1.3 Thesis Contributions . 22 2 QBS: Determining How to Execute 27 2.1 Interacting with the DBMS . 27 2.2 QBS Overview . 29 2.2.1 QBS Architecture . 32 2.3 Theory of Finite Ordered Relations . 36 2.3.1 Basics . 37 2.3.2 Translating to SQL . 39 2.4 Synthesis of Invariants and Postconditions . 44 2.4.1 Computing Verification Conditions . 44 2.4.2 Constraint based synthesis . 46 2.4.3 Inferring the Space of Possible Invariants . 46 2.4.4 Creating Templates for Postconditions . 47 2.4.5 Optimizations . 47 2.5 Formal Validation and Source Transformation . 48 2.5.1 Object Aliases . 50 2.6 Preprocessing of Input Programs . 50 9 2.6.1 Generating initial code fragments . 51 2.6.2 Identifying code fragments for query inference . 51 2.6.3 Compilation to kernel language . 53 2.7 Experiments . 53 2.7.1 Real-World Evaluation . 54 2.7.2 Performance Comparisons . 59 2.7.3 Advanced Idioms . 61 2.8 Summary . 63 3 PYXIS: Deciding Where to Execute 65 3.1 Code Partitioning Across Servers . 66 3.2 Overview . 69 3.3 Running PyxIL Programs . 72 3.3.1 A PyxIL Partitioning . 74 3.3.2 State Synchronization . 77 3.4 Partitioning PYXIS code . 78 3.4.1 Profiling . 78 3.4.2 The Partition Graph . 78 3.4.3 Optimization Using Integer Programming . 82 3.4.4 Statement Reordering . 83 3.4.5 Insertion of Synchronization Statements . 84 3.5 PyxIL compiler . 85 3.5.1 Representing Execution Blocks . 86 3.5.2 Interoperability with Existing Modules . 87 3.6 PYXIS runtime system . 88 3.6.1 General Operations . 88 3.6.2 Program State Synchronization . 89 3.6.3 Selecting a Partitioning Dynamically . 90 3.7 Experiments . 91 10 3.7.1 TPC-C Experiments . 92 3.7.2 TPC-W . 96 3.7.3 Microbenchmark 1 . 97 3.7.4 Microbenchmark 2 . 99 3.8 Summary . 99 4 SLOTH: Reducing Communication Overhead 101 4.1 Introduction . 102 4.2 Overview . 106 4.3 Compiling to Lazy Semantics . 111 4.3.1 Code Simplification . 111 4.3.2 Thunk Conversion . 112 4.3.3 Compiling Query Calls . 113 4.3.4 Compiling Method Calls . 115 4.3.5 Class Definitions and Heap Operations . 116 4.3.6 Evaluating Thunks . 118 4.3.7 Limitations . 119 4.4 Formal Semantics . 119 4.4.1 Program Model . 120 4.4.2 Standard Evaluation Semantics . 121 4.4.3 Semantics of Extended Lazy Evaluation . 124 4.5 Soundness of Extended Lazy Evaluation . 129 4.6 Being Even Lazier . 131 4.6.1 Selective Compilation . 132 4.6.2 Deferring Control Flow Evaluations . 133 4.6.3 Coalescing Thunks . 134 4.7 Implementation . 136 4.8 Experiments . 138 4.8.1 Page Load Experiments . 139 11 4.8.2 Throughput Experiments . 149 4.8.3 Time Breakdown Comparisons . ..