Wen Y T 2016.Pdf (1.024Mb)
Total Page:16
File Type:pdf, Size:1020Kb
Replication of Concurrent Applications in a Shared Memory Multikernel Yuzhong Wen Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Application Binoy Ravindran, Chair Ali R. Butt Co-Chair Dongyoon Lee June 17, 2016 Blacksburg, Virginia Keywords: State Machine Replication, Runtime Systems, Deterministic System, System Software Copyright 2016, Yuzhong Wen Replication of Concurrent Applications in a Shared Memory Multikernel Yuzhong Wen (ABSTRACT) State Machine Replication (SMR) has become the de-facto methodology of building a repli- cation based fault-tolerance system. Current SMR systems usually have multiple machines involved, each of the machines in the SMR system acts as the replica of others. However having multiple machines leads to more cost to the infrastructure, in both hardware cost and power consumption. For tolerating non-critical CPU and memory failure that will not crash the entire machine, there is no need to have extra machines to do the job. As a result, intra-machine replication is a good fit for this scenario. However, current intra-machine replication approaches do not provide strong isolation among the replicas, which allows the faults to be propagated from one replica to another. In order to provide an intra-machine replication technique with strong isolation, in this thesis we present a SMR system on a multi-kernel OS. We implemented a replication system that is capable of replicating concurrent applications on different kernel instances of a multi- kernel OS. Modern concurrent application can be deployed on our system with minimal code modification. Additionally, our system provides two different replication modes that allows the user to switch freely according to the application type. With the evaluation of multiple real world applications, we show that those applications can be easily deployed on our system with 0 to 60 lines of code changes to the source code. From the performance perspective, our system only introduces 0.23% to 63.39% overhead compared to non-replicated execution. This work is supported by AFOSR under the grant FA9550-14-1-0163. Any opinions, find- ings, and conclusions expressed in this thesis are those of the author and do not necessarily reflect the views of AFOSR. Dedication This work is dedicated to love and peace. iii Acknowledgments I would like to express my sincere gratitude and appreciation to all the people that helped me along this challenging yet enjoyable path to finishing this research work: Dr. Binoy Ravindran, for giving me the chance to start my academic research in System Software Research Group, and guiding my research along the path. Dr. Ali Butt and Dr. Dongyoon Lee, for taking time from their busy schedule and being on my committee. Dr. Antonio Barbalace, for helping me with his immense knowledge on Linux kernel, and leading me to the world of Popcorn Linux. Dr. Giuliano Losa, for showing me the efficient methodology of doing academic research, and turning my head to the right direction when I get lost in the research. Marina Sadini, for being a wonderful and cheerful partner all the way, also for introducing me to the world of compiler development. Everyone in System Software Research Group, for supporting me and giving me a great time in Virginia Tech. And finally my parents, for backing me up all the time. iv Contents 1 Introduction1 1.1 Contributions...................................3 1.2 Scope.......................................3 1.3 Thesis Organization................................4 2 Popcorn Linux Background5 2.1 Hardware Partitioning..............................5 2.2 Inter-Kernel Messaging Layer..........................5 2.3 Popcorn Namespace...............................6 2.3.1 Replicated Execution...........................6 2.3.2 FT PID..................................7 2.4 Network Stack Replication............................7 3 Shogoki: Deterministic Execution9 3.1 Logical Time Based Deterministic Scheduling.................9 3.1.1 Eliminating Deadlocks.......................... 11 3.2 Balancing the Logical Time........................... 13 3.2.1 Execution Time Profiling......................... 14 3.2.2 Tick Bumping for External Events................... 15 3.3 Related Work................................... 19 3.3.1 Deterministic Language Extension.................... 19 3.3.2 Software Deterministic Runtime..................... 19 v 3.3.3 Architectural Determinism........................ 20 3.3.4 Deterministic System For Replication.................. 20 4 Nigoki: Schedule Replication 22 4.1 Execute-Log-Replay................................ 22 4.1.1 Eliminating Deadlocks.......................... 25 4.2 Related Work................................... 26 5 Additional Runtime Support 27 5.1 System Call Synchronization........................... 27 5.1.1 gettimeofday/time............................ 28 5.1.2 poll..................................... 28 5.1.3 epoll wait................................. 29 5.2 Interposing at Pthread Library......................... 30 5.2.1 Interposing at Lock Functions...................... 31 5.2.2 Interposing at Condition Variable Functions.............. 31 5.3 stdin, stdio and stderr.............................. 32 5.4 Synchronization Elision.............................. 33 6 Evaluation 34 6.1 Racey....................................... 34 6.2 PBZip2...................................... 35 6.2.1 Results................................... 36 6.2.2 Message Breakdown........................... 37 6.3 Mongoose Webserver............................... 39 6.3.1 Results................................... 42 6.3.2 Message Breakdown........................... 42 6.4 Nginx Webserver................................. 42 6.4.1 Results................................... 45 6.4.2 Message Breakdown........................... 47 vi 6.5 Discussion..................................... 49 6.5.1 Benchmark for Nginx's Multi-process Mode.............. 49 6.5.2 Deterministic Execution's Dilemma................... 49 6.5.3 Which Replication Mode Should I Use?................. 50 7 Conclusion 51 7.1 Contributions................................... 51 7.2 Future Work.................................... 52 7.2.1 Precise Tick Bump for Deterministic Execution............ 52 7.2.2 Per-Lock Synchronization........................ 52 7.2.3 Arbitrary Number of Replicas...................... 53 7.2.4 Hybrid Replication............................ 54 Bibliography 54 Appendix A PlusCal Code for Logical Time Bumping 59 vii List of Figures 2.1 Popcorn Linux hardware partitioning......................6 2.2 An example of ft pid...............................7 3.1 An example use of the deterministic syscalls.................. 10 3.2 Simplified implementation of deterministic system calls............ 11 3.3 An example of deadlock............................. 12 3.4 An example of logical time imbalance...................... 13 3.5 pbzip2 without logical time balancing...................... 14 3.6 An instrumented basic block in pbzip2 with execution time profiling functions. 16 3.7 An instrumented basic block in pbzip2 with dettick............... 16 3.8 An example of tick bumping........................... 17 3.9 Simplified implementation of Tick Shepherd.................. 18 4.1 Simplified implementation of system calls for schedule replication...... 24 4.2 An example of Schedule Replication....................... 25 5.1 poll prototype and pollfd data structure.................... 29 5.2 epoll wait prototype and epoll event data structure.............. 30 5.3 pthread mutex lock in the LD PRELOAD library............... 31 5.4 glibc pthread cond wait internal work flow................... 32 6.1 pbzip2 concurrent model............................. 36 6.2 pbzip2 performance................................ 37 6.3 pbzip2 messages.................................. 38 viii 6.4 mongoose concurrent model........................... 39 6.5 mongoose performance for 50KB file requests.................. 40 6.6 mongoose performance for 100KB file requests................. 40 6.7 mongoose performance for 200KB file requests................. 41 6.8 mongoose messages for 50KB file requests................... 43 6.9 mongoose messages for 100KB file requests................... 43 6.10 mongoose messages for 200KB file requests................... 44 6.11 Nginx thread pool model............................. 44 6.12 nginx performance for 50KB file requests.................... 45 6.13 nginx performance for 100KB file requests................... 46 6.14 nginx performance for 200KB file requests................... 46 6.15 nginx messages for 50KB file requests...................... 47 6.16 nginx messages for 100KB file requests..................... 48 6.17 nginx messages for 200KB file requests..................... 48 6.18 The tradeoff of two replication modes...................... 50 ix List of Tables 6.1 Tracked system calls used by pbzip2....................... 36 6.2 pbzip2 Overall Overhead of Each Replication Mode.............. 37 6.3 Tracked system calls used by mongoose..................... 39 6.4 Mongoose Overall Overhead of Each Replication Mode............ 41 6.5 Tracked system calls used by nginx....................... 45 6.6 Nginx Overall Overhead of Each Replication Mode.............. 47 x Chapter 1 Introduction Nowadays semiconductor industry has pushed the CPU core count to a historically high level. Having a computer system with high