Runtime Support for Improving Reliability in System Software

Runtime Support for Improving Reliability in System Software Dissertation Presented in Partial Ful¯llment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Qi Gao, B.E., M.S. Graduate Program in Computer Science and Engineering Ohio State University 2010 Dissertation Committee: Prof. Feng Qin, Advisor Prof. Atanas Rountev Prof. Xiaodong Zhang Copyright by Qi Gao 2010 ABSTRACT As software is becoming increasingly complex, software reliability is getting more and more important. In particular, the reliability of system software is critical to the overall reliability of computer systems since system software is designed to provide a platform for application software running on top of. Unfortunately, it is very challenging to ensure the reliability of system software and the defects (bugs) in it can often cause severe impact. This dissertation proposes to use runtime support for improving system software reliability. Runtime support here refers to the technique to extend the runtime software system with more functionalities useful for reliability-oriented tasks, such as instrumentation-based pro¯ling, runtime analysis, checkpointing/re-execution, scheduling control, memory layout control, etc. Leveraging runtime support, this dissertation proposes novel methods for bug manifestation, bug detection, bug diagnosis, failure recovery and error prevention in multiple phases in the software development and deployment cycle. The most preferable phase to detect and ¯x software bugs is pre-release testing phase. To improve the testing e®ectiveness and e±ciency, this dissertation proposes the ¯rst method to help manifest the bugs hidden in system software. Facing the real-world fact that there are always some bugs making their way to deployment sites no matter how rigorous the software testing is, this dissertation proposes the second method to help monitor the system software and detect runtime errors. To handle the runtime errors caused by software bugs, this dissertation proposes the third method to help diagnose the failure, recover the program, and prevent future errors due to the same bugs. Speci¯cally, we propose a software testing method called 2ndStrike to manifest hidden concurrency typestate bugs in multi-threaded system software. 2ndStrike ¯rst pro¯les certain ii program runtime events related to the typestate and thread synchronization. Based on the logs, 2ndStrike then identi¯es bug candidates that would cause typestate violation if event order is reversed. Finally, 2ndStrike re-executes the program in multiple iterations with controlled thread interleaving for manifesting bug candidates. In addition, we propose a deployment-time monitoring and analysis method called DM- Tracker to detect anomalies in distributed system software running on parallel platforms during production runs. Based on the observation that data movements in parallel programs typically follow certain patterns, our idea is to extract data movement (DM)-based invariants at program runtime and check the violations of these invariants. These violations indicate potential bugs such as data races and memory corruption bugs that manifest themselves in data movements. Utilizing the data movement information, we propose a statistical-rule- based approach to detect anomalies for ¯nding bugs. Finally, we propose a deployment-time fault tolerance method called First-Aid to recover failures in system software due to common memory bugs during production runs and prevent future errors caused by the same bugs. Upon a failure, First-Aid diagnoses the bug type and identi¯es the memory objects that trigger the bug. To do so, it rolls back the program to previous checkpoints and uses two types of environmental changes that can prevent or expose memory bug manifestation during re-execution. Based on the diagnosis, First-Aid generates and applies runtime patches to avoid the memory bug and prevent its reoccurrence. We have designed and implemented software prototypes for the proposed methods and evaluated them with real world bugs on large open-source system software packages, such as Apache, MySQL, Mozilla, MVAPICH, etc. The experimental results show that the methods proposed in this dissertation can provide great help in improving reliability of system software in various scenarios. In addition, the results also demonstrate that the runtime support in these methods can bring key advantages such as high e±ciency, high accuracy, and high usability. iii Dedicated To My Love, Ying Tu iv ACKNOWLEDGMENTS I am truly lucky to have Professor Feng Qin as my advisor, who is kind, generous, and inspiring. On top of many good traits that an excellent advisor can have, his most admirable quality that distinguishes him from other good professors I have seen or heard is that he genuinely cares and respects his students. His insights and guidance have made me a better researcher, while more importantly, his open mind and optimism have inﬂuenced me to become a stronger person. I am also deeply indebted to Professor Xiaodong Zhang and Professor Atanas Rountev, not only for their time and e®ort in serving in my dissertation committee, but also for their generous help and insightful feedbacks during my study. Thanks also go to other professors who have worked with me and/or passed their knowledge to me. An incomplete list of them are Professor Panda, Professor Sadayappan, Professor Agrawal, Professor Xuan, Professor Arora, Professor Lai, Professor Long, and Professor Soundarajan. I would also like to thank my groupmates and fellow graduate students. Wenbin Zhang, Zhezhe Chen and Mai Zheng, thanks for extending help in the research projects. I thank Wei Huang, Lei Chai, Matthew Koop, Sayantan Sur, Pavan Balaji for their help in improving my research skills, engineering skills, and English speaking and writing skills. I thank Guoqiang Shu, Zhaozhang Jin, Qian Zhu, Lifeng Sang, Guoqing Xu, Ping Lai, Na Li and Hui Gao for their help in various aspects. I am grateful to Dr. Min Xu in VMware., Dr. Jason Yang in Microsoft, and Dr. Pete Beckman in Argonne National Lab for being my mentor while I was doing internships. I feel lucky working with them and other excellent researchers/engineers in their teams. I have really learned a lot and greatly broadened my horizon from these valuable experiences. v Those who I want to thank the most will not mind that I put them the last. I am deeply thankful to my parents, Weishi Gao and Xiaoli Liu, who have been providing me a cultivating and nurturing environment for my grow-up. They always believe in me and support me in every possible way. Finally, my very special thanks are given to my dearest wife-to-be, Ying Tu, who has been with me during the whole journey of going to a distant country for Ph.D. study. She is smart, diligent, caring, inspiring and she creates so much joy in my life. It is her love and support that unleashed my full potential. vi VITA 2004 . .Bachelor in Engineering Zhejiang University Hangzhou, China 2009 . .Master of Science The Ohio State University Columbus, OH, USA 2004 { Present . .Graduate Student The Ohio State University Columbus, OH, USA PUBLICATIONS Research Publications Qi Gao, Wenbin Zhang, Yan Tang, and Feng Qin. \First-Aid: Surviving and Preventing Memory Management Bugs during Production Runs". Proceedings of ACM SIGOPS/EuroSys 2009 European Conference on Computer System (EuroSys09), 159 - 172, Nuremberg, Ger- many, 2009. Yan Tang, Qi Gao, and Feng Qin. \LeakSurvivor: towards safely tolerating memory leaks for garbage-collected languages". Proceedings of USENIX 2008 Annual Technical Conference on Annual Technical Conference (USENIX08), 307 - 320, Boston, Massachusetts, 2008. Qi Gao, Feng Qin, and Dhabaleswar K. Panda. \DMTracker: ¯nding bugs in large-scale parallel programs by detecting anomaly in data movements". Proceedings of 2007 ACM/IEEE conference on Supercomputing (SC' 07), Article No 15, Reno, Nevada, 2007. Wei Huang, Matthew J. Koop, Qi Gao, and Dhabaleswar K. Panda. \Virtual machine aware communication libraries for high performance computing". Proceedings of 2007 ACM/IEEE conference on Supercomputing (SC' 07), Article No 9, Reno, Nevada, 2007. vii Wei Huang, Qi Gao, Jiuxing Liu, and Dhabaleswar K. Panda \Virtual machine aware communication libraries for high performance computing". Proceedings of 2007 IEEE In- ternational Conference on Cluster Computing (Cluster' 07), 11 - 20, 2007. Qi Gao, Wei Huang, Matthew J. Koop, and Dhabaleswar K. Panda \Group-based Coordi- nated Checkpointing for MPI: A Case Study on In¯niBand". Proceedings of 2007 Interna- tional Conference on Parallel Processing (ICPP' 07), Article No 47, 2007. Matthew J. Koop, Sayantan Sur, Qi Gao, and Dhabaleswar K. Panda \High performance MPI design using unreliable datagram for ultra-scale In¯niBand clusters". Proceedings of 21st Annual International Conference on Supercomputing (ICS' 07), 180 - 189, Seattle, Washington, 2007. Lei Chai, Qi Gao, and Dhabaleswar K. Panda \Understanding the Impact of Multi-Core Ar- chitecture in Cluster Computing: A Case Study with Intel Dual-Core System". Proceedings of Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid' 07), 471-478, 2007. Qi Gao, Weikuan Yu, Wei Huang, and Dhabaleswar K. Panda \Application-Transparent Checkpoint/Restart for MPI Programs over In¯niBand". Proceedings of 2006 International Conference on Parallel Processing (ICPP' 06), 471 - 478, 2006. Wei Huang, Gopal Santhanaraman, H. W. Jin, Qi Gao, and Dhabaleswar

Load more