Coordinated Checkpoint/Restart Process Fault Tolerance for Mpi Applications on Hpc Systems

COORDINATED CHECKPOINT/RESTART PROCESS FAULT TOLERANCE FOR MPI APPLICATIONS ON HPC SYSTEMS Joshua Hursey Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the Department of Computer Science Indiana University July, 2010 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Doctoral Committee Andrew Lumsdaine, Ph.D. Randall Bramley, Ph.D. Arun Chauhan, Ph.D. Beth Plale, Ph.D. Frank Mueller, Ph.D. June 30, 2010 ii Copyright 2010 Joshua Hursey All rights reserved iii To Dad and Mom for their unwavering support over these many years. iv Acknowledgements A special thank you to my advisor, Andrew Lumsdaine, for providing me with a unique working environment full of opportunities and challenges that have molded me into the researcher that I am today. My research perspective has been strongly influenced by his dedication to open, repeatable scientific exploration. His ability to coalesce and analyze a vast range of scientific domains to highlight new research avenues continues to amaze me. I would like to express my gratitude to my research committee for their support during the development of the research contained in this dissertation. Whether in a formal class- room setting or informal conversation their expertise has been an invaluable asset. This dissertation was shaped by their insightful questions and suggestions. I thank my colleagues in the Computer Science department and Open Systems Labo- ratory at Indiana University. A special thanks to Jeffrey M. Squyres, Brian Barrett, Torsten Hoefler, DongInn Kim, Timothy I. Mattox, Joseph A. Cottam, Andrew Friedley, Abhishek Kulkarni, and Timothy Prins for their collaborative help over the years. I would also like to thank Laura Hopkins for helping me polish this dissertation. I would like to thank the Open MPI, Berkeley Lab Checkpoint/Restart (BLCR), and MPI Testing Tool (MTT) developer communities for their patience and support without which this dissertation would not have been possible. In particular, I would like to thank Ralph Castain, Paul H. Hargrove, Jeffrey M. Squyres, Richard Graham, and Ethan Mallove for the many hours they have spent with me discussing various design decisions. I have learned so much from each of you, for which I am truly grateful. Many thanks to the staff of the Computer Science department, Computer Systems Group, and Pervasive Technology Institute. Lucy Battersby, Debbie Canada, Becky Cur- tis, Sherry Kay, Rebecca Lowe, Ann Oxby, and Jacqueline Whaley have been great assets v in helping me throughout my graduate career. Rob Henderson, Shing-Shong (Bruce) Shei, and T.J. Jones are some of the best system administrators I have ever had the pleasure of working with. Thank you for your patience and support as I experimented and stressed Odin and Sif beyond their comfort zones. I owe a debt of gratitude to Charles Peck for sparking my interest in both teaching and research. His energy and passion for education, research, and life has left a lasting impression on me. Thank you for helping me find my path and showing me how to have fun with what I do. Graduate school has been a taxing endeavor full of both joy and pain. To all of my friends at Indiana University and around the world, thank you for your support during this process. In particular, I would like to thank Craig Shue and Andrew Kalafut for helping me celebrate the good times and get through the rough times. To my dear wife, Samantha Foley, you have been and continue to be a steadfast source of love, support, encouragement and kindness. Thank you for keeping me grounded and reminding me to not take things too seriously. I love you very much, and appreciate you every day. Finally, I would like to thank my family not only for their unwavering love and support, but for instilling in me a work ethic and a set of values without which I could not have come this far. I am forever grateful of the hard work and sacrifice of my parents and grandparents which has provided me the opportunity to achieve something uniquely significant in our family’s history. This work was made possible by the generous support of the following grants and in- stitutions: the Lilly Endowment; Lawrence Berkeley National Laboratory; Los Alamos Na- tional Laboratory; Oak Ridge National Laboratory; National Science Foundation grants NSF ANI-0330620, CDA-0116050, and EIA-0202048; and U.S. Department of Energy grant DE- FC02-06ER25750Â003. vi Abstract Scientists use advanced computing techniques to assist in answering the complex questions at the forefront of discovery. The High Performance Computing (HPC) scientific applications created by these scientists are running longer and scaling to larger systems. These applications must be able to tolerate the inevitable failure of a subset of processes (process failures) that occur as a result of pushing the reliability boundaries of HPC systems. HPC system reliability is emerging as a problem in future exascale systems where the time to failure is measured in minutes or hours instead of days or months. Resilient applications (i.e., applications that can continue to run despite process failures) depend on resilient communication and runtime environments to sustain the application across process failures. Unfortunately, these environments are uncommon and not typically present on HPC systems. In order to preserve performance, scalability, and scientific accuracy, a resilient application may choose the invasiveness of the recovery solution, from completely transparent to completely application-directed. Therefore, resilient communication and runtime environments must provide customizable fault recovery mechanisms. Resilient applications often use rollback recovery techniques for fault tolerance: partic- ularly popular are checkpoint/restart (C/R) techniques. HPC applications commonly use the Message Passing Interface (MPI) standard for communication. This thesis identifies a complete set of capabilities that compose to form a coordinated C/R infrastructure for MPI applications running on HPC systems. These capabilities, when integrated into an MPI implementation, provide applications with transparent, yet optionally application config- urable, fault tolerance. By adding these capabilities to Open MPI we demonstrate support for C/R process fault tolerance, automatic recovery, proactive process migration, and parallel debugging. We also discuss how this infrastructure is being used to support further research into fault tolerance. vii Contents List of Tables xi List of Figures xiii List of Acronyms xvii Chapter 1. Introduction 1 Chapter 2. Background and Related Work 6 1. Distributed Fault Detection 8 2. Fault Prediction 10 3. Fault Recovery: Algorithm Based Fault Tolerance 11 4. Fault Recovery: Checkpoint/Restart 12 5. Fault Recovery: Message Logging 28 6. Fault Recovery: Replication 30 7. Debugging 31 8. Process Migration 34 9. File Systems 35 10. Compiler Based Techniques 39 Chapter 3. Checkpoint/Restart Infrastructure 41 1. Open MPI Architecture 43 2. External Tools and Interfaces 46 viii 3. Checkpoint/Restart Service 51 4. Checkpoint/Restart Coordination Protocol 53 5. Interlayer Notification Callback 58 6. Stable Storage 61 7. File Management 62 8. Snapshot Coordination 63 9. Performance Results 67 10. Conclusion 82 Chapter 4. Process Migration and Automatic Recovery 84 1. Error Management and Recovery Policy 85 2. Error Management and Recovery Policy Implementation 86 3. Runtime Stabilization ErrMgr Component 88 4. Automatic Recovery ErrMgr Component 89 5. Process Migration ErrMgr Component 90 6. Performance Results 92 7. Conclusion 98 Chapter 5. Application Interaction 99 1. Application Programming Interface 100 2. Checkpoint/Restart-Enabled Debugging 104 3. Conclusion 118 Chapter 6. Conclusions 119 1. Future Work 121 Bibliography 124 Appendix A. Checkpoint/Restart Application Programming Interface 148 1. Checkpoint/Restart Interface 148 2. Quiescence Interface 151 3. Process Migration Interface 154 ix 4. Interlayer Notification Callback Callbacks 155 Appendix B. Command Line Tools 158 1. ompi-checkpoint 158 2. ompi-restart 160 3. ompi-migrate 161 Appendix C. self Checkpoint/Restart Service (CRS) 162 1. Interface 163 2. Compiling 164 3. Running 164 4. Example Application 164 Appendix D. Nonblocking Process Creation and Management Operations 168 1. Process Manager Interface 169 2. Starting Processes and Establishing Communication 169 3. Establishing Communication 172 x List of Tables 3.1 Interlayer Notification Callback (INC) states used in the ft event function. 59 3.2 NetPIPE 1 byte latency and bandwidth illustrating CRCP framework failure-free overhead. 70 3.3 Checkpoint overhead analysis for NAS Parallel Benchmarks (NAS) Parallel Benchmark LU Class C with 32 processes using the central Stable Storage (SStore) component. Global snapshot is 1GB or 32MB per process. 75 3.4 Checkpoint overhead analysis for NAS Parallel Benchmark EP Class D with 32 processes using the central SStore component. Global snapshot is 102MB or 3.2MB per process. 75 3.5 Checkpoint overhead analysis for NAS Parallel Benchmark BT Class C with 36 processes using the central SStore component. Global snapshot is 4.2GB or 120MB per process. 75 3.6 Checkpoint overhead analysis for NAS Parallel Benchmark SP Class C with 36 processes using the central SStore component. Global snapshot is 1.9GB or 54MB per process. 76 3.7 Checkpoint overhead analysis for

Coordinated Checkpoint/Restart Process Fault Tolerance for Mpi Applications on Hpc Systems

RCU Usage in the Linux Kernel: One Decade Later

LMAX Disruptor

A Thread Synchronization Model for the PREEMPT RT Linux Kernel

Review Der Linux Kernel Sourcen Von 4.9 Auf 4.10

Linux Foundation Collaboration Summit 2010

Read-Copy Update (RCU)

Connect User Guide Armv8-A Memory Systems

Using Visualization to Understand the Behavior of Computer Systems

Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications

Interrupt Handling

Memory Barriers: a Hardware View for Software Hackers

The Linux Operating System