A Small-Scale Testbed for Large-Scale Reliable Computing Jason R
Total Page:16
File Type:pdf, Size:1020Kb
Purdue University Purdue e-Pubs Open Access Theses Theses and Dissertations 12-2016 A small-scale testbed for large-scale reliable computing Jason R. St. John Purdue University Follow this and additional works at: https://docs.lib.purdue.edu/open_access_theses Part of the Computer Sciences Commons Recommended Citation St. John, Jason R., "A small-scale testbed for large-scale reliable computing" (2016). Open Access Theses. 899. https://docs.lib.purdue.edu/open_access_theses/899 This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact [email protected] for additional information. Graduate School Form 30 Updated PURDUE UNIVERSITY GRADUATE SCHOOL Thesis/Dissertation Acceptance This is to certify that the thesis/dissertation prepared By Jason R. St. John Entitled A SMALL-SCALE TESTBED FOR LARGE-SCALE RELIABLE COMPUTING For the degree of Master of Science Is approved by the final examining committee: Thomas Hacker Chair Eric Matson John Springer To the best of my knowledge and as understood by the student in the Thesis/Dissertation Agreement, Publication Delay, and Certification Disclaimer (Graduate School Form 32), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy of Integrity in Research” and the use of copyright material. Approved by Major Professor(s): Thomas Hacker Approved by: Jeffrey Whitten 11/29/2016 Head of the Departmental Graduate Program Date A SMALL-SCALE TESTBED FOR LARGE-SCALE RELIABLE COMPUTING A Thesis Submitted to the Faculty of Purdue University by Jason R. St. John In Partial Fulfillment of the Requirements for the Degree of Master of Science December 2016 Purdue University West Lafayette, Indiana ii ACKNOWLEDGMENTS I wish to gratefully acknowledge my thesis committee, my family, and my fellow graduate students for their help and support, encouragement, and insight. iii TABLE OF CONTENTS Page LIST OF TABLES :::::::::::::::::::::::::::::::: v LIST OF FIGURES ::::::::::::::::::::::::::::::: vi ABBREVIATIONS :::::::::::::::::::::::::::::::: vii ABSTRACT ::::::::::::::::::::::::::::::::::: ix CHAPTER 1. INTRODUCTION :::::::::::::::::::::::: 1 1.1 Scope :::::::::::::::::::::::::::::::::: 1 1.2 Significance ::::::::::::::::::::::::::::::: 1 1.3 Research Question ::::::::::::::::::::::::::: 2 1.4 Assumptions ::::::::::::::::::::::::::::::: 2 1.5 Limitations ::::::::::::::::::::::::::::::: 3 1.6 Delimitations :::::::::::::::::::::::::::::: 3 1.7 Definitions :::::::::::::::::::::::::::::::: 4 1.8 Summary :::::::::::::::::::::::::::::::: 5 CHAPTER 2. REVIEW OF RELEVANT LITERATURE :::::::::: 6 2.1 HPC System Reliability :::::::::::::::::::::::: 6 2.2 HPC, Cloud Computing, and Virtualization ::::::::::::: 8 2.3 Error Injection ::::::::::::::::::::::::::::: 9 2.4 Summary :::::::::::::::::::::::::::::::: 11 CHAPTER 3. FRAMEWORK AND METHODOLOGY ::::::::::: 12 3.1 Study Design :::::::::::::::::::::::::::::: 12 3.1.1 Data Sets :::::::::::::::::::::::::::: 12 3.1.2 Overview of Statistical Model Creation :::::::::::: 13 3.1.3 System Log Analysis :::::::::::::::::::::: 14 3.1.3.1 Coates: Log Processing and Sorting ::::::::: 14 3.1.3.2 Coates: Timestamp Standardization :::::::: 15 3.1.3.3 Coates: Node ID Standardization :::::::::: 15 3.1.3.4 Coates: Scatter Plots of Memory Failure Events :: 16 3.1.3.5 Coates: Declustering of Failure Events ::::::: 16 3.1.3.6 Carter: Scatter Plots of Hard Disk Failure Events : 16 3.1.3.7 Carter: Declustering of Failure Events ::::::: 16 3.1.3.8 Distribution Fitting ::::::::::::::::: 21 3.1.4 Fault Injection ::::::::::::::::::::::::: 21 iv Page 3.1.4.1 Memory Error Injection Procedure ::::::::: 22 3.1.4.2 Machine Check Registers ::::::::::::::: 22 3.1.4.3 Memory Error Injection Strings ::::::::::: 23 3.1.4.4 QEMU Modifications and Hard Disk Error Simulation Procedure ::::::::::::::::::::::: 23 3.1.4.5 Hard Disk Errors ::::::::::::::::::: 24 3.1.4.6 Tunable Fault Injection Frequency ::::::::: 25 3.1.4.7 Tuning the Three-Parameter Weibull Distribution : 25 3.1.4.8 Tuning the Lomax Distribution ::::::::::: 25 3.1.4.9 Experimental Setup ::::::::::::::::: 26 3.2 Unit & Sampling :::::::::::::::::::::::::::: 27 3.2.1 Hypothesis :::::::::::::::::::::::::::: 27 3.2.2 Population :::::::::::::::::::::::::::: 27 3.2.3 Sample :::::::::::::::::::::::::::::: 31 3.2.4 Variables ::::::::::::::::::::::::::::: 31 3.2.5 Measure for Success ::::::::::::::::::::::: 31 3.3 Summary :::::::::::::::::::::::::::::::: 31 CHAPTER 4. PRESENTATION OF DATA :::::::::::::::::: 32 4.1 Statistical Models :::::::::::::::::::::::::::: 32 4.1.1 Coates: Memory Errors :::::::::::::::::::: 32 4.1.2 Carter: Hard Disk Errors :::::::::::::::::::: 33 4.2 Experiments ::::::::::::::::::::::::::::::: 33 4.2.1 System Log Samples :::::::::::::::::::::: 33 4.2.2 Quantitative Results :::::::::::::::::::::: 33 CHAPTER 5. CONCLUSIONS, DISCUSSION, AND RECOMMENDATIONS 36 5.1 Conclusions and Discussion :::::::::::::::::::::: 36 5.2 Recommendations :::::::::::::::::::::::::::: 37 APPENDIX A. FAULT INJECTION SCRIPT ::::::::::::::::: 38 APPENDIX B. QEMU SOURCE CODE MODIFICATIONS ::::::::: 51 LIST OF REFERENCES :::::::::::::::::::::::::::: 65 v LIST OF TABLES Table Page 3.1 Comparison of various timestamp formats :::::::::::::::: 15 3.2 Polynomial coefficients ::::::::::::::::::::::::::: 19 3.3 Polynomial roots :::::::::::::::::::::::::::::: 20 3.4 Number of centers vs. WSSE ::::::::::::::::::::::: 21 3.5 Supported memory errors and MCi STATUS register values :::::: 23 3.6 Supported hard disk errors and their descriptions :::::::::::: 24 4.1 Parameters of the three-parameter Weibull distribution for integrated memory controller errors on the Coates system :::::::::::::::::: 32 4.2 Parameters of the Lomax distribution for hard disk errors on the Carter system :::::::::::::::::::::::::::::::::::: 33 4.3 Number of attempted and successful fault injections ::::::::::: 35 vi LIST OF FIGURES Figure Page 3.1 Time of Memory Error Events vs. Node ID ::::::::::::::: 17 3.2 Time of Hard Disk Error Events vs. Node ID :::::::::::::: 18 3.3 WSSE vs. Number of Clusters ::::::::::::::::::::::: 20 3.4 Configuration file for fio :::::::::::::::::::::::::: 28 3.5 systemd service file for fio, installed on the VMs ::::::::::::: 29 3.6 systemd service file for changing the default polling interval for machine checks :::::::::::::::::::::::::::::::::::: 29 3.7 Bash script that changes the default polling frequency for machine checks 30 4.1 System log screenshot of VM \centos4" experiencing an ENOMEDIUM error injection :::::::::::::::::::::::::::::::: 34 vii ABBREVIATIONS ADDR memory address BSD Berkeley Software Distribution CPU central processing unit EC2 Amazon's Elastic Compute Cloud ECC error-correcting code DRAM dynamic random-access memory fio Flexible I/O Tester GART Graphics Address Relocation Table GUI graphical user interface HMP QEMU's Human Monitor Protocol HPC high performance computing IaaS infrastructure as a service I/O input/output IP Internet Protocol ISO International Organization for Standardization IT information technology MAC media access control MC machine check MCA machine check architecture MCE machine check exception MTBF mean time between failures MTTF mean time to failure RAM random-access memory RCAC Rosen Center for Advanced Computing viii RFC Request For Comments RHEL Red Hat Enterprise Linux TLB Translation Lookaside Buffer UTC Coordinated Universal Time VM virtual machine VMM virtual machine monitor WSSE within groups sum of squared error ix ABSTRACT St. John, Jason R. M.S., Purdue University, December 2016. A Small-Scale Testbed For Large-Scale Reliable Computing. Major Professor: Thomas J. Hacker. High performance computing (HPC) systems frequently suffer errors and failures from hardware components that negatively impact the performance of jobs run on these systems. We analyzed system logs from two HPC systems at Purdue University and created statistical models for memory and hard disk errors. We created a small-scale error injection testbed|using a customized QEMU build, libvirt, and Python|for HPC application programmers to test and debug their programs in a faulty environment so that programmers can write more robust and resilient programs before deploying them on an actual HPC system. The deliverables for this project are the fault injection program, the modified QEMU source code, and the statistical models used for driving the injection. 1 CHAPTER 1. INTRODUCTION High performance computing (HPC) systems are a major component of academic and industrial scientific research, and the reliability of these systems is one of the foremost areas of research in the HPC community. The HPC community is continually trying to make larger, more powerful HPC systems, and the poor reliability of commodity HPC clusters is one of the biggest hindrances to the adoption of petascale and exascale computing systems. This chapter provides the scope, significance, research question, and other background information on the thesis. 1.1 Scope This study examined the system logs of large-scale HPC systems operated by the Rosen Center for Advanced Computing (RCAC) at Purdue University. The systems studied are all commodity-based computing clusters running Red Hat Enterprise Linux (RHEL) 5, which is a GNU/Linux distribution based on version 2.6.18 of the Linux kernel. This study determined the frequency at which component failures occurred and produced statistical models of the recorded failure events. The statistical models were used as the driving input to a synthetic fault generator