Characterizing and Understanding HPC Job Failures Over the 2K-Day Life of IBM Bluegene/Q System
Total Page:16
File Type:pdf, Size:1020Kb
Characterizing and Understanding HPC Job Failures over The 2K-day Life of IBM BlueGene/Q System Sheng Di,∗ Hanqi Guo,∗ Eric Pershey,∗ Marc Snir,y Franck Cappello∗y ∗Argonne National Laboratory, IL, USA [email protected], [email protected], [email protected], [email protected] yUniversity of Illinois at Urbana-Champaign, IL, USA [email protected] Abstract—An in-depth understanding of the failure features of Blue Joule (UK), also adopt the same Blue Gene/Q system HPC jobs in a supercomputer is critical to the large-scale system architecture and they are still in operation. maintenance and improvement of the service quality for users. In Studying the job failure features in a large-scale system is this paper, we investigate the features of hundreds of thousands of jobs in one of the most powerful supercomputers, the IBM nontrivial in that it involves numerous messages logged from Blue Gene/Q Mira, based on 2001 days of observations with a across multiple data sources and many messages are heavily total of over 32.44 billion core-hours. We study the impact of the duplicated [2]. In our work, we performed a joint analysis system’s events on the jobs’ execution in order to understand the by leveraging four different data sources: the reliability, avail- system’s reliability from the perspective of jobs and users. The ability, and serviceability (RAS) log; task execution log; job characterization involves a joint analysis based on multiple data sources, including the reliability, availability, and serviceability scheduling log; and I/O behavior log. The RAS log is the most (RAS) log; job scheduling log; the log regarding each job’s important system log related to system reliability such as node physical execution tasks; and the I/O behavior log. We present 22 failure, power outages, and coolant issues. In the 5.5 years valuable takeaways based on our in-depth analysis. For instance, of observation, the RAS log has 80,665,723 messages, which 99,245 job failures are reported in the job-scheduling log, a large have three severity levels (Fatal, Warn, and Info). In the Mira majority (99.4%) of which are due to user behavior (such as bugs in code, wrong configuration, or misoperations). The job system, users who wish to run a high-performance computing failures are correlated with multiple metrics and attributes, such (HPC) application or simulation must submit a job to the as users/projects and job execution structure (number of tasks, Cobalt system [3] (a job-scheduling system similar to Torque scale, and core-hours). The best-fitting distributions of a failed [4]); the submitted job is then split into multiple tasks during job’s execution length (or interruption interval) include Weibull, the whole execution. The user-submitted jobs are called user Pareto, inverse Gaussian, and Erlang/exponential, depending on the types of errors (i.e., exit codes). The RAS events affecting job jobs or Cobalt jobs in the following text. The job-scheduling executions exhibit a high correlation with users and core-hours log records the status for each job, such as queuing status, and have a strong locality feature. In terms of the failed jobs, our running status, number of nodes or cores used, completion similarity-based event-filtering analysis indicates that the mean time, and exit status. In our study, the job-scheduling log time to interruption is about 3.5 days. involves up to 32.44 billion core-hours, and this is the largest compute resource usage in a resilience study up to date, to I. INTRODUCTION the best of our knowledge. The task execution log contains Since many of today’s science research problems are too detailed information such as what physical execution block complicated to resolve by theoretical analysis, scientists have was assigned to the job and which rank ran into errors if the to perform large-scale (or extreme-scale) simulations on su- job failed. To determine the jobs’ I/O behaviors, we analyze percomputers. Large-scale simulations, however, have a high the I/O characterization logs produced by Darshan [5], [6], likelihood of encountering failures during lengthy execution such as the number of bytes read/written by each job and the [1]. In order to improve the service quality of HPC systems, potential correlation with job failures. We combined all four it is critical to deeply understand the features and behaviors data sources to better understand the behavior of a failed job of failed jobs and their correlation with system reliability. and how the fatal system events affect the job execution. In this paper, we characterize job failures in one of the Our characterization/analysis results have been approved by most powerful supercomputers, the IBM Blue Gene/Q Mira, the Mira system administrator who is an expert in log analysis. which is deployed at Argonne National Laboratory. Our study Based on our in-depth study of HPC jobs running on the IBM is based on a set of system logs spanning 5.5 years (from Blue Gene/Q Mira system, we address the following questions. 04/09/2013 to 09/30/2018). The IBM Blue Gene/Q Mira was These questions are critical to large-scale system maintenance, ranked as the third fastest supercomputer in 2013, and it in-depth understanding of HPC job failures, and improvement is still ranked as the 21st in the world based on the latest of the resource provisioning quality. TOP500 report. Understanding the job failure features on • Analysis of generic job features: What are the statistical this supercomputer has a broad significance because multiple features of the HPC jobs from the perspective of a supercomputers, such as Sequoia (USA), Vulcan (USA), and long-term period of observation of Mira? Specifically, 1 we characterize the distribution of execution time, the example, used a tool called CrashFinder to analyze the faults best-fit distribution type of execution time by maximum causing long-latency crashes in user programs; they conducted likelihood estimation (MLE), jobs’ I/O behaviors, and their experiments on an Intel Xeon E5 machine with simulated resource usage of jobs across users and projects. This faults injected by the open fault injector LLFI [13]. Siddiqua characterization indicates the job features in IBM Blue et al. [14], [15] characterized DRAM/SRAM faults collected Gene/Q systems, in comparison with other systems with over the lifetime of the LANL Cielo supercomputer [16]. Nie different architectures [7]–[9]. et al. [17], [18] characterized and quantified different kinds of • Analysis of failed jobs’ features: What are the statistical soft-errors on the Titan supercomputer’s GPU nodes and also features of the failed jobs on the petascale system from a developed machine learning methods to predict the occurrence long-term view? To address this question, we provide an of GPU errors. Sridharan et al. [19], [20] examined the impact in-depth, comprehensive study of the correlation of the of errors and aging on DRAM and identified a significant specific job exit statuses and other important attributes intervendor effect on DRAM fault rates based on the LANL using multiple logs from the IBM Blue Gene/Q Mira; Cielo system and the ORNL Jaguar system [21]. Martino et this approach is in contrast with general characterization al. [7] studied hardware/firmware errors on Blue Waters [10], work [2], [7] focusing mainly on the system level. Our showing that its processor and memory protection mechanisms work also differs from the existing application resilience (x8 and x4 Chipkill, ECC, and parity) are robust. study in [8], which was focused on a statistical analysis Arguably, some large-scale system failure studies [9], [22]– of job failures on Blue Waters [10]. Specifically, not [26] have been conducted on the IBM Blue Gene series of only do we characterize the distribution of failed jobs, supercomputers; however, their analyses generally focus on but we also explore the best-fit distribution type of the specific issues such as memory errors, temperature, power, execution lengths based on specific exit statuses. We and soft-error behaviors or on small or medium-sized super- also identify the relationship between job failure status computers. Hwang et al. [27], for example, characterized the and other critical attributes, such as the execution scale, DRAM errors and their implications for the system design, users/projects, job’s execution tasks, jobs’ I/O behaviors, based on four supercomputers: IBM Blue Gene/L, IBM Blue job execution time, and resource allocation locations. Gene/P, SciNet, and a Google data center. Di et al. [2] char- • Impacts of fatal system events to job executions: How acterized the resilience features of fatal system events for the do the fatal system events impact the job executions from IBM Blue Gene/Q Mira, but the study was based on a single the perspective of both job scheduling system and user data source (RAS event log). Zheng et al. [28] provided a job executions? In contrast to related work [2] focusing coanalysis of RAS logs and job logs on a Blue Gene/P system only on the correlation among system events (i.e., RAS [29]; their study, however, was based on an older, smaller events) [11], we investigate the correlation between the cluster (163k cores) with a short logging period (273 days). system’s RAS events and job executions in this paper. By comparison, we provide a much more comprehensive, fine- This new analysis is important to system administrators, grained analysis of the correlation between various job failure application users, and fault tolerance researchers because types and multiple attributes (such as users and projects, job it indicates the system’s reliability from the perspective of execution structure, locality, job’s I/O behaviors, and RAS users and jobs.