Characterizing and Understanding HPC Job Failures over The 2K-day Life of IBM BlueGene/Q System

Sheng Di,∗ Hanqi Guo,∗ Eric Pershey,∗ Marc Snir,† Franck Cappello∗† ∗Argonne National Laboratory, IL, USA [email protected], [email protected], [email protected], [email protected] †University of Illinois at Urbana-Champaign, IL, USA [email protected]

Abstract—An in-depth understanding of the failure features of Blue Joule (UK), also adopt the same Blue Gene/Q system HPC jobs in a is critical to the large-scale system architecture and they are still in operation. maintenance and improvement of the service quality for users. In Studying the job failure features in a large-scale system is this paper, we investigate the features of hundreds of thousands of jobs in one of the most powerful , the IBM nontrivial in that it involves numerous messages logged from Blue Gene/Q Mira, based on 2001 days of observations with a across multiple data sources and many messages are heavily total of over 32.44 billion core-hours. We study the impact of the duplicated [2]. In our work, we performed a joint analysis system’s events on the jobs’ execution in order to understand the by leveraging four different data sources: the reliability, avail- system’s reliability from the perspective of jobs and users. The ability, and serviceability (RAS) log; task execution log; job characterization involves a joint analysis based on multiple data sources, including the reliability, availability, and serviceability scheduling log; and I/O behavior log. The RAS log is the most (RAS) log; job scheduling log; the log regarding each job’s important system log related to system reliability such as node physical execution tasks; and the I/O behavior log. We present 22 failure, power outages, and coolant issues. In the 5.5 years valuable takeaways based on our in-depth analysis. For instance, of observation, the RAS log has 80,665,723 messages, which 99,245 job failures are reported in the job-scheduling log, a large have three severity levels (Fatal, Warn, and Info). In the Mira majority (99.4%) of which are due to user behavior (such as bugs in code, wrong configuration, or misoperations). The job system, users who wish to run a high-performance computing failures are correlated with multiple metrics and attributes, such (HPC) application or simulation must submit a job to the as users/projects and job execution structure (number of tasks, Cobalt system [3] (a job-scheduling system similar to Torque scale, and core-hours). The best-fitting distributions of a failed [4]); the submitted job is then split into multiple tasks during job’s execution length (or interruption interval) include Weibull, the whole execution. The user-submitted jobs are called user Pareto, inverse Gaussian, and Erlang/exponential, depending on the types of errors (i.e., exit codes). The RAS events affecting job jobs or Cobalt jobs in the following text. The job-scheduling executions exhibit a high correlation with users and core-hours log records the status for each job, such as queuing status, and have a strong locality feature. In terms of the failed jobs, our running status, number of nodes or cores used, completion similarity-based event-filtering analysis indicates that the mean time, and exit status. In our study, the job-scheduling log time to interruption is about 3.5 days. involves up to 32.44 billion core-hours, and this is the largest compute resource usage in a resilience study up to date, to I.INTRODUCTION the best of our knowledge. The task execution log contains Since many of today’s science research problems are too detailed information such as what physical execution block complicated to resolve by theoretical analysis, scientists have was assigned to the job and which rank ran into errors if the to perform large-scale (or extreme-scale) simulations on su- job failed. To determine the jobs’ I/O behaviors, we analyze percomputers. Large-scale simulations, however, have a high the I/O characterization logs produced by Darshan [5], [6], likelihood of encountering failures during lengthy execution such as the number of bytes read/written by each job and the [1]. In order to improve the service quality of HPC systems, potential correlation with job failures. We combined all four it is critical to deeply understand the features and behaviors data sources to better understand the behavior of a failed job of failed jobs and their correlation with system reliability. and how the fatal system events affect the job execution. In this paper, we characterize job failures in one of the Our characterization/analysis results have been approved by most powerful supercomputers, the IBM Blue Gene/Q Mira, the Mira system administrator who is an expert in log analysis. which is deployed at Argonne National Laboratory. Our study Based on our in-depth study of HPC jobs running on the IBM is based on a set of system logs spanning 5.5 years (from Blue Gene/Q Mira system, we address the following questions. 04/09/2013 to 09/30/2018). The IBM Blue Gene/Q Mira was These questions are critical to large-scale system maintenance, ranked as the third fastest supercomputer in 2013, and it in-depth understanding of HPC job failures, and improvement is still ranked as the 21st in the world based on the latest of the resource provisioning quality. TOP500 report. Understanding the job failure features on • Analysis of generic job features: What are the statistical this supercomputer has a broad significance because multiple features of the HPC jobs from the perspective of a supercomputers, such as (USA), Vulcan (USA), and long-term period of observation of Mira? Specifically,

1 we characterize the distribution of execution time, the example, used a tool called CrashFinder to analyze the faults best-fit distribution type of execution time by maximum causing long-latency crashes in user programs; they conducted likelihood estimation (MLE), jobs’ I/O behaviors, and their experiments on an Intel Xeon E5 machine with simulated resource usage of jobs across users and projects. This faults injected by the open fault injector LLFI [13]. Siddiqua characterization indicates the job features in IBM Blue et al. [14], [15] characterized DRAM/SRAM faults collected Gene/Q systems, in comparison with other systems with over the lifetime of the LANL Cielo supercomputer [16]. Nie different architectures [7]–[9]. et al. [17], [18] characterized and quantified different kinds of • Analysis of failed jobs’ features: What are the statistical soft-errors on the Titan supercomputer’s GPU nodes and also features of the failed jobs on the petascale system from a developed machine learning methods to predict the occurrence long-term view? To address this question, we provide an of GPU errors. Sridharan et al. [19], [20] examined the impact in-depth, comprehensive study of the correlation of the of errors and aging on DRAM and identified a significant specific job exit statuses and other important attributes intervendor effect on DRAM fault rates based on the LANL using multiple logs from the IBM Blue Gene/Q Mira; Cielo system and the ORNL Jaguar system [21]. Martino et this approach is in contrast with general characterization al. [7] studied hardware/firmware errors on [10], work [2], [7] focusing mainly on the system level. Our showing that its processor and memory protection mechanisms work also differs from the existing application resilience (x8 and x4 Chipkill, ECC, and parity) are robust. study in [8], which was focused on a statistical analysis Arguably, some large-scale system failure studies [9], [22]– of job failures on Blue Waters [10]. Specifically, not [26] have been conducted on the IBM Blue Gene series of only do we characterize the distribution of failed jobs, supercomputers; however, their analyses generally focus on but we also explore the best-fit distribution type of the specific issues such as memory errors, temperature, power, execution lengths based on specific exit statuses. We and soft-error behaviors or on small or medium-sized super- also identify the relationship between job failure status computers. Hwang et al. [27], for example, characterized the and other critical attributes, such as the execution scale, DRAM errors and their implications for the system design, users/projects, job’s execution tasks, jobs’ I/O behaviors, based on four supercomputers: IBM Blue Gene/L, IBM Blue job execution time, and resource allocation locations. Gene/P, SciNet, and a Google data center. Di et al. [2] char- • Impacts of fatal system events to job executions: How acterized the resilience features of fatal system events for the do the fatal system events impact the job executions from IBM Blue Gene/Q Mira, but the study was based on a single the perspective of both job scheduling system and user data source (RAS event log). Zheng et al. [28] provided a job executions? In contrast to related work [2] focusing coanalysis of RAS logs and job logs on a Blue Gene/P system only on the correlation among system events (i.e., RAS [29]; their study, however, was based on an older, smaller events) [11], we investigate the correlation between the cluster (163k cores) with a short logging period (273 days). system’s RAS events and job executions in this paper. By comparison, we provide a much more comprehensive, fine- This new analysis is important to system administrators, grained analysis of the correlation between various job failure application users, and fault tolerance researchers because types and multiple attributes (such as users and projects, job it indicates the system’s reliability from the perspective of execution structure, locality, job’s I/O behaviors, and RAS users and jobs. On the one hand, system administrators events) as well as the best distribution fitting for job length can better understand fatal system events and diagnose and locality features. the issues more effectively by taking into account their Our analysis also differs from the characterization work in impact on the users. On the other hand, application users [8], which focused on applications running on Blue Waters and researchers can get a more accurate estimation of the [10]. That work studied application resilience (including both mean time to interruption (MTTI) such that more efficient CPU usage and GPU usage) and categorized failure reasons in fault tolerance strategies can be developed accordingly. terms of resource usage such as core-hours. However, it did The remainder of this paper is organized as follows. In not characterize the detailed correlation between exit codes Section II, we discuss related work. In Section III, we describe with specific key attributes (such as user names, number of the IBM Blue Gene/Q Mira system and the data sources tasks per job, and MLE-based best distribution fitting on job (logs). In Section IV, we describe our analysis methodology. length) using contingency tables. Moreover, we also charac- In Section V, we characterize the features of job executions terize failures vs. I/O behaviors and the detailed correlation in Mira and investigate the failure properties. In Section VI, between job failures and specific fatal system events as well we analyze the correlation between the system’s fatal events as the locality features. Their analysis was based on Cray- and job executions and their locality features. In Section VII, series systems whose architecture and users differ from those we conclude the paper with a brief discussion of future work. of Mira, such that their results cannot be applied to our study simply: for example, in their study on Blue Waters, only 14% II.RELATED WORK of failed applications are due to timeout, whereas 55.8% failed Although researchers have analyzed supercomputer’s relia- jobs are attributed to timeout on Mira. Our study also involves bility, their analysis results cannot be applied to our context far more core-hours than that work did: 32.44 billion core- because of different systems or architectures. Li et al. [12], for hours vs. 6.8 billion (=2.12E8 node-hours×32) core-hours.

2 III.BACKGROUND either are not needed (e.g., machine name is always valued as In this section, we describe the Blue Gene/Q supercomputer “mira”) or can be derived from the 15 key fields: for instance, Mira and its system logs used in our study. An organizational job name is always named as cobalt jobID.mira; start date id diagram of the Blue G/Q system can be found in IBM BG/Q (such as 20130401) can be derived from start timestamp (such administrator guide [11] (see Fig. 1-2 in that document). as 2013-04-01 00:01:11.000000); and the # cores used is equal to 16×nodes used because each node has 16 cores. A. Mira TABLE I Mira is a 10-petaflops IBM Blue Gene/Q system operated KEY FIELDSOFTHE COBALT JOB LOG by Argonne National Laboratory. Mira consists of 49,152 com- Field Description Examples cobalt jobID user’s job ID 67928,67931 pute nodes across 48 racks. Each rack contains two midplanes, queued timestamp job submission moment 2013-03-31 21:16:27.000000 each of which has 32 compute cards (or compute nodes). Every start timestamp start execution moment 2013-03-31 21:36:43.000000 end timestamp end execution moment 2013-04-01 00:07:38.000000 compute node has a PowerPC A2 1600 MHz processor with user ID hashed user name 50587932556210 16 active cores and 16G DDR3 memory, bringing the total to project ID hashed project name 3041172680929 786,432 cores for the entire machine. Each compute rack has queue name queue name prod-short,backfill,prod-long wall time requested wall time 4200,9000,21600 an I/O drawer, each coming with 8 I/O cards, 8 PCIe Gen2 runtime execution length 4200,9000,21600 x8 slots, optical modules, link module, and fan assembly. nodes used number of nodes used 1024,2048,32768 nodes requested number of nodes requested 1024,2048,32768 The Mira system uses IBM’s 5D torus network with 2 location execution block MIR-48400-7B771-1024 GB/s chip-to-chip links for connecting the nodes and uses exit code exist status 0,143,137,139 mode execution mode script,c1,c4 a single network for point-to-point, collective, and barrier num of tasks number of execution tasks 1,9,12,21 communication (in contrast to prior generations of Blue Gene systems). Each node has 10 links with 2 Gb/s bandwidth, with During the 5.5 years (2,001 days: from 04/09/2013 to an additional 11th link for communication with the I/O nodes. 09/30/2018), there are a total of 377,531 jobs, with a rather Links between the midplanes are optical and within the mid- nonuniform distribution on the number of jobs submitted each plane are electrical. The 48 compute racks are denoted by R00- day. Specifically, the minimum and maximum job-submission R0F, R10-R1F, and R20-R2F; and each rank is composed of counts within one day are 0 and 1,788, respectively. Note two midplanes (denoted M0 and M1). The compute resources that the minimum compute resource allocation per job (i.e., are allocated to jobs in the granularity of midplanes. Each the number of cores actually used per job) is required to be compute resource assignment exhibits an allocation block, no less than one midplane (i.e., 8,192 cores), leading to a which is represented as x x x x x –y y y y y , where x huge total number of core-hours consumed—specifically, up to 1 2 3 4 5 1 2 3 4 5 i 32.44 billion core-hours in total in the Mira system during the and yi denote the first and last node index in the ith dimension of the 5D torus network, respectively. 2,001 days—arguably the largest amount in any job resilience study to the best of our knowledge. B. Data Source Description: Job, Task, and I/O Behavior Logs 3) Task Execution Log: The task execution log involves In our study, we combine four system logs—RAS log, job physical execution information such as the allocation block, scheduling log, task execution log, and I/O log—to explore the detailed execution status, and the corresponding Cobalt the features of the user jobs based on their execution statuses. job ID. We can combine the RAS log and Cobalt job log to All four logs are available to download from the Argonne do a joint analysis. Leadership Computing Facility (ALCF) website [30]. Each job may go through two stages—queuing and 1) RAS Log: The RAS log is one of the most important execution—throughout its lifetime. The job execution stage is system logs because it indicates system reliability. Each item composed of one or multiple consecutive or parallel execution in the RAS log is represented as a specific event with one phases each handled by a particular task. A task is a finer of three severity levels (INFO, WARN, or FATAL). The fatal execution unit to complete the work for a job. Hence, by in- events are the most important category because they imply vestigating the task execution log, one can understand the job’s potential system errors [11]. Of the 14 fields, only a few (such detailed execution history. The task execution log comprises 21 as message ID, task ID, and timestamp) are critical to our fields. Our study involves mainly 7 of these fields (as listed in study; other fields (such as record ID) either are not needed Table II) because other fields are not relevant to our study. The in our analysis or can be derived from other fields already whole log includes over 2.6 million task execution records, included (e.g., category value is determined by message ID). which means that each job involves 7 tasks on average. 2) Cobalt Job Log: The job-scheduling log, or Cobalt 4) Darshan I/O Characterization Log: The I/O log used in log, is another critical log. It contains detailed information our analysis was generated by the lightweight I/O behavior- about the submitted jobs, including submission, scheduled, monitoring tool Darshan [5], [6] (which received an R&D100 and completion timestamps, the number of nodes or cores research award in 2018). The Darshan log records the I/O requested or used, physical execution tasks during execution, behavior of Cobalt jobs in the system (since 01/01/2014), in- applications and projects, and termination status. cluding properties such as patterns of access within files. This The Cobolt job log comprises 57 fields, of which 15 fields characterization can shed important light on the I/O behavior are selected in our study, as listed in Table I. Other fields of applications at extreme scale. Based on the Darshan log, we

3 TABLE II (calculated by the Cobalt log) with its wall time request. KEY FIELDSOFTHE TASK EXECUTION LOG The failed jobs with overlong execution times are cate- Field Description Examples gorized as timeout jobs. taskID ID of task 184483 userID user’s ID 82945435253412 • Bug: The jobs with serious termination signals, such as location allocated execution block MIR-48400-7B771-1024 SIGABRT (due to abort assertion or double-free of mem- start timestamp starting moment 2013-04-01 00:04:31.760919 end timestamp ending moment 2013-04-01 00:07:18.536998 ory) and SIGSEGV (segmentation fault), are grouped in cobalt jobID ID of user’s job 67928,67931 the bug category. exit signal exit status of task 9, 15 “abnormal termination by • Kill: Some jobs were killed in the middle of execution err text description of exit status signal 9 from rank 13461” (with signal 9), although they were not due to a code bug or execution time exceeding wall time. We group analyze the I/O behaviors of the user jobs and their potential such types of jobs in the kill category. correlations with the execution status. • IO: We observed a number of abnormal tasks are related The log comprises a total of 149 fields. The key fields to files stored on I/O nodes, so we group them in the include the total number of bytes read/written, highest offset IO category. Note that the corresponding failed jobs are in the file that was read/written, and number of POSIX/MPI not due to file system issues but to user mistakes in reads/writes. Other fields either are not needed for our study managing files or operations. The following are examples (e.g., the value of MACHINE NAME and CP DEVICE are of task failure messages: “Load failed on Q1G-I4-J00: always mira and 0, respectively) or can be derived from other Changing to working directory failed, errno 2 No such information (e.g., RUN DATE ID can be derived from the job file or directory”, “Load failed on Q0H-I2-J06: Reading execution time stamp). data from application executable failed, errno 21 Is a directory”, and “Load failed on Q0H-I4-J06: No authority IV. JOB FAILURE FEATURE ANALYSIS METHODOLOGY to application executable, errno 13 Permission denied”. In this section, we describe our analysis method, which • RAS: According to the task execution log, some tasks combines four data sources—RAS events, user jobs, execution were killed because of system reliability (i.e., RAS tasks, and I/O behaviors. Fig. 1 illustrates the process. events). We classify the corresponding jobs in the RAS Correlation between fatal events category. These jobs are critical to understanding the and failed jobs (Section VI) impact of the system events on the job executions. • Unknown: A few jobs terminate with unknown reason Map RAS Map I/O (e.g., missing messages). Task Cobalt Job RAS events to Map signals behaviors DARSHAN Execution Scheduling • SIGILL/SIGTRAP/SIGFPE: Three more types of ter- Log tasks to jobs to jobs I/O Log Log Log mination signals exist, which we put in the categories SIGILL, SIGTRAP, and SIGFPE, respectively. They cor- Analysis of Fatal Events respond to signals 4, 5, and 8, to be detailed later. (Section VI) Overall Job features & Job failure features (Section V) Fig. 1. Illustration of System Logs and Joint Analysis In order to reveal the mean time to interruption (MTTI) First, we need to identify the abnormal job termination as from the perspective of users, studying only RAS fatal events well as the different failure types, based on the Cobalt job- is not enough, because many fatal events may not actually scheduling log and the task execution log. Accurately extract- affect the jobs’ execution. As mentioned in Di et al.’s analysis ing the jobs’ exit statuses (normal or failure) is nontrivial work [2], the fatal events recorded in the RAS log refer to because the nonzero exit status values recorded in the Cobalt potential issues, which could be fixed by the system’s self- log refer to all the possible failed jobs from the perspective healing mechanism automatically in time or may not happen of the scheduling system. In fact, we observe that many jobs to any submitted jobs at runtime. Accordingly, to understand terminate with nonzero exit codes according to the Cobalt log the real impact of system’s fatal events on users’ jobs, we must yet the description field of the task execution log indicates that investigate the job-scheduling log and task execution log. In they terminate normally from the perspective of the users. The addition, to understand the I/O behaviors of the failed jobs, nonzero status values in the Cobalt log could be due to the we combine the Darshan I/O behavior log with the job log. users’ customized exit statuses or to missing exit codes. Based on the Cobalt job-scheduling log, we analyze sta- To ensure that our analysis is based on the correct termina- tistical features such as the distribution of jobs based on tion status values of the jobs, we map all task executions to exit statuses for both normal jobs and failed jobs. We also their corresponding jobs and determine each job’s exit status characterize the distributions for specific metrics such as job by the termination signals of the execution task(s) and corre- queuing time, execution time, and I/O behavior; and we ex- sponding descriptions in the task execution log. We categorize plore the best-fit distribution of a job’s execution time by using the failed jobs for several types of jobs as follows. the maximum likelihood estimation (MLE) method. These • Timeout: We first select the failed jobs with at least analyses disclose the job failure features from the perspective one abnormal execution task(s) (recorded in the task of users, leading to significant benefits: (1) helping system execution log) and then compare its real execution time administrators understand the behavior and root cause of the

4 failed jobs more deeply, as well as make a comparison with also improve the filtering ability by excluding fatal events the normal jobs; and (2) helping fault tolerance researchers or occurring during the system maintenance period and by taking users understand the system reliability to user jobs and MTTI into account the system reservation periods marked by the in order to improve fault tolerance for their applications. system administrator. The total number of fatal events can We also explore how the jobs’ normal termination and dif- be reduced to 1,299 compared with originally 2.6 million ferent types of abnormal terminations correlate with other sig- duplicated fatal messages in the 2001-day logging period. The nificant attributes and metrics, including wall time requested, mean time between fatal events (MTBFE) is about 1.54 days. real execution time, number of nodes used, total core-hours, Note that the fatal events here do not represent system failures user name, project name, queue name, allocated resource or interruptions from the perspective of users but represent locations, and machine partition. We construct the contingency all the “potential” severe issues of the system. In fact, some tables1 for the attributes and metrics versus a job’s exit status. fatal events may not affect user jobs at all, although they A contingency table contains rich information related to the really caused malfunctions to some parts of the system. Using mutual correlation, based on which one can understand the our elaborate mapping from RAS events to job failures, we detailed frequency for each value-combination, so we mainly calculate the MTTI to be 3.5 days for the whole Mira system demonstrate the correlations using the contingency tables in in terms of user jobs. We also identify the locality features of our analysis. In addition, we adopt a χ2 statistic analysis to system’s fatal events that affect the job executions. assess whether some attribute or field is likely correlated to 2 the exit status. Specifically, if the χ statistic calculated is V. EXPLORATION OF JOB FAILURE PROPERTIES greater than the critical value (with confidence level of 99.9%) from the Chi-Square distribution, we can claim that the two In this section, we explore the job failure properties by categories are non independent. investigating the Cobalt job log, task execution log, and I/O We split the attributes into two categories—identity type and behavior log. In addition, we compare the overall job features number type—that are coped with separately in our analysis. and normal job features, in order to identify the specific The identity-type attributes each have relatively low numbers features of the failed jobs and their correlations with other of values, and each value can be represented in the form metrics. We highlight takeaways/lessons in the following text. of a string text. For instance, user name, project name, and A total of 377,144 jobs were submitted or scheduled during machine partition all belong to the identity-type category. In the 5.5 years considered for this study. Based on our analysis our analysis, we build a hash table for each identity-type using the task execution log, we classified 99,245 of these attribute and calculate the corresponding probability regarding jobs as abnormal, although 116,787 jobs terminated with different execution exit statuses. By comparison, the number- nonzero exit statuses according to the Cobalt job-scheduling type attributes (such as execution time and core-hours) are log. We summarize the 10 most frequent types of terminations the fields whose values are recorded in the form of numbers (including normal jobs) in Table IV. We observe that about (e.g., integer values or floating-point values). Because of three-fourths of the jobs exited normally in the end. Of the the unlimited or numerous values for each of the number- one-fourth of the jobs that terminated abnormally, a large type attributes, we need to split their values into consecutive majority, 55.8%, were killed by the system because of timeout intervals in log scale, which can significantly improve the (i.e., execution time exceeding the wall time requested). This analysis efficiency because of the considerably reduced time cause differs significantly from the characterization of Blue complexity and memory overhead. The execution times, for Waters jobs [8], of which 14% of the failed jobs were due to instance, would be split into 10 intervals: [0, 10 minutes), [10 timeout. The jobs that exited because of code bugs constituted minutes, 20 minutes), [20 minutes, 40 minutes), ··· for our 7.25%, and the jobs terminating because of problematic script probability analysis. Table III lists all the four number-type or operation (e.g., no such file/directory) constituted about attributes and their consecutive intervals in our study. 0.94%. The task execution log records the related RAS event ID for any job that is failed because of system reliability and TABLE III CLASSIFICATION POLICIESOF NUMBER-TYPE ATTRIBUTES also the corresponding cobalt job ID, based on which we Attribute Policy # Intervals can calculate the system-reliability related job failures. Our real execution time [0,10m),[10m,20m),[20m,40m], ··· , [85.3h,∞] 10 core-hours [0,1000), [1000,2000),[2000,4000), ··· , [512,000, ∞) 10 characterization shows only 0.17% failed jobs were due to # consecutive tasks [0], [1,10], [11,20], [21,30],··· , [641,1320], [1320,∞) 10 system reliability, representing about 0.6% of all the failed # multilocation tasks [0], [1,10], [11,20], [21,30],··· , [641,1320], [1320,∞) 10 jobs. (Takeaway 1): A large majority (99.4%) of job We also explore the correlation between RAS fatal events failures were attributed to user behaviors instead of system and job exit status. The RAS log has many duplicated mes- reliability. This is compared to other reports (98.5% in Blue sages, so we have to perform duplication filtering before Waters [8] and 97% in Franklin (at NERSC) [32]). analyzing the system fatal events. To this end, we adopt the In the following text, we divide job termination types into a weighted-similarity-based spatiotemporal message filter that fine granularity (9 categories) and determine their correlations was developed in the open-source LogAider tool [31]. We with many important attributes, including user name, project 1A contingency table is in the form of matrix that displays the frequency name, job’s execution structure (core-hours, execution scale, distribution (i.e., the number of the value combinations) across two fields. job length), resource allocation, I/O behavior, and RAS event.

5 TABLE IV project. The χ2 significance value based on Table V, for TASKS’EXIT STATUS AND DESCRIPTION example, is calculated as 50,026. Hence, we have the following Root Cause Percentage Description Normal 73.79% normal job termination (e.g., with exit code 0). takeaway. (Takeaway 3): Users and projects exhibit a strong Timeout 14.62% Timeout, i.e., killed due to exceeding the wall ime. Bug 7.35% termination because of serious bugs (such as SIGABRT and SIGSEGV). correlation with the exit statuses with a confidence level Kill 2.84% SIGKILL signal: killed in the mid of execution. IO 0.94% I/O-related issue, such as ’no` such file/directory’ and ‘permission issue’ of 99.9% (corresponding to the critical value of 126.1), RAS 0.17% jobs killed by system’s RAS fatal event. Unknown 0.13% abnormal termination with the exit signal 36. indicating that users and projects often fail with specific SIGILL 0.065% signal 4: illegal instruction (e.g., mismatched CPU architecture or permission issue) SIGTRAP 0.062% signal 5: caught exceptions due to possible bugs during debugging reasons (or exit codes) particularly. SIGFPE 0.014% signal 8: erroneous arithmetic operation, such as division by zero TABLE V CONTINGENCY TABLEWITH USER NAMEAND EXIT CODE This analysis differs from other work [8] focusing mainly on PP exit P NM TO BG KL IO RS UK SI ST SF overall statistics regarding resource usage (such as core-hours). user PPP u1 26689 348 17 31 59 1 1 0 0 0 u2 26668 68 7 62 13 7 5 4 0 0 u3 13062 519 304 716 81 50 0 0 0 0 A. Features based on Users and Projects u4 1963 3436 1647 73 139 23 0 0 0 0 u5 6555 228 43 99 31 25 0 0 0 0 Mira had a total of 1,295 users and 627 projects throughout u6 5139 64 1310 41 0 5 20 0 0 0 u7 4719 1173 27 83 150 3 0 0 0 0 the 2,001 days of usage. To understand their distinct features, u8 5536 167 91 79 4 0 1 3 0 0 u9 5452 103 12 14 49 1 0 0 0 0 we characterize the distribution of the number of jobs and u10 4132 484 775 143 20 0 0 0 0 0 core-hours across users and projects, as shown in Fig. 2. In the figure, we observe that the job counts and core-hours TABLE VI CONTINGENCY TABLEWITH PROJECT NAMEAND EXIT CODE per user and project are largely different, following a typical PP exit P NM TO BG KL IO RS UK SI ST SF Pareto principle, or 80/20 rule (i.e., a large majority of the proj PPP p1 26951 679 74 52 75 2 0 0 3 0 jobs or core-hours are actually attributed to only a very p2 19765 431 50 554 48 40 4 0 0 0 p3 16168 194 54 173 43 25 2 4 0 0 small population). In absolute terms, we have the following p4 15525 428 117 134 52 40 0 0 0 0 p5 8407 99 88 12 66 1 0 0 0 0 takeaway. (Takeaway 2): Only 15% of users contribute p6 6764 501 706 151 87 1 3 15 9 4 80.4% jobs and 88% core-hours, which means that the p7 6028 274 389 110 8 3 22 5 0 0 p8 4774 831 907 144 20 2 0 0 0 0 job count features actually roughly follow an 85/15 rule. p9 5755 203 306 55 21 4 11 3 0 0 p10 4931 847 81 120 22 6 0 4 1 0 1 1 1 0.9 0.9 0.9 We also characterize the exit codes based on users and 0.8 0.8 0.8 0.8 0.7 0.6 0.7 0.7 projects, as presented in Fig. 3. From the figure, we derive 0.6 0.4 0.6 0.6 0.5 0.5

CDF 0.5 the following takeaway. (Takeaway 4): The majority of job CDF 0.2 0.4 0.4 0.4 00 20 40 60 80 10 0.3 0 2e 4e 6e+07 8e 1e 0.3 0 0 0 0 00 +07 +07 +07 +08 failures are attributed to four categories regarding user 0.2 0.2 Core-hours based on Projects Job Distr. based on Projects 0.1 0.1 Core-hours based on Users behaviors: ‘timeout,’ ‘bug,’ ‘kill,’ and ‘IO’ (or misop- Job Distr. based on Users 0 0 0 2e+08 4e+08 6e+08 8e+08 1e+09 1.2e+09 1.4e+09 1.6e+09 0 5000 10000 15000 20000 25000 30000 erations). Various users (e.g., u1, u3, u6, and u10) have Number of Jobs Corehours different most-frequent exit codes. Such a feature can help (a) Distribution of Job Count (b) Distribution of Core-Hours system administrators identify job failures quickly based on Fig. 2. Job Count/Core-Hours Based on Users/Projects the exit code category and users or projects, thereby improv- ing the daily diagnosis efficiency. In addition, our characterization shows that for most users and projects, the number of normal jobs is about 1.5∼2X as 1 1 high as the number of failed jobs, and job failure ratios (i.e., 0.8 0.8 the ratio of failed job count to total job count per user/project) 0.6 0.6 can differ significantly with user/projects. Specifically, about 0.4 0.4 12% users have 0 failures and about 35% users suffer from Job Distribution 0.2 Job Distribution 0.2 0 0 a failure ratio of 50+%. Hence, it is worth investigating the u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 failed job features based on users and projects. We will discuss USERS PROJECTS Timeout IO SIGILL Timeout IO SIGILL Bug RAS SIGTRAP Bug RAS SIGTRAP this issue with specific exit codes in the following text. Kill Unknown SIGFPE Kill Unknown SIGFPE In what follows, we divide the job failure counts based (a) Exit Code Based on Users (b) Exit Code based on Projects on different exit statuses in terms of the task execution log. Fig. 3. Exit Code Distribution Based on Users/Projects Because of space limits, we use abbreviations (or exit codes) to represent the exit statuses listed in Table IV in the same orders. B. Features Based on Job Execution Structure For instance, NM, TO, and BG refer to “Normal,” “Timeout,” Based on our characterization of all jobs, we formulated and “Bug,” respectively. We present the contingency table the following takeaway. (Takeaway 5): Different jobs may with user/project and exit codes in Table V and Table VI, have largely different job execution structures, such as respectively. Since there are many users and exit codes, we number of nodes and number of tasks, as shown in Fig. select the 10 most frequent exit codes and top 10 users 4. The majority of the jobs have relatively small or medium with the highest number of jobs for our analysis, without numbers of nodes or tasks, while a few jobs each may have loss of generality. From these two tables, we observe that numerous nodes or tasks. This situation motivates us to explore the majority of jobs terminate normally for each user or the correlations between job structure and failure types.

6 1 1 0.9 0.6 0.9 1 job-failures. (Takeaway 7): Normal jobs contributed major 0.8 0.5 0.8 0.9 0.4 0.8 core-hours (58% of the total), while the timeout job- 0.7 0.7 0.7 0.3 0.6 0.6 0.6 failures also take a considerable portion of core-hours 0.5 0.2 0.5 0.5 CDF CDF 0.4 0.1 0.4 0.4 (32.86% of the total), implying the high significance of 0 0.3 0.3 0 100 200 300 400 500 600 0.3 0 50 10 15 20 0.2 0.2 0 0 0 checkpointing/restart mechanism across job boundaries # Nodes Used Consecutive Tasks 0.1 # Nodes Requested 0.1 Multilocation Tasks for the execution continuation of HPC applications. In 0 0 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 0 2000 4000 6000 8000 10000 fact, many of existing applications (such as HACC [33]) on Number of Nodes # Tasks per Job MIRA have their own checkpointing mechanism to avoid the (a) Number of Nodes (b) Number of Tasks expensive re-computation upon failures or timeout issue. Fig. 4. Execution Scale of All Jobs TABLE VIII In this study, we focused on four important job structures: SUMOF CORE-HOURSBASEDON EXIT CODE NM TO BG KL IO RS UK SI ST SF job’s core-hours, job’s tasks, number of nodes, and execution 1.87E10 1.07E10 1.4E9 9.65E8 1.08E8 4.27E8 2.85E7 4.75E5 2.76E6 7.4E6 length. Specifically, we explored the correlations between each We further characterize the breakdown of core-hours and of these structures and different exit statuses. other statistics based on log-scale execution time intervals in 1) Features based on Job’s Core-hours: Table VII shows Table IX. We have the following takeaway. (Takeaway 8) the contingency table between job’s core-hours and exit codes Majority of the core-hours consumed by timeout jobs are with log-scaling classification. The table presents the top-10 contributed by the jobs with relatively long execution times intervals of the core-hours with the highest numbers of jobs. in the range of [5h20m, 42h40m]. In order to mitigate the (Takeaway 6): We can observe a very strong correlation possibly lost core-hours, the long-execution jobs are highly between the core-hours and exit codes. Most of the normal recommended to output the simulation data during the execu- jobs consumed either a small or medium amount of core- tion or to be protected by fault-tolerance techniques such as hours (in the range of [0,1k) and [8k,16k), respectively), checkpointing/restart mechanism. while the job-failure exit codes exhibit diverse job distribu- TABLE IX tions. For instance, the “Bug” jobs are mainly of small core- STATISTICS OF TIMEOUT JOBSIN LOG-SCALE EXECUTION INTERVALS Exe. Time Intervals Job Counts Sum of Core-hours Mean # cores Mean Exe. Time hours, most of “timeout” jobs have moderate-sized core-hours [0,10m] 1104 3824449 32278.3 0.1 h (e.g., [32k,64k)), and the majority of “RS” jobs have fairly [10m,20m) 2,664 2.11632E7 36969 0.21 h [20m,40m) 4,691 1.09938E8 49063 0.47 h large core-hours (such as [256k,512k)]). This diversity is due [40m,1h20m) 10,992 2.997808E8 28162 0.99 h [1h20m,2h40m) 6,762 4.658492E8 36272 1.9 h to the diverse purposes or features of various job executions: [2h40m,5h20m) 10,302 9.362381E8 24073.4 3.86 h for instance, the jobs with small core-hours are likely used to [5h20m,10h40m) 12,391 2.4776273E9 31256.7 6.25 h [10h40m,21h20m) 5,290 2.570157E9 39687.1 12.1 h debug codes, and those jobs with large core-hours may have a [21h20m,42h40m) 922 3.7690924E9 170326.1 24.1 h higher chance of being affected by system’s fatal RAS events. [42h40m,...] 2 8304012 98304.0 71.25 h Based on Fig. 5, we conclude that jobs with small-core-hours 2) Features Based on Job Tasks: We identify three types tend to have bugs or misoperations, while jobs with large core- of jobs in terms of their tasks: single-task jobs, consecutive- hours generally have bug-free implementation but are likely to task jobs, and multilocation-task jobs, which are completely be affected by timeout or system reliability issues. controlled by users (specifically, by tuning the submission mode such as “script” or “cn”). Exploring the features based TABLE VII CONTINGENCY TABLEWITH CORE-HOURSAND EXIT CODE on job tasks may shed light on how to mitigate the job failures X XX exit by tuning the job execution types. (Takeaway 9): The system- XX NM TO BG KL IO RS UK SI ST SF corehours XXX [0,1k) 67580 652 10466 2420 2415 20 152 213 164 22 reliability-related job failures (category “RAS”) happen [1k,2k) 17631 1356 2900 1688 283 46 48 15 13 1 mainly to jobs with small task count (either consecutive or [2k,4k) 18782 1511 2078 1568 232 21 47 4 19 1 [4k,8k) 34040 2863 3412 1265 168 48 89 1 8 1 multilocation tasks). In fact, the jobs with few tasks may also [8k,16k) 52228 7681 2383 935 118 29 32 2 6 10 [16k,32k) 26498 8859 1608 765 97 38 44 6 18 3 have many node hours, explaining their more frequent failures. [32k,64k) 23087 12853 1586 761 62 46 39 6 3 7 [64k,128k) 15732 7508 1135 469 60 68 17 0 3 0 This takeaway suggests users not to submit jobs each with very [128k,256k) 9493 4911 1328 281 27 74 6 0 0 3 few tasks, in order to mitigate unexpected job failures. [256k,512k) 13209 6926 833 556 72 251 15 0 1 6

1 3) Features Based on Job Execution Scale: We characterize Timeout Bug 0.8 Kill the contingency table about resource location and exit code IO 0.6 RAS Unknown in Table XI. The resource location is represented as xxxxx- 0.4 SIGILL SIGTRAP yyyyy-NNNN, where xxxxx-yyyyy refers to the resource SIGFPE 0.2

Exit Code Distribution blocks allocated to submitted jobs and NNNN indicates the 0 [0,1k) [1k,2k) [2k,4k) [4k,8k) [8k,16k)[16k,32k)[32k,64k)[64k,128k)[128k,256k)[256k,512k) number of nodes. For instance, 44000-77FF1-8192 corre-

Core-hours sponds to the block with 16 midplanes (R18-R1F). (Takeaway Fig. 5. Exit Code Distribution Based on Core-Hours 10): The table shows that the exit codes have a relatively In addition, we compute the sum of core-hours based on strong correlation with the number of nodes. For instance, different exit codes, as shown in Table VIII, in order to most of the exit codes, such as “TO,” “BG,” and “RS,” occur characterize the possible core-hour wastes in terms of various more frequently on the 8192-node blocks than on the 512-

7 TABLE X TABLE XII CONTINGENCY TABLEWITH CONSECT./MULTIL.TASKSAND EXIT CODE CONTINGENCY TABLEWITH #NODES USEDAND EXIT CODE X X XX exit XX exit XX NM TO BG IO KL UK RS ST SI XX NM TO BG KL IO RS UK SI ST SF # tasks XXX # nodes XXX Consecutive Tasks 512 144641 29145 12771 5918 2379 99 229 222 166 25 [1,10] 19763 1555 1114 284 365 51 31 25 7 1024 66732 9074 6265 1782 299 54 76 5 8 0 [11,20] 7004 769 184 112 62 2 1 0 5 2048 29839 8047 3611 1307 318 62 59 7 18 7 [21,40] 1947 904 136 46 61 2 1 0 0 4096 16289 3676 2024 723 218 82 32 5 30 8 [41,80] 1155 620 64 90 26 1 1 0 0 8192 12548 3499 1639 614 161 186 23 4 4 14 [81,160] 307 318 60 17 15 0 0 0 0 12288 1178 144 89 39 11 23 0 0 1 0 [161,320] 179 181 27 21 10 0 2 0 2 16384 4574 1070 885 237 82 65 59 4 6 0 [321,640] 49 78 12 2 2 0 0 0 0 24576 263 40 29 8 3 3 0 0 0 0 [641,1280] 21 58 9 3 2 0 0 0 0 32768 1485 335 323 63 35 53 9 0 2 0 [1280,2560] 81 29 1 2 3 0 0 0 0 49152 711 90 93 17 28 14 2 0 0 0 [2561,...] 14873 78 0 2 1 0 0 0 0 Multilocation Tasks TABLE XIII [1,10] 4905 871 296 55 156 0 28 1 0 [11,20] 705 345 274 11 56 3 4 0 0 CONTINGENCY TABLEWITH QUEUE NAMEAND EXIT CODE [21,40] 715 270 123 6 90 0 1 0 0 PP exit P NM TO BG KL IO RS UK SI ST SF [41,80] 375 272 125 5 37 0 0 0 0 Q PPP [81,160] 330 189 30 4 34 1 1 0 0 prod-short 175506 38154 20659 7472 2565 198 331 200 195 33 [161,320] 97 117 53 5 34 0 2 0 2 backfill 61807 5637 1379 1132 354 30 33 6 10 0 [321,640] 39 86 10 2 4 0 0 0 0 prod-capability 18872 4881 2901 860 269 318 84 7 13 14 [641,1280] 15 66 9 5 3 0 0 0 0 prod-long 13149 5691 1368 708 193 63 17 1 9 7 [1281,2560] 67 27 0 0 3 0 0 0 0 prod-1024-torus 1854 92 863 39 5 8 0 0 0 0 [2561,...] 14873 41 1 2 1 0 0 0 0 R.bc 712 101 140 76 22 1 16 32 0 0 R.pm 814 9 6 37 34 2 1 0 0 0 1 1 SC13 prep 579 63 111 70 11 1 2 0 0 0 0.8 0.8 backfill-1024-torus 739 11 1 22 2 0 0 0 0 0 0.6 0.6 training 477 119 43 62 23 0 0 0 7 0 0.4 0.4 0.2 0.2 0 0 [1,10][11,20][21,40][41,80][81,160][161,320][321,640][641,1280][1281,2560][2561,...] [1,10][11,20][21,40][41,80][81,160][161,320][321,640][641,1280][1281,2560][2561,...] (Takeaway 11): The fraction of the three most frequent exit Exit Code Distribution Exit Code Distribution codes (“TO,” ‘Bug,” and “Kill”) are relatively consistent across different numbers of nodes; in contrast, the “RS” # Tasks # Tasks Timeout IO SIGILL Timeout IO SIGILL exit code exhibits a higher chance on relatively large Bug RAS SIGTRAP Bug RAS SIGTRAP Kill Unknown SIGFPE Kill Unknown SIGFPE numbers of nodes used per job (the most error-prone (a) Consecutive Tasks (b) Multilocation Tasks execution scales are 12,288 nodes and 32,768 nodes), in Fig. 6. Exit Code Distribution Based on Consecutive/Multilocation Tasks that larger execution scale involves more resources.

1 1 node blocks, while exit codes “SI” and “ST” exhibit higher 0.8 0.8 frequency on the 512-node blocks. This feature can also be 0.6 0.6 0.4 0.4 verified by the contingency table based on the number of nodes 0.2 0.2 Exit Code Distribution Exit Code Distribution 0 (as shown in Table XII). However, the correlation is hardly 0 prods backf prodc prodl prodt R.bc R.pm SC13pbackf2train 512 1024 2048 4096 8192 1228816384245763276849152 observed on the same-size blocks (such as xxxxx-yyyyy-8192) especially for the high-frequency exit codes (such as “NM” # Nodes QUEUE NAME Timeout IO SIGILL Timeout IO SIGILL and “TO”), indicating that the user-behavior based job failures Bug RAS SIGTRAP Bug RAS SIGTRAP Kill Unknown SIGFPE Kill Unknown SIGFPE are not correlated with locality. This observation is consistent (a) Exit Code Based on # of Nodes (b) Exit Code Based on Queue Name with our χ2 significance test, where the χ2 statistic of the Fig. 7. Exit Code Distribution Based on # of Nodes and Queue Name (We locations vs. exit code in the table is far greater than the use abbreviations to denote queue names; full names are in Table XIII.) 2 99.9%-confidence-level threshold (437 vs. 126), but the χ By contrast, as shown in Fig. 7(b), we have the following values of the same-size locations with exit codes are lower 2 takeaway. (Takeaway 12): The exit code distribution ex- than the χ thresholds, meaning that exit codes cannot be hibits a very high diversity across different queues, because thought of as correlated with the same-block-size locations. different queues are created to hold specific groups of jobs In Table XI, for example, the χ2 value of all 512-node blocks 2 with similar features. The jobs in the prod-1024-torus (prodt) is 64.6, which is lower than the χ threshold (80.1). queue, for example, likely have bugs, according to the figure. TABLE XI The failed jobs belonging to prod-long are likely because of CONTINGENCY TABLEWITH RESOURCE LOCATION AND EXIT CODE “timeout” issues. More detailed correlations can be observed in PP exit P NM TO BG KL IO RS UK SI ST SF loc. PPP the contingency table in terms of the queue name vs. exit code MIR-48000-7BFF1-8192 2739 693 351 118 40 36 3 1 2 4 MIR-04000-37FF1-8192 2583 666 353 118 29 21 4 1 0 4 (Table XIII). For instance, the prod-short queue deals mainly MIR-08000-3BFF1-8192 2463 739 299 116 31 42 6 0 0 3 with the relatively short jobs, so it involves the most jobs on MIR-44000-77FF1-8192 2174 694 305 128 23 32 5 0 1 2 MIR-40C40-73F71-512 2621 588 195 96 42 1 2 6 2 0 each exit code; SC13 prep is a particular queue reserved for MIR-40C00-73F31-512 2423 533 217 93 37 0 2 2 3 2 the research prepared for the SC13 conference, so it has a MIR-40800-73B31-512 2384 493 217 94 36 1 5 5 5 0 MIR-00C80-33FB1-512 2339 461 253 74 22 3 6 1 6 0 small number of jobs (Table XIII). MIR-40CC0-73FF1-512 2308 498 206 90 23 2 1 1 3 1 MIR-00800-33B31-512 2282 497 218 76 31 4 2 3 1 0 4) Features Based on Job Queuing/Execution Length: In Fig. 8 we characterize the distribution (CDF) of all jobs’ In Fig. 7(a) we characterize the exit code distribution based queuing/execution length. Based on this, we formulate the on the number of nodes. We can clearly observe the following. following takeaway. (Takeaway 13): Most jobs request a

8 relatively short wall time (50% of the jobs request wall TABLE XIV CONTINGENCY TABLEWITH EXE.TIMEAND EXIT CODE times of less than 1 h) while 5% of the jobs have relatively X XX exit XX NM TO BG KL IO RS UK SI ST SF large requests for wall time (≥6 h). The real execution exetime XXX [0,10m) 91050 1104 16133 4550 3088 71 302 235 208 23 lengths are shorter than the job queuing lengths, since the [10m,20m) 22965 2664 1933 1719 105 101 28 5 9 12 execution lengths must be bounded in the requested wall time; [20m,40m) 44432 4691 2763 1382 95 66 63 2 15 10 [40m,1h20m) 44753 10992 2993 959 73 69 57 2 0 0 in contrast, the queuing time can be fairly long (see Fig. 9(a)) [1h20m,2h40m) 28009 6762 1253 695 69 89 8 0 1 1 [2h40m,5h20m) 24861 10302 994 731 49 93 13 3 2 3 because it has no upper bounds. [5h20m,10h40m) 16646 12391 1441 491 41 80 12 0 0 3 [10h40m,21h20m) 4727 5290 204 141 9 57 6 0 0 2 1 1 [21h20m,42h40m) 821 922 14 40 5 15 0 0 0 0 0.9 0.9 [42h40m,...) 16 2 1 0 0 0 0 0 0 0 0.8 0.8 1 0.7 0.7 Timeout 0.6 0.6 0.8 Bug 0.5 0.5 Kill

CDF CDF 0.6 IO 0.4 0.4 RAS 0.3 0.3 0.4 Unknown 0.2 0.2 SIGILL 0.2 SIGTRAP 0.1 Requested Wall Time 0.1 Requested Wall Time Queuing Time Execution Time SIGFPE 0 0 Exit Code Distribution 0 0 2 4 6 8 10 0 2 4 6 8 10 [0,10m)[10m,20k)[20m,40m)[40m,1h20m)[1h20m,2h40m)[2h40m,5h20m)[5h20m,10h40m)[10h40m,21h20m)[21h20m,42h40m)[42h40m,...) Time (in hours) Time (in hours)

(a) Queuing Time (b) Execution Time Execution Time Fig. 8. General Job Queuing/Execution Time Fig. 10. Exit Code Distribution Based on Execution Times We divide the normal jobs vs. failed jobs in terms of queuing 16): The best-fitting distributions of failed job’s execution length and execution length, respectively, in Figs. 9(a) and (b). lengths are Weibull, Pareto, inverse Gaussian, LogNormal, Based on these figures, we formulate the following takeaway. and Erlang/exponential, for RAS-based job failures (i.e., (Takeaway 14): Long jobs (with either long queuing time related to system reliability), bug-based job failures (due or long execution time) tend to have more failure events to code’s bugs), I/O-based job failures (due to user’s during their executions. In absolute terms, with the same misoperations), and timeout job failures (due to wrong percentage of jobs from among normal jobs and failed jobs, the submission setting), respectively. Our distribution-fitting queuing time and execution time of the former are generally analysis provides the specific best-fit distributions for specific 2 1 only 3 and 3 as long as those of the latter, respectively. job failures. Our approach differs from traditional best-fit 1 1 distribution analysis focusing mainly on either potential fatal 0.9 0.7 0.9 0.9 0.8 events [2] or overall failure rates based on different types 0.8 0.8 0.7 0.6 0.7 of applications [8]. Our analysis can help fault tolerance 0.6 0.6 0.7 0.5 0.5 0.5 CDF researchers emulate job failure events or intervals. CDF 0.6 0.4 0.4 0.4 0.3 0 2 6 8 10 12 14 1 1 4 0.3 0.5 0.99 0 1 2 3 4 5 6 7 0.94 0.2 0.9 0.98 0.9 Normal Jobs 0.2 0.92 0.1 Normal Jobs 0.97 Failed Jobs 0.1 0.8 0.8 0.9 0 Failed Jobs 0.96 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 0 0.7 0.95 0.7 0.88 0 10 20 30 40 50 60 70 80 90 0.6 0.94 0.6 0.86 0.93 0.84 Queued Time (in hours) Execution Time (in hours) 0.5 0.92 0.5

CDF 0.82 0.4 0.91 CDF 0.4 0.9 0.8 (a) Queuing Time (b) Execution Time 0.3 3 4 5 6 7 8 9 10 0.3 5 6 7 8 9 10 11 12 13 14 15 Real Normal Job Length Real Failed Job Length Fig. 9. Queuing/Execution Times of Normal Jobs vs. Failed Jobs 0.2 1.Weibull 0.2 1.Gamma 0.1 2.GammaDist 0.1 2.Pearson6 3.ChiSquare 3.Weibull We present in Table XIV a detailed breakdown of the 0 0 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 90 exit codes in consecutive log-scale execution time intervals. Job Length (in hours) Job Length (in hours) (Takeaway 15): Unlike the job failure features observed (a) Normal Jobs (b) Abnormal/Failed Jobs based on execution scales, a long job tends to fail and the Fig. 11. MLE Fitting of Job Runtime Distribution root cause is attributed to user behaviors such as ‘timeout,” C. Features Based on Job I/O Behavior “bugs,” and “misoperations” instead of system reliability. By combining the Darshan I/O behavior log and Cobalt job- For instance, if a job with an execution time in the range scheduling log, we explored the potential correlations between [21h20m, 42h40m) fails, it is likely because of “timeout,” as job failures and I/O behaviors, by generating a contingency shown in Fig. 10. The intuitive explanation is that the relatively table (Table XV) and the fractions among exit codes (Fig. long jobs’ execution time is harder to estimate accurately. In 13). We observe that (Takeaway 17): the total number of this case, the users are recommended to reserve more wall jobs based on the read/written bytes exhibits a bimodal or time for their jobs, such that the time-out can be mitigated. trimodal distribution. Specifically, the majority of jobs each In addition to characterizing the exit code distributions read data in the range of [4GB,8GB) and [128GB,256GB), for job execution length, we explore the best-fit distribution respectively; and the majority of jobs write data in the range type for job length (or time to interruption) based on both [2GB,4GB) and [32GB,128GB), respectively. Such a finding overall job failures and different exit codes, by leveraging the may help system administrators more deeply understand fail- MLE method. The distribution fitting results (top 3 from 20 ure issues related to I/O behavior or identify the root causes distributions) are presented in Fig. 11 and Fig. 12. (Takeaway based on the I/O behaviors in their daily diagnosis.

9 1 1 1 1 0.9 0.9 0.94 0.7 0.92 0.8 0.8 0.8 0.68 0.8 0.9 0.66 0.88 0.6 0.6 0.7 0.64 0.7 0.86 0.62 0.4 0.4 0.6 0.6 0.6 0.84 0.58 0.82 0.2 0.2 0.5 0.56 0.5 0.8 0 0 0.54 CDF 0.78 CDF [0,512MB)[512MB,1GB)[1GB,2GB)[2GB,4GB)[4GB,8GB)[8GB,16GB)[16GB,32GB)[32GB,64GB)[128GB,256GB)[256GB,512GB) [0,512MB)[512MB,1GB)[1GB,2GB)[2GB,4GB)[4GB,8GB)[8GB,16GB)[16GB,32GB)[32GB,64GB)[64GB,128GB)[128GB,256GB) 0.4 0.52 0.4 0.76 Exit Code Distribution Exit Code Distribution 0.5 0.3 1 1.5 2 2.5 3 3.5 4 4.5 0.3 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Real Failed Job Length Real Failed Job Length 0.2 1.Weibull 0.2 1.InverseGaussian 0.1 2.Pearson6 0.1 2.Weibull 3.Lognormal 3.Lognormal 0 0 # Bytes Read # Bytes Written 0 5 10 15 20 25 0 5 10 15 20 25 30 35 40 45 50 Job Length (in hours) Job Length (in hours) Timeout Kill Unknown Timeout Kill Unknown Bug IO RS Bug IO RS (a) RAS-based job failure (b) Bug-based job failure (a) Distr. with # Bytes Read (b) Distr. with # Bytes Written 1 1 0.9 0.9 0.98 0.9 Fig. 13. Exit Code Distribution Based on # Bytes Read/Written 0.96 0.85 0.8 0.94 0.8 0.8 0.92 0.7 0.7 0.75 TABLE XVII 0.9 0.7 0.6 0.88 0.6 BREAKDOWNOF RASEVENTS BASEDON CATEGORY/COMPONENT 0.86 0.65 0.5 0.84 0.5 0.6 Category percent Component percent CDF 0.4 0.82 CDF 0.4 0.55 0.8 0.5 Blue Gene/Q compute card 78.16% FIRMWARE 62.1% 0.3 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.3 1 2 3 4 5 6 7 8 9 10 Real Failed Job Length Real Failed Job Length Software Error 16.22% Machine Controller on Service Node 21.53% 0.2 1.Pareto 0.2 1.Erlang Message Unit (MU) 2.81% A Kernel Panic 11.39% 0.1 2.InverseGaussian 0.1 2.Exponential Generic Card/Board 2.65% Compute Node Kernel 4.52% 3.Weibull 3.Gamma 0 0 Bulk Power Supply 0.16% Memory Unit 0.32% 0 5 10 15 20 25 0 10 20 30 40 50 60 70 80 90 Control System on Service Node 0.16% Job Length (in hours) Job Length (in hours) (c) I/O-based job failure (d) Timeout job failure terminated involving 25% message IDs (i.e., the first two Fig. 12. MLE Fitting of Runtime Distribution Based on Job Failure Types rows in Table XVI). (2) The most frequent RAS events TABLE XV CONTINGENCY TABLEWITH #READ/WRITTEN BYTESAND EXIT CODE are node errors (such as 00080014 and 0008000B), which X XX exit involve over 78.16% system-reliability-based job failures XX NM TO BG KL IO UK RS # bytes XXX (as presented in Table XVII). In addition, message unit (MU) Total # Bytes Read vs. Exit Code [0,512MB) 258588 80842 13164 4494 1788 125 144 is used to move data between the memory and the 5D torus [512MB,1GB) 8066 10010 2903 256 128 4 0 network. Accordingly, Table XVII shows that (3) network- [1GB,2GB) 22812 3319 438 482 58 9 0 [2GB,4GB) 19278 1673 273 5 55 1159 0 related errors constitute 2.81% of the system-reliability- [4GB,8GB) 24400 3128 1685 34 82 30 169 [8GB,16GB) 16467 1374 576 6 91 1 8 related job failures; and the errors occurring in the system [16GB,32GB) 7169 433 426 18 67 1 4 such as kernels (denoted as Software Error in the table) [32GB,64GB) 6304 424 148 74 9 0 49 [128GB,256GB) 13707 387 88 27 6 23 12 constitute about 16.22%. (4) As for the system component, [256GB,512GB) 6437 2115 61 93 8 0 0 “firmware” is the major root cause of terminating jobs. Total # Bytes Written vs. Exit Code [0,512MB) 277555 79155 14504 4508 2139 1218 23 Firmware refers to a specific class of software that provides [512MB,1GB) 11588 1822 445 23 11 4 127 the low-level control for the device’s specific hardware. [1GB,2GB) 12322 3408 802 16 6 8 0 [2GB,4GB) 30906 2940 690 67 39 4 0 TABLE XVI [4GB,8GB) 14901 3257 1001 100 14 112 0 FRACTIONOFTHE MESSAGE ID OF RASEVENTS AFFECTING JOBS [8GB,16GB) 5281 3772 427 209 38 2 33 [16GB,32GB) 8892 2664 553 219 13 2 192 msg ID percent msg ID percent msg ID percent [32GB,64GB) 10577 2160 400 179 7 1 3 00080014 26.37% 0008000B 14.04% 000A000D 11.39% [64GB,128GB) 11452 2406 443 44 9 1 7 00040106 11.23% 00080007 7.96% 0008001A 6.24% [128GB,256GB) 3555 1254 118 154 10 1 7 00010010 4.21% 00040157 3.74% 000400CD 2.81% 0008000C 2.65% 00080019 2.5% 00080008 1.72% 00040037 1.40% 000400B1 1.09% 000400ED 0.78% VI.ANALYSIS OF CORRELATION BETWEEN JOB 00080016 0.31% 000C0042 0.31% 00040143 0.31% EXECUTIONSAND SYSTEM RELIABILITY 0004014D 0.16% 00080017 0.16% 00010001 0.16% 00080004 0.16% 0001000A 0.16% 00061012 0.16% In this section, we explore how the system’s fatal events affect job executions statistically, as well as the practical In Table XVIII and in Fig. 14(a), we show how the job- mean time to interruption. We first extract the 24 message affected RAS message IDs correlate with users. We formulate IDs that pertain to one or more failed jobs. The detailed the following takeaway. (Takeaway 19): Various users tend meaning/message of each message ID can be referenced in to be affected by specific types of RAS events, likely the IBM Blue Gene/Q ras book [34]. As mentioned in Section because of the job’s particular features such as execution IV, the message ID is the key field determining the nature scales, running setting, or the application’s nature (compu- of a group of events in the RAS log. In Table XVI we list tation intensive or memory intensive). For instance, u5 and all 24 message IDs and the corresponding fractions. Detailed u10 tend to be related to msg ID of 00080014 (node errors), information about the message IDs can be found in the IBM while u2 and u3 are more likely affected by 000A000D (kernel Blue Gene/Q RAS book [34]. From Table XVI, we formulate panic/errors). Since RAS event indicates system-reliability the following takeaway. (Takeaway 18): (1) the message IDs (firmware/hardware), this takeaway is not self-evident. of the job-affected RAS events follow a Pareto-similar Similarly, we formulate the following takeaway. (Takeaway principle, or 75/25 rule. Specifically, from among all the 20): The job-affected RAS events are also correlated to a jobs that failed because of system reliability, about 75% were certain extent with the core-hours consumed, as presented

10 Mira System (Compute Nodes) in Table XIX and Fig. 14(b). For example, Table XIX shows R20 R21 R22 R23 R24 R25 R26 R27 R28 R29 R2A R2B R2C R2D R2E R2F M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 that the 00080014 (node error) mainly happens to large-core- hour jobs. According to Fig. 14(b), it is observed that if a job’s M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 core-hours is in [1k,2k) and it failed because of a fatal system R10 R11 R12 R13 R14 R15 R16 R17 R18 R19 R1A R1B R1C R1D R1E R1F event, it is likely because of 000A000D (kernel panic/error). M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1

TABLE XVIII M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 CONTINGENCY TABLEWITH USER NAMEAND RASMSG ID User 00080014 0008000B 000A000D 00040106 00080007 0008001A R00 R01 R02 R03 R04 R05 R06 R07 R08 R09 R0A R0B R0C R0D R0E R0F u1 16 15 0 1 5 5 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 M1 u2 0 1 22 0 0 0 u3 0 0 21 0 0 0

u4 4 10 1 0 0 3 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 M0 u5 13 0 0 1 2 2 u6 0 0 17 0 0 0 u7 8 3 0 2 1 1 u8 3 8 0 1 1 0 Status Colors Number of events on Rack 6 35 u9 1 10 0 0 1 0 Number of events on Midplane 1 27 u10 10 1 0 0 1 0 Number of events on Node 1 19

TABLE XIX Fig. 15. Spatial Distribution of Job-Affected RAS Events on Compute Racks

CONTINGENCY TABLEWITH CORE-HOURSAND RASMSG ID MIRA System (IO Nodes) Q2G Q2H corehours 00080014 0008000B 000A000D 00040106 00080007 0008001A I6 I7 I8 I6 I7 I8 [0,1k) 11 1 4 0 3 0 [1k,2k) 7 1 33 1 2 0 I3 I4 I5 I3 I4 I5 [2k,4k) 1 2 9 3 1 2 [4k,8k) 9 3 24 2 0 1 I0 I1 I2 I0 I1 I2 [8k,16k) 10 5 0 2 2 2 [16k,32k) 13 9 1 1 7 5 Q1G Q1H [32k,64k) 13 9 0 5 2 3 I6 I7 I8 I6 I7 I8 [64k,128k) 24 16 0 9 3 4 I3 I4 I5 I3 I4 I5 [128k,256k) 16 7 1 12 11 7 [256k,...) 65 37 1 37 20 16 Status codors I0 I1 I2 I0 I1 I2

1 1 Number of events on Rack Q0G Q0H 0.8 6 21 I6 I7 I8 I6 I7 I8 0.8 0.6 Number of events on IO drawer 1 0.6 0.4 7 I3 I4 I5 I3 I4 I5 0.2 Number of events on Computer Card 0.4 0 [0,1k)[1k,2k)[2k,4k)[4k,8k)[8k,16k)[16k,32k)[32k,64k)[64k,128k)[128k,256k)[256k,512k) 1 4 I0 I1 I2 I0 I1 I2 0.2 RAS Event Distribution 0 RAS Event Distribution u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 Fig. 16. Spatial Distribution of Job-Affected RAS Events on IO Racks Users Corehours 00080014 0008001A 00080014 0008001A 0008000B 00010010 0008000B 00010010 000A000D 00040157 000A000D 00040157 on the similarity threshold [0.2,0.8] (similarity threshold is a 00040106 000400CD 00040106 000400CD 00080007 0008000C 00080007 0008000C paramter to control the similarity between messages in the (a) Distr. based on Users (b) Distr. based on Execution Time analysis [31]). According to system administrators, the simi- Fig. 14. RAS Event Distribution based on Users/Corehours larity threshold of 0.5 leads to high-fidelity filtering results, Fig. 15 and Fig. 16 show the locality features of the job- corresponding to 575 fatal events in total (where 518 are affected RAS events in the compute racks and I/O racks, related to compute racks and 57 are related to I/O racks). respectively. (Takeaway 21): Such RAS events are located Hence we have this takeaway. (Takeaway 22): MTTI is about under a fairly nonuniform distribution: some midplanes 3.48 days from the perspective of jobs/users in the Mira, (such as rack R03-M0 and R1C-M0) have far more disclosing the real system-related failure rate for users. frequent issues than other midplanes (such as R11-M0) VII.CONCLUSIONAND FUTURE WORK do. The most error-prone rack and midplane have 35 and 27 In this paper, we conduct a joint analysis with multiple fatal RAS events, respectively, while the minimum numbers data sources to explore the job failure features over the 2,001 of fatal events per rack and midplane are 6 and 1, respectively. days of the IBM Blue Gene/Q Mira. We present 22 valuable Some nodes have high-frequent failures (e.g.,R03: 19 failures takeaways, which we believe are helpful for understanding the on one node), while many nodes (shown as ‘white square’) job failure features and how users or system events affect job have no failure-job-related events at all. executions, as well as the locality features of the job-affected After a spatiotemporal filter is used to remove the duplicated RAS events in both compute racks and I/O racks. Our analysis messages, the number of job-affected RAS events is 635, involves the largest scale of core-hours in a resilience study to indicating that the rate of job failures caused by RAS events is the best of our knowledge. As future work, we plan to study once as per 3.15 days. As mentioned in Section IV, we further more supercomputers for a comprehensive comparison. remove the highly mutually correlated (or similar) messages by leveraging LogAider [31] in terms of the similarity of ACKNOWLEDGMENTS two events occurring at close timestamps. Specifically, if This research was supported under the Contract DE-AC02- two events occur within 2 hours and exhibit fairly similar 06CH11357 by the U.S. Department of Energy. We used the properties, we merge them as one event because of the likely data of the Argonne Leadership Computing Facility, which is a same event source (or root cause). The total number of failures DOE Office of Science User Facility supported under Contract can be further narrowed to the range of [570, 596], based DE-AC02-06CH11357 by the U.S. Department of Energy.

11 REFERENCES [19] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gu- rumurthi, “Feng Shui of supercomputer memory positional effects in [1] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, DRAM and SRAM faults,” in SC ’13: Proceedings of the International P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, Conference on High Performance Computing, Networking, Storage and P. Coteus, N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, Analysis, Nov 2013, pp. 1–11. S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, [20] V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, J. Shalf, and S. Gurumurthi, “Memory errors in modern systems: and E. V. Hensbergen, “Addressing failures in exascale computing,” The good, the bad, and the ugly,” in Proceedings of the Twentieth Int. J. High Perform. Comput. Appl., vol. 28, no. 2, pp. 129–173, May International Conference on Architectural Support for Programming 2014. [Online]. Available: http://dx.doi.org/10.1177/1094342014522573 Languages and Operating Systems, ser. ASPLOS ’15. New York, NY, [2] S. Di, H. Guo, R. Gupta, E. R. Pershey, M. Snir, and F. Cappello, USA: ACM, 2015, pp. 297–310. “Exploring properties and correlations of fatal events in a large-scale [21] “Ornl jaguar,” https://www.olcf.ornl.gov/tag/jaguar/, online. HPC system,” IEEE Transactions on Parallel and Distributed Systems, [22] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, pp. 1–14, 2018. D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux, L. Carro, and [3] “Cobalt: Component-based lightweight toolkit,” A. Bland, “Understanding GPU errors on large-scale HPC systems and 2015 IEEE 21st https://trac.mcs.anl.gov/projects/cobalt, online. the implications for system design and operation,” in International Symposium on High Performance Computer Architecture [4] “Torque resource manager,” http://www.adaptivecomputing.com/products/ (HPCA), Feb 2015, pp. 331–342. torque/, online. [23] S. Gupta, D. Tiwari, C. Jantzi, J. Rogers, and D. Maxwell, “Under- [5] “Darshan project,” https://www.mcs.anl.gov/research/projects/darshan/ standing and exploiting spatial properties of system failures on extreme- publications/, online. scale HPC systems,” in Proceedings of the 2015 45th Annual IEEE/IFIP [6] P. Carns, K. Harms, W. Allcock, C. Bacon, S. Lang, R. Latham, and International Conference on Dependable Systems and Networks, ser. R. Ross, “Understanding and improving computational science storage DSN ’15. Washington, DC, USA: IEEE Computer Society, 2015, pp. access through continuous characterization,” Trans. Storage, vol. 7, 37–44. no. 3, pp. 8:1–8:26, Oct. 2011. [24] D. Tiwari, S. Gupta, G. Gallarno, J. Rogers, and D. Maxwell, “Reliability [7] C. D. Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and lessons learned from GPU experience with the Titan Supercomputer W. Kramer, “Lessons learned from the analysis of system failures at at Oak Ridge Leadership Computing Facility,” in Proceedings of the petascale: The case of Blue Waters,” in Proceedings of the 2014 44th International Conference for High Performance Computing, Networking, Annual IEEE/IFIP International Conference on Dependable Systems Storage and Analysis, ser. SC ’15. New York, NY, USA: ACM, 2015, and Networks, ser. DSN ’14. Washington, DC, USA: IEEE Computer pp. 38:1–38:12. Society, 2014, pp. 610–621. [25] B. Nie, J. Xue, S. Gupta, C. Engelmann, E. Smirni, and D. Tiwari, [8] C. D. Martino, W. Kramer, Z. Kalbarczyk, and R. Iyer, “Measuring “Characterizing temperature, power, and soft-error behaviors in data and understanding extreme-scale application resilience: A field study center systems: Insights, challenges, and opportunities,” in 2017 IEEE of 5,000,000 HPC application runs,” in 2015 45th Annual IEEE/IFIP 25th International Symposium on Modeling, Analysis, and Simulation of International Conference on Dependable Systems and Networks (DSN), Computer and Telecommunication Systems (MASCOTS), Sept 2017, pp. June 2015, pp. 25–36. 22–31. [9] R.-T. Liu and Z.-N. Chen, “A large-scale study of failures on petascale [26] S. Gupta, T. Patel, C. Engelmann, and D. Tiwari, “Failures in large supercomputers,” Journal of Computer Science and Technology, vol. 33, scale systems: Long-term measurement, analysis, and implications,” no. 1, pp. 24–41, Jan 2018. in Proceedings of the International Conference for High Performance [10] “Blue waters supercomputer,” https://bluewaters.ncsa.illinois.edu/, on- Computing, Networking, Storage and Analysis, ser. SC’17. New York, line. NY, USA: ACM, 2017, pp. 44:1–44:12. [11] G. Lakner and B. Knudson, “Ibm system blue [27] A. A. Hwang, I. A. Stefanovici, and B. Schroeder, “Cosmic rays gene solution: Blue gene/q system administration,” don’t strike twice: Understanding the nature of DRAM errors and the http://www.redbooks.ibm.com/abstracts/sg247869.html, online. implications for system design,” SIGPLAN Not., vol. 47, no. 4, pp. 111– [12] G. Li, Q. Lu, and K. Pattabiraman, “Fine-grained characterization 122, Mar. 2012. of faults causing long latency crashes in programs,” in 2015 45th [28] Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and Annual IEEE/IFIP International Conference on Dependable Systems and D. Buettner, “Co-analysis of RAS log and job log on Blue Gene/P,” Networks, June 2015, pp. 450–461. in 2011 IEEE International Parallel Distributed Processing Symposium, [13] Q. Lu, M. Farahani, J. Wei, A. Thomas, and K. Pattabiraman, “LLFI: May 2011, pp. 840–851. An intermediate code-level fault injection tool for hardware faults,” in [29] “Intrepid at argonne (blue gene/p),” https://www.alcf.anl.gov/intrepid, 2015 IEEE International Conference on Software Quality, Reliability online. and Security, Aug 2015, pp. 11–16. [30] “Mira system logs,” https://reports.alcf.anl.gov/data/mira.html, online. [14] T. Siddiqua, V. Sridharan, S. E. Raasch, N. DeBardeleben, K. B. Ferreira, [31] S. Di, R. Gupta, M. Snir, E. Pershey, and F. Cappello, “LOGAIDER: S. Levy, E. Baseman, and Q. Guan, “Lifetime memory reliability data A tool for mining potential correlations of HPC log events,” in 2017 from the field,” in 2017 IEEE International Symposium on Defect and 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Oct 2017, Computing (CCGRID), May 2017, pp. 442–451. pp. 1–6. [32] W.-S. Yang, H.-C. W. Lin, and Y. H. He, “Franklin job completion [15] S. Levy, K. B. Ferreira, N. DeBardeleben, T. Siddiqua, V. Sridharan, analysis.” in CUG, Edinburg, UK, 2010. and E. Baseman, “Lessons learned from memory errors observed over [33] S. Habib, V. Morozov, N. Frontiere, H. Finkel, A. Pope, K. Heitmann, the lifetime of Cielo,” in Proceedings of the International Conference K. Kumaran, V. Vishwanath, T. Peterka, J. Insley et al., “HACC: extreme for High Performance Computing, Networking, Storage, and Analysis, scaling and performance across diverse architectures,” Communications ser. SC ’18. Piscataway, NJ, USA: IEEE Press, 2018, pp. 43:1–43:12. of the ACM, vol. 60, no. 1, pp. 97–104, 2016. [Online]. Available: http://dl.acm.org/citation.cfm?id=3291656.3291714 [34] “Ibm blue gene/q ras book,” https://reports.alcf.anl.gov/data/datadictionary/ [16] “Cielo NNSA capability supercomputer,” RasEventBook.html, online. https://www.lanl.gov/projects/cielo/, online. [17] B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers, “A large-scale study of soft-errors on gpus in the field,” in 2016 IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, March 12-16, 2016, 2016, pp. 519–530. [18] B. Nie, J. Xue, S. Gupta, T. Patel, C. Engelmann, E. Smirni, and D. Tiwari, “Machine learning models for gpu error prediction in a large scale hpc system,” in 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), June 2018, pp. 95–106.

12