Cray XC Node Level Diagnosability

Cray® XC40TM System Diagnosability: Functionality, Performance, and Lessons Learned Jeffrey J. Schutkoske Platform Services Group (PSG) Cray, Inc. Bloomington, MN, USA [email protected] Abstract— The Intel® Xeon PhiTM CPU 7250 processor presents Once the node is successfully initialized by BIOS, the new opportunities for diagnosing the nodes based on this CLE is booted. Cray Linux Environment (CLE) utilizes the processor in the Cray® XCTM system. This processor supports a Resiliency Communication Agent (RCA) interface to new high-bandwidth on-package MCDRAM memory and related communicate between HSS and nodes. The Hardware Error interfaces. It also provides the ability to support different Non- Log channel is used to send hardware errors to HSS. Uniform Memory Access (NUMA) configurations. The new HSS monitors a number of power and thermal sensors Cray Processor Daughter Card (PDC) also supports an optional within the Cray® XC40TM system at the blade and cabinet PCIe SSD card. This processor requires new BIOS, level. These sensors and devices are managed via the out- administrative commands, power and thermal limits, as well as of-band paths on the blade by HSS. new diagnostics to validate functionality and performance. This HSS also utilizes a number of Intel Reliability, paper describes the diagnostic tool chain changes required to support this processor and PCIe SSD card. It describes the Availability, and Service (RAS) features including Platform functionality, performance, and lessons learned from diagnosing Environment Control Interface (PECI) and In-Target Probe problems at scale. It also provides detailed examples on how to (ITP). ® TM diagnose node faults within the Cray® XC40TM system. New Intel Xeon Phi CPU 7250 processor on-line diagnostics have been written to validate the node and Keywords: Cray® XC40TM system, Cray Linux Environment, MCDRAM functionality and performance. The diagnostics CLE, Intel® Xeon PhiTM CPU 7250 processor, MCDRAM, validate the node based on the MCDRAM and NUMA diagnostic, diagnosability configurations. The Workload Test Suite (WTS) has also been updated to detect and diagnose Intel® Xeon PhiTM CPU I. INTRODUCTION 7250 processor problems under CLE. The Cray® XC40TM system includes the new Cray There are new utilities and diagnostics to support the Processor Daughter Card (PDC) which supports the new PCIe SSD card. There is a new diagnostic utility that Intel® Xeon PhiTM CPU 7250 processor and an optional PCIe executes in CLE on the Data Warp Node or the compute SSD card. Previous work [1] [2] [3] has outlined Cray node to support the SSD. This diagnostic utility is system diagnosability for the Cray® XCTM Series. There are periodically scheduled to check the health of the SSD. It a number of enhancements required to support the new node reports the status to the SMW via RCA. The system processor and optional SSD card in the Cray® XC40TM administrator can query and display the current SSD health, system. as well as historical data. The results of the SSD diagnostic The Cray Hardware Supervisory System (HSS) supports utility can also be viewed on the SMW. the new high-bandwidth on-package MCDRAM memory This paper describes the tool chain changes required to and interfaces. It supports the On-Demand configuration of support the new blade with the Intel® Xeon PhiTM CPU 7250 the MCDRAM and NUMA. The MCDRAM and NUMA processor and optional PCIe SSD cards. It also provides configurations, as well as SSD enable/disable, are detailed examples on how to diagnose Intel® Xeon PhiTM configurable from both the command line on the System CPU 7250 processor faults within the Cray® XC40TM Management Workstation (SMW) and by a workload system. Section II describes the on-demand configuration manager (WLM) running under the Cray Linux Environment options for the node and the attached SSD cards. BIOS (CLE) utilizing the Cray Advanced Platform Monitoring and initialization is covered in Section III. The Resiliency Control (CAPMC) interface. Communication Agent (RCA) interface and monitoring are The HSS commands, utilities, and diagnostics are used to cover in Section IV and Section V respectively. Section VI diagnose a faulty node. Initially BIOS is used to verify that covers the HSS Out-Of-Band (OOB) debug, Section VII the node has powered on correctly and that the node is covers the online diagnostics, and Section VIII explains the initialized and all interface ports are trained successfully. It Workload Test Suite (WTS). Section IX describes the PCIe also validates that any on-demand configuration settings are SSD cards. Related work is described in Section X. Finally, valid. Any errors or warnings are reported to HSS. Section XI provides a summary. II. ON-DEMAND CONFIGURATION MCDRAM, and initialize and train any attached PCIe The Intel® Xeon PhiTM CPU 7250 processor can be devices including the Aries network interface (NIC). reconfigured on-demand using the Cray Advanced Platform BIOS executes the Rank Margining Tool (RMT) as part Monitoring and Control (CAPMC) interface [11]. The new of the normal BIOS process. The RMT generates, configuration is applied on the next cold boot of the node or calculates, and reports out the system memory interface cold nodes. This processor supports five different NUMA modes boot signal timing and voltage reference (VREF) margins. including All-to-All, Sub NUMA Cluster 2, Sub NUMA This ensures that the DDR4 and MCDRAM interfaces can Cluster 4, Hemisphere, and Quadrant. It also supports 5 support expected memory performance. If BIOS determines different MCDRAM modes including Cache, Flat, Hybrid that RMT was successful on the previous cold boot, BIOS Equal, and Hybrid Split. All the permutations and skips the RMT thereby enabling a cold fast boot. RMT is combinations of these NUMA modes and MCDRAM modes run by default after each BIOS flash. are supported. Note that not all modes and combinations The Cray BIOS reports any MCDRAM or DIMM have proven effective, but this discussion is outside the scope failures during memory training including any Machine of this paper. [21] The PCIe SSD cards can be enabled or Check Architecture (MCA) errors that were detected. disabled via the CAPMC interface. NUMA and MCDRAM configurations are verified and CAPMC provides a published interface [12], shown in reported. Table I, that allows the WLM to make reconfiguration BIOS detected errors or failures are logged to the BIOS requests to HSS as shown in the table below. System logs. All errors are logged to the BIOS log on the Blade administrators and operators can also use the CAPMC Controller (BC). Controller log forwarding with the interface on the SMW. The changed settings are stored in Lightweight Log Manager (LLM) can be configured to the power management database (PMDB). The changes are forward the BIOS logs to the SMW. The Blade and applied on the next reboot. Typically, the WLM requests Cabinet Controllers are diskless Linux systems, so log the reboot via the CAPMC node_reinit command. forwarding is enabled by default. TABLE I. ÇRAY CAPMC CONFIGURATION REQUESTS CrayBdsHook.Entry(B9E00300) Command Description AddLpcFaultEntry: PostCode 0x68 get_ssd_enable Return list of enabled SSD cards LPC_SCRATCH_FAULT_REPORT_ENTRY set_ssd_enable Enable/disable SSD FaultNum: 1 get_mcdram_cfg_capabilities Return supported MCDRAM configurations Type: 11 get_mcdram_cfg Return current MCDRAM configuration Flags: 0x00 set_mcdram_cfg Set new MCDRAM configuration CodeMajor: 0x68 CodeMinor: 0x00 get_numa_cfg_capabilities Return supported NUMA configurations ApicId: 0x00 get_numa_cfg Return current NUMA configuration CpuNum: 0 set_numa_cfg Set new NUMA configuration Timestamp: 04/05/2017 18:18:20 node_reinit Reinitialize node LogData: 0x00000020 node_status Returns current node status FaultMsg: BIST disabled via POC_RESET_STRAPS LPC_SCRATCH_FAULT_INFO_ENTRY Once HSS receives the CAPMC node_reinit request, the LineNum: 1085 FileName: z:\knl\CrayPkg\Dxe\CrayBdsHook\CrayBdsHook.c node is stepped through the standard states of ready, halt, on, standby, and returning to the ready state. Any node that fails while in a state remains in that state. For example, if a node Figure 1. BIOS Hardware Failure Detected fails to complete the CLE boot, the node persists in the In Figure 1. BIOS Hardware Failure Detected BIOS standby state. The WLM can check the current state of the ® TM node using the CAPMC node_status command. requested that the Intel Xeon Phi CPU 7250 processor Failures that occur during a reboot initiated by CAPMC BIST execute and BIOS detected that it did not execute. In are logged in /var/opt/cray/log/xtremoted-YYYYMMDD. Figure 2. Figure 2. BIOS MCDRAM Failure Detected BIOS xtremoted logs the full xtbounce output when xtbounce detected multiple MCDRAM failures where the MCDRAM returns non-zero. xtremoted logs the full xtcli boot output did not train successfully and it reported the errors. regardless of return code. III. BIOS INITIALIZATION The Cray BIOS is used to initialize and configure the Intel® Xeon PhiTM CPU 7250 processor node as requested by CAMPC. When HSS brings the node out of reset, BIOS is loaded and executed. HSS passes the new configuration settings to BIOS, which uses them to configure the NUMA and MCDRAM modes as requested. BIOS is used to initialize the Intel® Xeon PhiTM CPU 7250 processor, initialize and train the attached memory and This includes banks 7-14 where MCDRAM errors are reported. ** EDC-0 Memory Init: cmdcrc_err = 1 ** EDC-4 Memory Init: cmdcrc_err = 1 HWERR[c1-0c2s14n1]:0xfd0b:

Cray XC Node Level Diagnosability

インテル® Parallel Studio XE 2020 リリースノート

Introduction to High Performance Computing

Applying the Roofline Performance Model to the Intel Xeon Phi Knights Landing Processor

Introduction to Intel Xeon Phi (“Knights Landing”) on Cori

Intel® Parallel Studio XE 2020 Update 2 Release Notes

Effectively Using NERSC

Intel® Xeon Phi™ Processor: Your Path to Deeper Insight

HPC Software Cluster Solution

Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* for Xeon Phi Cluster

HPCG on Intel Xeon Phi 2Nd Generation, Knights Landing

Knights Landing Intel® Xeon Phi™ CPU: Path to Parallelism With

Coprocessors: Failures and Successes