Cross-Core Event Monitoring for Processor Failure Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Cross-Core Event Monitoring for Processor Failure Prediction Felix Salfner, Peter Tröger, and Steffen Tschirpke Humboldt University Berlin {salfner,troeger,tschirpke}@informatik.hu-berlin.de ABSTRACT With the ever-increasing number of transistors on these chips, hardware reliability is about to become a pressing A recent trend in the design of commodity processors is issue in the upcoming years. AMD already sells triple- the combination of multiple independent execution units core processors that were originally intended as quad-core on one chip. With the resulting increase of complexity and processors, but contain one disabled defective execution transistor count, it becomes more and more likely that a unit. This shows that new paradigms and approaches are single execution unit on a processor gets faulty. In order needed to ensure dependability of modern CMP-based to tackle this situation, we propose an architecture for computer systems. dependable process management in chip-multiprocessing machines. In our approach, execution units survey each One way to improve dependability (meaning performance other to anticipate future hardware failures. The and availability) at run-time in imperfect systems is the prediction relies on the analysis of processor hardware proactive failure management approach. In contrast to performance counters by a statistical rank-sum test. classical fault tolerance techniques, which typically react Initial experiments with the Intel Core processor platform after the problem has occurred, it relies on the short-term proved the feasibility of the approach, but also showed anticipation of upcoming failures. the need for further investigation due to a high prediction quality variation in most of the cases. This is realized by a permanent evaluation of the system state at runtime. A successful prediction initiates a subsequent proactive phase, where the effects of a KEYWORDS: multi-core, fault injection, performance possible failure are compensated in advance. One typical counter, failure prediction countermeasure example is migrating workload or data to redundant resources. 1. INTRODUCTION In our presented online failure prediction approach, we treat the multiple execution units of modern CMP The support for multiple parallel activities has a long hardware as set of redundant computational resources. history in computer hardware design. Beside the They are either used for load balancing as originally traditional support for multiple processors in one system, intended or as spare resource in case of a partial processor there was always also a class of solutions described as failure. The proactive part can be realized by process chip multi-threading [1]. It exploits parallelism inside the migration or controlled application shutdown. processor chip by instruction-level parallelism or simultaneous multithreading. Fault anticipation requires continuous system state information, in order to detect patterns of anomalies that The most recent extension for chip multi-threading is the indicate an upcoming failure. System monitoring usually chip multi-processing (CMP) approach, where a set of relies on either operating system-specific or application- independent execution units („cores‟) is packaged as one specific solutions. The technique investigated in this work processor chip. The widely promoted new „era‟ of multi- operates on the processor hardware level only, which core systems basically focuses on the introduction of such makes it operating system as well as application CMP design in standard desktop processors. A report independent. The cores monitor each other and predict from Berkeley predicts CMP processors with thousands of upcoming failures by analyzing hardware event sampling parallel execution units as the mainstream hardware of the data. future [2]. This paper presents our initial experiences with this prediction algorithm is able to detect such a change in the approach. It is organized as follows: Section 2 describes event pattern early enough, it can trigger preventive our approach in detail, Section 3 explains the experiment actions accordingly. The general idea is now to let the setup, Section 4 discusses the obtained measurement monitored core (that is also running the workload) results, Section 5 discusses some related work, and periodically triggers an interrupt routine that moves Section 6 analyses the relevant next steps towards a observed data samples to a small buffer on the second sufficient processor failure prediction solution. core. The second core performs the failure prediction based on the monitoring values in the buffer. If the 2. APPROACH prediction algorithm detects a problem that might lead to a core failure, a warning signal is sent to the workload From the different information sources available in application. It can then perform actions to cope with the modern processor hardware, performance event failure-prone situation, e.g., it might be check-pointed or monitoring provides the most detailed information. moved by an external entity such as the operating system Performance events are signaled by hardware components scheduler. in the execution engine, for example after the completion (resp. retirement) of a microinstruction, a cache miss, a The concept of cross-core event monitoring demands branch miss-prediction, or a misaligned memory access. some specific conditions. The chosen algorithm must Performance events of a typical CMP processor are work with a relative small computational overhead, in monitored by configuring built-in hardware counters with order to perform the failure prediction as a background an event type and overflow threshold value. Depending on task during normal operation. The processor hardware the particular processor type, multiple parallel counters needs to support event counting with comparatively low can be activated at the same time. With each counter overhead. Finally, predictions must be accurate enough to overflow, a hardware interrupt is triggered that can be justify preventive actions, especially with respect to false used to save a sample of the processor state at the time of warnings. overflow. Each sample contains all monitored counter and register values, including the current instruction pointer. 2.2. EVENT TYPE REDUCTION The primary idea of these functionalities is to support performance profiling tools, which map the context All CMP platforms with hardware performance information to an investigated running application. Our monitoring support offer a large number of events. approach, however, aims at the utilization of these values However, only a very limited number can be monitored at as representation of the whole execution engine state. the same time, due to the restricted number of hardware counting units and their internal wiring [6]. It is therefore An initial investigation showed that hardware performance necessary to identify a very small indicative set of event monitoring is available in all major CMP platforms. counters in the first step. We analyzed correlations among Each particular vendor has the according hardware counters available in our experiment setup, in order to counter support; the solutions mainly differ in the set of remove those that behave very similar to other counters. monitorable events per execution engine. The Intel Core This allows measuring as many as possible independent technology architecture distinguishes between model- event types on one processor core at the same type. specific (‟non-architectural‟) and somehow standardized (‟architectural‟) processor performance events [3]. The The analysis is realized by running the chosen workload AMD multi-core processor families provide a comparable, with different counter configurations, each containing one but smaller, set of performance events. The SPARC CMP possible combination of event types. The two sampled processor series offers performance instrumentation data sets are then analyzed for their Spearman‟s rank counters (PIC) for different event types [4]. Also different correlation coefficient. This specific measure has the versions of the IBM POWER processor line support advantage of not assuming any frequency distribution of hardware performance counters [5]. Overall it can be the variables, which is in fact not known for performance assumed that hardware performance event counting is a event samples. The result is a list of unique event type common feature of modern CMP processors. combinations to be monitored during run-time. Additionally, a qualitative reduction of the counter set can 2.1. CROSS-CORE MONITORING be done. Events with an obviously strictly monotonic behavior, such as the number of cycles elapsed, can be Our research hypothesis assumes that there is a detectable sorted out. They would not show any significantly change in the behavior of hardware performance events different behavior in the failure case, which renders them before a failure of the particular core occurs. If a failure irrelevant for the prediction task. 2.3. SAMPLING RATE position in the combined data set. The sum of the created new ranks is compared to a pre-computed threshold. The Each sampling approach demands the choice of an predictor issues a failure warning if the rank sum has a according sampling rate, either based on a time or event deviation more than a threshold from expectations. In the count threshold. A time-based approach would lead to example, the test values tend to be larger than the major technical difficulties, since modern processors have