Cross-Core Event Monitoring for Processor Failure Prediction

Felix Salfner, Peter Tröger, and Steffen Tschirpke Humboldt University Berlin {salfner,troeger,tschirpke}@informatik.hu-berlin.de

ABSTRACT With the ever-increasing number of transistors on these chips, hardware reliability is about to become a pressing A recent trend in the design of commodity processors is issue in the upcoming years. AMD already sells triple- the combination of multiple independent execution units core processors that were originally intended as quad-core on one chip. With the resulting increase of complexity and processors, but contain one disabled defective execution transistor count, it becomes more and more likely that a unit. This shows that new paradigms and approaches are single execution unit on a processor gets faulty. In order needed to ensure dependability of modern CMP-based to tackle this situation, we propose an architecture for computer systems. dependable management in chip-multiprocessing machines. In our approach, execution units survey each One way to improve dependability (meaning performance other to anticipate future hardware failures. The and availability) at run-time in imperfect systems is the prediction relies on the analysis of processor hardware proactive failure management approach. In contrast to performance counters by a statistical rank-sum test. classical fault tolerance techniques, which typically react Initial experiments with the Intel Core processor platform after the problem has occurred, it relies on the short-term proved the feasibility of the approach, but also showed anticipation of upcoming failures. the need for further investigation due to a high prediction quality variation in most of the cases. This is realized by a permanent evaluation of the system state at runtime. A successful prediction initiates a subsequent proactive phase, where the effects of a KEYWORDS: multi-core, fault injection, performance possible failure are compensated in advance. One typical counter, failure prediction countermeasure example is migrating workload or data to redundant resources.

1. INTRODUCTION In our presented online failure prediction approach, we treat the multiple execution units of modern CMP The support for multiple parallel activities has a long hardware as set of redundant computational resources. history in design. Beside the They are either used for load balancing as originally traditional support for multiple processors in one system, intended or as spare resource in case of a partial processor there was always also a class of solutions described as failure. The proactive part can be realized by process chip multi-threading [1]. It exploits parallelism inside the migration or controlled application shutdown. processor chip by instruction-level parallelism or simultaneous multithreading. Fault anticipation requires continuous system state information, in order to detect patterns of anomalies that The most recent extension for chip multi-threading is the indicate an upcoming failure. System monitoring usually chip multi-processing (CMP) approach, where a set of relies on either -specific or application- independent execution units („cores‟) is packaged as one specific solutions. The technique investigated in this work processor chip. The widely promoted new „era‟ of multi- operates on the processor hardware level only, which core systems basically focuses on the introduction of such makes it operating system as well as application CMP design in standard desktop processors. A report independent. The cores monitor each other and predict from Berkeley predicts CMP processors with thousands of upcoming failures by analyzing hardware event sampling parallel execution units as the mainstream hardware of the data. future [2]. This paper presents our initial experiences with this prediction is able to detect such a change in the approach. It is organized as follows: Section 2 describes event pattern early enough, it can trigger preventive our approach in detail, Section 3 explains the experiment actions accordingly. The general idea is now to let the setup, Section 4 discusses the obtained measurement monitored core (that is also running the workload) results, Section 5 discusses some related work, and periodically triggers an interrupt routine that moves Section 6 analyses the relevant next steps towards a observed data samples to a small buffer on the second sufficient processor failure prediction solution. core. The second core performs the failure prediction based on the monitoring values in the buffer. If the 2. APPROACH prediction algorithm detects a problem that might lead to a core failure, a warning signal is sent to the workload From the different information sources available in application. It can then perform actions to cope with the modern processor hardware, performance event failure-prone situation, e.g., it might be check-pointed or monitoring provides the most detailed information. moved by an external entity such as the operating system Performance events are signaled by hardware components scheduler. in the execution engine, for example after the completion (resp. retirement) of a microinstruction, a cache miss, a The concept of cross-core event monitoring demands branch miss-prediction, or a misaligned memory access. some specific conditions. The chosen algorithm must Performance events of a typical CMP processor are work with a relative small computational overhead, in monitored by configuring built-in hardware counters with order to perform the failure prediction as a background an event type and overflow threshold value. Depending on task during normal operation. The processor hardware the particular processor type, multiple parallel counters needs to support event counting with comparatively low can be activated at the same time. With each counter overhead. Finally, predictions must be accurate enough to overflow, a hardware interrupt is triggered that can be justify preventive actions, especially with respect to false used to save a sample of the processor state at the time of warnings. overflow. Each sample contains all monitored counter and register values, including the current instruction pointer. 2.2. EVENT TYPE REDUCTION The primary idea of these functionalities is to support performance profiling tools, which map the context All CMP platforms with hardware performance information to an investigated running application. Our monitoring support offer a large number of events. approach, however, aims at the utilization of these values However, only a very limited number can be monitored at as representation of the whole execution engine state. the same time, due to the restricted number of hardware counting units and their internal wiring [6]. It is therefore An initial investigation showed that hardware performance necessary to identify a very small indicative set of event monitoring is available in all major CMP platforms. counters in the first step. We analyzed correlations among Each particular vendor has the according hardware counters available in our experiment setup, in order to counter support; the solutions mainly differ in the set of remove those that behave very similar to other counters. monitorable events per execution engine. The Intel Core This allows measuring as many as possible independent technology architecture distinguishes between model- event types on one processor core at the same type. specific (‟non-architectural‟) and somehow standardized (‟architectural‟) processor performance events [3]. The The analysis is realized by running the chosen workload AMD multi-core processor families provide a comparable, with different counter configurations, each containing one but smaller, set of performance events. The SPARC CMP possible combination of event types. The two sampled processor series offers performance instrumentation data sets are then analyzed for their Spearman‟s rank counters (PIC) for different event types [4]. Also different correlation coefficient. This specific measure has the versions of the IBM POWER processor line support advantage of not assuming any frequency distribution of hardware performance counters [5]. Overall it can be the variables, which is in fact not known for performance assumed that hardware performance event counting is a event samples. The result is a list of unique event type common feature of modern CMP processors. combinations to be monitored during run-time. Additionally, a qualitative reduction of the counter set can 2.1. CROSS-CORE MONITORING be done. Events with an obviously strictly monotonic behavior, such as the number of cycles elapsed, can be Our research hypothesis assumes that there is a detectable sorted out. They would not show any significantly change in the behavior of hardware performance events different behavior in the failure case, which renders them before a failure of the particular core occurs. If a failure irrelevant for the prediction task.

2.3. SAMPLING RATE position in the combined data set. The sum of the created new ranks is compared to a pre-computed threshold. The Each sampling approach demands the choice of an predictor issues a failure warning if the rank sum has a according sampling rate, either based on a time or event deviation more than a threshold from expectations. In the count threshold. A time-based approach would lead to example, the test values tend to be larger than the major technical difficulties, since modern processors have reference data and the resulting rank sum is hence larger no fixed timing behavior due to frequency scaling, than expected. In our specific failure prediction scenario, pipelining, and out-of-order execution. Even the classical the last n samples from the monitored core are compared Pentium time stamp counter (TSC) is not guaranteed to to the reference data set. The assumption is that prior to a provide a constant rate for some processor models [3]. failure the counter‟s median deviates significantly from Our experimental setup relies on a threshold for the the median of the reference data set. number of processed instructions on the core. This gives us a constant scale with respect to the execution of the 3. EXPERIMENT load application. Workload-based sampling has also been shown to be more appropriate in other areas [7,8] of In order to set up an experimental environment for our failure prediction. concept, several choices had to be made, including the choice of a particular hardware, of a performance event 2.4. FAILURE PREDICTION ALGORITHM monitoring approach, and of a fault injection technique.

We used a Wilcoxon statistical rank-sum test as failure We performed our experiments with an Intel Core2 Quad prediction approach. A one-sided version of the test has CPU (Q6600) with 2.40 GHz, 2 GB memory and a Linux successfully been applied to a comparable scenario [9]. 2.6 64bit operating system. The tested system was The test has the relevant characteristics for our connected to a monitoring computer by a custom made environment: No assumptions about the form of the heartbeat line. The monitoring system had the distribution have to be made, such as that the counter responsibility to automatically reset the tested multicore values are assumed to be normally distributed system – in case of non-reachability by ICMP ping, or (nonparametric test). It is based on ordinal statistics, i.e., it after a maximum time period. Every reset initiated a new can handle outliers more robustly. It is computationally test run on the multi-core machine, which allowed us to light, meaning that it can be performed in parallel to other run automated tests with different performance event workloads on the predicting core. combinations to be monitored. The performance counter monitoring was realized with a modified version of the The Wilcoxon rank-sum test compares a test data set to a perfmon2 toolkit for Linux. reference data set in order to determine whether the median is about the same. In our case, the reference data Performance data collection on Intel processors (since set corresponds to CPU counter values that have been Pentium 4) can operate in three different modes: Event measured and stored during normal operation without counting instructs one of the processor counters to count failures. The test data set consists of the samples that were an event type. The periodically fetches the measured during runtime. Note that the test data set can be counter value from a register. Non-precise event-based much smaller than the reference data set. sampling instructs a counter to generate a performance monitoring interrupt (PMI) on overflow. With precise event-based sampling (PEBS), the CPU stores its architectural state on overflow in a special memory buffer on its own. Even though PEBS provides the better accuracy for measurements [6], it is only available for a small subset of events. We decided therefore to stick with the PMI sampling approach.

The Intel Core technology architecture also distinguishes between model-specific (‟non-architectural‟) and Figure 1. Example for Rank Sum Test somehow standardized (‟architectural‟) processor performance events [3]. We were forced to use the non- Figure 1 shows an explanatory example. A test data set, architectural types, since the number of architectural ones which is observed during runtime, is merged and sorted is very limited. After the reduction step, we ended up with with the stored reference data set. This is done to 31 distinct Intel performance event types to be tested. determine the ranks of the test values, according to their 3.1. FAULT INJECTION AND WORKLOAD All Intel processors based on the Core micro-architecture work with an L2 cache shared between two cores on one The experimental analysis of hardware failure prediction die [10]. The quad-core models are realized as demands a suitable fault injection technique. Since combination of two such dies, so that the L2 caches are failures occur too rarely under normal conditions, independent but shared [11]. We accounted for this erroneous hardware behavior must be triggered explicitly. hardware design by excluding one die‟s cores from the The most obvious choice is over-clocking of the processor regular operating system scheduler. hardware. Due to the danger of permanent damage to the system hardware we rejected this approach. Instead we 3.2. PREDICTION EVALUATION opted for an under-volting approach, where the CPU core voltage is set to a level below normal operation. This In order to determine how accurate a failure prediction option is offered by many motherboards for gaming and algorithm is, three data sets of counter values are needed. over-clocking. The first is a reference data set, containing values from normal operation. The second one is an initial test data set, In order to generate workload on the monitored core, we containing values prior to CPU failures. Several runs are used the Mersenne prime number test application needed in order to obtain a statistically significant number MPRIME. Initial experiments showed that within a certain of such failure data sets. The third data set contains values voltage reduction range (25% - 30% in our case), the from normal operation. This data set is necessary in order system is in a semi-stable operational mode were it starts to determine the false positive rate of the prediction to generate machine check exceptions (MCE) for the implementation. particular core during the execution of MPRIME, but not during normal operation. This allowed us to boot the After the three sets have been recorded, we analyzed the tested machine, start the performance measurement and data offline in order to determine the quality of the failure trigger the core failure by executing the load application predictor for different event types. Data recording and on one of the cores. Figure 2 illustrates the setup. analysis had to be performed for each CPU counter.

Several metrics exist to express accuracy of a prediction; one of them is called accuracy itself. However, it can be shown that accuracy is not an appropriate metric to evaluate predictions in cases of skewed classes. Since failures will occur by far more rarely than non-failure examples, this is the case in our target scenario. We therefore decided for receiver operating characteristic (ROC) curves and the corresponding area under curve (AUC) metric as suitable measures for evaluation in this case.

ROC curves express the true positive rate (tpr) over the false positive rate (fpr). True positive rate denotes the fraction of true failures that have been predicted, i.e., for

which a warning has been issued, and false positive rate is Figure 2. Experiment Setup the fraction of failure warnings that are issued although no

failure was coming up: MCE‟s normally expressing an unrecoverable failure in the hardware operation. Operating systems therefore stop tpr = positive predictions / all predictions on failures the machine in case of such an event. In order to be able fpr = positive predictions / all predictions on non-failures to save the relevant last samples before the core failure, we reconfigured the Linux kernel to continue operating as Both quantities can be estimated from the three recorded far as possible in case of a MCE. This resulted in a data sets, leading to one (fpr;tpr) tuple per performance behavior where the CPU continued to operate for a short event type. A perfect predictor would achive (0;1), which period of time after the initial MCE (there were always means that all true failures are predicted (tpr = 1) without subsequent fatal ones). Since the inherent MCE logging of any false alarms (fpr = 0). By varying the rank-sum the Linux kernel provided us a core-specific TSC value, comparison threshold, a tuple can be determined for each we were able to identify the last performance sample threshold, resulting in the ROC curve. Since curves cannot value before the actual processor failure. be compared numerically, the area under the ROC curve is calculated for comparison. A perfect predictor would always worked better than a random predictor, however achieve an AUC of one. A random predictor, i.e., a variability was also comparatively high. predictor that randomly warns about an upcoming failure, results in a linear ROC curve with slope one (tpr = fpr) and AUC = 0.5.

Failure prediction evaluation, meaning the ROC curve generation and subsequent AUC computation, is performed on a finite set of event sampling data. It follows that obtained results are only a stochastic estimation of the true prediction accuracy. In order to assess exactness of the evaluation, we apply bootstrapping techniques [10]. Bootstrapping involves multiple random sampling in order to determine means and confidence intervals for estimation. We plotted average ROC curves and add box- whisker plots at selected threshold values. The boxes of the plots show the value of the first quartile, median and third quartile. The whiskers indicate minimum and maximum of non-outlier data, and circles denote mild outliers (between 1.5 and three times the inter-quartile Figure 3. Example for High Variability Group range below the first or above the third quartile). Figure 3 shows an example for the second group, the

highly varying counters. The solid line shows that on 4. RESULTS average across all experiments, the counter behaves like a random predictor. However, taking into account that a We have investigated 31 performance counters with predictor with AUC = 0.23 can be turned into a predictor unique behavior and collected about 30 failures for each with AUC = 1 – 0.23 = 0.77 by simply inverting the of the counters. Several sources of randomness occurred predictor‟s output, rather good predictions could be in the experiment. The reference data set was extracted obtained quite frequently. It seemed as if the quality of the from a random test run. Within the reference test run, the predictor depends quite heavily on the reference data set. reference data set was extracted from a random timestamp It remains an open question at this point if these counters within a fault-free region. In order to determine fpr, non- can be turned into consistently good predictors. Validation failure sequences were tested against the reference data set. and model selection techniques could be used to sort out Since performing predictions on all monitored samples is reference data sets that do not perform well. In our computationally not feasible, non-failure sequences were experiments five out of the 31 counters belong to this sampled from random positions within fault-free regions group. based on randomly selected test runs.

Since failure sequences were rare due to technical limitations (about 30 per counter), all failure sequences were used to determine tpr. We also determined the variability introduced by these random factors with up to 3250 repetitions of the prediction step.

4.1. ROC CURVES AND AUC

Our experimental work has shown three major groups of event types, each behaving differently in terms of prediction feasibility. The first group of counters acted like purely random predictors. 24 out of 31 counters were in this category.

A second group sometimes performed very well, but sometimes also worked very badly. They showed a great Figure 4. Example for High AUC Group variability among experiments. A third group of counters Two out of the 31 counters were situated in the third Vaidyanathan et al. [13] estimated the exhaustion of group of predictions that consistently performs better than system resources as function of time and system workload. a random prediction (minimum AUC value of 0.68). This is realized by constructing a semi-Markow reward Figure 4 shows a comparatively good predictor achieving model, a typical approach that is infeasible for our online AUC values of up to 0.91. It can be seen from the Figure surveillance approach. that when a false positive rate of 20% is accepted (fpr = 0.2), in one experiment about 90% (maximum whisker of Vilalta and Ma‟s approach [14] extracts a set of indicative the plot) of all failures could be predicted. While a false error events in order to predict the occurrence of target positive rate of 20% is not acceptable in many fault- failure events. A multi-stage sorting approach detects tolerance scenarios, it might be less of a problem for the event sets that are typical before a chosen target event. intended cross-core surveillance scenario. The costs for an The opposite approach is the dispersion frame technique unnecessarily performed proactive recovery are by Lin [15], which relies only on the interval time comparatively low. Nevertheless, similar to the second between successive error events. Both approaches rely on group, variability was still quite high here. Validation and an understanding of error events, a kind of information model selection techniques might also help to end up with that is not provided by hardware performance counters. It a more concise predictor. Note that these techniques are remains an open question if alternative information applied offline and do not make computation of sources about the processor might render one of these predictions during runtime more complex. approaches feasible for our scenario.

One of the reasons why only two counters were in the Some researchers use hardware performance counters for third group was the fault injection approach of this work. online performance analysis. Azimi et al. [16] showed an In most cases, it generated exactly the same kind of MCE, approach where application stall times are used for specifically a hardware problem with the bus used for sampling activities, which reduced the application memory access. The RESOURCE STALLS.ROB FULL interference through sampling to zero. In our experiments, performance event shown in Figure 4 is clearly related to we achieved the same effect by simply adjusting the this particular failure. It expresses the number of cycles sampling rate to a value small enough not to influence the processor was stalled due to a filled pipeline. This execution of the application. happens when long latency operations (such as load and store without cache hit) prevent progress in the pipeline Hardware fault injection for processors is a comparatively processing. The memory access hardware problem unusual approach, since most methodologies rely on triggered by the under-voltage operation advertised itself software-implemented fault injection. One example is the here by an unusual value change in this particular counter work by Carreira et al. [17], who work on pin level for value. pure hardware fault injection.

5. RELATED WORK 6. CONCLUSION AND FUTURE WORK

There has been exhaustive research on monitoring and We presented concept and design of a proactive predicting failures of specific components in computer management framework for partial processor hardware systems in the past. failures, based on cross-core event monitoring and failure prediction. Our initial experiments with the Intel Core A widely known approach is the S.M.A.R.T. monitoring technology proved feasibility of the concepts, but leave of hard disk hardware parameters. A study by Pinheiro et room for further improvements and investigations. al. [11] showed that the relevant counters do not provide any useful prediction in over 56% of the cases. Hughes et Future work first needs to focus on the improvement of al. [9] accordingly investigated other techniques like the the fault injection approach, in order to generate more rank-sum test to improve prediction quality for these types of (partial) processor failures. So far, the working counters. event types for prediction match exactly to the processor failure triggered by the under-voltage fault injection. In Liang et al. [12] analyzed more than 100 days of failure order to be able to react to other MCEs as well, alternative log for a BlueGene/L parallel computer system, and fault injection mechanisms must be applied. derived matching failure prediction methods. The developed strategies relied on the burst nature and spatial Using non-failed cores as spare resource also opens the skewness of failure events, properties that are not directly question of fault containment. A hardware problem within applicable to hardware performance events as in our case. one execution unit might also affect the other cores in the system, making them unavailable for recovery activities. and Rejuvenation,” in Proceedings of the IEEE Annual The degree of spreading is related to the nature of the Simulation Symposium, Apr. 2000. underlying fault (e.g. L1 cache defect vs. signaling defects). A discussion of according fault models was left [8] A. Andrzejak and L. Silva, “Deterministic Models of Software Aging and Optimal Rejuvenation Schedules,” in out in favor of the prediction mechanism analysis, but 10th IEEE/IFIP International Symposium on Integrated should be included in future research. Network Management (IM ’07), May 2007, pp. 159–168.

It should also be noted that the presented approach is less [9] G. Hughes, J. Murray, K. Kreutz-Delgado, and C. Elkan, portable than classical software event monitoring for “Improved disk-drive failure warnings,” IEEE failure prediction purposes. Hardware counters are Transactions on Reliability, vol. 51, no. 3, pp. 350–357, specific for any particular processor platform, so the right Sep. 2002. set of event types is specific per CPU design. On the other hand, these information sources are completely decoupled [10] B. Efron, “Bootstrap Methods: Another Look at the Jackknife,” The Annals of Statistics, vol. 7, no. 1, pp. 1–26, from the software executed on the machine. We will work 1979. on a meta-model for the hardware performance events of different CMP architectures, in order to render our [11] E. Pinheiro, W. D. Weber, and L. A. Barroso, “Failure approach more general. Trends in a Large Disk Drive Population,” in Proc. Of the FAST’07 Conference on File and Storage Technologies, ACKNOWLEDGEMENTS 2007.

[12] Y. Liang, Y. Zhang, A. Sivasubramaniam, M. Jette, and R. This work was sponsored in part by Intel. Sahoo, “BlueGene/L Failure Analysis and Prediction Models,” in IEEE Proceedings of the International REFERENCES Conference on Dependable Systems and Networks (DSN 2006), Jun. 2006, pp. 425–434. [1] L. Spracklen and S. G. Abraham, “Chip Multithreading: Opportunities and Challenges,” in 11th International [13] K. Vaidyanathan and K. S. Trivedi, “A Measurement- Symposium on High-Performance Computer Architecture Based Model for Estimation of Resource Exhaustion in (HPCA-11), 2005, pp. 248–252. Operational Software Systems,” in Proceedings of the International Symposium on Software Reliability [2] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Engineering (ISSRE), Nov. 1999. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick, “The Landscape [14] R. Vilalta and S. Ma, “Predicting Rare Events In of Research: A View from Berkeley,” Temporal Domains,” in Proceedings of the 2002 IEEE Electrical Engineering and Computer Sciences, University International Conference on Data Mining (ICDM’02). of California at Berkeley, Tech. Rep. UCB/EECS-2006- Washington, DC, USA: IEEE Computer Society, 2002, pp. 183, December 2006. 474–482.

[3] Intel Corporation, “Intel 64 and IA-32 Architectures, [15] T.-T. Y. Lin, “Design and evaluation of an on-line Software Developer‟s Manual, Volume 3B: System predictive diagnostic system,” Ph.D. dissertation, Programming Guide, Part 2,” Nov. 2008. Department of Electrical and Computer Engineering, Carnegie-Mellon University, Pittsburgh, PA, Apr. 1988. [4] Sun Microsystems Inc., “UltraSPARC Architecture 2007,” Nov. 2008. [16] R. Azimi, M. Stumm, and R. W. Wisniewski, “Online performance analysis by statistical sampling of [5] B. Sprunt, “Managing The Complexity Of Performance microprocessor performance counters,” in ICS ’05: Monitoring Hardware: The Brink And Abyss Approach,” Proceedings of the 19th annual international conference Int. J. High Perform. Comput. Appl., vol. 20, no. 4, pp. on Supercomputing. New York, NY, USA: ACM, 2005, 533–540, 2006. pp. 101–110.

[6] B. Sprunt, “Pentium 4 Performance-Monitoring Features,” [17] J. Carreira, D. Costa, and J. Silva, “Fault injection IEEE Micro, vol. 22, no. 4, pp. 72–82, 2002. spotchecks computer system dependability,” IEEE Spectrum, vol. 36, pp. 50–55, Aug. 1999. [7] K. S. Trivedi, K. Vaidyanathan, and K. Goseva- Popstojanova, “Modeling and Analysis of Software Aging