Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

Delft University of Technology Parallel and Distributed Systems Report Series Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, and Dick Epema The Failure Trace Archive. Web: fta.inria.fr Email: [email protected] report number PDS-2010-004 PDS ISSN 1387-2109 Published and produced by: Parallel and Distributed Systems Section Faculty of Information Technology and Systems Department of Technical Mathematics and Informatics Delft University of Technology Zuidplantsoen 4 2628 BZ Delft The Netherlands Information about Parallel and Distributed Systems Report Series: [email protected] Information about Parallel and Distributed Systems Section: http://pds.twi.tudelft.nl/ c 2010 Parallel and Distributed Systems Section, Faculty of Information Technology and Systems, Department of Technical Mathematics and Informatics, Delft University of Technology. All rights reserved. No part of this series may be reproduced in any form or by any means without prior written permission of the publisher. PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp Abstract The analysis and modeling of the failures bound to occur in today’s large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assumption and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large- scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distributions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems. Wp 1 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp WpContents Contents 1 Introduction 4 2 Method 4 2.1 FailureDatasets ................................. ............ 4 2.2 Analysis........................................ .......... 5 2.3 Modeling ........................................ ......... 5 3 Analysis of Autocorrelation 6 3.1 FailureAutocorrelationsintheTraces . ................... 6 3.2 Discussion...................................... ........... 11 4 Modeling the Peaks of Failures 11 4.1 PeakPeriodsModel ................................ ........... 11 4.2 Results......................................... .......... 13 5 Related Work 16 6 Conclusion 18 Wp 2 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp WpList of Figures List of Figures 1 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions. ......................................... ........ 7 2 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions(Cont.)................................... ........... 8 3 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions(Cont.)................................... ........... 9 4 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions(Cont.)................................... ........... 10 5 Daily and hourly failure rates for Overnet,Grid’5000,Notre-Dame (CPU), and PlanetLab platforms. ......................................... ....... 12 6 Parameters of the peak periods model. The numbers in the figure match the (numbered) model parametersinthetext................................ ........... 13 7 Peak Duration: CDF of the empirical data for the Notre-Dame (CPU) platform and the distributions investigated for modeling the peak duration parameter. None of the distributions provide a good fit due to a peak at 1h.................................. 17 List of Tables 1 Summary of nineteen data sets in the Failure Trace Archive. ..................... 5 2 Averagevaluesforthemodelparameters. ................ 14 3 Empirical distribution for the peak duration parameter. ...................... 15 4 Peak model: The parameter values for the best fitting distributions for all studied systems. 16 5 The average duration and average IAT of failures for the entire traces and for the peaks. 16 6 Fraction of downtime and fraction of number of failures due to failures that originate during peaks (k =1)................................................ 17 Wp 3 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp1. Introduction 1 Introduction Large-scale distributed systems have reached an unprecedented scale and complexity in recent years. At this scale failures inevitably occur—networks fail, disks crash, packets get lost, bits get flipped, software misbehaves, or systems crash due to misconfiguration and other human errors. Deadline-driven or mission-critical services are part of the typical workload for these infrastructures, which thus need to be available and reliable despite the presence of failures. Researchers and system designers have already built numerous fault-tolerance mechanisms that have been proven to work under various assumptions about the occurrence and duration of failures. However, most previous work focuses on failure models that assume the failures to be non-correlated, but this may not be realistic for the failures occurring in large-scale distributed systems. For example, such systems may exhibit peak failure periods, during which the failure rate increases, affecting in turn the performance of fault-tolerance solutions. To investigate such time correlations, we perform in this work a detailed investigation of the time-varying behavior of failures using nineteen traces obtained from several large-scale distributed systems including grids, P2P systems, DNS servers, web servers, and desktop grids. Recent studies report that in production systems, failure rates can be of over 1000 failures per year and, depending on the root cause of the corresponding problems, the mean time to repair can range from hours to days [16]. The increasing scale of deployed distributed systems causes the failure rates to increase, which in turn can have a significant impact on the performance and cost, such as degraded response times [21] and increased Total Cost of Operation (TCO) due to increased administration costs and human resource needs [1]. This situation also motivates the need for further research in failure characterization and modeling. Previous studies [18,13,12,14,20,16] focused on characterizing failures in several different distributed systems. However, most of these studies assume that failures occur independently or disregard the time correlation of failures, despite the practical importance of these correlations [11, 19, 15]. First of all, understanding if failures are time correlated has significant implications for proactive fault tolerance solutions. Second, understanding the time-varying behavior of failures and peaks observed in failure patterns is required for evaluating design decisions. For example, redundant submissions may all fail during a failure peak period, regardless of the quality of the resubmission strategy. Third, understanding the temporal correlations and exploiting them for smart checkpointing and scheduling decisions provides new opportunities for enhancing conventional fault-tolerance mechanisms [21,7]. For example, a simple scheduling policy could be to stop scheduling large parallel jobs during failure peaks. Finally, it is possible to devise adaptive fault-tolerance mechanisms that adjust the policies based on the information related to peaks. For example, an adaptive fault-tolerance mechanism can migrate the computation at the beginning of a predicted peak. To understand the time-varying behavior of failures in large-scale distributed systems, we perform a detailed investigation using data sets from diverse large-scale distributed systems including more than 100K hosts and 1.2M failure events spanning over 15 years of system operation. Our main contribution is threefold: 1. We

Load more