Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems

Total Page:16

File Type:pdf, Size:1020Kb

Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Delft University of Technology Parallel and Distributed Systems Report Series Analysis and Modeling of Time-Correlated Failures in Large-Scale Distributed Systems Nezih Yigitbasi, Matthieu Gallet, Derrick Kondo, Alexandru Iosup, and Dick Epema The Failure Trace Archive. Web: fta.inria.fr Email: [email protected] report number PDS-2010-004 PDS ISSN 1387-2109 Published and produced by: Parallel and Distributed Systems Section Faculty of Information Technology and Systems Department of Technical Mathematics and Informatics Delft University of Technology Zuidplantsoen 4 2628 BZ Delft The Netherlands Information about Parallel and Distributed Systems Report Series: [email protected] Information about Parallel and Distributed Systems Section: http://pds.twi.tudelft.nl/ c 2010 Parallel and Distributed Systems Section, Faculty of Information Technology and Systems, Department of Technical Mathematics and Informatics, Delft University of Technology. All rights reserved. No part of this series may be reproduced in any form or by any means without prior written permission of the publisher. PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp Abstract The analysis and modeling of the failures bound to occur in today’s large-scale production systems is invaluable in providing the understanding needed to make these systems fault-tolerant yet efficient. Many previous studies have modeled failures without taking into account the time-varying behavior of failures, under the assumption that failures are identically, but independently distributed. However, the presence of time correlations between failures (such as peak periods with increased failure rate) refutes this assump- tion and can have a significant impact on the effectiveness of fault-tolerance mechanisms. For example, the performance of a proactive fault-tolerance mechanism is more effective if the failures are periodic or predictable; similarly, the performance of checkpointing, redundancy, and scheduling solutions depends on the frequency of failures. In this study we analyze and model the time-varying behavior of failures in large- scale distributed systems. Our study is based on nineteen failure traces obtained from (mostly) production large-scale distributed systems, including grids, P2P systems, DNS servers, web servers, and desktop grids. We first investigate the time correlation of failures, and find that many of the studied traces exhibit strong daily patterns and high autocorrelation. Then, we derive a model that focuses on the peak failure periods occurring in real large-scale distributed systems. Our model characterizes the duration of peaks, the peak inter-arrival time, the inter-arrival time of failures during the peaks, and the duration of failures during peaks; we determine for each the best-fitting probability distribution from a set of several candidate distri- butions, and present the parameters of the (best) fit. Last, we validate our model against the nineteen real failure traces, and find that the failures it characterizes are responsible on average for over 50% and up to 95% of the downtime of these systems. Wp 1 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp WpContents Contents 1 Introduction 4 2 Method 4 2.1 FailureDatasets ................................. ............ 4 2.2 Analysis........................................ .......... 5 2.3 Modeling ........................................ ......... 5 3 Analysis of Autocorrelation 6 3.1 FailureAutocorrelationsintheTraces . ................... 6 3.2 Discussion...................................... ........... 11 4 Modeling the Peaks of Failures 11 4.1 PeakPeriodsModel ................................ ........... 11 4.2 Results......................................... .......... 13 5 Related Work 16 6 Conclusion 18 Wp 2 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp WpList of Figures List of Figures 1 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions. ......................................... ........ 7 2 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions(Cont.)................................... ........... 8 3 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions(Cont.)................................... ........... 9 4 Failure rates at different time granularities for all platforms and the corresponding autocorrelation functions(Cont.)................................... ........... 10 5 Daily and hourly failure rates for Overnet,Grid’5000,Notre-Dame (CPU), and PlanetLab platforms. ......................................... ....... 12 6 Parameters of the peak periods model. The numbers in the figure match the (numbered) model parametersinthetext................................ ........... 13 7 Peak Duration: CDF of the empirical data for the Notre-Dame (CPU) platform and the distributions investigated for modeling the peak duration parameter. None of the distributions provide a good fit due to a peak at 1h.................................. 17 List of Tables 1 Summary of nineteen data sets in the Failure Trace Archive. ..................... 5 2 Averagevaluesforthemodelparameters. ................ 14 3 Empirical distribution for the peak duration parameter. ...................... 15 4 Peak model: The parameter values for the best fitting distributions for all studied systems. 16 5 The average duration and average IAT of failures for the entire traces and for the peaks. 16 6 Fraction of downtime and fraction of number of failures due to failures that originate during peaks (k =1)................................................ 17 Wp 3 http://www.st.ewi.tudelft.nl/∼nezih/ PDS N. Yigitbasi. et al. Wp Wp Time-Correlated Failures in Distributed SystemsWp Wp1. Introduction 1 Introduction Large-scale distributed systems have reached an unprecedented scale and complexity in recent years. At this scale failures inevitably occur—networks fail, disks crash, packets get lost, bits get flipped, software misbehaves, or systems crash due to misconfiguration and other human errors. Deadline-driven or mission-critical services are part of the typical workload for these infrastructures, which thus need to be available and reliable despite the presence of failures. Researchers and system designers have already built numerous fault-tolerance mecha- nisms that have been proven to work under various assumptions about the occurrence and duration of failures. However, most previous work focuses on failure models that assume the failures to be non-correlated, but this may not be realistic for the failures occurring in large-scale distributed systems. For example, such systems may exhibit peak failure periods, during which the failure rate increases, affecting in turn the performance of fault-tolerance solutions. To investigate such time correlations, we perform in this work a detailed investiga- tion of the time-varying behavior of failures using nineteen traces obtained from several large-scale distributed systems including grids, P2P systems, DNS servers, web servers, and desktop grids. Recent studies report that in production systems, failure rates can be of over 1000 failures per year and, depending on the root cause of the corresponding problems, the mean time to repair can range from hours to days [16]. The increasing scale of deployed distributed systems causes the failure rates to increase, which in turn can have a significant impact on the performance and cost, such as degraded response times [21] and increased Total Cost of Operation (TCO) due to increased administration costs and human resource needs [1]. This situation also motivates the need for further research in failure characterization and modeling. Previous studies [18,13,12,14,20,16] focused on characterizing failures in several different distributed systems. However, most of these studies assume that failures occur independently or disregard the time correlation of failures, despite the practical importance of these correlations [11, 19, 15]. First of all, understanding if failures are time correlated has significant implications for proactive fault tolerance solutions. Second, understanding the time-varying behavior of failures and peaks observed in failure patterns is required for evaluating design decisions. For example, redundant submissions may all fail during a failure peak period, regardless of the quality of the resubmission strategy. Third, understanding the temporal correlations and exploiting them for smart checkpointing and scheduling decisions provides new opportunities for enhancing conventional fault-tolerance mechanisms [21,7]. For example, a simple scheduling policy could be to stop scheduling large parallel jobs during failure peaks. Finally, it is possible to devise adaptive fault-tolerance mechanisms that adjust the policies based on the information related to peaks. For example, an adaptive fault-tolerance mechanism can migrate the computation at the beginning of a predicted peak. To understand the time-varying behavior of failures in large-scale distributed systems, we perform a detailed investigation using data sets from diverse large-scale distributed systems including more than 100K hosts and 1.2M failure events spanning over 15 years of system operation. Our main contribution is threefold: 1. We
Recommended publications
  • Research Participants' Rights to Data Protection in the Era of Open Science
    DePaul Law Review Volume 69 Issue 2 Winter 2019 Article 16 Research Participants’ Rights To Data Protection In The Era Of Open Science Tara Sklar Mabel Crescioni Follow this and additional works at: https://via.library.depaul.edu/law-review Part of the Law Commons Recommended Citation Tara Sklar & Mabel Crescioni, Research Participants’ Rights To Data Protection In The Era Of Open Science, 69 DePaul L. Rev. (2020) Available at: https://via.library.depaul.edu/law-review/vol69/iss2/16 This Article is brought to you for free and open access by the College of Law at Via Sapientiae. It has been accepted for inclusion in DePaul Law Review by an authorized editor of Via Sapientiae. For more information, please contact [email protected]. \\jciprod01\productn\D\DPL\69-2\DPL214.txt unknown Seq: 1 21-APR-20 12:15 RESEARCH PARTICIPANTS’ RIGHTS TO DATA PROTECTION IN THE ERA OF OPEN SCIENCE Tara Sklar and Mabel Crescioni* CONTENTS INTRODUCTION ................................................. 700 R I. REAL-WORLD EVIDENCE AND WEARABLES IN CLINICAL TRIALS ....................................... 703 R II. DATA PROTECTION RIGHTS ............................. 708 R A. General Data Protection Regulation ................. 709 R B. California Consumer Privacy Act ................... 710 R C. GDPR and CCPA Influence on State Data Privacy Laws ................................................ 712 R III. RESEARCH EXEMPTION AND RELATED SAFEGUARDS ... 714 R IV. RESEARCH PARTICIPANTS’ DATA AND OPEN SCIENCE . 716 R CONCLUSION ................................................... 718 R Clinical trials are increasingly using sensors embedded in wearable de- vices due to their capabilities to generate real-world data. These devices are able to continuously monitor, record, and store physiological met- rics in response to a given therapy, which is contributing to a redesign of clinical trials around the world.
    [Show full text]
  • Overview of Reliability Engineering
    Overview of reliability engineering Eric Marsden <[email protected]> Context ▷ I have a fleet of airline engines and want to anticipate when they mayfail ▷ I am purchasing pumps for my refinery and want to understand the MTBF, lambda etc. provided by the manufacturers ▷ I want to compare different system designs to determine the impact of architecture on availability 2 / 32 Reliability engineering ▷ Reliability engineering is the discipline of ensuring that a system will function as required over a specified time period when operated and maintained in a specified manner. ▷ Reliability engineers address 3 basic questions: • When does something fail? • Why does it fail? • How can the likelihood of failure be reduced? 3 / 32 The termination of the ability of an item to perform a required function. [IEV Failure Failure 191-04-01] ▷ A failure is always related to a required function. The function is often specified together with a performance requirement (eg. “must handleup to 3 tonnes per minute”, “must respond within 0.1 seconds”). ▷ A failure occurs when the function cannot be performed or has a performance that falls outside the performance requirement. 4 / 32 The state of an item characterized by inability to perform a required function Fault Fault [IEV 191-05-01] ▷ While a failure is an event that occurs at a specific point in time, a fault is a state that will last for a shorter or longer period. ▷ When a failure occurs, the item enters the failed state. A failure may occur: • while running • while in standby • due to demand 5 / 32 Discrepancy between a computed, observed, or measured value or condition and Error Error the true, specified, or theoretically correct value or condition.
    [Show full text]
  • Lochner</I> in Cyberspace: the New Economic Orthodoxy Of
    Georgetown University Law Center Scholarship @ GEORGETOWN LAW 1998 Lochner in Cyberspace: The New Economic Orthodoxy of "Rights Management" Julie E. Cohen Georgetown University Law Center, [email protected] This paper can be downloaded free of charge from: https://scholarship.law.georgetown.edu/facpub/811 97 Mich. L. Rev. 462-563 (1998) This open-access article is brought to you by the Georgetown Law Library. Posted with permission of the author. Follow this and additional works at: https://scholarship.law.georgetown.edu/facpub Part of the Intellectual Property Law Commons, Internet Law Commons, and the Law and Economics Commons LOCHNER IN CYBERSPACE: THE NEW ECONOMIC ORTHODOXY OF "RIGHTS MANAGEMENT" JULIE E. COHEN * Originally published 97 Mich. L. Rev. 462 (1998). TABLE OF CONTENTS I. THE CONVERGENCE OF ECONOMIC IMPERATIVES AND NATURAL RIGHTS..........6 II. THE NEW CONCEPTUALISM.....................................................18 A. Constructing Consent ......................................................18 B. Manufacturing Scarcity.....................................................32 1. Transaction Costs and Common Resources................................33 2. Incentives and Redistribution..........................................40 III. ON MODELING INFORMATION MARKETS........................................50 A. Bargaining Power and Choice in Information Markets.............................52 1. Contested Exchange and the Power to Switch.............................52 2. Collective Action, "Rent-Seeking," and Public Choice.......................67
    [Show full text]
  • Understanding Functional Safety FIT Base Failure Rate Estimates Per IEC 62380 and SN 29500
    www.ti.com Technical White Paper Understanding Functional Safety FIT Base Failure Rate Estimates per IEC 62380 and SN 29500 Bharat Rajaram, Senior Member, Technical Staff, and Director, Functional Safety, C2000™ Microcontrollers, Texas Instruments ABSTRACT Functional safety standards like International Electrotechnical Commission (IEC) 615081 and International Organization for Standardization (ISO) 262622 require that semiconductor device manufacturers address both systematic and random hardware failures. Systematic failures are managed and mitigated by following rigorous development processes. Random hardware failures must adhere to specified quantitative metrics to meet hardware safety integrity levels (SILs) or automotive SILs (ASILs). Consequently, systematic failures are excluded from the calculation of random hardware failure metrics. Table of Contents 1 Introduction.............................................................................................................................................................................2 2 Types of Faults and Quantitative Random Hardware Failure Metrics .............................................................................. 2 3 Random Failures Over a Product Lifetime and Estimation of BFR ...................................................................................3 4 BFR Estimation Techniques ................................................................................................................................................. 4 5 Siemens SN 29500 FIT model
    [Show full text]
  • Probability Analysis in Determining the Behaviour of Variable Amplitude Strain Signal Based on Extraction Segments
    2141 Probability Analysis in Determining the Behaviour of Variable Amplitude Strain Signal Based on Extraction Segments Abstract M. F. M. Yunoh a This paper focuses on analysis in determining the behaviour of vari- S. Abdullah b, * able amplitude strain signals based on extraction of segments. The M. H. M. Saad c constant Amplitude loading (CAL), that was used in the laboratory d Z. M. Nopiah tests, was designed according to the variable amplitude loading e (VAL) from Society of Automotive Engineers (SAE). The SAE strain M. Z. Nuawi signal was then edited to obtain those segments that cause fatigue a damage to components. The segments were then sorted according to Department of Mechanical and Materials their amplitude and were used as a reference in the design of the Engineering, Faculty of Engineering & CAL loading for the laboratory tests. The strain signals that were Built Enviroment, Universiti Kebangsaan obtained from the laboratory tests were then analysed using fatigue Malaysia, 43600 UKM Bangi, Selangor, life prediction approach and statistics, i.e. Weibull distribution anal- Malaysia. [email protected] b ysis. Based on the plots of the Probability Density Function (PDF), [email protected] c Cumulative Distribution Function (CDF) and the probability of fail- [email protected]; d ure in the Weibull distribution analysis, it was shown that more than [email protected]; e 70% failure occurred when the number of cycles approached 1.0 x [email protected] 1011. Therefore, the Weibull distribution analysis can be used as an alternative to predict the failure probability. * Corresponding author: S.
    [Show full text]
  • Individual Failure Rate Modelling and Exploratory Failure Data Analysis for Power System Components
    INDIVIDUAL FAILURE RATE MODELLING AND EXPLORATORY FAILURE DATA ANALYSIS FOR POWER SYSTEM COMPONENTS REPORT 2018:549 RISK- OCH TILLFÖRLITLIGHETSANALYS Individual Failure Rate Modelling and Exploratory Failure Data Analysis for Power System Components JAN HENNING JÜRGENSEN ISBN 978-91-7673-549-7 | © Energiforsk December 2018 Energiforsk AB | Phone: 08-677 25 30 | E-mail: [email protected] | www.energiforsk.se INDIVIDUAL FAILURE RATE MODELLING AND EXPLORATORY FAILURE DATA ANALYSIS FOR POWER SYSTEM COMPONENTS Foreword Detta projekt är en fortsättning på ett doktorandprojekt från tidigare programperiod. Etappen Riskanalys III hanterade projektet från licentiat till doktorsavhandling genom att stötta en slutrapportering. Projektet studerade om felfrekvensnoggrannheten på en enskild komponentnivå kan öka trots att det tillgängliga historiska feldatat är begränsat. Resultatet var att effekterna av riskfaktorer på felfrekvens med regressionsmodeller kvantifierades, samt en ökad förståelse av riskfaktorer och högre noggrannhet av felfrekvensen. Statistisk data-underbyggda metoder används fortfarande sällan i kraftsystemen samtidigt som metoder behövs för att förbättra felfrekvensnoggrannheten trots begränsade tillgängliga data. Metoden beräknar individuella felfrekvenser. Jan Henning Jürgensen från Kungliga Tekniska Högskolan, har varit projektledare för projektet. Han har arbetat under handledning av docent Patrik Hilber och professor Lars Nordström, bägge på KTH. Stort tack till följande programstyrelse för all hjälp och vägledning: • Jenny
    [Show full text]
  • The Effects of Autocorrelation in the Estimation of Process Capability Indices." (1998)
    Louisiana State University LSU Digital Commons LSU Historical Dissertations and Theses Graduate School 1998 The ffecE ts of Autocorrelation in the Estimation of Process Capability Indices. Lawrence Lee Magee Louisiana State University and Agricultural & Mechanical College Follow this and additional works at: https://digitalcommons.lsu.edu/gradschool_disstheses Recommended Citation Magee, Lawrence Lee, "The Effects of Autocorrelation in the Estimation of Process Capability Indices." (1998). LSU Historical Dissertations and Theses. 6847. https://digitalcommons.lsu.edu/gradschool_disstheses/6847 This Dissertation is brought to you for free and open access by the Graduate School at LSU Digital Commons. It has been accepted for inclusion in LSU Historical Dissertations and Theses by an authorized administrator of LSU Digital Commons. For more information, please contact [email protected]. INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. U M I films the text directly from die original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type o f computer printer. The quality o f this reproduction is dependent upon the quality o f the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. hi the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note w ill indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand corner and continuing from left to right in equal sections with small overlaps.
    [Show full text]
  • Chapter 2 Failure Models Part 1: Introduction
    Chapter 2 Failure Models Part 1: Introduction Marvin Rausand Department of Production and Quality Engineering Norwegian University of Science and Technology [email protected] Marvin Rausand, March 14, 2006 System Reliability Theory (2nd ed), Wiley, 2004 – 1 / 31 Introduction Rel. measures Life distributions State variable Time to failure Distr. function Probability density Distribution of T Reliability funct. Failure rate Introduction Bathtub curve Some formulas MTTF Example 2.1 Median Mode MRL Example 2.2 Discrete distributions Life distributions Marvin Rausand, March 14, 2006 System Reliability Theory (2nd ed), Wiley, 2004 – 2 / 31 Reliability measures Introduction In this chapter we introduce the following measures: Rel. measures Life distributions State variable Time to failure ■ Distr. function The reliability (survivor) function R(t) Probability density ■ The failure rate function z(t) Distribution of T ■ Reliability funct. The mean time to failure (MTTF) Failure rate ■ The mean residual life (MRL) Bathtub curve Some formulas MTTF Example 2.1 of a single item that is not repaired when it fails. Median Mode MRL Example 2.2 Discrete distributions Life distributions Marvin Rausand, March 14, 2006 System Reliability Theory (2nd ed), Wiley, 2004 – 3 / 31 Life distributions Introduction The following life distributions are discussed: Rel. measures Life distributions State variable ■ The exponential distribution Time to failure ■ Distr. function The gamma distribution Probability density ■ The Weibull distribution Distribution of
    [Show full text]
  • Power Normal Distribution
    Power Normal Distribution Debasis Kundu1 and Rameshwar D. Gupta2 Abstract Recently Gupta and Gupta [10] proposed the power normal distribution for which normal distribution is a special case. The power normal distribution is a skewed distri- bution, whose support is the whole real line. Our main aim of this paper is to consider bivariate power normal distribution, whose marginals are power normal distributions. We obtain the proposed bivariate power normal distribution from Clayton copula, and by making a suitable transformation in both the marginals. Lindley-Singpurwalla dis- tribution also can be used to obtain the same distribution. Different properties of this new distribution have been investigated in details. Two different estimators are proposed. One data analysis has been performed for illustrative purposes. Finally we propose some generalizations to multivariate case also along the same line and discuss some of its properties. Key Words and Phrases: Clayton copula; maximum likelihood estimator; failure rate; approximate maximum likelihood estimator. 1 Department of Mathematics and Statistics, Indian Institute of Technology Kanpur, Pin 208016, India. Corresponding author. e-mail: [email protected]. Part of his work has been supported by a grant from the Department of Science and Technology, Government of India. 2 Department of Computer Science and Applied Statistics. The University of New Brunswick, Saint John, Canada, E2L 4L5. Part of the work was supported by a grant from the Natural Sciences and Engineering Research Council. 1 1 Introduction The skew-normal distribution proposed by Azzalini [3] has received a considerable attention in the recent years. Due to its flexibility, it has been used quite extensively for analyzing skewed data on the entire real line.
    [Show full text]
  • Basic Reliability Theory Including Reliability Models
    Practical Applications of Reliability Theory George Dodson Spallation Neutron Source ORNL is managed by UT-Battelle for the US Department of Energy Topics • Reliability Terms and Definitions • Reliability Modeling as a tool for evaluating system performance – In the design phase what are the tradeoffs of cost vs. reliability performance? – In the operational phase, does the performance meet expectations? • Analysis of the failure rate of systems or components – How do systems fail? – Is the failure rate “reasonable” ? • Analytical calculation for the number of Spares – What kinds of spares are there? – What is a “reasonable” number of spares? 2 Presentation_name Reliability Terms • Mean Time To Failure (MTTF) for non-repairable systems • Mean Time Between Failures for repairable systems (MTBF) • Reliability Probability (survival) R(t) • Failure Probability (cumulative density function ) F(t)=1-R(t) • Failure Probability Density f(t) • Failure Rate (hazard rate) λ(t) • Mean residual life (MRL) 3 Presentation_name Important Relationships R( t ) F ( t ) 1 t t f( t ) ( t )exp - ( u ) du dF ( t ) / dt F( t ) f ( u ) du , 0 0 t R( t ) 1- F ( t ) exp - ( u ) du ( t ) f ( t ) / R ( t ) 0 Where ()t is the failure rate function 4 Presentation_name MTBF The MTBF is widely used as the measurement of equipment's reliability and performance. This value is often calculated by dividing the total operating time of the units by the total number of failures encountered. This metric is valid only when the data is exponentially distributed. This is a poor assumption which implies that the failure rate is constant if it is used as the sole measure of equipment's reliability.
    [Show full text]
  • Fault Detection of Power Electronic Circuit Using Wavelet Analysis in Modelica
    Fault Detection of Power Electronic Circuit using Wavelet Analysis in Modelica Jianbo Gao*, Yang Ji**, Johann Bals**, Ralph Kennel* *Technische Universität München Arcisstr. 21, 80333 Munich, Germany [email protected], [email protected] ** German Aerospace Center Muenchner Str. 20, 82234 Wessling, Germany [email protected], [email protected] obtain maximum flight reliability and minimum Abstract maintenance efforts, advanced failure analysis tech- nologies shall be applied to ensure correct and quick In more electric aircrafts (MEA) the electric pow- er network is important for the reliability. To prevent fault detection and isolation. It is well known that an severe faults it is the key issue to identify the faults output voltage regulated DC/DC power converter in the early stage before a complete failure happens. supplying constant power loads could de-stabilize In this paper an early stage fault detection method the network stability due to the degraded perfor- using wavelet multi-resolution analysis (MRA) for a mance of its input filter. The sensitivity study of in- regulated buck DC-DC converter is studied. Specifi- put filter parameters concerning the network stability cally, the electrolyte input capacitor is diagnosed. addressed in [2] reveals that the observation of de- The study was carried out using simulation with graded degree of the capacitor in the input filter can Modelica / Dymola. The fault features that were ex- significantly increase the network reliability. tracted from different levels of wavelet decomposi- tion provided clear information for both fast and 1.2 State of the art slow occurring faults. This method showed signifi- cant advantages compared with filter techniques.
    [Show full text]
  • System Reliability Concepts
    2 System Reliability Concepts The analysis of the reliability of a system must be based on precisely defined concepts. Since it is readily accepted that a population of supposedly identical systems, operating under similar conditions, fall at different points in time, then a failure phenomenon can only be described in probabilistic terms. Thus, the fundamental definitions of reliability must depend on concepts from probability theory. This chapter describes the concepts of system reliability engineering. These concepts provide the basis for quantifying the reliability of a system. They allow precise comparisons between systems or provide a logical basis for improvement in a failure rate. Various examples reinforce the definitions as presented in Section 2.1. Section 2.2 examines common distribution functions useful in reliability engineering. Several distribution models are discussed and the resulting hazard functions are derived. Section 2.3 describes a new concept of systemability. Several systemability functions of various system configurations such as series, parallel, and k-out-of-n, are presented. Section 2.4 discusses various reliability aspects of systems with multiple failure modes. Stochastic processes including Markov process, Poisson process, renewal process, quasi-renewal process, and nonhomogeneous Poisson process are discussed in Sections 2.5 and 2.6. In general, a system may be required to perform various functions, each of which may have a different reliability. In addition, at different times, the system may have a different probability of successfully performing the required function under stated conditions. The term failure means that the system is not capable of performing a function when required. The term capable used here is to define if the system is capable of performing the required function.
    [Show full text]