(Raid) Incorporating Latent Defects and Non-Homogeneous Poisson Process Events
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of Document: RELIABILITY MODEL AND ASSESSMENT OF REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) INCORPORATING LATENT DEFECTS AND NON-HOMOGENEOUS POISSON PROCESS EVENTS. Jon Garrett Elerath Doctor of Philosophy, 2007 Directed By: Professor Michael G. Pecht Department of Mechanical Engineering Today’s most reliable data storage systems are made of redundant arrays of inexpensive disks (RAID). The quantification of RAID system reliability is often based on models that omit critical hard disk drive failure modes, assume all failure and restoration rates are constant (exponential distributions), and assume the RAID group times to failure follow a homogeneous Poisson process (HPP). This paper presents a comprehensive reliability model that accounts for numerous failure causes for today’s hard disk drives, allows proper representation of repair and restoration, and does not rely on the assumption of a HPP for the RAID group. The model does not assume hard disk drives have constant transition rates, but allows each hard disk drive "slot" in the RAID group to have its own set of distributions, closed form or user defined. Hard disk drive (HDD) failure distributions derived from field usage are presented, showing that failure distributions are commonly non-homogeneous, frequently having increasing hazard rates from time zero. Hard disks drive failure modes and causes are presented and used to develop a model that reflects not only complete failure, but also degraded conditions due to undetected, but corrupted data (latent defects). The model can represent user defined distributions for completion of “background scrubbing" to correct (remove) corrupted data. Sequential Monte Carlo simulation is used to determine the number of double disk failures expected as a function of time. RAID group can be any size up to 25. The results are presented as mean cumulative failure distributions for the RAID group. Results estimate the number of double disk failures can be as much as 5000 times greater than that predicted over 10 years when using the mean time to data loss method or Markov models when the characteristic lives of the input distributions is the same. Model results are compared to actual field data for two HDD families and two different RAID group sizes and show good correlation. Results show the rate of occurrence of failure for the RAID group may be increasing, decreasing or constant depending on the parameters used for the four input distributions. RELIABILITY MODEL AND ASSESSMENT OF REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) INCORPORATING LATENT DEFECTS AND NON-HOMOGENEOUS POISSON PROCESS EVENTS. By Jon Garrett Elerath Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2007 Advisory Committee: Professor Michael Pecht, Chair Professor David Barbe, Dean’s Representative Dr. Diganta Das Associate Professor Patrick McCluskey Associate Professor Peter Sandborn © Copyright by Jon Garrett Elerath 2007 Preface “All models are wrong, some models are useful.” - George Box ii Dedication This is dedicated to my father, B. Everett Elerath, and my mother, Betty Jean Elerath. My father, who passed away in December 2002, steadfastly believed he could not help me through every situation I would encounter in life. So, when I would ask how to do something, he often would help only by saying "figure it out". My mother passed away in April, 2004 had a tender heart and cared deeply for her children. My parents both passed while I pursued my Doctor of Philosophy. I miss you Mom and Dad. ......and Dad, I figured it out. iii Acknowledgements I want to thank my advisor, Dr. Michael Pecht, for encouraging me to pursue my Doctoral degree. His critical eye for detail has only helped me grow professionally. As a distance-learning student residing in California for the duration of my studies, it was often difficult to navigate the maze of written University and Graduate College requirements as well as other unwritten expectations. I want to thank Elyse Beaulieu-Lucy for her help in navigating the objective rules and regulations. I want to thank Dr. Diganta Das for his incessant work, attention to detail, wisdom in vague situations, and advice on subjective requirements. Without his help I’d probably still be filing my study plan. I want to acknowledge Sandeep Shah and Steven Magie, fellow Reliability Engineers, and Steve Kleiman, Executive Vice-President and Chief Technical Officer of Network Appliance, for their comments and suggestions. I thank Alan Wood, a good friend for many years, for giving up valuable family time to evaluate my first, very broken model. I thank the six managers I have had during this four and one-half year adventure, all of whom have supported me in my quest; Bob Weisickle, Ko Yamamoto, Bill Jacobsen, Dave Barney, Jim Ward and Dave Alexander. They also provided me with data for my analysis of time-between-managers. My brother Doug earned a Ph.D. in mathematics from L.S.U. One day I asked him, "Doug, what do you do with a Ph.D. in mathematics?" He looked at me, and very seriously said, "Prove theorems." Being a practical sort of person I then asked "What do you do with these theorems?" He again looked at me, waved his hands and iv said, "Jon, that is for you engineers to figure out." Here is to the practical side of mathematics. Thanks, Doug. Lastly and most importantly, I want to thank my wife, Debra, for her tolerance and encouragement in what seemed like a journey without end. I love you. v Table of Contents Preface........................................................................................................................... ii Dedication....................................................................................................................iii Acknowledgements...................................................................................................... iv Table of Contents......................................................................................................... vi List of Tables .............................................................................................................viii List of Figures.............................................................................................................. ix List of Figures.............................................................................................................. ix Chapter 1 Introduction............................................................................................... 1 1.1 Thesis Problem Statement ............................................................................ 2 1.2 Thesis Organization...................................................................................... 3 Chapter 2 Hard Disk Drives, Failure Modes and Causes.......................................... 5 2.1 HDD Description.......................................................................................... 5 2.2 Data Layout................................................................................................... 9 2.3 HDD Failure Causes................................................................................... 11 2.3.1 Scratches from Hard Particle Contaminants:...................................... 11 2.3.2 Media damage from “Soft Particles” .................................................. 12 2.3.3 Thermal Effects on Read Heads ......................................................... 12 2.3.4 Degraded/Unstable Heads................................................................... 14 2.3.5 Contamination from Lubrication ........................................................ 17 2.3.6 Other Causes....................................................................................... 18 2.4 HDD Failure Modes (Consequences) ......................................................... 19 2.4.1 Cannot Find Data ................................................................................ 19 2.4.2 Data Missing....................................................................................... 21 Chapter 3 RAID System Architecture..................................................................... 24 3.1 General RAID Concepts ............................................................................. 24 3.2 RAID Definitions........................................................................................ 25 3.3 Data Loss and Error Correction .................................................................. 27 Chapter 4 Previous Work ........................................................................................ 29 4.1 MTTDL and Markov Models ..................................................................... 29 4.2 Assumptions and Errors in Past RAID Models .......................................... 34 4.2.1 Hazard Rate versus Rate of Occurrence of Failure............................. 35 4.2.2 Non-Homogeneous Processes............................................................. 39 4.2.3 Errors Resulting from the HPP Assumption....................................... 41 Chapter 5 Realities of Field Data ............................................................................ 44 5.1 The Bathtub Myth....................................................................................... 45 5.2 The Importance of Vintage Analyses ......................................................... 46 5.3 Competing Risks versus Distribution