An Analysis of Data Corruption in the Storage Stack Garth Goodson Netapp, Inc

Total Page:16

File Type:pdf, Size:1020Kb

An Analysis of Data Corruption in the Storage Stack Garth Goodson Netapp, Inc An Analysis of Data Corruption in the Storage Stack Garth Goodson NetApp, Inc Lakshmi Bairavasundaram Bianca Schroeder Andrea C. Arpaci-Dusseau University of Toronto Remzi H. Arpaci-Dusseau University of Wisconsin-Madison Storage Developer Conference 2008 Corruption Anecdote There is much anecdotal evidence of data corruption E.g., this is a photo stored on an author’s laptop System designers know of similar occurrences Data protection often based on anecdotes Anecdotes: interesting, but not enough for system design A more rigorous understanding is needed Storage Developer Conference 2008 Our Analysis First large scale study of data corruption 1.53 million disks in 1000s of NetApp systems Time period 41 months (Jan 2004 – Jun 2007) Corruption detection Using various data protection techniques Data from NetApp Autosupport Database Also used in latent sector error [Bairavasundaram07], disk and storage failure [Jiang08] studies Storage Developer Conference 2008 Questions we had about corruption What kinds of corruption occur and how often ? Does disk class matter ? Expensive enterprise (FC) disks versus cheaper nearline (SATA) disks Does disk drive family/product matter ? Are corruption instances independent ? Do corruption instances have spatial locality? Storage Developer Conference 2008 Talk Outline Introduction Background Data corruption Protection techniques Results Lessons Conclusion Storage Developer Conference 2008 Should we care about disk errors? Joint UIUC/NetApp system failure analysis 44 months; 39,000 systems; 1.8 million disks Performance Performance Failure Failure Protocol Failure Protocol Failure Disk Failure Physical Physical Disk Failure Interconnect Interconnect Failure Failure High-End Systems Nearline Systems *W. Jiang, et. al, “Are disks the dominant contributor of storage failures?”, USENIX FAST, 2008 Storage Developer Conference 2008 Disk system failure rates From failure rate pie charts: High-end: 29% of system errors are disk errors Nearline: 57% of system errors are disk errors What’s going on? Software is generally the same Hardware platforms are somewhat different But real difference is in the type of disk in use i.e., Fibre-channel vs SATA Storage Developer Conference 2008 Types of disk errors Operational/component failures Fundamental problem with the drive hardware Bad servo, head, electronics, etc. Firmware bugs Failure to flush cache on power-down, etc. Partial failures Only affects small subset of disk sectors Errors during writing Bad media, high-fly write, vibration, etc. Errors during reading (write was successful) Scratches, corrosion, thermal asperities, etc. Storage Developer Conference 2008 Unreported Disk Errors Operational failures are easy to detect Usually fail-stop; something stops working Latent sector errors are reported via SCSI errors Occurs when a disk sector is read What about errors that go undetected? Observed errors not corrected by disk’s ECC Can not correct them unless detected first Result is usually some form of corruption Storage Developer Conference 2008 Data Corruption Data stored on a disk block is incorrect Many sources Software bugs File system, software RAID, device drivers, etc. Firmware bugs Disk drives, shelf controllers, adapters, etc. Corruption is silent Not reported by the disk drive Could have greater impact than other errors Storage Developer Conference 2008 Forms of Data Corruption Bit corruption Contents of existing disk block are modified Data being written to a disk block is corrupted Lost writes Data not written but completion is reported Misdirected writes Data is written to the wrong disk block Torn writes Data partially written but completion is reported In all cases, data passes disk’s internal ECC Storage Developer Conference 2008 Detecting data corruption Basic idea: 1. Generate checksum of data (64 Bytes/4KB) 2. Store checksum along with data (4KB FS block) 3. Verify checksum whenever reading data Simple checksum has limited protection Detects bit corruption and torn (partial) writes No protection against lost or misdirected writes Since data was not overwritten Storage Developer Conference 2008 Checksum problems: lost writes Block checksums Data 1 Data 2 Data 3 Parity A B C P(ABC’)P(ABC) cksum(A) cksum(B) cksum(C) cksum(P) Lost Write Overwrite C→C’ Read file ABC’ CKSUM Return data (ABC) Return Corrupt Data (C instead of C’) Storage Developer Conference 2008 Write verify: a partial solution Attempt to solve lost write problem Costly solution, expect good protection Procedure: 1. Write data to disk 2. Read back to verify 3. If lost write detected, write again C’C or remap to new location cksum(C’)cksum(C) Lost Write Overwrite C→C’ Read back (C) Success Lost write detected, write C’ again 14 Storage Developer Conference 2008 Lost write protection: a better way Need logical information pertaining to block identity Something external to data being stored Store inode, FS block number within checksum Verified by file system at read time We also add a checksum of checksum structure Block Checksum Protects 4KB FS block Block Identity Data Protects against lost writes Embedded Checksum Protects checksum structure 4KB file system block 64B Checksum 520 520 520 520 520 520 520 520 Storage Developer Conference 2008 Summary: Data Corruption Classes Checksum mismatch Causes: bit corruption, torn/misdirected write Detection: block checksum mismatch Identity mismatch Causes: lost or misdirected write Detection: block identity mismatch Parity mismatch Causes: lost write, bad parity Detection: RAID parity computation mismatch Storage Developer Conference 2008 Talk Outline Introduction Background Results System architecture Overall results Checksum mismatch results Lessons and Conclusion Storage Developer Conference 2008 NetApp® System • Store, verify block identity 3 (Inode X, offset Y) • Detect identity discrepancy • Lost or misdirected writes Client interface (NFS) • Parity generation 2 WAFL® file system • Reconstruction on failure • Data scrubbing RAID layer – read blocks, verify parity – Detect parity inconsistency Storage layer Autosupport – Lost or misdirected writes, parity miscalculations • Store, verify checksum 1 • Detect checksum mismatch Disk drives • Bit corruptions, torn writes Storage Developer Conference 2008 Overall Numbers What percentage of disks are affected by the different kinds of corruption? Storage Developer Conference 2008 Overall Numbers (% disks affected in 17 months of use) Corruption type Nearline Enterprise (SATA) (FC) Checksum mismatches 0.661% 0.059% 1 Parity inconsistencies 0.147% 0.017% 2 Identity discrepancies 0.042% 0.006% 3 ~10 times fewer disks than latent sector errors Higher % of Nearline disks affected Order of magnitude more than enterprise disks Bit corruptions or torn writes affect more disks than lost or misdirected writes Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis • Disk class (Nearline / Enterprise) • Disk model • Disk age • Disk size (capacity) 1. Factors • Workload 2. Characteristics • CMs per corrupt disk 3. Correlations with • Independence • Spatial locality other errors • Temporal locality 4. Request type • Not ready conditions • Latent sector errors • System reset • Scrubs vs. FS reads etc. Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis • Disk class • Disk model 1. Factors • Disk age 2. Characteristics • Disk size • Workload 3. Correlations with other errors 4. Request type Storage Developer Conference 2008 Factors Do disk class, model, or age affect development of checksum mismatches? Disk class: Nearline (SATA) or Enterprise (FC) Disk model: Specific disk drive product (say Vendor V’s disk product P of capacity 80 GB) Disk age: Time in the field since ship date Can we use these factors to determine corruption handling policies or mechanisms? Ex: Aggressive scrubbing for some disks Storage Developer Conference 2008 Class, Model, Age – Nearline 4.0% Fraction of disks affected 3.5% varies across models 3.0% From 0.27% to 3.51% 2.5% More than 3% 2.0% 4 out of 6 models 1.5% Response to age also varies 1.0% 0.5% % of disks with at least 1 CM 0.0% 0 3 6 9 12 15 18 Disk age (months) Storage Developer Conference 2008 Class, Model, Age – Enterprise 0.18% Fraction of disks affected 0.16% varies across models 0.14% From 0% to 0.17% 0.12% 0.10% All less than lowest Nearline (0.27%) 0.08% 0.06% Response to age also 0.04% varies % of disks with at least 1 CM 0.02% 0.00% 0 3 6 9 12 15 18 Disk age (months) Storage Developer Conference 2008 Factors – Summary Class, Model matter Nearline disks require greater attention Effect of age is unclear Cannot use age-specific corruption handling Storage Developer Conference 2008 Checksum Mismatch (CM) Analysis 1. Factors • CMs per corrupt disk 2. Characteristics • Independence 3. Correlations with • Spatial locality other errors • Temporal locality 4. Request type Storage Developer Conference 2008 Checksum Mismatches per Corrupt Disk Corrupt disk: A disk with at least 1 checksum mismatch (CM) How many CMs does a corrupt disk have? Should we “fail-out” disks when one corruption is detected? Storage Developer Conference 2008 CMs per Corrupt Disk – Nearline 100% CMs per corrupt disk is CMs 90% low X 80% ≤ 70% 50% of corrupt disks 60% have ≤ 2 CMs 50% 90% of corrupt disks 40% have ≤ 100 CMs 30% 20% Anomaly: E-1 10% Develops many CMs % of corrupt disks with 0% 1 2 3 4 5 10 20 50 100 200 500 1K Number of Checksum
Recommended publications
  • Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon
    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon-Aware NoC Jia Zhan, Student Member, IEEE, Jin Ouyang, Member, IEEE,FenGe,Member, IEEE, Jishen Zhao, Member, IEEE, and Yuan Xie, Fellow, IEEE Abstract— The breakdown of Dennard scaling prevents us MCMC MC from powering all transistors simultaneously, leaving a large fraction of dark silicon. This crisis has led to innovative work on Tile link power-efficient core and memory architecture designs. However, link link RouterRouter the research for addressing dark silicon challenges with network- To router on-chip (NoC), which is a major contributor to the total chip NI power consumption, is largely unexplored. In this paper, we Core link comprehensively examine the network power consumers and L1-I$ L2 L1-D$ the drawbacks of the conventional power-gating techniques. To overcome the dark silicon issue from the NoC’s perspective, we MCMC MC propose DimNoC, a dim silicon scheme, which leverages recent drowsy SRAM design and spin-transfer torque RAM (STT-RAM) Fig. 1. 4 × 4 NoC-based multicore architecture. In each node, the local technology to replace pure SRAM-based NoC buffers. processing elements (core, L1, L2 caches, and so on) are attached to a router In particular, we propose two novel hybrid buffer architectures: through an NI. Routers are interconnected through links to form an NoC. 1) a hierarchical buffer architecture, which divides the input Four memory controllers are attached at the four corners, which will be used buffers into a set of levels with different power states and 2) a for off-chip memory access.
    [Show full text]
  • An Analysis of Data Corruption in the Storage Stack
    An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram∗, Garth R. Goodson†, Bianca Schroeder‡ Andrea C. Arpaci-Dusseau∗, Remzi H. Arpaci-Dusseau∗ ∗University of Wisconsin-Madison †Network Appliance, Inc. ‡University of Toronto {laksh, dusseau, remzi}@cs.wisc.edu, [email protected], [email protected] Abstract latent sector errors, within disk drives [18]. Latent sector errors are detected by a drive’s internal error-correcting An important threat to reliable storage of data is silent codes (ECC) and are reported to the storage system. data corruption. In order to develop suitable protection Less well-known, however, is that current hard drives mechanisms against data corruption, it is essential to un- and controllers consist of hundreds-of-thousandsof lines derstand its characteristics. In this paper, we present the of low-level firmware code. This firmware code, along first large-scale study of data corruption. We analyze cor- with higher-level system software, has the potential for ruption instances recorded in production storage systems harboring bugs that can cause a more insidious type of containing a total of 1.53 million disk drives, over a pe- disk error – silent data corruption, where the data is riod of 41 months. We study three classes of corruption: silently corrupted with no indication from the drive that checksum mismatches, identity discrepancies, and par- an error has occurred. ity inconsistencies. We focus on checksum mismatches since they occur the most. Silent data corruptionscould lead to data loss more of- We find more than 400,000 instances of checksum ten than latent sector errors, since, unlike latent sector er- mismatches over the 41-month period.
    [Show full text]
  • Understanding Real World Data Corruptions in Cloud Systems
    Understanding Real World Data Corruptions in Cloud Systems Peipei Wang, Daniel J. Dean, Xiaohui Gu Department of Computer Science North Carolina State University Raleigh, North Carolina {pwang7,djdean2}@ncsu.edu, [email protected] Abstract—Big data processing is one of the killer applications enough to require attention, little research has been done to for cloud systems. MapReduce systems such as Hadoop are the understand software-induced data corruption problems. most popular big data processing platforms used in the cloud system. Data corruption is one of the most critical problems in In this paper, we present a comprehensive study on the cloud data processing, which not only has serious impact on characteristics of the real world data corruption problems the integrity of individual application results but also affects the caused by software bugs in cloud systems. We examined 138 performance and availability of the whole data processing system. data corruption incidents reported in the bug repositories of In this paper, we present a comprehensive study on 138 real world four Hadoop projects (i.e., Hadoop-common, HDFS, MapRe- data corruption incidents reported in Hadoop bug repositories. duce, YARN [17]). Although Hadoop provides fault tolerance, We characterize those data corruption problems in four aspects: our study has shown that data corruptions still seriously affect 1) what impact can data corruption have on the application and system? 2) how is data corruption detected? 3) what are the the integrity, performance, and availability
    [Show full text]
  • MRAM Technology Status
    National Aeronautics and Space Administration MRAM Technology Status Jason Heidecker Jet Propulsion Laboratory Pasadena, California Jet Propulsion Laboratory California Institute of Technology Pasadena, California JPL Publication 13-3 2/13 National Aeronautics and Space Administration MRAM Technology Status NASA Electronic Parts and Packaging (NEPP) Program Office of Safety and Mission Assurance Jason Heidecker Jet Propulsion Laboratory Pasadena, California NASA WBS: 104593 JPL Project Number: 104593 Task Number: 40.49.01.09 Jet Propulsion Laboratory 4800 Oak Grove Drive Pasadena, CA 91109 http://nepp.nasa.gov i This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, and was sponsored by the National Aeronautics and Space Administration Electronic Parts and Packaging (NEPP) Program. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology. ©2013. California Institute of Technology. Government sponsorship acknowledged. ii TABLE OF CONTENTS 1.0 Introduction ............................................................................................................................................................ 1 2.0 MRAM Technology ................................................................................................................................................ 2 2.1
    [Show full text]
  • The Effects of Repeated Refresh Cycles on the Oxide Integrity of EEPROM Memories at High Temperature by Lynn Reed and Vema Reddy Tekmos, Inc
    The Effects of Repeated Refresh Cycles on the Oxide Integrity of EEPROM Memories at High Temperature By Lynn Reed and Vema Reddy Tekmos, Inc. 4120 Commercial Center Drive, #400, Austin, TX 78744 [email protected], [email protected] Abstract Data retention in stored-charge based memories, such as Flash and EEPROMs, decreases with increasing temperature. Compensation for this shortening of retention time can be accomplished by refreshing the data using periodic erase-write refresh cycles, although the number of these cycles is limited by oxide integrity. An alternate approach is to use refresh cycles consisting of a rewrite only cycles, without the prior erase cycle. The viability of this approach requires that this refresh cycle induces less damage than an erase-write cycle. This paper studies the effects of repeated refresh cycles on oxide integrity in a high temperature environment and makes comparisons to the damage caused by erase-write cycles. The experiment consisted of running a large number of refresh cycles on a selected byte. The control group was other bytes which were not subjected to refresh only cycles. The oxide integrity was checked by performing repeated erase-write cycles on each of the two groups to determine if the refresh cycles decreased the number of erase-write cycles before failure. Data was collected from multiple parts, with different numbers of refresh cycles, and at temperatures ranging from 25C to 190C. The experiment was conducted on microcontrollers containing embedded EEPROM memories. The microcontrollers were programmed to test and measure their own memories, and to report the results to an external controller.
    [Show full text]
  • On the Effects of Data Corruption in Files
    Just One Bit in a Million: On the Effects of Data Corruption in Files Volker Heydegger Universität zu Köln, Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI), Albertus-Magnus-Platz, 50968 Köln, Germany [email protected] Abstract. So far little attention has been paid to file format robustness, i.e., a file formats capability for keeping its information as safe as possible in spite of data corruption. The paper on hand reports on the first comprehensive research on this topic. The research work is based on a study on the status quo of file format robustness for various file formats from the image domain. A controlled test corpus was built which comprises files with different format characteristics. The files are the basis for data corruption experiments which are reported on and discussed. Keywords: digital preservation, file format, file format robustness, data integ- rity, data corruption, bit error, error resilience. 1 Introduction Long-term preservation of digital information is by now and will be even more so in the future one of the most important challenges for digital libraries. It is a task for which a multitude of factors have to be taken into account. These factors are hardly predictable, simply due to the fact that digital preservation is something targeting at the unknown future: Are we still able to maintain the preservation infrastructure in the future? Is there still enough money to do so? Is there still enough proper skilled manpower available? Can we rely on the current legal foundation in the future as well? How long is the tech- nology we use at the moment sufficient for the preservation task? Are there major changes in technologies which affect the access to our digital assets? If so, do we have adequate strategies and means to cope with possible changes? and so on.
    [Show full text]
  • Exploiting Asymmetry in Edram Errors for Redundancy-Free Error
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2019.2960491, IEEE Transactions on Emerging Topics in Computing IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING 1 Exploiting Asymmetry in eDRAM Errors for Redundancy-Free Error-Tolerant Design Shanshan Liu (Member, IEEE), Pedro Reviriego (Senior Member, IEEE), Jing Guo, Jie Han (Senior Member, IEEE) and Fabrizio Lombardi (Fellow, IEEE) Abstract—For some applications, errors have a different impact on data and memory systems depending on whether they change a zero to a one or the other way around; for an unsigned integer, a one to zero (or zero to one) error reduces (or increases) the value. For some memories, errors are also asymmetric; for example, in a DRAM, retention failures discharge the storage cell. The tolerance of such asymmetric errors would result in a robust and efficient system design. Error Control Codes (ECCs) are one common technique for memory protection against these errors by introducing some redundancy in memory cells. In this paper, the asymmetry in the errors in Embedded DRAMs (eDRAMs) is exploited for error-tolerant designs without using any ECC or parity, which are redundancy-free in terms of memory cells. A model for the impact of retention errors and refresh time of eDRAMs on the False Positive rate or False Negative rate of some eDRAM applications is proposed and analyzed. Bloom Filters (BFs) and read-only or write-through caches implemented in eDRAMs are considered as the first case studies for this model.
    [Show full text]
  • Active Storage Activeraid ES Data Sheet
    RAID Storage Evolved Welcome to RAID Storage Evolved Key Features Introducing the ActiveRAID ES — the affordable high performance RAID storage solution Optimized RAID performance in Apple Mac for Apple® users. The ActiveRAID ES was designed using the same principles of the OS X® Applications original ActiveRAID, but represents a previously unobtainable level of value by providing the capabilities of the ActiveRAID in a lower cost configuration that does not sacrifice Modern Linux RAID kernel and RAID the performance, reliability or ease of management of the original ActiveRAID. The controller design ActiveRAID ES was carefully designed for use in business critical applications where Proprietary self-optimizing architecture organizations require a high level of data integrity with ease of installation and management, but not the 24/7 365 day duty cycles of a large enterprise or Post-Production or Broadcast facility. Native Mac OS X Storage management suite The ActiveRAID ES provides redundant power, redundant cooling and redundant Fibre Dashboard Widget monitoring Channel connections, and the same powerful, yet easy to use management suite of the original ActiveRAID. The ES differs from the original ActiveRAID in that it has a single high Enterprise performance and reliability performance RAID controller and uses business class hard drives with a smaller cache size. without Enterprise complexity This results in a very robust RAID solution ideal for business class applications. Performance No long Quick Start guides to memorize of the ActiveRAID ES is impressive at up to 774 MB/s*† throughput and will easily handle Performance profiling and statistical most Mac OS X Server applications. The ActiveRAID ES is the ideal storage for business planning tools class applications such as File serving, database services, web hosting, calendar, Wiki, Mail, iChat and much more.
    [Show full text]
  • SŁOWNIK POLSKO-ANGIELSKI ELEKTRONIKI I INFORMATYKI V.03.2010 (C) 2010 Jerzy Kazojć - Wszelkie Prawa Zastrzeżone Słownik Zawiera 18351 Słówek
    OTWARTY SŁOWNIK POLSKO-ANGIELSKI ELEKTRONIKI I INFORMATYKI V.03.2010 (c) 2010 Jerzy Kazojć - wszelkie prawa zastrzeżone Słownik zawiera 18351 słówek. Niniejszy słownik objęty jest licencją Creative Commons Uznanie autorstwa - na tych samych warunkach 3.0 Polska. Aby zobaczyć kopię niniejszej licencji przejdź na stronę http://creativecommons.org/licenses/by-sa/3.0/pl/ lub napisz do Creative Commons, 171 Second Street, Suite 300, San Francisco, California 94105, USA. Licencja UTWÓR (ZDEFINIOWANY PONIŻEJ) PODLEGA NINIEJSZEJ LICENCJI PUBLICZNEJ CREATIVE COMMONS ("CCPL" LUB "LICENCJA"). UTWÓR PODLEGA OCHRONIE PRAWA AUTORSKIEGO LUB INNYCH STOSOWNYCH PRZEPISÓW PRAWA. KORZYSTANIE Z UTWORU W SPOSÓB INNY NIŻ DOZWOLONY NA PODSTAWIE NINIEJSZEJ LICENCJI LUB PRZEPISÓW PRAWA JEST ZABRONIONE. WYKONANIE JAKIEGOKOLWIEK UPRAWNIENIA DO UTWORU OKREŚLONEGO W NINIEJSZEJ LICENCJI OZNACZA PRZYJĘCIE I ZGODĘ NA ZWIĄZANIE POSTANOWIENIAMI NINIEJSZEJ LICENCJI. 1. Definicje a."Utwór zależny" oznacza opracowanie Utworu lub Utworu i innych istniejących wcześniej utworów lub przedmiotów praw pokrewnych, z wyłączeniem materiałów stanowiących Zbiór. Dla uniknięcia wątpliwości, jeżeli Utwór jest utworem muzycznym, artystycznym wykonaniem lub fonogramem, synchronizacja Utworu w czasie z obrazem ruchomym ("synchronizacja") stanowi Utwór Zależny w rozumieniu niniejszej Licencji. b."Zbiór" oznacza zbiór, antologię, wybór lub bazę danych spełniającą cechy utworu, nawet jeżeli zawierają nie chronione materiały, o ile przyjęty w nich dobór, układ lub zestawienie ma twórczy charakter.
    [Show full text]
  • Algorithms for Data Cleaning in Knowledge Bases
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by VFAST - Virtual Foundation for Advancement of Science and Technology (Pakistan) VFAST Transactions on Software Engineering http://vfast.org/journals/index.php/VTSE@ 2018, ISSN(e): 2309-6519; ISSN(p): 2411-6327 Volume 15, Number 2, May-August, 2018 pp. 78-83 ALGORITHMS FOR DATA CLEANING IN KNOWLEDGE BASES 1 1 1 ADEEL ASHRAF , SARAH ILYAS , KHAWAJA UBAID UR REHMAN AND SHAKEEL AHMAD2 ` 1Department of Computer Science, University of Management and Technology, Punjab, Pakistan 2Department of Computer Sciences, Faculty of Computing and Information Technology in rabigh, King Abdulaziz University Jeddah Email: [email protected] ABSTRACT. Data cleaning is an action which includes a process of correcting and identifying the inconsistencies and errors in data warehouse. Different terms are uses in these papers like data cleaning also called data scrubbing. Using data scrubbing to get high quality data and this is one the data ETL (extraction transformation and loading tools). Now a day there is a need of authentic information for better decision-making. So we conduct a review paper in which six papers are reviewed related to data cleaning. Relating papers discussed different algorithms, methods, problems, their solutions and approaches etc. Each paper has their own methods to solve a problem in efficient way, but all the paper have common problem of data cleaning and inconsistencies. In these papers data inconsistencies, identification of the errors, conflicting, duplicates records etc problems are discussed in detail and also provided the solutions. These algorithms increase the quality of data.
    [Show full text]
  • ZFS: Love Your Data
    ZFS: Love Your Data Neal H. Waleld LinuxCon Europe, 14 October 2014 ZFS Features I Security I End-to-End consistency via checksums I Self Healing I Copy on Write Transactions I Additional copies of important data I Snapshots and Clones I Simple, Incremental Remote Replication I Easier Administration I One shared pool rather than many statically-sized volumes I Performance Improvements I Hierarchical Storage Management (HSM) I Pooled Architecture =) shared IOPs I Developed for many-core systems I Scalable 128 I Pool Address Space: 2 bytes I O(1) operations I Fine-grained locks I On-disk data is protected by ECC I But, doesn't correct / catch all errors Silent Data Corruption I Data errors that are not caught by hard drive I = Read returns dierent data from what was written Silent Data Corruption I Data errors that are not caught by hard drive I = Read returns dierent data from what was written I On-disk data is protected by ECC I But, doesn't correct / catch all errors Uncorrectable Errors By Cory Doctorow, CC BY-SA 2.0 I Reported as BER (Bit Error Rate) I According to Data Sheets: 14 I Desktop: 1 corrupted bit per 10 (12 TB) 15 I Enterprise: 1 corrupted bit per 10 (120 TB) ∗ I Practice: 1 corrupted sector per 8 to 20 TB ∗Je Bonwick and Bill Moore, ZFS: The Last Word in File Systems, 2008 Types of Errors I Bit Rot I Phantom writes I Misdirected read / write 8 9 y I 1 per 10 to 10 IOs I = 1 error per 50 to 500 GB (assuming 512 byte IOs) I DMA Parity Errors By abdallahh, CC BY-SA 2.0 I Software / Firmware Bugs I Administration Errors yUlrich
    [Show full text]
  • A Study Over Importance of Data Cleansing in Data Warehouse Shivangi Rana Er
    Volume 6, Issue 4, April 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Study over Importance of Data Cleansing in Data Warehouse Shivangi Rana Er. Gagan Prakesh Negi Kapil Kapoor Research Scholar Assistant Professor Associate Professor CSE Department CSE Department ECE Department Abhilashi Group of Institutions Abhilashi Group of Institutions Abhilashi Group of Institutions (School of Engg.) (School of Engg.) (School of Engg.) Chail Chowk, Mandi, India Chail Chowk, Mandi, India Chail Chowk Mandi, India Abstract: Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. In these days, many organizations tend to use a Data Warehouse to meet the requirements to develop decision-making processes and achieve their goals better and satisfy their customers. It enables Executives to access the information they need in a timely manner for making the right decision for any work. Decision Support System (DSS) is one of the means that applied in data mining. Its robust and better decision depends on an important and conclusive factor called Data Quality (DQ), to obtain a high data quality using Data Scrubbing (DS) which is one of data Extraction Transformation and Loading (ETL) tools. Data Scrubbing is very important and necessary in the Data Warehouse (DW). This paper presents a survey of sources of error in data, data quality challenges and approaches, data cleaning types and techniques and an overview of ETL Process.
    [Show full text]