Oracle ZFS Storage--Data Integrity White Paper

Total Page:16

File Type:pdf, Size:1020Kb

Oracle ZFS Storage--Data Integrity White Paper Oracle ZFS Storage—Data Integrity ORACLE WHITE P A P E R | M A Y 2 0 1 7 Table of Contents Introduction 1 Overview 2 Primary Design Goals 2 Shortcomings of Traditional Data Storage 2 RAID Design Problems 2 Traditional RAID Data Integrity Study 3 ZFS Data Integrity 4 ZFS Transactional Processing 4 ZFS Ditto Blocks 5 ZFS Self-Healing 5 University Research Testing and Validation of ZFS Data Integrity 5 Defining Corruption and How Often Corruption Happens 5 Problems Identified with Other File Systems and RAID 6 Research Testing Methodology 6 Testing Configuration and Data Layout 6 Oracle ZFS Storage Appliance Robustness 7 Reducing Risk Through Advanced Data Integrity 7 Robust Protection from Hardware Failures 7 ZFS Data Encryption 8 Conclusion 8 Related Links 9 ORACLE ZFS STORAGE—DATA INTEGRITY Introduction Studies have shown that traditional RAID technology, while effective for disk space savings, cannot provide sufficient end-to-end data integrity to ensure against data loss or corruption in contemporary cloud-scale data storage environments. For example, traditional RAID technology is unable to isolate and protect against firmware or driver bugs that can have a substantial impact on a storage system’s ability to protect against data corruption or loss. Modern storage architectures, like those incorporated into Oracle ZFS Storage Appliance, protect against these failure modes by providing advanced data integrity technology such as hierarchical checksums, redundant metadata, transactional processing, and integrated redundancy. These features provide a more comprehensive and reliable protection against data corruption or loss. This paper explores inherent design flaws in traditional RAID projects that cause data loss, provides case study evidence of these deficiencies, and demonstrates how modern technology like Oracle ZFS Storage is better suited to today’s contemporary cloud-scale data storage environments.1 1 Throughout this paper, Oracle ZFS Storage is used as an umbrella term that covers numerous products, such as the Oracle ZFS Storage Appliance systems, which use the Oracle Solaris ZFS file system. 1 | ORACLE ZFS STORAGE—DATA INTEGRITY Overview Today's vast repositories of data represent valuable intellectual property that must be protected to ensure the livelihood of the enterprise. While backup and archival capabilities are designed to guard against data loss, they do not necessarily protect against silent data corruption. Furthermore, in the event of a problem, the process of restoring archived data generally entails downtime and thus lost productivity. Even worse, not all restore operations are successful, and this can result in permanent data loss. Primary Design Goals A primary design goal and foundation of Oracle ZFS Storage systems is data integrity. The modern end-to-end data integrity architecture of the ZFS file system is designed to overcome the deficiencies of traditional storage products. Oracle ZFS Storage Appliance architecture is able to deliver industry-leading performance while providing contemporary end-to-end data protection and data integrity capability that ensure against data loss better than traditional storage systems. » Redirect-on-write architecture: Data is never overwritten in place. » Snapshots provide continuous file system replication. » Checksum protection occurs throughout the data path. » “Self-healing” ability replaces damaged data from redundant configurations. » Multiple levels of RAID protection meet modern capacities. » ZFS data path reliability is available throughout the OS, application, and hardware stack. Shortcomings of Traditional Data Storage The problem with traditional data storage products is that there is no defense against silent data corruption. Any defect in disk, controller, cable, driver, laser, or firmware can corrupt data silently. » File systems rely on underlying hardware to detect and report errors. » Hardware doesn’t know that a firmware bug occurred. » If disk returns bad data, traditional file system won’t detect it. Even without hardware problems, data is vulnerable to in-transit damage such as controller bugs, DMA parity errors, and SAN network bugs. Block-level checksums only prove that a block is self-consistent. They do not ensure that it's the right block. There is no fault isolation between the data and the checksum that is supposed to protect it, as illustrated in Figure 1 below. RAID Design Problems RAID 5 and other RAID parity schemes include a fatal flaw known as the RAID 5 write hole. When data in a RAID stripe is updated, the parity also must be updated as part of the reconstruction process when a disk fails. Because no way exists to update two or more disks atomically, RAID stripes can become damaged during a reconstruction process if the system crashes or a power outage occurs. This problem means that now the data and the parity are inconsistent and remain inconsistent. This is a silent failure that corrupts data. RAID systems can provide protection with NVRAM that survives power loss but this solution is expensive, and if the battery-backed NVRAM cache fails, a risk of data inconsistency still exists. A well-known performance problem with RAID systems is that if they do partial-stripe writes, where the data updated is less than a single RAID stripe, the RAID system must read the old data and parity to calculate the new parity. This recalculation slows performance. A possible solution is that the RAID system buffers the partial-stripe writes in 2 | ORACLE ZFS STORAGE—DATA INTEGRITY NVRAM, which hides the latency from the user, but the NVRAM cache can fill up. One RAID vendor’s solution is to sell more expensive cache. Today’s savvy enterprise customers are looking for a storage solution without the costs associated with silent data corruption or slow performance. Figure 1. Traditional RAID approach leaves I/O path vulnerable. Traditional RAID Data Integrity Study A CERN research study on data integrity with traditional RAID solutions found that data corruption occurs at lower levels and that probes and monitors must be implemented and deployed, leading to increased OpEx and CapEx costs. 2 » Testing involved writing 2 GB files with special bit patterns and afterwards, reading the files back to compare the patterns. The testing was deployed on more than 3,000 nodes (disk server, CPU server, database server, etc.) and run every two hours. After about five weeks of testing that was run on 3,000 nodes, the results revealed 500 errors on 100 nodes. » The study’s findings uncovered the following RAID problems: » RAID controllers don’t always check parity when reading data from RAID 5 file systems. » RAID controllers don’t always report problems at the disk level to the upper-level OS. » Running a verify operation for the RAID controller reads all data from disk and recalculates the RAID 5 checksum, but it doesn’t have a notion of what “correct” data is from a user point of view. » Running the verify operation on 492 systems over four weeks resulted in the fix of ~300 block problems. » The discovery of a disk firmware problem required a manual update of the firmware on 3,000 disks. » CERN concluded that they must implement and deploy constant RAID monitoring and intensive probes that would double their original disk I/O performance requirements, and they must also increase CPU capacity on storage servers by 50 percent to accommodate the RAID monitoring. 2 Bernd Panzer-Steindel, CERN/IT, Data Integrity, April 2007 3 | ORACLE ZFS STORAGE—DATA INTEGRITY ZFS Data Integrity ZFS is a modern file system that provides the following data integrity components to eliminate the problems with previous storage products: » End-to-end data integrity requires each data block to be verified against an independent checksum, after the data has arrived in the host's memory. » The ZFS design goal provides fault isolation between data and checksum by storing the checksum of each block in its parent block pointer—not in the block itself. » ZFS is a self-validating Merkle tree of blocks, proven to provide cryptographically strong authentication for any component of the tree, and for the tree as a whole. » Each ZFS block contains checksums for all its children, meaning the entire pool is self-validating. » ZFS knows to trust the checksum because it is part of some other block at a level higher in the tree, and that block has already been validated. » ZFS uses this end-to-end checksum hierarchy to detect and correct silent data corruption: » If a disk returns bad data transiently, ZFS detect its and retries the read. » If disk is part of mirror or RAIDZ group, ZFS both detects and corrects the error by using the checksum to determine the copy is correct, provides good data to the application, and repairs the damaged copy. Figure 2. The ZFS hierarchical checksums can detect more types of errors than traditional block-level checksums. ZFS Transactional Processing ZFS maintains data consistency in the event of system crashes by using a redirect-on-write transactional update model that is similar to a copy-on-write (COW) model, only it is more efficient because it generates one less copy. ZFS file system metadata and data are represented as objects and are grouped into a transactional group for modification. New copies are created for all the modified blocks (in a Merkle tree) when a transaction group is committed to disk. See Figure 1. The root of the tree structure (uberblock) is updated atomically, maintaining an 4 | ORACLE ZFS STORAGE—DATA INTEGRITY always-consistent disk state. The redirect-on-write transactional processing model, along with hierarchical checksums, means there is no need for a journal that is part of many traditional file systems. ZFS Ditto Blocks ZFS uses block pointers to point to data blocks on disk called disk virtual addresses (DVAs). ZFS block pointers identify not just one DVA, but up to three DVAs.
Recommended publications
  • Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon
    This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Hybrid Drowsy SRAM and STT-RAM Buffer Designs for Dark-Silicon-Aware NoC Jia Zhan, Student Member, IEEE, Jin Ouyang, Member, IEEE,FenGe,Member, IEEE, Jishen Zhao, Member, IEEE, and Yuan Xie, Fellow, IEEE Abstract— The breakdown of Dennard scaling prevents us MCMC MC from powering all transistors simultaneously, leaving a large fraction of dark silicon. This crisis has led to innovative work on Tile link power-efficient core and memory architecture designs. However, link link RouterRouter the research for addressing dark silicon challenges with network- To router on-chip (NoC), which is a major contributor to the total chip NI power consumption, is largely unexplored. In this paper, we Core link comprehensively examine the network power consumers and L1-I$ L2 L1-D$ the drawbacks of the conventional power-gating techniques. To overcome the dark silicon issue from the NoC’s perspective, we MCMC MC propose DimNoC, a dim silicon scheme, which leverages recent drowsy SRAM design and spin-transfer torque RAM (STT-RAM) Fig. 1. 4 × 4 NoC-based multicore architecture. In each node, the local technology to replace pure SRAM-based NoC buffers. processing elements (core, L1, L2 caches, and so on) are attached to a router In particular, we propose two novel hybrid buffer architectures: through an NI. Routers are interconnected through links to form an NoC. 1) a hierarchical buffer architecture, which divides the input Four memory controllers are attached at the four corners, which will be used buffers into a set of levels with different power states and 2) a for off-chip memory access.
    [Show full text]
  • Z/OS ICSF Overview How to Send Your Comments to IBM
    z/OS Version 2 Release 3 Cryptographic Services Integrated Cryptographic Service Facility Overview IBM SC14-7505-08 Note Before using this information and the product it supports, read the information in “Notices” on page 81. This edition applies to ICSF FMID HCR77D0 and Version 2 Release 3 of z/OS (5650-ZOS) and to all subsequent releases and modifications until otherwise indicated in new editions. Last updated: 2020-05-25 © Copyright International Business Machines Corporation 1996, 2020. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Figures................................................................................................................ vii Tables.................................................................................................................. ix About this information.......................................................................................... xi ICSF features...............................................................................................................................................xi Who should use this information................................................................................................................ xi How to use this information........................................................................................................................ xi Where to find more information.................................................................................................................xii
    [Show full text]
  • An Analysis of Data Corruption in the Storage Stack
    An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram∗, Garth R. Goodson†, Bianca Schroeder‡ Andrea C. Arpaci-Dusseau∗, Remzi H. Arpaci-Dusseau∗ ∗University of Wisconsin-Madison †Network Appliance, Inc. ‡University of Toronto {laksh, dusseau, remzi}@cs.wisc.edu, [email protected], [email protected] Abstract latent sector errors, within disk drives [18]. Latent sector errors are detected by a drive’s internal error-correcting An important threat to reliable storage of data is silent codes (ECC) and are reported to the storage system. data corruption. In order to develop suitable protection Less well-known, however, is that current hard drives mechanisms against data corruption, it is essential to un- and controllers consist of hundreds-of-thousandsof lines derstand its characteristics. In this paper, we present the of low-level firmware code. This firmware code, along first large-scale study of data corruption. We analyze cor- with higher-level system software, has the potential for ruption instances recorded in production storage systems harboring bugs that can cause a more insidious type of containing a total of 1.53 million disk drives, over a pe- disk error – silent data corruption, where the data is riod of 41 months. We study three classes of corruption: silently corrupted with no indication from the drive that checksum mismatches, identity discrepancies, and par- an error has occurred. ity inconsistencies. We focus on checksum mismatches since they occur the most. Silent data corruptionscould lead to data loss more of- We find more than 400,000 instances of checksum ten than latent sector errors, since, unlike latent sector er- mismatches over the 41-month period.
    [Show full text]
  • Detection Method of Data Integrity in Network Storage Based on Symmetrical Difference
    S S symmetry Article Detection Method of Data Integrity in Network Storage Based on Symmetrical Difference Xiaona Ding School of Electronics and Information Engineering, Sias University of Zhengzhou, Xinzheng 451150, China; [email protected] Received: 15 November 2019; Accepted: 26 December 2019; Published: 3 February 2020 Abstract: In order to enhance the recall and the precision performance of data integrity detection, a method to detect the network storage data integrity based on symmetric difference was proposed. Through the complete automatic image annotation system, the crawler technology was used to capture the image and related text information. According to the automatic word segmentation, pos tagging and Chinese word segmentation, the feature analysis of text data was achieved. Based on the symmetrical difference algorithm and the background subtraction, the feature extraction of image data was realized. On the basis of data collection and feature extraction, the sentry data segment was introduced, and then the sentry data segment was randomly selected to detect the data integrity. Combined with the accountability scheme of data security of the trusted third party, the trusted third party was taken as the core. The online state judgment was made for each user operation. Meanwhile, credentials that cannot be denied by both parties were generated, and thus to prevent the verifier from providing false validation results. Experimental results prove that the proposed method has high precision rate, high recall rate, and strong reliability. Keywords: symmetric difference; network; data integrity; detection 1. Introduction In recent years, the cloud computing becomes a new shared infrastructure based on the network. Based on Internet, virtualization, and other technologies, a large number of system pools and other resources are combined to provide users with a series of convenient services [1].
    [Show full text]
  • Linux Data Integrity Extensions
    Linux Data Integrity Extensions Martin K. Petersen Oracle [email protected] Abstract The software stack, however, is rapidly growing in com- plexity. This implies an increasing failure potential: Many databases and filesystems feature checksums on Harddrive firmware, RAID controller firmware, host their logical blocks, enabling detection of corrupted adapter firmware, operating system code, system li- data. The scenario most people are familiar with in- braries, and application errors. There are many things volves bad sectors which develop while data is stored that can go wrong from the time data is generated in on disk. However, many corruptions are actually a re- host memory until it is stored physically on disk. sult of errors that occurred when the data was originally written. While a database or filesystem can detect the Most storage devices feature extensive checking to pre- corruption when data is eventually read back, the good vent errors. However, these protective measures are al- data may have been lost forever. most exclusively being deployed internally to the de- vice in a proprietary fashion. So far, there have been A recent addition to SCSI allows extra protection infor- no means for collaboration between the layers in the I/O mation to be exchanged between controller and disk. We stack to ensure data integrity. have extended this capability up into Linux, allowing filesystems (and eventually applications) to be able to at- An extension to the SCSI family of protocols tries to tach integrity metadata to I/O requests. Controllers and remedy this by defining a way to check the integrity of disks can then verify the integrity of an I/O before com- an request as it traverses the I/O stack.
    [Show full text]
  • Understanding Real World Data Corruptions in Cloud Systems
    Understanding Real World Data Corruptions in Cloud Systems Peipei Wang, Daniel J. Dean, Xiaohui Gu Department of Computer Science North Carolina State University Raleigh, North Carolina {pwang7,djdean2}@ncsu.edu, [email protected] Abstract—Big data processing is one of the killer applications enough to require attention, little research has been done to for cloud systems. MapReduce systems such as Hadoop are the understand software-induced data corruption problems. most popular big data processing platforms used in the cloud system. Data corruption is one of the most critical problems in In this paper, we present a comprehensive study on the cloud data processing, which not only has serious impact on characteristics of the real world data corruption problems the integrity of individual application results but also affects the caused by software bugs in cloud systems. We examined 138 performance and availability of the whole data processing system. data corruption incidents reported in the bug repositories of In this paper, we present a comprehensive study on 138 real world four Hadoop projects (i.e., Hadoop-common, HDFS, MapRe- data corruption incidents reported in Hadoop bug repositories. duce, YARN [17]). Although Hadoop provides fault tolerance, We characterize those data corruption problems in four aspects: our study has shown that data corruptions still seriously affect 1) what impact can data corruption have on the application and system? 2) how is data corruption detected? 3) what are the the integrity, performance, and availability
    [Show full text]
  • Nasdeluxe Z-Series
    NASdeluxe Z-Series Benefit from scalable ZFS data storage By partnering with Starline and with Starline Computer’s NASdeluxe Open-E, you receive highly efficient Z-series and Open-E JovianDSS. This and reliable storage solutions that software-defined storage solution is offer: Enhanced Storage Performance well-suited for a wide range of applica- tions. It caters perfectly to the needs • Great adaptability Tiered RAM and SSD cache of enterprises that are looking to de- • Tiered and all-flash storage Data integrity check ploy a flexible storage configuration systems which can be expanded to a high avail- Data compression and in-line • High IOPS through RAM and SSD ability cluster. Starline and Open-E can data deduplication caching look back on a strategic partnership of Thin provisioning and unlimited • Superb expandability with more than 10 years. As the first part- number of snapshots and clones ner with a Gold partnership level, Star- Starline’s high-density JBODs – line has always been working hand in without downtime Simplified management hand with Open-E to develop and de- Flexible scalability liver innovative data storage solutions. Starline’s NASdeluxe Z-Series offers In fact, Starline supports worldwide not only great features, but also great Hardware independence enterprises in managing and pro- flexibility – thanks to its modular archi- tecting their storage, with over 2,800 tecture. Open-E installations to date. www.starline.de Z-Series But even with a standard configuration with nearline HDDs IOPS and SSDs for caching, you will be able to achieve high IOPS 250 000 at a reasonable cost.
    [Show full text]
  • MRAM Technology Status
    National Aeronautics and Space Administration MRAM Technology Status Jason Heidecker Jet Propulsion Laboratory Pasadena, California Jet Propulsion Laboratory California Institute of Technology Pasadena, California JPL Publication 13-3 2/13 National Aeronautics and Space Administration MRAM Technology Status NASA Electronic Parts and Packaging (NEPP) Program Office of Safety and Mission Assurance Jason Heidecker Jet Propulsion Laboratory Pasadena, California NASA WBS: 104593 JPL Project Number: 104593 Task Number: 40.49.01.09 Jet Propulsion Laboratory 4800 Oak Grove Drive Pasadena, CA 91109 http://nepp.nasa.gov i This research was carried out at the Jet Propulsion Laboratory, California Institute of Technology, and was sponsored by the National Aeronautics and Space Administration Electronic Parts and Packaging (NEPP) Program. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology. ©2013. California Institute of Technology. Government sponsorship acknowledged. ii TABLE OF CONTENTS 1.0 Introduction ............................................................................................................................................................ 1 2.0 MRAM Technology ................................................................................................................................................ 2 2.1
    [Show full text]
  • The Effects of Repeated Refresh Cycles on the Oxide Integrity of EEPROM Memories at High Temperature by Lynn Reed and Vema Reddy Tekmos, Inc
    The Effects of Repeated Refresh Cycles on the Oxide Integrity of EEPROM Memories at High Temperature By Lynn Reed and Vema Reddy Tekmos, Inc. 4120 Commercial Center Drive, #400, Austin, TX 78744 [email protected], [email protected] Abstract Data retention in stored-charge based memories, such as Flash and EEPROMs, decreases with increasing temperature. Compensation for this shortening of retention time can be accomplished by refreshing the data using periodic erase-write refresh cycles, although the number of these cycles is limited by oxide integrity. An alternate approach is to use refresh cycles consisting of a rewrite only cycles, without the prior erase cycle. The viability of this approach requires that this refresh cycle induces less damage than an erase-write cycle. This paper studies the effects of repeated refresh cycles on oxide integrity in a high temperature environment and makes comparisons to the damage caused by erase-write cycles. The experiment consisted of running a large number of refresh cycles on a selected byte. The control group was other bytes which were not subjected to refresh only cycles. The oxide integrity was checked by performing repeated erase-write cycles on each of the two groups to determine if the refresh cycles decreased the number of erase-write cycles before failure. Data was collected from multiple parts, with different numbers of refresh cycles, and at temperatures ranging from 25C to 190C. The experiment was conducted on microcontrollers containing embedded EEPROM memories. The microcontrollers were programmed to test and measure their own memories, and to report the results to an external controller.
    [Show full text]
  • On the Effects of Data Corruption in Files
    Just One Bit in a Million: On the Effects of Data Corruption in Files Volker Heydegger Universität zu Köln, Historisch-Kulturwissenschaftliche Informationsverarbeitung (HKI), Albertus-Magnus-Platz, 50968 Köln, Germany [email protected] Abstract. So far little attention has been paid to file format robustness, i.e., a file formats capability for keeping its information as safe as possible in spite of data corruption. The paper on hand reports on the first comprehensive research on this topic. The research work is based on a study on the status quo of file format robustness for various file formats from the image domain. A controlled test corpus was built which comprises files with different format characteristics. The files are the basis for data corruption experiments which are reported on and discussed. Keywords: digital preservation, file format, file format robustness, data integ- rity, data corruption, bit error, error resilience. 1 Introduction Long-term preservation of digital information is by now and will be even more so in the future one of the most important challenges for digital libraries. It is a task for which a multitude of factors have to be taken into account. These factors are hardly predictable, simply due to the fact that digital preservation is something targeting at the unknown future: Are we still able to maintain the preservation infrastructure in the future? Is there still enough money to do so? Is there still enough proper skilled manpower available? Can we rely on the current legal foundation in the future as well? How long is the tech- nology we use at the moment sufficient for the preservation task? Are there major changes in technologies which affect the access to our digital assets? If so, do we have adequate strategies and means to cope with possible changes? and so on.
    [Show full text]
  • Practical Risk-Based Guide for Managing Data Integrity
    1 ACTIVE PHARMACEUTICAL INGREDIENTS COMMITTEE Practical risk-based guide for managing data integrity Version 1, March 2019 2 PREAMBLE This original version of this guidance document has been compiled by a subdivision of the APIC Data Integrity Task Force on behalf of the Active Pharmaceutical Ingredient Committee (APIC) of CEFIC. The Task Force members are: Charles Gibbons, AbbVie, Ireland Danny De Scheemaecker, Janssen Pharmaceutica NV Rob De Proost, Janssen Pharmaceutica NV Dieter Vanderlinden, S.A. Ajinomoto Omnichem N.V. André van der Biezen, Aspen Oss B.V. Sebastian Fuchs, Tereos Daniel Davies, Lonza AG Fraser Strachan, DSM Bjorn Van Krevelen, Janssen Pharmaceutica NV Alessandro Fava, F.I.S. (Fabbrica Italiana Sintetici) SpA Alexandra Silva, Hovione FarmaCiencia SA Nicola Martone, DSM Sinochem Pharmaceuticals Ulrich-Andreas Opitz, Merck KGaA Dominique Rasewsky, Merck KGaA With support and review from: Pieter van der Hoeven, APIC, Belgium Francois Vandeweyer, Janssen Pharmaceutica NV Annick Bonneure, APIC, Belgium The APIC Quality Working Group 3 1 Contents 1. General Section .............................................................................................................................. 4 1.1 Introduction ............................................................................................................................ 4 1.2 Objectives and Scope .............................................................................................................. 5 1.3 Definitions and abbreviations ................................................................................................
    [Show full text]
  • Exploiting Asymmetry in Edram Errors for Redundancy-Free Error
    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TETC.2019.2960491, IEEE Transactions on Emerging Topics in Computing IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING 1 Exploiting Asymmetry in eDRAM Errors for Redundancy-Free Error-Tolerant Design Shanshan Liu (Member, IEEE), Pedro Reviriego (Senior Member, IEEE), Jing Guo, Jie Han (Senior Member, IEEE) and Fabrizio Lombardi (Fellow, IEEE) Abstract—For some applications, errors have a different impact on data and memory systems depending on whether they change a zero to a one or the other way around; for an unsigned integer, a one to zero (or zero to one) error reduces (or increases) the value. For some memories, errors are also asymmetric; for example, in a DRAM, retention failures discharge the storage cell. The tolerance of such asymmetric errors would result in a robust and efficient system design. Error Control Codes (ECCs) are one common technique for memory protection against these errors by introducing some redundancy in memory cells. In this paper, the asymmetry in the errors in Embedded DRAMs (eDRAMs) is exploited for error-tolerant designs without using any ECC or parity, which are redundancy-free in terms of memory cells. A model for the impact of retention errors and refresh time of eDRAMs on the False Positive rate or False Negative rate of some eDRAM applications is proposed and analyzed. Bloom Filters (BFs) and read-only or write-through caches implemented in eDRAMs are considered as the first case studies for this model.
    [Show full text]