Disk Scrubbing Large Archival Storage Systems

Total Page:16

File Type:pdf, Size:1020Kb

Disk Scrubbing Large Archival Storage Systems Disk Scrubbing in Large Archival Storage Systems Thomas J. E. Schwarz, S. J.12† Qin Xin2 3‡ Ethan L. Miller2‡ Darrell D. E. Long2‡ Andy Hospodor1 Spencer Ng3 1Computer Engineering Department, Santa Clara University, Santa Clara, CA 95053 2Storage Systems Research Center, University of California, Santa Cruz, CA 95064 3Hitachi Global Storage Technologies, San Jose Research Center, San Jose, CA 95120 Abstract quent, necessitating redundant data storage. For instance, we can mirror disks or mirror blocks of data, or even use Large archival storage systems experience long periods higher degrees of replication. For better storage efficiency, of idleness broken up by rare data accesses. In such sys- we collect data into large reliability blocks, and group m tems, disks may remain powered off for long periods of of these large reliability blocks together in a redundancy time. These systems can lose data for a variety of reasons, group to which we add k parity blocks. We store all relia- including failures at both the device level and the block bility blocks in a redundancy group on different disks. We level. To deal with these failures, we must detect them early calculate the parity blocks with an erasure correcting code. enough to be able to use the redundancy built into the stor- The data blocks in a redundancy group can be recovered if · age system. We propose a process called “disk scrubbing” we can access m out of the n m k blocks making up the in a system in which drives are periodically accessed to de- redundancy group, making our system k-available. tect drive failure. By scrubbing all of the data stored on all of the disks, we can detect block failures and compensate for them by rebuilding the affected blocks. Our research Our redundancy scheme implies that the availability of shows how the scheduling of disk scrubbing affects over- data depends on the availability of data in another block. all system reliability, and that “opportunistic” scrubbing, The extreme scenario is one where we try to access data, in which the system scrubs disks only when they are pow- discover that the disk containing the block has failed or that ered on for other reasons, performs very well without the a sector in the block can no longer be read, access the other need to power on disks solely to check them. blocks in the redundancy group, and then find that enough of them have failed so that we are no longer able to recon- struct the data, which is now lost. In this scenario, the re- 1. Introduction construction unmasks failures elsewhere in the system. As disks outpace tapes in capacity growth, large scale To maintain high reliability, we must detect these storage systems based on disks are becoming increasingly masked failures early. Not only do we have to access attractive for archival storage systems. Such systems will disks periodically to test whether the device has failed, encompass a very large number of disks and store petabytes but we also have to periodically verify all of the data on of data (1015 bytes). Because the disks store archival data, it [6, 7], an operation called “scrubbing.” By merely read- they will remain powered off between accesses, conserv- ing the data, we verify that we can access them. Since we ing power and extending disk lifetime. Colarelli and Grun- are already reading the data, we should also verify its ac- wald [2] called such a system a Massive Array of mainly curacy using signatures—small bit strings calculated from Idle Disks (MAID). a block of data. We can use signatures to verify the co- In a large system encompassing thousands or even tens herence between mirrored copies of blocks or between of thousands of active disks, disk failure will become fre- client data and generalized parity data blocks. Sig- † Supported in part by an SCU-internal IBM research grant. natures can also detect corrupted data blocks with a ‡ Supported in part by Lawrence Livermore National Laboratory, Los tiny probability of error. Our work investigates the im- Alamos National Laboratory, and Sandia National Laboratory under pact of disk scrubbing through analysis and simula- contract B520714. tion. Proceedings of the 12th IEEE/ACM International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2004), Volendam, Netherlands, October 2004. 2. Disk Failures unread block. Therefore, scrubbing and subsequent recon- struction of failed blocks is necessary to ensure long-term Storage systems must deal with defects and failures in data viability. rotating magnetic media. We can distinguish three principal failure modes that affect disks: block failures, also known 2.2. Disk Failure Rates as media failures; device failures; and data corruption or unnoticed read error. Some block failures are correlated. For example, if a On a modern disk drive, each block consisting of one or particle left over from the manufacturing process remains more 512 B sectors contains error correcting code (ECC) within the disk drive, then this particle will be periodically information that is used to correct errors during a read. If caught between a head and a surface, scratching the sur- there are multiple errors within a block, the ECC can ei- face and damaging the head [3]. The user would observe a ther successfully correct them, flag the read as unsuccess- burst of hard read failures, followed eventually by the fail- ful, or in the worst case, mis-correct them. The latter is an ure of the drive. However, most block failures are not re- instance of data corruption, which fortunately is extremely lated, so we can model block failures with a constant fail- rare. If the ECC is unable to correct the error, a modern disk ure rate λbf . retries the read several times. If a retry is successful, the er- Device failure rates are specified by disk drive manu- ror is considered a soft error; otherwise, the error is hard. facturers as MTBF values. The actual observed values de- Failures to read data may reflect physical damage to the pend in practice heavily on operating conditions that are disk or result from faulty disk operations. For example if a frequently worse than the manufacturers’ implicit assump- disk suffers a significant physical shock during a write op- tions. Block failure rates are published as failures per num- eration, the head might swing off the track and thus destroy ber of bits read. A typical value of 1 in 10 14 bits for a com- data in a number of sectors adjacent to the one being writ- modity ATA drive means that if we access data under nor- ten. mal circumstances at a rate of 10 TB per year, we should expect one uncorrectable error per year. 2.1. Block Failures Block errors may occur even if we do not access the disk. Lack of measurements, lack of openness, and most Many block defects are the result of imperfections in the importantly, lack of a tested failure model make conjec- machining of the disk substrate, non-uniformity in the mag- tures on how to translate the standard errors per bits rate netic coating and contaminants within the head disk assem- into a failure rate per year hazardous. Based on interviews bly. These manufacturing defects cause repeatable hard er- with data storage professionals, we propose the following rors detected during read operations. Disk drive manufac- model: turers attempt to detect and compensate for these defects We assume that for server drives, about 1/3 of all field during a manufacturing process called self-scan. Worst- returns are due to hard errors. RAID system users, who buy case data patterns are written and read across all surfaces 90% of the server disks, do not return drives with hard er- of the disk drive to form a map of known defects called the rors. Thus, 10% of the disks sold account for one third of P-list. After the defect map is created (during self-scan), hard errors reported. Therefore, the mean-time-to-block- the drive is then formatted in a way that eliminates each failure is 3/10 times the MTBF of all disk failures and the defect from the set of logical blocks presented to the user. mean-time-to-disk-failure is 3/2 of the MTBF. If a drive These manufacturing defects are said to be “mapped-out” is rated at 1 million hours, then the mean-time-to-block- 5 and should not affect the function of the drive. failure is 3 ¢ 10 hours and the mean-time-to-disk-failure 6 ¢ Over the lifetime of a disk, additional defects can be en- is 15 10 hours. Thus, we propose that block failure rates countered. Additional spare sectors are reserved at the end are about five times higher than disk failure rates. of each track or cylinder for these grown defects. A spe- cial grown defect list, or G-List, maps the grown defects and complements the P-List. In normal operation, the user 3. System Overview should rarely encounter repeatable errors. An increase in repeatable errors, and subsequent grown defects in the G- We investigate the reliability of a very large archival List, can indicate a systemic problem in the heads, media, storage system that uses disks. We further assume that disks servo or recording channel. Self Monitoring Analysis and are powered down when they are not serving requests, as in Reporting Technology (SMART) [13] allows users to mon- a MAID [2]. Disk drives are not built for high reliability [1] itor these error rates. and any storage system containing many disks needs to de- Note that a drive only detects errors after reading the af- tect failures actively.
Recommended publications
  • An Analysis of Data Corruption in the Storage Stack
    An Analysis of Data Corruption in the Storage Stack Lakshmi N. Bairavasundaram∗, Garth R. Goodson†, Bianca Schroeder‡ Andrea C. Arpaci-Dusseau∗, Remzi H. Arpaci-Dusseau∗ ∗University of Wisconsin-Madison †Network Appliance, Inc. ‡University of Toronto {laksh, dusseau, remzi}@cs.wisc.edu, [email protected], [email protected] Abstract latent sector errors, within disk drives [18]. Latent sector errors are detected by a drive’s internal error-correcting An important threat to reliable storage of data is silent codes (ECC) and are reported to the storage system. data corruption. In order to develop suitable protection Less well-known, however, is that current hard drives mechanisms against data corruption, it is essential to un- and controllers consist of hundreds-of-thousandsof lines derstand its characteristics. In this paper, we present the of low-level firmware code. This firmware code, along first large-scale study of data corruption. We analyze cor- with higher-level system software, has the potential for ruption instances recorded in production storage systems harboring bugs that can cause a more insidious type of containing a total of 1.53 million disk drives, over a pe- disk error – silent data corruption, where the data is riod of 41 months. We study three classes of corruption: silently corrupted with no indication from the drive that checksum mismatches, identity discrepancies, and par- an error has occurred. ity inconsistencies. We focus on checksum mismatches since they occur the most. Silent data corruptionscould lead to data loss more of- We find more than 400,000 instances of checksum ten than latent sector errors, since, unlike latent sector er- mismatches over the 41-month period.
    [Show full text]
  • Data & Computer Recovery Guidelines
    Data & Computer Recovery Guidelines Data & Computer Recovery Guidelines This document contains general guidelines for restoring computer operating following certain types of disasters. It should be noted these guidelines will not fit every type of disaster or every organization and that you may need to seek outside help to recover and restore your operations. This document is divided into five parts. The first part provides general guidelines which are independent of the type of disaster, the next three sections deal with issues surrounding specific disaster types (flood/water damage, power surge, and physical damage). The final section deals with general recommendations to prepare for the next disaster. General Guidelines 2. Your first step is to restore the computing equipment. These are general guidelines for recovering after any type If you do try to power on the existing equipment, it of disaster or computer failure. If you have a disaster is best to remove the hard drive(s) first to make sure recovery plan, then you should be prepared; however, the system will power on. Once you have determined there may be things that were not covered to help the system powers on, you can reinstall the hard drive you recover. This section is divided into two sections and power the system back on. Hopefully, everything (computer system recovery, data recovery) works at that point. Note: this should not be tried in the case of a water or extreme heat damage. Computer System Recovery 3. If the computer will not power on then you can either The first step is to get your physical computer systems try to fix the computer or in many cases it is easier, running again.
    [Show full text]
  • Data Remanence in Non-Volatile Semiconductor Memory (Part I)
    Data remanence in non-volatile semiconductor memory (Part I) Security Group Sergei Skorobogatov Web: www.cl.cam.ac.uk/~sps32/ Email: [email protected] Introduction Data remanence is the residual physical representation of data that has UV EPROM EEPROM Flash EEPROM been erased or overwritten. In non-volatile programmable devices, such as UV EPROM, EEPROM or Flash, bits are stored as charge in the floating gate of a transistor. After each erase operation, some of this charge remains. It shifts the threshold voltage (VTH) of the transistor which can be detected by the sense amplifier while reading data. Microcontrollers use a ‘protection fuse’ bit that restricts unauthorized access to on-chip memory if activated. Very often, this fuse is embedded in the main memory array. In this case, it is erased simultaneously with the memory. Better protection can be achieved if the fuse is located close to the memory but has a separate control circuit. This allows it to be permanently monitored as well as hardware protected from being erased too early, thus making sure that by the time the fuse is reset no data is left inside the memory. In some smartcards and microcontrollers, a password-protected boot- Structure, cross-section and operation modes for different memory types loader restricts firmware updates and data access to authorized users only. Usually, the on-chip operating system erases both code and data How much residual charge is left inside the memory cells memory before uploading new code, thus preventing any new after a standard erase operation? Is it possible to recover data application from accessing previously stored secrets.
    [Show full text]
  • Databridge ETL Solution Datasheet
    DATASHEET Extract and Transform MCP Host Data for Improved KEY FEATURES Client configuration tool for Analysis and Decision Support easy customization of table layout. Fast, well-informed business decisions require access to your organization’s key performance Dynamic before-and-after indicators residing on critical database systems. But the prospect of exposing those systems images (BI-AI) based on inevitably raises concerns around security, data integrity, cost, and performance. key change. 64-bit clients. For organizations using the Unisys ClearPath MCP server and its non-relational DMSII • Client-side management database, there’s an additional challenge: Most business intelligence tools support only console. relational databases. • Ability to run the client as a service or a daemon. The Only True ETL Solution for DMSII Data • Multi-threaded clients to That’s why businesses like yours are turning to Attachmate® DATABridge™. It’s the only increase processing speed. true Extract, Transform, Load (ETL) solution that securely integrates Unisys MCP DMSII • Support for Windows Server and non-DMSII data into a secondary system. 2012. • Secure automation of Unisys With DATABridge, you can easily integrate production data into a relational database or MCP data replication. another DMSII database located on an entirely different Unisys host system. And because • Seamless integration of DATABridge clients for Oracle and Microsoft SQL Server support a breadth of operating both DMSII and non-DMSII environments (including Windows 7, Windows Server 2012, Windows Server 2008, UNIX, data with Oracle, Microsoft SQL, and other relational AIX, SUSE Linux, and Red Hat Linux), DATABridge solutions fit seamlessly into your existing databases. infrastructure.
    [Show full text]
  • Database Analyst Ii
    Recruitment No.: 20.186 Date Opened: 5/25/2021 DATABASE ANALYST II SALARY: $5,794 to $8,153 monthly (26 pay periods annually) FINAL FILING DATE: We are accepting applications until closing at 5 pm, June 8, 2021 IT IS MANDATORY THAT YOU COMPLETE THE SUPPLEMENTAL QUESTIONNAIRE. YOUR APPLICATION WILL BE REJECTED IF YOU DO NOT PROVIDE ALL NECESSARY INFORMATION. THE POSITION The Human Resources Department is accepting applications for the position of Database Analyst II. The current opening is for a limited term, benefitted and full-time position in the Information Technology department, but the list may be utilized to fill future regular and full- time vacancies for the duration of the list. The term length for the current vacancy is not guaranteed but cannot exceed 36 months. The normal work schedule is Monday through Friday, 8 – 5 pm; a flex schedule may be available. The Information Technology department is looking for a full-time, limited-term Database Analyst I/II to develop and manage the City’s Open Data platform. Initiatives include tracking city council goals, presenting data related to capital improvement projects, and measuring budget performance. This position is in the Data Intelligence Division. Our team sees data as more than rows and columns, it tells stories that yield invaluable insights that help us solve problems, make better decisions, and create solutions. This position is responsible for building and maintaining systems that unlock the power of data. The successful candidate will be able to create data analytics & business
    [Show full text]
  • Increasing Reliability and Fault Tolerance of a Secure Distributed Cloud Storage
    Increasing reliability and fault tolerance of a secure distributed cloud storage Nikolay Kucherov1;y, Mikhail Babenko1;yy, Andrei Tchernykh2;z, Viktor Kuchukov1;zz and Irina Vashchenko1;yz 1 North-Caucasus Federal University,Stavropol,Russia 2 CICESE Research Center,Ensenada,Mexico E-mail: [email protected], [email protected], [email protected], [email protected], [email protected] Abstract. The work develops the architecture of a multi-cloud data storage system based on the principles of modular arithmetic. This modification of the data storage system allows increasing reliability of data storage and fault tolerance of the cloud system. To increase fault- tolerance, adaptive data redistribution between available servers is applied. This is possible thanks to the introduction of additional redundancy. This model allows you to restore stored data in case of failure of one or more cloud servers. It is shown how the proposed scheme will enable you to set up reliability, redundancy, and reduce overhead costs for data storage by adapting the parameters of the residual number system. 1. Introduction Currently, cloud services, Google, Amazon, Dropbox, Microsoft OneDrive, providing cloud storage, and data processing services, are gaining high popularity. The main reason for using cloud products is the convenience and accessibility of the services offered. Thanks to the use of cloud technologies, it is possible to save financial costs for maintaining and maintaining servers for storing and securing information. All problems arising during the storage and processing of data are transferred to the cloud provider [1]. Distributed infrastructure represents the conditions in which competition for resources between high-priority computing tasks of data analysis occurs regularly [2].
    [Show full text]
  • Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives
    ERRORS, MITIGATION, AND RECOVERY IN FLASH MEMORY SSDS 1 Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid-State Drives Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu Abstract—NAND flash memory is ubiquitous in everyday life The transistor traps charge within its floating gate, which dic- today because its capacity has continuously increased and cost has tates the threshold voltage level at which the transistor turns on. continuously decreased over decades. This positive growth is a The threshold voltage level of the floating gate is used to de- result of two key trends: (1) effective process technology scaling, termine the value of the digital data stored inside the transistor. and (2) multi-level (e.g., MLC, TLC) cell data coding. Unfortu- When manufacturing process scales down to a smaller tech- nately, the reliability of raw data stored in flash memory has also nology node, the size of each flash memory cell, and thus the continued to become more difficult to ensure, because these two trends lead to (1) fewer electrons in the flash memory cell (floating size of the transistor, decreases, which in turn reduces the gate) to represent the data and (2) larger cell-to-cell interference amount of charge that can be trapped within the floating gate. and disturbance effects. Without mitigation, worsening reliability Thus, process scaling increases storage density by enabling can reduce the lifetime of NAND flash memory. As a result, flash more cells to be placed in a given area, but it also causes relia- memory controllers in solid-state drives (SSDs) have become bility issues, which are the focus of this article.
    [Show full text]
  • EEPROM Emulation
    ...the world's most energy friendly microcontrollers EEPROM Emulation AN0019 - Application Note Introduction This application note demonstrates a way to use the flash memory of the EFM32 to emulate single variable rewritable EEPROM memory through software. The example API provided enables reading and writing of single variables to non-volatile flash memory. The erase-rewrite algorithm distributes page erases and thereby doing wear leveling. This application note includes: • This PDF document • Source files (zip) • Example C-code • Multiple IDE projects 2013-09-16 - an0019_Rev1.09 1 www.silabs.com ...the world's most energy friendly microcontrollers 1 General Theory 1.1 EEPROM and Flash Based Memory EEPROM stands for Electrically Erasable Programmable Read-Only Memory and is a type of non- volatile memory that is byte erasable and therefore often used to store small amounts of data that must be saved when power is removed. The EFM32 microcontrollers do not include an embedded EEPROM module for byte erasable non-volatile storage, but all EFM32s do provide flash memory for non-volatile data storage. The main difference between flash memory and EEPROM is the erasable unit size. Flash memory is block-erasable which means that bytes cannot be erased individually, instead a block consisting of several bytes need to be erased at the same time. Through software however, it is possible to emulate individually erasable rewritable byte memory using block-erasable flash memory. To provide EEPROM functionality for the EFM32s in an application, there are at least two options available. The first one is to include an external EEPROM module when designing the hardware layout of the application.
    [Show full text]
  • PROTECTING DATA from RANSOMWARE and OTHER DATA LOSS EVENTS a Guide for Managed Service Providers to Conduct, Maintain and Test Backup Files
    PROTECTING DATA FROM RANSOMWARE AND OTHER DATA LOSS EVENTS A Guide for Managed Service Providers to Conduct, Maintain and Test Backup Files OVERVIEW The National Cybersecurity Center of Excellence (NCCoE) at the National Institute of Standards and Technology (NIST) developed this publication to help managed service providers (MSPs) improve their cybersecurity and the cybersecurity of their customers. MSPs have become an attractive target for cyber criminals. When an MSP is vulnerable its customers are vulnerable as well. Often, attacks take the form of ransomware. Data loss incidents—whether a ransomware attack, hardware failure, or accidental or intentional data destruction—can have catastrophic effects on MSPs and their customers. This document provides recommend- ations to help MSPs conduct, maintain, and test backup files in order to reduce the impact of these data loss incidents. A backup file is a copy of files and programs made to facilitate recovery. The recommendations support practical, effective, and efficient back-up plans that address the NIST Cybersecurity Framework Subcategory PR.IP-4: Backups of information are conducted, maintained, and tested. An organization does not need to adopt all of the recommendations, only those applicable to its unique needs. This document provides a broad set of recommendations to help an MSP determine: • items to consider when planning backups and buying a backup service/product • issues to consider to maximize the chance that the backup files are useful and available when needed • issues to consider regarding business disaster recovery CHALLENGE APPROACH Backup systems implemented and not tested or NIST Interagency Report 7621 Rev. 1, Small Business planned increase operational risk for MSPs.
    [Show full text]
  • Active Storage Activeraid ES Data Sheet
    RAID Storage Evolved Welcome to RAID Storage Evolved Key Features Introducing the ActiveRAID ES — the affordable high performance RAID storage solution Optimized RAID performance in Apple Mac for Apple® users. The ActiveRAID ES was designed using the same principles of the OS X® Applications original ActiveRAID, but represents a previously unobtainable level of value by providing the capabilities of the ActiveRAID in a lower cost configuration that does not sacrifice Modern Linux RAID kernel and RAID the performance, reliability or ease of management of the original ActiveRAID. The controller design ActiveRAID ES was carefully designed for use in business critical applications where Proprietary self-optimizing architecture organizations require a high level of data integrity with ease of installation and management, but not the 24/7 365 day duty cycles of a large enterprise or Post-Production or Broadcast facility. Native Mac OS X Storage management suite The ActiveRAID ES provides redundant power, redundant cooling and redundant Fibre Dashboard Widget monitoring Channel connections, and the same powerful, yet easy to use management suite of the original ActiveRAID. The ES differs from the original ActiveRAID in that it has a single high Enterprise performance and reliability performance RAID controller and uses business class hard drives with a smaller cache size. without Enterprise complexity This results in a very robust RAID solution ideal for business class applications. Performance No long Quick Start guides to memorize of the ActiveRAID ES is impressive at up to 774 MB/s*† throughput and will easily handle Performance profiling and statistical most Mac OS X Server applications. The ActiveRAID ES is the ideal storage for business planning tools class applications such as File serving, database services, web hosting, calendar, Wiki, Mail, iChat and much more.
    [Show full text]
  • Algorithms for Data Cleaning in Knowledge Bases
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by VFAST - Virtual Foundation for Advancement of Science and Technology (Pakistan) VFAST Transactions on Software Engineering http://vfast.org/journals/index.php/VTSE@ 2018, ISSN(e): 2309-6519; ISSN(p): 2411-6327 Volume 15, Number 2, May-August, 2018 pp. 78-83 ALGORITHMS FOR DATA CLEANING IN KNOWLEDGE BASES 1 1 1 ADEEL ASHRAF , SARAH ILYAS , KHAWAJA UBAID UR REHMAN AND SHAKEEL AHMAD2 ` 1Department of Computer Science, University of Management and Technology, Punjab, Pakistan 2Department of Computer Sciences, Faculty of Computing and Information Technology in rabigh, King Abdulaziz University Jeddah Email: [email protected] ABSTRACT. Data cleaning is an action which includes a process of correcting and identifying the inconsistencies and errors in data warehouse. Different terms are uses in these papers like data cleaning also called data scrubbing. Using data scrubbing to get high quality data and this is one the data ETL (extraction transformation and loading tools). Now a day there is a need of authentic information for better decision-making. So we conduct a review paper in which six papers are reviewed related to data cleaning. Relating papers discussed different algorithms, methods, problems, their solutions and approaches etc. Each paper has their own methods to solve a problem in efficient way, but all the paper have common problem of data cleaning and inconsistencies. In these papers data inconsistencies, identification of the errors, conflicting, duplicates records etc problems are discussed in detail and also provided the solutions. These algorithms increase the quality of data.
    [Show full text]
  • And Intra-Set Write Variations
    i2WAP: Improving Non-Volatile Cache Lifetime by Reducing Inter- and Intra-Set Write Variations 1 2 1,3 4 Jue Wang , Xiangyu Dong ,YuanXie , Norman P. Jouppi 1Pennsylvania State University, 2Qualcomm Technology, Inc., 3AMD Research, 4Hewlett-Packard Labs 1{jzw175,yuanxie}@cse.psu.edu, [email protected], [email protected], [email protected] Abstract standby power, better scalability, and non-volatility. For example, among many ReRAM prototype demonstrations Modern computers require large on-chip caches, but [11, 14, 21–25], a 4Mb ReRAM macro [21] can achieve a the scalability of traditional SRAM and eDRAM caches is cell size of 9.5F 2 (15X denser than SRAM) and a random constrained by leakage and cell density. Emerging non- read/write latency of 7.2ns (comparable to SRAM caches volatile memory (NVM) is a promising alternative to build with similar capacity). Although its write access energy is large on-chip caches. However, limited write endurance is usually on the order of 10pJ per bit (10X of SRAM write a common problem for non-volatile memory technologies. energy, 5X of DRAM), the actual energy saving comes from In addition, today’s cache management might result in its non-volatility property. Non-volatility can eliminate the unbalanced write traffic to cache blocks causing heavily- standby leakage energy, which can be as high as 80% of the written cache blocks to fail much earlier than others. total energy consumption for an SRAM L2 cache [10]. Unfortunately, existing wear-leveling techniques for NVM- Given the potential energy and cost saving opportunities based main memories cannot be simply applied to NVM- (via reducing cell size) from the adoption of non-volatile based on-chip caches because cache writes have intra-set technologies, replacing SRAM and eDRAM with them can variations as well as inter-set variations.
    [Show full text]