Architectural Techniques for Improving NAND Flash Memory Reliability Yixin Luo

Architectural Techniques for Improving NAND Flash Memory Reliability Yixin Luo CMU-CS-18-101 March 2018 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee Onur Mutlu, Chair Phillip B. Gibbons James C. Hoe Yu Cai, SK Hynix Erich F. Haratsch, Seagate Technology Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2018 Yixin Luo This research was sponsored by Samsung, Qualcomm, AMD, the Oracle Software Engineering Innovation Foun- dation, Facebook, and the National Science Foundation under grant numbers CCF-0953246, CNS-1065112, CNS- 1320531, and CNS-1409723. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Flash Memory, NAND Flash Memory, 3D NAND Flash Memory, Solid-state Drives (SSD), Nonvolatile Memory (NVM), Integrated Circuit Reliability, Error Characteriza- tion, Error Mitigation, Error Correction, Error Recovery, Data Recovery, Process Variation, Data Retention, Memory Systems, Memory Controllers, Data Storage Systems, Fault Tolerance, Com- puter Architecture Abstract Over the past decade, NAND flash memory has rapidly grown in popularity within modern computing systems, thanks to its short random access latency, high internal parallelism, low static power consumption, and small form factor. Today, NAND flash memory is widely used as the primary storage medium for smartphones, personal laptops, and data center servers. This growth in flash memory popularity has been sustained by the decreasing cost per bit of NAND flash memory devices over each technology generation, which is due to the increased storage density. The higher density, however, comes at a cost of reduced storage reliability. Unreliable primary data storage could lead to permanent loss of valuable data. Thus, we must improve NAND flash memory reliability to prevent data loss. Existing techniques to keep NAND flash memory reliable are costly. For example, a strong error correcting code (ECC), such as low-density parity-check (LDPC) code, is typically used to tolerate up to a relatively high raw bit error rate (e.g., 3 2 between 10− and 10− ) from the flash memory. However, such ECC requires sig- nificant redundancy, high latency, and high area overhead in today’s designs. To tolerate more errors in future generations of NAND flash memories, a stronger ECC is needed, which requires an even larger amount of data redundancy, latency and area overhead than the ECC used in today’s flash memory. Such ECC is not only very costly, but it also may not be as effective as other novel techniques that are spe- cialized for different error types. Our goal in this dissertation is to greatly improve flash memory reliability at low cost. We identify three opportunities to improve the cost-efficiency of flash reliability enhancement techniques. First, we can adapt the flash controller to various NAND flash memory error characteristics, or even to the error characteristics of each indi- vidual flash chip. Second, we can adapt the flash controller to how the host uses the NAND flash memory, e.g., application access patterns and environmental temperature. Third, the flash chips are typically managed by a powerful controller within the Solid-State Drive (SSD). This powerful computing resource is underutilized when the SSD is idle or when the workload has low access intensity. We can use the flash controller to optimize flash reliability in the background without capacity or performance loss. To exploit these opportunities in improving flash memory reliability, the main thesis of our approach is to specialize the flash controller algorithms to the device and workload characteristics, rather than using powerful but expensive generic error tolerance techniques such as ECC. In this dissertation, we (1) develop a new understanding of the NAND flash memory error characteristics and the workload behavior through rigorous experimental characterization, and (2) design smart flash controller algorithms based on this new understanding to improve flash reliability at low cost. To this end, we make four major contributions. First, we propose a new technique called WARM to improve flash memory lifetime. The key idea is to identify and exploit the write-hotness of the workload in the flash controller in order to improve flash reliability. We show that existing write- hotness agnostic techniques lead to redundant refresh operations that degrade flash lifetime. WARM manages write-hot data and write-cold data differently, and effec- tively improves flash lifetime with low hardware and performance overhead. Second, we propose a new framework to learn an online flash channel model that predicts the underlying threshold voltage distribution of each flash chip. The threshold voltage distribution decides the error characteristics of the flash chip and changes over time due to flash memory wearout. We show that existing analytical threshold voltage distribution models are unsuitable for online flash channel model- ing, as they are either inaccurate or expensive to compute. We show that Student’s t-distribution and the power law distribution can be used to model the static and dy- namic (changing) behavior of the threshold voltage distribution with high accuracy and low latency. We also show that a variety of existing techniques can be tuned using this online model to improve flash memory lifetime. Third, we perform the first detailed, comprehensive characterization and analysis of 3D NAND flash memory errors. Through this analysis, we identify three new error characteristics in 3D NAND flash memory due to its unique structure and cell design. We develop models for two of the new error characteristics that are signifi- cant in current-generation 3D NAND flash chips. We develop four new mechanisms within the flash controller to mitigate the three new error types, and thus greatly reduce the error rate, at low cost. Fourth, we perform the first experimental characterization of the self-recovery effect on 3D NAND flash memory and show that dwell time, i.e., the idle time between write cycles, and temperature significantly impact retention loss speed and program accuracy. We develop a new unified model of these effects, called the Unified self-Recovery and Temperature model (URT). Using this model, we propose a new technique called HeatWatch to mitigate errors due to early retention loss in 3D NAND flash memory. HeatWatch reduces the raw bit error rate by tuning the read reference voltages to the dwell time of the workload and the operating temperature of the flash memory. We show that HeatWatch efficiently tracks the temperature and dwell time of NAND flash memory and greatly mitigates retention errors in 3D NAND flash memory using this information. Overall, this dissertation (1) deepens the understanding of the error characteristics of both planar and 3D NAND flash memory through rigorous experimental characterization and, (2) develops new flash controller algorithms that improve NAND flash memory reliability (both lifetime and error rate) at low cost by taking advan- tage of the flash device and workload characteristics that we find based on our new understandings. iv Acknowledgments First of all, I would like to thank my adviser, Onur Mutlu, for always trusting me to do good work, encouraging me whenever a paper gets rejected, giving me enough resources and opportunities to do great research, and pushing me to keep improving my presentation and writing skills. I am grateful to Saugata Ghose, Yu Cai, and Erich Haratsch for being both my mentors and collaborators. I am grateful to the members of my PhD committee: Phillip Gibbons and James Hoe for their valuable feedback and for making the final steps towards my PhD very smooth. I am grateful to Deb Cavlovich who allowed me to focus on my research by magically solving all other problems. I am grateful to SAFARI group members that were more than just lab mates. Saugata Ghose was not only a great mentor and collaborator to me who helped me improve in all aspects, but also a great friend who was always available for help. Hongyi Xin was like a big brother to me who was always supportive and gave me kind advice that kept me on track during my early struggles, and was never dis- appointing to have a fun discussion with about bioinformatics or history. Rachata Ausavarungnirun was a fantastic cook and was always kind to share his delicious food that reminded me of my grandmother’s cooking. Donghyuk Lee encouraged me when it was most needed, and taught me his highest work ethic and kindness by example. Justin Meza helped me improve my writing and presentation skills during my early years in a very friendly manner. Chris Fallin was kind and patient enough to share his wisdom and time for insightful research discussions, which helped to improve my research skills. Kevin Chang was a perfect role model for me to fol- low, who also happened to share similar hobbies with me. Yoongu Kim pushed me to work harder and keep improving my research, and set a high standard for the group in terms of quality of research, writing, and presentation. Vivek Seshadri, Gennady Pekhimenko, Samira Khan were always eager to share valuable research and career advice. Kevin Hsieh was a great roommate at the PDL Retreat, and was always friendly to chat with. Amirali Boroumand was a nice buddy until he cold- bloodedly left us for nicer weather in California. I also thank other members of the SAFARI group for their assistance and support: Yang Li, Nandita Vijaykumar, Jamie Liu, HanBin Yoon, Ben Jaiyen, Damla Senol, Arash Tavakkol, Minesh Patel, and Jeremie Kim. During my time at Carnegie Mellon, I met a lot of wonderful people: Haox- ian Chen, Yan Gu, Yihan Sun, Yu Zhao, Yong He, Yanzhe Yang, Junchen Jiang, Xuezhi Wang, Abutalib Aghayev, Jin Kyu Kim, Joy Arulraj, Lin Ma, Michael Zhang, Huanchen Zhang, Jinliang Wei, Dominic Chen, Guangshuo Liu, Junyan Zhu, Tianshi Li, Jiyuan Zhang, and many others who helped and supported me in many different ways.

Architectural Techniques for Improving NAND Flash Memory Reliability Yixin Luo

Increasing Reliability and Fault Tolerance of a Secure Distributed Cloud Storage

And Intra-Set Write Variations

Improving Performance and Reliability of Flash Memory Based Solid State Storage Systems

Configuring Oracle Solaris ZFS for an Oracle Database

PC-Based Data Acquisition II Data Streaming to a Storage Device

(12) United States Patent (10) Patent No.: US 7,191,286 B2 Forrer, Jr

Storage Administration Guide Storage Administration Guide SUSE Linux Enterprise Server 15

Of File Systems and Storage Models

Data Stored Based Test Pattern Generation in Flash Memory

Drobotm Beyondraidtm Simplifies Storage Deployment and Management Safe

Data Redundancy and Maintenance for Peer-To-Peer File Backup Systems Alessandro Duminuco

Coded Access Architectures for Dense Memory Systems