A Multi-Threaded FTL for a Parallel IO Flash Card Under Linux

1 LFTL: A multi-threaded FTL for a Parallel IO Flash Card under Linux Srimugunthan 1 , K. Gopinath 2, Giridhar Appaji Nag Yasa 3 Indian Institute of Science[ 1 2] NetApp 3 Bangalore [email protected] [email protected] [email protected] ✦ Abstract— New PCI-e flash cards and SSDs supporting erased block. Furthermore, a block can be erased over 100,000 IOPs are now available, with several usecases and rewritten only a limited number of times. in the design of a high performance storage system. By Currently, the granularity of an erase (in terms of using an array of flash chips, arranged in multiple banks, large capacities are achieved. Such multi-banked architec- blocks) is much bigger than the write or read (in ture allow parallel read, write and erase operations. In a terms of pages). As the reuse of blocks depends raw PCI-e flash card, such parallelism is directly available on the lifetimes of data, different blocks may be to the software layer. In addition, the devices have restric- rewritten different number of times; hence, to ensure tions such as, pages within a block can only be written reliability special techniques called wearlevelling sequentially. The devices also have larger minimum write are used to distribute the writes evenly across all > sizes ( 4KB). Current flash translation layers (FTLs) in the blocks. Linux are not well suited for such devices due to the high device speeds, architectural restrictions as well as other Multiple terabyte-sized SSD block devices made factors such as high lock contention. We present a FTL with multiple flash chips are presently available. for Linux that takes into account the hardware restrictions, New PCI-e flash cards and SSDs supporting high that also exploits the parallelism to achieve high speeds. IOPS rate (eg. greater than 100,000) are also avail- We also consider leveraging the parallelism for garbage able. Due to limitations in scaling of the size of flash collection by scheduling the garbage collection activities memory per chip, parallelism is an inherent feature on idle banks. We propose and evaluatean adaptive method to vary the amount of garbage collection according to the in all of these large flash storage devices. current I/O load on the device. Flash memory scaling IC fabrication processes are characterized by a 1 Introduction feature size, the minimum size that can be marked Flash memory technologies include NAND flash, reliably during manufacturing. The feature size de- termines the surface area of a transistor, and hence arXiv:1302.5502v1 [cs.OS] 22 Feb 2013 NOR flash, SLC/MLC flash memories, and their hybrids. Flash memory technologies, due to the cost, the transistor count per unit area of silicon. As speed and power characteristics, are used at different the feature size decreases, the storage capacity of levels of the memory hierarchy. a flash memory chip increases. But as we reduce Flash memory chips were dominantly used in the process feature size, there are problems like, embedded systems for handheld devices. With re- the bits can not be stored reliably. Sub-32 nm cent advances, Flash has graduated from being only flash memory scaling has been a challenge in the a low performance consumer technology to also past. Process innovations and scaling by storing being used in the high performance enterprise area. multiple bits per cell have enabled bigger flash sizes Presently, Flash memory chips are used as extended per chip. As of this writing, the minimum feature system memory, or as a PCI express card acting as size achieved is 19nm, resulting in an 8GB flash a cache in a disk based storage system, or as a SSD chip storing 2 bits per cell. But storing multiple drive completely substituting disks. cells per chip adversely affect the endurance of the However, flash memory has some peculiarities flash[1] bringing down the maximum erase cycles such as that a write is possible only on a fully per block in the range of 5000 to 10,000. It has been 2 shown that scaling by packing more bits per cell functionalities our FTL also does caching of data, degrades the write performance[10]. Also the bus before it writes to the flash. interface from a single NAND chip is usually a slow The following are the main contributions of this asynchronous interface operating at 40MHz. So in work order to achieve larger capacities and faster speeds, • We present a design of a Flash translation multiple flash memory chips are arrayed together layer(FTL) under Linux, which can scale to and they are accessed simultaneously. These kind higher speeds. Our FTL also copes up with of architectures are presently used in SSDs[7] device limitations like larger minimum write SSD architecture and interface An SSD pack- sizes by making use of buffering inside the age internally has its own memory (usually bat- FTL. tery backed DRAM ), a flash controller and some • We exploit the parallelism of a Flash card with firmware that implements a flash translation layer respect to block allocation, garbage collection which does some basic wearlevelling and exposes and for initial device scanning. The initial a block interface. Hard disks can be readily substi- scan is time consuming for larger flash; hence tuted by SSDs, because the internal architecture of exploiting parallelism here is useful. SSDs is hidden behind the block interface. Because • We give a adaptive method for varying the of the blackbox nature of SSDs, some changes have amount of on-going garbage collection accord- been proposed in the ATA protocol interface such ing to the current I/O load. as the new TRIM command, that marks a block as Section 2 presents the background with respect garbage as a hint from an upper layer such as a to flash memories, flash filesystems, FTLs and the filesystem to the SSDs[2]. flash card used. Section 3 describes the design of our Raw flash card vs SSDs FTL. Section 4 presents the results. Section 5 is a Though most flash is sold in the form of SSDs, discussion of some issues in our FTL and also about there are also a few raw flash cards used in some future directions of work and Section 6 concludes. scenarios. Such flash cards pack hundreds of gi- gabytes in a single PCI-e card with an SSD-like 2 Background architecture but without any on-board RAM or any Flash memory has a limited budget of erase firmware for wearlevelling. cycles per block. For SLC flash memories it is in The flash card that we studied is made of several hundreds of thousands while for MLC flash it is in channels or interfaces. Several of flash chips make tens of thousands. a bank and several of a banks make a channel Wearlevelling is a technique, that ensures that or interface. It is possible to do writes or reads all the blocks utilise their erase cycle budget at simultaneously across banks. The flash card have about the same rate. Associated with wearlevelling an in-built FPGA that supports DMA write size of is garbage collection. Wearlevelling and garbage 32KB and DMA read size of 4KB. Pages within collection have three components: a block can only be written sequentially one after • Out of place updates (referred to as Dynamic another. wearlevelling): A rewrite of a page is written It is important to mention that we have used a to a different page on the flash. The old page raw flash card and not a SSD for our study. is marked as invalid. Flash translation layer • To reclaim invalidated pages (referred to as The management of raw flash can either be im- garbage collection): To make the invalid pages plemented within a device driver or other software writable again, the block with invalid pages has layers above it. The layer of software between the to erased. If the block has some valid pages, it filesystem and the device driver that does the flash has to be copied to another block. management is called Flash translation layer (FTL). • Move older unmodified data to other blocks of The main functions of the FTL are address mapping, similar lifetimes (referred to as Static wearlev- wearlevelling and garbage collection. Working in elling ) the context of Linux, we have designed a flash Because of the out-of-place update nature of the translation layer for a raw flash card which exploits flash media, there is a difference between a logical the parallel IO capabilities. In addition to the main address that a filesystem uses and a physical address 3 that is used for writing to the flash. This logical NFTL is one of the Linux FTLs and is available address to physical address translation is provided as part of the linux code base. In Linux, flash by the flash translation layer. devices are treated specially as mtd (memory tech- Note that both garbage collection and static nology devices) devices. The flash specific charac- wearlevelling incur some extra writes. The metric, teristics like the asymmetric sizes and asymmetric write amplification, quantifies the overhead due to access speeds for read, write and erase, the presence garbage collection and static wearlevelling. Write of bad blocks in flash, reading or writing to the amplification is the ratio between the number of spare area of the flash, makes it different from a user writes and the total writes that happened on the block device. The NFTL operates over a mtd device flash. A good wearlevelling algorithm minimises the and exposes a block device, with sector size of 512 write amplification as much as possible.

A Multi-Threaded FTL for a Parallel IO Flash Card Under Linux

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support