1 Solid State Disk Object-Based Storage with Trim Commands Tasha Frankie, Member, IEEE, Gordon Hughes, Fellow, IEEE, and Ken Kreutz-Delgado, Senior Member, IEEE
Abstract—This paper presents a model of NAND flash SSD utilization and write amplification when the ATA/ATAPI SSD Trim command is incorporated into object-based storage under a variety of user workloads, including a uniform random workload with objects of fixed size and a uniform random workload with objects of varying sizes. We first summarize the existing models for write amplification in SSDs for workloads with and without the Trim command, then propose an alteration of the models that utilizes a framework of object-based storage. The utilization of objects and pages in the SSD is derived, with the analytic results compared to simulation. Finally, the effect of objects on write amplification and its computation is discussed along with a potential application to optimization of SSD usage through object storage metadata servers that allocate object classes of distinct object size.
Index Terms—Trim, Solid State Drive, Object-Based Storage, Write Amplification, NAND Flash, Markov processes. !
1 INTRODUCTION SSD to internally manage their object physical map- N recent years, NAND flash solid state drives ping and relocation. This has the potential to improve I (SSDs) have become ubiquitous, appearing in small I/O rate and QoS, user access blocking probability, mobile electronics to large data centers. Part of the and SSD flash chip lifetime through the reduction of popularity of flash-based SSDs over traditional hard write amplification. disk drives (HDDs) is the increased speed and per- In this paper, we present a model for computing uti- formance that they offer. However, SSDs have their lization and write amplification under a modified ran- own set of problems, including a slowdown in per- dom workload including Write and Trim commands, formance as the device fills with data due to write and data records of varying sizes. This model is an amplification, and a limited lifetime due to wearout extension of models proposed for simpler workloads of the floating-gate transistors. Additionally, SSDs by previous researchers [3], [4], [5], [6] and builds on currently need to conform with legacy interfaces from our previous models of the Trim command [7], [8], [9]. HDDs, which makes optimization in using the device Good models assist in the development of algorithms more challenging. for managing the challenges of SSDs; our model adds Object based storage offers multiple advantages for flexibility to the theory in order to more closely match systems using solid state storage devices [1]. When the real-world workload characteristics. multiple thousand pages in a flash block are divided into object records of larger size, the resultant fewer 2 NAND FLASH SSDS arXiv:1210.5975v1 [cs.PF] 10 Oct 2012 objects per block directly translates into fewer data In this section, we provide background about NAND write relocations during garbage collection, leading to Flash SSDs which is necessary to understand the a lower write amplification and faster performance. model presented in this paper. This offers an opportunity for user application stor- age metadata, via metadata servers, to actively allow 2.1 Layout flash storage devices to self-optimize their hardware, to fully utilize SSD write speed advantages over The organization of a NAND flash SSD gives it char- magnetic HDD while ameliorating SSD limitations acteristics that cause unique challenges. The smallest like block erase and finite write lifetime. Object stor- storage unit of flash is the page, which typically holds age metadata servers could allocate SSDs to object 2K, 4K, or 8K of data [10], [11]. Pages are grouped into classes [2] of distinct object size and thereby allow blocks, with blocks typically containing 64, 128, or 256 pages, varying by manufacturer [10]. One challenge of • T. Frankie and K. Kreutz-Delgado are with the Department of Electrical flash SSDs is that a page must be erased before it can and Computer Engineering, UC San Diego, La Jolla, 92093. be programmed again, but the smallest erase unit of E-mail: [email protected], [email protected] • G. Hughes is with the Center for Magnetic Recording Research, UC flash SSDs is the block, and not all pages on the block San Diego, La Jolla, 92093. are ready to be erased at the same time. This means E-mail: [email protected] that write-in-place schemes are impractical, and most SSDs employ a dynamic logical-to-physical mapping 2 called the flash translation layer (FTL) [12], [13]. With section, we summarize previous work that we build this dynamic mapping, when data is overwritten, on in this paper. the old physical location of the data is marked as invalidated in the FTL mapping. In this paper, we utilize a log-structured file system, in which data is 3.1 Uniform Random Workload written to successively available pages, as described The type of workload used can have a large effect in [3]. on the performance of a storage device [16]. We view real-world workloads as existing along a con- 2.2 Garbage Collection tinuum from purely sequential workloads to com- Garbage collection is the process of selecting and pletely random workloads. For SSDs, analysis of a erasing a block to reclaim space for writing. When purely sequential workload is uninteresting, because a block is chosen for erasure, any valid pages on entire blocks of data are invalidated before garbage the block first need to be copied elsewhere to avoid collection, leading to no write amplification [3]. How- data loss. Copying these valid pages leads to write ever, at the other end of the spectrum, completely amplification, a phenomenon in which the device random workloads are very interesting to statistically performs more writes than are requested of it. analyze, since the random nature of the writes results There are many algorithms used to select a block for in most blocks still containing valid pages at the garbage collection, most of which are based on one of time of garbage collection, leading to potentially high two schemes originally described in the seminal paper write amplification. Hu, et al. [3], [4], Agarwal and by Rosenblum and Ousterhout [14]. Their first method Marrow [5], and Xiang and Kurkoski [6] have all pro- of selection chooses the block that creates the most posed models for write amplification of the uniform free space, and their second method also accounts for random workload under greedy garbage collection. the age of the block. In this paper, we utilize greedy Their workloads consist of write requests distributed garbage collection, in which the block with the fewest uniformly randomly over the user space, with each valid pages is selected. logical block address (LBA) requested requiring one page to store the data. Each of their models found that 2.3 Improving Performance the single most important factor in determining the write amplification was the level of overprovisioning Reducing write amplification creates an improvement provided on the SSD. in performance by increasing the average write speed due to the fewer number of extra writes performed by the device. Additionally, by reducing the number of 3.2 Trim-Modified Uniform Random Workload program/erase cycles, SSD lifetime is extended, since flash devices wear out and lose the ability to store Previously, we modified the uniform random work- data after being written many times. load to incorporate Trim requests, then analyzed the There are several ways to reduce write amplifi- resulting utilization and write amplification [7], [8], cation. One common way is for manufacturers to [9]. In the Trim-modified uniform random workload, provide extra storage space on the SSD beyond what both write and Trim requests are allowed, and the the user is allowed to access, a practice called over- data associated with each LBA occupies one page. provisioning. Overprovisioning reduces write ampli- As in the standard uniform random workload, write fication because the amount of time between garbage requests are uniformly randomly chosen over all u collections increases, which tends to lead to a decrease user LBAs, and Trim requests are uniformly randomly in the number of valid pages in the block selected for chosen over all In-Use LBAs, where an LBA is defined reclamation. We measure the amount of overprovi- as being In-Use if its most recent request was a Write. Trim requests occur with probability q, and write sioning with the spare factor, Sf = (T − u)/T, 0 ≤ requests occur with probability 1 − q. This workload Sf ≤ 1, where T is the total number of pages on the SSD, and u is the number of pages in the user space. was modeled as a Markov birth-death chain with the Another way to reduce write amplification is state being the number of In-Use LBAs, and the steady through use of the Trim command [15], which is a state solution was approximated as Gaussian with way for the file system to communicate to the SSD mean us and variance us¯, where s = (1 − 2q)/(1 − q) that certain data pages no longer need to be retained. and s¯ = q/(1 − q) are dependent on the rate of Trim. The pages associated with the data are invalidated in the FTL mapping, meaning that there are fewer valid 3.2.1 Effective Overprovisioning pages that need copying during garbage collection. Since the number of In-Use LBAs is the same as the number of valid pages, the level of effective overpro- 3 ANALYZING PERFORMANCE visioning is straightforward to calculate. The effective Several researchers have created analytic models to overprovisioning is measured in terms of the effective help understand the performance of SSDs. In this spare factor, Seff = (T − Xn)/T , where Xn is the 3 number of valid pages at time n. The mean effective are object IDs that are In-Use or not In-Use. Previ- spare factor was found to be ously, there were u LBAs allowed; now we allow [1, u] u T − us object IDs in the range , for a maximum of S¯ = objects. The object-view of the Trim-modified uniform eff T random workload is the same as the LBA view of the =s ¯ + sSf problem when each object occupies 1 page. The variance of the effective spare factor was also To remove the restriction that one object occupies calculated, and found to be negligible in practical only one page, we start by examining the workload devices (as T → ∞) [7], [9]. in which each object occupies b pages, where b is constant and identical for all objects in the workload. 3.2.2 Write Amplification Then, we examine the workload in which b varies We also explored write amplification under the Trim- each time an object is written. For all workloads, modified random workload. After paralleling the we assume that when an object ID is selected for derivation of Xiang and Kurkoski [6], we discovered trimming (or, more appropriately for objects, dele- that all three previously proposed models for write tion), all pages associated with that object ID are amplification could be modified by using the effective invalidated. Likewise, when an object is rewritten, all level of overprovisioning. In simulation, the Trim- pages previously containing data associated with that modified model of Xiang and Kurkoski provided the object ID are invalidated. Note that there are now two best prediction; the Trim-modified models of Hu, et quantities of interest: the number of object IDs that are al., and of Agarwal and Marrow were optimistic in in-use, and the number of pages that are valid. When their predictions of write amplification [8], [9]. each object occupies one page, these two quantities Additionally, we separated the workload into hot are the same. (frequently accessed) and cold (infrequently accessed) Define the following variables: T data in order to closer approximate a real-work work- χt = (χ1,t, χ2,t, . . . , χu,t) is a vector of indicator load. Analysis of this workload led to a more general random variables at time t. workload with N temperatures, and found that the write amplification of this workload type was lowest ( N 1 ω ∈ Ai (object i is In-Use) when the temperatures were kept physically sepa- χi,t = 1Ai (ω) = rated into different blocks [9]. 0 otherwise (object i is not In-Use) where 4 MODEL AUGMENTATION Ai = {ω|the STATE(ω) is such that location i is occupied} In the previous work described in Section 3, one page stored the data associated with one LBA. In this section, we augment the model to allow larger Pu Xt = i=1 χi,t is a random variable representing amounts of data to be stored under a single identifier, the number of objects in-use at time t. The mean and which lends itself naturally to a discussion of object- variance of this random variable at steady state were based storage. computed previously by Frankie, et al. [7], [9]. T Zt = (Z1,t,Z2,t,...,Zu,t) is a vector of the size of 4.1 From LBAs to Objects each object at time t. t In object-based storage, the filesystem does not assign Now, compute the number of valid pages at time : Y = ZTχ = Pu χ Z data to particular LBAs. Instead, the filesystem tells t t t i=1 i,t i,t the object-based storage disk to store certain data; the We are concerned with the values at steady state, disk chooses where to store it, and gives the filesys- and drop the time index to indicate the steady state tem a unique object ID with which the data can be solution. retrieved and/or modified [17]. Object-based storage allows for variable-sized objects with identifiers that 4.2 Fixed Sized Objects need not indicate any information about the physical location of the data [18]. Rajimwale, et al. have pro- This workload is an extension of the original work- posed that object-based storage is an optimal interface load whose steady state pdf was derived in [7]. How- for flash-based SSDs [19]. We use object-based storage ever, objects now occupy b pages instead of 1 page. as motivation for extensions to the theory of effective Recall that the number of In-Use objects is the same as overprovisioning and write amplification described in the number of In-Use LBAs previously derived; only Section 3. the number of valid pages will change. We assume The utilization problem we solved previously [7] that at least ub pages are allowed for the user to write can be viewed in object space instead of LBA space. to, so that it is not possible to request more space than Instead of LBAs that are In-Use or not In-Use, there is available. 4
T For this workload, Zt = (b, b, . . . , b) ∀t. This means = mZ E[X] u X = mZ us (1) Yt = χi,tZi,t i=1 The variance is more complicated because of cross- u terms. The χi values are not independent, so it is not X 1 = χi,tb simple to compute the variance . i=1 To compute the variance of Y , compute u X V ar(Y ) = E[Y 2] − E[Y ]2 = b χi,t i=1 The value of E[Y ] was computed in Equation 1. Com- = bX t puting E[Y 2] is more complicated. When following 2 To compute the steady state mean and variance of Y , the steps below, note that χi = χi , χi is independent we can use the known steady state mean and variance of Zi and Zj, and the Zi values are iid. of X. u !2 2 X E[Y ] = E[bX] E[Y ] = E χiZi = bE[X] i=1 u = bus X 2 X = E χiZi + 2 χiχjZiZj Similarly, i=1 i