UNIVERSITY OF CALIFORNIA, SAN DIEGO
Model and Analysis of Trim Commands in Solid State Drives
A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy
in
Electrical Engineering (Intelligent Systems, Robotics, and Control)
by
Tasha Christine Frankie
Committee in charge:
Ken Kreutz-Delgado, Chair Pamela Cosman Philip Gill Gordon Hughes Gert Lanckriet Jan Talbot
2012
The Dissertation of Tasha Christine Frankie is approved, and it is acceptable in quality and form for publication on microfilm and electronically:
Chair
University of California, San Diego
2012
iii TABLE OF CONTENTS
Signature Page ...... iii
Table of Contents ...... iv
List of Figures ...... vi
List of Tables ...... x
Acknowledgements ...... xi
Vita ...... xii
Abstract of the Dissertation ...... xiv
Chapter 1 Introduction ...... 1
Chapter 2 NAND Flash SSD Primer ...... 5 2.1 SSD Layout ...... 5 2.2 Greedy Garbage Collection ...... 6 2.3 Write Amplification ...... 8 2.4 Overprovisioning ...... 10 2.5 Trim Command ...... 11 2.6 Uniform Random Workloads ...... 11 2.7 Object Based Storage ...... 13
Chapter 3 Trim Model ...... 15 3.1 Workload as a Markov Birth-Death Chain ...... 15 3.1.1 Steady State Solution ...... 17 3.2 Interpretation of Steady State Occupation Probabilities . 18 3.3 Region of Convergence of Taylor Series Expansion . . . . 22 3.4 Higher Order Moments ...... 23 3.5 Effective Overprovisioning ...... 24 3.6 Simulation Results ...... 26
Chapter 4 Application to Write Amplification ...... 35 4.1 Write Amplification Under the Standard Uniform Random Workload ...... 35 4.2 Write Amplification Analysis of Trim-Modified Uniform Random Workload ...... 37 4.3 Alternate Models for Write Amplification Under the Trim-Modified Uniform Random Workload . . 41 4.4 Simulation Results and Discussion ...... 41
iv Chapter 5 Workload Variation: Hot and Cold Data ...... 47 5.1 Mixed Hot and Cold Data ...... 48 5.2 Separated Hot and Cold Data ...... 50 5.3 Simulation Results ...... 51 5.4 N Temperatures of Data ...... 54
Chapter 6 Workload Variation: Variable Sized Objects ...... 56 6.1 From LBAs to Objects ...... 57 6.2 Fixed Sized Objects ...... 58 6.3 Variable Sized Objects ...... 59 6.4 Simulation Results ...... 62 6.5 Application to Write Amplification ...... 67
Chapter 7 Conclusion ...... 72
Appendix A Asymptotic Expansion Analysis ...... 77 A.1 Useful Theorems and Facts ...... 77 A.2 Asymptotic Expansion Analysis ...... 80 A.3 An Accurate Gaussian Approximation ...... 81 A.4 A Higher Order PDF Approximation ...... 84 A.4.1 Approximating the Mean ...... 92 A.4.2 Approximating the Variance ...... 93 A.4.3 Approximating the Skew ...... 95 A.4.4 Approximating the Kurtosis ...... 96 A.5 Accuracy of the Truncation PDFs ...... 96 A.6 The K-L Divergence to and from the Normal Distribution 97 A.6.1 Kullback-Liebler Divergence ...... 99 A.6.2 Application to PX ...... 100
References ...... 104
v LIST OF FIGURES
Figure 2.1: Schematic of greedy garbage collection. Data is written to the current writing block, which is the first block in the free block pool. When garbage collection is triggered, the block with fewest valid pages is selected from the occupied block queue. After its valid pages are copied to the current writing block, the selected block is erased and appended to the end of the free block pool...... 7
Figure 3.1: Steady state distribution of the number of In-Use LBAs for the Trim-modified uniform random workload under various amounts of Trim, with u = 1000. Note that a higher percentage of Trim requests in the workload results in a lower mean but a higher variance...... 19 Figure 3.2: Steady state distribution of the effective spare factor for the Trim-modified uniform random workload with 25% of requests Trim over 1000 user-allowed LBAs, and various amounts of man- ufacturer specified spare factor. Note the improvement in the effective spare factor over the manufacturer specified spare fac- tor has a larger variance for smaller manufacturer specified spare factor...... 25 Figure 3.3: Comparison of analytic and simulation PDF, for 40% Trim. The analytic values closely match the simulation results...... 26 Figure 3.4: A histogram of one run of simulation results for u = 1000 and q = 0.4, along with plots of the exact analytic PDF (Equa- tion 3.1) in red, the Stirling approximation of the PDF (Equa- tion 3.2) in black dots, and the Gaussian approximation of the PDF (Equation 3.3) in green. The line for the exact analytic PDF is not even visible because the line for the Gaussian ap- proximation of the PDF covers it completely. The Stirling ap- proximation of the PDF (black dots) lines up well with the Gaussian approximation and the (covered) analytic PDF. Even at this low value of u = 1000, the theoretical approximations are effectively identical to the exact analytic PDF...... 27
vi Figure 3.5: A histogram of one run of simulation results for u = 25 and q = 0.3, along with plots of the exact analytic PDF (Equa- tion 3.1) in red, the Stirling approximation of the PDF (Equa- tion 3.2) in black, and the Gaussian approximation of the PDF (Equation 3.3) in green. A slight negative skew of the exact analytic PDF is evident when compared to the Gaussian ap- proximation. Note that the exact analytic PDF is the best fit of the histogram. A series of 64 Monte-Carlo simulations with these values result in an average skew of mean ± one standard deviation = −0.299 ± 0.004, which agrees well with the theo- 1 retically computed approximation value of −0.306 ± R σ3 = −0.306 ± R(0.029). The average simulation kurtosis is 0.064; the approximate theoretical is 0.070, both values are well within overlapping error bounds. The top graph is a plot of the numer- ically normalized equations as shown in the text. In the bottom graph, a half-bin shift correction has been made to the Stirling and Gaussian approximation densities as described in the text. 28 Figure 3.6: Theoretical results for utilization under various amounts of Trim. When more Trim requests are in the workload, a lower percent- age of LBAs are In-Use at steady state, but the variance is higher than seen in a workload with fewer Trim requests. . . . . 30 Figure 3.7: Theoretical results for the effective spare factor under various amounts of Trim, with a manufacturer specified spare factor of 0.11. When more Trim requests are in the workload, a higher effective spare factor is seen, but with a higher variance than seen in a workload with fewer Trim requests...... 31 Figure 3.8: Mean and 3σ values of effective spare factor versus manufac- turer specified spare factor. This shows that an SSD with zero manufacturer specified spare factor effectively has a spare factor of 33% when the workload contains 25% Trim requests. This level of overprovisioning without the Trim command requires 50% more physical storage than the user is allowed to access. The ±3σ values were calculated using a small Tp = 1000, so they would be visible. Typical devices have a Tp >> 1000, resulting in a negligible variance...... 32
vii Figure 3.9: Theoretical results comparing the effective spare factor to the manufacturer specified spare factor. The mean and 3σ standard deviations are shown. For this figure, Tp = 1280 was used so that the 3σ values are visible. Larger gains in the effective spare factor are seen for smaller specified spare factor. In this figure, we see that a workload of 10% Trim requests transforms an SSD with a manufacturer specified spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical overprovisioning alone would require 12.5% more physical space than the user is allowed to access! . . . . . 33
Figure 4.1: Equations needed to compute the empirical model-based algo- rithm for AHu (adapted from [1, 2]). w is the window size, and allows for the windowed greedy garbage collection variation. Setting w = Tp − r is needed for the standard greedy garbage np collection discussed in this dissertation. For an explanation of the theory behind these formulas, see [1, 2]...... 36 Figure 4.2: Write amplification versus probability of Trim under a Trim- modified uniform random workload and greedy garbage collec- tion, when Sf = 0.2. The plots labeled Hu, Agarwal, and Xi- ang all show the Trim-modified models for write amplification. (T ) AAgarwal gives such an optimistic prediction that for large q it (T ) foresees fewer actual writes than requested writes! AHu gives a more realistic, but still optimistic prediction for write ampli- (T ) fication. AXiang gives the best, and generally quite accurate, prediction for write amplification...... 42 Figure 4.3: Simulation and theoretical results for write amplification under various amounts of Trim and Sf = 0.1. The Trim-modified Xiang calculation is a very good match. The Trim-modified calculations for Hu and Agarwal are optimistic about the write amplification, with the Agarwal calculation being so optimistic as to give an impossible write amplification of less than 1 for high amounts of Trim! ...... 43
Figure 5.1: Write amplification for mixed hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. Notice that the theoretical values, based only on the level of effective overprovisioning, are optimistic when cold data accounts for more than 20% of the user space...... 49
viii Figure 5.2: Write amplification for separated hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with proba- bility 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. After allocating the minimum number of blocks needed to store all of the user-allowed data for each temperature of data, the remaining blocks were split with the fraction shown on the x-axis. In this case, the optimal split was 50-50; further investigation of the optimization problem is needed to determine whether this is always the case...... 52 Figure 5.3: Write amplification for mixed hot and cold data compared to separated hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. For the separated hot and cold data, extra blocks were allocated evenly between the two data types. The theoretical values for separated data are a good prediction of the simulation values. Note that the separated hot and cold data has superior write amplification to the mixed hot and cold data when cold data is more than 20% of the workload...... 53
Figure 6.1: Comparison of analytic and simulation PDF for the number of valid pages when objects are a fixed size of 32 pages. For this example, u = 1000 and q = 0.2. The mean is shifted, and the variance is larger than if objects are a fixed size of 1 page. . . . 64 Figure 6.2: Comparison of analytic and simulation PDF for the number of valid pages when the size of objects is drawn from a discrete uniform distribution over [1, 32]. For this example, u = 1000 and q = 0.2...... 67 Figure 6.3: Comparison of analytic and simulation PDF for the number of valid pages when the size of objects is drawn from a binomial distribution with parameters p = 0.4 and n = 32. For this example, u = 1000 and q = 0.2...... 68 Figure 6.4: Write amplification versus pages per block for constant effec- tive overprovisioning of Seff = 0.29. The modified Xiang model (Equation 6.3) fails when the number of pages per block is low. The Trim-modified Hu model provides an optimistic, but rea- sonable, prediction...... 71
ix LIST OF TABLES
Table 3.1: List of common variables and their descriptions, as defined in this dissertation...... 16 Table 3.2: Mean and standard deviation (σ) for Number of In-Use LBAs computed over 128 Monte-Carlo simulations. Values in this table were computed with u = 1000...... 29
Table 4.1: Analytic and simulation write amplification results for various values of probability of Trim, q, and two choices of manufacturer specified spare factor. In all cases, u/np = 1024, and np = 256. Five Monte-Carlo simulations were run for each simulation value in the table, with each simulation having identical results, measured to three significant digits, for write amplification. A plot for the case Sf = 0.2 (equivalently ρ = 0.25) is shown in Figure 4.2...... 44
Table 6.1: Mean and σ of Number of In-Use Objects and Number of Valid Pages for Fixed Sized Objects. Values in this table were com- puted with u = 1000. The size of the objects was fixed at 32. Simulation values are the average of 64 Monte-Carlo simulations. 63 Table 6.2: Mean and σ of Number of In-Use Objects and Number of Valid Pages for Variable Sized Objects. Values in this table were com- puted with u = 1000. The size of the objects was drawn from a discrete uniform distribution over [1, 32]. Simulation values are the average of 64 Monte-Carlo simulations...... 65 Table 6.3: Mean and σ of Number of In-Use Objects and Number of Valid Pages for Variable Sized Objects. Values in this table were com- puted with u = 1000. The size of the objects was drawn from a binomial distribution over with parameters p = 0.4 and n = 32. Simulation values are the average of 64 Monte-Carlo simulations. 66 Table 6.4: Write amplification for varying number of pages per block, equiv- alent to fixed sized objects that evenly divide np. The level of effective overprovisioning was kept constant at Seff = 0.29 for each simulation. The amount of Trim was q = 0.1. All models referenced are Trim-modified; the Xiang model was additionally modified to account for np, as shown in Equation 6.3...... 70
x ACKNOWLEDGEMENTS
Chapters 1, 2, 3, and 7, in part, are a reprint of the material as it appears in “A Mathematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. Chapters 1, 2, 4, and 7, in part, are a reprint of the material as it appears in “The Effect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapters 1, 2, 3, 4, 5, and 7, in part, have been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovi- sioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. Chapters 1, 2, 6, and 7, in part, have been submitted for publication of the material as it may appear in “Solid State Disk Object-Based Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper.
xi VITA
2003 B. S. in Electrical and Computer Engineering, California Institute of Technology
2006 M. S. in Electrical Engineering (Intelligent Systems, Robotics, and Control), University of California, San Diego
2012 Ph. D. in Electrical Engineering (Intelligent Systems, Robotics, and Control), University of California, San Diego
PUBLICATIONS
T. Frankie, G. Hughes, K. Kreutz-Delgado, “Solid State Disk Object-Based Storage with Trim Commands”, IEEE Transactions on Computers, 2012 (in submission).
T. Frankie, G. Hughes, K. Kreutz-Delgado, “Analysis of Trim Commands on Over- provisioning and Write Amplification in Solid State Drives”, ACM Transactions on Storage (TOS), 2012 (in submission).
T. Frankie, G. Hughes, K. Kreutz-Delgado, “The Effect of Trim Requests on Write Amplification in Solid State Drives”, Proceedings of the IASTED-CIIT Conference. IASTED, 2012.
T. Frankie, G. Hughes, K. Kreutz-Delgado, “A Mathematical Model of the Trim Command in NAND-Flash SSDs”, Proceedings of the 50th Annual Southeast Re- gional Conference. ACM, 2012.
T. Vanesian, K. Kreutz-Delgado, G. Burgin, D. Fogel, “Algorithmic Tools for Ad- versarial Games: Intent Analysis Using Evolutionary Computation”, Proceedings of the IEEE Symposium on Computational Intelligence in Security and Defense Applications, 2007.
OTHER WORKSHOP AND CONFERENCE PRESENTATIONS
T. Frankie, G. Hughes, and K. Kreutz-Delgado, “A Statistical Model of Overpro- visioning”, Non-Volatile Memories Workshop, UCSD, March 2012.
T. Frankie, G. Hughes, and K. Kreutz-Delgado, “Improved Effective Overprovi- sioning Using SSD Trim Commands”, CMRR Research Review, UCSD, October 2011.
xii T. Frankie, G. Hughes, and K. Kreutz-Delgado, “SSD Trim Commands Consider- ably Improve Overprovisioning”, Flash Memory Summit, San Jose, August 2011.
T. Frankie, G. Hughes, and K. Kreutz-Delgado, “Impact of the Trim Command on Write Amplification”, Non-Volatile Memories Workshop, UCSD, March 2011. Also presented at the Jacobs School of Engineering Research Expo, UCSD, April 2011.
xiii ABSTRACT OF THE DISSERTATION
Model and Analysis of Trim Commands in Solid State Drives
by
Tasha Christine Frankie
Doctor of Philosophy in Electrical Engineering (Intelligent Systems, Robotics, and Control)
University of California, San Diego, 2012
Ken Kreutz-Delgado, Chair
NAND flash solid state drives (SSDs) have recently become popular storage alternatives to traditional magnetic hard disk drives (HDDs), due partly to their superior performance in write speed. However, SSDs suffer from a decrease in write speed as they fill with data, due largely to write amplification, a phenomenon in which more writes than requested are performed by the device. Use of the Trim command is known to result in improvement in NAND flash SSD performance; the Trim command informs the SSD flash controller which data records are no longer necessary, and need not be copied during garbage collection. In this dissertation, we analytically model the amount of improvement pro- vided by the Trim command for uniform random workloads in terms of effective
xiv overprovisioning, a measure of the device utilization. We show the effective spare factor is Gaussian with mean and variance dependent on the percentage of Trim requests in the workload and the manufacturer-provided physical overprovisioning, and the variance is also dependent on the total storage capacity of the SSD. We then utilize this information to compute the expected write amplification, and to verify our results by simulation. Our theoretical (formula-based) prediction sug- gests, and our simulations verify, that a considerable write amplification reduction is found as the proportion of Trim requests to Write requests increases. In partic- ular, write amplification drops by almost half over the no-Trim case if just 10% of the requests are Trim instead of Write. We extend our models of effective overprovisioning and write amplification to allow for variation of the workload. We explore data of varying sizes as well as data with varying frequency of access. These extensions allow the flexibility needed to more closely model real-world workloads. Models predicting the write amplification for workloads including Trim re- quests can be used in real-world situations ranging from helping cloud storage data center operators manage resources to reduce costs without sacrificing performance to optimizing the performance of apps on mobile devices.
xv Chapter 1
Introduction
Portable devices such as smartphones and tablets are increasing in pop- ularity, feeding the demand for storage that meets the unique requirements of small devices with a limited battery power supply. Additionally, consumers expect instantaneous responses from their electronic devices, making the speed of the stor- age device a key element as well. NAND flash based solid state drives (SSDs) meet these storage needs with their small form factor, low power consumption, and fast speed compared to magnetic hard disk drives (HDDs) [3]. Typical consumer applications of NAND flash SSDs are frequently found in mobile devices. Although the advertised sequential write speed of NAND flash exceeds the speeds of currently available wireless networks, Kim, et al. found that in practice, the speed of the storage on a mobile device has a significant impact on the performance of popular apps, regardless of speed of the network [4]. It is clear that manufacturers of faster SSDs have a competitive advantage, especially when the cost per unit of capacity is comparable across products. SSDs are not exclusively used in consumer applications; they have become a ubiquitous storage solution in enterprise applications as well. One example of an enterprise application that utilizes SSDs is cloud storage, which has become an attractive option for storing large amounts of data off-site. When choosing a cloud storage provider, in addition to examining the cost per unit of storage, customers should also take into account performance aspects such as latency [5]. Use of low-latency storage like SSDs helps to reduce the storage bottleneck and
1 2 improve performance, making data storage and retrieval speeds more appealing to users [6]. Despite their popularity, SSDs have some drawbacks, notably a perfor- mance slowdown due to write amplification when the device is filled with data, and a limited lifetime due to wearout of the floating-gate transistors, which lim- its the number of program/erase cycles before device failure. Additionally, SSDs currently need to conform with legacy interfaces from HDDs, which makes perfor- mance optimization of SSDs more challenging. Even though there is a performance increase when moving from magnetic storage to flash SSDs, it is evident that fur- ther improvement of the performance characteristics of SSDs can help to enhance the end-user experience. One recent change to the legacy HDD interface, the Trim command, helps increase the performance of SSDs. Although the Trim command is known to improve performance, models explaining why there is an improvement and quanti- fying the amount of the improvement have not yet been proposed in the scientific literature. The Trim analysis models presented in this dissertation are original first investigations necessary to advance the important design decisions concerning SSD architecture. The models we present in this dissertation apply directly to the Trim- modified uniform random workload, described in Section 2.6, as well as to vari- ations introduced to allow flexibility in that workload, including frequently and infrequently written (“hot and cold”) data, and data of varying sizes. Testing the applicability of the models on real-world workloads is desirable, but I/O traces1 from SSDs, especially those containing the Trim command, are not easily avail- able [7]. However, our modeled workload and variations are good approximations to the real-world HDD workloads analyzed by Wang, et al. [7], giving us confidence in the applicability of our Trim analysis models to real world systems. A good model of the SSD behavior in response to the Trim command is important because it allows system designers to assess how changes in the amount
1I/O traces record, along with other information, logical block level input and output com- mands from the file system to the storage device. I/O traces are often used as a reproducible way to benchmark the performance of a system. 3 of Trim in the workload alter the system’s performance. This helps to identify key points in the overall SSD architecture that can affect performance, and also leads to an improved understanding of the benefits and drawbacks to proposed computer architecture changes. Models offer insights into systems, and pave the way for the conception of new ideas and algorithms, which can lead to improved SSD architecture and future SSD interfaces. In this dissertation, we model the performance improvement when the Trim command is implemented in SSDs, and offer an explanation for why the improve- ment is so sizeable for even a small amount of Trim added to the workload. In Chapter 2, we give the necessary background information about NAND flash SSDs for the reader to understand the subsequent theory. We then describe the impact on device utilization, measured as effective overprovisioning, under a Trim-modified uniform random workload in Chapter 3, and the resulting decrease in write ampli- fication in terms of the proportion of Trim requests in the workload in Chapter 4. Finally, in Chapters 5 and 6, we examine variations to the Trim-modified uniform random workload that allow closer modeling of real-world workloads before we conclude in Chapter 7.
Chapter Acknowledgements Chapter 1, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. “The Effect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 1, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. “Solid State Disk Object-Based 4
Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 2
NAND Flash SSD Primer
In this section, we provide background about NAND Flash solid state drives (SSDs) which is necessary to understand the models presented in this dissertation.
2.1 SSD Layout
The smallest write unit on a NAND flash chip is a page, which stores a fixed amount of data, often 2K or 4K. Pages are grouped into physical blocks, which are the smallest erase unit on a flash device1. The number of pages per block varies by device, but is often 32, 64, 128, or 256 pages [8, 9]. Flash chips contain multiple blocks to reach the desired capacity of the device, and a single SSD may contain multiple flash chips. We denote the total number of blocks on the device by t, and the number of pages per block by np. Unlike magnetic media, data on NAND flash SSDs cannot simply be over- written. Writing, or programming, of NAND flash can only change bits from 1 to 0 – to change a bit back to 1 again, an erase cycle must occur. Writing can affect a single page, but erasure happens at the block level. Since all pages in a block must be erased at the same time, write-in-place schemes are impractical. For this reason, a flash translation layer (FTL) dynamically maps logical block addresses
1Note that physical blocks are different than logical blocks, and the type referred to when the word “block” is used needs to be determined from the context. In this dissertation, we attempt to specify logical or physical block if the context is potentially unclear.
5 6
(LBAs) to physical locations. The FTL also performs garbage collection. Garbage collection is the process of choosing a used physical block for reclaiming, copying its valid pages to an unused block, then erasing the old block so that it is ready for writing again. The extra writes performed when copying valid pages in order to avoid data loss contribute to a phenomenon called write amplification, and causes a reduction in write speed and the consumption of more of the finite number of writes in the lifetime of each block. There are many different algorithms used for FTLs, with varying perfor- mance for access speed [10, 11]; a survey and taxonomy of these is found in [12, 10]. These algorithms are often quite complex and beyond current analytic mathemat- ical models. The choice of FTL is beyond the scope of this dissertation, but it is important to note that when overwriting a logical block, good FTLs will place the new data in a different physical location, and invalidate the pages where the data was previously stored. In this research, we utilize a log-structured style of data placement [13] and the greedy garbage collection policy, described in Section 2.2, which has been previously analyzed by Hu, et al. [1, 2], Agarwal and Marrow [14], and Xiang and Kurkoski [15] to determine the amount of write amplification in workloads without Trim.
2.2 Greedy Garbage Collection
At some point in the process of writing data, most of the pages will be filled with either valid or invalidated data, and more space is needed for writing fresh data. Garbage collection occurs in order to reclaim additional blocks. In this process, a victim block is chosen for reclamation, its valid pages are copied elsewhere to avoid data loss, then the entire block is erased and the FTL map pointer is changed. Many algorithms exist for choosing the victim block, most of which are based on either selecting the block that creates the most free space or one that also takes into account the age of the data on the block, as originally described in the seminal paper by Rosenblum and Ousterhout [16]. A discussion of proposals for garbage collection can be found in [16, 17, 18]. 7
Free Block Pool
Write or Data TRIM Placement … Request
Relocate valid pages When filled, adjoin to Occupied Block Queue occupied block queue
…
Select block with largest number of When erased, adjoin invalid pages for garbage collection. to free block pool
Block composed of all free Invalid page pages. Before a page can be Valid page written, it must be erased, but Erased (free) page an entire block of pages must be erased at one time.
Greedy Garbage Collection, after Hu, et. al [1,2].
Figure 2.1: Schematic of greedy garbage collection. Data is written to the cur- rent writing block, which is the first block in the free block pool. When garbage collection is triggered, the block with fewest valid pages is selected from the occu- pied block queue. After its valid pages are copied to the current writing block, the selected block is erased and appended to the end of the free block pool. 8
A log-structured algorithm is used to map logical locations to physical lo- cations [1, 2]. At any given time, one physical block is designated as the current writing block. Incoming LBA Write and Trim requests are processed in order, and data is written to the next available physical page on the current writing block. This continues until the block is full, at which point a new writing block is chosen from the beginning of the free block pool. When the number of physical blocks in the free block pool reaches a predetermined reserve2, r, a used block must be chosen for garbage collection. Under greedy garbage collection, the emptiest block (containing the fewest valid pages) is chosen from the occupied block queue. If there is a tie for the emptiest block, the oldest block is chosen, where the age of a block is determined by when it was most recently filled with data. Once a block is chosen for reclamation by garbage collection, all valid pages in the block are copied to the new writing block, then the reclaimed block is erased and appended to the end of the free block pool. See Algorithm 1 and Figure 2.1 for visual descriptions of this process.
2.3 Write Amplification
When a block is chosen for erasure during garbage collection, any valid pages on the block first need to be copied elsewhere to avoid data loss. Copying these valid pages leads to write amplification, a phenomenon in which the device performs more writes than are requested of it. Write amplification, A, is a way of measuring the efficiency of garbage collection, and is defined as the average ratio of physical page writes to user page write requests [1]. Most FTLs and garbage collection algorithms attempt to reduce write am- plification and perform wear leveling in order to keep the number of program/erase cycles uniform over all blocks. Greedy garbage collection has been proven optimal for random workloads in terms of write amplification reduction [2]. Write am- plification under a uniform random workload (without Trim commands) in SSDs has been studied by Hu, et al. [1, 2], Agarwal and Marrow [14], and Xiang and
2In practice, this predetermined reserve is very small compared to the total number of blocks, and is often ignored in analysis. 9
Input: LBA requests of types Write and Trim. Output: Data placement and garbage collection procedures. while there are LBA requests do if request is Write then write new data to first free page in current-block; if LBA written was previously In-Use then invalidate appropriate page in FTL mapping; end if size(free pages in current-block)= 0 then append current-block to end of occupied-block-queue; current-block ← first block in reserve-queue; if size(reserve-queue)< r then // garbage collection: victim-block ← block with fewest valid pages from occupied-block-queue; copy valid pages from victim-block to current-block; erase victim-block and append to reserve-queue; end end else // request was Trim invalidate appropriate page in FTL mapping; end end Algorithm 1: Data Placement and Greedy Garbage Collection 10
Kurkoski [15]; their theory is discussed in Chapter 4. Reducing write amplification creates an improvement in performance by in- creasing the average write speed due to the fewer number of extra writes performed by the device. Additionally, by reducing the number of program/erase cycles, SSD lifetime is extended, since flash devices wear out and lose the ability to store data after being overwritten many times. Manufacturers use several techniques to reduce write amplification in an effort to mitigate the negative performance impact. Two of these techniques are physically overprovisioning the device with extra storage and implementing the Trim command [19].
2.4 Overprovisioning
Overprovisioning of an SSD is the placing of more physical storage capacity on the device than the user is allowed to fill [1]. The measure of overprovisioning that we use is the spare factor
Tp − u Sf = Tp which is calculated using the total number of pages on the device, Tp, and the number of pages the user is allowed to fill, u. The spare factor is a convenient measure when making comparisons because it has a finite range between 0 and 1. Smaller spare factors indicate that the user has access to more of the stor- age space on the device, which results in a larger write amplification and poorer performance. Larger spare factors correspond to better performance but less user- allowed proportion of the storage. Overprovisioning reduces write amplification by increasing the amount of time between garbage collections of the same block, which in turn increases the probability that more pages in the block are invalid at the time of garbage collection, and thus need not be copied. Previously proposed models of the mathematical relationship between overprovisioning and write am- plification in uniform random workloads without Trim has been studied by Hu, et al. [1, 2], Agarwal and Marrow [14], and Xiang and Kurkoski [15], and is discussed in Chapter 4. 11
2.5 Trim Command
The Trim command also improves performance by reducing write amplifi- cation. Defined in the working draft of the ATA/ATAPI Command Set - 2 (ACS- 2) [19] and supported by recent operating systems [19, 20] and most newer SSDs, the Trim command declares logical blocks inactive, and marks the corresponding physical pages as invalid in the FTL mapping, meaning they do not need to be copied before the blocks containing them are erased. The performance increase seen with the Trim command is often measured by first testing the read and write speeds of both sequential and random data without enabling operating system sup- port for Trim, so that no Trim requests are sent to the SSD. Then, Trim support is enabled and the test is repeated. The results of this type of test vary based on the specific, and often proprietary, features of the FTL implemented by the SSD tested, and do not reveal information about the effective level of overprovisioning due to the Trim command. We show the degree to which an SSD that supports Trim requires less phys- ical overprovisioning to achieve the same write amplification performance observed with a non-Trim SSD.
2.6 Uniform Random Workloads
There exists a spectrum of data storage workloads seen in the real world, from the single user of a personal computer to multiple users of cloud comput- ing, database transactions, sequential serial access, and random page access. The statistics of the workload depend on the application, and can have a large effect on the performance of a storage device [21]. On one end of the workload spectrum is a completely sequential workload; on the other end is a random workload. Most real-world workloads are somewhere between these two ends. Purely sequential workloads do not result in write amplification, since the sequential nature of the requests means that when garbage collection occurs, invalid pages are concentrated in complete blocks, and no extra writes are required to copy valid pages during the reclamation process [1]. Random workloads do cause write amplification though, 12 because invalid pages are distributed across many blocks when garbage collection occurs. This means pages with valid data need to be copied from a block that is garbage collected, resulting in more actual writes than were requested. The challenging nature of random workloads makes them interesting to study. Previous analysis of write amplification assumes a standard uniform random workload of single-page Write requests under greedy garbage collection. The input workload consists only of Write requests; it does not contain read requests since they do not affect write amplification or significantly affect I/O rate. Because of the assumption that one LBA Write request occupies one physical page, the number of allowed user LBAs is the same as the maximum number of pages in which the user can have data stored, u. Writes are chosen uniformly and randomly from all allowed user LBAs. When a Write request is made, there are two possibilities: 1) the request is for an LBA that has never been written, or 2) the request is for an LBA that has been written previously. In both cases, the new data is written to the next empty page in the current writing block, but in the second case, the physical page where the old data was written must also be invalidated in the FTL mapping. We define the Trim-modified uniform random workload, which introduces Trim requests to the standard uniform random workload. The input workload consists of both Trim and Write requests, with Trim requests occurring with prob- ability q, and Write requests occurring with probability 1 − q. Write requests are still chosen uniformly over all user LBAs, but now the two possibilities for Write requests are slightly altered: 1) the request is for an LBA that is not In-Use, or 2) the request is for an LBA that is In-Use, so the physical page where that LBA data was previously stored must be invalidated. An In-Use LBA is defined as one for which the most recent request was Write; all other LBAs are considered not In-Use. In the Trim-modified uniform random workload, Trim requests are chosen uniformly randomly over all In-Use LBAs. The standard uniform random workload (without Trim requests) eventually results in all LBAs being In-Use. The Trim-modified uniform random workload reaches a steady state distribution in the number of In-Use LBAs, which we show 13 in Chapter 3 to be well approximated by a Gaussian with mean and variance given respectively by 1 − 2q u = us 1 − q and q u = us¯ 1 − q where the substitutions s = (1 − 2q)/(1 − q) ands ¯ = q/(1 − q) represent the steady- state probability of a Write requesting an In-Use LBA and the steady state probability of a Write requesting a not In-Use LBA, respectively.3 The steady state number of In-Use LBAs is used to compute the mean effective spare factor of a device. The effective spare factor is defined similarly to the manufacturer specified spare factor, but uses Xn, the number of valid pages (equivalent to the number of In-Use LBAs) at time n Tp − Xn Seff = Tp The mean effective spare factor q 1 − 2q S¯ = + S =s ¯ + sS eff 1 − q 1 − q f f does not depend on the size of the device, and is calculated in terms of the man- ufacturer specified spare factor, Sf . The variance of the effective spare factor is dependent on the size of the SSD, and becomes negligible with increasing device size, Tp. Details of the calculations are in Chapter 3.
2.7 Object Based Storage
In object-based storage, the file system does not assign data to particular LBAs. Instead, the file system tells the object-based storage device to store certain data; the device chooses where to store it, and gives the file system a unique object ID with which the data can be retrieved and/or modified [22]. Object-based storage allows for variable-sized objects with identifiers that need not indicate any information about the physical location of the data [23]. Rajimwale, et al.
3Note that s = 1 − s¯. 14 have proposed that object-based storage is an optimal interface for flash-based SSDs [24]. We use object-based storage as motivation to remove the constraint that one LBA Write occupies one physical page. Under object-based storage, the workload is extended to allow one object Write to occupy one or more physical pages; this constraint change is applied for extensions to the theory of effective overprovisioning and write amplification, and is discussed in Chapter 6.
Chapter Acknowledgements Chapter 2, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. “The Effect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 2, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. “Solid State Disk Object-Based Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 3
Trim Model
In this chapter, a model for the effect of the Trim command on effective overprovisioning is derived. The Trim-modified uniform random workload is an- alyzed for Write requests that occupy one physical page. For the convenience of the reader, a list of variables and their descriptions used in the analysis of this and subsequent chapters is found in Table 3.1.
3.1 Workload as a Markov Birth-Death Chain
The Trim-modified uniform random workload can be viewed as a Markov birth-death chain in which the state, Xn, is the number of In-Use LBAs. This means that Trim requests correspond to a reduction in state by 1, and Write requests can result in either an increase of the state by 1 (when the LBA requested is not In-Use) or no change in the state (when the LBA requested is already In- Use). A general Markov birth-death chain has the following transition probabili- ties:
P (Xn+1 = x − 1|Xn = x) = qx
P (Xn+1 = x|Xn = x) = rx
P (Xn+1 = x + 1|Xn = x) = px with the values of qx, rx, and px depending on the current state. The unnormalized
15 16
Table 3.1: List of common variables and their descriptions, as defined in this dissertation. Variable Description
Tp Number of pages on SSD
t = Tp/np Number of blocks on SSD u Number of pages user is allowed to fill
np Number of pages in a block r Minimum number of reserved blocks in reserve-queue
Sf = (Tp − u)/Tp manufacturer specified spare factor
Seff = (Tp − Xn)/Tp Effective spare factor at time n Number of In-Use logical block addresses (LBAs) at Xn time n q Probability of Trim request
πx Unnormalized steady state probabilities Steady state probability of a Write being for an In-Use s = (1 − 2q)/(1 − q) LBA Steady state probability of a Write being for a not In- s¯ = q/(1 − q) Use LBA
ρ = (Tp − u)/u Alternate measure of overprovisioning ¯ Seff 1+ρ ρeff = ¯ = − 1 Alternate measure of effective overprovisioning 1−Seff s
ueff = us Mean number of In-Use LBAs at steady state σ2 = us¯ Variance of the number of In-Use LBAs at steady state
fh, fc Fraction of the data that is hot or cold
ph, pc Probability of requesting hot or cold data A(T ) Trim-modified write amplification 17 steady state occupation probabilities are calculated by [25]: p0···px−1 , x 1 def q1···qx > πx = Punnormalized(Xsteady = x) = 1 x = 0
3.1.1 Steady State Solution
To compute the steady state occupation probabilities of the Trim-modified random workload, first compute qx, rx, and px. Because the Trim rate is constant, qx = q ∀x. The probability of requesting an In-Use LBA is as follows. Let W be the event that a write request occurs. Let H be the event that an In-Use LBA is chosen (or “hit”), and Hc be the complement of H, that an In-Use LBA is not chosen (equivalent to a not In-Use LBA is chosen).
rx = P (H,W |Xn = x)
= P (H|W, Xn = x) P (W |Xn = x) x = (1 − q) u Similarly, the probability of requesting a not In-Use LBA is
c px = P (H ,W |Xn = x)
c = P (H |W, Xn = x) P (W |Xn = x) u − x = (1 − q) u
Plugging these values for qx, rx, and px in to the unnormalized steady state distri- bution gives
p0 ··· px−1 πx = q1 ··· qx u−0 u−(x−1) u (1 − q) ··· u (1 − q) = qx 1 − q x u! = q ux (u − x)! 18
Note that when x = 0, the above equation has the value 1; we no longer need to worry about separating πx into the cases based on the value of x. We can simply state that 1 − q x u! π = , ∀x ∈ {0, ..., u} (3.1) x q ux (u − x)! Although Equation 3.1 is an accurate representation of the unnormalized steady state occupation probabilities, it is not easy to understand the distribution without graphing it (see Figure 3.1).
3.2 Interpretation of Steady State Occupation Probabilities
Normalization of Equation 3.1 is difficult, so we continue to work with the unnormalized equation as we manipulate it to a more familiar form. 1−2q q 1 Start by defining s = 1−q ands ¯ = 1−s = 1−q , noting that s = 1−s¯. Now, rewrite Equation 3.1 as u! π =s ¯−x x ux (u − x)! This is still a challenging equation to understand, but manipulation in log-space helps:
log (πx) = −x log (¯s) + log (u!) − x log (u) − log ((u − x)!)
Because this is an unnormalized probability, terms that do not involve x can be added or subtracted, since they only affect the normalization factor. We use ∼ to denote an equivalence up to the normalization factor.
log (πx) ∼ − log ((u − x)!) − x log (us¯)
Now, apply Stirling’s approximation: log (n!) ≈ n log (n) − n, using ≈ to denote an asymptotic approximation. It is well known that Stirling’s approximation for
1The choice for s is not immediately obvious; however, subsequent analysis and simulation of the workload shows that the mean of the asymptotic distribution is µ = us and the variance is σ2 = us¯. s ands ¯ turn out to be the steady state probabilities of a Write request being for an In-Use LBA or for a not In-Use LBA, respectively. 19
pdf of Uniform Random Workload at Steady State 0.04 10% Trims 0.035 25% Trims 40% Trims 0.03
0.025
0.02
Probability 0.015
0.01
0.005
0 0 200 400 600 800 1000 State (# In−Use LBAs)
Figure 3.1: Steady state distribution of the number of In-Use LBAs for the Trim- modified uniform random workload under various amounts of Trim, with u = 1000. Note that a higher percentage of Trim requests in the workload results in a lower mean but a higher variance. 20 factorials converges very quickly, and for relatively small values of n can practically be considered to provide a “ground truth” equivalent for the value of the factorial function2. The application of Stirling’s approximation to this problem is reason- able. For SSDs of several Gigabytes, u is on the order of hundreds of thousands, so the Stirling approximation of (u − x)! is effectively ground truth except for a few x very close in value to u.
log (πx) ≈ − (u − x) log (u − x) + (u − x) − x log (us¯)
Add another term for convenience and rearrange
log (πx) ∼ − (u − x) log (u − x) + (u − x) − x log (us¯) + u log (us¯) = − (u − x) log (u − x) + (u − x) + (u − x) log (us¯) u − x = − (u − x) log + (u − x) (3.2) us¯ We call Equation 3.2 the Stirling PDF. Appendix A contains a rigorous proof that the Stirling PDF converges to the Gaussian distribution in the limit as u → ∞3. At this point, a Taylor series expansion is a natural way to handle the log term, but a change of variables must be made before applying the standard log expansion ∞ X αn log(1 − α) = − for − 1 ≤ α < 1 n n=1 Let x − us α = us¯ Then after some algebra and defining σ2 = us¯, u − x (u − x) log = σ2(1 − α) log(1 − α) us¯ Plug back in to Equation 3.2, apply the Taylor series approximation4, then drop 2Stirling’s Formula, one of several possible Stirling approximations to the factorial function, provides a highly accurate approximation to n! for values of n as small as 10-to-100. For this reason, for even moderately sized values of n, the Stirling Formula is used to derive a variety of useful probability distributions in fields such as probability theory [26], equilibrium statistical mechanics [27], and information theory [28]. Because of the remarkable accuracy of the Stirling Approximation, which very rapidly increases with the size of n, the resulting distributions can be taken to be ”ground truth” distributions in many practical domains. Derivations of the Stirling Approximation, and discussions of its accuracy and convergence, can be found in pages 52-54 of [26], Appendix A.6 of [27], and pages 522-524 of [29]. 3We omit the proof from the main text in order to present an unimpeded description of the research, as the particular details are lengthy. 4See Section 3.3 for proof that α lies in the region of convergence. 21 terms that do not depend on x. u − x log (π ) ≈ − (u − x) log + (u − x) x us¯ = −σ2(1 − α) log(1 − α) + σ2(1 − α) ∞ X αn = σ2(1 − α) + σ2(1 − α) n n=1 ∞ ! X αn+1 = −σ2 −α + + σ2(1 − α) n(n + 1) n=1 α2 ≈ −σ2 −α + + σ2(1 − α) 2 α2 ∼ 2 −(x − us)2 = (3.3) 2us¯ Exponentiate to remove the log to obtain the Taylor series first-order ap- proximation of πx, u! π =s ¯−x x ux (u − x)! −(x−us)2 2us¯ ≈ e an unnormalized Gaussian random variable with mean us and variance us¯. The Gaussian nature of the result is gratifying but unanticipated, as the central limit theorem is not obviously applicable. The approach presented in this section is quite powerful, and makes it possible to perform a higher-order Tay- lor series expansion resulting in a non-Gaussian PDF that applies in the non-far asymptotic limit. An summary of the analysis of the higher-order terms is in Section 3.4; a full analysis is in Appendix A. 22
3.3 Region of Convergence of Taylor Series Ex- pansion
The Taylor series expansion for log(1 − α) converges when −1 ≤ α < 1. This is equivalent to u(2s − 1) ≤ x < u (3.4)
Since x can only take values representing the state of the Markov chain, it is limited to the range [0, u]. The RHS of Equation 3.4 is shown to actually be x ≤ u. Examination of the application of the Taylor series expansion shows
∞ X αn σ2(1 − α) log(1 − α) = −σ2(1 − α) n n=1 ∞ ! X αn+1 = σ2 −α + (3.5) n(n + 1) n=1
At α = 1, which corresponds to x = u, the LHS of Equation 3.5 is σ2 0 log 0 = 0. The RHS of Equation 3.5 evaluates to 0 as well, since when α = 1, the RHS sum becomes a standard telescoping series. A closer examination of the LHS of Equation 3.4 is needed. Interestingly, u(2s − 1) has a range from −u to u, depending on the value of s (and ultimately on q, whose meaningful range is 0 ≤ q < 1/2). If u(2s − 1) ≤ 0, then the region of convergence covers all x that are of interest (that is, 0 ≤ x ≤ u). This happens when s ≤ 1/2 (equivalently q ≥ 1/3) so all 1/3 ≤ q ≤ 1/2 result in convergence for all allowed values of x. q < 1/3 has a region of convergence that is smaller than the range of x, but we show it is large enough to contain all x that are likely to occur at steady state. Compute the value of q needed for convergence when x is γ standard devi- √ ations below the mean, thus taking the value us − γ us¯. Set √ u(2s − 1) ≤ us − γ us¯ to compute 23
γ2 q ≥ (3.6) u + γ2 For any value of γ, as u → ∞, q ≥ 0. This means that the Taylor series converges for all values of q, so long as a sufficiently large u is chosen. Solving Equation 3.6 for u results in a determination of the size of “suffi- ciently large” for u when the value of γ is fixed. This results in
γ2 u ≥ s¯ The three-sigma rule states that in normal distributions, 99.7% of values lie within 3 standard deviations from the mean. Since values outside ±3σ are unlikely to be seen, it is sufficient to choose a u that causes the region of convergence to cover this range. This means that a small q = 0.01 only requires that u = 891, and a slightly larger q = 0.1 needs only u = 81 for the analysis of this dissertation to be accurate.
3.4 Higher Order Moments
The analysis of Section 3.2 allows the computation of higher-order moments such as skewness and kurtosis by truncating the Taylor series expansion at later terms. Appendix A heuristically argues that all of the approximation PDFs ob- tained via truncation of the Taylor series for the Stirling PDF should become highly accurate in the limit u → ∞. The skewness is calculated as5 1 1 Skew = − + R (3.7) X σ σ3 and the kurtosis is calculated as 3 1 Kurtosis = + R (3.8) X 4 σ2 σ4 For the details of the derivation of the skewness and kurtosis, see Appendix A.
5 1 1 R σ3 is defined to mean that the remaining terms are at least as small as O σ3 , in terms of the “big O” terminology. 24
The slight negative skew is toward a higher level of effective overprovision- ing. This is more evident for small u with a narrow variance (low but non-zero q). Figure 3.5 shows simulation, exact analytic values, and approximate Gaussian values for very small u = 25. With this small value of u, the slight negative skew is evident. It is unlikely that such a low value of u would every be encountered in a real device; the higher-order analysis is included here for completeness. The more detailed asymptotic analysis given in Appendix A shows that even in the limit of large u there is an irreducible, residual absolute error in the location of the mean of 0.5 which, consistent with the theory of asymptotic analysis, is asymptotically relatively negligible limu→∞ 0.5/u = 0. We correct for this known residual error in the approximation models via a shift, which results in the shown fits in Figures 3.4 and 3.5.
3.5 Effective Overprovisioning
The effective spare factor is computed using the results of the computation of the distribution of In-Use LBAs. The definition of the spare factor given in the introduction needs to be modified slightly to compute the effective spare factor
Tp − Xn Seff = Tp where Tp is the total number of pages in the SSD, and Xn is the number of In-Use LBAs, which is equivalent to the number of valid pages, at time n. The mean effective spare factor is
¯ Tp − us S eff = Tp =s ¯ + sSf
If the manufacturer specified spare factor is known, the impact of the Trim command on the mean effective spare factor does not depend on the total number of pages. However, the variance of the effective spare factor is dependent on the total number of pages in addition to the manufacturer specified spare factor: 1 Var(Seff) = 2 Var(Xn) Tp 25
pdf of Effective Spare Factor at Steady State, 25% Trims 45
40 Specified Spare Factor 0.00 35 0.17 30 0.33 0.50 25
20 Probability 15
10
5
0 0 0.2 0.4 0.6 0.8 1 Effective Spare Factor
Figure 3.2: Steady state distribution of the effective spare factor for the Trim- modified uniform random workload with 25% of requests Trim over 1000 user- allowed LBAs, and various amounts of manufacturer specified spare factor. Note the improvement in the effective spare factor over the manufacturer specified spare factor has a larger variance for smaller manufacturer specified spare factor.
us¯ = 2 Tp s¯(1 − S ) = f Tp As the size of the device grows large, the variance of the effective spare factor tends toward zero, indicating that for large enough SSDs, the penalty for ignoring the variance of the effective spare factor is negligible. Figure 3.2 shows a plot of the effective spare factor for various manufacturer specified spare factors. 26
Comparison of Analytic and Simulation pdf 0.016 Analytic pdf 0.014 Simulation pdf (histogram)
0.012
0.01
0.008
Probability 0.006
0.004
0.002
0 0 200 400 600 800 1000 State (# In−Use LBAs)
Figure 3.3: Comparison of analytic and simulation PDF, for 40% Trim. The analytic values closely match the simulation results.
3.6 Simulation Results
The analytic results were checked with Monte-Carlo simulations for various values of u and q. Results shown in this section are for u = 1000, with the simulation means and variances computed over 9 million requests, then averaged over 128 Monte-Carlo simulations. Results for larger values of u are comparable. Figure 3.3 shows one example of how closely the analysis aligns with the simulation. Table 3.2 shows the analytic and simulation values. Figure 3.4 demonstrates the accuracy of the Stirling and Gaussian approx- imations for a very low u = 1000 and q = 0.4. The low value of u and high value of q were necessary in order to keep the graph from looking like a delta function, which is what a zoomed-out PDF of larger, more practically realistic, values of u 27
Comparison of Analytic and Simulation pdf 0.016 Analytic pdf 0.014 Stirling Approximation of Analytic pdf Guassian Approximation of 0.012 Analytic pdf Simulation pdf (histogram) 0.01
0.008
Probability 0.006
0.004
0.002
0 0 200 400 600 800 1000 State (# In−Use LBAs)
Figure 3.4: A histogram of one run of simulation results for u = 1000 and q = 0.4, along with plots of the exact analytic PDF (Equation 3.1) in red, the Stirling approximation of the PDF (Equation 3.2) in black dots, and the Gaussian approximation of the PDF (Equation 3.3) in green. The line for the exact analytic PDF is not even visible because the line for the Gaussian approximation of the PDF covers it completely. The Stirling approximation of the PDF (black dots) lines up well with the Gaussian approximation and the (covered) analytic PDF. Even at this low value of u = 1000, the theoretical approximations are effectively identical to the exact analytic PDF. 28
Comparison of Analytic and Simulation pdf
0.15
0.1
0.05 Probability 0 0 5 10 15 20 25 State (# In−Use LBAs) Exact Analytic pdf Stirling Approximation 0.15 Gaussian Approximation Simulation pdf (histogram) 0.1
0.05 Probability 0 0 5 10 15 20 25 State (# In−Use LBAs)
Figure 3.5: A histogram of one run of simulation results for u = 25 and q = 0.3, along with plots of the exact analytic PDF (Equation 3.1) in red, the Stirling ap- proximation of the PDF (Equation 3.2) in black, and the Gaussian approximation of the PDF (Equation 3.3) in green. A slight negative skew of the exact analytic PDF is evident when compared to the Gaussian approximation. Note that the exact analytic PDF is the best fit of the histogram. A series of 64 Monte-Carlo simulations with these values result in an average skew of mean ± one standard deviation = −0.299±0.004, which agrees well with the theoretically computed ap- 1 proximation value of −0.306±R σ3 = −0.306±R(0.029). The average simulation kurtosis is 0.064; the approximate theoretical is 0.070, both values are well within overlapping error bounds. The top graph is a plot of the numerically normalized equations as shown in the text. In the bottom graph, a half-bin shift correction has been made to the Stirling and Gaussian approximation densities as described in the text. 29
Table 3.2: Mean and standard deviation (σ) for Number of In-Use LBAs com- puted over 128 Monte-Carlo simulations. Values in this table were computed with u = 1000. Analytic Values Simulation Values % Trim Mean σ Mean σ 10 888.89 10.54 888.87 10.55 20 750.00 15.81 749.97 15.80 30 571.43 20.70 571.42 20.72 40 333.33 25.82 333.32 25.80 appears to be. Even at such a low value of u, the simulation results, exact ana- lytic PDF (Equation 3.1), Stirling approximation of the PDF (Equation 3.2), and the Gaussian approximation of the PDF (Equation 3.3) are indistinguishable. In order to see the slight skew predicted by higher order terms, we are forced to use an unrealistically low value of u = 25, as seen in Figure 3.5. However, even in this unrealistically small example, both the Stirling and Gaussian approximations are close matches to the exact PDF. Figure 3.6 illustrates the utilization of the user space as the percentage of In-Use LBAs at steady state for several rates of Trim. Higher rates of Trim correspond to lower utilization, but with a wider variance. The corresponding graph of the PDF of the steady state effective spare factor is found in Figure 3.7, with the example utilizing a manufacturer specified spare factor of 0.11. Here, higher rates of Trim correspond to a higher effective spare factor, but with a wider variance. The practical application of this theory shows the potential for impressive savings. A workload with 25% Trim requests can transform an SSD with zero manufacturer specified spare factor into one with a mean effective spare factor of 0.33. To achieve this level of overprovisioning without the Trim command would require having 50% more physical pages than user LBAs. This means that for a known workload, 50% of the cost of the NAND flash inside the SSD could be saved, without sacrificing performance! Figure 3.8 shows a graph of the mean effective spare factor versus the manufacturer specified spare factor, demonstrating the 30
pdf of Uniform Random Workload at Steady State 0.04 10% Trims 0.035 25% Trims 40% Trims 0.03
0.025
0.02
Probability 0.015
0.01
0.005
0 0 20 40 60 80 100 Percentage of In−Use LBAs
Figure 3.6: Theoretical results for utilization under various amounts of Trim. When more Trim requests are in the workload, a lower percentage of LBAs are In-Use at steady state, but the variance is higher than seen in a workload with fewer Trim requests. 31
pdf of Effective Spare Factor at Steady State 50
45
40
35 10% Trims 25% Trims 30 40% Trims 25
20 Probability
15
10
5
0 0 0.2 0.4 0.6 0.8 1 Effective Spare Factor
Figure 3.7: Theoretical results for the effective spare factor under various amounts of Trim, with a manufacturer specified spare factor of 0.11. When more Trim requests are in the workload, a higher effective spare factor is seen, but with a higher variance than seen in a workload with fewer Trim requests. 32
S vs. Specified S eff f 1
0.8
0.6 eff S 0.4 10% Trim, mean 10% Trim, mean ± 3σ 25% Trim, mean ± σ 0.2 25% Trim, mean 3 40% Trim, mean 40% Trim, mean ± 3σ
0 0 0.2 0.4 0.6 0.8 1 Specified S f
Figure 3.8: Mean and 3σ values of effective spare factor versus manufacturer specified spare factor. This shows that an SSD with zero manufacturer specified spare factor effectively has a spare factor of 33% when the workload contains 25% Trim requests. This level of overprovisioning without the Trim command requires 50% more physical storage than the user is allowed to access. The ±3σ values were calculated using a small Tp = 1000, so they would be visible. Typical devices have a Tp >> 1000, resulting in a negligible variance. 33
S vs. Specified S eff f 1
0.8
0.6 eff S 0.4 10% Trim, mean 10% Trim, mean ± 3σ 25% Trim, mean ± σ 0.2 25% Trim, mean 3 40% Trim, mean 40% Trim, mean ± 3σ
0 0 0.2 0.4 0.6 0.8 1 Specified S f
Figure 3.9: Theoretical results comparing the effective spare factor to the manu- facturer specified spare factor. The mean and 3σ standard deviations are shown. For this figure, Tp = 1280 was used so that the 3σ values are visible. Larger gains in the effective spare factor are seen for smaller specified spare factor. In this figure, we see that a workload of 10% Trim requests transforms an SSD with a manufacturer specified spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical overprovisioning alone would require 12.5% more physical space than the user is allowed to access! 34 improvement in spare factor provided by the Trim command. Figure 3.9 shows similar results with a slightly larger SSD; in this experiment, a workload of 10% Trim requests transforms an SSD with a manufacturer specified spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical overprovisioning alone would require 12.5% more physical space than the user is allowed to access! Application of this theory should not ignore the variance for small values of Tp. The three-sigma rule means that the effective spare factor that will be seen ¯ 99.7% of the time lies within Seff ± 3σ. For the specific case of 25% Trim requests, no manufacturer specified spare factor, and a small physical capacity of 1000 pages, this means 0.28 ≤ Seff ≤ 0.39, which is the equivalent of having 39% to 63% more physical pages than user LBAs with physical overprovisioning alone. Typically, in practical applications Tp >> 1000, resulting in a much narrower range of Seff. In ¯ fact, limTp→∞ Var(Seff) = 0, indicating that for sufficiently large devices, Seff can be used without paying heed to the variance.
Chapter Acknowledgements Chapter 3, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. Chapter 4
Application to Write Amplification
Several researchers have analyzed the write amplification that results under the standard uniform random workload and greedy garbage collection, but they did not address the effect of Trim requests on write amplification. In this chapter, we summarize the previously published derivations for write amplification under the standard uniform random workload, and then extend the models to work with the Trim-modified uniform random workload. We compare analytic and simulation results, demonstrating the range of performance for each model.
4.1 Write Amplification Under the Standard Uniform Random Workload
In Hu, et al. [1, 2], the authors perform a probabilistic analysis of a vari- ation of greedy garbage collection, the windowed greedy garbage collection. The windowed greedy garbage collection varies from greedy garbage collection by se- lecting a block for garbage collection from the w oldest blocks in the occupied block queue. When w is the size of the occupied block queue, Tp/np − r, win- dowed greedy garbage collection reduces to greedy garbage collection. Hu, et al. derive the write amplification based on the probability that the block selected for
35 36
Computation of Write Amplification Factor
Compute p = (p1, ··· , pj, ··· , pw)
Tp −1.9 u − 1 p1 = e (4.1) ! 1.1 p p = min 1, j−1 (4.2) j 1 np 1 − u