UNIVERSITY OF CALIFORNIA, SAN DIEGO

Model and Analysis of Commands in Solid State Drives

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy

in

Electrical Engineering (Intelligent Systems, Robotics, and Control)

by

Tasha Christine Frankie

Committee in charge:

Ken Kreutz-Delgado, Chair Pamela Cosman Philip Gill Gordon Hughes Gert Lanckriet Jan Talbot

2012

The Dissertation of Tasha Christine Frankie is approved, and it is acceptable in quality and form for publication on microfilm and electronically:

Chair

University of California, San Diego

2012

iii TABLE OF CONTENTS

Signature Page ...... iii

Table of Contents ...... iv

List of Figures ...... vi

List of Tables ...... x

Acknowledgements ...... xi

Vita ...... xii

Abstract of the Dissertation ...... xiv

Chapter 1 Introduction ...... 1

Chapter 2 NAND Flash SSD Primer ...... 5 2.1 SSD Layout ...... 5 2.2 Greedy Garbage Collection ...... 6 2.3 Write Amplification ...... 8 2.4 Overprovisioning ...... 10 2.5 Trim Command ...... 11 2.6 Uniform Random Workloads ...... 11 2.7 Object Based Storage ...... 13

Chapter 3 Trim Model ...... 15 3.1 Workload as a Markov Birth-Death Chain ...... 15 3.1.1 Steady State Solution ...... 17 3.2 Interpretation of Steady State Occupation Probabilities . 18 3.3 Region of Convergence of Taylor Series Expansion . . . . 22 3.4 Higher Order Moments ...... 23 3.5 Effective Overprovisioning ...... 24 3.6 Simulation Results ...... 26

Chapter 4 Application to Write Amplification ...... 35 4.1 Write Amplification Under the Standard Uniform Random Workload ...... 35 4.2 Write Amplification Analysis of Trim-Modified Uniform Random Workload ...... 37 4.3 Alternate Models for Write Amplification Under the Trim-Modified Uniform Random Workload . . 41 4.4 Simulation Results and Discussion ...... 41

iv Chapter 5 Workload Variation: Hot and Cold Data ...... 47 5.1 Mixed Hot and Cold Data ...... 48 5.2 Separated Hot and Cold Data ...... 50 5.3 Simulation Results ...... 51 5.4 N Temperatures of Data ...... 54

Chapter 6 Workload Variation: Variable Sized Objects ...... 56 6.1 From LBAs to Objects ...... 57 6.2 Fixed Sized Objects ...... 58 6.3 Variable Sized Objects ...... 59 6.4 Simulation Results ...... 62 6.5 Application to Write Amplification ...... 67

Chapter 7 Conclusion ...... 72

Appendix A Asymptotic Expansion Analysis ...... 77 A.1 Useful Theorems and Facts ...... 77 A.2 Asymptotic Expansion Analysis ...... 80 A.3 An Accurate Gaussian Approximation ...... 81 A.4 A Higher Order PDF Approximation ...... 84 A.4.1 Approximating the Mean ...... 92 A.4.2 Approximating the Variance ...... 93 A.4.3 Approximating the Skew ...... 95 A.4.4 Approximating the Kurtosis ...... 96 A.5 Accuracy of the Truncation PDFs ...... 96 A.6 The K-L Divergence to and from the Normal Distribution 97 A.6.1 Kullback-Liebler Divergence ...... 99 A.6.2 Application to PX ...... 100

References ...... 104

v LIST OF FIGURES

Figure 2.1: Schematic of greedy garbage collection. Data is written to the current writing block, which is the first block in the free block pool. When garbage collection is triggered, the block with fewest valid pages is selected from the occupied block queue. After its valid pages are copied to the current writing block, the selected block is erased and appended to the end of the free block pool...... 7

Figure 3.1: Steady state distribution of the number of In-Use LBAs for the Trim-modified uniform random workload under various amounts of Trim, with u = 1000. Note that a higher percentage of Trim requests in the workload results in a lower mean but a higher variance...... 19 Figure 3.2: Steady state distribution of the effective spare factor for the Trim-modified uniform random workload with 25% of requests Trim over 1000 user-allowed LBAs, and various amounts of man- ufacturer specified spare factor. Note the improvement in the effective spare factor over the manufacturer specified spare fac- tor has a larger variance for smaller manufacturer specified spare factor...... 25 Figure 3.3: Comparison of analytic and simulation PDF, for 40% Trim. The analytic values closely match the simulation results...... 26 Figure 3.4: A histogram of one run of simulation results for u = 1000 and q = 0.4, along with plots of the exact analytic PDF (Equa- tion 3.1) in red, the Stirling approximation of the PDF (Equa- tion 3.2) in black dots, and the Gaussian approximation of the PDF (Equation 3.3) in green. The line for the exact analytic PDF is not even visible because the line for the Gaussian ap- proximation of the PDF covers it completely. The Stirling ap- proximation of the PDF (black dots) lines up well with the Gaussian approximation and the (covered) analytic PDF. Even at this low value of u = 1000, the theoretical approximations are effectively identical to the exact analytic PDF...... 27

vi Figure 3.5: A histogram of one run of simulation results for u = 25 and q = 0.3, along with plots of the exact analytic PDF (Equa- tion 3.1) in red, the Stirling approximation of the PDF (Equa- tion 3.2) in black, and the Gaussian approximation of the PDF (Equation 3.3) in green. A slight negative skew of the exact analytic PDF is evident when compared to the Gaussian ap- proximation. Note that the exact analytic PDF is the best fit of the histogram. A series of 64 Monte-Carlo simulations with these values result in an average skew of mean ± one standard deviation = −0.299 ± 0.004, which agrees well with the theo- 1  retically computed approximation value of −0.306 ± R σ3 = −0.306 ± R(0.029). The average simulation kurtosis is 0.064; the approximate theoretical is 0.070, both values are well within overlapping error bounds. The top graph is a plot of the numer- ically normalized equations as shown in the text. In the bottom graph, a half-bin shift correction has been made to the Stirling and Gaussian approximation densities as described in the text. 28 Figure 3.6: Theoretical results for utilization under various amounts of Trim. When more Trim requests are in the workload, a lower percent- age of LBAs are In-Use at steady state, but the variance is higher than seen in a workload with fewer Trim requests. . . . . 30 Figure 3.7: Theoretical results for the effective spare factor under various amounts of Trim, with a manufacturer specified spare factor of 0.11. When more Trim requests are in the workload, a higher effective spare factor is seen, but with a higher variance than seen in a workload with fewer Trim requests...... 31 Figure 3.8: Mean and 3σ values of effective spare factor versus manufac- turer specified spare factor. This shows that an SSD with zero manufacturer specified spare factor effectively has a spare factor of 33% when the workload contains 25% Trim requests. This level of overprovisioning without the Trim command requires 50% more physical storage than the user is allowed to access. The ±3σ values were calculated using a small Tp = 1000, so they would be visible. Typical devices have a Tp >> 1000, resulting in a negligible variance...... 32

vii Figure 3.9: Theoretical results comparing the effective spare factor to the manufacturer specified spare factor. The mean and 3σ standard deviations are shown. For this figure, Tp = 1280 was used so that the 3σ values are visible. Larger gains in the effective spare factor are seen for smaller specified spare factor. In this figure, we see that a workload of 10% Trim requests transforms an SSD with a manufacturer specified spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical overprovisioning alone would require 12.5% more physical space than the user is allowed to access! . . . . . 33

Figure 4.1: Equations needed to compute the empirical model-based algo- rithm for AHu (adapted from [1, 2]). w is the window size, and allows for the windowed greedy garbage collection variation. Setting w = Tp − r is needed for the standard greedy garbage np collection discussed in this dissertation. For an explanation of the theory behind these formulas, see [1, 2]...... 36 Figure 4.2: Write amplification versus probability of Trim under a Trim- modified uniform random workload and greedy garbage collec- tion, when Sf = 0.2. The plots labeled Hu, Agarwal, and Xi- ang all show the Trim-modified models for write amplification. (T ) AAgarwal gives such an optimistic prediction that for large q it (T ) foresees fewer actual writes than requested writes! AHu gives a more realistic, but still optimistic prediction for write ampli- (T ) fication. AXiang gives the best, and generally quite accurate, prediction for write amplification...... 42 Figure 4.3: Simulation and theoretical results for write amplification under various amounts of Trim and Sf = 0.1. The Trim-modified Xiang calculation is a very good match. The Trim-modified calculations for Hu and Agarwal are optimistic about the write amplification, with the Agarwal calculation being so optimistic as to give an impossible write amplification of less than 1 for high amounts of Trim! ...... 43

Figure 5.1: Write amplification for mixed hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. Notice that the theoretical values, based only on the level of effective overprovisioning, are optimistic when cold data accounts for more than 20% of the user space...... 49

viii Figure 5.2: Write amplification for separated hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with proba- bility 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. After allocating the minimum number of blocks needed to store all of the user-allowed data for each temperature of data, the remaining blocks were split with the fraction shown on the x-axis. In this case, the optimal split was 50-50; further investigation of the optimization problem is needed to determine whether this is always the case...... 52 Figure 5.3: Write amplification for mixed hot and cold data compared to separated hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. For the separated hot and cold data, extra blocks were allocated evenly between the two data types. The theoretical values for separated data are a good prediction of the simulation values. Note that the separated hot and cold data has superior write amplification to the mixed hot and cold data when cold data is more than 20% of the workload...... 53

Figure 6.1: Comparison of analytic and simulation PDF for the number of valid pages when objects are a fixed size of 32 pages. For this example, u = 1000 and q = 0.2. The mean is shifted, and the variance is larger than if objects are a fixed size of 1 page. . . . 64 Figure 6.2: Comparison of analytic and simulation PDF for the number of valid pages when the size of objects is drawn from a discrete uniform distribution over [1, 32]. For this example, u = 1000 and q = 0.2...... 67 Figure 6.3: Comparison of analytic and simulation PDF for the number of valid pages when the size of objects is drawn from a binomial distribution with parameters p = 0.4 and n = 32. For this example, u = 1000 and q = 0.2...... 68 Figure 6.4: Write amplification versus pages per block for constant effec- tive overprovisioning of Seff = 0.29. The modified Xiang model (Equation 6.3) fails when the number of pages per block is low. The Trim-modified Hu model provides an optimistic, but rea- sonable, prediction...... 71

ix LIST OF TABLES

Table 3.1: List of common variables and their descriptions, as defined in this dissertation...... 16 Table 3.2: Mean and standard deviation (σ) for Number of In-Use LBAs computed over 128 Monte-Carlo simulations. Values in this table were computed with u = 1000...... 29

Table 4.1: Analytic and simulation write amplification results for various values of probability of Trim, q, and two choices of manufacturer specified spare factor. In all cases, u/np = 1024, and np = 256. Five Monte-Carlo simulations were run for each simulation value in the table, with each simulation having identical results, measured to three significant digits, for write amplification. A plot for the case Sf = 0.2 (equivalently ρ = 0.25) is shown in Figure 4.2...... 44

Table 6.1: Mean and σ of Number of In-Use Objects and Number of Valid Pages for Fixed Sized Objects. Values in this table were com- puted with u = 1000. The size of the objects was fixed at 32. Simulation values are the average of 64 Monte-Carlo simulations. 63 Table 6.2: Mean and σ of Number of In-Use Objects and Number of Valid Pages for Variable Sized Objects. Values in this table were com- puted with u = 1000. The size of the objects was drawn from a discrete uniform distribution over [1, 32]. Simulation values are the average of 64 Monte-Carlo simulations...... 65 Table 6.3: Mean and σ of Number of In-Use Objects and Number of Valid Pages for Variable Sized Objects. Values in this table were com- puted with u = 1000. The size of the objects was drawn from a binomial distribution over with parameters p = 0.4 and n = 32. Simulation values are the average of 64 Monte-Carlo simulations. 66 Table 6.4: Write amplification for varying number of pages per block, equiv- alent to fixed sized objects that evenly divide np. The level of effective overprovisioning was kept constant at Seff = 0.29 for each simulation. The amount of Trim was q = 0.1. All models referenced are Trim-modified; the Xiang model was additionally modified to account for np, as shown in Equation 6.3...... 70

x ACKNOWLEDGEMENTS

Chapters 1, 2, 3, and 7, in part, are a reprint of the material as it appears in “A Mathematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. Chapters 1, 2, 4, and 7, in part, are a reprint of the material as it appears in “The Effect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapters 1, 2, 3, 4, 5, and 7, in part, have been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovi- sioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. Chapters 1, 2, 6, and 7, in part, have been submitted for publication of the material as it may appear in “Solid State Disk Object-Based Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper.

xi VITA

2003 B. S. in Electrical and Computer Engineering, California Institute of Technology

2006 M. S. in Electrical Engineering (Intelligent Systems, Robotics, and Control), University of California, San Diego

2012 Ph. D. in Electrical Engineering (Intelligent Systems, Robotics, and Control), University of California, San Diego

PUBLICATIONS

T. Frankie, G. Hughes, K. Kreutz-Delgado, “Solid State Disk Object-Based Storage with Trim Commands”, IEEE Transactions on Computers, 2012 (in submission).

T. Frankie, G. Hughes, K. Kreutz-Delgado, “Analysis of Trim Commands on Over- provisioning and Write Amplification in Solid State Drives”, ACM Transactions on Storage (TOS), 2012 (in submission).

T. Frankie, G. Hughes, K. Kreutz-Delgado, “The Effect of Trim Requests on Write Amplification in Solid State Drives”, Proceedings of the IASTED-CIIT Conference. IASTED, 2012.

T. Frankie, G. Hughes, K. Kreutz-Delgado, “A Mathematical Model of the Trim Command in NAND-Flash SSDs”, Proceedings of the 50th Annual Southeast Re- gional Conference. ACM, 2012.

T. Vanesian, K. Kreutz-Delgado, G. Burgin, D. Fogel, “Algorithmic Tools for Ad- versarial Games: Intent Analysis Using Evolutionary Computation”, Proceedings of the IEEE Symposium on Computational Intelligence in Security and Defense Applications, 2007.

OTHER WORKSHOP AND CONFERENCE PRESENTATIONS

T. Frankie, G. Hughes, and K. Kreutz-Delgado, “A Statistical Model of Overpro- visioning”, Non-Volatile Memories Workshop, UCSD, March 2012.

T. Frankie, G. Hughes, and K. Kreutz-Delgado, “Improved Effective Overprovi- sioning Using SSD Trim Commands”, CMRR Research Review, UCSD, October 2011.

xii T. Frankie, G. Hughes, and K. Kreutz-Delgado, “SSD Trim Commands Consider- ably Improve Overprovisioning”, Summit, San Jose, August 2011.

T. Frankie, G. Hughes, and K. Kreutz-Delgado, “Impact of the Trim Command on Write Amplification”, Non-Volatile Memories Workshop, UCSD, March 2011. Also presented at the Jacobs School of Engineering Research Expo, UCSD, April 2011.

xiii ABSTRACT OF THE DISSERTATION

Model and Analysis of Trim Commands in Solid State Drives

by

Tasha Christine Frankie

Doctor of Philosophy in Electrical Engineering (Intelligent Systems, Robotics, and Control)

University of California, San Diego, 2012

Ken Kreutz-Delgado, Chair

NAND flash solid state drives (SSDs) have recently become popular storage alternatives to traditional magnetic hard disk drives (HDDs), due partly to their superior performance in write speed. However, SSDs suffer from a decrease in write speed as they fill with data, due largely to write amplification, a phenomenon in which more writes than requested are performed by the device. Use of the Trim command is known to result in improvement in NAND flash SSD performance; the Trim command informs the SSD flash controller which data records are no longer necessary, and need not be copied during garbage collection. In this dissertation, we analytically model the amount of improvement pro- vided by the Trim command for uniform random workloads in terms of effective

xiv overprovisioning, a measure of the device utilization. We show the effective spare factor is Gaussian with mean and variance dependent on the percentage of Trim requests in the workload and the manufacturer-provided physical overprovisioning, and the variance is also dependent on the total storage capacity of the SSD. We then utilize this information to compute the expected write amplification, and to verify our results by simulation. Our theoretical (formula-based) prediction sug- gests, and our simulations verify, that a considerable write amplification reduction is found as the proportion of Trim requests to Write requests increases. In partic- ular, write amplification drops by almost half over the no-Trim case if just 10% of the requests are Trim instead of Write. We extend our models of effective overprovisioning and write amplification to allow for variation of the workload. We explore data of varying sizes as well as data with varying frequency of access. These extensions allow the flexibility needed to more closely model real-world workloads. Models predicting the write amplification for workloads including Trim re- quests can be used in real-world situations ranging from helping cloud storage data center operators manage resources to reduce costs without sacrificing performance to optimizing the performance of apps on mobile devices.

xv Chapter 1

Introduction

Portable devices such as smartphones and tablets are increasing in pop- ularity, feeding the demand for storage that meets the unique requirements of small devices with a limited battery power supply. Additionally, consumers expect instantaneous responses from their electronic devices, making the speed of the stor- age device a key element as well. NAND flash based solid state drives (SSDs) meet these storage needs with their small form factor, low power consumption, and fast speed compared to magnetic hard disk drives (HDDs) [3]. Typical consumer applications of NAND flash SSDs are frequently found in mobile devices. Although the advertised sequential write speed of NAND flash exceeds the speeds of currently available wireless networks, Kim, et al. found that in practice, the speed of the storage on a mobile device has a significant impact on the performance of popular apps, regardless of speed of the network [4]. It is clear that manufacturers of faster SSDs have a competitive advantage, especially when the cost per unit of capacity is comparable across products. SSDs are not exclusively used in consumer applications; they have become a ubiquitous storage solution in enterprise applications as well. One example of an enterprise application that utilizes SSDs is cloud storage, which has become an attractive option for storing large amounts of data off-site. When choosing a cloud storage provider, in addition to examining the cost per unit of storage, customers should also take into account performance aspects such as latency [5]. Use of low-latency storage like SSDs helps to reduce the storage bottleneck and

1 2 improve performance, making data storage and retrieval speeds more appealing to users [6]. Despite their popularity, SSDs have some drawbacks, notably a perfor- mance slowdown due to write amplification when the device is filled with data, and a limited lifetime due to wearout of the floating-gate transistors, which lim- its the number of program/erase cycles before device failure. Additionally, SSDs currently need to conform with legacy interfaces from HDDs, which makes perfor- mance optimization of SSDs more challenging. Even though there is a performance increase when moving from magnetic storage to flash SSDs, it is evident that fur- ther improvement of the performance characteristics of SSDs can help to enhance the end-user experience. One recent change to the legacy HDD interface, the Trim command, helps increase the performance of SSDs. Although the Trim command is known to improve performance, models explaining why there is an improvement and quanti- fying the amount of the improvement have not yet been proposed in the scientific literature. The Trim analysis models presented in this dissertation are original first investigations necessary to advance the important design decisions concerning SSD architecture. The models we present in this dissertation apply directly to the Trim- modified uniform random workload, described in Section 2.6, as well as to vari- ations introduced to allow flexibility in that workload, including frequently and infrequently written (“hot and cold”) data, and data of varying sizes. Testing the applicability of the models on real-world workloads is desirable, but I/O traces1 from SSDs, especially those containing the Trim command, are not easily avail- able [7]. However, our modeled workload and variations are good approximations to the real-world HDD workloads analyzed by Wang, et al. [7], giving us confidence in the applicability of our Trim analysis models to real world systems. A good model of the SSD behavior in response to the Trim command is important because it allows system designers to assess how changes in the amount

1I/O traces record, along with other information, logical block level input and output com- mands from the file system to the storage device. I/O traces are often used as a reproducible way to benchmark the performance of a system. 3 of Trim in the workload alter the system’s performance. This helps to identify key points in the overall SSD architecture that can affect performance, and also leads to an improved understanding of the benefits and drawbacks to proposed computer architecture changes. Models offer insights into systems, and pave the way for the conception of new ideas and algorithms, which can lead to improved SSD architecture and future SSD interfaces. In this dissertation, we model the performance improvement when the Trim command is implemented in SSDs, and offer an explanation for why the improve- ment is so sizeable for even a small amount of Trim added to the workload. In Chapter 2, we give the necessary background information about NAND flash SSDs for the reader to understand the subsequent theory. We then describe the impact on device utilization, measured as effective overprovisioning, under a Trim-modified uniform random workload in Chapter 3, and the resulting decrease in write ampli- fication in terms of the proportion of Trim requests in the workload in Chapter 4. Finally, in Chapters 5 and 6, we examine variations to the Trim-modified uniform random workload that allow closer modeling of real-world workloads before we conclude in Chapter 7.

Chapter Acknowledgements Chapter 1, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. “The Effect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 1, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. “Solid State Disk Object-Based 4

Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 2

NAND Flash SSD Primer

In this section, we provide background about NAND Flash solid state drives (SSDs) which is necessary to understand the models presented in this dissertation.

2.1 SSD Layout

The smallest write unit on a NAND flash chip is a page, which stores a fixed amount of data, often 2K or 4K. Pages are grouped into physical blocks, which are the smallest erase unit on a flash device1. The number of pages per block varies by device, but is often 32, 64, 128, or 256 pages [8, 9]. Flash chips contain multiple blocks to reach the desired capacity of the device, and a single SSD may contain multiple flash chips. We denote the total number of blocks on the device by t, and the number of pages per block by np. Unlike magnetic media, data on NAND flash SSDs cannot simply be over- written. Writing, or programming, of NAND flash can only change bits from 1 to 0 – to change a bit back to 1 again, an erase cycle must occur. Writing can affect a single page, but erasure happens at the block level. Since all pages in a block must be erased at the same time, write-in-place schemes are impractical. For this reason, a flash translation layer (FTL) dynamically maps logical block addresses

1Note that physical blocks are different than logical blocks, and the type referred to when the word “block” is used needs to be determined from the context. In this dissertation, we attempt to specify logical or physical block if the context is potentially unclear.

5 6

(LBAs) to physical locations. The FTL also performs garbage collection. Garbage collection is the process of choosing a used physical block for reclaiming, copying its valid pages to an unused block, then erasing the old block so that it is ready for writing again. The extra writes performed when copying valid pages in order to avoid data loss contribute to a phenomenon called write amplification, and causes a reduction in write speed and the consumption of more of the finite number of writes in the lifetime of each block. There are many different algorithms used for FTLs, with varying perfor- mance for access speed [10, 11]; a survey and taxonomy of these is found in [12, 10]. These algorithms are often quite complex and beyond current analytic mathemat- ical models. The choice of FTL is beyond the scope of this dissertation, but it is important to note that when overwriting a logical block, good FTLs will place the new data in a different physical location, and invalidate the pages where the data was previously stored. In this research, we utilize a log-structured style of data placement [13] and the greedy garbage collection policy, described in Section 2.2, which has been previously analyzed by Hu, et al. [1, 2], Agarwal and Marrow [14], and Xiang and Kurkoski [15] to determine the amount of write amplification in workloads without Trim.

2.2 Greedy Garbage Collection

At some point in the process of writing data, most of the pages will be filled with either valid or invalidated data, and more space is needed for writing fresh data. Garbage collection occurs in order to reclaim additional blocks. In this process, a victim block is chosen for reclamation, its valid pages are copied elsewhere to avoid data loss, then the entire block is erased and the FTL map pointer is changed. Many algorithms exist for choosing the victim block, most of which are based on either selecting the block that creates the most free space or one that also takes into account the age of the data on the block, as originally described in the seminal paper by Rosenblum and Ousterhout [16]. A discussion of proposals for garbage collection can be found in [16, 17, 18]. 7

Free Block Pool

Write or Data TRIM Placement … Request

Relocate valid pages When filled, adjoin to Occupied Block Queue occupied block queue

Select block with largest number of When erased, adjoin invalid pages for garbage collection. to free block pool

Block composed of all free Invalid page pages. Before a page can be Valid page written, it must be erased, but Erased (free) page an entire block of pages must be erased at one time.

Greedy Garbage Collection, after Hu, et. al [1,2].

Figure 2.1: Schematic of greedy garbage collection. Data is written to the cur- rent writing block, which is the first block in the free block pool. When garbage collection is triggered, the block with fewest valid pages is selected from the occu- pied block queue. After its valid pages are copied to the current writing block, the selected block is erased and appended to the end of the free block pool. 8

A log-structured algorithm is used to map logical locations to physical lo- cations [1, 2]. At any given time, one physical block is designated as the current writing block. Incoming LBA Write and Trim requests are processed in order, and data is written to the next available physical page on the current writing block. This continues until the block is full, at which point a new writing block is chosen from the beginning of the free block pool. When the number of physical blocks in the free block pool reaches a predetermined reserve2, r, a used block must be chosen for garbage collection. Under greedy garbage collection, the emptiest block (containing the fewest valid pages) is chosen from the occupied block queue. If there is a tie for the emptiest block, the oldest block is chosen, where the age of a block is determined by when it was most recently filled with data. Once a block is chosen for reclamation by garbage collection, all valid pages in the block are copied to the new writing block, then the reclaimed block is erased and appended to the end of the free block pool. See Algorithm 1 and Figure 2.1 for visual descriptions of this process.

2.3 Write Amplification

When a block is chosen for erasure during garbage collection, any valid pages on the block first need to be copied elsewhere to avoid data loss. Copying these valid pages leads to write amplification, a phenomenon in which the device performs more writes than are requested of it. Write amplification, A, is a way of measuring the efficiency of garbage collection, and is defined as the average ratio of physical page writes to user page write requests [1]. Most FTLs and garbage collection algorithms attempt to reduce write am- plification and perform in order to keep the number of program/erase cycles uniform over all blocks. Greedy garbage collection has been proven optimal for random workloads in terms of write amplification reduction [2]. Write am- plification under a uniform random workload (without Trim commands) in SSDs has been studied by Hu, et al. [1, 2], Agarwal and Marrow [14], and Xiang and

2In practice, this predetermined reserve is very small compared to the total number of blocks, and is often ignored in analysis. 9

Input: LBA requests of types Write and Trim. Output: Data placement and garbage collection procedures. while there are LBA requests do if request is Write then write new data to first free page in current-block; if LBA written was previously In-Use then invalidate appropriate page in FTL mapping; end if size(free pages in current-block)= 0 then append current-block to end of occupied-block-queue; current-block ← first block in reserve-queue; if size(reserve-queue)< r then // garbage collection: victim-block ← block with fewest valid pages from occupied-block-queue; copy valid pages from victim-block to current-block; erase victim-block and append to reserve-queue; end end else // request was Trim invalidate appropriate page in FTL mapping; end end Algorithm 1: Data Placement and Greedy Garbage Collection 10

Kurkoski [15]; their theory is discussed in Chapter 4. Reducing write amplification creates an improvement in performance by in- creasing the average write speed due to the fewer number of extra writes performed by the device. Additionally, by reducing the number of program/erase cycles, SSD lifetime is extended, since flash devices wear out and lose the ability to store data after being overwritten many times. Manufacturers use several techniques to reduce write amplification in an effort to mitigate the negative performance impact. Two of these techniques are physically overprovisioning the device with extra storage and implementing the Trim command [19].

2.4 Overprovisioning

Overprovisioning of an SSD is the placing of more physical storage capacity on the device than the user is allowed to fill [1]. The measure of overprovisioning that we use is the spare factor

Tp − u Sf = Tp which is calculated using the total number of pages on the device, Tp, and the number of pages the user is allowed to fill, u. The spare factor is a convenient measure when making comparisons because it has a finite range between 0 and 1. Smaller spare factors indicate that the user has access to more of the stor- age space on the device, which results in a larger write amplification and poorer performance. Larger spare factors correspond to better performance but less user- allowed proportion of the storage. Overprovisioning reduces write amplification by increasing the amount of time between garbage collections of the same block, which in turn increases the probability that more pages in the block are invalid at the time of garbage collection, and thus need not be copied. Previously proposed models of the mathematical relationship between overprovisioning and write am- plification in uniform random workloads without Trim has been studied by Hu, et al. [1, 2], Agarwal and Marrow [14], and Xiang and Kurkoski [15], and is discussed in Chapter 4. 11

2.5 Trim Command

The Trim command also improves performance by reducing write amplifi- cation. Defined in the working draft of the ATA/ATAPI Command Set - 2 (ACS- 2) [19] and supported by recent operating systems [19, 20] and most newer SSDs, the Trim command declares logical blocks inactive, and marks the corresponding physical pages as invalid in the FTL mapping, meaning they do not need to be copied before the blocks containing them are erased. The performance increase seen with the Trim command is often measured by first testing the read and write speeds of both sequential and random data without enabling sup- port for Trim, so that no Trim requests are sent to the SSD. Then, Trim support is enabled and the test is repeated. The results of this type of test vary based on the specific, and often proprietary, features of the FTL implemented by the SSD tested, and do not reveal information about the effective level of overprovisioning due to the Trim command. We show the degree to which an SSD that supports Trim requires less phys- ical overprovisioning to achieve the same write amplification performance observed with a non-Trim SSD.

2.6 Uniform Random Workloads

There exists a spectrum of data storage workloads seen in the real world, from the single user of a personal computer to multiple users of cloud comput- ing, database transactions, sequential serial access, and random page access. The statistics of the workload depend on the application, and can have a large effect on the performance of a storage device [21]. On one end of the workload spectrum is a completely sequential workload; on the other end is a random workload. Most real-world workloads are somewhere between these two ends. Purely sequential workloads do not result in write amplification, since the sequential nature of the requests means that when garbage collection occurs, invalid pages are concentrated in complete blocks, and no extra writes are required to copy valid pages during the reclamation process [1]. Random workloads do cause write amplification though, 12 because invalid pages are distributed across many blocks when garbage collection occurs. This means pages with valid data need to be copied from a block that is garbage collected, resulting in more actual writes than were requested. The challenging nature of random workloads makes them interesting to study. Previous analysis of write amplification assumes a standard uniform random workload of single-page Write requests under greedy garbage collection. The input workload consists only of Write requests; it does not contain read requests since they do not affect write amplification or significantly affect I/O rate. Because of the assumption that one LBA Write request occupies one physical page, the number of allowed user LBAs is the same as the maximum number of pages in which the user can have data stored, u. Writes are chosen uniformly and randomly from all allowed user LBAs. When a Write request is made, there are two possibilities: 1) the request is for an LBA that has never been written, or 2) the request is for an LBA that has been written previously. In both cases, the new data is written to the next empty page in the current writing block, but in the second case, the physical page where the old data was written must also be invalidated in the FTL mapping. We define the Trim-modified uniform random workload, which introduces Trim requests to the standard uniform random workload. The input workload consists of both Trim and Write requests, with Trim requests occurring with prob- ability q, and Write requests occurring with probability 1 − q. Write requests are still chosen uniformly over all user LBAs, but now the two possibilities for Write requests are slightly altered: 1) the request is for an LBA that is not In-Use, or 2) the request is for an LBA that is In-Use, so the physical page where that LBA data was previously stored must be invalidated. An In-Use LBA is defined as one for which the most recent request was Write; all other LBAs are considered not In-Use. In the Trim-modified uniform random workload, Trim requests are chosen uniformly randomly over all In-Use LBAs. The standard uniform random workload (without Trim requests) eventually results in all LBAs being In-Use. The Trim-modified uniform random workload reaches a steady state distribution in the number of In-Use LBAs, which we show 13 in Chapter 3 to be well approximated by a Gaussian with mean and variance given respectively by 1 − 2q  u = us 1 − q and  q  u = us¯ 1 − q where the substitutions s = (1 − 2q)/(1 − q) ands ¯ = q/(1 − q) represent the steady- state probability of a Write requesting an In-Use LBA and the steady state probability of a Write requesting a not In-Use LBA, respectively.3 The steady state number of In-Use LBAs is used to compute the mean effective spare factor of a device. The effective spare factor is defined similarly to the manufacturer specified spare factor, but uses Xn, the number of valid pages (equivalent to the number of In-Use LBAs) at time n Tp − Xn Seff = Tp The mean effective spare factor q 1 − 2q  S¯ = + S =s ¯ + sS eff 1 − q 1 − q f f does not depend on the size of the device, and is calculated in terms of the man- ufacturer specified spare factor, Sf . The variance of the effective spare factor is dependent on the size of the SSD, and becomes negligible with increasing device size, Tp. Details of the calculations are in Chapter 3.

2.7 Object Based Storage

In object-based storage, the file system does not assign data to particular LBAs. Instead, the file system tells the object-based storage device to store certain data; the device chooses where to store it, and gives the file system a unique object ID with which the data can be retrieved and/or modified [22]. Object-based storage allows for variable-sized objects with identifiers that need not indicate any information about the physical location of the data [23]. Rajimwale, et al.

3Note that s = 1 − s¯. 14 have proposed that object-based storage is an optimal interface for flash-based SSDs [24]. We use object-based storage as motivation to remove the constraint that one LBA Write occupies one physical page. Under object-based storage, the workload is extended to allow one object Write to occupy one or more physical pages; this constraint change is applied for extensions to the theory of effective overprovisioning and write amplification, and is discussed in Chapter 6.

Chapter Acknowledgements Chapter 2, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. “The Effect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 2, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. “Solid State Disk Object-Based Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 3

Trim Model

In this chapter, a model for the effect of the Trim command on effective overprovisioning is derived. The Trim-modified uniform random workload is an- alyzed for Write requests that occupy one physical page. For the convenience of the reader, a list of variables and their descriptions used in the analysis of this and subsequent chapters is found in Table 3.1.

3.1 Workload as a Markov Birth-Death Chain

The Trim-modified uniform random workload can be viewed as a Markov birth-death chain in which the state, Xn, is the number of In-Use LBAs. This means that Trim requests correspond to a reduction in state by 1, and Write requests can result in either an increase of the state by 1 (when the LBA requested is not In-Use) or no change in the state (when the LBA requested is already In- Use). A general Markov birth-death chain has the following transition probabili- ties:

P (Xn+1 = x − 1|Xn = x) = qx

P (Xn+1 = x|Xn = x) = rx

P (Xn+1 = x + 1|Xn = x) = px with the values of qx, rx, and px depending on the current state. The unnormalized

15 16

Table 3.1: List of common variables and their descriptions, as defined in this dissertation. Variable Description

Tp Number of pages on SSD

t = Tp/np Number of blocks on SSD u Number of pages user is allowed to fill

np Number of pages in a block r Minimum number of reserved blocks in reserve-queue

Sf = (Tp − u)/Tp manufacturer specified spare factor

Seff = (Tp − Xn)/Tp Effective spare factor at time n Number of In-Use logical block addresses (LBAs) at Xn time n q Probability of Trim request

πx Unnormalized steady state probabilities Steady state probability of a Write being for an In-Use s = (1 − 2q)/(1 − q) LBA Steady state probability of a Write being for a not In- s¯ = q/(1 − q) Use LBA

ρ = (Tp − u)/u Alternate measure of overprovisioning ¯ Seff 1+ρ ρeff = ¯ = − 1 Alternate measure of effective overprovisioning 1−Seff s

ueff = us Mean number of In-Use LBAs at steady state σ2 = us¯ Variance of the number of In-Use LBAs at steady state

fh, fc Fraction of the data that is hot or cold

ph, pc Probability of requesting hot or cold data A(T ) Trim-modified write amplification 17 steady state occupation probabilities are calculated by [25]:  p0···px−1 , x 1 def  q1···qx > πx = Punnormalized(Xsteady = x) = 1 x = 0

3.1.1 Steady State Solution

To compute the steady state occupation probabilities of the Trim-modified random workload, first compute qx, rx, and px. Because the Trim rate is constant, qx = q ∀x. The probability of requesting an In-Use LBA is as follows. Let W be the event that a write request occurs. Let H be the event that an In-Use LBA is chosen (or “hit”), and Hc be the complement of H, that an In-Use LBA is not chosen (equivalent to a not In-Use LBA is chosen).

rx = P (H,W |Xn = x)

= P (H|W, Xn = x) P (W |Xn = x) x = (1 − q) u Similarly, the probability of requesting a not In-Use LBA is

c px = P (H ,W |Xn = x)

c = P (H |W, Xn = x) P (W |Xn = x) u − x = (1 − q) u

Plugging these values for qx, rx, and px in to the unnormalized steady state distri- bution gives

p0 ··· px−1 πx = q1 ··· qx u−0   u−(x−1)  u (1 − q) ··· u (1 − q) = qx 1 − q x u! = q ux (u − x)! 18

Note that when x = 0, the above equation has the value 1; we no longer need to worry about separating πx into the cases based on the value of x. We can simply state that 1 − q x u! π = , ∀x ∈ {0, ..., u} (3.1) x q ux (u − x)! Although Equation 3.1 is an accurate representation of the unnormalized steady state occupation probabilities, it is not easy to understand the distribution without graphing it (see Figure 3.1).

3.2 Interpretation of Steady State Occupation Probabilities

Normalization of Equation 3.1 is difficult, so we continue to work with the unnormalized equation as we manipulate it to a more familiar form.  1−2q   q  1 Start by defining s = 1−q ands ¯ = 1−s = 1−q , noting that s = 1−s¯. Now, rewrite Equation 3.1 as u! π =s ¯−x x ux (u − x)! This is still a challenging equation to understand, but manipulation in log-space helps:

log (πx) = −x log (¯s) + log (u!) − x log (u) − log ((u − x)!)

Because this is an unnormalized probability, terms that do not involve x can be added or subtracted, since they only affect the normalization factor. We use ∼ to denote an equivalence up to the normalization factor.

log (πx) ∼ − log ((u − x)!) − x log (us¯)

Now, apply Stirling’s approximation: log (n!) ≈ n log (n) − n, using ≈ to denote an asymptotic approximation. It is well known that Stirling’s approximation for

1The choice for s is not immediately obvious; however, subsequent analysis and simulation of the workload shows that the mean of the asymptotic distribution is µ = us and the variance is σ2 = us¯. s ands ¯ turn out to be the steady state probabilities of a Write request being for an In-Use LBA or for a not In-Use LBA, respectively. 19

pdf of Uniform Random Workload at Steady State 0.04 10% Trims 0.035 25% Trims 40% Trims 0.03

0.025

0.02

Probability 0.015

0.01

0.005

0 0 200 400 600 800 1000 State (# In−Use LBAs)

Figure 3.1: Steady state distribution of the number of In-Use LBAs for the Trim- modified uniform random workload under various amounts of Trim, with u = 1000. Note that a higher percentage of Trim requests in the workload results in a lower mean but a higher variance. 20 factorials converges very quickly, and for relatively small values of n can practically be considered to provide a “ground truth” equivalent for the value of the factorial function2. The application of Stirling’s approximation to this problem is reason- able. For SSDs of several , u is on the order of hundreds of thousands, so the Stirling approximation of (u − x)! is effectively ground truth except for a few x very close in value to u.

log (πx) ≈ − (u − x) log (u − x) + (u − x) − x log (us¯)

Add another term for convenience and rearrange

log (πx) ∼ − (u − x) log (u − x) + (u − x) − x log (us¯) + u log (us¯) = − (u − x) log (u − x) + (u − x) + (u − x) log (us¯) u − x = − (u − x) log + (u − x) (3.2) us¯ We call Equation 3.2 the Stirling PDF. Appendix A contains a rigorous proof that the Stirling PDF converges to the Gaussian distribution in the limit as u → ∞3. At this point, a Taylor series expansion is a natural way to handle the log term, but a change of variables must be made before applying the standard log expansion ∞ X αn log(1 − α) = − for − 1 ≤ α < 1 n n=1 Let x − us α = us¯ Then after some algebra and defining σ2 = us¯, u − x (u − x) log = σ2(1 − α) log(1 − α) us¯ Plug back in to Equation 3.2, apply the Taylor series approximation4, then drop 2Stirling’s Formula, one of several possible Stirling approximations to the factorial function, provides a highly accurate approximation to n! for values of n as small as 10-to-100. For this reason, for even moderately sized values of n, the Stirling Formula is used to derive a variety of useful probability distributions in fields such as probability theory [26], equilibrium statistical mechanics [27], and information theory [28]. Because of the remarkable accuracy of the Stirling Approximation, which very rapidly increases with the size of n, the resulting distributions can be taken to be ”ground truth” distributions in many practical domains. Derivations of the Stirling Approximation, and discussions of its accuracy and convergence, can be found in pages 52-54 of [26], Appendix A.6 of [27], and pages 522-524 of [29]. 3We omit the proof from the main text in order to present an unimpeded description of the research, as the particular details are lengthy. 4See Section 3.3 for proof that α lies in the region of convergence. 21 terms that do not depend on x. u − x log (π ) ≈ − (u − x) log + (u − x) x us¯ = −σ2(1 − α) log(1 − α) + σ2(1 − α) ∞ X αn = σ2(1 − α) + σ2(1 − α) n n=1 ∞ ! X αn+1 = −σ2 −α + + σ2(1 − α) n(n + 1) n=1  α2  ≈ −σ2 −α + + σ2(1 − α) 2 α2 ∼ 2 −(x − us)2 = (3.3) 2us¯ Exponentiate to remove the log to obtain the Taylor series first-order ap- proximation of πx, u! π =s ¯−x x ux (u − x)!   −(x−us)2 2us¯ ≈ e an unnormalized Gaussian random variable with mean us and variance us¯. The Gaussian nature of the result is gratifying but unanticipated, as the central limit theorem is not obviously applicable. The approach presented in this section is quite powerful, and makes it possible to perform a higher-order Tay- lor series expansion resulting in a non-Gaussian PDF that applies in the non-far asymptotic limit. An summary of the analysis of the higher-order terms is in Section 3.4; a full analysis is in Appendix A. 22

3.3 Region of Convergence of Taylor Series Ex- pansion

The Taylor series expansion for log(1 − α) converges when −1 ≤ α < 1. This is equivalent to u(2s − 1) ≤ x < u (3.4)

Since x can only take values representing the state of the Markov chain, it is limited to the range [0, u]. The RHS of Equation 3.4 is shown to actually be x ≤ u. Examination of the application of the Taylor series expansion shows

∞ X αn σ2(1 − α) log(1 − α) = −σ2(1 − α) n n=1 ∞ ! X αn+1 = σ2 −α + (3.5) n(n + 1) n=1

At α = 1, which corresponds to x = u, the LHS of Equation 3.5 is σ2 0 log 0 = 0. The RHS of Equation 3.5 evaluates to 0 as well, since when α = 1, the RHS sum becomes a standard telescoping series. A closer examination of the LHS of Equation 3.4 is needed. Interestingly, u(2s − 1) has a range from −u to u, depending on the value of s (and ultimately on q, whose meaningful range is 0 ≤ q < 1/2). If u(2s − 1) ≤ 0, then the region of convergence covers all x that are of interest (that is, 0 ≤ x ≤ u). This happens when s ≤ 1/2 (equivalently q ≥ 1/3) so all 1/3 ≤ q ≤ 1/2 result in convergence for all allowed values of x. q < 1/3 has a region of convergence that is smaller than the range of x, but we show it is large enough to contain all x that are likely to occur at steady state. Compute the value of q needed for convergence when x is γ standard devi- √ ations below the mean, thus taking the value us − γ us¯. Set √ u(2s − 1) ≤ us − γ us¯ to compute 23

γ2 q ≥ (3.6) u + γ2 For any value of γ, as u → ∞, q ≥ 0. This means that the Taylor series converges for all values of q, so long as a sufficiently large u is chosen. Solving Equation 3.6 for u results in a determination of the size of “suffi- ciently large” for u when the value of γ is fixed. This results in

γ2 u ≥ s¯ The three-sigma rule states that in normal distributions, 99.7% of values lie within 3 standard deviations from the mean. Since values outside ±3σ are unlikely to be seen, it is sufficient to choose a u that causes the region of convergence to cover this range. This means that a small q = 0.01 only requires that u = 891, and a slightly larger q = 0.1 needs only u = 81 for the analysis of this dissertation to be accurate.

3.4 Higher Order Moments

The analysis of Section 3.2 allows the computation of higher-order moments such as skewness and kurtosis by truncating the Taylor series expansion at later terms. Appendix A heuristically argues that all of the approximation PDFs ob- tained via truncation of the Taylor series for the Stirling PDF should become highly accurate in the limit u → ∞. The skewness is calculated as5 1  1  Skew = − + R (3.7) X σ σ3 and the kurtosis is calculated as 3  1  Kurtosis = + R (3.8) X 4 σ2 σ4 For the details of the derivation of the skewness and kurtosis, see Appendix A.

5 1  1  R σ3 is defined to mean that the remaining terms are at least as small as O σ3 , in terms of the “big O” terminology. 24

The slight negative skew is toward a higher level of effective overprovision- ing. This is more evident for small u with a narrow variance (low but non-zero q). Figure 3.5 shows simulation, exact analytic values, and approximate Gaussian values for very small u = 25. With this small value of u, the slight negative skew is evident. It is unlikely that such a low value of u would every be encountered in a real device; the higher-order analysis is included here for completeness. The more detailed asymptotic analysis given in Appendix A shows that even in the limit of large u there is an irreducible, residual absolute error in the location of the mean of 0.5 which, consistent with the theory of asymptotic analysis, is asymptotically relatively negligible limu→∞ 0.5/u = 0. We correct for this known residual error in the approximation models via a shift, which results in the shown fits in Figures 3.4 and 3.5.

3.5 Effective Overprovisioning

The effective spare factor is computed using the results of the computation of the distribution of In-Use LBAs. The definition of the spare factor given in the introduction needs to be modified slightly to compute the effective spare factor

Tp − Xn Seff = Tp where Tp is the total number of pages in the SSD, and Xn is the number of In-Use LBAs, which is equivalent to the number of valid pages, at time n. The mean effective spare factor is

¯ Tp − us S eff = Tp =s ¯ + sSf

If the manufacturer specified spare factor is known, the impact of the Trim command on the mean effective spare factor does not depend on the total number of pages. However, the variance of the effective spare factor is dependent on the total number of pages in addition to the manufacturer specified spare factor: 1 Var(Seff) = 2 Var(Xn) Tp 25

pdf of Effective Spare Factor at Steady State, 25% Trims 45

40 Specified Spare Factor 0.00 35 0.17 30 0.33 0.50 25

20 Probability 15

10

5

0 0 0.2 0.4 0.6 0.8 1 Effective Spare Factor

Figure 3.2: Steady state distribution of the effective spare factor for the Trim- modified uniform random workload with 25% of requests Trim over 1000 user- allowed LBAs, and various amounts of manufacturer specified spare factor. Note the improvement in the effective spare factor over the manufacturer specified spare factor has a larger variance for smaller manufacturer specified spare factor.

us¯ = 2 Tp s¯(1 − S ) = f Tp As the size of the device grows large, the variance of the effective spare factor tends toward zero, indicating that for large enough SSDs, the penalty for ignoring the variance of the effective spare factor is negligible. Figure 3.2 shows a plot of the effective spare factor for various manufacturer specified spare factors. 26

Comparison of Analytic and Simulation pdf 0.016 Analytic pdf 0.014 Simulation pdf (histogram)

0.012

0.01

0.008

Probability 0.006

0.004

0.002

0 0 200 400 600 800 1000 State (# In−Use LBAs)

Figure 3.3: Comparison of analytic and simulation PDF, for 40% Trim. The analytic values closely match the simulation results.

3.6 Simulation Results

The analytic results were checked with Monte-Carlo simulations for various values of u and q. Results shown in this section are for u = 1000, with the simulation means and variances computed over 9 million requests, then averaged over 128 Monte-Carlo simulations. Results for larger values of u are comparable. Figure 3.3 shows one example of how closely the analysis aligns with the simulation. Table 3.2 shows the analytic and simulation values. Figure 3.4 demonstrates the accuracy of the Stirling and Gaussian approx- imations for a very low u = 1000 and q = 0.4. The low value of u and high value of q were necessary in order to keep the graph from looking like a delta function, which is what a zoomed-out PDF of larger, more practically realistic, values of u 27

Comparison of Analytic and Simulation pdf 0.016 Analytic pdf 0.014 Stirling Approximation of Analytic pdf Guassian Approximation of 0.012 Analytic pdf Simulation pdf (histogram) 0.01

0.008

Probability 0.006

0.004

0.002

0 0 200 400 600 800 1000 State (# In−Use LBAs)

Figure 3.4: A histogram of one run of simulation results for u = 1000 and q = 0.4, along with plots of the exact analytic PDF (Equation 3.1) in red, the Stirling approximation of the PDF (Equation 3.2) in black dots, and the Gaussian approximation of the PDF (Equation 3.3) in green. The line for the exact analytic PDF is not even visible because the line for the Gaussian approximation of the PDF covers it completely. The Stirling approximation of the PDF (black dots) lines up well with the Gaussian approximation and the (covered) analytic PDF. Even at this low value of u = 1000, the theoretical approximations are effectively identical to the exact analytic PDF. 28

Comparison of Analytic and Simulation pdf

0.15

0.1

0.05 Probability 0 0 5 10 15 20 25 State (# In−Use LBAs) Exact Analytic pdf Stirling Approximation 0.15 Gaussian Approximation Simulation pdf (histogram) 0.1

0.05 Probability 0 0 5 10 15 20 25 State (# In−Use LBAs)

Figure 3.5: A histogram of one run of simulation results for u = 25 and q = 0.3, along with plots of the exact analytic PDF (Equation 3.1) in red, the Stirling ap- proximation of the PDF (Equation 3.2) in black, and the Gaussian approximation of the PDF (Equation 3.3) in green. A slight negative skew of the exact analytic PDF is evident when compared to the Gaussian approximation. Note that the exact analytic PDF is the best fit of the histogram. A series of 64 Monte-Carlo simulations with these values result in an average skew of mean ± one standard deviation = −0.299±0.004, which agrees well with the theoretically computed ap- 1  proximation value of −0.306±R σ3 = −0.306±R(0.029). The average simulation kurtosis is 0.064; the approximate theoretical is 0.070, both values are well within overlapping error bounds. The top graph is a plot of the numerically normalized equations as shown in the text. In the bottom graph, a half-bin shift correction has been made to the Stirling and Gaussian approximation densities as described in the text. 29

Table 3.2: Mean and standard deviation (σ) for Number of In-Use LBAs com- puted over 128 Monte-Carlo simulations. Values in this table were computed with u = 1000. Analytic Values Simulation Values % Trim Mean σ Mean σ 10 888.89 10.54 888.87 10.55 20 750.00 15.81 749.97 15.80 30 571.43 20.70 571.42 20.72 40 333.33 25.82 333.32 25.80 appears to be. Even at such a low value of u, the simulation results, exact ana- lytic PDF (Equation 3.1), Stirling approximation of the PDF (Equation 3.2), and the Gaussian approximation of the PDF (Equation 3.3) are indistinguishable. In order to see the slight skew predicted by higher order terms, we are forced to use an unrealistically low value of u = 25, as seen in Figure 3.5. However, even in this unrealistically small example, both the Stirling and Gaussian approximations are close matches to the exact PDF. Figure 3.6 illustrates the utilization of the user space as the percentage of In-Use LBAs at steady state for several rates of Trim. Higher rates of Trim correspond to lower utilization, but with a wider variance. The corresponding graph of the PDF of the steady state effective spare factor is found in Figure 3.7, with the example utilizing a manufacturer specified spare factor of 0.11. Here, higher rates of Trim correspond to a higher effective spare factor, but with a wider variance. The practical application of this theory shows the potential for impressive savings. A workload with 25% Trim requests can transform an SSD with zero manufacturer specified spare factor into one with a mean effective spare factor of 0.33. To achieve this level of overprovisioning without the Trim command would require having 50% more physical pages than user LBAs. This means that for a known workload, 50% of the cost of the NAND flash inside the SSD could be saved, without sacrificing performance! Figure 3.8 shows a graph of the mean effective spare factor versus the manufacturer specified spare factor, demonstrating the 30

pdf of Uniform Random Workload at Steady State 0.04 10% Trims 0.035 25% Trims 40% Trims 0.03

0.025

0.02

Probability 0.015

0.01

0.005

0 0 20 40 60 80 100 Percentage of In−Use LBAs

Figure 3.6: Theoretical results for utilization under various amounts of Trim. When more Trim requests are in the workload, a lower percentage of LBAs are In-Use at steady state, but the variance is higher than seen in a workload with fewer Trim requests. 31

pdf of Effective Spare Factor at Steady State 50

45

40

35 10% Trims 25% Trims 30 40% Trims 25

20 Probability

15

10

5

0 0 0.2 0.4 0.6 0.8 1 Effective Spare Factor

Figure 3.7: Theoretical results for the effective spare factor under various amounts of Trim, with a manufacturer specified spare factor of 0.11. When more Trim requests are in the workload, a higher effective spare factor is seen, but with a higher variance than seen in a workload with fewer Trim requests. 32

S vs. Specified S eff f 1

0.8

0.6 eff S 0.4 10% Trim, mean 10% Trim, mean ± 3σ 25% Trim, mean ± σ 0.2 25% Trim, mean 3 40% Trim, mean 40% Trim, mean ± 3σ

0 0 0.2 0.4 0.6 0.8 1 Specified S f

Figure 3.8: Mean and 3σ values of effective spare factor versus manufacturer specified spare factor. This shows that an SSD with zero manufacturer specified spare factor effectively has a spare factor of 33% when the workload contains 25% Trim requests. This level of overprovisioning without the Trim command requires 50% more physical storage than the user is allowed to access. The ±3σ values were calculated using a small Tp = 1000, so they would be visible. Typical devices have a Tp >> 1000, resulting in a negligible variance. 33

S vs. Specified S eff f 1

0.8

0.6 eff S 0.4 10% Trim, mean 10% Trim, mean ± 3σ 25% Trim, mean ± σ 0.2 25% Trim, mean 3 40% Trim, mean 40% Trim, mean ± 3σ

0 0 0.2 0.4 0.6 0.8 1 Specified S f

Figure 3.9: Theoretical results comparing the effective spare factor to the manu- facturer specified spare factor. The mean and 3σ standard deviations are shown. For this figure, Tp = 1280 was used so that the 3σ values are visible. Larger gains in the effective spare factor are seen for smaller specified spare factor. In this figure, we see that a workload of 10% Trim requests transforms an SSD with a manufacturer specified spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical overprovisioning alone would require 12.5% more physical space than the user is allowed to access! 34 improvement in spare factor provided by the Trim command. Figure 3.9 shows similar results with a slightly larger SSD; in this experiment, a workload of 10% Trim requests transforms an SSD with a manufacturer specified spare factor of zero into one with an effective spare factor of 0.11. To obtain this spare factor through physical overprovisioning alone would require 12.5% more physical space than the user is allowed to access! Application of this theory should not ignore the variance for small values of Tp. The three-sigma rule means that the effective spare factor that will be seen ¯ 99.7% of the time lies within Seff ± 3σ. For the specific case of 25% Trim requests, no manufacturer specified spare factor, and a small physical capacity of 1000 pages, this means 0.28 ≤ Seff ≤ 0.39, which is the equivalent of having 39% to 63% more physical pages than user LBAs with physical overprovisioning alone. Typically, in practical applications Tp >> 1000, resulting in a much narrower range of Seff. In ¯ fact, limTp→∞ Var(Seff) = 0, indicating that for sufficiently large devices, Seff can be used without paying heed to the variance.

Chapter Acknowledgements Chapter 3, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. Chapter 4

Application to Write Amplification

Several researchers have analyzed the write amplification that results under the standard uniform random workload and greedy garbage collection, but they did not address the effect of Trim requests on write amplification. In this chapter, we summarize the previously published derivations for write amplification under the standard uniform random workload, and then extend the models to work with the Trim-modified uniform random workload. We compare analytic and simulation results, demonstrating the range of performance for each model.

4.1 Write Amplification Under the Standard Uniform Random Workload

In Hu, et al. [1, 2], the authors perform a probabilistic analysis of a vari- ation of greedy garbage collection, the windowed greedy garbage collection. The windowed greedy garbage collection varies from greedy garbage collection by se- lecting a block for garbage collection from the w oldest blocks in the occupied block queue. When w is the size of the occupied block queue, Tp/np − r, win- dowed greedy garbage collection reduces to greedy garbage collection. Hu, et al. derive the write amplification based on the probability that the block selected for

35 36

Computation of Write Amplification Factor

Compute p = (p1, ··· , pj, ··· , pw)

 Tp  −1.9 u − 1 p1 = e (4.1) ! 1.1 p p = min 1, j−1 (4.2) j 1 np 1 − u

 Compute v = vj,k , a matrix of size w × np,

j = 1, ··· , w & k = 0, ··· , np − 1

k    X np n −m v = 1 − pm (1 − p ) p j,k m j j m=0

Compute V = (V1, ··· ,Vk, ··· ,Vnp )

w−1 Y Vk = vj,k j=0

∗ ∗ ∗ ∗ Compute p = (p0, ··· , pk, ··· , pnp )

∗ p0 = 1 − V0 ∗ pk = Vk−1 − Vk for k = 1, . . . , np − 1 ∗ pnp = 1 − Vnp−1

Compute the write amplification, n A = p (4.3) Hu Pnp ∗ np − k=0 kpk

Figure 4.1: Equations needed to compute the empirical model-based algorithm for AHu (adapted from [1, 2]). w is the window size, and allows for the windowed greedy garbage collection variation. Setting w = Tp − r is needed for the standard np greedy garbage collection discussed in this dissertation. For an explanation of the theory behind these formulas, see [1, 2]. 37

garbage collection has 0, 1, . . . , np valid pages. Their analysis results in a series of formulas used to empirically compute write amplification based on SSD and FTL parameters. Figure 4.1 lists these formulas in the order needed for computation. An intuitive explanation for each formula is beyond the scope of this chapter; more details can be found in [1, 2]. Agarwal and Marrow [14] derive a closed-form expression for write amplifi- cation by examining the number of valid pages in an arbitrary block. Because the workload is uniformly random, the number of valid pages in an arbitrary block is binomially distributed. However, their simulation indicated the empirical number of valid pages to be approximately uniformly distributed. By equating the ex- pected values of each distribution, the authors computed the write amplification as 1 1 + ρ A = (4.4) Agarwal 2 ρ where ρ = (Tp − u)/u is a different measure of overprovisioning than the spare factor.1 Xiang and Kurkoski [15] improve upon Agarwal’s model by examining the number of invalid pages freed by garbage collection, which is represented by a binomially distributed random variable X. They assume that at steady state, X has a constant expected value, then equate this to the expected value of the binomial distribution in order to compute the write amplification, A. Finally, they look at the asymptotic value of A as the user space becomes large, and compute −1 − ρ A = (4.5) Xiang −1 − ρ − W ((−1 − ρ)e(−1−ρ)) where W (·) denotes the Lambert W function [30, 15].

4.2 Write Amplification Analysis of Trim-Modi- fied Uniform Random Workload

In this section, we analyze the write amplification of the Trim-modified uniform random workload under greedy garbage collection, paralleling the write

1 Note that ρ = Sf /(1 − Sf ), and that 0 ≤ ρ < ∞. 38 amplification analysis of the standard uniform random workload by Xiang and Kurkoski [15]. Let Y be a random variable representing the number of invalid pages re- claimed by garbage collection, analogous to Xiang and Kurkoski’s X. After a large number of both Write and Trim requests, Y reaches a steady state with constant expected value E[Y ] = y. An alternate way to consider the statistics of Y is to examine a given block chosen for garbage collection. Assume that all pages in a block selected for garbage collection have the same probability pie to be invalid, due to the uniform selection of LBAs for Write and Trim requests. This assumption is justified by Xiang and Kurkoski [15] when only Write requests are involved; the argument is not changed when the workload is modified to include Trim requests. The result of this assumption is that Y is a binomial random variable with np trials and probability pie of success. Hence, E[Y ] = nppie. To reconcile the two ways of viewing Y , we note that the expected values must be equal: nppie = y.

Now, derive pie in a similar fashion to Xiang and Kurkoski [15], by consid- ering a valid page α, corresponding to one In-Use LBA. We start by computing pvalid, the probability that this page α remains valid after k requests, in which Trims occur with probability q. To compute this, first consider the probability that a single request makes α invalid, remembering that a request can be either a Trim or a Write.

P (request makes α invalid) = P (request is for α|Trim)P (Trim) +P (request is for α|Write)P (Write) 1 1 = q + (1 − q) us u 1 = s(2 − s)u 1 = s(1 +s ¯)u Thus,  1 k p = 1 − valid s(1 +s ¯)u 39 and the probability that page α becomes invalid after k user requests is

 1 k p = 1 − 1 − invalid s(1 +s ¯)u Because we assumed that all pages in a block selected for garbage collection have the same probability pie of being invalid, pie = pinvalid when k is the number of requests that occurred between garbage collections of the same block. Since each garbage collection frees an expected number of y pages, that means that y writes must occur before the next garbage collection (likely of a different block) is triggered. y Writes corresponds to y(1 +s ¯) total requests, including both Writes and Trims. Next, we need to determine how many garbage collections occur during an entire block cycle, the time between garbage collections of the same block. Xiang and Kurkoski use simulation to verify that t = Tp/np blocks must be garbage collected during an entire block cycle. We agree that t blocks is the correct number, but argue it without simulation. Consider a single block β. The valid pages in β are equally likely to be invalidated as the valid pages in any other block. If more than t garbage collections occur between garbage collections of β, then that would mean that some other block has pages being invalidated more frequently than those of β, which would imply that another block’s valid pages are more likely to be invalidated, but we know this is not true. If fewer than t garbage collections occur in the block cycle of β, that would mean that the pages in β are being invalidated more frequently than the pages of other blocks; but again, we know that this is not true. Therefore, the only value left for the number of garbage collections in a block cycle is t, which also indicates that greedy garbage collection tends to reduce to choosing the least recently used block. Combining the information about the number of blocks that are garbage collected in an entire block cycle with the number of requests made between con- secutive garbage collections, we calculate the number of requests between garbage collections of the same block is ty(1 +s ¯), making

 1 ty(1+¯s) p = 1 − 1 − ie s(1 +s ¯)u 40

Now, equate the two expected values !  1 ty(1+¯s) n 1 − 1 − = y p s(1 +s ¯)u This equation is solved using the Lambert W function [30, 15].

 npt(1+¯s)   1   1  −W npt(1 +s ¯) 1 − s(1+¯s)u ln 1 − s(1+¯s)u y = np +  1  t(1 +s ¯)ln 1 − s(1+¯s)u

np The write amplification is computed as y . We are more interested in the asymptotic approximation of y as u (the user space) becomes large. yˆ = lim y u→∞

To compute this, substitute t = u(1 + ρ)/np and v = s(1 +s ¯)

 (1+ρ)  (1+ρ) 1 vu s ) 1  −W vu s 1 − vu ln 1 − vu yˆ = np + lim u→∞ vu (1+ρ) ln 1 − 1  np s vu −(1+ρ) ! −(1+ρ) s −W s e

= np + −(1+ρ) nps

2 (T ) np Now, computing AXiang = yˆ results in −(1+ρ) A(T ) = s (4.6) Xiang −(1+ρ) ! −(1+ρ) −(1+ρ) s s − W s e

Thus, the write amplification for Trim-modified uniform random workloads under greedy garbage collection is dependent on the amount of overprovisioning, as mea- sured by ρ, and s, a function of the proportion of Trim requests in the workload. In fact, Equation 4.6 is the same as Equation 4.5, but with ¯ Seff 1 + ρ ρeff = ¯ = − 1 (4.7) 1 − Seff s in place of ρ. 2We denote the calculation of write amplification with Trim requests in the workload as A(T ). 41

4.3 Alternate Models for Write Amplification Under the Trim-Modified Uniform Random Workload

Hu, et al. [2] find that, compared to the utilization of the device, the exact parameters of the SSDs have negligible effect on the write amplification. Both Agarwal and Marrow [14] and Xiang and Kurkoski al. [15] note that their expres- sions for write amplification are functions of only the overprovisioning factor. Since three different derivation methods all determine a way to compute write amplifi- cation that is either mostly or completely dependent on the amount of overpro- visioning, it makes sense to adjust these formulas for the Trim-modified uniform random workload by replacing the overprovisioning factor with the effective over- provisioning factor, calculated in Equation 4.7.

(T ) To compute AHu, the Trim-modified write amplification for the Hu, et al. [1, 2] model, modify the calculation of Equation 4.3 by changing Equations 4.1 and 4.2 to

 Tp  −1.9 us − 1 p1 = e (4.8) ! 1.1 p p = min 1, j−1 (4.9) j 1 np 1 − us

(T ) (T ) To compute AAgarwal and AXiang, the Trim-modified write amplifications for the Agarwal and Marrow [14] and Xiang and Kurkoski [15] models, replace ρ in

Equations 4.4 and 4.5 with ρeff (Equation 4.7). This results in the Trim-based write amplifications, 1  1 + ρ  A(T ) = Agarwal 2 1 + ρ − s

(T ) and Equation 4.6 for AXiang, derived in Section 4.2.

4.4 Simulation Results and Discussion

Analytic results for write amplification under the Trim-modified random workload are verified by simulation. The custom simulation program was written 42

Write Amplification vs. Probability of TRIM Request 3 Simulation Hu 2.5 Agarwal Xiang

2

1.5 , A 1

0.5 0 0.1 0.2 0.3 0.4 Probability of TRIM Request, q

Figure 4.2: Write amplification versus probability of Trim under a Trim-modified uniform random workload and greedy garbage collection, when Sf = 0.2. The plots labeled Hu, Agarwal, and Xiang all show the Trim-modified models for write (T ) amplification. AAgarwal gives such an optimistic prediction that for large q it foresees (T ) fewer actual writes than requested writes! AHu gives a more realistic, but still (T ) optimistic prediction for write amplification. AXiang gives the best, and generally quite accurate, prediction for write amplification. 43

Write Amplification vs. Probability of TRIM Request 6 Simulation Hu 5 Agarwal Xiang 4

3

2 Write Amplification, A 1

0 0 0.1 0.2 0.3 0.4 Probability of TRIM Request, q

Figure 4.3: Simulation and theoretical results for write amplification under vari- ous amounts of Trim and Sf = 0.1. The Trim-modified Xiang calculation is a very good match. The Trim-modified calculations for Hu and Agarwal are optimistic about the write amplification, with the Agarwal calculation being so optimistic as to give an impossible write amplification of less than 1 for high amounts of Trim! 44

Table 4.1: Analytic and simulation write amplification results for various values of probability of Trim, q, and two choices of manufacturer specified spare factor. In all cases, u/np = 1024, and np = 256. Five Monte-Carlo simulations were run for each simulation value in the table, with each simulation having identical results, measured to three significant digits, for write amplification. A plot for the case Sf = 0.2 (equivalently ρ = 0.25) is shown in Figure 4.2. (T ) (T ) (T ) (T ) ρ q ρeff Seff AHu AAgarwal AXiang ASimulation 0.0 0.100 0.091 5.75 5.50 5.68 5.58 0.1 0.1 0.238 0.192 2.74 2.60 2.79 2.77 0.2 0.467 0.319 1.69 1.57 1.78 1.78 (Sf = 0.09) 0.3 0.926 0.481 1.20 1.04 1.29 1.29 0.4 2.302 0.697 1.00 0.72 1.04 1.05 0.0 0.250 0.200 2.63 2.50 2.69 2.68 0.25 0.1 0.406 0.289 1.85 1.73 1.94 1.93 0.2 0.667 0.400 1.38 1.25 1.48 1.48 (Sf = 0.2) 0.3 1.188 0.543 1.11 0.92 1.19 1.19 0.4 2.750 0.733 1.00 0.68 1.03 1.03 in MATLAB to implement greedy garbage collection, as shown in Figure 2.1.

Simulation results shown in this section are obtained using u/np = 1024, np = 256, various values of q, and several values of Sf (equivalently, several values of ρ). Each simulation started with an empty SSD, and consisted of two phases. The first phase was a workload dependent preconditioning phase [31], in which 5 million requests (in the proportion of Trim and Write of the desired workload) were given to the SSD. This allowed an average of at least 9 writes (in the worst case) to each physical page, bringing the device into a steady state before measuring the write amplification. In the second phase, another 5 million requests were issued to the SSD, with the write amplification measured by keeping track of the number of requested writes and the number of actual writes. Each simulation was repeated five times; identical results, measured to three significant digits, were found in simulations with the same parameters. Simulation results are plotted along with analytic results in Figures 4.2

(T ) and 4.3; more results are in Table 4.1. AAgarwal gives such an optimistic predic- 45 tion that for large q it foresees fewer actual writes than requested writes, giving

(T ) a write amplification value of less than 1! AHu gives a more realistic, but still (T ) optimistic prediction for write amplification. AXiang gives the best, and generally quite accurate, prediction for write amplification. When garbage collection occurs, the benefit of a single Trim is twofold. 1) The first benefit is that one fewer page needs to be copied during garbage collection, saving one write. 2) The second benefit is that one more page is available for writing before garbage collection must occur again. This double benefit is what causes the considerable reduction in write amplification as the proportion of Trim requests to Write requests is increased. Each increase of 0.1 in the probability of Trim in the workload results in A decreasing approximately half the distance to the minimum A = 1. Examination of the results shows that a Trim-enabled SSD requires less physical overprovisioning than a non-Trim SSD in order to obtain the same write amplification. For example, to achieve a write amplification of 1.48 without Trim, a spare factor of 0.4 (ρ = 0.6667) is required. This same write amplification is reached with a spare factor of 0.2 (ρ = 0.25) and a Trim probability of 0.2. This means that with a workload of 20% Trims, the user can have an extra 33% capacity without any performance degradation, and without any extra material cost! Models such as the ones presented in this chapter that can reliably deter- mine the point at which there is no tradeoff between cost savings and performance have the potential for great utility for both manufacturers and consumers as well as for future design considerations.

Chapter Acknowledgements Chapter 4, in part, is a reprint of the material as it appears in “The Ef- fect of Trim Requests on Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 4, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write 46

Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. Chapter 5

Workload Variation: Hot and Cold Data

The Trim-modified uniform random workload analyzed above is unlikely to be found exactly in real-world applications. Most workloads will have some data that is accessed frequently (called hot data), and some data that is accessed infrequently (called cold data). Rosenblum and Ousterhout discussed these types of data in their paper [16], suggesting that an appropriate mix of hot and cold data be that hot data is requested 90% of the time, but that it only occupies 10% of the user space, and cold data be requested 10% of the time while occupying 90% of the user space. Hu, et al. [1, 2] showed through simulation that a standard uniform random workload consisting of hot and cold data under the log-structured writing and greedy garbage collection described above had high write amplification, but when the hot data and the cold data were kept in separate blocks, the write amplification improved. In their simulation, the cold data was read-only. We generalize the workload to allow cold data to be written and Trimmed infrequently compared to the hot data. In this chapter, we perform analysis of a Trim-modified uniform random workload with hot and cold data in which the hot and cold data are kept separate, and use simulation to examine the resulting write amplification when hot and cold data are mixed together in blocks. We assume that the temperature of the data is perfectly known; in reality, this information would need to be learned by the device; learning the temperature of the data is beyond the scope of this

47 48 chapter.

5.1 Mixed Hot and Cold Data

In the mixed hot and cold data scenario, no special treatment is given to the data based on its temperature; hot and cold data are mixed within blocks. When all LBAs are not accessed uniformly, however, as with hot and cold data, the theory already described for computing write amplification does not hold. We show this for the mixed case of hot and cold data by using the amount of effective overprovisioning to compute the write amplification and comparing it with the simulation values. To compute the amount of effective overprovisioning, first compute the expected number of In-Use LBAs for hot and cold data, then add these to get ueff. Then, the effective overprovisioning is ρeff = (Tp − ueff)/ueff and it can be plugged directly into the ρ of Equation 4.5 to compute the (incorrect) write amplification. As seen in Figure 5.1, the theoretical results are not too far off when cold data is up to 20% of the user space. For larger fractions of cold data in the user space, however, the theoretical results are too optimistic to be of value, and the actually observed write amplification grows large. This is undesirable, since cold data is likely to occupy a larger fraction of the user space than hot data. An analytic solution for the write amplification of mixed hot and cold data is not straightforward, and does not easily parallel the derivation of Xiang and Kurkoski [15]. It is known that mixing hot and cold data leads to higher write amplification than keeping the data separate [1, 2], so a mathematical analysis of the scenario is not likely to provide utility beyond mathematical interest. Instead, we provide simulation results in Fig. 5.1 for comparison with the separated data scenario. 49

Write Amplification For Mixed Hot and Cold Data

2.5 Simulation Theory

2

1.5 Write Amplification, A

1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Fraction of Data that is Cold

Figure 5.1: Write amplification for mixed hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. Notice that the theoretical values, based only on the level of effective overprovisioning, are optimistic when cold data accounts for more than 20% of the user space. 50

5.2 Separated Hot and Cold Data

Based on the results from Hu, et al. [1, 2], we expect to see an improve- ment in the write amplification over the mixed data case. In the separated data scenario, a set of physical blocks are assigned to the cold data, and a separate set of physical blocks are assigned to the hot data. Each set of blocks is kept logi- cally separated 1, and data placement and garbage collection for each temperature operates independently, in the same way described in Algorithm 1 of Section 2.2. Assume that h blocks are assigned to the hot data, and c blocks are assigned to the cold data, where h + c = Tp/np. Also assume that the fh fraction of the data is hot, and that the fc fraction of the data is cold (note that fh + fc = 1), so that hot data has ufh LBAs and cold data has ufc LBAs. We also allow the hot 2 and cold data to have their own rates of Trim, qh and qc, respectively . Finally, 3 the frequency at which requests arrive for hot and cold data is denoted as ph and pc, respectively, where ph + pc = 1. Although it may take longer for cold data to reach steady state than it takes for hot data, we can compute the expected number of In-Use LBAs for hot and for cold data at steady state. We expect to see ufhsh hot In-Use LBAs, and ufcsc cold In-Use LBAs, with the number of valid pages the same as the number of In-Use LBAs for each temperature. From this, we can compute the level of effective overprovisioning for each temperature, and then the write amplification.

−(1+ρhot) (T , hot) sh AXiang = (5.1) −(1+ρhot) ! −(1+ρhot) −(1+ρ ) − W hot e sh sh sh

1The complete logical segregation of hot and cold blocks is useful for mathematical analysis. However, in an actual device, wear leveling is an issue that needs to be addressed. For this reason, physical blocks will occasionally need to be swapped between the hot and cold queues to spread the wear evenly. If this is timed to coincide with near-simultaneous garbage collection of hot and cold blocks, the amount of write amplification contributed by this wear leveling is expected to be minimal. 2Allowing each temperature of data to have its own rate of Trim makes sense, as there is no reason to believe that hot and cold data will be Trimmed at the same rate. 3We expect requests for hot data to arrive more frequently than requests for cold data, hence the choice of names for each. However, the analysis of this section does not require this to the be case. 51

hnp−ufh (T , cold) where ρhot = . A similar equation can be computed for A . Finally, ufh Xiang the two values are combined in a weighted fashion to obtain the overall write amplification. Although it may seem the obvious choice, the weighting to use is not the ratio of hot and cold requests to the total number of requests; it is necessary to compute a ratio of hot and cold writes to the total number of writes

ph(1 − qh) αh = (5.2) ph(1 − qh) + pc(1 − qc) pc(1 − qc) αc = (5.3) ph(1 − qh) + pc(1 − qc)

(T , hot) (T , cold) Aseparated = αhAXiang + αcAXiang (5.4) This analysis points to an optimization problem in how to allocate the overprovisioned physical blocks on the device after the minimum number of blocks needed for each temperature is assigned. For the parameters chosen in graphing the results (see Fig. 5.2), an equal allocation of the overprovisioned physical blocks is optimal. However, it would require further exploration to determine whether this is true for all choices of parameters of if it is dependent on the parameter values.

5.3 Simulation Results

An examination of Fig. 5.3 shows that physically separating hot and cold data reduces the write amplification over allowing the data to be mixed within blocks, unless cold data occupies only a very small fraction of the user space. It is generally expected that infrequently accessed (cold) data, however, will be a large fraction of the data stored, so accurate determination of hot versus cold data logical blocks in order to keep them on physically separate blocks is very important. The theoretical values for write amplification of separated hot and cold data are very close to simulation values. 52

Write Amplification For Hot and Cold Data

1.7 Simulation Theory 1.6

1.5

1.4

1.3

Write Amplification, A 1.2

1.1

1 0.2 0.4 0.6 0.8 Fraction of Extra Blocks Allocated to Cold Data

Figure 5.2: Write amplification for separated hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. After allocating the minimum number of blocks needed to store all of the user-allowed data for each temperature of data, the remaining blocks were split with the fraction shown on the x-axis. In this case, the optimal split was 50-50; further investigation of the optimization problem is needed to determine whether this is always the case. 53

Write Amplification For Hot and Cold Data

Simulation: Mixed 2.5 Simulation: Separated Theory: Separated

2

1.5 Write Amplification, A

1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Fraction of Data that is Cold

Figure 5.3: Write amplification for mixed hot and cold data compared to sepa- rated hot and cold data. Physical spare factor is 0.2. Trim requests for cold data occur with probability 0.1; Trim requests for hot data occur with probability 0.2. Cold data is requested 10% of the time, with hot data requested the remaining 90% of the time. For the separated hot and cold data, extra blocks were allocated evenly between the two data types. The theoretical values for separated data are a good prediction of the simulation values. Note that the separated hot and cold data has superior write amplification to the mixed hot and cold data when cold data is more than 20% of the workload. 54

5.4 N Temperatures of Data

The idea of hot and cold data can be extended to N temperatures of data. This can be used to model a more complicated workload consisting of N temper- atures of data, each with its own profile for probability of Trim. Additionally, the idea of N temperatures of data can also be thought of as many users who access data at different rates and with different amount of Trim in their workloads, which could be helpful in analyzing cloud storage workloads. If each temperature of data is kept separate, the analysis of separated hot and cold data is readily extended to the case of N temperatures. Assume there are N temperatures, with each temperature j ∈ [1,N] having access to uj LBAs. Then, the total number of user LBAs is u = u1 +u2 +···+uN = PN j=1 uj. Also assume that each temperature has its own probability of Trim qj

(and associated sj), that is constant over time. Then at steady state, the number of In-Use LBAs from temperature j is Xj with a mean of ujsj and a variance of ujs¯j. Define the random variable Y to be the total number of In-Use LBAs. PN Then Y = j=1 Xj. Because Xj are independent Gaussian random variables, PN PN their sum, Y , is also Gaussian, with mean j=1 ujsj and variance j=1 ujs¯j. This information can be used to compute the overall effective overprovisioning. As demonstrated in the hot and cold data section, however, the overall effective overprovisioning is not useful in computing the write amplification of a non-uniform workload. The write amplification for N temperatures of data can be computed if the data is kept separated, as described in Section 5.2 4. First, assume each tempera- PN ture j has kj space available for writing (so that j=1 kj = Tp) and frequency of PN kj −uj writing pj (so that pj = 1). Then compute the value of ρj = for each j=1 uj temperature j, and use this to compute the write amplification for each tempera-

4Wear leveling would still need to be taken into consideration in a real device. 55 ture −(1+ρj)

(T , j) sj AXiang =  −(1+ρj)  −(1+ρ ) −(1+ρ ) j − W j e sj sj  sj 

At first glance, it may appear that the overall write amplification is just a weighted sum of the individual write amplifications, where the weight is pj. If there are no Trim requests, this would be true. However, the Trim requests in the workload need to be accounted for. To properly weight the individual temperature write amplifications, we need a measure of the ratio of writes requested by each user. Compute the weights p (1 − q ) α = j j j PN j=1 pj(1 − qj) Then, the overall write amplification is

N X (T , j) Aseparated, N = αjAXiang j=1

The computation of write amplification in workloads with varying frequency of access is important in modeling real-world workloads, many of which have data that is written once and never altered. The models discussed in this chapter maintain the ability to determine the effect on performance in the storage device while allowing flexibility in the user workload.

Chapter Acknowledgement Chapter 5, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. Chapter 6

Workload Variation: Variable Sized Objects

In the previously described models, one page stored the data associated with one LBA. In this section, we augment the model to allow larger amounts of data to be stored under a single identifier, which lends itself naturally to object-based storage, as described in Section 2.7. Object based storage offers multiple advantages for systems using solid state storage devices [32]. When the multiple thousand pages in a flash block are divided into object records of larger size, the resultant fewer objects per block directly translates into fewer data write relocations during garbage collection, leading to a lower write amplification and faster performance. This offers an opportunity for user application storage metadata, via metadata servers, to actively allow flash storage devices to self-optimize their hardware, to fully utilize SSD write speed advantages over magnetic HDD while ameliorating SSD limitations like block erase and finite write lifetime. Object storage metadata servers could allocate SSDs to object classes [33] of distinct object size and thereby allow SSD to internally manage their object physical mapping and relocation. This has the potential to improve I/O rate and QoS, user access blocking probability, and SSD flash chip lifetime through the reduction of write amplification.

56 57

6.1 From LBAs to Objects

The utilization problem previously solved in Chapter 3 can be viewed in object space instead of LBA space. Instead of LBAs that are In-Use or not In-Use, there are object IDs that are In-Use or not In-Use. Previously, there were u LBAs allowed; now we allow object IDs in the range [1, u], for a maximum of u objects. The object-view of the Trim-modified uniform random workload is the same as the LBA view of the problem when each object occupies 1 page. To remove the restriction that one object occupies only one page, we start by examining the workload in which each object occupies b pages, where b is constant and identical for all objects in the workload. Then, we examine the workload in which b varies each time an object is written. For all workloads, we assume that when an object ID is selected for Trimming (or, more appropriately for objects, deletion), all pages associated with that object ID are invalidated. Likewise, when an object is rewritten, all pages previously containing data associated with that object ID are invalidated. Note that there are now two quantities of interest: the number of object IDs that are In-Use, and the number of pages that are valid. When each object occupies one page, these two quantities are the same. Define the following variables: T χt = (χ1,t, χ2,t, . . . , χu,t) is a vector of indicator random variables at time t.

 1 ω ∈ Ai (object i is In-Use) χi,t = 1Ai (ω) = 0 otherwise (object i is not In-Use) where

Ai = {ω|the STATE(ω) is such that location i is occupied}

Pu Xt = i=1 χi,t is a random variable representing the number of objects In-Use at time t. The mean and variance of this random variable at steady state were computed in Chapter 3. T Zt = (Z1,t,Z2,t,...,Zu,t) is a vector of the size of each object at time t. T Pu Now, compute the number of valid pages at time t: Yt = Zt χt = i=1 χi,tZi,t 58

We are concerned with the values at steady state, and drop the time index to indicate the steady state solution.

6.2 Fixed Sized Objects

This workload is an extension of the original workload whose steady state PDF was derived in 3. However, objects now occupy b pages instead of 1 page. Recall that the number of In-Use objects is the same as the number of In-Use LBAs previously derived; only the number of valid pages changes. We assume that at least ub pages are allowed for the user to write to, so that it is not possible to request more space than is available.

T For this workload, Zt = (b, b, . . . , b) ∀t. This means

u X Yt, fixed = χi,tZi,t i=1 u X = χi,tb i=1 u X = b χi,t i=1 = bXt

To compute the steady state mean and variance of Yfixed, we use the known steady state mean and variance of X.

E[Yfixed] = E[bX] = bE[X] = bus

Similarly,

V ar[Yfixed] = V ar[bX] = b2V ar[X] = b2us¯ 59

6.3 Variable Sized Objects

In this workload, the size of each object varies independently over time. Each time an object ID is selected to be written, the size of the object is chosen randomly, according to the distribution from which Zi,t is drawn, regardless of any size that object ID may have had in the past. Zi,t is drawn from a distribution with a minimum value B1 ≥ 1, a maximum value B2 ≥ B1, a mean value mZ , and a standard deviation σZ . Again, we assume there is sufficient space for writing all requests; this means there are at least uB2 pages allowed for the user. Because Pu Zi,t is not constant, Yt, variable = i=1 χi,tZi,t cannot be simplified, as it was in the fixed-sized object case.

The mean is still relatively straightforward to calculate, since χi and Zi are independent, and the Zis are iid:

" u # X E[Yvariable] = E χiZi i=1 u X = E[χiZi] i=1 u X = E[χi]E[Zi] i=1 u X = E[χi]mZ i=1 u X = mZ E[χi] i=1 " u # X = mZ E χi i=1 = mZ E[X]

= mZ us (6.1)

The variance is more complicated because of cross-terms. The χi values are not independent, so it is not simple to compute the variance1.

1 If the χi values were independent, X would have a binomial distribution, but because we know from previous analysis that the variance is us¯ and not uss¯, this is obviously not the case. 60

To compute the variance of Yvariable, compute

2 2 V ar(Yvariable) = E[Yvariable] − E[Yvariable]

2 The value of E[Yvariable] was computed in Equation 6.1. Computing E[Yvariable] 2 is more complicated. When following the steps below, note that χi = χi , χi is independent of Zi and Zj, and the Zi values are iid.

 u !2 2 X E[Yvariable] = E  χiZi  i=1 " u # X 2 X = E χiZi + 2 χiχjZiZj i=1 i

Most of these expected values are straightforward to calculate. However, care must be taken when computing E[χiχj], because the χi values are NOT independent.

E[χiχj] = P (Ai ∩ Aj)

= P (Ai|Aj)P (Aj)

Obviously, when i = j, P (Ai|Aj) = 1. Calculating this value, however, when i 6= j is important. Since there is nothing special about any particular i or j, assume that P (Ai|Aj) is the same for all i 6= j, and thus is a constant with respect to i and j.

It is tempting to claim that P (Ai|Aj) = (us − 1)/(u − 1), since us is the expected number of In-Use objects at steady state. This is an invalid statement, however, because it implicitly assumes a zero variance. In fact, using this value in the calculation of V ar(Y ) results in a statement that V ar(Y ) = 0, which is exactly what is expected because of the implicit zero variance assumption. 61

Instead of trying to compute P (Ai|Aj) directly, it is possible to compute

E[χiχj] by using the result of the one object occupies one page model. Working (u−1)u backwards to calculate this probability, and noting that there are 2 terms in the sum over i < j:

u ! X V ar(Ysingle) = V ar χi i=1 u X X = V ar (χi) + 2 Cov(χi, χj) i=1 i

s¯2 2 Solving for C yields u−1 , which is the value of the covariance. Continuing with the calculations,

Cov(χi, χj) = C

= E[χiχj] − P (Ai)P (Aj)

2 = E[χiχj] − s

so that

2 E[χiχj] = C + s s¯2 = + s2 u − 1 s us − 1 + s¯ = s (6.2) u − 1

2 Continuing with the computation of E[Yvariable],

s¯ X s us − 1 − E[Y 2 ] = (σ2 + m2 )us + 2m2 s variable Z Z Z u − 1 i

2 Interestingly, as u gets large, the covariance goes to 0, indicating that the χi values are effectively independent for large u. However, the rate at which the covariance goes to 0 is counteracted by the rate at which the number of terms in the sum over i < j increases, resulting in a variance that is NOT binomially distributed. 62

(u − 1)u s us − 1 + s¯ = (σ2 + m2 )us + 2m2 s Z Z Z 2 u − 1  s¯ = σ2 us + m2 us + m2 us us − 1 + Z Z Z s 2 2 2 = σZ us + (mZ us) + mZ us¯

Now it is possible to compute

2 2 V ar(Yvariable) = E[Yvariable] − E[Yvariable] 2 2 2 2 = σZ us + (mZ us) + mZ us¯ − (mZ us) 2 2 = σZ us + mZ us¯

6.4 Simulation Results

Monte-Carlo simulation was used to check the analytic results, with various values of u, q, and distributions of Z. In this section, results are shown for u = 1000. Simulations were run for 1 million requests to ensure steady state was reached, then the simulation means and variances were averaged over the next 9 million requests. The results in each table are the average of 64 Monte-Carlo simulations. Although results are only shown for u = 1000, results for larger values of u are comparable. Table 6.1 shows the analytic and simulation values of the mean and stan- dard deviation of the number of In-Use objects as well as the mean and standard deviation of the number of valid pages at steady state when the objects are all a fixed size of 32 pages. When rounded to the nearest integer, the mean num- ber of objects is predicted exactly by analysis; the mean number of valid pages is predicted to within 3 pages. The slight error in the prediction of the number of valid pages is likely due to the extremely large variances seen when q ≥ 0.3. An overlay of the simulation PDF and the analytic PDF for the number of valid pages is shown in Fig. 6.1 for q = 0.2. Table 6.2 shows the analytic and simulation values of the mean and stan- dard deviation of the number of In-Use objects as well as the mean and standard deviation of the number of valid pages at steady state when the size of the objects 63 231.80 337.37 506.18 663.29 826.46 915.93 Simulation Pages σ 232.15 337.31 505.96 662.46 826.24 915.32 Analytic 5820.40 30315.80 28444.37 24000.37 18287.10 10669.70 Simulation Mean Pages 5818.18 Analytic 30315.79 28444.44 24000.00 18285.71 10666.67 7.24 10.54 15.82 20.73 25.83 28.62 Simulation Objects σ 7.25 10.54 15.81 20.70 25.82 28.60 Analytic 947.37 888.89 750.01 571.47 333.43 181.89 Simulation = 1000. The size of the objects was fixed at 32. Simulation values are the average of 64 u Mean Objects of Number of In-Use Objects and Number of Valid Pages for Fixed Sized Objects. Values in σ 947.37 888.89 750.00 571.43 333.33 181.82 Analytic q : Mean and 0.1 0.2 0.3 0.4 0.05 0.45 Trim Amount Table 6.1 this table were computed with Monte-Carlo simulations. 64

−4 x 10 Comparison of Analytic and Simulation pdf 8

7

Gaussian Approximation of Analytic pdf 6 Simulation pdf (histogram)

5

4

Probability 3

2

1

0 0 0.5 1 1.5 2 2.5 3 4 # Valid Pages x 10

Figure 6.1: Comparison of analytic and simulation PDF for the number of valid pages when objects are a fixed size of 32 pages. For this example, u = 1000 and q = 0.2. The mean is shifted, and the variance is larger than if objects are a fixed size of 1 page. is drawn from a discrete uniform distribution over [1, 32]. Analytic and simula- tion values match exactly when rounded to the nearest integer. An overlay of the simulation PDF and the analytic PDF for the number of valid pages is shown in Fig. 6.2 for q = 0.2. Table 6.3 shows the analytic and simulation values of the mean and stan- dard deviation of the number of In-Use objects as well as the mean and standard deviation of the number of valid pages at steady state when the size of the objects is drawn from a binomial distribution with parameters p = 0.4 and n = 32. An- alytic and simulation values match exactly when rounded to the nearest integer. An overlay of the simulation PDF and the analytic PDF for the number of valid 65 307.92 325.63 363.42 406.49 457.71 487.65 Simulation Pages σ 308.37 325.62 363.32 406.69 458.17 488.11 Analytic 9429.00 5500.13 2999.85 15631.33 14667.03 12374.67 Simulation Mean Pages 9428.57 5500.00 3000.00 Analytic 15631.58 14666.67 12375.00 7.27 10.53 15.80 20.70 25.79 28.58 Simulation Objects σ 7.25 10.54 15.81 20.70 25.82 28.60 Analytic 947.36 888.92 750.00 571.41 333.35 181.81 = 1000. The size of the objects was drawn from a discrete uniform distribution over Simulation u Mean Objects of Number of In-Use Objects and Number of Valid Pages for Variable Sized Objects. Values in 947.37 888.89 750.00 571.43 333.33 181.82 σ Analytic q 0.1 0.2 0.3 0.4 : Mean and 0.05 0.45 Trim Amount 32]. Simulation values are the average of 64 Monte-Carlo simulations. , [1 Table 6.2 this table were computed with 66 126.06 158.10 216.00 272.77 334.03 368.36 Simulation Pages σ 126.09 158.21 216.15 273.14 334.35 368.03 Analytic 9600.25 7313.46 4266.57 2327.60 12126.39 11377.73 Simulation Mean Pages 9600.00 7314.29 4266.67 2327.27 Analytic 12126.32 11377.78 7.27 10.53 15.80 20.68 25.79 28.63 Simulation Objects σ 7.25 10.54 15.81 20.70 25.82 28.60 Analytic = 1000. The size of the objects was drawn from a binomial distribution over with 947.36 888.88 750.01 571.37 333.32 181.84 u Simulation Mean Objects = 32. Simulation values are the average of 64 Monte-Carlo simulations. of Number of In-Use Objects and Number of Valid Pages for Variable Sized Objects. Values n σ 947.37 888.89 750.00 571.43 333.33 181.82 Analytic q 4 and . = 0 p : Mean and 0.1 0.2 0.3 0.4 0.05 0.45 Trim Amount Table 6.3 in this table were computed with parameters 67

−3 x 10 Comparison of Analytic and Simulation pdf

1

Gaussian Approximation 0.8 of Analytic pdf Simulation pdf (histogram)

0.6 Probability 0.4

0.2

0 0 0.5 1 1.5 2 2.5 3 4 # Valid Pages x 10

Figure 6.2: Comparison of analytic and simulation PDF for the number of valid pages when the size of objects is drawn from a discrete uniform distribution over [1, 32]. For this example, u = 1000 and q = 0.2. pages is shown in Fig. 6.3 for q = 0.2.

6.5 Application to Write Amplification

In this section, we examine write amplification when one object requires a fixed number of pages to store data. We assume that the number of pages required to store the data associated with the object evenly divides the number of pages per block (i.e. mod(np, b) = 0), so that an entire object is stored within a single block. As in previous work, data is stored in the fashion of a log-structured file system with greedy garbage collection. We argue that fixed sized objects with sizes that evenly divide the number 68

−3 x 10 Comparison of Analytic and Simulation pdf

1.8 Gaussian Approximation of Analytic pdf 1.6 Simulation pdf (histogram)

1.4

1.2

1

0.8 Probability 0.6

0.4

0.2

0 0 0.5 1 1.5 2 2.5 3 4 # Valid Pages x 10

Figure 6.3: Comparison of analytic and simulation PDF for the number of valid pages when the size of objects is drawn from a binomial distribution with param- eters p = 0.4 and n = 32. For this example, u = 1000 and q = 0.2. of pages per block is equivalent to reducing the number of pages in a block. For instance, if two objects can fit in one block, this is equivalent to having a block with two (large size) pages, and objects that require one of these (large) pages to store the associated data. With this equivalence between reduced number of pages per block and increased object size, the formulas for write amplification with Trim that we previously developed should apply. However, the Trim-modified Agarwal and the Trim-modified Xiang models do not take np into account! The derivation of the Trim-modified Agarwal model causes np to drop out naturally; it is not a good candidate for modification. The Trim-modified Xiang model is a better candidate, as it asymptotically removes np in the derivation. By eliminating this 69

asymptotic approximation, the write amplification is computed as np/y, where

 Tp(1+¯s)   1   1  −W Tp(1 +s ¯) 1 − s(1+¯s)u ln 1 − s(1+¯s)u y = np +   (6.3) Tp (1 +s ¯)ln 1 − 1 np s(1+¯s)u and W (·) is the Lambert W function [30, 15]. The Trim-modified Hu model does not need to be changed in order to take np into account, as it already does this. The model can be found in Figure 4.1 with modifications given in Equations 4.8. Results comparing the simulated write amplification and the computed write amplification are in Table 6.4 and Figure 6.4. In this simulation, there were 1280 blocks, with a manufacturer specified spare factor of 0.2, and a Trim factor of 0.1. As the number of pages per block changed, the effective spare factor was held constant at Seff = 0.29 to enable comparisons. As expected, the write amplification was reduced as the number of pages per block decreased, with the minimum pos- sible write amplification of 1 achieved with one page in a block. The modification of the Xiang model shown in Equation 6.3 was a very poor approximation as the number of pages in a block decreased; this is likely due to the assumption that the number of invalid pages freed in a block by garbage collection is binomial; with a small number of pages per block, this approximation becomes too coarse. How- ever, the Trim-modified Hu model provides a reasonable, if optimistic, prediction of write amplification. Note that the Trim-modified Hu model does not make sense to compute if there is one page per block, so the entry in the table is left blank.

When objects do not fit neatly within blocks, as is the case for mod(np, b) 6= 0, for variable sized objects, or for objects larger than one block, the computation of write amplification is not as straightforward. First, design decisions must be made: should an object be split over multiple blocks, or should empty, unwritten pages be left on a block, that perhaps could be filled in later by a smaller object? The design decision will affect the analysis of the write amplification; this is an area for future research. The models presented in this chapter provide utility in determining perfor- mance of workloads with varying sizes of data to be written, not only for standard file systems common today, but also for forward-looking object-based storage sys- 70

Table 6.4: Write amplification for varying number of pages per block, equivalent to fixed sized objects that evenly divide np. The level of effective overprovisioning was kept constant at Seff = 0.29 for each simulation. The amount of Trim was q = 0.1. All models referenced are Trim-modified; the Xiang model was additionally modified to account for np, as shown in Equation 6.3. Write Amplification Pages/Block Modified Xi- Simulation Hu Model np ang Model 1 1.000 1.936 - 2 1.191 1.937 1.000 4 1.432 1.938 1.065 8 1.631 1.938 1.274 16 1.771 1.938 1.628 32 1.853 1.938 1.732 64 1.896 1.938 1.793 128 1.918 1.938 1.828 256 1.929 1.938 1.847 71

Write Amplification vs. Pages Per Block 2

1.8

1.6 Simulation Hu Xiang 1.4 Write Amplification, A 1.2

1 0 50 100 150 200 250 Pages Per Block, n p

Figure 6.4: Write amplification versus pages per block for constant effective overprovisioning of Seff = 0.29. The modified Xiang model (Equation 6.3) fails when the number of pages per block is low. The Trim-modified Hu model provides an optimistic, but reasonable, prediction. tems. Understanding the effect of data size on performance is important for opti- mizing the use of currently available SSDs as well as for making design decisions for future generations of SSDs.

Chapter Acknowledgement Chapter 6, in part, has been submitted for publication of the material as it may appear in “Solid State Disk Object-Based Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 7

Conclusion

In this dissertation, we have developed an analytic model for the Trim command under a Trim-modified uniform random workload and variations to that workload including varying the size of the data records and the incorporation of hot and cold data. These changes to the workload allow the flexibility needed to model realistic workloads that are neither strictly sequential nor completely random. We have shown that a small amount of LBA Trim requests results in a large gain in effective overprovisioning, which can be used to reduce the amount of physical overprovisioning to save costs. The steady state number of In-Use LBAs in the Trim-modified uniform random workload was shown to be well approximated as a Gaussian random variable, but the analysis allows for inclusion of higher-order terms in order to compute higher order non-Gaussian corrections to moments such as skewness and kurtosis. These modifications provide a foundation to handling more complicated workloads, and is surprisingly non-trivial. Additionally, we have modified several methods for computing predictions of write amplification under standard uniform random workloads in log-structured flash-based Trim-enabled SSDs utilizing a greedy garbage collection policy. These modifications permit the computation of write amplification under a Trim-modified uniform random workload. We have compared analytic results to simulation re-

(T ) sults and shown that the Trim-modified Xiang model, AXiang, produces the best approximations when the data written occupies one page. Furthermore, we have shown that a modest amount of LBA Trim requests can significantly and pre-

72 73 dictably reduce SSD write amplification for a uniform random workload model with Trim requests. Compared to a non-Trim SSD, the amount of physical over- provisioning necessary can be reduced in a Trim-enabled SSD without affecting the write amplification; the size of the reduction is dependent on the proportion of Trim requests in the workload. The Trim-modified uniform random workload was further extended to in- clude various temperatures of data, allowing for closer approximation of real-world workloads. The write amplification under this workload with up to N data tem- peratures was analytically computed under the condition that each temperature is stored in a physically segregated fashion. For the case of N = 2, hot and cold data, the importance of separating different data temperatures in order to reduce write amplification was demonstrated. Additionally, an optimization problem of how to allocate overprovisioned blocks to different data temperatures was recog- nized, with the simulation solution for one set of parameters shown. Further work is needed to determine the optimal solution for all parameters. The Trim-modified uniform random workload model was extended to en- compass the idea of object-based storage and allow objects to utilize more than one page for data storage. A utilization model was derived and shown to be ac- curate via simulation. Additionally, models for write amplification were modified and compared to simulated values, demonstrating the failure point of one model

(T ) and the need for a more accurate model in another, with AHu producing the best approximations when the data written occupies more than one page, especially as the size of the data approaches the amount of data that can be held by one block. We initially used the standard assumption that one LBA requires one phys- ical page in order to store the data [1, 2, 14, 15], which is equivalent to saying that all data records stored are the same size. In a real system, however, data records are likely to be of varying sizes, perhaps requiring even more than one physical block in order to be stored, which is why we incorporated the idea of object-based storage into the model. We also initially assumed that data was accessed uniformly randomly; the extension to hot and cold data, and N temperatures of data, relaxed this assumption to more closely match real systems in which some data is accessed 74 frequently and other data is essentially read-only. Although extensions have been made to the model for objects and for hot and cold data, these extensions were made separately and have not yet been combined. Additionally, we assumed a constant rate of Trim that was not allowed to vary over time; relaxing this as- sumption would allow for a more realistic model of various workloads. It would be pragmatic to compare real workloads to the developed model in order to further identify where more flexibility is necessitated, but real-world SSD I/O traces are not yet easily available [7], making this an area for future research. Models such as the ones presented in this dissertation are useful tools for understanding the problems that an FTL design needs to address, and can lead to improved future designs which show increased write performance and lifetime. These models provide an efficient means to optimize physical overprovisioning, which saves money on materials. This also has implications in the application and utilization of SSDs. For example, in a data center, knowledge of the amount of Trim requests in various user workloads can help operators to appropriately allocate resources to reduce storage latency due to write amplification. In a mobile device, a model of the influence of Trim on the performance of the storage can be exploited to maximize the user experience by improving the response time of popular applications. Additionally, this model demonstrates the advantage of using object-based storage to improve processing speed by reducing write amplification because re- lated data is stored on the same block. It also makes evident the advantages of using Trim properly, which helps to move the SSD interface forward. Until it is economically feasible in the marketplace to move to a new interface, SSDs will con- tinue to be constrained by the legacy HDD interface with the slight modifications that are currently used, like Trim. Understanding how these modifications affect performance will lead to better designs for future interfaces. We expect future research to focus on analyzing the benefits of the Trim command under different data placement and garbage collection policies. Addi- tionally, we expect that commands similar to Trim will be proposed for RAID drives and other NAND flash interfaces, and anticipate that our models will be 75 helpful in determining the potential impact such commands would have on write performance and lifetime. We foresee new file system proposals based on an im- proved understanding of the Trim command due to our model. By using this model in conjunction with object-based storage, we envision metadata servers that sep- arate object classes of distinct sizes to unique flash SSDs in order to assist in self-optimization of hardware that would improve overall data access performance and improve the lifetime of the SSD. Other researchers are already proposing radical new designs for flash de- vices. For example, Wang, et al. explore the idea of a NAND flash device that has pages and blocks of varying sizes, dubbed heterogeneous flash [7]. Our analysis and simulation of the write amplification for workloads of varying sizes supports the utility of heterogeneous flash, as we have demonstrated the degree to which fewer pages per block reduces write amplification. Due to the flexibility of the model presented in this dissertation, we expect that manufacturers will use our model to optimize the design of NAND flash at the chip level as they create cus- tom NAND flash SSDs with the specific sizes of pages and blocks optimized for distinct workloads, as demanded by customers looking to improve performance. NAND flash SSDs are already playing an important role in storage, offer- ing improvements in power consumption, small form factor, and speed. A better understanding of how to improve speed, however, by reducing write amplification will make SSDs even more useful in both consumer and enterprise applications. The model presented in this dissertation provides a comprehensive analysis of the relationship between the use of the Trim command and the performance of an SSD, and paves the way for conception of new ideas and algorithms that will lead to improved future SSD architecture and SSD interfaces.

Chapter Acknowledgements Chapter 7, in part, is a reprint of the material as it appears in “A Math- ematical Model of the Trim Command in NAND-Flash SSDs,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the 50th Annual Southeast Regional Conference. ACM, 2012. The dissertation author was the primary investigator and author of this paper. “The Effect of Trim Requests on Write Amplification in 76

Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in Proceedings of the IASTED-CIIT Conference. IASTED, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 7, in part, has been submitted for publication of the material as it may appear in “Analysis of Trim Commands on Overprovisioning and Write Amplification in Solid State Drives,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in ACM Transactions on Storage (TOS), 2012. The dissertation author was the primary investigator and author of this paper. “Solid State Disk Object-Based Storage with Trim Commands,” T. Frankie, G. Hughes, K. Kreutz-Delgado, in IEEE Transactions on Computers, 2012. The dissertation author was the primary investigator and author of this paper. Appendix A

Asymptotic Expansion Analysis

In this Appendix, we include a more detailed higher-order asymptotic ex- pansion analysis that was omitted from the main text in order to present an un- obstructed report of our research. We start by reviewing some mathematical the- orems and facts that are useful in the subsequent derivations, then complete the full asymptotic expansion analysis. We use the series expansion of Chapter 3 to provide a family of truncated approximations to the Stirling PDF in the region of convergence. The third order truncation is used to approximate the mean, variance, skewness, and kurtosis. We then rigorously prove that in the region of convergence of this series, the Stirling PDF converges to a Gaussian distribution in the limit as u → ∞, and heuristically argue that as a consequence of the con- vergence to a known Gaussian PDF, all of the approximation PDFs obtained via a truncation of the Taylor series of the Stirling PDF should become highly accurate in the limit as u → ∞.

A.1 Useful Theorems and Facts

Here we state some well-known and useful theorems and facts that were used in the derivations above. Recall that every Riemann integrable function is Lebesgue integrable.

77 78

Theorem A.1.1. A necessary and sufficient condition for a function f(x) to be Riemann integrable on a finite interval [α, β] ⊂ R is that the function be bounded and piecewise continuous on that interval.

The Riemann integral is not defined on the infinite interval (−∞, ∞) = R, in which case one defines the improper Riemann integral as the limit

∞ α Z Z f(x) dx = lim f(x) dx , (A.1) 0 < α → ∞ − ∞ − α when the limit exists. Not every improper Riemann integral on R is a Lebesgue integral, but when it is, the use of the limit of a sequence of Riemann integrals as in Equation A.1 is a procedure commonly used to commute the Lebesgue integral.

Theorem A.1.2. A sufficient condition for the improper Riemann in- tegral Equation A.1 both to exist and to be a Lebesgue integral is that f(x) be absolutely integrable on the interval [−α, α],

α Z | f(x)| dx < 0

− α and that this integral be bounded as α → ∞.

A merit of the Lebesgue measure-theoretic formulation of integration the- P ory is that discrete sums, i f(xi), are placed on a equal footing with integrals, Rf(x) dµ(x), via the use of counting measure. Thus, sufficient conditions which ensure that one can interchange the operations of limit and integral/sum,1

α α Z Z lim f(x; θ) dx = lim f(x; θ) dx (A.2) θ → ∞ θ → ∞ −α −α or ∞ ∞ X X lim f(xi; θ) = lim f(xi; θ) (A.3) θ → ∞ θ → ∞ i=1 i=1

1Note that θ is here taken to be continuous and positive. When we say θ → ∞, this is shorthand for the statement that θn → ∞ for every possible unbounded and countable subset, {θn | θn+1 > θn, n = 1, 2, 3, ···}, of the positive reals. In the interest of time and space, we are glossing over this fine point in this memo. 79 are essentially equivalent. In particular, a sufficient condition for the interchange to hold when the integrals on both sides of the equality are Lebesgue integrals is provided by the Lebesgue Dominated Convergence Theorem (DCT):

Theorem A.1.3. (DCT) Let f(x; θ) be Lebesgue integrable for each θ on an interval [−α, α], for α ∈ (0, ∞) = R+ (i.e., including the case α =→ ∞). A sufficient condition for the interchange of the opera- tions of limit and integral as shown in Equation A.2 is that f(x; θ) be majorized by an absolutely Lebesgue integrable function g(x),

α Z |f(x; θ)| ≤ g(x) ∀ θ > 0 & ∀ x ∈ [−α, α] and | g(x)| dx < ∞ .

− α

The existence of the DCT is a powerful theorem provided by the Lebesgue theory of integration.2 The equivalent sufficient condition for the interchange of limits and sums as shown in Equation A.3 is

∞ X |f(xi; θ)| ≤ g(x) ∀ θ > 0 & ∀ i = 1, 2, ··· and | g(xi)| < ∞ . i=1 A classic example where the interchange of limit and integral does not hold is found for the Gaussian distribution shown in Equation A.22. We have for every α > 0,3,

α α Z Z 2 2 1 = lim φX (x; 0, σ ) dx 6= lim φX (x; 0, σ ) dx = undefined . (A.4) σ → 0 σ → 0 −α −α

For the general gaussian density

1 x−m 2 2 1 −2( σ ) φX (x) = φX (x; m, σ ) , √ e , 2πσ2 2Recall that every Riemann integrable function (necessarily defined on a finite integral) is Lebesgue integrable, so that the DCT applies to all Riemann integrals. Not every improper Riemann integral is Lebesgue integrable, however, so the DCT does not apply to all improper Riemann integrals. In practice, one exploits the relationships, and equivalences (when they exist), between the Riemann, Lebesgue, and Stieljes (and other) integrals, and utilizes the particular integral which is most convenient to use at any moment. 3Note that σ → 0 ⇐⇒ θ = σ−1 → ∞. 80

R α the integral −α φX (x)dx exists as both a Riemann and Lebesque integral. Further, the integral ∞ α Z Z φX (x)dx = lim φX (x)dx = 1 α→∞ −∞ −α exists in both the (improper) Riemann and Lebesque sense. More generally

β α Z Z φX (x)dx and ΦX (α) = φX (x)dx −α −∞ both exit as (proper and improper respectively) Riemann and Lebesque integrals for all α and β, where ΦX (α) is a strictly monotonically increasing function of α 1 with the properties limα→−∞ ΦX (α) = 0, ΦX (m) = 2 , and

lim ΦX (α) = 0 and lim ΦX (α) = 1 . α→−∞ α→∞

Finally, note that because of the monotonicity property of ΦX (α), 1 Φ (α) < Φ (m) = for all α < m . X X 2

A.2 Asymptotic Expansion Analysis

Although an exact analytic form for the normalization factor of the Stirling PDF (Equation 3.2, derived in Chapter 3) is not known, Monte-Carlo simulations show that it is extremely accurate, as expected from the fact that the theory of asymptotic expansions (and actual practice) shows that the Stirling approximation has very fast convergence in the size of u. In this section we are interested in the development of good approximations to Equation 3.2 that will allow for accurate computations of the mean, variance, skew, and kurtosis of the asymptotic number of In-Use LBAs of the PDF PX (x), the normalization of πx. It is important to realize that the expression  u − x log P (x) ∼ (u − x) 1 − log , s¯ = 1 − s , (A.5) X us¯ 81 is only appropriate for values of x that are physically realizable. The correct statement is that the Stirling PDF is given by    0 x > u  u−x  log PX (x) ∼ (u − x) 1 − log 0 ≤ x ≤ u (A.6)  us¯   0 x < 0

In practice one effectively works in the limit that u → ∞, in which case the probability that one sees a value of x near the boundary points becomes zero, which is why it is common to refer to (A.5) in lieu of (A.6). In the sequel, however, we will need to maintain the distinction at various points of the development. We are interested in values of s for which the feasible region

0 ≤ x ≤ u (A.7) is contained within the region of convergence (see Chapter 3 for a discussion of the region of convergence). It is evident that this corresponds to values s ≤ 0.5, which places constraints on the parameters of our SSD workload model. Fortunately, it is argued in Chapter 3 that this condition can always be satisfied in the region of large values of u (e.g., as u → ∞). Henceforth, we assume s ≤ 0, 5 so that the feasible region (A.7) is contained in the region of convergence.4 Note that our assumption is equivalent to s ≤ 1 , (A.8) s¯ a fact which will be used in the sequel.

A.3 An Accurate Gaussian Approximation √ Define the quantities m = us and σ = us¯. The right-hand-side of equation Equation 3.2 can be expanded in a formal Taylor series about the point m, yielding  u − x log P (x) ∼ (u − x) 1 − log (A.9) X us¯

4This assumption can be relaxed. It has been made to reduce the notational overhead. 82

u − x = (u − x) − (u − x) log (A.10) us¯ = σ2(1 − α) − σ2(1 − α) log(1 − α) (A.11) ∞ X αn+1 = −σ2 (A.12) n(n + 1) n=1  2  ∼ −α σ2 + log (1 − α)−σ (1−α) , (A.13) for −1 ≤ α ≤ 1. This shows that in the feasible region x ∈ [0, u], the Stirling PDF is given by

∞ n+1  − σ2(1−α(x)) − σ2 P α (x) 2 n(n+1) − α(x) σ n=1 PX(x) ∼ e 1 − α(x) ∼ e (A.14) for x − m α(x) = , m = us , σ2 = us¯ , s¯ = 1 − s and |α(x)| ≤ 1 . (A.15) σ2 Note that we can write the Stirling PDF expansion

∞ X αn+1(x) log P (x) ∼ −σ2 (A.16) X n(n + 1) n=1 as

∞ n+1 X 1 x − m log P (x) ∼ − (A.17) X n(n + 1)σn−1 σ n=1 or 1 x − m2 1 x − m3 1 x − m4 log P (x) ∼ − − − − · · · (A.18) X 2 σ 6σ σ 12σ2 σ which is convergent for

1 x − m |α(x)| = ≤ 1 . (A.19) σ σ Defining x − m z = z(x) = = σα(x) σ we can write this formal series simply as 83

z2(x) z3(x) z4(x) log P (x) ∼ − − − − · · · . (A.20) X 2 6 σ 12 σ2 Although the RHS of Equation A.20 is not globally convergent, convergence will hold for z-values within some region of convergence which is generally depen- dent on the specific value of σ (e.g., for all values of z such that |z| < 1 if σ > 1). Further, all possible truncations of the formal series shown in Equation A.20 agree in value with the RHS of Equation 3.2 in the limit z → 0 (equivalently, in the limit x → m) for any fixed value of σ. Writing the formal expansion in the equivalent form  3 4  2 z (x) z (x) z (x) − 6 σ + 2 −··· − 2 12 σ PX (x) ∼ e e (A.21) we see that it is entirely plausible that the “damage” done when estimating PX (x) of Equation 3.2 via a truncation of the formal series shown in Equation A.20 is limited for z-values outside of the region of convergence due to the exponential 2 − z (x) attenuation caused by the factor e 2 . This is, in fact, true and can be made rigorous via an application of the theory of asymptotic expansions [34]. In particular, the theory of asymptotic expansions explains the surprising fact that the lowest order truncation can be quite good, and often even optimal. In this case, we obtain the Gaussian approximation, 1 x−m 2 2 1 −2( σ ) PX (x) = φX (x) = φX (x; m, σ ) , √ e (A.22) 2πσ2 This approximation is extremely accurate in the high u limit. Indeed, the mean m = us and variance σ2 = us¯ always predict the empirically estimated mean and variance in Monte Carlo simulations with high accuracy, even for relatively small values of u. In the limit of very high values of u, the empirical estimates of the skew and (relative) kurtosis are both essentially zero which is consistent with the fact that the resulting empirical PDF is indistinguishable from the Gaussian distribution. Within the region of convergence, the truncation of the power series to an arbitrary order ` ≥ 1 given by

` n+1 (`) X α (x) log P (x) ∼ −σ2 , (A.23) X n(n + 1) n=1 84 produces a family of approximations to the Stirling PDF whose members are in- dexed by ` and whose accuracy increases with the size of ` (i.e., the number of terms kept in the series approximation). Only odd numbered truncations are admissible as the resulting approximations must conform to the constraint that admissible PDFs are normalizable (to one).5 Note from Equation (A.18) that the 1st order truncation yields a Gaussian approximation to the Stirling PDF,

1 x − m2 log P (1)(x) = φ (x) ∼ − , (A.24) X X 2 σ whereas the 3rd order truncation yields the approximation,

1 x − m2 1 x − m3 1 x − m4 log P (3)(x) ∼ − − − . (A.25) X 2 σ 6σ σ 12σ2 σ Note that while the different truncation PDFs can be rank ordered on the region of convergence in terms of greater or lesser accuracy, relative to the Stirling PDF, each one has a misspecification error (value of the truncation probability rel- ative to the value of the Stirling PDF) that goes to zero as α(x) → 0. In particular, for values of x very close to m relative to the size of σ, both the misspecification error of a low-order approximation will be small and the stability condition A.19 will be satisfied. This is an important insight which we will later use to show that the probability of misspecifying the value of the Stirling PDF of x will go to zero as u → ∞.

A.4 A Higher Order PDF Approximation

As mentioned above, the mean and variance predicted by the Gaussian density in Equation A.22 is always observed to be accurate in practice. Further, the empirically computed density for x is indistinguishable from the Gaussian Equation A.22 for high values of u. For intermediate values of u, however, slight deviations from Gaussianity are observed. These discrepancies are captured in

5This seems counterintuitive at first as the feasible value of x is bounded from above by u. In an asymptotic analysis, however, we are ultimately interested in the limit as u → ∞. 85 small, but significant nonzero values of the empirically computed values of the skew and kurtosis. As mentioned previously, in the limit of high u the empirically computed skew and kurtosis values do both vanish. We want to construct a model of the density that is valid for both the intermediate and high u-value limit. We do this by using all three terms in the expansion shown on the RHS of Equation A.20. All of the terms are used to a) ensure that the resulting function is integrable in order to find the normalizing factor required of a proper density function, and b) to ensure that the leading factor in Equation A.21 attenuates the second factor in the limit of x → ±∞ in order to attenuate the error in the PDF tails that would otherwise be caused by the asymptotic approximation. If only the first two terms are used, the resulting PDF will not converge in the limit x → −∞. The behavior of the third term dominates that of the second term, and its presence ensures that we have constructed an integrable PDF that takes the value 0 in the limit as x → ±∞. The proposed PDF of the random variable X is therefore taken to be  3 4  2 z (x) z (x) z (x) − 6 σ + 2 1 − 2 12 σ PX (x) = e e (A.26) C where the constant normalization factor C = C(σ) (also known as the partition function) is determined from the condition that the PDF PX (x) integrates to the value 1, assuming that the density is indeed integrable.6 One can readily determine that the second exponential factor in Equation A.26 is bounded from above by its 9σ 2 maximum value of e 64 , attained at z = −1.5σ, which guarantees that PX (x) is integrable for every finite value of the parameter σ =s ¯. In the sequel, an expectation of a function f(X) with respect to the PDF

Equation A.26 will be denoted as EP {f(X)}. We also denote the unit normal

(Gaussian) PDF by PZ (x) = φ0(z), z2 1 − 2 φ0(z) = √ e 2π and denote the expectation of a function g(Z) with respect to the unit normal density by Eφ0 {g(Z)}. Note the relationship between the unit normal φ0 and the 6Integrability, and hence the existence of the normalizing constant C, is further discussed below and proved in Section A.1. 86

general Gaussian distribution φX shown earlier in Equation A.22, x − m φ (z(x)) = σ φ (x) for z = z(x) = = σα(x) 0 X σ We also use the well known fact that for a unit normal random variable [35]

p Eφ0 {Z } = (p − 1)!! (A.27) where the double factorial is defined for integers n = 1, 2, 3, ··· , by  0 n even n!! = (n)(n − 2)(n − 4) ··· (1) n odd .

(For example: 7!! = 7 · 5 · 3 · 1, 4!! = 0, 1!! = 1, etc.) We can write the PDF in Equation A.26 as  3 4  √ − z (x)+z (x) 2π 6 σ 12 σ2 PX (x) = φ0[z(x)] e . (A.28) C Inspection of Equation A.28 indicates at least two qualitative properties we can expect of PX (x). First, for values of σ  1 = the standard deviation of the unit normal density, we can expect that the (exponential) factor on the right will be essentially constant (and close to 1 in value) for values of z ranging over many standard deviations of the unit normal (at which point the unit norm density will have a virtually zero value that will zero out any non-Gaussianity caused by the exponential factor), a fact consistent with the observation that for even moderately large values of σ2 = us¯ the empirical behavior of x is observed to be essentially Gaussian. Second, any observed skew (asymmetry) must be entirely due to the z3 term which, if discernable in the computer experiments, should tend to skew the density to the left, as indicated by a negative skewness. Note that x − m z = ⇐⇒ x = m + σ z =⇒ dx = σ dz . (A.29) σ This means (as a simple consequence of the random variable change in the variable formula) that for any composite function of the form (f ◦ z)(x) = f(z(x)) we have

EφX {f(Z(X))} = Eφ0 {f(Z)} . (A.30) 87

From Equations A.16 and A.29 we have that for feasible values x ∈ [0, 1] the Stirling PDF is given by

2 ∞ n+1 1 x − m X 1 z(x) log P (x; σ) ∼ − − . (A.31) X 2 σ n(n + 1) σ n=2 This yields 1 1 P (x; σ) = φ (x; σ) g(z(x); σ) = φ (z(x)) g(z(x); σ) (A.32) X ∆(σ) X σ ∆(σ) 0 where   0 x > u      n+1  P∞ 1 z(x) g(z(x); σ) − n=2 (A.33) , n(n+1) σ 0 ≤ x ≤ 1 .  e       0 x < 0

Note that Equations A.32 and A.33 are defined for all real values x ∈ R. Equations A.28 and A.29 yield the very useful fact that √   3 4  2πσ2  − Z + Z  6σ 12σ2 EP {g(z(X))} = Eφ0 g(Z) e . (A.34) C  

In particular, the normalization requirement for the PDF PX (x) is 1 = EP {1}, which yields for the truncated case   3 4  √  − Z + Z  √ 2 6 σ 12 σ2 2 C = 2πσ Eφ0 e , 2πσ ∆(σ) (A.35)   allowing us to rewrite Equation A.34 as   3 4  1  − Z + Z  6σ 12σ2 EP {g(z(X))} = Eφ0 g(Z) e (A.36) ∆(σ)   with   3 4   − Z + Z  6 σ 12 σ2 ∆(σ) , Eφ0 e (A.37)   88 or in the more generalized case Z ∞ 1 1 = φX (x; σ) g(x; σ)dx −∞ ∆(σ) Z ∞ 1 = φ0(z(x)) g(z(x); σ)dx −∞ σ ∆(σ) Z ∞ 1 = φ0(z) g(z; σ)dz (A.38) −∞ ∆(σ) or Z ∞

∆(σ) = Eφ0 {g(Z; σ)} = g(z; σ) φ0(z)dz (A.39) −∞

Note that g(z; σ)φ0(z) is positive and piecewise continuous. If g(z; σ) is bounded as a function of z,

0 ≤ g(z; σ) ≤ M < ∞ then we have 1 g(z; σ)φ (z) ≤ M(σ) < ∞ 0 2π showing that g(z; σ)φ0(z) is absolutely integrable and therefore that the integral defining ∆(σ) is guaranteed to exist. Even more can be said: if we can show that the bound M < ∞ is independent of σ then Mφ0(z) is an absolutely Lebesgue integrable majorizing function for g(z; σ)φ0(z) and the Lebesgue Dominated Con- vergence Theorem (DCT) applies and the limit and integration operations com- mute, Z ∞ Z ∞ lim ∆(σ) = lim g(z; σ) φ0(z)dz = lim g(z; σ) φ0(z)dz . (A.40) σ→∞ σ→∞ −∞ −∞ σ→∞ To determine a bound M is straightforward once we recognize that Equation (A.33) is equivalent to   0 z > σ      n+1  − P∞ 1 ( z ) g(z; σ) , e n=2 n(n+1) σ − s σ ≤ z ≤ σ . (A.41)  s¯      s 0 z < − s¯ σ 89

We have for all z ∈ R,

g(z; σ) ≤ max g(z; σ) z n+1 − P∞ 1 ( z ) = max n=2 n(n+1) σ s e − s¯ σ≤z≤σ n+1 − P∞ 1 ( z ) ≤ max e n=2 n(n+1) σ |z|≤σ P∞ 1 z n+1 − min n=2 n(n+1)(σ ) = e |z|≤σ P∞ 1 n=2 n(n+1) 1 √ ≤ e = e 2 = e = M < ∞ ,

P∞ 1 using the fact that n=1 n(n+1) = 1. Thus the normalizing factor ∆(σ) exists. Indeed, we immediately have that for all σ = us¯, √ ∆(σ) = Eφ0 {g(Z; σ)} ≤ Eφ0 {M} = M = e ≈ 1.649 < ∞ .

To determine the limiting value of ∆(σ) from (A.40), note that in the limit that σ → ∞, for each fixed value of z (this is legitimate as the DCT allows us to pass through the integral) we have

n+1 − P∞ 1 ( z ) g(z; σ) → e n=2 n(n+1) σ → 1 , showing that lim ∆(σ) = 1 . σ→∞ The normalization constant C = C(σ) exists and is finite (i.e., the density Equation A.26 exists) for every σ > 0 if and only if ∆(σ) exists and is finite for every σ > 0. The existence of ∆(σ) has just been shown, as well as the important fact that

lim ∆(σ) = 1 (A.42) σ→∞ It is evident from Equation A.35 that ∆(σ) gives a measure of the degree to which √ the normalization factor C deviates from the Gaussian normalization factor 2πσ2. 90

Indeed, as discussed further in this Appendix, it is reasonable to interpret ∆(σ) as providing a measure of the lack of Gaussianity of the density PX (x) because of the important fact shown there that

PX (x) → φX (x) ⇐⇒ ∆ → 1 (A.43)

Finally, note that Equations A.42 and A.43 imply that

σ → ∞ =⇒ PX (x) → φX (x) (A.44)

Using the fact that the series expansion for the exponential function

∞ y X y k e = k! k=0 is absolutely convergent for all values of y allows us to usefully expand

 z3 z4  3 − + − z 1+ z e 6 σ 12 σ2 = e 6 σ ( 2 σ ) for a variety of purposes, including the calculation of an approximate value of ∆(σ) (and hence also of the normalization factor C via the relationship in Equa- tion A.35). The expansion yields 7

 z3 z4  3 − + − z 1+ z e 6 σ 12 σ2 = e 6 σ ( 2 σ )

z3 z4 z6 z7 z8 = 1 − − + + + (A.45) 6 σ 12 σ2 72 σ2 72 σ3 288 σ4 z9 z10 z12  1  − − + + R 1, 296 σ3 864 σ4 31, 104 σ4 σ5 or  3 4  − z + z z3 z6 − 6z4 z9 − 18z7 e 6 σ 12 σ2 = 1 − + − (cont’d.) (A.46) 6 σ 72 σ2 1, 296 σ3

7This expansion converges for every fixed value of z. However, any finite truncation of the 1 expansion to some specific power of σ generally will yield an arbitrarily large approximation error if z is allowed to grow without bound. Fortunately, because of the nice structural form of PX (x) – in particular, because of the existence of the term φ0(z) – we can treat the values of z in the expansion as being “essentially bounded”. This is because values of φ0(z) for large z-values (values that are more than a few unit standard deviations from the unit normal mean value of zero) are “essentially zero” and don’t contribute appreciably to the value of the density PX (x) nor to any integration (i.e., expectation) that uses the density. 91

z12 − 36z10 + 108z8  1  + + R 31, 104 σ4 σ5

1  where R σ5 is defined to mean that the remaining terms are at least as small as 1  8 O σ5 , in terms of the “big O” terminology. Note that in terms of the “little o” 1  1  terminology, R σ5 ∼ o σ4 . Because of the property in Equation A.27 it is very useful to note that the terms in the expansion having even powers of σ have only even powers of z in the numerator while the terms having odd powers of σ contain only odd powers of z. From Equations A.27, A.35, A.37, and A.46 it is quite straightforward to compute an approximate value for the term ∆(σ) defined in Equation A.37. We have  Z3 Z6 − 6Z4 Z9 − 18Z7  1  ∆(σ) = E 1 − + − + R φ0 6 σ 72 σ2 1, 296 σ3 σ4 5!! − 6(3!!)  1  = 1 − 0 + − 0 + R 72 σ2 σ4 or 1  1  ∆(σ) = 1 − + R (A.47) 24 σ2 σ4

This results in a normalization constant approximation of 9 √ √  1   1  C = ∆(σ) 2πσ2 = 2πσ2 1 − + R (A.48) 24 σ2 σ3 Summarizing the results of the above derivations, we have the proposed

8For a discussion of the “big O” and “little o” terminologies see [36]. 9A more careful derivation shows that the next highest order estimate of ∆(σ) is

1 117  1  ∆(σ) = 1 − − + R , 24 σ2 2, 592 σ4 σ6 indicating that ∆(σ), and therefore the normalization constant C, appears to converge quickly as a function of σ2 = us¯. 92

PDF 10

 3 4   3 4  − z (x)+z (x) − z (x)+z (x) 1 6 σ 12 σ2 1 6 σ 12 σ2 PX (x) = φ0[z(x)] e = φX (x) e σ ∆(σ) ∆(σ)

(A.49)

1 1 1 1 with an R( 4 ) = R( 2 ) ∼ o( 3 ) ∼ o( 3 ) expansion for ∆(σ) given by Equa- σ u σ u 2 tion A.47. Note that as u → ∞, we have both

∆(σ) → 1 and, for each fixed value of x,   z3(x) z4(x) − 6 σ + 2 e 12 σ → 1 so that, for each fixed x, PX (x) → φX (x) as u → ∞.

A.4.1 Approximating the Mean

Denote the means of the random variables X and Z by

mx = EP {X} and mz = EP {Z} and note that

mx = m + σmz .

We have from Equations A.36 and A.46 that

  3 4  1  − Z + Z  6σ 12σ2 mz = Eφ0 Z e ∆(σ)   1  Z4 Z7 − 6Z5 Z10 − 18Z8  1  = E Z − + − + R ∆(σ) φ0 6 σ 72 σ2 1, 296 σ3 σ4

10Unless ∆(σ) has been approximated, this statement is the exact definition of the proposed PDF. Of course, the proposed PDF has been introduced as a candidate accurate approximation to the Stirling approximation-based PDF given by Equation 3.2. Also recall the distinction between φ0 and φX . 93

1  3!! 9!! − (18)7!!  1  = 0 − + 0 − + 0 + R ∆(σ) 6 σ 1, 296 σ3 σ5 or 1  1  m = − + R (A.50) z 2 ∆(σ) σ σ3 This yields 1  1  m = m − + R (A.51) x 2 ∆(σ) σ2 Consideration of the expansion of Equation A.47, shows that 1 1 1  1   1  = = 1 + + R = 1 + R (A.52) ∆(σ) 1 1  24 σ2 σ4 σ2 1 − 24 σ2 + R σ4 Thus, 1  1  m = − + R (A.53) z 2 σ σ3

1  1  2 so that the rate of convergence of mz to 0 is of order R σ3 ∼ o σ2 in σ = us¯.

This results in a value for the mean, mx, of x of 1  1  Mean = m − + R (A.54) x 2 σ2

1 1  1  showing that the rate of convergence of mx to m − 2 is of order R σ2 ∼ o σ . 1 Note that in the limit m = us → ∞, the term − 2 becomes negligible and the relative difference between mx and m vanishes as u → ∞, m − m 1 x = − → 0 . m 2m

A.4.2 Approximating the Variance

Denote the variances of the random variables X and Z by

2 2 σx = VarP {X} and σz = VarP {Z} and note that 2 2 2 σx = σ σz . 94

Also recall that 2  2 2 σz = EP Z − mz .

We have from Equations A.36 and A.46 that

  3 4  1  − Z + Z   2 2 6σ 12σ2 EP Z = Eφ0 Z e ∆(σ)   1  Z5 Z8 − 6Z6 Z11 − 18Z9  1  = E Z2 − + − + R ∆(σ) φ0 6 σ 72 σ2 1, 296 σ3 σ4 1  7!! − (6)5!!  1  = 1 − 0 + − 0 + R ∆(σ) 72 σ2 σ4 or 1  5   1  1  1  E Z2 = 1 + + R = 1 + + R (A.55) P ∆(σ) 24 σ2 σ4 4 ∆(σ) σ2 σ4 Consideration of Equation A.52 shows that in fact 1  1  E Z2 = 1 + + R (A.56) P 4 σ2 σ4 Continuing, we have

2  2 2 σz = EP Z − mz  1  1   1  1 2 = 1 + + R − − + R 4 σ2 σ4 2 σ σ3 1 1  1  = 1 + − + R 4 σ2 4 σ2 σ4 or  1  σ2 = 1 + R (A.57) z σ4

2 1  1  so that the rate of convergence of σz to 1 is R σ4 ∼ o σ3 . 2 2 2 2 Finally, from σx = σ σz we find that the variance, σx, of x is given by  1  Var = σ2 + R (A.58) x σ2

2 2 1  1  so that the rate of convergence of σx to σ is R σ2 ∼ o σ . 95

A.4.3 Approximating the Skew

The standardized moments of X are defined by  p X − mx γp = EP . σx In particular the skew (or skewness) of X is given by the 3rd standardized moment,

γ3. The (relative or excess) kurtosis is given by κ = γ4 − 3. Consideration of Equations A.54 and A.58, and the definition of z = z(x), shows that essentially 11   x − mx 1 1 = z(x) + + R 4 . σx 2 σ σ Thus  1  1 p γ = E z(X) + + R . p P 2 σ σ4 To compute the skew, note that essentially  1  1 3 3z2 3z 1  1  z + + R = z3 + + + + R . 2 σ σ4 2 σ 4 σ2 8 σ3 σ4 Then, using Equations A.36, A.46, A.27, and A.52 we have ( )  1  1 3 γ = E z(X) + + R 3 P 2 σ σ4 ( 1  3Z2 3Z  1  = E Z3 + + + R ∆(σ) φ0 2 σ 4 σ2 σ3 )  Z3 Z6 − 6Z4  1  1 − + + R 6 σ 72 σ2 σ3 1 3Z2 Z6  1  1  3 5!!  1  = E − + R = − + R ∆(σ) φ0 2 σ 6 σ σ3 ∆(σ) 2 σ 6 σ σ3   1   1  1  = 1 + R − + R . σ2 σ σ3

Thus the skew, γ3, of x is given by the simple expression 1  1  Skew = − + R (A.59) x σ σ3 Our numerical computer experiments match the predicted values of this slightly negative skew with very good accuracy.

11 1  1  We can treat x as“essentially bounded”, so that x · R σ4 ∼ R σ4 , because the probability of x being several standard deviations from its mean is “essentially zero”. 96

A.4.4 Approximating the Kurtosis

To find the kurtosis, κ = γ4 − 3, we need to compute the 4th standardized moment ( 4) (  4) X − mx 1 1 γ4 = EP = EP z(X) + + R 4 σx 2 σ σ

Note that essentially

 1  1 4 2z3 3z2 z  1  z + + R = z4 + + + + R . 2 σ σ4 σ 2 σ2 2 σ3 σ4 Using Equations A.36, A.46, A.27, and A.52 we have ( )  1  1 4 γ = E z(X) + + R 4 P 2 σ σ4 1  2Z3 3Z2 Z  1  = E Z4 + + + + R ∆(σ) φ0 σ 2 σ2 2 σ3 σ4  Z3 Z6 − 6Z4 Z9 − 18Z7  1  · 1 − + − + R 6 σ 72 σ2 1, 296 σ3 σ4 1  3Z2 Z6 Z10 − 6Z8  1  = E Z4 + − + + R ∆(σ) φ0 2 σ2 3 σ2 72 σ2 σ4 1  3 5!! 9!! − 6(7!!)  1  = 3 + − + + R ∆(σ) 2 σ2 3 σ2 72 σ2 σ4  1  1   7  1  3  1  = 1 − + R 3 + + R = 3 + + R . 24 σ2 σ4 8 σ2 σ4 4 σ2 σ4

Thus the kurtosis, κ = γ4 − 3, of x is given by the simple expression 3  1  Kurtosis = + R (A.60) x 4 σ2 σ4 For moderately large values of σ, this predicts the existence of a virtually negligible, very slightly positive, kurtosis. Our numerical computer experiments match the values of this prediction within experimental error.

A.5 Accuracy of the Truncation PDFs

Knowing that asymptotically (i.e., as σ → ∞) the Stirling PDF A.16 be- 2 2 haves like φX (x; m, σ ) for m = us and σ = uu¯ suggests a very crude and heuristic 97

` “bootstrap” argument that all of the approximate PDFs, PX (x; σ), arising from truncating the series as defined in A.23 behave well in the limit σ → ∞. In par- ticular, because in the limit σ → ∞, the mean of x becomes m and the variance becomes σ2, we can invoke the Chebyshev inequality,   x − m 1 Prob (|z(x)| ≤ γ) = Prob ≤ γ ≥ 1 − , σ γ2 in that limit. In particular, note that in the limit of large σ we have   z(x) 1 Prob (|α(x)| ≤ ) = Prob ≤  = Prob (|z(x)| ≤ σ) ≥ 1 − . σ 2σ2

Thus when approximating the Stirling PDF PX (x; σ) by the truncation approxi- ` mation PX (x), we can always make the neglected terms in the sum on the right hand side of (A.16) as small as possible with high probability by increasing the size of σ2 = us¯. For example, if we write A.16 as

` ∞  X αn+1(x) X αn+1(x)  log P (x) ∼ −σ2 + (A.61) X n(n + 1) n(n + 1) n=1 n=`+1 then if σ is large enough so that with high probability |α(x)| ≤ , we have with high probability that (assuming  < 1),

∞ ∞ ∞ ∞ X αn+1(x) X |αn+1(x)| X n+1 X 1 ≤ ≤ ≤ `+2 ≤ `+2 , n(n + 1) n(n + 1) n(n + 1) n(n + 1) n=`+1 n=`+1 n=`+1 n=`+1 whereas the first sum in Equation A.61 is at least of order O(`+1). Thus we heuristically argue that all of the truncation approximations to the Stirling PDF are accurate in the limit of high σ = us¯. This certainly makes sense, as above we’ve shown that the crudest nontrivial truncation of ` = 1 yields the Gaussian approximation which is accurate in the limit as σ → ∞.

A.6 The K-L Divergence to and from the Normal Distribution

We will now show that the Stirling PDF (A.24) converges to the Gaussian PDF in the limit that u → ∞. This is a most interesting and nontrivial result as 98 every term in the expansion (A.16) of the Stirling PDF has the same (“big O”) order O(u) so that a straightforward asymptotic analysis is elusive.

To show that the Stirling PDF PX (x; σ) converges to a Gaussian density in the limit as σ → ∞, we want to demonstrate that   D φX (·; σ)kPX (·; σ) → 0 as σ → ∞ , where D( · k · ) is the Kullback-Liebler Divergence (KLD) (or Mutual Information). Unfortunately, we can’t do this without some additional finessing of the prob- lem statement. This is because PX (x; σ) takes zero values on the support of the 12 Gaussian PDF φX (x; σ).

To ameliorate the problem, we modify g(z; σ) to g(z; σ), where    z > σ  σ     n+1  − P∞ 1 ( z ) g(z; σ) , e n=2 n(n+1) σ − s σ ≤ z ≤ σ  s¯       s σ z < − s¯ σ for  > 0. Obviously in the limit that  → 0 we recover g(z; σ).13 Thus, we define the -parameterized family of “near-Stirling” PDFs,

 1 1 PX (x; σ) = φX (x; σ) g(z(x); σ) = φ0(z(x)) g(z(x); σ) , (A.62) ∆(σ) σ ∆(σ) where Z ∞

∆(σ) = Eφ0 {g(Z; σ)} = g(z; σ) φ0(z)dz . (A.63) −∞

12 That is, unfortunately φX is not absolutely continuous with respect to PX . This causes a problem because the log of zero is generally undefined. If φX (x) were absolutely continuous with respect to PX (x), then the problem vanishes because 0 log 0 = 0. However, note that because φX (x) has infinite support (its value is zero nowhere) it can never be absolutely continuous with respect to a density having only a finite support. In principle the problem could be sidestepped by   working instead with the “reverse direction” KLD, D PX (x; σ)kφX (x; σ) , (because the Stirling PDF is absolutely continuous with respect to the Gaussian density). This path, however, leads to an unpleasant morass of analysis. 13 Also for any , g(z; σ) − g(z; σ) → 0 as σ → ∞. 99

Note that because g(z; σ) φ0(z) satisfies the conditions of the DCT, we have Z ∞ lim ∆(σ) = lim g(z; σ) φ0(z)dz σ→∞ σ→0 −∞ Z ∞ = lim g(z; σ) φ0(z)dz −∞ σ→0 Z ∞ = g(z; σ) φ0(z)dz = ∆(σ) , −∞ for every value of  > 0. Note that this says that regardless of the size of  > 0,

∆(σ) becomes independent of  (and equal to ∆(σ)) in the limit of large σ.

A.6.1 Kullback-Liebler Divergence

Given two densities p (x) and f(x) the divergence of f(x) from p (x) is defined by Z p (x)  p (X) D(p || f) = p (x) log dx = E log . f(x) p f(X) The density p (x) is the reference density and D(p || f) is a measure of how much f(x) “diverges” from p (x) on the average, using an average computed using the reference density.14 Note that the definition of the divergence generally is not symmetric, D(p || f) 6= D(f || g) , so that the “distance” from p (x) to f(x) is generally not equal to the “distance” from f(x) to p (x). Thus the K-L divergence is not a true metric (distance func- tion). However, the divergence is positive definite:

∀ p, f, D(p || f) ≥ 0 and D(p || f) = 0 ⇐⇒ p = f .

Thus convergence of PX (x) to φX (x) can be shown by demonstrating that either of the two conditions

D(φX || PX ) → 0 or D(PX || φX ) → 0

14Thus, the reference density p (x) is taken to be the baseline against which another density f(x) can be compared. In many areas of coding and communications theory, if p (x) is the “true density” that ideally should be used to optimize some computation of interest, and f(x) is an available “model density” that is actually used to perform the optimization, D(p || f) can be used to compute or predict the degree of performance loss that occurs. 100 hold when u → ∞.

Given the forms of the densities φX (x) and PX (x), it is simpler to show the first condition. For moderately large values of u, however, PX is likely to be a better approximation to “truth” than the simpler normal density φX (x), whereas it is the normal density that will generally have greater utility in practice. From this perspective, it is D(PX || φX ) that is of interest as it is a measure of the divergence of the “approximation” density φX (x) from the “truth” density PX (x).

A.6.2 Application to PX

We now show that for every  > 0,    lim D φX (·; σ)kPX (·; σ) = 0 . (A.64) σ→∞

With the framework we have established, it is straightforward to do this. We have      φX (X) 0 ≤ D φX (·; σ)kPX (·; σ) = EφX log  PX (X; σ)     ∆(σ) ∆(σ) = EφX log = Eφ0 log g(z(x); σ) g(z; σ)

= log ∆(σ) − Eφ0 {log g(z; σ)} or    0 ≤ D φX (·; σ)kPX (·; σ) = log Eφ0 {g(Z; σ)} − Eφ0 {log g(z; σ)} . (A.65)

Note that nonnegativity of the difference of the two terms shown on the far right hand side must be true (which gives us confidence in our derivation) as it is also a consequence of Jensen’s inequality applied to the concave function log(y).

We will have established our desired result once we’ve shown that the limit of the far right hand side is zero in the limit that σ → ∞. Noting that   lim log Eφ {g(Z; σ)} − Eφ {log g(z; σ)} σ→∞ 0 0

= lim log Eφ {g(Z; σ)} − lim Eφ {log g(z; σ)} σ→∞ 0 σ→∞ 0 101 we will establish the result by separately demonstrating that

lim log Eφ {g(Z; σ)} = log lim Eφ {g(Z; σ)} = 0 (A.66) σ→∞ 0 σ→∞ 0

(the first equality is a known consequence of the continuity of the logarithm), and

lim Eφ {log g(z; σ)} = 0 . (A.67) σ→∞ 0

To establish these assertions, note that both φ0(z)g(z; σ) and φ0(z)g(z; σ) can be shown to satisfy the conditions of the DCT. Therefore n o log lim Eφ {g(Z; σ)} = log Eφ lim g(Z; σ) = 0 σ→∞ 0 0 σ→∞ and n o lim Eφ {log g(z; σ)} = Eφ lim log g(z; σ) = 0 σ→∞ 0 0 σ→∞ as required. Thus we have shown that for every  > 0

 “φX (·; σ) = lim PX (·; σ)” (A.68) σ→∞

 in the sense of Equation A.64, which means that in the limit of large σ, PX (·; σ) 15 and φX (·; σ) become indistinguishable almost surely for every  > 0,

 a.s. PX (x; σ) − φX (x; σ) −→ 0 as σ → ∞ . (A.69)

To complete the proof, we show that for every  > 0 and for every x,    lim σ PX (x; σ) − PX (x; σ) = 0 . (A.70) σ→∞

We have (recall that z = z(x))       lim σ PX (x; σ) − PX (x; σ) = lim σPX (x; σ) − σPX (x; σ) σ→∞ σ→∞   g(z; σ) g(z; σ) = lim φ0(z) − φ0(z) σ→∞ ∆(σ) ∆(σ)

15 Convergence is almost surely (a.s.) with respect to the Gaussian density φX (x; σ). Since the Gaussian measure and Lebesgue measure are mutually absolutely continuous, this is equivalent to almost everywhere (a.e.) convergence with respect to the standard Lebesgue measure on the real line. Thus the convergence holds a.e., that is, everywhere except on a set of Lebesgue measure zero. 102

  g(z; σ) g(z; σ) = φ0(z) lim − σ→∞ ∆(σ) ∆(σ)   g(z; σ) g(z; σ) = φ0(z) lim − lim σ→∞ ∆(σ) σ→∞ ∆(σ)   limσ→∞ g(z; σ) limσ→∞ g(z; σ) = φ0(z) − limσ→∞ ∆(σ) ∆(σ) lim g (z; σ) lim g(z; σ) = φ (z) σ→∞  − σ→∞ 0 ∆(σ) ∆(σ)   φ0(z) = lim g(z; σ) − lim g(z; σ) ∆(σ) σ→∞ σ→∞   φ0(z) = lim g(z; σ) − g(z; σ) ∆(σ) σ→∞ φ (z) = 0 · 0 = 0, ∆(σ) which is true because    z > σ  σ  s g(z; σ) − g(z; σ) = 0 − s¯ σ ≤ z ≤ σ .    s  σ z < − s¯ σ

Thus we have shown that pointwise in x (i.e., everywhere) we have,

 PX (x; σ) − PX (x; σ) −→ 0 as σ → ∞ , (A.71) for every  > 0.

The final step in the proof follows from the fact that for large enough σ  (and regardless of the value of ) the PDFs PX (x; σ) and PX (x; σ) become indis- tinguishable (equal in value pointwise in x, and hence equal almost everywhere),  and since in the same limit PX (·; σ) and φX (·; σ) are equal almost everywhere, it must be the case that PX (·; σ) and φX (·; σ) are equal almost surely. This follows from the almost sure limits       lim PX (x; σ) − φX (x; σ) = lim PX (x; σ) − PX (x; σ) + PX (x; σ) − φX (x; σ) σ→∞ σ→∞    = lim PX (x; σ) − PX (x; σ) σ→∞ 103

   + lim PX (x; σ) − φX (x; σ) σ→∞ = 0 + 0 = 0 , which are a consequence of Equations A.69 and A.71. Thus it has been shown that

a.e. √ PX (x; σ) − φX (x; σ) −→ 0 as σ = us¯ → ∞ , (A.72) showing that the Stirling PDF converges to a Gaussian distribution almost every- where on the real line. References

[1] X. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka, “Write amplification analysis in flash-based solid state drives,” in Proceedings of SYSTEM 2009: The Israeli Experimental Systems Conference, ACM, 2009.

[2] X. Hu and R. Haas, “The fundamental limit of flash random write perfor- mance: Understanding, analysis and performance modelling,” IBM Research Report, vol. RZ 3771, Mar 2010.

[3] A. Jagmohan, M. Franceschini, and L. Lastras, “Write amplification reduc- tion in nand flash through multi-write coding,” in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, pp. 1–6, IEEE, 2010.

[4] H. Kim, N. Agrawal, and C. Ungureanu, “Revisiting storage for smartphones,” in The 10th USENIX Conference on File and Storage Technologies (FAST), 2012.

[5] E. Walker, W. Brisken, and J. Romney, “To lease or not to lease from storage clouds,” Computer, vol. 43, no. 4, pp. 44–50, 2010.

[6] K. Kant, “Data center evolution: A tutorial on state of the art, issues, and challenges,” Computer Networks, vol. 53, no. 17, pp. 2939–2965, 2009.

[7] D. Wang, A. Sivasubramaniam, and B. Urgaonkar, “A case for heterogeneous flash,” Tech. Rep. CSE-11-015, Pennsylvannia State University, 2011.

[8] L. Grupp, A. Caulfield, J. Coburn, S. Swanson, E. Yaakobi, P. Siegel, and J. Wolf, “Characterizing flash memory: anomalies, observations, and applica- tions,” in Proceedings of the 42nd Annual IEEE/ACM International Sympo- sium on Microarchitecture, pp. 24–33, ACM, 2009.

[9] D. Roberts, T. Kgil, and T. Mudge, “Integrating NAND flash devices onto servers,” Commun. ACM, vol. 52, no. 4, pp. 98–103, 2009.

[10] T. Chung, D. Park, S. Park, D. Lee, S. Lee, and H. Song, “A survey of flash translation layer,” Journal of Systems Architecture, vol. 55, no. 5-6, pp. 332– 343, 2009.

104 105

[11] A. Gupta, Y. Kim, and B. Urgaonkar, “DFTL: a flash translation layer em- ploying demand-based selective caching of page-level address mappings,” in Proceedings of the 14th international conference on Architectural support for programming languages and operating systems, pp. 229–240, ACM, 2009.

[12] T. Chung, D. Park, S. Park, D. Lee, S. Lee, and H. Song, “System software for flash memory: a survey,” Embedded and Ubiquitous Computing, pp. 394–404, 2006.

[13] J. Ousterhout and F. Douglis, “Beating the I/O bottleneck: a case for log- structured file systems,” ACM SIGOPS Operating Systems Review, vol. 23, no. 1, pp. 11–28, 1989.

[14] R. Agarwal and M. Marrow, “A closed-form expression for Write Amplification in NAND Flash,” in GLOBECOM Workshops (GC Wkshps), 2010 IEEE, pp. 1846–1850, IEEE, 2010.

[15] L. Xiang and B. Kurkoski, “An Improved Analytical Expression for Write Amplification in NAND Flash,” in International Conference on Computing, Networking, and Communications, 2012. Also available at Arxiv preprint arXiv:1110.4245.

[16] M. Rosenblum and J. Ousterhout, “The design and implementation of a log- structured file system,” ACM Transactions on Computer Systems (TOCS), vol. 10, no. 1, pp. 26–52, 1992.

[17] J. Menon and L. Stockmeyer, “An age-threshold algorithm for garbage collec- tion in log-structured arrays and file systems,” High Performance Computing Syst. Applicat., pp. 119–132, 1998.

[18] L. Chang, T. Kuo, and S. Lo, “Real-time garbage collection for flash-memory storage systems of real-time embedded systems,” ACM Transactions on Em- bedded Computing Systems (TECS), vol. 3, no. 4, pp. 837–863, 2004.

[19] Information Technology - ATA/ATAPI Command Set - 2 (ACS- 2), INCITS Working Draft T13/2015-D Rev. 7 ed., June 2011. http://www.t13.org/documents/UploadedDocuments/docs2011/ d2015r7-ATAATAPI Command Set - 2 ACS-2.pdf.

[20] “TRIM.” http://en.wikipedia.org/wiki/TRIM, September 2011. See refer- ences on website for specific operating system TRIM support.

[21] N. Agrawal, V. Prabhakaran, T. Wobber, J. Davis, M. Manasse, and R. Pan- igrahy, “Design tradeoffs for SSD performance,” in USENIX 2008 Annual Technical Conference on Annual Technical Conference, pp. 57–70, USENIX Association, 2008. 106

[22] Information Technology - SCSI Object-Based Storage Device Commands- 3 (OSD-3), INCITS Working Draft T10/2128-D Rev. 0 ed., March 2009. http://www.t10.org/cgi-bin/ac.pl?t=f&f=osd3r00.pdf.

[23] Y. Kang, J. Yang, and E. Miller, “Object-based SCM: An Efficient Inter- face for Storage Class Memories,” in Mass Storage Systems and Technologies (MSST), 2011 IEEE 27th Symposium on, pp. 1–12, IEEE, 2011.

[24] A. Rajimwale, V. Prabhakaran, and J. Davis, “Block management in solid- state devices,” in Proceedings of the 2009 conference on USENIX Annual technical conference, pp. 21–21, USENIX Association, 2009.

[25] P. Hoel, S. Port, and C. Stone, Introduction to probability theory, vol. 12. Houghton Mifflin Boston, 1971.

[26] W. Feller, Introduction to Probability Theory and its Applications. New York: Wiley, 3rd ed., 1968.

[27] F. Reif, Fundamentals of Statistical and Thermal Physics. McGraw-Hill, 1965.

[28] T. Cover and J. Thomas, Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, Wiley-Interscience, 2006.

[29] R. Courant and D. Hilbert, Methods of Mathematical Physics - Vol. I. New York: Interscience, 1953.

[30] R. Corless, G. Gonnet, D. Hare, D. Jeffrey, and D. Knuth, “On the LambertW function,” Advances in Computational mathematics, vol. 5, no. 1, pp. 329–359, 1996.

[31] STEC, “Benchmarking Enterprise SSDs.” http://www.stec-inc.com/ downloads/whitepapers/Benchmarking Enterprise SSDs.pdf, June 2009.

[32] Y. Kang, J. Yang, and E. Miller, “Efficient storage management for object- based flash memory,” in 2010 18th Annual IEEE/ACM International Sympo- sium on Modeling, Analysis and Simulation of Computer and Telecommuni- cation Systems, pp. 407–409, IEEE, 2010.

[33] D. Feng and L. Qin, “Adaptive object placement in object-based storage sys- tems with minimal blocking probability,” in Advanced Information Networking and Applications, 2006. AINA 2006. 20th International Conference on, vol. 1, pp. 611–616, IEEE, 2006.

[34] J. Boyd, “The devil’s invention: Asymptotic, superasymptotic and hyper- asymptotic series,” Acta Applicandae Mathematicae, vol. 56, no. 1, pp. 1–98, 1999. 107

[35] “Normal Distribution.” http://en.wikipedia.org/wiki/Normal distribution, May 2012.

[36] “Big O Notation.” http://en.wikipedia.org/wiki/Big O notation, May 2012.