SSDs – Write Ampli cation, and GC

OCZ Technology Group Some Key Flash Characteristics

Data in flash is stored as trapped charge. Erasing and which allow any block to be written or over written at any programming flash wears out the device. The most commonly time without impacting any of the data around it. used flash can be programmed and erased between 5000 and 10000 times. Since you cannot over write, you have to keep track of where the data actually is as there can’t be a static relationship Flash is organized in blocks, with multiple pages per block. between the data location and the logical block address (LBA) Blocks range in size from 2K to 8K with pages per block of 128 that the uses to communicate with the to 256. Flash can only be erased as a block, can only be storage device. If the operating system wants to over write an programmed as a full page and the pages must be programmed existing LBA, an SSD will mark the current page associated in order from 0 to the number of pages in a block. with the LBA as invalid, write the new data to a different location and map the LBA to the new location so the data can Why do we care about this? Because of the operational issues be found for a read. As a drive is used, a block will have more that it causes. We can read any page we want, but as you can and more pages that no longer hold valid data. The data stored see writing and erasing is much more constrained. Operating in these pages is considered stale or out of date. systems file structures are built around hard disk drives (HDD),

Write Ampli cation

What is and why should you care about it? Usage patterns have a large impact on write amplification. For Write amplification is a measure of the number of example 100% random 4K writes, the worst case, will have a actually written when writing a certain number of bytes. For write amplification of around 17 for OCZ drives. 100% serial example, if you write a 4K file, on average, the drive may writes from low LBA to high LBA will have a write amplification write 40K bytes worth of data. Why does this happen? This of 1. Real usage is somewhere in between. comes back to the flash characteristics. At some point, you will need to combine data from several partially used blocks to You can also reduce write amplification by partitioning or free up pages for new data to be written. allocating less than the total size of the drive. The reduction by doing will improve the total lifetime writes, but your total Write amplification has an impact on the life of a drive. One storage space is lower. If you only partition 50% of the drive, effective way to measure drive lifetime is to measure how the total lifetime writes increases by 3x for random 4K writes. many bytes can be written to the drive over its lifetime. With PECycles defined as the rated number of program / erase There are other factors that will impact write amplification cycles for the flash, lifetime writes can be estimated as and those will be explained in the next sections.

DriveSize * PECycles WriteAmplification

If you have a 30GB drive, write amplification of 10 and PECycles of 10K, the lifetime writes will be 30TB or 30 tera bytes. A user writing 5GB per day (about 1 dvd worth of data) would have a lifetime of 6000 days or about 16 years, well beyond the life time of most all electronic equipment of this type.

1 Trim

To understand what trim is and why it is important, you first Trim is simply the function of the operating system telling the need to understand a little about the in a drive that a page is no longer valid. computer. This helps to reduce write amplification because you don’t When you delete a file on your computer, the operating system copy stale pages. There will also be fewer pages to copy, which in your computer doesn’t actually delete it. The operating will speed up the process of freeing up partially valid blocks. system instead will mark the area as free, to be re-used when needed. This means the data associated with this area is no longer considered valid. There is just one problem with this, the drive doesn’t know that this has happened. An HDD doesn’t really care since you can just write over the data whenever it wants. However, this creates a problem for the SSD.

When it is time to consolidate blocks to free up space, the SSD must copy all of the data it considers valid to a new block before it can erase the current block. Without trim, the SSD doesn’t know a page is invalid unless the LBA associated with it has been rewritten. Garbage Collection or GC

Garbage collection is the process of freeing up partially valid blocks to make room for more data. This must be done with all drives at some point after the equivalent of their capacity has been written to the drive.. When new data needs to be written, there will be a consolidation process that frees up space. When this begins to occur, it slows down the writes. This is because the drive has to wait for space to clear before data can be written. This is the typical cause of write performance degradation as a drive is used.

Idle Time GC

One way to improve performance is to perform garbage before GC at run time. This is a legitimate concern. collection during idle periods, freeing up space before it is Measurements have been made to assess the potential impact. needed. The difference between this and real time GC is that It increases the write amplification by about 1%. This means this occurs ahead of time so the drive does not need to wait for that if the write amplification is 10, it will go to 10.1 with idle the GC to occur before it can write. For OCZ drives, it only time garbage collection. This is a small price to pay for runs until a certain amount of space is available and then it consistent performance over the drive life. stops. There has been some concern that idle time garbage collection will increase write amplification. This is due to the possibility of some of the consolidated pages going stale

2 Putting it all together

Write amplification is a part of function of the drive. It can be The best sustained performance is with a drive that utilized helped by trim and by lowering the amount of the drive that is both trim and idle time garbage collection. It is important to allocated for use, (Over Provisioning) if this is a concern. If the remember that garbage collection will always have to occur, drive has very heavy writes, this will optimize the life of the the issue is in when it occurs. By performing limited garbage drive. collection during idle time, enabling trim performance will be maintained as the drive is used and users will have a more Trim will help with performance by limiting the pages that are consistent performance experience. copied during garbage collection. However, there will still be a performance penalty in doing this real time, just when it is needed, the penalty will not be as high as it is for a drive not using trim.

Test Data

We took a 30GB drive and tested performance on it in a clean performance data. The performance was running at about state. We tested using ATTO. This showed an average 150MB/S write and 225MB/S read. Next we added a large performance of 225MB/S read and 150MB/S write. amount of data and ran IOMeter to thoroughly randomize the rest of the drive. The performance at this point was about We then used IOMeter to thoroughly exercise the drive with a 80MB/S write and 165MB/S read. The added files were then mixture of serial and random writes and tested the deleted and the trash bin emptied. It is the process of performance again at several time intervals. emptying the trash bin that initiates trim. After trim, the performance returned to the original levels. With that doesn’t support idle time GC, the performance dropped to about 154MB/S read and 75MB/S These test conditions are artificial conditions that push the write. The performance doesn’t improve from these levels. drives to the limit. Normal usage typically allows sufficient time to keep the performance at an optimal level. Not all With the background GC + trim firmware, the performance trims will be a large block of files or large files. Typical trims dropped to 166MB/S read and 106MB/S write and with a few are a range of sizes and locations, often not in large blocks, hours of time, recovered to 193MB/S read and 127MB/S write. but you can expect to see performance improvements and Since the drives were being tested about every hour, this set improved drive life with trim in active use. The best will be back the time to restore. the use of both trim and idle time GC.

We then performed a similar test with firmware that supports trim and not idle time GC. We gathered initial, clean state

3