Novel Methods for Improving Performance and Reliability of Flash-

Based Solid State Storage System

A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in the Department of Electrical Engineering and Science of the College of Engineering and Applied Science by

Jiayang Guo B.S. University of Cincinnati February 2018

Committee Chair: Yiming Hu, Ph.D.

i

Abstract

Though SSDs outperform traditional magnetic-based storage devices, there is still

potential for further performance improvements. In existing operating systems, the

software I/O stack is designed considering the working mechanisms of the traditional

magnetic-based hard drives. Therefore, it has been shown that the existing I/O software

layer can cause additional operational overheads for flash-based SSDs [1]. To address

this problem, we explore the influence factors which will lead to variation of the

performance of SSD. Based on our observation, we proposed a SSD-based I/O scheduler

called SBIOS that fully exploits the internal parallelism to improve the performance. It

dispatches the read requests to different blocks to make full use of SSD internal

parallelism. For write requests, it attempts to dispatch write requests to the same block to

minimize the number of the block cross requests. Moreover, SBIOS introduces the

conception of batch processing and separates read and write requests to avoid read-write

interference.

Besides, SSDs face reliability challenges due to the physical properties of . To fix the reliability issue of SSD, we propose a parallel and garbage collection aware I/O Scheduler called PGIS that identifies the hot data based on trace characteristics to exploit the channel level internal parallelism of flash-based storage systems. PGIS not only fully exploits abundant channel resource in the SSD, but also introduces a hot data identification mechanism to reduce the garbage collection overhead. By dispatching hot read data to the different channel, the channel level internal parallelism is fully exploited.

ii

By dispatching hot write data to the same physical block, the garbage collection overhead has been alleviated. The experiment results show that these methods significantly improve the reliability and performance of the SSD. In this research, the total number of erase operations is introduced to measure the reliability of SSD.

Meanwhile, with the rapid development of non-violate technology, due to high read/write speed, high endurance, in-place updating of PCM, many hybrid storage structures which use PCM and SSD at the same storage level have been proposed to improve the performance of SSD. However, hybrid storage systems pose a new challenge to cache management algorithms. Existing DRAM-based cache management schemes are only optimized to reduce the miss rate. On a miss, the cache needs to access the

PCM or the SSD. However there are major performance differences between the access times of the two different technologies. As a result, in such a hybrid system, a higher hit rate does not necessarily translate to higher performance. To address this issue, we propose a Miss Penalty Aware cache management scheme (short for MPA) which takes the asymmetry of cache miss penalty on PCM and SSD into consideration. Our MPA scheme not only uses the access locality to reduce the miss rate, but also assigns higher priorities to SSD requests located in the page cache to avoid the high miss penalty overhead. Our experimental results show that our MPA scheme can significantly improve the hybrid storage system performance by up to 30.5% compared with other cache management schemes.

iii

Copyright

iv

Acknowledgements

The Ph.D. study is a long journey which is very interesting and challenging. In this journey, I meet challenge and fix it every day. Through this process, I know my strengths and weakness. Based on the knowledge of myself, I make myself better. Throughout this adventure, there are many brilliant, supportive people with me.

First and foremost, I want to thank my advisor Dr. Yiming Hu. It has been an honor to be his Ph.D. student and to work with him. From my experience, he cares so much about his student that he always comes through for me when I need help. When I feel confused in my research, he always can provide insightful advices to improve my research in a few words. I cannot enumerate how many things I have learnt from him since there are too many. To me, the most important things I have learnt from him include doing work in a serious and responsible manner, hardworking, and paying attention to detail. I believe I will benefit from these skills in rest of my life. I really appreciate all his contributions of time, effort, ideas, trust, and funding to make my Ph.D. experience productive and stimulating.

I would like to thank all the professors in my committee for their support and guidance. Dr. Jing Xiang gave me tremendous help especially in the first year of my Ph.D. life. He cares about me so much and always shares personal experience on research and life. I want to thank Dr. Wen-Ben Jone for his time, effort, and encouragement. I learnt from him the attitude of being more confident on my research and work. I also want to

v

thank Dr. Raj Bhatnagar for his guidance and rigor to research as well as pointing out my weakness for me to improve. Finally, I want to thank Dr. Carla Purdy for her approval and effort on my thesis in improving the quality of my work. I learnt from her the attitude of being serious and responsible for research and work.

My gratefulness also goes to my friends and colleagues, Wentao Wang, Xining Li,

Xiaobang Liu, Suyuan Chen, Minyang Wang, Vadambacheri Manian Karthik and his wife Balasubramanian Sanchayeni, and many others who have helped and supported me in any aspect.

Lastly, I would like to thank my mother for her love, encouragement and funding. I am grateful to my mother because she has sacrificed a lot to raise me, educate me, and fund me in achieving my goal. My wholehearted gratitude is beyond any words. Thank you!

vi

Table of Contents

1. CHAPTER 1 INTRODUCTION ...... 1

1.1 HDDS VS SSDS ...... 1 1.2 PROBLEM DESCRIPTION ...... 6 1.3 CONTRIBUTION ...... 8 1.4 ORGANIZATION OF DISSERTATION ...... 12

2. CHAPTER 2 BACKGROUND ...... 15

2.1 NON-VIOLATE MEMORY ...... 15 2.2 INTERNAL STRUCTURE OF SSD ...... 18 2.3 FLASH TRANSLATION LAYER ...... 21 2.4 HYBRID SSD STRUCTURE ...... 26

3. CHAPTER 3 A SSD-BASED I/O SCHEDULER ...... 28

3.1 ANALYZING THE CHARACTERISTICS OF SSD ...... 28 3.2 BACKGROUND ...... 29 3.2.1 Request size ...... 29 3.2.2 Read-write interference ...... 31 3.2.3 Internal parallelism ...... 33

3.3 SYSTEM DESIGN AND IMPLEMENTATION ...... 35 3.3.1 System overview ...... 35 3.3.2 Dispatching method ...... 36 3.3.3 Algorithm process ...... 38

3.4 EXPERIMENTAL EVALUATION AND ANALYSIS ...... 39 3.4.1 Experiment setup ...... 40 3.4.2 Performance results and analysis ...... 40

3.5 RELATED WORK ...... 42 3.6 SUMMARY ...... 44

4. CHAPTER 4 A PARALLELISM AND GARBAGE COLLECTION AWARE I/O SCHEDULER ...... 45

4.1 INTRODUCTION ...... 45 4.2 MOTIVATION AND BACKGROUND ...... 47 4.2.1 The main bottleneck in the SSDs ...... 47 4.2.2 Internal parallelism...... 48 4.2.3 Trace access characteristics ...... 50

vii

4.2.4 Motivation ...... 51

4.3 SYSTEM DESIGN AND IMPLEMENTATION...... 52 4.3.1 System overview ...... 52 4.3.2 Trace access pattern analysis ...... 55 4.3.3 Hot data identification mechanism ...... 56 4.3.4 Dispatching method ...... 58

4.4 EXPERIMENTAL EVALUATION ...... 61 4.4.1 Experiment setup ...... 62 4.4.2 Performance analysis ...... 63 4.4.3 Improved garbage collection efficiency ...... 66 4.4.4 Sensitivity study on channel number ...... 69 4.4.5 Sensitivity study on over provision ratio...... 70

4.5 RELATED WORK ...... 72 4.6 SUMMARY ...... 74

5. CHAPTER 5 MISS PENALTY AWARE CACHE MANAGEMENT SCHEME .... 75

5.1 THE OPPORTUNITY FOR PCM ...... 75 5.2 BACKGROUND AND MOTIVATION ...... 77 5.2.1 PCM vs SSD ...... 78 5.2.2 Motivation ...... 79

5.3 THE DESIGN DETAIL OF MPA ...... 82 5.3.1 System overview ...... 83 5.3.2 The balance between Miss Penalty and Miss Rate ...... 84 5.3.3 Cache management algorithm ...... 87

5.4 EXPERIMENTAL EVALUATION ...... 88 5.4.1 Experiment setup ...... 88 5.4.2 Performance analysis ...... 90 5.4.3 Sensitivity study on PCM requests-to-SSD requests ratio ...... 94 5.4.4 Sensitivity study on FIFO-to-LRU ratio ...... 95

5.5 SUMMARY ...... 97 6. CHAPTER 6 CONCLUSION AND FUTURE WORK ...... 98

6.1 CONCLUSION ...... 98 6.2 FUTURE WORK ...... 100

7. REFERENCES ...... 102

viii

List of Tables

Table 1.1 : Comparison of existing storage technologies [7]...... 2 Table 3.1: Workload attributes...... 40 Table 4.1: Workload access pattern power laws...... 55 Table 4.2: Configuration Parameters of the SSD Simulator...... 61 Table 4.3: The characteristics of the traces...... 64 Table 5.1: Comparison among SSD and PCM...... 79 Table 5.2: Configuration Parameters of the hybrid SSD Simulator...... 89 Table 5.3: The characteristics of the traces...... 90

ix

List of Figures

Figure 1.1: An illustration of SSD internal architecture, adapted form [8]...... 4 Figure 2.1 : PCM cell structure...... 16 Figure 2.2: Flash structure...... 17 Figure 2.3 : Basic block structure of flash memory, adapted from [29] ...... 19 Figure 2.4: The major components of flash memory...... 20 Figure 2.5: Block-level FTL management scheme, adapted from [37]...... 22 Figure 2.6: Three cases when log block and data block are collected during garbage collection...... 24 Figure 3.1: The response time comparison between SSD and HD under different request size (4KB-1MB)...... 30 Figure 3.2: The response time of random reads under different size with writes in concurrent execution on Intel X25-E...... 32 Figure 3.3: IOPS for the 4K-1M random read on Intel X25-E...... 34 Figure 3.4 : The overview of SBIOS...... 36 Figure 3.5: Flow chart of algorithm process...... 38 Figure 3.6: Benchmark performance comparison under different I/O schedulers...... 42 Figure 4.1: Channel level internal parallelism in the SSDs...... 48 Figure 4.2: The hot data ratio of different traces...... 50 Figure 4.3: System overview of PGIS. Here RFQ represents read frequency queue and WFQ represents write frequency queue...... 53 Figure 4.4: Distribution of read and write access frequency in Fin2 trace...... 54 Figure 4.5: The basic data structure of the frequency queue...... 57 Figure 4.6: Illustration of additional garbage collection overhead due to traditional channel level internal parallelism dispatching method...... 60 Figure 4.7: Benchmark performance comparison under different I/O schedulers...... 65 Figure 4.8: Comparison for the overhead of garbage collection under different I/O schedulers...... 66 Figure 4.9: Standard response time speedup on seven benchmarks with respect to different channel number...... 68

x

Figure 4.10: Standard response time speedup on seven benchmarks with respect to different over provision ratio...... 71 Figure 5.1: The overview of MPA on the I/O path...... 82 Figure 5.2: The system overview of MPA scheme...... 85 Figure 5.3: The Normalized response times under the realistic trace-driven evaluations. .... 91 Figure 5.4: The overall cache hit rate of the different cache schemes...... 92 Figure 5.5: The SSD cache hit rate of the different cache schemes...... 93 Figure 5.6: Average response time speedup of MPA and conventional cache management schemes under different PCM requests-to-SSD requests ratio...... 94 Figure 5.7: Standard response time speedup of MPA on five bench-marks with respect to different FIFO-to-LRU ratio...... 96

xi

1. Chapter 1 Introduction

1.1 HDDs vs SSDs

Comparing to SSD technology, hard disk drives (HDDs) technology is relatively ancient. Before SSD technology is ready for business applications, HDDs are main storages in storage system. However, due to its mechanical characteristics, HDDs becomes the bottleneck, degrading the performance of the whole system. It is well known that HDDs are good at dealing with large files which are stored in consistent blocks. In

this way, the moving head can start and end its read in one continuous motion. Existing

kernel I/O scheduler, such as anticipatory I/O scheduler [2], is designed for exploiting

this feature. When incoming requests, anticipatory I/O scheduler pauses for a

short time after dispatching a read request, anticipating that the next read request will

occur close on the same disk track. Even though the anticipation sometimes may guess

incorrectly, even if it is moderately successful, it saves numerous expensive seek

operations [3].Then, as the time goes on, when HDDs start to fill up, large files can

become scattered around the disk. In such a case, seek latency which is caused by moving

the mechanical arms to different disk cylinders will be significantly increased. Finally,

the performance of the whole storage system has been degraded.

Except for the fragmentation issue, HDDs also have reliability issue [4, 5]. The reliability issue of HDDs is mainly caused by tear and wear of mechanical component. In order to prevent the data loss, RAID techniques [5, 6] have been proposed to improve

1

data reliability by incorporating data redundancy. For the SSDs, the reliability issue is totally different. SSDs have no mechanical part. That means it is more likely to keep your data safe in the event your system is shaken while it is operating. Moreover, SSDs don't have to expend electricity spinning up a platter. Consequently, SSDs save more energy, compared with HDDs. Due to the superior performance and less power consumption delivered by SSDs as depicted in Table 1.1, SSDs have been widely adopted in the storage system.

Table 1.1 : Comparison of existing storage technologies [7].

SRAM DRAM HDD SSD

Read Latency <10 ns 10-60ns 8.5ms 25us

Write Latency <10 ns 10-60ns 9.5ms 200us

Energy per bit access >1 PJ 2PJ 100-1000mJ 10nJ

Endurance >1015 >1015 >1015 104

Non-volatility No No No No

The price is the main factor which restricts the widespread use of SSDs. Compared

with HDDs, SSDs are more expensive in terms of dollar per . According to

newest statistics from Newegg business, 4 to 5 cents per gigabyte is for HDDs, and

25cents per gigabyte is for SSDs. Such a five-fold increase on price will push the whole

2

storage system price over budget, especially for multimedia users, they will require more, with over 1TB drives common in high-end systems. To solve this issue, the multi-level

(MLC) cell technology has been developed. The multi-level (MLC) cell technology can

largely reduce the cost of SSDs. Therefore SSDs are becoming an attractive replacement for conventional hard disks. Many manufacturers are currently offering SSD as an optional upgrade for HDD on high end laptops. An SSD-equipped laptop will boot in less

than a minute, and often in just seconds. Such an extra speed makes users feel

comfortable.

With the rapid development of the SSD technology, the last decade witnessed a

sharp decline in the retail price of the flash-based SSDs. Combined with its superior

performance, SSDs are gradually replacing traditional hard drives in the era of high

performance computing. Naturally, the performance issue and reliability issue of SSD become increasingly prominent.

As SSDs become increasingly popular, it becomes necessary to examine the

differences between flash-based solid state disks and traditional mechanical hard devices.

For instance, while read speeds of flash-based solid state disks are generally much faster

than traditional mechanical hard disks, there is no significant performance improvements

in terms of write operations. Since SSDs have no mechanical parts, they have no seeking

times in the traditional sense, but there are still many operational overhead that need to be

taken into account [3]. By examining the internal architectures of SSDs, we can find that

parallelism is available at different levels, and operations at each level can be parallelized.

3

Figure 1 shows an example of a possible internal architecture for an experimental flash- based disk.

Figure 1.1: An illustration of SSD internal architecture, adapted form [8].

Figure 1.1 illustrates the internal architecture of most flash-based SSDs. A number of packages of flash memory are connected to the flash channels. Each flash channel can be accessed independently and simultaneously. Meanwhile, the number of flash channels used to connect package and SSD controllers determine the magnitude of parallelism and hence the raw maximum throughput of the SSD devices. Such a highly parallelized structure provides great potential for internal parallelism. The internal parallelism acts in different levels. The different levels of parallelization are available in most of the SSD

4

architectures, leading to greatly improved performance. In general, channel-level internal

parallelism is the most common used internal parallelism in the research of SSD.

Many layers in the I/O subsystem have been designed with explicit assumptions

made about the underlying physical storage [3]. Although designs can vary, most

assumptions are based on the most popular storage, which is a traditional magnetic disk

that rotates mechanical parts to access the data. Though SSDs are designed to work with

existing accessing mechanisms, the performance of SSDs might suffer under such

circumstances. In order to take full advantage of the potential of solid state storage

devices, it is desirable to reconsider some of the assumptions that have been made and

redesign the I/O subsystem for the flash-based storage system.

The garbage collection mechanism is related to erase operations and the basic unit

of erase operation is a block. Therefore, during an erase operation, valid pages in a block

must be moved to another empty location during the erase operation. Thus, an important

consideration for improving erase efficiency is to minimize the operational latency by

reducing the number of page migrations to avoid delaying the I/O requests. Selecting the

block with the minimum number of valid pages for cleaning is one of the most commonly

used cleaning policies. At best, if we can find blocks with no valid pages, there is no

overhead of page migrations. In this case, hot data identification mechanism is introduced.

Specifically, separating hot and cold data helps reducing the number of page migrations:

blocks with hot pages are likely to be full of invalid pages in the near future, while blocks

with cold pages are likely to suffer from high cost for page migrations.

5

The overall objective of this research is to develop some novel methods for improving the performance and reliability of flash-based solid state storage system. The topics we would like to study include exploring the factors which will influence SSD performance and reliability, and digging the potential of SSD based on their unique structure.

1.2 Problem Description

In last decades, with the rapid development of technology, the emerging

applications, including internet-scale online service and big data analysis software,

require better storage I/O performance support their work [9]. Especially, when flash-

based SSD is widely used in storage system, its different characteristics drive us to

explore new possibility in software technology.

The block I/O subsystem is the fruit of software optimization to make efficient use

of underlying storage device [10, 11]. The previous block I/O subsystem is all design

based on characteristics of HDD, since storage system which comprises of HDD

could meet the user requirement in the past few years. However, in recent years, data

services deployed on web, servers, mail and data center require high throughput and

low latency. To guarantee the instant responsiveness, flash-based SSD is introduced

as the new storage device. Due to semiconductor property, flash-based SSD transfers

data from/to operating system without moving disk head back/forth. The data size is

the main factor which influences the performance of SSD in terms of response time.

The pervious optimization in block I/O system mainly focuses on reducing the

6

seeking overhead of HDD. Such an optimization may not achieve the peak performance of flash based SSD and even decreases the performance of the whole storage system. To solve this issue, our thesis investigates the influence factors for performance of SSD and proposes a SSD-based block I/O scheduler to achieve the peak performance of SSD.

With the development of MLC technology, the cost of SSD declines significantly.

The declining price and high performance make SSD become more and more popular.

However, design is a tradeoff. When MLC technology reduces the cost of SSD, it also brings the reliability problem. Comparing to SLC NAND SSD, the lifetime of MLC

NAND SSD is usually 10 folds less than SLC NAND SSD, although the cost of MLC

NAND SSD is 2-4 folds less than MLC NAND SSD. Meanwhile, due to the erase before write mechanism, the in-place updates don’t be allowed in the MLC NAND

SSD. Instead, the incoming data is written in the new clean place, and then the old data is marked as invalid. As the time goes on, when the block is full with valid pages and invalid pages, it will be reclaimed by erase operation. However, each block can afford a limited number of erase operations. If the number exceeds the limitation, the block becomes unreliable. To solve this reliability issue of SSD, we introduce hot data identification mechanism and propose a garbage collection aware I/O scheduler.

Meanwhile, with the rapid development of non-violate technology, we have more options for improving the performance of SSD. Phase change memory (short for PCM) is one of the most promising non-volatile memories, which instead of using electrical charges to store information, stores bit values with the physical state of a

7

chalcogenide material (e.g.,Ge2Sb2Te5, i.e., GST). Due to high read/write speed,

high endurance, non-volatility of PCM, many hybrid storage structures have been

proposed to improve the performance of SSD. Some of these hybrid storage structures

focus on exploiting PCM and SSD at the same storage level. While most of them try

to reduce the write amplification of SSDs and to improve the data reliability of the

whole storage system, none of these studies consider improving the performance of

whole storage system based on read latency disparity between PCM and SSD. To fix

this problem, we propose a Miss Penalty Aware cache management scheme (short for

MPA) which takes the asymmetry of cache miss penalty on PCM and SSD into

consideration.

1.3 Contribution

The contributions of the dissertation are three-folds:

(1) As we all know, many system designs, especially for I/O subsystems, are based on

characteristics of HDDs. However, these designs may not be appropriate for SSDs

and sometimes even degrade the performance of SSDs. In order to fully dig the

potential of SSDs, we need SSD-based I/O schedulers. In our thesis, we explore

some factors which will influence the performance of SSDs in terms of response

time. For instance, according to our experimental result, we find that SSD is more

sensitive to the increment of request size than mechanical hard drive. Meanwhile,

8

if read requests mix with write requests concurrently, the read-write interference occurs, it will degrade the performance of SSDs in terms of response time. In order to design SSD-based I/O schedulers, we need to take these factors into consideration. Due to the unique structure of SSD, internal parallelism should also be taken into consideration to improve the performance of SSDs. Based on the above observations, we develop a SSD-based I/O scheduler called SBIOS which is embedded into kernel. When dealing with a set of incoming I/O requests,

SBIOS fully exploits the characteristics of SSD. SBIOS adopts read-preference policy and small size-preference policy. It dispatches the read requests to different blocks to make full use of internal parallelism. For write requests, it tries to dispatch write requests to the same block to alleviate the block cross penalty and garbage collection overhead. To prevent the read-write interference, SBIOS introduces type-based queuing. SBIOS uses two queues to divide the types of incoming requests. One queue is for serving read requests, the other queue is for serving write requests. In each round, SBIOS deals with one queue. In this way, the probability of read-write interference has been reduced. To measure the performance of SSD, we develop a tool which is an extension of Radiometer [12,

13]. This tool dispatches the requests to physical storage and statics the response time. The evaluation results show that compared with other I/O schedulers in the

Linux kernel, SBIOS reduces the average response time significantly.

Consequently, the performance of the SSD-based storage system is improved.

9

(2) As I mentioned before, MLC technology is a double-edged sword, either reducing

the cost of SSDs or bringing the endurance problem of SSDs. To solve the

reliability issue of SSDs, we propose a parallelism and garbage collection aware

I/O Scheduler called PGIS. PGIS not only fully exploits abundant channel

resource in the SSD, but also it introduces a hot data identification mechanism to

reduce the garbage collection overhead. By introducing the statistic method, we

analyze the characteristic of trace access pattern[14, 15], and then we design a hot

data identification mechanism based on our analysis. In the PGIS, we classify the

hot request by frequency. To exploit the channel level internal parallelism, we

issue the hot read request to different channel. To reduce the overhead of garbage

collection in terms of P/E cycles, we issue the hot write data to the same physical

block. As we all know, design is a tradeoff. Issuing the hot write data to the same

physical block can accelerate the efficiency of reclaiming process. Meanwhile, it

also makes the physical block become unstable. To balance this side effect, we

introduce a hot/cold swapping wear-leveling algorithm. In our design, hot read

data may overlap with hot write data. There is no doubt that parallelizing hot read

data and hot write data will interfere with each other and lead to the performance

degradation. For reducing the probabilities of parallelizing hot read data and hot

write data, PGIS introduces type-based queuing. PGIS uses two queues to divide

the types of incoming requests. One queue is for serving read requests, the other

queue is for serving write requests. Because we use read-preference in PGIS, this

will lead to a starvation problem for write requests. To solve this problem, PGIS

10

sets a time stamp which defines a time period before which the request should be

dispatched into the FTL module. The PGIS periodically checks the requests stored

in the queues to guarantee no request exceeds the time period assigned by time

stamp. In the daily application, SSD is treated as a black box. The manufacturers

don’t provide us an interface for checking the internal activity of SSD. In our

thesis, we develop a SSD simulator which is to measure the performance and

reliability of various SSD architectures. Our simulator is an extension to Disksim

4.0 [16] and models a generalized SSD by implementing flash specific read, write,

erase, channel-level internal parallelism, wear-leveling algorithm and flash

translation layer module.

(3) With the widely deployment of SSD in data centers, the poor endurance problem

and out-of-place update limitations of SSD become more and more prominent.

Comparing to SSD, PCM has a higher endurance and can be updated in-place.

Hybrid storage systems that use PCM and SSD at the same storage level are a

promising approach because of their better performance and high endurance than

SSD-only approaches. However, hybrid storage systems pose a new challenge to

cache management algorithms. Existing DRAM-based cache management

schemes are only optimized to reduce the miss rate. On a miss, the cache needs

to access the PCM or the SSD. However there are major performance

differences between the access times of the two different technologies. As a result,

in such a hybrid system, a lower miss rate does not necessarily translate to higher

performance. To address this issue, we propose a Miss Penalty Aware cache

11

management scheme (short for MPA) which takes the asymmetry of cache miss

penalty on PCM and SSD into consideration. Our MPA scheme not only uses the

access locality to reduce the miss rate, but also assigns higher priorities to SSD

requests located in the page cache to avoid the high miss penalty overhead. Our

experimental results show that our MPA scheme can significantly improve the

performance of hybrid storage system.

1.4 Organization of Dissertation

So far we compare HDDs with SSDs and give a brief introduction on SSD. We also present the opportunity and challenge of SSDs which motivated our work. In the rest

of the dissertation, we will detail our system design and I/O scheduling methods. For

each of the proposed methods, we will also present backgrounds, relative works and

experimental results. The rest of this dissertation is organized as follows:

In chapter 2, we present background on SSDs, I/O schedulers and non-violate storages. We also introduce several aspects of SSD, including internal parallelism, flash translation layer, hybrid SSD structure.

In chapter 3, we revisit the I/O Subsystem in the operating system. We reveal that

existing I/O schedulers may not be appropriate for SSDs and sometimes even degrade the

performance of SSDs. We explore some factors which impact the performance of SSD.

Based on our exploration, we propose a SSD-based I/O scheduler called SBIOS. SBIOS

12

fully exploits the characteristic of SSD. For read requests, SBIOS dispatches them to

different blocks to make full use of internal parallelism. For write requests, it tries to

dispatch write requests to the same block to alleviate the block cross penalty. Based on the above I/O scheduling rules, we develop an I/O scheduling module and embed it into

Linux kernel. To measure the performance of SSD in terms of response time, we also develop a tool which is an extension of Radiometer [12, 13].

In chapter 4, we mainly solve reliability issue of SSD. By introducing statistical

method, we analyze the trace access pattern. We reveal that 15 to 25 percent of the blocks

are accessed by nearly 50 percent I/O requests. Based on the above observation, we

introduce the hot data identification mechanism and propose a parallelism and garbage

collection aware I/O Scheduler called PGIS. In the PGIS, we classify the hot request by

frequency. To exploit the channel level internal parallelism, we issue the hot read request

to different channel. To reduce the overhead of garbage collection in terms of P/E cycles,

we issue the hot write data to the same physical block. Based on the above I/O scheduling

rules, we develop an I/O scheduler and embed it into a SSD simulator. As we all know, in

the daily application, SSD is treated as a black box. To measure the performance and

reliability of SSD, we develop a SSD simulator which is an extension of Disksim 4.0[16].

In chapter 5, we investigate characteristics of PCM and SSD. We reveal that hybrid

storage systems which use PCM and SSD at the same storage level pose a new challenge

to cache management algorithms. Due to different miss penalty of PCM and SSD, higher

hit rate which is targeted by traditional cache management algorithms may not bring

13

higher performance. To solve this issue, we propose a Miss Penalty Aware cache

management scheme (short for MPA) which takes the asymmetry of cache miss penalty

on PCM and SSD into consideration. Our MPA scheme not only uses the access locality

to increase the hit rate, but also assigns higher priorities to SSD requests located in the

page cache to avoid the high miss penalty overhead. By combing miss penalty with hit

rate, our MPA scheme significantly improve the performance of hybrid storage system.

Meanwhile, to test the efficiency of our MPA scheme, we develop a hybrid SSD simulator which is an extension of Disksim 4.0[16].

In chapter 6, we conclude our work in this dissertation.

Some research works in this dissertation have been published in conferences. The

study in Chapter 3 was published on the 15th International Conference on Algorithms

and Architectures for Parallel Processing(ICA3PP) in 2015 [17]. The content in Chapter

4 is based on the paper published in the 31st IEEE International Parallel & Distributed

Processing Symposium (IPDPS) in 2017 [18].

14

2. Chapter 2 Background

In SSDs, data are hosted by flash memory which is a type of Electrically

Erasable Programmable Read-Only Memory, EEPROM in short [19]. Flash memory can

be classified into several types among which NOR type and NAND type are the most important [20]. NAND type flash memory was first introduced by Toshiba in the late

1980s, following NOR type flash memory by Intel [21]. NOR type flash memory provides random access and high reliability, while NAND type flash memory provides affordable price due to its scaling technology. With the development of portable

electronic, such as mobile phone, flash memory becomes more and more popular.

Especially for NAND type flash memory, it is widely adopted in the mobile phone due to

its low cost and high density. In this dissertation, flash memory refers to NAND type

flash memory.

2.1 Non-Violate Memory

Semiconductor memories can be simply divided into two categories: and non-volatile memory. Volatile memory needs power connection to hold the data.

When the power is off, volatile memory loses its data. In our daily life, the most common form of volatile memory is random access memory. Comparing to volatile memory, non- volatile memory can hold the data without power connection. If power outage occurs,

non-volatile memory can ensure safety of data. In recent years, the most popular forms of

non-volatile memories are phase change memory and flash memory.

15

Figure 2.1 : PCM cell structure. Phase change memory (short for PCM) is one of the most promising non-volatile

memories, which instead of using electrical charges to store information, stores bit values

with the physical state of a chalcogenide material (e.g.,Ge2Sb2Te5) [22, 23]. Figure 2.1

shows the typical PCM cell structure. In general, a PCM cell consists of a top electrode

and a bottom electrode. Between these two electrodes are the resistor (heater) and chalcogenide glass (GST). We can transform the PCM cell into amorphous state by quickly heating under high temperature (above the melting point) and quenching the glass.

Instead, holding the GST above its crystallization point but below the melting point for a while will transform the PCM cell into the crystalline state. In 2016, IBM researchers

16

found how to store 3 bits in one PCM cell, PCM is currently becoming more and more popular. Due to high read/write speed, high endurance and low standby power of PCM

[24-26], PCM is considered to be applied in the hybrid storage which combines PCM

with DRAM.

Figure 2.2: Flash memory cell structure.

Flash memory stores the data in memory cells. Flash memory cell mainly consists of

three parts: semiconductor martial (p-substrate), floating gate and oxide layer. P-

Substrate can be manipulated to control the flow of electronic. Electronics which are

trapped in the float gate can represent one or more bits. Oxide layer isolates the floating

gate and prevent electronics escaping from floating gate. The flash memory cell structure

17

is shown in Figure 2.2. NAND flash memory cell can be simply classified into two types:

SLC (Single-Level Cells) and MLC (Multi-Level Cells) [27, 28]. In SLC NAND flash memory, each cell only stores one bit. SLC NAND flash memory is more reliable than any other NAND flash memory. However, due to its price, MLC NAND flash memory technology has been developed to serve the need of users. MLC NAND flash memory stores 2-bits per cell. It requires 4 voltage states to represent 00, 01, 10 and 11. The MLC

NAND flash memory has 25-30 times less endurance than the SLC NAND flash memory

and is not as reliable with many more issues with unexpected power loss, read disturb,

data corruption and so on. However, due to its low price, MLC NAND flash memory is

widely adopted in our daily applications.

2.2 Internal Structure of SSD

Most SSDs are storage devices based on NAND flash memory technology. In the

NAND flash memory, the basic structure unit is flash page. Similarly, for read operation or write operation, the basic access unit is the flash page. However, due to limitation of flash memory, the basic unit for erase operation is flash block. As is shown in the Figure

2.3, a fixed number of flash pages are gathered to form a flash block. Meanwhile, every page is composed of a data area dedicated to user data and a spare area dedicated to metadata (mapping information, ECC, erase count, validity bit,...etc) [16]. The in-place update is not allowed in MLC NAND flash memory. If user wants to update data in the same location, an erase operation must occur. Erase operation not only takes a long time,

18

but also brings the reliability issue. In this way, erase before write mechanism becomes the main bottleneck of NAND flash memory.

Data area Spare area

Flash page Flash block

Figure 2.3 : Basic block structure of flash memory, adapted from [29] .

The major components inside a SSD mainly include DRAM, flash channel, flash controller, and flash packages. To provide larger capacity, SSDs usually are configured

19

DRAM Controller

Package 2 Package 0 Die 4 Die 5 Die 0 Die 1

Channel Channel Crtl 1 Crtl 0

Package 3 Package 1 Die 6 Die 7 Die 2 Die 3

Figure 2.4: The major components of flash memory. with an array of flash packages. These flash packages are connected through multiple channels to flash memory controllers. Flash memories provide logical block addresses

(LBA) as a logical interface to the host. Since logical blocks can be striped over multiple flash memory packages, data accesses can be conducted independently in parallel [8]. In this way, these packages form package-level internal parallelism. There are several internal parallelisms except package-level internal parallelism. As is shown in the Figure

2.4, the flash package is composed of several flash dies. These dies form die-level

20

internal parallelism. In the die-level internal parallelism, each die can execute command independently, since flash dies in the same flash package share the serial pin, when

multiple commands come to these dies, they can interleave the execution of these commands.

2.3 Flash Translation Layer

Due to the erase before write mechanism, in-place updating is forbidden in the SSD.

However, for the users, SSD seems to provide similar function of in-place updating. That is because the flash translation layer which is maintained by SSD controller hides the complexity of flash operation by providing a logical interface to SSD. Since, overwriting flash page in the same location is not allowed. The flash translation layer (short for FTL) marks the old data as invalid page and finds a new location to update the new data. In this process, a mapping table should be maintained to support operating system to find the latest data.

Based on the granularity of mapping unit, the flash translations layer schemes can be classified into three types: page-level mapping, block–level mapping and hybrid mapping.

In page-level FTL, each logical page number should be translated into a physical page number. This means page-level FTL needs to maintain a large mapping table which contains mapping information of each logical page. Comparing to other two FTL management schemes, page-level FTL has largest mapping table, however, benefited

21

from classifying hotness of the data clearly, the page-level FTL is flexible and efficient.

The first page-level FTL scheme was proposed in 1995 [30] and was adopted as a standard for Nor-type flash memory several years later [31] [32]. The other famous page-level FTL is DFTL [33] which is inspired by virtual memory systems and will load a fragment of the page mapping table into RAM on demand [34] [35] [36].

LPN: Logical block no Offset

SSD blocks Mapping Block-level mapping table

Physical block no

……… Offset

PPN: Physical address ……

Figure 2.5: Block-level FTL management scheme, adapted from [37].

22

The block-level FTL is the simplest flash translation layer management scheme. In block-level FTL, as is shown in Figure 2.5, logical page number (LPN) is divided into two parts: logical block number and offset. At the beginning of block-level FTL

mapping, each logical block number will be mapped to a physical block located in the

SSD, and then some searching algorithms are applied to find the targeted page based on

the in-block offset. Since block-level FTL only maintains the mapping information of

blocks, it reduces the overhead of hardware and saves the space of SRAM. However,

design is a double-edged sword, due to larger granularity of mapping unit, in block-level

FTL, a single write update may lead to relocation of other pages located in the same SSD

block. This also means the overhead of moving valid pages during garbage collection

has been increased. The block-level FTL is suitable for applications which have large

access size and limited space of SRAM. Some researchers propose their works based on

block-level mapping. For instance, Lee et al. present a two-level mapping scheme which

first maps a group of blocks in each DRAM map entry and then uses the out-of-band

section of flash to store a finer grained map within each group of blocks [38] [39].

Another famous block-level FTL is NFTL [39]. NFTL maintains a chain of physical

addresses for each logical block in SRAM. For each write operation, NFTL searches the

whole chain until it finds the first available page. If NFTL doesn’t find the page, a fresh

block is tacked to the end of the chain. Wells et al. propose a static block-level FTL

management scheme in their system designed for compressed storage [40]. In this design,

variable size writes are allowed and rescheduled to be written in a log block.

23

Data block Log block Data block Log block Data block Log block

Page 1 Page 1 Page 1 Page 1 Page 1 Page 1 Page 2 Page 2 Page 2 Page 2 Page 2 Page 2 Page 3 Page 1 Page 3 Page 3 Page 3 Page 3 Page 4 Page 2 Page 4 Page 4 Page 4

switch switch Page 1 Page 1 Page 1 Page 2 Page 2 Page 2 Page 3 Page 3 Page 3 Page 4 Page 4 Page 4

(a) Full merge (b) Partial merge (c) Switch merge

Figure 2.6: Three cases when log block and data block are collected during garbage collection.

As I mentioned above, the page-level FTL performs better performance than other

FTL schemes. However, it needs large SRAM space. The block-level FTL saves space

for SRAM, but it is less flexible and sometimes leads to severe write amplification. In

this case, the hybrid FTL, combing conception of the page-level FTL and the block-level

FTL, has been proposed. In the daily applications, many systems use log-structured

24

block-based hybrid FTL schemes [41] [42] [43], which are inspired by log file system

[44]. In these FTL schemes, the flash blocks have been divided into two types: data block

and log block. The block-level FTL scheme is applied to data block. To save the space of

SRAM, regular page is mapped into data block, while updated pages which are traced by the page-level FTL scheme are will be temporarily appended into log blocks. Log blocks

are small proportion of flash blocks. Generally, log blocks account for less than 5% of

flash blocks. Due to limited number of log blocks, most hybrid FTLs need to merge data

in the log block to data block to generate the new space for log blocks. Figure 2.6 shows

three kinds of merge operation: full merge, partial merge and switch merge. The full

merge is the most expensive merge operation, compared with other two merge operations.

In the case of a full merge, as is shown in Figure 2.6 (a), all the update pages need to be

copied to a new allocated block, and then old blocks should be erased. In the case of a

partial merge, only three pages from the beginning of the data block are logged, and the

rest page of the data block is still valid as depicted in Figure 2.6 (b). Finally, the valid page has to be copied from the data block to the log block before a switch will be

performed, resulting in a higher overhead. In terms of switch merge, as is shown in

Figure 2.6 (c), the data block only contains invalid pages, so it need to be reclaimed. The

log block contains all the valid pages. In this case, we simply mark the log block as the

new data block. The overhead of switch merge is the lowest, since only one erase

operation occurs. Different hybrid FTL schemes adopt different strategies to merge the

data in log block to data block. For instance, Jesung et al. propose the first hybrid FTL called BAST [41]. However, BAST does not work well with random overwrite patterns

25

which may result in a block thrashing problem [42]. Since each replacement block can

accommodate pages from only one logical block, BAST can easily run out of free replacement blocks and be forced to reclaim replacement blocks that have not been filled

[45]. To solve the block thrashing problem, FAST [42] has been proposed. FAST allows the log block to hold updates of any data blocks. Park et al. propose SAST [36] which limits a set of log blocks to a set of data blocks. Cho et al. propose KAST [46] which

addresses the variability of performance and limits the associability of its log blocks.

2.4 Hybrid SSD structure

With the rapid development of NVM technology, exploring the usage of emerging

NVM technologies at different level of becomes popular on computer architecture research. Some novel NVM-based novel architecture designs have been proposed, such as NVM-based cache design [47] [48] [49] [50], NVM-based storage

architecture [51] [52], NVM-based memory architecture [53] [22] [54]. NVM has its own

advantages, such as in-place updating, high endurance. However, comparing to SSD,

NVM technology is not mature. It is still not feasible for NVM to directly replace SSD as

massive storage because of their limitations of manufacture and high cost [55]. In this

case, hybrid SSD which uses PCM and SSD in the same storage level has been proposed

for taking advantage of SSD and PCM.

As we all know, small and frequent random write is common in the flash-based

database severer. In such an access pattern, the “erase-before-write” limitation of the SSD

becomes increasingly apparent. Meanwhile, Frequent write operations and erase

26

operations will reduce the lifetime of flash-based SSD. To solve this problem, Sun et al.

[56] propose a hybrid SSD architecture to prolong the lifetime of SSD. The key idea behind this design is to use PCM to store log data. Due to in-place updates, - addressable, non-volatile properties and better endurance of PCM, the performance, energy consumption and lifetime of the SSDs are all improved. Kim et al. [57] also propose the similar hybrid SSD structure for their flash-based database management strategy called IPL. Li et al. [58] propose a user-visible hybrid SSD with software- defined fusion methods for PCM and NAND flash. In this design, PCM is used for improving data reliability and reducing the write amplification of SSDs. In 2016, the

Radian memory system released a host-based hybrid SSD product called”RMS-325” to improve the reliability of SSD storage system. This clearly shows that hybrid SSD structure is on the way from the laboratory research stage to engineering application stage.

With the growth of the use of SSDs in the data center, hybrid SSD structure will attract more and more attentions in the forthcoming years.

27

3. Chapter 3 A SSD-based I/O Scheduler

3.1 Analyzing the characteristics of SSD

Unlike traditional magnetic-based storage devices, the flash-based solid state disk consists of semiconductor chips, which avoid considering the rotational latency in

random I/O performance. Theoretically, the speed of flash-based solid state disk is one or

two orders magnitude faster than the mechanical disk. But in fact, the advancements of

flash-based solid state disk are not fully exploited in practice. There are two reasons. First,

flash-based solid state disk has poor write performance. Such poor write performance is

caused by the erase before write mechanism. In order to overwrite a previous location on

SSD, the block which contains this location should be erased first, and then the new data

can be written in this location. Second, mechanical disks are still the main storage devices

in primary storage system. In the existing operating systems, the software I/O stack is

designed for the characteristics of mechanical disks. As a consequence, the potential of

flash-based solid state disks are not fully exploited. Some research studies have shown

that the existing I/O software layer can lead to additional overheads for flash-based solid

state disks [1] [59] [60] [61].

Due to the limitation of the erase before write mechanism, the read speed of the

flash-based solid state disk is not consistent with the write speed of the flash-based solid

state disk. Or worse, erasure granularity is much larger (64-256KB) than the basic I/O

28

granularity (2-8KB) [62] [63]. It means that the response time of the read request is faster

than the response time of the write request. Meanwhile in the upper layer application, the

read operation is synchronous, so the upper layer application needs the response data of

read operation to initiate the next step. While the write operation is asynchronous, it will

not block the upper layer application. So if we want to fully use the characteristics of

flash-based solid state disk in block layer I/O scheduler design, we need to take the

read/write speed discrepancy into account.

3.2 Background

In order to design a SSD-based I/O scheduler, we need to know the influence

factors for the performance of SSD. In the following subsections, we do some

experiments to explore the performance variation of SSD under different influence factor settings.

3.2.1 Request size

In this section, we explore the relationship between request size and response

time. In our experiment, we use fio [64] tool to test the average response time of

traditional hard disk (WDC WD1600JS 500GB) and solid state disk (Intel X25-E 64GB)

in different request sizes. To avoid the influence of system, we disable the write cache,

memory buffer and I/O scheduler in our experiments. Figure 3.1 shows the response time

comparison. SSD represents the Intel X25-E 64GB, and HD represents WDC-WD

29

1600JS 500GB. As is shown in the experimental results, the increment in the request size has little impact on the standard response time of traditional hard disk.

20

18

16

14

12

10 SSD 8 HD

6 Standard response response time Standard

4

2

0 4kB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB Request size

Figure 3.1: The response time comparison between SSD and HD under different request size (4KB-1MB). Meanwhile, the standard response time of SSD nearly has no any change for the request size between 4KB to 64KB. As we all know, the rotating drive moves the head and rotates the platter for locating and accessing the data. The three contributors to the response time of rotating drive are the seek time, rotational latency and data transfer time.

30

When the request size is small, the seek time and rotational latency account for the majority of the response time. As the request size increases, the data transfer time accounts for a large proportion of the response time. As shown in Figure 3.1, when the request size is larger than 64KB, the request size becomes the main part which relates to the response time of rotating drive. For requests whose size is between 128KB to

1024KB, the standard response time of SSD increases rapidly as the size of the request

increases. It indicates that SSD is more sensitive to the increment of request size than

mechanical hard drive.

Besides, response time of solid state disk is linearly proportional to the request

size, which causes standard response time of SSD to increase gradually. The reason for

above observation lies in the different internal structure of solid state disk. Unlike

rotating hard drive, the solid state disk finishes the fundamental operation (read and write)

by circuit signal transmission. Therefore, it is not necessary to take the seek time and

rotation latency into consideration. The data transfer time is the main part of the response

time of solid state disk. Data transfer time is directly related to the request size. For this

reason, there is a linear relationship between request size and the response time of solid

state disk.

3.2.2 Read-write interference

The write speed of Flash-based SSD is significantly lower than the read speed of

Flash-based SSD. Especially, when a reader continuously performs read requests at the

presence of the current writer, the reader may suffer an excessive slowdown in read

31

performance. In order to valid this, we use fio tool [64] to measure the read/write

characteristics of the flash device. The flash-based storage device used in these

experiments is Intel X25-E 64 GB. To access the native characteristics of flash, we omit

the memory buffer, write cache and I/O scheduler in our experiment.

Intel X25-E 18000

16000

14000

12000

10000

8000 READ Response time(usec)Response 6000 READ MIX WRITE 4000

2000

0

Request size

Figure 3.2: The response time of random reads under different size with writes in concurrent execution on Intel X25-E.

Our experiment simulates two processes. One process continuously sends read requests to the random storage location, while the other sends random write requests to

32

flash location at the same time. In our experiment, the request size spans between 4kB and 1MB. Figure 3.2 illustrates the response time in two cases, read and read mix with concurrent write. Compared with the response time of Read mix with concurrent write, the response time of the random read is significantly higher when interrupted by the concurrent write request. Especially, when the request size becomes larger, and the latency issue deteriorates. In order to alleviate the latency issue brought by concurrent write, we introduce the concept of batch processing and use the design of separating read and write requests. In this way, the problem of read-write interference can be fixed.

3.2.3 Internal parallelism

There are several levels of internal parallelism in the flash-based solid state disk.

These internal parallelisms make the single device to achieve close to ten thousand I/O

per second for Random access, as opposed to nearly two hundreds on traditional

hardware disk. In order to validate the importance of exploiting the internal parallelism in

the flash-based solid state disk, we introduce the IOPS metrics to measure the difference

between hardware disk and solid state disk.

In our experiment, we use fio tool [64] to show the IOPS of traditional hardware

disk and solid state disk. We continually issue different size requests (4K-1M) in random

read pattern to the devices. In Figure 4, HD represents WDC WD1600JS 500GB, while

SSD represents Intel X25-E. As is shown in Figure 3.3, the IOPS of HD is only

137,While the IOPS of solid state disk is over than 4000.

33

1

0.9

0.8

0.7

0.6

0.5 SSD 0.4 HD

Standard IOPS Standard 0.3

0.2

0.1

0 4kB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB Request size

Figure 3.3: IOPS for the 4K-1M random read on Intel X25-E. The IOPS has an over 30 folds gap between traditional hardware disk and flash-based solid state disk. Such a big performance gap is caused by the internal structures of hardware disk and solid state disk. The traditional hardware disks only have one moving head. That means one requests can be served per time. In random access pattern, the traditional hardware disk wastes lots of time to rotate the platters and seek the data. That is why it only has 137 IOPS in random read pattern. But in solid state disk, the case is totally different. The solid state disk is composed of multiple channels, multiple dies, multiple packages and multiple planes. Each level of internal parallelism can serve multiple requests at the same time. Especially, the random read access pattern is the most

34

efficient pattern which can trigger the internal parallelism in flash-based solid state disk.

In such a rationale, the over 30 folds gap appeared. Exploiting internal parallelism to

enhancing the I/O scheduler performance is very important. In our SBIOS, we dispatch

the read requests to different block to trigger the internal parallelism of solid state disk.

3.3 System design and implementation

In this section, we discuss the system design and implementation of our SBIOS

scheduler.

3.3.1 System overview

I/O scheduling module is located between block layer and block device layer.

It processes requests by a specific order. Figure 3.4 shows the system overview of SBIOS

and its location in the whole I/O subsystem. For the upper level, it treats the requested

data entering the I/O scheduling module as inserting the request into a queue, and then

the I/O scheduling module will resort the request in the queue by a certain resorting

policy. For the lower layer, the request leaving from I/O scheduling module is like the

operation of leaving a queue. The I/O scheduling policy in the scheduling module will

decide the next request to be served. Figure 3.4 shows the details of the sorting policy

that we use in SBIOS. SBIOS uses type-based queuing. We divide the requests into two

types: read and write. We dispatch read requests to different blocks to make full use of

the read internal parallelism. For write requests, we try to dispatch them to the same

35

block to avoid the block cross penalty. In this way, SBIOS improves the performance of solid state disk significantly.

Application

File System

Block

I/O Scheduler

Block 1 Block 2 ……… Block N

SSD

Figure 3.4 : The overview of SBIOS.

3.3.2 Dispatching method

Our goal of designing SBIOS is to exploit the full potential of solid state disk, so the rich internal parallelism of solid state disk will be taken into consideration. Since

SSDs have superior reading performance, we sort the incoming requests based on type, and the read preference policy is employed in our scheduler. In [65], there are four levels

36

of internal parallelism which are mentioned. In our design, we try to fully use the block level read internal parallelism, so the scheduler dispatches the read requests to different logical blocks to trigger the internal read parallelism. To exploit the full characteristic of

SSD, not only we take its advantage, but also we need to avoid its drawback. Random write performance is always the bottleneck of SSD, especially for random write operations which cross the blocks [3]. To avoid this drawback, we try to dispatch the write requests to the same block. In this way, we can avoid the large response latency which is brought by crossing the blocks.

During initialization, the SBIOS uses a function to calculate the total capacity of

the targeted solid state disk, and then it will find the starting sector of the solid state disk.

In Linux kernel system, the basic storage unit is the sector. Our design rationale is to

make the logical block size which can match the physical block size of solid state disk

and we use a function Calculate Block () to achieve this. With the starting sector of the

solid state disk, the Calculate Block () can maintain the block number of incoming

request. Suppose the starting sector of solid state disk is K. The beginning sector of the

incoming request is G. the logical block number maintained in the Calculate Block ()

equals to (G-K)/SECTOR PER BLOCK. Here SECTOR PER BLOCK variable shows

the number of sectors contained in a block. It is calculated at the initialization phase.

One important factor that impacts the performance of SBIOS is the physical block

size of the solid state disk. In practice, the block size is specified by the vendors.

37

Sometimes the exact physical block size is not available. However, for a given SSD, we can design some micro-tests on it to determine the physical block size [66].

Figure 3.5: Flow chart of algorithm process.

3.3.3 Algorithm process

38

In SBIOS, the entire incoming requests are placed in a red-black tree according to their logical block address. They wait in the red-black tree until the SBIOS chooses them to dispatch into the lower layer. Figure 3.5 shows the algorithmic diagram for selecting a request to dispatch. In each round of request selection, the scheduler module checks whether the write request is starved. This step is not optional, because we use the read- preference policy in SBIOS. If we don’t set a starved threshold to the write requests, there will be a write starvation problem in the design of our scheduler. After checking the write starvation, the scheduler will determine the request type of the next request. Also, we ensure that each request has been assigned a time stamp when it enters the I/O scheduler. If the timestamp is out of date, this request will be chosen as the next request to be served. If not, the scheduler will employ different choosing method according to the request type. If the next served request is set to read, the scheduler will find two different block requests from the read-black tree and compare their size, and then dispatching the smaller one to the lower layer. But if the next served requests are set to write, the scheduler will find the two write requests which locate in the same block from the red- black tree and dispatch the smaller one. Finally, it continually dispatches the request pending in the red-black tree until no request pending in the tree.

3.4 Experimental evaluation and analysis

In this section, we set up the experimental platform to analyze the performance of the SBIOS. In our experiments, we run different kinds of traces to the chosen I/O

39

scheduler to demonstrate that the SBIOS improves the response time significantly and

make full use of characteristics of SSD.

3.4.1 Experiment setup

SBIOS is implemented as a kernel module in Ubuntu 14.04 with kernel 3.13.0. In our experiment, we use Intel core i3 3.00 GHZ processor and 4 GB memory in our machine. For solid state disk, we use Intel X25-E Extreme SATA Solid-State Drive 64

GB (short for Intel X25-E) and its erase block size is 256 KB. The hard disk we use is

WDC WD1600JS 500GB. In order to test the efficiency of the SBIOS scheduler, we choose five different benchmarks to test it, including two online transaction processing workloads(Fin1, Fin2) and three search engine workloads(Web1, Web2, Web3).

Table 3.1: Workload attributes.

Workload Request size(byte) Read (%) Write (%) Fin1 [15] 512-17116160 21.6 78.4 Fin2 [15] 512-262656 82.4 17.6 Web1[15] 512-1137664 99.9 0.01 Web2[15] 8192-32768 99.9 0.01 Web3[15] 512-23674880 99.9 0.01

3.4.2 Performance results and analysis

In this section, we run different benchmarks with different I/O schedulers

(including CFQ, Deadline, Noop and SBIOS). In our experiments, we didn’t compare our

40

SBIOS to AS, because the AS scheduler has been removed from kernel 3.13.0. To validate the efficiency of SBIOS, we choose five benchmarks with different characteristics and compare their system performance.

Table 3.1 illustrates in detail these five benchmarks. Fin1 and Fin2 are read mix with write workloads. In this kind of benchmark, read-write inference problem may appear, especially for Fin1. In Fin1, read requests only account for 21.6%. That means read requests have a big probability to be blocked by write requests. Web1, Web2 and

Web3 are read-intensive workloads. In this case, we need to consider the starvation problem of write requests.

Figure 3.6 shows the performance results. In order to illustrate the performance clearly, we use standard response time in the Y axis. In the experiment, we set the response time of Noop scheduler as the baseline (initialed to 1 in Figure 3.6) to compare the efficiency of other I/O schedulers. The response time of Noop scheduler in Fin1, Fin2,

Web1, Web2, Web3 are 1.19618ms, 1.4491ms, 0.588226ms, 0.922445ms and 1.0727ms.

In Figure 3.6, we can get two conclusions. First, except for our SBIOS, Noop scheduler outperforms better than the other scheduler under these five workloads. It validates that the Noop scheduler is the most suitable scheduler for SSD. Second, the SBIOS performs best among these schedulers under the five workloads. For Fin1, Fin2, Web1, Web2,

Web3 trace, the SBIOS scheduler reduces the response time of best and worst of other three schedulers by 15%-18%, 14%-18%, 11%-23%, 14%-18% and 10%-17%. In

41

conclusion, the SBIOS reduces the response time significantly by taking SSD characteristics into consideration.

1.4

1.2

1

0.8 noop cfq 0.6 deadline SBIOS

Standard Response Time Standard 0.4

0.2

0 Fin1 Fin2 Web1 Web2 Web3

Figure 3.6: Benchmark performance comparison under different I/O schedulers.

3.5 Related work

Since the I/O scheduler is designed for HDDs in the operating systems, the popularization of the flash-based SSDs makes the I/O schedulers for SSDs receive much more attention. There is a large body of studies on the I/O scheduler for magnetic hard disks, but only a few studies had been focus on SSDs. They can be classified into two

42

categories. The first category was mainly focused on the fairness of resource usage of

SSDs. For example, Stan park et al. proposed FIOS [62] and FlashFQ [67] algorithms

that take the fairness of SSD resource usage into account. FIOS [62] designed an I/O time slice management mechanism which combines read preference with fair-oriented I/O anticipation. FlashFQ [67] discussed the drawbacks of the existing time slice I/O scheduler and a new mechanism which fully uses the flash I/O parallelism without losing fairness.

The other category tried to exploit and maximize the advance characteristics of

SSDs in the upper layer, such as the parallelism characteristics among flash chips. For example, Wang et al. proposed ParDispather [68] that partitions the logical space to issue the user I/O requests to SSDs in parallel. Marcus Dunn et al. [3] proposed a new I/O scheduler that tries to avoid the created penalty during the new block writing to SSDs.

Jaeho Kim et al.[69]proposed IRBW-FIFO and IRBW-FIFO-RP which arrange write-

requests into a logical block size bundle to improve the write performance. Our scheduler

not only considers making full use of read internal parallelism [65], but also tries to avoid

the block cross penalty [3]. Besides the I/O scheduler studies, there are also some

researches which have revealed the advance of the flash internal organization and parallel

data distribution. For example, Agrawal N et al. [16] described the internal organization

of flash and some parallel data design distribution policy inside SSDs. Feng Chen et al.

[66] conducted some experiments to reveal the hidden details of flash memory

implementation such as unexpected performance degradation caused by the data

43

fragmentation. Yang Hu et al. [65] divided the parallelism of the flash memory into four levels and discussed the priority and advance of these four level internal parallelisms.

Based on the above observations, the SBIOS scheduler tries to exploit the internal parallelism from the aspect of I/O scheduler to boost the throughput of user applications for SSD-based storage system.

3.6 Summary

In this chapter, we propose a new I/O scheduler called SBIOS which makes full use of the characteristics of solid state disk. The SBIOS tries to use rich read internal parallelism provided by SSD and dispatches the read requests to different blocks to trigger the read internal parallelism. For write requests, the SBIOS dispatches them to the same block to avoid block cross penalty. Furthermore, we validate that SSD is sensitive with the request size. In SBIOS, we use the small-size preference design. The

experimental results show that SBIOS reduces the response time significantly. In this

way, performance of the SSD-based storage systems is improved.

44

4. Chapter 4 A Parallelism and Garbage Collection aware I/O Scheduler

In recent years, based on enormous growth both in capacity and popularity, the flash-based solid state disk has drastically impacted computing. But the prices for solid state disks still lag far behind those of traditional magnetic hard disks. According to the newest statistics result in 2015, $/GB of a consumer-grade solid state disk (with SATA interface) is nine times higher than the $/GB of a SATA hard disk. The price becomes the main bottleneck which makes the high performance solid state disk can’t be widely deployed. This bottleneck leads to crazy density increment: the manufacturer reduces the price by increasing the density on the SSD. These higher densities have predominantly been enabled by two driving factors: (1) aggressive feature size reductions and (2) the use of multi-level cell (MLC) technology. Unfortunately, both of these factors come at the significant cost of reduced SSDs endurance [70] [71] [59] [72]. Especially, with the development of the SSD technology, the endurance problem is becoming more and more prominent. In this study, we proposed a parallelism and garbage collection aware I/O scheduler to improve the reliability issue of SSDs.

4.1 Introduction

Comparing to traditional magnetic hard disks, the most well identified advantage of SSDs is its high access performance. Unlike the traditional magnetic hard disks, in order to improve performance, SSDs are usually constructed with a number of channels with each channel connecting to a number of NAND flash chips [73] [13] [74]. This

45

design provides rich internal parallelisms which can be fully exploited to improve performance. Recently, many researches have been proposed to exploit the internal parallelisms in the SSDs. Chen et al. [75] proposed a buffer cache management approach for SSDs to solve the read conflict problem by exploiting the read parallelism of SSDs.

Gao et al. [76] proposed an I/O scheduling method for SSDs to solve the access conflict problem by using a parallel issuing Queue method in the SSDs. Guo et al. [17] proposed a novel SSD-based I/O scheduler to trigger the internal parallelism by dispatching the read requests to different blocks. However, these works only target at exploiting the internal parallelisms to improve SSDs performance, they didn’t consider the additional garbage collection overhead, while exploiting the internal parallelisms.

In the SSDs, The most common used internal parallelism is the channel level parallelism. By using independent SSD bus controllers, Flash translation layer can fully utilize the channel resource to serve the incoming requests in parallel. However, in practice this kind of the internal parallelism may increase the overhead of garbage collection. This is because the hot write requests are distributed to different blocks in parallel by using the channel level parallelism. Here, we classify the hot request by frequency. The definition of the hot request is the frequency which is no less than two.

For instance, if we modify the same place in a word document several times, this will generate many hot write requests. These hot write requests target the same logical page, but the FTL distributes them to different blocks to utilize the channel parallelism. This will increase the invalid page ratio per physical block significantly. In order to achieve a

46

tradeoff between internal parallelism and increased garbage collection overhead, we proposed a GC-aware I/O scheduler, called PGIS. In our design, according to workload characteristics, we introduce a LFU algorithm to classify the hot/cold data for the whole incoming requests. For hot read requests, we dispatch them to different channel to utilize the channel resource in the SSDs. For hot write requests, we distribute them into the same

block to reduce the overhead of garbage collection. Via the combination of the above

mention methods, PGIS improves the performance and endurance of SSDs for a variety

of workloads.

4.2 Motivation and Background

4.2.1 The main bottleneck in the SSDs

The write operations are considered as the main bottleneck in the SSDs, not only because they sometime block the read operations, but also because they trigger the erase operations. Erase operations act at block level, while the read and write operation act at page level. This distinguishing feature is caused by the structure of the SSDs. SSDs don’t allow the in-place updates, Instead, the incoming new data are written to new clean space

(i.e., out-of-place updates), and then the old data are marked as invalid for reclamation in the garbage collection. To resolve this in-place update problem, Flash Translation Layer

(FTL) has been proposed and deployed to flash memory to emulate in-place update like block devices [77]. This kind of in-place update is not a real in-place update, it just hides the characteristics of the out-of-place update to the users. As the time goes on, this in- place updates inevitably causes the coexistence of numerous invalid and valid pages in

47

one physical block. While the free blocks percentages in the SSDs reach the threshold, the garbage collection (short for GC) has been triggered to reclaim the spaces occupied by the invalid pages. The main method for reclaiming the space is the erase operation.

However, Erase operation (1,500us) is the most expensive operation in the SSDs compared to read operation (25us) and write operation (200us) [16]. In order to reduce the number of erase operations, in our work, we issue the hot write requests to the same physical block to accelerate the reclaiming efficiency.

Figure 4.1: Channel level internal parallelism in the SSDs.

4.2.2 Internal parallelism

48

In order to improve the I/O performance of the SSDs, the internal parallelism was taken into consideration. The internal parallelism originated from the internal structure of the SSDs. In recent research, Hu et al. [65] classified the internal parallelism of the SSDs into four levels: channel-level, die- level, chip-level and plane-level. The most common one is the channel-level internal parallelism. As shown in Figure 4.1, while applying the channel-level internal parallelism, the two incoming requests in the queue will be simultaneously issued to two different channels. In this kind of channel-level internal parallelism, for read requests, there is no doubt that the I/O performance has been improved, but for write requests, the efficiency of garbage collection has been decreased.

In Figure 4.1, the two incoming requests have been issued to two different chips which locate in two different channels. These two different chips contain lots of physical blocks, let us suppose, these two incoming requests are hot write requests which have the same logical block number. While using channel-level internal parallelism, the FTL will issue these two hot write requests to two different chips. This means the incoming write requests which have same logical block address will be issued to two different physical blocks which locate in two different chips. While the incoming requests contain lots of hot write requests, this dispatching method will increase the invalid page ratio in the physical block and decrease the garbage collection efficiency. In our design, we identify hot write requests and dispatch them to the same physical block in the SSD to improve the efficiency of garbage collection.

49

Figure 4.2: The hot data ratio of different traces.

4.2.3 Trace access characteristics

In order to reduce the garbage collection overhead, we should know the trace

access characteristics, especially for the hot ratio rate in the trace. Here, we use the access frequency to identify the hot block. By introducing the statistics method to analyze the trace access pattern, we validate that the access patterns of all the seven selected traces are good fits of power law distribution. According to statistical results, we find a good cut to get the exact threshold for identifying the hot block. The definition of the hot data is the I/O request which visits the hot block. Similarly other data in the trace is defined as

the cold data. Figure 4.2 shows that the hot ratio rate for all seven of our traces. We

50

observe that access frequency can be quite heterogeneous across different pages. While some pages are often accessed (hot data), other pages are infrequently accessed (cold data). From Figure 4.2, we observe that for all of our traces, the hot data occupies a large fraction in the total data; the read hot ratio represents the hot data percentage in the

total read data. The write hot ratio represents the hot data percentage in the total write

data. For read-intensive traces, such as Web1, Web2, Web3, the write request ratio only

accounts for 0.01%, so we omit the hot write data statistics. However, for write-intensive

traces, such as Prn_0 and Fin1, we observe that the read hot ratio rates are lower than the

average; it ranges from 25% to 29.5%. The reason for this is that the read data only

accounts for a small fraction of the total data. But no matter whether it is write-intensive

data or read-intensive data, a large fraction of the total data is the hot data. So in our

design, we identify the hot data to improve the performance of SSDs and reduce the

overhead of garbage collection.

4.2.4 Motivation

In order to fully dig the potential of SSDs, the internal structure of SSDs are

always taken into consideration. But the design is a tradeoff, when you gain something,

you will lose something. In fact, there is no doubt that introducing the internal parallelism

can obviously improve the performance of the SSDs. However, with the restriction of the

structure of SSDs, improving the performance of SSDs sometimes may increase the

overhead of garbage collection, because the Flash translation layer blindly directs the hot

write requests to the different physical block. When we analyze the access characteristics

51

of the real traces, we find that the hot data occupies a large fraction in the total data, so in our design, we maintain two LFU queue in the host level to identify the hot data, for hot read requests, we dispatch them to different channel to fully exploit the channel resource, for hot write requests, we dispatch them to same physical block in the SSD. In this way, we achieve a balance between improving the performance of SSDs and reducing the

overhead of garbage collection.

4.3 System Design and Implementation

4.3.1 System overview

The main goal of parallelism and garbage collection aware I/O scheduler (short for

PGIS) is to reduce the garbage collection overhead, while exploiting the internal parallelism. Figure 4.3 shows the SSD based storage systems with PGIS implemented in the host interface logic, where Request Type Identifier, Data Dispatcher, Hot Data

Identifier are three most important components of the PGIS, and multiple flash chips are connected with multiple channels in the storage level.

In Figure 4.3, the top level is application level. The application level generates the incoming requests. The incoming requests are received by file system, when the block device driver interacts with SSD through host interface logic. These incoming requests enter into the I/O scheduling module which is located in host interface logic. The I/O scheduling module consists of three important components: Request Type Identifier, Data

Dispatcher and Hot Data Identifier. Request Type Identifier directs the read requests and

52

Figure 4.3: System overview of PGIS. Here RFQ represents read frequency queue and WFQ represents write frequency queue. the write requests into different queues. Hot Data Identifier maintains two frequency queues where the hot requests are marked a hot flag. When the hot data are identified by

Hot Data Identifier, the data dispatcher will dispatch all the requests to the flash translation layer. The FTL manages the channel resource through the flash interface logic which is proposed by Jung et al. [78]. For the existing I/O schedulers implemented in the

53

kernel level, they are not aware of internal details of SSDs, so the additional overhead of

garbage collection may degrade the performance, while applying the internal parallelism.

To solve this problem, a host interface GC aware I/O scheduler is first proposed, which is

based on the understanding of the data allocation scheme of the FTL and the logical

address number of the incoming I/O requests. Then, based on introducing the hot data

identification mechanism, PGIS reduces the overhead of garbage collection by smartly scheduling incoming I/O requests. For hot read data, PGIS will issue them to the different channels to exploit the channel resource, for hot write data, PGIS will dispatch them to the same physical block to accelerate the block reclaiming process.

(a) Fin2 read (b) Fin2 write

Figure 4.4: Distribution of read and write access frequency in Fin2 trace.

54

Table 4.1: Workload access pattern power laws.

Workload K R2 Type

Fin1 1.37 0.99 read Fin1 1.42 0.97 write

Fin2 1.57 0.99 read

Fin2 1.50 0.99 write

Prn_0 1.30 0.99 read

Prn_0 1.58 0.90 write

Prxy_0 1.58 0.90 read

Prxy_0 1.64 0.97 write

Web1 1.35 0.99 read

Web2 1.46 0.98 read

Web3 1.35 0.98 read

4.3.2 Trace access pattern analysis

By introducing the statistics method to analyze the trace access pattern, we have an interesting observation: the access patterns of all our traces seem to match power law distribution. In order to validate this observation, we collect the number of block and access frequency for these blocks from all our traces, and then we plot these data in logarithmic scale to calculate the k and r2. In our experiment, we fit our data to a line to statistic clearly. The k represents the slope of the line, while r2 measures how far a set of

55

random numbers are spread out from their mean. Figure 4.4 shows the Frequency distribution results. Here, we only show the statistics results for Fin2 due to space limitation. The exact k and r2 values for all our traces are described in Table 4.1. The r2 value ranges from 0.90 to 0.99, while the k value ranges from 1.35 to 1.64, which is homogeneous and in line with the k value reflected in other works software power laws

[79, 80]. Thus, it validates that the access patterns of all our traces are good fits for power law distribution. According to our statistical results, we find that 15 to 25 percent of the blocks are accessed by nearly 50 percent I/O requests, while nearly 80 percent of the blocks are accessed by the rest of I/O requests, so when we design the threshold to identify the hot block, we believe that the 20 percent is an ideal threshold value. In our experiment, we choose the top 20 percent entries listed in the LFU queue as hot blocks.

4.3.3 Hot data identification mechanism

In order to identify the hot data in the host interface logic, we maintain two

frequency queues, one for recording the frequency for write block and the other for

recording the frequency for read block. According to the above trace access pattern

analysis result, we set the top 20% block maintained in the frequency queue as a

threshold to identify the hot block. When the hot blocks have been identified, the

incoming requests which visit the hot block will be added a hot flag. The data dispatcher

will apply different dispatching methods based on the hot flag value. In our experiment,

we find that the access frequency which is used to identify the hot block is depended by

the trace access pattern. For instance, for read-intensive data, such as web1, web2, web3,

56

the read block which is access by twice will be classified into hot block, but if this access frequency for read block appears in other write intensive traces, such as Fin1,prxy_0, it will be classified into the cold block.

Figure 4.5: The basic data structure of the frequency queue.

In the real trace, it contains lots of I/O requests, for improving the least

frequency used (short for LFU) algorithm efficiency, we implement an O (1) LFU queue,

which is based on the paper that is proposed by Ketan et al.[81]. Figure 4.5 shows the

57

basic data structure of our frequency queue. Our frequency queue consists of three basic data structure. The first one is a hash table which is used to store the key values so that given a key value we can retrieve it at O (1). Second one is a double linked list for each frequency of access. In here, the maximum frequency is decided by the queue length which is responsible for identifying which frequency will be classified into the hot data.

In Figure 4.5, each frequency node will maintain a double linked list to keep track of the

key value belonging to that particular frequency. The third data structure is a double linked list to link these frequencies lists. When the current key is accessed again, the key node can be easily promoted to the next frequency list in time O (1). In the PGIS, the key value stored in the hash table is the logical block number. When a new I/O request comes from block device driver, PGIS matches its logical block number to the hash map. If this logical block number has been found in the hash map, PGIS will go to that key node and promote this key node to the next frequency node, if the logical block number is not matched in the hash map, PGIS will add a new entry in the hash map, and then put the new key node under the frequency 1 node.

4.3.4 Dispatching method

In order to reduce the additional garbage collection over- head, while applying

the channel level internal parallelism, PGIS introduces the hot data identification

mechanism and smartly manages the hot write data which may increase the overhead of

garbage collection. Comparing to the host-level I/O scheduler which exploits the internal

parallelism [17], PGIS benefits from knowing the internal behavior of the SSDs. Figure

58

4.6 shows an example scenario where the traditional channel level dispatching method will lead to additional overhead of garbage collection and how PGIS avoids this situation.

As is shown in the Figure 4.6 (a), there are six incoming requests which reside in the data waiting queue. A traditional channel level internal parallelism dispatching method is to dispatch the incoming requests to different channel to fully exploit the channel resource. However, it is unaware of the incoming hot write data. Therefore, it blindly dispatches request 5 and request 6 to different channel. This leads to an additional overhead of garbage collection, because the hot write data which has the same logical block number is dispatched to different physical block. To avoid this situation, PGIS identifies the hot data in the host interface logic, when dispatching the incoming requests which reside in the data waiting queue, it smartly uses the channel level internal parallelism. In Figure 4.6 (b), we can see that PGIS only uses the channel level internal

parallelism for hot read data, such as request 3 and request 4. For hot write data, such as

request 5 and request 6, instead of dispatching them to different channel, PGIS dispatches

them into the same physical block. In this way, PGIS not only fully exploits the channel

level internal parallelism, but also avoids the additional overhead of garbage collection.

In our design, hot read data may overlap the hot write data. There is no doubt that parallelizing hot read data and hot write data will interfere with each other and lead to the performance degradation. For reducing the probabilities of parallelizing hot read data and hot write data, PGIS introduces type-based queuing.

59

Figure 4.6: Illustration of additional garbage collection overhead due to traditional

channel level internal parallelism dispatching method.

60

Because we use read-preference in PGIS, this will lead to a starvation problem for write requests. To solve this problem, PGIS sets a time stamp which defines a time period before which the request should be dispatched into the FTL module. The PGIS periodically checks the requests stored in the queues to guarantee no request exceeds the time period assigned by time stamp. In such a design, we believe the probabilities of parallelizing hot read data and hot write data will be controlled in an acceptable range, the experiment results also support our ideas. In the following section, we will discuss this topic in detail.

4.4 Experimental Evaluation

In this section, we set up experimental platform to evaluate PGIS. The

experiments are divided into two parts. The first part is to run different kinds of

traces on the chosen I/O scheduler to demonstrate that PGIS improves the response time

significantly and makes full use of the channel resource in the SSDs. The second part is

to compare the overhead of garbage collection under PGIS with the state-of-the-art I/O

schedulers to validate that PGIS reduces the overhead of garbage collection significantly

while implementing the channel level internal parallelism.

Table 4.2: Configuration Parameters of the SSD Simulator.

Parameter Value

Channel 8

Die 8

61

Plane per Die 8

Block per Plane 2048

Page per Block 64

Page Size 8KB

Reserve Block Percentage 20%

Read Latency 25us

Write Latency 200us

Erase Latency 1500us

4.4.1 Experiment setup

In this paper, our SSD simulator is a modified version of Disksim with ssd extension [82]. Disksim with ssd extension did not support the channel level internal parallelism, so we implemented the channel level internal parallelism in our simulator. In this work, the proposed approach is implemented in the host interface logic to schedule

I/O requests. We implemented the-state-of-the-art I/O scheduler (based on Noop scheduling algorithm) and a flash-based I/O scheduler [17] to perform a quantitative comparison. In this study, we modeled a 64GB SSD, which is configured with 8 channels with each channel equipped with 8 dies. The exact configuration of our simulator is described in Table 4.2. In Table 4.2, we can see that the 64GB SSD contains 131072 blocks. Each flash block consists of 64 pages with a page size of 8KB. Page mapping scheme is implemented as the default FTL mapping scheme. Greedy garbage collection

62

scheme and wear leveling scheme is implemented. The reserve block percentage is set to

20% of the SSD. In order to test the efficiency of PGIS, we choose 7 different traces to test it, including two online transaction processing traces (Fin1, Fin2) [15], two MSR

Cambridge traces from servers (Prxy_0, Prn_0) [14] and three search engine traces

(Web1, Web2, Web3) [15].

4.4.2 Performance analysis

In this section, we run different benchmarks with different I/O schedulers

(including SBIOS, PGIS and Noop). In our experiment, we use the Noop scheduler as a

baseline to measure the efficiency of other schedulers. SBIOS [17] is a flash-based I/O

scheduler which improves the performance by exploiting the internal parallelism. We

introduce SBIOS to measure the performance variation of PGIS. To validate the

efficiency of PGIS, we choose seven benchmarks with different characteristics and

compare their performance. Fin1 and Fin2 are collected from OLTP applications running

at a large financial institution. Web benchmarks (Web1, Web2 and Web3) are collected

from a machine running a web search engine. Prn_0 and Prxy_0 are collected from MSR

Cambridge server. Table 4.3 illustrates the main characteristics about these seven benchmarks. We simply divide these seven benchmarks into two types: read intensive and write intensive. Fin1, Prn_0, Prxy_0 are write intensive traces in which read requests only account for a small Fraction. However, Fin2, Web1, Web2, Web3 are read intensive traces which contain few write requests, especially for Web1,Web2,Web3, the write request ratio only accounts for 0.01%.

63

Table 4.3: The characteristics of the traces.

Trace Read(%) IOPS Avg req size(KB)

Fin1 21.6 121.4 8.5

Fin2 82.4 97.8 3.6

Prn_0 10.8 68.1 22.2

Prxy_0 2.9 188.5 6.8

Web1 99.9 322.1 16.2

Web2 99.9 315.2 14.5

Web3 99.9 297.8 15.6

Figure 4.7 shows the performance results. In order to illustrate the performance clearly, we introduce standard response time in Y axis. In our experiment, we set the response time of Noop scheduler as baseline (initialized to 1 in Figure 4.7) to compare the efficiency of other schedulers. As is shown in Figure 4.7, Compared with the baseline,

PGIS improves performance by 19.8%, 28.4%, 35.3%, 33.3%, 32.1%, 21.3%, 18.5% for the Fin1, Fin2, Web1, Web2, Web3, Prn_0, and Prxy_0 traces, respectively, with an average of 26.9%.

64

Figure 4.7: Benchmark performance comparison under different I/O schedulers.

However, when comparing with SBIOS, PGIS only improves performance by 3.7%,

12.6%, 18.9%, 15.8%, 15%, 6%, and 3.7% for the Fin1, Fin2, Web1, Web2, Web3,

Prn_0, and Prxy_0 traces. The reason why PGIS doesn’t beat SBIOS a lot is that SBIOS has already benefited from exploiting internal parallelism. In Figure 4.7, we can get two conclusions. The first one is that the SSD-based I/O schedulers including SBIOS and

PGIS outperform Noop scheduler. It validates that exploiting the internal parallelism can

improve the SSD performance. The second one is that read-intensive traces benefit a lot

from hot data identification and channel level internal parallelism. For read intensive

traces, including Fin2, Web1, Web2, Web3, we can see that the hot read data accounts for

65

a large fraction in the total read data, the hot read ratio ranges from 27.7% to 36%, while we dispatch the hot data to different channel, the channel level internal parallelism is

fully exploited by such a big read I/O intensity. For write intensive traces, including Fin1,

Prn_0, Prxy_0, it doesn’t benefit a lot from channel level internal parallelism. The reason

is that the read data only accounts for a small fraction in the total data, although the hot

read data accounts a big fraction in the total read data.

Figure 4.8: Comparison for the overhead of garbage collection under different I/O schedulers.

4.4.3 Improved garbage collection efficiency

66

In this section, we look into the overhead of garbage collection under different

schedulers while using seven traces. In order to clearly illustrate the overhead of garbage

collection, we introduce the erase operation number as a metric. To clearly reflect the

garbage collection status, we simulated a 64 GB SSD and the overprovision space of the

SSD has been set to 20% to accelerate the block reclaiming process.

Figure 4.8 shows comparison results for the overhead of garbage collection. In

order to illustrate the overhead of garbage collection clearly, we introduce standard

erase operation number in Y axis. The erase operation number is always introduced as an

important metric to measure the garbage collection state in the SSD. In our experiment,

we set the erase operation number of Noop scheduler as baseline (initialed to 1 in Figure

4.8) to compare the garbage collection overhead of other schedulers. As is shown in

Figure 4.8, for read mix with write requests, when comparing with the baseline, PGIS

reduces the overhead of garbage collection by 16.2%, 10.1%, 22.3%,and 26.8% for the

Fin1, Fin2, Prn_0, and Prxy_0 traces, respectively, with an average of 18.9%. However, when comparing with the SBIOS, PGIS reduces garbage collection overhead by 19.3%,

11.2%, 27.6%, 30.9% for the Fin1, Fin2, Prn_0, Prxy_0 traces, respectively, with an average of 22.3%. Our experimental result shows that the internal parallelism dispatching method which is not aware of the hot data will increase the overhead of garbage

collection. From Figure 4.8, we can get three conclusions. The first one is that

dispatching hot write data to the same block can accelerate block reclaiming process, but

dispatching hot read data to different channel makes no difference for garbage collection

67

process. That is why PGIS doesn’t gain any overhead improvement for the Web1, Web2,

Web3 traces. The second one is that hot data unaware dispatching method will increase

the overhead of garbage collection. As is shown in Figure 4.8, compared with the overhead of garbage collection, Noop scheduler outperforms SBIOS. The third one is that the trace which has a bigger write ratio can get more benefit from hot data identification scheme. In Figure 4.8, the biggest improvement for the overhead of garbage collection is gained by Prxy_0 which contains 97.1% write requests.

Figure 4.9: Standard response time speedup on seven benchmarks with respect to different channel number.

68

4.4.4 Sensitivity study on channel number

Channel level internal parallelism is the most common internal parallelism in

the SSD. No matter increasing the channel number or decreasing the channel number will

influence the SSD performance. According to the above section, we will find that the

channel level parallelism leads to the performance gain brought by PGIS. In this section,

we will conduct sensitivity study on the number of channels to illustrate how the

variations of channel number influence the SSD performance.

Figure 4.9 compares the response time of these seven benchmarks under

different channel number. In Figure 4.9, the Y axis represents standard response time speedup, we use the response time of SSD which is configured with two channel under these seven benchmarks(initialized to 1 in Figure 4.9) as a baseline to measure the efficiency variations of SSD which s configured with different channel number.

According to Figure 4.9, we can get three conclusions. The first one is that increasing the channel number will improve SSD performance. Take Fin2 for example, when the channel number increases from 2 to 8, the SSD performance improves over than 20%.

The second one is that there is an optimal channel number for read mixed with write benchmarks. For read mixed with write benchmarks, including Fin1, Fin2, Prn_0, Prxy_0, there is an obvious decreasing trend, when channel number increases from 8 to 12. We believe that the decreasing trend is caused by parallelizing hot read data and hot write data. Increasing channel number also increases the probability of overlapping hot read

69

data and hot write data. However, overlapping hot read data and hot write data doesn’t influence the SSD performance a lot, the average performance decrease for read mixed with write benchmarks is less than 2.5%. A recent research proposed by chen et al. [83] does a good job of explaining this phenomenon. According to their study, only overlapping random hot read data and hot write data will lead to a big performance decrease. By introducing the type dispatching method, PGIS significantly reduces the probability of overlapping random hot read data and hot write data. The third one is that there is a limit of increasing channel number to improve SSD performance. For read intensive benchmarks, such as Web1, Web2, Web3, when the channel number increases from 8 to 12, there is no any performance gain brought by increasing channel number.

According to our sensitivity study, we choose 8 as the default channel number in our experiment for clearly reflecting the variation trend of SSD performance.

4.4.5 Sensitivity study on over provision ratio

The over provision ratio is an important factor which can influence the

overhead of garbage collection a lot. On one hand, more over provision ratio means that

SSD reserves more blocks for incoming write requests. These additional reserved blocks

will decrease the occurrence rate of garbage collection. In this way, the overhead of

garbage collection has been alleviated. On the other hand, more over provision ratio

reduces SSD . Decreased SSD user space will result in the performance

degradation. In order to find an optimal point between SSD performance and over

70

provision ratio, we investigate the relationship between response time and over provision ratio in our experiment.

Figure 4.10: Standard response time speedup on seven benchmarks with respect to different over provision ratio.

Figure 4.10 compares the response time of these benchmarks under different over provision ratio. In Figure 4.10, the Y axis represents standard response time speedup, we use the response time of over provision ratio 5% under these seven benchmarks

(initialized to 1 in Figure 4.10) as a baseline to measure the efficiency of SSD under other over provision ratios. As is shown in Figure 4.10, for read intensive traces, such as Web1,

Web2, Web3, when the over provision rate increases, there is no influence on the SSD

71

performance. This validates that over provision ratio only influences write requests.

However, for write intensive traces, such as Fin1, Fin2, Prn_0, Prxy_0, when the over

provision ratio increases from 5% to 25%, there is an obvious performance improvement

trend, the average performance improvement is 7.5% except for Fin2. The reason for why

Fin2 doesn’t get more improvement from increased over provision ratio is that the write

ratio of Fin2 only accounts for 17.6%. While the over provision ratio is bigger than 25%,

the performance increment trend seems to become stagnant. According to our sensitivity

study, we choose 20% as the default over provision ratio in our experiment for clearly

reflecting the variation trend for the overhead of garbage collection.

4.5 Related work

The flash-based solid state disks have rich internal parallelisms. In recent research, many ideas focus on exploiting internal parallelisms to improve the SSDs performance. Hu et al. [65] divided the parallelism of the SSDs into four levels and found that these four levels have different priorities in the exploration of access latency and system throughput. Chen et al. [8] first evaluated and showed that the internal parallelisms of SSDs play an important role on improving performance. They stated that while introducing the internal parallelism, the performance of write operations had no relationship with their patterns (random/sequential) and even better than read operations.

Chen et al. [75] proposed a buffer cache management approach for SSDs to solve the read conflict problem by exploiting the read parallelism of SSDs. Gao et al. [76] proposed an I/O scheduling method for SSDs to solve the access conflict problem by

72

using a parallel issuing Queue method in the SSDs. Guo et al. [17] proposed a novel

SSD-based I/O scheduler to trigger the internal parallelism by dispatching the read requests to different blocks. Wang et al. proposed ParDispather [68] that partitions the

logical space to issue the user I/O requests to SSDs in parallel. However, none of the

above works take the additional garbage collection overhead into consider. In our work,

we propose a GC-aware I/O scheduling method, while exploiting internal parallelisms.

The endurance of the solid state disk is an important metric for measuring the

SSD performance. One important factor which can influence the endurance is the garbage

collection efficiency. The hot data identification is an effective method to improve

garbage collection efficiency. So the other field which relates to our research is the hot

data identification. Recently, many ideas based on hot data identification have been

proposed to improve the garbage collection efficiency. Chang et al. [84] used a two-level

LRU list structure to identify the incoming hot write requests. Park et al. [77] proposed a

multiple bloom-filter scheme to identify the hot data in flash memory. These works

improve the garbage collection efficiency by identifying the incoming hot write requests,

but none of the above works used the hot data identification scheme into read requests. In

our work, we introduce the hot data identification scheme into read requests to fully

utilize the rich channel resource in the SSDs. Other works related to our research are

power law distribution and I/O schedulers applied in the SSDs. Louridas et al. [80] prove

that power laws appear in software at the class and function level. Wang et al. [79] show

that tags produced in tracing follow power laws. Stan park et al. proposed FIOS [62] and

73

FlashFQ [67] algorithms that take the fairness of SSD resource usage into account.

Marcus Dunn et al. [3] proposed a new I/O scheduler that tries to avoid the created penalty during the new block writing to SSDs. Jaeho Kim et al. [69] proposed IRBW-

FIFO and IRBW-FIFO-RP which arrange write-requests into a logical block size bundle to improve the write performance.

4.6 Summary

In this study, we proposed a parallelism and garbage collection aware I/O

Scheduler named PGIS which identifies the hot data based on trace characteristics to exploit the channel level internal parallelism of flash-based storage systems. PGIS not

only fully exploits abundant channel resource in the SSD, but it also introduces a hot data

identification mechanism to reduce the overhead of garbage collection. By dispatching

hot read data to different channel, the channel level internal parallelism is fully exploited.

By dispatching hot write data to the same physical block, the overhead of garbage

collection has been alleviated. The experimental results show that compared with existing

I/O schedulers, PGIS reduces the response times and the overhead of garbage collection

significantly.

74

5. Chapter 5 Miss Penalty Aware Cache Management Scheme

One of the most famous physical limitations of the SSD is the “erase-before-write” mechanism. Direct in-place updates are not allowed in the SSDs. Instead, a time- consuming erase operation must be performed before the overwriting. To make it even worse, the erase operation cannot be performed selectively on a particular page but can only be done for a block of the SSD called the “erase unit“. Since the size of an erase unit

(typically 128 KB or 256 KB) is much larger than that of a page(typically 4 KB or 8 KB), even a small update to a single page requires all pages in the same erase unit to be erased and written again [56]. Frequent erase operations will lead to endurance problem, since each SSD cell has a limited number of erase operations before becoming unreliable. In order to resolve this issue, Flash Translation Layer (short for FTL) has been proposed and deployed to SSD firmware to emulate in-place updates like block devices. Although FTL can reduce the amount of erase operations, write operation is still the main bottleneck in the SSD.

5.1 The Opportunity for PCM

With the development of the non-volatile technology, we have more options for improving SSD performance [85] [86] [87]. Phase change memory (short for PCM) is one of the most promising non-volatile memories, which instead of using electrical

75

charges to store information, stores bit values with the physical state of a chalcogenide material (e.g., Ge2Sb2Te5, i.e., GST). In general, a PCM cell consists of a top electrode and a bottom electrode. Between these two electrodes are the resistor (heater) and chalcogenide glass (GST). We can transform the PCM cell into amorphous state by quickly heating under high temperature (above the melting point) and quenching the glass.

Instead, holding the GST above its crystallization point but below the melting point for a while will transform the PCM cell into the crystalline state. In 2016, IBM researchers found how to store 3 bits in one PCM cell, PCM is currently becoming more and more popular.

Due to high read/write speed, high endurance, non-volatility of PCM, many hybrid storage structures have been proposed to improve the performance of SSD. Sun et al. [56] propose a hybrid storage architecture to improve performance of SSD. The key idea behind this design is to use PCM to store log data. Due to in-place updates, byte- addressable, non-volatile properties and better endurance of PCM, the performance, energy consumption and lifetime of the SSDs are all improved. Tarihi et al. [88] propose a hybrid storage architecture for Solid-State Drives (SSDs) which attempts to exploit

PCM as the write cache while improving SSD’s drawbacks such as long write latency, high write energy, and finite endurance. Li et al. [58] propose a user-visible hybrid storage system with software-defined fusion methods for PCM and NAND flash. In this design, PCM is used for improving data reliability and reducing the write amplification of

SSDs. In 2016, the Radian memory system released a host-based hybrid SSD product

76

called”RMS-325” to improve the reliability of SSD storage system. This clearly shows

that hybrid storage technology is on the way from the laboratory research stage to

engineering application stage. With the growth of the use of SSDs in the data center,

hybrid storage technology will attract more and more attentions in the forthcoming years.

There are many studies work on hybrid storage technology which focuses on

exploiting PCM and SSD at the same storage level. While most of them try to reduce the

write amplification of SSDs and to improve the data reliability of the whole storage

system, none of these studies consider improving the performance of the whole storage

system based on read latency disparity between PCM and SSD. In general, the read

latency of PCM is at least 500 times higher than the read latency of SSD. Such a big read

latency disparity gap between PCM and SSD will lead to different miss penalty

overheads for the average memory access time in the main memory based on DRAM. In

modern computer systems, a large part of the main memory is used as page caches to

hide disk access latency. The effectiveness of page cache block Management algorithms

is critical to the performance of the whole storage systems. In this study, we propose a

Miss Penalty Aware Cache Management algorithm (short for MPA) to improve the

performance of the hybrid storage system which uses PCM and SSD at the same storage

level. MPA not only fully exploits abundant system resource in the host side, but also it

gives SSD requests maintained in the page cache high priority to improve the

performance of the whole storage system.

5.2 Background and Motivation

77

5.2.1 PCM vs SSD

In recent years, Flash-based Solid State Drives (SSDs) have enabled a revolution in mobile computing and are widely used in data centers and in high-performance computing. Compared to traditional hard disks, SSDs offer substantial performance improvements, but cost is limiting the adoption in cost-sensitive applications. To fix this problem, SSD manufactures increase SSD density by scaling silicon feature size

(shrinking the size of a transistor) [89]. More bits will be stored in a SSD cell. As we all know, the high density is a double-edged sword. When enormously reducing the cost of

SSD and increasing the adoption of SSD, high density also brings the swift drop in

SSD endurance, measured as the number of program/erase (P/E) cycles a SSD cell can sustain before it wears out [90] [91]. Although the 5x-nm (i.e., 50- to 59-nm) generation

of MLC NAND flash SSD had an endurance of 10k P/E cycles, modern 2x-nm MLC and

TLC NAND flash SSD can endure only 3k and 1k P/E cycles, respectively [92] [93] [94].

Comparing to NAND flash SSD, PCM has many advantages, such as in-place

updates, byte-addressable property, and high endurance and so on [95] [24]. However,

the most prominent advantage is its fast read access speed. As is shown in Table 5.1, the

read latency of PCM is at least 500 times higher than the read latency of SSD. Such a big

performance gap will bring many opportunities to improve performance of the hybrid

storage system which uses PCM and SSD at the same storage level. Our Miss Penalty

Aware Cache Management algorithm is also designed for exploiting this big performance

gap between PCM and SSD.

78

Table 5.1: Comparison among SSD and PCM.

Attribute SSD PCM

Non-volatility YES YES

Byte Addressability NO YES

Write Latency 500us 1us

Read Latency 25-50us 50-100ns

Write Energy 0.1-1nj <1nj

Read Energy 1nj 1nj

Endurance <104 106-108

Along with the progress of the PCM technologies, PCM has become a good choice to replace NAND flash SSD with advantages of high read/write speed and high endurance [22] [96]. However, due to its high cost and the limitation of manufacture, it is

still not feasible to replace the whole NAND flash SSD with PCM. In this case, many

hybrid storage structures which exploit the PCM and SSD at the same storage level have

been proposed to improve the performance of whole storage system. In our work, we

propose a Miss Penalty Aware Cache Management algorithm to improve the performance

of the hybrid storage system which uses PCM and SSD at the same storage level.

5.2.2 Motivation

As we all know, different storage systems show different characteristics. In

modern computer system, the DRAM-based main memory is used as page cache to

79

bridge the performance gap between main memory and storage system. In this case, the page cache management algorithm has a significant effect on improving performance of

I/O by hiding the long latency of underlying storage system. In general, judging whether a page cache management algorithm is effective depends mainly on the Average Memory

Access Time (short for AMAT). AMAT consists of Hit Time, Miss Rate and Miss

Penalty. The AMAT is represented in equation (1).

= + ( ) According to equation (1), we will find that the AMAT can be improved by optimizing 퐀퐌퐀퐓 퐇퐢퐭 퐓퐢퐦퐞 퐌퐢퐬퐬 퐑퐚퐭퐞 ∗ 퐌퐢퐬퐬 퐏퐞퐧퐚퐥퐭퐲 ퟏ each of the above three factors. Since the main memory is DRAM whose access time is consistent, the Hit Time is consistent too. A large part of existing optimizations for page cache management algorithms are trying to reduce the Miss Rate as much as possible by reducing the number of the user I/O requests actually passed to underlying storage system by exploiting access locality. In this case, they assume that the Miss Penalty is also consistent, because the direct access time from the underlying storage system is almost the same. This assumption is valid for most storage systems, such as HDDs and flash- based SSDs. However, when the hybrid SSD which uses PCM and SSD at the same storage level is applied, the story becomes different. The open channel hybrid SSD exploits PCM and SSD at the same storage level to reduce the write amplification of SSD.

= + ( )

퐌퐢퐬퐬 퐏퐞퐧퐚퐥퐭퐲 퐓퐃 퐃퐀 ퟐ In such a hybrid storage structure, the access time for PCM is at least 500 times lower than the access time for SSD. Based on this observation, we take the factor of Miss

80

Penalty into consideration. Equation (2) shows the detailed influencing factors for Miss

Penalty overhead. In here, TD represents the time for data loading to DRAM, while DA

represents DRAM access time. According to equation (2), we will find that Miss Penalty

is mainly influenced by the time for data load to DRAM, since the DRAM access time is

consistent. While the time for data load to DRAM depends on the storage access time,

The Miss Penalty overhead for PCM is quite different from the Miss Penalty overhead for

SSD in Hybrid SSD. Since the read latency of PCM is at least 500 times higher than the read latency of SSD, the Miss Penalty for PCM is much less than the Miss penalty for

SSD. In our research, we assign higher priority to SSD requests located in the page cache for avoiding the high Miss Penalty overhead.

In hybrid SSD which uses PCM and SSD at the same storage level, metadata, random or small write data (hot data) can be written into PCM for the in-place updates and byte-addressable properties of PCM. Meanwhile, this data dispatching method also reduces the write amplification of NAND flash SSD, which leads to the increasing number of write and erase operations, performance degradation and lifespan decreasing of NAND flash SSD. Since SSD is used to store cold data in hybrid SSD, on one hand, the high priority for SSD requests located in cache will decrease the probability of paying high Miss Penalty overhead for SSD requests. On the other hand, such a high priority for

SSD requests located in cache may increase the Miss Rate for PCM requests. In order to find the balance between Miss Rate and Miss Penalty, in MPA we have designed a mechanism to filter the real cold data located in the page cache. By filtering the real cold

81

data located in the page cache, the Miss Rate for PCM requests decreases significantly.

By considering not only the Miss Rate, but also the Miss Penalty, the AMAT with MPA is significantly improved.

Figure 5.1: The overview of MPA on the I/O path.

5.3 The Design Detail of MPA

82

5.3.1 System overview

The main goal of Miss Penalty Aware Cache Management algorithm is to exploit the read latency difference between PCM and SSD, while guaranteeing hit Rate. The existing DRAM-based Cache Management algorithms such as LRU, LFU are designed to reduce the Miss Rate by exploiting the temporal locality. Our MPA scheme is designed for the hybrid storage system which uses PCM and SSD at the same storage level. In this case, MPA not only takes hit Rate into consideration, but also exploits the read latency difference between PCM and SSD to reduce the Miss Penalty Overhead. Figure 5.1 shows the hybrid storage system overview with MPA implemented in DRAM, where

PCM is used as the same level storage of SSD.

In Figure 5.1, the host level consists of application level, (short

for VFS) and DRAM. When the user interacts with operating system, the application

level starts to generate requests. These incoming requests are received by VFS, and then

VFS dispatches these requests to the page cache located in the DRAM. In page cache,

the page cache management scheme will check whether the incoming requests are in the

cache. If so, they will be serviced by the page cache immediately. Otherwise they will be

issued to the hybrid storage system below. In here, the hybrid storage system

communicates with DRAM through open channel interface.

As is shown in Figure 5.1, the device level is a hybrid SSD which consists of

memory controller, PCM and SSD. In such a storage structure, the most prominent

feature is the read latency difference between PCM and SSD. Such a read latency

83

difference between PCM and SSD leads to Miss Penalty difference between PCM and

SSD. However, the existing page management scheme only exploits temporal locality to reduce the miss rate, it can’t fully dig the potential of hybrid SSD. Different from traditional page management scheme which uses Miss Rate as the only method to improve the AMAT, MPA fully exploits the feature of the hybrid SSD by taking Miss

Penalty as the most important method to improve AMAT.

5.3.2 The balance between Miss Penalty and Miss Rate

In hybrid SSD which uses PCM and SSD at the same storage level, PCM is always used to store metadata, random or small write data (hot data) for the in-place updates and high endurance properties of PCM, however, due to the limit capacity of PCM, most data

(cold data) is store in the SSD. Therefore, most SSD requests are rarely accessed. There is no doubt that assigning high priority for SSD requests located in page cache will decrease the probability of paying high Miss Penalty overhead for SSD requests, but decreasing the Miss Penalty overhead for SSD requests will lead to increment of the Miss

Rate for PCM requests. In order to achieve the optimal performance of the hybrid SSD, we need to find a balance between Miss Penalty and Miss Rate. Figure 5.2 shows how we improve the hit rate for PCM requests by filtering the real cold data in the page cache.

As is shown in Figure 5.2, Miss Penalty Aware cache management scheme contains two parts: a FIFO list for filtering the cold data and a LRU list for capturing the workload locality. When a page miss occurs, the miss page will be loaded from the storage and added to FIFO list at first, and then if a cache hit occurs before the miss

84

page leaves the FIFO list, the miss page will be promoted to the LRU list which is used to store active data.

Figure 5.2: The system overview of MPA scheme.

85

By this promotion mechanism, we filter the cold data in the FIFO list. Figure 5.2 shows an example of MPA scheme for decreasing the Miss penalty overhead. Based on traditional LRU scheme, request D4 at the end of LRU list should be evicted from the page cache. However, request D4 belongs to SSD which has the expensive Miss penalty overhead, the MPA scheme will give it high priority to stay in the page cache and instead evict PCM request D2 to free page cache space. By keeping the SSD requests longer in the cache space, MPA scheme reduces the Miss Penalty overhead significantly. In this design, how to allocate memory between LRU list and FIFO list will influence the performance of MPA, we will discuss this issue in following section.

86

5.3.3 Cache management algorithm

As we all know, the existing DRAM-based cache management schemes such as

LFU, LRU are mainly designed for improving hit rate on the cache by exploiting request access locality while assuming that the Miss Penalty is a constant value. However, when the hybrid SSD which exploits PCM and SSD at the same storage level is introduced, the assumption that Miss Penalty is a constant value becomes invalid. In hybrid SSD, the

Miss Penalty overhead for SSD requests are more expensive than the Miss Penalty overhead for PCM requests. Therefore, when considering the read latency difference between PCM and SSD, the performance of cache can’t be evaluated by traditional metric such as hit rate. To solve this issue, MPA introduces the Miss Penalty as an important metric to measure the performance of the cache management scheme. The detail of MPA is described in Algorithm 1.

When receiving an incoming request, MPA will check whether the request hits in the cache. In here, there are two cases for a cache hit. The first case is the cache hit in

FIFO list. In this case, the promotion mechanism will play the role and promote the hit page to LRU list. The second case is the cache hit in LRU list. In this case, MPA will move the hit page to the head of LRU list to exploit the temporal locality of the workload.

If a cache miss occurs, there are also two cases. The first case is that cache is not full. In this case, MPA will load the data from the hybrid storage and add miss data to the tail of

FIFO list. The second case is that cache is full. In such case, a page needs to be evicted to free the page cache space, so the problem will become complex. To fix the problem,

87

MPA scheme replaces the page based on the physical storage location of accessing data.

If the accessing data comes from PCM, the evicted data will be selected from LRU list.

Otherwise, the evicted data will be selected from FIFO list.

Since the Miss penalty overhead for loading the data stored in SSD is more expensive than the Miss penalty overhead for loading the data stored in PCM, MPA scheme gives the SSD requests high priority to stay longer in the page cache. When a cache miss occurs and the cache is full, MPA scheme will maintain SSD requests in the

LRU list as much as possible. As is shown in Figure 5.2, MPA scheme skips two SSD requests: D1 and D4. Instead, MPA chooses PCM request D2 and evicts D2 to free the page cache space.

5.4 Experimental Evaluation

In this section, we first describe the experiment setup, then evaluate the performance of MPA scheme by comparing our MPA scheme to existing DRAM-based cache management schemes.

5.4.1 Experiment setup

To evaluate the efficiency of our proposed MPA scheme, we first define the hybrid

SSD. In our experiment, we assume that hybrid SSD exploits PCM and SSD at the same storage level. In reality, due to the limitation of the price and technology, the size of PCM is generally expected to be smaller than that of SSD, so that we allocate more capacity to

SSD. In our experiment, we opt 8:1 ratio.

88

Table 5.2: Configuration Parameters of the hybrid SSD Simulator.

Parameter Value SSD size 32GB SSD read latency 25us SSD write latency 200us SSD erase latency 1500us SSD page size 4KB PCM size 4GB PCM read latency 100ns PCM write latency 1us PCM unit size 512byte Page cache size 4096(pages)

Hybrid SSD is simulated by a trace-driven simulator based on DiskSim with SSD extension [82]. We implement a DRAM-based page cache module and a PCM model on

top of simulated SSD. The DRAM-based page cache module simulates a 4096 page

cache which provides many page cache management schemes such as MPA, LFU and

LRU. The PCM model simulates a 8GB PCM which provides in- place updating and

byte-addressable properties. We set the basic data access unit for PCM to 512B, because

the minimal request size in our trace is 512B. The values of the hybrid SSD specific

parameters used in our simulator are shown in Table 5.2.

89

In the evaluation, we use two online transaction processing traces [15] and three

MSR Cambridge traces on servers [14] to study the performance impact of the different page cache management schemes. These five enterprise traces are all read mix with write traces. It can simply be divided into two categories: read intensive trace and write intensive trace. Fin2 is read intensive trace which read access accounts for 82.4%, while other traces are all dominated by write access. Table 5.3 shows the characteristics of these traces.

Table 5.3: The characteristics of the traces.

Trace Read(%) IOPS Avg req size(KB)

Fin1 21.6 121.4 8.5

Fin2 82.4 97.8 3.6

Prxy_0 2.9 188.5 6.8

Rscrch_0 9.3 120.1 9.08

Web2 20.1 130.2 9.28

5.4.2 Performance analysis

In order to better analyze the performance impact of our MPA scheme based on the real application traces, we first set the PCM requests-to-SSD requests ratio. In the

hybrid SSD which uses PCM and SSD at the same storage level, due to amazing read

latency and high endurance properties of PCM, there is no doubt that higher PCM

90

requests-to-SSD requests ratio will lead to more performance improvement. However,

due to the limitations of price and technology, the capacity of PCM is small in real hybrid

SSD product, when comparing to capacity of SSD. Meanwhile, considering that the limit

amount of trace items can dispatch to PCM mostly, we choose a PCM requests-to-SSD

requests ratio of 40:60 to explore the efficiency of our MPA scheme. In the following

section, we also explore the performance impact of our MPA scheme under different

PCM requests-to-SSD requests ratio.

Figure 5.3: The Normalized response times under the realistic trace-driven evaluations.

91

Figure 5.3 shows the normalized response times of the different schemes driven by the five traces when the buffer size is 4096 pages. In our experiment, we set the response time of LRU scheme as the baseline to measure the efficiency of other schemes. LFU is a cache management scheme which exploits access locality. We can see that LFU scheme has the similar improvement on response time compared with LRU scheme. Our MPA scheme reduces the response times from 12.5% to 30.5% compared with other schemes, especially for Prxy_0, Rsrch_0 for which the improvements are 30.5% and 25.8% respectively. The reason for such a big improvement on response time is clear, for Prxy_0,

Rsrch_0, the write ratio is high and the request size is large, which means cache miss will bring high miss penalty overhead. Finally, our MPA scheme achieves the big improvement on response time by avoiding the high miss penalty overhead.

Figure 5.4: The overall cache hit rate of the different cache schemes.

92

Figure 5.5: The SSD cache hit rate of the different cache schemes.

In order to investigate the relationship between miss penalty and hit rate, we examine the overall cache hit rate and SSD cache hit rate collected in a simulation study, as shown in Figure 5.4 and Figure 5.5. In the Figure 5.4, we can see that the MPA scheme has the lowest overall cache hit rate compared with other two schemes. The reason is obvious. The other two schemes try exclusively to exploit the access locality to improve the cache hit rate, while our MPA scheme focuses on reducing the miss penalty overhead. However, the overall cache hit rate of our MPA scheme is not reduced significantly, because our management scheme finds a suitable balance between the miss penalty and hit rate. This also proves that the hit rate is not the only metric for improving

93

the average response time. As is shown in Figure 5.5, comparing with other two schemes,

MPA scheme has the highest SSD cache hit rate under five real traces, which means the total miss penalty overhead of MPA scheme is the lowest among three schemes. The big performance improvement achieved by our MPA scheme also implies that miss penalty is also an influential factor affecting the overall performance.

Figure 5.6: Average response time speedup of MPA and conventional cache management schemes under different PCM requests-to-SSD requests ratio.

5.4.3 Sensitivity study on PCM requests-to-SSD requests ratio

PCM requests-to-SSD requests ratio is a very important factor to influence the cache scheme performance in the hybrid SSD. Figure 5.6 shows the standard average

94

response time speedup under different PCM requests-to-SSD requests ratio. Here, for different cache schemes, we use the PCM requests-to-SSD requests ratio of 20:80 as the

baseline to measure the average response time speedup variation under different ratio

values.

In Figure 5.6, we can see that high PCM requests-to-SSD requests ratio will lead

to big performance improvement, due to the high read/write speed and in-place updating

properties of PCM. Our MPA scheme benefits a lot when the PCM requests-to-SSD

requests ratio varies 20:80 to 40:80. The reason for such a big improvement is obvious.

When the PCM request ratio is low, giving high priority to SSD requests in the page

cache will bring many cold data stay in the page cache. In this case, the decreasing hit

rate for PCM requests will degrade the performance of cache management scheme. In

conclusion, when there are enough PCM requests, our MPA scheme can benefits from

avoiding high miss penalty overhead.

5.4.4 Sensitivity study on FIFO-to-LRU ratio

How to divide memory between LRU list and FIFO list will influence the performance of MPA scheme. In the experiment, we use a parameter called ”FIFO-to-

LRU ratio” to show memory allocation between LRU list and FIFO list. Figure 5.7 shows the standard response time speedup under different FIFO-to-LRU ratio. Here, for different traces, we use the FIFO-to-LRU ratio of 10:90 as the baseline to measure the response time speedup variation under different ratio values. According to Figure 5.7,

when FIFO-to-LRU ratio varies from 10:90 to 50:50, the performance of MPA Scheme

95

improves more than 13%. The reason for this improvement is that allocating small memory space for FIFO list will lead to increment of cache miss ratio, because hot page need enough time to stay in FIFO list before promoting to LRU list. When the FIFO-to-

LRU ratio varies from 70:30 to 90:10, the performance of MPA scheme enters in a downtrend. We believe that this downtrend is caused by frequent evicting page stored in SSD. When memory space of LRU list is small, the miss rate of SSD requests will increase. The miss penalty of SSD requests degrades the performance of MPA. In our experiment, we use FIFO-to-LRU ratio of 60:40 as default.

Figure 5.7: Standard response time speedup of MPA on five bench-marks with respect to different FIFO-to-LRU ratio.

96

5.5 Summary

With the rapid development of PCM technology, due to the high read/write speed and amazing endurance of PCM, hybrid SSD which uses PCM and SSD at the same storage level has been proposed for improving the performance of SSD. However, due to the big read latency gap between PCM and SSD, traditional operating system can’t fully dig the potential of the hybrid SSD. In this study, we propose a Miss Penalty aware

DRAM-based cache management scheme for hybrid SSD. Our MPA scheme not only exploits access locality to improve the hit rate, but also assigns higher priority to SSD requests located in page cache for avoiding the high miss penalty overhead. Our experiment results clearly shows our MPA scheme can improve the performance of hybrid SSD significantly compared with other cache management schemes.

97

6. Chapter 6 Conclusion and Future Work 6.1 Conclusion

Nowadays, the flash-based solid state disk is widely adopted in the datacenter. This

makes SSD attract more and more attention. However, due to the limitation of SSD

structure, there are some issues which act as the bottleneck of the whole system in the

SSD. The first one is the performance issue. Compared with traditional ,

SSD performs better. However, existing I/O subsystems are mainly designed for traditional hard disk drive. It can’t dig the potential of the SSD. The second one is the reliability issue. Due to the erase before write mechanism, in-place updating is forbidden in the SSD, Specifically, before a write operation is performed, an erase operation has to be performed. Erase operation will be triggered by the write operation, especially for the small write operation. Frequent erase operations jeopardize the lifetime of an SSD, for the reason that each SSD cell can only carry out limited number of program/erase (P/E) operation cycles reliably.

In this thesis, targeted on performance issue of SSD, we revisit the I/O Subsystem in the operating system. We reveal that existing I/O schedulers may not be appropriate for SSDs and sometimes even degrade the performance of SSDs. We also explore some

factors which impact the performance of SSD. Based on our exploration, we propose a

SSD-based I/O scheduler called SBIOS. SBIOS fully exploits the characteristic of SSD.

For read requests, SBIOS dispatches them to different blocks to make full use of internal

parallelism. For write requests, it tries to dispatch write requests to the same block to

98

alleviate the block cross penalty. The experiment results show SBIOS improves the performance of SSD significantly compared with other I/O schedulers.

Meanwhile, to fix the reliability issue of the SSD, we introduce statistical method to analyze the trace access pattern. We reveal that 15 to 25 percent of the blocks are accessed by nearly 50 percent I/O requests. Based on the above observation, we introduce the hot data identification mechanism and propose a parallelism and garbage collection aware I/O Scheduler called PGIS. In the PGIS, we classify the hot request by frequency.

To exploit the channel level internal parallelism, we issue the hot read request to different channel. To reduce the overhead of garbage collection in terms of P/E cycles, we issue the hot write data to the same physical block. The experiment results show PGIS significantly prolong the life of SSD compared with other I/O schedulers.

With the rapid development of non-violate technology, due to in-place updates, byte-addressable property, and high endurance of PCM, hybrid SSD which uses PCM and SSD in the same storage level has been proposed to improve the performance and reliability of SSD. However, hybrid SSD poses a new challenge to cache management algorithm. Due to different miss penalty of PCM and SSD, higher hit rate which is targeted by traditional cache management algorithms may not bring higher performance.

To solve this issue, we propose a Miss Penalty Aware cache management scheme (short for MPA) which takes the asymmetry of cache miss penalty on PCM and SSD into consideration. Our MPA scheme not only uses the access locality to increase the hit rate, but also assigns higher priorities to SSD requests located in the page cache to avoid the

99

high miss penalty overhead. By combing miss penalty with hit rate, our MPA scheme significantly improve the performance of hybrid storage system.

6.2 Future work

Due to the internal architecture of the SSD, the different levels of parallelization are available. Hu et al. [65] classify the internal parallelism into four levels: channel-level,

chip-level, die-level, and plane-level. In the thesis, we only exploit the channel level

internal parallelism to improve the performance of SSD. In the future work, we should

explore the possibility of exploiting other levels of internal parallelism to improve the

performance of the SSD.

Meanwhile, to prolong the life of SSD, we introduce a hot data identification

mechanism. There is no doubt that reclaiming a block which is full with invalid pages can

reduce the overhead of garbage collection. However, due to limitation of SSD structure, a

suitable wear-leveling algorithm needs to be applied. In the future work, we will explore

the interplay between PGIS and different wear-leveling algorithms.

For open channel hybrid SSD, we are exploring several directions for the future work.

First, in this thesis, we only implement our MPA scheme in a hybrid SSD simulator.

Nowadays, the new released Linux kernel 4.4 can support open channel SSD. In the future work, we will build a hardware platform to incorporate our MPA scheme, and then we will implement our MPA scheme in the Linux kernel and evaluate the efficiency of our MPA scheme with benchmark such as SPEC. Second, we will extend our MPA

100

scheme to the hybrid SSD which use the PCM as the write cache to improve the performance of the whole storage system.

101

7. References

[1] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swanson, "Moneta:

A high-performance storage array architecture for next-generation, non-volatile

memories," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International

Symposium on, 2010, pp. 385-395: IEEE.

[2] S. Iyer and P. Druschel, ": A disk scheduling framework to

overcome deceptive idleness in synchronous I/O," in ACM SIGOPS Operating Systems

Review, 2001, vol. 35, no. 5, pp. 117-130: ACM.

[3] M. P. Dunn and A. N. Reddy, "A new I/O scheduler for solid state devices," Texas A &

M University, 2010.

[4] B. Schroeder and G. A. Gibson, "Disk failures in the real world: What does an mttf of 1,

000, 000 hours mean to you?," in FAST, 2007, vol. 7, no. 1, pp. 1-16.

[5] T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Journal-guided

Resynchronization for Software RAID," in FAST, 2005.

[6] D. A. Patterson, G. Gibson, and R. H. Katz, A case for redundant arrays of inexpensive

disks (RAID) (no. 3). ACM, 1988.

[7] T. Perez and C. A. De Rose, "Non-volatile memory: Emerging technologies and their

impacts on memory systems," Porto Alegre, 2010.

102

[8] F. Chen, R. Lee, and X. Zhang, "Essential roles of exploiting internal parallelism of

flash memory based solid state drives in high-speed data processing," in High

Performance Computer Architecture (HPCA), 2011 IEEE 17th International

Symposium on, 2011, pp. 266-277: IEEE.

[9] Y. J. Yu et al., "Optimizing the block I/O subsystem for fast storage devices," ACM

Transactions on Computer Systems (TOCS), vol. 32, no. 2, p. 6, 2014.

[10] D. M. Jacobson and J. Wilkes, Disk scheduling algorithms based on rotational position.

Hewlett-Packard Laboratories Palo Alto, CA, 1991.

[11] R. Geist and S. Daniel, "A continuum of disk scheduling algorithms," ACM

Transactions on Computer Systems (TOCS), vol. 5, no. 1, pp. 77-92, 1987.

[12] S. Wu, B. Mao, X. Chen, and H. Jiang, "LDM: Log with Improved

Performance and Reliability for SSD-Based Disk Arrays," ACM Transactions on

Storage (TOS), vol. 12, no. 4, p. 22, 2016.

[13] B. Mao et al., "HPDA: A hybrid parity-based disk array for enhanced performance and

reliability," ACM Transactions on Storage (TOS), vol. 8, no. 1, p. 4, 2012.

[14] M. Traces. (2008). Available: http://iotta.snia.org/tracetypes/3

103

[15] U. o. Massachusetts. Storage: Umass trace repository. Available:

http://tinyurl.com/k6golon

[16] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. S. Manasse, and R. Panigrahy,

"Design Tradeoffs for SSD Performance," in USENIX Annual Technical Conference,

2008, vol. 8, pp. 57-70.

[17] J. Guo, Y. Hu, and B. Mao, "Enhancing I/O Scheduler Performance by Exploiting

Internal Parallelism of SSDs," in International Conference on Algorithms and

Architectures for Parallel Processing, 2015, pp. 118-130: Springer.

[18] J. Guo, Y. Hu, B. Mao, and S. Wu, "Parallelism and Garbage Collection aware I/O

Scheduler with Improved SSD Performance," in IEEE International Parallel and

Distributed Processing Symposium (IPDPS), 2017, pp. 1184-1193: IEEE.

[19] M. Wang, "Improving Performance And Reliability Of Flash Memory Based Solid State

Storage Systems," University of Cincinnati, 2016.

[20] A. Tal, "Two flash technologies compared: NOR vs NAND," White Paper of M-

SYstems, 2002.

[21] A. Inoue and D. Wong, "NAND flash applications design guide," Toshiba America

Electronic Components Inc, 2004.

104

[22] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable high performance main

memory system using phase-change memory technology," ACM SIGARCH Computer

Architecture News, vol. 37, no. 3, pp. 24-33, 2009.

[23] P. Chi, W.-C. Lee, and Y. Xie, "Making B+-tree efficient in PCM-based main memory,"

in Proceedings of the 2014 international symposium on Low power electronics and

design, 2014, pp. 69-74: ACM.

[24] H. Zhang, J. Fan, and J. Shu, "An OS-level Data Distribution Method in DRAM-PCM

Hybrid Memory," in Conference, 2016, pp. 1-14: Springer.

[25] S. Cho and H. Lee, "Flip-N-Write: A simple deterministic technique to improve PRAM

write performance, energy and endurance," in Microarchitecture, 2009. MICRO-42.

42nd Annual IEEE/ACM International Symposium on, 2009, pp. 347-357: IEEE.

[26] D. Liu, T. Wang, Y. Wang, Z. Qin, and Z. Shao, "A block-level flash memory

management scheme for reducing write activities in PCM-based embedded systems," in

Proceedings of the Conference on Design, Automation and Test in Europe, 2012, pp.

1447-1450: EDA Consortium.

[27] A. R. Olson and D. J. Langlois, "Solid state drives data reliability and lifetime," Imation

White Paper, 2008.

105

[28] N. Flash, "An Introduction to NAND Flash and How to Design It In to Your Next

Product."

[29] S. Zertal, "Exploiting the Fine Grain SSD Internal Parallelism for OLTP and Scientific

Workloads," in High Performance Computing and Communications, , 2014, pp. 990-

997: IEEE.

[30] A. Ban, "Flash file system," ed: Google Patents, 1995.

[31] C. Intel, "Understanding the flash translation layer (FTL) specification," ed, 1998.

[32] E. Gal and S. Toledo, "Algorithms and data structures for flash memories," ACM

Computing Surveys (CSUR), vol. 37, no. 2, pp. 138-163, 2005.

[33] A. Gupta, Y. Kim, and B. Urgaonkar, DFTL: a flash translation layer employing

demand-based selective caching of page-level address mappings (no. 3). ACM, 2009.

[34] M.-L. Chiang and R.-C. Chang, "Cleaning policies in mobile using flash

memory," Journal of Systems and Software, vol. 48, no. 3, pp. 213-231, 1999.

[35] P.-L. Wu, Y.-H. Chang, and T.-W. Kuo, "A file-system-aware FTL design for flash-

memory storage systems," in Proceedings of the Conference on Design, Automation and

Test in Europe, 2009, pp. 393-398: European Design and Automation Association.

106

[36] C. Park, W. Cheon, J. Kang, K. Roh, W. Cho, and J.-S. Kim, "A reconfigurable FTL

(flash translation layer) architecture for NAND flash-based applications," ACM

Transactions on Embedded Computing Systems (TECS), vol. 7, no. 4, p. 38, 2008.

[37] L. M. Caulfield, Symbiotic Solid State Drives: Management of Modern NAND Flash

Memory. University of California, San Diego, 2013.

[38] D. Jung, J.-U. Kang, H. Jo, J.-S. Kim, and J. Lee, "Superblock FTL: A superblock-based

flash translation layer with a hybrid address translation scheme," ACM Transactions on

Embedded Computing Systems (TECS), vol. 9, no. 4, p. 40, 2010.

[39] J.-U. Kang, H. Jo, J.-S. Kim, and J. Lee, "A superblock-based flash translation layer for

NAND flash memory," in Proceedings of the 6th ACM & IEEE International

conference on Embedded software, 2006, pp. 161-170: ACM.

[40] S. Wells, R. N. Hasbun, and K. Robinson, "Sector-based storage device emulator having

variable-sized sector," ed: Google Patents, 1998.

[41] J. Kim, J. M. Kim, S. H. Noh, S. L. Min, and Y. Cho, "A space-efficient flash

translation layer for CompactFlash systems," IEEE Transactions on Consumer

Electronics, vol. 48, no. 2, pp. 366-375, 2002.

107

[42] S.-W. Lee, D.-J. Park, T.-S. Chung, D.-H. Lee, S. Park, and H.-J. Song, "A log buffer-

based flash translation layer using fully-associative sector translation," ACM

Transactions on Embedded Computing Systems (TECS), vol. 6, no. 3, p. 18, 2007.

[43] S. Lee, D. Shin, Y.-J. Kim, and J. Kim, "LAST: locality-aware sector translation for

NAND flash memory-based storage systems," ACM SIGOPS Operating Systems Review,

vol. 42, no. 6, pp. 36-42, 2008.

[44] M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-

structured file system," ACM Transactions on Computer Systems (TOCS), vol. 10, no. 1,

pp. 26-52, 1992.

[45] D. Ma, J. Feng, and G. Li, "LazyFTL: a page-level flash translation layer optimized for

NAND flash memory," in Proceedings of the 2011 ACM SIGMOD International

Conference on Management of data, 2011, pp. 1-12: ACM.

[46] H. Cho, D. Shin, and Y. I. Eom, "KAST: K-associative sector translation for NAND

flash memory in real-time systems," in Proceedings of the Conference on Design,

Automation and Test in Europe, 2009, pp. 507-512: European Design and Automation

Association.

[47] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen, "Circuit and microarchitecture

evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory

108

replacement," in Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE,

2008, pp. 554-559: IEEE.

[48] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, "A novel architecture of the 3D stacked

MRAM L2 cache for CMPs," in High Performance Computer Architecture, 2009.

HPCA 2009. IEEE 15th International Symposium on, 2009, pp. 239-249: IEEE.

[49] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie, "Hybrid cache

architecture with disparate memory technologies," in ACM SIGARCH computer

architecture news, 2009, vol. 37, no. 3, pp. 34-45: ACM.

[50] W. Zhang and T. Li, "Exploring phase change memory and 3D die-stacking for

power/thermal friendly, fast and durable memory architectures," in Parallel

Architectures and Compilation Techniques, 2009. PACT'09. 18th International

Conference on, 2009, pp. 101-112: IEEE.

[51] J. K. Kim, H. G. Lee, S. Choi, and K. I. Bahng, "A PRAM and NAND flash hybrid

architecture for high-performance embedded storage subsystems," in Proceedings of the

8th ACM international conference on Embedded software, 2008, pp. 31-40: ACM.

[52] Y. Park, S.-H. Lim, C. Lee, and K. H. Park, "PFFS: a scalable flash memory file system

for the hybrid architecture of phase-change RAM and NAND flash," in Proceedings of

the 2008 ACM symposium on Applied computing, 2008, pp. 1498-1503: ACM.

109

[53] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting phase change memory as a

scalable dram alternative," in ACM SIGARCH Computer Architecture News, 2009, vol.

37, no. 3, pp. 2-13: ACM.

[54] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, "A durable and energy efficient main memory

using phase change memory technology," in ACM SIGARCH computer architecture

news, 2009, vol. 37, no. 3, pp. 14-23: ACM.

[55] C. Lam, "Cell design considerations for phase change memory as a universal memory,"

in VLSI Technology, Systems and Applications, 2008. VLSI-TSA 2008. International

Symposium on, 2008, pp. 132-133: IEEE.

[56] G. Sun, Y. Joo, Y. Chen, Y. Chen, and Y. Xie, "A hybrid solid-state storage architecture

for the performance, energy consumption, and lifetime improvement," in Emerging

Memory Technologies: Springer, 2014, pp. 51-77.

[57] K. Kim, S.-W. Lee, B. Moon, C. Park, and J.-Y. Hwang, "IPL-P: In-page logging with

PCRAM," Proceedings of the VLDB Endowment, vol. 4, no. 12, pp. 1363-1366, 2011.

[58] Z. Li et al., "A user-visible solid-state storage system with software-defined fusion

methods for PCM and NAND flash," Journal of Systems Architecture, vol. 71, pp. 44-

61, 2016.

110

[59] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, and S. Swanson,

"Providing safe, user space access to fast, solid state disks," ACM SIGARCH Computer

Architecture News, vol. 40, no. 1, pp. 387-400, 2012.

[60] X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka, "Write amplification analysis

in flash-based solid state drives," in Proceedings of SYSTOR 2009: The Israeli

Experimental Systems Conference, 2009, p. 10: ACM.

[61] J. Kim, S. Seo, D. Jung, J.-S. Kim, and J. Huh, "Parameter-aware I/O management for

solid state disks (SSDs)," IEEE Transactions on Computers, vol. 61, no. 5, pp. 636-649,

2012.

[62] S. Park and K. Shen, "FIOS: a fair, efficient flash I/O scheduler," in FAST, 2012, p. 13.

[63] S. A. Chamazcoti, S. G. Miremadi, and H. Asadi, "On endurance of erasure codes in

SSD-based storage systems," in Computer Architecture and Digital Systems (CADS),

2013 17th CSI International Symposium on, 2013, pp. 67-72: IEEE.

[64] J. Axboe. (2015). fio-2.26(software package).

[65] Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang, "Performance impact and

interplay of SSD parallelism through advanced commands, allocation strategy and data

granularity," in Proceedings of the international conference on Supercomputing, 2011,

pp. 96-107: ACM.

111

[66] F. Chen, D. A. Koufaty, and X. Zhang, "Understanding intrinsic characteristics and

system implications of flash memory based solid state drives," in ACM SIGMETRICS

Performance Evaluation Review, 2009, vol. 37, no. 1, pp. 181-192: ACM.

[67] K. Shen and S. Park, "FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs,"

in USENIX Annual Technical Conference, 2013, pp. 67-78.

[68] H. Wang, P. Huang, S. He, K. Zhou, C. Li, and X. He, "A novel I/O scheduler for SSD

with improved performance and lifetime," in Mass Storage Systems and Technologies

(MSST), 2013 IEEE 29th Symposium on, 2013, pp. 1-5: IEEE.

[69] J. Kim, Y. Oh, E. Kim, J. Choi, D. Lee, and S. H. Noh, "Disk schedulers for solid state

drivers," in Proceedings of the seventh ACM international conference on Embedded

software, 2009, pp. 295-304: ACM.

[70] Y. Luo, Y. Cai, S. Ghose, J. Choi, and O. Mutlu, "WARM: Improving NAND flash

memory lifetime with write-hotness aware retention management," in Mass Storage

Systems and Technologies (MSST), 2015 31st Symposium on, 2015, pp. 1-14: IEEE.

[71] M. Bjørling, J. Axboe, D. Nellans, and P. Bonnet, "Linux block IO: introducing multi-

queue SSD access on multi-core systems," in Proceedings of the 6th international

systems and storage conference, 2013, p. 22: ACM.

112

[72] S. Wu, Y. Lin, B. Mao, and H. Jiang, "GCaR: Garbage collection aware cache

management with improved performance for flash-based SSDs," in Proceedings of the

2016 International Conference on Supercomputing, 2016, p. 28: ACM.

[73] B. Mao and S. Wu, "Exploiting request characteristics and internal parallelism to

improve SSD performance," in Computer Design (ICCD), 2015 33rd IEEE

International Conference on, 2015, pp. 447-450: IEEE.

[74] B. Mao, H. Jiang, S. Wu, Y. Yang, and Z. Xi, "Elastic data compression with improved

performance and space efficiency for flash-based storage systems," in Parallel and

Distributed Processing Symposium (IPDPS), 2017 IEEE International, 2017, pp. 1109-

1118: IEEE.

[75] Z. Chen, N. Xiao, and F. Liu, "SAC: Rethinking the cache replacement policy for SSD-

based storage systems," in Proceedings of the 5th Annual International Systems and

Storage Conference, 2012, p. 13: ACM.

[76] C. Gao, L. Shi, M. Zhao, C. J. Xue, K. Wu, and E. H.-M. Sha, "Exploiting parallelism in

I/O scheduling for access conflict minimization in flash-based solid state drives," in

Mass Storage Systems and Technologies (MSST), 2014 30th Symposium on, 2014, pp. 1-

11: IEEE.

113

[77] D. Park and D. H. Du, "Hot data identification for flash-based storage systems using

multiple bloom filters," in Mass Storage Systems and Technologies (MSST), 2011 IEEE

27th Symposium on, 2011, pp. 1-11: IEEE.

[78] M. Jung, W. Choi, S. Srikantaiah, J. Yoo, and M. T. Kandemir, "HIOS: A host interface

I/O scheduler for solid state disks," ACM SIGARCH Computer Architecture News, vol.

42, no. 3, pp. 289-300, 2014.

[79] W. Wang, N. Niu, H. Liu, and Y. Wu, "Tagging in assisted tracing," in Proceedings of

the 8th International Symposium on Software and Systems Traceability, 2015, pp. 8-14:

IEEE Press.

[80] P. Louridas, D. Spinellis, and V. Vlachos, "Power laws in software," ACM Transactions

on Software Engineering and Methodology (TOSEM), vol. 18, no. 1, p. 2, 2008.

[81] K. Shah, A. Mitra, and D. Matani, "An O (1) algorithm for implementing the LFU cache

eviction scheme," Technical report, 2010.

[82] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. S. Manasse, and R. Panigrahy,

"Design tradeoffs for SSD performance," in USENIX Annual Technical Conference,

2008, vol. 57.

[83] F. Chen, B. Hou, and R. Lee, "Internal parallelism of flash memory-based solid-state

drives," ACM Transactions on Storage (TOS), vol. 12, no. 3, p. 13, 2016.

114

[84] L.-P. Chang and T.-W. Kuo, "An adaptive striping architecture for flash memory

storage systems of embedded systems," in Real-Time and Embedded Technology and

Applications Symposium, 2002. Proceedings. Eighth IEEE, 2002, pp. 187-196: IEEE.

[85] J. E. Denny, S. Lee, and J. S. Vetter, "NVL-C: Static analysis techniques for efficient,

correct programming of non-volatile main memory systems," in Proceedings of the 25th

ACM International Symposium on High-Performance Parallel and Distributed

Computing, 2016, pp. 125-136: ACM.

[86] J. Kim, S. Lee, and J. S. Vetter, "PapyrusKV: a high-performance parallel key-value

store for distributed NVM architectures," in Proceedings of the International

Conference for High Performance Computing, Networking, Storage and Analysis, 2017,

p. 57: ACM.

[87] Q. Liu and C. Jung, "Lightweight hardware support for transparent consistency-aware

checkpointing in intermittent energy-harvesting systems," in Non-Volatile Memory

Systems and Applications Symposium (NVMSA), 2016 5th, 2016, pp. 1-6: IEEE.

[88] M. Tarihi, H. Asadi, A. Haghdoost, M. Arjomand, and H. Sarbazi-Azad, "A hybrid non-

volatile cache design for solid-state drives using comprehensive I/O characterization,"

IEEE Transactions on Computers, vol. 65, no. 6, pp. 1678-1691, 2016.

[89] M. A. Roger, Y. Xu, and M. Zhao, "BigCache for big-data systems," in 2014 IEEE

International Conference on Big Data (Big Data),, 2014, pp. 189-194: IEEE.

115

[90] W. Li, G. Jean-Baptise, J. Riveros, G. Narasimhan, T. Zhang, and M. Zhao,

"CacheDedup: In-line Deduplication for Flash Caching," in FAST, 2016, pp. 301-314.

[91] R. Koller, L. Marmol, R. Rangaswami, S. Sundararaman, N. Talagala, and M. Zhao,

"Write policies for host-side flash caches," in FAST, 2013, pp. 45-58.

[92] X. Jimenez, D. Novo, and P. Ienne, "Wear unleveling: improving NAND flash lifetime

by balancing page endurance," in FAST, 2014, vol. 14, pp. 47-59.

[93] N. Elyasi, M. Arjomand, A. Sivasubramaniam, M. T. Kandemir, C. R. Das, and M. Jung,

"Exploiting intra-request slack to improve ssd performance," in Proceedings of the

Twenty-Second International Conference on Architectural Support for Programming

Languages and Operating Systems, 2017, pp. 375-388: ACM.

[94] Y. Lu, J. Shu, J. Guo, and P. Zhu, "Supporting system consistency with differential

transactions in flash-based SSDs," IEEE Transactions on Computers, vol. 65, no. 2, pp.

627-639, 2016.

[95] H. Wang, J. Zhang, S. Shridhar, G. Park, M. Jung, and N. S. Kim, "DUANG: Fast and

lightweight page migration in asymmetric memory systems," in High Performance

Computer Architecture (HPCA), 2016 IEEE International Symposium on, 2016, pp.

481-493: IEEE.

116

[96] S. Mittal, J. S. Vetter, and D. Li, "A survey of architectural approaches for managing

embedded DRAM and non-volatile on-chip caches," IEEE Transactions on Parallel and

Distributed Systems, vol. 26, no. 6, pp. 1524-1537, 2015.

117