Novel Methods for Improving Performance and Reliability of Flash-
Based Solid State Storage System
A dissertation submitted to the Graduate School of the University of Cincinnati in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in the Department of Electrical Engineering and Computer Science of the College of Engineering and Applied Science by
Jiayang Guo B.S. University of Cincinnati February 2018
Committee Chair: Yiming Hu, Ph.D.
i
Abstract
Though SSDs outperform traditional magnetic-based storage devices, there is still
potential for further performance improvements. In existing operating systems, the
software I/O stack is designed considering the working mechanisms of the traditional
magnetic-based hard drives. Therefore, it has been shown that the existing I/O software
layer can cause additional operational overheads for flash-based SSDs [1]. To address
this problem, we explore the influence factors which will lead to variation of the
performance of SSD. Based on our observation, we proposed a SSD-based I/O scheduler
called SBIOS that fully exploits the internal parallelism to improve the performance. It
dispatches the read requests to different blocks to make full use of SSD internal
parallelism. For write requests, it attempts to dispatch write requests to the same block to
minimize the number of the block cross requests. Moreover, SBIOS introduces the
conception of batch processing and separates read and write requests to avoid read-write
interference.
Besides, SSDs face reliability challenges due to the physical properties of flash memory. To fix the reliability issue of SSD, we propose a parallel and garbage collection aware I/O Scheduler called PGIS that identifies the hot data based on trace characteristics to exploit the channel level internal parallelism of flash-based storage systems. PGIS not only fully exploits abundant channel resource in the SSD, but also introduces a hot data identification mechanism to reduce the garbage collection overhead. By dispatching hot read data to the different channel, the channel level internal parallelism is fully exploited.
ii
By dispatching hot write data to the same physical block, the garbage collection overhead has been alleviated. The experiment results show that these methods significantly improve the reliability and performance of the SSD. In this research, the total number of erase operations is introduced to measure the reliability of SSD.
Meanwhile, with the rapid development of non-violate technology, due to high read/write speed, high endurance, in-place updating of PCM, many hybrid storage structures which use PCM and SSD at the same storage level have been proposed to improve the performance of SSD. However, hybrid storage systems pose a new challenge to cache management algorithms. Existing DRAM-based cache management schemes are only optimized to reduce the miss rate. On a miss, the cache needs to access the
PCM or the SSD. However there are major performance differences between the access times of the two different technologies. As a result, in such a hybrid system, a higher hit rate does not necessarily translate to higher performance. To address this issue, we propose a Miss Penalty Aware cache management scheme (short for MPA) which takes the asymmetry of cache miss penalty on PCM and SSD into consideration. Our MPA scheme not only uses the access locality to reduce the miss rate, but also assigns higher priorities to SSD requests located in the page cache to avoid the high miss penalty overhead. Our experimental results show that our MPA scheme can significantly improve the hybrid storage system performance by up to 30.5% compared with other cache management schemes.
iii
Copyright
iv
Acknowledgements
The Ph.D. study is a long journey which is very interesting and challenging. In this journey, I meet challenge and fix it every day. Through this process, I know my strengths and weakness. Based on the knowledge of myself, I make myself better. Throughout this adventure, there are many brilliant, supportive people with me.
First and foremost, I want to thank my advisor Dr. Yiming Hu. It has been an honor to be his Ph.D. student and to work with him. From my experience, he cares so much about his student that he always comes through for me when I need help. When I feel confused in my research, he always can provide insightful advices to improve my research in a few words. I cannot enumerate how many things I have learnt from him since there are too many. To me, the most important things I have learnt from him include doing work in a serious and responsible manner, hardworking, and paying attention to detail. I believe I will benefit from these skills in rest of my life. I really appreciate all his contributions of time, effort, ideas, trust, and funding to make my Ph.D. experience productive and stimulating.
I would like to thank all the professors in my committee for their support and guidance. Dr. Jing Xiang gave me tremendous help especially in the first year of my Ph.D. life. He cares about me so much and always shares personal experience on research and life. I want to thank Dr. Wen-Ben Jone for his time, effort, and encouragement. I learnt from him the attitude of being more confident on my research and work. I also want to
v
thank Dr. Raj Bhatnagar for his guidance and rigor to research as well as pointing out my weakness for me to improve. Finally, I want to thank Dr. Carla Purdy for her approval and effort on my thesis in improving the quality of my work. I learnt from her the attitude of being serious and responsible for research and work.
My gratefulness also goes to my friends and colleagues, Wentao Wang, Xining Li,
Xiaobang Liu, Suyuan Chen, Minyang Wang, Vadambacheri Manian Karthik and his wife Balasubramanian Sanchayeni, and many others who have helped and supported me in any aspect.
Lastly, I would like to thank my mother for her love, encouragement and funding. I am grateful to my mother because she has sacrificed a lot to raise me, educate me, and fund me in achieving my goal. My wholehearted gratitude is beyond any words. Thank you!
vi
Table of Contents
1. CHAPTER 1 INTRODUCTION ...... 1
1.1 HDDS VS SSDS ...... 1 1.2 PROBLEM DESCRIPTION ...... 6 1.3 CONTRIBUTION ...... 8 1.4 ORGANIZATION OF DISSERTATION ...... 12
2. CHAPTER 2 BACKGROUND ...... 15
2.1 NON-VIOLATE MEMORY ...... 15 2.2 INTERNAL STRUCTURE OF SSD ...... 18 2.3 FLASH TRANSLATION LAYER ...... 21 2.4 HYBRID SSD STRUCTURE ...... 26
3. CHAPTER 3 A SSD-BASED I/O SCHEDULER ...... 28
3.1 ANALYZING THE CHARACTERISTICS OF SSD ...... 28 3.2 BACKGROUND ...... 29 3.2.1 Request size ...... 29 3.2.2 Read-write interference ...... 31 3.2.3 Internal parallelism ...... 33
3.3 SYSTEM DESIGN AND IMPLEMENTATION ...... 35 3.3.1 System overview ...... 35 3.3.2 Dispatching method ...... 36 3.3.3 Algorithm process ...... 38
3.4 EXPERIMENTAL EVALUATION AND ANALYSIS ...... 39 3.4.1 Experiment setup ...... 40 3.4.2 Performance results and analysis ...... 40
3.5 RELATED WORK ...... 42 3.6 SUMMARY ...... 44
4. CHAPTER 4 A PARALLELISM AND GARBAGE COLLECTION AWARE I/O SCHEDULER ...... 45
4.1 INTRODUCTION ...... 45 4.2 MOTIVATION AND BACKGROUND ...... 47 4.2.1 The main bottleneck in the SSDs ...... 47 4.2.2 Internal parallelism...... 48 4.2.3 Trace access characteristics ...... 50
vii
4.2.4 Motivation ...... 51
4.3 SYSTEM DESIGN AND IMPLEMENTATION...... 52 4.3.1 System overview ...... 52 4.3.2 Trace access pattern analysis ...... 55 4.3.3 Hot data identification mechanism ...... 56 4.3.4 Dispatching method ...... 58
4.4 EXPERIMENTAL EVALUATION ...... 61 4.4.1 Experiment setup ...... 62 4.4.2 Performance analysis ...... 63 4.4.3 Improved garbage collection efficiency ...... 66 4.4.4 Sensitivity study on channel number ...... 69 4.4.5 Sensitivity study on over provision ratio...... 70
4.5 RELATED WORK ...... 72 4.6 SUMMARY ...... 74
5. CHAPTER 5 MISS PENALTY AWARE CACHE MANAGEMENT SCHEME .... 75
5.1 THE OPPORTUNITY FOR PCM ...... 75 5.2 BACKGROUND AND MOTIVATION ...... 77 5.2.1 PCM vs SSD ...... 78 5.2.2 Motivation ...... 79
5.3 THE DESIGN DETAIL OF MPA ...... 82 5.3.1 System overview ...... 83 5.3.2 The balance between Miss Penalty and Miss Rate ...... 84 5.3.3 Cache management algorithm ...... 87
5.4 EXPERIMENTAL EVALUATION ...... 88 5.4.1 Experiment setup ...... 88 5.4.2 Performance analysis ...... 90 5.4.3 Sensitivity study on PCM requests-to-SSD requests ratio ...... 94 5.4.4 Sensitivity study on FIFO-to-LRU ratio ...... 95
5.5 SUMMARY ...... 97 6. CHAPTER 6 CONCLUSION AND FUTURE WORK ...... 98
6.1 CONCLUSION ...... 98 6.2 FUTURE WORK ...... 100
7. REFERENCES ...... 102
viii
List of Tables
Table 1.1 : Comparison of existing storage technologies [7]...... 2 Table 3.1: Workload attributes...... 40 Table 4.1: Workload access pattern power laws...... 55 Table 4.2: Configuration Parameters of the SSD Simulator...... 61 Table 4.3: The characteristics of the traces...... 64 Table 5.1: Comparison among SSD and PCM...... 79 Table 5.2: Configuration Parameters of the hybrid SSD Simulator...... 89 Table 5.3: The characteristics of the traces...... 90
ix
List of Figures
Figure 1.1: An illustration of SSD internal architecture, adapted form [8]...... 4 Figure 2.1 : PCM cell structure...... 16 Figure 2.2: Flash memory cell structure...... 17 Figure 2.3 : Basic block structure of flash memory, adapted from [29] ...... 19 Figure 2.4: The major components of flash memory...... 20 Figure 2.5: Block-level FTL management scheme, adapted from [37]...... 22 Figure 2.6: Three cases when log block and data block are collected during garbage collection...... 24 Figure 3.1: The response time comparison between SSD and HD under different request size (4KB-1MB)...... 30 Figure 3.2: The response time of random reads under different size with writes in concurrent execution on Intel X25-E...... 32 Figure 3.3: IOPS for the 4K-1M random read on Intel X25-E...... 34 Figure 3.4 : The overview of SBIOS...... 36 Figure 3.5: Flow chart of algorithm process...... 38 Figure 3.6: Benchmark performance comparison under different I/O schedulers...... 42 Figure 4.1: Channel level internal parallelism in the SSDs...... 48 Figure 4.2: The hot data ratio of different traces...... 50 Figure 4.3: System overview of PGIS. Here RFQ represents read frequency queue and WFQ represents write frequency queue...... 53 Figure 4.4: Distribution of read and write access frequency in Fin2 trace...... 54 Figure 4.5: The basic data structure of the frequency queue...... 57 Figure 4.6: Illustration of additional garbage collection overhead due to traditional channel level internal parallelism dispatching method...... 60 Figure 4.7: Benchmark performance comparison under different I/O schedulers...... 65 Figure 4.8: Comparison for the overhead of garbage collection under different I/O schedulers...... 66 Figure 4.9: Standard response time speedup on seven benchmarks with respect to different channel number...... 68
x
Figure 4.10: Standard response time speedup on seven benchmarks with respect to different over provision ratio...... 71 Figure 5.1: The overview of MPA on the I/O path...... 82 Figure 5.2: The system overview of MPA scheme...... 85 Figure 5.3: The Normalized response times under the realistic trace-driven evaluations. .... 91 Figure 5.4: The overall cache hit rate of the different cache schemes...... 92 Figure 5.5: The SSD cache hit rate of the different cache schemes...... 93 Figure 5.6: Average response time speedup of MPA and conventional cache management schemes under different PCM requests-to-SSD requests ratio...... 94 Figure 5.7: Standard response time speedup of MPA on five bench-marks with respect to different FIFO-to-LRU ratio...... 96
xi
1. Chapter 1 Introduction
1.1 HDDs vs SSDs
Comparing to SSD technology, hard disk drives (HDDs) technology is relatively ancient. Before SSD technology is ready for business applications, HDDs are main storages in storage system. However, due to its mechanical characteristics, HDDs becomes the bottleneck, degrading the performance of the whole system. It is well known that HDDs are good at dealing with large files which are stored in consistent blocks. In
this way, the moving head can start and end its read in one continuous motion. Existing
kernel I/O scheduler, such as anticipatory I/O scheduler [2], is designed for exploiting
this feature. When scheduling incoming requests, anticipatory I/O scheduler pauses for a
short time after dispatching a read request, anticipating that the next read request will
occur close on the same disk track. Even though the anticipation sometimes may guess
incorrectly, even if it is moderately successful, it saves numerous expensive seek
operations [3].Then, as the time goes on, when HDDs start to fill up, large files can
become scattered around the disk. In such a case, seek latency which is caused by moving
the mechanical arms to different disk cylinders will be significantly increased. Finally,
the performance of the whole storage system has been degraded.
Except for the fragmentation issue, HDDs also have reliability issue [4, 5]. The reliability issue of HDDs is mainly caused by tear and wear of mechanical component. In order to prevent the data loss, RAID techniques [5, 6] have been proposed to improve
1
data reliability by incorporating data redundancy. For the SSDs, the reliability issue is totally different. SSDs have no mechanical part. That means it is more likely to keep your data safe in the event your system is shaken while it is operating. Moreover, SSDs don't have to expend electricity spinning up a platter. Consequently, SSDs save more energy, compared with HDDs. Due to the superior performance and less power consumption delivered by SSDs as depicted in Table 1.1, SSDs have been widely adopted in the storage system.
Table 1.1 : Comparison of existing storage technologies [7].
SRAM DRAM HDD SSD
Read Latency <10 ns 10-60ns 8.5ms 25us
Write Latency <10 ns 10-60ns 9.5ms 200us
Energy per bit access >1 PJ 2PJ 100-1000mJ 10nJ
Endurance >1015 >1015 >1015 104
Non-volatility No No No No
The price is the main factor which restricts the widespread use of SSDs. Compared
with HDDs, SSDs are more expensive in terms of dollar per gigabyte. According to
newest statistics from Newegg business, 4 to 5 cents per gigabyte is for HDDs, and
25cents per gigabyte is for SSDs. Such a five-fold increase on price will push the whole
2
storage system price over budget, especially for multimedia users, they will require more, with over 1TB drives common in high-end systems. To solve this issue, the multi-level
(MLC) cell technology has been developed. The multi-level (MLC) cell technology can
largely reduce the cost of SSDs. Therefore SSDs are becoming an attractive replacement for conventional hard disks. Many manufacturers are currently offering SSD as an optional upgrade for HDD on high end laptops. An SSD-equipped laptop will boot in less
than a minute, and often in just seconds. Such an extra speed makes users feel
comfortable.
With the rapid development of the SSD technology, the last decade witnessed a
sharp decline in the retail price of the flash-based SSDs. Combined with its superior
performance, SSDs are gradually replacing traditional hard drives in the era of high
performance computing. Naturally, the performance issue and reliability issue of SSD become increasingly prominent.
As SSDs become increasingly popular, it becomes necessary to examine the
differences between flash-based solid state disks and traditional mechanical hard devices.
For instance, while read speeds of flash-based solid state disks are generally much faster
than traditional mechanical hard disks, there is no significant performance improvements
in terms of write operations. Since SSDs have no mechanical parts, they have no seeking
times in the traditional sense, but there are still many operational overhead that need to be
taken into account [3]. By examining the internal architectures of SSDs, we can find that
parallelism is available at different levels, and operations at each level can be parallelized.
3
Figure 1 shows an example of a possible internal architecture for an experimental flash- based disk.
Figure 1.1: An illustration of SSD internal architecture, adapted form [8].
Figure 1.1 illustrates the internal architecture of most flash-based SSDs. A number of packages of flash memory are connected to the flash channels. Each flash channel can be accessed independently and simultaneously. Meanwhile, the number of flash channels used to connect package and SSD controllers determine the magnitude of parallelism and hence the raw maximum throughput of the SSD devices. Such a highly parallelized structure provides great potential for internal parallelism. The internal parallelism acts in different levels. The different levels of parallelization are available in most of the SSD
4
architectures, leading to greatly improved performance. In general, channel-level internal
parallelism is the most common used internal parallelism in the research of SSD.
Many layers in the I/O subsystem have been designed with explicit assumptions
made about the underlying physical storage [3]. Although designs can vary, most
assumptions are based on the most popular storage, which is a traditional magnetic disk
that rotates mechanical parts to access the data. Though SSDs are designed to work with
existing accessing mechanisms, the performance of SSDs might suffer under such
circumstances. In order to take full advantage of the potential of solid state storage
devices, it is desirable to reconsider some of the assumptions that have been made and
redesign the I/O subsystem for the flash-based storage system.
The garbage collection mechanism is related to erase operations and the basic unit
of erase operation is a block. Therefore, during an erase operation, valid pages in a block
must be moved to another empty location during the erase operation. Thus, an important
consideration for improving erase efficiency is to minimize the operational latency by
reducing the number of page migrations to avoid delaying the I/O requests. Selecting the
block with the minimum number of valid pages for cleaning is one of the most commonly
used cleaning policies. At best, if we can find blocks with no valid pages, there is no
overhead of page migrations. In this case, hot data identification mechanism is introduced.
Specifically, separating hot and cold data helps reducing the number of page migrations:
blocks with hot pages are likely to be full of invalid pages in the near future, while blocks
with cold pages are likely to suffer from high cost for page migrations.
5
The overall objective of this research is to develop some novel methods for improving the performance and reliability of flash-based solid state storage system. The topics we would like to study include exploring the factors which will influence SSD performance and reliability, and digging the potential of SSD based on their unique structure.
1.2 Problem Description
In last decades, with the rapid development of technology, the emerging
applications, including internet-scale online service and big data analysis software,
require better storage I/O performance support their work [9]. Especially, when flash-
based SSD is widely used in storage system, its different characteristics drive us to
explore new possibility in software technology.
The block I/O subsystem is the fruit of software optimization to make efficient use
of underlying storage device [10, 11]. The previous block I/O subsystem is all design
based on characteristics of HDD, since storage system which comprises of HDD
could meet the user requirement in the past few years. However, in recent years, data
services deployed on web, servers, mail and data center require high throughput and
low latency. To guarantee the instant responsiveness, flash-based SSD is introduced
as the new storage device. Due to semiconductor property, flash-based SSD transfers
data from/to operating system without moving disk head back/forth. The data size is
the main factor which influences the performance of SSD in terms of response time.
The pervious optimization in block I/O system mainly focuses on reducing the
6
seeking overhead of HDD. Such an optimization may not achieve the peak performance of flash based SSD and even decreases the performance of the whole storage system. To solve this issue, our thesis investigates the influence factors for performance of SSD and proposes a SSD-based block I/O scheduler to achieve the peak performance of SSD.
With the development of MLC technology, the cost of SSD declines significantly.
The declining price and high performance make SSD become more and more popular.
However, design is a tradeoff. When MLC technology reduces the cost of SSD, it also brings the reliability problem. Comparing to SLC NAND SSD, the lifetime of MLC
NAND SSD is usually 10 folds less than SLC NAND SSD, although the cost of MLC
NAND SSD is 2-4 folds less than MLC NAND SSD. Meanwhile, due to the erase before write mechanism, the in-place updates don’t be allowed in the MLC NAND
SSD. Instead, the incoming data is written in the new clean place, and then the old data is marked as invalid. As the time goes on, when the block is full with valid pages and invalid pages, it will be reclaimed by erase operation. However, each block can afford a limited number of erase operations. If the number exceeds the limitation, the block becomes unreliable. To solve this reliability issue of SSD, we introduce hot data identification mechanism and propose a garbage collection aware I/O scheduler.
Meanwhile, with the rapid development of non-violate technology, we have more options for improving the performance of SSD. Phase change memory (short for PCM) is one of the most promising non-volatile memories, which instead of using electrical charges to store information, stores bit values with the physical state of a
7
chalcogenide material (e.g.,Ge2Sb2Te5, i.e., GST). Due to high read/write speed,
high endurance, non-volatility of PCM, many hybrid storage structures have been
proposed to improve the performance of SSD. Some of these hybrid storage structures
focus on exploiting PCM and SSD at the same storage level. While most of them try
to reduce the write amplification of SSDs and to improve the data reliability of the
whole storage system, none of these studies consider improving the performance of
whole storage system based on read latency disparity between PCM and SSD. To fix
this problem, we propose a Miss Penalty Aware cache management scheme (short for
MPA) which takes the asymmetry of cache miss penalty on PCM and SSD into
consideration.
1.3 Contribution
The contributions of the dissertation are three-folds:
(1) As we all know, many system designs, especially for I/O subsystems, are based on
characteristics of HDDs. However, these designs may not be appropriate for SSDs
and sometimes even degrade the performance of SSDs. In order to fully dig the
potential of SSDs, we need SSD-based I/O schedulers. In our thesis, we explore
some factors which will influence the performance of SSDs in terms of response
time. For instance, according to our experimental result, we find that SSD is more
sensitive to the increment of request size than mechanical hard drive. Meanwhile,
8
if read requests mix with write requests concurrently, the read-write interference occurs, it will degrade the performance of SSDs in terms of response time. In order to design SSD-based I/O schedulers, we need to take these factors into consideration. Due to the unique structure of SSD, internal parallelism should also be taken into consideration to improve the performance of SSDs. Based on the above observations, we develop a SSD-based I/O scheduler called SBIOS which is embedded into Linux kernel. When dealing with a set of incoming I/O requests,
SBIOS fully exploits the characteristics of SSD. SBIOS adopts read-preference policy and small size-preference policy. It dispatches the read requests to different blocks to make full use of internal parallelism. For write requests, it tries to dispatch write requests to the same block to alleviate the block cross penalty and garbage collection overhead. To prevent the read-write interference, SBIOS introduces type-based queuing. SBIOS uses two queues to divide the types of incoming requests. One queue is for serving read requests, the other queue is for serving write requests. In each round, SBIOS deals with one queue. In this way, the probability of read-write interference has been reduced. To measure the performance of SSD, we develop a tool which is an extension of Radiometer [12,
13]. This tool dispatches the requests to physical storage and statics the response time. The evaluation results show that compared with other I/O schedulers in the
Linux kernel, SBIOS reduces the average response time significantly.
Consequently, the performance of the SSD-based storage system is improved.
9
(2) As I mentioned before, MLC technology is a double-edged sword, either reducing
the cost of SSDs or bringing the endurance problem of SSDs. To solve the
reliability issue of SSDs, we propose a parallelism and garbage collection aware
I/O Scheduler called PGIS. PGIS not only fully exploits abundant channel
resource in the SSD, but also it introduces a hot data identification mechanism to
reduce the garbage collection overhead. By introducing the statistic method, we
analyze the characteristic of trace access pattern[14, 15], and then we design a hot
data identification mechanism based on our analysis. In the PGIS, we classify the
hot request by frequency. To exploit the channel level internal parallelism, we
issue the hot read request to different channel. To reduce the overhead of garbage
collection in terms of P/E cycles, we issue the hot write data to the same physical
block. As we all know, design is a tradeoff. Issuing the hot write data to the same
physical block can accelerate the efficiency of reclaiming process. Meanwhile, it
also makes the physical block become unstable. To balance this side effect, we
introduce a hot/cold swapping wear-leveling algorithm. In our design, hot read
data may overlap with hot write data. There is no doubt that parallelizing hot read
data and hot write data will interfere with each other and lead to the performance
degradation. For reducing the probabilities of parallelizing hot read data and hot
write data, PGIS introduces type-based queuing. PGIS uses two queues to divide
the types of incoming requests. One queue is for serving read requests, the other
queue is for serving write requests. Because we use read-preference in PGIS, this
will lead to a starvation problem for write requests. To solve this problem, PGIS
10
sets a time stamp which defines a time period before which the request should be
dispatched into the FTL module. The PGIS periodically checks the requests stored
in the queues to guarantee no request exceeds the time period assigned by time
stamp. In the daily application, SSD is treated as a black box. The manufacturers
don’t provide us an interface for checking the internal activity of SSD. In our
thesis, we develop a SSD simulator which is to measure the performance and
reliability of various SSD architectures. Our simulator is an extension to Disksim
4.0 [16] and models a generalized SSD by implementing flash specific read, write,
erase, channel-level internal parallelism, wear-leveling algorithm and flash
translation layer module.
(3) With the widely deployment of SSD in data centers, the poor endurance problem
and out-of-place update limitations of SSD become more and more prominent.
Comparing to SSD, PCM has a higher endurance and can be updated in-place.
Hybrid storage systems that use PCM and SSD at the same storage level are a
promising approach because of their better performance and high endurance than
SSD-only approaches. However, hybrid storage systems pose a new challenge to
cache management algorithms. Existing DRAM-based cache management
schemes are only optimized to reduce the miss rate. On a miss, the cache needs
to access the PCM or the SSD. However there are major performance
differences between the access times of the two different technologies. As a result,
in such a hybrid system, a lower miss rate does not necessarily translate to higher
performance. To address this issue, we propose a Miss Penalty Aware cache
11
management scheme (short for MPA) which takes the asymmetry of cache miss
penalty on PCM and SSD into consideration. Our MPA scheme not only uses the
access locality to reduce the miss rate, but also assigns higher priorities to SSD
requests located in the page cache to avoid the high miss penalty overhead. Our
experimental results show that our MPA scheme can significantly improve the
performance of hybrid storage system.
1.4 Organization of Dissertation
So far we compare HDDs with SSDs and give a brief introduction on SSD. We also present the opportunity and challenge of SSDs which motivated our work. In the rest
of the dissertation, we will detail our system design and I/O scheduling methods. For
each of the proposed methods, we will also present backgrounds, relative works and
experimental results. The rest of this dissertation is organized as follows:
In chapter 2, we present background on SSDs, I/O schedulers and non-violate storages. We also introduce several aspects of SSD, including internal parallelism, flash translation layer, hybrid SSD structure.
In chapter 3, we revisit the I/O Subsystem in the operating system. We reveal that
existing I/O schedulers may not be appropriate for SSDs and sometimes even degrade the
performance of SSDs. We explore some factors which impact the performance of SSD.
Based on our exploration, we propose a SSD-based I/O scheduler called SBIOS. SBIOS
12
fully exploits the characteristic of SSD. For read requests, SBIOS dispatches them to
different blocks to make full use of internal parallelism. For write requests, it tries to
dispatch write requests to the same block to alleviate the block cross penalty. Based on the above I/O scheduling rules, we develop an I/O scheduling module and embed it into
Linux kernel. To measure the performance of SSD in terms of response time, we also develop a tool which is an extension of Radiometer [12, 13].
In chapter 4, we mainly solve reliability issue of SSD. By introducing statistical
method, we analyze the trace access pattern. We reveal that 15 to 25 percent of the blocks
are accessed by nearly 50 percent I/O requests. Based on the above observation, we
introduce the hot data identification mechanism and propose a parallelism and garbage
collection aware I/O Scheduler called PGIS. In the PGIS, we classify the hot request by
frequency. To exploit the channel level internal parallelism, we issue the hot read request
to different channel. To reduce the overhead of garbage collection in terms of P/E cycles,
we issue the hot write data to the same physical block. Based on the above I/O scheduling
rules, we develop an I/O scheduler and embed it into a SSD simulator. As we all know, in
the daily application, SSD is treated as a black box. To measure the performance and
reliability of SSD, we develop a SSD simulator which is an extension of Disksim 4.0[16].
In chapter 5, we investigate characteristics of PCM and SSD. We reveal that hybrid
storage systems which use PCM and SSD at the same storage level pose a new challenge
to cache management algorithms. Due to different miss penalty of PCM and SSD, higher
hit rate which is targeted by traditional cache management algorithms may not bring
13
higher performance. To solve this issue, we propose a Miss Penalty Aware cache
management scheme (short for MPA) which takes the asymmetry of cache miss penalty
on PCM and SSD into consideration. Our MPA scheme not only uses the access locality
to increase the hit rate, but also assigns higher priorities to SSD requests located in the
page cache to avoid the high miss penalty overhead. By combing miss penalty with hit
rate, our MPA scheme significantly improve the performance of hybrid storage system.
Meanwhile, to test the efficiency of our MPA scheme, we develop a hybrid SSD simulator which is an extension of Disksim 4.0[16].
In chapter 6, we conclude our work in this dissertation.
Some research works in this dissertation have been published in conferences. The
study in Chapter 3 was published on the 15th International Conference on Algorithms
and Architectures for Parallel Processing(ICA3PP) in 2015 [17]. The content in Chapter
4 is based on the paper published in the 31st IEEE International Parallel & Distributed
Processing Symposium (IPDPS) in 2017 [18].
14
2. Chapter 2 Background
In SSDs, data are hosted by flash memory which is a type of Electrically
Erasable Programmable Read-Only Memory, EEPROM in short [19]. Flash memory can
be classified into several types among which NOR type and NAND type are the most important [20]. NAND type flash memory was first introduced by Toshiba in the late
1980s, following NOR type flash memory by Intel [21]. NOR type flash memory provides random access and high reliability, while NAND type flash memory provides affordable price due to its scaling technology. With the raid development of portable
electronic, such as mobile phone, flash memory becomes more and more popular.
Especially for NAND type flash memory, it is widely adopted in the mobile phone due to
its low cost and high density. In this dissertation, flash memory refers to NAND type
flash memory.
2.1 Non-Violate Memory
Semiconductor memories can be simply divided into two categories: volatile memory and non-volatile memory. Volatile memory needs power connection to hold the data.
When the power is off, volatile memory loses its data. In our daily life, the most common form of volatile memory is random access memory. Comparing to volatile memory, non- volatile memory can hold the data without power connection. If power outage occurs,
non-volatile memory can ensure safety of data. In recent years, the most popular forms of
non-volatile memories are phase change memory and flash memory.
15
Figure 2.1 : PCM cell structure. Phase change memory (short for PCM) is one of the most promising non-volatile
memories, which instead of using electrical charges to store information, stores bit values
with the physical state of a chalcogenide material (e.g.,Ge2Sb2Te5) [22, 23]. Figure 2.1
shows the typical PCM cell structure. In general, a PCM cell consists of a top electrode
and a bottom electrode. Between these two electrodes are the resistor (heater) and chalcogenide glass (GST). We can transform the PCM cell into amorphous state by quickly heating under high temperature (above the melting point) and quenching the glass.
Instead, holding the GST above its crystallization point but below the melting point for a while will transform the PCM cell into the crystalline state. In 2016, IBM researchers
16
found how to store 3 bits in one PCM cell, PCM is currently becoming more and more popular. Due to high read/write speed, high endurance and low standby power of PCM
[24-26], PCM is considered to be applied in the hybrid storage which combines PCM
with DRAM.
Figure 2.2: Flash memory cell structure.
Flash memory stores the data in memory cells. Flash memory cell mainly consists of
three parts: semiconductor martial (p-substrate), floating gate and oxide layer. P-
Substrate can be manipulated to control the flow of electronic. Electronics which are
trapped in the float gate can represent one or more bits. Oxide layer isolates the floating
gate and prevent electronics escaping from floating gate. The flash memory cell structure
17
is shown in Figure 2.2. NAND flash memory cell can be simply classified into two types:
SLC (Single-Level Cells) and MLC (Multi-Level Cells) [27, 28]. In SLC NAND flash memory, each cell only stores one bit. SLC NAND flash memory is more reliable than any other NAND flash memory. However, due to its price, MLC NAND flash memory technology has been developed to serve the need of users. MLC NAND flash memory stores 2-bits per cell. It requires 4 voltage states to represent 00, 01, 10 and 11. The MLC
NAND flash memory has 25-30 times less endurance than the SLC NAND flash memory
and is not as reliable with many more issues with unexpected power loss, read disturb,
data corruption and so on. However, due to its low price, MLC NAND flash memory is
widely adopted in our daily applications.
2.2 Internal Structure of SSD
Most SSDs are storage devices based on NAND flash memory technology. In the
NAND flash memory, the basic structure unit is flash page. Similarly, for read operation or write operation, the basic access unit is the flash page. However, due to limitation of flash memory, the basic unit for erase operation is flash block. As is shown in the Figure
2.3, a fixed number of flash pages are gathered to form a flash block. Meanwhile, every page is composed of a data area dedicated to user data and a spare area dedicated to metadata (mapping information, ECC, erase count, validity bit,...etc) [16]. The in-place update is not allowed in MLC NAND flash memory. If user wants to update data in the same location, an erase operation must occur. Erase operation not only takes a long time,
18
but also brings the reliability issue. In this way, erase before write mechanism becomes the main bottleneck of NAND flash memory.
Data area Spare area
Flash page Flash block
Figure 2.3 : Basic block structure of flash memory, adapted from [29] .
The major components inside a SSD mainly include DRAM, flash channel, flash controller, and flash packages. To provide larger capacity, SSDs usually are configured
19
DRAM Controller
Package 2 Package 0 Die 4 Die 5 Die 0 Die 1
Channel Channel Crtl 1 Crtl 0
Package 3 Package 1 Die 6 Die 7 Die 2 Die 3
Figure 2.4: The major components of flash memory. with an array of flash packages. These flash packages are connected through multiple channels to flash memory controllers. Flash memories provide logical block addresses
(LBA) as a logical interface to the host. Since logical blocks can be striped over multiple flash memory packages, data accesses can be conducted independently in parallel [8]. In this way, these packages form package-level internal parallelism. There are several internal parallelisms except package-level internal parallelism. As is shown in the Figure
2.4, the flash package is composed of several flash dies. These dies form die-level
20
internal parallelism. In the die-level internal parallelism, each die can execute command independently, since flash dies in the same flash package share the serial pin, when
multiple commands come to these dies, they can interleave the execution of these commands.
2.3 Flash Translation Layer
Due to the erase before write mechanism, in-place updating is forbidden in the SSD.
However, for the users, SSD seems to provide similar function of in-place updating. That is because the flash translation layer which is maintained by SSD controller hides the complexity of flash operation by providing a logical interface to SSD. Since, overwriting flash page in the same location is not allowed. The flash translation layer (short for FTL) marks the old data as invalid page and finds a new location to update the new data. In this process, a mapping table should be maintained to support operating system to find the latest data.
Based on the granularity of mapping unit, the flash translations layer schemes can be classified into three types: page-level mapping, block–level mapping and hybrid mapping.
In page-level FTL, each logical page number should be translated into a physical page number. This means page-level FTL needs to maintain a large mapping table which contains mapping information of each logical page. Comparing to other two FTL management schemes, page-level FTL has largest mapping table, however, benefited
21
from classifying hotness of the data clearly, the page-level FTL is flexible and efficient.
The first page-level FTL scheme was proposed in 1995 [30] and was adopted as a standard for Nor-type flash memory several years later [31] [32]. The other famous page-level FTL is DFTL [33] which is inspired by virtual memory systems and will load a fragment of the page mapping table into RAM on demand [34] [35] [36].
LPN: Logical block no Offset
SSD blocks Mapping Block-level mapping table
Physical block no
……… Offset
PPN: Physical address ……
Figure 2.5: Block-level FTL management scheme, adapted from [37].
22
The block-level FTL is the simplest flash translation layer management scheme. In block-level FTL, as is shown in Figure 2.5, logical page number (LPN) is divided into two parts: logical block number and offset. At the beginning of block-level FTL
mapping, each logical block number will be mapped to a physical block located in the
SSD, and then some searching algorithms are applied to find the targeted page based on
the in-block offset. Since block-level FTL only maintains the mapping information of
blocks, it reduces the overhead of hardware and saves the space of SRAM. However,
design is a double-edged sword, due to larger granularity of mapping unit, in block-level
FTL, a single write update may lead to relocation of other pages located in the same SSD
block. This also means the overhead of moving valid pages during garbage collection
has been increased. The block-level FTL is suitable for applications which have large
access size and limited space of SRAM. Some researchers propose their works based on
block-level mapping. For instance, Lee et al. present a two-level mapping scheme which
first maps a group of blocks in each DRAM map entry and then uses the out-of-band
section of flash to store a finer grained map within each group of blocks [38] [39].
Another famous block-level FTL is NFTL [39]. NFTL maintains a chain of physical
addresses for each logical block in SRAM. For each write operation, NFTL searches the
whole chain until it finds the first available page. If NFTL doesn’t find the page, a fresh
block is tacked to the end of the chain. Wells et al. propose a static block-level FTL
management scheme in their system designed for compressed storage [40]. In this design,
variable size writes are allowed and rescheduled to be written in a log block.
23
Data block Log block Data block Log block Data block Log block
Page 1 Page 1 Page 1 Page 1 Page 1 Page 1 Page 2 Page 2 Page 2 Page 2 Page 2 Page 2 Page 3 Page 1 Page 3 Page 3 Page 3 Page 3 Page 4 Page 2 Page 4 Page 4 Page 4
switch switch Page 1 Page 1 Page 1 Page 2 Page 2 Page 2 Page 3 Page 3 Page 3 Page 4 Page 4 Page 4
(a) Full merge (b) Partial merge (c) Switch merge
Figure 2.6: Three cases when log block and data block are collected during garbage collection.
As I mentioned above, the page-level FTL performs better performance than other
FTL schemes. However, it needs large SRAM space. The block-level FTL saves space
for SRAM, but it is less flexible and sometimes leads to severe write amplification. In
this case, the hybrid FTL, combing conception of the page-level FTL and the block-level
FTL, has been proposed. In the daily applications, many systems use log-structured
24
block-based hybrid FTL schemes [41] [42] [43], which are inspired by log file system
[44]. In these FTL schemes, the flash blocks have been divided into two types: data block
and log block. The block-level FTL scheme is applied to data block. To save the space of
SRAM, regular page is mapped into data block, while updated pages which are traced by the page-level FTL scheme are will be temporarily appended into log blocks. Log blocks
are small proportion of flash blocks. Generally, log blocks account for less than 5% of
flash blocks. Due to limited number of log blocks, most hybrid FTLs need to merge data
in the log block to data block to generate the new space for log blocks. Figure 2.6 shows
three kinds of merge operation: full merge, partial merge and switch merge. The full
merge is the most expensive merge operation, compared with other two merge operations.
In the case of a full merge, as is shown in Figure 2.6 (a), all the update pages need to be
copied to a new allocated block, and then old blocks should be erased. In the case of a
partial merge, only three pages from the beginning of the data block are logged, and the
rest page of the data block is still valid as depicted in Figure 2.6 (b). Finally, the valid page has to be copied from the data block to the log block before a switch will be
performed, resulting in a higher overhead. In terms of switch merge, as is shown in
Figure 2.6 (c), the data block only contains invalid pages, so it need to be reclaimed. The
log block contains all the valid pages. In this case, we simply mark the log block as the
new data block. The overhead of switch merge is the lowest, since only one erase
operation occurs. Different hybrid FTL schemes adopt different strategies to merge the
data in log block to data block. For instance, Jesung et al. propose the first hybrid FTL called BAST [41]. However, BAST does not work well with random overwrite patterns
25
which may result in a block thrashing problem [42]. Since each replacement block can
accommodate pages from only one logical block, BAST can easily run out of free replacement blocks and be forced to reclaim replacement blocks that have not been filled
[45]. To solve the block thrashing problem, FAST [42] has been proposed. FAST allows the log block to hold updates of any data blocks. Park et al. propose SAST [36] which limits a set of log blocks to a set of data blocks. Cho et al. propose KAST [46] which
addresses the variability of performance and limits the associability of its log blocks.
2.4 Hybrid SSD structure
With the rapid development of NVM technology, exploring the usage of emerging
NVM technologies at different level of memory hierarchy becomes popular on computer architecture research. Some novel NVM-based novel architecture designs have been proposed, such as NVM-based cache design [47] [48] [49] [50], NVM-based storage
architecture [51] [52], NVM-based memory architecture [53] [22] [54]. NVM has its own
advantages, such as in-place updating, high endurance. However, comparing to SSD,
NVM technology is not mature. It is still not feasible for NVM to directly replace SSD as
massive storage because of their limitations of manufacture and high cost [55]. In this
case, hybrid SSD which uses PCM and SSD in the same storage level has been proposed
for taking advantage of SSD and PCM.
As we all know, small and frequent random write is common in the flash-based
database severer. In such an access pattern, the “erase-before-write” limitation of the SSD
becomes increasingly apparent. Meanwhile, Frequent write operations and erase
26
operations will reduce the lifetime of flash-based SSD. To solve this problem, Sun et al.
[56] propose a hybrid SSD architecture to prolong the lifetime of SSD. The key idea behind this design is to use PCM to store log data. Due to in-place updates, byte- addressable, non-volatile properties and better endurance of PCM, the performance, energy consumption and lifetime of the SSDs are all improved. Kim et al. [57] also propose the similar hybrid SSD structure for their flash-based database management strategy called IPL. Li et al. [58] propose a user-visible hybrid SSD with software- defined fusion methods for PCM and NAND flash. In this design, PCM is used for improving data reliability and reducing the write amplification of SSDs. In 2016, the
Radian memory system released a host-based hybrid SSD product called”RMS-325” to improve the reliability of SSD storage system. This clearly shows that hybrid SSD structure is on the way from the laboratory research stage to engineering application stage.
With the growth of the use of SSDs in the data center, hybrid SSD structure will attract more and more attentions in the forthcoming years.
27
3. Chapter 3 A SSD-based I/O Scheduler
3.1 Analyzing the characteristics of SSD
Unlike traditional magnetic-based storage devices, the flash-based solid state disk consists of semiconductor chips, which avoid considering the rotational latency in
random I/O performance. Theoretically, the speed of flash-based solid state disk is one or
two orders magnitude faster than the mechanical disk. But in fact, the advancements of
flash-based solid state disk are not fully exploited in practice. There are two reasons. First,
flash-based solid state disk has poor write performance. Such poor write performance is
caused by the erase before write mechanism. In order to overwrite a previous location on
SSD, the block which contains this location should be erased first, and then the new data
can be written in this location. Second, mechanical disks are still the main storage devices
in primary storage system. In the existing operating systems, the software I/O stack is
designed for the characteristics of mechanical disks. As a consequence, the potential of
flash-based solid state disks are not fully exploited. Some research studies have shown
that the existing I/O software layer can lead to additional overheads for flash-based solid
state disks [1] [59] [60] [61].
Due to the limitation of the erase before write mechanism, the read speed of the
flash-based solid state disk is not consistent with the write speed of the flash-based solid
state disk. Or worse, erasure granularity is much larger (64-256KB) than the basic I/O
28
granularity (2-8KB) [62] [63]. It means that the response time of the read request is faster
than the response time of the write request. Meanwhile in the upper layer application, the
read operation is synchronous, so the upper layer application needs the response data of
read operation to initiate the next step. While the write operation is asynchronous, it will
not block the upper layer application. So if we want to fully use the characteristics of
flash-based solid state disk in block layer I/O scheduler design, we need to take the
read/write speed discrepancy into account.
3.2 Background
In order to design a SSD-based I/O scheduler, we need to know the influence
factors for the performance of SSD. In the following subsections, we do some
experiments to explore the performance variation of SSD under different influence factor settings.
3.2.1 Request size
In this section, we explore the relationship between request size and response
time. In our experiment, we use fio [64] tool to test the average response time of
traditional hard disk (WDC WD1600JS 500GB) and solid state disk (Intel X25-E 64GB)
in different request sizes. To avoid the influence of system, we disable the write cache,
memory buffer and I/O scheduler in our experiments. Figure 3.1 shows the response time
comparison. SSD represents the Intel X25-E 64GB, and HD represents WDC-WD
29
1600JS 500GB. As is shown in the experimental results, the increment in the request size has little impact on the standard response time of traditional hard disk.
20
18
16
14
12
10 SSD 8 HD
6 Standard response response time Standard
4
2
0 4kB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB Request size
Figure 3.1: The response time comparison between SSD and HD under different request size (4KB-1MB). Meanwhile, the standard response time of SSD nearly has no any change for the request size between 4KB to 64KB. As we all know, the rotating drive moves the head and rotates the platter for locating and accessing the data. The three contributors to the response time of rotating drive are the seek time, rotational latency and data transfer time.
30
When the request size is small, the seek time and rotational latency account for the majority of the response time. As the request size increases, the data transfer time accounts for a large proportion of the response time. As shown in Figure 3.1, when the request size is larger than 64KB, the request size becomes the main part which relates to the response time of rotating drive. For requests whose size is between 128KB to
1024KB, the standard response time of SSD increases rapidly as the size of the request
increases. It indicates that SSD is more sensitive to the increment of request size than
mechanical hard drive.
Besides, response time of solid state disk is linearly proportional to the request
size, which causes standard response time of SSD to increase gradually. The reason for
above observation lies in the different internal structure of solid state disk. Unlike
rotating hard drive, the solid state disk finishes the fundamental operation (read and write)
by circuit signal transmission. Therefore, it is not necessary to take the seek time and
rotation latency into consideration. The data transfer time is the main part of the response
time of solid state disk. Data transfer time is directly related to the request size. For this
reason, there is a linear relationship between request size and the response time of solid
state disk.
3.2.2 Read-write interference
The write speed of Flash-based SSD is significantly lower than the read speed of
Flash-based SSD. Especially, when a reader continuously performs read requests at the
presence of the current writer, the reader may suffer an excessive slowdown in read
31
performance. In order to valid this, we use fio tool [64] to measure the read/write
characteristics of the flash device. The flash-based storage device used in these
experiments is Intel X25-E 64 GB. To access the native characteristics of flash, we omit
the memory buffer, write cache and I/O scheduler in our experiment.
Intel X25-E 18000
16000
14000
12000
10000
8000 READ Response time(usec)Response 6000 READ MIX WRITE 4000
2000
0
Request size
Figure 3.2: The response time of random reads under different size with writes in concurrent execution on Intel X25-E.
Our experiment simulates two processes. One process continuously sends read requests to the random storage location, while the other sends random write requests to
32
flash location at the same time. In our experiment, the request size spans between 4kB and 1MB. Figure 3.2 illustrates the response time in two cases, read and read mix with concurrent write. Compared with the response time of Read mix with concurrent write, the response time of the random read is significantly higher when interrupted by the concurrent write request. Especially, when the request size becomes larger, and the latency issue deteriorates. In order to alleviate the latency issue brought by concurrent write, we introduce the concept of batch processing and use the design of separating read and write requests. In this way, the problem of read-write interference can be fixed.
3.2.3 Internal parallelism
There are several levels of internal parallelism in the flash-based solid state disk.
These internal parallelisms make the single device to achieve close to ten thousand I/O
per second for Random access, as opposed to nearly two hundreds on traditional
hardware disk. In order to validate the importance of exploiting the internal parallelism in
the flash-based solid state disk, we introduce the IOPS metrics to measure the difference
between hardware disk and solid state disk.
In our experiment, we use fio tool [64] to show the IOPS of traditional hardware
disk and solid state disk. We continually issue different size requests (4K-1M) in random
read pattern to the devices. In Figure 4, HD represents WDC WD1600JS 500GB, while
SSD represents Intel X25-E. As is shown in Figure 3.3, the IOPS of HD is only
137,While the IOPS of solid state disk is over than 4000.
33
1
0.9
0.8
0.7
0.6
0.5 SSD 0.4 HD
Standard IOPS Standard 0.3
0.2
0.1
0 4kB 16KB 32KB 64KB 128KB 256KB 512KB 1024KB Request size
Figure 3.3: IOPS for the 4K-1M random read on Intel X25-E. The IOPS has an over 30 folds gap between traditional hardware disk and flash-based solid state disk. Such a big performance gap is caused by the internal structures of hardware disk and solid state disk. The traditional hardware disks only have one moving head. That means one requests can be served per time. In random access pattern, the traditional hardware disk wastes lots of time to rotate the platters and seek the data. That is why it only has 137 IOPS in random read pattern. But in solid state disk, the case is totally different. The solid state disk is composed of multiple channels, multiple dies, multiple packages and multiple planes. Each level of internal parallelism can serve multiple requests at the same time. Especially, the random read access pattern is the most
34
efficient pattern which can trigger the internal parallelism in flash-based solid state disk.
In such a rationale, the over 30 folds gap appeared. Exploiting internal parallelism to
enhancing the I/O scheduler performance is very important. In our SBIOS, we dispatch
the read requests to different block to trigger the internal parallelism of solid state disk.
3.3 System design and implementation
In this section, we discuss the system design and implementation of our SBIOS
scheduler.
3.3.1 System overview
I/O scheduling module is located between block layer and block device layer.
It processes requests by a specific order. Figure 3.4 shows the system overview of SBIOS
and its location in the whole I/O subsystem. For the upper level, it treats the requested
data entering the I/O scheduling module as inserting the request into a queue, and then
the I/O scheduling module will resort the request in the queue by a certain resorting
policy. For the lower layer, the request leaving from I/O scheduling module is like the
operation of leaving a queue. The I/O scheduling policy in the scheduling module will
decide the next request to be served. Figure 3.4 shows the details of the sorting policy
that we use in SBIOS. SBIOS uses type-based queuing. We divide the requests into two
types: read and write. We dispatch read requests to different blocks to make full use of
the read internal parallelism. For write requests, we try to dispatch them to the same
35
block to avoid the block cross penalty. In this way, SBIOS improves the performance of solid state disk significantly.
Application
File System
Block device driver
I/O Scheduler
Block 1 Block 2 ……… Block N
SSD
Figure 3.4 : The overview of SBIOS.
3.3.2 Dispatching method
Our goal of designing SBIOS is to exploit the full potential of solid state disk, so the rich internal parallelism of solid state disk will be taken into consideration. Since
SSDs have superior reading performance, we sort the incoming requests based on type, and the read preference policy is employed in our scheduler. In [65], there are four levels
36
of internal parallelism which are mentioned. In our design, we try to fully use the block level read internal parallelism, so the scheduler dispatches the read requests to different logical blocks to trigger the internal read parallelism. To exploit the full characteristic of
SSD, not only we take its advantage, but also we need to avoid its drawback. Random write performance is always the bottleneck of SSD, especially for random write operations which cross the blocks [3]. To avoid this drawback, we try to dispatch the write requests to the same block. In this way, we can avoid the large response latency which is brought by crossing the blocks.
During initialization, the SBIOS uses a function to calculate the total capacity of
the targeted solid state disk, and then it will find the starting sector of the solid state disk.
In Linux kernel system, the basic storage unit is the sector. Our design rationale is to
make the logical block size which can match the physical block size of solid state disk
and we use a function Calculate Block () to achieve this. With the starting sector of the
solid state disk, the Calculate Block () can maintain the block number of incoming
request. Suppose the starting sector of solid state disk is K. The beginning sector of the
incoming request is G. the logical block number maintained in the Calculate Block ()
equals to (G-K)/SECTOR PER BLOCK. Here SECTOR PER BLOCK variable shows
the number of sectors contained in a block. It is calculated at the initialization phase.
One important factor that impacts the performance of SBIOS is the physical block
size of the solid state disk. In practice, the block size is specified by the vendors.
37
Sometimes the exact physical block size is not available. However, for a given SSD, we can design some micro-tests on it to determine the physical block size [66].
Figure 3.5: Flow chart of algorithm process.
3.3.3 Algorithm process
38
In SBIOS, the entire incoming requests are placed in a red-black tree according to their logical block address. They wait in the red-black tree until the SBIOS chooses them to dispatch into the lower layer. Figure 3.5 shows the algorithmic diagram for selecting a request to dispatch. In each round of request selection, the scheduler module checks whether the write request is starved. This step is not optional, because we use the read- preference policy in SBIOS. If we don’t set a starved threshold to the write requests, there will be a write starvation problem in the design of our scheduler. After checking the write starvation, the scheduler will determine the request type of the next request. Also, we ensure that each request has been assigned a time stamp when it enters the I/O scheduler. If the timestamp is out of date, this request will be chosen as the next request to be served. If not, the scheduler will employ different choosing method according to the request type. If the next served request is set to read, the scheduler will find two different block requests from the read-black tree and compare their size, and then dispatching the smaller one to the lower layer. But if the next served requests are set to write, the scheduler will find the two write requests which locate in the same block from the red- black tree and dispatch the smaller one. Finally, it continually dispatches the request pending in the red-black tree until no request pending in the tree.
3.4 Experimental evaluation and analysis
In this section, we set up the experimental platform to analyze the performance of the SBIOS. In our experiments, we run different kinds of traces to the chosen I/O
39
scheduler to demonstrate that the SBIOS improves the response time significantly and
make full use of characteristics of SSD.
3.4.1 Experiment setup
SBIOS is implemented as a kernel module in Ubuntu 14.04 with kernel 3.13.0. In our experiment, we use Intel core i3 3.00 GHZ processor and 4 GB memory in our machine. For solid state disk, we use Intel X25-E Extreme SATA Solid-State Drive 64
GB (short for Intel X25-E) and its erase block size is 256 KB. The hard disk we use is
WDC WD1600JS 500GB. In order to test the efficiency of the SBIOS scheduler, we choose five different benchmarks to test it, including two online transaction processing workloads(Fin1, Fin2) and three search engine workloads(Web1, Web2, Web3).
Table 3.1: Workload attributes.
Workload Request size(byte) Read (%) Write (%) Fin1 [15] 512-17116160 21.6 78.4 Fin2 [15] 512-262656 82.4 17.6 Web1[15] 512-1137664 99.9 0.01 Web2[15] 8192-32768 99.9 0.01 Web3[15] 512-23674880 99.9 0.01
3.4.2 Performance results and analysis
In this section, we run different benchmarks with different I/O schedulers
(including CFQ, Deadline, Noop and SBIOS). In our experiments, we didn’t compare our
40
SBIOS to AS, because the AS scheduler has been removed from kernel 3.13.0. To validate the efficiency of SBIOS, we choose five benchmarks with different characteristics and compare their system performance.
Table 3.1 illustrates in detail these five benchmarks. Fin1 and Fin2 are read mix with write workloads. In this kind of benchmark, read-write inference problem may appear, especially for Fin1. In Fin1, read requests only account for 21.6%. That means read requests have a big probability to be blocked by write requests. Web1, Web2 and
Web3 are read-intensive workloads. In this case, we need to consider the starvation problem of write requests.
Figure 3.6 shows the performance results. In order to illustrate the performance clearly, we use standard response time in the Y axis. In the experiment, we set the response time of Noop scheduler as the baseline (initialed to 1 in Figure 3.6) to compare the efficiency of other I/O schedulers. The response time of Noop scheduler in Fin1, Fin2,
Web1, Web2, Web3 are 1.19618ms, 1.4491ms, 0.588226ms, 0.922445ms and 1.0727ms.
In Figure 3.6, we can get two conclusions. First, except for our SBIOS, Noop scheduler outperforms better than the other scheduler under these five workloads. It validates that the Noop scheduler is the most suitable scheduler for SSD. Second, the SBIOS performs best among these schedulers under the five workloads. For Fin1, Fin2, Web1, Web2,
Web3 trace, the SBIOS scheduler reduces the response time of best and worst of other three schedulers by 15%-18%, 14%-18%, 11%-23%, 14%-18% and 10%-17%. In
41
conclusion, the SBIOS reduces the response time significantly by taking SSD characteristics into consideration.
1.4
1.2
1
0.8 noop cfq 0.6 deadline SBIOS
Standard Response Time Standard 0.4
0.2
0 Fin1 Fin2 Web1 Web2 Web3
Figure 3.6: Benchmark performance comparison under different I/O schedulers.
3.5 Related work
Since the I/O scheduler is designed for HDDs in the operating systems, the popularization of the flash-based SSDs makes the I/O schedulers for SSDs receive much more attention. There is a large body of studies on the I/O scheduler for magnetic hard disks, but only a few studies had been focus on SSDs. They can be classified into two
42
categories. The first category was mainly focused on the fairness of resource usage of
SSDs. For example, Stan park et al. proposed FIOS [62] and FlashFQ [67] algorithms
that take the fairness of SSD resource usage into account. FIOS [62] designed an I/O time slice management mechanism which combines read preference with fair-oriented I/O anticipation. FlashFQ [67] discussed the drawbacks of the existing time slice I/O scheduler and a new mechanism which fully uses the flash I/O parallelism without losing fairness.
The other category tried to exploit and maximize the advance characteristics of
SSDs in the upper layer, such as the parallelism characteristics among flash chips. For example, Wang et al. proposed ParDispather [68] that partitions the logical space to issue the user I/O requests to SSDs in parallel. Marcus Dunn et al. [3] proposed a new I/O scheduler that tries to avoid the created penalty during the new block writing to SSDs.
Jaeho Kim et al.[69]proposed IRBW-FIFO and IRBW-FIFO-RP which arrange write-
requests into a logical block size bundle to improve the write performance. Our scheduler
not only considers making full use of read internal parallelism [65], but also tries to avoid
the block cross penalty [3]. Besides the I/O scheduler studies, there are also some
researches which have revealed the advance of the flash internal organization and parallel
data distribution. For example, Agrawal N et al. [16] described the internal organization
of flash and some parallel data design distribution policy inside SSDs. Feng Chen et al.
[66] conducted some experiments to reveal the hidden details of flash memory
implementation such as unexpected performance degradation caused by the data
43
fragmentation. Yang Hu et al. [65] divided the parallelism of the flash memory into four levels and discussed the priority and advance of these four level internal parallelisms.
Based on the above observations, the SBIOS scheduler tries to exploit the internal parallelism from the aspect of I/O scheduler to boost the throughput of user applications for SSD-based storage system.
3.6 Summary
In this chapter, we propose a new I/O scheduler called SBIOS which makes full use of the characteristics of solid state disk. The SBIOS tries to use rich read internal parallelism provided by SSD and dispatches the read requests to different blocks to trigger the read internal parallelism. For write requests, the SBIOS dispatches them to the same block to avoid block cross penalty. Furthermore, we validate that SSD is sensitive with the request size. In SBIOS, we use the small-size preference design. The
experimental results show that SBIOS reduces the response time significantly. In this
way, performance of the SSD-based storage systems is improved.
44
4. Chapter 4 A Parallelism and Garbage Collection aware I/O Scheduler
In recent years, based on enormous growth both in capacity and popularity, the flash-based solid state disk has drastically impacted computing. But the prices for solid state disks still lag far behind those of traditional magnetic hard disks. According to the newest statistics result in 2015, $/GB of a consumer-grade solid state disk (with SATA interface) is nine times higher than the $/GB of a SATA hard disk. The price becomes the main bottleneck which makes the high performance solid state disk can’t be widely deployed. This bottleneck leads to crazy density increment: the manufacturer reduces the price by increasing the density on the SSD. These higher densities have predominantly been enabled by two driving factors: (1) aggressive feature size reductions and (2) the use of multi-level cell (MLC) technology. Unfortunately, both of these factors come at the significant cost of reduced SSDs endurance [70] [71] [59] [72]. Especially, with the development of the SSD technology, the endurance problem is becoming more and more prominent. In this study, we proposed a parallelism and garbage collection aware I/O scheduler to improve the reliability issue of SSDs.
4.1 Introduction
Comparing to traditional magnetic hard disks, the most well identified advantage of SSDs is its high access performance. Unlike the traditional magnetic hard disks, in order to improve performance, SSDs are usually constructed with a number of channels with each channel connecting to a number of NAND flash chips [73] [13] [74]. This
45
design provides rich internal parallelisms which can be fully exploited to improve performance. Recently, many researches have been proposed to exploit the internal parallelisms in the SSDs. Chen et al. [75] proposed a buffer cache management approach for SSDs to solve the read conflict problem by exploiting the read parallelism of SSDs.
Gao et al. [76] proposed an I/O scheduling method for SSDs to solve the access conflict problem by using a parallel issuing Queue method in the SSDs. Guo et al. [17] proposed a novel SSD-based I/O scheduler to trigger the internal parallelism by dispatching the read requests to different blocks. However, these works only target at exploiting the internal parallelisms to improve SSDs performance, they didn’t consider the additional garbage collection overhead, while exploiting the internal parallelisms.
In the SSDs, The most common used internal parallelism is the channel level parallelism. By using independent SSD bus controllers, Flash translation layer can fully utilize the channel resource to serve the incoming requests in parallel. However, in practice this kind of the internal parallelism may increase the overhead of garbage collection. This is because the hot write requests are distributed to different blocks in parallel by using the channel level parallelism. Here, we classify the hot request by frequency. The definition of the hot request is the frequency which is no less than two.
For instance, if we modify the same place in a word document several times, this will generate many hot write requests. These hot write requests target the same logical page, but the FTL distributes them to different blocks to utilize the channel parallelism. This will increase the invalid page ratio per physical block significantly. In order to achieve a
46
tradeoff between internal parallelism and increased garbage collection overhead, we proposed a GC-aware I/O scheduler, called PGIS. In our design, according to workload characteristics, we introduce a LFU algorithm to classify the hot/cold data for the whole incoming requests. For hot read requests, we dispatch them to different channel to utilize the channel resource in the SSDs. For hot write requests, we distribute them into the same
block to reduce the overhead of garbage collection. Via the combination of the above
mention methods, PGIS improves the performance and endurance of SSDs for a variety
of workloads.
4.2 Motivation and Background
4.2.1 The main bottleneck in the SSDs
The write operations are considered as the main bottleneck in the SSDs, not only because they sometime block the read operations, but also because they trigger the erase operations. Erase operations act at block level, while the read and write operation act at page level. This distinguishing feature is caused by the structure of the SSDs. SSDs don’t allow the in-place updates, Instead, the incoming new data are written to new clean space
(i.e., out-of-place updates), and then the old data are marked as invalid for reclamation in the garbage collection. To resolve this in-place update problem, Flash Translation Layer
(FTL) has been proposed and deployed to flash memory to emulate in-place update like block devices [77]. This kind of in-place update is not a real in-place update, it just hides the characteristics of the out-of-place update to the users. As the time goes on, this in- place updates inevitably causes the coexistence of numerous invalid and valid pages in
47
one physical block. While the free blocks percentages in the SSDs reach the threshold, the garbage collection (short for GC) has been triggered to reclaim the spaces occupied by the invalid pages. The main method for reclaiming the space is the erase operation.
However, Erase operation (1,500us) is the most expensive operation in the SSDs compared to read operation (25us) and write operation (200us) [16]. In order to reduce the number of erase operations, in our work, we issue the hot write requests to the same physical block to accelerate the reclaiming efficiency.
Figure 4.1: Channel level internal parallelism in the SSDs.
4.2.2 Internal parallelism
48
In order to improve the I/O performance of the SSDs, the internal parallelism was taken into consideration. The internal parallelism originated from the internal structure of the SSDs. In recent research, Hu et al. [65] classified the internal parallelism of the SSDs into four levels: channel-level, die- level, chip-level and plane-level. The most common one is the channel-level internal parallelism. As shown in Figure 4.1, while applying the channel-level internal parallelism, the two incoming requests in the queue will be simultaneously issued to two different channels. In this kind of channel-level internal parallelism, for read requests, there is no doubt that the I/O performance has been improved, but for write requests, the efficiency of garbage collection has been decreased.
In Figure 4.1, the two incoming requests have been issued to two different chips which locate in two different channels. These two different chips contain lots of physical blocks, let us suppose, these two incoming requests are hot write requests which have the same logical block number. While using channel-level internal parallelism, the FTL will issue these two hot write requests to two different chips. This means the incoming write requests which have same logical block address will be issued to two different physical blocks which locate in two different chips. While the incoming requests contain lots of hot write requests, this dispatching method will increase the invalid page ratio in the physical block and decrease the garbage collection efficiency. In our design, we identify hot write requests and dispatch them to the same physical block in the SSD to improve the efficiency of garbage collection.
49
Figure 4.2: The hot data ratio of different traces.
4.2.3 Trace access characteristics
In order to reduce the garbage collection overhead, we should know the trace
access characteristics, especially for the hot ratio rate in the trace. Here, we use the access frequency to identify the hot block. By introducing the statistics method to analyze the trace access pattern, we validate that the access patterns of all the seven selected traces are good fits of power law distribution. According to statistical results, we find a good cut to get the exact threshold for identifying the hot block. The definition of the hot data is the I/O request which visits the hot block. Similarly other data in the trace is defined as
the cold data. Figure 4.2 shows that the hot ratio rate for all seven of our traces. We
50
observe that access frequency can be quite heterogeneous across different pages. While some pages are often accessed (hot data), other pages are infrequently accessed (cold data). From Figure 4.2, we observe that for all of our traces, the hot data occupies a large fraction in the total data; the read hot ratio represents the hot data percentage in the
total read data. The write hot ratio represents the hot data percentage in the total write
data. For read-intensive traces, such as Web1, Web2, Web3, the write request ratio only
accounts for 0.01%, so we omit the hot write data statistics. However, for write-intensive
traces, such as Prn_0 and Fin1, we observe that the read hot ratio rates are lower than the
average; it ranges from 25% to 29.5%. The reason for this is that the read data only
accounts for a small fraction of the total data. But no matter whether it is write-intensive
data or read-intensive data, a large fraction of the total data is the hot data. So in our
design, we identify the hot data to improve the performance of SSDs and reduce the
overhead of garbage collection.
4.2.4 Motivation
In order to fully dig the potential of SSDs, the internal structure of SSDs are
always taken into consideration. But the design is a tradeoff, when you gain something,
you will lose something. In fact, there is no doubt that introducing the internal parallelism
can obviously improve the performance of the SSDs. However, with the restriction of the
structure of SSDs, improving the performance of SSDs sometimes may increase the
overhead of garbage collection, because the Flash translation layer blindly directs the hot
write requests to the different physical block. When we analyze the access characteristics
51
of the real traces, we find that the hot data occupies a large fraction in the total data, so in our design, we maintain two LFU queue in the host level to identify the hot data, for hot read requests, we dispatch them to different channel to fully exploit the channel resource, for hot write requests, we dispatch them to same physical block in the SSD. In this way, we achieve a balance between improving the performance of SSDs and reducing the
overhead of garbage collection.
4.3 System Design and Implementation
4.3.1 System overview
The main goal of parallelism and garbage collection aware I/O scheduler (short for
PGIS) is to reduce the garbage collection overhead, while exploiting the internal parallelism. Figure 4.3 shows the SSD based storage systems with PGIS implemented in the host interface logic, where Request Type Identifier, Data Dispatcher, Hot Data
Identifier are three most important components of the PGIS, and multiple flash chips are connected with multiple channels in the storage level.
In Figure 4.3, the top level is application level. The application level generates the incoming requests. The incoming requests are received by file system, when the block device driver interacts with SSD through host interface logic. These incoming requests enter into the I/O scheduling module which is located in host interface logic. The I/O scheduling module consists of three important components: Request Type Identifier, Data
Dispatcher and Hot Data Identifier. Request Type Identifier directs the read requests and
52
Figure 4.3: System overview of PGIS. Here RFQ represents read frequency queue and WFQ represents write frequency queue. the write requests into different queues. Hot Data Identifier maintains two frequency queues where the hot requests are marked a hot flag. When the hot data are identified by
Hot Data Identifier, the data dispatcher will dispatch all the requests to the flash translation layer. The FTL manages the channel resource through the flash interface logic which is proposed by Jung et al. [78]. For the existing I/O schedulers implemented in the
53
kernel level, they are not aware of internal details of SSDs, so the additional overhead of
garbage collection may degrade the performance, while applying the internal parallelism.
To solve this problem, a host interface GC aware I/O scheduler is first proposed, which is
based on the understanding of the data allocation scheme of the FTL and the logical
address number of the incoming I/O requests. Then, based on introducing the hot data
identification mechanism, PGIS reduces the overhead of garbage collection by smartly scheduling incoming I/O requests. For hot read data, PGIS will issue them to the different channels to exploit the channel resource, for hot write data, PGIS will dispatch them to the same physical block to accelerate the block reclaiming process.
(a) Fin2 read (b) Fin2 write
Figure 4.4: Distribution of read and write access frequency in Fin2 trace.
54
Table 4.1: Workload access pattern power laws.
Workload K R2 Type
Fin1 1.37 0.99 read Fin1 1.42 0.97 write
Fin2 1.57 0.99 read
Fin2 1.50 0.99 write
Prn_0 1.30 0.99 read
Prn_0 1.58 0.90 write
Prxy_0 1.58 0.90 read
Prxy_0 1.64 0.97 write
Web1 1.35 0.99 read
Web2 1.46 0.98 read
Web3 1.35 0.98 read
4.3.2 Trace access pattern analysis
By introducing the statistics method to analyze the trace access pattern, we have an interesting observation: the access patterns of all our traces seem to match power law distribution. In order to validate this observation, we collect the number of block and access frequency for these blocks from all our traces, and then we plot these data in logarithmic scale to calculate the k and r2. In our experiment, we fit our data to a line to statistic clearly. The k represents the slope of the line, while r2 measures how far a set of
55
random numbers are spread out from their mean. Figure 4.4 shows the Frequency distribution results. Here, we only show the statistics results for Fin2 due to space limitation. The exact k and r2 values for all our traces are described in Table 4.1. The r2 value ranges from 0.90 to 0.99, while the k value ranges from 1.35 to 1.64, which is homogeneous and in line with the k value reflected in other works software power laws
[79, 80]. Thus, it validates that the access patterns of all our traces are good fits for power law distribution. According to our statistical results, we find that 15 to 25 percent of the blocks are accessed by nearly 50 percent I/O requests, while nearly 80 percent of the blocks are accessed by the rest of I/O requests, so when we design the threshold to identify the hot block, we believe that the 20 percent is an ideal threshold value. In our experiment, we choose the top 20 percent entries listed in the LFU queue as hot blocks.
4.3.3 Hot data identification mechanism
In order to identify the hot data in the host interface logic, we maintain two
frequency queues, one for recording the frequency for write block and the other for
recording the frequency for read block. According to the above trace access pattern
analysis result, we set the top 20% block maintained in the frequency queue as a
threshold to identify the hot block. When the hot blocks have been identified, the
incoming requests which visit the hot block will be added a hot flag. The data dispatcher
will apply different dispatching methods based on the hot flag value. In our experiment,
we find that the access frequency which is used to identify the hot block is depended by
the trace access pattern. For instance, for read-intensive data, such as web1, web2, web3,
56
the read block which is access by twice will be classified into hot block, but if this access frequency for read block appears in other write intensive traces, such as Fin1,prxy_0, it will be classified into the cold block.
Figure 4.5: The basic data structure of the frequency queue.
In the real trace, it contains lots of I/O requests, for improving the least
frequency used (short for LFU) algorithm efficiency, we implement an O (1) LFU queue,
which is based on the paper that is proposed by Ketan et al.[81]. Figure 4.5 shows the
57
basic data structure of our frequency queue. Our frequency queue consists of three basic data structure. The first one is a hash table which is used to store the key values so that given a key value we can retrieve it at O (1). Second one is a double linked list for each frequency of access. In here, the maximum frequency is decided by the queue length which is responsible for identifying which frequency will be classified into the hot data.
In Figure 4.5, each frequency node will maintain a double linked list to keep track of the
key value belonging to that particular frequency. The third data structure is a double linked list to link these frequencies lists. When the current key is accessed again, the key node can be easily promoted to the next frequency list in time O (1). In the PGIS, the key value stored in the hash table is the logical block number. When a new I/O request comes from block device driver, PGIS matches its logical block number to the hash map. If this logical block number has been found in the hash map, PGIS will go to that key node and promote this key node to the next frequency node, if the logical block number is not matched in the hash map, PGIS will add a new entry in the hash map, and then put the new key node under the frequency 1 node.
4.3.4 Dispatching method
In order to reduce the additional garbage collection over- head, while applying
the channel level internal parallelism, PGIS introduces the hot data identification
mechanism and smartly manages the hot write data which may increase the overhead of
garbage collection. Comparing to the host-level I/O scheduler which exploits the internal
parallelism [17], PGIS benefits from knowing the internal behavior of the SSDs. Figure
58
4.6 shows an example scenario where the traditional channel level dispatching method will lead to additional overhead of garbage collection and how PGIS avoids this situation.
As is shown in the Figure 4.6 (a), there are six incoming requests which reside in the data waiting queue. A traditional channel level internal parallelism dispatching method is to dispatch the incoming requests to different channel to fully exploit the channel resource. However, it is unaware of the incoming hot write data. Therefore, it blindly dispatches request 5 and request 6 to different channel. This leads to an additional overhead of garbage collection, because the hot write data which has the same logical block number is dispatched to different physical block. To avoid this situation, PGIS identifies the hot data in the host interface logic, when dispatching the incoming requests which reside in the data waiting queue, it smartly uses the channel level internal parallelism. In Figure 4.6 (b), we can see that PGIS only uses the channel level internal
parallelism for hot read data, such as request 3 and request 4. For hot write data, such as
request 5 and request 6, instead of dispatching them to different channel, PGIS dispatches
them into the same physical block. In this way, PGIS not only fully exploits the channel
level internal parallelism, but also avoids the additional overhead of garbage collection.
In our design, hot read data may overlap the hot write data. There is no doubt that parallelizing hot read data and hot write data will interfere with each other and lead to the performance degradation. For reducing the probabilities of parallelizing hot read data and hot write data, PGIS introduces type-based queuing.
59
Figure 4.6: Illustration of additional garbage collection overhead due to traditional
channel level internal parallelism dispatching method.
60
Because we use read-preference in PGIS, this will lead to a starvation problem for write requests. To solve this problem, PGIS sets a time stamp which defines a time period before which the request should be dispatched into the FTL module. The PGIS periodically checks the requests stored in the queues to guarantee no request exceeds the time period assigned by time stamp. In such a design, we believe the probabilities of parallelizing hot read data and hot write data will be controlled in an acceptable range, the experiment results also support our ideas. In the following section, we will discuss this topic in detail.
4.4 Experimental Evaluation
In this section, we set up experimental platform to evaluate PGIS. The
experiments are divided into two parts. The first part is to run different kinds of
traces on the chosen I/O scheduler to demonstrate that PGIS improves the response time
significantly and makes full use of the channel resource in the SSDs. The second part is
to compare the overhead of garbage collection under PGIS with the state-of-the-art I/O
schedulers to validate that PGIS reduces the overhead of garbage collection significantly
while implementing the channel level internal parallelism.
Table 4.2: Configuration Parameters of the SSD Simulator.
Parameter Value
Channel 8
Die 8
61
Plane per Die 8
Block per Plane 2048
Page per Block 64
Page Size 8KB
Reserve Block Percentage 20%
Read Latency 25us
Write Latency 200us
Erase Latency 1500us
4.4.1 Experiment setup
In this paper, our SSD simulator is a modified version of Disksim with ssd extension [82]. Disksim with ssd extension did not support the channel level internal parallelism, so we implemented the channel level internal parallelism in our simulator. In this work, the proposed approach is implemented in the host interface logic to schedule
I/O requests. We implemented the-state-of-the-art I/O scheduler (based on Noop scheduling algorithm) and a flash-based I/O scheduler [17] to perform a quantitative comparison. In this study, we modeled a 64GB SSD, which is configured with 8 channels with each channel equipped with 8 dies. The exact configuration of our simulator is described in Table 4.2. In Table 4.2, we can see that the 64GB SSD contains 131072 blocks. Each flash block consists of 64 pages with a page size of 8KB. Page mapping scheme is implemented as the default FTL mapping scheme. Greedy garbage collection
62
scheme and wear leveling scheme is implemented. The reserve block percentage is set to
20% of the SSD. In order to test the efficiency of PGIS, we choose 7 different traces to test it, including two online transaction processing traces (Fin1, Fin2) [15], two MSR
Cambridge traces from servers (Prxy_0, Prn_0) [14] and three search engine traces
(Web1, Web2, Web3) [15].
4.4.2 Performance analysis
In this section, we run different benchmarks with different I/O schedulers
(including SBIOS, PGIS and Noop). In our experiment, we use the Noop scheduler as a
baseline to measure the efficiency of other schedulers. SBIOS [17] is a flash-based I/O
scheduler which improves the performance by exploiting the internal parallelism. We
introduce SBIOS to measure the performance variation of PGIS. To validate the
efficiency of PGIS, we choose seven benchmarks with different characteristics and
compare their performance. Fin1 and Fin2 are collected from OLTP applications running
at a large financial institution. Web benchmarks (Web1, Web2 and Web3) are collected
from a machine running a web search engine. Prn_0 and Prxy_0 are collected from MSR
Cambridge server. Table 4.3 illustrates the main characteristics about these seven benchmarks. We simply divide these seven benchmarks into two types: read intensive and write intensive. Fin1, Prn_0, Prxy_0 are write intensive traces in which read requests only account for a small Fraction. However, Fin2, Web1, Web2, Web3 are read intensive traces which contain few write requests, especially for Web1,Web2,Web3, the write request ratio only accounts for 0.01%.
63
Table 4.3: The characteristics of the traces.
Trace Read(%) IOPS Avg req size(KB)
Fin1 21.6 121.4 8.5
Fin2 82.4 97.8 3.6
Prn_0 10.8 68.1 22.2
Prxy_0 2.9 188.5 6.8
Web1 99.9 322.1 16.2
Web2 99.9 315.2 14.5
Web3 99.9 297.8 15.6
Figure 4.7 shows the performance results. In order to illustrate the performance clearly, we introduce standard response time in Y axis. In our experiment, we set the response time of Noop scheduler as baseline (initialized to 1 in Figure 4.7) to compare the efficiency of other schedulers. As is shown in Figure 4.7, Compared with the baseline,
PGIS improves performance by 19.8%, 28.4%, 35.3%, 33.3%, 32.1%, 21.3%, 18.5% for the Fin1, Fin2, Web1, Web2, Web3, Prn_0, and Prxy_0 traces, respectively, with an average of 26.9%.
64
Figure 4.7: Benchmark performance comparison under different I/O schedulers.
However, when comparing with SBIOS, PGIS only improves performance by 3.7%,
12.6%, 18.9%, 15.8%, 15%, 6%, and 3.7% for the Fin1, Fin2, Web1, Web2, Web3,
Prn_0, and Prxy_0 traces. The reason why PGIS doesn’t beat SBIOS a lot is that SBIOS has already benefited from exploiting internal parallelism. In Figure 4.7, we can get two conclusions. The first one is that the SSD-based I/O schedulers including SBIOS and
PGIS outperform Noop scheduler. It validates that exploiting the internal parallelism can
improve the SSD performance. The second one is that read-intensive traces benefit a lot
from hot data identification and channel level internal parallelism. For read intensive
traces, including Fin2, Web1, Web2, Web3, we can see that the hot read data accounts for
65
a large fraction in the total read data, the hot read ratio ranges from 27.7% to 36%, while we dispatch the hot data to different channel, the channel level internal parallelism is
fully exploited by such a big read I/O intensity. For write intensive traces, including Fin1,
Prn_0, Prxy_0, it doesn’t benefit a lot from channel level internal parallelism. The reason
is that the read data only accounts for a small fraction in the total data, although the hot
read data accounts a big fraction in the total read data.
Figure 4.8: Comparison for the overhead of garbage collection under different I/O schedulers.
4.4.3 Improved garbage collection efficiency
66
In this section, we look into the overhead of garbage collection under different
schedulers while using seven traces. In order to clearly illustrate the overhead of garbage
collection, we introduce the erase operation number as a metric. To clearly reflect the
garbage collection status, we simulated a 64 GB SSD and the overprovision space of the
SSD has been set to 20% to accelerate the block reclaiming process.
Figure 4.8 shows comparison results for the overhead of garbage collection. In
order to illustrate the overhead of garbage collection clearly, we introduce standard
erase operation number in Y axis. The erase operation number is always introduced as an
important metric to measure the garbage collection state in the SSD. In our experiment,
we set the erase operation number of Noop scheduler as baseline (initialed to 1 in Figure
4.8) to compare the garbage collection overhead of other schedulers. As is shown in
Figure 4.8, for read mix with write requests, when comparing with the baseline, PGIS
reduces the overhead of garbage collection by 16.2%, 10.1%, 22.3%,and 26.8% for the
Fin1, Fin2, Prn_0, and Prxy_0 traces, respectively, with an average of 18.9%. However, when comparing with the SBIOS, PGIS reduces garbage collection overhead by 19.3%,
11.2%, 27.6%, 30.9% for the Fin1, Fin2, Prn_0, Prxy_0 traces, respectively, with an average of 22.3%. Our experimental result shows that the internal parallelism dispatching method which is not aware of the hot data will increase the overhead of garbage
collection. From Figure 4.8, we can get three conclusions. The first one is that
dispatching hot write data to the same block can accelerate block reclaiming process, but
dispatching hot read data to different channel makes no difference for garbage collection
67
process. That is why PGIS doesn’t gain any overhead improvement for the Web1, Web2,
Web3 traces. The second one is that hot data unaware dispatching method will increase
the overhead of garbage collection. As is shown in Figure 4.8, compared with the overhead of garbage collection, Noop scheduler outperforms SBIOS. The third one is that the trace which has a bigger write ratio can get more benefit from hot data identification scheme. In Figure 4.8, the biggest improvement for the overhead of garbage collection is gained by Prxy_0 which contains 97.1% write requests.
Figure 4.9: Standard response time speedup on seven benchmarks with respect to different channel number.
68
4.4.4 Sensitivity study on channel number
Channel level internal parallelism is the most common internal parallelism in
the SSD. No matter increasing the channel number or decreasing the channel number will
influence the SSD performance. According to the above section, we will find that the
channel level parallelism leads to the performance gain brought by PGIS. In this section,
we will conduct sensitivity study on the number of channels to illustrate how the
variations of channel number influence the SSD performance.
Figure 4.9 compares the response time of these seven benchmarks under
different channel number. In Figure 4.9, the Y axis represents standard response time speedup, we use the response time of SSD which is configured with two channel under these seven benchmarks(initialized to 1 in Figure 4.9) as a baseline to measure the efficiency variations of SSD which s configured with different channel number.
According to Figure 4.9, we can get three conclusions. The first one is that increasing the channel number will improve SSD performance. Take Fin2 for example, when the channel number increases from 2 to 8, the SSD performance improves over than 20%.
The second one is that there is an optimal channel number for read mixed with write benchmarks. For read mixed with write benchmarks, including Fin1, Fin2, Prn_0, Prxy_0, there is an obvious decreasing trend, when channel number increases from 8 to 12. We believe that the decreasing trend is caused by parallelizing hot read data and hot write data. Increasing channel number also increases the probability of overlapping hot read
69
data and hot write data. However, overlapping hot read data and hot write data doesn’t influence the SSD performance a lot, the average performance decrease for read mixed with write benchmarks is less than 2.5%. A recent research proposed by chen et al. [83] does a good job of explaining this phenomenon. According to their study, only overlapping random hot read data and hot write data will lead to a big performance decrease. By introducing the type dispatching method, PGIS significantly reduces the probability of overlapping random hot read data and hot write data. The third one is that there is a limit of increasing channel number to improve SSD performance. For read intensive benchmarks, such as Web1, Web2, Web3, when the channel number increases from 8 to 12, there is no any performance gain brought by increasing channel number.
According to our sensitivity study, we choose 8 as the default channel number in our experiment for clearly reflecting the variation trend of SSD performance.
4.4.5 Sensitivity study on over provision ratio
The over provision ratio is an important factor which can influence the
overhead of garbage collection a lot. On one hand, more over provision ratio means that
SSD reserves more blocks for incoming write requests. These additional reserved blocks
will decrease the occurrence rate of garbage collection. In this way, the overhead of
garbage collection has been alleviated. On the other hand, more over provision ratio
reduces SSD user space. Decreased SSD user space will result in the performance
degradation. In order to find an optimal point between SSD performance and over
70
provision ratio, we investigate the relationship between response time and over provision ratio in our experiment.
Figure 4.10: Standard response time speedup on seven benchmarks with respect to different over provision ratio.
Figure 4.10 compares the response time of these benchmarks under different over provision ratio. In Figure 4.10, the Y axis represents standard response time speedup, we use the response time of over provision ratio 5% under these seven benchmarks
(initialized to 1 in Figure 4.10) as a baseline to measure the efficiency of SSD under other over provision ratios. As is shown in Figure 4.10, for read intensive traces, such as Web1,
Web2, Web3, when the over provision rate increases, there is no influence on the SSD
71
performance. This validates that over provision ratio only influences write requests.
However, for write intensive traces, such as Fin1, Fin2, Prn_0, Prxy_0, when the over
provision ratio increases from 5% to 25%, there is an obvious performance improvement
trend, the average performance improvement is 7.5% except for Fin2. The reason for why
Fin2 doesn’t get more improvement from increased over provision ratio is that the write
ratio of Fin2 only accounts for 17.6%. While the over provision ratio is bigger than 25%,
the performance increment trend seems to become stagnant. According to our sensitivity
study, we choose 20% as the default over provision ratio in our experiment for clearly
reflecting the variation trend for the overhead of garbage collection.
4.5 Related work
The flash-based solid state disks have rich internal parallelisms. In recent research, many ideas focus on exploiting internal parallelisms to improve the SSDs performance. Hu et al. [65] divided the parallelism of the SSDs into four levels and found that these four levels have different priorities in the exploration of access latency and system throughput. Chen et al. [8] first evaluated and showed that the internal parallelisms of SSDs play an important role on improving performance. They stated that while introducing the internal parallelism, the performance of write operations had no relationship with their patterns (random/sequential) and even better than read operations.
Chen et al. [75] proposed a buffer cache management approach for SSDs to solve the read conflict problem by exploiting the read parallelism of SSDs. Gao et al. [76] proposed an I/O scheduling method for SSDs to solve the access conflict problem by
72
using a parallel issuing Queue method in the SSDs. Guo et al. [17] proposed a novel
SSD-based I/O scheduler to trigger the internal parallelism by dispatching the read requests to different blocks. Wang et al. proposed ParDispather [68] that partitions the
logical space to issue the user I/O requests to SSDs in parallel. However, none of the
above works take the additional garbage collection overhead into consider. In our work,
we propose a GC-aware I/O scheduling method, while exploiting internal parallelisms.
The endurance of the solid state disk is an important metric for measuring the
SSD performance. One important factor which can influence the endurance is the garbage
collection efficiency. The hot data identification is an effective method to improve
garbage collection efficiency. So the other field which relates to our research is the hot
data identification. Recently, many ideas based on hot data identification have been
proposed to improve the garbage collection efficiency. Chang et al. [84] used a two-level
LRU list structure to identify the incoming hot write requests. Park et al. [77] proposed a
multiple bloom-filter scheme to identify the hot data in flash memory. These works
improve the garbage collection efficiency by identifying the incoming hot write requests,
but none of the above works used the hot data identification scheme into read requests. In
our work, we introduce the hot data identification scheme into read requests to fully
utilize the rich channel resource in the SSDs. Other works related to our research are
power law distribution and I/O schedulers applied in the SSDs. Louridas et al. [80] prove
that power laws appear in software at the class and function level. Wang et al. [79] show
that tags produced in tracing follow power laws. Stan park et al. proposed FIOS [62] and
73
FlashFQ [67] algorithms that take the fairness of SSD resource usage into account.
Marcus Dunn et al. [3] proposed a new I/O scheduler that tries to avoid the created penalty during the new block writing to SSDs. Jaeho Kim et al. [69] proposed IRBW-
FIFO and IRBW-FIFO-RP which arrange write-requests into a logical block size bundle to improve the write performance.
4.6 Summary
In this study, we proposed a parallelism and garbage collection aware I/O
Scheduler named PGIS which identifies the hot data based on trace characteristics to exploit the channel level internal parallelism of flash-based storage systems. PGIS not
only fully exploits abundant channel resource in the SSD, but it also introduces a hot data
identification mechanism to reduce the overhead of garbage collection. By dispatching
hot read data to different channel, the channel level internal parallelism is fully exploited.
By dispatching hot write data to the same physical block, the overhead of garbage
collection has been alleviated. The experimental results show that compared with existing
I/O schedulers, PGIS reduces the response times and the overhead of garbage collection
significantly.
74
5. Chapter 5 Miss Penalty Aware Cache Management Scheme
One of the most famous physical limitations of the SSD is the “erase-before-write” mechanism. Direct in-place updates are not allowed in the SSDs. Instead, a time- consuming erase operation must be performed before the overwriting. To make it even worse, the erase operation cannot be performed selectively on a particular page but can only be done for a block of the SSD called the “erase unit“. Since the size of an erase unit
(typically 128 KB or 256 KB) is much larger than that of a page(typically 4 KB or 8 KB), even a small update to a single page requires all pages in the same erase unit to be erased and written again [56]. Frequent erase operations will lead to endurance problem, since each SSD cell has a limited number of erase operations before becoming unreliable. In order to resolve this issue, Flash Translation Layer (short for FTL) has been proposed and deployed to SSD firmware to emulate in-place updates like block devices. Although FTL can reduce the amount of erase operations, write operation is still the main bottleneck in the SSD.
5.1 The Opportunity for PCM
With the development of the non-volatile technology, we have more options for improving SSD performance [85] [86] [87]. Phase change memory (short for PCM) is one of the most promising non-volatile memories, which instead of using electrical
75
charges to store information, stores bit values with the physical state of a chalcogenide material (e.g., Ge2Sb2Te5, i.e., GST). In general, a PCM cell consists of a top electrode and a bottom electrode. Between these two electrodes are the resistor (heater) and chalcogenide glass (GST). We can transform the PCM cell into amorphous state by quickly heating under high temperature (above the melting point) and quenching the glass.
Instead, holding the GST above its crystallization point but below the melting point for a while will transform the PCM cell into the crystalline state. In 2016, IBM researchers found how to store 3 bits in one PCM cell, PCM is currently becoming more and more popular.
Due to high read/write speed, high endurance, non-volatility of PCM, many hybrid storage structures have been proposed to improve the performance of SSD. Sun et al. [56] propose a hybrid storage architecture to improve performance of SSD. The key idea behind this design is to use PCM to store log data. Due to in-place updates, byte- addressable, non-volatile properties and better endurance of PCM, the performance, energy consumption and lifetime of the SSDs are all improved. Tarihi et al. [88] propose a hybrid storage architecture for Solid-State Drives (SSDs) which attempts to exploit
PCM as the write cache while improving SSD’s drawbacks such as long write latency, high write energy, and finite endurance. Li et al. [58] propose a user-visible hybrid storage system with software-defined fusion methods for PCM and NAND flash. In this design, PCM is used for improving data reliability and reducing the write amplification of
SSDs. In 2016, the Radian memory system released a host-based hybrid SSD product
76
called”RMS-325” to improve the reliability of SSD storage system. This clearly shows
that hybrid storage technology is on the way from the laboratory research stage to
engineering application stage. With the growth of the use of SSDs in the data center,
hybrid storage technology will attract more and more attentions in the forthcoming years.
There are many studies work on hybrid storage technology which focuses on
exploiting PCM and SSD at the same storage level. While most of them try to reduce the
write amplification of SSDs and to improve the data reliability of the whole storage
system, none of these studies consider improving the performance of the whole storage
system based on read latency disparity between PCM and SSD. In general, the read
latency of PCM is at least 500 times higher than the read latency of SSD. Such a big read
latency disparity gap between PCM and SSD will lead to different miss penalty
overheads for the average memory access time in the main memory based on DRAM. In
modern computer systems, a large part of the main memory is used as page caches to
hide disk access latency. The effectiveness of page cache block Management algorithms
is critical to the performance of the whole storage systems. In this study, we propose a
Miss Penalty Aware Cache Management algorithm (short for MPA) to improve the
performance of the hybrid storage system which uses PCM and SSD at the same storage
level. MPA not only fully exploits abundant system resource in the host side, but also it
gives SSD requests maintained in the page cache high priority to improve the
performance of the whole storage system.
5.2 Background and Motivation
77
5.2.1 PCM vs SSD
In recent years, Flash-based Solid State Drives (SSDs) have enabled a revolution in mobile computing and are widely used in data centers and in high-performance computing. Compared to traditional hard disks, SSDs offer substantial performance improvements, but cost is limiting the adoption in cost-sensitive applications. To fix this problem, SSD manufactures increase SSD density by scaling silicon feature size
(shrinking the size of a transistor) [89]. More bits will be stored in a SSD cell. As we all know, the high density is a double-edged sword. When enormously reducing the cost of
SSD and increasing the adoption of SSD, high density also brings the swift drop in
SSD endurance, measured as the number of program/erase (P/E) cycles a SSD cell can sustain before it wears out [90] [91]. Although the 5x-nm (i.e., 50- to 59-nm) generation
of MLC NAND flash SSD had an endurance of 10k P/E cycles, modern 2x-nm MLC and
TLC NAND flash SSD can endure only 3k and 1k P/E cycles, respectively [92] [93] [94].
Comparing to NAND flash SSD, PCM has many advantages, such as in-place
updates, byte-addressable property, and high endurance and so on [95] [24]. However,
the most prominent advantage is its fast read access speed. As is shown in Table 5.1, the
read latency of PCM is at least 500 times higher than the read latency of SSD. Such a big
performance gap will bring many opportunities to improve performance of the hybrid
storage system which uses PCM and SSD at the same storage level. Our Miss Penalty
Aware Cache Management algorithm is also designed for exploiting this big performance
gap between PCM and SSD.
78
Table 5.1: Comparison among SSD and PCM.
Attribute SSD PCM
Non-volatility YES YES
Byte Addressability NO YES
Write Latency 500us 1us
Read Latency 25-50us 50-100ns
Write Energy 0.1-1nj <1nj
Read Energy 1nj 1nj
Endurance <104 106-108
Along with the progress of the PCM technologies, PCM has become a good choice to replace NAND flash SSD with advantages of high read/write speed and high endurance [22] [96]. However, due to its high cost and the limitation of manufacture, it is
still not feasible to replace the whole NAND flash SSD with PCM. In this case, many
hybrid storage structures which exploit the PCM and SSD at the same storage level have
been proposed to improve the performance of whole storage system. In our work, we
propose a Miss Penalty Aware Cache Management algorithm to improve the performance
of the hybrid storage system which uses PCM and SSD at the same storage level.
5.2.2 Motivation
As we all know, different storage systems show different characteristics. In
modern computer system, the DRAM-based main memory is used as page cache to
79
bridge the performance gap between main memory and storage system. In this case, the page cache management algorithm has a significant effect on improving performance of
I/O by hiding the long latency of underlying storage system. In general, judging whether a page cache management algorithm is effective depends mainly on the Average Memory
Access Time (short for AMAT). AMAT consists of Hit Time, Miss Rate and Miss
Penalty. The AMAT is represented in equation (1).
= + ( ) According to equation (1), we will find that the AMAT can be improved by optimizing 퐀퐌퐀퐓 퐇퐢퐭 퐓퐢퐦퐞 퐌퐢퐬퐬 퐑퐚퐭퐞 ∗ 퐌퐢퐬퐬 퐏퐞퐧퐚퐥퐭퐲 ퟏ each of the above three factors. Since the main memory is DRAM whose access time is consistent, the Hit Time is consistent too. A large part of existing optimizations for page cache management algorithms are trying to reduce the Miss Rate as much as possible by reducing the number of the user I/O requests actually passed to underlying storage system by exploiting access locality. In this case, they assume that the Miss Penalty is also consistent, because the direct access time from the underlying storage system is almost the same. This assumption is valid for most storage systems, such as HDDs and flash- based SSDs. However, when the hybrid SSD which uses PCM and SSD at the same storage level is applied, the story becomes different. The open channel hybrid SSD exploits PCM and SSD at the same storage level to reduce the write amplification of SSD.
= + ( )
퐌퐢퐬퐬 퐏퐞퐧퐚퐥퐭퐲 퐓퐃 퐃퐀 ퟐ In such a hybrid storage structure, the access time for PCM is at least 500 times lower than the access time for SSD. Based on this observation, we take the factor of Miss
80
Penalty into consideration. Equation (2) shows the detailed influencing factors for Miss
Penalty overhead. In here, TD represents the time for data loading to DRAM, while DA
represents DRAM access time. According to equation (2), we will find that Miss Penalty
is mainly influenced by the time for data load to DRAM, since the DRAM access time is
consistent. While the time for data load to DRAM depends on the storage access time,
The Miss Penalty overhead for PCM is quite different from the Miss Penalty overhead for
SSD in Hybrid SSD. Since the read latency of PCM is at least 500 times higher than the read latency of SSD, the Miss Penalty for PCM is much less than the Miss penalty for
SSD. In our research, we assign higher priority to SSD requests located in the page cache for avoiding the high Miss Penalty overhead.
In hybrid SSD which uses PCM and SSD at the same storage level, metadata, random or small write data (hot data) can be written into PCM for the in-place updates and byte-addressable properties of PCM. Meanwhile, this data dispatching method also reduces the write amplification of NAND flash SSD, which leads to the increasing number of write and erase operations, performance degradation and lifespan decreasing of NAND flash SSD. Since SSD is used to store cold data in hybrid SSD, on one hand, the high priority for SSD requests located in cache will decrease the probability of paying high Miss Penalty overhead for SSD requests. On the other hand, such a high priority for
SSD requests located in cache may increase the Miss Rate for PCM requests. In order to find the balance between Miss Rate and Miss Penalty, in MPA we have designed a mechanism to filter the real cold data located in the page cache. By filtering the real cold
81
data located in the page cache, the Miss Rate for PCM requests decreases significantly.
By considering not only the Miss Rate, but also the Miss Penalty, the AMAT with MPA is significantly improved.
Figure 5.1: The overview of MPA on the I/O path.
5.3 The Design Detail of MPA
82
5.3.1 System overview
The main goal of Miss Penalty Aware Cache Management algorithm is to exploit the read latency difference between PCM and SSD, while guaranteeing hit Rate. The existing DRAM-based Cache Management algorithms such as LRU, LFU are designed to reduce the Miss Rate by exploiting the temporal locality. Our MPA scheme is designed for the hybrid storage system which uses PCM and SSD at the same storage level. In this case, MPA not only takes hit Rate into consideration, but also exploits the read latency difference between PCM and SSD to reduce the Miss Penalty Overhead. Figure 5.1 shows the hybrid storage system overview with MPA implemented in DRAM, where
PCM is used as the same level storage of SSD.
In Figure 5.1, the host level consists of application level, virtual file system (short
for VFS) and DRAM. When the user interacts with operating system, the application
level starts to generate requests. These incoming requests are received by VFS, and then
VFS dispatches these requests to the page cache located in the DRAM. In page cache,
the page cache management scheme will check whether the incoming requests are in the
cache. If so, they will be serviced by the page cache immediately. Otherwise they will be
issued to the hybrid storage system below. In here, the hybrid storage system
communicates with DRAM through open channel interface.
As is shown in Figure 5.1, the device level is a hybrid SSD which consists of
memory controller, PCM and SSD. In such a storage structure, the most prominent
feature is the read latency difference between PCM and SSD. Such a read latency
83
difference between PCM and SSD leads to Miss Penalty difference between PCM and
SSD. However, the existing page management scheme only exploits temporal locality to reduce the miss rate, it can’t fully dig the potential of hybrid SSD. Different from traditional page management scheme which uses Miss Rate as the only method to improve the AMAT, MPA fully exploits the feature of the hybrid SSD by taking Miss
Penalty as the most important method to improve AMAT.
5.3.2 The balance between Miss Penalty and Miss Rate
In hybrid SSD which uses PCM and SSD at the same storage level, PCM is always used to store metadata, random or small write data (hot data) for the in-place updates and high endurance properties of PCM, however, due to the limit capacity of PCM, most data
(cold data) is store in the SSD. Therefore, most SSD requests are rarely accessed. There is no doubt that assigning high priority for SSD requests located in page cache will decrease the probability of paying high Miss Penalty overhead for SSD requests, but decreasing the Miss Penalty overhead for SSD requests will lead to increment of the Miss
Rate for PCM requests. In order to achieve the optimal performance of the hybrid SSD, we need to find a balance between Miss Penalty and Miss Rate. Figure 5.2 shows how we improve the hit rate for PCM requests by filtering the real cold data in the page cache.
As is shown in Figure 5.2, Miss Penalty Aware cache management scheme contains two parts: a FIFO list for filtering the cold data and a LRU list for capturing the workload locality. When a page miss occurs, the miss page will be loaded from the storage and added to FIFO list at first, and then if a cache hit occurs before the miss
84
page leaves the FIFO list, the miss page will be promoted to the LRU list which is used to store active data.
Figure 5.2: The system overview of MPA scheme.
85
By this promotion mechanism, we filter the cold data in the FIFO list. Figure 5.2 shows an example of MPA scheme for decreasing the Miss penalty overhead. Based on traditional LRU scheme, request D4 at the end of LRU list should be evicted from the page cache. However, request D4 belongs to SSD which has the expensive Miss penalty overhead, the MPA scheme will give it high priority to stay in the page cache and instead evict PCM request D2 to free page cache space. By keeping the SSD requests longer in the cache space, MPA scheme reduces the Miss Penalty overhead significantly. In this design, how to allocate memory between LRU list and FIFO list will influence the performance of MPA, we will discuss this issue in following section.
86
5.3.3 Cache management algorithm
As we all know, the existing DRAM-based cache management schemes such as
LFU, LRU are mainly designed for improving hit rate on the cache by exploiting request access locality while assuming that the Miss Penalty is a constant value. However, when the hybrid SSD which exploits PCM and SSD at the same storage level is introduced, the assumption that Miss Penalty is a constant value becomes invalid. In hybrid SSD, the
Miss Penalty overhead for SSD requests are more expensive than the Miss Penalty overhead for PCM requests. Therefore, when considering the read latency difference between PCM and SSD, the performance of cache can’t be evaluated by traditional metric such as hit rate. To solve this issue, MPA introduces the Miss Penalty as an important metric to measure the performance of the cache management scheme. The detail of MPA is described in Algorithm 1.
When receiving an incoming request, MPA will check whether the request hits in the cache. In here, there are two cases for a cache hit. The first case is the cache hit in
FIFO list. In this case, the promotion mechanism will play the role and promote the hit page to LRU list. The second case is the cache hit in LRU list. In this case, MPA will move the hit page to the head of LRU list to exploit the temporal locality of the workload.
If a cache miss occurs, there are also two cases. The first case is that cache is not full. In this case, MPA will load the data from the hybrid storage and add miss data to the tail of
FIFO list. The second case is that cache is full. In such case, a page needs to be evicted to free the page cache space, so the problem will become complex. To fix the problem,
87
MPA scheme replaces the page based on the physical storage location of accessing data.
If the accessing data comes from PCM, the evicted data will be selected from LRU list.
Otherwise, the evicted data will be selected from FIFO list.
Since the Miss penalty overhead for loading the data stored in SSD is more expensive than the Miss penalty overhead for loading the data stored in PCM, MPA scheme gives the SSD requests high priority to stay longer in the page cache. When a cache miss occurs and the cache is full, MPA scheme will maintain SSD requests in the
LRU list as much as possible. As is shown in Figure 5.2, MPA scheme skips two SSD requests: D1 and D4. Instead, MPA chooses PCM request D2 and evicts D2 to free the page cache space.
5.4 Experimental Evaluation
In this section, we first describe the experiment setup, then evaluate the performance of MPA scheme by comparing our MPA scheme to existing DRAM-based cache management schemes.
5.4.1 Experiment setup
To evaluate the efficiency of our proposed MPA scheme, we first define the hybrid
SSD. In our experiment, we assume that hybrid SSD exploits PCM and SSD at the same storage level. In reality, due to the limitation of the price and technology, the size of PCM is generally expected to be smaller than that of SSD, so that we allocate more capacity to
SSD. In our experiment, we opt 8:1 ratio.
88
Table 5.2: Configuration Parameters of the hybrid SSD Simulator.
Parameter Value SSD size 32GB SSD read latency 25us SSD write latency 200us SSD erase latency 1500us SSD page size 4KB PCM size 4GB PCM read latency 100ns PCM write latency 1us PCM unit size 512byte Page cache size 4096(pages)
Hybrid SSD is simulated by a trace-driven simulator based on DiskSim with SSD extension [82]. We implement a DRAM-based page cache module and a PCM model on
top of simulated SSD. The DRAM-based page cache module simulates a 4096 page
cache which provides many page cache management schemes such as MPA, LFU and
LRU. The PCM model simulates a 8GB PCM which provides in- place updating and
byte-addressable properties. We set the basic data access unit for PCM to 512B, because
the minimal request size in our trace is 512B. The values of the hybrid SSD specific
parameters used in our simulator are shown in Table 5.2.
89
In the evaluation, we use two online transaction processing traces [15] and three
MSR Cambridge traces on servers [14] to study the performance impact of the different page cache management schemes. These five enterprise traces are all read mix with write traces. It can simply be divided into two categories: read intensive trace and write intensive trace. Fin2 is read intensive trace which read access accounts for 82.4%, while other traces are all dominated by write access. Table 5.3 shows the characteristics of these traces.
Table 5.3: The characteristics of the traces.
Trace Read(%) IOPS Avg req size(KB)
Fin1 21.6 121.4 8.5
Fin2 82.4 97.8 3.6
Prxy_0 2.9 188.5 6.8
Rscrch_0 9.3 120.1 9.08
Web2 20.1 130.2 9.28
5.4.2 Performance analysis
In order to better analyze the performance impact of our MPA scheme based on the real application traces, we first set the PCM requests-to-SSD requests ratio. In the
hybrid SSD which uses PCM and SSD at the same storage level, due to amazing read
latency and high endurance properties of PCM, there is no doubt that higher PCM
90
requests-to-SSD requests ratio will lead to more performance improvement. However,
due to the limitations of price and technology, the capacity of PCM is small in real hybrid
SSD product, when comparing to capacity of SSD. Meanwhile, considering that the limit
amount of trace items can dispatch to PCM mostly, we choose a PCM requests-to-SSD
requests ratio of 40:60 to explore the efficiency of our MPA scheme. In the following
section, we also explore the performance impact of our MPA scheme under different
PCM requests-to-SSD requests ratio.
Figure 5.3: The Normalized response times under the realistic trace-driven evaluations.
91
Figure 5.3 shows the normalized response times of the different schemes driven by the five traces when the buffer size is 4096 pages. In our experiment, we set the response time of LRU scheme as the baseline to measure the efficiency of other schemes. LFU is a cache management scheme which exploits access locality. We can see that LFU scheme has the similar improvement on response time compared with LRU scheme. Our MPA scheme reduces the response times from 12.5% to 30.5% compared with other schemes, especially for Prxy_0, Rsrch_0 for which the improvements are 30.5% and 25.8% respectively. The reason for such a big improvement on response time is clear, for Prxy_0,
Rsrch_0, the write ratio is high and the request size is large, which means cache miss will bring high miss penalty overhead. Finally, our MPA scheme achieves the big improvement on response time by avoiding the high miss penalty overhead.
Figure 5.4: The overall cache hit rate of the different cache schemes.
92
Figure 5.5: The SSD cache hit rate of the different cache schemes.
In order to investigate the relationship between miss penalty and hit rate, we examine the overall cache hit rate and SSD cache hit rate collected in a simulation study, as shown in Figure 5.4 and Figure 5.5. In the Figure 5.4, we can see that the MPA scheme has the lowest overall cache hit rate compared with other two schemes. The reason is obvious. The other two schemes try exclusively to exploit the access locality to improve the cache hit rate, while our MPA scheme focuses on reducing the miss penalty overhead. However, the overall cache hit rate of our MPA scheme is not reduced significantly, because our management scheme finds a suitable balance between the miss penalty and hit rate. This also proves that the hit rate is not the only metric for improving
93
the average response time. As is shown in Figure 5.5, comparing with other two schemes,
MPA scheme has the highest SSD cache hit rate under five real traces, which means the total miss penalty overhead of MPA scheme is the lowest among three schemes. The big performance improvement achieved by our MPA scheme also implies that miss penalty is also an influential factor affecting the overall performance.
Figure 5.6: Average response time speedup of MPA and conventional cache management schemes under different PCM requests-to-SSD requests ratio.
5.4.3 Sensitivity study on PCM requests-to-SSD requests ratio
PCM requests-to-SSD requests ratio is a very important factor to influence the cache scheme performance in the hybrid SSD. Figure 5.6 shows the standard average
94
response time speedup under different PCM requests-to-SSD requests ratio. Here, for different cache schemes, we use the PCM requests-to-SSD requests ratio of 20:80 as the
baseline to measure the average response time speedup variation under different ratio
values.
In Figure 5.6, we can see that high PCM requests-to-SSD requests ratio will lead
to big performance improvement, due to the high read/write speed and in-place updating
properties of PCM. Our MPA scheme benefits a lot when the PCM requests-to-SSD
requests ratio varies 20:80 to 40:80. The reason for such a big improvement is obvious.
When the PCM request ratio is low, giving high priority to SSD requests in the page
cache will bring many cold data stay in the page cache. In this case, the decreasing hit
rate for PCM requests will degrade the performance of cache management scheme. In
conclusion, when there are enough PCM requests, our MPA scheme can benefits from
avoiding high miss penalty overhead.
5.4.4 Sensitivity study on FIFO-to-LRU ratio
How to divide memory between LRU list and FIFO list will influence the performance of MPA scheme. In the experiment, we use a parameter called ”FIFO-to-
LRU ratio” to show memory allocation between LRU list and FIFO list. Figure 5.7 shows the standard response time speedup under different FIFO-to-LRU ratio. Here, for different traces, we use the FIFO-to-LRU ratio of 10:90 as the baseline to measure the response time speedup variation under different ratio values. According to Figure 5.7,
when FIFO-to-LRU ratio varies from 10:90 to 50:50, the performance of MPA Scheme
95
improves more than 13%. The reason for this improvement is that allocating small memory space for FIFO list will lead to increment of cache miss ratio, because hot page need enough time to stay in FIFO list before promoting to LRU list. When the FIFO-to-
LRU ratio varies from 70:30 to 90:10, the performance of MPA scheme enters in a downtrend. We believe that this downtrend is caused by frequent evicting page stored in SSD. When memory space of LRU list is small, the miss rate of SSD requests will increase. The miss penalty of SSD requests degrades the performance of MPA. In our experiment, we use FIFO-to-LRU ratio of 60:40 as default.
Figure 5.7: Standard response time speedup of MPA on five bench-marks with respect to different FIFO-to-LRU ratio.
96
5.5 Summary
With the rapid development of PCM technology, due to the high read/write speed and amazing endurance of PCM, hybrid SSD which uses PCM and SSD at the same storage level has been proposed for improving the performance of SSD. However, due to the big read latency gap between PCM and SSD, traditional operating system can’t fully dig the potential of the hybrid SSD. In this study, we propose a Miss Penalty aware
DRAM-based cache management scheme for hybrid SSD. Our MPA scheme not only exploits access locality to improve the hit rate, but also assigns higher priority to SSD requests located in page cache for avoiding the high miss penalty overhead. Our experiment results clearly shows our MPA scheme can improve the performance of hybrid SSD significantly compared with other cache management schemes.
97
6. Chapter 6 Conclusion and Future Work 6.1 Conclusion
Nowadays, the flash-based solid state disk is widely adopted in the datacenter. This
makes SSD attract more and more attention. However, due to the limitation of SSD
structure, there are some issues which act as the bottleneck of the whole system in the
SSD. The first one is the performance issue. Compared with traditional hard disk drive,
SSD performs better. However, existing I/O subsystems are mainly designed for traditional hard disk drive. It can’t dig the potential of the SSD. The second one is the reliability issue. Due to the erase before write mechanism, in-place updating is forbidden in the SSD, Specifically, before a write operation is performed, an erase operation has to be performed. Erase operation will be triggered by the write operation, especially for the small write operation. Frequent erase operations jeopardize the lifetime of an SSD, for the reason that each SSD cell can only carry out limited number of program/erase (P/E) operation cycles reliably.
In this thesis, targeted on performance issue of SSD, we revisit the I/O Subsystem in the operating system. We reveal that existing I/O schedulers may not be appropriate for SSDs and sometimes even degrade the performance of SSDs. We also explore some
factors which impact the performance of SSD. Based on our exploration, we propose a
SSD-based I/O scheduler called SBIOS. SBIOS fully exploits the characteristic of SSD.
For read requests, SBIOS dispatches them to different blocks to make full use of internal
parallelism. For write requests, it tries to dispatch write requests to the same block to
98
alleviate the block cross penalty. The experiment results show SBIOS improves the performance of SSD significantly compared with other I/O schedulers.
Meanwhile, to fix the reliability issue of the SSD, we introduce statistical method to analyze the trace access pattern. We reveal that 15 to 25 percent of the blocks are accessed by nearly 50 percent I/O requests. Based on the above observation, we introduce the hot data identification mechanism and propose a parallelism and garbage collection aware I/O Scheduler called PGIS. In the PGIS, we classify the hot request by frequency.
To exploit the channel level internal parallelism, we issue the hot read request to different channel. To reduce the overhead of garbage collection in terms of P/E cycles, we issue the hot write data to the same physical block. The experiment results show PGIS significantly prolong the life of SSD compared with other I/O schedulers.
With the rapid development of non-violate technology, due to in-place updates, byte-addressable property, and high endurance of PCM, hybrid SSD which uses PCM and SSD in the same storage level has been proposed to improve the performance and reliability of SSD. However, hybrid SSD poses a new challenge to cache management algorithm. Due to different miss penalty of PCM and SSD, higher hit rate which is targeted by traditional cache management algorithms may not bring higher performance.
To solve this issue, we propose a Miss Penalty Aware cache management scheme (short for MPA) which takes the asymmetry of cache miss penalty on PCM and SSD into consideration. Our MPA scheme not only uses the access locality to increase the hit rate, but also assigns higher priorities to SSD requests located in the page cache to avoid the
99
high miss penalty overhead. By combing miss penalty with hit rate, our MPA scheme significantly improve the performance of hybrid storage system.
6.2 Future work
Due to the internal architecture of the SSD, the different levels of parallelization are available. Hu et al. [65] classify the internal parallelism into four levels: channel-level,
chip-level, die-level, and plane-level. In the thesis, we only exploit the channel level
internal parallelism to improve the performance of SSD. In the future work, we should
explore the possibility of exploiting other levels of internal parallelism to improve the
performance of the SSD.
Meanwhile, to prolong the life of SSD, we introduce a hot data identification
mechanism. There is no doubt that reclaiming a block which is full with invalid pages can
reduce the overhead of garbage collection. However, due to limitation of SSD structure, a
suitable wear-leveling algorithm needs to be applied. In the future work, we will explore
the interplay between PGIS and different wear-leveling algorithms.
For open channel hybrid SSD, we are exploring several directions for the future work.
First, in this thesis, we only implement our MPA scheme in a hybrid SSD simulator.
Nowadays, the new released Linux kernel 4.4 can support open channel SSD. In the future work, we will build a hardware platform to incorporate our MPA scheme, and then we will implement our MPA scheme in the Linux kernel and evaluate the efficiency of our MPA scheme with benchmark such as SPEC. Second, we will extend our MPA
100
scheme to the hybrid SSD which use the PCM as the write cache to improve the performance of the whole storage system.
101
7. References
[1] A. M. Caulfield, A. De, J. Coburn, T. I. Mollow, R. K. Gupta, and S. Swanson, "Moneta:
A high-performance storage array architecture for next-generation, non-volatile
memories," in Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International
Symposium on, 2010, pp. 385-395: IEEE.
[2] S. Iyer and P. Druschel, "Anticipatory scheduling: A disk scheduling framework to
overcome deceptive idleness in synchronous I/O," in ACM SIGOPS Operating Systems
Review, 2001, vol. 35, no. 5, pp. 117-130: ACM.
[3] M. P. Dunn and A. N. Reddy, "A new I/O scheduler for solid state devices," Texas A &
M University, 2010.
[4] B. Schroeder and G. A. Gibson, "Disk failures in the real world: What does an mttf of 1,
000, 000 hours mean to you?," in FAST, 2007, vol. 7, no. 1, pp. 1-16.
[5] T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau, "Journal-guided
Resynchronization for Software RAID," in FAST, 2005.
[6] D. A. Patterson, G. Gibson, and R. H. Katz, A case for redundant arrays of inexpensive
disks (RAID) (no. 3). ACM, 1988.
[7] T. Perez and C. A. De Rose, "Non-volatile memory: Emerging technologies and their
impacts on memory systems," Porto Alegre, 2010.
102
[8] F. Chen, R. Lee, and X. Zhang, "Essential roles of exploiting internal parallelism of
flash memory based solid state drives in high-speed data processing," in High
Performance Computer Architecture (HPCA), 2011 IEEE 17th International
Symposium on, 2011, pp. 266-277: IEEE.
[9] Y. J. Yu et al., "Optimizing the block I/O subsystem for fast storage devices," ACM
Transactions on Computer Systems (TOCS), vol. 32, no. 2, p. 6, 2014.
[10] D. M. Jacobson and J. Wilkes, Disk scheduling algorithms based on rotational position.
Hewlett-Packard Laboratories Palo Alto, CA, 1991.
[11] R. Geist and S. Daniel, "A continuum of disk scheduling algorithms," ACM
Transactions on Computer Systems (TOCS), vol. 5, no. 1, pp. 77-92, 1987.
[12] S. Wu, B. Mao, X. Chen, and H. Jiang, "LDM: Log Disk Mirroring with Improved
Performance and Reliability for SSD-Based Disk Arrays," ACM Transactions on
Storage (TOS), vol. 12, no. 4, p. 22, 2016.
[13] B. Mao et al., "HPDA: A hybrid parity-based disk array for enhanced performance and
reliability," ACM Transactions on Storage (TOS), vol. 8, no. 1, p. 4, 2012.
[14] M. Traces. (2008). Available: http://iotta.snia.org/tracetypes/3
103
[15] U. o. Massachusetts. Storage: Umass trace repository. Available:
http://tinyurl.com/k6golon
[16] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. S. Manasse, and R. Panigrahy,
"Design Tradeoffs for SSD Performance," in USENIX Annual Technical Conference,
2008, vol. 8, pp. 57-70.
[17] J. Guo, Y. Hu, and B. Mao, "Enhancing I/O Scheduler Performance by Exploiting
Internal Parallelism of SSDs," in International Conference on Algorithms and
Architectures for Parallel Processing, 2015, pp. 118-130: Springer.
[18] J. Guo, Y. Hu, B. Mao, and S. Wu, "Parallelism and Garbage Collection aware I/O
Scheduler with Improved SSD Performance," in IEEE International Parallel and
Distributed Processing Symposium (IPDPS), 2017, pp. 1184-1193: IEEE.
[19] M. Wang, "Improving Performance And Reliability Of Flash Memory Based Solid State
Storage Systems," University of Cincinnati, 2016.
[20] A. Tal, "Two flash technologies compared: NOR vs NAND," White Paper of M-
SYstems, 2002.
[21] A. Inoue and D. Wong, "NAND flash applications design guide," Toshiba America
Electronic Components Inc, 2004.
104
[22] M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable high performance main
memory system using phase-change memory technology," ACM SIGARCH Computer
Architecture News, vol. 37, no. 3, pp. 24-33, 2009.
[23] P. Chi, W.-C. Lee, and Y. Xie, "Making B+-tree efficient in PCM-based main memory,"
in Proceedings of the 2014 international symposium on Low power electronics and
design, 2014, pp. 69-74: ACM.
[24] H. Zhang, J. Fan, and J. Shu, "An OS-level Data Distribution Method in DRAM-PCM
Hybrid Memory," in Conference, 2016, pp. 1-14: Springer.
[25] S. Cho and H. Lee, "Flip-N-Write: A simple deterministic technique to improve PRAM
write performance, energy and endurance," in Microarchitecture, 2009. MICRO-42.
42nd Annual IEEE/ACM International Symposium on, 2009, pp. 347-357: IEEE.
[26] D. Liu, T. Wang, Y. Wang, Z. Qin, and Z. Shao, "A block-level flash memory
management scheme for reducing write activities in PCM-based embedded systems," in
Proceedings of the Conference on Design, Automation and Test in Europe, 2012, pp.
1447-1450: EDA Consortium.
[27] A. R. Olson and D. J. Langlois, "Solid state drives data reliability and lifetime," Imation
White Paper, 2008.
105
[28] N. Flash, "An Introduction to NAND Flash and How to Design It In to Your Next
Product."
[29] S. Zertal, "Exploiting the Fine Grain SSD Internal Parallelism for OLTP and Scientific
Workloads," in High Performance Computing and Communications, , 2014, pp. 990-
997: IEEE.
[30] A. Ban, "Flash file system," ed: Google Patents, 1995.
[31] C. Intel, "Understanding the flash translation layer (FTL) specification," ed, 1998.
[32] E. Gal and S. Toledo, "Algorithms and data structures for flash memories," ACM
Computing Surveys (CSUR), vol. 37, no. 2, pp. 138-163, 2005.
[33] A. Gupta, Y. Kim, and B. Urgaonkar, DFTL: a flash translation layer employing
demand-based selective caching of page-level address mappings (no. 3). ACM, 2009.
[34] M.-L. Chiang and R.-C. Chang, "Cleaning policies in mobile computers using flash
memory," Journal of Systems and Software, vol. 48, no. 3, pp. 213-231, 1999.
[35] P.-L. Wu, Y.-H. Chang, and T.-W. Kuo, "A file-system-aware FTL design for flash-
memory storage systems," in Proceedings of the Conference on Design, Automation and
Test in Europe, 2009, pp. 393-398: European Design and Automation Association.
106
[36] C. Park, W. Cheon, J. Kang, K. Roh, W. Cho, and J.-S. Kim, "A reconfigurable FTL
(flash translation layer) architecture for NAND flash-based applications," ACM
Transactions on Embedded Computing Systems (TECS), vol. 7, no. 4, p. 38, 2008.
[37] L. M. Caulfield, Symbiotic Solid State Drives: Management of Modern NAND Flash
Memory. University of California, San Diego, 2013.
[38] D. Jung, J.-U. Kang, H. Jo, J.-S. Kim, and J. Lee, "Superblock FTL: A superblock-based
flash translation layer with a hybrid address translation scheme," ACM Transactions on
Embedded Computing Systems (TECS), vol. 9, no. 4, p. 40, 2010.
[39] J.-U. Kang, H. Jo, J.-S. Kim, and J. Lee, "A superblock-based flash translation layer for
NAND flash memory," in Proceedings of the 6th ACM & IEEE International
conference on Embedded software, 2006, pp. 161-170: ACM.
[40] S. Wells, R. N. Hasbun, and K. Robinson, "Sector-based storage device emulator having
variable-sized sector," ed: Google Patents, 1998.
[41] J. Kim, J. M. Kim, S. H. Noh, S. L. Min, and Y. Cho, "A space-efficient flash
translation layer for CompactFlash systems," IEEE Transactions on Consumer
Electronics, vol. 48, no. 2, pp. 366-375, 2002.
107
[42] S.-W. Lee, D.-J. Park, T.-S. Chung, D.-H. Lee, S. Park, and H.-J. Song, "A log buffer-
based flash translation layer using fully-associative sector translation," ACM
Transactions on Embedded Computing Systems (TECS), vol. 6, no. 3, p. 18, 2007.
[43] S. Lee, D. Shin, Y.-J. Kim, and J. Kim, "LAST: locality-aware sector translation for
NAND flash memory-based storage systems," ACM SIGOPS Operating Systems Review,
vol. 42, no. 6, pp. 36-42, 2008.
[44] M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-
structured file system," ACM Transactions on Computer Systems (TOCS), vol. 10, no. 1,
pp. 26-52, 1992.
[45] D. Ma, J. Feng, and G. Li, "LazyFTL: a page-level flash translation layer optimized for
NAND flash memory," in Proceedings of the 2011 ACM SIGMOD International
Conference on Management of data, 2011, pp. 1-12: ACM.
[46] H. Cho, D. Shin, and Y. I. Eom, "KAST: K-associative sector translation for NAND
flash memory in real-time systems," in Proceedings of the Conference on Design,
Automation and Test in Europe, 2009, pp. 507-512: European Design and Automation
Association.
[47] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen, "Circuit and microarchitecture
evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory
108
replacement," in Design Automation Conference, 2008. DAC 2008. 45th ACM/IEEE,
2008, pp. 554-559: IEEE.
[48] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, "A novel architecture of the 3D stacked
MRAM L2 cache for CMPs," in High Performance Computer Architecture, 2009.
HPCA 2009. IEEE 15th International Symposium on, 2009, pp. 239-249: IEEE.
[49] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie, "Hybrid cache
architecture with disparate memory technologies," in ACM SIGARCH computer
architecture news, 2009, vol. 37, no. 3, pp. 34-45: ACM.
[50] W. Zhang and T. Li, "Exploring phase change memory and 3D die-stacking for
power/thermal friendly, fast and durable memory architectures," in Parallel
Architectures and Compilation Techniques, 2009. PACT'09. 18th International
Conference on, 2009, pp. 101-112: IEEE.
[51] J. K. Kim, H. G. Lee, S. Choi, and K. I. Bahng, "A PRAM and NAND flash hybrid
architecture for high-performance embedded storage subsystems," in Proceedings of the
8th ACM international conference on Embedded software, 2008, pp. 31-40: ACM.
[52] Y. Park, S.-H. Lim, C. Lee, and K. H. Park, "PFFS: a scalable flash memory file system
for the hybrid architecture of phase-change RAM and NAND flash," in Proceedings of
the 2008 ACM symposium on Applied computing, 2008, pp. 1498-1503: ACM.
109
[53] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting phase change memory as a
scalable dram alternative," in ACM SIGARCH Computer Architecture News, 2009, vol.
37, no. 3, pp. 2-13: ACM.
[54] P. Zhou, B. Zhao, J. Yang, and Y. Zhang, "A durable and energy efficient main memory
using phase change memory technology," in ACM SIGARCH computer architecture
news, 2009, vol. 37, no. 3, pp. 14-23: ACM.
[55] C. Lam, "Cell design considerations for phase change memory as a universal memory,"
in VLSI Technology, Systems and Applications, 2008. VLSI-TSA 2008. International
Symposium on, 2008, pp. 132-133: IEEE.
[56] G. Sun, Y. Joo, Y. Chen, Y. Chen, and Y. Xie, "A hybrid solid-state storage architecture
for the performance, energy consumption, and lifetime improvement," in Emerging
Memory Technologies: Springer, 2014, pp. 51-77.
[57] K. Kim, S.-W. Lee, B. Moon, C. Park, and J.-Y. Hwang, "IPL-P: In-page logging with
PCRAM," Proceedings of the VLDB Endowment, vol. 4, no. 12, pp. 1363-1366, 2011.
[58] Z. Li et al., "A user-visible solid-state storage system with software-defined fusion
methods for PCM and NAND flash," Journal of Systems Architecture, vol. 71, pp. 44-
61, 2016.
110
[59] A. M. Caulfield, T. I. Mollov, L. A. Eisner, A. De, J. Coburn, and S. Swanson,
"Providing safe, user space access to fast, solid state disks," ACM SIGARCH Computer
Architecture News, vol. 40, no. 1, pp. 387-400, 2012.
[60] X.-Y. Hu, E. Eleftheriou, R. Haas, I. Iliadis, and R. Pletka, "Write amplification analysis
in flash-based solid state drives," in Proceedings of SYSTOR 2009: The Israeli
Experimental Systems Conference, 2009, p. 10: ACM.
[61] J. Kim, S. Seo, D. Jung, J.-S. Kim, and J. Huh, "Parameter-aware I/O management for
solid state disks (SSDs)," IEEE Transactions on Computers, vol. 61, no. 5, pp. 636-649,
2012.
[62] S. Park and K. Shen, "FIOS: a fair, efficient flash I/O scheduler," in FAST, 2012, p. 13.
[63] S. A. Chamazcoti, S. G. Miremadi, and H. Asadi, "On endurance of erasure codes in
SSD-based storage systems," in Computer Architecture and Digital Systems (CADS),
2013 17th CSI International Symposium on, 2013, pp. 67-72: IEEE.
[64] J. Axboe. (2015). fio-2.26(software package).
[65] Y. Hu, H. Jiang, D. Feng, L. Tian, H. Luo, and S. Zhang, "Performance impact and
interplay of SSD parallelism through advanced commands, allocation strategy and data
granularity," in Proceedings of the international conference on Supercomputing, 2011,
pp. 96-107: ACM.
111
[66] F. Chen, D. A. Koufaty, and X. Zhang, "Understanding intrinsic characteristics and
system implications of flash memory based solid state drives," in ACM SIGMETRICS
Performance Evaluation Review, 2009, vol. 37, no. 1, pp. 181-192: ACM.
[67] K. Shen and S. Park, "FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs,"
in USENIX Annual Technical Conference, 2013, pp. 67-78.
[68] H. Wang, P. Huang, S. He, K. Zhou, C. Li, and X. He, "A novel I/O scheduler for SSD
with improved performance and lifetime," in Mass Storage Systems and Technologies
(MSST), 2013 IEEE 29th Symposium on, 2013, pp. 1-5: IEEE.
[69] J. Kim, Y. Oh, E. Kim, J. Choi, D. Lee, and S. H. Noh, "Disk schedulers for solid state
drivers," in Proceedings of the seventh ACM international conference on Embedded
software, 2009, pp. 295-304: ACM.
[70] Y. Luo, Y. Cai, S. Ghose, J. Choi, and O. Mutlu, "WARM: Improving NAND flash
memory lifetime with write-hotness aware retention management," in Mass Storage
Systems and Technologies (MSST), 2015 31st Symposium on, 2015, pp. 1-14: IEEE.
[71] M. Bjørling, J. Axboe, D. Nellans, and P. Bonnet, "Linux block IO: introducing multi-
queue SSD access on multi-core systems," in Proceedings of the 6th international
systems and storage conference, 2013, p. 22: ACM.
112
[72] S. Wu, Y. Lin, B. Mao, and H. Jiang, "GCaR: Garbage collection aware cache
management with improved performance for flash-based SSDs," in Proceedings of the
2016 International Conference on Supercomputing, 2016, p. 28: ACM.
[73] B. Mao and S. Wu, "Exploiting request characteristics and internal parallelism to
improve SSD performance," in Computer Design (ICCD), 2015 33rd IEEE
International Conference on, 2015, pp. 447-450: IEEE.
[74] B. Mao, H. Jiang, S. Wu, Y. Yang, and Z. Xi, "Elastic data compression with improved
performance and space efficiency for flash-based storage systems," in Parallel and
Distributed Processing Symposium (IPDPS), 2017 IEEE International, 2017, pp. 1109-
1118: IEEE.
[75] Z. Chen, N. Xiao, and F. Liu, "SAC: Rethinking the cache replacement policy for SSD-
based storage systems," in Proceedings of the 5th Annual International Systems and
Storage Conference, 2012, p. 13: ACM.
[76] C. Gao, L. Shi, M. Zhao, C. J. Xue, K. Wu, and E. H.-M. Sha, "Exploiting parallelism in
I/O scheduling for access conflict minimization in flash-based solid state drives," in
Mass Storage Systems and Technologies (MSST), 2014 30th Symposium on, 2014, pp. 1-
11: IEEE.
113
[77] D. Park and D. H. Du, "Hot data identification for flash-based storage systems using
multiple bloom filters," in Mass Storage Systems and Technologies (MSST), 2011 IEEE
27th Symposium on, 2011, pp. 1-11: IEEE.
[78] M. Jung, W. Choi, S. Srikantaiah, J. Yoo, and M. T. Kandemir, "HIOS: A host interface
I/O scheduler for solid state disks," ACM SIGARCH Computer Architecture News, vol.
42, no. 3, pp. 289-300, 2014.
[79] W. Wang, N. Niu, H. Liu, and Y. Wu, "Tagging in assisted tracing," in Proceedings of
the 8th International Symposium on Software and Systems Traceability, 2015, pp. 8-14:
IEEE Press.
[80] P. Louridas, D. Spinellis, and V. Vlachos, "Power laws in software," ACM Transactions
on Software Engineering and Methodology (TOSEM), vol. 18, no. 1, p. 2, 2008.
[81] K. Shah, A. Mitra, and D. Matani, "An O (1) algorithm for implementing the LFU cache
eviction scheme," Technical report, 2010.
[82] N. Agrawal, V. Prabhakaran, T. Wobber, J. D. Davis, M. S. Manasse, and R. Panigrahy,
"Design tradeoffs for SSD performance," in USENIX Annual Technical Conference,
2008, vol. 57.
[83] F. Chen, B. Hou, and R. Lee, "Internal parallelism of flash memory-based solid-state
drives," ACM Transactions on Storage (TOS), vol. 12, no. 3, p. 13, 2016.
114
[84] L.-P. Chang and T.-W. Kuo, "An adaptive striping architecture for flash memory
storage systems of embedded systems," in Real-Time and Embedded Technology and
Applications Symposium, 2002. Proceedings. Eighth IEEE, 2002, pp. 187-196: IEEE.
[85] J. E. Denny, S. Lee, and J. S. Vetter, "NVL-C: Static analysis techniques for efficient,
correct programming of non-volatile main memory systems," in Proceedings of the 25th
ACM International Symposium on High-Performance Parallel and Distributed
Computing, 2016, pp. 125-136: ACM.
[86] J. Kim, S. Lee, and J. S. Vetter, "PapyrusKV: a high-performance parallel key-value
store for distributed NVM architectures," in Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, 2017,
p. 57: ACM.
[87] Q. Liu and C. Jung, "Lightweight hardware support for transparent consistency-aware
checkpointing in intermittent energy-harvesting systems," in Non-Volatile Memory
Systems and Applications Symposium (NVMSA), 2016 5th, 2016, pp. 1-6: IEEE.
[88] M. Tarihi, H. Asadi, A. Haghdoost, M. Arjomand, and H. Sarbazi-Azad, "A hybrid non-
volatile cache design for solid-state drives using comprehensive I/O characterization,"
IEEE Transactions on Computers, vol. 65, no. 6, pp. 1678-1691, 2016.
[89] M. A. Roger, Y. Xu, and M. Zhao, "BigCache for big-data systems," in 2014 IEEE
International Conference on Big Data (Big Data),, 2014, pp. 189-194: IEEE.
115
[90] W. Li, G. Jean-Baptise, J. Riveros, G. Narasimhan, T. Zhang, and M. Zhao,
"CacheDedup: In-line Deduplication for Flash Caching," in FAST, 2016, pp. 301-314.
[91] R. Koller, L. Marmol, R. Rangaswami, S. Sundararaman, N. Talagala, and M. Zhao,
"Write policies for host-side flash caches," in FAST, 2013, pp. 45-58.
[92] X. Jimenez, D. Novo, and P. Ienne, "Wear unleveling: improving NAND flash lifetime
by balancing page endurance," in FAST, 2014, vol. 14, pp. 47-59.
[93] N. Elyasi, M. Arjomand, A. Sivasubramaniam, M. T. Kandemir, C. R. Das, and M. Jung,
"Exploiting intra-request slack to improve ssd performance," in Proceedings of the
Twenty-Second International Conference on Architectural Support for Programming
Languages and Operating Systems, 2017, pp. 375-388: ACM.
[94] Y. Lu, J. Shu, J. Guo, and P. Zhu, "Supporting system consistency with differential
transactions in flash-based SSDs," IEEE Transactions on Computers, vol. 65, no. 2, pp.
627-639, 2016.
[95] H. Wang, J. Zhang, S. Shridhar, G. Park, M. Jung, and N. S. Kim, "DUANG: Fast and
lightweight page migration in asymmetric memory systems," in High Performance
Computer Architecture (HPCA), 2016 IEEE International Symposium on, 2016, pp.
481-493: IEEE.
116
[96] S. Mittal, J. S. Vetter, and D. Li, "A survey of architectural approaches for managing
embedded DRAM and non-volatile on-chip caches," IEEE Transactions on Parallel and
Distributed Systems, vol. 26, no. 6, pp. 1524-1537, 2015.
117