Implications of NVM Based Storage on Memory Subsystem Management

applied sciences

Article Implications of NVM Based Storage on Memory Subsystem Management

Hyokyung Bahn 1,* and Kyungwoon Cho 2

1 Department of Computer Engineering, Ewha University, Seoul 03760, Korea 2 Embedded Software Research Center, Ewha University, Seoul 03760, Korea; [email protected] * Correspondence: [email protected]; Tel.: +82-2-3277-4247

Received: 22 November 2019; Accepted: 30 January 2020; Published: 3 February 2020

Featured Application: The authors anticipate that memory and storage conﬁgurations explored in this article will be helpful in the design of system software for future computer systems with ever-growing memory demands and the limited density of DRAM.

Abstract: Recently, non-volatile memory (NVM) has advanced as a fast storage medium, and legacy memory subsystems optimized for DRAM (dynamic random access memory) and HDD (hard disk drive) hierarchies need to be revisited. In this article, we explore the memory subsystems that use NVM as an underlying storage device and discuss the challenges and implications of such systems. As storage performance becomes close to DRAM performance, existing memory configurations and I/O (input/output) mechanisms should be reassessed. This article explores the performance of systems with NVM based storage emulated by the RAMDisk under various configurations. Through our measurement study, we make the following findings. (1) We can decrease the main memory size without performance penalties when NVM storage is adopted instead of HDD. (2) For buffer caching to be effective, judicious management techniques like admission control are necessary. (3) Prefetching is not effective in NVM storage. (4) The effect of synchronous I/O and direct I/O in NVM storage is less significant than that in HDD storage. (5) Performance degradation due to the contention of multi-threads is less severe in NVM based storage than in HDD. Based on these observations, we discuss a new PC configuration consisting of small memory and fast storage in comparison with a traditional PC consisting of large memory and slow storage. We show that this new memory-storage configuration can be an alternative solution for ever-growing memory demands and the limited density of DRAM memory. We anticipate that our results will provide directions in system software development in the presence of ever-faster storage devices.

Keywords: NVM; storage performance; buﬀer cache; prefetching; PCM; STT-MRAM

1. Introduction Due to the wide speed gap between DRAM (dynamic random access memory) and HDDs (hard disk drives), the primary goal of memory hierarchy design in traditional computer systems has been the minimization of storage accesses [1]. The access time of HDD is limited to tens of milliseconds, which is 5–6 orders of magnitude slower than DRAM access time. However, with the recent advances in fast storage technologies such as NAND ﬂash memory and NVM (non-volatile memory), the extremely wide speed gap has been reduced [2–4]. The typical access time of NAND ﬂash memory is less than 50 milliseconds, and thus the speed gap between storage and memory is reduced to three orders of magnitude. This trend has been accelerated by the appearance of NVM, of which the access time is about 1–100 times that of DRAM [5,6].

Appl. Sci. 2020, 10, 999; doi:10.3390/app10030999 www.mdpi.com/journal/applsci Appl. Sci. 2020, 10, 999 2 of 18

Patents published by Intel describe detailed micro-architectures to support NVM in various kinds of memory and storage subsystems, implying that the era of NVM is imminent [7,8]. Due to its good characteristics such as small access latency, low-power consumption, and long endurance cycles, NVM is expected to be used as secondary storage in addition with NAND flash memory and HDD [9–12]. NVM has also been considered as a candidate of the main memory medium as it is byte-addressable and has substantial density benefits [13]. However, its access time is rather slower than DRAM, and thus it is being considered as fast storage or far memory used along with DRAM. Though NVM is considered for both memory and storage, the focus of this study is in storage. As storage becomes sufficiently fast by adopting NVM, legacy software layers optimized for HDD need to be revisited. Specifically, storage performance becomes close to that of DRAM, and thus memory management mechanisms need to be reconsidered. As flash-based SSD (solid state drive) was originally designed for substituting HDD, it is perceived as a fast HDD device from the operating system’s viewpoint. Thus, some additional software layers such as FTL (flash translation layers) are installed inside the flash storage to be seen as an HDD-like block device. Due to this reason, major changes in the operating system are not necessary for flash-based SSD. In contrast, as the access latency of storage becomes close to that of DRAM by adopting NVM, we need to revisit the fundamental issues of the operating system’s memory management subsystem. Early flash memory products suffered from the freezing phenomenon in which the performances of storage I/Os (input/outputs) are deteriorated seriously when the garbage collection starts [14]. Moreover, there are performance fluctuations in flash storage as the internal state of the device changes over time. Such problems have been improved significantly by adopting an internal buffer and/or cache in flash-based SSD products, and some recent SSDs consist of NVM for accelerating the storage performance even more. In this article, we analyze the performance implication of NVM based storage and discuss issues in the design of memory subsystems with NVM storage. In particular, we consider two types of NVM storage media, PCM (phase-change memory) and STT-MRAM (spin transfer torque magnetic random access memory), and look into the effectiveness of various I/O mechanisms. In deploying NVM to replace current storage in the traditional storage hierarchy, there are fundamental questions that may be raised with respect to the I/O mechanisms supported by current operating systems. We list some of the questions below and present our answers to these questions obtained through empirical evaluation.

Q1. What will be the potential benefit in the memory-storage hierarchy of a desktop PC if we adopt fast NVM storage? Q2. Is the buffer cache still necessary for NVM based storage? Q3. Is prefetching still effective for NVM based storage? Q4. Is the performance effect of I/O modes (i.e., synchronous I/O, buffered I/O, and direct I/O) similar to HDD storage cases? Q5. What is the impact of concurrent storage accesses on the performance of NVM based storage?

This article answers the aforementioned questions through a wide range of empirical studies on systems with HDD and NVM storage emulated by a RAMDisk, which provides NVM performance by stalling DRAM access for the proper duration. To do so, we implement an NVM emulator whose performance can be measured with the setting of appropriate conﬁgurations for target NVM storage media. The use of a RAMDisk with no timing delay provides an upper-bound performance estimate of NVM storage, which can be interpreted as the optimistic performance of STT-MRAM. We also set an appropriate time delay for another type of NVM, PCM. Through this measurement study, we make the following ﬁndings.

We can decrease the main memory size without performance penalties when NVM storage is • adopted instead of HDD. Appl. Sci. 2020, 10, 999 3 of 18

Buffer caching is still effective in fast NVM storage but some judicious management techniques • like admission control are necessary. Prefetching is not effective in NVM storage. • The effect of synchronous I/O and direct I/O in NVM storage is less significant than that in • HDD storage. Performance degradation due to the contention of multi-threads is less severe in NVM based • storage than in HDD. This implies that NVM can constitute a contention-tolerable storage system for applications with highly concurrent storage accesses.

Based on these observations, we discuss a new memory and storage configurations that can address the issue of rapidly growing memory demands in emerging applications by adopting NVM storage. Specifically, we analyze the cost of a desktop PC consisting of small memory and fast storage based on Intel’s Optane SSD, which is the representative NVM product on the market at this time, in comparison with the traditional PC consisting of large memory and slow storage. We show that such a new memory-storage configuration can be an alternative solution for ever-growing memory demands and the limited density of DRAM memory. The remainder of this article is organized as follows. Section2 describes the background of this research focusing on the hardware technologies of PCM and STT-MRAM. Section3 explains the experimental setup of the measurement studies and the performance comparison results on systems with HDD and NVM storage. In Section4, we summarize the related work. Finally, Section5 concludes this article.

2. PCM and STT-MRAM Technologies PCM (phase-change memory) stores data by using a material called GST (germanium-antimony-tellurium), which has two different states, amorphous and crystalline. The two states can be controlled by setting different temperatures and heating times. As the two states provide different resistance when passing the electric current, we can distinguish the state while reading the resistance of the cell. Whereas reading the resistance value is fast, modifying the state in a cell takes long time, and thus writing data in PCM is slower than reading. The endurance cycle, i.e., the number of the maximum write operations allowed for a cell is in the range of 106–108 in PCM, which is shorter than that of DRAM, and thus it is difficult to use PCM as a main memory medium. Thus, studies on PCM as the main memory adopt additional DRAM in order to reduce the number of write operations on PCM [6,13]. When we use PCM as storage rather than memory, however, such an endurance problem is not a significant issue any longer. Note that the endurance cycle of NAND flash memory is in the range of 104–105, which is even shorter than that of PCM. Although we consider PCM as storage rather than memory, we need to compare its characteristics with DRAM as the two layers are tightly related. For example, adopting fast storage can reduce the memory size of the system without performance degradations due to the narrowed speed gap between the two layers. We will discuss this issue further in Section3. With respect to the density of media, PCM is anticipated to provide more scalability than DRAM and NAND flash memory. It is reported that the density of DRAM is difficult to be under 20 nm and NAND flash memory also has reached its density limit. Note that the cell of NAND flash memory will not be cost effective if its density is less than 20 nm. Accordingly, instead of scaling down the chip size, 3D stacking technologies have been attempted alternatively in order to scale the capacity. 3D-DDR3 DRAM and V-NAND flash memory have been produced by using this technology [15,16]. Unlike DRAM and NAND flash memory cases, it is anticipated that PCM will be stable even in the 5 nm node [17]. Samsung and Micron already announced 8 Gb 20 nm PCM and 45 nm 1 Gb PCM, respectively. By considering overall situations, PCM is expected to find its place in the storage market soon. MLC (multi-level cell) is another technology that can enhance the density of PCM. Even though the basic prototype of PCM is based on SLC (single-level cell) technologies, which can represent only Appl. Sci. 2020, 10, 999 4 of 18 the two states of crystalline and amorphous, studies on PCM have shown that additional intermediate states can be represented by making use of the MLC technologies [18]. That is, MLC can represent more than two states in a cell by selecting multiple levels of electrical charge. By virtue of MLC technologies, the density of PCM can be an order of magnitude higher than that of other NVM media such as MRAM (magnetic RAM) and FRAM (ferroelectric RAM), which have difficult structures to adopt MLC. Due to this reason, major semiconductor venders, like Intel and Samsung, have researched PCM and are now ready to commercialize PCM products. STT-MRAM (spin transfer torque magnetic random access memory) is another notable NVM technology. STT-MRAM makes use of magnetic characteristics of a material, of which magnetic orientations can be set and detected by using electrical signals. In particular, STT-MRAM utilizes the MTJ (magnetic tunneling junction) device in order to store data [19]. MTJ is composed of two ferromagnetic layers and a tunnel barrier layer, and data in MTJ can be represented by the resistance value, which depends on the relative magnetization directions of the two ferromagnetic layers [20]. If the two layers’ magnetic fields are aligned in the same direction, the MTJ resistance is low, which represents the logical value of zero. On the other hand, if the two layers are aligned in the opposite direction, the MTJ resistance is high, which represents the logical value of one. To read data in an STT-MRAM’s cell, we should apply a small voltage between the sense and bit lines, and detect the current flow. To write data in an STT-RAM’s cell, we should push a large current through MTJ in order to modify the magnetic orientation. According to the current’s direction, the two ferromagnetic layers’ magnetic fields are aligned in the same or the opposite direction. The level of the current necessary for storing in MTJ is larger than that required for reading from it. Table1 lists the characteristics of STT-MRAM and PCM in comparison with DRAM and NAND flash memory [9,17,21]. As listed in the table, the performance and the endurance characteristics of STT-MRAM are very similar to those of DRAM, but STT-MRAM consumes less power than DRAM. Similar to NAND flash, PCM has asymmetric read/write latencies. However, PCM does not have erase operations, and its access latency and endurance limit are all better than NAND flash.

Table 1. Characteristics of non-volatile memory (NVM) technologies in comparison with dynamic random access memory (DRAM) and NAND ﬂash memory.

DRAM STT-MRAM PCM NAND Flash Maturity Product Prototype Product Product Read latency 10 ns 10 ns 20–50 ns 25 us Write latency 10 ns 10 ns 80–500 ns 200 us Erase latency N/AN/AN/A 200 ms Energy per bit access (r/w) 2 pJ 0.02 pJ 20pJ/100 pJ 10 nJ Static power Yes No No No Endurance (writes/bit) 1016 1016 106–108 105 Cell size 6–8 F2 >6 F2 5–10 F2 4–5 F2 MLC N/A 4 bits/cell 4 bits/cell 4 bits/cell

3. Performance Implication of NVM Based Storage We perform measurement experiments to investigate the effect of NVM based storage. We set the configuration of the CPU, memory, and storage to Intel Core i5-3570, 8 GB DDR3, and 1 TB SATA 3 Gb/s HDD, respectively. We install the Linux kernel 4.0.1 64 bits and Ext4 file system. We consider two types of NVM storage, PCM and STT-MRAM. For now, as it is difficult to perform an experiment with commercial PCM and STT-MRAM products, we make use of RAMDisk consisting of DRAM and set a certain time delays for emulating NVM performances. To do so, we implement an NVM device driver by making use of the RAMDisk device driver and the PCMsim that is publicly available [22]. As the optimistic performance of STT-MRAM is expected to be similar to that of DRAM, we set no time delays for emulating STT-MRAM, which we call RAMDisk throughout this article. For the Appl. Sci. 2020, 10, 999 5 of 18

PCM case, we set the latency of read and write operations to 4.4 and 10 that of DRAM, respectively, × × in our NVM device driver, which is the default setting of PCMsim [22]. Appl. Sci.In our2020, NVM 10, 999 experiments, we also install the Ext4 ﬁle system on the RAMDisk. For benchmarking,5 of 18 we use IOzone, which is the representative microbenchmark that measures the performance of storage sequentialsystems by read, making random a sequence read, sequential of I/O operations write, and [23]. random In our experiment, write of IOzone sequential scenarios read, with random the totalread, footprint sequential of write, 2 GB and were random performed. write ofWe IOzone also use scenarios Filebench with benchmark the total footprint for some of additional 2 GB were experiments.performed. We also use Filebench benchmark for some additional experiments.

3.1.3.1. Effect Effect of of the the Memory Memory Size Size InIn this this section, section, we we measure measure the the performance performance of of NVM when the memory size is varied. Figure Figure 1 showsshows thethe benchmarkingbenchmarking performance performance of HDD,of HDD, RAMDisk, RAMDisk, and PCMdiskand PCMdisk as a function as a function of the memory of the memorysize when size executing when executing IOzone. IOzone. As shown As inshown Figure in1 a,b,Figure not 1a,b, only not RAMDisk only RAMDisk but also but HDD also exhibit HDD exhibitsurprisingly surprisingly high performances high performances when the when memory the memory size is 1 GBsize or is more. 1 GB Thisor more. is because This is all because requested all requesteddata already data reside already in the reside buffer in cache the buffer when thecache memory when sizethe memory is large enough. size is large However, enough. as the However, memory assize the is small,memory there size is ais wide small, performance there is a gapwide between performance HDD and gap NVM between in read HDD operations. and NVM Specifically, in read operations.the throughput Specifically, of the RAMDisk the throug ishput over 5of GB the/s RAMDisk for all cases is butover that 5 GB/s of HDD for all degrades cases but significantly that of HDD as degradesthe memory significantly size becomes as 512the MBmemory or less. size The becomes performance 512 ofMB the or PCMdisk less. The sits performance in b etween thoseof the of PCMdiskHDD and sits RAMDisk. in b etween those of HDD and RAMDisk.

8000 8000

7000 7000

6000 6000

5000 5000

4000 4000

3000 3000

Throughput(MB/s) 2000 HDD Throughput(MB/s) 2000 HDD RAMDisk RAMDisk 1000 1000 PCMdisk PCMdisk 0 0 128 MB 256 MB 512 MB 1 GB 2 GB 128 MB 256 MB 512 MB 1 GB 2 GB Memory size Memory size

(a) Sequential read (b) Random read 8000 8000

7000 HDD 7000 HDD RAMDisk RAMDisk 6000 PCMdisk 6000 PCMdisk 5000 5000

4000 4000

3000 3000

Throughput (MB/s) Throughput 2000 (MB/s) Throughput 2000

1000 1000

0 0 128 MB 256 MB 512 MB 1 GB 2 GB 128 MB 256 MB 512 MB 1 GB 2 GB Memory size Memory size

FigureFigure 1.1. PerformancePerformance comparison comparison of aof hard a hard disk disk drive drive (HDD), (HDD), RAMDisk, RAMDisk, and phase-change and phase-change memory memory(PCM) disk (PCM) varying disk thevarying main the memory main memory size. size.

ThisThis provides provides significant significant implications implications of of utiliz utilizinging NVM NVM in in future future memory memory and and storage storage systems. systems. Specifically,Specifically, thethe memory memory size size is di ffiiscult difficult to be extended to be extended any longer any when longer considering when the considering ever-increasing the ever-increasingfootprint of big datafootprint applications of big data as well applications as the scaling as well limitation as the ofscaling DRAM. limitation To relieve of this DRAM. problem, To relieveNVM maythis provideproblem, an NVM alternative may provide solution. an alternat That is,ive a future solution. computer That is, system a future can computer be composed system of canlimited be memorycomposed capacities of limited but me hasmory NVM capacities storage devices, but has not NVM to degrade storage performances. devices, not Suchto degrade “small performances.memory and fast Such storage” “small hierarchies memory can and be e ffifastcient storage” for memory-intensive hierarchies applicationscan be efficient such as bigfor memory-intensivedata and/or scientific applications applications. such as big data and/or scientific applications. Now,Now, let us see the resultsresults forfor thethe writewrite operations.operations. As As shown shown in in Figure 11c,d,c,d, thethe averageaverage performance improvementsimprovements of of the the PCMdisk PCMdisk and RAMDiskand RAMDisk are 1204% are and1204% 697%, and respectively, 697%, respectively, compared compared to the HDD performances. The reason is that a write system call triggers flush operations to the storage even though the memory space is sufficient. Specifically, file systems perform flush or journaling operations, which periodically write the modified data to storage not to lose the modifications against power failure situations. When flush is triggered, the modifications are directly written to the original position in the storage, whereas journaling writes the modifications to the storage’s journal area first and then reflects them to the original position later. The default Appl. Sci. 2020, 10, 999 6 of 18 period of journaling/flush is set to 5 seconds in the Ext4 file system, and thus write operations Appl. Sci. 2020, 10, 999 6 of 18 cannot be buffered although sufficient memory space is provided.

3.2.to the Effectiveness HDD performances. of Buffer Cache The reason is that a write system call triggers flush operations to the storage evenIn though this section, the memory we investigate space issu theffi cient.performanc Specifically,e effect file of systemsbuffer cache perform when flush NVM or journalingstorage is used.operations, Buffer which cache periodicallystores requested write file the data modified in a certain data to part storage of the not main to lose memory, the modifications thereby servicing against powersubsequent failure requests situations. directly When without flush is triggered,accessing theslow modifications storage. The are primary directly writtengoal of tobuffer the original cache positionmanagement in the is storage, in minimizing whereas the journaling number writes of storage the modifications accesses. However, to the storage’s as the storage journal areabecomes first sufficientlyand then reflects fast by them adopting to the original NVM, positionone may later. wonder The defaultif the traditional period of journaling buffer cache/flush will is set be to still 5 s necessary.in the Ext4 fileTo system,investigate and this, thus writewe first operations set the cannotI/O mode be butoff synchronousered although I/O suffi andcient direct memory I/O, space and measureis provided. the performance of the NVM storage. In the synchronous I/O mode, a write system call is directly reflected to storage devices. We 3.2. Effectiveness of Buffer Cache can set the synchronous I/O mode by opening a file with the O_SYNC flag. In the direct I/O mode, file I/OIn thisbypasses section, the we buffer investigate cache and the performancetransfers file data effect between of buffer storage cache when and user NVM space storage directly. is used. A directBuffer cacheI/O mode stores can requested be set fileby dataopening in a certaina file wi partth ofthe the O_DIRECT main memory, flag. thereby Figures servicing 2–4 compares subsequent the performancerequests directly of the without default accessing I/O mode, slow which storage. is denoted The primary as “baseline”, goal of bu andffer cachethe synchronous management I/O is modein minimizing and the thedirect number I/O mode of storage for HDD, accesses. RAMDisk, However, and as thePCMdisk, storage respectively. becomes suffi Noteciently that fast the by defaultadopting I/O NVM, mode one is set may when wonder user if applications the traditional dobu notffer specify cache willspecific bestill I/O necessary.modes, which To investigate performs I/Othis, in we asynchronous first set the I /write,O mode buffered to synchronous I/O, and prefetching I/O and direct turned I/O, andon. measure the performance of the NVMLet storage. us first see the effect of the synchronous I/O mode. As shown in Figures 2a,b, 3a,b, and 4a,b, the writeIn the throughput synchronous of I /synchronousO mode, a write I/Osystem is degraded call is directlysignificantly reflected for toallstorage cases including devices.We HDD, can RAMDisk,set the synchronous and PCMdisk. I/O mode When by comparing opening a file NVM with and the HDD, O_SYNC the flag.performance In the direct degradation I/O mode, in file NVM I/O isbypasses relatively the minor buffer than cache HDD. and transfersThis is because file data writing between to storageHDD isand even user slower space than directly. writing A directto NVM. I/O Specifically,mode can be the set byperformance opening a filedegradation with the O_DIRECT of HDD isflag. the largest Figures 2with–4 compares small random the performance writes when of adoptingthe default synchronous I/O mode, which I/O. is denoted as “baseline”, and the synchronous I/O mode and the direct I/O modeNow, for HDD, let us RAMDisk, see the effect and PCMdisk,of direct respectively.I/O. As the storage Note that becomes the default sufficiently I/O mode fast is setby whenadopting user NVM,applications one may do notwonder specify if direct specific I/O I /willO modes, perform which better performs than conventional I/O in asynchronous buffered I/O. write, To bu answerffered thisI/O, question, and prefetching we investigate turned on. the performance of direct I/O that does not use the buffer cache.

350 300 Baseline Baseline 300 Sync 250 Direct Sync 250 200 Direct 200 150 150 100 100 50 50 Throughput (MB/s) Throughput 0 Throughput (MB/s) 0 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Op. size (KB) Op. size (KB)

(a) Sequential write (b) Random write 10000 10000 9000 Baseline 9000 Baseline 8000 Sync 8000 Sync 7000 Direct 7000 Direct 6000 6000 5000 5000 4000 4000 3000 3000 2000 2000 1000 1000

Throughput (MB/s) 0 Throughput (MB/s) 0 4 8 4 8 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 8192 1024 2048 4096 8192 16384 16384 Op. size (KB) Op. size (KB)

Figure 2. HDDHDD performance performance under under different diﬀerent I/O I/O (input/output) (input/output) modes. Appl. Sci. 2020, 10, 999 7 of 18 Appl. Sci. 2020, 10, 999 7 of 18 Appl. Sci. 2020, 10, 999 7 of 18

7000 7000 7000 Baseline 7000 Baseline 6000 Baseline 6000 Baseline 6000 Sync 6000 Sync 5000 SyncDirect 5000 SyncDirect 5000 Direct 5000 Direct 4000 4000 4000 4000 3000 3000 3000 3000 2000 2000 2000 2000 1000 1000 1000 1000 Throughput (MB/s) 0 Throughput (MB/s) 0 Throughput (MB/s) Throughput (MB/s) 0 4 8 0 4 8 16 32 64 16 32 64 4 8 4 8 128 256 512 128 256 512 16 32 64 16 32 64 1024 2048 4096 8192 1024 2048 4096 8192 128 256 512 128 256 512 16384 16384 1024 2048 4096 8192 1024 2048 4096 8192 Op. size (KB) 16384 Op. size (KB) 16384 Op. size (KB) Op. size (KB)

(a) Sequential write (b) Random write 10000 10000 100009000 Baseline 100009000 Baseline 90008000 BaselineSync 90008000 BaselineSync 80007000 SyncDirect 80007000 SyncDirect 70006000 Direct 70006000 Direct 60005000 60005000 50004000 50004000 40003000 40003000 30002000 30002000 20001000 20001000

Throughput (MB/s) 10000 Throughput (MB/s) 10000 Throughput (MB/s) Throughput (MB/s) 0 4 8 0 4 8 16 32 64 16 32 64 4 8 4 8 128 256 512 128 256 512 16 32 64 16 32 64 1024 2048 4096 8192 1024 2048 4096 8192 128 256 512 128 256 512 16384 16384 1024 2048 4096 8192 1024 2048 4096 8192 Op. size (KB) 16384 Op. size (KB) 16384 Op. size (KB) Op. size (KB)

FigureFigure 3. RAMDisk3. RAMDisk performance performance under unde dir ﬀdifferenterent I/O I/O modes. modes.

2500 2500 2500 Baseline 2500 Baseline 2000 BaselineSync 2000 BaselineSync 2000 SyncDirect 2000 SyncDirect 1500 Direct 1500 Direct 1500 1500 1000 1000 1000 1000 500 500 500 500

Throughput (MB/s) 0 Throughput (MB/s) 0 Throughput (MB/s) Throughput (MB/s) 0 4 8 0 4 8 16 32 64 16 32 64 4 8 4 8 128 256 512 128 256 512 16 32 64 16 32 64 1024 2048 4096 8192 1024 2048 4096 8192 128 256 512 128 256 512 16384 16384 1024 2048 4096 8192 1024 2048 4096 8192 16384 Op. size (KB) Op. size (KB) 16384 Op. size (KB) Op. size (KB)

(c) Sequential read (d) Random read FigureFigure 4. PCMdisk4. PCMdisk performance performance under unde dir ffdifferenterent I/O I/O modes. modes. Figure 4. PCMdisk performance under different I/O modes. Let us first see the effect of the synchronous I/O mode. As shown in Figure2a,b, Figure3a,b, and In case of HDD, the performance is degraded significantly for all kinds of operations when the Figure4a,b, the write throughput of synchronous I /O is degraded significantly for all cases including direct I/O is used. This is consistent with the expectations as HDD is excessively slower than DRAM. HDD, RAMDisk, and PCMdisk. When comparing NVM and HDD, the performance degradation in Now, let us see the direct I/O performance of NVM. When the storage is the RAMDisk, the effect of NVM is relatively minor than HDD. This is because writing to HDD is even slower than writing to direct I/O contrasts among different operations. Specifically, the performance of the RAMDisk is NVM. Specifically, the performance degradation of HDD is the largest with small random writes when degraded for read operations, but it is improved for write operations under direct I/O. Note that the adopting synchronous I/O. physical access time of DRAM and RAMDisk is identical but using the buffered I/O still improves Now, let us see the effect of direct I/O. As the storage becomes sufficiently fast by adopting NVM, the read performances because of the software overhead in I/O operations. That is, accessing data one may wonder if direct I/O will perform better than conventional buffered I/O. To answer this from the buffer cache is faster than accessing the same data from I/O devices although the buffer question, we investigate the performance of direct I/O that does not use the buffer cache. cache and the I/O device have identical performance characteristics. Unlike the read case, the write In case of HDD, the performance is degraded significantly for all kinds of operations when the performance of RAMDisk is improved with the direct I/O. This implies that write performance is direct I/O is used. This is consistent with the expectations as HDD is excessively slower than DRAM. degraded although we use the buffer cache. In fact, writing data to the buffer cache incurs Now, let us see the direct I/O performance of NVM. When the storage is the RAMDisk, the effect additional overhead as it should be written to both memory and storage. Note that additional of direct I/O contrasts among different operations. Specifically, the performance of the RAMDisk is writing to memory is not a burden when storage is excessively slow like HDD, but that is not case degraded for read operations, but it is improved for write operations under direct I/O. Note that the for the RAMDisk. physical access time of DRAM and RAMDisk is identical but using the buffered I/O still improves the However, in the case of the PCMdisk, the buffer cache is also effective in write requests as shown in Figure 4. That is, the baseline (with buffer cache) performs 40% better than direct I/O in write operations with PCMdisk. Based on this study, we can conclude that the buffer cache is still Appl. Sci. 2020, 10, 999 8 of 18 effective in fast NVM storage. However, in the case of the RAMDisk, it degrades the storage performances for write operations. Thus, for adopting the buffer cache more efficiently, read requests may be cached but write requests may be set to bypass the buffer cache layer as storage access time approaches the memory access time. To investigate the efficiency of the buffer cache in more realistic situations, we capture system callAppl. Sci.traces2020 ,while10, 999 executing a couple of representative I/O benchmarks, and perform replay8 of 18 experiments as the storage device changes. Specifically, file request traces were collected while executing two popular Filebench workloads: proxy and web server. Then, we evaluate the total read performances because of the software overhead in I/O operations. That is, accessing data from the storage access time with and without buffer cache for the given workloads. In this experiment, the buffer cache is faster than accessing the same data from I/O devices although the buffer cache and the cache replacement policy is set to LRU (Least Recently Used), which is most commonly used in I/O device have identical performance characteristics. Unlike the read case, the write performance of buffer cache systems. Note that LRU removes the cached item whose access time is the oldest RAMDisk is improved with the direct I/O. This implies that write performance is degraded although among all items in the buffer cache when free cache space is needed. we use the buffer cache. In fact, writing data to the buffer cache incurs additional overhead as it should Figure 5 shows the normalized storage access time when the underlying storage is HDD and be written to both memory and storage. Note that additional writing to memory is not a burden when the buffer cache size is varied from 5% to 50% of the maximum cache usage of the workload. The storage is excessively slow like HDD, but that is not case for the RAMDisk. cache size of 100% implies the configuration that complete storage accesses within the workload However, in the case of the PCMdisk, the buffer cache is also effective in write requests as shown can be cached simultaneously, which is equivalent to the infinite cache capacity. This is an in Figure4. That is, the baseline (with bu ffer cache) performs 40% better than direct I/O in write unrealistic condition and in practical aspects, the cache size less than 50% indicates most of the real operations with PCMdisk. Based on this study, we can conclude that the buffer cache is still effective in system situations. As shown in the figure, the effect of the buffer cache is very large in HDD storage. fast NVM storage. However, in the case of the RAMDisk, it degrades the storage performances for In particular, buffer cache improves the storage access time by 70% on average. This significant write operations. Thus, for adopting the buffer cache more efficiently, read requests may be cached improvement happens as HDD is excessively slower than buffer cache, and thus reducing the but write requests may be set to bypass the buffer cache layer as storage access time approaches the number of storage accesses by caching is very important. memory access time. Figure 6 shows the results with the RAMDisk as the storage medium. As shown in the figure, To investigate the efficiency of the buffer cache in more realistic situations, we capture system call the effectiveness of the buffer cache is reduced significantly as the storage changes from HDD to traces while executing a couple of representative I/O benchmarks, and perform replay experiments as RAMDisk. Specifically, buffer cache improves the storage access time by only 3% on average as the storage device changes. Specifically, file request traces were collected while executing two popular RAMDisk is adopted. This result clearly shows that as the storage access latency becomes close to Filebench workloads: proxy and web server. Then, we evaluate the total storage access time with and that of the buffer cache, the gain from buffer cache decreases largely in realistic situations. However, without buffer cache for the given workloads. In this experiment, the cache replacement policy is set to we perform some additional experiments, and show that the buffer cache is still efficient by LRU (Least Recently Used), which is most commonly used in buffer cache systems. Note that LRU managing the cache space appropriately. removes the cached item whose access time is the oldest among all items in the buffer cache when free Figure 7 compares the storage access time of two configurations that adopt the same RAMDisk cache space is needed. but differentiate the management policy of the buffer cache. Specifically, the LRU-cache inserts all Figure5 shows the normalized storage access time when the underlying storage is HDD and the requested file data in the cache and replaces the oldest data when free space is needed. Note that buffer cache size is varied from 5% to 50% of the maximum cache usage of the workload. The cache size this is the most common configuration of buffer cache, which is also shown in Figure 4. The other of 100% implies the configuration that complete storage accesses within the workload can be cached configuration, AC-cache, is an admission-controlled buffer cache that allows the admission of file simultaneously, which is equivalent to the infinite cache capacity. This is an unrealistic condition and data into the buffer cache only after its second access happens. The rationale of this process is to in practical aspects, the cache size less than 50% indicates most of the real system situations. As shown filter out single-access data, thereby allowing only multiple accessing data to be cached. Although in the figure, the effect of the buffer cache is very large in HDD storage. In particular, buffer cache this is a very simple approach, the results in Figure 7 show that AC-cache improves the storage improves the storage access time by 70% on average. This significant improvement happens as HDD is access time by 20–40%. This implies that buffer cache is still effective under fast storage like excessively slower than buffer cache, and thus reducing the number of storage accesses by caching is STT-MRAM, but more judicious management will be necessary to get the maximized performances. very important.

1.2 1.2 No-cache Buffer cache No-cache Buffer cache 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 storage access time (normalized) time access storage (normalized) time access storage

0.0 0.0 5 101520253035404550 5 101520253035404550 cache size (%) cache size (%)

(a) Filebench-proxy (b) Filebench-web server

FigureFigure 5. 5. PerformancePerformance effects eﬀects of of the the buffer buﬀer cache when storage is the HDD.

Figure6 shows the results with the RAMDisk as the storage medium. As shown in the figure, the effectiveness of the buffer cache is reduced significantly as the storage changes from HDD to RAMDisk. Specifically, buffer cache improves the storage access time by only 3% on average as RAMDisk is adopted. This result clearly shows that as the storage access latency becomes close to that of the buffer cache, the gain from buffer cache decreases largely in realistic situations. However, we perform some additional experiments, and show that the buffer cache is still efficient by managing the cache space appropriately. Appl.Appl. Sci. Sci. 20202020,, 1010,, 999 999 99 of of 18 18

1.2 1.2 No-cache Buffer cache No-cache Buffer cache 1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 storage access time (normalized) timeaccess storage (normalized) timeaccess storage

0.0 0.0 5 101520253035404550 5 101520253035404550 cache size (%) cache size (%) (a) Filebench-proxy (b) Filebench-web server Appl. Sci. 2020, 10, 999 9 of 18 FigureFigure 6. 6. PerformancePerformance effects eﬀects of of the the buffer buﬀer ca cacheche when when storage storage is is the RAMDisk.

1.2 1.2 1.21.2 Figure7 compares the storageLRU-cacheNo-cache accessBufferAC-cache cache time of two configurationsNo-cacheLRU-cache Buffer thatAC-cache cache adopt the same RAMDisk but differentiate the management1.0 1.0 policy of the buffer1.01.0 cache. Specifically, the LRU-cache inserts all requested file data in the0.8 0.8 cache and replaces the oldest0.80.8 data when free space is needed. Note that 0.6 0.6 this is the most common0.6 configuration of buffer cache,0.6 which is also shown in Figure4. The other 0.4 0.4 0.4 0.4 configuration, AC-cache,0.2 is an admission-controlled bu0.2ffer cache that allows the admission of file data

storage access time (normalized) time access storage 0.2 (normalized) time access storage 0.2 storage access time (normalized) timeaccess storage (normalized) timeaccess storage into the buffer cache only0.0 after its second access happens.0.0 The rationale of this process is to filter out 0.0 5 101520253035404550 0.0 5 101520253035404550 single-access data, thereby allowing5 101520253035404550cache size only (%) multiple accessing5 101520253035404550 datacache size to (%) be cached. Although this is a cache size (%) cache size (%) very simple approach, the(a results) Filebench-proxy in Figure7 show that (b AC-cache) Filebench-web improves server the storage access time by (a) Filebench-proxy (b) Filebench-web server 20–40%. This implies that buffer cache is still effective under fast storage like STT-MRAM, but more Figure 7. Performance comparison of the buffer cache with Least Recently Used (LRU) and judicious managementFigure 6. Performance will be necessary effects toof the get buffer the maximized cache when performances. storage is the RAMDisk. admission-controlled LRU when storage is the RAMDisk.

1.2 1.2 LRU-cache AC-cache LRU-cache AC-cache 3.3. Effects of Prefetching 1.0 1.0

0.8 0.8 Prefetching is a widely used technique to improve storage system performances. Specifically, 0.6 0.6 prefetching reads not only0.4 the currently requested page0.4 but also a certain number of adjacent pages in storage, and stores them0.2 in the memory. Note that0.2 Linux’s prefetching reads up to 32 adjacent storage access time (normalized) time access storage (normalized) time access storage pages while handling read0.0 I/Os. Prefetching is effective0.0 in slow storage systems like HDD as the 5 101520253035404550 5 101520253035404550 access latency of HDD is not sensitivecache size (%)to the amou nt of data tocache be size transferred (%) but storage accesses can be decreased if prefetched(a) Filebench-proxy data are actually (bused.) Filebench-web Note that serverfile accesses usually show sequential patterns, so prefetching is effective in the HDD storage. In this section, we analyze the Figure 7. PerformancePerformance comparison comparison of of the the buffer buffer cache cache with with Least Least Recently Recently Used Used (LRU) and effectiveness of prefetching when NVM storage is adopted. As the storage access time becomes admission-controlled LRU when storage is the RAMDisk. close to the memory access time, the effectiveness of prefetching obviously decreases. Furthermore, 3.3. Effects of Prefetching prefetching3.3. Effects of may Prefetching waste the effective memory space if the prefetched data are not actually used. ToPrefetching review the is effectiveness a widely used of techniqueprefetching to in improve NVM storage, storage we system execute performances. the IOzone Specifically,benchmark Prefetching is a widely used technique to improve storage system performances. Specifically, underprefetching HDD readsand NVM not only storage the currentlywith/without requested prefetching page butand also measure a certain the numberthroughput. of adjacent Figures pages 8–10 prefetching reads not only the currently requested page but also a certain number of adjacent pages depictin storage, the measured and stores throughput them in the of memory. RAMDisk, Note HD thatD, and Linux’s PCMdisk, prefetching respectively, reads upwith to prefetching 32 adjacent in storage, and stores them in the memory. Note that Linux’s prefetching reads up to 32 adjacent (denotedpages while as handling“ra_on”) readand Iwithout/Os. Prefetching prefetching is e ff(denotedective in as slow “ra_off”). storage systemsAs shown like in HDD Figure as the8, pages while handling read I/Os. Prefetching is effective in slow storage systems like HDD as the prefetchingaccess latency is effective of HDD in is notread sensitive performances to the amountwhen the of HDD data tostorage be transferred is adopted. but However, storage accesses it does access latency of HDD is not sensitive to the amount of data to be transferred but storage accesses notcan affect be decreased the performance if prefetched of RAMDisk data are actually and PCMdisk used. Note as shown that file in accesses Figures usually 9 and 10. show This sequential implies can be decreased if prefetched data are actually used. Note that file accesses usually show thatpatterns, prefetching so prefetching is not effective is effective any in longer the HDD in RAMDisk storage. Inorthis PCMdisk. section, we analyze the effectiveness of sequential patterns, so prefetching is effective in the HDD storage. In this section, we analyze the prefetching when NVM storage is adopted. As the storage access time becomes close to the memory effectiveness of prefetching when NVM storage is adopted. As the storage access time becomes access time, the effectiveness of prefetching obviously decreases. Furthermore, prefetching may waste close to the memory access time, the effectiveness of prefetching obviously decreases. Furthermore, the effective memory space if the prefetched data are not actually used. prefetching may waste the effective memory space if the prefetched data are not actually used. To review the effectiveness of prefetching in NVM storage, we execute the IOzone benchmark To review the effectiveness of prefetching in NVM storage, we execute the IOzone benchmark under HDD and NVM storage with/without prefetching and measure the throughput. Figures8–10 under HDD and NVM storage with/without prefetching and measure the throughput. Figures 8–10 depict the measured throughput of RAMDisk, HDD, and PCMdisk, respectively, with prefetching depict the measured throughput of RAMDisk, HDD, and PCMdisk, respectively, with prefetching (denoted as “ra_on”) and without prefetching (denoted as “ra_off”). As shown in Figure8, prefetching (denoted as “ra_on”) and without prefetching (denoted as “ra_off”). As shown in Figure 8, is effective in read performances when the HDD storage is adopted. However, it does not affect the prefetching is effective in read performances when the HDD storage is adopted. However, it does performance of RAMDisk and PCMdisk as shown in Figures9 and 10. This implies that prefetching is not affect the performance of RAMDisk and PCMdisk as shown in Figures 9 and 10. This implies not effective any longer in RAMDisk or PCMdisk. that prefetching is not effective any longer in RAMDisk or PCMdisk.

Appl.Appl. Sci. Sci. 20202020, ,1010,, 999 999 1010 of of 18 18 Appl. Sci. 2020, 10, 999 10 of 18

Appl. Sci. 2020, 10, 999 1400 2500 10 of 18 ra_on ra_on 12001400 2500 ra_offra_on 2000 ra_offra_on 10001200 1400 ra_off 25002000 ra_off 1000800 ra_on 1500 ra_on 1200 20001500 600800 ra_off ra_off 1000 1000 600 400 800 15001000 400 500 200 600

Throughput (MB/s) Throughput Throughput (MB/s) 1000500 200 0 400 0 Throughput (MB/s) Throughput Throughput (MB/s)

4 8 500 4 8 0 16 32 64 0 16 32 64 128 256 512 200 128 256 512 4 8 4 8 1024 2048 4096 8192 1024 2048 4096 8192 Throughput (MB/s) Throughput Throughput (MB/s) 16 32 64 16 32 64 16384 16384 128 256 512 0 0 128 256 512 1024 2048 4096 8192 1024 2048 4096 8192 4 8 4 8 Op. size (KB) 16384 Op. size (KB) 16384 16 32 64 16 32 64 128 256 512 128 256 512 1024 2048 4096 8192 Op. size (KB) Op. size (KB)1024 2048 4096 8192 16384 16384 Op. size (KB) Op. size (KB) (a) Sequential write (b) Random write 6000(a) Sequential write 6000 (b) Random write 6000 (a) Sequential writera_on 6000 (b) Random write ra_on 5000 5000 6000 ra_offra_on 6000 ra_offra_on 5000 5000 4000 ra_offra_on 4000 ra_onra_off 5000 5000 4000 4000 3000 ra_off 3000 ra_off 4000 4000 3000 3000 2000 2000 3000 3000 2000 2000 1000 1000 2000 2000 Throughput (MB/s) Throughput 1000 Throughput(MB/s) 1000 0 0

Throughput (MB/s) Throughput 1000 Throughput(MB/s) 1000 4 8 4 8 16 32 64 16 32 64 Throughput (MB/s) Throughput 0 Throughput(MB/s) 0 128 256 512 128 256 512 4 8 4 8 1024 2048 4096 8192 0 0 1024 2048 4096 8192 16 32 64 16 32 64 16384 16384 4 8 4 8 128 256 512 128 256 512 16 32 64 16 32 64 1024 2048 4096 8192 1024 2048 4096 8192 128 256 512 128 256 512 16384 Op. size (KB) 16384

1024 2048 4096 8192 Op. size (KB) 1024 2048 4096 8192 16384 Op. size (KB) Op. size (KB) 16384 Op. size (KB) Op. size (KB) (c) Sequential read (d) Random read (c()c )Sequential Sequential read read ( (dd)) RandomRandom read read Figure 8. HDD performance when the prefetching option is turned on/off. FigureFigureFigure 8. 8. 8. HDDHDD HDD performance performanceperformance whenwhen the the prefetch prefetchinginging option option option is isis turned turned turned on/off. on/off. on/oﬀ .

3500 1600 35003500 140016001600 3000 ra_on 30003000 ra_onra_on 120014001400 2500 ra_off ra_off 12001200 25002500 ra_off 1000 2000 1000 1000800 20002000 1500 600800800 15001500 600 400600 1000 400 ra_on 1000 400 1000 Throughput (MB/s) Throughput 200 ra_on 500 ra_offra_on (MB/s) Throughput

Throughput (MB/s) Throughput 200 500 Throughput (MB/s) Throughput

Throughput (MB/s) Throughput 2000 ra_off 500 ra_off (MB/s) Throughput 0

0 4 8 4 8 0 16 32 64 0 4 8 16 32 64 128 256 512 0 4 8 16 32 64 4 8 1024 2048 4096 8192 128 256 512 16 32 64 128 256 512 4 8 16 32 64 16384 1024 2048 4096 8192 1024 2048 4096 8192 128 256 512 16 32 64 128 256 512 16384 16384 1024 2048 4096 8192 1024 2048 4096 8192 128 256 512 16384 16384 Op. size (KB) 1024 2048 4096 8192

Op. size (KB) Op. size (KB) 16384 Op. size (KB) Op. size (KB) Op. size (KB)

(a()a Sequential) Sequential write write (b)) RandomRandom write write 6000(6000a) Sequential write (b) Random write 6000 6000 ra_onra_on 50005000 6000 ra_onra_on ra_offra_on 5000 5000 ra_off ra_offra_offra_on 4000 5000 4000 ra_off 4000 ra_off 4000 4000 30003000 3000 3000 3000 20002000 2000 2000 2000 10001000 1000 Throughput (MB/s) Throughput Throughput (MB/s) Throughput (MB/s) Throughput Throughput (MB/s) 1000 10000

0 (MB/s) Throughput 0 Throughput (MB/s)

0 4 8 4 8 4 8 16 32 64 4 8 16 32 64 16 32 64 0 128 256 512 16 32 64 128 256 512 128 256 512 0 1024 2048 4096 8192 4 8 128 256 512 1024 2048 4096 8192 16384 1024 2048 4096 8192 4 8 16 32 64 16384 1024 2048 4096 8192 16384 16 32 64 128 256 512 16384 128 256 512 Op. size (KB) 1024 2048 4096 8192 16384 Op. size (KB) 1024 2048 4096 8192 Op. size (KB) Op. size (KB) 16384 Op. size (KB) Op. size (KB) (c()c Sequential) Sequential read read (d)) RandomRandom read read (c) Sequential read (d) Random read FigureFigureFigure 9. 9. 9.RAMDiskRAMDisk RAMDisk performance performanceperformance whenwhen the the prefet prefetchingchingching option option option is is isturned turned turned on/off. on/off. on/o ﬀ. Figure 9. RAMDisk performance when the prefetching option is turned on/off. 2500 2500 2500 ra-on 2500 ra-on 25002000 ra-onra-off 20002500 ra-offra-on 2000 ra-offra-on 2000 ra-offra-on 20001500 ra-off 15002000 ra-off 1500 1500 15001000 10001500 1000 1000 500 500 1000 1000 500Throughput (MB/s) (MB/s) Throughput 500 0 0 Throughput (MB/s) 500 (MB/s) Throughput 500 4 8 4 8 16 32 64 16 32 64 Throughput (MB/s) (MB/s) Throughput 0 128 256 512 0 128 256 512 1024 2048 4096 8192 1024 2048 4096 8192 4 8 4 8 16384 16384 16 32 64 0 0 16 32 64 128 256 512 128 256 512 4 8 4 8 1024 2048 4096 8192 Op. size (KB) Op. size (KB) 1024 2048 4096 8192 16 32 64 16 32 64 16384 16384 128 256 512 128 256 512 1024 2048 4096 8192 1024 2048 4096 8192 Op. size (KB) 16384 Op. size (KB) 16384 (a) SequentialOp. size (KB) write (b) RandomOp. size write (KB) 10000 10000 (a9000) Sequential write ra-on (b) Random write 9000 ra-on 10000(8000a) Sequential write ra-off (b) Random write 100008000 ra-off 1000090007000 ra-on 1000070009000 ra-on 800090006000 ra-offra-on 600080009000 ra-offra-on 700080005000 ra-off 5000 70008000 ra-off 600070004000 4000 60007000 50003000 3000 6000 50006000 2000 2000 40005000 40005000 Throughput (MB/s) Throughput (MB/s) Throughput 1000 1000 30004000 30004000 0 2000 0 2000 3000 3000 4 8 4 8 16 32 64 Throughput (MB/s) 16 32 64 128 256 512 Throughput (MB/s) Throughput 1000 1000 128 256 512 2000

2000 1024 2048 4096 8192 16384 1024 2048 4096 8192 Throughput (MB/s) 16384 Throughput (MB/s) Throughput 10000 10000

4 8 Op. size (KB) 4 8 0 16 32 64

16 32 64 Op. size (KB) 0 128 256 512 128 256 512 4 8 1024 2048 4096 8192

4 8 16 32 64 16384 1024 2048 4096 8192 16 32 64 128 256 512 16384 128 256 512 1024 2048 4096 8192 16384 1024 2048 4096 8192 Op. size (KB)

Op. size (KB) 16384 (c) Sequential read (d) RandomOp. size read (KB) Op. size (KB) (c) Sequential read (d) Random read FigureFigure 10. 10.PCMdisk PCMdisk(c) Sequential performanceperformance read when the the prefet prefetching (d)ching Random option option read is isturnedturned on/off. on/o ﬀ. Figure 10. PCMdisk performance when the prefetching option is turned on/off. Figure 10. PCMdisk performance when the prefetching option is turned on/off. Appl. Sci. 2020, 10, 999 11 of 18

3.4.Appl. Effects Sci. 2020 of, 10Concurrent, 999 Accesses 11 of 18 In modern computing systems, various applications access storage concurrently increasing the randomness3.4. Effects of of Concurrent access patterns Accesses and aggravating contention for storage. For example, popular social media web servers service a large number of user requests simultaneously, possibly incurring In modern computing systems, various applications access storage concurrently increasing the severe contention due to large random accesses generated from concurrent requests. In virtualized randomness of access patterns and aggravating contention for storage. For example, popular social systems with a multiple number of guest machines, I/O requests from each guest machine media web servers service a large number of user requests simultaneously, possibly incurring severe eventually get mixed up generating a series of random storage accesses [24]. Mobile systems like contention due to large random accesses generated from concurrent requests. In virtualized systems smartphones also face such situations as the mixture of metadata logging by Ext4 and data with a multiple number of guest machines, I/O requests from each guest machine eventually get mixed journaling by SQLite generates random accesses [25]. Database and distributed systems also have up generating a series of random storage accesses [24]. Mobile systems like smartphones also face similar problems [26]. such situations as the mixture of metadata logging by Ext4 and data journaling by SQLite generates Thus, supporting concurrent accesses to storage must be an important function of modern random accesses [25]. Database and distributed systems also have similar problems [26]. storage systems. To explore the effectiveness of NVM under such concurrent accesses, we measure Thus, supporting concurrent accesses to storage must be an important function of modern storage the throughput of HDD, RAMDisk, and PCMdisk as the number of IOzone threads increases. systems. To explore the effectiveness of NVM under such concurrent accesses, we measure the Figure 11 shows the throughput of each storage as the number of threads changes from 1 to 8. As throughput of HDD, RAMDisk, and PCMdisk as the number of IOzone threads increases. Figure 11 shown in the figure, NVM performs well even though the number of concurrent threads increases shows the throughput of each storage as the number of threads changes from 1 to 8. As shown in the exponentially. Specifically, performance improves until the number of threads becomes 8 for most figure, NVM performs well even though the number of concurrent threads increases exponentially. cases. Note that the large number of threads increase access randomness and storage contention. Specifically, performance improves until the number of threads becomes 8 for most cases. Note that However, we observe that this only has a minor effect on the performance of NVM storage. In the large number of threads increase access randomness and storage contention. However, we observe contrast, the throughput of HDD degrades as the number of concurrent threads increases. that this only has a minor effect on the performance of NVM storage. In contrast, the throughput of In summary, NVM can constitute a contention-tolerable storage system for applications with HDD degrades as the number of concurrent threads increases. highly concurrent storage accesses in future computer systems.

250000 6000000 seq_w seq_w seq_r 5000000 seq_r 200000 rand_w rand_w rand_r 4000000 rand_r 150000 3000000 100000 2000000 Throughput (KB/s) Throughput (KB/s) Throughput 50000 1000000

0 0 1248 1248 Number of threads Number of threads

(a) HDD (b) RAMDisk 3500000 seq_w 3000000 seq_r rand_w 2500000 rand_r 2000000

1500000

1000000 Throughput (KB/s) Throughput

500000

0 1248 Number of threads (c) PCMdisk

FigureFigure 11. PerformancePerformance comparisoncomparison of of HDD, HDD, RAMDisk, RAMDisk, and and PCMdisk PCMdisk varying varying the number the number of threads. of threads. In summary, NVM can constitute a contention-tolerable storage system for applications with 3.5.highly Implications concurrent of Alternative storage accesses PC Configurations in future computer systems.

3.5. ImplicationsBased on the of Alternativeaforementioned PC Configurations experimental results, this section presents an alternative PC configuration by adopting NVM storage, and discusses the effectiveness of such configurations Based on the aforementioned experimental results, this section presents an alternative PC with respect to the cost, lifespan, and power consumption. Due to the recent advances in multi-core configuration by adopting NVM storage, and discusses the effectiveness of such configurations with and many-core processor technologies, new types of applications such as interactive rendering and respect to the cost, lifespan, and power consumption. Due to the recent advances in multi-core and machine learning are also performed in desktop PCs. This is in line with the emerging era of the many-core processor technologies, new types of applications such as interactive rendering and machine Fourth Industrial Revolution as well as the recent trend like personalized YouTube broadcasting, learning are also performed in desktop PCs. This is in line with the emerging era of the Fourth Industrial where creating high quality contents or performing computing-intensive jobs is no longer the Revolution as well as the recent trend like personalized YouTube broadcasting, where creating high Appl. Sci. 2020, 10, 999 12 of 18

domain of the expert. Thus, the performance of desktop systems should be improved in accordance with such situations. While the processor’s computing power has improved significantly, it is difficult to increase the memory capacity of desktop PCs due to the scaling limitation of DRAM. Thus, instead of increasing the memory capacity, we argue that NVM storage can provide an alternative solution to supplement the limited DRAM size in emerging applications. That is, a future computer system can be composed of limited memory capacities but has NVM storage devices, not to degrade Appl. Sci. 2020, 10, 999 12 of 18 performances. Figure 12 shows the architecture of an alternative PC consisting of small DRAM and fast NVM qualitystorage contentsin comparison or performing with a traditional computing-intensive PC consisting jobs of is large no longer DRAM the and domain slow ofHDD. the expert. Table 2 Thus, lists the performancedetailed specifications of desktop and systems estimates should of a be “large improved DRAM in accordanceand slow HDD with PC” such (hereinafter situations. we will refer Whileto it as the “HDD processor’s PC”) and computing a “smallpower DRAM has and improved fast NVM significantly, PC” (hereinafter it is di ffiwecult will to refer increase to it the as memory“NVM PC”). capacity By considering of desktop PCsemerging due to desktop the scaling workloads limitation such of DRAM. as interactive Thus, instead rendering, of increasing big data theprocessing, memory and capacity, PC virtualization, we argue that we NVM use storagea large memory can provide of 64 an GB alternative in HDD solutionPC. In contrast, to supplement we use the limitedmemory DRAM size of size 16 inGB emerging in NVM applications.PC based on Thatthe observations is, a future computer in Section system 3.1. For can NVM be composed storage, ofwe limited use Intel’s memory Optane capacities SSD, which but has is the NVM representative storage devices, NVM not product to degrade on the performances. market at this time [27]. As shownFigure in 12 Table shows 2, thethe architecturetotal price of of NVM an alternative PC is similar PC consisting to that of of HDD small PC. DRAM This andis because fast NVM not storageonly the in memory comparison itself with but aalso traditional the mainboard PC consisting becomes of largeexpensive DRAM when and the slow memory HDD. Tablecapacity2 lists is thevery detailed large. A specifications more serious andproblem estimates is that of the a “large memory DRAM size andis difficult slow HDD to be PC”extended (hereinafter any longer we willdue referto the to scaling it as “HDD limitation PC”) andof DRAM. a “small When DRAM the and market fast NVM of NVM PC” (hereinafterstorage fully we opens, will refer it can to it be as “NVManticipated PC”). that By the considering price of NVM emerging will become desktop more workloads competitive. such asEven interactive though an rendering, auxiliary bigHDD data is processing,added to NVM and PC, PC virtualization,it will not significantly we use a largeaffect memory the total of price. 64 GB in HDD PC. In contrast, we use the memoryNow, size let ofus 16 discuss GB in NVMthe lifespan PC based issue on of the NVM. observations The technical in Section specifications 3.1. For NVM of Optane storage, 900p we useindicate Intel’s that Optane the endurance SSD, which rating is the is representative8.76 PBW (peta NVM bytes product written), on which the market implies at that this timewe can [27 write]. As shown8.76 PB ininto Table this2, device the total before price we of NVMneed to PC replace is similar it. As to thatits warranty of HDD period PC. This is isspecified because to not be only 5 years, the memoryif we evaluate itself butthe alsoaverage the mainboardamount of becomesdaily writing expensive allowed, when it is the 8.76 memory PB/(5 years capacity × 365 is verydays/year) large. A= more4.9152 serious TB per problem day. As is the that size the of memory the Optane size is 900p diffi cultproduct to be is extended 280 GB anyor 480 longer GB, duethis to implies the scaling that limitationwriting more of DRAM. than 10 When times the to marketthe full ofstorage NVM storagecapacity fully every opens, day itis canallowed be anticipated for 5 years. that When the price we ofcompare NVM willit with become flash-based more competitive. SSDs, we Evendo not though need anto auxiliaryconsider HDDthe endurance is added toproblem NVM PC, in itNVM will notstorage significantly devices. a ffect the total price.

CPU CPU L1 Cache L1 Cache L2 Cache L2 Cache

Memory Interface Memory Interface

DRAM Memory DRAM Memory (Large) (Small)

I/O Interface I/O Interface

NVM HDD Storage NVMNVM Storage Storage storage (Slow) (Fast)(Fast)

(a) Traditional PC configurations (b) Alternative PC configurations

Figure 12. An alternativealternative PCPC consisting consisting of of small small DRAM DRAM and and fast fast NVM NVM storage storage in comparisonin comparison with with the traditionalthe traditional PC consistingPC consisting of large of large DRAM DRAM and and slow slow HDD. HDD.

Now, let us discuss the lifespan issue of NVM. The technical specifications of Optane 900p indicate that the endurance rating is 8.76 PBW (peta bytes written), which implies that we can write 8.76 PB into this device before we need to replace it. As its warranty period is specified to be 5 years, if we evaluate the average amount of daily writing allowed, it is 8.76 PB/(5 years 365 days/year) = 4.9152 × TB per day. As the size of the Optane 900p product is 280 GB or 480 GB, this implies that writing more than 10 times to the full storage capacity every day is allowed for 5 years. When we compare it with flash-based SSDs, we do not need to consider the endurance problem in NVM storage devices. Appl. Sci. 2020, 10, 999 13 of 18

Table 2. Speciﬁcations of a “large DRAM and slow HDD PC” and a “small DRAM and fast NVM PC”.

Component 16 GB DRAM + NVM PC 64 GB DRAM + HDD PC Note (1) Intel Core i7-7700 3.6 GHz CPU $317 (1) $317 (1) 4-Core Processor (2) ASRock H110M-HDS R3.0 Micro ATX LGA1151 Motherboard Mainboard $56 (2) $114 (3) (3) ASRock B250 Pro4 ATX LGA1151 Motherboard (4) Corsair Vengeance LPX 16 GB DRAM $70 (4) $280 (5) (5) Corsair Vengeance LPX 16 GB (16 GB 4) × (6) Intel Optane 900P (280 GB) Storage $370 (6) $50 (7) (7) Seagate Barracuda Compute HDD 2 TB Misc (CPU Cooler, Case, $156 $156 Power) Total Price $969 $917

Now, let us consider the power issue. We analyze the power consumptions of memory and storage separately and then accumulate them. The memory power consumption PDRAM is calculated as

PDRAM = PDRAM_static + PDRAM_active (1) where

P = Unit_static_power (W/GB) Memory_size (GB) and, DRAM_static × P = Read_energy (J) Read_freq (/s) + Write_energy (J) Write_freq (/s). DRAM_active DRAM × DRAM DRAM × DRAM Unit_static_power is the static power per capacity including both leakage power and refresh power, and Read_energyDRAM and Write_energyDRAM refer to the energy required for read and write operations for the unit size, respectively. Read_freqDRAM and Write_freqDRAM are the frequencies of read and write operations for the unit time, respectively. The NVM power consumption PNVM is calculated as

PNVM = PNVM_idle + PNVM_active (2) where

P = Idle_rate Idle_power (W) and, NVM_idle NVM × NVM P = Read_energy (J) Read_freq (/s) + Write_energy (J) Write_freq (/s). NVM_active NVM × NVM NVM × NVM

Idle_rateNVM is the percentage of time that no read or write occurs on NVM and Idle_powerNVM is the power consumed in NVM during idle time. Read_energyNVM and Write_energyNVM refer to the energy required for read and write operations for the unit size, respectively. Read_freqNVM and Write_freqNVM are the frequencies of read and write operations for the unit time, respectively. The HDD power consumption PHDD is calculated as

PHDD = PHDD_idle + PHDD_active + PHDD_standby (3) where

P = Idle_rate Idle_power (W), HDD_idle HDD × HDD P = Active_rate Active_power (W), and HDD_active HDD × HDD P = Transition_freq (/s) (Spin_down_energy (J) + Spin_up_energy (J)). HDD_standby HDD × HDD HDD Appl. Sci. 2020, 10, 999 14 of 18

Appl. Sci. 2020, 10, 999 14 of 18 Idle_rateHDD is the percentage of time that no read or write occurs on HDD and Idle_powerHDD is the power consumed in HDD during idle states. Active_rateHDD and Active_powerHDD refer to the Idle_rate Idle_power percentage ofHDD timeis thethat percentage read or write of time occurs that on no HDD read orand write the occurspower onconsumed HDD and in HDD duringHDD is the power consumed in HDD during idle states. Active_rateHDD and Active_powerHDD refer to the active states, respectively. Transition_freqHDD is the frequency of state transitions between idle and percentage of time that read or write occurs on HDD and the power consumed in HDD during active standby states for the unit time. Spin_down_energyHDD and Spin_up_energyHDD refer to the energy states, respectively. Transition_freqHDD is the frequency of state transitions between idle and standby required for spin_down and spin_up operations, respectively. states for the unit time. Spin_down_energyHDD and Spin_up_energyHDD refer to the energy required for spin_downThe static and power spin_up is operations,consumed consistently respectively. in DRAM regardless of any operations as DRAM memoryThe cells static store power data is in consumed small capacitors consistently that lose in DRAM their charge regardless over oftime any and operations must be asrefreshed. DRAM Asmemory the size cells of storeDRAM data increases, in small this capacitors static power that lose keeps their increasing charge over to timesustain and refresh must be cycles refreshed. to retain As itsthe data. size ofHowever, DRAM increases, this is not this required static power in NVM keeps because increasing of its tonon-volatile sustain refresh characteristics. cycles to retain Active its powerdata. However, consumption, this is on not the required other hand, in NVM refers because to the ofpower its non-volatile dissipated characteristics.when data is being Active read power and written.consumption, on the other hand, refers to the power dissipated when data is being read and written. Figure 13 shows the normalized power consumption of the NVM PC in comparison with the HDD PC as the II/O/O raterate isis varied.varied. As shown in the figure,figure, thethe NVMNVM PC,PC, whichwhich usesuses small DRAM capacity along withwith NVM,NVM, consumesconsumes lessless powerpower than than the the HDD HDD PC. PC. This This is is because because the the static static power power of ofDRAM DRAM is proportionalis proportional to to its its capacity, capacity, and and thus thus the the reduced reduced size size of DRAMof DRAM saves saves the the static static power power of ofDRAM. DRAM. Specifically, Specifically, power-savings power-savings of theof the NVM NVM PC PC becomes becomes large large as the as the I/O I/O rate rate decreases. decreases. When When the theI/O I/O rate rate decreases, decreases, the staticthe static power power consumption consumptio requiredn required for DRAM for DRAM refresh refresh operations operations accounts accounts for a forlarge a portionlarge portion of the totalof the power total consumptionpower consumpt comparedion compared to the active to the power active consumption power consumption required requiredfor actual for read actual/write read/write operations. operations. The average The poweraverage consumption power consumption of memory of andmemory storage and layers storage by layersadopting by theadopting NVM PCthe isNVM 42.9% PC of is the 42.9% HDD of PC the in HDD our experiments. PC in our experiments. This result indicates This result that NVMindicates PC willthat beNVM effective PC will in reducingbe effective the in power reducing consumption the power ofconsumption future computer of future systems. computer In summary, systems.we In summary,showed that we the showed new memory-storage that the new memory-storage configuration consisting configuration of small consisting memory of and small fast storagememory could and fastbe an storage alternative could solution be an alternative for ever-growing solution memory for ever-growing demands and memory the limited demands scalability and the problem limited of scalabilityDRAM memory. problem of DRAM memory.

Figure 13. Comparison of power consumption. Figure 13. Comparison of power consumption. 4. Related Work 4. RelatedAs NVM Work can be adopted in both memory and storage subsystems, a spectrum of studies haveAs been NVM conducted can be adopted to explore in both the memory performance and storage of systems subsystems, making a use spectrum of NVM of studies as the mainhave beenmemory conducted [6,13,20 ,to21 ,explore28,29] and the/or performance storage [30– 35of ].sy Mogulstems etmaking al. propose use of a novelNVM memoryas the main management memory [6,13,20,21,28,29]technique for systems and/or making storage use [30–35]. of DRAM Mo andgul NVMet al. hybridpropose memory. a novel Their memory technique management allocates techniquepages to DRAM for systems or NVM making by utilizing use of theDRAM page’s and attributes, NVM hybrid such memory. that read-only Their pagestechnique are placed allocates on pagesPCM, whereasto DRAM writable or NVM pages by utilizing are placed the onpage’s DRAM attributes, in order such to decrease that read-only the write pages traffi care to placed PCM [ 28on]. PCM,Querishi whereas et al. presentwritable a pages PCM basedare placed main on memory DRAM system in order and to adoptdecrease DRAM the write as PCM’s traffic write to PCM buff er[28]. to Querishiincrease theet al. lifespan present of a PCM PCM as based well as main accelerating memory PCM’s system write and performancesadopt DRAM [ 6as]. PCM’s Lee et al.write propose buffer a toPCM increase based the main lifespan memory of systemPCM as and well try as to accele enhancerating the PCM’s PCM’s write write performances performance by[6]. judiciously Lee et al. proposemanaging a PCM the last based level main cache memory memory. system Specifically, and try they to enhance present the partialPCM’s writingwrite performance and the bu ffbyer judiciouslyreorganizing managing schemes forthe tracking last level the cache change memory. of data Specifically, in the last level they cache present and the flush partial only writing the changed and thepart buffer to PCM reorganizing memory [schemes21,29]. Lee for ettracking al. present the change a new of page data replacementin the last level algorithm cache and for flush memory only thesystems changed being part composed to PCM of memory PCM and [21,29]. DRAM Lee [13 et]. Theiral. present algorithm a new performs page replacement online estimation algorithm of page for memoryattributes systems and places being write-intensive composed of pages PCM on and DRAM DRAM and read-intensive[13]. Their algorithm pages on performs PCM. Kultursay online Appl. Sci. 2020, 10, 999 15 of 18 et al. propose an STT-MRAM based main memory system for reducing the power consumption of the main memory [20]. Their system considers the entire replacement of DRAM to STT-MRAM and exhibits that the performance of STT-MRAM based main memory is competitive to that of DRAM with the 60% energy-saving effect in the memory. Now, let us discuss the studies focusing on NVM as storage devices. In early days, the size of NVM was very small, and thus NVM acts as a certain part of the storage systems. NEB [32], MRAMFS [33], and PRAMFS [34] are examples of such cases. They aim to maintain hot data or important data such as metadata on NVM [34]. As the size of NVM was small, they also focused on the efficiency of the NVM storage spaces. For example, Edel et al. [33] try to save NVM spaces by making use of compression, whereas Baek et al. [32] perform space saving by extent-based data management. As the density of NVM increases, studies on NVM storage suggest the file system design that maintains the total file system in NVM. Baek et al. present a system layer that acts as the memory as well as the storage by making use of NVM for managing both memory and file objects [35]. Condit et al. present BPFS, which is a new file system for NVM exploiting the byte-addressable characteristics of NVM [30]. Specifically, BPFS overwrites data if the write size is small enough for atomic operations, thereby reducing the writing overhead of NVM without compromising reliability. Similarly, Wu et al. propose a byte-addressable file system for NVM, which does not need to pass through the buffer cache layer by allocating memory address spaces to the location of the NVM file system [31]. Coburn et al. propose a persistent in-memory data storage consisting of NVM, which can be created by users [36]. Volos et al. propose a new NVM interface that allows the creation and management of NVM without the risk of inconsistency from system crashes [37]. Specifically, it enables users to define persistent data with the given primitives, which is automatically managed by transaction, and thus users do not need to concern the inconsistency situations from system failures. Recently, studies on NVM focus on the efficiency of I/O paths and software stacks that are designed originally for slow storage like HDD rather than NVM storage. Yang et al. argue that synchronous I/O performs better than asynchronous I/O if storage becomes very fast [38]. Caulfield et al. quantify the overhead of each software stack in I/O operations and propose efficient interfaces for fast NVM storage systems [39]. Our study is different from these studies in that we evaluate the NVM storage performance, in particular for PCM and STT-MRAM based storage systems, for a wide range of operating system settings and modes, while they focus on a detailed examination of a typical aspect of the storage access path.

5. Conclusions In this article, we revisited the memory management issues specially focusing on the storage I/O mechanisms when the underlying storage is NVM. As the storage performance becomes sufficiently fast by adopting NVM, we found out that reassessment of the existing I/O mechanisms is needed. Specifically, we performed a broad range of measurement studies to observe the implications of using NVM based storage on existing I/O mechanisms. Based on our study, we made five observations. First, the performance gain of NVM based storage is limited due to the existence of the buffer cache, but it becomes large as we reduce the memory size. Second, buffer caching is still effective in fast NVM storage even though its effectiveness is limited. Third, prefetching is not effective when the underlying storage is NVM. Fourth, synchronous I/O and direct I/O do not affect the performance of storage significantly under NVM storage. Fifth, performance degradation due to the contention of multi-threads is less severe in NVM based storage than in HDD. Based on these observations, we discussed how to address rapidly growing memory demands in emerging applications by adopting NVM storage. Specifically, we analyzed the cost of a system consisting of a small memory and fast NVM storage in comparison with the traditional PC consisting of a large memory and slow storage. We showed that the memory-storage configuration consisting of a small memory and fast storage could be an alternative solution for ever-growing memory demands and the limited scalability of DRAM memory. Appl. Sci. 2020, 10, 999 16 of 18

In the future, we will study some other issues related to memory and storage layers such as security and reliability when NVM is adopted as the storage of emerging computer systems. We anticipate that our ﬁndings will be eﬀective in designing future computer systems with fast NVM based storage.

Author Contributions: K.C. implemented the architecture and algorithm, and performed the experiments. H.B. designed the work and provided expertise. All authors have read and agreed to the published version of the manuscript. Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2019R1A2C1009275) and also by the ICT R&D program of MSIP/IITP (2019-0-00074, developing system software technologies for emerging new memory that adaptively learn workload characteristics). Conﬂicts of Interest: The authors declare no conﬂict of interest.

References

1. Ng, S. Advances in disk technology: Performance issues. Computer 1998, 31, 75–81. [CrossRef] 2. Evans, C. Flash vs 3D Xpoint vs storage-class memory: Which ones go where? Comput. Wkly. 2018, 3, 23–26. 3. Stanisavljevic, M.; Pozidis, H.; Athmanathan, A.; Papandreou, N.; Mittelholzer, T.; Eleftheriou, E. Demonstration of Reliable Triple-Level-Cell (TLC) Phase-Change Memory. In Proceedings of the of the 8th IEEE International Memory Workshop, Paris, France, 15–18 May 2016; pp. 1–4. 4. Zhang, W.; Li, T. Exploring phase change memory and 3D die-stacking for power/thermal friendly, fast and durable memory architectures. In Proceedings of the of the 18th IEEE International Conference on Parallel Architectures and Compilation Techniques (PACT), Raleigh, NC, USA, 12–16 September 2009; pp. 101–112. 5. Lee, E.; Jang, J.; Kim, T.; Bahn, H. On-demand Snapshot: An Efficient Versioning File System for Phase-Change Memory. IEEE Tran. Knowl. Data Eng. 2012, 25, 2841–2853. [CrossRef] 6. Qureshi, M.K.; Srinivasan, V.;Rivers, J.A. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th International Symposium on Computer Architecture (ISCA), Austin, TX, USA, 20–24 June 2009; pp. 24–33. 7. Nale, B.; Ramanujan, R.; Swaminathan, M.; Thomas, T. Memory Channel that Supports near Memory and Far Memory Access; PCT/US2011/054421; Intel Corporation: Santa Clara, CA, USA, 2013. 8. Ramanujan, R.K.; Agarwal, R.; Hinton, G.J. Apparatus and Method for Implementing a Multi-level Memory Hierarchy Having Different Operating Modes; US 20130268728 A1; Intel Corporation: Santa Clara, CA, USA, 2013. 9. Eilert, S.; Leinwander, M.; Crisenza, G. Phase Change Memory: A new memory technology to enable new memory usage models. In Proceedings of the 1st IEEE International Memory Workshop (IMW), Monterey, CA, USA, 10–14 May 2009; pp. 1–2. 10. Lee, E.; Yoo, S.; Bahn, H. Design and implementation of a journaling file system for phase-change memory. IEEE Trans. Comput. 2015, 64, 1349–1360. [CrossRef] 11. Lee, E.; Kim, J.; Bahn, H.; Lee, S.; Noh, S. Reducing Write Amplification of Flash Storage through Cooperative Data Management with NVM. ACM Trans. Storage 2017, 13, 1–13. [CrossRef] 12. Lee, E.; Bahn, H.; Yoo, S.; Noh, S.H. Empirical study of NVM storage: An operating system’s perspective and implications. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Paris, France, 9–11 September 2014; pp. 405–410. 13. Lee, S.; Bahn, H.; Noh, S.H. CLOCK-DWF: A write-history-aware page replacement algorithm for hybrid PCM and DRAM memory architectures. IEEE Trans. Comput. 2014, 63, 2187–2200. [CrossRef] 14. Kwon, O.; Koh, K.; Lee, J.; Bahn, H. FeGC: An Efficient Garbage Collection Scheme for Flash Memory Based Storage Systems. J. Syst. Softw. 2011, 84, 1507–1523. [CrossRef] 15. Weis, N.; Wehn, L.; Igor, I.; Benini, L. Design space exploration for 3d-stacked drams. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Grenoble, France, 14–18 March 2011; pp. 1–6. 16. Elliot, J.; Jung, E.S. Ushering in the 3D Memory Era with V-NAND. In Proceedings of the Flash Memory Summit, Santa Clara, CA, USA, 12–15 August 2013. Appl. Sci. 2020, 10, 999 17 of 18

17. Wright, C.D.; Aziz, M.M.; Armand, M.; Senkader, S.; Yu, W. Can We Reach Tbit/sq.in. Storage Densities with Phase-Change Media? In Proceedings of the European Phase Change and Ovonics Symposium (EPCOS), Grenoble, France, 29–31 May 2006. 18. Bedeschi, F.; Fackenthal, R.; Resta, C.; Donz, E.M.; Jagasivamani, M.; Buda, E.; Pellizzer, F.; Chow, D.W.; Cabrini, A.; Calvi, G.M.A.; et al. A multi-level-cell bipolar-selected phase-change memory. In Proceedings of the 2008 IEEE International Solid-State Circuits Conference-Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 3–7 February 2008. 19. Zhang, Y.; Wen, W.; Chen, Y. STT-RAM Cell Design Considering MTJ Asymmetric Switching. SPIN 2012, 2, 1240007. [CrossRef] 20. Kultursay, E.; Kandemir, M.; Sivasubramaniam, A.; Mutlu, O. Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative. In Proceedings of the IEEE International Symposium Performance Analysis of Systems and Software (ISPASS), Austin, TX, USA, 21–23 April 2013; pp. 256–267. 21. Lee, B.C.; Ipek, E.; Mutlu, O.; Burger, D. Architecting phase change memory as a scalable DRAM alternative. In Proceedings of the 36th ACM/IEEE International Symposium on Computer Architecture (ISCA), Austin, TX, USA, 20–24 June 2009. 22. PCMSim. Available online: http://code.google.com/p/pcmsim (accessed on 18 January 2020). 23. Norcutt, W. IOzone Filesystem Benchmark. Available online: http://www.iozone.org/ (accessed on 1 October 2019). 24. Tarasov, V.; Hildebrand, D.; Kuenning, G.; Zadok, E. Virtual Machine Workloads: The Case for New NAS Benchmarks. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST), 12–15 February 2013; pp. 307–320. 25. Lee, K.; Won, Y. Smart layers and dumb result: IO characterization of an Android-based smartphone. In Proceedings of the International Conference on Embedded Software (EMSOFT), Tampere, Finland, 7–12 October 2012; pp. 23–32. 26. Stonebraker, M.; Madden, S.; Abadi, D. The End of an Architectural Era (It’s Time for a Complete Rewrite). In Proceedings of the 33rd Very Large Data Bases Conference (VLDB), Viena, Austria, 23–27 September 2007. 27. Intel® SSD Client Family. Available online: https://www.intel.com/content/www/us/en/products/memory- storage/solid-state-drives/consumer-ssds.html (accessed on 18 January 2020). 28. Mogul, J.C.; Argollo, E.; Shah, M.; Faraboschi, P. Operating system support for NVM+DRAM hybrid main memory. In Proceedings of the 12th USENIX Workshop on Hot Topics in Operating Systems (HotOS), Monte Verita, Switzerland, 18–20 May 2009; pp. 4–14. 29. Lee, B.C.; Ipek, E.; Mutlu, O.; Burger, D. Phase change memory architecture and the quest for scalability. Commun. ACM 2010, 53, 99–106. [CrossRef] 30. Condit, J.; Nightingale, E.B.; Frost, C.; Ipek, E.; Lee, B.; Burger, D.; Coetzee, D. Better I/O through byte-addressable, persistent memory. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), Big Sky, MT, USA, 11–14 October 2009; pp. 133–146. 31. Wu, X.; Reddy, A.L.N. SCMFS: A File System for Storage Class Memory. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WA, USA, 12–18 November 2011. 32. Baek, S.; Hyun, C.; Choi, J.; Lee, D.; Noh, S.H. Design and analysis of a space conscious nonvolatile-RAM file system. In Proceedings of the IEEE Region 10 Confe4rence (TENCON), Hong Kong, China, 14–17 November 2006. 33. Edel, N.K.; Tuteja, D.; Miller, E.L.; Brandt, S.A. MRAMFS: A compressing file system for non-volatile RAM. In Proceedings of the 12th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Volendam, The Netherlands, 4–8 October 2004; pp. 569–603. 34. PRAMFS. Available online: http://pramfs.sourceforge.net (accessed on 1 October 2019). 35. Baek, S.; Sun, K.; Choi, J.; Kim, E.; Lee, D.; Noh, S.H. Taking advantage of storage class memory technology through system software support. In Proceedings of the Workshop on Interaction between Operating Systems and Computer Architecture (WIOSCA), Beijing, China, June 2009. 36. Coburn, J.; Caufield, A.; Akel, A.; Grupp, L.; Gupta, R.; Swanson, S. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Newport Beach, CA, USA, 5–11 March 2011; pp. 105–118. Appl. Sci. 2020, 10, 999 18 of 18

37. Volos, H.; Tack, A.J.; Swift, M.M. Mnemosyne: Lightweight persistent memory. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Newport Beach, CA, USA, 5–11 March 2011; pp. 91–104. 38. Yang, J.; Minturn, D.B.; Hady, F. When poll is better than interrupt. In Proceedings of the USENIX Conference File and Storage Technologies (FAST), San Jose, CA, USA, 14–17 February 2012. 39. Caulﬁeld, A.M.; De, A.; Coburn, J.; Mollov, T.I.; Gupta, R.K.; Swanson, S. Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, Northwest, Washington, DC, USA, 4–8 December 2010; pp. 385–395.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).