electronics

Article UHNVM: A Universal Heterogeneous Design with Non-Volatile Memory

Xiaochang Li * and Zhengjun Zhai

School of Computer Science and Engineering, Northwestern Polytechnical University, 127 West Youyi Road, Xi’an 710072, China; [email protected] * Correspondence: [email protected]

Abstract: During the recent decades, non-volatile memory (NVM) has been anticipated to scale up the main memory size, improve the performance of applications, and reduce the speed gap between main memory and storage devices, while supporting persistent storage to cope with power outages. However, to fit NVM, all existing DRAM-based applications have to be rewritten by developers. Therefore, the developer must have a good understanding of targeted application codes, so as to manually distinguish and store data fit for NVM. In order to intelligently facilitate NVM deployment for existing legacy applications, we propose a universal heterogeneous cache hierarchy which is able to automatically select and store the appropriate data of applications for non-volatile memory (UHNVM), without compulsory code understanding. In this article, a program context (PC) technique is proposed in the user space to help UHNVM to classify data. Comparing to the conventional hot or cold files categories, the PC technique can categorize application data in a fine-grained manner, enabling us to store them either in NVM or SSDs efficiently for better performance. Our experimental results using a real Optane dual-inline-memory-module (DIMM) card show that our new heterogeneous architecture reduces elapsed times by about 11% compared to   the conventional kernel memory configuration without NVM.

Citation: Li, X.; Zhai, Z. UHNVM: A Keywords: DRAM; Intel Optane; NVM; UHNVM; program context Universal Heterogeneous Cache Design with Non-Volatile Memory. Electronics 2021, 10, 1760. https:// doi.org/10.3390/electronics10151760 1. Introduction

Academic Editor: Christos J. Bouras Solid state drives (SSDs) are often used as a cache to expand the memory capacity for the data-intensive applications [1–5]. SSDs provide an order of magnitude larger Received: 21 May 2021 capacity than DRAM, offering a higher throughput than HDDs. For this reason, placing Accepted: 12 July 2021 SSDs between DRAM and HDDs is a reasonable choice to boost up application speeds Published: 22 July 2021 economically. However, the long latency of SSDs compared to DRAM inevitably delivers a performance decrease especially for the applications with a large memory footprint. As a Publisher’s Note: MDPI stays neutral method to overcome such a limitation, NVM is actively considered to enlarge the memory with regard to jurisdictional claims in capacity of the computing systems [6–9]. NVM not only has a higher density than DRAM published maps and institutional affil- but also provides a comparable access latency to DRAM; NVM reduces the performance iations. penalty for the memory scalability compared to the SSDs. Recently, Intel released the commercial NVM technology product, called Intel Optane DIMM, the NVM technology is drawing increasing attention in large-scale internet service providers. However, there is a significant limitation to use NVM in real systems. To take advan- Copyright: © 2021 by the authors. tage of the NVM, the existing software should be revised to identify the data that would Licensee MDPI, Basel, Switzerland. be suitable for NVM and to maintain them in NVM using the proprietary libraries. As an This article is an open access article example, Intel offers a set of libraries to maintain data within NVM such as a persistent distributed under the terms and memory development kit (PMDK) [10], which include the implementation of the various conditions of the Creative Commons data structures and logging functionalities tailored for the NVM. This lack of compatibility Attribution (CC BY) license (https:// requires a laborious and time consuming development , significantly hindering the creativecommons.org/licenses/by/ adoption of the NVM in practical systems [7,8]. 4.0/).

Electronics 2021, 10, 1760. https://doi.org/10.3390/electronics10151760 https://www.mdpi.com/journal/electronics Electronics 2021, 10, 1760 2 of 15

Motivated by this observation, we present a transparent cache design, which resides between the user application and the kernel layer, and transparently maintains data over the heterogeneous memory. This property eliminates the need for the existing software to be modified to exploit NVM, making the use of NVM much easier. The universal heterogeneous cache hierarchy design for NVM (UHNVM) interposes on the system call of the user-level applications to access the file data. Then, we check whether the requested data resides in NVM, which is a file cache of the disks. If the data are found in NVM, it is directly serviced to the user program; otherwise, the data are retrieved from storage. To increase the cache hit ratio of UHNVM, we further optimize the caching policy by taking into consideration the physical characteristics of the NVM. Our empirical analysis reveals that NVM (i.e, Intel Optane DIMM) is more sensitive to the workload change than DRAM. Specifically, NVM decreases the performance by 80% for the random workloads compared to that for the sequential workload, while DRAM only has a 20% of the per- formance gap between two workloads [4]. This implies that the NVM is more suitable to cache the data blocks that are likely to be accessed in a sequential pattern rather than a random one. Based on this observation, UHNVM internally predicts the access patterns for the data blocks by the program context (PC) of the request, which refers to the call stack that finally invokes the file access operations [11,12]. Because the access patterns for the data blocks are highly dependent to their invocation contexts, PC is known to be a good indicator to predict the access patterns to the data blocks [11,12]. By making use of this information, UHNVM classifies the access patterns of the data blocks into four types: looping, cluster-hot, sequential, and random. UHNVM favorably caches the first two categories because they have a high locality and thus are likely to increase the cache hit ratio. For the latter two, UHNVM prefers the data blocks with the sequential patterns to those with the random pattern, so as to take full advantage of the NVM. The rest of this article is organized as follows. We explain the background of Intel Optane DIMM in Section2. Section3 describes the motives and main thoughts of UHNVM. Section4 introduces the design of UHNVM. The experimental results are shown in Section5. Finally, we conclude in Section6 with a summary.

2. Background Before the appearance of commercial NVM, researchers emulate NVM by implement- ing software emulators over DRAM [13–16]. DRAM has to be managed by software to simulate NVM. This process is somewhat distorted. Some tried to use hardware emula- tors which include Intel’s persistent memory emulator platform that mimics latency and bandwidth of NVM [17,18], plain DRAM designs [19–22], and battery support DRAM designs [23]. Those hardware emulators are rather complicated to perform NVM features. Despite the numerous efforts, the previous studies conducted without real NVM devices often fail to produce meaningful results, because most of their experiments are based on the emulation mechanisms that do not model actual behaviors and performance of product NVMs accurately [4]. The release of the real NVM product, Intel Optane DIMM, opens a possibility to overcome this limitation, enabling studies under the more realistic environments. Intel Optane DIMM is connected to the memory bus like DRAM, but it provides the persistence of data like storage. However, compared to other persistent storage, Intel Optane DIMM has an order of magnitude lower latency. Intel Optane DIMM can be configured in two operating modes, memory mode and app direct mode, so that it can simultaneously satisfy the various demands on it. Figure1 briefly shows data access paths in memory mode (Figure1a) and app direct mode (Figure1b). The memory mode provides an abstract of the extended main memory, using DRAM as a cache of the NVM. This mode does not expose the presence of the NVM to the user applications. This means that this mode requires no modification of current applications but the user applications cannot make use of the NVM as they intend. In contrast, app direct mode supports the in-memory persistence, which opens up a new approach to new Electronics 2021, 10, 1760 3 of 15

software architecture designs. This mode allows applications and file systems to access the Optane DIMM by using CPU instructions. Normally, existing designs about app direct mode are to create a small Optane cache pool for fast storage. Intel has provided a set of libraries to support and operate Optane.

Figure 1. Data access paths of Intel Optane.

Intel Optane DIMM allows more software to be bypassed since the application pro- gram can read from or write to an Optane DIMM without using either a driver or the file system (see Figure1b). Four access paths are for Optane DIMM. There are two standard file accesses through new NVM-aware file system [24] and traditional file system, respectively. The NVM-aware file system is designed to take advantage of the new features of the Optane DIMM, and its performance is superior to that standard file system. Application programs can also communicate directly with the Optane DIMM as if it were standard host DRAM. All these are handled by the direct access channel, which is usually abbreviated to “DAX”. For the three access modes, load and store access, raw device access, and standard file access in Figure1b, their software overheads increase gradually and the load and store access used in our design achieves the minimum overhead.

3. Motivation This section investigates the performance benefits obtained by adopting the Optane DIMM as a storage cache in terms of the system architecture Section 3.1 and the performance characteristics of Optane DIMM with respect to the workload patterns indicated by PC technique exhibited in Section 3.4, which can be exploited when maintaining a cache within it.

3.1. Memory Architecture with Optane Dimm Using the fast storage medium (such as SSDs) as a cache storage is a common method to scale up the memory capacity of the computing systems. With the advent of a high- density memory device such as Optane DIMM, many computing systems consider the use of scalable memory device as a storage cache, as an alternative or supplement to the fast storage. Figure2 compares the data access flow when traditional SSD device is used as a cache of the storage (Figure2a) in comparison to that when Optane DIMM is deployed (Figure2b). Electronics 2021, 10, 1760 4 of 15

User space Direct access

Host Cache

Extra Software layers

HDD/SSD Optane DIMM

(a) Conventional data flow. (b) UHNVM data flow. Figure 2. Comparison of data flows.

Figure2a shows the hierarchical memory system where the SSD serves as a cache between DRAM and HDD. In this architecture, the host will experience a performance drop when a page fault occurs. The application should be blocked until the in-kernel page fault handler load the requested pages from SSD (or HDD in the worst case) to DRAM. For the applications with a large memory footprint, the number of page faults highly increases, leading to a serious performance degradation. When the data accesses are random and in small size, the overhead of the page fault further aggravates because the SSDs performs I/Os in a block size granularity (i.e., 4 KB) [3,25–27]. Figure2b shows the parallel memory architecture where Optane DIMM is deployed as a cache of the storage. As opposed to the SSD-based cache architecture, Optane DIMM has no data load operation because it allows the direct access of the applications to it. This advantage enables to not only avoid the software overhead that occurs in the process of going through the multiple in-kernel stacks but also eliminate the unnecessary data movement between DRAM and disks.

3.2. Related Work of NVM In this section, we present an overview of previous studies relating to persistent memory programming. Prior NVM studies has almost spanned the entire system stack. Some designs built intricate NVM-related data structures for special actions in applications, such as logging, transaction proceeding [28–30]. Some tried to custom NVM file systems by adding support for NVM to existing file systems [20,31,32]. For the NVM-related structures, application programmers should be careful in the process of application design, which is inconvenient and difficult to deploy. As for the NVM file systems, it is uncontrollable that how effectively legacy files systems can evolve to accommodate NVM and how much effort is required to adapt legacy applications to exploit NVM [33]. Comparing to above studies, UHNVM is able to balance performance and complexity as its convenient interfaces in user space and PC technique. Performance comparison 1.2 Optane DRAM 1000 1.1 800 1 0.9 600 0.8 400 0.7

200 PerformanceNormalized Intelligent NVM No NVM Random NVM Different configurations

Load to Use Latency (ns) Use Latency to Load 0 Read Seq Read Write Write Random Seq Random Performance comparison 1.2 Figure 3. Latency comparisons between Optane DIMM and DRAM. 1.1 1 0.9 0.8 0.7 Random NVM No NVM Intelligent NVM Normalized PerformanceNormalized Different configurations Electronics 2021, 10, 1760 5 of 15

3.3. Performance Analysis of Optane DIMM To investigate the performance characteristics of Optane DIMM, we measure the read and write latency in 256GB Optane DIMM, generating one million of read and write operations for the individual load instructions (8-byte unit). in different patterns. The Optane DIMM is deployed with the app direct mode. Figure3 shows the performance of Optane DIMM compared to that of DRAM. In the figure, we can observe that Optane DIMM has a higher read and write latency than DRAM due to its inherent physical characteristics. One interesting finding is that Optane DIMM has the sequential-random gap that is almost 2X for reading and 4X for writing, while DRAM has almost same performance across workload patterns. This result indicates that Optane DIMM is more sensitive to the workload change than DRAM and Optane DIMM is more appropriate to service the sequential workloads than the random one. This feature increases the possibility of further optimizing the Optane DIMM management mechanism.

3.4. Efficient Data Patterns Indicated by Program Contexts (PCs) Due to the importance of data patterns, classifying data patterns is a key approach to optimize Optane DIMM. As far as we know, most of the existing studies tag data patterns by programmers according to their code understandings [24,33]. Our PC technique in UHNVM is able to separate data automatically according to application regulations described in Section 4.2 without compulsory code understanding compared to traditional designs needing manual code modifications. UHNVM is a lightweight and efficient design. To further illustrate the efficiency of PC, we plot Figure4 to show the space-time graph of data references from two applications, cscope and fio-random executing concurrently. Observing from Figure4, it is hard to separate these data only by considering coordinate axes, time, and block number, while PC technique is able to indicate and identify them. For example, those data encircled by the top dotted-black-circle are represented by PC1 (looping pattern), and data in bottom dotted-red-circle are represented by PC2 (random pattern).

3.55*105

3.45*105 Gap1 Gap2 3.35*105

3.25*105

3.15*105

Logical Block Number 3.05*105

2.95*105 0 10 20 30 40 50 Virtual Time (*10000) Figure 4. Different data patterns indicated by program counters.

4. Design of UHNVM An overview of the UHNVM architecture system is exhibited in Figure5a which is comprised of two main modules, ‘Data classification module’ in Section 4.2 and ‘Data distribution module’ in Section 4.3. The Optane DIMM in this design is configured in app direct mode with Intel ’libpmemblk’ library. In the following subsections, we give detailed explanations of UHNVM design. Firstly, we compare architectures between Figure5a,b to illustrate these frame differences in detail in Section 4.1. Secondly, the process of the program context technique producing better granularity is described in Section 4.2. Thirdly, in order to explain ‘Data distribution module’ well, a read I/O request example is used to exhibit the basic data structures in Section 4.3.1 and a data flow showing where the page can be mapped (either to the host DRAM or Optane) in Section 4.3.2. Electronics 2021, 10, 1760 6 of 15

Fourthly, as not all interfaces are suitable for Optane DIMM, we exhibit how to design a blacklist to except those inappropriate interfaces in Section 4.4.

Applications Our unified interfaces U_read U_write U_fread U_pwrite

Data Distribution Module Data Classification Module O_read S_read O_write S_write O_pwrite S_pwrite

Traditional File System

Page Cache

Optane DIMM SSD Disk

(a) cache architecture of UHNVM.

Applications

Conventional standard Interfaces Load Store S_read S_write S_pwrite

Traditional File System

Page Cache

Optane DIMM SSD Disk

(b) Conventional cache architecture.

Figure 5. Two kinds of cache architectures.

4.1. Architecture Comparison Figure5 shows detailed architectures of UHNVM and traditional designs for NVM. NVM is the common concept representing non-volatile memory, while the UHNVM means our new architecture with NVM device. In Figure5a, our unified interfaces begin with ‘U_’ head, which is the abbreviation of the word ’universal’, while the ‘S_’ head (abbreviation of ’standard’) represents standard interfaces from common GNU libraries. Those interfaces with ‘O_’ header (abbreviation of ’Optane’) stand for the basic Intel ’libpmemblk’ library, only for Optane DIMM operations. UHNVM packages interfaces with both ‘S_’ and ‘O_’ headers, and provides new universal interfaces with ‘U_’ heads to developers. Through ’libpmemblk’ library, Optane DIMM in UHNVM has been configured and treated as one storage pool. The size of each block unit in this pool is 4 KB. In Figure5a, those universal interfaces, with ‘U_’ head, are able to automatically split data stream into Optane or host DRAM by judging the pattern results from ‘Data classification module’. The red dotted line represents the bridge of Optane DIMM and host DRAM to migrate data, which is described in Section 4.3. The red dotted box in Figure5a is named as ‘added universal layer’ (an AU layer), which is created as a static library after being compiled. With the AU layer, developers can use batch scripts to replace standard interfaces by universal interfaces of UHNVM, because our universal interfaces use the same function parameters comparing to their counterpart traditional interface functions. After adding Electronics 2021, 10, 1760 7 of 15

AU layer to an application, the programmers no longer need to manually modify its source code again. Under traditional architecture in Figure5b, data classification for Optane DIMM only can be executed during the code design process of each application. ‘Load’ and ‘Store’ instructs in Figure5b are the basic commands for developers to manipulate Optane DIMM. For example, a ‘Load’ instruct is to read a complete XPLine into XPBuffer [4], and on the contrary, a ‘Store’ instruct flushes and fences data to persistent. In this way, it is inconvenient and inefficient to rewrite all legacy applications for Optane DIMM.

4.2. Data Classification Module The key theoretical hypothesis behind our data classification module is the strong correlation between program contexts (PC) and data patterns [34,35]. In order to display this correlation and PCs calculation, we plot an example of function routines from two applications running at the same time in Figure6. All function routines are above our AU layer which is first mentioned in the red dotted box of Figure5a.

Figure 6. Function routines of two applications.

In Figure6, we can find that a single terminal function (funct4 or functx) can invoke a read or write function. However, knowing a request only coming from a single function is not enough to explore the detailed data flow in an application. For example, there are at least two functions, funct2 and funct3, invoking funct4. They make up two different func- tion routines, ‘funct1->funct2->funct4’ and ‘funct1->funct3->funct4’. During the running process of an application, if the first routine represented by funct2 is accessed frequently, while the second routine represented by funct3 is only accessed once, then their access frequencies are different. Since a single funct4 has no ability to detect two kinds of data access frequencies, some policies use all functions instead of a single function in each function routine to distinguish routines. In applications, every function has its corresponding function address. After adding all function addresses in a routine, a program context (PC) indicator value is created to represent this routine. In our example, there are four routines in application1 and one routine in application2. Then all I/O operations will be indicated by different PCs. However, adding all function addresses might lead to new problems. For example, calculating an indicator in recursive functions results in unbearable overhead. Selecting limited function candidates in each routine is necessary. Our experiential function can- didate number is from five to eight. Five function candidates are enough to identify all routines in experiments of this article. Basing on the theory above, the data detection process is comprised of two steps, creating a unique PC indicator for every routine first, and then figuring out the pattern of each routine by statistical analysis. After two steps, each routine pattern is then connected Electronics 2021, 10, 1760 8 of 15

with its unique PC indicator. The first step is to sum up the limited function addresses. However, the second step, pattern prediction of a PC is somewhat complicated. In ‘Data classification module’, UHNVM statistically analyzes all pages in every request. There are four pattern classification criterion in UHNVM. First, a group of consecutive repeated pages indicated by the same PC value are tagged as a looping pattern. Second, the sequential pattern means consecutive pages only appear once with one PC. Third, a group of repeat- edly accessed pages in the same PC, but not in sequential, are detected as a cluster-hot pattern. Finally, in addition to above three patterns, the remaining data are all considered as random pattern.

4.3. An Example of Data Distribution Module In order to describe the ‘Data distribution module’ in detail, in Figure7, we give an example of a concrete read request with nine pages from file2. As formerly mentioned, pages in a file might be divided into several patterns, i.e., the Redis AOF log file only merges and updates the multi-operation commands together for the same data, not all log commands. In order to simulate this complicated situation, in this example, we settle its first four pages (from page1 to page4) in a looping pattern enclosed by dotted line, which is suitable for the Optane DIMM storage. The left pages are in random pattern which should be stored in DRAM space. Hash Func Hash Hash Func Hash

Figure 7. Dealing a reading I/O request in UHNVM. 4.3.1. Basic Data Structures Illustration In Figure7, the two-level hash map is design to show the mapping relationship between file pages and blocks in Optane. According to the hash map, UHNVM is able to know what pages are stored in which blocks of Optane DIMM. The file node in a file list contains several members, such as file_inode and file_path. The page node in the page list has five variables, page_number (4 bytes), pmemeblk_num (4 bytes), start_offset (12 bits), end_pos (12 bits), and dirty_flag (1 bit), which totally occupy 11 bytes. These definitions of Electronics 2021, 10, 1760 9 of 15

five variables can be known from their names. The file_inode is used to search and locate files. As we treat Optane DIMM as a cache in 4 KB unit, the members, start_offset and end_pos only need 12 bits to cover 4KB space. With the pmemeblk_num, ‘Data distribution module’ can locate the block position in the Optane cache. The dirty_flag is used to discard or flush back to SSD/HDD when new pages come and the Optane space is full. After referencing the two-level hash map in step 1 in Figure7, if a page hits in a hash map, then the target data can be obtained from the Optane DIMM through the step 2. Then the remaining pages can be read from the host DRAM through step 3. Based on the block status in Optane, we build a LRU list to organize all empty blocks in Optane DIMM. Then, UHNVM can understand which blocks are already occupied (O) and which blocks are empty (E) to accommodate coming data.

4.3.2. Cache Data Migration Process From the request example in Figure7, excepting direct accesses, there are also two special pages in file2. One is the fourth page and another is the seventh page. According to our assumption, the pages enclosed by the dotted line represent looping pattern data. Therefore, the page4 of file2 should be stored into Optane DIMM. However, it is discovered in DRAM. The page7 of file2 should stay in DRAM as its random pattern. However, observing from our two-level hash mapping table, it stays in Optane now. For the page4, UHNVM should first read it from the host DRAM, and then move it into Optane DIMM. Lastly, after data migrating, UHNVM will update Optane hash map information. On the contrary, for the page7, UHNVM should first read it from Optane DIMM and expel it to SSD/HDD. This eviction process needs to consider the dirty mark of the seventh page. If it is a clean page, just discard it from Optane, then update the hash map. Otherwise, it should be flushed back to SSD/HDD.

4.3.3. Small Access (<4 KB) and File Feature Update in UHNVM Byte-addressability is available in Optane DIMM. That is to say UHNVM can access Optane at both byte and block units. However, as Optane’s pattern-dependent feature and physical limitations, we prefer to avoid those trivial writes. A block size (4 KB) is UHNVM read unit. A smaller read request (<4 KB) in Optane will read the whole block, and then only pick out requested part. A smaller write (<4 KB) is citing the whole block and modifying the aimed data by a read-modify-write operation. The last concern is file feature parameters modification, such as changing file size and file pointer position by write request, and modifying file pointer position by read request. These entire file parameters should be updated in real time when data operation happens in Optane DIMM. UHNVM updates file parameters by citing the lseek and fallocate functions.

4.4. Function Blacklist of UHNVM There are also some cases that are not suitable for Optane DIMM, for which we build up a blacklist for some interfaces in UHNVM. For instance, the mmap interface is a direct address mapping operation, rather than a real data movement. UHNVM should treat it as usual in host DRAM. When application encounter the mmap interface, according to the blacklist, UHNVM prevents data related to mmap interface from entering Optane DIMM. If there exist some data in Optane DIMM, UHNVM needs to move data from Optane to SSD/HDD. That is to say, the data from mmap interface only can be stored in SSD/HDD as usual, regardless of data patterns. We also allow developers who adopt the AU layer to add special files name into this blacklist if they need to bypass Optane DIMM for some specialized files. This blacklist will be saved as a configuration file after host power-off. Note that all the other Optane metadata, two-level hash map and empty-block list, are also saved as configuration files. During host start-up (power-up), all these configuration files will be read from Optane DIMM to configure and set up Optane in UHNVM. In conclusion, UHNVM is composed of two main components, ’Data Classification module’ to identify data patterns and ’Data Distribution module’ for distributing data to Electronics 2021, 10, 1760 10 of 15

Optane DIMM or DRAM through our modified interfaces. Except for two main modules, there are also some key data structures to manage data flow, such as ’Two-level Hash Map’, and ’Function Blocklist’ to bypass Optane DIMM for specific data.

5. Evaluation 5.1. Experimental Setup We have implemented UHNVM by C program language in a library format, including commonly used interfaces. Our computer is equipped with a 6-core Intel based i5-8600 CPU running at 3.6 GHz, 32 GB of DRAM. The Intel Optane DIMM has been configured in app direct mode with a capacity of 128 GB. In this design, we use two workloads to test interfaces. One is the dd utility experiment and another is the fio experiment. They are designed and analyzed in three aspects: (1) the importance of pattern recognition ability; (2) the influence of different data granularities; (3) the impact of the different proportions of Optane DIMM and host DRAM. Through these experiments, it can be proved that UHNVM is able to improve performance of data-intensive applications by adjusting Optane DIMM and host DRAM according to our refined data patterns.

5.2. dd Utility Experiment In order to quantify the benefits of our data pattern recognition ability to storage performance, we conduct the dd test. The dd application is a command-line utility for Unix and Unix-like operating systems whose main purpose is to convert and copy files. In this test, the ratio of looping data to random data is set to 1:10. The Optane DIMM and host DRAM sizes can be configured arbitrarily. In this test, the size ratio is set as 1:1, 2 GB Optane DIMM, and 2 GB host DRAM. First of all, we keep account of the time in milliseconds when we run dd utility with the same instructions. This setting can create the same data size, about 20 GB for three designs, ’Random NVM’, ’No NVM’, and ’Intelligent NVM’. Due to the same data size, we can evaluate their performances by comparing completion time. As the no NVM design is in traditional configuration, we normalized other two designs to the ’No NVM’ design. This experiment is design to prove the necessity of data pattern identification. The importance of pattern recognition ability. The first test is to randomly store data in Optane DIMM or host DRAM during the running process of dd application. In the second experiment, the data are only stored in the host DRAM. Since no Optane DIMM is used, we configure the host DRAM to be 4 GB. In the third experiment, we evaluate the dd application with UHNVM, which can distinguish the looping data and store looping data into Optane DIMM. The performance comparisons of these three tests are shown in Figure8 . They are called as ‘Random NVM’, ‘No NVM’, and ‘Intelligent NVM’, respectively. The results are normalized to ‘No NVM’. The third test with UHNVM outperforms the second test without UHNVM, because UHNVM can distinguish and keep repeated data into Optane DIMM without frequent transmission between host DRAM and storage device. However, the performance of the first test is the worst. Compared to the second test, the time cost of the third test can be reduced by up to 11%. This comparison proves the importance of pattern recognition ability, which can promote the intelligent use of Optane DIMM. Since UHNVM is composed of two parts, Optane DIMM (NVM) and traditional DRAM cache, we define the number of the cache hit blocks in Optane DIMM as Nho and the number of the cache hit blocks in DRAM as Nhd. To calculate the entire cache hit ratio, Hr, both Nho and Nhd are summed together and are divided by the total number of blocks (Nt). Then the cache hit ratio can be derived from equation, Hr = (Nho + Nhd)/(Nt). From Figure9, we can observe that the intelligent NVM outperforms other two designs. The reason is that ’Intelligent NVM’ has a better pattern-identification ability which helps to make more rational decisions on cache pre-fetching and cache eviction. In summary, UHNVM can distinguish memory accesses with high data-reuse and keep them in Optane without worrying about power failure. By contrast, if Optane Performance comparison 1.2 Optane DRAM 1000 1.1 800 1 0.9 600 Electronics 2021, 10, 1760 11 of 15 0.8 400 0.7

200 accessesPerformanceNormalized areIntelligent dominated NVM by low No data-reuse, NVM Random such as inNVM the random pattern, Optane DIMM will fail to promote storageDifferent efficiency. configurations This experiment has proved the importance of

Load to Use Latency (ns) Use Latency to Load 0 Read Seq Read Write Write pattern recognition. Random Seq Random 1.2 1.2 1.1 1.1 1 1 0.9 0.9 0.8 0.8 0.7 0.7 Normalized TimeNormalizedCost Normalized TimeNormalizedCost Random NVM No NVMRandom Intelligent NVM NVM No NVM Intelligent NVM Different configurations Different configurations Figure 8. Performance comparisons of different cache configurations. 100% 80% 60% 40% Hit Ratio Hit 20% 0% Random NVM No NVM Intelligent NVM Different configurations Figure 9. Hit ratio comparisons of different cache configurations. 5.3. fio Experiment In addition to dd workload, we also evaluate the performance of UHNVM under fio application to explore optimal configuration parameters for Optane DIMM. The fio is a representative tool that can spawn a number of processes to perform particular I/O operations specified by the user, which is flexible enough to allow us to run it with different interfaces, different data patterns and different data granularity. According to different fio engines, we choose two pairs of interfaces replaced by UHNVM, Write/Read and Pwrtie/Pread. There are two related configurations of Optane DIMM that need to be discussed, data granularity and the ratio of Optane DIMM and host DRAM. The influence of different data granularity. To understand how different data granu- larity affects Optane DIMM performance, we vary the granularity setting from 1 KB to 8 KB while Optane DIMM and host DRAM sizes are fixed. Our test workbenches consist of a looping-simulated thread (repeated sequential thread) and a random thread. In Figure 10, two pairs of interfaces have been compared. The results are all normalized to 4 KB unit size. We find that the larger the access unit, the faster performance could be. For Write–Pwrite interfaces, UHNVM run in 8 KB unit increased the speed by 4.1×, 4×, respectively, com- pared to the speed in 1 KB. With regard to the Read–Pread interfaces, UHNVM running in 8 KB unit increases speed by 7.3×, 7.1×, respectively, compared with the speed in 1 KB. From the Write/Pwrite test results in Figure 10b,d, we can also discover that the writing process did not increase linearly like the Read/Pread test results, which are shown in Figure 10a,c. The reason is that Optane DIMM in our design is in the 4 KB unit (one block size). An Optane DIMM writing operation, less than 4 KB size, is comprised of reading a block from Optane DIMM to buffer, modifying entire block data in a buffer, and writing back this modified block into a new Optane DIMM block, which we conclude as a read-modified-write operation. When the write granularity is increased, there is no obvious speed improvement as the read-modified-write action takes up most of an entire writing operation. However, the Optane DIMM read is simple in Optane DIMM which accounts for only a small proportion of an entire read operation. Therefore, increasing Pwrite Write 4 4

3 3

2 2 Electronics 2021, 10, 1760 12 of 15 1 1

0 0 Normalized TimeNormalizedCost Normalized TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB 1KB 2KB 3KB 4KB 8KB reading granularity canUnit significantlySize result in enhanced reading speed. This testUnit proves Size that data granularity for Optane DIMM is a key factor in NVM deployment.

Read Pwrite Pwrite Write Write Pread 4 4 44 4 4 3 3 3 33 3 2 2 2 2 2 2 1 1 1 1 11 0 0 0 0 00 Normalized TimeNormalizedCost Normalized TimeNormalizedCost Normalized TimeNormalizedCost 1KB 2KB 3KB1KB 4KB2KB 8KB 3KB 4KB 8KB 1KB 2KBTimeNormalizedCost 3KB1KB 4KB 2KB 8KB 3KB 4KB 8KB Normalized TimeNormalizedCost Normalized TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB 1KB 2KB 3KB 4KB 8KB Unit Size Unit Size Unit Size Unit Size Unit Size Unit Size (a) Read interface. (b) Write interface. Read Pread Read 4 Pread 4 Pwrite Write 4 4 4 4 3 3 3 3 3 3 2 2 2 2 2 2 1 1 1 1 1 1 0 0 0 0 0 0 Normalized TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB Normalized TimeNormalizedCost Normalized TimeNormalizedCost Normalized TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB 1KB 2KB 3KB 4KB 8KB 1KB 2KB 3KB 4KB 8KB Unit Size Unit Size Unit Size Unit Size Unit Size Unit Size (c) Pread interface. (d) Pwrite interface.

Figure 10. Performances under different unit sizes. Read Pread 4 4 The impact of the different proportions of Optane DIMM and host DRAM. In 3 3 order to explain the influence of the proportion between Optane DIMM and host DRAM, we dynamically adjust the proportion of Optane2 DIMM to host DRAM, ranging from 0.1 to 2 0.5. The testing workbench is collected by running a simulated looping thread and two random threads. In this workbench, the looping part1 takes up about 30% and the remaining 1 random data accounts for about 70%. 0 0 Figure 11 shows the performance changes under different configurations on the Normalized TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB TimeNormalizedCost 1KB 2KB 3KB 4KB 8KB x-axis. As shown in Figure 11, only when the Optane DIMM andUnit host Size DRAM proportion Unit Size is configured as 0.3, matching the looping-data proportion (0.3) in workload, then the experiment under this configuration performs the best. That is why all the results are normalized to the result at 0.3 position of the x-axis. Our data distribution strategy only allocates Optane DIMM to pointed patterns, such as looping, and sequential patterns. In our design, even Optane DIMM has free space, it is still not suitable for random data, especially for random writes. Therefore, for UHNVM, the size configuration between Optane DIMM and host DRAM has better be set in advance after experiential evaluation if purchasing the best performance.

PreadPread PwritePwrite 3.2 3.2 1.61.6 1.5 2.7 2.7 1.5 1.41.4 2.2 2.2 1.31.3 1.7 1.7 1.21.2 1.11.1 1.2 1.2 1 1 0.7 0.7 0.90.9 Normalized TimeNormalizedCost TimeNormalizedCost Normalized TimeNormalizedCost 0.10.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 TimeNormalizedCost 0.10.1 0.2 0.2 0.3 0.3 0.4 0.4 0.5 0.5 ProportionProportion(Optane(Optane DIMM DIMM : Host : Host DRAM) DRAM) ProportionProportion (Optane (Optane DIMM DIMM : Host : Host DRAM) DRAM) (a) Pread interface. (b) Pwrite interface. Figure 11. Time costs under different proportions of Optane DIMM and Host DRAM. Electronics 2021, 10, 1760 13 of 15

For the Pread interface, not given enough Optane DIMM space, the looping data will be transferred between the storage device and Optane DIMM, which results in greater overheads. For example, in Figure 11a, when we test at 0.1 position of the x-axis, looping data have to be read from the storage device and then written into Optane DIMM frequently. At 0.4 position of the x-axis, there is not enough space to host DRAM, then data should be transferred between the storage device and host DRAM frequently, because Optane DIMM only works for looping pattern even when there is idle space, which leads to worse performance. The performance of the Pwrite interface shown in Figure 11b has the same reason as the Pread interface. In the Figure 12b, the hit ratio increases gradually according to the increasing of Optane DIMM proportion. Because UHNVM prefers to store those repeated blocks which will be accessed in the future in the Optane DIMM. This strategy will increase the cache hit ratio. In addition, we can get another conclusion that the cache hit ratio is not the only index benefiting I/O performance by comparing Figure 12a,b.

100% Pread Pread Pwrite 100% 100% Pwrite 80% 100% 80% 80% 60% 80% 60% 60% 40% 60% 40% 40% Hit Ratio Hit 20% 40% Hit Ratio Hit 20% Ratio Hit 20% 0% Ratio Hit 20% 0% 0% 0.1 0.2 0.3 0.4 0.5 0% 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 Ratio (Optane DIMM : Host DRAM) 0.1 0.2 0.3 0.4 0.5 Ratio (Optane DIMM : Host DRAM) Ratio (Optane DIMM : Host DRAM) Ratio (Optane DIMM : Host DRAM) (a) Pread interface. (b) Pwrite interface. Figure 12. Cache hit ratios under different proportions of OptaneLRU DIMMCFLRU and HostI/O-Cache DRAM. UHNVM 60% LRU CFLRU I/O-Cache UHNVM 100% 60% 100%Cache hit ratio comparisons among existing designs. In order to show the efficiency 80% 50% of80% UHNVM, three existing designs, LRU, CFLRU, I/O-Cache50% are compared with UHNVM. 60% We run the dd utility to create a complex workload40% which is composed of looping, se- 60%Pwrite 40% 40% quential, random patterns, and set the proportionRatio Hit of Optane DIMM and DRAM as 1:1. In Hit Ratio Hit 100% 30% Pread 40% Ratio Hit

100% Ratio Hit 20% 80%the Figure 13, the UNHVM performs better and more stable30% than other designs. Note 80% that,20% I/O-Cache is more appropriate in a sequential-dominant20% workload, not that work- 0% 60% 020% 0.5 1 1.5 2 60% load with complicated patterns. However, UHNVM is able to identifyCache Size data (GB) patterns in 40%Random0% NVM No NVM Intelligent NVM 0 0.5 1 1.5 2 40% advance,Random even underNVM a smallNo NVM cache Intelligent size. Furthermore, NVM as mentioned earlier, UHNVMCache Size has (GB) Hit Ratio Hit

Hit Ratio Hit 20% 20% updated the traditional linear memory architecture into a parallel architecture which will 0% 0%overcome an obvious cache overhead, frequent data migration between NVM, DRAM, 0.1 0.2 0.3 0.4 0.5 0.1 0.2 0.3 0.4 0.5 and SSD/HDD. The existing LRU, CFLRU and I/O-Cache strategies belong to traditional linear architecture.

LRU CFLRU I/O-Cache UHNVM 60% 100%

80% 50%

60% 40%

40% Ratio Hit

Hit Ratio Hit 30% 20% 20% 0% 0 0.5 1 1.5 2 Random NVM No NVM Intelligent NVM Cache Size (GB) Figure 13. Hit ratios under different strategies.

Finally, UHNVM has proved its effectiveness in storage system and Optane DIMM deployment. In order to further maximize our UHNVM, it is better for developers to dynamically adjust those two configuration parameters discussed above according to workload’s patterns. Electronics 2021, 10, 1760 14 of 15

6. Conclusions In this paper, we propose a universal cache management technique, which is able to intelligently classify and arrange different patterns data into NVM or host DRAM without requirements of understanding application codes. This design is flexible and automatic enough to help users to deploy NVM devices. In order to better classify data patterns, we move the program context technique from conventional kernel space to user space. The results show that UHNVM not only releases the host buffer burden economically, but also improves the performance of storage systems.

Author Contributions: conceptualization, X.L. and Z.Z.; methodology, X.L. and Z.Z.; software, X.L.; validation, X.L.; investigation, X.L. and Z.Z.; writing—original draft preparation, X.L.; writing— review and editing, X.L. and Z.Z. All authors have read and agreed to published version of the manuscript. Funding: This research received no external funding Conflicts of Interest: The authors declare no conflict of interest.

References 1. Badam, A.; Pai, V.S. SSDAlloc: Hybrid RAM/SSD Memory Allocation Made Easy. In Proceedings of the Symposium on Networked Systems Design and Implementation, Boston, MA, USA, 30 March–1 April 2011. 2. Cheong, W.; Yoon, C.; Woo, S.; Han, K.; Jeong, J. A flash memory controller for 15 µs ultra-low-latency SSD using high-speed 3D NAND flash with 3 µs read time. In Proceedings of the IEEE International Solid—State Circuits Conference, San Francisco, CA, USA, 5–9 February 2018. 3. Huang, J.; Badam, A.; Qureshi, M.K.; Schwan, K. Unified address translation for memory-mapped ssds with flashmap. In Proceedings of the Annual International Symposium on Computer Architecture, Portland, OR, USA, 13–17 June 2015; pp. 580–591. 4. Yang, J.; Kim, J.; Hoseinzadeh, M.; Izraelevitz, J.; Swanson, S. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the Conference on File and Storage Technologies, Santa Clara, CA, USA, 24–27 February 2020; pp. 169–182. 5. Grupp, L.M.; Davis, J.D.; Swanson, S. The bleak future of NAND flash memory. In Proceedings of the Conference on File and Storage Technologies, San Jose, CA, USA, 15–17 February 2012; Volume 7, pp. 10–12. 6. Marathe, V.J.; Seltzer, M.; Byan, S.; Harris, T. Persistent memcached: Bringing legacy code to byte-addressable persistent memory. In Proceedings of the Workshop on Hot Topics in Storage and File Systems, Santa Clara, CA, USA, 10–11 July 2017. 7. Oukid, I.; Lehner, W.; Kissinger, T.; Willhalm, T.; Bumbulis, P. Instant Recovery for Main Memory Databases. In Proceedings of the Annual Conference on Innovative Data Systems Research, Asilomar, CA, USA, 13–16 January 2015. 8. Pelley, S.; Wenisch, T.F.; Gold, B.T.; Bridge, B. Storage management in the NVRAM era. In Proceedings of the 40th International Conference on Very Large Data Bases, Hangzhou, China, 1–5 September 2014. 9. Viglas, S.D. Write-limited sorts and joins for persistent memory. Proc. Very Large Data Bases Endow. 2014, 7, 413–424. [CrossRef] 10. Intel. Persistent Memory Development Kit (PMDK). Available online: https://pmem.io/pmdk/(accessed on 1 May 2021). 11. Gniady, C.; Butt, A.R.; Hu, Y.C. Program-counter-based pattern classification in buffer caching. In Proceedings of the Symposium on Operating Systems Design and Implementation, San Francisco, CA, USA, 6–8 December 2004; Volume 4, p. 27. 12. Zhou, F.; von Behren, R.; Brewer, E.A. Program Context Specific Buffer Caching with AMP. 2005. Available online: http: //zhoufeng.net/eng/papers/amp-tr.pdf (accessed on 1 May 2021). 13. Kwon, Y.; Fingler, H.; Hunt, T.; Peter, S.; Witchel, E.; Anderson, T. Strata: A cross media file system. In Proceedings of the Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; pp. 460–477. 14. Moraru, I.; Andersen, D.G.; Kaminsky, M.; Tolia, N.; Ranganathan, P.; Binkert, N. Consistent, durable, and safe memory management for byte-addressable non volatile main memory. In Proceedings of the First ACM Special Interest Group on Operating Systems Conference on Timely Results in Operating Systems, Farmington, PA, USA, 3–6 November 2013; pp. 1–17. 15. Seo, J.; Kim, W.H.; Baek, W.; Nam, B.; Noh, S.H. Failure-atomic slotted paging for persistent memory. ACM Spec. Interest Group Comput. Archit. News 2017, 45, 91–104. [CrossRef] 16. Volos, H.; Magalhaes, G.; Cherkasova, L.; Li, J. Quartz: A lightweight performance emulator for persistent memory software. In Proceedings of the Annual Middleware Conference, Vancouver, BC, Canada, 7–11 December 2015; pp. 37–49. 17. Dong, M.; Yu, Q.; Zhou, X.; Hong, Y.; Chen, H.; Zang, B. Rethinking benchmarking for nvm-based file systems. In Proceedings of the ACM Special Interest Group on Operating Systems Asia-Pacific Workshop on Systems, Hong Kong, 4–5 August 2016; pp. 1–7. 18. Duan, Z.; Liu, H.; Liao, X.; Hai, J. HME: A lightweight emulator for hybrid memory. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, Dresden, Germany, 19–23 March 2018. 19. Boehm, H.J.; Chakrabarti, D.R.; Bhandari, K. Atlas: Leveraging Locks for Non-volatile Memory Consistency. ACM Sigplan Not. Mon. Publ. Spec. Interest Group Program. Lang. 2014, 19, 433–452. Electronics 2021, 10, 1760 15 of 15

20. Lu, Y.; Shu, J.; Chen, Y.; Li, T. Octopus: An rdma-enabled distributed persistent memory file system. In Proceedings of the Annual Technical Conference, Santa Clara, CA, USA, 12–14 July 2017; pp. 773–785. 21. Nawab, F.; Izraelevitz, J.; Kelly, T.; Morrey III, C.B.; Chakrabarti, D.R.; Scott, M.L. Dalí: A periodically persistent hash map. In Proceedings of the International Symposium on Distributed Computing, Washington, DC, USA, 26–30 June 2017; Volume 91. 22. Yang, J.; Izraelevitz, J.; Swanson, S. Orion: A distributed file system for non-volatile main memory and RDMA-capable networks. In Proceedings of the Conference on File and Storage Technologies, Boston, MA, USA, 25–28 February 2019; pp. 221–234. 23. Dong, M.; Chen, H. Soft updates made simple and fast on non-volatile memory. In Proceedings of the Annual Technical Conference, Santa Clara, CA, USA, 12–14 July 2017; pp. 719–731. 24. Zhu, G.; Han, J.; Lee, S.; Son, Y. An Empirical Evaluation of NVM-aware File Systems on Intel Optane DC Persistent Memory Modules. In Proceedings of the International Conference on Information Networking, Jeju Island, Korea, 13–16 January 2021; pp. 559–564. 25. Abulila, A.; Mailthody, V.S.; Qureshi, Z.; Huang, J.; Kim, N.S.; Xiong, J.; mei Hwu, W. FlatFlash: Exploiting the Byte-Accessibility of SSDs within a Unified Memory-Storage Hierarchy. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI, USA, 13–17 April 2019. 26. Bae, D.H.; Jo, I.; Choi, Y.A.; Hwang, J.Y.; Cho, S.; Lee, D.G.; Jeong, J. 2B-SSD: The Case for Dual, Byte- and Block-Addressable Solid-State Drives. In Proceedings of the the ACM/IEEE International Symposium on Computer Architecture, Los Angeles, CA, USA, 1–6 June 2018. 27. Wu, K.; Ober, F.; Hamlin, S.; Li, D. Early Evaluation of Intel Optane Non-Volatile Memory with HPC I/O Workloads. arXiv 2017, arXiv:1708.02199. 28. Arulraj, J. BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory. Proc. Very Large Data Bases Endow. 2018, 11, 553–565. [CrossRef] 29. Chatzistergiou, A.; Cintra, M.; Viglas, S.D. REWIND: Recovery Write-Ahead System for In-Memory Non-Volatile Data-Structures. Proc. VLDB Endow. 2015, 8, 497–508. [CrossRef] 30. Chen, S.; Qin, J. Persistent B + -Trees in Non-Volatile Main Memory. In the Proceedings of the VLDB Endowment, Kohala Coast, HI, USA, 31 August–4 September 2015; pp. 786–797 31. Ipek, E.; Condit, J.; Lee, B.; Nightingale, E.B.; Burger, D.; Frost, C.; Coetzee, D. Better I/O Through Byte-Addressable, Persistent Memory. In the Proceedings of the ACM SIGOPS 22nd symposium on Operating Systems Principles, Big Sky, MT, USA, 11–14 October 2009; pp. 133–146. 32. Volos, H.; Nalli, S.; Panneerselvam, S.; Varadarajan, V.; Swift, M.M. Aerie: Flexible File-System Interfaces to Storage-Class Memory. In Proceedings of the Ninth European Conference on Computer Systems, Amsterdam, The Netherlands, 14–16 April 2014. 33. Xu, J.; Kim, J.; Memaripour, A.; Swanson, S. Finding and fixing performance pathologies in persistent memory software stacks. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Atlanta, GA, USA, 2–6 April 2016; pp. 427–493. 34. Kim, J.M.; Choi, J.; Kim, J.; Noh, S.H.; Min, S.L.; Cho, Y.; Kim, C.S. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In Proceedings of the Conference on Symposium on Design & Implementation, San Diego, CA , USA, 22–25 October 2000. 35. Kim, T.; Hong, D.; Hahn, S.S.; Chun, M.; Lee, S.; Hwang, J.; Lee, J.; Kim, J. Fully automatic stream management for multi- streamed SSDs using program contexts. In Proceedings of the Conference on File and Storage Technologies, Boston, MA, USA, 25–28 February 2019; pp. 295–308.