Towards an Unwritten Contract of Intel Optane SSD Kan Wu, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau University of Wisconsin-Madison

Towards an Unwritten Contract of Intel Optane SSD Kan Wu, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau University of Wisconsin-Madison Abstract Rule Impact Metric Cause Access with Low Request Scale 11x Latency SSD Controller/Interconnect Random Access is OK 7x (vs. Flash) Latency & Throughput 3D XPoint & Controller New non-volatile memory technologies offer unprece- Avoid Crowded Accesses 4.6x Latency & Throughput SSD Controller/Interconnect dented performance levels for persistent storage. However, to Control Overall Load 5x (vs. Flash) Latency 3D XPoint Memory Avoid Tiny Accesses 5x Throughput SSD Controller/Interconnect exploit their full potential, a deeper performance characteri- Issue 4KB Aligned Requests 1.2x Latency SSD Controller/Interconnect Forget Garbage Collection 15x (vs. Flash) Sustained Throughput 3D XPoint & Controller zation of such devices is required. In this paper, we analyze Table 1: Performance Impact of Rule Violations a NVM-based block device – the Intel Optane SSD – and formalize an “unwritten contract” of the Optane SSD. We rules that are critical for Optane SSD users. First, to obtain show that violating this contract can result in 11x worse read low latency, users should issue small requests ( > 4 KB) latency and limited throughput (only 20% of peak bandwidth) and keep a small number of outstanding IOs (Access with regardless of parallelism. We present that this contract is rel- Low Request Scale rule). Second, different from HDD/Flash evant to features of 3D XPoint memory and Intel Optane SSDs, Optane SSD clients should not consider sequential SSD’s controller/interconnect design. Finally, we discuss the workloads special (Random Access is OK rule). Third, to implications of the contract. avoid contention among requests, clients should not issue parallel accesses to a single chunk (4KB) (Avoid Crowded 1 Introduction Accesses rule). Fourth, to achieve optimal latency, the user New NVM technologies provide unprecedented performance needs to control the overall load of both reads and writes levels for persistent storage. Such devices offer significantly (Control Overall Load rule). Fifth, to exploit the bandwidth lower latency than Flash-based SSD and can be a cost- of Optane SSD, clients should never issue requests less than effective alternative to DRAM. One excellent example is 4KB (Avoid Tiny Accesses rule). Sixth, to get the best la- Intel’s 3D XPoint memory [1,25] from Intel and Micron, avail- tency, requests issued to Optane SSD should align to eight able on the open market under the brand name Optane [11]. sectors (Issue 4KB Aligned Requests rule). Finally, when It is available in various form factors, including Optane mem- serving sustained workloads, there is no cost of garbage col- ory [9] (a caching layer between DRAM and block device), lection in Optane SSD (Forget Garbage Collection rule). Optane SSD [10] (a block device), and Optane DC Persis- Overall, these rules are relevant to features of 3D XPoint tent Memory [8]. Among these devices, the Optane SSD is memory and Optane SSD’s controller/interconnect design. currently the most cost-effective and widely available option. The unwritten contract provides numerous implications for Optane SSD offers numerous opportunities for applications. systems and applications. According to the Access with Low For example, Intel’s Memory Direct Technology (IMDT) [7] Request Scale rule and the Control Overall Load rule, the enables the use of Optane SSD as a DRAM alternative. Use design of heterogeneous storage systems including Optane cases of IMDT/Optane SSD include Memcached [4], Re- SSD and Flash SSD must be carefully considered. Many dis [5], and Spark [6]. Evaluations of those scenarios demon- rules (e.g., Random Access is OK) present opportunities and strate the potential role of Optane SSD as a cost-effective new challenges to external data structure design for Optane alternative of DRAM. Optane SSDs are also deployed to sup- SSD. Finally, the Random Access is OK rule may enable new port key workloads in Facebook [22, 23], both as a caching applications on Optane SSD that don’t work well on existing layer between DRAM and Flash SSD for RocksDB and for technologies. crucial workloads such as machine learning. According to The rest of this paper is structured as follows. We describe Eisenman et al. [23], Facebook stores embeddings of trained the unwritten contract and explain the insights for each rule neural networks on Optane SSDs. in Section2. We provide the implications of the unwritten Using new technology effectively requires understanding contract in Section3, conclude in Section4 and point to its performance and reliability characteristics. For traditional potential research directions in Section5. devices, such as hard-drives and Flash-based SSDs, these are well known [16,17,20,21,26,30,33]. However, for new 2 The Unwritten Contract of Optane SSD devices like the Optane SSD, much remains unclear, and is In this section, we define the rules of the unwrit- thus the focus of our work. ten contract. We offer experiments (available at In terms of immediate performance, we summarize six https://github.com/sherlockwu/OptaneBench) to infer Controller optane better optane better 1.7 1.0 -0.1 -0.2 -0.2 -0.3 -0.3 -0.2 -0.1 -0.2 -0.2 -0.2 -0.2 -0.2 256 5.0 256 5.0 3D XPoint 3D XPoint 3D XPoint 1.7 1.8 0.6 -0.3 -0.3 -0.3 -0.4 -0.2 -0.1 -0.3 -0.2 -0.2 -0.2 -0.2 Memory Die Memory Die Memory Die 128 128 2.2 2.3 0.9 0.1 -0.3 -0.4 -0.3 2.5 0.0 -0.2 -0.3 -0.2 -0.2 -0.2 -0.2 2.5 64 64 3.2 3.6 1.4 0.4 -0.1 -0.2 -0.2 0.1 -0.1 -0.2 -0.2 -0.2 -0.2 -0.2 3D XPoint 3D XPoint 3D XPoint 32 32 0.0 0.0 … 3.9 4.3 2.4 1.1 0.4 0.2 0.3 0.3 0.0 -0.1 -0.1 -0.2 -0.1 -0.2 Memory Die Memory Die Memory Die 16 16 8 − 8 − Request Size (KB) 5.2 4.9 4.7 2.0 0.7 0.2 0.0 2.5 Request Size (KB) 0.5 0.3 -0.0 -0.5 -0.4 -0.4 -0.4 2.5 … … … 4 6.2 5.7 5.9 4.3 1.9 0.7 0.2 4 0.7 0.3 0.2 0.2 -0.0 -0.1 -0.1 −5.0 −5.0 3D XPoint 3D XPoint 3D XPoint 1 7.4 6.7 6.9 4.5 1.9 0.7 0.2 1 0.2 0.1 0.0 0.2 0.1 0.1 0.1 Memory Die Memory Die Memory Die 1 2 4 8 16 32 64 flash better 1 2 4 8 16 32 64 flash better Queue Depth Queue Depth Channel #7 Channel #2 Channel #1 (a) Read (b) Write Figure 2: RAID-like Architecture Figure 1: Avg Latency of random workloads, Optane vs. Flash 1000 3000 the rules. We summarize the impact and cause of each rule 2500 QD=16 800 in Table1. We begin with six rules related to immediate QD=8 QD=16 2000 performance like latency and throughput. We then present 600 1500 rules for sustainable performance. QD=8 400 QD=4 1000 Throughput (MB/s) QD=4 Throughput(MB/s) 200 QD=2 500 2.1 Access with Low Request Scale QD=2 QD=1 QD=1 0 0 The Optane SSD is based on 3D XPoint memory which is 0 64 128 192 256 0 64 128 192 256 said to provide up to 1,000 times lower latency than NAND Stride (chunk=8sectors) Stride (chunk=8sectors) Flash [2]. Does 3D XPoint memory always lead to better (a) Flash SSD (b) Optane 905P SSD performance for Optane SSD compared to Flash SSD? We Figure 3: Detecting Interleaving Degree answer this question with our first rule: to obtain low latency, again better in the opposite cases; however, the difference for Optane SSD users should issue small requests and maintain writes is not as high as for reads (Optane is up to 1.7x faster a small number of outstanding IOs. This rule is needed not than Flash). This result is due to Flash SSDs log-structured only to extract low latency but also enough to exploit the full layout and buffering optimizations. Flash SSD outperforms bandwidth of the Optane SSD. Optane SSD when request size> 4KB and QD> 2, which We uncovered this rule when quantitatively comparing includes most of our tested workloads. Intel Optane SSD 905P (960GB) with a “high-end” Flash SSD: Samsung 970 Pro (1TB) [12]. From our experiments, The device’s internal parallelism dictates its behavior when we found that Optane SSD does not show improvement for serving workloads with high request scale. We are thus mo- workloads with many outstanding requests. tivated to uncover the internals of the Optane SSD. Like Flash SSD, Optane SSD utilizes a RAID-like organization We compare Optane and Flash SSDs with random read- of memory dies (Figure2). Through a fine-grained experi- only and write-only workloads. Each workload has two vari- ment [18, 19] we examine a critical parameter for internal ables: request size and queue depth (QD) (or number of in- parallelism: the interleaving degree, or number of channels. flight I/Os). Figure1 compares the two devices in terms of We maintain a read stream with stride S from the devices, average access latency. The temperature T in each rectangle where S is the distance between two consecutive chunks (a shows the scaled difference between the latencies of the two chunk is 4KB or 8 sectors as shown in Section 2.6).

Towards an Unwritten Contract of Intel Optane SSD Kan Wu, Andrea Arpaci-Dusseau and Remzi Arpaci-Dusseau University of Wisconsin-Madison

Analyzing the Performance of Intel Optane DC Persistent Memory in Storage Over App Direct Mode

Initial Experience with 3D Xpoint Main Memory

After the 3D Xpoint Take-Off, Emerging NVM Keeps Growing1

Transforming Business with Data

Arxiv:1908.03583V1 [Cs.DC] 9 Aug 2019 1 Introduction

Intel® Optane™ DC Persistent Memory – Telecom Use Case Workloads

From Flash to 3D Xpoint: Performance Bottlenecks and Potentials in Rocksdb with Storage Evolution

Introducing 3D Xpoint™ Memory

Nvme Technology – Powering the Connected Universe

3D X Point Technology

Analyzing the Performance of Intel Optane DC Persistent Memory in App Direct Mode in Lenovo Thinksystem Servers

Privacy Protection Cache Policy on Hybrid Main Memory