Scalable Parallel Flash Firmware for Many-Core Architectures

Scalable Parallel Flash Firmware for Many-core Architectures Jie Zhang1, Miryeong Kwon1, Michael Swift2, Myoungsoo Jung1 Computer Architecture and Memory Systems Laboratory, Korea Advanced Institute of Science and Technology (KAIST)1, University of Wisconsin at Madison2 http://camelab.org Abstract requests are sent to tens or hundreds of flash packages. This enables assigning queues to different applications; multiple NVMe is designed to unshackle flash from a traditional stor- deep NVMe queues allow the host to employ many threads age bus by allowing hosts to employ many threads to achieve thereby maximizing the storage utilization. higher bandwidth. While NVMe enables users to fully exploit An SSD should handle many concurrent requests with its all levels of parallelism offered by modern SSDs, current massive internal parallelism [12,31,33,34,61]. However, it firmware designs are not scalable and have difficulty in han- is difficult for a single storage device to manage the tremen- dling a large number of I/O requests in parallel due to its dous number of I/O requests arriving in parallel over many limited computation power and many hardware contentions. NVMe queues. Since highly parallel I/O services require si- We propose DeepFlash, a novel manycore-based storage multaneously performing many SSD internal tasks, such as platform that can process more than a million I/O requests address translation, multi-queue processing, and flash schedul- in a second (1MIOPS) while hiding long latencies imposed ing, the SSD needs multiple cores and parallel implementation by its internal flash media. Inspired by a parallel data analy- for a higher throughput. In addition, as the tasks inside the sis system, we design the firmware based on many-to-many SSD increase, the SSD must address several scalability chal- threading model that can be scaled horizontally. The proposed lenges brought by garbage collection, memory/storage con- DeepFlash can extract the maximum performance of the un- tention and data consistency management when processing derlying flash memory complex by concurrently executing I/O requests in parallel. These new challenges can introduce multiple firmware components across many cores within the high computation loads, making it hard to satisfy the perfor- device. To show its extreme parallel scalability, we implement mance demands of diverse data-centric systems. Thus, the DeepFlash on a many-core prototype processor that employs high-performance SSDs require not only a powerful CPU and dozens of lightweight cores, analyze new challenges from par- controller but also an efficient flash firmware. allel I/O processing and address the challenges by applying DeepFlash concurrency-aware optimizations. Our comprehensive evalua- We propose , a manycore-based NVMe SSD tion reveals that DeepFlash can serve around 4.5 GB/s, while platform that can process more than one million I/O requests minimizing the CPU demand on microbenchmarks and real within a second (1MIOPS) while minimizing the require- server workloads. ments of internal resources. To this end, we design a new flash firmware model, which can extract the maximum performance of hundreds of flash packages by concurrently exe- 1 Introduction cuting firmware components atop a manycore processor. The layered flash firmware in many SSD technologies handles Solid State Disks (SSDs) are extensively used as caches, the internal datapath from PCIe to physical flash interfaces databases, and boot drives in diverse computing domains as a single heavy task [66, 76]. In contrast, DeepFlash em- [37, 42, 47, 60, 74]. The organizations of modern SSDs and ploys a many-to-many threading model, which multiplexes flash packages therein have undergone significant technology any number of threads onto any number of cores in firmware. shifts [11, 32, 39, 56, 72]. In the meantime, new storage in- Specifically, we analyze key functions of the layered flash terfaces have been proposed to reduce overheads of the host firmware and decompose them into multiple modules, each is storage stack thereby improving the storage-level bandwidth. scaled independently to run across many cores. Based on the Specifically, NVM Express (NVMe) is designed to unshackle analysis, this work classifies the modules into a queue-gather flash from a traditional storage interface and enable users stage, a trans-apply stage, and a flash-scatter stage, inspired to take full advantages of all levels of SSD internal paral- by a parallel data analysis system [67]. Multiple threads on lelism [13,14,54,71]. For example, it provides streamlined the queue-gather stage handle NVMe queues, while each commands and up to 64K deep queues, each with up to 64K thread on the flash-scatter stage handles many flash devices entries. There is massive parallelism in the backend where on a channel bus. The address translation between logical block addresses and physical page numbers is simultane- C:. C:. % %`V ously performed by many threads at the trans-apply stage. % %`VV6 VJ 1QJ V6 VJ 1QJ As each stage can have different numbers of threads, con- `QHVQ` J V`HQJJVH C:. C:. tention between the threads for shared hardware resources V VJ 8 and structures, such as mapping table, metadata and mem- V H`: H. C:. C:. QJ `QCCV` ]:R QJ `QCCV` H.:JJVC % %`V ory management structures can arise. Integrating many cores `C:. V6 VJ 1QJ V V in the scalable flash firmware design also introduces data Q`V QJ `QCCV` V_%VJHV` C:. C:. consistency, coherence and hazard issues. We analyze new challenges arising from concurrency, and address them by Figure 1: Overall architecture of an NVMe SSD. applying concurrency-aware optimization techniques to each a dozen of lightweight in-order cores to deliver 1MIOPS. stage, such as parallel queue processing, cache bypassing and background work for time-consuming SSD internal tasks. 2 Background We evaluate a real system with our hardware platform that implements DeepFlash and internally emulates low-level flash media in a timing accurate manner. Our evaluation re- 2.1 High Performance NVMe SSDs sults show that DeepFlash successfully provides more than Baseline. Figure1 shows an overview of a high-performance 1MIOPS with a dozen of simple low-power cores for all reads SSD architecture that Marvell recently published [43]. The and writes with sequential and random access patterns. In host connects to the underlying SSD through four Gen 3.0 addition, DeepFlash reaches 4.5 GB/s (above 1MIOPS), on PCIe lanes (4 GB/s) and a PCIe controller. The SSD archi- average, under the execution of diverse real server workloads. tecture employs three embedded processors, each employing The main contributions of this work are summarized as below: two cores [27], which are connected to an internal DRAM • Many-to-many threading firmware. We identify scalabil- controller via a processor interconnect. The SSD employs ity and parallelism opportunities for high-performance flash several special-purpose processing elements, including a low- firmware. Our many-to-many threading model allows future density parity-check (LDPC) sequencer, data transfer (DMA) manycore-based SSDs to dynamically shift their computing engine, and scratch-pad memory for metadata management. power based on different workload demands without any hard- All these multi-core processors, controllers, and components ware modification. DeepFlash splits all functions from the are connected to a flash complex that connects to eight chan- existing layered firmware architecture into three stages, each nels, each connecting to eight packages, via flash physical with one or more thread groups. Different thread groups can layer (PHY). We select this multicore architecture description communicate with each other over an on-chip interconnection as our reference and extend it, since it is only documented network within the target SSD. NVMe storage architecture that employs multiple cores at this • Parallel NVMe queue management. While employing juncture, but other commercially available SSDs also employ many NVMe queues allows the SSD to handle many I/O a similar multi-core firmware controller [38, 50, 59]. requests through PCIe communication, it is hard to coordinate Future architecture. The performance offered by these de- simultaneous queue accesses from many cores. DeepFlash vices is by far below 1MIOPS. For higher bandwidth, a future dynamically allocates the cores to process NVMe queues device can extend storage and processor complexes with more rather than statically assigning one core per queue. Thus, a flash packages and cores, respectively, which are highlighted single queue is serviced by multiple cores, and a single core by red in the figure. The bandwidth of each flash package can service multiple queues, which can deliver full bandwidth is in practice tens of MB/s, and thus, it requires employing for both balanced and unbalanced NVMe I/O workloads. We more flashes/channels, thereby increasing I/O parallelism. show that this parallel NVMe queue processing exceeds the This flash-side extension raises several architectural issues. performance of the static core-per-queue allocation by 6x, on First, the firmware will make frequent SSD-internal memory average, when only a few queues are in use. DeepFlash also accesses that stress the processor complex. Even though the balances core utilization over computing resources. PCIe core, channel and other memory control logic may be • Efficient I/O processing. We increase the parallel scala- implemented, metadata information increases for the exten- bility of many-to-many threading model by employing non- sion, and its access frequency gets higher to achieve 1MIOPS. blocking communication mechanisms. We also apply sim- In addition, DRAM accesses for I/O buffering can be a crit- ple but effective lock and address randomization methods, ical bottleneck to hide flash’s long latency. Simply making which can distribute incoming I/O requests across multiple cores faster may not be sufficient because the processors will address translators and flash packages. The proposed method suffer from frequent stalls due to less locality and contention minimizes the number of hardware core to achieve 1MIOPS.

Load more