contributed articles

DOI:10.1145/3286588 months and application-demanded Programmable software-defined requirements from the storage sys- tem grow quickly over time. This solid-state drives can move computing notable lag in the adaptability and functions closer to storage. velocity of movement of the storage infrastructure may ultimately affect BY JAEYOUNG DO, SUDIPTA SENGUPTA, AND STEVEN SWANSON the ability to innovate throughout the cloud world. In this article, we advocate creating a software-defined storage substrate of Programmable solid-state drives (SSDs) that are as pro- grammable, agile, and flexible as the applications/OS accessing from serv- ers in cloud datacenters. A fully pro- Solid-State grammable storage substrate prom- ises opportunities to better bridge the gap between application/OS needs and Storage in storage capabilities/limitations, while allowing application developers to in- novate in-house at cloud speed. Future Cloud The move toward software-defined control for IO devices and co-proces- sors has played out before in the data- Datacenters center. Both GPUs and network inter- face cards (NICs) started as black-box devices that provide acceleration for CPU-intensive operations (such as graphics and packet processing). In- ternally, they implemented accelera- tion features with a combination of specialized hardware and proprietary THERE IS A major disconnect today in cloud datacenters firmware. As customers demanded concerning the speed of innovation between greater flexibility, vendors slowly ex- posed programmability to the rest of application/operating system (OS) and storage the system, unleashing the vast pro- infrastructures. Application/OS software is patched cessing power available from GPUs with new/improved functionality every few weeks at and a new level of agility in how sys- tems can manage networks for en- “cloud speed,” while storage devices are off-limits hanced functionality like more granu- for such sustained innovation during their hardware lar traffic management, security, and life cycle of three to five years in datacenters. Since key insights the software inside the storage device is written by ˽˽ A fully programmable storage substrate storage vendors as proprietary firmware not open in cloud datacenters opens up new opportunities to innovate the storage for general application developers to modify, the infrastructure at cloud speed.

developers are stuck with a device whose functionality ˽˽ In-storage programming is becoming increasingly easier with powerful and capabilities are frozen in time, even as many of processing capabilities and highly flexible them are modifiable in software. A period of five years is development environments. ˽˽ New value propositions with the almost eternal in the cloud computing industry where programmable storage substrate can be new features, platforms, and application program realized, such as customizing the storage interface, moving compute close to data, interfaces (APIs) are evolving every couple of and performing secure computations.

54 COMMUNICATIONS OF THE ACM | JUNE 2019 | VOL. 62 | NO. 6 deep-network telemetry. Such large-scale datasets—which gen- The controller that is most com- Storage is at the cusp of a similar erally range from tens of terabytes to monly implemented as a system-on- transformation. Modern SSDs rely multiple petabytes—present chal- a-chip (SoC) is designed to manage on sophisticated processing engines lenges of extreme scale while achieving the underlying storage media. For ex- running complex firmware, and ven- very fast and efficient data processing: ample, SSDs built using NAND flash dors already provide customized firm- a high-performance storage infrastruc- memory have unique characteristics ware builds for cloud operators. Ex- ture in terms of throughput and latency in that data can be written only to an posing this programmability through is necessary. This trend has resulted in empty memory location—no in-place easily accessible interfaces will let growing interest in the aggressive use updates are allowed—and memory storage systems in the cloud data- of SSDs that, compared with tradition- can endure only a limited number centers adapt to rapidly changing re- al spinning hard disk drives (HDDs), of writes before it can no longer be quirements on the fly. provides orders-of-magnitude lower read. Therefore, the controller must latency and higher throughput. In ad- be able to perform some background Storage Trends dition to these performance benefits, management tasks (such as garbage The amount of data being generated the advent of new technologies (such collection) to reclaim flash blocks daily is growing exponentially, placing as 3D NAND enabling much denser containing invalid data to create more and more processing demand on chips and quad-level-cell, or QLC, for available space and to datacenters. According to a 2017 mar- bulk storage) allows SSDs to continue evenly distribute writes across the keting-trend report from IBM,a 90% of to significantly scale in capacity and to entire flash blocks with the purpose the data in the world in 2016 has been yield a huge reduction in price. of extending the SSD life. These tasks created in the last 12 months of 2015. There are two key components in are, in general, implemented by pro- SSDs,4 as shown in Figure 1—an SSD prietary firmware running on one or

IMAGE BY ANDRIJ BORYS ASSOCIATES, USING PHOTO FROM SHUTTERSTOCK.COM USING PHOTO ASSOCIATES, ANDRIJ BORYS BY IMAGE a https://ibm.co/2XNvHPk controller and flash storage media. more embedded processor cores in

JUNE 2019 | VOL. 62 | NO. 6 | COMMUNICATIONS OF THE ACM 55 contributed articles

Figure 1. Internal architecture of a modern flash SSD. of 16 or 32 flash channels, as out- lined in Figure 2. Since each flash

Flash SSD channel can keep up with ~500MB/ sec; internally each SSD can be up to Flash Storage Media ~500MB/sec per channel X 32 chan- Flash Channel nels = ~16GB/sec (see Figure 2d); and Embedded Flash SRAM the total aggregated in-SSD perfor-

r Controller ac e Processor f r ll e

e mance would be ~16GB/sec per SSD X o t r n t

n 64 SSDs = ~1TB/sec (see Figure 2c), a o st I

C SSD Flash Channel o DRAM Flash 66x gap. Making SSDs programmable H Controller Controller Controller would thus allow systems to fully le- verage this abundant bandwidth. DRAM In-Storage Programming Modern SSDs combine processing— embedded processor—and storage Figure 2. Example conventional storage server architecture with multiple NVMe SSDs. components—SRAM, DRAM, and —to carry out routine func- tions required for managing the SSD. SSD Storage System These computing resources present in- 0 teresting opportunities to run general Flash SSD 0 0 user-defined programs. In 2013, Do et PCIe 1 Flash SSD Switch 1 al.6,17 explored such opportunities for 2 Flash SSD the first time in the context of running 1 PCIe 3 selected database operations inside Switch Flash SSD 31 x e l a Samsung SAS flash SSD. They wrote p CPU m c o RAM simple selection and aggregation oper- D RAM 0 D 0 R oo t Flash SSD ators that were compiled into the SSD 1 15 firmware and extended the execution Flash SSD 1 PCIe 2 framework of Microsoft SQL Server Switch Flash SSD 2012 to develop a working prototype 3 Flash SSD 31 in which simple selection and aggrega- (c) (d) (a) (b) tion queries could be run end-to-end. 64 SSDs X ~2 GB/s That work demonstrated several Throughput gap of 8x = ~128 GB/s 16 lanes of PCIe times improvement in performance = ~16 GB/s 64 SSDs X ~16 GB/s 32 channels X ~500 MB/s Throughput gap of 66x = ~1 TB/s = ~16 GB/s and energy efficiency by offloading database operations onto the SSD and highlighted a number of challenges that would need to be overcome to the controller. In enterprise SSDs, large storage server at low cost (compared broadly adapt programmable SSDs: SRAM is often used for executing the to building a specialized server to First, the computing capabilities SSD firmware, and both user data and directly attach all SSDs on the moth- available inside the SSD are limited by internal SSD metadata are cached in erboard of the host), the maximum design. The low-performance embed- external DRAM. throughput is limited to 16-lane ded processor inside the SSD with- Interestingly, SSDs generally have PCIe interface speed (see Figure 2a), out L1/L2 caches and high latency to a far larger aggregate internal band- which is approximately 16GB/sec, the in-SSD DRAM require extra care- width than the bandwidth supported regardless of the number of SSDs ful programming to run user code in by host I/O interfaces (such as SAS and accessed in parallel. There is thus the SSD without producing a perfor- PCIe). Figure 2 outlines an example an 8x throughput gap between the mance bottleneck. of a conventional storage system that host interface and the total aggre- Moreover, the embedded software- leverages a plurality of NVM Express gated SSD bandwidth that could be development process is complex and (NVMe)b SSDs; 64 of them are con- up to roughly ~2GB/sec per SSDc X 64 makes programming and debugging nected to 16 PCIe switches that are SSDs = ~128GB/sec (see Figure 2b). very challenging. To maximize perfor- mounted to a host machine via 16 More interestingly, this gap would mance, Do et al. had to carefully plan lanes of PCIe Gen3. While this stor- grow further if the internal SSD per- the layout of data structures used by the age architecture provides a com- formance is considered. A modern code running inside the SSD to avoid modity solution for high-capacity enterprise-level SSD usually consists spilling out of the SRAM. Likewise, Do et al. used a hardware-debugging tool b A device interface for accessing non-volatile c Practical sequential-read bandwidth of a com- to debug programs running inside the memory attached via a PCI Express (PCIe) bus. modity PCIe SSD. SSD that is far more primitive than reg-

56 COMMUNICATIONS OF THE ACM | JUNE 2019 | VOL. 62 | NO. 6 contributed articles ular debugging tools (such as Micro- available inside the SSD are increasingly software-hardware innovation inside soft Visual Studio) available to general powerful, with abundant compute the SSD. Moreover, going beyond the application developers. Worse, the de- and bandwidth resources. Emerging packaged SSD, because the two major vice-side processing code—selection SSDs include software-programmable components inside the SSD are each and aggregation—had to be compiled controllers with multi-core proces- manufactured by multiple vendors,d it into the SSD firmware in the prototype, sors, built-in hardware accelerators is conceivable that SSDs could be cus- meaning application developers would to offload compute-intensive tasks tom designed and provided in partner- need to worry about not only the target from the processors, multiple GBs of ship with component vendorse (just application itself but also complex in- DRAM, and tens of independent chan- like how today’s datacenter servers are ternal structures and algorithms in the nels to the underlying storage media, built and deployed), and even contrib- SSD firmware. allowing several GB/s of internal data ute back some of the designs to the On top of this, the consequences throughput. Even more interesting community (via forums like the Open of an error can be quite severe, which and useful, programming SSDs is be- Compute project, https://www.open- could result in corrupted data or an coming easier, with the trend away compute.org). For example, the indus- unusable drive. Workaday application from proprietary architectures and try is already moving in this direction programmers are unlikely to accept software runtimes and toward com- with introduction of the Open-Channel the additional complexity, and cloud modity operating systems (such as SSD technology2,8,f that moves much of providers are unlikely to let untrusted Linux) running on top of general- the SSD firmware functionalities out of code run in such a fragile environment. purpose processors (such as ARM and the black box and into the operating Application developers need a flex- RISC-V). This trend enables general system or userspace, giving applica- ible and general programming model application developers to fully lever- tions better control over the device. In that allows easily running user code age existing tools, libraries, and exper- an open source project called Denalig written in a high-level programming tise, allowing them to focus on their in 2018, Microsoft proposed a scheme language (such as C/C++) inside an own core competencies rather than SSD. The programming model must spending many hours getting used to d Several vendors manufacture each type of also support the concurrent execution the low-level, embedded development component in flash SSDs. For example: flash of multiple in-SSD applications while process. This also allows application controller manufactured by Marvell, PMC (ac- ensuring that malicious applications developers to easily port large applica- quired by Microsemi), Sandforce (acquired by do not adversely affect the overall SSD tions already running on host operat- Seagate), Indilinx (acquired by OCZ), and flash memory manufactured by Samsung, , operation or violate protection guar- ing systems to the device with mini- and Micron. antees provided by the operating and mal code changes. e Many large-scale datacenter operators (such file system. All in all, the programmability evo- as Google19 and Baidu16) build their own SSDs In 2014, Seshadri et al.20 proposed lution in SSDs presents a unique op- that are fully optimized for their own applica- tion requirements. Willow, an SSD that made program- portunity to embrace the SSDs as a f The Linux Open-Channel SSD subsystem was mability a central feature of the SSD first-class programmable platform introduced in the Linux kernel version 4.4. interface, allowing ordinary developers in the cloud datacenters, enabling g https://bit.ly/2GCuIum to safely augment and extend the SSD semantics with application-specific Figure 3. Disruptive trends in the flash storage industry toward abundant resources and increased ease of programmability inside the SSD. functions without compromising file system protections. In their model, host and in-SSD applications communicate via PCIe using a simple, generic—not Abundant storage-centric—remote procedure call resources Programmable 7 r e a inside SSD SSD

(RPC) mechanism. In 2016, Gu et al. ex- capacity) w n d plored a flow-based programming mod- r d a

h a s

l el where an in-SSD application can be , nd re nn e constructed from tasks and data pipes ee d T p e h a s connecting the tasks. These program- c iv k

h pt ming models provide great flexibility in s a fl / cl o c Disru terms of programmability but are still s #

far from “general purpose.” There is or e M , c A

# Today’s Frugal a risk that existing large applications R

D SSD resources ,

might still need significant redesigns CP U d ( inside SSD to exploit each model’s capabilities, re- o a fl quiring much time and effort. o f Fortunately, winds of change can Embedded CPU, proprietary General purpose CPU, firmware OS server-like OS (Linux) disrupt the industry and help applica- tion developers explore SSD program- (Ease of programmability inside SSD) ming in a better way, as illustrated in Figure 3. The processing capabilities

JUNE 2019 | VOL. 62 | NO. 6 | COMMUNICATIONS OF THE ACM 57 contributed articles

that splits the monolithic components ray (FPGA)13,21 and GPU,h with storage GPU ecosystems have in the past. This of an SSD into two different modules— media) and flash and other emerging is an opportunity to rethink datacenter one standardized part dealing with new non-volatile memories (such as 3D architecture with efficient use of het- storage media and a software interface XPoint, ReRAM, STT-RAM, and PCM) erogeneous, energy-efficient hardware, to handle application-specific tasks that provide persistent storage at DRAM which is the way forward for higher (such as garbage collection and wear latencies to deliver high-performance performance at lower power. leveling). In this way, SSD suppliers can gains. This approach would present the build simpler products for datacen- greatest flexibility to take advantage of Value Propositions ters and deliver them to market more advances in the underlying storage de- Here, we summarize three value quickly while per-application tuning is vice to optimize performance for mul- propositions that demonstrate future possible by datacenter operators. tiple cloud applications. In the near directions in programmable storage The component-based ecosystem future, the software-hardware innova- (see Figure 4): also opens up entirely new opportu- tion inside the SSD can proceed much Agile, flexible storage interface (see nities for integrating powerful het- like the PC, networking hardware, and Figure 4a). Full programmability will al- erogeneous programming elements low the storage interface and feature set (such as field-programmable gate ar- h https://bit.ly/2L8LfM4 to evolve at cloud speed, without having to persuade standardization bodies to Figure 4. Programmable SSD value proposition. bless them or persuade device manu- facturers to implement them in the next-generation hardware roadmap, Programmable SSD both usually involving years of delay. A richer, customizable storage interface will allow application developers to stay DRAM focused on their application, without Host having to work around storage con- Server Processor + straints, quirks, or peculiarities, thus HW Offload improving developer productivity. As an example of the need for such an interface, consider how stream writes (a) Agile, flexible storage interface are handled in the SSD today. Because leveraging programmability within SSD the SSD cannot differentiate between incoming data from multiple streams, it could pack data from different streams Programmable SSD onto the same flash erase block, the smallest unit that can be erased from flash at once. When a portion of the DRAM stream data is deleted, it leaves blocks Host with holes of invalid data. To reclaim Server Processor + these blocks, the garbage-collection ac- HW Offload tivity inside the SSD must copy around the valid data, slowing the device and increasing , thus re- (b) Moving compute inside SSD to leverage ducing device lifetime. low latency, high bandwidth, and access to data If application developers had con- trol over the software inside the SSD, they could handle streams much more Programmable SSD efficiently. For instance, incoming writes could be tagged with stream IDs and the device could use this in- DRAM formation to fill a block with data Host from the same stream. When data Server Processor + from that stream is deleted, the entire HW Offload data block could be reclaimed with- out copying around data. Such stream awareness has been shown to double (c) device lifetime, significantly increas- Trusted domain for secure computation 14 (cleartext not allowed to egress the SSD boundary.) ing read performance. In Micro- soft, this need of supporting multiple streams in the SSD was identified in 2014, but NVMe incorporated the fea-

58 COMMUNICATIONS OF THE ACM | JUNE 2019 | VOL. 62 | NO. 6 contributed articles ture only late 2017.i Moreover, large- analytic query is given, compressed to provide access to these files.10 Secu- scale deployment in Microsoft data- data required to answer the query is rity is often among the topmost con- centers might take at least another first loaded to host, uncompressed, cerns enterprise chief information offi- year and be very expensive, since new and then executed using host resourc- cers have when they move to the cloud, SSDs must be purchased to essential- es. Such fundamental data analytics as cloud providers are unwilling to take ly get a new version of the firmware. primitive can be processed inside the on full liability for the impact of such Waiting five years for a change to a SSD by accessing data with high in- breaches. Development of a secure system software component is com- ternal bandwidth and by offloading cloud is not just a feature requirement pletely out of step with how quickly decompression to the dedicated en- but also an absolute foundational ca- computer systems are evolving today. gine. Subsequent stages of the query- pability necessary for the future of the A programmable storage platform processing pipeline (such as filtering cloud computing model and its busi- would reduce this delay to months out unnecessary data and performing ness success as an industry. and allow rapid iteration and refine- the aggregation) can execute inside To realize the vision of a trusted ment of the feature, not to mention the SSD, resulting in greatly reduced cloud, data must be encrypted while the ability to “tweak” the implementa- network traffic and saved host CPU/ stored at rest, which however, limits tion to match specific use cases. memory resources for other important the kind of computation that can be Moving compute close to data (see jobs. Further, performance and band- performed on encrypted data with- Figure 4b). The need to analyze and width together can be scaled by adding out decryption. To facilitate arbitrary glean intelligence from big data im- more SSDs to the system if the applica- (legitimate) computation on stored poses a shift from the traditional com- tion requires higher data rates. data, it needs to be decrypted before pute-centric model to data-centric Secure computation in the cloud computing on it. This requires de- model. In many big data scenarios, (see Figure 4c). Recent security breach crypted cleartext data to be present application performance and re- events related to personal, private in- (at least temporarily) in various por- sponsiveness (demanded by interac- formation (financial and otherwise) tions of the datacenter infrastructure tive usage) is dominated not by the have exposed the vulnerability of data vulnerable to security attacks. Appli- execution of arithmetic and logic in- infrastructures to hackers and attackers. cation developers need a way to facili- structions but instead by the require- Also, a new type of malicious software tate secure computation on the cloud ment to handle huge volumes of data called “encryption ransomware” at- by fencing in well-defined, narrow, and the cost of moving this data to the tacks machines by stealthily encrypt- trusted domains that can preserve location(s) where compute is per- ing data files and demanding a ransom the ability to perform arbitrary com- formed. When this is the case, moving the compute closer to the data can Figure 5. A prototype programmable SSD developed for research purposes. reap huge benefits in terms of in- creased throughput, lower latency, and reduced energy usage. Big data analytics running inside an SSD can have access to the stored data with tens of GB/sec bandwidth (rivaling DRAM bandwidth), and with latency comparable to accessing raw non-vola- tile memory. In addition, large energy savings can be achieved because pro- cessors inside the SSD are more energy efficient compared to the host-server CPU (such as Xeon), and data (a) Device with a storage board with an embedded storage controller does not need to be hauled over large and DIMM slots for flash or other forms of NVM distances from storage all the way up to the host via network, which is more energy-expensive than processing it. Processors inside the SSD are clear- ly not as powerful as host processors, but together with in-storage hardware offload engines, a broad range of data processing tasks can be competitively performed inside the SSD. As an ex- ample, consider how data analytic que- ries are processed in general: When an (b) Device with a storage board where M.2 SSDs can be plugged into. i Note the multi-stream technology for SCSI/ SAS was standardized in T10 on May 20, 2015.

JUNE 2019 | VOL. 62 | NO. 6 | COMMUNICATIONS OF THE ACM 59 contributed articles

putation on the data. flexible with enterprise-level capa- toward bringing compute to SSDs so SSDs with their powerful compute bilities and resources. It comprises data can be processed without leaving capabilities can form a trusted do- a main board and a storage board. the place where it is originally stored. main for doing secure computation The main board contains an ARMv8 For instance, in 2017 NGD Systemsl on encrypted data, leveraging their in- processor, 16GB of RAM, and various announced an SSD called Catalina21 ternal hardware cryptographic engine on-chip hardware accelerators (such capable of running applications di- and secure boot mechanisms for this as 20Gbps compression/decompres- rectly on the device. Catalina2 uses purpose. Cryptographic keys can be sion, 20Gbps SEC-crypto, and 10Gbps TLC 3D NAND flash (up to 24TB), stored inside the SSD, allowing arbi- RegEx engines). It also provides which is connected to the onboard trary compute to be carried out on the NVMe connectivity via four PCIe Gen3 ARM SoC that runs an embedded stored data—after decryption if need- lanes, and 4x10Gbps Ethernet that Linux and modules for error-correct- ed—while enforcing that data cannot supports remote dynamic memory ing code (ECC) and FTL. On the host leave the device in cleartext form. This access (RDMA) over converged ether- server, a tunnel agent (with C/C++ allows a new, flexible, easily program- net (RoCE) protocol. It supports two libraries) runs to talk to the device mable, near-data realization of trusted different storage boards that connect through the NVMe protocol. As anoth- hardware in the cloud. Compared to via 2x4 PCIe Gen3 lanes: One type of er example, ScaleFluxm uses a Xilinx currently proposed solutions like Intel board (see Figure 5a) includes an em- FPGA (combined with terabytes of Enclavesj that are protected, isolated bedded storage controller and four TLC 3D NAND flash) to compute data areas of execution in the host server memory slots where flash or other for data-intensive applications. The memory, this solution protects orders forms of NVM can be installed; and host server runs a software module, of magnitude more data. the second (see Figure 5b) an adapter providing API accesses to the device that hosts two M.2 SSDs. while being responsible for FTL and Programmable SSDs The ARM SoC inside the board runs flash-management functionalities. While the concept of in-storage pro- a full-fledged Ubuntu Linux, so pro- Academia and industry are work- cessing on SSDs was proposed more gramming the board is very similar ing to establish a compelling value than six years ago,6 experimenting with to programming any other Linux de- proposition by demonstrating appli- SSD programming has been limited vice. For instance, software can lever- cation scenarios for each of the three by the availability of real hardware on age the Linux container technology pillars outlined in Figure 4. Among which a prototype can be built to dem- (such as Docker) to provide isolated them we are initially focused on ex- onstrate what is possible. The recent environments inside the board. To ploring the benefits and challenges emergence of prototyping boards avail- create applications running on the of moving compute closer to stor- able for both research and commercial board, a software development kit age (see Figure 4b) in the context of purposes has opened new opportuni- (SDK) containing GNU tools to build big data analytics, examining large ties for application developers to take applications for ARM and user/ker- amounts of data to uncover hidden ideas from conception to action. nel mode libraries to use the on-chip patterns and insights. Figure 5 shows such prototype hardware accelerators is provided, al- Big data analytics within a program- device, called Dragon Fire Card lowing a high level of programmabil- mable SSD. To demonstrate our ap- (DFC),k,3,5 designed and manufac- ity. The DFC can also serve as a block proach, we have implemented a C++ tured by Dell EMC and NXP for re- device, just like regular SSDs. For this reader that runs on a DFC card (see search. The card is powerful and purpose, the device is shipped with a Figure 5) for Apache Optimized Row flash translation layer (FTL) that runs Columnar (ORC) files. The ORC file for- j https://software.intel.com/en-us/sgx on the main board. mat is designed for fast processing and k https://github.com/DFC-OpenSource The SSD industry is also moving high storage efficiency of big data ana- lytic workloads, and has been widely Figure 6. Preliminary results using a programmable SSD yield approximately 5x speedups adopted in the open source community for full scans of ZLIB-compressed ORC files within the device, compared to native ORC readers running on x86 architecture. and industry. The reader running in- side the SSD reads large chunks of ORC streams, decompresses them, and then evaluates query predicates to find only 300 272.47 necessary values. Due to the server-like 250 development environment—Ubuntu 200 and a general-purpose ARM proces- 150 sor—we easily ported a reference im- 100 n Th r oughpu t plementation of the ORC reader to the

illi on rows/second) 50 55.4 ARM SoC environment (with only a few ( m 0 lines of code changes) and incorporat- X86 (Intel Xeon @2.3GHz) Programmable SSD (ARM @1.8GHz + Decompression offload) l http://www.ngdsystems.com m http://www.scaleflux.com n https://github.com/apache/orc

60 COMMUNICATIONS OF THE ACM | JUNE 2019 | VOL. 62 | NO. 6 contributed articles ed library APIs into the reader, enabling node that could be accessed over the reading data from flash and offloading network through a simple key-value the decompression work to the ARM store interface provided fault tolerance SoC hardware accelerator. through replication and application- Figure 6 shows preliminary band- specific processing (such as predicate width results of scanning a ZLIB-com- The programmable evaluations, substring matching and pressed, single-column integer dataset storage substrate decompression) at line rate. (one billion rows) through the C++ ORC reader running on a host x86 server vs. can be viewed as Datacenter Realization inside the DFC card, respectively.o As in a hyper-converged Each application running in cloud the figure, we achieved approximately datacenters has its own, unique re- 5x faster scan performance inside the infrastructure quirements, making it difficult to device compared to running on the design server nodes with the proper host server. Given that this is a single where storage, balance of compute, memory, and device performance, we should be able networking, storage. To cope with such complex- to achieve much better performance ity, an approach of physically decou- improvements by increasing the num- and compute pling resources was proposed re- ber of programmable SSDs that are are tightly coupled cently by Han et al.9 in 2013 to allow used in parallel. replacing, upgrading, or adding in- In addition to scanning, filtering, for low-latency, dividual resources instead of the en- and aggregating large volumes of data high-throughput tire node. With the availability of fast at high-throughput rates by offload- interconnect technologies (such as ing part of the computation directly to access, while InfiniBand, RDMA, and RoCE), it is al- the storage has been explored as well. still providing ready common in today’s large-scale In 2016 Jo et al.12 built a prototype cloud datacenters to disaggregate that performs very early filtering of availability. storage from compute, significantly data through a combination of ARM reducing the total cost of ownership and a hardware pattern-matching en- and improving the efficiency of the gine available inside a programmable storage utilization. However, stor- SSD equipped with a flow-based pro- age disaggregation is a challenge15 gramming model described by Gu et as storage-media access latencies are al.7 When a query is given, the query heading toward single-digit microsec- planner determines whether early ond levelp compared to a disk’s milli- filtering is beneficial for the query second latency, which is much larger and chooses a candidate table as the than the fast network overhead. It is target if the estimated filtering ratio likely that, in the next few years the is sufficiently high. Early filtering is network latency will become a bottle- then performed against the target neck as new, emerging non-volatile table inside the device, and only fil- memories with extremely low laten- tered data is then fetched to the host cies become available. for residual computation. This early This challenge of storage disaggre- filtering inside the device turns out gation can be overcome by using pro- to be highly effective for analytic grammable storage, enabling a fully queries; when running all 22 TPC-H programmable storage substrate queries on a MariaDB server with the that is decoupled from the host sub- programmable device prototyped on strate as outlined in Figure 7. This a commodity NVMe SSD, a 3.6x speed- view of storage as a programmable up was achieved by Jo et al.12 com- substrate allows application devel- pared to a system with the same SSD opers not only to leverage very low, without the programmability. storage-medium access latency by Alternatively, an FPGA-based proto- running programs inside the storage type design for near-data processing device but also to access any remote inside the a storage node for database storage device without involving the engines was studied by István et al.11 in remote host server where the device 2017. In this prototype, each storage is physically attached (see Figure 7) by o Note, to effectively compare data-processing capability in each case—Intel Xeon in x86 vs. p For example, the access latency of 3D XPoint ARM + decompression accelerator in the device— can take 5~10 µsec, while NVMe SSD and disk only a single core for each processor was used. takes ~50–100 µsec and 10 msec, respectively.8

JUNE 2019 | VOL. 62 | NO. 6 | COMMUNICATIONS OF THE ACM 61 contributed articles

Figure 7. Enabling a programmable storage substrate decoupled from the host substrate. 8. Hady, F. Wicked fast storage and beyond. In Proceedings of the 7th Non Volatile Memory Workshop (San Diego, CA, Mar. 6–8). Keynote, 2016. Direct traffic between programmable storage devices (with a network 9. Han, S., Egi, N., Panda, A., Ratnasamy, S., Shi, G., interface) without involving a remote host. and Shenker, S. Network support for resource disaggregation in next-generation datacenters. In Proceedings of the 12th ACM Workshop on Hot Topics in Networks (College Park, MD, Nov. 21–22). ACM Press, New York, 2013, 10. 10. Huang, J., Xu, J., Xing, X., Liu, P., and Qureshi, M. K. Programmable Host Host Flashguard: Leveraging intrinsic flash properties Substrate to defend against encryption ransomware. In Proceedings of the 2017 ACM SIGSAC Conference Host on Computer and Communications Security (Dallas, Host TX, Oct. 30–Nov. 3). ACM Press, New York, 2017, 2231–2244. SSD 11. István, Z., Sidler, D., and Alonso, G. Caribou: Programmable Storage Intelligent distributed storage. In Proceedings of the Substrate VLDB Endowment 10, 11 (Aug. 2017), 1202–1213. SSD 12. Jo, I., Bae, D.-H., Yoon, A.S., Kang, J.-U., Cho, S., Lee, SSD D.D., and Jeong, J. YourSQL: A high-performance database system leveraging in storage computing. In Proceedings of the VLDB Endowment 9, 12 (Aug. NIC Network 2016), 924–935. 13. Jun, S.-W., Liu, M., Lee, S., Hicks, J., Ankcorn, J., King, Interconnect NIC NIC M., Xu, S., et al. BlueDBM: An appliance for big data analytics. In Proceedings of the ACM/IEEE 42nd Annual International Symposium on Computer Architecture Server Node (Portland, OR, Jun. 13–17). IEEE, 2015, 1–13. 14. Kang, J.-U., Hyun, J., Maeng, H., and Cho, S. The multi-streamed solid-state drive. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (Philadelphia, PA, Jun. 17–18). USENIX Association, Berkeley, CA, 2014. q using NVMe over Fabrics (NVMe-oF) provides opportunities for embracing 15. Klimovic, A., Kozyrakis, C., Thereska, E., John, B., with RDMA. them as a first-class programmable and Kumar, S. Flash storage disaggregation. In Proceedings of the 11th European Conference on With the programmable storage sub- platform in cloud datacenters, en- Computer Systems (London, U.K., Apr. 18–21). ACM Press, New York, 2016, 29. strate, we can think of going beyond the abling software-hardware innovation 16. Ouyang, J., Lin, S., Jiang, S., Hou, Z., Wang, Y., and single-device block interface. For exam- that could bridge the gap between ap- Wang, Y. SDF: Software-defined flash for web-scale Internet storage systems. In Proceedings of the 19th ple, a micro server inside storage can ex- plication/OS needs and storage capa- International Conference on Architectural Support pose a richer interface like a distributed bilities/limitations. We hope to shed for Programming Languages and Operating Systems (Salt Lake City, UT, Mar. 1–5). ACM press, New York, key-value store or distributed streams. light on the future of software-defined 2014, 471–484. Or the storage infrastructure can be man- storage and help chart a direction for 17. Park, K., Kee, Y.-S., Patel, J.M., Do, J., Park, C., and Dewitt, D.J. Query processing on smart SSDs. IEEE aged as a fabric, not as individual devices. designing, building, deploying, and Data Engineering Bulletin 37, 2 (Jun. 2014), 19–26. The programmable storage substrate can leveraging a software-defined storage 18. Picoli, I.L., Pasco, C.V., Jónsson, B.Þ., Bouganim, L., and Bonnet, P. uFLIP-OC: Understanding flash also provide high-level datacenter capa- architecture for cloud datacenters. I/O patterns on open-channel solid state drives. bilities (such as backup, data snapshot, In Proceedings of the 8th Asia-Pacific Workshop on Systems (Mumbai, India, Sep. 2–3). ACM Press, New replication, de-duplications, and tier- Acknowledgments York, 2017, 20. ing), which are typically supported in a This work was supported in part by 19. Schroeder, B., Lagisetty, R., and Merchant, A. Flash reliability in production: The expected and the datacenter server environment where National Science Foundation Award unexpected. In Proceedings of the 14th USENIX compute and storage are separated. 1629395. Conference on File and Storage Technologies (Santa Clara, CA, Feb. 22–25). USENIX Association, Berkeley, This means the programmable storage CA, 2016, 67–80. 20. Seshadri, S., Gahagan, M., Bhaskaran, S., Bunker, substrate can be viewed as a hyper-con- References T., De, A., Jin, Y., Liu, Y., and Swanson, S. Willow: A verged infrastructure where storage, net- 1. Alves, V. In-situ processing. Flash Memory Summit user-programmable SSD. In Proceedings of the 11th (Santa Clara, CA, Aug. 8–10), 2017. USENIX Symposium on Operating Systems Design and working, and compute are tightly cou- 2. Bjørling, M., González, J., and Bonnet, P. Lightnvm: The Implementation (Broomfield, CO, Oct. 6–8). USENIX Linux open-channel SSD subsystem. In Proceedings pled for low-latency, high-throughput th Association, Berkeley, CA, 2014, 67–80. of the 15 USENIX Conference on File and Storage 21. Woods, L., István , Z., and Alonso, G. Ibex: An access, while still providing availability. Technologies (Santa Clara, CA, Feb. 27–Mar. 2). intelligent storage engine with support for advanced USENIX Association, Berkeley, CA, 2017, 359–374. SQL offloading. InProceedings of the VLDB 3. Bonnet, P. What’s up with the storage hierarchy? Endowment 7, 11 (Jul. 2014), 963–974. Conclusion In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (Chaminade, CA, In this article, we have presented our Jan. 8–11), 2017. Jaeyoung Do ([email protected]) is a researcher at 4. Cornwell, M. Anatomy of a solid-state drive. Commun. vision of a fully programmable stor- Microsoft Research, Redmond, WA, USA. He is leading a ACM 55, 12 (Dec. 2012), 59–63. project, SoftFlash, which aims to use programmable SSDs age substrate in cloud datacenters, 5. Do, J. Softflash: Programmable storage in future data in cloud datacenters. allowing application developers to centers. In Proceedings of the 20th SNIA Storage Developer Conference (Santa Clara, CA, Sep. 11–14), 2017. innovate the storage infrastructure Sudipta Sengupta ([email protected]) is leading 6. Do, J., Kee, Y.-S., Patel, J.M., Park, C., Park, K., and new initiatives in artificial intelligence/deep learning at at cloud speed like the software ap- DeWitt, D.J. Query processing on smart SSDs: Amazon AWS, Seattle, WA, USA; the research reported in Opportunities and challenges. In Proceedings of this article was done while he was at Microsoft Research, plication/OS infrastructure. The the ACM SIGMOD International Conference on Redmond, WA, USA. programmability evolution in SSDs Management of Data (New York, NY, Jun. 22–27). ACM Press, New York, 2013, 1221–1230. Steven Swanson ([email protected]) is a professor 7. Gu, B., Yoon, A. S., Bae, D.-H., Jo, I., Lee, J., Yoon, J., in the Department of Computer Science and Engineering q A technology specification designed for non- Kang, J.-U., Kwon, M., Yoon, C., Cho, S., et al. Biscuit: at the University of California, San Diego, USA. volatile memories to transfer data between A framework for near data processing of big data rd a host and a target system/device over a net- workloads. In Proceedings of the ACM/IEEE 43 Annual International Symposium on Computer work. Approximately 90% of the NVMe-oF pro- Architecture (Seoul, S. Korea, Jun. 18–22). IEEE, Copyright held by authors/owners. tocol is the same as the NVMe protocol. 2016, 153-165. Publication rights licensed to ACM. $15.00.

62 COMMUNICATIONS OF THE ACM | JUNE 2019 | VOL. 62 | NO. 6