Towards an Arrow-Native Storage System

Towards an Arrow-native Storage System Jayjeet Chakraborty Ivo Jimenez Sebastiaan Alvarez UC Santa Cruz UC Santa Cruz Rodriguez Santa Cruz, California, USA Santa Cruz, California, USA Leiden University [email protected] [email protected] Leiden, Netherlands [email protected] Alexandru Uta Jeff LeFevre Carlos Maltzahn Leiden University UC Santa Cruz UC Santa Cruz Leiden, Netherlands Santa Cruz, California, USA Santa Cruz, California, USA [email protected] [email protected] [email protected] ABSTRACT access libraries can evolve independently from storage sys- With the ever-increasing dataset sizes, several file formats tems while leveraging the scale-out and availability proper- like Parquet, ORC, and Avro have been developed to store ties of distributed storage systems. We present one example data efficiently and to save network and interconnect band- implementation of our design paradigm using Ceph, Apache width at the price of additional CPU utilization. However, Arrow, and Parquet. We provide a brief performance evalua- with the advent of networks supporting 25-100 Gb/s and tion of our implementation and discuss key results. storage devices delivering 1, 000, 000 reqs/sec the CPU has become the bottleneck, trying to keep up feeding data in and out of these fast devices. The result is that data access Application Data Access Libraries libraries executed on single clients are often CPU-bound extension and cannot utilize the scale-out benefits of distributed storage systems. One attractive solution to this problem is to POSIX Interface DOAL method offload data-reducing processing and filtering tasks tothe Direct Object Access Layer (DOAL) storage layer. However, modifying legacy storage systems to support compute offloading is often tedious and requires ex- tensive understanding of the internals. Previous approaches re-implemented functionality of data processing frameworks extension and access library for a particular storage system, a dupli- Data Access Libraries cation of effort that might have to be repeated for different storage systems. Object Store In this paper, we introduce a new design paradigm that allows extending programmable object storage systems to Figure 1: High-Level Design and Architecture. embed existing, widely used data processing frameworks and access libraries into the storage layer with minimal mod- ACM Reference Format: ifications. In this approach data processing frameworks and arXiv:2105.09894v2 [cs.DC] 21 May 2021 Jayjeet Chakraborty, Ivo Jimenez, Sebastiaan Alvarez Rodriguez, Permission to make digital or hard copies of all or part of this work for Alexandru Uta, Jeff LeFevre, and Carlos Maltzahn. 2021. Towards personal or classroom use is granted without fee provided that copies are not an Arrow-native Storage System. In Proceedings of The 13th ACM made or distributed for profit or commercial advantage and that copies bear Workshop on Hot Topics in Storage and File Systems (HotStorage’21). this notice and the full citation on the first page. Copyrights for components ACM, New York, NY, USA, 7 pages. https://doi.org/10.475/1234.5678 of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to 1 INTRODUCTION redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. Over the last decade, a variety of distributed data process- HotStorage’21, July 27-28, 2021, Virtual ing frameworks like Spark [39] and Hadoop [33] have come © 2021 Association for Computing Machinery. into existence. These frameworks were built to efficiently ACM ISBN 123-4567-24-567/08/06...$15.00 https://doi.org/10.475/1234.5678 HotStorage’21, July 27-28, 2021, Virtual J. Chakraborty et al. query vast quantities of semi-structured data and get in- on the client and storage layers allow an application to exe- sights quickly. Unlike standard relational database manage- cute access library operations either on the client or via the ment systems (RDBMS) such as MySQL [2], which are op- direct object access layer, in the storage server. timized to manage both data storage and processing, these We implement one instantiation of our design paradigm systems were designed to read data from a wide variety of using RADOS [32] as the object-storage backend, CephFS [9] data sources, including those in the cloud like S3 [4]. These as the POSIX layer, Apache Arrow [7, 29] as the data access systems depend on different file formats like Parquet [1], library, and Parquet as the file format. We evaluate the perfor- Avro [5], and ORC [6] for efficiently storing and accessing mance of our implementation by varying parameters such as data. Since storage devices have been the primary bottleneck cluster size and parallelism and measure metrics like query for data processing systems for a long time, the main focus latency and CPU utilization. The evaluations show that our of these file formats has been to store data efficiently ina implementation scales almost linearly by offloading CPU binary format and reduce the amount of disk I/O required usage for common data processing tasks to the storage layer, to fetch the data. However, with recent advancements in freeing the client for other processing tasks. storage devices with NVMe [37] drives and network devices In summary, our primary contributions are as follows: with Infiniband networks [35], the bottleneck has shifted • A design paradigm that allows extending programmable from the storage devices to the client machine’s CPUs, ren- storage systems with the ability to offload CPU-bound dering the notion of "A fast CPU and slow disk" invalid as tasks like dataset scanning to the storage layer using shown by Trivedi et al. [30]. The serialized and compressed plugin-based extension mechanisms and widely-used data nature of these file formats makes reading them CPU bound access libraries while keeping the implementation over- in systems with high-speed network and storage devices, head minimal. resulting in severely reduced scalability. • A brief analysis of the performance gained by offloading An attractive solution to this problem is to offload any Parquet dataset scanning to the storage nodes. We demon- computation to the storage layer to achieve scalability, faster strate that offloading dataset scanning operations tothe queries, and less network traffic. Several popular distributed storage layer results in faster queries and near-linear scal- data processing systems have explored this approach, e.g. ability. IBM Netazza [27], Oracle Exadata [24], Redshift [16], and PolarDB [11]. Most of these systems are built following a 2 DESIGN AND IMPLEMENTATION clean-slate approach and use specialized and costly hard- In this section, we discuss our design paradigm, the moti- ware, such as Smart SSDs [14] and FPGAs [34] for table vation behind it, and provide an in-depth discussion of the scanning. Building systems like these requires in-depth un- internals of our implementation. Additionally, we discuss derstanding and expertise in building database systems. Also, two alternate file layout choices for efficiently storing and modifying existing systems like MariaDB [25], as in the case querying Parquet files in Ceph. of YourSQL [19], requires modifying code that is hardened over the years which may result in performance, security, 2.1 Design Paradigm and reliability issues. A possible way to mitigate these is- One of the most important aspects of our design is that sues is to have programmable storage systems with low-level it allows building in-storage data processing systems with extension mechanisms that allow implementing application- minimal implementation effort. Our design allows extending specific data manipulation and access in their I/O path. Cus- the client and storage layers with widely-used data access tomizing storage systems via plugins results in minimal im- libraries requiring minimal modifications. We achieve this plementation overhead and increases the maintainability of by (1) creating a file system shim in the object storage layer the software. so that access libraries embedded in the storage layer can Programmable object-storage systems like Ceph [31], continue to operate on files, (2) mapping client requests-to- Swift [22], and DAOS [20] often provide a POSIX filesys- be-offloaded directly to objects using file system striping tem interface for reading and writing files which are mostly metadata, and (3) mapping files to logically self-contained built on top of direct object access libraries like “librados” fragments by using standard file system striping. As shown in Ceph and “libdaos” in DAOS. Being programmable, these in Figure 2, We developed one instantiation of our design systems provide plugin-based extension mechanisms that paradigm using Ceph as the storage system, Apache Arrow allow direct access and manipulation of objects within the as the data access library, and Parquet as the file format. storage layer. We leverage these features of programmable We expose our implementation via the Arrow Dataset API storage systems and develop a new design paradigm that by creating a new file format called RADOS Parquet that allows the embedding of widely-used data access libraries inside the storage layer. As shown in Figure 1, the extensions Towards an Arrow-native Storage System HotStorage’21, July 27-28, 2021, Virtual Dataframe API Dataset API Analytics Functions Data Index extension Query Engine Object Object Arrow API Ceph POSIX Interface FileFragment Context Embedded Librados API Access Library RandomAccessObject In-Memory ObjectClass R/W Execute Interface Query Chunk Store K/V Store Store Object Storage Device Object Figure 2: Architecture of the implementation that extends CephFS to allow invoking methods for processing Parquet files inside Ceph OSDs using Arrow libraries. inherits from the Parquet file format in Arrow. Next, we scanning datasets of different formats in a unified manner. discuss the internals of our implementation in detail. Since Parquet is the de-facto file format of choice in data processing systems, we use it as a baseline for our implemen- 2.2 An Example Implementation tation.

Load more