Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark Sebastiaan Jayjeet Aaron Chu Ivo Jimenez Jeff LeFevre Alvarez Chakraborty UC Santa Cruz UC Santa Cruz UC Santa Cruz Rodriguez UC Santa Cruz [email protected] [email protected] [email protected] Leiden University [email protected] [email protected] Carlos Alexandru Uta Maltzahn Leiden University UC Santa Cruz [email protected] [email protected] ABSTRACT which is a very expensive operation; (2) data processing sys- Distributed data processing ecosystems are widespread and tems require new adapters or readers for each new data type their components are highly specialized, such that efficient to support and for each new system to integrate with. interoperability is urgent. Recently, Apache Arrow was cho- A common example where these two issues occur is the sen by the community to serve as a format mediator, provid- de-facto standard data processing engine, Apache Spark. In ing efficient in-memory data representation. Arrow enables Spark, the common data representation passed between op- efficient data movement between data processing and storage erators is row-based [5]. Connecting Spark to other systems engines, significantly improving interoperability and overall such as MongoDB [8], Azure SQL [7], Snowflake [11], or performance. In this work, we design a new zero-cost data in- data sources such as Parquet [30] or ORC [3], entails build- teroperability layer between Apache Spark and Arrow-based ing connectors and converting data. Although Spark was data sources through the Arrow Dataset API. Our novel data initially designed as a computation engine, this data adapter interface helps separate the computation (Spark) and data ecosystem was necessary to enable new types of workloads. (Arrow) layers. This enables practitioners to seamlessly use However, we believe that using a universal interoperability Spark to access data from all Arrow Dataset API-enabled layer instead enables better and more efficient data process- data sources and frameworks. To benefit our community, ing, and more data-format related optimizations. we open-source our work and show that consuming data The Arrow data format is available for many languages through Apache Arrow is zero-cost: our novel data interface and is already adopted by many projects, including pyS- is either on-par or more performant than native Spark. park [32], Dask [21], Matlab [19], pandas [29], Tensorflow [1]. Moreover, it is already used to exchange data between com- 1 INTRODUCTION putation devices, such as CPUs and GPUs [10]. However, the Distributed data processing frameworks, like Apache Apache Arrow Dataset API [23], not to be confused with the Spark [32], Hadoop [4], and Snowflake [11] have become per- main Arrow [24] library, emerged as a platform-independent arXiv:2106.13020v1 [cs.DC] 24 Jun 2021 vasive, being used in most domains of science and technology. data consumption API, which enables data processing frame- The distributed data processing ecosystems are extensive and works to exchange columnar data efficiently, and without touch many application domains such as stream and event unnecessary conversions. The Arrow Dataset API supports processing [6, 15, 16, 22], distributed machine learning [20], reading many kinds of datasources, both file formats and or graph processing [31]. With data volumes increasing con- (remote) cloud storage. Exploring the benefits of the Arrow stantly, these applications are in urgent need of efficient Dataset API on building storage connectors is currently an interoperation through a common data layer format. In the understudied topic. absence of a common data interface, we identify two major In this paper, we therefore leverage the power of the problems: (1) data processing systems need to convert data, Apache Arrow Dataset API and separate the computation offered by Spark from the data (ingestion) layers, which are Arxiv Preprint ’21, June 09, 2021, Virtual more efficiently handled by Arrow. We design a novel con- 2021. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$1000.00 nector between Spark and The Apache Arrow Dataset API, to https://doi.org/10.1145/nnnnnnn.nnnnnnn Arxiv Preprint ’21, June 09, 2021, Virtual Alvarez Rodriguez et al. JVM C++ sources and formats. Spark is in charge of execution, and JNI JNI Uses File format 1 2 wrapper Build wrapper 3 Arrow provides the data, using its in-memory columnar for- Read Create ds request Arrow ds Native Native parquet CSV ORC mats. In Figure 1, we give a more detailed overview of how we FileFormat Dataset Dataset Forward File File File access data through Arrow. Spark is a JVM-based distributed Read Execute Native tasks Native parquet CSV ORC response Scanner Scanner data processing system, whereas the Arrow Dataset API [23] Generates Data R/W File is only implemented in C++ and Python, but not Java. To Spark Vector Base Scantask 4 Arrow 6 Loader ref 5 data enable communication, we created a bridge implementation. Scantask Generates Data R/W source 1 Reads. Data transmission is initiated by a read request ref Data source Arrow C++ coming from Spark. Read requests arrive at the Arrow File- Columnar Column ... Arrow IPC Arrow JNI Spark objects Batch Vectors Messages bridge Format datasource. Data copy Vector Loader Java Native 2 JVM Dataset. The Arrow FileFormat constructs a Iterator Interface (JNI) Contribution Extension Dataset interface to read data through JNI, using the Ar- Figure 1: Arrow-Spark design overview, integrating row Dataset C++ API. The JVM Dataset interface forwards Arrow (right) and Spark (left) using the Java Native In- all calls through the JNI wrapper to C++. The Arrow Dataset terface (JNI). API interface is constructed on the C++-side, and a reference which Spark can offload its I/O. Using the Arrow Dataset API, UUID is passed to the JVM interface counterpart. Through we enable Spark access to Arrow-enabled data formats and this JVM interface, the Arrow FileFormat initiates data scan- sources. The increasing adoption of Arrow will make many ning (reading) using a scanner. more data types and sources available in the future, without 3 Arrow C++ Dataset API. On the C++ side, a native adding any additional integration effort for our connector Dataset instance is created. On creation, it picks a FileFormat, and, by extension, for Spark. depending on the type of data to be read. In this work, we lay the foundation of integrating Spark 4 Data transmission. The C++ Arrow Dataset API read- with all Arrow-enabled datasources and show that the per- s/writes the data in batches, using given FileFormat. formance achieved by our connector is promising, exceed- 5 Arrow IPC. Each data batch is placed in memory as ing in many situations the performance achieved by Spark- an Arrow IPC message, which is a columnar data format. implemented data connectors. We experiment with several The address to access each message is forwarded to the JVM, design points, such as batch sizes, compression, data types and stored in a Scantask reference. Notice that here we make (e.g., Parquet or CSV), and the scaling behavior of our con- only one additional data copy. nector. Our analysis shows that our proposed solution scales 6 Conversion. Each Arrow IPC message is converted to well, both with increasing data sizes and Spark cluster sizes, an array of Spark-readable column vectors. Because Spark and we provide advice for practitioners on how to tune Ar- operators exchange row-wise data, we convert the column row batch sizes and which compression algorithms to choose. vectors to a row-wise representation by wrapping the vectors Finally, practitioners can integrate our connector in existing in a ColumnarBatch, which wraps columns and allows row- programs without modification, since it is implemented as wise data access on them. This batch is returned to Spark and a drop-in, zero-cost replacement for Spark reading mecha- incurs data copying which cannot be avoided due to Spark nisms. The contribution of this work is the following: operators only working on row-based data. (1) The design and implementation of a novel, zero-cost data Arrow-Spark JNI Bridge. Apache Spark (core) is imple- interoperability layer between Apache Spark and the mented in Scala, and there does not exist an Arrow Dataset Apache Arrow Dataset API. Our connector separates the implementation written in any JVM language. The Arrow computation (Spark) from the data (ingestion) layer (Ar- Dataset API [23] is not to be confused with main Arrow row) and enables Spark interoperability with all Arrow- library, for which a Java stub implementation exists [25]. enabled formats and data sources. We open-source [2] Practitioners using Spark with Arrow are currently bound our implementation for the benefit of our community. to a very small set of features. To use the pyarrow-dataset (2) The performance evaluation of the data interoperability (Python wrappers around the C++ Arrow dataset API) im- layer. We show that Arrow-enabled Spark performs on- plementation with PySpark (Python wrapper over Spark), par or better than native-Spark and provide advice on one needs to implement these explicitly through the PyS- how to tune Arrow parameters. park program, unlike our approach which is transparent to the programmer. Then, the Python bridge between Spark 2 DESIGN AND IMPLEMENTATION (JVM) and Arrow (C++) adds a highly inefficient link in Our framework, called Arrow-Spark, provides an efficient applications, and a large functionality limitation. PySpark interface between Apache Spark and all Arrow-enabled data requires to convert the pyarrow dataset tables to pandas data, a PySpark-readable format. This conversion cancels Spark’s Zero-Cost, Arrow-Enabled Data Interface for Apache Spark Arxiv Preprint ’21, June 09, 2021, Virtual Table 1: Description of the experiments in this work. same experiment we did not close the Spark session as to Dataset Batch Row Size Experiment Query keep JVM and caches warm.

Zero-Cost, Arrow-Enabled Data Interface for Apache Spark

Mastering Machine Learning with Scikit-Learn

Installing Python Ø Basics of Programming • Data Types • Functions • List Ø Numpy Library Ø Pandas Library What Is Python?

Quick Install for AWS EMR

Delft University of Technology Arrowsam In-Memory Genomics

The Platform Inside and out Release 0.8

Cheat Sheet Pandas Python.Indd

Pandas: Powerful Python Data Analysis Toolkit Release 0.25.3

Arrow: Integration to 'Apache' 'Arrow'

Image Processing with Scikit-Image Emmanuelle Gouillart

In Reference to RPC: It's Time to Add Distributed Memory

Ray: a Distributed Framework for Emerging AI Applications

ECSEL2017-1-737451 Fitoptivis from the Cloud to the Edge