Supporting Fast Transactional Workloads on Universal Columnar
Total Page:16
File Type:pdf, Size:1020Kb
Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats Tianyu Li Matthew Butrovich Amadou Ngom litianyu@mit:edu mbutrovi@cs:cmu:edu angom@cs:cmu:edu MIT Carnegie Mellon University Carnegie Mellon University Wan Shen Lim Wes McKinney Andrew Pavlo wanshenl@cs:cmu:edu wes@ursalabs:org pavlo@cs:cmu:edu Carnegie Mellon University Ursa Labs Carnegie Mellon University ABSTRACT Although a DBMS can perform some analytical duties, The proliferation of modern data processing tools has given modern data science workloads often involve specialized rise to open-source columnar data formats. The advantage frameworks, such as TensorFlow, PyTorch, and Pandas. Or- of these formats is that they help organizations avoid repeat- ganizations are also heavily invested in personnel, tooling, edly converting data to a new format for each application. and infrastructure for the current data science eco-system These formats, however, are read-only, and organizations of Python tools. We contend that the need for DBMS to ef- must use a heavy-weight transformation process to load data ficiently export large amounts of data to external tools will from on-line transactional processing (OLTP) systems. We persist. To enable analysis of data as soon as it arrives in aim to reduce or even eliminate this process by developing a a database is, and to deliver performance gains across the storage architecture for in-memory database management entire data analysis pipeline, we should look to improve a systems (DBMSs) that is aware of the eventual usage of DBMS’s interoperability with external tools. its data and emits columnar storage blocks in a universal If an OLTP DBMS directly stores data in a format used open-source format. We introduce relaxations to common by downstream applications, the export cost is just the cost analytical data formats to efficiently update records and rely of network transmission. The challenge in this is that most on a lightweight transformation process to convert blocks open-source formats are optimized for read/append oper- to a read-optimized layout when they are cold. We also de- ations, not in-place updates. Meanwhile, divergence from scribe how to access data from third-party analytical tools the target format in the OLTP DBMS translates into more with minimal serialization overhead. To evaluate our work, transformation overhead when exporting data, which can we implemented our storage engine based on the Apache be equally detrimental to performance. A viable design must Arrow format and integrated it into the DB-X DBMS. Our seek equilibrium in these two conflicting considerations. experiments show that our approach achieves comparable To address this challenge, we present a multi-versioned performance with dedicated OLTP DBMSs while enabling DBMS that operates on a relaxation of an open-source colum- orders-of-magnitude faster data exports to external data sci- nar format to support efficient OLTP modifications. The re- ence and machine learning tools than existing methods. laxed format can then be transformed into the canonical format as data cools with a light-weight in-memory process. arXiv:2004.14471v1 [cs.DB] 29 Apr 2020 1 INTRODUCTION We implemented our storage and concurrency control ar- chitecture in DB-X [10] and evaluated its performance. We Data analysis pipelines allow organizations to extrapolate in- target Apache Arrow, although our approach is also appli- sights from data residing in their OLTP systems. The tools in cable to other columnar formats. Our results show that we these pipelines often use open-source binary formats, such achieve good performance on OLTP workloads operating as Apache Parquet [9], Apache ORC [8] and Apache Ar- on the relaxed Arrow format. We also implemented an Ar- row [3]. Such formats allow disparate systems to exchange row export layer for our system, and show that it facilitates data through a common interface without converting be- orders-of-magnitude faster exports to external tools. tween proprietary formats. But these formats target write- The remainder of this paper is organized as follows: we once, read-many workloads and are not amenable to OLTP first discuss in Section 2 the motivation for this work. We systems. This means that a data scientist must transform then present our storage architecture and concurrency con- OLTP data with a heavy-weight process, which is computa- trol in Section 3, followed by our transformation algorithm tionally expensive and inhibits analytical operations. T. Li et al. 1400 in Section 4. In Section 5, we discuss how to export data to In-Memory 1380.3 external tools. We present our experimental evaluation in 1200 CSV Export Section 6 and discuss related work in Section 7. CSV Load 1000 PostgreSQL Query Processing ODBC Export 800 2 BACKGROUND 600 We now discuss challenges in running analysis with exter- 400 nal tools with OLTP DBMSs. We begin by describing how Load Time (seconds) 284.46 data transformation and movement are bottlenecks in data 200 8.38 0 processing. We then present a popular open-source format In-Memory CSV Python ODBC (Apache Arrow) and discuss its strengths and weaknesses. Figure 1: Data Transformation Costs – Time taken to load a 2.1 Data Movement and Transformation TPC-H table into Pandas with different approaches. A data processing pipeline consists of a front-end OLTP layer and multiple analytical layers. OLTP engines employ the n-ary storage model (i.e., row-store) to support efficient single-tuple operations, while the analytical layers use the decomposition storage model (i.e., column-store) to speed up large scans [22, 29, 38, 43]. Because of conflicting optimiza- Figure 2: SQL Table to Arrow – An example of using Arrow’s tion strategies for these two use cases, organizations often API to describe a SQL table’s schema in Python. implement the pipeline by combining specialized systems. The most salient issue with this bifurcated approach is data To simplify our setup, we run the Python program on the transformation and movement between layers. This prob- same machine as the DBMS. We use a machine with 128 GB lem is made worse with the emergence of machine learning of memory, of which we reserve 15 GB for PostgreSQL’s workloads that load the entire data set instead of a small shared buffers. We provide a full description of our operating query result set. For example, a data scientist will (1) exe- environment for this experiment in Section 6. cute SQL queries to export data from PostgreSQL, (2) load The results in Figure 1 show that ODBC and CSV are it into a Jupyter notebook on a local machine and prepare orders of magnitude slower than what is possible. This differ- it with Pandas, and (3) train models on cleaned data with ence is because of the overhead of transforming to a different TensorFlow. Each step in such a pipeline transforms data into format, as well as excessive serialization in the PostgreSQL a format native to the target framework: a disk-optimized wire protocol. Query processing itself takes 0:004% of the row-store for PostgreSQL, DataFrames for Pandas, and ten- total export time. The rest of the time is spent in the serial- sors for TensorFlow. The slowest of all transformations is ization layer and in transforming the data. Optimizing this from the DBMS to Pandas because it retrieves data over export process will significantly speed up analytics pipelines. the DBMS’s network protocol and then rewrites it into the desired columnar format. This process is not optimal for high-bandwidth data movement [46]. Many organizations 2.2 Column-Stores and Apache Arrow employ costly extract-transform-load (ETL) pipelines that The inefficiency of loading data through a SQL interface re- run only nightly, introducing delays to analytics. quires us to rethink the data export process and avoid costly To better understand this issue, we measured the time it data transformations. Lack of interoperability between row- takes to extract data from PostgreSQL (v10.6) and load it into stores and analytical columnar formats is a major source a Pandas program. We use the LINEITEM table from TPC-H of inefficiency. As discussed previously, OLTP DBMSs are with scale factor 10 (60M tuples, 8 GB as a CSV file, 11 GB as a row-stores because conventional wisdom is that column- PostgreSQL table). We compare three approaches for loading stores are inferior for OLTP workloads. Recent work, how- the table into the Python program: (1) SQL over a Python ever, has shown that column-stores can also support high- ODBC connection, (2) using PostgreSQL’s COPY command performance transactional processing [45, 48]. We propose to export a CSV file to disk and then loading it into Pandas, implementing a high-performance OLTP DBMS directly on and (3) loading data directly from a buffer already in the top of a data format used by analytics tools. To do so, we Python runtime’s memory. The last method represents the select a representative format (Apache Arrow) and analyze theoretical best-case scenario to provide us with an upper its strengths and weaknesses for OLTP workloads. bound for data export speed. We pre-load the entire table Apache Arrow is a cross-language development platform into PostgreSQL’s buffer pool using the pg_warm extension. for in-memory data [3]. In the early 2010s, developers from Mainlining Databases: Supporting Fast Transactional Workloads on Universal Columnar Data File Formats Apache Drill, Apache Impala, Apache Kudu, Pandas, and 101 others independently explored universal in-memory colum- 102 0 J O nar data formats. These groups joined together in 2015 to 103 3 ID Name E 3 Logical develop a shared format based on their overlapping require- 101 "JOE" ID (int) Name (string) M Table A 102 null Column ments.