Yoursql: a High-Performance Database System Leveraging In-Storage Computing
Total Page:16
File Type:pdf, Size:1020Kb
YourSQL: A High-Performance Database System Leveraging In-Storage Computing Insoon Jo, Duck-Ho Bae, Andre S. Yoon, Jeong-Uk Kang, Sangyeun Cho, Daniel DG Lee, Jaeheon Jeong Memory Business, Samsung Electronics Co. ABSTRACT fect solution, either. Leaving the expense aside, the amount This paper presents YourSQL, a database system that ac- of data transferred from storage devices remains unchanged celerates data-intensive queries with the help of additional because the data must be transferred in any case to filter in-storage computing capabilities. YourSQL realizes very servers or FPGAs before being filtered. early filtering of data by offloading data scanning of a query Contrary to the prior work, we argue that early filtering to user-programmable solid-state drives. We implement our may well take place at the earliest point possible|within system on a recent branch of MariaDB (a variant of MySQL). a storage device. Besides a fundamental computer science In order to quantify the performance gains of YourSQL, we principle|when operating on large datasets, do not move evaluate SQL queries with varying complexities. Our re- data from disk unless absolutely necessary|, modern solid- sult shows that YourSQL reduces the execution time of the state drives (SSDs) are not a dumb storage any longer [7, whole TPC-H queries by 3.6×, compared to a vanilla sys- 9, 11, 13, 16, 17, 20]. Therefore, we have explored a novel tem. Moreover, the average speed-up of the five TPC-H database system architecture where SSDs offer compute ca- queries with the largest performance gains reaches over 15×. pabilities to realize faster query responses. Some prior work Thanks to this significant reduction of execution time, we aims to quantify the benefits of in-storage query process- observe sizable energy savings. Our study demonstrates that ing. Do et al. [11] built a \smart SSD" prototype, where the YourSQL approach, combining the power of early filter- SSDs are in charge of the whole query processing. Even ing with end-to-end datapath optimization, can accelerate though this work lays the basis for in-storage query pro- large-scale analytic queries with lower energy consumption. cessing, there remain large areas for further research due to its limitations. First, it focuses on proving the concept of SSD-based query processing but pays little attention to 1. INTRODUCTION realizing a realistic database system architecture. Not only Delivering end results quickly for data-intensive queries is would join queries be unsupported, but internal represen- key to successful data warehousing, business intelligence, tation and layout of data must be converted to achieve and analytics applications. An intuitive (and efficient) way reasonable speed-up. Moreover, the hardware targeted by to speed them up is to reduce the volume of data being trans- this work (i.e., SATA/SAS SSDs) is outdated and the cor- ferred across storage network to a host system. This can be responding results may not hold for future systems. Indeed, achieved by either filtering out extraneous data or transfer- its performance advantages mainly result from the higher ring intermediate and/or final computation results [18]. internal bandwidth inside an SSD compared to the exter- The former approach is called early filtering, a typical ex- nal SSD bandwidth limited by a typical host interface (like ample of data-intensive, non-complex data processing. Due SATA/SAS), but the assumption of such a large bandwidth to its simplicity and effectiveness, it has seen many forms gap (i.e., a gap of 2.8× or larger) does not hold for a state- of implementations [3, 12, 19, 22]. However, existing imple- of-the-art SSD. We feel a strong need for a practical system mentations do not fully utilize modern solid-state storage design that would work on a state-of-the-art SSD and be device technology and are limited in cost-effectiveness and able to accelerate complex queries as well. performance. First, the software filters [3] presuppose filter This paper presents YourSQL, a database system archi- metadata like index, but the overheads of metadata man- tecture that leverages in-storage computing (ISC) for query agement can be prohibitively costly. Thus, high-end filter offloading to SSDs. We built a prototype of YourSQL that servers [19] or FPGAs [12, 22] are employed between disk en- supports ISC-enabled early filtering. Our prototype works closures and database servers. However, they are not a per- on a commodity NVMe SSD that offers user-programmability and provides built-in support for advanced hardware IPs (e.g., pattern matcher) as well [13]. Not only would extra- neous data reduction take place within SSDs, but YourSQL would request relevant pages only. Moreover, it leverages the This work is licensed under the Creative Commons Attribution- hardware pattern matcher in the target SSD, which makes NonCommercial-NoDerivatives 4.0 International License. To view a copy our system sustain high, on-the-fly table scan throughput. of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For Our findings and contributions are summarized as follows: any use beyond those covered by this license, obtain permission by emailing [email protected]. Proceedings of the VLDB Endowment, Vol. 9, No. 12 • We seamlessly integrated query offloading to SSDs into Copyright 2016 VLDB Endowment 2150-8097/16/08. MySQL. No prior work reports a full port of a ma- 924 jor DB engine. We modified its query planner to in- clude new algorithm that automatically identifies cer- Query 1: A simple selection query. tain query operations to offload to the SSD. Portions SELECT p partkey, p mfgr FROM part of its storage engine are rewritten so that those oper- WHERE p size = 15 AND p type LIKE '%BRASS'; ations are offloaded to the SSD at run time. • Contrary to the prior work, YourSQL runs all 22 TPC- Query 2: TPC-H Q2. H queries. With YourSQL, the average speed-up for SELECT s acctbal, s name, n name, p partkey, p mfgr, the five TPC-H queries with the largest performance s address, s phone, s comment FROM part, supplier, partsupp, nation, region gains is over 15×. Also, the execution time of the WHERE p partkey = ps partkey whole TPC-H queries is reduced by 3.6× compared to AND s suppkey = ps suppkey \vanilla" system. With the significant reduction of the AND p size = 15 AND p type LIKE '%BRASS' AND s nationkey = n nationkey execution time, it achieves substantially lower energy AND n regionkey = r regionkey consumption as well (e.g., about 24× less energy con- AND r name = 'EUROPE' sumption in the case of TPC-H Query 2). AND ps supplycost = (SELECT MIN(ps supplycost) FROM partsupp, supplier, nation, region WHERE p partkey = ps partkey • We present detailed state-of-the-art techniques for ac- AND s suppkey = ps suppkey celerating large-scale data processing involving pattern AND s nationkey = n nationkey matching (e.g., grep, html/xml parsing, table scan) AND n regionkey = r regionkey AND r name = 'EUROPE') with the help of SSD. ORDER BY s acctbal DESC, n name, s name, p partkey LIMIT 100; In the remainder of this paper, we introduce the motiva- tions behind YourSQL's early filtering approach and its over- requests in both cases. ICP is an early filtering feature of all architecture in Section 2. The implementation details of MySQL that can be enabled when the filter columns of a YourSQL are described in Section 3. Section 4 provides the given query are preindexed. In order to enable ICP for the methodologies of our evaluation and the results. We discuss queries in this example, we first preprocessed the dataset by related works in Section 5 and conclude in Section 6. creating a multi-column index on p size and p type. Query 1 is a simplified version of Query 2 and refers to the 2. YOURSQL part table only. Its filter predicates are very limiting with the high filtering ratio of 0.067, that is, only 6.7% of pages in this table are needed to answer this query 2. Compared to a 2.1 Motivational Example: Early Filtering single table query, a join query can potentially benefit more The early filtering approach is usually considered one of from the early filtering approach. For the effective early the most effective ways to achieve the acceleration of data- filtering of a join query, its join order should be optimized intensive queries. The level of potential improvement varies first, that is, the early filtering target should be placed first depending on selectivity, which represents a fraction of rele- in the join order. With the (early) filtered rows placed first vant rows from a row set (e.g., a table, a view or the result in the join order, intermediate row sets to be processed could of a join) in a given data set. The value ranges from zero to significantly be reduced at the earliest stage of join. one, and zero is the highest selectivity value that means no Table 1 shows how the query plan and the resulting read rows will be selected from a row set. amounts change in the presence of ICP for Query 2. We One may think that higher selectivity guarantees higher show top query plans only since subquery plans are iden- speed-up with early filtering. However, it is not always the tical in both cases. In this table, Access method and Key case because the selectivity, which is calculated at row-level, represent the access method to each join table (e.g., full is not always quantitatively correlated with the amount of scan, index scan) and the key (index) column that MySQL page I/O that can be reduced by early filtering.