Column-Stores Vs. Row-Stores: How Different Are They Really?

Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi Samuel R. Madden Nabil Hachem Yale University MIT AvantGarde Consulting, LLC New Haven, CT, USA Cambridge, MA, USA Shrewsbury, MA, USA [email protected] [email protected] [email protected] ABSTRACT General Terms There has been a significant amount of excitement and recent work Experimentation, Performance, Measurement on column-oriented database systems (“column-stores”). These database systems have been shown to perform more than an or- Keywords der of magnitude better than traditional row-oriented database systems (“row-stores”) on analytical workloads such as those found in C-Store, column-store, column-oriented DBMS, invisible join, com- data warehouses, decision support, and business intelligence appli- pression, tuple reconstruction, tuple materialization. cations. The elevator pitch behind this performance difference is straightforward: column-stores are more I/O efficient for read-only 1. INTRODUCTION queries since they only have to read from disk (or from memory) Recent years have seen the introduction of a number of column- those attributes accessed by a query. oriented database systems, including MonetDB [9, 10] and C-Store [22]. This simplistic view leads to the assumption that one can ob- The authors of these systems claim that their approach offers order- tain the performance benefits of a column-store using a row-store: of-magnitude gains on certain workloads, particularly on read-intensive either by vertically partitioning the schema, or by indexing every analytical processing workloads, such as those encountered in data column so that columns can be accessed independently. In this pa- warehouses. per, we demonstrate that this assumption is false. We compare the Indeed, papers describing column-oriented database systems usu- performance of a commercial row-store under a variety of differ- ally include performance results showing such gains against tradi- ent configurations with a column-store and show that the row-store tional, row-oriented databases (either commercial or open source). performance is significantly slower on a recently proposed data These evaluations, however, typically benchmark against row-orient- warehouse benchmark. We then analyze the performance differ- ed systems that use a “conventional” physical design consisting of ence and show that there are some important differences between a collection of row-oriented tables with a more-or-less one-to-one the two systems at the query executor level (in addition to the obvi- mapping to the tables in the logical schema. Though such results ous differences at the storage layer level). Using the column-store, clearly demonstrate the potential of a column-oriented approach, we then tease apart these differences, demonstrating the impact on they leave open a key question: Are these performance gains due performance of a variety of column-oriented query execution tech- to something fundamental about the way column-oriented DBMSs niques, including vectorized query processing, compression, and a are internally architected, or would such gains also be possible in new join algorithm we introduce in this paper. We conclude that a conventional system that used a more column-oriented physical while it is not impossible for a row-store to achieve some of the design? performance advantages of a column-store, changes must be made Often, designers of column-based systems claim there is a funda- to both the storage layer and the query executor to fully obtain the mental difference between a from-scratch column-store and a row- benefits of a column-oriented approach. store using column-oriented physical design without actually exploring alternate physical designs for the row-store system. Hence, one goal of this paper is to answer this question in a systematic way. One of the authors of this paper is a professional DBA spe- Categories and Subject Descriptors cializing in a popular commercial row-oriented database. He has H.2.4 [Database Management]: Systems—Query processing, Re- carefully implemented a number of different physical database de- lational databases signs for a recently proposed data warehousing benchmark, the Star Schema Benchmark (SSBM) [18, 19], exploring designs that are as “column-oriented” as possible (in addition to more traditional designs), including: Permission to make digital or hard copies of all or part of this work for Vertically partitioning the tables in the system into a collec- personal or classroom use is granted without fee provided that copies are tion of two-column tables consisting of (table key, attribute) not made or distributed for profit or commercial advantage and that copies pairs, so that only the necessary columns need to be read to bear this notice and the full citation on the first page. To copy otherwise, to answer a query. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD'08, June 9–12, 2008, Vancouver, BC, Canada. Using index-only plans; by creating a collection of indices Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00. that cover all of the columns used in a query, it is possible 967 for the database system to answer a query without ever going 1. We show that trying to emulate a column-store in a row-store to the underlying (row-oriented) tables. does not yield good performance results, and that a variety of techniques typically seen as ”good” for warehouse perfor- Using a collection of materialized views such that there is a mance (index-only plans, bitmap indices, etc.) do little to view with exactly the columns needed to answer every query improve the situation. in the benchmark. Though this approach uses a lot of space, it is the `best case’ for a row-store, and provides a useful 2. We propose a new technique for improving join performance point of comparison to a column-store implementation. in column stores called invisible joins. We demonstrate ex- perimentally that, in many cases, the execution of a join us- We compare the performance of these various techniques to the ing this technique can perform as well as or better than se- baseline performance of the open-source C-Store database [22] on lecting and extracting data from a single denormalized ta- the SSBM, showing that, despite the ability of the above methods ble where the join has already been materialized. We thus to emulate the physical structure of a column-store inside a row- conclude that denormalization, an important but expensive store, their query processing performance is quite poor. Hence, one (in space requirements) and complicated (in deciding in ad- contribution of this work is showing that there is in fact something vance what tables to denormalize) performance enhancing fundamental about the design of column-store systems that makes technique used in row-stores (especially data warehouses) is them better suited to data-warehousing workloads. This is impor- not necessary in column-stores (or can be used with greatly tant because it puts to rest a common claim that it would be easy reduced cost and complexity). for existing row-oriented vendors to adopt a column-oriented phys- 3. We break-down the sources of column-database performance ical database design. We emphasize that our goal is not to find the on warehouse workloads, exploring the contribution of late- fastest performing implementation of SSBM in our row-oriented materialization, compression, block iteration, and invisible database, but to evaluate the performance of specific, “columnar” joins on overall system performance. Our results validate physical implementations, which leads us to a second question: previous claims of column-store performance on a new data Which of the many column-database specic optimizations pro- warehousing benchmark (the SSBM), and demonstrate that posed in the literature are most responsible for the signicant per- simple column-oriented operation – without compression and formance advantage of column-stores over row-stores on warehouse late materialization – does not dramatically outperform well- workloads? optimized row-store designs. Prior research has suggested that important optimizations specific to column-oriented DBMSs include: The rest of this paper is organized as follows: we begin by describing prior work on column-oriented databases, including sur- Late materialization (when combined with the block iteration veying past performance comparisons and describing some of the optimization below, this technique is also known as vector- architectural innovations that have been proposed for column-oriented ized query processing [9, 25]), where columns read off disk DBMSs (Section 2); then, we review the SSBM (Section 3). We are joined together into rows as late as possible in a query then describe the physical database design techniques used in our plan [5]. row-oriented system (Section 4), and the physical layout and query execution techniques used by the C-Store system (Section 5). We Block iteration [25], where multiple values from a column then present performance comparisons between the two systems, are passed as a block from one operator to the next, rather first contrasting our row-oriented designs to the baseline C-Store than using Volcano-style per-tuple iterators [11]. If the val- performance and then decomposing the performance of C-Store to ues are fixed-width, they are iterated through as an array. measure which of the techniques it employs for efficient query execution are most effective on the SSBM (Section 6). Column-specific compression techniques, such as run-length encoding, with direct operation on compressed data when using late-materialization plans [4]. 2. BACKGROUND AND PRIOR WORK In this section, we briefly present related efforts to characterize We also propose a new optimization, called invisible joins, column-store performance relative to traditional row-stores. which substantially improves join performance in late-mat- Although the idea of vertically partitioning database tables to erialization column stores, especially on the types of schemas improve performance has been around a long time [1, 7, 16], the found in data warehouses.

Column-Stores Vs. Row-Stores: How Different Are They Really?

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support