Column-Stores Vs. Row-Stores: How Different Are They Really?

Total Page:16

File Type:pdf, Size:1020Kb

Column-Stores Vs. Row-Stores: How Different Are They Really? Column-Stores vs. Row-Stores: How Different Are They Really? Daniel J. Abadi Samuel R. Madden Nabil Hachem Yale University MIT AvantGarde Consulting, LLC New Haven, CT, USA Cambridge, MA, USA Shrewsbury, MA, USA [email protected] [email protected] [email protected] ABSTRACT General Terms There has been a significant amount of excitement and recent work Experimentation, Performance, Measurement on column-oriented database systems (“column-stores”). These database systems have been shown to perform more than an or- Keywords der of magnitude better than traditional row-oriented database sys- tems (“row-stores”) on analytical workloads such as those found in C-Store, column-store, column-oriented DBMS, invisible join, com- data warehouses, decision support, and business intelligence appli- pression, tuple reconstruction, tuple materialization. cations. The elevator pitch behind this performance difference is straightforward: column-stores are more I/O efficient for read-only 1. INTRODUCTION queries since they only have to read from disk (or from memory) Recent years have seen the introduction of a number of column- those attributes accessed by a query. oriented database systems, including MonetDB [9, 10] and C-Store [22]. This simplistic view leads to the assumption that one can ob- The authors of these systems claim that their approach offers order- tain the performance benefits of a column-store using a row-store: of-magnitude gains on certain workloads, particularly on read-intensive either by vertically partitioning the schema, or by indexing every analytical processing workloads, such as those encountered in data column so that columns can be accessed independently. In this pa- warehouses. per, we demonstrate that this assumption is false. We compare the Indeed, papers describing column-oriented database systems usu- performance of a commercial row-store under a variety of differ- ally include performance results showing such gains against tradi- ent configurations with a column-store and show that the row-store tional, row-oriented databases (either commercial or open source). performance is significantly slower on a recently proposed data These evaluations, however, typically benchmark against row-orient- warehouse benchmark. We then analyze the performance differ- ed systems that use a “conventional” physical design consisting of ence and show that there are some important differences between a collection of row-oriented tables with a more-or-less one-to-one the two systems at the query executor level (in addition to the obvi- mapping to the tables in the logical schema. Though such results ous differences at the storage layer level). Using the column-store, clearly demonstrate the potential of a column-oriented approach, we then tease apart these differences, demonstrating the impact on they leave open a key question: Are these performance gains due performance of a variety of column-oriented query execution tech- to something fundamental about the way column-oriented DBMSs niques, including vectorized query processing, compression, and a are internally architected, or would such gains also be possible in new join algorithm we introduce in this paper. We conclude that a conventional system that used a more column-oriented physical while it is not impossible for a row-store to achieve some of the design? performance advantages of a column-store, changes must be made Often, designers of column-based systems claim there is a funda- to both the storage layer and the query executor to fully obtain the mental difference between a from-scratch column-store and a row- benefits of a column-oriented approach. store using column-oriented physical design without actually ex- ploring alternate physical designs for the row-store system. Hence, one goal of this paper is to answer this question in a systematic way. One of the authors of this paper is a professional DBA spe- Categories and Subject Descriptors cializing in a popular commercial row-oriented database. He has H.2.4 [Database Management]: Systems—Query processing, Re- carefully implemented a number of different physical database de- lational databases signs for a recently proposed data warehousing benchmark, the Star Schema Benchmark (SSBM) [18, 19], exploring designs that are as “column-oriented” as possible (in addition to more traditional de- signs), including: Permission to make digital or hard copies of all or part of this work for Vertically partitioning the tables in the system into a collec- personal or classroom use is granted without fee provided that copies are tion of two-column tables consisting of (table key, attribute) not made or distributed for profit or commercial advantage and that copies pairs, so that only the necessary columns need to be read to bear this notice and the full citation on the first page. To copy otherwise, to answer a query. republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD'08, June 9–12, 2008, Vancouver, BC, Canada. Using index-only plans; by creating a collection of indices Copyright 2008 ACM 978-1-60558-102-6/08/06 ...$5.00. that cover all of the columns used in a query, it is possible 967 for the database system to answer a query without ever going 1. We show that trying to emulate a column-store in a row-store to the underlying (row-oriented) tables. does not yield good performance results, and that a variety of techniques typically seen as ”good” for warehouse perfor- Using a collection of materialized views such that there is a mance (index-only plans, bitmap indices, etc.) do little to view with exactly the columns needed to answer every query improve the situation. in the benchmark. Though this approach uses a lot of space, it is the `best case’ for a row-store, and provides a useful 2. We propose a new technique for improving join performance point of comparison to a column-store implementation. in column stores called invisible joins. We demonstrate ex- perimentally that, in many cases, the execution of a join us- We compare the performance of these various techniques to the ing this technique can perform as well as or better than se- baseline performance of the open-source C-Store database [22] on lecting and extracting data from a single denormalized ta- the SSBM, showing that, despite the ability of the above methods ble where the join has already been materialized. We thus to emulate the physical structure of a column-store inside a row- conclude that denormalization, an important but expensive store, their query processing performance is quite poor. Hence, one (in space requirements) and complicated (in deciding in ad- contribution of this work is showing that there is in fact something vance what tables to denormalize) performance enhancing fundamental about the design of column-store systems that makes technique used in row-stores (especially data warehouses) is them better suited to data-warehousing workloads. This is impor- not necessary in column-stores (or can be used with greatly tant because it puts to rest a common claim that it would be easy reduced cost and complexity). for existing row-oriented vendors to adopt a column-oriented phys- 3. We break-down the sources of column-database performance ical database design. We emphasize that our goal is not to find the on warehouse workloads, exploring the contribution of late- fastest performing implementation of SSBM in our row-oriented materialization, compression, block iteration, and invisible database, but to evaluate the performance of specific, “columnar” joins on overall system performance. Our results validate physical implementations, which leads us to a second question: previous claims of column-store performance on a new data Which of the many column-database specic optimizations pro- warehousing benchmark (the SSBM), and demonstrate that posed in the literature are most responsible for the signicant per- simple column-oriented operation – without compression and formance advantage of column-stores over row-stores on warehouse late materialization – does not dramatically outperform well- workloads? optimized row-store designs. Prior research has suggested that important optimizations spe- cific to column-oriented DBMSs include: The rest of this paper is organized as follows: we begin by de- scribing prior work on column-oriented databases, including sur- Late materialization (when combined with the block iteration veying past performance comparisons and describing some of the optimization below, this technique is also known as vector- architectural innovations that have been proposed for column-oriented ized query processing [9, 25]), where columns read off disk DBMSs (Section 2); then, we review the SSBM (Section 3). We are joined together into rows as late as possible in a query then describe the physical database design techniques used in our plan [5]. row-oriented system (Section 4), and the physical layout and query execution techniques used by the C-Store system (Section 5). We Block iteration [25], where multiple values from a column then present performance comparisons between the two systems, are passed as a block from one operator to the next, rather first contrasting our row-oriented designs to the baseline C-Store than using Volcano-style per-tuple iterators [11]. If the val- performance and then decomposing the performance of C-Store to ues are fixed-width, they are iterated through as an array. measure which of the techniques it employs for efficient query ex- ecution are most effective on the SSBM (Section 6). Column-specific compression techniques, such as run-length encoding, with direct operation on compressed data when us- ing late-materialization plans [4]. 2. BACKGROUND AND PRIOR WORK In this section, we briefly present related efforts to characterize We also propose a new optimization, called invisible joins, column-store performance relative to traditional row-stores. which substantially improves join performance in late-mat- Although the idea of vertically partitioning database tables to erialization column stores, especially on the types of schemas improve performance has been around a long time [1, 7, 16], the found in data warehouses.
Recommended publications
  • Cubes Documentation Release 1.0.1
    Cubes Documentation Release 1.0.1 Stefan Urbanek April 07, 2015 Contents 1 Getting Started 3 1.1 Introduction.............................................3 1.2 Installation..............................................5 1.3 Tutorial................................................6 1.4 Credits................................................9 2 Data Modeling 11 2.1 Logical Model and Metadata..................................... 11 2.2 Schemas and Models......................................... 25 2.3 Localization............................................. 38 3 Aggregation, Slicing and Dicing 41 3.1 Slicing and Dicing.......................................... 41 3.2 Data Formatters........................................... 45 4 Analytical Workspace 47 4.1 Analytical Workspace........................................ 47 4.2 Authorization and Authentication.................................. 49 4.3 Configuration............................................. 50 5 Slicer Server and Tool 57 5.1 OLAP Server............................................. 57 5.2 Server Deployment.......................................... 70 5.3 slicer - Command Line Tool..................................... 71 6 Backends 77 6.1 SQL Backend............................................. 77 6.2 MongoDB Backend......................................... 89 6.3 Google Analytics Backend...................................... 90 6.4 Mixpanel Backend.......................................... 92 6.5 Slicer Server............................................. 94 7 Recipes 97 7.1 Recipes...............................................
    [Show full text]
  • Data Analysis Expressions (DAX) in Powerpivot for Excel 2010
    Data Analysis Expressions (DAX) In PowerPivot for Excel 2010 A. Table of Contents B. Executive Summary ............................................................................................................................... 3 C. Background ........................................................................................................................................... 4 1. PowerPivot ...............................................................................................................................................4 2. PowerPivot for Excel ................................................................................................................................5 3. Samples – Contoso Database ...................................................................................................................8 D. Data Analysis Expressions (DAX) – The Basics ...................................................................................... 9 1. DAX Goals .................................................................................................................................................9 2. DAX Calculations - Calculated Columns and Measures ...........................................................................9 3. DAX Syntax ............................................................................................................................................ 13 4. DAX uses PowerPivot data types .........................................................................................................
    [Show full text]
  • (BI) Using MS Excel Powerpivot
    2018 ASCUE Proceedings Developing an Introductory Class in Business Intelligence (BI) Using MS Excel Powerpivot Dr. Sam Hijazi Trevor Curtis Texas Lutheran University 1000 West Court Street Seguin, Texas 78130 [email protected] Abstract Asking questions about your data is a constant application of all business organizations. To facilitate decision making and improve business performance, a business intelligence application must be an in- tegral part of everyday management practices. Microsoft Excel added PowerPivot and PowerPivot offi- cially to facilitate this process with minimum cost, knowing that many business people are already fa- miliar with MS Excel. This paper will design an introductory class to business intelligence (BI) using Excel PowerPivot. If an educator decides to adopt this paper for teaching an introductory BI class, students should have previ- ous familiarity with Excel’s functions and formulas. This paper will focus on four significant phases all students need to complete in a three-credit class. First, students must understand the process of achiev- ing small database normalization and how to bring these tables to Excel or develop them directly within Excel PowerPivot. This paper will walk the reader through these steps to complete the task of creating the normalization, along with the linking and bringing the tables and their relationships to excel. Sec- ond, an introduction to Data Analysis Expression (DAX) will be discussed. Introduction It is not that difficult to realize the increase in the amount of data we have generated in the recent memory of our existence as a human race. To realize that more than 90% of the world’s data has been amassed in the past two years alone (Vidas M.) is to realize the need to manage such volume.
    [Show full text]
  • Rdbmss Why Use an RDBMS
    RDBMSs • Relational Database Management Systems • A way of saving and accessing data on persistent (disk) storage. 51 - RDBMS CSC309 1 Why Use an RDBMS • Data Safety – data is immune to program crashes • Concurrent Access – atomic updates via transactions • Fault Tolerance – replicated dbs for instant failover on machine/disk crashes • Data Integrity – aids to keep data meaningful •Scalability – can handle small/large quantities of data in a uniform manner •Reporting – easy to write SQL programs to generate arbitrary reports 51 - RDBMS CSC309 2 1 Relational Model • First published by E.F. Codd in 1970 • A relational database consists of a collection of tables • A table consists of rows and columns • each row represents a record • each column represents an attribute of the records contained in the table 51 - RDBMS CSC309 3 RDBMS Technology • Client/Server Databases – Oracle, Sybase, MySQL, SQLServer • Personal Databases – Access • Embedded Databases –Pointbase 51 - RDBMS CSC309 4 2 Client/Server Databases client client client processes tcp/ip connections Server disk i/o server process 51 - RDBMS CSC309 5 Inside the Client Process client API application code tcp/ip db library connection to server 51 - RDBMS CSC309 6 3 Pointbase client API application code Pointbase lib. local file system 51 - RDBMS CSC309 7 Microsoft Access Access app Microsoft JET SQL DLL local file system 51 - RDBMS CSC309 8 4 APIs to RDBMSs • All are very similar • A collection of routines designed to – produce and send to the db engine an SQL statement • an original
    [Show full text]
  • Business Intelligence and Column-Oriented Databases
    Central____________________________________________________________________________________________________ European Conference on Information and Intelligent Systems Page 12 of 344 Business Intelligence and Column-Oriented Databases Kornelije Rabuzin Nikola Modrušan Faculty of Organization and Informatics NTH Mobile, University of Zagreb Međimurska 28, 42000 Varaždin, Croatia Pavlinska 2, 42000 Varaždin, Croatia [email protected] [email protected] Abstract. In recent years, NoSQL databases are popular document-oriented database systems is becoming more and more popular. We distinguish MongoDB. several different types of such databases and column- oriented databases are very important in this context, for sure. The purpose of this paper is to see how column-oriented databases can be used for data warehousing purposes and what the benefits of such an approach are. HBase as a data management Figure 1. JSON object [15] system is used to store the data warehouse in a column-oriented format. Furthermore, we discuss Graph databases, on the other hand, rely on some how star schema can be modelled in HBase. segment of the graph theory. They are good to Moreover, we test the performances that such a represent nodes (entities) and relationships among solution can provide and we compare them to them. This is especially suitable to analyze social relational database management system Microsoft networks and some other scenarios. SQL Server. Key value databases are important as well for a certain key you store (assign) a certain value. Keywords. Business Intelligence, Data Warehouse, Document-oriented databases can be treated as key Column-Oriented Database, Big Data, NoSQL value as long as you know the document id. Here we skip the details as it would take too much time to discuss different systems [21].
    [Show full text]
  • Microsoft Power BI DAX
    Course Outline Microsoft Power BI DAX Duration: 1 day This course will cover the ability to use Data Analysis Expressions (DAX) language to perform powerful data manipulations within Power BI Desktop. On our Microsoft Power BI course, you will learn how easy is to import data into the Power BI Desktop and create powerful and dynamic visuals that bring your data to life. But what if you need to create your own calculated columns and measures; to have complete control over the calculations you need to perform on your data? DAX formulas provide this capability and many other important capabilities as well. Learning how to create effective DAX formulas will help you get the most out of your data. When you get the information you need, you can begin to solve real business problems that affect your bottom line. This is Business Intelligence, and DAX will help you get there. To get the most out of this course You should be a competent Microsoft Excel user. You don’t need any experience of using DAX but we recommend that you attend our 2-day Microsoft Power BI course prior to taking this course. What you will learn:- Importing Your Data and Creating the Data Model CALCULATE Function Overview of importing data into the Power BI Desktop Exploring the importance of the CALCULATE function. and creating the Data Model. Using complex filters within CALCULATE using FILTER. Using ALLSELECTED Function. Using DAX Syntax used by DAX. Time Intelligence Functions Understanding DAX Data Types. Why Time Intelligence Functions? Creating a Date Table. Creating Calculated Columns Finding Month to Date, Year To Date, Previous Month How to use DAX expressions in Calculated Columns.
    [Show full text]
  • Ms Sql Server Alter Table Modify Column
    Ms Sql Server Alter Table Modify Column Grinningly unlimited, Wit cross-examine inaptitude and posts aesces. Unfeigning Jule erode good. Is Jody cozy when Gordan unbarricade obsequiously? Table alter column, tables and modifies a modified column to add a column even less space. The entity_type can be Object, given or XML Schema Collection. You can use the ALTER statement to create a primary key. Altering a delay from Null to Not Null in SQL Server Chartio. Opening consent management ebook and. Modifies a table definition by altering, adding, or dropping columns and constraints. RESTRICT returns a warning about existing foreign key references and does not recall the. In ms sql? ALTER to ALTER COLUMN failed because part or more. See a table alter table using page free cloud data tables with simple but block users are modifying an. SQL Server 2016 introduces an interesting T-SQL enhancement to improve. Search in all products. Use kitchen table select add another key with cascade delete for debate than if column. Columns can be altered in place using alter column statement. SQL and the resulting required changes to make via the Mapper. DROP TABLE Employees; This query will remove the whole table Employees from the database. Specifies the retention and policy for lock table. The default is OFF. It can be an integer, character string, monetary, date and time, and so on. The keyword COLUMN is required. The table is moved to the new location. If there an any violation between the constraint and the total action, your action is aborted. Log in ms sql server alter table to allow null in other sql server, table statement that can drop is.
    [Show full text]
  • CDW: a Conceptual Overview 2017
    CDW: A Conceptual Overview 2017 by Margaret Gonsoulin, PhD March 29, 2017 Thanks to: • Richard Pham, BISL/CDW for his mentorship • Heidi Scheuter and Hira Khan for organizing this session 3 Poll #1: Your CDW experience • How would you describe your level of experience with CDW data? ▫ 1- Not worked with it at all ▫ 2 ▫ 3 ▫ 4 ▫ 5- Very experienced with CDW data Agenda for Today • Get to the bottom of all of those acronyms! • Learn to think in “relational data” terms • Become familiar with the components of CDW ▫ Production and Raw Domains ▫ Fact and Dimension tables/views • Understand how to create an analytic dataset ▫ Primary and Foreign Keys ▫ Joining tables/views Agenda for Today • Get to the bottom of all of those acronyms! • Learn to think in “relational data” terms • Become familiar with the components of CDW ▫ Production and Raw Domains ▫ Fact and Dimension tables/views • Creating an analytic dataset ▫ Primary and Foreign Keys ▫ Joining tables/views “C”DW, “R”DW & “V”DW • Users will see documentation referring to xDW. • The “x” is a variable waiting to be filled in with either: ▫ “V” for VISN, ▫ “R” for region or ▫ “C” for corporate (meaning “national VHA”) • Each organizational level of the VA has its own data warehouse focusing on its own population. • This talk focuses on CDW only. 7 Flow of data into the warehouse VistA = Veterans Health Information Systems and Technology Architecture C“DW” • The “DW” in CDW stands for “Data Warehouse.” • Data Warehouse = a data delivery system intended to give users the information they need to support their business decisions.
    [Show full text]
  • Building an Effective Data Warehousing for Financial Sector
    Automatic Control and Information Sciences, 2017, Vol. 3, No. 1, 16-25 Available online at http://pubs.sciepub.com/acis/3/1/4 ©Science and Education Publishing DOI:10.12691/acis-3-1-4 Building an Effective Data Warehousing for Financial Sector José Ferreira1, Fernando Almeida2, José Monteiro1,* 1Higher Polytechnic Institute of Gaya, V.N.Gaia, Portugal 2Faculty of Engineering of Oporto University, INESC TEC, Porto, Portugal *Corresponding author: [email protected] Abstract This article presents the implementation process of a Data Warehouse and a multidimensional analysis of business data for a holding company in the financial sector. The goal is to create a business intelligence system that, in a simple, quick but also versatile way, allows the access to updated, aggregated, real and/or projected information, regarding bank account balances. The established system extracts and processes the operational database information which supports cash management information by using Integration Services and Analysis Services tools from Microsoft SQL Server. The end-user interface is a pivot table, properly arranged to explore the information available by the produced cube. The results have shown that the adoption of online analytical processing cubes offers better performance and provides a more automated and robust process to analyze current and provisional aggregated financial data balances compared to the current process based on static reports built from transactional databases. Keywords: data warehouse, OLAP cube, data analysis, information system, business intelligence, pivot tables Cite This Article: José Ferreira, Fernando Almeida, and José Monteiro, “Building an Effective Data Warehousing for Financial Sector.” Automatic Control and Information Sciences, vol.
    [Show full text]
  • Query Execution in Column-Oriented Database Systems by Daniel J
    Query Execution in Column-Oriented Database Systems by Daniel J. Abadi [email protected] M.Phil. Computer Speech, Text, and Internet Technology, Cambridge University, Cambridge, England (2003) & B.S. Computer Science and Neuroscience, Brandeis University, Waltham, MA, USA (2002) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2008 c Massachusetts Institute of Technology 2008. All rights reserved. Author......................................................................................... Department of Electrical Engineering and Computer Science February 1st 2008 Certifiedby..................................................................................... Samuel Madden Associate Professor of Computer Science and Electrical Engineering Thesis Supervisor Acceptedby.................................................................................... Terry P. Orlando Chairman, Department Committee on Graduate Students 2 Query Execution in Column-Oriented Database Systems by Daniel J. Abadi [email protected] Submitted to the Department of Electrical Engineering and Computer Science on February 1st 2008, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering Abstract There are two obvious ways to map a two-dimension relational database table onto a one-dimensional storage in- terface:
    [Show full text]
  • Integrating Compression and Execution in Column-Oriented Database Systems
    Integrating Compression and Execution in Column-Oriented Database Systems Daniel J. Abadi Samuel R. Madden Miguel C. Ferreira MIT MIT MIT [email protected] [email protected] [email protected] ABSTRACT commercial arena [21, 1, 19], we believe the time is right to Column-oriented database system architectures invite a re- systematically revisit the topic of compression in the context evaluation of how and when data in databases is compressed. of these systems, particularly given that one of the oft-cited Storing data in a column-oriented fashion greatly increases advantages of column-stores is their compressibility. the similarity of adjacent records on disk and thus opportuni- Storing data in columns presents a number of opportuni- ties for compression. The ability to compress many adjacent ties for improved performance from compression algorithms tuples at once lowers the per-tuple cost of compression, both when compared to row-oriented architectures. In a column- in terms of CPU and space overheads. oriented database, compression schemes that encode multi- In this paper, we discuss how we extended C-Store (a ple values at once are natural. In a row-oriented database, column-oriented DBMS) with a compression sub-system. We such schemes do not work as well because an attribute is show how compression schemes not traditionally used in row- stored as a part of an entire tuple, so combining the same oriented DBMSs can be applied to column-oriented systems. attribute from different tuples together into one value would We then evaluate a set of compression schemes and show that require some way to \mix" tuples.
    [Show full text]
  • Quick-Start Guide
    Quick-Start Guide Table of Contents Reference Manual..................................................................................................................................................... 1 GUI Features Overview............................................................................................................................................ 2 Buttons.................................................................................................................................................................. 2 Tables..................................................................................................................................................................... 2 Data Model................................................................................................................................................................. 4 SQL Statements................................................................................................................................................... 5 Changed Data....................................................................................................................................................... 5 Sessions................................................................................................................................................................. 5 Transactions.......................................................................................................................................................... 6 Walktrough
    [Show full text]