Monetdblite: an Embedded Analytical Database

Total Page:16

File Type:pdf, Size:1020Kb

Monetdblite: an Embedded Analytical Database MonetDBLite: An Embedded Analytical Database Mark Raasveldt Hannes Mühleisen CWI CWI Amsterdam, Netherlands Amsterdam, Netherlands [email protected] [email protected] ABSTRACT Relational database systems have been designed specifically to While traditional RDBMSes offer a lot of advantages, they require solve these problems. However, data scientists still prefer to use significant effort to setup and to use. Because of these challenges, flat files. This is primarily because the current methods ofusing many data scientists and analysts have switched to using alter- a relational database in combination with these analytical tools native data management solutions. These alternatives, however, are lacking. The standard approach of using a database with an lack features that are standard for RDBMSes, e.g. out-of-core query analytical tool is to run the database system as a separate process execution. In this paper, we introduce the embedded analytical (the “database server”) and connecting with it over a socket using database MonetDBLite. MonetDBLite is designed to be both highly a client interface (typically ODBC or JDBC). However, not only is efficient and easy to use in conjunction with standard analytical this approach very slow [15], it is very cumbersome as the database tools. It can be installed using standard package managers, and system must be installed, managed and tuned as a separate process. requires no configuration or server management. It is designed For many use cases, the added effort of using a relational database for OLAP scenarios, and offers near-instantaneous data transfer outweighs the benefits of using one. between the database and analytical tools, all the while maintaining Instead of managing the database as a separate process, the the transactional guarantees and ACID properties of a standard database can run embedded inside the analytical tool. This method relational system. These properties make MonetDBLite highly suit- has the advantage that the database server no longer needs to be able as a storage engine for data used in analytics, machine learning managed, and the database can be installed from within the standard and classification tasks. package manager of the tool. In addition, because the database and the analytical tool run inside the same process on the same machine, CCS CONCEPTS data can be transferred between them for a much lower cost. SQLite [2] is the most popular embedded database. It has bind- • Information systems → DBMS engine architectures; Main ings for all major languages, and it can be embedded without any memory engines; licensing issues because its source code is in the public domain. However, it is first and foremost designed for transactional work- KEYWORDS loads on small datasets. While it can be used in conjunction with Embedded Databases, Analytics, OLAP popular analytical tools, it does not perform well when used for ACM Reference Format: analytical purposes. Mark Raasveldt and Hannes Mühleisen. 2018. MonetDBLite: An Embedded In this paper, we describe MonetDBLite, an Open-Source em- Analytical Database. In Proceedings of CIKM 2018 International Conference bedded database based on the popular columnar database Mon- on Information and Knowledge Management (CIKM’18). ACM, New York, etDB [11]. MonetDBLite is an in-process analytical database that NY, USA, 10 pages. https://doi.org/10.475/123_4 can be run directly from within popular analytical tools without any external dependencies. It can be installed through the default 1 INTRODUCTION package managers of popular analytical tools, and has bindings for C/C++, R, Python and Java. Because of its in-process nature, data Modern machine learning libraries and analytical tools have moved can be transferred between the database and these analytical tools away from using traditional relational databases for managing at zero cost. The source code for MonetDBLite is freely available1 data. Instead, data is managed by storing it either as structured arXiv:1805.08520v1 [cs.DB] 22 May 2018 and is in active use by thousands of analysts around the world. text (such as CSV or JSON files), or as binary files such as Parquet Contributions. The main contributions of this paper are: or HDF5 [19, 20]. Managing data in flat files requires significant manual effort to maintain and is difficult to reason about because • We describe the internal design of MonetDBLite, and how it of the lack of a rigid schema. Furthermore, data managed in such interfaces with standard analytical tools. a way is prone to data corruption because of lack of transactional • We discuss the technical challenges we have faced in con- guarantees and atomic write actions. verting a popular Open-Source database into an in-process embeddable database. Permission to make digital or hard copies of part or all of this work for personal or • We benchmark MonetDBLite against other alternative data- classroom use is granted without fee provided that copies are not made or distributed base systems when used in conjunction with analytical tools, for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. and show that it outperforms alternatives significantly. This For all other uses, contact the owner/author(s). benchmark is completely reproducible with publicly avail- CIKM’18, October 2018, Lingotto, Turin, Italy able sources. © 2018 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06. https://doi.org/10.475/123_4 1https://github.com/hannesmuehleisen/MonetDBLite CIKM’18, October 2018, Lingotto, Turin, Italy Mark Raasveldt and Hannes Mühleisen (a) Socket connection. (b) In-database processing. (c) Embedded database. Figure 1: Different ways of connecting analytical tools with a database management system. Outline. This paper is organized as follows. In Section 2, we environment dependencies, making the analytical script portable discuss related work. In Section 3, we describe the design and and capable of running anywhere without additional user effort. implementation of the MonetDBLite system. We compare the per- Using a standard relational database requires the user to have the formance of MonetDBLite against other database systems and sta- database server running in the background or on a separate ma- tistical libraries in Section 4. Finally, we draw our conclusions in chine, whereas the embedded database can be completely controlled Section 5. from within the script itself. Embedded databases are extremely popular, mainly because of 2 RELATED WORK the omnipresent SQLite [2]. This embedded database is shipped There are three main methods of combining relational databases with all major operating systems and browsers, and is included with analytical tools. These methods are visualized in Figure 1. The with numerous other applications. SQLite has become so ubiquitous standard approach is to connect the analytical tool to a database because it provides the valuable advantages of a relational database through a database connection (DBC). The client can then issue management system for a low development cost. It has bindings for queries to the database over this client connection and retrieve any all major languages, and it can be embedded without any licensing data stored in the database. This approach is database agnostic, issues because its source code is in the public domain. According and allows the user to stay within their familiar scripting language to the author of SQLite, there are over one trillion SQLite databases environment (REPL or IDE). However, exporting large amounts of in active use. data from the database to the client is very inefficient [15], and can However, SQLite is first and foremost designed for transactional be a significant bottleneck in analysis pipelines. workloads. It is a row-store database that uses a volcano-style pro- An alternative solution is to use in-database processing meth- cessing model for query execution. While popular analytical tools ods. By executing the analysis pipelines inside the database, the such as Python and R do have SQLite bindings, it does not perform overhead of data export can be entirely avoided. In addition, in- well when used for analytical purposes. Even exclusively using database processing can provide additional benefits in the form of SQLite as a storage engine typically does not work out well in these automatic parallelization or distributed execution of these func- scenarios as it is not optimized for bulk retrieval of data. In addition, tions [14] within the database engine. often only a limited subset of columns of a table are used in analy- While this approach removes the data transfer overhead between ses, and its row-wise storage layout forces it to always scan entire the scripting language and the database, it still requires the user to tables. This can leads to poor performance when dealing with wide run and manage a separate database server. These user-defined func- data. tions also introduce new issues. Firstly, they force users to rewrite There are also alternatives to using a relational database system code so the code fits within the query work flow. Secondly, because at all. Libraries such as dplyr [21], data.table [7] and Pandas [13] these UDFs run within the database process, they are difficult to allow the user to execute common database operations directly create and debug [10]. Users cannot use the IDEs/REPLs/debugging from within analytical scripting languages. They implement opera- tools that they are familiar with (such as RStudio or PyCharm) tions such as joins and aggregations that efficiently operate directly while writing the user-defined functions. Additionally, user-defined on the native structures used in these languages. These solve the functions introduce safety issues as arbitrary code can now run problem of users wanting to execute database operations, however, within the database kernel. they do not solve the problem of automatically managing persis- Instead of running the database system as a separate server, it tent data for the user.
Recommended publications
  • Monetdb/X100: Hyper-Pipelining Query Execution
    MonetDB/X100: Hyper-Pipelining Query Execution Peter Boncz, Marcin Zukowski, Niels Nes CWI Kruislaan 413 Amsterdam, The Netherlands fP.Boncz,M.Zukowski,[email protected] Abstract One would expect that query-intensive database workloads such as decision support, OLAP, data- Database systems tend to achieve only mining, but also multimedia retrieval, all of which re- low IPC (instructions-per-cycle) efficiency on quire many independent calculations, should provide modern CPUs in compute-intensive applica- modern CPUs the opportunity to get near optimal IPC tion areas like decision support, OLAP and (instructions-per-cycle) efficiencies. multimedia retrieval. This paper starts with However, research has shown that database systems an in-depth investigation to the reason why tend to achieve low IPC efficiency on modern CPUs in this happens, focusing on the TPC-H bench- these application areas [6, 3]. We question whether mark. Our analysis of various relational sys- it should really be that way. Going beyond the (im- tems and MonetDB leads us to a new set of portant) topic of cache-conscious query processing, we guidelines for designing a query processor. investigate in detail how relational database systems The second part of the paper describes the interact with modern super-scalar CPUs in query- architecture of our new X100 query engine intensive workloads, in particular the TPC-H decision for the MonetDB system that follows these support benchmark. guidelines. On the surface, it resembles a The main conclusion we draw from this investiga- classical Volcano-style engine, but the cru- tion is that the architecture employed by most DBMSs cial difference to base all execution on the inhibits compilers from using their most performance- concept of vector processing makes it highly critical optimization techniques, resulting in low CPU CPU efficient.
    [Show full text]
  • Data Warehouse Fundamentals for Storage Professionals – What You Need to Know EMC Proven Professional Knowledge Sharing 2011
    Data Warehouse Fundamentals for Storage Professionals – What You Need To Know EMC Proven Professional Knowledge Sharing 2011 Bruce Yellin Advisory Technology Consultant EMC Corporation [email protected] Table of Contents Introduction ................................................................................................................................ 3 Data Warehouse Background .................................................................................................... 4 What Is a Data Warehouse? ................................................................................................... 4 Data Mart Defined .................................................................................................................. 8 Schemas and Data Models ..................................................................................................... 9 Data Warehouse Design – Top Down or Bottom Up? ............................................................10 Extract, Transformation and Loading (ETL) ...........................................................................11 Why You Build a Data Warehouse: Business Intelligence .....................................................13 Technology to the Rescue?.......................................................................................................19 RASP - Reliability, Availability, Scalability and Performance ..................................................20 Data Warehouse Backups .....................................................................................................26
    [Show full text]
  • Master Data Management Whitepaper.Indd
    Creating a Master Data Environment An Incremental Roadmap for Public Sector Organizations Master Data Management (MDM) is the processes and technologies used to create and maintain consistent and accurate representations of master data. Movements toward modularity, service orientation, and SaaS make Master Data Management a critical issue. Master Data Management leverages both tools and processes that enable the linkage of critical data through a unifi ed platform that provides a common point of reference. When properly done, MDM streamlines data management and sharing across the enterprise among different areas to provide more effective service delivery. CMA possesses more than 15 years of experience in Master Data Management and more than 20 in data management and integration. CMA has also focused our efforts on public sector since our inception, more than 30 years ago. This document describes the fundamental keys to success for developing and maintaining a long term master data program within a public service organization. www.cma.com 2 Understanding Master Data transactional data. As it becomes more volatile, it typically is considered more transactional. Simple Master Data Behavior entities are rarely a challenge to manage and are rarely Master data can be described by the way that it interacts considered master-data elements. The less complex an with other data. For example, in transaction systems, element, the less likely the need to manage change for master data is almost always involved with transactional that element. The more valuable the data element is to data. This relationship between master data and the organization, the more likely it will be considered a transactional data may be fundamentally viewed as a master data element.
    [Show full text]
  • Data Warehouse Applications in Modern Day Business
    California State University, San Bernardino CSUSB ScholarWorks Theses Digitization Project John M. Pfau Library 2002 Data warehouse applications in modern day business Carla Mounir Issa Follow this and additional works at: https://scholarworks.lib.csusb.edu/etd-project Part of the Business Intelligence Commons, and the Databases and Information Systems Commons Recommended Citation Issa, Carla Mounir, "Data warehouse applications in modern day business" (2002). Theses Digitization Project. 2148. https://scholarworks.lib.csusb.edu/etd-project/2148 This Project is brought to you for free and open access by the John M. Pfau Library at CSUSB ScholarWorks. It has been accepted for inclusion in Theses Digitization Project by an authorized administrator of CSUSB ScholarWorks. For more information, please contact [email protected]. DATA WAREHOUSE APPLICATIONS IN MODERN DAY BUSINESS A Project Presented to the Faculty of California State University, San Bernardino In Partial Fulfillment of the Requirements for the Degree Master of Business Administration by Carla Mounir Issa June 2002 DATA WAREHOUSE APPLICATIONS IN MODERN DAY BUSINESS A Project Presented to the Faculty of California State University, San Bernardino by Carla Mounir Issa June 2002 Approved by: Date Dr. Walter T. Stewart ABSTRACT Data warehousing is not a new concept in the business world. However, the constant changing application of data warehousing is what is being innovated constantly. It is these applications that are enabling managers to make better decisions through strategic planning, and allowing organizations to achieve a competitive advantage. The technology of implementing the data warehouse will help with the decision making process, analysis design, and will be more cost effective in the future.
    [Show full text]
  • Lecture Notes on Data Mining& Data Warehousing
    LECTURE NOTES ON DATA MINING& DATA WAREHOUSING COURSE CODE:BCS-403 DEPT OF CSE & IT VSSUT, Burla SYLLABUS: Module – I Data Mining overview, Data Warehouse and OLAP Technology,Data Warehouse Architecture, Stepsfor the Design and Construction of Data Warehouses, A Three-Tier Data WarehouseArchitecture,OLAP,OLAP queries, metadata repository,Data Preprocessing – Data Integration and Transformation, Data Reduction,Data Mining Primitives:What Defines a Data Mining Task? Task-Relevant Data, The Kind of Knowledge to be Mined,KDD Module – II Mining Association Rules in Large Databases, Association Rule Mining, Market BasketAnalysis: Mining A Road Map, The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation,Generating Association Rules from Frequent Itemsets, Improving the Efficiently of Apriori,Mining Frequent Itemsets without Candidate Generation, Multilevel Association Rules, Approaches toMining Multilevel Association Rules, Mining Multidimensional Association Rules for Relational Database and Data Warehouses,Multidimensional Association Rules, Mining Quantitative Association Rules, MiningDistance-Based Association Rules, From Association Mining to Correlation Analysis Module – III What is Classification? What Is Prediction? Issues RegardingClassification and Prediction, Classification by Decision Tree Induction, Bayesian Classification, Bayes Theorem, Naïve Bayesian Classification, Classification by Backpropagation, A Multilayer Feed-Forward Neural Network, Defining aNetwork Topology, Classification Based of Concepts from
    [Show full text]
  • Hannes Mühleisen
    Large-Scale Statistics with MonetDB and R Hannes Mühleisen DAMDID 2015, 2015-10-13 About Me • Postdoc at CWI Database architectures since 2012 • Amsterdam is nice. We have open positions. • Special interest in data management for statistical analysis • Various research & software projects in this space 2 Outline • Column Store / MonetDB Introduction • Connecting R and MonetDB • Advanced Topics • “R as a Query Language” • “Capturing The Laws of Data Nature” 3 Column Stores / MonetDB Introduction 4 Postgres, Oracle, DB2, etc.: Conceptional class speed flux NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 Physical (on Disk) NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 5 Column Store: class speed flux Compression! NX 1 3 Constitution 1 8 Galaxy 1 3 Defiant 1 6 Intrepid 1 1 NX Constitution Galaxy Defiant Intrepid 1 1 1 1 1 3 8 3 6 1 6 What is MonetDB? • Strict columnar architecture OLAP RDBMS (SQL) • Started by Martin Kersten and Peter Boncz ~1994 • Free & Open Open source, active development ongoing • www.monetdb.org Peter A. Boncz, Martin L. Kersten, and Stefan Manegold. 2008. Breaking the memory wall in MonetDB. Communications of the ACM 51, 12 (December 2008), 77-85. DOI=10.1145/1409360.1409380 7 MonetDB today • Expanded C code • MAL “DB assembly” & optimisers • SQL to MAL compiler • Memory-Mapped files • Automatic indexing 8 Some MAL • Optimisers run on MAL code • Efficient Column-at-a-time implementations EXPLAIN SELECT * FROM mtcars; | X_2 := sql.mvc(); | | X_3:bat[:oid,:oid] := sql.tid(X_2,"sys","mtcars"); | | X_6:bat[:oid,:dbl]
    [Show full text]
  • LDBC Introduction and Status Update
    8th Technical User Community meeting - LDBC Peter Boncz [email protected] ldbcouncil.org LDBC Organization (non-profit) “sponsors” + non-profit members (FORTH, STI2) & personal members + Task Forces, volunteers developing benchmarks + TUC: Technical User Community (8 workshops, ~40 graph and RDF user case studies, 18 vendor presentations) What does a benchmark consist of? • Four main elements: – data & schema: defines the structure of the data – workloads: defines the set of operations to perform – performance metrics: used to measure (quantitatively) the performance of the systems – execution & auditing rules: defined to assure that the results from different executions of the benchmark are valid and comparable • Software as Open Source (GitHub) – data generator, query drivers, validation tools, ... Audience • For developers facing graph processing tasks – recognizable scenario to compare merits of different products and technologies • For vendors of graph database technology – checklist of features and performance characteristics • For researchers, both industrial and academic – challenges in multiple choke-point areas such as graph query optimization and (distributed) graph analysis SPB scope • The scenario involves a media/ publisher organization that maintains semantic metadata about its Journalistic assets (articles, photos, videos, papers, books, etc), also called Creative Works • The Semantic Publishing Benchmark simulates: – Consumption of RDF metadata (Creative Works) – Updates of RDF metadata, related to Annotations • Aims to be
    [Show full text]
  • Database Vs. Data Warehouse: a Comparative Review
    Insights Database vs. Data Warehouse: A Comparative Review By Drew Cardon Professional Services, SVP Health Catalyst A question I often hear out in the field is: I already have a database, so why do I need a data warehouse for healthcare analytics? What is the difference between a database vs. a data warehouse? These questions are fair ones. For years, I’ve worked with databases in healthcare and in other industries, so I’m very familiar with the technical ins and outs of this topic. In this post, I’ll do my best to introduce these technical concepts in a way that everyone can understand. But, before we discuss the difference, could I ask one big favor? This will only take 10 seconds. Could you click below and take a quick poll? I’d like to find out if your organization has a data warehouse, data base(s), or if you don’t know? This would really help me better understand how prevalent data warehouses really are. Click to take our 10 second database vs data warehouse poll Before diving in to the topic, I want to quickly highlight the importance of analytics in healthcare. If you don’t understand the importance of analytics, discussing the distinction between a database and a data warehouse won’t be relevant to you. Here it is in a nutshell. The future of healthcare depends on our ability to use the massive amounts of data now available to drive better quality at a lower cost. If you can’t perform analytics to make sense of your data, you’ll have trouble improving quality and costs, and you won’t succeed in the new healthcare environment.
    [Show full text]
  • MASTER's THESIS Role of Metadata in the Datawarehousing Environment
    2006:24 MASTER'S THESIS Role of Metadata in the Datawarehousing Environment Kranthi Kumar Parankusham Ravinder Reddy Madupu Luleå University of Technology Master Thesis, Continuation Courses Computer and Systems Science Department of Business Administration and Social Sciences Division of Information Systems Sciences 2006:24 - ISSN: 1653-0187 - ISRN: LTU-PB-EX--06/24--SE Preface This study is performed as the part of the master’s programme in computer and system sciences during 2005-06. It has been very useful and valuable experience and we have learned a lot during the study, not only about the topic at hand but also to manage to the work in the specified time. However, this workload would not have been manageable if we had not received help and support from a number of people who we would like to mention. First of all, we would like to thank our professor Svante Edzen for his help and supervision during the writing of thesis. And also we would like to express our gratitude to all the employees who allocated their valuable time to share their professional experience. On a personal level, Kranthi would like to thank all his family for their help and especially for my friends Kiran, Chenna Reddy, and Deepak Kumar. Ravi would like to give the greatest of thanks to his family for always being there when needed, and constantly taking so extremely good care of me….Also, thanks to all my friends for being close to me. Luleå University of Technology, 31 January 2006 Kranthi Kumar Parankusham Ravinder Reddy Madupu Abstract In order for a well functioning data warehouse to succeed many components must work together.
    [Show full text]
  • The Teradata Data Warehouse Appliance Technical Note on Teradata Data Warehouse Appliance Vs
    The Teradata Data Warehouse Appliance Technical Note on Teradata Data Warehouse Appliance vs. Oracle Exadata By Krish Krishnan Sixth Sense Advisors Inc. November 2013 Sponsored by: Table of Contents Background ..................................................................3 Loading Data ............................................................3 Data Availability .........................................................3 Data Volumes ............................................................3 Storage Performance .....................................................4 Operational Costs ........................................................4 The Data Warehouse Appliance Selection .....................................5 Under the Hood ..........................................................5 Oracle Exadata. .5 Oracle Exadata Intelligent Storage ........................................6 Hybrid Columnar Compression ...........................................7 Mixed Workload Capability ...............................................8 Management ............................................................9 Scalability ................................................................9 Consolidation ............................................................9 Supportability ...........................................................9 Teradata Data Warehouse Appliance .........................................10 Hardware Improvements ................................................10 Teradata’s Intelligent Memory (TIM) vs. Exdata Flach Cache.
    [Show full text]
  • Enterprise Data Warehouse (EDW) Platform PRIVACY IMPACT ASSESSMENT (PIA)
    U.S. Securities and Exchange Commission Enterprise Data Warehouse (EDW) Platform PRIVACY IMPACT ASSESSMENT (PIA) May 31, 2020 Office of Information Technology Privacy Impact Assessment Enterprise Data Warehouse Platform (EDW) General Information 1. Name of Project or System. Enterprise Data Warehouse (EDW) Platform 2. Describe the project and its purpose or function in the SEC’s IT environment. The Enterprise Data Warehouse (EDW) infrastructure will enable the provisioning of data to Commission staff for search and analysis through a virtual data warehouse platform. As part of its Enterprise Data Warehouse initiative, the SEC will begin to integrate data from various systems to provide more comprehensive management and financial reporting on a regular basis and facilitate better decision-making. (SEC Strategic Plan, 2014-2018, pg.31) The EDW will replicate source system-provided data from other operational SEC systems and provide a simplified way of producing Agency reports. The EDW is designed for summarization of information, retrieval of information, reporting speed, and ease of use (i.e. creating reports). It will enhance business intelligence, augment data quality and consistency, and generate a high return on investment by allowing users to quickly search and access critical data from a single location and obtain historical intelligence to analyze different time periods and performance trends in order to make future predictions. (SEC FY2013 Agency Financial Report, pg.31-32) The data content in the Enterprise Data Warehouse platform will be provisioned incrementally over time via a series of distinct development projects, each of which will target specific data sets based on requirements provided by SEC business stakeholders.
    [Show full text]
  • High-Performance Analytics for Today's Business
    Teradata Database: High-Performance Analytics for Today’s Business DATA WAREHOUSING Because Performance Counts Already the foundation of the world’s most successful data warehouses, Teradata Database is designed for decision support and parallel implementation. And it’s as versatile The success of your data warehouse has always rested as it is powerful. It supports hundreds of installations, on the performance of your database engine. But as from three terabytes to huge warehouses with petabytes your analytics and data requirements grow increasingly (PB) of data and thousands of users. Teradata Database’s complex, performance becomes more vital than ever. inherent parallelism and architecture provide more So does finding a faster, simpler way to manage your data than pure performance. It also offers low total cost of warehouse. You can find that combination of unmatched ownership (TCO) and a simple, scalable, and self-managing performance, flexibility, and efficient management in the solution for building a successful data warehouse. Teradata® Database. With the Teradata Software-Defined Warehouse, In fact, we continually enhance the Teradata Database customers have the flexibility to segregate data and to provide you and your business with new features users on a single system. By combining Teradata and functions for even higher levels of flexibility and Database’s secure zones with Teradata’s industry-leading performance. Work with dynamic data and schema-on- mixed workload management capabilities, multi-tenant read operational models together with structured data economies and flexible privacy regulation compliance through completely integrated JSON and Avro data. become possible in a single Teradata Database system. Query processing performance is also accelerated by the world’s most advanced hybrid row/column table storage capability and in-memory optimization and vectorization for in-memory database speed.
    [Show full text]