Monetdblite: an Embedded Analytical Database

MonetDBLite: An Embedded Analytical Database Mark Raasveldt Hannes Mühleisen CWI CWI Amsterdam, Netherlands Amsterdam, Netherlands [email protected] [email protected] ABSTRACT Relational database systems have been designed specifically to While traditional RDBMSes offer a lot of advantages, they require solve these problems. However, data scientists still prefer to use significant effort to setup and to use. Because of these challenges, flat files. This is primarily because the current methods ofusing many data scientists and analysts have switched to using alter- a relational database in combination with these analytical tools native data management solutions. These alternatives, however, are lacking. The standard approach of using a database with an lack features that are standard for RDBMSes, e.g. out-of-core query analytical tool is to run the database system as a separate process execution. In this paper, we introduce the embedded analytical (the “database server”) and connecting with it over a socket using database MonetDBLite. MonetDBLite is designed to be both highly a client interface (typically ODBC or JDBC). However, not only is efficient and easy to use in conjunction with standard analytical this approach very slow [15], it is very cumbersome as the database tools. It can be installed using standard package managers, and system must be installed, managed and tuned as a separate process. requires no configuration or server management. It is designed For many use cases, the added effort of using a relational database for OLAP scenarios, and offers near-instantaneous data transfer outweighs the benefits of using one. between the database and analytical tools, all the while maintaining Instead of managing the database as a separate process, the the transactional guarantees and ACID properties of a standard database can run embedded inside the analytical tool. This method relational system. These properties make MonetDBLite highly suit- has the advantage that the database server no longer needs to be able as a storage engine for data used in analytics, machine learning managed, and the database can be installed from within the standard and classification tasks. package manager of the tool. In addition, because the database and the analytical tool run inside the same process on the same machine, CCS CONCEPTS data can be transferred between them for a much lower cost. SQLite [2] is the most popular embedded database. It has bind- • Information systems → DBMS engine architectures; Main ings for all major languages, and it can be embedded without any memory engines; licensing issues because its source code is in the public domain. However, it is first and foremost designed for transactional work- KEYWORDS loads on small datasets. While it can be used in conjunction with Embedded Databases, Analytics, OLAP popular analytical tools, it does not perform well when used for ACM Reference Format: analytical purposes. Mark Raasveldt and Hannes Mühleisen. 2018. MonetDBLite: An Embedded In this paper, we describe MonetDBLite, an Open-Source em- Analytical Database. In Proceedings of CIKM 2018 International Conference bedded database based on the popular columnar database Mon- on Information and Knowledge Management (CIKM’18). ACM, New York, etDB [11]. MonetDBLite is an in-process analytical database that NY, USA, 10 pages. https://doi.org/10.475/123_4 can be run directly from within popular analytical tools without any external dependencies. It can be installed through the default 1 INTRODUCTION package managers of popular analytical tools, and has bindings for C/C++, R, Python and Java. Because of its in-process nature, data Modern machine learning libraries and analytical tools have moved can be transferred between the database and these analytical tools away from using traditional relational databases for managing at zero cost. The source code for MonetDBLite is freely available1 data. Instead, data is managed by storing it either as structured arXiv:1805.08520v1 [cs.DB] 22 May 2018 and is in active use by thousands of analysts around the world. text (such as CSV or JSON files), or as binary files such as Parquet Contributions. The main contributions of this paper are: or HDF5 [19, 20]. Managing data in flat files requires significant manual effort to maintain and is difficult to reason about because • We describe the internal design of MonetDBLite, and how it of the lack of a rigid schema. Furthermore, data managed in such interfaces with standard analytical tools. a way is prone to data corruption because of lack of transactional • We discuss the technical challenges we have faced in con- guarantees and atomic write actions. verting a popular Open-Source database into an in-process embeddable database. Permission to make digital or hard copies of part or all of this work for personal or • We benchmark MonetDBLite against other alternative data- classroom use is granted without fee provided that copies are not made or distributed base systems when used in conjunction with analytical tools, for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. and show that it outperforms alternatives significantly. This For all other uses, contact the owner/author(s). benchmark is completely reproducible with publicly avail- CIKM’18, October 2018, Lingotto, Turin, Italy able sources. © 2018 Copyright held by the owner/author(s). ACM ISBN 123-4567-24-567/08/06. https://doi.org/10.475/123_4 1https://github.com/hannesmuehleisen/MonetDBLite CIKM’18, October 2018, Lingotto, Turin, Italy Mark Raasveldt and Hannes Mühleisen (a) Socket connection. (b) In-database processing. (c) Embedded database. Figure 1: Different ways of connecting analytical tools with a database management system. Outline. This paper is organized as follows. In Section 2, we environment dependencies, making the analytical script portable discuss related work. In Section 3, we describe the design and and capable of running anywhere without additional user effort. implementation of the MonetDBLite system. We compare the per- Using a standard relational database requires the user to have the formance of MonetDBLite against other database systems and sta- database server running in the background or on a separate ma- tistical libraries in Section 4. Finally, we draw our conclusions in chine, whereas the embedded database can be completely controlled Section 5. from within the script itself. Embedded databases are extremely popular, mainly because of 2 RELATED WORK the omnipresent SQLite [2]. This embedded database is shipped There are three main methods of combining relational databases with all major operating systems and browsers, and is included with analytical tools. These methods are visualized in Figure 1. The with numerous other applications. SQLite has become so ubiquitous standard approach is to connect the analytical tool to a database because it provides the valuable advantages of a relational database through a database connection (DBC). The client can then issue management system for a low development cost. It has bindings for queries to the database over this client connection and retrieve any all major languages, and it can be embedded without any licensing data stored in the database. This approach is database agnostic, issues because its source code is in the public domain. According and allows the user to stay within their familiar scripting language to the author of SQLite, there are over one trillion SQLite databases environment (REPL or IDE). However, exporting large amounts of in active use. data from the database to the client is very inefficient [15], and can However, SQLite is first and foremost designed for transactional be a significant bottleneck in analysis pipelines. workloads. It is a row-store database that uses a volcano-style pro- An alternative solution is to use in-database processing meth- cessing model for query execution. While popular analytical tools ods. By executing the analysis pipelines inside the database, the such as Python and R do have SQLite bindings, it does not perform overhead of data export can be entirely avoided. In addition, in- well when used for analytical purposes. Even exclusively using database processing can provide additional benefits in the form of SQLite as a storage engine typically does not work out well in these automatic parallelization or distributed execution of these func- scenarios as it is not optimized for bulk retrieval of data. In addition, tions [14] within the database engine. often only a limited subset of columns of a table are used in analy- While this approach removes the data transfer overhead between ses, and its row-wise storage layout forces it to always scan entire the scripting language and the database, it still requires the user to tables. This can leads to poor performance when dealing with wide run and manage a separate database server. These user-defined func- data. tions also introduce new issues. Firstly, they force users to rewrite There are also alternatives to using a relational database system code so the code fits within the query work flow. Secondly, because at all. Libraries such as dplyr [21], data.table [7] and Pandas [13] these UDFs run within the database process, they are difficult to allow the user to execute common database operations directly create and debug [10]. Users cannot use the IDEs/REPLs/debugging from within analytical scripting languages. They implement opera- tools that they are familiar with (such as RStudio or PyCharm) tions such as joins and aggregations that efficiently operate directly while writing the user-defined functions. Additionally, user-defined on the native structures used in these languages. These solve the functions introduce safety issues as arbitrary code can now run problem of users wanting to execute database operations, however, within the database kernel. they do not solve the problem of automatically managing persis- Instead of running the database system as a separate server, it tent data for the user.

Load more