Monetdb/Datacell: Online Analytics in a Streaming Column-Store

MonetDB/DataCell: Online Analytics in a Streaming Column-Store Erietta Liarou, Stratos Idreos, Stefan Manegold, Martin Kersten CWI, Amsterdam, The Netherlands [email protected] [email protected] [email protected] [email protected] ABSTRACT tion, web log analysis requires fast analysis of big streaming In DataCell, we design streaming functionalities in a mod- data for decision support. ern relational database kernel which targets big data analyt- A new processing paradigm is born [16, 10, 12] where in- ics. This includes exploitation of both its storage/execution coming streams of data need to quickly be analyzed and engine and its optimizer infrastructure. We investigate the possibly combined with existing data to discover trends and opportunities and challenges that arise with such a direction patterns. Subsequently, the new data may also enter the and we show that it carries significant advantages for mod- data warehouse and be stored as normal for further analy- ern applications in need for online analytics such as web sis if necessary. This new paradigm requires scalable query logs, network monitoring and scientific data management. processing that can combine continuous querying for fast The major challenge then becomes the efficient support for reaction to incoming data with traditional querying for ac- specialized stream features, e.g., multi-query processing and cess to the existing data. However, neither pure database incremental window-based processing as well as exploiting technology nor pure stream technology are designed for this standard DBMS functionalities in a streaming environment purpose. Database systems do not qualify for continuous such as indexing. query processing, while data stream systems are not built This demo presents DataCell, an extension of the Mon- to scale for big data analysis. etDB open-source column-store for online analytics. The DataCell Motivation. MonetDB/DataCell aims to pro- demo gives users the opportunity to experience the features vide scalable online analytics. We begin from a state of of DataCell such as processing both stream and persistent the art column-store design for big data analytics, Mon- data and performing window based processing. The demo etDB, and extend it with online functionality. The goal is provides a visual interface to monitor the critical system to fully exploit the generic storage and execution engine of components, e.g., how query plans transform from typical the DBMS as well as its complete optimizer stack. Stream DBMS query plans to online query plans, how data flows processing then becomes primarily a query scheduling task, through the query plans as the streams evolve, how Da- i.e., make the proper queries see the proper portion of stream taCell maintains intermediate results in columnar form to data at the proper time. A positive side-effect is that our avoid repeated evaluation of the same stream portions, etc. architecture supports SQL’03, which allows stream applica- The demo also provides the ability to interactively set the tions to exploit sophisticated query semantics. Numerous test scenarios and various DataCell knobs. research and technical questions immediately arise. The most prominent issues are the ability to provide specialized stream functionality and hindrances to guarantee real-time 1. INTRODUCTION constraints for event handling. Contributions and Demo. Paper [16] illustrates the Numerous applications nowadays require online analytics basic DataCell architecture and sets the research path and over high rate streaming data. For example, emerging ap- critical milestones. In addition, DataCell ships together plications over mobile data can exploit the big mobile data with the MonetDB open-source system. Here, we present streams for advertising and traffic control. In addition, the a demo based on MonetDB/DataCell. The demo showcases recent and continuously expanding massive cloud infrastruc- several of the key aspects of the MonetDB/DataCell design tures require continuous monitoring to remain in good state such as the ability to do both stream processing and normal and prevent fraud attacks. Similarly, scientific databases queries and the ability to provide window based processing. create data at massive rates daily or even hourly. In addi- Through a graphical user interface the user of the demo can observe the query execution inside DataCell, e.g., how the Permission to make digital or hard copies of all or part of this work for various columnar structures are populated, how intermedi- personal or classroom use is granted without fee provided that copies are ate results are kept around to avoid reevaluation for sliding not made or distributed for profit or commercial advantage and that copies window queries and how the shape of a normal query plan bear this notice and the full citation on the first page. To copy otherwise, to changes to a continuous query plan through the optimizer. republish, to post on servers or to redistribute to lists, requires prior specific In addition, the user can interact with the system, posing permission and/or a fee. Articles from this volume were invited to present continuous or one-time queries or choosing some pre-defined their results at The 38th International Conference on Very Large Data Bases, August 27th - 31st 2012, Istanbul, Turkey. scenarios as well as varying both input parameters and Da- Proceedings of the VLDB Endowment, Vol. 5, No. 12 taCell knobs. Copyright 2012 VLDB Endowment 2150-8097/12/08... $ 10.00. 1910 Paper Organization. The rest of the paper is organized Receptors Parser/Compiler Emitters as follows. Section 2 briefly discusses related work while Optimizer Section 3 gives an overview of the main design of DataCell. Then, Section 4 presents the demonstration scenarios as well Rewriter as the ways the user can interact with the system and how the demo visualizes query execution and system status. Fi- Scheduler/Factories nally, Section 5 concludes the paper. 2. RELATED WORK Kernel DataCell fundamentally differs from existing stream ef- Baskets Tables forts [3, 4, 5, 8, 9, 11, 13, 21, 6, 2], etc. by building on top of the storage and execution engine of a DBMS kernel. It opens a very interesting path towards exploiting and merg- Figure 1: The DataCell architecture. ing technologies from both worlds. as a collection of Binary Association Tables (BATs), one for Compared to even earlier efforts on active databases, e.g., each attribute. Advanced column-stores process one column [20], DataCell adds support for specialized stream function- at a time, using late tuple reconstruction, discussed in, e.g., alities, such as incremental processing. Incremental process- [1, 15]. Intermediates are also in column format. This allows ing in a DBMS has been studied in the context of updat- the engine to exploit CPU- and cache-optimized vector-like ing materialized views, e.g., [7, 14], but there the scenario is operator implementations throughout the whole query eval- very different given that it targets read-mostly environments uation, using an efficient bulk processing model instead of whereas an online scenario is by definition a write-only one. the typical tuple-at-a-time volcano approach. This way, a Truviso Continuous Analytics system [12], a commercial select operator for example, operates on a single column, fil- product of Truviso, is another recent example that follows tering the qualifying values and producing an intermediate the same approach as DataCell. They extend the open that holds their tuple IDs. This intermediate can then be source PostgreSQL database [19] to enable continuous anal- used to retrieve the necessary values from a different column ysis of streaming data, tackling the problem of low latency for further actions, e.g., aggregations, further filtering, etc. query evaluation over massive data volumes. TruCQ inte- The key point is that in DataCell these intermediates can be grates streaming and traditional relational query process- exploited for flexible incremental processing strategies, i.e., ing in such a way that ends-up to a stream-relational da- we can selectively keep around the proper intermediates at tabase architecture. It is able to run SQL queries continu- the proper places of a plan for efficient future reuse. ously and incrementally over data while they are still com- DataCell. DataCell [16] is positioned between the SQL ing and before they are stored in active database tables (if compiler/optimizer and the DBMS kernel. The SQL com- they need to be stored). TruCQ’s query processing signif- piler is extended with a few orthogonal language constructs icantly outperforms traditional store-first-query-later data- to recognize and process continuous queries. The query plan base technologies as the query evaluation has already been as generated by the SQL optimizer is rewritten to a contin- initiated when the first tuples arrive. It allows evaluation uous query plan and handed over to the DataCell scheduler. of one-time queries, continuous queries, as well as combina- In turn, the scheduler handles the execution of the plan. tions of both types. Figure 1 shows a DataCell instance. It contains receptors Another recent work, coming from the HP Labs [10], also and emitters, i.e., a set of separate processes per stream and confirms the strong research attraction for this trend. It per client, respectively, to listen for new data and to deliver defines an extended SQL query model that unifies queries results. They form the edges of the architecture and the over both static relations and dynamic streaming data, by bridges to the outside world, e.g., to sensor drivers. developing techniques to generalize the query engine. It also Baskets/Columns. The key idea is that when an event extends the PostgreSQL database kernel [19], building an stream enters the system via a receptor, stream tuples are engine that can process persistent and streaming data in immediately stored in a lightweight table, called basket. By a uniform design. First, they convert the stream into a collecting event tuples into baskets, DataCell can evaluate sequence of chunks and then continuously call the query the continuous queries over the baskets as if they were nor- over each sequential chunk.

Load more