Scalable Push-Based Real-Time Queries on Top of Pull-Based Databases (Extended)
Total Page:16
File Type:pdf, Size:1020Kb
InvaliDB: Scalable Push-Based Real-Time Queries on Top of Pull-Based Databases (Extended) Wolfram Wingerath Felix Gessert Norbert Ritter Baqend GmbH Baqend GmbH University of Hamburg Stresemannstr. 23 Stresemannstr. 23 Vogt-Kölln-Strae 30 22769 Hamburg, Germany 22769 Hamburg, Germany 22527 Hamburg, Germany [email protected] [email protected] [email protected] hamburg.de ABSTRACT pull-based and therefore ill-equipped to push new informa- Traditional databases are optimized for pull-based queries, tion to clients [67]. Systems for managing data streams, on i.e. they make information available in direct response to the other hand, are natively push-oriented and thus facil- client requests. While this access pattern is adequate for itate reactive behavior [35]. However, they do not retain mostly static domains, it requires inefficient and slow work- data indefinitely and are therefore not able to answer his- arounds (e.g. periodic polling) when clients need to stay torical queries. The separation between these two system up-to-date. Acknowledging reactive and interactive work- classes gives rise to both high complexity and high main- loads, modern real-time databases such as Firebase, Me- tenance costs for applications that require persistence and teor, and RethinkDB proactively deliver result updates to real-time change notifications at the same time [77]. their clients through push-based real-time queries. However, In this paper, we present the system design InvaliDB for current implementations are only of limited practical rele- bridging the gap between pull-based database and push- vance, since they are incompatible with existing technology based data stream management: InvaliDB is a real-time stacks, fail under heavy load, or do not support complex database built on top of a pull-based (NoSQL) database, queries to begin with. To address these issues, we propose supports the query language of the underlying datastore the system design InvaliDB which combines linear read and for both pull- and push-based queries, and scales linearly write scalability for real-time queries with superior query with reads and writes through a scheme for two-dimensional expressiveness and legacy compatibility. We compare In- workload distribution. valiDB against competing system designs to emphasize the benefits of our approach. To validate our claims of linear 1.1 Real-Time Databases: Open Challenges scalability, we further present an experimental evaluation of Numerous system designs have been proposed to provide the InvaliDB prototype that has been serving customers at collection-based semantics for pull-based and push-based the Database-as-a-Service company Baqend since July 2017. queries alike: Subsumed under the term \real-time data- bases"1 [69] [81], systems like Firebase, Meteor, and Re- PVLDB Reference Format: thinkDB provide push-based real-time queries that do not Wolfram Wingerath, Felix Gessert, Norbert Ritter. InvaliDB: only deliver a single result upon request, but also a contin- Scalable Push-Based Real-Time Queries on Top of Pull-Based uous stream of informational updates thereafter. Like tra- Databases (Extended). PVLDB, 13(12): 3032-3045, 2020. DOI: https://doi.org/10.14778/3415478.3415532 ditional database systems, real-time databases store consis- tent snapshots of domain knowledge. But like stream man- agement systems, they let clients subscribe to long-running 1. INTRODUCTION queries that push incremental updates. In concept, real-time databases thus extend traditional Many of today's web applications notify users of status database systems as they follow the same semantics, but updates and other events in realtime. But even though provide an additional mode of access. In practice, though, more and more usage scenarios revolve around the inter- there is no established scheme how to actually build a real- action between users, building applications that detect and time database on top of a traditional database system. Ex- publish changes with low latency remains notoriously hard isting real-time databases have been built from scratch and even with state-of-the-art data management systems. While consequently do not inherit the rich feature set and stabil- traditional database systems (or short \databases") excel at ity that some pull-based systems have gained over decades complex queries over historical data [37], they are inherently of development. To date, every push-based real-time query mechanism fails in at least one of the following challenges: This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy C1 Scalability. Serving real-time queries is a resource- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For intensive process which requires continuous monitor- any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights 1 licensed to the VLDB Endowment. In the past, the term \real-time databases" has been used Proceedings of the VLDB Endowment, Vol. 13, No. 12 to reference specialized pull-based databases that produce ISSN 2150-8097. an output within strict timing constraints [60] [14] [27]; we DOI: https://doi.org/10.14778/3415478.3415532 do not share this notion of real-time databases. 3032 ing of all write operations that might possibly affect decoupling failure domains and enabling independent scal- query results. To sustain more demanding workloads ing for both. As our third contribution in Sections 6 and than a single machine could handle, real-time databa- 7, we present an experimental evaluation of our InvaliDB ses typically partition the set of queries across server prototype that has been used in production as part of the nodes. As each node is only responsible for a sub- Database-as-a-Service (DBaaS) offering at Baqend [9] since set of all queries in this scheme, most systems can July 2017. Our experiments confirm that InvaliDB's perfor- scale with the number of concurrent queries. However, mance scales linearly with the number of matching nodes, we are not aware of any real-time database that sup- both in terms of sustainable write throughput and the num- ports partitioning the write stream as well. Thus, re- ber of concurrent real-time queries. The results further indi- sponsibility for individual queries is not shared among cate that our InvaliDB implementation exhibits consistently nodes and overall system throughput remains bottle- low latency even under high per-node load, irrespective of necked by single-machine capacity: Queries become the number of nodes employed for query matching. We dis- intractable as soon as one of the nodes is not able to cuss our findings in Section 8 and conclude in Section 9. keep up with processing the entire write stream. C2 Expressiveness. The majority of real-time query 3. RELATED WORK: PUSH-BASED APIs are limited in comparison to their ad hoc counter- parts. Even relatively simple queries are often unsup- ACCESS IN DATA MANAGEMENT ported or have severe restrictions; for example, some Given the trend towards reactivity in modern applica- implementations do not offer filter composition with tions, data management systems of all classes have come AND/OR, do not allow ordering by multiple attributes, to provide push-based data access in one form or another. or support limit, but no offset clauses. The lack of such In Table 1, we delineate real-time databases from traditional basic functionality on the database side necessitates in- databases on the one side and systems for stream manage- efficient workarounds in the application code, even for ment and processing on the other. moderately sophisticated data access patterns. Table 1: An overview over data access in data management. C3 Legacy Support. Today's real-time databases do not Database Real-Time Data Stream Stream follow standards regarding data model or query lan- Management Databases Management Processing guage, implement custom protocols for pull-based and Primitive persistent collections ephemeral streams push-based data access alike, and exhibit interfaces one-time + Processing one-time continuous that are incompatible among different vendors. While continuous random + the complete lack of support for legacy interfaces may Access random sequential (single-pass) be acceptable in the development of new applications, sequential structured, Data structured it complicates the adoption of push-based queries for unstructured existing software. Traditional (SQL) database management systems are We argue that neither of these limitations is inherent to optimized for expressive random access queries over persis- the challenge of providing push-based real-time queries over tent data collections. They provide a wealth of features database collections. To prove our point, we present the de- for applications based on request-response interaction, but sign and implementation of a system for real-time queries maintaining query results on a per-user basis is not what that is built on top of the pull-based NoSQL database Mon- they have been designed for. Few SQL systems offer ac- goDB, supports sorted filter queries over single collections tive features beyond triggers [28] or event-condition-action for pull- and push-based execution alike, and scales linearly (ECA) rules [65] and existing functionality for query result with reads and writes. maintenance is almost exclusively employed for optimizing pull-based query performance (e.g. materialized views [23] [36] or change notification mechanisms [56] [51] [59]). 2. CONTRIBUTIONS & OUTLINE To warrant low-latency updates in quickly evolving do- Our contributions in this paper are threefold. First in Sec- mains, systems for data stream management [35] [73] tion 3, we present a discussion of push-based query mech- break with the idea of maintaining a persistent data repos- anisms in modern data management to expound why real- itory. Instead of random access queries on static collec- time databases deserve distinction in a separate system class tions, they perform sequential, long-running queries over next to pull-based database and push-based data stream data streams.