Couchbase Analytics: Noetl for Scalable Nosql Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Couchbase Analytics: NoETL for Scalable NoSQL Data Analysis Murtadha Al Hubail1 Ali Alsuliman1 Michael Blow1 Michael Carey12 Dmitry Lychagin1 Ian Maxon12 Till Westmann1 1Couchbase, Inc. 2University of California, Irvine [email protected] ABSTRACT Couchbase Server is a highly scalable document-oriented database management system. With a shared-nothing architecture, it exposes a fast key-value store with a managed cache for sub-millisecond Full-Text data operations, indexing for fast queries, and a powerful query Eventing Query Data Analytics Indexing engine for executing declarative SQL-like queries. Its Query Ser- Search vice debuted several years ago and supports high volumes of low- latency queries and updates for JSON documents. Its recently intro- Figure 1: Couchbase Server Overview duced Analytics Service complements the Query Service. Couch- base Analytics, the focus of this paper, supports complex analytical queries (e.g., ad hoc joins and aggregations) over large collections however, since the early days of mainframes and their attached ter- of JSON documents. This paper describes the Analytics Service minals. from the outside in, including its user model, its SQL++ based Today’s mission-critical applications demand support for mil- query language, and its MPP-based storage and query processing lions of interactions with end-users via the Web and mobile devices. architecture. It also briefly touches on the relationship of Couch- In contrast, traditional database systems were built for thousands of base Analytics to Apache AsterixDB, the open source Big Data users. Designed for strict consistency and data control, they tend to management system at the core of Couchbase Analytics. lack agility, flexibility, and scalability. To handle a variety of use cases, organizations today often end up deploying multiple types PVLDB Reference Format: of databases, resulting in a “database sprawl” that brings with it in- Murtadha Al Hubail, Ali Alsuliman, Michael Blow, Michael Carey, Dmitry efficiencies, slow times to market, poor customer experiences, and Lychagin, Ian Maxon, and Till Westmann. Couchbase Analytics: for analytics, slow times to insight as well. This is where Couch- NoETL for Scalable NoSQL Data Analysis. PVLDB, 12(12): 2275-2286, 2019. base Server comes in – it aims to reduce the degree of database DOI: https://doi.org/10.14778/3352063.3352143 sprawl and to reduce the mismatch between the applications’ view of data and its persisted view, thereby enabling the players – from application developers to DBAs and now data analysts as well – to 1. INTRODUCTION work with their data in its natural form. It adds the much-needed It has been fifty years (yes, a full half-century!) since Ted Codd agility, flexibility, and scalability while also retaining the benefits changed the face of data management with the introduction of the of a logical view of data and SQL-like declarative queries. relational data model [12, 13]. His simple tabular view of data, re- lated by values instead of by pointers, made it possible to design 2. COUCHBASE SERVER declarative query languages to allow business application develop- ers and business analysts to interact with their databases logically Couchbase as a company was created through a merger of two rather than physically – specifying what, not how, they want. The startups, Membase and CouchOne, in 2011. This brought two ensuing decades of research and industrial development brought technologies together: Membase Server, a distributed key-value numerous innovations, including SQL, indexing, query optimiza- database based on memcached, and CouchDB, a single-node doc- tion, parallel query processing, data warehouses, and many other ument database supporting JSON. The goal of the merged com- features that are now taken for granted in today’s multibillion-dollar pany was to combine and build on the strengths of these two com- database industry. The world of data and applications has changed, plementary technologies and teams. Couchbase Server today is a highly scalable document-oriented database management system [5]. With a shared-nothing architecture, Couchbase Server exposes This work is licensed under the Creative Commons Attribution- a fast key-value store with a managed cache for sub-millisecond NonCommercial-NoDerivatives 4.0 International License. To view a copy data operations, secondary indexing for fast queries, and a high- of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For performance query engine (actually two complementary engines) any use beyond those covered by this license, obtain permission by emailing for executing declarative SQL-like N1QL∗ queries. [email protected]. Copyright is held by the owner/author(s). Publication rights Figure 1 provides a high-level overview of Couchbase Server. licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 12 (Not included in the figure is Couchbase Mobile, which extends ISSN 2150-8097. ∗ DOI: https://doi.org/10.14778/3352063.3352143 Short for Non-1NF Query Language. 2275 Couchbase data platform to the edge, securely managing and sync- 3. COUCHBASE ANALYTICS SERVICE ing data from any cloud to all edge devices.) Architecturally, the The Couchbase Analytics Service complements the Query Ser- system is organized as a set of services that are deployed and man- vice by providing support for complex and potentially expensive aged as a whole on a Couchbase Server cluster. Physically, a clus- ad-hoc analytical queries (e.g., large joins and aggregations) over ter consists of a group of interchangeable nodes that operate in a collections of JSON documents.y Figure 2 provides a high-level peer-to-peer topology and the services running on each node can illustration of the role that the Analytics Service plays in Couch- be managed as required. Nodes can be added or removed through a base Server. The Data Service and the Query Service provide user- rebalance process that redistributes the data evenly across all nodes. facing applications with low-latency key-value and/or query-based The addition or removal of nodes can also be used to increase or access to their data. The design point for these services is a large decrease the CPU, memory, disk, or network capacity of a cluster. number of users making relatively small, inexpensive requests. In This process is done online and requires no system downtime. The contrast, the Analytics Service is designed to support a much smaller ability to dynamically scale the cluster capacity and individually number of users posing much larger, more expensive N1QL queries map services to sets of nodes is sometimes referred to as Multi- against a real-time shadowed copy of the same JSON data. Dimensional Scaling (MDS). The technical objectives of Figure 2’s approach are similar to An important aspect of the Couchbase Server architecture is how those of Gartner’s HTAP concept for hybrid workloads [15]: Data data mutations are communicated across services. Mutation coor- doesn’t have to be copied from operational databases to data ware- dination is done via an internal Couchbase Server protocol, called houses before analysis can commence, operational data is readily the Database Change Protocol (DCP), that keeps components in available for analytics as soon as it is created, and analyses always sync by notifying them of all mutations to documents managed by utilize fresh application data. This minimizes the time to insight the Data Service. The Data Service in Figure 1 is the producer of for application and data analysts by allowing them to immediately such mutation notifications, with the other services participating as pose questions against their operational data in terms of its natu- DCP listeners. ral data model. This is in contrast to traditional architectures [11] The Couchbase Data Service provides the foundation for doc- that, even today, involve warehouse schema design and mainte- ument management. It is based on a memory-first, asynchronous nance, periodically-executed ETL (extract-transform-load) scripts architecture that provides the core capabilities of caching, data per- to copy and reformat data into its “analysis home” before being sistence, and inter-node replication. The document data model for queried, and the understanding of a second/different data model in Couchbase Server is JSON [17], a flexible, self-describing data for- order to perform analyses on the warehouse’s rendition of the ap- mat capable of representing rich structures and relationships. Doc- plication data. Like HTAP, this reduces the time to insight from uments are stored in containers called buckets. A bucket is a logical days or hours to seconds. collection of related documents in Couchbase, similar to a database The next two sections of the paper will drill down, respectively, or a schema in a relational database. It is a unique key space and on the Analytics Service’s user model and internal technology. Be- documents can be accessed using a (user-provided) document ID fore proceeding, however, it is worth noting several differences much as one would use a primary key for lookups in an RDBMS. between Couchbase Server’s NoSQL model and architecture and Unlike a traditional RDBMS, the “schema” for documents in “traditional HTAP” in the relational world. One relates to scale: Couchbase Server is a logical construct defined based on the appli- Like all of the services in Couchbase Server, the Analytics Service cation code and captured in the structure of the stored documents. is designed to scale out horizontally on a shared-nothing cluster. Since there is no explicitly defined schema to be maintained, de- Also like other Couchbase Server services, it can be scaled inde- velopers can add new objects and properties at any time simply by pendently, as Figure 2 shows. The Analytics Service maintains a deploying new application code that stores new JSON data without real-time shadow copy of the operational data that the enterprise having to also make and deploy corresponding changes to a static wants to have available for analysis; the copy is because Analytics schema.