Schema-Agnostic Indexing with Azure Documentdb

Schema-Agnostic Indexing with Azure DocumentDB Dharma Shukla, Shireesh Thota, Karthik Raman, Sudipta Sengupta, Justin Levandoski, Madhan Gajendran, Ankur Shah, Sergii Ziuzin, David Lomet Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna Wawrzyniak, Samer Boshra, Renato Ferreira, Mohamed Nassar, Michael Koltachev, Ji Huang Microsoft Corporation Microsoft Research ABSTRACT Section 6 describes the related commercial systems, and Section 7 Azure DocumentDB is Microsoft’s multi-tenant distributed concludes the paper. database service for managing JSON documents at Internet scale. 1.1 Overview of the Capabilities DocumentDB is now generally available to Azure developers. In DocumentDB is based on the JSON data model [2] and JavaScript this paper, we describe the DocumentDB indexing subsystem. language [3] directly within its database engine. We believe this is DocumentDB indexing enables automatic indexing of documents crucial for eliminating the “impedance mismatch” between the without requiring a schema or secondary indices. Uniquely, application programming languages/type-systems and the database DocumentDB provides real-time consistent queries in the face of schema [4]. Specifically, this approach enables the following very high rates of document updates. As a multi-tenant service, DocumentDB capabilities: DocumentDB is designed to operate within extremely frugal resource budgets while providing predictable performance and The query language supports rich relational and hierarchical robust resource isolation to its tenants. This paper describes the queries. It is rooted in JavaScript’s type system, expression DocumentDB capabilities, including document representation, evaluation and function invocation model. Currently the query query language, document indexing approach, core index support, language is exposed to developers as a SQL dialect and and early production experiences. language integrated JavaScript query (see [5]), but other frontends are possible. 1. INTRODUCTION The database engine is optimized to serve consistent queries Azure DocumentDB [1] is Microsoft’s multi-tenant distributed in the face of sustained high volume document writes. By database service for managing JSON [2] documents at Internet default, the database engine automatically indexes all scale. Several large Microsoft applications, including Office, documents without requiring schema or secondary indexes Skype, Active Directory, Xbox, and MSN, have been using from developers. DocumentDB, some since early 2012. DocumentDB was recently Transactional execution of application logic provided via released for general availability to Azure developers. stored procedures and triggers, authored entirely in JavaScript In this paper, we describe DocumentDB’s indexing subsystem. and executed directly inside DocumentDB’s database engine. The indexing subsystem needs to support (1) automatic indexing of We exploit the native support for JSON values common to documents without requiring a schema or secondary indices, (2) both the JavaScript language runtime and the database engine DocumentDB’s query language, (3) real-time, consistent queries in in a number of ways - e.g. by allowing the stored procedure to the face of sustained high document ingestion rates, and (4) multi- execute under an implicit database transaction, we allow the tenancy under extremely frugal resource budgets while (5) still JavaScript throw keyword to model a transaction abort. The providing predictable performance guarantees and remaining cost details of transactions are outside the scope of this paper and effective. will be discussed in future papers. As a geo-distributed database system, DocumentDB offers The paper is organized as follows: The rest of this section provides well-defined and tunable consistency levels for developers to a short overview of DocumentDB’s capabilities and architecture as choose from (strong, bounded-staleness, session and eventual well as, the design goals for indexing. Section 2 discusses schema- [6]) and corresponding performance guarantees [1, 7]. agnostic indexing. Section 3 describes the logical nature of As a fully-managed, multi-tenant cloud database service, all DocumentDB’s JSON derived index terms. Section 4 deals with the machine and resource management is abstracted from users. indexing method and discusses index maintenance, replication, We offer tenants the ability to elastically scale both the recovery and considerations for effective resource governance. In throughput and SSD-backed document storage, and take full Section 5, we substantiate design choices we have made with key responsibility of resource management, cost effectively. metrics and insights harvested from our production clusters. This work is licensed under the Creative Commons Attribution-NonCommercial- NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st – September 4th 2015, Kohala Coast, Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 Copyright 2015 VLDB Endowment 2150-8097/15/08. 1668 without requiring developers to specify schema or secondary indices. Configurable storage/performance tradeoffs: Although documents are automatically indexed by default, developers should be able to make fine grained tradeoffs between the storage overhead of index, query consistency and write/query performance using a custom indexing policy. The index transformation resulting from a change in the indexing policy must be done online for availability and in-place for storage efficiency. Figure 1. DocumentDB system topology and components. Efficient, rich hierarchical and relational queries: The index 1.2 Resource Model should efficiently support the richness of DocumentDB’s A tenant of DocumentDB starts by provisioning a database account query APIs (currently, SQL and JavaScript [5]), including (using an Azure subscription). A database account manages one or support for hierarchical and relational projections and more DocumentDB databases. A DocumentDB database in-turn composition with JavaScript UDFs. manages a set of entities: users, permissions and collections. A DocumentDB collection is a schema-agnostic container of arbitrary Consistent queries in face of sustained volume of document user generated documents. In addition to documents, a writes: For high write throughput workloads requiring DocumentDB collection also manages stored procedures, triggers, consistent queries, the index needs to be updated efficiently user defined functions (UDFs) and attachments. Entities under the and synchronously with the document writes. The crucial tenant’s database account – databases, users, collections, requirement here is that the queries must be served with the documents etc. are referred to as resources. Each resource is consistency level configured by the developer without uniquely identified by a stable and logical URI and is represented violating performance guarantees offered to developers. as a JSON document. Developers can interact with resources via HTTP (and over a stateless TCP protocol) using the standard HTTP Multi-tenancy: Multi-tenancy requires careful resource verbs for CRUD (create, read update, delete), queries and stored governance. Thus, index updates must be performed within procedures. Tenants can elastically scale a resource of a given type the strict budget of system resources (CPU, memory, storage by simply creating new resources which get placed across resource and IOPS) allocated per replica. For predictable placement partitions. Each resource partition provides a single system image and load balancing of replicas on a given machine, the worst- for the resource(s) it manages, allowing clients to interact with the case on-disk storage overhead of the index should be bounded resources within the partition using their stable, logical URIs. A and predictable. resource partition is made highly available by a replica set. Individually and collectively, each of the above goals pose 1.3 System Topology significant technical challenges and require careful tradeoffs. We asked ourselves two crucial questions while considering the above The DocumentDB service is deployed worldwide across multiple goals: (1) what should be the logical and physical representations Azure regions [8]. We deploy and manage DocumentDB service on of the index? (2) what is the most efficient technique to build and clusters of machines each with dedicated local SSDs. Upon maintain the index within a frugal budget of system resources in a deployment, the DocumentDB service manifests itself as an multi-tenant environment? The rest of this paper discusses how we overlay network of machines, referred to as a federation (Figure 1) answered these questions when building the DocumentDB which spans one or more clusters. Each machine hosts replicas indexing subsystem. But first, we define what is meant by schema- corresponding to various resource partitions within a fixed set of agnostic indexing. processes. Replicas corresponding to the resource partitions are placed and load balanced across machines in the federation. Each 2. SCHEMA AGNOSTIC INDEXING replica hosts an instance of the DocumentDB’s database engine, In this section, we explore the key insight to make the which manages

Load more