Spanner: Becoming a SQL System
Total Page:16
File Type:pdf, Size:1020Kb
Spanner: Becoming a SQL System David F. Bacon Nathan Bales Nico Bruno Brian F. Cooper Adam Dickinson Andrew Fikes Campbell Fraser Andrey Gubarev Milind Joshi Eugene Kogan Alexander Lloyd Sergey Melnik Rajesh Rao David Shue Christopher Taylor Marcel van der Holst Dale Woodford Google, Inc. ABSTRACT this paper, we focus on the “database system” aspects of Spanner, Spanner is a globally-distributed data management system that in particular how query execution has evolved and forced the rest backs hundreds of mission-critical services at Google. Spanner of Spanner to evolve. Most of these changes have occurred since is built on ideas from both the systems and database communi- [5] was written, and in many ways today’s Spanner is very different ties. The first Spanner paper published at OSDI’12 focused on the from what was described there. systems aspects such as scalability, automatic sharding, fault tol- A prime motivation for this evolution towards a more “database- erance, consistent replication, external consistency, and wide-area like” system was driven by the experiences of Google developers distribution. This paper highlights the database DNA of Spanner. trying to build on previous “key-value” storage systems. The pro- We describe distributed query execution in the presence of reshard- totypical example of such a key-value system is Bigtable [4], which ing, query restarts upon transient failures, range extraction that continues to see massive usage at Google for a variety of applica- drives query routing and index seeks, and the improved blockwise- tions. However, developers of many OLTP applications found it columnar storage format. We touch upon migrating Spanner to the difficult to build these applications without a strong schema sys- common SQL dialect shared with other systems at Google. tem, cross-row transactions, consistent replication and a powerful query language. The initial response to these difficulties was to build transaction processing systems on top of Bigtable; an exam- 1. INTRODUCTION ple is Megastore [2]. While these systems provided some of the Google’s Spanner [5] started out as a key-value store offering multi- benefits of a database system, they lacked many traditional database row transactions, external consistency, and transparent failover features that application developers often rely on. A key example across datacenters. Over the past 7 years it has evolved into a rela- is a robust query language, meaning that developers had to write tional database system. In that time we have added a strongly-typed complex code to process and aggregate the data in their applica- schema system and a SQL query processor, among other features. tions. As a result, we decided to turn Spanner into a full featured Initially, some of these database features were “bolted on” – the SQL system, with query execution tightly integrated with the other first version of our query system used high-level APIs almost like architectural features of Spanner (such as strong consistency and an external application, and its design did not leverage many of the global replication). Spanner’s SQL interface borrows ideas from unique features of the Spanner storage architecture. However, as the F1 [9] system, which was built to manage Google’s AdWords we have developed the system, the desire to make it behave more data, and included a federated query processor that could access like a traditional database has forced the system to evolve. In par- Spanner and other data sources. ticular, Today, Spanner is widely used as an OLTP database management system for structured data at Google, and is publicly available in • The architecture of the distributed storage stack has driven beta as Cloud Spanner1 on the Google Cloud Platform (GCP). Cur- fundamental changes in our query compilation and execu- rently, over 5,000 databases run in our production instances, and tion, and are used by teams across many parts of Google and its parent com- pany Alphabet. This data is the “source of truth” for a variety of • The demands of the query processor have driven fundamental mission-critical Google databases, incl. AdWords. One of our large changes in the way we store and manage data. users is the Google Play platform, which executes SQL queries to These changes have allowed us to preserve the massive scala- manage customer purchases and accounts. Spanner serves tens of bility of Spanner, while offering customers a powerful platform for millions of QPS across all of its databases, managing hundreds of database applications. We have previously described the distributed petabytes of data. Replicas of the data are served from datacenters architecture and data and concurrency model of Spanner [5]. In around the world to provide low latency to scattered clients. De- spite this wide replication, the system provides transactional con- sistency and strongly consistent replicas, as well as high availabil- Permission to make digital or hard copies of part or all of this work for personal or ity. The database features of Spanner, operating at this massive classroom use is granted without fee provided that copies are not made or distributed scale, make it an attractive platform for new development as well for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. as migration of applications from existing data stores, especially For all other uses, contact the owner/author(s). for “big” customers with lots of data and large workloads. Even SIGMOD’17, May 14–19, 2017, Chicago, IL, USA “small” customers benefit from the robust database features, strong c 2017 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-4197-4/17/05. 1 DOI: http://dx.doi.org/10.1145/3035918.3056103 http://cloud.google.com/spanner consistency and reliability of the system. “Small” customers often parts ‘K’ are stored physically next to the parent row whose key is become “big”. ‘K’. Shard boundaries are specified as ranges of key prefixes and The Spanner query processor implements a dialect of SQL, preserve row co-location of interleaved child tables. called Standard SQL, that is shared by several query subsystems Spanner’s transactions use a replicated write-ahead redo log, and within Google (e.g., Dremel/BigQuery OLAP systems2). Standard the Paxos [6] consensus algorithm is used to get replicas to agree SQL is based on standard ANSI SQL, fully using standard features on the contents of each log entry. In particular, each shard of the such as ARRAY and row type (called STRUCT) to support nested database is assigned to exactly one Paxos group (replicated state data as a first class citizen (see Section6 for more details). machine). A given group may be assigned multiple shards. All Like the lower storage and transactional layers, Spanner’s query transactions that involve data in a particular group write to a logical processor is built to serve a mix of transactional and analyti- Paxos write-ahead log, which means each log entry is committed cal workloads, and to support both low-latency and long-running by successfully replicating it to a quorum of replicas. Our imple- queries. Given the distributed nature of data storage in Spanner, the mentation uses a form of Multi-Paxos, in which a single long-lived query processor is itself distributed and uses standard optimization leader is elected and can commit multiple log entries in parallel, techniques such as shipping code close to data, parallel process- to achieve high throughput. Because Paxos can make progress as ing of parts of a single request on multiple machines, and partition long as a majority of replicas are up, we achieve high availability pruning. While these techniques are not novel, there are certain as- despite server, network and data center failures [3]. pects that are unique to Spanner and are the focus of this paper. In We only replicate log records – each group replica receives log particular: records via Paxos and applies them independently to its copy of the group’s data. Non-leader replicas may be temporarily behind • We examine techniques we have developed for distributed the leader. Replicas apply Paxos log entries in order, so that they query execution, including compilation and execution of have always applied a consistent prefix of the updates to data in that joins, and consuming query results from parallel workers group. (Section3). Our concurrency control uses a combination of pessimistic lock- ing and timestamps. For blind write and read-modify-write transac- • We describe range extraction, which is how we decide which tions, strict two-phase locking ensures serializability within a given Spanner servers should process a query, and how to mini- Paxos group, while two-phase commits (where different Paxos mize the row ranges scanned and locked by each server (Sec- groups are participants) ensure serializability across the database. tion4). Each committed transaction is assigned a timestamp, and at any • We discuss how we handle server-side failures with query timestamp T there is a single correct database snapshot (reflect- restarts (Section5). Unlike most distributed query proces- ing exactly the set of transactions whose commit timestamp ≤ T). sors, Spanner fully hides transient failures during query exe- Reads can be done in lock-free snapshot transactions, and all data cution. This choice results in simpler applications by allow- returned from all reads in the same snapshot transaction comes ing clients to omit sometimes complex and error-prone retry from a consistent snapshot of the database at a specified times- loops. Moreover, transparent restarts enable rolling upgrades tamp. Stale reads choose a timestamp in the past to increase the of the server software without disrupting latency-sensitive chance that the nearby replica is caught up enough to serve the read.