Tell: an Elastic Database System for Mixed Workloads

DISS. ETH NO. 24147 Tell: An Elastic Database System for Mixed Workloads A thesis submitted to attain the degree of DOCTOR OF SCIENCES of ETH ZURICH (Dr. sc. ETH Zurich) presented by MARKUS PILMAN Master of Science ETH in Computer Science, ETH Zurich born on 18. September 1983 citizen of Amriswil, Thurgau accepted on the recommendation of Prof. Dr. Donald Kossmann (ETH Zürich), examiner Dr. Philip A. Bernstein (Microsoft Research), co-examiner Prof. Dr. Peter Boncz (Vrije Universiteit Amsterdam), co-examiner Prof. Dr. Timothy Roscoe (ETH Zürich), co-examiner 2017 Abstract It is an exciting time to do database research. Two movements dominated the field for the last few years: Big Data and NoSQL. Both movements aroseoutof necessity, as cloud computing imposes new requirements on database systems. Cloud computing makes scalability and elasticity more important than ever. A user does not want to pay for computing and storage resources she does not use, but she expects to be able to get these resources as soon as they are needed. Traditional database management systems, however, are not able to meet these requirements. Early NoSQL systems provided elasticity and scalability by massively simplifying the provided consistency guarantees and the underlying data model. Most notably key value stores can scale to thousands of machines and allow resizing their cluster at runtime. However, their simplicity is also their greatest weakness: The lack of transactions makes it difficult to reason about concurrency, and the simple data model makes them difficult to use. Key value stores push most of their complexity into the application. As a result, more recent solutions try not only to add transactions, but they also implement complex operations in a layer above the underlying NoSQL storage. This layering is often referred to as SQL over NoSQL. Big Data, on the other hand, is about the analytical processing of massive amounts of data in the cloud. The Hadoop ecosystem and, more recently, Spark are the most prominent systems that play in this field. These systems allow for massive parallelization of complex analytical queries and are elastic and scalable. They achieve this by implementing a shared data architecture which decouples computing resources from storage resources. However, these Big Data platforms still have a problem: bringing the data from the online NoSQL (or SQL) database into Hadoop is a complex issue. Traditionally, this is solved like traditional data warehousing which is a heavy weight solution. 4 A system like Spark also can not simply use a key value store for its underlying storage, because current key value stores perform poorly when they have to deliver high volumes of data. This thesis introduces Tell, a distributed shared-data database management system that fills the gap between NoSQL and Big Data. Tell implements theSQL over NoSQL design principle: it performs transaction processing on top of a high- performance key-value store. At the same time, its key value store is heavily optimized for scan queries, allowing data processing engines to fetch their data directly from the online database. Zusammenfassung Big Data und NoSQL waren und sind Bewegungen die die Datenbankwelt in den letzten Jahren stark geprägt haben. Beide Bewegungen entstanden aus der Notwendigkeit neuen Anforderungen, die vor allem Cloud Computing an moderne Datenbanksysteme stellt, zu genügen. Die wichtigste Eigenschaft die ein System innerhalb der Cloud haben muss ist Elastizität. Ein Kunde will nicht für Ressourcen bezahlen die er nicht nutzt. Der Anbieter muss jedoch in der Lage sein, zusätzliche Ressourcen zur Verfügung zu stellen sobald sich die Anforderungen der Kunden ändern. Elastizität und Skalierbarkeit wurden von NoSQL Systemen zu beginn erreicht indem die Konsistenzgarantien und das unterstützte Datenmodel stark vereinfacht wurden. Dank diesen Vereinfachungen können, zum Beispiel, Key-Value Stores auf tausende von Maschinen skalieren und zur Laufzeit Machinen aus dem Cluster entfernen oder zum Cluster hinzufügen. Leider ist jedoch diese Vereinfachung sowohl eine Stärke als auch eine Schwäche: Die fehlenden Transaktionen und das stark vereinfachte Datenmodel erschweren die Benutzung dieser Systeme, da die Komplexität von der Datenbank in die App- likation verlagert wird. Aus diesen Gründen versuchen neuere Systeme Transak- tionen und komplexe Operationen eine Schicht oberhalb eines NoSQL storages zu implementieren. Dieser Ansatz wird häufig als SQL over NoSQL bezeichnet. Big Data ist ein anderer Trend der letzten Jahre. Wir produzieren immer mehr Daten und wollen mit analytischen Abfragen Informationen aus diesen Daten gewinnen. Hadoop, und seit ein paar wenigen Jahren Spark, sind die bekan- ntesten Produkte die aus diesem Trend hervor gingen. Diese Systeme erlauben mit massiver Parallelisierung komplexe Anfragen über grosse Datenmengen zu beantworten. Beide Systeme skalieren und sind elastisch. Dies wird erreicht indem die Recheneinheiten vom Storage getrennt werden. 6 Leider haben diese Systeme einen grossen Nachteil: die Daten die verarbeitet werden sollen befinden sich meistens in einem Datenbanksystem, entweder einer NoSQL oder einer traditionellen SQL Datenbank. Die Daten müssen dann kopiert werden damit sie von Spark oder Hadoop gelesen werden können. Leider können Spark und Hadoop die Daten nicht direkt aus dem Online-System lesen, da Key- Value Stores nicht optimiert sind um grosse Datenmengen zu lesen. Diese Dissertation stellt Tell vor, eine verteilte shared-data Datenbanklösung die die Kluft zwischen NoSQL und Big Data füllt. Tell arbeitet nach dem SQL over NoSQL Prinzip: es implementiert Transaktionen eine Schicht oberhalb eines schnellen Key-Value Stores. Der Key-Value Store ist zusätzlich in der Lage grosse Datenmengen direkt an analytische Systeme zu liefern. Dies erlaubt sowohl Echtzeit-Transaktionsverarbeitung und analytische Anfragen auf den selben Daten. Acknowledgments Dicebat Bernardus Carnotensis nos esse quasi nanos gigantum umeris insidentes, ut possimus plura eis et remotiora videre, non utique proprii visus acumine, aut eminentia corporis, sed quia in altum subvehimur et extollimur magnitudine gigantea (John of Salisbury: Metalogicon 3,4,46-50) Working on this thesis was a tremendous learning experience and this work would not have been possible without those who worked in the field before me and with me. I want to express my deepest gratitude towards Donald, who advised me not only during my doctorate but virtually throughout all my years at ETH. Donald granted me a lot of freedom, but he was always there when he was needed. I also want to thank Mothy for his support. Discussions with him were always insightful. Mothy’s knowledge within and outside of his field is impressive. I also want to thank him for his life-advice to my wife and me. Throughout my doctoral studies, I had the chance to meet and work together with brilliant people. I want to thank all the people in the Systems Group for interesting discussions and new insights. While working on a particular problem, it was always helpful to have people who presented me the problem from a different angle. At this point, a special thank goes to my office mates, Simon, Lucas, and Jonas. Sharing the workplace with them was always pleasant. During my studies, I did an internship at Microsoft Research under the supervi- sion of Phil Bernstein. I want to thank Phil who was an awesome adviser. Even though I sometimes felt dumb while talking to him, he was always patient and never condescending. During my short time at Microsoft, I also worked a lot with Sudipto Das, from whom I also learned a lot and who was a very pleasant person to work with. Microsoft Research is filled with brilliant people, and I am glad Ihad the chance to talk to many of them. Tell was quite a big project and I would not have been able to do it alone. A first thanks goes to Simon Loesing, who worked closely together with me on Tell for nearly three years. I had the rare luck to supervise only excellent master students during my doctorate. Thomas Etter and Kevin Bocksrocker did an enormous amount of the implementation work - during their master thesis and while working for the Systems Group as Software Engineers. Thanks goes to Daniel Bucher who did some important work on the Bd-Tree and Florian Köhl who did excellent work in the field of temporal query processing (which was an important side-project during my doctoral studies but is not part of the work presented in this thesis). I also want to thank Lucas and Renato who worked on Tell and TellStore. Lucas’ work on AIM (Analytics in Motion) influenced my work a lot, and he helped during the implementation and the experimentation. Last but not least, my biggest gratitude goes to my wife, Sonja. Her support was essential for my work as well as for my private life. Contents 1 Introduction . 15 1.1 Background and Motivation 15 1.2 Problem statement 17 1.3 Contributions 17 1.4 Overview of the thesis 18 2 The Shared Data Architecture . 19 2.1 Introduction 19 2.1.1 Finding the Right Architecture for the Right Job . 19 2.1.2 Requirements . 19 2.1.3 Contributions . 20 2.2 Shared Nothing vs Shared Disk 20 2.2.1 Shared Nothing . 20 2.2.2 Shared Disk . 22 2.2.3 Copying data . 23 2.3 Shared Data 24 2.4 Potential Bottlenecks in a Shared Data System 26 2.4.1 Network bottleneck . 27 2.4.2 CPU . 28 2.4.3 Memory Controller . 29 I TellStore - A Storage for mixed Workloads 3 Designing a Storage for Mixed Workloads . 33 3.1 Introduction 33 3.2 Requirements 33 3.3 Why is this difficult? 37 4 TellStore . 39 4.1 Features 40 4.1.1 In-Memory . 41 4.1.2 Shared Scans . 41 4.1.3 Predictability . 41 4.1.4 Lock- And Latch-Freeness . 41 4.1.5 Consistency . 42 4.1.6 Durability .

Tell: an Elastic Database System for Mixed Workloads

An Opinionated Guide to Technology Frontiers

Table of Contents

Learning Key-Value Store Design

Mergers in the Digital Economy

CURRICULUM FRAMEWORK and SYLLABUS for FIVE YEAR INTEGRATED M.Sc (DATA SCIENCE) DEGREE PROGRAMME in CHOICE BASED CREDIT SYSTEM

Mac Os X Database Application

Confidentiality and Perfomance for Cloud Databases

Mergers in the Digital Economy

Making Transactional Key-Value Stores Verifiably

Mergers in the Digital Economy

COSC349—Cloud Computing Architecture David Eyers Learning Objectives

Paper Title (Use Style: Paper Title)