Scalable Querying of Nested Data

Total Page:16

File Type:pdf, Size:1020Kb

Scalable Querying of Nested Data Scalable Querying of Nested Data Jaclyn Smithy, Michael Benedikty, Milos Nikolicx, Amir Shaikhhay yUniversity of Oxford, UK xUniversity of Edinburgh, UK [email protected] [email protected] ABSTRACT also contributed to the rise in NoSQL databases [4, 5, 42], While large-scale distributed data processing platforms have where the nested data model is central. Thus, it comes as become an attractive target for query processing, these sys- no surprise that nested data accounts for most large-scale, tems are problematic for applications that deal with nested structured data processing at major web companies [40]. collections. Programmers are forced either to perform non- Despite natively supporting nested data, existing systems trivial translations of collection programs or to employ au- fail to harness the full potential of processing nested collec- tomated flattening procedures, both of which lead to perfor- tions at scale. One implementation difficulty is the discrep- mance problems. These challenges only worsen for nested ancy between the tuple-at-a-time processing of local pro- collections with skewed cardinalities, where both handcrafted grams and the bulk processing used in distributed settings. rewriting and automated flattening are unable to enforce Though maintaining similar APIs, local programs that use load balancing across partitions. Scala collections often require non-trivial and error-prone In this work, we propose a framework that translates a translations to distributed Spark programs. Beyond pro- program manipulating nested collections into a set of seman- gramming difficulties, nested data processing requires scal- tically equivalent shredded queries that can be efficiently ing in the presence of large or skewed inner collections. We evaluated. The framework employs a combination of query next elaborate these difficulties by means of an example. compilation techniques, an efficient data representation for Example 1. Consider a variant of the TPC-H database nested collections, and automated skew-handling. We pro- containing a flat relation Part with information about parts vide an extensive experimental evaluation, demonstrating and a nested relation COP with information about customers, significant improvements provided by the framework in di- their orders, and the parts purchased in the orders. The verse scenarios for nested collection programs. relation COP stores customer names, order dates, part IDs PVLDB Reference Format: and purchased quantities per order. The type of COP is: Jaclyn Smith, Michael Benedikt, Milos Nikolic, and Amir Shaikhha. Scalable Querying of Nested Data. PVLDB, 12(xxx): xxxx-yyyy, Bag (h cname : string; corders : Bag (h odate : date; 2019. oparts : Bag (h pid : int; qty : real i) i) i): DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx Consider now a high-level collection program that returns for each customer and for each of their orders, the total 1. INTRODUCTION amount spent per part name. This requires joining COP with Large-scale, distributed data processing platforms such as Part on pid to get the name and price of each part and then Spark [59], Flink [9], and Hadoop [21] have become indis- summing up the costs per part name. We can express this pensable tools for modern data analysis. Their wide adop- using nested relational calculus [11, 56] as: tion stems from powerful functional-style APIs that allow for cop in COP union programmers to express complex analytical tasks while ab- fh cname := cop:cname; arXiv:2011.06381v1 [cs.DB] 12 Nov 2020 stracting distributed resources and data parallelism. These corders := systems use an underlying data model that allows for data for co in cop:corders union fh odate := co:odate; to be described as a collection of tuples whose values may Q oparts := sumBytotal( themselves be collections. This nested data representation pname for op in co:oparts union arises naturally in many domains, such as web, biomedicine, Qcorders for p in Part union and business intelligence. The widespread use of nested data if op:pid == p:pid then Qoparts fh pname := p:pname; total := op:qty ∗ p:price ig)igig This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy The program navigates to oparts in COP, joins each bag of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For with Part, computes the amount spent per part, and aggre- any use beyond those covered by this license, obtain permission by emailing gates using sumBy which sums up the total amount spent for [email protected]. Copyright is held by the owner/author(s). Publication rights each distinct part name. The output type is: licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. xxx Bag (h cname : string; corders : Bag (h odate : date; ISSN 2150-8097. oparts : Bag (h pname : string; total : real i) i) i): DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx 1 A key motivation for our work is analytics pipelines, which an output. Exploration of succinctness and efficiency for typically consist of a sequence of transformations with the different representations of nested output has not, to our output of one transformation contributing to the input of knowledge, been the topic of significant study. the next, as above. Although the final output consumed Challenge 3: Data skew. Last but not least, data by an end user or application may be flat, the intermediate skew can significantly degrade performance of large-scale nested outputs may still be important in themselves: either data processing, even when dealing with flat data only. Skew because the intermediate outputs are used in multiple follow- can lead to load imbalance where some workers perform sig- up transformations, or because the pipeline is expanded and nificantly more work than others, prolonging run times on modified as the data is explored. Our interest is thus in platforms with synchronous execution such as Spark, Flink, programs consisting of sequence of queries, where both inputs and Hadoop. The presence of nested data only exacerbates and outputs may be nested. the problem of load imbalance: inner collections may have We next discuss the challenges associated with processing skewed cardinalities { e.g., very few customers can have very such examples using distributed frameworks. many orders { and the standard top-level partitioning places Challenge 1: Programming mismatch. The transla- each inner collection entirely on a single worker node. More- tion of collection programs from local to distributed settings over, programmers must ensure that inner collections do not is not always straightforward. Distributed processing sys- grow large enough to cause disk spill or crash worker nodes tems natively distribute collections only at the granularity due to insufficient memory. 2 of top-level tuples and provide no direct support for using Prior work has addressed some of these challenges. High- nested loops over different distributed collections. To ad- level scripting languages such as Pig Latin [43] and Scope [13] dress these limitations, programmers resort to techniques ease the programming of distributed data processing sys- that are either prohibitively expensive or error-prone. In tems. Apache Hive [6], F1 [48, 46], Dremel [40], and Big- our example, a common error is to try to iterate over Part, Query [47] provide SQL-like languages with extensions for which is a collection distributed across worker nodes, from querying nested structured data at scale. Emma [3] and within the map function over COP that navigates to oparts, DIQL [27] support for-comprehensions and respectively SQL- which are collections local to one worker node. Since Part like syntax via DSLs deeply-embedded in Scala, and target is distributed across worker nodes, it cannot be referenced several distributed processing frameworks. But alleviating inside another distributed collection. Faced with this im- the performance problems caused by skew and manually flat- possibility, programmers could either replicate Part to each tening nested collections remains an open problem. worker node, which can be too expensive, or rewrite to use Our approach. To address these challenges and achieve an explicit join operator, which requires flattening COP to scalable processing of nested data, we advocate an approach bring pid values needed for the join to the top level. A nat- that relies on three key aspects: ural way to flatten COP is by pairing each cname with every Query Compilation. When designing programs over tuple from corders and subsequently with every tuple from nested data, programmers often first conceptualize desired oparts; however, the resulting (cname; odate; pid; qty) tu- transformations using high-level languages such as nested ples encode incomplete information { e.g., omit customers relational calculus [11, 56], monad comprehension calcu- with no orders { which can eventually cause an incorrect lus [26], or the collection constructs of functional program- result. Manually rewriting such a computation while pre- ming languages (e.g., the Scala collections API). These lan- serving its correctness is non-trivial. guages lend themselves naturally to centralized execution, Challenge 2: Input and output data representa- thus making them the weapon of choice for rapid prototyp- tion. Default top-level partitioning of nested data can lead ing and testing over small datasets. to poor scalability, particularly with few top-level tuples To relieve programmers from manually rewriting program and/or large inner collections. The number of top-level tu- for scalable execution (Challenge 1), we propose using a ples limits the level of parallelism by requiring all data on compilation framework that automatically restructures pro- lower levels to persist on the same node, which is detrimen- grams to distributed settings. In this process, the framework tal regardless of the size of these inner collections. Con- applies optimizations that are often overlooked by program- versely, large inner collections can overload worker nodes mers such as column pruning, selection pushdown, and push- and increase data transfer costs, even with larger numbers ing nesting and aggregation operators past joins.
Recommended publications
  • ACID Compliant Distributed Key-Value Store
    ACID Compliant Distributed Key-Value Store #1 #2 #3 Lakshmi Narasimhan Seshan ,​ Rajesh Jalisatgi ,​ Vijaeendra Simha G A # ​ ​ ​ ​1l​ [email protected] # ​2r​ [email protected] #3v​ [email protected] Abstract thought that would be easier to implement. All Built a fault-tolerant and strongly-consistent key/value storage components run http and grpc endpoints. Http endpoints service using the existing RAFT implementation. Add ACID are for debugging/configuration. Grpc Endpoints are transaction capability to the service using a 2PC variant. Build used for communication between components. on the raftexample[1] (available as part of etcd) to add atomic Replica Manager (RM): RM is the service discovery transactions. This supports sharding of keys and concurrent transactions on sharded KV Store. By Implementing a part of the system. RM is initiated with each of the transactional distributed KV store, we gain working knowledge Replica Servers available in the system. Users have to of different protocols and complexities to make it work together. instantiate the RM with the number of shards that the user wants the key to be split into. Currently we cannot Keywords— ACID, Key-Value, 2PC, sharding, 2PL Transaction dynamically change while RM is running. New Replica ​ ​ Servers cannot be added or removed from once RM is I. INTRODUCTION ​ up and running. However Replica Servers can go down Distributed KV stores have become a norm with the and come back and RM can detect this. Each Shard advent of microservices architecture and the NoSql Leader updates the information to RM, while the DTM revolution. Initially KV Store discarded ACID properties leader updates its information to the RM.
    [Show full text]
  • QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications
    QUERYING JSON AND XML Performance evaluation of querying tools for offline-enabled web applications Master Degree Project in Informatics One year Level 30 ECTS Spring term 2012 Adrian Hellström Supervisor: Henrik Gustavsson Examiner: Birgitta Lindström Querying JSON and XML Submitted by Adrian Hellström to the University of Skövde as a final year project towards the degree of M.Sc. in the School of Humanities and Informatics. The project has been supervised by Henrik Gustavsson. 2012-06-03 I hereby certify that all material in this final year project which is not my own work has been identified and that no work is included for which a degree has already been conferred on me. Signature: ___________________________________________ Abstract This article explores the viability of third-party JSON tools as an alternative to XML when an application requires querying and filtering of data, as well as how the application deviates between browsers. We examine and describe the querying alternatives as well as the technologies we worked with and used in the application. The application is built using HTML 5 features such as local storage and canvas, and is benchmarked in Internet Explorer, Chrome and Firefox. The application built is an animated infographical display that uses querying functions in JSON and XML to filter values from a dataset and then display them through the HTML5 canvas technology. The results were in favor of JSON and suggested that using third-party tools did not impact performance compared to native XML functions. In addition, the usage of JSON enabled easier development and cross-browser compatibility. Further research is proposed to examine document-based data filtering as well as investigating why performance deviated between toolsets.
    [Show full text]
  • A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets
    electronics Article J-CO: A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets Giuseppe Psaila * and Paolo Fosci Department of Management, Information and Production Engineering, University of Bergamo, 24044 Dalmine (BG), Italy; [email protected] * Correspondence: [email protected]; Tel.: +39-035-205-2355 Abstract: Internet technology and mobile technology have enabled producing and diffusing massive data sets concerning almost every aspect of day-by-day life. Remarkable examples are social media and apps for volunteered information production, as well as Open Data portals on which public administrations publish authoritative and (often) geo-referenced data sets. In this context, JSON has become the most popular standard for representing and exchanging possibly geo-referenced data sets over the Internet.Analysts, wishing to manage, integrate and cross-analyze such data sets, need a framework that allows them to access possibly remote storage systems for JSON data sets, to retrieve and query data sets by means of a unique query language (independent of the specific storage technology), by exploiting possibly-remote computational resources (such as cloud servers), comfortably working on their PC in their office, more or less unaware of real location of resources. In this paper, we present the current state of the J-CO Framework, a platform-independent and analyst-oriented software framework to manipulate and cross-analyze possibly geo-tagged JSON data sets. The paper presents the general approach behind the Framework, by illustrating J-CO the query language by means of a simple, yet non-trivial, example of geographical cross-analysis. The paper also presents the novel features introduced by the re-engineered version of the execution Citation: Psaila, G.; Fosci, P.
    [Show full text]
  • XML Prague 2012
    XML Prague 2012 Conference Proceedings University of Economics, Prague Prague, Czech Republic February 10–12, 2012 XML Prague 2012 – Conference Proceedings Copyright © 2012 Jiří Kosek ISBN 978-80-260-1572-7 Table of Contents General Information ..................................................................................................... vii Sponsors ........................................................................................................................... ix Preface .............................................................................................................................. xi The eX Markup Language? – Eric van der Vlist ........................................................... 1 XML and HTML Cross-Pollination: A Bridge Too Far? – Norman Walsh and Robin Berjon .................................................................................... 11 XML5's Story – Anne van Kesteren ................................................................................ 23 XProc: Beyond application/xml – Vojtěch Toman ........................................................ 27 The Anatomy of an Open Source XProc/XSLT implementation of NVDL – George Bina ...................................................................................................................... 49 JSONiq – Jonathan Robie, Matthias Brantner, Daniela Florescu, Ghislain Fourny, and Till Westmann .................................................................................................................. 63 Corona: Managing and
    [Show full text]
  • Download the Result from the Cluster
    UC Riverside UC Riverside Electronic Theses and Dissertations Title Interval Joins for Big Data Permalink https://escholarship.org/uc/item/7xb001cz Author Carman, Jr., Eldon Preston Publication Date 2020 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA RIVERSIDE Interval Joins for Big Data A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Eldon Preston Carman, Jr. September 2020 Dissertation Committee: Dr. Vassilis J. Tsotras, Chairperson Dr. Michael J. Carey Dr. Ahmed Eldawy Dr. Vagelis Hristidis Dr. Eamonn Keogh Copyright by Eldon Preston Carman, Jr. 2020 The Dissertation of Eldon Preston Carman, Jr. is approved: Committee Chairperson University of California, Riverside Acknowledgments I am grateful to my advisor, Professor Vassilis Tsotras, without whose support, I would not have been able to complete this journey. Thank you for your continued encour- agement and guidance through this learning experience. I especially would like to thank my dissertation committee members, Dr. Ahmed Eldawy, Dr. Vagelis Hristidis, Dr. Eamonn Keogh, and Dr. Michael Carey, for reviewing my dissertation. I also like to thank Dr. Rajiv Gupta for his support in the beginning of this dissertation. I would like to thank everyone in AsterixDB’s team for their support. Especially Professor Michael Carey for his guidance in understanding the AsterixDB stack and managing the research in an active software product. Thank you to my lab partners, Ildar Absalyamov and Steven Jacobs, for all the lively discussions on our regular drives out to Irvine.
    [Show full text]
  • Grpc - a Solution for Rpcs by Google Distributed Systems Seminar at Charles University in Prague, Nov 2016 Jan Tattermusch - Grpc Software Engineer
    gRPC - A solution for RPCs by Google Distributed Systems Seminar at Charles University in Prague, Nov 2016 Jan Tattermusch - gRPC Software Engineer About me ● Software Engineer at Google (since 2013) ● Working on gRPC since Q4 2014 ● Graduated from Charles University (2010) Contacts ● jtattermusch on GitHub ● Feedback to [email protected] @grpcio Motivation: gRPC Google has an internal RPC system, called Stubby ● All production applications use RPCs ● Over 1010 RPCs per second in total ● 4 generations over 13 years (since 2003) ● APIs for C++, Java, Python, Go What's missing ● Not suitable for external use (tight coupling with internal tools & infrastructure) ● Limited language & platform support ● Proprietary protocol and security ● No mobile support @grpcio What's gRPC ● HTTP/2 based RPC framework ● Secure, Performant, Multiplatform, Open Multiplatform ● Idiomatic APIs in popular languages (C++, Go, Java, C#, Node.js, Ruby, PHP, Python) ● Supports mobile devices (Android Java, iOS Obj-C) ● Linux, Windows, Mac OS X ● (web browser support in development) OpenSource ● developed fully in open on GitHub: https://github.com/grpc/ @grpcio Use Cases Build distributed services (microservices) Service 1 Service 3 ● In public/private cloud ● Google's own services Service 2 Service 4 Client-server communication ● Mobile ● Web ● Also: Desktop, embedded devices, IoT Access APIs (Google, OSS) @grpcio Key Features ● Streaming, Bidirectional streaming ● Built-in security and authentication ○ SSL/TLS, OAuth, JWT access ● Layering on top of HTTP/2
    [Show full text]
  • Database Software Market: Billy Fitzsimmons +1 312 364 5112
    Equity Research Technology, Media, & Communications | Enterprise and Cloud Infrastructure March 22, 2019 Industry Report Jason Ader +1 617 235 7519 [email protected] Database Software Market: Billy Fitzsimmons +1 312 364 5112 The Long-Awaited Shake-up [email protected] Naji +1 212 245 6508 [email protected] Please refer to important disclosures on pages 70 and 71. Analyst certification is on page 70. William Blair or an affiliate does and seeks to do business with companies covered in its research reports. As a result, investors should be aware that the firm may have a conflict of interest that could affect the objectivity of this report. This report is not intended to provide personal investment advice. The opinions and recommendations here- in do not take into account individual client circumstances, objectives, or needs and are not intended as recommen- dations of particular securities, financial instruments, or strategies to particular clients. The recipient of this report must make its own independent decisions regarding any securities or financial instruments mentioned herein. William Blair Contents Key Findings ......................................................................................................................3 Introduction .......................................................................................................................5 Database Market History ...................................................................................................7 Market Definitions
    [Show full text]
  • Recent Trends in JSON Filters Atul Jain*, Dr
    International Journal of Scientific Research in Computer Science, Engineering and Information Technology ISSN : 2456-3307 (www.ijsrcseit.com) https://doi.org/10.32628/CSEIT217116 Recent Trends in JSON Filters Atul Jain*, Dr. ShashiKant Gupta CSA Department, ITM University, Gwalior, Madhya Pradesh, India ABSTRACT Article Info JavaScript Object Notation is a text-based data exchange format for structuring Volume 7, Issue 1 data between a server and web application on the client-side. It is basically a Page Number: 87-93 data format, so it is not limited to Ajax-style web applications and can be used Publication Issue : with API’s to exchange or store information. However, the whole data never to January-February-2021 be used by the system or application, It needs some extract of a piece of requirement that may vary person to person and with the changing of time. The searching and filtration from the JSON string are very typical so most of the studies give only basics operation to query the data from the JSON object. The aim of this paper to find out all the methods with different technology to search and filter with JSON data. It explains the extensive results of previous research on the JSONiq Flwor expression and compares it with the json-query module of npm to extract information from JSON. This research has the intention of achieving the data from JSON with some advanced operators with the help of a prototype in json-query package of NodeJS. Thus, the data can be filtered out more efficiently and accurately Article History without the need for any other programming language dependency.
    [Show full text]
  • Implementing and Testing Cockroachdb on Kubernetes
    Implementing and testing CockroachDB on Kubernetes Albert Kasperi Iho Supervisors: Ignacio Coterillo Coz, Ricardo Brito Da Rocha Abstract This report will describe implementing CockroachDB on Kubernetes, a sum- mer student project done in 2019. The report addresses the implementation, monitoring, testing and results of the project. In the implementation and mon- itoring parts, a closer look to the full pipeline will be provided, with the intro- duction of all the frameworks used and the biggest problems that were faced during the project. After that, the report describes the testing process of Cock- roachDB, what tools were used in the testing and why. The results of testing will be presented after testing with the conclusion of results. Keywords CockroachDB; Kubernetes; benchmark. Contents 1 Project specification . 2 2 Implementation . 2 2.1 Kubernetes . 2 2.2 CockroachDB . 3 2.3 Setting up Kubernetes in CERN environment . 3 3 Monitoring . 3 3.1 Prometheus . 4 3.2 Prometheus Operator . 4 3.3 InfluxDB . 4 3.4 Grafana . 4 3.5 Scraping and forwarding metrics from the Kubernetes cluster . 4 4 Testing CockroachDB on Kubernetes . 5 4.1 SQLAlchemy . 5 4.2 Pgbench . 5 4.3 CockroachDB workload . 6 4.4 Benchmarking CockroachDB . 6 5 Test results . 6 5.1 SQLAlchemy . 6 5.2 Pgbench . 6 5.3 CockroachDB workload . 7 5.4 Conclusion . 7 6 Acknowledgements . 7 1 Project specification The core goal of the project was taking a look into implementing a database framework on top of Kuber- netes, the challenges in implementation and automation possibilities. Final project pipeline included a CockroachDB cluster running in Kubernetes with Prometheus monitoring both of them.
    [Show full text]
  • Architecting Cloud-Native NET Apps for Azure (2020).Pdf
    EDITION v.1.0 PUBLISHED BY Microsoft Developer Division, .NET, and Visual Studio product teams A division of Microsoft Corporation One Microsoft Way Redmond, Washington 98052-6399 Copyright © 2020 by Microsoft Corporation All rights reserved. No part of the contents of this book may be reproduced or transmitted in any form or by any means without the written permission of the publisher. This book is provided “as-is” and expresses the author’s views and opinions. The views, opinions, and information expressed in this book, including URL and other Internet website references, may change without notice. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. Microsoft and the trademarks listed at https://www.microsoft.com on the “Trademarks” webpage are trademarks of the Microsoft group of companies. Mac and macOS are trademarks of Apple Inc. The Docker whale logo is a registered trademark of Docker, Inc. Used by permission. All other marks and logos are property of their respective owners. Authors: Rob Vettor, Principal Cloud System Architect/IP Architect - thinkingincloudnative.com, Microsoft Steve “ardalis” Smith, Software Architect and Trainer - Ardalis.com Participants and Reviewers: Cesar De la Torre, Principal Program Manager, .NET team, Microsoft Nish Anil, Senior Program Manager, .NET team, Microsoft Jeremy Likness, Senior Program Manager, .NET team, Microsoft Cecil Phillip, Senior Cloud Advocate, Microsoft Editors: Maira Wenzel, Program Manager, .NET team, Microsoft Version This guide has been written to cover .NET Core 3.1 version along with many additional updates related to the same “wave” of technologies (that is, Azure and additional third-party technologies) coinciding in time with the .NET Core 3.1 release.
    [Show full text]
  • How Financial Service Companies Can Successfully Migrate Critical
    How Financial Service Companies Can Successfully Migrate Critical Applications to the Cloud Chapter 1: Why every bank should aspire to be the Google of FinServe PAGE 3 Chapter 2: Reducing latency and other delays that are detrimental to FSIs PAGE 6 Chapter 3: A look to the future of transaction services in the face of modernization PAGE 9 Copyright © 2021 RTInsights. All rights reserved. All other trademarks are the property of their respective companies. The information contained in this publication has been obtained from sources believed to be reliable. NACG LLC and RTInsights disclaim all warranties as to the accuracy, completeness, or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. The information expressed herein is subject to change without notice. Copyright (©) 2021 RTInsights How Financial Service Companies Can Successfully Migrate Critical Applications to the Cloud 2 Chapter 1: Why every bank should aspire to be the Google of FinServe Introduction Fintech companies are disrupting the staid world of traditional banking. Financial services institutions (FSIs) are trying to keep pace, be competitive, and become relevant by modernizing operations by leveraging the same technologies as the aggressive fintech startups. Such efforts include a move to cloud and the increased use of data to derive real-time insights for decision-making and automation. While businesses across most industries are making similar moves, FSIs have different requirements than most companies and thus must address special challenges. For example, keeping pace with changing customer desires and market conditions is driving the need for more flexible and highly scalable options when it comes to developing, deploying, maintaining, and updating applications.
    [Show full text]
  • Degree Project: Bachelor of Science in Computing Science
    UMEÅ UNIVERSITY June 3, 2021 Department of computing science Bachelor Thesis, 15hp 5DV199 Degree Project: Bachelor of Science in Computing Science A scalability evaluation on CockroachDB Tobias Lifhjelm Teacher Jerry Eriksson Tutor Anna Jonsson A scalability evaluation on CockroachDB Contents Contents 1 Introduction 1 2 Database Theory 3 2.1 Transaction . .3 2.2 Load Balancing . .4 2.3 ACID . .4 2.4 CockroachDB Architecture . .4 2.4.1 SQL Layer . .4 2.4.2 Replication Layer . .5 2.4.3 Transaction Layer . .5 2.4.3.1 Concurrency . .6 2.4.4 Distribution Layer . .7 2.4.5 Storage Layer . .7 3 Research Questions 7 4 Related Work 8 5 Method 9 5.1 Data Collection . 10 5.2 Delimitations . 11 6 Results 12 7 Discussion 15 7.1 Conclusion . 16 7.2 Limitations and Future Work . 16 8 Acknowledgment 17 June 3, 2021 Abstract Databases are a cornerstone in data storage since they store and organize large amounts of data while allowing users to access specific parts of data eas- ily. Databases must however adapt to an increasing amount of users without negatively affect the end-users. CochroachDB (CRDB) is a distributed SQL database that combines consistency related to relational database manage- ment systems, with scalability to handle more user requests simultaneously while still being consistent. This paper presents a study that evaluates the scalability properties of CRDB by measuring how latency is affected by the addition of more nodes to a CRDB cluster. The findings show that the la- tency can decrease with the addition of nodes to a cluster.
    [Show full text]