Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems

Total Page:16

File Type:pdf, Size:1020Kb

Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems Ziheng Liu Shuofei Zhu Boqin Qin Pennsylvania State University Pennsylvania State University Beijing Univ. of Posts and Telecom. USA USA China Hao Chen Linhai Song University of California, Davis Pennsylvania State University USA USA ABSTRACT Systems. In Proceedings of the 26th ACM International Conference on Archi- Go is a statically typed programming language designed for efficient tectural Support for Programming Languages and Operating Systems (ASPLOS and reliable concurrent programming. For this purpose, Go provides ’21), April 19–23, 2021, Virtual, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3445814.3446756 lightweight goroutines and recommends passing messages using channels as a less error-prone means of thread communication. Go has become increasingly popular in recent years and has been adopted to build many important infrastructure software systems. 1 INTRODUCTION However, a recent empirical study shows that concurrency bugs, Go is a statically typed programming language designed by Google especially those due to misuse of channels, exist widely in Go. These in 2009 [13]. In recent years, Go has gained increasing popularity in bugs severely hurt the reliability of Go concurrent systems. building software in production environments. These Go programs To fight Go concurrency bugs caused by misuse of channels, range from libraries [5] and command-line tools [1, 6] to systems this paper proposes a static concurrency bug detection system, software, including container systems [12, 22], databases [3, 8], and GCatch, and an automated concurrency bug fixing system, GFix. blockchain systems [16]. After disentangling an input Go program, GCatch models the com- One major design goal of Go is to provide an efficient and safe plex channel operations in Go using a novel constraint system and way for developers to write concurrent programs [14]. To achieve applies a constraint solver to identify blocking bugs. GFix auto- this purpose, Go provides easily created lightweight threads (called matically patches blocking bugs detected by GCatch using Go’s goroutines); it also advocates the use of channels to explicitly channel-related language features. We apply GCatch and GFix to 21 pass messages across goroutines, on the assumption that message- popular Go applications, including Docker, Kubernetes, and gRPC. passing concurrency is less error-prone than shared-memory con- In total, GCatch finds 149 previously unknown blocking bugs due currency supported by traditional programming languages [24, 31, to misuse of channels and GFix successfully fixes 124 of them. We 69]. In addition, Go also provides several unique primitives and have reported all detected bugs and generated patches to develop- libraries for concurrent programming. ers. So far, developers have fixed 125 blocking misuse-of-channel Unfortunately, Go programs still contain many concurrency bugs based on our reporting. Among them, 87 bugs are fixed by bugs [77], the type of bugs that are most difficult to debug [30, 53] applying GFix’s patches directly. and severely hurt the reliability of multi-threaded software sys- CCS CONCEPTS tems [15, 78]. Moreover, a recent empirical study [87] shows that message passing is just as error-prone as shared memory, and that • Software and its engineering ! Software testing and de- misuse of channels is even more likely to cause blocking bugs (e.g., bugging; Software reliability. deadlock) than misuse of mutexes. A previously unknown concurrency bug in Docker is shown KEYWORDS in Figure1. Function Exec() creates a child goroutine at line 5 Go; Concurrency Bugs; Bug Detection; Bug Fixing; Static Analysis to duplicate the content of a.Reader. After the duplication, the ACM Reference Format: child goroutine sends err to the parent goroutine through channel Ziheng Liu, Shuofei Zhu, Boqin Qin, Hao Chen, and Linhai Song. 2021. outDone to notify the parent about completion and any possible er- Automatically Detecting and Fixing Concurrency Bugs in Go Software ror (line 7). Since outDone is an unbuffered channel (line 3), the child blocks at line 7 until the parent receives from outDone. Meanwhile, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed the parent blocks at the select at line 9 until it either receives err for profit or commercial advantage and that copies bear this notice and the full citation from the child (line 10) or receives a message from ctx.Done() (line on the first page. Copyrights for components of this work owned by others than ACM 13), indicating the entire task can be halted. If the message from must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ctx.Done() arrives earlier, or if the two messages arrive concur- fee. Request permissions from [email protected]. rently and Go’s runtime non-deterministically chooses the second ASPLOS ’21, April 19–23, 2021, Virtual, USA case to execute, the parent will return from function Exec(). No © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8317-2/21/04...$15.00 other goroutine can pull messages from outDone, leaving the child https://doi.org/10.1145/3445814.3446756 goroutine permanently blocked at line 7. 616 ASPLOS ’21, April 19–23, 2021, Virtual, USA Ziheng Liu, Shuofei Zhu, Boqin Qin, Hao Chen, and Linhai Song GCatch GFix 1 func Exec(ctx context.Context, ...) (ExResult, error){ BMOC Patcher-1 Bugs 2 var oBuf, eBuf bytes.Buffer Go Source BMOC Patched Patcher-2 Code Detector Code 3 - outDone := make(chan error) Dispatcher Patcher-3 4 + outDone := make(chan error, 1) Traditional 5 go func() { Detectors All Detected Bugs 6 _, err= StdCopy(&oBuf,&eBuf,a.Reader) 7 outDone <- err// block Figure 2: The workflow of GCatch and GFix. 8 }() relationship between synchronization primitives of an input pro- 9 select { gram, and leverages that relationship to disentangle the primitives 10 case err := <-outDone: 11 if err != nil { into small groups. GCatch inspects each group only in a small pro- 12 return ExecResult{}, err} gram scope. To detect BMOC bugs, GCatch enumerates execution 13 case <-ctx.Done(): paths for all goroutines executed in a small program scope, uses 14 return ExecResult{}, ctx.Err() constraints to precisely describe how synchronization operations 15 } 16 return ExResult{oBuf:&oBuf, eBuf:&eBuf}, nil proceed and block, and invokes Z3 [33] to search for possible exe- 17 } cutions that would lead some synchronization operations to block forever (thereby detecting blocking bugs). Figure 1: A previously unknown Docker bug and its patch. The key challenge of building GCatch lies in modeling channel operations using constraints. Since a channel’s behavior depends This bug demonstrates the complexity of Go’s concurrency fea- on its states (e.g., the number of elements in the channel, closed tures. Programmers have to have a good understanding of when or not), channel operations are much more complex to model than a channel operation blocks and how select waits for multiple the synchronization operations (e.g., locking/unlocking) already channel operations. Otherwise, it is easy for them to make similar covered in existing constraint systems [45, 48, 57, 90]. We model mistakes when programming Go. Since Go is being widely adopted, channel operations by associating each channel with state variables it is increasingly urgent to fight concurrency bugs in Go, especially and defining how to update state variables when a channel opera- those caused by misuse of channels, since Go advocates using chan- tion proceeds. Our constraint system is the first to consider states nels for thread communication and many developers choose Go of synchronization primitives. It is very different from existing con- because of its good support for channels [72, 86]. straint systems and models used in the previous model checking However, existing techniques cannot effectively detect channel- techniques [39, 59, 60, 70, 80]. related concurrency bugs in large Go software systems. First, con- A BMOC bug will continue to hurt the system’s reliability until currency bug detection techniques designed for classic program- it is fixed. Thus, we further design an automated concurrency bug ming languages [49, 58, 67, 68, 75, 82] mainly focus on analyz- fixing system (GFix) to patch BMOC bugs detected by GCatch (Fig- ing shared-memory accesses or shared-memory primitives. Thus, ure2). GFix first conducts static analysis to categorize input BMOC they cannot detect bugs caused by misuse of channels in Go. Sec- bugs into three groups. It then automatically increases channel ond, since message-passing operations in MPI are different from buffer sizes or uses keyword “defer” or “select” to change block- those in Go, techniques designed for detecting deadlocks in MPI ing channel operations to be non-blocking and fix bugs in each programs [38, 41, 42, 83, 88] cannot be applied to Go programs. group. One challenge of automated bug fixing is generating read- Third, the three concurrency bug detectors released by the Go able patches that will be more readily accepted by developers. GFix team [9, 11, 19] cover only limited buggy code patterns and cannot synthesizes patches using Go’s channel-related language features, identify the majority of Go concurrency bugs in the real world [87]. which are powerful and already frequently used by programmers. Fourth, although recent techniques can use model checking to Thus, GFix’s patches only change a few lines of code and align identify blocking bugs in Go [39, 59, 60, 70, 80], those techniques with programmers’ usual practice. They are easily validated and analyze each input program and all its synchronization primitives accepted by developers. as a whole. Due to the exponential complexity of model check- The bug and its patch shown in Figure1 confirm the effectiveness ing, those techniques can handle only small programs with a few of GCatch and GFix.
Recommended publications
  • Towards a Library for Deterministic Failure Testing of Distributed Systems
    UC Irvine UC Irvine Electronic Theses and Dissertations Title Towards a Library for Deterministic Failure Testing of Distributed Systems Permalink https://escholarship.org/uc/item/6h90g8rz Author Balalaie, Armin Publication Date 2020 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Towards a Library for Deterministic Failure Testing of Distributed Systems THESIS submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Software Engineering by Armin Balalaie Thesis Committee: Associate Professor James A. Jones, Chair Professor Cristina V. Lopes Professor Michael J. Carey 2020 c 2020 Armin Balalaie DEDICATION To Sara, my love, without whom and her patience, this wasn't possible .. and My mom without her help, dedication and sacrifices, I wouldn't be where I am today .. ii TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF LISTINGS vi LIST OF TABLES vii ACKNOWLEDGMENTS viii ABSTRACT OF THE THESIS ix 1 Introduction 1 2 Existing Work 5 2.1 Deployment-Focused Testing Frameworks . .5 2.2 Random Failure Injection . .6 2.3 Systematic Failure Injection . .8 2.4 Failure-specific Frameworks . .9 2.5 Model-based Approaches . .9 3 Example Failify Test Case for HDFS 11 4 Design Goals 14 4.1 Minimum Learning Curve . 14 4.2 Easy and Deterministic Failure Injection and Environment Manipulation . 15 4.3 Cross-Platform and Multi Language . 16 4.4 Seamless Integration . 17 4.5 Multiple Runtime Engines . 17 4.6 Research Infrastructure . 17 5 Architecture and Implementation 19 5.1 Java-based DSL .
    [Show full text]
  • ACID Compliant Distributed Key-Value Store
    ACID Compliant Distributed Key-Value Store #1 #2 #3 Lakshmi Narasimhan Seshan ,​ Rajesh Jalisatgi ,​ Vijaeendra Simha G A # ​ ​ ​ ​1l​ [email protected] # ​2r​ [email protected] #3v​ [email protected] Abstract thought that would be easier to implement. All Built a fault-tolerant and strongly-consistent key/value storage components run http and grpc endpoints. Http endpoints service using the existing RAFT implementation. Add ACID are for debugging/configuration. Grpc Endpoints are transaction capability to the service using a 2PC variant. Build used for communication between components. on the raftexample[1] (available as part of etcd) to add atomic Replica Manager (RM): RM is the service discovery transactions. This supports sharding of keys and concurrent transactions on sharded KV Store. By Implementing a part of the system. RM is initiated with each of the transactional distributed KV store, we gain working knowledge Replica Servers available in the system. Users have to of different protocols and complexities to make it work together. instantiate the RM with the number of shards that the user wants the key to be split into. Currently we cannot Keywords— ACID, Key-Value, 2PC, sharding, 2PL Transaction dynamically change while RM is running. New Replica ​ ​ Servers cannot be added or removed from once RM is I. INTRODUCTION ​ up and running. However Replica Servers can go down Distributed KV stores have become a norm with the and come back and RM can detect this. Each Shard advent of microservices architecture and the NoSql Leader updates the information to RM, while the DTM revolution. Initially KV Store discarded ACID properties leader updates its information to the RM.
    [Show full text]
  • Go in Tidb Yao Wei | Pingcap About Me
    Go in TiDB Yao Wei | PingCAP About me ● Yao Wei (姚维) ● TiDB Kernel Expert, General Manager of South Region, China ● 360 Infra team / Alibaba-UC / PingCAP ● Atlas/MySQL-Sniffer ● Infrastructure software engineer Why a new database? Brief History RDBMS NoSQL NewSQL ● Standalone RDBMS 1970s 2010 2015 Present ● NoSQL MySQL Redis Google Spanner PostgreSQL HBase Google F1 Oracle Cassandra TiDB ● Middleware & Proxy DB2 MongoDB ... ... ● NewSQL Architecture Stateless SQL Layer Metadata / Timestamp request TiDB ... TiDB ... TiDB Placement Driver (PD) Raft Raft Raft TiKV ... TiKV TiKV TiKV Control flow: Balance / Failover Distributed Storage Layer TiKV - Overview • Region: a set of continuous key-value pairs • Data is organized/stored/replicated by Regions • Highly layered TiKV Key Space RPC (gRPC) Node A Transaction 256MB MVCC [ start_key, Raft end_key) RocksDB (-∞, +∞) Raft Raft Sorted Map Node B Node C Raft PD - Overview ● Meta data management ● Load balance management Route Info TiKV Client PD Node/Region Management Info Command TiKV TiKV TiKV TiKV … ... TiKV Cluster TiKV - Multi-Raft Multiple raft groups in the cluster, one group for each region. Client RPC RPC RPC RPC Store 1 Store 2 Store 3 Store 4 Region 1 Region 1 Region 2 Region 1 Region 3 Region 2 Region 5 Region 2 Raft Region 5 Region 4 Region 3 Region 5 Group Region 4 Region 3 Region 4 TiKV node 1 TiKV node 2 TiKV node 3 TiKV node 4 TiKV - Horizontal Scale Node B Region 1^ Region 1 Region 2 Region 3 Region 1* Node D Region 2 Region 2 Region 3 Region 3 Node C Node A Add Replica Three
    [Show full text]
  • Grpc - a Solution for Rpcs by Google Distributed Systems Seminar at Charles University in Prague, Nov 2016 Jan Tattermusch - Grpc Software Engineer
    gRPC - A solution for RPCs by Google Distributed Systems Seminar at Charles University in Prague, Nov 2016 Jan Tattermusch - gRPC Software Engineer About me ● Software Engineer at Google (since 2013) ● Working on gRPC since Q4 2014 ● Graduated from Charles University (2010) Contacts ● jtattermusch on GitHub ● Feedback to [email protected] @grpcio Motivation: gRPC Google has an internal RPC system, called Stubby ● All production applications use RPCs ● Over 1010 RPCs per second in total ● 4 generations over 13 years (since 2003) ● APIs for C++, Java, Python, Go What's missing ● Not suitable for external use (tight coupling with internal tools & infrastructure) ● Limited language & platform support ● Proprietary protocol and security ● No mobile support @grpcio What's gRPC ● HTTP/2 based RPC framework ● Secure, Performant, Multiplatform, Open Multiplatform ● Idiomatic APIs in popular languages (C++, Go, Java, C#, Node.js, Ruby, PHP, Python) ● Supports mobile devices (Android Java, iOS Obj-C) ● Linux, Windows, Mac OS X ● (web browser support in development) OpenSource ● developed fully in open on GitHub: https://github.com/grpc/ @grpcio Use Cases Build distributed services (microservices) Service 1 Service 3 ● In public/private cloud ● Google's own services Service 2 Service 4 Client-server communication ● Mobile ● Web ● Also: Desktop, embedded devices, IoT Access APIs (Google, OSS) @grpcio Key Features ● Streaming, Bidirectional streaming ● Built-in security and authentication ○ SSL/TLS, OAuth, JWT access ● Layering on top of HTTP/2
    [Show full text]
  • Database Software Market: Billy Fitzsimmons +1 312 364 5112
    Equity Research Technology, Media, & Communications | Enterprise and Cloud Infrastructure March 22, 2019 Industry Report Jason Ader +1 617 235 7519 [email protected] Database Software Market: Billy Fitzsimmons +1 312 364 5112 The Long-Awaited Shake-up [email protected] Naji +1 212 245 6508 [email protected] Please refer to important disclosures on pages 70 and 71. Analyst certification is on page 70. William Blair or an affiliate does and seeks to do business with companies covered in its research reports. As a result, investors should be aware that the firm may have a conflict of interest that could affect the objectivity of this report. This report is not intended to provide personal investment advice. The opinions and recommendations here- in do not take into account individual client circumstances, objectives, or needs and are not intended as recommen- dations of particular securities, financial instruments, or strategies to particular clients. The recipient of this report must make its own independent decisions regarding any securities or financial instruments mentioned herein. William Blair Contents Key Findings ......................................................................................................................3 Introduction .......................................................................................................................5 Database Market History ...................................................................................................7 Market Definitions
    [Show full text]
  • Implementing and Testing Cockroachdb on Kubernetes
    Implementing and testing CockroachDB on Kubernetes Albert Kasperi Iho Supervisors: Ignacio Coterillo Coz, Ricardo Brito Da Rocha Abstract This report will describe implementing CockroachDB on Kubernetes, a sum- mer student project done in 2019. The report addresses the implementation, monitoring, testing and results of the project. In the implementation and mon- itoring parts, a closer look to the full pipeline will be provided, with the intro- duction of all the frameworks used and the biggest problems that were faced during the project. After that, the report describes the testing process of Cock- roachDB, what tools were used in the testing and why. The results of testing will be presented after testing with the conclusion of results. Keywords CockroachDB; Kubernetes; benchmark. Contents 1 Project specification . 2 2 Implementation . 2 2.1 Kubernetes . 2 2.2 CockroachDB . 3 2.3 Setting up Kubernetes in CERN environment . 3 3 Monitoring . 3 3.1 Prometheus . 4 3.2 Prometheus Operator . 4 3.3 InfluxDB . 4 3.4 Grafana . 4 3.5 Scraping and forwarding metrics from the Kubernetes cluster . 4 4 Testing CockroachDB on Kubernetes . 5 4.1 SQLAlchemy . 5 4.2 Pgbench . 5 4.3 CockroachDB workload . 6 4.4 Benchmarking CockroachDB . 6 5 Test results . 6 5.1 SQLAlchemy . 6 5.2 Pgbench . 6 5.3 CockroachDB workload . 7 5.4 Conclusion . 7 6 Acknowledgements . 7 1 Project specification The core goal of the project was taking a look into implementing a database framework on top of Kuber- netes, the challenges in implementation and automation possibilities. Final project pipeline included a CockroachDB cluster running in Kubernetes with Prometheus monitoring both of them.
    [Show full text]
  • Architecting Cloud-Native NET Apps for Azure (2020).Pdf
    EDITION v.1.0 PUBLISHED BY Microsoft Developer Division, .NET, and Visual Studio product teams A division of Microsoft Corporation One Microsoft Way Redmond, Washington 98052-6399 Copyright © 2020 by Microsoft Corporation All rights reserved. No part of the contents of this book may be reproduced or transmitted in any form or by any means without the written permission of the publisher. This book is provided “as-is” and expresses the author’s views and opinions. The views, opinions, and information expressed in this book, including URL and other Internet website references, may change without notice. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. Microsoft and the trademarks listed at https://www.microsoft.com on the “Trademarks” webpage are trademarks of the Microsoft group of companies. Mac and macOS are trademarks of Apple Inc. The Docker whale logo is a registered trademark of Docker, Inc. Used by permission. All other marks and logos are property of their respective owners. Authors: Rob Vettor, Principal Cloud System Architect/IP Architect - thinkingincloudnative.com, Microsoft Steve “ardalis” Smith, Software Architect and Trainer - Ardalis.com Participants and Reviewers: Cesar De la Torre, Principal Program Manager, .NET team, Microsoft Nish Anil, Senior Program Manager, .NET team, Microsoft Jeremy Likness, Senior Program Manager, .NET team, Microsoft Cecil Phillip, Senior Cloud Advocate, Microsoft Editors: Maira Wenzel, Program Manager, .NET team, Microsoft Version This guide has been written to cover .NET Core 3.1 version along with many additional updates related to the same “wave” of technologies (that is, Azure and additional third-party technologies) coinciding in time with the .NET Core 3.1 release.
    [Show full text]
  • Using Rust to Build a Distributed Transactional Key-Value Database Liutang | [email protected] About Me
    Using Rust to Build a Distributed Transactional Key-Value Database LiuTang | [email protected] About me ● Chief Architect at PingCAP ● TiDB and TiKV ● Open source projects ○ LedisDB ○ go-mysql ○ go-mysql-elasticsearch ○ rust-prometheus ○ ... Agenda ● Introduction ● Hierarchy ○ Storage ○ Raft ○ Transaction ○ RPC Framework ○ Monitor ○ Test ● Combine them all When we want to build a distributed transactional key-value database... Consistency Performance Scalability ACID Stability HA Others… A High Building, A Low Foundation Language Let’s start from scratch!!! RocksDB Immutable Memory Memory Table Table WAL Flush Memory Disk Compaction SST Info Log Level 0 SST SST …... SST Manifest Level 1 Current SST SST …... SST Level 2 https://github.com/pingcap/rust-rocksdb Raft Client State State State Machine Raft Machine Raft Machine Raft a = 1 Module a = 1 Module a = 1 Module b = 2 b = 2 b = 2 Log a = 1 b = 2 Log a = 1 b = 2 Log a = 1 b = 2 Multi-Raft Key Space A - B Raft Group Region 1 Region 1 Region 1 Raft Group Region 2 Region 2 Region 2 B - C Raft Group Region 3 Region 3 Region 3 C - D Multi-Raft - Scalability A B C D Region 1 Region 1 Region 1 Region 2 Region 2 Region 2 Multi-Raft - Scalability A B C D Region 1 Region 1 Region 1 Region 2 Region 2 Region 2 Region 2 Raft ConfChange - AddNode Multi-Raft - Scalability A B C D Region 1 Region 1 Region 1 Region 2 Region 2 Region 2 Raft ConfChange - RemoveNode https://github.com/pingcap/raft-rs Transaction Transaction How to keep consistency crossing multi-Raft Groups? let mut txn = store.begin() let value1 = txn.get(region1_key) let value2 = txn.get(region2_key) // do something with value txn.set(region1_key, new_value1) txn.set(region2_key, new_value2) txn.commit() // or txn.rollback() Transaction 1.
    [Show full text]
  • How Financial Service Companies Can Successfully Migrate Critical
    How Financial Service Companies Can Successfully Migrate Critical Applications to the Cloud Chapter 1: Why every bank should aspire to be the Google of FinServe PAGE 3 Chapter 2: Reducing latency and other delays that are detrimental to FSIs PAGE 6 Chapter 3: A look to the future of transaction services in the face of modernization PAGE 9 Copyright © 2021 RTInsights. All rights reserved. All other trademarks are the property of their respective companies. The information contained in this publication has been obtained from sources believed to be reliable. NACG LLC and RTInsights disclaim all warranties as to the accuracy, completeness, or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. The information expressed herein is subject to change without notice. Copyright (©) 2021 RTInsights How Financial Service Companies Can Successfully Migrate Critical Applications to the Cloud 2 Chapter 1: Why every bank should aspire to be the Google of FinServe Introduction Fintech companies are disrupting the staid world of traditional banking. Financial services institutions (FSIs) are trying to keep pace, be competitive, and become relevant by modernizing operations by leveraging the same technologies as the aggressive fintech startups. Such efforts include a move to cloud and the increased use of data to derive real-time insights for decision-making and automation. While businesses across most industries are making similar moves, FSIs have different requirements than most companies and thus must address special challenges. For example, keeping pace with changing customer desires and market conditions is driving the need for more flexible and highly scalable options when it comes to developing, deploying, maintaining, and updating applications.
    [Show full text]
  • Towards a Library for Deterministic Failure Testing of Distributed Systems
    UNIVERSITY OF CALIFORNIA, IRVINE Towards a Library for Deterministic Failure Testing of Distributed Systems THESIS submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Software Engineering by Armin Balalaie Thesis Committee: Associate Professor James A. Jones, Chair Professor Cristina V. Lopes Professor Michael J. Carey 2020 c 2020 Armin Balalaie DEDICATION To Sara, my love, without whom and her patience, this wasn't possible .. and My mom without her help, dedication and sacrifices, I wouldn't be where I am today .. ii TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF LISTINGS vi LIST OF TABLES vii ACKNOWLEDGMENTS viii ABSTRACT OF THE THESIS ix 1 Introduction 1 2 Existing Work 5 2.1 Deployment-Focused Testing Frameworks . .5 2.2 Random Failure Injection . .6 2.3 Systematic Failure Injection . .8 2.4 Failure-specific Frameworks . .9 2.5 Model-based Approaches . .9 3 Example Failify Test Case for HDFS 11 4 Design Goals 14 4.1 Minimum Learning Curve . 14 4.2 Easy and Deterministic Failure Injection and Environment Manipulation . 15 4.3 Cross-Platform and Multi Language . 16 4.4 Seamless Integration . 17 4.5 Multiple Runtime Engines . 17 4.6 Research Infrastructure . 17 5 Architecture and Implementation 19 5.1 Java-based DSL . 19 5.1.1 Deterministic Failure Injection . 21 5.2 Verification Engine . 23 5.3 Workspace Manager . 24 5.4 Instrumentation Engine . 24 5.5 Runtime Engine . 25 iii 6 Evaluation 28 6.1 Compactness and Generality of the Deployment API . 28 6.2 Effectiveness of Deterministic Environment Manipulation API .
    [Show full text]
  • Degree Project: Bachelor of Science in Computing Science
    UMEÅ UNIVERSITY June 3, 2021 Department of computing science Bachelor Thesis, 15hp 5DV199 Degree Project: Bachelor of Science in Computing Science A scalability evaluation on CockroachDB Tobias Lifhjelm Teacher Jerry Eriksson Tutor Anna Jonsson A scalability evaluation on CockroachDB Contents Contents 1 Introduction 1 2 Database Theory 3 2.1 Transaction . .3 2.2 Load Balancing . .4 2.3 ACID . .4 2.4 CockroachDB Architecture . .4 2.4.1 SQL Layer . .4 2.4.2 Replication Layer . .5 2.4.3 Transaction Layer . .5 2.4.3.1 Concurrency . .6 2.4.4 Distribution Layer . .7 2.4.5 Storage Layer . .7 3 Research Questions 7 4 Related Work 8 5 Method 9 5.1 Data Collection . 10 5.2 Delimitations . 11 6 Results 12 7 Discussion 15 7.1 Conclusion . 16 7.2 Limitations and Future Work . 16 8 Acknowledgment 17 June 3, 2021 Abstract Databases are a cornerstone in data storage since they store and organize large amounts of data while allowing users to access specific parts of data eas- ily. Databases must however adapt to an increasing amount of users without negatively affect the end-users. CochroachDB (CRDB) is a distributed SQL database that combines consistency related to relational database manage- ment systems, with scalability to handle more user requests simultaneously while still being consistent. This paper presents a study that evaluates the scalability properties of CRDB by measuring how latency is affected by the addition of more nodes to a CRDB cluster. The findings show that the la- tency can decrease with the addition of nodes to a cluster.
    [Show full text]
  • Scalable but Wasteful: Current State of Replication in the Cloud
    Scalable but Wasteful: Current State of Replication in the Cloud Venkata Swaroop Matte Aleksey Charapko Abutalib Aghayev The Pennsylvania State University University of New Hampshire The Pennsylvania State University Strongly Consistent Replication - Used in Cloud datastores and Configuration management - Rely on Consensus protocols ( replication protocols ) - Achieve High-throughput 2 How to optimize for Throughput? Leader Multi-Paxos 3 One way to optimize: Shift the load Leader Multi-Paxos EPaxos - Many protocols shift work from the bottleneck to the under-utilized node - Examples: EPaxos, SDPaxos and PigPaxos 4 Resource utilization of replication protocols Core 1 Core 1 Core 1 Core 2 Core 2 Core 2 3 nodes with 2 cores each 5 Resource utilization of replication protocols Core 1 Core 1 Core 1 Utilized core Core 2 Core 2 Core 2 Idle core Leader Followers Multi-Paxos 6 Resource utilization of replication protocols Core 1 Core 1 Core 1 Core 1 Core 1 Core 1 Core 2 Core 2 Core 2 Core 2 Core 2 Core 2 Multi-Paxos EPaxos EPaxos also utilizes the idle cores to achieve high throughput 7 Confirming performance gains • Single Instance • 5 AWS EC2 m5a.large nodes • Each 2 vCPU, 8GB RAM • 50% write workload Throughput of Multi-Paxos and EPaxos EPaxos achieves 20% higher throughput compared to Multi-Paxos 8 Missing piece: Resource efficiency EPaxos Multi-Paxos 500% Utilization 200% Utilization 18 kops/s 14 kops/s Multi-Paxos shows better resource efficiency compared9 to EPaxos Metric to analyze Resource efficiency Throughput-per-unit-of-constraining-resource-utilization
    [Show full text]