Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems

Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems Ziheng Liu Shuofei Zhu Boqin Qin Pennsylvania State University Pennsylvania State University Beijing Univ. of Posts and Telecom. USA USA China Hao Chen Linhai Song University of California, Davis Pennsylvania State University USA USA ABSTRACT Systems. In Proceedings of the 26th ACM International Conference on Archi- Go is a statically typed programming language designed for efficient tectural Support for Programming Languages and Operating Systems (ASPLOS and reliable concurrent programming. For this purpose, Go provides ’21), April 19–23, 2021, Virtual, USA. ACM, New York, NY, USA, 14 pages. https://doi.org/10.1145/3445814.3446756 lightweight goroutines and recommends passing messages using channels as a less error-prone means of thread communication. Go has become increasingly popular in recent years and has been adopted to build many important infrastructure software systems. 1 INTRODUCTION However, a recent empirical study shows that concurrency bugs, Go is a statically typed programming language designed by Google especially those due to misuse of channels, exist widely in Go. These in 2009 [13]. In recent years, Go has gained increasing popularity in bugs severely hurt the reliability of Go concurrent systems. building software in production environments. These Go programs To fight Go concurrency bugs caused by misuse of channels, range from libraries [5] and command-line tools [1, 6] to systems this paper proposes a static concurrency bug detection system, software, including container systems [12, 22], databases [3, 8], and GCatch, and an automated concurrency bug fixing system, GFix. blockchain systems [16]. After disentangling an input Go program, GCatch models the com- One major design goal of Go is to provide an efficient and safe plex channel operations in Go using a novel constraint system and way for developers to write concurrent programs [14]. To achieve applies a constraint solver to identify blocking bugs. GFix auto- this purpose, Go provides easily created lightweight threads (called matically patches blocking bugs detected by GCatch using Go’s goroutines); it also advocates the use of channels to explicitly channel-related language features. We apply GCatch and GFix to 21 pass messages across goroutines, on the assumption that message- popular Go applications, including Docker, Kubernetes, and gRPC. passing concurrency is less error-prone than shared-memory con- In total, GCatch finds 149 previously unknown blocking bugs due currency supported by traditional programming languages [24, 31, to misuse of channels and GFix successfully fixes 124 of them. We 69]. In addition, Go also provides several unique primitives and have reported all detected bugs and generated patches to develop- libraries for concurrent programming. ers. So far, developers have fixed 125 blocking misuse-of-channel Unfortunately, Go programs still contain many concurrency bugs based on our reporting. Among them, 87 bugs are fixed by bugs [77], the type of bugs that are most difficult to debug [30, 53] applying GFix’s patches directly. and severely hurt the reliability of multi-threaded software sys- CCS CONCEPTS tems [15, 78]. Moreover, a recent empirical study [87] shows that message passing is just as error-prone as shared memory, and that • Software and its engineering ! Software testing and de- misuse of channels is even more likely to cause blocking bugs (e.g., bugging; Software reliability. deadlock) than misuse of mutexes. A previously unknown concurrency bug in Docker is shown KEYWORDS in Figure1. Function Exec() creates a child goroutine at line 5 Go; Concurrency Bugs; Bug Detection; Bug Fixing; Static Analysis to duplicate the content of a.Reader. After the duplication, the ACM Reference Format: child goroutine sends err to the parent goroutine through channel Ziheng Liu, Shuofei Zhu, Boqin Qin, Hao Chen, and Linhai Song. 2021. outDone to notify the parent about completion and any possible er- Automatically Detecting and Fixing Concurrency Bugs in Go Software ror (line 7). Since outDone is an unbuffered channel (line 3), the child blocks at line 7 until the parent receives from outDone. Meanwhile, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed the parent blocks at the select at line 9 until it either receives err for profit or commercial advantage and that copies bear this notice and the full citation from the child (line 10) or receives a message from ctx.Done() (line on the first page. Copyrights for components of this work owned by others than ACM 13), indicating the entire task can be halted. If the message from must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ctx.Done() arrives earlier, or if the two messages arrive concur- fee. Request permissions from [email protected]. rently and Go’s runtime non-deterministically chooses the second ASPLOS ’21, April 19–23, 2021, Virtual, USA case to execute, the parent will return from function Exec(). No © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8317-2/21/04...$15.00 other goroutine can pull messages from outDone, leaving the child https://doi.org/10.1145/3445814.3446756 goroutine permanently blocked at line 7. 616 ASPLOS ’21, April 19–23, 2021, Virtual, USA Ziheng Liu, Shuofei Zhu, Boqin Qin, Hao Chen, and Linhai Song GCatch GFix 1 func Exec(ctx context.Context, ...) (ExResult, error){ BMOC Patcher-1 Bugs 2 var oBuf, eBuf bytes.Buffer Go Source BMOC Patched Patcher-2 Code Detector Code 3 - outDone := make(chan error) Dispatcher Patcher-3 4 + outDone := make(chan error, 1) Traditional 5 go func() { Detectors All Detected Bugs 6 _, err= StdCopy(&oBuf,&eBuf,a.Reader) 7 outDone <- err// block Figure 2: The workflow of GCatch and GFix. 8 }() relationship between synchronization primitives of an input pro- 9 select { gram, and leverages that relationship to disentangle the primitives 10 case err := <-outDone: 11 if err != nil { into small groups. GCatch inspects each group only in a small pro- 12 return ExecResult{}, err} gram scope. To detect BMOC bugs, GCatch enumerates execution 13 case <-ctx.Done(): paths for all goroutines executed in a small program scope, uses 14 return ExecResult{}, ctx.Err() constraints to precisely describe how synchronization operations 15 } 16 return ExResult{oBuf:&oBuf, eBuf:&eBuf}, nil proceed and block, and invokes Z3 [33] to search for possible exe- 17 } cutions that would lead some synchronization operations to block forever (thereby detecting blocking bugs). Figure 1: A previously unknown Docker bug and its patch. The key challenge of building GCatch lies in modeling channel operations using constraints. Since a channel’s behavior depends This bug demonstrates the complexity of Go’s concurrency fea- on its states (e.g., the number of elements in the channel, closed tures. Programmers have to have a good understanding of when or not), channel operations are much more complex to model than a channel operation blocks and how select waits for multiple the synchronization operations (e.g., locking/unlocking) already channel operations. Otherwise, it is easy for them to make similar covered in existing constraint systems [45, 48, 57, 90]. We model mistakes when programming Go. Since Go is being widely adopted, channel operations by associating each channel with state variables it is increasingly urgent to fight concurrency bugs in Go, especially and defining how to update state variables when a channel opera- those caused by misuse of channels, since Go advocates using chan- tion proceeds. Our constraint system is the first to consider states nels for thread communication and many developers choose Go of synchronization primitives. It is very different from existing con- because of its good support for channels [72, 86]. straint systems and models used in the previous model checking However, existing techniques cannot effectively detect channel- techniques [39, 59, 60, 70, 80]. related concurrency bugs in large Go software systems. First, con- A BMOC bug will continue to hurt the system’s reliability until currency bug detection techniques designed for classic program- it is fixed. Thus, we further design an automated concurrency bug ming languages [49, 58, 67, 68, 75, 82] mainly focus on analyz- fixing system (GFix) to patch BMOC bugs detected by GCatch (Fig- ing shared-memory accesses or shared-memory primitives. Thus, ure2). GFix first conducts static analysis to categorize input BMOC they cannot detect bugs caused by misuse of channels in Go. Sec- bugs into three groups. It then automatically increases channel ond, since message-passing operations in MPI are different from buffer sizes or uses keyword “defer” or “select” to change block- those in Go, techniques designed for detecting deadlocks in MPI ing channel operations to be non-blocking and fix bugs in each programs [38, 41, 42, 83, 88] cannot be applied to Go programs. group. One challenge of automated bug fixing is generating read- Third, the three concurrency bug detectors released by the Go able patches that will be more readily accepted by developers. GFix team [9, 11, 19] cover only limited buggy code patterns and cannot synthesizes patches using Go’s channel-related language features, identify the majority of Go concurrency bugs in the real world [87]. which are powerful and already frequently used by programmers. Fourth, although recent techniques can use model checking to Thus, GFix’s patches only change a few lines of code and align identify blocking bugs in Go [39, 59, 60, 70, 80], those techniques with programmers’ usual practice. They are easily validated and analyze each input program and all its synchronization primitives accepted by developers. as a whole. Due to the exponential complexity of model check- The bug and its patch shown in Figure1 confirm the effectiveness ing, those techniques can handle only small programs with a few of GCatch and GFix.

Automatically Detecting and Fixing Concurrency Bugs in Go Software Systems

Towards a Library for Deterministic Failure Testing of Distributed Systems

ACID Compliant Distributed Key-Value Store

Go in Tidb Yao Wei | Pingcap About Me

Grpc - a Solution for Rpcs by Google Distributed Systems Seminar at Charles University in Prague, Nov 2016 Jan Tattermusch - Grpc Software Engineer

Database Software Market: Billy Fitzsimmons +1 312 364 5112

Implementing and Testing Cockroachdb on Kubernetes

Architecting Cloud-Native NET Apps for Azure (2020).Pdf

Using Rust to Build a Distributed Transactional Key-Value Database Liutang | [email protected] About Me

How Financial Service Companies Can Successfully Migrate Critical

Towards a Library for Deterministic Failure Testing of Distributed Systems

Degree Project: Bachelor of Science in Computing Science

Scalable but Wasteful: Current State of Replication in the Cloud