Is Parallel Programming Hard, And, If So, What Can You Do About It?
Total Page:16
File Type:pdf, Size:1020Kb
Is Parallel Programming Hard, And, If So, What Can You Do About It? Edited by: Paul E. McKenney Linux Technology Center IBM Beaverton [email protected] August 8, 2012 ii Legal Statement This work represents the views of the authors and does not necessarily represent the view of their employers. IBM, zSeries, and PowerPC are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. Linux is a registered trademark of Linus Torvalds. i386 is a trademark of Intel Corporation or its subsidiaries in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of such compa- nies. The non-source-code text and images in this document are provided under the terms of the Cre- ative Commons Attribution-Share Alike 3.0 United States license (http://creativecommons. org/licenses/by-sa/3.0/us/). In brief, you may use the contents of this document for any purpose, personal, commercial, or otherwise, so long as attribution to the authors is maintained. Likewise, the document may be modified, and derivative works and translations made available, so long as such modifications and derivations are offered to the public on equal terms as the non-source-code text and images in the original document. Source code is covered by various versions of the GPL (http://www.gnu.org/licenses/ gpl-2.0.html). Some of this code is GPLv2-only, as it derives from the Linux kernel, while other code is GPLv2-or-later. See the CodeSamples directory in the git archive (git://git. kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git) for the ex- act licenses, which are included in comment headers in each file. If you are unsure of the license for a given code fragment, you should assume GPLv2-only. Combined work © 2005-2011 by Paul E. McKenney. Contents 1 Introduction 1 1.1 Historic Parallel Programming Difficulties . 1 1.2 Parallel Programming Goals . 3 1.2.1 Performance . 3 1.2.2 Productivity . 4 1.2.3 Generality . 6 1.3 Alternatives to Parallel Programming . 8 1.3.1 Multiple Instances of a Sequential Application . 8 1.3.2 Make Use of Existing Parallel Software . 8 1.3.3 Performance Optimization . 9 1.4 What Makes Parallel Programming Hard? . 9 1.4.1 Work Partitioning . 10 1.4.2 Parallel Access Control . 11 1.4.3 Resource Partitioning and Replication . 11 1.4.4 Interacting With Hardware . 12 1.4.5 Composite Capabilities . 12 1.4.6 How Do Languages and Environments Assist With These Tasks? 13 1.5 Guide to This Book . 13 1.5.1 Quick Quizzes . 13 1.5.2 Sample Source Code . 13 2 Hardware and its Habits 15 2.1 Overview . 15 2.1.1 Pipelined CPUs . 15 2.1.2 Memory References . 16 2.1.3 Atomic Operations . 17 2.1.4 Memory Barriers . 18 2.1.5 Cache Misses . 18 2.1.6 I/O Operations . 19 2.2 Overheads . 19 2.2.1 Hardware System Architecture . 20 2.2.2 Costs of Operations . 21 2.3 Hardware Free Lunch? . 23 2.3.1 3D Integration . 24 2.3.2 Novel Materials and Processes . 25 2.3.3 Special-Purpose Accelerators . 26 2.3.4 Existing Parallel Software . 26 2.4 Software Design Implications . 27 iii iv CONTENTS 3 Tools of the Trade 29 3.1 Scripting Languages . 29 3.2 POSIX Multiprocessing . 30 3.2.1 POSIX Process Creation and Destruction . 30 3.2.2 POSIX Thread Creation and Destruction . 32 3.2.3 POSIX Locking . 33 3.2.4 POSIX Reader-Writer Locking . 36 3.3 Atomic Operations . 39 3.4 Linux-Kernel Equivalents to POSIX Operations . 40 3.5 The Right Tool for the Job: How to Choose? . 40 4 Counting 43 4.1 Why Isn’t Concurrent Counting Trivial? . 44 4.2 Statistical Counters . 46 4.2.1 Design . 46 4.2.2 Array-Based Implementation . 46 4.2.3 Eventually Consistent Implementation . 47 4.2.4 Per-Thread-Variable-Based Implementation . 49 4.2.5 Discussion . 51 4.3 Approximate Limit Counters . 51 4.3.1 Design . 52 4.3.2 Simple Limit Counter Implementation . 52 4.3.3 Simple Limit Counter Discussion . 57 4.3.4 Approximate Limit Counter Implementation . 57 4.3.5 Approximate Limit Counter Discussion . 58 4.4 Exact Limit Counters . 58 4.4.1 Atomic Limit Counter Implementation . 58 4.4.2 Atomic Limit Counter Discussion . 64 4.4.3 Signal-Theft Limit Counter Design . 64 4.4.4 Signal-Theft Limit Counter Implementation . 65 4.4.5 Signal-Theft Limit Counter Discussion . 70 4.5 Applying Specialized Parallel Counters . 71 4.6 Parallel Counting Discussion . 72 5 Partitioning and Synchronization Design 75 5.1 Partitioning Exercises . 75 5.1.1 Dining Philosophers Problem . 75 5.1.2 Double-Ended Queue . 79 5.1.3 Partitioning Example Discussion . 87 5.2 Design Criteria . 87 5.3 Synchronization Granularity . 89 5.3.1 Sequential Program . 89 5.3.2 Code Locking . 90 5.3.3 Data Locking . 91 5.3.4 Data Ownership . 95 5.3.5 Locking Granularity and Performance . 96 5.4 Parallel Fastpath . 98 5.4.1 Reader/Writer Locking . 99 5.4.2 Hierarchical Locking . 100 5.4.3 Resource Allocator Caches . 100 CONTENTS v 5.5 Beyond Partitioning . 107 6 Locking 109 6.1 Staying Alive . 110 6.1.1 Deadlock . 110 6.1.2 Livelock and Starvation . 118 6.1.3 Unfairness . 119 6.1.4 Inefficiency . 120 6.2 Types of Locks . 121 6.2.1 Exclusive Locks . 121 6.2.2 Reader-Writer Locks . 121 6.2.3 Beyond Reader-Writer Locks . 122 6.3 Locking Implementation Issues . 123 6.3.1 Sample Exclusive-Locking Implementation Based on Atomic Exchange . 124 6.3.2 Other Exclusive-Locking Implementations . 124 6.4 Lock-Based Existence Guarantees . 126 6.5 Locking: Hero or Villain? . 128 6.5.1 Locking For Applications: Hero! . 128 6.5.2 Locking For Parallel Libraries: Just Another Tool . 128 6.5.3 Locking For Parallelizing Sequential Libraries: Villain! . 132 6.5.4 Partitioning Prohibited . 132 6.5.5 Deadlock-Prone Callbacks . 133 6.5.6 Object-Oriented Spaghetti Code . 133 6.6 Summary . 134 7 Data Ownership 135 7.1 Multiple Processes . 135 7.2 Partial Data Ownership and pthreads . 136 7.3 Function Shipping . 136 7.4 Designated Thread . 136 7.5 Privatization . 137 7.6 Other Uses of Data Ownership . 137 8 Deferred Processing 139 8.1 Reference Counting . 139 8.1.1 Implementation of Reference-Counting Categories . 140 8.1.2 Linux Primitives Supporting Reference Counting . 145 8.1.3 Counter Optimizations . 146 8.2 Sequence Locks . 146 8.3 Read-Copy Update (RCU) . 149 8.3.1 Introduction to RCU . 149 8.3.2 RCU Fundamentals . 153 8.3.3 RCU Usage . 163 8.3.4 RCU Linux-Kernel API . 176 8.3.5 “Toy” RCU Implementations . 182 8.3.6 RCU Exercises . 201 vi CONTENTS 9 Applying RCU 203 9.1 RCU and Per-Thread-Variable-Based Statistical Counters . 203 9.1.1 Design . 203 9.1.2 Implementation . 204 9.1.3 Discussion . 204 9.2 RCU and Counters for Removable I/O Devices . 206 10 Validation 207 10.1 Where Do Bugs Come From? . 207 10.2 Required Mindset . ..