Practical Structures for Parallel Operating Systems
Total Page:16
File Type:pdf, Size:1020Kb
Practical Structures for Parallel Operating Systems Jan Edler A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science New York University May 1995 Approved: Allan Gottlieb © 1995 Jan Edler Permission granted to copy in whole or in part, provided the title page and this copyright notice are retained. Dedication For my wife, Janet Simmons, who never gave up hope and for our children, Cory, Riley, and Cally, whose interests I’ve tried to keep first. iii iv Acknowledgments I have had three employers during the course of this work: Bell Labs for the first 14 months, NEC Research Institute for the last 10 months, and New York University for more than 11 years in between. The bulk of this work was supported by the following grants and con- tracts: Department of Energy contract DE-AC02-76ER03077 and grant DE- FG02-88ER25052, International Business Machines joint study agreement N00039-84-R-0605(Q), and National Science Foundation grants DCR-8413359, DCR-8320085, and MIP-9303014. Views and opinions expressed in this dissertation are solely my own. The work described here has been done as part of a large ongoing project, spanning more than a decade, and this dissertation couldn’t be what it is without the involvement of many people. The following project participants have had particularly notable influence on this work: Allan Gottlieb and Mal Kalos lead the project together from before my involve- ment in 1982 until 1989; Allan has lead it alone since then. In addition, Allan has been my advisor and guide. Jim Lipkis was a very close collaborator until leaving NYU in 1989. Edith Schonberg, Jim, and I worked as a team on the early design and implementation of Symunix-2. Richard Kenner ’s contribution was widespread, but his talent for penetrating complex hardware and software problems was especially valuable. David Wood and Eric Freudenthal ported Symunix-1 to the Ultra-3 prototype, and David also did lots of low-level work for Symunix-2. Eric Freudenthal contributed at many levels, but his algorithm design work deserves special attention. Other notable contributors at NYU included Wayne Berke, Albert Cahana, Susan Dickey, Isaac Dimitrovsky, Lori Grob, and Pat Teller. In addition to being valued collaborators on a professional level, all of these people have been good friends. Another significant force affecting this work came from IBM’s RP3 Project [162]; espe- cially influential were Greg Pfister, Alan Norton, and Ray Bryant, with whom we had many valuable debates about highly parallel operating systems. The enthusiastic support of many prototype users was a crucial factor in attaining acceptable levels of performance, reliability, and robustness; notable among these were Anne Greenbaum from NYU and Gordon Lyons from the National Bureau of Standards. Mal Kalos and Jim Demmel led undergraduate courses using the Ultracomputer prototype to teach parallel programming. Perhaps underlying everything else technically was the availability of UNIX source code. Countless programmers at Bell Labs, University of California at Berkeley, and many other places deserve acknowledgment for that. My wife, Janet Simmons, has always provided the most crucial support and encourage- ment. My parents, Barbara and Karl Edler, helped with encouragement and proofreading. v Acknowledgments The Courant Institute library provided a pleasant atmosphere for literature research and countless hours of writing. This dissertation was prepared using a variety of UNIX systems1 and a Toshiba T1000SE laptop computer; the vi text editor [123, 153, 129, 152] was used in all environ- ments. Some of the figures were produced using the interactive drawing program xfig [181]. Formatting was performed with troff. (I used both AT&T’s Documenter ’s Workbench [13] and the groff package [46], although groff still lacks an implementation of grap [23]. The pri- mary macro package was −me [3].) Other vital software used includes ghostscript [54], and ghostview [195]. Cross referencing and indexing was done with packages of my own design, txr and kwit. 1 IBM RT PC, IBM RS/6000, SGI Indy, and NEC 486-based PC running Linux [202]. vi Abstract Large shared memory MIMD computers, with hundreds or thousands of processors, pose spe- cial problems for general purpose operating system design. In particular: • Serial bottlenecks that are insignificant for smaller machines can seriously limit scalabil- ity. • The needs of parallel programming environments vary greatly, requiring a flexible model for run-time support. • Frequent synchronization within parallel applications can lead to high overhead and bad scheduling decisions. Because of these difficulties, the algorithms, data structures, and abstractions of conven- tional operating systems are not well suited to highly parallel machines. We describe the Symunix operating system for the NYU Ultracomputer, a machine with hardware support for Fetch&Φ operations and combining of memory references. Avoidance of serial bottlenecks, through careful design of interfaces and use of highly parallel algo- rithms and data structures, is the primary goal of the system. Support for flexible parallel programming models relies on user-mode code to implement common abstractions such as threads and shared address spaces. Close interaction between the kernel and user-mode run-time layer reduces the cost of synchronization and avoids many scheduling anomalies. Symunix first became operational in 1985 and has been extensively used for research and instruction on two generations of Ultracomputer prototypes at NYU. We present data gathered from actual multiprocessor execution to support our claim of practicality for future large systems. vii viii Table of Contents Dedication iii Acknowledgments v Abstract vii Chapter 1: Introduction 1 1.1 Organization of the Dissertation 1 1.2 Development History 2 1.3 Development Goals 5 1.4 Design Principles 6 1.5 Design Overview 7 1.6 Style and Notation 11 Chapter 2: Survey of Previous Work 15 2.1 Terminology 15 2.2 Historical Development 16 2.2.1 The Role of UNIX in Multiprocessor Research 17 2.2.2 The Rise of the Microkernel 18 2.2.3 Threads 20 2.2.4 Synchronization 20 2.2.5 Bottleneck-Free Algorithms 23 2.2.6 Scheduling 24 2.3 Overview of Selected Systems 26 2.3.1 Burroughs B5000/B6000 Series 26 2.3.2 IBM 370/OS/VS2 26 2.3.3 CMU C.mmp/HYDRA 28 2.3.4 CMU Cm*/StarOS/Medusa 29 2.3.5 MUNIX 30 2.3.6 Purdue Dual VAX/UNIX 31 2.3.7 AT&T 3B20A/UNIX 31 2.3.8 Sequent Balance 21000/DYNIX 32 2.3.9 Mach 33 2.3.10 IBM RP3 33 2.3.11 DEC Firefly/Topaz 34 2.3.12 Psyche 35 ix Table of Contents 2.4 Chapter Summary 36 Chapter 3: Some Important Primitives 39 3.1 Fetch&Φ Functions 40 3.2 Test-Decrement-Retest 42 3.3 Histograms 43 3.4 Interrupt Masking 44 3.5 Busy-Waiting Synchronization 48 3.5.1 Delay Loops 51 3.5.2 Counting Semaphores 54 3.5.3 Binary Semaphores 58 3.5.4 Readers/Writers Locks 59 3.5.5 Readers/Readers Locks 64 3.5.6 Group Locks 68 3.6 Interprocessor Interrupts 74 3.7 List Structures 77 3.7.1 Ordinary Lists 78 3.7.2 LRU Lists 83 3.7.3 Visit Lists 91 3.7.4 Broadcast Trees 108 3.7.5 Sets 121 3.7.6 Search Structures 126 3.8 Future Work 141 3.9 Chapter Summary 144 Chapter 4: Processes 145 4.1 Process Model Evolution 145 4.2 Basic Process Control 148 4.3 Asynchronous Activities 150 4.3.1 Asynchronous System Calls 151 4.3.2 Asynchronous Page Faults 156 4.4 Implementation of Processes and Activities 156 4.4.1 Process Structure 157 4.4.2 Activity Structure 160 4.4.3 Activity State Transitions 163 4.4.4 Process and Activity Locking 163 4.4.5 Primitive Context-Switching 165 4.5 Context-Switching Synchronization 166 4.5.1 Non-List-Based Synchronization 168 4.5.2 List-Based Synchronization−Readers/Writers Locks 170 4.5.3 Premature Unblocking 184 4.6 Scheduling 186 4.6.1 Preemption 186 4.6.2 Interdependent Scheduling 188 4.6.3 Scheduling Groups 189 4.6.4 Temporary Non-Preemption 190 4.6.5 The SIGPREEMPT Signal 191 4.6.6 Scheduler Design and Policy Alternatives 194 x Table of Contents 4.6.7 The Ready List 196 4.6.8 The resched Function 197 4.6.9 The mkready Functions 198 4.7 Process Creation and Destruction 199 4.8 Signals 200 4.8.1 Signal Masking 200 4.8.2 Sending Signals to Process Groups 202 4.9 Future Work 205 4.10 Chapter Summary 208 Chapter 5: Memory Management 209 5.1 Evolution of OS Support for Shared Memory 209 5.2 Kernel Memory Model 211 5.2.1 The mapin, mapout, and mapctl System Calls 214 5.2.2 The mapinfo System Call 216 5.2.3 Stack Growth 216 5.2.4 The fork, spawn, exec, and exit System Calls 218 5.2.5 Image Directories 219 5.2.6 Core Dumps and Unlink on Last Close 220 5.2.7 File System Optimizations 221 5.3 Kernel Memory Management Implementation 224 5.3.1 Address Space Management 225 5.3.2 Working Set Management 230 5.3.3 TLB Management 235 5.3.4 Page Frame Management 237 5.4 Future Work 248 5.5 Chapter Summary 250 Chapter 6: Input/Output 251 6.1 Buffer Management 251 6.1.1 The bio Module 252 6.1.2 The bcache Module 256 6.2 File Systems 257 6.2.1 Mountable File Systems 258 6.2.2 File System Resources 258 6.2.3 The file Structure 259 6.2.4 Inode Access 260 6.2.5 Path Name Resolution 262 6.2.6 Block Look-Up 263 6.3 Device Drivers 264 6.4 Future Work 266 6.5 Chapter Summary 266 Chapter 7: User-Mode Facilities 269 7.1 Basic Programming Environment 270 7.2 Busy-Waiting Synchronization 271 7.2.1 Interrupts and Signals 271 7.2.2 Preemption 272 xi Table of Contents 7.2.3 Page Faults 272 7.2.4 The Problem of Barriers 274 7.3 Context-Switching Synchronization 275 7.3.1 Standard