Coordinated Checkpoint/Restart Process Fault Tolerance for Mpi Applications on Hpc Systems

Total Page:16

File Type:pdf, Size:1020Kb

Coordinated Checkpoint/Restart Process Fault Tolerance for Mpi Applications on Hpc Systems COORDINATED CHECKPOINT/RESTART PROCESS FAULT TOLERANCE FOR MPI APPLICATIONS ON HPC SYSTEMS Joshua Hursey Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the Department of Computer Science Indiana University July, 2010 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Doctoral Committee Andrew Lumsdaine, Ph.D. Randall Bramley, Ph.D. Arun Chauhan, Ph.D. Beth Plale, Ph.D. Frank Mueller, Ph.D. June 30, 2010 ii Copyright 2010 Joshua Hursey All rights reserved iii To Dad and Mom for their unwavering support over these many years. iv Acknowledgements A special thank you to my advisor, Andrew Lumsdaine, for providing me with a unique working environment full of opportunities and challenges that have molded me into the researcher that I am today. My research perspective has been strongly influenced by his dedication to open, repeatable scientific exploration. His ability to coalesce and analyze a vast range of scientific domains to highlight new research avenues continues to amaze me. I would like to express my gratitude to my research committee for their support during the development of the research contained in this dissertation. Whether in a formal class- room setting or informal conversation their expertise has been an invaluable asset. This dissertation was shaped by their insightful questions and suggestions. I thank my colleagues in the Computer Science department and Open Systems Labo- ratory at Indiana University. A special thanks to Jeffrey M. Squyres, Brian Barrett, Torsten Hoefler, DongInn Kim, Timothy I. Mattox, Joseph A. Cottam, Andrew Friedley, Abhishek Kulkarni, and Timothy Prins for their collaborative help over the years. I would also like to thank Laura Hopkins for helping me polish this dissertation. I would like to thank the Open MPI, Berkeley Lab Checkpoint/Restart (BLCR), and MPI Testing Tool (MTT) developer communities for their patience and support without which this dissertation would not have been possible. In particular, I would like to thank Ralph Castain, Paul H. Hargrove, Jeffrey M. Squyres, Richard Graham, and Ethan Mallove for the many hours they have spent with me discussing various design decisions. I have learned so much from each of you, for which I am truly grateful. Many thanks to the staff of the Computer Science department, Computer Systems Group, and Pervasive Technology Institute. Lucy Battersby, Debbie Canada, Becky Cur- tis, Sherry Kay, Rebecca Lowe, Ann Oxby, and Jacqueline Whaley have been great assets v in helping me throughout my graduate career. Rob Henderson, Shing-Shong (Bruce) Shei, and T.J. Jones are some of the best system administrators I have ever had the pleasure of working with. Thank you for your patience and support as I experimented and stressed Odin and Sif beyond their comfort zones. I owe a debt of gratitude to Charles Peck for sparking my interest in both teaching and research. His energy and passion for education, research, and life has left a lasting impression on me. Thank you for helping me find my path and showing me how to have fun with what I do. Graduate school has been a taxing endeavor full of both joy and pain. To all of my friends at Indiana University and around the world, thank you for your support during this process. In particular, I would like to thank Craig Shue and Andrew Kalafut for helping me celebrate the good times and get through the rough times. To my dear wife, Samantha Foley, you have been and continue to be a steadfast source of love, support, encouragement and kindness. Thank you for keeping me grounded and reminding me to not take things too seriously. I love you very much, and appreciate you every day. Finally, I would like to thank my family not only for their unwavering love and support, but for instilling in me a work ethic and a set of values without which I could not have come this far. I am forever grateful of the hard work and sacrifice of my parents and grandparents which has provided me the opportunity to achieve something uniquely significant in our family’s history. This work was made possible by the generous support of the following grants and in- stitutions: the Lilly Endowment; Lawrence Berkeley National Laboratory; Los Alamos Na- tional Laboratory; Oak Ridge National Laboratory; National Science Foundation grants NSF ANI-0330620, CDA-0116050, and EIA-0202048; and U.S. Department of Energy grant DE- FC02-06ER25750ˆA003. vi Abstract Scientists use advanced computing techniques to assist in answering the complex questions at the forefront of discovery. The High Performance Computing (HPC) scientific applica- tions created by these scientists are running longer and scaling to larger systems. These applications must be able to tolerate the inevitable failure of a subset of processes (process failures) that occur as a result of pushing the reliability boundaries of HPC systems. HPC system reliability is emerging as a problem in future exascale systems where the time to failure is measured in minutes or hours instead of days or months. Resilient applications (i.e., applications that can continue to run despite process failures) depend on resilient communication and runtime environments to sustain the application across process fail- ures. Unfortunately, these environments are uncommon and not typically present on HPC systems. In order to preserve performance, scalability, and scientific accuracy, a resilient application may choose the invasiveness of the recovery solution, from completely trans- parent to completely application-directed. Therefore, resilient communication and runtime environments must provide customizable fault recovery mechanisms. Resilient applications often use rollback recovery techniques for fault tolerance: partic- ularly popular are checkpoint/restart (C/R) techniques. HPC applications commonly use the Message Passing Interface (MPI) standard for communication. This thesis identifies a complete set of capabilities that compose to form a coordinated C/R infrastructure for MPI applications running on HPC systems. These capabilities, when integrated into an MPI implementation, provide applications with transparent, yet optionally application config- urable, fault tolerance. By adding these capabilities to Open MPI we demonstrate support for C/R process fault tolerance, automatic recovery, proactive process migration, and par- allel debugging. We also discuss how this infrastructure is being used to support further research into fault tolerance. vii Contents List of Tables xi List of Figures xiii List of Acronyms xvii Chapter 1. Introduction 1 Chapter 2. Background and Related Work 6 1. Distributed Fault Detection 8 2. Fault Prediction 10 3. Fault Recovery: Algorithm Based Fault Tolerance 11 4. Fault Recovery: Checkpoint/Restart 12 5. Fault Recovery: Message Logging 28 6. Fault Recovery: Replication 30 7. Debugging 31 8. Process Migration 34 9. File Systems 35 10. Compiler Based Techniques 39 Chapter 3. Checkpoint/Restart Infrastructure 41 1. Open MPI Architecture 43 2. External Tools and Interfaces 46 viii 3. Checkpoint/Restart Service 51 4. Checkpoint/Restart Coordination Protocol 53 5. Interlayer Notification Callback 58 6. Stable Storage 61 7. File Management 62 8. Snapshot Coordination 63 9. Performance Results 67 10. Conclusion 82 Chapter 4. Process Migration and Automatic Recovery 84 1. Error Management and Recovery Policy 85 2. Error Management and Recovery Policy Implementation 86 3. Runtime Stabilization ErrMgr Component 88 4. Automatic Recovery ErrMgr Component 89 5. Process Migration ErrMgr Component 90 6. Performance Results 92 7. Conclusion 98 Chapter 5. Application Interaction 99 1. Application Programming Interface 100 2. Checkpoint/Restart-Enabled Debugging 104 3. Conclusion 118 Chapter 6. Conclusions 119 1. Future Work 121 Bibliography 124 Appendix A. Checkpoint/Restart Application Programming Interface 148 1. Checkpoint/Restart Interface 148 2. Quiescence Interface 151 3. Process Migration Interface 154 ix 4. Interlayer Notification Callback Callbacks 155 Appendix B. Command Line Tools 158 1. ompi-checkpoint 158 2. ompi-restart 160 3. ompi-migrate 161 Appendix C. self Checkpoint/Restart Service (CRS) 162 1. Interface 163 2. Compiling 164 3. Running 164 4. Example Application 164 Appendix D. Nonblocking Process Creation and Management Operations 168 1. Process Manager Interface 169 2. Starting Processes and Establishing Communication 169 3. Establishing Communication 172 x List of Tables 3.1 Interlayer Notification Callback (INC) states used in the ft event function. 59 3.2 NetPIPE 1 byte latency and bandwidth illustrating CRCP framework failure-free overhead. 70 3.3 Checkpoint overhead analysis for NAS Parallel Benchmarks (NAS) Parallel Benchmark LU Class C with 32 processes using the central Stable Storage (SStore) component. Global snapshot is 1GB or 32MB per process. 75 3.4 Checkpoint overhead analysis for NAS Parallel Benchmark EP Class D with 32 processes using the central SStore component. Global snapshot is 102MB or 3.2MB per process. 75 3.5 Checkpoint overhead analysis for NAS Parallel Benchmark BT Class C with 36 processes using the central SStore component. Global snapshot is 4.2GB or 120MB per process. 75 3.6 Checkpoint overhead analysis for NAS Parallel Benchmark SP Class C with 36 processes using the central SStore component. Global snapshot is 1.9GB or 54MB per process. 76 3.7 Checkpoint overhead analysis for
Recommended publications
  • RCU Usage in the Linux Kernel: One Decade Later
    RCU Usage In the Linux Kernel: One Decade Later Paul E. McKenney Silas Boyd-Wickizer Jonathan Walpole Linux Technology Center MIT CSAIL Computer Science Department IBM Beaverton Portland State University Abstract unique among the commonly used kernels. Understand- ing RCU is now a prerequisite for understanding the Linux Read-copy update (RCU) is a scalable high-performance implementation and its performance. synchronization mechanism implemented in the Linux The success of RCU is, in part, due to its high perfor- kernel. RCU’s novel properties include support for con- mance in the presence of concurrent readers and updaters. current reading and writing, and highly optimized inter- The RCU API facilitates this with two relatively simple CPU synchronization. Since RCU’s introduction into the primitives: readers access data structures within RCU Linux kernel over a decade ago its usage has continued to read-side critical sections, while updaters use RCU syn- expand. Today, most kernel subsystems use RCU. This chronization to wait for all pre-existing RCU read-side paper discusses the requirements that drove the devel- critical sections to complete. When combined, these prim- opment of RCU, the design and API of the Linux RCU itives allow threads to concurrently read data structures, implementation, and how kernel developers apply RCU. even while other threads are updating them. This paper describes the performance requirements that 1 Introduction led to the development of RCU, gives an overview of the RCU API and implementation, and examines how ker- The first Linux kernel to include multiprocessor support nel developers have used RCU to optimize kernel perfor- is not quite 20 years old.
    [Show full text]
  • LMAX Disruptor
    Disruptor: High performance alternative to bounded queues for exchanging data between concurrent threads Martin Thompson Dave Farley Michael Barker Patricia Gee Andrew Stewart May-2011 http://code.google.com/p/disruptor/ 1 Abstract LMAX was established to create a very high performance financial exchange. As part of our work to accomplish this goal we have evaluated several approaches to the design of such a system, but as we began to measure these we ran into some fundamental limits with conventional approaches. Many applications depend on queues to exchange data between processing stages. Our performance testing showed that the latency costs, when using queues in this way, were in the same order of magnitude as the cost of IO operations to disk (RAID or SSD based disk system) – dramatically slow. If there are multiple queues in an end-to-end operation, this will add hundreds of microseconds to the overall latency. There is clearly room for optimisation. Further investigation and a focus on the computer science made us realise that the conflation of concerns inherent in conventional approaches, (e.g. queues and processing nodes) leads to contention in multi-threaded implementations, suggesting that there may be a better approach. Thinking about how modern CPUs work, something we like to call “mechanical sympathy”, using good design practices with a strong focus on teasing apart the concerns, we came up with a data structure and a pattern of use that we have called the Disruptor. Testing has shown that the mean latency using the Disruptor for a three-stage pipeline is 3 orders of magnitude lower than an equivalent queue-based approach.
    [Show full text]
  • A Thread Synchronization Model for the PREEMPT RT Linux Kernel
    A Thread Synchronization Model for the PREEMPT RT Linux Kernel Daniel B. de Oliveiraa,b,c, R^omulo S. de Oliveirab, Tommaso Cucinottac aRHEL Platform/Real-time Team, Red Hat, Inc., Pisa, Italy. bDepartment of Systems Automation, UFSC, Florian´opolis, Brazil. cRETIS Lab, Scuola Superiore Sant'Anna, Pisa, Italy. Abstract This article proposes an automata-based model for describing and validating sequences of kernel events in Linux PREEMPT RT and how they influence the timeline of threads' execu- tion, comprising preemption control, interrupt handling and control, scheduling and locking. This article also presents an extension of the Linux tracing framework that enables the trac- ing of kernel events to verify the consistency of the kernel execution compared to the event sequences that are legal according to the formal model. This enables cross-checking of a kernel behavior against the formalized one, and in case of inconsistency, it pinpoints possible areas of improvement of the kernel, useful for regression testing. Indeed, we describe in details three problems in the kernel revealed by using the proposed technique, along with a short summary on how we reported and proposed fixes to the Linux kernel community. As an example of the usage of the model, the analysis of the events involved in the activation of the highest priority thread is presented, describing the delays occurred in this operation in the same granularity used by kernel developers. This illustrates how it is possible to take advantage of the model for analyzing the preemption model of Linux. Keywords: Real-time computing, Operating systems, Linux kernel, Automata, Software verification, Synchronization.
    [Show full text]
  • Review Der Linux Kernel Sourcen Von 4.9 Auf 4.10
    Review der Linux Kernel Sourcen von 4.9 auf 4.10 Reviewed by: Tested by: stecan stecan Period of Review: Period of Test: From: Thursday, 11 January 2018 07:26:18 o'clock +01: From: Thursday, 11 January 2018 07:26:18 o'clock +01: To: Thursday, 11 January 2018 07:44:27 o'clock +01: To: Thursday, 11 January 2018 07:44:27 o'clock +01: Report automatically generated with: LxrDifferenceTable, V0.9.2.548 Provided by: Certified by: Approved by: Account: stecan Name / Department: Date: Friday, 4 May 2018 13:43:07 o'clock CEST Signature: Review_4.10_0_to_1000.pdf Page 1 of 793 May 04, 2018 Review der Linux Kernel Sourcen von 4.9 auf 4.10 Line Link NR. Descriptions 1 .mailmap#0140 Repo: 9ebf73b275f0 Stephen Tue Jan 10 16:57:57 2017 -0800 Description: mailmap: add codeaurora.org names for nameless email commits ----------- Some codeaurora.org emails have crept in but the names don't exist for them. Add the names for the emails so git can match everyone up. Link: http://lkml.kernel.org/r/[email protected] 2 .mailmap#0154 3 .mailmap#0160 4 CREDITS#2481 Repo: 0c59d28121b9 Arnaldo Mon Feb 13 14:15:44 2017 -0300 Description: MAINTAINERS: Remove old e-mail address ----------- The ghostprotocols.net domain is not working, remove it from CREDITS and MAINTAINERS, and change the status to "Odd fixes", and since I haven't been maintaining those, remove my address from there. CREDITS: Remove outdated address information ----------- This address hasn't been accurate for several years now.
    [Show full text]
  • Linux Foundation Collaboration Summit 2010
    Linux Foundation Collaboration Summit 2010 LTTng, State of the Union Presentation at: http://www.efficios.com/lfcs2010 E-mail: [email protected] Mathieu Desnoyers April 15th, 2010 1 > Presenter ● Mathieu Desnoyers ● EfficiOS Inc. ● http://www.efficios.com ● Author/Maintainer of ● LTTng, LTTV, Userspace RCU ● Ph.D. in computer engineering ● Low-Impact Operating System Tracing Mathieu Desnoyers April 15th, 2010 2 > Plan ● Current state of LTTng ● State of kernel tracing in Linux ● User requirements ● Vertical vs Horizontal integration ● LTTng roadmap for 2010 ● Conclusion Mathieu Desnoyers April 15th, 2010 3 > Current status of LTTng ● LTTng dual-licensing: GPLv2/LGPLv2.1 ● UST user-space tracer – Userspace RCU (LGPLv2.1) ● Eclipse Linux Tools Project LTTng Integration ● User-space static tracepoint integration with gdb ● LTTng kernel tracer – maintainance-mode in 2009 (finished my Ph.D.) – active development restarting in 2010 Mathieu Desnoyers April 15th, 2010 4 > LTTng dual-licensing GPLv2/LGPLv2.1 ● LGPLv2.1 license is required to share code with user-space tracer library. ● License chosen to allow tracing of non-GPL applications. ● Headers are licensed under BSD: – Demonstrates that these headers can be included in non-GPL code. ● Applies to: – LTTng, Tracepoints, Kernel Markers, Immediate Values Mathieu Desnoyers April 15th, 2010 5 > User-space Tracing (UST) (1) ● LTTng port to user-space ● Re-uses Tracepoints and LTTng ring buffer ● Uses Userspace RCU for control synchronization ● Shared memory map with consumer daemon ● Per-process per-cpu ring buffers Mathieu Desnoyers April 15th, 2010 6 > User-space Tracing (UST) (2) ● The road ahead – Userspace trace clock for more architectures ● Some require Linux kernel vDSO support for trace clock – Utrace ● Provide information about thread creation, exec(), etc..
    [Show full text]
  • Read-Copy Update (RCU)
    COMP 790: OS Implementation Read-Copy Update (RCU) Don Porter COMP 790: OS Implementation Logical Diagram Binary Memory Threads Formats Allocators User Today’s Lecture System Calls Kernel RCU File System Networking Sync Memory Device CPU Management Drivers Scheduler Hardware Interrupts Disk Net Consistency COMP 790: OS Implementation RCU in a nutshell • Think about data structures that are mostly read, occasionally written – Like the Linux dcache • RW locks allow concurrent reads – Still require an atomic decrement of a lock counter – Atomic ops are expensive • Idea: Only require locks for writers; carefully update data structure so readers see consistent views of data COMP 790: OS Implementation Motivation (from Paul McKenney’s Thesis) 35 "ideal" "global" 30 "globalrw" 25 20 Performance of RW 15 lock only marginally 10 better than mutex 5 lock Hash Table Searches per Microsecond 0 1 2 3 4 # CPUs COMP 790: OS Implementation Principle (1/2) • Locks have an acquire and release cost – Substantial, since atomic ops are expensive • For short critical regions, this cost dominates performance COMP 790: OS Implementation Principle (2/2) • Reader/writer locks may allow critical regions to execute in parallel • But they still serialize the increment and decrement of the read count with atomic instructions – Atomic instructions performance decreases as more CPUs try to do them at the same time • The read lock itself becomes a scalability bottleneck, even if the data it protects is read 99% of the time COMP 790: OS Implementation Lock-free data structures
    [Show full text]
  • Connect User Guide Armv8-A Memory Systems
    ARMv8-A Memory Systems ConnectARMv8- UserA Memory Guide VersionSystems 0.1 Version 1.0 Copyright © 2016 ARM Limited or its affiliates. All rights reserved. Page 1 of 17 ARM 100941_0100_en ARMv8-A Memory Systems Revision Information The following revisions have been made to this User Guide. Date Issue Confidentiality Change 28 February 2017 0100 Non-Confidential First release Proprietary Notice Words and logos marked with ® or ™ are registered trademarks or trademarks of ARM® in the EU and other countries, except as otherwise stated below in this proprietary notice. Other brands and names mentioned herein may be the trademarks of their respective owners. Neither the whole nor any part of the information contained in, or the product described in, this document may be adapted or reproduced in any material form except with the prior written permission of the copyright holder. The product described in this document is subject to continuous developments and improvements. All particulars of the product and its use contained in this document are given by ARM in good faith. However, all warranties implied or expressed, including but not limited to implied warranties of merchantability, or fitness for purpose, are excluded. This document is intended only to assist the reader in the use of the product. ARM shall not be liable for any loss or damage arising from the use of any information in this document, or any error or omission in such information, or any incorrect use of the product. Where the term ARM is used it means “ARM or any of its subsidiaries as appropriate”. Confidentiality Status This document is Confidential.
    [Show full text]
  • Using Visualization to Understand the Behavior of Computer Systems
    USING VISUALIZATION TO UNDERSTAND THE BEHAVIOR OF COMPUTER SYSTEMS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Robert P. Bosch Jr. August 2001 c Copyright by Robert P. Bosch Jr. 2001 All Rights Reserved ii I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Dr. Mendel Rosenblum (Principal Advisor) I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Dr. Pat Hanrahan I certify that I have read this dissertation and that in my opinion it is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. Dr. Mark Horowitz Approved for the University Committee on Graduate Studies: iii Abstract As computer systems continue to grow rapidly in both complexity and scale, developers need tools to help them understand the behavior and performance of these systems. While information visu- alization is a promising technique, most existing computer systems visualizations have focused on very specific problems and data sources, limiting their applicability. This dissertation introduces Rivet, a general-purpose environment for the development of com- puter systems visualizations. Rivet can be used for both real-time and post-mortem analyses of data from a wide variety of sources. The modular architecture of Rivet enables sophisticated visualiza- tions to be assembled using simple building blocks representing the data, the visual representations, and the mappings between them.
    [Show full text]
  • Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications
    Evolutionary Tree-Structured Storage: Concepts, Interfaces, and Applications Dissertation submitted for the degree of Doctor of Natural Sciences Presented by Marc Yves Maria Kramis at the Faculty of Sciences Department of Computer and Information Science Date of the oral examination: 22.04.2014 First supervisor: Prof. Dr. Marcel Waldvogel Second supervisor: Prof. Dr. Marc Scholl i Abstract Life is subdued to constant evolution. So is our data, be it in research, business or personal information management. From a natural, evolutionary perspective, our data evolves through a sequence of fine-granular modifications resulting in myriads of states, each describing our data at a given point in time. From a technical, anti- evolutionary perspective, mainly driven by technological and financial limitations, we treat the modifications as transient commands and only store the latest state of our data. It is surprising that the current approach is to ignore the natural evolution and to willfully forget about the sequence of modifications and therefore the past state. Sticking to this approach causes all kinds of confusion, complexity, and performance issues. Confusion, because we still somehow want to retrieve past state but are not sure how. Complexity, because we must repeatedly work around our own obsolete approaches. Performance issues, because confusion times complexity hurts. It is not surprising, however, that intelligence agencies notoriously try to collect, store, and analyze what the broad public willfully forgets. Significantly faster and cheaper random-access storage is the key driver for a paradigm shift towards remembering the sequence of modifications. We claim that (1) faster storage allows to efficiently and cleverly handle finer-granular modifi- cations and (2) that mandatory versioning elegantly exposes past state, radically simplifies the applications, and effectively lays a solid foundation for backing up, distributing and scaling of our data.
    [Show full text]
  • Interrupt Handling
    CCChahahaptptpteeerrr 555 KKKeeerrrnnneeelll SSSyyynnnccchhhrrrooonnniiizzzaaatttionionion Hsung-Pin Chang Department of Computer Science National Chung Hsing University Outline • Kernel Control Paths • When Synchronization Is Not Necessary • Synchronization Primitives • Synchronizing Accesses to Kernel Data Structure • Examples of Race Condition Prevention Kernel Control Paths • Kernel control path – A sequence of instructions executed by the kernel to handle interrupts of different kinds – Each kernel request is handled by a different kernel control path Kernel Control Paths (Cont.) • Kernel requests may be issued in several possible ways – A process executing in User Mode causes an exception-for instance, by executing at int0x80 instruction – An external devices sends a signal to a Programmable Interrupt Controller Kernel Control Paths (Cont.) – A process executing in Kernel Mode causes a Page Fault exception – A process running in a MP system and executing in Kernel Mode raises an interprocessorinterrupt Kernel Control Paths (Cont.) • Kernel control path is quite similar to the process, except – Does not have a descriptor – Not scheduled through scheduler, but rather by inserting sequence of instructions into the kernel code Kernel Control Paths (Cont.) • In some cases, the CPU interleaves kernel control paths when one of the following event occurs – A process switch occurs, i.e., when the schedule() function is invoked – An interrupt occurs while the CPU is running a kernel control path with interrupt enabled – A deferrable function
    [Show full text]
  • Memory Barriers: a Hardware View for Software Hackers
    Memory Barriers: a Hardware View for Software Hackers Paul E. McKenney Linux Technology Center IBM Beaverton [email protected] April 5, 2009 So what possessed CPU designers to cause them ing ten instructions per nanosecond, but will require to inflict memory barriers on poor unsuspecting SMP many tens of nanoseconds to fetch a data item from software designers? main memory. This disparity in speed — more than In short, because reordering memory references al- two orders of magnitude — has resulted in the multi- lows much better performance, and so memory barri- megabyte caches found on modern CPUs. These ers are needed to force ordering in things like synchro- caches are associated with the CPUs as shown in Fig- nization primitives whose correct operation depends ure 1, and can typically be accessed in a few cycles.1 on ordered memory references. Getting a more detailed answer to this question requires a good understanding of how CPU caches CPU 0 CPU 1 work, and especially what is required to make caches really work well. The following sections: 1. present the structure of a cache, Cache Cache 2. describe how cache-coherency protocols ensure Interconnect that CPUs agree on the value of each location in memory, and, finally, 3. outline how store buffers and invalidate queues Memory help caches and cache-coherency protocols achieve high performance. We will see that memory barriers are a necessary evil that is required to enable good performance and scal- Figure 1: Modern Computer System Cache Structure ability, an evil that stems from the fact that CPUs are orders of magnitude faster than are both the in- Data flows among the CPUs’ caches and memory terconnects between them and the memory they are in fixed-length blocks called “cache lines”, which are attempting to access.
    [Show full text]
  • The Linux Operating System
    T HE L INUX O PERATING S YSTEM William Stallings Copyright 2008 This document is an extract from Operating Systems: Internals and Design Principles, Sixth Edition William Stallings Prentice Hall 2008 ISBN-10: 0-13-600632-9 ISBN-13: 978-0-13-600632-9 http://williamstallings.com/OS/OS6e.html M02_STAL6329_06_SE_C02.QXD 2/22/08 7:02 PM Page 94 94 CHAPTER 2 / OPERATING SYSTEM OVERVIEW of the System V kernel and produced a clean, if complex, implementation. New fea- tures in the release include real-time processing support, process scheduling classes, dynamically allocated data structures, virtual memory management, virtual file sys- tem, and a preemptive kernel. SVR4 draws on the efforts of both commercial and academic designers and was developed to provide a uniform platform for commercial UNIX deployment. It has succeeded in this objective and is perhaps the most important UNIX variant. It incorporates most of the important features ever developed on any UNIX system and does so in an integrated, commercially viable fashion. SVR4 runs on processors ranging from 32-bit microprocessors up to supercomputers. BSD The Berkeley Software Distribution (BSD) series of UNIX releases have played a key role in the development of OS design theory. 4.xBSD is widely used in academic installations and has served as the basis of a number of commercial UNIX products. It is probably safe to say that BSD is responsible for much of the popularity of UNIX and that most enhancements to UNIX first appeared in BSD versions. 4.4BSD was the final version of BSD to be released by Berkeley, with the de- sign and implementation organization subsequently dissolved.
    [Show full text]