Semperos: a Distributed Capability System

Total Page:16

File Type:pdf, Size:1020Kb

Semperos: a Distributed Capability System SemperOS: A Distributed Capability System Matthias Hille, Technische Universität Dresden; Nils Asmussen, Technische Universität Dresden; Barkhausen Institut; Pramod Bhatotia, University of Edinburgh; Hermann Härtig, Technische Universität Dresden; Barkhausen Institut https://www.usenix.org/conference/atc19/presentation/hille This paper is included in the Proceedings of the 2019 USENIX Annual Technical Conference. July 10–12, 2019 • Renton, WA, USA ISBN 978-1-939133-03-8 Open access to the Proceedings of the 2019 USENIX Annual Technical Conference is sponsored by USENIX. SemperOS: A Distributed Capability System Matthias Hille† Nils Asmussen† ∗ Pramod Bhatotia‡ Hermann Härtig† ∗ †Technische Universität Dresden ‡The University of Edinburgh ∗ Barkhausen Institut Abstract systems have received renewed attention recently to provide Capabilities provide an efficient and secure mechanism for an efficient and secure mechanism for resource management fine-grained resource management and protection. However, in modern hardware architectures [5, 24, 30, 36, 44, 64, 67]. as the modern hardware architectures continue to evolve Today the main improvements in compute capacity with large numbers of non-coherent and heterogeneous cores, are achieved by either adding more cores or integrating we focus on the following research question: can capability accelerators into the system. However, the increasing core systems scale to modern hardware architectures? counts exacerbate the hardware complexity required for global In this work, we present a scalable capability system to drive cache coherence. While on-chip cache coherence is likely to future systems with many non-coherent heterogeneous cores. remain a feature of future hardware architectures [45], we see More specifically, we have designed a distributed capability characteristics of distributed systems added to the hardware system based on a HW/SW co-designed capability system. by giving up on global cache coherence across a whole We analyzed the pitfalls of distributed capability operations machine [6, 28]. Additionally, various kinds of accelerators running concurrently and built the protocols in accordance are added like the Xeon Phi Processor, the Matrix-2000 accel- with the insights. We have incorporated these distributed erator, GPUs, FPGAs, or ASICs, which are used in numerous capability management protocols in a new microkernel-based application fields [4,21,32,34,43,60]. These components also OS called SEMPEROS. Our OS operates the system by contribute to the number of resources an OS has to manage. means of multiple microkernels, which employ distributed In this work, we focus on capability-based systems and how capabilities to provide an efficient and secure mechanism their ability to implement fine-grained access control combines for fine-grained access to system resources. In the evaluation with large systems. In particular, we consider three types of we investigated the scalability of our algorithms and run capability systems: L4 [30], CHERI [67] (considered for The applications (Nginx, LevelDB, SQLite, PostMark, etc.), which Machine [31]), and M3 [5] (see Section 2.1). For all these are heavily dependent on the OS services of SEMPEROS. The capability types it is not clear whether they will scale to modern results indicate that there is no inherent scalability limitation hardware architectures since the scalability of capability for capability systems. Our evaluation shows that we achieve systems has never been studied before. Also existing capability a parallel efficiency of 70% to 78% when examining a system schemes cannot be turned into distributed schemes easily since with 576 cores executing 512 application instances while they either rely on centralized knowledge, cache-coherent using 11% of the system’s cores for OS services. architectures, or are missing important features like revocation. Independent of the choice which capability system to use, 1 Introduction scaling these systems calls for two basic mechanisms to be fast. First, it implies a way of concurrently updating access Capabilities are unforgeable tokens of authority granting rights to enable fast decentralized resource sharing. This rights to resources in the system. They can be selectively means fast obtaining or delegating of capabilities, which delegated between constrained programs for implementing acquires or hands out access rights to the resources behind the the principle of least authority [48]. Due to their ability of capabilities. The other performance-critical mechanism is the fine-grained resource management and protection, capabilities revocation of capabilities. Revoking the access rights should be appear to be a particularly good fit for future hardware possible within a reasonable amount of time and with minimal architectures, which envision byte granular memory access to overhead. The scalability of this operation is tightly coupled to large memories (NVRAM) from a large numbers of cores (e.g. the enforcement mechanism, e.g. when using L4 capabilities The Machine [31], Enzian [18]). Thereby, capability-based the TLB shootdown can be a scalability bottleneck. USENIX Association 2019 USENIX Annual Technical Conference 709 We base our system on a hardware/software co-designed ca- L4 M3 CHERI pability system (M3). More specifically, we propose a scalable Scope Coherence Dom. Machine Address space distributed capability mechanism by building a multikernel Enforcement MMU / Kernel DTU / Kernel CHERI co-proc. Limitation Coherence Dom. Core count no revoke OS based on the M3 capability system. We present a detailed analysis of possible complications in distributed capability systems caused by concurrent updates. Based on this inves- Table 1: Classification of capability types. tigation we describe the algorithms, which we implemented L4 microkernels [30,36,40,62], (2) instruction set architecture in our prototype OS—the SEMPEROS multikernel. (ISA) capabilities, as implemented by the CAP computer [49] Our OS divides the system into groups, with each of them and recently revived by CHERI [67] and (3) sparse capabilities being managed by an independent kernel. These independently which are deployed in the password-capability system of the managed groups resemble islands with locally managed re- Monash University [2] and in Amoeba [63]. sources and capabilities. Kernels communicate via messages to Capabilities can be shared to exchange access rights. ISA enable interaction across groups. To quickly find objects across capabilities and sparse capabilities can be shared without the whole system—a crucial prerequisite for our capability involving the kernel since their validity is either ensured by scheme—we introduce an efficient addressing scheme. hardware or checked by the kernel using a cryptographic Our system design aims at future hardware, which might one-way function in the moment they are used. In contrast, connect hundreds of processing elements to form a powerful sharing of partitioned capabilities is observed by the kernel. rack-scale system [28]. To be able to experiment with such To analyze capability systems regarding their scalability, we systems, we use the gem5 simulator [14] Our evaluation inspect their enforcement mechanism, their scope, and their focuses on the performance of the kernel, where we showcase limitation in Table 1. The three categories of capability systems the scalability of our algorithms by using microbenchmarks in Table 1 represent a relevant subset of capability systems for as well as OS-intensive real-world applications. We describe the scope of this work: (1) L4 capabilities which are partitioned trade-offs in resource distribution between applications and capabilities employed in L4 µ-kernels [30,36,40,62], (2) M3 the OS to determine a suitable configuration for a specific capabilities which are a special form of partitioned capabilities application. We found that our OS can operate a system which involving a different enforcement mechanism explained in the hosts 512 parallel running application instances with a parallel following, and lastly (3) CHERI capabilities which are ISA efficiency of 70% to 78% while dedicating 11% of the system capabilities implemented by the CHERI processor [64, 67]. to the OS and its services. L4 capabilities utilize the memory management unit To summarize, our contributions are as follows. (MMU) of the processor to restrict a program’s memory access. • We propose a HW/SW co-designed distributed capability Access to other resources like communication channels or system to drive future systems. Our capability system process control are enforced by the kernel. Since L4 is built 3 extends M capabilities to support a large number of for cache coherent machines both the scope of a capability and non-cache-coherent heterogeneous cores. the current limitation is a coherence domain. • We implemented a new microkernel-based OS called In contrast, M3 introduces a hardware component called SEMPEROS that operates the system by employing data transfer unit (DTU) which provides message passing multiple microkernels and incorporates distributed between processing elements and a form of remote direct mem- capability management protocols. ory access. Consequently, memory access and communication • We evaluated the distributed capability management channels are enforced by the DTU and access to other system protocols by implementing the HW/SW co-design for resources by the kernel. The DTU is the only possibility SEMPEROS in the gem5 simulator [14] to run real for
Recommended publications
  • A Microkernel API for Fine-Grained Decomposition
    A Microkernel API for Fine-Grained Decomposition Sebastian Reichelt Jan Stoess Frank Bellosa System Architecture Group, University of Karlsruhe, Germany freichelt,stoess,[email protected] ABSTRACT from the microkernel APIs in existence. The need, for in- Microkernel-based operating systems typically require spe- stance, to explicitly pass messages between servers, or the cial attention to issues that otherwise arise only in dis- need to set up threads and address spaces in every server for tributed systems. The resulting extra code degrades per- parallelism or protection require OS developers to adopt the formance and increases development effort, severely limiting mindset of a distributed-system programmer rather than to decomposition granularity. take advantage of their knowledge on traditional OS design. We present a new microkernel design that enables OS devel- Distributed-system paradigms, though well-understood and opers to decompose systems into very fine-grained servers. suited for physically (and, thus, coarsely) partitioned sys- We avoid the typical obstacles by defining servers as light- tems, present obstacles to the fine-grained decomposition weight, passive objects. We replace complex IPC mecha- required to exploit the benefits of microkernels: First, a nisms by a simple function-call approach, and our passive, lot of development effort must be spent into matching the module-like server model obviates the need to create threads OS structure to the architecture of the selected microkernel, in every server. Server code is compiled into small self- which also hinders porting existing code from monolithic sys- contained files, which can be loaded into the same address tems. Second, the more servers exist | a desired property space (for speed) or different address spaces (for safety).
    [Show full text]
  • Uva-DARE (Digital Academic Repository)
    UvA-DARE (Digital Academic Repository) On the construction of operating systems for the Microgrid many-core architecture van Tol, M.W. Publication date 2013 Link to publication Citation for published version (APA): van Tol, M. W. (2013). On the construction of operating systems for the Microgrid many-core architecture. General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:27 Sep 2021 General Bibliography [1] J. Aas. Understanding the linux 2.6.8.1 cpu scheduler. Technical report, Silicon Graphics Inc. (SGI), Eagan, MN, USA, February 2005. [2] D. Abts, S. Scott, and D. J. Lilja. So many states, so little time: Verifying memory coherence in the Cray X1. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS '03, page 10 pp., Washington, DC, USA, April 2003.
    [Show full text]
  • High Performance Computing Systems
    High Performance Computing Systems Multikernels Doug Shook Multikernels Two predominant approaches to OS: – Full weight kernel – Lightweight kernel Why not both? – How does implementation affect usage and performance? Gerolfi, et. al. “A Multi-Kernel Survey for High- Performance Computing,” 2016 2 FusedOS Assumes heterogeneous architecture – Linux on full cores – LWK requests resources from linux to run programs Uses CNK as its LWK 3 IHK/McKernel Uses an Interface for Heterogeneous Kernels – Resource allocation – Communication McKernel is the LWK – Only operable with IHK Uses proxy processes 4 mOS Embeds LWK into the Linux kernel – LWK is visible to Linux just like any other process Resource allocation is performed by sysadmin/user 5 FFMK Uses the L4 microkernel – What is a microkernel? Also uses a para-virtualized Linux instance – What is paravirtualization? 6 Hobbes Pisces Node Manager Kitten LWK Palacios Virtual Machine Monitor 7 Sysadmin Criteria Is the LWK standalone? Which kernel is booted by the BIOS? How and when are nodes partitioned? 8 Application Criteria What is the level of POSIX support in the LWK? What is the pseudo file system support? How does an application access Linux functionality? What is the system call overhead? Can LWK and Linux share memory? Can a single process span Linux and the LWK? Does the LWK support NUMA? 9 Linux Criteria Are LWK processes visible to standard tools like ps and top? Are modifications to the Linux kernel necessary? Do Linux kernel changes propagate to the LWK?
    [Show full text]
  • Improving Learning of Programming Through E-Learning by Using Asynchronous Virtual Pair Programming
    Improving Learning of Programming Through E-Learning by Using Asynchronous Virtual Pair Programming Abdullah Mohd ZIN Department of Industrial Computing Faculty of Information Science and Technology National University of Malaysia Bangi, MALAYSIA Sufian IDRIS Department of Computer Science Faculty of Information Science and Technology National University of Malaysia Bangi, MALAYSIA Nantha Kumar SUBRAMANIAM Open University Malaysia (OUM) Faculty of Information Technology and Multimedia Communication Kuala Lumpur, MALAYSIA ABSTRACT The problem of learning programming subjects, especially through distance learning and E-Learning, has been widely reported in literatures. Many attempts have been made to solve these problems. This has led to many new approaches in the techniques of learning of programming. One of the approaches that have been proposed is the use of virtual pair programming (VPP). Most of the studies about VPP in distance learning or e-learning environment focus on the use of the synchronous mode of collaboration between learners. Not much research have been done about asynchronous VPP. This paper describes how we have implemented VPP and a research that has been carried out to study the effectiveness of asynchronous VPP for learning of programming. In particular, this research study the effectiveness of asynchronous VPP in the learning of object-oriented programming among students at Open University Malaysia (OUM). The result of the research has shown that most of the learners have given positive feedback, indicating that they are happy with the use of asynchronous VPP. At the same time, learners did recommend some extra features that could be added in making asynchronous VPP more enjoyable. Keywords: Pair-programming; Virtual Pair-programming; Object Oriented Programming INTRODUCTION Delivering program of Information Technology through distance learning or E- Learning is indeed a very challenging task.
    [Show full text]
  • The Joe-E Language Specification (Draft)
    The Joe-E Language Specification (draft) Adrian Mettler David Wagner famettler,[email protected] February 3, 2008 Disclaimer: This is a draft version of the Joe-E specification, and is subject to change. Sections 5 - 7 mention some (but not all) of the aspects of the Joe-E language that are future work or current works in progress. 1 Introduction We describe the Joe-E language, a capability-based subset of Java intended to make it easier to build secure systems. The goal of object capability languages is to support the Principle of Least Authority (POLA), so that each object naturally receives the least privilege (i.e., least authority) needed to do its job. Thus, we hope that Joe-E will support secure programming while remaining familiar to Java programmers everywhere. 2 Goals We have several goals for the Joe-E language: • Be familiar to Java programmers. To minimize the barriers to adoption of Joe-E, the syntax and semantics of Joe-E should be familiar to Java programmers. We also want Joe-E programmers to be able to use all of their existing tools for editing, compiling, executing, debugging, profiling, and reasoning about Java code. We accomplish this by defining Joe-E as a subset of Java. In general: Subsetting Principle: Any valid Joe-E program should also be a valid Java program, with identical semantics. This preserves the semantics Java programmers will expect, which are critical to keeping the adoption costs manageable. Also, it means all of today's Java tools (IDEs, debuggers, profilers, static analyzers, theorem provers, etc.) will apply to Joe-E code.
    [Show full text]
  • The Flask Security Architecture: System Support for Diverse Security Policies
    The Flask Security Architecture: System Support for Diverse Security Policies Ray Spencer Secure Computing Corporation Stephen Smalley, Peter Loscocco National Security Agency Mike Hibler, David Andersen, Jay Lepreau University of Utah http://www.cs.utah.edu/flux/flask/ Abstract and even many types of policies [1, 43, 48]. To be gen- erally acceptable, any computer security solution must Operating systems must be flexible in their support be flexible enough to support this wide range of security for security policies, providing sufficient mechanisms for policies. Even in the distributed environments of today, supporting the wide variety of real-world security poli- this policy flexibility must be supported by the security cies. Such flexibility requires controlling the propaga- mechanisms of the operating system [32]. tion of access rights, enforcing fine-grained access rights and supporting the revocation of previously granted ac- Supporting policy flexibility in the operating system is cess rights. Previous systems are lacking in at least one a hard problem that goes beyond just supporting multi- of these areas. In this paper we present an operating ple policies. The system must be capable of supporting system security architecture that solves these problems. fine-grained access controls on low-level objects used to Control over propagation is provided by ensuring that perform higher-level functions controlled by the secu- the security policy is consulted for every security deci- rity policy. Additionally, the system must ensure that sion. This control is achieved without significant perfor- the propagation of access rights is in accordance with mance degradation through the use of a security decision the security policy.
    [Show full text]
  • Microkernels Meet Recursive Virtual Machines
    Microkernels Meet Recursive Virtual Machines Bryan Ford Mike Hibler Jay Lepreau Patrick Tullmann Godmar Back Stephen Clawson Department of Computer Science, University of Utah Salt Lake City, UT 84112 [email protected] http://www.cs.utah.edu/projects/flux/ Abstract ªverticallyº by implementing OS functionalityin stackable virtual machine monitors, each of which exports a virtual Thispaper describes a novel approach to providingmod- machine interface compatible with the machine interface ular and extensible operating system functionality and en- on which it runs. Traditionally,virtual machines have been capsulated environments based on a synthesis of micro- implemented on and export existing hardware architectures kernel and virtual machine concepts. We have developed so they can support ªnaiveº operating systems (see Fig- a software-based virtualizable architecture called Fluke ure 1). For example, the most well-known virtual machine that allows recursive virtual machines (virtual machines system, VM/370 [28, 29], provides virtual memory and se- running on other virtual machines) to be implemented ef- curity between multiple concurrent virtual machines, all ®ciently by a microkernel running on generic hardware. exporting the IBM S/370 hardware architecture. Further- A complete virtual machine interface is provided at each more, special virtualizable hardware architectures [22, 35] level; ef®ciency derives from needing to implement only have been proposed, whose design goal is to allow virtual new functionality at each level. This infrastructure allows machines to be stacked much more ef®ciently. common OS functionality, such as process management, demand paging, fault tolerance, and debugging support, to This paper presents a new approach to OS extensibil- be provided by cleanly modularized, independent, stack- ity which combines both microkernel and virtual machine able virtual machine monitors, implemented as user pro- concepts in one system.
    [Show full text]
  • Run-Time Reconfigurable RTOS for Reconfigurable Systems-On-Chip
    Run-time Reconfigurable RTOS for Reconfigurable Systems-on-Chip Dissertation A thesis submited to the Faculty of Computer Science, Electrical Engineering and Mathematics of the University of Paderborn and to the Graduate Program in Electrical Engineering (PPGEE) of the Federal University of Rio Grande do Sul in partial fulfillment of the requirements for the degree of Dr. rer. nat. and Dr. in Electrical Engineering Marcelo G¨otz November, 2007 Supervisors: Prof. Dr. rer. nat. Franz J. Rammig, University of Paderborn, Germany Prof. Dr.-Ing. Carlos E. Pereira, Federal University of Rio Grande do Sul, Brazil Public examination in Paderborn, Germany Additional members of examination committee: Prof. Dr. Marco Platzner Prof. Dr. Urlich R¨uckert Dr. Mario Porrmann Date: April 23, 2007 Public examination in Porto Alegre, Brazil Additional members of examination committee: Prof. Dr. Fl´avioRech Wagner Prof. Dr. Luigi Carro Prof. Dr. Altamiro Amadeu Susin Prof. Dr. Leandro Buss Becker Prof. Dr. Fernando Gehm Moraes Date: October 11, 2007 Acknowledgements This PhD thesis was carried out, in his great majority, at the Department of Computer Science, Electrical Engineering and Mathematics of the University of Paderborn, during my time as a fellow of the working group of Prof. Dr. Franz J. Rammig. Due to a formal agreement between Federal University of Rio Grande do Sul (UFRGS) and University of Paderborn, I had the opportunity to receive a bi-national doctoral degree. Therefore, I had to accomplished, in addition, the PhD Graduate Program in Electrical Engineering of UFRGS. So, I hereby acknowledge, without mention everyone personally, all persons involved in making this agreement possible.
    [Show full text]
  • All Computer Applications Need to Store and Retrieve Information
    MyFS: An Enhanced File System for MINIX A Dissertation Submitted in partial fulfillment of the requirement for the award of the degree of MASTER OF ENGINEERING ( COMPUTER TECHNOLOGY & APPLICATIONS ) By ASHISH BHAWSAR College Roll No. 05/CTA/03 Delhi University Roll No. 3005 Under the guidance of Prof. Asok De Department Of Computer Engineering Delhi College Of Engineering, New Delhi-110042 (University of Delhi) July-2005 1 CERTIFICATE This is to certify that the dissertation entitled “MyFS: An Enhanced File System for MINIX” submitted by Ashish Bhawsar in the partial fulfillment of the requirement for the award of degree of Master of Engineering in Computer Technology and Application, Delhi College of Engineering is an account of his work carried out under my guidance and supervision. Professor D. Roy Choudhury Professor Asok De Head of Department Head of Department Department of Computer Engineering Department of Information Technology Delhi College of Engineering Delhi College of Engineering Delhi Delhi 2 ACKNOWLEDGEMENT It is a great pleasure to have the opportunity to extent my heartiest felt gratitude to everybody who helped me throughout the course of this project. I would like to express my heartiest felt regards to Dr. Asok De, Head of the Department, Department of Information Technology for the constant motivation and support during the duration of this project. It is my privilege and owner to have worked under the supervision. His invaluable guidance and helpful discussions in every stage of this thesis really helped me in materializing this project. It is indeed difficult to put his contribution in few words. I would also like to take this opportunity to present my most sincere regards to Dr.
    [Show full text]
  • Composing Abstractions Using the Null-Kernel
    Composing Abstractions using the null-Kernel James Litton Deepak Garg University of Maryland Max Planck Institute for Software Systems Max Planck Institute for Software Systems [email protected] [email protected] Peter Druschel Bobby Bhattacharjee Max Planck Institute for Software Systems University of Maryland [email protected] [email protected] CCS CONCEPTS hardware, such as GPUs, crypto/AI accelerators, smart NICs/- • Software and its engineering → Operating systems; storage devices, NVRAM etc., and to efficiently support new Layered systems; • Security and privacy → Access control. application requirements for functionality such as transac- tional memory, fast snapshots, fine-grained isolation, etc., KEYWORDS that demand new OS abstractions that can exploit existing hardware in novel and more efficient ways. Operating Systems, Capability Systems At its core, the null-Kernel derives its novelty from being ACM Reference Format: able to support and compose across components, called Ab- James Litton, Deepak Garg, Peter Druschel, and Bobby Bhattachar- stract Machines (AMs) that provide programming interfaces jee. 2019. Composing Abstractions using the null-Kernel. In Work- at different levels of abstraction. The null-Kernel uses an ex- shop on Hot Topics in Operating Systems (HotOS ’19), May 13–15, 2019, Bertinoro, Italy. ACM, New York, NY, USA, 6 pages. https: tensible capability mechanism to control resource allocation //doi.org/10.1145/3317550.3321450 and use at runtime. The ability of the null-Kernel to support interfaces at different levels of abstraction accrues several 1 INTRODUCTION benefits: New hardware can be rapidly deployed using rela- tively low-level interfaces; high-level interfaces can easily Abstractions imply choice, one that OS designers must con- be built using low-level ones; and applications can benefit front when designing a programming interface to expose.
    [Show full text]
  • Introduction
    1 INTRODUCTION A modern computer consists of one or more processors, some main memory, disks, printers, a keyboard, a mouse, a display, network interfaces, and various other input/output devices. All in all, a complex system.oo If every application pro- grammer had to understand how all these things work in detail, no code would ever get written. Furthermore, managing all these components and using them optimally is an exceedingly challenging job. For this reason, computers are equipped with a layer of software called the operating system, whose job is to provide user pro- grams with a better, simpler, cleaner, model of the computer and to handle manag- ing all the resources just mentioned. Operating systems are the subject of this book. Most readers will have had some experience with an operating system such as Windows, Linux, FreeBSD, or OS X, but appearances can be deceiving. The pro- gram that users interact with, usually called the shell when it is text based and the GUI (Graphical User Interface)—which is pronounced ‘‘gooey’’—when it uses icons, is actually not part of the operating system, although it uses the operating system to get its work done. A simple overview of the main components under discussion here is given in Fig. 1-1. Here we see the hardware at the bottom. The hardware consists of chips, boards, disks, a keyboard, a monitor, and similar physical objects. On top of the hardware is the software. Most computers have two modes of operation: kernel mode and user mode. The operating system, the most fundamental piece of soft- ware, runs in kernel mode (also called supervisor mode).
    [Show full text]
  • W4118: Multikernel
    W4118: multikernel Instructor: Junfeng Yang References: Modern Operating Systems (3rd edition), Operating Systems Concepts (8th edition), previous W4118, and OS at MIT, Stanford, and UWisc Motivation: sharing is expensive Difficult to parallelize single-address space kernels with shared data structures . Locks/atomic instructions: limit scalability . Must make every shared data structure scalable • Partition data • Lock-free data structures • … . Tremendous amount of engineering Root cause . Expensive to move cache lines . Congestion of interconnect 1 Multikernel: explicit sharing via messages Multicore chip == distributed system! . No shared data structures! . Send messages to core to access its data Advantages . Scalable . Good match for heterogeneous cores . Good match if future chips don’t provide cache- coherent shared memory 2 Challenge: global state OS must manage global state . E.g., page table of a process Solution . Replicate global state . Read: read local copy low latency . Update: update local copy + distributed protocol to update remote copies • Do so asynchronously (“split phase”) 3 Barrelfish overview Figure 5 in paper CPU driver . kernel-mode part per core . Inter-Processor interrupt (IPI) Monitors . OS abstractions . User-level RPC (URPC) 4 IPC through shared memory #define CACHELINE (64) struct box{ char buf[CACHELINE-1]; // message contents char flag; // high bit == 0 means sender owns it }; __attribute__ ((aligned (CACHELINE))) struct box m __attribute__ ((aligned (CACHELINE))); send() recv() { { // set up message while(!(m.flag & 0x80)) memcpy(m.buf, …); ; m.flag |= 0x80; m.buf … //process message } } 5 Case study: TLB shootdown When is it necessary? Windows & Linux: send IPIs Barrelfish: sends messages to involved monitors . 1 broadcast message N-1 invalidates, N-1 fetches .
    [Show full text]