Distributed Synchronization Primitives for Legacy and Novel Parallel Applications Rodrigo Cadore Cataldo

Total Page:16

File Type:pdf, Size:1020Kb

Distributed Synchronization Primitives for Legacy and Novel Parallel Applications Rodrigo Cadore Cataldo Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo To cite this version: Rodrigo Cadore Cataldo. Subutai : Distributed synchronization primitives for legacy and novel parallel applications. Distributed, Parallel, and Cluster Computing [cs.DC]. Université de Bretagne Sud; Pontifícia universidade católica do Rio Grande do Sul, 2019. English. NNT : 2019LORIS541. tel- 02865408 HAL Id: tel-02865408 https://tel.archives-ouvertes.fr/tel-02865408 Submitted on 11 Jun 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE DE DOCTORAT DE L’UNIVERSITÉ BRETAGNE SUD COMUE UNIVERSITE BRETAGNE LOIRE ÉCOLE DOCTORALE N° 601 Mathématiques et Sciences et Technologies de l'Information et de la Communication Spécialité : Électronique Par « Rodrigo Cadore CATALDO » « SUBUTAI: Distributed synchronization primitives for legacy and novel parallel applications » Thèse présentée et soutenue à Porto Alegre, le 16 décembre 2019 Unité de recherche : Lab-STICC Thèse N° : 541 Rapporteurs avant soutenance : Composition du Jury : Eduardo Augusto BEZERRA Professeur, Universidade Federal Kevin MARTIN Maître de conférences, Université de Santa Catarina (UFSC) Bretagne Sud Marcio KREUTZ Directeur de thèse Professeur, Universidade Federal Jean-Philippe DIGUET DR, CNRS do Rio Grande do Norte (UFRN) Avelino Francisco ZORZO Co-directeur de thèse / Président Professeur, Pontifícia César Augusto MISSIO MARCON Universidade Católica do Rio Professeur, Pontifícia Universidade Grande do Sul (PUCRS) Católica do Rio Grande do Sul (PUCRS) Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 List of Figures Figure 1.1 – The speedup of a parallel application is severely limited by how much of the application can be parallelized [Wik19]. 18 Figure 1.2 – The essential approaches to parallelizing code. Thick lines depict locks and the flash symbol denote device interrupt [Kå05]. 19 Figure 1.3 – (a) RCU usage over the years [MBWW17]; (b) RCU and Locking usage over the years [MW08]. 20 Figure 1.4 – Snippet of the locking scheme for the TTY subsystem [BC19a][BC19b]. 23 Figure 1.5 – Execution of reader-writer lock and RCU. The green color is used to represent up to date data [McK19a]. 24 Figure 1.6 – Cost of each production step of multiple generations of transistor technology [Spe19]. 25 Figure 1.7 – Subutai components are highlighted in red in the computing stack. Subutai only requires changes in the (i) on-chip NI, (ii) OS NI driver and (iii) PThreads implementation. Additionally, a new scheduling policy (in blue) is explored in this work as an optional optimization. 26 Figure 1.8 – Synchronization acceleration with the Subutai solution for different scenarios. 27 Figure 2.1 – Atomic increment scalability on a Nehalem Intel processor [McK19a]. 31 Figure 2.2 – Throughput of different atomic operations on a single memory posi- tion [DGT13]. 31 Figure 2.3 – Reordering example from the Intel Manual [Int17] [Pre19b]. 32 Figure 2.4 – The net effect of using lock-based procedures: implicit memory barriers (represented as blue fences) (based on [Pre19a]). 36 Figure 2.5 – Scalability of the Reader-Writer Lock mechanism. A parameter is used to simulate the critical section range in instructions, which ranges from a thousand (1K on the graph) to 100 million (100M on the graph) instructions [McK19a]. 39 Figure 3.1 – The fork-join model of OpenMP [Bar19a]. 47 Figure 3.2 – RCU deletion example [McK19a]. 50 Figure 3.3 – Milliseconds to copy a 10MB file in two threads on a 2GHz Pentium 4 Xeon (lower is better) [Boe07]. 52 Figure 3.4 – Delays to complete the barrier release process for a 16-thread appli- cation on a 16-core system (a) without and (b) with optimization [FPMR18]. 53 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 Figure 3.5 – Normalized execution time of eight applications from the STAMP benchmark from 1 up to 8 threads; AVG represents the average execution time of all applications [YHLR13]. 56 Figure 3.6 – Comparison of different synchronization schemes for PhysicsSolver [YHLR13]. 57 Figure 3.7 – Simulation time comparison [PKMS17]. 58 Figure 3.8 – Ideas behind the CASPAR architecture [GMT16]. 60 Figure 3.9 – Experimental results for the CASPAR architecture design [GMT16]. 61 Figure 3.10 – Three barrier synchronization topologies [AFA+12]. 62 Figure 3.11 – Gather phase for (a) CBarrier, (b) GBarrier, and (c) TBarrier [AFA+12]. 63 Figure 3.12 – (a) Optimal frequencies and (b) area overhead running at 600MHz for all analyzed topologies [AFA+12]. 64 Figure 3.13 – The barrier cost of (a) SW and (b) HW implementation in cycles [AFA+12]................................................ 65 Figure 3.14 – Barrier overhead for varied-sized parallel workload [AFA+12]. 65 Figure 3.15 – Target architecture with Notifying Memories [MRSD16]. 67 Figure 3.16 – Classification of packets transported after 10 decoded frames of ice [MRSD16]. 69 Figure 4.1 – The Subutai solution; Subutai components are highlighted in red. 79 Figure 4.2 – Schematic of the target architecture for Subutai; the target architecture is comprised of 64 cores. 81 Figure 4.3 – Schematic representation of Subutai-HW and the NI. The hardware elements required by Subutai-HW implementation are highlighted in orange. 82 Figure 4.4 – Subutai-HW control structure. 83 Figure 4.5 – Subutai-HW queue structure. 83 Figure 4.6 – Subutai’s packet format. 87 Figure 4.7 – Futex kernel implementation [Har19]. 91 Figure 4.8 – 2-bit saturating counter for branch prediction [Dia19]. 94 Figure 4.9 – Communication flow of Subutai. 95 Figure 5.1 – The Subutai extensions are highlighted in blue. 97 Figure 5.2 – Core scheduler procedure switch_to steps for two Linux kernel versions [Mai19]. 99 Figure 5.3 – Overall time spent in the critical sections for all the work-related mutexes of Bodytrack on a RR scheduler. 100 Figure 5.4 – Overall time spent in mutex queues for all the work-related mutexes of Bodytrack on a RR scheduler. 101 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 Figure 5.5 – Total execution time spent in critical sections for all the work-related mutexes of Bodytrack on a RR scheduler. 102 Figure 5.6 – Overall time spent in critical sections for all the work-related mutexes of Bodytrack on a RR and CSA-enabled scheduler. 103 Figure 6.1 – Experimental setup for Subutai evaluation. 120 Figure 6.2 – Maximum speedup measured for a 48-core system for each bench- mark, region, and input set combination. Full and ROI inputs represent the entire application and parallel portion, respectively [SR15]. 124 Figure 6.3 – Bodytrack’s output. 126 Figure 6.4 – Bodytrack’s core workflow. 127 Figure 6.5 – The execution time in seconds (109 ns) of our application set on SW- only and Subutai solutions for 16 threads. Due to the order of magnitude, Subutai-HW and NoC latencies are not visible. 131 Figure 6.6 – The execution time in seconds (109 ns) of our application set on SW- only and Subutai solutions for 32 threads. Due to the order of magnitude, Subutai-HW and NoC latencies are not visible. 132 Figure 6.7 – The execution time in seconds (109 ns) of our application set on SW-only and Subutai solutions for 64 cores. Due to the order of magnitude, Subutai-HW and NoC latencies are not visible. 133 Figure 6.8 – Relative time for every thread to reach the barrier localS; (a) is a plot for all barrier calls during the application execution, while (b) is a snippet of 50 barrier calls for better visualization. 137 Figure 6.9 – Relative time for every thread to reach all barriers calls of workDone- Barrier. ..................................................... 137 Figure 6.10 – Comparison of latency for the procedure pthread_barrier_wait for SW-only and Subutai. The results are for T7 for a subset of barrier calls (400 up to 450). 138 Figure 6.11 – Comparison of latency for the procedure pthread_barrier_wait for SW-only and Subutai. The results are for T7 for all barrier calls of poolReadyBarrier.............................................. 138 Figure 6.12 – Time spent, in 106 ns, on the mutex queue of Bodytrack and x264. (a) and (b) are plots of time spent (Y-axis) by the number of threads on the queue (X-axis) for x264 and Bodytrack, respectively; (c) and (d) are the total time spent for the same application set. 141 Figure 6.13 – Execution in seconds (109 ns) for multiple application sets (lower is better) (Exec = Execution). 143 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 Figure 6.14 – Execution in seconds (109 ns) for multiple application sets (lower is better) (Exec = Execution). 145 Figure 6.15 – Latency reduction in nanoseconds of the total execution time for three PARSEC applications. 148 Figure 7.1 – Polynomial Regression and recorded data for the barrier latencies for a worker thread of Bodytrack. 157 Figure 7.2 – Barrier latencies from Bodytrack for (a) recorded data, (b) one gaus- sian and (c) double gaussian models. 158 Figure 7.3 – A zoom for some of the first 10 barriers of Bodytrack from Figure 7.2. 159 Subutai : Distributed synchronization primitives for legacy and novel parallel applications Rodrigo Cadore Cataldo 2019 List of Tables Table 1.1 – Breakdown of RCU usage by kernel subsystems [MBWW17]. 21 Table 2.1 – Overview of some memory barriers [Int17] [Lea19] [SPA93]. 32 Table 2.2 – Memory barriers used by some PThread implementations for mutexes.
Recommended publications
  • The Design and Implementation of Multiprocessor Support for an Industrial Operating System Kernel
    Blekinge Institute of Technology Research Report 2005:06 The Design and Implementation of Multiprocessor Support for an Industrial Operating System Kernel Simon Kågström Håkan Grahn Lars Lundberg Department of Systems and Software School of Engineering Blekinge Institute of Technology The Design and Implementation of Multiprocessor Support for an Industrial Operating System Kernel Simon Kågström, Håkan Grahn, and Lars Lundberg Department of Systems and Software Engineering School of Engineering Blekinge Institute of Technology P.O. Box 520, SE-372 25 Ronneby, Sweden {ska, hgr, llu}@bth.se Abstract The ongoing transition from uniprocessor to multiprocessor computers requires support from the op- erating system kernel. Although many general-purpose multiprocessor operating systems exist, there is a large number of specialized operating systems which require porting in order to work on multipro- cessors. In this paper we describe the multiprocessor port of a cluster operating system kernel from a producer of industrial systems. Our initial implementation uses a giant locking scheme that serializes kernel execution. We also employed a method in which CPU-local variables are placed in a special sec- tion mapped to per-CPU physical memory pages. The giant lock and CPU-local section allowed us to implement an initial working version with only minor changes to the original code, although the giant lock and kernel-bound applications limit the performance of our multiprocessor port. Finally, we also discuss experiences from the implementation. 1 Introduction A current trend in the computer industry is the transition from uniprocessors to various kinds of multipro- cessors, also for desktop and embeddedsystems. Apart from traditional SMP systems, many manufacturers are now presenting chip multiprocessors or simultaneous multithreaded CPUs [9, 15, 16] which allow more efficient use of chip area.
    [Show full text]
  • Filesystem Performance on Freebsd
    Filesystem Performance on FreeBSD Kris Kennaway [email protected] BSDCan 2006, Ottawa, May 12 Introduction ● Filesystem performance has many aspects ● No single metric for quantifying it ● I will focus on aspects that are relevant for my workloads (concurrent package building) ● The main relevant filesystem workloads seem to be – Concurrent tarball extraction – Recursive filesystem traversals ● Aim: determine relative performance of FreeBSD 4.x, 5.x and 6.x on these workloads – Overall performance and SMP scaling – Evaluate results of multi-year kernel locking strategy as it relates to these workloads Outline ● SMP architectural differences between 4/5/6.x ● Test methodology ● Hardware used ● Parallel tarball extraction test – Disk array and memory disk ● Scaling beyond 4 CPUs ● Recursive filesystem traversal test ● Conclusions & future work SMP Architectural Overview ● FreeBSD 4.x; rudimentary SMP support – Giant kernel lock restricts kernel access to one process at a time – SPL model; interrupts may still be processed in parallel ● FreeBSD 5.x; aim towards greater scalability – Giant-locked to begin with; then finer-grained locking pushdown ● FreeBSD 5.3; VM Giant-free ● FreeBSD 5.4; network stack Giant-free (mostly) ● Many other subsystems/drivers also locked – Interrupts as kernel threads; compete for common locks (if any) with everything else ● FreeBSD 6.x; – Consolidation; further pushdown; payoff! – VFS subsystem, UFS filesystem Giant-free FreeBSD versions ● FreeBSD 4.11-STABLE (11/2005) – Needed for amr driver fixes after 4.11-RELEASE ● FreeBSD 5.4-STABLE (11/05) – No patches needed ● FreeBSD 6.0-STABLE (11/05) – patches: ● Locking reworked in amr driver by Scott Long for better performance ● All relevant changes merged into FreeBSD 6.1 – A kernel panic was encountered at very high I/O loads ● Also fixed in 6.1 Test aims and Methodology ● Want to measure – overall performance difference between FreeBSD branches under varying (concurrent process I/O) loads – scaling to multiple CPUs ● Avoid saturating hardware resources (e.g.
    [Show full text]
  • USENIX October Proof 3
    FOR THE PAST COUPLE OF MONTHS, I have become absorbed in operating sys- RIK FARROW tems. My quest began with some consult- ing regarding security features of the cur- rent Linux kernel, including SELinux, then plunged even deeper during the HotOS workshop. I do not pretend to be a kernel hacker, but I am very interested in what musings goes on with the design and implementa- Rik Farrow provides UNIX and Internet security con- tion of operating system software. sulting and training. He is the author of UNIX System Security and System Administrator’s Guide to System V and editor of the SAGE Short Topics in System Like some others, I had wondered what had hap- Administration series. pened to FreeBSD 4’s stellar performance when Free- [email protected] BSD 5 appeared. Instead of getting faster, FreeBSD was slower. If I had bothered digging deeper, I would have learned that this was the result of far-reaching changes in the FreeBSD kernel. Michael W. Lucas ex- plains these changes in his article, which in turn is based on a talk given by Robert Watson about modify- ing the kernel to support SMP (symmetric multipro- cessing). And perhaps by the time you read this col- umn, FreeBSD 6 will have appeared, ready to utilize the new multiprocessor cores that are popping up. Lucas explains just why the transition from single- threaded to multi-threaded kernel takes so long and is so hard to do right. I first understood the importance of the Big Giant Lock when I was reviewing an early multiprocessing server that used SPARC processors.
    [Show full text]
  • How Freebsd Boots: a Soft-Core MIPS Perspective Brooks Davis, Robert Norton, Jonathan Woodruff, Robert N
    How FreeBSD Boots: a soft-core MIPS perspective Brooks Davis, Robert Norton, Jonathan Woodruff, Robert N. M. Watson Abstract We have implemented an FPGA soft-core, multithreaded, 64-bit MIPS R4000-style CPU called BERI to support research on the hardware/software interface. We have ported FreeBSD to this platform including support for mul- tithreaded and soon multicore CPUs. This paper describes the process by which a BERI system boots from CPU startup through the boot loaders, hand off to the kernel, and en- abling secondary CPU threads. Historically, the process of booting FreeBSD has been documented from a user perspec- tive or at a fairly high level. This paper aims to improve the documentation of the low level boot process for developers aiming to port FreeBSD to new targets. 1. Introduction Figure 1: BERIpad with application launcher From its modest origins as a fork of 386BSD targeting Intel i386 class CPUs, FreeBSD has been ported to a range of architectures including DEC Alpha1, AMD x86_64 (aka The rest of this paper narrates the boot process with a amd64), ARM, Intel IA64, MIPS, PC98, PowerPC, and special focus on the places customization was required for Sparc64. While the x86 and Alpha are fairly homogeneous BERI. We begin by describing the BERI platform (Section targets with mechanics for detecting and adapting to spe- 2), and then in detail documents the kernel architecture- cific board and peripheral configurations, embedded systems specific boot process for FreeBSD on BERI: boot loader platforms like ARM, MIPS, and PowerPC are much more (Section 3) and kernel boot process (Section 4).
    [Show full text]
  • Real-Time Audio Servers on BSD Unix Derivatives
    Juha Erkkilä Real-Time Audio Servers on BSD Unix Derivatives Master's Thesis in Information Technology June 17, 2005 University of Jyväskylä Department of Mathematical Information Technology Jyväskylä Author: Juha Erkkilä Contact information: [email protected].fi Title: Real-Time Audio Servers on BSD Unix Derivatives Työn nimi: Reaaliaikaiset äänipalvelinsovellukset BSD Unix -johdannaisjärjestelmissä Project: Master's Thesis in Information Technology Page count: 146 Abstract: This paper covers real-time and interprocess communication features of 4.4BSD Unix derived operating systems, and especially their applicability for real- time audio servers. The research ground of bringing real-time properties to tradi- tional Unix operating systems (such as 4.4BSD) is covered. Included are some design ideas used in BSD-variants, such as using multithreaded kernels, and schedulers that can provide real-time guarantees to processes. Factors affecting the design of real- time audio servers are considered, especially the suitability of various interprocess communication facilities as mechanisms to pass audio data between applications. To test these mechanisms on a real operating system, an audio server and a client utilizing these techniques is written and tested on an OpenBSD operating system. The performance of the audio server and OpenBSD is analyzed, with attempts to identify some bottlenecks of real-time operation in the OpenBSD system. Suomenkielinen tiivistelmä: Tämä tutkielma kattaa reaaliaikaisuus- ja prosessien väliset kommunikaatio-ominaisuudet, keskittyen 4.4BSD Unix -johdannaisiin käyt- töjärjestelmiin, ja erityisesti siihen kuinka hyvin nämä soveltuvat reaaliaikaisille äänipalvelinsovelluksille. Tutkimusalueeseen sisältyy reaaliaikaisuusominaisuuk- sien tuominen perinteisiin Unix-käyttöjärjestelmiin (kuten 4.4BSD:hen). Mukana on suunnitteluideoita, joita on käytetty joissakin BSD-varianteissa, kuten säikeis- tetyt kernelit, ja skedulerit, jotka voivat tarjota reaaliaikaisuustakeita prosesseille.
    [Show full text]
  • Evaluating Locking Schemes in the Freebsd Kernel
    Evaluating Locking Schemes in the FreeBSD Kernel james francis toy iv Advisor: David Hemmendinger 20100321 Abstract This project is concerned with locking strategies in the FreeBSD unix kernel. Locks prevent race conditions in a multi-programming environ- ment. At present the FreeBSD kernel employs a GIANT_LOCK to pre- vent race conditions by locking access to the entire kernel. This method may inhibit concurrency; however, it is a simple and easy way to ensure correctness. The goal of this project is to see if fine-grained locking (FGL), an alternative to the GIANT_LOCK, promotes concurrency and through- put by locking only critical shared resources. FGL was implemented for two kernel subsystems, and we measured the time that a thread was wait- ing to acquire a lock; FGL showed zero wait time and the GIANT_LOCK showed non-zero wait time. 1 An Introduction to Locking Locking and synchronization were developed to prevent processes from interfer- ing with one another when having access to shared resources. This is important because in a multi-threaded environment these situations often arise since it is more efficient to execute programs in parallel. This contention for shared data is more commonly described as a “race condition” where multiple pro- cesses are racing to a shared resource. These race conditions can be resolved by mutual exclusion (ensures only one process is using a shared resource at a time) which is behavior provided by the locking mechanisms currently employed by the FreeBSD kernel. The purpose of this project is to analyze the current locking scheme in the FreeBSD kernel; attempt to implement a more efficient 1 locking method, and compare the initial method versus the proposed method in this project.
    [Show full text]
  • Matthew Dillon Dragonfly BSD Project 23 October 2003 Dragonfly Overview
    Matthew Dillon DragonFly BSD Project 23 October 2003 DragonFly Overview ●FreeBSD 4.x and 5.x directions ●Differentiating DragonFly, the basis for a project fork ● A different, more maintainable user threading API (syscall messaging) ● A different, more maintainable approach to MP design ● CPU Isolation by design using IPI messaging rather then by accident w/Mutexes ● Light Weight Kernel Threading with fewer hacks ●Project Goals ● Maintaining stability, producing production-capable releases ● A more consistent and more easily maintained message-based framework ● UP, MP, SSI Scaleability ● Userland VFS Development ● Machine-verified Package Management ●This Presentation ● Threading And Messaging ● Our approach to the Big Giant Lock problem. Why not mutexes? ● Our approach to achieving a Single System Image (SSI) DragonFly Threading and Messaging Model Matthew Dillon DragonFly BSD Project 23 October 2003 Light Weight Kernel Threading and User Processes LWKT SUBSYSTEM USERLAND PROCESS SCHEDULER CPU #1 PROCESS RUNQ SCHEDULER THREAD LWKT SCHEDULER THREAD PROCESS1 (FIXED PRIORITY) PROCESS2 THREAD PROCESS2 CURPROC ON CPU1 PROCESS3 CPU #2 THREAD PROCESS4 LWKT SCHEDULER PROCESS5 THREAD PROCESS5 (FIXED PRIORITY) CURPROC ON CPU2 THREAD PROCESS6 IPI Messaging ●ABSTRACTION PROMOTES CPU ISOLATION ●ASYNCHRONOUS IPI MESSAGING AVOIDS MUTEX OPS ●SIMPLE CRITICAL SECTIONS FOR LOCAL ACCESS ●MANY IPI OPS CAN BE PASSIVE / CONTRAST W/ RCU CPU #1 CPU #2 SCHEDULER SCHEDULER ASYNC THREAD1 SCHEDULE IPI SCHEDULE THREAD2 THREAD2 MESSAGE THREAD2 ALLOCATOR ALLOCATE MEMORY
    [Show full text]
  • Scalability of Microkernel-Based Systems
    Scalability of Microkernel-Based Systems Zur Erlangung des akademischen Grades eines DOKTORS DER INGENIERWISSENSCHAFTEN von der Fakultat¨ fur¨ Informatik der Universitat¨ Fridericiana zu Karlsruhe (TH) genehmigte DISSERTATION von Volkmar Uhlig aus Dresden Tag der mundlichen¨ Prufung:¨ 30.05.2005 Hauptreferent: Prof. Dr. rer. nat. Gerhard Goos Universitat¨ Fridericiana zu Karlsruhe (TH) Korreferent: Prof. Dr. sc. tech. (ETH) Gernot Heiser University of New South Wales, Sydney, Australia Karlsruhe: 15.06.2005 i Abstract Microkernel-based systems divide the operating system functionality into individ- ual and isolated components. The system components are subject to application- class protection and isolation. This structuring method has a number of benefits, such as fault isolation between system components, safe extensibility, co-existence of different policies, and isolation between mutually distrusting components. How- ever, such strict isolation limits the information flow between subsystems including information that is essential for performance and scalability in multiprocessor sys- tems. Semantically richer kernel abstractions scale at the cost of generality and mini- mality–two desired properties of a microkernel. I propose an architecture that al- lows for dynamic adjustment of scalability-relevant parameters in a general, flex- ible, and safe manner. I introduce isolation boundaries for microkernel resources and the system processors. The boundaries are controlled at user-level. Operating system components and applications can transform their semantic information into three basic parameters relevant for scalability: the involved processors (depending on their relation and interconnect), degree of concurrency, and groups of resources. I developed a set of mechanisms that allow a kernel to: 1. efficiently track processors on a per-resource basis with support for very large number of processors, 2.
    [Show full text]
  • An Overview of Security in the Freebsd Kernel 131 Dr
    AsiaBSDCon 2014 Proceedings March 13-16, 2014 Tokyo, Japan Copyright c 2014 BSD Research. All rights reserved. Unauthorized republication is prohibited. Published in Japan, March 2014 INDEX P1A: Bold, fast optimizing linker for BSD — Luba Tang P1B: Visualizing Unix: Graphing bhyve, ZFS and PF with Graphite 007 Michael Dexter P2A: LLVM in the FreeBSD Toolchain 013 David Chisnall P2B: NPF - progress and perspective 021 Mindaugas Rasiukevicius K1: OpenZFS: a Community of Open Source ZFS Developers 027 Matthew Ahrens K2: Bambi Meets Godzilla: They Elope 033 Eric Allman P3A: Snapshots, Replication, and Boot-Environments—How new ZFS utilities are changing FreeBSD & PC-BSD 045 Kris Moore P3B: Netmap as a core networking technology 055 Luigi Rizzo, Giuseppe Lettieri, and Michio Honda P4A: ZFS for the Masses: Management Tools Provided by the PC-BSD and FreeNAS Projects 065 Dru Lavigne P4B: OpenBGPD turns 10 years - Design, Implementation, Lessons learned 077 Henning Brauer P5A: Introduction to FreeNAS development 083 John Hixson P5B: VXLAN and Cloud-based networking with OpenBSD 091 Reyk Floeter INDEX P6A: Nested Paging in bhyve 097 Neel Natu and Peter Grehan P6B: Developing CPE Routers based on NetBSD: Fifteen Years of SEIL 107 Masanobu SAITOH and Hiroki SUENAGA P7A: Deploying FreeBSD systems with Foreman and mfsBSD 115 Martin Matuška P7B: Implementation and Modification for CPE Routers: Filter Rule Optimization, IPsec Interface and Ethernet Switch 119 Masanobu SAITOH and Hiroki SUENAGA K3: Modifying the FreeBSD kernel Netflix streaming servers — Scott Long K4: An Overview of Security in the FreeBSD Kernel 131 Dr. Marshall Kirk McKusick P8A: Transparent Superpages for FreeBSD on ARM 151 Zbigniew Bodek P8B: Carve your NetBSD 165 Pierre Pronchery and Guillaume Lasmayous P9A: How FreeBSD Boots: a soft-core MIPS perspective 179 Brooks Davis, Robert Norton, Jonathan Woodruff, and Robert N.
    [Show full text]
  • Enhancing the Performance of Openldap Directory Server with Multiple Caching
    Enhancing the Performance of OpenLDAP Directory Server with Multiple Caching Jong Hyuk Choi and Hubertus Franke Kurt Zeilenga IBM Thomas J. Watson Research Center The OpenLDAP Foundation P.O. Box 218, Yorktown Heights, NY 10598 [email protected] jongchoi,frankeh¡ @us.ibm.com Keywords—Directory, LDAP, Caching, Linux, Profiling. This paper presents our efforts to enhance OpenLDAP [2], Abstract— Directory is a specialized data store optimized for ef- an open-source directory software suite. More specifically, this ficient information retrieval which has standard information model, paper focuses on the performance and reliability of slapd, the naming scheme, and access protocol for interoperability over net- standalone LDAP server of OpenLDAP. We participated in the work. It stores critical data such as user, resource, and policy infor- mation in the enterprise computing environment. This paper presents development of back-bdb, a transactional backend of OpenL- a performance driven design of a transactional backend of the OpenL- DAP, which directly utilizes the underlying Berkeley DB with- DAP open-source directory server which provides improved reliabil- out the general backend API layer used in the existing OpenL- ity with high performance. Based on a detailed system-level profiling DAP backend, back-ldbm. Back-bdb provides a higher level of of OpenLDAP on the Linux OS, we identify major performance bot- reliability and concurrency. The transaction support of back- tlenecks. After investigating the characteristics of each bottleneck, we propose a set of caching techniques in order to eliminate them: bdb makes directory recoverable from temporary failures and directory entry cache, search candidate cache, IDL (ID List) cache, disasters, while the page level locking enables concurrent ac- IDL stack slab cache, and BER (Basic Encoding Rule) transfer cache.
    [Show full text]
  • A BRIEF HISTORY of the BSD FAST FILE SYSTEM 9 June07login Press.Qxd:Login June 06 Volume 31 5/27/07 10:22 AM Page 10
    June07login_press.qxd:login June 06 Volume 31 5/27/07 10:22 AM Page 9 I FIRST STARTED WORKING ON THE UNIX file system with Bill Joy in the late MARSHALL KIRK MCKUSICK 1970s. I wrote the Fast File System, now called UFS, in the early 1980s. In this a brief history of article, I have written a survey of the work that I and others have done to the BSD Fast File improve the BSD file systems. Much of System this research has been incorporated into other file systems. Dr. Marshall Kirk McKusick writes books and articles, teaches classes on UNIX- and BSD-related subjects, and provides expert-witness testimony on software patent, trade secret, and copyright issues, particu - 1979: Early Filesystem Work larly those related to operating systems and file sys - tems. While at the University of California at Berke - The first work on the UNIX file system at Berke - ley, he implemented the 4.2BSD Fast File System and was the Research Computer Scientist at the Berkeley ley attempted to improve both the reliability and Computer Systems Research Group (CSRG) oversee - the throughput of the file system. The developers ing the development and release of 4.3BSD and improved reliability by staging modifications to 4.4BSD. critical filesystem information so that the modifi - [email protected] cations could be either completed or repaired cleanly by a program after a crash [14]. Doubling the block size of the file system improved the per - formance of the 4.0BSD file system by a factor of more than 2 when compared with the 3BSD file system.
    [Show full text]
  • SMP Implementation for Openbsd/Sgi Takuya ASADA [email protected]
    SMP Implementation for OpenBSD/sgi Takuya ASADA [email protected] Abstract now have SMP support on the i386, amd64, mvme88k, sparc64 and macppc platforms. We started to implement SMP support for the Recently, sgi was added to the list of officially OpenBSD/sgi port and selected the SGI Octane as the supported architectures but without SMP support, de- first target machine. We are now at the point where we spite the SGI hardware range including multiprocessor have almost finished, except for the implementation of servers and workstations. We started to implement SMP some non-critical features. SMP support is now oper- support for OpenBSD/sgi and selected a SGI Octane as ational on SGI Octane systems and the code has been the first target machine. merged into the OpenBSD tree. 1 Introduction 2.2 Current status A year ago, I was working to add SMP and 64 bit sup- OpenBSD SMP now works correctly, however perfor- port to a BSD-based embedded operating system. The mance is still limited. The stage of our implementation target device was based on the MIPS64 architecture. is similar to the early versions of FreeBSD and NetBSD. At that time only FreeBSD had started to support SMP A significant amount of work still needs to be done in on MIPS, however only some of their code had been order to improve performance and scalability. merged into the public repository. As a result, I tried to implement SMP support in a step-by-step manner, refer- 2.2.1 Giant lock ring to the SMP implementations on other architectures.
    [Show full text]