On the Design of Reliable and Scalable Networked Systems

On the Design of Reliable and Scalable Networked Systems Ph.D. Thesis Tomas Hruby VU University Amsterdam, 2014 This work was funded by European Research Council under ERC Advanced Grant 227874. Copyright © 2014 by Tomas Hruby. ISBN 978-90-5383-072-7 Printed by Wöhrmann Print Service. VRIJE UNIVERSITEIT ON THE DESIGN OF RELIABLE AND SCALABLE NETWORKED SYSTEMS ACADEMISCH PROEFSCHRIFT ter verkrijging van de graad Doctor aan de Vrije Universiteit Amsterdam, op gezag van de rector magnificus prof.dr. F.A. van der Duyn Schouten, in het openbaar te verdedigen ten overstaan van de promotiecommissie van de Faculteit der Exacte Wetenschappen op donderdag 10 april 2014 om 13.45 uur in de aula van de universiteit, De Boelelaan 1105 door TOMAS HRUBY geboren te Stod, Tsjechische Republiek promotors: prof.dr. Andrew S. Tanenbaum prof.dr. Herbert Bos Acknowledgements Tomas Hruby Amsterdam, The Netherlands, December 2013 v Contents Acknowledgements v Contents vii List of Figures xi List of Tables xiii Publications xv 1 Introduction 1 1.1 Overview of Operating System Architectures . 4 1.1.1 Monolithic Kernels . 4 1.1.2 Microkernels . 5 1.2 New Multicore Hardware . 7 1.3 Network Stack . 9 1.4 NEWTOS ............................... 11 1.5 Contributions . 12 1.6 Organization of the Dissertation . 12 2 Keep Net Working On a Dependable and Fast Networking Stack 15 2.1 Introduction . 16 2.2 Reliability, performance and multicore . 17 2.3 Rethinking the Internals . 19 2.3.1 IPC: What’s the kernel got to do with it? . 19 2.3.2 Asynchrony for Performance and Reliability . 20 vii viii 2.4 Fast-Path Channels . 21 2.4.1 Trustworthy Shared Memory Channels . 22 2.4.2 Monitoring Queues . 23 2.4.3 Channel Management . 24 2.4.4 Channels and Restarting Servers . 24 2.5 Dependable Networking Stack . 25 2.5.1 The Internals . 27 2.5.2 Combining Kernel IPC and Channels IPC . 28 2.5.3 Zero Copy . 29 2.5.4 Crash Recovery . 30 2.6 Evaluation . 32 2.6.1 TCP Performance . 32 2.6.2 Fault Injection and Crash Recovery . 33 2.6.3 High Bitrate Crashes . 35 2.7 Related work . 36 2.8 Conclusion and Future Work . 37 3 When Slower is Faster On Heterogeneous Multicores for Reliable Systems 39 3.1 Introduction . 40 3.2 Big cores, little cores and combinations . 42 3.2.1 BOCs and SICs . 42 3.2.2 Configurations of BOCs and SICs . 43 3.3 NEWTOS ............................... 44 3.3.1 The NEWTOS network stack . 45 3.3.2 Dynamic reconfiguration . 46 3.3.3 Non-overlapping ISA . 47 3.4 Evaluation on a high-performance stack . 48 3.4.1 Test configurations . 49 3.4.2 Methodology of measurements . 50 3.4.3 Frequency scaling 2267–1600 MHz . 50 3.4.4 Throttling below 1600 MHz . 51 3.4.5 Hyper-threading . 53 3.4.6 Why is slower faster? . 55 3.4.7 Stack on a single core . 57 3.5 Related work . 57 3.6 Conclusions . 59 4 Scheduling of Multiserver System Components on Over-provisioned Multicore Systems 61 4.1 Introduction . 62 4.2 NEWTOS ............................... 63 4.3 Network Traffic Indicators . 66 ix 4.4 Web Server Profiles . 67 4.5 Profile Based Scheduling . 70 4.6 Conclusions and Future Work . 71 5 On Sockets and System Calls Minimizing Context Switches for the Socket API 73 5.1 Introduction . 74 5.2 Motivation and Related Work . 76 5.3 Sockets without System Calls . 79 5.4 Socket Buffers . 81 5.4.1 Exposing the Buffers . 81 5.4.2 Socket Buffers in Practice . 82 5.5 Calls and Controls . 83 5.5.1 Notifying the System . 83 5.5.2 Notifying the Application . 84 5.5.3 Socket Management Calls . 85 5.5.4 Select in User Space . 86 5.5.5 Lazy Closing and Fast Accepts . 86 5.6 Fast Sockets in NEWTOS....................... 87 5.6.1 Polling within the Network Stack . 88 5.6.2 Reducing TX Copy Overhead . 89 5.7 Implications for Reliability . 90 5.8 Evaluation . 90 5.8.1 lighttpd . 91 5.8.2 memcached . 94 5.8.3 Scalability . 95 5.9 Conclusions . 97 6 NEAT Designing Scalable and Reliable Networked Systems 99 6.1 Introduction . 100 6.2 Background and Related Work . 102 6.2.1 Scalability of Implementation . 102 6.2.2 Reliability of Implementation . 103 6.2.3 Scalability and Reliability of Design . 104 6.3 A Scalable and Reliable Network Stack . 105 6.3.1 Overview . 106 6.3.2 Sockets . 108 6.3.3 TCP Connections . 110 6.3.4 Scaling Up and Down . 111 6.3.5 Scaling NIC Drivers . 112 6.3.6 Reliability . 113 6.3.7 Security . 113 x CONTENTS 6.3.8 Multi-component Network Stack . 114 6.4 Architectural Support . 114 6.5 Evaluation . 116 6.5.1 Scalability on a 12-core AMD . 116 6.5.2 Scalability on a 8-core Xeon . 118 6.5.3 Impact of Different Configurations . 121 6.5.4 Reliability . 122 6.6 Conclusion . 123 7 Summary and Conclusions 125 References 131 Summary 145 Samenvatting 147 List of Figures 2.1 Conceptual overview of NEWTOS. Each box represents a core. Appli- cation (APP) cores are timeshared. 18 2.2 Decomposition and isolation in multiserver systems . 26 2.3 Asynchrony in the network stack . 28 2.4 IP crash . 34 2.5 Packet filter crash . 35 3.1 Design of the NEWTOS network stack . 44 3.2 Test configurations - large squares represent the quad-core chips, small squares individual cores and ‘&’ separates processes running on different hyper-treads. 49 3.3 Throughput compared to the combined performance of the resources in use — the best combinations . 52 3.4 Configuration #1 – CPU utilization of each core throttled to % of 1600 MHz. 53 3.5 Configuration #1 – CPU utilization normalized to cores at 1600 MHz unthrottled. 53 3.6 Configuration #2 (HT) – CPU utilization of each core throttled to % of 1600 MHz. 54 3.7 Configuration #2 (HT) – CPU utilization of each core normalized to 1600 MHz. 54 3.8 Comparison of configuration #1 (no HT) and #2 (HT). Lines represent bitrate, bars represent CPU utilization normalized to 3 cores at 1600 MHz 55 3.9 Sleep overhead in different situations. The thick line represents transi- tion between execution in user space and kernel in time. Arrows mark job arrivals, loops denote execution of the main loop and spirals denote polling. 56 xi xii LIST OF FIGURES 4.1 NEWTOS- network stack and inputs of the scheduler . 64 4.2 NEWTOS vs Linux performance. Continuously requesting 10 times a 10 B file using various numbers of simultaneous persistent connections. 65 4.3 Resuts for tests requesting continuously a file of a size between 10 bytes and 10 Megabytes using 8 simultaneous connections. The dashed line in (a) and (g) connects points where all three components run with the same frequency. 10 MB case shows bitrate instead of request rate. Legend is only in (f). 68 5.1 Net stack configurations : (a) Monolithic systems (Linux, BSD, Win- dows, etc.) (b) FlexSC / IsoStack (c) Multiserver system (MINIX 3, QNX). 77 5.2 Exposed socket buffers – no need to cross the protection boundary (dashed) to access the socket buffers. 81 5.3 A socket buffer – ring queue and data buffer. White and shaded present free and used / allocated space. 82 5.4 Architecture of the network stack . 87 5.5 1 request for a 20 B file per connection . 92 5.6 10 requests-responses per connection, 20 B file . 93 5.7 1 request and a 11 kB response per connection . 94 5.8 10 requests for an 11 kB file . ..

On the Design of Reliable and Scalable Networked Systems

The Politics of Roman Memory in the Age of Justinian DISSERTATION Presented in Partial Fulfillment of the Requirements for the D

Hyper-V Performance Comparison: Microsoft Windows Server 2008 SP2 and R2 with Intel Xeon Processor X5570

Intel X86 Considered Harmful

Performance Best Practices for Vmware Workstation Vmware Workstation 7.0

TVS-Ecx80+ Edge Cloud Turbo Vnas Series the Best Storage Solution for Edge Cloud Use Your Turbo Vnas As a PC

Operating System Support for Redundant Multithreading

Oracle Databases on Vmware Best Practices Guide Provides Best Practice Guidelines for Deploying Oracle Databases on Vmware Vsphere®

Linux Performance Tools

Perceptions of the Ancient Jews As a Nation in the Greek and Roman Worlds

Cisco Telepresence Server 4.4(1.16) MR1 Open Source Documentation

Why Open Source Software?

Project-Team REGAL