DEGREE PROJECT, IN COMPUTER SCIENCE , SECOND LEVEL LAUSANNE, SWITZERLAND 2015

Enhancing Quality of Service Metrics for High Fan-In Node.js Applications by Optimising the Network Stack

LEVERAGING IX: THE DATAPLANE

FREDRIK PETER LILKAER

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

Enhancing Quality of Service Metrics for High Fan-in Node.js Applications by

Optimising the Network Stack -Leveraging IX: The Dataplane Operating System

FREDRIK PETER LILKAER

DD221X, Master’s Thesis in Computer Science (30 ECTS credits) Degree Progr. in Computer Science and Engineering 300 credits Master Programme in Computer Science 120 credits Royal Institute of Technology year 2015 Supervisor at EPFL was Edouard Bugnion Supervisor at CSC wa s Carl-Henrik Ek Examiner wa s Johan Håstad Presented: 2015-10-01

Royal Institute of Technology School of Computer Science and Communication

KTH CSC SE-100 44 Stockholm, Sweden URL: www.kth.se/csc

Abstract

This thesis investigates the feasibility of porting Node.js, a JavaScript web application framework and server, to IX, a data- plane operating system specifically developed to meet the needs of high performance microsecond-computing type of applications in a datacentre setting. We show that porting requires exten- sions to the IX kernel to support UDS polling, which we imple- ment. We develop a distributed load generator to benchmark the framework. The results show that running Node.js on IX improves throughput by up to 20.6%, latency by up to 5.23×, and tail latency by up to 5.68× compared to a baseline. We show how server side request level reordering affect the la- tency distribution, predominantly in cases where the server is load saturated. Finally, due to various limitations of IX1, we are unable at this time to recommend running Node.js on IX in a production environment, despite improved metrics in all test cases. However, the limitations are not fundamental, and could be resolved in future work.

Referat Förbättran av Quality of Service för högbelastade Node.js- webbapplikationer genom effektivare operativsystem

Detta exjobb undersöker möjligheterna till att använda IX, ett specialiserat dataplansoperativsystem avsett för högpresterande datacentertillämpningar, för att köra Node.js, ett webapplika- tionramverk för JavaScript-applikationer. För att porta Node.js till IX krävs att vi utvidgar IX med funktionalitet för samtidig pollning av Domain Sockets och nätverksflöden, vilket visas samt genomförs. Vidare utvecklas en distribuerad lastgenerator för att utvärdera applikationsramverket under IX jämfört baslin- je som utgörs av en omodifierad Linuxdistribution. Resultaten vi- sar att throughput förbättras med upp till 20.6%, latens upp till 5.23× och tail latency upp till 5.68×. Sedermera undersöker vi huruvida latensvariansen ökat på grund av request-omordningar på serversidan, vilket tycks vara fallet vid hög serverbelastning, även om andra faktorer tycks ha större inverkan vid låg server- belastning. Slutligen, även om alla storheter förbättrats vid alla observerade mätpunkter, kan ännu inte vidspredd adoption av IX för att köra Node.js applikationer rekommenderas, främst på grund av problem med horisontal skalning samt problem att ingå som frontend-server i en klassisk tiered-datacentre arkitektur.

1Mainly lack of outgoing TCP connections and multi-process execution, respectively preventing Node.js from acting as a frontend in a multi-tiered architecture and scaling horizontally within a single node. Acknowledgments

Writing a thesis can be a long, and at times straining task. I would therefore like to thank the people that helped me achieve my thesis. First, I would like to thank the Data Center Systems laboratory at École Poly- technique Fédérale de Lausanne, EPFL, that allowed me to work with them for the duration of my thesis. In particular, I would like to thank my supervisor Edouard Bugnion, who offered invaluable advice every time I was stuck in my work. I would also like to thank Mia Primorac and George Prekas who I had the pleasure of working alongside, and who also withstood all my questions on IX. I would like to thank my supervisor at KTH, Carl-Henrik Ek, for offering good academic guidance and writing advice. Finally, I would like to thank all my friends of Lausanne for support and moti- vation during the semester. An extra thanks goes out to those of you that helped me to proofread. Contents

Contents

Glossary

1 Introduction 1 1.1 Problem Statement ...... 2 1.2 Contribution ...... 3

2 Background 5 2.1 Operating Systems ...... 5 2.2 The IX Dataplane Operating System ...... 6 2.2.1 Requirements and Motivations ...... 7 2.2.2 What is a Dataplane Operating System? ...... 8 2.2.3 Results ...... 8 2.3 Web Servers ...... 9 2.3.1 Apache, the Traditional Forking Web Server ...... 9 2.3.2 Nginx - the Driven Web Server ...... 9 2.3.3 Node.js ...... 10 2.4 Queueing Theory ...... 10

3 Software Foundation 13 3.1 The IX Dataplane Operating System ...... 13 3.1.1 Architectural Overview ...... 13 3.1.2 Dune Process Virtualisation ...... 14 3.1.3 Execution Model ...... 15 3.1.4 IX System Call API ...... 16 3.1.5 IX Event Conditions ...... 17 3.1.6 libix Userspace API ...... 17 3.1.7 Limitations ...... 18 3.2 Node.js ...... 18 3.2.1 V8 Javascript Engine ...... 19 3.2.2 ...... 19

4 Design 25 CONTENTS

4.1 Design Overview ...... 25 4.2 Limitations ...... 25 4.3 Modifications of IX ...... 26 4.3.1 Motivation for IX Kernel Extensions ...... 26 4.3.2 Kernel Extension ...... 27 4.3.3 libix ...... 28 4.4 Modifications of Node.js ...... 28 4.4.1 Modifications of libuv ...... 28 4.4.2 Modifications of the V8 Javascript Engine ...... 34

5 Evaluation 35 5.1 Results ...... 35 5.1.1 Test Methodology ...... 35 5.1.2 Performance Metrics ...... 36 5.1.3 A Note on Poisson Distributed Arrival Rates ...... 37 5.1.4 Load Scaling ...... 37 5.1.5 Connection Scalability ...... 38 5.2 Result Tracing ...... 38 5.2.1 Throughput Increase ...... 39 5.2.2 Reordering & Tail Latency ...... 39

6 Discussion 43 6.1 Related Work ...... 45 6.2 Lessons Learned ...... 45 6.3 Future Work ...... 46 6.4 Conclusion ...... 47

Bibliography 49

A Resources 53 A.1 libuv - ix ...... 53 A.2 Node.js ...... 53

B dialog - high concurrency rate controlled poisson distributed load generator 55 B.1 Purpose ...... 55 B.2 Implementation ...... 55 B.3 Evaluation ...... 56 B.4 Resources ...... 58 Glossary

API Application Programming Interface. 2, 16–20, 24, 25, 28, 29, 31, 47

ASLR Address Space Layout Randomisation. 34, 35, 53

FIFO First-In, First-Out. 11, 41

HTTP HyperText Transfer Protocol. 18, 35, 56

IPC Inter-Process Communication. 5, 46 libOS library Operating System. 6, 15, 47

LIFO Last-In, First-Out. 12

NIC Network Interface Controller. 26, 44

OS Operating System. 5, 19, 43

RPC Remote Procedure Call. 16, 28

RSS Receive Side Scaling. 15

SIRO Service in Random Order. 12

SLA Service Level Agreement. 3, 36, 38

TCP Transmission Control Protocol. 20, 22, 29, 32, 43

TLB Translation Lookaside Buffer. 14

UDP User Datagram Protocol. 43

UDS . 20, 24, 26–28, 34, 47

Chapter 1

Introduction

Almost everyone have probably heard about Moore’s law in one form or another; that computers double in processing power approximately every 18 months1. Con- sequently we should, by now, be free of performance problems since our computers ought to be super fast, given an exponential growth in processing power. And they are. The problem is just that we are constantly telling our computers to solve big- ger, and/or harder problems. Around the year 2004, it stopped to be efficient to scale CPU processing performance vertically, that is increasing the clock frequency. As a result, we are now constructing software to make use of multi-core processors, and we are engineering large, complex, distributed systems to deal with the gigantic datasets that we like to call “big data”. We find that it is important to bound the end-to-end latency, particularly in such systems. End-to-end latency is a key per- formance indicator and has a direct correlation with user experience and thus, for a commercial system, both customer conversion and customer retention, in particular in a realtime/online system In such distributed systems, computation is divided between multiple entities, which may be spread across a pleathoria of machines within a single - or across - datacentre(s). Therefore, one way to minimise the end-to-end latency and to control its distribution is to attempt to bound the latency of every participating component. The motivation is that latency and variance in latency is induced in every step of communication along the execution path. Furthermore, in current computer cluster deployments, energy accounts for a significant portion of operational expenses. Consequently, if we can engineer systems that are able to perform the required tasks more efficiently, they can run with fewer hardware resources and thus consume less energy resources. Therefore, it is still desired to improve the efficiency of our systems, even if we have at our disposal, extremely powerful computational resources. In this work we explore a method to improve the performance of web servers based on the Node.js application framework, that may or may not, be used in such a distributed setting as described in the first paragraph. The performance met-

1 The number of transistors on a die doubles approximately every 18 months.

1 CHAPTER 1. INTRODUCTION rics/Quality of Service metrics we study are mainly latency and its distribution as motivated in the second paragraph, and throughput. Throughput is the num- ber of transactions per time unit, and exhibits correlation with energy efficiency requirements as described in the third paragraph. The IX [1] dataplane operating system, a specialised operating system for en- hanced network performance, is the result of a research collaboration between Stan- ford University and École Polytechnique Fédérale de Lausanne. It is designed to bridge the four-way trade-off between low latency, high throughput, strong pro- tection and resource efficiency. Low latency and high throughput encourages the construction of scalable, maintainable and fault-tolerant micro-service oriented ar- chitectures. Improved resource efficiency in conjunction with strong protection re- duces both capital and operational expenses, as it permits workload consolidation, and energy proportionality directly affects the operational expenses [2]. IX uses hardware virtualisation to provide strong protection between applica- tions while retaining performance. Performance is further enhanced by techniques such as adaptive batching, run-to-completion, strict FIFO ordering and a native, zero-copy Application Programming Interface (API). The results show greatly en- chanced throughput, latency reductions as well as tail latency reductions compared to the standard Linux networking stack. Low latency and tail latency considerations are predominantly important in a setting where a frontend or mid-tier layer fan-outs requests to a large number of servers in a backend layer. As such, the performance of IX has primarily been assessed for microsecond computing type of applications, such as the key-value store [3] where throughput is increased by a factor 3.6× and tail latency reduced by 2×. Since the publication of [1], the IX-team has realised that Linux performs poorly regarding fairness and Quality of Service, and IX also seems to handle connection scalability better. Therefore it seems that IX can also be more suitable than Linux for a high fan-in situation, such as one faced by a web server. Node.js [4] is a contemporary JavaScript web application framework that rose to popularity in recent years due to providing a non-blocking scalable I/O mechanism with a low learning curve. By leveraging non-blocking I/O, Node provides a single threaded execution model based on an [5]. By not dedicating a per connection, the system can save resources, which enables it to scale to a high number of concurrent clients. Furthermore, it became popular by unifying the server and client codebases with a single development language. The motivation is that it increases developer productivity and eases hiring by letting companies combine backend and frontend teams to single units [6].

1.1 Problem Statement

The IX dataplane operating system improves upon Linux in throughput and latency by up to an order of magnitude [1], apart from providing better connection scalabil-

2 1.2. CONTRIBUTION ity. Such optimisations could benefit a broad range of network bound applications. Node.js is designed to improve connection scalability for I/O bound applications, and could potentially enjoy an underlying operating system specifically engineered for such a purpose. However, IX assumes a specialised processing model in order to implement its optimisations and is directly targeted towards microsecond-computing type of applications. Thus, can a more general network bound application, such as a web applica- tion framework, in particular Node.js, be effectively ported to IX to benefit from its advantages? If so, what are the benefits, limitations and how is performance affected?

1.2 Contribution

We show that Node.js can, thanks to its event driven design, be ported to IX. The results show that Node on IX brings performance enhancements in terms of throughput, latency, but most notably in terms of 99th percentile latencies and throughput under varying levels of 99th percentile Service Level Agreements (SLAs) versus a standard Linux installation. We investigate and account partially for the sources of improvements in performance metrics. Namely the throughput increase can primarily be traced to the improved efficiency by using batched system calls. We show an increased rate of request reorderings on Linux, which could contribute to the increased 99th percentile latency, due to a change in effective queueing discipline. However, we are unable to verify that this is the primary contributor to increased tail latency on Linux. While most functionality of Node.js can directly be supported on IX there are a few shortcomings of IX that prevent us from supporting the full functionality of Node.js. Namely support for concurrent event notification on Unix Domain Sock- ets and network flows, outbound network connections and multiprocess/multiple address space applications, such as Node running the cluster module, require mod- ifications to the IX kernel. In this work we extend the IX kernel with an epoll-like interface to support concurrent polling of Unix Domain Sockets and network flows, but multiple address spaces and outbound connections are left for future work. Additionally, the engineering objective has been to construct a port with the smallest possible changeset to the codebases of Node.js [7], libuv [8], V8 [9] and IX [1]. We accomplished the work with 946 changed or added lines to libuv, 1 to V8 and 422 to IX, whereof 132 to libix and 290 to the IX kernel. Finally, we developed Dialog2, a closed-loop load generator for request/response type of server applications such as web services. Dialog combines high connection concurrency with a rate controlled Poisson process load. The purpose is to enable load measurements for a high connection count in order to measure connection scalability of Node.js on IX compared to the Linux baseline.

2See appendix B.

3

Chapter 2

Background

This chapter explains the necessary background required to understand Node.js and IX. More specifically it explains what they do, rather than how, meaning that the details and software architecture is left to chapter 3. We start by looking at the background of Operating Systems (OSs) in section 2.1 and an introduction of the IX dataplane operating system follows in section 2.2. We follow up by a brief history on web servers in section 2.3, and conclude with a succinct queueing theory primer in section 2.4 as web servers essentially are queueing systems. We will see later that queueing theory will have an impact on the performance metrics we study in chapter 5.

2.1 Operating Systems

The main purpose of an OS is to abstract the details of the underlying hardware and to multiplex the access to various resources between applications. It provides the application programmer with a clean interface that abstracts away the pecu- liarities of the underlying hardware [10]. In general, Operating Systems consists of a kernel that provides the core functionality of the Operating System, such as multiplexing of CPU and memory resources. On top of the kernel each operating system typically comes with a set of user space libraries that enable applications to request services from the OS. Such libraries may implement OS functionality in user space, or may perform system calls that transfer control to kernel space. Most mainstream operating system generally provide applications with facilities for process scheduling, Inter-Process Communication (IPC), memory management, a file system and I/O, such as a networking stack. Furthermore, they often include high level libraries designed to support the development of user space applications, such libraries may include sound players or especially GUI windowing toolkits that permits application developers to create applications with unified look and feel. The literature [10] classifies operating systems into three main types; the mono- lithic kernel, the microkernel and the exokernel. Out of the three the monolithic kernel is by far the most commonly used for commodity operating systems. Win-

5 CHAPTER 2. BACKGROUND dows, Linux, and Unix systems such as BSD variants, Mac OSX and Solaris are all built on a monolithic architecture. Monolithic means that the kernel is a single large program, with no internal information hiding, all procedures can basically call all available procedures [10]. Not having to do any context switches while performing cross-module tasks in the kernel improves performance and is, apart from the sim- plified engineering task compared to other designs, a reason that many commodity operating systems have chosen this design. As a bug in kernel code can bring down the entire system, the idea that as much functionality as possible should be put outside the kernel naturally comes to mind. The idea gives birth to the microkernel, which improves system reliability by separating system functionality into different modules, isolated as different user space processes. Most notably, device drivers are run outside the kernel so that a bug in e.g. a video driver can only crash itself, and not the entire kernel. In the case of a monolithic kernel there is no protection between module, so a bug in one module or a rogue module can easily corrupt the data of any arbitrary module and thus bring the entire kernel down. Note that microkernels historically have received criticism for being inefficient due to cross-module calls causing context switches. Finally, the Exokernel [11] makes an end-to-end argument that operating sys- tems provide inefficient abstractions, and that applications know better which ab- stractions they need. This led the authors to a minimalistic kernel which exports the concept of secure bindings, secure allocations of hardware resources to allow ef- ficient multiplexing of resources across applications. The secure bindings uses phys- ical names to remove a layer of indirection and the exokernel furthermore exposes allocation and revocation to allow deeper optimisations of “client applications”. The exokernel architecture allows for the constructions of library Operating Systems (li- bOSs), operating system that are run in user space, linked with the application. The concept allows for different libOSs to be used for each application, tailored for its specific needs, exporting just the abstractions the application needs, in the most efficient manner. Additionally, since libOSs are untrusted, they allow faster innova- tion of operating system software, as bugs are not nearly as fatal as in a monolithic kernel; they can only bring down the application and not the entire system.

2.2 The IX Dataplane Operating System

IX is a specialised operating system designed for aggressive networking requirements posed mainly by datacentre applications. It runs as a virtualised process with pro- tected access to hardware inside an environment called Dune1 [12]. Dune provides a (Linux) process abstraction with access to privileged hardware instructions through virtualisation hardware. Since IX is a Dune extended Linux process, it does not need to implement everything an operating system needs to provide a process, such as a file system, device drivers or process multiplexing. IX implements a specialised processing model and its own optimised datapath for network I/O. System calls not

1 Dune is further described in section 3.1.2.

6 2.2. THE IX DATAPLANE OPERATING SYSTEM directly supported by IX can thus be supported by simply passing it through to the underlying Linux kernel. Therefore, IX could be seen as a library operating system specifically designed for datacentre application needs, eschewing inefficient Linux abstractions whilst keeping acceptable ditos. In the remainder of this section, we look at the motivations behind IX (sec- tion 2.2.1), what a dataplane operating system is (section 2.2.2) and the results that it achieves (section 2.2.3).

2.2.1 Requirements and Motivations

The purpose of IX is to deal with the increasingly specific demands that large scale datacentre applications put on infrastructure and underlying software layers. Specifically, microsecond tail latencies are required to allow the construction of distributed applications with predictable latencies composed by a large number of participating nodes [13]. Dean and Barroso showed that the tail latencies for individual components are amplified by scale: if one request out of 100 is slow on a single server, and a request requires answers from 100 servers in parallel, then 63% such distributed queries will, in fact, be slow. Therefore it is imperative for large scale datacenter applications to control the latency distribution and limit the 99th percentile latency of its components. Modern datacenter applications also require high packet rates to be able to sustain throughput, since packet sizes often are small [14, 1]. Furthermore, the practice of co-locating applications induces the need to isolate applications due to security reasons, and also demands resource efficiency so that server resources can be shared and reallocated amongst co-located applications [15, 16] with varying resource demands. Commodity operating systems2 were designed during an era with notably dif- ferent hardware characteristics than what is readily available in datacentres today. Processors used to sport a single processing core, multiplexing different applica- tions through timesharing and for networking, packet inter-arrival times used to be much slower than interrupts and system calls [1]. With 10Gb Ethernet, packet inter-arrival times are reaching nanoseconds, as the interarrival time of minimum sized packets at 10GbE is 67ns.3 67ns is well below the time scales of interrupts and system calls and therefore they suddenly become significant sources of latency and diminished throughput in high performance datacentre applications. Further- more, as a single cache miss handled by DRAM may occupy 100ns, the even lower interarrival-times also encourages data oriented design for such applications. It is possible to argue that with the advent of multicore processing, some applications no longer need the type of resource scheduling provided by legacy operating systems. We are therefore free to revisit operating system design in order to improve both

2Readily available and in-production deployed operating systems such as Linux, Windows Server, FreeBSD and all the other various Unix flavours. 3 64+512+12(bits) 10×109(bits/sec) = 67.2ns for preamble, start-of-frame delimiter, minimum sized Layer 2 Eth- ernet frame and Interpacket Gap added and divided by bit rate.

7 CHAPTER 2. BACKGROUND throughput and latency of datacenter applications by no longer trading them for fine-grained resource scheduling. User space networking stacks could solve some of the overheads involved in sys- tem calls by kernel crossings, but they do not necessarily solve the tradeoffs between low latency and high throughput [1]. Moreover, they do not offer protection between the application and the networking stack, which could lead to corruptions of the network stack due to application level bugs. More critically, such corruptions could enable a malicious user to exploit the network stack in ways normally reserved to users with root access to the system, such as transmitting raw packets or enabling promiscuous mode [17]. Belay, Prekas, Klimovic, et al. [1] argue that the improve- ments gained by removing kernel crossings are marginal compared to amortizing the costs over multiple system calls by batching system calls as proposed by Soares and Stumm [18].

2.2.2 What is a Dataplane Operating System?

IX is a Dataplane Operating System, which implicitly tells us that it distinguishes between the Dataplane and the Control Plane. Along with other contemporary operating system such as Arrakis [19], it borrows the nomenclature from the net- working community, where the separation between dataplane and control plane is widespread. In networking, switches typically operate in two planes: the dataplane and the control plane. The dataplane is responsible for packet forwarding along the forwarding path, typically implemented in hardware, performing fast lookup in the forwarding tables. The control plane on the other hand, is responsible for configuring the dataplane(s), in the case of a switch to setup the forwarding table by the means of a control plane routing protocol, such as BGP [20]. Likewise, IX separates the areas of responsibility, improving efficiency by re- moving the control plane from the data path. The control plane performs coarse grained resource allocation, such as allocation of dedicated CPU-cores and network queues. The dataplane(s) are responsible for everything on the datapath; from packet processing to application logic. In IX, the Linux kernel acts as control plane through the Dune kernel module. By eliminating the Linux kernel from the data- path and replacing it with a specialised optimised datapath, IX can improve upon the throughput achieved by Linux by up to an order of magnitude [1].

2.2.3 Results

IX improves the throughput of sustained connections by up to 1.9× over mTCP and 8.8× over Linux for a 64 byte packet [1, pp 58]. For memcached [3] throughput for the given SLA 500µs @ 99th pct is increased by 3.6×, whilst the unloaded tail latency is reduced by 2×. For further descriptions, evaluation, and results of IX, please refer to [1].

8 2.3. WEB SERVERS

2.3 Web Servers

A web server is a piece of software, running on a machine connected to a network, capable of serving resources over the HTTP [21] protocol [22]. In some literature, the term may refer to the physical hardware server running such software, or the combination of such a dedicated hardware server and the web server software. In this work “Web Server” refers to the web server software. Web servers are by tradition divided into static and dynamic servers, where the classification indicates the type of content the server may serve. Static web servers merely serve static content, such as files stored on disk. Dynamic web servers either perform some processing or may generate the full content on a per-request basis. Such servers may run arbitrarily complex server programs, but typically run an application program performing application business logic, store data to an underlying database and respond to the requests with customised, dynamic web pages.

2.3.1 Apache, the Traditional Forking Web Server Apache HTTP server [23], the web server that the Apache foundation’s name has become synonymous with, started in 1995, has been the by far most deployed web server in the past, and still holds a majority share of web server deployments as of July 2015 [24]. Apache used a fork-and-execute model for its version 1 deployments; spawning a number of processes, each handling a single request at a time. Most modern day deployments use the Apache MPM worker module [25], which is a hybrid multi- process/multi-threaded concurrency module. The server spawns multiple processes, each running multiple threads. Each thread serves a single request at a time, but is held ready in a thread pool whilst idle.

2.3.2 Nginx - the Event Driven Web Server Nginx [26], launched in 2004, is an asynchronous event-driven web server specifi- cally engineered to generate a small resource footprint and to solve the C10K [27] scalability problem. The C10K problem is Kegel’s encouragement, that with the hardware of that time4, web servers should be able to handle 10 thousand concurrent connections. Event driven programming is a programming paradigm where the execution of a program is driven by the reaction to events. Such events include user input, network data or sensory input. Most commonly the model is implemented by a main loop polling different event sources. Upon receiving an event from a source, the event loop will call the preregistered callback function for the triggered event. The event-driven architecture means, for a web server, that requests are split up into smaller chunks of work and that I/O operations are performed by asynchronous

41 GHz CPU, 2GB RAM and 1GbE [27].

9 CHAPTER 2. BACKGROUND system calls. The processing model lets a single thread of execution handle more than one connection, therefore less resources are dedicated per connection and the system can scale to a higher number of concurrent connections [28].

2.3.3 Node.js Node.js takes the event-driven web server concept of Nginx, and combines it with the V8 Javascript Engine [9] to create an event-driven application server for applications written in Javascript. Node.js leverages the V8 engine to provide a platform for fast execution of JavaScript. JavaScript was designed for, and developers are already used to, writing callback driven programs with asynchronous execution, as employed in web frontend UI applications. Therefore it is a suitable language for event-driven web applications. Node.js was created by Ryan Dahl in 2009 to ease the implementation of real- time web applications. The combination of adequate websocket support and high connection scalability allows a Node.js application to hold a high number of concur- rent connections with web clients open simultaneously, which facilitates the creation of real-time web applications. Note that Node.js, as well as other event-driven web server architectures, solves the IO scalability problem [5], and not the computation scalability problem. If a workload is CPU-bound, the performance might decrease by running it on an event- driven architecture. Fast, short running requests might be queued up behind a long running CPU-intensive task, whereas on a threaded architecture, the long running task would be preempted and the fast tasks would complete before it. Node.js could still be used as a part of such compute intensive applications, but since the event loop must not be blocked, it would write the request data to a computation backend through some . Among users of Node.js we find renowned companies such as Paypal and LinkedIn. LinkedIn reduced the number of servers from 30 down to 3, while still having head- room to handle ten times the amount of traffic they currently do [6]. Moreover, they claim to have improved “speed” by a factor of 20 by moving away from their previous Rails based solution to Node.js [6]. Although, care has to be put into the claim, as LinkedIn, out of political reasons, used a proxying architecture that blocked the entire process while performing a cross-datacenter request for each and every request [29]. The moral of the story is thus that if the application is spend- ing a lot of time waiting for I/O, then efficiency can be improved by employing asynchronous, non-blocking I/O.

2.4 Queueing Theory

Queueing theory is a branch of statistical mathematics that models the dynamics of queues in service systems. Briefly, customers, or clients, arrive at a service point with an arrival rate λ, may be forced to wait in line (or might leave the system), eventually gets serviced and then leave the system.

10 2.4. QUEUEING THEORY

The Kendall notation is generally used to describe a queueing system, as follows (in its most basic form):

A/B/ Where A is the arrival process, B the service time distribution and c the number of service stations. An arrival process is always a point process, which is a process such that the arrivals are points, isolated in time. The arrival process is typically assigned into one of four categories:

M indicates a Markovian, or memoryless arrival process. For queueing systems this implies the utilisation of a Poisson process.

D indicates a deterministic arrival rate.

GI abbreviates “General Independent”, a general process with the requirement that interarrival times are independent and equally distributed.

G designates a general process, any arbitrary point process.

A Poisson process is usually assumed for the arrival process as it in many application is a reasonable model of reality whilst still offering a simple mathematical model. Furthermore, as we are often looking at the queue in short time horizons, the process is often assumed to be homogenous, that is having a constant expected rate. The service time is generally described belonging to one of the following three classes:

M indicates a Markovian, or memoryless service distribution, which leads us to exponentially distributed service times.

G designates a general distribution.

D indicated a deterministic service time distribution, which means that the service time is constant.

The number of service stations affect the performance of the service if the service stations share a queue. If we have four service stations, each with their own queue and a total arrival rate λ we will in fact for a Markovian arrival process observe 4 M/G/1 systems with λ2 = λ/4 instead of a M/G/4 system. Finally, Node is single threaded and even in the case of multiple service stations by usage of the cluster module, for real time websocket systems, client affinity will render a (Number of processes) × (M/M/1) system anyhow. Therefore we will not delve any deeper into the topic of multiple service stations in this thesis. The queueing discipline describes the rules for how the next client to be serviced is chosen from the queue when a service station is ready to service a new request. The most common discipline to assume is First-In, First-Out (FIFO), the mathe- matically simplest model. A request comes in and is placed at the back of the queue,

11 CHAPTER 2. BACKGROUND and requests to be serviced are always taken from the front of the queue. A queueing system may also utilise disciplines such as priority queueing, or random order. Li, Sharma, Ports, et al. demonstrates how the FIFO discipline is optimal from a tail latency consideration [30]. The motivation is simple: it minimises queueing time variation. Queueing time variation increases tail latency as longer queueing times become more likely. For each given queue length at time of arrival the request has a given expected queueing time of the expected service time, times the length of the queue. If the service time is deterministic it also has a certain queueing time, given the queue length. For any other queueing discipline, including Last-In, First-Out (LIFO), Priority Queueing and Service in Random Order (SIRO), the queueing time variance increases. Even if the service time is deterministic, the queueing time is not bounded for these disciplines. In LIFO the queueing time is determined by the arrival process, even in the case of a given queue length at arrival. If an additional request arrives while the at arrival time processed request is still being processed, such a new request will get processed before our process, increasing variance in queueing time. For random queueing a request may end up staying in the queue as long as there is at least one more request in the queue; such a behaviour increases variance.

12 Chapter 3

Software Foundation

This chapter explains the inner workings of IX, including Dune, and Node.js, includ- ing V8 and libuv. This extends chapter 2 by providing detail on how the systems achieve their functionality.

3.1 The IX Dataplane Operating System

This section is organised as follows, section 3.1.1 explains the overall software ar- chitecture of IX, section 3.1.2 explains how the Dune process virtualisation works, which eases the understanding of the IX dataplane (section 3.1.3), since the IX dataplanes are Dune threads1.

3.1.1 Architectural Overview

IX is divided into a control plane, responsible for resource control and allocation, and dataplanes, responsible for network I/O and application logic, as illustrated in fig. 3.1a. The control plane initialises network interfaces and provides an interface for dataplanes to request allocations of cores, network queues and memory. It consists of the full Linux kernel and a user-level program that implements resource allocation policies. The Linux kernel is run in VMX root ring 02, leveraging Dune (section 3.1.2) to provide the capability of exercising control over the dataplane, without interfering in its normal mode of operation. Like in the Exokernel[11], the dataplane(s) can be seen as application specific operating system(s), as it runs in VMX non-root ring 0 and provides a single address space specific for each application running on IX. There are two fundamental thread types for applications running on IX, elastic threads and background threads. Elastic threads interact with the IX dataplane, commute between dataplane operation and user application, whereas background

1Dune describes itself as enabling processes to enter Dune mode, while in fact it allows threads to enter Dune mode[12]. 2Used for hypervisors in virtualised systems.

13 CHAPTER 3. SOFTWARE FOUNDATION

app sshd httpd memcached ... 3 Event libix Batched ring 3 IXCP Conditions Syscalls libix libix ring 3

IX IX ring 0 non-root 4 ring 0

Dune non-root 2 tcp/ip tcp/ip Linux

ring 0 5

vmx-root timer C C C C C 1 6

adaptive batch

(a) Protection and separation of control and data (b) Interleaving of protocol processing and appli- plane. cation execution.

Figure 3.1: The IX dataplane operating system. Reprinted from A. Belay, G. Prekas, A. Klimovic, et al., “IX: a protected dataplane operating system for high throughput and low latency”, in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 49–65. threads do not interact with the IX dataplane. Both thread types can issue arbitrary POSIX system calls, however, elastic threads are assumed to not perform any long- running actions, as it may result in dropped packets.

3.1.2 Dune Process Virtualisation Dune uses VT-x virtualisation hardware to expose a process, rather than a machine abstraction, with access to privileged hardware features in a safe manner [12]. By privileged hardware features we mean functionality previously only available to kernel level code, such as control over page tables, TLBs, ring protection (CPU privilege modes) and access to NIC hardware queues. Dune works by extending the Linux kernel with the Dune kernel module that puts the kernel into VMX root and provides the facility for a process to enter Dune mode, transferring it from VMX root, ring 3 to VMX non-root, ring 0 and allowing it hardware access through the underlying virtualisation support. Dune includes the implementation of a sandbox. The Dune sandbox application leverages the privilege modes exposed by Dune to constrain untrusted 64-bit Linux applications in ring 3, whilst the trusted sandbox module itself runs in ring 0. The Dune system allows us to view the Linux kernel as a form of optional exok- ernel. Since it exposes priviledged hardware features, we are free to implement our own abstractions directly on top of the hardware interface, if we are not content with the Linux abstractions. By providing access through virtualisation hardware Dune can multiplex access to the hardware features in a safe way, similar to the secure bindings of the Exokernel [11]. Along with the sandbox module, Dune simplifies the

14 3.1. THE IX DATAPLANE OPERATING SYSTEM creation and usage of libOSs. It allows us to write libOSs that overrides and replaces inefficient abstractions of the current platform while retaining the ability to make downcalls to the underlying host if its abstractions are deemed suitable. At the same time we are able to run completely unmodified applications using the standard ab- straction set concurrently on the machine, which may significantly ease adoptation of an Exokernel inspired application architecture in existing infrastructure. Finally, the priviledge modes exposed to processes by Dune allows the construction of libOSs that are protected against application level bugs by hardware protection, a feature that is not provided in the original Exokernel design.

3.1.3 Execution Model We present the IX dataplane execution model by first introducing the execution model inside an IX elastic thread, in the following subsection, Intra-dataplane. In the subsequent subsection we look at the bigger picture, how performance is en- hanced by a synchronisation free execution model.

Intra-dataplane IX assumes an event driven application, where events can only be generated from the network interface. The application can, from elastic threads, synchronously poll the kernel for events. Upon network activity (1, fig. 3.1b) the kernel will process the packets (2), and notify the application by writing event conditions into an array that is mapped read-only into userspace and return from the system call (3). At this point, the userspace library libix will process the returned event conditions, calling the associated callbacks to notify the application of the occurred events. The application may respond by issuing further system calls, for which their arguments would be written in the batched system call vector, to be issued when the application yet again polls the kernel for new events(4-6). IX employs a run-to-completion model with (bounded) adaptive batching. Run- to-completion means that task are run until they finish, which reduces latency incurred by scheduling and improves throughput and latency due to data cache locality since consecutive processing stages often access the same data[1]. Batching reduces overhead by system call transitions, and also improves instruction cache locality, since the same instruction sequences are reused for multiple packets, which both leads to a higher packet rate. Furthermore, batching is adaptive so that batching is only used upon congestion, minimising the effect on latency in non- congested cases. Upon congestion the efficiencies of batching can improve latency by reducing head-of-line blocking [1]. Bounding the batch size bounds the latency imposed by batching and effectively avoids exceeding capacity of the data cache.

Inter-dataplane Due to the coarse-grained allocation policy of IX, dataplanes are allocated entire CPU-cores and NIC queues. Receive Side Scaling (RSS) is used to hash incoming

15 CHAPTER 3. SOFTWARE FOUNDATION

System Call Parameters Description sys_bpoll struct bsys_desc __user *d, unsigned int nr performs I/O processing and issues a batch of system calls sys_bcall struct bsys_desc __user *d, unsigned int nr issues a batch of system calls sys_baddr void get the address of the batched syscall array sys_mmap void *addr, int nr, int size, int perm) maps pages of memory into userspace sys_unmap void *addr, int nr, int size unmaps pages of memory from userspace sys_spawnmode bool spawn_cores sets the spawn mode, if clone spawns elastic or background threads sys_nrcpus void returns the number of active CPUs (a) Exception driven system calls

System Call Parameters Description bsys_tcp_connect Opens a connection bsys_tcp_accept hid_t handle, unsigned long cookie Accepts a connection bsys_tcp_reject hid_t handle Rejects a connection bsys_tcp_send hid_t handle, void *addr, size_t len Transmits an array of data bsys_tcp_sendv hid_t handle, struct sg_entry __user *ents, unsigned int nrents Transmits a scatter-gather array of data bsys_tcp_recv_done hid_t handle, size_t len Advances the receive window and frees memory buffers bsys_tcp_close hid_t handle Closes or rejects a connection (b) Batched System calls

Table 3.1: IX system calls

flows to a consistent NIC queue. The design, in conjunction with the omission of a POSIX style socket API (a shared flow namespace between threads of execution), removes the need of synchronisation between dataplanes. Such a design scales well horizontally for servers with an increasing number of CPU-cores. The dataplanes still share the memory namespace, which can be used to exchange messages and perform application-level synchronisation. IX itself examplifies this possibility by implementing an in-kernel Remote Procedure Call (RPC) mechanism to synchronise execution of functionality on a foreign elastic thread.

3.1.4 IX System Call API

The IX system call API is divided into two sets, standard, exception driven system calls (table 3.1a) and batched system calls (table 3.1b). The standard system calls function like standard exception-driven system calls, provide IX specific functional- ity, such as int sys_bpoll(struct bsys_desc __user*d, unsigned int nr); or int sys_baddr(struct bsys_desc __user*d, unsigned int nr);, or for IX required overloads of standard Linux system calls such as mmap. These system calls are meant to mainly be used in the upstart phase of the application, not to be used in the hot path of the application. For performance critical paths the application should use the batching API, that amortizes kernel transition costs over multiple system calls. Therefore, network communication system calls are only available as batched system calls. The exception driven system calls return results as normal. The batched system calls write their respective results in the system call vector sent to the kernel, and require the application to examine the results after the batch has been processed. If the application uses the event based API, such processing is provided, otherwise the application is required to implement it.

16 3.1. THE IX DATAPLANE OPERATING SYSTEM

3.1.5 IX Event Conditions

Apart from issueing a set of batched system calls to the kernel, the sys_bpoll call also polls the kernel for events. Such events include but are not limited to incoming packets, sent buffers, accepted and dropped connections. The full list, along with explanations of the event conditions can be seen in table 3.2.

3.1.6 libix Userspace API

The userspace library libix offers two sets of API for the application to utilise IX, a plain API that simply follows the system call enumeration, and an event based API that may be easier for an application programmer to use.

Plain API

The plain API mirrors the system calls described in table 3.1. The userspace library provides an application the possibility of calling the IX system calls by implementing the userspace system call mechanisms.

Event API

The event based API is modeled after the [31] API, as Memcached [3] uses libevent, and the API was developed when porting Memcached to IX. It builds on the plain API, that merely exports the system calls as functions, and augments it with a copying API, a flow abstraction with individual registration of event handlers, and system call return handling. Libix introduces the ixev_ctx struct that abstracts a network flow. It is a bidirectional flow handle that enables reading and writing. Furthermore it allows the user to bind an event handler on a per flow basis rather than a global multiplexing handler. The ixev_wait function polls the IX kernel using sys_bpoll, handles return values from eventual system calls, and handles generated events by calling user registered callback functions. The ixev_recv and ixev_send provides I/O with copy semantics, which accel- erates the implementation of some applications by eliminating the need to reference count I/O buffers. Some software assume that the buffer received by the read call needs to be deallocated, and this poses an incompatibility with zero copy .

Event Condition Parameters Description connected cookie, outcome A locally initiated connection was successfully established. knock handle, src IP, src port A remotely initiated connection is requested. recv cookie, mbuf ptr, mbuf len A message buffer was received. sent cookie, bytes sent, window size A number of bytes was sent and/or the window size was changed. dead cookie, reason A connection died; was concluded or expired.

Table 3.2: IX Event Conditions

17 CHAPTER 3. SOFTWARE FOUNDATION

Function Description ssize_t ixev_recv(struct ixev_ctx*ctx, void *addr, size_t len); read data with copying. void * ixev_recv_zc(struct ixev_ctx*ctx, size_t len); read an exact amount of data without copying. ssize_t ixev_send(struct ixev_ctx*ctx, void *addr, size_t len); send data using copying. ssize_t ixev_send_zc(struct ixev_ctx*ctx, void *addr, size_t len); send data using zero-copy void ixev_add_sent_cb(struct ixev_ctx*ctx, struct ixev_ref*ref); registers a callback for when all current sends complete. void ixev_close(struct ixev_ctx*ctx); closes a context void ixev_dial(struct ixev_ctx*ctx, struct ip_tuple*id); Open a connection. void ixev_ctx_init(struct ixev_ctx*ctx); prepares a context for use void ixev_wait(void); wait for new events void ixev_set_handler(struct ixev_ctx*ctx, unsigned int mask, ixev_handler_t handler); sets the event handler and which events trigger it int ixev_init_thread(void); thread-local initializer int ixev_init(struct ixev_conn_ops*ops); global initializer

Table 3.3: IX Event Conditions

The ixev_recv_zc and ixev_send_zc exports the zero copy API. They pro- vide higher performance than their copying counterparts, but may prove harder to integrate in an application. ixev_add_sent_cb provides a facility to register a callback when all outstand- ing transfers have been sent. The functionality is useful to attach a callback to deallocate memory after a zero copy transfer has completed.

3.1.7 Limitations

Currently we can observe a range of limitations on the IX platform. IX does cur- rently not support outgoing TCP connections, nor any form of UDP communication, due to an issue of how the allocation of cores and NIC queues is performed for flows. When IX is launched, it occupies the entire NIC, which prevents multiple IX in- stances from running in parallel. The behaviour also prevents the machine running IX from performing any DNS lookups, unless an extra NIC is present. Since IX occupies the entire NIC, Linux can’t perform such lookups and since IX does not support outgoing connections, we are unable to issue DNS lookups. Furthermore the listening port of IX is hardcoded in the kernel source code. Finally, IX does currently only support the Intel x520 and 82599ES NICs.

3.2 Node.js

Node.js is an event-driven JavaScript application server. It consists mainly of libuv [8] (see section 3.2.2 and fig. 3.2b.) for core functionality such as the event loop, I/O and timers, and Google’s V8 JavaScript engine (section 3.2.1) that sup- plies swift JavaScript execution. Node, as seen in fig. 3.2a, consists of a set of core JavaScript libraries and a small C-kernel that glues libuv together with V8, so that the libuv functionality can be used in JavaScript via V8. The remaining parts of Node.js, are as mentioned JavaScript libraries, implementing functionality such as HyperText Transfer Protocol (HTTP) parsing, to export the Node.js API [32]. Since such libraries do not directly issue system calls, but do so through V8 and libuv, they are OS independent, and are therefore omitted from the scope of this thesis.

18 3.2. NODE.JS

(a) Overview of Node. (b) libuv architecture. Reprinted from http://docs. libuv.org/en/v1.x/design.html.

Figure 3.2: Node.js Application Structure.

3.2.1 V8 Javascript Engine

V8 [9] is a high performant JavaScript engine developed by Google primarily for its web browser project Chrome. It has been open-sourced as a separate project and can be run either standalone or included in any C++ project.

3.2.2 libuv

Libuv [8] is a multi-platform support library that abstracts common operating sys- tem functionality such as network I/O, file system operations, and multi-threading over the supported OSs, including but not limited to Linux, Windows and Mac OSX. The library mainly focuses on asynchronous I/O, and aims to provide a platform to build highly scalable event-driven applications. This subsection mainly focuses on the workings of the core event loop (sec- tion 3.2.2) and the stream API (section 3.2.2), as those require modification to port libuv to IX. Due to IX’s nature as an OS built on top of Linux, file system, thread pool and other miscellaneous functionality can be left unmodified.

Event Loop

The event loop is the core of libuv library. Libuv provides the abstraction of an event loop, which means that an application can use many event loops, but each libuv data structure or handle must belong to one and only one event loop. The libuv operations are reentrant but not thread safe, meaning that operations can be performed concurrently on objects residing in different event loops concurrently, but cross thread/event loop operations must not be performed without careful synchronisation. Naturally there can only be as many event loops running con- currently as there are threads running concurrently. The event loop can be run either a single iteration, or as long as events still can be generated by calling the int uv_run(uv_loop_t* loop, uv_run_mode mode); function. Each event loop iteration will perform the following actions, in the following order [33]:

19 CHAPTER 3. SOFTWARE FOUNDATION

1. Update the loop time, libuv caches the time per iteration to minimise the count of time related system calls.

2. Activation check. The loop will only iterate if it is “alive”. A loop is alive if it has active and referenced handles, active requests or closing handles.

3. Runs timers that are scheduled to run before the loop time established in (1).

4. Pending callbacks are called, for example if an I/O callback for some reason has been deferred to the next loop iteration.

5. “Idle handle callbacks” are run. Idle handles are handles whose callbacks are run on every loop iteration.

6. “Prepare handle callbacks” are run.

7. Calculate the loop timeout: 0 if the loop is triggered as UV_RUN_NOWAIT, there are idle handles, or no active handles etc., for a full list please see [33]. Otherwise the timeout assumes the value of the next timer timeout, or infinity if there is no active timer.

8. BLOCKS FOR I/O, up to the timeout calculated in step 7. The I/O polling uses different polling mechanisms depending on the platform. E.g. epoll is used on linux, on OpenBSD and Mac OSX and IOCP on Windows.

9. “Check handle callbacks” are run.

10. “Close callbacks” are called for handles that were closed with uv_close().

11. If the loop was run as UV_RUN_ONCE forward progressed is guaranteed by the library, and thus if no I/O callback fired, the library will retest for due timers.

12. If the loop was invoked with UV_RUN_DEFAULT run mode, then go to (1), otherwise return.

Network and UDS sockets TCP network flows and Unix Domain Sockets are exposed as stream abstractions following an asynchronous API. Common for these stream types is that they are implemented using asynchronous system calls and polled using the (platform de- pendent) scalable event notification system used in step 8 in section 3.2.2. Since this API is the main API that needs to be implemented using the IX API rather than the Linux/POSIX API, the API is presented in detail. The stream API includes the following data types:

• uv_stream_t, a stream handle. “Subtypes” follow:

– uv_tcp_t: A TCP handle that are used to represent TCP streams and servers.

20 3.2. NODE.JS

– uv_pipe_t: A UDS handle on Unix systems and named pipes on Win- dows. – uv_tty_t: A handle for a stream to a console

• uv_connect_t: A connect request.

• uv_shutdown_t: A shutdown request.

• uv_write_t: A write request.

• Callback function types:

– void (*uv_write_cb)(uv_write_t* req, int status): A write re- quest callback. Status is negative for failed requests, 0 for successful requests. – void (*uv_connect_cb)(uv_connect_t* req, int status): A con- nect request callback, called when a connection started by uv_connect has completed. Status is negative for failed requests, 0 for successful requests. – void (*uv_shutdown_cb)(uv_shutdown_t* req, int status): A shut- down request callback. Status is negative for failed requests, 0 for suc- cessful requests. – void (*uv_connection_cb)(uv_stream_t* server, int status):A connection callback. Called when a stream server has an incoming con- nection. Libuv streams, subtypes of uv_stream_t, support the following operations: • int uv_shutdown(uv_shutdown_t* req, uv_stream_t* handle, ,→ uv_shutdown_cb cb);

Shuts down the write side of a duplex stream. Waits for potential pending requests to complete, and when the shutdown has finished, the callback is called.

• int uv_listen(uv_stream_t* stream, int backlog, uv_connection_cb ,→ cb);

Starts listening for incoming connections on the server specified by stream. The callback is called upon connections.

• int uv_accept(uv_stream_t* server, uv_stream_t* client);

Accepts incoming connections and creates new bidirectional TCP flows (han- dles). Should be used after receiving uv_connection_cb callback, then it guarantees successful completion, a behaviour that is not guaranteed if used more than once per uv_connection_cb.

21 CHAPTER 3. SOFTWARE FOUNDATION

• int uv_read_start(uv_stream_t* stream, uv_alloc_cb alloc_cb, ,→ uv_read_cb read_cb);

Start reading on a stream. The alloc_cb will be called to allocate read buffers, and the read_cb will be called when data is available. The read callback will be called repeatedly until there is no more data available, or int uv_read_stop(uv_stream_t*) has been called.

• int uv_read_stop(uv_stream_t* stream);

Stop reading from the stream.

• int uv_write(uv_write_t*, uv_stream_t*, const uv_buf_t[], ,→ unsigned int, uv_write_c;

Write supplied buffers in order on the stream. The write callback will be called upon write completion.

• int uv_write2(uv_write_t* req, uv_stream_t* handle, const ,→ uv_buf_t bufs[], unsigned int nbufs, uv_stream_t* ,→ send_handle, uv_write_cb cb);

Extended write functionality to send handles over a pipe.

• int uv_try_write(uv_stream_t* handle, const uv_buf_t bufs[], ,→ unsigned int nbufs);

Same as int uv_write but does not queue requests if they are unable to complete immediately.

• int uv_is_readable(const uv_stream_t* handle);

Return a non-zero number if the stream is readable and zero if it is not.

• int uv_is_writable(const uv_stream_t* handle);

Return a non-zero number if the stream is writeable and zero if it is not.

• int uv_stream_set_blocking(uv_stream_t* handle, int blocking);

Set or unset stream blocking operation; that stream operations complete blockingly instead of non-blockingly. The asynchronous interface remains fixed.

TCP flows additionally support the following operations:

• int uv_tcp_init(uv_loop_t* loop, uv_tcp_t* handle);

Initialise the TCP handle data structure. Does not create a connection.

22 3.2. NODE.JS

• int uv_tcp_open(uv_tcp_t* handle, uv_os_sock_t sock);

Open a file descriptor or socket as a libuv TCP handle.

• int uv_tcp_nodelay(uv_tcp_t* handle, int enable);

Enable / disable Nagle’s algorithm.

• int uv_tcp_keepalive(uv_tcp_t* handle, int enable, unsigned int ,→ delay);

On/off TCP keep-alive.

• int uv_tcp_simultaneous_accepts(uv_tcp_t* handle, int enable);

Enable / disable simultaneous asynchronous accept requests that are queued by the operating system when listening for new TCP connections.

• int uv_tcp_bind(uv_tcp_t* handle, const struct sockaddr* addr, ,→ unsigned int flags);

Bind the handle to an IP-tuple3.

• int uv_tcp_getsockname(const uv_tcp_t* handle, struct sockaddr* ,→ name, int* namelen);

Get the current address to which the handle is bound.

• int uv_tcp_getpeername(const uv_tcp_t* handle, struct sockaddr* ,→ name, int* namelen);

Get the address of the peer bound to the handle.

• int uv_tcp_connect(uv_connect_t*, uv_tcp_t*, const struct ,→ sockaddr*, uv_connect_cb);

Establish an outgoing TCP connection. The uv_connect_cb will be called upon completion or error.

“Pipe” flows additionally support the following operations:

• int uv_pipe_init(uv_loop_t* loop, uv_pipe_t* handle, int ipc);

Initialise the pipe data structure.

• int uv_pipe_open(uv_pipe_t* handle, uv_file file);

3IP address and port number.

23 CHAPTER 3. SOFTWARE FOUNDATION

Open a FD or existing handle as a libuv pipe.

• int uv_pipe_bind(uv_pipe_t* handle, const char* name);

Bind the pipe to a file path.

• void uv_pipe_connect(uv_connect_t* req, uv_pipe_t* handle, const ,→ char* name, uv_connect_cb cb);

Make an outgoing connection to the specified Unix Domain Socket (UDS) (Unix), or (Windows).

• int uv_pipe_getsockname(const uv_pipe_t* handle, char* buffer, ,→ size_t* size;

• int uv_pipe_getpeername(const uv_pipe_t* handle, char* buffer, ,→ size_t* size);

Get the name of the remote end of the peer.

• void uv_pipe_pending_instances(uv_pipe_t* handle, int count);

Set the pipe queue size when the handle is is used as a pipe server. (Maximum number of pending connections.)

• int uv_pipe_pending_count(uv_pipe_t* handle);

• uv_handle_t uv_pipe_pending_type(uv_pipe_t* handle);

Receive a stream handle over an IPC pipe.

File System operations follow the asynchronous API set by libuv for other stream types, but can also be run synchronously, if no callback function is supplied. How- ever, even when file system operations are run asynchronously, they are run using synchronous system calls in a separate worker thread, using libuv’s threadpool. The reason is that not all scalable I/O mechanisms (epoll) supports file system file descriptors.

Thread Pool Libuv implements a thread pool that can facilitate asynchronous execution of inher- ent synchronous work such as file system system calls, DNS lookups or user supplied tasks. The threadpool uses UDSs as its synchronisation mechanisms with the main thread to cause the blocking poll to return. The synchronisation is needed to allow the main thread to timely process callback functions in a thread safe manner.

24 Chapter 4

Design

The design chapter covers the overall design of the port, but also the modifications to every software module in detail. The modifications of Node.js, covered in section 4.4 are almost exclusively done to libuv due to good software modularisation of the Node.js project. Regarding libuv, the major work is adapting it to use the libix API in lieu of POSIX system calls. Section 4.3 covers the modifications of IX, in particular the support for an epoll-like API to support polling of Unix Domain Sockets.

4.1 Design Overview

Node.js is adapted to use IX’s system call for networking through modifications of libuv, Node’s core event loop library. We implement our version of libuv in top of the IX userspace library libix in order to minimise the changeset of the codebases and simplify the implementation. An epoll-like interface is introduced to the IX kernel, exposed through a new system call in IX, and finally to the application through a user space function in libix. Naturally, the libuv library leverages this new interface to provide applications the possibility to register interest for events on UDS as well as using libuv’s threadpool.

4.2 Limitations

The libuv version designed in this thesis supports Node.js but makes no claims to universally support all applications relying on the libuv API. Most limitations holding for the IX branch of libuv stem from various limitations in IX1 that prevent us to implement the libuv API with the very same guarantees as the standard libuv. Such limitations include but are not limited to: only one libuv event loop can be used per IX elastic thread, event loops can only be run in IX elastic threads, multiple processes as required by the Node cluster module[34] cannot run concurrently and

1See section 3.1.7

25 CHAPTER 4. DESIGN

finally listening “sockets”, or handles, cannot be bound to other port numbers than port 8000. Neither do we support DNS queries on machines with less than one Network Interface Controller (NIC) left to Linux, due to IX not providing a DNS API. For machines with more than a single NIC, one can leave a NIC attached to Linux, and the functionality remains through system call passthrough.

4.3 Modifications of IX

This section describes the implementation of UDS polling support in IX. Sec- tion 4.3.1 motivates the need to introduce changes to the IX kernel. Section 4.3.2 describes the architectural design of the kernel level functionality and the newly introduced system call. Section 4.3.3 is a brief passage describing the user level API interfacing the introduced system call.

4.3.1 Motivation for IX Kernel Extensions Node, and libuv in extension supports a rich API of stream I/O operations on network flows, UDSs and files in the file system. IX provides a synchronous polling method that does not support a timeout. This is done by design, the run-to- completion paradigm allows IX to assume that there is no more work to be done in the application level when a loop iteration has passed. Yielding control to the dataplane through the bpoll call has the sematics that nothing more can happen in userspace, all following events will be network triggered. This does not hold for a general Node.js application. For example, a web request might trigger a database call to a MySQL server running on the localhost, where communication is achieved over a Unix Domain Socket. A user might request a file to be read through (synchronous) system calls in a background thread from libuv’s threadpool; when the background task has completed an event must be raised, and its origin is not the IX dataplane. When the IX bpoll has returned and the elastic thread has processed all incoming packets the application has to take the polling decision, should it call bpoll or should it not? If it does call bpoll it risks getting stuck indefinitely in the IX bpoll if no new packets arrive. That will effectively prevent responses from being delivered to clients waiting for asynchronous work performed in the threadpool, or reading data from a UDS, e.g. connected to another application on the same machine. If it does not call bpoll, or postponing it till there is no more queued work or waiting clients it risks to miss incoming packets in the dataplane, and further reduces throughput of the system. Therefore it is impossible to implement a fully functional port of Node.js to IX without modifying or extending the IX kernel. However, note that if the IX kernel is extended with support for UDSs, both the problem of concurrently polling UDS sockets and network flows and the problem of synchronisation with other event sources can be resolved. The first problem is trivially solved by combining the polling API of the two pollable types. The possibility for synchronisation with arbitrary event sources, such as the finish of

26 4.3. MODIFICATIONS OF IX

Figure 4.1: The IX dataplane kernel including the UDS worker thread addition. some task run asynchronously in the thread pool can be implemented by writing on a UDS registered with IX.

4.3.2 Kernel Extension

The IX kernel is extended with functionality to concurrently poll for events on both UDS and network flows. IX does not discern between different polling sets for network flows like the Linux epoll functionality does, it reports every flow with a change for the queues tied to the dataplane in question as an event condition2. Therefore, in the name of API coherence, with the introduction of notification sup- port for events on Unix Domain Sockets, we do not introduce multiple polling sets, but provide a singleton polling set per IX dataplane. We introduce only one new system call, sys_uds_ctl, which replaces the role of epoll_ctl, to the elastic thread global polling set. int sys_uds_ctl(int fd, int op, struct epoll_event*event); int epoll_ctl(int epfd, int op, int fd, struct epoll_event*event); Events on the subscribed Unix Domain Sockets are returned as a new event condi- tion upon calling the IX specific system call sys_bpoll. int sys_bpoll(struct bsys_desc*d, unsigned int nr); The extension, as seen in fig. 4.1, is implemented by the addition of a kernel level worker thread. The worker thread, and the singleton polling instance are lazily

2 A change is defined as following: if more data has been received on a flow since the last event condition. The IX polling is thus neither edge-, nor level triggered.

27 CHAPTER 4. DESIGN spawned on the first call to int sys_uds_ctl(...);. Every elastic thread that has a nonempty UDS polling set will thus maintain a 1-to-1 mapping to a UDS worker thread, sharing an epoll instance, and the UDS worker thread keeping an identifier to its parent elastic thread. The UDS worker thread will continuously poll the shared epoll instance through int epoll_pwait(...); and upon encountering a non-empty return set it will update the shared PollingSetState. Upon modification and no outstanding notification to its corresponding elastic thread, it will notify the elastic thread through IX’s RPC mechanism for posting work to a specific elastic thread, that it has changed the PollingSetState. In the IX polling system call int sys_bpoll(); the system will synchronise against the instanding work queue of the elastic thread in question once per iteration of the main loop. Thus if the worker thread posts a notification of a change of the PollingSetState, the elastic thread will synchronise by running the callback. Subsequently, in the next polling loop iteration, the elastic thread will generate an event condition for each file descriptor marked in the PollingSetState and then reset the PollingSetState. After polling the NIC queue one additional time, it will trigger the polling system call to return, since a non-empty set of event conditions has been generated.

4.3.3 libix

Libix provides a user level function that enables the application to call the int sys_uds_ctl(...); system call of IX. Libix is also modified to accept a func- tion pointer for a callback function for the added UDS_ACTIVITY event condition.

4.4 Modifications of Node.js

Since Node.js consists of a small core kernel that binds together the V8 JavaScript engine with the libuv event library, implementing as much functionality as possible in Javascript, our porting efforts are largely concentrated on the libuv event library. Libuv implements most of the core functionality Node provides, whereas the core Node kernel exposes bindings for said functionality to Javascript code, through the v8 engine.

4.4.1 Modifications of libuv

Libuv, as seen in fig. 4.2 is modified to use the libix event API instead of the various POSIX APIs (epoll, kqueue or event ports) or Windows IOCP API. Networking is performed via the API described in section 3.1.6. Furthermore, to support UDS, the introduced epoll-like interface for Unix Domain Sockets are used, both to provide support for UDS, but also for synchronisation of the libuv internal threadpool.

28 4.4. MODIFICATIONS OF NODE.JS

Figure 4.2: libuv implemented on top of the libix event polling API.

Networking

The libuv port to IX uses the event API of libix to a large extent as possible, as it provides an API fairly close to libuv’s API and performs reference counting of write buffers. We implement a subset of the libuv API, following the description in section 3.2.2, describe the implementation of each function while motivating the departures from fulfilling either the libuv API by functionality or by contract. The implementation follows: First we supplement the uv_tcp_t struct, libuv’s user facing TCP stream handle, with a pointer to a libix ixev_ctx struct. Likewise the ixev_ctx’s field user_data is choosen to contain a pointer to a uv_tcp_t if properly set up and paired. We also extend the libuv TCP struct with a linked list of pointers to libuv contexts to facilitate the usage of a TCP handle as a server. The list serves as a buffer of connections accepted by the libuv layer from IX, awaiting acceptance from the user application. The changes for the networking subsystem includes the functions regarding libuv streams and libuv TCP streams: UV stream handle API:

• int uv_shutdown(uv_shutdown_t* req, uv_stream_t* handle, ,→ uv_shutdown_cb cb);

Shuts down the write side of a duplex stream. Waits for potential pending requests to com- plete, and when the shutdown has finished, the callback is called. The call to uv__io_start is disabled, otherwise left unmodified. The function initiliases the shutdown request and shedules its execution in the loop.

• int uv_listen(uv_stream_t* stream, int backlog, uv_connection_cb ,→ cb);

29 CHAPTER 4. DESIGN

Starts listening for incoming connections on the server specified by stream. The callback is called upon connections. A setter function for the connection callback function. Note that to effectively listen to incoming connections, the server must be bound to an address. For TCP flows, it is done via uv_tcp_bind.

• int uv_accept(uv_stream_t* server, uv_stream_t* client);

Accepts incoming connections and creates new bidirectional TCP flows (handles). Should be used after receiving uv_connection_cb callback, then it guarantees successful completion, a behaviour that is not guaranteed if used more than once per uv_connection_cb. The acceptance of new connections is a three step procedure. First, a library- internal callback3 is registered with libix to fire upon a USYS_TCP_KNOCK event condition. Said callback will, if a listening server is registered with uv_tcp_bind, create a new ixev_ctx and enqueue it with the listening server, and enqueue the listening server for I/O. Then it will return a pointer to the allocated ixev_ctx which will cause libix to issue a bsys_tcp_accept system call. Secondly an interrim callback handling the readiness of the listening handle will be called, which will handle the control redirection to the user-supplied connection callback. The third phase is to call uv_accept from the connection callback. For TCP flows, the function simply dequeues an ixev_ctx from the supplied uv_tcp_t* server handle and connects it with the client handle.

• int uv_read_start(uv_stream_t* stream, uv_alloc_cb alloc_cb, ,→ uv_read_cb read_cb);

Start reading on a stream. The alloc_cb will be called to allocate read buffers, and the read_cb will be called when data is available. The read callback will be called repeatedly until there is no more data available, or int uv_read_stop(uv_stream_t*) has been called. The UV_READING flag is set for the stream and the read callback handler is registered. From standard libuv we disable the activation of listening on a file descriptor in the case of TCP streams.

• int uv_read_stop(uv_stream_t* stream);

Stop reading from the stream. The UV_READING flag is cleared, along with the read callback function, blocking the user from notifications of available data.

• int uv_write(uv_write_t*, uv_stream_t*, const uv_buf_t[], ,→ unsigned int, uv_write_cb); 3struct ixev_ctx*ixuv__accept( struct ip_tuple*id)

30 4.4. MODIFICATIONS OF NODE.JS

Write supplied buffers in order on the stream. The write callback will be called upon write completion. Writing to a uv_tcp_t uses the libix zero copy API. The assumption that the user may not modify or free the submitted buffer before the submitted write completion callback has been called. With the submission of the write buffer (by reference) we submit an ixev_ref_t containing a pointer to the user uv_write_t write request. When the IX kernel triggers a sent event condition with a transmission count that exceeds the position for our buffer our internal callback will be called. In that callback the write request will be looked up and added to the write complete queue. Eventually this leads to the user-supplied callback being called, as a notification of the write being completed allowing the user to free its related buffers or otherwise proceed in its processing.

• int uv_write2(uv_write_t* req, uv_stream_t* handle, const ,→ uv_buf_t bufs[], unsigned int nbufs, uv_stream_t* ,→ send_handle, uv_write_cb cb);

Extended write functionality to send handles over a pipe. NOT APPLICABLE. Unlike the Unix socket layer, IX does not provide the possibility of sending flows between processes.

• int uv_try_write(uv_stream_t* handle, const uv_buf_t bufs[], ,→ unsigned int nbufs);

Same as int uv_write but does not queue requests if they are unable to complete immedi- ately. NOT IMPLEMENTED. Always returns 0.

• int uv_is_readable(const uv_stream_t* handle);

Return a non-zero number if the stream is readable and zero if it is not. No modification, reads flag from handle field.

• int uv_is_writable(const uv_stream_t* handle);

Return a non-zero number if the stream is writeable and zero if it is not. No modification, reads flag from handle field.

• int uv_stream_set_blocking(uv_stream_t* handle, int blocking);

Set or unset stream blocking operation; that stream operations complete blockingly instead of non-blockingly. The asynchronous interface remains fixed. NOT APPLICABLE. IX does not support blocking operation.

31 CHAPTER 4. DESIGN

UV TCP handle API:

• int uv_tcp_init(uv_loop_t* loop, uv_tcp_t* handle);

Initialise the TCP handle data structure. Does not create a connection. The I/O handling is replaced with a handler that only performs callbacks for completed writes, since all other functionality of the io handler is provided in libix.

• int uv_tcp_open(uv_tcp_t* handle, uv_os_sock_t sock);

Open a file descriptor or socket as a libuv TCP handle. NOT APPLICABLE. IX does not have a socket layer, supporting file descrip- tors to be used as TCP streams.

• int uv_tcp_nodelay(uv_tcp_t* handle, int enable);

Enable / disable Nagle’s algorithm. NOT APPLICABLE. IX does not implement Nagle’s algorithm, so libuv-ix does not offer a way to control it.

• int uv_tcp_keepalive(uv_tcp_t* handle, int enable, unsigned int ,→ delay);

On/off TCP keep-alive. NOT APPLICABLE. IX does not implement TCP KeepAlive, so libuv-ix does not offer a way to control it.

• int uv_tcp_simultaneous_accepts(uv_tcp_t* handle, int enable);

Enable / disable simultaneous asynchronous accept requests that are queued by the operating system when listening for new TCP connections. NOT IMPLEMENTED.

• int uv_tcp_bind(uv_tcp_t* handle, const struct sockaddr* addr, ,→ unsigned int flags);

Bind the handle to an IP-tuple4. Binding listening “sockets” is done by binding a mapping from an (ip, port) tuple to a listening handle. To libix we register a library-interal intermediary callback5. Upon connection events, the intermediary callback will lookup the listening handle for the connection tuple and direct the connection event to concerned handle. 4IP address and port number. 5

32 4.4. MODIFICATIONS OF NODE.JS

Since IX does currently only support listening on port 8000, the mechanism is implemented by storage of a single pointer in the BSS segment, but can easily be extended to support multiple listening handles by implementing the inter- face with a hash map, as all accesses to the singleton mapping are performed through the map interface.

• int uv_tcp_getsockname(const uv_tcp_t* handle, struct sockaddr* ,→ name, int* namelen);

Get the current address to which the handle is bound. NOT IMPLEMENTED. Consider replacing with suitable replacement and/or implement reverse lookup for TCP listening servers.

• int uv_tcp_getpeername(const uv_tcp_t* handle, struct sockaddr* ,→ name, int* namelen);

Get the address of the peer bound to the handle. NOT IMPLEMENTED. Consider extending libix to save remote information upon accepting a connection.

• int uv_tcp_connect(uv_connect_t*, uv_tcp_t*, const struct ,→ sockaddr*, uv_connect_cb);

Establish an outgoing TCP connection. The uv_connect_cb will be called upon completion or error. NOT APPLICABLE/NOT IMPLEMENTED. IX does currently not support outgoing connections. When outgoing connections are supported, this func- tion needs to call void ixev_dial(struct ixev_ctx*ctx, struct ip_tuple *id);.

Unix Domain Sockets Libuv uses a single “backend_fd” file descriptor for a singleton polling set per event loop, and Node.js uses a single libuv event loop per Node.js process. Implementing UDS support over the IX abstraction of a single polling set thus poses no unneces- sary restrictions. All calls to int epoll_ctl(...); of libuv are proxied through int uv__epoll_ctl(...);. Thus, all that we need to do is to replace the sys- tem call to int epoll_ctl(...); from int uv__epoll_ctl(...); to a call to int sys_uds_ctl(...);, discarding the int epfd parameter. Finally, we need to register a callback with libix for handling USYS_UDS_ACTIVITY event conditions. Naturally the code within that callback function is the same code that normally follows the call to int epoll_wait(...); in standard libuv, that looks up the user registered callback function for the specified file descriptor, and yields control to the user. Thus, no changes to the implementation of the stream or pipe abstractions in libuv are needed.

33 CHAPTER 4. DESIGN

Timers Timers continue to work since IX can pass the time system calls down to the Linux kernel, and all timer structures in libuv are time agnostic. Although, with the IX processing model there is a risk that the polling blocks for a long time, if no request reaches the server for an extended period of time. That will prevent libuv from executing user provided timer callback functions, until the IX polling returns. Note that this behaviour does not break the API, as timers are only guaranteed to run some time after they have expired, there is no upper bound on the delay. Therefore we do not do any extra implementation to handle timers. For users that are interested at having timers execute with an upper bound from the timer expiration point, we suggest to implement a worker thread that periodically wakes up to write on a UDS to trigger an event condition, and force IX to return from polling, enabling timer handling in the elastic thread.

4.4.2 Modifications of the V8 Javascript Engine The V8 javascript engine employs Address Space Layout Randomisation (ASLR) [35] as a measure to prevent buffer overflow attacks. The technique randomises memory placement of data areas of the process such as stack, heap and library in order to prevent a perpetrator performing a buffer overflow attack from reliable jumping between memory locations. The protection mechanism is implemented as a random hint of memory placement to the mmap system call. Currently IX does not support hints that V8 is generating. Therefore ASLR is disabled by changing the function generating the randomised placement hint6 to always return 0.

6void* OS::GetRandomMmapAddr(), in v8/src/base/platform/platform-.cc

34 Chapter 5

Evaluation

The evaluation chapter is organised as follows, section 5.1 explains the main per- formance evaluation along with its methodology. In section 5.2 we account for the mechanisms in IX that improve the various performance metrics, namely through- put in section 5.2.1 and latency distribution in section 5.2.2.

5.1 Results

The main result section starts by introducing the test methodology in section 5.1.1. In section 5.1.4 we look at how the latency of the system depends on the arrival rate of requests. Connection scalability is explored in section 5.1.5, where we look at how the latency varies under increased number of sustained connections for a sub-saturation load.

5.1.1 Test Methodology

We compared Node-on-IX vs Node-on-Linux for a version of Node1 modified to dis- able ASLR in V8, dynamically built against libuv. Both test systems used Ubuntu 14.10 with Linux kernel version 3.16.0-41. For the Linux tests we used version 1.5.0 of libuv2, whereas IX uses the libuv version described in chapter 4. Since a Node ap- plication typically does not perform any CPU-, but I/O intensive work we chose to benchmark node with a Hello World type of application issuing an HTTP response body of 17 bytes. For the benchmarks, an in-house scalable distributed load generator was de- veloped. The application allows a rate controlled Poisson distributed load to be generated from a set of load generating machines, synchronised by a master node responsible for latency measurements and data reconcilation. Each physical ma- chine is able to simulate a high number of virtual clients by a based

1git commit: 9010dd26529cea60b7ee55ddae12688f81a09fcb 2git commit: db0624a465493931c790445c22227660b88c5a8e

35 CHAPTER 5. EVALUATION parallelisation model. For further description of the load generator, please see ap- pendix B. Each test case used 4 slave load generating physical machines, and one master that distributed load and measured latency over non-saturated probe connections. The server and each of the load generating machine used a dual socket mother- board with 2× 2.60 GHz 8 core Intel Xeon E5-2650 for a total of 16 cores and 32 hyperthreads with 64 GB memory. The machines were configured with Intel x520 10GbE NICs and connected by 10 Gb Ethernet over a Quanta/Cumulus 48x10GbE switch with a Broadcom Trident+ ASIC. The latency measurement machine held 4 concurrent connections open at an average request rate of 1000 requests per second. The remaining virtual clients and load was evenly distributed among participating load generators. Experiments were run for 5 minutes per data point.

5.1.2 Performance Metrics Throughput Throughput is the number of transactions that can be completed for a given time frame. In the case of Node, we look at the number of HTTP responses that can be completed per second. We are concurrently interested in the total throughput, that is the maximum throughput the system can sustain and also the throughput given some SLA. An SLA if often given as a 99th percentile latency. Throughput given an SLA will thus mean the maximal throughput attainable whilst not exceeding the given SLA.

Latency Latency is the time it takes for the server to serve a given request. The latency of a request include the network round trip time, time spent waiting while earlier request are serviced (queueing time) and finally also the service time of the request in question. We measure and care for tail latency, that is the far end of the latency distribu- tion, not only because the world is becoming increasingly more real time oriented, with interactive applications inducing the need to guarantee timely service almost in all cases. Every hiccup from a smooth user experience decreases user retention and thus profit[36]. We also do care for the tail latency because the 99th percentile case is far more common than intuition tells us. The tail latency becomes more common and thus important when we introduce dependencies between request, such that e.g. the slowest request in a set determines the “response of the set”. As Dean and Barroso[13] demonstrated, if a frontend ser- vice fans-out requests to an underlying layer such as a key-value store, and requires all responses before it can proceed and aggregate a response to the client, the end user will experience the 99th percentile often. Assume that the number of fan-out request is 100: the chance to undercut the 99th percentile for a single request is natu- rally 0.99. Whereas to do the same for all 100 requests follows to be 0.99100 = 0.366.

36 5.1. RESULTS

Thus actually 63.4% of all requests will observe the 99th percentile latency. Note that the math is identical for all scenarios when the response time is given by the slowest response of a set of requests. For a frontend web server this occurs in at least two scenarios; if the web server serves all types of web objects3 needed to display a page, or if, using a standard tiered datacentre architecture, the underlying layer already provides a very tight bound on tail latencies. A (web) client needs to load a set of web objects in order to display a web page. As of 2012, the average number of objects per web page was 100[37]. Thus, by employing the same math as in the previous paragraph we see that 63.4% of all requests will observe the 99th percentile latency. If the tail latencies of the underlying layers have been severely tamed, a vast simplification gives that the responses come back to the web server “more or less at the same time”, which again ties the end-to-end latency to the web server’s processing of the slowest of those requests. Note that if the tail latency of the underlying service is not very controlled, the latency distribution of the end response will just be the latency distribution of a single transaction of the web server plus the distribution of the underlying layer.

5.1.3 A Note on Poisson Distributed Arrival Rates

Note that for a uniform arrival rate sub-saturation there would be no queue build up, thus leaving no variance in latency; all requests would observe the latency of an unloaded system. The Poisson distributed arrival rate more accurately models reality, with requests from different clients being independent, which inherently induces a non-uniformity in the arrival rate. Such a behaviour will cause momentary queue build-ups, when the momentary arrival rate is greater than the service rate. Such build-ups will induce latency variance, and by that it becomes interesting to study the latency distributions of the systems, even for sub saturation loads.

5.1.4 Load Scaling

We test how the systems respond to load scaling by fixing the number of connections to powers of 2 from 1 to 16384 and varying the load for each settings. The results shown are limited to 64 and 512 connections respectively; other concurrency levels yield similar looking graphs. In fig. 5.1 we see those results on an xy-plot. On the x- axis we have the achieved throughput, and the y-axis depicts the observed latencies at the specific load level for each of the systems. For both systems we study both the average, or expected, latency, and the 99th percentile of the latency distribution. When a queueing system hits saturation in terms of throughput, queueing theory predicts that the latency will approach infinity at an exponential rate4. Therefore, we are interested in studying the latency response even in sub-saturation cases.

3Such as images, stylesheets and scripts. 4For a load with an open-loop. With “enough” concurrency and a low SLO the phenomena will appear to take place even for the closed-loop loads we are studying.

37 CHAPTER 5. EVALUATION

64 clients 512 clients 7000 15000 6000 12000 5000 4000 9000

3000 6000 2000 Latency (us) Latency (us) 3000 1000 0 0 0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 103 Requests/sec 103 Requests/sec IX-ZC AVG LINUX AVG IX-ZC AVG LINUX AVG IX-ZC 99th LINUX 99th IX-ZC 99th LINUX 99th (a) Load scaling under 64 clients (b) Load scaling under 512 clients.

Figure 5.1: Load scaling.

In fig. 5.1a we observe an increase in throughput by 16.75%, and we can see that the 99th percentile tail latency is reduced by 5.24× at 7000 requests per second. Note that if given an SLA of 2 ms 99th percentile latency, the effective attainable throughput rises from 4000 requests per second on Linux to 11000 on IX, a 2.75× increase of throughput under the given SLA. In fig. 5.1b we see how both the average latencies and the 99th percentile latencies are much lower on IX for all load levels. Note how the 99th percentile of the IX line gracefully ducks even the average latency of the system running on Linux. Furthermore, for this specific concurrency level, we observe a 20.62% increase in throughput, 5.23× reduction in average latency (at 7000 req/s) and 5.68× for 99th percentile tail latency (at 6000 req/s).

5.1.5 Connection Scalability

Figure 5.2a shows an almost unloaded system running at approximately 20% of its throughput. Notice how the 99th percentile latency of Linux spikes for 16384 connections, while the latency of IX merely doubles, relatively appearing constant. At a throughput of 5000 requests per second (fig. 5.2b) we start to see the disparity between the two systems already at a much lower connection concurrency. At 1024 connections the IX system exhibits a 4.92× reduction in 99th percentile tail latency. Note that given an SLA of 2 ms @ 99th pp., for this particular throughput the web server can handle 32 concurrent connections if it runs on Linux , and 8192 by running on IX.

5.2 Result Tracing

In this section we try to account for the causes of the performance differences between the systems seen in section 5.1.4 and section 5.1.5. In particular, we look

38 5.2. RESULT TRACING

Connection scaling, fixed TP 2000 req/s Connection scaling, fixed TP 5000 req/s 3500 6000

3000 5000 2500 4000 2000 3000 1500 2000

Latency (us) 1000 Latency (us)

500 1000

0 0 1 4 16 64 256 1024 4096 16384 1 4 16 64 256 1024 4096 16384 # concurrent connections # concurrent connections IX-ZC AVG LINUX AVG IX-ZC AVG LINUX AVG IX-ZC 99th LINUX 99th IX-ZC 99th LINUX 99th (a) Connection scalability for 2000 requests/s. (b) Connection scalability for 5000 requests/s.

Figure 5.2: Connection scalability. at the increased throughput of IX in section 5.2.1 and the lowered 99th percentile latency in section 5.2.2.

5.2.1 Throughput Increase

To determine the batched system calls effects on the throughput of the server run- ning on IX we experimentally set the IX event condition batching size to 1. The action has the implication that every packet delivery will trigger a kernel crossing, and every buffer that has been sent will also trigger a kernel crossing, just as if no batching had been performed at the system call layer. We run at a moderate load and concurrency level of 512 concurrently connected virtual clients. Figure 5.3 shows the system running on the unmodified IX kernel in black, Linux in red and finally IX with disabled batching in blue. The figure clearly shows how the non- batched IX system, in both the average(fig. 5.3a) and the 99th pp. (fig. 5.3b) case, reaches saturation at the same load level as Linux as opposed to the elevated satu- ration point in the case of unmodified IX. Note that the non-batched version still provides lower latency than Linux for sub saturation loads.

5.2.2 Reordering & Tail Latency

Since the queueing discipline effects the latency distribution5, we investigate if there are more request reorderings on Linux than IX and if it causes the elevated 99th pp. latency. The motivation is that if a server reorders requests, then it will, in fact, change the effective queueing dicipline. We define an ordering violation to be a pair of requests A and B such that a request A that reached the server’s NIC before a request B, and that B was pro- cessed before A. Let the total ordering violation be the sum of ordering violations.

5See section 2.4

39 CHAPTER 5. EVALUATION

512 clients 512 clients 8 14 7 12 6 10 5 8 4 6 3 4 Latency (ms) 2 Latency (ms) 1 2 0 0 0 2 4 6 8 10 12 2 4 6 8 10 12 103 Requests/sec 103 Requests/sec IX AVG IX-NB AVG IX 99th IX-NB 99th LINUX AVG LINUX 99th (a) Averages latencies (b) 99th percentile latencies

Figure 5.3: Throughput plot for Linux and IX, with and without batching.

Furthermore, let the total number of reorderings be the number of requests that has been the victim of an ordering violation. We mark each request’s processing time with a sequence number in the server application program6. We approximate the arrival order by a client side timestamp once the request has been copied into the kernel space send buffer. All requests on the load generator are saved with these metadata for offline processing. To find the number of violations we process them in order of server side processing, incrementally adding them to a sorted set sorted in issue order. If a request is added to the end of that set, it means that no request processed before it was issued after, i.e. it was not violated. If it is added anywhere else but at the end of the set, it means that we have identified a set of requests that have been issued after but processed before. Since we traverse the requests in processing order it also means that there can be no other request that was processed before, and therefore we have found the violation subset for which the request in question is request A in the definition given above. Since violations are symmetric we need only find the violation subsets for all request such that for a given global choice if to look for requests of type A or B, the request in question is the (global choice)-end of the violation. If we sum the size of all such subsets we will find the total number of violations. Table 5.1 shows the ordering violation count for three different workloads, 2000, 5000 and 7000 requests per second, with 8, 1024, 4096 and 16384 connected virtual clients. For the reordering test we let the master node continue to measure the latency, but the reordering request set is the request set issued by a single load generator. The clients column describes the number of concurrently connected virtual clients and TP the achieved total throughput of the system7. AVG accounts for the average latency and the first 99th column the 99th percentile latency, both

6Since Node is single threaded we have no race conditions on the sequence number. 7Not a maximum test, but aimed a at a target load level.

40 5.2. RESULT TRACING

SYSTEM Clients TP AVG [µs] 99th[µs] Requests Violations 99th 99 lat / 99 vio. Linux 8 1999.5 287 633 300190 3117 1 353 IX 8 1998.8 172 280 300401 1174 0 Linux 8 5005 286 869 1201395 80945 1 412 IX 8 4995.7 212 457 1199043 7789 0 Linux 8 6994.3 335 1144 1799121 290947 2 511 IX 8 7001.6 264 633 1800112 18493 1 Linux 1024 2002 426 1363 300105 5834 1 853 IX 1024 2000.6 197 510 300115 2042 0 Linux 1024 4989.4 807 3696 1197414 193804 2 3015 IX 1024 4993.6 241 681 1197962 13411 1 Linux 1024 7000.2 963 7069 1799767 3076127 27 215 IX 1024 7003.2 412 1472 1801614 34357 1 Linux 4096 2003.3 421 1234 299918 4923 1 613 IX 4096 1997.2 214 621 298860 2273 0 Linux 4096 4987.6 771 3149 1197092 140086 1 NaN IX 4096 5003.3 276 945 1201014 13451 1 Linux 4096 6996.7 954 5646 1799855 1112979 15 224 IX 4096 7003.8 547 2506 1800287 667747 1 Linux 16384 1995.6 431 1157 299217 4732 1 593 IX 16384 1995.8 213 564 298966 2568 0 Linux 16384 4998.9 844 4130 1199935 95874 1 NaN IX 16384 4998.6 347 1923 1198846 14345 1 Linux 16384 6895.3 1105 7595 1800366 1011314 13 261 IX 16384 7006.6 614 4459 1802588 437480 1

Table 5.1: Ordering violations. measured in microseconds. Requests and Violations describe the total number of requests and number of violations respectively, in the load generator request set. The second 99th column describes the 99th percentile of violations per request in the request set. The last column describes the ratio between the absolute difference in 99th pp latency between IX and Linux, and the absolute difference in 99th percentile of violations per request. Observe that for load rate = 7000 requests/s, the ratio is close to the unloaded average latency on Linux. Therefore, the increased request level reordering may well be the contributing factor to increased tail latency on IX for loads close to saturation. For sub saturation loads, the high ratio numbers suggest that there are more factors at work. Since it is known that IX processes packages in strict FIFO ordering, and Node.js handles events in the order they were generated, we know that the reorderings measured for IX are client generated. Since the endpoint of measurement is the server’s processing sequence number, we do know that these reorderings happen during the client’s send phase. The observation highlights the problem of using a multi threaded client to test reordering; for best results a highly optimised IX client should have been used. However, it was not possible during the scope of the project since IX currently does not support outgoing connections.

41

Chapter 6

Discussion

We set out to find out whether or not Node.js could be effectively ported to IX, furthermore to chart the performance benefits and disadvantages of using the IX operating system to operate Node.js. The results clearly show how it is not only possible, but that we can improve upon the performance of Node in all metrics tested by using IX instead of Linux. Naturally the impact of the benefit varies across the tested metrics. Unloaded latency is improved by roughly a factor 2×, and up to a factor 5.23× for some sub saturation loads1 and tail latency is also significantly improved, by up to 5.68×. For cases where we care about fast responses and good distribution of such latencies, it makes sense to run Node.js on IX. Such cases include, but are not limited to: a single web server handling all the requested web objects (i.e. not having a Content Delivery Network)2, but also for a standard web server in a classical tiered datacenter hierarchy, if the underlying key-value store is assumed to keep an already very tight bound on tail latency. However, one can pose the question whether or not Node.js is the ideal framework for an application with tight requirement on low latency, as a considerable cost in latency is induced by the execution of JavaScript, as the transition cost between executing JavaScript and C++ in V8 is high. Note the very modest throughput increase of roughly 20%, which even if a sig- nificant number, it is not a game changing number. If the increase would have come with no drawbacks, just plug and play into a new OS and get a 20% performance increase, a switch might have been a nobrainer. Many of the drawbacks could be ignored depending on the use case, e.g. inability to lookup DNS without a supple- mentary NIC, which could easily and inexpensively be installed, or applications that use TCP might accept the lack of UDP. Likewise, IX’s current inability to initiate remote network connections might be acceptable to a single web server setup. But it does pose a problem in a multi-tiered datacentre architecture, as it disables the front end web server from initiating connections with the nodes of services in the underlying layers, such as key-value store replicas.

1See e.g. 512 clients @ 8000 RPS, fig. 5.1b. 2See argument in section 5.1.2

43 CHAPTER 6. DISCUSSION

The major hindrance from viewing the 20% throughput improvement as a rea- son to immediately start running Node on IX is, however, the lack of support for horizontal scaling. The idiomatic way of horizontally scaling a Node.js application is to run the Node.js cluster module that runs multiple Node processes in parallel. It runs a set of worker processes that performs application business logic, and a single master process; responsible of accepting connections and distributing them over the available workers. Theoretically the cluster module scales throughput linearly with the number of CPU-cores since it utilises a process per CPU-core. Our practical results do not suggest otherwise: with 16 cores and 32 hyperthreads we should be seeing anywhere from 16× up to 32× increase in throughput, and we observed a throughput improvement of up to 24× compared to a single process system by em- ploying the cluster module. Note that it is a significantly performance gap to the 20% optimisation achieved by IX. The two optimisations should be orthogonal, but the 20% improvement by running on IX only becomes relevant once IX supports a horizontally scalable execution model. There is work in progress regarding the usage of SR-IOV3 to, among other purposes, support concurrent execution of multiple IX instances - a multi-process execution model. However, it is not sufficient to horizontally scale Node.js on IX, as IX processes do not share a single network flow namespace, thus IX flows cannot be shared among multiple IX processes the way Node.js currently scales horizontally on Linux systems. Furthermore, IX dataplanes are tightly coupled to ther respective NIC queues, and so would IX processes be to their respective Virtual Functions. Therefore, the behaviour of having a master process accept connections and then distributing them over worker processes does not work on IX. Note that SR-IOV virtual functions assigns a different MAC address to each Virtual Function. Thus, using SR-IOV to multiplex packets over multiple processes requires a demultiplexing function such as a load balancer to distribute the connec- tions over the “mini nodes”, present for each core. Depending on the characteristics of the connections and the mechanism of the load balancer the latency increase might be acceptable, for web servers such as Node.js. For other services, such a microsecond computing applications we might not prefer to scale horizontally if it requires an additional system passthrough. Summatively, horizontal scaling with a process per CPU core, each running a worker event loop, like the Node.js cluster module, is not currently possible on IX for Node.js. In particular, the Node.js cluster module paradigm does not work, and we need another solution, which may or may not prove hard to engineer. The problem, apart from software maturity, and usability and support of IX, remains the main gripe that prevents imminent adoption of IX to run network bound Node.js applications.

3Single-Root I/O Virtualization, an Intel technology to multiplex a NIC as multiple virtual NICs.

44 6.1. RELATED WORK

6.1 Related Work

The Exokernel [11] started the debate concerning radical operating system design in order to counter the inefficiencies posed by general-purpose abstractions provided in operating systems of classical design. It created a new design paradigm of oper- ating systems, that aims to provide as thin abstractions as possible, possibly just exposing the hardware interface. The designers do realise that the kernel have to control resources in order to isolate applications. They design secure bindings, which provide a secure allocation of a hardware resource. The secure bindings are imple- mented differently for different resources, but what they do provide is decoupling of allocation/management and usage/access control. IX [1] and Arrakis [19] are in a way both modern incarnations of the exokernel idea. They both leverage virtualisation hardware to be able to export secure access to the underlying hardware interface. Where IX uses VT-x to export the interface of a process having access to privilege modes and NIC hardware rings, Arrakis utilises SR-IOV technology, that provides virtual functions - appearing as NIC on the PCI bus. By that, Arrakis does only export the hardware interface of NICs and, assuming a similar technology as SR-IOV for storage, to storage controllers. Cheetah is a sample web server application built to showcase Greg Gangers extensible I/O library XIO for the second exokernel, Xok. By exploiting the exten- sibility of Xok, the team managed to build a web server that improved throughput by 8× versus the best result they were able to achieve on OpenBSD.

6.2 Lessons Learned

We have seen that a specialised library operating system built for Memcached can be used for a more general networked event-driven application framework, Node.js. It is general enough to support the operations required by Node after our extensions regarding UDS. The abstractions exposed by IX, mainly flows that replaces the Socket layer of a Unix system corresponds well with the abstractions required by Node.js in the single-threaded case, and therefore Node.js can benefit from the optimisations allowed by this new abstraction set. The exception is that the flow abstraction and IX’s memory model makes it difficult to scale Node.js horizontally, as done in the cluster module, even if multiple concurrently running IX processes were supported. The fact that Node.js is less efficient than Memcached limits the performance improvements possible by running it on IX. Since Node.js spends a larger fraction of its execution time in user space, smaller improvements by running an optimised kernel and system call layer can be achieved. It is important to analyse an application completely to design an efficient li- bOS for it. Even though the IX execution model is general enough to support the main features of Node, some of its current limitations prevent it from competing in throughput versus a multicore Linux system. Thus, it is important to analyse

45 CHAPTER 6. DISCUSSION the full execution model and its application to design an efficient library operating system tailored for a specific application.

6.3 Future Work

Limitations The enumerated limitations of IX, including UDP, support for outgoing connections, wider range of NICs and multiple processes running in parallel needs to be addressed for IX to reach a wider public and are required to enable adoptation in production.

Horizontal Scaling Find a way to scale Node horizontally on IX. Even if the development using SR-IOV technology renders it possible to run multiple instances of IX in parallel on different cores, the cluster module of Node.js will not provide horizontal scaling. The cluster module works by having a main process accepting new connections and subsequently distributing them over the worker processes by sending the file descriptors over IPC. To enable horizontal scaling, either flows needs to support migration between processes, multiple processes need to be able to listen to the same incoming port (on the same network interface), or a new scheme of horizontal scaling, maybe based on elastic threads instead of processes would need to be devised. Note that the thread- safety of the JavaScript runtime comes into play if the last route is pursued. A final remark is that for event-driven architectures, a single crash will disrupt many inflight requests, and the risk is increased with a multi-threaded model compared to a multi-process model.

Nginx on IX We have seen how Node can benefit in throughput from running on IX. However, Node.js is relatively slow, which limits the potential throughput gains. Nginx is an event-driven web server written in C for performance and that does not inherently execute dynamic languages. It would be interesting to see what kind of performance improvements that could be realised by running Nginx on IX. The tail-latency argument for static resources is more valid for Nginx, as it is commonly used for CDN deployments. Nginx uses system calls like sendfile aggressively on Linux to improve performance by eliminating kernel crossings. Does IX still outperform Linux for serving static files, even without sendfile semantics? Could IX include its own abstractions for files and UDS to enable a sendfile-like interface?

Improved Tools for Generating Load and Measuring Server Side Reorderings We have seen how Dialog could be used to generate a Poisson distributed rate controlled load from a large set of virtual clients, easily implemented by leveraging

46 6.4. CONCLUSION . But the server side reordering tests do show a significant number of reorderings, even on the IX system. We know that IX operates in a strict FIFO discipline, which implies that the reorderings occur client side, not very surprising due to the extreme number of parallel threads of execution in the design. Thus, how would a load generator and probing suite be designed to maximise client side control or at least knowledge over wire-time for each packet? The application knowledge of the time a packet was put on the wire could both be used to minimise client side reordering mistakes to better probe the server side reorderings, or to generate a more accurate on-the-wire request rate distribution. libOS for Node/VM Dynamic Languages IX is basically a library operating system developed to increase the performance of microsecond-computation type of applications in datacentre settings. The webserver Cheetah demonstrates what an optimised libOS can do for the performance of a web server. How would a libOS designed to run a virtual machine of a dynamically interpreted language, or a JavaScript runtime, maybe V8 in particular, be designed?

6.4 Conclusion

By extending IX with functionality for concurrent polling of network flows and Unix Domain Sockets we effectively permit a larger set of applications to be run on IX. We prove that Node.js from now on can run on IX, by the implementation of a minimal port of libuv to IX’s API. Furthermore we show that Node.js on IX sig- nificantly outperforms itself on a Linux baseline, especially regarding latency and latency distribution. However, due to the semantics of IX flows we are unable to scale horizontally within a single node, which effectively restricts the attainable throughput of a single node by more than an order of magnitude. Nevertheless, we believe that the restrictions can be lifted, in order to show performance improve- ments that matter in a real world setting. Furthermore we do believe that the project has reinforced the exokernel’s thesis that general purpose abstractions hurt performance, and that a library operating system with improved performance can prove useful even to third-party applications it was not originally designed for.

47

Bibliography

[1] A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion, “IX: a protected dataplane operating system for high throughput and low la- tency”, in 11th USENIX Symposium on Operating Systems Design and Im- plementation (OSDI 14), 2014, pp. 49–65 (cit. on pp. 2, 3, 7, 8, 14, 15, 45). [2] G. Prekas, M. Primorac, A. Belay, C. Kozyrakis, and E. Bugnion, “Energy proportionality and workload consolidation for latency-critical applications”, in Proceedings of the Sixth ACM Symposium on Cloud Computing, ser. SoCC ’15, Kohala Coast, Hawaii: ACM, 2015, pp. 342–355, isbn: 978-1-4503-3651-2. doi: 10.1145/2806777.2806848. [Online]. Available: http://doi.acm.org/ 10.1145/2806777.2806848 (cit. on p. 2). [3] Memcached – a distributed memory object caching system, http://memcached. org, 2015 (cit. on pp. 2, 8, 17). [4] Node.js, https://nodejs.org/, 2015 (cit. on p. 2). [5] T. Capan. (2013). Why the hell would i use node.js? a case-by-case tutorial, [Online]. Available: http : / / www . toptal . com / nodejs / why - the - hell - would-i-use-node-js (visited on 07/02/2015) (cit. on pp. 2, 10). [6] R. Paul. (2012). A behind-the-scenes look at linkedin’s mobile engineering, [Online]. Available: http://arstechnica.com/information-technology/ 2012/10/a-behind-the-scenes-look-at-linkedins-mobile-engineering/ 2/ (visited on 07/14/2015) (cit. on pp. 2, 10). [7] (Jun. 2015). Nodejs, [Online]. Available: http://nodejs.org (cit. on p. 3). [8] Libuv, https://github.com/libuv/libuv (cit. on pp. 3, 18, 19). [9] V8 javascript engine, https://code.google.com/p/v8/, 2015 (cit. on pp. 3, 10, 19). [10] A. S. Tanenbaum, Modern Operating Systems, 3rd. Upper Saddle River, NJ, USA: Prentice Hall Press, 2007, isbn: 9780136006633 (cit. on pp. 5, 6). [11] D. R. Engler, M. F. Kaashoek, and J. O’Toole, “Exokernel: An Operating Sys- tem Architecture for Application-Level Resource Management.”, in SOSP95, 1995, pp. 251–266 (cit. on pp. 6, 13, 14, 45).

49 BIBLIOGRAPHY

[12] A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazières, and C. Kozyrakis, “Dune: safe user-level access to privileged cpu features”, in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 12), USENIX, 2012, pp. 335–348 (cit. on pp. 6, 13, 14). [13] J. Dean and L. A. Barroso, “The tail at scale”, Commun. ACM, vol. 56, no. 2, pp. 74–80, Feb. 2013, issn: 0001-0782. doi: 10.1145/2408776.2408794. [Online]. Available: http://doi.acm.org/10.1145/2408776.2408794 (cit. on pp. 7, 36). [14] B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, “Work- load analysis of a large-scale key-value store”, in Proceedings of the 12th ACM SIGMETRICS/PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’12, London, England, UK: ACM, 2012, pp. 53–64, isbn: 978-1-4503-1097-0. doi: 10.1145/2254756.2254766. [Online]. Available: http://doi.acm.org/10. 1145/2254756.2254766 (cit. on p. 7). [15] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica, “Mesos: a platform for fine-grained resource sharing in the data center”, in Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, ser. NSDI’11, Boston, MA: USENIX Association, 2011, pp. 295–308. [Online]. Available: http://dl.acm.org/ citation.cfm?id=1972457.1972488 (cit. on p. 7). [16] C. Delimitrou and C. Kozyrakis, “Quasar: Resource-Efficient and QoS-Aware Cluster Management.”, in ASPLOS14, 2014, pp. 127–144 (cit. on p. 7). [17] S. Dhar, “Sniffers, basics and detection”, [Online]. Available: http://www. just.edu.jo/~tawalbeh/nyit/incs745/presentations/Sniffers.pdf (visited on 07/16/2015) (cit. on p. 8). [18] L. Soares and M. Stumm, “Flexsc: flexible system call scheduling with exception- less system calls”, in Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’10, Vancouver, BC, Canada: USENIX Association, 2010, pp. 1–8. [Online]. Available: http://dl.acm. org/citation.cfm?id=1924943.1924946 (cit. on p. 8). [19] S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. An- derson, and T. Roscoe, “Arrakis: the operating system is the control plane”, in 11th USENIX Symposium on Operating Systems Design and Implemen- tation (OSDI 14), Broomfield, CO: USENIX Association, Oct. 2014, pp. 1– 16, isbn: 978-1-931971-16-4. [Online]. Available: https://www.usenix.org/ conference/osdi14/technical- sessions/presentation/peter (cit. on pp. 8, 45). [20] Y. Rekhter, T. Li, and S. Hares, “RFC 4271: A Border Gateway Protocol 4 (BGP-4)”, IETF, Tech. Rep., 2006. [Online]. Available: www.ietf.org/rfc/ rfc4271.txt (cit. on p. 8).

50 BIBLIOGRAPHY

[21] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, “Hypertext transfer protocol – http/1.1”, United States, RFC 2616, 1999. [Online]. Available: http://tools.ietf.org/html/rfc2616 (cit. on p. 9). [22] J. Patonnier, F. Culloca, A. Pfeiffer, and S. (sboroba). (2015). What is a web server, [Online]. Available: https : / / developer . mozilla . org / en - US/Learn/What_is_a_web_server (visited on 07/17/2015) (cit. on p. 9). [23] The apache http server project, http://httpd.apache.org/ (cit. on p. 9). [24] (2015). June 2015 web server survey, [Online]. Available: http : / / news . netcraft.com/archives/2015/06/25/june-2015-web-server-survey. html (visited on 07/17/2015) (cit. on p. 9). [25] Apache mpm worker, http://httpd.apache.org/docs/2.4/mod/worker. html (cit. on p. 9). [26] Nginx, http://nginx.org/ (cit. on p. 9). [27] D. Kegel, The C10K Problem, http://www.kegel.com/c10k.html, 1999 (cit. on p. 9). [28] The architecture of open source applications (volume 2): nginx, http://www. aosabook.org/en/nginx.html (cit. on p. 10). [29] I. Lan. (2012). Clearing up some things about linkedin mobile’s move from rails to node.js, [Online]. Available: http://ikaisays.com/2012/10/04/ clearing-up-some-things-about-linkedin-mobiles-move-from-rails- to-node-js/ (visited on 08/19/2015) (cit. on p. 10). [30] J. Li, N. K. Sharma, D. R. K. Ports, and S. D. Gribble, “Tales of the tail: hardware, os, and application-level sources of tail latency”, in Proceedings of the ACM Symposium on Cloud Computing, ser. SOCC ’14, Seattle, WA, USA: ACM, 2014, 9:1–9:14, isbn: 978-1-4503-3252-1. doi: 10.1145/2670979. 2670988. [Online]. Available: http : // doi . acm. org / 10. 1145 / 2670979. 2670988 (cit. on p. 12). [31] N. Provos and N. Mathewson, libevent: an event notification library, http: //libevent.org, 2003 (cit. on p. 17). [32] Node.js, https://nodejs.org/api/, 2015 (cit. on p. 18). [33] Libuv design, http://docs.libuv.org/en/v1.x/design.html (cit. on pp. 19, 20). [34] (Aug. 2015). Nodejs docs: cluster, [Online]. Available: https://nodejs. org/api/cluster.html (visited on 08/11/2015) (cit. on p. 25). [35] (2005). Address space layout randomization (aslr), [Online]. Available: https: //developer.cisco.com/media/onepk_security_guide/GUID-527CB4BF- B5AC-41A3-92B1-883C09B8730D.html (visited on 07/17/2015) (cit. on p. 34).

51 BIBLIOGRAPHY

[36] (2011). How loading time affects your bottom line, [Online]. Available: https: //blog.kissmetrics.com/loading-time/ (visited on 08/19/2015) (cit. on p. 36). [37] (2012). Average number of web page objects breaks 100, [Online]. Available: http://www.websiteoptimization.com/speed/tweak/average-number- web-objects/ (visited on 08/03/2015) (cit. on p. 37). [38] Wrk - a http benchmarking tool, https://github.com/wg/wrk (cit. on p. 57).

52 Appendix A

Resources

This appendix references the locations where the work can be found.

A.1 libuv - ix

The libuv branch with IX support can be found at https://github.com/Lilk/ libuv.git.

A.2 Node.js

The Node.js version, with disabled ASLR in its V8 depency, and built towards a dynamically loaded libuv, can be found at https://github.com/Lilk/node-ix.

53

Appendix B dialog - high concurrency rate controlled poisson distributed load generator

B.1 Purpose

Dialog is a tool that can help to assess the performance of servers running request- response types of protocols such as HTTP. It combines the ability to generate a rate controlled (average) load according to a Poisson process1 with high concurrency (up to thousands of virtual clients per physical client). Furthermore it allows a distributed load testing mode of operation, where the expected load is farmed out over a set of worker machines, and the latency measurements are carried out by a selected machine, to minimise client-side latency in the measurements. One objective of the tool is to measure connection scalability, and therefore the system is implemented as a closed-loop, to keep the number of connected (virtual) clients constant for a set parameterised experiment.

B.2 Implementation

Dialog is based on a coroutine execution model, implemented in Go2, the core module spawns a goroutine for each virtual client, handling exactly one connection per goroutine. The connection control routine randomises a waiting time between each request, according to an exponential distribution time to achieve a Poisson process with the expected rate. Scheduling of goroutines between cores and upon I/O is performed by the go runtime. Furthermore each virtual client keeps a moving average of it’s own scheduling overhead in order to self-tune its request rate. Since the load generating problem is embarrassingly parallel, for distributed execution the master divides the target load and number of virtual clients equally over the participating slave machines, keeping 4 virtual clients and a lesser share

1The software could be modified to support any type of distribution where the time between two requests could be expressed as a distribution given a rate parameter λ. 2http://golang.org/

55 APPENDIX B. DIALOG - HIGH CONCURRENCY RATE CONTROLLED POISSON DISTRIBUTED LOAD GENERATOR Configuration # Virtual Clients TP (req/s) AVG lat. (µs) 99th pp. lat. (µs) 1 probe + 0 LG3 256 9863.7 25968 28434 1 + 4 256 10047.3 1946 4060

Table B.1: Dialog: Separation between latency measurement and load generation. Node.js server. of throughput that minimises client side latency effects on the measurements. In all cases the master synchronises the measurements with successful connections by all participating slaves and only starts the measurements once all connections are established. The protocol implementation of dialog is dependency-injected, which implies that is easy for a user to use dialog as a framework, changing the protocol depending on the service used. By default dialog is bundled with two protocol implementations of the HTTP protocol. The first protocol uses the go network stack “net/http”, providing compatibility with a large number of websites. Dialog is also bundled with the SimpleChunkedReader, which provides a barebones implementation that reads only Chunked encoding, significantly improving on the client side latency of measurements over the standard library HTTP implementation.

B.3 Evaluation

All tests in this section were carried out using an identical hardware and test setup as in chapter 5, unless otherwise described. First, we show how separation of latency probing and load generation on differ- ent physical machines effect latency measurements in table B.1. The server used in table B.1 is a single worker Node.js server running a Hello world application. Comparing the two rows, the first row, depicting the scenario with no auxiliary load generating machines, exhibits a high client side induced latency on the probing machine (when it has to generate all load) that dominates the latency cost. The second row shows a reduced average latency by 13.3× and 99th percentile of 7×, showing the importance to measure latency from an unloaded physical machine. Secondly, in table B.2, we show how our minimal HTTP reader SimpleChun- kedReader outperforms the standard library’s more complete implementation in terms of performance. As seen in the two first rows, we can see this difference for non-loaded cases, where the server responds fast, as client side overhead is visible

HTTP reader # Virtual Clients TP (req/s) AVG lat. (µs) 99th pp. lat. (µs) golang net/http 16 5001.8 657 2255 SimpleChunkedReader 16 5002.8 423 1196 go net/http 256 10195.1 10139 26948 SimpleChunkedReader 256 10271.9 9743 26837

Table B.2: Dialog: golang net/http stack vs SimpleChunkedReader. Node.js server

56 B.3. EVALUATION

Client # Virtual Clients TP (req/s) AVG lat. (µs) 99th pp. lat. (µs) Dialog: 1 + 0 1 8627 116 149 Dialog: 1 + 4 1 8500 118 151 wrk 1 10080 101 399 Dialog: 1 + 0 8 46770 170 666 Dialog: 1 + 4 8 39815 172 603 wrk 8 67928 132 561 Dialog: 1 + 0 128 162868 798 2525 Dialog: 1 + 4 128 201831 602 2353 wrk 128 185787 765 2520 Dialog: 1 + 0 512 186739 2641 5911 Dialog: 1 + 4 512 246706 1542 5255 wrk 512 238770 2200 5610 Dialog: 1 + 0 4096 214921 18144 45203 Dialog: 1 + 4 4096 277377 6202 34472 wrk 4096 289575 14700 36220 Dialog: 1 + 0 16384 195478 76243 800383 Dialog: 1 + 4 16384 230700 11125 259100 wrk 16384 232865 112139 872650

Table B.3: Dialog vs wrk. golang server in these cases. Comparing row 3 and 4 shows the server at a point of saturation, where server side latency is dominating. Therefore no meaningful difference can be observed in this case. To evaluate dialog, we compare it to wrk[38], the nginx project load tester tool. We compare both minimal achievable latency of both tools against a common server and also achievable throughput. In order to show that Dialog is not the bottleneck when benchmarking Node.js applications, we test both Dialog and wrk against a higher performing horizontally scaled web server written in go using goroutines and its standard net/http library. The results can be seen in table B.3. Note that wrk determines maximal throughput, whereas Dialog tries to remain a predefined set global throughput, which results in a comparison that may be hard to reason about. Comparing minimal latency, we observe that dialog exhibits approximately 15% higher minimum average latency in the case of a single connection. Throughput achieved by wrk is also higher, both suggesting client-side inefficiencies in Dialog compared to wrk. Dialog shines for higher connection counts with distributed load, such as for 16374 concurrent connections, for an almost identical throughput we measure an average latency an order of magnitude less, again reinforcing the need of measuring latency on a separate machine from load generation. For the non- distributed version we observe roughly 20% higher latency or less throughput. We have before motivated the need for Dialog, to measure latency and latency distribution for loads of high connection counts running at throughput levels below saturation at realistic arrival processes. Wrk does not provide this functionality. By

57 APPENDIX B. DIALOG - HIGH CONCURRENCY RATE CONTROLLED POISSON DISTRIBUTED LOAD GENERATOR this comparison, we have shown that the two load generators perform comparatively, with a 20% advantage for wrk in the single-node case. Moreover, they both provide ample performance to not act as a bottleneck for the tests performed in chapter 5.

B.4 Resources

Dialog can be found at https://github.com/Lilk/dialog.

58 www.kth.se