<<

ink - An HTTP Benchmarking Tool

Andrew J. Phelps

Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science & Application

Godmar V. Back, Chair Ali R. Butt Denis Gracanin

May 11, 2020 Blacksburg, Virginia

Keywords: Networking, Benchmarking, HTTP, Distributed Systems Copyright 2020, Andrew J. Phelps ink - An HTTP Benchmarking Tool

Andrew J. Phelps

(ABSTRACT)

The Hypertext Transfer Protocol (HTTP) is one the foundations of the modern Internet. Because HTTP servers may be subject to unexpected periods of high load, developers use HTTP benchmarking utilities to simulate the load generated by users. However, many of these tools do not report performance details at a per-client level, which deprives developers of crucial insights into a server’s performance capabilities.

In this work, we present ink, an HTTP benchmarking tool that enables developers to better understand server performance. ink provides developers with a way of visualizing the level of service that each individual client receives. It does this by recording a trace of events for each individual simulated client. We also present a GUI that enables users to explore and visualizing the data that is generated by an HTTP . Lastly, we present a method for running HTTP benchmarks that uses a set of distributed machines to scale up the achievable load on the benchmarked server.

We evaluate ink by performing a series of case studies to show that ink is both performant and useful. We validate ink’s load generation abilities within the context of a single machine and when using a set of distributed machines. ink is shown to be capable of simulating hundreds of thousands of HTTP clients and presenting per-client results through the ink GUI. We also perform a set of HTTP benchmarks where ink is able to highlight performance issues and differences between server implementations. We compare servers like NGINX and

Apache and highlight their differences using ink. ink - An HTTP Benchmarking Tool

Andrew J. Phelps

(GENERAL AUDIENCE ABSTRACT)

The (WWW) uses the Hypertext Transfer Protocol to send web content such as HTML pages or video to users. The servers providing this content are called HTTP servers. Sometimes, the performance of these HTTP servers is compromised because a large number of users requests documents at the same time. To prepare for this, server maintainers test how many simultaneous users a server can handle by using benchmarking utilities. These benchmarking utilities work by simulating a set of clients. Currently, these tools focus only on the amount of requests that a server can process per second. Unfortunately, this coarse- grained metric can hide important information, such as the level of service that individual clients received.

In this work, we present ink, an HTTP benchmarking utility we developed that focuses on reporting information for each simulated client. Reporting data in this way allows for the developer to see how well each client was served during the benchmark. We achieve this by constructing data visualizations that include a set of client timelines. Each of these timelines represents the service that one client received.

We evaluated ink through a series of case studies. These focus on the performance of the utility and the usefulness of the visualizations produced by ink. Additionally, we deployed ink in Virginia Tech’s Computer Systems course. The students were able to use the tool and took a survey pertaining to their experience with the tool. Acknowledgments

First, I would like to thank my advisor, Dr. Back, for helping me perform this research. He has been of immense help over the last year, and he has been a resource of knowledge for me. He has spent a great amount of his own time assisting me, and his guidance has helped me complete this research and this document. I would also like to thank the rest of my committee, Dr. Butt and Dr. Gracanin, for the insights that they provided on this work. They both made suggestions that improved the quality of this research. Specifically, I’d like to thank Dr. Gracanin for taking the time to go through this entire thesis with me.

iv Contents

List of Figures vii

List of Tables ix

1 Introduction 1

1.1 Testing for Performance ...... 1

1.2 Proposed Solution ...... 3

1.3 Contributions ...... 6

1.4 Roadmap ...... 7

2 Background Information 8

2.1 Transmission Control Protocol ...... 8

2.2 Hypertext Transfer Protocol ...... 13

2.3 HTTP Server Concurrency Models ...... 17

2.4 Connection Management ...... 22

2.5 HTTP Benchmarking ...... 26

3 Design and Implementation 32

3.1 Load Generation ...... 34

v 3.2 Distributed Benchmarking Manager ...... 38

3.3 ink API ...... 41

3.4 ink GUI ...... 44

4 Evaluation 58

4.1 Goals and Methodology ...... 58

4.2 Assessing ink’s Load Generation Ability ...... 60

4.3 Evaluating Server Performance with ink ...... 69

4.4 Survey Results ...... 83

5 Related Work 86

5.1 Assessing Server Quality and Performance ...... 86

5.2 Visualization Techniques ...... 90

6 Future Work 92

6.1 Load Generation Improvements ...... 92

6.2 GUI Improvements ...... 92

7 Conclusion 94

Bibliography 96 List of Figures

2.1 Example of HTTP clients experiencing different levels of service, leading to misleading data points ...... 30

3.1 High level overview of ink’s architecture ...... 33

3.2 Brewer’s red-to-green divergent color scheme [1] ...... 46

3.3 100 client moving averages rendered using colored line segments ...... 49

3.4 1, 000 concurrent clients rendered as 200 timelines ...... 51

3.5 1, 000 concurrent clients rendered as 50 timelines ...... 52

3.6 Image of the dashboard of alternative data visualizations. Visualizations are based on the same dataset as Figure 3.3 ...... 54

3.7 2, 000 clients grouped by physical machine origin ...... 57

4.1 Benchmarking with 43, 520 simulated clients and 17 physical clients ..... 62

4.2 Benchmarking with 87, 040 simulated clients and 17 physical clients ..... 63

4.3 Benchmarking with 174, 080 simulated clients and 17 physical clients .... 64

4.4 Benchmarking with 348, 160 simulated clients and 17 physical clients .... 65

4.5 Requests-per-second generated by NGINX under varying levels of load ... 66

4.6 Average request latency observed by clients benchmarking NGINX ...... 67

vii 4.7 Histogram of latencies from benchmark with 348,160 connections and 17 phys- ical clients ...... 68

4.8 File descriptors allocated by HTTP server during 60 second benchmark ... 70

4.9 An ink report that indicates that only half of clients were served. Observed TCP connections are marked by black dots...... 71

4.10 Memory usage of HTTP server during 60 second benchmark ...... 72

4.11 Benchmark with 15, 000 clients released in waves of 5, 000 clients every 30 seconds ...... 74

4.12 CPU usage of pre-fork Apache HTTP server with 1, 000 concurrent connections 77

4.13 Network usage of pre-fork Apache HTTP server with 1, 000 concurrent con- nections ...... 78

4.14 CPU usage of NGINX HTTP server with 1, 000 concurrent connections ... 79

4.15 Network usage of NGINX HTTP server with 1, 000 concurrent connections . 80

4.16 Client timelines of pre-fork Apache HTTP server with 1, 000 concurrent con- nections ...... 81

4.17 Client timelines of NGINX HTTP server with 1, 000 concurrent connections 82 List of Tables

2.1 Comparison of load generated on NGINX by popular HTTP benchmarking tools ...... 28

4.1 Comparison of RPS generated by original wrk and modified wrk ...... 61

ix Chapter 1

Introduction

Society is becoming more and more reliant on web-based services. Websites, mobile appli- cations, and headless systems are all connected to the Internet. A large number of these services is built upon the Hypertext Transfer Protocol (HTTP), which is the most popular application layer protocol that is used on the Internet today. HTTP is far from its simple origins of transferring markup files; now, complex, critical systems use HTTP to transfer important data. For this reason, it is important that the performance of these services can be tested adequately. If critical services, such as government health care services [2], run on top of HTTP, developers need to understand the impact that a sudden surge of users can have on their application. Crisis situations can lead to unforeseen spikes in the load being put on web servers [3]. Even applications that were developed with a smaller target audience in mind can be subject to heavy load if a “Slashdot Effect” [4] occurs in which an unprepared HTTP server is accessed by thousands of users in an instant.

1.1 Testing for Performance

To help prepare for situations that result in an unexpected influx of traffic, developers have created specialized that can simulate a workload with a large number of concurrent clients. These benchmarks assess a server’s performance and report a set of statistics indica- tive of how well the server performed during the test. This information can help predict how

1 2 Chapter 1. Introduction well the server might perform in a real world situation. Ideally, testing like this will help guide developers to write and configure performance conscious and scalable servers.

The question of scalability of Internet services was highlighted with the introduction of the C10K problem by Dan Kegel in 1999 [5]. Kegel discusses a set of methods for enabling a single server to handle 10, 000 concurrent connections, a large number at the time. How- ever, as hardware and software has advanced during the following decades, the technical community has shifted towards the C10M problem, the handling of 10, 000, 000 concurrent connections [6]. As there is a need for servers to support more concurrent clients, there has been work to introduce optimized server software architectures [7, 8].

When benchmarking a server, one should be able to gain a good understanding of its overall performance. The vast majority of benchmarking tools evaluate performance by simulating a set of clients and sending a large number of HTTP requests to a server and recording the time that each corresponding response takes. Following this, the tool will report the collected information to the user in summary form. This information is generally composed of the average response latency, the average throughput, and a summary of errors encountered.

1.1.1 Shortcomings of Current Tools

Currently available benchmarking tools such as wrk [9] and JMeter [10] have multiple short- comings. They may fail to characterize poor server performance because data aggregation can hide some important details. For example, consider a server that is only able to handle a limited number of persistent connections simultaneously due to operating system imposed restrictions. If a benchmark attempted to create more connections than the server could han- dle, it should report errors for connections that are not handled. Some benchmarking tools however ignore the failed connections altogether. Other tools report an aggregate number of 1.2. Proposed Solution 3 requests that timed out, but do not make clear that some clients did not receive any service at all. It is therefore impossible to fully understand the performance of this server using these tools. Since errors and timeouts are not reported from the perspective of individual clients, there is no way to know the lack of fairness of this server. In addition, a server that exhibits this behavior may still appear to perform well with respect to the average latency and throughput reported by these tools. A benchmark that indicates high throughput, low latency, thousands of successful responses, and a relatively low number of errors would thus appear better than it actually is. Depending on the method of recording errors, the number of errors reported could be very small. A server that only handles connections from a small percentage of clients can achieve extremely low latency and hide the fact that most of the clients sending requests to the server are never responded to.

Aside from the above-mentioned operating system imposed limits, incorrect server configu- rations and defenses against denial-of-service (DoS) attacks may also affect a server’s per- formance. Traditional tools are not able to report information that can accurately describe a server’s performance for each of these scenarios.

1.2 Proposed Solution

We are introducing ink, an HTTP benchmarking tool that aims to solve these problems by recording and analyzing traces of individual clients. ink provides a graphical representation of a server’s performance that is easy to understand, and it provides more details about the behavior of a server than traditional methods.

In addition to maintaining the macro-level benchmarking capabilities of current benchmarks, ink enables a user to understand the performance that each individual client is receiving and compare performance between individual clients in the benchmark. ink has two main 4 Chapter 1. Introduction foci:

• Record fine grained results with an emphasis on the service that individual simulated clients received.

• Present the recorded data by using easy to understand graphical representations.

1.2.1 Enabling Per-Client Data Visualizations

It is important for an HTTP benchmark to provide results that allow for a user to fully understand the performance of a server, with little ambiguity. A graphical user interface (GUI) is needed to display and allow the interactive exploration of the data that is collected. In order to support operations on large data sets, a back end server supports the offloading of compute-intensive calculations. ink’s GUI displays a set of client timelines. The moving average of each simulated client’s request latencies is recorded over the course of a benchmark. Each value in the moving average is mapped to a color. With this mapping, we can create a timeline that represents the service that one individual client received. ink uses a collection of these timelines to summarize a server’s performance. These timelines are the most important component of ink.

In addition to viewing the per-client performance timelines, ink allows a user to interactively explore the data by selecting a range of time to examine further. A user can see the histogram of latencies, average throughput, and various other data points for any period of time during the test. This feature has many use cases, but an example of this would be when the user wishes to compare data points from the beginning of the benchmark to those after the server has reached a steady-state. 1.2. Proposed Solution 5

1.2.2 Understanding Server Performance ink’s client timeline visualization enables users to make inferences about a server’s behavior. This information can be used to optimize server performance; however, this has use cases in areas beyond benchmarking. ink also aims to make debugging a server easier. A utility that can graphically show what HTTP clients are experiencing could be used in an academic setting, allowing students to understand what exactly is happening as a server experiences load. A motivation for this work is the creation of a tool that can be used by students to help them understand their own HTTP server implementations. ink also provides the ability to monitor performance metrics related to the server environ- ment. In its current configuration, ink monitors the CPU usage, memory usage, allocated file descriptors, and network usage of the target machine. Providing these sets of data to the user can help when attempting to diagnose server performance issues.

1.2.3 Maintaining Benchmark Performance ink is intended for benchmarking state-of-the-art HTTP servers. To enable this, it must be able to supply sufficiently high load. Therefore, after assessing a number of options, wrk was chosen as ink’s load generator. There are two main reasons for wrk’s performance capabilities. First, wrk uses an event-driven concurrency model that efficiently uses machine resources. wrk also uses the highly optimized HTTP parser that is used by both Node.js and NGINX. ink also enables the use of a set of distributed machines for benchmarking. This was done because modern HTTP servers are able to handle more concurrent clients than one machine is capable of simulating. 6 Chapter 1. Introduction

1.3 Contributions

This research sets out to improve HTTP benchmarking tools to improve accuracy and re- liability, and to report individual client performance and experienced errors. It makes the following contributions:

• New approach to HTTP benchmarking practices We argue that a new design be followed when designing an HTTP benchmark. Tools need to be more conscious of the service delivered to each individual client, rather than focusing only on aggregate performance measures.

• Implementation of tool

Following the proposed methods of HTTP benchmarking, ink was developed. The provided tool provides a comprehensive report of an HTTP server’s performance from a client’s perspective. Data is presented using visualizations that allow users to under- stand a server’s performance easily.

• Evaluation of tool

We evaluate the ink tool to show that it is robust and useful. To do this, we first show that ink is able to generate the load needed to overload a high performance HTTP server. We also show that ink’s visualization techniques can be applied to large amounts of data.

• Case studies

In addition to showing the performance of ink, we perform a number of case studies that exemplify ink’s abilities. These case studies highlight how ink is able to report low-level details about server performance.

• Tool deployment 1.4. Roadmap 7

ink was deployed for use in Virginia Tech’s Computer Systems course during the Spring 2020 semester. This enabled students to benchmark their personal servers and see the performance impacts of design decisions, such as event-based versus thread-based concurrency models.

• Survey of user base Following the use of the tool, the students in Virginia Tech’s course were given the

opportunity to participate in a survey pertaining to their experience with ink.

1.4 Roadmap

In Chapter 2, we will cover the background information needed to understand this work.

Chapter 3 will cover the design and implementation of ink. We will evaluate ink in Chapter 4. We will also report details of a survey that was taken of the user base in this chapter. In Chapter 5, we will discuss research related to this work, focusing on other HTTP bench- marking tools and methods. Chapter 6 will give a brief overview of possible future work. Finally, Chapter 7 will bring a conclusion to this work. Chapter 2

Background Information

2.1 Transmission Control Protocol

The Transmission Control Protocol (TCP) is a popular transport layer protocol that many application layer protocols are built upon, including HTTP. In fact, TCP makes up a large majority of all Internet traffic as of 2004 [11]. TCP was first introduced as a protocol to be used by the United States military, where the designers set out to solve the problem of reliably sending data over an unreliable underlying network [12]. This reliability is why TCP is often chosen as the transport layer protocol for higher level applications. An underlying protocol that guarantees delivery greatly simplifies the process of designing an application level protocol. In addition to reliable data transfer, TCP provides congestion and flow control. These prevent the network and the communicating peers from being overloaded with data.

2.1.1 TCP/IP Addressing

TCP is a transport layer protocol that delivers data to a specific process that is running on a specific host. To route data to the host, TCP uses Internet Protocol (IP) addresses [13]. Each public entity connected to the Internet must have at least one unique IP address. An IPv4 address is a 4 byte value, and an IPv6 value is a 16 byte value [14]. This address is used

8 2.1. Transmission Control Protocol 9 by a series of network routers to deliver data from the source to the destination. Prior to the adoption of IPv6, there were far more users on the Internet than available IP addresses. For this reason, most end-users are connected to the Internet via a router that performs Network Address Translation. A router that acts as a NAT will have a public IP address that may be used by all devices connected to that router [15].

Once packets have been delivered to the correct host, the data must then be routed to the correct process running on that machine. TCP uses port numbers to deliver the data to the specific process on the host that the sender intended it for. All processes on one machine use the same set of available ports.

Because TCP uses an unsigned 16-bit integer to represent the port number, a machine is limited to 65, 536 possible port numbers, which becomes a bottleneck when trying to create a large number of connections originating in one machine for benchmarking purposes, as further discussed in Section 3.2.

2.1.2 Establishing Connections

A TCP connection requires the knowledge of the source port, the destination port, the source IP address, and the destination IP address. This tuple is what defines a TCP connection. A connection is established by using a three-way handshake. After a server opens a listening port, the client can initiate the connection process by sending a first SYN segment. The server responds with a SYN-ACK segment. Lastly, the client responds with a final ACK. SYN and ACK are shortened forms of the terms “synchronize” and “acknowledge,” respectively. A segment is distinguished as a SYN or an ACK by checking if specific bits in the segment have been set to 1. Once this connection has been established, both client and server are free to send arbitrary data to each other. As part of the three-way handshake, client and 10 Chapter 2. Background Information server exchange initial sequence numbers which makes it unlikely that delayed segments from old connections are mistaken for new connection attempts [16]. The handshake also helps prevent connection hijacking.

2.1.3 Guaranteeing Delivery and Ordering

In order to provide reliable data transfer, TCP requires that all data that is sent be responded to with an acknowledgment. If the sender does not receive an ACK within a specified amount of time, then the segment will be resent to the destination.

The TCP protocol carefully calculates the amount of time to wait before retransmitting data [17]. The protocol uses the route-trip-time (RTT) in this calculation to limit the amount of data that is unnecessarily retransmitted while the ACK is still in transit. Initially, the retransmission timeout (RTO) is set to be 1 second. Then, protocol uses RTT estimation to adjust the RTO. For slower networks, the RTO will be increased to minimize unnecessary retransmissions. Following a retransmission, the RTO is doubled.

In order to maintain delivery order and identify lost segments, TCP embeds a sequence number in each segment. A sequence number is needed to do this because TCP is a sliding window protocol. Upon first connection, both client and server choose a random number to be their initial sequence numbers (ISN). With each byte that is transferred over the connection, the ISN is increased. The receiver of the data uses this sequence number to ACK all data that was received up to that number. TCP also buffers data that is received to account for networks that may deliver segments out of order. Sequence numbers allow for the correction of out-of-order delivery of IP packets. 2.1. Transmission Control Protocol 11

2.1.4 Congestion and Flow Control

TCP congestion control works to reduce the load on the network that is transporting the data [18]. This is done by assessing the segment loss rate of the network that is being used. After a connection is established, a small initial window size is chosen. This window size determines the number of full-sized segments that can go unacknowledged before requiring an ACK. As segments are successfully delivered, the window size is increased exponentially. Once the window size reaches a threshold, the window size increases linearly. Upon segment loss, the window size is reset to its initial value and the maximum threshold is decreased. Packet loss implies that the network between the sender and receiver is congested; resetting the window size and reducing the maximum threshold both are in an attempt to prevent this from happening again.

TCP flow control alleviates pressure on the receiver [18]. The receiver publishes the amount of data that they are willing to receive. The sender will send no more than this amount of data and must wait for an acknowledgment before sending any more data. This amount of data that the receiver advertises as being able to handle may be less than the maximum window size allowed by the congestion control mechanism.

2.1.5 TCP SYN Cookies

The Linux implementation of TCP uses a backlog of pending connections. These connections have only completed the first part of the three-way handshake. The size of this backlog is configurable by the user, and it also has direct implications on the TCP implementation. As clients connect to the TCP server, the connections are placed into the backlog. As the server accepts these clients, they are removed from the backlog.

TCP SYN flood attacks are a specific type of Denial of Service (DoS) attack that is directed 12 Chapter 2. Background Information at the resources required to maintain the TCP backlog [19]. In this attack, attackers flood the server with connection attempts; however, the connections are from a forged address. Since the attacker uses a forged IP address, the server will be unable to respond to the client with a

SYN-ACK. Since a TCP server needs to maintain state for all partially established connections in the backlog, an attacker can cause the server to run out of resources. Additionally, the server will continue to send the SYN-ACK, since the client never responds with the final ACK. Commonly, a server will fill the in-kernel queue that contains partially established connections. Once this queue is full, no other clients can connect to the server.

To thwart this attack, TCP SYN cookies were developed [20]. This defense is used when a client attempts to connect to a TCP server and the server’s backlog is full. TCP SYN cookies allow the server to not maintain any state for a connection until the connection has been fully established. This is achieved by carefully crafting the ISN of the first SYN-ACK sent from the server. The Linux implementation of SYN cookies calculates the ISN by using a random secret value. When the client responds with the third part of the handshake, the server can mathematically verify that the ISN is valid. Resources are no longer allocated upon establishing a connection, but only once the client responds to the server with the final

ACK in the TCP handshake.

TCP SYN Cookies may mislead clients into believing that a server accepted their connection when in fact it did not. Benchmarking utilities should report failed connections; however, in the current version of the Linux kernel, TCP SYN cookies are sent back to the client when the TCP backlog is full. Thus, the server cannot rely on the TCP backlog to deny clients when the server is under load. Therefore, a benchmark must infer that a connection has failed because of inactivity. 2.2. Hypertext Transfer Protocol 13

2.2 Hypertext Transfer Protocol

HTTP was first developed in the 1990s as an application level protocol that allows for the transfer of arbitrary data between two entities over the Internet [21]. The protocol is simple and stateless, making it ideal for situations where servers must provide data to a large number of clients. Since the introduction of HTTP, the protocol has changed and grown significantly, without losing its basic request/response characteristics.

2.2.1 Request/Response Pattern

The HTTP protocol follows a request/response pattern. This was introduced in the first version of HTTP, and has been maintained in all iterations [21]. Clients issue a request to the HTTP server for some resource and the server will respond with that resource. An example request and response can be seen in Listing 2.1 and Listing 2.2.

Listing 2.1: Example HTTP Request GET /example.html HTTP/1.0 Host: www.example.com

Listing 2.2: Example HTTP Response HTTP/1.0 200 OK Content-Type: text/html Content-Length: 27

Example HTML Page

HTTP requests must specify a method. The methods defined by HTTP/1.0 are GET, HEAD, and POST. HTTP/1.1 defined more methods; some of which are PUT, OPTIONS, and DELETE. The HEAD and OPTIONS methods allow the client to obtain information about the requested resource. Methods like GET, POST, PUT, and DELETE are used by the client to interact with 14 Chapter 2. Background Information the resources that the server manages.

HTTP requests and responses both include a message header. The header includes fields that relay metadata information between the server and client, such as the Content-Type and Content-Length fields in Listing 2.2. This metadata can influence server and client behavior. The If-Modified-Since field is an example of this. If a client sets this field to a specific time, and the server sees that the requested data has not been changed since that time, the server can choose not to respond with the requested data. In this case, the client can use a locally cached copy of the data.

HTTP includes a status code in each response to indicate if the request was successful or unsuccessful. In Listing 2.2, status code 200 is used to indicate that the request was successful. Status codes are placed into 5 major groups, indicated by their first digit. Each group’s meaning is listed here, as defined by the HTTP/1.0 specification:

• 1XX - Informational

• 2XX - Successful

• 3XX - Redirection

• 4XX - Client Error

• 5XX - Server Error

2.2.2 HTTP Connections

Most versions of HTTP were designed to be built on top of a reliable transport protocol. This means that HTTP is not tied to a specific protocol; however, the protocol must guar- antee that all data is delivered. This is because HTTP does not provide any retransmission 2.2. Hypertext Transfer Protocol 15 mechanism. For the reasons that are described throughout Chapter 2.1, TCP is the protocol that is most commonly used for HTTP connections. Each version of HTTP differs in how these underlying connections are managed.

2.2.3 HTTP/1.0

HTTP/1.0 was the first major version of the HTTP protocol [21]. It was predated by what is now called HTTP/0.9, which only supported file requests without the inclusion of metadata or error codes. HTTP/1.0 expanded upon this by defining header fields, request methods, and status codes.

The HTTP/1.0 protocol establishes a TCP connection for each request and closes that connection following the completion of the response. This incurs the overhead of the TCP three-way handshake for each request and can be very inefficient for clients that request multiple resources in quick succession.

2.2.4 HTTP/1.1

The HTTP/1.1 specification added a very important performance enhancing feature to the standard: persistent connections [22]. In past versions of HTTP, the connection between client and server was closed after each request was served. In modern web browsing, HTTP requests rarely come alone, due to the way that HTML embeds objects. As of writing, the median number of requests per web page is 74 [23]! HTTP/1.1 uses persistent connections so that requests patterns like this can be handled more efficiently. This has been shown to lead to significant performance improvements over HTTP/1.0 [24]. After HTTP/1.1 was introduced, persistent connections were enabled in HTTP/1.0 through the use of the

Connection: keep-alive header field. 16 Chapter 2. Background Information

These persistent connections improved performance for individual clients; however, this change required that servers be prepared to handle a high number of concurrent clients. HTTP servers that support HTTP/1.1 maintain a persistent TCP connection for each client, which requires some method to manage the associated resources. For example, the Apache Web Server, a popular HTTP server, defaults to closing an idle persistent connection after 5 seconds to free up system resources [25].

HTTP/1.1 introduced request pipelining, which allows a client to send multiple requests to the server over the same TCP connection without waiting for a response from the server. The server must respond to these requests in the order that they were sent. However, many mod- ern browsers have either disabled or completely discontinued support for the feature [26, 27]. The theoretical performance gains of pipelining were not being met, and issues with head-of- the-line blocking led to slow response times. Most web browsers employ the tactic of opening multiple TCP connections with an HTTP server to send simultaneous HTTP requests. For example, Google Chrome uses up to simultaneous 6 connections [28]. The original HTTP/1.1 specification recommended that clients should open at most 2 TCP connections [22]. In later revisions, this suggestion has been replaced by a more lenient, and vague, suggestion: “A client ought to limit the number of simultaneous open connections that it maintains to a given server” [29].

2.2.5 HTTP/2

The next iteration of HTTP is HTTP/2. This version of the protocol was published in 2015, and brings along with it a set of features that enable it to be an even more performant protocol than the previous versions [30]. One of the headline features of HTTP/2 is request multiplexing; this improvement allows clients to successively send multiple requests to the 2.3. HTTP Server Concurrency Models 17 server over one TCP connection without waiting for a response. The server can respond to these requests in any order. The major benefit of this is that the client will not block on one request with a high latency. The protocol allows for this arbitrary order by assigning a unique “stream identifier” to request/response pairs. Since each identifier is unique, streams do not collide. Data sent in a stream is further broken down into frames. A frame is a compressed binary representation of either the message headers or the message payload.

Multiplexing is a feature that addresses the issues with pipelining in HTTP/1.1. In addition to enabling better performance for the user, servers are subject to less load because clients no longer need to simulate multiplexing by establishing multiple TCP connections.

There are other additional features that HTTP/2 introduces. Firstly, HTTP/2 no longer uses a plain text data format for the header like the three prior revisions. A binary format is used that allows for more efficient transfer of data. Header fields can now be compressed. Lastly, servers are able to push messages to the client without the client issuing a request. This feature is built on top of multiplexing.

2.3 HTTP Server Concurrency Models

Different concurrency models are a highly debated topic [31, 32, 33]. Early versions of the conversation were centered around operating system design [34]. Since HTTP servers are naturally concurrent applications, different HTTP server implementations showcase the differences between concurrency models. Thus, web servers are now often the marquee example of a concurrent application.

When a client accesses an HTTP server, some of the operations that the server must perform may not complete immediately. Examples of such operations include waiting for the client 18 Chapter 2. Background Information to send the request after connecting, accessing the requested data, and responding to the client. For a server to handle multiple concurrent clients, the server must retain the ability to support multiple of these operations in progress at the same time. That is where the debate between concurrency models arises, as there are multiple ways to handle this problem, each of which have their own advantages and disadvantages. Two of the most widely used models of concurrency are the event-based model and the thread-based model.

2.3.1 The Problem of Maintaining State

Since an HTTP server must be able to handle many simultaneous clients, and serving a client may require that the server perform an operation that may block, the server must employ some strategy for performing other operations while some operations are blocked. Threads solve this problem by maintaining state for the developer. A thread’s stack is maintained in the same way that a single threaded program’s stack would be maintained. Thus, by supplying a thread to perform an operation, a client’s state is maintained across a blocking operation on the thread’s stack. In a thread-based design, this thread will not perform any other operations until the blocking function has returned.

Traditionally, an event-driven design will only use one thread. Instead of indiscriminately performing blocking operations, this thread will only perform an operation if it will not block. If the attempted operation would block, an event-driven system will perform other operations while waiting for the blocking operation to become ready. A system that imple- ments this behavior can efficiently use system resources. However, it often comes at the price of increased program complexity. In order to connect the state of an application to specific event notifications, the programmer will have to manually manage state. For example, if a socket has become readable, the application may need to know with which client the socket 2.3. HTTP Server Concurrency Models 19 is associated. This is often done by constructing continuations. A continuation is a data structure that groups together information that will be needed to complete the processing of an event. In the example of a socket becoming readable, this data could be identifying information about the client, along with the function that will read from the socket. This style of managing state is called “stack-ripping” [35]. It is aptly named this because variables that would traditionally be on the stack are copied onto the heap so that they may persist between function calls.

2.3.2 Thread-Based Concurrency

Most thread-based HTTP server implementations follow the same pattern: a thread of execution is supplied to handle each connection. Note that this thread of execution is not necessarily a thread supplied by the operating system. Some languages provide user-level threads that are abstractions on top of operating system threads, and other HTTP servers spawn a process per HTTP connection [36]. This thread is executed until completion and without interruption. It should be noted here that threads can be preempted by the operating system, but the server should be ignorant of that fact.

Manual state management is difficult. Using a thread-based model can reduce the cognitive load of this on the developer [31]. This is because the operating system or language runtime manages the state of a thread for the programmer.

One of the main criticisms of the thread-based concurrency model is its apparent performance limitations. Proponents of the model argue that simple changes to threading implementa- tions and can help thread-based models achieve the performance of event-based systems. Even further, there are claims that thread-based and event-based models have the same performance limits; thus, threads must be the obvious choice because “threads provide 20 Chapter 2. Background Information a more natural abstraction” [31].

Compiler and language features are also suggested as a solution to the other main negative characteristic of threads: synchronization can be difficult and is prone to error. VonBehren et al. refer to nesC, a language that attempts to eliminate data races. A more recent example of a language that enforces thread safety is Rust [37]. Rust eliminates data races through the use of very sophisticated compile-time checking. However, this system incurs much more difficulty on the developer due to its strictness. In 2018, the designers of Rust released a survey of their user base. Over 20% of Rust developers, ranging from new to very experienced users, felt as if they were unproductive with the language [38]. These survey results could imply that it is difficult to write programs that are guaranteed to be correct with threads.

2.3.3 Event-Based Concurrency

In contrast to the model of having a thread of execution for each request, the event-based paradigm makes use of only one thread. This thread runs in what is traditionally called an event-loop, and events are processed as they occur. In the context of an HTTP server, examples of events could be that a connection has been established or that a new HTTP request is ready to be processed. An event handler is assigned to each type of event and is executed to process each event.

Interestingly, proponents of event-based concurrency argue that events are easier to program with than threads [32]. In addition to this, they argue that the abstractions provided by events are more natural. Both of these statements directly conflict with the claims that proponents of a thread-based design make. Threads require explicit synchronization, which is difficult to achieve while still maintaining performance; due to the single-threaded nature 2.3. HTTP Server Concurrency Models 21 of events, there is no need for synchronization. Other proposed benefits of events are easier debugging, portability, and simplicity.

Ousterhout acknowledges that threads should be used in some cases, but only where they are absolutely necessary. These include use cases like scientific calculations and high- performance servers.

2.3.4 Theoretical Equivalence

Some authors suggest that these two models are theoretically equivalent [34]. Firstly, is it established that there is a mapping between the functionality of the two systems. Lauer and Needham state that each characteristic of the two models can be mapped to a characteristic in the other model. Lauer and Needham go on to postulate that the theoretical performance of the two systems can be identical; however, this would require extensive modification of the underlying OS that the applications were running on.

2.3.5 Modern Examples of Concurrency in Practice

As of March, 2020, The Apache Web Server and NGINX are the two most popular HTTP servers available [39]. However, the concurrency model of the two servers differ drastically. By default, Apache is configured to handle each request in its own process. This is an example of a thread-based concurrency model. All requests are handled independently, and the state of each connection is managed by that process. Apache relies on the operating system’s scheduler to switch between the currently running processes. In contrast to this, NGINX implements a fully event-driven concurrency model. NGINX spawns a limited number of worker threads that each handle connections concurrently. The state of each connection is kept in an in-memory data structure. As events fire, an NGINX worker thread will handle 22 Chapter 2. Background Information them. Through one iteration of its event loop, an NGINX worker thread may read from and write to multiple sockets. In the performance tests that are done in Section 4.3.3, NGINX outperformed Apache in all metrics.

Node.js is a JavaScript runtime, and Go is a language that leverages its runtime to provide thread-like abstractions on top of an event-driven scheduler [40, 41]. Both of these technolo- gies use events to allow for efficient use of machine resources. Node.js uses a callback-based API, which allows for all operations to be non-blocking. When a developer wants to perform an operation that might block, they will register a function that will be called when that operation completes. Node.js uses an event library to monitor file descriptors and call the registered function when the operation is complete. In contrasts to Node.js callbacks, the Go runtime maintains state for Go applications. User-level threads, called “goroutines,” are multiplexed on top of threads supplied by the operating system. Whenever a goroutine would block in a system call, the Go scheduler will instead resume the execution of a goroutine that is ready, if any.

2.4 Linux Connection Management

In the Linux operating system, an API is provided that allows a user to interface with the Linux implementation of the TCP protocol. This API exposes a set of functions which enables a user to create TCP sockets and perform operations on the sockets such as reading and writing. Here, the connection between this API and the TCP state machine will briefly be covered. 2.4. Linux Connection Management 23

2.4.1 Server API Interactions

Listed here is a subset of the functions that the Linux kernel provides for TCP communica- tions: socket(), bind(), listen(), accept(), write(), read(), close() [42]. To provide the basic functionality of a TCP server, a program must make use of all of these functions.

All sockets are represented to a user as a file descriptor. To the user, these file descriptors are simple integers; the complex data structures that are used to manage a socket are abstracted away from the user. The operating system itself can impose limits on the number of file descriptors that a process is allowed to open. This limit affects the number of sockets, pipes, and files that the process may open. If that limit has been reached, then the function will report an error.

The call to socket() provides the first layer of abstraction that the operating system in- troduces. This function returns a file descriptor that represents a TCP socket. It should be noted that this function does not directly play a role in the TCP protocol.

A program that intends to act as a server will have to call bind() and listen(). Both of these functions have an effect on the socket that is created by the call to socket(). By calling bind(), the server can choose which IP address and port to bind the socket to. The call to listen() opens the port to incoming connections. This function accepts a parameter that allows a user to set the size of the TCP backlog. According to the POSIX.1-2017 standard, this set the size of the queue that holds pending connections [42], as previously discussed in Section 2.1.5. At this point, clients may now attempt to connect to the server. The kernel will accept SYN packets that have the address of the bound socket as the destination. Prior to this, the kernel would have ignored packets that were addressed to that port since it was not yet open. To tie this into the TCP protocol, clients are now able to complete the three-way handshake with the server. 24 Chapter 2. Background Information

For a server to send to or receive data from a client, the server must call accept(). This function blocks until a client has fully completed the TCP three-way handshake. An inter- esting behavior of the API that is not immediately obvious is that the call to accept() is not required to fully establish a TCP connection. This is because the kernel handles the entire handshake. This adds a layer of difficulty when declaring a connection as failed in the context of an HTTP server benchmark. A client may perceive a connection as established; however, that connection may still be in the server’s accept queue. accept() must be called for all clients that connect to the server.

Once accept() has been called, both the client and the server are free to make calls to write() and read(). Both of these functions interact with underlying data buffers that contain data waiting to be sent, in the case of write(), and data waiting to be read, in the case of read(). All TCP acknowledgments are handled by the kernel and transparent to the user. If there is no space in the write buffer, then a call to write() will block. This implies that all of the data that is currently in the buffer has yet to have been acknowledged. A call to read() will block if there is no data to be read in the read buffer.

2.4.2 Client API Interactions

TCP clients use the socket(), connect(), read(), write(), and close() calls. Like for server sockets, the call to socket() returns a file descriptor that will be used for subsequent socket operations. When a client wishes to connect to the server, connect() must be called. This function allows the user to specify the destination server’s address and port number. This function initiates the three-way handshake, and it will block until the handshake has been completed. As noted previously, the client has no way to know if the server has actually called accept(). Once connect() has been called, a client can make use of read() 2.4. Linux Connection Management 25 and write(). These work in the same fashion as previously mentioned.

2.4.3 Non-Blocking API

Many of the functions provided by the socket API can be used in “non-blocking mode” [42]. When a socket is configured to behave this way, functions that are applied on that socket are guaranteed to not block. In this context, blocking means that a function will not return until the action has been performed. For example, a call to read() will block until there is data to be read from the file descriptor. In non-blocking mode, the function will set errno to EWOULDBLOCK [42]. This indicates to the caller of the function that this operation would have blocked, and the operation must be attempted again. A notable exception of the usage of the non-blocking API is when connect() is called on a non-blocking socket. Instead of setting errno to indicate that the connection should be attempted again, errno is set to a specific value that indicates that the connection is in-progress. This means that a call to connect() will not block for the entirety of the three-way TCP handshake. Instead, a user will check if the socket is writable. Once this occurs, the client will know that the handshake was completed.

The non-blocking functionality of the API requires the addition of some event notification system. Common examples of these are select(), poll(), and epoll(). These functions provide the ability to watch a set of file descriptors and return a subset of the file descriptors that are ready to have an operation performed on them to the user. Thus, a program can efficiently operate on a set of file descriptors by only choosing to interact with ones that are guaranteed not to block. This translates to interacting with a set of TCP clients very naturally. This method of handling multiple connections allows for one thread to manage connections with many clients. As an added benefit, real-world time is not spent waiting on 26 Chapter 2. Background Information

I/O; instead, other events can be handled. This maximizes the usage of the CPU.

As various alternative notification systems have been developed, researchers have assessed the performance of each system call [43, 44]. The most modern of the three major contenders, epoll, outperforms the alternatives in situations that induce high concurrency [43]. Due to this, epoll has become the standard event notification call when writing event-driven software for the Linux operating system.

2.5 HTTP Benchmarking

HTTP benchmarking has been of interest since early on in the life of HTTP [45]. An HTTP server’s performance has many implications. Research has shown that increasing the latency of an application results in lower user engagement and productivity [46]. According to Miller, humans engage better with an application if the latency of interacting with the application is less than 100ms. A latency of 1 second results in expectations being met, and a latency of greater than 10 seconds results in the loss of the user’s attention. In a talk given by Mayer, a former spokeswoman at Google, a 20% drop in Google searches was reported after an slight increase in page loading times was introduced [47]. The increase in response time was the result of more search results being generated per page. This information implies that web server performance is critical, and it is important for a developer to understand the implications that performance has on the usability of an application.

2.5.1 Data Metrics

The two points of data that HTTP benchmarks commonly present are throughput and average latency. Throughout the process of performing this research and testing modern 2.5. HTTP Benchmarking 27

HTTP benchmarking tools, this has been found to hold true for many popular tools [9, 48, 49]. The throughput of a server is defined as the number of requests that are served per second [50]. The average latency of the server is defined as the average amount of time that it takes for a request to be met with a response. The statistics that are presented to the user have not evolved over time, remaining the same in modern tools like wrk as they were in . In fact, implementing the same feature set as past tools is often mentioned in the documentation of various HTTP benchmarking tools.

A issue that many HTTP benchmarking tools do not address is the problem of coordinated omission [51]. This problems occurs because clients that experience high latency during a given benchmarking period send fewer requests, record fewer responses, and thus will contribute disproportionately less to the final, averaged score that’s computed. Some tools address this by weighting the latencies recorded during the benchmark. Higher latency data points will be weighted higher in the final report.

2.5.2 Generating Load

Many HTTP benchmarking utilities use the same strategy for load generation: simulate a user-defined number of clients, each of which repeatedly performs HTTP requests. Various strategies for generating this load exist. The authors of httperf listed generating a constant load as one of their goals [50]. They wanted to generate constant load so that a server can be pushed to its limits. If constant load is not applied, then the benchmarking client will alleviate the pressure applied as the server begins to fail. This goal has not been adopted by more recent tools. Most notably, wrk fails to implement this feature. wrk is the most popular HTTP benchmarking tool on GitHub; however, it fails to sustain the load that the user intended to generate. Clients simulated by wrk that never receive a response from a 28 Chapter 2. Background Information

Table 2.1: Comparison of load generated on NGINX by popular HTTP benchmarking tools. The requested payload was 4KB. *Note that wrk and hey are both multi-threaded applica- tions.

Benchmark Version RPS (1000 Connections) RPS (5000 Connections) wrk* v4.1.0 gcc 5.5.0 277041 277501 hey* v0.1.3 go 1.13 114052 97384 ab v2.4.43 gcc 5.5.0 22112 18549 httperf v0.9.1 gcc 5.5.0 11365 11102

request will indefinitely stall.

When performance testing, it is important to ensure that the tool used for testing is not the bottleneck of the test. HTTP benchmarking tools address this in a few different ways. wrk is able to generate extremely high load due to its concurrency model and highly optimized

HTTP parser. httperf addresses performance limitations by suggesting an extension of the tool could be made to use multiple machines. See Table 2.1 for a comparison of through- put generated and recorded by popular benchmarking tools. These results are from initial experiments that we ran. For hardware and NGINX configuration details, see Section 4.1. wrk was able to perform significantly better in these tests, with regard to throughput. wrk is able to maintain performance while increasing the number of concurrent clients whereas the other tools degrade in performance.

In Section 2.3, we discussed different concurrency models. Those concepts generally spark discussion about their application in systems like high performance HTTP servers. However, these concepts can be applied to all systems that need to perform operations concurrently. HTTP load generators can apply the same concepts as HTTP servers. All of the tested tools use an event-driven model. wrk supplies a number of worker threads that process events as they arrive. Each thread maintains its own event loop and set of clients. This parallelizes the work while also not wasting resources by not blocking on requests. hey, which is written 2.5. HTTP Benchmarking 29 in Go, does not use an event-driven system [49]. However, the concurrency model that the Go language encourages makes use of a high number of user-level threads. The Go runtime makes scheduler decisions based on event notifications; however, based on our experiments, this runtime still incurs significant overhead that is not present when directly adopting an event-based approach as in wrk.

Performance is not the only important aspect of a benchmarking tool. Qualities like flex- ibility and reported data metrics are also important, but they are less quantifiable than performance.

2.5.3 Error Reporting

While the statistical representation of benchmarking data is generally similar between tools, errors are reported in different ways. Sometimes, these errors are grouped by specific socket operation errors: read, write, and connection errors [9, 50]. Other benchmarking tools rely on a higher level HTTP library to report errors [49]. Some tools fail to provide any distinction between errors and report them as a lump sum [48].

A low number of these errors might imply to the user that a server is performing well. However, timeouts, a very common error, can be more indicative of a server’s performance. gback: more indicative than “a low number of these errors?” Request timeouts are preva- lent when an HTTP server due to the high load that the server is under.

Unfortunately, many of the standard tools handle this class of error poorly. Currently, wrk does not report any timeouts. Not only does wrk fail to generate the advertised concurrent workload, as established in Section 2.5.2; wrk does not report this information about the stalled requests to the user. Users are not presented with accurate information and are mis- led to believe that the server is able to handle a high number of simultaneous connections. 30 Chapter 2. Background Information

Client 1 1 2 3 4 5 6 Median

1 Client 2 1 2

Average Client 3 1

2 time

Figure 2.1: Example of HTTP clients experiencing different levels of service, leading to misleading data points

In reality, some clients may not have been served at all! As a comparison to other popular tools, Apache Bench does report timeouts, but upon encountering a timeout, the test stops and no other statistics are reported. hey and httperf both report timeouts; however, the information is not sufficient enough for a user to fully understand server performance.

2.5.4 Data Representations

This problem with reporting errors and performance issues in a way that is meaningful to the user and representative of server performance is an issue that spans across the current set of tools that are available for developers today. Since HTTP benchmarks simulate clients, it should follow that a user is able to see how individual clients were served. This is something that is impossible with the current set of tools.

See Figure 2.1 for an example of behavior that would lead to misleading performance results. Each block in this graphic represents a request. In this example, the average latency would be 2. However, two of the three clients never recorded a latency under 2. A more detailed data visualization is needed to see this kind of behavior.

Most existing tools also do not support visualizations at all, leaving it to the user to visualize 2.5. HTTP Benchmarking 31 the benchmark results in a meaningful way, which can make it more difficult for them to understand server performance. Chapter 3

Design and Implementation

This section provides information about the design and implementation of the ink tool. We discuss implementation decisions we made and difficulties we encountered throughout the implementation of the tool. See Figure 3.1 for an overview of the tool’s architecture. Here is the set of goals that ink was developed to achieve:

• Transparently represent performance statistics without hiding information

• Place an emphasis on representing individual simulated client experience

• Highlight the performance implications of client timeouts

• Support the use of a distributed set of machines to generate high load

Each of the major components listed here is given a devoted section of this chapter:

• Load Generator - simulates clients that will supply load on the target HTTP server

• Distributed Manager - manages and coordinates load generators on multiple machines

• Application Protocol Interface (API) Server - processes benchmarking data reports

and supplies data to the ink GUI

• Graphical User Interface (GUI) - browser-based application that generates data visu- alizations

32 33

Server Monitor Load Generator wrk

Load Generator Distributed Manager wrk Target HTTP Server

Load Generator wrk

API Server Frontend GUI

Frontend GUI

Figure 3.1: High level overview of ink’s architecture 34 Chapter 3. Design and Implementation

3.1 Load Generation ink generates a workload that models the real world phenomenon of when a server is met with a large number of concurrent clients. ink’s workload is generated by creating a predefined number of simulated HTTP clients. The user may choose how many clients to use. Each of these clients will repeatedly send requests to the target HTTP server. The endpoint (URL) that is targeted is chosen by the user.

3.1.1 Choosing a Load Generator

As discussed in Section 2.5.4, being able to generate a high load is necessary for an HTTP benchmarking tool. Preliminary research established that wrk is able to generate higher load than other HTTP benchmarking tools. Initially, we did some experimental work with hey [49], a benchmarking tool implemented in Go. In testing, hey was able to generate load higher than all other of tools tested, with the exception of wrk. We attempted to modify hey to use a more efficient HTTP parser. This resulted in some performance gains, but we were unable to achieve the performance that wrk does. In the interest of creating a more robust tool, hey was dropped as the load generator, and wrk was modified to fill the needs of the ink.

The chosen load generator, wrk, is able to generate high load due to its event-driven concur- rency model. All socket operations are non-blocking so that other tasks may be done while waiting for I/O to become available. wrk uses the Redis event library, which is designed to function with various different event notification mechanisms, like epoll, select, and kqueue [52, 53, 54]. This maximizes the number of operating systems that wrk can operate on. 3.1. Load Generation 35

3.1.2 Collected Data Points

Each event that is triggered during the course of the benchmark signals to the load generator that some data points need to be recorded. Since ink puts an emphasis on reporting data at a per-client level, the load generator must record data that retains the information about the client that produced the data. To help keep track of per-client data, wrk has been modified to record separate benchmarking data for each simulated client.

Each simulated client maintains an array with statistics about each request. After a sim- ulated client successfully receives a response, a data point is appended to this array. This data point includes the time that the request was initiated, the latency of the request, and the status code of the response. If a connection must be established to complete this re- quest, then the time that is taken to complete the connection is also included in the latency calculation. Otherwise, the latency is the amount of time between sending the request and receiving the response. This is a minor change in the policy that the original version of wrk employs. Previously, latency did not include the overhead of establishing a connection. This change was made so that we could later present this information to the user.

Each simulated client in ink also records data points for all established TCP connections, socket errors, and timeouts. TCP connections and errors are recorded as they occur; each client data structure maintains an array of the times at which they occurred. All timestamps recorded by ink are in microseconds, but the GUI allows a user to choose different units if desired.

3.1.3 Handling Timeouts and Generating Constant Load wrk was modified to improve the reporting of timeouts. The original implementation’s reporting of timeouts was not representative of the events that were actually occurring 36 Chapter 3. Design and Implementation during the benchmark. Requests that took longer than the timeout threshold were reported as timeouts only if the request received a response. If the client never received a response from the server, this failure was never reported to the user.

To address this issue, our modified version of wrk adds another event to wrk’s event-loop. This event repeatedly fires after a user-defined amount of time. Since each thread that is spawned by wrk is used to manage a subset of the total number of clients in the benchmark, each thread is able to safely cancel events for clients that it is responsible for. When this event is fired, the associated handler cancels any operations that have been stalled for more than the user-specified timeout period. The event handler then attempts to reestablish the TCP connection with the server for that client. The timeout handler also records a data point that indicates that a timeout has occurred for that client.

Since all socket operations are restricted by a timeout, the modified wrk uses the same model that httperf uses to generate constant load. This means that ink’s version of wrk will continue to supply load even when the server is unable to handle the offered load.

3.1.4 Post-Processing Phase

After the data collection phase of the benchmark is done, the benchmarking client performs the post-processing of the data. These operations are done to the data sets that each simulated client has recorded.

The moving average of the set of latencies that each simulated client recorded is calculated. However, instead of using a fixed number of latencies for each data point in the moving average, we use a user-specified amount of time to determine how many data points to include in each frame of the moving average. Note that changing the frame size in the moving average cannot be done after the benchmark has been run. Additionally, the user 3.1. Load Generation 37 may set by how much time each frame in the moving average overlaps with the previous frame. The application and associated visualization for this moving average calculation will be discussed in Section 3.4.4.

Once these calculations have been done, the load generator creates a report that is composed of JSON and binary components. JSON is used for the more structured data; specifically, the client timelines and metadata about the benchmark itself. This metadata includes the hostname of the physical machine and the benchmarking parameters. The collected array of response latencies is converted into a raw binary format. This combination of binary data and JSON was used to reduce the overhead of generating and parsing large JSON documents. An early version of ink used JSON to record all data points; this proved to be very inefficient and incurred a very large memory overhead.

All data that the load generator produces is archived together, and the resulting file is either written to disk or transmitted back to the distributed manager. If the user is not using the distributed manager, then they will now submit the produced file to the ink API.

Here is a summary of the data that is reported for each simulated client:

• A moving average of successful request latencies

• Timestamps corresponding to each timeout and TCP socket error

This information will be referred to as a client timeline. It is representative of the service that a single client received.

In addition to the per-client data, we also report these data points:

• The timestamps of every request/response pair

• The HTTP status code of each response 38 Chapter 3. Design and Implementation

3.2 Distributed Benchmarking Manager

To enable the use of multiple physical clients to benchmark one HTTP server, we designed a method to coordinate the benchmark across distributed machines and merge together the resulting reports into one report that can be parsed by the ink API and GUI.

3.2.1 Distributed Machine Access

A user must have ssh access to each machine that will be used as a load generator. This ssh connection will be used to transmit data to and from the remote machines. A separate TCP connection is also established from each remote machine to the distributed manager; however, if a user has ssh access to the machine, then ssh tunneling can be used to enable this feature. If a user does not have ssh access, then manual synchronization of the remote load generators will be required. The user will also have to manually collect all generated data reports.

3.2.2 Coordinated Test Procedure

We use a simple synchronization method to ensure that the benchmarking process starts at the same time on all distributed machines. First, an ssh connection is established with each physical machine. The distributed manager issues a command to each machine that starts the load generator. Instead of starting the benchmarking process immediately, the load generator establishes a TCP connection with the distributed manager. Once this TCP connection has been established, a signal is sent to each benchmarking client. This signal indicates to the load generator that the benchmarking process should begin. With this signaling process, the greatest possible gap between the starting point of two physical machines is equal to RTT * 3.2. Distributed Benchmarking Manager 39 number of physical clients. On the cluster of machines that was used to test ink, the average RTT between machines is around 0.220 ms. Alternatively, the system clock could be used to synchronize the test, but that would require that each system clock be synchronized prior to testing. Using our protocol removes that prerequisite.

After each machine finishes the benchmarking process, the data report is transferred back to the distributed manager over the previously established TCP connection, and a small summary of the test is sent back to the ssh client. This summary includes a few details about the benchmark and any messages about unexpected errors that may have occurred during the benchmark. Unexpected errors would imply that the benchmarking client failed.

The data received by the ssh client is then displayed to the user.

3.2.3 Configuration Options

With each run of the distributed manager, a configuration file needs to be supplied that, at minimum, contains these items: the hostnames or IP address of each physical client, the installation path of the load generator on each physical client, and the IP address or hostname of the target machine. Command line arguments may be set for the load generator, allowing for custom work loads.

Each individual node in the benchmark can be configured to start after a delay. This allows users to stagger the level of concurrency that is applied on the server. Thus, a single report can be used to compare how a server performs under different levels of load. ink has the ability to run software that records information on the machine that is running the target HTTP server. This feature is pre-configured to record CPU usage, network us- age, memory usage, and allocated file descriptors. These data points are sampled over the duration of the benchmark. The user must have ssh access to the target machine to use 40 Chapter 3. Design and Implementation this feature. Additionally, the target machine must have a scripting language interpreter installed. In the pre-configured setup, Python is used. In an effort to keep configuration on the target machine to a minimum, the script is not stored on the target machine. In- stead, the interpreter is run in an ssh client, and the contents of the script are streamed to the interpreter by the distributed manager. This allows the user to reuse scripts between benchmarks without having to install any additional software on the target machine. All recorded data points are relayed back to the distributed manager over a TCP connection.

Error messages are sent over the ssh connection and displayed to the user. This feature was designed so a user can freely modify the server monitoring script to record data for their specific use case.

3.2.4 Data Conversion and Merging

The distributed manager is responsible for merging all of the collected data reports from each physical client into one report that can then be viewed by the user using the front-end of the tool. This process is fairly simple; however it is important to note that the merging time may not be negligible due to amount of data generated by long-running benchmarks.

The distributed manager converts the JSON and binary data that the load generator pro- duces into the Protocol Buffers format [55]. Protocol Buffers are a data serialization format developed by Google. To use Protocol Buffers, a schema must be defined for each data type. Due to this, Protocol Buffers have been shown to decrease file size and parsing time when compared to JSON [56]. JSON was initially used in ink, but Protocol Buffers were adopted to reduce large file sizes and long parsing times. Protocol Buffers are not used by the load generator directly because the official Protocol Buffer tool chain does not support C, and wrk is written in C. There are unofficial implementations, but the projects that we examined 3.3. ink API 41 were difficult to work with.

Since Protocol Buffers require that a data format be defined prior to encoding and decoding data, it is difficult to handle data of an unknown format. The majority of the data types used throughout ink are well defined, with the exception of the data that is recorded by the server monitoring script. This feature was designed in a way that encourages modification of the script to fit a user’s specific needs. For this reason, the data recorded by this script remains in a JSON format. It is embedded in the Protocol Buffer as an array of bytes, later to be decoded by the API server. Since this approach was taken, arbitrary data can be recorded by the server monitoring script and then served to the user through the API.

Once the data has been merged and converted, it will be transmitted to the ink API. The API will respond with a link to where users can view the results of their benchmark.

3.3 ink API

The ink API server provides access to the data that was recorded during an HTTP bench- mark. The original motivation for using a server to distribute the data was that heavy calculations could be off-loaded from the user’s machine to a different machine. However, this design decision also resulted in the added benefit of allowing users to access their bench- marking data from multiple devices. When paired with the ink’s browser-based GUI, users are able to distribute HTTP benchmarking results by just sharing a link.

3.3.1 Submission

Data reports are submitted to the API with a POST request. When a report is submitted to the API, it is saved to disk and also put into a simple cache. Elements are removed from 42 Chapter 3. Design and Implementation the cache after a predetermined amount of time, which is set when starting the server. We choose to use a in-memory cache because the reports that the ink benchmarking process produces can be very large. On a cache miss, data will be reloaded into memory by reading from persistent storage and added back into the cache.

Since the submitted dataset can be very large, the submission process can be time consuming. We parallelize operations whenever possible.

3.3.2 API Endpoints

This subsection provides a listing of the API endpoints used in ink, along with a description of the data returned by each endpoint. represents a unique ID that each report is assigned upon submission. The max and min query parameters allow a user to limit the queried data to data that was recorded during a specific period of time during the benchmark.

• POST /submit - Used to submit a data report generated by the load generator or the distributed manager.

• GET /report//timelines?count= - Responds with client timelines and supplementary information about timelines needed to render the data. count is used to specify the number of client timelines to return.

• GET /report//averageDist?min=?max= - Responds with a distribu- tion of the moving average data points

• GET /report//latencyDist?min=?max= - Responds with the 0th, 25th, 50th, 75th, and 100th percentiles over the number of responses that each simu- lated client received from the target server. 3.3. ink API 43

• GET /report//statistics?min=?max= - Responds with the average, minimum, and maximum response latencies. Also responds with the number of errors and timeouts recorded.

• GET /report//histogram?min=?max= - Responds with the numerical representation of a response latency histogram.

3.3.3 Off-Loading Heavy Calculations

Once the server has received the report, it decodes the submission from the binary Protocol Buffers into an in-memory data structure. The server pre-calculates as much information as possible and generates data structures that will enable faster subsequent requests.

The API allows the user to limit queries to data that was recorded during any arbitrary period during the benchmark. The API is able to quickly filter out the appropriate data by using a strategy called coordinate compression. Coordinate compression is a technique that maps arbitrary values to discrete values. In this case, an arbitrary time stamp provided by the user is mapped to the closest point in time that an event was recorded by the benchmarking client. Completed responses, errors reported, and established connections are all events that this technique is applied to. This technique is used for all range-based queries that the API supports. The mapping from an arbitrary data point to a discrete point is performed in O(log n) time. Since lengthy HTTP benchmarks are necessary to gain knowledge about a server’s performance, it is important to reduce the complexity of these operations whenever possible.

The histogram of latencies is one of the data presentations that is partially pre-calculated at the time that the data set is entered into the cache. The histogram bin that each element maps to is stored and used for each subsequent request. This reduces the time complexity 44 Chapter 3. Design and Implementation of the query from O(m × n) to O(n), where m is the number of bins in the histogram and n is the number of requests over which the histogram is computed.

We use segment trees to speed up the calculations of minimum, maximum, and average response latency. A segment tree is a data structure that performs range-based queries in O(log n) time. This is done by constructing a tree in memory that stores the result of a range query in each node. All leaf nodes in the tree represent a range query over one only element in the data set. All parent nodes store the range query over the parent’s two child nodes. The resulting tree uses twice as much memory as the original set of data. However, the latency that the user experiences with the application is significantly decreased. We could have sped up the computation of the latency histograms as well, but decided against it due to the memory overhead of maintaining m segment trees, one for each bin.

3.4 ink GUI

The ink GUI provides users with a way of interfacing with the ink API. The GUI renders data visualizations that help users understand the performance of the benchmarked HTTP server, at a per-client level. The most prominent feature of the GUI makes use of the client timelines that were recorded by the load generator. These timelines help ink clearly represent the benchmarking data.

3.4.1 Browser Based Interface

The front-end of ink runs in the user’s web browser for multiple reasons. Firstly, the cross- platform nature of web browsers is very compelling when creating an application that targets all systems. Additionally, web technologies are very mature, and there is an extensive set 3.4. ink GUI 45 of preexisting software that can be used to help with the development of a project. It also limits the amount of software that must be installed on a user’s machine before they can view a report that ink produces.

3.4.2 SVG Based Visualizations ink’s GUI supports multiple data visualizations. To help create these graphics, D3.js was used. D3.js (Data-Driven Documents) is a data visualization library that provides convenient methods for creating complex Scalable Vector Graphics (SVG) images [57]. D3.js helps solve the problem of dynamically interacting with SVG images. D3.js does this by binding data elements to objects in the SVG. Operations that can be done on the data set can also be done on the SVG components. Additionally, the library provides a convenient set of tools that simplify data presentation. For example, generating linear, logarithmic, and exponential scales based off of a data set is directly supported by D3.js.

D3.js is also capable of rendering graphics using the HTML element; however, this element does not provide the necessary flexibility that is needed by the GUI. To properly display the data, elements in the visualization need to be dynamic and mutable. HTML

elements do not allow a user to interact with the rendered elements nearly as well has SVG images do. SVG images do meet the needs of the visualizations; however, the features of SVG come at the price of performance because rendering a large number of SVG elements in a browser can be time consuming. Specifically, this became an issue when rendering a large number of client timelines. Strategies to deal with this problem are discussed in Section 3.4.4.

Since we chose to use SVG images to render the visualizations, we also gained the ability to export and save the visualizations that are rendered by the web page. All information that 46 Chapter 3. Design and Implementation

Figure 3.2: Brewer’s red-to-green divergent color scheme [1] is used to render the SVG is contained within the SVG element in the HTML document. In fact, the GUI offers a feature that allows a user to download any SVG image rendered on the page. Once downloaded, a user can scale the image to the desired size. As a case in point, all snapshots of the ink GUI shown in this document were exported in this way.

3.4.3 Latencies Mapped to Colors

Throughout the front-end of the tool, a color scale is used to represent the latency of requests. This enables the user to quickly see the performance of a server without having to process a large set of data. The color scale comes from Brewer, a cartographer who specializes in color scales that are often used in maps [1]. These scales are regarded as the state of the art for presenting data as color. Brewer provided a set of color scales called “divergent color schemes.” These are recommended to be used in situations where two distinct groups of data need to be visually different. In ink, one of these color schemes is used to represent latencies, with the two ends of the scale mapping to great performance and poor performance. ink uses the Brewer color scheme that trends from red to green. This color scale can be seen in Figure 3.2. ink allows for the user to choose how latencies are mapped to the color scale. By default, ink performs this mapping by assigning the fastest response to one end of the color scale and the slowest response to the other end of the scale. Values are then logarithmically mapped from their value to a color on this scale.

In order to construct a mapping that is more meaningful for human users, ink also supplies 3.4. ink GUI 47 a mapping that follows the response delay thresholds that were proposed by Miller [46]. Latencies that are between 0ms and 100ms are mapped to the green end of the color scale. Latencies that fall between 100ms and 1s are mapped to the yellow section of the scale. Latencies over 1s are mapped to the red end of the scale. This scale can be applied as a standard scale to perform comparisons.

Lastly, a user is able to choose arbitrary values for the domain of the color scale. This is useful for comparing server performance when the Miller-inspired scale is too broad. This is often the case when a server performs well enough that the average latency is well below the median of the Miller-inspired scale.

3.4.4 Client Timeline Visualization

The client timeline visualization is the most prominent component of the GUI. This visual- ization conveys most of the information about the performance of the server; thus, it is the focal point of the application.

A single client timeline is constructed from the client’s moving average of latencies that was recorded by the load generator. The x-axis ranges from the starting time to the ending time of the benchmark. Each data point in a client’s moving average represents that client’s average request latency for a small period of the time during the benchmark. We will call this period of time a frame. For each frame, we will draw a line segment on the x-axis that extends from the starting time of that frame to the ending time of that frame. We then map the average latency experienced in each frame to a color using Brewer’s scale. The line segment is given that color. The line that results from all frames being drawn is a visualization of the amount of service that a single client received over the course of the benchmark. 48 Chapter 3. Design and Implementation

The client timelines can optionally show additional information about a client’s experience. Markers can be placed on each timeline that indicate when TCP connections were established or errors occurred. This extra information can provide insight into the behavior of a server. For example, if a server is using persistent connections, it will be easy to see that connections are not being reestablished after every request. ink represents timeouts as line segments that are a dark gray color. This color is not in the normal scale, so these portions of the timeline stand out. A main goal of ink is to bring notice to timeouts and the impact that they have on a client’s individual experience. This method of visualization makes it easy to see how long a client went without receiving any service.

We construct the full visualization of client timelines by stacking individual timelines on the y-axis of a graph; the x-axis of the graph is still used to represent time. This allows users to make comparisons between clients. A user can see performance fluctuations that affect all clients in the test, and the user can see when clients are not provided with equal service. To help with the user’s understanding of the data, the timelines are sorted on the y-axis by the number of requests that each client was able to successfully complete over the course of the benchmark. Clients that performed similarly are grouped together, and the user is able to compare the client that received the best service with the client that received the worst service.

See Figure 3.3 for an example of a set of client timelines. This visualization is the result of benchmarking an HTTP server with 100 concurrent clients. This graph shows 100 horizontal client timelines stacked on top of each other. The client who received the most responses from the server is at the top of the graph. The client who received the fewest responses is at the bottom of the graph. 3.4. ink GUI 49 1.500ms 2.000ms 3.000ms 3.750ms 4.000ms 5.000ms 6.000ms

Responses 4.6k 4.5k 4.4k 4.4k 4.4k 4.4k 4.3k 4.3k 4.3k 4.2k 4.1k 10 9 8 7 6 5 Time (seconds) 4 3 2 Figure 3.3: 100 client moving averages rendered using colored line segments 1 0 0

90 80 70 60 50 40 30 20 10

100 Concurrent Clients Concurrent 50 Chapter 3. Design and Implementation

3.4.5 Sampling Client Timelines

The amount of simulated clients used in a benchmark may exceed the number of pixels that a user will have available to visualize the data. In addition to this, rendering too many client timelines can cause performance issues. Therefore, we had to devise a strategy to sample the visualized data. We already aggregate the data for each client into moving averages. However, this only solves the problem of presenting too much data on the x-axis of the visualization. We also need to limit the number of individual client timelines that are rendered. To solve this problem, first, we needed to find a data point with which we could sort the client timelines. We found that sorting the clients by the number responses received over the duration of the test resulted in a pattern in the data visualization. This pattern is retained when taking an evenly distributed subset of all client timelines.

A comparison of two reports with different sample sizes can be seen in Figure 3.4 and Fig- ure 3.5. Both of these visualizations represent 1, 000 concurrent clients repeatedly requesting an object from an HTTP server. Initially, roughly 650 clients received worse service than the remaining 350 clients. This was due to the server being unable to accept all of these clients simultaneously. As the server accepted these clients, the performance became more similar across all clients. In both data sets, this pattern can be seen. This is why we have chosen to filter client timelines in this way. A more detailed examination of various reports will be given in later sections; this comparison is provided only to show that drawing fewer client timelines does not greatly reduce the amount of information that can be gathered from the report.

In addition to the filtering techniques used to address performance difficulties, the GUI also allows the user to filter out timelines based on specific criteria. Users may filter the timelines based on metrics like the number of responses received by the client or if the client 3.4. ink GUI 51 0.700ms 0.900ms 2.000ms 4.000ms 6.000ms 8.000ms 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 4502ms

Responses 2.5k 2.4k 2.3k 2.3k 2.0k 2.0k 2.0k 1.9k 1.7k 1.6k 1.3k 10 9 8 timelines 200 7 6 5 Time (seconds) concurrent clients rendered as 4 000 , 1 3 Figure 3.4: 2 1 0 0

900 800 700 600 500 400 300 200 100 1000 Clients Concurrent 52 Chapter 3. Design and Implementation 0.700ms 0.900ms 2.000ms 4.000ms 6.000ms 8.000ms 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 4502ms

Responses 2.5k 2.4k 2.3k 2.3k 2.0k 2.0k 2.0k 1.9k 1.7k 1.6k 1.3k 10 9 8 timelines 50 7 6 5 Time (seconds) concurrent clients rendered as 4 000 , 1 3 Figure 3.5: 2 1 0 0

900 800 700 600 500 400 300 200 100 1000 Clients Concurrent 3.4. ink GUI 53 experienced a timeout. Users may also choose to exclude timelines that originated from a specific physical machine. These options enable a user to more easily examine the experience of specific clients.

3.4.6 Alternative Data Representations ink’s GUI also supplies a set of alternative data visualizations and statistics to accompany the client timelines. These help the user understand the data points that construct the client timelines. All of these visualizations are driven by data retrieved from various API endpoints. See Figure 3.6 for an image of the user interface. Both the client timeline visualization and the plots shown in Figure 3.6 are rendered in the browser for the user to see. ink constructs a histogram of latencies that were recorded during the benchmark. This histogram includes only latencies from requests that were successful. Timeouts are omitted here; due to the effects of coordinated omission, the timeouts would be negligible in the histogram. The resulting histogram of data can have a very wide range. For this reason, users are able to pan and zoom throughout the data.

A box-and-whisker plot is generated for the distribution of responses that each client received over the course of the benchmark. This can be used to indicate how fairly a server served clients. A tight distribution would imply that all clients got close to the same level of service. This chart also can be used to show the distribution of responses for individual physical clients in the test.

A distribution of all of the values that were calculated by taking the moving average of client latencies is also included. This chart shows what ratio of time was spent experiencing varying levels of service. This chart is a good way to convert the client timelines into a more discrete set of data points. This chart also provides a good way to show what ratio of time 54 Chapter 3. Design and Implementation 3.3 Figure 3.6: Image ofFigure the dashboard of alternative data visualizations. Visualizations are based on the same dataset as 3.4. ink GUI 55 clients spent experiencing a timeout. Timeouts are indicated by the column marked with an ∞ symbol.

The GUI also plots server performance data points that may have been recorded by the server monitoring script. Users may cycle between the CPU usage, memory usage, network usage, and allocated file descriptors. Any additional data points that the user modified the script to record must be accessed directly through the ink API.

Users are able to select a region of the client timeline visualization to limit data points to events that occurred only during that time period. Users can make this selection by using their mouse to highlight a region of the client timelines. Making this selection results in sending multiple requests to the API. These calls to the API supply a range of time to limit the provided data to. All other components of the GUI are redrawn to show only the filtered data.

3.4.7 Grouping Physical Client Data

In addition to sorting the client timelines by the number of responses received, the GUI can also present the client timelines in groups distinguished by physical machine origin. These groups are stacked on the y-axis of the report. Each group of timelines remains sorted by the number of responses received from the target HTTP server. We can then apply the same sampling strategy that was used to reduce the entire data set of client timelines. This reordering can be efficiently done by using a stable sorting , which retains the relative ordering of elements after sorting them by a new data point.

See Figure 3.7 for a report that groups clients by machine origin. Both physical clients produced a set of simulated clients that exhibit the performance pattern. Roughly one half of the simulated clients received consistent service throughout the test, while the other half 56 Chapter 3. Design and Implementation of clients had difficult establishing a connection with the server. 3.4. ink GUI 57 0.400ms 0.600ms 0.800ms 1.000ms 3.000ms 5.000ms 7.000ms 9.000ms 20.00ms 40.00ms 60.00ms 80.00ms 100.0ms 300.0ms 500.0ms 700.0ms 900.0ms 2000ms 3271ms

Nodes 1 2 10 9 8 7 6 5 Time (seconds) clients grouped by physical machine origin 4 000 , 2 3 Figure 3.7: 2 1 0 0

800 600 400 200 2.0k 1.8k 1.6k 1.4k 1.2k 1.0k Concurrent Clients Concurrent Chapter 4

Evaluation

This chapter evaluates the ink utility in terms of usability, usefulness, robustness, and performance through a number of use case studies. We also present the results of a survey of students who used the tool.

4.1 Goals and Methodology

The goal of the evaluation is to show that ink

• does not impose significant overhead compared to wrk

• is capable of generating high load on an HTTP server

• is able to highlight performance aspects of HTTP servers at a per-client level

To perform this evaluation, we use a series of benchmark case studies that highlight ink’s features and abilities. All of the benchmarks provided in this section are run on a cluster of Linux machines with the following specifications:

• Operating System: CentOS 7

• Architecture: -64

58 4.1. Goals and Methodology 59

• CPU: 2x Intel Xeon CPU E5-2470 v2 @ 2.40GHz

• Network Device: 10Gb/s, Full Duplex

• Memory: 96GB @ 1600MT/s

Each test was done by deploying an HTTP server on one node in the cluster and starting a distributed manager on another. The distributed manager handles initializing the load generators on a subset of the remaining nodes. The node that acted as the distributed manager was not used as a load generator. The ink API ran on a separate machine that is not a part of the cluster. Following a benchmark, the distributed manager submits the data report to the API.

Unless stated otherwise, we used NGINX version 1.17.10 as the target HTTP server for the benchmarks in this chapter. As described in Section 2.3.5, NGINX is the most popular HTTP server on the web today. NGINX was chosen for its performance. See Listing 4.1 for the relevant sections of the NGINX configuration file that was used.

Listing 4.1: NGINX configuration file used in evaluation worker_processes 40; worker_rlimit_nofile 65536; events { worker_connections 65536; use epoll; multi_accept on; } http { open_file_cache max=200000 inactive=20s; open_file_cache_valid 30s; open_file_cache_min_uses 2; sendfile on; tcp_nodelay on; keepalive_requests 100000; } 60 Chapter 4. Evaluation

The main purpose of enabling these features is to optimize the performance of the server for benchmarking. This configuration uses 40 parallel worker processes that each execute one event loop. The configuration also enables file caching, compression, and persistent connec- tions. NGINX is configured to use epoll. These settings are typical for a deployed NGINX server, with the exception of the keepalive_requests 100000 directive. This directive allows clients to send a large number of requests before NGINX closes the connection.

4.2 Assessing ink’s Load Generation Ability

In this section we provide the results of two case studies that showcase ink’s performance capabilities.

4.2.1 Overhead relative to wrk

Since ink uses a modified version of wrk to generate load, we will compare the throughput that our modified version of wrk can generate and the throughput generated by the original wrk tool.

Here are the relevant benchmarking details:

• 10 second test

• 1 physical client

• 10, 000 simulated clients

• Range of response payload sizes 4.2. Assessing ink’s Load Generation Ability 61

Table 4.1: Comparison of RPS generated by original wrk and modified wrk

Benchmark Compiler RPS (4KB) RPS (2KB) RPS (1KB) RPS (0.5KB) Original wrk v4.10 gcc 5.5.0 276740 511199 923791 950175 Modified wrk for ink gcc 5.5.0 276023 508938 902413 946391

This comparison will discuss only throughput, as that is the best metric to use when strictly comparing the performance of two load generators. See Table 4.1 for a comparison of the throughput. The unmodified wrk is able to generate a higher load than our version of wrk. This is because ink records event data points whereas the original wrk does not. Generally, the difference is small, with the overhead in these tests ranging from 0.26% to 2.31%.

4.2.2 Inducing Load with Clustered Network of Machines

To evaluate ink’s ability to impose a wide spectrum of loads on state-of-the-art HTTP servers, we used ink to simulate up to 348, 160 clients attempting to request data from NGINX. We used the following benchmarking parameters:

• 120 second test

• 17 physical clients

• Varying levels of concurrency, ranging from 85 to 348, 160 clients

• Response payload size of 0.5KB

• Client timeline windows are 2 seconds in length and overlap by 1 second

• Timeout threshold of 5 seconds

See Figures 4.5 and 4.6 for plots showing how the overall throughput and average latency of NGINX vary under different levels of concurrency. In Figures 4.1, 4.2, 4.3, and 4.4, we 62 Chapter 4. Evaluation 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 5000ms 7000ms 9000ms

Responses 3.0k 2.8k 2.7k 2.6k 2.4k 2.3k 2.2k 2.1k 2.0k 1.8k 1.2k 120 110 100 physical clients 17 90 80 70 simulated clients and 520 , 60 43 Time (seconds) 50 40 30 20 Figure 4.1: Benchmarking with 10 0 0

40k 35k 30k 25k 20k 15k 10k 5.0k ocretClients Concurrent 4.2. Assessing ink’s Load Generation Ability 63 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 5000ms 7000ms 9000ms

Responses 1.4k 1.3k 1.2k 1.1k 1.0k 974 917 859 797 720 467 120 110 100 physical clients 17 90 80 70 simulated clients and 040 , 60 87 ie(seconds) Time 50 40 30 20 Figure 4.2: Benchmarking with 10 0 0

80k 70k 60k 50k 40k 30k 20k 10k ocretClients Concurrent 64 Chapter 4. Evaluation 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 5000ms 7000ms 9000ms

Responses 897 676 575 507 454 409 368 328 287 241 110 120 110 100 physical clients 17 90 80 70 simulated clients and 080 , 60 174 Time (seconds) 50 40 30 20 Figure 4.3: Benchmarking with 10 0 0

80k 60k 40k 20k 160k 140k 120k 100k Clients Concurrent 4.2. Assessing ink’s Load Generation Ability 65 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 5000ms 7000ms 9000ms

Responses 518 370 303 254 214 179 148 118 86 48 0 120 110 100 physical clients 17 90 80 70 simulated clients and 160 , 60 348 Time (seconds) 50 40 30 20 Figure 4.4: Benchmarking with 10 0 0

50k

300k 250k 200k 150k 100k Concurrent Clients Concurrent 66 Chapter 4. Evaluation

1100

1000

900

800 Requests second per

- 700 Kilo

600

500 85 170 340 680 1,360 2,720 5,440 10,880 21,760 43,520 87,040 174,080 348,160

Concurrent Clients

Figure 4.5: Requests-per-second generated by NGINX under varying levels of load

show some of the client timeline visualizations that were generated by ink. These reports show how ink’s visualization conveys performance degradation. The color scale used here is the scale that was derived from human response times. Clients in the tests with 43, 520 and 87, 040 clients received mostly good service. Both benchmarks showed that the server had difficulty establishing the initial connections with all clients. However, once the clients established connections, the service was consistent. This is also true for the test with 174, 080 clients. However, the service that these clients received was significantly worse, with most clients having moving average values of nearly 250 ms. The benchmark with 348, 160 clients was much worse, with a very large number of requests resulting in a timeout.

These results indicate that ink was able to produce a high enough load to cause performance degradation on the server. ink reports the reduced throughput of the server. Additionally, ink shows how this reduced throughput manifests in higher latencies. This results in clients 4.2. Assessing ink’s Load Generation Ability 67

300

250

200

150

100 Average Request Latency (ms) 50

0 85 170 340 680 1,360 2,720 5,440 10,880 21,760 43,520 87,040 174,080 348,160

Concurrent Clients

Figure 4.6: Average request latency observed by clients benchmarking NGINX

being able to complete fewer requests. When handling 348, 160 clients, the best served client was able complete 1, 300 requests. The worst served client was unable to complete any requests.

The results of the final benchmark also show the effects of coordinated omission. Figure 4.7 shows the histogram of latencies that were recorded during the test with 348, 160 clients. The largest bin in the histogram, with over 28 million values, represents requests that were responded to within 0 ms to 7.5 ms. From the client timelines, we know that most clients experienced service that was much worse. From a client’s perspective, this would appear as if a small number of requests per window of time were responded to very quickly; however, the remainder of the requests during that window took much longer. In this test, exactly that happened. While some requests had low latency, most clients experienced periods were requests took, on average, 200 ms to 400 ms. 68 Chapter 4. Evaluation

28M

26M

24M

22M

20M

18M

16M

14M

Responses 12M

10M

8.0M

6.0M

4.0M

2.0M

0 0.000 50.00 100.0 150.0 200.0 250.0 300.0 350.0 400.0 450.0 500.0 550.0 600.0 650.0 700.0 Latency (milliseconds)

Figure 4.7: Histogram of latencies from benchmark with 348,160 connections and 17 physical clients 4.3. Evaluating Server Performance with ink 69 ink helps us infer that the bottleneck of this benchmark is the CPU. For the entirety of each test, the machine that the server was running on was using 90% − 100% of the total available CPU time. The network card usage peaked at 5.5 Gbps. The machine only slightly increased in physical memory usage.

4.3 Evaluating Server Performance with ink

In this section we report on three case studies that showcase ink’s ability to help diagnose server issues and highlight interesting performance aspects.

4.3.1 Artificially Limited HTTP Server

When devising tests to evaluate ink, one of the goals was to highlight ink’s ability to help debug server implementations by providing detailed information about individual client experience. ink can help diagnose performance issues stemming from the use of a deficient concurrency model. For instance, a server using a one-thread per connection model and a fixed-sized thread pool will only be able to serve a fixed number of clients. This scenario was one that students in Virginia Tech’s Computer Systems course encountered when they improperly used a thread pool to develop an HTTP server. In this test, the server was configured to use a thread pool that contains 50 threads. The server was benchmarked under these conditions:

• 60 second test

• 100 simulated clients, divided across 20 threads on one machine

• 4 KB response payload 70 Chapter 4. Evaluation

600

500

400

300 loae ieDescriptors File Allocated

200

0 5 10 15 20 25 30 35 40 45 50 55 60 Time (seconds)

Figure 4.8: File descriptors allocated by HTTP server during 60 second benchmark

• Client timeline windows are 1 second in length and overlap by 0.5 seconds

• Timeout threshold of 5 seconds

See Figure 4.9 for the full client timeline visualization. From the report, it is clear to see that 50 clients did not receive any service whatsoever. However, it is also clear that the remaining 50 clients received excellent service; most clients saw an average latency under 1 ms.

Figure 4.9 also includes markers to indicate when TCP handshakes were observed by the client. As shown by the visualization, all clients were able to establish connections with the HTTP server. Additionally, all subsequent connections attempts following a timeout were successful. As reported in Section 2.1.5, this information does not necessarily imply that the 4.3. Evaluating Server Performance with ink 71 0.230ms 0.300ms 0.320ms 0.400ms 0.410ms

Responses 220k 220k 210k 210k 210k 0 0 0 0 0 0 60 55 50 45 40 35 30 Time (seconds) 25 20 15 10 report that indicates that only half of clients were served. Observed TCP connections are marked ink 5 0 0

90 80 70 60 50 40 30 20 10

100 Concurrent Clients Concurrent Figure 4.9: An by black dots. 72 Chapter 4. Evaluation

3.4%

3.3%

3.2%

3.1%

3%

0 5 10 15 20 25 30 35 40 45 50 55 60 Time (seconds)

Figure 4.10: Memory usage of HTTP server during 60 second benchmark server accepted the connection. However, by using the monitoring abilities of ink, we also see that file descriptors were allocated for all incoming connections. See Figure 4.8 and 4.10 for plots showing the allocated file descriptors and memory usage of the HTTP server over the course of the benchmark. By the end of the test, roughly 600 file descriptors were allocated by the server. The data reported by the server monitoring script also indicates that the server’s memory usage steadily increases during the benchmark. This can be attributed to the growing queue of thread pool tasks.

This serves as an example of ink’s ability to be an HTTP benchmarking tool, as well as a tool to help debug and diagnose misbehaving server. For comparison purposes, see Listing 4.2 for the resulting output from wrk when performing a benchmark of the same server. This output lacks any indication that 50% of clients did not receive any service. 4.3. Evaluating Server Performance with ink 73

Listing 4.2: Output from wrk when benchmarking limited HTTP server Running 1m test @ http://gum.rlogin:13628/file.txt 20 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 273.02us 173.78us 16.20ms 89.08% Req/Sec 16.31k 4.17k 19.13k 90.83% 10727331 requests in 1.00m, 40.95GB read Requests/sec: 178500.79 Transfer/sec: 697.78MB

4.3.2 Benchmarking With Varying Levels of Concurrency ink can vary the number of concurrent clients over the span of one test by delaying the starting point of physical clients. This can help a user understand the performance of their server under different conditions without having to run multiple tests. Here are the details for this benchmark:

• 180 second test

• 3 physical clients

• 15, 000 simulated clients, 5, 000 clients per physical machine

• The second and third physical clients are delayed by 30 and 60 seconds, respectively

• Response payload size of 4 KB

• Client timeline windows are 1 second in length and overlap by 0.5 seconds

• Timeout threshold of 5 seconds

See Figure 4.11 for this report. This report showcases the efficacy of the latency-to-color mapping. All clients experienced worse performance during the period of time when all 74 Chapter 4. Evaluation 3.000ms 5.000ms 7.000ms 9.000ms 20.00ms 40.00ms 60.00ms 80.00ms 100.0ms 300.0ms 500.0ms 700.0ms 900.0ms 2000ms 3688ms

Nodes 3 2 1 180 seconds 30 160 clients every 140 000 , 5 120 100 Time (seconds) clients released in waves of 80 000 , 15 60 40 20 Figure 4.11: Benchmark with 0 0

14k 12k 10k 8.0k 6.0k 4.0k 2.0k ocretClients Concurrent 4.3. Evaluating Server Performance with ink 75 physical clients were benchmarking the server. Performance was significantly better during periods with only 5, 000 simulated clients.

Throughout these tests, there is a clear pattern that the servers provide the worst overall service during the phase when many clients are attempting to establish connections. Often, clients that did successfully connect to the server receive excellent service while others are left completely unserved. In this test, we are able to see that 280 clients experienced timeouts while establishing connections during the first wave of the benchmark. In the second wave of the benchmark, 461 clients timed out while connecting to the server. 629 clients timed out while connecting during the last wave of the benchmark. This would imply that the server has difficulty accepting new clients when other connections are already established.

These data points were gathered by using the range-based selection tool that the ink GUI provides.

4.3.3 Comparison of Server Implementations ink can be useful in comparing the performance of different HTTP servers that use different concurrency models. As an example, we compare the performance of the NGINX HTTP server and the Apache HTTP server. NGINX, as we have discussed, uses an event-driven concurrency model. We used Apache version 2.4.43. Apache is configured to use a process- based concurrency model, which is referred to as a pre-fork model. Each connection is served by an independent process that Apache spawns. Note that Apache may be configured to use other concurrency models, including an event-driven model; the pre-fork model was chosen to provide a comparison of the performance between the two patterns.

The servers were benchmarked under these conditions: 76 Chapter 4. Evaluation

• 90 second test

• 32 physical clients

• 1, 000 simulated clients, 500 clients per physical machine

• 4 KB response payload size

• Client timeline windows are 1 second in length and overlap by 0.5 seconds

• Timeout threshold of 5 seconds

See Listing 4.3 for the Apache configuration parameters.

Listing 4.3: Apache configuration paramters StartServers 50 MinSpareServers 50 MaxSpareServers 1000 MaxRequestWorkers 10000 ServerLimit 10000

Apache’s configuration limits the rate at which the server can accept clients. This explains the pattern that can be seen in Figure 4.16. 70 seconds pass before all clients are accepted by Apache. NGINX does not limit itself in this way; all clients are accepted immediately in Figure 4.17. The Apache pre-fork model limits resource use by limiting the number of new processes that are spawned per second. For this test, this effect could be mitigated by setting the MinSpareServers parameter higher, which would cause Apache to maintain more idle processes.

Figure 4.17 shows NGINX’s client timelines and Figure 4.16 shows Apache’s client timelines. ink presents the data recorded during the benchmark in a way that highlights the differences in load adaptation policies between NGINX and Apache. This report could help a developer 4.3. Evaluating Server Performance with ink 77

100%

80%

60%

40% CPU Usage

20%

0% 0 10 20 30 40 50 60 70 80 90 Time (seconds)

Figure 4.12: CPU usage of pre-fork Apache HTTP server with 1, 000 concurrent connections 78 Chapter 4. Evaluation

9.6Gb/s

8.0Gb/s

6.4Gb/s

4.8Gb/s

ewr Usage Network 3.2Gb/s

1.6Gb/s

0.0Gb/s 0 10 20 30 40 50 60 70 80 90 Time (seconds)

Figure 4.13: Network usage of pre-fork Apache HTTP server with 1, 000 concurrent connec- tions 4.3. Evaluating Server Performance with ink 79

100%

80%

60%

40% CPU Usage

20%

0% 0 10 20 30 40 50 60 70 80 90 Time (seconds)

Figure 4.14: CPU usage of NGINX HTTP server with 1, 000 concurrent connections 80 Chapter 4. Evaluation

9.6Gb/s

8.0Gb/s

6.4Gb/s

4.8Gb/s

ewr Usage Network 3.2Gb/s

1.6Gb/s

0.0Gb/s 0 10 20 30 40 50 60 70 80 90 Time (seconds)

Figure 4.15: Network usage of NGINX HTTP server with 1, 000 concurrent connections 4.3. Evaluating Server Performance with ink 81 0.300ms 0.500ms 0.700ms 0.900ms 2.000ms 4.000ms 6.000ms 8.000ms 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 5000ms

Responses 69k 28k 19k 11k 6.5k 5.3k 4.3k 3.4k 2.8k 2.1k 1.7k 90 80 concurrent connections 70 000 , 1 60 50 Time (seconds) 40 30 20 10 Figure 4.16: Client timelines of pre-fork Apache HTTP server with 0 0

900 800 700 600 500 400 300 200 100

1000 ocretClients Concurrent 82 Chapter 4. Evaluation 0.300ms 0.500ms 0.700ms 0.900ms 2.000ms 4.000ms 6.000ms 8.000ms 10.00ms 30.00ms 50.00ms 70.00ms 90.00ms 200.0ms 400.0ms 600.0ms 800.0ms 1000ms 3000ms 5000ms

Responses 30k 28k 26k 26k 26k 25k 24k 23k 23k 23k 22k 90 80 70 concurrent connections 000 , 1 60 50 Time (seconds) 40 30 20 10 Figure 4.17: Client timelines of NGINX HTTP server with 0 0

900 800 700 600 500 400 300 200 100 1000 Clients Concurrent 4.4. Survey Results 83 decide how to configure their server. If the system maintainer expects there to be an influx of traffic to their server, then Apache would need to be configured to use more resources, or NGINX could be used as their HTTP server. Both NGINX and Apache were able to provide consistent service when all connections had been established. However, NGINX was able to achieve an average response latency of 3.6 ms. Apache was only able to produce an average response latency of 12.2 ms. Apache client timelines shift from green to yellow as more clients are accepted, illustrating a degrade in individual performance.

NGINX was limited by the bandwidth of the network device. Apache was CPU limited. See Figure 4.14 and Figure 4.15 for NGINX’s CPU and network usage. See Figure 4.12 and Figure 4.13 for Apache’s CPU and network usage. Apache’s high CPU usage might be explained by the overhead of context switching between 1, 000 processes. NGINX does not have to do this, so it is able to saturate the network device before reaching the CPU’s performance threshold. These insights are given by ink’s server monitoring script.

4.4 Survey Results

We deployed ink in Virginia Tech’s Computer Systems course during the Spring 2020 semester. ink was deployed so that students could use the tool to benchmark their own HTTP server implementations. Additionally, we surveyed the students in the course on their experience with the tool. We used Qualtrics to administer the survey. Permission was granted for this research by the Virginia Tech Institutional Review Board under IRB protocol 19-1080.

The survey was offered to students who were enrolled in Virginia Tech’s Computer Systems course during the Spring 2020 semester. The survey was open from May 5, 2020 to May 12, 2020. There were 9 respondents to the survey. The first question of the survey asked the 84 Chapter 4. Evaluation student if they had used the tool. Out of the 9 respondents, 1 reported that they did not use the tool. Therefore, their responses to the remaining questions were omitted here. Here is a listing of the survey questions, along with the results of each question:

1. Did you find the tool to be useful?

• Extremely useful - 4

• Slightly useful - 4

• Neither useful or useless - 0

• Slightly useless - 0

• Extremely useless - 0

2. Did the tool help you understand the performance of your server?

• The tool helped significantly - 5

• The tool helped slightly - 3

• The tool did not help at all - 0

3. Which part of the tool did you find most useful?

• The per-client latency color chart - 6

• The range-based histogram - 1

• The range-based details panel - 0

• The HTTP response code breakdown - 1

4. Were you able to improve performance of your server because you used this tool?

• The tool helped significantly - 0 4.4. Survey Results 85

• The tool helped slightly - 8

• The tool did not help at all - 0

5. Do you believe this tool reported any inaccuracies?

• Yes - 8

• No - 0

The questions for this survey were written early into the research process. Additionally, it should be noted that this survey was taken during the semester of Spring, 2020. The effects of COVID-19 may be a possible reason for the low number of respondents.

These results are generally positive, and they indicate that the majority of survey respondents found the tool to be useful. ink aims to help users understand the performance of HTTP servers, and that appears to be achieved by these results. 6 out of 8 respondents reported the client timeline visualization was the most useful part of the tool. This indicates that the data presentation techniques used by the client timeline visualization are effective. Chapter 5

Related Work

This section will briefly overview research work most closely related to ink. This ranges from other research that employs HTTP benchmarking to software that has been designed as a result of the need to support highly concurrent workloads.

5.1 Assessing Server Quality and Performance

Mosberger et al. created httperf, one of the first HTTP benchmarking tools [50]. httperf uses the same model of benchmarking that has been discussed throughout this thesis, in which one physical client simulates a set of clients that all request resources from an HTTP server. httperf places an emphasis on generating a constant load on the HTTP server. Mosberger et al. addressed the issue of the benchmarking client reducing the load applied on the server as the performance of the server degraded. Like httperf, ink attempts to create a constant load through the use of timeouts. Unlike ink, httperf does not report performance statistics for each individual client.

Behal et al. used HTTP benchmarking tools to assess how servers are able to handle le- gitimate clients while experiencing a Distributed DoS (DDoS) attack [58]. Since real world phenomena, like the Slashdot effect, can appear similar to DDoS attacks, the researchers devised and tested ways to discriminate between the two. This was done by examining the entropy of the requests coming into the server. The authors postulated that when a server is

86 5.1. Assessing Server Quality and Performance 87 being attacked, the data pertaining to individual clients has a lower entropy than legitimate client data. Specifically, the IP addresses of clients in an attack are more similar to each other than clients that are part of a large surge of real users. Similar to ink, the authors recorded benchmarking information that was local to each client, but for a different purpose. Instead of applying the information in a way that can indicate performance, it was used to create a model for real world clients for the purpose of detecting denial-of-service attacks.

Not all web traffic is static HTML content; therefore, there is a need for load generators that can test how servers handle responding with different types of content. Summers et al. cre- ated an HTTP benchmarking tool that benchmarks servers that provide video content [59].

This tool is based off of a modified version of httperf. The modified benchmarking tool aims to create realistic workloads. Many of their methods were derived from the analysis of YouTube video traffic [60]. The authors were able to account for the presence of many client variables, such as video length, video quality, user retention rate, and performance expecta- tions of server hardware. Generally, HTTP is used to stream video in chunks. Summers et al. emulated this behavior by manually chunking the data in the server’s prior to benchmarking. This enabled the use of httperf without significant modifications. This work is closely related to ink as they both are forms of HTTP benchmarking, and they both drew inspiration from httperf. Their research also supports the use of multiple physical httperf clients. However, Summers et al. primarily focused on realistic video workloads. While ink does present the user with many options, workloads that attempt to simulate human behavior, such as video retention rate, are not currently supported.

Ramamurthy et al. focused on finding the performance bottlenecks of web servers in the presence of what they called “flash crowds” [61]. This is done by generating workloads that are designed to exhaust specific resources. Workloads are broken down into categories. These categories are small, large, and dynamic object requests. The authors claim that if 88 Chapter 5. Related Work the average response time of requests monotonically increases with the number of concurrent clients, then the resource that the applied workload was designed to exhaust might be a limiting factor of the server. The authors were testing for disk, network, and CPU utilization. For example, a test was designed to cause cache misses, perform calculations on the retrieved data, and then respond to the client with the result of these calculations. As the number of clients used in each test increased, the CPU usage increased at a rate similar to the rate at which the average response latency increased. The authors concluded that this would result in the server being limited by the CPU with enough concurrent clients. In many ways, this work is similar to ink. The authors propose a non-intrusive benchmarking tool that can help target performance bottlenecks. Both solutions provide a way to monitor perceived client service and monitor server performance points. However, the solution that Ramamurthy et al. proposes requires that the user have control and knowledge of the benchmarked server whereas ink does not. Additionally, ink aims to generate a significantly higher load than the benchmarking model that Ramamurthy et al. proposed. ink also places a significantly higher focus of the experience of individual clients.

Shams et al. provided a model for generating realistic workloads for stateful applications [62]. The authors recognized that real-world HTTP workloads are not random; a realistic workload is often the trace of an individual user’s experience with the web application. One request may depend on the results of a previous request. Shams et al. modeled this by constructing finite state machines (FSM) that represent how a user might interact with an application. The authors presented an example of an FSM that represents how a user might interact with an e-commerce website. Using these FSMs to generate workloads can lead to more realistic benchmarking conditions than arbitrarily picking some endpoints to test. To generate the workloads, users must manually construct the FSM and then request traces can be generated.

The authors used httperf to generate the load from the traces. Both ink and this work 5.1. Assessing Server Quality and Performance 89 relate to HTTP benchmarking; however, the research completed by Shams et al. focused on the generation of realistic individual workloads. Shams et al. provided results of using this model with only one concurrent client. A large focus of ink is its ability to generate load with a high number of concurrent clients. Adapting this approach for ink could yield more realistic client timelines that are representative of realistic resource access patterns.

Geist is a web traffic load generator designed by Kant et al. [63]. Geist generates load using a method that no other generators we have discussed employ. Instead of simulating a set of users, Geist generates a specific web traffic scenario in advance of the test. Geist does not use persistent connections that send multiple requests back to back, instead sending requests at specific times in separate connections that are defined in the scenario created prior to running the test. In this way, Geist can produce a constant aggregate load. Like ink, Geist also makes use of multiple physical machines to generate load. ink and Geist also both attempt to generate a constant load, albeit in different ways. Geist allows a user to define an exact workload prior to benchmarking, including the time at which each request will be sent; whereas ink uses persistent connections and relies on timeouts.

SURGE is a web traffic generator that was developed by Barford and Crovella [64]. SURGE was designed to generate realistic web traffic that is representative of real human clients. This means that simulated clients take into account variables such as how long a user views some content or how long the browser takes to render content. These variables are built into a SURGE client. SURGE, like Geist, generates the request data before the benchmark is run. SURGE workloads are generated in a way that accounts for how HTTP servers act as a file system. They also account for various file sizes and popularity. SURGE emulates the behavior of a browser by requesting embedded objects in an HTML page. The authors were able to show that SURGE was able to induce a higher CPU load on a server than other comparable benchmarking utilities. This was due to how the workload was crafted. ink and 90 Chapter 5. Related Work

SURGE both aim to high generate load on an HTTP server. ink does this by generating a high number of concurrent clients; SURGE does this by carefully constructing the requests that will be sent.

In an effort to mitigate the effects of coordinated omission, there have been other exten- sions of wrk. One project named wrk2 uses High Dynamic Range (HDR) Histograms [65]. HDR histograms are designed “for recording histograms of value measurements in latency and performance sensitive applications” [66]. HDR histograms use a constant amount of space, regardless of the amount of data points that the report represents. Using these HDR histograms, wrk2 supplies a constant load on the server in the same manner that wrk does. However, unlike wrk, the user is required to limit the amount of requests per second that the load generator produces. This is because wrk2 uses the rate at which it is limited to adjust the reported data. Latencies that are higher than the expected value are weighted proportionately in the final histogram. ink addresses the effects of coordinated omission instead by generating client timelines.

5.2 Visualization Techniques

Liu et al. created inMems, a system for viewing and querying large amounts of data in the browser using WebGL [67]. inMems makes use of various data visualization techniques and data processing . Both ink and inMems place an emphasis on the interactivity and usefulness of their respective data visualizations. Many strategies that the two works use are similar. Both systems precompute large amounts of data for future access and exploit parallelism wherever possible when querying data structures. Both works also use color- mapping techniques to represent large amounts of data. Unlike ink, inMems was solely designed as a showcase for various data visualization and processing techniques. 5.2. Visualization Techniques 91

ScalaR is a generic system for filtering large data sets to be used with visualization ap- plications [68]. As was experienced when developing ink, rendering large data sets can be difficult for multiple reasons. Screen resolution and rendering complexity can both be limiting factors for visualizing large sets of data. ScalaR is designed to help alleviate this problem. The authors suggest that data visualization designers should not be burdened with the management of massive data sets; instead, they should be able to query a database that performs all of the necessary data reduction. ScalaR provides this feature. Users of ScalaR are notified if a query will be too large to render; if that is the case, then ScalaR will suggest to the user a reduction technique. These reduction techniques are aggregation, sampling, and filtering. Aggregation segments data into chunks and returns averages of each chunk.

This is similar to how client timelines in ink are partitioned into windows and assigned a color based on average latency. Sampling reduces the number elements in a data set by taking a subset of the full data set. This is the technique that ink uses to lower the number of client timelines rendered. Lastly, filtering requires that the returned data points all pass a set of user-defined tests. ink provides users with the ability to perform this operation on client timelines. ScalaR and ink perform many of the same operations on data sets in an effort to limit data rendered by visualizations. However, ScalaR is a generic system that can be used with any front-end application. ink could make use of a system like this to reduce the complexity of the application. Chapter 6

Future Work

While ink provides a large array of features, there are many ways that the tool can be improved upon. Some of those are discussed here.

6.1 Load Generation Improvements

HTTP/2 Support Currently, ink uses wrk as the load generator. Unfortunately, wrk does not support HTTP/2. Extending wrk to be able to support HTTP/2 would be a welcome addition to the tool, as many servers have transitioned to using HTTP/2.

Supporting More Flexible Workloads Although ink allows a user to configure custom benchmarks with varying levels of load and concurrency, the user could be given even more control. As discussed throughout Section 5.1, more complex workloads allow a benchmark to simulate different real-world phenomena. ink’s modular design allows it to support multiple different workload generators; however, these would need to be developed.

6.2 GUI Improvements

WebGL Graphics ink uses a browser based GUI that rendered using SVG elements. When rendering a large number of SVG elements, the client can experience performance

92 6.2. GUI Improvements 93 issues because each element must be added to the document object model (DOM) represen- tation. When developing ink, this was a problem that was solved by limiting the number of elements drawn on the page. Conveniently, this worked for our visualization since the data can be sampled down in a way that does not greatly reduce the amount of information that is relayed to the user.

Converting ink to use the Web Graphics Library (WebGL) might alleviate performance issues and could allow for more flexible data visualizations. WebGL is a JavaScript API that enables the programmer to create 2D and 3D content within the browser. The WebGL API is arguably more difficult to work with than D3. A hybrid design that uses both D3 and WebGL might be the most effective. This would retain the convenient data manipulation methods that D3 provides while gaining the performance benefits of WebGL. Chapter 7

Conclusion

Many available HTTP benchmarking tools do not provide enough information to users to fully understand the performance of an HTTP server. They do not report information about the individual client experience, providing only provide aggregate data points instead.

To address this problem, we designed and developed ink, an HTTP benchmarking tool that allows users to better understand the performance of an HTTP server. ink focuses on providing users with information about individual client experience, while still allowing users to understand the performance of their server overall. ink does this by presenting users with a set of data visualizations. Most notably, the ink generates a set of client timelines that so that a user may know how each client was served over the course of the benchmark. These client timelines help users understand server performance and make informed design and configuration decisions. ink also provides users with a way to generate high load with a distributed set of machines. Modern HTTP servers are able to handle a very high number of concurrent connections; ink allows users to test how a server may perform with hundreds of thousands of concurrent clients. Additionally, we show that ink is capable of recording the data points needed for the client timeline visualization with only a minimal overhead introduced.

We have shown a series of case studies that highlight use cases for ink. These range from helping debug an improperly written server to providing details about the configuration and

94 95 performance of two popular HTTP server implementations. These all show that ink’s client timelines can help understand server performance and behavior.

We deployed ink in Virginia Tech’s Computer Systems course for student use. The students were able to use ink to benchmark their personal HTTP server implementations. Afterwards, we performed a survey of the students in the class. Although the number of responses to the survey was low, the majority respondents indicated that ink was useful and that it helped them understand the performance of their server. Bibliography

[1] C. Brewer, “ColorBrewer.” http://www.ColorBrewer.org, 2001. [Online; accessed 2020-03-10].

[2] A. Goldstein, “HHS failed to heed many warnings that HealthCare.gov was in trouble,”

The Washington Post, February 2016. [Online; accessed: 2020-02-11].

[3] T. Hsu and T. Siegel Bernard, “Coronavirus layoff surge overwhelms unemployment offices,” April 2020. Accessed: 2020-02-24.

[4] Wikipedia contributors, “Slashdot effect — Wikipedia, the free encyclopedia.” https: //en.wikipedia.org/wiki/Slashdot_effect, 2019. [Online; accessed 2020-02-11].

[5] D. Kegel, “The C10K problem.” http://www.kegel.com/c10k.html, February 2019. [Online; accessed 2020-02-11].

[6] R. Graham, “C10M.” http://c10m.robertgraham.com/p/manifesto.html. [Online; accessed 2020-05-02].

[7] M. Welsh, D. Culler, and E. Brewer, “SEDA: An architecture for well-conditioned, scal-

able Internet services,” in Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, SOSP ’01, (Banff, Alberta, Canada), pp. 230–243, 2001.

[8] R. von Behren, J. Condit, F. Zhou, G. C. Necula, and E. Brewer, “Capriccio: Scalable

threads for Internet services,” in Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, (Bolton Landing, NY, USA), pp. 268–281, 2003.

96 BIBLIOGRAPHY 97

[9] W. Glozier and Github contributors, “wrk.” https://github.com/wg/wrk. [Online; accessed 2020-02-11].

[10] Apache Software Foundation, “Apache JMeter.” https://jmeter.apache.org. [Online; accessed 2020-02-11].

[11] M. Fomenkov, K. Keys, D. Moore, and K. Claffy, “Longitudinal study of Internet traffic

in 1998–2003,” in Proceedings of the 2004 Winter International Symposium on Informa- tion and Communication Technologies, WISICT ’04, (Cancun, Mexico), pp. 1–6, Trinity College Dublin, 2004.

[12] ISI, “Transmission Control Protocol,” RFC 793, RFC Editor, September 1981.

[13] T. Socolofsky and C. Kale, “A TCP/IP Tutorial,” RFC 1180, RFC Editor, January 1991.

[14] S. Deering and R. Hinden, “Internet Protocol, Version 6 (IPv6) Specification,” RFC 2460, RFC Editor, December 1998.

[15] K. Egevang and P. Francis, “The IP Network Address Translator (NAT),” RFC 1631, RFC Editor, May 1994.

[16] C. A. Sunshine and Y. K. Dalal, Connection Management in Transport Protocols, pp. 245–264. USA: Artech House, Inc., 1988.

[17] V. Paxson, M. Allman, J. Chu, and M. Sargent, “Computing TCP’s Retransmission Timer,” RFC 6298, RFC Editor, June 2011.

[18] M. Allman, V. Paxson, and W. Stevens, “TCP Congestion Control,” RFC 2581, RFC Editor, April 1999. 98 BIBLIOGRAPHY

[19] A. Hussain, J. Heidemann, and C. Papadopoulos, “A framework for classifying denial

of service attacks,” in Proceedings of the 2003 ACM SIGCOMM Conference on Ap- plications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’03, (Karlsruhe, Germany), pp. 99––110, 2003.

[20] W. Eddy, “TCP SYN Flooding Attacks and Common Mitigations,” RFC 4987, RFC Editor, August 2007.

[21] T. Berners-Lee, R. Fielding, and H. Frystyk, “Hypertext Transfer Protocol – HTTP/1.0,” RFC 1945, RFC Editor, May 1996.

[22] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, “Hypertext Transfer Protocol – HTTP/1.1,” RFC 2616, RFC Editor, June 1999.

[23] Internet Archive, “Report: State of the web.” https://httparchive.org/reports/ state-of-the-web, 2020. [Online; accessed 2020-03-18].

[24] H. F. Nielsen, J. Gettys, A. Baird-Smith, E. Prud’hommeaux, H. W. Lie, and C. Lil-

ley, “Network performance effects of HTTP/1.1, CSS1, and PNG,” in Proceedings of the 1997 ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM ’97, (Cannes, French Riviera, France), pp. 155––166, 1997.

[25] Apache Software Foundation, “Apache core features.” http://httpd.apache.org/ docs/2.4/mod/core.html. [Online; accessed 2020-03-03].

[26] The Chromium Projects, “HTTP pipelining.” https://www.chromium.org/ developers/design-documents/network-stack/http-pipelining. [Online; ac- cessed 2020-03-23]. BIBLIOGRAPHY 99

[27] MozillaZine, “Network.http.pipelining.” http://kb.mozillazine.org/Network.http. pipelining, February 2012. [Online; accessed 2020-03-23].

[28] The Chromium Projects, “Network stack.” https://www.chromium.org/developers/ design-documents/network-stack. [Online; accessed 2020-03-23].

[29] R. Fielding and J. Reschke, “Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing,” RFC 7230, RFC Editor, June 2014.

[30] M. Belshe, R. Peon, and M. Thomson, “Hypertext Transfer Protocol Version 2 (HTTP/2),” RFC 7540, RFC Editor, May 2015.

[31] R. von Behren, J. Condit, and E. Brewer, “Why events are a bad idea (for high-

concurrency servers),” in Proceedings of the 9th Conference on Hot Topics in Operating Systems, HOTOS ’03, (Lihue, Hawaii), USENIX Association, 2003.

[32] J. Ousterhout, “Why threads are a bad idea (for most purposes).” USENIX Technical Conference, 1995.

[33] E. A. Lee, “The problem with threads,” Computer, vol. 39, pp. 33–42, May 2006.

[34] H. C. Lauer and R. M. Needham, “On the duality of operating system structures,”

SIGOPS Oper. Syst. Rev., vol. 13, pp. 3–19, Apr 1979.

[35] A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur, “Cooperative task

management without manual stack management,” in Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference, ATC ’02, (Monterey, California), pp. 289–302, USENIX Association, June 2002.

[36] B. Erb, “Concurrent programming for scalable web architectures,” diploma thesis, In- stitute of Distributed Systems, Ulm University, April 2012. 100 BIBLIOGRAPHY

[37] N. D. Matsakis and F. S. Klock, “The Rust Language,” in Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Technology, HILT ’14, (Portland, Oregon, USA), pp. 103–104, October 2014.

[38] The Rust Survey Team, “Rust survey 2018 results.” https://blog.rust-lang.org/ 2018/11/27/Rust-survey-2018.html, November 2018. [Online; accessed 2020-04-16].

[39] Netcraft, “April 2020 web server survey.” https://news.netcraft.com/archives/ category/web-server-survey/, April 2020. [Online; accessed 2020-04-16].

[40] Joyent, “About Node.js.” https://nodejs.org/en/about/. [Online; accessed 2020- 04-16].

[41] Google, “Go.” https://golang.org. [Online; accessed 2020-04-16].

[42] The Open Group, “The open group base specifications, issue 7,” 2018.

[43] L. Gammo, T. Brecht, A. Shukla, and D. Pariag, “Comparing and evaluating epoll,

select, and poll event mechanisms,” in Proceedings of the 6th Annual Ottawa Linux Symposium, 01 2004.

[44] A. Chandra and D. Mosberger, “Scalability of Linux event-dispatch mechanisms,” in

Proceedings of the General Track: 2001 USENIX Annual Technical Conference, (Boston, Massachusetts), USENIX Association, June 2001.

[45] G. Banga and P. Druschel, “Measuring the capacity of a web server,” in Proceedings of the USENIX Symposium on Internet Technologies and Systems, USITS ’97, (Monterey, California), pp. 61–71, December 1997.

[46] R. B. Miller, “Response time in man-computer conversational transactions,” in Pro- ceedings of the ACM Fall Joint Computer Conference, AFIPS ’68, (San Francisco, California), pp. 267–277, 1968. BIBLIOGRAPHY 101

[47] M. Mayer, “Speed research,” Nov 2006. Presentation at Web 2.0 Summit.

[48] Apache Software Foundation, “Apache Bench.” https://httpd.apache.org/docs/2. 4/programs/ab.html, 2020. [Online; accessed 2020-03-03].

[49] J. Dogan and Github contributors, “hey.” https://github.com/rakyll/hey, 2020. [Online; accessed 2020-03-03].

[50] D. Mosberger and T. Jin, “httperf–a tool for measuring web server performance,” SIG- METRICS Perform. Eval. Rev., vol. 26, pp. 31–37, December 1998.

[51] G. Tene, “How NOT to measure latency.” https://www.infoq.com/presentations/ latency-response-time/. Accessed: 2020-04-20.

[52] Redis Labs, “Redis.” https://redis.io. [Online; accessed 2020-05-21].

[53] kqueue(2) FreeBSD Manual Pages, 12.1 ed., July 2018.

[54] epoll(7) Linux Programmer’s Manual, 5.05 ed., March 2019.

[55] Google, “Protocol buffers.” https://developers.google.com/protocol-buffers. [Online; accessed 2020-05-22].

[56] A. Sumaray and S. K. Makki, “A comparison of data serialization formats for optimal

efficiency on a mobile platform,” in Proceedings of the 6th ACM International Confer- ence on Ubiquitous Information Management and Communication, ICUIMC ’12, (Kuala Lumpur, Malaysia), 2012.

[57] M. Bostock, V. Ogievetsky, and J. Heer, “D3: Data-driven documents,” IEEE Trans- actions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2301–2309, 2011. 102 BIBLIOGRAPHY

[58] S. Behal and K. Kumar, “Detection of DDoS attacks and flash events using informa-

tion theory metrics–an empirical investigation,” Computer Communications, vol. 103, pp. 18–28, 2017.

[59] J. Summers, T. Brecht, D. Eager, and B. Wong, “Methodologies for generating HTTP

streaming video workloads to evaluate web server performance,” in Proceedings of the 5th Annual ACM International Systems and Storage Conference, SYSTOR ’12, (Haifa, Israel), 2012.

[60] A. Abhari and M. Soraya, “Workload generation for YouTube,” Multimedia Tools Appl., vol. 46, pp. 91–118, January 2010.

[61] P. Ramamurthy, V. Sekar, A. Akella, B. Krishnamurthy, and A. Shaikh, “Using mini-

flash crowds to infer resource constraints in remote web servers,” in Proceedings of the 2007 ACM SIGCOMM Workshop on Internet Network Management, INM ’07, (Kyoto, Japan), pp. 250–255, 2007.

[62] M. Shams, D. Krishnamurthy, and B. Far, “A model-based approach for testing the per-

formance of web applications,” in Proceedings of the 3rd ACM International Workshop on Software Quality Assurance, SOQUA ’06, (Portland, Oregon), pp. 54–61, 2006.

[63] K. Kant, V. Tewari, and R. Iyer, “Geist: A web traffic generation tool,” in Evaluation: Modelling Techniques and Tools (T. Field, P. G. Harrison, J. Bradley, and U. Harder, eds.), pp. 227–232, 2002.

[64] P. Barford and M. Crovella, “Generating representative web workloads for network

and server performance evaluation,” in Proceedings of the 1998 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, (Madison, Wisconsin, USA), pp. 151–160, 1998. BIBLIOGRAPHY 103

[65] G. Tene and Github contributors, “wrk2.” https://github.com/giltene/wrk2, 2020. [Online; accessed 2020-04-20].

[66] G. Tene, “HdrHistogram: A high dynamic range histogram.” http://hdrhistogram. org. Accessed: 2020-04-20.

[67] Z. Liu, B. Jiang, and J. Heer, “ImMens: Real-time visual querying of Big Data,” in

Proceedings of the 15th Eurographics Conference on Visualization, EuroVis ’13, (Leipzig, Germany), pp. 421–430, 2013.

[68] L. Battle, M. Stonebraker, and R. Chang, “Dynamic reduction of query result sets for

interactive visualization,” in 2013 IEEE International Conference on Big Data, pp. 1–8, 2013.