AN ABSTRACT OF THE THESIS OF

Pranjal Mittal for the degree of Master of Science in Computer Science presented on September 1, 2016.

Title: Benchmarking Containers and Virtual Machines

Abstract approved: ______Carlos Jensen Rakesh Bobba

In data-centers, running multiple isolated workloads while getting the most performance out of available hardware is key. For many years Virtual Machines (VMs) have been an enabler, but native containers which offer isolation similar to virtual machines while reducing overhead costs associated with emulating hardware resources have become an increasingly attractive proposition for data- centers.

This thesis aims at evaluating the CPU and Memory I/O performance of containers and VMs (specifically Docker and KVM) against bare-metal for nodes/clusters with single or multiple deployed virtual entities. However, such a process of measuring the performance of each entity in a cluster consisting of several entities is an onerous task and error-prone due to lack of benchmark synchronization, necessitating the development of a solution that allows automated benchmarking. This leads to the second contribution of this thesis which is to develop a new open source tool prototype (Playbench) that would

enable repeatable benchmarking, allowing evaluating CPU and Memory I/O performance of each virtual entity (container or VM) in a cluster with multiple virtual entities deployed.

©Copyright by Pranjal Mittal September 1, 2016 All Rights Reserved

Benchmarking Linux Containers and Virtual Machines

by Pranjal Mittal

A THESIS

submitted to

Oregon State University

in partial fulfillment of the requirements for the degree of

Master of Science

Presented September 1, 2016 Commencement June 2017

Master of Science thesis of Pranjal Mittal presented on September 1, 2016

APPROVED:

______Co-Major Professor, representing Computer Science

______Co-Major Professor, representing Computer Science

______Director of the School of Electrical Engineering and Computer Science

______Dean of the Graduate School

I understand that my thesis will become part of the permanent collection of Oregon State University libraries. My signature below authorizes the release of my thesis to any reader upon request.

______Pranjal Mittal, Author

ACKNOWLEDGEMENTS

I would like to express my gratitude to my advisors Dr. Carlos Jensen and Dr. Rakesh Bobba for their continuous time and support throughout this work and feedback; Gorantla Sai Krishna, Dept. of Computer Science and Engineering, IIT (BHU), Varanasi for his contributions while running of experiments and inputs on interpreting final results. I would also like to thank Intel Corporation for supporting this work in part, under Intel-OSU graduate fellowship program 2015-2016 and OSU Open Source Lab for the help with setting up an Intel Xeon server at the OSL data center. I am also thankful to D. Kevin McGrath for accepting the invitation and taking the time to be a part of my Graduate committee.

TABLE OF CONTENTS

Page

1 Introduction ...... 1 1.1 Performance Characteristics: CPU and Memory I/O ...... 2 1.2 Research Questions ...... 3

2 Related Work ...... 5

3 Automation of Benchmarking ...... 8 3.1 Ansible and Ansible Playbooks ...... 10 3.2 PlayBench: Container-VM Benchmark Automation Tool ...... 11

4 Methodology ...... 14 4.1 Starting/Stopping Virtual Machines or Containers ...... 15 4.2 Networking ...... 16 4.3 Steps for Benchmarking Virtual Machines/Containers ...... 17 4.3.1 Single Entity: Massively Parallel Workload ...... 18 4.3.2 Multiple Entities: CPU and Memory I/O Performance ...... 18 4.4 Other Methodology Notes ...... 19

5 Results .…...... 22 5.1 Single Virtual Entity Deployed: Massively Parallel Workload ...... 22 5.2 Multiple Virtual Entities Deployed: CPU Performance ...... 25 5.3 Multiple Virtual Entities Deployed: Memory I/O ...... 32

6 Limitations and Future Work ...... 40

7 Conclusion ...... 42

8 Bibliography ...... 43 Appendix ………………………………………………………………………...…… 47

LIST OF FIGURES

Figure Page

Figure 3.1 Inventory of multiple nodes, containers, VMs ...... 13

Figure 4.1 Intel Xeon Server (2 NUMA Nodes) ...... 14

Figure 4.2 Virtual Units deployed on Host each running an equivalent workload 16

Figure 4.3 hosts file with container IDs and ansible_connection set to docker ... 17

Figure 4.4 Master playbook file for containers: site_containers.yml ...... 17

Figure 5.1 CPU Performance vs number of available threads ...... 23

Figure 5.2 % difference in performance relative to host (containers vs VMs) ..... 24

Figure 5.3 Single-core CPU Performance ...... 26

Figure 5.4 Percentage difference relative to host (single-core performance) ..... 27

Figure 5.5 Multi-core CPU Performance ...... 28

Figure 5.6 Percentage difference relative to host (multi-core CPU performance) ...... 29

Figure 5.7 Single-core Integer Performance ...... 30

Figure 5.8 Multi-core Integer Performance ...... 30

Figure 5.9 Single-core Floating Performance ...... 31

Figure 5.10 Multi-core Floating Performance ...... 31

Figure 5.11 Single-core Memory Performance ...... 33

LIST OF FIGURES (Continued)

Figure Page

Figure 5.12 Difference relative to host (single-core memory performance) ...... 34

Figure 5.13 Multi-core Memory Performance ...... 34

Figure 5.14 Percentage difference (multi-core memory performance) ...... 35

Figure 5.15 Memory: Copy (single-core performance) ...... 36

Figure 5.16 Memory: Copy (multi-core performance) ...... 36

Figure 5.17 Memory: Scale (single-core performance) ...... 37

Figure 5.18 Memory: Scale (multi-core performance) ...... 37

Figure 5.19 Memory: Add (single-core performance) ...... 38

Figure 5.20 Memory: Add (multi-core performance) ...... 38

Figure 5.21 Memory: Triad (single-core performance) ...... 39

Figure 5.22 Memory: Add (multi-core performance) ...... 39

1

Chapter 1: Introduction

Performance unpredictability has been listed among the top obstacles to the growth of cloud computing [1]. With the adoption of container frameworks in cloud infrastructure and development environments, there has been increased interest among developers and system administrators in understanding the runtime performance and security impact of running Linux containers [2] [3].

Until recently, the state of the art for deploying and managing multiple isolated workloads on a single machine was to use virtual machines. KVM (Kernel Virtual Machine) [4] is a popular example of a hypervisor used to provision Virtual Machines. It is a Linux kernel module that allows the kernel to function as a Type 1 hypervisor [5]. Container-based virtualization makes use of the relatively new Linux kernel features called Namespaces and Cgroups [6] to establish end-user isolation instead of running emulated hardware for each isolated entity and running an OS on top. This allows containers to offer isolation similar to virtual machines, without having to spend resources on emulating hardware and running multiple operating systems. This at least, in theory, makes them more efficient. Containers are based on two key technologies; Cgroups and Namespaces. Cgroups allow for the creation of an isolated group of processes with restricted resources, whereas Namespaces limit the visibility of processes to a group (namespace) in such a way that processes may only access resources within their namespace. Such a group of processes forming an isolated workspace is collectively called a container [7]. Containers, therefore, offer many of the technical features people look for in VMs, while offering distinct advantages such as simplified deployment [6]. 2

Given the promise of increased performance and simplified deployment and management, many data center users are contemplating switching from VM- based deployment to native-Containers [3], [8].The question for them is whether the switch is worthwhile, which means predicting performance gains. Recent attempts have been made to benchmark Containers against VMs [3], [9], [10] but most of these have looked at a small set of performance metrics and workloads with usually a single or few number of deployed containers or VMs. In our work, the goal was to benchmark CPU and Memory I/O performance of VMs and Containers for single as well as a variable number of deployed virtual entities to get a better understanding of performance as well as scalability. Furthermore, given the rapid evolution of these infrastructures and the manual, time- consuming benchmarking methods, what we wanted was to develop a tool that would allow orchestrating benchmark runs on infrastructures with multiple virtual entities in order to measure not just system level performance but also evaluate the performance of each virtual entity in the system. We use the phrase virtual entity or virtual unit often in this text which, as the name suggests, is a collective noun to refer to either containers or virtual machines.

1.1 Performance Characteristics: CPU and Memory I/O

Our work focuses on CPU and Memory I/O performance as they are one of the 2 key performance measures that characterize a cloud workload [11] [12] other ones being Network and File I/O. Each application workload is a combination of different performance characteristics; some are CPU intensive whereas others are I/O intensive. Table [1.1] shows some example use cases where CPU and Memory I/O performance respectively are of significant importance.

3

Performance Example Use Cases Characteristic CPU Performance Machine Learning and Artificial Intelligence, Cryptography (encryption/decryption), Image Processing, Gaming Engines

Memory I/O In memory queuing systems, In-memory databases (Redis)

Table 1.1 CPU, Memory I/O performance characteristics: Example Use-Cases

1.2 Research Questions

● RQ1: Are containers able to effectively use multiple cores under parallel CPU workloads?

● RQ2: What are the single-core, multi-core CPU and memory I/O overhead of containers and how does this compare to Virtual Machines?

● RQ3: How effectively do containers scale in comparison to virtual machines as the number of deployed entities increase?

Chapter 2, discusses related work in the field of performance benchmarking of containers and virtual machines. Chapter 3 describes the importance of automation of benchmarking, how it could be achieved and Playbench, our benchmark automation tool prototype. Chapter 4 describes the hardware and used for the benchmarking experiments and describes each step in our benchmarking process. Chapter 5 discusses results observed post- benchmarking and possible reasons that explain the behavior. Chapter 6 4

discusses limitations of this thesis and future opportunities in benchmarking containers and virtual machines. Chapter 7 ends with a conclusion and summary of results.

5

Chapter 2: Related Work

An empirical study by Felter et al. [3] provided a performance comparison of containers (Docker) and Virtual Machines (KVM). The study provides a quantitative performance comparison between containers and virtual machines for various performance characteristics like CPU, Memory, Network, and File I/O performance. KVM was used as the hypervisor and Docker for running and managing containers. The authors concluded that both KVM and Docker introduce negligible overhead for CPU and memory performance barring extreme cases. However, the authors acknowledged that their study is limited to single VMs or containers that consumed the entire server; whereas in the data-center, it is more common to run multiple smaller virtual units on the same server [13] which is a weakness that we aimed at addressing. Further, we noted that the study by Felter et al. performs the performance analysis only for a narrow set of workloads. For example, it does not evaluate the behavior and performance of containers and virtual machines under parallel workloads which are shortcomings addressed in our experiments.

Beserra et al. [9] did a performance comparison of containers (LXC) and Virtual Machines (KVM). They concluded that LXC was more suitable for High- Performance Computing than KVM. In recent years Docker, due to its ease of use and features, has overtaken LXC to become a standard choice for deploying containers [14] thus we focus on Docker in our study.

6

Xavier et al. [10] primarily investigated the performance of LXC, OpenVZ, and VServer based containers against Virtual Machines (Xen). Their study showed that LXC has higher performance than Xen in nearly all performance aspects. Many of the benchmarks used in their study, like Linpack and STREAM [40] are also adopted in the study by Felter et al. [3].The main limitation of this study is that it does not evaluate the performance of docker containers which in recent times have become the most widely used container technology.

Matthews et al. in their 2007 study [15] discuss the performance of virtual machines relative to native performance. Our study adopts a similar approach for measuring performance. The percentage overhead in our plots for container and VM are measured relative to host. The authors also design a performance isolation benchmark that performs various stress tests for CPU, Memory, and Disk. This study is not focused on performance evaluation of containers; which is one of the main limitations. However, the authors did suggest that as virtualization systems become more and more common, the importance of benchmarks that compare virtualization environments would increase.

Shivam et al. [16] discuss policies for automating benchmarking. They highlighted that benchmarking often requires multiple server configurations or workloads and provide a workload automation framework prototype. Their prototype is capable of spawning multiple virtual machines and varying virtual hardware resources per entity, however, details on how to run a benchmark on multiple virtual entities and obtain performance metrics per virtual entity is not provided. The framework prototype does not account for benchmarking of containers which have become more popular in recent times. Given that a benchmarking automation tool works for VMs does not imply the same would 7

work for containers, which is why we wanted to develop a generic tool that would work for both containers and virtual machines.

Google Perfkit [17] is an attempt at developing a consolidated open source benchmark automation tool on top of existing benchmarks. Overall the tool allows comparing a set of cloud offerings. However, it lacks the ability to measure the performance of every virtual entity and is more geared towards measuring overall/system level performance. Measuring the average performance of each virtual entity (container or host) deployed in a cluster was required for answering our research questions, which made it necessary for us to write our own tool [18] to automate the benchmarking.

Another related study by Padala et al. [19] was focused on performance comparison of virtual machines (Xen) and containers. The study not only evaluates application performance, resource consumption, and scalability but also evaluates low-level system metrics like cache misses. However, the study uses older hypervisor versions and outdated container technology (OpenVZ).

8

Chapter 3: Automation of Benchmarking

One of the key ideas in benchmarking, in the context of computing, is that the benchmark (benchmarking tool) either generates a test workload or implicitly acts as a workload on whichever physical or virtual entity it runs on. Since benchmark is meaningless without a test workload, the word benchmark implies the existence of a corresponding workload. In this text, we use the terms, benchmark, and workload interchangeably sometimes.

In a data center, each node often hosts multiple virtual entities. Hence, an ideal benchmarking solution would not only allow benchmarking a single machine or virtual entity but also allow:

1. Measuring Host/Cluster Level Performance. 2. Measuring performance of each virtual entity deployed on the node/cluster.

As the number of virtual entities deployed on a host node or cluster increase, installing and running a benchmarking workload on each entity becomes an onerous task. A system administrator running the benchmark should ensure that each virtual entity in the cluster runs an approximately equal workload. The same benchmark would induce an equal workload on each virtual entity at any given time instant ideally when we ensure that the workload on each deployed virtual entity starts at the same time instant. This synchronization of the benchmarks, if 9

attempted manually, would cause a significant delay between the start of the workload on one entity versus the other and may lead to erroneous results. Thus, we require tools that use parallelism to allow spawning benchmarks on multiple hosts or virtual entities at the same time.

Also, different types of virtual entities have different connection adapters. For example: In the case of VMs SSH Daemon may be used to connect to establish a connection to the VM shell however containers (like Docker containers) lack an SSH daemon and instead provide an independent connection adapter. Ensuing challenges other than the ones mentioned above include having the ability to collect performance metrics that reside in each virtual unit because logging into each VM or container to parse and collect benchmarking results would be wearisome and time-consuming.

Google Perfkit [17] allows measuring system level performance i.e. performance of cloud offering as a whole or a host node, but our attempt to study and use Perfkit did not reveal sufficient tools to measure the performance of each deployed virtual entity.

As is the case with benchmarking of data centers, in most of the experiments that we planned to conduct it was important to be able to run and measure the performance of not only the host running the virtual entities but also determine the performance of each VM or container. Even with available benchmarks like Geekbench [20] that are capable of acting as a workload and benchmarking any single node or virtual entity running a standard , repeating the benchmarking process and collecting final results from every virtual entity deployed on a node is an arduous and error-prone process as discussed above, 10

which prompted us to develop a tool that would do the following:

1. Support repeatable, automated benchmarking on clusters deploying multiple virtual entities.

2. Allow evaluating the performance of each virtual entity as well as host performance.

3. Synchronizing the start of the underlying benchmark on multiple virtual entities deployed on the same host/cluster in parallel.

4. Work for multiple virtualization technologies (i.e. both containers and VMs) having the ability to dynamically use different connection adapter depending on the type of virtual entity in context.

5. Parse and aggregate performance results from each virtual entity into one place so that user of the benchmark does not have to connect to each unit to collect final performance values

3.1 Ansible and Ansible Playbooks

Ansible [21] is an open-source IT automation and DevOps framework which allows configuration management, application deployment, cloud provisioning, ad-hoc task execution, and multi-node orchestration.

Ansible has key features like Modules, Tasks, Roles, and Playbooks. Modules allow controlling system resources like; services, packages, files, etc. or manage execution of system commands. Tasks are an instance of using an Ansible module to perform an action, like calling a shell command with arguments, downloading a file, etc. Roles are a collection of tasks (and also other static files 11

or template) that are to be applied to a node. And, Playbook is a collection of roles.

Tools developed via Ansible can be wrapped up as an Ansible Playbook which can then be applied to multiple nodes including virtual entities, like containers and VMs. Such a collection of nodes (physical or virtual) to which a role can be applied is called an Inventory [22].

3.2 PlayBench: Container-VM Benchmark Automation Tool

Due to the variety of features, Ansible offered, it was a good choice as an underlying automation framework to build a benchmark automation tool that allows performance evaluation of multiple virtual entities as well as system-level performance evaluation. Playbench [18], is an open-sourced benchmark automation tool prototype that we developed. One of the key ideas behind Playbench is to model well-known benchmarks as an Ansible role that is reusable and can be applied to any host or virtual entity or to an inventory of multiple nodes (physical or virtual). A role in the context of Playbench can also be thought of as a wrapper around a benchmark that is responsible for downloading and installing a benchmark, running it, parsing its output to obtain desired metrics and finally collecting all results at one place suitable for analysis. In the text ahead we often use the word “role”, which unless stated, refers to an Ansible “role”.

For the prototype version of Playbench, we wrote a Role for the Geekbench [20] benchmark which allows benchmarking CPU Performance and Memory I/O performance (indirectly via STREAM benchmark). The Geekbench Role 12

encapsulates a set of tasks and scripts that allow downloading and installing the benchmark and its dependencies, parsing the benchmark output to obtain final performance tuples or performance vectors (each performance vector has multiple components like, single-core CPU performance, multi-core CPU performance, single-core integer performance, etc.)

Overall, Playbench can be thought of as a collection of benchmarks or custom Ansible roles that can be applied to any Inventory of choice. Inventory can be a collection of IP addresses of hosts, virtual machines or container IDs in the case of Docker containers. Using Playbench, gave us the ability to ensure the exact benchmarking procedure is followed on both containers and VMs that was important to ensure a fair comparison in the experiments discussed in subsequent chapters. The ability of Ansible to allow pluggable connections allows us to use SSH Daemon for VMs and Docker adapter for connecting to containers, thus allowing anyone to reuse the same role for both virtual machines and containers. Figure 3.1 shows the example of an Ansible inventory containing 3 groups of nodes; baremetal, VMs and containers.

[baremetal] localhost

[containers] 5fad57eda134 7835582520d7 79db37740fa6 56bbb5fff4b9 4670f0f06ffb 666f51c7675d 49774c127bb2 5670f0f06ffb

[vms] 192.168.152.96 192.168.152.10 13

192.168.152.195 192.168.152.76 192.168.152.159 192.168.152.46 192.168.152.148 192.168.152.173 192.168.152.89 192.168.152.28 Figure 3.1 Inventory of multiple nodes, containers, VMs

Also, we ensure that the benchmarks are spawned in parallel on each virtual entity deployed on the cluster, by setting the fork level [23] to a sufficiently large number that is greater than the number of nodes in the inventory. Further, to ensure the start of each task (especially the task that initiates the start of benchmark/workload run across the nodes) is synchronized, a linear execution strategy [24] is adopted. Linear strategy means, each subtask in the benchmark role completes before the start of the next task.

14

Chapter 4: Methodology

All experiments were performed on a 2 CPU Intel Xeon E5-2680 @ 2.70GHz having 16GiB memory, 8 cores with hyper threading of 2 per core. L3 cache size is 20MB. More details about this particular machine architecture can be found on the Xeon Product Family specification pages [25]. The rough CPU and Memory Architecture for the node are shown in Figure 4.1.

2 QPI Links

Memory Memory (DDR3) Intel Xeon Intel Xeon (DDR3) Processor Processor E5-2680 @ E5-2680 @ 2.7 GHz 2.7 GHz (8 cores) (8 cores)

DMI

Other ports (USB, etc.) Chipset

Figure 4.1 Intel Xeon Server (2 NUMA Nodes) 15

The Linux Kernel version used is, 3.10.0-327.22.2.el7.x86_64. KVM(QEMU 2.0.0) is used as the hypervisor with libvirt 1.2.17 [26]. The host runs Centos 7 operating system which is a commonly used operating system to manage data center nodes. Docker v1.10.3 is used to manage containers.

4.1 Starting/Stopping Virtual Machines or Containers

Wrapper scripts around the virsh [26] command are used to create, start, stop, destroy virtual machines. These scripts have been bundled in our Playbench tool repository [18], Ubuntu 14.04 LTS is used as the deployed VM image. For containers, docker [6] command is used (to create, start, stop, containers). Ubuntu 14.04 image is also used as the base image for spawning Docker containers. While creating virtual machines it is ensured that vCPUs are not pinned to any core, so that each VM can use all available cores available on the host. Docker containers do not impose any resource restrictions unless otherwise imposed [27] and are able to make use of all CPU cores. Each virtual entity (VE) deployed in the benchmarking experiment runs an equivalent CPU and memory workload. Figure 4.2 is provided as an illustration. In any experimental run, we perform, either all virtual entities deployed simultaneously are containers or VMs but not both at the same time.

16

Xeon Node

VE VE VE VE

VE VE VE VE

VE VE VE VE

Figure 4.2 Virtual Units deployed on Host each running an equivalent workload

4.2 Networking

Even though benchmarking Network I/O is not the focus of this study, networking is essential to allow the virtual entities (containers and VMs) to connect to the external network/internet to fetch packages including benchmark dependencies and benchmark packages themselves. Bridged Networking is used to allow both virtual machines and containers to connect to the external network/internet via the host. Explicitly creating separate virtual networks in the case of containers is not required because when Docker is installed on the host node 3 default networks including one bridged network is automatically created on the host. In the case of containers, a connection is established via Docker connection adapter and container IDs can be used as the references to a container in the Ansible inventory file. ansible_connection can be set to Docker in the inventory file as shown in Figure 4.3 or in the Ansible master playbook file. Example master playbook file is shown in Figure 4.4. This is the master playbook file for 17

containers i.e. site_containers.yml. Similarly, we would have one small site_vms.yml and site_baremetal.yml which would act as the master playbook for VMs and baremetal respectively. These master playbook files, which can also be found in Playbench repository [18], act as the entry point for the automated benchmarking process and define what hosts are affected and what benchmarking Roles are applied.

[containers] ae27a97a14bb ansible_connection=docker be98a55a16df ansible_connection=docker

Figure 4.3 hosts file with container IDs and ansible_connection set to docker

- name: Starting playbook and applying roles hosts: containers remote_user: root connection: docker vars: num_hosts: "{{ groups['containers'] | length }}"

roles: - common - geekbench

Figure 4.4 Master playbook file for containers: site_containers.yml

4.3 Steps for Benchmarking Virtual Machines/Containers

This section provides step by step instructions to repeat the benchmarking process for performance evaluation in both single-entity and multi-entity deployments. As discussed in the Introduction chapter, single-entity deployment refers to a single container or VM deployed on a server consuming entire system 18

resources, whereas a multi-entity deployment refers to a server with multiple smaller virtual entities deployed, each of which shares the system resources with others.

4.3.1 Single Entity: Massively Parallel Workload

1. Start the host server (abbreviated as host) 2. Create and run a VM on the host 3. Install and run a massively parallel benchmark program like Numeric Integration benchmark [28] by varying number of spawned threads in the benchmark program from 1-40. 4. Repeat Steps 1-3, for a container instead of a VM.

Playbench tool is not required in the case of single entity deployment as managing a single virtual entity is easier than managing several virtual entities. However, using it would ensure consistency in the tasks performed on VMs and containers, respectively.

4.3.2 Multiple Entities: CPU and Memory I/O Performance

1. Start N VMs on the host using scripts bundled in the Playbench tool [18]. 2. Download and install the benchmark on each node with the help of Playbench. 3. It will also help simultaneously spawn the benchmark in context, (Geekbench/Stream) on each of the N virtual entities deployed on the host. 19

4. After the benchmark completes, Playbench helps parse, aggregate and log relevant results from all virtual entities at one place. 5. Repeat Steps 1-4 by varying N from 1 to 12, where N is the number of virtual entities simultaneously deployed on the host. This will result in N performance score vectors. One performance score vector is obtained per virtual entity. We find average performance score vector for each N, by dividing the sum of these vectors by N.

Repeat the same steps for Containers instead of VMs. Instead of using libvirt/virsh, the docker command is used and Ubuntu 14.04 image is deployed.

4.4 Other Methodology Notes

Geekbench [20] is used as the underlying benchmark for experiments involving multiple deployed entities, orchestrated by Playbench. Geekbench is a cross- platform processor benchmark that evaluates single-core and multi-core performance for integer and floating point workloads. Previous studies [29], [30] have used Geekbench for performance evaluation of Virtual Machines. In our experiments, we use Geekbench3.3 (32 bit) as the underlying benchmark. Geekbench is run on the Host as well as containers and Virtual Machines deployed on the host. While the face value of the scores output via Geekbench is not important in itself, the relative differences is what matter. A higher score indicates higher performance, and double the score indicates double the performance as per the Geekbench documentation [20].

Our tool, Playbench [18], as discussed in Chapter 3:, is used for orchestrating the Geekbench benchmark runs, across multiple containers and virtual machines. 20

Playbench helps in installing dependencies, fetching and installing benchmark, synchronization of the start of the benchmark, and finally parsing and aggregating relevant benchmark results from each virtual entity. This process, if done manually would be tedious and error-prone. Synchronizing the start of the benchmarking workload on all simultaneously deployed virtual entities, is essential to ensure fair experiments because the study assumes each virtual entity runs a nearly equal workload.

Even though we are able to synchronize the start of the benchmark, it is not possible to make sure the benchmarks end at the same time especially for a higher number of virtual entities. This is because we do not have control over the OS or hypervisor-level scheduling of processes. We observe that as the number of virtual entities deployed on the host increased beyond 12 units, the end time of the Geekbench workload on each deployed virtual entity did not align. These differences in the end time of the workload were insignificant as compared to overall run-time (>20 minutes) of the benchmark but enough to be noticeable (1- 10 seconds). This inconsistency in the completion time is likely caused due to the unpredictable scheduling of the workload processes and is exacerbated by the high consumption of system resources under CPU intensive workloads induced by the benchmark. A group of processes may not get the same number of CPU cycles as another group in a given time interval, especially when there is high demand for CPU cycles. These differences in the end-times can lead to noise in the final performance results because some virtual entities could still be running the benchmark while the other virtual entity workloads have exited. This is why we only increased the number of virtual entities deployed to up to 12 entities. The observed inconsistency in benchmark end-time for VMs was slightly higher than that of containers; which is suspected due to the greater scheduling overhead 21

required in case of VMs. A VM involves hypervisor level scheduling [31] along with OS-level scheduling and a container only requires OS level process scheduling as there is no hypervisor.

We used outlier removal methods [32], followed by averaging of results over 5 runs for each experiment to get final performance scores. These performance results are discussed in the next chapter.

22

Chapter 5: Results

Section 5.1, discusses results for massively parallel workload running on a single container or virtual machine deployed on a host without imposing any resource limits via vCPU pinning [31] or CPU share allocation in case of containers [27]. Section 5.2, discusses benchmarking results for deployments consisting of multiple virtual machines (or containers) on the same host.

Along with the performance charts for some of the results, we also provide a percentage difference chart that better helps understand the performance of the virtual entity relative to host. Since in theory containers and virtual machines are expected to perform better than the host they are deployed on, the percentage difference values are expected to be negative.

5.1 Single Virtual Entity Deployed: Massively Parallel Workload

In this experiment, our goal was to understand how effectively do containers and VMs without vCPU pinning, utilize multiple cores under an embarrassingly parallel workload [33]. A benchmark program with a high parallel fraction [33] i.e. Numeric Integration [34], [35], is used for three cases; host, single virtual machine deployed on the host and finally a single Docker container deployed on the same host. Only one virtual entity is run at a time. When the container is being benchmarked, the VM is in shut off state, whereas when a VM is being benchmarked, the container is shut off to ensure that additional performance overhead is not caused due to idle virtual entities not being actively benchmarked. 23

Figure 5.1, shows the performance patterns for the three cases as the number of available threads (i.e. number of threads spawned) in the numeric integration benchmark program are increased. Y-axis indicates performance with double the value indicating double the performance.

400000000

350000000

300000000

250000000

200000000 host container 150000000 vm 100000000 performance (constant / runtime) 50000000

0 0 10 20 30 40 50 number of threads

Figure 5.1 CPU Performance vs number of available threads

From the figure we can observe that the performance peaks around 32 threads. This is because the host has 32 processors effectively (2 physical processors, each having 8 cores and hyper threading of 2 per core). A massively parallel benchmark program can effectively utilize multiple cores. It is interesting to observe that both KVM based virtual machines under the absence of vCPU pinning and Docker containers follow a similar performance pattern which means 24

that they are able to utilize multiple cores effectively under parallel workloads. This answers one of our research questions which was aimed at understanding the behavior of containers and VMs under a massively parallel workload. Further, we can see that the performance of host, container and virtual machines are nearly similar with container only being slightly better than virtual machines in multi-core performance. Figure 5.2, shows the percentage difference in performance of containers and VMs relative to the host. The average performance decrease relative to host, in the case of containers is 1.7% and in the case of VMs is 3.6%. The slightly higher average drop in the case of VM is likely because VMs have a greater system overhead than containers. [3]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 2

0

-2

-4

-6

% difference relative to host -8

-10 number of threads

container vm

Figure 5.2 % difference in performance relative to host (containers vs VMs)

25

In this section, we only analyzed the performance under a single deployed container and a single deployed virtual machine. Also, the benchmark used is a massively parallel program that only characterizes a very special workload which has a high parallel fraction [33]. In reality, we would have workloads with different values of parallel fraction. In the next section, we would use a more established underlying benchmark called Geekbench that generates numerous CPU workloads and provides various performance scores. We evaluate the performance, as the number of deployed virtual entities increase. Playbench [18], our benchmarking automation tool, is used to orchestrate and run the Geekbench benchmark on multiple virtual entities.

5.2 Multiple Virtual Entities Deployed: CPU Performance

The results below discuss various CPU performance characteristics like, overall single-core performance, overall multi-core performance, integer performance (single and multi-core), floating point performance (single and multi-core). The performance scores are obtained via the underlying benchmark, Geekbench [20].

As seen in Figure 5.3, the overall integer performance for both containers and virtual machines consistently reduces when the number of deployed entities increases. We observe an approximately linear drop in the single-core performance per entity. At first, we did not expect a drop in the single-core performance until the number of deployed entities increased beyond the number of available cores because each virtual entity should have effectively got its own core [36]. However, a drop is observed. Cache contention [37] is a possible reason for this behavior. The study on cache contention by Xu et al. [38] provides insight into the cache contention problem and its impact on performance. The 26

study is focused on cache contention in multi-core systems and provides a plausible explanation for the results. Note that, even though we are currently analyzing single-core CPU performance but that refers to single-core CPU performance from the virtual entity point of view. From the cache point of view, it is still a multi-core access pattern because all deployed virtual entities are accessing the same CPU cache simultaneously.

Figure 5.4 Percentage difference relative to host (single-core performance) shows percentage difference relative to the native host single-core performance for containers and VM’s. Overall, we observe that containers have slightly better single-core CPU performance than virtual machines.

3000

2500

2000

1500 Container VM 1000 Host (baseline) overall single core score 500

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.3 Single-core CPU Performance 27

1 2 3 4 5 6 7 8 9 10 11 12 5 0 -5 -10 -15 -20 Container -25 VM -30 -35

% difference (relative to host) -40 -45 -50 number of deployed virtual entities

Figure 5.4 Percentage difference relative to host (single-core performance)

Multi-core CPU performance in both virtual machines and containers shows an exponential drop as shown in Figure 5.5. Intuitively, we would expect to observe a low, but non-zero multi-core performance as the number of deployed entities tends to a very large number. This also explains to an extent, why the multi-core CPU performance curve is an exponentially reducing function with X-axis (number of virtual entities axis) as the asymptote.

Containers seem to have a better multi-core performance than VMs. The performance gap or difference between containers and VMs is still small in this case, however, relatively more prominent as compared to the single-core CPU performance differences. This is because the scheduling of multi-core workloads requires more overhead in case of virtual machines than containers; In the case of VMs some scheduling is done at the hypervisor level [31], whereas it is resolved entirely at host OS/kernel level in the case of containers as there is no 28

hypervisor. In terms of percentage difference relative to host, both single-core and multi-core performance are similar.

40000

35000

30000

25000

20000 Container

15000 VM Host (baseline)

overall multi core score 10000

5000

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.5 Multi-core CPU Performance

29

1 2 3 4 5 6 7 8 9 10 11 12 0 -10 -20 -30 -40 Container -50 VM -60 -70 -80 % difference (relative to host) -90 -100 number of deployed virtual entities

Figure 5.6 Percentage difference relative to host (multi-core CPU performance)

We also observe that the difference between the average VM and container performance does not increase or decrease as the number of deployed virtual entities are changed. The difference is nearly the same (and <10%) even as the number of deployed entities increase. These observations suggest that containers don’t necessarily scale better than virtual machines in CPU performance even though they do perform little better. We imagine this is happening because all the workloads introduced by Geekbench (AES, Twofish, SHA1, SHA2) are CPU intensive, making the idle-VM overhead or idle-container overhead less significant in comparison to the workload.

Figures [5.7] - [5.10] show single and multi-core CPU performance for both integer and floating point operations. Similar performance patterns are observed for both floating point and integer performance where containers perform either equal or slightly better than virtual machines. 30

3500

3000

2500

2000 Container 1500 VM 1000 Host (baseline) single core integer score 500

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.7 Single-core Integer Performance

45000 40000 35000 30000 25000 Container 20000 VM 15000 Host (baseline) multi core integer score 10000 5000 0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.8 Multi-core Integer Performance 31

3000

2500

2000

1500 Container VM 1000 Host (baseline) single core floating score 500

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.9 Single-core Floating Performance

50000 45000 40000 35000 30000 25000 Container 20000 VM 15000 Host (baseline)

multi core floating score 10000 5000 0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virutal entities

Figure 5.10 Multi-core Floating Performance 32

5.3 Multiple Virtual Entities Deployed: Memory I/O

The Xeon machine that we run the benchmarks on, uses a heterogeneous memory architecture also called a non-uniform memory access architecture (NUMA) [39], as discussed in the Methodology chapter. NUMA architecture is commonly used in modern day processor layouts. There are 2 physical processors in our Xeon server. Each processor has its own memory unit. A processor can also access memory attached to the other processor indirectly via a Quick Path Interconnect (QPI) link [40]. Throughput while accessing the memory unit in a given processors locality is greater than the throughput when the processor accesses memory in the other processors locality [39].

In Figures [5.11] - [5.14], single-core Memory I/O performance and multi-core Memory I/O performance are plotted with respect to the number of deployed virtual entities. We observe an approximately linear drop in Memory I/O performance with increase in number of deployed virtual entities. Also, we can observe that the performance patterns in the case of multi-core memory access is also an approximately linear (decreasing) function. Comparing containers to VMs we observe that containers only have an edge over VMs for Memory I/O access. In terms of memory I/O scalability, containers seem to scale in a way similar to VMs as we can see from the figures; performance gap between containers and VMs does not increase or decrease as the number of deployed virtual entities increase. These observations suggest that memory access patterns in containers and VMs are not very different though slightly more efficient in the case of containers. The memory architecture plays an important role in determining the memory I/O performance. The memory I/O results do not stabilize well, which we believe happens due to the variations in the workloads 33

induced by the benchmark itself and because of the unpredictable memory I/O throughput in NUMA architecture. The exact memory access patterns are even more intricate in NUMA architecture because each processor is not necessarily accessing memory in its locality causing unpredictability in performance [41].

3000

2500

2000

1500 Container VM 1000 Host (baseline)

single core memory score 500

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.11 Single-core Memory Performance

34

1 2 3 4 5 6 7 8 9 10 11 12 10 0 -10 -20 -30 Container -40 VM -50 -60 % difference (relative to host) -70 -80 number of deployed virtual entities

Figure 5.12 Difference relative to host (single-core memory performance)

4500 4000 3500 3000 2500 Container 2000 VM 1500 Host (baseline)

multi core memory score 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.13 Multi-core Memory Performance 35

1 2 3 4 5 6 7 8 9 10 11 12 0

-10

-20

-30 Container -40 VM -50

-60 % difference (relative to host) -70

-80 number of deployed virtual entities

Figure 5.14 Percentage difference (multi-core memory performance)

Figures [5.15] - [5.22] show the memory throughput values in GB/sec obtained via STREAM benchmark [42] as we scale the number of deployed virtual entities. The performance is evaluated for copy, scale, add and triad operations respectively. Similar performance patterns are observed for both containers and virtual machines in each of the memory operations. We do not find any surprising differences between containers and VMs in different types of memory operations. In all the cases, containers seem to have either equal or better average performance than VMs like we observed in case of the overall memory I/O scores. As per the STREAM benchmark specifications [40] overall single-core and multi-core Memory I/O scores are composite scores which include Copy, Scale, Add, Triad performance measurements in GB/Sec. The higher variance in the performance is likely due to non-deterministic memory access across the NUMA nodes.

36

12

10

8

6 Container

4 VM copy single core Host (baseline) 2

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.15 Memory: Copy (single-core performance)

25

20

15 Container 10 VM copy multi core 5 Host (baseline)

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.16 Memory: Copy (multi-core performance)

37

12

10

8

6 Container VM 4 scale single core Host (baseline) 2

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.17 Memory: Scale (single-core performance)

18 16 14 12 10 Container 8 VM 6 Host (baseline) scale multi core score 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.18 Memory: Scale (multi-core performance)

38

16 14 12 10 8 Container 6 VM

add single core score 4 Host (baseline) 2 0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.19 Memory: Add (single-core performance)

20 18 16 14 12 10 Container 8 VM 6 add multi core score Host (baseline) 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.20 Memory: Add (multi-core performance)

39

12

10

8

6 Container VM 4 Host (baseline) triad single core score 2

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.21 Memory: Triad (single-core performance)

12

10

8

6 Container VM 4 Host (baseline) triad single core score 2

0 1 2 3 4 5 6 7 8 9 10 11 12 number of deployed virtual entities

Figure 5.22 Memory: Add (multi-core performance)

40

Chapter 6: Limitations and Future Work

The study uses Geekbench as the underlying CPU performance benchmark orchestrated via Playbench. Geekbench helps induce different types of workloads, however, most of these workloads are CPU intensive. The CPU performance could vary when the underlying benchmark generates lighter workloads. We believe that there is an opportunity in benchmarking under lighter workloads as well as workloads that are moderate in CPU consumption. We are able to deploy up to 12 virtual entities on the host, each running the same benchmarking workload, after which a saturation in the consumption of CPU resources is observed. It may be possible to go beyond 12 deployed virtual entities with lighter workloads.

The thesis primarily focuses on CPU and Memory I/O performance of containers. There is scope for evaluating other performance characteristics like Network and File I/O performance with multiple virtual entities deployed.

The experiments performed in our study run isolated workloads that are either CPU intensive or Memory I/O intensive but not mixed workloads at the same time. There is opportunity in evaluating performance under such mixed workloads that are characterized by multiple performance characteristics. For example; a workload that is both CPU and memory I/O intensive at the same time. Also, all our experiments did not impose any resource restrictions like vCPU pinning in case of container or restricting CPU shares in case of containers. Each VM or container had access to entire CPU and memory resources of the host and was free to compete with other virtual units for 41

resources. There is scope in analyzing performance differences between containers and VMs where each VM or container is allocated fixed CPU and memory resources.

42

Chapter 7: Conclusion

We highlighted the importance of benchmark automation in running performance benchmarks on nodes containing multiple virtual machines or containers. We developed a benchmark automation tool called Playbench [18] and explained it in this thesis. We described how it would eliminate the overhead involved in repeatedly running a benchmark on several virtual entities and also reduce the chances of error in the benchmarking process. Playbench helped us in being consistent while evaluating the performance of several containers and VMs during our benchmarking experiments. It also helped us start the benchmarks on multiple entities in parallel and aggregate the final results from each virtual entity to a single log file after benchmark completion.

From the results, we observed that containers performed better than VMs for both CPU and Memory I/O performance even with multiple deployed entities each running the same benchmarking workload. However, both containers and VM scaled similarly in terms of CPU and Memory I/O performance when the number of deployed virtual entities were increased. Multi-core CPU performance differences between containers and VMs were higher than single-core CPU performance differences, though the differences were nearly the same in terms of percentage difference relative to host. We also observed that multi-core CPU performance per virtual entity drops rapidly compared to single-core CPU performance. This suggested that multi-core CPU performance may hit bottlenecks quicker as the number of virtual entities deployed on a host increase. High variance was observed in the Memory I/O performance data, which was attributed to the NUMA memory architecture (NUMA). 43

Bibliography

[1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A View of Cloud Computing,” Commun ACM, vol. 53, no. 4, pp. 50–58, Apr. 2010. [2] “Stackoverflow: What is the runtime performance cost of docker.” [Online]. Available: http://stackoverflow.com/questions/21889053/what-is-the-runtime- performance-cost-of-a-docker-container. [Accessed: 04-Jan-2016]. [3] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated performance comparison of virtual machines and Linux containers,” presented at the 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 171–172. [4] I. Habib, “Virtualization with kvm,” Linux J, vol. 2008, no. 166, p. 8, 2008. [5] G. J. Popek and R. P. Goldberg, “Formal requirements for virtualizable third generation architectures,” Commun ACM, vol. 17, no. 7, pp. 412–421, 1974. [6] D. Merkel, “Docker: lightweight linux containers for consistent development and deployment,” Linux J, vol. 2014, no. 239, p. 2, 2014. [7] “Understand Docker architecture.” [Online]. Available: https://docs.docker.com/engine/introduction/understanding-docker/. [Accessed: 23-Jan-2016]. [8] S. Deshpande, “Let’s define ‘container-native,’” 27-Apr-2016. [Online]. Available: https://techcrunch.com/2016/04/27/lets-define-container-native/. [Accessed: 09-Aug-2016]. [9] D. Beserra, E. D. Moreno, P. Takako Endo, J. Barreto, D. Sadok, and S. Fernandes, “Performance Analysis of LXC for HPC Environments,” presented at the Complex, Intelligent, and Software Intensive Systems (CISIS), 2015 Ninth International Conference on, 2015, pp. 358–363. [10] M. G. Xavier, M. V. Neves, F. D. Rossi, T. . Ferreto, T. Lange, and C. A. F. De Rose, “Performance evaluation of container-based virtualization for high performance computing environments,” presented at the Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro International Conference on, 2013, pp. 233–240. [11] M. F. Arlitt and C. L. Williamson, “Internet web servers: Workload characterization and performance implications,” IEEEACM Trans Netw, vol. 5, no. 5, pp. 631–645, 1997. [12] R. Cheveresan, M. Ramsay, C. Feucht, and I. Sharapov, “Characteristics of workloads used in high performance and technical computing,” presented 44

at the Proceedings of the 21st annual international conference on Supercomputing, 2007, pp. 73–82. [13] B. Rochwerger, D. Breitgand, E. Levy, A. Galis, K. Nagin, I. M. Llorente, R. Montero, Y. Wolfsthal, E. Elmroth, J. Cáceres, M. Ben-Yehuda, W. Emmerich, and F. Galán, “The Reservoir Model and Architecture for Open Federated Cloud Computing,” IBM J Res Dev, vol. 53, no. 4, pp. 535–545, Jul. 2009. [14] “State of Containers 2015, howpublished = http://www.stackengine.com/wp-content/uploads/2015/02/Docker-Adoption- Survey-Summary.pdf, note = Accessed: 2016-01-23.” [15] J. N. Matthews, W. Hu, M. Hapuarachchi, T. Deshane, D. Dimatos, G. Hamilton, M. McCabe, and J. Owens, “Quantifying the performance isolation properties of virtualization systems,” presented at the Proceedings of the 2007 workshop on Experimental computer science, 2007, p. 6. [16] P. Shivam, V. Marupadi, J. S. Chase, T. Subramaniam, and S. Babu, “Cutting Corners: Workbench Automation for Server Benchmarking,” presented at the USENIX Annual Technical Conference, 2008, pp. 241–254. [17] GoogleCloudPlatform, “GoogleCloudPlatform/PerfKitBenchmarker.” [Online]. Available: https://github.com/GoogleCloudPlatform/PerfKitBenchmarker. [Accessed: 20- Jul-2016]. [18] pramttl, “Playbench: Benchmark Automation Tool.” [Online]. Available: https://github.com/pramttl/playbench. [Accessed: 21-Jul-2016]. [19] P. Padala, X. Zhu, Z. Wang, S. Singhal, K. G. Shin, and Others, “Performance evaluation of virtualization technologies for server consolidation,” HP Labs Tec Rep., 2007. [20] “Interpreting Geekbench 3 Scores / Geekbench / Knowledge Base - Primate Labs Support.” [Online]. Available: http://support.primatelabs.com/kb/geekbench/interpreting-geekbench-3- scores. [Accessed: 17-Dec-2015]. [21] ansible, “Ansible: IT Automation Platform.” [Online]. Available: https://github.com/ansible/ansible. [Accessed: 21-Jul-2016]. [22] “Inventory — Ansible Documentation.” [Online]. Available: http://docs.ansible.com/ansible/intro_inventory.html. [Accessed: 21-Jul-2016]. [23] “Configuration file — Ansible Documentation.” [Online]. Available: http://docs.ansible.com/ansible/intro_configuration.html. [Accessed: 21-Jul- 2016]. [24] “Strategies — Ansible Documentation.” [Online]. Available: http://docs.ansible.com/ansible/playbooks_strategies.html. [Accessed: 21-Jul- 2016]. 45

[25] “Intel® Xeon® Processor E5-2680 (20M Cache, 2.70 GHz, 8.00 GT/s Intel® QPI) Specifications.” [Online]. Available: http://ark.intel.com/products/64583/Intel-Xeon-Processor-E5-2680-20M- Cache-2_70-GHz-8_00-GTs-Intel-QPI. [Accessed: 02-Aug-2016]. [26] J. Clift, “libvirt 0.8. 7 Virsh Command Reference.” [27] “Marek Goldmann | Resource management in Docker.” [Online]. Available: https://goldmann.pl/blog/2014/09/11/resource-management-in-docker/. [Accessed: 25-Jul-2016]. [28] “Numeric Integration: Massively Parallel Program.” [Online]. Available: https://gist.github.com/pramttl/d860d4f33e91ae1f5a6a. [Accessed: 05-Jan- 2016]. [29] R. Morabito, J. Kjällman, and M. Komu, “Hypervisors vs. Lightweight Virtualization: a Performance Comparison.” [30] P. Prakash and B. R. Mohan, “Evaluating Performance of Virtual Machines on Hypervisor (Type-2),” Dep. Inf. Technol. Natl. Inst. Technol. Karnataka Surathkal India, 2013. [31] X. Song, J. Shi, H. Chen, and B. Zang, “Schedule Processes, Not VCPUs,” in APSys ’13, New York, NY, USA, 2013, pp. 1:1–1:7. [32] M. Umasuthan and A. M. Wallace, “Outlier removal and discontinuity preserving smoothing of range data,” IEE Proc. - Vis. Image Signal Process., vol. 143, no. 3, pp. 191–200, Jun. 1996. [33] M. D. Hill and M. R. Marty, “Amdahl’s law in the multicore era,” Computer, no. 7, pp. 33–38, 2008. [34] pramttl, “pramttl/benchmarking-docker-kvm.” [Online]. Available: https://github.com/pramttl/benchmarking-docker- kvm/tree/master/cpu/massively_parallel_benchmark. [Accessed: 23-Jul- 2016]. [35] “Data Parallel Examples: Numerical Integration : TechWeb : Boston University.” [Online]. Available: http://www.bu.edu/tech/support/research/training-consulting/online- tutorials/matlab-pct/integration-example/. [Accessed: 23-Jul-2016]. [36] C. Xu, Y. Bai, and C. Luo, “Performance Evaluation of Parallel Programming in Virtual Machine Environment,” in Network and Parallel Computing, 2009. NPC ’09. Sixth IFIP International Conference on, 2009, pp. 140–147. [37] D. Chandra, F. Guo, S. Kim, and Y. Solihin, “Predicting inter-thread cache contention on a chip multi-processor architecture,” presented at the 11th International Symposium on High-Performance Computer Architecture, 2005, pp. 340–351. 46

[38] C. Xu, X. Chen, R. P. Dick, and Z. M. Mao, “Cache contention and application performance prediction for multi-core systems,” presented at the Performance Analysis of Systems Software (ISPASS), 2010 IEEE International Symposium on, 2010, pp. 76–86. [39] N. Manchanda and K. Anand, “Non-uniform memory access (numa),” N. Y. Univ., 2010. [40] B. Mutnury, F. Paglia, J. Mobley, G. K. Singh, and R. Bellomio, “QuickPath Interconnect (QPI) design and analysis in high speed servers,” in 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems, 2010, pp. 265–268. [41] J. Rao, K. Wang, X. Zhou, and C. Z. Xu, “Optimizing virtual machine scheduling in NUMA multicore systems,” in High Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, 2013, pp. 306–317. [42] J. McCalpin, “STREAM benchmark,” Link Www Cs Va. Edustreamref Html What, 1995.

47

APPENDIX 48

Appendix A: Playbench Details

As discussed in Chapter 3:, Playbench consists of scripts and multiple Ansible Roles, with each Role representing a benchmark. A benchmark (role) directory can contain tasks, files or templates. The figure below provides a top-level overview of Playbench. Tasks are grouped together for brevity:

Package Benchmark installation tasks (Ansible Role) Playbench Benchmark install (Ansible Role + task Other scripts)

Benchmark Benchmark run (Ansible Role) task

Parse final benchmark output task

The basic directory structure of the Playbench source code is given below: playbench ├── ansible.cfg ├── group_vars │ └── all ├── other_scripts │ ├── Dockerfile │ ├── README.md │ └── vmscript.py ├── roles │ ├── common │ │ └── tasks │ │ └── main.yml │ ├── fio 49

│ │ ├── files │ │ │ ├── benchmarkconfig.fio │ │ │ └── fioparser.sh │ │ └── tasks │ │ └── main.yml │ ├── geekbench │ │ ├── files │ │ │ ├── geekbench_results_parser.py │ │ ├── tasks │ │ │ └── main.yml │ │ └── templates │ │ └── parse_stdout_to_url.j2 │ ├── pts │ │ └── tasks │ │ └── main.yml ├── sample_hosts ├── site_baremetal.yml ├── site_containers.yml └── site_vms.yml

We also provide some of the key files, like the main tasks files, parser files below for the benchmarks currently supported by Playbench. The tasks files show the steps that are run on each VM or container in sequence on which the respective benchmark role is applied. Other files, templates are also provided which are used for parsing and configuration.

Geekbench Role roles/geekbench/tasks/main.yml

- name: dpkg --add-architecture i386 shell: dpkg --add-architecture i386 tags: - ubuntu_only

- name: Install geekbench dependencies package: name="{{ item }}" state=latest update_cache=yes with_items: - python-setuptools 50

- python-pip - libc6:i386 - libstdc++6:i386 tags: - ubuntu_only

- name: Download the geekbench tarball get_url: url="http://il0.ca/downloads/{{ geekbench }}.tar.gz" dest="{{ benchmarking_dir }}/" mode=0440

- name: Unarchive the tarball shell: "tar zxf {{ geekbench }}.tar.gz" args: chdir: "{{ benchmarking_dir }}/"

- name: Chmod geekbench executable file: path="{{ benchmarking_dir }}/dist/{{ geekbench }}/geekbench" state=file mode="a+x"

- name: Copy the python parser (geekbench url extractor) upstream template: src=parse_stdout_to_url.j2 dest="{{ benchmarking_dir }}/{{ geekbench_stdout_parser }}" backup=yes

- name: Start the geekbench benchmark. This may take some time.. shell: ./geekbench args: chdir: "{{ benchmarking_dir }}/dist/{{ geekbench }}/" register: geekbench_output

# Creates a new variable geekbench_stdout from geekbench_output # This step is redundant but added because default geekbench_stdout # is used if this step is not run. Useful for debugging as geekbench # takes 10+ minutes to run. This step and prev step can be disabled # by setting when: false - set_fact: geekbench_stdout: "{{ geekbench_output.stdout }}"

- name: Copy geekbench output to geekbench stdout file copy: content="{{ geekbench_stdout }}" dest="{{ geekbench_stdout_fname }}" 51

- name: Install beautifulsoup4 pip: name=beautifulsoup4

- name: Run the python script to get results url from geekbench stdout shell: python {{ geekbench_stdout_parser }} args: chdir: "{{ benchmarking_dir }}/" register: results_url

- name: Copy the python results_url scraper/parser script to remote server copy: src=geekbench_results_parser.py dest="{{ benchmarking_dir }}/geekbench_results_parser.py"

- name: Parse the scores from the geekbench results_url shell: python geekbench_results_parser.py {{ results_url.stdout }} args: chdir: "{{ benchmarking_dir }}/" register: geekbench_scores

- debug: msg="{{ num_hosts }}\t{{ geekbench_scores.stdout }}"

files/geekbench_results_parser.py

""" usage : python2 geekbenchparser.py http://browser.primatelabs.com/geekbench3/4534575 Returns comma separated list of scores (performance vector for single entitiy) """ import sys,urllib2 from bs4 import BeautifulSoup url = sys.argv[1] content = urllib2.urlopen(url).read() soup = BeautifulSoup(content)

52

overall_table = soup.findAll("table", { "class" : "table geekbench3-show summary" }) scores = [] scores.append(str(overall_table[0].findAll('td')[1].contents[0])) scores.append(str(overall_table[0].findAll('td')[2].contents[0])) section_table = soup.findAll("table", { "class" : "table table- striped geekbench2-show section-performance" }) scores.append(str(section_table[0].findAll('th')[1].contents[0])) scores.append(str(section_table[0].findAll('th')[4].contents[0])) scores.append(str(section_table[1].findAll('th')[1].contents[0])) scores.append(str(section_table[1].findAll('th')[4].contents[0])) scores.append(str(section_table[2].findAll('th')[1].contents[0])) scores.append(str(section_table[2].findAll('th')[4].contents[0]))

# Memory Details in GB/sec (Example geekbench score values showing output/format) """ Stream Copy single-core 1150 4.59 GB/sec multi-core 2145 8.56 GB/sec Stream Scale single-core 1804 7.20 GB/sec multi-core 2639 10.5 GB/sec Stream Add single-core 1900 8.59 GB/sec multi-core 2586 11.7 GB/sec Stream Triad single-core 1902 8.36 GB/sec multi-core 2682 11.8 GB/sec """ for i in range(1, 23, 3): scores.append(section_table[2].findAll('td')[i].contents[5].text. split(' ')[0])

#print 'Single-Core Score, Multi-Core Score, Integer Perf Single- core, Integer Perf Multi-core, Float Perf Single-core, Float Perf Multi-core, Memory Single Core, Memory Multi Core' print "\t".join(scores)

53

The variables within {{ }} are ansible group variables that are common to all roles/benchmarks. The group variables are defined in a file at the root geekbench: "Geekbench-3.4.1-Linux" geekbench_stdout: "{{ lookup('file', 'test_geekbench_op.txt') }}" benchmarking_dir: "/root/benchmarking" vms_dir: "{{ benchmarking_dir }}/vms" geekbench_stdout_fname: "{{ benchmarking_dir}}/geekbench_output.txt" geekbench_stdout_parser: "parse_stdout_to_url.py"

# libvirt related variables # https://libvirt.org/formatdomain.html qemu_url: 'qemu:///system'

Fio Role roles/fio/tasks/main.yml

- name: Install fio package: name: '{{ item }}' state: 'latest' with_items: [ 'fio' ]

- name: Copy the fio benchmark file to remote server copy: src=benchmarkconfig.fio dest="{{ benchmarking_dir }}/benchmarkconfig.fio"

- name: Start the fio benchmark. This may take some time.. shell: fio benchmarkconfig.fio > fioresults args: chdir: "{{ benchmarking_dir }}/"

- name: Copy the parse to remote server 54

copy: src=fioparser.sh dest="{{ benchmarking_dir }}/fioparser.sh"

- name: Run the fioparser on results shell: bash fioparser.sh fioresults args: chdir: "{{ benchmarking_dir }}/" register: fio_results

- debug: msg="{{ fio_results.stdout }}"

roles/fio/files/fioparser.sh

#!/usr/bin/env bash cat $1 | grep 'group 0' -A1 | grep -v 'group 0' | awk -F '=' '{print $4}' | awk -F ',' '{print $1}' cat $1 | grep 'group 1' -A1 | grep -v 'group 1' | awk -F '=' '{print $4}' | awk -F ',' '{print $1}' cat $1 | grep 'group 2' -A1 | grep -v 'group 2' | awk -F '=' '{print $4}' | awk -F ',' '{print $1}' cat $1 | grep 'group 3' -A1 | grep -v 'group 3' | awk -F '=' '{print $4}' | awk -F ',' '{print $1}' cat $1 | grep 'group 4' -A1 | grep -v 'group 4' | awk -F '=' '{print $4}' | awk -F ',' '{print $1}' cat $1 | grep 'group 5' -A1 | grep -v 'group 5' | awk -F '=' '{print $4}' | awk -F ',' '{print $1}' cat $1 | grep 'group 5' -A2 | grep -v 'group 5' |grep -v 'READ'| awk -F '=' '{print $4}' | awk -F ',' '{print $1}'