DEGREE PROJECT IN COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2016

Increasing the Throughput of a Node.js Application Running on the Heroku Cloud App Platform

NIKLAS ANDERSSON

ALEKSANDR CHERNOV

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY Abstract

The purpose of this thesis was to investigate whether utilization of the Node.js Cluster module within a web application in an environment with limited resources (the Heroku Cloud App Platform) could lead to an increase in throughput of the application and, in the case of an increase, how substantial it was.

This has been done by load testing an example application when utilizing the module and without utilizing it. In both scenarios, the traffic sent in to the application varied from 10 requests/second to 100 requests/second. For the tests conducted on the application utilizing the module the number of worker process used within the application varied between 1 and 16.

Furthermore, the tests were first conducted in a local environment in order to establish any increases in throughput in a stable environment, and, in case there were notable differences in throughput of the application, the same tests were conducted on the Heroku Cloud App Platform. Each test was also aimed towards testing one of two different types of tasks performed by the application: I/O or CPU bound.

From the test results, it could be derived that utilization of the Cluster module did not lead to any increases in throughput when the application was doing I/O bound tasks in neither of the environments. However, when doing CPU bound tasks, it led to a ≥20% increase when the traffic sent to the application in the local environment was 10 requests/second or higher. The same increase could be seen when the traffic sent to the application was 50 requests/second or higher in the Heroku environment.

The conclusion was, thus, that utilization of the module would be useful for the company (that this thesis took place at) in case an application installed on Heroku was exposed to higher traffic.

Keywords

Throughput, Node.js, Heroku, Performance, Increasing

Abstract

Syftet med detta examensarbete var att undersöka om huruvida nyttjande av Node.js­modulen Cluster i wen webbapplikation i en miljö med begränsade resurser (Heroku cloud app­plattformen) skulle kunna leda till en ökning i throughput hos applikationen, och om det skedde en ökning – hur stor var då denna?

Detta har gjorts genom att belastningstesta en exempelapplikation nyttjande modulen och utan den. I båda scenarier varierade trafiken som skickades till applikationen mellan 10 och 100 requests/sekund. För testerna utförda i applikationen som nyttjade modulen varierade antalet workerprocesser mellan 1 och 16.

Vidare utfördes testerna i den lokala miljön med målet att slå fast möjlig throughput­ökning i en stabil miljö först, och om det fanns några märkbara skillnaden i throughput hos applikationen skulle samma tester även utföras på Heroku app cloud­plattformen. Varje test strävade också för att testa en av två olika typer av arbetsuppgifter utförda av applikationen: I/O­ eller CPU­bundna.

Från testresultatet kunde det fastslås att: Cluster­modulen ledde inte till några ökningar vad gällde throughput när applikationen gjorde I/O­bundna arbetsuppgifter i någon av miljöerna. När applikationen däremot gjorde CPU­bundna arbetsuppgifter ledde det till en ökning på ≥20% när trafiken var 10 requests/sekund eller högre. Samma ökning kunde ses först när trafiken kommer över 50 requests/sekund eller högre i Heroku­miljön.

Slutsatsen var därmed att användande av modulen skulle vara användbart för företaget som arbetet uträttades hos om en applikation som låg installerad på Heroku utsattes för vad som ansågs vara högre trafik.

Nyckelord

Throughput, Node.js, Heroku, Prestanda, Öka

1 Table of Contents

Abstract (in English) ​ Abstract (in Swedish) ​ Table of Contents 1 Introduction………………………………………………………………………………………………5 ​ 1.1 Background………………………………………………………………………………………..5 ​ 1.1.1 Increasing Throughput………………………………………………………………...6 ​ 1.1.2 Node.js……………………………………………………………………………………….6 ​ 1.1.3 The Heroku Cloud App Platform…………………………………………………..6 ​ 1.1.4 Web Applications………………………………………………………………………..7 ​ 1.2 Problem……………………………………………………………………………………………..7 ​ 1.3 Research Questions…………………………………………………………………………….7 ​ 1.4 Purpose……………………………………………………………………………………………..8 ​ 1.5 Delimitations……………………………………………………………………………………..8 ​ 1.6 Disposition………………………………………………………………………………………...9 ​ 2 Theoretical Background…………………………………………………………………………….10 ​ 2.1 The Company Platform……………………………………………………………………..10 ​ 2.2 Heroku Dyno…………………………………………………………………………………....11 ​ 2.3 I/O vs. CPU bound……………………………………………………………………………12 ​ 2.4 The Inner Workings of Node.js……………………………………………………..13 ​ 2.5 Increasing Throughput in Node.js Using the Cluster Module…………...14 ​ 2.6 Related Work………………………………………………………………………………15 ​ 3 Research Process……………………………………………………………………………………...17 ​ 3.1 Research Methodology……………………………………………………………………….17 ​ 3.2 Process Overview………………………………………………………………………………18 ​ 3.2.1 Problem Definition……………………………………………………………………18 ​ 3.2.2 Data Collection………………………………………………………………………...19 ​ 3.2.3 Design & Implementation………………………………………………………....20 ​ 3.2.4 Defining the Testing Environments…………………………………………...20 ​ 3.2.5 Creating Test Plan…………………………………………………………………….20 ​ 3.2.6 Results and Analysis…………………………………………………………………20 ​ 3.2.7 Evaluation………………………………………………………………………………..21 ​ 3.3 Hypotheses……………………………………………………………………………………...21 ​ 4 Analysis: How to Increase Throughput………………………………………………………22 ​ 4.1 Our approach…………………………………………………………………………………..22 ​ 4.1.1 Different Implementations of the Cluster Module………………………..22 ​ 4.1.2 Clustering Method Chosen When Creating the Application Template…………………………………...23 ​

2 4.2 The Application Template………………………………………………………………...23 ​ 4.2.1 CPU Usage……………………………………………………………………………….25 ​ 4.2.2 Workload………………………………………………………………………………...26 ​ 4.2.3 Memory Usage………………………………………………………………………....27 ​ 4.3 Test Application……………………………………………………………………………….27 ​ 5 Analysis: Benchmarking the Test Application……………………………………………..28 ​ 5.1 Testing Environment………………………………………………………………………..28 ​ 5.1.1 Local Environment…………………………………………………………………….29 ​ 5.1.2 Heroku Environment…………………………………………………………………29 ​ 5.1.3 The Test Application’s Memory Usage………………………………………...29 5.2 Testing Tools……………………………………………………………………………………30 ​ 5.2.1 Apache JMeter…………………………………………………………………………..31 ​ 5.2.2 Heroku Metrics………………………………………………………………………...32 5.3 Creating the Test Plan……………………………………………………………………….33 ​ 5.4 Local Tests………………………………………………………………………………………33 ​ 5.4.1 I/O Bound………………………………………………………………………………..34 ​ 5.4.2 CPU Bound……………………………………………………………………………...35 5.5 Heroku Tests…………………………………………………………………………………...36 5.5.1 Throughput Rates……………………………………………………………………..37 5.5.2 Memory Usage…………………………………………………………………………39 5.5.3 Median Reponse Times…………………………………………………………….40 5.5.4 Analysis of Heroku Test Results………………………………………………...41 6 Discussion……………………………………………………………………………………………….43 ​ 6.1 Our Methodology and Consequences of the Study…………………………….....43 ​ 6.2 Discussion and Conclusions……………………………………………………………....44 ​ 6.2.1 Recommendations Concerning the Application Template…………….45 ​ 6.3 Ethics……………………………………………………………………………………………...46 ​ 6.4 Sustainability…………………………………………………………………………………...46 ​ 6.5 Future Work…………………………………………………………………………………….47 ​ References Appendix 1 ­ Heroku Dyno CPU Information………………………………………………...52 Appendix 2 ­ The Test Application……………………………………………………………….58 ​ Appendix 3 ­ The Application Template………………………………………………………..60 ​ Appendix 4 ­ The Local Server CPU Specifications………………………………………...61 ​ Appendix 5 ­ Results from I/O Bound Tests in Local Environment…………………63 Appendix 6 ­ Results from CPU Bound Tests in Local Environment………………..65 Appendix 7 ­ Results from CPU Bound Tests on Heroku………………………….…….67

3 1 Introduction

Today, virtually every company with a presence on the Internet collects data

[1] concerning their customers in some form .​ With a large collection of customer ​ profiles it is possible to collect information concerning the customer’s geographical area, what products the customer has viewed, what devices the customer is using etc. With this data, customer communication can be improved, marketing can be optimized (through a more well­targeted informational flow), and all customer information can be stored in one single virtual space.

Data can come from different sources: web analytics­tools, login processes, e­mail, etc. It can also be required to collect data from different physical nodes; it might be located in different data warehouses, and can even be administered by different third party companies.

For a large company the collected data may grow very large and there might be a lot of daily transactions. It is therefore important that these transactions are consistent, that data is preserved, and that the application can handle as much traffic as possible. One way of making sure that the application is adapted to do this is by assuring that it can handle as many requests per time unit as possible. This leads to the application being able to serve more clients, thus lowering the risk of a client not receiving the requested data.

1.1 Background

Innometrics, the company that the project took place at, is active within the area just described above. Their product helps other companies personalize their marketing strategies by collecting data from a customer’s different data warehouses, and creating a customer profile out of this data.

They were in need of increasing the throughput of Node.js applications used for intra­system communication between their system and other systems. These applications were installed on an external cloud platform (Heroku Cloud App

4 Platform or Amazon Web Services), and thus restricted by each of the platform’s individual specifications.

1.1.1 Increasing Throughput

Throughput is a measurement used for describing the number of requests per time unit handled by any given web service or application. One of the ways of increasing throughput is by making the application more concurrent, that is – to make it

[2] process more requests simultaneously .​ ​

This can be achieved by adding extra hardware resources or by maximizing utilization of the available resources.

1.1.2 Node.js

Node.js is a runtime environment based on the programming language JavaScript – a programming language most well­known as the scripting language for web pages. A runtime environment deals with a variety of issues such as the layout and allocation of storage locations for the objects specified in the source code, the mechanisms used

[3] by the target program to access variables and for passing variables, etc. .​ ​

Node.js ships with a collection of modules, which basically encapsulate related code, as in Java or any other programming language with a set of standard libraries. Also, new modules can be installed, managed and published through the Node Package Manager to provide further functionality. A more detailed specification of Node.js is given in chapter 2.

1.1.3 The Heroku Cloud App Platform

[4] In order to describe what cloud computing is, Eric Griffith states in his article :​ “In ​ the simplest terms, cloud computing means storing and accessing data and programs over the Internet instead of your computer’s hard drive”.

Heroku belongs to a type of cloud computing known as Platform as a Service

[5] (PaaS) .​ This type of service removes the need for organizations to manage the ​

5 underlying infrastructure (usually hardware and operating systems) and allows users

[6] to focus on the deployment and management of their applications .​ ​

The Heroku platform allows for users to install and execute applications isolated from one another. It provides functionality such as a database management system and application monitoring. The platform’s execution environment also enables the user to write applications in several different programming languages, such as Node.js, Ruby, Java and PHP.

1.1.4 Web Applications

An application is a stored set of instructions that directs a computer to do some

[7] specific task .​ ​

Web applications are distributed client­server applications in which a web browser

[8] provides the user interface .​ The client browser and the server side exchange ​ protocol messages represented as HTTP requests and responses. In the case of cloud computing, web applications no longer exist on the server, instead they reside on a cloud platform.

1.2 Problem

During periods of high traffic towards a web application, it is essential that the system can handle the increased demand of service. Exposing an inefficient web application to high traffic can cause individual requests not to receive their corresponding responses. It can also lead to response times – the total time it takes ​ ​ from when a user makes a request until they receive a response – being longer than desired.

In order to fulfill the need of service to as many clients as possible, it is important that the web application can provide a large throughput.

6 1.3 Research Questions

The main questions of this thesis narrow down to: ­ How can the throughput of a Node.js application, running on the Heroku Platform, be increased by taking advantage of the available system resources? ­ In case of an increased throughput, how substantial will it be?

1.4 Purpose

The application’s performance is limited by the cloud platform where the application is installed. The purpose of this thesis is to show how to increase throughput of the company’s applications running on the Heroku Platform.

The intention is to develop a generic application template in Node.js, that can be used when creating new applications within the company’s Node.js application platform. Applications that utilize this template should be able to be installed on the Heroku Cloud App Platform. The template should increase the throughput of each individual application, and thus increase the performance of the system in whole. Although the application was primarily aimed at the Heroku platform, there should be a possibility to migrate it to other existing cloud platforms. Therefore, the solution should be as general as possible.

Furthermore, we are to implement functionality that takes full advantage of the available system resources in order to increase the number of requests handled by the application per time unit. This is to be done without adding any additional hardware resources.

Best practices in increasing throughput of a Node.js application deployed on the Heroku Cloud App Platform, without adding hardware resources, will be investigated and evaluated. Hopefully, this will lead to an increase in throughput of each individual application on the Innometrics’ Node.js application platform.

7 1.5 Delimitations

This thesis will focus single­handedly on increasing the throughput by taking advantage of the available system resources. Also, it is only concerned with the increase of throughput in an application running on the Heroku Platform – not on an arbitrary cloud platform. Furthermore, we were limited to using only the free account level of Heroku (specifications of machines on this level are given in chapter 2).

1.6 Disposition

The thesis is outlined as follows. Firstly, a theoretical background is presented, giving a brief insight to the specific technologies that are needed in terms of understanding the approach to the problem and the thesis results. Node.js, the Heroku environment, and increasing throughput in Node.js specifically are here discussed in more detail.

After that, the research process is treated. The chapter starts by describing our information gathering process, and continue with a review of existing literature, a description of our research methodology and the requirements specification.

The next chapter describes how the template for the applications is created. This chapter is then followed by a chapter devoted to the tests. Here, the testing environment is described and the results from the tests are evaluated.

Lastly, in the Discussion chapter, we reflect about our methods and results, future work, and the topics of ethics and sustainability within the area.

8 2 Theoretical Background

This chapter will give a deepened insight into the more theoretical parts of the problem area that is essential in order to understand the problem and its solution. It will describe the Innometrics system, the Heroku dyno, how Node.js works in more detail, how to increase throughput within the runtime environment using the cluster module, and related work done within the area.

2.1 The Company Platform

As mentioned in section 1.1, the customer’s (the company buying Innometrics’ product) data warehouse or their system for tracking and managing existing or potential customers (Customer Relationship Management system, or CRM­system)[9] ​ is connected to the company platform.

With the data retrieved from the customer’s data warehouse or CRM system, Innometrics initially puts together a profile for each of the customer’s clients (the visitors to the customer’s website), which is then stored in Innometrics’ own data warehouse. The Innometrics system will then continually add data to this profile containing information on any website interaction that the client in question has made towards the customer’s website. The website interaction to listen for is specified by the customer through the Innometrics system.

All client interaction, that has been specified to listen for, is logged in an event stream in the form of data objects known as events. An event is, in turn, a collection of data containing information on an action that has been taken by a client on the customer website. For example, as a client clicks a banner or a link, an event could be generated containing information on which banner or link that was clicked, the time when the click was made, etc.

In order to enable Innometrics to retrieve resources from third party sources, Node.js applications (deployed on a cloud platform) are used. Each application has been set to listen for one or several events. In case any of these events are triggered by a visitor on the customer website (e.g. by the client clicking on a link), the

9 Innometrics system sends a request containing client’s profile (with the event added to it) to the application.

An example of this type of communication is shown in figure 2.1.1. As the client visits a website, an event is generated by the Innometrics system containing information on the IP of the client. A request containing the client profile is then sent by the Innometrics system to the application. The application extracts the IP address contained in the event data of the profile just received with the request, and sends it onwards to an IP­lookup service retrieving further information on the IP address in question. Then, as the response is received from the lookup service, the application saves this data towards the Innometrics’ own data warehouse.

Figure 2.1.1: A flow chart describing an example case of communication between different actors as an event is triggered on a customer website.

2.2 Heroku Dyno

Each application on the Heroku platform is running on a dyno. Each dyno is a lightweight Linux container that runs a single command provided by the user. A dyno can run any command available in its environment like restart, stop, scale, etc.

10 [10] According to Heroku’s official documentation ,​ containerization is a virtualization ​ technology that allows multiple isolated operating system containers to be run on a shared host. All dynos are isolated from one another for security purposes.

[11] Dynos on the free account level are limited to 512 MB of RAM .​ Concerning the ​ CPU specifications, this is something that Heroku (due to unknown reasons) has decided not to reveal to the user, but by accessing the application’s shell environment it was clear that the dyno lied on a machine that had access to one physical unit consisting of 4 cores with 8 hardware threads each (see Appendix 1). However, it

[10] seemed ​ that this was something that the dyno has varied access to depending on ​ the amount of other dynos currently active on the shared host.

A hardware thread is one out of two execution threads per core that executes simultaneously in order to hide latencies when it comes to retrieving data from memory caches on the CPU, and is something that is implemented by Intel

[12] Hyper­Threading Technology .​ ​ 2.3 I/O vs. CPU bound

Tasks performed by an application or a system can be I/O or CPU bound.

I/O (I/O is shorthand for Input/Output) bound task performs operations associated with I/O communications. Examples of I/O communications are HTTP requests,

[13] database operations and disk reads and writes .​ ​

CPU bound tasks are mainly performed by the CPU. In this case the CPU spends its time mostly on computing. Examples of these types of tasks are calculating a hash, searching for an item, and performing mathematical calculations.

Figure 2.3.1: A CPU (a) vs. I/O bound (b) application

11 An application can also be either CPU or I/O bound. In the case of a CPU bound application, a majority of the tasks done within the application are CPU bound. In the case of an I/O bound application it is the other way around – a majority of the tasks are I/O bound. Both types of applications are depicted in figure 2.3.1. Here it can be seen how the CPU bound application (application ‘a’) spends more time doing calculations, and less time handling I/O. It can also be seen, in application ‘b’, how an I/O bound application spends its time doing the opposite – more time waiting for

[14] I/O, and less time doing calculations .​ ​

2.4 The Inner Workings of Node.js

One of the main strengths of Node.js is its method for treating I/O calls. This is much because of I/O calls being handled by background threads, while the main thread of the application, known as the event loop, can treat and process any other requests sent to the application. In figure 2.4.1, there is a detailed overview of the inner workings of Node.js.

Figure 2.4.1: A Node.js instance with its event loop and thread pool

12 [15] Node.js runtime runs on single core ​ and contains an event queue which stores a ​ list of events, each consisting of a name describing the event and a callback

[16] function ​ (a function to be run after the initial function has finished its execution). ​ An example of an event is when an HTTP request is sent to the server. This request is placed in the event queue. The event loop starts by picking up an event containing an I/O call that is to be executed from the queue and then delegates the job to the

[17] operating system via an internal thread pool .​ The thread that receives the job then ​ executes the function associated with the event without blocking the event loop, while the event loop continues treating the next event in queue.

After the thread in the internal thread pool has finished its execution, the callback function is again placed in the event queue. The callback function is later on retrieved from the queue and processed by the event loop. If another event occurs, a new event is placed in the event queue, and the procedure is repeated. This way the event loop can handle all incoming requests asynchronously in a non­blocking way.

[18] However, Node.js is not as good at treating CPU intensive tasks .​ When Node.js ​ performs a CPU intensive task all other requests are being held up, due to the event loop running on a single thread and the CPU being occupied with working on this

[13] thread. One of the strategies to handle this problem is by using the Cluster module .​ ​

2.5 Increasing Throughput in Node.js Using the Cluster Module

In order to improve Node.js ability to treat CPU intensive tasks, worker processes can be forked. That is, the main process of the application is duplicated into new

[19] processes referred to as worker processes .​ The main process is then referred to as ​ the master process. This functionality is provided by the Cluster module, which is a

[15] part of the standard library in Node.js .​ ​

When forking new processes, all new connections are first received by the master process and then handed over to an available worker. Which worker gets the connection is decided through a round­robin approach – which essentially means

[20] that the next available worker gets it .​ ​

13 Best practice is to bind each worker to its own logical CPU core, which leads to the application’s ability of processing each request being increased through utilization of

[21] more of the CPU’s capacity, thus, increasing its effectiveness and throughput .​ ​

This essentially means that each Node.js instance (figure 2.4.1) is replicated into its own server instance, where each instance – known as a worker process – listens to the same socket. Here, the master process works as a load­balancer by receiving all

[15] incoming connections and distributing them among the worker processes .​ The ​ resulting architecture of the application when implementing the cluster module is depicted in figure 2.5.1.

Figure 2.5.1: The desired application architecture for this thesis, with each worker representing the Node.js instance depicted in figure 2.4.1

2.6 Related Work

The Node.js platform is still rather new and evolving rapidly. Because of that it is not easy to find articles that are still up­to­date. Some of the articles are reviewed in this section.

The article “Optimizing Node.js Application Concurrency” provided by Heroku’s

[22] official website, explains how to regulate the number of worker processes .​ It is also ​ recommended to create worker process and bind each of them to its own logical CPU core, thus making the application take full advantage of the available system resources. One interesting thing that they mention is that each app has unique

14 memory, CPU and I/O requirements and there is no solution that can fit each app. However, they do not provide any benchmark results.

[23] Rowan Manning in his blog describes how to implement the Cluster module .​ He ​ also states that creating multiple processes for a Node.js application can dramatically improve the amount of load the application can handle. He provides some simple benchmarking in order to illustrate the improvement. The app is installed on a local machine without involving the Heroku platform, and the benchmarked function is doing CPU bound tasks.

Neil Kandalgaonkar argues that “Node.js can be a great choice for computation heavy

[24] services” .​ He clarifies that it can be suitable for some occasional CPU­bound tasks ​ – not too many, nor too heavy tasks however. The Heroku platform is mentioned in the article as well, but due to the fact that the application tested in the article was too big (~200 Mb), it was not possible to perform some thorough tests on that platform. He names the Cluster module as one of the possible solutions.

15 3 Research Process

This chapter will describe our research process. It will provide a description of the methodology used in solving the problem, give an overview of the overall process, and lay down the hypotheses for this thesis.

3.1 Research Methodology

Since the thesis consisted of two separate research questions, two different research strategies were used.

In order to answer the first research question, how to increase the throughput of the application, practices on how to create a server in Node.js were investigated. The solution was determined through a combination of quantitative and qualitative methods, where a form of applied research based on existing theories and research[25] ​ was used to create a test application, which could then be evaluated by answering the second research question. If the results from answering the second question would lead to a substantial increase (≳20%), the results from the evaluating the first ​ ​ question would be considered positive. If not, the first question would need to be re­evaluated based on another existing theory.

When answering the second research question, how substantial the increase in throughput would be, two different methods were followed. Experimental research was conducted by having a foundation for this thesis by comparing different test results with one changing variable per test. In our case, these variables were represented by 1) the load sent to the application during the test, 2) the number of workers used by the application in the test, and 3) the environment the test was conducted in (local or Heroku).

The hypotheses could also be predefined for the outcome of the comparison, and

[25] thereby, a method of the analytical kind was also used .​ Thus, the methodology ​ used for answering the second research question was a combination of two research methods: experimental and analytical.

16 3.2 Process Overview

The methods listed in this section are described in order to give an understanding on how this thesis was structured to be able to achieve the goal and answer the research questions defined in chapter 1. The overall research process is illustrated in figure 3.2.1, and is described in detail below.

Figure 3.2.1: The research process

3.2.1 Problem Definition

This was the phase where the problem was defined out of the requirement specification received from the company.

17 3.2.2 Data Collection

Data collection consists of two different types of data: primary data and secondary data.

Primary data is most generally described as data collected from the information source and most often is retrieved through interviews, observations and discussions

[26] with members of the company .​ ​

Secondary data, in turn, is typically gathered by persons not involved in the current research. The sources of this kind of data can be technical and statistical records,

[26] newspaper articles, etc .​ ​

The primary data that the qualitative part of this thesis relies on mainly consists of a task overview given by Innometrics’ supervisor of this thesis, and of informal interviews given by the employees of the company.

The overview given by the company consisted of recommendations on what modules to use for the thesis – partly modules used by the company daily when designing applications for the platform, and partly modules that could contribute to this thesis. Recommendations on what tools that might be used when performing the tests were also given.

The informal interviews given by employees consisted of recommendations on how to set up the remote environment, and information on the average traffic that the Innometrics system is exposed to.

This primary data was then complemented by document studies in the form of company documentation on the platform, technical reports, and articles on the subject. Such materials can give a better and deeper understanding of the subject.

The primary data that the quantitative part of this thesis relies on mainly consists of test results obtained from tests conducted in order to answer the second research question on how substantial the increase in throughput was (in case there was an increase).

18 3.2.3 Design & Implementation

In this phase, a test application is to be designed and implemented based on a known method for increasing throughput in Node.js. The initial design of the test application is the result of the primary data obtained through qualitative methods just described, and it defines the architecture and functionality of the application.

3.2.4 Defining the Testing Environments

In this phase, the specifications for the machines, of both the local and the Heroku testing environments, were laid down.

3.2.5 Creating Test Plan

During this phase, focus laid on creating a test plan that included test for both the local and the Heroku testing environment. The test plan were to be designed to test the application’s throughput in the case of both I/O and CPU bound tasks, different rates of traffic sent to the application, and different number of worker processes for each traffic rate and type of task.

We had been informed on the structure of the requests being sent to the application, and by re­using that structure we only needed to adapt the request’s body to contain data relevant for the test application. The body data that was relevant for this thesis was simply a string, in order to determine which function to call (I/O or CPU bound).

3.2.6 Results and Analysis

This phase consisted of two iterations: one for local tests, and one for Heroku tests. In both iterations, the test application was benchmarked, and the results of the benchmark was then analyzed. The results were presented in form of tables and graphs. Increases in throughput were expressed in terms of a percentage increase between each test.

The analysis consisted of a type of formative evaluation, where concentration laid on examining and changing processes as they occur. The last iteration was evaluated, in case it had provided positive results the process would continue to a final evaluation

19 of the solution. In case it had provided negative results, a new iteration would be initiated.

3.2.7 Evaluation

The evaluation of the solution to this thesis was to have a summative approach, providing an overall description of the application’s performance increases. It was to be described whether the objectives of the thesis had been fulfilled, and on the future direction of the product. Here, a secondary analysis was also to be given to reexamine existing data to address new questions or methods not employed.

3.3 Hypotheses

Our hypotheses were that the results would show that the test application would have performance increases in areas where Node.js usually was flawed. In other words: when benchmarking one and the same Node.js­application with and without our application template, a performance increase, in the form of a higher throughput, should be apparent when doing CPU heavy tasks, such as calculating a hash or when doing other arithmetic calculations. However, when doing I/O bound tasks it should result in a status quo.

20 4 Analysis: How to Increase Throughput

This chapter will provide the answer to the first research question of this thesis ­ how can the throughput of a Node.js application, running on the Heroku Platform, be increased by taking advantage of the available system resources? The answer was obtained by using the qualitative methods described in section 3.2.2. It will provide a description of the test application used in this thesis.

The application was to consist of partly the throughput increasing template that was to be the product of this thesis, and partly functionality for testing two different aspects – it’s capabilities of fulfilling I/O and CPU bound tasks – of the test application in its current environment.

4.1 Our approach

When analyzing data retrieved during the data collection phase, we found that there were not so many ways to increase throughput of a Node.js application.

The main method for increasing throughput in Node.js is by creating multiple processes for the application, thus utilizing more of the available system resources. This is known as clustering the application, and is mainly implemented by the Cluster module described in section 2.5.

4.1.1 Different Implementations of the Cluster Module

[27] There are several alternatives f​ or implementing worker processes in Node.js .​ One ​ ​ of these is to simply use the standard Cluster module, which comes as a standard library in Node.js, and provides the most basic mechanisms for implementing worker processes. More about the implementation of the Cluster module for this thesis can be found in section 4.2.

[28] There is also the alternative of implementing the Throng module ,​ which is used by ​ Heroku in their own example on how to cluster. This module is also implemented on top of the Cluster module. It is being advertised for being “a simple worker manager for clustered Node.js apps”, by obscuring large parts of the master/worker logic when

21 clustering the application – in order to make it easier for the developer. Instead, the developer mainly has to focus on setting the number of workers, configuring the master process, etc.

Another alternative is PM2, a program which is also implemented on top of the Cluster module. It is similar to the Throng module by obscuring large parts of the master/worker logic from the developer, but does so to an even larger extent. It also provides the application with some additional functionality such as real time process

[29] management ​ (e.g. adding workers), basic system monitoring, log aggregation, ​ [30] etc .​ ​

Lastly, there is the alternative of implementing the StrongLoop Cluster Management

[27] Tool ,​ which also is based on the Cluster module, and basically provides the same ​ functionality as PM2, but with some smaller differences (such as profiling).

4.1.2 Clustering Method Chosen When Creating the Application Template

When it came to this thesis, it was found that the standard Cluster module was the most appropriate way to implement clustering in the application.

When looking at the alternatives, they either tended to hide larger parts of the cluster related code from the developer (Throng, PM2, and StrongLoop), or offer functionality not relevant for this thesis – which might have lead to a larger memory usage (higher memory allocation) for the processes. They are also all built on top of the Cluster module, and it also seemed easier to adapt the standard Cluster module

[31] to different cloud platforms compared to the other alternatives .​ ​

While other alternatives, with additional functionality, might be useful in a live­case scenario – it was not appropriate for this study, where it was desired to evaluate the effects of clustering on the most basic level.

22 4.2 The Application Template

When creating the template, we relied on the official Node.js documentation, and the description of the Cluster module in particular, on how to create, and cluster, a web application. This lead to a template realizing the server model described in section 2.5 (and depicted in figure 2.5.1).

When developing the template, it was important to keep the master process as light as possible, by keeping the allocated memory for the process at a minimum, and not to include any server related code, or any other code that was not relevant to its task of managing the workers. The reason for this was to optimize memory usage on the cloud platform, and since it was the workers that did the request handling procedures, it was important that they had a maximum amount of memory available.

In order to change the number of worker processes dynamically for each application instance, an environment variable that could be set via the command line was used. On Heroku, this variable held the number of workers appropriate for the number of dynos used for the application.

According to official Node.js Cluster documentation, the default strategy when creating worker processes in an application, was to use the worker processes as request handlers (receives and treats requests), and use the master process for creating workers and handing sockets to them through interprocess communication

[4] (IPC) – a mechanism for sharing data among multiple process .​ ​

Figure 4.2.1: The master related code of the template

This works in the way that, as the application instance is started, the master creates a number of workers equal to the value represented by the environment variable mentioned earlier (see figure 4.2.1, rows 25­27). Also, on rows 29­31, it is shown how

23 a new worker is generated by the master in case a worker dies (process somehow shuts down).

Figure 4.2.2: The worker related code of the template

In figure 4.2.2, the worker related code of the template is shown. The worker starts by instantiating the Express framework (row 34) – a Node.js framework used for creating web applications providing the process with necessary server functionality. On rows 39­41, it can be seen how each worker process listens to the same port.

The code for handling each request sent to the application is shown on rows 44­51. As a request is sent to the application, the request is treated and a response is generated in the callback of this method.

The complete template of this thesis can be found in Appendix 3.

4.2.1 CPU Usage

As mentioned earlier, a Node.js application has a single­threaded event loop, utilizing only one of the available CPU cores. To increase throughput only using the available system resources, it should be specified in the application how many worker processes that are to be created.

As mentioned in section 2.5, best practice in determining how many worker threads that should be created for a particular application, is to base it on the number of

24 cores available to the system. That way, each process is bound to a single logical core. The desired CPU usage can be seen in figure 4.2.1.1

Figure 4.2.1.1: Regular vs. desired Node.js CPU usage

Using the Cluster module, this can easily be implemented on a physical machine, where you exactly know the specifications of the machine. When it comes to a cloud platform, however, there is not much information revealed about container specifications. A single Heroku dyno shares access to system resources with some other dynos and the performance of a single dyno can vary depending on the total load on the underlying machine. Therefore, according to Heroku’s article

[21] “Optimizing Node.js Application Concurrency” ,​ clustering more than one worker ​ on standard single dyno may hurt, rather than help performance. This was one of the things to be considered when performing the tests.

4.2.2 Workload

Analyzing the information received from observations and recommendations, it was clear that the application should be able to handle different amounts of simultaneous requests. The customers that use the application are of different kinds ­ it can be either large or small companies. Thus, the application should take into consideration those differences, i.e. it should be able to handle both larger and smaller amounts of clients. Therefore, finding the right balance of workers is very important.

25 4.2.3 Memory Usage

Applications can differ in memory usage. Some applications, in need of larger memory allocation (≳200 Mb for a single application), might suffer from ​ ​ ​ implementing worker processes on a single Heroku dyno (due to exceeding the memory limit). Exceeding the memory limit could lead to the application not performing desirably, with requests timing out (not receiving responses). Therefore, when clustering an application the memory usage of each process has to be kept in mind – the application’s overall memory usage must not exceed the dyno’s memory limit.

4.3 Test Application

A template was created, and by that the first research question was partly answered. In this case, the next step would be to verify whether the template would give the desired increase in terms of throughput on the Heroku platform or not. In order to do that a test application was to be created and the second research question to be answered.

The application had to provide means for testing its capabilities of doing different tasks. From discussions with people at the company, it was discovered that the system sometimes calculates hashes when creating new profiles. Therefore, the test application needed to provide the ability to run two different types of tasks: CPU and I/O bound. The CPU bound function calculated a hash, while the I/O bound task simulated an I/O call by doing a timeout of 300 ms where the application simply was waiting without blocking the event loop. The function to run was parsed from the HTTP­request that the application received. In each of the two functions, an appropriate response was generated. The same test application (see Appendix 2) was used in both local and Heroku tests.

26 5 Analysis: Benchmarking the Test Application

This section will focus on describing the testing environment, the test plan, and the results obtained from the tests, in a local environment and on Heroku. It will provide an analysis of the results, in order to answer the second research question of this thesis on how substantial the increase in throughput of the application can be when clustering functionality has been added.

5.1 Testing Environment Analyzing the information gathered during the data collection phase when attempting to answer the first research question, we came to the conclusion that it was needed to define a local and a Heroku testing environment.

[21] Heroku themselves inform ​ that an application might suffer from being clustered ​ when running on a free account. The tests were thus first conducted on the test application locally, with the goal of acquiring the expected results in a stable environment. Tests were then conducted on the same application, but instead installed on Heroku, with the expectation of obtaining similar results.

In both testing environments we used six different versions of our application – one without clustering functionality, and five with clustering functionality, each with a given number of workers available to the application instance (1, 2, 4, 8 or 16). The version without clustering functionality was needed in order to confirm that the added functionality would not affect the performance of the application.

For both environments the same machine was used as client. The specifications of the client machine were: ● Macbook Air (13­inch, Mid 2013) ● CPU: 1.7 Ghz Intel Core i7 ● Memory: 8 GB 1600 Mhz DDR3 ● OS: Mac OS X El Capitan, Version 10.11.4 ● 100/10Mbs Ethernet Connection

27 5.1.1 Local Environment The local testing environment consisted of two machines: one client (with the specifications given above), and one server with the following specifications: ● MacBook Pro (13­inch, Mid 2012) ● CPU: 2.5 Ghz Intel Core i5 ● Memory: 4 GB 1600 Mhz DDR3 ● OS: Mac OS X El Capitan, Version 10.11.5 ● 100/10Mbs Ethernet Connection

Through a terminal command in Mac OS X, the specifications for the Intel Core i5 CPU could be retrieved (see Appendix 4). Here, it could be seen that the CPU had access to 2 cores and 4 hardware threads. This later on became a determinant when deciding on what was the most appropriate amount of worker threads to be used when running a local server.

5.1.2 Heroku Environment Summarizing the specifications given for the Heroku dyno in section 2.2: ● CPU: varied share depending on how many other dynos that are currently active on the shared host ● Memory: 512 Mb

Due to having significantly less memory available in the Heroku environment compared to the local one, and due to the fact that a dyno had a varied share of the CPU, it was needed to establish the results locally first. We thought that if the expected results from the hypotheses (i.e. getting a throughput increase only for CPU bound tasks) could be obtained in a local environment, it would be worth testing on Heroku as well. If not, the expected results would definitely not be obtained on the lower performing machines that we had in our Heroku environment.

5.1.3 The Test Application’s Memory Usage By monitoring the application’s memory usage locally through a Mac OS X terminal command, “top”, we could see that it used 20 Mb without any requests being sent to it. When requesting the application to run CPU bound tasks, on the other hand, its

28 memory usage could climb up to 85 Mb, but averaged around 65 Mb. When requesting the application to run I/O bound tasks, on the other hand, its memory usage could climb up to around 80 Mb, but averaged around 60 Mb.

Since we, on Heroku, had a memory quota of 512 Mb, we had now been given an equation for calculating the appropriate number of workers for the application. By having 512 Mb in total memory available, and the application having a max memory usage of around 85 Mb when it was performing CPU bound tasks (the most memory demanding task), the most appropriate number of workers would be around 512 Mb / 85 Mb ≈ 6 workers. Considering that the master process also would need some memory allocated, the appropriate number of workers would most likely be slightly below 6.

Among our different versions of the application, we could thereby predict that the one having 4 workers would produce the best results by giving an increased throughput, while not exceeding the memory limit of the dyno (and still leave a margin to it). The application utilizing 4 workers would have a memory quota of 512 Mb / 4 workers = 128 Mb available for each worker (minus master process memory usage). This meant that when the application would be exposed to high traffic ordering it to perform CPU bound tasks, each worker would still have a memory quota of 128 ­ 85 = 43 Mb available, which should be considered as a good margin, without leaving a significant amount of unused memory on the dyno.

Summarizing, it is important that the memory usage of the application’s processes does not exceed the available memory of the Heroku dyno, and that it, ideally, lies with a good margin below this value – but not too good, because then a large amount of memory could become unused. The problem, concerning the Heroku environment, had thus become memory related as well (not only CPU related).

5.2 Testing Tools This section will describe testing tools used when running tests locally and on Heroku.

29 5.2.1 Apache JMeter JMeter is a Java application designed to load test functional behavior and measure performance. It provides means for simulating a heavy load on a server, groups of servers, or network, to test its strength or to analyze the performance under different load types. It has the ability to load and performance test many different server/protocol types: HTTP/HTTPS, FTP, TCP etc.

Figure 5.2.1.1: Example properties of a thread group

With each testing plan, the user creates a thread group, specifying a thread number, a ramp­up period and a loop count. The thread number specifies how many threads that are to be started in the beginning of each ramp­up period (specified in seconds), and the loop count specifies how many times this procedure should be repeated. In figure 5.2.1.1, there is an example of the properties that can be set for a thread group. Here, 10 threads are being initiated each second, and this is looped 320 times.

Figure 5.2.1.2: An example of the properties of an HTTP Request Sampler

30 Within each thread group, in turn, there are several elements that can be included. For example, in our case it was relevant to include an HTTP Request Sampler – an object that contains information on an HTTP request that is to be sent with each thread in the thread group. Figure 5.2.1.2 shows an example of properties set for an HTTP Request Sampler that sends request to port number 8887 on IP 192.168.1.104. The body data can also be set, this is however something that we could not show due to risking company policy infringement.

There is also a possibility of generating aggregated reports. This type of report is what lies as a basis for presenting the results of the tests performed in the local environment.

5.2.2 Heroku Metrics When running the tests on Heroku, we used JMeter for sending the requests, but not for measuring the application’s performance. This was due to JMeter having a different measurement of throughput, which was based on the number of samples divided by the total time of the test. This meant that the time for the request being sent to, and received by, the server, and the time for the response being sent to, and received by, the client, being included in the measurement as well. This was an acceptable measurement in the local environment, since the distance between client and host was small. Now, however, with the application deployed on an external host, we had to take into consideration that there might be a significant distance between the client and the host. Therefore, it was decided that the just mentioned transport times were something that should not be a part of the application’s performance evaluation.

In order to measure the application’s performance ideally, it was important to do the measurements as close to the application as possible. This could be done by relying on the metrics tool which Heroku has made available for developers. The tool consisted of a collection of graphs including the same units of measurement for the application as those retrieved from the JMeter reports used in the previous section – namely throughput, average and median response times, and error rates.

31 5.3 Creating the Test Plan The testing procedures of the application followed a pattern where, for each type of task (I/O or CPU bound), the number of requests sent to the application was gradually increased – in order to evaluate how well the application performed different tasks at different traffic rates.

The load rate for each test varied between 10­100 requests per second. The rates of 10­25 requests per second were to simulate low traffic, 25­50 medium traffic, and 50­100 high traffic. For all tests 15000 samples were sent to the application.

Figure 5.3.1: Example of six thread groups each containing one HTTP Request Sampler

In order to test each application sequentially, there was one thread group (see figure 5.3.1) for each number of workers available for each application. Within each thread group, we had specified a HTTP Request Sampler (described in section 5.2.1) sending to that application's specific endpoint. As mentioned earlier, regardless of whether the application was running locally or remotely on Heroku we set up the same samplers in each of the thread groups – only changing the names of the thread groups and the host URL for the samplers.

5.4 Local Tests This section describes the evaluation of the application’s capabilities in performing I/O and CPU bound tasks in the local environment. Focus was laid on differences in throughput between tests, but average and median response times will also be noted and analyzed.

32 5.4.1 I/O Bound Starting with the I/O bound tests and sending in 10 requests per second (see figure 5.4.1.1), it was noted that the results were similar between each thread group – no matter the number of worker processes used. Both the average and median response time is close to the same for each of the thread groups. The throughput (number of requests handled by the application per second) is also similar between the thread groups.

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 10/s With Cluster, 2 workers 15000 317 308 0,00% 30,8 10/s With Cluster, 4 workers 15000 308 309 0,00% 31,7 10/s With Cluster, 16 workers 15000 307 308 0,00% 31,8 10/s Without Cluster 15000 307 307 0,00% 31,9 10/s With Cluster, 8 workers 15000 307 308 0,00% 31,9 10/s With Cluster, 1 workers 15000 307 307 0,00% 31,9

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 100/s With Cluster, 16 workers 15000 306 305 0,00% 309,3 100/s With Cluster, 4 workers 15000 305 305 0,00% 310,3 100/s Without Cluster 15000 305 305 0,00% 310,8 100/s With Cluster, 8 workers 15000 305 305 0,00% 310,9 100/s With Cluster, 2 workers 15000 306 305 0,00% 311,5 100/s With Cluster, 1 workers 15000 306 305 0,00% 312,0

Figure 5.4.1.1: Results from local I/O­bound tests at 10 and 100 request per second (sorted by throughput)

Looking at the results from the test where 10 requests were being sent per second, there is barely a difference when comparing the one without clustering and the ones utilizing it. When looking at the results from the other test (100 requests per second) the result was the same. The difference in throughput between the thread without clustering and the highest performing thread with clustering was ~0.4%, which is not a significant difference (and does not pass the bar of 20%).

To summarize, when the application was performing I/O bound tasks in a local environment, the test results did not show a significant difference in terms of

33 throughput, and none of the results passed the bar of an increased throughput of 20%. In accordance with the hypotheses (described in section 3.3) and the research process depicted for this thesis (section 3.2.6) of only moving onto Heroku with tests that proved an increase in the local environment, an increase in throughput could not be seen as the application performed I/O bound tasks.

All of the results obtained from testing the application’s capabilities of performing I/O bound tasks in the local environment can be found in Appendix 5.

5.4.2 CPU Bound As can be seen in figure 5.4.2.1, when simulating low traffic (10 requests per second) the results obtained showed significant difference between the thread groups. When comparing the thread group without clustering to the highest performing one with clustering (4 workers), it showed a difference of ~24.4%. There could also be noted a decrease of 25% in average and median response times between the two thread groups.

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 10/s With Cluster, 1 workers 15000 6 6 0,00% 787,5 10/s Without Cluster 15000 4 4 0,00% 827,6 10/s With Cluster, 16 workers 15000 4 3 0,00% 892,9 10/s With Cluster, 8 workers 15000 3 3 0,00% 972,3 10/s With Cluster, 2 workers 15000 3 3 0,00% 994,6 10/s With Cluster, 4 workers 15000 3 3 0,00% 1029,9

Label # Samples Average Median Error % Throughput 50/s With Cluster, 1 workers 15000 51 53 0,00% 815,5 50/s Without Cluster 15000 37 39 0,00% 954,5 50/s With Cluster, 2 workers 15000 25 27 0,00% 1211,9 50/s With Cluster, 16 workers 15000 15 15 0,00% 1269,7 50/s With Cluster, 4 workers 15000 16 15 0,00% 1320,3 50/s With Cluster, 8 workers 15000 13 13 0,00% 1355,9

Figure 5.4.2.1: The results of the CPU bound tests at 10 and 50 requests per second (sorted by throughput)

Moving on to the test where requests were being sent to the application at 50 requests per second, the results obtained showed an increase in throughput of

34 ~42.1% when comparing the application without clustering and highest performing clustered application (8 workers). In this case, a decrease in average response time when comparing the two threads were of ~64.9%, and in terms of median response time the decrease was ~61.5%.

Label # Samples Average (ms) Median (ms) Error % Throughput (rps) 100/s With Cluster, 1 workers 15000 107 110 0,00% 793,6 100/s Without Cluster 15000 82 86 0,00% 926,8 100/s With Cluster, 16 workers 15000 19 18 0,00% 1151,8 100/s With Cluster, 2 workers 15000 63 67 0,00% 1208,1 100/s With Cluster, 4 workers 15000 49 52 0,00% 1278,3 100/s With Cluster, 8 workers 15000 39 44 0,00% 1326,5

Figure 5.4.2.2: The results of the CPU bound tests at 100 requests per second (sorted by throughput)

Looking at the test where requests were being in at 100 requests per second (see figure 5.4.2.2), a difference of ~43.1% in terms of increased throughput could be noted between non­clustered and best performing clustered application (8 workers). In this case, a decrease of ~52.4% in average and ~48.8% in median response times could also be seen. Lastly, it was noted that the application with 1 worker performed ~4.8%­14.4% lower than the application not implementing clustering.

In conclusion, the results obtained from testing the application’s capabilities of performing CPU bound tasks locally showed increases in terms of throughput higher than the bar of 20%. Because of the results showing this increase, CPU bound tests were to be conducted in the Heroku environment as well. When comparing the application only utilizing 1 worker, however, the throughput was lowered by ~4.8%­14.4% compared to when not utilizing clustering at all. The full results of the CPU bound tests can be seen in Appendix 6.

5.5 Heroku Tests This chapter presents the evaluation results of the same application used in the local tests, but deployed on the Heroku platform instead, are presented.

Worth taking note of is that it was here decided not to test the I/O bound function on Heroku, as the results obtained from analyzing the local tests spoke for the cluster

35 package not contributing to any increases concerning throughput in this environment.

Here, the test results are based on the output of the Heroku metrics. Each bar in the diagram represents the performance of the application during a given minute in time. The vertical line apparent in most of the diagrams (e.g. figure 5.5.1.1) represents a specific minute, chosen by us to analyze. This specific minute belongs to the highest throughput value obtained during each test. All test results in this section are based on results from when the application was performing CPU bound tasks on Heroku.

5.5.1 Throughput Rates The results obtained from simulating low traffic for an application deployed on the Heroku platform did not show any significant difference. In figure 5.5.1.1 it can be seen that the throughput is about the same (~10000 requests/min) for an application without clustering and for the highest performing application with clustering (4 workers).

Figure 5.5.1.1: A comparison in throughput between tests at 10 rps without clustering (upper) and with 4 (lower) workers

36 Continuing with the test results obtained from the tests running at 50 requests per second (see figure 5.5.1.2), a difference of ~20.6% in throughput could be noted when comparing the application without clustering to the best performing application with clustering (4 workers). Thus, passing the bar of 20% set out as the prerequisite for being considered as a positive result.

Figure 5.5.1.2: A comparison in throughput between tests at 50 rps without clustering (upper) and 4 (lower) workers

When looking at the results obtained from the tests running at 100 requests per second (see figure 5.5.1.3), a significant difference in throughput could be seen. Here, there is a difference of ~54.9%, which also passes the bar of 20%.

37

Figure 5.5.1.3: A comparison in throughput between tests at 100 rps without clustering (upper) and 4 (lower) workers

Lastly, looking at the results from each test (10­100 rps), it was shown that the differences in terms of lowered throughput for the application utilizing 1 worker compared to the one without clustering persisted. Comparing the two, the performance was lowered with ~1.5­5.3%, when sending 10­100 requests/second to the application (see Appendix 7).

5.5.2 Memory Usage

Looking at the memory usage when comparing the application without clustering and the one with the best performance (4 workers) during high traffic, it can be seen (in figure 5.5.2.1) that a significant amount of the available memory to the non­clustered application is unused, namely 407 Mb. The reason for looking at high traffic in particular is because it is the worst case scenario for the application (when the highest amount of memory is allocated).

38

Figure 5.5.2.1: Memory footprint of the application without clustering (upper) and with 4 workers (below) running at 100 rps

Looking at the memory usage of the second application (4 workers), it can be seen (also in figure 5.5.2.1) that more of the memory available is being used by the application, leaving 221 Mb unused. The vertical line represents the same time as in figures 5.5.1.3.

Additionally, the application in this case (100 rps) exceeds the memory limit when having 8 or 16 workers, leading to requests queueing up (see Appendix 7, figure 23 and 24). These queued requests further adds to the memory quota of the application,

[10] and it can eventually lead to the application crashing (at 1 Gb memory usage )​ , thus ​ loosing data.

5.5.3 Median Response Times

When looking at the tests, the median response times varied between being roughly the same during low traffic (10 rps), and having a decreased time of ~94% during high traffic (100 rps), when comparing the application without clustering to the one

39 having 4 workers (see figure 5.5.3.1). The vertical line in the figure marks the same time as in figures 5.5.1.3 and 5.5.2.1.

Figure 5.5.3.1: Response times for application without clustering (upper) and with 4 workers when sending at 100 rps

5.5.4 Analysis of Heroku Test Results

When performing CPU bound tasks in the Heroku environment, an increase in throughput, passing the set out bar of 20%, could be seen when the traffic sent to the application was of medium to high rate (50­100 requests per second).

It could be seen how the memory when using 8 workers at 100 request/second exceeded the limit of the dyno. Thus, requests started queueing up, which could eventually have led to the application crashing. At the same traffic, when clustering the application, it could be seen that a large amount of the available memory of the dyno was unused, thus not utilizing the full capacity of the dyno. The memory usage of the applications at high traffic, thus, spoke for using 4 workers within the application.

40 Additionally, in the Heroku tests, as with the local tests, there was some decrease in throughput when comparing the non­clustered application with the one utilizing 1 worker. This meant that implementing the Cluster module and instantiating only one worker process would most likely lead to a small decrease in throughput instead of an increase. This is most likely due to the procedure of the master process both needing to create the worker process at the start of the test, and due to redundancy of having a master process doing IPC when only having one worker.

In conclusion, throughput increases could be seen in the Heroku environment as well. The answer to the second research question of this thesis, on how substantial the increase in throughput could be, is thus: when sending medium to high traffic (50­100 rps) to an application performing CPU bound tasks, the increase in throughput varies between ~20.6% (when sending 50 rps) and ~54.9% (when sending 100 rps). When doing I/O bound tasks, on the other hand, there are not any throughput increases. The obtained answer confirmed the hypotheses depicted in section 3.3.

41 6 Discussion

This chapter presents discussion and our conclusions on this study, which will reflect our interpretations of the results and problem areas related to the thesis. Finally, the future direction of the template will be discussed.

6.1 Our Methodology and Consequences of the Study The purpose of the thesis was to create a template for Node.js applications deployed on the Heroku cloud platform. The problem definition described in section 1.2 led to two research questions (see section 1.3), thus dividing the research process into two parts.

When answering the first question, the applied research method was chosen. This method belongs both to qualitative and quantitative research. This choice was determined through early investigation, where it was found that there were not so many techniques that could be applied in order to design and implement the template. This was due to the fact that we were bound to a particular situation and a very specific implementation environment. This led to there only being one solution to the problem.

The problem with this research method was that it relied too much on the outcome of the results of the second question. It would require further investigations if the results would be negative and it would have made the research process iterative, where a new technique for increasing throughput would have to be evaluated.

The data collection phase of the first question included collection of primary and secondary data. The primary data consisted mainly of informal interviews and our observations at Innometrics. This data was then complemented by the document studies related to our research field. However, we had trouble finding academic literature on the subject. This is probably partly due to the narrowness of the problem area, and partly due to the fact that the technologies being discussed are rather new. That is why we were careful when analyzing the quality of the available

42 resources, otherwise potential problems could arise during the later phases of the research process.

The implementation part of the study focused on the development of a Node.js template for Innometrics. This is the part of the research process that heavily relied on the result of the second research question.

Before answering the second question, the right assumptions needed to be made. In our case, it was to focus on CPU bound functions. That is why the tests gave the desired result. The only problem was that the Heroku environment, where the tests were conducted, was very unpredictable in terms of available system resources.

In conclusion, the research process made it possible to identify the right problem areas and choose the right methods. As there was no previous research in this field and the problem was of a practical character, our thesis could serve as a good ground for the future development of the applications at Innometrics, as well as for all developers experimenting with Node.js and the Heroku platform.

6.2 Discussion and Conclusions

The only solution found when trying to answer the first research question, how to increase throughput for the applications within Innometrics’ system, was to utilize the Cluster module (there were alternatives on how to implement the module, however), which was a part of the standard library in Node.js.

When faced with the problem of testing the application on Heroku, it was found that the problem was not only focused on utilizing more of the CPU’s capacity anymore, but on optimizing the memory usage as well. This meant we could not focus only on finding the number of workers giving the highest throughput, but now also had to consider the memory usage of the application, and make sure that the application, with its processes, stayed well below the memory limit of the dyno that it was installed on. Otherwise, the memory might be exceeded, which could eventually lead to the application crashing, thus loosing data.

43 Analyzing the obtained test results, it showed an increase of ~20%­55% when utilizing the Cluster module in a test application when performing CPU bound tasks during medium to high traffic (50­100 requests per second). However, through tests performed in a local and higher performing environment, we came to the conclusion that, when doing I/O bound tasks, a clustered application would not show any increases in throughput in the Heroku environment, regardless of the traffic sent to it. This was in accordance with the hypotheses for this thesis (depicted in section 3.3).

It was, also, found that the highest increases in throughput, when doing CPU bound tasks in the Heroku environment, could be seen when using 4 workers in the application. This number of workers, too, showed to stay well within the memory limit of the dyno (thus minimizing the risk of exceeding it during sudden increases in traffic).

Regarding the test results, it was also noted that utilizing only one worker led to a throughput decrease of ~1.5%­5.3% (depending on the traffic sent to the application), which meant that an application should only utilize the application template in case it was in need to, and had the memory available, instantiate more than one worker.

The obtained results from answering the second research question (on the substantiality of the increase in throughput), confirmed that the answer to the first research question on how to increase throughput in a Node.js application was to utilize the Cluster module.

6.2.1 Recommendations Concerning the Application Template

Our recommendations when utilizing this template in an application installed on a Heroku free account is to implement it mainly in applications that are exposed to traffic around 50 requests per second and higher. It is also recommended to check the memory usage of the application before implementing this template, to make sure that it is able to utilize at least more than one worker, in order to obtain an increase in throughput, without exceeding the memory limit of the dyno.

44 6.3 Ethics

A problem encountered considering the ethical parts of the project was that we needed to avoid revealing company secrets about either Innometrics, one of its customers, or the customer’s clients (the Innometrics profiles). Therefore, the report is kept in as general terms as possible when writing our report.

Also, by increasing the throughput of an application, more data can be collected per minute. This means that the consumer behaviour can be tracked more detailed, and this will lead to a better communication between customer and company. But at the same time, concerning the individual person, this project might not benefit in a positive way. It contributes further in monitoring persons, which is a controversial subject today. A more detailed customer profile might lead to a person feeling pursued. This is, however, not up to us, but to the company using the template (in this case Innometrics).

6.4 Sustainability

When speaking about sustainability, there are four different general areas to discuss:

[32] environmental, human, economic, and social sustainability .​ ​

Concerning environmental sustainability, by increasing the throughput of the applications, more clients can be handled per minute. This can lead to a faster abrasion of the machines running the system, since the machine in question now will be handling more clients simultaneously – thus doing more work per minute. It can also lead to the energy consumption of the machine going up, due to utilizing more of the machine’s capacity. Both of these aspects will lead to further damaging our environment.

It can, however, also lead to less abrasion on the machine by completing demanding work sessions faster than before, and thereby, giving the machine time for recovery (e.g. lowering machine temperatures) before strenuous periods.

Lastly, concerning economical sustainability, by increasing the throughput of the application, it will not crash as often. This will lead to an increase in profits for the

45 companies in need of the service, and for the company providing the service as well. Also, by only using the free account on Heroku, it will save expenses for Innometrics.

6.5 Future Work

Despite the fact that the the application template can already be implemented, there are still some areas that can be further investigated.

When it comes to implementation of the Cluster module, there are different possibilities. It could be worth evaluating some of them (PM2, StrongLoop). They will probably not give a much better performance, if any at all, but they can give more convenient ways of managing the worker processes and some other features, which can facilitate the maintenance of the application in the future. For example, there will be no need to undeploy the application when making changes.

In order to make the application as generic as possible for different platforms, other cloud platforms should also be considered. Similar tests can find the best suitable platform.

In the case of Heroku, there are other types of accounts providing more system resources. Detailed comparison tests could give a clear picture of differences in performance between different types of accounts.

It could also be interesting to do further research on finding the exact appropriate number of workers for an application. In this case, 4 workers had, as mentioned, a good margin to the memory limit, but it might still not have been the most optimized memory usage of the application. Thus, testing 5­6 workers for this particular application might have showed even better results.

46 References

[1] “IAB Internet Advertising Revenue Report”, 2012. [Online], p. 16. Available: http://www.iab.net/media/file/IAB_Internet_Advertising_Revenue_Report_FY_2 012_rev.pdf, [Accessed: 1 March 2016] ​ [2] B. Cantrill, J. Bonwick, “Real­world Concurrency”, ACM Queue, Volume 6, Issue 5, September 2008. [Online]. Available: http://queue.acm.org/detail.cfm?id=1454462, [Accessed: 25 May 2016] ​ [3] A.V. Aho, M.S. Lam, R. Sethi, J.D. Ullman, Compilers, Principles and Techniques, p.247. 2006, Pearson [4] S.D. Burd, Systems Architecture, Fifth Edition, p.45. 2004, Course Technology ­ a division of Thomson Learning [5] E. Griffith, “What Is Cloud Computing?”, PCMag UK, 3 May 2016. [Online]. Available: http://uk.pcmag.com/networking­communications­software­products/16824/featur e/what­is­cloud­computing [Accessed: 24 May 2016] ​ [6] Heroku, “The Heroku Platform as a Service & Data Services”. [Online]. Available: https://www.heroku.com/platform, [Accessed: 24 May 2016] ​ [7] B. Butler, “PaaS Primer: What is platform as a service and why does it matter?”, Network World.com, 11 February 2013. [Online]. Available: http://www.networkworld.com/article/2163430/cloud­computing/paas­primer­­wh at­is­platform­as­a­service­and­why­does­it­matter­., [Accessed: 25 May 2016] ​ [8] L. R. Rewatkar, U. A. Lanjewar, “Implementation of Cloud Computing on Web Application”, International Journal of Computer Applications, Volume 2 ­ N0.8, June 2010. [Online], pp. 28­31. Available: http://www.ijcaonline.org/volume2/number8/pxc387964.pdf, [Accessed: 24 May ​ 2016] [9] S.Lynn, “What Is CRM?”, PCMag UK, 18 August 2011. [Online]. Available: http://uk.pcmag.com/software/9038/feature/what­is­crm, [Accessed: 24 May 2016] ​ [10] Heroku Dev Center, “Dynos and the Dyno Manager”. [Online]. Available: https://devcenter.heroku.com/articles/dynos, [Accessed: 24 May 2016] ​

47 [11] Heroku Dev Center, “Dyno Types”. [Online]. Available: https://devcenter.heroku.com/articles/dyno­types, [Accessed: 24 May 2016] ​ [12] S. Saini, H. Jin, R. Hood, D. Barker, P. Mehrotra, R. Biswas, “The Impact of Hyper­Threading on Processor Resource Utilization in Production Applications”, NASA Advanced Supercomputing Division, 2011. [Online]. Available: https://www.nas.nasa.gov/assets/pdf/papers/saini_s_impact_hyper_threading_20 11.pdf, [Accessed: 25 May 2016] ​ [13] E. Swenson­Healey, “The JavaScript Event Loop: Explained”, Carbon Five, 27 October 2013. [Online]. Available: http://blog.carbonfive.com/2013/10/27/the­javascript­event­loop­explained/, ​ [Accessed: 25 May 2016] [14] A.S. Tanenbaum, Modern Operating Systems, Third Edition, p.145., 2009, Pearson [15] Cluster, Node.js v6.2.1 Documentation, [Online]. Available: https://nodejs.org/api/cluster.html [16] A. Burgess, “Using Node's Event Module”, Envato Tuts+, 3 December 2013. [Online]. Available: http://code.tutsplus.com/tutorials/using­nodes­event­module­­net­35941, ​ [Accessed: 26 May 2016] [17] D. Khan, “How to Track Down CPU Issues in Node.js”, about:performance ­ Application Performance, Scalability and Architecture, 14 January 2016. [Online]. Available: http://apmblog.dynatrace.com/2016/01/14/how­to­track­down­cpu­issues­in­node­ js/, [Accessed: 25 May 2016] ​ [18] M. Ridwan, “The Top 10 Most Common Mistakes That Node.js Developers Make”, Toptal [Online]. Available: https://www.toptal.com/nodejs/top­10­common­nodejs­developer­mistakes [Accessed: 25 May 2016] [19] Linux Documentation. [Online]. Available: http://linux.die.net/man/2/fork, ​ ​ [Accessed: 24 May 2016] [20] B. Noordhuis, “What’s New in Node.js v0.12: Cluster Round­Robin Load Balancing”, StrongLoop, 19 November 2013. [Online]. Available:

48 https://strongloop.com/strongblog/whats­new­in­node­js­v0­12­cluster­round­robi n­load­balancing/, [Accessed: 25 May 2016] ​ [21] A. Gorbatchev, “How­to Cluster Node.js in Production with Strong Cluster Control”, StrongLoop, 22 April 2015. [Online]. Available: https://strongloop.com/strongblog/production­node­js­strong­cluster­control/, ​ [Accessed: 25 May 2016] [22] “Optimizing Node.js Application Concurrency”, Heroku Dev Center. [Online]. Available: https://devcenter.heroku.com/articles/node­concurrency, [Updated: 24 ​ ​ September 2015] [23] R. Manning, “Node.js Cluster and Express”, 10 January 2013. [Online]. Available: http://rowanmanning.com/posts/node­cluster­and­express/, [Accessed: ​ ​ 28 May 2016] [24] N. Kandalgaonkar, “Why you should use Node.js for CPU­bound tasks”, 30 April 2013. [Online]. Available: http://neilk.net/blog/2013/04/30/why­you­should­use­nodejs­for­CPU­bound­task s/, [Accessed: 25 May 2016] ​ [25] A. Håkansson, “Portal of Research Methods and Methodologies for Research Projects and Degree Projects”, 2013. [Online]. Available: https://www.kth.se/social/files/55563be1f276547328cea897/Research%20Methods %20­%20Methodologies(1).pdf [26] “Qualitative and Quantitative Research Techniques for Humanitarian Needs Assessment. An Introductory Brief”, ACAPS, May 2012. [Online]. Available: http://www.acaps.org/sites/acaps/files/resources/files/qualitative_and_quantitativ e_research_techniques_for_humanitarian_needs_assessment­an_introductory_bri ef_may_2012.pdf [27] S. Kar, “Node.js Performance Tip of the Week: Scaling with Proxies and Clusters”, StrongLoop, 22 April 2015. [Online]. Available: https://strongloop.com/strongblog/node­js­performance­scaling­proxies­clusters/, ​ [Accessed: 16 May 2014] [28] Throng, . [Online]. Available: https://www.npmjs.com/package/throng ​ [29] J. Shkurti, “Node.js clustering made easy with PM2”, Keymetrics, 26 March 2015. [Online]. Available:

49 https://keymetrics.io/2015/03/26/pm2­clustering­made­easy/, [Accessed: 5 June ​ 2016] [30] PM2. [Online]. Available: http://pm2.keymetrics.io/, [Accessed: 5 june 2016] ​ ​ [31] “Using PM2 in Cloud Providers”, PM2 Documentation, [Online]. Available: ​ ​ http://pm2.keymetrics.io/docs/usage/use­pm2­with­cloud­providers/, [Accessed: 5 ​ june 2016] [32] Robert Goodland, “Sustainability: Human, Social, Economic and Environmental”, Baltic University Programme ­ a regional university network. [Online]. Available: http://www.balticuniv.uu.se/index.php/component/docman/doc_download/435­su stainability­human­social­economic­and­environmental, [Accessed: 5 june 2016] ​

50 Appendix 1 ­ Heroku Dyno CPU Information

Results from running the “cat /proc/cpuinfo”­command in the shell environment of each test application: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 1

51 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

52 processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 0 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13

53 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 5 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 1 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 6 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4

54 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 2 cpu cores : 4 apicid : 5 initial apicid : 5 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management: processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E5­2670 v2 @ 2.50GHz stepping : 4 microcode : 0x428 cpu MHz : 2500.090 cache size : 25600 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm xsaveopt fsgsbase smep erms

55 bogomips : 5000.18 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

56 Appendix 2 ­ The Test Application

Figure 1: First half of the test application

57

Figure 2: Second half of the test application

58 Appendix 3 ­ The Application Template

Figure 1: Part one of the application template

Figure 2: Part two of the application template

59 Appendix 4 ­ The Local Server CPU Specifications machdep.cpu.max_basic: 13 machdep.cpu.max_ext: 2147483656 machdep.cpu.vendor: GenuineIntel machdep.cpu.brand_string: Intel(R) Core(TM) i5­3210M CPU @ 2.50GHz machdep.cpu.family: 6 machdep.cpu.model: 58 machdep.cpu.extmodel: 3 machdep.cpu.extfamily: 0 machdep.cpu.stepping: 9 machdep.cpu.feature_bits: 9203919201183202303 machdep.cpu.leaf7_feature_bits: 641 machdep.cpu.extfeature_bits: 4967106816 machdep.cpu.signature: 198313 machdep.cpu.brand: 0 machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC POPCNT AES PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS machdep.cpu.extfeatures: SYSCALL XD EM64T LAHF RDTSCP TSCI machdep.cpu.logical_per_package: 16 machdep.cpu.cores_per_package: 8 machdep.cpu.microcode_version: 21 machdep.cpu.processor_flag: 4 machdep.cpu.mwait.linesize_min: 64 machdep.cpu.mwait.linesize_max: 64 machdep.cpu.mwait.extensions: 3 machdep.cpu.mwait.sub_Cstates: 135456 machdep.cpu.thermal.sensor: 1 machdep.cpu.thermal.dynamic_acceleration: 1 machdep.cpu.thermal.invariant_APIC_timer: 1 machdep.cpu.thermal.thresholds: 2 machdep.cpu.thermal.ACNT_MCNT: 1 machdep.cpu.thermal.core_power_limits: 1 machdep.cpu.thermal.fine_grain_clock_mod: 1 machdep.cpu.thermal.package_thermal_intr: 1 machdep.cpu.thermal.hardware_feedback: 0 machdep.cpu.thermal.energy_policy: 0 machdep.cpu.xsave.extended_state: 7 832 832 0 machdep.cpu.xsave.extended_state1: 1 0 0 0 machdep.cpu.arch_perf.version: 3 machdep.cpu.arch_perf.number: 4 machdep.cpu.arch_perf.width: 48 machdep.cpu.arch_perf.events_number: 7

60 machdep.cpu.arch_perf.events: 0 machdep.cpu.arch_perf.fixed_number: 3 machdep.cpu.arch_perf.fixed_width: 48 machdep.cpu.cache.linesize: 64 machdep.cpu.cache.L2_associativity: 8 machdep.cpu.cache.size: 256 machdep.cpu.tlb.inst.small: 64 machdep.cpu.tlb.inst.large: 8 machdep.cpu.tlb.data.small: 64 machdep.cpu.tlb.data.large: 32 machdep.cpu.tlb.shared: 512 machdep.cpu.address_bits.physical: 36 machdep.cpu.address_bits.virtual: 48 machdep.cpu.core_count: 2 machdep.cpu.thread_count: 4 machdep.cpu.tsc_ccc.numerator: 0 machdep.cpu.tsc_ccc.denominator: 0

61 Appendix 5 ­ Results From I/O Bound Tests in Local Environment

Average and Median measured in ms, and Throughput in requests per second.

Label Samples Average Median Error % Throughput 10/s With Cluster, 2 workers 15000 317 308 0,00% 30,8 10/s With Cluster, 4 workers 15000 308 309 0,00% 31,7 10/s With Cluster, 16 workers 15000 307 308 0,00% 31,8 10/s Without Cluster 15000 307 307 0,00% 31,9 10/s With Cluster, 8 workers 15000 307 308 0,00% 31,9 10/s With Cluster, 1 workers 15000 307 307 0,00% 31,9

Label Samples Average Median Error % Throughput 25/s With Cluster, 2 workers 15000 305 305 0,00% 79,1 25/s With Cluster, 16 workers 15000 306 306 0,00% 79,2 25/s With Cluster, 4 workers 15000 306 307 0,00% 79,3 25/s With Cluster, 8 workers 15000 306 306 0,00% 79,4 25/s With Cluster, 1 workers 15000 305 305 0,00% 79,4 25/s Without Cluster 15000 305 305 0,00% 79,6

Label Samples Average Median Error % Throughput 50/s With Cluster, 4 workers 15000 305 305 0,00% 157 50/s With Cluster, 2 workers 15000 305 305 0,00% 157,4 50/s With Cluster, 16 workers 15000 305 305 0,00% 157,7 50/s With Cluster, 8 workers 15000 305 305 0,00% 157,7 50/s Without Cluster 15000 305 305 0,00% 158,2 50/s With Cluster, 1 workers 15000 305 305 0,00% 158,4

Label Samples Average Median Error % Throughput 75/s With Cluster, 4 workers 15000 305 305 0,00% 234,5 75/s With Cluster, 8 workers 15000 305 305 0,00% 234,6 75/s Without Cluster 15000 306 305 0,00% 234,6 75/s With Cluster, 2 workers 15000 305 305 0,00% 234,8 75/s With Cluster, 16 workers 15000 306 306 0,00% 234,9 75/s With Cluster, 1 workers 15000 306 305 0,00% 235,6

Label Samples Average Median Error % Throughput 100/s With Cluster, 16 workers 15000 306 305 0,00% 309,3 100/s With Cluster, 4 workers 15000 305 305 0,00% 310,3

62 100/s Without Cluster 15000 305 305 0,00% 310,8 100/s With Cluster, 8 workers 15000 305 305 0,00% 310,9 100/s With Cluster, 2 workers 15000 306 305 0,00% 311,5 100/s With Cluster, 1 workers 15000 306 305 0,00% 312

Figure 1: Results from I/O bound tests in local environment

63 Appendix 6 ­ Results From CPU Bound Tests in Local Environment

Average and Median measured in ms, and Throughput in requests per second.

Label # Samples Average Median Error % Throughput 10/s With Cluster, 1 workers 15000 6 6 0,00% 787,5 10/s Without Cluster 15000 4 4 0,00% 827,6 10/s With Cluster, 16 workers 15000 4 3 0,00% 892,9 10/s With Cluster, 8 workers 15000 3 3 0,00% 972,3 10/s With Cluster, 2 workers 15000 3 3 0,00% 994,6 10/s With Cluster, 4 workers 15000 3 3 0,00% 1029,9

Label # Samples Average Median Error % Throughput 25/s With Cluster, 1 workers 15000 21 23 0,00% 814,8 25/s With Cluster, 2 workers 15000 10 10 0,00% 1158,6 25/s With Cluster, 4 workers 15000 6 5 0,00% 1247,6 25/s With Cluster, 8 workers 15000 5 5 0,00% 1277,8 25/s With Cluster, 16 workers 15000 7 4 0,00% 1188,4 25/s Without Cluster 15000 14 15 0,00% 933,4

Label # Samples Average Median Error % Throughput 50/s With Cluster, 1 workers 15000 51 53 0,00% 815,5 50/s Without Cluster 15000 37 39 0,00% 954,5 50/s With Cluster, 2 workers 15000 25 27 0,00% 1211,9 50/s With Cluster, 16 workers 15000 15 15 0,00% 1269,7 50/s With Cluster, 4 workers 15000 16 15 0,00% 1320,3 50/s With Cluster, 8 workers 15000 13 13 0,00% 1355,9

Label # Samples Average Median Error % Throughput 75/s With Cluster, 1 workers 15000 77 80 0,00% 821,3 75/s With Cluster, 2 workers 15000 47 49 0,00% 1197,1 75/s With Cluster, 4 workers 15000 34 33 0,00% 1348 75/s With Cluster, 8 workers 15000 29 32 0,00% 1345,1 75/s With Cluster, 16 workers 15000 40 27 0,00% 1101,4 75/s Without Cluster 15000 60 63 0,00% 949,8

Label # Samples Average Median Error % Throughput 100/s With Cluster, 1 workers 15000 107 110 0,00% 793,6 100/s Without Cluster 15000 82 86 0,00% 926,8

64 100/s With Cluster, 16 workers 15000 19 18 0,00% 1151,8 100/s With Cluster, 2 workers 15000 63 67 0,00% 1208,1 100/s With Cluster, 4 workers 15000 49 52 0,00% 1278,3 100/s With Cluster, 8 workers 15000 39 44 0,00% 1326,5

Figure 1: Results from CPU bound tests in local environment

65 Appendix 7 ­ Results From CPU Bound Tests on Heroku

To see the results of the test corresponding with the figure description, look at the vertical line (the one next to the timestamp).

10 rps:

Figure 1: 10 rps, without clustering (vertical line missing, see timestamp for time of test)

Figure 2: 10 rps, 1 worker

66

Figure 3: 10 rps, 2 workers

Figure 4: 10 rps, 4 workers

67

Figure 5: 10 rps, 8 workers

Figure 6: 10 rps, 16 workers

68 25 rps:

Figure 7: 25 rps, without clustering

Figure 8: 25 rps, 1 worker

Figure 9: 25 rps, 2 workers

69

Figure 10: 25 rps, 4 workers

Figure 11: 25 rps, 8 workers

70

Figure 12: 25 rps, 16 workers

50 rps:

Figure 13: 50 rps, without clustering

71

Figure 14: 50 rps, 1 worker

Figure 15: 50 rps, 2 workers

72

Figure 16: 50 rps, 4 workers

Figure 17: 50 rps, 8 workers

73

Figure 18: 50 rps, 16 workers

100 rps:

Figure 19: 100 rps, without clustering

74

Figure 20: 100 rps, 1 worker

Figure 21: 100 rps, 2 workers

Figure 22: 100 rps, 4 workers

75

Figure 23: 100 rps, 8 workers

Figure 24: 100 rps, 16 workers

76 TRITA TRITA-ICT-EX-2016:69

www.kth.se