DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

A comparison of compiler strategies for serverless functions written in Kotlin

KIM BJÖRK

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

A comparison of compiler strategies for serverless functions written in Kotlin - En j¨amf¨orelse av kompilatorstrategier f¨or serverless-funktioner skrivna i Kotlin

Kim Bj¨ork - [email protected] Royal Intitute of Technology Stockholm, Sweden

Supervisor: Cyrille Artho - [email protected] Examiner: Pontus Johnson - [email protected]

January 2020 Abstract

Hosting options for software have become more modifiable with time, from re- quiring on-premises hardware to now being able to tailor a flexible hosting solu- tion in a public cloud. One of the latest hosting solution option is the serverless architecture, entailing running software only when invoked. Public cloud providers such as Amazon, Google and IBM provide serverless solutions, yet none of them provide an official support for the popular language Kotlin. This may be one of the reasons why the performance of Kotlin in a serverless environment is, to our knowledge, relatively undocumented. This the- sis investigates the performance of serverless functions written in Kotlin when run with different compiler strategies, with the purpose of contributing knowl- edge within this subject. One Just-In-Time compiler, the Hotspot Virtual Machine (JVM), is set against an Ahead-Of-Time compiler, GraalVM. A benchmark suite was constructed and two serverless functions were created for each benchmark, one run with the JVM and one run as a native image, created by GraalVM. The benchmark tests are divided in two categories. One consisting of cold starts, an occurrence that arises the first time a serverless function is invoked or has not been invoked for a longer period of time, causing the need for certain start-up actions. The other category is warm starts, a run when the function has recently been invoked and the cold starts start-up actions are not needed. The result showed a faster total runtimes and less requirements for GraalVM-enabled functions during cold starts. During warm starts the GraalVM-enabled functions still required less memory but the JVM functions showed large improvements over time, making the total runtimes more similar to their GraalVM-enabled counterparts. Sammanfattning

M¨ojligheterna att hysa (engelska: host) mjukvara har blivit fler och mer modi- fierbara, fr˚anatt beh¨ova ¨aga all h˚ardvara sj¨alv till att man nu kan skr¨addarsy en flexibel l¨osning i molnet. Serverless ¨ar en av de senaste l¨osningarna. Olika leverant¨orer av publika molntj¨anster s˚asom Amazon, Google och IBM tillhandah˚allerserverless-l¨osningar. Dock har ingen av dessa leverant¨orer ett officiellt st¨od f¨or det popul¨ara programmeringsspr˚aket Kotlin. Detta kan vara en av anledningarna till att spr˚akets prestanda i en serverless-milj¨o ¨ar, s˚avitt vi vet, relativt ok¨and. Denna rapport har som syfte att bidra med kunskap inom just detta omr˚ade. Tv˚aolika kompilatorstrategier kommer att j¨amf¨oras, en JIT (Just-In-Time) -kompilator och en AOT (Ahead-Of-Time) -kompilator. Den JIT-kompilator som anv¨ands ¨ar Hotspot (JVM). Den AOT-kompilator som anv¨ands ¨ar GraalVM. F¨or detta arbete har en benchmark svit skapats och f¨or varje test i denna svit skapades tv˚aserverless-funktioner. En som kompileras med JVM och en som k¨ors som en f¨ardig bin¨ar skapad av GraalVM. Testerna har delats upp i tv˚akategorier. En d¨ar alla tester genomg˚attkallstarter, n˚agotsom sker d˚adet ¨ar f¨orsta g˚angenfunktionen kallas eller d˚adet har g˚atten l¨angre tid sedan funktionen kallades senast. Den andra kategorien ¨ar d˚atestet inte beh¨over g˚a igenom en kallstart, d˚ahar funktionen blivit kallad nyligen. K¨orningen kan d˚a undvika att genomg˚avissa steg som kr¨avs vid en kallstart. Resultatet visade att de tester som genomf¨ordes inom kategorin kallstarter visade p˚aatt k¨ortiden var snabbare och att minnesanv¨andningen var mindre f¨or de funktioner som kompilerats av GraalVM. I den andra kategorin, d˚atesterna inte beh¨ovde genomg˚aen kallstart, kr¨avde GraalVM-funktionerna fortfarande mindre minne men JVM-funktionerna visade p˚aen stor f¨orb¨attring n¨ar det kom till exekveringstid. De totala k¨ortiderna av de tv˚aolika kompilatorstrategierna var d˚amer lika. Contents

1 Introduction 3 1.1 Problem and Research Question ...... 4 1.2 Contributions and Scope ...... 5 1.3 Ethics and Sustainability ...... 5 1.4 Outline ...... 6

2 Background 7 2.1 Serverless ...... 7 2.1.1 The Attributes of Serverless ...... 7 2.1.2 Use Cases for Serverless Functions ...... 9 2.2 Kotlin ...... 11 2.3 Types of Compilers ...... 11 2.3.1 Ahead-of-Time Compiler (AOT) ...... 12 2.3.2 Just-In-Time Compiler (JIT) ...... 12 2.4 The JVM Compiler ...... 12 2.5 The GraalVM Compilation Infrastructure ...... 14 2.6 Performing Benchmark Tests ...... 15 2.7 Related Work ...... 17 2.7.1 Solutions Silimar to Serverless ...... 17 2.7.2 GraalVM at Twitter ...... 18 2.7.3 Benchmark Environment and the Cloud ...... 18 2.8 Summary ...... 19

3 Method 20 3.1 Metrics ...... 20 3.1.1 Latency ...... 21 3.1.2 Response time ...... 22 3.1.3 Memory consumption ...... 22 3.1.4 Execution time ...... 22 3.2 Benchmarks ...... 22 3.2.1 Real benchmarks ...... 23 3.2.2 Complementary benchmarks ...... 24 3.3 Environment and Setup ...... 25 3.4 Sampling Strategy and Calculations ...... 26

1 3.5 Summary ...... 27

4 Result 28 4.1 Static metrics ...... 28 4.2 Latency ...... 29 4.3 Application Runtime ...... 32 4.4 Response Time ...... 33 4.5 Memory Consumption ...... 34

5 Discussion 36 5.1 Latency ...... 36 5.1.1 Cold start ...... 36 5.1.2 Warm start ...... 37 5.2 Application Runtime ...... 38 5.2.1 Cold start ...... 39 5.2.2 Warm start ...... 40 5.3 Response Time ...... 46 5.3.1 Cold start ...... 46 5.3.2 Warm start ...... 48 5.4 Memory Consumption ...... 50 5.4.1 Cold start ...... 51 5.4.2 Warm start ...... 52 5.5 Threats to validity ...... 53

6 Conclusion 55 6.1 Performance ...... 55 6.1.1 Latency ...... 55 6.1.2 Application Runtime ...... 56 6.1.3 Response Time ...... 57 6.1.4 Memory Consumtion ...... 57 6.2 Future work ...... 57

2 Chapter 1

Introduction

Companies are constantly looking to digitize and are conceiving new use cases they want to explore every day. This is preferably done in an agile and mod- ular way. The key factors to making this possible are a reasonable cost, fast realization time and flexibility. Hosting is an area that has followed this trend. As a response, to this need for a more agile way of working, companies have moved from bare metal on- premises hosting to cloud hosting. In a cloud adaptation survey by IDG done in 2018, 73 % of companies stated that they already had adopted cloud technology and 17 % said they intended to do so within a year [1]. Another survey predicts that 83 % of enterprise workload will be in the cloud [2]. By using cloud computing, companies can allocate just the amount of com- putation power they need to host their solutions. A cloud solution can also easily be scaled up or down when the need changes. Cloud computing also makes it possible for small-scale solutions to be hosted with great flexibility and be economically defensible. The next step in this development toward more agile and modular hosting options could be claimed to be the serverless architecture. A serverless archi- tecture lets customers run code without having to buy, rent or provision servers or virtual machines. In fact a serverless architecture also relieves a client of everything that is connected to servers and more traditional hosting, such as maintenance, monitoring and everything infrastructure related. All that the clients need to concern themselves with is the actual code. These attributes enables a more fine-grained billing method, where clients gets charge solely on the resources used. These resources are the time it takes, as well as the memory needed, to execute the serverless function. The vendors providing the serverless solution, such as Amazon (AWS Lambda [3]) and Google (Google Cloud Func- tions [4]), also provide automatic scaling for their serverless solutions enabling a steady high availability. These are presumably among the top reasons why serverless is rapidly increasing in usage. According to Serverless the usage of serverless functions has almost doubled among their respondents from 45 % in 2017 to 82 % in 2018 [5]. Notable is also that 53.2 % stated serverless technology

3 is critical for their job. During the growth of serverless architecture, cloud providers have added support for more languages. AWS for example have gone from only supporting Node.js to now also supporting Python, Ruby, Java, Go and # [6]. But one language that is still lacking official support, from any cloud provider that offers a serverless solution, is Kotlin. Kotlin is a developed by JetBrains and was first re- leased in February 2016. Kotlin is mostly run on JVM but can also be compiled into JavaScript or native code (utilizing LLVM) [7]. Despite being a newer lan- guage it has already gained a lot of traction by being adopted by large companies and is currently used in production by Pinterest [8] and Uber [9] among others. Kotlin is also as of 7th May 2019 Google’s preferred language for Android app development [10] and has been among the top in “most loved” languages ac- cording to Stackoverflow developer survey reports the last years [11, 12]. One of the reasons for its popularity is Kotlins interoperability with Java, meaning it is possible to continue to work on an already existing Java project using Kotlin. Other popular attributes are the readability of Kotlin as well as its null safety, facilitated by the language’s ability to distinguish between non-null types and nullable types. Seeing as Kotlin is such a widely used and favored language it would be of interest to developers and companies to continue utilizing their knowledge within the language in more parts of their work, such as in a serverless context. The rest of this chapter contains an introduction to this thesis. It explains the problem that brought about the subject of this thesis and specifies what research question that is aimed to be answered. Moreover, this chapter also in- cludes the intended contribution as well as a section covering ethics and sustain- ability connected to this thesis. Concluding this chapter is a section describing the outline of this report.

1.1 Problem and Research Question

Kotlin is a popular language that is increasing in usage, however, it is not yet officially supported in a serverless solution provided by any public cloud provider. Since the serverless architecture is also being utilized more, companies might be looking to apply their knowledge of a language that they already know. It could also be that a company already has an application written in Kotlin that they would like to convert into a serverless function. Since Kotlin is able to run on the JVM, it is possible to package a Kotlin application as a jar file and run it as a serverless function. However, is that the optimal option? An application written in Kotlin could likewise be converted into a native image and run as a serverless function. Since the payment plan of serverless solutions are based on resource usage, where every millisecond is counted and billed for, there is a possible cost saving factor to be had from optimizing the serverless function´s execution. The aim of this thesis is to find out how Kotlin performs in a serverless

4 environment and what is the best way to run a serverless function written in Kotlin. From this statement two research questions can be extracted: • What is the difference, if any, of running a serverless function written in Kotlin with a Just-in-Time compiler compared to running the same function as a binary?

• How does cold starts affect the performance of a serverless function written in Kotlin? Does it matter if the function is run with a JIT compiler or as a binary?

1.2 Contributions and Scope

Kotlin is not officially supported by any public cloud provider that offers server- less solutions. To the best of our knowledge, there exists scant knowledge on how Kotlin performs in a serverless environment. This thesis aims to contribute more knowledge on this subject. The expanded knowledge could serve as a foun- dation, should a company be looking into utilizing Kotlin for writing serverless functions. The work done for this thesis could both give information about the performance of serverless functions written in Kotlin in general but also pro- vide an understanding about what is the better way to run a serverless function written in Kotlin. Only one public cloud provider will be tested. The reason being that there is only one public cloud provider, Amazon, that offers the possibility of custom runtimes. The runtimes that will be compared for this thesis is the JVM and GraalVM. The JVM will represent JIT compilers and GraalVM will represent AOT com- pilers.

1.3 Ethics and Sustainability

From a sustainability standpoint the cloud and the serverless architecture are both environmentally defensible. To begin with, users of the cloud do not have to buy their own hardware. This means that they also do not have to estimate how much computation power they need and therefore the risk of buying more than what they actually need is eliminated. Since computer power in shared in the cloud, the usage of the cloud´s resources can be optimized where the same hardware used by one client one day can be used by another client another day. This entails power savings and less impact on the environment. Due to serverless being more lightweight than other more traditional hosting options it is also more attainable. Clients can host their application at a lower price which means more have the opportunity to host applications. There is a given ethics perspective to this thesis, as with any investigating report. It is of great importance that the work being performed is unbiased. One way to create a conviction of this is to only use open source code and

5 tools available to the general public to ensure repeatability. Results will also be reported in their raw form to ensure readers have the opportunity to perform their own calculations or verify the ones presented in this thesis.

1.4 Outline

Chapter 2 contains necessary background information, such as explanations re- garding the different compilers and an ingoing clarification about what a server- less architecture is and what it entails. This chapter also contains research concerning related work. Chapter 3 incorporates a description of the methodology used to perform the work done for this thesis. It includes how and what benchmarks where chosen. It also gives an explanation to what metrics where used and why there where chosen. The result of the work is presented in Chapter 4 and a discussion regarding the result can be found in Chapter 5. Finally Chapter 6 contains the conclusions drawn from the result and the discussion, it also contains a section for possible future work.

6 Chapter 2

Background

This chapter contains useful background information about this thesis main subjects: serverless, Kotlin, compilers and benchmarks. It also incorporates a section that presents and discusses related work. A summary of the chapters key points concludes the chapter.

2.1 Serverless

Serverless is a concept that was first commercialized by Amazons service AWS Lambda in 2014 [13], the company was the first public cloud provider to offer serverless computing in the way it is known today. Since then serverless has gained a great deal of traction. Google [14], IBM [15] and Microsoft [16] now also provide their own serverless services. Serverless refers to a programming model and an architecture, aimed to execute a modest amount of code in a cloud environment were the users do not have any control over the hardware or software that runs the code. Despite the name, there are still servers executing the code, however, the servers have been abstracted away from the developers, to the point where the developers do not need to concern themselves with operational tasks associated with the server, e. g., maintenance, scalability and monitoring. The provided function is executed in response to a trigger. A trigger is an events that can arise from various sources, e. g., a database change, a sensor caption, an API call or a scheduled job. After the trigger has been received, a container is instantiated and the code, provided from the developer, is then executed inside that container.

2.1.1 The Attributes of Serverless A serverless architecture does not imply the same infrastructural-related con- cerns and dilemmas, such as capacity, scaling and setup, as more traditional architectures do. This enables developers to acquire a much shorter time to

7 market. A very important factor in the software development industry, where changes are happening at a rapid pace and where market windows can open and close fast. The ability to launch code quickly also enables prototypes to be created and tested at a lower cost and therefore at a lower risk. Furthermore, this benefit implies that there can be a larger focus on the product itself. Giv- ing developers the opportunity to concentrate on application design and new features instead of spending time on the infrastructure. Cloud providers that offers a serverless solution charge only for what the function utilizes, in terms of execution time and memory. The owner of the function is therefore only billed when the function is invoked. This entails, given the right use case, that the infrastructural cost can be reduced, when compared to a more traditional hosting option. Since there is no need to maintain a hosting solution, it can also lead to developers being able to take over the entire deployment chain, rendering the operations role more obsolete and in extent enable an additional cost saving factor. A serverless solution can bring many benefits to a project, however, it is not an appropriate solution for all projects. A function is executed only when triggered by an event, nothing is running when the function is not needed. The result of this is that when a function is invoked for the first time, or after a long time of no invocations, the cloud provider needs to go through more steps in order to start executing the function. An invocation of this type is called a cold start. During a cold start at Amazon AWS Lambda the additional phases that needs to be executed, before the invoked function starts executing, are: downloading the function, starting a new container and bootstrapping the runtime [17]. The outcome of this is that during a cold start the execution time, and in extent the response time, will be noticeably longer. A longer execution time also entails a greater cost. To prevent cold starts, and spare end users of a long response time, one op- tion is to trigger the function periodically to keep it ”warm”. Amazon provides one such solution, Cloud Watch [18], were it is possible to schedule triggers with a certain interval. There are also third party tools serving as warmers [19] [20]. Some tools also analyzes the usage of a serverless function and claims to predict when a trigger is needed, making the function warm upon real invocations [21]. Keeping a function warm may be an option for functions that are being triggered fairly often or if its being triggered with some predictable consistency. Other- wise there is a possibility that the warm up triggers deplete the cost savings that a serverless solution otherwise would provide. In that case a more traditional hosting solution might be a better option, if response time is a decisive factor. Another approach, to reducing the impact of a cold start, is to reduce the response time during a cold start. In this case there are two parameters that are configurable. The first parameter is the code. One way to optimize the code is to carefully choose programming language, different languages have varying start up times [22]. Furthermore, Amazon recommends a few design approaches that could help optimise the performance of a function. Amazon suggest avoiding large monolithic functions and instead divide the code into smaller more special- ized functions. Only loading necessary dependencies rather then entire libraries

8 is also good practice. Amazon also recommends using language optimization tools such as Browserfy and Minify [17]. If an AWS Lambda function is reading from another service Amazon emphasises the importance of only fetching what is actually needed. That way both runtime and memory usage can be reduced. The second parameter, that is configurable, is the runtime, which will be the focus of this thesis. Resource limitations are, like cold starts, a restraint to serverless solutions. Cloud providers limits the resources a function can allocate. However, Amazon AWS Lambda has been continuously increased these limits. In November 2017 they doubled the memory capacity, from 1.5 GB to 3 GB, that a lambda function can allocate [23]. In October 2018 they tripled the time limitation from 5 to 15 minutes per execution [24]. There is a possibility this trend will continue in the future and facilitate additional use cases for serverless functions. In a serverless architecture a third party, the public cloud provider, has taken over a great deal of the responsibility related to hosting compared to a more traditional architecture. This entails that a great deal of trust has to be placed upon the provider. Especially since a serverless solution implies a vendor-lock- in, where a migration can be problematic and require multiple adjustments, due to the code not only being tied to specific hardware but also to a specific data center. Further trust also has to be put onto the cloud provider on account of se- curity. In a public cloud, where many users’ arbitrary functions are running at the same time, security has to be a high priority in order to prevent intersection of remote procedure calls and insuring container security. To fully take advantage of the benefits, as well as avoid consequential im- plications from various drawbacks, a serverless solution can bring, it can be concluded that not just any use cases can be applied in a favorable way.

2.1.2 Use Cases for Serverless Functions For many applications, from a functionality perspective, a serverless architec- ture and more traditional architectures could be used interchangeably. Other factors, such as the solutions need for control over the infrastructure, cost and the applications expected workload, are determining when considering using a serverless architecture. From a cost perspective, serverless performs well when invocations occurs in bursts. This since a burst implies many invocations happening close to each other, time wise, and therefore entails that only the first execution will have to go though a, more expensive, cold start. The other calls in the burst will thereafter use the same container and will therefore execute faster. When the burst has ended a serverless architecture will let the infrastructure scale down to zero, during which time there is no charge. Computation heavy applications could, under the right circumstances, also be a good fit since the cost of other infrastructure solutions grow in proportion to computer power needed. However, to keep in consideration is that if a public cloud provider is used, limitations on computing exist, such as memory and

9 time limits. This could mean that, from a performance perspective, a compu- tation heavy application might not be an appropriate use case for a serverless architecture. From a developer perspective serverless would be a good option in the cases when the drawbacks of lacking control over the infrastructure is outweighed by the fact that there is no need for maintaining the infrastructure or worry about scaling. Based on the characteristics and limitations of a serverless architecture, such as the basis for the cost and resource limitations, the general usage for a server- less solution has a few common characteristics: lightweight, scalable and single- purposed.

IoT and mobile device backend When it comes to IoT and backend solutions for mobile devices, a serverless approach could be advantageous. It could offload burdens from a device with limited resources, such as computer power and battery time. Internet connection is also a limited resource on an IoT and mobile device. By using a serverless solution as an API aggregator, required connection time could be reduced due to a reduced number of API calls. There could also be a benefit from a developer perspective since mobile ap- plications are developed by mostly front-end skilled people, some may therefore lack the experience and knowledge of developing back-end components. Creat- ing a serverless back end simplifies both its creation and set up, as well as elim- inates the need for maintenance. All this enables mobile apps and IoT devices that are fast and consistent in their performance independent of unpredictable peak usage. iRobot, the developers of the internet-connected Roomba vacuums, is one of the companies that are using a serverless architecture as an IoT backend [25].

Event triggered computing Event driven applications are ideal for a serverless architecture. AWS Lambda has many ways a user can trigger its functions. One of them is events that happens in there storage solution S3. One company that is taking advantage of this solution is Netflix. Before a video can be streamed by end users, Netflix needs to encode and sort the video files. The process begins with a publisher uploading their video files to Netflix’s S3 database. That triggers Lambda functions that handles splitting up the files and processes them all in parallel. Thereafter Lambda aggregates, validates and tags the video files before the files are ultimately published [26]. Another company that is also utilizing the same type of solution is Auger Labs, that focuses on custom-branding apps for artist. Auger Labs founder and CEO’s intent has been to remain NoOps, where no configuration or managing of back-end infrastructure was needed. Among other use cases, Auger Labs are using their serverless architecture of choice, Google’s Cloud Functions, in

10 combination with Googles Firebase Storage. When an image is uploaded to their storage a function is triggered to create thumbnails in order to enhance mobile app responsiveness. They also use Cloud Functions to send notifications via Slack to handle monitoring [27].

Scaling solutions Since scaling is handled automatically, developers do not have to worry about how the infrastructure is going to perform in case an expected, or unexpected, burst of requests occurs. The service provider will make sure to start enough containers that can support all the heavy traffic being generated. Hosting an application with a lower everyday usage, where heavy spikes occurs very rarely, could lead to a high hosting cost, where the clients will pay for computer power that is unused most of the time, in order to maintain high availability even at the spikes. One such use case is presented by Amazon, regarding Alameda County in California. Their problem was a huge spike in usage during the elections. Their previous solution included on-premises servers that did not measure up. By moving to the cloud and utilizing Lambda AWS, the application could easily scale at a satisfactory rate. Alameda County could avoid buying more expensive hardware that would not be used the rest of the year, at the same time as they could serve all their users during their peak [28].

2.2 Kotlin

Kotlin is a statically typed programming language developed by JetBrains and was first released in February 2016. Kotlin is most commonly run on the JVM but can also be compiled into JavaScript or native code (utilizing LLVM) [7]. Despite being a newer language it has already gained a lot of traction. Large companies such as Pinterest [8] and Uber [9] are currently using Kotlin in pro- duction. Kotlin is also as of May 2019 Google’s preferred language for Android app development [10] and has been among the top in “most loved” languages according to Stack Overflow developer survey reports the last years [11, 12]. The reasons behind Kotlins success may be many. One contributor could be its interoperability with Java. It is possible to continue to work on an already existing Java project using Kotlin. Other praised features are Kotlins readability as well as its null safety, facilitated by the languages ability to distinguishing between non-null types and nullable types.

2.3 Types of Compilers

A compiler is a program that translates code written in a higher level language to a lower level language in order to make the code readable and executable by a computer. A compilers type is defined by when this translation is made. An

11 Ahead-of-Time compiler performs the conversion before the code is run while a Just-in-Time compiler translated the high level code at runtime.

2.3.1 Ahead-of-Time Compiler (AOT) An Ahead-of-Time compiler does precisely what the name suggests, it compiles code ahead of time, i.e, before runtime. When an application is compiled with an AOT compiler no more optimizations are done after the compilation phase. There are both benefits and drawbacks to an AOT compiler. One benefit is that the runtime overhead is smaller since there is no optimizations during runtime. It is therefore also possible that an AOT compiled application is less demanding when it comes to computer resources such as RAM. The drawback to this is that the compiler knows nothing about the workload of this application or how it will be used. There is therefore a risk that the compiler spends time on optimization of, for example, methods that are rarely used.

2.3.2 Just-In-Time Compiler (JIT) A Just-in-Time compiler offers a dynamic compilation process. Meaning blocks of code are translated into native code during runtime rather than prior to execution like an AOT compiler [29]. A JIT compiler optimizes code during runtime using profiling. Meaning that the program is analysed to determine what optimizations would be profitable to carry out. A JIT compiler will therefore perform well informed optimizations and will not waste time on compiling parts of an application that wont lead to an increase of performance. Examples of metrics that a JIT profiler is based on is method invocation count and loop detection [30]. A high method invocation count means that method is a good candidate for compilation into native code to speed up execution. Loops can be optimized in many ways, one favorable way is to unroll a loop. An unrolling of a loop entails an increase of operations performed by each iteration of the loop. Steps that would be performed in subsequent iterations are merged into earlier iterations. The drawback to these specialized optimisations is the fact that execution time during the first runs will be longer. Performance will however improve over time as more parts of the code gets translated into native code and the compiler gets more execution history to base its optimizations on.

2.4 The JVM Compiler

Even though all CPU’s are very similar, e. g., have the same functionalities such as perform calculations and control memory access, programs that are designed for one CPU can not be executed on another. The developers of the Java programming language wanted a solution to this problem. They decided to design an abstraction of a CPU, a virtual computer that could run all programs written for it on any system, the result was the Just-in-Time compiler: Java

12 Virtual Machine (JVM). This idea was the basis for the slogan created for Java by the developer of Java, : write once, run anywhere. Another benefit facilitated by the JVM’s abstraction of a CPU is the abstract view of the memory that JVM has. Since the JVM treats the memory as a collection of objects it has more control over which programs that are allowed to access which parts of the memory. That way the JVM can prevent harmful programs accessing sensitive memory. The JVM also include an algorithm called verification that contains rules every program has to follow and aims to detect malicious code and prevent it from running [31]. This algorithm is one of the three cornerstones of the JVM, stated in the Java Virtual Machine Specification [32]:

• An algorithm for identifying programs that cannot compromise the in- tegrity of the JVM. This algorithm is called verification. • A set of instructions and a definition of the meanings of those instructions. These instructions are called bytecodes.

• A binary format called the class file format (CFF), which is used to con- vey bytecodes and related class infrastructure in a platform-independent manner. The JVM was developed primarily for the Java programming language but the JVM has the possibility to execute any language that can be compiled into bytecode. The JVM, in fact, knows nothing of the Java programming language, only of the binary format CFF, that is the result of compiled Java code. Some of the more well-known languages that can be executed by the JVM aside from Java, are Kotlin, Scala and Groovy [33]. These languages, and all others that can be executed on the JVM, also gets the JVM’s benefits, such as its debugging features and garbage collection, that prevents memory leaks. The most used JVM is the Java Hotspot Performance Engine that is main- tained and distributed by Oracle and is included in their JDK and JRE. The Hotspot JVM continuously analyses the program for code that is executed re- peatedly, so called hot spots, and aims to optimize these blocks, aspiring to facilitate a high-performance execution. The Hotspot JVM has two different flavors, the Client and Server VM. The two modes run different compilers that are individually tuned to benefit the different use cases and characteristics of a server and a client application. Com- pilation inlining policy and heap default are examples of these differences. Since characteristics of a server include a long run time, the Server VM aims to optimize running speed. This comes at the cost of slower start-up time and larger runtime . On the opposite side, the Client VM does not try to execute some of the more complex optimizations that the Server VM performs. This enables a faster start-up time and is not as memory demanding [34].

13 2.5 The GraalVM Compilation Infrastructure

GraalVM is a compilation infrastructure that started out as a research project from Oracle and was released as a production ready beta in May 2019 [35]. GraalVM contains the Graal compiler that is a dynamic JIT, just-in-time, compiler that is utilizing novel code analysis and optimizations. The compiler transforms byte code into machine code. GraalVM is then dependent on a JVM to install the machine code in. The JVM, that is used, also needs to support the JVM Compiler Interface in order for the Graal compiler to interact with the JVM. One that does this is the Java Hotspot VM that is included in the GraalVM Enterprise Edition. Before the Graal compiler translates the bytecode into machine code it is converted into an intermediate representation, Graal IR [36]. In this represen- tation optimizations are made. One goal of the GraalVM is to enable performance advantages for JVM- based languages, such as minimizing memory footprint through its ability to avoid costly object allocations. This is done by a new type of Escape Analysis that instead of using an all-or-nothing approach uses Partial Escape Analy- sis [37]. A more traditional Escape Analysis would check for all objects that are accessible outside its allocating method or thread and move these objects to the heap in order to make them accessible in other contexts. Partial Escape Analysis, however, is a flow-sensitive Escape Analysis taking into account if the object only escapes rarely, for example in one single unlikely branch. Partial Escape Analysis can therefore facilitate optimizations in cases a traditional Es- cape Analysis can not, enabling memory savings. During an evaluation done in a collaboration between Oracle Labs and the Johannes Kepler University they saw a memory allocation reduction of up to 58.5 % and a performance increase of 33 % [37]. Notably they also saw a performance decrease of 2.1 % on one par- ticular benchmark, indicating, not surprisingly, that Partial Escape Analysis is not the best solution in every case. But overall all other benchmarks had an increase in performance and a decrease in memory allocation. Another goal of GraalVM is to reduce start up time of JVM-based applica- tions. This through a GraalVM feature that creates native images, that per- forms a full ahead-of-time (AOT) compilation. The result being a native binary that contains the whole program and is ready for immediate execution. By this Graal states that the program will not only have a faster startup time, but also have a lower runtime memory overhead when compared to a Java VM [38]. With the help from the language implementation framework Truffle, GraalVM is not only able to execute JVM-based languages. JavaScript, Python and Ruby can also be run with the GraalVM compilation infrastructure [39]. LLVM-based languages such as C and C++ can also be executed by GraalVM thanks to Su- long [40]. Since the GraalVM ecosystem is language-agnostic, developers can create cross-language implementations where they have the ablility to choose languages based on what is suitable for each component.

14 2.6 Performing Benchmark Tests

In the field of benchmarks, much research has been done and several open source benchmark suites have been constructed. There are multiple suites tar- geting the Java Virtual Machine, e .g., SPECjvm2008 [41], DaCapo [42] and Renaissance [43]. Dacapo was developed to expand the SPECjvm2008 suite by targeting more modern functions [44] and the Renaissance suite focused on benchmarks using parallel programming and concurrent primitives [45]. Looking at the thought processes behind building these suites, certain com- mon requirements can be identified. Only open source benchmarks and libraries have been selected. One of the benefits of this is that it enables inspection of the code and the workload. Diversity is also a common attribute these benchmark suites are striving for, a good feature in principle but one that is harder to put into practice. The Renaissance suites interpretation and approach to achieve diversity is to include different concurrency related features of the JVM. Object- orientation is also mentioned as an important factor in the Renaissance suite, since that will lead to an exploit of the JVM parts that are responsible for ef- ficient executions of code patterns commonly associated with object oriented features, e.g., frequent object allocation and virtual dispatch. The developers of the DaCapo suite strived to achieve diversity through maximizing coverage of application domains and application behavior. Another type of benchmark suite is The Computer Language Benchmarks Game [46]. The aim of the suite is to provide a number of algorithms written in different languages. Kotlin, however, is not one of them. It is used, for example, in an evaluation of various JVM languages made by Li et al. [47]. From this suite the authors categorized the benchmarks, depending on if the program mostly manipulated integers, floating-point numbers, pointers or strings. The Computer Language Benchmarks Game has also been used by Schwermer [48]. In his paper a subset of benchmarks was chosen. One benchmark for each type of manipulation focus, e.i, integers, floating-point numbers, pointers and strings. The chosen benchmarks where translated to Kotlin to be compared with the Java implementation provided by The Computer Language Benchmarks Game. The Kotlin translated suite will serve as a complementary part of the benchmark suite used in this thesis. When creating a benchmark suite, preferably there would exist a tool in the likes of the one described by Dyer et al. [49], that is under construction. A tool is describes where it would be possible to search for open source benchmarks given certain requirements and where researchers could contribute with their own benchmarks. The vision being faster and more transparent research. Traditionally, performance tests are run in dedicated environment where as much as possible is done to minimize external impact on the result. Factors such as hardware configurations are ensured to be kept static, all background services are turned off and the machine should be single tenant. None of this can be found in a serverless solution hosted in a public could. Configurations are unknown and made by the cloud provider and the machines hosting the functions are exclusively multi tenant. This entails an unpredictable environment where

15 there always will be uncertainties. The benefit, however, to perform tests in the public cloud, is that it is easy to set up and at a low cost, where a more traditional approach would mean a higher cost and an environment that requires a high amount of effort to maintain. A study by Laaber et al. [50] investigates the effect of running micro bench- marks in the cloud. The focus of their study consisted of measuring to what extent slowdowns are detected in a public cloud environment, where the tests were run on server instances hosted by different public cloud providers. One of the problems the authors address is that the instances might be up- graded by the provider between test executions, which can result in inexplicable differences in the results. However, if tests are done during a short period of time, to avoid such changes by the provider, the results will only represent a specific snapshot of the public cloud. It can then be argued that tests run over a longer period, e.g, a year, would result in a better representation. However, this large amount of time is, in many cases, an unobtainable asset. The authors also mention the difference between private and public cloud testing and emphasizes that the two can not be compared. This due to the possibility of noisy neighbours in a public cloud but also due to hardware het- erogeneity [51], where different hardware configurations are used for instances with the same type. Furthermore, the authors acknowledge that even though it is possible to make reasonable model assumptions about the underlying software and hard- ware in a public cloud, based on literature and information published by the providers, when experiments are done in the public cloud the cloud provider should always be considered a black-box that cannot be controlled. The paper concludes that slowdowns of below 10 % can be reliably detected 77–83 % of the time and the authors therefore considers micro benchmark ex- periments possible in instances hosted in a public cloud. They also concluded that there was no big differences between instance types for the same provider. According to Alexandrov et al. [52] there are four key factors to building a good benchmark suite and running a the benchmarks in the cloud is (1) meaningful metrics, (2) workload design, (3) workload implementation and (4) creating trust. When considering meaningful metrics, the example of runtime is given as a natural and undebatable metric. Furthermore, cost is discussed as an interesting factor but mostly relevant in research that is meant as support for business decisions. Although the cloud can be seen as infinitely scalable, it is only an illusion and therefore throughput can be seen as a valuable metric. The workload has to be designed with the metrics in mind. Where the application should be modeled as a real world scenario with a plausible workload. One of the important factors mentioned, when it comes to workload im- plementation, is the workload generation. The recommendation is that this is done by pseudo-random number generators to ensure repeatability. A pseudo- random number generator also has the benefit of being much more accessible than it would be to gather the same amount of real data. Creating trust is considered especially important when it comes to running

16 benchmark tests in the public cloud. The reason being the public clouds black- box property. As a client of a public cloud one can never be certain about the underlying software or hardware. To create trust, the authors recommend exe- cuting previously mentioned aspects well, along with choosing a representative benchmark scenario.

2.7 Related Work

In this section, previous work that relates to the work that will be done for this thesis will be discussed. It starts with a discussion about solutions that have similar attributed as the serverless architecture. Followed by a section about how GraalVM is used at Twitter. To conclude this topic, about related work, there is a section covering how benchmark environments affect the result and what is thought of running benchmarks in the cloud.

2.7.1 Solutions Silimar to Serverless The idea to start a processes only once called upon, is not unique for the server- less architecture. Super-servers, or a service dispatchers, is based on the same principle. A super-server is a type of daemon which job is to start other services when needed. Examples of super-servers are: launchd, systemd and inetd. inetd is an internet service deamon in Unix systems that was first introduced in 4.3BSD, 1986 [53]. The inetd super-server listens to certain predefined ports and when a connection is made upon one of them inetd starts the corresponding service what will handle the request. These ports support the protocols TCP and UDP and examples of services that inetd can call are FTP and telnet. For services that do not expect high loads, this solution is a favorable option. This is due to the fact that such services do not have to run continuously, resulting in a reduced system load. Another benefit is that the services connected to inetd does not have to provide any network code since inetd links the socket directly to the service’s standard input, standard output and standard error. To create an inetd service, developers only need to provide the code and specify where the file, containing the code, will be located and which port should trigger the service. In similarity to the serverless architecture, not needing to care about servers is also a principle of agent-based application mobility where an application is wrapped by a mobile agent that has full control over the application. The mobility of the agent lets the application migrate from one host to another, where the application can resume its execution [54]. Instead of abstracting away the server from the developer, like in the serverless solution, this approach lets the developer implement services and avoid servers all together. Although this approach can bring many benefits, such as reduced network load and latency due to local execution, agent-based application mobility also has its drawbacks. One of the drawbacks is the high complexity of developing the application. The application needs to be delicately designed in order to be

17 device-independent and have the ability to be migrated between devices [55]. The solution many applies to this problem is to use an underlying infrastructure or middleware [56, 57].

2.7.2 GraalVM at Twitter Despite GraalVM only having a beta release, Twitter is already using it in production. Their purpose to adopting GraalVM is to save money from the decrease in computer power needed. Another motivation was that the Hotspot Server VM is old and complex while GraalVM is easier to understand [58]. By switching to GraalVM the VM-team at Twitter saw a decrease of 11 % of used CPU-time in their tweet service, compared to running the Hotspot Server VM. Twitter also discovered that they could decrease CPU-time further by tun- ing some of GraalVM’s parameters. One of these parameters where TrivialInlin- ingSize. Graphs with less nodes than the number represented by this parameter would always be inlined. With their machine learning based tuner, Autotuner, that automatically adjusts these parameters, CPU-time was dropped another 6 % [59]. To take into consideration is that the Hotspot JVM is tuned to the Java language and Twitter is mainly using Scala in their services. The same code base written in Java might not have produced the same dramatic improvements.

2.7.3 Benchmark Environment and the Cloud When analysing the result of this thesis, it is important to take into consid- eration impacting error sources. One such error source is the hardware the functions will be running on. Since there will be no indication of what CPU is used for any execution, nothing could be said about its impact on performance. In a runtime comparison made by Hitoshi Oi three different processors where used [60]. All three was made by Intel and based on the netburst microarchi- tecture but had different clock speed and cache hierarchies. Despite being from the same manufacturer and based on the same architecture, varied performance could still be seen in almost all use cases. In AWS, no guarantee is given that any feature of the different processors used will be the same. This fact and the study made by Hitoshi Oi gives an indication on the possible impact this factor can have on the results. This is further emphasised in a conference where John Chapin shares his investigation into AWS performance [61]. Among other topics he speaks about the difference in performance in relation to how much memory the user spec- ifies as maximum. Since AWS Lambda allows CPU allocation in proportion to maximum memory usage specified, it would be logical that lower amount of memory allocated always would lead to an inferior performance. However, in Chapin’s experiments he found that this is not always the case. In some in- stances he got almost the same performance independent of available memory allocation. He draws the conclusion that this is connected to the randomness of the container distribution. Some containers may be placed on less busy servers

18 and can therefore deliver better performance. This emphasises the importance of rigorous performance testing; where the testing is well, time wise, distributed to get the best possible representation of the overall performance of the given function in the public cloud. A comparison of public could providers by Hyungro Lee et al. can give an indication as to how AWS will perform when testing its throughput [62]. Martin Maas et al. suggests that runtimes used in the serverless context should be rethought [63]. This based on the fact that most runtimes today is not optimized for the modern cloud related use cases. They envision a generic managed runtime framework that supports different languages, front ends and back ends, for various CPU instruction sets, FPGAs, GPUs and other acceler- ators. Graal/Truffle is mentioned as a good example of a framework that can create high performance and maintainability by its ability to execute several different languages.

2.8 Summary

Serverless refers to a programming model and an architecture, aimed to execute a modest amount of code in a cloud environment were the users do not have any control over the hardware or software that runs the code. The servers are abstracted away from the developer and the only thing the developer needs to be concerned about is the code. Every task related to maintaining servers are taken care of by the cloud provider. Therefore solutions that requires scaling, for example, are a good fit for the serverless architecture. The provided code only runs when the serverless function is invoked. Mean- ing that there is nothing running connected to the function when it is not invoked. This also entails that for the first time, and every time the function has not been invoked for a while, start-up actions, such as starting a container, needs to be performed. An execution containing these start-up actions is said to have gone through a cold start, otherwise it is a, so called, warm start. There are two types of compilers compared in this thesis a Just-in-Time (JIT) compiler and an Ahead-Of-Time (AOT) compiler. An AOT compiler compiles code before it is run and creates an executable file. The AOT compiler used in this thesis is GraalVM that started out as a research project from Oracle. It was released as a production ready beta May 2019. A JIT compiler compiles the code during runtime. The JIT compiler used in this thesis is the Hotspot JVM that is maintained and distributed by Oracle. When running benchmarks, dedicated and isolated environments are usually used to minimize external impact on the results. The public cloud is eminently unlike such an environment. One reason being that the hardware and its con- figurations are hidden from the user. The fact that the public could is shared also enables the possibility of a neighbour having an effect on the performance of one’s function. These factors have to be taken into account when analysing the result.

19 Chapter 3

Method

A benchmark suite was created for this thesis. For every benchmark two cor- responding serverless functions were implemented in Amazon Web Service’s serverless solution Lambda. One that runs with the hotspot JVM provided by Amazon and one that runs as a native image created with the tool GraalVM CE. These functions were then invoked trough the AWS’s command line inter- face. The commands were run locally to simulate a more real world scenario where network latency can impact the result. All programs returns a JSON containing information about the execution. We grouped the test into two categories, one that contains the executions that went through a cold start and one that contains the executions that reuse already started containers, warm starts. The arithmetic mean of the different metrics were calculated along with a two sided confidence interval to be able to analyse the results fairly. This chapter describes, more in detail, how the work for this thesis was carried out and the motivations behind the choices made. The last section of this chapter contains a summary containing the chapters key points.

3.1 Metrics

The metrics focused on in this thesis are mainly dynamic metrics [64]. Meaning metrics that are to a higher degree based on the execution of code rather than the code itself [65]. This is due to the fact that the interest of this thesis lies within the performance of code given different runtimes. What applications are used and what techniques that were used developing them, factors that are connected to static metrics, are secondary. Some static metrics will, however, be collected. The static metrics used in this thesis is chosen with the purpose to give the reader an indication of the overall size of the different benchmarks. Four different static metrics will be documented. Two of them are the sizes of the JVM and the GraalVM function, collected from Amazon Console. The other

20 two are lines of code and the number of Kotlin files. The dynamic metrics chosen for this thesis is based on what would be of interest to a developer that is considering using Kotlin in a serverless context. We hypothesised that the factors a developer would be most interested in are a comparison of performance as well as cost. When performance of software is measured, one of the most interesting ele- ments to attain is knowledge about how much resources are being used. Since cost, in this case, is exclusively based on resources used, there is no need to add specific cost-related metrics. The second element of interest is what is causing these resource allocations. An example of a factor affecting the performance of a program is garbage collection. In this thesis the public cloud is used, that can be seen as a black box since users can not be certain what environment their code is executed in. This entails that there are a large amount of factors that can affect the performance of the functions, such as hardware configuration. A choice have therefore been made to only focus on measuring the resources that are being used and not measure factors that are believed to cause these performance changes. The resources that will be measured are latency, application runtime, re- sponse time as well as memory consumption. In Figure 3.1 we can see an illustration of the dynamic metrics which are measured in time.

Figure 3.1: An illustration of the metrics measured in time

3.1.1 Latency Latency is measured by subtracting the start time, recorded by the locally executed script invoking the Lambda function, from the start time recorded by the function that is returned in the response JSON. Latency can be important in cases were data becomes stale fast, it is there- fore important that the data gets processed quickly. One example of this is a

21 navigation system that gets location data from a car and needs to update its direction accordingly.

3.1.2 Response time Response time is measured by subtracting the start time recorded by the invo- cation script from the end time recorded at the time the response is returned from Amazon. Response time is a meaningful metric in multiple use cases. One example is user interfaces. In one study from 1968 [66] and a compli- mentary study from 1991 [67], three types of limits for human and computer interactions are summarized. For a user to experience that a system is react- ing instantaneously the requested result should be delivered within 0.1 second. To insure a user’s continuous, uninterrupted thought process the response time should not exceed 1.0 second. If the response time surpass a limit of 10 seconds users will want to switch to another task during the execution. Even thought these studies are written several decades ago there is no indi- cation that users should have raised there tolerance. With faster internet speeds and more powerful computers the opposite are presumably more truthful.

3.1.3 Memory consumption Memory consumption is another essential factor. As always in software develop- ment, developers and operators are looking to optimize execution. One simple reason is that the more memory an application uses the more expensive it is to run. If a developer is running an application on an on-premises system the effect might not be as palpable, until the need to buy more RAM arises. In a serverless context, however, optimization of memory usage can easily lead to a visible cost reduction. The memory consumption of a function execution is recorded by AWS Lambda and will be retrieved from its logs.

3.1.4 Execution time The response time might be the most interesting time metric in this work. However, it is also of interest to see how much of the total time that comprise of actual application execution time and how that time changes given different circumstances. Execution time is also unaffected by external factors, such as internet connection, and is only a result of the characteristics of AWS Lambda. This makes it a good measurement of the performance of AWS Lambda.

3.2 Benchmarks

Every benchmarks have two different Lambda functions. One that is run with the JIT compiler hotspot JVM and one that is run as a native image, created with GraalVM.

22 All benchmarks are open source and have a separate repository on [68]. Each benchmark has main class that contains a main-function and a function called handler. The handler-function is used as entry point for the serverless functions using the JVM and the main-function is used for the serverless func- tions compiled with GraalVM.

3.2.1 Real benchmarks A real test is to be preferred when performing benchmarks. These test are real in the sense that they are real repositories acquired from GitHub. They are not originally intended as serverless applications and a discussion could be had weather any of them would fit in a serverless context. Nevertheless, they still represent real workloads of real applications and will therefor presumably be a better indicator of the performance, of Graal and the hotspot JVM respectively in a serverless environment, than artificial applications. Each of these real benchmarks contains test written using JUnit. To simulate workload, some, or all, of the tests in the repository are invoked when running the benchmark. After invoking an AWS Lambda function a response from the function is sent back in the form of a JSON containing these fields: • coldStart : Boolean - Indicates if the run has gone through a cold start or not.

• startTime : Long - The time when the functions code starts to execute rep- resented in milliseconds since the UNIX epoch (January 1, 1970 00:00:00 UTC). • runTime : Long - Runtime of the application in milliseconds.

• wasSuccess : Boolean - Indicating if the tests where a success, used for debugging purposes. • failures : List - Containing reasons why tests failed, if there where failures otherwise the list is empty. Used only for debugging purposes. To determine if a run was a cold start or not, a search is made of a specific file in the /tmp folder (where Amazon lets users write files with a combined size of 512 MB). If the file is not there the file is created. Since the file is removed when that container is, the applications will know if the container is new (the file does not exist), meaning a cold start, or of it has been used before (the file exist), meaning a warm start.

Kakomu Kakomu is a repository that contains a go simulator [69]. The repository enables a user to play a game of Go with a bot, but it can also simulate a game between two bots. There are 18 tests used for this thesis and focuses on the game model, ensuring a game is evaluated correctly.

23 State machine The state machine benchmark is taken from a repository containing a Kotlin DSL for finite state machine developed by Tinder [70]. This benchmark contains 13 tests.

Math functionalities This repository provides discrete math functionalities as extension functions [71]. Some examples of its capabilities are permutations and combinations of sets, factorial function and iterable multiplication. The benchmark implementation based on this repository runs 55 individual tests that ensures all equations are done correctly, i.e., most are mathematical equations and set-operations.

3.2.2 Complementary benchmarks Finding suitable real benchmarks proved to be challenging, therefore the bench- mark suite is supplemented with additional artificial benchmarks. One of them is a simple ”Hello world”-example, its only purpose is to return the basic in- formation the other benchmarks does: start time of the function and if it went through a cold start or not. The other complementary benchmarks are algorithms from the benchmark suite The Computer Language Benchmarks Game [46] implemented in Kotlin for the purpose of the paper ”Performance Evaluation of Kotlin and Java on ” [48]. These benchmarks all were categorized by Li et. al [47] to mainly manipulate different data types. These categorizations can be seen in Table 3.1.

Benchmark Data type Fasta Pointer N-body Floating-point Fannkuch-Redux Integer Reverse-Complement String

Table 3.1: Mainly manipulated data types

The benchmarks which originates from The Computer Language Bench- marks Game also returns a JSON, but since there are no JUnit tests run the field wasSuccess and failures are omitted, otherwise the fields are the same as in the real test, i.e., coldStart, startTime and runtime.

Fasta The Fasta benchmark is categorised as a pointer intensive algorithm that also has a large amount of output. Running the algorithm results in three generated

24 DNA sequences, the length of the sequences are decided by an input parameter represented as an integer. The length used in this thesis is 5 × 106. The generated output is written to a file and consists of three parts. The first part of the DNA sequence is a repetition of a predefined sequence and the last two are generated in a pseudo-random way using a seeded random generator. After the file has been generated it will be removed to not affect the following tests, since some are running in sequences.

Reverse-Compliment The Reverse Compliment benchmark takes the input from a file containing the output from a run of the Fasta application, that in turn had an input of 106. The aim is for the Reverse-Complement program to calculate the comple- menting DNA strands to fit the three DNA sequences that the input file con- tains. It is calculated with the help of a pre-defined translation table. Due to the fact that the input file that is processed consist of strings, this benchmark is categorized as mainly handling strings. Another attribute of this benchmark to keep in mind is that it is also both input and output heavy.

N-body The N-body benchmark simulates planets movements and manipulates for the most part floating points. It requires an integer as input that represent the number of simulation steps to be taken. The input used for this benchmark in this thesis is 106.

Fannkuch-Redux The Fannkuch-Redux benchmark permutes a set of numbers S = {1, ..., n}, where n is the input value, in this case 10. In a permutation P of the set S the first k elements of P is reversed, where k is the first element in P . This is repeated until the first element of the permuted list is a 1. This is done for all n-length permutations P of S. Since all the elements in the list are integers this benchmark classifies as an application that mostly handles integers.

3.3 Environment and Setup

For this thesis Amazons Web Services is chosen as the public cloud provider, on account of Amazon being the only provider that offers the possibility for customers to provide a custom runtime. AWS’s serverless solution is called Lambda. A user of Lambda has the possibility to create and manipulate Lambda functions by using a CLI provided by Amazon, which we used for both creation and invocations in this thesis. A Lambda function that should run on the JVM require a so called uber JAR, a JAR file that not only contains the program, but also includes its dependencies.

25 That way the JAR file only requires a JVM to run. The JAR files used in this thesis are created with the help of a Gradle plug-in called Shadow [72] and the open JDK version 1.8.0 222. The Lambda functions that executes these JAR files uses the runtime Amazon calls java8 that is based on the JDK java-1.8.0- . When using the java8 runtime, Amazon utilizes their own , Amazon , on the containers executing that function. Using GraalVM CE a native image is created from the JAR generated by Gradle. The latest release of GraalVM CE is 19.3 but contains a known bug where it is unable to create native images [73]. Therefore, the previous version, 19.2.1, is used. In this thesis the community edition is used due to its availability. To create a Lambda function with a custom runtime a bootstrap file is needed in addition to the executable file. This bootstrap file needs to invoke the executable as well as report its result. The bootstrap file and the executable are then decompressed in a zip file format and pushed Lambda to create a function. All the Lambda functions that was created runs with a maximum memory size of 256 MB and a timeout of 100 seconds. Meaning a program can not use more than 256 MB memory, otherwise the invocation fails, and it will be interrupted if it runs for more than 100 seconds.

3.4 Sampling Strategy and Calculations

Since the benchmarks are executed in a public cloud where the results can be affected by factors such as noisy neighbours, it is reasonable to be mindful of the selection of execution times in order to achieve a representative result. Two different aspects about time was taken into consideration, day versus night and weekday versus weekend. Although the region chosen for hosting the AWS Lambda functions where us-east-1, there is no guarantee that the users of that region should all have a timezone used in the eastern parts of the Unites States. These test for example are made from the Central European Time zone (GMT+1). It was therefore concluded that since no distinction can be made, between day and night of the users of the same region, an interval was chosen. The tests done in sequence where performed with 8 hours apart: 12PM, 8PM and 4AM (CET). The benchmarks ran 6 times over the span of three weekdays, from Tuesday 10/12 12:00 PM to Thursday 12/12 4:00 AM. In order to cover the weekend as well, three tests where run, with an 8 hour interval, from Saturday 14/12 12:00 PM to Sunday 15/12 4:00 AM. Since these test where not meant to go through a cold start they could be done in sequence. When deciding how many invocation each sequence should contain, previous work were consulted. When a JVM is used for running benchmarks, a warm up sequence is commonly defined and used in order to ensure that the JVM has achieved the so called steady state when samples are acquired [74] [75] [76]. The optimal would be to get both warm up instances as well as instances where the JVM has reached a steady state in order to get a fair representation. The amount of invocations required for each benchmark to achieve steady state could be examined, however, is out of the scope for this thesis. Therefore a report

26 by Lengauer et al. was used in order to determine a reasonable sample size. In the report, three different benchmark suites were used and the amount of warm up instances was base-lined at 20 due to the built-in mechanism in the DaCapo suite that requires a maximum of 20 warm-ups to automatically detect a steady state [75]. The suite used for this thesis is undoubtedly different in many ways but this still gives an indication of how many invocations are required before a steady state is reached. We hypothesize that a steady state is reached after 20 invocations, but we also want some samples capturing the steady state. The number was therefore doubled and it was reasoned that 40 invocations presumably would suffice. The first invocation, however, will inevitably include a cold start and will be excluded, entailing 39 usable executions per sequence. To get measurements of executions including cold starts, invocations has to be made with a large enough gap. After some trials, 20 minutes was found to be an adequate gap. Benchmarks was executed with a 20 minute interval between Tuesday 10/12 16:40 and Wednesday 11/12 09:00 as well as between Sunday 15/12 10:40 and Monday 16/12 10:20. When the results have been gathered the raw data has to be compressed in some way in order to make it presentable and comprehensible. For this the arithmetic mean is chosen as a first step. To be able to argue for the accuracy of the result the confidence interval is also calculated. The confidence level used in this work is 95 %, meaning that the level of confidence one can have that the actual value is within the given interval will be 95 %. The confidence level was chosen on account of it being one of the most commonly used [77] and it contributes to a high credibility.

3.5 Summary

For this thesis we create a benchmark suite. The goal of the benchmarks is to simulate a real workload. The suite consists of three benchmarks that are real applications, one simple hello world benchmark as well as four smaller complementary benchmarks. Each of the complementary benchmarks focuses on manipulating different data types. Each benchmark has two AWS Lambda functions, one that runs on the JVM and one that is run as a native image created with GraalVM. The times when these benchmarks are run is selected with the intent to create a fair representa- tion of the clouds performance. The metrics that are collected form each run is latency, execution time, response time and memory consumption.

27 Chapter 4

Result

Below follows the result of the performed benchmarks. To be noted is that not all benchmark executions where used. The first execution of each running sequence of 40 invocations were removed since these are supposed to represent warm start and the first will always be a cold start. A few executions from the intended cold starts category was also removed since some were overlapping the sequential batch invocation, resulting in warm starts. The raw data that was aggregated in this chapter can be found at the general- purpose open-access repository Zenodo [78]. To begin the chapter there is a section describing the static metrics. The purpose of these metrics are, as previously mentioned, to supply an overall view of the benchmarks to the reader. For each of the dynamic metrics there is a separate section. Every section contains a table where the average value of each set can be viewed with an accompanying two-sided confidence interval as well as the maximum and the minimum value. Each benchmark has four categories, where all combinations of warm/cold start and JVM/GraalVM function are represented.

4.1 Static metrics

In the Table 4.1 found below the result of the gathered static metrics can be viewed. Lines of code (LOC) and number of Kotlin files were gathered from the IDE IntelliJ. The sizes of the functions presented here were gathered from Amazon console and are the sizes of the zip files that was uploaded to Amazon. We can see that the real benchmarks are unsurprisingly larger than most of the complementary benchmarks. The exception is reverse-comp, the explanation is the large input file that comes along with that benchmark. The uncompressed version of the input file is 10 MB. Since the hello-world benchmark only contains code for creating the response we can see that the amount of lines required for this is 52 that are divided over two files. It can therefore be concluded that 52 lines and 2 Kotlin files of all

28 benchmarks are dedicated to creating the response and that the rest is the actual application.

Benchmark LOC Kotlin JVM function GraalVM function files size size discrete-math 911 26 5.9 MB 6.5 MB go-simulator 2672 40 9.5 MB 6.7 MB state-machine 1405 6 13.5 MB 6.5 MB reverse-comp 140 3 8.6 MB 3.7 MB nbody 229 3 5.7 MB 0.8 MB fannkuch 133 3 5.6 MB 0.8 MB fasta 205 3 5.7 MB 0.8 MB hello-world 52 2 5.7 MB 0.8 MB

Table 4.1: Static metrics for each benchmark in the suite

4.2 Latency

From Table 4.2 we can see that most of the confidence intervals are narrow, below 33 milliseconds. Giving an indication that the results are trustworthy. There is however an anomaly, and that is the latency results from both the cold and warm start category of the the GraalVM function of the reverse-comp benchmark. The two-sided confidence interval is ± 375.3 for the cold start and ± 114 for the warm start. The uncertainties of the result can also be seen in the min and max result where the span between them is large when compared to the other benchmarks, 771-37,780 and 605-21,152.

29 Benchmark Category Compiler Latency (ms) Max Min hello-world Cold JVM 1035 ± 12.3 1428 726 GraalVM 967 ± 32.8 4036 807 Warm JVM 629 ± 5.9 1631 593 GraalVM 640 ± 2.8 952 603 discrete-math Cold JVM 1010 ± 10.5 1437 676 GraalVM 1325 ± 24.1 2450 658 Warm JVM 635 ± 2.7 848 592 GraalVM 643 ± 2.3 827 605 go-simulator Cold JVM 1021 ± 9.0 1280 927 GraalVM 1186 ± 14.3 1951 997 Warm JVM 633 ± 7.3 1851 594 GraalVM 641 ± 2.1 721 602 state-machine Cold JVM 1037 ± 13.6 1703 921 GraalVM 1296 ± 15.7 1981 1062 Warm JVM 626 ± 3.1 1065 594 GraalVM 642 ± 3.0 1067 600 reverse-comp Cold JVM 1091 ± 18.1 1922 732 GraalVM 1220 ± 375.3 37780 771 Warm JVM 632 ± 2.3 728 595 GraalVM 707 ± 114.0 21152 605 nbody Cold JVM 1038 ± 11.4 1606 920 GraalVM 945 ± 8.5 1203 786 Warm JVM 629 ± 1.9 693 595 GraalVM 641 ± 2.0 708 605 fannkuch Cold JVM 1033 ± 14.2 1951 717 GraalVM 955 ± 12.4 1931 791 Warm JVM 634 ± 2.2 707 596 GraalVM 652 ± 3.2 982 606 fasta Cold JVM 1017 ± 9.1 1415 832 GraalVM 957 ± 12.9 1948 802 Warm JVM 631 ± 2.1 720 594 GraalVM 649 ± 2.2 727 603

Table 4.2: Latency result from benchmark execution in milliseconds

Looking closer at the raw data, illustrated in Figure 4.1 and 4.2, it can be seen that the second largest values for reverse-comp benchmark, when running the GraalVM implementation and cold start, are numerically far from the largest values. These are 1263 ms for cold start and 1665 ms for warm start. If the largest values where removed from the resulting set of the GraalVM function during cold start the average value would be 1029 ms and during warm start it would be 649 ms. That is a numerical decrease of 191 and 58 respectively and a the percentage reduction of 15.7 % and 8.2 %.

30 Figure 4.1: Latency of the GraalVM benchmark reverse-comp during warm- starts

Figure 4.2: Latency of the GraalVM benchmark reverse-comp during cold-start

The abnormally large values can have many causes, where the real one is impossible to determine. One possible reason can be noisy neighbours. However, it is interesting that the GraalVM function for both the warm and the cold category got one such anomaly each, while the other tests where spared of similar deviations. The arithmetic mean of a set containing an anomaly such as this becomes less significant and no great weight can be put on this result.

31 4.3 Application Runtime

For this metric benchmark hello-world is excluded. That is because it would result in zero milliseconds every time, since the benchmark does not include any more code than the creation of the response. The end timestamp would be retrieved right after the start timestamp retrieval. For most benchmarks in Table 4.3 the confidence interval is relatively nar- row. For all real benchmarks, for example, the largest interval is ± 22.2 ms and belongs to the JVM function of the discrete-math benchmark during warm starts. The overall largest numerical confidence interval is for the JVM function of the fannkuch benchmark during cold start, ± 226.5 ms. However, since the average value, 25597 ms, is so large it does not have a great impact. If the correct value would prove to be at the very edge of the interval it would mean a the percentage difference of barely 0.9 %.

32 Benchmark Category Compiler Application Max Min runtime (ms) discrete-math Cold JVM 1010 ± 21.2 1437 676 GraalVM 223 ± 3.2 339 98 Warm JVM 315 ± 22.2 848 592 GraalVM 95 ± 1.1 137 77 go-simulator Cold JVM 1340 ± 10.4 1798 1121 GraalVM 134 ± 1.2 158 120 Warm JVM 46 ± 5.1 289 2 GraalVM 16 ± 0.6 38 1 state-machine Cold JVM 1305 ± 9.5 1493 1119 GraalVM 21 ± 0.5 38 3 Warm JVM 13 ± 2.1 177 1 GraalVM 2 ± 0.5 17 0 reverse-comp Cold JVM 11249 ± 74.9 12660 9999 GraalVM 11618 ± 29.3 12362 10783 Warm JVM 3779 ± 66.2 6708 3028 GraalVM 10591 ± 14.2 11203 10304 nbody Cold JVM 4055 ± 14.9 4357 3701 GraalVM 835 ± 3.9 939 743 Warm JVM 748 ± 2.6 829 691 GraalVM 840 ± 2.7 941 737 fannkuch Cold JVM 25597 ± 226.5 29701 20878 GraalVM 46774 ± 67.5 50663 45339 Warm JVM 19846 ± 62.3 23667 18117 GraalVM 46714 ± 33.9 49620 45359 fasta Cold JVM 32551 ± 114.9 34580 29802 GraalVM 31455 ± 53.4 32883 30099 Warm JVM 23309 ± 39.1 24780 22134 GraalVM 30990 ± 50.5 32806 29681

Table 4.3: Runtime result from benchmark execution in milliseconds

4.4 Response Time

The confidence intervals that can be seen in Table 4.4 are relatively limited and does not indicate any untrustworthy result. It can, however, be noted that the GraalVM function of the reverse-comp benchmark during cold start has the largest interval, ± 373.1 ms. Since we saw the same pattern in the latency result it comes as no surprise. However, since latency is only one part of the total response time and there are other time intervals required to make up the whole, the confidence interval does not have as great of an impact as it has for the latency result. Where the latency is 1220 ± 375.3 compared to the total runtime average and confidence interval 13060 ± 373.1.

33 Benchmark Category Compiler Response Max Min time (ms) hello-world Cold JVM 12042 ± 41.6 12993 10548 GraalVM 1176 ± 32.8 4223 1012 Warm JVM 725 ± 6.1 1716 678 GraalVM 836 ± 3.5 1141 789 discrete-math Cold JVM 14709 ± 68.0 16015 13038 GraalVM 2047 ± 28.2 3136 991 Warm JVM 1181 ± 32.4 2807 847 GraalVM 957 ± 3.2 1161 903 go-simulator Cold JVM 13480 ± 61.4 14393 11834 GraalVM 1796 ± 14.9 2559 1620 Warm JVM 806 ± 10.2 1998 691 GraalVM 863 ± 3.0 967 807 state-machine Cold JVM 13596 ± 51.9 14870 11787 GraalVM 1808 ± 16.5 2502 1592 Warm JVM 757 ± 4.9 1240 688 GraalVM 855 ± 70.1 1287 792 reverse-comp Cold JVM 19665 ± 91.5 22086 17884 GraalVM 13060 ± 373.1 49280 11984 Warm JVM 4863 ± 70.1 8291 4141 GraalVM 11516 ± 114.1 31802 11168 nbody Cold JVM 12713 ± 46.9 13523 11353 GraalVM 1993 ± 10.5 2352 1814 Warm JVM 1484 ± 3.8 1610 1388 GraalVM 1687 ± 3.9 1822 1586 fannkuch Cold JVM 33922 ± 244.7 38196 28375 GraalVM 47952 ± 68.8 51807 46468 Warm JVM 20582 ± 62.3 24411 18891 GraalVM 47592 ± 33.9 50430 46234 fasta Cold JVM 40696 ± 132.1 42868 37529 GraalVM 32630 ± 53.5 34074 31262 Warm JVM 24054 ± 39.3 25501 22911 GraalVM 31861 ± 50.7 33698 30514

Table 4.4: Response time result from benchmark execution in milliseconds

4.5 Memory Consumption

From the result Table 4.5 we can see that the confidence intervals are quite narrow for all benchmarks, most intervals have the value ± 0.1 MB. The largest interval is ± 2.3 MB and belong to the cold start version of the GraalVM function running the reverse-comp benchmark. From this we can conclude that the documented average values can be

34 trusted as representative values of the average memory consumption.

Benchmark Category Compiler Memory (MB) Max Min hello-world Cold JVM 113 ± 1.1 115 111 GraalVM 50 ± 0.1 50 49 Warm JVM 113 ± 0.1 114 112 GraalVM 51 ± 0.1 52 49 discrete-math Cold JVM 113 ± 0.1 115 112 GraalVM 76 ± 0.1 79 76 Warm JVM 126 ± 0.6 141 114 GraalVM 77 ± 0.1 79 76 go-simulator Cold JVM 114 ± 0.1 117 113 GraalVM 65 ± 0 66 64 Warm JVM 115 ± 0.1 117 112 GraalVM 67 ± 0.1 68 65 state-machine Cold JVM 115 ± 0.1 116 113 GraalVM 62 ± 0 63 62 Warm JVM 116 ± 0.1 117 114 GraalVM 64 ± 0.1 65 62 reverse-comp Cold JVM 175 ± 0.1 178 172 GraalVM 183 ± 2.3 206 154 Warm JVM 214 ± 0.1 216 212 GraalVM 207 ± 0.5 209 177 nbody Cold JVM 109 ± 0.1 110 107 GraalVM 50 ± 0.1 50 49 Warm JVM 109 ± 0.1 110 108 GraalVM 51 ± 0.1 52 50 fannkuch Cold JVM 111 ± 0.1 113 108 GraalVM 76 ± 0 77 76 Warm JVM 111 ± 0.1 113 110 GraalVM 78 ± 0.1 79 76 fasta Cold JVM 136 ± 0.1 137 135 GraalVM 98 ± 0 99 97 Warm JVM 158 ± 0.1 159 156 GraalVM 99 ± 0.1 100 98

Table 4.5: Memory usage result from benchmark execution in mega bytes

35 Chapter 5

Discussion

This chapter contains discussions regarding the result seen in the previous chap- ter. Each section covers one type of metric and contains one subsection for cold starts and one for covering warm starts. To conclude the chapter there is a section about internal and external threats to the validity of the result.

5.1 Latency

Latency is the time between invocation and the time the invoked function code starts to execute. JVM functions perform better during cold starts for real benchmarks, whereas GraalVM functions are in general faster for the artificial benchmarks. During warm starts, however, the result for every benchmark for both types of functions are very similar.

5.1.1 Cold start From the Figure 5.1 we can see that the JVM functions have a lower average latency for all real benchmarks (discrete-math, go-simulator and state-machine) as well as for reverse-comp, although the confidence interval for reverse-comp is wide and is overlapping the GraalVM functions interval. For the other bench- marks the GraalVM functions have a lower average latency. The largest dif- ference can be seen for the discrete-math benchmark, the JVM function has a 31.2 % lower latency than the GraalVM function.

36 1,400 325 , 296 1 , 1 220 , 186 1 , 1,200 1 091 , 1 038 037 035 033 , , 021 , , 017 010 , , 1 1 , 1 1 1 1 1

1,000 967 957 955 945

800

600 Latency (ms)

400

200

0

fasta nbody fannkuch hello-world go-simulator reverse-comp discrete-math state-machine JVM GraalVM

Figure 5.1: Average latency during cold start

5.1.2 Warm start In Figure 5.2 we can see that during warm starts for every benchmark, real and artificial, the function running on the JVM consistently has a lower average latency, compared with the functions that has utilized GraalVM. However, the difference is small, and some of the confidence intervals are overlapping, such as for the go-simulator benchmark as well as for the reverse-comp benchmark. The biggest difference is for reverse-comp where the average latency of the function running on the JVM is 11.9 % faster that the GraalVM function. However, as the confidence intervals of these two versions collide not much weight can be put on this observation. The second largest difference is with the fasta benchmark,

37 where the JVM function is 2.9 % faster.

800 707 700 652 649 643 642 641 640 640 635 634 633 632 631 629 629 626 600

500

400

Latency (ms) 300

200

100

0

fasta nbody fannkuch hello-world go-simulator reverse-comp discrete-math state-machine JVM GraalVM

Figure 5.2: Average latency during warm start

5.2 Application Runtime

As previously mentioned, the hello-world benchmark is missing from Table 4.3, describing the application runtime result, due to redundancy. From the table we can see a clear difference between JVM functions and GraalVM functions. There is not one type of function that performs better for all benchmarks, however, among the real test the GraalVM functions are indisputably faster. For both warm and cold starts.

38 5.2.1 Cold start In the Figure 5.3 we can see a dramatic difference where the GraalVM functions executes much faster than their JVM counterparts. The largest difference is in the state-machine benchmark where the GraalVM function executes 62 times faster and only takes an average of 21 ms whereas the JVM function requires an average of 1305 milliseconds.

1,500 340 , 305 , 1 1 010 , 1,000 1

500 223 Application runtime (ms) 134 21 0 discrete-math go-simulator state-machine

JVM GraalVM

Figure 5.3: Average application runtime of real benchmarks during cold start

If we look at Figure 5.4 that outline the average application runtime of the artificial benchmarks, the differences are not as dramatic and not as one sided as for the real benchmarks. The execution time of reverse-comp and fasta are quite similar. What stands out most is that in the fannkuch benchmark the JVM function was almost twice as fast as the GraalVM function.

39 ·104 6 774 , 46 551 455 4 , , 32 31 597 , 25 618 2 249 , , 11 11 055 , 4 Application runtime (ms) 835 0

reverse-comp nbody fannkuch fasta

JVM GraalVM

Figure 5.4: Average application runtime of artificial benchmarks during cold start

The results implies that during a cold start the application runtime is shorter for a GraalVM function than a JVM function. This is probably due to the fact that the code has been compiled beforehand and all the optimizations have already been done by GraalVM. Whereas the JVM has to compile during run- time, resulting in a longer execution time. There might be a risk, however, to compiling before hand since nothing of the application usage is known the com- piler might prioritize incorrectly in its optimisation. This may be the cause of why the JVM-function is almost twice as fast as the GraalVM function in the fannkuch benchmark.

5.2.2 Warm start In Figure 5.5 we can observe the average runtime results for the warm start category. The GraalVM functions still have faster execution times than the JVM function for the real benchmarks. However, we can note how small the difference now is for the go-simulation and the state-machine benchmark when compared to the result from the cold start category.

40 400 315 300

200

100 95 46 16 13 Application runtime (ms) 0 2

discrete-math go-simulator state-machine

JVM GraalVM

Figure 5.5: Average application runtime of real benchmarks during warm start

For the go-simulation benchmark, during cold start, the difference between the two types of functions is 1204 ms, where the GraalVM function is 10 times faster. The same benchmark during warm start differs only 30 ms, where the GraalVM function now only is 3 times as fast. The execution time for the JVM function has been reduced by 1294 ms, which equals a speed up of 281 %. The same pattern can be seen for the state-machine benchmark. During cold start the difference is 1284 ms, where the GraalVM function requires 21 ms and the JVM function 1305 ms to execute. When the JVM function does not have to go through a cold start however, the time gap is reduced to only 11 ms, where the JVM function have an average execution time of 13 ms and the GraalVM function 2 ms. Similar patterns can be viewed for the artificial benchmarks is Figure 5.6, where the average execution time of the JVM functions has been reduced for all benchmarks.

41 ·104 6 714 , 46

4 990 , 30 309 , 846 , 23 19

2 591 , 10 779 , 3 Application runtime (ms) 840 748 0

reverse-comp nbody fannkuch fasta

JVM GraalVM

Figure 5.6: Average application runtime of artificial benchmarks during warm start

For the benchmarks nbody and fasta benchmark, where the GraalVM func- tion has a lower average execution time during cold starts, the JVM function is now faster. For the reverse-comp benchmark, where the JVM function is slightly faster during cold starts, the time gap is increased during warm starts. The same goes for the fannkuch benchmark. We can establish that the JVM functions perform much better during warm starts than cold starts. The difference is illustrated in Figure 5.7. As a compar- ison we can also view the difference in execution time of the GraalVM functions during warm and cold start in Figure 5.8. Some small improvements can be noted on every benchmark, except for the nbody benchmark where there is a slight deterioration, but it is nowhere near the improvements of the JVM func- tions.

42 ·104 551

3.5 , 32

3 597 , 25 309 2.5 , 23 847 ,

2 19

1.5 249 , 11 1 Application runtime (ms) 055 779 , ,

0.5 4 3 340 305 010 , , , 1 1 1 748 315 46 0 13

fasta nbody fannkuch go-simulator reverse-comp discrete-math state-machine JVM-cold start JVM-warm start

Figure 5.7: Comparison of average runtime of JVM functions during warm and cold start

43 ·104 6 774 714 , 5 , 46 46

4 455 990 , , 31 30 3

2 618 , 591 , 11 10 1 Application runtime (ms) 840 835 223 134 95 21 16 0 2

fasta nbody fannkuch go-simulator reverse-comp discrete-math state-machine GraalVM-cold start GraalVM-warm start

Figure 5.8: Comparison of average runtime of GraalVM functions during warm and cold start

The reason the JVM functions are performing much better during warm starts is due to the JVMs JIT attribute. It has the possibility to, during runtime, change its optimizations based on the way the code is used and since the code in this case always is used in the same way, optimizations are made easier. The JIT attribute also means that the more times the code is run the more opportunities the JVM has to make these optimizations. The GraalVM functions on the other hand are already compiled and the optimizations that have been made are constant. In the Figure 5.9 this can be viewed more clearly. It illustrates every warm start execution time of the benchmark go-simulator in the order they where acquired. We can see a distinct pattern that is repeated nine times, the same amount of times as the sequential warm start tests where run. The start of each new sequential test is marked with a grey vertical line. We can see that during these marked points there is a peak. We can also see that the execution time is decreasing, with some irregular spikes, until the end of the sequence and peaks again when the next sequence starts.

44 300

250

200

150

100 Application runtime (ms) 50

0

Order of collection

Figure 5.9: Execution time of the JVM-function of the go-simulator benchmark during warm starts.

As a comparison, we can observe the same graph for the corresponding GraalVM-function in Figure 5.10. A pattern over the whole course of time can be seen, where 18 is a reoccurring value, but no repeating pattern with regards to the plotted intervals can be viewed.

45 40

35

30

25

20

15

Application runtime (ms) 10

5

0

Order of collection

Figure 5.10: Execution time of the GraalVM-function of the go-simulator bench- mark during warm starts.

5.3 Response Time

The total response time is the time it takes from the request is sent until a response is received. Which means that it is a product of both latency together with the application runtime as well as creating and delivering the response. In general we can see that during cold starts the GraalVM-functions performs better than the JVM-functions. For average total runtime during warm starts a decrease can be seen for the JVM functions when compared to the values from the cold start category, the JVM and GraalVM functions then have more similar values.

5.3.1 Cold start For the real benchmarks it is unmistakable that the GraalVM-functions have a much faster average total runtime. Every GraalVM-function is at least 7 times faster than its JVM counter part. An illustration of this can be seen in Figure 5.11.

46 ·104 709 , 596 480 , , 1.5 14 13 13

1

0.5 047 808 796 , , , Response time (ms) 2 1 1

0

discrete-math go-simulator state-machine

JVM GraalVM

Figure 5.11: Average total runtime of real benchmarks during cold start

In Figure 5.12 a comparison of the average total runtime of the artificial benchmarks during cold starts can be seen. The GraalVM-functions have the lowest average total runtime for all benchmarks except for the fannkuch bench- mark.

47 ·104 6 952 , 47 696 , 40 922 , 630

4 , 33 32 665 , 19 060 713 042 , ,

2 , 13 12 12 Response time (ms) 993 176 , , 1 1 0

fasta nbody fannkuch hello-world reverse-comp JVM GraalVM

Figure 5.12: Average total runtime of artificial benchmarks during cold start

5.3.2 Warm start In Figures 5.13 and 5.14 the GraalVM as well as JVM-functions of all bench- marks can be seen to have a reduced total average runtime during warm starts compared to their cold start counterparts. The largest reductions in average total runtime consists of JVM functions. The benchmark with the largest numerical difference is the fasta benchmark, where the JVM-function during cold start has an average total runtime of 40,696 ms and during warm starts 24,054 ms, a difference of 16,642 ms. The benchmark that has the largest relative difference is the state-machine benchmark, where the warm start is 17 times faster than the cold start, from 13,596 ms to 757 ms.

48 181 , 1

1,000 957 855 836 806 757

500 Response time (ms)

0 discrete-math go-simulator state-machine

JVM GraalVM

Figure 5.13: Average total runtime of real world benchmarks during warm start

The average total runtime of the two types of functions during warm starts have equalized for almost all benchmarks. For the discrete-math benchmark the GraalVM-function is still faster than the JVM function. However, for the go- simulator and the state-machine benchmark the JVM function performs slightly better. If we look at the artificial benchmarks, in Figure 5.14, we can see that, although the difference is smaller in some cases, the JVM-functions are generally faster when compared to its GraalVM counterpart.

49 ·104 6 592 , 47 861

4 , 31 054 , 582 , 24 20

2 516 , 11 863 , Response time (ms) 687 484 4 , , 1 1 836 725 0

fasta nbody fannkuch hello-world reverse-comp JVM GraalVM

Figure 5.14: Average total runtime of artificial benchmarks during warm start

That the average total runtime of the JVM-functions are significantly lower during warm starts than during cold starts is not unanticipated, considering the result discussed in the previous sections. This since latency and application runtime is a part of the total runtime. If one or both of them are reduced then so should the total runtime value also be. In Section 5.1 we saw that latency is decreased for all functions during warm starts when compared to cold starts. Then, in Section 5.2, we saw the same pattern for application runtime, where the JVM-functions showed the greatest improvements.

5.4 Memory Consumption

Memory is measured by Amazon, where maximum memory used during the functions run are stated in the log-reports for each function invocation. In general it can be said that GraalVM-functions uses significantly less mem- ory than the JVM-functions. This since the JVM needs to compile the code dur- ing runtime and the compilation requires memory, while the GraalVM-functions are already compiled. It can also be stated that the memory consumption re- mains relatively unchanged when comparing the same functions during cold and warm start

50 5.4.1 Cold start In Figure 5.15 the average memory usage result for all functions during cold start are illustrated. It is clear that the GraalVM-functions are using consequently less memory than the JVM functions for all benchmarks except the reverse-comp benchmark, which is also using the most memory out of all the benchmarks. Since what is characteristic of reverse-comp is its input-feature, this might be the reason of the large memory usage. Reading a large input-file might result in an uncharacteristically large memory usage for GraalVM enabled functions.

200 183

180 175

160

140 136

120 115 114 113 113 111 109

100 98 Memory consumption (MB)

80 76 76 65 62 60 50 50

40

fasta nbody fannkuch hello-world go-simulator reverse-comp discrete-math state-machine JVM GraalVM

Figure 5.15: Average memory consumption during cold start

51 5.4.2 Warm start Figure 5.16 illustrated the average memory consumption of all benchmarks dur- ing warm starts. A similarity to Figure 5.15 can be seen. One difference is that there is now no exceptions where the GraalVM-function does not require less memory that its JVM counterpart. Most values are the same or very similar, with only 1 or 2 MB difference. One benchmark that stands out on the other hand is the reverse-comp benchmark. For the JVM-function the average mem- ory consumption has increased with 39 MB and for the GraalVM-function 24 MB. Another benchmark that stands out is the discrete-math benchmark. The JVM function still requires the same amount of memory, the GraalVM-function, however, shows a decrease of 25 MB.

52 220 214 207 200

180

160 158

140 116 120 115 113 113 111 109

100 99 Memory consumption (MB)

80 78 64 64 60 51 51 51

40

fasta nbody fannkuch hello-world go-simulator reverse-comp discrete-math state-machine JVM GraalVM

Figure 5.16: Average memory consumption during warm start

5.5 Threats to validity

For the work done in this thesis, a benchmark suite of real as well as complemen- tary benchmarks was created. The goal when selecting them was to simulate real workloads. It is possible that this is not the case, that they don’t reflect real workloads. It is also possible that the benchmark suite represents a real workload but only a certain type of workload. Both of these threats would entail it unfitting to use the result as a basis for the general case. Since the collection of the metrics was a mix between collecting data from the script written for this work as well as log data from AWS Console, it may

53 be possible errors where made when merging these different metrics. There is also a possibility there have been issues when collecting the metrics, by AWS but also by the script designed to automatically record the results. This work relies heavily on AWS, where the goal is to present an as repre- sentative result as possible. It is possible that the benchmarks are run in an abnormal environment for AWS. This would entail that the result would only be valid for that abnormal environment. Since the information about where and how the functions are hosted in AWS is highly limited it can cause all sorts of unknown issues that could threatens the validity of this work. There may also be other unknown issues to this work that does not involve AWS.

54 Chapter 6

Conclusion

This chapter concludes this thesis by gathering all results and summarizing them. It also contains a section about how this work could be elaborated and improved.

6.1 Performance

The results from the work done in this thesis, give the indication that different recommendations on how to run a function written in Kotlin, should be given depending on how the function is intended to be used. The result from this work indicates a sizable performance difference depend- ing on if the function undergoes a cold start or not. Latency and memory consumption are, however, not as affected as the two other parameters. Ap- plication runtime and total runtime are more dependent on if the function has been invoked recently or not. If a function written in Kotlin is expected to go through many cold starts, creating a native image with GraalVM is a good option to lower the total run- time. A GraalVM function also requires less memory. The combination entails a cheaper hosting solution than a JVM function. On the other hand, if a function is expected to be invoked often and therefore not undergo many cold starts, the difference in total runtime is smaller for the two compilation approaches. Memory consumption is, however, still in the favor of the GraalVM functions. The overall advantages are therefore not as obvious in this case. A more detailed conclusion for each metric can be found in the subsections below.

6.1.1 Latency When latency is regarded, the type of function, JVM-dependent or GraalVM- created, does not seem to matter much. The JVM-functions showed better

55 results for the real benchmarks as well as for the reverse-comp benchmark during cold start. The result for reverse-comp for the GraalVM-function during cold starts, however, can not be trusted since its confidence interval is so wide. For the other artificial benchmarks the GraalVM-function proved to be faster. If we put more weight on the real benchmarks, as we should, it could be said that the JVM-functions perform better than its GraalVM counterparts. However, since the difference is not very clear it can not be said for certain that a JVM-function would have a lower latency in a general case during cold starts. During warm starts latency is almost equal for all benchmarks and functions. The JVM-functions can be seen to have a slightly lower latency in general, but the difference is so small that it can be said that the latency during warm starts is equal for JVM and GraalVM-functions.

6.1.2 Application Runtime From the results we saw that there are big differences between the JVM and the GraalVM-functions. During cold starts of the real benchmarks it is clear that the GraalVM-functions perform a great deal better. When it comes to the artificial benchmarks, however, the results are more even, where the fannkuch benchmark was executed much faster by the JVM-function than the GraalVM- function. However, if we for application runtime, as for latency, would put more weight onto the real benchmarks it could be stated that GraalVM-functions perform much better during cold starts than JVM-functions. This since optimizations and compilations have already been made by GraalVM when the native im- age was created. This has to be done during runtime for the JVM functions, entailing a longer execution time the first times it runs. For application runtime during warm starts all functions had improved its execution time. The JVM-functions, however, had improved drastically. For the real benchmarks the GraalVM-functions where still faster but the JVM- functions could be seen to perform better than its GraalVM counterpart for the artificial benchmarks. The drastic improvement for the JVM-functions can be traced back to the JVMs ability to keep changing optimizations during runtime. More executions leads to more data the JVM can create more well based optimizations on. For the GraalVM-functions the code is already compiled, therefore no great im- provement in application runtime can be seen. If a function written in Kotlin is expected to undergo many cold starts, e.i, to be invoked rarely, it would be a good option to precompile the function with the help of GraalVM to lower the application runtime. If a Kotlin function is expected to have a lot of sequential batch invocations, e.i, many warm starts consecutively, the difference in execution time is lower and the choice between a GraalVM precompiled function and a JVM-function becomes less clear.

56 6.1.3 Response Time We have seen that for cold starts the GraalVM-functions perform better for all benchmarks, with the exception of the fannkuch benchmark. The GraalVM- functions for the real benchmarks all run at least 7 times faster. During warm starts the JVM-functions improvements in application runtime can be seen and the result is a much lower total runtime for those functions. The difference between the GraalVM-functions and the JVM-functions are small for all real benchmarks and for the artificial benchmarks the JVM-function perform better or equal to its GraalVM counterpart. If a serverless function written in Kotlin is expected to undergo many cold starts, GraalVM is a good option to lower the total runtime. If many warm starts are expected the choice between a JVM-function and a GraalVM function matters less.

6.1.4 Memory Consumtion For memory consumption the results are clear. Both for warm and cold starts GraalVM-functions require a much lower amount of memory, in some cases even half. The results are hardly surprising since the JVM-compiles its function during runtime and to perform the compilation the JVM requires memory.

6.2 Future work

There are many components of this work that can be improved and/or expanded. More work could be done to develop a more substantial benchmark suite, where only real benchmarks are used. A closer look could also be had into the different optimization techniques of the two compilation options. Benchmarks could then be designed or picked in order to target the compilation strategies strength and weaknesses. The impact of the different approaches could then be analysed and more specific recommendation could then be made as when to use one or the other compilation option. Another improvement would be to test more cloud providers. As of now only AWS supports custom runtimes but it is possible more public cloud providers will follow. A comparison could at least be done for only the JVM-functions. To get a more accurate view of the cloud provider, benchmarks could be run for a longer period of time. The result from this work only represents the performance of a snapshot of the AWS cloud. There is also the possibly to explore the different regions that AWS has. Since different hardware as well as different configurations could be used for different data centers it might be the case that some regions perform better than others. Furthermore, it would be interesting to perform the same tests for similar functions written in other JVM based languages such as Scala and Java to see if they result in the same outcome as the tests done for Kotlin in this thesis.

57 Another angle to expand the work done in this thesis, could be to explore the patterns seen in figure 5.9. To look into how a JVM-function performs over longer time to see when the functions reach a steady state to explore when a JVM-function becomes faster that an already compiled native image produced by GraalVM.

58 Bibliography

[1] IDG. “2018 Cloud Computing Survey”. In: (Aug. 14, 2018). [2] LogicMonitor. “Cloud Vision 2020: The Future of the Cloud - A survey of in- dustry Influencers”. In: (Dec. 2017). [3] Amazon. AWS Lambda. url: https://aws.amazon.com/lambda/ (visited on 09/06/2019). [4] Google. Google Cloud Function. url: https://cloud.google.com/functions/ (visited on 09/06/2019). [5] Passwater, Andrea. 2018 Serverless Community Survey: huge growth in server- less usage. 2018. url: https : / / serverless . com / blog / 2018 - serverless - community-survey-huge-growth-usage/ (visited on 09/06/2019). [6] Amazon. AWS Lambda Releases. url: https://docs.aws.amazon.com/lambda/ latest/dg/lambda-releases.html (visited on 09/06/2019). [7] Kotlin. FAQ. url: https : / / kotlinlang . org / docs / reference / faq . html (visited on 10/08/2019). [8] Kotlin, Talking. Kotlin at Pinterest with Christina Lee. May 15, 2017. url: http://talkingkotlin.com/kotlin- at- pinterest- with- christina- lee/ (visited on 10/08/2019). [9] Kotlin, Talking. Kotlin at Uber. Apr. 30, 2019. url: http://talkingkotlin. com/kotlin-at-uber/ (visited on 10/08/2019). [10] Lardinois, Frederic. Kotlin is now Google’s Preferred language for Android app development. May 30, 2019. url: https : / / techcrunch . com / 2019 / 05 / 07 / kotlin-is-now-googles-preferred-language-for-android-app-development/ (visited on 10/08/2019). [11] Overflow, Stack. Developer Survey Results - 2018. 2018. url: https://insights. stackoverflow.com/survey/2018/#most-loved-dreaded-and-wanted (visited on 10/08/2019). [12] Overflow, Stack. Developer Survey Results - 2019. 2019. url: https://insights. stackoverflow.com/survey/2019/#most-loved-dreaded-and-wanted (visited on 10/08/2019). [13] Miller, Ron. Amazon Launches Lambda, An Event-Driven Compute Service. Nov. 13, 2014. url: https://techcrunch.com/2014/11/13/amazon-launches- lambda-an-event-driven-compute-service/ (visited on 09/30/2019).

59 [14] Novet, Jordan. Google has quietly launched its answer to AWS Lambda. Feb. 9, 2016. url: https://venturebeat.com/2016/02/09/google- has- quietly- launched-its-answer-to-aws-lambda/ (visited on 09/30/2019). [15] IBM. IBM Unveils Fast, Open Alternative to Event-Driven Programming. Feb. 23, 2016. url: https://www-03.ibm.com/press/us/en/pressrelease/49158.wss (visited on 09/30/2019). [16] Miller, Ron. Microsoft answers AWS Lambda’s event-triggered serverless apps with Azure Functions. Mar. 31, 2016. url: https://techcrunch.com/2016/03/ 31/microsoft-answers-aws-lambdas-event-triggered-serverless-apps- with-azure-functions/ (visited on 09/30/2019). [17] Services, Amazon Web. AWS re:Invent 2017: Become a Serverless Black Belt: Optimizing Your Serverless Appli (SRV401). Dec. 1, 2017. url: https://youtu. be/oQFORsso2go (visited on 09/30/2019). [18] Amazon. Tutorial: Schedule AWS Lambda Functions Using CloudWatch Events. url: https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ RunLambdaSchedule.html (visited on 09/30/2019). [19] Jeremydaly. Lambda Warmer. url: https://github.com/jeremydaly/lambda- warmer (visited on 09/30/2019). [20] FidelLimited. Serverless WarmUp Plugin. url: https://github.com/FidelLimited/ serverless-plugin-warmup (visited on 09/30/2019). [21] Dashbird. X-Lambda. url: https://github.com/dashbird/xlambda/ (visited on 09/30/2019). [22] Shilkov, Mikhail. Cold Starts in AWS Lamdba. Sept. 26, 2019. url: https : //mikhail.io/serverless/coldstarts/aws/ (visited on 10/10/2019). [23] Amazon. AWS Lambda Doubles Maximum Memory Capacity for Lambda Func- tions. Nov. 17, 2017. url: https : / / aws . amazon . com / about - aws / whats - new/2017/11/aws-lambda-doubles-maximum-memory-capacity-for-lambda- functions/ (visited on 10/01/2019). [24] Amazon. AWS Lambda enables functions that can run up to 15 minutes. Oct. 10, 2018. url: https://aws.amazon.com/about-aws/whats-new/2018/10/aws- lambda-supports-functions-that-can-run-up-to-15-minutes/ (visited on 09/30/2019). [25] Amazon. iRobot Ready to Unlock the Next Generation of Smart Homes Using the AWS Cloud. url: https://aws.amazon.com/solutions/case- studies/ irobot/ (visited on 10/03/2019). [26] Amazon. Netflix AWS Lambda Case Study. url: https://aws.amazon.com/ solutions/case-studies/netflix-and-aws-lambda/ (visited on 10/02/2019). [27] Google. NCloud Functions for Firebase Auger Labs. url: https://firebase. google.com/docs/functions/case-studies/augerlabs.pdf?hl=zh-CN (vis- ited on 10/02/2019). [28] Amazon. Alameda County Serves Election Maps at High Speed, Low Cost Using AWS. url: https://aws.amazon.com/solutions/case- studies/alameda- county/ (visited on 10/03/2019).

60 [29] Perez, Guillermo A. et al. “A Hybrid Just-in-time Compiler for Android: Com- paring JIT Types and the Result of Cooperation”. In: Proceedings of the 2012 International Conference on Compilers, Architectures and Synthesis for Em- bedded Systems. CASES ’12. Tampere, Finland: ACM, 2012, pp. 41–50. isbn: 978-1-4503-1424-4. doi: 10.1145/2380403.2380418. url: http://doi.acm. org/10.1145/2380403.2380418. [30] Croce, Louis. Lecture notes in Just in Time Compilation. 2014. (Visited on 12/17/2019). [31] Engels, Joshua. Programming for the JavaTM Virtual Machine. Addison Wesley, 1999. isbn: 0-201-30972-6. [32] Lindholm, Tim et al. The Java Virtual Machine Specification, Java SE 8 Edition. 1st. Addison-Wesley Professional, 2014. isbn: 013390590X, 9780133905908. [33] Urma, Raoul-Gabriel. Alternative Languages for the JVM. July 2014. url: https: / / www . oracle . com / technetwork / articles / java / architect - languages - 2266279.html (visited on 10/04/2019). [34] Oracle. The Java HotSpot Performance Engine Architecture. url: https : / / www . oracle . com / technetwork / java / whitepaper - 135217 . html (visited on 10/09/2019). [35] Lynn, Scott. For Building Programs That Run Faster Anywhere: Oracle GraalVM Enterprise Edition. May 8, 2019. url: https://blogs.oracle.com/graalvm/ announcement (visited on 10/07/2019). [36] Duboscq, Gilles et al. Graal IR: An Extensible Declarative Intermediate Repre- sentation. Feb. 2013. [37] Stadler, Lukas & W¨urthinger,Thomas, and M¨ossenb¨ock, Hanspeter. “Partial Escape Analysis and Scalar Replacement for Java”. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. CGO ’14. Orlando, FL, USA: ACM, 2014, 165:165–165:174. isbn: 978-1-4503- 2670-4. doi: 10.1145/2581122.2544157. url: http://doi.acm.org/10.1145/ 2581122.2544157. [38] GraalVM Native Image. url: https://www.graalvm.org/docs/reference- manual/aot-compilation/ (visited on 10/04/2019). [39] W¨urthinger,Thomas et al. “One VM to Rule Them All”. In: Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. Onward! 2013. Indianapolis, Indiana, USA: ACM, 2013, pp. 187–204. isbn: 978-1-4503-2472-4. doi: 10.1145/2509578. 2509581. url: http://doi.acm.org/10.1145/2509578.2509581. [40] Rigger, Manuel et al. “Sulong, and Thanks for All the Fish”. In: Conference Companion of the 2Nd International Conference on Art, Science, and Engineer- ing of Programming. Programming'18 Companion. Nice, France: ACM, 2018, pp. 58–60. isbn: 978-1-4503-5513-1. doi: 10.1145/3191697.3191726. url: http://doi.acm.org.focus.lib.kth.se/10.1145/3191697.3191726. [41] Corporation, Standard Performance Evaluation. SPECjcm2008. Sept. 26, 2019. url: https://www.spec.org/jvm2008/ (visited on 11/12/2019). [42] DaCapo. DaCapo Benchmarking Suite. url: http://dacapobench.org/ (visited on 11/12/2019).

61 [43] Renaissance Suite. url: https://renaissance.dev/ (visited on 11/12/2019). [44] Blackburn, S. M. et al. “The DaCapo Benchmarks: Java Benchmarking Devel- opment and Analysis”. In: OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programing, Systems, Languages, and Applications. Portland, OR, USA: ACM Press, Oct. 2006, pp. 169–190. doi: http://doi.acm.org/10.1145/1167473.1167488. [45] Prokopec, Aleksandar et al. “Renaissance: Benchmarking Suite for Parallel Ap- plications on the JVM”. In: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI 2019. Phoenix, AZ, USA: ACM, 2019, pp. 31–47. isbn: 978-1-4503-6712-7. doi: 10 . 1145 / 3314221.3314637. url: http://doi.acm.org/10.1145/3314221.3314637. [46] The Computer Language Benchmarks Game. url: https://benchmarksgame- team.pages.debian.net/benchmarksgame/ (visited on 11/12/2019). [47] Li, Wing Hang & White, David ., and Singer, Jeremy. “JVM-hosted Languages: They Talk the Talk, but Do They Walk the Walk?” In: Proceedings of the 2013 International Conference on Principles and Practices of Programming on the Java Platform: Virtual Machines, Languages, and Tools. PPPJ ’13. Stuttgart, Germany: ACM, 2013, pp. 101–112. isbn: 978-1-4503-2111-2. doi: 10 . 1145 / 2500828.2500838. [48] Schwermer, Patrik. “Performance Evaluation of Kotlin and Java on Android Runtime”. (MA thesis). KTH, School of Electrical Engineering and Computer Science (EECS), 2018. [49] Sherman, Elena and Dyer, Robert. “Software Engineering Collaboratories (SE- Clabs) and Collaboratories As a Service (CaaS)”. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE 2018. Lake Buena Vista, FL, USA: ACM, 2018, pp. 760–764. isbn: 978-1-4503-5573-5. doi: 10.1145/3236024.3264839. url: http://doi.acm.org.focus.lib.kth.se/10. 1145/3236024.3264839. [50] Laaber, Christoph & Scheuner, Joel, and Leitner, Philipp. “Software Microbench- marking in the Cloud. How Bad is It Really?” In: Empirical Softw. Engg. 24.4 (Aug. 2019), pp. 2469–2508. issn: 1382-3256. doi: 10.1007/s10664-019-09681- 1. url: https://doi.org/10.1007/s10664-019-09681-1. [51] Ou, Zhonghong et al. “Exploiting Hardware Heterogeneity within the Same In- stance Type of Amazon EC2”. In: Presented as part of the. USENIX, Submit- ted. url: https://www.usenix.org/conference/hotcloud12/exploiting- hardware-heterogeneity-within-same-instance-type-amazon-ec2. [52] Folkerts, Enno et al. “Benchmarking in the Cloud: What It Should, Can, and Cannot Be”. In: Selected Topics in Performance Evaluation and Benchmarking. Ed. by Nambiar, Raghunath and Poess, Meikel. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 173–188. isbn: 978-3-642-36727-4. [53] FreeBSD. BSD System Manager’s Manual - INETD(8). Jan. 12, 2008. url: https://www.freebsd.org/cgi/man.cgi?query=inetd&sektion=8 (visited on 10/22/2019). [54] Belghiat, Aissam et al. “Mobile Agent-Based Software Systems Modeling Ap- proaches: A Comparative Study”. In: CIT 24 (2016), pp. 149–163.

62 [55] Yu, Ping et al. “Mobile Agent Enabled Application Mobility for Pervasive Com- puting”. In: Proceedings of the Third International Conference on Ubiquitous In- telligence and Computing. UIC’06. Wuhan, China: Springer-Verlag, 2006, pp. 648– 657. isbn: 3-540-38091-4, 978-3-540-38091-7. doi: 10.1007/11833529_66. url: http://dx.doi.org/10.1007/11833529_66. [56] Qin, Weijun & Suo, Yue, and Shi, Yuanchun. “CAMPS: A Middleware for Pro- viding Context-Aware Services for Smart Space”. In: Advances in Grid and Per- vasive Computing. Ed. by Chung, Yeh-Ching and Moreira, Jos´eE. Berlin, Hei- delberg: Springer Berlin Heidelberg, 2006, pp. 644–653. isbn: 978-3-540-33810-9. [57] Zhou, Y. et al. “A Middleware Support for Agent-Based Application Mobility in Pervasive Environments”. In: 27th International Conference on Distributed Computing Systems Workshops (ICDCSW’07). June 2007, pp. 9–9. doi: 10 . 1109/ICDCSW.2007.12. [58] Thalinger, Chris. GeeCON Prague 2017: Chris Thalinger - Twitter’s quest for a wholly Graal runtime. Jan. 24, 2018. url: https://www.youtube.com/watch? v=pR5NDkIZBOA (visited on 10/11/2019). [59] Thalinger, Chris. Performance tuning Twitter services with Graal and Machine Learning - Chris Thalinger. July 31, 2019. url: https://www.youtube.com/ watch?v=3fSKcLM5nGw (visited on 10/11/2019). [60] Oi, H. “A Comparative Study of JVM Implementations with SPECjvm2008”. In: 2010 Second International Conference on Computer Engineering and Appli- cations. Vol. 1. Mar. 2010, pp. 351–357. doi: 10.1109/ICCEA.2010.75. [61] Chapin, John. Fearless JVM Lamdbas - John Chapin. Mar. 30, 2017. url: https: //www.youtube.com/watch?v=GINI0T8FPD4 (visited on 10/17/2019). [62] Lee, H. & Satyam, K., and Fox, G. “Evaluation of Production Serverless Com- puting Environments”. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). July 2018, pp. 442–450. doi: 10 . 1109 / CLOUD . 2018 . 00062. [63] Maas, Martin & Asanovi´c,Krste, and Kubiatowicz, John. “Return of the Run- times: Rethinking the Language Runtime System for the Cloud 3.0 Era”. In: Proceedings of the 16th Workshop on Hot Topics in Operating Systems. HotOS ’17. Whistler, BC, Canada: ACM, 2017, pp. 138–143. isbn: 978-1-4503-5068-6. doi: 10.1145/3102980.3103003. url: http://doi.acm.org/10.1145/3102980. 3103003. [64] Singh, Gurdev & Singh, Dilbag, and Singh, Vikram. “A Study of Software Met- rics”. In: International Journal of Computational Engineering and Management 11 (Jan. 2011), pp. 22–27. issn: 2230-7893. [65] Sharma, Manik and Singh, Dr. Gurdev. “Article: Analysis of Static and Dynamic Metrics for Productivity and Time Complexity”. In: International Journal of Computer Applications 30.1 (Sept. 2011), pp. 7–13. [66] Miller, Robert B. “Response Time in Man-computer Conversational Transac- tions”. In: Proceedings of the December 9-11, 1968, Fall Joint Computer Con- ference, Part I. AFIPS ’68 (Fall, part I). San Francisco, California: ACM, 1968, pp. 267–277. isbn: 978-1-4503-7899-4. doi: 10.1145/1476589.1476628. url: http://doi.acm.org/10.1145/1476589.1476628.

63 [67] Card, Stuart K. & Robertson, George G., and Mackinlay, Jock D. “The Infor- mation Visualizer, an Information Workspace”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’91. New Orleans, Louisiana, USA: ACM, 1991, pp. 181–186. isbn: 0-89791-383-3. doi: 10.1145/ 108844.108874. url: http://doi.acm.org/10.1145/108844.108874. [68] Kim Bj¨ork- Github Repositories. url: https://github.com/KimBjork?tab= repositories (visited on 01/11/2020). [69] uberto. Kakomu. url: https://github.com/uberto/kakomu (visited on 12/12/2019). [70] Tinder. State Machine. url: https://github.com/Tinder/StateMachine (vis- ited on 12/12/2019). [71] MarcinMoskala. Kotlin Disctete Math Toolkit. url: https : / / github . com / MarcinMoskala/KotlinDiscreteMathToolkit (visited on 12/12/2019). [72] johnrengelman. Gradle Shadow. url: https://github.com/johnrengelman/ shadow (visited on 12/12/2019). [73] [native-image] cannot build an image with the 19.3.0 version of GraalVM CE for java8 and java11. url: https://github.com/oracle/graal/issues/1863 (visited on 12/12/2019). [74] Georges, Andy & Buytaert, Dries, and Eeckhout, Lieven. “Statistically Rigorous Java Performance Evaluation”. In: Proceedings of the 22nd Annual ACM SIG- PLAN Conference on Object-Oriented Programming Systems and Applications. OOPSLA ’07. Montreal, Quebec, Canada: Association for Computing Machin- ery, 2007, pp. 57–76. isbn: 9781595937865. doi: 10.1145/1297027.1297033. url: https://doi.org/10.1145/1297027.1297033. [75] Lengauer, Philipp et al. “A Comprehensive Java Benchmark Study on Memory and Garbage Collection Behavior of DaCapo, DaCapo Scala, and SPECjvm2008”. In: Proceedings of the 8th ACM/SPEC on International Conference on Perfor- mance Engineering. ICPE ’17. Lapos;Aquila, Italy: Association for Comput- ing Machinery, 2017, pp. 3–14. isbn: 9781450344043. doi: 10.1145/3030207. 3030211. url: https://doi.org/10.1145/3030207.3030211. [76] Blackburn, Stephen M. et al. “The DaCapo Benchmarks: Java Benchmarking Development and Analysis”. In: Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applica- tions. OOPSLA ’06. Portland, Oregon, USA: Association for Computing Ma- chinery, 2006, pp. 169–190. isbn: 1595933484. doi: 10.1145/1167473.1167488. url: https://doi-org.focus.lib.kth.se/10.1145/1167473.1167488. [77] Zar, Jerrold H. Biostatistical Analysis (4th Edition). USA: Prentice-Hall, Inc., 2007, pp. 43–45. isbn: 978-0130815422. [78] Bj¨ork,Kim. A comparison of compiler strategies for serverless functions written in Kotlin - dataset. url: https://zenodo.org/record/3648281#.XjwYe2hKiUl.

64 TRITA -EECS-EX-2020:83

www.kth.se