DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020

Lookaside Load Balancing in a Service Mesh Environment

ERIK JOHANSSON

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Lookaside Load Balancing in a Service Mesh Environment

ERIK JOHANSSON

Master in Computer Science Date: 2020-10-04 Supervisor: Cyrille Artho Examiner: Johan Håstad School of Electrical Engineering and Computer Science Host company: Spotify AB Swedish title: Extern Lastbalansering i en Service Mesh Miljö

iii

Abstract

As more online services are migrated from monolithic systems into decoupled distributed micro services, the need for efficient internal load balancing solu- tions increases. Today, there exists two main approaches for load balancing internal traffic between micro services. One approach uses either a central or sidecar proxy to load balance queries over all available server endpoints. The other approach lets client themselves decide which of all available endpoints to send queries to. This study investigates a new approach called lookaside load balancing. This approach consists of a load balancer that uses the control plane to gather a list of service endpoints and their current load. The load balancer can then dynamically provide clients with a subset of suitable endpoints they connect to directly. The endpoint distribution is controlled by a lookaside load balancing algorithm. This study presents such an algorithm that works by changing the endpoint assignment in order to keep current load between an upper and lower bound. In order to compare each of these three load balancing approaches, a test environment in is constructed and modeled to be similar to a real service mesh. With this test environment, we perform four experiments. The first experiment aims at finding suitable settings for the lookaside load balanc- ing algorithm as well as a baseline load configuration for clients and servers. The second experiments evaluates the underlying network infrastructure to test for possible bias in latency measurements. The final two experiments evaluate each load balancing approach in both high and low load scenarios. Results show that lookaside load balancing can achieve similar performance as client- side load balancing in terms of latency and load distribution, but with a smaller CPU and memory footprint. When load is high and uneven, or when compute resource usage should be minimized, the centralized proxy approach is better. With regards to traffic flow control and failure resilience, we can show that lookaside load balancing is better than client-side load balancing. We draw the conclusion that lookaside load balancing can be an alternative approach to client-side load balancing as well as proxy load balancing for some scenarios.

Keywords— load balancing, lookaside load balancing, external load balanc- ing, service mesh, kubernetes, envoy, grpc iv

Sammanfattning

Då fler online tjänster flyttas från monolitsystem till uppdelade distribuerade mikrotjänster, ökas behovet av intern lastbalansering. Idag existerar det två hu- vudsakliga tillvägagångssätt för intern lastbalansering mellan interna mikro- tjänster. Ett sätt använder sig antingen utav en central- eller sido-proxy for att lastbalansera trafik över alla tillgängliga serverinstanser. Det andra sättet låter klienter själva välja vilken utav alla serverinstanser att skicka trafik till. Denna studie undersöker ett nytt tillvägagångssätt kallat extern lastbalan- sering. Detta tillvägagångssätt består av en lastbalanserare som använder kon- trollplanet för att hämta en lista av alla serverinstanser och deras aktuella last. Lastbalanseraren kan då dynamiskt tillsätta en delmängd av alla serverinstan- ser till klienter och låta dom skapa direktkopplingar. Tillsättningen av serverin- stanser kontrolleras av en extern lastbalanseringsalgoritm. Denna studie pre- senterar en sådan algoritm som fungerar genom att ändra på tillsättningen av serverinstanser för att kunna hålla lasten mellan en övre och lägre gräns. För att kunna jämföra dessa tre tillvägagångssätt för lastbalansering kon- strueras och modelleras en testmiljö i Kubernetes till att vara lik ett riktigt service mesh. Med denna testmiljö utför vi fyra experiment. Det första expe- rimentet har som syfte att hitta passande inställningar till den externa lastba- lanseringsalgoritmen, samt att hitta en baskonfiguration för last hos klienter or servrar. Det andra experimentet evaluerar den underliggande nätverksinfra- strukturen för att testa efter potentiell partiskhet i latensmätningar. De sista två experimenten evaluerar varje tillvägagångssätt av lastbalansering i både sce- narier med hög och låg belastning. Resultaten visar att extern lastbalansering kan uppnå liknande prestanda som klientlastbalansering avseende latens och lastdistribution, men med lägre CPU- och minnesanvändning. När belastning- en är hög och ojämn, eller när beräkningsresurserna borde minimeras, är den centraliserade proxy-metoden bättre. Med hänsyn till kontroll över trafikflö- de och resistans till systemfel kan vi visa att extern lastbalansering är bättre än klientlastbalansering. Vi drar slutsatsen att extern lastbalansering kan va- ra ett alternativ till klientlastbalansering samt proxylastbalansering i vissa fall.

Nyckelord— lastbalansering, extern lastbalansering, service mesh, kuberne- tes, envoy, grpc v

Acknowledgement

First of all, I would like to thank my Spotify mentor and supervisor Richard Tolman, for supporting me during my time at the company. His help with feedback, proof-reading and encouragement was crucial for my progress and I will be forever grateful. I would also like to thank my manager Matthias Grüter, for helping me brainstorm the ideas which finally lead to the proposal for this thesis project. Finally, I want to send thanks to every member of my team Fabric, for pushing me all the way to the finish line and reminding me to focus on what is important rather than what is most fun in the moment. I am very much looking forward to continue working with all of you. Contents

1 Introduction 1 1.1 Background ...... 1 1.2 Problem definition ...... 3 1.2.1 Objectives ...... 3 1.2.2 Delimitation ...... 4 1.3 Ethics and sustainability ...... 4 1.4 Outline ...... 5

2 Background 6 2.1 Terminology ...... 6 2.2 today ...... 7 2.3 Evolution of service mesh infrastructure ...... 7 2.3.1 Stateless micro services ...... 8 2.3.2 Containerization ...... 9 2.3.3 Container orchestration ...... 10 2.3.4 Service mesh ...... 10 2.4 The load balancing problem ...... 11 2.5 Related work ...... 12 2.5.1 Load balancing algorithms ...... 12 2.5.2 Proxy load balancing solutions ...... 15 2.5.3 Using a proxy load balancer as a sidecar ...... 16 2.5.4 Future of load balancing control plane ...... 17 2.5.5 Low-level application load balancing ...... 18 2.6 Summary ...... 18

3 Methodology 19 3.1 Evaluation using cloud based testing ...... 19 3.2 Design of test environment ...... 20 3.2.1 Selection of test protocol ...... 20

vi CONTENTS vii

3.2.2 Description of service mesh infrastructure ...... 21 3.2.3 Metrics monitoring ...... 23 3.2.4 Selection of proxy load balancer ...... 26 3.2.5 Load generator client service ...... 27 3.2.6 Dynamic load server service ...... 27 3.2.7 Test environment overview ...... 28 3.3 Implementation of lookaside load balancer ...... 30 3.3.1 Desired properties of algorithm ...... 30 3.3.2 Control plane protocol ...... 34 3.3.3 Algorithm ...... 36 3.4 Summary ...... 42

4 Experiments 43 4.1 Discovery of baseline configuration ...... 43 4.2 Evaluation of network latency ...... 44 4.3 Testing scenarios ...... 44 4.3.1 Stable load ...... 45 4.3.2 Increased load ...... 45 4.4 Summary ...... 47

5 Results 48 5.1 Baseline configuration ...... 48 5.1.1 Selecting the configuration ...... 49 5.1.2 Selecting the algorithm parameters ...... 50 5.2 Network latency ...... 50 5.3 Testing scenarios ...... 51 5.3.1 Stable load ...... 51 5.3.2 Increased load ...... 55

6 Discussion 57 6.1 Threats to validity ...... 57 6.1.1 Resource usage ...... 57 6.1.2 Limited test scenarios ...... 58 6.2 Importance of load balancing strategy ...... 58 6.2.1 Load balancing resource usage ...... 58 6.2.2 Latency differences ...... 59 6.2.3 Throughput ...... 60 6.2.4 Server load distribution ...... 60 6.2.5 Traffic flow control ...... 61 6.2.6 Failure resilience ...... 62 viii CONTENTS

6.3 Scaling properties ...... 62 6.3.1 Proxy load balancing ...... 63 6.3.2 Lookaside load balancing ...... 63

7 Conclusions 64 7.1 Viability of lookaside load balancing ...... 64 7.2 Future work ...... 65

Bibliography 66

Acronyms 72

A Protocol definitions 74 A.1 Load Balancer ...... 74 A.2 Load Reporter ...... 76 Chapter 1

Introduction

This chapter introduces the general topic of this study and presents three ques- tions that aims at evaluating lookaside load balancing in a service mesh envi- ronment. We also explain ethical aspects of this field that should be considered important.

1.1 Background

Today, many online services have strong requirements for availability and per- formance. To fulfill these requirements, a now common approach is to sub- divide an online service into many distributed services that each are able to provide some feature or function of the end service. This way, the end service achieves a higher resilience towards failures and is able to dynamically scale only the components that account for the highest amount of processing. By changing to a network of many small services that interact with each other, we introduce many new challenges. One of these challenges is how to efficiently route and distribute traffic internally between different services. This is known as the concept of internal load balancing. The internal aspect of this concept follows from the fact that all clients will be other developer controlled systems, rather than any potentially malicious external user. Currently there exist two main approaches to load balancing traffic to a distributed service [1]. Firstly, there exists proxy-based load balancing where clients send all traffic to a proxy server that then looks at the packet and decides which backend endpoint of the distributed service to forward the request to. The proxy usually looks at either the network or the application-layer of the packet in order to decide where to forward it. While this approach allows for a wide set of features and control over traffic flow it can in some cases come

1 2 CHAPTER 1. INTRODUCTION

at the cost of increased latency. The second approach is so called client-side load balancing. In this ap- proach the clients have knowledge of the backend endpoints for some service using service discovery mechanisms. With knowledge of the backend end- points the clients then can connect to them directly and load balance between them using some simple load balancing algorithm such as round-robin. While this approach should give clients lower latencies due to less network hops re- quired, there are potential drawbacks with both the load balancing feature set as well as traffic flow control [1]. Lookaside load balancing [1] may be seen as a sort of hybrid approach between client and proxy load balancing. In this approach there exists a cen- tralized server that clients can contact to get a shorter list of endpoints that serve some service. This way, connections are made directly from clients to servers while at the same time moving most of the load balancing logic to a server that then may support additional advanced features that control traffic flow (see Figure 1.1).

Proxy Client Lookaside

LB

C S C S C S

C LB S C S C S

C S C S C S

Figure 1.1: Three different approaches to load balancing client traffic

While there currently exists research into different load balancing algo- rithms and approaches [2], there is as of early 2020 no research into lookaside load balancing or how new load balancing algorithms on a lookaside load bal- ancer in combination with simpler load balancing algorithms performed on the clients can improve load balancing as well as latency. This project will investigate the lookaside load balancing approach and compare it to existing client and proxy-based approaches. CHAPTER 1. INTRODUCTION 3

1.2 Problem definition

In order to properly investigate and evaluate the lookaside load balancing ap- proach, we must formulate an explicit question. This study thereby formulates the following research question:

Can lookaside load balancing act as a valid alternative to proxy/client based load balancing in an internal network of micro services?

While answering this question by itself should be sufficient for evaluating lookaside load balancing, the problem may be split up into three different parts which will make it easier for the reader to understand the evaluation and con- clusions drawn:

1. With regards to latency and throughput (total answered requests per sec- ond), can lookaside load balancing achieve results comparable to exist- ing client-side or proxy-based load balancing solutions in the following situations?

(a) Requests are processed at a near constant time. (b) Requests are processed in non-constant time. (c) A subset of endpoints do not respond and failover must occur.

2. Can lookaside load balancing be utilized to achieve traffic flow control comparable to the traffic flow control possible with proxy-based load balancing?

3. With regards to memory and CPU usage, can lookaside load balancing be achieved without a negative performance impact on the client’s side?

1.2.1 Objectives In order to answer these questions, there are a number of objectives that this study will focus on. The first objective will be to design and implement a methodology for eval- uating the different load balancing approaches with regards to service mesh environments. The main goal for this methodology will be that any conclu- sions drawn should be applicable in a real service mesh environment. Another important goal is that the metrics required to answering the proposed questions are available and that the method for retrieving these metrics is reproducible. 4 CHAPTER 1. INTRODUCTION

The second objective will be to design and implement a lookaside load bal- ancer and algorithm such that the lookaside load balancing approach may be compared with the client and proxy-based approaches. This entails looking at existing load balancing algorithms as well as defining desired properties of any lookaside load balancing algorithm. The main goal of the implemented algo- rithm will be that it should be able to demonstrate capabilities of the lookaside load balancing approach for some realistic scenario.

1.2.2 Delimitation This project focuses only on comparing the lookaside load balancing approach in comparison with client and proxy-based load balancing approaches. This means that we can define a delimitation of what this project will not focus on.

• This study will not directly focus on the performance of different algo- rithms and instead only use well known algorithms for client and proxy- based approaches.

• Since the focus is on each load balancing approach, the service mesh environment used for testing does not need to be modeled as a real ser- vice mesh environment as long as it may be motivated why components of such an environment are not relevant to the conclusions drawn.

• The algorithm implemented for lookaside load balancing does not need to be suitable for real environments and may be optimized only for com- parison purposes. However, it should be motivated why the implemented algorithm is suitable for comparison with remaining load balancing ap- proaches and algorithms.

1.3 Ethics and sustainability

As large scale computation in the cloud becomes cheaper, it also becomes in- creasingly feasible to solve high application load requirements by over-provisioning the computational capacity of the application [3]. While such solutions are simple in nature and require minimal developer effort, there are important eth- ical and sustainable implications that should be taken into account. The main implication of solving system load requirements using over- provisioning is the waste of energy. In order to scale a system, we either add computational machines or allocate additional resources to existing ma- chines. Both of these approaches introduce computational overhead, and by CHAPTER 1. INTRODUCTION 5

performing avoidable allocation of resources we falsely contribute to the need for physically scaling the cloud platform. Electrical energy usage of cloud data centers worldwide were estimated to make up more than 1.4% of the global usage in 2014 [4]. So by avoiding over-provisioning in large cloud systems we could achieve a measurable impact on energy consumption which in turn closely relates to greenhouse gas emissions and impact on environment. Thereby, we realize that efficient load balancing solutions could help in im- proving resource utilization, which directly would reduce our carbon footprint and environmental impact.

1.4 Outline

Chapter 2 gives an introduction to cloud computing, the concept of micro ser- vices and service mesh environment. Also, related work within the load bal- ancing field is presented and explained. Chapter 3 describes and motivates the methodology used to evaluate each load balancing approach as well the methodology used to create the lookaside load balancer. Chapter 4 presents four experiments that aim at answering our three questions. The first experi- ments finds suitable parameters for our load balancing algorithm and a base- line configuration for clients. The second experiments investigates any poten- tial bias in network latency. The final two experiments test high and low load scenarios in order to help us evaluate each load balancing approach. Chapter 5 presents the results gathered during each of these four experiments as well tests for statistical significance. Chapter 6 discusses how our findings relate to our questions and presents any threats to the findings validity. Chapter 7 presents a concluding summary of the findings as well as answers the posed questions of this study. Chapter 2

Background

This chapter provides background into cloud computing, how service mesh environments have evolved, as well as presents related work and state of the art load balancing solutions and algorithms such as round robin and sidecar proxy load balancing.

2.1 Terminology

This section contains a list of some of the key terms and definitions that are used. Readers are expected to have basic understanding of software develop- ment and concepts within algorithms and complexity.

Term Definition A global and managed platform for on-demand compute cloud and storage resources An often stateless, ad-hoc application that serves some (micro) service decoupled part of a larger system or application A network of services that interact with each other service mesh through the means of a dedicated infrastructure layer A server that acts as an intermediary and requests service proxy resources on clients behalf An IP address and port pair that describes a network endpoint location that serves some distributed service A system capable of dynamically configuring control plane services or the flow of data between them data plane The flow of application data across the network fabric

6 CHAPTER 2. BACKGROUND 7

2.2 Cloud computing today

During the 1980s we saw a large shift from central computing systems to per- sonal computers [5]. This paved way for the rise of applications designed to run locally on personal computers. Later during the 1990s when the inter- net experienced a surge in growth, applications started to utilize this connec- tivity to allow collaboration and communication using a client-server model. Servers consisted of application code that ran on so called "on-premises" hard- ware systems connected to the . During the late 00s, companies such as Amazon had gained well-founded experience with running and maintaining large datacenters and were thereby able to offer on-demand virtual compute machines with high availability and reliability to many customers. The platform for running such on-demand re- sources on quickly became known as the cloud [5]. High maintenance costs and scalability limits for on-premises systems prompted more application own- ers to move their systems to cloud based infrastructure which at the time were growing in popularity [5]. In the last decade, we have been seeing what might be the next step in this development where applications not only run on cloud infrastructure but also are modeled as cloud native [6]. Cloud native applications are applica- tions which are designed to take advantage of cloud based offerings that solve many common issues within software development such as scalable persistent storage or global client facing load balancing [6]. This study was performed within a cloud native environment and one relevant aspect is how cloud based network infrastructure affects latency when load balancing using different ap- proaches.

2.3 Evolution of service mesh infrastructure

In combination with the transition from on-premise systems to cloud native systems there have also been major developments in how services are devel- oped and deployed [7]. Before the introduction of the microservice model, online services were commonly deployed as monolithic systems deployed ei- ther on a Virtual Machine (VM) or bare metal hardware [7]. Scaling can be achieved in such monolithic systems by replicating machines. This scaling process can however not handle higher degrees of dynamic traffic because of slow initialization and deployment [7]. This slowness follows from the time it takes to install and configure large systems on new machines. A final aspect 8 CHAPTER 2. BACKGROUND

that prompted the move from monoliths is the complexity and difficulties in the development process [7]. With monolithic systems, the cost and risk of devel- opment is high because developers are required to learn and work with large components of the system that are tightly coupled [7]. A property that also follows from this is that introduced bugs could bring down the entire system. Today, this monolith type infrastructure is gradually being replaced by a number of modern software development concepts such as the micro service model, containerization and service mesh infra structure [8]. This section aims at giving a basic understanding of these concepts as well as show how the need of load balancing has arisen within this type of modern software development.

2.3.1 Stateless micro services As mentioned in Section 2.3, tightly coupled components of a monolithic sys- tem constitute the main cause of developer complexity and high risk for system wide failures. The thought behind the micro service model is to decouple these components into separate systems or services of their own [6]. By providing an interface for interacting with each micro service, the total system is still able to achieve equal functionality. This interface is usually some application- layer protocol such as Hyper Text Transfer Protocol (HTTP) that in turn uses network infrastructure for communication access. While the network com- munication between micro services introduces some response time overhead in the full system compared to the previous monolith approach, the benefits gained by the micro service model outweighs this drawback [6]. With the mi- cro service model an important benefit is the ability to scale components of the system that account for high usage [6]. From this ability there also follows a need for efficient internal load balancing for components of the system that scale. There are however aspects of micro services that require attention when considering an internal load balancing solution. A stateful micro service indicates that some or all requests to that service are dependent on some local state in the service (for example user session in- formation) [6]. This means that if multiple requests from a client are sent to different instances of a distributed micro service, there exists a possibility that requests fail due to missing required state. This aspect poses many difficulties later when designing load balancing for service mesh infrastructure so a com- mon solution today is to only allow stateless services where there either is no need for local state or state has been moved to a centralized storage location [6]. In practice this means that client requests are independent on any particu- lar instance of the micro service and the load balancing algorithm can thereby CHAPTER 2. BACKGROUND 9

chose to send requests to any available instance. This study has also thereby opted to only include load balancing for stateless micro services as this is the common industry practice today.

2.3.2 Containerization While the concept of stateless micro services allows for higher redundancy and scalability, we also must define how this concept may be implemented as well as explain why multiple layers of abstraction are important for service mesh infrastructure. As previously mentioned in Section 2.3, the monolith approach includes installing and running the system application directly on either a bare metal or virtual machine. With only one application running on a machine, this application may utilize all the machine resources without the concern of affect- ing any other applications or other parts of the system [9]. Another property of running only one application per machine is the avoidance of interference such as network port collisions or conflicts in configuration of common application dependencies [10]. When monolithic systems are migrated to smaller micro services, we still wish to maintain this application non-interference property and assure that micro services can run independently without affecting or re- stricting each other [9]. There are two main approaches to solve this issue with micro services. Ei- ther we may chose to install and deploy micro services directly on machines as in monolithic systems, or we chose to create mechanisms that allow multiple applications to run on the same system without the possibility of interference. The first approach causes an overhead as the ratio between the cost to run the machine operating system and the cost to run the application increases the smaller the application is [11]. This means that there exists a computational waste that increases the more micro services are run on their own machine. Be- cause of this, the second approach is preferable from a cost and environmental standpoint. To allow running multiple micro services on the same machine we introduce the concept of containerization. Containerization today works by utilizing existing operating system mech- anisms that limit resource usage and system overview for certain processes [11]. Applications and all their dependencies (including any OS dependen- cies) are packaged up in a bundle of binaries called an image. This image may then be installed in an isolated part of the file system for some machine and when the application is launched it is given its own isolated view of this ma- chine and its resources [11]. This method will allow applications to run on an abstract virtual operating system (called a container) while at the same time 10 CHAPTER 2. BACKGROUND

allowing the binaries to run natively on the underlying machine. By creating this abstraction, we are able to run many different applica- tions on the same machine without concern of interference. For service mesh infrastructure this means that any components such as load balancing compo- nents may be deployed directly on existing machines just as any micro service would. This removes the need for specific load balancing machines and will allow this study to easily deploy and test different load balancing solutions in a service mesh environment.

2.3.3 Container orchestration While the possibility of running applications in containers allows deployment on any machine, there still exists the issue of how to efficiently deploy and maintain a large number of distributed applications. This issue may be solved using a concept known as container orchestration [11]. For the purpose of this study, we do not consider it relevant to give an in-depth explanation of different container orchestration solutions and how they work. However, since service mesh infrastructure and the test environment constructed in this study are dependent on container orchestration we chose to give a brief explana- tion. Container orchestration is a concept that allows deploying and scaling a distributed application without requiring knowledge of underlying hardware resources [11]. Kubernetes is an example of container orchestration, and the use of Kubernetes will be described in further detail in Section 3.2.2.

2.3.4 Service mesh Using containers and container orchestration, we have the possibility of de- ploying and maintaining a large number of distributed micro services capable of communicating with each other [12]. While communication between mi- cro services is possible using any network based protocol, there are however a number of challenges that arise with regards to this communication. As the micro services grow in numbers or scale in size, there is a rise in complexity as to how these micro services discover each other and as to how traffic is dis- tributed between instances of some micro service deployment. To solve this issue we introduce the concept of service mesh infrastructure. A service mesh describes the network of micro services and the interac- tions between them. Service mesh infrastructure includes a number of compo- nents that solve issues such as discoverability, load balancing, monitoring or failure recovery [12]. For more advanced use cases, service mesh infrastruc- CHAPTER 2. BACKGROUND 11

ture may even provide components that provide micro services with solutions for authentication, authorization, rate limiting, canary deployments or A/B testing [12]. The purpose of a service mesh is to provide a platform where micro services only require application-specific logic [12]. This means that cost and overhead of development may be reduced while at the same time in- creasing safety and reliability. This study will focus on on the load balancing aspect of service mesh infrastructure. Today there exists production ready so- lutions to service mesh infrastructure. Two such solutions are the Linkerd and Istio projects [12]. Each of these projects provides load balancing solutions that utilize the concept of sidecar proxies for adding load balancing support to any micro service that uses either HTTP or HTTP2 for communication. The sidecar approach will be explained in Section 2.5.3 and an explanation as to why this project has opted to not use any service mesh solution may be found in Section 3.2.2.

2.4 The load balancing problem

As more large scale monolithic systems move towards service mesh infrastruc- ture, the need for efficient load balancing between micro services increases. But in order to understand the complexity of achieving efficient load balanc- ing, we first must define and explain the general load balancing problem. This problem is also known as makespan minimization and can be defined as the following [13][14]:

Problem:

Input: m identical machines and n jobs with ith job having processing time ti Goal: Schedule jobs such that I Jobs run contiguously on a machine I A machine processes only one job a time I Makespan or maximum load on any machine is minimized

Definition: Let A(i) be the set of jobs assigned to machine i. P The load on i is Ti = j∈A(i) tj. The makespan of A is T = max(Ti) This problem may then be defined as the following decision problem:

Decision problem. Given n, m, t1, t2, ... , tn and a target T , is there a schedule with makespan at most T ? 12 CHAPTER 2. BACKGROUND

This decision problem can be reduced from Subset Sum [14] and since ver- ification of a given solution trivially can be done in polynomial time O(N) we can draw the conclusion that the general load balancing problem is NP- Complete. Assuming that P 6= NP , this means that there exists no polyno- mial time algorithm that can solve optimal load balancing. An important note on the definition for this general load balancing problem is that job processing time is assumed to be known for each job. If we were to map this general load balancing problem to application based load balancing, jobs would be considered single client requests and job processing time would be considered the CPU time required to process that request by the server. Since the CPU processing time required to process any given client request is difficult to calculate or predict, we may draw the conclusion that application- layer load balancing involves additional complexity compared to general load balancing. This additional complexity means that design of a lookaside load balancing algorithm can be considered non-trivial.

2.5 Related work

While this study does not intend to focus on any particular load balancing algo- rithm or solution, there still exists interest in learning state of the art research within these fields. This section aims at giving the reader an understanding of some common load balancing algorithms and solutions that are used within modern load balancing.

2.5.1 Load balancing algorithms When looking at load balancing algorithms, we first must define how the envi- ronment affects which load balancing algorithms may be used. There are two types of environments that affect what load balancing algorithm may be used. Firstly, there is the static environment type where we can predict client be- havior to some degree, and where there exists a fixed number of homogenous resources. This means that load balancing algorithms in advance will have knowledge of each available server and that their total capacity is constrained from changing during runtime. A property that follows from a static envi- ronment is that clients traffic also is not allowed to exceed some value and is expected to stay near constant. The second type of environment is dynamic environments, where client behavior might not be predictable and where there is a non-fixed number of heterogenous resources. In these environments the load balancing algorithms CHAPTER 2. BACKGROUND 13

must be capable of handling increases and decreases in client traffic or a change in the size or number of servers that should be load balanced across [15]. Dy- namic environments are most suitable for describing a service mesh environ- ment because of the built in concept of scalability. When designing algorithms for lookaside load balancing we must therefore start with looking at the re- quirements and constraints of load balancing in dynamic environments. This section describes some of the most used and researched load balanc- ing algorithms today. By doing this we may gain some insight into challenges that algorithms face and what approaches exist to solve solve some of these challenges.

Round-robin One of the most-used load balancing algorithms today is the round-robin algo- rithm [15]. The algorithm works by queueing each endpoint and then sending requests to each endpoint in order while moving used endpoints to the back of the queue [15]. This process may also be described as cyclically iterating the list of endpoints and sending one request to each endpoint at a time. The re- sulting effect of this algorithm is that there is an equal distribution of requests sent to each endpoint. This algorithm does thereby not take into account the time required to pro- cess each request or how failures or slow requests can cause a build up in the number of outstanding requests for some endpoint. Generally round-robin is most suitable for static environments, however because of its speed and sim- plicity, it may also be suitable for scenarios when requests are processed in a limited or near constant time [15]. This algorithm will be widely used within this study as it is the default load balancing algorithm used in the gRPC frame- work (see Section 3.2.1 for information on gRPC with regards to this study).

Weighted Least Request Another widely used load balancing algorithm is the weighted least request algorithm. The thought behind this algorithm is to minimize the number of outstanding requests for any endpoint such that peak load for every endpoint is reduced. The algorithm works differently depending on whether each endpoint is considered to have the same weight or not. For dynamic service mesh en- vironments, each endpoint of some micro service should be considered equal in capacity and thereby the weights should also be equal. In this case the al- gorithm can achieve O(1) performance by randomly sampling two endpoints and selecting the endpoint with least outstanding requests. This is called the 14 CHAPTER 2. BACKGROUND

power of two choices and has been shown to perform nearly as well as a O(N) full scan for the endpoint with the least amount of requests [16]. This algorithm is mainly suitable for centralized proxy load balancing as for example a large number of independent clients using this algorithm might select the same endpoint based on local knowledge, thus causing a peak of traffic to that endpoint.

Random A simple random-based load balancing algorithm that distributes traffic uni- formly random may be beneficial for some use cases. As with the round-robin algorithm it is well suited for static environments and has a low computational impact since it consists of only selecting random hosts from a uniform distri- bution. For the scenario of failing endpoints, a random approach could perform better than round-robin as it avoids bias towards endpoints in the queue that come after the failing endpoint. Also as with round-robin, this algorithm may be used in some dynamic environments where request processing time is lim- ited on the server or other mechanisms protect the endpoints from overload.

Exponentially Weighted Moving Average As previously mentioned, existing service mesh infrastructure already provide some load balancing solutions. The Linkerd service mesh utilizes an algorithm known as Exponentially Weighted Moving Average (EWMA) [17]. This al- gorithms functions by inference of endpoint load based on response time. It does this by keeping an exponentially rolling average of the response time to each endpoint and prioritizing endpoints with lower averages when sending requests. While response time might be affected by other aspects such as underlying network congestion, this algorithm has shown some promise for reducing both client latencies and server load [18].

Algorithms for stateful services While stateful service load balancing is not directly covered in this study, it may be of interest for the reader to learn about how this may be achieved, since this will affect how we view possibilities with different load balancing approaches. In order to link service clients to a certain state on some service CHAPTER 2. BACKGROUND 15

endpoint, many algorithms utilize efficient hashing functions. By hashing cer- tain fields of client requests and defining how these hashes map into service endpoints, client connections to endpoints may be tracked and made persistent even through a load balancer. There has been much research within this type of load balancing algorithm and one example of this is the Maglev algorithm that extensively has used for many services since 2008 [19].

Algorithms for advanced use cases Another topic within the field of load balancing algorithms which this study chooses to only mention briefly is the topic of advanced load balancing al- gorithms. Research has provided a number of promising complex algorithms such as the nature based Honeybee Foraging algorithm [20]. While many such complex algorithms have shown promising results in simulation scenarios, research into load balancing challenges has discussed possible implications of production use [21]. The complex algorithms often have higher computational cost and information requirements that introduce delay and a drop in load balancing efficiency [21]. Thereby it is proposed that load balancing algorithms should be designed in the simplest possible forms [21].

2.5.2 Proxy load balancing solutions In the field of application-layer proxy load balancing, there exists two cate- gories of production ready solutions. First, there are self-managed solutions and secondly there are fully managed solutions offered by cloud providers. This section will go over some of the widely used solution in each of these categories and give the reader a basic understanding of what advantages and drawbacks different solutions have.

Self-managed solutions Self-managed solutions are mainly open source projects that developers may configure and deploy directly in their production environment. As we have previously mentioned, deployment of such a solution may be easily achieved using the feature set provided by a service mesh. Thereby, the main aspect of self-managed solutions is the feature set they provide in contrast to the devel- oper management required to maintain such solutions. 16 CHAPTER 2. BACKGROUND

Examples of self-managed load balancing solutions that have seen wide production use include; HAProy [22], Nginx [23] and more recently Envoy, which is a project part of the Cloud Native Computation Foundation [24]. All these projects provide functionality to load balance using different load bal- ancing algorithms as well as allow for traffic policies or security features that could improve service availability and security [12]. The main cost of these features is the complexity of configuring such solu- tions and while future managed control plane infrastructure could solve some of this complexity it is something that should be considered when comparing load balancing solutions.

Fully managed solutions With a managed load balancing solution, we give up some of the transparency, feature control and monitoring capabilities that self-managed open-source so- lutions provide in order to achieve larger scale load balancing at a potentially lower cost to develop and maintain. This study did not consider any fully man- aged solutions because of their lack of implementation transparency. However, we will acknowledge the fact that there exists a number of products such as the Amazon Elastic Load Balancing (ELB) [25], Google Front End (GFE) or In- ternal Load Balancer (ILB) [26] as well as the Azure Load Balancer (ALB). These products provide a comparable feature set to self-managed solutions with regards to load balancing and traffic flow and could in theory also be utilized in service mesh environments if we regard the the subset of offered products that optimize for internal networks rather than external.

2.5.3 Using a proxy load balancer as a sidecar With the introduction of service mesh infrastructures and the importance of load balancing in such settings, a new concept of proxy load balancing has been proposed and adopted by existing service mesh solutions such as Istio or Linkerd [27]. This is the concept of sidecar proxy load balancing, where a self-managed proxy load balancer is deployed alongside each instance of a micro service. In the context of this report we will chose to consider this approach as proxy load balancing, however there are some fundamental differences between tra- ditional proxy load balancing and sidecar proxy load balancing. By utilizing containerization and deployment capabilities given by container orchestration we are able to attach a resource thin proxy to the local network interface of each deployed micro service instance. In order for a client service to connect CHAPTER 2. BACKGROUND 17

to a server endpoint, it does thereby not need to discover or connect to a cen- tralized proxy in the network. Instead, the client connects to a known port on its local network interface and sends requests there to be load balanced to server endpoints known by the proxy. This form of proxy load balancing has the main benefit of reduced Round Trip Time (RTT) latency since a network hop on the local network interface is negligible compared to the network hop to any centralized proxy (see Figure 2.1). Sidecar proxy-based load balancing is a part of the proxy load balancing approach which has not been widely researched in terms of CPU, memory and latency impact. Because of this, load balancing via a sidecar proxy is an approach that can also be considered within the scope of this study when comparing lookaside load balancing with state of the art proxy and client based solutions.

Normal Proxy Sidecar Proxy

client server client proxy server

client proxy server client proxy server

client server client proxy server

Figure 2.1: Difference between non-sidecar and sidecar proxy approaches

2.5.4 Future of load balancing control plane While the concept of lookaside load balancing has not yet been considered by research today, there still has been development within control plane protocols which allow for lookaside load balancing to be achieved. These protocols allow sending and receiving information about load and failures from both clients and servers as well as allow for dynamically changing a clients view of the server endpoints. These features are provided by the grpclb control plane protocol [28] which this study will utilize to implement and test lookaside load balancing (see Section 3.3.2). 18 CHAPTER 2. BACKGROUND

A more recent control plane protocol that provides even more advanced load reporting and traffic flow control is the xDS protocol developed and avail- able in the Envoy proxy load balancer [29]. While this protocol is mainly de- veloped for use in Envoy, it is generic in nature such that it can be used to achieve lookaside load balancing without the need of any proxy. Currently, the xDS protocol is in the process of replacing the grpclb protocol and the re- cently published Traffic Director product by Google provides a control plane for xDS [30]. Traffic Director can as of today function as a lookaside load balancer. However, its current implementation seems to primarily focus only on load balancing based on regional locality and capacity [30].

2.5.5 Low-level application load balancing The main drawback of application-layer load balancing is added computa- tional cost compared to network layer load balancing. This added compu- tational cost consists partly of additional packet parsing required and partly of the fact that the parsing takes place in the user space of the operating system. Network layer load balancing is however usually implemented in either network hardware or in low levels of the kernels network stack. The P4 pro- gramming language aims at allowing application-layer parsing and forward- ing to take place directly on networking hardware such as routers or switches [31]. It is a domain-specific language containing constructs that optimize for network packet parsing and forwarding. This could in theory allow network- ing hardware to become protocol-independent [31]. The impact of this could be that application-layer protocols and load balancing algorithms see a more widespread use in the future because of the significant performance increase that would follow from such a solution [31].

2.6 Summary

A service mesh describes a network of micro services and the interactions between them. Service mesh infrastructure consists of different components that solve for example discoverability, load balancing, monitoring or failure recovery. The load balancing problem is NP-hard. State of the art service mesh load balancing consists of algorithms such as round robin or weighted least request. These algorithms are implemented in proxies such as Envoy. The proxy is deployed either centrally or alongside the application as a sidecar in order to reduce latency. Chapter 3

Methodology

This chapter explains the methodology used for constructing a metrics-gathering test environment in Kubernetes, as well as a lookaside load balancer and al- gorithm based on load reporting and load boundaries.

3.1 Evaluation using cloud based testing

There exist two main performance evaluation platforms for testing load bal- ancing algorithms [32]. One approach is to evaluate performance using some simulation toolkit such as CloudSim, CloudSched or FlexCloud [32]. The other approach is to evaluate performance using experimentation on a real cloud platform such as (AWS), (GCP) or Azure. While simulation of cloud environments allow testing hypothetical scenar- ios with for example thousands of virtual machines, there are a few aspects that make simulations less appropriate for investigating the viability of Lookaside Load Balancing (LLB). A simulation toolkit works well in scenarios where the aim is to evaluate some load balancing algorithm mainly with regards to throughput metrics. That is, testing how well different algorithms distribute traffic in settings of different load, capacity and failures. This work does not intend to investigate algorithmic performance but rather investigate load bal- ancing performance when utilizing algorithms in different components of a micro service mesh. This means that metrics such as latency Round Trip Time (RTT), CPU and memory for client, server and load balancer are of interest. A simulation toolkit cannot properly model the complex dynamics of cloud data centers that affect these metrics and therefore we rather look towards the second performance evaluation platform that consists of real cloud platforms

19 20 CHAPTER 3. METHODOLOGY

such as GCP, AWS or Azure [32]. Evaluation of load balancing performance using a cloud provider includes creating tools and frameworks that model real world scenarios and allow gath- ering metrics as previously mentioned. This chapter will discuss how a test environment was constructed in Google Cloud Platform with the intent of eval- uating performance of Client-side Load Balancing (CLB), Proxy Load Balanc- ing (PLB) and Lookaside Load Balancing (LLB).

3.2 Design of test environment

In this study we have opted to use a testing environment that consists of a simple service mesh set up on a Kubernetes cluster in Google Cloud Platform (GCP). The reasoning for selecting GCP is motivated mainly by the higher Service License Agreement (SLA) and lower cost to compute [33]. However, with regards to compute and network performance by different cloud providers we will work under the assumption that results produced in this report in the- ory should be reproducable in all major cloud providers with close to similar performance as GCP [34].

3.2.1 Selection of test protocol Since this work is delimited by application-layer load balancing there exists a need to decide which application protocol should be used. The most com- mon application-layer protocol used today is the Hyper Text Transfer Proto- col (HTTP). While HTTP has well-established support for service to service communication using Representational State Transfer (REST) and JavaScript Object Notation (JSON) payloads, there are multiple reasons why this protocol is ill-suited for modern micro service communication [35]. The main reason for not opting into the HTTP approach is its poor performance due to plain text information transfer [35]. While its successor HTTP2 improves most of the performance issues of HTTP [36], we still have the need for an efficient communication framework that allows micro services to communicate effi- ciently without the need of complex client libraries for different programming languages [35]. One communication framework that solves these issues is the gRPC Re- mote Procedure Call framework (gRPC) [37] [35]. The gRPC framework de- veloped and widely used by Google utilizes protocol buffers that allow de- scribing application calls and messages in a language-independent format [37]. CHAPTER 3. METHODOLOGY 21

Protocol buffers allows for generation of client libraries capable of serializ- ing and de-serializing application messages into binary wire format that al- low efficient transfer [37]. The gRPC framework has many build-in features. However, for this study we mainly have interest in the built in load balanc- ing features [1]. The protocol supports client-side load balancing as well as lookaside load balancing using their own control plane protocol called grpclb [1]. Another feature that makes gRPC suitable for load balancing testing is that it uses HTTP2 as transport for the binary protobuf payload. This means that proxy load balancers that support HTTP2 are able to load balance gRPC traffic as well.

3.2.2 Description of service mesh infrastructure A service mesh infrastructure consists of many complex components and con- cepts such as service discovery, load balancing, fault tolerance, traffic monitor- ing, circuit breaking, authentication and access control [8]. While there exist production-ready service mesh infrastructure solutions such as Istio Service Mesh [27], this paper aims at investigating LLB as a suitable replacement or extension to such solutions. Because of this decision, we have opted for only using a bare Kubernetes cluster as the foundation for our service mesh test environment. Kubernetes is a container-orchestration system that hides the complexity of managing a large number of virtual machines and allows for large-scale deployment, man- agement and scaling of distributed micro services [38]. While this report does not intend to explain major technical detail about Kubernetes, it is important to understand some of the concepts that govern the deployment and networking of micro services and why those concepts could introduce bias in load balanc- ing testing. We will explain these concepts briefly and then present a solution to reducing bias that arises from the use of Kubernetes.

Basic understanding of Kubernetes concepts The Kubernetes cluster itself consists of a master node Virtual Machine (VM) together with a wide number of worker nodes (VMs) that each have at least a container runtime, a kubelet service and a kube-proxy service [38]. The container runtime allows running multiple micro service instances originating from any number of distributed services on the same VM [38]. The kubelet service acts as a control plane link to the centralized master node and finally there is the kube-proxy service that acts as an interface to the network interface of the VM [38]. The latter is important since every running instance of an 22 CHAPTER 3. METHODOLOGY

micro service on each VM must get its own Linux network namespace and IP address that may be used to access that micro service from within the entire cluster [39]. The concept of a running instance of some distributed service is called a Kubernetes Pod. Each pod runs on a VM that in turn might have other pods from the same or a different service running on it. The Kubernetes master node finally has a pod scheduler service that distributes pods to different nodes with regards to available compute resources as well as high redundancy for the distributed services in cases of node failures.

Eliminating bias using pod scheduling mechanism As mentioned, each Kubernetes pod will receive its own IP address on the cluster network. This means that pods may communicate using any IP-based protocol. But as also mentioned, it is possible that the VM of the underlying node also hosts pods from other services. A fact that follows from this property is that pods running on the same VM will share the same networking interface, even if they have different IP addresses internally. When pods from different VMs communicate, this means that IP packets are routed out through the host VMs network interface and through at least one network router in the cloud network topology [39]. In these scenarios the cloud network will introduce latency to clients requests that we in turn may measure during testing [39]. However, if pods on the same VM communicate, the Kubernetes network configuration will allow the IP packets to be routed locally on the machine without the need of any external cloud network routers. In these scenarios any measured latency for client requests will be several orders of magnitude less [39]. While this might be seen as a desired side-effect in production environ- ments, it introduces a large degree of bias when evaluating latency of requests between services running in the same cluster. To remove this bias for load balancer testing, it is possible to then split all the cluster nodes into groups or pools. After separating the nodes we are then able to manipulate the rules used by the pod scheduler so that clients and servers always will be run on separate VMs. By using this method of node pool separation we are then able to argue a consistent number of cloud network hops and request latency when performing tests. An example of this issue may be seen in Figure 3.1. CHAPTER 3. METHODOLOGY 23

Client Server

Client Client Server

Server Client Server

VM VM VM VM

Without node pool With node pool separation separation

Figure 3.1: Importance of pod scheduling for LB testing

3.2.3 Metrics monitoring In order to answer the questions posed by this study there exists a need to define a set of metrics that should be gathered and analyzed during the cloud based testing. This section will describe which metrics were considered to be of interest as well as describe the capture and analysis methodology of those metrics.

Description of selected metrics This study has opted for a set number of metrics that may be used to evaluate each of the load balancing approaches. By capturing CPU and memory usage for clients, server and load balancers we aim at measuring performance im- pact of each load balancing approach as well as allow estimation of the cost to compute. Then by capturing client-side Round Trip Time (RTT), we are able to measure the impact of additional network hops and load balancer. Finally, we also chose to monitor the throughput of data measured in Queries Per Sec- ond (QPS). The QPS metric will be valuable to determine the impact server endpoint failures has on client queries as well as give insight into theoretical recovery time for such scenarios. In order to capture and store these selected metrics we have chosen to combine existing monitoring solutions together with ad-hoc modifications required to eliminate bias caused by approximation in those existing solutions. 24 CHAPTER 3. METHODOLOGY

Metrics storage in Google Stackdriver By default, many metrics within GCP infrastructure are reported and stored in the Google Stackdriver monitoring platform [40]. This platform has support for storing, viewing and retrieving both GKE and any custom metrics reported [40]. By utilizing this existing infrastructure, we were able to reduce the com- plexity and overhead of the remaining metrics infrastructure required for this study (as motivated and described within this section).

Usage of OpenCensus Application metrics are not only a relevant aspect of this study but also impor- tant in production service mesh environments [41]. Implementing automatic retrieval of such metrics is difficult but essential [41]. One metrics and tracing framework that solves this issue is the OpenCensus framework [42]. While similar frameworks also exist, this study has opted for OpenCensus due to its integration with gRPC [41] and Google Stackdriver [40]. The framework will by default capture QPS, request RTT as well as response status [41]. The framework also has support for capturing and reporting any arbitrary custom metric which was useful since we for reasons soon to be described were unable to utilize the default latency capture included in OpenCensus.

Retrieving and analyzing latency data There are important differences between gathering and analyzing different types of metrics. Metrics such as CPU or memory are gauge type metrics which means that each data point represents an instantaneous measurement in time. Metrics such as completed requests or error rates are either delta or cumulative metrics. In a delta metric, each data point represents the change in a value over a time interval, while in a cumulative metric each data point is a value being accumulated over time. A common property of gauge, delta and cumulative metrics is its minimal footprint on the monitored resource. Gauge metrics work by non-intrusive, usually constant-time measurements of the resource, while delta and cumula- tive metrics work by incrementally increasing or decreasing a predefined set of counter values based on resource actions or data. Implementation of delta and cumulative metrics may be achieved using application interceptors [43] that capture measurements or properties of completed requests and stores this information in counters that are regularly reported to the monitoring infras- tructure. CHAPTER 3. METHODOLOGY 25

One important issue that follows from this occurs when measurements or properties of requests cannot be subdivided into a predefined set of counters. One example of such a measurement is latency. While the interceptor is able to capture the latency of each request as, for example, a 64-bit float value, this value can only be saved by either immediately reporting to the monitoring system or temporarily storing each value in memory that grows linearly with the number of requests sent. Both options could be regarded as performance- intrusive as they trivially affect both CPU and memory metrics. OpenCensus and other metrics frameworks solve this issue by storing such metrics in a his- togram data structure. An example of this process is shown in Figure 3.2, where a minimal-latency histogram is used to capture latency data using con- stant memory and processing.

Latency histogram

Interval (ms) 0.0-0.1 0.1-0.3 0.3-0.8 0.8-1.7 1.7-3.0 3.0-∞ Count 0 0 0 1 0 0

3. Matching latency interval is incremented

client interceptor server 1. Request and 2. Latency is response is sent measured to 1.5 ms

Figure 3.2: Example of how to gather latency metrics with minimal footprint.

The main drawback of this common approach is the approximation that takes place when latency is limited to a number of intervals. In a production environment this methodology will allow developers to notice increased la- tency in different percentiles. However, for the purpose of statistical analysis, the loss of information might affect result outcomes. Therefore, this study has chosen to divide gathering of latency and other performance metrics into separate test executions. For latency testing, each latency value is stored temporarily in memory before being dumped to the metrics infrastructure using OpenCensus. To an- 26 CHAPTER 3. METHODOLOGY

alyze metric results of different load balancing approaches we perform a pair- wise Wilcoxon signed-rank test [44]. This is a statistical test that may be use to compare two related samples and determine if they originate from a com- mon distribution [44]. If the samples may be determined to not originate from a common distribution we may then draw the conclusion that there exists a statistically significant difference in performance when comparing two load balancing approaches [44].

3.2.4 Selection of proxy load balancer In order to evaluate the viability of lookaside load balancing in comparison with client-side load balancing and proxy load balancing there exists a need to decide upon a proxy load balancing implementation that may be used as a baseline. Testing more than one PLB is technically feasible. However, with re- gards to the questions and scope of this study, it should be sufficient to use only one PLB given that there exists motivation to why that PLB can be considered a representative PLB with regards to performance.

Motivation of decision to use Envoy While there exists a number of high-performance open-source implementa- tions of proxy load balancers (see Section 2.5.2), this study has opted to use Envoy proxy as the load balancer for base line comparison. The decision is mainly motivated by the shown performance of Envoy with regards to other popular options [45]. Another motivating factor was that Envoy is a part of the Cloud Native Compute Foundation (CNCF) backed by major industry lead- ing companies such as Google [46]. The Google internal application-layer load balancer is also built using Envoy proxy [26], so we may work under the assumption that it Envoy will have widespread use in large scale production systems.

Configuration of Envoy The configuration used consists of a few number of components. Firstly, there is a single HTTP2 listener that listens for incoming connections on port 8000. This listener in turn has a route configuration that contains only one virtual host which will match any incoming host header (wildcard domain). This virtual host will route any requests targeting the path of the gRPC service to a statically configured cluster. The service cluster is finally configured to discover service endpoints using strict DNS that targets the A records of the service created by CHAPTER 3. METHODOLOGY 27

the Kubernetes DNS system. The remaining configuration of the envoy proxy consists of default values as described in the version 1.13 documentation [47]. We choose to note the fact that while configuration could affect the perfor- mance results, this configuration contains bare minimum components that are required for experiments to be performed.

3.2.5 Load generator client service In this study, we have chosen to implement a client and server service capable of simulating near-real production loads. The decision to implement these custom micro services follows from the need to dynamically being able to simulate a wide range of load as well as allow gathering of raw latency data that in any real production setting would be considered unfeasible. The client service is denoted as a load generator service and it was im- plemented in the Go programming language [48] using gRPC version 1.26. The implementation allows setting the number of concurrent threads that each should open a gRPC connection to the server via some load balancing ap- proach. The load generator will send queries approximately 1KB in size con- taining a randomized payload to ensure no eventual caching systems interfere. The load generator also has options for dynamically setting the QPS that should be sent across all active threads. Note that for a too low number of threads, the active QPS might be impossible to achieve by the load generator instance due to congestion caused by slow requests. Finally, the load generator has a built in metrics interceptor that captures request response status and RTT and reports this to StackDriver using OpenCensus as previously described. For gathering CPU and memory metrics the interceptor can be disabled.

3.2.6 Dynamic load server service The requirements of the server service consist of the capability of dynami- cally adjusting request processing time as well as the capability of adjusting failure rate. By implementing these features, we may simulate realistic power law response times and total or selective server endpoint failure. Similarly to the load generator, this micro service is implemented in the Go programming language. The service exposes a simple HelloWorld method that returns a randomized payload to the client. In order to simulate load that optionally follows a power law distribution, the server will perform a loop of multiplication operations for a duration in milliseconds specified by rand.ExpF loat()/λ∗m, which models a scaled ex- 28 CHAPTER 3. METHODOLOGY

Type Platform Cores Frequency (Turbo) Memory Network n1-standard-1 Intel Xeon Skylake 1 2.0 GHz (3.5 GHz) 3.75 GB 2 Gbit n1-standard-4 Intel Xeon Skylake 4 2.0 GHz (3.5 GHz) 15 GB 10 Gbit n2-highcpu-32 Intel Xeon Cascade 32 2.8 GHz (3.9 GHz) 32 GB 36 Gbit

Table 3.1: Used VM machine types ponential distribution for some rate parameter λ and multiplier m. The server also uses the OpenCensus framework to provide metrics regarding completed queries and server side latency. Finally, the server exposes the gRPC Load- Reporter service [28] that the lookaside load balancer control plane will use to make load balancing decisions.

3.2.7 Test environment overview In order to perform the intended experiments of this study, the solutions and components described need to be combined into a usable test environment. The first step of this process includes the deployment of a Kubernetes cluster capable of housing each of the service mesh components that are required for this study. The GKE cluster is created using default configuration which in- cludes a default node pool containing three n1-standard-1 VM instances (See Table 3.1). This default node pool is used for Kubernetes internal systems and other components which we desire to not interfere with the load balanc- ing testing. By default, there is a metrics monitoring system installed in the cluster that will monitor CPU and memory usage for each pod and send this data to StackDriver for storage. The cluster also includes configurable DNS infrastructure that may be used to discover service endpoints. A central con- trol pod is also installed in the default node pool which acts as an access point to the cluster as well as allow dynamic configuration of clients and server as previously described. To eliminate the pod scheduling bias, we create three separate node pools for clients, load balancers and servers. The server and client node pools uses 20 VMs of the n1-standard-4 type and the load balancing node pool uses only a single n2-standard-32 VM instance. The number of client and server node pool instances was selected such that the combined theoretical network capacity does not introduce any bottleneck while also keeping the total cloud cost within bounds of a maximum target. For the load balancing pool either the proxy load balancer or the lookaside load balancer is deployed depending on test scenario. CHAPTER 3. METHODOLOGY 29

This way both load balancing approaches are tested using the same available machine performance. A full overview of the test environment is shown in Figure 3.3.

gke-cluster europe-west1-b OpenCensus StackDriver

default-pool n1-standard-1 dnsmasq metrics controller

Control OpenCensus Plane

client-pool lb-pool server-pool n1-standard-4 n2-highcpu-32 n1-standard-4

Load Generator Server

Load Generator Server 10 Gbit 10 Gbit

Load Generator Envoy Server

Load Generator gRPCLB Server 36 Gbit 36 Gbit

Load Generator Server

Load Generator Server

20 nodes 1 node 20 nodes 120 pods 2 pods 50 pods

Figure 3.3: Test environment architectural overview. 30 CHAPTER 3. METHODOLOGY

3.3 Implementation of lookaside load balancer

As the concept of Lookaside Load Balancing (LLB) is relatively new, there ex- ists no production ready solution that provides a full implementation. There- fore, this study implements its own LLB solution and algorithm such that we may evaluate the LLB approach against Proxy Load Balancing (PLB) and Client-side Load Balancing (CLB). This section will explain goals when designing a LLB solution and present the algorithm used within this study.

3.3.1 Desired properties of algorithm While this study does not intend to discover or propose an efficient or produc- tion ready lookaside load balancing algorithm, we have chosen to define and explain each of the properties that should be important for such an algorithm. Previous studies have discussed some of the challenges and properties of load balancing algorithms [20] and this section will re-iterate on some of these properties as well as introduce new LLB-specific properties and how they could be achieved.

Evenly distribute load The first property of any load balancing algorithm should be that load is dis- tributed evenly. We may utilize our definition of makespan and state that the load distribution property has a higher degree of fulfillment when the makespan is reduced with regards to CPU processing time. However, since the makespan is calculated based on some finite number of requests n we also must state that makespan should be reduced for all incrementally increasing n as n → ∞.

Locality As we have previously mentioned, the rise of cloud computing and service mesh infrastructure has introduced new capabilities of having distributed ser- vices deployed globally in multiple regions and zones. While each micro ser- vice instance still functions identically regardless of its location, the load bal- ancing algorithm must take into account this location for high throughput sce- narios and low latency requirements. This locality property also could allow the load balancer to increase appli- cation redundancy if the algorithm is capable of detecting regional failures and CHAPTER 3. METHODOLOGY 31

in such scenarios giving up latency requirements and route requests to other regions in order to reduce failures.

Latency Even if we consider regional or zonal locality when creating a load balancing algorithm there could still be benefits of regarding request latency as a prop- erty. The request latency is not only dependent on endpoint load but also the underlying network infrastructure that exists in the cloud environment. Con- gestion in the network could thereby affect the Round Trip Time (RTT) latency even within the same computational zone. Another factor that can impact latency for different endpoints in the same zone is the abstraction of VMs that exists in service mesh environments (see Section 3.2.2). The pod scheduling bias in Figure 3.1 is an example that demonstrates how two endpoints of the same service might have significantly different RTT latency. An ideal LLB algorithm could in theory take advantage of this service mesh property in order to reduce RTT latency.

Preventing failure propagation Another important property of load balancing is the reduction of failure propa- gation. In an ideal scenario, the failure of single endpoints should not result in upstream service failures or alerts if there still exists enough healthy endpoints to handle all traffic. Due to complexity or developer mistake, this might how- ever not always be the case. Therefore, a desirable property of load balancing algorithms should be to reduce the number of clients connected to each server endpoint. In proxy load balancing, this property may be achieved by detecting fail- ures and quickly removing endpoints such that only a small number of clients are affected. In client-side load balancing, the clients are aware of every server endpoint and in most cases load balance across all endpoints using one of the previously listed algorithms. The clients and servers in CLB can be modeled as a com- plete bipartite graph, which means that failures in a single endpoint have the risk of spreading to every client. With lookaside load balancing, this property may by achieved by giving clients only a subset of server endpoints such that the number of clients per endpoint is reduced. 32 CHAPTER 3. METHODOLOGY

Redundancy for failure scenarios Due to the ephemeral nature of micro services in a service mesh environment, the endpoints of some service could regularly be removed and eventually re- placed with new ones. While the previous property of preventing failure propagation is achieved in lookaside load balancing by reducing the number of endpoints each client has, this has the unwanted effect of reduction in fallback options for clients. If a client only has been given a single endpoint and that endpoint were to fail or be removed, the client is unable to send additional requests until it has been given a new endpoint and connected to it. This unwanted delay between sending requests will introduce added latency for upstream services which in turn could cause errors or alerts. By defining some acceptable degree of endpoint failure, the LLB algorithm can assign enough endpoints to each client such that the chosen degree of endpoint failure may occur without noticeably affecting client latency due to the delay of adding new endpoints. In order to achieve this, the non-failing endpoints on each client must have enough available capacity to handle all traffic until new endpoints have been created and reassigned.

s1 10/20 s1 10/20

10 QPS c1 s2 10/20 10 QPS c1 s2 10/20

10 QPS c2 s3 10/20 10 QPS c2 s3 10/20

30 QPS c3 s4 10/20 30 QPS c3 s4 10/20

s5 10/20 s5 10/20

Client Side Round Robin Lookaside Redundant Least Connection

Figure 3.4: Example of equal load distribution using LLB and a 1-degree end- point failure redundancy CHAPTER 3. METHODOLOGY 33

Figure 3.4 demonstrates an example of how LLB can construct an endpoint assignment such that this property is fulfilled for a 1-degree failure acceptance. The left side of this figure demonstrates the CLB setting where clients have persistent connections to all endpoints as a complete bipartite graph. We see that each server endpoint s1, s2, s3, s4, s5 has an equal theoretical maximum capacity of 20 QPS, clients c1, c2 send 10 QPS each and finally client c3 sends 30 QPS. Assuming near constant request processing time on the server, a client- side round-robin approach results in a load distribution of 10 QPS per server endpoint. We may denote the server utilization on each endpoint as being 10/20 = 0.5 or 50%. When failure of some endpoint where it is not removed occurs, 20% of queries will fail on each client. If clients are equipped with outlier detection [49] which removes failing endpoints after x consecutive failures, the total number of failed requests will be limited to x×n for n clients. We also see that in the case of an endpoint being removed, the resulting load on each remaining endpoint will be 10+10/(5−1) = 12.5 with a utilization of 12.5/20 = 0.625 (63%). If we look to the right side in the figure we see a hypothetical load as- signment made by some LLB algorithm. We first note that an equal server utilization of 50% can be achieved with only a subset of edges from before. While edges c1 → s2 and c2 → s1 might seem redundant for achieving this utilization, they are required for the failure scenario we have already discussed. If any single endpoint were to fail, there would be either 0%, 33% or 50% of failing requests for different clients, as well as a total failure rate of 20%. Even if the total failure rate is the same for both approaches, only a subset of clients ns ≤ n would be affected. If clients support outlier detection [49], this means that a fewer number x × ns ≤ x × n of total requests would fail. In the more serious scenario where clients due to developer mistake are unable to handle some type of endpoint failure, this LLB property could prevent full failure in all clients.

Reducing new connections A final property that this study proposes for any future LLB algorithm is the reduction of runtime changes to persistent client-server connections. As our previous properties aims at reducing the number of endpoints given to clients, we must also must explain issues that could follow from this practice and in- troduce a new property that could help solve those issues. 34 CHAPTER 3. METHODOLOGY

When a client only has a subset of all endpoints, it will also only have a subset of the entire capacity. This means that while the total capacity in the system might have enough redundancy for increased traffic, the capacity of a single clients endpoints might not have enough capacity to handle increases in that clients traffic. The only way to both minimize a clients endpoints while still allowing for increases in client traffic, is to dynamically update the client endpoints. When dynamically updating client endpoints, we will either add, remove or replace them. We have previously explained that adding new endpoints causes an un- wanted delay and this holds true also for removing and replacing endpoints. When removing a client endpoint, any outstanding queries to that endpoint must first be drained, thus introducing some delay. Because of this cost of re- placing a clients connection to an endpoint, we thereby introduce the property of reducing replacement of connections. A possible way we could achieve this property is to avoid replacing end- points given to a client unless failures are occurring, as well as adjusting the number of endpoints given to new clients such that we reduce the need to add or remove them.

3.3.2 Control plane protocol In Section 3.2.1 we explained and motivated the use of the gRPC protocol within experimentation in this study. One of the main motivating factors was that client libraries for this protocol already had built in functionality that would allow achieving and testing LLB. This functionality is included in the grpclb protocol, which we in this sec- tion will describe in further detail such that the reader gains an understanding of what can and cannot be achieved by our LLB. The protocol is split up into two parts, the LoadBalancer service and the LoadReporter service (see Fig- ure 3.5).

Load balancing The first part is an interface implemented on the LLB that allows clients to connect and inform their intention of accessing some distributed micro service. To make this study fully reproducible we have included the full protobuf definition of the load balancing protocol in Appendix A.1 which also at the time of writing is available on the GitHub page for gRPC. The load balancing protocol consists of only one bi-directional stream- ing method called BalanceLoad. New clients connect to the LLB using this CHAPTER 3. METHODOLOGY 35

client server LoadReporter Data plane Control plane

BalanceLoad ReportLoad LLB LoadBalancer

Figure 3.5: The grpclb exposed services and methods gRPC method and starts the stream of messages with an initial request. This initial request contains only the name of the service that the client wished to access. If the LLB is able to provide endpoints for the requested service it will respond with a report interval which specifies how frequently client should report statistics to the LLB. These statistics include the number of queries sent or completed by the client as well as any information regarding endpoint failures. In a scenario where server load reports cannot be fetched by the LLB, this information could be used to infer server endpoint load and health by aggregation of each clients statistics. Since the connection between client and LLB is bi-directional, the LLB can at any point in time send the client a new ServerList containing the set of all server endpoints. If the client already has an existing ServerList, the one sent by the LLB will take precedence and the client must perform any necessary addition or removal of existing endpoint connections to achieve the new desired state.

Load reporting The second part is an interface implemented on the micro service that allows the LLB to connect and access load information on each endpoint. Since the endpoint load to some degree can be inferred by the client statis- tics, this component of the grpclb protocol should not in theory be mandatory. However, by querying endpoints directly using the LoadReporter service, the LLB can directly retrieve non-inferred endpoint load and utilization. The full protocol supports many different metrics and other information (see Appendix A.2) that could improve the decision making of any advanced LLB algorithm. Because this study is delimited by not attempting any ad- 36 CHAPTER 3. METHODOLOGY

vanced algorithm, we have chosen to only utilize the bare minimum load in- formation provided. Just like the load balancing protocol, the load reporter protocol exposes a single bi-directional streaming method called ReportLoad. The initial mes- sage sent by the LLB to the server endpoint contains an interval for which the endpoint should send a load report. In the response to this initial message, the endpoint can optionally send information such as version number. After these initial messages have been sent, the server endpoint will start to regularly send load reports to the LLB as specified. This load report will consist of a feedback section as well as a load section. The feedback section will include the current server utilization, the number of QPS currently handled by the endpoint, as well as the number of request failures that have occurred. The load section will contain a list of metrics for each connected client to that endpoint. These metrics include the number of queries sent as well as the number of errors and total server latency for each connected client. The feedback and load section will allow the LLB to not only identify increased server load, but also identify which client or clients have contributed the most to that increased load. This way, the LLB should be able to reduce load by giving those clients additional endpoints to use. This will also be the basis for the algorithm we have constructed in order to test Lookaside Load Balancing (LLB).

3.3.3 Algorithm This section will explain the proof-of-concept algorithm implemented and used to study lookaside load balancing. The algorithm is composed of three main components that are executed in parallel by the LLB. One component implements the LoadBalancer ser- vice, one component monitors service discovery for new endpoints (using the Kubernetes API) and finally there is one component that connects to the LoadReporter service on each endpoint. The last component is responsible for monitoring increases in load and trying to reduce that load by updating client endpoints. We also have chosen to base the load balancing algorithm on an upper and lower bounds based approach which we will describe further in this section.

Data structures Each of the LLB components provide some local state required by the main load distribution function. This also means that this local state must be pro- CHAPTER 3. METHODOLOGY 37

tected by a mutex locking mechanism that protects the LLB from race con- ditions during runtime. With this protection in place, we may now describe the data structures used in the implementation such that the reader more easily may understand the constructed algorithm. Listing 3.1 shows a shortened ver- sion of the Golang structures used in the LLB. Note that the grpclb protocol provides more data than is used in these fields. However, we have chosen to only store the data required by the algorithm used in this study.

Listing 3.1: Golang data structures used 1 type Load struct{ 2 Outstanding uint64, 3 CurrentQPS uint64, 4 TotalLatency uint64, 5 } 6 7 type Endpoint struct{ 8 IP n e t . IP , 9 P o t uint16, 10 T o t a l C a p a c i t y uint64, 11 CurrentQPS uint64, 12 CurrentErrorQPS uint64, 13 C u r r e n t U t i l i z a t i o n float64, 14 C l i e n t s [ ] ∗ C l i e n t , 15 C l i e n t s L o a d map[ ∗ Client]Load, 16 } 17 18 type Client struct{ 19 IP n e t . IP , 20 CurrentQPS int64, 21 E n d p o i n t s [ ] ∗ Endpoint , 22 }

Using upper and lower bounds to achieve load balancing The main basis for this LLB algorithm is to define four parameters that define the behavior of how load is scheduled and redistributed. The first parameter is defining a lower load boundary (either using QPS or utilization) which will be used as an assignment bound. This assignment bound will limit how we distribute load to different endpoints and the algo- rithm will do a greedy best effort to not exceed the bound on any endpoint when assigning endpoints to clients. The second parameter defines an upper bound called a load bound. This load bound will act as a breakpoint for when load distribution must occur in 38 CHAPTER 3. METHODOLOGY

order to not overload some endpoint. By defining this boundary we prevent the need for rapidly changing client endpoints when load increases and decreases. This means that we do not aim for precisely equal load on each endpoint but rather aim at keeping load on all endpoints between the upper and lower bound. Since an increase in number of clients or load might make the assignment or load bound impossible, we also define a parameter for bound growth. For this study, this parameter is simply a constant value that can make the bound- aries jump up or down if required. However, for systems where load behavior is known to be non-linear, a function that describes this value could be a suit- able improvement. The forth and final parameter describes the number of endpoints we should assign to clients. This parameter defined as e, is a min-max pair that will limit the number of endpoints a client can get. To achieve some degree of redundancy in the system, the parameter should be in the range 2 ≤ emin ≤ emax ≤ n for n ≥ 2 server endpoints. Also, emax should be chosen based on the chosen assignment bound and the maximum expected QPS that any client will send. For example, if we have n = 20 endpoints and an initial assignment boundary of 100 QPS per endpoint, we may select emax to be 6 if we expect that no client will send more than 600 QPS. In Figure 3.6 we see an example of how load may be assigned below the assignment boundary and allowed to grow to the load boundary without the need of assigning new endpoints (to the left in the figure). And also an exam- ple where additional load forces us to raise the boundaries when new clients connect (to the right in the figure).

20 20

15 15

10 10 c4 c4 c4 c4 c4 c3 c3 5 c3 c3 c2 c2 c2 5 c c c2 c2 c2 1 1 c1 c1 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5

Figure 3.6: Bounds have to be risen when client c4 connects (20 QPS) CHAPTER 3. METHODOLOGY 39

Client connection procedure When a client wishes to access some service, it will first locate the LLB re- sponsible for that service using DNS. Once the LLB has been located, it will call the BalanceLoad method with initial data as previously described. The server will set the clients report interval to 30 seconds (which is not important since we do not rely on client reports). At this point, the algorithm has to as- sume the amount of QPS this client will start to send once it has connected to endpoints. This is important since we do not want to under or over dimension the number of endpoints given to a client. For this study, we manually set this value to the QPS we know new clients will send. However, for a real production use case this value might be approx- imated using long term client statistics. This algorithm is inspired by the previously explained Weighted Least Re- quest algorithm, so once we have an approximation of the QPS, we sort all endpoints based on least current utilization and start adding low utilization endpoints to the new client. While adding these new endpoints, we continu- ously estimate the impact on endpoint utilization the assumed QPS will have. This is done by first splitting the expected QPS into emin equal parts (assuming that clients will use round-robin) and testing if the emin first endpoints can be assigned that QPS without exceeding its assignment boundary. If not, then we try increasing e up to emax and try again to split the load onto the emax lowest load endpoints. If this is unsuccessful, the assignment boundary and load boundary is in- creased, and the process is repeated until we successfully are able to assign endpoints to the new client. Also, when the client has been successfully added to some number of endpoints, we preemptively add the assumed QPS to those endpoints to avoid the case where many clients connecting at the same time would be given the same endpoints if those endpoints load reports have not had time to update the assumed QPS to the actual sent QPS.

Endpoint monitoring The endpoint monitoring component is responsible for monitoring changes to the total set of server endpoints. This set of endpoints may change in two possible ways. Either some endpoint is removed from the set or some endpoint is added. For the case when a new endpoint is added, it will be added to the set of active endpoints. If this is the only action we take, the new endpoint will not receive any new traffic unless either some load boundary is exceeded or some 40 CHAPTER 3. METHODOLOGY

new client connects. Since we want to avoid this scenario, we sort all clients based on highest throughput and greedily try to add the new endpoint to that client if its number of endpoints are less than emax. If this does not succeed the new endpoint will not receive any traffic until new load is distributed to it. For the case when an endpoint is removed from the set, more complex ac- tions must be taken in order to assure that the load distribution of the system is not compromised. This is done by looking at all clients that were connected to the removed endpoint and estimating the total impact on their other endpoints based on their most recently known QPS. If the load boundary is exceeded on one or more endpoints based on this estimation, we start with the endpoint with highest load and sort that endpoints clients based on how much QPS they are sending. Then for that sorted list of clients we greedily try to add low load endpoints to each client in order to get below the load boundary to the assignment boundary. If this cannot be done using this process, we are forced to increase the bounds until we are successful. Also, this process is repeated for every end- point above the load boundary until all endpoints are below the current load boundary.

Redistribution based on load reports The final component of the LLB is the component responsible for monitoring all server endpoints for load. The component will spawn one thread for every active endpoint and connect to the LoadReporter service on that endpoint. For this study, we have chosen to instruct endpoints to send load reports to the LLB at a 10 second interval. This interval could affect performance on endpoints since each report might contain a large amount of data. If the interval is too high, endpoints might be overloaded before the next load report is sent. Thereby we have selected 10 seconds since we expect that the load increases will not overload any endpoint within this time frame. Once every load report arrives, the stored data structures will be updated with the most recent data in the report. After this update, the algorithm will check that the load boundary has not been exceeded on any endpoint. This check does not only take into account the endpoint QPS but also the number of outstanding queries not completed on that endpoint. This means that if an endpoint is slow to respond to its queries, the load boundary will more rapidly be exceeded. When a load boundary has been exceeded on some endpoint, we use the same process as when endpoints are removed in order to reduce high load by adding more endpoints to clients or increasing boundaries. CHAPTER 3. METHODOLOGY 41

Known limitations As we have mentioned, this algorithm is not intended for production use. The intention of this algorithm is to demonstrate a feasible LLB behavior based on the desired properties we have defined. Regardless of this, we choose to explain some of the known pitfalls of this algorithm such that it may be im- proved upon. These pitfalls are due to the nature of the test environment and experiments avoided such that the focus on the results may be on the viability of LLB, rather than the viability of the algorithm it self. Below follows a list of identified pitfalls of this algorithm that should be addressed for any future development of this project.

• The algorithm does not currently support lowering boundaries. This could be achieved by defining a third bound which acts as a breakpoint for scaling down. Thereby the experiments in this study are limited to only increasing traffic and monitoring performance.

• Clients are assumed to use round-robin which creates major limitations on the algorithm. If gRPC had support for Weighted round-robin this would allow the algorithm to more easily split the load on different end- points.

• Guessing the future QPS of a client that is connecting could in some cases be an unfeasible approach. If the guess is largely incorrect there will be an impact on performance until load reports correct this issue with the actual QPS the client is sending.

• The ordering which clients connect could have a huge impact on the algorithms performance. Since this is a greedy approach to load bal- ancing, there are many edge cases where specific ordering of clients can cause huge impacts on the system.

• There is no redundancy on the LB itself. If the LLB were to go down, there is no fallback system defined. In theory we could make the LLB a distributed system if we use a centralized storage for client and endpoint data. 42 CHAPTER 3. METHODOLOGY

3.4 Summary

The Google Kubernetes Engine (GKE) along with Google Stackdriver and OpenCensus, allows for gathering CPU, memory, QPS and latency metrics with minimized bias. Key properties of any lookaside load balancing algorithm include load distri- bution, locality, latency, prevention of failure propagation, redundancy as well as reduction of new connections. We present a naive algorithm that attempts to solve some of these properties. The presented algorithm works by keeping load on each endpoint between an upper and lower bound. Chapter 4

Experiments

This chapter presents four experiments that are used to help answer the ques- tions of this study. The first experiment aims at finding suitable settings for the load balancing algorithm as well as a baseline load configuration. The second experiment evaluates the underlying network infrastructure. The final two ex- periments evaluate each load balancing approach in both high and low load scenarios.

4.1 Discovery of baseline configuration

When experimenting with different load balancing approaches, there are many factors that could affect performance and the test outcome. Since the amount of experimentation required rapidly grows the more factors are taken into ac- count, this study has chosen to limit itself to only factors that relate to the load balancing approach, rather than factors that relate to, for example, available compute resources. It is widely known that application performance deteriorates as CPU or memory usage approaches 100% of the available capacity. This is also why many load balancing algorithms take into account the CPU utilization on each server when distributing traffic. However, we also must take into account the CPU usage of the load balancing machines themselves. If not, then degraded load balancer performance could introduce bias where results may not be at- tributed to the load balancing approach, but rather to the amount of resources available. To avoid this, we propose to start with an experiment designed to find a suitable throughput and configuration for the clients and load balancer. The purpose of this experiment is thereby to find a configuration of the test environment where we are not limited by compute resources and where we

43 44 CHAPTER 4. EXPERIMENTS

are able to argue that results are attributable mainly to the load balancing ap- proach. The experiment will be performed by configuring servers to respond with- out delay and gradually increasing throughput from clients until some or all of the system entities (client, server, lb) are saturated in terms of compute re- sources. This includes testing a different number of parallel connections and throughput by clients while making sure that neither clients nor the load bal- ancer exceeds roughly 25 − 50% CPU or memory usage (based on findings for minimizing application latency [50]). When finding a configuration that does not overload any entity, we denote it as the baseline configuration. This baseline configuration may then be used for further experimentation.

4.2 Evaluation of network latency

Because both sidecar proxy and proxy load balancing approaches are depen- dent on at least one network hop, we propose running a latency baseline exper- iment between each node and node pool in the Kubernetes cluster. By learning about the underlying network performance, it should be easier to draw conclu- sions on the impact application level request processing has on client Round Trip Time (RTT) latency. The experiment uses an open source implementation of the Two-Way Ac- tive Measurement Protocol (TWAMP) by Nokia [51] [52]. It is a protocol utilized by Internet Service Providers (ISPs) to measure network latency and calculate an accurate latency variance value known as jitter [52]. The TWAMP responder is deployed in the LB and server node pools such that we may use the TWAMP sender manually to verify that there is close to equal latency re- gardless of which node pool a request is routed to.

4.3 Testing scenarios

In order to answer the questions posed by this study, we chose to split the remaining experiments into two parts. The first part focuses on benchmarking each load balancing approach within the discovered baseline configuration, and the second part focuses on adding load to that baseline configuration and evaluating load distribution. CHAPTER 4. EXPERIMENTS 45

4.3.1 Stable load The first testing scenario is a scenario that to some degree will exclude the effect on load distribution and instead focus on the client and load balancer metrics that should be impacted directly by the load balancing approach. These metrics include the CPU, memory and latency impact on the client as well the CPU and memory usage of either the proxy load balancer or the lookaside load balancer. The purpose of this scenario is partly to test the im- pact on clients based on load balancing approach and partly to investigate the total computational cost of load balancing. By doing this, we aim at discov- ering if lookaside load balancing may be achieved with minimal footprint as well as discovering any eventual latency or throughput differences. The experiment will use the baseline configuration produced by the ex- periment in Section 4.1. This means that server endpoints should respond within a low and near constant time. Using the round-robin algorithm on all LB strategies should thereby produce a fair load distribution such that latency bias caused by server load is reduced. We may then measure client RTT la- tency reliably in a 60 minute run for each LB approach as well as measure CPU and memory impact in a separate 60 minute run.

4.3.2 Increased load The handling of increased load is an important aspect of any load balanc- ing algorithm or approach. Evaluating this aspect for different load balancing strategies is, however, a complex task. This section will explain some of this complexity and describe the experiment used in this study to evaluate load distribution performance.

Client load Simulation of realistic client traffic is a challenging problem in itself [53]. Not only is it difficult to model client behavior, but any model would be dependent on the application or service itself. For example, a service that processes batch jobs might receive very predictable traffic with known time intervals, while a service directly linked to end user behavior might receive periodic increases and decreases in traffic with an underlying risk of sudden peaks. Within the context of service mesh environments, we have already discussed how scaling of distributed services plays an important role in handling large changes in client traffic. The aspect of scaling services to handle additional load is how- ever not considered to be within the scope of this study. Therefore, we have 46 CHAPTER 4. EXPERIMENTS

opted to design an experiment that increases traffic within the current bounds of the test environment that we discover using the experiment in Section 4.1. Based on the baseline configuration, we chose to linearly increase client throughput within the discovered limits of both client and load balancer. This is done by identifying a Queries Per Second (QPS) constant which we may use to increase the thoughput of a randomly selected client. The constant is selected such that load may be increased at a 15 second interval for one hour without exceeding 80−90% CPU on client or load balancer. The reasoning be- hind this is that we do not want client or load balancer performance impacting results and this CPU boundary was based on work in [50]. The 15-second in- terval has been deliberately chosen to be close, but not equal to, the 10-second load report interval by server endpoints. While testing a shorter load increase interval is also possible, we would also have to lower the QPS constant in or- der to not reach total server capacity too quickly. This could have the adverse affect of helping with the load distribution since adding small increments to load on many different endpoints is less likely to trigger a load redistribution than adding larger load increments on single endpoints every 15 seconds.

Server load Another important aspect to better model realistic load scenarios is to also introduce some load on the server side. For example, in a real production setting, we may make the assumption that many services would do some sort of database lookup for each request. Depending on caching and other database mechanics, this lookup might introduce a high variance in server response time, thus allowing the request queue to build up. In order to simulate server load and slow response times for certain re- quests, we have gone with an approach that utilizes the server functionality we built in and described in Section 3.2.6. This allows us to dynamically set a power law distribution of response times on each server endpoint dynamically. However, since we do not want this practice to introduce a bias towards load on certain endpoints, we must make sure that all endpoints are impacted equally. We achieve this by iterating all endpoints and setting equal load parameters as well as only ending the experiment in between a load increase iteration. We make a load increase iteration every 15 seconds just as for the throughput in- crease and make sure to end the experiment just before another iteration takes place. CHAPTER 4. EXPERIMENTS 47

Load balancing algorithms While load balancing algorithms are not the focus of this study, the selection of the algorithm should affect the result of this experiment. Due to current limitations in gRPC and the constructed LLB algorithm, the client and looka- side approaches are required to use the round-robin algorithm. However, since we desire to benchmark the best-case load distribution capabilities of each load balancing approach, we have allowed the Envoy proxy and Envoy side- car proxy to instead use the Weighted Least Request algorithm described in Section 2.5.1.

Metrics A final important aspect of this load scenario experiment is how to measure load distribution with the available metrics at our disposal. This study has chosen to use the CPU metric to evaluate load distribution on each server end- point. The idea is similar to the concept of makespan described in Section 2.4 where we defined the general load balancing problem. During the entire experiment we monitor peak CPU usage on each server endpoint and save this value. When the experiment has been completed we look at the average and variance of this peak load to gain insight into how well each algorithm was able to distribute load in order to minimize this makespan.

4.4 Summary

In order to configure the lookaside load balancer and reduce bias towards some load balancing approach, we conduct an experiment that tests differ- ent throughput and number of connections per client. The configuration we discover using this experiment will be referred to as the baseline configuration. The second experiment uses the TWAMP protocol to evaluate the cloud network latency. The purpose is to gain understanding of how the underlying network infrastructure might affect latency when testing different load balanc- ing approaches. To test each load balancing approach, we finally conduct two experiments. The first experiment is the stable load scenario, where requests are processed at a near constant time and we focus on the resource consumption of clients and load balancer(s). The second experiment is the increased load scenario, where requests are processed in non-constant time and we focus on the load distribution capabilities of each approach. Chapter 5

Results

This chapter presents results gathered in each experiment. Results of the first experiment allow us to configure the lookaside load balancer and clients. Re- sults of the second experiment gives us insight into the underlying network latency. Finally, results of the two load scenarios allow us to evaluate each load balancing approach.

5.1 Baseline configuration

The first experiment was performed by measuring peak CPU usage of clients, servers and load balancer for a different number of parallel gRPC connections and total QPS sent by each client. The experiment was executed three times with the entire testing environment being recreated in between runs. Since this experiment focused on finding a configuration which does not overload any system entity, we chose to only extract the peak resource usage across each three executions and each entity replica. This was done by first running the three 60 minute experiment executions. Then, Google Stackdriver was queried for the container/cpu/request_utilization gauge metric that is sampled every minute for each Kubernetes pod. This accounted to a total of 3 ∗ 120 ∗ 60 = 21600 client samples, 3 ∗ 50 ∗ 60 = 900 server samples and 3∗1∗60 = 180 proxy samples per tested configuration. The metric shows how many percent CPU is used out of the CPU cores available to each Kubernetes pod. Also, since Kubernetes pods can boost above their available resources, values may exceed 100% usage. For possible configuration of each experiment, ten connections per pod and round-robin client-side load balancing was initially tested with 100 QPS stepped increases in throughput until requests started failing with the UN-

48 CHAPTER 5. RESULTS 49

AVAILABLEgRPC response code at 700 QPS. Additionally, the same through- put interval was tested with one, twenty and fifty connections per client as well as with both client-side load balancing and proxy load balancing. This accounted to a total of 4 ∗ 2 ∗ 7 = 56 unique configurations tested. In Figure 5.1 we can view the maximum measured sample for each system entity and each QPS value. Note that values were calculated based on the max load between both client-side and proxy load balancing configurations. This was done since the goal of this experiment was to find a baseline that was suitable for testing both load balancing approaches.

Proxy Server Client 140 1 connection per client 10 connections per client 20 connections per client 120 50 connections per client

100

80

60

40 Peak CPU usage (%) CPUusage Peak 20

0 0 200 400 600 0 200 400 600 0 200 400 600 QPS per client

Figure 5.1: Peak CPU usage for different load scenarios

During experiment runs the container/memory/request_utilization gauge metric was also measured. But since no tested configuration caused a memory usage peak larger than 50% this metric was ignored when selecting the baseline configuration.

5.1.1 Selecting the configuration The 300 QPS per client over ten connections configuration achieved a CPU usage of roughly 25% on all system entities (See plotted lines in Figure 5.1), so this configuration was selected as the baseline configuration for remaining experiments. The reasoning behind this was partly due to the 25 − 50% target utilization mentioned in Section 4.1 and partly due to leaving room for addi- 50 CHAPTER 5. RESULTS

tional throughput in the increased load scenario. The reason for not selecting one connection per client was also to better model a realistic micro service, which most likely would communicate with more than one other service (and thereby also have more than one connection).

5.1.2 Selecting the algorithm parameters Based on the results of the baseline testing we selected the lookaside load bal- ancing parameters. With c = 120 clients sending 300 QPS each, the total client throughput becomes c ∗ 300 = 36000 QPS. For s = 50 servers, we select parameters assign_bound = 100 QPS, load_bound = 200 QPS and growth = 100 QPS since this should cause bound growth in the algorithm to occur multiple times at an even s ∗ 100 = 5000 QPS interval. An addi- tional motivation behind the parameters was the fact that Figure 5.1 showed a measurable difference in CPU usage for the 100 QPS interval. For the endpoint parameters we chose emin = 2 and emax = 5. This means that clients should be able to handle failure of at least one endpoint, while failure in any given endpoint only should affect 10% of clients (since emax = s ∗ 0.10).

5.2 Network latency

As described in Section 4.2, a TWAMPresponder was deployed in the LB node pool and the server node pool. Then a TWAMP sender was first run from the client node pool, targeting the deployed responders. Finally, a TWAMP sender was run from the LB node pool to the server node pool.

From To Latency Jitter Client Server 0.26 ms 0.11 ms Client Load Balancer 0.25 ms 0.12 ms Load Balancer Server 0.25 ms 0.11 ms

Table 5.1: Table showing TWAMP results

In Table 5.1 we see that baseline average RTT latency appears consistent around 0.25 ms with a jitter variance of about 0.11 ms. There is no apparent change in latency between different node pools. The experiment was repeated three times with environment recreation in between. These achieved similar results that at most differed 0.02 ms in both latency and jitter. CHAPTER 5. RESULTS 51

5.3 Testing scenarios

With the baseline configuration from Section 5.1, we can conduct both the stable and increased load experiments described in Section 4.3.

5.3.1 Stable load As described in Section 4.3.1, the stable load experiment was divided into two separate runs of 60 minutes. One run where CPU and memory was captured and one run where only latency was captured. After the first run, Google Stackdriver was queried for the container/mem- ory/used_bytes and container/cpu/core_usage_time gauge metrics that are sampled each minute on each Kubernetes pod. All samples were then ag- gregated to a mean value and the standard deviation (σ) was computed based on the same set of samples. The reasoning behind using absolute value metrics instead of the utilization metrics in Section 5.1, was that absolute value met- rics allow comparison of resource usage across system entities with a different amount of resources available. For the latency run, the client saved each RTT measurement with a 0.1 ms accuracy in a 32-bit integer value (e.g. 0.1 ms was stored as the integer 1). This was done in order to reduce the amount of data and allow for easier statistical analysis. For the near-constant throughput of 300 QPS this results in ≈520 MB of measurements per client. The results for these tests are split up into two parts. The first part shows client CPU, memory and RTT latency impact, as well as a statistical analysis on the significance of the latency data. The second part shows the CPU and memory usage of the proxy and lookaside load balancing approaches.

Client CPU usage The first results of interest in the stable load scenario is the impact on CPU and memory by each load balancing approach. In Table 5.2 we see the average CPU usage in ms/s. This unit describes the accumulated core usage time in milliseconds, aggregated by each second. For an n-core system, this unit can not exceed n×1000 ms/s. This means that for example a value of 2000 ms/s on a two-core system would equal 100% CPU usage. There exists no apparent difference in CPU usage between Client-side Load Balancing (CLB) and Lookaside Load Balancing (LLB). These approaches 52 CHAPTER 5. RESULTS

Load balancing approach Average CPU usage σ Client-side 177.3 ms/s 8.2 ms/s Lookaside 172.7 ms/s 6.1 ms/s Sidecar proxy 221.1 (136.6*) ms/s 17.2 (12.6*) ms/s Proxy 85.2 ms/s 3.2 ms/s * Only measuring proxy container in pod

Table 5.2: Table showing client CPU usage time in ms/s also seem to roughly double the usage when using proxy-based load balancing. For the sidecar approach, the usage is higher than using client or lookaside. In order to substantiate these findings the experiment was repeated an ad- ditional two times with complete environment recreation in between. Using the average CPU usage in each run x¯1, x¯2 and x¯3 we calculated the standard error of the mean (σx¯) for each load balancing approach by taking the stan- dard deviation of the three means. The relatively low standard error displayed in Table 5.3 indicates that the experiment could be reproduced with similar results.

Load balancing approach σx¯ Client-side 5.7 ms/s Lookaside 9.6 ms/s Sidecar proxy 12.8 ms/s Proxy 5.0 ms/s

Table 5.3: Table showing standard error of the mean for client CPU usage time

Client memory usage In Table 5.4 we display the measured memory consumption for each load bal- ancing approach. We see that all memory consumption is close to constant due to low vari- ance. The proxy-based approach uses least memory while lookaside and side- car approaches use roughly 20 MB more on average. An important note on this result is that the baseline configuration used here utilizes ten separate gRPC channels that each contain their own list of each server endpoint. This means that number of endpoints managed by gRPC internally is scaled by a factor of ten. CHAPTER 5. RESULTS 53

Load balancing approach Average memory usage σ Lookaside 29.7 MB 0.7 MB Client-side 184.1 MB 2.8 MB Sidecar proxy 33.1 (21.8*) MB 2.4 (2.1*) MB Proxy 11.2 MB 0.3 MB * Only measuring proxy container in pod Table 5.4: Table showing client memory usage in MB

Since this experiment was rerun an additional two times we also could quantify (σx¯) for each load balancing approach using the same method previ- ously described. The outcome shown in Table 5.5 also indicate small differ- ences when repeating the experiment.

Load balancing approach σx¯ Client-side 1.2 MB Lookaside 3.6 MB Sidecar proxy 2.8 MB Proxy 0.9 MB

Table 5.5: Table showing standard error of the mean for client memory usage

Client latency As previously noted, one of the more important aspects of this study is the difference in latency between using a proxy, and not using a proxy to load balance queries. The latency test of this scenario produced the result plotted in Figure 5.2. Because of the noted importance of latency, we have previously stated that a Wilcoxon signed-rank test would be used to investigate a statistically sig- nificant difference in latency between different approaches. Using equal size sets of latency measurements from the 60 minute run of each load balancing approach, the statistical test was conducted pair-wise against LLB. First, the standard score (z-score) was computed as the Wilcoxon signed- rank test describes. This score describes the number of standard deviations (σ) that any data point is above or below the population mean. This score could then in turn be used to estimate the ρ-value, which denotes the probability of 54 CHAPTER 5. RESULTS

Round Trip Time

80.0

30.0

10.0

3.5

Latency (ms) Latency 2.0 1.5 1.1 0.8 0.6 0.4 0.3 0.2 Lookaside Client-Side Sidecar proxy Proxy

Figure 5.2: The RTT latency distribution plotted using 95th, 75th, 50th, 25th and 5th percentiles along with outliers achieving results at least as extreme as the measured results, assuming that some null hypothesis holds. The null hypothesis tested is this case is the hypothesis that both sets of latency measurements originate from the same distribution, i.e. there is no difference. We select the commonly used significance level of ρ < 0.05 in order to try and reject the null hypothesis (thereby proving that there is a dif- ference).

Set 1 Set 2 z-value ρ Lookaside Client-Side −1.7255 8.4 ∗ 10−2 Lookaside Sidecar proxy −5.5724 2.6 ∗ 10−8 Lookaside Proxy −5.7767 7.6 ∗ 10−9

Table 5.6: Results of pairwise Wilcoxon signed-rank test for different load balancing approaches

As seen in Table 5.6, we are clearly able to reject the null hypothesis when comparing lookaside with sidecar and proxy approaches but not when com- paring lookaside with client-side. This means that there exists a statistically significant difference between lookaside load balancing and proxy-based load balancing with lookaside having significantly lower latency. Two repeats the Wilcoxon signed-rank test using latency measurements CHAPTER 5. RESULTS 55

sets from two other identical experiment runs also only could prove a statis- tically significant difference when comparing lookaside with the two proxy- based approaches.

Load balancer resource usage For the lookaside and proxy load balancing approaches, the average CPU (Ta- ble 5.7) and memory (Table 5.8) consumption of the load balancers themselves was also measured and compared with two additional runs using the same method previously described.

Load balancer Average CPU usage σ σx¯ Lookaside 63.8 ms/s 12.5 ms/s 6.8 ms/s Proxy 7627.2 ms/s 160.6 ms/s 318.3 ms/s

Table 5.7: Table showing load balancer CPU usage time in ms/s

Load balancer Average memory usage σ σx¯ Lookaside 138.1 MB 3.5 MB 9.1 MB Proxy 165.2 MB 4.6 MB 8.4 MB

Table 5.8: Table showing load balancer memory usage in MB

5.3.2 Increased load Finally, the second experiment scenario described in Section 4.3.2 is config- ured and run. In order to allow load increases for a one-hour duration, we select the load increase parameter to be 200 QPS. Since the goal of this ex- periment is to investigate the load distribution capabilities of lookaside load balancing, we chose to only calculate the CPU makespan for each load balanc- ing approach. The makespan was calculated by querying Google Stackdriver for the container/cpu/core_usage_time gauge metric for each pod (simi- larly to before) and then computing the maximum measurement for each pod during the test. 56 CHAPTER 5. RESULTS

Load distribution In Figure 5.3 we see the result of this testing scenario.

Endpoint load distribution Lookaside Client-Side 250

225

200

175

150 Sidecar proxy Proxy 250

225

200

175 Peak endpoint CPU usage (ms/s) CPUusage endpoint Peak 150 0 10 20 30 40 50 0 10 20 30 40 50 Server endpoint

Figure 5.3: Makespan load distribution of load balancing approaches.

We see that the peak CPU usage for the proxy-based approach was on av- erage 158 ms/s with a variance of 15 ms/s. The side car approach measured higher with a 168 ms/s average and variance of 27 ms/s. Finally the lookaside and client-side approaches performed similarly with 183 ms/s and 179 ms/s averages respectively. For those approaches the variance was also higher with 55 ms/s and 67 ms/s respectively. In Table 5.9, we also see that there is a statistically significant difference between lookaside and proxy-based ap- proaches. Between lookaside and client-side load balancing we can not prove any statistically significant difference since ρ > 0.05.

Set 1 Set 2 z-value ρ Lookaside Client-Side −1.4432 1.5 ∗ 10−1 Lookaside Sidecar proxy −5.5072 3.6 ∗ 10−8 Lookaside Proxy −6.1443 8.0 ∗ 10−10

Table 5.9: Results of pairwise Wilcoxon signed-rank test using the fifty end- point measurements for each set

Once again, when repeating the statistical test for this scenario twice with new data, the same conclusion could still be drawn. Chapter 6

Discussion

This chapter discusses two main topics that allow us to draw conclusions to the three questions this study poses. First, we discuss how external factors can affect results, and how limited experimentation can affect the conclusions. Second, we go through the results of our load experiments and argue their importance and relation to our three questions.

6.1 Threats to validity

Before discussing all the results gathered, it is important to bring up and ex- plain aspects which might threaten the validity of any conclusions drawn. By doing this, we are also able to argue the importance of some findings while also being able to highlight other findings, even if their significance could be considered lower. This section will thereby discuss the threats that we have identified for this study.

6.1.1 Resource usage When measuring metrics relating to CPU and memory, it is important to note that there are other factors than the load balancing approach which may have affected the outcome of the result. While this study highlights the importance of reducing result bias and constructing experiments which in theory should only depend on the LB approach, we still are bound by the fact that other un- derlying mechanisms may affect CPU and memory usage. For example, the programming language, libraries and versions used, could have an impact on performance. These factors should however in theory only increase or de- crease results by some factor or multiplier since we use the same underlying

57 58 CHAPTER 6. DISCUSSION

platform for all experiments. This means that the proportional difference in resource usage may be attributable to something other than load balancing ap- proach, while we still should be allowed to argue the existence of a difference only due to the LB approach.

6.1.2 Limited test scenarios Due to the number of experiments and LLB algorithm parameters, the time to run all tests rapidly grows larger the more aspects we wish take into account. As we have previously noted, this study is limited in the amount of testing performed because of this reason. This introduces a possible result bias, since the experiments we have chosen might have some underlying properties that work well with certain load balancing approaches. In the context of this study, the highest risk for this is in the load increase testing scenario. However, the purpose of that testing scenario is not prove or disprove the load distribution capabilities of LLB and rather to provide an example of how LLB could com- pare to other LB approaches in a semi-realistic scenario. Also, that experiment is able to highlight the importance of LB algorithm, which this study does not focus on.

6.2 Importance of load balancing strategy

When looking at the four different load balancing approaches covered by this study, we may categorize the findings based on the different questions we set out to answer about Lookaside Load Balancing (LLB).

6.2.1 Load balancing resource usage One of the questions this study set out to answer was whether or not lookaside load balancing could be achieved with a minimal footprint on the client. By looking at the results gathered in Section 5.3.1, we see that while LLB clearly performed worse than proxy-based load balancing, it still managed to achieve better results than client-side and sidecar approaches in terms of CPU and memory. The difference in CPU usage was low between lookaside and client- side and due to the similar variance we may only state that lookaside seems to perform as least as well as client-side. This could mean that gRPC only adds a near constant load when using round-robin that is not dependent on the number of endpoint connections. CHAPTER 6. DISCUSSION 59

If we instead look at client memory usage, there are a number of points to be made. It seems like the number of endpoints given to a client introduces a noticeable impact on memory usage. While we do not argue the scale of this difference due to reasons we have already discussed, we may still argue the fact that lookaside load balancing is able to achieve lower memory usage on the client. For the sidecar approach, memory usage was on a similar level to the LLB approach even though the sidecar contained a list of all endpoints rather than a subset of them. This means that Envoy is more efficient in storing endpoint connections and state than our client. A possible explanation for this could be that Envoy is written in a lower-level programming language than our Golang client (C/C++). Finally, if we look at the CPU (Table 5.2) and memory (Table 5.4) usage of the load balancers, we can approximately calculate the total computational cost to load balance using our different approaches. Just looking at the CPU for each load balancer, we see that the proxy consumed more than ten times the CPU compared to our implemented LLB. However, if we based on the client CPU findings assume that at least 70 ms/s is added to each client when using lookaside or client-side, we get a total added client usage of at least 70×120 = 8400 ms/s (120 clients). This would indicate that for the deployment used in this study, LLB in total adds at least 10% CPU usage when comparing with the centralized Envoy proxy approach. This is of course dependent on the deployment size and many other factors. However, it is a realistic example of how we could have achieved a lower CPU and environmental impact by going with the proxy approach rather than any of lookaside, client-side and sidecar proxy approaches.

6.2.2 Latency differences In Section 5.3.1 we presented the latency findings of this study. There, we were also able to show that there is a statistically significant difference be- tween lookaside and both proxy-based approaches. If we look at the 95th per- centiles of each load balancing approach, we see that lookaside achieves ap- proximately 0.56 ms faster response times than the sidecar approach (39% im- provement), and 2.64 ms faster response times than the proxy approach (75% improvement). This means that the majority of clients will have noticeably lower Round Trip Time (RTT) latencies when using lookaside load balancing or client-side load balancing. However, while the percentage improvement might seem large, the actual gain in a service mesh environment is difficult to calculate. The reason for 60 CHAPTER 6. DISCUSSION

this follows from how micro services in a service mesh usually are decoupled components of a larger system or application that is used by external users. Because of this fact, the query between two individual micro services might be seen as a single branch in a function call graph. Then, depending on how deep or shallow the function call graph is, the end user RTT latency will be significantly different. For a shallow call graph, there might only be be a total difference of a few milliseconds which in turn might not be a perceivable difference to human users. As the call graph becomes deeper, this difference would however increase to a point where there eventu- ally is a perceivable difference. However, regardless of where the boundary for perceivable difference is drawn, there is no disputing the fact that significantly improving latency between single micro services would allow for a deeper call graph and more complex micro service interactions.

6.2.3 Throughput Another important question about LLB this study is concerned with is client throughput with regards to Queries Per Second (QPS). An interesting discov- ery made in Section 5.1 was the impact on throughput caused by client-side load balancing when the number of parallel connections was above 10 for ev- ery client. In this case, the total request throughput went down since the server endpoints were unable to handle the higher amount of total connections (even if the QPS remained the same). However, by using the Envoy proxy in between clients and servers, we were able to achieve at least 50 parallel connections by clients without dropping any queries. This discovery highlights the importance of being able to multiplex queries over the same connection, and this is something Envoy seems to achieve effi- ciently. However, this does not become an issue until there is a very large num- ber of connections compared to the number of server endpoints. For service mesh environments, we have some degree of control over client connections and by utilizing this in LLB to reduce the number of connections, we should be able provide better circumstances for achieving higher throughput.

6.2.4 Server load distribution While we are able to measure and compare load distribution of different LB- approaches, we still need to discuss the difficulty and importance of load dis- tribution. This discussion is important since we wish to understand the impact different load balancing algorithms could have on a future LLB solution. CHAPTER 6. DISCUSSION 61

If we start with the proxy load balancing approach, it seems like having a local knowledge base of server load or outstanding requests to each endpoint could allow load balancing algorithms to function more efficiently. Even if the same weighted least request algorithm and Envoy configuration was used by the client as a sidecar proxy, the loss of central information and coordination was enough to deteriorate the algorithm performance to some degree. With lookaside load balancing, we are able to have a similar central knowl- edge base as with proxy load balancers. However, contrary to proxy load bal- ancers, the LLB central knowledge base cannot contain any real-time data due to the multiple second delays between client and server load reports. Also, the added control plane delay when updating client endpoints, limits us from using any known LB algorithm on the LLB. The result of this is that we are forced to find new algorithms that possibly focus more on predicting client behav- ior based on a central knowledge base, or alternatively overestimating client throughput and give up some of the load distribution capabilities in order to achieve the other benefits gained by lookaside load balancing. In this study, we provided a greedy heuristic algorithm and an example scenario where that algorithm seems to at least perform similarly to an existing client-side solution in terms of load distribution. Because of the NP-complete complexity of the load balancing problem, we cannot prove any load balancing approach or algorithm to be optimal. Because of this, we can also not prove or disprove the existence of a LLB algorithm that performs equally or better than existing algorithms.

6.2.5 Traffic flow control While not explicitly tested by any experiment, we still must discuss the dif- ferences in traffic flow control features that exist for different load balancing approaches. With traffic flow control, we mainly refer to the ability of routing traffic either globally or to different versions of some micro service. With for example the Google Front End (GFE) proxy load balancer, traffic is allowed to overflow to other regions in case of failures or high load scenarios. This type of behavior is not possible when using the regular client-side load balancing approach. With lookaside load balancing, we have in Section 3.3.1 already explained this as a desired property of any LLB algorithm. And while the al- gorithm implemented in this study does not solve this property, we still argue that this property could be added to our implementation by splitting the set of all endpoints based on location and only allowing cross-region endpoints to be assigned to clients in certain cases. 62 CHAPTER 6. DISCUSSION

6.2.6 Failure resilience A final question this study aims at answering is how LLB compares to remain- ing load balancing approaches in terms of failure resilience. For this question, no experiment was constructed or performed within this study. The reason behind this was the fact that gRPC currently does not support any outlier de- tection or circuit breaking which was discussed in Section 3.3.1. Within that section, we discussed how failure resilience in LLB may be achieved by ful- filling two proposed properties regarding reduction of failure propagation as well as increase of redundancy. By reducing the number of endpoints given to clients as well as utilizing outlier detection, a LLB is able to increase the failure resilience compared to the client-side load balancing approach. This follows from the fact that a failure in some endpoint will propagate to a smaller number of clients, which in turn results in a smaller number of total request failures (due to outlier detection). To minimize the impact of failing server endpoints, a central load balanc- ing proxy must be used since that is the only way we can detect and remove failing endpoints without the delay otherwise introduced by health-checking systems or load reports. However, when using a central proxy, we also want to highlight the fact that failure in the proxy itself would cause an immediate system failure as no clients would be able to reach any server. Using LLB instead, downtime in the load balancer would not cause this complete failure since existing client would keep their most previously known endpoints. We thereby do a trade off where we allow more queries to fail while at the same time reducing the risk of complete system failure.

6.3 Scaling properties

While not covered by the experiments of this study, there still exists value in discussing the scaling properties of proxy load balancing and lookaside load balancing. The scaling ability of a load balancing system is important, since a large increase in number of clients or servers might cause CPU or memory exhaustion in the load balancing server. While increasing the resources avail- able to the load balancing server might be possible in some scenarios, there is a limit where more load balancing servers are required in order to handle the large number of clients or servers. CHAPTER 6. DISCUSSION 63

6.3.1 Proxy load balancing For proxy load balancing, it is difficult to scale the number of load balancing servers without losing the property of centralized decision-making that this study shows to be a beneficial property in terms of load distribution. For example, the Weighted Least Request algorithm explained in Section 2.5.1, requires information on outstanding requests to each endpoint in order to function. This means that in order for multiple proxy servers to function equivalently of a single proxy server, load information must be stored in a central and shared location. In this case, we trivially would see an increase in client request latency, since the load balancing algorithm would have to read and write data to an external storage system on a per-request basis. Another complex aspect of proxy load balancing at massive scale is the aspect of distributing traffic between the load balancing servers. While outside the scope of this study, this might be solved using hardware or network layer load balancing as briefly mentioned in Section 2.5.5.

6.3.2 Lookaside load balancing In lookaside load balancing, the load balancer does not sit in the request path between clients and servers. This removes the requirement of low-latency load balancing decision-making that exists for proxy load balancers. In this case, a central storage system can be utilized by the load balancing algorithm. By moving all algorithm state out of the load balancing server, the lookaside load balancer can be modeled as a stateless micro service described in Section 2.3.1. So by taking advantage of the existing delay in the control plane and mov- ing algorithm state to for example a managed and scalable cloud-native solu- tion, the lookaside load balancer can be scaled by the same order of magnitude as any other distributed system in a service mesh environment. One challenging aspect however, as for proxy load balancers, is how clients are assigned to the load balancing servers. One potential solution to this chal- lenge is already built into the grpclb protocol that is used in this study and listed in Appendix A.1. In the initial load balancing response, the load balancer can delegate or redirect clients to another load balancer. This allows the lookaside load bal- ancers to load balance client traffic between themselves without the need of special hardware or network layer balancing. As a result, the lookaside load balancer implemented in this study could be scaled to handle both a large num- ber of clients and a large number of servers. Chapter 7

Conclusions

In a service mesh environment, it is important that we are able to load balance between instances of some micro service as well as reduce the impact of fail- ures. While a proxy-based load balancing approach can achieve both advanced load balancing and high failure resilience, there is added latency to each re- quest. The client-side load balancing approach avoids this added latency by letting clients themselves implement and use some load balancing algorithm. This client-side approach is however limited in advanced load balancing ca- pabilities and failure resilience. Lookaside load balancing is a new hybrid approach where the advanced features of proxy load balancers are combined with clients that dynamically can be controlled at runtime by an external load balancer. Thereby, we wish to answer the question of whether or not lookaside load balancing can act as a valid alternative to client-side and proxy-based approaches.

7.1 Viability of lookaside load balancing

In this study, we constructed a test environment capable of evaluating Looka- side Load Balancing (LLB), Client-side Load Balancing (CLB), Proxy Load Balancing (PLB) as well as Sidecar Proxy Load Balancing (SPLB). We also were able to describe desired properties of a lookaside load balancer and im- plement a greedy algorithm that solved some of these properties. With the help of baseline testing and experiments aiming at simulating sce- narios of both high and low load, we showed that LLB provides similar latency and throughput as CLB. For maximizing throughput, the PLB and SPLB ap- proaches appear to show better performance. When failures occur, LLB is able to improve the failure resilience of CLB and possible also PLB if we consider

64 CHAPTER 7. CONCLUSIONS 65

load balancer failures as well. In terms of traffic flow control, LLB is able to achieve features similar to PLB. Finally, in terms of CPU and memory us- age, LLB seems to have a smaller footprint on clients than CLB, but not PLB. If we consider the total CPU and memory consumption by each load balanc- ing approach, the central PLB approach was able to achieve lowest resource consumption in the system. To answer the main question of this study, we draw the conclusion that lookaside load balancing may act as a valid alternative to at least client-side load balancing in an internal network of micro services. For situations where load distribution is more important than latency, or situations where total re- source usage should be minimized, lookaside load balancing using the algo- rithm proposed by this study is not able to match the resource-efficiency of proxy load balancing.

7.2 Future work

The main limiting factor of lookaside load balancing currently is the lack of algorithms that can achieve efficient load distribution. This study proposes a number of properties that any future algorithm should try to fulfill. We thereby suggest that further research should be done both with regards to control- plane metrics and with regards to new lookaside load balancing algorithms that better utilize available metrics and achieves improved load distribution using client endpoint assignment. Bibliography

[1] M. Dharma. “gRPC Load Balancing”. In: June 2017. url: https : //grpc.io/blog/grpc-load-balancing/ (visited on 2020- 05-31). [2] M. Randles, D. Lamb, and A. Taleb-Bendiab. “A Comparative Study into Distributed Load Balancing Algorithms for Cloud Computing”. In: 2010 IEEE 24th International Conference on Advanced Informa- tion Networking and Applications Workshops. Apr. 2010, pp. 551–556. doi: 10.1109/WAINA.2010.85. [3] Woongsup Kim and Jaha Mvulla. “Reducing resource over-provisioning using workload shaping for energy efficient cloud computing”. In: Ap- plied Mathematics & Information Sciences 7.5 (2013), p. 2097. [4] Awada Uchechukwu, Keqiu Li, and Yanming Shen. “Energy consump- tion in cloud computing data centers”. In: International Journal of Cloud Computing and Services Science (IJ-CLOSER) 3.3 (2014), pp. 31–48. [5] Brian Hayes. “Cloud Computing”. In: Commun. ACM 51.7 (July 2008), pp. 9–11. issn: 0001-0782. doi: 10.1145/1364782.1364786. [6] Armin Balalaie, Abbas Heydarnoori, and Pooyan Jamshidi. “Microser- vices architecture enables devops: Migration to a cloud-native architec- ture”. In: Ieee Software 33.3 (2016), pp. 42–52. [7] Irakli Nadareishvili et al. Microservice architecture: aligning princi- ples, practices, and culture. " O’Reilly Media, Inc.", 2016. [8] W. Li et al. “Service Mesh: Challenges, State of the Art, and Future Research Opportunities”. In: 2019 IEEE International Conference on Service-Oriented System Engineering (SOSE). Apr. 2019, pp. 122–1225. doi: 10.1109/SOSE.2019.00026. [9] Mathijs Jeroen Scheepers. “Virtualization and containerization of appli- cation infrastructure: A comparison”. In: 21st twente student conference on IT. Vol. 21. 2014.

66 BIBLIOGRAPHY 67

[10] Adam Muc et al. “Containerization of Server Services”. In: Multidisci- plinary Aspects of Production Engineering 3.1 (1Sep. 2020), pp. 320– 330. doi: 10.2478/mape-2020-0028. [11] A. Khan. “Key Characteristics of a Container Orchestration Platform to Enable a Modern Application”. In: IEEE Cloud Computing 4.5 (Sept. 2017), pp. 42–48. issn: 2372-2568. doi: 10 . 1109 / MCC . 2017 . 4250933. [12] Ramaswamy Chandramouli and Zack Butcher. Building Secure - based Applications Using Service-Mesh Architecture. Tech. rep. Na- tional Institute of Standards and Technology, Jan. 2020. doi: 10.6028/ NIST.SP.800-204A-draft. [13] M. Ghirardi and C.N. Potts. “Makespan minimization for scheduling unrelated parallel machines: A recovering beam search approach”. In: European Journal of Operational Research 165.2 (2005). Project Man- agement and Scheduling, pp. 457–467. issn: 0377-2217. doi: 10.1016/ j.ejor.2004.04.015. [14] Otfried Cheong. Approximation Algorithms. Apr. 2015. url: http: //www.otfried.org/courses/cs500/slides-approx. pdf (visited on 2020-05-31). [15] Mayanka Katyal and Atul Mishra. “A Comparative Study of Load Bal- ancing Algorithms in Cloud Computing Environment”. In: arXiv preprint arXiv:1403.6918 (2014). [16] Andrea W Richa, M Mitzenmacher, and R Sitaraman. “The power of two random choices: A survey of techniques and results”. In: Combina- torial Optimization 9 (2001), pp. 255–304. [17] James Aweya et al. “An adaptive load balancing scheme for web servers”. In: International Journal of Network Management 12.1 (2002), pp. 3– 39. [18] William Morgan. gRPC Load Balancing on Kubernetes without Tears. Nov. 2018. url: https://kubernetes.io/blog/2018/11/ 07/grpc-load-balancing-on-kubernetes-without- tears/ (visited on 2020-05-31). [19] Daniel E Eisenbud et al. “Maglev: A fast and reliable software network load balancer”. In: 13th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 16). 2016, pp. 523–535. 68 BIBLIOGRAPHY

[20] K. A. Nuaimi et al. “A Survey of Load Balancing in Cloud Computing: Challenges and Algorithms”. In: 2012 Second Symposium on Network Cloud Computing and Applications. Dec. 2012, pp. 137–142. doi: 10. 1109/NCCA.2012.29. [21] Daniel Grosu, Anthony T Chronopoulos, and Ming-Ying Leung. “Co- operative load balancing in distributed systems”. In: Concurrency and Computation: Practice and Experience 20.16 (2008), pp. 1953–1976. [22] A. B. Prasetijo, E. D. Widianto, and E. T. Hidayatullah. “Performance comparisons of web server load balancing algorithms on HAProxy and Heartbeat”. In: 2016 3rd International Conference on Information Tech- nology, Computer, and Electrical Engineering (ICITACEE). Oct. 2016, pp. 393–396. doi: 10.1109/ICITACEE.2016.7892478. [23] Rahul Soni. “Load Balancing with Nginx”. In: Nginx: From Beginner to Pro. Berkeley, CA: Apress, 2016, pp. 153–171. isbn: 978-1-4842- 1656-9. doi: 10.1007/978-1-4842-1656-9_8. [24] D. Mueller and D. Izquierdo-Cortazar. “From Art to Science: The Evo- lution of Community Development”. In: IEEE Software 36.6 (Nov. 2019), pp. 23–28. issn: 1937-4194. doi: 10.1109/MS.2019.2936177. [25] F. L. Ferraris et al. “Evaluating the Auto Scaling Performance of Flexis- cale and Amazon EC2 Clouds”. In: 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing. Sept. 2012, pp. 423–429. doi: 10.1109/SYNASC.2012.58. [26] Google. Internal HTTP(S) Load Balancing overview | Google Cloud. Nov. 2019. url: https://cloud.google.com/load-balancing/ docs/l7-internal (visited on 2020-05-31). [27] Ozair Sheikh et al. “Modernize digital applications with microservices management using the istio service mesh”. In: Proceedings of the 28th Annual International Conference on Computer Science and Software Engineering. IBM Corp. 2018, pp. 359–360. [28] Grpc. Load Balancing in gRPC. Oct. 2017. url: https://github. com/grpc/grpc/blob/v1.26.0/doc/load-balancing. md (visited on 2020-05-31). [29] Envoy. xDS REST and gRPC protocol. Jan. 2020. url: https: // www.envoyproxy.io/docs/envoy/v1.13.0/api-docs/ xds_protocol (visited on 2020-05-31). BIBLIOGRAPHY 69

[30] Anna Berenberg and Arunkumar Jayaraman. Google Cloud network- ing in depth: How Traffic Director provides global load balancing for open service mesh. Apr. 2019. url: https://cloud.google. com/blog/products/networking/traffic-director- global - traffic - management - for - open - service - mesh (visited on 2020-05-31). [31] Pat Bosshart et al. “P4: Programming protocol-independent packet pro- cessors”. In: ACM SIGCOMM Computer Communication Review 44.3 (2014), pp. 87–95. [32] Minxian Xu, Wenhong Tian, and Rajkumar Buyya. “A survey on load balancing algorithms for virtual machines placement in cloud comput- ing”. In: Concurrency and Computation: Practice and Experience 29.12 (2017), e4123. [33] Platform9. Kubernetes Cloud Services: Comparing GKE, EKS and AKS. Dec. 2019. url: https://platform9.com/blog/kubernetes- cloud-services-comparing-gke-eks-and-aks/ (visited on 2020-05-31). [34] LLC Cloud Spectator. “Price-Performance Analysis of the Top 10 Pub- lic IaaS Vendors”. In: (2018), pp. 29–33. url: https://www.upcloud. com / wp - content / uploads / 2018 / 11 / 2018 _ Top - EU - Cloud-Benchmarking-Report.pdf (visited on 2020-05-31). [35] Sang Gyun Du, Jong Won Lee, and Keecheon Kim. “Proposal of GRPC as a New Northbound API for Application Layer Communication Effi- ciency in SDN”. In: Proceedings of the 12th International Conference on Ubiquitous Information Management and Communication. IMCOM ’18. Langkawi, Malaysia: Association for Computing Machinery, 2018. isbn: 9781450363853. doi: 10.1145/3164541.3164563. [36] Daniel Stenberg. “HTTP2 Explained”. In: SIGCOMM Comput. Com- mun. Rev. 44.3 (July 2014), pp. 120–128. issn: 0146-4833. doi: 10. 1145/2656877.2656896. [37] Rajarshi Biswas, Xiaoyi Lu, and Dhabaleswar K Panda. “Designing a micro-benchmark suite to evaluate gRPC for TensorFlow: Early expe- riences”. In: arXiv preprint arXiv:1804.01138 (2018). [38] David Bernstein. “Containers and cloud: From lxc to docker to kuber- netes”. In: IEEE Cloud Computing 1.3 (2014), pp. 81–84. [39] Michael Hausenblas. Container Networking. O’Reilly Media, Incorpo- rated, 2018. 70 BIBLIOGRAPHY

[40] Stackdriver Monitoring | Google Cloud. Jan. 2015. url: https:// cloud.google.com/monitoring (visited on 2020-05-31). [41] Emil Mikulic. “Achieving Observability into Your Application with OpenCensus”. In: USENIX Association, June 2018. [42] OpenCensus: A Stats Collection and Distributed Tracing Framework. Jan. 2018. url: https://opensource.googleblog.com/ 2018/01/opencensus.html (visited on 2020-05-31). [43] Douglas C Schmidt et al. Pattern-Oriented Software Architecture, Pat- terns for Concurrent and Networked Objects. Vol. 2. John Wiley & Sons, 2013. [44] Frank Wilcoxon. “Individual Comparisons by Ranking Methods”. In: Biometrics Bulletin 1.6 (1945), pp. 80–83. issn: 00994987. doi: 10. 2307/3001968. [45] Gerred Dillon. Benchmarking 5 Popular Load Balancers: Nginx, HAProxy, Envoy, Traefik, and ALB: Log Analysis: Log Monitoring by Loggly. Dec. 2018. url: https://www.loggly.com/blog/benchmarking- 5-popular-load-balancers-nginx-haproxy-envoy- traefik-and-alb/ (visited on 2020-05-31). [46] Kristen Evans. Cloud Native Computing Foundation Announces Envoy Graduation. Nov. 2018. url: https://www.cncf.io/announcement/ 2018/11/28/cncf-announces-envoy-graduation/ (vis- ited on 2020-05-31). [47] Envoy documentation. Nov. 2019. url: https://www.envoyproxy. io/docs/envoy/v1.12.0/ (visited on 2020-05-31). [48] Rob Pike. “The go programming language”. In: Talk given at Google’s Tech Talks (2009). [49] Rahul Sharma and Avinash Singh. “Istio VirtualService”. In: Getting Started with Istio Service Mesh: Manage Microservices in Kubernetes. Berkeley, CA: Apress, 2020, pp. 137–168. isbn: 978-1-4842-5458-5. doi: 10.1007/978-1-4842-5458-5_4. [50] F. Al-Haidari, M. Sqalli, and K. Salah. “Impact of CPU Utilization Thresholds and Scaling Size on Autoscaling Cloud Resources”. In: 2013 IEEE 5th International Conference on Cloud Computing Technology and Science. Vol. 2. Dec. 2013, pp. 256–261. doi: 10.1109/CloudCom. 2013.142. BIBLIOGRAPHY 71

[51] Sven Wisotzky. Python tools for TWAMP and TWAMP light. https: //github.com/nokia/twampy. 2017. (Visited on 2020-05-31). [52] K. Hedayat et al. A Two-Way Active Measurement Protocol (TWAMP). RFC 5357. RFC Editor, Oct. 2008, pp. 1–26. url: https://www. rfc-editor.org/rfc/rfc5357.txt (visited on 2020-05-31). [53] Yuhong Cai, John Grundy, and John Hosking. “Synthesizing Client Load Models for Performance Engineering via Web Crawling”. In: Pro- ceedings of the Twenty-Second IEEE/ACM International Conference on Automated Software Engineering. ASE ’07. Atlanta, Georgia, USA: As- sociation for Computing Machinery, 2007, pp. 353–362. isbn: 9781595938824. doi: 10.1145/1321631.1321684. Acronyms

ALB Azure Load Balancer. 16

AWS Amazon Web Services. 19, 20

CLB client-side load balancing. 26, 31, 33, 64, 65

ELB Elastic Load Balancing. 16

GCP Google Cloud Platform. 19, 20, 24

GFE Google Front End. 16, 61

GKE Google Kubernetes Engine. 24, 28, 42 gRPC gRPC Remote Procedure Call framework. 13, 20, 24, 26–28, 34, 35, 41, 47–49, 52, 58, 62

HTTP Hyper Text Transfer Protocol. 8, 11, 20

JSON JavaScript Object Notation. 20

LB Load Balancer. 23, 41, 44, 45, 50, 57, 58, 60, 61

LLB lookaside load balancing. 12, 21, 26, 30–37, 39–41, 45, 47, 50, 53, 58–65

PLB proxy load balancing. 26, 31, 62–65

QPS Queries Per Second. 23, 24, 27, 33, 36–42, 46, 48, 49, 51, 60

REST Representational State Transfer. 20

RTT Round Trip Time. 17, 19, 23, 24, 27, 31, 44, 45, 50, 51, 54, 59, 60

72 Acronyms 73

SLA Service License Agreement. 20

VM Virtual Machine. 7, 21, 22, 28, 31 Appendix A

Protocol definitions

A.1 Load Balancer

1// Copyright2015 The gRPC Authors 2// 3// Licensed under the Apache License, Version2 .0(the"License"); 4// you may not use this file except in compliance with the License. 5// You may obtaina copy of the License at 6// 7// http:// www.apache.org/licenses/LICENSE −2 .0 8// 9// Unless required by applicable law or agreed to in writing, software 10// distributed under the License is distributed on an"ASIS"BASIS, 11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12// See the License for the specific language governing permissions and 13// limitations under the License. 14 15// This file defines the GRPCLB LoadBalancing protocol. 16// 17// The canonical version of this proto can be found at 18// https:// github.com/grpc/grpc −p r o t o/blob/master/grpc/lb/v1/load_balancer.proto 19 syntax="proto3"; 20 21 package grpc.lb.v1; 22 23 import"google/protobuf/duration.proto"; 24 import"google/protobuf/timestamp.proto"; 25 26 option go_package="google.golang.org/grpc/balancer/grpclb/grpc_lb_v1"; 27 option java_multiple_files= true; 28 option java_outer_classname="LoadBalancerProto"; 29 option java_package="io.grpc.grpclb"; 30 31 service LoadBalancer{ 32// Bidirectional rpc to geta list of servers. 33 rpc BalanceLoad(stream LoadBalanceRequest) returns(stream LoadBalanceResponse); 34} 35 36 message LoadBalanceRequest{ 37 oneof load_balance_request_type{ 38// This message should be sent on the first request to the load balancer. 39 InitialLoadBalanceRequest initial_request=1; 40 41// The client stats should be periodically reported to the load balancer 42// based on the duration defined in the InitialLoadBalanceResponse. 43 ClientStats client_stats=2; 44} 45} 46 47 message InitialLoadBalanceRequest{ 48// The name of the load balanced service(e.g., service. googleapis.com). Its 49// length should be less than256 bytes.

74 APPENDIX A. PROTOCOL DEFINITIONS 75

50// The name might includea port number. How to handle the port number is up 51// to the balancer. 52 string name=1; 53} 54 55// Contains the number of calls finished fora particular load balance token. 56 message ClientStatsPerToken{ 57// See Server. load_balance_token. 58 string load_balance_token=1; 59 60// The total number of RPCs that finished associated with the token. 61 int64 num_calls=2; 62} 63 64// Contains client level statistics that are useful to load balancing. Each 65// count except the timestamp should be reset to zero after reporting the stats. 66 message ClientStats{ 67// The timestamp of generating the report. 68 google.protobuf. Timestamp timestamp=1; 69 70// The total number of RPCs that started. 71 int64 num_calls_started=2; 72 73// The total number of RPCs that finished. 74 int64 num_calls_finished=3; 75 76// The total number of RPCs that failed to reacha server except dropped RPCs. 77 int64 num_calls_finished_with_client_failed_to_send=6; 78 79// The total number of RPCs that finished and are known to have been received 80// bya server. 81 int64 num_calls_finished_known_received=7; 82 83// The list of dropped calls. 84 repeated ClientStatsPerToken calls_finished_with_drop=8; 85 86 reserved4,5; 87} 88 89 message LoadBalanceResponse{ 90 oneof load_balance_response_type{ 91// This message should be sent on the first response to the client. 92 InitialLoadBalanceResponse initial_response=1; 93 94// Contains the list of servers selected by the load balancer. The client 95// should send requests to these servers in the specified order. 96 ServerList server_list=2; 97 98// If this field is set, then the client should eagerly enter fallback 99// mode(even if there are existing, healthy connections to backends). 100// See go/grpclb −e x p l i c i t −f a l l b a c k for more details. 101 FallbackResponse fallback_response=3; 102} 103} 104 105 message InitialLoadBalanceResponse{ 106// This is an application layer redirect that indicates the client should use 107// the specified server for load balancing. When this field is non −empty in 108// the response, the client should opena separate connection to the 109// load_balancer_delegate and call the BalanceLoad method. Its length should 110// be less than64 bytes. 111 string load_balancer_delegate=1; 112 113// This interval defines how often the client should send the client stats 114// to the load balancer. Stats should only be reported when the duration is 115// positive. 116 google.protobuf.Duration client_stats_report_interval=2; 117} 118 119 message ServerList{ 120// Containsa list of servers selected by the load balancer. The list will 121// be updated when server resolutions change or as needed to balance load 122// across more servers. The client should consume the server list in order 123// unless instructed otherwise via the client_config. 124 repeated Server servers=1; 125 126// Was google.protobuf.Duration expiration_interval. 127 reserved3; 128} 129 76 APPENDIX A. PROTOCOL DEFINITIONS

130// Contains server information. When the drop field is not true, use the other 131// fields. 132 message Server{ 133//A resolved address for the server, serialized in network −byte−o r d e r. It may 134// either be an IPv4 or IPv6 address. 135 bytes ip_address=1; 136 137//A resolved port number for the server. 138 int32 port=2; 139 140// An opaque but printable token for load reporting. The client must include 141// the token of the picked server into the initial metadata when it startsa 142// call to that server. The token is used by the server to verify the request 143// and to allow the server to report load to the gRPC LB system. The token is 144// also used in client stats for reporting dropped calls. 145// 146// Its length can be variable but must be less than50 bytes. 147 string load_balance_token=3; 148 149// Indicates whether this particular request should be dropped by the client. 150// If the request is dropped, there will bea corresponding entry in 151// ClientStats. calls_finished_with_drop. 152 bool drop=4; 153 154 reserved5; 155} 156 157 message FallbackResponse{}

A.2 Load Reporter

1// Copyright2018 gRPC authors. 2// 3// Licensed under the Apache License, Version2 .0(the"License"); 4// you may not use this file except in compliance with the License. 5// You may obtaina copy of the License at 6// 7// http:// www.apache.org/licenses/LICENSE −2 .0 8// 9// Unless required by applicable law or agreed to in writing, software 10// distributed under the License is distributed on an"ASIS"BASIS, 11// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 12// See the License for the specific language governing permissions and 13// limitations under the License. 14 15 syntax="proto3"; 16 17 package grpc.lb.v1; 18 19 import"google/protobuf/duration.proto"; 20 21// The LoadReporter service. 22 service LoadReporter{ 23// Report load from server to lb. 24 rpc ReportLoad(stream LoadReportRequest) 25 returns(stream LoadReportResponse){ 26}; 27} 28 29 message LoadReportRequest{ 30// This message should be sent on the first request to the gRPC server. 31 InitialLoadReportRequest initial_request=1; 32} 33 34 message InitialLoadReportRequest{ 35// The hostname this load reporter client is requesting load for. 36 string load_balanced_hostname=1; 37 38// Additional information to disambiguate orphaned load: load that should have 39// gone to this load reporter client, but was not able to be sent since the 40// load reporter client has disconnected. load_key is sent in orphaned load 41// reports; see Load. load_key. 42 bytes load_key=2; APPENDIX A. PROTOCOL DEFINITIONS 77

43 44// This interval defines how often the server should send load reports to 45// the load balancer. 46 google.protobuf.Duration load_report_interval=3; 47} 48 49 message LoadReportResponse{ 50// This message should be sent on the first response to the load balancer. 51 InitialLoadReportResponse initial_response=1; 52 53// Reports server −wide statistics for load balancing. 54// This should be reported with every response. 55 LoadBalancingFeedback load_balancing_feedback=2; 56 57//A load report for each tuple. This could be considered to be 58//a multimap indexed by. It is not strictly necessary to 59// aggregate all entries into one entry per tuple, although it 60// is preferred to do so. 61 repeated Load load=3; 62} 63 64 message InitialLoadReportResponse{ 65// Initial response returns the Load balancerID. This must be plain text 66//(printableASCII). 67 string load_balancer_id=1; 68 69 enum ImplementationIdentifier{ 70IMPL_UNSPECIFIED=0; 71 CPP=1;// Standard GoogleC++ implementation. 72 JAVA=2;// Standard Google Java implementation. 73 GO=3;// Standard Google Go implementation. 74} 75// Optional identifier of this implementation of the load reporting server. 76 ImplementationIdentifier implementation_id=2; 77 78// Optional server_version should bea value that is modified(and 79// monotonically increased) when changes are made to the server 80// implementation. 81 int64 server_version=3; 82} 83 84 message LoadBalancingFeedback{ 85// Reports the current utilization of the server(typical range[0 .0 − 1 .0]). 86 float server_utilization=1; 87 88// The total rate of calls handled by this server(including errors). 89 float calls_per_second=2; 90 91// The total rate of error responses sent by this server. 92 float errors_per_second=3; 93} 94 95 message Load{ 96// The(plain text) tag used by the calls covered by this load report. The 97// tag is that part of the load balancer token after removing the load 98// balancer id. Empty is equivalent to non −e x i s t e n t tag. 99 string load_balance_tag=1; 100 101// The user identity authenticated by the calls covered by this load 102// report. Empty is equivalent to no known user_id. 103 string user_id=3; 104 105//IP address of the client that sent these requests, serialized in 106// network −byte−o r d e r. It may either be an IPv4 or IPv6 address. 107 bytes client_ip_address=15; 108 109// The number of calls started(since the last report) with the given tag and 110// user_id. 111 int64 num_calls_started=4; 112 113// Indicates whether this load report is an in −p r o g r e s s load report in which 114// num_calls_in_progress is the only valid entry. If in_progress_report is not 115// set, num_calls_in_progress will be ignored. If in_progress_report is set, 116// fields other than num_calls_in_progress and orphaned_load will be ignored. 117// TODO(juanlishen):A Load is either an in_progress_report or not. We should 118// make this explicit in hierarchy. From the log,I see in_progress_report_ 119// hasa random num_calls_in_progress_ when not set, which might lead to bug 120// when the balancer process the load report. 121 oneof in_progress_report{ 122// The number of calls in progress(instantaneously) per load balancer id. 78 APPENDIX A. PROTOCOL DEFINITIONS

123 int64 num_calls_in_progress=5; 124} 125 126// The following values are counts or totals of call statistics that finished 127// with the given tag and user_id. 128 int64 num_calls_finished_without_error=6;// Calls with status OK. 129 int64 num_calls_finished_with_error=7;// Calls with status non −OK. 130// Calls that finished witha status that maps to HTTP5XX(see 131// googleapis/google/rpc/code.proto). Note that this isa subset of 132// num_calls_finished_with_error. 133 int64 num_calls_finished_with_server_error=16; 134 135// Totals are from calls that with _and_ without error. 136 int64 total_bytes_sent=8; 137 int64 total_bytes_received=9; 138 google.protobuf.Duration total_latency=10; 139 140// Optional metrics reported for the call(s). Requires that metric_name is 141// unique. 142 repeated CallMetricData metric_data=11; 143 144// The following two fields are used for reporting orphaned load: load that 145// could not be reported to the originating balancer either since the balancer 146// is no longer connected or because the frontend sent an invalid token. These 147// fields must not be set with normal(unorphaned) load reports. 148 oneof orphaned_load{ 149// Load_key is the load_key from the initial_request from the originating 150// balancer. 151 bytes load_key=12[deprecated=true]; 152 153// If true then this load report is for calls that had an invalid token; the 154// user is probably abusing the gRPC protocol. 155// TODO(yankaiz): Rename load_key_unknown. 156 bool load_key_unknown=13; 157 158// load_key and balancer_id are included in order to identify orphaned load 159// from different origins. 160 OrphanedLoadIdentifier orphaned_load_identifier=14; 161} 162 163 reserved2; 164} 165 166 message CallMetricData{ 167// Name of the metric; may be empty. 168 string metric_name=1; 169 170// Number of calls that finished and included this metric. 171 int64 num_calls_finished_with_metric=2; 172 173// Sum of metric values across all calls that finished with this metric. 174 double total_metric_value=3; 175} 176 177 message OrphanedLoadIdentifier{ 178// The load_key from the initial_request from the originating balancer. 179 bytes load_key=1; 180 181// The uniqueID generated by LoadReporter to identify balancers. Here it 182// distinguishes orphaned load witha same load_key. 183 string load_balancer_id=2; 184}

TRITA -EECS-EX-2020-776

www.kth.se