Lookaside Load Balancing in a Service Mesh Environment

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Lookaside Load Balancing in a Service Mesh Environment ERIK JOHANSSON KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Lookaside Load Balancing in a Service Mesh Environment ERIK JOHANSSON Master in Computer Science Date: 2020-10-04 Supervisor: Cyrille Artho Examiner: Johan Håstad School of Electrical Engineering and Computer Science Host company: Spotify AB Swedish title: Extern Lastbalansering i en Service Mesh Miljö iii Abstract As more online services are migrated from monolithic systems into decoupled distributed micro services, the need for efficient internal load balancing solutions increases. Today, there exists two main approaches for load balancing internal traffic between micro services. One approach uses either a central or sidecar proxy to load balance queries over all available server endpoints. The other approach lets client themselves decide which of all available endpoints to send queries to. This study investigates a new approach called lookaside load balancing. This approach consists of a load balancer that uses the control plane to gather a list of service endpoints and their current load. The load balancer can then dynamically provide clients with a subset of suitable endpoints they connect to directly. The endpoint distribution is controlled by a lookaside load balancing algorithm. This study presents such an algorithm that works by changing the endpoint assignment in order to keep current load between an upper and lower bound. In order to compare each of these three load balancing approaches, a test environment in Kubernetes is constructed and modeled to be similar to a real service mesh. With this test environment, we perform four experiments. The first experiment aims at finding suitable settings for the lookaside load balancing algorithm as well as a baseline load configuration for clients and servers. The second experiments evaluates the underlying network infrastructure to test for possible bias in latency measurements. The final two experiments evaluate each load balancing approach in both high and low load scenarios. Results show that lookaside load balancing can achieve similar performance as client- side load balancing in terms of latency and load distribution, but with a smaller CPU and memory footprint. When load is high and uneven, or when compute resource usage should be minimized, the centralized proxy approach is better. With regards to traffic flow control and failure resilience, we can show that lookaside load balancing is better than client-side load balancing. We draw the conclusion that lookaside load balancing can be an alternative approach to client-side load balancing as well as proxy load balancing for some scenarios. Keywords— load balancing, lookaside load balancing, external load balancing, service mesh, kubernetes, envoy, grpc iv Sammanfattning Då fler online tjänster flyttas från monolitsystem till uppdelade distribuerade mikrotjänster, ökas behovet av intern lastbalansering. Idag existerar det två hu- vudsakliga tillvägagångssätt för intern lastbalansering mellan interna mikro- tjänster. Ett sätt använder sig antingen utav en central- eller sido-proxy for att lastbalansera trafik över alla tillgängliga serverinstanser. Det andra sättet låter klienter själva välja vilken utav alla serverinstanser att skicka trafik till. Denna studie undersöker ett nytt tillvägagångssätt kallat extern lastbalansering. Detta tillvägagångssätt består av en lastbalanserare som använder kon- trollplanet för att hämta en lista av alla serverinstanser och deras aktuella last. Lastbalanseraren kan då dynamiskt tillsätta en delmängd av alla serverinstanser till klienter och låta dom skapa direktkopplingar. Tillsättningen av serverinstanser kontrolleras av en extern lastbalanseringsalgoritm. Denna studie pre- senterar en sådan algoritm som fungerar genom att ändra på tillsättningen av serverinstanser för att kunna hålla lasten mellan en övre och lägre gräns. För att kunna jämföra dessa tre tillvägagångssätt för lastbalansering kon- strueras och modelleras en testmiljö i Kubernetes till att vara lik ett riktigt service mesh. Med denna testmiljö utför vi fyra experiment. Det första experimentet har som syfte att hitta passande inställningar till den externa lastba- lanseringsalgoritmen, samt att hitta en baskonfiguration för last hos klienter or servrar. Det andra experimentet evaluerar den underliggande nätverksinfra- strukturen för att testa efter potentiell partiskhet i latensmätningar. De sista två experimenten evaluerar varje tillvägagångssätt av lastbalansering i både sce- narier med hög och låg belastning. Resultaten visar att extern lastbalansering kan uppnå liknande prestanda som klientlastbalansering avseende latens och lastdistribution, men med lägre CPU- och minnesanvändning. När belastning- en är hög och ojämn, eller när beräkningsresurserna borde minimeras, är den centraliserade proxy-metoden bättre. Med hänsyn till kontroll över trafikflö- de och resistans till systemfel kan vi visa att extern lastbalansering är bättre än klientlastbalansering. Vi drar slutsatsen att extern lastbalansering kan vara ett alternativ till klientlastbalansering samt proxylastbalansering i vissa fall. Nyckelord— lastbalansering, extern lastbalansering, service mesh, kubernetes, envoy, grpc v Acknowledgement First of all, I would like to thank my Spotify mentor and supervisor Richard Tolman, for supporting me during my time at the company. His help with feedback, proof-reading and encouragement was crucial for my progress and I will be forever grateful. I would also like to thank my manager Matthias Grüter, for helping me brainstorm the ideas which finally lead to the proposal for this thesis project. Finally, I want to send thanks to every member of my team Fabric, for pushing me all the way to the finish line and reminding me to focus on what is important rather than what is most fun in the moment. I am very much looking forward to continue working with all of you. Contents 1 Introduction 1 1.1 Background . .1 1.2 Problem definition . .3 1.2.1 Objectives . .3 1.2.2 Delimitation . .4 1.3 Ethics and sustainability . .4 1.4 Outline . .5 2 Background 6 2.1 Terminology . .6 2.2 Cloud computing today . .7 2.3 Evolution of service mesh infrastructure . .7 2.3.1 Stateless micro services . .8 2.3.2 Containerization . .9 2.3.3 Container orchestration . 10 2.3.4 Service mesh . 10 2.4 The load balancing problem . 11 2.5 Related work . 12 2.5.1 Load balancing algorithms . 12 2.5.2 Proxy load balancing solutions . 15 2.5.3 Using a proxy load balancer as a sidecar . 16 2.5.4 Future of load balancing control plane . 17 2.5.5 Low-level application load balancing . 18 2.6 Summary . 18 3 Methodology 19 3.1 Evaluation using cloud based testing . 19 3.2 Design of test environment . 20 3.2.1 Selection of test protocol . 20 vi CONTENTS vii 3.2.2 Description of service mesh infrastructure . 21 3.2.3 Metrics monitoring . 23 3.2.4 Selection of proxy load balancer . 26 3.2.5 Load generator client service . 27 3.2.6 Dynamic load server service . 27 3.2.7 Test environment overview . 28 3.3 Implementation of lookaside load balancer . 30 3.3.1 Desired properties of algorithm . 30 3.3.2 Control plane protocol . 34 3.3.3 Algorithm . 36 3.4 Summary . 42 4 Experiments 43 4.1 Discovery of baseline configuration . 43 4.2 Evaluation of network latency . 44 4.3 Testing scenarios . 44 4.3.1 Stable load . 45 4.3.2 Increased load . 45 4.4 Summary . 47 5 Results 48 5.1 Baseline configuration . 48 5.1.1 Selecting the configuration . 49 5.1.2 Selecting the algorithm parameters . 50 5.2 Network latency . 50 5.3 Testing scenarios . 51 5.3.1 Stable load . 51 5.3.2 Increased load . 55 6 Discussion 57 6.1 Threats to validity . 57 6.1.1 Resource usage . 57 6.1.2 Limited test scenarios . 58 6.2 Importance of load balancing strategy . 58 6.2.1 Load balancing resource usage . 58 6.2.2 Latency differences . 59 6.2.3 Throughput . 60 6.2.4 Server load distribution . 60 6.2.5 Traffic flow control . 61 6.2.6 Failure resilience . 62 viii CONTENTS 6.3 Scaling properties . 62 6.3.1 Proxy load balancing . 63 6.3.2 Lookaside load balancing . 63 7 Conclusions 64 7.1 Viability of lookaside load balancing . 64 7.2 Future work . 65 Bibliography 66 Acronyms 72 A Protocol definitions 74 A.1 Load Balancer . 74 A.2 Load Reporter . 76 Chapter 1 Introduction This chapter introduces the general topic of this study and presents three ques- tions that aims at evaluating lookaside load balancing in a service mesh environment. We also explain ethical aspects of this field that should be considered important. 1.1 Background Today, many online services have strong requirements for availability and performance. To fulfill these requirements, a now common approach is to sub- divide an online service into many distributed services that each are able to provide some feature or function of the end service. This way, the end service achieves a higher resilience towards failures and is able to dynamically scale only the components that account for the highest amount of processing. By changing to a network of many small services that interact with each other, we introduce many new challenges. One of these challenges is how to efficiently route and distribute traffic internally between different services. This is known as the concept of internal load balancing. The internal aspect of this concept follows from the fact that all clients will be other developer controlled systems, rather than any potentially malicious external user. Currently there exist two main approaches to load balancing traffic to a distributed service [1]. Firstly, there exists proxy-based load balancing where clients send all traffic to a proxy server that then looks at the packet and decides which backend endpoint of the distributed service to forward the request to.

Load more