Arxiv:2106.08833V1 [Cs.NI] 16 Jun 2021 Mented on Commodity Servers, Are Widely Adopted in Real Tion Deployments

Dynamic Recompilation of Software Network Services with Morpheus Sebastiano Miano1, Alireza Sanaee1, Fulvio Risso2, Gábor Rétvári3, Gianni Antichi1, 1Queen Mary University of London, UK 2Politecnico di Torino, IT 3MTA-BME Information Systems Research Group & Ericsson Research, HU Abstract that are only known at run time, and take difficult-to-predict branches conditioned on variable data. State-of-the-art approaches to design, develop and optimize Dynamic compilation, in contrast, enables program opti- software packet-processing programs are based on static com- mization based on invariant data computed at run time and pilation: the compiler’s input is a description of the forward- produces code that is specialized to the input the program ing plane semantics and the output is a binary that can accom- is processing [7, 24, 34]. The idea is to continuously collect modate any control plane configuration or input traffic. run-time data about program execution and then re-compile In this paper, we demonstrate that tracking control plane it to improve performance. This is a well-known practice actions and packet-level traffic dynamics at run time opens up adopted by generic programming languages (e.g., Java [24], new opportunities for code specialization. We present Mor- JavaScript [34], and C/C++ [7]) and often produces orders pheus, a system working alongside static compilers that con- of magnitude more efficient code in the context of, e.g., data- tinuously optimizes the targeted networking code. We intro- caching services [67], data mining [21] and databases [51,90]. duce a number of new techniques, from static code analysis to adaptive code instrumentation, and we implement a tool- To our surprise, we found that state-of-the-art dynamic box of domain specific optimizations that are not restricted optimization tools for generic software, including Google’s to a specific data plane framework or programming language. AutoFDO [21] and Facebook’s Bolt [67], are largely inef- We apply Morpheus to several eBPF and DPDK programs fective for network code (§2). We demonstrate that the per- including Katran, Facebook’s production-grade load balancer. formance of data plane programs critically depends on (i) We compare Morpheus against state-of-the-art optimization network configuration, (ii) match-action table content and frameworks and show that it can bring up to 2x throughput (iii) traffic patterns, and we argue that standard optimization improvement, while halving the 99th percentile latency. tools [7, 24, 34] are ill-suited to exploit these domain-specific attributes (§2). Although several tools are available specif- ically for the networking domain (Table2), most perform 1 Introduction offline optimizations using recorded execution traces, requir- ing operators to tediously collect representative samples of Software Data Planes, packet processing programs imple- match-action tables and predict traffic patterns from produc- arXiv:2106.08833v1 [cs.NI] 16 Jun 2021 mented on commodity servers, are widely adopted in real tion deployments. To be practical, instead, a dynamic com- deployments [9, 33, 40, 50, 68, 82, 88, 89]. This is because piler for networking code should not depend on any offline they do not require dedicated hardware, guarantee unlimited profile, but rather work in a fully unsupervised mode where scale-out/scale-up, and are easier to debug than closed-source all tracing data needed for code specialization is collected on- hardware [40]. Software data planes depend on a compiler line. In addition, existing tools are commonly tied to specific toolchain (e.g., GCC [73] or LLVM [53]) to generate machine hardware, data plane framework, or programming language, code, which can be potentially optimized offline through static limiting their applicability in specific scenarios. transformations, e.g., inlining, loop unrolling, branch elimina- We present Morpheus, a system to optimize network code tion, or vectorization [6,54]. Static optimizations, however, are at run time using domain-specific dynamic optimization tech- independent of the actual input the code will process in opera- niques. Morpheus operates in a fully unsupervised mode, and tion, as this is unknown until then [11, 24]. Consequently, the it does not require any a priori knowledge about control plane resulting generic code might contain logic for protocols and configuration or data plane traffic patterns. We discuss the features that will never be triggered in a deployment, might main design challenges (§3), such as automatically tracking be forced to perform costly memory loads to access values highly variable input (e.g., inbound traffic) that may change 1 Unsupervised Unsupervised Domain adaptation to adaptation to Data plane Name Description specific control plane data plane agnostic actions traffic Bolt [67] 7 - - 3 Offline profile-guided optimizer for generic software code. AutoFDO [21] 7 - - 3 Offline profile-guided optimizer for generic software code. eSwitch [62] 3 3 7 7 Policy-driven optimizer for DPDK-based OpenFlow software switches. P5 [5] 3 7 7 7 Policy-driven optimizer for P4/RMT packet-processing pipelines. P2GO [86] 3 7 7 7 Offline profile-guided optimizer for P4/RMT packet-processing pipelines. PacketMill [30] 3 7 7 7 Packet metadata management optimizer for DPDK-based software data planes. NFReducer [69] 3 7 7 3 Policy-driven optimizer for network function virtualization. Morpheus 3 3 3 3 Run-time compiler and optimizer framework for arbitrary networking code. Table 1: A comparison of some popular dynamic optimization frameworks and Morpheus. tens, or even hundreds of millions of times per second. We 2 The Need for Dynamic Optimization show that the required profiling and tracing facilities, if imple- mented carelessly, can easily nullify the performance benefit To understand the performance implications of dynamic opti- of code specialization. mization on software data planes, we present a series of pre- We introduce several novel techniques; we leverage static liminary benchmarks using real network code. We consider code analysis to build an understanding of the program of- two applications: the DPDK sample firewall l3fwd-acl [27], fline and introduce a low-overhead adaptive instrumentation which performs basic L2/L3/L4 packet processing followed mechanism to minimize the amount of data collected online. by a lookup into an access control list (ACL) containing a con- Then, we invoke several dynamic optimization passes (e.g., figurable number of wildcard 5-tuple rules, and Katran [40], dead code elimination, data-structure specialization, just-in- Facebook’s open-source L4 eBPF/XDP load balancer. We time compilation, and branch injection) to specialize the code connected two servers back-to-back by a 40GbE link, one against control plane actions and data plane traffic patterns. server running the DPDK Pktgen traffic generator producing Finally, by injecting guards at specific points in the pipeline, a stream of 64-byte packets at line rate [26], and the other we protect the consistency of the specialized code against running the application under test pinned to a single CPU changes to input that is considered invariant (§4). core (see §6 for the details of the configuration). Our implementation, Morpheus, exploits the LLVM JIT Generic tools fail to optimize network code. In general, dy- compiler toolchain to apply the above ideas at the LLVM namic compilation opens up vast opportunities to improve Intermediate Representation (IR) level in a generic fashion performance. The question is, to what extent standard dy- and separates data plane specific code to several backend namic compilers can exploit these, when applied to network plugins to minimize the effort in porting Morpheus to a new software? Fig.1 presents the benchmark results for the DPDK architecture (§5). The code currently contains an eBPF and a firewall application at various levels of optimization using DPDK/C plugin. We apply Morpheus to a number of packet standard compiler tools. In particular, the baseline perfor- processing programs, including the production-grade L4 load mance was measured with all optimizations disabled (level balancer Katran from Facebook (§6). Our results show that -O0), consistently reaching 8.7 Mpps rate in our test. Enabling Morpheus can improve the performance of the unoptimized aggressive GCC static optimizations (level -O3,[73]) yields (statically compiled) eBPF application up to 94%, while re- 1:5× performance improvement (12.9 Mpps). This is not sur- ducing packet-processing latency by up to 123% at the 99th prising: it is well-documented that typical C-level static pro- percentile. Applying Morpheus to a DPDK program, we in- gram optimizations greatly benefit networking code [6, 54]. crease performance by up to 469%. Finally, we measured On top of static optimization, profile-guided optimization Morpheus against state-of-the-art network code optimization tools (PGO), like Google’s AutoFDO [21] or Facebook’s frameworks such as ESwitch [62] and PacketMill [30]: we Bolt [67], promise to dynamically specialize the code for show that Morpheus boosts the throughput by up to 80% and a specific input by recompiling it based on execution profiles 294%, respectively, compared to existing work. recorded offline. However, our benchmarks indicate that for Contributions. In this paper, we: networking code this promise is not fulfilled; Bolt and Aut- • demonstrate that tracking packet-level dynamics opens up oFDO could not bring sensible improvements (from 0:15% new opportunities for network code

Arxiv:2106.08833V1 [Cs.NI] 16 Jun 2021 Mented on Commodity Servers, Are Widely Adopted in Real Tion Deployments

US 2002/0066086 A1 Linden (43) Pub

A Dynamically Recompiling ARM Emulator POWERED

A Brief History of Just-In-Time Compilation

Program Dynamic Analysis Overview

Transparent Dynamic Optimization: the Design and Implementation of Dynamo

Infrastructures and Compilation Strategies for the Performance of Computing Systems Erven Rohou

Costing JIT Traces

A Dynamic Optimization Framework for a Java Just-In-Time Compiler

Virtual Machines Uses for Virtual Machines

Improving Whole Program Code Locality in Managed Runtimes

Adaptive Optimization in the Jalape ˜No

Continuous Program Optimization Via Advanced Dynamic Compilation Techniques