SWSL: Software Synthesis for Network Lookup

SWSL: SoftWare Synthesis for Network Lookup Sung Jin Kimy Lorenzo De Carliy Karthikeyan Sankaralingamy Cristian Estanz yUniversity of Wisconsin-Madison zBroadcom Corporation {sung,lorenzo,karu}@cs.wisc.edu [email protected] ABSTRACT strength, stemming from the architectural flexibility of general pur- Data structure lookups are among the most expensive operations on pose processors, is the ease with which algorithms can be devel- routers’ critical path in terms of latency and power. Therefore, effi- oped, debugged and tuned. However, the lack of specialization cient lookup engines are crucial. Several approaches have been pro- of general-purpose CPU negatively impacts performance. More- posed, based on either custom ASICs, general-purpose processors, over, they exhibit high power consumption, which is the cost of or specialized engines. ASICs enable high performance but have generality [5]. The second approach is hardwired logic design long design cycle and scarce flexibility, while general-purpose pro- (e.g. [18, 23, 26, 35, 6, 11, 34, 37]). Implementing lookup al- cessors present the opposite trade-off. Specialized programmable gorithms with fixed-function hardware maximize performance and engines achieve some of the benefits of both approaches, but are minimize power consumption. This however comes at the price of a still hard to program and limited either in terms of flexibility or design cycle that is long and error-prone, and no flexibility. A third performance. option consists of specialized designs implementing programmable In this paper we investigate a different design point. Our solu- hardware accelerators. Such accelerators provide hardware imple- tion, SWSL (SoftWare Synthesis for network Lookup) generates mentation of functionality typically used by lookup algorithms, ar- hardware logic directly from lookup applications written in C++. ranged in a configurable architecture, to achieve the best of both Therefore, it retains a simple programming model yet leads to sig- worlds. Examples include [4, 21, 25, 27]. However, as providing nificant performance and power gains. Moreover, compiled appli- both complete programmability and performance is an impractical cation can be deployed on either FPGA or ASIC, enabling a fur- goal, these specialized approaches still have to prioritize one or the ther trade-off between flexibility and performance. While most other. Moreover, recent industry trends [3, 33] show that the main high-level synthesis compilers focus on loop acceleration, SWSL cost factors in hardware development are design/verification, fol- generates entire lookup chains performing aggressive pipelining to lowed by developing the companion software. Specialized engines achieve high throughput. still incur design/verification costs, although amortized. The re- Initial results are promising: compared with a previously pro- lated software is tied to the architecture; transitioning to a different posed solution, SWSL gives 2 − 4× lower latency and 3 − 4× approach requires rewriting it, wasting a considerable investment. reduced chip area with reasonable power consumption. In this paper, we propose a new approach to design lookup engines called SoftWare Synthesis for network Lookups (SWSL – pronounced Swizzle). SWSL consists in a lookup programming Categories and Subject Descriptors API and a specialized compiler middle layer that generates efficient B.4.1 [Data Communication Devices]: Processors; C.1 lookup hardware logic from software. This approach improves [Computer Systems Organization]: Processor Architectures the hardware/software development cycle in various ways. First, SWSL is architecture-neutral: lookup implementations are valid C++ that can be compiled for a conventional CPU, or fed to the Keywords SWSL compiler and deployed as FPGA or ASIC. The former ap- Network processing, Lookups, High-level synthesis, Dynamically- proach retains flexibility, while the latter prioritizes performance. specialized datapath Thus SWSL eases code reuse. Moreover, software and hardware designs are consolidated: design and verification can be done efficiently in an high-level programming language, reducing the need 1. INTRODUCTION for a separate hardware verification cycle. Power and area usage As the demands of Internet services increase, throughput require- are limited, as the specialized logic implements exactly the func- ments for high performance routers and switches become more and tionality needed by the application. more stringent. At their core, these devices consist of multiple line The SWSL programming model (§ 3) is dataflow-based and cards, each including host processors, memory modules and lookup naturally exploits pipelining opportunities present in lookup algo- engines. Data structure lookups are are among the most expen- rithms [10]. While most high-level synthesis compilers focus on sive operations on the packets’ critical path in terms of latency and latency improvement through loop acceleration, SWSL leverages power; therefore, developing high performance lookup engines is the simple, acyclic nature of lookup programs to generate hard- crucial to enable router performance to scale. ware logic for the entire lookup algorithm, performing aggressive Three main research approaches have been investigated in com- pipelining to achieve high throughput. One of the main challenges puter architecture and VLSI design communities. The first con- in synthesizing hardware from software is that the latter is inher- sist on SRAM-based algorithmic solution, deployed as software ently sequential, offering limited opportunities for concurrent ex- on general purpose processors [12] or even GPUs [20, 31]. Their Approach Performance Efficiency Design Debug/ ecution. SWSL employs optimizations to uncover concurrency at time verif. basic-block level (§ 4), and increases parallelism through the use of Software Low Low Good Good Customized variable lines. Each line maintain an independent copy of the ex- High High Poor Poor Design ecution state of the program, allowing multiple executions to pro- HLS High High Good Good ceed independently in parallel. We compare SWSL with two previously proposed lookup accel- Table 1: Comparison of design methodologies erators, PLUG and LEAP, with promising initial results. SWSL their heterogeneity – hardware lookup implementations do have sustains the same throughput as these designs, with reasonable common aspects [10]. Proposals such as [4, 21, 25, 27] provide power consumption. Moreover, in comparison with PLUG SWSL hardware implementation of functions typically used by lookup al- gives 2 − 4× lower latency and 3 − 4× reduced chip area. gorithms, arranged in a configurable architecture. Similar to GPUs, This paper is organized as follows. Section 2 provide back- these engines use specialized programming models; lookup imple- ground and presents related work both in the fields of lookup ac- mentations are hardware-specific and cannot be used on different celerators and high-level synthesis. Section 3 presents overview of platforms. SWSL, and Section 4 discuss the functioning of SWSL in details. We argue that the specialized engine approach does not avoid Section 5 presents quantitative measurement and evaluation based two significant pitfalls of ASICs: hardware verification costs on five network lookup algorithms, and Section 6 concludes the and excessive software specialization. According to [3, 33], the paper. main cost factors in hardware development are design/verification (∼40%) followed by developing the companion software (∼30%). 2. BACKGROUND AND MOTIVATION Designing lookup engines still requires complex and costly hardware/software codesign; lookup algorithms developed for a plat- 2.1 Network Lookups form are highly specialized and cannot be ported to different archi- The fundamental task of switches and routers is to determine tectures. next-hop information (e.g. an outgoing port number) given some In this paper, we propose to tackle these limitations by gener- packet data, such as a layer 2 or 3 addresses, or the connection ating lookup hardware directly from high-level software via high- 5-tuple. Making a forwarding decision usually requires searching level synthesis (HLS). No hardware/software codesign is required; large data structures of various kinds, depending on the specific verification can be done fully in software using standard debug- algorithm. For example, layer-3 routers use the IP destination ad- ging tools. Therefore, hardware testing can be minimized to sanity- dress to search IP routing tables; OpenFlow [30] and its predeces- checking the final design. Moreover the approach is architecture- sor Ethane [9] look up per-flow rules in tables index by layer-3/4 neutral and reusable, as our API does not assume any special hard- headers. The operation must be performed at line speed and for ware capability. The hardware generated by HLS can be deployed each incoming packet. As lookups are among the most expensive both on FPGAs and ASICs, enabling a trade-off between flexibil- operations on packets’ critical path in terms of latency and power, ity and performance. For example an ASIC deployment – sacri- there is a large body of research, both in academia and industry, on ficing flexibility – may be acceptable if the target application is implementing them efficiently. well-standardized, e.g. layer-2 or -3 forwarding. Software approaches are problematic because network lookups In the next section we discuss the advantage of software- and are known to suffer from poor locality: multi-Gigabit

Load more