Direct Universal Access: Making Data Center Resources Available to FPGA

Direct Universal Access: Making Data Center Resources Available to FPGA Ran Shu1, Peng Cheng1, Guo Chen1;2, Zhiyuan Guo1;3, Lei Qu1, Yongqiang Xiong1, Derek Chiou4, and Thomas Moscibroda4 Microsoft Research1, Hunan University2, Beihang University3, Microsoft Azure4 Abstract ficiently use data center resources (see Fig. 1(a)): FPGAs have been deployed at massive scale in data cen- First, different resources at different locations (local/re- ters. Using currently available communication architec- mote) are connected in different ways (e.g., PCIe, network) tures, however, it is difficult for FPGAs to access and utilize requiring different communication stacks. This greatly in- the various heterogenous resources available in data centers creases programming complexity. For example, an FPGA (DRAM, CPU, GPU,. ). In this paper, we present Direct application may use a custom communication stack [2] to Universal Access (DUA), a communication architecture that access a local (same server) FPGA, a networking stack [11] provides uniform access for FPGA to these data center re- to access a remote (different server) FPGA, GPU/FPGA Di- sources. rect [13] to access a GPU, DMA to access system DRAM, Without being limited by machine boundaries, DUA pro- DDR IP to access local DRAM, etc. Each of these commu- vides global names and a common interface for communi- nication stacks has a different interface (different I/O ports, cating across various resources, the underlying network au- functional timings, etc.), making it hard to understand, pro- tomatically routing traffic and managing resource multiplex- gram, optimize, and debug. ing. Our benchmarks show that DUA provides simple and Second, most resources (e.g., host DRAM, SSD) in a data fair-share resource access with small logic area overhead center are organized in a server-centric manner. Each re- (<10%) and negligible latency (<0.2ms). We also build two source uses a dedicated name space that can only be accessed practical multi-FPGA applications—deep crossing and regu- from within a host (e.g., a PCIe address.) The lack of global lar expression matching—on top of DUA to demonstrate its names for resources is inefficient for FPGAs when accessing usability and efficiency. remote resources, since they first need to communicate with the remote host, and the host first has to perform the access 1 Introduction on behalf of the requesting FPGA. If an FPGA wants to write Large-scale FPGA deployments in data centers [1–9] has a remote SSD, for example, it first has to transfer the data to changed the way of FPGA-based distributed systems are a daemon process running on its local CPU, which passes the designed. Instead of a small number of FPGAs and lim- data to the remote CPU, which then finally writes the data to ited resources (e.g., only the DRAM on each FPGA board), the targeted SSD. To make matters worse, developers manu- modern FPGA applications can use heterogeneous compu- ally write dedicated logic for each type of FPGA-to-resource tation/memory resources, such as CPU, GPU, host/onboard communication. DRAM, SSD etc., across large-scale data centers. The scale Third, although FPGAs have been deployed at data cen- and diversity of resources enables the building of novel ter scale, current FPGA communication does not deal well FPGA-based applications, such as cloud-scale web search with resource multiplexing. Though various resources are ranking [10,11], storage systems [4,6], or deep learning plat- accessed through the same physical interface (e.g., DMA and forms [12]. GPU/FPGA Direct both through PCIe), we are not aware Building large-scale and diverse FPGA applications re- of any general resource multiplexing scheme. FPGA devel- quires communication capabilities between any pair of FP- opers need to manually handle each specific case, which is GAs and other components in the data center. However, with tightly coupled with the application logic. Moreover, prob- today’s technology such FPGA communication is highly im- lems become more severe when there are multiple FPGA ap- practical and cumbersome, posing severe challenges to de- plications using the same resource (e.g., applications on two signers and application developers. There are three main FPGAs accessing the same SSD). problems barring FPGA applications to conveniently and ef- Instead, FPGA developers would like an FPGA commu- Server Server Communication interface for FPGA applications Host Host CPU GPU DRAM SSD ... CPU GPU DRAM SSD ... FPGA PCIe fabric ... PCIe fabric APP ... APP applications FPGA board FPGA board FPGA board FPGA board Onboard ... Onboard Onboard ... Onboard NIC FPGA DRAM FPGA DRAM NIC FPGA DRAM FPGA DRAM GPU/FPGA DMA DDR ... LTL Stacks Ethernet Direct Ethernet PCIe DDR link ... QSFP Physical Data center network fabric Interfaces (a) Current FPGA communication architecture. Communication interface for FPGA applications Server ... Server FPGA CPU Host Onboard Host FPGA GPU SSD APP ... APP GPU Onboard FPGA ... DRAM DRAM FPGA FPGA ... applications CPU DRAM SSD DRAM FPGA communication network fabric Common communication interface Unified resource naming & Resource multiplexing Routing (b) Ideal FPGA communication architecture. Figure 1: Comparison between a current FPGA communication architecture and an ideal communication architecture that enables FPGAs to easily access data center resources. nication architecture as shown in Fig. 1(b). This architec- nical challenges, design choices and implementation prob- ture has the following desirable properties: 1) a common lems. We discuss these challenges alongside our solutions in communication interface regardless of what the communica- Sections4,5,6 and7. tion endpoints are and where they reside; 2) a global, unified In summary, we make the following contributions in this naming scheme that can address and access all resources re- paper. We introduce DUA, a communication architecture gardless of their location; 3) an underlying network service with unified naming and common interface to enable large- that provides routing and resource multiplexing. With such scale FPGA applications in a cloud environment. We im- a communication architecture, FPGA applications could eas- plement and deploy DUA on a twenty FPGA testbed. Our ily access a diverse set of resources across the data center, us- testbed benchmarks show that DUA introduces negligible la- ing the common programming interface with each resource’s tency (<0.2ms) and small area (<10%) overhead on each global name. Indeed, such a communication architecture is FPGA. Moreover, we demonstrate that using DUA, it is what developers and architects expect and is used in other easy to build highly-efficient production-quality FPGA ap- distributed systems. For example, such a design has been plications. Specifically, we implement two real-world multi- proven successful in IP networks. FPGA applications: Deep Crossing and regular expression In this paper, we propose Direct Universal Access (DUA) matching, and show the vastly superior performance of these to bring this desirable communication architecture to the applications based on DUA. In addition, we also design and FPGA world. Doing so is challenging in multiple ways, implement a new communication stack that supports high- especially considering that very little, if anything, can be performance communication between local FPGAs through changed when we seek real-world deployment in existing PCIe, which can be integrated into DUA data plane as a data centers. It is impractical to require all manufacturers highly efficient underlying stack. to support a new unified communication architecture. To cir- 2 Background cumvent this challenge, DUA chooses to abstract an over- lay network on top of the existing communication stacks 2.1 FPGA Deployments in Data Centers and physical interconnections, thereby providing a unified The left side of Fig. 1(a) shows an overview of cur- FPGA communication method for accessing all resources. rent FPGA data center deployments. FPGA boards connect Moreover, performance and area is often crucial in FPGA- to their host server motherboard through commodity com- based applications, so DUA was designed to mimimize per- munication interfaces such as PCIe. Each hosting server formance and area overheads. Inspired by ideas in software can contain one [11] or more FPGA boards [2]. Each defined networking [14, 15], we architect DUA into a data FPGA board is typically equipped with gigabytes of onboard plane that is included in every FPGA, and a hybrid control DRAM [2,9, 11]. Recent deployments [11] directly con- plane including both CPU and FPGA control agents. Need- nect each FPGA to the data center networking fabric, en- less to say, designing and implementing this novel communi- abling it to send and receive packets without involving its cation architecture also brings about numerous specific tech- host server. To meet the high data rate of physical inter- different communication interfaces to access four different Table 1: FPGA programming efforts to connect different resources: host DRAM, host CPU, on board DRAM, and communication stacks. Resource Communication stack LoC remote FPGA. Table1 shows the lines of Verilog code for Host DRAM DMA 294 application developers to connect each stack’s interface. Host CPU FPGA host stack 205 Poor Resource Accessibility: Although data centers pro- Onboard DRAM DDR 517 vide many computation/memory resources that potentially Remote FPGA LTL 1356 could be used by FPGA applications, most of these resources faces, FPGAs use Hard IPs

Load more