P-Socket: Optimizing a Communication Library for a Pcie-Based Intra-Rack Interconnect

P-Socket: Optimizing a Communication Library for a PCIe-Based Intra-Rack Interconnect Liuhang Zhang, Rui Hou Sally A. McKee Jianbo Dong, Lixin Zhang Institute of Computing Chalmers University of Institute of Computing Technology Technology Technology University of Chinese Gothenburg, Sweden University of Chinese Academy of Sciences [email protected] Academy of Sciences Beijing, China Beijing, China [email protected] [email protected] [email protected] [email protected] ABSTRACT by cloud computing and big data applications, these work- Data centers require efficient, low-cost, flexible interconnects loads have caused network data flows to change such that to manage the rapidly growing internal traffic generated by the ratio of internal to external traffic has gone from 5:95 an increasingly diverse set of applications. To meet these re- to 75:25 [27]. Data center applications are also becoming quirements, data center networks are increasingly employing more diversified in their requirements. Meeting the needs of alternatives such as RapidIO, Freedom, and PCIe, which re- these applications at rapidly growing scales requires efficient quire fewer physical devices and/or have simpler protocols sharing of data-center resources, which, in turn, requires a than more traditional interconnects. These networks offer high-efficiency, low-cost, flexible interconnect. raw high performance communication capabilities, but sim- Early data centers often employed standard High Perfor- ply using them for conventional TCP/IP-based communica- mance Computing (HPC) networking solutions like Infini- tion fails to realize the potential performance of the physical Band [12] (heretofore abbreviated as IB), 10-Gigabit Ether- network. Here we analyze causes for this performance loss net [16] (10 GigE), Myrinet [4], and Quadrics [22]. Ethernet for the TCP/IP protocol over one such fabric, PCIe, and we has often been used for its ease of deployment and backward explore a hardware/software solution that mitigates over- compatibility, but linking racks of volume servers together heads and exploits PCIe's advanced features. The result with 10 GigE switches and routers is wasteful in terms of is P-Socket, an efficient library that enables legacy socket power and cost: replicated components allow each server applications to run without modification. Our experiments to enjoy individual operation, management, and connectiv- show that P-Socket achieves an end-to-end latency of 1.2µs ity, yet the servers are never used in isolation. Myrinet is and effective bandwidth of up to 2.87GB/s (out of a theo- a high-speed local area networking fabric which has much retical peak of 3.05GB/s). lower protocol overhead than Ethernet. Once popular for supercomputers, its use has decreased in recent years (it was used in 141 of the TOP500 machines in 2005 but only one CCS Concepts in 2013) [33]. InfiniBand stands out among traditional HPC •Software and its engineering ! Communications solutions: its popularity has been growing for both TOP500 management; •Networks ! Network performance evalu- supercomputers and enterprise data centers. ation; Network performance analysis; Data center networks; Many newer data centers are turning to innovative inter- •Hardware ! Networking hardware; connects that better match the communication requirements of modern data center workloads. For instance, Freescale, IDT, Mobiveil, and Prodrive promote the use of ARM servers Keywords connected by RapidIO [20], the embedded fabric currently data-center servers, rack interconnects, sockets, PCIe used in most base stations. AMD SeaMicro market ultra low-power, small-footprint data centers connected by their FreedomTM Supercomputer Fabric, a 3D torus that includes 1. INTRODUCTION both path redundancy and diversity. SeaMicro's proprietary Data centers must handle huge volumes of workloads gen- technology can be used to build servers from any proces- erated by millions of independent users. Now dominated sor with any instruction set and any communication protocol [26]. Even PCI Express (PCIe) | once viewed inap- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed propriate for general-purpose fabrics | is increasingly used for profit or commercial advantage and that copies bear this notice and the full citation within small-scale, tightly coupled data-center racks once on the first page. Copyrights for components of this work owned by others than the connected by more traditional HPC technologies [6]. For in- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission stance, Hou et al. [15] demonstrate a prototype data center and/or a fee. Request permissions from [email protected]. server in which nodes share memory resources, GPGPUs, CF’16, May 16 - 19, 2016, Como, Italy and network bandwidth via PCIe, which allows remote re- c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. sources to be used much the same as local resources. ISBN 978-1-4503-4128-8/16/05. $15.00 DOI: http://dx.doi.org/10.1145/2903150.2903168 In all high-speed interconnects, hardware and software 2. PCIE SYSTEM ARCHITECTURE overheads prevent applications from realizing a fabric's raw Many scalable systems are built from sets of nodes coupled peak performance. For instance, Balaji et al. [1] show that tightly with low-latency, high-bandwidth local interconnects. the Socket Direct Protocol (SDP) and IP over InfiniBand These sets of tightly coupled nodes, or super nodes, are (IPoIB) both take more than five times the raw link latency themselves connected by more traditional networks such as and realize less than 2/3 and 1/5 of the raw link bandwidth, Ethernet. PCIe fabrics are good candidates for super-node respectively. Likewise, Feng et al. [11] show that implement- interconnects based on several considerations. First, the ing traditional sockets over 10 GigE incurs 16% latency over PCIe interface already enjoys widespread use, which means that of the raw link and realizes less than 3/4 of the raw link that deploying it requires no architectural changes or addi- bandwidth. These demonstrate that there is still room for tional protocol translation cards. Second, the PCIe fabric improvement. Here we use PCIe to build a prototype system has good scalability for intra-rack interconnects, which usu- (similar to that of Hou et al. [15]) on which to investigate ally include fewer than 100 nodes: PCIe cables (e.g., cop- the reasons behind these performance gaps. We select PCIe per wire or optical fiber) work well at such short distances. for its reliability and efficiency over short distances. Third, PCIe allows servers within a rack to directly share Experiments on our PCIe prototype system verify that: resources via memory load/store instructions. • using store instructions to transmit small data packet Figure 1 shows a typical organization. In this tightly reduces latency by about 16% over DMA; coupled group, compute nodes connect to a Non-Transparent Bridge (NTB) that connects to the Transparent Bridge (TB) • using store instructions realizes a peak bandwidth of on the other side. NTBs can separate different address 2.48GB/s, while using load instructions delivers only spaces and translate transactions from one address space 26.67MB/s; to another. TBs are used to forward transactions within a given address space. This fabric is sufficiently scalable • using burst DMA mode performs better than using that many TBs can be connected together to expand the block DMA mode for small packets, even though burst network, and new compute nodes need only one NTB to DMA's latency is 1.76 times longer; join. When compared with Ethernet or IB (whose adapters • bypassing the TCP/IP protocol stack can increase band- are commonly plugged into PCIe slots), PCIe fabrics elimi- width by a factor of eight and lower latency by about nate additional protocol conversions (e.g., from PCIe to IB 30% for small messages; and or from PCIe to Ethernet). This advantage gives PCIe a shorter communication channel and thus lower latency. • bypassing the kernel reduces small-packet latency from 18µs to 1.2µs by eliminating unnecessary context switch- ing and buffer copying. 3. RELATED WORK Based on this analysis, we propose P-Socket, a communi- The evolution of interconnection technology has caused cations library designed to exploit these performance arti- the communication performance bottleneck to move from facts. Specifically, P-Socket bypasses the kernel, uses store the physical layer to the software layer. One way to re- instructions to transmit small packets and to implement flow duce the bottleneck is to move some software functionality control, and uses burst DMA instead of block DMA to trans- to the hardware. For example, Illinois Fast Messages (FM) mit large packets. Note that the optimizations we study are use Myrinet capabilities to offload protocol processing to not new, and they have been exploited to good effect else- the programable NIC [21]. TCP Offload Engines also free where; rather, it is their synthesis and PCIe-specific adap- host CPU cycles by moving TCP/IP stack processing to the tation that is novel to P-Socket. The main contribution of network controller [14]. Similarly, Ethernet Message Passing this paper is the detailed performance evaluation of our im- (EMP) offloads protocol processing to take better

Load more