P-Socket: Optimizing a Communication for a PCIe-Based Intra-Rack Interconnect

Liuhang Zhang, Rui Hou Sally A. McKee Jianbo Dong, Lixin Zhang Institute of Chalmers University of Institute of Computing Technology Technology Technology University of Chinese Gothenburg, Sweden University of Chinese Academy of Sciences [email protected] Academy of Sciences Beijing, China Beijing, China [email protected] [email protected] [email protected] [email protected]

ABSTRACT by cloud computing and big data applications, these work- Data centers require efficient, low-cost, flexible interconnects loads have caused network data flows to change such that to manage the rapidly growing internal traffic generated by the ratio of internal to external traffic has gone from 5:95 an increasingly diverse set of applications. To meet these re- to 75:25 [27]. Data center applications are also becoming quirements, data center networks are increasingly employing more diversified in their requirements. Meeting the needs of alternatives such as RapidIO, Freedom, and PCIe, which re- these applications at rapidly growing scales requires efficient quire fewer physical devices and/or have simpler protocols sharing of data-center resources, which, in turn, requires a than more traditional interconnects. These networks offer high-efficiency, low-cost, flexible interconnect. raw high performance communication capabilities, but sim- Early data centers often employed standard High Perfor- ply using them for conventional TCP/IP-based communica- mance Computing (HPC) networking solutions like Infini- tion fails to realize the potential performance of the physical Band [12] (heretofore abbreviated as IB), 10-Gigabit Ether- network. Here we analyze causes for this performance loss net [16] (10 GigE), Myrinet [4], and Quadrics [22]. Ethernet for the TCP/IP protocol over one such fabric, PCIe, and we has often been used for its ease of deployment and backward explore a hardware/software solution that mitigates over- compatibility, but linking racks of volume servers together heads and exploits PCIe’s advanced features. The result with 10 GigE switches and routers is wasteful in terms of is P-Socket, an efficient library that enables legacy socket power and cost: replicated components allow each server applications to run without modification. Our experiments to enjoy individual operation, management, and connectiv- show that P-Socket achieves an end-to-end latency of 1.2µs ity, yet the servers are never used in isolation. Myrinet is and effective bandwidth of up to 2.87GB/s (out of a theo- a high-speed local area networking fabric which has much retical peak of 3.05GB/s). lower protocol overhead than Ethernet. Once popular for supercomputers, its use has decreased in recent years (it was used in 141 of the TOP500 machines in 2005 but only one CCS Concepts in 2013) [33]. InfiniBand stands out among traditional HPC •Software and its engineering → Communications solutions: its popularity has been growing for both TOP500 management; •Networks → Network performance evalu- supercomputers and enterprise data centers. ation; Network performance analysis; Data center networks; Many newer data centers are turning to innovative inter- •Hardware → Networking hardware; connects that better match the communication requirements of modern data center workloads. For instance, Freescale, IDT, Mobiveil, and Prodrive promote the use of ARM servers Keywords connected by RapidIO [20], the embedded fabric currently data-center servers, rack interconnects, sockets, PCIe used in most base stations. AMD SeaMicro market ultra low-power, small-footprint data centers connected by their FreedomTM Supercomputer Fabric, a 3D torus that includes 1. INTRODUCTION both path redundancy and diversity. SeaMicro’s proprietary Data centers must handle huge volumes of workloads gen- technology can be used to build servers from any proces- erated by millions of independent users. Now dominated sor with any instruction set and any communication pro- tocol [26]. Even PCI Express (PCIe) — once viewed inap- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed propriate for general-purpose fabrics — is increasingly used for profit or commercial advantage and that copies bear this notice and the full citation within small-scale, tightly coupled data-center racks once on the first page. Copyrights for components of this work owned by others than the connected by more traditional HPC technologies [6]. For in- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission stance, Hou et al. [15] demonstrate a prototype data center and/or a fee. Request permissions from [email protected]. server in which nodes share memory resources, GPGPUs, CF’16, May 16 - 19, 2016, Como, Italy and network bandwidth via PCIe, which allows remote re- c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. sources to be used much the same as local resources. ISBN 978-1-4503-4128-8/16/05. . . $15.00 DOI: http://dx.doi.org/10.1145/2903150.2903168 In all high-speed interconnects, hardware and software 2. PCIE SYSTEM ARCHITECTURE overheads prevent applications from realizing a fabric’s raw Many scalable systems are built from sets of nodes coupled peak performance. For instance, Balaji et al. [1] show that tightly with low-latency, high-bandwidth local interconnects. the Socket Direct Protocol (SDP) and IP over InfiniBand These sets of tightly coupled nodes, or super nodes, are (IPoIB) both take more than five times the raw link latency themselves connected by more traditional networks such as and realize less than 2/3 and 1/5 of the raw link bandwidth, Ethernet. PCIe fabrics are good candidates for super-node respectively. Likewise, Feng et al. [11] show that implement- interconnects based on several considerations. First, the ing traditional sockets over 10 GigE incurs 16% latency over PCIe interface already enjoys widespread use, which means that of the raw link and realizes less than 3/4 of the raw link that deploying it requires no architectural changes or addi- bandwidth. These demonstrate that there is still room for tional protocol translation cards. Second, the PCIe fabric improvement. Here we use PCIe to build a prototype system has good scalability for intra-rack interconnects, which usu- (similar to that of Hou et al. [15]) on which to investigate ally include fewer than 100 nodes: PCIe cables (e.g., cop- the reasons behind these performance gaps. We select PCIe per wire or optical fiber) work well at such short distances. for its reliability and efficiency over short distances. Third, PCIe allows servers within a rack to directly share Experiments on our PCIe prototype system verify that: resources via memory load/store instructions. • using store instructions to transmit small data packet Figure 1 shows a typical organization. In this tightly reduces latency by about 16% over DMA; coupled group, compute nodes connect to a Non-Transparent Bridge (NTB) that connects to the Transparent Bridge (TB) • using store instructions realizes a peak bandwidth of on the other side. NTBs can separate different address 2.48GB/s, while using load instructions delivers only spaces and translate transactions from one address space 26.67MB/s; to another. TBs are used to forward transactions within a given address space. This fabric is sufficiently scalable • using burst DMA mode performs better than using that many TBs can be connected together to expand the block DMA mode for small packets, even though burst network, and new compute nodes need only one NTB to DMA’s latency is 1.76 times longer; join. When compared with Ethernet or IB (whose adapters • bypassing the TCP/IP protocol stack can increase band- are commonly plugged into PCIe slots), PCIe fabrics elimi- width by a factor of eight and lower latency by about nate additional protocol conversions (e.g., from PCIe to IB 30% for small messages; and or from PCIe to Ethernet). This advantage gives PCIe a shorter communication channel and thus lower latency. • bypassing the kernel reduces small-packet latency from 18µs to 1.2µs by eliminating unnecessary context switch- ing and buffer copying. 3. RELATED WORK Based on this analysis, we propose P-Socket, a communi- The evolution of interconnection technology has caused cations library designed to exploit these performance arti- the communication performance bottleneck to move from facts. Specifically, P-Socket bypasses the kernel, uses store the physical layer to the software layer. One way to re- instructions to transmit small packets and to implement flow duce the bottleneck is to move some software functionality control, and uses burst DMA instead of block DMA to trans- to the hardware. For example, Illinois Fast Messages (FM) mit large packets. Note that the optimizations we study are use Myrinet capabilities to offload protocol processing to not new, and they have been exploited to good effect else- the programable NIC [21]. TCP Offload Engines also free where; rather, it is their synthesis and PCIe-specific adap- host CPU cycles by moving TCP/IP stack processing to the tation that is novel to P-Socket. The main contribution of network controller [14]. Similarly, Ethernet this paper is the detailed performance evaluation of our im- (EMP) offloads protocol processing to take better advantage plementation within a real system. of the bandwidth of Gigabit Ethernet [30]. The performance benefits of supporting multiple DMA On the software side, the past two decades have witnessed modes and remote memory access via load/store instruc- the development of user-level software protocols that in- tions argue for their inclusion in future interconnects. Thus crease performance. For instance, lightweight and reliable even though P-Socket’s optimizations are closely related to libraries such as Active Messages [34] and U-Net [35] im- the features of the PCIe fabric, our work provides general plement new protocols and programming interfaces to ob- insights into the design of new intra-rack interconnects and tain high performance, but they sacrifice compatibility with accompanying software libraries. legacy applications. U-Net supports zero-copy transfer via We designed and produced the hardware backplane board bypassing the kernel to avoid having to copy data from ker- and developed the corresponding software stack (the Virtual nel to . Network Interface Controller driver and standard socket- VMMC-2 also supports zero-copy transfers over Myrinet compatible library) for our proof-of-concept prototype. We but uses a custom API [10]. For protection, VMMC-2 re- quantify the software overheads of the communication li- quires that communications happen only after the receiver brary over PCIe and demonstrate P-Socket’s efficacy by im- has given the sender permission to transfer data into its plementing the kernel portion within 2.6.39.4. Fur- address space. If the receiver has not exported its receive thermore, we analyze limitations of our hardware prototype buffer before data arrives, the protocol deposits the data system to inform the design of future hardware interconnect into a default buffer (to be copied to the receive buffer protocols. Here we demonstrate P-Socket on a prototype later). Since data are simply dropped when this buffer be- data center server, but extending it to support MPI would comes full, VMMC-2 implements sender-side buffering and make it equally appropriate for HPC systems with PCIe retransmission for reliability. Although PM [32] uses modi- intra-rack interconnects. fied ACK/NACK flow control to deliver messages in order, Tightly Coupled Tightly Coupled Tightly Coupled Nodes Nodes Nodes Node1 NTB 2 NTB Node4

Shared Address Space

Loosely Coupled Node2 NTB TB 1 TB NTB Node5 Network

Tightly Coupled Tightly Coupled Node3 NTB NTB Node6 Nodes Nodes

1 TB: PCIe Transparent Bridge (Switch) 2 NTB: PCIe Non-Transparent Bridge

Figure 1: PCIe fabric architecture it requires data retransmission to ensure reliability. In con- advanced features and explore a synergistic approach that trast, FM 2.x avoids unnecessary network traffic by only combines both software and hardware optimizations and de- allowing the sender to begin transmission when the receiver velops policies to choose how best to exploit each. has a free buffer [19]. In an effort to standardize communica- Like P-Socket, Intel’s RSOCKETS [13] supports tradi- tion architectures, the Virtual Interface Architecture (VIA) tional sockets applications over RDMA devices with no API builds on the components of Active Messages, U-Net, FM, changes. RSOCKETS strives to deliver maximal perfor- and VMMC to provide an OS-independent infrastructure for mance over Infiniband, whereas we focus on PCIe, which is high-performance, user-level networking [5]. less expensive. P-Socket latency is lower than RSOCKET’s, The reliability and adaptability of TCP have sustained but RSOCKET’s effective bandwidth is higher. A detailed its dominance with respect to deployment, in spite of its comparison of the two approaches is part of future work. performance overheads. TCP requires extensive computing power [2], and it relies on kernel intervention to process mes- sages, which triggers multiple copies and context switches 4. REDUCING SOFTWARE OVERHEAD in the critical message-passing path. User-level socket im- Consider two applications communicating via a standard plementations address TCP inefficiencies by replacing the socket interface. In user mode, both send and receive calls software protocol with zero-copy, kernel-bypass protocols. invoke the GNU C Library (glibc), which passes requests Examples include Fast Sockets [29], Sockets-over-EMP [3], to the kernel via system calls. At the sender side, data is Socket Direct Protocol (SDP) [1], and SuperSocketsTM [9]. copied from user space to the send queue in kernel space, Fast Sockets implements a traditional sockets interface on processed by the TCP/IP protocol stack, and sent to the top of Active Messages for Myrinet networks. As TCP/IP receiver side over an Ethernet connection. At the receiver and Active Messages have different “on-the-wire” packet for- side, the receive queue processes the data from the TCP/IP mats, it is difficult to make a full implementation of TCP/IP protocol stack and then notifies the blocked user application. on top of Active Messages. To achieve performance close to Solutions like IP over IB (IPoIB) [7] and IP over PCIe the raw network capabilities and maintain good compati- (IPoPCIe) [18,28] allow applications to use normal IP sockets bility, Fast Sockets uses a low-overhead protocol for local- on top of high-speed interconnect fabrics (at the cost of area communication and falls back to normal TCP/IP for requiring the CPU to participate in processing the proto- wide-area communication. P-Socket adopts this strategy, col stack). The IPoPCIe implementation we choose as our too, but uses a different local-area . baseline uses a custom Virtual Network Interface Controller EMP [3] is a completely NIC-based protocol with OS by- (VNIC) driver that offers the same interfaces used by the pass and zero-copy. Sockets-over-EMP exploits the perfor- TCP/IP protocol, except that the VNIC controls a PCIe mance of Gigabit Ethernet and offers compatibility with device instead of an Ethernet card. legacy socket-based applications. Like P-Socket, Sockets- Given the inefficiencies outlined above, TCP/IP presents over-EMP requires that the kernel translate virtual addresses two obvious opportunities for software optimization. Com- and pin data buffers in memory to ensure correctness. putational intensity can be reduced by streamlining or avoid- SDP [1] implements a socket interface using InfiniBand ing the protocol stack, and the high cost of system calls can operations to allow traditional socket-based applications to be avoided by performing more work in user mode. run over IB fabrics without modification/recompilation. It supports Buffer Copy, which uses pre-registered SDP buffers 4.1 TCP/IP Stack Bypass as intermediaries when transferring data from sender to re- Figure 2(a) depicts an optimization that allows commu- ceiver. Furthermore, it can support zero-copy via “Sink ” or nication to bypass the TCP/IP protocol stack. P-Socket is “Source Avail” messages, which is not feasible in P-Socket. divided into a kernel part and a user part. The former re- SuperSocketsTM (introduced by Dolphin Corp. in 2001 moves the TCP/IP protocol stack from the message-passing and implemented on several of their high-performance in- path. The latter consists of a user-level library wrapper. terconnects) is most similar to the design we present here. The kernel part is implemented as a dynamic module that A commercial product, SuperSockets focuses on reliability, initializes the NTB and DMA engine when loaded. When availability, and compatibility. It implements automatic fail- unloaded, it releases these resources. over to a redundant adapter, supports TCP and UDP, re- The wrapper is a dynamically linked user-level library quires no OS patches and no application modifications, and that works seamlessly with glibc. Via the preload mech- remains 100% compliant with the Linux Socket library. How- anism, LD PRELOAD [25], the wrapper can be invoked ever, in our work, we focus on details of the hardware’s before other shared libraries (including glibc). When the Node 1 Node 2 Node 1 Node 2 5. PCIE HARDWARE FEATURES User User User User App A App B App A App B We investigate the advanced features of the PCIe fabric to leverage them in the design of P-Socket. Wrapper Wrapper Wrapper Wrapper PCIe Send Queue Recv Queue 5.1 Small Packet Optimization User Lib User Lib User Lib User Lib PCIe allows memory devices attached to a PCIe inter- Kernel Kernel Kernel Kernel face to be accessed via normal load/store instructions. We Send Queue Recv Queue use this feature to allow a node to access another node’s PCIe Driver PCIe Driver PCIe Driver PCIe Driver memory as if it were local (PCIe-attached) memory. Re- PCIe PCIe mote memory can thus be mapped to a host’s local address (a)TCP/IP bypass (b)Kernel bypass space so that packets need not be moved between nodes by DMA transfers. Instead, the receiver can use load instruc- Figure 2: Software optimization tions to pull the packet from the sender memory, or the sender can use store instructions to push the packet into the receiver memory. Using store instructions is particularly ex- application calls a standard socket function, the wrapper pedient for small packets because it eliminates the initiali- intercepts the call and decides whether to handle the call it- zation and interrupt handling overhead of DMA requests. self, pass it to glibc, or both. The wrapper sends requests to However, when transmitting large packets, the DMA engine the kernel part via the ioctl function. Note that applications incurs less overhead than the CPU, and it allows more out- compiled with statically linked libraries must be recompiled. standing requests. As control messages are usually small Socket-based applications commonly work in client/server and latency-sensitive, we use store instructions to transfer mode. The server calls the accept() function to wait for a them. connection, and the client calls the connect() function to In our prototype system, a load/store instruction carries post a connection request. The wrapper intercepts both up to 64B (i.e., a cache line), and the maximum payload of calls and builds a PCIe connection. Once this connection one PCIe transmit unit is 2048B. Defining the boundary be- is built, the kernel part creates a send queue on one side tween using load/store instructions and using DMA requests and a receive queue on the other. The wrapper intercepts is a question worth considering: setting it too low limits the subsequent sends and receives and performs them over the advantages of store instructions, while setting it too high PCIe link. When the connection is terminated, the kernel leads to high CPU utilization and poor performance. We part frees both queues. choose a boundary of 256B based on the experimental re- sults shown in Figure 4 in Section 6.1. 4.2 Kernel Bypass Even though bypassing the TCP/IP protocol stack im- 5.2 Flow-Control Optimization proves performance, at least two copies and two context Turning data communications into PCIe load/store ac- switches are still required for each data transfer. To avoid cesses or DMA requests requires new flow-control strategies. this overhead, P-Socket allows user-mode transfers, as shown To prevent undetectable overflows, P-Socket uses full/empty in Figure 2(b). The main data structures, including the send flags to track the states of the send/receive buffers for two and receive queues, are implemented in user space instead flow control mechanisms, write-side flow control (WFC) and of kernel space, which highlights two issues. read-side flow control (RFC). These two mechanisms differ with respect to where the tags reside. RFC locates tags 1. When the P-Socket data structures are maintained in- in the receiver memory, whereas WFC locates them in the side the kernel, a P-Socket can be shared/transferred sender. In RFC mode, the sender reads the tags in the re- among processes running on the same OS through an mote memory and transmits data when the receive buffer integer (P-Socket ID). To keep the same flexibility with is empty. After the receiver consumes the data, it resets kernel bypass, the P-Socket data structures should be the appropriate tags in local memory. In WFC mode, the allocated in . sender accesses local tags before transmission, and the re- 2. Since the DMA engine only recognizes physical ad- ceiver clears the remote tags after consuming the data. dresses, it must ensure that the kernel pins data in The flow-control strategy invoked depends on whether the memory for the duration of the transfer. application issues a PCIe write or a PCIe read. It turns out that PCIe write performs better than the correspond- In order to keep data coherent, the receive buffer for each ing PCIe read because write requests use posted transac- connection must be uncacheable. This buffer is allocated tions, but read requests do not. In a posted transaction, through ioremap() when the P-Socket driver is invoked, and the sender releases the associated resources immediately af- it is exposed to user space through mmap(). We allocate a ter the transaction is sent. In a non-posted transaction, the DMA-coherent buffer via dma alloc coherent(). sender does not release the occupied resources until it re- Due to the translation window limits of the NTBs (dis- ceives an acknowledgement from the final destination. Fur- cussed in Section 7), we allocate 8MB of contiguous physical thermore, since the sender in WFC need only access local space to hold all receive buffers. The current version sup- memory for the tags before transmission, it has lower la- ports 16 independent channels. Each channel’s receive buffer tency. Therefore, P-Socket adopts WFC as its default. is divided into segments for flags, control packets, small data packets, and large data packets. Each data packet consists of 5.3 DMA Optimization a sequence number, a source P-Socket ID, a target P-Socket In order to increase bandwidth, P-Socket uses a DMA ID, the packet length, and the data payload. engine inside the NTB to transfer data between local and 8619 [24]). To make the TB easily configurable, we add a Table 1: Data node configurations node to the shared address space; this becomes the root node Hardware Configuration in our prototype system. It participates as a compute node Intel R CoreTMi7 processor, 3.4GHZ, 4 cores, CPU in addition to managing TB configuration. Data nodes com- 8 threads, 8MB LLC municate through the backplane via a PCIe adapter card. Memory 8GB DDR3 Disk SATA 720RPM 1TB The TB forwards transactions between the root node (node Interconnect PCIe Gen 2, x8 links with 5GB/s per link 0) and the leaf nodes (nodes 1-4). The NTBs all connect OS Red Hat Enterprise Linux 6.1 to the PCIe switch chip, and each attaches to a compute node. They electrically isolate the PCIe buses and shield the attached nodes by masquerading as endpoints to discovery Table 2: Different implementation versions software and translating the addresses of transactions that Version SB KB SPO RFC WFC BLK BDM cross the bridge (mapping transactions in one PCIe hier- √ √ V1 √ √ √ archy to corresponding transactions in another). V2 √ √ √ √ Since our study focuses on sustainable communication V3 √ √ √ V4 performance (and not scalability), we use just two of the five √ √ √ V5 data nodes in our evaluation environment. These nodes are √ √ √ √ V6 √ √ √ √ √ configured as shown in Table 1. In order to study the influ- V7 ence of different configurations on performance, we combine SB (TCP/IP Stack Bypass) KB (Kernel Bypass) hardware and software optimizations to create the seven dif- SPO (Small Packet Optimization) BDM (Burst DMA Mode) ferent P-Socket implementations listed in Table 2. WFC (Write-side Flow Control) BLK (Block DMA Mode) RFC (Read-side Flow Control) 6.1 Raw Bandwidth To first gain a basic understanding of our PCIe system’s bandwidth performance, we transfer a total of 1GB of data remote memory. PCIe systems support two DMA modes: between two nodes via different methods: block DMA, burst block DMA and burst DMA. Both modes require an ini- DMA, store instructions, and load instructions. Data are di- tialization stage to set up one or more DMA descriptors vided into blocks that vary from 64B to 1MB. Bandwidth (one per DMA transfer task), and both rely on interrupts to results are shown in Figure 4. Note that these raw band- completion. In block mode, the driver directly posts width tests do not involve P-Socket: they demonstrate the a descriptor to the DMA engine, which can only process one basic bandwidth performance of our PCIe system. We ob- transmission task at a time (i.e., only one descriptor is ini- serve three characteristics. tialized for each transfer). In burst DMA mode, the driver first prepares one or more descriptors in the memory, and • Burst DMA has higher bandwidth than block DMA then the DMA engine prefetches descriptors and launches when transferring small data packets, since it allows transmission tasks in batches (i.e., multiple DMA descrip- better pipelining with less interrupted data movement. tors can be initialized and processed simultaneously). Addi- As data block size increases, this discrepancy gradually tionally, block DMA mode raises an interrupt signal at the disappears. The peak bandwidth is 2.98GB/s, which end of each DMA task, while burst DMA mode requires only is close to the 3.05GB/s theoretical limit [24]. one interrupt per batch of transfers. • Using store instructions to transmit small packets has Burst DMA takes longer to trigger an operation, as it has slightly better performance than DMA. In particular, to store the DMA descriptor in memory first and then invoke it performs the best when packets are 256B or smaller. the DMA controller to process it. In contrast, block DMA configures the descriptor directly in the DMA controller • Store instructions work better than load instructions: and starts the procedure. Nevertheless, one transmission is loads perform (surprisingly) poorly, regardless of data usually composed of several DMA operations due to non- block size. The peak bandwidth is about 26.67 MB/s. contiguous physical addresses, and multiple threads may invoke bursts of DMA transfers within a short time. Un- To understand the low load performance, we conduct two der these conditions, burst DMA mode spends less overhead more tests: in the first, one node uses loads to access a time and gets better bandwidth than block DMA mode. contiguous region of a second node’s address space, and in To take advantage of burst DMA, the send queue is split the second, the first node does the same via DMA reads. into an active and an inactive queue. Only descriptors stored We used a Tek PCIe Logic analyzer [31] to collect Trans- in the active queue are processed. Descriptors that arrive action Layer Packet (TLP) traces of the PCIe transaction while the DMA engine is busy are stored in the inactive headers and payloads. For a 10µs randomly sampled ob- queue. When the active queue empties, the queue roles re- servation window during the stable state, we measured 280 verse. Pipelining transactions this way allows consecutive TLPs generated by DMA reads, but only seven TLPs gen- batches to be processed continuously without interruption. erated by loads, as shown in Figure 5. This suggests that TLP read requests are issued serially, which severely impacts performance. We suspect the PCIe Root Complex inside the 6. EVALUATION processor chip to be the culprit, as it allows only one out- Figure 3(a) presents our five-node hardware prototype, standing PCIe load request outside the processor. and Figure 3(b) shows backplane detail. The backplane has one PCIe switch chip with Transparent Bridge (TB) func- 6.2 P-Socket Bandwidth tionality (PLX PEX 8648 [23]) and four switch chips with We use Netperf [17] to measure the TCP bandwidth over Non-Transparent Bridge (NTB) functionality (PLX PEX P-Socket. Netperf creates two user-level processes, a server Leaf Node1 Backplane 2 PCIe NTB Address Leaf Node 1 Transparent Space 1 Interface Bridge (Switch)

Leaf Node2 NTB Address Leaf Node 4 Root Node 0 1 Space 2 Interface Address TB Root Node Space 0 Leaf Node3 Interface NTB Address Space 3 Leaf Node 3 Interface Leaf Node4 NTB Address Leaf Node 2 Space 4 PCIe Non- Interface Transparent Bridge 1 TB: PCIe Transparent Bridge (Switch) 2 NTB: PCIe Non-Transparent Bridge

(a) Prototype System (b) Backplane

Figure 3: A PCIe switch based system

3 8 6 4

2 TLPs 2 1.463 burst DMA 0 1 stores 0 2 4 6 8 10 block DMA Time (usec)

Bandwidth (GB/s) loads (a) Transaction Layer Packets from loads 64 1K 2K 4K 8K 128 256 512 16K 32K 64K 128K256K512K 1M Data Size (B) 300

200 55

TLPs 100 Figure 4: Results of raw bandwidth tests 0 0 2 4 6 8 10 Time (usec) and a client. They run on different data nodes and commu- (b) Transaction Layer Packets from DMAs nicate via the standard socket interface. We vary message size from 1B to 4MB and test the bandwidth of P-socket Figure 5: Number of TLPs traced with our Tek PCIe logic with our different implementations. Figure 6 shows results. analyzer We make five observations: (1) The peak bandwidth of the baseline implementation V1 (RFC+BLK) is about 200MB/s, and that of V2 (SB+RFC+ memory locations on subsequent transmissions. P-Socket BLK) is about 1.7GB/s. This shows that TCP/IP stack by- pass can improve bandwidth by almost a factor of eight. therefore pins the data pages before each transmission, even (2) There is almost no difference in the bandwidth results though this may not be necessary. The OS can allocate the for implementations V2-V4. The small packet optimization same set of buffers for different transmissions, and we then need pin the receive/send buffers only once. Future work in V3 (SB+SPO+RFC+BLK) has no impact on peak band- width, since only large packets are bandwidth-hungry. Even will optimize V7 to avoid unnecessary data pinning. (5) V7 exhibits higher bandwidth in transferring 256B when switching to write-side flow control, V4 (SB+WRC+ messages versus 512B messages. The former are transmit- BLK) remains limited by the serial transmission process- ing of block DMAs. However, in burst DMA mode, the ted by store instructions, and the latter are transmitted by performance gap between WFC and RFC is obvious. V6 DMAs. We do not see this phenomenon in V3 or V6, even though they also use SPO to eliminate overheads (like pin- (SB+SPO+WFC+BDM) reaches peak bandwidth earlier than ning data) when transferring large messages by DMA. V5 (SB+RFC+BDM), which means WFC performs better than RFC in transmitting small packets via burst DMAs. We conclude that combining stack bypassing, write-side (3) The peak bandwidths of V5 and V6 are nearly 1.75 flow control, and burst DMA effectively reduces the gap be- tween the peak and the end-user’s achieved bandwidths. times that of V2 (SB+RFC+BLK). Specifically, V6 (SB+SPO+ WFC+BDM) achieves the best peak bandwidth, i.e., 2.98GB/s, which matches our raw bandwidth test results. This indi- 6.3 Raw Latency cates that burst DMA increases effective bandwidth. To measure uncontested raw latencies of the basic PCIe (4) V7 (SB+KB+SPO+WFC+BDM) gets the peak bandwidth operations we conduct ping-pong tests that transmit 4B of 2.87GB/s, which is a bit lower than that of V5 and V6. data chunks between nodes. We write non-pipelined code Since V7 moves transmission tasks from kernel space to user to measure the load request latency. The raw latencies for space, it pays the overhead for pinning data in physical stores, loads, burst DMAs, and block DMAs are 1.125µsec, memory when transferring data blocks with the DMA en- 2.452µse, 3.735µsec, and 6.572µsec, respectively. Block DMAs gine. In our experiments, we send data packets in different enjoy lower latency compared to burst DMAs due to their 3.0 V1 (RFC+BLK) 3.0 P-Socket V2 (SB+RFC+BLK) V3 (SB+SPO+RFC+BLK) 2.5 10GigE 2.5 V4 (SB+WFC+BLK) V5 (SB+RFC+BDM) 2.0 2.0 V6 (SB+SPO+WFC+ 1.5 BDM) V7 (SB+KB+SPO+ 1.0 1.5 WFC+BDM) 0.5

1.0 Bandwidth (GB/s) 0.0 2 4 6 8 Bandwidth (GB/s) 0.5 Threads 0.0

1 2 4 8 16 32 64 1K 2K 4K 8K 128 256 512 16K 32K 64K128K256K512K1M 2M 4M (a) BW when varying numbers of threads Message Size (Bytes) 3.0 P-Socket 2.5 10GigE Figure 6: Bandwidth for varying message sizes 2.0 1.5 1.0 V1 (RFC+BLK) 50 V2 (SB+RFC+BLK) 0.5 V3 (SB+SPO+RFC+BLK) V4 (SB+WFC+BLK) Bandwidth (GB/s) 0.0 V5 (SB+RFC+BDM) 1 2 4 8 16 32 64 1K 2K 4K 8K 40 V6 (SB+SPO+WFC+BDM) 128 256 512 16K 32K 64K128K256K512K 1M V7 (SB+KB+SPO+WFC+BDM) Message Size (Bytes) 30 (b) BW when varying numbers of messages 20 (threads=4)

Latency (usec) 10 Figure 8: Multiple stream tests 0

1 2 4 8 16 32 64 128 256 512 1K 2K 4K Message Size (Bytes) these roles reverse when it comes to bandwidth. P-Socket adopts V7: it gets slightly lower bandwidth than V5 and Figure 7: Latency for varying message sizes (see Table 2 for V6, but the differences are small, whereas the latency gaps implementation descriptions) are huge. V7 incurs context switch overhead when pinning buffers in memory. If we instead pinned a single buffer and copied the payload into it before each transmission, V7’s shorter initialization overhead. Store instructions perform bandwidth would likely rival that of V6. much better than load instructions and either block or burst DMAs. Our best measured latency using stores is 1.125µs, 6.5 Multiple Stream Tests which is close to the theoretical minimum value of 1µs. To test multi-stream performance, we create an equal num- ber of threads on each node. Each on the first node is 6.4 P-Socket Latency connected to one thread on the second. Figure 8a shows the We again use Netperf to measure P-Socket’s latency, re- aggregate bandwidth of all connections with varying num- sults for which are presented in Figure 7. Our baseline sys- bers of threads for a message size of 1KB. We measure the tem V1 (RFC+BLK) exhibits the highest latency among all bandwidth of all connections over one minute. We also com- versions. In this implementation, messages are copied to the pare the same benchmark on the nodes connected by a 10 kernel where they undergo complicated processing by the Gigabit Network adapter (Intel R 82599EB [8]), which has TCP/IP protocol stack. Implementation versions V2-V6 re- the same interface as our hardware prototype system, i.e., duce latency by bypassing the protocol stack. In particular, PCIe Gen 2. When running only one or two threads, P- V6 (SB+SPO+WFC+BDM) sees almost 48% reduction in la- Socket and 10 GigE perform similarly. However, when run- tency compared to the baseline. This demonstrates the high ning more than three threads P-Socket delivers about twice overhead of the TCP/IP protocol stack and the efficiency of the bandwidth (2.15GB/s) of 10 GigE. using store instructions for small packets. Figure 8b shows bandwidth for different message sizes In comparing V2-V6, we find that SPO and WFC de- when running four threads. P-Socket has lower effective crease latency by about 3.1µs and 0.6µs, respectively. When bandwidth than 10 GigE for messages less than 256B. Recall using both, latency decreases by about 5µs. Nevertheless, that control messages are usually small and latency sensitive. there is still much room for improvement. Note that V6 To achieve lower latency, P-Socket transfers these small mes- (SB+SPO+WFC+BDM) handles transmission in the kernel, sages via store instructions instead of DMAs. These store although it removes the TCP/IP stack from the message- instructions are uncacheable because their destinations are passing path. To decrease the number of copies associated in IO space. Using stores may sacrifice bandwidth perfor- with a message transfer and to remove the kernel from the mance, as it prevents P-Socket from writing data to cache critical message passing path, V7 (SB+KB+SPO+WFC+BDM) or to local memory (where it might be consumed more effi- allows direct user-mode transfers. In addition, this version ciently). Huge messages are always bandwidth sensitive, and uses store instructions and WFC to transfer small packets, transferring them via store instructions incurs large CPU resulting in an end-to-end latency of 1.2µs (which is close to overheads. Messages over 256B are thus sent by DMA, the 1.125µs result from the raw latency test). which improves bandwidth and reduces CPU utilization. With respect to latency, software optimizations play a For very large packets, P-Socket reaches a peak bandwidth more important role than hardware optimizations. However, of 2.90GB/s, which is about three times that of 10 GigE. load/store NTB Inside

Instructions Address Space 0 Translation Window 0 Root Address In: a+offset Core Memory Address Space Hole Complex offset

Address Space Hole Address Address Out: c+offset NTB Translation Window 3 Translation Address Space 1

Figure 9: Remote access and address translation

7. LIMITATIONS the bandwidth and latency characteristics of a baseline sys- Figure 9 shows the mechanism for remote access and ad- tem running IP over PCIe (IPoPCIe), leveraging our findings dress translation. Load/store instructions are issued by the to optimize a communications library that: core and forwarded by the Root Complex and PCIe Switch. • balances kernel and user responsibilities, The OS initializes forwarding tables within these compo- nents at boot time, and the kernel can change them at • streamlines software processing of the TCP/IP proto- run time. Typically, each NTB is allocated one contigu- col stack, and ous physical address space, as there are only a few forward- ing table entries. Address translation happens in the Non- • exploits PCIe hardware features. Transparent Bridge (NTB) that separates two different ad- dress spaces. The incoming access address must be located Our library enables traditional socket applications to run in the translation window in order for it to be mapped to without modification. Furthermore, P-Socket delivers most the address space on the other side of the NTB. In our hard- of the physical network’s potential performance to the end ware prototype system, there are only four 32-bit translation user, achieving a minimum end-to-end application latency windows or two 64-bit translation windows for each NTB. of 1.2µs and effective bandwidths of up to 2870MB/s. This means that user applications must share these resources We are continuing to improve P-Socket, e.g., by develop- in a multiprocessing environment. While one transfer is in ing support for UDP and improving zero-copy. Based on progress, transfers for other applications are blocked. Large the results we present here, we encourage the community transfers may thus hurt other applications’ performance. to push for further enhancements of the PCIe interconnect We could avoid this by segmenting large messages into technology. In the meantime, we are continuing to investi- smaller ones, but this is difficult to synchronize and manage gate and develop a new hardware interconnect protocol that from user mode. Note that the address translation regis- has a thinner layer than PCIe and that will offer better vir- ters in the NTB must be changed whenever an application’s tualization and protection mechanisms. source or destination buffer addresses change, but allowing applications to change address mappings in user mode with- 9. ACKNOWLEDGMENTS out hardware protection poses a security risk. Given these This work was supported in part by the National Science problems, P-Socket does not use zero-copy in user mode. Foundation of China under grant numbers 61402439, 61402438, Note that these problems do not occur in InfiniBand with a and 61522212. The Chinese Academy of Sciences President’s 24 Queue Pair (QP) mechanism, which can support up to 2 International Fellowship Initiative grant number 2015VTB053 isolated QP connections with protection. supported S.A. McKee’s participation. Kernel bypassing is another technique used to improve communication library performance. The main challenges in implementing kernel bypassing are how to maximize the use 10. REFERENCES of limited hardware resource and how to use bypass safely in [1] P. Balaji, S. Narravula, K. Vaidyanathan, user mode. For instance, our hardware prototype system has S. Krishnamoorthy, J. Wu, and D. K. Panda. Sockets limited DMA resources that must be shared among applica- Direct Procotol over InfiniBand in clusters: Is it tions. How to share these effectively in user mode remains a beneficial? In Proc. IEEE International Symposium challenge: it would require moving the control process from on Performance Analysis of Systems and Software, kernel mode to user mode. Having DMA support for each pages 28–35, Mar. 2004. user process would save time and effort for user-level library [2] P. Balaji, H. V. Shah, and D. K. Panda. Sockets vs developers. We expect that next-generation PCIe fabrics RDMA interface over 10-Gigabit networks: An will offer better virtualization mechanisms. in-depth analysis of the memory traffic bottleneck. In Proc. Workshop on Remote Direct Memory Access (RDMA): Applications, Implementations, and 8. CONCLUSIONS Technologies (RAIT), Sept. 2004. We have described P-Socket, a high-efficiency communi- [3] P. Balaji, P. Shivam, P. Wyckoff, and D. Panda. High cation library for PCIe-switch based servers. We imple- performance user level sockets over Gigabit Ethernet. mented a prototype server for which we designed and pro- In Proc. IEEE International Conference on Cluster duced a custom backplane board, and we developed the cor- Computing, pages 179–186, Sept. 2002. responding software stack, including the VNIC driver and [4] N. J. Boden, D. Cohen, R. E. Felderman, A. E. socket-compatible library. We then systematically studied Kulawik, C. L. Seizovic, and W.-K. Su. Myrinet: A Gigabit-per-Second local area network. Micro, July 2013. 15(1):29–36, Dec. 1995. [21] S. Pakin, M. Lauria, and A. Chien. High performance [5] P. Buonadonna, A. Geweke, and D. Culler. An messaging on workstations: Illinois Fast Message implementation and analysis of the Virtual Interface (FM) for Myrinet. In Proc. ACM/IEEE Conference on Architecture. In Proc. ACM/IEEE Conference on High Performance Networking and Computing Supercomputing, pages 1–15, Nov. 1998. (Supercomputing), page 55, Dec. 1995. [6] L. Chisvin. PCIe ready for datacenter role. [22] F. Petrini, W. Feng, A. Hoisie, S. Coll, and http://www.eetimes.com/AUTHOR.ASP?SECTION E. Frachtenberg. The Quadrics network: ID=36&DOC ID=1319539, Sept. 2013. High-performance clustering technology. Micro, [7] J. Chu and V. Kashyap. Transmission of IP over 22(1):46–57, Nov. 2002. InfiniBand (IPoIB). [23] PLX Technology, Inc. Expresslane PEX 8648-AA AB, http://www.hjp.at/doc/rfc/rfc4391.html, Apr. 2006. and BB 48-lane/12-port PCI Express Gen 2 switch [8] I. Corp. Intel 82599 10 Gigabit Ethernet Controller: data book. PEX8648-SIL-PB-1.0, http: Product brief. http://www.intel.com/content/www/ //www.plxtech.com/products/expresslane/pex8648, us/en/ethernet-controllers/ Apr. 2009. 82599-10-gbe-controller-brief.html, Aug. 2009. [24] PLX Technology, Inc. Expresslane PEX 8619-BA [9] Dolphin Corp. Supersockets for linux: Overview. http: 16-lane, 16-port PCI Express Gen 2 switch with DMA //www.dolphinics.com/download/WHITEPAPERS/ data book. PEX8619-SIL-PB-1.4, http: Dolphin Express IX SuperSockets for Linux.pdf, Aug. //www.plxtech.com/products/expresslane/pex8619, 2013. Apr. 2010. [10] C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and [25] K. Pulo. Fun with LD PRELOAD. K. Li. VMMC-2: Efficient support for reliable, https://nf.nci.org.au/training/talks/lca2009.pdf, Jan. connection-oriented communication. In Proc. IEEE 2009. Hot Interconnects V, Aug. 1997. [26] A. Rao. AMD | SeaMicro technology overview. [11] W. Feng, P. Balaji, C. Baron, L. N. Bhuyan, and http://www.seamicro.com/sites/default/files/SM D. K. Panda. Performance characterization of a TO01 64 v2.7.pdf, Oct. 2012. 10-Gigabit Ethernet TOE. In Proc. High Performance [27] R. Recio. The coming decade of data center Interconnects, pages 58–63, Aug. 2005. networking discontinuities. In Proc. IEEE [12] P. Grun. Introduction to InfiniBandTM for end users. International Conference on Computing, Networking https://cw.infinibandta.org/document/dl/7268, Apr. and Communications (keynote), Feb. 2012. 2010. [28] J. Regula. Integrating rack level connectivity into a [13] S. Hefty. RSOCKETS: RDMA for dummies. In Proc. PCI Express switch. In Proc. Hot Chips: A Open Fabrics Developer Workshop, Apr. 2013. Symposium on High Performance Chips, pages [14] Y. Hoskote, B. A. Bloechel, G. E. Dermer, 259–266, Aug. 2013. V. Erraguntla, D. Finan, J. Howard, D. Klowden, [29] S. H. Rodrigues, T. E. Anderson, and D. E. Culler. S. G. Naendra, G. Ruhl, J. W. Tschanz, S. Vangal, High-performance local area communication with Fast V. Veeramachaneni, H. Wilson, J. Wu, and N. Borkar. Sockets. In Proc. USENIX Technical Conference, A TCP offload accelerator for 10 Gb/s Ethernet in pages 257–274, Jan. 1997. 90-nm CMOS. Solid-State Circuits, 38(11):1866–1875, [30] P. Shivam, P. Wyckoff, and D. K. Panda. EMP: Feb. 2003. Zero-copy OS-bypass NIC-driven Gigabit Ethernet [15] R. Hou, T. Jiang, L. Zhang, P. Qi, J. Dong, H. Wang, Message Passing. In Proc. IEEE International X. Gu, and S. Zhang. Cost effective data center Conference on Supercomputing, pages 49–49, Nov. servers. In Proc. IEEE International Symposium on 2001. High Performance Computer Architecture, pages [31] Tektronix. Tektronix pci express logic protocol 179–187, Feb. 2013. analyzer. [16] J. Hurwitz and W. Feng. End-to-end performance of http://www.tek.com/datasheet/tla7sa00-series, 2013. 10-Gigabit Ethernet on commodity systems. IEEE [32] H. Tezuka, A. Hori, and Y. Ishikawa. PM: A Micro, 24(1):10–12, Jan.-Feb. 2004. high-performances communication library for [17] R. Jones. Care and feeding of Netperf 2.6.X. multi-user parallel environments. Technical Report http://www.netperf.org/svn/netperf2/tags/netperf-2. TR-96-015, Tsukuba Research Center, 1996. 6.0/doc/netperf.html, 2012. [33] TOP500 Supercomputer Site. Interconnect [18] V. Krishnan. Towards an integrated IO and clustering Family/Myrinet. solution using PCI Express. In Proc. IEEE http://www.top500.org/statistics/details/connfam/2, International Conference on Cluster Computing, pages 2013. 259–266, Sept. 2007. [34] T. von Eicken, V. Avula, A. Basu, and V. Buch. [19] M. Lauria, S. Pakin, and A. A. Chien. Efficient Low-latency communication over ATM networks using layering for high speed communication: Fast Messages Active Messages. Micro, 15(1):46–53, Dec. 1995. 2.x. In Proc. IEEE High Performance Parallel and [35] T. von Eicken, A. Basu, V. Buch, and W. Vogels. Distributed Computing, pages 10–20, July 1998. U-Net: A user-level network interface for parallel and [20] R. Merritt. RapidIO nudges ARM into servers. http: distributed computing. In Proc. ACM Symposium on //www.eetimes.com/document.asp?doc id=1318957, Principles, pages 40–53, Dec. 1995.