Twitter Heron with Infiniband and Omni-Path
Total Page:16
File Type:pdf, Size:1020Kb
Low Latency Stream Processing: Twitter Heron with Infiniband and Omni-Path Supun Kamburugamuve1 Karthik Ramasamy2, Martin Swany1 and Geoffrey Fox1 1School of Informatics and Computing, Indiana University Bloomington 2Twitter Inc. Submission Type: Research Abstract performance interconnects feature microsecond latencies and large bandwidths. Thanks to recent advancements Worldwide data production is increasing both in volume in hardware, some of these high performance networks and velocity, and with this acceleration, data needs to be have become cheaper to set up than their Ethernet coun- processed in streaming settings as opposed to the tradi- terparts. With multi-core and many-core systems having tional store and process model. Distributed streaming large numbers of CPUs in a single node, the demand for frameworks are designed to process such data in real time high performance networking is increasing exponentially with reasonable time constraints. Twitter Heron is a pro- as well. duction ready large scale distributed stream processing There are many distributed streaming frameworks framework developed at Twitter. In order to scale stream- available today for processing large amounts of streaming ing applications to large numbers of nodes, the network data in real time. Such systems are largely designed and is of utmost importance. High performance computing optimized for commodity hardware and clouds. Apache (HPC) clusters feature interconnects that can perform at Storm [1] was one of the most popular early systems de- higher levels than traditional Ethernet. In this work the veloped for processing streaming data. Twitter Heron [2] authors present their findings on integrating Twitter Heron is a next-generation version of Storm with an improved distributed stream processing system with two high per- architecture for large-scale data processing. It features a formance interconnects; Infiniband and Intel Omni-Path. hybrid design with some of the performance-critical parts written in C++ and others written in Java. This architec- ture allows the integration of high performance enhance- 1 Introduction ments naively rather than going through native wrappers such as JNI. This work take advantage of these architec- With ever increasing data production by users and ma- tural enhancements to integrate high performance inter- chines alike, the amount of data that needs to be processed connects naively to Heron to accelerate its communica- has increased dramatically. This must be achieved both in tions. real time and as batches to satisfy different use cases. Ad- Advanced hardware features are seldom used in the Big ditionally, with the adoption of devices into Internet of Data computing frameworks, mostly because they are ac- Things setups, the amount of real time data are exploding, cessible only to low level programming languages and and must be processed with reasonable time constraints. most Big Data systems are written on Java platform. In- In real time distributed stream analytics, the large data finiband [3] is an open standard protocol for high perfor- streams are partitioned and processed in distributed sets of mance interconnects that is widely used in today’s high machines to keep up with the high volume data rates. By performance clusters. Omni-Path [4] is a proprietary in- definition of large-scale streaming data processing, net- terconnect developed by Intel and is available with the works are a crucial component in transmitting messages latest Knights Landing architecture-based many-core pro- between the processing units for achieving efficient data cessors. processing. In this work the authors explore how to leverage these High performance computing (HPC) clusters are de- hardware features in the context of distributed streaming signed to perform large computations with advanced pro- framework Twitter Heron. In particular the authors show cessors, memory, IO systems and high performance inter- how they leveraged the Infiniband and Intel Omni-Path connects. For efficient parallel computations at very large communications to improve the latency of the system. scale, the network between the nodes is vital in order to The main contribution in this work is to showcase the im- scale the applications to thousands of nodes. These high portance of using high performance interconnects for dis- 1 tributed stream processing. There are many differences 2.3 Channel & Memory Semantics in hardware available for communications with different bandwidths, latencies and processing models. Even Eth- High performance interconnects generally supports two ernet has comparable hardware available to some of the modes of operations called channel and remote memory high performance interconnects; it is not our goal to show access. With channel semantics queues are used for mes- that one particular technology is superior to others, as dif- sage communication. In memory semantics a process can ferent environments may have alternate sets of these tech- read from or write directly to the memory of a remote ma- nologies and with these implementation Heron can take chine. advantage of such environments. In channel mode, two queue pairs for transmission and The remainder of the paper is organized as follows. receive operations are used. To transfer a message, a de- Section 2 presents the background information on Infini- scriptor is posted to the transfer queue, which includes the band and Omni-Path. Section 3 describes the Heron ar- address of the memory buffer to transfer. For receiving a chitecture in detail and section 4 the implementation de- message, a descriptor needs to be submitted along with tails. Next the experiments conducted are described in a pre-allocated receive buffer. The user program queries sections 5 and results are presented and discussed in sec- the completion queue associated with a transmission or a tion 6. Section 7 presents related work. The paper con- receiving queue to determine the success or failure of a cludes with a look at future work. work request. Once a message arrives, the hardware puts the message into the posted receive buffer and the user program can determine this event through the completion 2 Background queue. Note that this mode requires the receiving buffers to be pre-posted before the transmission can happen suc- 2.1 Infiniband cessfully. Infiniband is one of the most widely used high perfor- The remote direct memory access mode is generally mance fabrics. It provides a variety of capabilities includ- called RDMA. With RDMA, two processes preparing to ing message channel semantics, remote memory access communicate, register memory and share the details with and remote atomic memory operations, supporting both the other party. Read and write operations are used in- connection-oriented and connectionless endpoints. Infini- stead of send and receive operations. These are one-sided band is programmed using the Verbs API, which is avail- and do not need any software intervention from the other able in all major platforms. The current hardware is capa- side. If a process wishes to write to remote memory, it can ble of achieving up to 100Gbps speeds with microsecond post a write operation with the local addresses of the data. latencies. Infiniband does not require the OS Kernel in- The completion of the write operation can be detected us- tervention to transfer packets from user space to the hard- ing the completion queue associated. The receiving side is ware. Unlike in TCP, its protocol aspects are handled by not notified about the write operation and has to use out- the hardware. These features mean less CPU time is re- of-band mechanisms to figure out the write. The same is quired by the network compared to TCP for transferring true for remote reads as well. the same amount of data. Because the OS Kernel is by- passed by the communications, the memory for transfer- ring data has to be registered in the hardware. 2.4 Openfabrics API 2.2 Intel Omni-Path Openfabrics 1 provides a library called libfabric that hides Omni-Path is a high performance fabric developed by In- the details of common high performance fabric APIs be- tel. Omni-Path fabric is relatively new compared to Infini- hind a uniform API. Because of the advantage of such an band and there are fundamental differences between the API, we chose to use libfabric as our programming library two. Omni-Path does not offload the protocol handling to for implementing the high performance communications network hardware and it doesn’t have the connection ori- for Heron. Libfabric is a thin wrapper API and it sup- ented channels as in Infiniband. Unlike in Infiniband the ports different providers including Verbs, Aries intercon- Omni-Path network chip can be built into the latest Intel nect from Cray through GNI, Intel Omni-Path, and Sock- Knights Landing processors. Omni-Path supports tagged ets. The libfabcirc API closely resembles the structure of messaging with a 96 bit tag in each message. A Tag can Verbs API. For example, for channel semantics, it pro- carry any type of data and this information can be used at vides send/receive operations with completion queues to the application to distinguish between different messages. get notifications about the operations. Omni-Path is designed and optimized for small high fre- quency messaging. 1https://www.openfabrics.org/ 2 2.5 TCP vs High performance Intercon- Apache Storm with many enhancements to the underlying nects engine architecture. It retains the same Storm API, allow- ing applications written in Storm to be deployed with no TCP is one of the most successful protocols in the brief or minimal code changes. history of computers. It provides a simple yet power- ful API for transferring data reliably across the Internet using unreliable links and protocols underneath. One of 3.1 Heron Data Model the biggest advantages of TCP is its wide adoption and A stream is an unbounded sequence of high level objects simplicity to use. Virtually every computer has access to named events or messages.