Low Latency Stream Processing: Apache Heron with Infiniband

Low latency stream processing: Apache Heron with Infiniband & Intel Omni-Path Supun Kamburugamuve Karthik Ramasamy School of Informatics, Computing, and Engineering Streaml.io Indiana University Palo Alto, CA Bloomington, IN 47408 [email protected] [email protected] Martin Swany Georey Fox School of Informatics, Computing, and Engineering School of Informatics, Computing, and Engineering Indiana University Indiana University Bloomington, IN 47408 Bloomington, IN 47408 [email protected] [email protected] ABSTRACT transmiing messages between the processing units for achieving Worldwide data production is increasing both in volume and ve- ecient data processing. locity, and with this acceleration, data needs to be processed in ere are many hardware environments in which big data sys- streaming seings as opposed to the traditional store and process tems can be deployed including High performance computing (HPC) model. Distributed streaming frameworks are designed to process clusters. HPC clusters are designed to perform large computa- such data in real time with reasonable time constraints. Apache tions with advanced processors, memory, IO systems and high Heron is a production-ready large-scale distributed stream pro- performance interconnects. High performance interconnects in cessing framework. e network is of utmost importance to scale HPC clusters feature microsecond latencies and large bandwidths. streaming applications to large numbers of nodes with a reason- anks to recent advancements in hardware, some of these high able latency. High performance computing (HPC) clusters feature performance networks have become cheaper to set up than their interconnects that can perform at higher levels than traditional Ethernet counterparts. With multi-core and many-core systems Ethernet. In this paper the authors present their ndings on inte- having large numbers of CPUs in a single node, the demand for grating Apache Heron distributed stream processing system with high performance networking is increasing as well. two high performance interconnects; Inniband and Intel Omni- Advanced hardware features such as high performance intercon- Path and show that they can be utilized to improve performance of nects are not fully utilized in the big data computing frameworks, distributed streaming applications. mostly because they are accessible to low level programming lan- guages and most big data systems are wrien on Java platform. KEYWORDS In recent years we have seen eorts to utilize high performance interconnects into big data frameworks such as Spark [24] and Streaming data, Inniband, Omni-Path, Apache Heron Hadoop [22]. Big data frameworks such as Spark and Hadoop focus on large batch data processing and hence their communication 1 INTRODUCTION requirements are dierent compared to streaming systems which With ever increasing data production by users and machines alike, are more latency sensitive. the amount of data that needs to be processed has increased dra- ere are many distributed streaming frameworks available to- matically. is must be achieved both in real time and as batches day for processing large amounts of streaming data in real time. to satisfy dierent use cases. Additionally, with the adoption of Such systems are largely designed and optimized for commodity devices into Internet of ings setups, the amount of real time data hardware and clouds. Apache Storm [29] was one of the popular are exploding, and must be processed with reasonable time con- early systems developed for processing streaming data. Apache 1 straints. In distributed stream analytics, the large data streams are Heron [19] is similar to Storm with a new architecture for stream- partitioned and processed in distributed sets of machines to keep ing data processing. It features a hybrid design with some of the up with the high volume data rates. By denition of large-scale performance-critical parts wrien in C++ and others wrien in streaming data processing, networks are a crucial component in Java. is architecture allows the integration of high performance enhancements directly rather than going through native wrappers Permission to make digital or hard copies of all or part of this work for personal or such as Java Native Interface(JNI). When these systems are de- classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation ployed on clusters that include high performance interconnects, on the rst page. Copyrights for components of this work owned by others than ACM they need to use an TCP interface to high performance interconnect must be honored. Abstracting with credit is permied. To copy otherwise, or republish, which doesn’t perform as well as a native implementation. to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from [email protected]. To utilize these hardware features, we have integrated Inniband UCC’17, December 5–8, 2017, Austin, Texas, USA and Intel Omni-Path interconnects to Apache Heron to accelerate © 2017 ACM. ISBN 978-1-4503-5149-2/17/12...$15.00 DOI: hp://dx.doi.org/10.1145/3147213.3147232 1hp://incubator.apache.org/projects/heron.html its communications. Inniband [4] is an open standard protocol for channel semantics queues are used for communication. In memory high performance interconnects that is widely used in today’s high semantics a process can read from or write directly to the memory performance clusters. Omni-Path [6] is a proprietary interconnect of a remote machine. In channel mode, two queue pairs for trans- developed by Intel and is available with the latest Knights Landing mission and receive operations are used. To transfer a message, architecture-based (KNL) [27] many-core processors. With this im- a descriptor is posted to the transfer queue, which includes the plementation, we have observed signicantly lower latencies and address of the memory buer to transfer. For receiving a message, a improved throughput in Heron. e main contribution in this work descriptor needs to be submied along with a pre-allocated receive is to showcase the benets of using high performance interconnects buer. e user program queries the completion queue associated for distributed stream processing. ere are many dierences in with a transmission or a receiving queue to determine the success hardware available for communications with dierent bandwidths, or failure of a work request. Once a message arrives, the hardware latencies and processing models. Ethernet has comparable hard- puts the message into the posted receive buer and the user pro- ware available to some of the high performance interconnects; it is gram can determine this event through the completion queue. Note not our goal to show that one particular technology is superior to that this mode requires the receiving buers to be pre-posted before others, as dierent environments may have alternate sets of these the transmission can happen successfully. technologies. With memory semantics, Remote Direct Memory Access(RDMA) e remainder of the paper is organized as follows. Section 2 operations are used. Two processes preparing to communicate presents the background information on Inniband and Omni-Path. register memory and share the details with each other. Read and Section 3 describes the Heron architecture in detail and section 4 write operations are used instead of send and receive operations. the implementation details. Next the experiments conducted are ese are one-sided and do not need any soware intervention described in sections 5 and results are presented and discussed in from the other side. If a process wishes to write to remote memory, section 6. Section 7 presents related work. e paper concludes it can post a write operation with the local addresses of the data. with a look at future work. e completion of the write operation can be detected using the completion queue associated. e receiving side is not notied 2 BACKGROUND about the write operation and has to use out-of-band mechanisms 2.1 Inniband to gure out the write. e same is true for remote reads as well. RDMA is more suitable for large message transfers while channel Inniband is one of the most widely used high performance fab- mode is suitable for small messages. In general RDMA has 1 − 2µs rics. It provides a variety of capabilities including message channel latency advantage over channel semantics for Inniband and this semantics, remote memory access and remote atomic memory op- is not signicant for our work. erations, supporting both connection-oriented and connectionless endpoints. Inniband is programmed using the Verbs API, which 2.4 Openfabrics API is available in all major platforms. e current hardware is capa- 2 ble of achieving up to 100Gbps speeds with microsecond latencies. Openfabrics provides a library called libfabric [12] that hides the Inniband does not require the OS Kernel intervention to transfer details of common high performance fabric APIs behind a uniform packets from user space to the hardware. Unlike in TCP, its proto- API. Because of the advantage of such an API, we chose to use col aspects are handled by the hardware. ese features mean less libfabric as our programming library for implementing the high CPU time spent on the network compared to TCP for transferring performance communications for Heron. Libfabric is a thin wrap- the same amount of data. Because the OS Kernel is bypassed by per API and it supports dierent providers including Verbs, Aries the communications, the memory for transferring data has to be interconnect from Cray through GNI, Intel Omni-Path, and Sockets. registered in the hardware. 2.5 TCP & High performance Interconnects 2.2 Intel Omni-Path TCP is one of the most successful protocols developed. It provides Omni-Path is a high performance fabric developed by Intel. Omni- a simple yet powerful API for transferring data reliably across the Path fabric is relatively new compared to Inniband and there are Internet using unreliable links and protocols underneath.

Load more