Hermes: an Integrated CPU/GPU Microarchitecture for IP Routing

54.5 Hermes: An Integrated CPU/GPU Microarchitecture for IP Routing Yuhao Zhu Yangdong Deng Yubei Chen Department of Electrical and Computer Institute of Microelectronics Institute of Microelectronics Engineering Tsinghua University The University of Texas at Austin Tsinghua University chen- [email protected] [email protected] [email protected] ABSTRACT network services and protocols are fast appearing, although the With the constantly increasing Internet traffic and fast changing Internet is still based on IPv4 and Ethernet technologies that both network protocols, future routers have to simultaneously satisfy were developed in 1960s [4]. As a result, the original protocols the requirements for throughput, QoS, flexibility, and scalability. have to be extended and updated to adapt to today’s network In this work, we propose a novel integrated CPU/GPU microarc- applications. Good programmability is crucial to meet such re- hitecture, Hermes, for QoS-aware high speed routing. We also quirements. develop a new thread scheduling mechanism, which significantly Current router solutions can be classified into application- improves all QoS metrics. specific integrated circuit (ASIC) based, network processor based, Categories and Subject Descriptors and software based [5]. However, none of these solutions could C.2.6 [Computer-Communication Networks]: Internetworking simultaneously meet the requirements for both high throughput and programmability. – Routers; C.1.2 [Processor Architectures]: Multiple Data Recently graphic processing units (GPUs) are appearing as a new Stream Architectures (Multiprocessors) – Single-instruction- platform that could offer strong computing power [6]. GPUs also stream, multiple-data-stream processors (MIMD); C.1.3 [Proces- have a stable mass market and thus strong supports for software sor Architectures] Other Architecture Styles – Heterogeneous development (e.g., [7] and [8]). Accordingly, it is appealing to (hybrid) systems leverage the mature GPU microarchitecture for network routing processing. Two recent works already proved the performance General Terms. Performance potential of GPU to accelerate packet processing [9] and [10]. Keywords Nevertheless, it is also found that current GPU architectures are Software Router, QoS, CPU/GPU Integration still under serious limitations for routing processing. First, GPU computing requires the packets to be copied from CPU’s main 1. Introduction memory to GPU’s video memory. The extra memory copy intro- As the backbone of Internet, IP routers provide the physical and duces a performance overhead. Second, the batch based GPU logic connections among multiple computer networks. When a processing could not guarantee processing QoS for an individual packet arrives, a router will determine which out-bounding net- packet, although such QoS requirements are critical for routers. work the packet should be forwarded to according to the current In this work, we develop novel solutions to augment an existing routing table and the packet’s destination IP address. On modern GPU microarchitecture for high speed packet processing with routers, typical packet processing involves a series of operations QoS assurance. Our basic design philosophy is to make the best on packet headers/body [1] by a CPU or a network processor. A out of mature hardware and software solutions with minimal router has to deliver a high forwarding throughput under strin- modifications. The contributions of this paper are as follows. • gent quality-of-service requirements. In this work, we focus on We proposed Hermes, an integrated CPU/GPU, shared mem- improving packet processing performance by developing a novel ory microarchitecture that is enhanced with an adaptive warp integrated CPU/GPU microarchitecture as well as a correspond- issuing mechanism for IP packet processing. To the best of ing QoS-aware scheduling mechanism. our knowledge, this is the first work on developing a GPU- Now IP routers have to face a unique set of challenges. First of based packet processing platform that simultaneously opti- all, the Internet traffic is still exponentially increasing, especially mizes all metrics of QoS. • with the introduction of on-line video and P2P technologies (e.g. A complete set of router applications were implemented on [2]). In fact, current core routers are required to deliver a Hermes. We also conducted extensive QoS evaluations on throughput of 40Gbps - 90Tbps [3]. Another trend is that new Hermes microarchitecture. When compared with a GPU- accelerated software router [9], Hermes delivers a 5X en- hancement in throughput, an 81.2% reduction in average packet delay, as well as a 72.9% reduction in delay variance. Permission to make digital or hard copies of all or part of this work for The rest of the paper is organized as follows. Section 2 reviews personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that the background of this work. Section 3 details the hardware and copies bear this notice and the full citation on the first page. To copy software designs of the Hermes microarchitecture. A thorough otherwise, to republish, to post on servers or to redistribute to lists, performance evaluation is presented in Section 4. Section 5 con- requires prior specific permission and/or a fee. cludes the whole paper and outlines important future research DAC'11, June 5-10, 2011, San Diego, California, USA directions. Copyright © 2011 ACM 978-1-4503-0636-2/11/06...$10.00 1044 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy 54.5 2. Background and Motivation pose a block of threads into 32-thread warps. A warp is the basic unit of job scheduling on GPUs. Each warp will always follow The responsibility of a router is to deliver packets from ingress the same instruction schedule with each thread handling a differ- ports to egress ports in a timely manner. The fundamental tasks ent data set. In other words, a warp of threads would execute of IP layer packet processing involves checking IP header, packet instructions in a single-instruction, multiple data (SIMD) fashion. classification, routing table lookup, decrementing TTL value, and GPUs are supported with a video memory, or global memory in packet fragmentation. The critical path of a complete IP router NVIDIA’s CUDA terminology. The global memory offers a high tasks chain is presented as follows: Checking IP Header (Check- memory bandwidth, but also incurs a long latency. IPHeader) Packet Classification (Classifier) Routing Table Lookup (RTL) Decrementing TTL (DecTTL) IP Fragmen- 2.3 Pros and Cons of GPU Based Packet Processing tation (Fragmentation) [4]. In addition, to meet the ever- The Internet services are realized through a hierarchical organiza- demanding requirements for intrusion detection, deep packet tion of packet processing. Generally the processing of a packet is inspection (DPI) [11] is increasingly becoming a regular task independent with others. Such a fundamental observation sug- deployed before routing processing or even an integral part of gests that GPU is potentially a good packet processing engine modern IP routers. We use all the above tasks as benchmark ap- because various SP cores could handle multiple packets in paral- plications in our experimental evaluations. lel. Two recent works reported in [9] and [10] already proved the 2.1 Current IP Router Solutions potential of GPUs for packet processing. It is shown that a GPU based software router solution could outperform a CPU baseline Modern IP router solutions can be classified into three main cate- router by a factor of up to 30X. gories, hardware routers, software routers and programmable However, two main problems still need to be resolved before a network processors (NPUs) [5]. GPU accelerated software routers can be practical. First of all, Hardware routers depend on customized hardware, i.e., ASICs, to the communication mechanism between CPU and GPU seriously deliver the highest performance with the least power/area over- degrades system throughput. In fact, the packets arriving at the head. Nonetheless, hardware routers suffer from the long design router are first copied to CPU main memory and then to GPU turnaround time, high non-recurring engineering (NRE) cost, and global memory through a PCI Express (PCIe) bus with a peak poor scalability and programmability. Such hurdles have made bandwidth of 8GB/s [17]. The extra memory copy introduces ASIC based solutions gradually out of the mainstream routers. performance and power overhead. The situation is exemplified In contrast, software routers implement all packet processing by a signature matching application reported in [9]: the pure applications as programs running on commodity computers (e.g., processing throughput of GPU can be over 30 times higher than [12]). They are extremely flexible because any change in network that of CPU, but the speed-up degrades to 5X when considering configurations and protocols can be realized through re- the data transfer overhead. programming. Furthermore, general purpose processors are tar- Secondly, the GPU’s batch processing model introduces a geting a much more massive market and thus backed up with “throughput vs. delay” dilemma. The

Load more