Packet Processing on the GPU

Matthew Mukerjee David Naylor Bruno Vavala Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] [email protected]

ABSTRACT example of this trend is Software Defined Net- Packet processing in routers is traditionally implemented working [5], wherein each packet is classified in hardware; specialized ASICs are employed to forward as belonging to one or more flows and is for- packets at line rates of up to 100 Gbps. Recently, how- warded accordingly; moreover, the flow rules ever, improving hardware and an interest in complex packet governing forwarding behavior can be updated processing has prompted the networking community to on the order of minutes [5] or even seconds [3]. explore software routers; though they are more flexible Part of this new-found attention for software routers and easier to program, achieving forwarding rates com- has been an exploration of various hardware archi- parable to hardware routers is challenging. tectures that might be best suited for supporting Given the highly parallel nature of packet processing, software-based packet processing. Since packet pro- a GPU-based architecture is an inviting option (and cessing is naturally an SIMD application, a GPU- one which the community has begun to explore). In this based router is a promising candidate. The primary project, we explore the pitfalls and payoffs of implement- job of a router is to decide, based on a packet’s ing a GPU-based software router. destination address, through which output port the packet should be sent. Since modern GPUs contain 1. INTRODUCTION hundreds of cores [8], they could potentially perform this lookup on hundreds of packets in parallel. The There are two approaches to implementing packet same goes for more complex packet processing func- processing functionality in routers: bake the pro- tions, like flow rule matching in a software defined cessing logic into hardware, or write it in software. network (SDN). The advantage of the former is speed; today’s fastest This project is motivated by two goals: first, we routers perform forwarding lookups in dedicated ASICs seek to implement our own GPU-based software router and achieve line rates of up to 100 Gbps (the current and compare our experience to that described by the standard line rate for a core router is 40 Gbps) [2]. PacketShader team (see §2). Given time (and bud- On the other hand, the appeal of software routers get!) constraints, it is not feasible to expect our is programability; software routers can more easily system to surpass PacketShader’s performance. Our support complicated packet processing functions and second goal is to explore scheduling the execution can be reprogrammed with updated functionality. of multiple packet processing functions (e.g., longest Traditionally, router manufacturers have opted for prefix match and firewall rule matching). hardware designs since software routers could not The rest of this paper is organized as follows: first come close to achieving the same forwarding rates. we review related work that uses the GPU for net- Recently, however, software routers have begun gain- work processing in §2. Then we explain the design of ing attention in the networking community for two our system in §3. §4 describes our evaluation setup primary reasons: and presents the results of multiple iterations of our 1. Commodity hardware has improved significantly, router. Finally, we discuss our system and future making it possible for software routers to achieve work in §5 and conclude in §6. line rates. For example, PacketShader forwards packets at 40 Gbps [2]. 2. RELATED WORK We are by no means the first to explore the use of 2. Researchers are introducing increasingly more a GPU for packet processing. We briefly summarize complex packet processing functionality. A good here the contributions of some notable projects.

1 1 Incoming packets (DMA) GPU PacketShader [2] implements IPv4/IPv6 forward- 7 Outgoing Main Memory ing, IPsec , and OpenFlow [5] flow match- packets (DMA) 4 Process SM SM SM ing on the GPU using NVIDIA’s CUDA architec- 2 Fill bu!er 5 Copy packets ture. They were the first to demonstrate the feasabil- NIC of packets results on GPU SM SM SM ity of multi-10Gbps software routers. A second key contribution from PacketShader is 6 Forward packets CPU 3 Copy pkts to GPU Global Memory a highly optimized packet I/O engine. By allocat- 5 Copy results back ing memory in the NIC driver for batches of packets at a time rather than individually and by remov- Figure 1: Basic System Design ing unneeded meta-data fields (parts of the Linux networking stack needed by endhosts are never used CPU in a software router), they achieve much higher for- Host to Device Device to Host warding speeds, even before incorporating the GPU. Memory Copy Memory Copy Due to time constraints, we do not make equivalent optimizations in our NIC driver, and so we cannot H → D H → D D → H H → D D → H D → H time directly compare our results with PacketShader’s. Process Process Process

Gnort [11] ports the Snort intrusion detection sys- GPU tem (IDS) to the GPU (also using CUDA). Gnort uses the same basic CPU/GPU workflow introduced Figure 2: Pipelined Execution by PacketShader (see §3); its primary contribution batch of packets from the GPU back to main mem- is implementing fast string pattern-matching on the ory and copy the next batch of packets to the GPU GPU. (Figure 2). In this way, we never let CPU cycles go idle while the GPU is working. Hermes [12] builds on PacketShader by implement- ing a CPU/GPU software router that dynamically adjusts batch size to simultaneously optimize multi- Batching. Although the GPU can process hun- ple QoS metrics (e.g., bandwidth and latency). dreds of packets in parallel (unlike the CPU), there is, of course, a cost: non-trivial overhead is incurred in transferring packets to the GPU and copying re- 3. BASIC SYSTEM DESIGN sults back from it. To amortize this cost, we process packets in batches. As packets arrive at our router, We use the same system design presented by Pack- we buffer them until we have a batch of some fixed etShader and Gnort. Pictured in Figure 1, packets size BATCH SIZE, at which point we copy the whole pass through our system as follows: (1) As packets batch to the GPU for processing. arrive at the NIC, they are copied to main memory Though batching improves throughput, it can have via DMA. (2) Software running on the CPU copies an adverse effect on latency. The first packet of a these new packets to a buffer until a sufficient num- batch arrives at the router and is buffered until the ber have arrived (see below) and (3) the batch is remaining BATCH SIZE-1 packets arrive rather than transferred to memory on the GPU. (4) The GPU being processed right away. To ensure that no packet processes the packets in parallel and fills a buffer of suffers unbounded latency (i.e., in case there is a lull results on the GPU which (5) the CPU copies back in traffic and the batch doesn’t fill), we introduce to main memory when processing has finished. (6) another parameter: MAX WAIT. If a batch doesn’t Using these results, the CPU instructs the NIC(s) fill after MAX WAIT milliseconds, the partial batch is where to forward each packet in the batch; (7) fi- transferred to GPU for processing. We evaluate this nally, the NIC(s) fetch(es) the packets from main tradeoff in §4. memory with another DMA and forwards them. Mapped Memory. Newer GPUs (such as ours) 3.1 GPU Programming Issues have the ability to directly access host memory that Pipelining. As with almost any CUDA program, has been pinned and mapped, eliminating steps (3) ours employs pipelining for data transfers between and (5). Though data clearly still needs to be copied main memory and GPU memory. While the GPU to the GPU and back, using mapped memory rather is busy processing a batch of packets, we can uti- than explicitly performing the entire copy at once lize the CPU to copy the results from the previous allows CUDA to copy individual lines “intelligently”

2 as needed, overlapping copies with execution. We 3.2.2 Firewall Rule Matching evaluate the impact of using mapped memory in §4 A firewall rule is defined by a set of five values as well. (the “5-tuple”): source and destination IP address, source and destination port number, and protocol. 3.2 Packet Processing The protocol identifies who should process the packet A router performs one or more packet processing after IP (e.g., TCP, UDP, or ICMP). Since most traf- functions on each packet if forwards. This includes fic is either TCP or UDP [10], the source and des- deciding where to forward the packet (via a longest tination port numbers are also part of the 5-tuple prefix match lookup on the destination address or even though they are not part of the network-layer matching against a set of flow rules) and might also . If the protocol is ICMP, these fields can in- include deciding whether or not to drop the packet stead be used to encode the ICMP message type and (by comparing the packet header against a set of fire- code. wall rules or comparing the header and payload to a Each rule is also associated with an action (usually set of intrusion detection system rules). In our sys- ACCEPT or REJECT). If a packet matches a rule tem we implement two packet processing functions: (that is, the values in the packet’s 5-tuple match longest prefix match and firewall rule matching. We those in the rule’s 5-tuple), the corresponding ac- pick longest prefix match, because it is the core al- tion is performed. Rather than specifying a partic- gorithm used by routers to determine which port to ular value, a rule might define a range of matching forward a packet on. We additionally choose fire- values for one of the fields of its 5-tuple. A rule wall rule matching as it is a very common feature might also use a wildcard (*) for one of the values on commodity routers that is simple to implement if that field should not be considered when deciding (only requiring packet header inspection). if a packet matches. For example, a corporate fire- wall might include the following rule: (*, [webserver- IP], *, 80, TCP):ALLOW. This rule allows incom- 3.2.1 Longest Prefix Match ing traffic from any IP address or port to access the Long Prefix Match is an algorithm for route lookup. company’s webserver using TCP on port 80. It performs a lookup on a data structure, whose Our rule matching algorithm is incredibly simple. entries are prefix-port pairs. The prefixes identify Rules are stored in order of priority and a packet is routes and have the form IP address/mask (e.g., checked against each one after the next in a linear 192.168.1.0/24). The ports represent the link to search. Though this sounds like a na¨ıve approach, [7] reach the correct the next hop, possibly the destina- claims that it is the approach taken by open source tion. Once the router receives a packet to forward, firewalls like pf and iptables and likely other com- it selects an entry in the data structure such that mercial firewalls as well. We discuss how we generate the prefix is the longest (i.e., with the largest mask) the rules used in our tests in §4.1.4. available that matches the destination address in the packet. Then, the associated output port is re- 4. EVALUATION turned. We adapted the algorithm from [9] for our work. 4.1 Experimental Setup The algorithm uses a binary trie. It is a tree-like data structure that keeps the prefixes ordered for 4.1.1 Hardware fast lookup. In particular, the position of each inter- nal node in the trie defines a prefix which is constant We ran our tests on a mid-2012 MacBook Pro with in the subtree associated to such node. The depth a 2.6 GHz Core i7 processor, 8GB RAM, and an of a prefix is the length of its mask (i.e., between 0 NVIDIA GeForce GT 650M graphics card with 1GB and 32, in our case), which is thus a bound for the memory. lookup time. The lookup procedure is extremely simple. A depth- 4.1.2 Router Framework first search in run on the trie using the destination We implemented a software router framework ca- address of the packet being processed. So each node pable of running in one of two modes: CPU-only or lying on the path describing the associated prefix is CPU/GPU. We use the router in CPU-only mode to visited. At each step, it updates the output ports, if gather a baseline against which we can compare the an entry is stored (i.e., longest match). Clearly, the performance of CPU/GPU mode. In either mode, last update is returned, possibly none as there may the framework handles gathering batches of pack- be no matching prefixes. ets, passing them to a “pluggable” packet processing

3 function, and forwarding them based on the results Protocol Prevelance in Rules of processing. We implemented three processing functions: one TCP 75% that does longest prefix match, one that does firewall UDP 14% rule matching, and one that does both. For each we ICMP 4% implemented CPU version and a GPU version (the * 6% GPU version is a CUDA kernel function). Other 1%

4.1.3 Packet Generation Table 1: Protocol distribution in firewall rules We use the Click Modular Router [4] to generate 4.2 Evaluation Metrics packets for our software router framework to process. We use three different metrics to evaluate our router’s We modify the standard packet generation functions performance. The results in §4.3 present these val- in click to output UDP packets with random source ues averaged over 30 seconds of router operation. and destination addresses as well as random source Bandwidth. No router performance evaluation would and destination ports. We have click generate these be complete without considering bandwidth, and so randomly addressed packets as fast as it can and we measure the number of 64 byte packets forwarded have it send the packets up to our software by our router per second, from which we calculate framework via a standard Berkeley socket between bandwidth in gigabits per second. For comparison, both locally hosted applications. the norm for core routers is about 40 Gbps with Note that this “full-on” kind of workload is essen- high-end routers currently maxing out around 100 tial to test the maximum throughput of our router Gbps. framework, but does not necessarily model a partic- ular kind of real-world workload. However, in the Latency. Bandwidth only tells part of the story, case of our system, we wish to measure the feasi- however; the delay a single packet experiences at a bility of processing packets in software at the same router is important as well (for some applications, speed as specially-designed hardware ASICs. Thus it is more important than bandwidth). Since some our evaluation focuses on comparisons of maximum optimizations aimed at increasing our router’s band- throughput. width increase latency as well (such as increasing batch sizes), measuring latency is important. 4.1.4 Packet Processing We measure both the minimum and maximum la- Longest Prefix Match. First of all, since CUDA tencies of our router. The maximum latency is the provides limited support for dynamic data structures time from the arrivial of the first packet of a batch such as a trie, we adapted the algorithm to run on to the time it is forwarded; the minimum latency is the GPU. In particular, in the initial setup of the the same but for the last packet of a batch. trie, we serialize the data structure into an array, Processing Time. We also consider a microbench- so that it can be easily transferred on the GPU’s mark: the time spent in the packet processing func- memory. tion itself (i.e., doing the longest prefix match lookup In order to work with a realistic dataset, we ini- or matching against firewall rules). This is largely tialized the FIB with the prefixes in the a sanity check; we expect to see the GPU perform belonging to actual Autonomous Systems. The list much better here, though actual performance (mea- has been retrieved from CAIDA [1], which period- sured by bandwidth and latency) depends on many ically stores a simplified version of its RouteViews other factors (like the time spent copying data to/from Collectors’ BGP tables. Overall, the size of the trie the GPU). turns out to be a few tens of megabytes large. Once built, the serialized trie is transferred to the GPU to 4.3 Results be used during lookup. Our router went through four iterations, each one Firewall Rule Matching. To evaluate the perfor- introducing optimizations based on what we learned mance of our router’s firewall rule matching func- from the results of the last. We therefore present our tion, we generate a set of random firewall rules. The results in four stages, guiding the reader through our 5-tuple values for our random rules are chosen ac- design process. cording to the distributions in [7]. For example, we use the probabilities in Table 1, taken from [7], to 4.3.1 Iteration One pick a new random rule’s protocol. Similar statistics We began by na¨ıvely implementing the design pre- were used to pick source/destination address/port. sented in §3; at this point, we made no optimizations

4 — the goal was building a functioning GPU-based Bandwidth vs. Batch Size 9 GPU router. 8 CPU 7 Not surprisingly, it performed underwhelmingly. 6 The GPU version of our router achieved roughly 5 4 80% of the bandwidth the CPU version did (Fig- 3 ure 3(a)) and its (max and min) latencies were 1.3X Bandwidth (Gbps) 2 1 longer (Figure 3(b)). The processing time is not to 0 10000 20000 30000 40000 50000 60000 70000 blame; as expected, the GPU performs the actual Batch Size (packets) packet processing much faster than the CPU (Fig- (a) Bandwidth vs. Batch Size ure 3(c) — the GPU processing time is so small it hugs the X axis). A quick examination of the time spent performing different functions (Figure 3(d)) Latency vs. Batch Size 300 GPU Max Latency explains where the GPU router is losing ground: al- 250 CPU Max Latency GPU Min Latency though it spends less time processing, it has to copy 200 CPU Min Latency packets to the GPU for processing and copy the re- 150 100 sults back, tasks the CPU version doesn’t need to Latency (ms) 50 worry about (Figure 3(e)). 0 0 10000 20000 30000 40000 50000 60000 70000 4.3.2 Iteration Two Batch Size (packets) Since both of our packet processing functions op- (b) Latency vs. Batch Size erate only on the data carried by the packet header, we can reduce the time spent copying data to the Processing Time vs. Batch Size GPU by copying only the packet headers. Unfor- 30 GPU tunately, the results do not contain unnecessary in- 25 CPU formation, and cannot easily be condensed (this is 20 not completely true — it is probably possible to 15 10 compress the results, though we do not explore this 5

Processing Time (ms) 0 here). 0 10000 20000 30000 40000 50000 60000 70000 Figure 4 shows the performance of the second it- eration of our router. Reducing the number of bytes Batch Size (packets) copied to the GPU has closed the gap between the (c) Processing Time vs. Batch Size CPU-only and the CPU/GPU routers in terms of bandwidth and latency, but the CPU/GPU router GPU % Time Breakdown vs. Batch Size 100 still doesn’t perform any better. Even though we Gather Packets 80 Process have all but eliminated copy time to GPU, Figure 4(c) Send Packets 60 Copy to Device suggests that we should try to cut down copy time Copy from Device 40 from the GPU as well. % Time 20 0 4.3.3 Iteration Three 32 64 128 256 512 1024 2048 4096 8192 16384 65536 Unfortunately, there is no unnecessary data be- ing copied back from the GPU as there was being Batch Size (packets) copied to it; we must find another way to reduce (d) Breakdown of Relative GPU Time the copy-from-GPU overhead. Instead, we modify our router’s workflow to take advantage of mapped CPU % Time Breakdown vs. Batch Size memory (§3.1). 100 Gather Packets 80 Process This gives the CPU/GPU router a tiny edge over Send Packets the CPU-only router (Figure 5), but the gains are 60 40 small (as Amdahl’s Law would suggest — the copy- % Time from-device time we eliminated was a small portion 20 0 of the total time in Figure 4(c)). 32 64 128 256 512 1024 2048 4096 8192 16384 65536

4.3.4 Iteration Four Batch Size (packets) Having eliminated the overhead of copying data (e) Breakdown of Relative CPU Time to and from the GPU, the third iteration of our Figure 3: Iteration 1 Results

5 GPU Time Breakdown vs. Batch Size CPU Time Breakdown vs. Batch Size 250 250 Gather Packets Gather Packets Process 200 Process 200 Send Packets Send Packets 150 Copy to Device 150 Copy from Device 100 100 Time (ms) Time (ms) 50 50 0 0 32 64 128 256 512 1024 2048 4096 8192 16384 65536 32 64 128 256 512 1024 2048 4096 8192 16384 65536

Batch Size (packets) Batch Size (packets) (a) Breakdown of Absolute GPU Time (b) Breakdown of Absolute CPU Time

Figure 6: Iteration 3 Breakdown

Bandwidth vs. Batch Size Bandwidth vs. Batch Size 11 GPU 11 10 GPU CPU 10 CPU 9 9 8 8 7 7 6 6 5 5 4 4 3

Bandwidth (Gbps) 3 2 Bandwidth (Gbps) 2 1 0 10000 20000 30000 40000 50000 60000 70000 1 0 10000 20000 30000 40000 50000 60000 70000

Batch Size (packets) e Batch Size (packets) (a) Bandwidth vs. Batch Size (a) Bandwidth vs. Batch Size

Latency vs. Batch Size Latency vs. Batch Size 250 250 GPU Max Latency GPU Max Latency CPU Max Latency 200 CPU Max Latency GPU Min Latency 200 GPU Min Latency CPU Min Latency CPU Min Latency 150 150

100 100 Latency (ms) 50 Latency (ms) 50 0 0 10000 20000 30000 40000 50000 60000 70000 0 0 10000 20000 30000 40000 50000 60000 70000

Batch Size (packets) Batch Size (packets) (b) Latency vs. Batch Size (b) Latency vs. Batch Size

GPU % Time Breakdown vs. Batch Size GPU % Time Breakdown vs. Batch Size 100 Gather Packets 100 Gather Packets Process 80 80 Process Send Packets Send Packets Copy to Device 60 60 Copy to Device Copy from Device Copy from Device 40 % Time 40 % Time 20 20 0 32 64 128 256 512 1024 2048 4096 8192 16384 65536 0 32 64 128 256 512 1024 2048 4096 8192 16384 65536

Batch Size (packets) Batch Size (packets) (c) Breakdown of Relative GPU Time (c) Breakdown of Relative GPU Time

Figure 4: Iteration 2 Results Figure 5: Iteration 3 Results

6 Parallelism Packet 1 Packet 1 Packet 2 Packet CPU/GPU router only displays meager gains over 1 Packet 2 Packet the CPU-only version. We turn to a breakdown by LPM LPM LPM Firewall LPM Firewall function of the absolute runtime of eachLPM (Figure 6) to understand why. The CPU-only router clearly spends more time in the processing function; however, both spend nearly Firewall Firewall all their time performing packet I/O (that is, re- ceiving packets from ClickParallelism and forwardingFirewall them af- (a) Per-packet ter processing), so the difference is processing times has little effect. This suggests that our Click-based Packet 1 Packet 1 Packet 2 Packet Packet 1 Packet 2 Packet

packet generator is the bottleneck. (Indeed, Pack- 2 Packet etShader spent a greatLPM amount ofLPM time developing LPM Firewall LPM Firewall LPM highly optimized drivers for their NIC [2].) To test this hypothesis, we implemented a versionLPM of our (b) Per-function per-packet packet generator that pre-generates a buffer of ran- dom packets and returnsFirewall this bufferFirewall immediately Figure 8: Scheduling Multiple Processing when queried for a new batch; similarly, when re- Functions Firewall sults are returned so that packets may be forwarded, the call returns immediately rather thanFirewall waiting for Harder Processing Functions. Even in the final the whole batch to be sent. iteration of our router, the CPU/GPU version only

Packet 2 Packet Sure enough, the CPU/GPU router now operates achieves slightly more than three times the band- at 3X the bandwidth of the CPU-only version and width of the CPU-only version. Though this is by with 1/5th the latency (Figure 7). Of course, the no means an improvement to scoff at, the speedup LPM bandwidth achieved by both is completely unreal- strikes us as being a tad low. We suspect the cause istic (instantaneous packet I/OOur is impossible), processing but functionsis that our packet aren’t processing strenuous functions are enough. not tax- Future work: IDS or IPsec. these results indicate that with optimized packet I/O ing enough; the harder the processing function, the drivers like PacketShader’s, our CPU/GPU router more benefit we should see from the massively par- would indeed outperform our CPU-only one. allel GPU. This suggests that GPU-based software Firewall routers might be best suited for complex packet pro- 5. DISCUSSION AND FUTURE WORK cessing like IDS filtering (which requires pattern match- ing against packet payloads) and IPsec processing Multithreaded CPU Processing. The compari- (which requires expensive cryptographic operations). son of our CPU/GPU router to the CPU-only base- One of our goals was to explore the best way to line is not completely fair. Our CPU-only router op- schedule multiple packet processing functions on the erates in a single thread, yielding misleadingly low GPU. As a simple example, to schedule both LPM Our processingperformance. functions Any CPU-based aren’t strenuous software router in theenough. Future work: IDS or IPsec. and firewall lookup, you could imagine parallelizing real world would certainly spread packet processing per-packet or parallelizing per-function (Figure 8). across multiple cores, and we would be surprised if We began testing these schemes, but, as our pro- any core software router were run on a machine with cessing functions turned out to be “easy,” there was fewer than 16 cores. little difference. We think this is still an interest- A simple extension of our project would be to par- ing question to explore with more taxing processing allelize our CPU-only packet processing functions functions. with OpenMP to provide more realistic baseline mea- surements. Faster Packet I/O. By far the largest issue we We have implemented a simple multi-threaded ver- noted is that generating and gathering packets from sion. Our preliminary results show that OpenMP Click contributes to the majority of the latency, be- can dramatically improve the performance. By tun- coming the dominant part of both CPU and GPU ing the number of threads and the batch size, we overall latency. As seen in Figure 7, removing the could increase the bandwidth up to one order of overhead contributed by generating and gather pack- magnitude, with respect to the mono-thread ver- ets, our GPU implementation performs much better sion. However, the packet processing on the GPU than our CPU implementation. This is the same still outperforms that on the CPU, being around conclusion that the authors of PacketShader [2] came 20% faster. More investigation is however required to. Thus in their system, they focused on reimple- to understand the impact of the parameters on the menting the driver for their NIC to alleviate these performance. issues. In our framework we could similarly emu-

7 Bandwidth vs. Batch Size Latency vs. Batch Size 700 25 GPU GPU Max Latency 600 CPU CPU Max Latency 20 GPU Min Latency 500 CPU Min Latency 400 15 300 10 200 Latency (ms) 5

Bandwidth (Gbps) 100 0 0 0 10000 20000 30000 40000 50000 60000 70000 0 10000 20000 30000 40000 50000 60000 70000

Batch Size (packets) Batch Size (packets) (a) Bandwidth vs. Batch Size (b) Latency vs. Batch Size

Figure 7: Iteration 4 Results late newer NIC technologies (like RDAM) to allow Integrated Graphics Processors. One tempting zero-copy access to DRAM that is memory mapped idea to explore is the use of integrated graphics pro- in the GPU. We could emulate this by having Click cessors rather than dedicated discrete GPUs. Mod- copy packets directly to application DRAM rather ern processors ( Core i-series, etc.) include a than first sending the packets to the application via traditional multi-core GPU directly on the proces- a socket. sor itself. In essence, this shifts the position of the GPU from being on the PCI-express bus to being co- located with the CPU on the quickpath interconnect Streams. The typical use of a GPU involves the (QPI). As the QPI can potentially provide more bus following operations: a set of tasks and input data bandwidth to memory, and integrated graphics pro- is first collected, transferred to the GPU, processed, cessor could obtain even higher maximum through- and eventually the output is copied back to the host put, as memory constraints are the biggest source of device. These operations are performed cyclically, potential slowdown after packet I/O at the NIC. often on independent data, as in the case of our packet processing applications. Since the CUDA and 6. CONCLUSION host device are separate processing unit, it is criti- cal for performance to maximize the concurrency of Software routers are no doubt poised to start play- such operations. ing a larger and larger role in real systems. The pop- CUDA Streams provide an interesting mechanism ularity of software defined networks like OpenFlow to optimize GPU-based applications. Basically, CUDA alone is a big enough driver to ensure this, though allows to assign operations (like kernel execution software routers might find other niches as well. To and asynchronous data transfer to and from the de- our knowledge, GPU-based software routers are cur- vice) to streams. Then, the stream executions are rently limited to the research lab, and we are unsure overlapped in such a way that similar operations of whether this is likely to change. Our experience (and different streams do not interfere with each other PacketShader’s success) suggests that they might be (i.e., a kernel execution in stream1 and a data trans- a viable option, though the effort required to opti- fer in stream2 can be performed in parallel). The mize them may in the end outweigh the benefits. high level idea is quite similar to that of instruction pipelining. We have implemented a version of our router using CUDA streams. Our preliminary results show that streams indeed increase the parallelism and through- put. Operation overlapping is quite visible in the Visual Profiler. The number of processed packets increases substantially (around 30-40%). However, more investigation is required on this subject, par- ticularly on the packet generator and on the new issues related to resource management, highlighted by the NVidia Visual Profiler.

8 7. REFERENCES [1] CAIDA. Route Views, http://data.caida.org/datasets/routing/routeviews- prefix2as/2012/01/, 2012. [2] S. Han, K. Jang, K. Park, and S. Moon. Packetshader: a gpu-accelerated software router. In Proceedings of the ACM SIGCOMM 2010 conference, SIGCOMM ’10, pages 195–206, New York, NY, USA, 2010. ACM. [3] J. H. Jafarian, E. Al-Shaer, and Q. Duan. Openflow random host mutation: transparent moving target defense using software defined networking. In Proceedings of the first workshop on Hot topics in software defined networks, HotSDN ’12, pages 127–132, New York, NY, USA, 2012. ACM. [4] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. Kaashoek. The click modular router. ACM Transactions on Computer Systems (TOCS), 18(3):263–297, 2000. [5] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. SIGCOMM Comput. Commun. Rev., 38(2):69–74, Mar. 2008. [6] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. SIGCOMM Comput. Commun. Rev., 38(2):69–74, Mar. 2008. [7] D. Rovniagin and A. Wool. The geometric efficient matching algorithm for firewalls. IEEE Trans. Dependable Secur. Comput., 8(1):147–159, Jan. 2011. [8] S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu. Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, PPoPP ’08, pages 73–82, New York, NY, USA, 2008. ACM. [9] S. D. Strowes. LPM Trie, http://svn.sdstrowes.co.uk/pub/longest-prefix/, 2011. [10] K. Thompson, G. Miller, and R. Wilder. Wide-area internet traffic patterns and characteristics. Network, IEEE, 11(6):10–23, 1997. [11] G. Vasiliadis, S. Antonatos, M. Polychronakis, E. P. Markatos, and S. Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection, RAID ’08, pages 116–134, Berlin, Heidelberg, 2008. Springer-Verlag. [12] Y. Zhu, Y. Deng, and Y. Chen. Hermes: an integrated cpu/gpu microarchitecture for . In Proceedings of the 48th Design Automation Conference, DAC ’11, pages 1044–1049, New York, NY, USA, 2011. ACM.

APPENDIX A. DISTRIBUTION OF CREDIT

Group Member Portion of Credit Matt 33.3% David 33.4% Bruno 33.3%

9