1 JumboGen: Dynamic Generation For Network Performance Scalability David Salyers, Yingxin Jiang, Aaron Striegel, Christian Poellabauer Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN. 46530 USA Email: {dsalyers, yjiang3, striegel, cpoellab}@nd.edu

Abstract— Network transmission speeds are increasing at a there is a fixed amount of overhead processing per packet, significant rate. Unfortunately, the actual performance of the it is a challenge for CPUs to scale with increasing network network does not scale with the increases in line speed. This speeds [2]. In order to overcome this issue, many works on is due to the fact that the majority of packets are less than or equal to 64 bytes and the fact that packet size has not scaled with high performance networks, as in [3] and [4], have advocated the increases in network line speeds. This causes a greater and the use of a larger MTU length or jumbo frames. These works greater load to be placed on routers in the network as routing show that the performance of TCP and the network in general decisions have to be made for an ever increasing number of can significantly improve with the use of jumbo-sized packets. packets, which can cause increased packet delay and loss. In Unfortunately, the benefits of jumbo frames are limited for order to help alleviate this problem and make the transfer of bulk data more efficient, networks can support an extra large several reasons. First, the majority of packets transferred are MTU size. Unfortunately, when a packet traverses a number of 64 bytes or less. In this case, a larger MTU would not increase different networks with different MTU sizes, the smallest of the the efficiency of the network since the smaller MTU is already MTU sizes will determine the size of the packet that arrives at significantly greater than the packet size. The second factor the destination, which will limit the effectiveness of large MTU that limits the use of jumbo frames is the lack of end-to- packets. In this paper we present our solution, JumboGen, an approach end support. If at any point during the packet transmission that will allow for a higher utilization of larger packet sizes on the jumbo frame passes through a node that does not support a domain-wise basis, instead of on an end-to-end basis. Smaller jumbo frames, the packet will be fragmented into many smaller packets are encapsulated into a larger packet for transmission packets via IP fragmentation. The destination will then receive across the domain. At the egress point, the original packets are these small fragments instead of the original jumbo frame. removed from the jumbo-sized packet and forwarded toward their final destination such that no nodes besides the ingress and Given that network speeds will be increasing for the fore- egress are aware any conversion took place. Through simulations, seeable future, a reduction in the volume of packet processing we show that the dynamic creation of jumbo packets for the is critical. However, the legacy nature of the current Internet domain does not adversely affect the network in terms of link dictates that attempts to impose end-to-end solutions for larger utilization or fairness while decreasing the number of packets packet sizes will be plagued by fragmentation, thus removing processed by core routers. The decrease in the number of packets seen by the core routers will lessen their load and improve their their benefit. In this paper, we propose the JumboGen approach ability to scale to greater and greater network speeds. which allows for the incremental deployment and efficient utilization of jumbo-size packets. JumboGen dynamically aggregates smaller packets at the I. INTRODUCTION ingress to a domain for domain-wise transmission as a single Network (LAN and WAN) speeds are continuously increas- large JumboGen packet. At the egress point, the original ing and are such that computer networks can now transfer packets are de-queued, shaped, and transmitted toward their billions of bits, and beyond, of information every second. final destination. The end result of JumboGen is a significant However, in these high speed networks, the number of packets reduction in the number of packets, hence significantly im- becomes a major bottleneck for the performance of the routers proving the scalability of the core of the network. due to the overhead associated with each packet. There are two However, before JumboGen can be implemented numerous primary reasons for this bottleneck. First, as network speeds issues need to be addressed. These issues include, how to have increased, the common MTU size has not. Second, since determine which packets can be aggregated together, RED the majority of packets transferred over the network are 64 Queue effects on JumboGen packets, release of encapsulated bytes or less in length [1], a larger MTU would not help reduce packets, and partial deployment issues. The remainder of the the number of those types of packets. These two facts have paper is organized as follows: Section II presents an overview caused the number of packets the routers in the network need of the JumboGen approach and presents solutions to important to process to increase substantially. issues, JumboGen packet assembly, RED interactions, schedul- For example, consider a 10 Mb/s network that can handle ing, and deployment. Next, Section III discuses the details N packets every second. If the network is upgraded to 10 Gb/s of the simulation studies performed with JumboGen. Then, the routers would be expected to handle 1000N packets. Since background and related works are discussed in Section IV. 2

B D

A 4,5 C 6 7

C D C 4 B A J IP JumboGen B 1,2,3 De-Aggregator A JumboGen Aggregator D

Standard Packets JumboGen Packets Standard Packets

Fig. 1. JumboGen operation Overview

Finally, Section V presents conclusions for the paper. may be difficult to discern. Thus, JumboGen includes mechanisms to ensure that only packets sharing the same II. JUMBOGEN DETAILS egress point will be combined into the same JumboGen packet. Fig. 1 shows an overview of the JumboGen approach. The • Packet Encapsulation: The structure of the JumboGen basic flow of events is described below: packet is such that the method is computationally simplis- 1) Packets arrive at an ingress node to the domain. tic with little overhead. Furthermore, the packet structure 2) At the ingress node, the packets are sorted into queues must be easily modifiable to allow for pruning of ag- based on egress node, while starting a Jumbo Frame gregated packets to support RED-like behavior on a fine Aggregation Timer (JFAT) for the queue. granularity. 3) Packets that are being sent through the same egress point • Structure of Packet Queues: JumboGen incorporates an are combined into the same JumboGen packet, subject efficient and fast method for determination of packet to MTU. placement at the ingress. 4) Once the JFAT for the queue has expired, the JumboGen • RED Interactions: Since JumboGen packets consist of packet is released towards the egress point. many smaller encapsulated packets, the loss of a Jumbo- 5) The JumboGen packet is routed through the core of Gen packet has more significant implications for network the network. Routing is provided by using the standard dynamics. Thus, the JumboGen approach includes con- routing mechanism of the network1. siderations for RED and the interactions with JumboGen 6) The JumboGen packet arrives at the egress node and the and non-JumboGen packets. original packets are separated out. 7) Finally, the original packets are forwarded onto their final destination. A. Aggregation with Shared Egress Points There are two main benefits to this approach. The first is that If packets are randomly aggregated without consideration JumboGen lowers the number of packets that the core routers for their egress points, it is likely that when the JumboGen are responsible for processing, thus allowing the network to packet arrives at an egress node, many of the encapsulated better scale as line speeds increase. The second beneficial packets will be routed back across the domain. This can aspect to the use of JumboGen is that data is more efficiently add a significant amount of delay and is unacceptable from transferred over the network. The increase of efficiency is due a network routing perspective. Hence, in order to aggregate to the reduced number of physical layer headers used (due to packets together at the ingress node, the egress node for the a lower number of packets) to transfer the same amount of packet should be known. data over the network. Unfortunately, in standard IP intra-domain routing (ie. While this process seems relatively simple there are a OSPF, IS-IS) only the next hop is known at any given link number of non-trivial issues that need to be addressed. Issues on the packets path to the destination. However, it is possible that need to be addressed include: to discern which packets will have the same egress point, even if the specific egress point can not be determined. • Aggregation with Shared Egress Points: Since IP based If the network supports MPLS [5], aggregation of packets routing provides only the next hop, the egress point can be performed by matching MPLS labels. Since the same 1Information from the header of an encapsulated packet (MPLS or IP) is MPLS label indicates that packets are in the same Forward simply copied to the actual JumboGen packet Equivalence Class (FEC) and will be routed along the same 3

Packet Queues Filter Addr: 129.154.190

Parent Link: IP Header NULL

Filter Addr: Filter Addr: { 129.154.160 129.154.200 QTime Parent Link: Parent Link: NULL NULL avgLen nPkt pLen1 pLen2 Filter Addr: Filter Addr: JumboGen Header 129.154.150 129.154.195

Parent Link: Parent Link: NULL 129.154.160 {pLen(n-1) pLen(n)

Packet 1 Fig. 3. Queue Structure for JumboGen. Encapsulated Packets

Packet n { JumboGen header is a variable length header that contains information necessary to efficiently handle the packet and 32 Bits extract the encapsulated information. The first field, QT ime Fig. 2. JumboGen Packet Structure stores the time the packets were aggregated over (JFAT , unless released earlier). The second field, avgLen, stores the average length of the packets encapsulated in the JumboGen 2 path, it follows that the packets will exit the domain at the packet . The third field, nPkt, indicates the number of packets same egress point. stored in the JumboGen packet. The next nPkt fields contain If the network does not support MPLS, the first 24 bits the length of the various encapsulated packets. Finally, the of the destination IP address, which is referred to as the original packets are stored in the same order that their lengths filter address, is used. By using the filter address, it can be are stored. This allows for packets to be removed from the reasonably guaranteed that packets bound for destinations with JumboGen packet as efficiently as possible. the same filter address (ie. same subnet ex. 129.74.150.x) The overhead introduced by JumboGen packet is 20 bytes will exit the domain at the same egress point. While this for the additional IP header and (nPkt ∗ 2) + 8 bytes for is a conservative approach that may miss some aggregation the JumboGen header. The overhead is easily offset by the opportunities, it is more important to ensure the correct routing reduction in physical layer headers and the true cost (or of packets. Furthermore, methods are presented for queue benefit) of JumboGen in terms of bandwidth is given by: management which allows filter addresses sharing the same egress point to gradually be combined such that they share the same aggregation queue. cost = 28 + (nPkt ∗ 2) − (phySize ∗ (nPkt − 1)) For example, if the network is an network, only three B. Encapsulation of Packets packets would need to be aggregated together in order to over- A number of issues need to be addressed when aggregating come the additional overhead introduced by the JumboGen the packets into the larger JumboGen packet. The first issue packet header. is how to address the JumboGen packet such that it arrives This layout of the JumboGen packet also allows for the at the proper egress point. As already mentioned, the egress possibility of dropping a partial packet without significant point is not known a priori. Thus, the destination IP address overhead. If a router, which is aware of JumboGen packets, (or MPLS label) from the first encapsulated packets is used to decides to drop part of a JumboGen packet (ie. for RED), it address the packet. The protocol field in the IP header is set first looks at the number of packets stored in the aggregate to indicate that the packet is a JumboGen packet. Finally, the packet. Once the number of packets to be dropped is decided, source IP address is set to the address of the ingress router, the packets will be removed from the end of the JumboGen thus allowing the egress router to know which ingress router packet. The length of the JumboGen packet is shortened by created the JumboGen packet. the lengths of the packets that are to be removed and their The next issue that needs to be addressed is how to store lengths in the JumboGen packet header are set to zero. The the encapsulated packets in the JumboGen packet such that the nPkt field and avgLen field is not modified which is due to original packets can easily be reconstructed. The operation for the need for correct parsing at the egress router and the need reconstructing the original packets needs to be as simple as for simplicity in modifying the packets in flight. Removing the possible in order to prevent the egress node from becoming a lengths that are zeroed out is not a desirable option because bottleneck in the network. Thus a straight-forward approach, multiple memory copies would have to occur before the packet as shown in Fig. 2, is used. As can be seen in the figure, could be forwarded. the first part of the packet is the normal IP header. After the IP header, a special JumboGen Header is included. The 2This information is used for RED queue modification. 4

C. Queuing of Packets Algorithm 1 Algorithm for determining which queue to place a packet In the naive¨ approach to aggregating packets, no concern is given to packets of the same source/destination pair; thus, Require: a = last 4 bits of source address packets in the same flow can be aggregated into the same Require: b = last 4 bits of destination address JumboGen packet. However, multiple queues are still used for Require: n = first 24 bits of destination address aggregating packets that will leave the domain from different 1: node = node matching n in destination tree egress points. The overall queue structure is shown in Fig. 3. 2: if node = NULL then As can be seen in the figure, the filter address/MPLS label is 3: Insert new node in destination tree used for locating the correct queue in the filter address/MPLS 4: else if node.queueP arent != NULL then label tree. If an appropriate entry for the destination address 5: node = node.queueP arent is found then that entry is used to identify the proper queue, 6: end if otherwise a new entry is created. If a queue is already attached 7: QueueNotF ound ⇐ true to the entry, the packet is placed at the end of the queue. If 8: x ⇐ b ∗ 16 + a; where x represents a bit position there is no space (exceeds jumbo frame MTU size) to hold the 9: c ⇐ 0 packet, the JumboGen packet is immediately dispatched and a 10: while QueueNotF ound and c < node.numQueues do new queue is placed for the entry. If no queue is available to be 11: if node.queues[c].contains list[x] == 0 then attached to the filter address entry, packets matching the filter 12: node.queues[c].contains list[x] ⇐ 1 address entry are forwarded toward their destination without 13: QueueNotF ound ⇐ false being queued. 14: else If multiple packets from the same flow are encapsulated into 15: c ⇐ c + 1 the same JumboGen packet, issues regarding fairness could 16: end if arise. Problems with fairness may arise due to the fact that if a 17: end while single JumboGen packet is lost, all of the encapsulated packets 18: if QueueNotF ound then are lost (and possibly multiple packets from the same flow). 19: if node.numQueues >= maxQueues then In contrast, an intelligent approach using multiple queues is 20: Pop node.queues[0] and release the JumboGen presented which removes this issue by preventing packets from packet to the network. the same flow to be aggregated into the same JumboGen 21: end if packet. 22: add queue to node.queues The queue structure for the intelligent approach is similar 23: node.numQueues ⇐ node.numQueues + 1 to the structure shown in Fig. 3. In the multiple queue ap- 24: node.queues[c].contains list[x] ⇐ 1 proach, each filter address/MPLS Label entry can be associated 25: end if 26: node.queues c with multiple queues, with each queue having a 256 bit Insert packet into [ ] contains list associated with it. The contains list uses a hashing function to ensure that multiple packets from the same source/destination pair do not appear in the same JumboGen packet. The algorithm for determining the queue to aggregate the packets themselves. In reality, the number of queues should packet into is shown in Algorithm 1. The process for selecting be substantially less then the above calculation would dictate. the appropriate queue for aggregation is of O(log(e) + k) While not shown in Algorithm 1, it is possible to limit the complexity, where e is the number of egress points in the filter address/MPLS label tree to a specified size. This can be tree and k is the maximum number of queues allowed to be efficiently achieved by using an Least Recently Used (LRU) used by one egress node. list of the nodes in the tree. When a new node needs to be The space requirements for the contains list array will added and no free node exists, then the LRU node is removed add additional overhead, with the exact amount is dependant from the tree and the new node is inserted. While this adds on the number of queues that the ingress router supports. an additional O(log e) search to the above algorithm it does The overhead storage requirements can be determined by not change the overall complexity from O(log(e) + k). multiplying 32 (256 bits) by the number of queues that are supported. For example, a 10 Gb/s (line speed) network, a While the packets could be arranged in any number of minimum packet size of 28 bytes (minP ktSize), a JFAT different ways inside the queues, JumboGen uses a straightfor- value of 1 ms, and the assumption that only two packets will ward approach for the naive¨ and intelligent implementations. be aggregated into the same queue (nP ktQ), the overhead is Packets are simply added to the end of the JumboGen queue. given by: Such approaches, such as arranging the encapsulated packets by size (smallest to largest), may provide some benefit. Un- line speed fortunately, such approaches require an increase in complexity overhead = ∗ JFAT ∗ 32 at the ingress router and may introduce out-of-order delivery 8 ∗ 1000 ∗ minP ktSize ∗ nP ktQ for the encapsulated packets. This gives an overhead cost of 750KB, which is slightly more than half of the 1.28 MB of storage needed for the As previously mentioned, the JumboGen ingress nodes will 5 aggregate different filter addresses3 together as it learns that advantageous as a JumboGen packet has the same percent different filter addresses will exit the domain at the same egress chance to be dropped as does any other packet. However, any point. Fig. 3, the presence of a parent pointer for each node is time a JumboGen packet is dropped, all encapsulated packets shown. The parent pointer allows for queues to be aggregated are lost. Because multiple packets are lost, this can result together once it is discovered that packets in these queues exit in poor TCP performance, as packets from the flow can be at the same egress point. Parent pointers are restricted in that a dropped, thus resulting in a greater than desired reduction of node is either a parent or a child, but not both (ie. one parent, traffic. many children). For example, filter addresses 129.154.195 and In the case where the queue utilization is determined by 129.154.160 are combined in Fig. 3. bytes and not number of packets, many ACK packets that When JumboGen packets arrive at the egress node, the node have been aggregated together still have a greater chance of looks at the filter address (taken from destination IP address) being dropped due to the large size of the JumboGen packet. for the JumboGen packet. As different filter addresses are seen, This is not desirable because if ACKs are lost, the original the egress node sends notification messages to the ingress node data packet will have to be retransmitted along with the ACK (identified by the source IP address in the JumboGen packet) for the retransmitted packet. to combine the queues for the different filter addresses. When BGP or other routing updates occur, the parent pointers for the JumboGen-Aware RED Queue: filter address tree entries will be invalidated. The invalidation When RED uses the number of packets to determine queue will occur in order to prevent incorrect aggregation of packets utilization, it is desirable to lower the chance that a JumboGen as they may no longer exit at the same egress node. packet will be dropped. This can be accomplished by dividing the drop probability by the number of encapsulated packets in the JumboGen packet. This will give preference to JumboGen D. Packet Scheduling packets. However, if the network traffic consists mostly of Packet scheduling is another area that can be affected by JumboGen packets, then this could effectively reduce the the dynamic creation of jumbo-sized packets, especially when number of drops due to the chance that any packet is dropped JumboGen is only partially deployed in the network. If the is substantially lowered. Thus, the following formula is used: core routers packet scheduling algorithms are not modified, Redw then one JumboGen packet counts the same as a single non- dropP ercent = (1 − α) ∗ + α ∗ Redw N JumboGen packet. Unless the size of the packet is taken into account, the JumboGen packets could receive more than their In the above formula, α is the percentage of packets that are fair share of transmission opportunities (due to many packets JumboGen packets, Redw is the probability that an individual being transmitted with each JumboGen packet). Furthermore, packet will be dropped and N is the number of packets the size disparity between small and large packets is sig- encapsulated in the JumboGen packets. As the percentage of nificantly increased. However, if the network is completely JumboGen packets increases in the network, the chance for any JumboGen enabled (the majority of packets will be JumboGen JumboGen packet to be dropped approaches Redw (normal packets) or packet sizes are taken into account, no special RED operation). modifications need to be applied. When RED uses the number of bytes to determine the queue utilization, smaller packets will have a smaller chance of being dropped. As mentioned in [6], the action of favoring E. RED Queue Modifications small packets is desirable due to the lessened chance that TCP RED queues are an important technology that aims to acknowledgments will be dropped. In JumboGen, if multiple improve the utilization of the network and remove the syn- TCP acknowledgments are combined into the same packet, chronization that tends to occur with TCP flows when the this will increase their chance of being dropped, which is not network becomes congested. In [6], two different methods desirable. Thus, for JumboGen packets, we use the average that RED queues can use to determine the queue utilization size avgLen of the encapsulated packets in order to determine are presented. The first method is to simply use the number if a packet should be dropped. By using the average size of of packets, while the second is to use the number of bytes the encapsulated packets, the chance that a JumboGen packet consumed in order to determine queue utilization. The second consisting of mostly TCP ACKs will be less than a JumboGen method has more overhead, however, it allows for smaller packet consisting of larger data packets. The average size of packets to be favored over larger packets. This effectively gives encapsulated packet is read from the JumboGen header, not priority (less chance to be dropped) to smaller packets (eg. calculated at the router. Additionally, the maximum size of a TCP acknowledgments). packet in the RED algorithm (see [6]) is not the jumbo frame MTU, but the network MTU size. No RED modification: If RED is not modified in any way, JumboGen packets will Partial Drop: be treated the same as any other packet. This behavior is not The main weakness of RED queues on JumboGen is that a dropped JumboGen packet means the loss of all encapsulated 3MPLS labels are used for routing and, in addition, can be used to determine packets within the JumboGen packet. Thus, an additional . Although MPLS can aggregate different FECs together that have the same destination and quality of service requirements, the JumboGen modification of the RED queue is made. In this modification, approach will not combine any MPLS labels. instead of dropping the whole JumboGen packet, a portion of 6

External Domain the packet is dropped. The actual RED calculation performed in order to determine if the packet should be partially dropped End Nodes is unmodified from the previous RED method. To decide how many packets to drop, two new variables, minPktDrop and maxPktDrop, are introduced. These two variables will set the External Domain range of the percentage of packets to drop which determined Ingress 2 End Nodes by: Ingress 1 Egress 1 avgQSize − minth n ⇐ ( ∗(maxPktDrop−minPktDrop) End Nodes maxth − minth Egress 2 External Domain +minPktDrop) ∗ numPkts External Domain The specific packets to be dropped are taken from the end End Nodes of the JumboGen packet. While the packets to be dropped could be selected from anywhere inside the JumboGen packet, Fig. 4. JumboGen simulation setup any location other than the end would result in a substantial increase in the complexity of the partial drop. To drop an encapsulated packet from the end of the JumboGen packet, the increase the likelihood of packets arriving out-of-order at the length of the encapsulated packet (read from the JumboGen destination. packet header) is subtracted from the length of the JumboGen packet and the length field for the encapsulated packet in the Algorithm 2 Algorithm to determine if a packet should be JumboGen packet is zeroed out. The total number of packets queued will not be changed, due to the need for this value to skip over 1: addr ⇐ filter address all of the lengths in the JumboGen header to find the start of 2: Search queue tree for addr the first encapsulated packet. 3: if addr entry is found then The main drawback to the above method for partially 4: Check JumboGen Capability dropping packets is that it may introduce a minor drop- 5: if JumboGen Capable then tail like behavior, which RED aims to eliminate. However, 6: Queue Packet if packets are able to be removed from anywhere in the 7: end if JumboGen packet, the complexity of the partial drop would 8: else substantially increase. The increase in complexity is from 9: Send query packet to destination address to determine performing a RED calculation on each encapsulated packet JumboGen capability. and from memory move operations needed to close the gaps 10: Forward packet (packets will not be queued until reply in the JumboGen packet. Furthermore, as shown in Section to query packet is received). III we show that only dropping packets from the end of the 11: end if JumboGen packet does not result in poor performance.

F. Egress Shaping When the JumboGen packet reaches its destination, the G. Deployment packets need to be de-aggregated and released to the next node The final issue that needs to be addressed is one of deploy- on their path to the destination. If all the packets are released ment of the JumboGen egress and ingress points. JumboGen as soon as they are removed from the JumboGen packet, this has the capability to work in a partial deployment scenario, can lead to dropped packets at the client due to the receive although at a reduced performance gain. First, the whole buffer overflowing, related to the ACK Compression problem domain must support jumbo packets before JumboGen could [7]. In JumboGen, this issue is removed by storing the queuing be utilized. After the requirement for jumbo frame support time that the packets were aggregated over and then uniformly is met, JumboGen aggregators/de-aggregators and JumboGen distributing the release of the packets over that amount of time. aware routers can be deployed on an incremental basis. An additional issue to deal with at the egress point is how to If JumboGen is only partially deployed, discovery needs to handle the situation where multiple JumboGen packets are at be performed in order to determine if the egress node supports the egress node ready to be de-aggregated. The rule that needs JumboGen packets. The discovery and queuing process works to be followed in this situation is that encapsulated packets as discussed in Algorithm 2. The query packet referred to in from JumboGen packets that arrive before other JumboGen Algorithm 2 is a specially formatted ICMP message. If the packets from the same ingress node (as determined by the egress point to the domain is JumboGen capable, the query IP address) must be completely transmitted to the destination. packet will be intercepted and a reply indicating JumboGen The release of encapsulated packets contained in JumboGen capability will be sent to the egress point. If the egress point packets from different ingress nodes can be interleaved without is not JumboGen capable then the message will arrive at the any further issues. This way the JumboGen approach does not destination and be discarded. 7

Number of Packets Processed By Core Links Packet Size Distribution 50 40000

40 30000

No Jumbo 30 No Jumbo JumboGen JumboGen 20000 20 Number of Packets 10000 10 Packet Size Distribution (%)

0 0 100 110 120 130 140 150 0 2000 4000 6000 8000 10000 Time (seconds) Packet Size (Bytes)

(a) (b)

Fig. 5. Effect of different RED queue modifications on a) Number of packets seen by core links b) Packet size Distribution

III. SIMULATION STUDIES • Link Utilization For the simulation studies, the simulations were conducted • Total Goodput: Total goodput of one monitored TCP flow. using the ns-2 simulator [8] and the GenMCast extension • Service Fairness: Multiple TCP flows were monitored module [9]. The goal of these simulation studies was to over the life of the simulation in order to determine if examine the impact of the JumboGen approach on scalability, they received fair service. Fairness was determined by and quality of service (TCP fairness, link utilization, and the use of Jain’s Fairness Index [10]. goodput). For our simulation studies the network topology shown in A. RED Queue Modification Performance - Naive¨ queuing Fig. 4 was used, which consisted of a small network with In Fig. 5 and Tables I, II, and III, the performance for the five external nodes connected to each edge node and a link various RED queues is shown. The different RED implemen- bandwidth of 250 Mb/s. A full deployment of JumboGen tations are: was assumed in the core network. The remaining information • No JumboGen - The performance of the network when concerning the simulation setup is presented below: JumboGen is not used. Queue utilization is determined • Traffic was generated between end nodes in order to by the number of bytes. Queue utilization by the number ensure a high link utilization in the network. Typically, of packets typically performs slightly worse and is not busy core links had hundreds of traffic flows active at a shown. given time. • Q1 - Full Drop - A JumboGen configuration using egress • The JumboGen frame MTU is 8 KB. shaping and the queue utilization is determined by the • The network flows consisted of short and long term flows. number of encapsulated packets. A packet drop will result For TCP flows, the average size of the flows was 100 KB in all encapsulated packets being lost. for short term traffic and 5 MB for long term traffic. • Q1 - Partial Drop - Same as previous, except partial • Network traffic was supplemented with long term UDP packet drop is allowed. traffic with an average lifetime of 20 s. • Q2 - Full Drop - A JumboGen configuration using egress • All short term flows were TCP and long term flows shaping and the queue utilization is determined by the consisted of 20% CBR, 20% VBR, and 60% TCP. number of bytes consumed. A packet drop will result in With the basic setup as listed above, three groups of sim- all encapsulated packets being lost. ulations were performed. The first group evaluates the ef- • Q2 - Partial Drop - Same as previous, except partial fectiveness of the different RED queue modifications. The packet drop is allowed. second group of simulations shows the difference in the In Fig. 5(a), the number of packets transmitted over one performance of the naive¨ and intelligent queue management of the core links is shown. As can be seen in the figure, the schemes. Finally, the third group of simulations explores the number of packets processed by the core links is roughly an effect increasing the MTU size has on JumboGen. 8x reduction over the non-JumboGen case (38,000 packets per The performance of JumboGen was evaluated according to second versus 4,500 packets per second). Fig. 5(b) shows the the following metrics: packet size distribution for the same core link. It can be seen • Number of Packets: This is the number of packets pro- in the figure that without JumboGen, the majority of packets cessed by a core link in the network. are either 1500 (for the various TCP transfers) or 64 bytes • Packet Size Distribution (TCP ACKs) and the remainder are various UDP packets. For 8

Number of Packets By JumboGen Queue Type Packet Size Distribution By JumboGen Queue Type

40000 50

40 30000

No Jumbo 30 No Jumbo Single Queue Single Queue 20000 Multiple Queue Multiple Queue 20

Number of Packets 10000 10 Packet Size Distribution (%)

0 0 100 110 120 130 140 150 0 2000 4000 6000 8000 Time (seconds) Packet Size (Bytes)

(a) (b)

Fig. 6. Effect of different JumboGen implementations on a)Number of packets seen by core links b) Packet size distribution

TABLE I however, when partial drops are allowed, the link utilization LINK UTILIZATION OF JUMBOGEN AND DIFFERENT RED QUEUE is slightly less in addition to being slightly less stable. MODIFICATIONS

Avg. Link Util. (%) Std. Dev. In Table II, the total goodput for the different approaches No JumboGen 90.62 1.925 is shown. Q1 with full drop is the worst performer delivering Q1 - Full Drop 85.40 2.291 the lowest amount of data to the end client. However, Q1 Q1 - Partial Drop 93.49 1.904 with partial drop and both Q2 implementations deliver a Q2 - Full Drop 93.74 1.6793 Q2 - Partial Drop 93.22 2.145 substantially higher amount of data to the end client. Finally, the overall fairness of the different implementations TABLE II is shown in Table III. Q1 with full packet drop has the best TOTAL GOODPUT OF JUMBOGEN AND DIFFERENT RED QUEUE fairness of the different JumboGen implementations. However, MODIFICATIONS. this fairness comes at a substantial loss in terms of perfor- Total Goodput (MB) mance. Q1 with partial drop and both Q2 implementations No JumboGen 53.205 perform slightly worse than the non-JumboGen case. Q2 with Q1 - Full Drop 47.921 Q1 - Partial Drop 69.733 full drop achieves a fairness rating closest to that of the non- Q2 - Full Drop 76.212 JumboGen approach. Q2 - Partial Drop 72.546 In these simulations, while Q1’s performance is significantly better with partial drops, Q2’s performance is slightly worse. JumboGen, the majority of packets are located near the MTU This is likely due to the fact that in Q2, JumboGen packets that (8 KB). are composed of mostly TCP ACKs will have a lower chance of being dropped than JumboGen packets that are composed In Table I, the link utilization for the different RED queues of larger data packets. When partial drop is combined with is shown. Q1 with full drop is the only implementation with Q2 the chance that at least some packets from many differ- a lower link utilization than the no JumboGen case. However, ent JumboGen packets will be discarded is increased. This once Q1 is allowed to perform a partial drop, the link utiliza- increase, in turn, will increase the likelihood that a portion tion approached that of the different Q2 implementations. Q2 of a JumboGen packet, consisting of mostly ACKs, will be performs roughly similar in terms of average queue utilization, discarded whereas, when full JumboGen packets are dropped, they are more likely to consist of larger data packets. TABLE III TCP SERVICE FAIRNESS OF JUMBOGEN AND DIFFERENT RED QUEUE These results have shown that JumboGen is capable of pro- MODIFICATIONS. viding a performance benefit to the network without adversely affecting link utilization, total goodput, or fairness. Overall, Q1 Fairness Ratio with full drop is the only implementation that performs worse No JumboGen 0.95 than the standard case without JumboGen, with the other im- Q1 - Full Drop 0.95 Q1 - Partial Drop 0.89 plementations showing substantial improvements. Overall, Q2 Q2 - Full Drop 0.93 with full packet drops shows a slight performance advantage Q2 - Partial Drop 0.90 over the other RED queue modifications. 9

Number of Packets Processed By Core Links Packet Size Distribution 5000 50

4000 40

3000 Q1-8 KB MTU 30 Q1-8 KB MTU Q1-16 KB MTU Q1-16 KB MTU Q1-24 KB MTU Q1-24 KB MTU Q1-32 KB MTU Q1-32 KB MTU 2000 20 Number of Packets 1000 10 Packet Size Distribution (%)

0 0 100 110 120 130 140 150 0 10000 20000 30000 Time (seconds) Packet Size (Bytes)

(a) (b)

Fig. 7. Effect of different MTU sizes on a) Number of packets seen by core links b) Packet size distribution

TABLE IV approach reduces the number of packets seen by the core links LINK UTILIZATION OF JUMBOGEN AND DIFFERENT QUEUE by nearly 3.5x. However, it does not achieve the reduction of IMPLEMENTATIONS packets of the single queue approach, which is to be expected. Avg. Link Util. (%) Std. Dev. Fig. 6(b) shows the packet size distribution of the single and No JumboGen 87.84 0.878 multiple queue approaches. As before, when JumboGen is not Single Queue 93.49 1.904 Multiple Queue 94.47 2.14 used the majority of packets are 64 or 1500 bytes. In the single queue approach, the majority of packets are from 6000 TABLE V to 8000 bytes. However, the multiple queue approach has a TOTAL GOODPUT OF JUMBOGEN AND DIFFERENT QUEUE more even distribution of packet sizes, with only substantial IMPLEMENTATIONS. concentrations around 3000 and 4500 bytes. Table IV shows the link utilization of the single and multiple Total Goodput (MB) queues. As can be seen in the table, the link utilization of No JumboGen 50.492 Single Queue 69.731 both the single and multiple queue approach is higher than Multiple Queue 68.276 when JumboGen is not used. Table V shows the total goodput achieved by the various JumboGen implementations. As can be seen in the table, the multiple queue approach performs B. Intelligent vs. Naive¨ queuing similarly to the single queue approach with only a slight In this section, the performance of the single queue (naive¨ reduction in the amount of goodput delivered. Finally, in approach), multiple queue (intelligent approach), and no Jum- Table VI it can be seen that the fairness of the multiple boGen approaches are evaluated. For all simulations the RED queue JumboGen is better than any of the other JumboGen queue utilization is determined by number of packets (number implementations. of encapsulated packets for JumboGen) in the queue. The The multiple queue JumboGen approach can perform as single queue performs egress shaping, while the multiple well as the single queue approach while providing a better queue does not. Additionally, both queue implementation fairness to the different TCP flows. However, the performance simulations allow for partial packet drops. This is due to the of the multiple queue approach depends on how many of the complexity involved to ensure efficient in-order delivery when underlying flows are from the same source/destination pair. If egress shaping is used with multiple queues. the network traffic is mostly composed of flows with the same Fig. 6(a) shows the number of packets processed by the source/destination pair, then the performance of the multiple- core links. As can be seen in the figure, the multiple queue queue JumboGen approach will suffer. On the other hand, if the network traffic consists of flows with many different source/destination pairs, then the performance of the multiple TABLE VI queue approach will be similar to that of the single queue TCP SERVICE FAIRNESS OF JUMBOGEN AND DIFFERENT QUEUE approach, while providing better fairness. IMPLEMENTATIONS.

Fairness Ratio C. Varying MTU Setting No JumboGen 0.87 Single Queue 0.89 The final set of results presented in this paper is the Multiple Queue 0.95 performance of JumboGen as the MTU size and JFAT settings 10

TABLE VII There are a number of technologies that use queuing in LINK UTILIZATION OF JUMBOGEN AND DIFFERENT MTU VALUES. order to improve the efficiency of the network. One such Avg. Link Util. (%) Std. Dev. technology is Gathercast [11]. In [11], the authors describe a 8 KB MTU 93.49 1.904 method to increase the efficiency of the network by combining 16 KB MTU 92.60 2.179 24 KB MTU 92.67 2.397 many packets to the same destination into a single packet. 32 KB MTU 92.88 2.439 There method is most beneficial for servers that use TCP as their transport protocol and ICP (Internet Cache Protocol) TABLE VIII queries. In Gathercast, gather nodes are added to a reverse TOTAL GOODPUT OF JUMBOGEN AND DIFFERENT MTU VALUES. multicast tree (with the different sources being the leaves and the destination being the root node) which combine Total Goodput (MB) 8 KB MTU 69.733 small packets destined for the root node into a single packet. 16 KB MTU 67.950 Metadata is stored in the packet in order to allow for efficient 24 KB MTU 69.377 reconstruction of the original packets. By combining many 32 KB MTU 73.167 packets into a single packet, the load on the server is reduced as there are less interrupts to handle in order to receive the ACKs or other small packets. are adjusted. For all simulations, the RED queues calculated JumboGen is fundamentally different from Gathercast in utilization by total number of packets in the queue (including a number of ways. The first major difference is one of encapsulated packets), egress shaping was used, and partial scope. Gathercast is concerned with combining packets that packet drops were allowed. The following scenarios were are arriving to same exact same destination. This forces the simulated: destination node and at least some of the routers on the • 8 KB MTU - 8 KB MTU with a 1 ms JFAT network to be Gathercast capable. This allows Gathercast to • 16 KB MTU - 16 KB MTU with a 2 ms JFAT provide a reduced number of interrupts to the destination, thus • 24 KB MTU - 24 KB MTU with a 3 ms JFAT providing better performance, while increasing the complexity • 32 KB MTU - 32 KB MTU with a 4 ms JFAT of Gathercast routers. On the other hand, JumboGen focuses In Fig. 7(a), the number of packets processed by the core on lessening the load on the core routers by slightly increasing link node is shown. As is expected, as the MTU is increased the complexity of the edge routers. Another major difference the number of packets seen by the core node decreases. is that in Gathercast, the transformation tunnels need to be However, diminishing returns exist as the MTU is increased specifically configured for certain applications, while Jumbo- with only a marginal decrease in the number of packets after a Gen will attempt to combine any packet. 16 KB MTU. In Fig. 7(b), the packet size distribution is shown. Finally, Gathercast does not look into RED queues and their As can be seen in the figure, as MTU and JFAT increase, a shift affect on aggregated packets. This is especially important as occurs in where the plurality of packet sizes appear. However, the main use for Gathercast is to aggregate TCP ACKs into it should be noted that the plurality constitutes a decreasing a single packet. Any loss of a single Gathercast packet can percentage of the overall number of packets. cause many TCP streams to be throttled back unnecessarily. In Table VII, the link utilization for the various MTU Additionally, the loss of these TCP ACKs will cause substan- settings is shown. As can be seen in the table, the average tial wasted bandwidth as each data packet corresponding to link utilization remains nearly constant. However, as the MTU the ACK will have to be retransmitted. JumboGen modifies increases the standard deviation for the link utilization shows the RED queue algorithm such that JumboGen packets that a slight increase. Table VIII shows that total goodput generally are constructed by many small packets are favored (less likely increases slightly as the MTU setting is increased, although to be dropped). a slight drop does occur at 16 KB. Fairness for the various Another related technology that uses queuing in order to MTU settings was roughly similar and showed no discernable improve network efficiency is our own work, stealth multicast pattern. [12]. In stealth multicast, the idea of dynamically detecting Overall, it is shown that by increasing the MTU of the redundancy in the network in order to form domain-wise jumbo frame, better performance is achieved. However, the multicast groups is presented. Packets are queued in a Virtual gains diminish after 16 KB. This will likely change as the Group Detection Manager (VGDM) and are checked to see networks scale to even greater speeds with even more data if they are redundant and should be queued. This work is being transferred. However, the same situation of diminishing concerned with two things: 1) providing motivation for the returns will occur eventually at a higher MTU value. deployment of multicast and 2) reducing the short term re- dundancy in the network. While this technology will decrease the load on core routers when a large percentage of short-term IV. BACKGROUND AND RELATED WORK redundant data exists, it is not a main goal. The JumboGen approach relates to two broad research areas, In wireless networks, there are many studies, as in [13], the first being the idea of queuing packets (packet aggregation) [14], that perform packet aggregation in order to lessen the in order to gain some benefit to the network. The second load placed on intermediate nodes in a wireless network and area is that of performance of TCP in high speed network to reduce the overhead involved with wireless communication. environments and special issues that arise. In [13], the authors introduce the idea of data funneling for 11 sensor networks. The main idea is that wireless nodes are Another issue with TCP and aggregation of packets was coordinated in a way that the data from the furthest nodes identified in [7]. In this paper the authors note the phenomenon from the destination is combined with data from other nodes of ACK Compression. ACK Compression occurs when, due as it is forwarded to the final destination in the network. This to network conditions, TCP ACKs from the same source to allows for better network utilization, due to the decrease in the the same destination have their inner packet arrival times number of network headers, and longer life for battery limited compressed, thus the destination receives a burst of packets. nodes. As mentioned, this work requires that the network This causes the destination to send a burst of TCP packets configuration be known a priori along with the destination in response, which could overload the receiver, where if the to which the packets will be aggregated. inter-arrival time of the ACKs was not changed, then this burst Another study [14], looks into the performance of general would not occur. JumboGen performs egress shaping in order packet aggregation in a wireless network. Here packets are to prevent the situation as described above. aggregated on a link wise basis. This means that at each node packets are de-aggregated and re-aggregated according to the next destination link. In busy networks this can potentially lead V. CONCLUSIONS to lower end-to-end delays in a congested network, but larger In this paper, we have proposed a dynamic packet aggre- end-to-end delays in a lightly congested network. Since this gation scheme that allows for the utilization of higher speed method only deals with aggregation on a link by link basis, this networks while lessening the load placed on routers as the means that each router still has to perform the same number number of packets that traverse the network increases. This of routing lookups as if the packets were not aggregated at allows for better utilization of the transfer medium and a all. In a high speed network this will limit the performance at more reliable transport. Simulations were used to demonstrate congested nodes. that JumboGen can be used without adversely affecting the Finally, as most traffic transferred over the Internet is TCP- performance of the network in terms of link utilization and based, it is important to be aware of the implications high- fairness to traffic flows. JumboGen also provides a benefit to speed networks and packet aggregation has on TCP perfor- the total goodput of the network and reduces the number of mance. TCP performance in high speed networks has been packets processed by the core links. studied at great length [15], [16], [17]. Additionally, other Two different JumboGen configurations were demonstrated works show that high speed networks can benefit from large (single and multiple queue) along with different RED queue MTU sizes [2], [18]. modifications. We showed that a single queue implementation The authors in [2] present methods for reducing the over- of JumboGen performed significantly better than when the head associated with TCP/IP. In this paper, the authors look at JumboGen approach was not used. Overall, when the number the effect that various message sizes (packet sizes ranging from of packets determines the queue utilization, it is better to use 1 to 8192 bytes) have on TCP. The authors note that that data- an approach that allows for partial dropping of packets. When touching operations (checksum) dominate the processing time the number of bytes consumed is used to determine queue of large packets and nondata-touching operations, such as data utilization, partial packet drops is not preferred, as it is better move, dominate the overhead of small packets (less than 1000 to drop a single JumboGen packet consisting of mainly large bytes). This shows that if one wants to transfer large amounts data packets rather than part of many JumboGen packets, some of data, it is better to use larger packets to diminish the cost consisting of TCP ACKs. of the nondata-touching operations. Unfortunately, it is well Finally, we have shown the performance of multiple queue known that packets tend to be smaller and the authors note that JumboGen. The multiple queue approach is shown to be able in their WAN trace, data-touching overheads only consume to provide performance that is similar to the single queue about 17% for TCP and 13% for UDP of the total overhead approach while having excellent fairness. However, the per- cost. The rest is consumed by nondata-touching operations. formance of the multiple queue approach can vary depending This makes the case that it is better to process a single larger on the heterogeneity of the underlying network flows. If the packet than a many smaller ones. flows are mostly from different source/destination pairs then In [18], the authors look into performance issues related the performance of the multiple approach is good, otherwise to using a large MTU (Maximum Transport Unit) size to this will not be the case. The various simulations suggest that transfer TCP/IP packets over an ATM network. In the paper, even though the fairness of the single queue approach is not the authors found certain conditions that caused the TCP quite equal to that of the multiple queue or non-JumboGen algorithm to deadlock when a large segment size is used. approach, it may still be a better choice, due to the fact that These conditions were: 1) a large 2) it will work well regardless of the underlying traffic patterns. the use of Nagel’s algorithm [19], 3) asymmetry of size in the socket buffers, and 4) using delayed acknowledgments. If any of these conditions do occur then a deadlock does not occur. REFERENCES JumboGen will not increase the chance of this happening due to the fact the maximum segment size for the TCP packets will [1] R. Gass, “Sprint corp. packet size distribution,” in not be modified. JumboGen does not have any requirements http://ipmon.sprint.com, 2004. [2] J. Kay and J. Pasquale, “Profiling and reducing processing overheads concerning the other three conditions that will potentially lead in TCP/IP,” IEEE/ACM Transactions on Networking, vol. 4, no. 6, pp. to deadlock. 817–828, Dec. 1996. 12

[3] W. chun Feng, J. G. Hurwitz, H. Newman, S. Ravot, R. L. Cottrell, O. Martin, F. Coccetti, C. Jin, X. D. Wei, and S. Low, “Optimizing 10- for networks of workstations, clusters, and grids: A case study,” in SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing. Washington, DC, USA: IEEE Computer Society, 2003, p. 50. [4] S. Ravot, Y. Xia, D. Nae, X. Su, H. Newman, J. Bunn, and O. Martin, “A practical approach to tcp high speed wan data transfers,” in Proceedings of Broadnet 2004 Conference, San Jose, CA., Oct. 2004. [5] E. Rosen, A. Visiwanathan, and R. Callon, “Multiprotocol label switch- ing architecture,” IETF RFC 3031, Jan. 2001. [6] S. Floyd and V. Jacobson, “Random early detection gateways for congestion avoidance,” IEEE/ACM Transactions on Networking., vol. 1, no. 4, pp. 397–413, 1993. [7] H. Balakrishnan, V. N. Padmanabhan, S. Seshan, M. Stemm, and R. H. Katz, “TCP behavior of a busy internet server: Analysis and improvements,” in Proceedings INFOCOM (1998), 1998, pp. 252–262. [8] “UCB/LBNL/VINT Network Simulator - ns (version 2),” Nov. 2005, available at http://www.isi.edu/nsnam/ns/. [9] “GenMCast: A generic multicast extension for ns-2,” Nov. 2005, avail- able at www.cse.nd.edu/˜striegel/GenMCast. [10] R. Jain, M. Chiu, and W. Hawe, “A quantitative measure of fairness and discrimination for resource allocation in shared systems,” in DEC Technical Report DEC-TR-301, 1984. [11] B. Badrinath and P. Sudame, “Gathercast: The design and implemen- tation of a programmable aggregation mechanism for the Internet,” in Proceedings of IEEE International Conference on Computer Communi- cations and Networks (ICCCN), Las Vegas, NV USA, Oct. 2000, pp. 206–213. [12] D. Salyers and A. Striegel, “A novel approach for transparent bandwidth conservation,” in Proceedings of IFIP Networking 2005, Waterloo, ON. Canada, May 2005, pp. 1219–1230. [13] D. Petrovic, R. C. Shah, K. Ramchandran, and J. Rabaey, “Data funneling: Routing with aggregation and compression for wireless sensor networks,” in Proceedings of IEEE International Workshop on Sensor Network Protocols and Applications, Anchorage, AL, May 2003, pp. 156–162. [14] A. Jain, M. Gruteser, M. Neufeld, and D. Grunwald, “Benefits of packet aggregation in ad-hoc wireless network,” in University of Colorado Technical Report CU-CS-960-03-1, Aug. 2003. [Online]. Available: http://www.cs.colorado.edu/department/publications/reports/docs/CU- CS-960-03.pdf [15] S. Floyd, “RFC 3649 highspeed TCP for large congestion windows,” IETF RFC 3649, Dec. 2003. [16] T. Kelly, “Scalable TCP: Improving performance in highspeed wide area networks,” ACM SIGCOMM Computer Communication Review, vol. 33, no. 2, pp. 83–91, 2003. [17] C. Jin, D. Wei, and S. Low, “FAST TCP: Motivation, architecture, algorithms, performance,” in Proceedings of IEEE INFOCOM 2004, Mar. 2004. [18] K. Moldeklev and P. Gunningberg, “How a large ATM MTU causes deadlocks in TCP data transfers,” IEEE/ACM Transactions on Network- ing, vol. 3, no. 4, pp. 409–422, 1995. [19] J. Nagel, “Rfc 896 congestion control in IP/TCP networks,” RFC 896, Jan. 1984.