The Effects of Software Traffic Shaping for Transport Protocols in Band- Width Guaranteed Services

YAMASHITAIEICE TRANS. et.al. COMMUN., : The Effects VOL.E81-B, of Software NO.8 Traffic AUGUST Shaping 1998 for Transport Protocols in Bandwidth Guaranteed Services 1

PAPER Special Issue on Multimedia Communication in Heterogeneous Network Environment The Effects of Software Traffic Shaping for Transport Protocols in Band- width Guaranteed Services

Kei YAMASHITA† , Member, Shusuke UTSUMI††, Hiroyuki TANAKA†, Nonmembers, Kenjiro CHO*, and Atsushi Shionozaki*, Nonmembers

summary In this paper, we show the effectiveness of software shap- the congestion state changes. The sending rate and queue ing through evaluation of our extensions to the internet transport pro- length of the congested node would oscillate and the conges- tocols, TCP (Transmission Control Protocol) and UDP (User Datagram Protocol). These extensions are aimed at efficient realization of bulk tion would never subside. data transfer and continuous media communication. The extensions One method to overcome this problem is to adopt a con- are to be used with resource reservation, a possible and promising ap- gestion prevention approach, in which a fixed bandwidth for proach to resolve transport issues that the current TCP/IP networks a flow is reserved by a fast resource reservation scheme. The cannot support. Although it seems straightforward to utilize dedicated overhead of setting up the reservation can be ignored for bulk bandwidth set up via resource reservation, filling up the reserved pipe is not so trivial. Performance analysis shows that, by applying the traf- data transfer, since the transfer time is long compared to the fic shaping extensions, not only is the reserved pipe easily filled up, set up time. However, TCP, as it is, cannot fill up the re- but the timely data delivery required by continuous media communica- served pipe, as will be described in section 2. Thus a traffic tion is also provided. Our experiments with a real system also show shaping mechanism, which can send packets at a specified that overheads introduced by the new extensions are small enough to permit their practical use. The extensions are implemented in the interval, is required for TCP. UNIX system kernel. For continuous media applications that do not require key words: traffic shaping, TCP, UDP, long fat pipe, QoS reliable transport, UDP can be used as a transport protocol, albeit with a risk of flooding the link. Thus again, some kind 1 Introduction of shaping mechanism is necessary. The IETF (Internet En- gineering Task Force) has proposed RTP (Real-Time Trans- TCP has been incorporating many improvements and exten- port Protocol) [14] that provides application oriented real- sions [2, 3, 9] since its inception. TCP can work in unpredict- time features in conjunction with UDP. RTP provides fea- able internetworking environments by incorporating many tures for multiplexing flows and synchronization, but it does mechanisms such as dynamic congestion control by detect- not support flow control mechanisms and traffic shaping is ing packet loss. However, for the long fat pipe, where the left up to the application software. This is one way of dealing product of RTT (Round-Trip Time) and bandwidth is large, the problem, but traffic shaping in the transport layer can TCP is yet to be improved to fully utilize the pipe bandwidth. provide finer grain shaping and can sustain more flows than One problem is an inherent difficulty in optimizing an adap- the application layer shaping. tive congestion control in a network with large RTT. In this paper, we propose to incorporate software traffic In an adaptive congestion control, a data sending termi- shaping into TCP and UDP as their extensions, and we report nal is either notified by the network, or detects for itself by their implementation in the UNIX kernel and its performance observing ack messages, on whether the network is congested evaluation by using a real system. or not. Then the sender adapts its sending rate by, for ex- This paper is organized as follows. Section 2 describes ample, shrinking or widening its window size in the TCP TCP’s behavior in the long fat pipe and explains the require- method. It takes at least one RTT to the congested node for ment for the traffic shaping. Section 3 describes the protocol the sender’s reaction to take effect on that node. In the TCP designs and their implementation, and section 4 shows re- method, it takes as much as one RTT to the receiver. This sults of performance evaluations. Section 5 refers to related delay, however, can be very harmful in a WAN (Wide Area works, and section 6 concludes the paper. Network), because the delay, ranging from 10 msec to hun- dreds of msec, can be larger than a time scale within which 2 TCP in a reserved long fat pipe

Figure 1 illustrates the measured behavior of TCP in a long Manuscript received December 15, 1997. fat pipe. The Reno version of TCP was used. Figure 1 (a) Manuscript revised March 9, 1998. shows the data throughput and Figure 1 (b) shows the packet † The authors are with NTT Optical Network Systems Labo- dropped by an ingress router between a sending and a receiv- ratories, Yokosuka-shi, 239-0847 Japan. †† The author is with Sony IT Laboratory, Tokyo, 141-0022 ing node. Figure 2 illustrates our experimental network. A Japan. sender, an ingress router, a hardware delayer and a receiver * The authors are with Sony Computer Science Laboratory, were all connected by 155 Mbps ATM (Asynchronous Trans- Tokyo, 141-0022 Japan. fer Mode) links. We adjusted the rate at which the ingress IEICE TRANS. COMMUN., VOL.E81-B, NO.8 AUGUST 1998 2

short time span, the window size can be too small to produce 30 three consecutive duplicated acks for the secondly lost packet. In this case, the fast retransmit does not occur for the packet. 20 A loss of a retransmitted packet itself also makes the algo- 10 rithm fail. The failure of the algorithm causes the slow start, Traffic (Mbps) Traffic 0 as shown in Figure 1 at 85 msec and later on. Obviously, 60708090100 TCP does not fully utilize the bandwidth of the 40 Mbps link. Time (sec) This throughput degradation has two causes. One is that (a) TCP Throughput packet bursts exceeding the network capacity are sent during slow start, fast recovery, or by ack compression [11, 13]. To 1 fully utilize the bandwidth in a long fat pipe, TCP has to have a large window at the sender, but this leads to large packet bursts. The other reason is that a sender TCP continues to 0.5 increase its window size even after the optimum window size is reached.

Discard (Mbps) Discard 0 60708090100 Thus we can see that a traffic shaping mechanism, which sends packets at a specified interval, is required in a high Time (sec) speed network. We claim that software shaping is essential (b) TCP Packet Drop even when hardware shaping is provided. In ATM networks, Figure 1 TCP behavior in the long-fat pipe (RTT 20ms) a NIC (Network Interface Card) or a switch in the data path PC-AT (Pentium 200MHz) usually supports hardware shaping. Such shapers can pre- cisely control the rate at which each cell is sent out onto the sender receiver RTT 20ms fiber (on an OC3 link 1 cell every 2.7 usec). hardware delayer However, in an internetworking environment one can- router Data not assume that traffic shaping mechanism will be provided in hardware all the time. Even if a backbone network incor- porates ATM, it is highly likely that end nodes are connected ATM 155Mbps IF ATM 155Mbps IF to other data links such as Ethernet. In such cases, even if it is possible to set up a dedicated VC (Virtual Connection) in buffer size constrained to ATM backbone to support hardware shaping, current TCP/IP 32Kbytes(4MTU) 45Mbps may not be able to fill the reserved pipe as is clear from Fig- Figure 2 Experimental Network ure 1. A similar phenomenon can also be observed within the router sent packets downstream to about 40 Mbps, while the sending node itself. If TCP sends packets at a rate that ex- sender can send data at full speed on its ATM link. RTT was ceeds reserved bandwidth, the NIC and device driver buffer set to 20 msec by the hardware delayer. Both buffers on the is flooded and packets are lost. sending and receiving sides were set to 160 Kbytes. The packet buffer on the router was 32 Kbytes, 4 times the MTU 3 Design and Implementation (Maximum Transmission Unit) size. This configuration simu- lates the reservation of 40 Mbps bandwidth in the WAN sec- 3.1 Transport layer tion, while the full link rate is available in the LAN (Local Area Network) section. This kind of bandwidth gap between Internet integrated services proposed by the IETF such as LAN and WAN will remain for the foreseeable future; LAN Guaranteed Service [15] or Controlled Load Service [19], link speed will exceed Gbps, while WAN link speed is con- both specify that packets should be sent out according to strained by cost. TSpec parameters following the token bucket algorithm [12]. In Figure 1(a), TCP’s “slow start” [17] due to a series We modified TCP and UDP in the FreeBSD kernel (Version of packet losses can be observed immediately after the data 2.2) to support the token bucket algorithm. Our extensions sending begins. In the slow start phase, which is performed in the very beginning of data sending and is triggered by the bucket rate Retransmission Time Out (RTO), TCP slashes its window size down to its minimum (one MTU size), then increases its window size gradually each time an ack packet arrives. This bucket depth behavior is one of the TCP’s congestion control mechanisms. Then, a relatively steady peak at 35 Mbps continues while TCP repeatedly executes its “fast retransmit and fast recov- sending data ery” algorithm [17] as a reaction to single and isolated packet packet losses. The fast retransmit and fast recovery algorithm, how- Figure 3 Token bucket algorithm ever, sometimes fails; When two packets are lost within a YAMASHITA et.al. : The Effects of Software Traffic Shaping for Transport Protocols in Bandwidth Guaranteed Services 3 maintain compatibility with normal TCP and UDP implemen- for a particular TCP flow, congestion control through win- tation. dow size adjustment is not necessary. In other words, the con- The token bucket algorithm behaves as follows (Figure gestion window can be ignored and only the advertised win- 3) : It has two parameters, bucket rate and bucket depth. The dow need be considered. Thus a congestion control flag is bucket rate determines the rate at which the quantity of token introduced so that congestion control can be set on/off to use in the token bucket can increase. The quantity of token can or to ignore the congestion window. The TSpec parameters never exceed the bucket depth. At the initial state, the token for the token bucket algorithm and toggling the use of TCP bucket is full of token. A packet can be sent only when the congestion control can all be set through setscokopt( ) calls. quantity of the token in the token bucket is larger than the packet size. If not enough token is available, the packet must 3.2 Resource Reservation Protocols wait for the token quantity to be increased. In the IETF defi- nition, the token quantity has a continuous value, not a dis- Our extensions of TCP and UDP can be used with resource crete value that one might expect. This algorithm shapes reservation protocols such as RSVP[20] and ST2+[1]. How- packet traffic such that the long time average throughput is ever, we are also developing a new resource reservation pro- the bucket rate, and that packet burst size is constrained to tocol called ASP (AMInet Setup Protocol) [16], that is an IP under the bucket size. layer protocol but can directly set up an ATM VC at the same With our extensions enabled, TCP and UDP can send time as the resource reservation on the IP layer proceeds. out packets according to the token bucket algorithm. For the ASP is a fast and flexible resource reservation protocol, since extension of TCP, a code that checks if the current token quan- it does not use the ATM UNI signaling protocol described tity is enough to permit packet release is inserted into the usually as being heavy and slow. original TCP program previous to the function that hands the The AMInet Project [16], under which this research is packet to the IP layer. For the UDP extension, we recreated performed aims, as the first step, to provide advanced inte- almost entire protocol code, since we should provide the UDP grated services that must be supported in the near future when layer with a packet buffer. In our experimental implementa- optical fibers or xDSL technologies will fatten the subscriber tion, a TCP/UDP code is called at a regular interval by a line to each home. In the AMInet, the backbone network is timeout function, and the code checks whether there is a flow built on an ATM network. The AMInet makes use of an ATM waiting for its tokens to be increased enough. Though more network just as a link layer network, and does not use the sophisticated methods to realize the timeout function required network layer functions of ATM, such as routing, address- for the token bucket algorithm could be envisaged, it is a ing, signaling and so on. The network layer functions are subject for future research. performed by IP protocols, thus we are developing ASP to We are targeting our system to applications that send integrate ATM with the IP protocols. data in a few Mbps to 100 Mbps range, so if we assume the ASP has been implemented as a daemon process (aspd) hardware shaper of the NIC or ingress routers are equipped on FreeBSD. Figure 4 shows an implementation image. An with several 10 Kbyte buffers, granularity for software shap- application program asks aspd to reserve resources it wants ing of 10 msec is not sufficient (a 40 Mbps flow requires to use. Then an application program sets the parameters of approximately 50 Kbytes). The FreeBSD kernel timer used traffic shaper for the transport protocol to fit to the resources for the shaping timer in our implementation defaults to a value obtained by aspd. If a terminal or a router has an ATM inter- of 10 msec, so we changed the kernel clock to support 1 msec face, the aspd maps the IP flow into an ATM VC, which is traffic shaping (a 40 Mbps flow now only requires approxi- also set up by the aspd. If the hardware traffic shaper on the mately 5 Kbytes). Because modern CPUs have internal clock ATM is available, it is configured by the aspd. speeds of a few hundred MHz, this 1 msec timer does not seriously degrade the system performance. 4 Performance Evaluation Since we can assume that fixed bandwidth is reserved Various experiments are performed to evaluate our imple- aspd mentation on PC-AT compatibles equipped with Intel’s Pen- Application tium processors (200MHz). Our extensions are denoted as User space Kernel space 30 TCP shaper(soft) shaper(soft) UDP 20 IP 10

Traffic (Mbps) Traffic 0 ETHER ATM shaper(hard) 550 560 570 580 590 Time (sec) Figure 4 System overview Figure 5 TCP BDTE Throughput (RTT 20ms) IEICE TRANS. COMMUN., VOL.E81-B, NO.8 AUGUST 1998 4

40 TCP Bulk Data Transfer Extensions or simply, TCP BDTE, and UDP Continuous Media Transfer Extensions, UDP BDTE CMTE. In all experiments, we performed memory-to-memory TCP transfer, in which that a sender application sends dummy data 30 (RTO 50ms) from memory recursively that is then by a receiver application received and thrown away.

4.1 Throughput: Comparison with TCP 20 Figure 5 shows the throughput for TCP BDTE in the same Throughput (Mbps) TCP experiment we performed for TCP in Figure 1. For this ex- (RTO 500ms) periment the bucket rate and the bucket depth are set 40 Mbps 10 and 2 times the MTU (about 16 Kbytes), respectively. The 0 50 100 150 200 250 bucket depth is so small that even if the window size is large, Maximum window size (Kbyte) buffer overflow does not occur. This can be seen clearly in Figure 6 TCP Throughput Comparison (Net: RTT 20ms) Figure 5. No packet is dropped when using TCP BDTE. Next, we compared throughput of TCP and TCP BDTE by varying the maximum window size at the sending node (Figure 6). The bucket rate and the bucket depth for TCP 100 BDTE are the same with those in Figure 5 case. We also per- BDTE (1ms clock) formed experiments for a slightly modified TCP, whose RTO timer has a 10 times finer resolution (50 msec) than the de- 80 fault value (500 msec), since this resolution may have much effect on the TCP throughput. 60 For TCP cases, the largest throughput (35 Mbps) is obtained when the maximum window size is set 144 Kbytes. BDTE The reserved bandwidth (40 Mbps), however, is not fully uti- 40 (10ms clock) lized even in this case. When the maximum window size is split results

Throughput (Mbps) set over 160 Kbytes, the need for excessive retransmissions 20 decreases TCP throughput, whatever the RTO value. TCP Since the product of RTT (20 msec) and the expected 0 throughput (40 Mbps) is 100 Kbytes, one might think that 50 100 150 200 250 the window size of 100 Kbytes yields the optimum through- Maximum window size (Kbyte) put. That size, however, achieves only 30 Mbps throughput Figure 7 TCP Throughput Comparison (NIC: RTT 20ms) as seen from Figure 6. This is because the RTT value of 20 msec includes only fixed delays, and does not include queueing delays or the TCP processing delay in the receiver. The bottleneck router has 32 Kbyte packet buffer, thus some packets are delayed 32 Kbytes / 40 Mbps = 6.4 msec. If RTT is 150 assumed to be 26.4 msec, the required window size is 132 CMTE (no load) Kbytes, which is much larger than 100 Kbytes. Thus, if we CMTE (4 load) try to decide the optimum window size for the conventional UDP (no load) CMTE TCP, we should take account of all queueing delays that fre- 100 (1 load) quently change in a real network and the processing delay on the host. Thus, you can see that the optimum window size for the normal TCP is difficult to determine before transmitting data. 50 UDP (1 load) For TCP to perform better, a larger packet buffer is needed at the bottleneck router. If a sufficient packet buffer is not available, not only does it decrease TCP’s throughput, Actual throughput (Mbps) UDP (4 load) but also the router queue will be filled up, and may nega- 0 tively influence the data flows of other connections, causing 0 50 100 150 such problems as packet drops and increase in delay and delay jitter. Specified throughput (Mbps) With TCP BDTE, we can effectively shape the traffic to Figure 8 UDP Throughput Comparison fill up the pipe to 40 Mbps. Furthermore, it is possible to increase throughput with a minimum packet buffer at the YAMASHITA et.al. : The Effects of Software Traffic Shaping for Transport Protocols in Bandwidth Guaranteed Services 5 bottleneck router, decreasing the possibility of af- Specified rate per VC (Mbps) fecting other connections. In addition, the ideal # of VCs 1 2 4 8 1 6 3 2 6 4 128 throughput can be easily obtained simply by setting 1 1.009 1.008 1.006 1.011 1.006 0.994 0.968 0.922 the maximum window size as large as possible. It 2 1.009 1.008 1.006 1.007 1.005 0.989 0.954 N.A. should also be pointed out that this throughput of 4 1.007 1.005 1.003 0.995 0.978 0.769 N.A. N.A. TCP BDTE does not degrade with increase in RTT, 8 0.992 0.976 0.945 0.889 0.734 N.A. N.A. N.A. as long as enough size of buffer is available at both the sender and the receiver. In TCP BDTE, the sender Table 1 Rate ratio: actual rate / specified rate simply sends packets at a specified rate with the fixed sending window size (= the advertised window); the pacing of sending packets is kept independent of Specified rate per VC (Mbps) RTT. # of VCs 1 2 4 8 1 6 3 2 6 4 128 Another potential bottleneck might be found 1 0.008 0.008 0.008 0.008 0.008 0.008 0.009 0.000 in the sending node itself. In the FreeBSD TCP 2 0.008 0.008 0.008 0.008 0.007 0.008 0.013 N.A. implementation, data packets are passed from TCP/ 4 0.004 0.002 0.007 0.013 0.005 0.006 N.A. N.A. IP to the NIC by using normal TCP flow control 8 0.003 0.003 0.006 0.005 0.004 N.A. N.A. N.A. mechanisms, so this presents the same problem as if there were a bottleneck in the network. To dem- Table 2 Violated packet ratio: # of packets that violated / # of all packets onstrate this point, two PCs equipped with an ATM NIC were connected to each other only through the hardware background processes running. As a background process, we delayer. RTT was set to 20 msec. The hardware shaper on the use a memory copy benchmark test (bcopy benchmark) in ATM NIC was turned off. “lmbench” [10], a benchmark suite. The benchmark is set up In Figure 7, we can see that when TCP’s window size is to repeatedly copy 8 Mbytes blocks, that do not fit into the increased, the buffers in the NIC and device driver (total of CPU’s second level cash (512 Kbytes). When the background 96 Kbytes) of the sending node overflow, resulting in a de- processes exist (“n load” in Figure 8, n represents a number crease in throughput. TCP’s throughput drastically decreases of background processes), UDP throughput degrades drasti- when the window is above 192 Kbytes. For TCP BDTE, when cally, but the throughput of UDP CMTE shows no decrease. the shaping timer period is 1 msec, throughput continues to This is because the traffic shaping is executed inside the sys- increase with the window size. The bucket rate and the bucket tem kernel for UDP CMTE, which allows a higher priority depth are set to 80 Mbps and 2 times MTU, respectively. than the application layer shaping. However, when the shaping timer is set at 10 msec, throughput above 192 Kbytes becomes unstable and alternates be- 4.3 Shaping Accuracy tween approximately 70 Mbps and 30 Mbps. This happens because the 10 msec case requires a larger token bucket depth Performance results illustrating the accuracy of our traffic (13 times MTU, about 100 Kbytes), increasing possible buffer shaping extensions are shown in Table 1 and Table 2. These overruns. measurements were taken with an ATM traffic analyzer used It can be concluded from these results, that traffic shap- to probe AAL5 frames sent from a PC into an optical fiber. ing at the transport layer is effective and can defeat problems The hardware shaper on the ATM NIC was turned off. TCP sometimes faces in buffer bottlenecks in the network or Table 1 shows the ratio of the actual rate to the specified even in the sending node. Furthermore, the timer granularity rate. Table 2 shows the ratio of packets that violated the speci- used for shaping traffic influences throughput considerably fied token bucket parameters to all packets sent. These mea- and justifies our reason for implementing a 1 msec shaping surements were taken with the shaper timer period at 1 msec timer. for TCP BDTE. The token bucket depth was one MTU size (about 8 Kbytes) for the case with the per flow rate under 16 4.2 Throughput: Comparison with UDP Mbps, and two MTU size for the 32 and 64 Mbps, and three MTU size for the 128 Mbps case. For example, in 16 Mbps Figure 8 compares the throughputs for UDP CMTE and nor- case, an MTU size packet is to be sent at every 4 msec. mal UDP. The kernel timer was set to 1 msec for all the cases. The first column of the tables represents the number of For UDP, shaping is implemented in the application. For UDP flows established at the same time in a sender, and the sec- CMTE, the accurate shaping is executed in the transport layer. ond row represents the data sending rate assigned to each The bucket depth is set to the product of the bucket rate and flow. When there are plural flows at the same time, we picked 1 msec. Figure 8 shows that for UDP data transfer, even when up one flow to measure. there is no other active process (“no load” in Figure 8), it First, we can see that the actual rate observed is within cannot perform over 50 Mbps, indicating the limit of appli- 1 to 5 percent of the specified rate unless the total specified cation level transport. On the other hand, UDP CMTE prop- rate (per flow rate multiplied by number of flows) is not too erly shapes traffic until over 100 Mbps. large. Second, less than 1 percent of the packets violated the Figure 8 also shows results for the cases with several specified parameters for most of the time. Third, the figures IEICE TRANS. COMMUN., VOL.E81-B, NO.8 AUGUST 1998 6

100 in the table are not affected so much by the number of flows established at the same time. These results show that even TCP BDTE our simple implementation can provide accurate traffic shap- 80 ing. The same results were obtained for UDP CMTE.

TCP 4.4 Overheads of software shaping 60 Software traffic shaping incurs additional system loads. Our UDP CMTE simple implementation, which calls TCP/UDP code every 1 Load (%) 40 msec whether a TCP/UDP flow waits for the timeout or not, is likely to add new overheads because frequent context switch is executed. 20 We evaluated the system overhead from two points of UDP view. First we measured CPU load when a data sending pro- 0 cess runs alone. Secondly, we measured performance of a 0 50 100 150 background process, which runs at the same time with the Throughput (Mbps) data sending process. The former experiment measures overheads in the traffic shaping itself, and the latter mea- Figure 9 CPU load comparison sures overheads that impact the whole system performance. Figure 9 compares the values of CPU load of the send- 40 ing node when a data sending process runs alone. Two PC’s were connected with the ATM link by way of a router (PC). TCP (10ms timer) RTT was 0.5 msec. The throughput of the normal TCP was adjusted by the router that implements CBQ [5][8], a special 30 queueing mechanism that realizes both bandwidth restriction and bandwidth sharing at the same time. The packet buffer of the router was set large enough to prevent packet drops at this time. The router puts no restriction on the traffic of TCP 20 BDTE. CPU load was measured during a process run for a minute, by using “iostat”, a shell command of the FreeBSD. Copy throughput (MBps) TCP BDTE The kernel clock was set to 1 msec for TCP BDTE and UDP (1ms timer) CMTE, and was set to 10 msec for TCP and UDP. 10 As seen from Figure 9, no distinct difference is found 0 20406080 under 50 Mbps between TCP and TCP BDTE, though TCP Network throughput (Mbps) BDTE gives 10% load increase compared with TCP at speeds Figure 10 TCP and TCP BDTE: Effect on System Load over 50 Mbps. TCP does not provide higher throughput than 95 Mbps for this network configuration, because of packet dropping inside the sending terminal. UDP CMTE shows a relatively large difference from UDP in the low throughput 40 range. For example, for 16 Mbps throughput, CPU load for UDP CMTE is 14% whereas the load for UDP is 6%. How- UDP (10ms timer) ever, UDP does not provide a millisecond order accurate shaping nor provide specified throughput over 50 Mbps. Abso- 30 UDP (1ms timer) lute values of UDP CMTE load are comparable with TCP’s values under 50 Mbps range; for example, TCP load is 12% for 16 Mbps throughput while the load for UDP CMTE is 14%. From the above results, we can say that, even by our 20 simple implementation, the overhead in the traffic shaping itself is small and should be of no practical problem.

Copy throughput (MBps) UDP CMTE (1ms timer) Next, we ran as a background process various benchmark programs in “lmbench”, a suite of system performance 10 0 50 100 benchmarks. Figure 10 and 11 shows the decrease in throughput of the memory copy benchmark (bcopy benchmark), the Network throughput (Mbps) same one used in section 4.2, while sending data onto the Figure 11 UDP and UDP CMTE: Effect on System Load network. Figure 10 compares TCP and TCP BDTE, and Fig- ure 11 compares UDP and UDP CMTE. We obtained similar results for other benchmarks on CPU-intensive task, system YAMASHITA et.al. : The Effects of Software Traffic Shaping for Transport Protocols in Bandwidth Guaranteed Services 7 call latency, process creation time and so on. 6 Conclusion For the TCP comparison, it can be observed that both TCP and TCP BDTE reach a performance limit around 50 We implemented software shaping extensions to TCP and Mbps to 64 Mbps. TCP BDTE reaches its throughput limit a UDP based on the token bucket algorithm to effectively fill little faster than TCP. The decrease of the memory copy up a reserved pipe and to realize timely data delivery. We throughput incurred by TCP BDTE varies from 5% to 20%. have shown that our shaping extensions are effective for large The overheads of the traffic shaping seem to have more im- bulk data transfer through performance evaluation. We also pact on a coexisting process than what might be expected showed that this software shaping mechanism is accurate to from the results of the previous experiment (Figure 9). the extent that only 1% of all sent packets violate the speci- For UDP, we could not measure throughput above 64 fied token bucket parameters; the cost is only a 10% hit on Mbps because the PC could not provide enough processing system performance, a tolerable level. power. Until 48 Mbps, traffic shaping performed by UDP Some subjects remain for further study. For example, CMTE decreases throughput of the benchmark program on an admission control in the sending node, i.e., how to decide the average of about 7.5%. Over 48 Mbps, since UDP cannot the possible or allowable throughput that a data sending pro- meet its target rate, the slope of the line steepens and like- cess can get from the sending node, not from the network, wise degrades the benchmark program throughput. However, may be necessary for efficient use of the system and network UDP CMTE is able to shape traffic consistently until 128 resources. Mbps. In Figure 11, the measurement for UDP with 1 msec Acknowledgment timer is also shown. It shows little difference with the 10 msec timer case. This means that just to change the kernel We would like to thank Dr. Kouichi Sano and Dr. Sadayasu clock resolution incurs almost no overhead by itself. Ono of NTT Optical Network Sys. Lab., and Dr. Fumio TCP and TCP BDTE throughput are more susceptible Teraoka and Dr. Mario Tokoro of Sony CSL for their helpful than UDP CMTE to the coexisting process. This seems to advice and encouragement. come from higher loads of TCP and TCP BDTE, as shown in Figure 9. Since TCP has more complexity than UDP, such as References ack processing, this result is not surprising. [1] L. Berger and L. Delgrossi, “Internet STream protocol version 2 To summarize the results of the experiments, the over- (ST2) - protocol specification - version ST2+,” IETF RFC 1819, head incurred by the traffic shaping is about 10% on the av- 1995. erage, and 20% at maximum. We consider that this is in the [2] V. Jacobson, R. Braden, and D. Borman, “TCP extensions for practically allowable level. Changing the kernel timer to 1 high performance,” IETF RFC 1323, 1992. msec resolution incurs almost no overhead by itself. [3] L. S. Brakmo and S. W. O’Malley, “TCP Vegas: New techniques for congestion detection and avoidance,” SIGCOMM’94, pp. 24-35, 1994. 5 Related works [4] D. R. Cheriton, “VMTP: A Transport protocol for the next gen- eration of communication systems,” SIGCOMM’86, pp. 406- To challenge the transport issues concerning high speed net- 415, 1986,. works, there have been a number of proposals stating that [5] K. Cho, “A framework for alternate queueing: Towards traffic rate-based traffic control should be implemented in the trans- management by PC-UNIX based routers,” to be appear in Proc. USENIX Technical Conf., June 1998. Programs are available port protocol [4, 6, 18]. A dynamic and adaptive rate-control from http://www.csl.sony.co.jp/person/kjc/programs.html. approach, however, is difficult to be optimized in a wide area [6] D. D. Clark, M. L. Lambert, and L. Zhang, “NETBLT: A bulk network. Those rate-control protocols could avert this prob- data transfer protocol,” IETF RFC 998, 1987. lem if a scheme to reserve bandwidth and a mechanism to [7] W. Feng, D. Kandlur, D. Saha, and K. Shin, “Understanding TCP guarantee the reserved bandwidth were available. The early dynamics in an integrated services internet,” NOSSDAV’97, 1997. protocols, however, have not been widely used. We should [8] S. Floyd and V. Jacobson, “Link-sharing and resource manage- try to adopt such a rate-based approach now, since resource ment models for packet networks,” IEEE/ACM Trans. Network- reservation protocols such as RSVP[20] seem to become ing, Vol.3, No.4, pp.365-386, 1995. widely used, and rate-guaranteeing mechanism will become [9] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow, “TCP selec- feasible once queueing systems such as CBQ[5][8] or link tive acknowledgment options,” IETF RFC 2018, 1996. layer networks such as ATM become more widespread. [10] L. McVoy and C. Staelin, “lmbench: Portable tools for performance analysis,” Proc. USENIX Winter Conf., Jan. 1996. Pro- Others [7] also discuss TCP performance in a rate-guar- grams are available from http://reality.sgi.com/employees/ anteed network, but their results are obtained only through lm_engr/. simulation. Our experiment with a real system not only con- [11] P. P. Mishra, “Effect of leaky bucket policing on TCP traffic firms their simulation results in part, but also investigates over ATM networks, “ Proc. ICC’96, June 1996. various issues on protocol implementation such as system [12] C. Partridge, Gigabit networking, Addison-Wesley, 1994. [13] V. Paxon, “Measurements and analysis of end-to-end internet load, system performance and so on. dynamics,” Ph.D. Thesis, University of California Berkeley, April 1997. [14] H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, “RTP: IEICE TRANS. COMMUN., VOL.E81-B, NO.8 AUGUST 1998 8

A transport protocol for real-time applications,” IETF RFC 1889, Kenjiro Cho is an associate researcher at 1996. Sony Computer Science Laboratory Inc., [15] S. Shenker, C. Partridge, and R. Guerin, “Specification of guar- working on traffic management issues. He anteed quality of service,” IETF RFC 2212, 1997. received a M.Eng. in Computer Science from [16] A. Shionozaki, K. Yamashita, S. Utsumi, and K. Cho, “Integrat- Cornell University in 1993. ing Resource Reservation with Rate-Based Transport Protocol in AMInet,” Proc. Worldwide Computing and Its Applications - WWCA’98, also in Lecture Notes in Computer Science 1368, March 1998. [17] W. R. Stevens, TCP/IP illustrated volume 1, Addison-Wesley, 1994. [18] T. W. Strayer, B. J. Dempsey, and A. C. Weaver, XTP - The Xpress Transfer Protocol, Addison-Wesley, 1992. [19] J. Wroclawski, “Specification of the controlled-load network element service,” IETF RFC 2211, 1997. [20] L. Zhang, S. Deering, D. Estrin, S. Shenker, and D. Zappala, Atsushi Shionozaki received the B.E. degree “RSVP: A new resource ReSerVation Protocol,” pp. 8-18, IEEE in electrical engineering, and the M.S. and Network, Sept. 1993. Ph.D. degrees in computer science all from Keio University in 1990, 1992, and 1995, respectively. Since 1995 he has been at Sony Computer Science Laboratory Inc. as an associate researcher. His research interests include resource reservation, QoS control, and QoS routing.

Kei Yamashita received B.E. and M.E. degrees in mathematical engineering from To- kyo University in 1990, 1992 respectively. In 1992, he joined NTT (Nippon Telegraph and Telephone Corporation) Laboratories. He has been engaged in the research area of traffic control techniques in the Internet and ATM network. He is a member of IEICE and IEEE.

Shusuke Utsumi received B.E. degree in electirical engineering, and M.S. in computer science from Keio University in 1994, 1996 respectively. Since 1996 he has been at Sony Corporation, working on traffic control techniques in the Internet.

Hiroyuki Tanaka received M.E. in informa- tion processing from Nara Institute of Sci- ence and Technology in 1995. In 1995, he joined NTT (Nippon Telegraph and Tele- phone Corporation) Laboratories. His current interest is in the mulitcast communication in the Internet and ATM network. He is a member of IPSJ.