A (In)Cast of Thousands: Scaling Datacenter TCP to Kiloservers and Gigabits Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat, David G

Home , Jiffy (time)

A (In)Cast of Thousands: Scaling Datacenter TCP to Kiloservers and Gigabits Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat, David G. Andersen, Gregory R. Ganger, Garth A. Gibson Carnegie Mellon University CMU-PDL-09-101 February 2009

Parallel Data Laboratory Carnegie Mellon University Pittsburgh, PA 15213-3890

Abstract

This paper presents a practical solution to the problem of high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets—the Incast problem. In these networks, receivers often experience a drastic reduction in throughput when simultaneously requesting data from many servers using TCP. Inbound data overﬁlls small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a synchronization requirement (e.g., ﬁlesystem reads and parallel data- intensive queries), incast can reduce throughput by up to 90%. Our solution for incast uses high-resolution timers in TCP to allow for microsecond-granularity timeouts. We show that this technique is effective in avoiding incast using simulation and real-world experiments. Last, we show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.

Acknowledgements: We would like to thank Brian Mueller at Panasas Inc. for helping us to conduct experiments on their systems. We also would like to thank our partners in the Petascale Data Storage Institute, including Andrew Shewmaker, HB Chen, Parks Fields, Gary Grider, Ben McClelland, and James Nunez at Los Alamos National Lab for help with obtaining packet header traces. We thank the members and companies of the PDL Consortium (including APC, Cisco, Data Domain, EMC, Facebook, Google, Hewlett-Packard, Hitachi, IBM, Intel, LSI, NetApp, Oracle, Seagate, Sun Microsystems, Symantec, and VMware) for their interest, insights, feedback, and support. Finally, we’d like to thank Michael Stroucken for his help managing the PDL cluster. This material is based on research sponsored in part by the National Science Foundation, via grants #CNS-0546551, #CNS-0326453 and #CCF-0621499, by the Army Research Ofﬁce under agreement number DAAD19–02–1–0389, by the Department of Energy under Award Number #DE-FC02-06ER25767, and by DARPA under grant #HR00110710025. Keywords: Cluster-based storage systems, TCP, performance measurement and analysis

1 1 Introduction Num Servers vs Goodput (Fixed Block = 1MB, buffer = 64KB (est.), Switch = S50) 1000 In its 35 year history, TCP has been repeatedly chal- 900 800 lenged to adapt to new environments and technol- 700 ogy. Researchers have proved adroit in doing so, en- 600 500 abling TCP to function well in gigabit networks [27], 400 long/fat networks [18, 10], satellite and wireless en- 300 Goodput (Mbps) 200 vironments [22,7], among others. In this paper, we 100 0 examine and improve TCP’s performance in an area 0 5 10 15 20 25 30 35 40 45 that, surprisingly, proves challenging to TCP: very Number of Servers 200us RTOmin low delay, high throughput, datacenter networks of 1ms RTOmin Jiffy dozens to hundreds of machines. 200ms RTOmin (default) The problem we study is termed incast: A drastic reduction in throughput when multiple senders com- Figure 1: TCP Incast: A throughput collapse can oc- municate with a single receiver in these networks. cur with increased numbers of concurrent senders in The highly bursty, very fast data overfills typically a synchronized request. Reducing RTO to microsec- small Ethernet switch buffers, causing intense packet ond granularity alleviates Incast. loss that leads to TCP timeouts. These timeouts last hundreds of milliseconds on a network whose round- processor communication and bulk storage transfer trip-time (RTT) is measured in the 10s or 100s of applications in the fastest, largest data centers, in- microseconds. Protocols that have some form of syn- stead of Fibrechannel or Infiniband. Provided that chronization requirement—filesystem reads, parallel TCP adequately supports high bandwidth, low la- data-intensive queries—block waiting for the timed- tency, synchronized and parallel applications, there out connections to finish. These timeouts and the is a strong desire to “wire-once” and reuse the ma- resulting delay can reduce throughput by 90% (Fig- ture, well-understood transport protocols that are so ure1, 200ms RTOmin) or more [25, 28]. familiar in lower bandwidth networks. In this paper, we present and evaluate a set of system extensions to enable microsecond-granularity timeouts – the TCP retransmission timeout (RTO). 2 Background The challenges in doing so are threefold: First, we show that the solution is practical by modifying the Cost pressures increasingly drive datacenters to Linux TCP implementation to use high-resolution adopt commodity components, and often low-cost kernel timers. Second, we show that these modifi- implementations of such. An increasing number cations are effective, enabling a network testbed ex- of clusters are being built with off-the-shelf rack- periencing incast to maintain maximum throughput mount servers interconnected by Ethernet switches. for up to 47 concurrent senders, the testbed’s max- While the adage “you get what you pay for” still imum size (Figure1, 200 µs RTOmin). Microsecond holds true, entry-level gigabit Ethernet switches to- granularity timeouts are necessary—simply reducing day operate at full data rates, switching upwards of RTOmin to 1ms without also improving the timing 50 million packets per second—at a cost of about granularity may not prevent incast. In simulation, $10 per port. Commodity 10Gbps Ethernet is now our changes to TCP prevent incast for up to 2048 cost-competitive with specialized interconnects such concurrent senders on 10 gigabit Ethernet. Lastly, we as Infiniband and FibreChannel, and also benefits show that the solution is safe, examining the effects from wide brand recognition to boot. To reduce cost, of this aggressively reduced RTO in the wide-area however, switches often sacrifice expensive, power- Internet, showing that its benefits to incast recovery hungry SRAM packet buffers, the effect of which we have no drawbacks on performance for bulk flows. explore throughout this work. The motivation for solving this problem is the in- The desire for commodity parts extends to trans- creasing interest in using Ethernet and TCP for inter- port protocols. TCP provides a kitchen sink of pro-

2 tocol features, giving reliability and retransmission, Scenario RTT OS TCP RTOmin congestion and flow control, and delivering packets WAN 100ms Linux 200ms in-order to the receiver. While not all applications Datacenter <1ms BSD 200ms need all of these features [20, 31] or benefit from SAN <0.1ms Solaris 400ms more rich transport abstractions [15], TCP is ma- When a server involved in a synchronized request ture and well-understood by developers, leaving it experiences a timeout, other servers can finish send- the transport protocol of choice even in many high- ing their responses, but the client must wait a mini- performance environments. mum of 200ms before receiving the remaining parts Without link-level flow control, TCP is solely re- of the response, during which the client’s link may sponsible for coping with and avoiding packet loss be completely idle. The resulting throughput seen by in the (often small) Ethernet switch egress buffers. the application may be as low as 1-10% of the client’s Unfortunately, the workload we examine has three bandwidth capacity, and the per-request latency will features that challenge (and nearly cripple) TCP’s be higher than 200ms. performance: a highly parallel, synchronized request This phenomenon was first termed “Incast” and workload; buffers much smaller than the bandwidth- described by Nagle et. al [25] in the context of paral- delay product of the network; and low latency that lel filesystems. Nagle et. al coped with Incast in the results in TCP having windows of only a few pack- parallel filesystem with application specific mecha- ets. nisms. Specifically, Panasas [25] limits the number of servers simultaneously sending to one client 2.1 The Incast Problem to about 10 by judicious choice of the file strip- ing policies. It also reduces the default size of its Synchronized request workloads are becoming in- per-flow TCP receive buffers (capping the adver- creasingly common in today’s commodity clusters. tised window size) on the client to avoid incast on Examples include parallel reads/writes in cluster switches with small buffers. For switches with large filesystems such as Lustre [8], Panasas [34], or buffers, Panasas provides a mount option to increase NFSv4.1 [33]; search queries sent to dozens of the client’s receive buffer size. In contrast, this work nodes, with results returned to be sorted1; or paral- provides a TCP-level solution to incast for switches lel databases that harness multiple back-end nodes to with small buffers and many more than 10 simulta- process parts of queries. neous senders. In a clustered file system, for example, a client ap- Without application-specific techniques, the gen- plication requests a data block striped across several eral problem of incast remains: Figure1 shows the storage servers, issuing the next data block request throughput of our test synchronized-read application only when all servers have responded with their por- (Section4) as we increase the number of nodes it tion. This synchronized request workload can result reads from, using an unmodified Linux TCP (200ms in packets overfilling the buffers on the client’s port RTOmin line). This application performs synchro- on the switch, resulting in many losses. Under se- nized reads of 1MB blocks of data; that is, each of vere packet loss, TCP can experience a timeout that N servers responds to a block read request with 1 lasts a minimum of 200ms, determined by the TCP MB / N bytes at the same time. Even using a high- minimum retransmission timeout (RTOmin). While performance switch (with its default settings), the operating systems use a default value today that may throughput drops drastically as the number of servers suffice for the wide-area, datacenters and SANs have increases, achieving a shockingly poor 3% of the round-trip-times that are orders of magnitude below network capacity—about 30Mbps—when it tries to the RTOmin defaults: stripe the blocks across all 47 servers. Prior work characterizing TCP Incast ended on 1 In fact, engineers at Facebook recently rewrote the middle- a somewhat down note, finding that existing TCP tier caching software they use—memcached [13]—to use UDP so that they could “implement application-level flow control for improvements—NewReno, SACK [22], RED [14], ... gets of hundreds of keys in parallel” [12] ECN [30], Limited Transmit [3], and modifica-

3 Average Goodput VS # Servers to the datacenter? Does it risk increased congestion (SRU = 256KB) or decreased throughput because of spurious (incor- 1000 rect) timeouts? Second, it requires addressing im- plementability: TCP implementations typically use

100 a coarse-grained timer that provides timeout support with very low overhead. Providing tighter TCP timeouts requires not only reducing or eliminating 10 32KB buffer

Goodput (Mbps) 64KB buffer RTOmin, but also supporting fine-grained RTT mea- 128KB buffer 256KB buffer surements and kernel timers. Finally, if one makes 512KB buffer 1024KB buffer TCP timeouts more fine-grained, how low must one 1 1 2 4 8 16 32 64 128 go to achieve high throughput with the smallest ad- Number of Servers ditional overhead? And to how many nodes does this solution scale? Figure 2: Doubling the switch egress buffer size doubles the numbers of concurrent senders needed to see 3.1 Jacobson RTO Estimation incast. The standard RTO estimator [17] tracks a smoothed tions to Slow Start—sometimes increased through- estimate of the round-trip time, and sets the time- put, but did not substantially change the incast- out to this RTT estimate plus four times the linear induced throughput collapse [28]. This work found deviation—roughly speaking, a value that lies out- three partial solutions: First, as shown in Figure2 side four standard deviations from the mean: (from [28]), larger switch buffers can delay the on- RTO = SRTT + (4 × RTTVAR) (1) set of Incast (doubling the buffer size doubled the number of servers that could be contacted). But in- Two factors set lower bounds on the value that the creased switch buffering comes at a substantial dollar RTO can achieve: an explicit configuration parame- cost – switches with 1MB packet buffering per port ter, RTOmin, and the implicit effects of the granular- may cost as much as $500,000. Second, Ethernet ity with which RTTs are measured and with which flow control was effective when the machines were the kernel sets and checks timers. As noted earlier, on a single switch, but was dangerous across inter- common values for RTOmin are 200ms, and most im- switch trunks because of head-of-line blocking. Fi- plementations track RTTs and timers at a granularity nally, reducing TCP’s minimum RTO, in simulation, of 1ms or larger. appeared to allow nodes to maintain high throughput Because RTT estimates are difficult to collect dur- with several times as many nodes—but it was left un- ing loss and timeouts, a second safety mechanism explored whether and to what degree these benefits controls timeout behavior: exponential backoff. Af- could be achieved in practice, meeting the three cri- ter each timeout, the RTO value is doubled, helping teria we presented above: practicality, effectiveness, to ensure that a single RTO set too low cannot cause and safety. In this paper, we answer these questions an long-lasting chain of retransmissions. in depth to derive an effective solution for practical, high-fan-in datacenter Ethernet communication. 3.2 Is it safe to disregard RTOmin? There are two possible complications of permitting 3 Challenges to Fine-Grained TCP much smaller RTO values: spurious (incorrect) time- Timeouts outs when the network RTT suddenly jumps, and breaking the relationship between the delayed ac- Successfully using an aggressive TCP retransmit knowledgement timer and the RTO values. timer requires first addressing the issue of safety Spurious retransmissions: The most promi- and generality: is an aggressive timeout appropri- nent study of TCP retransmission showed that a ate for use in the wide-area, or should it be limited high (by the standards of datacenter RTTs) RTOmin

4 helped avoid spurious retransmission in wide-area receiver. Modern systems set the delayed ACK time- TCP transfers [4], regardless of how good an esti- out to 40ms, with RTOmin set to 200ms. mator one used based on historical RTT informa- Consequently, a host modified to reduce the tion. Intuition for why this is the case comes from RTOmin below 40ms would periodically expe- prior [24, 11] and subsequent [35] studies of Inter- rience an unnecessary timeout when communi- net delay changes. While most of the time, end-to- cating with unmodified hosts, specifically when end delay can be modeled as random samples from the RTT is below 40ms (e.g., in the data- some distribution (and therefore, can be predicted by center and for short flows on the wide-area). an RTO estimator), the delay consistently observes As we show in the fol- both occasional, unpredictable delay spikes, as well lowing section, delayed as shifts in the distribution from which the delay is ACKs themselves impair drawn. Such changes can be due to the sudden in- performance in incast troduction of cross-traffic, routing changes, or fail- environments: we suggest ures. As a result, wide-area “packet delays [are] disabling them entirely. not mathematically [or] operationally steady” [35], This solution requires which confirms the Allman and Paxson observation client participation, how- that RTO estimation involves a fundamental tradeoff ever, and so is not general. between rapid retransmission and spurious retrans- In the following section, we discuss three practical missions. solutions for interoperability with unmodified, Fortunately, TCP timeouts and spurious timeouts delayed-ACK-enabled clients. were and remain rare events, as we explore further As a consequence, there are few a-priori reasons in Section8. In the ten years since the Allman and to believe that eliminating RTOmin—basing timeouts Paxson study, TCP variants that more effectively re- solely upon the Jacobson estimator and exponen- cover from loss have been increasingly adopted [23]. tial backoff—would greatly harm wide-area perfor- By 2005, for example, nearly 65% of Web servers mance and datacenter environments with clients us- supported SACK [22], which was introduced only in ing delayed ACK. We evaluate these questions ex- 1996. In just three years from 2001—2004, the num- perimentally in Sections4 and8. ber of TCP Tahoe-based servers dropped drastically in favor of NewReno-style servers. Moreover, algorithms to undo the effects of spu- 4 Evaluating Throughput with rious timeouts have been both proposed [4, 21, 32] Fine-Grained RTO and, in the case of F-RTO, adopted in the latest Linux implementations. The default F-RTO settings con- How low must the RTO be allowed to go to re- servatively halve the congestion window when a spu- tain high throughput under incast-producing condi- rious timeout is detected but remain in congestion tions, and to how many servers does this solution avoidance mode, thus avoiding the slow-start phase. scale? We explore this question using real-world Given these improvements, we believe disregarding measurements and ns-2 simulations [26], finding that RTOmin is safer today than 10 years ago, and Sec- to be maximally effective, the timers must operate tion8 will show measurements reinforcing this. on a granularity close to the RTT of the network— Delayed Acknowledgements: The TCP delayed hundreds of microseconds or less. ACK mechanism attempts to reduce the amount of Test application: Striped requests. The test ACK traffic by having a receiver acknowledge only client issues a request for a block of data that is every other packet [9]. If a single packet is received striped across N servers (the “stripe width”). Each blocksize with none following, the receiver will wait up to the server responds with N bytes of data. Only delayed ACK timeout threshold before sending an after it receives the full response from every server ACK. The delayed ACK threshold must be shorter will the client issue requests for the subsequent data than the lowest RTO values, or a sender might time block. This design mimics the request patterns found out waiting for an ACK that is merely delayed by the in several cluster filesystems and several parallel

5 workloads. Observe that as the number of servers RTOmin vs Goodput increases, the amount of data requested from each (Block size = 1MB, buffer = 32KB) 1000 server decreases. We run each experiment for 200 900 data block transfers to observe steady-state perfor- 800 mance, calculating the goodput (application through- 700 put) over the entire duration of the transfer. 600 500 We select the block size (1MB) based upon read 400 4 8 sizes common in several distributed filesystems, such Goodput (Mbps) 300 16 as GFS [16] and PanFS [34], which observe work- 200 32 100 64 loads that read on the order of a few kilobytes to a 128 0 few megabytes at a time. Prior work suggests that 1m 5m 10m 50m 200u the block size shifts the onset of incast (doubling the RTOmin (seconds) 100m 200m block size doubles the number of servers before collapse), but does not substantially change the system’s Figure 3: Reducing the RTOmin in simulation to mi- behavior; different systems have their own “natural” croseconds from the current default value of 200ms block sizes. The mechanisms we develop improve improves goodput. throughput under incast conditions for any choice of RTOmin vs Goodput block sizes and buffer sizes. (Fixed Block = 1MB, buffer = 32KB (estimate)) 1000 900 In Simulation: We simulate one client and multi- 800 ple servers connected through a single switch where 700 round-trip-times under low load are 100µs. Each 600 500 node has 1Gbps capacity, and we configure the 400 switch buffers with 32KB of output buffer space per 300 Goodput (Mbps) port, a size chosen based on statistics from commod- 200 # servers = 4 100 # servers = 8 ity 1Gbps Ethernet switches. Because ns-2 is an # servers = 16 0 event-based simulation, the timer granularity is in- 1m 5m 10m 50m 200u RTO 100m 200m finite, hence we investigate the effect of min to RTOmin (seconds) understand how low the RTO needs to be to avoid Incast throughput collapse. Additionally, we add a Figure 4: Experiments on a real cluster validate the small random timer scheduling delay of up to 20µs simulation result that reducing the RTOmin to mi- to more accurately model real-world scheduling vari- croseconds improves goodput. ance.2 Figure3 depicts throughput as a function of the with a 1ms RTOmin. For 64 and 128 servers and low RTOmin for stripe widths between 4 and 128 servers. RTOmin values, we note that each individual flow Throughput using the default 200ms RTOmin drops by nearly an order of magnitude with 8 concurrent does not have enough data to send to saturate the senders, and by nearly two orders of magnitude when link, but also that performance for 200µs is worse data is striped across 64 and 128 servers. than for 1ms; we address this issue in more detail when scaling to hundreds of concurrent senders in Reducing the RTOmin to 1ms is effective for 8-16 concurrent senders, fully utilizing the client’s link, Section7. but begins to suffer when data is striped across a larger number of servers: 128 concurrent senders uti- In Real Clusters: We evaluate incast on two clus- lize only 50% of the available link bandwidth even ters; one sixteen-node cluster using an HP Procurve 2848 switch, and one 48-node cluster using a 2 Experience with ns-2 showed that without introducing this Force10 S50 switch. In these clusters, every node has delay, we saw simultaneous retransmissions from many servers all within a few microseconds, which is rare in real world set- 1 Gbps links and a client-to-server RTT of approxi- tings. mately 100µs. All nodes run Linux kernel 2.6.28.

6 We run the same synchronized read workload as in RTT Distribution in SAN simulation. 10000 For these experiments, we use our modiﬁed Linux 9000 2.6.28 kernel that uses microsecond-accurate timers 8000 7000 with microsecond-granularity RTT estimation (§6) 6000 to be able to accurately set the RTOmin to a de- 5000 sired value. Without these modiﬁcations, the default 4000

TCP timer granularity in Linux can be reduced only # of Occurrences 3000 to 1ms. As we show later, when added to the 4x 2000 RTTVAR estimate, the 1ms timer granularity effec- 1000 0 tively raises the minimum RTO over 5ms. 0 100 200 300 400 500 600 700 800 900 1000 Figure4 plots the application goodput as a func- RTT in Microseconds tion of the RTOmin for 4, 8, and 16 concurrent senders. For all conﬁgurations, goodput drops with Figure 5: During an incast experiment on a 16-node cluster, RTTs increase by 4 times the baseline RTT increasing RTOmin above 1ms. For 8 and 16 concur- (100µs) on average with spikes as high as 1ms. This rent senders, the default RTOmin of 200ms results in nearly 2 orders of magnitude drop in throughput. produces RTO values in the range of 1-3ms, resulting The real world results deviate from the simu- in an RTOmin of 1ms being as effective as 200µs in lation results in a few minor ways. First, the practice. maximum achieved throughput in simulation nears

1Gbps, whereas the maximum achieved in the real an RTOmin will show no improvement below 1ms. world is 900Mbps. Simulation throughput is always Hence, where we specify a RTOmin of 200µs, we are higher because simulated nodes are infinitely fast, effectively eliminating the RTOmin, allowing RTO to whereas real-world nodes are subject to myriad in- be as low as calculated. fluences, including OS scheduling and Ethernet or Despite these differences, the real world results switch timing differences, resulting in real-world re- show the need to reduce the RTO to at least 1ms to sults slightly below that of simulation. avoid throughput degradation. Second, real world results show negligible difference between 8 and 16 servers, while the differences 4.1 Interaction with Delayed ACK for Un- are more pronounced in simulation. We attribute this modified Clients to variances in the buffering between simulation and the real world. As we show in Section7, small mi- During the onset of In- crosecond differences in retransmission scheduling cast using five or fewer can lead to improved goodput; these differences ex- servers, the delayed ist in the real world but are not modeled as accurately ACK mechanism acted in simulation. as a miniature timeout, Third, the real world results show identical perfor- resulting in reduced, but mance for RTOmin values of 200µs and 1ms, whereas not catastrophically low, there are slight differences in simulation. We find throughput [28] during that the RTT and variance seen in the real world certain loss patterns. is higher than that seen in simulation. Figure5 As shown to the right, shows the distribution of round-trip-times during an delayed ACK can delay Incast workload. While the baseline RTTs can be be- the receipt of enough duplicate acks for data-driven tween 50-100µs, increased congestion causes RTTs loss recovery when the window is small (or reduce to rise to 400µs on average with spikes as high as the number of duplicate ACKs by one, turning a 1ms. Hence, the variance of the RTO combined with recoverable loss into a timeout). While this delay the higher RTTs mean that the actual retransmission is not as high as a full 200ms RTO, 40ms is still timers set by the kernel are between 1-3ms, where large compared to the RTT and results in low

7 throughput for three to five concurrent senders. Num Servers vs Goodput (DelayedACK Client) Beyond five senders, high packet loss creates 200ms (Fixed Block = 1MB, buffer = 32KB (est.), Switch = Procurve) 1000 retransmission timeouts which mask the impact of 900 delayed ACK delays. 800 While servers require modifications to the TCP 700 600 stack to enable microsecond-resolution retransmis- 500 sion timeouts, the clients issuing the requests do not 400 necessarily need to be modified. But because de- Goodput (Mbps) 300 layed ACK is implemented at the receiver, a server 200 Delayed ACK Disabled 100 Delayed ACK 200us may normally wait 40ms for an unacked segment that Std Client (Delayed ACK 40ms) 0 successfully arrived at the receiver. For servers us- 0 2 4 6 8 10 12 14 16 ing a reduced RTO in a datacenter environment, the Number of Servers server’s retransmission timer may expire long before Figure 6: Disabling Delayed ACK on client nodes the unmodified client’s 40ms delayed ACK timer provides optimal goodput. fires. As a result, the server will timeout and resend the unacked packet, cutting ssthresh in half and re- discovering link capacity using slow-start. Because 5 Preventing Incast the client acknowledges the retransmitted segment immediately, the server does not observe a coarse- In this section we briefly outline the three compo- grained 40ms delay, only an unnecessary timeout. nents of our solution to incast, the necessity for and effectiveness of which we evaluate in the following Figure6 shows the performance difference be- section. tween our modified client with delayed ACK disabled, delayed ACK enabled with a 200µs timer, and a standard kernel with delayed ACK enabled. 1. Microsecond-granularity timeouts. We first modify the kernel to measure RTTs in microseconds, Beyond 8 servers, our modified client with de- changing both the infrastructure used for timing and layed ACK enabled receives 15-30Mbps lower the values placed in the TCP timestamp option. The throughput compared to the client with delayed ACK kernel changes use the high-resolution, hardware- disabled, whereas the standard client experiences be- supported timing hardware present in modern oper- tween 100 and 200Mbps lower throughput. When ating systems. In Section9, we briefly discuss alter- the client delays an ACK, the standard client forces natives for legacy hardware. the servers to timeout, yielding much worse performance. In contrast, the 200µs minimum delayed ACK timeout client delays a server by roughly a 2. Randomized timeout delay. Second, we mod- round-trip-time and does not force the server to time- ify the microsecond-granularity RTO values to in- out, so the performance hit is much smaller. clude an additional randomized component: Delayed ACK can provide benefits where the timeout = RTO + (rand(0.5) ∗ RTO) ACK path is congested [6], but in the datacenter environment, we believe that delayed ACK should be In expectation, this increases the duration of the disabled; most high-performance applications favor RTO by up to 25%, but more importantly, has the ef- quick response over an additional ACK-processing fect of de-synchronizing packet retransmissions. As overhead and are typically equally provisioned for we show in the next section, this decision is unimpor- both directions of traffic. Our evaluations in Sec- tant with smaller numbers of servers, but becomes tion6 disable delayed ACK on the client for this critical when the number of servers is large enough reason. While these results show that for full per- that a single batch of retransmitted packets them- formance, delayed ACK should be disabled, we note selves cause extensive congestion. The extra half- that unmodified clients still achieve good perfor- RTO allows some packets to get through the switch mance and avoid Incast. (the RTO is at least one RTT, so half an RTO is

8 at least the one-way transit time) before being clob- 6 Achieving Microsecond- bered by the retransmissions. granularity Timeouts

The TCP clock granularity in most popular operat- 3. Disabling delayed ACKs. As noted earlier, de- ing systems is on the order of milliseconds, as delayed ACKs can have a poor interaction with mi- fined by the “jiffy” clock, a global counter updated crosecond timeouts: the sender may timeout on the by the kernel at a frequency “HZ”, where HZ is typi- last packet in a request, retransmitting it while wait- cally 100, 250, or 1000. Linux, for example, updates ing for a delayed ACK from the receiver. In the the jiffy timer 250 times a second, yielding a TCP datacenter environment, the reduced window size is clock granularity of 4ms, with a configuration option relatively unimportant—maximum window sizes are to update 1000 times per second for a 1ms granular- small already and the RTT is short. This retransmis- ity. More frequent updates, as would be needed to sion does, however, create overhead when requests achieve finer granularity timeouts, would impose an are small, reducing throughput by roughly 10%. increased, system-wide kernel interrupt overhead. While the focus of this paper is on understanding Unfortunately, setting the RTOmin to 1 jiffy (the the limits of providing high throughput in datacen- lowest possible value) does not achieve RTO values ter networks, it is important to ensure that these so- of 1ms because of the clock granularity. TCP mea- lutions can be adopted incrementally. Towards this sures RTTs in 1ms granularity at best, so both the end, we discuss briefly three mechanisms for inter- smoothed RTT estimate and RTT variance have a 1 acting with legacy clients: jiffy (1ms) lower bound. Since the standard RTO estimator sums the RTT estimate with 4x the RTT variance, the lowest possible RTO value is 5 jiffies. • Bimodal timeout operation: The kernel could We experimentally validated this result by setting the use microsecond RTTs only for extremely low clock granularity to 1ms, setting RTOmin to 1ms, and RTT connections, where the retransmission observing that TCP timeouts were a minimum of overhead is much lower than the cost of coarse- 5ms. grained timeouts, but still use coarse timeouts At a minimum possible RTOmin of 5ms in stan- in the wide-area. dard TCP implementations, throughput loss can not be avoided for as few as 8 concurrent senders. • Make aggressive timeouts a socket option: Re- While the results of Figures3 and4 suggest reduc- quire users to explicitly enable aggressive time- ing the RTOmin to 5ms can be both simple (a one- outs for their application. We believe this is an line change) and helpful, next we describe how to appropriate option for the first stages of deploy- achieve microsecond granularity RTO values in the ment, while clients are being upgraded. real world.

• Disable delayed ACKs, or use an adaptive de- 6.1 Linux high-resolution timers: hrtimers layed ACK timer. While the former has nu- High resolution timers were introduced in Linux ker- merous advantages, ﬁxing the delay penalty of nel version 2.6.18 and are still actively in devel- delayed ACK for the datacenter is relatively opment. They form the basis of the posix-timer straightforward: Base the delayed ACK timeout and itimer user-level timers, nanosleep, and a few on the smoothed inter-packet arrival rate instead other in-kernel operations, including the update of of having a static timeout value. We have not the jifﬁes value. implemented this option, but as Figure6 shows, The Generic Time of Day (GTOD) framework a static 200µ s timeout value (still larger than the provides the kernel and other applications with inter-packet arrival) shows that these more pre- nanosecond resolution timing using the CPU cycle cise delayed ACKs restore throughput to about counter on modern processors. The hrtimer imple- 98% of the no-delayed-ACK throughput. mentation interfaces with the High Precision Event

9 Timer (HPET) hardware also available on modern Num Servers vs Goodput systems to achieve microsecond resolution in the ker- (Fixed Block = 1MB, buffer = 32KB (est.), Switch = Procurve) 1000 nel. When a timer is added to the list, the kernel 900 800 checks whether this is the next expiring timer, pro- 700 gramming the HPET to send an interrupt when the 600 500 HPET’s internal clock advances by a desired amount. 400 300

For example, the kernel may schedule a timer to ex- Goodput (Mbps) 200 pire once every 1ms to update the jiffy counter, and 100 the kernel will be interrupted by the HPET to update 0 0 2 4 6 8 the jiffy timer only every 1ms. 10 12 14 16 Number of Servers Our preliminary evaluations of hrtimer overhead 200us RTOmin 200ms RTOmin have shown no appreciable overhead of implement- 1ms RTOmin Jiffy ing TCP timeouts using the hrtimer subsystem. We posit that, at this stage in development, only a few Figure 7: On a 16 node cluster, our high-resolution kernel functions use hrtimers, so the red-black tree 1ms RTOmin eliminates incast. The jiffy-based im- that holds the list of timers may not contain enough plementation has a 5ms lower bound on RTO, and timers to see poor performance as a result of repeated achieves only 65% throughput. insertions and deletions. Also, we argue that for incast, where high-resolution timers for TCP are re- complished entirely on the sender—receivers simply quired, any introduced overhead may be acceptable, echo back the value in the TCP timestamp option. as it removes the idle periods that prevent the server from doing useful work to begin with. Constants, Timers and Socket Structures All timer constants previously defined with respect to the 6.2 Modifications to the TCP Stack jiffy timer are converted to exact microsecond val- The Linux TCP implementation requires three ues. The TCP implementation must make use of the changes to support microsecond timeouts using hrtimer interface: we replace the standard timer ob- hrtimers: microsecond resolution time accounting jects in the socket structure with the hrtimer struc- to track RTTs with greater precision, redefinition of ture, ensuring that all subsequent calls to set, reset, or TCP constants, and replacement of low-resolution clear these timers use the appropriate hrtimer func- timers with hrtimers. tions.

Microsecond accounting: By default, the jiffy 6.3 hrtimer Results counter is used for tracking time. To provide mi- Figure7 presents the achieved goodput as we in- crosecond granularity accounting, we use the GTOD crease the stripe width N using various RTOmin val- framework to access the 64-bit nanosecond resolu- ues on a Procurve 2848 switch. As before, the client tion hardware clock wherever the jifﬁes time is tradi- issues requests for 1MB data blocks striped over tionally used. N servers, issuing the next request once the previ- With the TCP timestamp option enabled, RTT es- ous data block has been received. Using the de- timates are calculated based on the difference be- fault 200ms RTOmin, throughput plummets beyond tween the timestamp option in an earlier packet and 8 concurrent senders. For a 1ms jiffy-based RTOmin, the corresponding ACK. We convert the time from throughput begins to drop at 8 servers to about 70% nanoseconds to microseconds and store the value in of link capacity and slowly decreases thereafter; as the TCP timestamp option3. This change can be ac- shown previously, the effective RTOmin is 5ms. Last, our TCP hrtimer implementation allowing microsec- 3The lower wrap-around time – 232 microseconds or 4294 seconds – is still far greater than the maximum IP segment life- ond RTO values achieves the maximum achievable time (120-255 seconds) goodput for 16 concurrent senders.

10 Num Servers vs Goodput RTT Distribution at Los Alamos National Lab Storage Node (Fixed Block = 1MB, buffer = 64KB (est.), Switch = S50) 3000 1000 900 2500 800 700 600 2000 500 400 1500 300 Goodput (Mbps) 200 1000

100 # of Occurrences 0 0 5 10 15 20 25 30 35 40 45 500 Number of Servers 200us RTOmin 0 1ms RTOmin Jiffy 0 200 400 600 800 1000 200ms RTOmin (default) RTT in Microseconds (binsize = 10us)

Figure 8: For a 48-node cluster, providing RTO val- Figure 10: Distribution of RTTs shows an apprecia- ues in microseconds eliminates incast. ble number of RTTs in the 10s of microseconds.

Num Servers vs Goodput (Fixed Block = 1MB, buffer = 512KB (est.), than just the switches total buffer size: switch con- Switch = S50 with QoS Disabled) 1000 figuration options designed for other workloads can 900 make its flows more prone to incast. A generalized 800 700 TCP solution should reduce administrative complex- 600 ities in the field. 500 400 Overall, we find that enabling microsecond RTO 300 Goodput (Mbps) 200 values in TCP successfully avoids the incast through- 100 put drop in two real-world clusters for as high as 47 0 0 5 10 15 20 25 30 35 40 45 concurrent servers, the maximum available to us to Number of Servers date, and that microsecond resolution is necessary 1ms RTOmin Jiffy 200ms RTOmin (default) to achieve high performance with some switches or switch configurations. Figure 9: Switches configured with large enough buffer capacity can delay incast. 7 Next-generation Datacenters

We verify these results on a second cluster con- TCP Incast poses further problems for the next gen- sisting of 1 client and 47 servers connected to a eration of datacenters consisting of 10Gbps networks single 48-port Force10 S50 switch (Figure8). The and hundreds to thousands of machines. These net- microsecond RTO kernel is again able to saturate works have very low-latency capabilities to keep throughput for 47 servers, the testbed’s maximum competitive with alternative high-performance in- size. The 1ms RTOmin jiffy-based configuration ob- terconnects like Infiniband; though TCP kernel-to- tained 70-80% throughput, with an observable drop kernel latency may be as high as 40-80µs due to ker- above 40 concurrent senders. nel scheduling, port-to-port latency can be as low When the Force10 S50 switch is configured to dis- as 10µs. Because 10Gbps Ethernet provides higher able multiple QoS queues, the per-port packet buffer bandwidth, servers will be able to send their portion allocation is large enough that incast can be avoided of a data block faster, requiring that RTO values be for up to 47 servers (Figure9). This reaffirms the shorter to avoid idle link time. For example, we plot simulation result of Figure2 – that one way to avoid the distribution of RTTs seen at a storage node at Los incast in practice is to use larger per-port buffers in Alamos National Lab (LANL) in Figure 10: 20% of switches on the path. It also emphasizes that rely- RTTs are below 100µs, showing that networks today ing on switch-based incast solutions involves more are capable and operate in very low-latency environ-

11 Average Goodput VS # Servers Repeated Retransmissions, Backoff and Idle-time (Block size = 40MB, buffer = 32KB, rtt = 20us) 1000 10000 9000 800 8000 600 7000 400

6000 Mbps 5000 200 4000 0

Goodput (Mbps) 3000 2000 0.35 0.4 0.45 0.5 0.55 0.6 1000 Goodput time (seconds) Throughput 0 Instanteous Link Utilization 32 64 128 256 512 1024 Flow 503 Failed Retransmission Number of Servers Flow 503 Successful Retransmission

Figure 11: In simulation, flows still experience re- Figure 12: Some flows experience repeated retrans- duced goodput even with microsecond resolution mission failures due to synchronized retransmission timeouts (RTOmin = RTT = 20µs) without a random- behavior, delaying transmission far beyond when the ized RTO component. link is idle. ments, further motivating the need for microsecond- These idle periods occur specifically when there granularity RTO values. are many more flows than the amount of buffer capacity at the switch due to simultaneous, successive timeouts. Recall that after every timeout, the RTO Scaling to thousands of concurrent senders: In value is doubled until an ACK is received. This has this section, we analyze the impact of incast and the been historically safe because TCP quickly and con- reduced RTO solution for 10Gbps Ethernet networks servatively estimates the duration to wait until con- in simulation as we scale the number of concurrent gestion abates. However, the exponentially increas- senders into the thousands. We reduce baseline RTTs ing delay can overshoot some portion of time that from 100µs to 20µs, and increase link capacity to the link is actually idle, leading to sub-optimal good- 10Gbps, keeping per-port buffer size at 32KB as we put. Because only one flow must overshoot to delay assume smaller buffers for faster switches. the entire transfer, the probability of overshooting in- We increase the blocksize to 40MB (to ensure each creases with increased number of flows. flow can mostly saturate the 10Gbps link), scale up Figure 12 shows the instantaneous link utilization the number of nodes from 32 to 2048, and reduce the for all flows and the retransmission events for one RTOmin to 20µs, effectively eliminating a minimum of the flows that experienced repeated retransmission bound. In Figure 11, we find that without a random- failures during an incast simulation on a 1Gbps net- ized RTO component, goodput decreases sublinearly work. This flow timed out and retransmitted a packet (note the log-scale x-axis) as the number of nodes at the same time that other timed out flows also re- increases, indicating that even with an aggressively transmitted. While some of these flows got through RTO granularity, we still observe reduced goodput. and saturated the link for a brief period of time, the Reduced goodput can arise due to idle link time flow shown here timed out and doubled its timeout or due to retransmissions; retransmissions factor into value (until the maximum factor of 64 * RTO) for throughput but not goodput. Figure 11 shows that for each failed retransmission. The link then became a small number of flows, throughput is near optimal available soon after the retransmission event, but the but goodput is lower, sometimes by up to 15%. For RTO backoff set the retransmission timer to fire far a larger number of flows, however, throughput and beyond this time. When this packet eventually got goodput are nearly identical – with an aggressively through, the block transfer completed and the next low RTO, there exist periods where the link is idle block transfer began, but only after large periods of for a large number of concurrent senders. link idle time that reduced goodput.

12 Average Goodput VS # Servers the timing of events is as accurate as the simula- (Block size = 40MB, buffer = 32KB, rtt = 20us) tion timestep. At the scale of microseconds, there 10000 9000 will exist small timing and scheduling differences in 8000 the real-world that are not captured in simulation. 7000 For example, in the simulation, a packet will be re- 6000 transmitted as soon as the retransmission timer ﬁres, 5000 4000 whereas kernel scheduling may delay the actual re-

Goodput (Mbps) 3000 transmission by 10µs or more. Even when ofﬂoading 2000 200us RTOmin duties to ASICs, slight timing differences will exist 1000 20us RTOmin 20us RTOmin + RandDelay in real-world switches. Hence, the real-world behav- 0 32 64 128 256 512 1024 2048 ior of incast in 10GE, 20µs RTT networks will likely Number of Servers deviate from these simulations slightly, though the general trend should hold. Figure 13: In simulation, introducing a randomized component to the RTO desynchronizes retransmissions following timeouts and avoiding goodput 8 Implications of Reduced RTOmin degradation for a large number of ﬂows. on the Wide-area

This analysis shows that decreased good- Aggressively lowering both the RTO and RTOmin put/throughput for a large number of flows can be shows practical benefits for datacenters. In this sec- attributed to many flows timing out simultaneously, tion, we investigate if reducing the RTOmin value to backing off deterministically, and retransmitting at microseconds and using finer granularity timers is precisely the same time. By adding some degree of safe for wide area transfers. We find that the im- randomness to the RTO, the retransmissions can be pact of spurious timeouts on long, bulk data flows desynchronized such that fewer flows experience is very low – within the margins of error – allowing repeated timeouts. RTO to go into the microseconds without impairing wide-area performance. We examine both RTOmin and the retransmission synchronization effect in simulation, measuring the goodput for three different settings: a 200µs RTOmin, 8.1 Evaluation a 20µs RTO , and a 20µs RTO with a modified, min min The major potential effect of a spurious timeout is a randomized timeout value set by: loss of performance: a flow that experiences a time- timeout = RTO + (rand(0.5) ∗ RTO) out will reduce its slow-start threshold by half, its window to one and attempt to rediscover link capac-

Figure 13 shows that the 200µs RTOmin scales ity. Spurious timeouts occur not when the network poorly as the number of concurrent senders in- path drops packets, but rather when it observes a sud- creases: at 1024 servers, throughput is still an order den, higher delay, so the effect of a shorter RTO on of magnitude lower. The 20µs RTOmin shows im- increased congestion is likely small because a TCP proved performance, but eventually suffers beyond sender backs-off on the amount of data it injects into 1024 servers due to the successive, simultaneous the network on a timeout. In this section we analyze timeouts experienced by a majority of flows. the performance of TCP flows over the wide-area for Adding a small random delay performs well re- bulk data transfers. gardless of the number of concurrent senders because it explicitly desynchronizes the retransmissions of Experimental Setup We deployed two servers flows that experience repeated timeouts, and does not that differ only in their implementation of the RTO heavily penalize flows that experience a few time- values and granularity, one using the default Linux outs. 2.6.28 kernel with a 200ms RTOmin, and the other A caveat of these event-based simulations is that using our modified hrtimer-enabled TCP stack with

13 100 100

80 80

60 60

40 40 % samples (with RTT <= x)

20 % samples (with Kbps <= x) 20 200ms RTOmin (Default) 200ms RTOmin (Default) 200us RTOmin 200us RTOmin 0 0 1 10 100 1000 10000 100000 1 10 100 RTT (ms) Throughout (Kbps)

Figure 14: A comparison of RTT distributions of Figure 15: The two configurations observed an iden- flows collected over 3 days on the two configurations tical throughput distribution for flows. Only flows show that both servers saw a similar distribution of with throughput over 100 bits/s were considered. both short and long-RTT flows. 100 a 200µs RTOmin. We downloaded 12 torrent files 80 consisting of various Linux distributions and begin seeding all content from both machines on the same 60 popular swarms for three days. Each server uploaded over 30GB of data, and observed around 70,000 40 flows (with non-zero throughput) over the course of

three days. We ran tcpdump on each machine to col- % samples (with Kbps <= x) 20 200ms RTOmin (over 200ms RTT) 200ms RTOmin (sub 200ms RTT) lect all uploaded traffic packet headers for later anal- 200us RTOmin (over 200ms RTT) 200us RTOmin (sub 200ms RTT) ysis. 0 1 10 100 The TCP RTO value is determined by the esti- Throughout (Kbps) mated RTT value of each flow. Other factors being equal, TCP throughput tends to decrease with Figure 16: The throughput distribution for short and increased RTT. To compare RTO and throughput long RTT flows shows negligible difference across metrics for the 2 servers we first investigate if they configurations. see similar flows with respect to RTT values. Fig- ure 14 shows the per-flow average RTT distribution whether the flow’s RTT was above or below 200ms. for both hosts over the three day measurement pe- For flows above 200ms, we use the variance in the riod. The RTT distributions are nearly identical, sug- two distributions as a control parameter: any vari- gesting that each machine saw a similar distribution ance seen above 200ms are a result of measurement of both short and long-RTT flows. The per-packet noise, because the RTOmin is no longer a factor. Fig- RTT distribution for both flows are also identical. ure 16 shows that the difference between the distri- Figure 15 shows the per-flow throughput distribu- bution for flows below 200ms is within this measure- tions for both hosts, filtering out those flows with ment noise. a bandwidth less than 100bps, which are typically Table1 lists statistics for the number of spurious flows sending small control packets. The throughput timeouts and normal timeouts observed by both the distributions for both hosts are also nearly identical – servers. These statistics were collected using tcp- the host with RTOmin = 200µs did not perform worse trace [2] patched to detect timeout events [1]. The on the whole than the host with RTOmin = 200ms. number of spurious timeouts observed by the two We split the throughput distributions based on configurations are high, but comparable. We at-

14 RTOmin Normal Spurious Spurious 9 Related Work timeouts timeouts Fraction 200ms 137414 47094 25.52% TCP Improvements: A number of TCP improve- 200µs 264726 90381 25.45% ments over the years have improved TCP’s ability to respond to loss patterns and perform better in particular environments, many of which are relevant to the Table 1: Statistics for timeout events across ﬂows high-performance datacenter environment we study. for the 2 servers with different RTOmin values. Both NewReno and SACK, for instance, reduce the num- servers experience a similar % of spurious timeouts. ber of loss patterns that will cause timeouts; prior work on the incast problem showed that NewReno, 650 600 in particular, improved throughput during moderate 550 amounts of incast, though not when the problem be- 500 came severe [28]. 450 TCP mechanisms such as Limited Transmit [3] 400 350 were speciﬁcally designed to help TCP recover from 300 packet loss when window sizes are small—exactly

RTT and RTO (ms) 250 the problem that occurs during incast. This solution 200 again helps maintain throughput under modest con- 150 100 gestion, but during severe incast, the most common 0 500 1000 1500 2000 loss pattern is the loss of the entire window. Time (sec) RTT RTO Finally, proposed improvements to TCP such as TCP Vegas [10] and FAST TCP [19] can limit win- Figure 17: RTT and RTO estimate over time for a dow growth when RTTs begin to increase, often randomly picked flow over a 2000 second interval – combined with more aggressive window growth al- the RTT and RTO estimate varies significantly in the gorithms to rapidly fill high bandwidth-delay links. wide-area which may cause spurious timeouts. Unlike the self-interfering oscillatory behavior on high-BDP links that this prior work seeks to resolve, Incast is triggered by the arrival and rapid ramp-up of tribute the high number of spurious timeouts to the numerous competing flows, and the RTT increases nature of TCP clients in our experimental setup. The drastically (or becomes a full window loss) over a peers in our BitTorrent swarm observe large varia- single round-trip. While an RTT-based approach is tions in their RTTs (2x-3x) due to the fact that they an interesting approach to study for alternative solu- are transferring large amounts of data, frequently es- tions to Incast, it is a matter of considerable future tablishing and breaking TCP connections to other work to adapt existing techniques for this purpose. peers in the swarm, resulting in variations in buffer Efficient, fine-grained kernel timers. Where 4 occupancy of the bottleneck link. Figure 17 shows our work depends on hardware support for high- one such flow picked at random, and the per-packet resolution kernel timers, earlier work on “soft RTT and estimated RTO value over time. The dip timers” shows an implementation path for legacy in the estimated RTO value below the 350ms RTT systems [5]. Soft timers can provide microsecond- band could result in a spurious timeout if there were resolution timers for networking without introducing a timeout at that instance and the instantaneous RTT the overhead of context switches and interrupts. The was 350ms. hrtimer implementation we do make use of draws This data suggests that reducing the RTOmin to lessons from soft timers, using a hardware interrupt 200µs in practice does not affect the performance of to trigger all available software interrupts. bulk-data TCP flows on the wide-area. Understanding RTOmin. The origin of concern about the safety and generality of reducing RTO 4An experiment run by one of the authors discovered that min RTTs over a residential DSL line varied from 66ms for one TCP was presented by Allman and Paxson [4], where they stream to two seconds for four TCP streams. used trace-based analysis to show that there existed

15 no optimal RTO estimator, and to what degree that References the TCP granularity and RTOmin had an impact on spurious retransmissions. Their analysis showed that [1] TCP Spurious Timeout detection patch for tcp- a low or non-existent RTOmin greatly increased the trace. http://ccr.sigcomm.org/online/?q=node/220. chance of spurious retransmissions and that tweaking [2] tcptrace: A TCP Connection Analysis Tool. the RTO had no obvious sweet-spot for balancing min http://irg.cs.ohiou.edu/software/tcptrace/. fast response with spurious timeouts. They showed the increased benefit of having a fine measurement [3] M. Allman, H. Balakrishnan, and S. Floyd. En- granularity for responding to good timeouts because hancing TCP’s Loss Recovery Using Limited of the ability to respond to minor changes in RTT. Transmit. Internet Engineering Task Force, Last, they suggested that the impact of bad timeouts January 2001. RFC 3042. could be mitigated by using the TCP timestamp option, which later became known as the Eifel algo- [4] Mark Allman and Vern Paxson. On estimating rithm [21]. F-RTO later showed how to detect spuri- end-to-end network path properties. In Proc. ous timeouts by detecting whether the following ac- ACM SIGCOMM, Cambridge, MA, September knowledgements were for segments not retransmit- 1999. ted [32], and this algorithm is implemented in Linux [5] Mohit Aron and Peter Druschel. Soft timers: TCP today. Efficient Microsecond Software Timer Support Psaras and Tsaoussidis revisit the minimum RTO for Network Processing. ACM Transactions on for high-speed, last-mile wireless links, noting the Computer Systems, 18(3):197–228, 2000. default RTOmin is responsible for worse throughput on wireless links and short flows [29]. They suggest [6] H. Balakrishnan, V. N. Padmanabhan, and R.H. a mechanism for dealing with delayed ACKs that at- Katz. The Effects of Asymmetry on TCP Per- tempts to predict when a packet’s ACK is delayed – a formance. In Proc. ACM MOBICOM, Bu- dapest, Hungary, September 1997. per-packet RTOmin. We find that while delayed ACK can affect performance for low RTOmin, the benefits [7] H. Balakrishnan, V. N. Padmanabhan, S. Se- of a low RTOmin far outweigh the impact of delayed shan, and R.H. Katz. A comparison of mech- ACK on performance. Regardless, we provide alter- anisms for improving TCP performance over native, backwards-compatible solutions for dealing wireless links. In Proc. ACM SIGCOMM, Stan- with delayed ACK. ford, CA, August 1996.

[8] Peter J. Braam. File Systems for Clusters from a Protocol Perspective. 10 Conclusion http://www.lustre.org.

This paper presented a practical, effective, and safe [9] R. T. Braden. Requirements for Internet solution to eliminate TCP throughput degradation Hosts—Communication Layers. Internet Engi- in datacenter environments. Using a combination neering Task Force, October 1989. RFC 1122. of microsecond granularity timeouts, randomized re- [10] L. S. Brakmo, S. W. O’Malley, and L. L. Peter- transmission timers, and disabling delayed acknowl- son. TCP Vegas: New Techniques for Conges- edgements, the techniques in this paper allowed tion Detection and Avoidance. In Proc. ACM high-fan-in datacenter communication to scale to 47 SIGCOMM, London, England, August 1994. nodes in a real cluster evaluation, and potentially to thousands of nodes in simulation. Through a wide- [11] k. claffy, G. Polyzos, and H-W. Braun. Mea- area evaluation, we showed that these modiﬁcations surement considerations for assessing unidi- remain safe for use in the wide-area, providing a gen- rectional latencies. Internetworking: Re- eral and effective improvement for TCP-based clus- search and Experience, 3(4):121–132, Septem- ter communication. ber 1993.

16 [12] Scaling memcached at Facebook. [24] Amarnath Mukherjee. On the dynamics and http://www.facebook.com/note.php?note id=39391378919. signiﬁcance of low frequencey components of internet load. Internetworking: Research and [13] Brad Fitzpatrick. LiveJournal’s back- Experience, 5:163–205, December 1994. end: A history of scaling. Presentation, http://tinyurl.com/67s363, August 2005. [25] David Nagle, Denis Serenyi, and Abbie [14] S. Floyd and V. Jacobson. Random Early De- Matthews. The Panasas ActiveScale Stor- tection Gateways for Congestion Avoidance. age Cluster: Delivering Scalable High Band- SC ’04: Proceedings of the IEEE/ACM Transactions on Networking, 1(4), width Storage. In 2004 ACM/IEEE Conference on Supercomput- August 1993. ing, Washington, DC, USA, 2004. [15] Bryan Ford. Structured streams: A new transport abstraction. In Proc. ACM SIGCOMM, [26] ns-2 Network Simulator. Kyoto, Japan, August 2007. http://www.isi.edu/nsnam/ns/, 2000.

[16] Sanjay Ghemawat, Howard Gobioff, and Shun- [27] C. Partridge. Gigabit Networking. Addison- Tak Leung. The Google ﬁle system. In Proc. Wesley, Reading, MA, 1994. 19th ACM Symposium on Operating Systems Principles (SOSP), Lake George, NY, October [28] Amar Phanishayee, Elie Krevat, Vijay Vasude- 2003. van, David G. Andersen, Gregory R. Ganger, Garth A. Gibson, and Srinivasan Seshan. Mea- [17] V. Jacobson. Congestion Avoidance and Con- surement and analysis of TCP throughput col- trol. In Proc. ACM SIGCOMM, pages 314–329, lapse in cluster-based storage systems. In Proc. Vancouver, British Columbia, Canada, Septem- USENIX Conference on File and Storage Tech- ber 1998. nologies, San Jose, CA, February 2008. [18] V. Jacobson, R. Braden, and D. Borman. TCP Extensions for High Performance. Internet En- [29] Ioannis Psaras and Vassilis Tsaoussidis. The gineering Task Force, May 1992. RFC 1323. tcp minimum rto revisited. In IFIP Networking, May 2007. [19] Cheng Jin, David X. Wei, and Steven H. Low. FAST TCP: motivation, architecture, al- [30] K. Ramakrishnan and S. Floyd. A Proposal to gorithms, performance. Add Explicit Congestion Notiﬁcation (ECN) to IP. Internet Engineering Task Force, January [20] Eddie Kohler, Mark Handley, and Sally Floyd. 1999. RFC 2481. Designing DCCP: Congestion control without reliability. In Proc. ACM SIGCOMM, Pisa, [31] S. Raman, H. Balakrishnan, and M. Srinivasan. Italy, August 2006. An Image Transport Protocol for the Internet. In Proc. International Conference on Network [21] R. Ludwig and M. Meyer. The Eifel Detection Protocols, Osaka, Japan, November 2000. Algorithm for TCP. Internet Engineering Task Force, April 2003. RFC 3522. [32] P. Sarolahti and M. Kojo. Forward RTO- [22] M. Mathis, J. Mahdavi, S. Floyd, and A. Ro- Recovery (F-RTO): An Algorithm for Detecting manow. TCP Selective Acknowledgment Op- Spurious Retransmission Timeouts with TCP tions. Internet Engineering Task Force, 1996. and the Stream Control Transmission Protocol RFC 2018. (SCTP). Internet Engineering Task Force, Au- gust 2005. RFC 4138. [23] Alberto Medina, Mark Allman, and Sally Floyd. Measuring the evolution of transport [33] S. Shepler, M. Eisler, and D. Noveck. NFSv4 protocols in the Internet. April 2005. Minor Version 1 – Draft Standard.

17 [34] Brent Welch, Marc Unangst, Zainul Abbasi, Garth Gibson, Brian Mueller, Jim Zelenka, and Bin Zhou. Scalable Performance of the Panasas Parallel File System. In Proc. USENIX Con- ference on File and Storage Technologies, San Jose, CA, February 2008.

[35] Y. Zhang, N. Dufﬁeld, V. Paxson, and S. Shenker. On the constancy of Internet path properties. In Proc. ACM SIGCOMM Internet Measurement Workshop, San Fransisco, CA, November 2001.