The Performance of Microsoft TCP Implementations: a Bug and Its Fix

Web Servers Should Turn Off Nagle to Avoid Unnecessary 200 ms Delays[+]

Robert Buff, Arthur Goldberg

Computer Science Department

Courant Institute of Mathematical Science

New York University

{buff, artg}@cs.nyu.edu

www.cs.nyu.edu/cs/faculty/artg

1  Abstract

We show that the silly window syndrome (SWS) avoidance algorithms in standard implementations of TCP significantly slow the Web protocols HTTPS and HTTP in certain circumstances. Substantial delays of several 100 ms may occur on fast Intranet transactions that might otherwise complete in a few tens of milliseconds.

We illustrate this performance bug with TCP packet traces from test programs and production Web systems. This bug is easily and reliably avoided by disabling Nagle’s algorithm at the sender on every connection.

2  Introduction

Current TCP implementations deliver high bandwidth when transmitting large segments. Less attention has focused on the response time of TCP transactions that exchange smaller segments. However, this response time is important because widespread applications, like the Web, employ such transactions.

2.1  TCP Review

TCP supports a reliable, full duplex, network transport byte stream. TCP usually packetizes application data into segments that fit into a single IP packet. The largest TCP segment that can be sent on a connection can hold maximum segment size (MSS) bytes, and is called a MSS segment.

HTTPS and HTTP are client/server protocols that use TCP. Typically, a client/server interaction (or transaction) consists of a small request message sent by the client to the server and a response message sent back. Depending on the implementation, an application message is sent by just one or multiple TCP socket writes. TCP may map application data arbitrarily to segment boundaries.

A TCP receiver acknowledges receipt of data by sending the sender the sequence number of the next expected byte. In addition, a receiver manages buffer space by advertising an available ‘window’ beyond data it has received.

2.1.1  The Silly Window Syndrome

As described in RFC 1122 [Braden 89] and Section 13.29 of [Comer 95], early TCP implementations exhibited a problem known as the silly window syndrome (SWS). In SWS a connection reaches a steady state in which each acknowledgement advertises a small window and each data segment carries a small amount of data. SWS occurs, for example, when the receiver repeatedly reads just one byte from a connection with no advertised window.

The TCP standard [Braden 89] requires both senders and receivers to incorporate algorithms that avoid SWS. In brief, a receiver avoids advertising small TCP windows and delays transmitting acknowledgements. A sender implements the Nagle algorithm, which delays transmission of partially filled segments until all previously transmitted data has been acknowledged. For more detail on SWS avoidance, we review the TCP specification.

2.1.2  The TCP Specification

The TCP specification describes how the receiver and sender avoid SWS. Section 4.2.3.2 states that at the receiver

A TCP SHOULD implement a delayed ACK, but an ACK should not be excessively delayed; in particular, the delay MUST be less than 0.5 seconds, and in a stream of full-sized segments there SHOULD be an ACK for at least every second segment.

Therefore a receiver can always delay acknowledging a partial segment.

Section 4.2.3.3 says that

A TCP MUST include a SWS avoidance algorithm in the receiver. […] The receiver's SWS avoidance algorithm determines when the right window edge may be advanced; […]

For realistic receive buffers (greater than twice the MSS) window advances are announced in increments of MSS.

Section 4.2.3.4, “When to Send Data” says that

A TCP MUST include a SWS avoidance algorithm in the sender. […] A TCP SHOULD implement the Nagle Algorithm [Nagle 84] to coalesce short segments. However, there MUST be a way for an application to disable the Nagle algorithm on an individual connection. […]

The Nagle algorithm is generally as follows: If there is unacknowledged data […] then the sending TCP buffers all user data […] until the outstanding data has been acknowledged or until the TCP can send a full-sized segment […]

If the receiver delays acknowledgements, and the application writes less than MSS to the socket, and Nagle is enabled, then sending TCP delays transmission.

The specification also says

To avoid a resulting deadlock, it is necessary to have a timeout to force transmission of data […].

but in all traces we collected, the delayed acknowledgement appears to timeout before the Nagle algorithm.

2.2  HTTPS and HTTP Performance Problems

In several situations HTTPS and HTTP trigger SWS avoidance in both the sender and receiver, thereby creating substantial delays. The application layer situations were the following:

·  HTTPS / SSL key exchange, new and reused session key: The server writes two small messages and blocks waiting for response; the browser reads both messages and responds.

·  HTTPS / SSL key exchange, reused session key: Same situation, but with directions reversed. The browser writes two small messages and blocks waiting for response; the server reads both messages and responds.

·  HTTP image (GIF), smaller than MSS: The server sends the HTTP response in two small separate writes, containing headers and body (image data), respectively.

All three cases lead to the same TCP situation. The sender transmits the first message in a separate segment, then waits for its acknowledgement. It transmits the second message in a separate segment when the acknowledgment arrives. The receiver receives the first segment, but delays the acknowledgment because the segment is partial and the window available to advertise is less than MSS. Eventually, a time-out triggers the acknowledgment, thus causing the sender to send the second segment.

The avoidable delay is determined by the delayed ack time-out. As documented by [Microsoft 97] and our measurements Win32 TCP delays acks by typically 200 ms. Our measurements indicate that Sun’s Solaris delays acks by about 45 ms, on average.

3  Related Work

Network designers understood that delayed acknowledgements may slow application communications, as described in [Stevens 98], [Microsoft 98] and [Sun 98].

[Heidemann 97] discusses this problem in persistent HTTP. He states that the problem does not occur in HTTP, versions 1.0 and earlier, but we find it does.

Microsoft acknowledges the problem [Microsoft 97] and [Microsoft 97], but indicates that it only occurs when making small sends.

[Nielsen 98] and [Nielsen 97] measure the cumulative performance of a set of accesses to a representative Web site. In [Nielsen 98] enabling Nagle in an HTTP/1.1 server slows performance:

Situation / Time (sec) / Time (sec)
Nagle / 0.48 / 0.27
NoNagle / 0.45 / 0.21

However, the client ran on a Digital Alpha station 400 4/233, UNIX 4.0a, rather than Win32 which concerns us.

4  The lab experiments

4.1  A simple client-server test application

Before discussing production cases, we analyze the performance behavior of a simple client-server laboratory application that triggers the bug. The test application runs 100 identical transactions. Each transaction consists of the exchange of a 10-byte client request and a 20-byte server response. The 20-byte response is written to the socket by two 10-byte write() calls. No computational overhead is involved. All communication is strictly sequential: the client only initiates a subsequent transaction after the entire 20-byte server response has been received.

We ran the same test application on Win32, Solaris and Linux, in different client/server combinations. The source code remained unchanged for all systems. The application uses the standard Berkeley socket interface. There is no delay between the writes. The resulting fragmentation in the application layer at the server is maintained in lower layers.

In half of our tests, the Nagle algorithm was activated. In the other half, Nagle was deactivated. To turn Nagle on or off, the setsockopt() system call was used with the TCP_NODELAY option.

4.2  Experimental setup

Table 1 lists the clients and servers used in our experiments. All computers except Win98 are connected to the same 100Mbit Ethernet. Win98 is connected to a 100Mbit Ethernet separated from the others by one router. Network congestion was insignificant during our experiments. IP segment traces were collected with Network General's NetXRay network monitoring tool. The traces appear complete and accurate.

Name / Processor / Operating System
NT4Wa / Pentium, 100 MHz / NT Workstation 4.00.1381
NT4Wb / Pentium, 200 MHz / NT Workstation 4.00.1381
NT4S / Pentium, 233 MHz / NT Server 4.00.1381, SP 3
NT5S / Pentium, 233 MHz / NT Server 5.00.1671, beta 1
Win95 / Pentium, 100 MHz / Windows 95
Win98 / Pentium, 90 MHz / Windows 98
Linux / i486, 66 MHz / Linux 2.0.31
Solaris / Sun SPARCstation 5 / SunOS 5.6

Table 1. Test machines and operating systems.

The test application was run on these six pairs of machines from Table 1:

NT4Wa/NT4S, NT4Wb/Win95, NT4Wb/Win98, NT4Wa/NT5S, NT4Wa/Solaris, NT4Wa/Linux

Although the focus was on covering all Win32 implementations (NT4 Workstation and Server, NT5 beta, Windows95 and 98), we also tested a Win32/Linux and a Win32/Solaris configuration. In each of the six combinations, each partner acted as both client and server in two successive executions. Each execution was run twice, with the Nagle algorithm enabled and disabled. In total, the test application was run 6´2´2=24times.

In the following sections, we present two traces with lengthy ack delays of about 45ms for Solaris and 190ms for Win32, respectively. Then, we show a recorded trace of an execution without lengthy ack delay between the first and second server response. Finally, we give a performance summary of all 24 executions.

4.3  TCP segment traces: lengthy ack delays

The following two traces exhibit lengthy ack delays.

The trace in Table 2 was recorded between the NT client NT4Wa and the NT server NT4S, with Nagle active on the server. In all 100 transactions, the first server 10-byte response segment is acknowledged separately by the client after a delay of, on average, 187.3ms. Packets 6 and 10 are acks sent by the client, which were delayed because the client TCP has no data to send and has received a partial segment.

Segment,
payload [bytes] / Delta
[ms]
1 / à / TCP handshake
2 / ß / 0.3 / 0.3 ms cumulative
3 / à / 0.3 / 0.6
4 / à / Request, 16 / 9.1 / 9.7
5 / ß / First response, 10 / 3.6 / 13.3
6 / à / (ack) / 114.4 / 127.7
7 / ß / Second response, 10 / 0.2 / 127.9
8 / à / Request, 16 / 8.3 / 136.2
9 / ß / First response, 10 / 3.5 / 139.7
10 / à / (ack) / 188.3 / 328.0
11 / ß / Second response, 10 / 0.2 / 328.2
and so on

Table 2. NT client NT4Wa and NT server NT4S, with Nagle active on the server. In all transactions, the client acknowledges the first server response segment separately and after lengthy delay. In this configuration, the delay is 187.3ms on average, after a slightly lower initial delay of 114.4ms in the first transaction.

The trace in Table 3 was recorded between the Solaris client and the NT server NT4Wa, with Nagle active on the server. For all 100 transactions, the client acknowledges the first server 10-byte response segment after a delay of, on average, 46.2ms. Again, Nagle prevents the server from sending the second half of its response earlier. Although in this case the delay is much smaller than for Win32 clients, it still dominates the overall average transaction duration of 4.3ms on average by an order of magnitude.

Segment,
payload [bytes] / Delta
[ms]
1 / à / TCP handshake
2 / ß / 0.4 / 0.4 ms cumulative
3 / à / 0.7 / 1.1
4 / à / Request, 16 / 3.1 / 4.2
5 / ß / First response, 10 / 0.8 / 5.0
6 / à / (ack) / 0.6 / 5.6
7 / ß / Second response, 10 / 0.3 / 5.9
8 / à / Request, 16 / 2.1 / 8.0
9 / ß / First response, 10 / 0.8 / 8.8
10 / à / (ack) / 41.8 / 50.6
11 / ß / Second response, 10 / 0.3 / 50.9
and so on

Table 3. Solaris client and NT server NT4Wa, with Nagle active on the server. The server’s first response segment is always acknowledged separately. The ack is delayed in the second and subsequent transactions. In this configuration, the delay is 46.2ms on average, after virtually no delay in the first transaction. Throughout this paper, the arrow (ß or à) indicates a segment's direction between client on the left and server on the right. So à indicates a segment from client to server, as in "client à server", and vice-versa.

We only show one sample trace of all Win32/Win32 combinations (Table 2), because all Win32/Win32 traces perform similarly. The Solaris/NT trace shows that the amount of delay chosen by the actual TCP implementation can vary widely (here, by a factor of 4).

These delays are consistent with the analysis of delay durations in Section 9 of [Paxson 97].

4.4  TCP segment traces: performance with short ack delay, and Nagle off

The trace shown in Table 4 was recorded between the Linux client and the NT server NT4Wa, with Nagle deactivated on the server. In this trace, the Linux client does not send a separate ack for the first 10-byte server response segment (segments 5 and 9 in Table 4). Since Nagle is deactivated, the NT server immediately pushes the second 10-byte server response segment to the client (segments 6 and 10 in Table 4), resulting in a very low overall transaction duration of about 4.4 ms on average.