440GX Application Note Overview of TCP/IP Acceleration Hardware January 22, 2008 Introduction Modern interconnect technology offers Gigabit/second (Gb/s) speed that has shifted the bottleneck in communica- tion from the physical connection to the protocol stack. In traditional systems, message processing by the operating system can use a significant number of CPU cycles. The TCP/IP Acceleration Hardware (TAH) sub- system of the 440GX offloads the checksum and segmentation aspects of protocol processing from the operating system, leaving the CPU free to dedicate more cycles to the application. This application note describes the fea- tures and benefits of TAH and its implementation in the 440GX. For additional information on the 440GX, please refer to http://www.amcc.com.

TCP/IP Acceleration Hardware In the 440GX, TAH provides hardware acceleration functions for the two 10/100/gigabit Media Access Controllers (EMACs) to improve bandwidth and lower CPU utilization. TAH provides checksum verification for TCP/ UDP/IP headers in the receive path, checksum generation for TCP/UDP/IP headers in the transmit path, and TCP segmentation support in the transmit path. TAH provides support for standard and jumbo packets. No acceleration functions are available for the two 10/100 EMACs. For receive packets, setting the Checksum Verification on Receive (CVR) bit in the accelerate mode register, TAHx_MR, enables hardware checksum verification for all incoming packets for a given EMAC/TAH. For transmitted packets, hardware generated checksums and/or packet segmentation is done on a per-packet basis. This is accomplished by setting the proper bits in the descriptor control/status field of the buffer descriptor. The size that packets should be segmented into (if segmentation is enabled) is controlled by one of six Segment Size Registers (TAH_SSRx). Bits in the buffer descriptor also determine which SSR is used.

TCP/IP Overview TCP/IP is a term that commonly refers to a collection of protocols more accurately called the “ suite”. In addition to TCP (Transmission Control Protocol) and IP (Internet Protocol), the also includes additional protocols such as UDP () and ICMP (Internet Control Message Proto- col). Because TAH only manipulates TCP, IP, and UDP packets, these are the only protocols discussed in this paper. The Internet protocol suite is still evolving through the Request For Comments (RFC) mechanism. RFCs are avail- able online at ftp://ftp.rfceditor. org/in-notes. The Internet Protocol (RFC 791) provides services that are roughly equivalent to the OSI Network Layer. The unit of transfer in an IP network is called a datagram. IP provides a con- nectionless datagram transport service across the network. This service is sometimes referred to as unreliable because packets may be lost, arrive out of order, or perhaps even be duplicated. The network does not guarantee delivery or notify the end host system about packets lost due to errors or network congestion. IP assumes higher- layer protocols will address these anomalies. IP datagrams contain a message, or one fragment of a message, that may be up to 65,535 bytes (octets) in length. IP does not provide a mechanism for flow control. The TCP and UDP protocols correspond to the OSI . TCP, described in RFC 793, provides a virtual circuit (connection-oriented) communication service across the network. TCP includes rules for formatting mes- sages, establishing and terminating virtual circuits, sequencing, flow control, and error correction. Most of the applications in the TCP/IP suite operate over the reliable transport service provided by TCP. Common applications such as ftp and smtp communicate through TCP. UDP, described in RFC 768, provides an end-to-end datagram (connectionless) service. Some applications, such as those that involve a simple query and response, are better suited to the datagram service of UDP because there is no time lost to virtual circuit establishment and termination. UDP's primary function is to add a port number to the IP address to provide a socket for the application. Applications such as bootp and rtelnet communicate through UDP.

Revision 1.01 Application Note (Proprietary) AN2017 440GX Application Note

Each of these three components of the protocol suite adds its own header to the data to be transferred (or pay- load). The UDP header includes source and destination port numbers, a length, and a checksum. Because TCP is a connection oriented protocol, more information is needed in the header. The TCP header includes source and destination port numbers, a sequence number, an acknowledgement number, an ‘urgent pointer’, and a checksum. Once the IP layer receives the packet, it needs to add information to ensure the packet is sent to the proper desti- nation. The IP header includes a version number, the type of service, the length of the packet, another checksum, source and destination addresses, plus some additional fields that will not be discussed in detail. The format of a TCP/IP packet is shown in Figure 1.

Figure 1: TCP/IP Packet Format

03182416 VERS HLEN SERVICE TYPE TOTAL LENGTH

IDENTIFICATION FLAGS FRAGMENT OFFSET TIME TO LIVE PROTOCOL HEADER CHECKSUM IP Header SOURCE IP ADDRESS

DESTINATION IP ADDRESS

IP OPTIONS (If Any) PADDING

SOURCE PORT DESTINATION PORT SEQUENCE NUMBER

ACKNOWLEDGEMENT NUMBER TCP Header HLEN RESERVED CODE BITS WINDOW

CHECKSUM URGENT POINTER

OPTIONS (IF ANY) PADDING

DATA ......

Memory Access Layer (MAL) One additional component used to facilitate network communications within the 440GX is the Memory Access Layer (MAL). The MAL is a hardware core that manages data transfers between the TAH (or the EMAC if no TAH is present) and memory. The MAL utilizes a buffer descriptor ring structure in memory. A software device driver, such as the TCP/IP protocol stack, uses the buffer descriptor to inform the MAL about buffer locations and packet or buffer status. The MAL uses the buffer descriptors to convey packet transfer status from the EMAC back to the protocol stack.

Packet Send with TCP Acceleration An application wishes to send a packet to another system on the network. This example assumes a simple TCP application sending a packet that does not require segmentation. The application first builds an arbitrary payload and sends it to TCP. TCP adds its header and in a system without TAH enabled, calculates a checksum for the entire packet, including the header itself. With hardware checksum calculation enabled, the checksum is not calcu- lated until the TCP/IP Accelerate Hardware receives the packet. The packet is then sent to the IP software layer, and another header is added. The IP layer needs to be able to ver- ify that the header does not get damaged in transit, and another checksum is needed. If hardware checksum calculation is disabled, IP calculates a new checksum and stores it in the appropriate location in the IP header. If hardware checksum calculation is enabled, software does not need to modify either checksum field. The remaining sequence of events is illustrated in Figure 2.

2 Application Note (Proprietary) Revision 1.01 440GX Application Note

The protocol stack portion of the operating system initiates a packet transmit (1). The device driver parses the pro- tocol stack buffer into descriptor table entries and buffers (2). It is important to note that the buffer descriptors should be placed in noncacheable memory because they are eight bytes each and must be contiguous. If they are placed in cacheable memory, maintaining software cache coherency may not be possible as a cache flush of a sin- gle descriptor could corrupt the other three (in real memory) that are within the same cache line. The 440GX includes 256KB of on-chip SRAM ideally suited for storing buffers and buffer descriptors. The device driver then instructs the EMAC to process a new transmit packet (3). The EMAC requests the TAH to retrieve descriptor information (4). This request is passed to MAL (5). The MAL then fetches the buffer descriptor (6), writes it to TAH (7) and initiates a data move. The packet is then passed through the MAL and written to TAH (8). If hardware checksum calculation is enabled, the checksums are calculated and written into the appropriate packet headers. Checksums are calculated on the fly as the packet is sent from MAL. After TAH has finished replacing the checksums into the packet, it is sent to the EMAC (9), and then transmitted on the media (10), and the EMAC sends a read packet status request to TAH (11). The status information is passed through MAL (12), and written in the buffer descriptor (13). Software is interrupted, which is then responsible for clearing the interrupt sta- tus bits in the EMAC and MAL, and then notify the protocol stack that the transmission is complete (14). The device driver acknowledges the interrupt, and clears interrupt status bits in the EMAC and MAL (15). The device driver notifies the protocol stack that the operation is complete (16).

Figure 2: Send Operation with Acceleration

OS 2 1 16

CPU

13 6 8 Buffers

MAL 3 5 7 12

15 TAHTAH

4 9 11

14 EMACEMAC

10

Packet processing occurs in much the same way when hardware segmentation is enabled. The difference occurs when manipulating the packet headers within the hardware accelerate function, between steps 8 and 9 in the pre- vious example. When segmentation is enabled, the original headers from the packet received from MAL are saved for later use. New headers based on the original are built and stored. When the amount of data equal to the selected segment size has been transferred from MAL, the TAH stores the new checksums in the appropriate loca- tions in the header, and sends the packet on to the EMAC. At the same time, TAH will use the previously stored headers to create headers for the next packet, and continue transferring data from MAL. This process continues until the entire original packet has been transferred.

Revision 1.01 Application Note (Proprietary) 3 440GX Application Note

Receive Operations When performing a receive operation, the TAH is responsible for transferring packets from the EMAC to the MAL. If enabled, it will also verify the TCP/UDP/IP checksums. At the end of the reception process, the TAH will modify the checksum status, if any, to the status/error word provided by EMAC before presenting the status to MAL.

Enabling Hardware Accelerate As mentioned earlier, hardware accelerate functions can be enabled on a per-packet basis for transmit operations, and on a per EMAC/TAH basis for receive operations. For each transmitted packet, MAL uses the descriptor con- trol/status field of the buffer descriptor to provide control information to TAH. If the Generate FCS bit (bit 6) and Generate Padding (bit 7) are set to 1, TAH will calculate checksums and store them in the appropriate locations within the headers and add padding to the packet as appropriate. Bits 12 – 14 are the Hardware Accelerate bits. Possible values and interpretations are shown in Table 1.

Table 1: Hardware Accelerate Bit Settings

000 Hardware accelerate is disabled

001 TCP segmentation enabled, use TAHxSSR0

010 TCP segmentation enabled, use TAHxSSR1

011 TCP segmentation enabled, use TAHxSSR2

100 TCP segmentation enabled, use TAHxSSR3

101 TCP segmentation enabled, use TAHxSSR4

110 TCP segmentation enabled, use TAHxSSR5

111 Hardware checksum generation enabled

If the hardware accelerate bits are not set to 000 (some form of hardware accelerate is enabled), bits 6 and 7 must also be set. Failure to do so will cause improper operation. If "Generate FCS" is not set, the FCS will be bad, and EMAC will set bit 6 "Bad FCS on transmitted frame". If "Generate padding" is not set, the frame might not meet the minimum packet size requirement of the network. If this were the case, when the frame is received by the other side, the frame would be considered a “runt frame” and will be reported as such or discarded.

4 Application Note (Proprietary) Revision 1.01 440GX Application Note

Performance Use of TCP Accelerate functions can greatly increase both network and processor efficiency. Typical checksum generation code within the protocol stack may be similar to this example: static unsigned short checksum( unsigned short *addr, int len, long sum, int udp_sum) { int nleft=len; unsigned short *w=addr; unsigned short answer; /* Simple checksum algorithm, using a 32-bit accumulator, we add sequential 16-bit words to it, and at the end fold back all the carry bits from the top 16 bits into the lower 16 bits */ while (nleft>1) { sum+=*w++; nleft-=2; } /* Clean up an odd byte, if necessary. */ if (nleft==1) { if (udp_sum==0) { sum+=*(unsigned char *)w; } else { sum+=(*(unsigned char *)w)<<8; } } /* Add back carry outs from top 16 bits to low 16 bits. Add hi 16 to low 16. Add carry. Truncate to 16 bits (sort of). */ sum=(sum>>16)+ (sum&0xffff); sum+=(sum>>16); answer=~sum; return(answer); }

Revision 1.01 Application Note (Proprietary) 5 440GX Application Note

Running this code on a 440GP evaluation board with a 400MHz CPU clock, calculating the checksum on a 1KB packet (len = 1024) takes approximately 6 microseconds (2592 cycles) if both the buffer and code are resident in the cache at the time of execution. The actual time to execute this same sequence could take much longer depending on the system’s memory configuration. To determine performance and benefits of offloading checksum generation for more realistic applications, AMCC built a model of the 440GX and simulated a variety of network traffic scenarios. Each scenario assumed software checksum generation would take 8 cycles per word (consistent with results obtained in the previous example), gen- eral packet processing would take 6130 cycles, and ACK processing would take 2400 cycles. Figure 3 shows the results of simulating read operations in a system with two active 1Gb Ethernet interfaces. This chart represents fixed packet sizes of 512 bytes, 1536 bytes, and 8192 bytes. As the chart illustrates, in this sce- nario when running without TCP acceleration enabled, CPU utilization is at 100% even though network utilization is well below 100%, particularly in the case of small (512 byte) packets. With checksum generation enabled, CPU uti- lization drops to about 35% for 8KB packets, but remains at 100% for the other two packet sizes. In all cases, however, network utilization is increased.

Figure 3: Read Operation

Read - 2x1G Ethernet 600 MHz Processor 100

80

CPU-NoC T P Assist

n

o i

t 60 RX-No TCP Assist

a

z

i

l i t CPU-with TCP Assist u 40 RX-with TCP Assist 20

0 512B 1536B 8192B

Figure 4 illustrates CPU and network utilization for write operations. Test parameters were the same as in the pre- vious example.

Figure 4: Write Operation

Write - 2x1G Ethernet 600 MHz Processor 100

80

CPU-No TCP Assist n

o 60 RX-No TCP Assist

i

t

a

z

i l

i CPU-with TCP Assist t

u 40 RX-with TCP Assist 20

0 512B 1536B 8192B

6 Application Note (Proprietary) Revision 1.01 440GX Application Note

As in the Read operation, CPU utilization is at 100% when operating when checksum generation is not active, and network utilization increases for all packet sizes when enabled. Both examples show that by offloading some of the TCP/IP functions, the number of free CPU cycles is increased. The most significant performance increase is seen when working with 8KB packets. This is most beneficial to applications that move large units of data, such as in storage applications. Comparable results are achieved in designs using only one of the 1Gb Ethernet controllers. Figures 5 and 6 show the simulation results of read and write operations across a single channel. All other simulation parameters were the same as in the previous examples.

Figure 5: Read Operation

Read - 1x1G Ethernet 600MHz Processor 100

80 CPU-No TCP Assist

n 60 RX-No TCP A ssist

o

i

t

a

z i

l CPU-with TCP Assist i

t 40 u RX-with TCP Assist 20

0 512B 1536B 8192B

Figure 6: Write Operation

Write - 1x1G Ethernet 600 MHz Processor 100

80 CPU-No TCP Assist

60 RX-No TCP Assist CPU-with TCP Assist 40 RX-with TCP Assist 20

0 512B 1536B 8192B

These examples show it is possible to fully utilize the network with or without TCP Acceleration, but that offloading checksum functions releases more microprocessor cycles. Again an 8KB packet size shows the best performance increase. While every attempt was made to build an accurate simulation model, results obtained on actual hard- ware may vary.

Revision 1.01 Application Note (Proprietary) 7 440GX Application Note

Conclusion Network bandwidth has increased at a much faster pace than processor and memory performance. Because of these limitations, CPU utilization can easily reach 100% long before the network is saturated. Although 1Gb Ether- net is ten times faster than 100Mb, the overhead of packet processing in software may result in only a 3x or 4x increase in overall throughput. By offloading some of the packet processing that is traditionally done by software, significant performance increases can be realized. The TCP Acceleration Hardware provides this capability in the 440GX.

8 Application Note (Proprietary) Revision 1.01 440GX Application Note

Document Revision History

Revision Date Description

v1.01 1/22/08 Converted layout to AMCC format.

Applied Micro Circuits Corporation 6310 Sequence Dr., San Diego, CA 92121 Main Phone: (858) 450-9333 — Technical Support Phone: (858) 535-6517 — (800) 840-6055 http://www.amcc.com ([email protected])

AMCC reserves the right to make changes to its products, its datasheets, or related documentation, without notice and war- rants its products solely pursuant to its terms and conditions of sale, only to substantially comply with the latest available datasheet. Please consult AMCC’s Term and Conditions of Sale for its warranties and other terms, conditions and limitations. AMCC may discontinue any semiconductor product or service without notice, and advises its customers to obtain the latest version of relevant information to verify, before placing orders, that the information is current. AMCC does not assume any lia- bility arising out of the application or use of any product or circuit described herein, neither does it convey any license under its patent rights nor the rights of others. AMCC reserves the right to ship devices of higher grade in place of those of lower grade. AMCC SEMICONDUCTOR PRODUCTS ARE NOT DESIGNED, INTENDED, AUTHORIZED, OR WARRANTED TO BE SUITABLE FOR USE IN LIFE-SUPPORT APPLICATIONS, DEVICES OR SYSTEMS OR OTHER CRITICAL APPLICATIONS. AMCC is a registered Trademark of Applied Micro Circuits Corporation. Copyright © 2008 Applied Micro Circuits Corporation. I2C BUS® is a registered Trademark of Philips N.V. Corporation Netherlands.

Revision 1.01 Application Note (Proprietary) 9