UNIVERSITY OF HAWAI"I LIBRARY A MULTI-HOMED GATEWAY FOR REDUNDANT ACCESS

A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI'I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

ELECTRICAL ENGINEERING

August 2005

By KinHoTung

Thesis Committee:

E.J. Weldon, Jr. Chairperson Tep Dobry GalenSasaki ACKNOWLEDGEMENTS

I would like to express my sincere appreciation to my advisor, Dr. Weldon, for providing me with advice and insights during my work on this thesis. In addition I would like to thank Dr. Tep Dobry and Dr. Galen Sakaki for taking the time to serve on my committee. Finally, I would like to thank Ed Nakamoto and Keith Oshiro ofSpirent

Communications for providing the equipment that I needed to properly implement this project.

111 ABSTRACT

The Internet is fast becoming the world's most important communication medium. It is used not only by large corporations, but also by small businesses and individual consumers. High speed broadband subscribers continue to increase each year as more customers depend on the Internet as a vital part oftheir everyday lives. As such, network availability and reliability are becoming a major concern. The reliability and availability ofthe network are largely dictated by "the last mile" connection between the customers and their Internet Service Providers (ISPs). This thesis focuses on making this last mile connection as reliable as possible by creating multiple links between the customer and the Internet. The device, called the multi-homed gateway allows users to connect to the Internet through multiple ISPs. In addition to increasing network availability and reliability, the multihomed gateway can provide improved performance both in terms oflatency and bandwidth.

IV TABLE OF CONTENTS

Acknowledgements iii Abstract iv List ofTables vii List ofFigures viii Chapter I- Introduction l 1.1 Background l 1.2 The multihomed gateway .4 1.3 Hardware platform 5 1.4 Software platform 7 1.5 Thesis overview 9 Chapter 2 - Overview ofthe Internet 12 2.1 Internet wiring 12 2.2 OSI model 14 2.3 Internet Routing 16 2.4 Address shortage and network address translation 16 2.5 A day in the life ofa packet.. 19 Chapter 3 - Inherent problems with consumer Internet services 21 3.1 Unreliable broadband availability 22 3.2 Reduced service quality due to loading 23 3.3 Inflexibility ofservice offering 25 Chapter 4 - Multihoming solution 27 4.1 Multihoming advantages 27 4.1.1 Channel redundancy 28 4.1.2 Better response time 28 4.1.3 Load balancing 28 4.1.4 Reduced cost 28 4.2 Multihomed gateway overview .29 4.3 Data granularity 30 4.4 Selecting a channel 31 4.4.1 Channel performance monitor .33 4.4.2 User traffic monitor. 37 4.4.3 Channel chooser 38 4.5 Network address translation (NAT) engine .39 4.6 Packet forwarder 49 Chapter 5 - Implementation ofmultihoming using a network processor 51 5.1 Packet processing with the Intel 2800 network processor 51 5.1.1 Intel 2800 network processor block diagram .52 5.1.2 Setup and boot ofthe Intel 2800 network processor 55 5.1.3 Flow ofa typical packet.. 58 5.2 Implementation ofthe multihomed gateway on the Intel 2800 63 5.3 Packet processing logic 65 5.3.1 Dispatch loop 65 5.3.2 Channel selection 68

v 5.3.3 Packet processing 74 5.3.4 Statistics update 77 Chapter 6 - Functional test ofthe multihomed gateway 78 6.1 Test setup 78 6.1.1 User PC 79 6.1.2 Boot and debug manager 80 6.1.3 Channel impairment device 81 6.2 Multihomed gateway operational test 83 6.3 Performance ofthe multihomed gateway using the capacity estimation algorithm results 85 6.3.1 Results under externally loaded condition 87 6.3.2 Results under channel outage condition 90 6.3.3 Results under internally loaded condition 92 6.3.4 Results under unloaded condition 92 6.4 Summary oftest results 94 Chapter 7 - Conclusions & suggestions for future work 96 7.1 Advantages ofa network processor vs. FPGA 96 7.2 Applications for the multihomed gateway 97 7.3 Suggestions for future work 97 Appendix A- Intel IXP2800 Network Processor 99 A.l Embedded Xscale Core l 01 A.2 Microengines 102 A.3 DRAM I SRAM controllers 105 A.4 SHaC 107 A.5 Media and switch fabric interface 108 Appendix B- The Intel IXDP 2800 Advanced Development System lll B.l IXDP 2800 Overview 111 B.2 IXBM 2800 dual network processor base card 112 B.3 IXDP2810 mezzanine card 114 B.4 Intel IXA software development kit.. 115 Appendix C- Channel selection using other methods 118 C.l Balanced load 118 C.2 Random session placement 119 Appendix D- Source code 120 D.l Network processor mirocode 120 D.2 Xscale C code 132

VI LIST OF TABLES

Table Page

2.1 The seven layers of the OSI mode 15

4.1 Packet identification used by NAT .42

5.1 ARP frame fields 70

5.2 RFC 1624 equation for checksum calculation 76

6.1 External multihomed gateway services 80

VB LIST OF FIGURES

Figures

2.1 Link or node connections in a network 13

2.2 Network routing 16

2.3 Typical network topology in a small business environment.. 18

3.1 ISP service to customers 22

3.2 24hour ping response time (sample 1) 25

3.3 24hour ping response time (sample 2) 25

4.1 Multihomed vs. single-homed node 27

4.2 Multihomed gateway block diagram 29

4.3 Channel chooser block diagram .32

4.4 Maximum data rate measurement.. .35

4.5 Service type field in the IP header 38

4.6 Network configuration with the multihomed gateway .40

4.7 Multihomed gateway connection to the Internet .43

4.8 DNAT multihomed gateway configuration .45

4.9 DNAT flow chart 46

4.10 SNAT multihomed gateway configuration .47

4.11 SNAT flow chart 48

5.1 Intel 2800 network processor block diagram 52

5.2 Packet flow through the network processor. 59

5.3 Hyper task chaining of microengines 61

V111 5.4 Pool of threads of microengines 62

5.5 HTC and POTs in a typical system 63

5.6 Dispatch loop flow diagram 65

5.7 Channel selection flow diagram 68

5.8 ARP frame 69

5.9 Packet processing flow diagram 74

5.10 Statistics update flow diagram 77

6.1 Functional test setup diagram 79

6.2 Channell jnettop screen shot 84

6.3 Channel 2 jnettop screen shot 85

6.4 Traffic during externally loaded condition 87

6.5 Externally loaded condition debug statistics 89

6.6 Channel outage condition debug statistics 91

6.7 Traffic during internally loaded condition 92

6.8 Unloaded condition 93

A.l Intel IXP2800 network processor 100

A.2 Embedded Xscale core block diagram 101

A.3 Microengine block diagram 103

A.4 DRAM/SRAM controller block diagram 106

A.5 SHaC block diagram 107

A.6 MSF receive block diagram 109

A.7 MSF transmit block diagram 109

B.l Intel IXDP 2800 development system 111

IX B.2 Intel IXBM 2800 dual network processor base card 112

B.3 Intel IXBM 2800 block diagram 113

B.4 Intel IXDP 2810 mezzanine card 114

B.5 Fiber/copper SFP modules 115

B.6 Intel developer workbench IDE 116

B.7 Intel developer workbench transactor 116

B. 8 Intel developer workbench debugger.. 117

x Chapter 1

Introduction

The Internet has grown to be an integral part ofeveryday life. The different services that the Internet provides allow people to communicate instantaneously. The flow ofinformation increases each year and as more people are connected, the data traffic will continue to grow. The Internet spawned such tools as e-mail, the World Wide Web, instant messaging and video conferencing. As these tools become more integrated into our lives, we start to take them for granted. We expect them to work well, all the time.

Unfortunately, the home and small-business user must be content with services that are unreliable compared with services such as the telephone service. With the Internet becoming the primary communication tool, reliability and quality ofservice is quickly becoming a significant issue.

1.1 Background

As we expect more from our Internet experience, we sometimes forget the cost of that experience. Dynamic web pages, animation, video all require massive amounts of bandwidth. During the early days ofthe web, websites were mostly text centric. Some had a few pictures; however, most ofthe content delivered was text. It was very unusual for a webpage to be larger than lOOK bytes.

1 Today's websites have colorful advertisements, pictures and animations. For example, the front page ofmsnbc.com is slightly under 500K bytes in size. Using that site as a sample, it would take roughly 70 seconds, or over a minute to download this page using a 56K modem. In a world where we are accustomed to instant communication, a minute is much too long to wait. To make matters worse, future technologies are expected to increase the bandwidth requirements. Dial-up connections will not be an option, even by today's standards. Broadband Internet access is therefore essential to fully utilize the Internet.

The first widely accepted broadband access was the cable modem, offered by the customer's local cable company around 1997. Shortly after the introduction of broadband Internet access by cable modems, the telephone industry introduced its own broadband access method using Digital Subscriber Loop (DSL). Prior to the cable modem, ISDN was the broadband service ofchoice. However, due to the high cost and difficult setup, ISDN only served large businesses that demanded high-speed access at any cost.

The cable modems as well as its competitor, DSL, did not have a very good start.

With only a few hundred thousand subscribers in 1999, both did not catch on as well as the high speed providers had hoped. However, at the time the dot-com boom had just started, and Windows 95 was shipped with new computers. By the late 1990s, more people began to use the Internet more frequently and content became more complex.

People began to switch from dial-up service to broadband service. Prices began to fall, which attracted more customers.

2 In 2003, broadband users grew by 42 percent as 8.3 million more homes and businesses began using broadband. The total number ofbroadband users at the end of

2003 was 28.2 million, or about 11 % or all Internet users. Number ofbroadband users is expected to increase. Yankee Group, a communications and networking research and consulting firm, projects broadband usage at 61 million users by 2008, or nearly 25% of all Internet users.

Typical broadband access for home and small businesses is provided by a single

Internet Service Provider (ISP), causing the user to be dependent on this a single connection for Internet access. Ifthe broadband link goes down, the user's network will be cut offfrom the rest ofthe world. The link may go down for hours or even days ifthe cause ofthe problem is hard to identify. This type ofdown time is considered unacceptable for big businesses; however, home and small business users are usually left to "deal with it."

The world has become a place where even small businesses depend heavily on

Internet access. Customers need to access business websites or contact businesses via e­ mail. A business like an Internet cafes requires a reliable Internet access in their business model. Businesses need to access their supply chain to check on the various statuses of their items. Internet access is no longer a luxury but a necessity.

Home users also depend greatly on high-speed Internet access. For example, a few days ofdisruption can cause a serious financial trouble for a user who pays bills online, or who trades stocks online. A few days ofdown time, while waiting for the technician to fix the problem, can be disastrous.

3 By making use ofbroadband access from multiple providers, home and small business users can make user ofthe same redundancy techniques used by big businesses to assure network availability. This "Multihoming" technique connects an internal network to two or more external channels, which would in tum connect to different ISPs.

Once users have multiple external connections, they can then choose which to use at any one time.

1.2 The multihomed gateway

A multihoming device has more than one external channel to the outside world.

With multiple channels, the user's network can operate satisfactorily ifthe one channel goes down. This is done automatically by with the multihoming gateway.

Another advantage ofhaving multiple external channels is that the multihoming device can perform load balancing among all outgoing channels. In load balancing, all sessions requested by the user's internal network are weighed against the current load situation ofthe external channels. The multihoming device then picks the best possible channel to carry the session. The result is an intelligent session placement that will provide the best through put.

The purpose ofthis thesis is to design and implement such a multihomed gateway.

This device will allow the user to connect to two separate ISPs. In addition, it will allow the user ofthe multihomed gateway to connect multiple client computers on the internal network and share the external network connections using Network Address Translation

(NAT). The network address translation allows the multihomed gateway to work

4 transparently with the user's internal network and behaves as a nonnal residential gateway.

The multihomed gateway will provide the user with channel fault detection and traffic diversion ifa channel fails. Additionally, the multihomed gateway will employ a channel capacity estimation technique that will estimate the channel capacity at any given time. This allows the multihomed gateway to recognize channel loading condition and react accordingly. As with the residential gateway function, channel fault tolerance and channel capacity estimation would be transparent to the user.

The multihomed gateway designed in this project has been implemented using the

Intel IXP2800 Network Processor. All the functions were written in the Intel Network

Processor assembly language using Intel's development studio. Section 1.2 will provide more details ofthe network processor's hardware. Section 1.3 will explain the software.

1.3 The hardware platform

The hardware platfonn chosen to implement the multihomed gateway is the Intel

IXP2800 network processor. A network processor is a chip designed with specific instructions and logic units to process network packets. Many vendors offer different flavors ofnetwork processors, each designed for a particular perfonnance and price range. Aside from Intel, companies such as Agere Systems, Conexant, Ubicom and

Vitesse offer versions ofnetwork processors.

Although made by different vendors, network processors cater to a specific target audience. Some network processors operate at very high speed and can handle network

5 traffic up to OC-l92 line rate. These processors are expensive. Low-end network processors cost less but will only handle traffic at 10/1 00 Mbps. Other processors exist between these two extremes and offer a variation in speed and price.

Despite their differences, network processors have a common design theme.

Variations among network processors are analogous to how general computing CPUs differ. For example, the Intel x86 family ofprocessors is very different in design from the IBM PowerPC line ofprocessors. The x86 uses a CISC instruction set while the

PowerPC uses a RISC instruction set. One would not expect binary files compiled for the x86 to work correctly or even at all with the PowerPC. However, ifthe processors are viewed from a high-level design perspective, it is easy see many similarities between the two processors. For example, both the x86 and PowerPC have an Arithmetic Logic Unit

(ALU), a Memory Management Unit (MMU), caches and generic IO ports. Network processors are similar as they share the same basic design, yet it is impossible to match specifics between network processors. Generally speaking, all network processors have an embedded, or management processor, a group ofsub processors that handle the "fast path" network processing, memory controllers and external IO ports. Like general­ purpose processors, binary files are incompatible with different vendors' network processors. The Intel IXP2800's embedded processor is the Intel Xscale processor. The embedded processor runs an embedded operating system and is used to control the group ofsub processors, called by Intel as the microengines.

Unfortunately, network processor technology is relatively new. Designers have to write programs in the network processor's native assembly language. This means that code written for a specific processor will not work with any other network processor

6 since they do not share the same instruction set. However, the industry is moving away from assembly code to C, a higher-level language. In time, more intelligent and optimized compilers will be written to compile generic C code to the network processors' native assembly code. When that happens, code portability between network processors will reach a level equivalent to the current portability between general processors.

The specific network processor that will be used for this thesis is the Intel

IXP2800. This is Intel's third-generation network processor and is designed to handle up to 10 gigabits ofnetwork traffic. The prototype board provided by Intel is the IXBM

2800 (Deer Island) development system and includes the two IXP2800 network processors, 768 megabytes of SDRAM, 16 megabytes of SRAM and 4 megabytes ofboot flash.

The IXBM 2800's external connections include two RS-232 serial console ports, and two 10/100 development Ethernet ports each connected to one ofthe network processor's embedded processors. It also has the IXDP 2810 mezzanine card that contains ten separate I-gigabit Ethernet ports. Finally, the development board is housed inside a standard 2U rack-mountable chassis.

1.4 The software platform

While working with network processors, the bulk ofthe effort consists ofwriting the software. Although the network processor was designed to be generic enough to manipulate any type ofnetwork traffic, it must be programmed similarly to a general processor. However, unlike a general processor, a network processor is a complete

7 system-on-chip solution. Therefore, many facets ofsoftware must work together for the entire chip to function properly.

The single most important piece ofsoftware running on the network processor is the embedded operating system, which runs on the embedded processor. The entire embedded system is used as a control and distribution node for the rest ofthe chip.

The Intel IXP2800 network processor may run WindRiver's VxWorks or Linux.

For this thesis, Linux is used because the operating system can be obtained freely. The installation discs that carne with the development system contained the binary files and the complete source tree to build the embedded Linux kernel. The installation discs also included the development suite for VxWorks. A separate license must be purchased from

WindRiver to use the suite. The choice ofusing VxWorks or Linux is based on the knowledge ofthe individual developer as there are no operational differences between the two. In other words, both environments can fully utilize all the features ofthe network processor. However, since VxWorks is a commercial product, it is much better supported by WindRiver. It should also be noted that the environment and tools provided by

WindRiver are better equipped and debugging in the VxWorks environment is generally viewed as superior to the Linux environment.

The Linux kernel used by the embedded processor is the MontaVista Linux release v3.1. This version ofthe kernel includes modification by Intel to communicate to specific hardware found only in Intel network processors. These changes are not part of the official Linux kernel and therefore do not exist outside the Intel network processor circle.

8 The development suite used to develop the network processor code is a combination ofthe Intel developer studio and Linux. The Intel developer studio is primary used to write, compile and simulate the microengine code. The heart ofthe developer studio is the Transactor. The Transactor is the microengine simulator used by the developer studio to simulate the actions ofthe microengine code (microcode) on the microengines. The Transactor allows the developer to visually debug code and to optimize memory usage. The Transactor will take into account all memory latency delays and provides a graphical timeline ofhow each microengine will react to the code.

This allows the developer to layout the flow ofthe code in time correctly such that the code can hide memory latency. This is especially important since without latency hiding, the network processor will not operate optimally.

1.5 Thesis overview

This thesis is separated into seven chapters. The first chapter introduces the idea ofmultihoming. It also touches on the hardware and software platform used to implement the multihoming gateway.

Chapter Two briefly describes the inner workings ofthe modem Internet. It will discuss how the Internet is physically connected and how the OSI Model is used by the

Internet to send information back and forth using the Internet protocols. This chapter also discusses how routing is performed and how network address translation works. Finally, chapter two follows the path ofa packet traveling across the Internet, from transmit to receive across a NAT interface.

9 Chapter Three describes why a multihoming device is important. It explains how a single Internet connection for home and small business broadband access is unreliable.

In addition, a single link can be loaded down, which can cause slowdowns.

Chapter Four deals with the operation ofa multihoming device. This chapter describes how a multihoming device works and what functions it performs. It delves into the hardware block diagram and detailed descriptions ofprimary functions such as NAT, and assesses important concepts such as link availability and load balancing

Chapter Five describes the implementation ofthe multihoming device discussed in Chapter Four using the Intel IXP2800 Network Processor. It explains how to set up

Linux, how to boot the embedded processor, and how to setup the microengines and the multihomed device's program flow.

Chapter Six describes the functional testing to be performed on the multihoming device. The test setup and the necessary software are listed in this chapter. All ofthe functions ofthe multi-homed gateway are tested to ensure they work correctly. The chapter also describes several methods for the multihoming device to perform load balancing. It explains how one load balancing scheme can be more effective than another.

Chapter Seven concludes the thesis and shows how a multihoming device can be used. It also discusses possible future projects that can be derived from this thesis or the network processor. There will also be a short section on the advantages ofa network processor versus traditional FPGA.

10 There are four appendices:

Appendix A contains a description ofall components ofthe Intel IXP2800 Network

Processor. This includes block diagrams that has detail descriptions ofthe embedded processor, the microengines and all support peripherals.

Appendix B contains an overview ofthe Intel IXBM 2800 Advanced Development

System. It describes functions and peripherals bundled with the development system and the Intel IXP developer studio.

Appendix C contains other methods ofchannel select. These methods will be compared against the method used by the multihomed gateway.

Appendix D contains the complete source code for the multihoming device. This includes the front-end user application to set up the multihoming device and the microengine source code.

11 Chapter 2

Overview ofthe Internet

In 1973, the U.S. Defense Advanced Research Projects Agency (DARPA) funded a research program to interconnect various computer centers across the country. These centers, or nodes, were operated by different government agencies and universities. The goal was to interlink all the nodes so they could communicate with each other. A • secondary goal was to ensure that the network created could handle node failures, so that failure ofone link or node would not bring down the entire network.

The purpose ofthis resilient network, known as ARPAnet, was to allow important government units to communicate despite network catastrophes that might occur during war or natural disaster. In time, the ARPAnet evolved into the Internet.

2.1 Internet wiring

The Internet is best described as a network ofnetworks. However, the recent dramatic increase in the number ofusers demands a new definition for this gigantic network. The Internet today consists ofnetworks layered on top of each other. Each layer serves a different function. The inner core, known generally as the backbone, consists ofhigh capacity links provided by common carriers such as AT&T, UUNET and

Sprint. This backbone routes traffic between areas as large as continents. As one moves away from the backbone, the successive layers service smaller geographic regions. For

12 example, a second-tier network provider may be responsible for a regional connection like an entire state. Further from the backbone, the service region becomes smaller until it reaches a network which provides services to many users. This can be private (eg.

UHnet) or an ISP that services individual users. Users today have their home network that services several computers on an internal LAN. This LAN is then connected via a single connection to their local ISP. However, this layering structure is not absolute. It is possible for a small, localized network to have a direct tier-two or backbone connection to the Internet. The cost ofsuch a connection would be substantial and usually impractical since backbone connections have such a large bandwidth capacity.

In order to achieve the goal ofnetwork resiliency, all backbone nodes must have more than one connection to other nodes. This allows the network to absorb the failure ofany one node and provides alternate links for communication. In the Figure 2.1, each black circle represents a node in the network. Connecting lines between them are links, which could be fiber cables, copper cables or satellite/wireless connections.

Resilient networks make it so that Network which could be segmented if a disabling anyone link or node cannot node is disabled. In this case, the two bring down or segment the whole nodes with arrows and the dotted path network. between them are vulnerable.

Figure 2.1 Link or node connections in a network

The disadvantage ofconnecting multiple links to any node is the many paths that travel from one node to another. Each node must have enough intelligence to determine 13 how to send data to one another node. In addition, the nodes have to handle the dynamic nature ofthe network, since a node failure is usually not a scheduled event.

The unit used to transmit data from one node to another is called a packet, and the act ofdetermining how to send the packet from one node to another is call routing. Each node must decide how to efficiently route packets to other destination. Routers must also be able to deal with congestion and broken links and then determine alternate routes.

The Internet is a packet-switched network. This contrasts with the telephone system, which is a circuit-switched network. A packet-switched network transmits data by segmenting it into blocks or packets. The specific contents ofa packet depend on the type ofnetwork. Generally, a packet contains a header, a body and a trailer. The header ofthe packet contains routing information such as source and destination address. The body ofthe packet contains the data to be transmitted. The trailer contains the packet checksum or FCS. However, certain routing protocols omit the checksum ifit is deemed unnecessary.

2.2 The OSI model

The Open System Interconnection (OSI) model is the framework that networks use to communicate. The seven-layer model separates tasks that must be accomplished in order to transmit and receive information over a network. The framework defines the messages passed between each layer. The messages are built so that one layer can only communicate with the layer directly above and below. The model also stipulates that any particular layer can only communicate with the same layer on the far end. For example,

14 Layer 3 (the network layer) ofthe OSI model can only pass physical messages to Layer 4 and Layer 2 ofthe same network stack, and it can only send and receive virtual messages from Layer 3 ofthe far end when communication is established. The advantage ofusing this model is that the inter-working between manufacturers' network equipment is guaranteed ifthe network equipment uses the OSI model. In other words, ifa network equipment is designed to work on Layer 3, it can assume that all Layer 2 and Layer 4 equipment will work with it. This also means that the far end node does not need to use the same manufacturer's equipment as the originator uses since all equipments use the same protocols.

The OSI model and a briefdescription ofits layers are defined in the Table 2.1:

The application layer defines the particular applications used by the Layer 7 user. The application can be an e-mail client, FTP or Telnet. The Application syntax and data format on this layer is completely defined by the Layer application. Layer 6 The presentation layer provides the translation between independent Presentation data formats to a common network format. Security and encryption Layer are also performed in this layer ifnecessary. The session layer handles high-level sessions between the two ends. Layer 5 Each session has a start and an end. It is up to the session layer to Session Layer manage all possible simultaneous transmissions on a given machine. The provides transparent data movement for the upper Layer 4 layers. It handles end-to-end data recovery and flow control and Transport Layer basically creates an error-free transmission medium for the upper layers. The network layer handles routing and forwarding ofpackets Layer 3 between nodes. It determines the logical paths that a packet must Network Layer take to reach its destination The data link layer is divided into two sections, the Media Access Layer 2 Control (MAC) layer and the Logical Link Control (LLC) layer. The Data Link Layer MAC allows the computer to access the physical medium and the LLC provides frame synchronization. The physical layer contains transmission hardware that sends the bits Layer 1 over its medium. The medium itself is not important as the particular Physical Layer Layer 1 hardware is designed to operate over the medium. Table 2.1 The seven layers ofthe OSI model

15 2.3 Internet Routing

Routing on the Internet with IP is performed on the individual node level (see

Figure 2.2). Once the packet leaves the source, it is passed along different routers toward its destination. Each checks the destination address ofthe packet and sends it to the next router that moves that packet on the correct path. Each router has an internal routing table that determines the correct route of a packet. The internal routing table is updated periodically to reflect the current state ofthe network.

Destination Source

One possible route from source to destination. The actual route could change dynamically depending on the network status.

Figure 2.2 Network routing

2.4 Address shortage & network address translation

A driving force in Network Address Translation (NAT) is the problem ofthe

Internet running out ofIP addresses. This problem grew worse during the explosion the number ofnew users in the late 1990s. The older protocol, IPv4, or simply the IP

16 protocol defines the IP addresses as a 4-byte value. This means that there could be a maximum of232 or about 4.2 billion unique addresses. Although this would appear to be enough addresses, because ofadministrative difficulties, there is actually a shortage of address at many sites. (A newer protocol, IPv6 expands the IP address field to 16 bytes, but this technology is slow to be adapted due to the high cost ofreplacement hardware.)

NAT provides a temporary solution to the address shortage problem by translating addresses from a private network to a public network. This allows private network IP addresses to be reused in many networks on the Internet. As shown in Figure 2.3, a company ofone hundred computers no longer needs one hundred separate IP addresses.

The company could use a few, or even just one, public IP addresses and through NAT, allow all internal computers to access the Internet. Using this scheme, it is possible for an ISP to reduce the number ofIP addresses required to support users. This reduces the assigned IP block ranges and alleviates the problem ofrunning out ofIP addresses.

17 External IP address Network address Far end translation Internet equipment services

InternalIP address Internal network switch

Figure 2.3 Typical network topology in a small business environment

NAT does not work in all situations. NAT cannot be used ifall computers on the network must be publicly accessible from the Internet. Also, since all traffic is funneled into a single IP address, it is necessary to remap the well-known ports ofthe internal computers to another port ofthe NAT machine. This can cause problems for software that cannot dynamically allocate port use.

Despite its shortcomings, NAT has become an integral part ofhome- and small- business networking, as it is now possible to share a single Internet connection for many computers in home or office. In addition, it is much simpler to apply firewall protections to the internal network. All outside (and therefore suspect) traffic is funneled through a single machine before it is distributed to the internal network.

18 2.5 A day in the life of a packet

An explanation ofhow a packet travels across the Internet will provide a better understanding ofhow routing and NAT works. In this case, the example application is e­ mail.

1. The user ofthe e-mail client software (at OSI layer 7) completes writing the e­

mail and presses the send button. The e-mail client then takes the e-mail and

packages it inside a SMTP protocol (at OSI layer 6) and sends it down to the

lower OSI layers.

2. The e-mail is divided into packets for transmission; each packet is

encapsulated by Layer 4's error control codes, then by Layer 3 routing

information. The packet continues to move down the OSI layer until it is

transmitted to the next node on the network.

3. The next node on the network is the gateway. The gateway acts as the NAT

and is situated between the internal private network and the external Internet.

The gateway recovers packets up to the Layer 3 level and reads the source and

destination IP address. NAT then replaces the internal source address with the

external IP address. Essentially, the packet is relabeled with the public IP

address so it can be routed correctly by the Internet.

4. After NAT, the packet is sent out ofthe gateway to the ISP's router. The

router then determines the best possible route to the packet's destination.

Once that route is determined, the router passes the packet to the next router

19 and it will do the same thing. At some point, the packet will arrive at the

destination computer.

5. Once the packet arrives at the destination, it works its way up the OSI model

with each layer stripping away its own information. The layers perform

checks on the packet to ensure it was received correctly. Layer 4 ofthe OSI

model would deal with packets that are received out oforder and also check

for errors. A retransmission may be necessary.

6. Once the packets leave Layer 4, they are assumed to be in order and error­

free. The application can then retrieve and operate on the data. The e-mail

client would notify the user that new mail has arrived.

In practice, the transmission ofa packet from source to destination is much more complex than has been described here. Details such as error control, congestion control, port mappings are only touched on because they are outside the scope ofthis thesis.

20 Chapter 3

Inherent problems with consumer Internet services

The Internet can be considered a relatively new technology. As with all new technologies, there are problems to overcome. Conceived in 1973, the Internet remained a tool used by universities and government agencies until the early 1990s. It was not until the Internet boom ofthe mid- 1990s that the general public was able to log on to the

Internet. Broadband access appeared later with its initial rollout coming in 1997. With about a decade ofexperience, Internet service providers still routinely struggle with technical problems.

In comparison, the telephone was invented in 1876 and steadily gained momentum throughout the 20th century. Thus there has been more than 100 years of refinements to that technology. During this time, the telephone became the primary mode ofcommunication for everyone, and its performance and reliability continued to increase. The reliability oftelephone services is so great that outages are typically caused by natural disasters or an errant tree branch falling on telephone lines.

The difference between the development ofthe telephone system and the Internet is that the Internet continues to grow rapidly. It is poised to surpass the telephone as the primary choice for communication. Consumers are now demanding the level of performance and reliability enjoyed with the telephone service. Unfortunately, ISPs and their infrastructures cannot handle the surging traffic and bandwidth demands. In addition, the telephone system has been built from the ground up to maximize reliability, and thus measures their downtime in minutes per year. In contrast, the Internet was 21 designed to move data as quickly as possible and it is base on a "best effort" delivery philosophy. As a result, outages due to equipment failure or other factors still plague consumers. The three major problems ofInternet services are unreliability, poor service quality and service inflexibility; these problems are discussed in the following sections.

3.1 Unreliable broadband availability

Small businesses and consumers connect to the Internet through their local ISP.

As shown in Figure 3.1, the ISP's function is to route the customer's Internet traffic through its system and onto the Internet

Internet Service Provider (ISP)

Cable orDSL service

Customer's PC

Figure 3.1 ISP service to customers

The customer is completely dependent on the ISP for their Internet service. The user can lose service ifthe channel between the user and ISP goes down, ifthe ISP crashes, or ifthe channel between the ISP and the Internet goes down.

Users can be disconnected from the Internet for several reasons. For example, the issue can be a setup problem with the ISP's equipment, or it can be a physical problem 22 with the line. The ISP schedules maintenance periodically or equipment may be upgraded. Regardless, Internet service is temporally disabled. Whatever the reason,

Internet outages are common for broadband services.

3.2 Reduced service quality due to loading

ISPs generate revenue by forcing their customers to share bandwidth. In a sense,

ISPs oversell available bandwidth. This business decision keeps prices low for customers. The strategy is based on the idea that only a small percentage ofcustomers will send and receive traffic at anyone time. Therefore, a user's average bandwidth requirement is a small percentage ofhis maximum bandwidth requirement. In effect, under ideal conditions, all users appear to have very fast, dedicated connections to the

Internet.

The problem with overselling bandwidth occurs when users begin to rely heavily on the Internet. The bandwidth requirement per user increases. Therefore, it is possible for the average bandwidth requirement ofan ISP to increase over time without the addition ofnew customers. This leads to the problem that the ISP must continuously increase its bandwidth to the Internet. Otherwise, the level ofservice to their customers will be reduced.

In addition, an issue with overselling is the system might not have reserve bandwidth during peak hours, especially during evening. This problem causes the most aggravation among users since they cannot access the Internet when they want. It is especially frustrating since the ISP can claim more than enough bandwidth because

23 hardly anyone uses the system during working hours. Thus, the average usage ofthe day is well within specifications. This type ofloading generally causes slowdowns where the upload and download rates are drastically reduced. At worst, the user cannot access the

Internet due to high latency and dropped packets.

Lastly, the bandwidth requirement for a typical user is increasing at a dramatic rate. Peer-to-Peer (P2P) networks became popular when Napster appeared on the scene.

Since then, other P2P networks have created an enormous amount oftraffic.

Furthermore, users now routinely leave their computer on with P2P running in the background. This means the user is continuously sending traffic at maximum speed. In effect, the user is no longer participating in the bandwidth sharing scheme but consuming all available bandwidth continuously. Clearly this will increase congestion.

Figures 3.2 and 3.3 illustrate how loading can affect service quality by providing a chart ofthe ping response time ofvarious Internet sites from a standard Road Runner broadband connection. Each figure is a chart ofthe ping response to six different Internet hosts sampled every minute for a 24 hour period. The response time is measured and logged. The chart provides a good indication ofthe health ofthe channel over that 24 hour period. As seen in figure 3.2, there is a significant disturbance in response time around 1O:00am to I:30pm. During this time, the ping times for all the hosts varied by a significant amount. In figure 3.3, the disturbance occurred at 2:30am to 2:45am and

9:45am to 10:30am. It is not possible to determine exactly what caused those disturbances. However, since the disturbance occurred on all the sampled host, it is possible to say that the problem lies with the ISP.

24 1000 900 -- CD E 800 :;:: 700 CD 1Il C 600 0 -----l a. 500 1Il ..CD 400 I:ll 300 c I ii: 200 100 I ~,~,~. . ._-".... ~ I.J ./u ~~.., l .....,.,.,\ r...... l."- ~.~ 0 ----' _t... __ 0 10 0 10 0 10 0 10 0 10 0 10 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 c:1 T"' 0 C') T"' 0 C') T"' 8 C') T"' 0 C') T"' 0 ~ C') T"' 0 C') T"' c::! ~ C') ..... 0 •0 T"' N it; •it; ~ on <0 <0• ,;.:

- 24.25.227.33 - 24.25.227.64 128.171.3.13 -128.171.1.1 -169.229.131.109 -18.181.0.31

Figure 3.2 24hour ping response time (sample 1)

1000 I I 900 I

CD 800 ~ 700 ~ -- =: 600 - c - g, 500 ! 400 --- - g' 300 ii: 200 -.------T I lJ... u ... I ...... ,... 100 ~ J~ ,k . ... .l t.. I.;,L.' "'" " .~. oJ ',"" k! ..... L..-wl ~ .. '-.J._" 1._.'-,.." • , _"",Ill. • • J. ll,.'-,...... &.,l.i...... --."'.... o " o 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10 10 0 10 0 10 0 10 0 10 0 10 o c:1 T"' 0 C') T"' 0 C') ..... 0 C') T"' 0 C') T"' 8 C') T"' 0 C') T"' c::! ~ C') T"' o •0 N it; •it; ~ on <0 •<0 ,;.:

- 24.25.227.33 -24.25.227.64 128.171.3.13 -128.171.1.1 -169.229.131.109 -18.181.0.31

Figure 3.3 24hour ping response time (sample 2)

3.3 Inflexibility ofservice offering

Consumers and businesses have many choices for broadband service. Cable and

ADSL are the most popular selections. However, it is possible to choose services such as 25 satellite broadband or a community sponsored wireless connection. Each service provides slightly different performance. For example, a satellite link may provide good bandwidth but poor response time while a wireless connection may offer a good response time but poor bandwidth. In essence, any particular broadband service has inherent strengths and weaknesses. For the most part, when users select a broadband provider, they are locked into the strengths and weaknesses ofthe provider. This often makes the service offered poorly matched to the type oftraffic generated by the user.

26 Chapter 4

Multihoming solution

The term "multihoming" or "multihomed" refers to a node that has more than one channel to the main network (see Figure 4.1). A multihomed node, therefore, has multiple paths from itselfto the network. The multihoming concept is not new. Core routers and even edge routers use multihoming with (BGP) to provide redundancy for their backbone connections. It is only recently that high-speed connections have come down in price such that consumers and small businesses can afford the technology.

Multihomed node Single-homed node

Figure 4.1 Multihomed vs. single-homed node

4.1 MlIltihoming advantages

A multihomed node has four main advantages over a single-homed node:

Channel redundancy, better response time, loading balancing, and lower overall cost than the comparable single-homed solution. 27 4.1.1 - Channel redundancy

Channel redundancy is the ability ofthe multihomed gateway to detect failed channels and divert traffic to the active channel. This function creates a more robust user connection to the Internet. In addition, the channel fault detection is transparent to the user and will be automatically performed by the multihomed gateway.

4.1.2 - Better response time

Multiple network paths mean that it is possible to tailor each packet to the available channels. Some channels are better at handling high volume traffic while others / are better for low latency traffic. Once the multihomed gateway classifies the user traffic, it can route that traffic to the channel that would provide best performance. For example, a voice over IP call would be routed to the channel with lowest latency to the destination.

A large FTP transfer would be routed to the channel offering the highest bandwidth.

4.1.3 -Loadbalancing

In the traditional multihomed solution used by core routers, the backup channel is unused unless the primary channel fails. The muitihomed gateway provides load balancing across all the outbound links. This allows the most efficient use ofbandwidth for the target audience.

4.1.4 - Reduced cost

The multihomed gateway increases reliability and bandwidth by adding channels.

The additional channel increases the system's reliability because the additional channel acts as a backup to other channels. Also, because the system is equipped with load balancing, the additional channel will be used to route traffic as soon as it is configured.

28 Compared with a single, equally reliable ISP connection, multihoming reduces system

cost.

4.2 Multihomed gateway overview

When a new session is requested by the internal network, the multihomed

gateway determines which channel to the external network (the Internet) would best serve

this session, translate individual packets ofthe session into the external network's

address, and then send the packet out. In the other direction, when a session is

established from the external network, the individual packets are translated back to the

internal network address and sent to the internal network. Three components, shown in

Figure 4.2, accomplish the above task. They are the channel selection module, the

Network Address Translation (NAT) engine, and the packet forwarder.

NAT Engine

r------Channel selection TCP port Channel chooser number Packet Forwarder

Channel User traffic performance monitor monitor

Internal network

Figure 4.2 Multihomed gateway block diagram

29 The channel selection module is the heart ofthe multihomed gateway. This module encapsulates the functions ofthe channel chooser, the channel performance monitor and the user traffic monitor. The channel performance monitor determines the performance profile, in terms ofbandwidth capacity and packet latency, ofeach ofthe external channels. The user traffic monitor determines the bandwidth and latency requirements ofthe user base on TCP port numbers. The channel chooser uses the information from the channel performance monitor and the user traffic monitor to match the traffic profile with the channel characteristics.

The NAT engine serves as the bridge between the internal and external network.

All traffic in and out ofthe multihomed gateway must pass through NAT. NAT translates the internal network address to the external network address. This separation between the internal and external networks allows the computer inside the internal network to talk to only one IP address (the multihomed gateway's internal address) instead ofthe two external addresses the multihomed gateway supports.

The packet forwarder is the simplest ofall major components in the multihomed gateway. The packet forwarder takes the translated packets from the NAT engine and forwards them to the correct port. In reverse, the packet forwarder takes the packet from the external network and forwards it to the NAT engine.

4.3 Data granularity

To efficiently balance the traffic load, it is necessary to reduce data granularity as small as possible. Small granularity mean the load balancer can react quickly to

30 changing network conditions and route packets. Accordingly, the multihomed gateway deals with sessions. When the user requests a new session, the multihomed gateway will sample the current network condition and route the session to the best channel. Once a session is opened, all subsequent traffic will be routed to the same channel. This step is necessary because applications cannot handle traffic ofthe same session having different source addresses.

4.4 Selecting a channel

Choosing a channel for a new session is the most important function performed by the multihomed gateway. The traffic across the external channels should be as balanced as possible in order to maximize traffic performance for the internal network.

The channel selection function ofthe multihomed gateway attempts to provide the best possible service for the user based on the recent channel conditions and the user's traffic pattern. All external channels are continuously monitored by the multihomed gateway to gauge their performance in terms ofbandwidth capacity and latency. The bandwidth capacity and latency information is then compared with the user's traffic pattern to determine which external channel should be used. The user's traffic pattern is also actively monitored by the multihomed gateway to determine ifthe traffic is more sensitive to latency or to bandwidth.

31 New session Channel Selection TCP port Outgoing channel number Channel chooser number ..

I I I ,I /\ , Channel User traffic , I I performance monitor I monitor ,I , ,I , , ,' ,------Figure 4.3 Channel chooser block diagram

As shown in Figure 4.3, channel selection is divided into three interdependent modules: the Channel Performance Monitor, the User Traffic Monitor and the Channel

Chooser. The channel performance monitor determines the maximum downstream! channel capacity as well as the latency information about each ofthe external channels.

Downstream capacity is more relevant to the multihomed gateway because download traffic is expected to be much higher than upload traffic. The user traffic monitor collects the average packet length ofeach TCP port number. The average packet length is used to profile the traffic pattern ofa TCP port. Finally, the channel chooser uses the information provided by the channel performance monitor and the user traffic monitor, as well as the new session's TCP port number in order to determine the best channel for the new sessIOn.

1 Downstream mean from the external network to the internal network. Upstream is the opposite direction 32 4.4.1 - Channel performance monitor

The channel performance monitor collects the data from the external channels.

The data collected is used to determine the capacity (maximum data rate) ofthe channel and the round trip latency ofpackets. However, both the maximum capacity and latency are not static. They change depending on the load ofthe local ISP. Since the channel's performance cannot be predetermined, it is necessary for the channel performance monitor to continuously sample the channel periodically to determine its current latency and capacity.

Channel latency is measured by the round trip delay ofthe SYN packet during the

TCP three way handshake. The channel performance monitor recognizes the start ofa new session by checking for the existence ofthe session in its active session list. When a new session is initiated, the channel performance monitor tags the first packet with the current timestamp value. When the reply packet is seen from the remote host, the channel performance monitor compares the current timestamp value with the one tagged earlier. The difference between the two times is the round trip delay ofthe packet. The channel with the lower round trip delay is marked as the channel with a better latency characteristic.

The current downstream channel capacity can be reasonably deduced by looking at the recently bandwidth history ofthe channel. The multihomed gateway samples the bit rate ofthe channel each second. From that sample, it can determine the maximum bit rate ofthe channel by applying a ceiling function to the collected data. The idea is that if the channel is measured as having a specific current data rate, then the channel itselfmust

33 have at least that much capacity; otherwise it would have not been able to receive that much traffic.

Unfortunately, it is not possible to use the ceiling function alone because this would only give the maximum data rate received since the unit has been powered on. As stated before, the channel capacity is not static. The capacity can be reduced or increased depending on the load. Therefore, the multihomed gateway cannot keep a static ceiling value without taking into account that the capacity can change.

Figure 4.4 shows a plot ofdownstream traffic bandwidth and the estimate of channel capacity using a ceiling function with the addition ofan exponential decay function. This decay function reduces the apparent channel capacity ofthe multihomed gateway as time passes. As seen in Figure 4.4, the traffic bandwidth spiked high in the beginning, and then dropped to a lower level in the middle. It is impossible for the multihomed gateway to determine ifthe bandwidth reduction is due to diminished channel capacity, or simply less traffic. After all, the multihomed gateway cannot simply assume that the channel capacity was reduced and therefore immediately adjust the apparent channel capacity. Nor can it keep the apparent channel capacity unchanged because loading could have indeed caused the reduction in traffic. The decay function allows the multihomed gateway to reduce the apparent channel capacity over time so that it can smooth out the bursty nature ofthe traffic pattern.

34 Apparent estimate capacity as seen by , ,, multihomed device \ \ \ \ \ \ ",r\\----;::---.:...---__-iJ \ \ ,, \ \ • ,' \ \ \ \ ,,' '. Exponent : \ \ " ~ -ial decay : \ \ Data ,"', : \ , , " , \ rate , , I \, \ , , \ " , " 1 I ,, , , ,\ I " ~ : \_, ,.' , I

/"-"'" \,/ .... _,/\ '. " Actual traffic \ ' ,~, bandwidth

Time Figure 4.4 Maximum data rate measurement

The scheme ofusing a ceiling function with the decay function produces good results ifthere is a lot oftraffic between the internal and external networks. The more traffic there is, the less likely the traffic pattern will have large peaks and valleys since it would tend to be smoothed out. However, this method ofdetermining the channel capacity is not without its flaws. The decay function causes the system to react slowly to the changing condition ofthe channel. For example, suppose the first drop oftraffic bandwidth in figure 4.4 was actually due to channel loading. Then the multihomed gateway will mistakenly think that there is plenty ofavailable bandwidth on this channel.

Ifthe internal network initiates a new session at this time, the session can potentially be placed onto this channel. This re-adjustment period will persist until the decay function reduces the apparent maximum capacity ofthe channel to align with the actual capacity.

While this scenario can happen, it is not as large ofa pitfall as it might seem. First, the decay function can be adjusted to a larger decay value. The larger decay value will cause the apparent maximum capacity ofthe channel to decay faster, thus reducing the window 35 of exposure that the user may face. Second, the channel is slow, but not unusable. Ifa session was placed onto the wrong channel, the user will simply experience some slow down while the system re-adjusts itself. Third, this scenario can only happen ifthe channel experiences a signification drop in capacity. According to the experiment shown in Chapter 3, this happens rarely, roughly once per 24-hour period.

The choice ofthe decay value is base on the need ofthe multihomed gateway to cope with the changing channel capacity. A small decay value makes the multihomed gateway slow to react to changes in capacity. But provide a better channel choice during times when both channels are functioning properly. On the other hand, a large decay value allow the multihomed gateway to track data rate much faster, but causes the multihomed gateway to fluctuate between the two channels more often than it is necessary. The experiment in Chapter 3 shows that the channel capacity for the most part is very steady. Therefore, it would make sense to have a small decay value because it is more likely that the perceived channel capacity from its history is a good measurement of the current channel capacity. The multihomed gateway is implemented with a decay rate of I Kbytes per second decay per second.

It is difficult to estimate channel capacity. Ideally, the multihomed gateway should determine the immediate current channel capacity as a new session is initiated.

Base on that, the multihomed gateway will be able to place the new session on the best channel. However, this is not possible even in theory because there is no way of"asking" the channel for its current capacity. Given the circumstances, the use ofthe ceiling function with exponential decay provides a good estimate ofchannel capacity.

36 4.4.2 - User traffic monitor

The user traffic monitor can tell which application sent the packet based on the port number ofthe packet. Secondly, application writers tend to send small packets ifthe application needs very low latency because small packets perform better in applications where low latency is required. On the other hand, applications that send large amounts of data will try to make packets large so the transmission overhead remains a small percentage ofthe packet size. Using that information, the user traffic monitor keeps track ofeach TCP port's average packet size. When a particular TCP port has a large average packet size, it is likely that port is devoted to high traffic applications like FTP. In contrast, ifa TCP port has a small average packet size, then it is likely that TCP port is used for an application such as voice over IP. In both cases, the traffic can now be filtered as a function ofport number.

The user traffic monitor keeps a log ofthe average packet length per TCP port. The average packet length is a good indication ofthe type oftraffic being sent. The idea behind this is that incoming traffic is distributed based on its port numbers. That is to say, each application has its own distinct port. There are well known ports such as Telnet at 23 and FTP at 21. The system would be lacking, however, ifthe user has to manually enter what traffic type goes on which port. Also, it is very inefficient for a multihomed gateway to use a table ofwell-known ports ofapplications because it cannot adapt to new software. To allow for dynamic traffic classification, the multihomed gateway employs the user traffic monitor to takes advantage ofthe fact that all applications use distinct port numbers.

37 4.4.3 - Channel chooser

The channel chooser is the decision maker ofthe channel selection module. The channel chooser uses the information presented to it by the channel performance monitor and the user traffic monitor to determine which external channel is best for the new session. The channel chooser first attempts to classify the new session itselfby examining the packet's service type field in the IP datagram header. The service type field is an 8-bit field code sometimes called Type ofService (TOS). It specifies what type ofdata this is likely to be. The structure ofthis field is shown in Figure 4.5.

o 2 345 6 7

_p_r_e_c_ed_e_n_c_e_~...._u_n_u_s_ed_

Figure 4.5 Type ofService field in the IP header

The precedence field is a three-bit op-code that specifies the importance ofthe data.

Normal traffic will have a low value while very important traffic such as routing information will have a value of6 or 7. However, some routers ignore this flag to avoid misuse ofthe system. The D flag denotes the packet would like low latency; the Tbit denotes the packet needs high throughput; and, the R bit means the packet needs high reliability. The channel chooser will look at the D and T bit to determine what type of traffic this session will be.

Not all systems use the Type ofService field. Most simply zero out the entire 8- bit field because ISPs usually setup their routers to ignore this field in fear ofabuse (one can only imagine the havoc which would result ifsomeone "accidentally" tagged all their packets with high priority, high reliability bits all time). Therefore, in the absence of

38 Type ofService information, the channel chooser will usually rely on the traffic profiles provided by the user traffic monitor. Once the traffic is classified as either latency­ sensitive or bandwidth-sensitive, the channel chooser then matches the new session's bandwidth and latency requirements with the external connection's profile provided by the channel performance monitor.

4.5 Network address translation (NAT) engine

A basic operation which a multihoming device must perform is Network Address

Translation (NAT). NAT allows the multihomed gateway to multiplex traffic from an internal network and present it to the Internet as iftraffic were coming from a single, external IP address. A multihoming device expands on this concept by intelligently translating an internal IP address to one oftwo different external IP addresses and vice versa.

Figure 4.6 shows the typical layout ofthe system. Non-routable IP addresses are designated as private by the Internet Assigned Number Authority (lANA) and reserved for internal use. All routers in service honor these designations and will not route these addresses. There are three groups ofprivate addresses.

10.0.0.0 - 10.255.255.255

172.16.0.0 - 172.31.255.255

192.168.0.0 - 192.168.255.255

39 ISP 1 ISP 2 66.45.21.155 128.171.60.245 L ~ ~ I I MHGateway I InternalIP 192.168.1.1 I 10/100 Ethernet Switch I

PC PC PC PC 192.168.1.2 192.168.1.3 192.168.1.4 192.168.1.5

Figure 4.6 Network configuration with the multihomed gateway

The multihomed gateway has three Ethernet connections. One connects to the internal network and two connect to two different ISPs. The external IP addresses are provided by the ISPs. In the diagram, ISP 1 assigned the IP address 66.45.21.155 and

ISP 2 assigned the IP address 128.171.60.245. These addresses are external, real Internet addresses and thus fully routable. The internal IP address is assigned by the system administrator and can have any non-routable, private IP address.

The multihomed gateway can use anyone ofthese network addresses as the internal address range. In this example, the address chosen is 192.l68.l.xxx. The multihomed gateway's internal IP address should reflect the fact that it will act as the local Internet gateway. This means all internal traffic destined to the external network will need to pass through this device. By convention, we use the network number .1 to denote the gateway device. Subsequently, all other network devices will also have

40 network numbers. The number scheme used here is a "conventional" setup. There are no technical reasons why the gateway device needs to have the network number of.1. Any number would work as long as it is within the same subnet as the rest ofthe internal network.

It may seem that ifthe internal network only has one device, then the multihomed gateway will not need to perform NAT functions. However, upon closer inspection, this is not the case. The reason is because the internal PC can access two possible external IP addresses. Ifthe multihomed gateway relies on simple IP forwarding techniques without address translation, the PC would need to understand it could select two IP addresses when building its packets. This does not exist on a PC and would have to be written.

Also, software that does not understand this scheme will not work as it expects to only have one IP address. With NAT, the handling ofthe two separate external addresses is dealt with in the multihomed gateway itself. All PCs in the internal network talk to the multihomed gateway as the Internet gateway, and the PCs pass Internet traffic to its IP address.

Four fields in the packet header, as shown in Table 4.1, uniquely identify all TCP and UDP packets: source address, source port, destination address and destination port.

NAT uses this arrangement to correctly translate the packets from one form to another.

Without this, NAT in general cannot work. The source address and port indicate to the receiver the address ofthe sender; they also show the port ofthe sender originating the packet. The receiver uses the destination address and port to identify whether this packet is meant for itself.

41 Source address Source port Destination address Destination port

Table 4.1 Packet identification used by NAT

The NAT function has generally been divided into two sections: Source Network

Address Translation (SNAT) and Destination Network Address Translation (DNAT). In

SNAT, the TCP session is initiated by the internal network; in DNAT, it is initiated by the external network. In SNAT, the source address ofthe TCP/IP header changes to match the external IP address. With DNAT, the destination address header changes to match the internal PC's address. The multihomed gateway must at least provide SNAT functions for the internal PCs to access the external network. Ifthere is an internal service that needs to be access by the external network, then the multihomed gateway will need to private DNAT functions.

Ifthe internal network has server functions that the external network needs to access, then DNAT would need to be implemented. DNAT is used when an external PC needs to access server resources located on a computer in the internal network. Figure

4.7 shows such a case.

42 EXT PC 1 204.193.10.35 ISP 1 66.45.21.155 EXTPC2 66.135.192.19

MH Gateway

InternalIP 192.168.1.1 10/100 Ethernet Switch

PC PC PC PC 192.168.1.2 192.168.1.3 192.168.1.4 192.168.1.5

FTP service at Web service at local port 21 local port 80

Figure 4.7 Multihomed gateway connection to the Internet

In Figure 4.7, a web server is running on 192.168.1.4 on port 80 and an FTP service is running on 192.168.1.2 on port 21. IfEXT PC 1 (204.193.10.25) wants to access the internal web service, it cannot directly connect to 192.168.1.4 since that address does not exist on the Internet. Therefore, EXT PC 1 must go through the multihomed gateway for translation ofits address. With the new address, it can access the internal web server. In this case, EXT PC 1 decides to access the internal network using

ISP 2's address. It doesn't matter which ISP the extern PC uses, as long as it is kept the same throughout the entire session. The multihomed gateway, running the DNAT service, will need to assign a specific port number defined as the port used to divert traffic to the web server at 192.168.1.4. The port number chosen is arbitrary and can be 43 any value from 1 to 65536. It is important to note that the same port number cannot be reused for another service on the internal network. In order words, the FTP service and the web service running on the internal network cannot have the same diverting port number on the multihomed gateway. In our simple case, the multihomed gateway's port number can be set to match the well-known port numbers ofthe services provided. When

EXT PC 1 wants to access the internal web server, it establishes a regular TCP/IP connection to one ofthe multihomed gateway's IP addresses with the port number of 80.

The multihomed gateway will then replace the destination address ofthe packet with the address ofthe internal web server, 192.168.1.4, and then forward the packet to the internal network.

DNAT does not have to match the well-known ports to each other. So in the previous example, it is not necessary to forward port 80 ofthe multihomed gateway to port 80 on the web server. This is especially true ifwe have multiple web servers on the internal network.

44 EXT PC I 204.193.10.35 ISP I 66.45.21.155

EXT PC 2 66.135.192.19

MHGateway

InternalIP 192.168.1.1 10/100 Ethernet Switch

PC PC PC PC 192.168.1.2 192.168.1.3 192.168.1.4 192.168.1.5

Web service at Web service at Web service at local port 80 local port 80 local port 80

Figure 4.8 DNAT multihomed gateway configuration

In Figure 4.8, three web servers running on the internal network use local port 80 as their web service port. This would cause a problem for external users as there is only one port 80 on the multihomed gateway. The gateway cannot know exactly the machine to which it must forward web traffic. In this case, set up the multihomed gateway to forward traffic on a different port. Each port would then forward traffic to a specific machine. For example, DNAT could be set up such that port 10000 would be mapped to

192.168.1.2, port 10001 mapped to 192.168.1.3 and port 10002 mapped to 192.168.1.4.

When the external PC wants to access a specific web server, it simply connects to the multihomed gateway using a different port. In this case, DNAT replaces both the destination address and the destination port ofthe TCP/IP packet. Port translation is 45 necessary because from the web server's point ofview, all web traffic needs to be mapped on its local port 80.

DNAT performs this translation by following a set ofrules as shown in Figure

4.9. When a packet comes in that matches the rules listed, DNAT performs the translation and forwards the packet. The following diagram shows the flow ofa packet from EXT

PC 1 as it tried to establish a web connection to internal PC 192.168.1.4. Remember that we set port 10002 as the port number to forward traffic to 192.168.1.4.

204.193.10.35 80 204.193.10.35 80 204.193.10.35 80 ,"- , 128.171.60.245 10002 192.168.1.4 80 " 192.168.1.4 80

EXT PC 1 sends a The multihomed The multihomed packet to the gateway then gateway then multihomed replaces the forwards the packet gateway using ISP destination field of to the internal 2's address. The the address with the network, which will packet is sent to port web server's be received by the 10002 ofthe internal IP address. web server. multihomed It also changes the gateway. port number to match the web server's local port.

128.171.60.245 10002 128.171.60.245 10002 192.168.1.4 80 ~ ~ ~ 204.193.10.35 80 ...... 204.193.10.35 80 ...... 204.193.10.35 80 ......

The multihomed The multihomed When the web gateway then gateway replaces the server replies, the forwards the packet source address with source address is the to the external its own external IP web server itselfand network using ISP address. It also the destination 2. replaces the port address is the number to match its external PC. existing rules.

Figure 4.9 DNAT flow chart

46 SNAT is almost the reverse ofDNAT because it manipulates packets from the internal to the external network (see Figure 4.10). While DNAT translates the destination address, SNAT translates the source address. This allows devices on the internal network to access the external network through the multihomed gateway. SNAT also has the advantage ofmultiplexing internal traffic to a single external network connection so that many PCs share a single external network connection. The multihomed gateway expands this function by intelligently choosing one oftwo external network connections to translate, depending on the current link conditions. Another difference between SNAT and DNAT is that SNAT requires the device to keep a translation table. This table allows

SNAT to match incoming traffic so that it knows which internal PC to forward the respond.

ISP 1 Remote service 66.45.21.155 128.171.60.1

Telnet service at local port 23 MHGateway

InternalIP 192.168.1.1 10/100 Ethernet Switch

PC 1 PC2 PC3 PC4 192.168.1.2 192.168.1.3 192.168.1.4 192.168.1.5

Figure 4.10 SNAT multihomed gateway configuration

47 Figure 4.10 shows a remote service at 128.171.60.1. The multihomed gateway is running SNAT service that will translate all outgoing Internet traffic to ISP 2's address.

Suppose PC 1 wants to open a telnet session to the remote service. Figure 4.11 shows the flow oftraffic:

192.168.1.2 23 128.171.60.245 1000 128.171.60.245 1000 , ,'" 128.171.60.1 23 '" 128.171.60.1 23 128.171.60.1 23

PC 1 sends a packet The source address The multihomed to 128.171.60.1 on ofthe packet is gateway then port 23, requesting a replaced with ISP forwards the packet teInet session. 2's address. Also it to the external assigns a new source network. The port and replaces the packet will then be original port correctly routed to number. A new remote service. entry is created in the NAT table showing this translation. I

Port number Original source address Original source port

......

~ 1000 192.168.1.2 23 t-- ......

128.171.60.1 23 128.171.60.1+ 23 128.171.60.1 23 ~ .J .J 192.168.1.2 23 ...... 192.168.1.2 23 ...... 128.171.60.245 1000 ......

The multihomed gateway then forwards the packet The destination port The return packet to the internal number is checked would have the against the NAT network. destination address table to find out who and port as shown this packet belongs above. to. It then replaces the destination address and port with the proper values.

Figure 4.11 SNAT flow chart

48 The NAT table entry is session base, meaning a new entry is required whenever a new session begins. Traffic from the same session would use the same table entry. Once started, all entries would have a timeout counter. The timeout counter kills entries that do not have traffic for a set period oftime. This is needed because sessions occasionally fail to terminate correctly. Without a timeout, the table would quickly fill with dead entries.

The way the table is set up allows for concurrent accesses from multiple internal clients. For example, suppose both internal PC 1 and PC 2 want to use telnet to connect to the remote service. SNAT will assign a different source port number to each PC for each separate session. The difference in port number determines the destination ofthe individual arriving packets. In other words, the SNAT table might have assigned source port 1000 to PC l's telnet session and source port 1001 to PC 2's telnet session. It is now possible to correctly route the appropriate packets to their destination machine.

4.6 Packet forwarder

Once all address translation is completed and the external network determined, the packets travel to the packet forwarder. The packet forwarder queues all outgoing packets and sends them to the external interface as it becomes available. No processing is performed by the packet forwarder. However, it is still a good idea to separate the packet forward from the rest ofthe system because the packet forwarder needs to interact with the external network. The external network can take on many different forms in terms ofconnection types and protocols. Therefore, it is possible to reuse the rest ofthe

49 system even ifthe external connection type changes. Simply alter the relatively simple packet forwarder module.

50 Chapter 5

Implementation ofmultihoming using a network processor

Network processors are programmable devices that operate similarly to a general­ purpose processor, but the hardware is optimized to improve network packet processing.

The network processor is used primarily in embedded network equipment such as switches and routers. Current generations ofnetwork processors are starting to appear in

Web cache engines, IPSec accelerators and load balancers.

Several companies make network processors. Although the processors differ in detail such as processor and instruction type, they share the same basic packet processing structure. That basic structure is receive, process, transmit. This is similar to general­ purpose processors that load, execute, store. Another fundamental similarity between all network processors is the separation ofthe control and data planes, otherwise known as the slow path and the fast path. This separation allows the network processors to obtain as much speed as possible while processing data packets and at the same time allow a control packet to be send to a separate processor for slow processing.

5.1 Packet processing with the Intel 2800 network processor

The Intel 2800 is a second-generation network processor. It was chosen for this project primary due to the availability ofa flexible, full-featured evaluation platform.

This "Deer Island" evaluation platform has two IXP2800 network processors. The

51 platform also has an external media card that supports ten I-gigabit fiber/copper connections and is rated to handle traffic as high as 20 gigabits per second.

5. J. J - Structure ofthe Intel 2800 networkprocessor

The 2800 network processor separates the data and control planes with two types ofprocessors inside the chip, uses microengines for processing data and has an Xscale embedded processor for handling control packets. Peripherals such as memory controllers and input/output buffers complete the system.

Extemal SRAM~ Media Device(s) SP14 CSIX aDA Rambus DDR

SRAM SRAM DRAM DRAM Controller 0 Controller 1 Controller 0 COntroller 1

SRAM SRAM DRAM Controller 2 ,Controller 3 Controller 2

Media Switch Fabric Interlace t

PCI Controller

Hash Unit

CAP

IXP2850 PCI (64 bit, 33/66 MHz) ~_--.:....---.. Optional host CPU. PCI bus devices Fij!ure 5.1 Intel 2800 network processor block diagram

52 As shown in Figure 5.1, the 2800 network processor has eight main hardware components linked by a high-speed common bus. The components are as follows:

1. The microengine clusters - There are two microengine clusters, each ofwhich

holds eight microengines for a total ofsixteen microengines. The microengines

handle all data plane traffic and are designed to process network packets very

quickly. All microengines run in parallel with special registers for inter­

microengine communication. The sixteen microengines are identical, which

makes it irrelevant to specifically use one over another.

2. The Intel Xscale core - The Xscale core is used to set up and load the

microengines and handle control plane traffic once the system is running. The

Xscale core is a fully compliant ARM v5TE1 architecture as defined by ARM

Limited. The core contains everything one expects from an ARM processor such

as Icache, Dcache, MMU, ALU, buffers and debugging tools. The Xscale

operates using either WindRiver's VxWorks or Linux. Intel supports both

operating systems; there is no functional difference between them.

3. The SRAMcontrollers - Four separate SRAM controllers each handle a single

bank ofQuad Data Rate (QDR) SRAMS. The reason for four separate controllers

is to minimize memory access delays to the microengines. Since all microengines

operate in parallel, it is very likely multiple microengines will access SRAM

resources at the same time. By having four separate controllers, it is possible for

the SRAM to service four separate microengines at the same time.

53 4. The DRAMcontrollers - The three DRAM controllers use Rambus' Double Data

Rate (DDR) DRAMS. The idea ofhaving multiple DRAM controllers is the same

as for the SRAM controllers.

5. The Media Switch Fabric (MSF) interface - This is the main I/O port for network

traffic. The MSF connects external network to the network processor through the

external media device. The type ofexternal media device can range from a frame

relay interface to Ethernet to POS. All external devices connected to the MSF

must properly translate their input to either SPI42 or CSIX3 bus cycles for the

MSF.

6. The PCl interface - A standard PCI interface connects peripherals such as an

external host controller to the network processor. This interface conforms to the

PClv2 standard.

7. SHaC - The SHaC unit is an acronym for Scratchpad, Hashing and CAP. SHaC

is a collection ofuseful functions that isn't large enough to have its own domain.

The scratchpad is a 16-Kbyte local cache memory that microengines use for fast

memory access. The Hashing unit is an optimized hash generator for 48-, 64- and

128-bit hash keys. The CAP unit is an interface to system-wide registers such as

the system timestamp.

8. Crypto unit - The crypto unit is used to perform real-time encrypt and decrypt

functions. The crypto unit supports the Advanced Encryption Standard (AES)

I ARM v5TE is a RISC-based process architecture and instruction set designed by Advance RISC Machines (ARM) Limited. 2 System Packet Interface Level 4 (SPI4) is an industry standard interface for packet and cell transfer between a physical layer device and a link layer device.

54 and the triple Data Encryption Stands (3DES) symmetric-key ciphers commonly

used by Virtual Private Networks (VPN).

9. High-speed common bus - The high-speed common bus that connects the

components is largely invisible to the programmer. The bus is 64 bits wide and

operates at the same frequency as the microengines, up to 1.4 GHz. The common

bus also has a special circuit that allows the bus to be split ifthe shared accesses

are both 32 bits wide. In other words, the system can choose to use the entire 64

bit wide bus or it can allow two 32-bit-wide bus accesses to occur simultaneously,

depending on system demand.

5.1.2 - Setup and boot ofthe Intel 2800 networkprocessor

The Intel 2800's most important components are the Xscale processor and the embedded operating system. All other functions ofthe network processor are under the direct control ofthe operating system, including the microengines. It is more intuitive for the 2800 user to view the system from a top-down perspective instead of a bottom-up perspective. This might sound backwards at first, since the microengines are doing all the work. However, one can view the microengines as specialized processors that offload packet processing tasks from the Xscale processor. Consider the microengines as DMA engines, where the DMA engines offload the task ofmoving data from the main processor to other locations.

The operation ofthe 2800 network processor follows a series ofsteps that must occur in strict sequence. For the most part, the design process ofthe 2800 follows the steps closely as the embedded operating system must be working before it programs the

3 The Common Switch Interface (CSIX) defines an interface between a traffic manager and a switch fabric 55 microengines, and the bootstrap code must be working before the operating system can load. Steps to set up the 2800 network processor are listed below:

1. Bootstrap - The bootstrap process is the first procedure the system attempts after

power is applied. The visionWARE bootstrap code, similar to the BIOS in a PC,

is supplied by WindRiver. The bootstrap code initializes Xscale and other

peripherals such as Ethernet connection. It is a very specialized setup code works

only with the Intel 2800 network processor. In the future, revisions ofthe 2800

will require a different bootstrap code. The bootstrap code physically resides in

the on-board flash and can be updated with a flash utility, although changing

bootstrap code can be risky. Ifa bug is introduced into the code, or ifthe flash is

corrupted due to a bad update, then system recovery could be difficult. At the

very least, an external flash programmer would be required. Therefore, it is not

advised for the user to change the bootstrap code unless it is extremely urgent.

2. Loading the kernel- After bootstrap is loaded, the code will attempt to load the

embedded operating system using Linux. The Linux kernel is a compressed

image created on the development system build using a cross-compiler. The

kernel image can reside in any location to which the bootstrap has access. During

the development cycle, it is advantageous to store the kernel image on the

development system. The bootstrap will then use TFTP to retrieve the image

from the development system and place it into RAM. After development is

complete, the final kernel image is burned onto the flash from which it can be

loaded.

for ATM, IP, MPLS, Ethernet or similar data transmission. 56 3. Loading thejilesystem - The kernel image doesn't have an embedded filesystem,

although it is possible to build a kernel with the initrd option. This procedure

makes development more difficult since any change to the filesystem requires a

change to the kernel compressed image. For this project, the filesystem exists on

the development system where it is remotely mounted via NFS. Once

development is complete, the final filesystem can be compressed and burned onto

the flash memory.

4. Setting up the resource manager - Once Linux is booted, it uses the resource

manager to initialize microengines and the hardware. The resource manager is a

kernel module provided by Intel that allows the Linux kernel to access internal

memory and registers ofthe 2800 network processor. The source code to the

resource manager is available and can be changed as needed. Through the

resource manager, the kernel will reset all the microengines and their associated

logic such as MSF.

5. Setting up the workbench server - After the resource manager is loaded and the

rest ofthe 2800 is placed in the ready state, Linux starts the workbench server to

allow for incoming microcode from the development system. This step is only

necessary during the development cycle during which the microcode is expected

to change constantly. After development, the microcode will most likely reside in

the flash and therefore will not need the workbench server for access.

6. Starting the microengines - In order to start the microengines, the microcode

must be loaded into the microengine's instruction store. This is done either via

the workbench server or a setup script from Linux. When all the microcodes are

57 loaded, the system is put out ofreset and into the run state. The microengines will

start fetching and executing instructions from their local instruction store.

7. Running the system - At this point, the 2800 network process is considered

"running" and ready for packet processing. The Xscale processor is now reduced

to handling exception packets that the microengine flags. In most designs, the

exception packets are control plane packets used to set up routing information.

However, this is not a limitation ofthe system, just a design philosophy. Any

packet can be passed to the Xscale for further processing as needed. The Xscale

processor can access these exception packets through the resource manager.

5.1.3 - Flow ofa typicalpacket

The Intel 2800 network processor is, fundamentally, a packet processor. Its basic function ofinput, process, output works on a packet level. Each packet received by the network processor must follow these steps. Figure 5.2 shows the flow ofa single packet through the network processor using all ofthe 2800's various components. Each step is numbered and subsequently explained.

58 ~ ~,~_--,'\1---, External media device(s)

DRAM SRAM Controllers Controllers \ ~ : \\ t ...... " \\ I ~;<...... '\/ ...... , \ \\ " ", \. \..... f /. '\ \\.... --- __ -~~ /'" .\ Intel \~-\- ·+l....~ XScale Core -...... \\ ....----..---II ....--...... \·..\1 \\ \\ / I " " '\.,,------"------I .--_---..:ll-.-...."...... ::;~ "",~",,---.L_--±-,_""'\ --

Scratchpad. Hash. CAP ~ SHaC " ME Cluster 1 --_ ..... TXP2ROO

Figure 5.2 Packet flow through the network processor

1. A packet is received from the external media device.

2. The MSF (Media Switch Fabric) is notified ofthe arrival ofa packet and it

is moved from the external media device to the MSF's local buffer.

3. The receive microengine is notified ofthe new packet by the MSF, along

with important packet information such as packet length and type.

4. The receive microengine allocates the necessary space in DRAM and

commands the MSF to move the packet to the new address.

5. The receive microengine then puts the address ofthe new packet (or the

packet handle) into the scratchpad memory and notifies the next available

microengine.

59 6. The next available microengine assumes ownership ofthe packet and

retrieves the packet handle from the scratchpad memory.

7,8,9. The microengine processes the packet, possibly using a combination of

DRAM and SRAM for the task. Once complete, the new packet is placed

back into DRAM and the transmit microengine is notified.

10. The transmit microengine receives the packet handle and locate the packet

mmemory.

11. The transmit microengine command the MSF to retrieve the packet from

the DRAM and place it in its local buffer.

12. Once the packet is ready, the MSF sends it to the external media device.

13. The external media device transmits the packet.

The network processor can be configured to handle packet processing in two ways: Hyper Task Chaining (HTC) and Pool ofThreads (POTs). The difference is conceptual and not technological. In other words, a flag or register cannot be set in the network processor to install HTC mode or POTs mode. Rather, the user ofthe network processor must understand the problem and architect a solution using one ideology.

Hyper Task Chaining, as shown in Figure 5.3, is a pipeline approach ofusing the microengines in order to move the packet along the data path. The concept is closely related to the pipeline process used in modem general-purpose processors. In HTC, each microengine is assigned a small task that is a subset ofthe larger task. When a microengine is done with the task, it passes the data to the next microengine. All

60 microengines are placed in stages and will only move data to the next stage when the previous stage completes.

TaskO Task 1

t=()

Taskm 1=1 --~ME:mI

t=2

t=3

Figure 5.3 hyper tasks chaining ofrnicroengines

HTC's greatest strength is the potential parallel use ofall microengines. Ifthe design permits and the designer are careful in laying out tasks for the microengines, it is possible to have all the microengines running concurrently and utilize all ofthe network processor's capability.

On the other hand, most problems cannot be divided evenly among all the network processors. Therefore, the speed that the system can process packets is limited to the processing time ofthe slowest stage. Fortunately, an HTC design in a network processor does not suffer from pipeline stalling as does a general purpose processor. This is because, by concept, each stage manipulates the packets as much as possible before

61 handing them offto the next stage. While it is easy to avoid data dependenciesthat stall pipelines, an HTC design suffers from memory latency stalling ifthe pipeline has not been designed so that that no two stages access the same memory space simultaneously.

The second technique for doing parallel processing is called Pool ofThreads.

This concept, as shown in Figure 5.4, is simple in theory and is generally favored over

HTC. In POTs, each microengine contains the necessary instructions to process a packet from start to completion. All the microengines are then placed into a pool ofavailable resources. When a packet enters the system, it is assigned to the next available microengine where all ofits tasks are performed.

r=:l~ ~r=:l ~

Figure 5.4 Pool ofthreads ofmicroengines

The advantage ofusing POTs is its simpler system design. The designer only needs to implement one set ofmicroengine code to be placed on all microengines. This reduces complexity by eliminating the need for inter-microengine communication since each microengine can completely process a packet. The drawback ofPOTs is its waste ofsystem potential since it allows microengines to run freely. All microengines run in parallel, and each one is not aware ofwhat the others are doing. One issue is that microengines will attempt to access the same memory space and cause higher memory latency.

62 Generally, a network process design will use primary POTs with a slight mix of

HTC. This is true for the multihomed gateway. As shown in Figure 5.5, one microengine is tasked with receiving packets from the MSF and moving the packets to

DRAM. Another microengine takes processed packets from the DRAM and moves them to the MSF for transmission. In the middle, the system uses POTs where any available microengine is allowed to take packets from the DRAM and process them.

I."::l~ ~I."::l ~ Figure 5.5 HTC and POTs in a typical system

The layout results because only one MSF exists in the network processor and only one microengine is allowed to talk to the TX or RX side. Another reason ofthis layout is separation ofthe receiving and transmitting ofpackets from packet processing. This improves the reuse ofcode since the receiving and transmitting ofpackets is unlikely to change between designs. Intel provides well-documented examples ofthis design concept.

5.2 Implementation ofthe multihomed gateway on the Intel 2800

The multihomed gateway's microengine code, or microcode, was compiled using

Intel's developer workbench v3.5l. The microcode was written using microengine

Assembly. The setup and monitor application that ran on the embedded Xscale processor

63 was written in C using the MontaVista v3.1 Linux cross-compiler. The system relies on low level building blocks provided by Intel to accomplish tasks such as receive/transmit packets, L2 encap/decap and IP forwarding. The system also uses the resource manager for setup and configuration ofthe microengines from the Xscale processor.

The overall program structure is based on Intel's suggested NPU structure. It consists ofthree basic stages ofoperation: receive, process and transmit. Each stage is mapped to a different microengine to take advantage ofthe hyper task chaining concept.

The process stage is where the multihomed gateway's functional code is located, and it uses the pool ofthreads concept to maximize packet flow.

The receive and transmit stages are provided by Intel's Internet Exchange

Architecture (lEA) software building blocks. The lEA blocks consist ofpre-defined functions to perform basic operations needed by the NPU. The receive and transmit block provides basic packet operations that receive and transmit packets and handles the communication between microengines and the MSF. lEA blocks are specifically written to provide good performance after Intel went to great effort to optimize these blocks. It is unlikely the programmer would need to rewrite these blocks unless a very specific function is required. There are other lEA blocks provided by Intel such as queue manager and scheduler. The multihomed gateway only uses the receive and transmit lEA blocks.

The process stage is the main focus ofthe multihomed gateway project. All necessary functions performed by the multihomed gateway are located in the process stage, which has a single main entry point called the dispatch loop. The dispatch loop is an NPU programming concept similar to the "main" function ofa C program. The dispatch loop is a continuous loop that retrieves a packet from the receive stage, operates

64 on that packet and then sends the packet to the transmit stage. The process stage also

uses Intel's optimized data plane libraries. The library contains useful functions such as

memory compare and header retrieval so programmers do not need to write the functions.

5.3 Packet processing logic

The packet processing logic ofthe multihomed gateway is structured with the

dispatch loop as its main function. The dispatch loop calls sub-functions to operate on

the packet. Packet processing is strictly serialized; the previous step must be completed

before the next step can start.

5.3.1-Dispatch loop

Receive packet Layer 2 Channel Packet decapsulation selection processmg

Transmit Layer 2 Statistics packet encapsulation update

Figure 5.6 Dispatch loop flow diagram

The dispatch loop, as shown in Figure 5.6, does not perform packet processing.

This loop's purpose is to hand out work to the rest ofthe system. It starts by receiving a

packet from the receive stage, then forwards the packets to the various process stage

blocks and finally sends the packet to the transmit stage. Then this cycle is repeated. It is

important to note that the dispatch loop should never stall at any process, and a properly

coded dispatch loop should not wait for an event to occur. For example, it might be 65 logical to wait for a packet to arrive before continuing to the next block. This makes sense because the program should not have to continue without a packet to process. In a microengine program, it is undesirable for the code to wait because it will stall the microengine from doing other tasks. Tasks stalled in the microengines cannot take control ofthe microengine without the running task voluntarily giving up control.

Therefore, the dispatch loop should check ifa packet is available to the processor. If none, then continue the program and ignore possible branches in the flow.

The receive packet block (1) could be considered the beginning ofthe program flow, since the rest ofthe code relies on this block. The receive packet block gets a new packet from the receive stage by calling the dl_source function. This function is defined in the IEA application interface that provides a generic way for a process stage to receive a packet. The dl_source function returns the packet handle associated with the packet just received. The packet handle contains information on the packet such as length and type. It also contains the pointer to memory where the packet is stored, the port that this packet came in and the port to be used for exit. The packet handle need not have all fields filled out upon arrival from the receive stage. For example, the output port is not yet defined since the multihomed gateway does not yet know which output port to route this packet to. It is assumed the rest ofthe program will fill in the necessary fields as the information becomes available. When the packet is ready for transmission, all required fields are properly filled out. Interestingly, the dl_source function will always return even ifno packet was received, in accordance with the concept of"no waiting" used by the dispatch loop. The packet handle, however, will be tagged as an invalid packet type so that the dispatch loop can act accordingly.

66 The next block is layer 2 decapsulation (2). This block decapsulates the Ethernet header (L2) from the IP packet (L3). The result is a packet without the MAC addresses,

Ethernet header and the Ethernet FCS. The layer 2 decapsulation block also classifies the packet type into ARP or normal IPv4 packets and places that information in the packet's header. The layer 2 decapsulation block is provided by Intel as an lEA building block.

Channel selection (3) contains logic to determine which channel the packet should be transmitted. This section contains code for packet classification, the active sessions list, channel health monitor and the channel chooser. Channel selection is discussed in further detail in Section 5.3.2.

The packet processing (4) block contains functions to manipulate the packet. The

NAT engine and the IP/TCPIUDP checksum calculator are in the packet processing block. The packet processing block is further discussed in Section 5.3.3.

The statistics update (5) block updates statistics values in the multihomed gateway. These statistics include average packet length, channel capacity and various housekeeping values. The statistics block is further discussed in Section 5.3.4.

The layer 2 encapsulation (6) block does the exact opposite as the decapsulation block. The encapsulation block takes the layer 3 IP packet from the previous block and encapsulates it inside layer 2. The Ethernet MACs, Ethernet header and Ethernet FCS are placed around the packet. Like the decapsulation block, the layer 2 encapsulation block is included as an lEA building block.

The final block in the dispatch loop is the transmit block (7). The transmit block is similar to the receive block as it uses the lEA application interface. The transmit block will take packet header information and send the information to the transit stage. The

67 transmit stage is called by the dl_sinkO function call with the packet handle as its

argument.

5.3.2 - Channel selection

The channel selection block has code that determines which channel the packet

will be transmitted. The flow chart is shown in Figure 5.7.

Load channel information

ARP From external

Fill ARP Check packet information RTtimer Yes

Figure 5.7 Channel selection flow diagram

The first block in channel selection is to receive control from the dispatch loop

(1). At this point, the channel selection block assumes a packet has been received and

68 properly decapsulated. The header infonnation is copied from the SRAM to the microengine's local memory for faster access.

The next block (2) classifies the packet into ARP or IPv4 packets. Actual classification is perfonned by the layer 2 decapsulation block, and the packet type is placed in the packet header. The multihomed gateway will respond only to ARP and nonnal IPv4 packets. All other packets are marked as ignored so they can be dropped later.

Ifthe packet is classified as ARP, the multihomed gateway will fill out the ARP entries (3) so the two devices can obtain each other's MAC address. A typical ARP packet has the fonnat shown in Figure 5.8.

Frame Header Frame Data

ARP Message , , ,, ,, ,, ,, ,, ,, ,, ,, ,, ,, Hardware Type Protocol Type HLEN I PLEN Operation Sender HA (0-3)

Sender HA (4-5) Sender IP (0-1)

Sender IP (2-3) TARGET HA (0-1)

Target HA (2-5)

Target IP (0-3)

Figure 5.8 ARP packet

69 The frame header is an Ethernet header with the destination MAC address set as the broadcast address. This allows all computers on the network to receive the ARP message. The ARP message itselfcontains the fields listed in Table 5.1.

Hardware type: Specifies the hardware interface type, for Ethernet, this is set to OxO 1.

Protocol type: The protocol used by the sender, this field is set to Ox0800 for IP

addresses.

HLEN: Length ofthe hardware address.

PLEN: Length ofthe protocol address.

Operation: The type ofARP message:. The message could be an ARP request

(1), ARP response (2), RARP request (3) or RARP response (4).

Sender HA: The hardware (MAC) address ofthe sender.

Sender IP: The IP address ofthe sender.

Target HA: The hardware (MAC) address ofthe target.

Target IP: The IP address ofthe target.

Table 5.1 ARP frame fields

When the multihomed gateway receives an ARP packet, the packet itselfonly has the sender's MAC address and IP address fields filled out. The multihomed gateway then inserts its own MAC address and IP address in the proper fields. ARP packets are transmitted out on the same channel that receives them.

70 Ifthe received packet is a normal IPv4 packet, the multihomed gateway first determines the direction ofthe packet (4). The direction ofthe packet determines the type ofprocessing needed to determine the transmit channel.

Ifthe packet came from the external network, it may be possible this was a timing packet sent earlier to check for channel latency (5). The multihomed gateway has an internal scratchram location written with the transmit timestamp whenever it sends out a timing packet. When a packet is received from the external network, the packet's address and port are matched with the values written into the scratchram timing location. If address and port matched, the current timestamp is read and the difference between the current timestamp and the transmit timestamp is the packet round trip delay. Packets from the external network will always be transmitted to the internal network, which is channelO.

Ifthe packet came from the internal network, then the multihomed gateway may need to choose a new channel for the packet ifthis packet is the first of a session. To make that determination, the multihomed gateway checks the source and destination address ofthe IP header and searches through its active sessions lists (6). Ifthe source and destination address match one ofthe entries in the active session list, then it can be assumed this packet belongs to a current session. Ifno match is found, then the multihomed gateway assumes this is a new session to be initiated, and it will choose a new channel for the session.

Ifthe packet's addresses match one entry in the active sessions lists, the channel number is read (7) from the list and the packet is transmitted from that channel. The active sessions entry for that session is also updated with the latest transmit timestamp.

71 The transmit timestamp is used by the multihomed gateway to determine which session entry will be removed when it runs out ofspace in the active sessions list. A Least

Recently Used (LRU) algorithm is used to determine which session entry will be removed.

Ifthe packet does not have a corresponding entry in the active sessions list, a channel will be chosen for the new session. The first step the multihomed gateway performs is load the channel information (8) from the scratchram. The channel information is generated by the channel performance monitor and the user traffic monitor. The values are loaded into local registers.

Once all information is loaded, the multihomed gateway checks the channel to ensure it still functions (9). The process ofdetecting dead channels differs from determining channel capacity. A channel is considered dead ifit has not received data within 60 seconds after a packet was transmitted into the channel. No matter what type oftraffic uses the external channel, it will need to send a packet back to the internal network. The packet could be an acknowledge packet or a retransmit packet. Any packet will trigger the multihomed gateway to mark the channel as being active. It is only when the channel is quiet for a long time that it will be marked as dead. Ifa channel is marked as dead (10), then the multihomed gateway will place the new session onto a channel that is not broken. Otherwise the multihomed gateway will need to estimate the best channel to place the new session (11). The session is first classified to determine ifit is more latency sensitive or bandwidth sensitive. The multihomed gateway does this by first looking at the service type field in the IP header. This field is extracted from the IP header in the local memory by the function call xbuf_extractO. This function is provided

72 by Intel's optimized data plane library. The multihomed gateway then examines the D and T flags in the service type field to determine ifthe packet demands low latency (D flag) or high bandwidth (T flag). Ifboth the D and T flags are set or ifboth the D and T flags are not set, then the multihomed gateway assumes the field is not being used by the internal network. This is to be expected because the service type field is not a widely used field and most equipment will ignore this field to avoid usage abuse. In the case where the traffic type cannot be determined by the service type field, the multihomed gateway relies on the method of looking at the port history to determine the likely requirements ofthe traffic. The port history table is a 65,556 content addressable memory (CAM). The table is located in scratchram and contains the average length ofthe packets seen using a particular port. Once the multihomed gateway determined that it is necessary to consult with the table, it extracts the destination port and offsets the port history table using the port number. This method ofusing a CAM as the port history table allows for a 0(1) access speed to the history information. Once the average packet length to the port number is retrieved, the multihomed gateway simply compares that value to 1400 bytes. Ifthe average packet length is smaller than 1400 bytes, then it is likely that this port's traffic pattern is latency sensitive. Otherwise ifthe average packet length is larger than 1400 bytes, the multihomed gateway will assume the traffic is bandwidth sensitive. The value of 1400 bytes is chosen because it is slightly under the normal Ethernet MTU size of 1500 bytes. When large bandwidth is required, usually due to the user downloading a large file, then that file will be broken up into the largest MTU block size to minimize overhead. Therefore, ifthe average packet length is smaller than

1400 bytes, most ofthe traffic is very, very small packets. Small packets are a good

73 indication the traffic pattern is latency sensitive. Once the traffic pattern is determined, the multihomed gateway compares the values reported by the channel performance monitor and places the new session on the most appropriate channel. The active sessions list is also updated to reflect the new session.

Once a channel is chosen, it is stored in a local register. Control is passed back to the dispatch loop (12).

5.3.3 - Packetprocessing

The packet processing block translates the packet's address to match the outgoing channel. The packet processing block also recalculates IP/TCPIUDP checksums and replaces it in the packet. The packet processing flow chart is shown in Figure 5.9.

Control from NAT Fix IP dispatch loop checksum

Control to Fix UDP Fix TCP dispatch loop checksum checksum

Figure 5.9 Packet processing flow diagram

The packet processing block begins by getting control from the dispatch loop (1).

Then, the packet processing block performs NAT (2) on the packet. When the program reaches this point, the multihomed gateway has determined the channel to which this packet will proceed. Ifthe packet originates from the external network, the destination address ofthe packet is changed to the internal network address. Ifthe packet originates from the internal network, the packet's source address is changed to match the respective

74 output channel ofthe multihomed gateway. In both cases, the address is replaced with the new address.

Once the address is replaced, the multihomed gateway recalculates the necessary checksums (3,4,5). The multihomed gateway can recalculate three types ofchecksums:

IP, TCP and UDP. The formula used to recalculate the checksums is the same for all three packet types. The difference between them is the offset; where the checksum is located in the header. All IPv4 packets have an IP checksum to protect the IP header.

However, the packets mayor may not have a UDP or TCP checksum depending on the packet type. It is important to note that a packet will have either a UDP or TCP checksum, but never both. This is an important fact to know. Ifa packet has a UDP payload inside of a TCP frame, the multihomed gateway will only operate on the outermost packet type. The multihomed gateway uses the IP_PROTOCOL field in the IP header to determine ifit is TCP or UDP. A value of Ox06 in the IP PROTOCOL field means this is a TCP packet. A value of Ox 11 means this is a UDP packet. Any other values are ignored, and the packet will be treated as a normal IP packet.

All three checksums are calculated using the same function. The checksum "is the 16 bit one's complement ofthe one's complement sum ofall 16 bit words in the header". This means all the data to be protected by the checksum will be divided into 16 bit words and summed together. Ifthe result ofthe sum is larger than 65,536, then the upper 16 bits will be added to the lower 16 bits to create the checksum. IP checksum protects just the IP header. Both UDP and TCP create a pseudo header that contains parts ofthe IP header. The pseudo header is protected along with the UDP or TCP header.

The pseudo header contains different fields for each protocol but both protocols include

75 the source and destination IP address. This is why the multihomed gateway must update the UDP and TCP checksums.

The multihomed gateway uses an arithmetic approach to recalculate the

checksums instead ofcalculating them from scratch. This is especially useful for TCP checksum which protects the entire packet. Ifthe multihomed gateway must recalculate the TPC checksum for every TCP packet, then it would require additional microengine cycles. The arithmetic approach to recalculate the checksums is outlined in the Request for Comments document 1624 (RFC 1624). The equation described in the RFC 1624 is shown in Table 5.2.

He' = HC - ~m - m'

He' : The new computed checksum

HC : Old checksum

m: Old 16 bit field

m' : New 16 bit field

Table 5.2 RFC 1624 equation for checksum calculation

This result ofthis equation is the new checksum base on the changes made to any

16 bit value in the protected data area. The multihomed gateway must use this equation twice to compute the new checksum correctly because the IP address field is 32 bits wide.

Once the address and all appropriate checksums are updated, the multihomed gateway is ready to transmit the packet (12) and control is returned back to the dispatch loop.

76 5.3.4 - Statistics update

The multihomed gateway updates channel statistics before the packet is

transmitted. These statistics are collected by the channel performance monitor (CPM)

and the user traffic monitor (UTM). Both functions shown in blocks in Figure 5.10.

Control from Update UTM Update CTM Control to dispatch loop dispatch loop

Figure 5.10 Statistics update flow diagram

Statistics update receives control from the dispatch loop (1) and updates the user

traffic monitor (2). The user traffic monitor keeps track ofthe average bandwidth per

TCP port number. However, the user traffic monitor needs to weigh older values more

than new values because oftraffic segmentation at the TCP level. Ifa user has large FTP

downloads using TCP port 21, the traffic pattern will show numerous 1,500-byte packets

followed by a single smaller packet at the end. The last small packet is due to remnant

bytes after the transmission of 1,500-byte packets. That smaller packet, iftaken on the

same weight as the rest ofthe packets, will greatly reduce the average packet length.

Therefore, it is necessary to weigh previous packets higher than recent packets.

The channel performance monitor (3) updates the statistics ofboth channels. The

first statistics collected are the current bandwidth usage. Once determined, the current

bandwidth usage is compared with channel capacity. Ifthe current bandwidth usage is

higher than channel capacity, then the channel capacity is updated with the value ofthe

current bandwidth usage. The channel capacity is also reduced by a small amount to

follow the algorithm set out in Chapter 4. After all statistics are updated, control is

passed back to the dispatch loop (4).

77 Chapter 6

Functional test ofthe multihomed gateway

The functional test ofthe multihomed gateway involves connecting an internal network to the Internet through two different ports or channels. The internal network of the multihomed gateway is attached to a computer that will continuously request traffic

from various remote hosts ofthe Internet. The purpose ofthis test is to exercise the multihomed gateway's various functionalities. The functional test will also test the

channel selection algorithm to see how well it performs during various network loads.

Finally, the functional test will examine the case in which one channels goes dead.

6.1 Test setup

As shown in Figure 6.1, the test setup for the multihomed gateway requires four separate functional units: The user PC, the boot and debug manager, and two channel impairment devices. The user PC represents the internal network and is connected to channel 0 ofthe multihomed gateway. The two channel impairment devices are connected to channell and channell ofthe multihomed gateway. They are used to artificially degrade the channel and simulate traffic loading on the channel. Each channel impairment devices is connected to the Internet via an ISP. In this case, the ISP is

UHNet. Finally, the boot and debug manager runs the necessary services used by the multihomed gateway to boot correctly. These services are DHCP, NFS, TFTP and the

Intel workbench studio. These functional units will now be explained in detail. 78 Boot & debug manager

Multi­ Channel homed impairment device ISP device 1 (UR Net) User Channel PC impairment device 2

---_.-/ \---.... ~) "------~ y External network Internal network

Figure 6.1 Functional test setup diagram

6.1.1- User PC

The user PC represents the internal network to the multihomed gateway. The purpose ofthe user PC is to generate traffic to the multihomed gateway to test its channel choosing function. To generate the necessary traffic, a script was written that takes a random website from a file and fetches the website's start page. The program that fetches the website is call wget. To generate as much traffic as possible, four copies ofthe script will run at the same time. Each copy ofthe script will use a different website list so the scripts will not all try to fetch from the same website. The types oftraffic generated are similar to what will be experienced in a real network environment. Experiments using the test scripts show that the scripts can generate traffic at over 100Kbps.

79 6.1.2- Boot anddebug manager

The boot and debug manager serves boot and runtime files for the multihomed gateway. Several necessary services must must functioning for the multihomed gateway to boot and run properly. Services used by the multihomed gateway are listed in Table

6.1.

Services Function TFTP The Trivial File Transfer Protocol (TFTP) is commonly used to retrieve boot files from the server. The TFTP protocol is light-weight and does not require authentication to retrieve files. The small footprint ofTFTP makes it ideal to store in a flash environment where space is limited. For the multihomed gateway, TFTP is used to retrieve the Linux kernel image from the boot and debug manager. DHCP The Dynamic Host Control Protocol (DHCP) is used by the multihomed gateway to set up its IP address. DHCP allows all IP address to be stored in a central location where the client "leases" the IP address. The dynamic nature ofDHCP allows the multihomed gateway to avoid setting up a permanent IP address but instead asks for one during boot. In the case of the multihomed gateway, the DHCP also leases the root filesystem mounts from the server. The root filesystem are remotely hosted by the boot and debug manager. NFS Network File System (NFS) is a special filesystem used by UNIX operating systems. It allows a client to remotely access the server's filesystem. In the case ofthe multihomed gateway, NFS mounts the root filesystem from the boot and debug server. In order words, the entire filesystem exists on the boot and debug manager, and nothing is stored locally on the multihomed gateway. Intel The Intel workbench downloads the microengine code from the PC to the workbench microengines. Table 6.1 External multIhomed gateway serVIces

For the test setup, the boot and debug manager is not a separate PC. All services are only used during boot up and rarely afterward. Therefore, the services share the resources ofone ofthe channel impairment devices. This lessens the number of computers needed to set up the test.

80 6.1.3- Channel impairment device

The channel impairment device is a PC running Linux. The Linux operating

system's IP forwarding module and network statistics applications will be very useful in

this test. The channel impairment PC has two network cards installed. One is connected

to the Internet and the other is connected to the multihomed gateway. Linux is then put

into IP masquerading mode where all traffic from the multihomed gateway will be

forwarded to the Internet and vice versa. Linux offers many advantages as the channel

impairment device. Linux provides a wealth ofapplications that could be used to

artificially degrade channel capabilities. Also, applications will be used in this test to

monitor the channel for better insight in how the channel is behaving. The three

applications used in the test are shaper, jnettop and nload.

IP masquerading is a part ofLinux's IP table function. IP table manipulates

incoming and outgoing packets. IP table is based on a set ofrules; each packet that

arrives or leaves the system is compared against the set ofrules to see ifany ofthe rules

match. In the test case, a simple SNAT forwarding rule is set on the IP table's

postrouting chain. To set up the rule, enter the following on the Linux command line: echo 1 > /proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -0 ethO -j SNAT 10.14.19.1

Most Linux distributions have an IP table built in by default. For other distributions, IP

table must be compiled into the kernel after installation. Information regarding the

compiling ofthe kernel and other details for setting up IP tables can be found online.

Another information source is the Linux source tree's documentation directory.

81 The shaper application is used to degrade the channel. Shaper is short for traffic

shaper that limits traffic flow from a network device. In the test scenario, shaper limits

the bandwidth between the channel impainnent device and the multihomed gateway. To

set up shaper, enter the following on the Linux command line: modprobe shaper ifconfig shaperO 192.168.2.2 route del -net 192.168.2.2 dev eth1 shapecfg attach shaperO eth1 shapecfg speed 64000

The first line installs the shaper kernel module. This command mayor may not be

necessary, depending' on whether shaper was compiled into the kernel or as a kernel

module. In this test case, shaper is a kernel module so this line is required. The second

line configures the shaper virtual network device to have the same network address as the

physical network connected to the multihomed gateway. The third line deletes the

default route ofthe physical network so that all packets destined for the multihomed

gateway will be routed through the shaper virtual network device. The fourth line

attaches the virtual shaper device to the real physical device. The last line sets the speed

ofshaper and the last value configures the data rate in bits per second. Once shaper is

configured, it is not necessary to go through the entire process to change speed. This can

be done with the last line.

The applicationjnettop and nload are network statistics modules that will show

the current network statistics ofthe Linux system. Jnettop shows the current active

sessions between the Internet and the multihomed gateway. Nload displays an ASCII

chart ofthe recent network usage.

82 6.2 Multihomed gateway operational test

The operational test ofthe multihomed gateway starts by running the test scripts on the user PC. The test scripts begin to retrieve data from various hosts on the Internet and store the data locally. Thejnettop application monitors traffic traveling between the

Internet and the multihomed gateway.

Test scripts continuously fetch data from various remote hosts on the Internet.

The data fetched is stored locally until the script runs through its entire list ofremote hosts, at which point all data is deleted and the script restarts. The data is not important as the scripts' only role is to generate background traffic.

The applicationjnettop is used to monitor the user streams from the multihomed gateway to the Internet. Each channel impairment device runs a local copy ofjnettop.

Figures 6.2 and 6.3 shows a screenshot ofthe output ofjnettop running on each ofthe two channel impairment devices. The output ofjnettop is a detailed view of all current active streams on that particular channel impairment device. For example, Figure 6.2 shows the active streams on channel 1 that has current streams from amazon.com, mit.edu and cnn.com. Also listed injnettop is the TCP port number used by those streams and the utilized bandwidth. There is also data on total bandwidth usage near the bottom ofthe screen.

83 75-35. amezon. co. <-> 192.168.1.1 207.171.175.35 80 TCP 192.168.1.1 34911 GET /exec/obidos/tq/browse/-/502394 175-35. amazon. com <-> 192.168.1.1 13.3k/s 774b/s 207.171.175.35 80 TCP 192.168.1.1 34909 53.0k 3.03k GET /exec/obidos/tg/browse/-/l065836 175-35. amazon. com <-> 192.168.1.1 9.42k/s S58b/s 207.171.175.35 60 TCP 192.168.1.1 34905 47.1k 2.73k GET /exec/obidos/tq/browse/-/404272 175-35. amazon. com <-> 192.168.1.1 7.96k/s 276b/s 207.171.175.35 80 TCP 192.168.1.1 34899 58.1k 2.91k GET /exec/obidos/tg/browse/-/5174 175-35.emezon.com <-> 192.168.1.1 29b/s 409b/s 207.171.175.35 80 TCP 192.168.1.1 34915 2.91k 938b GET /exec/obidos/tq/browse/-/l064954 phoebe.adtech-inc.com <-> 192.168.1.1 80b/s 45b/s 10.14.0.20 53 UDP 192.168.1.1 33256 852b 377b

.Jait.edu <-> 192.168.1.1 10b/s 24b/s 113.7.22.83 80 TCP 192.168.1.1 341375 Z22k 11. Ok GET /comment-form.html hiqhlinemri.com <-> 192.168.1.1 Obis 12b/s 65.18.223.89 80 TCP 192.168.1.1 34660 6Gb 372b

www4.cnn.com <-> 192.168.1.1 10b/s Obis 64.236.16.52 80 TCP 192.168.1.1 34883 147k 6.50k GET jprivecy.html hiqhlinelllr1.colil <-> 192.168.1.1 Obis Obis 65.18.223.89 80 TCP 192.168.1.1 34897 6.7Bk 1.20k GET /link_db.php?name=&cat=main&c=&keya&no=35&isBanner=l&tsrqet=3

TOTAL 50.0k/s 3.06k/s 2.711ll 148k

Figure 6.2 Channel 1jnettop screen shot

As seen in Figure 6.3, channel 2 has different active streams than channel 1. The current active streams in channel 2 are from various hosts on yahoo.com. The two outputs ofjnettop show the multihomed gateway placing different sessions ofremote hosts on the two different channels.

84 •• 1.my.vip.sc5.yahoo.com <-> 192.168.2.1 66.163.171.129 80 TCP 192.168.2.1 34989 GET / s~oryl.news.vip.scd.yahoo.com <-> 192.168.2.1 9.13k/s 404t'/s 66.218.75.230 80 TCP 192.168.2.1 34995 45.6k 1.98k GET / p14.www.scd.yahoo.com <-> 192.168.2.1 91b/s 176b/s 66.94.230.45 80 TCP 192.168.2.1 34997 366b 706b 1.05k GET l_ylh=X30DMTEwZGh2NmNjBF9TAzI3MTYxNDkEdGVzdAMwBHP.tcGwDaW5kZXgtdGJs/r/ps _ 14..www. scd. yahoo. com <-> 192.168.2.1 nb/s 141b/s 213b/s 66.94.230.45 80 TCP 192.168.2.1 34991 364b 706b 1.04k GET l_ylh=X30D~ITEwZGh2N~rjBF9TAzI3Krlx~IDkEdGVzdAMwBHRtcGwDaWSkZXgtdGJs/r/dn p14.www. scd.yahoo. com <-> 192.168.2.1 72b/s 141b/s 213b/s 66.94.230.45 80 TCP 192.168.2.1 34988 362b 706b 1.04k GET l_ylh=X30DMTEwZGh2NmNjBF9TAzI3KrlxNDkEdGVzdAMwBHRtcGwDaWSkZXgtdGJs/r/uf newl.ps.vip. dcn.yahoo. com <-> 192.168.2.1 14b/s 122b/s 136b/s 216.109.127.124 80 TCP 192.168.2.1 35000 5Gb 48gb S47b GET I p14.www. scd.yahoo. com <-> 192.168.2.1 Obis Obis 66.94.230.45 80 TCP 192.168.2.1 34985 36Sb 772b 1.11k GET l_ylh=X30DMTEwZGh2N~rjBF9TAzI3MTYxl~kEdGVzdM{wBHRtcGwDaWSkZXgtdGJs/r/uf oviesl.vip. scd.yahoo. com <-> 192.168.2.1 Obis Obis Obis 66.218.71.147 80 TCP 192.168.2.1 34984 38.8k 1.8Sk 40.6k GET I p14.www. sed. yahoo. com <-> 192.168.2.1 Obis Obis 66.94.230.45 80 TCP 192.168.2.1 34982 366b 772b 1.11k GET l_ylh=X30DMTEwZGh2NmNjBF9TAzI3MTYxNDkEdGVzdAMwBHRtcGwDaW5kZXgtdGJs/r/mf msqLl.vip.sc5.yahoo.com <-> 192.168.2.1 Obis Obis Obis 66.163.172.116 80 TCP 192.168.2.1 34978 12.7k 1.08k 13.7k GET J

TOTAL 2LSk/s LS4k/s 2.97m 216k

Figure 6.3 Channel 2jnettop screen shot

6.3 Performance ofthe multihomed gateway using the capacity estimation algorithm result

The heart ofthe multihomed gateway is the channel selection module. The functional test will test this section in various network environments the multihomed gateway will encounter. The algorithm in the multihomed gateway is the capacity 85 estimation algorithm laid out in Section 4.4. The capacity estimation algorithm selects a channel by looking at the history ofthe two channels to estimate their maximum

capacity.

The functional test ofthe capacity estimation algorithm involves running the multihomed gateway through four network conditions: externally loaded, channel outage, internally loaded and unloaded. The externally loaded condition occurs ifthe channel is loaded from an external source, such as the ISP, and the channel capacity is reduced from normal. The channel outage condition is when one channel is disconnected from the

Internet due to a fault. An internally loaded channel happens when the traffic generated by the internal network overwhelms one or both external channels. The internally loaded condition happens during peak usage. The unloaded condition occurs during offpeak periods, where just one external channel has more than enough bandwidth to handle all current traffic requirements.

The externally loaded condition and the channel outage condition are the main reasons for building the multihomed gateway. Both conditions adversely affect the performance ofthe network and either reduce or disable Internet access. Furthermore, both conditions usually happen without warning; therefore, internal network users cannot plan for such outages. The multihomed gateway reduces the downtime ofthe external network by estimating the health ofthe channels and acts accordingly when problems arise. The other conditions, internally loaded and unloaded, are presented here to complete the tests and to show the performance characteristics ofthe multihomed gateway during a range ofoperation.

86 6.3.1- Results under externally loaded condition

The externally loaded condition occurs when the external channel's capacity is reduced significantly. The reason for this reduction could result from a loading condition on the channel or the ISP suffering an external attack. The reason why the externally loaded condition exists is not important, just that it happened. The multihomed gateway must recognize and deal with the condition.

To simulate the externally loaded condition, the multihomed gateway and the user

PC will run for 60 seconds with the external channels in the unloaded condition. This allows the multihomed gateway to sample the channels and determine their capacities.

After 60 seconds, one channel (in this case, channel 2) will have its throughput reduced by the shaper program running on the channel impairment device. The shaper program will reduce the available bandwidth ofthe channel to about 10Kbps. At the same time, the data rates ofthe channels are collected, and the data will be used to plot a chart comparing data rate versus time. The chart is shown is Figure 6.4.

160000

140000 I. "iii' 120000 A 111\ Q. I fIl e 100000 - -- 'p N J!! 80000 ~ f------~~L~, ~ .l!! 60000 -- - III ~I ~ "C 40000 20000 I I I o L 111 n hf"'j n nr ...- I~V -\, ,", "V\,.-..f\\r " 1 31 61 91 121 151 181 211 241 271 301 331 361 3gt 421 time (5)

L=-.c:hannel 1 --channel 2 I

Figure 6.4 Traffic during externally loaded condition

87 Figure 6.4 shows channe12's capacity reduced to about lOKbps at about 60 seconds into the test. Channell's data rate is immediately reduced to zero. This follows the algorithm exactly because the multihomed gateway sampled channe12's bandwidth capacity as about 140Kbps immediately before the drop. Therefore, the multihomed gateway will assume plenty ofbandwidth is available in channel 2 and will place new sessions there. The multihomed gateway eventually corrects this mistake with the use of the decay function. The decay function will reduce the apparent bandwidth capacity of channel 2 to the new levels. Once that happens, the multihomed gateway will start placing new sessions on channell. This happens at about 190 seconds into the test.

To get a better picture ofinner workings ofthe multihomed gateway during the externally load condition; it is possible to view important statistics values ofthe multihomed gateway through the debug port. Figure 6.5 shows a screenshot ofthe statistics page. The channel statistics and aggregate total are near the top ofthe page. In this case, channel 1 is favored over channel 2 because channel 2 is externally loaded.

The next set ofstatistics shows the health ofthe channel. During this test, both channels are active. The active sessions list follows the channel health statistics. The active sessions list shows all active session ofwhich the multihomed gateway is aware. The list is separated into four columns. The left column shows the source address. In the test case, the source address is always the same since one user PC is used. The second column is the destination address. This is the remote host's address and changes with each session. The third column is the timestamp ofthe last packet. The timestamp determines which session was recently used the least and can be eliminated from the list when space is required. Finally the rightmost column shows the channel on which a

88 particular session is placed. The last statistic on the page is average packet length per

TCP port. The statistics page only shows TCP ports that had sessions before. In other words, if a TCP port never saw packets, it will not be shown on the statistics page. In this case, port 53 (DNS), port 80 (HTTP) and port 138 (NETBIOS) are shown.

E'v'ery 13:

multihomed gateway proc Kenny Turll:J

(packets) a>;jgre>;jate 8 .- channel 1 3 channel 2

channel 1 last rx: Ox004c stat.us: OK channel <- last rx: Ox004c status: OK }lHG: most capacity

chI rt: OxOOOOOOa4, ch2 rt: Ox0000008f

(srce addr) (dest addr) (t imestarrlp) (channel) 192.16G.000.002 216.034.039.048 Ox004bf4ed 2 192.168.000.002 066.113.139.149 Ox004bac33 1 192.168.000.002 010.014.000.020 Ox004cOaal 2 192.168.000.002 205.161.007.089 Ox004baOaO 1 192.16G.000.002 207.097.222,,114 Ox004cOage 1 192.16::3.000.002 064.236.016.242 Ox004baOec 1 192.168.000.002 066.150.161.141 Ox004c240e 1 192.168.000.002 128.171.094.042 Ox004c3db1 1 192.168.000.002 207.171.166.102 Ox004c4l!1! 192.16G.000.002 206.071.081.011 Ox004c072f 192.168.000.002 209.159.152.012 Ox004bf962 192.168.000.002 064.236.016.084 Ox004c3ddO 192.168.000.002 065.018.223.089 OX004c4B 192.168.000.002 206.071.080.234 Ox004c035b 192.16G.000.002 205.188.238.181 OxOOQbc956 192.16G.000.002 064.236.016.222 Ox004c3dl0

(port.) (avg len) 53 98 80 138 -243 Figure 6.5 Externally loaded condition debug statistics

89 Figure 6.5 shows that once the multihomed gateway determines channel 2 was

loaded, nearly all new sessions are placed on channell. The active session list shows

only three sessions are on channel 2, with the rest placed on channell. The results show the multihomed gateway correctly recognized the externally loaded condition and placed new sessions on the unloaded channel. This is in contrast to other load balancing algorithms that do not handle this network condition. The results ofthe externally loaded condition using other algorithms are shown in Appendix C.

6.3.2- Results under channel outage condition

The multihomed gateway monitors channel connectivity to determine ifthe channel is still active. When a channel is determined to have failed, all previous sessions ofthe failed channel are deleted from the active session list and new sessions will be placed on channel still functioning. To test the channel outage condition, the multihomed gateway is first run for 60 seconds so it can determine channel capacities. Once it is concluded that both channels function correctly, one channel (in this case, channel 2) is unplugged from the Internet to simulate the outage. Then review the multihomed gateway's debug statistics screen to see ifit can detect the failed channel correctly.

Ifthe multihomed gateway determines a channel is disconnected, all active sessions ofthe disconnected channel must be cleared from the active sessions list. The reason is potential misplacement ofthe packet: Ifanother packet arrives from for the failed channel, that packet could be improperly placed onto the failed channel. This issue could result from the channel number in the active sessions lists. Ifthe active sessions list is purged ofall entries to the failed channel, the multihomed gateway will place the packet onto the channel that is still working. Moreover, all active session on the failed

90 channel cannot be recovered because the channel has failed. The multihomed gateway needs to position itself such that, upon reconnection, the gateway would place the new session on the active channel. In Figure 6.6, the debug statistics screen shows channel 2 has failed. Also, the active sessions list does not have entries for channel 2. This is channel2's session was deleted and zeroed out.

rnultihorned gateway proc Kenny Tung

(packets) (bytes) aggregate 1" 12 channel 1 42 channel 2 4608 5756e55

channell last rx: OX00027~, status: OK channel 2 last rx: OxOOOlb9f2, status: FAIL

UEG: reset.ting

chi rt: OxOOOOOOa5, ch2 rt: OxOOOOOOd6

(srce addr) (dest ;::tddr) (t irnest.arr.p) (channel) 192.16e.000.002 010.014.000.020 Ox0001ab1a 1 192.16e.000.002 065.0H3.223.0e9 OX00027~ 1 192.168.000.002 207.171.166.048 0>:00027298 1 192. 168.000. C02 205,,188.23C:.l09 Ox0001c024 1 192.16e.000.002 064.236.016.020 Ox00025414 1 192.16e.000.C02 216.034.039.048 OxOOOU::7f3 1 192.16e.COO.002 206.071.0eO.234 OxOOO Hodce 1 192.16e.000.002 128.125.019.146 0>:00027. 1 000.000.000.000 000.000.000.000 OxOOOOOOOO o 000.000.000.000 000.000.000.000 OxOOOOOOOO o 192.16e.000.002 064.236.016.241 Ox00017ccd 1 000.000.000.000 000.000.000.000 0;.:00000000 o 000.000.000.000 000.000.000.000 0>:00000000 o 000.000.000.000 000.000.000.000 OxOOOOOOOO o 000.000.000.000 000.000.000.000 OxOOOOOOOO o 000.000.000.000 000.000.000.000 OxOOOOOOOO o

(port:1 (avg len) 53 106 80 1!im

Figure 6.6 Channel outage condition debug statistics

91 6.3.3- Results under internally loaded condition

During an internally loaded condition, the traffic from the internal network

overwhelms the external channels. The multihomed gateway is expected to balance the

load in this condition and intelligently place traffic on both channels to maximize

throughput.

[--channel 1 --channel 2 I

160000..------......

140000 -t------I

_ 120000 iI-ri~1-- CIl .g- 100000 iRil--t-~H1'I

~ 80000 -;ftI 60000 1----R-1f--'-V-P-IIlII-+- "C 40000 II---t--'----J

20000

31 61 91 121 151 181 211 241 271 time (5)

Figure 6.7 Traffic during internally loaded condition

Figure 6.7 shows an occurrence during this condition. At about 90 seconds into the test, the internal network traffic spikes to utilize both channels. Before that time,

traffic generated by the internal network is primary handled by channel 2. During peak usage, channell's data rate jumps to the same level as channel 2. This shows that the multihomed gateway will correctly place new sessions on the unloaded channel when

load demands it.

6.3.4- Results under unloaded condition

In the unloaded condition, a single channel should be adequate for bandwidth generated by the internal network. In this case, it is not important that the data rates of

92 the two channels be completely balanced since overall traffic throughput does not improve. In other words, the new session requested by the internal network can be served by either channell or channel 2. It does not matter that channell's data rate is currently higher than channel 2 because channel 1 has more than enough capacity to handle the new data. Figures 6.8 show the chart ofthe unloaded condition.

!_=::-channeI1 -channel 2 I

160000,------

140000 1------1----.----+----- _ 120000 UI : 100000 f'-t-hirt-tIHt---::-:--;t----;---tt---tt--HI-tt~--_+_-.n-+t-pj_

~ 80000 .; 60000

" 40000 t+lH--Hut-'HIJ 20000 I--'-'--f----'l/---'-V '-II~------l W---lI·-'-11 ·~..I-il. 1-11 II

31 61 91 121 151 181 211 241 271 301 time (5)

Figure 6.8 Unloaded condition

The seesaw pattern ofchannel usage is a random function ofthe session arrival pattern ofthe internal network. The reason is because traffic generated by scripts varies in size. For example, a download from amazon.com can be very large due to graphics on the website. On the other hand, Google's main page has fewer graphics and loads faster.

At some point, the multihomed gateway will receive a session that has a high bandwidth requirement. The multihomed gateway will place that session onto channel 1 or channel

2. Suppose the high bandwidth requirement session was placed on channel 2 and the session begins to download its content. Channel2's bandwidth usage will spike high due to that session. That spike is absorbed by channel 2 because it has the capacity to handle that much traffic. However, the multihomed gateway will notice that channel 2 capacity

93 is really high because ofthe spike. After the high bandwidth session finishes, channe12's usage drops. The multihomed gateway notices that now a huge gap exists between channel 2's current usage and the channel's capacity, and will therefore place most ofthe channel onto channel 2. In effect, the multihomed gateway will now favor channel 2 because ofits proven high capacity due to a random event. This does not mean channel 2 will be favored from this point forward. The capacity estimation algorithm has a decay function attached to the apparent channel capacity. This means the apparent channel capacity will be reduced over time ifchannel usage is minimal. This behavior is evident in Figure 6.8. At about 190 seconds into the test, channell's data rate spiked up and channe12's data rate reduced. This is probably because channell just received a session with high bandwidth requirements. Channell's data rate remained relatively the same until about 240 seconds into the test. At that point, the data rate for channel 1 started to decline. This shows the decay function in action, where the apparent capacity ofchannel

1 is slowly reduced to match actual bandwidth usage. In contrast, channe12's data rate slowly increases as more and more sessions are placed onto that channel.

6.4 Summary oftest results

The main purpose ofthe multihomed gateway is to provide redundancy during the time when the external channels are not performing as expected. The multihomed gateway was able to correctly identify the externally loaded condition during the test and was able to redirect the traffic to the other channel. The multihomed gateway was also able to correctly identify the scenario where one ofthe channels was disconnected from

94 the Internet. During that test, all the traffic was correctly diverted to the channel that was still in operation. Furthermore, the multihomed gateway shows that it was able to utilize both channels during the internally loaded condition and unload condition. This shows that the multihomed gateway will perform correctly during all possible network situations.

The functional demonstration was carried out at the University ofHawaii

Engineering building distant learning room. Three computers were used in the test; the user PC and two channel impairment devices. The boot and debug manager was part of the channel impairment device. The multihomed gateway was connected to the Internet via an Asante router connected to UHnet. The demonstration ran as expected until the

Asante router stopped transmitting and receiving traffic from the Internet. The crash is probably due to the large amount oftraffic generated by the multihomed gateway. It is not unusual for a small, consumer grade router to crash under heavy load. The function demonstration resumed once the multihomed gateway was reconnected to the Internet directly via UHnet. The demonstration ran correctly and the results were the same as those taken during the testing phase ofthe project.

95 Chapter 7

Conclusions & suggestions for future work

The multihomed gateway was designed and implemented successfully using the

Intel IXP 2800 network processor. Test results presented in Chapter 6 show that under a range ofconditions the multihomed gateway provides the user with performance which is significantly superior to a singled-homed gateway

7.1 Advantages of a network processor vs. FPGA

The main advantage ofnetwork processors over an FPGA setup is the speed at which the developer can create a complete product. The network processor already has all components required for quick and efficient processing ofpackets. In contrast, an

FPGA necessitates manual coding using HDL. It would have been possible to implement the multihomed gateway using an FPGA; however, the engineering effort to finish the project would have been too great. The network processor allowed the developer to quickly access highly optimized packet processing libraries to inspect the packets. Then it becomes possible to classify and manipulate the incoming packets into something useful. A function that would require days to implement in FPGA can be performed in minutes when using a network processor.

96 7.2 Applications for the multihomed gateway

There are many useful applications for the multihomed gateway. The multihomed gateway is designed for a network environment that serves several hundred users.

External channels to the Internet need not be highly reliable because that is the reason why the multihomed gateway was designed. Therefore, the multihomed gateway could be useful as a residential gateway for an entire condominium, or small- to medium-sized businesses that require very high reliability on their Internet connections.

7.3 Suggestions for future work

The multihomed gateway can be extended from its present form into something that would provide much more functionality. Currently, the multihomed gateway only accepts two external channels. Future work can extend the concept and allow for more external channels. In fact, the current multihomed gateway requires knowledge ofhow much external channels are connected. It would be very interesting to extend the multihomed gateway to dynamically set up new channels as they are plugged in. This dynamic setup will increase,flexibility in how and where a multihomed gateway can operate.

Future extensions to the multihomed gateway can incorporate a network switch directly into the multihomed gateway. Currently, the multihomed gateway has a single port for the internal network connection. Additional users would require an external

97 switch. The network processor has the processing power to absorb the functionalities of the network switch. These additions will increase the usability ofthe multihomed gateway.

98 Appendix A

The Intel IXP2800 network processor

The Intel IXP2800 network processor is Intel's second generation network processor. The design ofthe IXP2800 is base on Intel's first generation IXP1200 network processor. The IXP2800 extends on the functionality and performance ofthe

IXP1200 by enhancing nearly all ofits internal structures. The IXP2800 is capable of processing complex algorithms, deep packet inspection, traffic management, and forwarding at wire speed. The IXP2800 is rated at a sustained IOGbps packet forwarding speed.

The IXP2800 is every bit as complex as any modem general purpose processors.

The main design obstacle that plagues all network processors, including the IXP2800, is memory bandwidth. The network processor must be able to move data from the wire to its internal memory faster than the wire speed. The extra headroom needed is because of overheads associated with processing each packet. In the case ofthe IXP2800, it must be able to move data at more than IOGbps from the wire to its internal memory. This requires an enormous amount ofbandwidth between the incoming port and the memory controllers. The entire network processor is designed to move data at those incredible rates.

99 SHaC Unit

~D HashD CAPO 0

~~ ....- ~J. ~ I __ _ ...­ II I' II s-"""*ku II II II I ,I _n l

r====~. II ____--'II

l~ UEOdlMbc.l1 MElh10-llxl XkaIie CktsteT1 ClusteTO e-. PCI C~er

CommandBus Arb1ta-O

Figure A.I Intel IXP2800 network processor

As shown in Figure A.l, the main components to the network processor are the embedded Xscale core, the microengines, the DRAM and SRAM controllers, the SHaC unit and the media and switch fabric interface. Each component will be discussed in detail.

100 A.I Embedded Xscale core

The embedded Xscale core (see Figure A.2) is a fully ARM V5TE compliant microprocessor. The Xscale has been designed for high performance and low power. It is one ofthe leading embedded processors used by many different applications. The

embedded Xscale processor contains all the major components seen in modem processors

InstnJctioo Cadre Mini-Data -32 Kbytes Cache -32"warys '2i~ - LlXbble "2~

IMMU DMMU IFill BuO'er • 32 etrlky"lTlB '" 32 ei'It.wy TLB .. ... - S "enIriies. - FulJ as5ll:lciative .Furya~ ~eb)'""emry tcd;db!'e b)' entrl

Power MAC Write Buffer ·8 entries: Management ·gngle~e ClCJ

Figure A.2 Embedded Xscale core block diagram

The instruction cache is a 32Kbyte, 32-way set associative instruction cache with a cache line size of32 bytes. All cache misses generate a 32-byte read request to the external memory.

The data cache is a 32Kbyte, 32-way set associative data cache and a separate mini-data cache that is 2Kbytes, 2-way set associative. The data cache line size is 32 bytes. The data cache supports both write-through and write-back caching. 101 The branch target buffer is used for branch prediction. The branch target buffer has a 128 instruction storage for the target address ofa possible branch.

The Instruction Memory Management Unit (IMMU) and Data Memory

Management Unit (DMMU) is the memory manager for the Xscale processor. Both

MMUs are fully associative and a Translation Look-aside Buffer (TLB) is use to accelerate virtual to physical address translation

A.2 Microengines

The Microengines (uE) in the IXP2800 is the second generation microengines

(uEv2) developed by Intel (see Figure A.3). There are a total ofsixteen microengines in the IXP2800. There are no differences between the microengines in the IXP2800, they are simply clones ofeach other. Therefore, any discussion about "the" microengine is relevant across all sixteen microengines.

102 D Push S Push (trmnORAM) (ftOmSRAM NNoatsUn ScratdJped. (iron'l pf1i!!Wou5 ME) ------, M~Hasn. PCI,CAP) ."l:""""--...... ,

Control Store

I I .*'• I I I I ------I ._------I .. ... I I LI __ I - ---~-- - I -XQ: --- -_...:_}- ..- --- I

r-+-l-+-+-++-...... ------1ll'l1l!'l1ed

Execution Datapath (Shift. Add. Subt!r;acl. MlIttiply Logicals. Fimd Rrst Bit. CAM)

-.------~g L..-----ill--+------j--.....-+----j-:--..-----...... ------..;.... ·N-N-_-DaMa-··-_-Ou....t (10 m!!ld: !I.E)

Figure A.3 Microengine block diagram

The single bigger hurtle for the microengine is memory latency. From the point ofview ofthe microengines, it is extremely slow to access any external memory. It can potentially take dozens ofcycles even accessing the scratchpad ram. The amount oftime needed to access DRAM can be as bad as hundreds ofcycles. In order to minimize the

103 dead time associated with access external memory, Intel has equipped the microengine with eight independent context threads. Each context has its own set ofcontext specific local registers and program counters. The idea is that when a piece ofcode reached a point where it needs to access memory. The code would signal the memory controller to read or write the memory location and while the memory controller is doing that, the microengine can switch context and run another section ofcode. This type oflatency hiding is the comer stone ofhow the network processor can achieve the speed require.

Context switching is so important that every memory access instruction has a context switch flag that allows the microengine to switch context.

The microengine has 640, 32bit word local memory that can be access at full speed by the microengine. This memory is primary used for storing the packet header since the packet header is the portion ofthe packet that the microengine would likely need to access.

There are two banks General Purpose Registers (GPRs). Each bank can hold 128

32-bit words. The GPRs are used by the execution path for various data storage needs.

The GPRs also provide the operands used by the ALU.

There is a 128 32-bit word next neighbor register. This group ofregisters is specifically used to transfer information from one microengine to the next microengine.

This accelerates the hypertask chaining approach where data is moved from one microengine to another because it is much faster to move data using these registers than it is to copy them to external memory.

The DRAM transfer (D-XFER) and SRAM transfer (S-XFER) registers are used to read or write data from the DRAM or SRAM. Each transfer register group holds 128

104 32-bit words. There are a total offour ofthese transfer register groups. D-XFER in, D­

XFER out, S-XFER in, S-XFER out. Intel decided to have separate transfer in registers

and transfer out registers because it is possible for the microengine to want to read

something and write something at the same time. Ifa single group ofregisters serves both in and out, then that would slow down the microengine.

The control store is where the microengine code is stored and executed. The control store can hold a total of8192 instructions, with each instruction being 40-bits wide. The control store is initialized by the Xscale processor during the setup procedure.

The execution unit ofthe microengine is a scaled down version ofa RISe processor. It operates the same as any other processor such that it/etch, execute and store. The inputs to the execution unit must come from the GPRs. However, the output ofthe execution unit can be the GPRs, as well as the DIS XFER out registers.

A.3 DRAM I SRAM controllers

The DRAM and SRAM controllers have the exact same design philosophy, to maximum the available bandwidth to the microengines. Both controllers have multiple channels connected to separate RAMS.

105 Ocmmand SRAMdhips Busfroml'''E --i------, and/or CtSer{) co-processor Ccmmand Bus from ME ---t------if--....-... ~·1 SRAM Oontroa!elr

PushBusilD 10 ME Cluster 1 SRAM Coorllroaler

PuIllIlD to MEClusl:erO SP.AM Coorllromler PuIllIlD to MEOluster 1

SRA.M Controller

~ J

Figure A.4 DRAM/SRAM controller block diagram

Figure A.4 shows the configuration for the SRAM controller, but the configuration is the same for the DRAM controller also. The purpose ofhaving several controllers is because it allows the microengines to have more than one channel to the ram. In essence, it is possible for the microengine to have four outstanding SRAM accesses at the same time to the four separate channels. This effectively quadruples the memory bandwidth and is one ofthe key reasons that the IXP2800 can perform the way it does.

106 A.4 SHaC

The Scratchpad, Hash unit and CSRs (SHaC) unit is a multifunction block that contains the scratchpad memory, the hashing unit and the processor wide Control and

Status Registers (CSRs). As shown in Figure A.5, all ofthe component units ofthe

SHaC is connected through a locally shared bus and is bridged to both the Xscale core and all ofthe microengines.

TAl()(~CW,,~1I 8H~':"I!l(VU.l

!*I_CSR.M.!),I\T'" ~1<.."Jl"'TI'. alRJW'f'

Figure A.S SHaC block diagram

The scratchpad memory is 16Kbytes ofon-chip memory organized as 4K 32-bit words. The scratchpad memory is similar to the L2 cache used by general processor.

The access time to the scratchpad memory is faster than both DRAM and SRAM. The

107 scratch pad memory also has useful microengine operations such as atomic reads and writes to avoid deadlock conditions in the microcode.

The hashing unit can take a 48-bit, 64-bit or 128-bit data and produce a 48-bit, 64­ bit or 128-bit hash index. The hashing unit can produce three distinct hashes in a single clock cycle. The output hash can be use to access CAMs or use as pointers into the main memory.

The CSRs are accessible through the SHaC unit. The CSRs are global register sets thatp provide various generic functions to the network processors. For example, the

CSRs contain a 64-bit timestamp value that increments every 16 clock cycles. The timestamp can be used to time various events through out the microcode. Another useful

CSR is the random number registers. The random number register creates a different pseudo random number each clock cycle. The complete list ofavailable CSRs is located in the Intel IXP240012800 network processor programmer's reference manual.

A.5 Media and switch fabric interface

The Media and Switch Fabric (MSF) interface connects the IXP 2800 network processor to a PHY or a switch fabric through the network processor's SPI-4 or CSIX bus. The MSF translates the data from SPI-4 or CSIX to an internal representation ofthe data call the mpackets. The MSF also separates receive (Figure A.6) and transmit (Figure

A.7) into two, almost separate, entities. This allows full duplex operations on the SPI-4 or CSIX bus to increase bandwidth.

108 IChecksum I

RDAT FleTl RPAR SP1-4 Protocol Lcyo I FllllllndiJcatioo 101'1_ CQI'ltrOI JIi' RPROT

SPIi-4 1'1_ Control ,---- CSIX CFrames mapped by IVU·orl.:Map CSR ...... --..., {normaly FIowControlCFrames are mapped here} FCEFlFO

RCU:' .. C,Ioe'k for --._--_.. ---il..- JReCe

TXCOAT

Figure A.6 MSF receive block diagram

SPutl Data TaUI' l32-bitsmmME} ---J!'-. - - - - - TOAT ------1------.... ~~ -l--.:a.. TOTL WAR Ieo_, I

ME Reads From OtherCSRs ------'l1lJ>1 (SLPuskU3us)

TCLKPEF

FCIF1FO Inlemail Clock for Tran:smit logic

RXCSRB I,ReadySits}

RXCOAT RXCfC (1'011'11'0 full)

Figure A.7 MSF transmit block diagram

The network processor internally store network packets as mpackets. Each mpacket has a fixed length of64 bytes with a small side-band data that holds information' 109 about the mpacket. It is more efficient in hardware terms to break packets into fixed sized chucks because it is now possible to optimize the hardware to that specific size. All mpackets are marked as start, stop or none. This allows the network processor to tell the difference between the start ofpacket, end ofpacket or a middle packet. The separation and reassembly ofpackets into mpackets is similar to how large data blocks are separated into packets for transmission over the network.

110 Appendix B

The Intel IXDP 2800 advance development platform

The Intel IXDP 2800 advance development system is a rapid prototyping platform designed by Intel. The IXDP 2800 advance development system is also known as "Deer

Island". The IXDP 2800 platform allows developers to quickly get up to speed with the

IXP 2800 network processor without devoting time to build their own prototype board.

The IXDP 2800 platform have all the necessary components to build a fully function system that utilize all ofthe power ofthe IXP 2800 network processor.

B.1 IXDP 2800 overview

Figure B.1 Intel IXDP 2800 development system

III The Intel IXDP 2800, as shown in Figure B.l, is a development platfonn consists

ofa 19 inch rack-mountable 2U enclosure that houses one network processor base card,

one modular mezzanine card and a power supply. The network processor base card is

base on the Intel dual network processor reference design ofthe IXP 2800. The

mezzanine card that came with the development platfonn is the IXD281 O. The IXP281 0

has ten one-gigabit Ethernet ports with replaceable copper/fiber SPF modules. Another

mezzanine card available to the IXDP 2800 is the IXD 28192, which has a single 10Gb

fiber Ethernet connection.

B.2 IXBM 2800 dual network processor base card

Figure 8.2 IntelIXBM 2800 dual network processor base card

The Intel IXBM 2800 dual network processor base card, shown in Figure B.2, is designed around two IXP 2800 network processor (see Figure B.3). The two processors

112 are separated into an ingress1 and egress2 configuration. The combined processing power

ofthe two IXP 2800 network processor allows the IXBM 2800 to achieve a sustain data

rate ofOC192 full

duplex.

SP-4 tOObs CSlX tSObs

SP-4 tOObs CSlXtSObs IIII

Figure B.3 Intel IXBM 2800 block diagram

Each network processor has its set ofDRAM, SRAM and flash. This configuration allows a complete separation ofnetwork processor duties. The ingress and egress network processor is connected via a high speed communication bus that facilitates the packets to move from the ingress network processor to the egress network processor.

Each network processors have a total of 768MB ofRambus RDRAM, 128MB of

QDR SRAM and 16MB offlash memory. The RDRAM is separated into three channels

I The ingress NPU is receiving data from the network. 2 The egress NPU is sending data to the network. 113 which the QDR SRAM is separated into four separate channels. The flash is an Intel

Strataflash non-volatile memory used to store the boot code and embedded operating system.

The IXBM 2800 also has an Intel 21555 PCI bridge from the network processors to an external PCI device. Two 82559 10/100 Ethernet PHYs connected to the network processors debug port. Finally, the IXBM 2800 has a general purpose media interface connected to the SPI-4 bus ofboth network processors allowing them to connect to the

Intel mezzanine card.

B.3 IXDP 2810 mezzanine card

~~~ ~ ~.\ '~. tf"aiil j f.- .. .. . ,(.I.. ,','I..,f.I.. , ".1. I- ';'.~~. i. I •• •• iii I'. • .'. .. t. , I. "':~I .~ '-':\ X. ". "• ". '. '. ~-!.; .. ~ ~ ~.,.~ ~ ....;. t' I I. j ,.!~.. ~ . Y.! I '( I I, .: • 'I I I ~_. .' J ., •

Figure B.4 Intel IXDP 2810 mezzanine card

The IXDP 2810 mezzanine card (see Figure B.4) is base on the Intel IXF 1110 ten-port gigabit Ethernet switch. The IXDP 2810 provides 10 individual 1000 Mbps full duplex Ethernet connections. Each Ethernet port is connected to a Small Form factor

Pluggable (SFP) transceiver module. The SFP (see Figure B.5) modules allow the user flexibility to plug in either a fiber or copper connector. 114 Figure B.5 Fiber/copper SFP modules

B.4 Intel IXA software development kit

The Intel DCA software development kit (SDK) contains all the necessary tools to write, compile and debug a complete network processor project. The SDK also contains the Intel Control Plane Platform Development Kit (CP-PDK), the Portability Framework

and the Intel Building Blocks for network processors. The SDK also contain both the

microengine C compiler and assembler.

The centerpiece ofthe Intel SDK is the Developer Workbench (see figure B.6).

This application is use to write, compile and debug the microcode. The project management part ofthe development workbench is similar to that ofany Integrated

Development Environment (IDE). It has the current working source on the left window

as well as all source files in the project.

115 ~ 10fJJ_CUlem~! JfICft."'SJ . Doveloper Workbench Ie :\1)\.4_SDK ~J })\src\IltJIIdlnfLbwc-J.ls\JpI14\nllcroco l1c\Jpv

I llD ~1iiI" ~ex: , j~~-E"! t'~' 1:/\ ~- ~~ '~~ • ;(~ ~~ I W ~"''M Ili~F~4i .:J.zsJ begin 1=~ 1~_olhoJ".U'IJ'" reg ip_total_len // total length read from IP helld~ II IE r4et ~ reg in_port /~/ port f~oJfl. 'wI?hlCh :...H5 pac~e";. lS • s",ceF1io. reg counter_base // base where the stals are ~aInta SClF . reg nexthop_index // Index of next hop l.nfOrTllatlon by{o/ioIIiuc Ill: reg out_result // result of macro calls comllrU.uc 51g control_blocJcsig // slgnal used ~Q read control I 51g da_block....sig cr;".J"h - reg dacoapare I csio),,'.uc 11\1 reg tslov csOt.ll/2.uc reg tscoapare J I 0IiIL1><_-"1,,,,, 1 .reg bvcurrent etOc..bt..~lc '1 11 reg bWJIl.ax reg dastats ClilJlLuli1.uc reg exit_port OIiILt><_u1i2.uc reg exit_hop dofinilionl.h .reg dival I'll reg hvtype ~_boP.h reg sa ~uc • I reg droppacket I Iii ...W.UC reg portnumber d'-W~,uc reg portof fset reg avglen_base ot.ng,-"'_Il"t.!nLd

3 -

, .. ~~""A"""'LA"'~""71 , For Help, seIoct Help- >He\:> Topi:s on tho mai"l menu I r --- II

Figure B.6 Intel developer workbench IDE

The developer workbench also has a cycle precise network processor simulator call the transactor (see Figure B.7). The transactor allows the developer to view microengine usage in terms oftime.

- I' I I' -.! -.! -J-J-J --'" ."'...... t. _ •• ~. __••t. -_..C. __..1--1__ _ .-t. -_..t. -_..t. -_.. ~. -_..t.- -~ -~ -I--ll- -~ -~ -I--l!- -l--l!- -I--l!- -~ -~

Figure B.7 Intel developer workbench transactor

116 The developer workbench debugger (see Figure B.8) allows the developer to view debugging information in simulation mode and hardware mode. In simulation mode, there is a simulated packet generator so the developer can inject packets into the simulated system to see how it will perform.

'S III-e ...... ,­ -"~.I +~~ T_ :t::: T ;~ ...... 1:1 ...... 1 ::~:: ::t=:: ~.., ....I .. .,...... ,. :==~~ .~,. .;-"• ~u. .. tl ...... 1

- - • Ie... '--' .,

.[

Figure B.8 Intel developer workbench debugger

117 Appendix C

Channel selection using other methods

The channel selection method used by the multihomed gateway is unique in that it attempts to determine the health ofthe external channels. Other channel selection methods are used by similar systems. These systems usually attempts to load balance across the external channels and assumes that the channels are uniform and are performing correctly. The main drawback for these channel select methods is that they do not take into account ofexternal loading. Thus these methods break down during a channel loaded event. The two other channel selection discuss in this appendix is balanced load and random placement.

C.l Balanced load

The balanced load method simply balances the load across all external channels.

This method requires the hardware to keep the current channel usage in bits per second.

When a new session arrives, the hardware compares the current usage ofall the channels and places the new session on the channel with the least usage.

The balanced load method works well when both channels are operating correctly.

More importantly, the balanced load method requires that all external channels have same capacity. Otherwise, all external channels will be forced to operate at the rate ofthe lowest capacity channel. The reason behindthis is because the algorithm strives to balance the load across all channels. Suppose channel 1 has a capacity of50 KBps and 118 channel 2 has a capacity of 100 KBps. Ifthe internal network's traffic requirement is 100

KBps, the balanced load algorithm will divide the load to 50 Kbps per channel.

However, ifthe traffic requirements were to increase, the algorithm will break down. For example, ifthe traffic requirements ofthe internal network were to increase to 150 KBps.

The algorithm will correctly place the first 100 KBps. Then, the algorithm will attempt to balance the rest ofthe 50 KBps to 25 KBps each channel. The problem is that channel

1 no longer has the capacity to handle the extra 25 KBps. The result is that channell's data rate remains at 50KBps. The algorithm will mistakenly attempt to balance and place more session onto channel 1 because channel 2's current data rate is a little higher than

50 KBps. This causes a condition where channe12's extra capacity is not used while channel 1 is overloaded.

C.2 Random session placement

Random session placement is another method used by some load balancers. The results ofthe random session placement method produce surprisingly good results ifthere are lots new sessions created. This is because the randomness will result in an equal amount ofsessions placed on the each channels. The main advantage ofthis method is that it is very easy to implement and does not require a lot ofengineering effort. This is the only method ofchannel selection that does not require at least some packet inspection done by the hardware.

Random session placement performs similarly as the balanced load method except that it obviously does not balance the load as well since session placements are base on a

119 random function. However, it is that very same random function that allows the random place algorithm to work better in an externally loaded or channel capacity mismatch situation. The randomness ofthe algorithm causes the decision making process to not be trapped in a situation where it will always place a new session one way or another.

Therefore, in the case ofthe mismatched channel capacity, the random algorithm stands a good chance ofplacing new sessions on the unloaded channel. The random session placed method performs "pretty well" and it is very easy to implement. Thus this method is still being used by many load balancers.

120 AppendixD

Source code

D.l Network processor mirocode

11------II dispatch loop 11------.begin .reg sig_mask immed32 [sig_mask, 0]

II get packets in order. dl_source[DL_THREAD_ORDER, sig_mask]

. end

II Thread 0 of the IPV4 START_ME from here on II will process only exception packets. There II is enough headroom available and will not affect II performance

#ifdef IPV4_START_ME br=ctx[O,_handle_exception-pkt#] #endif loop#:

#ifdef ETHER_WITHOUT_L2 _ether_decap_classify(__PKTHDR,ETHER_HEADER_OFFSETl #endif

II Align and copy the IP header into local memory. The II header will be copied to a known local memory address II depending on the header type (IPv4 or IPv6). dl_copy_iphdr_to_lmem()

II If the packet is IPv4, the IPv4 header will now be in the local II memory buffer Imem_ipv4_hdr at offset O. First set the value II of the IPv4 default output to the tunnel decap block.

II If the packet was encapsulated it needs to be handled by the IPv4 II forwarder. Reset the value of the IPv4 forwarder default output so II that the packet will be directed to the next processing stage instead II of the tunnel decapsulation block.

II Now call DL macros to send the current packet II and received the next packet. Depending on the dl_next_block, these II macros can droplsend exceptionlsend to next block. Also, we need to II maintain packet order when we sendlreceive packets .

. reg sig_mask .sig sig_scr-put 120 // allocate xfer registers to be used by dl_qID_sink. #ifdef POTS xbuf_alloc[$wxfer, 4, write] #else xbuf_alloc[$wxfer, 3, write] #endif immed32 [sig_mask, 0]

// send packet. dl_qID_sink[sig_mask, sig_scr-put, $wxferl

// receive packet dl_source[DL_THREAD_NO_ORDER, sig_maskJ

// free the xfer registers allocated xbuf_free[$wxfer]

// start again. br[loop#]

#endif

// We never execute the below code. This is here for completeness only. term#:

// frees the IP header cache dl_iphdr_cache_fini[]

// Control should never come here. kill# : nop nop

//////////////////////////////////////////////////////////////////////////////// // dl_copy_iphdr_to_lmem: /I // description: // /I This macro copies the IP header to an aligned location in local // memory. The IP header can be either an IPv4 or IPv6 header. /I /I Outputs: /I IP header is copied to local memory cache buffer. // // Inputs: /I Header is currently in transfer registers. /I //////////////////////////////////////////////////////////////////////////////// #macro dl_copy_iphdr_to_lmem() .begin

.reg hdr_type .reg context_id

121 .reg newiphdr .reg oldchksum .reg newchksum .reg olddstaddr .reg newdstaddr .reg counter_base .reg systemstat_base .sig stats_sig .sig stats_update_sig .reg ch1_bwusage .reg ch1_bwmax .reg ch2_bwusage .reg ch2_bwmax .reg chl_availableBW .reg ch2_availableBW .reg availableBW_compare .reg hwtype .reg hwaddr1 .reg hwaddr2 . reg Sipaddr1 .reg Sipaddr2 .reg Tipaddr1 .reg Tipaddr2 .reg inputport .reg oldtcpchksum .reg newtcpchksum .reg oldudpchksum . reg newudpchksum .reg proto .reg sa .reg da .reg ts .reg loopcount .reg streamcache_base .reg streamcache_offset .reg free_streamcache .reg oldest_stream .reg channeldiff .reg rnumber .reg chlstat .reg ch2stat

II Get thread ID for xbuf_activate.

dl_meta_get_header_type(hdr_type) dl_meta_get_input-port(inputport)

II check to see if reset is requested alu(systemstat_base, -- , B, @ipv4_stats_basel alu_shf[systemstat_base, systemstat_base, OR , 0, «STATS_PER_PORT_SHF_VAL) xbuf_alloc($resetsystem_xfer, 4, read_write) scratch_read ($resetsystem_xfer (0) , systemstat_base, Ox20, 4, stats_sig, stats_sig, -) .if ($resetsystem_xfer(O) == 1) immed32 (loopcount, 0) immed32 ($resetsystem_xfer[Ol , 0) immed32 ($resetsystem_xfer(1J , 0) immed32 ($resetsystem_xfer(2] , 0) immed32 ($resetsystem_xfer(3) , 0) .while (loopcount != Ox200l scratch_write ($resetsystem_xfer [0) , systemstat_base, loopcount, 4, stats_sig, stats_sig, ) alu[loopcount, loopcount, +, Ox10) .endw . endif xbuf_free[$resetsysteffiLxferl

122 alu[streamcache_base, -- , B, @ipv4_stats_base] alu_shf[streamcache_base, streamcache_base, OR 3, «STATS_PER_PORT_SHF_VAL] xbuf_alloc[$deadchannel_xfer, 1, read_write] xbuf_alloc[$killsession_xfer, 4, read_write] immed32 (streamcache_offset, 0) immed32 (ch1stat, 0) immed32 (ch2stat, 0) II see if channel 1 is still working scratch_read($deadchannel_xfer[O], systemstat_base, Ox60, 1, stats_sig, stats_sig, -) .if ($deadchannel_xfer[O] != 0) get_shifted_ts(ts) alu[channeldiff, ts, -, $deadchannel_xfer[O]] .if (channeldiff > OxOOOOBOOO) II channel 1 failed, mark it and delete channell's active sessions immed32 (ch1stat, 1) move [$deadchannel_xfer [0] , 1] scratch_write ($deadchannel_xfer [0] , systemstat_base, Ox2B, 1, stats_sig, stats_sig, ) .while (streamcache_offset != Ox160) scratch_read($kil1session_xfer [0] , streamcache_base, streamcache_offset, 4, stats_sig, stats_sig, ) .if ($killsession_xfer[3] == 1) immed32($killsession_xfer[O], 0) immed32 ($killsession_xfer[l] , 0) immed32 ($killsession_xfer[2] , 0) immed32($killsession_xfer[3], 0) scratch_write ($killsession_xfer[O], streamcache_base, streamcache_offset, 4, stats_sig, stats_sig, ) .endif alu[streamcache_offset, streamcache_offset, +, Oxl0] .endw .endif .endif II see if channel 2 is still working scratch_read ($deadchannel_xfer [0] , systemstat_base, OxaO, 1, stats_sig, stats_sig, -) .if ($deadchannel_xfer[O] != 0) get_shifted_ts(ts) alu[channeldiff, ts, -, $deadchannel_xfer[O]] .if (channeldiff > OxOOOOBOOO) II channel 2 failed, mark it and delete channel 2's active sessions immed32 (ch2stat, 1) move [$deadchannel_xfer [0] , 1] scratch_write ($deadchannel_xfer[O] , systemstat_base, Ox2c, 1, stats_sig, stats_sig, ) .while (streamcache_offset != Ox160) scratch_read($killsession_xfer [0] , streamcache_base, streamcache_offset, 4, stats_sig, stats_sig, ) .if ($killsession_xfer[3] == 2) immed32($killsession_xfer[O], 0) immed32($killsession_xfer[l], 0) immed32($killsession_xfer[2], 0) immed32($killsession_xfer[3], 0) scratch_write ($killsession_xfer[O], streamcache_base, streamcache_offset, 4, stats_sig, stats_sig, I .endif alu[streamcache_offset, streamcache_offset, +, OxIO] .endw . endif .endif

xbuf_free[$deadchannel_xfer] xbuf_free[$killsession_xferl

II only run this code if this is a ipv4 or arp packet

123 II update the packet count as seen here alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 0, «STATS_PER_PORT_SHF_VAL] scratch_incr(counter_base, Ox30)

II Activate the IPv4 buffer and then copy the IPv4 header II into the local memory buffer. xbuf_activate(LM_IPV4_HDR, 0, context_id, 1)

xbuf_copY(LM_IPV4_HDR, 0, 0, $$iphdr, PKT_HDR_BYTES_ETHER, 0, LM_XFER_BYTES_TCP, O}

alu[systemstat_base, -- , B, @ipv4_stats_base] alu_shf[systemstat_base, systemstat_base, OR , 0, «STATS_PER_PORT_SHF_VAL]

xbuf_extract(hwtype, LM_IPV4_HDR, 0, 0, 2} .if (hwtype == Ox0001) Iithis is an ARP packet xbuf_insert[LM_IPV4_HDR, Ox0002, 0, 6, 2] II make this a ARP response Ilmove the source address to target xbuf_extract(hwaddr1, LM_IPV4_HDR, 0, 8, 4) xbuf_extract(hwaddr2, LM_IPV4_HDR, 0, 12, 2) xbuf_extract(Sipaddr1, LM_IPV4_HDR, 0, 14, 2} xbuf_extract(Sipaddr2, LM_IPV4_HDR, 0, 16, 2) xbuf_extract(Tipaddr1, LM_IPV4_HDR, 0, 24, 2) xbuf_extract(Tipaddr2, LM_IPV4_HDR, 0, 26, 2} xbuf_insert[LM_IPV4_HDR, hwaddr1, 0, 18, 4] xbuf_insert[LM_IPV4_HDR, hwaddr2, 0, 22, 2] xbuf_insert[LM_IPV4_HDR, Sipaddr1, 0, 24, 2] xbuf_insert[LM_IPV4_HDR, Sipaddr2, 0, 26, 2] II load the appropiate MAC depending on the port immed32 (hwaddr1, Ox0090d800) move [hwaddr2 , inputport] II insert the MAC address into the frame xbuf_insert[LM_IPV4_HDR, hwaddr1, 0, 8, 4] xbuf_insert[LM_IPV4_HDR, hwaddr2, 0, 12, 2] xbuf_insert[LM_IPV4_HDR, Tipaddr1, 0, 14, 2] xbuf_insert[LM_IPV4_HDR, Tipaddr2, 0, 16, 2] .e1se xbuf_extract(proto, LM_IPV4_HDR, 0, IP_PROTOCOL} xbuf_extract(oldchksum, LM_IPV4_HDR, 0, IP_CHECKSUM) xbuf_extract(oldtcpchksum, LM_IPV4_HDR, 0, 36, 2} xbuf_extract(oldudpchksum, LM_IPV4_HDR, 0, 26, 2} xbuf_extract(sa, LM_IPV4_HDR, 0, IP_SOURCE_ADDRESS) xbuf_extract(da, LM_IPV4_HDR, 0, IP_DESTINATION_ADDRESS} xbuf_alloc[$channelselect_xfer, 8, read_write]

II allocate space for the packet timers and read it from scratchram xbuf_alloc[$pkt_timer, 4, read_write] scratch_read ($pkt_timer [0] , systemstat_base, Ox10, 4, stats_sig,

.if (inputport == OxOOOO) II incoming from port 0, attempt to load balance II check the stream cache for this stream immed32 (loopcount, 0) immed32 (streamcache_offset, 0) immed32 (free_streamcache, 0) immed32 (availableBW_compare, 0) immed32 (oldest_stream, OxOfffffff} .while (loopcount != 16) scratch_read ($channelselect_xfer[O), streamcache_base, streamcache_offset, 4, stats_sig, stats_sig, } .if (da == $channelselect_xfer[l]) move [free_streamcache, streamcache_offset] .if ($channelselect_xfer[3] == 1)

124 .else br[select_ch2#] .endif .endif II no match is found so far, maybe I'll need to bump an entry out of the II streamcache table. Let see if this is the entry I should bump II .if ($channelselect_xfer[O] == 0) .if (oldest_stream> $channelselect_xfer[2]) alu[oldest_stream, -- , B, $channelselect_xfer[2]] II move [oldest_stream, $channelselect_xfer[2]) move [free_streamcache, streamcache_offset] .endif alu[loopcount, loopcount, +, OxOl] alu[streamcache_offset, streamcache_offset, +, Ox10] .endw

II get the stats out of the scratchram for both channels alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 1,

scratch_read ($channelselect_xfer[O], counter_base, 0, 4,

move [ch1_bwusage, $channelselect_xfer[O]] move [ch1_bwmax, $channelselect_xfer[l]] alu[ch1_availableBW, ch1_bwmax, -, ch1_bwusage]

alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 2,

scratch_read($channelselect_xfer[O] , counter_base, 0, 4,

move [ch2_bwusage, $channelselect_xfer[O]] move [ch2_bwmax, $channelselect_xfer[l]] alu[ch2_availableBW, ch2_bwmax, -, ch2_bwusage]

move [$channelselect_xfer[2], ch1_availableBW] move [$channelselect_xfer [3] , ch2_availableBW] scratch_write ($channelselect_xfer[2], systemstat_base, 8,

scratch_write ($channelselect_xfer [3] , systemstat_base, 12,

II always not select the channel that's dead local_csr_rd[pseudo_random_number] immed32 [rnumber, 0] alu_shf[rnumber, --, B, rnumber, »16] .if (ch1stat == 1) II unless its time to probe the channel .if (rnumber < Ox1000) br[select_ch1#] .else

.endif .endif .if (ch2stat == 1) II unless its time to probe the channel .if (rnumber < Ox1000) br[select_ch2#] .else

.endif .endif II always select the channel with 0 max bw .if (ch1_bwmax == 0) br[select_ch1#]

125 .endif .if (ch2_bwmax == 0) br[select_ch2#J .endif

II determine which channel is more free II .if (ch1_availableBW > ch2_availableBW) II .if (ch1_bwusage < ch2_bwusage) .if (rnumber < Ox8000) II check to see if the timer slot is open .if ($pkt_timer[O] == 0) II timer slot is free, write da and current time move [$pkt_timer[O] , da] get_shifted_ts(ts) move [$pkt_timer[1] , ts] scratch_write ($pkt_timer[O] , systemstat_base, Ox10, 2, stats_sig, stats_sig, ) .endif br[select_ch1#1 .else .if ($pkt_timer[21 == 0) II timer slot is free, write da and current time move [$pkt_timer[2] , da] get_shifted_ts(ts) move [$pkt_timer[3], ts] scratch_write ($pkt_timer [2] , systemstat_base, Ox18, 2, stats_sig, stats_sig, ) .endif br[select_ch2#] .endif

select_ch1#: II change the source address to channel 1 immed32 ($channelselect_xfer[3] , 1) move [newiphdr, OxcOa801011 immed[olddstaddr, Ox0002, 0] immed[newdstaddr, Ox0101, 0] II write channel 1 tx timestamp xbuf_alloc[$ch1ts_xfer, 1, read_write] scratch_read($ch1ts_xfer[0] , systemstat_base, Ox60, 1,

.if ($ch1ts_xfer[0] == 0) get_shifted_ts(ts) move [$ch1ts_xfer[0], ts] scratch_write ($ch1ts_xfer[0], systemstat_base, Ox60,

.endif xbuf_free[$ch1ts_xfer] br[stream_update#]

select_ch2#: II change the source address to channel 2 immed32 ($channelselect_xfer[3] , 2) move [newiphdr, OxcOa80201] immed[olddstaddr, Ox0002, 0] immed[newdstaddr, Ox0201, 0] II write channel 2 tx timestamp xbuf_alloc[$ch2ts_xfer, 1, read_write] scratch_read($ch2ts_xfer [0] , systemstat_base, OxaO, 1,

.if ($ch2ts_xfer[0] == 0) get_shifted_ts(ts) move [$ch2ts_xfer[OJ, ts] scratch_write($ch2ts_xfer[01, systemstat_base, OxaO,

.endif xbuf_free[$ch2ts_xferl

126 stream_update#: II update the stream statistics move [$channelselect_xfer[O], sa] move [$channelselect_xfer [1] , da] get_shifted_ts(ts) move [$channelselect_xfer[2] , ts] scratch_write ($channelselect_xfer[O] , streamcache_base, free_streamcache, 4, stats_sig, stats_sig, )

continue_nat#: _get_new_checksum(newchksum, newdstaddr, olddstaddr, oldchksum) xbuf_insert[LM_IPV4_HDR, newiphdr, 0, IP_SOURCE_ADDRESS] xbuf_insert[LM_IPV4_HDR, newchksum, 0, IP_CHECKSUM) .if (proto == Ox06) Iltcp _get_new_checksum(newtcpchksum, newdstaddr, olddstaddr, oldtcpchksum) xbuf_insert[LM_IPV4_HDR, newtcpchksum, 0, 36, 2] .elif (proto == Ox11) Iludp _get_new_checksum(newudpchksum, newdstaddr, olddstaddr, oldudpchksum) xbuf_insert[LM_IPV4_HDR, newudpchksum, 0, 26, 2] .endif move [$channelselect_xfer[l] , availableBW_compare] scratch_write($channelselect_xfer[lJ, systemstat_base, 4,

.else II else, change the DA and route to port 0 move [newiphdr, OxcOa80002] .if (inputport == Ox0001) immed[olddstaddr, Ox0101, 0] .else immed[olddstaddr, Ox0201, 0] .endif immed[newdstaddr, Ox0002, 0] _get_new_checksum(newchksum, newdstaddr, olddstaddr, oldchksum)

xbuf_insert[LM_IPV4_HDR, newchksum, 0, IP_CHECKSUM) .if (proto == Ox06) _get_new_checksum(newtcpchksum, newdstaddr, olddstaddr, oldtcpchksum) xbuf_insert[LM_IPV4_HDR, newtcpchksum, 0, 36, 2] .elif (proto == Ox11) _get_new_checksum(newudpchksum, newdstaddr, olddstaddr, oldudpchksum) xbuf_insert[LM_IPV4_HDR, newudpchksum, 0, 26, 2J .endif

xbuf_a1loc[$chts_xfer, 1, read_write] .if (inputport == Ox01) II clear ch1's tx timestamp, ch1 still good move [$chts_xfer[O], 0] scratch_write ($chts_xfer[O], systemstat_base, Ox60, 1, stats_sig, stats_sig, -) scratch_write ($chts_xfer[O] , systemstat_base, Ox28, 1, stats_sig, stats_sig, -) .elif (inputport -- Ox02) /I chear ch2's tx timestamp, ch2 still good move [$chts_xfer[O], 0] scratch_write ($chts_xfer[O], systernstat_base, OxaO, 1, stats_sig, stats_sig, -) scratch_write ($chts_xfer[O], systemstat_base, Ox2c, 1, stats_sig, stats_sig, -) .endif xbuf_free[$chts_xfer]

127 // check to see if this is a return timing packet .if (sa == $pkt_timer[O]) //matched to channell timer alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 1,

get_shifted_ts{ts) move [$channelselect_xfer [0] , ts] move [$channelselect_xfer [1] , $pkt_timer[l]] alu[$channelselect_xfer[2] , ts, -, $pkt_timer[l]] scratch_write ($channelselect_xfer[O] , counter_base, Ox10, 3, stats_sig, stats_sig, ) move [$pkt_timer [0] , 0] scratch_write ($pkt_timer [0] , systemstat_base, Ox10,

.endif .if (sa == $pkt_timer[2]) //matched to channel 2 timer alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 2,

get_shifted_ts(ts) move [$channelselect_xfer [0] , ts] move [$channelselect_xfer[l], $pkt_timer[3]] alu[$channelselect_xfer[2] , ts, -, $pkt_timer[3]] scratch_write ($channelselect_xfer[O], counter_base, OxlO, 3, stats_sig, stats_sig, ) move [$pkt_timer [2] , 0] scratch_write ($pkt_timer[2] , systemstat_base, Ox18,

.endif .endif xbuf_free[$channelselect_xfer] xbuf_free[$pkt_timer] .endif

.endif copy_cache_done#:

. end

#endm

////////////////////////////////////////////////////////////////////////////// // // _ipv4_fwder() // // Description: // // Verify IP header and forward the packet. // // Outputs: // // out_result IPV4_SUCCESS or IPV4_FAILURE or exception code // out_ip - Register buffer for modified IP header // // Inputs: // // in_ip buffer containing packet header // // IPHDR_WR START_BYTE - byte address relavive to start of // out_ip // // IPHDR_RD START_BYTE - byte address relative to start of

128 // in_ip // Size: // // ?? instructions. (Worst case cycle count) // ////////////////////////////////////////////////////////////////////////////// #macro _ipv4_fwder(out_ip, in_ip, IPHDR_WR_START_BYTE, IPHDR_RD_START_BYTE)

// get rid of white spaces from constansts #define_eval WR_START IPHDR_WR_START_BYTE #define_eval RD_START IPHDR_RD_START_BYTE

. begin .reg ip_total_len // total length read from IP header .reg in-port // port from which this packet is rxed .reg counter_base // base where the stats are maintained .reg nexthop_index // index of next hop information . . reg out_result II result of macro calls . . sig control_block_sig // signal used to read control block · sig da_block_sig · reg dacompare . reg tslow · reg tscompare .reg bwcurrent . reg bwmax · reg dastats .reg exit-port · reg exit_hop . reg dlval .reg hwtype . reg sa · reg droppacket · reg portnumber .reg portoffset .reg avglen_base .reg avglen .reg proto

.begin . reg thread_id

xbuf_activate(in_ip, 0, thread_id, 0) xbuf_activate(out_ip, 1, thread_id, 0)

. end #endif // check for our block id. If doesn't match, exit. br!=byte[dl_next_block, 0, BID_IPV4, ipv4_hdr_copy#]

// update the total length count as seen here xbuf_alloc[$len_xfer, I, read_write] alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 0, «STATS_PER_PORT_SHF_VAL] xbuf_extract(ip_total_len, in_ip, RD_START, IP_TOTAL_LENGTH) alu[$len_xfer[O], ip_total_len, +, OxlS] scratch_add($len_xfer[O] , counter_base, Ox34, da_block_sig, da_block_sig, ) xbuf_free[$len_xfer]

get the input port number.

II retrieve the exit port from SRAM alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 0, «STATS_PER_PORT_SHF_VAL]

129 xbuf_extract(sa, in_ip, RD_START, IP_SOURCE_ADDRESS) .if (sa == OxcOa80101) move [exit-port, 1] move [exit_hop, 4] .elif (sa == OxcOa80201) move [exit-port, 2] move [exit_hop, 5] .else move [exit-port, 0] move [exit_hop, 3] .endif continue_cbase#:

II setup the output port and hop id xbuf_extract(hwtype, in_ip, RD_START, 0, 2) .if (hwtype == OxOOOl) II this is an ARP packet dl_meta_set_output-port[in-port] II ARP packets always go out the same port dl_meta_set_nexthop_id_type[Ox06] dl_meta_set_nexthop_id[in-port] dl_meta_set_fabric-port[Ox01] .else II normal IP packet dl_meta_set_output-port[exit-port] II output port already defined dl_meta_set_nexthop_id_type[OxOOl dl_meta_set_nexthop_id[exit_hop) dl_meta_set_fabric-port[Ox02) xbuf_alloc[$channelstatus_xfer, 1, read_write) .if (in-port == OxOO) II if incoming from port 0, don't calulate bw br[continue_fwder#] .endif

alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , in-port, «STATS_PER_PORT_SHF_VAL]

II increment the current length count xbuf_extract(ip_total_len, in_ip, RD_START, IP_TOTAL_LENGTH) alu [$channelstatus_xfer [0) , ip_total_len, +, Ox18] scratch_add($channelstatus_xfer [0) , counter_base, 0, da_block_sig, da_block_sig, I

II increment the channel total packet and length count scratch_incr(counter_base, Ox08) scratch_add($channelstatus_xfer[Ol, counter_base, OxOc, da_block_sig, da_block_sig, )

II update the average length table xbuf_alloc[$avglen_xfer, 2, read_write] xbuf_extract(portnumber, LM_IPV4_HDR, 0, 20, 2) xbuf_extract(proto, LM_IPV4_HDR, 0, IP_PROTOCOL) alu_shf[portoffset, --, B, portnumber, «2] alu[avglen_base, -- , B, @ipv4_stats_base] alu_shf[avglen_base, avglen_base, OR , 0, «STATS_PER_PORT_SHF_VAL] alu[$avglen_xfer[Ol, --, B, portnumberl alu[$avglen_xfer[1] , --, B, portoffset) scratch_write ($avglen_xfer[O) , avglen_base, Ox38, 2, da_block_sig, da_block_sig, I .if (proto == Ox06) br[cal_avglen#] .elif (proto == Oxll) br[cal_avglen#l .else br[end_avglen#l .endif

130 alu[avglen_base, -- , B, @ipv4_stats_base] alu_shf[avglen_base, avglen_base, OR , 7, «STATS_PER_PORT_SHF_VAL] scratch_read ($avglen_xfer [OJ , avglen_base, portoffset, 1, da_block_sig, da_block_sig, ) .if ($avglen_xfer[O] == 0) alu[$avglen_xfer[O] , B, ip_total_len] .else alu_shf[avglen, --, B, $avglen_xfer[O] , «4] alu[avglen, avglen, - $avglen_xfer[O]] alu[avglen, avglen, +, ip_total_len] alu_shf[$avglen_xfer[Ol, --, B, avglen, »4] .endif scratch_write ($avglen_xfer [0] , avglen_base, portoffset, 1, da_block_sig, da_block_sig, )

end_avglen#: xbuf_free[$avglen_xfer] II load the old timestamp alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 0, «STATS_PER_PORT_SHF_VAL] scratch_read ($channelstatus_xfer [0] , counter_base, 0, 1, da_block_sig, da_block_sig, ) II check to see if the timestamp changed local_csr_rd[timestamp_low] immed32 [tslow, 0] alu_shf[tslow, --, B, tslow, »26] alu[tscompare, tslow, -, $channelstatus_xfer[O]]

II if the timestamp is the same, jump to continue_fwder beq[continue_fwder#]

II otherwise, update the old timestamp with the new value move ($channelstatus_xfer[O] , tslow) scratch_write ($channelstatus_xfer[O] , counter_base, 0, 1, da_block_sig, da_block_sig, )

II update the max bandwidth of channell alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 1, «STATS_PER_PORT_SHF_VAL] scratch_read ($channelstatus_xfer [0] , counter_base, 0, 1, da_block_sig, da_block_sig, ) move (bwcurrent, $channelstatus_xfer[O]) scratch_read($channelstatus_xfer [0] , counter_base, 4, 1, da_block_sig, da_block_sig, ) move (bwmax, $channelstatus_xfer[O]) move ($channelstatus_xfer [0] , OxOOOOOOOO) scratch_write ($channelstatus_xfer[O] , counter_base, 0, 1, da_block_sig, da_block_sig, )

.if (bwcurrent > bwmax) II update the max bw move ($channelstatus_xfer[O], bwcurrent) .else II decay the max bw .if (bwcurrent <= Oxl0) move ($channelstatus_xfer[O] , OxOO) .else alu [$channelstatus_xfer [0] , bwcurrent, - OxOl] .endif .endif scratch_write ($channelstatus_xfer[O] , counter_base, 4, 1, da_block_sig, da_block_sig, )

continue_ch2#: II update the max bandwidth of channel 2 alu[counter_base, -- , B, @ipv4_stats_base] alu_shf[counter_base, counter_base, OR , 2, «STATS_PER_PORT_SHF_VAL]

131 scratch_read($channelstatus_xfer[O] , counter_base, 0, 1, da_block_sig, da_block_sig, ) move (bwcurrent, $channelstatus_xfer[O]) scratch_read($channelstatus_xfer[O] , counter_base, 4, 1, da_block_sig, da_block_sig, ) move (bwmax, $channelstatus_xfer[O]) move ($channelstatus_xfer [0] , OxOOOOOOOO) scratch_write ($channelstatus_xfer [0] , counter_base, 0, 1, da_block_sig, da_block_sig, )

.if (bwcurrent > bwmax) II update the max bw move ($channelstatus_xfer [0] , bwcurrent) .else II decay the max bw .if (bwcurrent <= Ox10) move ($channe1status_xfer [0] , OxOO) .e1se alu [$channe1status_xfer [0] , bwcurrent, - Ox01] .endif .endif scratch_write ($channe1status_xfer[OJ, counter_base, 4, 1, da_block_sig, da_block_sig, )

continue_fwder#: xbuf_free[$channelstatus_xfer]

.endif

II we are done with ipv4 fwding. Exit this macro. br[ipv4_fwder_finish#], defer[l] immed[dl_next_block, IPV4_NEXT1] ipv4_hdr_copy#: II copy header into dst buffer xbuf_copy(out_ip, 0, WR_START, in_ip, RD_START, 0, 20, OJ

D.2 Xscale C code

int xsca1e_me_status( char *buf,char **start, off_t offset,int len,int *eof,void *data)

int i; unsigned int base = IPV4_STATS_TABLE_BASE; unsigned int vall, va12, va13, va14; char saddrbuf[100] , daddrbuf[lOOJ;

sprintf (buf, "\n\nmultihomed gateway proc\n"); sprintf(buf+strlen(buf), "Kenny Tung\n\n");

sprintf (buf+strlen (buf) ," (packets) (bytes) \n") ; sprintf(buf+strlen(buf), "aggregate %lOd %10d\n", get_scratch (base+Ox30), get_scratch(base+Ox34»; sprintf(buf+strlen(buf), "channell %10d %10d\n", get_scratch (base+Ox4B), get_scratch(base+Ox4c») ; sprintf(buf+strlen(buf), "channel 2 %lOd %10d\n\n", get_scratch (base+OxBB), get_scratch(base+OxBc));

132 sprintf(buf+strlen(buf), "channell last rx: Ox%08x, status: " get_scratch(base+Ox60) ; if (get_scratch (base+Ox28) == 0) sprintf (buf+strlen (buf) , "OK\n"); else sprintf (buf+strlen (buf) , "FAIL\n"); sprintf(buf+strlen(buf}, "channel 2 last rx: Ox%08x, status: " get_scratch(base+OxaO»; if (get_scratch (base+Ox2c) == O} sprintf (buf+strlen(buf), "OK\n"); else sprintf (buf+strlen(buf), "FAIL\n\n");

Iisprintf(buf+strlen(buf), "Ox%08x, Ox%08x, %d, %d\n\n", get_scratch (base+Ox60), get_scratch (base+OxaO), get_scratch (base+Ox28), get_scratch(base+Ox2c»; sprintf(buf+strlen(buf), "ch1 rt: Ox%08x, ch2 rt: Ox%08x\n\n", get_scratch (base+Ox58), get_scratch(base+Ox98));

sprintf (buf+strlen (buf) ," (srce addr) (dest addr) (timestamp) (channel) \n"); for (i=0;i!=16;i++) { decodeaddr(get_scratch(base+OxcO+(i*Ox10), saddrbuf); decodeaddr(get_scratch(base+Oxc4+(i*Ox10)), daddrbuf}; sprintf(buf+strlen(buf), "%s %s Ox%08x %d\n" , saddrbuf, daddrbuf, get_scratch(base+Oxc8+(i*Ox10), get_scratch(base+Oxcc+(i*Ox10))); }

sprintf(buf+strlen(buf), "\n(port) (avg len)\n"); for (i=0;i

1* II for (i=0;i

sprintf (buf+strlen (buf) , "Ox%04x: Ox%08x Ox%08x Ox%08x Ox%08x\n", L vall, va12, va13, va14); } *1

*eof = 1; return strlen(buf);

133