Bowman: A Node OS for Active Networks

¡ ¢

S. Merugu S. Bhattacharjee E. Zegura K. Calvert

¡ ¢

College of Computing, Dept. of Computer Science, Dept. of Computer Science, Georgia Tech., Univ. of Maryland, Univ. of Kentucky,

Atlanta, GA. College Park, MD. Lexington, KY. £

merugu, ewz ¤ @cc.gatech.edu [email protected] [email protected]

AbstractÐBowman is an extensible platform for active networking: it ef®cient and ¯exible packet classi®cation algorithm. Com- layers active-networking functionality in user-space software over variants putation on behalf of a ¯ow occurs in its own set of compute of the System V UNIX operating system. The packet processing path im- plemented in Bowman incorporates an ef®cient and ¯exible packet classi®- threads. Thus, the internal path of Bow- cation algorithm, supports multi-threaded per-¯ow processing, and utilizes man is inherently multi-threaded. real-time processor schedulingto achieve deterministicperformance inuser- ¥ Provide a fast-path. For packets that do not require per- space. In this paper we describe the design and implementation of Bowman; ¯ow processing, Bowman provides cut-through channels, discuss the support that Bowman provides for implementing execution en- vironments for active networking; discuss the network-level architecture i.e., paths through the Bowman packet processing cycle that of Bowman that can be used to implement virtual networks; and present do not incur the overheads of multi-threaded processing. performance data. Bowman is able to sustain 100 Mbps while ¥ Enable a network-wide architecture. Globally, Bowman forwarding IP packets over fast . implements a network-wide architecture by providing sys- tem support for multiple simultaneous abstract topologies, I. INTRODUCTION i.e., abstractions that can be used to imple- Active networks provide a programmable user-network inter- ment virtual networks. face that supports dynamic modi®cation of the network's be- ¥ Maintain reasonable performance. Bowman is multi- havior. Such dynamic control is potentially useful on multiple processor capable and provides deterministic performance levels: in user-space by utilizing POSIX real-time extensions for

¥ For a network provider, active networks have the potential processor scheduling [2]. Even though Bowman is imple- to reduce the time required to deploy new protocols and mented entirely in user space, it delivers high performance. network services. IP forwarding through Bowman saturates 100 Mbps Ether-

¥ At a ®ner level of granularity, active networks might en- net, and the Bowman classi®er is able to classify packets at able users or third parties to create and tailor services to gigabit rates while matching on multiple ®elds. particular applications and current network conditions.

¥ For researchers, a dynamically-programmable network of- fers a platform for experimenting with new network services The Bowman software architecture is highly modular and exten- and features on a realistic scale without disrupting regular sible. All the major parts of a Bowman node Ðcommunication network service. protocol implementations, protocols and associated data The programming interface supported by the active network structures, code-fetch mechanisms, per-¯ow processing and out- de®nesthe ªvirtual machineºpresent at each node of the network. put queuing mechanismsÐ are loaded dynamically at run-time. Depending on its design, such a virtual machine can provide a The Bowman implementation can be ported to UNIX systems generic mechanism such as a language interpreter at each node or that support the POSIX system call interface [2]. it may just allow the user of the network to choose a service from a set of services provided at each node. Particular programming In the next section, we provide an overview of Bowman. We interfaces for active networks are implemented by node-resident start with the architectural context for the development; we then programs called Execution Environments (EEs). discuss the abstractions supported by Bowman and provide an This paper describes the design and implementation of the overview of the network architecture. In Section III, we present Bowman node operating system, a software platform for im- selected details of the Bowman implementation, with a focus on plementing execution environments within active nodes. Bow- the packet processing path and the packet classi®er. We present man was speci®cally implemented as a platform for the CANEs a set of performance results for Bowman in Section IV including EE [1]; however, it provides a general platform over which other details of the forwarding performance and individual overheads EEs may also be implemented. in the packet processing path. We also provide a performance The design goals for Bowman are the following: analysis of the Bowman packet classi®cation algorithm and the

¥ Support per-¯ow processing. Bowman provides system effect of real-time processor scheduling. In Section V, we de- support for long-lived ¯ows. Flows are classi®ed using an scribe how the CANEs EE is implemented over Bowman. We present a survey of related work in Section VI, and conclude in Work supported by DARPA under contract N66001-97-C-8512. Section VII. II. OVERVIEW basic services (again, the real users of the EE API are the

In this section we provide an overview of the design and programmers who create the active applications). ¥ implementation of Bowman. The active applications provide an interface by which end- users invoke (by sending packets to an EE with appropriate instructions) and possibly customize (via in-band or out-of- Active Network User Network Interfaces band signaling) their services. The ability of EEs to provide interesting and novel services de- ANTS CANEs . . . . . IP v4 pends greatly on the degree to which the NodeOS exposes lower- level mechanisms. For example, providing active applications execution environments with ®ne-grained control over scheduling requires a greater de- gree of access to output and processor scheduling decisions than Common Objects is traditionally provided in general-purpose operating systems. (Routing Tables) In designing Bowman we have attempted to provide this kind of access without re-implementing available components. Thus Bowman makes use of an underlying OS, and attempts to P P . . . hide the differences between low-level capabilities (e.g., network IO interfaces Processors State Store interfaces) across different platforms by providing a consistent node resources interface. Node Operating System While the NodeOS/EE interface generally marks the separa- tion of local and global concerns, the NodeOS cannot totally Fig. 1. DARPA active networking architecture ignore end-to-end considerations. For example, the NodeOS is responsible for seeing that packets transmitted by end users Ðincluding both data and signaling packetsÐ reach the proper A. Architectural Context nodes and the proper EE at a node. Bowman provides a packet- Figure 1 shows the architecture for active network nodes de- matching mechanism enabling EEs to describe packets they want veloped within the DARPA active network research commu- to receive. In addition, Bowman makes it possible to group and nity [3]. The primary functional components in the architecture identify channels consistently across multiple nodes, so that EEs are the Node Operating System (NodeOS) and the Execution can de®ne and use virtual topologies. In addition, the role of the Environments (EEs). Generally speaking, the NodeOS is re- NodeOS (and Bowman) includes support for common objects sponsible for managing the local resources at a node, while each such as routing tables that are likely to be useful to all EEs. EE is responsible for implementing a User-Network interface, Bowman implements a subset of the emerging DARPA Node i.e., a service exported to users of the active network. In par- OS interface [5]. The interface exported by Bowman is intended ticular, each EE de®nes a ªvirtual machineº that interprets the to be used primarily by EE-implementors, although it can be packets delivered to it at each node. Thus an Protocol used for other purposes as well, such as experimentation with implementation might be considered a very simple EE, whose different network architectures. virtual machine is ªprogrammableº only to the extent of control- B. Bowman Abstractions ling where packets are delivered. On the other hand, the ANTS execution environment [4] uses a virtual machine that interprets Bowman is built around three key abstractions: channels, a- Java bytecodes. ¯ows, and state store. To understand how Bowman ®ts into the picture, it may be ¥ Channels. Channels in Bowman are communication end- useful to consider the purpose of the programming interface points that support sending and receiving packets via an exported by each component, along with its intended ªusersº: extensible set of protocols. Bowman exports a set of func-

¥ The NodeOS exports an API that provides access to node tions enabling EEs to create, destroy, query and commu- resources including computing, storage, and transmission nicate over channels that implement traditional protocols ; its ªusersº are the Execution Environments, or (e.g., TCP, UDP, IP) in various con®gurations. In addition, more precisely the programmers who implement them. A Bowman channels allow other forms of processing such as functional speci®cation of this API is currently in draft [5]. compression and forward error correction to be included as

¥ The Execution Environment provides basic end-to-end ser- well. Details of the Bowman channels with examples are vices and some form of programmabilityÐi.e., some way presented in Section II-D. of composing or extending basic services to create new ¥ A-Flows. The a-¯ow1 is the primary abstraction for compu- ones. A primary goal of the node architecture is to enable tation in Bowman. A-¯ows encapsulateprocessing contexts the network as a whole to provide multiple user-network and user state. Each a-¯ow consists of at least one thread interfaces; thus the primary reason for the NodeOS's exis- and executes on behalf of an identi®ed principal (i.e., user tence as a separate entity is to support multiple EEs, which of the system). Bowman provides interfaces to create, de- may offer different forms and degrees of programmability. stroy and run a-¯ows. Further, multiple threads may be The basic abstractions and form of composition depend on 1The term ª¯owº is often used to describe per-user computation state at an the EE. The EE's users are the end-system active appli- active node [5]. Since these ¯ows are, in a manner, activeÐ they refer to both a cations whose packets invoke and/or customize the EE's set of packets and some processingÐ we adopt the use of the term ªa-¯owº. active within a Bowman a-¯owÐthe level of concurrency Bowman and typically contains control code for the EE and a set within an a-¯ow is user-selected. of routines that implement the user-network interface supported

¥ State-store. The state-store provides a mechanism for a- by the EE. In order for the EE to process packets, the control ¯ows to store and retrieve state that is indexed by a unique code must spawn at least one a-¯ow. In the monolithic approach, key. The Bowman state-store interface provides functions the EE creates exactly one a-¯ow and does not expose its inter- for creating, storing and retrieving data from named state- nal processing structure to Bowman. The EE submits a blanket stores. Using the underlying state store mechanism, Bow- request for all packets ªaddressedº to the EE to be delivered to man provides an interface for creating named registries; that a-¯ow, which may do its own additional demultiplexing. such registries provide a mechanism for data sharing be- In the multi-a-¯ow approach, the EE code spawns one control tween a-¯ows without sharing program variables. a-¯ow that can be used for the EE signaling and housekeep- ing. The control a-¯ow starts a different a-¯ow for each user's C. Bowman Structure packets, and submits a request for only that user's packets to One of the goals of our implementation is for Bowman to be be delivered to that a-¯ow. Note that in each case, third-party widely deployable and run on a variety of hardware. To this end, code, loaded on behalf of users of the network, is linked against it is built as an extensible library of C-language function calls; the API that is exported by the EE; however, in the multi-a-¯ow in this section we describe the parts of the implementation. approach, the user's computation is scheduled by Bowman (and not by the EE itself). C.1 The Host OS In order to implement the channel, a-¯ow, and state-store ab- D. Bowman Abstract Topologies stractions, Bowman requires interfaces to lower level hardware- Along with the node architecture described above, Bowman speci®c mechanisms such as memory management, thread cre- implements a network-wide architecture at the node OS level by ation and scheduling, synchronization primitives, and I/O de- providing support for abstract links and abstract topologies. Ab- vices. The Bowman implementation is layered on top of a host stract links provide connectivity between a set of Bowman nodes operating system that provides these lower level services. The and implement protocol processing as well as data transmission. host operating system implements hardware-speci®c routines for Bowman channels are end-points of abstract links. Named sets device and memory access, synchronization etc.2 Thus, Bow- of abstract links form named abstract topologies. man provides a uniform active node OS interface over different Abstract topologies implement user-de®ned connectivity over hardware and operating systems. Bowman can be ported to op- arbitrary physical topologies 3. Abstract topologies have glob- erating systems that support the POSIX system call interface. ally unique names (assigned when the topology is created) and The current Bowman implementation has been ported to SunOS all packets transmitted and received by a Bowman node traverse (versions 5.5 and higher) and Linux (kernel versions 2.2.0 and some abstract topology. Abstract topologies in Bowman can be higher). used to instantiate overlay networks (much like ªvirtual private networksº in the Internet) in which each node possesses a set C.2 Bowman Extensions of speci®ed attributes (e.g., a particular EE is present at each The goal of active networking is to provide enhanced net- node). When a packet is received at a Bowman node, it is classi- work services to network users. The Bowman channel, a-¯ow, ®ed to determine to which abstract topology it belongs; packets and state-store abstractions provide basic capabilities, but more that belong to no topology are discarded. A-Flows subscribe to sophisticated services may be needed. Bowman provides an subsets of packets on speci®c topologies; matching packets are extension mechanism that is analogous to loadable modules in demultiplexed to proper a-¯ows. In order to send a packet to traditional operating systems. Using extensions, the Bowman a neighboring node on a particular abstract topology, an a-¯ow interface can be extended, dynamically, to provide support for transmits the packet on a channel that belongs to the topology additional abstractions such as queuing mechanisms, routing and has the required node as the other end-point. By default, protocols, user-speci®c protocols and other network services. Bowman abstract topologies do not provide any resource guar- The extension mechanism provides a common underlying foun- antees at active nodes or isolation of data between topologies; dation for EEs that provide various composition mechanisms. however, such properties can be added as a matter of policy Bowman extensions are written using Bowman system calls and during topology instantiation. calls exported by other extensions. In the latter case, all required We conclude this overview of Bowman with an example il- extensions must already be loaded into the system. lustrating how the Bowman abstract topology capability can be used to realize a network architecture (Figure 2). In Figure 2, we C.3 EEs on Bowman show a network of ®ve nodes ( ¦¨§©§ § ) and their physical connec- EEs running on Bowman can choose from at least two mod- tivity. The next step in realizing an abstract topology is to create els of operation: the monolithic approach and the multi-a-¯ow a set of abstract links between the nodes. Note that abstract links approach. The code for an EE can be loaded as an extension to can be de®ned using MAC protocols such as or net-

2Operating systems, of course, provide additional services, such as protection 3Obviously, abstract topologies cannot provide connectivity that is not sup- between programs running in different address spaces, and security mechanisms ported by the transitive closure of the underlying physical adjacency relation. based on (weak) notions of ªprincipalº. Presently Bowman does not depend on However, by using abstract links that implement protocols such as IP, abstract services other than mentioned above, and would work as well on a Host OS that topologies may provide logical adjacency between nodes that have more than does not provide such additional services. one physical hop between them.

Physical Ethernet UDP a−flow processing  0 a b c 1 F Physical  F F2 n With abstract 1 links ...

 configured User d e Code Bowman  2 System inQ Logical Topologies packets packets input output by

for a−flow a−flows

Topology 0 Topology 1 processing cut through path Fig. 2. The Bowman network-wide architecture: abstract links that incorporate packet protocol processing are overlaid on the physical topology. An abstract outQ topology is a named collection of abstract links. classifi− cation

Packets recd. Packets recd. logical input logical output on Topology 0 on Topology 1 channels channels

Fig. 4. Bowman packet processing path. channels: udp ends of eth eth ip abstract eth A. Packet Processing Path links Figure 4 shows a schematic of the Bowman packet process- ing path. The ®gure shows the demarcation between Bowman physical wire wire wire system code and user code that runs within particular a-¯ows. endpoint endpoint endpoint Within the underlying Bowman system, there are (at least) three b c d processing threads active at all times: a packet input processing thread, a packet output processing thread and a system timer

Fig. 3. Details of links and channels at node  due to the creation of abstract thread. Depending on the input and output scheduling algo- links and abstract topologies as shown in previous ®gure rithms, it is possible for multiple input and output threads to be active within the Bowman. The semantics and implementa- work/transport protocols such as IP and UDP. Abstract links that tion of Bowman timers are detailed in the next section; in the implement protocols such as UDP are akin to UDP tunnels. In rest of this section, we provide an overview of the input/output the last step, abstract topologies are created by identifying sets processing within Bowman. of abstract links. When an abstract link with a Bowman node Packets received on a physical interface undergo a classi®- as the end-point is created, a corresponding channel is set up cation step that identi®es a set of channels on which the re- at the Bowman node. Figure 3 shows the channels created at ceived packet should be processed. In this manner, incoming packets are demultiplexed to speci®c channel(s) where they un- the Bowman node due to the creation of the abstract links in

dergo channel-speci®c processing. Once channel input process-

  Figure 2. Physically, node is connected to nodes , , and  . In the example, three abstract (point-to-point) links are cre- ing completes, the packet is considered to have ªarrivedº on the abstract link associated with the channel. At this stage, the ated with node as the endpoint. As shown in Figure 3, each packet undergoes a second classi®cation step that identi®es the link results in one channel being created at node . As abstract topologies are con®gured in the network, the channels at each further processing needed by the packet. Such processing is one

node are associated with particular topologies. This information of three types: ¥ is used by routing protocols to send data to particular nodes over A-Flow processing. The packet can be assigned to an particular topologies. a-¯ow for ¯ow-speci®c processing. The system thread en- queues the packet into the a-¯ow input queue (assuming We have designed the ALP (Abstract Link Protocol) and the the input queue is not fullÐotherwise the packet is dis- ATP (Abstract Topology Protocol) protocols to create abstract carded). The a-¯ow processing thread eventually dequeues links and disburse abstract topology information. These two the packet and processes it. protocols have been implemented as a-¯ows in Bowman. ¥ Cut-through forwarding. As shown in Figure 4, pack- ets may be enqueued directly on an output queue, without III. BOWMAN: IMPLEMENTATION DETAILS undergoing any a-¯ow processing. This processing corre- In this section, we present selected details from the imple- sponds to cut-through paths through the Bowman node. mentation of Bowman. Speci®cally, we present details of the ¥ Channel input. Speci®edpacketsreceived on a channel can Bowman packet processing path, packet classi®cation algorithm, be routed as input to other channels. For example, packets a-¯ow scheduling and Bowman timers. received on an IP channel with the IP protocol number equal to 17 (UDP) may be input to an UDP channel. In The packet ®lter rules themselves are speci®ed in ASCII text this manner, Bowman supports hierarchical protocol graphs and specify a set of headers and ®elds to be matched. The

similar to the  -kernel [6]. Bowman protocol graphs are following example shows a ®lter expression that matches IP run-time con®gurable. packets over an ethernet link; further, the IP datagram must A-¯ow processing is determined by the code speci®ed when be from a source with DNS name Discovery.Space.Net the a-¯ow is created. Using the Bowman extension mechanism, and be destined to Earth.Space.Net. Further, the IP pro- a-¯ow code can be dynamically introduced into a node; this tocol must be UDP, and the UDP destination port number mechanism is also used for dynamic code loading (as is required must be 2001: [ethernet (proto = ETHERTYPE IP) for some active network execution environments). A-¯ows can | ip (src = Discovery.Space.Net)(dst = transmit packets on particular topologies: the tuple (topology, Earth.Space.Net) (proto = IPPROTO UDP) | routing metric, destination) identi®es a channel on which the udp (dport = 2001)| *]. packet must be sent. The Bowman classi®ers export calls to create and destroy The output queuing mechanism employed within Bowman is classi®ers, teach new protocols to an existing classi®er, add ®lter dynamically loaded at boot time. This provides the ¯exibil- rules to a classi®er, and to match a buffer against a set of rules. ity to con®gure a Bowman node with a wide range of output queuing and scheduling algorithms. Our Bowman implemen- C. A-Flow scheduling and timers tation provides the requisite primitives (such as system threads The Bowman packet processing path incorporates multiple and ®ne-grained timers) for the output schedulers to support dif- threads and queues; thus, the Bowman implementation has to ferent services such as per-¯ow queues, fair sharing of output context-switch between threads in order to process each packet. bandwidth, etc. Further, asynchrony due to the queues on the packet processing path introduces delays for each packet. Two speci®c features of B. Bowman packet classi®er Bowman help in minimizing these overheads. First, the Bowman Ef®cient packet classi®cation to identify ¯ows is an essen- implementation is multi-processor capable; on multi-processor tial part of ¯ow-speci®c processing. The complexity of packet machines, Bowman threads may execute concurrently and some classi®cation algorithms( [7], [8], [9]) depend on (1) the type thread context-switch times are eliminated. Second, on the So- of matching required (®rst match, longest pre®x match, or best laris platform, Bowman can be con®gured to run in real-time match based on attributes of matched ®elds), and (2) the number mode using the Solaris Real-Time Extensions.4 As we will see of ®elds and types of headers being matched. in the performance ®gures in the next section, with real time ex- As a part of the Bowman, we have implemented a packet tensions, Bowman can deliver high throughput and low delay for classi®er that can match on an arbitrary number of ®elds in the packets traversing through a-¯ows. To leverage the performance packet header. The classi®er can be con®gured (as a matter of advantages of a multi-processor multi-threaded implementation, node-wide policy) to operate in one of the three modes: return we have implemented the a-¯ow input and output queues so the ®rst match, return all matches, and return the best match(es) that, for single-reader single-writer queues, there is no locking according to a cost associated with each ®eld in the classi®cation (or blocking) unless the queue is empty. rule-base. Bowman provides a fast timeout routine that supports multi- The Bowman packet classi®er implements a trie-based search ple timers per-a-¯ow. Using the Bowman timeout routine, we algorithm. The nodes in the trie are constructed from a set of have implemented a Bowman extension for threaded per-a-¯ow packet ®lter rules; each node in the trie corresponds to a ®eld to timers. Each a-¯ow contains one timer thread that executes call- be matched on the packet header. Each node in the classi®cation back functions when speci®c timers expire. The resolution of trie contains an unique ®eld match Ð packet ®lters with common the timers and the number of outstanding timers are con®gurable pre®xes share nodes in the classi®cation trie. Within Bowman, a parameters. packet classi®er is allocated for each channel. Subscriptions to the channel are speci®ed as packet ®lter rules and are stored as IV. BOWMAN PERFORMANCE up-calls inside the trie nodes that terminate the ®lter rule. When In this section, we present a set of Bowman performance re- a packet arrives on the channel, the classi®er trie is traversed. It sults, beginning with base forwarding performance. Note that is possible for incoming packets to match more than one ®lter this base performance represents an upper bound on the through- in the trie; the values returned by the packet classi®er depend on put achievable by a single ¯ow through a Bowman node; actual how it is con®gured (return ®rst, best or all matches). for user ¯ows will certainly depend on the ¯ow- The Bowman packet classi®cation algorithm is not dependent speci®c processing at each node. Thus, the goal of this section on matching particular protocols, instead, using a small lan- is to quantify the overheads incurred due to Bowman itself. To guage, protocols can be dynamically taught to the classi®er at that end, in Section IV-A.1, we analyze the performance of each run-time. An example of this language that teaches a packet component of the packet processing path. In Section IV-B, we classi®er the Ethernet header is given below. provide results of the performance of the Bowman packet clas- si®er. Because the classi®er can handle packets faster than the (protocol ethernet (hdr_len 14) test network can transmit packets, we present results that are in- (field dst offset 0 length 6) (field src offset 6 length 6) 4See the priocntl(2) system call on SunOS 5.5+. An effort is underway (field proto offset 12 length 2)) to port Bowman real-time extensions to RT-Linux. 100 26000

90 24000

80 22000

70 Solaris kernel 20000 Solaris kernel C gateway C gateway

60 Bowman Aflow 18000 Bowman Aflow

  50 16000

40 14000

30 12000 Throughput at receiver (Mbps)

20 10000 Received packets per second at receiver

10 8000 200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400 Packet Size (Bytes) Packet Size (Bytes)

Fig. 5. Sustained throughput Fig. 6. Sustained packet processing rate dependent of the network, and depend only on the test machine overheads due to the queues in the Bowman packet processing processor. path. The C gateway incurs the ®rst three overheads as well6. We conclude the section with an analysis of the effects of It is clear from the ®gures that the system call, context switch real-time scheduling for a-¯ows. and data copy overheads for small packet sizes precludes both the C gateway and Bowman from sustaining high throughput. A. Forwarding performance However, as packet sizes increase, the context switch and system call dispatch overheads are amortized as more data traverses the The forwarding performance experiments were performed on system and user space boundary for each system call. Bowman a three-node testbed con®gured in a straight line topology. The is able to saturate the 100 Mbps ethernet for packet sizes of 1400 interior node executed Bowman. The edge nodes were Sun bytes or more; Bowman performance is within 5% of kernel Ultra-5 300 MHz machines connected via 100 Mbps Ethernet forwarding for packet sizes greater than 1200 bytes. We should to the Bowman node. The Bowman node was a Sun Ultra-2 2- note that the Bowman performance results shown here are for processor machine with each processor rated at 168 MHz. The a path that includes an a-¯ow; Bowman cut-through processing edge nodes ran the SunOS 5.7 operating system while the inte- results are similar to the C gateway results 7. rior node executed SunOS 5.5.1. The edge nodes were directly connected to separate 100 Mbps interfaces to the Bowman node. A.1 Overheads on the Packet Processing Path We compare the performance of the Bowman packet process- ing path with that of the forwarding performance of the Solaris 250 kernel and a gateway program written in C. The single threaded C gateway program uses socket calls to read and immediately write out UDP segments. For this experiment, both the read and 200 the write socket buffer sizes for the C gateway were set to 256 Kbytes. In each case, the sending node transmits UDP segments 150 of a speci®ed size to the destination node. The routing table  of one of the edge nodes was con®gured to use the Bowman 100 node as its next hop to the other edge node in our testbed. The System Write A-flow processing Bowman topology was con®gured with two bidirectional UDP 50

Cumulative Time (microseconds) A-flow dequeue A-flow enqueue channels Ð from the edge nodes to the interior node. (For the Channel processing System read C gateway and Bowman forwarding experiments, the Solaris 0 200 400 600 800 1000 1200 1400 kernel forwarding was turned off). All the times reported were Packet size (bytes) gathered using the Solaris high (nanosecond) resolution timer5. Figure 5 shows the sustained lossless throughput through Fig. 7. Overhead of Bowman components Bowman, the C gateway, and the Solaris kernel as measured at the receiver. Each data point is an average of 30 samples In this section, we quantify the overhead of Bowman multi- where each sample measured the time to receive 1000 pack- threaded processing, context switches and internal queues. In ets of the given size. Figure 6 shows the raw packet rate (in packets/second) at the receiver for the same experiment. 6Note that, in general, Bowman multiplexes multiple input/output channels Compared to kernel forwarding, Bowman incurs four extra and may incur an extra overhead due to a poll(2) system call (called by select(3c)) not present in the read-write loop for the C gateway. In our overheads: (1) overhead for at least two context switches for implementation, Bowman is optimized not to call poll unless there is no data data read and write, (2) overhead for data copies in and out of available on its last input descriptor. user space, (3) overhead for dispatching system calls, and (4) 7All the user-space results (Bowman and the C gateway) presented here are for a 32-bit executables; we are in the process of porting Bowman for a native 64-bit platform. At the very least, this will alleviate some of the system copy 5Please see gethrtime(3c) manual page on SunOS 5.5+. overheads seen in these results. order to quantify this overhead, we augmented the Bowman the ®elds uniformly at random using the 48-bit UNIX pseudo- packet processing path with several high resolution timers that random number generator8. On average, each ®lter rule had allowed us to record precise measurements of the time consumed approximately six ®elds and the classi®er was con®gured to ®nd by each component of the Bowman packet processing path. Aug- all the matches. The average classi®cation time as the number menting the packet processing path with these timers degraded of rules were varied is shown in Table I. Note that in Bowman, the performance of the system by about 5%. packets are classi®ed on the ends of channels Ð as such we do Figure 7 shows the cumulative processing time taken by each not expect a large number of rules to be added to a classi®er (as step in the Bowman packet processing path. The Bowman pro- each rule can result in an a-¯ow at the node). Nonetheless, the cessing is quanti®ed in the following steps: (1) channel process- Bowman packet classi®er performs well with a relatively large ing: this includes allocation for packet memory from a dedicated number of rules, e.g., with 256 rules, each match takes about

packet ring, a trivial packet classi®cation (the packet classi®er is 7.4  -seconds. This corresponds to a classi®cation rate of over

5  called with a ®lter that matches all packets) and the subsequent 1 § 3 10 packets per second (or a throughput of about 1.5 Gbps demultiplex that calls the a-¯ow enqueue function; (2) the a- assuming 1514 byte packets). ¯ow enqueue; (3) the a-¯ow dequeue (by the a-¯ow processing thread); (4) null a-¯ow processing; and (5) the Bowman channel Number of Avg. Classi®cation Time dispatch on write and deallocation of packet memory. Since Fields (nanoseconds) Bowman does not internally copy packets, the Bowman over- 1 3174 head is relatively constant at about 25  -seconds for all packet 2 3809 sizes. Of this, about 5  -seconds are incurred due to the timers. 3 4095 The processing time is, in fact, dominated by the system read 4 4788

and write calls. For small packet sizes (200 bytes), the system 5 5117  read and write calls take approximately 24  -seconds and 90 - 6 5417 seconds, respectively, and the overhead due to Bowman is around 7 5687 15%. However, for large packet sizes (1514 bytes), the system 8 6189 read and write calls take 95 and 125  -seconds respectively and 9 6486 the Bowman overhead reduces to approximately 10%. 10 6730 11 7017 B. Packet classi®er performance TABLE II Number of Avg. Classi®cation Time CHANGE IN CLASSIFICATION TIME VS. NUMBER OF FIELDS IN THE BEST Filter Rules (nanoseconds) MATCHING FILTER RULE. 1 2578 2 2853 4 3325 Table II shows the variation in packet classi®cation time as 8 3628 the number of ®elds in the best matching rule is increased. In 16 3909 each case, the classi®cation trie consists of exactly 32 ®lter rules; 32 5192 however, the number of ®elds in the best matching ®lter rule is

64 6806 varied. Further, the experiment with  ®elds in the best matching ! 128 6720 ®lter rule includes the previous  1 best matching ®lter 256 7477 rules in its set of 32 ®lter rules, e.g., if the best matching ®lter 512 8187 is [ip (src = alpha) (dst = beta) | *] then the 1024 9952 rule [ip (src = alpha) | *] is also included in the rule set. We report the results for the case when the classi®er ®nds

TABLE I all matching rules in the rule set; thus, in case the best matching  VARIATION OF AVERAGE PACKET CLASSIFICATION TIME VS. NUMBER OF FILTER rule has  ®elds, the classi®er outputs matching rules. In RULES Table II, we see that there is an approximately linear increase in the classi®cation time as the number of ®elds is increased. This is because common ®elds share the same node in the classi®cation In this section, we present the results of two experiments that trie and each new ®eld adds a new node to the trie. The extra analyze the performance of the Bowman packet classi®er algo- time in consecutive measurements is due to the traversal of this rithm. In the ®rst experiment, we report the packet classi®cation new node (and the corresponding time it takes to access the new time as the number of rules in the classi®er is varied. These ®eld in the matched packet's header). results were generated on a 300 MHz Ultra-5 SunOS 5.7 ma- chine with 32-bit executables. The ®lter rules were generated C. Effect of real-time scheduling for a packet with Ethernet, IP, UDP, and an application-level In Figure 8, wepresent a snapshot of the queue lengths in a run- header. All four protocols (including the application-speci®c ning Bowman system to show the effect of real-time scheduling. protocol) were taught to the packet classi®er. The ®lter rules were created using a program that generates ®elds and values for 8Please see the drand48(3c) manual page. 1000 consists of three parts: a computation part that speci®es a set of code modules required by the ¯ow, a data part that is fed as input to the computation, and an input/output part that speci®es the set of packets to be input to the computation. When the signaling a-¯ow receives a valid user request, it downloads the requested

 100 Input Queue (TS) code modules from a code ; the code server(s) are spec- Input Queue (RT) i®ed as part of the signaling message. The protocol to fetch code is loaded as an extension into Bowman and is a generic Queue Size (packets) service provide by the node to the EE. The code fetch protocol exports a well-known interface that is used by the CANEs EE to download the code required by per-¯ow computations. The 10 2000 2200 2400 2600 2800 3000 downloaded code modules for CANEs are in the form of shared Packet Index objects that are linked into the Bowman node using the Bowman Fig. 8. Effect of real-time scheduling on system and a-¯ow queue sizes. The extension mechanism. In case of ¯ow-speci®c code, however, dots under each curve correspond to the sizes of the output queues. the newly linked symbols are not exported (as is the case for reg- ular Bowman extensions) as system calls; instead, the CANEs signaling thread extracts speci®c functions from the linked-in The two curves in the ®gure correspond to a-¯ow input queue shared object using names supplied as part of the signaling mes- sizes when the system scheduler is run in time-sharing mode (TS) sage. When all the code modules required for a ¯ow are retrieved versus real-time (RT) mode. Under each curve is a set of dots that and linked into the Bowman node, the signaling thread spawns a show the size of the output queues for the corresponding time new a-¯ow. The new a-¯ow subscribes to the packets speci®ed period. There are three different kernel threads active: the Bow- in the signaling message (possibly creating new channels in the man input thread enqueues packets into the a-¯ow input queue. process) and then executes the ¯ow-speci®c computation. The a-¯ow input queue is drained by the a-¯ow thread which then enqueues the packet in the a-¯ow output queue; the a-¯ow Routing Extensions output queue is drained by the Bowman output thread. Under Typically, ¯ow-speci®c computations do not implement their time-shared scheduling, these threads are scheduled on far too own routing protocols; instead a set of routing protocols are coarse a quantum; as such, the queue sizes ¯uctuate wildly and linked into the node as extensions and execute within their own the average queue size is over 300 packets. (Note that larger and a-¯ows. These routing protocols use the Bowman state-store widely varying queue sizes also imply longer delays and more mechanism to export routing tables for different abstract topolo- jitter through the node.) In the real-time case, we run each kernel gies (and different routing metrics). The routing protocols also thread with the same real-time priority and give each thread a export well-known methods to access these routing tables. Un- time quantum of 1000 nanoseconds. Due to the deterministic less an user ¯ow implements a routing protocol, it uses the and ®ne grained scheduling, the queue sizes are smaller by an generic routing services exported by the routing protocols to order of magnitude and the system is far more stable. For this forward its data. experiment, the priority of each thread was static; in subsequent work, we will investigate dynamic processor priority scheduling Support Protocols schemes for Bowman threads. Along with the code fetch and routing protocols, a number of V. CONFIGURATION OF BOWMAN FOR ACTIVE NETWORKING other extensions are also routinely initialized as part of a Bow- man node. These include extensions for output queue schedul- In this section, we give an overview of how Bowman is being ing; speci®c protocol implementations such as Ethernet,IP, UDP, used for active networking. Speci®cally, we discuss: and TCP that are required to implement channels of speci®c ¥ loading of the EE and user code into Bowman as extensions; types; protocols that provide name service such as DNS and ¥ structuring the EE and user computations across multiple ARP; multi-threaded timers that can be used by per-¯ow com- a-¯ows; putations; protocols for creating and managing abstract links and ¥ using extensions for accessing node-wide state such as rout- topologies. Lastly, a login shell for Bowman called Hal, is also ing tables; and loaded as an extension: Hal provides an extensible command ¥ the set of support protocols and extensions required to con- interpreter that can be used to query the state of various objects ®gure a working active node. Ðsuch as network interfaces, channels, and a-¯owsÐ that are present in a Bowman node. CANEs over Bowman The CANEs[1] execution environment allows network pro- VI. RELATED WORK grammability on a per-¯ow basis and has been implemented us- In this section, we provide a survey of work related to Bow- ing the system call interface exported by Bowman. The CANEs man: this includes the areas of system support for ¯ow-speci®c EE is loaded as an extension to Bowman. Initially, the EE cre- processing at routers and development of new operating systems. ates a signaling a-¯ow; the signaling a-¯ow creates a signaling A software architecture for ¯ow-speci®c processing has been channel on a well-known UDP port and subscribes to all mes- developed as part of the Plugins project [10] at Wash- sages that arrive on that channel. A CANEs signaling message ington University. This work has adapted a NetBSD kernel to incorporate user-processing at speci®c points (gates) on the IP ton University has designed an open, general purpose comput-

forwarding" path with about 8% overhead. The major feature ing/. They share the same design goals of Bowman that is missing from Router Plugins is the general as Bowman: support complex forwarding logic and high perfor- set of communication abstractions Ðchannels, abstract links mance while reusing available building blocks. The architecture and topologiesÐ that is supported by Bowman; this re¯ects the for Extensible Routers is based on ideas of I/O paths from the IP-based focus of the Router Plugins project. Router plugins Scout [16] operating system and I/O optimizations developed for classify incoming packets to speci®c ¯ows using a packet clas- the SHRIMP [17] multi-computer. si®cation scheme similar to Bowman: the packet classi®cation step identi®es the function that is bound to a packet at each pro- VII. CONCLUSIONS cessing gate. As the packet processing path in Router Plugins We have designed and implemented Bowman, a node OS for is integrated with the ¯ow-speci®c processing, Router Plugins active networks. The software architecture of Bowman is highly do not incur the thread-switching and queuing overheads of the modular and dynamically con®gurable at run-time. The Bow- multi-threaded multi-queue packet processing path of Bowman. man design and implementation is decoupled from any particular Architecturally however, Router Plugins merge Node OS and EE network protocol or routing scheme. Bowman provides system- functionality and thus, unlike Bowman, Router Plugins do not 9 level support for abstract topologies that can be used to create export a node OS interface over which different EEs can be built . virtual networks. The Bowman multi-threaded packet process- This architectural difference between Bowman and Router Plu- ing path is an example of how to structure system software for gins clearly exposes the performance versus generality tradeoff routers that may support signi®cant per-¯ow processing. Fur- inherent in the design of extensible router platforms. ther, we have shown how real-time processor scheduling can The aim of Practical Active Network (PAN) [12] project at be used to provide deterministic packet forwarding performance MIT is to build a high performance capsule-based active node even in user-space. Our user-space implementation is portable that is based on the ANTS [4] framework. Experiments have to systems that support the POSIX system call interface, though shown that the forwarding performance of capsules containing of course speci®c Bowman capabilities that use capabilities of Java [13] byte-code is dominated by overhead incurred due to the underlying system beyond POSIX may not be available on execution within the Java virtual machine. PAN in-kernel per- all platforms. formance is comparable to Bowman user-space performance; Our results show that Bowman packet forwarding perfor- in their experiments with 1500 byte capsules containing native mance is relatively high: Bowman is able to saturate 100 Mbps code, executed in-kernel, they were able to saturate a 100 Mbps ethernet with only 1400 byte packets. The Bowman packet Ethernet. Unlike PAN, Bowman does not mandate that code classi®er implements a protocol-independent algorithm that can be transported in band; instead the code transport and execution classify packets at gigabit rates in software. mechanism is implemented by EEs that execute over Bowman. Our work on Bowman has been motivated by the need for a Architecturally, PAN is an in-kernel implementation of an EE node OS for the CANEs EE. In future, we hope to port other EEs using Linux as the node OS. to execute over Bowman. Bowman provides a ¯exible platform Another effort based on ANTS is the development of a new for experimentation with a number of major open issues in active operating system called Janos [14] at the University of Utah. networking, including algorithms for integrated processor and Janos implements a special Java Virtual Machine and Java run- link scheduling; algorithms for allocation of resources to ¯ows time for executing Java byte code. Janos includes a modi®ed within active nodes; and policies for instantiation and isolation version of ANTS that supports resource management. Their of different abstract topologies. We hope to use Bowman as a main design goal is to provide a strong protection and separation research vehicle to investigate some of these open problems. mechanism between different user code executed at the active node. While execution of user code within a virtual machine VIII. ACKNOWLEDGMENTS (like Java) provides a high degree safety of execution, it also has associated performance costs. Unlike Janos, Bowman does not We are grateful to Matt Sanders for helping us at vari- enforce safety constraints on loaded code at run time. During ous stages of software development. He has also provided the signaling phase in Bowman, the code-fetching mechanism us extensive and insightful comments on a draft of this pa- must decide, based on credentials provided by the user and the per. We would also like to thank Youngsu Chae for his con- node security policy, whether it trusts the requested code; if tributions on the Bowman abstract topologies and their im- not, the request is denied. This is the time-honored approach plementation. This work is supported by Defense Advanced of relying on the code supplier for safety in order to obtain Research Projects Agency under contract N66001-97-C-8512. better performance. The choice of ªCº programming language to Information about the Bowman software can be found at: implement the Bowman system call interface has given us good performance, but probably sacri®ces some ¯exibility (code can only come from a priori-trusted parties) compared to Java-based REFERENCES approaches. [1] K. Calvert, S. Bhattacharjee, E. Zegura, and J. Sterbenz, ªDirections in With an aim to push functionality from end devices into active networks,º IEEE Communications Magazine, 1998. [2] System Application Program Interface (API) [C Language], Information the network, the Extensible Routers project [15] at Prince- technologyÐPortable Operating System Interface (POSIX). IEEE Com- puter Society, 345 E. 47th St, New York, NY 10017, USA, 1990. 9An extension to Router Plugins for active networking, ªActive Router Plug- [3] Kenneth L. Calvert (Editor), ªArchitectural Framework for Active Net- insº [11] has been designed. works,º DARPA AN Working Group Draft, 1998. [4] D. Wetherall, J. Guttag, and D. L. Tennenhouse, ªANTS: A toolkit for

building# and dynamically deploying network protocols,º in IEEE OPE- NARCH'98, San Francisco, CA, April 1998. [5] Larry Peterson (Editor), ªNodeOS Interface Speci®cation,º DARPA AN NodeOS Working Group Draft, 1999. [6] Norman C. Hutchinson and Larry L. Peterson, ªDesign of the x-kernel,º in Proceedings of the SIGCOMM '88 Symposium, Stanford, California, 16±19 August 1988, pp. 65±75, ACM Press. [7] Mary L. Bailey, Burra Gopal, Micheal A. Pagels, and Larry L. Peterson, ªPATHFINDER: A pattern-based packet classi®er.,º in Proceedings of the ®rst USENIX Symposium on Operating Systems Design and Implementa- tion, Nov 1994. [8] V.Srinivasan, S.Suri, and G.Varghese, ªPacket Classi®cation using Tuple Space Search,º in Proceedings of ACM SIGCOMM, Sept 1999. [9] Pankaj Gupta and Nick McKeown, ªPacket Classi®cation on Multiple Fields,º in Proceedings of ACM SIGCOMM, Sept 1999. [10] Dan Decasper, Zubin Dittia, Guru Parulkar, and Bernhard Plattner, ªRouter Plugins: A Software Architecture for Next Generation Routers,º in Pro- ceedings of SIGCOMM '98, Vancouver, CA, Sept 1998. [11] Dan Decasper, A Software Architecturefor Next GenerationRouters, Ph.D. thesis, Washington University, 1999. [12] Erik L. Nygren, Stephen J. Garland, and M. Frans Kaashoek, ªPAN: A High-Performance Active Network Node Supporting Multiple Mobile Code Systems,º in Proceedings of IEEE OpenArch '99, March 1999, pp. 78±89. [13] Ken Arnold and James Gosling, The Java Programming Language, Addi- son Wesley, Reading, Massachusetts, 1996. [14] ªJanos: A Java-based Active Network Operating System,º http://www.cs.utah.edu/projects/¯ux/janos/summary.html. [15] Larry Peterson, Scott Karlin, and Kai Li, ªOS Support for General-Purpose Routers,º in HotOS Workshop, March 1999. [16] A. Montz, D. Mosberger, S. O'Malley, L. Peterson, T. Proebsting, and J. Hartman, ªScout: A communications-oriented operationg system,ºTech. Rep. 94-20, Department of Computer Science, The University of Arizona, 1994. [17] ªSHRIMP: Scalable High-performance Really Inexpensive Multi- Processor,º http://www.cs.princeton.edu/shrimp/.