TESLA: A Transparent, Extensible Session-Layer Architecture for End-to-end Network Services

Jon Salz Alex C. Snoeren MIT Laboratory for Computer Science University of California, San Diego [email protected] [email protected] Hari Balakrishnan MIT Laboratory for Computer Science [email protected]

Session-layer services for enhancing functionality and • Encryption services for sealing or signing ¤ows. improving network performance are gaining in impor- • General-purpose compression over low-bandwidth tance in the Internet. Examples of such services in- links. clude connection multiplexing, congestion state shar- • ing, application-level , mobility/migration sup- Traf£c shaping and policing functions. port, and encryption. This paper describes TESLA, a These examples illustrate the increasing importance of transparent and extensible framework allowing session- session-layer services in the Internet—services that oper- layer services to be developed using a high-level ¤ow- ate on groups of ¤ows between a source and destination, based abstraction. TESLA services can be deployed and produce resulting groups of ¤ows using shared code transparently using dynamic library interposition and and sometimes shared state. can be composed by chaining event handlers in a graph structure. We show how TESLA can be used to imple- Authors of new services such as these often imple- ment several session-layer services including encryption, ment enhanced functionality by augmenting the link, net- SOCKS, application-controlled routing, ¤ow migration, work, and transport layers, all of which are typically im- and traf£c rate shaping, all with acceptably low perfor- plemented in the kernel or in a shared, trusted interme- mance degradation. diary [12]. While this model has suf£ced in the past, we believe that a generalized, high-level framework for session-layer services would greatly ease their develop- 1 Introduction ment and deployment. This paper argues that Internet end hosts can bene£t from a systematic approach to de- Modern network applications must meet several increas- veloping session-layer services compared to the largely ing demands for performance and enhanced function- ad-hoc point approaches used today, and presents TESLA ality. Much current research is devoted to augmenting (a Transparent, Extensible Architecture), the transport-level functionality implemented by stan- a framework that facilitates the development of session- dard protocols as TCP and UDP. Examples abound: layer services like the ones mentioned above. Our work with TESLA derives heavily from our own • Setting up multiple connections between a source previous experience developing, debugging, and deploy- and destination to improve the throughput of a sin- ing a variety of Internet session-layer services. The earli- gle logical data transfer (e.g., £le transfers over est example is the Congestion Manager (CM) [4], which high-speed networks where a single TCP connec- allows concurrent ¤ows with a common source and des- tion alone does not provide adequate utilization [2, tination to share congestion information, allocate avail- 15]). able bandwidth, and adapt to changing network condi- tions. Other services include Resilient Overlay Networks • Sharing congestion information across connections (RON) [3], which provides application-layer routing in sharing the same network path. an overlay network, and the Migrate mobility architec- • Application-level routing, where applications route ture [23, 24], which preserves end-to-end communica- traf£c in an overlay network to the £nal destination. tion across relocation and periods of disconnection. • End-to-end session migration for mobility across Each these services was originally implemented at the network disconnections. kernel level, though it would be advantageous (for porta- bility and ease of development and deployment) to have new network services. This paper describes the design them available at the user level. Unfortunately this tends and implementation of TESLA, a generic framework for to be quite an intricate process. Not only must the im- development and deployment of session-layer services. plementation specify the internal logic, algorithms, and TESLA consists of a set of C++ application program in- API, but considerable care must be taken handling the terfaces (APIs) specifying how to write these services, details of non-blocking and blocking sockets, interpro- and an interposition agent that can be used to instantiate cess communication, process management, and integrat- these services for use by existing applications. ing the API with the application’s event loop. The end re- sult is that more programmer time and effort is spent set- ting up the session-layer plumbing than in the service’s 2 Related Work logic itself. Our frustrations with the development and implementation of these services at the user level were Today’s commodity operating systems commonly al- the prime motivation behind TESLA and led to three ex- low the dynamic installation of network protocols on plicit design goals. a system-wide or per-interface basis (e.g., Linux ker- First, it became apparent to us that the standard BSD nel modules and FreeBSD’s netgraph), but these exten- sockets API is not a convenient abstraction for program- sions can only be accessed by the super-user. Some ming session-layer services. Sockets can be duplicated operating systems, such as SPIN [5] and the Exoker- and shared across processes; operations on them can be nel [12], push many operating system features (like net- blocking or non-blocking on a descriptor-by-descriptor work and £le system access) out from the kernel into basis; and reads and writes can be multiplexed using sev- application-speci£c, user-con£gurable libraries, allow- eral different APIs (e.g., select and poll). It is undesir- ing ordinary users £ne-grained control. Alternatively, able to require each service author to implement the en- extensions were developed for both operating systems to tire sockets API. In response, TESLA exports a higher allow applications to de£ne application-speci£c handlers level of abstraction to session services, allowing them to that may be installed directly into the kernel (Plexus [14] operate on network ¤ows (rather than simply socket de- and ASHs [27]). scriptors) and treat ¤ows as objects to be manipulated. Operating systems such as Scout [21] and x- Second, there are many session services that are re- kernel [16] were designed explicitly to support sophis- quired ex post facto, often not originally thought of by ticated network-based applications. In these systems, the application developer but desired later by a user. For users may even rede£ne network or proto- example, the ability to shape or police ¤ows to conform col functions in an application-speci£c fashion [6]. With to a speci£ed peak rate is often useful, and being able TESLA, our goal is to bring some of the power of these to do so without kernel modi£cations is a deployment systems to commodity operating systems in the context advantage. This requires the ability to con£gure session of session-layer services. services transparent to the application. To do this, TESLA In contrast to highly platform-dependent systems such uses an old idea—dynamic library interposition [9]— as U-Net [26] and Alpine [11], TESLA does not attempt taking advantage of the fact that most applications today to allow users to replace or modify the system’s network on modern operating systems use dynamically linked li- stack. Instead, it focuses on allowing users to dynami- braries to gain access to kernel services. This does not, cally extend the protocol suite by dynamically compos- however, mean that TESLA session-layer services must ing additional end-to-end session-layer protocols on top be transparent. On the contrary, TESLA allows services of the existing transport- and network-layer protocols, to de£ne APIs to be exported to enhanced applications. achieving greater platform independence and usability. Third, unlike traditional transport and TESLA’s modular structure, comprising directed services, there is a great diversity in session services graphs of processing nodes, shares commonalities with as the examples earlier in this section show. This im- a number of previous systems such as x-kernel, the Click plies that application developers can bene£t from com- modular [19], and UNIX System V [22]. posing different available services to provide interesting Unlike the transport and network protocols typically con- new functionality. To facilitate this, TESLA arranges for sidered in these systems, however, we view session-layer session services to be written as event handlers, with a protocols as an extension of the application itself rather callback-oriented interface between handlers that are ar- than a system-wide resource. This ensures that they are ranged in a graph structure in the system. subject to the same scheduling, protection, and resource We argue that a generalized architecture for the de- constraints as the application, minimizing the amount of velopment and deployment of session-layer functionality effort (and privileges) required to deploy services. will signi£cantly assist in the implementation and use of To avoid making changes to the operating system or to the application itself, TESLA transparently interposes it- Input flow self between the application and the kernel, intercepting from upstream Zero or more Flow handler output flows and modifying the interaction between the application to downstreams and the system—acting as an interposition agent [18]. It uses dynamic library interposition [9] to modify the interaction between the application and the system. This technique is popular with user-level £le systems such as Figure 1: A ¤ow handler takes as input one network ¤ow IFS [10] and Ufo [1], and several libraries that provide and generates zero or more output ¤ows. speci£c, transparent network services such as SOCKS [20], Reliable Sockets [29], and Migrate [23, 24]. Each Upstream of these systems provides only a speci£c service, how- ever, not an architecture usable by third parties. Application Application

Conductor [28] traps application network operations f f and transparently layers composable “adaptors” on TCP TESLA TESLA connections, but its focus is on optimizing ¤ows’ per- formance characteristics, not on providing arbitrary ad- Encryption Encryption ditional services. Similarly, Protocol Boosters [13] pro- g poses interposing transparent agents between communi- cation endpoints to improve performance over particu- g Migration lar links (e.g., compression or forward error correction). While Protocol Boosters were originally implemented in the FreeBSD and Linux kernel, they are an excel- h1 h2 hn lent example of a service that could be implemented in C library C library a generic fashion using TESLA. Thain and Livny re- cently proposed Bypass, a dynamic-library based inter- position toolkit for building split-execution agents com- Downstream monly found in distributed systems [25]. However, be- Figure 2: Two TESLA stacks. The encryption ¤ow han- cause of their generality, none of these systems provides dler implements input ¤ow f with output ¤ow g. The mi- any assistance in building modular network services, par- gration ¤ow handler implements input ¤ow g with output ticularly at the session layer. To the best of our knowl- ¤ows h1 . . . hn. edge, TESLA is the £rst interposition toolkit to speci£- cally support generic session-layer network services. tions on the byte stream, such as transparent ¤ow migra- tion, encryption, compression, etc. 3 Architecture A ¤ow handler, illustrated in Figure 1, is explicitly de- £ned and constructed to operate on only one input ¤ow As discussed previously, many tools exist that, like from an upstream handler (or end application), and is TESLA, provide an interposition (or “shim”) layer be- hence devoid of any demultiplexing operations. Concep- tween applications and operating system kernels or li- tually, therefore, one might think of a ¤ow handler as braries. However, TESLA raises the level of abstraction dealing with traf£c corresponding to a single socket only of programming for session-layer services. It does so by (as opposed to an interposition layer coded from scratch, introducing a ¤ow handler as the main object manipu- which must potentially deal with operations on all open lated by session services. Each session service is imple- £le descriptors). A ¤ow handler generates zero or more mented as an instantiation of a ¤ow handler, and TESLA output ¤ows, which map one-to-one to downstream han- takes care of the plumbing required to allow session ser- dlers (or the network send routine).1 While a ¤ow han- vices to communicate with one another. dler always has one input ¤ow, multiple ¤ow handlers A network ¤ow is a stream of bytes that all share the may coexist in a single process, so they may easily share same logical source and destination (generally identi£ed global state. (We will expound on this point as we further by source and destination IP addresses, source and desti- discuss the TESLA architecture.) nation port numbers, and transport protocol). Each ¤ow The left stack in Figure 2 illustrates an instance of handler takes as input a single network ¤ow, and pro- TESLA where stream encryption is the only enabled han- duces zero or more network ¤ows as output. Flow han- dler. While the application’s I/O calls appear to be dlers perform some particular operations or transforma- reading and writing plaintext to some ¤ow f, in reality TESLA intercepts these I/O calls and passes them to the class ¤ow handler { protected: encryption handler, which actually reads and writes ci- ¤ow handler *plumb(int domain, int type); phertext on some other ¤ow g. The right stack in Fig- handler* const upstream; ure 2 has more than two ¤ows. f is the ¤ow as viewed vector downstream; by the application, i.e., plaintext. g is the ¤ow between public: the encryption handler and the migration handler, i.e., ci- // DOWNSTREAM methods phertext. h1, h2, . . . , hn are the n ¤ows that a migration virtual int connect(address); ¤ow handler uses to implement ¤ow g. (The migration virtual int bind(address); virtual pair<¤ow handler*, address> accept(); handler initially opens ¤ow h1. When the host’s network virtual int close(); address changes and h1 is disconnected, the migration virtual bool write(data); handler opens ¤ow h2 to the peer, and so forth.) From virtual int shutdown(bool r, bool w); the standpoint of the encryption handler, f is the input virtual int listen(int backlog); ¤ow and g is the output ¤ow. From the standpoint of the virtual address getsockname(); virtual address getpeername(); migration handler, g is the input ¤ow and h1, h2, . . . , hn virtual void may avail(bool); are the output ¤ows. virtual string getsockopt(. . .); virtual int setsockopt(. . .); Each TESLA-enabled process has a particular handler virtual int ioctl(. . .); con£guration, an ordered list of handlers which will be used to handle ¤ows. For instance, the con£guration for // UPSTREAM methods (’from’ is a downstream ¤ow) the application to the right in Figure 2 is “encryption; virtual void connected(¤ow handler *from, bool success); virtual void accept ready(¤ow handler *from); migration.” Flows through this application will be en- virtual bool avail(¤ow handler *from, data); cryption and migration-enabled. virtual void may write(¤ow handler *from, bool may); };

3.1 Interposition Figure 3: An excerpt from the ¤ow handler class de£- nition. address and data are C++ wrapper classes for To achieve the goal of application transparency, TESLA address and data buffers, respectively. acts as an interposition agent at the C-library level. We provide a wrapper program, tesla, which sets up the environment (adding our libtesla.so shared library to instantiates the next downstream handler, namely migra- LD PRELOAD) and invokes an application, e.g.: tion. Its plumb method creates a TESLA-internal handler to implement an actual TCP socket connection. tesla +crypt -key=sec.gpg +migrate ftp mars To send data to a downstream ¤ow handler, a ¤ow This would open an FTP connection to the host named handler invokes the latter’s write method, which in turn mars, with encryption (using sec.gpg as the private key) typically performs some processing and invokes its own and end-to-end migration enabled. downstream handler’s write method. Downstream han- We refer to the libtesla.so library as the TESLA stub, dlers communicate with upstream ones via callbacks, since it contains the interposition agent but not any of the which are invoked to make events available to the up- handlers themselves (as we will discuss later). stream ¤ow. Flow handler methods are asynchronous and event- driven—each method call must return immediately, 3.2 The ¤ow handler API without blocking.

Every TESLA session service operates on ¤ows and is implemented as a derived class of ¤ow handler, shown 3.3 Handler method semantics in Figure 3. To instantiate a downstream ¤ow handler, a ¤ow handler invokes its protected plumb method. TESLA Many of the virtual ¤ow handler methods presented in handles plumb by instantiating handlers which appear af- Figure 3—write, connect, getpeername, etc.—have se- ter the current handler in the con£guration. mantics similar to the corresponding non-blocking func- For example, assume that TESLA is con£gured to per- tion calls in the C library. (A handler class may override form compression, then encryption, then session migra- any method to implement it specially, i.e., to change the tion on each ¤ow. When the compression handler’s con- behavior of the ¤ow according to the session service the structor calls plumb, TESLA responds by instantiating the handler provides.) These methods are invoked by a han- next downstream handler, namely the encryption han- dler’s input ¤ow. Since, in general, handlers will prop- dler; when its constructor in turn calls plumb, TESLA agate these messages to their output ¤ows (i.e., down- stream), these methods are called downstream methods write call which returns a number of bytes or an EA- and can be thought of as providing an abstract ¤ow ser- GAIN error message (we will explain this design deci- vice to the upstream handler. sion shortly). Our interface lacks a read method; rather, (One downstream method has no C-library analogue: we use a callback, avail, which is invoked by a down- may avail informs the handler whether or not it may stream handler whenever bytes are available. The up- make any more data available to its upstream handler. stream handler handles the bytes immediately, generally We further discuss this method in Section 3.5.) by performing some processing on the bytes and passing them to its own upstream handler. In contrast the £nal four methods are invoked by the handler’s output ¤ows; each has a from argument identi- A key difference between ¤ow handler semantics and fying which downstream handler is invoking the method. the C library’s I/O semantics is that, in ¤ow handlers, Since typically handlers will propagate these messages to writes and avails are guaranteed to complete. In partic- their input ¤ow (i.e., upstream), these methods are called ular, a write or avail call returns false, indicating fail- upstream methods and can be thought of callbacks which ure, only when it will never again be possible to write are invoked by the upstream handler. to or receive bytes from the ¤ow (similar to the C li- brary returning 0 for write or read). Contrast this to the C library’s write and read function calls which (in non- • connected is invoked when a connection request blocking mode) may return an EAGAIN error, requiring (i.e., a connect method invocation) on a downstream the caller to retry the operation later. Our semantics make has completed. The argument is true if the request handler implementation considerably simpler, since han- succeeded or false if not. dlers do not need to worry about handling the common • accept ready is invoked when listen has been called but dif£cult situation where a downstream handler is un- and a remote host attempts to establish a connec- able to accept all the bytes it needs to write. tion. Typically a ¤ow handler will respond to this Consider the simple case where an encryption han- by calling the downstream ¤ow’s accept method to dler receives a write request from an upstream handler, accept the connection. performs stream encryption on the bytes (updating its • avail is invoked when a downstream handler has re- state), and then attempts to write the encrypted data to the ceived bytes. It returns false if the upstream handler downstream handler. If the downstream handler could re- is no longer able to receive bytes (i.e., the connec- turn EAGAIN, the handler must buffer the unwritten en- tion has shut down for reading). crypted bytes, since by updating its stream encryption state the handler has “committed” to accepting the bytes • may write, analogous to may avail, informs the han- from the upstream handler. Thus the handler would have dler whether or not it may write any more data to maintain a ring buffer for the unwritten bytes and reg- downstream. This method is further discussed in ister a callback when the output ¤ow is available for writ- Section 3.5. ing. Many handlers, such as the encryption handler, per- Our approach (guaranteed completion) bene£ts from form only simple processing on the data ¤ow, have ex- the observation that given upstream and downstream handlers that support guaranteed completion, it is easy to actly one output ¤ow for one input ¤ow, and do not mod- 2 ify the semantics of other methods such as connect or write a handler to support guaranteed completion. This accept. For this reason we provide a default implemen- is an inductive argument, of course—there must be some tation of each method which simply propagates method blocking mechanism in the system, i.e., completion can- calls to the downstream handler (in the case of down- not be guaranteed everywhere. Our decision to impose stream methods) or the upstream handler (in the case of guaranteed completion on handlers isolates the complex- upstream methods). ity of blocking in the implementation of TESLA itself (as described in Section 5.1), relieving handler authors of We further discuss the input and output primitives, having to deal with multiplexing between different ¤ows timers, and blocking in the next few sections. 3 (e.g., with select).

3.4 Input/output semantics 3.5 Timers and ¤ow control Since data input and output are the two fundamental op- erations on a ¤ow, we shall describe them in more de- As we have mentioned before, TESLA handlers are tail. The write method call writes bytes to the ¤ow. It re- event-driven, hence all ¤ow handler methods must return turns a boolean indicating success, unlike the C library’s immediately. Clearly, however, some ¤ow control mech- anism is required in ¤ow handlers—otherwise, in an ap- 3.7 Process management plication where the network is the bottleneck, the appli- cation could submit data to TESLA faster than TESLA could ship data off to the network, requiring TESLA to Flow handlers comprise executable code and runtime buffer a potentially unbounded amount of data. Sim- state, and in designing TESLA there were two obvious ilarly, if the application were the bottleneck, TESLA choices regarding the context within which the ¤ow han- would be required to buffer all the data ¤owing upstream dler code would run. The simplest approach would place from the network until the application was ready to pro- ¤ow handlers directly within the address space of appli- cess it. cation processes; TESLA would delegate invocations of POSIX I/O routines to the appropriate ¤ow handler meth- A ¤ow handler may signal its input ¤ow to stop send- ods. Alternatively, ¤ow handlers could execute in sepa- ing it data, i.e., stop invoking its write method, by in- rate processes from the application; TESLA would man- voking may write(false); it can later restart the handler age the communication between application processes by invoking may write(true). Likewise, a handler may and ¤ow handler processes. throttle an output ¤ow, preventing it from calling avail, by invoking may avail(false). Flow handlers are required The former, simpler approach would not require any to respect may avail and may write requests from down- interprocess communication or context switching, mini- stream and upstream handlers. This does not compli- mizing performance degradation. Furthermore, this ap- cate our guaranteed-completion semantics, since a han- proach would ensure that each ¤ow handler has the same dler which (like most) lacks a buffer in which to store process priority as the application using it, and that there partially-processed data can simply propagate may writes are no inter-handler protection problems to worry about. upstream and may avails downstream to avoid receiving Unfortunately, this approach proves problematic in data that it cannot handle. several important cases. First, some session services A handler may need to register a time-based call- (e.g., shared congestion management as in CM) require back, e.g., to re-enable data ¤ow from its upstream or state to be shared across ¤ows that may belong to differ- downstream handlers after a certain amount of time has ent application processes, and making each ¤ow handler passed. For this we provide a timer facility allowing han- run linked with the application would greatly complicate dlers to instruct the TESLA event loop to invoke a call- the ability to share session-layer state across them. Sec- back at a particular time. ond, signi£cant dif£culties arise if the application pro- cess shares its ¤ows (i.e., the £le descriptors correspond- ing to the ¤ows) with another process, either by creat- 3.6 Handler-speci£c services ing a child process through fork or through £le descrip- tor passing. Ensuring the proper sharing semantics be- Each ¤ow handler exposes (by de£nition) the comes dif£cult and expensive. A fork results in two iden- ¤ow handler interface, but some handlers may need tical TESLA instances running in the parent and child, to provide additional services to applications that wish making it rather cumbersome to ensure that they coordi- to support them speci£cally (but still operate properly if nate correctly. Furthermore, maintaining the correct se- TESLA or the particular handler is not available or not mantics of asynchronous I/O calls like select and poll is enabled). For instance, the end-to-end migration handler very dif£cult in this model. can “freeze” an application upon network disconnection, For these reasons, we choose to separate the context preserving the process’s state and re-creating it upon in which TESLA ¤ow handlers run from the application reconnection. To enable this functionality we introduced processes that generate the corresponding ¤ows. When the ioctl method to ¤ow handler. A “Migrate-aware” an application is invoked through the tesla wrapper (and application (or, in the general case, a “TESLA-aware” hence linked against the stub), the stub library creates application) may use the ts ioctl macro to send a control or connects to a master process, a process dedicated to message to a handler and receive a response: executing handlers.

struct freeze params t params = { . . . }; Each master process has its own handler con£gura- struct freeze return t ret; tion (determined by the user, as we shall describe later). int ret = ts ioctl(fd, “migrate”, MIGRATE FREEZE, When the application establishes a connection, the mas- ¶ms, sizeof params, &ret, sizeof ret); ter process determines whether the con£guration con- We also provide a mechanism for handlers to send tains any applicable handlers. If so, the master instan- events to the application asynchronously; e.g., the migra- tiates the applicable handlers and links them together in tion handler can be con£gured to notify the application what we call a TESLA instance. when a ¤ow has been migrated between IP addresses. The application and master process communicate via Application Process Master Process #1 Master Process #2

Instance #1A Instance #2A

Unmodified stub Application E CM M stub E CM stub to C library Instance #1B CM

M E

to C library Shared CM state CM flow across flows handlers Upstream Downstream

Figure 4: Possible interprocess data ¤ow for an application running under TESLA. M is the migration handler, E is encryption, and CM is the congestion manager handler. a TESLA-internal socket for each application-level ¤ow, ager handler) must be trusted. and all actual network operations are performed by the In general, TESLA support for setuid and setgid ap- master process. When the application invokes a socket plications, which assume the protection context of the operation, TESLA traps it and converts it to a message binary iself rather than the user who invokes them, is a which is transmitted to the master process via the mas- tricky affair: one cannot trust a user-provided handler ter socket. The master process returns an immediate re- to behave properly within the binary’s protection con- sponse (this is possible since all TESLA handler opera- text. Even if the master process (and hence the ¤ow tions are non-blocking, i.e., return immediately). handlers) run within the protection context of the user, Master processes are forked from TESLA-enabled ap- user-provided handlers might still modify network se- plications and are therefore linked against the TESLA mantics to change the behavior of the application. Con- stub also. This allows TESLA instances to be chained, as sider, for example, a setuid binary which assumes root in Figure 4: an application-level ¤ow may be connected privileges, performs a lookup over the network to de- to an instance in a £rst master process, which is in turn termine whether the user is a system administrator and, connected via its stub to an instance in another master if so, allows the user to perform an administrative op- process. Chaining enables some handlers to run in a pro- eration. If a malicious user provides ¤ow handlers that tected, user-speci£c context, whereas handlers that re- spoof a positive response for the lookup, the binary may quire a system-scope (like system-wide congestion man- be tricked into granting him or her administrative privi- agement) can run in a privileged, system-wide context. leges. Nevertheless, many widely-used applications (e.g., ssh) are setuid, so we do require some way to support 3.8 Security considerations them. We can make the tesla wrapper setuid root, so that it is allowed to add libtesla.so to the LD PRELOAD envi- Like most software components, ¤ow handlers can po- ronment variable even for setuid applications. The tesla tentially wreak havoc within their protection context if wrapper then invokes the requested binary, linked against they are malicious or buggy. Our chaining approach al- libtesla.so, with the appropriate privileges. libtesla.so lows ¤ow handlers to run within the protection context creates a master process and begins instantiating han- of the user to minimize the damage they can do; but in dlers only once the process resets its effective user ID any case, ¤ow handlers must be trusted within their pro- to the user’s real user ID, as is the case with applications tection contexts. Since a malicious ¤ow handler could such as ssh. This ensures that only network connections potentially sabotage all connections in the same master within the user’s protection context can be affected (ma- process, any ¤ow handlers in master processes intended liciously or otherwise) by handler code. to have system-wide scope (such as a Congestion Man- 4 Example handlers class crypt handler : public ¤ow handler { des3 cfb64 stream in stream, out stream;

We now describe a few ¤ow handlers we have imple- public: mented. Our main goal is to illustrate how non-trivial crypt handler(init context& ctxt) : ¤ow handler(ctxt), in stream(des3 cfb64 stream::ENCRYPT), session services can be implemented easily with TESLA, out stream(des3 cfb64 stream::DECRYPT) showing how its ¤ow-oriented API is both convenient {} and useful. We describe our handlers for transparent use of a SOCKS proxy, traf£c shaping, encryption, compres- bool write(data d) { // encrypt and pass downstream sion, and end-to-end ¤ow migration. return downstream[0]−>write(out stream.process(d)); The compression, encryption, and ¤ow migration han- } ESLA dlers require a similarly con£gured T installation bool avail(¤ow handler *from, data d) { on the peer endpoint, but the remaining handlers are use- // decrypt and pass upstream ful even when communicating with a remote application return upstream−>avail(this, in stream.process(d)); that is not TESLA-enabled. } };

4.1 Traf£c shaping Figure 5: A transparent encryption/decryption handler.

It is often useful to be able to limit the maximum host. When its connected method is invoked, it does not throughput of a TCP connection. For example, one pass it upstream, but rather negotiates the authentication might want prevent a background network operation mechanism with the server and then passes it the actual such as mirroring a large software distribution to im- destination address, as speci£ed by the SOCKS protocol. pact network performance. TESLA allows us to provide a generic, user-level traf£c-shaping service as an alterna- If the SOCKS server indicates that it was able to es- tive to building shaping directly into an application (as in tablish the connection with the remote host, then the the rsync --bwlimit option) or using an OS-level packet- SOCKS handler invokes connected(true) on its upstream £ltering service (which would require special setup and handler; if not, it invokes connected(false). The upstream superuser privileges). handler, of course, is never aware that the connection is physically to the SOCKS server (as opposed to the desti- The traf£c shaper keeps count of the number of bytes it nation address originally provided to connect). has read and written during the current 100-millisecond timeslice. If a write request would exceed the outbound We utilize the SOCKS server to provide transparent limit for the timeslice, then the portion of the data which support for Resilient Overlay Networks [3], or RON, an could not be written is saved, and the upstream han- architecture allowing a group of hosts to route around dler is throttled via may write. A timer is created to no- failures or inef£ciencies in a network. We provide a tify the shaper at the end of the current timeslice so it ron handler to allow a user to connect to a remote host can continue writing and unthrottle the upstream handler transparently through a RON. Each RON server exports when appropriate. Similarly, once the inbound bytes-per- a SOCKS interface, so ron handler can use the same timeslice limit is met, any remaining data provided by mechanism as handler to connect to a RON host avail is buffered, downstream ¤ows are throttled, and a as a proxy and open an indirect connection to the remote timer is registered to continue reading later. host. In the future, ron handler may also utilize the end- to-end migration of migrate handler (presented below in We have also developed latency handler, which de- Section 4.4) to enable RON to hand off the proxy con- lays each byte supplied to or by the network for a user- nection to a different node if it discover a more ef£cient con£gurable amount of time (maintaining this data in an route from the client to the peer. internal ring buffer). This handler is useful for simulating the effects of network latency on existing applications. 4.3 Encryption and compression 4.2 SOCKS and application-level routing crypt handler is a triple-DES encryption handler for TCP Our SOCKS handler is functionally similar to existing streams. We use OpenSSL [8] to provide the DES im- transparent SOCKS libraries [7, 17], although its imple- plementation. Figure 5 shows how crypt handler uses mentation is signi£cantly simpler. Our handler overrides the TESLA ¤ow handler API. Note how simple the the connect handler to establish a connection to a proxy handler is to write: we merely provide alternative im- server, rather than a connection directly to the requested plementations for the write and avail methods, routing data £rst through a DES encryption or decryption step tion creates a socket, TESLA intercepts the socket library (des3 cfb64 stream, a C++ wrapper we have written for call and sends a message to the master process, inquir- OpenSSL functionality). ing whether the master has any handlers registered for Our compression handler is very similar: we merely the requested domain and type. If not, the master returns wrote a wrapper class (analogous to des3 cfb64 stream) a negative acknowledgement and the application process for the freely available zlib stream compression library. simply uses the socket system call to create and return the request socket. If, on the other hand, there is a handler registered for 4.4 Session migration the requested domain and type, the master process in- stantiates the handler class. Once the handler’s construc- We have used TESLA to implement transparent support tor returns, the master process creates a pair of connected in the Migrate session-layer mobility service [23]. In UNIX-domain sockets, one retained in the master and the transparent mode, Migrate preserves open network con- other passed to the application process. The application nections across changes of address or periods of discon- process notes the received £lehandle (remembering that nection. Since TCP connections are bound to precisely it is a TESLA-wrapped £lehandle) and returns it as result one remote endpoint and do not survive periods of dis- of the socket call. connection in general, Migrate must synthesize a logi- cal ¤ow out of possibly multiple physical connections (a Later, when the application invokes a socket API call, new connection must be established each time either end- such as connect or getsockname, on a TESLA-wrapped point moves or reconnects). Further, since data may be £lehandle, the wrapper function informs the master pro- lost upon connection failure, Migrate must double-buffer cess, which invokes the corresponding method on the in-¤ight data for possible re-transmission. handler object for that ¤ow. The master process returns this result to the stub, which returns the result to the ap- This basic functionality is straightforward to provide plication. in TESLA. We simply create a handler that splices its input ¤ow to an output ¤ow and conceals any mobil- In general the TESLA stub does not need to specially ity events from the application by automatically initiat- handle reads, writes, or multiplexing on wrapped sock- ing a new output ¤ow using the new endpoint locations. ets. The master process and application process share The handler stores a copy of all outgoing data locally a socket for each application-level ¤ow, so to handle a in a ring buffer and re-transmits any lost bytes after re- read, write, or select, whether blocking or non-blocking, establishing connectivity on a new output ¤ow. Incoming the application process merely uses the corresponding data is trickier: because received data may not have been unwrapped C library call. Even if the operation blocks, read by the application before a mobility event occurs, the master process can continue handling I/O while the Migrate must instantiate a new ¤ow immediately after application process waits. movement but continue to supply the buffered data from Implementing fork, dup, and dup2 is quite simple. the previous ¤ow to the application until it is completely Having a single ¤ow available to more than one appli- consumed. Only then will the handler begin delivering cation process, or as more than one £lehandle, presents data received on the subsequent ¤ow(s). no problem: application processes can use each copy of Note that because Migrate wishes to conceal mobility the £lehandle as usual. Once all copies of the £lehan- when operating in transparent mode, it is important that dle are shut down, the master process can simply detect the handler be able to override normal signaling mecha- via a signal that the £lehandle has closed and invokes the nisms. In particular, it must intercept connection failure handler’s close method. messages (the “connection reset” messages typically ex- perienced during long periods of disconnection) and pre- vent them from reaching the application, instead taking 5.1 top handler and bottom handler appropriate action to manage and conceal the changes in endpoints. In doing so, the handler also overrides the get- We have described at length the interface between han- sockname() and getpeername() calls to return the origi- dlers, but we have not yet discussed how precisely han- nal endpoints, irrespective of the current location. dlers receives data and events from the application. We have stated that every handler has an upstream ¤ow which invokes its methods such as connect, write, etc.; 5 Implementation the upstream ¤ow of the top-most handler for each ¤ow (e.g., the encryption handler in Figure 2) is a special ¤ow The TESLA stub consists largely of wrapper functions for handler called top handler. sockets API functions in the C library. When an applica- When a TESLA master’s event loop detects bytes ready to deliver to a particular ¤ow via the application/master- service implementation would perform similarly to the process socket pair which exists for that ¤ow, it invokes application-internal agent. the write method of top handler, which passes the write We provide two benchmarks. In bandwidth(s), a client call on to the £rst real ¤ow handler (e.g., encryption). establishes a TCP connection to a server and reads data Similarly, when a connect, bind, listen, or accept mes- into a s-byte buffer until the connection closes, measur- sage appears on the master socket, the TESLA event loop ing the number of bytes per second received. (The server invokes the corresponding method of the top handler for writes data s bytes at a time as well.) In latency, a client that ¤ow. connects to a server, sends it a single byte, waits for an Similarly, each ¤ow handler at the bottom of the 1-byte response, and so forth, measuring the number of TESLA stack has a bottom handler as its downstream. round-trip volleys per second. For each non-callback method—i.e., connect, write, We run bandwidth and latency under several con£gu- etc.—bottom handler actually performs the correspond- rations: ing network operation. top handlers and bottom handlers maintain buffers, in 1. The benchmark program running (a) unmodi£ed case a handler writes data to the application or network, and (b) with triple-DES encryption performed di- but the underlying system buffer is full (i.e., sending data rectly within the benchmark application. asynchronously to the application or network results in 2. The benchmark program, with each connection an EAGAIN condition). If this occurs in a top handler, the passing through TCP proxy servers both on the top handler requests via may avail that its downstream client and server hosts. In (a), the proxy servers ¤ow stop sending it data using avail; similarly, if this oc- perform no processing on the data; in (b) the proxy curs in a bottom handler, it requests via may write that its servers encrypt and decrypt traf£c between the two upstream ¤ow stop sending it data using write. hosts. 3. Same as 2(a) and 2(b), except with proxy servers 5.2 Performance listening on UNIX-domain sockets rather than TCP sockets. This is similar to the way TESLA works in- When adding a to an application, we ternally: the proxy server corresponds to the TESLA may expect to incur one or both of two kinds of per- master process. formance penalties. First, the service’s algorithm, e.g., 4. The benchmark program running under TESLA with encryption or compression, may require computational (a) the dummy handler enabled or (b) the crypt han- resources. Second, the interposition mechanism, if dler enabled. (dummy is a simple no-op handler that any, may introduce overhead such as additional mem- simply passes data through.) ory copies, interprocess communication, scheduling con- tention, etc., depending on its implementation. We refer We can think of the (a) con£gurations as providing a to these two kinds of performance penalty as algorithmic completely trivial transparent service, an “identity ser- and architectural, respectively. vice” with no algorithm (and hence no algorithmic over- If the service is implemented directly within the appli- head) at all. Comparing benchmark 1(a) with the other cation, e.g., via an application-speci£c input/output indi- (a) con£gurations isolates the architectural overhead of a rection layer, then the architectural overhead is probably typical proxy server (i.e., an extra pair of socket streams minimal and most of the overhead is likely to be algorith- through which all data must ¤ow) and the architectural mic. Services implemented entirely within the operating overhead of TESLA (i.e., an extra pair of socket streams system kernel are also likely to impose little in the way plus any overhead incurred by the master processes). of architectural overhead. However, adding an interpo- Comparing the (a) benchmarks, shown on the left of Fig- sition agent like TESLA outside the contexts of both the ure 6, with the corresponding (b) benchmarks, shown on application and the operating system may add signi£cant the right, isolates the algorithmic overhead of the service. architectural overhead. To eliminate the network as a bottleneck, we £rst ran To analyze TESLA’s architectural overhead, we con- the bandwidth test on a single host (a 500-MHz Pentium structed several simple tests to test how using TESLA af- III running Linux 2.4.18) with the client and server con- fects latency and throughput. We compare TESLA’s per- necting over the loopback interface. Figure 6 shows these formance to that of a tunneling, or proxy, agent, where results. Here the increased demand of interprocess com- the application explicitly tunnels ¤ows through a local munication is apparent. For large block sizes (which we proxy server, and to an application-internal agent. Al- would expect in bandwidth-intensive applications), in- though we do not consider it here, we expect an in-kernel troducing TCP proxies, and hence tripling the number 100 3 Internal Internal TCP Proxy TCP Proxy 80 UNIX Proxy UNIX Proxy TESLA 2 TESLA 60

40 1

20 Null Throughput (MB/s) 3-DES Throughput (MB/s)

0 0 1 16 256 4K 16K 64K 1 16 256 4K 16K 64K Block size (bytes) Block size (bytes)

Figure 6: Results of the bandwidth test on a loopback interface. of TCP socket streams involved, reduces the throughput by nearly two thirds. TESLA does a little better, since it 15 uses UNIX-domain sockets instead of TCP sockets. Internal In contrast, Figure 7 shows the results of running TCP Proxy bandwidth with the client and server (both 500-MHz Pen- UNIX Proxy 10 tium IIIs) connected via a 100-BaseT network.4 Here TESLA neither TESLA nor a proxy server incurs a signi£cant throughput penalty. For reasonably large block sizes ei- 5 ther the network (at about 89 Mb/s, or 11 MB/s) or the

encryption algorithm (at about 3 MB/s) becomes the bot- Null Throughput (MB/s) tleneck; for small block sizes the high frequency of small 0 application reads and writes is the bottleneck. 1 16 256 4K 16K 64K We conclude that on relatively slow machines and very Block size (bytes) fast networks, TESLA—or any other IPC-intensive in- terposition mechanism—may cause a decrease in peak Figure 7: Results of the bandwidth test on a 100-BaseT throughput, but when used over typical networks, or network. Only the null test is shown; triple-DES perfor- when used to implement nontrivial network services, mance was similar to performance on a loopback inter- face (the right graph of Figure 6). TESLA causes little or no throughput reduction. The latency benchmark yields different results, as il- lustrated in Figure 8: since data rates never exceed a few kilobytes per second, the computational overhead of the algorithm itself is negligible and any slowdown is due exclusively to the architectural overhead. We see almost 8000 Internal no difference between the speed of the trivial service and TCP Proxy the nontrivial service. Introducing a TCP proxy server on UNIX Proxy 6000 each host incurs a noticeable performance penalty, since TESLA data must now ¤ow over a total of three streams (rather than one) from client application to server application. 4000 TESLA does a little better, as it uses UNIX-domain sock- ets rather than TCP sockets to transport data between 2000 the applications and the TESLA masters; under Linux, Round-trips per second UNIX-domain sockets appear to have a lower latency than the TCP sockets used by the proxy server. We con- 0 Null 3-DES ESLA sider this performance hit quite acceptable: T in- Handler type creases the end-to-end latency by only tens of microsec- onds (4000 round-trips per second versus 7000), whereas Figure 8: Results of the latency test over a 100-BaseT network latencies are typically on the order of millisec- network. The block size is one byte. onds or tens of milliseconds. 6 Conclusion ship, by Intel Corp., and by IBM Corp. under a university faculty award. This paper has outlined the design and implementation The authors would like to thank Dave Andersen, Stan of TESLA, a framework to implement session layer ser- Rost, and Michael Wal£sh of MIT for their comments. vices in the Internet. These services are gaining in impor- tance; examples include connection multiplexing, con- gestion state sharing, application-level routing, mobil- References ity/migration support, compression, and encryption. [1] ALEXANDROV, A. D., IBEL, M., SCHAUSER, K. E., TESLA incorporates three principles to provide a AND SCHEIMAN, C. J. Ufo: A personal global £le sys- transparent and extensible framework: £rst, it exposes tem based on user-level extensions to the operating sys- network ¤ows as the object manipulated by session ser- tem. ACM TOCS 16, 3 (Aug. 1998), 207–233. vices, which are written as ¤ow handlers without deal- [2] ALLMAN, M., KRUSE, H., AND OSTERMANN, S. An ing with socket descriptors; second, it maps handlers to application-level solution to TCP’s inef£ciencies. In processes in a way that enables both sharing and protec- Proc. WOSBIS '96 (Nov. 1996). tion; and third, it can be con£gured using dynamic li- [3] ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, brary interposition, thereby being transparent to end ap- M. F., AND MORRIS, R. T. Resilient overlay networks. plications. In Proc. ACM SOSP '01 (Oct. 2001), pp. 131–145. We showed how TESLA can be used to design several [4] BALAKRISHNAN, H., RAHUL, H. S., AND SESHAN, S. interesting session layer services including encryption, An integrated congestion management architecture for In- SOCKS and application-controlled routing, ¤ow migra- ternet hosts. In Proc. ACM SIGCOMM '99 (Sept. 1999), tion, and traf£c rate shaping, all with acceptably low per- pp. 175–187. formance degradation. In TESLA, we have provided a [5] BERSHAD, B. N., SAVAGE, S., PARDYAK, P., SIRER, powerful, extensible, high-level C++ framework which E. G., BECKER, D., FIUCZYNSKI, M., CHAMBERS, C., makes it easy to develop, deploy, and transparently use AND EGGERS, S. Extensibility, safety and performance session-layer services. We are pleased to report that other in the SPIN operating system. In Proc. ACM SOSP '95 (Dec. 1995), pp. 267–284. researchers have recently found TESLA useful in deploy- ing their own session layer services, and even adopted it [6] BHATTI, N. T., AND SCHLICHTING, R. D. A system for as a teaching tool. constructing con£gurable high-level protocols. In Proc. ACM SIGCOMM '95 (Aug. 1995), pp. 138–150. Currently, to use TESLA services such as compression, [7] CLOWES, S. tsocks: A transparent SOCKS proxying li- encryption, and migration that require TESLA support on brary. http://tsocks.sourceforge.net/. both endpoints, the client and server must use identical [8] COX, M. J., ENGELSCHALL, R. S., HENSON, S., AND con£gurations of TESLA, i.e., invoke the tesla wrapper LAURIE, B. Openssl: The open source toolkit for with the same handlers speci£ed on the command line. SSL/TLS. http://www.openssl.org/. We plan to add a negotiation mechanism so that once [9] CURRY, T. W. Pro£ling and tracing dynamic library us- mutual TESLA support is detected, the endpoint can dy- age via interposition. In Proc. Summer USENIX '94 (June namically determine the con£guration based on the in- 1994), pp. 267–278. tersection of the sets of supported handlers. We also plan to add per-¤ow con£guration so that the user can specify [10] EGGERT, P. R., AND PARKER, D. S. File systems in user space. In Proc. Winter USENIX '93 (Jan. 1993), pp. 229– which handlers to apply to ¤ows based on ¤ow attributes 240. such as port and address. Finally, we plan to explore the possibility of moving some handler functionality directly [11] ELY, D., SAVAGE, S., AND WETHERALL, D. Alpine: A user-level infrastructure for network protocol develop- into the application process or operating system kernel to ment. In Proc. 3rd USITS (Mar. 2001), pp. 171–183. minimize TESLA’s architectural overhead. [12] ENGLER, D. R., KAASHOEK, M. F., AND O’TOOLE TESLA is available for download at http://nms.lcs. JR., J. Exokernel: An operating system architecture for mit.edu/software/tesla/. application-level resource management. In Proc. ACM SOSP '95 (Dec. 1995), pp. 251–266. [13] FELDMEIER, D. C., MCAULEY, A. J., SMITH, J. M., Acknowledgments BAKIN, D. S., MARCUS, W. S., AND RALEIGH, T. M. Protocol boosters. IEEE JSAC 16, 3 (Apr. 1998), 437– This work was funded by NTT Inc. under the NTT-MIT 444. research collaboration, by Acer Inc., Delta Electronics [14] FIUCZYNSKI, M. E., AND BERSHAD, B. N. An extensi- Inc., HP Corp., NTT Inc., Nokia Research Center, and ble protocol architecture for application-speci£c network- Philips Research under the MIT Project Oxygen partner- ing. In Proc. USENIX '96 (Jan. 1996), pp. 55–64. [15] GEVROS, P., RISSO, F., AND KIRSTEIN, P. Analysis Notes of a method for differential TCP service. In Proc. IEEE GLOBECOM '99 (Dec. 1999), pp. 1699–1708. 1It is important not to take the terms upstream and [16] HUTCHINSON, N. C., AND PETERSON, L. L. The x- downstream too literally in terms of the ¤ow of actual kernel: An architecture for implementing network proto- bytes; rather, think of TESLA as the session layer on the cols. IEEE Transactions on Software Engineering 17, 1 canonical network stack, with upstream handlers placed (Jan. 1991), 64–76. closer to the presentation or and down- [17] INFERNO NETTVERK A/S. Dante: A free SOCKS im- stream handlers closer to the transport layer. plementation. http://www.inet.no/dante/. 2 [18] JONES, M. B. Interposition agents: Transparently inter- The inverse—“given upstream and downstream han- posing user code at the system interface. In Proc. ACM dlers that do not support guaranteed completion, it is SOSP '93 (Dec. 1993), pp. 80–93. easy to write a handler that does not support guaranteed [19] KOHLER, E., MORRIS, R., CHEN, B., JANNOTTI, J., completion”—is not true, for the reason described in the AND KAASHOEK, M. F. The Click modular router. ACM previous paragraph. TOCS 18, 3 (Aug. 2000), 263–297. 3The guaranteed-completion semantics we discuss in [20] LEECH, M., GANIS, M., LEE, Y., KURIS, R., KOBLAS, this section apply to the TESLA ¤ow handler API only, D., AND JONES, L. SOCKS Protocol Version 5. IETF, read write Mar. 1996. RFC 1928. not to standard socket I/O functions ( , , etc.). TESLA makes sure that the semantics of POSIX I/O [21] MOSBERGER, D., AND PETERSON, L. L. Making paths functions remain unchanged, for transparency’s sake. explicit in the Scout operating system. In Proc. OSDI '96 (Oct. 1996), pp. 153–167. 4Note the anomaly with small block sizes: on our [22] RICHIE, D. M. A stream input-output system. AT&T Bell test machines, using a UNIX-domain socket to connect Laboratories Technical Journal 63, 8 (Oct. 1984), 1897– the application to the intermediary process (whether a 1910. proxy server or the TESLA master) incurs a severe per- [23] SNOEREN, A. C. A Session-Based Architecture for In- formance hit. Recon£guring TESLA to use TCP sockets ternet Mobility. PhD thesis, Massachusetts Institute of internally (rather than UNIX-domain sockets) increases Technology, Dec. 2002. TESLA’s performance to that of the TCP proxy server. [24] SNOEREN, A. C., BALAKRISHNAN, H., AND Since this anomaly does not seem speci£c to TESLA, KAASHOEK, M. F. Reconsidering Internet Mobil- and occurs only in a somewhat unrealistic situation—a ity. In Proc. HotOS-VIII (May 2001), pp. 41–46. high-bandwidth application using a 1- or 16-byte block [25] THAIN, D., AND LIVNY, M. Multiple bypass: Interposi- size—we do not examine it further. tion agents for distributed computing. Cluster Computing 4, 1 (Mar. 2001), 39–47. [26] VON EICKEN, T., BASU, A., BUCH, V., AND VOGELS, W. U-Net: A user-level network interface for parallel and distributed computing. In Proc. ACM SOSP '95 (Dec. 1995), pp. 40–53. [27] WALLACH, D. A., ENGLER, D. R., AND KAASHOEK, M. F. ASHs: Application-speci£c handlers for high- performance messaging. In Proc. ACM SIGCOMM '96 (Aug. 1996), pp. 40–52. [28] YARVIS, M., REIHER, P., AND POPEK, G. J. Conductor: A framework for distributed adaptation. In Proc. HotOS- VII (Mar. 1999), pp. 44–51. [29] ZANDY, V. C., AND MILLER, B. P. Reliable network connections. In Proc. ACM/IEEE Mobicom '02 (Atlanta, Georgia, Sept. 2002), pp. 95–106.