Trickle: A Userland Bandwidth Shaper for Unix-like Systems

Marius A. Eriksen [email protected]

Abstract • an occasional bulk transfer may adversely impact an interactive session and the two should receive As with any finite resource, it is often necessary to differentiated services, or apply policies to the shared usage of network resources. • Existing solutions typically implement this by employ- bulk transfers may need to be prioritized. ing traffic management in edge routers. However, users Furthermore, such users may not have administrative of smaller networks regularly find themselves in need access to their (s) or network infras- of nothing more than ad-hoc rate rate limiting. Fur- tructure in order to apply traditional bandwidth shap- thermore, the networks in question are typically un- ing techniques. managed, with no network administrator(s) to manage Some operating systems provide the ability to shape complicated traffic management schemes. Trickle fills traffic of local origin (these are usually extensions to the this gap by providing a simple and portable solution to functionality provided by the OS). This func- rate limit the TCP connections of a given process or tionality is usually embedded directly in the network group of processes. stack and resides in the operating system kernel. Net- Trickle works by taking advantage of the Unix dy- work traffic is not associated with the process local namic loader’s preloading functionality to interposition processes responsible for generating the traffic. Rather, itself in front of the libc API interfaces that are used other criteria such as IP/TCP/UDP protocol and des- for sending and receiving data through BSD network tination IP addresses are used in classifying network sockets. Trickle then performs traffic shaping by delay- traffic when specifying classifying network traffic to ing and truncating socket I/Os. Trickle runs entirely shape. These policies are typically global to the host in user space, does not require administrator privileges (thus applying to all users of it). Since these policies and is portable across a wide variety of Unix-like plat- are mandatory and global, it is the task of the system forms. Instances of Trickle can cooperate, even across administrator to manage the traffic policies. networks. This allows for the specification of global rate These are the many burdens that become evident limiting policies. if one wishes simply to employ bandwidth shaping in an ad-hoc manner. While there have been a few at- tempts to add voluntary bandwidth shaping capabili- 1 Introduction ties to the aforementioned in-kernel shapers[24], there is still a lack of a viable implementation and there is Bandwidth shaping is traditionally employed mono- no use of collaboration between multiple hosts. These lithically as part of network infrastructure or in the lo- solutions are also non-portable and there is a lack of cal operating system kernel. These solutions work well any standard application or user interfaces. for providing traffic management to large networks. Ideally, we would like to be able to employ rate lim- Such solutions typically require dedicated administra- iting on a case-by-case basis by any unprivileged user. tion and privileged access levels to network routers or There should also be no need for special kernel support. the local operating system. Trickle addresses precisely this scenario: Voluntary ad- Users of unmanaged networks without any set band- hoc rate limiting without the use of a network wide width usage policies (for example home and small office policy. Trickle is a portable solution to rate limiting networks) typically do not have the need for mandatory and it runs entirely in user space. Instances of Trickle traffic management. More likely, their need for band- may collaborate with each other to enforce a network width shaping is largely ad-hoc, to be employed when wide rate limiting policy, or they may run indepen- and where it is needed. For example, dently. Trickle works properly only with applications 2 utilizing the BSD socket layer using TCP connections. may be shared between every process that use it. We We do not feel this is a serious restriction: Recent mea- refer to resolving these references to external objects as surements attribute TCP to be responsible for over link editing, and to unresolved external symbols simply 90% of the volume of traffic in one major provider’s as external symbols. backbone[16]. The majority of non-TCP traffic is DNS After compilation, at link time, the linker specifies (UDP) – which is rarely desirable to shape anyway. a list of libraries that are needed to resolve all external We strive to maintain a few simple design criteria symbols. This list is then embedded in the final exe- for Trickle: cutable file. At load time (before program execution), • the link editor maps the specified libraries into memory Semantic transparency: Trickle should never and resolves all external symbols. The existence of any change the behavior or correctness of the process unresolved symbols at this stage results in a run time it is shaping. (Other than the data transfer rates, error. of course). To load load its middleware into memory, Trickle • Portability: Trickle should be extraordinarily uses a feature of link editors in Unix-like environments portable, working with any Unix-like operating called preloading. Preloading allows the user to spec- system that has shared library and preloading sup- ify a list of preload objects that are to be loaded to- port. gether the shared libraries. The link editor will first try to resolve symbols to the preload objects (in or- • Simplicity: No need for excessively expressive poli- der), thus selectively bypassing symbols provided by cies that confuse users. Trickle should be designed the shared libraries specified by the program. Trickle to do one thing, and to do it well. Don’t add fea- uses preloading to provide an alternative version of the tures that will be used by only 1 in 20 users. No BSD socket API, and thus socket calls are now handled setup cost, a user should be able to immediately by Trickle. This feature has been used chiefly for pro- make use of Trickle after examining just the com- gram and systems diagnostics; for example, to match mand line options (and there should be very few malloc to free calls, one would provide an alternative command line options). of these functions via a preload library that has the additional matching functionality. The rest of the paper is organized as follows. The In practice, this feature is used by listing the li- linking and preloading features of modern Unix-like braries to preload in an environment variable. Preload- systems is described in section 2. A high level overview ing does not work for set-UID or set-GID binaries for of how Trickle works is explored in section 3. The security reasons: A user could perform privilege eleva- Trickle scheduler is detailed in section 4. Section 5 tion or arbitrary code execution by specifying a preload discusses related work and section 6 describes possi- object that defines some functionality that is known to ble future directions of Trickle. Finally, section 8 con- be used by the target application. cludes. We are interested in interpositioning Trickle in be- tween the shaped process and the socket implemen- tation provided by the system. Another way to look 2 Linking and (Pre)Loading at it, is that Trickle acts as a proxy between the two. We need some way to call the procedures Trickle is Trickle uses the dynamic linker and loader present proxying. The link editor provides this functionality in Unix-like systems. We give an overview of this im- through an API that allows a program to load an ar- portant component of Unix-like systems and discuss bitrary shared object to resolve any symbol contained two specific features that Trickle uses to interposition therein. The API is very simple: Given a string rep- its middleware in between the network shaped process resentation of the symbol to resolve, a pointer to the and libc. location of that symbol is returned. A common use of Dynamic linking and loading have been widely used this feature is to provide plug-in functionality wherein in Unix-like environments for a long time. Dynamic plugins are shared objects and may be loaded and un- linking and loading allow an application to refer to an loaded dynamically. external symbol which does not need to be resolved to Figure 1 illustrates Trickle’s interpositioning. an address in memory until the runtime of the partic- ular binary. The traditional use of this capability has been to implement shared libraries. Shared libraries al- low an operating system to share one copy of commonly used code among any number of processes; e.g. libc 3 3 How Trickle Works application recv() send() () printf()

3.1 A Simple Rate Limiting Scheme libtrickle.sodelaysmooth()

A process utilizing BSD sockets may perform its own libc.so write() rate limiting. For upstream limiting, the application can do this by simply limiting the rate of data that is OS kernel sys_send() sys_ioctl() written to a socket. Similarly, for downstream limiting, sys_recv() sys_write() an application may limit the rate of data it reads from a socket. However, the reason why this works is not im- Figure 1: Illustrating the interpositioning mechanism. mediately obvious. When the application neglects to read some data from a socket, its socket receive buffers fill up. This in turn will cause the receiving TCP to advertise a smaller receiver window (rwnd), creating Trickle applies rate limiting to any dynamically linked back pressure on the underlying TCP connection thus binary that uses the BSD socket layer. By providing a limiting its data flow. Eventually this “trickle-down” standard command line utility, Trickle provides a sim- effect achieves end-to-end rate limiting. Depending on ple and consistent user interface to specify rate lim- buffering in all layers of the network stack, this effect iting parameters. By communicating with the trickle may take some time to propagate. More detail about daemon, all instances of Trickle can participate in col- the interaction between this scheme and TCP is pro- laborative rate limiting, even across hosts. vided in section 4. There are several other advantages to using this ap- While this scheme is practical, two issues would hin- proach for rate limiting besides portability. There is der widespread employment. Firstly, the scheme out- no need for extending the kernel nor configuring such lined is deceptively simple. As we will see in section extensions; any user may use and configure Trickle any 4, there are many details which make shaping at this way she wants, making it ideal for ad-hoc rate limit- level of abstraction complicated. The second issue is ing. There are also a number of advantages to this that there are no standard protocols or APIs for mul- approach from the developer’s point of view. Further- tiple processes to collaborate. more, being entirely contained in userland has made We also argue that employing bandwidth shaping Trickle inherently easier to develop. It is easier to per- inside of an application breaks abstraction layers. It is form experiments and the software is easier to maintain really the task of the operating system to apply policies and will be understood by a wider audience. to bandwidth usage, and it should not need to be a fea- The primary disadvantage to using this approach is ture of the application. Even if libraries were developed that all usage of Trickle is voluntary – that is, one can- to assist application developers, employing rate limit- not enforce rate limiting by policy (though some oper- ing in this manner would still put considerable burden ating systems provide a mechanism for administrators on the developers and it should not be expected that to enforce preload libraries, there are still ways to get every developer would even support it. The socket API around its interpositioning). Note that for its intended provided by the OS provides certain functionality, and usage, this is not a big drawback for Trickle as ad- it should be the freedom of the application to use it hoc bandwidth shaping implies users do so voluntarily. unchanged, and not have to rely on semantics at the Secondly and with smaller impact, Trickle cannot work lower levels of abstraction in order to limit bandwidth with statically linked binaries. Most modern Unix-like usage. systems ship with very few statically linked binaries. There are also exceptions to these arguments. For example, certain protocols may benefit from applica- 3.2 The Mechanics of Library Interposi- tion level semantics to perform shaping. Another ex- tioning ample is that some applications may be able to instruct the sending party to limit its rate of outbound traffic[9] With very rare exceptions, network software for which is clearly preferable over relying on TCP seman- Unix-like systems uses the socket abstraction provided tics to perform traffic shaping. by the operating system. In reality, the socket abstrac- Trickle provides a bandwidth shaping service with- tion is entirely contained in the system call layer with 1 out the need to modify applications. Trickle augments corresponding libc shims . Thus, with the use of the the operating system by interpositioning its middle- 1A small software component used to provide an interface to ware in front of the libc socket interface. From there, another software component (a trivial “adapter”) 4 link editor’s preload functionality, we interposition the tinent to Trickle: Socket I/O and socket I/O multi- Trickle middleware at a convenient level of abstraction plexing. In the following discussion we assume that we and we do so entirely in user space. possess a black box. This black box has as its input Using preload objects, we replace the BSD socket a unique socket identifier (e.g. file descriptor number) abstraction layer provided by libc. However, to suc- and the direction and length of the I/O operation to cessfully interposition the Trickle middleware, we must be performed on the socket. A priority for every socket be able to call the original version of the very interface may also be specified as a means to indicate the wish for we have replaced. To solve this issue, we need to take differentiated service levels between them. The black advantage of the second feature of the link editor we box outputs a recommendation to either delay the I/O, discussed: We simply explicitly resolve the libc shims to truncate the length of the I/O, or a combination of and call them as needed. This is done by opening the the two. We refer to this black box as the Trickle sched- actual object file that contains libc, and using the link- uler and it discussed in detail in a later section. editor API to resolve the symbols needed. The location The operation of Trickle, then, is quite simple: of the libc shared object is discovered in the config- Given a socket I/O operation, Trickle simply consults uration/compilation cycle, but could just as easily be the scheduler and delays the operation by the time discovered dynamically at run time. Figure 1 attempts specified, and when that delay has elapsed, it reads to illustrate the mechanics of the interpositioning of or writes at most the number of bytes specified by the the Trickle middleware. scheduler. If the socket is marked non-blocking, the In practice, preload objects are specified by the scheduler will specify the length of I/O that is imme- environment variable LD PRELOAD. Trickle’s command diately allowable. Trickle will perform this (possibly line utility, trickle, sets this environment variable to truncated) I/O and return immediately, as to not block the object that contains Trickle’s middleware. Ad- and violate the semantics of non-blocking sockets. Note ditionally, it passes any parameters specified by the that BSD socket semantics allow socket I/O operations user in other environment variables in a well defined to return short counts – that is, an operation is not re- namespace. These parameters may include upstream quired to complete in its entirety and it is up to the or downstream rates to apply, as well as whether or not caller to ensure all data is sent (for example by looping this instance of Trickle should collaborate with other or multiplexing over a calls to send() and recv()). In instances of Trickle. practice, this means that the Trickle middleware is also allowed to return short I/O counts for socket I/O op- 3.3 The Life of a Socket erations without affecting the semantics of the socket abstraction. This is an essential property of the BSD socket abstraction that we use in Trickle. At this point, the preload object has been loaded Multiplexing I/O operations, namely calls to and Trickle has interpositioned itself in the BSD socket select() and poll()2 are more complex. The pur- layer, and will intercept every call to a libc shim that pose of the I/O multiplexing interface is to, given a operates on a socket in any relevant manner. set of file descriptors and conditions to watch for each, New sockets are created with either the socket() notify the caller when any condition is satisfied (e.g. or accept() interfaces. An old socket is aliased with file descriptor x is ready for reading). One or more calls to dup() or dup2(). Any new or duplicated socket of these file descriptors may be tracked by Trickle, so is marked by Trickle by keeping an internal table in- it is pertinent for Trickle to wrap these interfaces as dexing every such socket. File descriptors that are not well. Specifically, select() and poll() are wrapped, marked are ignored by Trickle, and relevant calls spec- these may additionally wait for a timeout event (which ifying these as the file descriptor argument are simply is satisfied as soon as the specified timeout value has passed through to the libc shims without any pro- elapsed). cessing. Note that it is also possible for an applica- To simplify the discussion around how Trickle han- tion to perform file descriptor passing: An application dles multiplexing I/O, we abstract away the particular may send an arbitrary file descriptor to another over interface used, and assume that we deal only with a local inter process communication (IPC), and the re- set of file descriptors, one or more of which may be ceiving application may use that file descriptor as any tracked by trickle. Also specified is a global timeout. other. File descriptor passing is currently not detected For every file descriptor that is in the set and tracked by Trickle. When a socket is closed, it is unmarked by Trickle. We say that any marked socket is tracked by 2 Modern operating systems have introduced new and more ef- Trickle. ficient interfaces for multiplexing. These are discussed in section Two categories of socket operations are most per- 6 5 by Trickle, the scheduler is invoked to see if the file uler simply outputs the current rate the entity is as- descriptor would be capable of I/O immediately. If signed. These rate allocations may change frequently, it is not, it is removed from the set and added to a and so the instances of Trickle get updated allocations holding set. The scheduler also returns the amount of with some preset regularity. time needed for the file descriptor to become capable Note that there is a simple denial of service attack of I/O, the holding time. The scheduler calculates this here should there be a rogue user on the system. This on the basis of previously observed I/O rates on that user could simply create as many fake Trickle instances socket. Trickle now recalculates the timeout to use for as necessary and, without actually doing any socket the multiplexing call: This is the minimum of the set I/O, report data transfers to trickled. Of course, of holding times and the global timeout. such a user could, though using more resources to do Trickle then proceeds to call the multiplexing call so, also consume as much network resources as possible, with the new set of file descriptors (that is, the orig- in effect achieving the same result by exploiting TCP inal set minus the holding set) and the new timeout. fairness. If the call returns because a given condition has been The same model of collaboration can be applied satisfied, Trickle returns control to the caller. If it re- across several hosts. Instead of listening on a Unix do- turns due to a timeout imposed by Trickle, the process main socket, trickled can listen on a TCP socket, and is repeated, with the global timeout reduced by the thus schedule network resource usage across any num- time elapsed since the original invocation of the multi- ber of hosts. In this scenario, rate allocation updates plexing call (wrapper). In practice, a shortcut is taken may start to consume a lot of local network resources, here, where only file descriptors from the holding set so care must be taken when setting the frequency at are examined, and rolled in if ready. The process is which updates are sent. repeated until any user specified condition is satisfied by the underlying multiplexing call. 4 I/O With Rate Restric- 3.4 Collaboration tions

We have detailed how Trickle works with a set of We describe the scheduler which we have so far sockets in a single process. More often than not, it is treated as a black box. The problem of rate limiting in highly practical to apply global rate limits to a set of Trickle can be generalized to the following (reusable) processes that perform network I/O. Also, these pro- abstraction: Given a number of entities capable of cesses do not necessarily reside on just one host; it is transmitting or receiving data, a global rate limit must often useful to apply them on every process that con- be enforced. Furthermore, entities may have different tributes to the network traffic passing through a partic- priorities relative to each other as to differentiate their ular gateway router, or to use a global limit to control relative service levels. In Trickle, we use this abstrac- the utilization of the local area network. It may also tion twice: In a shaped process socket is represented be desirable to apply individual rate limiting policies as an entity with priority 1. In trickled every collab- for processes or classes of processes. orating process is represented by an entity (the collab- Trickle solves this by running a daemon, trickled orating processes may even reside on different hosts) which coordinates among multiple instances of Trickle. and every entity is assigned a priority according to a The user specifies to trickled the global rate limita- user specified policy. tions which apply across all instances of Trickle. That When an entity is ready to perform some I/O, it is, the aggregate I/O rates over all processes shaped by must consult the scheduler. The scheduler may then Trickle may not exceed the global rates. Furthermore, advise the entity to delay its request, to partially com- the user can specify a priority per instance or type of plete the request (i.e. truncate the I/O operation), or a instance (e.g. interactive applications), allowing her to combination of the two. In this capacity, the scheduler provide differentiated network services to the various is global and coordinates the I/O allocation over all en- client applications. tities. In another mode, the scheduler simply outputs trickled listens on a BSD domain socket and ac- the current global rate allocation for the requesting en- cepts connections from instances of Trickle. These tity. instances then request a bandwidth allocation from After an entity has performed an I/O operation, it trickled. The bandwidth allocation is computed us- notifies the Trickle scheduler with the direction (sent or ing the same black box scheduler described previously. received) and length of the I/O. Trickle then updates It is used in a slightly different mode where the sched- a bandwidth statistics structure associated with that 6 entity and direction of data. This structure stores the average data throughput rate for the entire lifetime of that entity as well as a windowed average over a fixed number of bytes. Also, an aggregate statistic covering all entities is updated. Before an entity performs I/O, it consults the Trickle scheduler to see how much delay it must apply and how much data it is allowed to send or receive after the delay has elapsed. Let’s for a moment assume that the scheduler need only decide for how long to delay the requested I/O operation.

4.1 Distribution and allocation

Every entity has assigned a number of points propor- tional to that entity’s priority. The global rate limit is divided by the total number of points over all entities, Figure 2: An illustration of the effect of smoothing in and this is the rate allotment per point. If every en- Trickle. tity performed I/O with a rate equal to its number of points multiplied by the per point allotment, the total rate over all entities would be at the rate limit For every entity that consumes bandwidth less than and every entity would perform I/O at a rate propor- its allotment, we subtract Ep from T , and then add tional to their assigned priority. Since Trickle is per- the difference between E’s actual consumption and it’s forming bandwidth shaping, most often the entities has alloted consumption to a free pool, F . After iterat- the ability to exceed the transfer rates that they are ing over the entities, the value of the free pool is re- assigned by the scheduler. The entities only very sel- distributed amongst the remaining entities. In prac- domly behave in any predictable manner: Their data tice, this is done by inflating the per-point allotment: transfer rates may be bursty, they may have consistent P = P + F/T . transfer rates lower than their alloted rates, or they This process is then repeated until there are either might be idle. At the same time, the scheduler needs no remaining entities that have more allotment than to make sure the entities in aggregate may transfer data their consumption or all remaining entities have more at a rate capped only by the total rate limit: Trickle allotment than their consumption. This is the stable should never hinder its client applications from fully state. We call the final allotment of each entity after utilizing the bandwidth allocated to them. this process the adjusted allotment. Statistics structures as well as limits are kept inde- After the adjustment process, if the entity being pendently per-direction and thus Trickle is fully asym- scheduled is currently consuming bandwidth at a rate metric. Furthermore, it is worthy to note that Trickle less than its adjusted allotment, Trickle allows the op- makes scheduling decisions based on a windowed av- eration to proceed immediately. If not, it requests the erage. Dealing with instantaneous bursts is discussed entity to delay the operation by the time it would take later. to send the requested number of bytes at the adjusted rate. 4.2 Scheduling 4.3 Smoothing Trickle uses a simple but effective algorithm to schedule I/O requests. It is a global scheduler that pre- This na¨ıve approach of delaying I/O operations serves every requirement outlines above. We maintain tends to result in very bursty network behavior since a total T , which is initialized with the total number we are blocking an I/O of any length for some time, of points alloted over all entities. We also maintain a and then letting it complete in full. As expected, this per-point allotment P which is initially calculated as behavior is especially prevalent when operations are outlined above. An entity consumes bandwidth less large. In the short term, burstiness may even result in than its allotment if its measured (windowed average) over shaping as network conditions are changing, and consumption is less than its number of points Ep mul- the scheduler could allocate more I/O to the stream in tiplied by the per-point allotment P . question. 7 The length of an I/O may also be unpredictable, once per second. especially in applications with network traffic driven In practice, the scheduler deployed as follows: In a by user input (e.g. interactive login sessions or games). single Trickle instance, the entities are sockets, all with Such applications are naturally bursty and it would be priority 1 and the global limit is either user specified advantageous for Trickle to dampen these bursts. or by trickled. trickled again uses the same sched- Note that even if Trickle considered instantaneous uler: Here the entities are the instances of Trickle, and bandwidth consumption in addition to the windowed the global limit is specified by the user. Note that in average as netbrake[5] does, bursty behavior would still this case, the scheduler does not need to apply delay be present in many applications. When shaping is or smoothing, it simply needs to report back to each based on both instantaneous and average bandwidths, instance of Trickle what its allotment is at that point it is the hope that the buffers underneath the applica- in time. tion layer will provide dampening. For I/Os (keep in mind that applications are allowed to make arbitrarily 4.4 Streams vs. Packets large I/O requests to the socket layer) with lengths ap- proaching and exceeding the bandwidth × delay prod- One big difference between Trickle and in-kernel rate uct, buffering provides little dampening. limiters is that the Trickle deals works at a much higher Thus, we introduce techniques to smooth these level of abstraction. The only context available to bursts. The techniques we introduce are generic and Trickle are BSD sockets, while in-kernel rate limiters apply equally to both instantaneous and TCP bursti- typically work at the packet level (e.g. IP packets). ness. Our technique makes use of two parameters to Trickle can only provide bandwidth shaping by de- normalize traffic transmitted or received by the socket laying and truncating I/O on a socket (e.g. TCP/IP layer. stream), and must rely on the underlying semantics In the following discussion, we use the variable to be effective. Furthermore, Trickle is affected by lo- pointvalue to indicate the value of a point after schedul- cal buffering in every OSI[23] layer, which effectively ing, numpoints is the number of points allocated to the reduces the best reaction time Trickle can provide. To- entity in question and length refers to the (original) gether, these conditions severely reduces the granular- length of the I/O being scheduled. ity at which Trickle can operate. We first introduce a time smoothing parameter. We In-kernel rate limiters can schedule discrete packets, set the delay imposed on a socket to the minimum of and live just one abstraction above the network inter- the time smoothing parameter and the delay requested faces, and can thus exercise complete control of the (by the process outlined in the previous sub section). actual outgoing date rates. Ingress rate limiting is pro- If the time smoothing delay is the smaller of the two, vided by a technique called “policing” which amounts the length is truncated so that the entity meets its ad- to simply dropping matching incoming packets even justed rate allotment. This is called the adjusted length: though there is no real local contention that would oth- adjlen = pointvalue ∗ numpoints ∗ timesmoothingparam. erwise prevent them from being delivered. Note that The aim of the time smoothing parameter is to intro- policing is a different strategy for shaping incoming duce a certain continuity in the data transfer. traffic. When a policing router drops a packet, it cre- We must be able to handle the case where the ad- ates artificial congestion in the view of the sending TCP justed length is 0. That is, the time smoothing param- as as the retransmit timer will trigger fast retransmit, eter is too small to send even one byte. To mitigate or even slow start. Trickle’s approach tries to avoid this situation, we introduce a length smoothing param- artificial congestion by shrinking the advertised TCP eter. When the adjusted length is 0, we simply set receiver window, causing the sending TCP to limit the the length to the length smoothing parameter, and ad- amount of data it can send. just the delay accordingly: delay = length/(pointvalue∗ numpoints). 4.5 The Interactions of Delay, Smoothing The effect of smoothing is illustrated in figure 2. and TCP Here, Iperf[26], a network measurement tool, was run for 60 seconds. The source node was rate limited with To recapitulate, Trickle shapes network traffic by de- Trickle, which was configured to rate limit at 80 KB/s. laying and truncating I/O requests according to a few The thin line indicates the resulting Iperf behavior simple parameters. Trickle attempts to reduce bursti- without any smoothing applied. The thick line applies ness by smoothing, which in effect introduces some a time smoothing parameter of 500 ms. The trans- time and length normalization for the emitted I/O op- fer rates shown are instantaneous with a sampling rate erations. We now explore how our shaping techniques interact with TCP. 8 For ingress traffic, this should result in a less volatile porate more advanced bandwidth shaping. These mod- advertised window (rwnd) in the receiving TCP since ules are typically application layer shapers: they ex- the utilization of socket receive buffers have smaller ploit additional context within Apache and the HTTP variance. protocol. For example, such modules could have the Smoothing is also beneficial for the transmitting ability to perform rate limiting by particular cookies, TCP. Because the data flow from the application layer or on CPU intensive CGI scripts. is less bursty, the TCP does not have to deal with long Many peer-to-peer applications also offer traffic idle times which may reduce responsiveness: It is stan- shaping. Again, here there is great opportunity to dard practice to reduce the congestion window (cwnd) use application level knowledge to apply policies for and perform slow start upon TCP idleness beyond one shaping[2, 4]. retransmission timeout (RTO)[19]. Netbrake[5] is another bandwidth shaper that uses Smoothing may also be used for adapting to inter- shared library preloading.3 Like Trickle, it delays I/Os active network protocols. For example, a smaller time on sockets. Netbrake does not have the ability to coor- smoothing parameter should cause data to be sent in dinate bandwidth allocation amongst several instances. a more continuous and consistent manner, whereas the Netbrake calculates aggregate bandwidth usage for all lack of smoothing would likely cause awkward pauses sockets in a process, and shapes only according to this: in user interactions. That is, if a given socket I/O causes the global rate to Another tradeoff made when smoothing is that you exceed the specified limit, that socket is penalized with are likely to loose some accuracy because of timing. a delay as for bandwidth consumption to converge to When using timers in userland, the value used is the the limit, and there no equivalent to the smoothing in floor of the actual timeout you will get, and thus when Trickle. Thus, Netbrake does not retain a sense of “fair- sleeping on a timer just once for some I/O, the inaccu- ness” among sockets: One “overzealous” socket could racy is amortized over the entirety of that I/O. How- cause delays in other sockets performing I/O at lower ever, smoothing is likely to break this I/O up into many rates. This does not retain the TCP fairness seman- smaller I/Os, and the timer inaccuracies may be more tics (nor does it attempt to), and could cause uneven pronounced. The ultimate compromise here would be application performance, one example being an appli- to use the effects of buffering whenever you can, and to cation that uses different streams for control and data. use smoothing whenever you have to. This is left for Netbrake also does not distinguish between the two di- future work. rections of data; incoming data will add to the same observed rate as outgoing data. Netbrake is not semantically transparent (nor does 5 Related Work it aim to be); • it does not handle non-blocking I/O nor There are a number of generic rate limiting soft- ware solutions and one is included in nearly every • I/O multiplexing (select(), poll(), etc.) nor major open source operating system. These oper- • ate in a mostly traditional manner (defining discrete socket aliasing (dup(), etc.). packet queues and applying policies on these queues). Trickle provides semantic transparency and the abil- What differentiates these is typically what operat- ity to provide fairness or managed differentiated band- ing system(s) they run on, and how expressive their width usage to different sockets or applications. Trickle policies are. Examples are AltQ[14], Netnice[20], also allows applications to cooperate as to retain (a) Dummynet[22] and Netfilter[25]. global bandwidth limit(s). Several network client and server applications in- corporate rate limiting as a feature. For example, OpenSSH[8]’s has the ability to rate limit scp file trans- fers. rsync[9] too, features rate limiting. Both use a 6 Future Work simple scheme that sleeps whenever the average or in- stantaneous bandwidth rates go beyond their thresh- Trickle does not have much control over how the old(s). rsync has the additional advantage that they lower layers of the network stack behave. A future area may control the sender as well (their protocol is pro- of exploration is to dynamically adjust any relevant prietary) and so bandwidth is shaped at the sender, socket options. Especially interesting is to adjust the which is easier and more sensible. 3The author of Netbrake notes that Netbrake is intended to There are a few modules for Apache[1] which incor- simply be a hack, and that it is not mature software. 9 socket send and receive buffers as to lessen the reaction operating systems including OpenBSD[7], NetBSD[6], time of Trickle’s actions. Another area of future work FreeBSD[3], Linux[10] and Sun Solaris[11]. Trickle is is dynamic adjustment of smoothing settings, parame- by its very nature also architecture agnostic. terized by various observed network characteristics and At the socket layer the number of ways an operation usage patterns (e.g. interactive, bulk transfers) of a can be manipulated is limited. Furthermore, we gain particular socket. no access to lower layers in the network stack at which There also exists a need for Trickle to employ more rate limiters typically reside, and so we have to rely expressive and dynamic policies. For example, shaping on the semantics of TCP to cause reductions in band- by remote host or by protocol. width consumption. We developed several techniques There are a few new and disparate interfaces for including smoothing that help normalize the behavior dealing with socket I/O multiplexing. In the BSD oper- observed in the lower network layers and avoids bursty ating systems, there is the [17] event notification throughput. layer, Solaris has /dev/poll[13] and Linux epoll[18]. There are many venues to explore in the future de- Trickle stands to gain from supporting these interfaces velopment of Trickle, and we believe it will remain a as they are becoming more pervasive. very useful utility for ad-hoc rate limiting. Further- By using a system call filter such as systrace[21], more, we believe that this is the typical usage case for Trickle could address its two highest impact issues. rate limiting by end users requiring occasional service By using such a system call filter, Trickle could in- differentiation. terposition itself in the system call layer, while still Trickle source code is available under a BSD running entirely in userland, hence gaining the ability style license at http://www.monkey.org/~marius/ to work with statically linked binaries. Furthermore, trickle/. these tools provide the means to actually enforce the usage of Trickle, thus enforcing bandwidth policies. In order to do collaborative rate limiting when join- References ing a new network, a user would have to manually find which host (if any) is running trickled. Trickle would [1] Apache. http://www.apache.org/. thus benefit from some sort of service discovery pro- [2] Azureus - java bittorrent client. http://azureus. tocol akin to DHCP[15]. Perhaps using Zeroconf[12] sourceforge.net/. technologies will also be beneficial. [3] FreeBSD: An advanced operating system. http:// www.freebsd.org/. [4] Kazaa 3.0. http://www.kazaa.com/. 7 Acknowledgments [5] Netbrake. http://www.hping.org/netbrake/. [6] NetBSD: the free, secure, and highly portable Unix- If you are reading this (and provide feedback). You like Open Source operating system. http://www. will be in mentioned here. Thank you! .org/. [7] OpenBSD: The proactively secure Unix-like operating system. http://www.openbsd.org/. 8 Summary and Conclusion [8] OpenSSH. http://www.openssh.com/. [9] Rsync. http://samba.anu.edu.au/rsync/. Trickle provides a practical and portable solution to ad-hoc rate limiting which runs entirely in userland. It [10] The Linux Kernel. http://www.kernel.org/. has been shown to work extremely well in practice, and [11] The Sun Solaris Operating System. http://wwws.sun. none of its inherent limitations seem to be a problem com/software/solaris/. for its target set of users. [12] The Zeroconf Working Group. http://www.zeroconf. Since the time Trickle was released in March, 2003, org/. it has enjoyed a steady user base. It is widely used, [13] S. Acharya. Using the devpoll (/dev/poll) in- especially by home users in need of ad-hoc rate limiting. terface. http://access1.sun.com/techarticles/ Trickle has also been used in research. devpoll.html. Trickle works by interpositioning its middleware at [14] Kenjiro Cho. Managing traffic with ALTQ. In Pro- the BSD socket abstraction layer which it can do ceedings of the USENIX 1999 Annual Technical Con- this entirely in userland by preloading itself using the ference, pages 121–128, June 1999. link editor present in Unix-like systems. Trickle has [15] R. Droms. Dynamic Host Configuration Protocol. been reported to work on a wide variety of Unix-like RFC 2131, 1997. 10 [16] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, [21] Niels Provos. Improving host security with system call D. Moll, R. Rockell, T. Seely, and C. Diot. Packet- policies. In Proceedings of the 11th USENIX Security level traffic measurements from the sprint IP back- Symposium, pages 257–272. USENIX, August 2003. bone. IEEE Network, 2003. [22] Luigi Rizzo. Dummynet: a simple approach to the [17] Jonathan Lemon. Kqueue - a generic and scal- evaluation of network protocols. SIGCOMM Comput. able event notification facility. In Proceedings of Commun. Rev., 27(1):31–41, 1997. the FREENIX Track: 2001 USENIX Annual Techni- [23] Marshall T. Rose. The Open Book: A Practical Per- cal Conference, pages 141–153. USENIX Association, spective on OSI. Prentice Hall, 1990. 2001. [24] T. Loder S. Bottoms and R. Wash. nlimit: A New [18] D. Libenzi. /dev/epoll home page. http://www. Voluntary Bandwidth Limiting Facility. Unpublished xmailserver.org/linux-patches/nio-improve. manuscript, 2002. html. [25] The Netfilter Core Team. The Netfilter Project: [19] W. Stevens M. Allman, V. Paxson. TCP Congestion Packet Mangling for Linux 2.4. http://www. Control. RFC 2581, Apr 1999. netfilter.org/, 1999. [20] T. Okumura and D. Moss´e. Virtualizing network i/o [26] Ajay Tirumala. End-to-end bandwidth measurement on end-host operating system: Operating system sup- using iperf. In SC’2001 Conference CD, Denver, port for network control and resource protection. IEEE November 2001. ACM SIGARCH/IEEE. National Transactions on Computers , 2004. Laboratory for Applied Network Research (NLANR).