Full Exploitation of the Deficit Round Robin Capabilities by Efficient Implementation and Parameter Tuning

Technical Report, October 2003

L. Lenzini, E. Mingozzi, G. Stea Dipartimento di Ingegneria della Informazione, University of Pisa Via Diotisalvi 2, 56126 Pisa – Italy {l.lenzini, e.mingozzi, g.stea}@iet.unipi.it

Abstract—Deficit Round Robin (DRR) is a algo- into frames, and select which packet to transmit on a per- rithm devised for providing fair queueing in the presence of vari- frame basis. Within the frame-based class, round-robin algo- able length packets. The main attractive feature of DRR is its rithms service the various flows cyclically, and within a round simplicity of implementation: in fact, it can exhibit O(1) com- each flow is serviced for up to a given quantum, so that the plexity, provided that specific allocation constraints are met. frame length can vary up to a given maximum. Sorted-priority However, according to the DRR implementation proposed in [7], algorithms generally exhibit better latency and fairness prop- meeting such constraints often implies tolerating high latency erties compared to round-robin ones, but have a higher com- and poor fairness. In this paper, we first derive new and exact plexity, due to the calculation of the timestamp and to the sort- bounds for DRR latency and fairness. On the basis of these re- ing process [3]. Specifically, the task of sorting packets from sults, we then propose a novel implementation technique, called N different flows has an intrinsic complexity of ON(log ) 1. Active List Queue Method (Aliquem), which allows a remarkable At high link speeds, such a complexity may limit the scalabil- gain in latency and fairness to be achieved, while still preserving ity. Simpler algorithms requiring a constant number of opera- O(1) complexity. We show that DRR achieves better perform- tions per packet transmission would then be desirable. Round- ance metrics than those of other round robin algorithms such as robin scheduling algorithms, instead, can exhibit O()1 com- Pre-Order Deficit Round Robin and Smoothed Round Robin. We plexity. also show by simulation that the proposed implementation allows Deficit Round Robin (DRR) [7] is a scheduling algorithm the average delay and the jitter to be reduced. devised for providing fair queueing in the presence of variable Keywords—DRR; scheduling algorithms; Quality of Service length packets. Recent research in the DiffServ area [19] pro- poses it as a feasible solution for implementing the Expedited Forwarding Per-hop Behavior [20,21]. According to the im- I. INTRODUCTION plementation proposed in [7], DRR exhibits O()1 complexity Multi-service packet networks are required to carry traffic provided that each flow is allocated a quantum no smaller than pertaining to different applications, such as e-mail or file its maximum packet size. As observed in [16], removing this transfer, which do not require pre-specified service guaran- hypothesis would entail operating at a complexity which can tees, and real-time video or telephony, which require perform- be as large as ON(). On the other hand, in this paper we ance guarantees. The best-effort service model, though suit- show that meeting such a constraint may lead to poor per- able for the first type of applications, is not so for applications formances if flows with very different rate requirements are of the other type. Therefore, multi-service packet networks scheduled. This is because a frame can be very large under the need to enable Quality of Service (QoS) provisioning. A key above constraint, and this in turn implies poor fairness and component for QoS enabling networks is the scheduling algo- longer delays. rithm, which selects the next packet to transmit, and when it should be transmitted, on the basis of some expected perform- The purpose of this paper is twofold. Firstly, we propose a novel implementation technique, called Active List Queue ance metrics. During the last decade, this research area has () been widely investigated, as proved by the abundance of lit- Method (Aliquem), which allows DRR to operate at O 1 erature on the subject (see [2-18]). Various scheduling algo- complexity even if quanta are smaller than the maximum packet size. More specifically, the Aliquem implementation rithms have been devised, which exhibit different fairness and () latency properties at different worst-case per-packet complexi- allows DRR to operate at O 1 complexity, provided that ties. An algorithm is said to have O(1) worst-case per packet each flow is allocated a quantum no smaller than its maximum complexity (hereafter complexity) if the number of operations packet size scaled by a tunable parameter q . By appropri- needed to select the next packet to be transmitted is constant ately selecting the q value, it is then possible to make a trade- with respect to the number of active flows. off between operational overhead and performance. We pro- pose different solutions for implementing Aliquem, employing The existing work-conserving scheduling algorithms are commonly classified as sorted-priority or frame-based. Sorted-priority algorithms associate a timestamp with each 1 It has been shown that this task can be performed at On()log log queued packet and transmit packets by increasing timestamp complexity if coarsened timestamps are used [8]. order. On the other hand, frame-based algorithms divide time different data structures and requiring different space occu- Let Li be the maximum length of a packet for flow i pancy and operational overhead. In addition, we present a (measured in bits). In [7] the following inequality has been variant of the Aliquem technique, called Smooth Aliquem, proved to hold right after a flow has been serviced during a which further reduces the output burstiness at the same com- round: plexity. ≤∆ < (1) However, in order to fully understand the performance 0 iiL gains which are achieved by means of the Aliquem implemen- tation, a careful analysis of the DRR performance – specifi- which means that a flow’s deficit never reaches the maximum cally of its latency and fairness – is required. In fact, the re- packet length for that flow. sults related to DRR reported in the literature only apply to the () case in which each flow is allocated a quantum no smaller DRR has O 1 complexity under certain specific condi- than its maximum packet size [3-4], [7]. Therefore, as a sec- tions. Let us briefly recall the DRR implementation proposed ond issue, we also provide a comprehensive analysis of the in [7] and discuss those conditions. A FIFO list, called the latency and fairness bounds of DRR, presenting new and exact active list, stores the references to the backlogged flows. results. We show that DRR achieves better performance met- When an idle flow becomes backlogged, its reference is added rics than those of other round robin algorithms such as Pre- to the tail of the list. Cyclically, if the list is not empty, the Order Deficit Round Robin (PDRR) [13] and Smoothed flow which is at the head of the list is dequeued and serviced. Round Robin (SRR) [14]; we also compare our implementa- After it has been serviced, if still backlogged, the flow is tion with Self-Clocked (SCFQ) [6], which is a added to the tail of the list. It is worth noting that: sorted-priority algorithm. Finally, we report simulation results • dequeueing and enqueueing flows in a FIFO list are showing that the Aliquem and Smooth Aliquem techniques O()1 operations; allow the average delay and the jitter to be reduced. • since backlogged flows are demoted to the tail of the The rest of the paper is organized as follows. Section II list after reaching the head of the list and receiving gives the results related to the DRR latency and fairness. The service, the relative service order of any two back- Aliquem implementation is described in Section III and ana- logged flows is preserved through the various rounds. lyzed in Section IV, whilst in Section V we describe the It has been proved in [7] that the following inequality must Smooth Aliquem DRR. We compare Aliquem DRR with pre- hold for DRR to exhibit O()1 complexity: vious work in Section VI. We show simulation results in Sec- tion VII, and draw conclusions in Section VIII. ∀≥φ . (2) iL, ii II. DRR LATENCY AND FAIRNESS ANALYSIS In this section we briefly recall the DRR scheduling algo- Inequality (2) states that each flow’s quantum must be rithm and the proposed implementation. We then give general- large enough to contain the maximum length packet for that ized results related to the latency and fairness properties of flow. If some flows violate (2), a flow which is selected for DRR, and propose guidelines for selecting parameters. transmission (i.e. dequeued from the head of the list), though backlogged, may not be able to transmit a packet during the A. DRR operation and implementation complexity current round. Nevertheless, its deficit variable must be up- dated and the flow must be queued back at the tail of the list. Deficit Round Robin is a variation of Weighted Round- In a worst case, this can happen as many consecutive times as Robin (WRR) that allows flows with variable packet lengths the number of flows violating (2). Therefore, if (2) does not to share the link bandwidth fairly. Each flow i is character- hold, the number of operations needed to transmit a packet in ized by a quantum of φ bits, which measures the quantity of i the worst case can be as large as ON() [16]. packets that flow i should ideally transmit during a round, and by a deficit variable ∆ . When a backlogged flow is ser- i B. Latency Analysis viced, a burst of packets is allowed to be transmitted of an φ ∆ Let us assume that DRR is used to schedule packets from overall length not exceeding i + i . When a flow is not able to send a packet in a round because the packet is too large, the N flows on a link whose capacity is C . Let number of bits which could not be used for transmission in the N current round is saved into the flow’s deficit variable, and are F = ∑ φ (3) therefore made available to the same flow in the next round. i=1 i More specifically, the deficit variable is managed as fol- denote the frame length, i.e. the number of bits that should lows: ideally be transmitted during a round if all flows were back- • reset to zero when the flow is not backlogged; logged. • φ DRR has been proved to be a latency-rate server (LR increased by i when the flow is selected for service during a round; server) [3]. An LR server is characterized by its latency, which may be regarded as the worst-case delay experienced by the • decreased by the packet length when a packet is first packet of a flow busy period, and by its rate, i.e. the transmitted. guaranteed service rate of a flow. φ The following expression has been derived as the DRR la- Meanwhile, all other flows are serviced for ceiling() Lii tency in [4]: times as well. It has been proved in [7] that the following ine- quality bounds the service received by a backlogged flow which is serviced m times in (tt, ] : 32F − φ 12 Θ=* i . (4) i C φφ−<() < +. (7) mLWttmLii i12, ii However, (4) was obtained under the assumption that Thus, the maximum service that each flow ji≠ can re- quanta are selected equal to the maximum packet length, i.e. φ φ =∀L j . As a consequence, it does not apply to cases in ceive in ceiling() Lii visits is upper bounded by jj φφ+ which the above hypothesis does not hold, as in the following ceiling() Liij L j. Therefore, a packet of length Li will example: leave the scheduler after at most: Example – Suppose that N flows with maximum packet ====   lengths L12LLL... N are scheduled, and that 1 L φ == i ⋅+∑∑φ L +L  (8) j L 100,jN 1... . According to (4) the latency should be: φ j ji C i j=≠1.. Nji , j =≠ 1.. Nji , 

32F − φ ()32100NL− Θ=* i = . have elapsed since it became the head-of-line packet for flow i CC i . By considering that Li bounds the packet length for flow i and substituting the definition of frame (3) into (8), we However, a packet can actually arrive at a (previously idle) straightforwardly obtain (6). flow i when every other flow in the active list has a maxi- mum length packet ready for transmission and a deficit value large enough to transmit it. Therefore the delay bound for a C. Fairness Analysis packet that starts a busy period for flow i , is no less than ()NLC−⋅1 , i.e., much higher than (4). DRR has been devised as a scheduling algorithm for pro- viding fair queueing. A scheduling algorithm is fair if its fair- To the best of our knowledge, no general latency expres- ness measure is bounded. The fairness measure was intro- sion for DRR has yet been derived. The following result is duced by Golestani [6], and may be seen as the maximum proved in the Appendix. absolute difference between the normalized service received Theorem 1 by two flows over any time interval in which both are back- “The latency of DRR is logged. Let us assume that DRR schedules N flows requiring N ρρ ρ 1 rates 12, ,..., N on a link whose capacity is C , with Θ=()F −φφ()1 +LL +∑ .” (5) ∑ ρ ≤ () iiiij i i C . Let us denote with Wtti 12, the service received C j=1 ( ] by flow i during the time interval tt12, . The fairness meas- ure is given by the following expression φ =∀ Θ=Θ* Note that, when jjL j , it is ii. Another common figure of merit for a scheduling algo- 1 Wtt(), Wtt(), FM =−max i 12 j 12 . rithm is the start-up latency, defined as the upper bound on the ij, > C tt21 ρρρρ∑∑ time it takes for the last bit of a flow’s head-of-line packet to ihjhhh leave the scheduler. Such a figure of merit is less meaningful than latency, since it does not take into account the correlation As a consequence of (7), if we want flows to share the link among the delays of a sequence of packets in a flow, but it is bandwidth fairly, the ratio of any two flows’ quanta must then generally easier to compute. Moreover, computing the start-up match their required rate ratio. Therefore, the following set of latency allows DRR to be compared with scheduling algo- equalities constrains the choice of the quanta: rithms not included in the LR servers category, such as SRR. We prove the following result: φφ=∀≠. ρ ρ (9) ij i j ij,, i j Theorem 2 “The start-up latency of DRR is A fair system would thus be one for which the fraction of the frame length for which a flow is serviced is equal to the 1 L N fraction of the link bandwidth the flow requires. Let us define SF=−⋅+()φ i ∑ L.” (6) iiφ j C i j=1 fF φφφ∑ = . (10) iijN=1.. ji Proof A head-of-line packet of length L will be serviced after i Therefore, the following equality follows from (9) and flow i receives exactly ceiling() L φ service opportunities. ii (10): f = ρρ∑ . (11) iijN=1.. j list 0 list 1 list q-1 It can be easily shown that the fairness measure for DRR is

1 L L FM=++ F i j . (12) ij,  active lists Cffij

The proof is obtained by considering the following worst- Figure 1. Active List Queue. case scenario (see Lemma 2 in [7]): allows the frame length to be reduced without the drawback of • in (tt, ] , flow i is serviced one more time than flow 12 operating at ON() complexity. j , say m times against m −1; • ()=+φ Wttiii12, m L, i.e. flow i has the maximum III. THE ALIQUEM IMPLEMENTATION deficit at time t1 and no deficit at time t2 ; The Active List Queue Method (Aliquem) is an implemen- • ()()=−φ − Wttj 12,1 mjj L, i.e. flow j has no deficit tation technique that allows DRR to obtain better delay and O()1 at time t1 and the maximum deficit at time t2 . fairness properties preserving the complexity. In order to distinguish between the Aliquem-based implementation of By substituting the above expressions for Wtt(), and i 12 DRR and the implementation proposed in [7] and described in Wtt(), in the fairness measure, and keeping into account j 12 the previous section, we will call the former Aliquem DRR (10) and (11), expression (12) follows straightforwardly. and the latter Standard DRR. The rationale behind Aliquem DRR is the following: suppose that the frame length is se- D. Parameter Selection lected as FF=<<ααS ,0 1. This implies that there are A known DRR problem (which is also common to other flows violating (2). Let flow i be one of those. As already round-robin schedulers) is that the latency and fairness depend stated, if a single FIFO active list is used to store references to on the frame length. The longer the frame is, the higher the the backlogged flows, it is possible that i is dequeued and latency and fairness are. In order for DRR to exhibit lower enqueued several times before its deficit variable becomes latency and better fairness, the frame length should therefore greater than the head packet length. However, by only looking be kept as small as possible. Unfortunately, given a set of at flow i ’s quantum, deficit and head-packet length, it is pos- flows, it is not possible to select the frame length arbitrarily if sible to state in advance how many rounds will elapse before we want DRR to operate at O ()1 complexity. In fact, inequal- flow i actually transmits its head packet, and we can defer ity (2) establishes a lower bound for each quantum in order for updating flow i ’s deficit variable until that time. Specifically, DRR to operate in the O ()1 region. Assuming that fairness is let Li be the length of the head packet which is at the head of a key requirement, and therefore (11) holds, in order for (2) to flow i ’s queue right after the flow has been serviced in a hold, a frame cannot be smaller than ∆ round, and let i be the deficit value at that time. We can assert that the head packet will be transmitted in the R -th S = . (13) subsequent round, where R is computed as follows F max ()Lfj j jN=1..  L −∆  It is straightforward to see that if the frame length is F S , R =  ii (14) φ there is at least one flow for which equality holds in (2).  i  Therefore, operating with a frame smaller than F S implies not operating at O()1 complexity. and the deficit right before the head packet transmission is Even if the frame length is selected according to (13), it started will be can be very large in practical cases, as shown by the following numerical example. Suppose that 20 data flows requiring 50 R ⋅+∆φ . (15) Kbps each share a 10 Mbps link with 9 video flows requiring ii 1 Mbps each. The maximum packet length is L = 1500 bytes S Note that (14) and (15) can also be applied when the for all flows. According to our scheme, we obtain F = 300 Kbytes (240 ms at 10 Mbps), and each video flow has a quan- packet arrives at a previously idle flow; in this case the flow’s tum of 30 Kbytes, i.e. is allowed to transmit a burst of 20 deficit when the arriving packet reaches the head of the flow’s maximum length packets in a round. The latency for a video queue is null. flow is 262 ms, and the latency for a data flow is 512.4 ms. Aliquem DRR avoids unnecessary enqueueing, dequeue- As the above example shows, when DRR is used to sched- ing and deficit updating by taking (14) and (15) into account. ule flows with very different rate requirements, quanta and Instead of considering a single FIFO active list, we consider the data structure shown in Figure 1. A circular queue of frame length can be very large, thus leading to high latencies > for all flows and bursty transmissions for the flows with high q 1 elements contains the head and tail references to q dif- rates. In the next section, we present an implementation that ferent FIFO active lists. Each active list contains references to PacketArrival (packet p, flow I) { ture: this is done by applying (14) and (16). If the flow was if ( NotBacklogged(I) ) { already backlogged, no action needs to be taken (apart from ∆ I=0; enqueueing the incoming packet in the flow’s packet queue, of φ R=ceiling (Length(p)/ I); course). While there are packets in the system, the following g = (CurrentList+R) mod q; EnqueueFlow(I, ActiveList(g)); steps are performed cyclically: } • as long as the current active list is not empty, the head EnqueuePacket(p, FlowQueue(I)); } flow is dequeued: its deficit is updated according to AliquemDRRScheduler () { (15), and its packets are transmitted until either the while (SystemBusy) { flow is empty or its deficit is not sufficient to transmit while ( NotEmpty(CurrentList) ) { the next packet; I = DequeueFlow(CurrentList); L = Length(HeadPacket(I)); • if the flow which has just been serviced is still back- ∆I = ceiling((L-∆I)/φI)*φI + ∆I; logged, it is inserted into another active list located by loop { applying (14) and (16); otherwise its reference is sim- L=Length(HeadPacket(I)); if (L==0) { ply discarded; Backlogged=false; • when the current list has been emptied (i.e. all packets exit loop; } which were to leave during the current round have been transmitted), the next non-empty active list from if (L>∆I){ Backlogged=true; which to dequeue flows, i.e. the active list to be se- exit loop; lected as the next current list, has to be located. The } function NextNonEmptyList() is intentionally left p=DequeueHeadPacket(I); unspecified for the moment. In the following section TransmitPacket(p); we will propose two different implementations for it, ∆I = ∆I -L; } which yield different worst-case and average com- if (Backlogged) { plexity and require different space occupancy.

R = ceiling((L-∆I)/φI); g = (CurrentList+R) mod q; IV. ALIQUEM DRR ANALYSIS EnqueueFlow(I, ActiveList(g)); } A. Operational Overhead and Space Occupancy } CurrentList=NextNonEmptyList(); The space occupancy introduced by Aliquem DRR is quite } modest. The active lists pool can contain at most N flow ref- } erences (the same as Standard DRR). The circular queue can Figure 2. Pseudo-code for Aliquem DRR. be implemented by a vector of 2⋅ q flow references (head and tail of an active list for each element). Thus, the only space backlogged flows. During round k , only the flows whose overhead introduced by Aliquem is reference is stored in the ()kqmod -th active list (the current 2⋅⋅q sizeof(reference). active list) are considered for transmission, and they transmit Suppose that Aliquem DRR is servicing flows from the their packets according to the DRR pseudo-code. current active list k (0 ≤

01 x Χ −2 Χ −1

level 1 ......

01 Χ −2Χ −1 01 Χ −2 Χ −1 01 Χ −2Χ −1 level 2 ......

Aliquem ......

Figure 5 – Example of a 2-level Bit Vector Tree for accessing an Aliquem structure φ containing information about the state of the q lists. Locating i the least significant bit set to 1 in a given word takes one ma- flow i pi,3 pi,2 pi,1 chine instruction; on the other hand, locating the correct word φ in a set of M can be efficiently accomplished by considering j flow j p p p them as leaf nodes of a Χ -ary tree structure (as shown in Fig- j,3 j,2 j,1 ure 5). In the Χ -ary tree, each non-leaf node is itself a word, (a) Packets queued at flows i and j whose x th bit, x =Χ−0... 1, is set if there is at least one non- empty list in the node’s x th sub-tree. list k list k+1 list k+2 list k+3 Given an Aliquem structure with q active list, the Χ -ary i j i j j i tree structure has a logΧ q depth, and therefore it is possible to locate the next non-empty list in log q time. As Χ Χ (b) Flow queueing in Aliquem DRR ranges from 32 to 128 in today’s processors, the operational overhead of such a data structure is negligible. The same can be said about the space occupancy of a Χ -ary tree, in which pi,1 pj,1 pj,2 pi,2 pi,3 pj,3 only a single bit is associated with each list at the leaf nodes. Standard DRR Χ = As an example, consider a system with 32 . It is pos- pi,1 pj,1 pj,2 pi,2 pj,3 pi,3 sible to manage qK= 1 active lists with a 2-level tree, or Aliquem DRR qK= 32 active lists with a 3-level tree. The tree root can be k k+1 k+2 k+3 stored directly in a register and the overall tree occupancy (c) Packet transmission sequence (128 bytes for a 2-level tree, ~4Kbytes for a 3-level tree) is Figure 6. Inversion in the service order. negligible.

B. Latency A side result of the proof process of Theorem 1 is that the 1  F S N  Θ=A ()1 −f  +Lf +∑ L . latency of DRR does not change if it is implemented as iiiij− Aliquem DRR. This is because Aliquem DRR just avoids Cq 1 j=1  polling flows that cannot transmit a packet during a round. ΘA This means that packets that are supposed to leave during a Whereas in Standard DRR latency i is only achievable round in Standard DRR will leave during the same round in when operating at ON() complexity, in Aliquem DRR latency ΘA Aliquem DRR (though not necessarily in the same order, as i can be obtained at the cost of a much smaller complexity. we explain later on). Therefore, the scenario under which the As far as the start-up latency is concerned, by manipulating Standard DRR latency is computed is also feasible for (6) we obtain the following lower bound for Standard DRR at Aliquem DRR. Note that, in that particular scenario, packets O()1 complexity: from the tagged flow are the last to leave on each round. However, although the latency for Standard and Aliquem DRR is the same, it is not obtained at the same complexity. 1 L N  () SS=⋅⋅−+i ()  (20) The lowest latency for Standard DRR operating at O 1 com- SFfLiijS 1 ∑ CfF⋅ = plexity, achieved by selecting the frame length according to i j 1  (13), is The lowest start-up latency for Aliquem DRR with q ac- 1 N tive lists, obtained when the frame length is selected according Θ=SS() − + + . (18) iiiij1 f ()FLf∑ L to (19), is thus: C j=1

Aliquem DRR allows the frame size to be reduced by a 1 Lq⋅−(1)F S N  SfLA =⋅⋅−+i ()1 ∑  (21) factor of q −1 with respect to the minimum frame length al- iij⋅ S − CqfFi 1 j=1  lowed by Standard DRR at O()1 complexity. In order for Aliquem DRR to work, inequality (17) (rather than (2)) must ≥ AS≤ hold. Inequality (17) allows quanta (and therefore frames) to It can be easily shown that, when q 2 , SSii, and be smaller by a factor of q −1 with respect to (2) The mini- that the reduction in the start-up latency is non-decreasing mum frame length for Aliquem DRR is thus the following with q . However, the amount of reduction in the start-up la- tency also depends on the value of Li and fi : specifically, the S S smaller Liif is with respect to F , the higher the start-up A ==F 1 . (19) latency reduction is for flow i . In the case in which F max ()Lfj j −−jN=1.. = S AS= qq11 LiifF, we have SSii for any value of q .

The lowest latency for Aliquem DRR with q active lists, C. Fairness obtained when the frame length is selected according to (19), In Standard DRR, since flows are inserted in and extracted is therefore: from a unique FIFO active list, the service sequence of any 0.6 0.3

0.5 0.25 data 0.4 video 0.2

0.3 0.15

latency (s) 0.2 0.1

0.1 (s) fairness flows video 0.05

0 0 4 8 12 16 20 4 8 12 16 20 q q Figure 7. Latency gain against q. Figure 8. Fairness gain among video flows against q. two backlogged flows is preserved during the rounds: if flow mitted. By following the same procedure used for computing i is serviced before flow j during round k and both flows the standard DRR fairness measure, it is straightforward to are still backlogged at round k +1, flow i will be serviced prove that the fairness measure of Aliquem DRR is: before flow j at round k +1. Thus, if packets from both flows are supposed to leave during a given round, flow i  1 L L packets will be transmitted before flow j ’s. In Aliquem DRR,  F ++i j if φφ ≥LL and ≥ due to the presence of multiple FIFO active lists, it is possible  ii j j Cffij  that the service sequence of two backlogged flows can be al- A = . FM ij,  tered from one round to another.  L L 1 ++i j  2F otherwise Let us show this with a simple example. Let us consider Cffij flows i and j , which are backlogged at a given round k . Let us assume that the sequence of packets queued at both flows is Given a frame length F , Aliquem DRR can have a worse the one reported in Figure 6(a). Both flows transmit a packet fairness measure than Standard DRR. However, as already during round k , and the order of transmission of the packets stated in the previous subsection, Aliquem DRR allows a frame is p , p . In Standard DRR, this implies that, as long as i,1 j,1 length to be selected which is less than the minimum frame both flows are backlogged, flow i is serviced before flow j ; () + length of Standard DRR operating at O 1 complexity by a thus, the packet transmission sequence during round k 3 factor of q −1. Therefore, by employing Aliquem DRR with (i.e. in a round when packets from both flows are transmitted) q > 2 , we obtain a better fairness measure than that of Standard is p , p (see Figure 6(c)). Consider the same scenario in i,3 j,3 DRR operating at O ()1 complexity. We also observe that, if Aliquem DRR: both flows are initially queued in the k -th = <− q 2 , flow sequence inversion cannot take place, and therefore active list (assume for simplicity that kq3 ), and flow i is the fairness measure of an Aliquem DRR with two active lists is queued before flow j (see Figure 6(b)). However, even the same as that of Standard DRR operating at O ()1 complex- though both flows are constantly backlogged from round k to ity. round k + 3 , the packet transmission sequence during round k + 3 is the opposite, i.e., p , p . j,3 i,3 D. Trade-off between Performance and Overhead Since in Aliquem DRR backlogged flows do not necessar- As seen earlier, Aliquem DRR allows us to obtain much ily share the same active list on every round, it is not possible better performance bounds (both for latency and for fairness) to preserve their service sequence by employing FIFO active by selecting smaller frames. Smaller frames are obtained by lists. Doing so would require using sorted lists, which have an employing a large number of active lists, i.e., a large q value, intrinsic complexity of ON()log . This would lead to a higher and the q value affects the operational overhead. However, complexity. with a modest increase in the operational overhead it is possi- Note that in Aliquem DRR the service sequence inversion ble to achieve a significant improvement in both latency and does not affect couples of flows whose quantum contains a fairness. Let us recall the example of Section II.D: if Aliquem φ ≥ φ ≥ DRR with q = 10 is employed, the latency of a data flow is maximum length packet. In fact, if iiL and j Lj , the two flows are always dequeued from the head of the same list reduced by 41% and the latency of a video flow is reduced by (the current list) and enqueued at the tail of the same list (the 73%, as shown in Figure 7 and 8. In addition, the fairness subsequent list); therefore, no service inversion can take place. measure between any two video flows is reduced by 80%. Let us refer to the worst-case scenario under which (12) was ( ] V. SMOOTH ALIQUEM DRR derived. Due to the flow sequence inversion, during tt12, , flow i can receive service two more times than flow j ; in the According to DRR, during a round a flow is serviced as long as its deficit is larger than the head packet. Aliquem DRR above example, this happens if t1 and t2 are selected as the time instants in which packets p and p start being trans- isolates the subset of flows that can actually transmit packets j,1 i,3 during a round, but, according to the pseudo-code in Figure 2, it queued in the current list. The pseudo-code for Smooth p p p p p p p p p p a,1 a,2 a,3 b,1 b,2 b,3 c,1 c,2 c,3 c,4 Aliquem DRR is reported in Figure 10. Aliquem DRR Since a flow might receive service more than once per round, a flag YetServiced[i] is needed in order to distin-

pa,1 pb,1 pc,1 pa,2 pb,2 pc,2 pa,3 pb,3 pc,3 pc,4 guish whether it is the first time that flow i is going to be Smooth Aliquem DRR serviced during the round (and in that case its deficit must be Figure 9. Aliquem DRR and Smooth Aliquem DRR Output updated) or not. This flag is set to false for a newly back- logged flow and for a flow which is enqueued in a different PacketArrival (packet p, flow I) { active list after being serviced, and is set to true when the flow if ( NotBacklogged(I) ) { is queued back in the same active list. ∆I=0; YetServiced[I]=False; Smooth Aliquem DRR has a similar operational overhead as

R=ceiling (Length(p)/φI); Aliquem DRR, which only depends on the q value and on the g = (CurrentList+R) mod q; chosen implementation of the NextNonEmptyList() function. EnqueueFlow (I, ActiveList(g); The only space overhead added by Smooth Aliquem DRR is the } vector of N flags YetServiced[]. However, it is possible to EnqueuePacket (p, FlowQueue(I)); } show that the latency, start-up latency and fairness measure of Smooth Aliquem DRR are the same as Aliquem DRR’s. SmoothAliquemDRRScheduler () { while (SystemBusy) { VI. COMPARISON WITH PREVIOUS WORK while ( NotEmpty(CurrentList) ) { I = DequeueFlow(CurrentList); L = Length(HeadPacket(I)); A. Comparison with Other Implementation Techniques if(not(YetServiced[I])) In this subsection we discuss the differences between our ∆ ∆ φ φ ∆ I = ceiling((L- I)/ I)* I + I; proposal and the one described in [15] for reducing the output p=DequeueHeadPacket(I); burstiness of a DRR scheduler. In [15], it is observed that the TransmitPacket(p); output burstiness of a DRR scheduler could be reduced by ∆I-= L; L = Length(HeadPacket(I)); allowing a flow to be serviced several times within a round, one packet at a time. In order to do so, a DRR round is divided if (L>∆I){

R = ceiling((L-∆I)/φI); into sub-frames; each sub-frame is associated with a FIFO g = (CurrentList+R) mod q; queue, in which references to backlogged flows are stored. YetServiced[I]=False; Sub-frame queues are visited orderly, and each time a flow is EnqueueFlow(I, ActiveList(g)); dequeued it is only allowed to transmit one packet; after that, } else if (L>0) { it will be enqueued in another sub-frame queue if still back- YetServiced[I]=True; logged. The correct sub-frame queue a backlogged flow has to EnqueueFlow(I, CurrentList); be enqueued into is located by considering the finishing time- } stamp of the flow’s head-of-line packet. The data structure } proposed in [15] consists of two arrays of q sub-frames each, CurrentList=NextNonEmptyList(); } associated with the current and subsequent rounds respec- } tively, which are cyclically swapped as rounds elapse. Obvi- ously, the number of operations needed to select the next non- Figure 10. Pseudo-code for Smooth Aliquem DRR empty sub-frame queue increases linearly with the array di- services each flow exhaustively before dequeuing the next one. mension. It can be observed that – though this aspect is not Therefore, a burst of packets of overall length at most equal to dealt with in [15] – it could be possible to reduce the opera- +φ Lii can leave the server when a flow is serviced. Clearly, tional overhead of the sub-frames array by employing two reducing the frame length also implies reducing the flow bursti- VEB-PQs or bit vector trees (one for the array related to the ness as a side effect, since flows are polled more frequently and current round and another for the array related to the subse- for smaller quanta. However, we can further reduce it by forc- quent round). Thus, assuming that the dimension of a sub- ing each flow to transmit only one packet when selected for frames array and the number of active lists in Aliquem are transmission. The flow is then queued back (if still backlogged) comparable, both implementations have the same operational either in the current active list (if the deficit is still larger than overhead; however, the sub-frames array data structure takes the new head packet) or in a new active list, located by applying twice as much space as an active list queue of the same length (16). Reducing the output burstiness has several well-known (either with or without employing additional data structures). advantages, such as reducing the buffer requirements on the It is observed in [15] that, although the proposed implementa- nodes of a multi-node path, and improving the average fairness tion reduces the typical DRR burstiness, it does not reduce its and average delay. We define a slightly modified version of performance bounds such as the latency or fairness measure; Aliquem DRR, called Smooth Aliquem DRR, in which flows this is probably due to the fact that the proposed implementa- transmit one packet at a time. Figure 9 shows an example of the tion cannot reduce the frame length, which is in fact bounded difference in the output packet sequence between Aliquem by (2). On the other hand, the Aliquem implementation allows DRR and Smooth Aliquem DRR, assuming that three flows are the frame length to be reduced, which actually reduces latency and fairness measure. B. Comparison with Other Scheduling Algorithms SRR smoothens the round-robin output burstiness by allow- In this subsection we compare Aliquem DRR with some ing a flow to send one maximum length packet worth of bytes existing scheduling algorithms for packet networks. Specifi- every time it is serviced. The relative differentiation of the flows cally, we compare Aliquem DRR with Self-Clocked Fair is achieved by visiting them a different number of times during a Queueing (SCFQ) [6], Pre-Order DRR (PDRR) [13] and round, according to their rate requirements, and two visits to the Smoothed Round Robin (SRR) [14]. same flow are spread apart as far as possible. The space occu- pancy required by SRR – which is mainly due to the data struc- We have already observed that the DRR latency and fairness ture needed to store the sequence of visits – grows exponentially measure are related to the frame length. If we let the frame with the number of bits k employed to memorize the flows rate length go to zero, from (5) and (12) we obtain the following requirements. If k = 32, the space occupancy is about 200 DRR limit latency and limit fairness measure respectively: Kbytes. Aliquem DRR does not require such space occupancy. Although the complexity of SRR does not depend on the number N of active flows, Ok() operations must be performed whenever a Θ=0 1 + , (22) iiijL fL∑ flow becomes idle or backlogged, and this may happen many C =≠ jji1, times in a round. It has been observed in [14] that such a burden may be comparable to the ON(log ) complexity of sorted- 1 L L priority algorithms. On the other hand, the operational overhead FM 0 =+i j . (23) ij,  arising from searching for the next non-empty list in Aliquem Cfij f DRR (which has to be done at most once per round) is expected to be much lower, especially if either of the two additional data The limit latency and fairness measure of DRR are equal to structures proposed in Section IV.A is employed. the latency and fairness measure of SCFQ. The latter is a In [14], it is claimed that the start-up latency of SRR (re- sorted-priority scheduling algorithm that has ON(log ) com- ferred to therein as the scheduling delay) is much lower that plexity, and it has been analyzed as an LR server in [3], [4]. that of DRR. We then compare the start-up latency of both al- Thus, Aliquem DRR performance bounds are bounded from gorithms. Let us suppose that a set of N flows requiring rates below by those of SCFQ, to which they tend as q →∞. Recall- ρρ ρ 12, ,..., N are scheduled on a link whose capacity is C , ing the example in Section IV.D, we obtain that, when q = 20 , ∑ ρ = with i i C . In that case, after some straightforward ma- latency bounds are 4.1% above the limit latency for data flows nipulation, we obtain the following result for SRR: and 22% above the limit latency for the video flows. PDRR is aimed at emulating the Packet Generalized Proces- 2L  sor Sharing (PGPS) [5] by coupling a DRR module with a pri- SRR<+−=max 1 SRR (24) SNSii1 ority module that manages Z FIFO priority queues. Each Cfi packet which should be transmitted in the current DRR round is instead sent to one of such priority queues, selected according to = where Lmax max ()Li the expected packet leaving time under PGPS. Packets are de- iN=1.. queued from the priority queues and transmitted. The higher the For Aliquem DRR, assuming as a worst case that = S = number of priority queues Z is, the closer PGPS is emulated. Li Lmax and F Lfii, we obtain from (21): In order to locate the non empty priority queue with the highest priority, a min heap structure is employed. It is said in [13] that N SRR PDRR exhibits OZ()log worst-case per packet complexity, the A LS1 Lj S =+−≤max ∑ 1 i (25) latter being the cost of a single insertion in the min heap, pro- i Cfi j=1 Lmax 2 vided that its DRR module operates at O(1) complexity. How- ever, a careful analysis of the PDRR pseudo code shows that this is not the case. In fact, at the beginning of a new round, the This result, which counters the claim made in [14], can be min heap is empty, and it is filled up by dequeueing packets partially explained by considering that SRR performs a large k − from all backlogged flows and sending each of them to the number of visits (up to 21) in a round, and on each visit a relevant priority queue. This process has to be completed before flow’s deficit is increased by Lmax , regardless of the flow’s packet transmission is started, otherwise the PDRR ordering actual maximum packet size Li . would be thwarted. Looping through all backlogged flows re- quires ON() iterations, Z of which might trigger a min heap VII. SIMULATIONS insertion. Thus, transmitting the first packet in a round requires In this section we show some of the Aliquem DRR and ON()+ Zlog Z operations. Clearly, the complexity of Smooth Aliquem DRR properties by simulation. We have im- Aliquem DRR is instead much lower. As far as latency is con- plemented both schedulers in the ns simulator [25]. cerned, the expression computed in [13] applied to the example of Section II.D yields a latency for a video flow which is 480ms A. Operational Overhead = when Z 10 , i.e., higher than that of Standard DRR itself We consider a scenario consisting of a single node with a 1 computed by applying (5). This probably implies that PDRR Mbps output link shared by 40 traffic sources. All sources latency bound is not tight, which makes a well-based compari- transmit UDP packets with lengths uniformly distributed be- son impossible. tween 500 and 1500 bytes and are kept constantly backlogged. Twenty traffic sources require 10 Kbps, and the remaining 300 twenty require 40 Kbps. The scenario is simulated for 10 sec- DRR Aliquem onds with both Standard DRR and Aliquem DRR, and the 250 number of operations that are required for each packet trans- mission is traced. 200 By following the guidelines for DRR quanta allocation out- φ = lined in Section II.D, we would obtain i 1500 bytes for the 150 φ = 10 Kbps sources and i 6000 bytes for the 40 Kbps sources,

S operations which yield a frame length F = 150 Kbytes. We want to com- 100 pare the number of operations per-packet transmission of Stan- dard DRR and Aliquem DRR if the frame length is selected as 50 F S 100 , and quanta are allocated consequently, so as to obtain a latency which is close to the limit latency. In Aliquem DRR, = 0 this requires q 101 active lists. In this simulation, we use the 600 620 640 660 680 700 linear search NextNonEmptyList() implementation for packet no. Aliquem DRR and we consider as an operation unit a flow en- Figure 11. Per-packet operations comparison. queueing/dequeueing or an iteration of the loop in NextNon- packet transmission in Aliquem DRR (which is 3.2) is very far EmptyList(). Clearly, this is the most unfavorable scenario from the upper bound, which is += . A thorough for assessing the Aliquem DRR operational overhead reduction, q 2103 since no additional data structure (as a VEB-PQ or a bit vector evaluation of the operational overhead would also entail taking into account protocol-related and architectural issues (such as tree) is employed. Nevertheless, as Figure 11 clearly shows, the header processing, memory transfers and so on), which cannot number of operations per packet transmission is much lower in Aliquem DRR. Moreover, the average number of operations per be easily covered through ns simulation.

TABLE I. AVERAGE AND MAXIMUM DELAY COMPARISON.

Aliquem DRR Smooth Aliquem DRR q=2 q=5 q=11 q=101 q=2 q=5 q=11 q=101 avg delay (ms) 28.953 19.142 18.669 18.201 24.768 19.263 18.633 18.197 avg delay gain % - 33.9% 35.5% 37.1% 14.5% 33.5% 35.6% 37.2% max delay (ms) 61.444 39.627 38.870 37.968 61.689 39.942 38.870 37.968 max delay gain % - 35.5% 36.7% 38.2% -0.4% 35.0% 36.7% 38.2%

0.06 0.05 q=2 q=5 0.04 0.03

delay (s) 0.02

0.01 0 70 72 74time (s) 76 78 80 0.06

0.05 q=2 smooth, q=2 0.04

0.03

delay (s) 0.02

0.01

0 70 72 74 76 78 80 time (s) Figure 12. Delay trace comparison. prob. % prob. % 100 100 q=21 q=21 q=11 q=2 q=2 80 80

q=3 q=3

q=5 60 q=6 q=5 60

40 q=6 40

q=11

20 20

0 0 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 time (s) time (s) Figure 14. Distribution of the inter-packet spacing for the CBR flows for various values of q in Aliquem DRR (left) and Smooth Aliquem DRR (right) ground traffic along the path, so that the capacity of each link B. Delay is fully utilized. On each link, the background traffic consists We have shown in the previous sections how it is possible of 21 sources sending packets uniformly distributed between to achieve lower latencies in Aliquem DRR. In this simulation 50 and 1500 bytes. Six sources require 0.9 Mbps each, 5 we show how it is possible to reduce a flow’s average delay sources require 0.6 Mbps each, the remaining 10 require 0.15 by increasing the number of active lists q and using the Mbps each. To avoid correlation on the downstream nodes, Smooth Aliquem version. We consider a scenario consisting background traffic sent from a node to its downstream of a single node with a 10 Mbps output link shared by 40 traf- neighbor is not re-scheduled. Instead, each node removes the incoming background traffic and generates new outgoing fic sources. The flow we monitor in the simulation sends 500 background traffic, as shown in Figure 13. bytes long UDP packets at a constant rate of 100 Kbps. The other 39 sources transmit UDP packets with lengths uniformly distributed between 100 and 1500 bytes. Nineteen traffic CBR traffic sources require 100 Kbps, and the remaining twenty require 400 Kbps. The scenario is simulated for 100 seconds with various values of q with Aliquem DRR and Smooth Aliquem DRR, and the sum of the queueing and transmission delay for Background traffic each packet of the tagged flow is traced. Figure 13. Simulation scenario We report the average and maximum delay experienced by the tagged flow’s packets in the various experiments in Table We simulate the network for 350 seconds employing both 1. In order to show the delay gain, we take Aliquem DRR with Aliquem DRR and Smooth Aliquem DRR as schedulers. We q = 2 (which exhibits the same behavior as Standard DRR select the frame length according to (19), which yields operating at O ()1 complexity) as a reference. The table shows Fq=−100() 1 Kbytes, and perform experiments with various that an average and maximum delay reduction of about 33% values of q , tracing the inter-arrival spacing of the CBR can be achieved even with q as small as 5; furthermore, em- source packets on the destination host. In an ideal fluid-flow ploying Smooth Aliquem reduces the average delay by about fair queueing system, the inter-arrival spacing would be con- 15% even when q = 2 . We observe that, as q gets higher, the stant (and obviously equal to 10ms). In a packet-wise fair- performance metrics of Smooth Aliquem DRR and Aliquem queueing system, we can expect the inter-arrival spacing to be DRR tend to coincide. This is because quanta get smaller, and distributed in some way around an average value of 10ms: in therefore the probability that a flow is able to send more than this case, the narrower the distribution is, the closer the ideal one packet per round decreases. Figure 12 shows two delay case is approximated. Figure 14 shows the probability distri- traces, from which the benefits introduced by Aliquem and bution of the inter-packet spacing (confidence intervals are not Smooth Aliquem are also evident. reported, since they are too small to be visible). For small q values, the distribution tends to be bimodal: this means that C. Delay Variation in a Multi-node Path the scheduling on the various nodes has turned the original CBR traffic injected by the source into bursty on/off traffic at In this experiment we show that reducing the frame length the destination host. The latter will then need buffering in or- also helps to preserve the traffic profile of a CBR flow in a der to counterbalance the jittering introduced. As q grows, multi node path, thus reducing the need for buffering both in the curves become narrower around the average value, thus the nodes and at the receiving host. We consider a scenario denoting a smoother traffic profile at the destination. The consisting of three scheduling nodes connected by 10 Mbps same behavior with respect to a variation in the q parameter links. A CBR source sends 125 bytes packets with 10ms pe- can be observed both in Aliquem DRR and in Smooth riod across the path, thus occupying 100Kbps. Other traffic Aliquem DRR. However, in the latter the distributions for a sources (kept in asymptotic conditions) are added as back- given q value are generally narrower2, due the smoothing [10] J. Bennett and H. Zhang: “Hierarchical Packet Fair Queueing effect. Algorithms”, IEEE/ACM Trans. on Networking, Vol. 5, No. 5, pp. 675- 689, October 1997. [11] F. Toutain, “Decoupled Generalized Processor Sharing: A Fair Queuing VIII. CONCLUSIONS Principle for Adaptive Multimedia Applications”, Proc. 17th IEEE In this paper we have analyzed the Deficit Round-Robin Infocom ’98, San Francisco, CA, USA, March – April 1998 scheduling algorithm, and we have derived new and exact [12] D. Saha, S. Mukherjee, K. Tripathi: “Carry-Over Round Robin: A bounds on its latency and fairness. Based on these results, we Simple Cell Scheduling Mechanism for ATM Networks”. IEEE/ACM have proposed an implementation technique, called the Active Trans. on Networking, Vol. 6, No. 6, pp. 779-796, December 1998. Lists Queue Method (Aliquem), which allows DRR to work [13] S.-C. Tsao and Y.-D. Lin, “Pre-Order Deficit Round Robin: a new scheduling algorithm for packet switched networks,” Computer with smaller frames while still preserving the O ()1 complex- Networks, Vol. 35, pp. 287-305, February 2001. ity. As a consequence, Aliquem DRR achieves better latency [14] G. Chuanxiong, “SRR: An O(1) Time Complexity Packet Scheduler for and fairness. We have proposed several solutions for imple- Flows in Multi-Service Packet Networks,” in Proc. of ACM menting Aliquem DRR, employing different data structures SIGCOMM’01, pp. 211-222, Aug. 27-31, 2001, San Diego, CA, USA. and requiring different space occupancy and operational over- [15] A. Francini, F. M. Chiussi, R. T. Clancy, K. D. Drucker, and N. E. head. We have also presented a variation of Aliquem DRR, Idirene, “Enhanced Schedulers For Accurate Bandwidth Distribution in Packet Networks,” Computer Networks, Vol. called Smooth Aliquem DRR, which further reduces the out- 37, pp. 561-578, Nov. 2001. put burstiness at the same complexity. We have compared [16] S. S. Kanhere, H. Sethu, A. B. Parekh, “Fair and Efficient Packet Aliquem DRR to PDRR and SRR, showing that it either Scheduling Using Elastic Round-Robin”, IEEE Trans. on Paral. and achieves better performances with the same operational over- Dist. Systems, Vol. 13, No. 3, March 2002 head or provides comparable performances at a lower opera- [17] L. Lenzini, E. Mingozzi, G. Stea, “A unifying service discipline for tional overhead than either of the other two. Our simulations providing rate-based guaranteed and fair queueing services based on the showed that Aliquem DRR allows the average delay to be Timed Token Protocol”, IEEE Trans. on Computers, Vol. 51, No. 9 Sept. reduced, and it also lessens the likelihood of bursty transmis- 2002, pp. 1011-1025. sion in a multi-hop environment. [18] L. Lenzini, E. Mingozzi, G. Stea, “Packet Timed Token Service Discipline: a scheduling algorithm based on the dual-class paradigm for providing QoS in integrated services networks”, Computer Networks REFERENCES 39/4, July 2002, pp.363-384. [1] L. Lenzini, E. Mingozzi, and G. Stea, “Aliquem: a Novel DRR [19] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An Implementation to Achieve Better Latency and Fairness at O(1) Architecture for Differentiated Services”, RFC 2475, The Internet Complexity”, in Proc. of the 10th International Workshop on Quality of Society, December 1998. Service (IWQOS), Miami Beach,USA, May 2002, pp. 77-86. [20] B. Davie et al. “An Expedited Forwarding PHB (Per-Hop Behavior)”, [2] H. Zhang “Service Disciplines for Guaranteed Performance Service in RFC 3246, The Internet Society, March 2002. Packet-Switching Networks”, Proceedings of the IEEE, Vol. 83, No. 10, [21] A. Charny et al. “Supplemental Information for the New Definition of pp. 1374-1396, October 1995. the EF PHB (Expedited Forwarding Per-Hop Behavior)”, RFC 3247, [3] D. Stiliadis and A. Varma, “Latency-Rate Servers: A General Model for The Internet Society, March 2002. Analysis of Traffic Scheduling Algorithms,” IEEE/ACM Trans. on [22] P. Van Emde Boas, R. Kaas, and E. Zijlstra, “Design and Networking, Vol. 6, pp. 675-689, October 1998. Implementation of an Efficient Priority Queue,” Math. Syst. Theory, [4] D. Stiliadis and A. Varma, “Latency-Rate Servers: A General Model for Vol. 10, pp. 99-127, 1977. Analysis of Traffic Scheduling Algorithms,” Tech. Rep. CRL-95-38, [23] K. Mehlhorn, “Data Structures and Algorithms 1: Sorting and University of California at Santa Cruz, USA, July 1995. Searching”, EATCS Monographs on Theoretical Computer Science, [5] A. K. Parekh and R. G. Gallager, “A Generalized Processor Sharing Springer-Verlag. Approach to Flow Control in Integrated Services Networks: the Single [24] A. Brodnik, S. Carlsson, J. Karlsson, J. Ian Munro, “Worst case Node Case,” IEEE/ACM Trans. on Networking, Vol. 1, pp. 344-357, Constant Time Priority Queue”, ACM/SIAM 12th Symposium on Discrete June 1993. Algorithms 2001, Washington, DC, USA, Jan. 2001, pp. 523-528. [6] S. J. Golestani, “A Self-Clocked Fair Queueing Scheme for Broadband [25] The Network Simulator - ns-2, . Applications,” in Proc. of IEEE INFOCOM’94, pp. 636-646, Jun. 14-16, 1994, Toronto, Canada. IX. APPENDIX [7] M. Shreedhar and G. Varghese, “Efficient Fair Queueing Using Deficit Round Robin,” IEEE/ACM Trans. on Networking, Vol. 4, pp. 375-385, A. Proof of the Theorem 1 June 1996. [8] S. Suri, G. Varghese, and G. Chandranmenon “Leap Forward Virtual Before proving the Theorem 1, let us report the following Clock: A New Fair Queueing Scheme With Guaranteed Delay and lemmas: Throughput Fairness,” in Proc. of IEEE INFOCOM’97, Apr. 7-11, 1997, Kobe, Japan. Lemma 1 φ [9] P. Goyal, H.M. Vin and H. Cheng, “Start-Time Fair Queueing: A “Let i be flow i quantum, and let Li be its maximum Scheduling Algorithm for Integrated Services Packet Switching packet length; the service received by flow i in the time inter- Networks”, IEEE/ACM Trans. on Networking, Vol. 5, No. 5, pp. 690- [ ) val tt12, in which it is always backlogged is lower bounded 704, October 1997. by:

⋅−φ < vLWttii i(,12 ) 2 Note that the horizontal scale of the graph related to Smooth Aliquem DRR is one half of the other [ ) where v is the number of service opportunities in tt12, ”. Proof: see [7]. Wtt(,)≥⋅−−Θ max0,() f Ct() t (27) iii11 Lemma 2 the latter representing the latency rate curve of flow i . We “Let flow i be always backlogged since the beginning of then show that the bound is tight by presenting a case in which round k . If at least one packet is serviced from flow i during the service received is exactly equal to the bound. round k+ v , v ≥ 1 , then the deficit value right after the kv+ th service opportunity is bounded by: As (27) is required to hold in any possible scenario, we can limit ourselves to proving it when Wtti (,)1 is the lowest possi- ble. Clearly, this happens in a scenario (hereafter worst-case ∆ where Ski ()0 v denotes the service received during the th φ ≥ kv+ service opportunity. It is easy to prove that a) Case 1: iiL +>∆kv+−1 ∆>kv+−1 Skii() v . Suppose that i 0 (otherwise When φ ≥ L , a backlogged flow can transmit at least one ∆=∆+−kv++− kv1 φφ +=− + ii i i iiSk() v ii Sk () v and the thesis packet on each service opportunity. For mathematical conven- obviously holds). This implies that a packet which is longer ience, we need to define two functions which bound the ser- ∆kv+−1 +− than i could not be serviced during round kv1 ; let vice curve Wt(0, ) : L >∆kv+−1 be its length. Clearly, Sk()+≥ v L (otherwise no i i i • packet transmission takes place at round kv+ ), and therefore Wti (0, ) , which increases with a C slope whenever flow +≥>∆kv+−1 i is being serviced. More formally: Skii() v L . By substituting into (26) we obtain: [ ) ⊆ [ ) - for any instant s in a time interval s12,0,st in ∆=∆+−kv++− kv1 φφ +< which flow i is not being serviced, ii iiSk() v i = WsWsii(0, ) (0,1 ) , i.e. Wsi (0, ) remains constant; [ ) ⊆ [ ) - for any instant s in a time interval s12,0,st in which flow i is being serviced, We are now ready to compute the latency of a DRR sys- =+⋅− WsWsCssii(0, ) (0,11 ) ( ) , i.e. Wsi (0, ) in- tem. We observe that the following proof process can be eas- creases with a C slope; ily adapted in order to compute the latency of other round- • robin LR servers, such as the Packet Timed Token Service wti (0, ) , which increases stepwise with a step equal to the Discipline [17], [18]. received service at the end of a time interval in which flow i is being serviced and remains constant elsewhere. Theorem 1 “The latency of DRR is: Clearly, for any time instant t , the following relationship holds: • wtWtWt(0, )≤≤ (0, ) (0, ) whenever i is being serviced 1 N iii Θ=() −φφ() + + . (5) • == iiiijF 1 LL∑ wtWtWtiii(0,) (0,) (0,) whenever i is not being ser- C j=1 viced. Figure 15 shows the three curves. For the sake of readabil- Proof ity, the channel capacity is assumed to be equal to 1 in the fig- Let us denote with Wtt(,) the service received by flow i i 1 ures, so that both the horizontal and vertical axis have the same in the time interval [tt, ) (service curve of flow i ). We as- 1 scale. sume that Wtti (,)1 increases by a quantity equal to the packet length when the last bit of the packet has been serviced. In order to derive the DRR latency, we apply the methodology described in [3], Lemma 7. Specifically, we assume that flow i is continuously backlogged starting from time t and we [ ) 1 prove that, in a generic time interval tt1, , it is: units of time beforetime of units can assert that, after after that, assert can dition a) holds) under following holds) a) conditions: the dition oe on ntesriecreta (,) (0, theboundlower service on curve than atighter obtain then can we therefore packet; length mum the maxi- than greater by astep increase cannot curve service service curve reaches a height: a curve reaches height: service each time instant instant time each that This implies Each time instant Each instant time We will denote by denoteWe will curve The Lemma b) and we1, assumption to according case, this In • 3

When When

Figure 16. Improved bounds on the service curve for for curve service the on bounds Improved 16. Figure service each flow is always is each backlogged; flow service serviced being i not Figure 15. Bounds on the service curve for v = twtWtL wt wt 1 iiii ' 0 a 0 ,(,) (0, ), (0, max ) (0,

i , ' 0 ) (0, wt Wt T i 0 ) (0, v T i =− , 1 0 ) (0, v t L Wtv isshown in Figure 16. T v s (note that that (note v v v iii service opportunities, the worst-case worst-case the opportunities, service increases by increases astep equal to comes “as late as possible” (i.e. con- (i.e. as late comes “as possible” 0 ) (0, i being serviced i being > thatthe lies time instant W LC increases by increases to a equal step W i i { W ( 3 0 i i , . In a finite capacityIn a finite . system, the ( ( t 0 ⋅− =⋅ 0 ) , s , t t ) v )

φ Ts vv

= i not being serviced being not i wt i

if wt T v i 0 ) (0, φ φ ii wt W wt i by considering: ii = 0 ) (0, φ i ' i } 0 ) (0, ≥ ( ii 0 L

− , φ L t ii ) () L t ). ). φ ≥

ii . L − t

L φ

(28) i at at C

in fact fact in

flow to allocated rate the normalized that confirms which

However, the deficit can be arbitrarily close to the maximum to length. close the packet arbitrarily be can deficit the However, valueof maximumthe length packet, since inequality holds into Lemma 1. curve prove theFigure line will is We this 17). that points points

service 2 φ 3 • • Let us define: φ 1 for that, observe We • Under the above conditions, we have: φ ii ii − ii 4 −

− L We observe that itnotpossible isthat observe deficitfor the variable We to assume the L for flow flow for L Tk L L kF TT () kh i h hi kk TFLL L F TT deficit iscarried over onto subsequent rounds); way always quantum is its that fully exploited(i.e. no every flow flow each flow each first servicefirst opportunity. ∆≅ s CC fC 21 ++ ⋅+ − += =+ iv vi i hh ,(0,) ⋅≤ ++=+− + = + =+ ws =−=−− − = − Θ= ' − sw C ws ws Figure17. Latency computationfor L i 1 ki i ik ik TLL '' i 0 0 ) (0, ) (0,

1 . 4 CC 11 CC sT 11 ; T =+− =++− . This also implies that the line the. This also line that implies that joins 1 22           sF ss C C hi 1 1 ∑∑ ih hi hi + kk ∑∑ hi ih hi , 1 1 ≠= + φφφ ≠ ≠=       1 ≠ ≠ CCfC C fC iiiii ii v φφ ∑∑ FLL ihi hi φφ − −⋅ ii −−− ≠≠ ihi h hi ⋅⋅ > always packets transmits insuch a k Θ starts with the with starts maximum deficit is serviced before flow flow before serviced is L φφ i > ∑ hhii h , has a slope N = 1 time , hi

s 2 ==⋅ T 2 2

    22 () LL φ φ wt fC N ii i i ' 0 ) (0, 1 ⋅ fC N ≥ i 1    L

, (as shown, (as in

s latency-rate 3

2 T 3 i getsits

(33) (32) (31) (30) (29) i is

Θ i marks the intersection of the latency-rate curve with the time axis. By substituting (30) into (33), after a few alge- φ − L braic manipulations we obtain expression (5). According to ii the methodology described in [3-4], we need to prove that ' L wt(0, )≥⋅ fCt ( −Θ ) . i ii i φ − ' =⋅ −Θ ∀> ii2L

Let us first prove that wsik(0, ) fCs i ( ki ) , k 1. service ' =−φ − By definition we have  wsik(0, ) ( k 1) ii L, whilst:

fCs⋅−Θ=⋅−+−=−−() fCs () sφφ L (1) k L; * ikiik2 ii ii t Θ T i 1 this proves the equality. Figure 18. Latency for sub-case 2 We then prove that wt' (0, )≥⋅ fCt ( −Θ ) when ts≠ , φ ≥ ' ≥⋅ −Θ ii i k This proves that, when iiL , wtii(0, ) fCt ( i ) , k > 1 . Suppose that tsT∈( , ] , k > 1 . In these regions, the Θ kk and therefore that i is an upper bound to the latency of curve wt' (0, ) increases with a slope which is at least C 5, Θ i DRR. We will now show that i is a tight bound. whilst the latency-rate curve increases with a constant slope ⋅≤ ' ≥⋅ −Θ In order to prove that the latency bound (5) is tight, we fCi C. This implies that wtii(0, ) fCt ( i ) for any tsT∈( , ] , k > 1 , since ws' (0, )=⋅ fCs ( −Θ ) . need to prove that there exists a scenario in which the service kk ik i ki curve is arbitrarily close to the sloped segment of the latency- ∈()> Suppose then that tTskk, +1 , k 1 . Since the curve rate curve. We prove that it is possible to find a packet se- ' ' wti (0, ) has a step in Tk , we can assert that quence for flow i , such that point ()s ,(0,)ws (which be- ' >⋅ −Θ 22i wTik(0, ) fCT i ( ki ) . We also know that longs to the latency-rate curve) is arbitrarily close to point ws' (0, )=⋅ fCs ( −Θ ) . Since in tTs∈(), , k > 1 , () ik++11 i k i kk+1 s22,(0,)Wsi . both curves have a constant slope (which is null for wt' (0, ) i not transmitted in φ δ > iimod L i and fi 0 for the latency-rate curve), we conclude that the first round ' >⋅ −Θ ∈()> wtii(0, ) fCt ( i ) when tTskk, +1 , k 1 . ∈[ ) ∈Θ[ ) L L L Suppose now that ts0, 2 . When t 0, i the la- i Li i i tency-rate curve is null, and therefore the assertion holds; φ when ts∈Θ[ , ) two sub-cases are given: i i 2 Θ≥ 1. i T1 ; Figure 19. Packets queued at flow i ' Θ≥ ⋅ Θ−Θ= in this case we have wfCiii(0, ) ( ii ) 0 and ws' (0, )=⋅ fCs ( −Θ ) ; in the time interval [Θ , s ) Let us assume that flow i becomes backlogged at time 0, ii22 i i 2 ∆=0 −δ δ both curves have a constant slope and therefore and at that time j Ljj with an arbitrarily small j for ' ≥⋅ −Θ ∈Θ[ ) any ji≠ . Since packets can be of arbitrary length, it is possi- wtii(0, ) fCt ( i ) when tsi , 2 ; ble to assume that each flow ji≠ fully exploits its quantum Θ< 2. i T1 ; on each round, carrying a null deficit onto the subsequent by using (29), (30) and (33), after a few algebraic manipu- round. Assume that flow i has the following packet length δ lations it is possible to prove that this condition is equiva- sequence: one small packet of length i , one packet of length φ 6 φ  lent to: iimod L , and then  iiL  packets of length Li (as shown in Figure 19). In this case, flow i will not be able to transmit

the last packet of length Li at the first service opportunity, + 1 1 fi and will therefore accumulate a deficit equal to ∆=L −δ . φ >≥L 2L (34) iii iii Let us compute time T : fi 1

This implies that the curve wt' (0, ) has the profile shown i 1  in Figure 18. By using (29), (30) and (33), it is also easy to T =+∆+−∆∑ ()()φφ01 * 1 hh ii Θ≥ = −φ − C ≠ prove that (34) implies iiitT1 (2) LC. There- hi fore the same considerations used for the former cases also  NN =+−−−1 δδ apply to this sub-case. FLL∑∑hi2( hi 2) C  hh==11

At time T the service curve of flow i reaches a height of =−+φ 1 δ WTiiii(0,1 ) L . At the second service opportunity, flow i will be able to transmit the remaining maximum length

6 This can be null if the quantum contains an integer number of maxi- 5 Note that at time Tk the curve has a step. mum length packets. packet. Since we have assumed that the service curve only   =++−1 φφ() increases when the last bit of a packet has been serviced, TvLvvhhi∑()1  ==−+φ δ C  hi≠  WtWTii(0, ) (0,1 ) iii L until the last bit of the grayed (37) packet is transmitted. Let us compute the time instant t ' at  N  =⋅+−+1 φ which the gray packet is transmitted. vF∑ Lhii() L  C  h=1 

* N δδ− For any vv> in which flow i transmits a packet, the fol- 1 ∑ hi2 ' =+φ + =− h=1 lowing relationship holds: tT12∑ hi L s CChi≠ wT(0, )− wT (0, ) φ Therefore, point iv+1 iv=⋅=⋅ iCfC − i TTvv+1 F

N ∑ δδ− 2 which implies that the points ()Tw,(0,) T are joined by a −−+h=1 hiφ δ vi v sL2 , iii ⋅ C line of slope fCi , which we will prove to be the latency-rate  Θ curve. Let us denote with i the intersection of the above- mentioned line with the time axis. We have: belongs to the worst-case service curve. This point can be φ − made arbitrarily close to ()s2 , iiL (which belongs to the latency-rate curve) by choosing appropriately small δ values. ()vL−−1 φ j Θ= − ii φ ≥ ivT This proves that the latency bound derived for iiL is tight. ⋅ fCi b) Case 2: φ < L ii By substituting (37) into the above expression, after a few In this case, flow i is not guaranteed (as it was for case 1) algebraic manipulations we obtain expression (5). According to transmit packets on each service opportunity. to the methodology described in [3-4], we need to prove that: According to Lemma 1, we can bound the service received by flow i after v service opportunities as follows: ≥⋅ −Θ (38) Wtii(0, ) fCt ( i )

Wt(0, )≥⋅− max{ 0, vφ L} (35) In a worst-case scenario, flow i transmits no packet until iii Θ> time T * . Let us then prove that T * as a first step. v i v * = φ Let vLii: expression (35) ensures that, after the * th v service opportunity, flow i has transmitted at least one 1   ≤++−+()φφ()* TLvL* ∑∑1 packet. v  hh hi C  ≠≠ hi hi N − φ 1 * (1)v i =+−() ∑ LvFfhi1 C h=1 −−φ (1)vLii service We also know that:

Θ time i Tv * =<+φφ vLii L ii1 φ < Figure 20. Latency computation for iiL Therefore: Let us assume that flow i transmits a packet during ser- vice opportunity vv> * . Then, by Lemma 2, right after time Tv , the curve Wti (0, ) reaches a height which is greater than 1 N ()− φ TLLFf<++−=Θ∑ ()φ 11() v 1 i , as shown in Figure 20. Since packet lengths are up- v* hii i i C h=1 per bounded by Li , at time Tv we have: ∈Θ[] This implies that, for any t 0, i , (38) holds. We now >−()φ −= (36) >Θ WTiv(0, ) v 1 iiiv LwT (0, ) prove that (38) also holds when t i .

Since Wti (0, ) may increase only during a service opportu- The time instant Tv comes “as late as possible” under the nity, we can assume as a worst case that it remains constant same conditions expressed for case 1. Under such conditions, until the end of the subsequent service opportunity (say, the ≠ φ + [ ) each flow hi receives a service equal vLhh in 0,Tv , ()− φ whilst flow i is serviced for a time v 1 i in the same in- terval. Therefore we have: th ≥ m n ) following time t , i.e. at time Ttn . Therefore, we only fCT⋅−Θ<−∆−() mφ fL (45) need to prove that (39) holds right before the end of each ser- iniiiii vice opportunity7, i.e.: By putting (45) and (42) together we obtain:

WT(0, )≥⋅ fCT ( −Θ ) (40) in i ni =⋅−∆>φφmm −∆− >⋅() −Θ WTin(0, ) m ii m iiiii fLfCT ni >Θ > * Note that t i also implies nv. th and therefore (40) holds. This proves that expression (5) is an If flow i has transmitted a packet at the n service oppor- φ < upper bound to the latency when iiL . In order to prove that tunity, we can apply Lemma 2, which ensures that (40) holds. (5) is a tight bound, we need to build a scenario in which a point Let us then assume that flow i has transmitted no packets at th of the service curve can be arbitrarily close to a point of the the n service opportunity, and let us denote with m the last sloped segment of the latency-rate curve. round before round n at which flow i has transmitted a ∆m Let us assume that flow i becomes backlogged at time 0, packet. Let i be the deficit value of flow i right after the th * ∆= −δ δ m service opportunity. We have vmn≤< by hypothesis, and at that time j Ljj with an arbitrarily small j for ∆=∆+nm() −φ < any ≠ . Since packets can be arbitrarily small, it is possible and Lemma 1 ensures that iinm ii L. There- ji fore we can assert the following: to assume that each flow ji≠ fully exploits its quantum on each round, carrying a null deficit onto the subsequent round. =⋅φ + ≥ ≤<φ Let us express Li as Liikr, with k 1 , 0 r i , LL−∆mm −∆ and let us assume that flow i has the following packet length nm−<ii < ii +1 (41) φφ sequence: ii • + φ k 1 packets of length i ; • φ −−()δ Since no transmission takes place after round m , we have: one packet of length iir ; • one packet of length Li ; =⋅−∆φ m . (42) WTin(0, ) m ii • > φ one packet of length l i . It is easy to see that, after k + 2 service opportunities (i.e. If every flow hi≠ always transmits for the maximum time after the packet of length φ −−()r δ has been transmitted), ii+ allowed by DRR, the time instant Tn is upper bounded by: ∆=−k 2 δ flow i has a deficit equal to iir . The subsequent + th packet (of length Li ) is transmitted during the 23k ser- vice opportunity, after which flow i has a deficit equal to =++⋅−∆1 φφm ∆=−23k + φ δ TLnmnhhii∑∑ iii, which is in accord with Lemma 2. Therefore, C ≠≠ hi hi (43) we can assert that: 1 N =−+−−−∆()φ m ∑ LLnFnmhi i i C = =+()()φ −−δ (46) h 1 WTik(0,23+ ) k 2 i r i −Θ Let us compute the difference Tni: () We will show that point TWT23ki++,(0,) 23 k can be arbi- trarily close to the latency-rate curve. Let us compute the time

1 instant T23k + : −Θ =()() − −φφ + −∆m − + − TnmFmFFLfni i i iii C  N  =−+++−+−1 δδ()() φ(47) By substituting (41) into the above expression, after some TL23khhi+ ∑()223 kFkr 1 i C  =  algebraic manipulations we obtain: h 1 We can then compute the latency-rate curve value at time ∆m tT= , i.e. expression fCT⋅−Θ(): −Θ <1 −i − (44) 23k + iki23+ TmFLni i Cfi N fCT⋅−Θ=+−+−+()() k22φ r f δδ (48) and therefore: iki23+ i i∑ hi h=1

By comparing (46) and (48) it clearly emerges that, by choosing arbitrarily small δ , points ()TWT++,(0,) and ()⋅−Θh 23ki 23 k 7 Note that we have assumed that the service curve increases right after TfCT23ki++,( 23 k i ) can be made arbitrarily close. This φ < the end of a packet transmission, and therefore right after the end of a service proves that the latency bound (5) is tight when iiL . opportunity.