1 Packet Clustering Introduced by Routers: Modeling, Analysis and Experiments

Chiun Lin Lim1, Ki Suh Lee2, Han Wang2, Hakim Weatherspoon2, Ao Tang1 1 School of Electrical and Computer Engineering, Cornell University 2 Department of Computer Science, Cornell University [email protected], kslee, hwang, [email protected], [email protected]

Abstract Utilizing a highly precise network measurement device, we investigate router’s inherent variation on packet processing time and its effect on interpacket delay and packet clustering. We propose a simple pipeline model incorporating the inherent variation, and two metrics, one to measure packet clustering and one to quantify inherent variation. To isolate the effect of the inherent variation, we begin our analysis with no cross traffic and step through setups where the input streams have different data rate, packet size and go through different number of hops. We show that a homogeneous input stream with a sufficiently large interpacket gap will emerge at the router’s output with interpacket delays that are negative correlated with adjacent values and have symmetrical distributions. We show that for smaller interpacket gaps, the change in packet clustering is smaller. It is also shown that the degree of packet clustering could in fact decrease for a clustered input. We generalize our results by adding cross traffic. We apply these results to demonstrate how we could reduce jitter by minimizing interpacket gap. All the results predicted by the model are validated with experiments with real routers.

I. Introduction For real-world network traffic, the observation that packets tend to cluster together or become bursty after passing through one or multiple routers is well-documented for several timescales [6], [4], [16], [13]. On longer timescales, proposed explanations for burstiness are centered on input traffic characteristics such as the distribution of user’s idle and active time [27], the distribution of file sizes [9], and TCP congestion control [13], [25]. At the packet level, the clustering could be attributed to contention and scheduling with cross-traffic at the switching fabric and timing variation in routing table lookup. Is there any other factor that can cause bursty traffic, maybe on an even finer timescale? Consider an experimental setup where all the above mentioned factors are not present. Let a single packet stream with fixed packet size, constant interpacket gap, as well as identical destination goes through a single isolated and idle router with no cross traffic. One would expect the router’s output packet stream to have the same constant interpacket gap as the input stream. However, it has been observed that the interpacket gap is not constant but instead exhibits some variation on the order of 100 ns at the output, and when the experiment is repeated for various interpacket delay constants, the variation is observed to be sufficient to induce packet clustering in some experiments [12]. Moreover, if the input stream goes through multiple routers, the clustering effect can be more prominent. In the absence of external factors, such variation could only be explained by factors inherent in the router design itself. Possible explanations include clock drift, buffering scheme and quantization of packets into cells [7]. Our goal in this paper, though, is not on the causes of the variation, but on the effect of the router’s inherent variation on interpacket delay and packet clustering. Such a fine-scale investigation was not feasible experimentally prior to [12], as network measurement devices have significant measurement error, see e.g. [20]. BIFOCALS as introduced in [12] allows exact timing measurement of network packet streams by directly capturing the symbol stream in real-time and time-stamping in off-line post-processing. The time-stamps are exact since the precision of the device is smaller than the width of a single symbol. Our investigation can have potential impact on works and applications related to packet clustering as well as interpacket delay and its variation, i.e. jitter. Clustering packets or bursty traffic affects network performance such as queuing delay [22] and packet loss rate [20]. It is also an important consideration for resource or bandwidth allocation [19], particularly for protocols involving pre-negotiated resources such as Diffserv [3]. Interpacket delay, on the other hand, has been used to detect congestion[5], estimate bandwidth with packet train [18] and trace encrypted connections [26]. Lastly, jitter is an important metric for multimedia and real time protocols such as RTP 2

Ethernet Nominal After 64/66b Frame Data IPD IPG Encoding Size Rate [bytes] [bits] [Gbps] [bits] [bits] 1526 12588 1 125928 113340 520 4288 6 7194 2906 72 594 3 1980 1386 72 594 6 990 396 72 594 9 660 66 Table I: Interpacket delay and interpacket gap for various packet sizes and data rates. All values except for frame size are measured from the physical layer of 10 .

[23]. As our research focus of interpacket delay variation is directly related to jitter, we will in fact extend and apply the results obtained and show how jitter could be controlled. In this work, we couple analysis with experiments by proposing a simple model firmly grounded in experimental validation to further our understanding on the input-output characteristics of a router with an inherent variation. In particular, we focus on the change in interpacket delay and on how the router’s inherent variation is able to induce packet clustering. The paper contributes to the knowledge of networking in the following aspects: • We propose a simple device-independent model incorporating a router’s inherent variation. The model admits a pipeline structure. Using the model, we make several predictions on the properties of the interpacket delay at the router’s output. We predict that adjacent interpacket delays are negatively correlated (Theorem 3) and the histogram of the interpacket delay is symmetrical in shape (Theorem 4). We provide a benchmark to quantify the inherent variation of a router. • A metric is proposed to quantify packet clustering. We show how packet clustering in terms of this metric changes as it passes through one or multiple routers (Theorem 7 and Section V-B). We also develop conditions on when the output is less clustered than the input (Corollary 8). • In terms of analysis, we approach the problem by carefully categorizing different regions (Figures 4 and 8). They have different packet clustering behavior depending on both data rate and packet size intricately. • We generalize our results by incorporating cross traffic (Lemma 10 and Theorem 13). We interpret the results obtained under the context of jitter and show how to control jitter. • We are able to provide qualitative and quantitative explanations on the observed phenomena of [12]. We also verify all the model results with repeatable experiments using SoNIC [15], which is a software-defined network interface card that achieves the same level of precision as BIFOCALS. II. Motivation: Observed Phenomena In this section, we motivate our study by presenting the key phenomena observed in the experiments of [12]. The data presented here are obtained by replicating the experiment using SoNIC [15]. Readers interested in further details should refer to either paper for more information. We first establish some terminologies and conventions before moving on to describe the various experiment setups and their associated observations. We start with the basic measurement. The interpacket delay (IPD) is the space, in bits, between the first bit of a packet and the first bit of the subsequent packet. The interpacket gap (IPG) is the space between the last bit of a packet and the first bit of the subsequent packet, i.e. IPG = IPD − packet size. Both IPD and IPG are measured on the physical layer of the (10 GbE) and so to be precise, packet size from here on refers to the length of the after preamble bits and control characters are inserted by 64/66b encoding (see Clause 49 of IEEE802.3 [2]). So while Ethernet frame sizes range from 72 to 1526 bytes, the actual packet size is longer after 64/66b encoding. Table I shows various Ethernet frame sizes and their corresponding packet sizes. Note that due to 64/66b encoding, the actual capacity of 10 GbE is 10 Gbps × 66/64 = 10.3125 Gbps. For convenience, we will often refer to time in units of bits instead, where 1 bit is equivalent to 97 ps (1/10.3125 Gbps). We say that a stream of packets is homogeneous if all the packets have the same size, the same interpacket delay, and are heading for the same destination. We use shorthand such as 1526B 3G to refer to a homogenous packet stream with 1526-byte Ethernet frames and 3 Gbps data rate. The first experiment of [12] sends a homogeneous packet stream through an isolated Cisco 6500 and the resulting IPD of the output stream is measured. We repeat the experiment and plot the IPD histogram in Figure 1a. The x-axis 3

5 10 105 105 Histogram count 0 Histogram count Histogram count 10 100 100 1.24 1.25 1.26 1.27 0 2000 4000 6000 4200 4400 4600 4800 IPD (bits) x 105 IPD (bits) IPD (bits) (a) 1526-byte packets and 1 Gbps data (b) 72-byte packets and 3 Gbps data rate (c) 520-byte packets and maximum data rate rate Figure 1: Histogram of 1 million observed interpacket delay at the output of Cisco 6500 for various combination of data rates and packet size. The dotted red line marks the interpacket delay of the homogeneous packet stream. is the IPD measured in bits while the y-axis is the count of each IPD on a log-scale. Without inherent variation, since the input is homogeneous, the histogram should be a vertical delta function centered at the input stream’s IPD (similar to Figure 1c). However, the router’s inherent variation causes the actual histogram to be spread out. Further complications arise when the experiment is repeated with different data rates, packet size and number of routers. While the variation may seem small when the IPD is large, when the setup is varied by having a higher data rate and smaller packet size, the variation coupled with the small IPD is sufficient to induce packet clusters (see figure 1b, note that the highest count is at the leftmost value, where the IPG is minimal). The probability of observing such packet clusters further increases if the packets have to pass through multiple routers. In short, inherent router variation is observed to induce packet clusters. However, when the input data rate is close to capacity, the effect of inherent variation seems to disappear (Figure 1c). Our analysis in the subsequent sections are motivated by Figure 1. We attempt to answer the following questions: 1) How should routers with an inherent variation be modeled? (Section III-A) 2) How can we quantify packet clustering? (Section III-B) 3) How do IPD and the degree of clustering change as a packet stream passes through one router? (Section IV) Multiple routers? (Section V) With cross traffic? (Section VI) 4) Finally, does our model reflect reality? We physically validate our model in Section VII using Cisco 6500 routers. The results are consistent in all other routers where the experiments are repeated. This includes Cisco 4948, IBM G8264 and HP Procurve 2900. III. Problem Setup A. Model We assume the input packet stream is homogeneous; though all our results in this section hold as long as the IPD is independent and identically distributed (i.i.d.). The packets will go through one isolated router (one hop) and each packet transitioning through the router experiences a service time, which is the sum of various delays related to processing time and and transmission time. Typically the constant transmission time is the dominant delay while other delays are shorter and could have some inherent variation. Without loss of generality, we group all the possible sources of variation together and refer to it as part of processing time. The model then is simple — the total service time, S, of a packet going through a router is the sum of delays through two servers: S = X + u (1) In the first server, the packet takes a random processing time, X ≥ 0 to be processed. In the second stage, the packet takes a constant transmission time, u to be transmitted (see figure 2). The transmission time is simply the packet size divided by capacity rate we will sometimes refer to u as the packet size. However, we will see in the later sections that in most cases, the actual value of u does not really matter as it cancels out. This two-server model means that it is possible for parallel processing and transmission of different packet, i.e. pipelining, to occur. Each packet of the input stream is indexed with i and its associated processing time, Xi is assumed to be i.i.d. Note that while white papers, for instance [1], exist to explain conceptually how a packet transitions through a 4

queue queue output input

processing, 푋 transmission, 푢

Figure 2: A two-server model of a router with the total service time S being a sum of X, a random processing time and u, a constant transmission time commercial router, the actual implementation is proprietary. Without such information, we have no way of knowing if the processing time is correlated over time. As such, we start out with the i.i.d. assumption though we will come back to reexamine its validity after some further analysis in section IV-A. We denote the interpacket delay between packet i and i + 1 with the random variable Di. We assume packet arrival rate is smaller than the capacity rate, i.e. E [D] > u and packets are processed faster than it can be transmitted, i.e. u > E [X]. We also assume there is sufficient buffer such that neither the processing server nor the transmission server ever overflows. In short, we have modeled the router with two serial servers with deterministic arrivals, independent service time and a first-in, first-out queue discipline. In the terminology of queuing theory, the processing server is a D/G/1 queue while the transmission server is a G/D/1 queue. With this simple model, we are going to analyze how interpacket delay varies as the input stream passes through a router. But first we clarify the notational convention of this paper. We use superscript on the variables to denote 1 router number, not just on D, but also on S, X and other yet to be introduced variables, e.g. Di is the ith interpacket delay after going through the first router. The 0 superscript refers to the input packet stream. Since we will have expressions involving exponents, except for 0 or 1, all other integer superscripts should be interpreted as exponents, 2 e.g. Di is the square of the ith interpacket delay and not the ith interpacket delay after passing through 2 routers. The superscript is sometimes omitted when we are referring to interpacket delay in general while the subscript is sometimes omitted when the statement is applicable to all packets. We use the corresponding lowercase letter when a random variable takes on a particular value, e.g. Pr (D = d) is the probability of the random variable D taking on a value of d. We denote E [·] as the expectation function and any random variable with a tilde accent represents its zero-mean equivalent, e.g. X˜ = X − E [X]. As we do not expect all the audience of this report to have a sufficient background in probability and queuing theory, most of the proofs in this report are intentionally more verbose and may contain steps that seem to be trivial for readers well-versed in the topics. B. Packet Clustering Metric We propose a metric to represent the degree of packet clustering. To start, given that the packet arrival rate is smaller than the router capacity rate, we know the average input data rate is the same as the average output data rate. This simple observation implies that the average IPD is the same before and after going through a router. Armed with this observation, we consider the simplest setup with a sequence of 3 packets as shown in Figure 3 to gain intuition. Since the average IPD is the same, the only variable here is the relative position of the second packet. Intuitively, the least clustering setup is when the packets are uniformly spaced, i.e. the IPG between packet 1 and 2 is the same as between packet 2 and 3. Packets are seen to be more clustered as packet 2 is closer to either packet 1 or 3 and the packets are most clustered when two packets have minimal IPG in between. A metric that fits the intuition is the sum of the square of the IPG. The packet clustering metric, c, given the IPG of all packets, g is given by n 1 X c (g) = g2 (2) n i i=1 where we have normalized by n, the number of IPGs. Next, we want to obtain an equivalent metric that is expressed in terms of IPD instead of IPG. To do so, we first clarify what we mean by an equivalent metric. A metric is equivalent to another if it is order-preserving, i.e. if packet stream a is rated as more clustered than packet stream setup b under the first metric, then the second 5

g g Minimum 1 2 3 2g2 (1-p)g (1+p)g 2 2 Increasing 1 2 3 2(1+p )g 2g Maximum 1 2 3 4g2 Figure 3: An intuitive figure depicting how packet clustering evolves as interpacket delay changes. The square of the interpacket gap fits the intuitive notion.

metric should also rate packet stream a as more clustered than b. We claim that since the packet size is constant, an equivalent metric is the sum of the square of the IPD n 1 X c (d) = d2 (3) n i i=1 where di is the ith IPD . Similarly, the sample variance n 1 X 2 σ2 = d − d¯ (4) n − 1 i i=1 ¯ P where d = i di/n is the average IPD, is an equivalent metric. We leave the verification that (3) and (4) are equivalent metrics to the reader. The metric c (d) computes a value when each Di takes on a particular value di. Often times, we are interested with Di in general, in which case, the expected packet clustering metric is more relevant: " n # 1 X C (D) = E D2 (5) n i i=1 Using the identity var (D) = E D2 − E [D]2, we obtain an equivalent metric n 1 X C (D) = var (D ) (6) v n i i=1  2 When Di, i = 1, 2, . . . , n are i.i.d., C (D) is just the expected second moment of D, i.e. E D , while Cv (D) is the variance of D, i.e. var (D). In short, the metric Cv (D) states that a packet stream is more clustered if its IPD is more variable. Thus, for a homogeneous packet stream, since the IPD is a constant, var D0 = 0, which conforms with our intuition that the homogenous packet stream is the least clustered. For the rest of the paper, for conciseness, we refer to (expected) packet clustering metric as simply packet clustering. For clarity, the guideline for which equivalent metric is applied is as follows. When Di is random and i.i.d., we use var (D). If Di is random but not i.i.d., we use C (D) or Cv (D). To calculate packet clustering for experimental data, we use sample variance regardless of whether the underlying distribution is i.i.d. or not. We note here that our proposed metric is not the only one that fits the intuitive notion of packet clustering in Figure 3. Other convex metrics are possible and we choose the current metric due to its relative ease of analysis. Whenever possible, we will explain the results obtained in an intuitive manner without relying on our choice of metric. IV. The Single-Hop Case We begin our analysis with the single hop case. There are two regions to consider here: when all packets have no waiting time (large interpacket gap) and when some of them do (small interpacket gap). We will give a more detailed explanation in the next subsection but the condition for all packets to have no waiting time occurs when the interpacket gap G0 is strictly greater than the random processing time X, i.e. G0 > X. Conversely, if the 6

14000 1526 bytes, 1Gbps 72 bytes, 6Gbps 12000 520 bytes, 6Gbps 10000 Large interpacket gap 8000 P(G0>S) = 1 6000 Small

Packet Size (bits) 4000 interpacket gap

2000 P(G0>S) < 1

0 0 2 4 6 8 10 Data Rate (Gbps) 9 x 10 Figure 4: Large and small interpacket gap region for Cisco 6500. Also shown are 3 packet size plus data rate combinations that will be encountered in the paper.

Packet 1 arrives Packet 2 arrives 0 퐷1 Processing 푡 1 1 푋1 푋2 Transmission

푢 푢 푡 0 1 1 퐷1 − 푋1 + 푋2 Packet 1 leaves Packet 2 leaves Figure 5: Large interpacket gap interpacket gap is small enough such that X > G0 with positive probability, then some packets will have a waiting time. In other words, suppose the maximum value the random processing time, X could take on with positive 0 probability is xmax, then if G > xmax, we are operating in the large interpacket gap region and vice versa. We use a Cisco 6500 router operating on a 10 Gigabit Ethernet to illustrate the two region. To completely specify a homogeneous packet stream we need to specify its packet size, u and data rate, R. Using the capacity rate of C = 10.3125 Gbps, we could figure out the constant interpacket gap with the formula u/ (IPG + u) = R/C. The region surrounded by borders in Figure 4 represents all the feasible packet size and data rate combinations for a homogeneous packet stream. The maximum value of processing time, xmax is estimated to be about 2176 bits or 0.211 µs (see the next subsection for estimation method) and we could use the value to divide the region into two with the dividing line u = xmax · R/ (C − R). A representative setup for the large interpacket gap region is 1526B 1G. From Table I, the interpacket gap is 113340 bits, which is far greater than xmax. Similarly, a representative setup for the small interpacket gap region is 72B 9G. The interpacket gap is only 66 bits, which is far smaller than xmax. A. Large interpacket gap With no waiting time, the interpacket delay after one router is relatively straightforward to figure out. Consider the sequence of discrete events that occur on packet 1 and 2 as shown in figure 5. Packet 1 arrives at the router, 1 1 receives service for a time period of S1 = X1 + u and leaves the router. The second packet then arrives and 1 1 receives service for a time period of S2 = X2 + u. The interpacket delay after 1 router for packet 1 and 2 is thus 1 0 1 1 D1 = D1 − S1 + S2 . One could continue by considering packets 3, 4, 5 and so on to find that the same results hold: 1 0 1 1 0 1 1 Di = Di − Si + Si+1 = Di − Xi + Xi+1, i = 1, 2,... (7) 7

Notice that u cancels off in the final equation as we pointed out earlier. Observe also that equation (7) only holds 0 1 when the subsequent packet arrives after the prior packet completes service, i.e. Di − Si > 0 always or in other words, G0 > X1, which is the stated condition at the beginning for the large interpacket gap scenario. Equation (7) allows us to estimate max (X), albeit crudely: from the histogram of the output’s interpacket delay (Figure 1a), find the maximum interpacket delay value and subtract the mean interpacket delay from it. This would give us maxi (Xi+1 − Xi), which is an underestimation of max (X). In addition, the distribution of X is dependent on packet size, and estimation of max (X) spans the range from 1716 to 2176 bits. Thus, the line separating the two regions in Figure 4 is not entirely correct though the idea that one could divide the feasible region into two remains intact and we also know the 520B 6G setup belongs to the large interpacket gap scenario, implying that the dividing line is not too far off the mark. While equation (7) looks simple, we can extract plenty of information: symmetry of interpacket histogram (Lemma 2 and Theorem 4), negative correlation of adjacent interpacket delay (Theorem 3) and an accurate unbiased estimator to represent a router’s inherent variation (Lemma 5). Note that while we have assumed a homogeneous input packet stream, all the results in this subsection hold as long as D0 is i.i.d., symmetrical about E D0 and satisfy the assumption that G0 > X1. To start, a function f (x) is said to be symmetric about y = d if f (x + d) = f (−x + d). Similarly, a random variable D is said to be symmetrical about d if its distribution (probability density function) is symmetric about d, i.e. for all x, Pr (D = x + d) = Pr (D = −x + d) (8) which we denote with the notation D − d ∼ − (D − d) where ∼ means same distribution. We need two additional results before proceeding. Lemma 1. Given two independent random variables X and Y 1) The distribution of Z = X + Y is given by the convolution of X and Y ’s distributions: ∞ X Pr (Z = z) = Pr (X = z) · Pr (Y = z − w) w=−∞ 2) If X is symmetrical about x¯ and Y is symmetrical about y¯, then Z = X + Y is symmetrical about x¯ +y ¯, i.e. if X − x¯ ∼ − (X − x¯) and Y − y¯ ∼ − (Y − y¯), then X + Y − (¯x +y ¯) ∼ − [X + Y − (¯x +y ¯)]. To put it simply, symmetry of distribution is preserved over addition of independent random variables. Proof: The first part is an application of (a) total probability and (b) independence between X and Y : ∞ ∞ (a) X (b) X Pr (Z = z) = Pr (X = z) · Pr (Y = z − w|X = z) = Pr (X = z) · Pr (Y = z − w) w=−∞ w=−∞ As for the second part, we start from the LHS of (8) Pr (Z = z +x ¯ +y ¯) = Pr (X − x¯ + Y − y¯ = z) ∞ (c) X = Pr (X − x¯ = z) · Pr (Y − y¯ = z − w) w=−∞ ∞ (d) X = Pr (X − x¯ = −z) · Pr (Y − y¯ = − (z − w)) w=−∞ ∞ X = Pr (X − x¯ = −z) · Pr (Y − y¯ = −z − v) v=−∞ (c) = Pr (X − x¯ + Y − y¯ = −z) = Pr (X + Y = −z +x ¯ +y ¯) = Pr (Z = −z +x ¯ +y ¯) 8

We have applied (c) first part of current lemma and (d) symmetry of X and Y about x¯ and y¯ respectively. 1 We now have the necessary tools to prove that Di is symmetric for all i. 0  0 0 Lemma 2. If the distribution of D is symmetric about E D , then the distribution of Di −Xi +Xi+1 is symmetric about E D0 for all i. Proof: D0 is symmetric about E D0 implies that D0 − E D0 ∼ − D0 − E D0 (9) 1 1 By applying (a) the sequence X1 ,X2 ,... is i.i.d. and (b) part 1 of Lemma 1: ∞ 1 1  (a),(b) X 1  1  Pr −Xi + Xi+1 = x = Pr −Xi = x · Pr Xi+1 = x − w w=−∞ ∞ (a) X 1  1  = Pr −Xi+1 = x · Pr Xi = x − w w=−∞ ∞ X 1  1  = Pr Xi+1 = −x · Pr −Xi = −x − v v=−∞ (b) 1 1  = Pr −Xi + Xi+1 = −x 1 1 This means that −Xi + Xi+1 is symmetric about 0, i.e. 1 1 1 1 − Xi + Xi+1 ∼ Xi − Xi+1 (10) 0 1 1 Since D is independent of −Xi + Xi+1, part 2 of Lemma 1 implies that 0  0 1 1 0  0 1 1  Di − E D − Xi + Xi+1 ∼ − Di − E D − Xi + Xi+1 1 0  0 i.e. the distribution of Di = Di − Xi + Xi+1 is symmetric about E D . 1 1 However, Lemma 2 is not sufficient to prove symmetry, as the interpacket delay sequence D1,D2,... is not i.i.d. 0 1 1 0 1 1 Neighboring terms of the sequence are correlated: D1 − X1 + X2 is not independent of D2 − X2 + X3 due to 1 1 the X2 term. Since the interpacket delays are correlated via the X2 term, which has an opposite sign in each of the interpacket delay, we expect the correlation to be negative. In simple words, if the processing time of packet 1 1 1 2, X2 is large, D1 would be large while packet 3 will catch up to packet 2 making D2 small and vice versa. We state the negative correlation formally. Theorem 3. After one hop, the correlation coefficient between two neighboring interpacket delay is -1/2, that is

 1 1 cov Di ,Dj 1 ρi,j , r = − , for |i − j| = 1 (11) 1  1 2 var Di var Dj Proof: TTo determine the correlation, first define the following zero-mean equivalent: ˜ 1 1  1 ˜ 0 0  0 Xi = Xi − E X , Di = Di − E D The covariance of the interpacket delay is, by definition, 1 1 h ˜ 1 ˜ 1i h ˜ 0 ˜ 1 ˜ 1   ˜ 0 ˜ 1 ˜ 1 i cov Di ,Dj , E Di Dj = E Di − Xi + Xi+1 Dj − Xj + Xj+1 ˜ 0 ˜ 0 Expanding the equation gives us 9 terms. However, 5 terms involving either Di or Dj is equal to zero since it h i h i h i ˜ 0 ˜ 1 ˜ 0 ˜ 1 is independent of all other terms. The independence implies, for instance, E Di Xj+1 = E Di E Xj+1 = 0. 1 1 For the other 4 terms, since the sequence X1 ,X2 ,... is i.i.d.,  h i h i ˜ 1 ˜ 1 E Xi E Xj = 0 , if i 6= j h ˜ 1 ˜ 1i E Xi Xj =  2 ˜ 1 1 E Xi = var X , if i = j 9

Taking the sum of all 4 terms give  2var X1 , if i = j  1 1 1 cov Di ,Dj = −var X , if |i − j| = 1 (12) 0 , otherwise

1 1 1 Since var Di = cov Di ,Di , the correlation coefficient is given by   1 1  1 1 1 , if i = j cov Di ,Dj cov Di ,Dj   1 ρi,j = r = 1 = − 2 , if |i − j| = 1 1  1 2var (X )  var Di var Dj 0 , otherwise

Even though the interpacket delay is negatively correlated, we can still prove symmetry by observing that every other term of the interpacket delay sequence is i.i.d and by using superposition. Theorem 4. The interpacket delay histogram of a homogeneous packet stream at a router’s output is symmetrical about E D0. 1 0 1 1 1 0 1 1 1 1 Proof: Consider the first and third IPD: D1 = D1 − X1 + X2 and D3 = D3 − X3 + X4 . Since X1 ,X2 , ... 0 1 1 are i.i.d. while D is just a constant for a homogeneous packet stream, D1 and D3 are i.i.d. of each other. We 1 1 can easily generalize this and find that the odd-indexed sequence D1,D3,... is i.i.d., and so is the even-indexed 1 1 sequence D2,D4,.... From Lemma 2, we know that the IPD histogram of either the odd or the even-indexed sequence is symmetrical about E D0. The IPD histogram of the whole sequence is just the superposition of the two, and is hence symmetrical about E D0. Note that Lemma 2, Theorem 3 and Theorem 4 all make heavy use of the assumption that the random processing time X is i.i.d. Thus, if experimental data turns out to match the analysis here, then it is reasonable to adopt the i.i.d. assumption. We will see in Section VII that this is indeed the case. We now switch attention to packet clustering. Packet clustering could be determined using equation (7). Since 0 1 1 Di , Xi and Xi+1 are all independent of each other 1 0 1 Cv D = var D + 2var X (13) 0 1 1 For a homogeneous input packet stream, D is a constant and thus Cv D = 2var X . The variance of the processing time var X1 is an important expression that will continue to reappear for the rest of the paper. From a router performance evaluation standpoint, we could think of var X1 as either the metric or the benchmark characterizing the inherent variation of a particular router model. Thus, for practical purposes, it is important to be able to estimate it from available data as accurately as possible. Keeping in mind the negative correlation of the IPD, we could estimate the variance by applying the unbiased sample variance formula on the i.i.d. odd and evenly indexed IPD and then average the two estimates. However, it turns out that despite the negative correlation, we could make an unbiased estimation with a better accuracy from the full IPD sequence. 1 1 1 Lemma 5. Given D1,D2,...,Dn, the sequence of interpacket delays from the output of a router fed with a homogeneous input packet stream, the estimator

n n X 2 β = D1 − D¯ (n + 1) (n − 1) i i=1 ¯ 1 Pn 1 1 1 where D = n i=1 Di is the sample mean, is an unbiased estimator of 2var X , i.e. E [β] = 2var X . 10

Packet 푖 arrives Packet 푖 + 1 arrives Packet 푖 arrives Packet 푖 + 1 arrives 0 푊1 퐷푖 퐷0 푖+1 푊1 = 0 푖 푖+1 Processing Processing 푡 푡 1 1 1 1 1 1 푊 푋 푘 푋 푊 푋 푘 푋 푖 푖 퐼 푖+1 Transmission 푖 푖 퐼 = 0 푖+1 Transmission 푖 푖 푡 푡 푢 푢 푢 푢 푘 푘 Packet 푖 leaves 푘 퐼푖 + 푋푖+1 푋푖+1 Packet 푖 leaves Packet 푖 + 1 leaves Packet 푖 + 1 leaves (a) IPD when there is positive idle time (b) IPD when there is zero idle time Figure 6: Two possible sequence of events in between the arrivals and departures of packets i and i + 1

Packet 푖 arrives Packet 푖 + 1 arrives 1 0 푊 퐷푖 푖+1 Processing 푡 1 1 1 푊 푋 푘 푋 푖 푖 퐼 = 0 푖+1 Transmission 푖 푡 푢 푢 Packet 푖 leaves Packet 푖 + 1 leaves Figure 7: IPD when there is waiting time at the transmission server. The idle time could be positive as in Figure 6a.

Proof: 2  n   ˜ X ˜ 1 n (n + 1) (n − 1) E [β] = E nDi − Dj   j=1 n X X  = var (X) · 2n2 − 4n + 2n 1 + 2n − 2 (n − 1) i=1 |i−j|=1 = 2nvar (X) n2 − 2n + 2 (n − 1) + n − (n − 1) = 2n (n + 1) (n − 1) var (X) where the second inequality follows from applying correlation values from Theorem 3 and noting that if we expand     Pn ˜ 1 Pn ˜ 1 out j=1 Dj k=1 Dk , there are 2 (n − 1) terms where |j − k| = 1. B. Small interpacket gap The interpacket gap in this region is small while the service time is sometimes long enough to induce a waiting time in the next packet. The IPD histogram is no longer symmetrical as its left portion is distorted by the physical requirement of a non-negative interpacket gap (Figure 1b). To get a clearer picture and to figure out how packet clustering happens, we follow the development of the previous subsection by considering the sequence of discrete events that occur on packet i and i + 1. It turns out that we need to consider 3 general cases: positive idle time (figure 6a), zero idle time (figure 6b), and positive waiting time at the transmission server (figure 7). We define Ii as the processing server idle time in between packet i and i + 1 and Wi as the waiting time of packet i at the processing server. 11

From figure 6a and 6b, we see that the idle time is given by 1 0 1 1 0 1 1+ Ii = max 0,Di − Wi − Xi = Di − Wi − Xi (14) where a+ = max (0, a). Similarly, the waiting time of packet i + 1 is given by 1 1 1 0 0 1 1− Wi+1 = max 0,Wi + Xi − Di = Di − Wi − Xi (15) where a− = max (0, −a). In the queuing theory literature, equation (15) is also known as the Lindley equation [17]. Combining all three cases, the IPD after one-hop is given by 1  1 1 Di = max u, Ii + Xi+1 (16) The identity a = a+ − a− gives us 0 1 1 1 1 Di − Wi − Xi = Ii − Wi+1 (17) Combining equation (16) and (17) gives 1 0 1 1 1 1  Di ≥ Di − Wi + Xi + Wi+1 + Xi+1 (18) Recall that for the large interpacket gap region, equation (13) tells us that packet clustering increases by 2var X1. For the small interpacket gap region, some packets are prevented from getting closer together due to the physical requirement of a minimum service time. This implies that intuitively, packet clustering after one router should increase to a value that is less than 2var X1. Before determining the exact value, we need a supporting lemma. Lemma 6. For any random variable Y = Y + − Y −, var (Y ) = var (Y +) + var (Y −) + 2E [Y +] E [Y −]. Theorem 7. For a packet stream with i.i.d. interpacket delay and n → ∞ number of packets, after one hop, 1 Cv D is given by var D1 ≤ var D0 + 2 var X1 − E W 1 E I1 (19) Proof: We start with equation (16) and noting that forcing a minimum value reduces variance 1 1 1  1 1  var Di ≤ var Ii + Xi+1 = var Ii + var Xi+1 (20) 1 1 where the second equality follows from independence of Ii and Xi+1. We look for an expression to substitute for 1 0 1 1 var Ii by applying lemma 6 to Y = Di − Wi − Si . 0 1 1 1 1   1  1  var Di − Wi − Xi = var Ii + var Wi+1 + 2E Ii E Wi+1 (21) 1 0 1 0 1 0 Since Xi is independent of Di and Wi , and Di is independent of Wi , LHS of (21) is equal to var Di + 1 1 1 var Wi +var Xi . Rearranging the terms, we obtain an expression for var Ii and substitute it into (20) to find 1 0 1 1 1   1  1  var Di ≤ var Di + var Wi + 2var X − var Wi+1 − 2E Ii E Wi+1 1 1 1 1 1 1 As n → ∞, each term converges to equilibrium values, e.g. Di → D , Ii → I and Wi+1 → W . We obtain var D1 ≤ var D0 + 2 var X1 − E W 1 E I1 (22) as desired. Since E W 1 E I1 > 0, Theorem 7 confirms our intuition that packet clustering increases by a value less than 2var (X). The theorem also tells us that it is possible for the packet stream to be less clustered if the input packet stream is not homogeneous and E W 1 E I1 > var (X). However, we may not be able to check for it in practice, since we may not have sufficient knowledge or access to determine E W 1 and E I1. We thus have the question: for an input packet stream with i.i.d. IPD, is there an easily verifiable condition to know when the packet stream will be less clustered? Corollary 8. An input packet stream with i.i.d. interpacket delay will be less clustered after going through a router n o2 if 2var X1 ≤ E D0 − X1− . 12

14000 1526 bytes, 1Gbps 72 bytes, 6Gbps 12000 520 bytes, 6Gbps Large 10000 interpacket gap 8000 Medium 6000 interpacket gap

Packet Size (bits) 4000 Small interpacket 2000 gap

0 0 2 4 6 8 10 Data Rate (Gbps) 9 x 10 Figure 8: Large, medium and small interpacket gap region for m = 8 Cisco 6500

Proof: We apply a lower bound by Kingman [14]

n −o2 2E I1 E W 1 ≥ E D0 − X1

n o2 If 2var X1 ≤ E D0 − X1− , then the lower bound and (19) gives

n −o2 var D1 ≤ var D0 + 2var X1 − E D0 − X1 ≤ var D0 i.e. packet clustering has decreased after passing through a router. n o2 Note that E D0 − X1− is large if D0 has a large probability of taking small values, i.e. a large portion of packets are close together. There are two scenarios for the packet to be closer together: one, the input packet stream is getting more clustered and two, the input data rate is getting higher, as higher data rate implies smaller n o2 interpacket delay. Since var D0 + 2var X1 − E D0 − X1− is an upper bound for var D1, the upper bound is getting smaller as either scenario occurs. As such, it is reasonable to expect that the change in packet clustering, var D1 − var D0 to decrease as the input packet stream is getting more clustered or as the input data rate increases (see Figure 15). In cases where the input packet stream is very clustered, we have shown in Corollary 8 that the change in packet clustering could in fact be negative (see Figure 12) V. The Multi-Hop Case In this section, we extend our results to the multiple-router scenario. We denote m as the number of routers and the routers are not assumed to be identical unless otherwise stated. There are now 3 regions to consider. Similar to before, the regions depend on the interpacket gap and maximum processing time. If all packets have zero waiting 0 Pm k time while passing through all routers, i.e. G > k=1 xmax, then we are in the large interpacket gap region. If all packets have zero waiting time for the first j routers and some packets have positive waiting time for the remaining 0 Pl k 0 Pl k routers, i.e. G > k=1 xmax for all l ≤ j < m while for m ≥ l > j, G < k=1 xmax with positive probability, then we are in the medium interpacket gap region. In the small interpacket gap region, some packet have positive 0 Pl k waiting time while passing through each router, i.e. G < k=1 xmax with positive probability for l = 1, . . . , m. For m = 8 identical Cisco 6500, the large, medium and small interpacket gap region are shown in Figure 8. With reference to the single-hop Figure 4, the small interpacket gap region remains the same while the large interpacket gap region is now divided into two. As the number of router grows, the medium interpacket gap region grows while the large interpacket gap region shrinks. 13

A. Large interpacket gap The analysis is this subsection is mostly a straightforward generalization of the results of the single-hop case. Note that for this subsection only the integer exponent refers to router number. First consider m = 2, since we are in the large interpacket gap region, equation (7) holds and we have 2 1 2 2 Di = Di − Xi + Xi+1 1 Applying equation (7) again to Di gives 2 0 1 2 1 2  Di = Di − Xi + Xi + Xi+1 + Xi+1 Generalizing, we obtain for all l ≤ m l l l 0 X k X k Di = Di − Xi + Xi+1 (23) k=1 k=1 The other results concerning negative correlation and symmetry of the IPD histogram generalize in the same manner. In addition, we have a new result concerning how packet clustering changes as it passes through the routers. We state them all formally in the following theorem. l l Theorem 9. In the large interpacket gap region, for any 1 ≤ l ≤ m, the interpacket delay sequence D1,D2,... has the following 3 properties: adjacent interpacket delay has a correlation coefficient of −1/2, the IPD histogram at the final router’s output is symmetrical and if the m routers are identical, then packet clustering increases linearly with the number of routers. Pl k Proof: To prove the first two properties, denote Yi = k=1 Xi to get l 0 Di = Di − Yi + Yi+1

This is exactly equation (7) and since the sequence Y1,Y2,... is i.i.d., negative correlation and symmetry follows from Theorem 3 and 4. Finally, l  l 0 X  k var Di = var D + 2 var X k=1 l l For a homogeneous packet stream and identical routers, var Di = 2lvar (X) and thus Cv D = 2lvar (X) B. Medium and small interpacket gap For medium interpacket gap, for the first j routers where there is no waiting time, packet clustering increases linearly and for the remaining routers, the change in packet clustering would be similar to the small interpacket gap region, which we will analyze now. For small interpacket gap, the results are not as easy to generalize from the single hop case as large interpacket gap. The main difficulty arises from the correlated interpacket delay after the first hop. Such correlation implies that the setup may not converge to a steady state distribution even if the input packet stream is sufficiently long. Additionally, even if the setup does converge, we could go through the same steps as the proof of Theorem 7 to find that due to correlation, we now have an additional covariance term, cov D0,W  which does not have a clear interpretation. As such, we approximate by ignoring the correlation and assuming that the input to all the routers have i.i.d. interpacket delay and our aim in this subsection is not to derive analytical expressions but to use the results from Section IV-B to argue qualitatively how packet clustering would evolve with increasing number of identical hops. Recall that after Corollary 8 we argue that the change in packet clustering, var D1−var D0 would decrease as the input packet stream is getting more clustered or as the input data rate increases. Given the i.i.d. input assumption to all routers, we could now apply this argument at each hop. This means that if we start with a homogeneous input packet stream, then packet clustering would be an increasing and concave function in the number of identical hops. Since var D1 − var D0 could be negative when packet clustering exceeds the threshold specified in Corollary 8, we expect packet clustering to converge to a value for sufficiently large number of identical hops. Conversely, if we start out with a very clustered input packet stream, then packet clustering would be a decreasing and convex 14

Packet 푖 arrives Packet 푖 + 1 arrives Packet 푖 arrives Packet 푖 + 1 arrives 푝 0 0 푇 퐷 퐷 푖+1 푖 푇푝 = 0 푖 푖+1 Processing Processing 푡 푡 푝 푝 푝 푝 푝 푊 푋푖 퐶 푋 푊 푋푖 푋 푖 퐼푖 푖+1 푖+1 푖 푝 퐶푖+1 푖+1 퐼푖 = 0 푝 푝 푝 퐼푖 + 퐶푖+1 + 푋푖+1 퐶푖+1 + 푋푖+1 Packet 푖 leaves Packet 푖 + 1 leaves Packet 푖 leaves Packet 푖 + 1 leaves p p (a) Positive idle time Ii > 0 (b) Zero idle time Ii = 0 Figure 9: Two possible sequence of events at the processing server in between the arrivals and departures of target packets i and i + 1

function in the number of identical hops, and would also converge to a fix value for sufficiently large number of identical hops (see Figure 12). The second implication is that if we have two input packet streams with equal packet clustering but different data rate, then for the input stream with a higher data rate, its rate of change for packet clustering in the number of identical hops would be lower. For instance, suppose we have two homogeneous packet stream with 4G and 6G data rate. Then while packet clustering packet clustering would evolve in an increasing and concave manner in the number of identical hops for both, the function for 4G would be strictly above that of 6G’s (see Figure 15). VI. Adding Cross Traffic In this section, we add in cross traffic and show that the results and discussions from Section IV carry over. We need new terms and assumptions with the addition of cross traffic. We call the packet stream that we are tracking as it goes through the router the target traffic while any other traffic that interferes with it, the cross traffic. It is well-known that Internet traffic is usually not Poisson in nature and is positively correlated over time [16]. For such cross traffic, the setup that we are currently analyzing does not necessarily converge to the steady state distribution. To make the analysis tractable, we make the assumption that the cross traffic is not time varying, i.e. stationary, and the current setup does converge to the steady state distribution. We assume packets are served in a first-in, first-out manner. This is again a simplifying assumption as packets that arrive at different router input ports typically have to undergo contention and scheduling and the final packet service order is not necessarily first-in, first-out. There are two key steps to analyzing the interaction of the two types of traffic. The first step is to divide our analysis into two stages by first finding the interpacket delay after passing through the processing server, ∆1, before moving on to deal with the transmission server. The second step is to further subdivide the waiting time and idle time by source. The waiting time of the ith target packet at the processing server is now p p p Wi = Ti + Ci (24) p p where Ti is the waiting time incurred on the ith target packet till target packet i − 1 is processed and Ci is the waiting time incurred on the ith target packet due to cross traffic that is processed after target packet i − 1. p We define Ii as the idle time the processing server spends not processing any target traffic after the departure of the ith packet and before the arrival of the (i + 1)th packet. Note that it is possible for the processing server to r r r r be processing cross traffic during such idle time. The terms Wi , Ti , Ci and Ii are defined analogously for the transmission server. To prevent notational clustering, we will drop the superscript 1 from all non interpacket delay terms in this section. The analysis to derive the interpacket delay and packet clustering follow the same path as the model with no cross traffic. There are two general cases as shown in Figure 9. Going through the same steps as before, we arrive 15 at p 0 p+ Ii = Di − Xi − Wi (25) p 0 p− Ti+1 = Di − Xi − Wi (26) 1 p p ∆i = Ii + Ci+1 + Xi+1 (27) 0 p p p Di − Xi − Wi = Ii − Ti+1 (28) 1 0 p p ∆i = Di − Wi − Xi + Wi+1 + Xi+1 (29) We next find out the change in packet clustering. Lemma 10. For a target packet stream with i.i.d. interpacket delay and n → ∞ number of packets, if the current setup with cross traffic converges, then var ∆1 = var D0 + 2 [var (X) + cov (Ip + Cp,W p)] (30) Proof: The proof is almost identical to the proof for Theorem 7 and thus some details some will be skipped. We start with equation (27) 1 p p  p p  var ∆i = var (Ii ) + var Ci+1 + var (X) + 2cov Ii ,Ci+1 (31) p We will find an expression for var (Ii ) by applying lemma 6 together with equation (25) and (26) 0 p p p  p  p  var Di + var (X) + var (Wi ) = var (Ii ) + var Ti+1 + 2E [Ii ] E Ti+1 (32) p Rearranging, substituting the expression for var (Ii ) into equation (31) and taking the limit as i → ∞ where each term converges to equilibrium distribution, we find var ∆1 =var D0 + 2var (X) − 2E [Ip] E [T p] + var (W p) − var (T p) + var (Cp) + 2cov (Ip,Cp) (33) Since var (W p) − var (T p) = var (Cp) + 2cov (T p,Cp),−E [Ip] E [T p] = cov (Ip,T p) and var (Cp) = cov (Cp,Cp), the last 5 terms on the RHS of equation (33) simplifies to 2cov (Ip + Cp,W p) and thus we are done. We then move on to the transmission server. Now note that the analysis is identical to the two cases shown in 1 0 Figure 9, with ∆ replacing D , u replacing Xi and Xi+1, and waiting and idle time terms for the transmission server replacing analogous terms for the processing server. Theorem 11. Assuming the setup converges to equilibrium distributions, then var D1 ≤ var D0 + 2var (X) + 2cov (Ip + Cp,W p) + 2cov (Ir + Cr,W r) (34) Proof: All the steps are the same as the proof of Lemma 10, except for equation (32) where we have an 1 r 1 additional covariance term, 2cov ∆i ,Wi due to the correlation of ∆i with adjacent interpacket delay values. 1 r p Substituting ∆i using equation (29) and noting that all the terms are independent of Wi except Wi , we arrive at 1 r p r 2cov ∆i ,Wi = −2cov (Wi ,Wi ) < 0 As we assume packets are processed at a faster rate than they could be transmitted, a larger waiting time at the processing server implies that we should expect a larger waiting time at the transmission server and thus p r cov (Wi ,Wi ) > 0. For the proof here, the equality of equation (32) is thus replaced with ≥, which carries through to give var D1 ≤ var ∆1 + 2cov (Ir + Cr,W r) (35) Now apply equation (10) and we are done. As a simple check, when there is no cross traffic, Cp = 0, Cr = 0, Ip = I, T p = W , cov (I,W ) = −E [I] E [W ], cov (Ir,T r) = −E [Ir] E [T r] < 0 and the expression degenerates to equation (19). Before showing that packet clustering can decrease, we backtrack to the large interpacket gap scenario. From equation (29) and its analogous version for the transmission server, we have 1 0 p r p r D = Di − Wi − Wi − Xi + Wi+1 + Wi+1 + Xi+1 (36) 16

Consider what happens when the target traffic is homogeneous with a sufficiently large interpacket gap. The p p r r waiting time is mostly due to cross traffic, i.e. Wi ≈ Ci and Wi ≈ Ci . Since the cross traffic is stationary, p p r r Ci is approximately independent of and has similar distribution as Ci+1 and similarly for Ci and Ci+1. In such a scenario we could apply the same reasoning as Section IV-A to find that at the output, the target traffic has interpacket delays with adjacent values having a correlation of −1/2 and with a histogram that is symmetrical in shape (see Section VII-C). We now show that even with cross traffic, it is still possible for packet clustering to decrease. Theorem 8 tells us that packet clustering could decrease when the input stream is already clustered. With the addition of cross traffic, we should expect the decrease could only happen when the input stream is more clustered than is required for the no cross traffic scenario. We need an additional result before showing that. Lemma 12. At steady state (i) E [Ip] = E D0 − X − W p h i n o2 (ii) E {Ip}2 ≤ E D0 − X − Cp+ n o2 (iii) var (Ip) is upper bounded by var D0 + var (X) + var (Cp) − E D0 − X − Cp− n o2 (iv) var (Ir) is upper bounded by var ∆1 + var (Cr) − E ∆1 − u − Cr− Proof: The first expression could be obtained by taking the expectation on equation (27), apply E ∆1 = E D0 and rearrange the terms. As for the second equation, start from the inequality D0 − X − W p+ ≤ D0 − X − Cp+ , square both sides, take expectation and apply equation (25). The upper bound for var (Ip) is derived by applying the 2 expressions we just derived. h i var (Ip) = E {Ip}2 − E [Ip]2

n +o2 2 ≤ E D0 − X − Cp − E D0 − X − Cp

 2 n −o2 2 = E D0 − X − Cp − D0 − X − Cp − E D0 − X − Cp

n −o2 = var D0 − X − Cp − E D0 − X − Cp

Expand out the first term and we are done. The upper bound for var (Ir) is derived in a similar manner Theorem 13. With cross traffic, the change in packet clustering is upper bounded by

var D1 − var D0 ≤ 2 [var (X) + var (Cp) + cov (Ip,Cp) + var (Cr) + cov (Ir,Cr)]

n −o2 n −o2 − E D0 − X − Cp − E ∆1 − u − Cr (37)

Proof: Start from ∆1 = Ip + X + Cp, take the variance, and apply the upper bound for var (Ip) from Lemma 12 to obtain

n −o2 var ∆1 − var D0 ≤ 2 [var (X) + var (Cp) + cov (Ip,Cp)] − E D0 − X − Cp (38)

Similarly with D1 = Ir + u + Cr, take the variance, and apply the upper bound for var (Ir) from Lemma 12 to obtain

n −o2 var D1 − var ∆1 ≤ 2 [var (Cr) + cov (Ir,Cr)] − E ∆1 − u − Cr (39)

Sum inequalities (38) and (39) and we are done. 17

Similar to the case with no cross traffic, the last two terms on the RHS of (37) is large if D0 has a large probability of taking small values, and as such, the observation from the end of Section IV-B carry over and it is reasonable to expect that even with cross traffic, the change in packet clustering, var D1 − var D0 to decrease as the input packet stream is getting more clustered or as the input data rate increases. We now apply the observation that we just made to understand jitter. There exists several definitions for jitter and the one we are going to look at is interpacket delay variation (IPDV) as defined in RFC5481 [21]. More specifically, we are investigating the variance of IPDV, which we will refer to n 0 interchangeably with jitter. In the notation of this paper, IPDV is defined as Di − Di and thus for homogeneous n input traffic, its variance is simply var (Di ) and we could apply the results that we have so far in the context of jitter. From the discussion that we just had, we know that jitter would grow slower or even decrease if the interpacket gap in between packets are smaller. The phenomena of decreasing jitter with smaller interpacket gap has been observed in [10], albeit the definition of jitter used is different. Thus, to control jitter, we need to decrease interpacket gap, which could be achieved in three ways. The first is by sending at a higher data rate, which will increase the end-to-end delay as a tradeoff. Alternatively, we could send with smaller packet sizes, though this means we have to send more packets and thus more data overhead in terms of packet headers. Finally, we could send the traffic via a dedicated, rate-limited tunnel through the network though setting up such a tunnel is expensive. Each method of controlling jitter comes with its own tradeoff, and we leave a more thorough investigation for future work. VII. Experimental Validation A. Experiment Setup To validate our model, we deploy a SoNIC board on a Dell T7500 workstation, and use a Cisco Catalyst 6500 router. The workstation is equipped with two 2.93 GHz six-core Xeon 5670 processors, and with 12 GB RAM, 6 GB connected to each of the processors. Two Myricom SFP+ LR transceivers are plugged into the SoNIC board for wire connections. We will call this machine, a SoNIC server. The Cisco router uses Cisco Supervisor Engine 720 for the control plane, and four-port 10 Gigabit Ethernet Fiber module (WS-X6704-10GE) for the data plane. The router is configured to run in default L2 forwarding mode. We connect two ports of the SoNIC board to two 10 GbE ports of the router via LC/SC optical fibers. Then, we use one port of SoNIC to generate packets, and the other port of SoNIC to capture packets routed from the router. We extend the functionality of SoNIC to mimic a multi-hop routing environment. In particular, after capturing and recording the IPD from the output port, we use it to generate a new stream of packets with IPDs identical to the recorded data. This new packet stream is then sent to the input port and the process can be repeated any number of times. This is possible because SoNIC can capture what is sent including the size of packets along with headers and the number of idle characters and bits between any two packets. Therefore, SoNIC can easily reconstruct an identical packet stream with captured information. We use this functionality to validate our model and scale the experiment up to 100 hops of routers. B. Simulation A good way to check the validity of the model is to simulate the model and compare it with experimental data with the same parameters. The model is coded from scratch in Matlab without relying on external libraries. For accurate modeling, ideally we would need to figure out the distribution of the processing time, X for the Cisco 6500 router that we are simulating. Figuring out such a method is also important because different routers have different processing time distribution. However, we did not manage to figure out a way to accurately extract the processing time from experimental data. Without going into details, the most promising approach is by using equation (IV-A) from the single-hop, large interpacket delay setup. The idea is that by comparing the packet timing at the output with the corresponding packet timing at the input, the difference between the two should ideally give a fixed latency difference and a variable processing time. However, we observe that the latency difference is changing too as shown in Figure 10. We conjecture that the change is likely due to clock drift and we leave a detailed investigation of this phenomena to future work. Fortunately, it turns out that the variance of processing time, var (X), alone is sufficient to run the simulation. What we observe is that if we are only concerned about packet clustering, then the underlying distribution does not matter too much as long as its variance is the same as var (X), as estimated using Lemma 5. The simulation in 18

15000

10000

5000

~4000 bits Timing difference of each packet (bits) 0 0 2 4 6 8 10 Packet number 5 x 10

Figure 10: Figure shows the difference between packet timing at Cisco 6500 output and input for homogeneous traffic with 1500-byte packets and 1 Gbps data rate. All the values have been shifted by a constant to illustrate the magnitude of the changing latency difference. Note that the band height of approximately 4000 bits is in agreement with the spread of values in Figure 1a. this section are thus all run with the processing time taking on an exponential distribution with the same variance as var (X). C. Validation All the experiments in this subsection are performed with the target traffic having at least 1 million packets. The packet size of the cross traffic is assumed to follow a log-normal distribution [11], while its variance could be estimated from Internet packet size study, e.g. [24], to be in the order of 106 bits2 or higher. In all the experiments, packet clustering is calculated by finding the variance of the interpacket delays of the target traffic. Experiment 1: Negative correlation of adjacent interpacket delays (Theorem 3): The correlation coefficient is computed using 1 million IPDs obtained from experiment for various setups. The results are summarized in Table II. We can see that the computed values are close to the value of -0.5. The 3 setups where the correlation coefficients are far from -0.5 are from the small interpacket delay region (see Figure 4). Even then, they confirm our intuition that the correlation coefficient for adjacent IPD is negative. Experiment 2: Symmetry in the observed distribution of output IPD for large interpacket gap (Theorem 3 and Theorem 4): We already know that this is confirmed by figure 1a for the no cross traffic case. For the cross traffic case, its observation is mentioned in RFC5481 [21]. We show here an experiment with cross traffic demonstrating the symmetry in Figure 11. The correlation coefficient of adjacent interpacket delays is calculated to be -0.4944.

1526-byte packet 72-byte packet 1 Gbps -0.4942 -0.5022 3 Gbps -0.5494 -0.8673* 6 Gbps -0.4740 -0.1788* 9 Gbps -0.4931 -0.6924* Table II: Correlation coefficient values under various data rates and packet sizes. The 3 starred values are setups that belong to the small interpacket delay region (see Figure 4). 19

x 106 15 Homogeneous Clustered

10

5 Packet Clustering

0 0 20 40 60 80 100 Number of hops Figure 12: How packet clustering evolves over multiple routers for clustered vs. homogenous packet stream. Both packet streams are 72B 3G.

6 10

4 10

2 10 Histogram count

0 10 3 4 5 6 IPD (bits) 4 x 10 Figure 11: IPD histogram for a target traffic of 520B 1G and a cross traffic of 1.2G and a packet size following a log-normal distribution with a mean of 520 bytes and variance of 5.12 × 106 bits2.

Experiment 3: Packet clustering could decrease after passing through one router (Corollary 8 and Theorem 13): We verify that for a very clustered input packet stream, packet clustering decreases after the packet stream passes through the router. Without cross traffic, the result is shown in Figure 12. The clustered packet stream has 10 packets clustered together with minimal IPG and one huge gap before the next cluster. As a comparison, we also plotted the change in packet clustering for a homogenous packet stream with the same packet size and data rate. Notice that as the number of hops increases, packet clustering varies in an increasing and concave manner for the homogeneous packet stream, and decreasing and convex for the clustered packet stream. This agrees with the discussion in Section V-B. With cross traffic, the result is shown in Figure 13. The red dashed line is the y = x line while the blue line represents the collected data on packet clustering at the output. The height difference between the blue line and the red line represents the change in packet clustering. When the blue line is above the red line, it means that change in packet clustering from input to output is positive and vice versa. We see from the figure that as packet clustering at the input increases, the change in packet clustering decreases and eventually goes negative. Experiment 4: IPD properties for the multi-hop, large IPG setup (Theorem 9): For the large interpacket gap region, a 1526B 1G homogeneous packet stream goes through 8 identical routers and the IPD is recorded. The IPD histogram is observed to be symmetrical and packet clustering grows linearly. In addition, the correlation coefficient between adjacent IPD is calculated as -0.4858, which is close to the theoretical value of -0.5. Experiment 5: How packet clustering evolves for different data rates with increasing number of hops (Section V-B): Figure 15 shows experimental data for fixed packet size but varying data rates. The 4, 6 and 7 Gbps setup could be thought of as representing medium IPD, and 8 Gbps the small IPD. For the first few hops, packet clustering evolves almost linearly for all data rates. As the number of hops increases, the increment decreases at a faster rate for the higher rate. The figure tells us that for increasing data rates, packet clustering increases at a decreasing rate, in agreement with the discussion in Section V-B. 20

5 x 10

y = x 15 Data

10

5

Packet clustering at output 0 0 5 10 15 Packet clustering at input 5 x 10 Figure 13: How packet clustering changes from input to output. The target traffic has a data rate of 3G and packet size of 800B. We adjust the packet clustering by varying the parameter for a geometric random variable representing the number of packets which are clustered together with minimal IPG.

6 5 10 x 10 15 5 10 10 5

4 Burstiness 0 10 0 1 2 3 4 5 6 7 8 Number of hops 3 10

2

Histogram count 10

1 10

0 10 1.2 1.22 1.24 1.26 1.28 1.3 1.32 Interpacket delay (bits) 5 x 10 Figure 14: Interpacket delay after 8 hops for homogeneous input traffic of 1526-byte packet and 1 Gbps data rate. Inset shows the linear growth of traffic burstiness.

Experiment 6: Validation via simulation: Figure 16 shows how packet clustering changes as a homogeneous traffic passes through multiple routers for experimental and simulated data. Note that though we have used an exponential distribution to represent X, the simulated values for the model agrees closely with the experimental data even after many hops. VIII. Conclusion and Future Works Our model has formed a framework for understanding how inherent variation in a router affects the input-output characteristic of a router. Using the proposed packet clustering metric, we have carefully analyzed over and experimentally validate various setups with different packet sizes, data rates, number of hops and with or without cross traffic. We have shown that when the interpacket gap is sufficiently large, the interpacket delay at the output will always have a correlation of -1/2 with adjacent values, and have a histogram that is symmetrical in shape. In a more general setting, we have shown that, rather surprisingly, it is possible for packet clustering to decrease after passing through a router. More importantly, we show that the rate of change in packet clustering going from the input to the output of a router is decreasing in the interpacket gap. This result allows us to understand how we could reduce jitter by minimizing interpacket gap. There are plenty of interesting research directions that we could pursue with this paper as a starting point. On the practical side, we could delve into the workings of a router, try to understand how inherent variation comes about and build a practical router model in the same spirit as [8] but which incorporates inherent variation as an important parameter. On the application side, as mentioned before, we could investigate deeper into the tradeoffs 21

4 x 10 2.5 2k⋅ var(X) 2 4G 6G 7G 1.5 8G

1 Packet Clustering 0.5

0 0 5 10 15 20 Number of hops, k

Figure 15: How packet clustering evolves as a homogenous packet stream of 520-byte packets passes through different number of routers for various data rates.

6 x 10 4.5

4

3.5

3 Simulation

Packet clustering 2.5 Experimental data 2 0 20 40 60 80 100 Number of hops Figure 16: Simulation vs. experiment. The input homogeneous traffic has 72-byte packet and 3 Gbps data rate. between the various methods of controlling jitter by minimizing interpacket gap. We could also look at bandwidth estimation using packet-train dispersion [18] and see if our analysis could help in improving the accuracy of existing bandwidth estimation methods. References [1] Cisco Catalyst 6500 white paper. http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/prod_white_ paper0900aecd80673385.html. [2] IEEE Standard 802.3-2008. http://standards.ieee.org/about/get/802/802.3.html. [3] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss. RFC 2475: An architecture for differentiated services. 1998. [4] Jean-Chrysotome Bolot. End-to-end packet delay and loss behavior in the internet. SIGCOMM, 1993. [5] L. S. Brakmo and L.L. Peterson. TCP Vegas: End to end congestion avoidance on a global internet. IEEE J. Sel. Area. Comm., 13(8):1465–1480, 1995. [6] A. Broido, R. King, E. Nemeth, and KC Claffy. Radon spectroscopy of inter-packet delay. In HSN Workshop, 2003. [7] H.J. Chao and B. Liu. High Performance Switches and Routers. Wiley, 2007. [8] R. Chertov and S. Fahmy. Forwarding devices: From measurements to simulations. ACM T. Model Comput. S., 21(2):12, 2011. [9] M.E. Crovella and A. Bestavros. Self-similarity in world wide web traffic: Evidence and possible causes. IEEE/ACM Trans. Netw., 5(6):835–846, 1997. [10] H. Dahmouni, A. Girard, and B. Sansò. An analytical model for jitter in IP networks. Ann. Telecommun., 67(1-2):81–90, 2012. [11] A. B. Downey. The structural cause of file size distributions. In MASCOTS, pages 361–370, 2001. [12] D.A. Freedman, T. Marian, J. H. Lee, K. Birman, H. Weatherspoon, and C. Xu. Exact temporal characterization of 10 Gbps optical wide-area network. In IMC, 2010. [13] H. Jiang and C. Dovrolis. Why is the internet traffic bursty in short time scales. In Sigmetrics, 2005. [14] J. F. C. Kingman. Inequalities in the theory of queues. J. R. Stat. Soc. Ser. B Stat. Methodol., 32, 1970. [15] K.S. Lee, H. Wang, and H. Weatherspoon. SoNIC: Precise Realtime Software Access and Control of Wired Networks. In NSDI, 2013. [16] W.E. Leland, M.S. Taqqu, W. Willinger, and D.V. Wilson. On the self-similar nature of ethernet traffic. In SIGCOMM, 1993. 22

[17] D.V. Lindley. The theory of queues with a single server. In Math. Proc. Carmbridge, volume 48, pages 277–289. Cambridge University Press, 1952. [18] X. Liu, K. Ravindran, and D. Loguinov. A queueing-theoretic foundation of available bandwidth estimation: Single-hop analysis. IEEE/ACM Trans. Netw.., 15(4):918 –931, aug. 2007. [19] Y. Luo and N. Ansari. Bandwidth allocation for multiservice access on epons. IEEE Communications Magazine, 43(2):S16–S21, 2005. [20] T. Marian, D.A. Freedman, K. Birman, and H. Weatherspoon. Empirical characterization of uncongested optical lambda networks and 10gbe commodity endpoints. In DSN, 2010. [21] A. Morton and B. Claise. RFC 5481: Packet delay variation applicability statement. 2009. [22] K. Park, G. Kim, and M. Crovella. On the relationship between file sizes, transport protocols, and self-similar network traffic. In ICNP, 1996. [23] Henning Schulzrinne, Steven Casner, R Frederick, and Van Jacobson. RFC 3550 RTP: A transport protocol for real-time applications, 2003. [24] R. Sinha, C. Papadopoulos, and J. Heidemann. Internet packet size distributions: Some observations. Available: http://netweb. usc. edu/˜ rsinha/pkt-sizes, 2007. [25] A. Tang, Andrew L., Jacobsson K., Johansson K., Hjalmarsson H., and Low S. Queue dynamics with window flow control. IEEE/ACM Trans. Netw., 18(5):1422–1435, 2010. [26] X. Wang, D.S. Reeves, and S. F Wu. Inter-packet delay based correlation for tracing encrypted connections through stepping stones. In ESORICS. 2002. [27] W. Willinger, M.S. Taqqu, R. Sherman, and D.V. Wilson. Self-similarity through high-variability: Statistical analysis of ethernet lan traffic at the source level. IEEE/ACM Trans. Netw., 5(1):71–86, 1997.