Performance Analysis of a Robust Algorithm for Scalable Input-Queued Switches

Itamar Elhanany†, Dan Sadot Department of Electrical and Computer Engineering Ben-Gurion University, Israel †Email: [email protected]

Abstract— In this paper a high-performance, robust and scalable introduced such that the internal fabric bandwidth is S times scheduling algorithm for input-queued switches, called higher than that of incoming traffic, where if N is the number distributed sequential allocation (DISA), is presented and of ports then 1

Switch Port #N packet Buffering (VOQ) Central processing & Arbitration Arbitration Unit Queue Selection Logic (QSL) engine Module (CAU)

Reserved N Offered Destination Bus (ODS) N Output N

Figure 1. The proposed crosspoint-based, single-stage switch fabric architecture. N-bit OR

The efficiency of the scheduler has a paramount impact on New ODS (ϕ*) the amount of buffering required at the linecards and, consequently, the latency through the system. Two types of control links are utilized in the proposed architecture between Figure 2. The output allocation function performed by each ingress port. the nodes and the fabric. The first is an N-bit bus called the offered destination set (ODS), which all nodes have read and write access to, indicating the reservation status of each of the At the initial step, the ODS lines either grant or discard N outputs. We denote a logical ‘1’ on the ODS as representing queue requests using a layer of W-bit AND gates, where W is an available output port while a logical ‘0’ a previously the number of bits per weight. As a result, only weights reserved one. In addition to the ODS, each node receives a corresponding to available outputs advance to the next level. dedicated signal from a central arbitration unit (CAU) which The following step is comprised of a queue selection logic indicates when the node is to perform reservation, as will be (QSL) block which determines the output to be reserved. In the clarified in the following section. The switching interval is a single-bit-weight case, this level may be implemented using a fixed multiple of the time slot duration, where a time slot randomized priority decoder circuit or if queue sizes are represents a single cell time. considered as weights then a maximal value circuit should be deployed. The output of the QSL is an N-bit vector with at B. Scheduling Algorithm most a single bit set to ‘1’ denoting the index of the selected output. This word is inserted into a N-bit OR gate together with The proposed scheduling algorithm is carried out as * sequential selection process. For each switching interval, which ϕ to yield ϕ - the new ODS. In typical staged designs, the may span over one or more time slots, a single output is propagation delay of such QSL circuits is of O(logdN·tc), where allocated by each of the ports. Using the dedicated links d is the number of comparisons performed in each level and tc between the CAU and the ports, the CAU signals each port, in is the propagation delay contributed by each comparison block. turn, to perform allocation. A random order of signaling the Using current 0.13µ CMOS VLSI technology, for 128 ports ports is drawn at the beginning of each interval. Once the last with 2 comparisons per level and tc~500psec, the critical path port completes allocation, the crosspoint switches are duration is approximately 3.5nsec. configured and each port transmits cells to its designated output Upon completion of allocation by the first input node, the in accordance with the selected configuration while, CAU randomly signals the next node to commence output concurrently, a new allocations interval (for the following allocation. That node will perform the same allocation process transmission cycle) begins. Since transmission and scheduling only with the new ODS as a mask of previously allocated are carried out simultaneously, there is no transmission “dead- outputs. Contention is inherently avoided since at any given time” involved,. time only one node attempts to reserve an output. After all N In figure 2, a block diagram of the channel allocation nodes perform reservation, the CAU configures the crosspoint scheme performed by each node is depicted. We let wj denote switches and data traverses the fabric. When multiple classes of th the weight for queue j within the VOQ, and ϕi the i bit service are deployed, a different queue (and hence a unique element in the ODS indicating the availability status of output weight) is associated with each class of service. A basic i. Upon receiving a signal from the CAU, each node performs precondition for supporting the analysis is that the algorithms is output reservation according to two primary guidelines: (a) stable. To that end, we have the following theorem addressing global switch resources status, i.e., available outputs and (b) the issue of stability: local considerations reflected by the priorities map of the Theorem 1: (Stability) The DISA scheduling algorithm with no node's queues. We note that as will be elaborated in subsequent speedup is strictly stable for any admissible, uniformly sections, the term queue weight may refer either to a single bit distributed traffic. denoting queue occupancy state (empty or non-empty), or a value corresponding to the queue size. The reader may refer to appendix A for a detailed proof.

(1) III. QUEUEING MODEL FORMULATION When rearranged, (4) yields the well-known result for We begin our analysis by establishing the notation and Geo/Geo/1 systems [8], queueing formulation framework, which will be used to derive µ − λ π = . the performance metrics. We assume a discrete-time VOQ 0 (5) µ(1− λ) system with a single-server and infinite buffer capacity. The number of queues, N, corresponds to the number of distinct The common goal for each examined scenario is to find a destinations in the system. All events occur at discrete time slot closed-form expression for the steady-state queue size intervals in which at most a single arrival and a single service distribution, π . Of particular interest is π , which represents λ m 0 event may occur. Let ij(n) denote the probability of arrival to the share of time that the queue is empty of cells. In some virtual output queue j in port i at time step n. We further label cases, π is sufficient to derive the queue size distribution while λ 0 ij as the corresponding average arrival rate. In order to in others additional information is required. Based on the queue guarantee admissibility of the traffic, we require size distribution, important performance metrics can ∞be directly λ ≤ λ ≤ = π that∑ ij 1, ∑ ij 1. For convenience, we employ the early- obtained including the mean queue size, E[Q] ∑ m m and i j m=1 arrival model whereby an arrival will precede a service event using Little’s theorem [8], the mean queue latency, within any given time slot. As will be explicitly shown in the E[τ ] = E[Q]/ λ . For any given stable queueing system we ∞ following sections, the service discipline governing each VOQ π = necessitate that πm satisfy ∑ m 1. system (ingress port) can be approximated by a Bernoulli i.i.d. m=0 process, resulting in geometrically distributed interservice times. In the context of the proposed architecture, the task of Throughout the paper we call attention to the distinction the scheduler is to determine which of the virtual output queues between two independent attributes of traffic patterns: the in each port is to be granted transmission during the next destination distribution and the arrival process. Destination switching interval. distribution corresponds to the probability of an arriving cell to be destined to each of the output ports. The most common and Let µ denote the probability of service to the VOQ during simple case is that of uniform distribution, whereby a packet each interval. Once a service event occurs, the internal has an identical likelihood of being destined to each output, arbitration scheme determines which of the queues is granted λ = λ = thereby implying that i / N, i, 1,2,..N. However, real life transmission. Let Qij(n) designate the queue occupancy of queue j in port i at time step n, such that traffic has been shown to be characterized by non-uniform destination distribution throughout all levels of the network = { − + − } Qij (n) max Qij (n 1) Aij (n) Dij (n),0 (1) hierarchy [9], [10].

where Aij(n)∈{0,1} and Dij(n)∈{0,1} are the number of arrivals and departures during time step n, respectively. To ensure the IV. UNIFORM BERNOULLI ARRIVALS existence of a stochastic equilibrium of the queueing system, The most commonly examined traffic is uniformly the arrival rate for each queue should converge to the departure distributed between the outputs and obeys a Bernoulli i.i.d. rate, yielding the condition υ arrival process. Recalling that k denotes the mean size of the n n ∑ A(t) ∑ D(t) ODS at the beginning of the kth phase (k = 1,2,…N), we note lim t =1  = lim t =1  (2) →∞ →∞ that υ1 = N since at the beginning of the first phase all outputs n  n  n  n  are unreserved. Moreover, υk forms a monotonically non- Utilizing the fact that µ ≡ 0 together with the statement of increasing series given that during each phase either a node 0 υ υ convergence between the arrival and departure rates, a general selects an output, resulting in k+1 = k-1 or, alternatively, it υ υ expression for the queueing can be written as does not select any output implying that k+1 = k. ∞ The probability that a node selects no outputs may be λ = µ π ∑ m m (3) interpreted as the probability that all non-empty queues m=1 belonging to the ODS are empty. In all other cases one output is selected. Employing the early arrival model described in where λ is the mean arrival rate, πm is the steady-state queue chapter 2, the probability of a queue not selecting any outputs size distribution (i.e. πm = P{Q = m}) and µm is the probability of service given that the queue size is m. For a Geo/Geo/1 equals the probability that all queues within the ODS were empty at the beginning of the time slot and no cell has arrived system in which µ =µ, a direct outcome of the steady-state m during the current time slot. Expressing the above balance equation for the early arrival model can thus be written mathematically, gives us the following recursive expression as forυk+1 : ∞ , (4) λ = ∑µ()()π ()1− λ + π − λ =µ 1− π (1− λ)  υ υ m m 1 0 υ with prob. π k (1− k λ) m=1  k 0 υ = N k+1  (6) where the term π0(1-λ) expresses the probability that the queue  υ υ υ −1 with prob. 1-π k (1− k λ) was empty prior to the arrival phase and no cell has arrived.  k 0 N υ and hence, combining the probabilistic terms in (6) into an µ = expected value expression yields υ − −π + (12) N[( 1)(1 0 ) 1]  N k = 0  Substituting (8) in (11) and assuming that 1/N<<1, we finally υ = (7) E[ k +1 ]  υ  [υ ]  obtain υ − + π E[ k ]  − k λ  ≤ ≤ − E[ k ] 1 0 1 0 k N 1 .   N  − λ + π ≈ (1 )(N 1) . 0 (13) By applying summation and rearranging (7), it can be shown (1− λ)(N +1) + λ that an approximation of the mean ODS size, υ = υ , is E[ k ] Utilizing the results from the Geo/Geo/1 model we find the given by mean queue occupancy to be

2 −π λ N +1 π [N(1−π ) − λ] 1 0 υ = + 0 0 (8) E[Q] = = (14) −1 2 π − λ + −π 0 (1 )(N 1) 2 N(1 0 ) and, accordingly, the mean queueing delay is For high load conditions (λ→1) the probability of a queue being empty is low, resulting in the intuitive conclusion that E[Q] N υ E[τ ] = = . (15) →(N+1)/2, which is the mean of the arithmetic series λ / N (1− λ)(N +1) decreasing from N to 1. Our initial examination is that of random selection arbitration where non-empty queues contend The latter implies that the mean queueing delay, for large for transmission in an equal manner. In other words, a queue values of N, does not depend on N and is roughly 1/(1-λ). containing 3 cells and another containing 10 cells have Figure 3 depicts the mean queueing delay of a 128-port switch identical probabilities of prevailing for transmission. In the with Bernoulli i.i.d. uniformly distributed arrivals, compared to context of the DISA algorithm, there are three conditions that the theoretical limit of an output queued switch. must be met for a queue to be selected: (1) the queue must reside within the ODS, (2) the queue must be non-empty and In practice, completing the scheduling cycle within a single (3) the queue must prevail when contending against the other cell time (approximately 50 nsec for 10 Gbps links) for N>16 non-empty queues in the ODS such that is impractical. In view of the latter, we next investigate the performance implications of increasing the switching interval Q prevails for  duration beyond one time slot. It is apparent from the nature of = {}{}∈ψ >  ij ∈ψ >  Pdeparture P Qij P Qij 0 P Qij ,Qij 0 (9) a multi-time-slot service discipline that we are still considering  transmission  a memoryless process, particularly in view of the fact that consecutive service (switching) events are independent of From (8), an estimated average probability of being in the previous ones. Accordingly, we may continue to utilize the υ ODS is / N and the probability that a queue is non-empty Geo/Geo/1 model providing that we find an expression for π0 following an arrival phase is (1-π0(1-λ/N)). The probability of which reflects on the lengthy switching intervals. It is intuitive prevailing in the internal contention for transmission can be that as these intervals increase in duration, more cells expressed as accumulate in the queues and hence the mean queue occupancy is expected to increase and π to decrease. Using similar   0 Qij prevails for  1 P Q ∈ψ ,Q > 0 = (10) rationale to that of (3), we may write ij ij υ − −π +  transmission  ( 1)(1 0 ) 1 λ B B = µ′∑α i where (υ −1)(1−π ) represents the mean number of non- (17) 0 N i=1 empty queues in Ψ (other than the queue analyzed) to which where α = ()−π − λ representing the probability of a we add 1 to account for the analyzed queue. Substituting these 1 0 (1 / N) terms in (9) yields queue being non-empty, B is the duration of the switching interval in time slots and λ υ ()1−π (1− λ / N) P = = P = 0 (11) υ arrival N departure N[(υ −1)(1−π ) +1] µ′ = . (18) 0 υ − −π + N[( 1)(1 0 ) 1] The service discipline is governed by a memoryless process, primarily since within each time slot decisions are The left hand side of (17) corresponds to the mean number made regardless of the outcome of previous time slots. To that of arrivals that are expected within B time slots, while the right end, the queueing behavior may be described using a hand side pertains to the probability of a service event Geo/Geo/1 model whereby both the arrival and service occurring ( µ′) multiplied by the probabilities of having B non- processes are the outcomes of Bernoulli i.i.d. trials with empty time slots in succession. Equation (18) is a polynomial parameters λ and µ, respectively. Equation (11) states that the expression of order B with only a single solution for α in the probability of departure equals the probability of the queue region (0,1). Using α we find π0, from which the mean queue being non-empty multiplied by a term for the probability of occupancy and mean waiting time are derived. Figure 4 depicts service. Accordingly, we may isolate the probability of service, the mean queueing delay as a function of the offered load for µ, as various switching interval durations.

50 4 10 Output Queueing B = 1 45 DISA B = 2 B = 4 B = 8 3 40 10 ) st ) o 35 st l ol s s e e 2 m 30 mi 10 ti (t ( y y al al e e 25 D d g ni 1 g e ni u 10 e 20 e u u e Q u n q 15 a n e a M 0 e 10

M 10 Mean Queueing Delay (cell times) Mean Queueing Delay (cell times)

5 -1 10 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Offered Load Offered Load Figure 4. Mean queueing delay for N=128 with random selection arbitration Figure 3. Mean queueing delay for DISA and an output queued switch with and various switching interval durations. N=128 and Bernoulli i.i.d. uniformly distributed arrivals

αΒ We next examine longest-queue-first (LQF) arbitration in which, in contrast to random selection, the queue with the α α 2 largest occupancy is selected for transmission. The significance 1 of considering the queue size becomes apparent when the switching interval is larger than one time slot thus allowing for 0 1 2 . . . B the accumulation of cells. Figure 5 illustrates a Markov chain β β that describes the queueing behavior with LQF and a switching 1 2 β interval of B time slots. The states correspond to queue Β occupancy where the n-step forward and reverse transition probabilities are defined as Figure 5. Markov chain for a system with LQF arbitration where αn and βn are the n-step forward and reverse transition probabilities, respectively. −  B λ n  λ B n α =    1−  n n  N   N  , n = 1,2,…,B reflecting on the probability that the size of υ −1 queues are −  B λ  B n  λ n independently smaller or equal to m. Numerical techniques β =    1−  π n n  N   N  (19) may be applied to obtain approximations of m from which the

mean queue occupancy and mean delay are derived. Assuming there are J+1 states in the system, implying that ()> → P Q J 0 , we obtain J+1 equations from the Markov V. NON-UNIFORM DESTINATION DISTRIBUTION π chain. The variables included in this set of equations are m λ µ Let k denote the non-homogeneous probability that a cell (m=0,1,…J) and m (m=1,2,…J), amounting to 2J+1 is destined to output k, such that for a given VOQ independent variables. Since the Markov chain is assumed to N ∞ ∑λ = λ π = system k . The service discipline is independent of the be in equilibrium, an additional equation is ∑ m 1. The k =1 m=0 arrival process and remains the same as before. We assume remaining J equations are given by the approximation random selection between non-empty queues as the VOQ arbitration scheme. We further define Q {k=1,2,..N} as the size 1 N  υ P()Q ≤ m,i ∈ν  k µ = ∑  i  of the kth queue having the arrival ratesλ . m υ − π + (20) k N υ=1  N  ( 1) m 1  Utilizing the Geo/Geo/1 model for the same reasons which is the expected probability that the VOQ is serviced described in the case of uniform distribution, we focus our multiplied by the probability that the size of all other υ -1 analysis on expressing the steady-state probability of Qk being queues in the ODS is smaller or equal to m and the queue π k = = empty, i.e. 0 Pr{Qk 0}. Recalling the three conditions for prevails in contending against the set of other queues with size transmission and accordingly rewriting (3) by replacing the equal to m. The inner probabilistic term in (20) an be coarsely generic term λ with λ , we obtain approximated as k k k υ−1 ()1 − π (1 − λ ) ()1 − π (1 − λ ) m λ = µ o k = µ o k   k ˆ ˆ (22) ()≤ ∈ν ≈  π  ∑ (1 − π j ) +1 N − ∑π j P Qi m,i ∑ j  (21) 0 0  j=0  j≠k j≠k 4 p 10 Uniform, Random Selection Uniform, LQF Zipf(k=1),Random Selection p p Zipf(k=1), LQF 3 10 κ ) . . 2 s . t ol s κ e 2 κ 1 m 10 Ν (1-q)/N (ti y al e D g ni 1 e 10 1-p u e 1-p u Q n (1-q)/N a e 1-p M 0 10 (1-q)/N Mean Queueing Delay (cell times) κ 0 -1 10 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (OFF) Offered Load q Figure 6. A comparison of the mean queue occupancy for a 16-port switch with Bernoulli i.i.d. arrivals obeying uniform and Zipfk=1 destination distributions. Figure 7. Markov chain characterizing the behavior of a uniformly distributed ON/OFF arrival process to a VOQ with N ports. Each port receives µ = υ π k where ˆ / N . Isolating 0 gives us a mean offered load of λ/N. λ   k 1 j π = − k N − ∑π  0 − λ µ − λ 0 (23) 1 k ˆ(1 k )  j≠k  3 10 DISA (B=8 cells) Letting Π = π 1 π 2 π N T and rearranging (23) for each of N { 0 , 0 ,..., 0 } Output Queueing the queues, we arrive at a matrix solution in the form of

) 2  1 α α α ... α   α − β N  s 10  1 1 1 1   1 1  e mti  α 1 α α ... α   α − β N  l 2 2 2 2 2 2 e     c( α α 1 α ... α Π = α − β N (24) y  3 3 3 3  N  3 3  c n  ......   ...  et a     l 1 n α α α ... α 1 α − β N  a 10 N N N N N N e M where Mean Queueing Delay (cell times) λ α = − k β = − 1 k , k (25) µˆ(1− λ ) 1− λ 0 k k 10 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Solving this linear system, we find expressions for which, by Offered Load

ρ π k letting k=1- 0 and utilizing the results for the Geo/Geo/1 model, yields Figure 8. Performance of the DISA scheduling algorithm in a 128-port switch under bursty traffic with mean burst size of 32 cells. LQF arbitration is E[Q ] ρ employed utilizing switching intervals of 8 cells in duration. E[τ ] = k = k . (26) k λ λ − ρ k k (1 k ) Figure 6 illustrates the mean delay for a 16-port switch with VI. CORRELATED ARRIVALS

Bernoulli i.i.d. arrivals which are distributed both uniformly In an aim to evaluate the DISA algorithm under traffic and according to Zipfr=1, where patterns that more accurately portray the nature of Internet −1 traffic, we focus our attention on scenarios with ON/OFF [10]  N  = λ = −r  −r  (27) correlated arrival patterns. We extend the basic two-state Zipf r (k) k k ∑ j   j=0  ON/OFF model to accommodate for an N-queue VOQ system. Within each burst all cells are destined to the same output. The and r is a model parameter, pragmatically assumed to equal bursts are geometrically distributed in size, as with the classic approximately 1. As can be noted from the results, random two-state ON/OFF model. The process alternates equally selection leads to significant degradation in the performance between the various queues such that each queue receives, on when non-uniform destination distribution is introduced, average, an equivalent portion of the traffic, i.e. λi=λ/N for particularly at high loads. However, when utilizing LQF the i=1,2,…N. The Markov chain describing the behavior of this algorithm exhibits notable robustness to the destination arrival process is depicted in Figure 7. distribution. Figure 8 illustrates the mean latency of the switch, where analyzed, without loss of generality, as a GI/G/1 queueing bursty traffic is applied to a 128-port switch with envelope system with Q(n) denoting the queue occupancy at time step n. switching intervals and LQF arbitration employed. The reader The latter implies generic arrival and service processes. We will note that the performance is relatively close to that of an label E[Tn] as the mean interarrival time and E[Sn] as the mean output queued switch. The reason for the exceptionally low interservice time. It has been shown that for a GI/G/1 queue, if delay lies in the fact that correlated traffic results in temporarily E[Tn]>E[Sn] then limn→∞{Q(n)}=C exists, and thus the queueing un-evenly populated queues, allowing the scheduler to more system is stable. An alternative and identical interpretation of efficiently utilize lengthy switching intervals. As a result, the this stability criterion is that the mean probability of arrival, -1 latency obtained under bursty scenarios is lower than that of λ=(E[Tn]) is smaller than the mean probability of service, Bernoulli i.i.d. traffic. The deployment of LQF allows the µ=(E[S ])-1. Under the assumption of uniformly distributed scheduler to grant service to the most populated queue, hence n arrivals, the mean probability of arrival is λ /N. Recalling that improving the switching capacity utilization. Since real-life υ th traffic does tend to be correlated on several levels, the k denotes the mean size of the ODS at the beginning of the i robustness exhibited by the DISA algorithm under bursty phase (i = 1,2,…N). By observing that at most a single output is arrivals is perhaps one of its key properties. selected by each input during the reservation process, we have υ ≥ ()− ()− . Accordingly, we can write k N i 1 / N VII. CONCLUSIONS 1  N − ()i −1  Pr{service| phase = i} ≥   . (28) We have presented the DISA scheduling algorithm along ri  N  with a complementary switch architecture. Analytical where ri represents the mean number of non-empty queues at foundations coupled with simulation results have been th provided for evaluating the performance of the system under the i phase. The latter pertains to the worst case scenario different traffic scenarios, including non-uniform destination whereby in each phase an output is matched to an input and distribution and correlated arrivals. Low latency and maximal thus removed from the ODS. If this is not the case, the throughput are exhibited for admissible traffic loads. Ease of probability of service increases since more destinations are implementation in conjunction with the ability to utilize offered resulting from a larger ODS. Since the probability that commercially available crosspoint switches with moderate a queue contends for transmission during each of the phases is configuration rates, further accentuate the attractiveness of the uniform and equal to 1/N, we have proposed scheme for high-port density switching platforms. λ < E[Pr{service}] = N . REFERENCES (29) 1  1 (N −1) 1 (N − 2) 1 1   + + + ...+  N  r N r N r N  [1] Information available at: http://www.mindspeed.com. 1 2 3 [2] Information available at: http://www.vitesse.com. However, we know that r ≤(N-(i-1))

APPENDIX A: PROOF OF STABILITY FOR THE DISA ALGORITHM

In this appendix we provide a proof for theorem 1. We being by noting that each of the virtual output queues can be