Performance Analysis of a Robust Scheduling Algorithm for Scalable Input-Queued Switches
Total Page:16
File Type:pdf, Size:1020Kb
Performance Analysis of a Robust Scheduling Algorithm for Scalable Input-Queued Switches Itamar Elhanany†, Dan Sadot Department of Electrical and Computer Engineering Ben-Gurion University, Israel †Email: [email protected] Abstract— In this paper a high-performance, robust and scalable introduced such that the internal fabric bandwidth is S times scheduling algorithm for input-queued switches, called higher than that of incoming traffic, where if N is the number distributed sequential allocation (DISA), is presented and of ports then 1<S<N. A speedup of N yields performance analyzed. Contrary to pointer-based arbitration schemes, the equivalent to an output queued switch. However, speeding up algorithm proposed is based on a synchronized channel the fabric carries enormous implementation ramifications that reservation cycle whereby each input selects a designated output, motivate the search for switching architectures and scheduling considering both its local requests as well as global channel algorithms that would overcome the need for a large speedup. availability information. The distinctiveness of the algorithm is in An additional drawback of many pointer-based schemes is the its ability to offer high-performance when multiple cells are connectivity complexity which is known to be O(N2). transmitted within each switching intervals. The subsequent Consequently, these algorithms have been proposed primarily relaxed switching-time requirement allows for the utilization of ≤ commercially available crosspoint switches, yielding a pragmatic for switches with small number of ports (i.e. N 32) [3], and are and scalable solution for high port-density switching platforms. generally unsuitable for next-generation switches where The efficiency of the algorithms and its robustness is established hundreds and even thousands of ports are considered. through analysis and computer simulations addressing diverse In this paper we proposed a scheduling algorithm called traffic scenarios. distributed sequential allocation (DISA) that represents a shift Keywords- Packet scheduling algorithms, Switch fabric, Input- from legacy scheduling schemes, by constituting a non-pointer queued switches, Non-uniform destination distribution based approach in which the contention resolution process takes into consideration both local and global resource availability [6]. Moreover, the DISA algorithm requires only I. INTRODUCTION O(NlogN) connectivity and no speedup. Scalable packet scheduling algorithms that offer high- This paper is organized as follows. Section II describes the performance for input buffered switches and routers have been DISA scheduling algorithm and a complementing switch the focus of many academic and industry studies in recent architecture. Section III outlines the queueing notation and years. The ever growing need for additional bandwidth and model formulations utilized throughout the paper for more sophisticated service provisioning in next generation performance analysis. Section IV presents analysis for networks requires the introduction of scalable packet Bernoulli i.i.d. uniformly distributed arrival patterns, while scheduling solutions that go beyond legacy schemes. As port sections V focuses on analysis of non-uniform destination densities and line rates increase, input buffered switch distribution. In section VI performance results under correlated architectures are acknowledged as a pragmatic approach for arrivals are shown, while in section VII the conclusions are implementing scalable switches and routers. In these drawn. architectures, arriving packets or cells are stored at queues at the ingress port until scheduled for transmission. With the II. THE DISA SCHEDULING ALGORITHM introduction of commercially available high port-density crosspoint switches ([1], [2]), it has become more compelling A. Switch Architecture to propose packet scheduling techniques that can control crosspoint switches to offer a low chip-count, high- Figure 1 shows a diagram of the proposed switch fabric performance and cost-efficient switch fabric solutions. architecture. Packets received from the switch port at each line card flow into a network processor or traffic manager. The The majority of the proposed scheduling algorithms processed packets are subsequently forwarded to an ingress targeting high port density switches are based on an iterative fabric interface device which typically segments the packets request-accept-grant process [3], [4], [5]. Although such into smaller fixed-size cells and provides a buffering stage to algorithms offer ease of implementation, they typically the fabric. In order to avoid head-of-line blocking [7], a experience degradation in performance at high loads in cases queueing technique called virtual output queueing (VOQ) is where the traffic is correlated or non-uniformly distributed deployed whereby a unique queue is maintained at the ingress between the destinations. As means of overcoming the inherent for each of the outputs. We interchange the terns channel and limitations of pointer-based schemes, speedup is usually output in the context of allocation by an input port. ODS (ϕ) High-speed N Serial Links ϕ ϕ ϕ Linecard 1 1 2 N 1 Switch Port #1 packet Buffering (VOQ) w1 w2 wN processing & Arbitration Crosspoint W engine Module SwitchesCrosspointN x N Switches Crosspoint W-bit W-bit W-bit . Switches . AND AND AND Linecard N . Switch Port #N packet Buffering (VOQ) Central processing & Arbitration Arbitration Unit Queue Selection Logic (QSL) engine Module (CAU) Reserved N Offered Destination Bus (ODS) N Output N Figure 1. The proposed crosspoint-based, single-stage switch fabric architecture. N-bit OR The efficiency of the scheduler has a paramount impact on New ODS (ϕ*) the amount of buffering required at the linecards and, consequently, the latency through the system. Two types of control links are utilized in the proposed architecture between Figure 2. The output allocation function performed by each ingress port. the nodes and the fabric. The first is an N-bit bus called the offered destination set (ODS), which all nodes have read and write access to, indicating the reservation status of each of the At the initial step, the ODS lines either grant or discard N outputs. We denote a logical ‘1’ on the ODS as representing queue requests using a layer of W-bit AND gates, where W is an available output port while a logical ‘0’ a previously the number of bits per weight. As a result, only weights reserved one. In addition to the ODS, each node receives a corresponding to available outputs advance to the next level. dedicated signal from a central arbitration unit (CAU) which The following step is comprised of a queue selection logic indicates when the node is to perform reservation, as will be (QSL) block which determines the output to be reserved. In the clarified in the following section. The switching interval is a single-bit-weight case, this level may be implemented using a fixed multiple of the time slot duration, where a time slot randomized priority decoder circuit or if queue sizes are represents a single cell time. considered as weights then a maximal value circuit should be deployed. The output of the QSL is an N-bit vector with at B. Scheduling Algorithm most a single bit set to ‘1’ denoting the index of the selected output. This word is inserted into a N-bit OR gate together with The proposed scheduling algorithm is carried out as * sequential selection process. For each switching interval, which ϕ to yield ϕ - the new ODS. In typical staged designs, the may span over one or more time slots, a single output is propagation delay of such QSL circuits is of O(logdN·tc), where allocated by each of the ports. Using the dedicated links d is the number of comparisons performed in each level and tc between the CAU and the ports, the CAU signals each port, in is the propagation delay contributed by each comparison block. turn, to perform allocation. A random order of signaling the Using current 0.13µ CMOS VLSI technology, for 128 ports ports is drawn at the beginning of each interval. Once the last with 2 comparisons per level and tc~500psec, the critical path port completes allocation, the crosspoint switches are duration is approximately 3.5nsec. configured and each port transmits cells to its designated output Upon completion of allocation by the first input node, the in accordance with the selected configuration while, CAU randomly signals the next node to commence output concurrently, a new allocations interval (for the following allocation. That node will perform the same allocation process transmission cycle) begins. Since transmission and scheduling only with the new ODS as a mask of previously allocated are carried out simultaneously, there is no transmission “dead- outputs. Contention is inherently avoided since at any given time” involved,. time only one node attempts to reserve an output. After all N In figure 2, a block diagram of the channel allocation nodes perform reservation, the CAU configures the crosspoint scheme performed by each node is depicted. We let wj denote switches and data traverses the fabric. When multiple classes of th the weight for queue j within the VOQ, and ϕi the i bit service are deployed, a different queue (and hence a unique element in the ODS indicating the availability status of output weight) is associated with each class of service. A basic i. Upon receiving a signal from the CAU, each node performs precondition for supporting the analysis is that the algorithms is output reservation according to two primary guidelines: (a) stable. To that end, we have the following theorem addressing global switch resources status, i.e., available outputs and (b) the issue of stability: local considerations reflected by the priorities map of the Theorem 1: (Stability) The DISA scheduling algorithm with no node's queues. We note that as will be elaborated in subsequent speedup is strictly stable for any admissible, uniformly sections, the term queue weight may refer either to a single bit distributed traffic. denoting queue occupancy state (empty or non-empty), or a value corresponding to the queue size.