Improving Convergence Time of Routing Protocols
Total Page:16
File Type:pdf, Size:1020Kb
Improving Convergence Time of Routing Protocols Gotz¨ Lichtwald, Uwe Walter and Martina Zitterbart Institute of Telematics University of Karlsruhe, Germany Email: {lichtwald, walter, zit}@tm.uka.de Abstract— One of the main design goals of the Internet is but normal routing protocol traffic takes this function and robustness against failures. Normally, this is accomplished by is used as regular alive beacon. redundancy and dynamic routing protocols that automatically In order to further limit resource consumption, rather adapt to failures: If a link is unavailable, data packets can gen- erally be sent via alternative paths. An essential requirement long time intervals between consecutive check messages for this is a fast mechanism for failure detection, since routing have been chosen. This also helps to reduce the number protocols can only start to reroute traffic around problems as of so-called false positives. Whenever a link is wrongly soon as they get aware of them. This paper proposes a novel declared to be broken, due to several short-time link noise design of a generic failure detection service to be utilized periods or temporarily router processor overload this is by routing protocols that aims at dramatically decreasing the detection times of today’s mechanisms. After the introduction called a false positive. of the concept and its evaluation, the integration of the new It is a permanent tradeoff that must be found between failure detection service into BGP is described. Furthermore, a delayed failure detection and too many false positives. an example for the adaption of a routing protocol using the If a link breakdown has occurred, but has not yet been proposed service is given. recognized, packets sent via the defective link get inevitably lost during that period. On the other side, during every I. INTRODUCTION false positive, whenever an operational link is mistakenly Data packets traveling through the Internet typically declared to be down, packets are unnecessarily rerouted traverse multiple routers and, thus, multiple physical links and other routers are notified of the alleged failure, leading interconnecting them. Whenever such a link fails, dynamic to routing fluctuations and instabilities. routing protocols try to provide an alternative path towards Considering the very high speed links, the powerful the destination. For this task, it is crucial that the routing routers as well as the low bit error rates within today’s In- protocol quickly detects such a link failure. Especially with ternet, long time intervals between periodic check messages the increasing use of the Internet for mission critical ap- do not appear to be appropriate any longer. Liveness check plications any unnecessary loss of connectivity can hardly mechanisms of routing protocols—especially the timeout be tolerated and has to be kept as short as possible. value—need to be updated appropriately. One of the fastest possibility for link failure detection This is in accordance recent efforts of with major routing can be achieved by co-operation with the lower (link vendors (i.e. Juniper) as well as with current standardiza- level) layers. If they are able to detect a link breakdown tion efforts within the IETF (cf. section V). Common to in hardware, e.g. by the loss of the physical or optical all these efforts is the goal to improve, i.e. to shorten, the signal, they can immediately notify the network layer about link failure detection time. the failure. This is actually already deployed, but has This paper goes a step further and does not only pro- some shortcomings. Fixing implementation errors in router pose to change the default values between periodic check operating systems [1] that inhibit a quick notification may messages, but develops a novel generic service for failure help sometimes. Link level failure detection in general is detection, called Adjacent Peer Check Service (APCS) [2] not always possible. that enables any routing protocol to detect link outages For example, in an environment where switches are faster than before (see section III. Furthermore, does the involved in the router interconnection, the possibility that APCS not only check the physical reachability but also the links may fail behind such a switch, prevents the chance operational state of the control plane and can be integrated for a fast link level failure detection. into existing routing protocols. So, in contrast to the efforts This is why existing routing protocols, like Routing In- of the IETF [3] and major routing vendors, the Adjacent formation Protocol (RIP), Open Shortest Path First (OSPF), Peer Check Service is designed to improve existing routing Intermediate System to Intermediate System (IS-IS) or Bor- protocols. der Gateway Protocol (BGP) typically exchange periodic As the APCS is a generic and routing protocol inde- messages to check whether their peer is still reachable pendent service, network operators can define the peer and alive, as described in section II. These periodic test check time intervals for the check messages depending messages, e.g. KEEPALIVE or HELLO messages, however, on their network demands. This means that the check consume bandwidth and processing time. Because of these message interval for WLAN connections can be set to a reasons, sometimes there are no dedicated check packets, different value than for LAN connections to accommodate to the different bandwidths and loss rates. Furthermore, it process, as the remaining neighbor routers are notified of is possible to define the threshold of check message losses the broken connection by sending out new distance vectors. until a link is declared to be broken. As mentioned before, those time intervals can be easily B. Open Shortest Path First adjusted to the physical network environment conditions, The Open Shortest Path First (OSPF) protocol [5] was e.g. fiber, radio or coax on the one hand side. On the other created by the OSPF working group of the IETF as an hand, those time intervals can be adapted to the changing IP-based routing protocol to be used inside Autonomous demands of a connection, i.e. a connection carries more Systems (AS). OSPF uses a Link State Database that high priority traffic and in case a failure occurs the outage describes the topology of the AS, inside of which it is has to be kept as short as possible. deployed. To synchronize this database consistently among This paper will provide the Adjacent Peer Check Service all OSPF-speaking routers, the contained information is protocol design including its evaluation in several test-bed flooded throughout the whole AS via so-called Link State scenarios (see section IV). Those scenarios compare the Announcements (LSA). This allows all routers to build the advantages that can be achieved by extending currently same Link State Database and to calculate the shortest paths deployed routing protocols like RIP, OSPF and BGP, with to all possible destinations on their own. improvements achieved by the Adjacent Peer Check Ser- Different sorts of LSAs exist that are all transmitted vice. The evaluation also took highly loaded links into via IP packets carrying the protocol number 89. The most consideration to prove that the novel approach also works important OSPF packet type is the HELLO packet, used in congested networks. for automatic detection of neighbors and failure detection. Furthermore, an analysis of how the currently deployed HELLO packets are broadcasted by every router via each of routing protocols can be improved by the Adjacent Peer its interfaces in regular time intervals called Hello Interval. Check Service is provided. The paper concludes in sec- Communication to a router is declared to be broken, if there tion VI with a detailed description how BGP can be has been no HELLO packet received from it for another improved with the novel Adjacent Peer Check Service. important time interval, named Router Dead Interval. Both Further a detailed analysis about the improvements that time intervals are included in every HELLO packet and would come along with this BGP extension, concerning are, therefore, identical for all OSPF routers inside the the inter-domain convergence time, is given. AS. The default values, proposed in [5], for the Hello Interval are 10 seconds for local area networks (LANs) II. FAILURE DETECTION MECHANISMS IN ROUTING and 30 seconds for wide area networks (WANs). The same PROTOCOLS document recommends to set the Router Dead Interval to In the following, an overview is given about the mech- four times the Hello Interval. This means that a failure is anisms that are used in some widely-deployed routing detected after a maximum time of 40–120 seconds, before protocols to detect failures, for example link or router the process of finding alternative paths can be started. outages, which trigger the process of finding alternative paths. This will demonstrate the similarities between all C. Border Gateway Protocol these protocols and help in understanding the concept of The Border Gateway Protocol (BGP) [6] is the stan- the Adjacent Peer Check Service, described in section III. dard inter-AS routing protocol deployed between all Au- tonomous Systems of the Internet. It is used to exchange A. Routing Information Protocol information between all BGP-speaking routers about the The Routing Information Protocol (RIP), developed in reachability of destination networks in form of Path Vec- 1988 it [4] is an example of a Distance Vector Protocol. tors, i.e., BGP is a so-called path vector routing protocol. A distance vector consists of a destination address and a Essentially, these path vectors consist of a specific AS- metric that represents the cost for reaching this destination. path towards a destination network (represented by an IP RIP exchanges such distance vectors between all routers to prefix) and are exchanged via TCP connections on port 179 allow them to calculate the optimal paths to all possible between adjacent routers.