A Review of Fault Management in WDM Mesh Networks: Basic Concepts and Research Challenges
Total Page:16
File Type:pdf, Size:1020Kb
A Review of Fault Management in WDM Mesh Networks: Basic Concepts and Research Challenges Jing Zhang and Biswanath Mukherjee, University of California Abstract This article first presents a broad overview of the fault management mechanisms involved in deploying a survivable optical mesh network, employing optical cross- connects. We review various protection and restoration schemes, primary and back- up route computation methods, sharability optimization, and dynamic restoration. Then we describe different parameters that can measure the quality of service pro- vided by a WDM mesh network to upper protocol layers (e.g., IP network back- bones, ATM network backbones, leased lines, virtual private networks), such as service availability, service reliability, restoration time, and service restorability. We review these concepts, the factors that affect them, and how to improve them. In par- ticular, we present a framework for cost-effective availability-aware connection provi- sioning to provide differentiated services in WDM mesh networks. Through the framework, the more realistic scenario of multiple near-simultaneous failures can be handled. In addition, the emerging problem of protecting low-speed connections of different bandwidth granularities is also reviewed. ith the maturing of wavelength-division multi- 109 hours, Tx denotes optical transmitters, Rx denotes optical plexing (WDM) technology, one single strand receivers, and MTTR means mean time to repair. of fiber can provide tremendous bandwidth With the frequent occurrence of fiber cuts and the tremen- (potentially a few tens of terabits per second) dous traffic loss a failure may cause, network survivability WWby multiplexing many nonoverlapping wavelength channels becomes a critical concern in network design and real-time (WDM channels). Each wavelength channel can be operated operation. As networks migrate from stacked rings to meshes asynchronously and in parallel at any desirable speed (e.g., because of the poor scalability of interconnected rings and the peak electronic speed of a few gigabits per second). In a excessive resource redundancy used in ring-based fault man- wavelength-routed WDM network, an optical crossconnect agement schemes, designing and operating a survivable WDM (OXC) can switch the optical signal on a WDM channel from mesh network have received increasing attention [2–6]. Most an input port to an output port without any optoelectronic of the research work on survivability in WDM networks focus- conversion of the signal; thus, a lightpath may be established es on recovery from a single link or node failure, where one from a source node to a destination node, and it may span failure is repaired before another failure is assumed to occur multiple fiber links. in the network. Nevertheless, as our knowledge on this subject A fiber cut usually occurs due to a duct cut1 during con- has matured, the more realistic scenario of multiple near- struction or destructive natural events, such as earthquakes, simultaneous failures should now be considered (e.g., more etc. All the lightpaths that traverse the failed fiber will be dis- than one link may be affected when a natural disaster such as rupted so a fiber cut can lead to tremendous traffic loss. Other an earthquake occurs). network equipment (OXC, amplifier, etc.) may also fail. Table Meanwhile, as our knowledge of resource management in 1 shows some typical data on network component (transmitter, survivable network design and real-time operation continues receiver, fiber link [cable], etc.) failure rates and failure repair to mature, more and more researchers are shifting their atten- times according to Bellcore (now Telcordia) [1]. In Table 1, tion to a service perspective. Naturally, how to provide a cer- failure-in-time (FIT) denotes the average number of failures in tain quality of service (QoS) per a customer’s requirement and how to guarantee the service quality become critical con- cerns. The rationale behind this is as follows. A WDM mesh This work has been supported by the National Science Foundation (NSF) network may provide services for IP network backbones, asyn- under Grant No. ANI-9805285, and by Sprint Advanced Technology Lab- chronous transfer mode (ATM) network backbones, leased oratories (ATL). lines, virtual private networks (VPNs), and so on. The QoS requirements for these services can be very different because 1 A duct is a bidirectional physical pipe between two nodes. In practice, of their diverse characteristics; for example, online trading, fibers are put into cables, which are buried into ducts under the ground. military applications, and banking services will require strin- IEEE Network • March/April 2004 0890-8044/04/$20.00 © 2004 IEEE 41 Metric Telcordia statistics Equipment MTTR 2 h gent reliability, while IP best effort is rerouted through a backup route packet delivery service may be satis- Cable-cut MTTR 12 h (backup path) once a link failure fied without a special constraint on occurs on its working path (primary reliability. Service quality can be Cable-cut rate 4.39/yr/1000 sheath mi path). The primary and backup paths measured in many different ways for a connection must be link disjoint Tx failure rate 10,867 FIT such as signal quality, service avail- so that no single link failure can affect ability, service reliability, restoration Rx failure rate 4311 FIT both paths. In link protection, the traf- time, and service restorability. Signal fic is rerouted only around the failed quality is mainly represented by the I Table 1. Failure rates and repair times (Tel- link. While path protection leads to optical signal-to-noise ratio (OSNR), cordia) [1]. efficient utilization of backup bit error rate (BER), and other fac- resources and lower end-to-end prop- tors, and is affected by the transmis- agation delay for the recovered route, sion equipment characteristics. This is a problem in all-optical link protection provides shorter protection switching time. networks, and is out of the scope of our current discussion. Recently, researchers have proposed the idea of subpath pro- Our interest is in the availability of service paths in WDM tection in a mesh network by dividing a primary path into a mesh networks. Usually, availability is defined as the probabil- sequence of segments and protecting each segment separately ity that the service or connection will be found in the operat- (or dividing the whole network into different domains where a ing state at a random time in the future [7]. Connection lightpath segment in one domain must be protected by the availability can be computed statistically based on the failure resources in the same domain) [10, 11]. Compared to path frequency and failure repair rate, reflecting the percentage of protection, subpath protection can achieve high scalability and time a connection is “alive” or “up” during its entire service fast recovery time for a modest sacrifice in resource utilization. period. Although the problem of how connection availability Note that node failures can also be considered by calculat- is affected by network failures is currently attracting more ing node disjoint routes. However, one should also note that research interest [1, 7–9], we still lack a systematic methodolo- carrier-class OXCs in network nodes must be 1 + 1 gy to quantitatively estimate a connection’s availability, espe- (master/slave) protected in the hardware for both the OXC’s cially when protection schemes are applied to the connection. switch fabric and its control unit. The OXC’s port cards, how- In this review we shall discuss the rationale and challenges for ever, do not have to be 1 + 1 protected since they take up the availability analysis in WDM mesh networks. In particular, we bulk of the space (perhaps over 80 percent) and cost of an shall present a framework for a generic connection provision- OXC; also, a port card failure can be handled as link and/or ing problem with due consideration of network component wavelength channel failure(s). However, node failures are failure characteristics so that all possible network failure sce- important to protect against in scenarios where an entire node narios and their effects can be incorporated. (or a collection of nodes in a part of the network) may be The objective of this article is to present a broad overview taken down, possibly due to a natural disaster or by a mali- of the fault management issues involved in designing an opti- cious attacker. cal mesh network employing OXCs and real-time network Link, subpath, and path protection schemes can be dedicat- operation including dynamic connection provisioning. We ed or shared. In dedicated protection there is no sharing introduce the basic concepts in fault management schemes: between backup resources, while in shared protection backup various protection and restoration schemes, primary and back- wavelengths can be shared on some links as long as their pro- up route computation methods, sharability optimization, and tected segments (links, subpaths, paths) are mutually diverse. dynamic restoration. Then we present a framework for avail- OXCs on backup paths cannot be configured until the failure ability-aware connection provisioning to provide differentiated occurs if shared protection is used. Thus, recovery time in services in a mesh network and handle multiple failures. We shared protection is longer but its resource utilization is better also outline other criteria that may affect service quality such than dedicated protection. as service reliability, restoration time, and service restorability. Dynamic restoration [3, 12] can also be classified as link-, Online algorithms for protecting low-speed connections are subpath-, or path-based depending on the type of rerouting. also reviewed. In link restoration the end nodes of the failed link dynamically discover a route around the link for each connection (or “live” wavelength) that traverses the link. In path restoration Basic Concepts in Fault Management when a link fails, the source and destination nodes of each There are two types of fault recovery mechanisms [3].