Data Center Network Failure Detection
Total Page:16
File Type:pdf, Size:1020Kb
Data Center Network Failure Detection BRKDCT-2333 Arkadiy Shapiro Manager, Technical Marketing NX-OS and Nexus 2000 – 7000 [email protected] or @ArkadiyShapiro Session Goals At the end of the session, the participants should understand: • Where failure detection fits in achieving network fast convergence • Design aspects of network failure detection in a data center environment • Which failure detection technologies are needed to achieve Data Center business needs and SLAs • Advances in network failure detection technologies BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Session Non-goals This session does not include: • Discussion on other aspects of fast convergence beyond failure detection • Discussion on user-driven failure detection methods (ping, traceroute etc) and using scripts / EEM to automate reaction based on result / Syslog / SNMP trap • Troubleshooting • Detailed roadmap discussion for related Cisco products BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 Agenda • Overview • Layer 1 Failure Detection • Layer 2 Failure Detection • Layer 3 Failure Detection • Additional Failure Detection Mechanisms • Summary BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 Agenda • Overview • Layer 1 Failure Detection • Layer 2 Failure Detection • Layer 3 Failure Detection • Additional Failure Detection Mechanisms • Summary BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 Routing Convergence in Action A quick reminder… D: I don’t care, nothing changes for me D B: my link to C is down C: my link to B is down A B C A: Ok, fine, will use path via D B: Ooops.. Problem Loss of Connectivity = t4 – t0 t0 t1 t2 t3 t4 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 Routing Convergence in Action A quick reminder… D A B C B: Ooops.. Problem t0 t1 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 Routing Convergence in Action A quick reminder… D B: my link to C is down C: my link to B is down A B C B: Ooops.. Problem t0 t1 t2 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 Routing Convergence in Action A quick reminder… D: I don’t care, nothing changes for me D A B C A: Ok, fine, will use path via D t0 t1 t2 t3 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 Routing Convergence in Action A quick reminder… D A B C Loss of Connectivity = t4 – t0 t0 t1 t2 t3 t4 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 Routing Convergence Components 1. Failure Detection 2. Failure Propagation (flooding, etc.) IGP and BGP Reaction 3. Topology/Routing Recalculation 4. Update of the routing and forwarding table (RIB & FIB) 1 2 3 4 t0 t1 t2 t3 t4 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 Failure Detection Overview . Detecting the failure is most critical and most challenging part of network convergence . Failure Detection can occur on different levels / layers: Physical Layer (1) Data link Layer (2) Network Layer (3) Service / Application (not covered) . Do you really need to touch all the layers? BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 Failure Detection Tools Layered Approach Application / IP SLA Service FabricPath OAM Layer 3 Aggressive Timers for Various Protocols BFD for BGP, OSPF, IS-IS, EIGRP, FHRPs, static and FabricPath / TRILL MPLS BFD for MPLS LSPs / TE-FRR 802.1ag CFM/ 802.3ah UDLD LACP Layer 2 Y.1731 FM Link OAM Bit transmission Layer 1 Signaling: Auto-negotiation / Remote Fault Indication Other: Carrier Delay / Debounce BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 Interconnection Options IP/MPLS L3 A D Ethernet/FR/ATM L2 … C SONET SDH L1 OTN DWDM B A. Layer 3 p2p B. Layer 3 with a Layer 1 (DWDM) “bump” in wire C. Layer 3 with a Layer 2 (Ethernet / Frame Relay / ATM switch) “bump” in wire D. Layer 3 with a Layer 3 (Firewall / router) “bump” in wire BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 Data Center Requirements Impact on network failure detection Requirement Meaning for Failure Detection Fast (often sub-second) network Sub-second detection convergence Link / path isolation Port density and multi-tenant scale High number of protocol sessions Protocol offload High Availability SSO / ISSU / Graceful insertion and removal Link aggregation or wide ECMP Ability to operate on all interface types Simple to configure and maintain More failure scenarios covered by one technology • Will traditional enterprise networking approaches apply? • Will routing failure detection technologies apply to switching environment? BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 Data Center Reference Topology – VPC Core / Edge Aggregation / DCI / Dark Services Fiber OTV OTV Access L2 link FEX link L3 link BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 Data Center Reference Topology – FabricPath Edge Spine / Services Dark Fiber OTV OTV Leafs FP link CE link FEX link L3 link BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 20 Data Center Reference Topology – FabricPath with Enhanced Forwarding & Automation Spine Leafs Compute and Storage L3 Cloud More information: BRKDCT-3378 - Building simplified, automated and scalable Data Center network with Overlays (VXLAN/FabricPath) LABDCT-2227 - Building simplified, automated and scalable DataCenter network with Unified Fabric BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 Focus on Specific Data Center Scenarios . Layer 2 Classical Ethernet Single p2p link Bundle . FabricPath Single p2p link Bundle . Layer 3 Single p2p link SVI SVI Bundle SVI SVI Sub-interfaces SVI on top of Classical Ethernet SVI on top of FabricPath BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 Agenda . Overview . Layer 1 Failure Detection . Layer 2 Failure Detection . Layer 3 Failure Detection . Additional Failure Detection Mechanisms . Summary Bit transmission Layer 1 Signaling: Auto-negotiation / Remote Fault Indication Other: Carrier Delay / Debounce BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 Layer 1 Failure Detection – Ethernet Link Fault Signaling . Ethernet mechanisms like auto-negotiation (1 GigE) and link fault signalling (10 GigE 802.3ae / 40 and 100 GigE 802.3ba) can signal local failures to the remote end R1 R2 rx tx X tx rx . Challenge to get this signal across an optical cloud as relaying the fault information to the other end is not always possible R2 R1 MUX-A Optical Transport MUX-B rx tx tx rx X tx rx rx tx “Bump” in Layer 1 link BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Carrier Delay . Running timer in software . Standard routing platform feature . Filters link up and down events, notifies protocols . This behaviour is not desirable for Fast Convergence interface … carrier-delay msec 0 . NX-OS only supports on SVI . Sets timer at 100 msec to suppress short flaps . Not recommended to set carrier-delay to 0 on SVI BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Debounce Timer . Delay link down notification only . Runs in firmware . 100 msec default in NX-OS . Most cases recommended to keep it at default . Standard switching platform feature switch(config)interface … NX-OS switch(config-if)# link debounce time ? <0-5000> Timer value (in milliseconds) BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 Carrier Delay vs Debounce timer General Guidance Carrier Delay Debounce timer Runs in software Runs in firmware Applicable to: Applicable to : • Routers except Ethernet LAN switching • Switches except WAN interfaces ((i.e. interfaces (i.e. Cisco 7600 with WS- ES+ or SIP/SPA on Catalyst 6500) X6708 card) • Ethernet LAN switching interfaces on • WAN interfaces on switches (i.e. ES+ routers (i.e. Cisco 7600 with WS-X6708 or SIP/SPA on Catalyst 6500) card) • SVIs on switches Filters link down and up events Filters link down events only Make sure to test before implementing! BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Agenda . Overview . Layer 1 Failure Detection . Layer 2 Failure Detection . Layer 3 Failure Detection . Additional Failure Detection Mechanisms . Summary 802.1ag CFM/ 802.3ah UDLD LACP Layer 2 Y.1731 FM Link OAM BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 Unidirectional Link Detection (UDLD) . Light-weight Layer 2 failure detection protocol . Designed for detecting: One-way connections due to physical or soft failure Mis-wiring detection (loopback or triangle) Rx Tx . Cisco proprietary, but listed in informational RFC 5171 . Runs on any single Ethernet link, even inside bundle . Centralized implementation in DC switching platforms Rx Tx . Message interval: 7 - 90 sec (default: 15 seconds) . Detection: 2.5 x interval + timeout value (4 sec) ~ 41 sec BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 UDLD Scenario 1 Empty-Echo condition or age out Switch A e x/y e w/z Switch B X X X e P U P k k U D t t D L M M L D g g D r r . Echo Packet from A to B has “My Switch-ID A, My Port-ID e x/y” . When B sends the echo-reply back, it is expected to have “My Switch-ID B, My Port-ID e w/z” AND “Your Switch-ID A, Your Port-ID e x/y”.