Data Center Network Failure Detection

Data Center Network Failure Detection

Data Center Network Failure Detection BRKDCT-2333 Arkadiy Shapiro Manager, Technical Marketing NX-OS and Nexus 2000 – 7000 [email protected] or @ArkadiyShapiro Session Goals At the end of the session, the participants should understand: • Where failure detection fits in achieving network fast convergence • Design aspects of network failure detection in a data center environment • Which failure detection technologies are needed to achieve Data Center business needs and SLAs • Advances in network failure detection technologies BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Session Non-goals This session does not include: • Discussion on other aspects of fast convergence beyond failure detection • Discussion on user-driven failure detection methods (ping, traceroute etc) and using scripts / EEM to automate reaction based on result / Syslog / SNMP trap • Troubleshooting • Detailed roadmap discussion for related Cisco products BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 Agenda • Overview • Layer 1 Failure Detection • Layer 2 Failure Detection • Layer 3 Failure Detection • Additional Failure Detection Mechanisms • Summary BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 Agenda • Overview • Layer 1 Failure Detection • Layer 2 Failure Detection • Layer 3 Failure Detection • Additional Failure Detection Mechanisms • Summary BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 Routing Convergence in Action A quick reminder… D: I don’t care, nothing changes for me D B: my link to C is down C: my link to B is down A B C A: Ok, fine, will use path via D B: Ooops.. Problem Loss of Connectivity = t4 – t0 t0 t1 t2 t3 t4 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 9 Routing Convergence in Action A quick reminder… D A B C B: Ooops.. Problem t0 t1 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 Routing Convergence in Action A quick reminder… D B: my link to C is down C: my link to B is down A B C B: Ooops.. Problem t0 t1 t2 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 Routing Convergence in Action A quick reminder… D: I don’t care, nothing changes for me D A B C A: Ok, fine, will use path via D t0 t1 t2 t3 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 Routing Convergence in Action A quick reminder… D A B C Loss of Connectivity = t4 – t0 t0 t1 t2 t3 t4 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 13 Routing Convergence Components 1. Failure Detection 2. Failure Propagation (flooding, etc.) IGP and BGP Reaction 3. Topology/Routing Recalculation 4. Update of the routing and forwarding table (RIB & FIB) 1 2 3 4 t0 t1 t2 t3 t4 BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 Failure Detection Overview . Detecting the failure is most critical and most challenging part of network convergence . Failure Detection can occur on different levels / layers: Physical Layer (1) Data link Layer (2) Network Layer (3) Service / Application (not covered) . Do you really need to touch all the layers? BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 Failure Detection Tools Layered Approach Application / IP SLA Service FabricPath OAM Layer 3 Aggressive Timers for Various Protocols BFD for BGP, OSPF, IS-IS, EIGRP, FHRPs, static and FabricPath / TRILL MPLS BFD for MPLS LSPs / TE-FRR 802.1ag CFM/ 802.3ah UDLD LACP Layer 2 Y.1731 FM Link OAM Bit transmission Layer 1 Signaling: Auto-negotiation / Remote Fault Indication Other: Carrier Delay / Debounce BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 Interconnection Options IP/MPLS L3 A D Ethernet/FR/ATM L2 … C SONET SDH L1 OTN DWDM B A. Layer 3 p2p B. Layer 3 with a Layer 1 (DWDM) “bump” in wire C. Layer 3 with a Layer 2 (Ethernet / Frame Relay / ATM switch) “bump” in wire D. Layer 3 with a Layer 3 (Firewall / router) “bump” in wire BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 Data Center Requirements Impact on network failure detection Requirement Meaning for Failure Detection Fast (often sub-second) network Sub-second detection convergence Link / path isolation Port density and multi-tenant scale High number of protocol sessions Protocol offload High Availability SSO / ISSU / Graceful insertion and removal Link aggregation or wide ECMP Ability to operate on all interface types Simple to configure and maintain More failure scenarios covered by one technology • Will traditional enterprise networking approaches apply? • Will routing failure detection technologies apply to switching environment? BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 Data Center Reference Topology – VPC Core / Edge Aggregation / DCI / Dark Services Fiber OTV OTV Access L2 link FEX link L3 link BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 Data Center Reference Topology – FabricPath Edge Spine / Services Dark Fiber OTV OTV Leafs FP link CE link FEX link L3 link BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 20 Data Center Reference Topology – FabricPath with Enhanced Forwarding & Automation Spine Leafs Compute and Storage L3 Cloud More information: BRKDCT-3378 - Building simplified, automated and scalable Data Center network with Overlays (VXLAN/FabricPath) LABDCT-2227 - Building simplified, automated and scalable DataCenter network with Unified Fabric BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 Focus on Specific Data Center Scenarios . Layer 2 Classical Ethernet Single p2p link Bundle . FabricPath Single p2p link Bundle . Layer 3 Single p2p link SVI SVI Bundle SVI SVI Sub-interfaces SVI on top of Classical Ethernet SVI on top of FabricPath BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 Agenda . Overview . Layer 1 Failure Detection . Layer 2 Failure Detection . Layer 3 Failure Detection . Additional Failure Detection Mechanisms . Summary Bit transmission Layer 1 Signaling: Auto-negotiation / Remote Fault Indication Other: Carrier Delay / Debounce BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 24 Layer 1 Failure Detection – Ethernet Link Fault Signaling . Ethernet mechanisms like auto-negotiation (1 GigE) and link fault signalling (10 GigE 802.3ae / 40 and 100 GigE 802.3ba) can signal local failures to the remote end R1 R2 rx tx X tx rx . Challenge to get this signal across an optical cloud as relaying the fault information to the other end is not always possible R2 R1 MUX-A Optical Transport MUX-B rx tx tx rx X tx rx rx tx “Bump” in Layer 1 link BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Carrier Delay . Running timer in software . Standard routing platform feature . Filters link up and down events, notifies protocols . This behaviour is not desirable for Fast Convergence interface … carrier-delay msec 0 . NX-OS only supports on SVI . Sets timer at 100 msec to suppress short flaps . Not recommended to set carrier-delay to 0 on SVI BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Debounce Timer . Delay link down notification only . Runs in firmware . 100 msec default in NX-OS . Most cases recommended to keep it at default . Standard switching platform feature switch(config)interface … NX-OS switch(config-if)# link debounce time ? <0-5000> Timer value (in milliseconds) BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 Carrier Delay vs Debounce timer General Guidance Carrier Delay Debounce timer Runs in software Runs in firmware Applicable to: Applicable to : • Routers except Ethernet LAN switching • Switches except WAN interfaces ((i.e. interfaces (i.e. Cisco 7600 with WS- ES+ or SIP/SPA on Catalyst 6500) X6708 card) • Ethernet LAN switching interfaces on • WAN interfaces on switches (i.e. ES+ routers (i.e. Cisco 7600 with WS-X6708 or SIP/SPA on Catalyst 6500) card) • SVIs on switches Filters link down and up events Filters link down events only Make sure to test before implementing! BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 28 Agenda . Overview . Layer 1 Failure Detection . Layer 2 Failure Detection . Layer 3 Failure Detection . Additional Failure Detection Mechanisms . Summary 802.1ag CFM/ 802.3ah UDLD LACP Layer 2 Y.1731 FM Link OAM BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 Unidirectional Link Detection (UDLD) . Light-weight Layer 2 failure detection protocol . Designed for detecting: One-way connections due to physical or soft failure Mis-wiring detection (loopback or triangle) Rx Tx . Cisco proprietary, but listed in informational RFC 5171 . Runs on any single Ethernet link, even inside bundle . Centralized implementation in DC switching platforms Rx Tx . Message interval: 7 - 90 sec (default: 15 seconds) . Detection: 2.5 x interval + timeout value (4 sec) ~ 41 sec BRKDCT-2333 © 2015 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 UDLD Scenario 1 Empty-Echo condition or age out Switch A e x/y e w/z Switch B X X X e P U P k k U D t t D L M M L D g g D r r . Echo Packet from A to B has “My Switch-ID A, My Port-ID e x/y” . When B sends the echo-reply back, it is expected to have “My Switch-ID B, My Port-ID e w/z” AND “Your Switch-ID A, Your Port-ID e x/y”.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    90 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us