Introduction to the Cloud Computing Network Control Plane Architecture and Protocols for Data Center Networks Outline

❒ Data Center Networking Basics ❒ Problem Solving with Traditional Design Techniques ❒ Virtual/Overlay Network Functional Architecture ❒ Virtual/Overlay Network Design and Implementation Data Center Networking Basics Lots of Servers

❒ Data centers consist of massive numbers of servers ❍ Up to 100,000’s ❒ Each server has multiple processors ❍ 8 or more ❒ Each processor has multiple cores ❍ 32 max for commodity processors, more coming ❒ Each server has multiple NICs ❍ Usually at least 2 for redundancy ❍ 1G common, 10G on the upswing

Source: http://img.clubic.com/05468563-photo-google-datacenter.jpg Mostly Virtualized

❒ Hypervisor provides a compute abstraction layer ❍ Looks like hardware to operating system VM3 ❍ OSes run as multiple Virtual Machines (VMs) on single server ❒ Hypervisor maps VM to processors VM2 ❍ Virtual cores (vCores) VM1 VM4 ❒ Virtual switch provides networking vNIC between VMs and to DC network vNIC vNIC vNIC ❍ Virtual NICs (vNICS) Hypervisor ❒ W.o. oversubscription, usually as many VMs as cores Virtual Switch ❍ Up to 256 for 8p x 32c Server Hardware ❍ Typical is 32 for 4p x 8c NIC1 NIC2 ❒ VMs can be moved from one machine to another Data Center Network Problem ❒ For a single virtualized data center built with cheap commodity servers: ❍ 32 VMs per server ❍ 100,000 servers ❍ 32 x 100,000 = 3.2 million VMs! ❒ Each VM needs a MAC address and an IP address ❒ Infrastructure needs IP and MAC addresses too ❍ Routers, switches ❍ Physical servers for management ❒ Clearly a scaling problem! Common Data Center Network Architectures: Three Tier ❒ Server NICs connected directly to edge ❒ Pluses switch ports ❍ Common ❍ ❒ Aggregation layer switches connect Simple ❒ multiple edge switches Minuses ❒ ❍ Top layer massively over-subscribed Top layer switches connect aggregation ❍ ❍ Reduced cross sectional bandwidth Top layer can also connectThese canto the be Internet • 4:1 oversubscription means only 25% of bandwidth ❒ Usually some redundancyIP Routers available (for more €s) ❍ Scalability at top layer requires expensive enterprise switches

End of Row Switch (sometimes) Top of Rack (ToR) Switch

Source: K. Bilal, S. U. Khan, L. Zhang, H. Li, K. Hayat, S. A. Madani, N. Min-Allah, L. Wang, D. Chen, M. Iqbal, C.-Z. Xu, and A. Y. Zomaya, "Quantitative Comparisons of the State of the Art Data Center Architectures," Concurrency and Computation: Practice and Experience, vol. 25, no. 12, pp. 1771-1783, 2013. Common Data Center Network Architectures: Fat

❒TreeCLOS network origin in 1950’s ❒ Pluses telephone network ❍ No oversubscription ❒ Data center divided into k pods ❍ Full bisection bandwidth ❒ Each pod has (k/2)2 switches ❒ Minuses ❍ k/2 access, k/2 aggregation ❍ Need specialized and ❒ Core has (k/2)2 switches addressing scheme ❍ Number of pods limited to number of ❒ 1:1 oversubscription ratio and full ports on a switch bisection bandwidth ❍ Maximum # of pods = # switch ports k=4 Example

Source: Bilal, et. al. Problem Sovlving with Traditonal Design Techniques Problem #1:ARP/ND Handling

❒ IP nodes use ARP (IPv4) and N(eighbor) D(iscovery) for resolving the IP to MAC address ❍ Broadcast (ARP) and Multicast (ND) ❒ Problem: ❍ Broadcast forwarding load on large, flat L2 Source: networks can be http://www.louiewong.com/wp-content/uploads/2010/09/ARP.jpg overwhelming Problem #2: VM Movement

❒ Data center operators need to move VMs around ❍ Reasons: server maintenance, server optimization for energy use, performance improvement, etc. ❍ MAC address can stay fixed (provided it is unique in the data center) Hypervisor Hypervisor Hypervisor Hypervisor ❍ If subnet changes, IP address must change because it is bound to the VM’s location in the topology • For “hot” migration, the IP address cannot change ❒ Problem: ❍ How broadcast domains are provisioned affects where VMs can be moved Source: http://www.freesoftwaremagazine.com/files/nodes/1159/slide4.jpg Solutions Using Traditional Network Design Principles: IP Subnets ❒ ToR == last hop ❍ Subnet (broadcast domain) limited to rack Note: ❍ Good broadcast/multicast limitation These solutions only ❍ Poor VM mobility work if the data center ❒ Aggregation Switch == last hop router ❍ Subnet limited to racks controlled by aggregation switch is single tenant! ❍ Complex configuration • Subnet VLAN to all access switches and servers on served racks Where to put the last ❍ Moderate broadcast/multicast limitation ❍ Moderate VM mobility hop router? • To any rack covered ❒ Core Switch/Router == last hop router ❍ Poor broadcast/multicast limitation ❍ Good VM mobility

Source: Bilal, et. al. Problem #3: Dynamic Provisioning of Tenant Networks ❒ Virtualized data centers enable renting infrastructure to outside parties (aka tenants) ❍ Infrastructure as a Service (IaaS) model ❍ Amazon Web Services, Microsoft Azure, Google Compute Engine, etc. ❒ Customers get dynamic server provisioning through VMs ❍ Expect same dynamic “as a service” provisioning for networks too ❒ Characteristics of tenant network ❍ Traffic isolation ❍ Address isolation • From other tenants • From infrastructure Solution Using Traditional Network Design Principles ❒ Use a different VLAN for each tenant network ❒ Problem #1 ❍ There are only 4096 VLAN tags for 802.1q VLANs* ❍ Forces tenant network provisioning along physical network lines ❒ Problem #2 ❍ For fully dynamic VM placement, each ToR-server link must be dynamically configured as a trunk ❒ Problem #3 ❍ Can only move VMs to servers where VLAN tag is available • Ties VM movement to physical infrastructure

*except for carrier , about which more shortly Summary

❒ Configuring subnets based on hierarchical switch architecture always results in a tradeoff between broadcast limitation and VM movement freedom ❍ On top of which, can’t achieve traffic isolation for multitenant networks ❒ Configuring multitenant networks with VLAN tags for traffic isolation ties tenant configuration to physical data center layout ❍ Severely limits where VMs can be provisioned and moved ❍ Requires complicated dynamic trunking ❒ For multitenant, virtualized data centers, no good solution using traditional techniques! Virtual/Overlay Network Functional Architecture Virtual Networks through

Overlays❒ Basic idea of an overlay: ❍ Tunnel tenant packets through underlying physical Ethernet or IP network ❍ Overlay forms a conceptually separate network providing a separate service from underlay ❒ L2 service like VPLS or EVPN ❍ Overlay spans a separate broadcast domain ❒ L3 service like BGP IP VPNs ❍ Different tenant networks have separate IP address spaces ❒ Dynamically provision and remove overlay as tenants need network service ❒ Multiple tenants with separate networks on the same server

Source: Bilal, et. al. Blue Tenant Network Yellow Tenant Network Advantages of Overlays

❒ Tunneling is used to aggregate traffic ❒ Addresses in underlay are hidden from the tenant ❍ Inhibits unauthorized tenants from accessing data center infrastructure ❒ Tenant addresses in overlay are hidden from underlay and other tenants ❍ Multiple tenants with the same IP address space ❒ Overlays can potentially support large numbers of tenant networks ❒ Virtual network state and end node reachability are handled in the end nodes Challenges of Overlays

❒ Management tools to co-ordinate overlay and underlay ❍ Overlay networks probe for bandwidth and packet loss, which can lead to inaccurate information ❍ Lack of communication between overlay and underlay can lead to inefficient usage of network resources ❍ Lack of communication between overlays can lead to contention and other performance issues ❒ Overlay packets may fail to traverse firewalls ❒ Path MTU limit may cause fragmentation ❒ Efficient multicast is challenging Functional Architecture: Definitions ❒ Virtual Network ❍ Overlay network defined over the Layer 2 or Layer 3 underlay (physical) network ❍ Provides either a Layer 2 or a Layer 3 service to tenant ❒ Virtual Network Instance (VNI) or Tenant Network ❍ A specific instance of a virtual network ❒ Virtual Network Context (VNC) ❍ A tag or field in the encapsulation header that identifies the specific tenant network Functional Architecture: More Definitions ❒ Network Virtualization Edge (NVE) ❍ Data plane entity that sits at the edge of an underlay network and implements L2 and/or L3 network virtualization functions • Example: virtual switch aka Virtual Edge Bridge (VEB) ❍ Terminates the virtual network towards the tenant VMs and towards outside networks ❒ Network Virtualization Authority (NVA) ❍ Control plane entity that provides information about reachability and connectivity for all tenants in the data center Overlay Network Architecture

Tenant Tenant Data Plane System System Control Plane

NVE NVA LAN link Tenant System

NVE

Data Center L2/L3 Network End System NVE integration

Tenant Tenant Point to Point System System link Virtual/Overlay Network Design and ImplementatION Implementing Overlays: Tagging or Encapsulation? ❒ At or above Layer 2 but below Layer 3: ❍ Insert tag at a standards specified place in the pre-Layer 3 header ❒ At Layer 3: ❍ Encapsulate the tenant packet with an encapsulation protocol header and an IP header ❒ Tenant network identified by Virtual Network Context ❍ Tag for tagging ❍ Context identifier in protocol header for encapsulation L2 Virtual Networks:Tagging Options ❒ Simple 802.1q VLANs ❍ 4096 limit problem ❍ Trunking complexity ❒ MPLS ❍ Nobody uses MPLS directly on the switching hardware • One experimental system (Zepplin) ❍ Switches are perceived to be too expensive ❒ TRILL ❍ IETF standard for L2 encapsulation ❍ Not widely adopted • Brocade and Cisco implement it ❒ Collection of enhancements to 802.1 since 2000 ❍ 802.1qbg Virtual Edge Bridging (VEB) and Virtual Ethernet Port Aggregation (VEPA) (data plane) ❍ 802.1qbc Provider Bridging (data plane) ❍ 802.1qbf Provider Backbone Bridging (data plane) • Also does MAC’nMAC encapsulation ❍ 802.1aq Shortest-Path Bridging (control plane) ❍ Note: These are also used by carriers for wide area network (Carrier Ethernet) 802.1qbg: Standard Virtual Switch/VEB ❒ Virtual switch software sits in hypervisor and switches packets between VMs ❒ Every time a packet arrives for a VM, the hypervisor takes an interrupt ❍ Potential performance issue

Source: D. Kamath, et. Al., “Edge Virtual Bridge Proposal Version 0, Rev 0.1”, March, 2010. 802.1qbg: Hardware Supported VEB

❒ SR-IOV is a PCI Express bus standard for allowing VMs to communicate directly with the NIC ❍ No hypervisor interrupt ❒ Improves performance of virtual switching ❒ Downside ❍ More expensive NIC hardware ❍ More complex virtual switch ❍ Constrains VM movement 802.1qbg: VEB Forwarding

❒ At 1, VEB forwards between VM and outside network via an external physical bridge (e.g. ToR) ❒ At 2, VEB forwards between two VMs belonging to the blue tenant on the same hypervisor ❒ At 3, forwarding between two logical uplink ports is not allowed 802.1qbg: VEB Characteristics ❒ Works in the absence of any ToR switch support ❒ Only supports a single physical uplink ❒ VEB does not participate in spanning tree calculations ❒ Maximize bandwidth ❍ As opposed to VEPA which uses trombone forwarding (as we will shortly see) ❒ Minimize latency for co-located VMs because no external network to cross ❒ Migration of VMs between servers is straightforward ❍ If both support SR-IOV for hardware supported 802.1qbg:VEB Drawbacks (as of 2010) ❒ Limited additional (ACLs, etc.) ❒ Limited security features ❒ Limited monitoring (Netflow, etc.) ❒ Limited support for 802.1 protocols (802.1x authentication, etc.) ❒ Limited support for promiscuous mode ❒ All these are supported in the ToR ❒ Assumption: the only way to get support for these is to forward frames to the ToR before sending them to the VM 802.1qbg: Virtual Edge Port Aggregation (VEPA) ❒ Firmware upgrade to switch to allow forwarding out of the same physical port the packet arrived at under certain conditions ❒ VMs send all packets to the switch ❍ Packets to VMs on VLANs on same machine turned around and sent back ❒ Trombone routing halves the capacity on the ToR-server link

5 Years Later: VEBs support most of these

❒ OpenVirtualSwtch (OVS) supports ACLs ❒ OVS supports Netflow ❒ VMWare virtual switch supports promiscuous mode and OVS supports it if the NIC is in promiscuous mode ❒ OVS doesn’t support 802.1x ❒ Conclusion: programming support into software is a much better solution than making a hardware standard that reduces performance Ethernet Data Plane Evolution: Not Your Father’s Ethernet Anymore

802.1Qbc 802.1Qbf Provider Provider 802.1Q Bridging Backbone 802.1D VLAN Bridging

1990 2008Source:P. Thaler, N. Finn, D. Fedyk, 1999 2005 G. Parsons, and E. Gray, “IEEE 802.1Q: Media Access Control Bridges and Virtual Bridged Local Area Networks”, Source:evolutionanimation.wordpress.com IETF-86 Tutorial, March 19, 2013 Ethernet Control Plane Evolution ❒ Rapid Spanning Tree Protocol (RSTP): single spanning tree for all traffic ❒ Multiple Spanning Tree Protocol (MSTP): different VLANs share separate paths ❒ Shortest Path Bridging: Use (ISIS) to give each node its own spanning tree Source: P. Thaler, et. al., 2013 SPB Data Center Virtualization 1) Create Red Tenant Network (I-SID1) NVA (e.g. Software Defined Network Controller) NVE (Edge VM Switch 1) VN-1

1 D 1 I D S I - I V - Central B Switch 1

VN-1 NVE (Edge NVE (Edge Data Center L2 Network Switch 3) Switch 2) 2) Distribute Shortest Hybrid Centralized/Distributed Control Plane Path Routes with ISIS L2 Virtualization: Challenges Handled ❒ “Hot” VM movement ❍ IP address space configured on I-SID ❍ But only within the data center ❒ ARP containment ❍ Limit broadcast domain to I-SID ❒ Firewall traversal ❍ No firewall at L2 ❒ Path MTU ❍ Handled by the IP layer ❒ Multicast ❍ ISIS handles ❒ Management ❍ Whole suite of management tools for 802.1 networks L2 Virtualization Summary

❒ Possible to virtualize a data center with standardized L2 overlays ❍ Advances in 802.1Q data plane provide one encapsulation layer of MAC’nMAC encapsulation and extra layer of VLAN tags ❍ Centralized, decentralized or hybrid control plane ❒ But most existing deployments use proprietary extensions ❍ Cisco UCS uses TRILL ❒ But using IP overlays is cheaper ❍ Switches supporting carrier Ethernet extensions and TRILL are more expensive than simple 802.1Q L3 Virtual Networks: Advantages ❒ Easy IP provisioning through hypervisor/virtual switch ❍ End host provisioning ❍ No need for distributed control plane ❒ Cheap NICs and switching hardware ❒ Support in hypervisor/virtual switch ❒ No limitation on number and placement of virtual networks ❍ Virtual network can even extend into WAN L3 Virtual Networks: Challenges ❒ Path MTU limitation may cause fragmentation ❒ Lack of tools for management ❒ Some performance hit ❍ Encapsulation/decapsulation ❍ Lack of NIC hardware support

But low cost of NICs and switching hardware trumps all!! L3 Virtual Networks: Encapsulation Options ❒ IP in IP ❍ Use IP address as VNC ❍ Problem for IPv4: Lack of address space ❒ IPSec in Infrastructure mode ❍ Provides additional confidentiality ❍ Problem: Key distribution complexity ❍ Problem: larger performance hit even with hardware encryption assist ❒ In practice: ❍ STT • Proprietary VMWare/NSX protocol • Designed to leverage TLS hardware support on NICs ❍ GRE and NVGRE ❍ VxLAN ❒ Coming ❍ GEVNE • Proposed unified protocol framework for encapsulation headers NVGRE: Network Virtualization Generic Routing Encapsulation ❒ Microsoft-proposed GRE Protocol= Extension built on: C-DIP ❍ RFC 2784 GRE C-SIP Ethertype=0x0800 ❍ RFC 2890 GRE Key Extension C-SMAC ❒ NO VID!! Provides a Layer 2 serviceChecksum bit & C-DMAC tunneled over IP Sequence # bit ❍ No VLAN id! 24 bit VSID 8 bit FlowID ❒ VNC is a Virtual Subnet 0 1 0 0ReservedVerProtocol=0x6558 Identifier (VSID) Protocol=0x2F ❍ 24 bit Key Key bit P-DIP • Each VSID constitutes a P-SIP separate broadcast domain Indicates – Like a VLAN Ethertype=0x0800 Transparent ❍ P-VID Ethernet 8 bit Flow label Bridging • Adds entropy for Equal Cost P-SMAC Multipath (ECMP) routing P-DMAC NVGRE Characteristics

❒ Path MTU discovery must be performed by originating NVE ❒ Encapsulated MAC header VLAN tag handling ❍ Originating NVE must strip out any 802.1Q VLAN tag ❍ Receiving NVE must add required 802.1Q VLAN tag back ❍ Requires NVA to maintain and provision VLAN tag to VN Key mapping ❒ Multicast handling ❍ Multicast routing deployed in infrastructure • Provider provisions a multicast address per VSID • Addr takes all multicast and broadcast traffic originating in VSID ❍ No multicast routing deployed in infrastructure • N-way unicast by NVEs or a dedicated VM multicast router VxLAN: Virtual eXtensible Local Protocol= Area Network C-DIP ❒ RFC 7348 C-SIP Ethertype=0x0800 ❍ Consortium lead by Intel, C-VID VMWare and Cisco ❒ C-SMAC Full Layer 2 service Set to 1 for provided over IP valid VNI C-DMAC ❍ VLAN id OK 24 bit VNI Reserved VxLAN ❍ VxLAN segments constitute Rsv. 1 R. Reserved a broadcast domain UDP Length UDP Checksum (= 0) ❒ UDP VNC is VxLAN Network Other bits Source Port = Dest. Port = 4789 Identifier (VNI) ignored Protocol=0x17 (UDP) ❍ 24 bit VxLAN Segment P-DIP Identifier ❒ Recommended UDP source P-SIP Ethertype=0x0800 port randomized to P-VID provide entropy for ECMP routing P-SMAC P-DMAC VxLAN Characteristics

❒ Problem: IP multicast control plane required by RFC 7348 ❍ IP multicast address allocated per VNI for determining IP unicast address to MAC address mapping ❍ Multicast routing not widely deployed in data centers • Most VxLAN deployments use NVA/SDN Controller ❒ Solution: VxLAN just used as an encapsulation format ❒ UDP endpoint constitutes a VxLAN Tunnel End Point (VTEP) ❍ Handled at application layer ❒ Path MTU discovery performed by VTEP ❒ Multicast handling like NVGRE ❍ Can be handled by using underlay multicast ❍ Mostly handled using N-way unicast VxLAN Data Center Virtualization Create Red Tenant Network (VNI-1) NVA (e.g. Software Defined Network Controller) NVE

VTEP NVE

VTEP ToR1 1 -

I VNI-1 N V

VTEP NVE ToR3 Data Center L3 Network ToR2

Centralized Control Plane L3 Virtual Networks Summary

❒ Despite the challenges with IP overlays, they are widely deployed ❍ Usually workarounds for the challenges ❒ Software availability ❍ Lots of open source software ❍ Also proprietary solutions available ❒ Can extend overlay into WAN ❍ Between data centers ❍ Between enterprise network and data center ❒ Deployments almost exclusively use centralized control ❍ NVA implemented using an SDN controller