DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 26

CLOUD COMPUTING: NETWORKING AND COMMUNICATIONS CHALLENGES Communication within Clouds: Open Standards and Proprietary Protocols for Data Center Networking

Carolyn J. Sher DeCusatis and Aparico Carranza, New York City College of Technology Casimer M. DeCusatis, IBM Corporation

BSTRACT A available on affordable commodity x86 servers Cloud computing and other highly virtualized within the last decade or so. In recent years, data center applications have placed many new many equipment vendors have contributed to and unique requirements on the data center net- the hardware and software infrastructure, which work infrastructure. Conventional network pro- has made enterprise-class virtualization widely tocols and architectures such as Spanning Tree available. This, in turn, enables new designs for Protocol and multichassis link aggregation can cloud computing, including hosting multiple limit the scale, latency, throughput, and virtual independent tenants on a shared infrastructure, machine mobility for large cloud networks. This rapid and dynamic provisioning of new features, has led to a multitude of new networking proto- and implementing advanced load balancing, cols and architectures. We present a tutorial on security, and business continuity functions some of the key requirements for cloud comput- (including multisite transaction mirroring). This ing networks and the various approaches that has brought about profound changes in many have been proposed to implement them. These aspects of data center design, including new include industry standards (e.g., TRILL, SPB, requirements for the data center network. software-defined networking, and OpenFlow), Modern cloud data centers employ resource best practices for standards-based data center pooling to make more efficient use of data center networking (e.g., the open datacenter interoper- appliances and to enable dynamic reprovisioning in able network), as well as vendor proprietary response to changing application needs. Examples approaches (e.g., FabricPath, VCS, and Qfabric). of this include elastic workloads where application components are added, removed, or resized based on the traffic load; mobile applications relocating INTRODUCTION to different hosts based on distance from the host or hardware availability; and proactive disaster Cloud computing is a method of delivering com- recovery solutions, which relocate applications in puting services from a large, highly virtualized response to a planned site shutdown or a natural data center to many independent end users, disaster. It has been shown [2] that highly virtual- using shared applications and pooled resources. ized servers place unique requirements on the data While there are many different definitions for center network. Cloud data center networks must cloud computing [1], it is typically distinguished contend with huge numbers of attached devices by the following attributes: on-demand self-ser- (both physical and virtual), large numbers of isolat- vice, broad network access, resource pooling, ed independent , multitenancy (appli- rapid and elastic resource provisioning, and cation components belonging to different tenants metered service at various quality levels. are collocated on a single host), and automated Implementation of these attributes as part of creation, deletion, and migration of virtual a large enterprise-class cloud computing service machines (facilitated by large layer 2 network that provides continuous availability to a large domains). Furthermore, many cloud data centers number of users typically requires significantly now contain clusters or pods of servers, storage, more server, networking, and storage resources and networking, configured so that the vast majori- than conventional data centers (up to an order ty of traffic (80–90 percent in some cases) flows of magnitude more in many cases). This is only between adjacent servers within a pod (so-called achievable through extensive use of virtualiza- east-west traffic). This is a very different traffic pat- tion. While server virtualization has existed since tern from conventional data center networks, the 1960s, when it was first implemented on which supported higher levels of traffic between IBM mainframes, it has only become widely server racks or pods (so-called north-south traffic).

26 0163-6804/12/$25.00 © 2012 IEEE IEEE Communications Magazine • September 2012 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 27

MC-LAG and LAG STP and LAG To help overcome LAG LAG the limitations of STP, several enhancements have MC-LAG MC-LAG been standardized. These include Multi- LAG STP blocked LAG LAG ple STP, which con- figures a separate spanning tree for Figure 1. MC-LAG configuration without STP (left) and with STP (right). each virtual group To cope with these problems, many attempts EtherChannel [MCEC], also referred to as Vir- and blocks all but have been made to develop best practices for net- tual Port Channels [vPC] for Cisco switches.) working design and management. Several new The changing requirements of cloud data center one of the possible industry standards and proprietary network archi- networks are forcing designers to reexamine the alternate paths tectures have recently been proposed. Many net- role of STP. One of the drawbacks of a spanning work designers, users, and administrators have tree protocol is that in blocking redundant ports within each span- long expressed a desire for standardization and and paths, a spanning tree reduces the aggregate ning tree. multivendor interoperability in the data center, to available network significantly. Addi- simplify management, improve reliability, and tionally, STP can result in circuitous and subop- avoid being locked into one particular vendor’s timal communication paths through the network, proprietary product offerings and pricing struc- adding latency and degrading application perfor- ture. These conclusions were supported by a mance. A spanning tree cannot easily be segre- recent analyst report [3], which determined that gated into smaller domains to provide better multisourcing of network equipment is not only scalability, fault isolation, or multitenancy. Final- practical, but can reduce total cost of ownership ly, the time taken to recompute the spanning by 15–25 percent. Furthermore, a recent survey of tree and propagate the changes in the event of a 468 business technology professionals on their failure can vary widely, and sometimes becomes data networking purchasing preferences [4] quite large (seconds to minutes). This is highly showed that adherence to industry standards was disruptive for elastic applications and virtual their second highest requirement, behind virtual- machine migrations, and can lead to cascaded ization support. Standardization also encourages system-level failures. future-proofing of the network and helps promote To help overcome the limitations of STP, sev- buying confidence. Despite these advantages, eral enhancements have been standardized. These many large network equipment providers have include Multiple STP (MSTP), which configures a advocated for proprietary network protocols. A separate spanning tree for each virtual local area recent study [5] showed that five out of the six network (VLAN) group and blocks all but one of largest network equipment manufacturers include the possible alternate paths within each spanning proprietary features in their products, and only tree. Also, the link aggregation group (LAG) three of the six claimed interoperability with standard (IEEE 802.3ad) allows two or more other vendors’ access layer switches. physical links to be bonded into a single logical In this article, we present a tutorial on cloud link, either between two switches or between a networking design practices, including both server and a switch. Since a LAG introduces a industry standard and vendor proprietary alter- loop in the network, STP has to be disabled on natives. It should be noted that although modern network ports using LAGs. It is possible for one data centers will almost certainly require some end of the link aggregated port group to be dual- version of these new protocols, many of these homed into two different devices to provide approaches are far less mature than convention- device-level redundancy. The other end of the al network designs. Early adopters should use group is still single-homed and continues to run caution when evaluating the best choices for normal LAG. This extension to the LAG specifi- their data center needs. cation is called multichassis link aggregation (MC-LAG), and is standardized as IEEE 802.1ax (2008). As shown in Fig. 1, MC-LAG can be used AND to create a loop-free topology without relying on ULTICHASSIS INK GGREGATION STP; because STP views the LAG as a single link, M L A it will not exclude redundant links within the Spanning Tree Protocol (STP) is a layer 2 switch- LAG. For example, it is possible for a pair of net- ing protocol used by classical that work interface cards (NICs) to be dual-homed ensures loop-free network topologies by always into a pair of access switches (using NIC team- creating a single path tree structure through the ing), and then use MC-LAG to interconnect the network. In the event of a link failure or recon- access switches with a pair of core switches. figuration, the network halts all traffic while the Most MC-LAG systems allow dual homing spanning tree algorithm recalculates the allowed across only two paths; in practice, MC-LAG sys- loop-free paths through the network. (STP cre- tems are limited to dual core switches because it ates a loop-free topology using Multi Chassis is extremely difficult to maintain a coherent

IEEE Communications Magazine • September 2012 27 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 28

around link failures. Virtual machine migration 10.1.1.0/24 is limited to servers within a VLAG . Layer 3 ECMP designs offer several advan- VLAN 1 VLAN 1 tages. They are based on proven standardized Switch 10.1.2.0/24 Switch .1 VLAN 2 VLAN 2 .2 technology that leverages smaller, less expensive rack or blade chassis switches (virtual soft switch- VLAN 3 VLAN 3 es typically do not provide layer 3 functions and 10.1.3.0/24 would not participate in an ECMP network). VLAN 4 VLAN 4 The is distributed, and smaller iso- 10.1.4.0/24 lated domains may be created. There are also some trade-offs when using a layer 3 ECMP design. The native layer 2 Figure 2. Example of a four-way layer 3 ECMP design. domains are relatively small, which limits the ability to perform live virtual machine (VM) migrations from any server to any other server. state between more than two devices with submi- Each individual domain must be managed as a crosecond refresh times. Unfortunately, the separate entity. Such designs can be fairly com- hashing algorithms that are associated with MC- plex, requiring expertise in IP to set up LAG are not standardized; care needs to be and manage the network, and presenting compli- taken to ensure that the two switches on the cations with multicast domains. Scaling is affect- same tier of the network are from the same ven- ed by the control plane, which can become dor (switches from different vendors can be used unstable under some conditions (e.g., if all the on different tiers of the network). As a relatively servers attached to a leaf switch boot up at once, mature standard, MC-LAG has been deployed the switch’s ability to process Address Resolu- extensively, does not require new forms of data tion Protocol [ARP] and Dynamic Host Config- encapsulation, and works with existing network uration Protocol [DHCP] relay requests will be a management systems and multicast protocols. bottleneck in overall performance). In a layer 3 design, the size of the ARP table supported by the switches can become a limiting factor in scal- LAYER 3 VS. LAYER 2 DESIGNS FOR ing the design, even if the medium access con- LOUD OMPUTING trol (MAC) address tables are quite large. C C Finally, complications may result from the use of Many conventional data center networks are different hashing algorithms on the spine and based on well established, proven approaches leaf switches. such as a layer 3 “fat tree” design (or Clos net- New protocols are being proposed to address work) using equal cost multipathing (ECMP). the limitations on live VM migration presented These approaches can be adapted to cloud com- by a layer 3 ECMP design, while at the same puting environments. As shown in Fig. 2, a layer time overcoming the limitations of layer 2 3 ECMP design creates multiple load balanced designs based on STP or MC-LAG. All of these paths between nodes. The number of paths is approaches involve some implementation of variable, and bandwidth can be adjusted by multipath routing, which allows for a more flexi- adding or removing paths up to the maximum ble than STP [5] (Fig. 4). In allowed number of links. Unlike a layer 2 STP the following sections, we discuss two recent network, no links are blocked with this approach. standards, TRILL and SPB, both of which are Broadcast loops are avoided by using different essentially layer 2 ECMP designs with multipath VLANs, and the network can route around link routing. failures. Typically, all attached servers are dual- homed (each server has two connections to the first network switch using active-active NIC TRILL teaming). This approach is known as a spine and Transparent Interconnection of Lots of Links leaf architecture, where the switches closest to (TRILL) is an Engineering Task Force the server are leaf switches that interconnect (IETF) industry standard protocol originally pro- with a set of spine switches. Using a two-tier posed by , who also invented design with a reasonably sized (48-port) leaf and STP. TRILL runs a link state protocol between spine switch and relatively low oversubscription devices called routing bridges (RBridges). Specifi- (3:1) as illustrated in Fig. 3, it is possible to scale cally, TRILL uses a form of layer 2 link state this network up to around 1000–2000 or more protocol called intermediate system to interme- physical ports. diate system (IS-IS) to identify the shortest paths If devices attached to the network support between switches on a hop-by-hop basis, and Link Aggregation Control Protocol (LACP), it load balance across those paths. In other words, becomes possible to logically aggregate multiple connectivity information is broadcast across the connections to the same device under a common network so that each RBridge knows about all virtual link aggregation group (VLAG). It is also the other RBridges and the connections between possible to use VLAG interswitch links (ISLs) them. This gives RBridges enough information combined with Virtual Redundancy Pro- to compute pair-wise optimal paths for unicast tocol (VRRP) to interconnect switches at the traffic. For multicast/broadcast groups or deliv- same tier of the network. VRRP supports IP for- ery to unknown destinations, TRILL uses distri- warding between subnets, and protocols such as bution trees and an Rbridge as the root for Open Shortest Path First (OSPF) or Border forwarding. Each node of the network recalcu- Gateway Protocol (BGP) can be used to route lates the TRILL header and performs other

28 IEEE Communications Magazine • September 2012 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 29

4 “spine” switches

16 “leaf” switches

48 servers per “leaf” switch

Figure 3. Example of an L3 ECMP leaf-spine design.

functions such as MAC address swapping. STP is not required in a TRILL network. Core There are many potential benefits of TRILL, Path 1 including enhanced scalability. TRILL allows the construction of loop-free multipath topologies without the complexity of MCEC, which reduces Aggregation the need for synchronization between switches and eliminates possible failure conditions that Path 2 would result from this complexity. TRILL should also help alleviate issues associated with exces- Access sively large MAC address tables (approaching 20,000 entries) that must be discovered and updated in conventional Ethernet networks. Fur- Servers thermore, the protocol can be extended by defin- ing new TLV (type-length-value) data elements A B C for carrying TRILL information (some network equipment vendors are expected to implement Figure 4. Example of multipath routing between three servers A, B, and C. proprietary TLV extensions in addition to indus- try standard TRILL.) For example, consider the topologies shown network infrastructure. The I-SID abstracts the in Figs. 5a and 5b. In a classic STP network with service from the network. By mapping a VLAN bridged domains, a single multicast tree is avail- or multiple VLANs to an I-SID at the service able. However, in a TRILL fabric, there are sev- access point, SPB automatically builds a shortest eral possible multicast trees based on IS-IS. This path through the network. The I-SID also pro- allows multidestination frames to be efficiently vides a mechanism for granular traffic control by distributed over the entire network. There are mapping services (applications) into specific I- also several possible active paths in a TRILL SIDs. fabric, which makes more efficient use of band- When a new device is attached to the SPB width than does STP. network and wishes to establish communication with an existing device, there is an exchange HORTEST ATH RIDGING (enabled by the IS-IS protocol) to identify the S P B requesting device and learn its immediate neigh- Shortest path bridging (SPB) is a layer 2 stan- boring nodes. Learning is restricted to the edge dard (IEEE 802.1aq) addressing the same basic of the network, which is reachable by the I-SID. problems as TRILL, although using a slightly The shortest bidirectional paths from the different approach. SPB was originally intro- requesting device to the destination are then duced to the IEEE as provider link state bridg- computed using link metrics such as ECMP. The ing (PLSB), a technology developed for the same approach is used for both unicast and telecommunications carrier market, which was broadcast/multicast packets, which differs from itself an evolution of the IEEE 802.1ah standard TRILL. Once the entire multicast tree has been (provider backbone bridging or PBB). The SPB developed, the tree is then pruned, and traffic is standard reuses the PBB 802.1ah data path, and assigned to a preferred path. The endpoints thus therefore fully supports the IEEE 802.1ag-based learn how to reach one another by transmitting operations, administration, and management on a specific output address port, and this path (OA&M) functionality. The 802.1ah frame for- remains in place until there is a configuration mat provides a service identifier (I-SID) that is change in the network. In this manner, the end- completely separate from the backbone MAC points are fully aware of the entire traffic path, addresses and the VLAN IDs. This enables sim- which was not the case for TRILL. The route plified data center virtualization by separating packets take through the network can be deter- the connectivity services layer from the physical mined from the source address, destination

IEEE Communications Magazine • September 2012 29 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 30

B1 B2 Rb1 Rb2

B3 B4 B5 B6 Rb3 Rb4 Rb5 B6

(a) (b)

Figure 5. a) STP multicast with bridged domains; b) TRILL multicast domain with Rbridges.

address, and VLAN ID. Traffic experiences the between these two standard protocols. The real same latency in both directions on the resulting impact of these differences on the end user or paths through the network, also known as con- network administrator remains to be seen. gruent pathing. As of this writing, several companies have Within SPB there are two models for multi- announced their intention to support SPB path bridging: shortest path bridging VLAN (including Avaya and Alcatel-Lucent) or to sup- (SPBV) and SPB Mac-in-Mac (SPBM). Both port TRILL (including IBM, Huawei, and variants use IS-IS as the link state topology pro- Extreme Networks). Other companies have tocol, and both compute shortest path trees announced proprietary protocols which essential- between nodes. SPBV uses a shortest path ly address the same basic problems but claim to VLAN ID (SPVID) to designate nodal reacha- offer other advantages. The following sections bility. SPBM uses a backbone MAC (BMAC) briefly discuss three alternatives from some of and backbone VLAN ID (BVID) combination the largest networking companies (Cisco, Bro- to designate nodal reachability. Both SPBV and cade, and Juniper Networks); other proprietary SPBM provide interoperability with STP. For approaches exist, but we do not review them data center applications, SPBM is the preferred here. technology. There are several other proposed enhancements and vendor proprietary extensions ABRIC ATH to SPB that we do not discuss here; interested F P readers should refer to the latest working drafts Cisco produces switches that use either a version of proposed SPB features and extensions. of TRILL or a proprietary multipath layer 2 TRILL or SPB may be used in the following encapsulation protocol called FabricPath, which cases: is based on TRILL but has several important 1 When the number of access port needs differences. For example, FabricPath frames do exceeds the MC-LAG capacity (typically a not include TRILL’s next-hop header [6], and few thousand ports) and the network needs FabricPath has a different MAC learning tech- an additional core switch that does not par- nique than TRILL, which according to Cisco ticipate in the MC-LAG may be more efficient. Conversational learning 2 When implementing a multivendor network has each FabricPath switch learn the MAC with one vendor’s switches at the access address it needs based on conversations, rather layer and another vendor’s switch in the than learning all MAC addresses in the domain. core According to Cisco, this may have advantages in 3 When implementing different switch prod- cases where a large number of VMs creates far uct lines from a single vendor, and one more MAC addresses than a switch could nor- product cannot participate in the other mally handle. FabricPath also supports multiple product’s fabric topologies based on VLANs, and allows for the Since both TRILL and SPB were developed creation of separate layer 2 domains. FabricPath to address the same underlying problems, com- uses vPC+, Cisco’s proprietary multichassis link parisons between the two are inevitable. The aggregation protocol, to connect to non-Fabric- main difference between the two approaches is Path switches. According to Cisco, FabricPath that TRILL attempts to optimize the route has the ability to scale beyond the limits of between hops in the network, while SPB makes TRILL or SPB in the network core, supporting the entire path known to the edge nodes. TRILL up to eight core switches on the same tier in a uses a new form of MAC-in-MAC encapsulation single layer 2 domain. Edge devices such as the and OA&M, while SPB has variants for both Cisco Nexus 5000 series support up to 32,000 existing MAC-in-MAC as well as queue-in-queue MAC addresses; core devices such as the Cisco encapsulation, each with their associated OA&M Nexus 7000 series support up to 16,000 MAC features. Only SPB currently supports standard addresses and 128,000 IP addresses. This 802.1 OA&M interfaces. TRILL uses different approach can likely scale to a few thousand paths for unicast and multicast traffic than those ports or more with reasonable levels of oversub- used by SPB, which likely will not make a differ- scription (around 4:1). Theoretically, this ence for the vast majority of IP traffic applica- approach may scale to even larger networks, tions. There are also variations in the loop although in practice scale may be limited by prevention mechanisms, number of supported other factors, such as the acceptable size of a virtualization instances, lookup and forwarding, single failure domain within the network. Scale handling of multicast traffic, and other features can also be extended with the addition of a third

30 IEEE Communications Magazine • September 2012 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 31

or fourth tier in the network switch hierarchy if SOFTWARE-DEFINED NETWORKING additional latency is not a concern. In June 2012 The OpenFlow Cisco announced a collection of networking AND OPENFLOW: EMERGING NEXT standard moves the products and solutions collectively known as GENERATION PROTOCOLS Cisco ONE, which includes FabricPath and network control other features. According to Cisco, this approach Some of the network traffic routing concerns plane into software is different from the standard definition of soft- discussed previously may also be addressed by a ware-defined networking, and may include both new industry standard approach known as soft- running on an proprietary and industry standard protocols. ware-defined networking (SDN), which advo- attached server or cates virtualizing the data center network with a network controller. IRTUAL LUSTER WITCHING software overlay and network controller that V C S allows attached servers to control features such The flow of network Brocade produces switches using a proprietary as packet flows, topology changes, and network traffic can then be multipath layer 2 encapsulation protocol called management. There have been several different Virtual Cluster Switching (VCS). This approach industry standard software overlay proposals, controlled is based on TRILL (e.g., the data plane forward- including the IETF proposed standard distribut- dynamically, without ing is compliant with TRILL and uses the ed overlay virtual Ethernet (DOVE) [2]. SDN is TRILL frame format [7], and edge ports in VCS used to simplify network control and manage- the need to rewire support standard LAG and LACP), but it is not ment, automate network virtualization services, the data center necessarily compatible with other TRILL imple- and provide a platform from which to build agile network. mentations because it does not use an IS-IS network services. SDN leverages both industry core. Brocade’s core uses Fabric Shortest Path standard network virtualization overlays and the First (FSPF), the standard path selection proto- emerging OpenFlow industry standard protocol. col in Fibre Channel storage area networks. The OpenFlow standard moves the network con- VCS also uses a proprietary method to discover trol plane into software running on an attached compatible neighboring switches and connect to server or network controller. The flow of net- them appropriately. According to Brocade, up to work traffic can then be controlled dynamically, 32,000 MAC addresses are synchronized across a without the need to rewire the data center net- VCS fabric. The currently available switches work. Some of the benefits of this approach from Brocade can likely scale to around 600 include better scalability, larger layer 2 domains physical ports in a single fabric, although in and virtual devices, and faster network conver- practice the upper limit may be reduced (e.g., if gence. These technologies can form the basis for some ports are configured for Fibre Channel networking as a service (NaaS) in modern cloud connectivity or interswitch links). data centers. The OpenFlow specification has been under development as an academic initiative for sever- QFABRIC al years [9]. In early 2011, it was formalized under the control of a non-profit industry con- Juniper Networks produces switches using a pro- sortium called the Open Networking Foundation prietary multipath layer 2/3 encapsulation proto- (ONF). The ONF is led by a board of directors col called QFabric [8]. According to Juniper, consisting of some of the largest network opera- Qfabric allows multiple distributed physical tors in the world, including Google, Facebook, devices in the network to share a common con- Microsoft, Verizon, NTT, Deutsche Telekom, trol plane and a separate common management and Yahoo. It is noteworthy that all of these plane, thereby behaving as if they were a single companies are end users of networking technol- large switch entity. For this reason, Qfabric ogy, rather than network equipment manufactur- devices are not referred to as edge or core ers; this indicates a strong market interest in switches, and the overall approach is called a OpenFlow technology. With this motivation, fabric rather than a network. According to over 70 equipment manufacturers have joined Juniper, Qfabric provides equal latency between the ONF (including Cisco, Brocade, Juniper any two ports on the fabric, up to the current Networks, IBM, and others), which released the theoretical maximum supported scale of 6144 latest version of its specification, OpenFlow 1.3, physical ports with 3:1 oversubscription at the in June 2012 [9]. Since SDN and OpenFlow are edge of the fabric (the maximum scale fabric still evolving, care must be taken to implement requires 128 edge devices and 4 core intercon- only the approved industry standard version of nects). As of March 2012, the largest publicly these protocols to maximize interoperability in a released independent Qfabric testing involves multivendor network and fully realize the bene- 1536 ports of 10G Ethernet, or approximately 25 fits intended by the ONF. percent of Qfabric’s theoretical maximum capac- OpenFlow takes advantage of the fact that ity [8]. According to Juniper, Qfabric provides most modern Ethernet switches and routers con- up to 96,000 MAC addresses and up to 24,000 tain flow tables, which run at line rate and are IP addresses. In May 2012 Juniper announced used to implement functions such as quality of the QFX3000-M product line (commonly known service (QoS), security firewalls, and statistical as micro-fabric), which enables a version of analysis of data streams. OpenFlow standardizes Qfabric optimized for smaller-scale networks. a common set of functions that operate on these Qfabric is managed by the Junos operating sys- flows and will be extended in the future as the tem, and requires a separate fabric controller standard evolves. An OpenFlow switch consists and an outband management network (often of three parts: flow tables in the switch, a remote provided by traditional Ethernet switches, e.g., controller on a server, and a secure communica- the Juniper EX4200). tion channel between them. The OpenFlow pro-

IEEE Communications Magazine • September 2012 31 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 32

tocol allows entries in the flow table to be the ODIN reference architecture, which has In cloud computing defined by an external server. For example, a been endorsed by numerous networking equip- flow could be a TCP connection, all the packets ment companies. Some major networking equip- environments, from a particular MAC or IP address, or all ment manufacturers have also developed vendor OpenFlow improves packets with the same VLAN tag. Each flow proprietary protocols to address the same issues scalability and table entry has a specific action associated with a (FabricPath, VCS, and Qfabric), each with dif- particular flow, such as forwarding the flow to a ferent features for scalability, latency, oversub- enables given switch port (at line rate), encapsulating scription, and management. None of these multi-tenancy and and forwarding the flow to a controller for pro- proposed solutions has yet reached the same cessing, or dropping a flow’s packets (for exam- level of maturity as STP and MC-LAG, but there resources pooling, ple, to help prevent denial of service attacks). are many promising evaluations underway by and will likely There are many potential applications for early adopters. Within the next year or so, com- co-exist with other OpenFlow in modern networks. For example, a pliance testing of network equipment using com- network administrator could create on-demand mon use cases for cloud computing should begin Layer 2/3 protocols ‘express lanes’ for voice and data traffic. Soft- to emerge. Unfortunately, many of the recently and network overlays ware could also be used to combine several developed protocols do not interoperate, and links into a larger virtual pipe to temporarily this lack of consensus represents a serious con- for some time. handle bursts of traffic; afterward, the channels cern for the networking industry, coming at a would automatically separate again. In cloud critical inflection point in the development of computing environments, OpenFlow improves next-generation cloud-enabled networks. Some scalability, enables multitenancy and resource designers may claim that radical architectural pooling, and will likely coexist with other layer innovation in the cloud network requires them 2/3 protocols and network overlays for some to outpace industry standards; this approach is time. only acceptable if the resulting system either opens its architecture to others after a reason- able time period, or agrees to support any devel- THE OPEN DATACENTER oping industry standards as a free, nondisruptive NTEROPERABLE ETWORK upgrade in addition to any proprietary alterna- I N tives. The same approach holds true for pre- In May 2012, IBM released a series of five tech- standard implementations, which should provide nical briefs known as the Open Datacenter Inter- a free upgrade path to the eventual finalized operable Network (ODIN) [10], intended to form of the industry standard. Finally, we recog- describe best practices for designing a data cen- nize that the choice of networking equipment ter network based on open industry standards. It involves trade-offs between many factors in addi- is noteworthy that these documents do not refer tion to those described here, including total cost to specific products or service offerings from of ownership, energy consumption, and support IBM or other companies. According to IBM, for features such as of Fibre Chan- specific implementations of ODIN, which may nel over Ethernet (FCoE), IPv6, or remote use products from IBM and other companies, direct memory access (RDMA) over converged will be described in additional technical docu- Ethernet (RoCE), as well as layer 4–7 function- mentation (e.g., in June 2012 a solution for stor- ality. These topics will be examined in more age stretch volume clusters using components detail as part of ongoing research in this area. from IBM, Brocade, Adva, and Ciena was pre- sented at the Storage Edge conference.) The ACKNOWLEDGEMENTS ODIN documents describe the evolution from traditional enterprise data networks into a multi- All company names and product names are reg- vendor environment optimized for cloud com- istered trademarks of their respective compa- puting and other highly virtualized applications. nies. The names Cisco, FabricPath, vPC+, ONE, ODIN deals with various industry standards and and Nexus are registered trademarks of Cisco best practices, including layer 2/3 ECMP spine- Systems Inc. The names Brocade, Virtual Clus- leaf designs, TRILL, lossless Ethernet, SDN, ter Switching, and VCS are registered trade- OpenFlow, wide area networking, and ultra-low- marks of Brocade Communications Systems Inc. latency networks. As of this writing, ODIN has The names Juniper, Qfabric, Junos, QFX3000- been publicly endorsed by eight other companies M, and EX4200 are registered trademarks of and one college [10], including Brocade, Juniper, Juniper Networks Inc. Huawei, NEC, Extreme Networks, BigSwitch, Adva, Ciena, and Marist College. REFERENCES [1] P. Mell and T. Grance, “The NIST Definition of Cloud CONCLUSIONS Computing,” 07 Oct. 2009: http://csrc.nist.gov/groups/ SNS/cloud-computing/, accessed: 29-Jan-2011. In this article, we have presented an overview of [2] K. Barabesh et al., “A Case for Overlays in DCN Virtual- layer 2 and 3 protocols that may be used to ization,” Proc. 2011 IEEE DC CAVES Wksp., Collocated address the unique requirements of a highly vir- with ITC 22. [3] M. Fabbi and D. Curtis, “Debunking the Myth of the tualized cloud data center. While existing layer 3 Single-Vendor Network,” Gartner Research, Gartner “fat tree” networks provide a proven approach Research Note G00208758, Nov. 2010. to these issues, there are several industry stan- [4] A. Wittmann, “IT Pro Ranking: Data Center Network- dards that enhance features of a flattened layer ing,” InformationWeek, Sept. 2010. [5] C. J. Sher-DeCusatis, The Effects of Cloud Computing 2 network (TRILL, SPB) or have the potential on Vendor’s Data Center Network Component Design, to enhance future systems (SDN and Open- doctoral thesis, Pace University, White Plains, New Flow). Many of these standards are described in York, May 2011.

32 IEEE Communications Magazine • September 2012 DECUSATIS LAYOUT_Layout 1 8/23/12 12:38 PM Page 33

[6] Cisco white paper C45-605626-00, “Cisco FabricPath nology, Brooklyn. She holds a B.A. from Columbia College At-A-Glance,” http://www.cisco.com/en/US/prod/collat- and an M.A. from Stony Brook University (formerly State eral/switches/ps9441/ps9402/at_a_glance_c45- University of New York at Stony Brook) in physics, and a 605626.pdf, Accessed: 27-Dec-2011. Doctor of Professional Studies in computing from Pace Uni- [7] Brocade Technical Brief GA-TB-372-01,”Brocade VCS versity. She is co-author of the handbook Fiber Optic Fabric Technical Architecture,” http://www.brocade. Essentials. com/downloads/documents/technical_briefs/vcs-techni- cal-architecture-tb.pdf, Accessed: 27-Dec-2011. APARICO CARRANZA ([email protected]) is an asso- [8] Juniper white paper 2000380-003-EN (Dec. 2011), “Rev- ciate professor and chair of the Computer Engineering olutionizing Network Design: Flattening the Data Cen- Technology Department of New York City College of Tech- ter Network with the Qfabric Architecture,” nology, Brooklyn. He earned his Associate’s degree (summa http://www.juniper.net/us/en/local/pdf/whitepa- cum laude) in electronics circuits and systems from Techni- pers/2000380-en.pdf, accessed 27-Dec-2011, see also cal Career Institutes, New York, New York; his Bachelor’s independent Qfabric test results, available: http://news- degree (summa cum laude) and Master’s in electrical engi- room.juniper.net/press-releases/juniper-networks-qfab- neering from City College/University of New York ric-sets-new-standard-for-net-nyse-jnpr-0859594, (CC/CUNY); and his Ph.D. degree in electrical engineering accessed 30 June 2012. from the Graduate School and University Center, CUNY. He [9] N. McKeown et al., “OpenFlow: Enabling Innovation in also worked for IBM (RS6000 parallel computers; and Campus Networks,” ACM SIGCOMM Comp. Commun. S/390 mainframe computers) for several years. Rev., vol. 38, no. 2, pp. 69–74, 2008; see also http://www.openflow.org, accessed 30 June 2012. CASIMER M. DECUSATIS [F] ([email protected]) is an IBM [10] C. M. DeCusatis, “Towards an Open Datacenter with Distinguished Engineer and CTO for System Networking an Interoperable Network (ODIN),” vols. 1–5, http:// Strategic Alliances, based in Poughkeepsie, New York. He is www.03.ibm.com/systems/networking/solutions/odin.ht an IBM Master Inventor with over 110 patents, editor of ml, accessed 30 June 2012 all ODIN endorsements the Handbook of Fiber Optic Data Communication, and available ://www-304.ibm.com/connections/blogs/ recipient of several industry awards. He holds an M.S. and DCN, accessed 30 June 2012. Ph.D. from Rensselaer Polytechnic Institute, and a B.S. (magna cum laude) in the Engineering Science Honors Pro- IOGRAPHIES gram from Pennsylvania State University. He is a Fellow of B the OSA and SPIE, a Distinguished Member of Sigma Xi, CAROLYN J. SHER DECUSATIS ([email protected]) is an and a member of the IBM Academy of Technology. His adjunct assistant professor in the Computer Engineering blog on Data Center Networking is https://www-304.ibm. Technology Department of New York City College of Tech- com/connections/blogs/DCN.

IEEE Communications Magazine • September 2012 33