CERN Data Centre Network Architecture: proposed evolution and implementation plan

Authors: Edoardo Martelli (CERN, IT-CS-PO), Tony Cass (CERN, IT-CS), with the input of the members of the DCNA Working Group and the CERN IT-CS-CE section.

Last updated: 26th of June 2018

Abstract In 2017 the IT-CS group decided to take advantage of the planned hardware upgrade of the data centre routers to deliver more advanced network features to the CERN Data Centre user community. In order to involve all relevant IT groups in selection and prioritisation of the desired features, a Data Centre Network Architecture (DCNA) working group was formed. This working group met several time throughout 2017.

This document summarises the conclusions of the Working Group and details the features that will be implemented, giving an indication of the technologies that can be used. An implementation plan, with proposed deadlines dictated by existing constraints, is also included.

Table of Contents 1 Terminology and Acronyms...... 2 2 2017 Data Centre Network...... 2 2.1 Domains...... 2 2.2 Features...... 3 2.3 Limitations...... 3 2.4 Network diagram...... 4 3 Designing a new Data Centre Network...... 5 3.1 Design objectives...... 5 3.2 Design constraints...... 5 3.3 User requirements...... 5 4 The New Data Centre Network Architecture...... 6 4.1 Addressing the objectives, constraints and requirements...... 6 4.1.1 Security...... 7 4.1.2 Agile domain membership...... 7 4.1.3 Router redundancy...... 8 4.1.4 ToR switch redundancy...... 8 4.1.5 mobility...... 8 4.1.6 Second NIC for storage servers...... 9 4.1.7 Jumbo frames...... 9 4.1.8 Faster DNS and DHCP updates...... 9 4.1.9 Openstack information in LANDB...... 9 5 Implementation plan...... 10 5.1 Dependencies...... 10 5.2 Constraints...... 10 5.3 Proposed schedule...... 11 6 References...... 11

1 Terminology and Acronyms BF = Blocking Factor (uplink bandwidth over-subscription, aggregate downlink bandwidth divided by aggregate uplink bandwidth) CT = Container HV = Hypervisor HC = Host for Containers LANDB = IT-CS network MLAG = Multichassis Group NIC = Network Interface Card NAT = Network Address Translation SS = Storage Server TN = Technical Network: LHC accelerator control and management network ToR = Top of Rack (switch) VRRP = Virtual Router Redundancy Protocol VM = Virtual Machine

2 2017 Data Centre Network This chapter briefy sets out the architecture of the data centre networks in Building 513 (Geneva) and Building 9918 (Budapest), describing the features available at the time the DCNA working group was established in 2017. 2.1 Domains We talk of data centre networks since distinct network domains coexist to address four distinct classes of network requirements.

 The LCG network, with direct access to the LHCOPN and LHCONE network connections to Tier1 and Tier2 sites and extensive support for high-bandwidth connections, is provided to support physics services.  Non-physics services are connected to the ITS network which includes a zone (in the “Barn”) confgured for router redundancy with connections to the diesel-backed power supply  A Technical Network (TN) presence (in B513, but not B9918) provides connectivity to this network for relevant IT-managed servers. Access to this network is restricted to authorised servers by a “gate”.  A low bandwidth, non-redundant MGMT network is provided to support connections to dedicated server management interfaces (mostly IPMI interfaces). As for the TN, access to this network is gate protected. Whilst the TN and MGMT networks are physically distinct infrastructures, the LCG and ITS domains are implemented by using virtual routing (VRF) over a common infrastructure.

These network domains are shown schematically in this picture:

2.2 Features The key features of the data centre network architecture in 2017 were

 line rate performances (subject to the agreed Blocking Factor); no NAT, no encapsulation,  dual stack IPv4 and IPv6,  switch redundancy for critical services; implemented with the Switch Stacking feature,  router redundancy for critical services; implemented with VRRP,  frewall protection with LANDB driven automatic updates every 15 minutes,  to support dynamic virtual machine creation, LANDB driven automatic updates of the DNS and DHCP services (every 10 and 5 minutes respectively).

Although VLAN extensions for live VM migration were also supported, this was only on an ad-hoc and temporary basis.

2.3 Limitations Key drawbacks of this data centre network architecture, as perceived by clients, are as follows, in roughly decreasing order of importance.

 The separation between the LCG and ITS domains is infexible and of low granularity. Domain membership is decided at the switch level, so it is not possible to move a single machine between domains, and a network renumbering is required when domain assignment changes. In practice, machines are therefore rarely moved between domains leading to many machines supporting general IT services being directly exposed to LHCOPN and LHCONE, which is not ideal from a security standpoint.  The blocking factors are mostly unknown to users and, linked to the point above, may be diferent if logically comparable servers are in diferent network domains.  There is no integration of network domains with AI availability zones, so machines that are supposed to be in diferent availability zones may in fact be behind a single router.  It is not possible for the OpenStacks virtual machine orchestrator to ensure that dynamic network changes are correctly recorded in LANDB. For example the real hosting hypervisor for a particular virtual machine may not be the one declared in LANDB.  Lack of full support for IP mobility restricts the possibility to deliver load-balanced or high availability solutions.  The latency for DNS and DHCP updates (up to 10 and 5 minutes respectively) does not match the speed at which virtual machines and containers can be provisioned.

A further drawback, from the network architecture point of view, is that since VRRP is used to provide router redundancy, the backup links are idle.

2.4 Network diagram This diagram shows the most important interconnections of the data centre routers: 3 Designing a new Data Centre Network Evidently, the main aim for the redesign of the data centre network architecture was to address the drawbacks set out just above—i.e. to increase fexibility, improve support for virtual-machine and container based services and to improve the integration between network management and virtual machine orchestration. Certain general objectives, constraints and requirements also had to be taken into account and these are set out in the sections that follow. 3.1 Design objectives The new data centre architecture has been designed to be afordable, cost-efective and, in decreasing order of importance

 Performant: guaranteeing NIC speeds, up to the agreed blocking factor,  Flexible: allowing easy implementation of new services and software features,  Reliable and Resilient: the data centre network must be reliable and resilient for any service. In particular, single points of failure for critical services are forbidden - although this might require such services to be in distinct zones (e.g. the Barn),  Manageable and Operable: with extensive use of automation and standard confgurations,  Scalable: Easily grow up bandwidth and number of connections.

3.2 Design constraints The design of the new data centre architecture also refects three constraints imposed by the Ofcer.

 No sharing of network devices between the TN and other domains is permitted  VMs and containers from diferent logical domains may not be hosted by the same server, and  Unnecessary access to the LHCOPN/ONE bypass is forbidden. 3.3 User requirements In addition to the general requirement for a “more fexible and more modern” network, specifc user requirements collected during the DCNA meetings included the following, again roughly in decreasing order of priority.

 Full router redundancy for all but the management network connections.

 Switch redundancy for IT-DB, Skype, Licence and Load Balancing servers.

 Support for multiple network connections for storage servers, database servers and hypervisors.  Better support for VMs and Containers, like faster DNS and DHCP updates, integration of Openstack information in LANDB

 Support for jumbo frames for server-to-server communications (notably for EOS and database servers).

 Support for foating IP addresses, VMs migration and load-balancers

4 The New Data Centre Network Architecture To meet the design objectives and deliver the requested features, we proposed to migrate from today’s architecture to an IP fabric, i.e. an infrastructure were all the components are reachable in a routed, redundant and load balanced network. All the services for the datacentre users are built on top of the fabric, either as integrated components or in overlay networks. The main advantages of such an architecture is the increased resiliency in case of failure of a network link or device.

The interconnections and components of a typical IP fabric architecture are shown schematically in the following picture:

4.1 Addressing the objectives, constraints and requirements This new network design addresses the objectives, constraints and requirements set out in section 3. We revisit below the specifc points set out in that section, explaining how they are addressed by the new architecture. 4.1.1 Security As today, four distinct network domains will be supported, LCG and ITS for physics and general services respectively, plus the TN and MGMT networks. A small number of additional domains could be added, at some expense in terms of hardware and software resources, but no need was identifed during the DCNA discussions. As today, the TN domain will be provided with physically distinct hardware whilst a common hardware infrastructure will be used to support the LCG and ITS network domains. The MGMT domain may be integrated with the LCG and ITS infrastructure if economically favourable.

Unlike today, domain membership for at least some servers will efectively be decided at the switch port level, as explained in the next section, so individual machines can be moved between the LCG and ITS domains as capacity demands fuctuate. The new network architecture thus better supports the security requirement to avoid unnecessary exposure to the LHCOPN/ONE frewall bypass—although it is the responsibility of service managers to ensure that this requirement is in fact fulflled—and also that there is no mix of VMs and containers for diferent logical domains on a single server.

4.1.2 Agile domain membership In the immediate future, servers will be divided into three categories 1. those with a single network connection connected to the ITS domain, 2. those with two network connections to the ITS domain, and 3. those with one network connection to the ITS domain and one to the LCG domain.

The principle reason for the 2nd and 3rd category of servers is the desire to separate out “control” network connections—to the hypervisor, for example, or for EOS management trafc—from service network trafc, to the VMs or containers, for example. For evident reasons the control connection will always be in the ITS domain.

It will thus be possible for servers in the 3rd category to host VMs or containers for either physics or general IT services but with the limitation that server and control trafc will fow over the same network interface in the latter case.

In the longer term, with the schedule dependent on the efort available, the network automation system (CFMGR) will be adapted to allow the possibility of adding tagged VLANs to switch ports—i.e. efectively allowing any switch port to be a member of either the LCG or the ITS domain if the interface is (re)numbered appropriately.

4.1.3 Router redundancy All new ToR switches (except those for the management domain) will be connected to two Spine routers and confgured either as Leaf routers or connected through multi-chassis aggregated links (MLAG). The double connection to two diferent routers is necessary to ensure that switches are not isolated in case of failure or disruptive maintenance of a router.

There are some important diferences between the Leaf Router and the MLAG solutions. The former uses standard protocols, so avoids any vendor lock-in, but is less easy to operate in case of failure, because a ToR switch cannot work as Leaf Router without the proper confguration. The MLAG solution, on the other hand, requires proprietary protocols on the Spine routers (vendor lock-in), but is easier to operate, because no special confguration is required for the ToR switches.

Connecting ToR switches to two Spine routers clearly requires multiple uplinks but, as most switches anyway need multiple uplinks to deliver the agreed blocking factor, there is minimal additional cost. Where multiple uplinks are needed solely to deliver router redundancy, the improved redundancy for servers and the greater freedom to perform router maintenance without close coordination with, or impact on, services amply justify the cost of the second uplink.

The disadvantage of the move to this confguration is that the repair time for a failing switch will be longer, especially if it is confgured as a Leaf Router. The move to this confguration thus conficts with the manageability/operability objective but the impact can be mitigated by using the MLAG confguration where possible.

Once the Super-Spine and Spine routers are installed, router redundancy can be provided on request for existing services but it must be realised that this implies downtime and cost for the replacement or reconfguration of the existing ToR switch.

4.1.4 ToR switch redundancy Servers with dual NICs and high reliability needs can be connected to two diferent switches running in Stacking or Virtual chassis confguration (i.e. as single managed virtual entity).

ToR Switch redundancy will be implemented only for critical services and only when requested by the service manager.

4.1.5 Virtual machine mobility Enabling a VM to move to a diferent HV without changing IP address requires extending the VM’s broadcast domain (“IP service” in IT-CS terminology) to include the new HV which is possibly in another broadcast domain. Two options for implementing this were considered:  defning network tunnels on the HVs/HCs and routing these at the spine/leaf router level; and  creating of Virtual extensible LAN (VXLAN) tunnels between ToR ports.

The frst option is preferred as it is the simplest and its feasibility will be tested by IT-CS and IT-CM.

Both permanent extensions/tunnels (for Load-balancers and High Availability services) and temporary extensions (to enable the migration of VMs out of retiring HVs and HCs) will be supported.

4.1.6 Second NIC for storage servers Support for multiple NICs is a simple matter of ensuring that enough switch ports are provided. As has been mentioned above, multiple NICs are foreseen for hypervisors to separate out control trafc and can similarly be considered for storage servers. Higher speed NICs (25G or 40G) for data transfers can also be supported.

4.1.7 Jumbo frames Jumbo frames can improve performances of large data transfers, especially when source and destination are distant. Unfortunately Jumbo frames are not widely deployed and there are some risks when machines with diferent frame size talks together.

Jumbo frames can be supported on services connecting storage servers and DB RACs. Special care must be taken to ensure that all servers connected to a Jumbo enabled service are confgured with the same frame size. This is necessary to ensure that the protocols that discover the frame size on the path works and are not blocked by strict frewall settings.

4.1.8 Faster DNS and DHCP updates There is no agreement on any mechanism to improve the frequency of DNS and DHCP updates. Two implementations are possible 1. increasing the frequency of updates for the entire cern.ch domain, and 2. creating a dedicated subdomain for VMs with a high DNS/DHCP update frequency. IT-CS considers the frst option to be too risky and the second creates problems for the transfer of Kerberos credentials between cern.ch and the subdomain.

As IT-CM have implemented local DHCP servers in their hypervisors, the current frequencies for DHCP updates have only a limited impact on the speed at which VMs can be created. As such, there is no urgency to implement either of the options mentioned above.

4.1.9 Openstack information in LANDB Discussions at the DCNA meetings confrmed that LANDB should remain the authoritative database for network information. Software interfaces will be introduced to allow Openstack to inject relevant information into LANDB and to keep it in synch. They will be implemented together with the deployment of the LANDB changes required by the advanced features related to Openstack.

5 Implementation plan 5.1 Dependencies Delivery of the new data centre network architecture and the associated new features depends not only on the installation of the Juniper hardware, but also on a set of software components:

• additional database records are needed in LANDB for the automatic confguration of the switches and of the advanced features in the routers. As well the related CSDB, SOAP and LANDB Portal features to manipulate them;

• Cfmgr modules are needed for the automatic confguration of Juniper devices. We will take this opportunity to modernise cfmgr, migrating to a more scalable architecture and exploiting open source libraries, and implementing the new modules in a modern programming languages (Python rather than PERL); and

• implementation of the Agile domain membership and VM mobility features will require increased integration between OpenStack and LANDB.

5.2 Constraints The next major server delivery for the Geneva data centre is foreseen for October 2018. Sufcient Juniper routers and switches must have been installed by then to provide the necessary connectivity, although not all the new features have to be implemented by that time.

Brocade MLXE disinvestment: due to the very high cost of 40G and 100G interfaces on the Brocade MLXE platform, any increase of bandwidth must be delivered with Juniper equipment.

The LHC experiments have started testing higher capacity DAQ connections to the data centre in B513. These may be needed already during these last months of Run2. The current capacity of the existing Brocade MLXE infrastructure may not be able to cope with a major increase of data rates.

WLCG Tier1s are upgrading their LHCOPN links to CERN to higher capacity for the rest of Run2. Again, the data centre network must be able to address the demand for increased bandwidth.

The new architecture must be production ready in time for LHC Run 3, which is foreseen to start in 2021. Meeting this deadline requires implementation and extensive testing during LS2 (Long Shutdown 2) in the years 2019-2020. 5.3 Proposed schedule The major tasks and milestones for the implementation of the proposed architecture are as follows.

Basic support for Juniper routers must be implemented in cfmgr by summer 2018 to allow the deployment of routers with static confgurations. Full support, i.e. for features such as dynamic ACLs, user services and routing protocols must be implemented in time for the major server delivery foreseen in October 2018.

An initial IP fabric consisting of a pair of super spine routers with associated spine routers and connected ToR/Leaf devices must be ready for the server delivery of October 2018. An IP fabric made of Juniper devices will be be built beside the existing Brocade MLXE based data centre. Legacy racks of servers may be migrated from the Brocade to the Juniper infrastructure to accelerate the phase out of the Brocade routers..

Brocade core router replacement must start as soon as possible in order that needs for greater capacity can be met. Replacement of all Brocade data centre routers must be completed before the start of Run3 to avoid disruptions in the data centre during data taking.

Cfmgr and LANDB Support for Advanced features such as Agile Domain Membership and VM Mobility will be implemented only once the previous tasks are completed. This is expected to be during 2019, allowing time for refnements during 2020 to ensure that we are production ready for Run3.

The plan is summarized in the chart below:

2018 - Run2 2019 - LS2 2020 - LS2 2021 Run3 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2

Cfmgr - basic Cfmgr - adv.

IP Fabric - minimum IP Fabric - full

Brocade MLXE replacement

DC advanced features

6 References DCNA Meetings agenda and minutes: https://indico.cern.ch/category/9080/

Presentation at ASDF meeting: https://indico.cern.ch/event/671141/contributions/2745584/attachments/1550870/2436372/ DCNA-WG-20171102-for-ASFD.pdf

NCM meetings: https://indico.cern.ch/category/1200/