Integration Path for Intel® Omni-Path Fabric Attached Intel® Enterprise Edition for Lustre (IEEL) LNET Table of Contents
Total Page:16
File Type:pdf, Size:1020Kb
Integration Path for Intel® Omni-Path Fabric attached Intel® Enterprise Edition for Lustre (IEEL) LNET Table of Contents Introduction 3 Architecture for LNET 4 Integration 5 Proof of Concept routing for multiple fabrics 5 Ko2iblnd settings 6 Client Mounting 6 Production LNET routing for CSD3 7 Performance Tuning 8 Performance Tests and Results 8 Summary 12 Glossary 13 Appendix A 14 LNET Networks and corresponding network types 14 LNET Routing Configuration: 14 Lustre Server Configuration: 14 Lustre Client Configuration ( EDR ): 14 Lustre Client Configuration ( OPA ): 14 Introduction As High Performance Computing centres grow, data centre infrastructure becomes more complex as new and older services are integrated. This can increase the number of server, storage and network technologies that are connected together, making it critical for the successful operation of services that they work seamlessly. A centre’s growth can also be impacted in the extension of its service portfolio, which puts pressure on the provision of flexible and scalable platforms, especially storage. Storage requirements increase, often doubling the existing storage with each new service adopted by HPC centres. Across each service provided it is often desirable to have a common storage infrastructure that can be accessed from each of the services provided to users. Allowing users to effectively migrate data across different systems can be challenging, and creates a risk of duplicating data and wasting storage space, together with placing undue stress on network resources. In the case of the University of Cambridge Research Computing Service (RCS), a new set of supercomputing resources has recently been procured and installed for the growing needs of science, both at the University as well as nationally within the UK. The Cambridge Service for Data Driven Discovery (CSD3) provides three new supercomputing resources, along with the existing Intel and dedicated GPU supercomputers. The RCS has made use of Lustre parallel file systems for most of its main resources, and they have been the backbone for providing high performance scalable storage across all research computing platforms. Lustre filesystems support high performance networks such as Ethernet, Infiniband and Intel® Omni-Path Fabric. The older HPC service has five Lustre storage filesystems providing access through 5PB of storage and CSD3 introduces an additional five Lustre filesystems, to provide the service with another 5PB of storage space. The new storage platform has been designed and deployed with the intention of allowing both old and new systems to mount these new filesystems, allowing users to migrate and consume data as they switch between CSD3 and the existing resources. In order to take advantage of platform-specific features – at the time of acquisition – the CSD3 GPU system (Wilkes-2 in Figure 1 below) uses Mellanox EDR Infiniband and the new Intel® Xeon Phi™ and Intel® Xeon® Gold 6142 CPU resources use the Intel® Omni-Path Fabric. The goal of building a common Lustre storage system that can be accessed over the HPC fabrics of different generations and technology can be achieved through the use of LNET routing. LNET routing allows the RCS to expand beyond the confines of the existing FDR InfiniBand fabric, facilitating the translation between fabrics. Services on Intel® Omni-Path Fabric, EDR/FDR InfiniBand and Ethernet can now consume existing and new Lustre storage. For example, a user on CSD3 can now write files to Lustre and launch a visualisation instance in the RCS OpenStack cloud, seamlessly accessing Lustre storage concurrently without the user being aware of the underlying infrastructure and placement. LNET routing can not only be used to join dispersed supercomputing resources; LNET routers can be used in the same way as conventional Ethernet routing by setting up multiple routers, thus Lustre traffic can traverse multiple hops of a complicated networking infrastructure, allowing for fine grain routing as scientific computing progresses beyond Petascale systems. Architecture for LNET Wilkes-2 Skylake & Darwin Intel®Xeon Phi™ x200 EDR Fabric Research Data EDR Store Fabric Intel® Omni-Path CSD3 Storage LNET Router General Lustre1-5 Ethernet Mix of OpenStack Interconnects Wilkes Hosted Clusters Figure 1 High Level Diagram of the University of Cambridge Research Computing Services estate integrating LNET routers CSD3 incorporates two distinct processor technologies, Intel® Xeon® Gold 6142, internally referred to as Skylake, and Intel® Xeon Phi™ X200 for Intel architecture together with NVIDIA P100 GPGPU, known as Wilkes-2, underpinned by multiple Lustre filesystems attached to Intel Omni-Path. While the Intel systems use Intel Omni-Path directly themselves, Wilkes-2 uses EDR InfiniBand as its fabric and this presents. Figure 1 shows a high-level view of the current RCS estate. LNET routers shown in the centre of the diagram provide a translation layer between different types of network fabrics, allowing for Lustre access across all systems within the RCS where convergence on one type of interconnect is not possible. Figure 2 shows the LNET routers that connect storage and servers that use Intel® Omni-Path Fabric to provide Wilkes-2 with access to common storage. The LNET router does not mount the Lustre file system but merely forwards traffic from one network to the other. In this example two LNET routers load balance traffic and act as an active-active failover pair for Lustre traffic. Additional nodes can be added throughout the network topology to balance network routes across systems. Details on load balancing can be found in the Intel user guide for LNET [1]. Current production services, such as the Darwin CPU cluster and the existing Wilkes GPU connect to the LNET routers over FDR. This flexibility now allows for users to migrate their data over a high speed interconnect as they transition to the new service. Figure 2 LNET Detail showing a pair of routers between Peta-4 and Wilkes-2 Integration Before progressing with the deployment of a production LNET service, an initial experimental routing set-up was completed. This concept demonstrator was then used to aid the construction of the production LNET routing within CSD3. Proof of Concept routing for multiple fabrics When integrating LNET, it is best to map out the LNET for each of the fabric or TCP networks that will connect to the router. Each fabric must have its own o2ib network in order to distinguish between each of the fabrics. Table 1 shows an example from the concept demonstrator system: Fabric Type LNET Network tag Router IP IP Subnet Intel® Omni-Path Fabric o2ib0 192.168.0.254 192.168.0.0/24 InfiniBand o2ib1 10.144.60.230 10.144.0.0/16 Ethernet tcp0 172.10.2.254 172.10.2.0/24 Table 1 Example LNET layout All Lustre clients must have defined a list of LNET tags and the address of the router on the respective fabric. A compute node on o2ib0 would have the following router definition within its /etc/modprob.d/lustre.conf options lnet networks=”o2ib0(ib0)” routes=”tcp0 192.168.0.254@o2ib0” \ live_router_check_interval=60 dead_router_check_interval=60 router_ping_timeout=60 Figure 3 Example compute node router configuration Lustre Servers – MDS/MGS and OSS nodes – should define similar configurations in reverse. These must know about all available LNET fabrics that would wish to mount Lustre. The test system used a set of Lustre storage servers built within an OpenStack Tenant to allow for quick development. Again the server /etc/modprob.d/lustre.conf is shown. options lnet networks=”tcp0(em2)” routes=”o2ib0 172.10.2.254@tcp0, \ o2ib1 172.10.2.254@tcp0” live_router_check_interval=60 dead_router_check_interval=60 router_ping_timeout=60 Figure 4 Example Lustre server route configuration Each node can define multiple routes using a bracketed list of IP addresses within the module configuration. routes=”o2ib0 172.10.2.254@tcp0; o2ib1 172.10.2 [251,252,253,254]2tcp0” Figure 5 LNET route expressing multiple routers This shows the LNET server/client that in order to access the fabric o2ib1 the address 172.10.2.251 to 254@tcp0 can be used. Further settings tell the nodes how to treat the router in the event that Lustre RPCs cannot be successfully routed. When implementing LNET routing it is important to think in the context of Lustre traffic as opposed to a standard ICMP package. While the network port might be ‘up’ in the traditional sense, if a lctl ping command fails, or if there is no endpoint, each LNET router will mark the route as ‘down’. The status of an available path for routing can be viewed using lctl route_list as shown below. net o2ib0 hops 4294967295 gw 172.10.2.254@tcp up pri 0 net o2ib0 hops 4294967295 gw 172.10.2.254@tcp up pri 0 [rot@lnet-mds-0 ~]# Figure 6 Screenshot of lctl route_list showing the status of available routing paths for LNET traffic. Router nodes will receive the following configuration to set the node’s LNET to route traffic between fabrics. options lnet networks=”o2ib0(ib0),o2ib1(ib1),tcp0(em2)” forwarding=enabled Figure 7 LNET router node configuration The routing options shown after each network is defined are presented as sensible defaults. This ensures that should a router go down, a client and server can mitigate against the issue while the system administrator remediates the situation. Ko2iblnd settings The Ko2iblnd module should have the same settings for all participating LNET nodes. Due to compatibility issues between mlx and Intel® Omni-Path Fabric drivers, users may need to increase the value of the map on demand option to 256, depending on the version of Lustre used. From Lustre 2.10 this can be varied with dynamic LNET configuration. options ko2iblnd-opa peer_credits=128 peer_credits_hiw=64 credits 1024 concurrent_sends=256 ntx=2048 map_on_demand=256 fmr_pool_size=2048 fmr_flush_trigger=512 fmr_cache=1 Figure 8 ko2iblnd settings for LNET estates that mix Intel® Omni-Path Fabric and MLX IB Client Mounting Clients on the respective fabrics will mount the address of the MGS, Lustre Management Server, as normal.