Gridpp Experiences of Monitoring the WLCG Network Infrastructure with Perfsonar
Total Page:16
File Type:pdf, Size:1020Kb
GridPP experiences of monitoring the WLCG network infrastructure with perfSONAR Duncan Rand Imperial College London / Jisc Outline • Large Hadron Collider (LHC) • Worldwide LHC Computing Grid (WLCG) • WLCG tiered networking model • perfSONAR toolkit • Monitoring with perfSONAR • IPv4 and IPv6 DI4R 2016 2 Large Hadron Collider • The LHC is located at CERN on the Franco-Swiss border • Proton proton and heavy ion collider with four main experiments – two general purpose: ATLAS and CMS – two specialist: LHCb and ALICE (heavy ions) • During Run 1 at 8 TeV: found the Higgs particle in 2012 • Started Run 2 in 2015 at 13 TeV – Tentative signs of a new particle at 750 GeV but was not to be • Computing for LHC experiments carried out by the Worldwide LHC Computing Grid (WLCG or ‘the Grid’) DI4R 2016 3 Worldwide LHC Computing Grid • The Worldwide Large Hadron Collider Computing Grid (WLCG) is a global collaboration of more than 170 computing centres in 42 countries • Its mission is to provide global computing resources to store, distribute and analyse the ~30 petabytes of data generated per year by the LHC experiments • GridPP is a collaboration providing data-intensive distributed computing resources for the UK HEP community and the UK contribution to the WLCG • Hierarchically arranged with four tiers: – Tier-0 at CERN (and Wigner in Hungary) – 13 Tier-1s (mainly national laboratories) – 149 Tier-2s (generally university physics laboratories) – Tier-3s DI4R 2016 4 WLCG sites Tier-0 Tier-1 Tier-2 DI4R 2016 5 WLCG tier networking model • Initial modelling of LHC computing requirements suggested a hierarchical tier-based data management and transfer model • Data exported from Tier-0 at CERN to each Tier-1 and then on to Tier-2s • However better than expected network bandwidth means that the LHC experiments have been able to relax this hierarchy – Now data is transferred in an all-to-all mesh configuration • Data often transferred across multiple domains – e.g. a CMS transfer to Imperial College London might come predominately from Fermilab near Chicago along with other CMS sites DI4R 2016 6 DI4R 2016 7 Remote access reading • New modes of data transfer from federated data storage using HEP specific protocol called xrootd – CMS experiment: AAA (Any data, Any time, Anywhere) – ATLAS experiment: FAX (Federated ATLAS storage systems using XRootD) • Allows the direct reading of experiment data by an analysis job at one site from storage at another • So the network quality between two institutes has become a crucial requirement for good operation of the WLCG • The WLCG is using perfSONAR to monitor the cross domain network infrastructure DI4R 2016 8 perfSONAR • Network monitoring tool developed by Esnet, GEANT, Indiana University and Internet2 • 'perfSONAR is a widely-deployed test and measurement infrastructure that is used by science networks and facilities around the world to monitor and ensure network performance.' • 'perfSONAR’s purpose is to aid in network diagnosis by allowing users to characterize and isolate problems. It provides measurements of network performance metrics over time as well as “on-demand” tests’ http://www.perfsonar.net/about/what-is-perfsonar/ • WLCG goals: – Find and isolate “network” problems; alerting in time – Characterize network use such as finding base-line performance – In the future: provide a source of network metrics for higher level services DI4R 2016 9 1775 perfSONAR hosts worldwide http://stats.es.net/ServicesDirectory/ DI4R 2016 10 245 active WLCG perfSONAR instances http://stats.es.net/ServicesDirectory/ Maps WLCG perfSONAR hosts are typically co-located with storage DI4R 2016 11 WLCG perfSONAR testing • Testing coordinated by the WLCG Network Throughput working group • Current perfSONAR configuration for WLCG testing – latency (one-direction only, 10Hz) – bandwidth (one-direction only, daily) – traceroute (bi-directional, hourly) • Tests either locally configured by the perfSONAR administrator or organised in centrally configured 'meshes' and accessed via a mesh URL DI4R 2016 12 perfSONAR configuration interface • A perfSONAR host can participate in multiple meshes • Configuration interface and auto-URL enables dynamic configuration of entire network DI4R 2016 McKee et al. CHEP 2015 13 perfSONAR • Home page of host shows current configuration and status DI4R 2016 http://t2ps-bandwidth.physics.ox.ac.uk/toolkit/ 14 perfSONAR • Home page of host shows current configuration and status • Tests can be configured with other perfSONAR hosts DI4R 2016 http://t2ps-bandwidth.physics.ox.ac.uk/toolkit/ 15 perfSONAR • Home page of host shows current configuration and status • Tests can be configured with other perfSONAR hosts • On demand testing tools are useful for debugging e.g. reverse traceroute can help pick up asymmetric routing DI4R 2016 http://t2ps-bandwidth.physics.ox.ac.uk/toolkit/ 16 Bandwidth throughput reverse throughput DI4R 2016 17 Latency and Loss Latency (ms) Loss (%) DI4R 2016 18 Traceroute DI4R 2016 19 MaDDash • With large meshes it is difficult to visualise overall performance • Centralised dashboards really help • MaDDash (Monitoring and Debugging Dashboard) displays meshes of perfSONAR hosts • A WLCG-wide MaDDash dashboard has been developed • This has two aspects – Open Monitoring Distribution: Nagios monitoring – MaDDash http://psmad.grid.iu.edu/maddash-webui/ DI4R 2016 20 • Initial meshes set up on a country basis, e.g. UK DI4R 2016 21 Rate thresholds destination Hover to get rates from both measurement archives of sites involved Click to drill down (next slide) source 23 DI4R 2016 23 UKI-SCOTGRID-DURHAM Replaced motherboard DI4R 2016 24 UKI-NORTHGRID-LANCS-HEP “a number of major tweaks to our network configuration” DI4R 2016 25 UKI-SOUTHGRID-OX-HEP “a number of major tweaks to our network configuration” Re-configuration of site core network DI4R 2016 26 CMS bandwidth hosts DI4R 2016 28 LHCb Latency hosts DI4R 2016 29 Dual-Stack perfSONAR measurements • The WLCG has an ongoing effort to promote the adoption of IPv6 • Aim to be able to allow sites to offer IPv6-only computing resources to the WLCG by April 2017 – coordinated by HEPiX IPv6 working group • perfSONAR supports use of IPv4 and IPv6 • Have set up a HEPiX/WLCG dual-stack mesh to monitor IPv4 and IPv6 performance • Twenty one WLCG perfSONAR dual-stack nodes are in the mesh • Adding and removing hosts from the mesh configuration is very simple DI4R 2016 30 Building perfSONAR dual-stack mesh • WLCG meshes constructed using US Open Science Grid (OSG) OIM interface populated by data from OSG and GOCDB host configuration databases • Web based GUI makes it very easy to add and remove sites DI4R 2016 31 IPv4 versus IPv6 throughput • Would like to be able to show difference in v4 and v6 throughput directly DI4R 2016 32 Oxford to QMUL IPv4/6 throughput - October 2015 IPv4: ~5 Gbps IPv6: ~0.5 Gbps DI4R 2016 33 Oxford to QMUL IPv4/6 throughput - September 2016 IPv4: dropped to ~1.3 Gbps IPv6: increased to ~1.3 Gbps DI4R 2016 34 IPv4 and IPv6 traceroute DI4R 2016 35 Summary • The LHC experiments’ distributed computing resources, provided by the WLCG, depend heavily on cross-domain, often long distance, network links • A network of 245 perfSONAR hosts has been set up to monitor these links organised into experiment-specific meshes • MaDDash dashboards give a very useful overview of the state of a mesh • It is easy to drill down to detailed graphs of bandwidth and latency • Traceroute information is available • A dual-stack mesh in use to help monitor IPV6 roll-out across the WLCG DI4R 2016 38 • Thank you to the WLCG Network Throughput working group, especially the coordinators Shawn McKee (University of Michigan) and Marian Babik (CERN) DI4R 2016 39.