Archer Update Andrew Washbrook University of Edinburgh
HPC working group mee ng 10th December 2014 Archer Details
• Archer is the UK’s primary academic research supercomputer • Opera onal since Nov 2013 • NEW: Phase 2 upgrade completed in Nov 2014
• Cray XC30 system • Each compute node comprises of: • 2 x 12-core 2.7 GHz Ivy Bridge processors • At least 64 GB of DDR3-1833 MHz main memory • Cray Aries interconnect (mul - er all-to-all connec vity) • 4.4 PB scratch storage (Lustre)
• 3008 4920 compute nodes è 72,192 118,080 cores • 1.56 >2 Petaflops of theore cal peak performance.
2 Questions for ATLAS Weekly
How much are we using for G4 and what are the prospects for using more?
• We currently have access to Archer via a nominal alloca on pledged to University of Edinburgh researchers (Archer directors me) • An effec ve proof of concept to demonstrate effec ve use of opportunis c slots would help bolster the case to request addi onal resources
What are the main challenges for using HPCs in general?
• Specific challenges are covered in the following slides • Some of these will be Archer specific • (Hopefully) some of these will have been addressed at other HPC sites
3 Compute Node Connectivity
Challenge: Outgoing connec vity is (in general) required throughout the life me of a job
• Up un l Archer Phase 2 there was no external connec vity available on the compute nodes
Approaches • External connec vity may now be possible by the use of new Cray Realm Specific IP addressing (RSIP) for compute nodes • Allows compute and service nodes to share the IP addresses configured on the external Gigabit Ethernet interfaces of network nodes • Included on Archer for third party so ware licence valida on • Not meant for large scale data transfer • Performing scaling tests underway to determine limita ons (if any)
Alternatives • ssh port forwarding to select servers on the Archer network 4 Software Delivery
Challenge: Availability of ATLAS so ware on HPC compute nodes
Approaches • CVMFS is not available from compute nodes • External connec vity issues (see previous slide) • Problema c installing CVMFS so ware, FUSE and autofs on compute nodes • Currently have a local snapshot of CVMFS as a pragma c op on • CVMFS directory is rsynced to machine resident at Tier-2 site • Repository copy is cleansed of absolute paths and replaced with local path • Copy is rsynced over to Archer shared storage • For now only selected releases extracted during tes ng phase
Alternatives • Archer will allow external filesystem resident on edge server to be mounted on scheduling nodes • Not clear how his would be beneficial – could try CVMFS over NFS solu on • Parrot/CVMFS • Pacman 5 Job Environment
Challenge: Define HPC compute node environment before each job
Approaches
• c/w Grid environment script + Worker node tarball solu on used for some shared Tier-2 facili es (e.g. ECDF) • HEP specific libraries (covered by the HEP_OSlibs meta rpm) have to be made available without rpm installa on method • Need to address poten al conflicts in common tools and libraries used by other HPC users (gcc, python) • Favoured the use of an TCL module to define path setup to be consistent with so ware management of other HPC applica ons on Archer • asetup command works with some tweaking of absolute paths
6 HPC Pilots
Challenge: A pilot submi ed to a HPC batch system will (in general) have to request and manage workload across many compute nodes • Pilot executes on a scheduling node and then requests compute resources using aprun Approaches • The easiest approach would be just to submit wholenode pilots • Archer queue limita ons: maximum 16 queued jobs, 8 running jobs per user • Assuming no MPI implementa on the pilot would need to handle many instances of wholenode jobs • Job finishes at the speed of the slowest if all wholenode jobs be launched simultaneously • Inefficient resource alloca on over me and burns through quota
Alternatives • Yoda Event Service approach makes sense to use given job throughput limita ons on Archer • Note that this challenge is not isolated to HEP - alterna ve HPC pilot solu on (on top of SAGA) is being used for Molecular Dynamics simula on framework (ExTASY) on Archer • RADICAL pilot: h p://radical-cybertools.github.io/radical-pilot/index.html 7 Workload Scheduling and Backfilling
Challenge: Assuming that pilots handle mul -node workloads how many compute resources (and how much wallclock me) should a pilot request at any given me?
Approaches • We could just es mate a fixed value that has a reasonable chance of being scheduled • Would like to follow approach made on Titan • Pilot polls scheduler directly to determine most efficient resource alloca on based on backfilling informa on • Unfortunately we cannot do this at Archer due to a different scheduler implementa on • Titan has PBSpro has access to “showbf” command • Archer has Cray ALPS • Ongoing discussions with Cray on alterna ves - informa on could be derived from superset of informa on from apstat, qstat, xtnodestat commands - but less trivial than direct method 8 Other Challenges and Outlook
Grid storage interac on • May need two stage copy to move job output across to local storage • Could be done via post-processing node outside of job life me Dynamic Shared libraries vs Sta c libraries • aprun normally expects sta c libraries • Need to explore if this is a real issue for ATLAS workload
Outlook • No major blockers on progress (for now) • External connec vity was a long-standing issue but could be resolved in Phase 2 setup • Troubleshoo ng session scheduled with Archer admins to resolve remaining issues in valida on exercise • Dormant ARC CE connected to Archer - will revive service to allow low level test jobs if suitable pilots are available • Would like scheduling and resource alloca on to be efficient as possible • Yoda Event Service could be more efficient use of resources - will perform ini al tes ng in step with deployment • Aiming for a reliable service early next year if challenges can be addressed (or at 9 least mi gated)