Archer Update Andrew Washbrook University of Edinburgh

HPC working group meeng 10th December 2014 Details

• Archer is the UK’s primary academic research supercomputer • Operaonal since Nov 2013 • NEW: Phase 2 upgrade completed in Nov 2014

• Cray XC30 system • Each compute node comprises of: • 2 x 12-core 2.7 GHz Ivy Bridge processors • At least 64 GB of DDR3-1833 MHz main memory • Cray Aries interconnect (mul-er all-to-all connecvity) • 4.4 PB scratch storage (Lustre)

• 3008 4920 compute nodes è 72,192 118,080 cores • 1.56 >2 Petaflops of theorecal peak performance.

2 Questions for ATLAS Weekly

How much are we using for G4 and what are the prospects for using more?

• We currently have access to Archer via a nominal allocaon pledged to University of Edinburgh researchers (Archer directors me) • An effecve proof of concept to demonstrate effecve use of opportunisc slots would help bolster the case to request addional resources

What are the main challenges for using HPCs in general?

• Specific challenges are covered in the following slides • Some of these will be Archer specific • (Hopefully) some of these will have been addressed at other HPC sites

3 Compute Node Connectivity

Challenge: Outgoing connecvity is (in general) required throughout the lifeme of a job

• Up unl Archer Phase 2 there was no external connecvity available on the compute nodes

Approaches • External connecvity may now be possible by the use of new Cray Realm Specific IP addressing (RSIP) for compute nodes • Allows compute and service nodes to share the IP addresses configured on the external Gigabit Ethernet interfaces of network nodes • Included on Archer for third party soware licence validaon • Not meant for large scale data transfer • Performing scaling tests underway to determine limitaons (if any)

Alternatives • ssh port forwarding to select servers on the Archer network 4 Software Delivery

Challenge: Availability of ATLAS soware on HPC compute nodes

Approaches • CVMFS is not available from compute nodes • External connecvity issues (see previous slide) • Problemac installing CVMFS soware, FUSE and autofs on compute nodes • Currently have a local snapshot of CVMFS as a pragmac opon • CVMFS directory is rsynced to machine resident at Tier-2 site • Repository copy is cleansed of absolute paths and replaced with local path • Copy is rsynced over to Archer shared storage • For now only selected releases extracted during tesng phase

Alternatives • Archer will allow external filesystem resident on edge server to be mounted on scheduling nodes • Not clear how his would be beneficial – could try CVMFS over NFS soluon • Parrot/CVMFS • Pacman 5 Job Environment

Challenge: Define HPC compute node environment before each job

Approaches

• c/w Grid environment script + Worker node tarball soluon used for some shared Tier-2 facilies (e.g. ECDF) • HEP specific libraries (covered by the HEP_OSlibs meta rpm) have to be made available without rpm installaon method • Need to address potenal conflicts in common tools and libraries used by other HPC users (gcc, python) • Favoured the use of an TCL module to define path setup to be consistent with soware management of other HPC applicaons on Archer • asetup command works with some tweaking of absolute paths

6 HPC Pilots

Challenge: A pilot submied to a HPC batch system will (in general) have to request and manage workload across many compute nodes • Pilot executes on a scheduling node and then requests compute resources using aprun Approaches • The easiest approach would be just to submit wholenode pilots • Archer queue limitaons: maximum 16 queued jobs, 8 running jobs per user • Assuming no MPI implementaon the pilot would need to handle many instances of wholenode jobs • Job finishes at the speed of the slowest if all wholenode jobs be launched simultaneously • Inefficient resource allocaon over me and burns through quota

Alternatives • Yoda Event Service approach makes sense to use given job throughput limitaons on Archer • Note that this challenge is not isolated to HEP - alternave HPC pilot soluon (on top of SAGA) is being used for Molecular Dynamics simulaon framework (ExTASY) on Archer • RADICAL pilot: hp://radical-cybertools.github.io/radical-pilot/index.html 7 Workload Scheduling and Backfilling

Challenge: Assuming that pilots handle mul-node workloads how many compute resources (and how much wallclock me) should a pilot request at any given me?

Approaches • We could just esmate a fixed value that has a reasonable chance of being scheduled • Would like to follow approach made on Titan • Pilot polls scheduler directly to determine most efficient resource allocaon based on backfilling informaon • Unfortunately we cannot do this at Archer due to a different scheduler implementaon • Titan has PBSpro has access to “showbf” command • Archer has Cray ALPS • Ongoing discussions with Cray on alternaves - informaon could be derived from superset of informaon from apstat, qstat, xtnodestat commands - but less trivial than direct method 8 Other Challenges and Outlook

Grid storage interacon • May need two stage copy to move job output across to local storage • Could be done via post-processing node outside of job lifeme Dynamic Shared libraries vs Stac libraries • aprun normally expects stac libraries • Need to explore if this is a real issue for ATLAS workload

Outlook • No major blockers on progress (for now) • External connecvity was a long-standing issue but could be resolved in Phase 2 setup • Troubleshoong session scheduled with Archer admins to resolve remaining issues in validaon exercise • Dormant ARC CE connected to Archer - will revive service to allow low level test jobs if suitable pilots are available • Would like scheduling and resource allocaon to be efficient as possible • Yoda Event Service could be more efficient use of resources - will perform inial tesng in step with deployment • Aiming for a reliable service early next year if challenges can be addressed (or at 9 least migated)