Overview of the System

This overview of the ARCHER system is provided as information to bidders in the ARCHER Service Provider and CSE Provider procurements. The ARCHER system is a Cray XC30. This massively parallel supercomputer provides balanced performance between processors, memory, interconnect, and I/O. The Cray XC30 system combines Intel processors, Cray’s Aries network and the Cray Linux Environment to provide a system capable of delivering production processing capability to ARCHER users. A schematic representation of the ARCHER system can be found in Figure 1.

FDR IB 10GbE ARCHER Cray XC30 40GbE Fibre Channel

ARCHER TDS Standard memory High memory compute nodes compute nodes Standard memory High memory compute nodes Compute nodes

Service and I/O nodes Service and I/O nodes DVS DVS LNET LNET DVS DVS Login/ Boot SDB Boot SDB FC Login/ Login/ Boot SDB /IB … /IB /10G /10G MOM Node Node Node Node Node MOM … MOM Node Node LNETLNET

Home 10GbE network FC Switch FC Switch FDR IB network CTRL CTRL CTRL CTRL Boot Boot

SMW SMWSMW

# # 2 # # 1

Cray TDS …

esLogin esLogin esLogin PostProc Sonexion PostProc Parallel Filesystems SuperJANET Home Filesystem Tape Research Data Facility Library

Figure 1: Schematic representation of proposed ARCHER system

Key components of the ARCHER system are:  A Cray XC30 supercomputer including: o Standard memory compute nodes o High memory compute nodes o Service and I/O nodes distributed throughout the system o Cray Aries high-performance interconnect o External login nodes through which users access the ARCHER system

Cray UK Ltd Archer System 1.0 May 4th 2013 1  Four Cray Sonexion storage appliances providing an aggregate of 4.8PB usable capacity.  An FDR Infiniband network connecting the storage systems to the XC30  A NetApp home storage system, supporting snapshots for daily backup to online storage and a SpectraLogic tape library for weekly backup to removable media  A test and development system (TDS)  Two pre/post-processing systems.  A 10GE network providing the XC30, its login nodes and the pre/post processing nodes access to the home file system, the RDF and SuperJANET.

Cray XC30 software The ARCHER system runs the Cray Linux Environment (CLE), which includes a SUSE Linux-based OS distribution, system management tools, job scheduler, high performance file system, hardware supervisory system and energy monitoring system. Scheduling services are provided through the combination of Altair’s PBS Professional workload management software and the Cray Application Layer Placement Scheduler (ALPS). The Cray XC30 system includes the fully integrated Cray programming environment with tools designed to maximize programmer productivity, application scalability, and performance. The proposed Cray solution includes:  Unlimited licenses for the Cray Compiling Environment (CCE)  Floating license for the Intel Composer XE compilers  The GNU C/C++ and Fortran compilers  Cray Programming Environment  The Cray Scientific and Math Libraries and Intel Math Kernel Libraries  The Cray Performance, Measurement and Analysis Tools (CPMAT)  Allinea DDT parallel debugger Most elements of the ARCHER programming environment are in use today on the HECToR system, the exception being Intel compilers. The external login nodes, pre/post processing nodes and backup server are managed using Bright Cluster Manager.

File Systems The Cray ARCHER system includes four Cray Sonexion file systems. Each Cray Sonexion file system consists of a single Metadata Management Unit (MMU) and one or more Scalable Storage Units (SSUs). The Cray ARCHER system has a total of four MMUs and twenty SSUs, configured as one large file system (11 SSUs) and three smaller file systems (3 SSUs each). The ARCHER system includes a NetApp network attached storage appliance providing 214TByte of home storage. It comprises:  Dual FAS3250 active-active controllers, configured for high availability  DS4246 disk shelves, with 3TB 7.2K SATA drives  SpectraLogic T380 tape library Access to the home file system is provided over 10 Gbps Ethernet. The storage appliance and its backup server are each connected to a pair of Ethernet switches, as are all client nodes requiring Ethernet access. Compute nodes access the home file system via Cray XC30 I/O nodes running the

Cray UK Ltd Archer System 1.0 May 4th 2013 2 Cray Data Virtualization Service (DVS). The tape library is connected to the home file system via fibre channel links. Symantec Netbackup is provided to manage backup of the home file system.

Test and Development System An air-cooled Cray XC system is provided as a Test and Development System (TDS). The TDS includes all major hardware and software components of the main system. TDS nodes have the same processor and memory configurations as are used in the main system The TDS is capable of operating standalone. It can be powered up/down independently of the main system. The operating system image for the TDS is independent of that installed on the main system, a separate boot RAID is provided to support this. A Fibre Channel disk array provides direct attached storage for standalone operation. The TDS has dedicated I/O gateways, providing connectivity to the parallel file systems over FDR InfiniBand. The TDS can be connected to or disconnected from the file systems without impacting the main system. The TDS has dedicated service nodes and a dedicated login node.

Pre/Post Processing Systems The proposed system includes two x86 pre/post processing nodes. The expectation is that the pre/post processing systems will operate in the same manner as the HECToR large memory server and the nodes have been configured accordingly. Users do not access the pre/post processing nodes directly; access is via the login nodes. Jobs are submitted to them via PBS. The pre/post processing nodes operate independently of the main system.

Networking The ARCHER system includes two Ethernet switches providing connectivity between the XC, the TDS, the login nodes, pre-processing nodes and the home file system. The ARCHER system includes 40Gbps Ethernet links from each of the two Ethernet switches to each of two RDF switches (4 x 40GE in total).

Training Cray will provide a limited amount of ARCHER specific training as described below. Additional training may be required by the Service Provider or CSE Provider staff depending on their skills and experience with Cray systems.

Training included within the Cray proposal Cray will run two ARCHER specific training courses:  Introduction to ARCHER system for HECToR users  Introduction to ARCHER system for Service Provider and ACF staff These training courses will be open to existing HECToR users, Service Provider staff, Authority staff, and others as appropriate. Introductory material will be provided as part of the documentation set. Introduction to ARCHER system for HECToR users  System overview and configuration  Changes to the programming environment between HECToR and ARCHER This course could be run at the Service Provider site with dial-in/online access for remote users. A repeat could be run on-line to allow more users to attend.

Cray UK Ltd Archer System 1.0 May 4th 2013 3 Introduction to ARCHER system for Service Provider and ACF staff  Power and cooling systems  Power monitoring  Log files  Tape library operation  Safety Cray would expect to run this course at the ACF.

Additional Training Cray recommends that, as a minimum, Service Provider staff take training courses in the following subjects  Cray XC30 System Administration  Cray Sonexion  Cray External Services Additional course may be useful depending on staff skill levels and experience on Cray systems. Information on course schedules, location and pricing should be obtained from Cray, see http://www.cray.com/CustomerSupport/training.aspx

Cray System Administration Part I: Length: 4 days Description: This course covers important features of software support, software maintenance and system administration of the Cray XC30 system. Through lectures and laboratories, students learn the fundamentals of hardware architecture, software installation and configuration, monitoring and maintenance of the Cray XC30 systems. This course is intended for both Cray field support personnel and Cray customers. Delivery Options: The lecture portion of the class is delivered in a standard classroom setting. Hands-on experience in a laboratory setting supplements the lectures. Skills/Topics Addressed:  Hardware description  Boot Raid Disk system  Cabling  Hardware Supervisory System  Power, cooling, and control  Cray Linux environment: o Software installation and configuration o Booting and shutting down the system o Configuring and maintaining file systems o Backing up and restoring file systems  An overview of esLogin systems, the Cray storage subsystems

Cray UK Ltd Archer System 1.0 May 4th 2013 4 Cray Sonexion Storage System Length: 2 days Description: This course presents the installation and operation of the Cray Sonexion integrated storage appliance. Students are also taught the support strategy, tools, and procedures that are required to monitor and maintain the Sonexion system. Lab sessions provide students with hands- on exposure to the maintenance tools and field replacement procedures. Delivery Options: Instructor-led class and lab. Prerequisites: Previous experience with Lustre file systems is helpful but not required. Skills/Topics Addressed:  Overview of Sonexion and Lustre file systems  Hardware components and configuration options  Software components and networks  Installation and support strategies  Installation and maintenance tools

Cray External Services Length: 3 days Description: This course presents the important hardware and software features of the Cray External Services equipment. Through lectures and labs, students learn the fundamentals of hardware architecture, configuration, software installation, maintenance, and system administration. Delivery Options: Instructor-led class and lab. Prerequisites: None Skills/Topics Addressed:  Hardware description  Node types and functions  System configurations  Networks  Bright Cluster Manager tool  Software installation  System management

Cray UK Ltd Archer System 1.0 May 4th 2013 5