DiRAC-3 Technical Case

Contributors: Peter Boyle, Paul Calleja, Lydia Heck, Juha Jaykka, Adrian Jenkins, Stuart Rankin, Chris Rudge, Paul Shellard, Mark Wilkinson, Jeremy Yates, on behalf of the DiRAC project.

1 Table of Contents 1 Executive summary ...... 3 2 Technical case for support ...... 4 1.1 Service description ...... 4 1.2 Track record ...... 5 1.2.1 Hosting capability ...... 6 1.3 International competition ...... 6 1.4 Risk Management ...... 7 1.5 Impact ...... 7 1.5.1 Industrial Collaboration ...... 8 1.6 RFI based Technology Survey ...... 8 1.6.1 Processor technology survey ...... 9 1.6.2 Interconnect technology survey ...... 10 1.6.3 Storage technology survey ...... 11 1.7 Benchmarking ...... 11 1.8 Procurement strategy ...... 11 1.9 Timeline ...... 13 1.10 Expenditure profile ...... 13 3 Case Studies ...... 13 1.11 Data Intensive (DI) Service ...... 14 1.11.1 Illustrative workflows...... 14 1.11.2 Technical requirements from science case ...... 15 1.11.3 Proposed System characteristics ...... 16 1.12 Extreme scaling service ...... 17 1.12.1 Extreme Scaling system characteristics ...... 18 1.13 Memory Intensive Service ...... 19 4 Appendix: Risk Register ...... 21

2 1 Executive summary The Distributed Research through Advanced Computing facility (DiRAC) serves as the integrated computational resource for the STFC funded theoretical particle physics, astronomy, cosmology, and astro-particle physics communities. Since 2009 it has provided two generations of computing service, and presently delivers an aggregate two petaflop/s across five installations. DiRAC has been a global pioneer of the close interaction of scientific software and computing hardware co-design, and has worked closely with technology companies such as IBM and Intel. The fundamental research themes included in DiRAC are of great public interest and have regularly appeared in news articles with worldwide impact. These same fundamental themes consume substantial fractions of computational resources in other countries throughout the developed world and it is essential that the UK retain its competitive international position.

The DiRAC community has constructed a scientific case for a transformative theoretical physics and astronomy research programme, tackling the full remit of STFC science challenges. This requires an order of magnitude increase in computational resources. This technical case aims to establish that computing resources required to support this science programme are achievable within the budget and timeframe, and that we have the capabilities and track record to deliver these services to the scientific community.

A survey of our scientific needs identified three broad types of problem characteristic in the scientific programme, which may drive the technologies selected. These are problems where the key challenges are: extreme scaling, intensive use of memory and the rapid access and analysis of large volumes of data.

This technical case also establishes a matrix mapping the three services to the different areas of the science programme. It is expected that the peer review of the scientific case will be fed back through this mapping by STFC to tension the resources allocated to the three services. Some detailed case studies are provided for each service to show the typical usage pattern addressed by each service.

Data intensive: These are problems involving data-driven scientific discovery, for example the confrontation of complex simulation outputs with large-volume data sets from flagship astronomical satellites such as Planck and Gaia. Over the period 2015-18, data intensive research will become increasingly vital as the magnitude of both simulated and observed data sets in many DiRAC science areas enters the Petascale era. In addition to novel accelerated I/O, this service may include both an accelerated cluster and a shared memory component to add significant flexibility to our hardware solution to the Big Data challenges which the DiRAC community will face in the coming years. Some components of the associated computation support fine grained parallelism and may be accelerated, while some components may require the extreme flexibility associated with symmetric multiprocessing. Use of novel storage subsystem components, such as non- volatile memory in the storage subsystem may be very useful to these problems.

Extreme scaling: These problems require the maximum computational effort to be applied to a problem of fixed size. This requires maximal interconnect and memory bandwidth, but relatively limited memory capacity; a good example scientific problem is Lattice QCD simulation in theoretical particle physics. This field provides theoretical input on the properties of hadrons to assist the interpretation of experiments such as the Large Hadron Collider. Lattice QCD simulations involve easily parallelized operations on regular arrays, and can make use of the highest levels of parallelism within processing nodes possessing many- cores and vector instructions.

Memory intensive: These problems require a larger memory footprint as the problem size grows with increasing machine power; a good example scientific application is the simulation of dark matter and structure formation in the primordial universe, where a larger snapshot universe is enabled with additional computing power. As structure forms, the denser portions of the universe experience greater gravitational forces, and the “hotspots” require a disproportional amount of work. These hot-spots present greater challenges to parallelization and achieving performance from processors with the best performance on serial calculations may be preferred.

The DiRAC technical working group has conducted a technology survey (under a formal Request for Information (RFI) issued by UCL) for possible solutions that will provide a step change in our scientific output in calendar years 2015 and 2016. Many useful vendor responses were received and indicate several competing solutions within the correct price, power and performance envelope for each of the three required service types. Significant changes in available memory and processor technology are anticipated in Q1 2016, and in particular the availability of highly parallel processors integrating high bandwidth 3D stacks of memory will be transformative for those codes that can exploit them.

3 2 Technical case for support

The aim of DiRAC-3 is to provide the computing hardware that will enable a step change in scientific capability for the DiRAC science projects in theoretical particle physics and astronomy. These projects were described in the DiRAC-3 Science Case, and that case has been submitted to STFC for peer review. In particular, the Science Case describes all aspects of the match of these projects to the STFC Science and Strategy, and places these in their scientific context. The present document is intended to translate the scientific requirements into a service specification, and lay out a plan for procuring and operating the services required to enable the science programme that is technically and economically sound.

The step change in scientific reach requires for most projects an order of magnitude increase in the computing power available over DiRAC-2 resources, with a more modest increase in the available storage and storage bandwidths. A subset of the projects require more than an order of magnitude increase in available storage and storage bandwidth.

DiRAC-3 will define three broad service categories and invite potential service providers, such as Universities, to bid to procure and operate one or more of these services. Successful bidders will participate in a coordinated procurement run on behalf of STFC. The service providers will subsequently operate services meeting these requirements for the STFC science communities, with resources allocated by STFC panels. As with DiRAC-2, the facility will be integrated in a single project and user management system, with integrated job accounting and facility operations reporting managed by the DiRAC Technical Working Group. 1.1 Service description The DiRAC consortium has undertaken a requirement identification process through March 201, and in April 2014 issued a formal request for information from vendors relating to their products in 2015/16. Three classes of need were identified: “Data intensive”, “Extreme scaling”, and “Memory intensive. Service Node Node Interconnect Node memory I/O bandwidth Count parallelism bandwidth Data intensive High High High Maximal Maximal Extreme scaling Maximal Maximal Maximal Limited Limited Memory intensive Maximal High High Maximal High

Data intensive: The data intensive portion of the workload involves processing petabyte-scale datasets, both simulated and observational. This includes the confrontation of realistic theoretical models with the next generation of astronomical datasets, as well as purely theoretical problems where advancing power has turned the analysis of simulation outputs into a very significant I/O challenge.

Extreme Scaling: The characteristics of these needs are that extreme scaling involves strong scaling problems where extreme computational power must be focused on a problem of fixed size. Both power and price efficiency, and sufficiently fast communication networks, are the technological barriers to scaling problems of limited size across many nodes to obtain high performance. The DiRAC problems associated with this system have geometric regularity and can in principle scale to both very large numbers of nodes, making use of SIMD instructions after some software engineering investment. To obtain reasonable strong scaling performance, computing nodes must refer frequently to off node data, and it is necessary that the per-node interconnect bandwidth not fall much below 1/10th of the bandwidth to the lowest level of cache.

Memory Intensive: The memory intensive class of problem involves weak scaling, where adequate simulations on the large problem sizes are required. These problems have a somewhat lower network bandwidth requirement, while memory capacity and bandwidth, and I/O bandwidth are significant technology challenges. Within this class of problem, there are DiRAC applications for which the arithmetic intensity is not always uniform across the dataset, and load balancing becomes an issue. Where there are load balancing issues there is a clear preference for obtaining the same peak performance from fewer but faster processor cores, over more but slower processor cores. Nodes constructed from fewer but faster processor cores are typically less power efficient for the same peak performance, and these algorithms dictate that the design point is selected to optimize for a mix of serial and parallel components.

The memory footprint for a simulation eight times larger than those already performed using DiRAC-2 resources is 250TB. This should be provided in a system with excellent memory bandwidth, and the load balancing sensitive algorithms mean that there is a serial component to the problem.

4 Service integration: It is possible that a single highly capable computing system could simultaneously fulfill all of the above requirements. However, this may not be the most cost-effective approach since a tailored mix of architectures can be cheaper than buying a single system with maximal capability in all areas. A simple analogy may be useful - a two-car family may choose one people carrier and one small car; obtaining a people carrier that obtains 60MPG, if possible, could be an unnecessarily expensive solution.

Silicon Graphics and Cray proposed architectures that can include a mix of computing nodes of different types, and could potentially meet all three requirements. However, it is also quite possible that distinct technological solutions from different vendors may be price optimal for each of the service classes. For the DiRAC-1 and DiRAC-2 procurements, systems from multiple vendors were “right-sized” to the scientific application mix. These systems were integrated under a single user-account, project, accounting and monitoring framework. DiRAC has, perhaps, set a standard for the coherent and coordinated operation of multi-site UK e-Infrastructure in an integrated framework. Such integration allows mobility of research codes between systems. We believe that DiRAC-2 has demonstrated that, in principle, all tiers of UK HPC provision could be integrated and that this would be a worthwhile step.

The responses of vendors to our request for information enable us to demonstrate that technological solutions enabling our various scientific goals will almost certainly exist. Since complete information will only be available in the future, DiRAC-3 will remain completely open to procuring a mix of solutions “right sized” to match the mix of scientific problems, using the most cost effective approach identified by benchmarking metrics. Currently we can project the capital and operational budgets given the budgetary advice of the vendors. These figures would only become definite under binding quotations made during competitive tendering, and the corresponding final system size obtained within a fixed budget could vary. 1.2 Track record The DiRAC consortium has a significant track record of procuring, installing and operating scientific computing resources for the STFC theory community. This also includes a substantial technical hosting capability at a number of sites, including both infrastructure support from several Universities and a significant body of skilled technical staff. The DiRAC consortium has installed and operated two generations of computing systems following a three to four year replacement cycle. The DiRAC-1 computing systems were installed in 2009 and DiRAC-2 computing systems in 2012. The current DiRAC-2 resources comprise five systems installed in Cambridge (2), Durham, , and Leicester.

Site Technology Cores Peak performance Storage Cambridge Infiniband cluster 4800 100Tflop/s 1PB Cambridge SGI UV2000 SMP 1856 (Xeon) 42 Tflop/s (Xeon) 146TB 1860 (XeonPhi) 37 Tflop/s (XeonPhi) Durham Infiniband cluster 6740 150Tflop/s 2PB Edinburgh IBM BlueGene/Q 98304 1260 Tflop/s 1PB Leicester Infiniband cluster 4352 95 Tflop/s 0.8PB

STFC has managed the DiRAC resources through a number of committee and oversight structures. The project management board (PMB) is responsible for the organization and decision making of the consortium. The PMB chairs and technical working group (TWG) chair report regularly to an Oversight Committee (OSC). The computer time is managed by a resource allocation committee (RAC). RAC chairs are appointed by PMB, and the RAC manages the peer review and scientific tensioning of calls for proposals.

The service has been run as a distributed facility; it has been integrated by the DiRAC TWG. The TWG meets biweekly and, with the help of the Edinburgh Parallel Computing Centre, has implemented a unified system administration framework (SAFE). SAFE is also used by the HeCTOR and Archer national facilities and provides a unified project and user account management interface. SAFE also records every DiRAC job submitted through the distributed system, and provides visibility of the functioning of the system to the TWG, RAC and STFC’s Oversight Committee.

Computing cycles delivered Since the DiRAC-2 service started, over several billion core-hours have been used for scientific computation by the STFC theory community, while utilization is of order 90%. A chart of the usage by project, for the accounting year 1-Apr-2013 through 1-April-2014, is given below. The chart shows both core-hours and the percentage of the total DiRAC core-hours utilization by each DiRAC project on a logarithmic scale. For simplicity of illustration we treat all core-hours as equivalent regardless of the architecture used and the code efficiency.

5

Ordered from left, projects 1, 2, 3 & 6 require extreme scaling and are based on sparse matrix PDE solution. Projects 4, 5 & 10 are memory intensive, simulating the largest possible multi-physics spatial grids. Projects 7, 8, 9 & 11 are data intensive: either requiring vast I/O throughput, or having a large volume of SMP code.

1.2.1 Hosting capability The DiRAC hosting sites have made significant investment (including various mixtures of self investment and taking funding from external sources) to develop advanced machine room capabilities. These require considerable raised floor-space, site and room power distribution capacity, and various sophisticated and environmentally friendly room and rack level cooling technologies. Network connectivity is provided through dedicated high bandwidth links to the JANET backbone. This considerable investment in hosting infrastructure provides leverage for the capital expenditure on computing hardware. is willing to invest up to £2.5M of new money in HPC hosting infrastructure to support DiRAC-3.

Site Floorspace Power/Cooling Cambridge 328m2 1MW 2 Durham 200m 1MW Edinburgh 300m2 2.7MW Leicester 30m2 300KW

1.3 International competition We can compare our resources to current, and to some extent planned, computational resources available to the international competitors and collaborators of UK scientists working in the scientific areas of DiRAC. Upgrade plans are tabulated, where known, however this list is necessarily incomplete.

Country Site Year System Japan Kobe 2011 10.5 PF K-computer Germany Juelich 2012 5.9 PF BlueGene/Q Italy Cineca 2012 1.7 PF BlueGene/Q Japan Tsukuba 2012 0.8 PF HA-PACS Japan KEK 2012 1.26 PF BlueGene/Q US Argonne/Mira 2012 10PF BlueGene/Q US ORNL/Titan 2012 27PF Cray XK7 US NCSA/BlueWaters 2013 13PF Cray XE6/XK7

US NERSC/Edison 2014 2.6PF Cray XC30

US NERSC/Cori 2016 28PF Cray XC30/KNL

US Oak Ridge, Argonne, Livermore 2017 >300 PF combined upgrade.

6 In the US three large systems can be considered for which allocations have been made public on the facility internet pages: allocations in the 20% to 40% region were made across the US to DiRAC scientific areas.

– The 2014 INCITE allocation of the 27PF Titan system totaled 2.2Billion hours. Of this, one third was made to the DiRAC scientific areas (particle physics, nuclear physics, MHD and astronomy). – The BlueWaters system displays usage by scientific domain, with 40%-80% typical for Particle Physics and Astronomy. – In 2013/2014 the allocations to our US competitors in DiRAC science areas on the 10PF Mira system totaled over 2 billion core-hours. This 30-40% scale of allocation was consistent with the other large US systems. – A substantial part of the 5.9PF Juelich system has been used by the BMW lattice QCD collaboration.

Many of the above internationally competing systems will be upgraded, and it is their replacements that will compete with DiRAC-3. The information in this section demonstrates a substantial fraction of large US, European and Japanese machines are being used to compete in the DiRAC scientific areas. The DiRAC-3 upgrade is required to maintain scientific competitiveness. 1.4 Risk Management The Risk Register for the DiRAC-3 installation is included as an appendix to this document.

Following a careful Request for Information exercise and requirements identification, the TWG have identified the Intel Knights Landing (KNL) processor as the best candidate for one, and possibly two of the services. This processor is in development, and the timing of the procurement should be chosen to align with general availability, presently targeted at Q1 2016. The dependence of this strategy on successful product development by Intel is the most significant risk in the project.

Delay of the KNL processor by Intel is addressed by retaining sufficient flexibility to commit a significant component of capital expenditure throughout fiscal year 2016/17. Cancellation is viewed as unlikely, but the extreme scaling service could then be meaningfully upgraded with GPU accelerated computing nodes. The other services can deliver their goals with standard Xeon technologies. 1.5 Impact DiRAC has significant societal impact through the deep public interest in the fundamental nature of our science. This is reflected by a large number of news articles about the research areas included in the DiRAC science programme. In particular there has been worldwide news activity about the 2012 discovery of the Higgs Boson at the Large Hadron Collider, and the associated Nobel Prize won by Professor Higgs in 2013. This completed the observation of the standard model of particle physics. A major theme of DiRAC research is to calculate hadron properties to assist the Large Hadron Collider in its search for new discoveries. Edinburgh DiRAC academics recently lectured in a massively open online course explaining the Higgs Boson and its role in the standard model and cosmic inflation to many thousand members of the public.

Recently, astonishing evidence of gravitational waves in the cosmic microwave background was discovered by the American BICEP project. These waves are indicative of new laws of physics, even more fundamental and potentially simpler than the standard model, entering at a Grand Unification scale. Further, they back up the idea of inflation in the early universe as a mechanism for the exponential primordial expansion. The COSMOS project within DiRAC makes use of the European Planck satellite’s observations of the cosmic microwave background, and requires DiRAC-3 to confirm and improve upon the observations of BICEP.

Another major theme of DiRAC activity concerns large-scale structure and galaxy formation. The DiRAC reseachers in the Virgo Consortium have performed a series of groundbreaking simulations that have regularly received wide-spread attention both from scientists and in the news. The work of UK Virgo members has been recognised most recently by the 2014 Eddington Gold medal received by Carlos Frenk (also a Gruber prize winner in 2011), and in the award of the 2014 Shaw prize to astronomy to Shaun Cole and John Peacock (together with a US astronomer).

Examples of mainstream press activity covering the subject areas within DiRAC research are given below: http://www.bbc.co.uk/news/world-18702455 http://www.bbc.co.uk/news/science-environment-24436781 http://www.bbc.co.uk/news/science-environment-26605974 http://www.bbc.co.uk/news/science-environment-21866464

7 The DiRAC community regularly engages directly with the public, in events ranging from BBC Stargazing Live, through to public lectures and workshops in primary schools. The DiRAC collaboration also trains a steady stream of graduate students and post-doctoral researchers whose career progression in many cases involves ultimate motion to disparate fields throughout the economy. Career destinations include the financial sector, and include varied destinations such as the Meteorological Office, through to technology and HPC companies such as NVIDIA. Many researchers from particle physics have also assumed key roles in the Edinburgh Parallel Computing Centre. The EPCC has, for several decades, played a leading role in the HPC provision and education to the Academic and Industry communities across the UK and the EU. The impact of placing HPC provision in higher education institutions cannot be overstated: the highest bandwidth knowledge transfer involves the motion of educated brains from academia to industry. Embedding specialist HPC resources in multiple places of learning, for use by intelligent early career researchers, establishes a far greater flow of knowledge than centralised HPC provision by centres retaining a handful of experts in knowledge transfer positions.

1.5.1 Industrial Collaboration The UK lattice gauge theory community has an excellent record of industrial engagement in the development and exploitation of for simulation. The community has twice partnered with IBM Research in the last decade to develop national simulation resources. Firstly, between 2000 and 2004 the UKQCD community developed the QCDOC supercomputer with Columbia University and IBM Research. This was operated as a national resource until 2009. Our collaborators within IBM went on to develop the general purpose BlueGene supercomputer designs that became IBM’s premier HPC product.

In 2007, Edinburgh University were invited by IBM to enter a project to jointly develop the BlueGene/Q computer chip with them. Theoretical Physicists among Edinburgh and Columbia’s academic staff designed the memory prefetch engine for IBM, while using QCD codes as a performance optimization benchmark. This academic-industrial collaboration on internationally leading technology is globally unique, and perhaps one of the best examples of hardware software co-design. This design activity by DiRAC researchers has subsequently been included in flagship and internationally leading supercomputing sites around the world, including in the USA (Argonne, LLNL), the UK (Daresbury), Italy (CINECA), Germany (Julich), and Japan (KEK). These systems are accelerating scientific computing in diverse applications ranging from computational biology through CFD and even nuclear stockpile stewardship.

Edinburgh University has coauthored four US Patents, and several scientific publications about the BlueGene/Q design with IBM. One of these joint papers between Edinburgh and IBM won the Gauss Award at the 2012 International Supercomputing Conference. The Edinburgh codes are highly efficient sustaining around 30% of peak performance and with key routines delivering as much as 71% of peak performance. The codes are freely available and are used around the world for QCD simulations; for example Edinburgh wrote the code for and co-authored a Gordon Bell Prize finalist paper with Lawrence Livermore National Laboratory at Supercomputing 2013 sustaining 7.2 Pflop/s on the Sequoia BlueGene/Q system. The BlueGene/Q system was designed to be particularly energy efficient and was top of the Green500 as the most energy efficient computer in the world from 2010 until 2012.

COSMOS has a long-standing collaboration with SGI, which was joined by Intel in 2003. This has ensured early access to state-of-the-art SMP machines and COSMOS had the first large Origin 2000, Altix Itanium 3000, UV1000 and UV2000. This UV2000 was also the first large shared memory machine with Xeon Phi coprocessors, with the "MG" blades housing them being specifically designed for and with COSMOS.

More recently, DiRAC has been very successful in attracting Intel Parallel Computing Centre (IPCC) investment to help reengineer our codes for novel and emerging architectures such as the Intel Many Core devices. Two centres in Cambridge and one in Edinburgh have been opened with the primary goal of supporting the aggressive optimization of COSMOS and UKQCD collaboration codes, and to investigate the optimization of structure formation codes for these devices. A third IPCC has been established in Cambridge to support the Square Kilometre Array project. The combined value of these investments is several hundred thousand pounds p.a. and this provides three FTE of software support. 1.6 RFI based Technology Survey The underlying technologies will be developed by the computing industry. The DiRAC-3 service delivery will, however, involve a significant effort on the part of the DiRAC Technical Working Group, Project Management Board and a Technical Benchmarking working group to select the appropriate mix of technologies.

8 To prepare for this exercise, in March 2014 DiRAC carried out a Request for Information (RFI) exercise to survey candidate technologies for each of the three service requirements. It is anticipated that these technologies might be combined in an appropriate mix to deliver the DiRAC-3 service. Although this is very much forward looking and therefore there remains considerable uncertainty, by seeking the best information available and being as technologically informed as possible while planning for service procurement, the risk has been reduced to a minimum.

The following vendors and resellers made very useful responses to the RFI.  Cray Inc; Silicon Graphics ; Dell; Hewlett Packard; IBM; OCF; Clustervision

We provide some summary charts of the responses, however the thousands of pages of response can be made available if requested. Various technologies were suggested by the vendors in the areas of processing, interconnect and storage. For the computing element of the systems proposed by vendors, a broad summary of the electrical power, computational power, interconnect and memory performance in relation to pricing is given in the table below.

Architecture Peak/Power Peak/Capex Peak/TCO Network/Peak MemoryBW/Peak (PF/MW) (PF/£M) (PF/£M) (GB/s / TF) (GB/s /TF) Intel KNL 8 0.6 0.4 8-16 GB/s / TF 200 self host Intel Broadwell 2 0.12 0.1 12-24 150 cluster Accelerated 5 0.4 0.3 3-10 200-400 cluster SMP 1.6 0.06 0.05 64 150 Accelerated 2.6 0.09 0.08 20 200 SMP BlueGene/Q 2.4 0.1 0.09 200 200

It is worth noting that the IKL based systems from Cray and SGI appear the most efficient in both performance/power and performance/capex in 2016 by a substantial margin. The peak performance of both Xeon and KNL can only be obtained by using vector instructions, and the vector length of KNL is twice that of Xeon. Since some codes in the DiRAC application base only use scalar operations, the relative advantage between KNL and Xeon is reduced by a factor of two, but KNL systems should retain a reasonable advantage for the subset of scalar codes that can be efficiently threaded.

The 500KW BlueGene/Q system, used to provide around 60% of the DiRAC-2 computing cycles, represents a baseline for the extreme scaling service. It can be seen that only GPU accelerators or systems based on the Knight’s Landing processor improve on the power efficiency. This improvement is required to provide a substantial upgrade within a reasonable power envelope. The Knights Landing is projected to be first available in Q1 2016.

The ratio of network bandwidth to peak performance is a significant metric for the capability of a computer to scale problems to very large numbers of nodes, since when massively spread out the problem on each node is small and off-node references become common. The above table clearly shows the trend that 3D memory packaging will essentially solve the memory bandwidth problem by changing the price proposition for many high speed wires. A corresponding breakthrough for long range interconnect may, in future, come from silicon photonics, but not within timeframes relevant to DiRAC-3. 1.6.1 Processor technology survey There are several possible processing technologies planned for the relevant time frames. The most standard of these is based on the server versions of Intel multi-core chips, essentially a continuation of the long running and standard Xeon approach. These represent a stable software platform with, perhaps, some codes finding difficulty in using vector instructions efficiently. Some novel architectures were also proposed and these included processing based on NVIDIA graphics processor units, and Intel Knights Landing many- core chips. All of these are future processors; since the introduction of new technologies is a challenging and complex task, details of plans may necessarily change in the final procurement.

Accelerator specific considerations: The GPU and Intel many-core chips are novel products which substantially improve the power efficiency and memory bandwidth per computing device. The NVIDIA power

9 efficiency quoted above is based on a balance of two GPU’s per hosting node; the overhead of the host server reduces the power efficiency of the device somewhat compared to the devices in isolation. Submissions were made using the Knights Landing product in two configurations: as an accelerator which yields a similar power performance to GPU’s, and in a self-hosting mode with better power performance.

Novel architectures bring the disadvantage of novel programming models. The Knights Landing product maintains source compatibility for the standard MPI programming model in C, C++ and Fortran, while the Nvidia product in particular would require substantial software re-engineering, and this may only be possible Technology Peak performance per Peak performance/watt Availability estimates chip (system level) Intel multi-core 600Gflop/s 2 PF/MW 2015/6 (Broadwell) NVIDIA GPU >2000Gflop/s 5 PF/MW 2015/6 (Pascal) Intel many-core >3000Gflop/s 8 PF/MW Q1 2016 Knights Landing self-host Intel many-core >3000Gflop/s 5 PF/MW Q1 2016 Knights Landing hosted for a subset of the DiRAC code bases. The power efficiency of self-hosted Knight’s Landing systems is fortuitous since the self-hosting mode provides software compatibility with existing multi-core processor based clusters: self-hosting presents both the best software compatibility and the best power efficiency.

1.6.2 Interconnect technology survey Interconnects for parallel computing systems impact performance. Sufficient bandwidth between nodes must be provided to ensure that the application runtime is dominated by computation and does not leave the nodes waiting excessively for inter-node communication. The fastest communication networks can constitute a sizable proportion of the cost of a supercomputer. Since the balance between local computation and communication is dependent on the scientific application and algorithm, there may be cost savings if the most expensive networks are procured only for those portions of the scientific codes that require them.

Our RFI exercise has identified that in most cases the vendors are using broadly similar and commodity interconnect technology, based on the Infiniband standard. Infiniband bandwidth has evolved through a series of generations (QDR, FDR, EDR) with data rates increasing by a factor of two with each generation, and the price structure reflects the data rate. There are several proprietary networks worth mentioning: SGI is developing Hypercube interconnected systems based on the Intel StormLake-1 interconnect. This interconnect is broadly similar to, and competitive with, Infiniband EDR, and Knights Landing nodes may use two links per processor. Cray has a proprietary high performance all-to-all (Dragonfly) interconnect. SGI has also submitted shared memory systems based on a highly capable NUMAlink cache coherent interconnect, but this adds significant expense.

Interconnect Interconnect bandwidth per port Interconnect bandwidth per technology (transmit+receive) processor (transmit+receive) QDR Infiniband 8 GB/s 2-8 GB/s FDR Infiniband 13.6GB/s 3.4-13.6 GB/s EDR Infiniband 25 GB/s 6.25-25 GB/s Intel StormLake-1 25 GB/s 12.5-50 GB/s Cray Dragonfly 32 GB/s 16-32 GB/s SGI NumaLink 32 GB/s 32 GB/s BlueGene/Q 42 GB/s 42 GB/s

The ratio of network ports to the number of processors and/or accelerators in a system is a key tunable characteristic that significantly affects the system price and the delivered application performance. Benchmarking during procurement is the key metric that will ensure the right balance of capital expenditure on network performance, computational performance and on storage subsystems. The interconnect should be fast enough to ensure the hardware is used efficiently, without spending more than is required.

It is worth noting that the DiRAC-2 BlueGene/Q service provided 42 GB/s interconnect bandwidth for a 205Gflop/s computing node. Despite node performance growing in some cases to 3Tflop/s, only one vendor (SGI) plans an interconnect (slightly) exceeding this bandwidth.

10 1.6.3 Storage technology survey High performance and large capacity storage is a significant requirement for all types of service, but they differ in the level of requirements to various degrees. The most common three media technologies are based on non-volatile memory, magnetic disk, and magnetic tape. Non-volatile memory includes solid-state disk and similar, but more server oriented, devices. Due to the nature of the pricing and speed of these technologies a sensible cost optimization approach includes a mix, with long term bulk storage residing in tapes and frequently accessed data remaining on redundant arrays of inexpensive disks. Hierarchical storage manager software solutions make the distinction transparent to the user, with a suite of devices appearing as a single and easy to use file system. The predicted storage pricing is summarized below.

Technology Pricing (PB/£M) Speed

Flash 1 PB/£M Fast

Disk 3 PB/£M Medium

Tape 10 PB/£M Slow The balance between storage types may be tailored to each service balancing the need for high speed access to data that is currently being used with slower speed access to longer term data. Advanced software, such as hierarchical storage managers were proposed by several vendors and are commonly used to make the management transparent and provide multiple copies of critical data to ensure integrity.

Since storage is typically not filled immediately and since prices decrease, it is universally best practice in procurement to install storage in a series of tranches, spending 4/7th, 2/7th and 1/7th of the storage capital over three years. One expects that this yields an equal volume of newly installed storage in each year. In a hierarchical storage system, this profile may only apply to the lowest layer in the hierarchy, such as tape. DiRAC expects to negotiate and commit to such an installation profile with successful vendors and sites. 1.7 Benchmarking The large DiRAC projects have provided benchmarks to the TWG that represent the bulk of the computing cycles currently allocated by STFC’s RAC to DiRAC science. These will be used to benchmark the hardware bids for each of the three services during hardware procurement. A shared repository on the RCUK Research Data Facility has been setup for holding these benchmarks by the TWG, who will be responsible for making them available as part of a vendor package for each system during the coordinated procurement of hardware. The codes are listed in the following table. dp002 MODAL Data Intensive dp007 BQCD-clover Extreme Scaling dp002 CACTUS Data Intensive dp008 Bagel-dwf Extreme Scaling

dp005 Gadget-2-agn Data Intensive dp008 Bagel-clover Extreme Scaling

dp005 Gadget-2-disc Data Intensive dp008 Bagel-wilson Extreme Scaling

dp019 MILC-hisq Data Intensive dp008 Bagel- Extreme Scaling

wilsonAdjoint

dp020 TROVE Data Intensive dp009 HiRep/WilsonA Extreme Scaling

djoint

new PRIMAL Data Intensive dp015 sphNG Data Intensive new SMAUG Data Intensive dp015 TORUS Data Intensive dp004 Swift Memory Intensive dp016 RAMSES-AMR Data Intensive dp004 Gadget-eagle Memory Intensive dp018 Enzo All dp004 Gadget-4 Memory Intensive dp010 Lare3d All dp004 RAMSES Memory Intensive ECOSMOG

1.8 Procurement strategy We will describe in this section how DiRAC proposes to tension, procure and operate the three services. The considerations that must be balanced are the relative scientific weight of DiRAC science codes, how this mix of codes should be mapped to hardware, and the value added by siting or hosting decisions for the services.

Scientific refereeing (Completed June 2014): The DiRAC science case has been submitted to international peer review, organized by the RAC chairs and the referee assessments of the DiRAC projects will be returned to the Resource Tensioning Panel.

11 Technical refereeing (July 2014): The STFC Computing Advisory Panel have been asked to appoint independent and internationally respected experts in HPC provision to referee this technical proposal. They will also be asked to report on four aspects for each of the three proposed services (and four machine requirements): the cost-effectiveness in both capital (1) and recurrent expenditure (2), efficiency of the proposed hardware for the target applications (3), and efficiency of the hardware for a broader range of HPC applications (4). CAP has been asked to organize that this refereeing takes place in July 2014, with reports returned to the Resource Tensioning Panel.

Resource Tensioning Panel (September 2014): The RAC chairs will be asked to construct an independent scientific tensioning panel to allocate capital resources to the services in September 2014. This tensioning panel will take into account the scientific merit of the DiRAC projects for which each service is viewed as ideal, the technical assessments of the cost-effectiveness, the total cost-of-ownership of DiRAC, and the level of freedom each specified service will subsequently give to RAC for allocating time to multiple science domains. This document includes three price points for each service, designed to allow the Resource Tensioning Panel to interpolate or extrapolate system metrics to whatever level of funding the panel decides.

A key outcome of the Resource Tensioning Panel will be the capital expenditure profile, with the broad balance between expenditure in 2015/16 and 2016/17 being fixed by their decisions. The capital expenditure profile and technology balance decided upon will be communicated to the DiRAC Project Management Board for comments, which would then be returned to STFC by November 2014.

Service provider call and selection (14 October 2014 – 14 December 2014): STFC will select a coordinating University to organize a two stage coordinated procurement for the three services. The coordination of procurement will enable STFC to leverage the bargaining power of scale should the same vendor be the most technologically appropriate for two or more of the services. University College London played this role for DiRAC in organizing the DiRAC Request for Information, and would be non-conflicted in reprising this role for selection of service providers and the coordination of procurement.

The coordinating University will run an open competition for Service Providers to bid to run one or more of the three services. The call will be issued in October 2014 with submissions received by December 1 2014. Bids will be required to include detailed information:

 Specification of at least two technological solutions that could provide the service;  Electrical power, cooling and recurrent electrical budgets sought from STFC;  Infrastructure capacity and margin for the Host to operate the system;  Service Level Agreement terms;  Staffing levels and salaries sought from STFC for operating the system;  Staffing levels and salaries sought from STFC for user support for the system;  Monetary and equivalent value added by the Bidder. Value add considerations include: . Electrical budget, hosting infrastructure investment and similar; . Additional Staffing support for user support and administration; . Additional Staffing support for application optimization and code development; . Additional Staffing support for the integration of the broader facility with the rest of DiRAC and the UK e-infrastructure; . Contribution to community training, and . Industrial engagement, collaboration, and industry contributions.

The STFC RAC chairs will appoint an independent technical review panel to evaluate and select Service Providers from the bids, with this panel meeting in December 2014.

Benchmarking preparation (TWG April 2014-April 2015): The large DiRAC projects have identified benchmark codes to be used in evaluating hardware for each of the service types. For each service the benchmarks submitted by those projects whose scientific merit was used to tension that service will be used as the performance evaluation metric during hardware procurement. The Resource Tensioning Panel may also instruct the TWG to include additional benchmarks in the evaluation of any of the systems.

Coordinated hardware procurement (April 2015-Dec 2015): The successful bidders for the provision of service will be contractually obliged to participate in a subsequent coordinated procurement of the hardware for these services. Each of the service categories will be evaluated against the benchmarks provided by the DiRAC projects for the science domains against which the service was evaluated in the tensioning process. The STFC Technical Working Group and the coordinating University will retain panel representation in the

12 evaluation of the hardware vendor bids. The procurement and benchmarking may extend through 2015 for systems based on the Knights Landing processor.

Service Integration: The successful service bidders will be contractually obliged to work with the TWG to include the services in the integrated user and project administration, accounting, monitoring and help desk framework that will enable the operation of DiRAC-3 as an integrated facility. This integration will enable STFC to perform resource allocation and facility monitoring. 1.9 Timeline Begin End Selection of service nature and provision Benchmark preparation 1 April 2014 1 Jan 2015 Scientific refereeing 1 May 2014 Completed Jun 8 2014 Technical refereeing 1 July 2014 1 August 2014 Resource Tensioning Panel 1st October 2014 Service provider call 15 October 2014 1 December 2014 Service provider selection 14 December 2014 Hardware Phase 1 Coordinated procurement 1 April 2015 1 August 2015 Hardware installation 1 August 2015 1 October 2015 RAC Service commence 1 December 2015 Hardware Phase 2 Coordinated procurement 1 October 2015 1 December 2015 Hardware installation 1 April 2016 1 December 2016 RAC Service commence 1 Jan 2017 Service Integration Extend and enhance DiRAC-2 1 April 2015 Integration completed by Phase-1 SAFE Leverage considerable pre- commencement, incorporate of existing DiRAC software Phase-2 in second year, support infrastructure and expertise. over entire duration of the service.

1.10 Expenditure profile The case studies that follow document several price points for each of the services. The division of DiRAC expenditure between the DiRAC-3 services will be determined by the Resource Tensioning Panel. This tensioning will incorporate referee responses from the Peer Review of the DiRAC-3 Science Case, and referee responses to this case. Sufficient information is provided in this Technical Case to ensure that the Resource Tensioning Panel can estimate the effect on the facility performance of reallocation of funds. The precise capital expenditure profile will be an output from the resource tensioning process and at this stage only best guess can be given; some flexibility must be retained to move expenditure across the fiscal year boundary in response to the resource tensioning panel recommendations. A proposed capital expenditure profile is as follows:

Capital Electrical power Fiscal year DiRAC compute £10M 550kW 2015/16 DiRAC data facility £3M 20kW 2015/16 DiRAC compute £16M 1200kW 2016/17

3 Case Studies The remainder of this document presents case studies intended to provide detailed service specification and justification of required features for each of the proposed DiRAC-3 computer systems, with reference to the primary scientific applications for which each is tailored.

13 1.11 Data Intensive (DI) Service We propose a large-scale, distributed, data-centric cloud HPC service for PPAN science in the UK. The DiRAC-3 Science Case shows that progress in many areas of our research programme demands a step- change in our capability to handle large data sets, both to perform and analyse precision theoretical simulations and to confront them with the next generation of observational data. Many key DiRAC-3 projects also involve the exploration of high-dimensional parameter spaces using statistical techniques which generate large numbers of computationally intensive models. The DI service will support data intensive science, supporting a range of architectures (distributed memory, cache-coherent shared memory, fat nodes, accelerators, tightly-coupled storage) and programming paradigms (MPI, OpenMP, Hybrid, Accelerated). The DI service will appear to the user as a single logical system, using the cloud computing platform Openstack. Individual DI sub-systems will have their own high speed local parallel file systems (scratch areas), but will share a common filesystem for Tertiary Storage (data products, applications software, etc.). To ensure optimal use of its state-of-the-art capabilities, the service will build on the innovative code development work within DiRAC-2 (e.g. the COSMOS@DiRAC Intel Parallel Computing Centre, Cambridge), by providing research software development support for code porting and optimisation. In addition to the workflows below, the DI service will help maintain UK leadership in projects across the PPAN science domains and facilitate the scientific exploitation of high-profile STFC-funded experiments. It will enhance the HPC skills of UK researchers and lead to new collaborations between industry and academia. 1.11.1 Illustrative workflows Astrophysics (Science Case A.(vi)): The Gaia satellite project will revolutionise our understanding of the Milky Way, but presents many computational and theoretical challenges. DiRAC-3 Gaia modeling involves 3 key steps: 1) The probability distribution functions (PDFs) for the physical properties of each star are vital for a proper Bayesian analysis of the 2016 data release (for which incompleteness renders robust error distributions essential) but are not provided by the Gaia data processing. Analysis of the current PDF generation code shows that this requires ~5x1021 flops for 109 stars. 2) Dynamical Model construction at minimal velocity/spatial resolution requires 109-1010 points. The simplest chemodynamic models require ~200GB RAM and ~2x1019 flops (1s/object, 50-70% peak performance), while a single Nbody model with N~5x109 particles (using PRIMAL) requires at least 9x106 core-hours, 2.5TB of RAM and generates a time series of ~0.25PB. 3) The confrontation of models with data is limited by tightly-coupled storage volume. For the simplest models, including only kinematics and binned data, confrontation of a minimal set of 103 9 19 models with 10 stars requires ~7x10 flops (assuming linear scaling with Nstars). More complex models must 9 compare the full PDFs for each star. For 10 stars, each model Table 1: Expected Gaia data volume comparison involves moving ~1PB of Gaia data (see table) into/out Data N MB/star Total 3 stars of the RAM being used – again a minimal study requires 10 models. Gaia (2016) 109 0.2 200TB To be competitive, this can take at most 2 months (90 min/model), 9 Stellar PDFs 10 1 1 PB requiring data transfer rates of ~200GB/s (actual). Particle physics (Science Case C.(i) and C.(ii)): Numerical solution of Quantum Chromodynamics provides key results for the experimental programme at the Large Hadron Collider at CERN. The expensive step is calculating propagators for light up and down quarks on gluon fields defined on large, fine space-time lattices. These are needed for multiple physics results and so storage, rather than re-calculation, is optimal. Storage of 2000 current 643x96 gluon field configurations (20TB) and 16 quark propagators (each 4MB) per configuration gives a total of 128TB. In the next 2-3 years, lattice spacings will halve for the same 4-d volume and so storage requirements will grow by a factor 24. The cost of calculating light quark propagators will increase by more than a factor of 16. A typical job, once light propagators have been calculated, will then require reading in a ~200GB configuration and one or two ~60GB propagators before making temporary strange or charm propagators and combining into a meson correlation function. Calculations on a full ensemble in one month requires repeatedly moving 300TB of configurations and 2PB of propagators in and out of RAM – much of these data would need to be in fast storage at any one time. Typical job lengths range from 1-24 hours, thus data rates of 5GB/s (actual) would suffice (~1 minute to read in 260GB). With additional data management overhead, the volume required in fast storage could be reduced by a further 50% using staging between fast and slow storage during the calculation. Cosmology (Science Case A(ii)): DiRAC Planck satellite science exploitation has yielded some of the most highly cited physics papers for 2013-14 and most Planck non-Gaussian results were obtained from the COSMOS Modal Bispectrum pipelines. The current Planck satellite pipeline for modal resolution p=27 (with n=1000 modes) uses 1.1TB shared memory at runtime, with the variance being estimated from 1000 map realizations. On DiRAC-3, Planck ‘hi-res’ bispectrum estimation will target double spatial resolution using p = 54 modes (with the linear term scaling as p2, the memory is estimated to be 4.3TB). The Planck trispectrum estimation (four-point correlator) is even more demanding: good convergence of the “linear term” requires 10K simulated maps to minimize Gaussian bias, and 10M or 15M core-hours at resolutions of p=27 or p=54, respectively. The CMB modal bispectrum methodology has been implemented in a 3D OpenMP pipeline for

14 calculating the dark matter and halo bispectrum in N-body simulations and will be applied to galaxy survey data, notably SDSS/BOSS and DES. Current investigations of structure formation in N-body codes use 38403 particles with the resolution p=8 (n=50 modes) and require ~4TB shared-memory. Defining requirements come from bispectrum estimation of galaxy survey data: DES 40963 mock data sets with dark matter and halos consume 7-8TB in memory (particle position/velocity data) and 10-14TB on disk for several redshifts. Mock catalogue analysis at an effective resolution 20483 uses 5TB shared-memory at current p=8 resolution and the DES bispectrum estimation will target double resolution p=16 (n=237 modes) requiring 19TB shared memory and 12M core-hours, for which high throughput and I/O bandwidth is essential. 1.11.2 Technical requirements from science case The full set of technical specifications required to deliver the DI science program of DiRAC-3 (see table) are:  Compute: a minimal requirement based on the estimates of the individual DI projects in the Science Case is 1.5-2 PF. It is difficult to be more precise due to the heterogeneous nature of the DI workflows.  Tightly coupled storage: The projected requirements for the Gaia satellite project (~1PB@200GB/s) exemplify the data rates necessary to support DiRAC DI science in the Petascale era for observed datasets.  20TB SMP node: This will enable fast prototyping and deployment of complex data analysis pipelines. The size of the single-image SMP node assumes that the DI service includes significant software engineering support to migrate the majority of our current SMP workflows onto the distributed memory components of the DI service (the 14TB SMP service of DiRAC-2 will be used for this code porting).  256GB standard nodes: The memory requirements for the standard nodes will be determined by benchmarking of existing codes to ensure an optimal GB/GF ratio for the science programme.  1TB Fat nodes: A set of 1TB fat nodes for smaller OpenMP workflows.  Accelerated nodes: GPU and KNL nodes will enable DI users to explore the use of these new technologies. We note that ~80% of the PF requirement should be delivered by x86 nodes: at present the majority of DI service users do not use accelerators and not all algorithms are amenable to acceleration.  14PB storage: This is for medium term storage only and assumes that a long-term DiRAC-3 data repository is provided separately. Table 2: Technical requirements for DI service from DiRAC-3 Science Case Requirement Science Case Code (name/type) Comment ~1.5 PF C(i) MILC (MPI) 1PB storage (200+ GB/s) A(vi) SMAUG (hybrid) Code is vectorisable 20TB/39TF SMP A(ii), B(v) WALLS(modified) TROVE Focus on largest SMP calculations 1TB fat nodes A(ii) Many OpenMP codes Focus on smaller OpenMP calculations to relieve pressure on SMP. >8 GB RAM/core A(v) RAMSES (MPI) Benchmarking required 128 GPU nodes C(i) MILC/HadSpec Code-porting underway 32 KNL nodes A(vi) Torus Fraction of code base that can be ported will depend on memory capacity <1s interconnect latency A(ii) WALLS 14PB storage All All Modest increase in PB/PF ratio from 12 to 14. Remote 3D visualisation A(viii) sphNG (hybrid) Benefit to many projects; will make use of 20TB SMP Software Development A(ii), B(v) WALLS, TROVE Other OpenMP codes will need porting support (36 FTE-months) to MPI. SMP component for code A(ii) OpenMP codes Machine to be used for porting SMP porting codes to distributed memory. Software engineering support (36 FTE-months): The DiRAC-3 Technical Review Committee noted the value of an SMP component within the DI service to support projects whose workflows “have a long standing dependence upon large complex codes which would be difficult to adapt to a DM [distributed memory] machine and require a very substantial investment in software engineering.” The SMP provision we propose will only deliver the DiRAC-3 science goals if we are able to relieve the pressure on the 20TB SMP node by porting most of our OpenMP code base to MPI in a timely manner. Thus, it is vital that the DI service provides devoted intensive applications support both to port suitable dp002/020 OpenMP pipelines to MPI and to optimize many existing codes. The TRC noted that this effort “is crucial to evolution of the code bases that are intimately tied up with technology”. Examples from the COSMOS@DiRAC-2 IPCC of the benefits of dedicated software engineering support include an algorithmic breakthrough which led to a 730x speed-up for part of the modal bi-/trispectrum code. We note that this support is required specifically for the SMP code base of the DI service. Software development and migration efforts for other DiRAC users will be supported by the proposed virtual DiRAC hardware and Software Innovation Centre.

15 1.11.3 Proposed System characteristics The table below indicates one outline technical solution for the DI service and the associated cost envelope. The system capital cost includes £250K for software engineering (36 FTE-months) and a £0.5M SMP cost (i.e. an equivalent number of PF without SMP provision would cost £0.5M less). We present the service as a single system to emphasise the unified nature of the DI service. However, a two-site DI solution, with the capabilities distributed appropriately between the two sites provides quantifiable advantages to DiRAC-3. Within an individual site, all DI components would share a shared filesystem and service management team. Case A: this solution is the minimal service required to deliver all the science goals. Its key features include:  1.24 PF: The shortfall in PF relative to the requirements is expected to be made up via vendor discounts.  Hierarchy of nodes: standard, fat, SMP and accelerated nodes, some accessing tightly-coupled storage.  Tightly-coupled storage: The above solution delivers bandwidth by aggregating the bandwidth of spinning discs in conjunction with Flash storage – other possible configurations will be considered and benchmarked prior to procurement. Such a system requires the scheduler to be aware of both compute and bandwidth needs of particular codes as well as the distribution of large data sets across the storage volume.  SMP At this time the mode of SMP provision is not pre-determined – benchmarking of our largest codes is underway and will be used identify the most appropriate SMP solution (e.g. hardware vs vSMP). The existing UV2000 will be used for code migration, but will not require DiRAC-3 investment. A multi-site DI service offers significant benefits: (1) leveraged access to the skillsets of large support teams at lower FTE cost (e.g. DiRAC-2 support at Leicester is provided by a team of 15 staff: DiRAC pays for 1FTE of effort). Broad skillsets are essential to manage a novel, complex, heterogeneous system with up to 1000 users. This models also allows flexible access to additional staff effort at key times (e.g. procurement and installation) and provides illness/holiday cover; (2) data assurance; (3) development/testing of an innovative, fully-federated storage and data analytics system as a prototype for other national facilities. We consider that the two site model for the DI service is vital since it necessitates the deployment of both software tools and ways of working that facilitate computational and data federation across multiple sites. This is essential for any national level data service and is therefore an important contribution of DiRAC to the UK national e-I. Clearly, however, multi-site service bids must demonstrate that their provision will be cost effective. Reduced funding: Case B: reduced capability means that, for example, Gaia work would require sole use of the tightly-coupled storage for extended periods, impacting negatively on delivery of UKQCD science case and other projects. Case C: lack of capability means that a significant fraction of DiRAC-3 DI science is not delivered: delayed Gaia/Planck analysis will potentially damage UK leadership. Table 3: Example DI implementations compared to DiRAC-2 service. System DiRAC-21 DiRAC-3 Capital price £5.2M Case A: £10.75M Case B: £8.75M Case C £5.75M (inc. VAT) Floating pt. 0.226PF 1.24PF = 0.69PF = 0.47 PF = (double) 0.82/0.13/0.25/0.04 PF 0.65/0.2/0.04 PF 0.43/0.04 PF (x86/KNL/GPU/SMP) (x86/KNL/SMP) (x86/SMP) Nodes 572/1 800/32/64/1 (x86/KNL/GPU/ 650/32/1 (x86/KNL/SMP) + 425/1 (x86/SMP) (x86/SMP) SMP)+ 10x(48-core,1TB) 10x(48-core,1TB) Cores 9150/1784 19680/64/128/2560 16080/64/2560 10200/2560 (x86/SMP) (x86/SMP) (x86/KNL/GPU/SMP) (x86/KNL/SMP) Flash I/O None 0.6PB 0.3PB 0.6PB SMP 14TB 24TB 24TB 24TB Storage 1.81 PB 14.25 PB 14.25PB 7.25PB Power (KW) 278 488 405 251 Prog. Effort 0 1FTE (36 months) 1FTE (36 months) 1FTE (36 months) Notes: 1DiRAC-2 numbers combine Complexity, HPCS and COSMOS

Table 4: Node configurations in example DI implementations Node type Standard (x86) Fat (x86) SMP Accelerated Cores 24 48 TBC 2 (KNL or GPU) Memory 256 1TB 24TB 128GB Fl. pt. (double) 1TF/s 2TF/s 80TF/s 7/5TF/s (KNL/GPU) Mem. bw. TBC TBC 68GB/s(Xeon)+KNL TBC L2 Cache bw. TBC TBC 166GB/s(Xeon)+KNL TBC Network bw./node 56/100Gb/s 56/100Gb/s 64 GB/s 56/100Gb/s (4xFDR/4xEDR)

16 1.12 Extreme scaling service The Extreme Scaling service will particularly target the needs of scientific areas C(i), C(ii), C(iii), C(iv) ,C(v) and C(vi) and will support, among others, the needs of DiRAC projects dp006, dp007, dp008 and dp009. These projects comprised around 83% of the computing cycles in DiRAC-2 using the BlueGene/Q system. Many other DiRAC projects and science areas could use this hardware, but may also be able to use hardware with less capable interconnect. The system would, by design, also be useful for all halo exchange PDE problems, for example UKHMD dp010 and HPQCD dp019. It is worth noting that systems with an extremely capable network can also be capable to support applications with lower network requirements.

QCD simulations require the repeated solution of the Dirac equation on many snapshots of the force carrying gauge fields, while sampling the Feynman path integral. The discrete systems are large and are regularly laid out in structured four dimensional grids, and DiRAC-2 enabled simulations on volumes 643x128 and 483x256. Each solution of the Dirac equation is dependent on the result of the previous step, and it is necessary to run a single simulation very quickly for up to one year. As a result, it is also necessary that the largest partitions of the machine are efficient for our largest simulations.

The DiRAC-3 system must find the best balance of total floating point performance, network performance, power requirements, delivered application performance, and ease of use for a large community. The step change in simulations targeted in the science case require lattices of extent up to 1283x256, increasing the volume by around a factor of sixteen. This volume is taken as the design point, and provides a calculation of the required network bandwidth for each node. The regularity of the problem allows for massive and fine- grained parallelism (and so novel architectures). The simulations can be scaled to very large system sizes, providing adequate network bandwidth is available.

Floating point performance: The science programme outlined in the scientific case could be delivered with a 6-9 Pflop/s (double precision) and 12-18 Pflop/s (single precision) system. This provides a substantial increase over the DiRAC-2 1.26Pflop/s (single/double) performance, and is consistent with the increase in volume. Thread level and vector instruction parallelism may, with software effort, be exploited with good efficiency. The RFI suggests 6-9Pflop/s double precision peak performance using Knights Landing or GPU technology is viable in terms of capital and recurrent budgets. This could be achieved with around 2048- 3072 nodes, each around 3Tflop/s peak, and a £10-16M pound budget. The computational power should be accompanied by adequately performing memory and network subsystems, and minimal electrical power.

Energy efficiency: For the DiRAC-3 upgrade, a substantially more power efficient new technology is required since each MW translates to £1M p.a. recurrent expenditure. This power requirement rules out conventional server processors as a technology direction since they are no more power efficient than the existing BlueGene/Q at 2.3 PF per MW (equivalently GF per Watt). This efficiency will be substantially improved in 2016 by systems based on Intel Knight’s Landing devices. Cray and SGI give the most power efficient responses at 8PF per MW for double precision, and 16PF per MW for single precision. This factor of four greater efficiency of Knights Landing over Xeon based systems is essential to upgrade the extreme scaling service. GPU accelerated computing nodes are also a viable option in terms of energy efficiency and having two appropriate processing technology planned 2016 mitigates risk.

Memory bandwidth: NVIDIA and Intel plan to use a revolutionary 3D chipstack memory technology in the 2016 timeframe for their HPC products. The number of signaling wires and bandwidth are vastly increased, and this disruptive technology will, to a great extent, address memory bottlenecks in HPC.

The arithmetic performed for each element of data accessed dictates that lattice gauge theory codes require 1TB/s of cache bandwidth for each Tflop/s of performance. A 50% efficient node with 3Tflop/s peak would transfer 1.5TB/s from the L2 cache. A 600GB/s chip stack memory would support this (there is a modest cache reuse factor), while a conventional memory system would give poor performance. The fraction of off- node references then determines the required network bandwidth.

Network bandwidth and topology: Consider distributing two target calculations 643x128 and 1283x256 calculations over 2048 nodes, taking a 4x83 network. One can easily calculate the percentage of data references made to neighbouring nodes for DiRAC projects dp006, dp007, dp008 and dp009. Project dp008 has used the most computing cycles of all projects in DiRAC-2, and has the largest network bandwidth requirement is taken as the requirement below; this is the typical case. The on-node accesses come predominantly from an on-chip cache. The required network bandwidth is 5%-10% of the cache bandwidth: 5% represents a reasonable engineering design point. Assuming a 50% efficient calculation, and 1.5TB/s L2 cache bandwidth, one can see that the required network bandwidth is in the 45-75 GB/s range for all except

17 the smallest local volume above. The best performance networks being offered by Cray and SGI in this time frame are a reasonably close match, in the 32-50GB/s range.

Global Volume Nodes Local volume % off node access Node working set 643x128 2048 162x82 9% 350MB 512 164 6% 1.4GB 1283x256 2048 322x162 5% 5GB 512 324 3% 20GB The analysis assumes scalability is determined solely by the interface bandwidth on each node, ignoring network topology effects. QCD has avoided global network contention by performing precise mapping of the application torus to a physical network torus: as was done on the BlueGene/Q and QCDOC systems. This is possible in the SGI topology, but careful benchmarking is needed to demonstrate global interconnect scaling.

Solver algorithms, such as conjugate gradient QCD codes, will scale reasonably well to the full system, while smaller partitions may be used to run even more efficiently. The algorithms are not fixed: the prohibitive cost of increasing network bandwidth has given rise to new preconditioned sparse matrix inverters that reduce communication bandwidth. UKQCD’s new HDCG algorithm makes a substantial portion of the work less sensitive to network performance, and allows mixed precision acceleration. The DiRAC-3 networks will remain balanced for these new algorithms, and for single precision will give an 8-14 fold speed up.

Ease of use: There are many community codes in use: CPS, Bagel, Chroma, MILC. These are already programmed with hybrid message passing and thread parallelism. The Cray and SGI RFI responses proposed self-hosting KNL nodes directly connecting to network interfaces, and use the hybrid MPI/OpenMP programming model. The familiar x86 instruction set and multicore environment make these systems reasonably inclusive for non-experts. Vector instructions are required to obtain the absolute best performance, and it is possible that initially only QCD projects would be able to achieve this. Most of the QCD codes could immediately make use of wide SIMD: UKQCD has developed a compiler (BAGEL) for this purpose, giving 71% of peak on SIMD routines on BlueGene/Q. BAGEL has been ported to XeonPhi.

NVIDIA accelerator products are, from a technical perspective, a good alternative technology. The majority of applications that use NVIDIA accelerators efficiently do so through a proprietary CUDA language to obtain good performance. This requires substantial vendor specific code development. NVIDIA maintain a library, QUDA, which supports many of the QCD projects. A GPU environment would present a real barrier to use by the broader range of DiRAC projects, and a barrier to the scientific productivity of QCD projects.

1.12.1 Extreme Scaling system characteristics The extreme scaling design point, performance metrics, and a comparison to those of the DiRAC-2 BlueGene/Q system are given below. The BlueGene/Q was designed for extreme scalability in large (>1M cores) US installations, and the DiRAC-3 design point will only attain the same scalability for lattice gauge theory by using communication compression techniques. This would give adequate scalability on older algorithms, and is necessitated by technology trends and facilitated by new algorithms. Both KNL and GPU processor options should be considered in competitive procurement, though significant weight must be given to inclusivity. In a close competition the simpler programming environment of KNL is clearly preferred.

DiRAC-2 BlueGene/Q Extreme Scaling Service (2016) Capital price £9.8M £9M £10.5M £16M System Floating point (double) 1.26Pflop/s 5 PF 6PF 9PF Floating point (single) 1.26Pflop/s 10PF 12PF 18PF Nodes 6144 1728+ 2048+ 3072+ Cores 98k 121k 143k 215k Electrical Power 500KW 0.7-0.8MW 0.8-1MW 1.2-1.5MW Storage 1 Petabyte 3+ PB disk + 5-10PB Tape Node Cores 16 70+ Floating point (double/single) 205.6/205.6 3+/6+ Tflop/s (KNL/GPU) Memory bandwidth 42GB/s 500+GB/s HBM + 120GB/s DDR L2 Cache bandwidth 570GB/s 3+ TB/s Network bandwidth per node 42GB/s 25-50GB/s

18 1.13 Memory Intensive Service The memory intensive system is designed to be cost effective for scientific problems for which the problem size scales as the computational power is increased. For these problems the powerful computing nodes, detailed in the RFI responses, must be stocked with large amounts of memory. We take examples from the Virgo Consortium (dp004) project as a case study for this system, but note that the resulting machine would fulfil the requirements of the UKMHD (dp010) project. The Virgo collaboration performs cosmological simulations describing the evolution of the Universe from early times after the Big Bang to the present day. Some of these calculations require modelling large volumes of space with high numerical resolution and cannot be run on the DiRAC-2 infrastructure for lack of total memory. We give two specific examples taken from sections A(iv) and A(v) of the DiRAC-3 science case that define minimum total memory requirement and minimum memory per core for a DiRAC-3 memory intensive machine. The examples in this case study, together with appropriate benchmarks, will set out parameters that will be used to select the optimal characteristics including memory per node; memory bandwidth; processor architecture and processor count; storage capacity and bandwidth; network performance and total memory. Cosmological N-body simulations: required to make realistic mock galaxy catalogues which are crucial for the interpretation of data from billion dollar observational missions such as Euclid and LSST. This requires simulations of Gigaparsec volumes. The goal of modelling the volume of Virgo's MXXL simulation at the resolution of Virgo's Millennium simulation requires 250TB of total memory, approximately five times more than is available in the DiRAC-2 Datacentric system at Durham. Galaxy formation: galaxy formation simulations such as Virgo's Eagle Project run on the DiRAC-2 Datacentric are unable to model a volume large enough to capture a representative sample of galaxy clusters. 256TB of total memory is required to support a volume eight times greater than Eagle. Processor architecture and processor count: Software considerations make a system based on either standard Xeon or Knights Landing nodes compatible with current and planned codes. The simulation algorithms in this case study must evolve the universe forward in time over billions of years. As rich structure and galaxies form, the dynamics of particles in the dense regions must be followed with a higher resolution in time (i.e. many small time-steps) than in low-density regions. The processing of these dense regions sets the ultimate limit for strong scalability. At a fixed peak performance, hardware with fewer-but-faster nodes will outperform hardware with more-but-slower nodes due to the improved processing of this work intensive component of the algorithm. Similarly, at fixed total memory, hardware with a smaller number of high memory nodes is essential to minimise load imbalance as larger nodes hold a larger and overall more representative set of environments of different densities. Detailed benchmarking will be required to determine whether the most work intensive components of these algorithms benefit more from the improved speed of each core for a Xeon system compared to Knights Landing, or whether the improved memory bandwidth makes the Knights Landing in 2016 the correct choice to maximize the return on investment. The DiRAC benchmark suite will be used to guide this decision, and we are engaged with Intel representatives and with the DiRAC IPCCs to obtain access to accurate simulation of the Knights Landing. Memory capacity: The minimum aggregate memory capacity of compute nodes in the system will be 256TB. The processing of N-body problems are memory bandwidth limited; memory bandwidth is better correlated with delivered performance than with peak performance. Measured performance on Virgo’s benchmark codes will be a key metric in the procurement. There is insufficient information at this time in the vendor responses to the RFI to assess whether a Xeon based node or a Knights Landing based node will be preferable from a performance and memory capacity perspective. Knights Landing solutions were indicated to support up to 192GB per node, and include a novel memory system with 16GB of exceedingly fast 600GB/s on-package 3D memory/cache. The figure of 192GB per node is unattractive for a memory intensive system, but we expect by 2016 that Knights Landing based nodes will be able to host a total of 384GB, using six 64 GB DIMMs. Experience of running galaxy formation simulations on systems with differing amounts of memory per core suggests the work intensive component in cosmology simulations causes poor load balancing when the memory per core falls below 5GB. This empirical figure stems from the dynamic range in density found in the (simulated) universe. A memory requirement of at least 4 GB per core is also important for UKMHD (dp010). Storage capacity and bandwidth: The target of the memory intensive system is simulations that fill the memory. The volumes required for both check-pointing and storing simulation results will be scaled up with the memory capacity, following the workflow used by Virgo on DiRAC-2. These increased simulation sizes require a 10PB total volume for storage on delivery, and needs to be upgraded annually with a 4/7, 2/7, 1/7 phasing of the storage expenditure. Hierarchical bulk storage is reasonable. The total size of the system (256TB) should be check pointed in a reasonable time (<30 mins) giving an application accessible bandwidth requirement around 140GB/s. This should preferably be delivered to a

19 robust and easy to use parallel file system if technologically possible. Storage throughput will be a key metric assessed during procurement. This might require burst technology using local SSDs in each node Interconnect: The most cost effective hardware uses distributed memory with message passing between nodes. The network performance should be good at delivering MPI latency no worse than about a microsecond between any pair of nodes, which favours a relatively flat network topology. However, it need not necessarily be the highest bandwidth possible if this leads to better pricing, since the design point targets have large volumes of memory intensive work on each node. Machine stability: The run times of jobs such as those in the case study are typically weeks to months. A requirement for the memory intensive system is that the hosting site provides UPS and generator support. Summary of the system characteristics: Depending on whether benchmarking indicates that Xeon or Knights Landing computing nodes are the most appropriate technology for this service, the system will range between 1 and two peak PFLOPS. The key in this decision is the relative performance of a Xeon core and a Knights Landing core on the portions of the simulation where load imbalance is significant. Detailed benchmarking will guide this, and since this is the key technology decision to be made it will receive particular attention from the Technical Working Group. For a cost of £9M (all prices include VAT): DiRAC-2 COSMA DiRAC-3 Memory Intensive System @ £9M Xeon based Xeon based system KNL based system System Floating point performance 140 TFLOPS 1 2 (double) [PFLOPS] Nodes 420 800 680 number of cores 6720 24000 48960 total memory 54 TB 256 TB 256 TB cluster wide latency order of several to order of 1 micro sec ~ 10 micro secs Electrical Power [kW] 150 450 300 Node Cores 16 32 72 Floating point (double) 332 GFLOPS 1.33 TFLOPS 3 TFLOPS total memory bandwidth 84 GB/s 136 GB/s EDRAM at >500 GB/s (42 GB/s per (68 GB/s per socket) RAM at >120 GB socket) Network bandwidth per 8 GB/s 25 GB/s node Below we give two additional systems at lower and higher price points: £7M and £12M. All three proposed systems meet the minimum memory requirements and disk storage requirements. For the £7M system the KNL option is not possible because the minimal system memory cannot be met. It would require 512 Gbytes of RAM per node, which we do think feasible in early to mid-2016. The cluster component of the £12M system is a larger version of the £9M system, and the total storage has been scaled up by a third. DiRAC-3 Memory DiRAC-3 Memory Intensive System @ Intensive System @ £7M Intensive System @ £12M Xeon based system Xeon based system KNL based system System Floating point 0.7 PFLOPS 1.4 PFLOPS 2.7 PFLOPS performance (double) Nodes 512 1024 900 number of cores 16384 32768 64800 total memory 256 TB 320 TB 384 TB cluster wide latency order of 1 micro sec order of 1 micro sec Electrical Power [kW] 340 550 390 Node Cores 32 32 72 Floating point (double) 1.33 TFLOPS 1.33 TFLOPS 3 TFLOPS Total memory 136 GB/s 136 GB/s EDRAM at >500 bandwidth (68 GB/s per socket) (68 GB/s per GB/s socket) RAM at >120 GB Network bandwidth per 25 GB/s 25 GB/s node

20 4 Appendix: Risk Register Risk Probability Consequence Mitigation Delay to Knights moderate Installation of extreme STFC must plan capital expenditure profile Landing product scaling system must allowing delivery throughout calendar year take place in 2016. 2016. Cancellation of low Alternate power efficient In this circumstance the extreme scaling service Knights Landing technology required. will have to adopt GPU nodes which are more product difficult to programme. GPU Very high Limited number of Strictly limit GPU hardware to level consistent programming scientific applications with codes proven to use this hardware difficulty run successfully. effectively. Create preference for standard MPI+threads software during procurement. Xeon vector High Scientific applications Community best practice education. Limited instruction usage executing scalar code software effort is available managed by RAC to do not use full capability provide additional effort according to scientific of hardware. need. Several IPCC’s have been awarded to DiRAC by Intel. Knights landing High Scientific applications Community best practice education. Limited vector usage executing scalar code software effort is available managed by RAC to do not use full capability provide additional effort according to scientific of hardware. need. Several IPCC’s have been awarded to DiRAC by Intel. Insufficient funds Moderate A smaller than The reduced resource will be managed by RAC. to deliver target anticipated upgrade is The most appropriate response (scaling back capacity. delivered. allocations, or modifying some research programmes) will be decided. No appropriate Low One or more service This is unlikely since several universities have sites does not receive an made considerable investments in hosting appropriate hosting bid. infrastructure. Bids for hosting more than one service will be encouraged to engender competition. Power Low Hosting bids from multiple sites for each service requirements will be encouraged. Funds can be reallocated exceed hosting between services as part of coordinated capabilities procurement as required. Sufficient margin on machine room power infrastructure will be required of hosting sites during competitive selection of service providers. Power Moderate Total cost of ownership will be a key metric in requirements the coordinated procurement and these costs exceed planned will be known prior to installation with the recurrent budget hosting site award terms and conditions setting the recurrent budget. This controls the risk. Insufficient Moderate Service standards fall Host sites will be required to detail the level of system short manpower, and KPI’s that will be met, for administration operating the systems, along with recurrent manpower pricing for this as part of the bidding process. Insufficient user Moderate Service standards fall Host sites will be required to detail the level of support short manpower, and KPI’s that will be met, for manpower supporting the systems, along with recurrent pricing for this as part of the bidding process. Hardware vendor Low Delay to service if this Financial checks during procurement. ceases operation occurs during during procurement or prior to procurement. installation. Hardware vendor Moderate Maintenance of systems Financial checks during procurement. Service ceases operation may be placed at risk. provider insurance. after installation.

21