Dirac-3 Technical Case

DiRAC-3 Technical Case Contributors: Peter Boyle, Paul Calleja, Lydia Heck, Juha Jaykka, Adrian Jenkins, Stuart Rankin, Chris Rudge, Paul Shellard, Mark Wilkinson, Jeremy Yates, on behalf of the DiRAC project. 1 Table of Contents 1 Executive summary ............................................................................................................................. 3 2 Technical case for support ............................................................................................................... 4 1.1 Service description ..................................................................................................................................... 4 1.2 Track record ................................................................................................................................................. 5 1.2.1 Hosting capability ................................................................................................................................................. 6 1.3 International competition ....................................................................................................................... 6 1.4 Risk Management ...................................................................................................................................... 7 1.5 Impact ............................................................................................................................................................. 7 1.5.1 Industrial Collaboration .................................................................................................................................... 8 1.6 RFI based Technology Survey ............................................................................................................. 8 1.6.1 Processor technology survey ....................................................................................................................... 9 1.6.2 Interconnect technology survey .................................................................................................................. 10 1.6.3 Storage technology survey ......................................................................................................................... 11 1.7 Benchmarking ............................................................................................................................................ 11 1.8 Procurement strategy ............................................................................................................................. 11 1.9 Timeline ....................................................................................................................................................... 13 1.10 Expenditure profile ............................................................................................................................... 13 3 Case Studies ........................................................................................................................................ 13 1.11 Data Intensive (DI) Service ................................................................................................................. 14 1.11.1 Illustrative workflows................................................................................................................................... 14 1.11.2 Technical requirements from science case .......................................................................................... 15 1.11.3 Proposed System characteristics ............................................................................................................. 16 1.12 Extreme scaling service ....................................................................................................................... 17 1.12.1 Extreme Scaling system characteristics ................................................................................................ 18 1.13 Memory Intensive Service ................................................................................................................... 19 4 Appendix: Risk Register ................................................................................................................. 21 2 1 Executive summary The Distributed Research through Advanced Computing facility (DiRAC) serves as the integrated computational resource for the STFC funded theoretical particle physics, astronomy, cosmology, and astro-particle physics communities. Since 2009 it has provided two generations of computing service, and presently delivers an aggregate two petaflop/s across five installations. DiRAC has been a global pioneer of the close interaction of scientific software and computing hardware co-design, and has worked closely with technology companies such as IBM and Intel. The fundamental research themes included in DiRAC are of great public interest and have regularly appeared in news articles with worldwide impact. These same fundamental themes consume substantial fractions of computational resources in other countries throughout the developed world and it is essential that the UK retain its competitive international position. The DiRAC community has constructed a scientific case for a transformative theoretical physics and astronomy research programme, tackling the full remit of STFC science challenges. This requires an order of magnitude increase in computational resources. This technical case aims to establish that computing resources required to support this science programme are achievable within the budget and timeframe, and that we have the capabilities and track record to deliver these services to the scientific community. A survey of our scientific needs identified three broad types of problem characteristic in the scientific programme, which may drive the technologies selected. These are problems where the key challenges are: extreme scaling, intensive use of memory and the rapid access and analysis of large volumes of data. This technical case also establishes a matrix mapping the three services to the different areas of the science programme. It is expected that the peer review of the scientific case will be fed back through this mapping by STFC to tension the resources allocated to the three services. Some detailed case studies are provided for each service to show the typical usage pattern addressed by each service. Data intensive: These are problems involving data-driven scientific discovery, for example the confrontation of complex simulation outputs with large-volume data sets from flagship astronomical satellites such as Planck and Gaia. Over the period 2015-18, data intensive research will become increasingly vital as the magnitude of both simulated and observed data sets in many DiRAC science areas enters the Petascale era. In addition to novel accelerated I/O, this service may include both an accelerated cluster and a shared memory component to add significant flexibility to our hardware solution to the Big Data challenges which the DiRAC community will face in the coming years. Some components of the associated computation support fine grained parallelism and may be accelerated, while some components may require the extreme flexibility associated with symmetric multiprocessing. Use of novel storage subsystem components, such as non- volatile memory in the storage subsystem may be very useful to these problems. Extreme scaling: These problems require the maximum computational effort to be applied to a problem of fixed size. This requires maximal interconnect and memory bandwidth, but relatively limited memory capacity; a good example scientific problem is Lattice QCD simulation in theoretical particle physics. This field provides theoretical input on the properties of hadrons to assist the interpretation of experiments such as the Large Hadron Collider. Lattice QCD simulations involve easily parallelized operations on regular arrays, and can make use of the highest levels of parallelism within processing nodes possessing many- cores and vector instructions. Memory intensive: These problems require a larger memory footprint as the problem size grows with increasing machine power; a good example scientific application is the simulation of dark matter and structure formation in the primordial universe, where a larger snapshot universe is enabled with additional computing power. As structure forms, the denser portions of the universe experience greater gravitational forces, and the “hotspots” require a disproportional amount of work. These hot-spots present greater challenges to parallelization and achieving performance from processors with the best performance on serial calculations may be preferred. The DiRAC technical working group has conducted a technology survey (under a formal Request for Information (RFI) issued by UCL) for possible solutions that will provide a step change in our scientific output in calendar years 2015 and 2016. Many useful vendor responses were received and indicate several competing solutions within the correct price, power and performance envelope for each of the three required service types. Significant changes in available memory and processor technology are anticipated in Q1 2016, and in particular the availability of highly parallel processors integrating high bandwidth 3D stacks of memory will be transformative for those codes that can exploit them. 3 2 Technical case for support The aim of DiRAC-3 is to provide the computing hardware that will enable a step change in scientific

Dirac-3 Technical Case

Durham Unlocks Cosmological Secrets

Call for Applications for Dirac Community Development Director 1

NVIDIA Powers Next-Generation Supercomputer at University of Edinburgh

School of Physics and Astronomy – Yearbook 2019

Press Release

The Bluegene/Q Supercomputer