Dirac-3 Technical Case

Total Page:16

File Type:pdf, Size:1020Kb

Dirac-3 Technical Case DiRAC-3 Technical Case Contributors: Peter Boyle, Paul Calleja, Lydia Heck, Juha Jaykka, Adrian Jenkins, Stuart Rankin, Chris Rudge, Paul Shellard, Mark Wilkinson, Jeremy Yates, on behalf of the DiRAC project. 1 Table of Contents 1 Executive summary ............................................................................................................................. 3 2 Technical case for support ............................................................................................................... 4 1.1 Service description ..................................................................................................................................... 4 1.2 Track record ................................................................................................................................................. 5 1.2.1 Hosting capability ................................................................................................................................................. 6 1.3 International competition ....................................................................................................................... 6 1.4 Risk Management ...................................................................................................................................... 7 1.5 Impact ............................................................................................................................................................. 7 1.5.1 Industrial Collaboration .................................................................................................................................... 8 1.6 RFI based Technology Survey ............................................................................................................. 8 1.6.1 Processor technology survey ....................................................................................................................... 9 1.6.2 Interconnect technology survey .................................................................................................................. 10 1.6.3 Storage technology survey ......................................................................................................................... 11 1.7 Benchmarking ............................................................................................................................................ 11 1.8 Procurement strategy ............................................................................................................................. 11 1.9 Timeline ....................................................................................................................................................... 13 1.10 Expenditure profile ............................................................................................................................... 13 3 Case Studies ........................................................................................................................................ 13 1.11 Data Intensive (DI) Service ................................................................................................................. 14 1.11.1 Illustrative workflows................................................................................................................................... 14 1.11.2 Technical requirements from science case .......................................................................................... 15 1.11.3 Proposed System characteristics ............................................................................................................. 16 1.12 Extreme scaling service ....................................................................................................................... 17 1.12.1 Extreme Scaling system characteristics ................................................................................................ 18 1.13 Memory Intensive Service ................................................................................................................... 19 4 Appendix: Risk Register ................................................................................................................. 21 2 1 Executive summary The Distributed Research through Advanced Computing facility (DiRAC) serves as the integrated computational resource for the STFC funded theoretical particle physics, astronomy, cosmology, and astro-particle physics communities. Since 2009 it has provided two generations of computing service, and presently delivers an aggregate two petaflop/s across five installations. DiRAC has been a global pioneer of the close interaction of scientific software and computing hardware co-design, and has worked closely with technology companies such as IBM and Intel. The fundamental research themes included in DiRAC are of great public interest and have regularly appeared in news articles with worldwide impact. These same fundamental themes consume substantial fractions of computational resources in other countries throughout the developed world and it is essential that the UK retain its competitive international position. The DiRAC community has constructed a scientific case for a transformative theoretical physics and astronomy research programme, tackling the full remit of STFC science challenges. This requires an order of magnitude increase in computational resources. This technical case aims to establish that computing resources required to support this science programme are achievable within the budget and timeframe, and that we have the capabilities and track record to deliver these services to the scientific community. A survey of our scientific needs identified three broad types of problem characteristic in the scientific programme, which may drive the technologies selected. These are problems where the key challenges are: extreme scaling, intensive use of memory and the rapid access and analysis of large volumes of data. This technical case also establishes a matrix mapping the three services to the different areas of the science programme. It is expected that the peer review of the scientific case will be fed back through this mapping by STFC to tension the resources allocated to the three services. Some detailed case studies are provided for each service to show the typical usage pattern addressed by each service. Data intensive: These are problems involving data-driven scientific discovery, for example the confrontation of complex simulation outputs with large-volume data sets from flagship astronomical satellites such as Planck and Gaia. Over the period 2015-18, data intensive research will become increasingly vital as the magnitude of both simulated and observed data sets in many DiRAC science areas enters the Petascale era. In addition to novel accelerated I/O, this service may include both an accelerated cluster and a shared memory component to add significant flexibility to our hardware solution to the Big Data challenges which the DiRAC community will face in the coming years. Some components of the associated computation support fine grained parallelism and may be accelerated, while some components may require the extreme flexibility associated with symmetric multiprocessing. Use of novel storage subsystem components, such as non- volatile memory in the storage subsystem may be very useful to these problems. Extreme scaling: These problems require the maximum computational effort to be applied to a problem of fixed size. This requires maximal interconnect and memory bandwidth, but relatively limited memory capacity; a good example scientific problem is Lattice QCD simulation in theoretical particle physics. This field provides theoretical input on the properties of hadrons to assist the interpretation of experiments such as the Large Hadron Collider. Lattice QCD simulations involve easily parallelized operations on regular arrays, and can make use of the highest levels of parallelism within processing nodes possessing many- cores and vector instructions. Memory intensive: These problems require a larger memory footprint as the problem size grows with increasing machine power; a good example scientific application is the simulation of dark matter and structure formation in the primordial universe, where a larger snapshot universe is enabled with additional computing power. As structure forms, the denser portions of the universe experience greater gravitational forces, and the “hotspots” require a disproportional amount of work. These hot-spots present greater challenges to parallelization and achieving performance from processors with the best performance on serial calculations may be preferred. The DiRAC technical working group has conducted a technology survey (under a formal Request for Information (RFI) issued by UCL) for possible solutions that will provide a step change in our scientific output in calendar years 2015 and 2016. Many useful vendor responses were received and indicate several competing solutions within the correct price, power and performance envelope for each of the three required service types. Significant changes in available memory and processor technology are anticipated in Q1 2016, and in particular the availability of highly parallel processors integrating high bandwidth 3D stacks of memory will be transformative for those codes that can exploit them. 3 2 Technical case for support The aim of DiRAC-3 is to provide the computing hardware that will enable a step change in scientific
Recommended publications
  • Durham Unlocks Cosmological Secrets
    Unlocking cosmological secrets Drawing on the power of high performance computing, Durham University and DiRAC scientists are expanding our understanding of the universe and its origins. Scientific Research | United Kingdom Business needs Solutions at a glance Durham University and the DiRAC HPC facility need • Dell EMC PowerEdge C6525 servers leading-edge high performance computing systems • AMD® EPYC™ 7H12 processors to power data- and compute-intensive cosmological • CoolIT® Systems Direct Liquid Cooling technology research. • Mellanox® HDR 200 interconnect Business results • Providing DiRAC researchers with world-class capabilities • Enabling a much greater level of discovery • Accelerating computational cosmology research • Keeping Durham University at the forefront cosmology research The COSMA8 cluster has The AMD EPYC processor used as many as in the COSMA8 cluster has cores 76,000 64 cores 64 per CPU CORES “This will primarily be used for testing, for getting code up Researching very big to scratch,” says Dr. Alastair Basden, technical manager for the COSMA high performance computing system at Durham questions University. “But then, coming very soon, we hope to place an order In scientific circles, more computing power can lead to bigger for another 600 nodes.” discoveries in less time. This is the case at Durham University, whose researchers are unlocking insights into our universe Deploying the full 600 nodes would provide a whopping 76,800 with powerful high performance computing clusters from Dell cores. “It’s a big uplift,” Dr. Basden notes. “COSMA7 currently is Technologies. about 12,000 cores. So COSMA8 will have six times as many cores. It will have more than twice as much DRAM as previous What is the universe? What is it made of? What is dark matter? systems, which means that we’ll be able to more than double What is dark energy? These are the types of questions explored the size of the simulations.
    [Show full text]
  • Call for Applications for Dirac Community Development Director 1
    DiRAC Health Data Science and AI Placement Opportunity DiRAC will awarD one Innovation Placement in 2021 in tHe area of HealtH Data Science anD tHe application of AI. The nominal length is 6 montHs anD Has to be completed by 30 September 2021. In tHis scheme a final year PhD stuDent or an early career researcHer can Have a funDed placement (up to £25k) witH the Getting It Right First Time (GIRFT) programme. GIRFT is funDeD by the UK Department of Health anD Social Care and is a collaboration between NHS England & NHS Improvement and the Royal National Orthopaedic Hospital NHS Trust. GIRFT uses comprehensive benchmarking data analysis to identify unwarranted variation in Healthcare provision and outcomes in National HealtH Service (NHS) Hospitals in EnglanD and combine this with deep dive visits to tHe Hospital by clinicians witH follow up on agreeD actions by an improvement team. The programme covers the majority of healthcare specialities. You have to be working on research that falls within the STFC remit in orDer to qualify for tHe placement; however, you can be funDeD by otHer organisations besides STFC, as long as tHe subject area is identifiable as being in Particle Physics, Astronomy & Cosmology, Solar Physics and Planetary Science, Astro-particle Physics, and Nuclear Physics. To check your eligibility please contact Jeremy Yates ([email protected]) anD Maria MarcHa ([email protected]). You must get your Supervisor or PIs permission before applying for this placement. It is allowed under UKRI’s rules, but only with your supervisor/PIs consent.
    [Show full text]
  • NVIDIA Powers Next-Generation Supercomputer at University of Edinburgh
    NVIDIA Powers Next-Generation Supercomputer at University of Edinburgh DiRAC Selects NVIDIA HDR InfiniBand Connected HGX Platform to Accelerate Scientific Discovery at Its Four Sites ISC—NVIDIA today announced that its NVIDIA HGX™ high performance computing platform will power Tursa, the new DiRAC supercomputer to be hosted by the University of Edinburgh. Optimized for computational particle physics, Tursa is the third of four DiRAC next-generation supercomputers formally announced that will be accelerated by one or more NVIDIA HGX platform technologies, including NVIDIA A100 Tensor Core GPUs, NVIDIA HDR 200Gb/s InfiniBand networking and NVIDIA Magnum IO™ software. The final DiRAC next-generation supercomputer is to feature NVIDIA InfiniBand networking. Tursa will allow researchers to carry out the ultra-high-precision calculations of the properties of subatomic particles needed to interpret data from massive particle physics experiments, such as the Large Hadron Collider. “DiRAC is helping researchers unlock the mysteries of the universe,” said Gilad Shainer, senior vice president of networking at NVIDIA. “Our collaboration with DiRAC will accelerate cutting-edge scientific exploration across a diverse range of workloads that take advantage of the unrivaled performance of NVIDIA GPUs, DPUs and InfiniBand in-network computing acceleration engines.” “Tursa is designed to tackle unique research challenges to unlock new possibilities for scientific modeling and simulation,” said Luigi Del Debbio, professor of theoretical physics at the University of Edinburgh and project lead for the DiRAC-3 deployment. “The NVIDIA accelerated computing platform enables the extreme-scaling service to propel new discoveries by precisely balancing network bandwidth and flops to achieve the unrivaled performance our research demands.” The Tursa supercomputer, built with Atos and expected to go into operation later this year, will feature 448 NVIDIA A100 Tensor Core GPUs and include 4x NVIDIA HDR 200Gb/s InfiniBand networking adapters per node.
    [Show full text]
  • School of Physics and Astronomy – Yearbook 2019
    * SCHOOL OF PHYSICS AND ASTRONOMY – YEARBOOK 2019 Introduction This 2019 Yearbook brings together the highlights from the blog, and from our School’s press releases, to provide a Table of Contents record of our activities over the past twelve months. In fact, this edition will be packed full of even more goodies, Introduction .......................................................... 2 spanning from October 2018 (the last newsletter) to School Events & Activities ..................................... 4 December 2019. We expect that future editions of the Yearbook will focus on a single calendar year. Science News ...................................................... 13 As we look forward to the dawn of a new decade, we can Space Park Leicester News ................................. 31 reflect on the continued growth and success of the School Physicists Away from the Department ............... 34 of Physics and Astronomy. Our Department has evolved to become a School. Our five existing research groups are Celebrating Success ............................................ 36 exploring ways to break down any barriers between them, Physics Special Topics: Editors Pick .................... 44 and to restructure the way our School works. The exciting dreams of Space Park Leicester are becoming a reality, with Comings and Goings ........................................... 46 incredible new opportunities on the horizon. Our foyer has been thoroughly updated and modernised, to create a welcoming new shared space that represents our A very warm welcome to the first School of Physics and ambitions as a School. Our website has been dragged, Astronomy Yearbook! We have moved away from the kicking and screaming, into the 21st century (SiteCore) to termly email news bulletins of the past, towards an online showcase the rich portfolio of research, teaching, and blog1 that celebrates all the successes of our School.
    [Show full text]
  • Press Release
    Press release Atos supercomputer to help unlock secrets of the Universe Paris (France), London, (UK) 1 June 2021 – Atos today announces it has been awarded a contract by the University of Edinburgh to deliver its most efficient supercomputer, the BullSequana XH2000, the most energy-efficient supercomputing system on the market. This is the largest system dedicated to GPU computing deployed at a customer site in the UK. The new system will constitute the Extreme Scaling Service of the UK’s DiRAC HPC Facility. The state-of-the-art platform will allow scientists across the STFC theory community to drive forward world-leading research in particle physics, among other areas, using NVIDIA Graphics Processing Units (GPUs) and AMD processors. It represents a major boost to DiRAC’s computing capacity, significantly increasing the power of the Extreme Scaling service. DiRAC is a distributed facility with high performance computing resources hosted by the Universities of Edinburgh, Cambridge, Durham and Leicester. These systems support fundamental research in particle physics, astrophysics, nuclear physics and cosmology. This agreement forms part of a £20 million investment by the UK Research and Innovation (UKRI) World Class Laboratories scheme, through the Science and Technology Facilities Council (STFC), to fund an upgrade of the DiRAC facility. The investment is delivering new systems which are up to four times more powerful than the existing DiRAC machines, providing computing capacity that can also be used to address immediate and emerging issues such as the COVID-19 pandemic. The upgraded DiRAC-3 facility will also be much more energy efficient than previous generations.
    [Show full text]
  • The Bluegene/Q Supercomputer
    The BlueGene/Q Supercomputer P A Boyle∗ University of Edinburgh E-mail: [email protected] I introduce the BlueGene/Q supercomputer design, with an emphasis on features that enhance QCD performance, discuss the architectures performance for optimised QCD codes and compare these characteristics to other leading architectures. The 30 International Symposium on Lattice Field Theory - Lattice 2012, June 24-29, 2012 Cairns, Australia ∗Speaker. c Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence. http://pos.sissa.it/ The BlueGene/Q Supercomputer P A Boyle 1. QCD computer design challenge The floating point operation count in the Monte Carlo evaluation of the QCD path integral is dominated by the solution of discretised Euclidean Dirac equation. The solution of this equation is a classic sparse matrix inversion problem, for which one uses one of a number of iterative Krylov methods. The naive dimension of the (sparse) matrix is of O([109]2), and hundreds of thousands of inversions must be performed in a serially dependent chain to importance sample the QCD path integral. The precision with which we can solve QCD numerically is computationally limited, and the development of faster computers is of great interest to the field. Many processors are in principal faster than one, providing we can arrange for them to co- ordinate work effectively. QCD is easily parallelised with a geometrical decomposition spreading L4 space time points across multiple N4 processing nodes, each containing
    [Show full text]