Disturbing Lessons in Distributed Supercomputing

From Thread to Transcontinental Computer: Disturbing Lessons in Distributed Supercomputing Derek Groen* Simon Portegies Zwart* Centre for Computational Science and CoMPLEX Leiden Observatory University College London Leiden University London, United Kingdom Leiden, the Netherlands Email: [email protected] Email: [email protected] Abstract—We describe the political and technical complica- Big Brother in Orwell’s 1984) and in scientific research (e.g., tions encountered during the astronomical CosmoGrid project. Amazon EC2, projects such as TeraGrid/XSEDE and EGI, CosmoGrid is a numerical study on the formation of large scale and numerous distributed computing projects [3], [?], [?], [4], structure in the universe. The simulations are challenging due to the enormous dynamic range in spatial and temporal coordinates, [5], [6], [7], [8]). Although many have tried, none have yet as well as the enormous computer resources required. In Cosmo- succeeded to link up more than a handful of major computers Grid we dealt with the computational requirements by connecting in the world to solve a major high-performance computing up to four supercomputers via an optical network and make problem. them operate as a single machine. This was challenging, if only Very few research endeavors aim to do distributed com- for the fact that the supercomputers of our choice are separated by half the planet, as three of them are located scattered across puting at such a scale to obtain more performance. Although Europe and fourth one is in Tokyo. The co-scheduling of multiple it requires a rather large labour investment across several computers and the ’gridification’ of the code enabled us to achieve time zones, accompanied with political complexities, it is an efficiency of up to 93% for this distributed intercontinental technically possible to combine supercomputers to form an supercomputer. In this work, we find that high-performance intercontinental grid. We consider that combining supercom- computing on a grid can be done much more effectively if the sites involved are willing to be flexible about their user policies, puters in such a way is probably worth the effort if many and that having facilities to provide such flexibility could be machines are involved, rather than a few. Combining a small key to strengthening the position of the HPC community in an number of machines is hardly worth the effort of doubling the increasingly Cloud-dominated computing landscape. Given that performance of a single machine, but combining hundreds or smaller computer clusters owned by research groups or university maybe even thousands of computers together could increase departments usually have flexible user policies, we argue that it could be easier to instead realize distributed supercomputing by performance by orders of magnitude [9]. combining tens, hundreds or even thousands of these resources. Here we share our experiences, and lessons learned, in performing a large cosmological simulation using an intercon- I. INTRODUCTION tinental infrastructure of multiple supercomputers. Our work Computers have become an integral part of modern life, and was part of the CosmoGrid project [10], [4], an effort that are essential for most academic research [1]. Since the middle was eventually successful but which suffered from a range of of last century, researchers have invented new techniques to difficulties and set-backs. The issues we faced have impacted boost their calculation rate, e.g. by engineering superior hard- on our personal research ambitions, and have led to insights ware, designing more effective algorithms, and introducing which could benefit researchers in any large-scale computing increased parallelism. Due to a range of physical limitations community. which constrain the performance of single processing units [2], We provide a short overview of the CosmoGrid project, and recent computer science research is frequently geared towards describe our initial assumptions in Section II. We summarize arXiv:1507.01138v1 [astro-ph.IM] 4 Jul 2015 enabling increased parallelism for existing applications. the challenges we faced, ascending the hierarchy from thread By definition, parallelism is obtained by concurrently using to transcontinental computer, in Section III and we summarize the calculation power of multiple processing units. From how our insights affected our ensuing research agenda in Sec- small to large spatial scales, this is respectively done by: tion IV. We discuss the long-term implications of CosmoGrid facilitating concurrent operation of instruction threads within in Section V and conclude the paper with some reflections in a single core, of cores within a single processor, of processors Section VI. within a node, of nodes within a cluster or supercomputer, and of supercomputers within a distributed supercomputing II. THE COSMOGRID PROJECT: VISION AND (IMPLICIT) environment. The vision of aggregating existing computers to ASSUMPTIONS form a global unified computing platform, and to focus that The aim of CosmoGrid was to interconnect four supercom- power for a single purpose, has been very popular both in puters (one in Japan, and three across Europe) using light popular fiction (e.g., the Borg collective mind in Star Trek or paths and 10 Gigabit wide area networks, and to use them concurrently to run a very large cosmological simulation. single machine, and by executing a distributed setup we could We performed the project in two stages: first by running mitigate the computational, storage and data I/O load imposed simulations across two supercomputers, and then by extending on individual machines. We also were aware of the varying our implementation to use four supercomputers concurrently. loads of machines at different times, and could accommodate The project started as a collaboration between researchers for that by rebalancing the core distribution whenever we in the Netherlands, Japan and the United States in October would restart the distributed simulation from a checkpoint. 2007, and received support from several major supercomputing Overall, we mainly expected technical problems, par- centres (SARA in Amsterdam, EPCC in Edinburgh, CSC in ticularly in establishing a parallelization platform which Espoo and NAOJ in Tokyo). CosmoGrid mainly served a two- works across supercomputers. Installing homogeneous soft- fold purpose: to predict the statistical properties of small dark ware across heterogeneous (and frequently evolving) super- matter halos from an astrophysics perspective, and to enable computer platforms appeared difficult to accomplish, partic- production simulations using an intercontinental network of ularly since we did not possess administrative rights on any supercomputers from a computer science perspective. of the machines. In addition, the GreeM code had not been tested in a distributed environment prior to the project. A. The software: GreeM and SUSHI For CosmoGrid, we required a code to model the for- III. DISTRIBUTED SUPERCOMPUTING IN PRACTICE mation of dark matter structures (using 20483 particles in Although we finalized the production simulations about a total) over a period of over 13 billion years. We adopted year later than anticipated, CosmoGrid was successful in a a hybrid Tree/Particle-Mesh (TreePM) N-body code named number of fundamental areas. We managed to successfully GreeM [11], [12], which is highly scalable and straightforward execute cosmological test simulations across up to four super- to install on supercomputers. GreeM uses a Barnes-Hut tree computers, and full-size production simulations across up to algorithm [13] to calculate force interactions between dark three supercomputers [4], [15]. In addition, our astrophysical matter particles over short distances, and a particle-mesh algo- results have led to new insights on the mass distribution of rithm to calculate force interactions over long distances [14]. satellite halos around Milky-Way sized galaxies [17], on the Later in the project, we realized that further code changes existence of small groups of galaxies in dark-matter deprived were required to enable execution across supercomputers. As voids [18], the structure of voids [19] and the evolution of a result, we created a separate version of GreeM solely for this barionic-dominated star clusters in a dark matter halo [20]. purpose. This modified code is named SUSHI, which stands However, these results, though valuable in their own right, for Simulating Universe Structure formation on Heterogeneous do not capture some of the most important and disturbing Infrastructures [4], [15]. lessons we have learned from CosmoGrid about distributed supercomputing. Here we summarize our experiences on B. Assumptions engineering a code from the level of threads to that of a Our case for a distributed computing approach was focused transcontinental machine, establishing a linked infrastructure on a classic argument used to justify parallelism: multiple to deploy the code, reserving the required resources to execute resources can do more work than a single one. Even the the code, and the software engineering and sustainability world’s largest supercomputer is about an order of magnitude aspects surrounding distributed supercomputing codes. less powerful than the top 500 supercomputers in the world We are not aware of previous publications of practical combined [16]. In terms of interconnectivity the case was also experiences on the subject, and for that reason this paper may clear. Our performance models predicted that

Disturbing Lessons in Distributed Supercomputing

Distributed & Cloud Computing: Lecture 1

Distributed Algorithms with Theoretic Scalability Analysis of Radial and Looped Load ﬂows for Power Distribution Systems

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Grid Computing: What Is It, and Why Do I Care?*

Unit: 4 Processes and Threads in Distributed Systems

Parallel Computing

Threading SIMD and MIMD in the Multicore Context the Ultrasparc T2

Parallel Programming with Openmp

Intel's Hyper-Threading

Computing at Massive Scale: Scalability and Dependability Challenges

Lecture 3: a Simple Example, Tools, and CUDA Threads

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem