Deic Super Computing 2019 Report Fact Finding Tour at Super Computing 19, November - Denver, Colorado
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from orbit.dtu.dk on: Oct 07, 2021 DeiC Super Computing 2019 Report Fact Finding Tour at Super Computing 19, November - Denver, Colorado Boye, Mads; Bortolozzo, Pietro Antonio; Hansen, Niels Carl; Happe, Hans Henrik; Madsen, Erik Østergaard; Nielsen, Ole Holm; Visling, Jannick Publication date: 2020 Document Version Publisher's PDF, also known as Version of record Link back to DTU Orbit Citation (APA): Boye, M., Bortolozzo, P. A., Hansen, N. C., Happe, H. H., Madsen, E. Ø., Nielsen, O. H., & Visling, J. (2020). DeiC Super Computing 2019 Report: Fact Finding Tour at Super Computing 19, November - Denver, Colorado. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. Users may download and print one copy of any publication from the public portal for the purpose of private study or research. You may not further distribute the material or use it for any profit-making activity or commercial gain You may freely distribute the URL identifying the publication in the public portal If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. DeiC Super Computing 2019 Report Fact Finding Tour at Super Computing 19, November - Denver, Colorado Boye, Mads - [email protected] Bortolozzo, Pietro Antonio - [email protected] Hansen, Niels Carl - [email protected] Happe, Hans Henrik - [email protected] Madsen, Erik B. - [email protected] Nielsen, Ole Holm - [email protected] Visling, Jannick - [email protected] 2020-02-24 Contents 1 Preface . 2 2 Satellite Events . 2 2.1 DELL EMC Visit in Round Rock . 2 2.2 HP-CAST 33 . 3 2.3 Intel HPC Developer Conference . 4 2.4 SC19 workshops . 5 3 Central Processing Unit (CPU) . 6 3.1 AMD . 6 3.2 Intel . 7 4 Accelerators . 7 4.1 AMD . 7 4.2 Intel . 7 4.3 Nvidia . 8 4.4 FPGA . 8 5 Interconnect/Fabric . 9 5.1 Mellanox . 9 5.2 OmniPath . 9 6 Servers . 10 6.1 Huawei . 10 6.2 Lenovo . 10 7 Storage . 11 7.1 BeeGFS . 11 7.2 Ceph . 12 7.3 Lustre . 12 7.4 DAOS . 12 8 Middleware / Software . 12 8.1 Slurm . 12 8.2 OneAPI . 13 8.3 Cloud bursting from local HPC clusters . 13 8.4 EasyBuild & Spack . 14 8.5 EasyBuild . 14 8.6 Spack . 14 9 Liquid Cooling . 14 10 Artificial Intelligence . 17 11 Top 500 Announcements . 17 12 Conclusion . 18 1 1 Preface The SC19 International Conference for High Performance Computing, Network- ing, Storage and Analysis was held in Denver, Colorado where a Danish dele- gation represented the five Danish universities: • Aarhus University (AU) • Technical University of Denmark (DTU) • University of Copenhagen (KU) • University of Southern Denmark (SDU) • Aalborg University (AAU) The purpose of the present document is to briefly report the delegation's findings. The trip was subsidized in part by the Danish e-infrastructure Co- operation (DeiC). Prior to SC19 conference, parts of the delegation attended various events, such as a DELL EMC visit in Round Rock, HP-CAST 33 con- ference and the INTEL HPC Developer Conference 2019. During these confer- ences the delegation attended a number of Birds of a Feather sessions (BoF), Talks, Workshops, and Technical Sessions. In addition, the delegation had ar- ranged meetings with a number of vendors to gain knowledge of hardware and software developments within the High Performance Computing (HPC) field. Information obtained during vendor meetings was mostly under Non-disclosure agreements (NDA), and cannot be disclosed in this report. 2 Satellite Events Prior to the SC19 conference, it was possible to visit various events, which is not part of the technical program for SC19. The delegation covered these events, based on interest. The knowledge obtained at these events are covered in the following section. 2.1 DELL EMC Visit in Round Rock The visit at DELL EMC had a day with NDA briefing at DELL EMC headquar- ters in Round Rock, Austin, TX. Without breaking the NDA it is safe to say that the future of HPC is very power hungry, which much of our talks revolved around. And we had insights into how DELL EMC deals with that. We also got some interesting information about performance of the current CPUs. Especially the newly released AMD EPYC 7002 series (Rome). DELL has published some of the performance results, together with curiosities about the CPU architecture[13]. Especially the ”InfiniBand bandwidth and message rate" section of the "AMD Rome is it for real? Architecture and initial HPC performance"[13] article is important for HPC. DELL EMC also presented upcoming additions to their HPC storage port- folio, which already include Lustre, BeeGFS, NFS and Isilon. The Data Accelerator from University of Cambridge was also introduced. This work shows how to make a cost-effective burst buffer with standard NVMe servers[14]. 2 Figure 1: Dell PowerEdge C6420 server with custom water cooling for CPU. The DELL EMC HPC and AI Innovation Lab has published a lot about their AI findings[15]. 2.2 HP-CAST 33 Figure 2: HP-CAST33 Logo This year was permeated by the fact that Hewlett Packard Enterprise (HPE) bought CRAY almost two years ago and the merge became ratified a year ago. All of the sessions was presented by at least two people on the scene, one from the old CRAY, and one from the HPE part of the consortia. The same message was repeated again and again: \We will not let any costumer down. We will support everything you bought, and in the future build products with the best of both worlds." There were some interesting words for us university people. HPE see a clear path for the CPU market in the upcoming years. As a fact they believe suppliers will be capable of producing CPUs down to 2 nm technology, maybe down to 1:6 nm technology, within the next two decades, if you are to believe HPE, the market needs new technologies, and here they ask for help on technologies like analog, quantum, biologic, or other principles that can get Flops pr. Watt at a higher level that the transistor can bring to the market. 3 HPE also has a clear view on an exascale computer model, and the most interesting idea here is the network. HPE believes that the interconnect is ethernet with a speed that can be compared with InfiniBand, but with new technologies, called Slingshot to prevent congestion.[1] HPE is still looking at the old concept of The Machine[19], but nowadays it is in the form of Gen-Z, where HPE is working on a bus structure as the heart of the super computer and not the CPU. They are now capable of producing light emitter switches that can be used for this, and HPE believe, that we soon will see different forms of products designed to be used under standards of Gen-Z e.g HPEs Cadet project[2]. 2.3 Intel HPC Developer Conference Figure 3: Presentation slide from Intel HPC Developer Conference 2019 This year's Developer Conference was more of an excuse than a tell of how the HPC developers world will be the next coming years. The few workshops compared to earlier years, had a bad setup where only few people were capable hearing the speech of the speaker. Seen from a developers sight, the only new thing was OneAPI. An idea of a compiler that chooses the right hardware from whatever the compiler is offered. Intel gave a sneak peek of the future. Intel is coming up with a new packing of CPUs (Intel Next Gen 7 nm Process & Foveros packing), where CPUs will be built in layers. And a few slides later Intel also revealed a new GPU (code name Ponte Vecchio) under the name Xe. This is what Intel needs to build one of the exascale computers in 2021 named Aurora. A really interesting point of the HPC Developer conference is that Intel does not mention OmniPath or a new fabric product to follow. 4 Figure 4: Presentation slide of how a compute unit of the Aurora Exascale super computer will designed. 2.4 SC19 workshops Super Computing 19, provides a wide range of workshops, covering all areas of HPC professions. The attended workshop and the main conclusions is summed up in the list below: • Reproducible computations in exascale. There is a need of better verification methods that match large data set. Computational modules must be robust, as some fields of science have much noise in data set. Large distributed jobs must be analyzed as global summation and rounding, can have a larger impact. • Training of system professionals and users. A common challenge for HPC centers, is that the usage of HPC is growing to new fields of science, where there is no history for programming. This causes much bad code to run on cluster, which can cause the cluster to run inefficient. This problem was solved in different ways for the presenters: Oregon State University: Provides courses for scientific staff on how to use collaboration tools, source code licensing, structuring of python packages, object oriented programming among other things. Los Alamos National Labs: They provided a long training ses- sion, where the attendees would have two weeks during the summer, to setup a HPC cluster, starting form a bunch of servers, cables, and an empty rack, to a running system.