MCHPC'19: Workshop on Memory Centric High Performance Computing

Proceedings of MCHPC'19: Workshop on Memory Centric High Performance Computing Held in conjunction with SC19: The International Conference for High Performance Computing, Networking, Storage and Analysis Denver, Colorado, November 17-22, 2019 MCHPC’19: Workshop on Memory Centric High Performance Computing Table of Contents Message from the Workshop Organizers ................................. iii Organization ................................................................................ iv Keynote Talks …………………………………..…………………… 1 Research Papers Emerging Memory and Architectures 1. Vladimir Mironov, Igor Chernykh, Igor Kulikov, Alexander Moskovsky, Evgeny Epifanovsky, and Andrey Kudryavtsev, Performance Evaluation of the Intel Optane DC Memory With Scientific Benchmarks …….………………………..……………… 2 2. Thomas B. Rolinger, Christopher D. Krieger, and Alan Sussman, Optimizing Data Layouts for Irregular Applications on a Migratory Thread Architecture ….…..… 8 3. Chih Chieh Chou, Yuan Chen, Dejan MiloJicic, A. L. Narasimha Reddy, and Paul V. Gratz, Optimizing Post-Copy Live Migration with System-Level Checkpoint Using Fabric-AttacheD Memory ……………………………………………………...… 17 Application and Performance Optimization 4. Osamu Watanabe, Yuta Hougi, Kazuhiko Komatsu, Masayuki Sato, Akihiro Musa, and Hiroaki Kobayashi, Optimizing Memory Layout of Hyperplane Ordering for Vector Supercomputer SX-Aurora TSUBASA …………………………………..…. 26 5. Jiayu Li, Fugang Wang, Takuya Araki, and Judy Qiu, GeneralizeD Sparse Matrix- Matrix Multiplication for Vector Engines and Graph Applications ..................... 34 6. Esma Yildirim, Shaohua Duan, and Xin Qi, A DistributeD Deep Memory Hierarchy System for Content-based Image Retrieval of Big Whole SliDe Image Datasets ………………………………………………………………….……….…..…… 44 i UnifieD and Heterogeneous Memory 7. Wei Der Chien, Ivy B. Peng, and Stefano Markidis, Performance Evaluation of Advanced Features in CUDA UnifieD Memory …………………………..………… 51 8. Swann Perarnau, Brice Videau, Nicolas Denoyelle, Florence Monna, Kamil Iskra, and Pete Beckman, Explicit Data Layout Management for Autotuning Exploration on Complex Memory Topologies ………………………………………………..….…… 59 9. Hailu Xu, Murali Emani, Pei-Hung Lin, Liting Hu, and Chunhua Liao, Machine Learning Guided Optimal Use of GPU UnifieD Memory ……………….…..…….. 65 Converging Storage and Memory 10. Ivy B. Peng, Marty McFadden, Eric Green, Keita Iwabuchi, Kai Wu, Dong Li, Roger Pearce, and Maya Gokhale, UMap : Enabling Application-driven Optimizations for Page Management …………..……….……………………………………………..…… 72 11. Kewei Yan, AnJia Wang, Xinyao Yi, and Yonghong Yan, ExtenDing OpenMP map Clause to Bridge Storage and Device Memory …………………………………..… 80 Panel ……………………………………………………..………………………………..… 87 Software and HarDware Support for Programming Heterogeneous Memory MoDerator: Maya B Gokhale (Lawrence Livermore National Laboratory) Panelist: Mike Lang (LANL), Jeffrey Vetter (ORNL), Vivek Sarkar (Georgia Tech), David Beckinsale (LLNL), Paolo Faraboschi (HPE) ii Organization Organizers Yonghong Yan, University of North Carolina at Charlotte, USA Ron Brightwell, Sandia National Laboratory, USA Xian-He Sun, Illinois Institute of TeChnology, USA Maya B Gokhale, LawrenCe Livermore National Laboratory, USA Program Committee Ron Brightwell, Co-Chair, Sandia National Laboratory, USA Yonghong Yan, Co-Chair, University of North Carolina at Charlotte, USA Xian-He Sun, Illinois Institute of TeChnology, USA Maya B Gokhale, LawrenCe Livermore National Laboratory, USA Mingyu Chen, Chinese Academy of SCienCes, China Bronis R. de Supinski, LawrenCe Livermore National Laboratory, USA Tom Deakin, University of Bristol, UK Hal Finkel, Argonne National Laboratory and LLVM Foundation, USA Kyle Hale, Illinois Institute of TeChnology, USA Jeff R. Hammond, Intel Corporation, USA Dong Li, University of California, USA Scott Lloyd, LawrenCe Livermore National Laboratory, USA Ivy B. Peng, Oak Ridge National Laboratory, USA Christian Terboven, RWTH AaChen University, German AliCe Koniges, University of Hawaii, Maui High Performance Computing Center, USA Arun Rodrigues, Sandia National Laboratory, USA Chunhua Liao, LawrenCe Livermore National Laboratory, USA iv MCHPC'19: Workshop on Memory Centric High Performance Computing Keynote Talks Prospects for Memory, J. Thomas Pawlowski Abstract: The outlook for memory components and systems is at the all-time height of excitement. This talk examines the path we have traversed and the likely future directions we must pursue. We will examine scaling across many parameters, "walls" of all sorts concluding with the one that has emerged as the true wall, the dynamics of heterogeneity in systems and their comprising memories, the need for abstraction (still unsatisfied), options available for processing in or nearer to memory, and the imperative technology redirection that is required to make significant strides forward. Time permitting, a lively question and answer period is anticipated. Brief Bio: J. Thomas Pawlowski is a systems architect and multi-discipline design engineer currently self-employed as a consultant and entrepreneur. He has recently retired from Micron Technology, Inc. where he was employed for 27 years, holding titles including Senior Director, Fellow, Chief Architect and Chief Technologist. He was at the center of numerous new memory architectures, technologies and concepts including synchronous pipelined burst memory, zero bus turnaround memory, double data rate memory, quad data rate memory, reduced latency memory, SerDes memory, multi-channel memory, 3D memory, abstracted memory, smart memory, non-deterministic finite automata (processing in memory technology), processing in interbank memory regions, processing in memory controllers, processing in NAND memory, processing in 3D memory stacks, memory controllers, 3DXpoint memory management, and cryogenic memory among others. Prior to Micron he served for almost 10 years at Allied-Signal Aerspace/Garrett Canada (now Honeywell). In his colorful career Thomas has many design firsts including first FPGAs, microcontrollers and custom microprocessors in aerospace applications, one of the early laptop computer designs (with perhaps the world's first SSD comprising NOR Flash), a novel electronic musical instrument, a line of high-end loudspeakers, and a from- scratch 2-seat electric vehicle design achieving nearly 400mpge efficiency. Thomas holds a BASc degree in electrical engineering from the University of Waterloo and is an IEEE Fellow. Thomas is available for consulting opportunities and would consider employment offers too good to refuse. 1 Performance Evaluation of the Intel Optane DC Memory With Scientific Benchmarks Vladimir Mironov Igor Chernykh Igor Kulikov Department of Chemistry Supercomputing lab Supercomputing lab Lomonosov Moscow State University ICMMG SB RAS ICMMG SB RAS Moscow, Russian Federation Novosibirsk, Russian Federation Novosibirsk, Russian Federation [email protected] [email protected] [email protected] Alexander Moskovsky Evgeny Epifanovsky Andrey Kudryavtsev RSC Technologies Q-Chem, Inc. Intel Corporation Moscow, Russia Pleasanton, CA, USA Folsom CA, USA [email protected] [email protected] [email protected] Abstract—Intel Optane technology is a cost-effective solution Traditionally, disk drives or persistent memory devices were to create large non-volatile memory pools, which is attractive to employed to overcome DRAM limits. The modern attractive many scientific applications. Last year Intel® Optane™ DC option is to use solid-state drives (SSDs). However, even fastest Persistent Memory in DIMM (PMM) form-factor was introduced, SSDs have orders-of-magnitude less bandwidth and higher which is capable of being used as main memory device. This latency than DRAM; also, these devices usually operate data is technology promises faster memory access comparing to Intel® large blocks instead of bytes. Recently Intel introduced a new Optane™ DC P4800X SSD with Intel® Memory Drive Technology Intel Optane DC SSDs based on the low latency Optane media devices available previously. technology [6]. In contrast to traditional SSDs they have much lower latency, higher IOPS performance under low queue depth In this paper, we present new benchmark data for the Intel and higher endurance, even if NMVe interface is employed by Optane DC Persistent Memory. We studied the performance of scientific application from domains of quantum chemistry and SSD. In addition to that Optane DC SSD can be used as memory computational astrophysics. To put performance of the Intel extension of traditional DDR4 with Intel Memory Drive Optane DC PMM in comparison, we used two memory technology [7]. It has been shown [8] that using Optane drives configurations: DDR4 only system and Optane DC SSD with Intel as a software-defined memory (Intel Memory Drive Memory Drive Technology (IMDT) that is another option for Technology, IMDT) brings acceptable system performance but memory extension. We see that PMM is superior to IMDT in with a much less costs. However, this setup inherits the block- almost all our benchmarks, to no surprise. However, PMM was based access to memory from traditional SSDs and it suffers 20-30% more performant in quantum chemistry and only from operating system overheads when using data on disks [9]. marginally better for astrophysics simulations. We also found, that for practical calculations hybrid setup demonstrates on the In the current study, we investigate performance of next- same order of magnitude performance

MCHPC'19: Workshop on Memory Centric High Performance Computing

BDEC2 Poznan Japanese HPC Infrastructure Update

Supercomputer Fugaku

Download SC18 Final Program

Mixed Precision Block Fused Multiply-Add

HPC Systems in the Next Decade – What to Expect, When, Where

The Post-K Project and Fujitsu ARM-SVE Enabled A64FX Processor for Energy-Efficiency and Sustained Application Performance

SC20-Final-Program-V2.Pdf

World's Fastest Computer

An Accidental Benchmarker

An Overview of High Performance Computing and Using Mixed Precision in Numerical Computations to Speedup Linear Algebra Solvers

Final Copy 2021 06 24 Foyer

The Arm Architecture for Exascale HPC