18.9-Pflops Nonlinear Earthquake Simulation on TaihuLight: Enabling Depiction of 18-Hz and 8-Meter Scenarios

Haohuan Fu Conghui He Bingwei Chen Tsinghua University Tsinghua University Tsinghua University National Supercomputing Center in National Supercomputing Center in National Supercomputing Center in , China Wuxi, China Wuxi, China [email protected] [email protected] [email protected] Zekun Yin Zhenguo Zhang Wenqiang Zhang Shandong University Southern University of Science and University of Science and Technology [email protected] Technology, China of China [email protected] [email protected] Tingjian Zhang Wei Xue Weiguo Liu Shandong University Tsinghua University Shandong University [email protected] National Supercomputing Center in [email protected] Wuxi, China [email protected] Wanwang Yin Guangwen Yang Xiaofei Chen National Research Center of Parallel Tsinghua University Southern University of Science and Computer Engineering and National Supercomputing Center in Technology, China Technology, China Wuxi, China [email protected] [email protected] [email protected] ABSTRACT KEYWORDS ￿is paper reports our large-scale nonlinear earthquake simulation Sunway TaihuLight, computational seismology, earthquake ground so￿ware on Sunway TaihuLight. Our innovations include: (1) a motions, parallel scalability customized parallelization scheme that employs the 10 million cores e￿ciently at both the process and the thread levels; (2) an elaborate ACM Reference format: memory scheme that integrates on-chip halo exchange through Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, register communcation, optimized blocking con￿guration guided by Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, Guangwen Yang, and Xiaofei Chen. 2017. 18.9-P￿ops Nonlinear Earthquake an analytic model, and coalesced DMA access with array fusion; (3) Simulation on Sunway TaihuLight: Enabling Depiction of 18-Hz and 8- on-the-￿y compression that doubles the maximum problem size and Meter Scenarios. In Proceedings of SC17, Denver, CO, USA, November 12–17, further improves the performance by 24%. With these innovations 2017, 12 pages. to remove the memory constraints of Sunway TaihuLight, our DOI: 10.1145/3126908.3126910 so￿ware achieves over 15% of the system’s peak, be￿er than the 11.8% e￿ciency achieved by a similar so￿ware running on Titan, whose byte to ￿op ratio is 5 times be￿er than TaihuLight. ￿e extreme cases demonstrate a sustained performance of over 18.9 1 JUSTIFICATION FOR THE GORDON BELL P￿ops, enabling the simulation of Tangshan earthquake as an 18-Hz PRIZE scenario with an 8-meter resolution. 18.9-P￿ops non-linear earthquake simulation on Sunway Taihu- Light (15% of the peak), using 10,400,000 cores, for 18-Hz and 8- meter scenarios. We adopt comprehensive memory-related op- Permission to make digital or hard copies of all or part of this work for personal or timizations to resolves the bandwidth constraints, and propose classroom use is granted without fee provided that copies are not made or distributed on-the-￿y compression to double the maximum problem size and for pro￿t or commercial advantage and that copies bear this notice and the full citation on the ￿rst page. Copyrights for components of this work owned by others than the to further improve the performance by 24%. author(s) must be honored. Abstracting with credit is permi￿ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci￿c permission and/or a fee. Request permissions from [email protected]. 2 PERFORMANCE ATTRIBUTES SC17, Denver, CO, USA In the following table, we not only show the information about © 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 978-1-4503-5114-0/17/11...$15.00 the category of our work, but also demonstrate the performance DOI: 10.1145/3126908.3126910 highlights, updates, and experiment updates in the recent months. SC17, November 12–17, 2017, Denver, CO, USA H. Fu et al.

Performance Content In the recent two to three decades, with more powerful supercom- Attributes puters, the numerical simulation becomes another leg to support Performance 18.9 P￿ops (improved from 15.2 P￿ops the table. ￿e earthquake simulation tool is like a ‘digital seismo- with on-the-￿y compression added) scope’ that enables scientists to ‘replay’ or ‘project’ an earthquake, Resolution and 18-Hz and 8-Meter so as to provide quantitative evaluations of earthquake-related Frequency (improved from 10-Hz and 20-Meter) risks, and to improve our understanding about both the underneath Maximum 7.8 trillion points (improved from 3.99 trillion) structure and the development mechanism of natural earthquakes. Problem Size 40,000 x 39,000 x 5,000 For engineering purposes, such a tool can also be coupled with 724 TB (80.4% of total available) memory) other mechanical models to conduct scenario tests and to guide the Tangshan Earthquake 16-m resolution seismic hazard map design of resilient utility system in seismically active regions. Simulation 450 TB broadband seismic data While the numerical earthquake simulation provides a unique Category scalability, peak performance, ‘experimental platform’ with valuable research functions, it is also of achievement time to solution a traditional ‘grand challenge’ for the supercomputing community. Type of explicit Spatially, earthquake simulations cover hundreds of kilometers method used along the plane, and tens of kilometers along the depth. Even with a Results reported whole application grid size of over 100 meters, such a problem would involve billions on basis of with I/O to trillions of unknowns [7]. ￿e engineering demand nowadays Precision reported single precision requires the support for a wide range of frequencies up to at least 10 Hz, which pushes the grid size to 20 meters or even below, with System scale measured on full-scale system a time step in the range of milliseconds [8]. Even though we only Measurement Flop counts and timers care about the duration of tens of seconds for an earthquake, such mechanism spatial range and spatial/temporal resolutions bring tremendous challenges even for leading-edge systems. 3 OVERVIEW OF THE PROBLEM: Moreover, within each grid, the computational and memory EARTHQUAKE SIMULATION pressure is also high. For even a linear case, we solve a coupled system of velocity and stress tensor equations: In traditional Chinese culture, the highest praise for a scholar is that he understands both “the upper sky and the underneath earth”. @ v = (1) t r· While science has made signi￿cant progresses in the last century, the interior of the earth still remains largely unknown to scientists. @ = ( v)I + µ ( v + ( v)T ) (2) As a result, the earthquakes, which are signi￿cant disasters that can t r· · r r lead to huge losses, are still among the ultimate scienti￿c challenges where v=(x , , z ) is the velocity, =( xx, , zz, x , xz, that wait for deciphering [1]. z ) is the stress. For cases with a higher frequency, nonlinear ￿e seismoscope designed by Zhang, Heng, in 132 AD, is prob- responses in rocks and soils, and the nonlinear behavior of the ably the earliest scienti￿c e￿ort towards earthquake disaster mit- shallow sedimentary rock underlying the basins, become impor- igation in Chinese history [22]. As shown in Fig. 1, with eight tant factors to consider. To accommodate these nonlinear e￿ects, dragons holding balls in their mouths, this seismoscope developed we need to incorporate the Drucker-Prager plasticity[21], which two thousand years ago was able to indicate earthquakes thousands de￿nes the yield stress as: of miles away, and exhibited technical features similar to modern Y ( ) = max(0,c cos (m + Pf ) sin ) (3) seismic measurement instruments originated in the 1880s [17]. where c is the cohesion, the friction angle, Pf the ￿uid pressure and m the mean stress. ￿e yield function is used to determine whether to update the stress: trial trial ij = m ij + rsij (4)

where r is calculated by yield stress Y ( ) and sij is the stress de- viator. As a result, when moving from linear to nonlinear cases, we need over 35 instead of just 28 3D arrays [21], which almost increase 25% of both the memory capacity and memory bandwidth, making it extremely di￿cult to achieve a high e￿ciency on today’s many-core . Figure 1: ￿e internal structure and external appearance of With Sunway TaihuLight announced [9] in 2016, leadership Zhang, Heng’s seismoscope designed in 132 AD [12]. supercomputers have ￿nally entered the scale of over 100 P￿ops and over 10 million cores. While TaihuLight presents an unprecedented In the 20th century, with more sensitive and accurate seismic level of computing power (3 times of Tianhe-2, and 5 times of Titan), measurement devices to record seismic signals of natural earth- its memory system is relatively modest. As shown in Table 1, the quakes, statistics-based methods have been developed as the major total memory size is similar to other systems (only be￿er than two tools for various forms of earthquake risk analysis. GPU-based systems Piz Daint and Titan), with a signi￿cantly lower 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight SC17, November 12–17, 2017, Denver, CO, USA

Table 1: A brief comparison between Sunway TaihuLight to simulate a region with the dimension size of 140 km by 100 km and other leadship systems. sizeMEM and BWMEM refer to by 20 km, providing a performance of 8 T￿ops. BWMEM the size and bandwidth of the memory respectively. PEAK Due to the concerns over earthquake risks, most large-scale computes the ratio of memory bandwidth to peak comput- earthquake simulation e￿orts come from seismically active regions, ing performance of the system. such as Japan, and the bay area in US. For example, the collaboration between Caltech (US) and JAMSTEC (Japan) [15] led to the ￿rst BWMEM winner of Gordon Bell Prize in the domain of earthquake simulation. PEAK LINPACK sizeMEM BWMEM PEAK (P￿ops) (P￿ops) (TB) (TB/s) BYTE per ￿op By using the spectral element method (SEM), a high-degree ￿nite- TaihuLight 125 93 1,310 4,473 0.038 element technique, seismic wave modeling across the entire earth Tianhe-2 54.9 33.9 1,375 10,312 0.188 was performed using a cubed-sphere mesh with an average grid Piz Daint 25.3 19.6 425.6 4,256 0.168 Titan 27.1 17.6 710 5,475 0.202 spacing of 2.9 km, providing a performance of 5 TFlops when using Sequoia 20.1 17.2 1,572 4,188 0.208 1,944 processors of the Earth Simulator machine. ￿e work was K 11.28 10.51 1,410 5,640 0.5 later extended to a so￿ware package called SPECFEM3D GLOBE that evolves with newer generations of supercomptuers. In 2008, SPECFEM3D GLOBE further improved the grid spacing to around 800 m, and the frequency range to around 0.5 Hz, with a perfor- byte-to-￿op ratio (1/5 of other heterogeneous systems, and 1/10 of mance of 28.7 T￿ops by using 32,000 cores of Ranger, and a per- K). Such a system presents both high potential and severe challenges formance of 35.7 T￿ops by using 29,000 cores of Jaguar (Jaguar’s for scaling scienti￿c applications to the next level. Especially for be￿er performance mainly due to its higher memory bandwidth for the earthquake simulation problem in this work, which requires this memory-bound problem). In 2012, the SPECFEM3D code was both large memory space and high memory bandwidth, and thus further updated to support an e￿cient utilization of GPU devices, breaking the memory wall becomes the top challenge. and scaled to 896 CPU-GPU nodes, providing a performance of 35 Originated from the AWP-ODC [7] and CG-FDM codes [23], we T￿ops [20]. have done a fully-optimized design that carefully combines algo- Another numerical method that enables large-scale earthquake rithm and implementation techniques, and manage to push the simulations is the arbitrary high-order derivative discontinuous various capabilities of the Sunway system (mainly the memory- Galerkin ￿nite element method (DG-FEM) used in the SeisSol so￿- related capabilities for this memory-bound problem), such as the ware. SeisSol achieved a sustained performance of 8.6 P￿ops on DRAM space, the local data memory (LDM) space (Sunway’s user- 8,192 nodes of Tianhe-2 [11], to support the simulation of the 1992 controller scratch-pad cache), the e￿ective memory bandwidth, to Landers M7.2 earthquake with 191 million tetrahedral elements the range of 80% to 90% of the hardware upper bounds. We also and 96 billion degrees of freedom. Compared with the other meth- propose an on-the-￿y compression method to reduce the intro- ods, the DG-FEM in SeisSol converts the numerical problem into duced computational and memory costs. Our method can double compute-bound dense matrix operations, thus leading to more the maximum problem size and further improve the performance e￿cient utilization of the Xeon Phi accelerators. However, the by 24%. ￿e simulation in the extreme case provides a sustained high-order method also increases the complexity in both compute performance of 14.2 P￿ops for the linear case, and 18.9 P￿ops for and memory, which again limits the largest size of solvable prob- the nonlinear case. ￿e computational e￿ciency we achieve is lems as well as the time to solution. SeisSol is recently upgraded up to 15%, which is be￿er than the e￿ciency of 11.8% achieved into EDGE [3] to include the feature of fused seismic simulations in the previous e￿orts on Titan [21], even though TaihuLight’s of similar inputs, which further improves the throughput of the byte-to-￿op ratio is only 1/5 of Titan. DG-FEM approach to 10.4 P￿ops on Cori-II. By squeezing all the possible performance out of Sunway Tai- Recent e￿orts on K computer explores the potential of implicit huLight, we are able to perform a series of simulations for the ￿nite element solvers. T. Ichimura and his colleagues propose GAM- Tangshan earthquake (M7.2, 1976) with a problem domain of 320 ERA [14] and GOJIRA [13], the la￿er of which solves over 1 trillion km by 312 km by 40 km, with the spatial resolution increasing from degrees of freedom with a time step of 29.7 seconds, providing a 500 m to 8 m, supporting a frequency range up to 18 Hz. To our sustained performance of 1.97 P￿ops by using all the 663,552 CPU best knowledge, this is the ￿rst time of performing a nonlinear cores of K computer. However, these two implementations focus plasticity earthquake simulation at such a scale, and with such a more on the scenario coupled with buildings in the cities (usually in high frequency and high resolution in the world. ￿e plasticity the size of tens of kilometers) and seismic wave ampli￿cation simu- ground motion simulation for the Tangshan earthquake allows us lation. And they cannot tackle large-scale earthquake simulation for the ￿rst time to quantitatively estimate the hazards of the Tang- in the range of hundreds of kilometers. shan earthquake in the a￿ected area, and to provide guidance for As a world-leading earthquake research consortium, and proba- designing proper seismic engineering standards for buildings in bly the world’s largest geo-science collaboration across disciplines North China. and countries, the Southern California Earthquake Center (SCEC) led some of the most important developments in earthquake simu- 4 CURRENT STATE OF THE ART lation. SCEC initialized the TeraShake project in 2004 [6], which As a traditional ‘grand challenge’ for high performance computing, started with the 2,048 processors of the NSF funded TeraGrid [5], one of the earliest e￿orts on earthquake simulation was performed and improved to a larger scale of 40,000 BlueGene processors. ￿e on the 256 processors of Cray T3D [2], using an unstructured mesh TeraShake project discovered how the rupture directivity of the SC17, November 12–17, 2017, Denver, CO, USA H. Fu et al.

Table 2: A summary of existing work on large-scale earthquake simulations on supercomputers. ￿e numbers are obtained from the published papers. Unreported values are labelled as ‘–’. For the numercial method, FD refers to ￿nite di￿erence method, SEM referes to the spectral element method, and DG-FEM refers to the discontinuous Galerkin ￿nite element method.

Work Year Machine Arch Scale # grid points # DOFs Flops Mem Method [2] 1996 Cray T3D Alpha CPU 256 13.4 million 40.2 million 8 G￿ops 16 GB FD processors SPECFEM3D [15] 2003 Earth NEC SX-6 1,944 5.5 billion 14.6 billion 5 T￿ops 2.5 TB SEM Simulator processors [4] 2008 Ranger 4-core Opteron 32,000 cores – – 28.7 T￿ops – SEM Jaguar 29,000 cores 35.7 T￿ops [20] 2012 Cray XK6 16-core Opteron and 896 GPUs 8 billion 22 billion 135 T￿ops 3.5 T SEM Fermi M2090 GPU SeiSol [11] 2014 Tianhe-2 12-core Xeon and 196,608 cores 191 million 96 billion 8.6 P￿ops – DG-FEM 59-core Xeon Phi 1,400,832 cores tetrahedrons EDGE [3] 2017 Cori-II 68-core Xeon Phi 612,000 cores 341 million – 10.4 P￿ops 32 TB DG-FEM tetrahedrons GAMERA [14] 2014 K Computer 8-core SPARC64 663,552 cores – 27 billion 0.804 P￿ops – implicit FEM GOJIRA [13] 2015 K Computer 8-core SPARC64 663,552 cores 270 billion 1.08 trillion 1.97 P￿ops – implicit FEM AWP-ODC [7] 2010 Jaguar 6-core Istanbul 223,074 cores 436 billion 1.31 trillion 220 T￿ops 127 TB FD linear [8] 2013 Titan 16-core Opteron and 229,376 SMXs 859 billion 2.58 trillion 2.33 P￿ops 250TB FD k20x GPU 16,384 GPUs linear [21] 2016 Titan same as above 114,688 SMXs 329 billion 987 billion 1.6 P￿ops 129TB FD 8,192 GPUs non-linear our work without 2017 Sunway 260-core SW26010 1,014,000 cores 3.99 trillion 11.98 trillion 15.2 P￿ops 892 TB FD compression TaihuLight non-linear with 7.8 trillion 23.4 trillion 18.9P￿ops 724TB compression southern San Andreas fault, a source e￿ect, could couple to the exci- to perform an apple-to-apple comparison among them. In general, tation of sedimentary basins, a site e￿ect, to substantially increase FEM-based approaches have a be￿er capability for handling topog- the seismic hazard in Los Angeles. And they clearly demonstrated raphy and can simulate complex scenarios with a small number of the scienti￿c bene￿ts of large-scale earthquake simulations. An- grids, but can face serious e￿ciency problems for the convergence other signi￿cant result of the project is the open-source so￿ware of nonlinear problems. In contrast, the FD methods, which used to AWP-ODC (Anelastic Wave Propagation by Olsen, Day and Cui), be regarded as an impractical solution due to its high demand for which supports numerous seismic research projects a￿erwards. both computing and memory capacity [2], are nowadays consid- Around the year of 2010, the simulation capability of AWP-ODC ered more suitable for massively parallel computing environments was further improved to the petascale supercomputers [7]. By using for its regularity in computing, memory access, and inter-node 223,074 cores of Jaguar, a simulation was achieved for the 800 km communication. by 400 km area in Southern California, with a spatial resolution of Considering all these major so￿ware codes, AWP-ODC, which 40 m, a maximum frequency of 2 Hz, and a sustained performance accumulates SCEC’s so￿ware e￿orts in the last few decades, pro- of 220 T￿ops. In 2013, AWP-ODC was extended to support GPU vides the most advanced features, with well-tuned physics for plas- accelerators. For a problem size of 20,480 by 20,480 by 2,048 mesh ticity and anelastic a￿enuation, and the capability to accomplish points, a sustained performance of 2.33 P￿ops is achieved by using large-scale earthquake simulations within half a day. ￿erefore, 16,384 GPUs on Titan [8]. In 2016, the work was further improved in our work, we redesign AWP-ODC for the completely di￿erent to support the simulation of nonlinear e￿ects, which is a key ele- hardware architecture of Sunway TaihuLight. By squeezing the ment to be considered in high-frequency earthquake simulations, machine’s capability to the limits in almost all aspects, we manage and achieved a performance of 1.6 P￿ops when using half of Titan to further push the performance from 2.3 P￿ops on Titan to 15.2 [21]. P￿ops on TaihuLight, and to support problems that are 4 to 5 times Table 2 summarizes the existing work on large-scale earthquake larger. With our compression scheme combined, the simulation simulation. Within roughly two decades, with the machine evolving capability would be further improved to a performanmce of 18.9 from 256 processors to 10 million cores, and the problem expand- P￿ops, and can support scenarios of 18-Hz frequency and 8-meter ing from millions of elements to billions of elements (occupied resolution. In summary, compared with the AWP work on Titan, memory space from GB to PB), the simulation performance has our work has expanded scientists’ simulation capabilities for earth- also improved from G￿ops to over 15 P￿ops. With the di￿erent quakes by 8 times for simulation performance, and 9 to 10 times numerical methods taken by di￿erent so￿ware packages (SEM for for the largest problem size possible nowadays. Note that, these SPECFEM3D, DG-FEM for SeisSol and EDGE, implicit FEM for improvements were achieved for this memory-bound application GAMERA and GOJIRA, and FD for AWP-ODC), it is not possible under the circumstance of a 5 to 10 times lower byte-to-￿op ratio, 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight SC17, November 12–17, 2017, Denver, CO, USA which would be impossible without our innovations to break the making maximum utilization of both the 64-KB LDM, the registers, memory constraints, detailed in Section 6. as well as the register communication feature at the second and the top level. Even when we design the parallelization scheme, we 5 SUNWAY TAIHULIGHT AND ITS carefully map the computing part with the memory footprints that we can a￿ord at di￿erent levels. More details are given in Section 6. ARCHITECTURAL CHALLENGES FOR

LARGE-SCALE SCIENTIFIC APPLICATIONS Register (32) local: 1 cycle, remote: 11 cycles 5.1 System Architecture Reg Reg Reg Reg ￿e computing power of TaihuLight is provided by a homegrown 4 cycles many-core SW26010 CPU that includes 4 core-groups (CGs) [9], 64KB each of which includes one management processing element (MPE), LDM 1 LDM 2 LDM 3 LDM 4 one computing processing element (CPE) cluster with 8 by 8 CPEs, and one memory controller (MC). With 260 processing elements 120+ cycles 136GB/s in one CPU, a single SW26010 provides a peak performance of over 3 TFlops. With a total of 40,960 SW26010 CPUs in the system, Main Memory (8G) Sunway TaihuLight provides a peak performance of 125 P￿ops with 10,649,600 cores. While TaihuLight demonstrates signi￿cant Figure 2: ￿e memory hierarchy of the CPEs. advantages over other systems on the computing part, the system is relatively modest in terms of the memory capacity, with 32 GB memory and a memory bandwidth of 136 GB/s for each node. 6 MAJOR INNOVATIVE CONTRIBUTIONS 5.2 Major Design challenges 6.1 Contribution Summary As the ￿rst system in the world with over 10 million cores, the ￿rst Our major contribution is a so￿ware framework on Sunway Taihu- design challenge is to derive the right parallelization scheme that Light that can support both the generation of dynamic ruptures, and map our target application into the processes and threads in the the simulation of the seismic wave propagation (the complete cycle system. Similar to other heterogeneous supercomputers with many- of large-scale earthquake simulation) in a massively parallel way. core accelerators, we also take a two-level ‘MPI+X’ approach. Each ￿e so￿ware framework manages to remove the constraints set CG usually corresponds to one MPI process. Within each CG, we by its relatively modest memory system, and achieves simulation have two di￿erent options: one is Sunway OpenACC, a customized capability that scales with the unprecedented 125 P￿ops comput- parallel compilation tool that supports OpenACC 2.0 syntax and ing power of TaihuLight, even though its byte-to-￿op ratio is 5 to targets the CPE cluster; the other one is a high-performance and 10 times lower than other leading-edge supercomputers. Our key light-weight thread library named Athread, which provides similar innovations are: interfaces to Pthread to exploit ￿ne-grained parallelism. In our A uni￿ed so￿ware framework that includes the dynamic rupture • work, we take the Athread approach, which requires more pro- generator, the wave propagation part, and the other supporting gramming e￿orts but also exposes more possibilities to tune both functions, such as source partitioner, 3D model generator, restart the computation and the memory access schemes. controller, and parallel I/O functions. ￿e second challenge, especially for this memory-bound problem A customized parallelization scheme that employs the 10 million • of FD-based earthquake simulation, is to break the memory wall cores e￿ciently at both the process level and the thread level. of this system. As highlighted in Table 1, with a byte-to-￿op ratio An elaborate memory scheme that integrates on-chip halo ex- • that is 5 to 10 times lower than other leadership systems, we need change through register communcation, optimized blocking con- extraordinary memory-related innovations to scale the simulation ￿guration guided by an analytic model, and coalesced DMA capability with the 125 P￿ops computing performance. access with array fusion. Fig. 2 shows the memory hierarchy for CPEs. ￿e top level of the An on-the-￿y compression that pushes the application-available • memory hierarchy is the 32 ￿oating-point registers for each CPE. memory size and bandwidth to the next level, and expands both ￿e local registers take only one cycle to access. Using the row the highest performance and the largest problem size we can and column communication buses in the CPE cluster, the register achieve on Sunway TaihuLight. communication function takes roughly 11 cycles to fetch the data from remote registers of other CPEs in the same row or column. 6.2 Our Uni￿ed So￿ware Framework ￿e second level is the 64-KB LDM, which needs to be manually ￿e entire work￿ow of large-scale earthquake simulation is a com- managed by the programmers in a ￿ne-grained way. ￿e third level plex process that consists of di￿erent components, ranging from is the 8 GB main memory for each CG with a bandwidth of 34 GB/s, dynamic rupture source generation, mesh generation, to the most aggregating to a total of 32 GB memory and 136 GB/s bandwidth time-consuming wave propagation part. ￿ese di￿erent compo- for each processor. nents bring challenges to various aspects (computing, memory, As both the size and the bandwidth at the third level are relatively communication, and I/O) of the supercomputer. To resolve these modest when compared to the 3 T￿ops computing performance challenges, we build a uni￿ed so￿ware framework that integrates of the Sunway processor, a lot of our e￿orts in this work focus on di￿erent functions, as shown in Fig. 3. SC17, November 12–17, 2017, Denver, CO, USA H. Fu et al.

￿e dynamic rupture generator is based on the CG-FDM code (1) 2D decomposition for MPI processes: for the storage of all [23], with functions to initialize the fault stress, to perform friction the 3D arrays, we take the z axis (the vertical direction) as the law control, and to generate the sources through wave propagation. fastest axis, the axis as the second, and the x axis as the slowest To support large-scale simulation, between the source and the axis. In typical earthquake simulation scenarios, we generally have wave propagation, we develop a source partitioner that maps one signi￿cantly larger dimensions for x and (hundreds of kilometers) single large source input into di￿erent ￿les for di￿erent source- than the dimension for z (tens of kilometers). ￿erefore, to minimize responsible MPI processes. We also provide a 3D model interpolator communication among the di￿erent processes, at the ￿rst level, that remaps the velocity and density model to the target mesh. instead of taking a 3D approach, we decompose the horizontal ￿e wave propagation part, originated from AWP-ODC [7] and plane into Mx by M di￿erent partitions, each corresponding to redesigned completely for Sunway TaihuLight, consumes most of a speci￿c MPI process. With the well-designed MPI scheme to the compute cycles and integrates most of our innovative optimiza- hide halo communcation in computation inherited from AWP-ODC tions. Key functions include: velocity update, stress update, source [7], in extreme cases, we can have up to 160,000 (400 by 400) MPI injection, and stress adjustment for plasticity. processes to scale to the full machine. ￿e I/O part is another important factor for large-scale earth- (2) blocking for each CG: at the second level, instead of assigning quake simulation. ￿e toughest challenge comes from the check- all the mesh points to di￿erent cores within a CG, we add a blocking points for restart. All the wave￿elds required by the checkpoint mechanism along and z axes to assign a suitable size of block to aggregate to a size of 108 TB in the 16-meter resolution case in our the CG, so as to achieve a more e￿cient utilization of the 64-KB study, which is clearly beyond both the I/O bandwidth and capacity LDM of each CPE, detailed in Section 6.4. Each CG would iterate provided by this system. ￿erefore, we integrate the LZ4 compres- over these di￿erent blocks to ￿nish the processing. sion to reduce the size for a smoother run, and adopt techniques (3) 2D decomposition for Athreads: we further partition the such as group I/O and balanced I/O forwarding that achieves a peak block into di￿erent regions for each CPE, but along the and z I/O bandwidth of 120GB/s (92.3% of the ￿le system we use). dimensions (with each thread iterating along the direction of x), so as to achieve fast memory accesses for the di￿erent threads. Dynamic Rupture Source Generator (4) LDM bu￿ering scheme: for each CPE, we load a suitable size (Based on CG-FDM) of the computing domain (both the central and the halo parts) into Fault Stress Friction Wave Eqn the LDM using DMA operations, so as to perform the computation Init Law Ctrl Solver 3D Vel/Den Model a￿erwards. ￿e DMA operations are designed to be asynchronous, so as to overlap with the computation part. Source Partitioner 3D Model Interpolator ��

Seismic Wave Propagation Snapshot/Sesimo �� x (Based on AWP-ODC) Recorder y Velocity Stress Update Update z �� Next Timestep Restart � Stress Source � Adjustment For Controller Plasticity Injection (1) MPI decomposition (2) CG blocking �� LZ4 Compression, Group I/O, Balanced I/O Forwarding ��

�� Figure 3: ￿e general structure of our earthquake simula- �� tion framework on Sunway TaihuLight. Finished area Buffer area �� Computing area Unfinished area Compute direction (4) LDM buffering scheme (3) Athread decomposition 6.3 A Customized Parallelization Design In the most time consuming wave propagation part, we have com- Figure 4: ￿e multi-level domain decomposition scheme: (1) putational kernels that involves reading and writing of over 20 MPI decomposition; (2) CG blocking; (3) Athread decompo- variable arrays that covers the entire mesh grid. In such cases, sition; and (4) LDM bu￿ering scheme. many previous optimization techniques, such as the 3.5D block- ing scheme [18], becomes inpractical due to the extremely high memory volume requirement. ￿erefore, we apply a customized 6.4 A Balanced and Optimized Memory Scheme domain decomposition scheme for these di￿erent kernels, which ex- To break the memory constraints, a key part of the solution is to poses enough parallelism for the 10 million cores, while minimizing e￿ciently utilize the memory hierarchy of the SW26010 processor. related memory costs. One unique feature of SW26010 is the register communication Fig. 4 shows our multi-level approach that decomposes the do- among the 64 CPEs in each CG, which provides a perfect solution main into di￿erent partitions for each MPI process, and further into for data reuse in stencil-like computations. Using register commu- di￿erent regions for each CPE thread, detailed as follows: nication based halo exchange, inside each CG, the CPE thread only 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight SC17, November 12–17, 2017, Denver, CO, USA

Table 3: Measured DMA bandwidths for di￿erent block In our model, the block size is determined byWz , which speci￿es sizes. the dimension size along the fastest axis. ￿erefore, we need to keep Wz axis as large as possible. Combining the ￿rst and the Block DMA Get (GB/s) DMA Put (GB/s) second goals of our model, we can derive that a small value of Cz Size (BYTEs) 1 CG 4 CGs 1 CG 4 CGs is preferred to achieve an e￿cient memory behavior. 32 3.28 13.21 2.58 8.07 ￿erefore, in most cases, we found Cz = 1 and C = 64 to be the 128 17.81 72.02 19.05 77.10 most suitable con￿guration within a CG. Each of the 64 CPE threads 512 27.8 104.86 30.48 107.88 will initialize the DMA load operations to fetch the corresponding 2048 31.3 119.2 34.2 133 cubic region (W W W ) and thez plane for the stencil operations z · · x a￿erwards. Even a￿er adopting the optimal parallelization scheme men- needs to load its corresponding central region, and can acquire tioned above, in most cases, due to the large number of arrays that the halo regions from the neighboring threads through register we need to access, the 64 KB LDM would limit us to a too small communication operations. Only the boundary CPE threads that portion of the array, and a low e￿ciency of the DMA reads. need to communicate across di￿erent CGs still need to initialize For example, for the kernel of delcx (update of velocity), we DMA loads for the corresponding halo regions. need to read in 10 di￿erent arrays (u, , w, xx, , zz, x, xz, z, d, Combining the parallelization scheme detailed in Section 6.3 which are the velocity and stress variables in di￿erent directions). and the register communication scheme, we can then derive an According to equation 6, to compute the forth-order in space stencil analytic model to determine an optimized con￿gurations for the in this kernel, we need at least 5 slices loaded in LDM, so the CPE decomposition and the LDM bu￿ering. minimum value of W is 5. For the parameter of W , as (W 2H ) Our ￿rst set of design parameters to determine are C and C , x z is the e￿ective region loaded into LDM, we need to set W to at as shown in Fig. 4, which determines the layout of the parallel CPE least 9 (for H = 2) to keep the halo cost at a reasonable level. threads in each CG. ￿e second set are W , W , W , also shown in z x ￿erefore, with 10 input arrays, from equation 6, we derive: Fig. 4, which determines the size of the region loaded into the LDM of each CPE thread. W 9 5 10 4 < 64 1024 (8) z · · · · ⇥ ￿e design constraints for Cz , C , Wz , and W are: which gives a maximum Wz value of around 32. In such a case, the C C = 64 (5) z · DMA block size of 128 bytes, leading to a low memory bandwidth utilization of only 50%. W W W N N < 64 1024 (6) z · · x · arra · btes per ariable ⇥ To resolve the above issue, we analyze all the di￿erent related where Narra denotes the number of 3D arrays needed for the kernels, and identify a set of co-located arrays that demonstrate current kernel, and Nbtes per ariable means the number of bytes a common behavior for a majority of kernels. In the example of we use for each variable (4 for single-precision ￿oats). delcx and other similar kernels, we identify the arrays (u, , w, ￿e design goal is to: (1) minimize the number of DMA loads xx, , zz, x, xz, and z) to be the co-located arrays that demon- required for redundant halo region reads; (2) maximize the e￿ective strate identical memory access pa￿erns among di￿erent kernels. memory bandwidth by using a large chunk size. ￿erefore, we make the design strategy to fuse u, and w, into an To achieve the ￿rst goal of minimizing the number of redundant array of vectors with three elements, and to fuse xx, , zz, x, xz, DMA loads, we further derive the number of extra DMA loads as and z into an array of vectors with six elements. (the halo region fetching within CG is performed through register A￿er the fusion of the arrays, with only 3 separate arrays to communication, and not considered here): read, from equation 6, we derive:

N Wz 9 5 3 4 < 64 1024 (9) N = 2 H N ( z 1) · · · · ⇥ redundant DMA load ⇥ · · C W z · z which gives a maximum Wz value of around 108. In such a case, the N DMA block size of 432 bytes, improving the memory bandwidth +2 H Nz ( 1). (7) ⇥ · · C W utilization to around 80%. In the extreme case of the dstrqc kernel, · the array fusion technique could increase the DMA block size from where H is the number of points needed for the stencil halo, N 84 bytes to 512 bytes, improving the e￿ective memory bandwidth and Nz are the sizes of the block processed by the CG. Combining equations 5, 6, and 7, we can easily derive that to from 50.47GB/s to 104.82GB/s. achieve the minimum redundant DMA loads of the halos, we need A￿er combining all these techniques mentioned above (register to assure that C W = C W . communication based halo exchange, optimized blocking con￿g- · z · z For the memory bandwidth part, each CG of SW26010 connects uration guided by an analytic model, and array fusion), we then to a DDR3 interface with a bandwidth of 34 GB/s. However, the derive a balanced memory scheme that achieve e￿cient utilization block size of the data chunk that we read or write with DMA op- of SW26010’s memory hierarchy. erations largely determines the portion of the bandwidth that can be e￿ectively utilized, as shown in Table 3. Only for block sizes 6.5 On-the-￿y Compression above 512 bytes, we start to see reasonable memory bandwidth A￿er squeezing all the capabilities of Sunway’s hardware, our next utilization. innovation is an on-the-￿y compression scheme, which not only SC17, November 12–17, 2017, Denver, CO, USA H. Fu et al. doubles the available memory space, but also releases more com- meet the accuracy requirement; (3) couple the decompression and puting power under the same physical memory bandwidth. the computation codes tightly (at the level of assembly code) to As shown in Fig. 5, we explain our compression scheme as four maximize the life of variables in the registers, so as to decrease the major parts. Part (a) is a preprocessing step, which performs a loads and stores to the LDM. complete simulation with a coarser resolution, so as to generate the In our ￿nal design, we adopt method (3) for most velocity and statistics (such as the maximum and minimum values of variables), stress variables. Combining the compression scheme to our wave for the high-resolution simulations a￿erwards to utilize in their propagation kernels, we see 24% more computational performance, compression processes. as our compression scheme helps to pump more data using the Part (b) explains the work￿ow of our compression scheme. ￿e same physical bandwidth. Moreover, with a compression ratio of 2, CPEs perform the tasks in the following sequence: load the com- we can e￿ectively double the maximum problem size that can be pressed numbers into LDM using DMA instructions, decompress, solved on this supercomputer. compute, compress the results into 16-bit results, and store the com- To validate our compression scheme, we perform extensive com- pressed results back to main memory also with DMA instructions. parison between simulation results generated with and without Part (c) explains the bu￿ering scheme for the decompress, com- the compression scheme under di￿erent resolutions. Fig. 6 shows pute, and compress work￿ow. Similar to the scheme explained in the comparison of the seismograms of two stations, Ninhe and Fig. 4, we perform the compression of points in z planes. Each Cangzhou, with compression switched on and o￿. ￿e Ninhe time, we load a plane into the LDM, and perform the decompression county, which located near the Tangshan earthquake fault, en- with vectorization switched on. dured the great damage during the earthquake, the blue solid line Considering the characteristics of seismic wave propagation, we in Fig. 6 is the result without compression, while the red dashed adopt three di￿erent compression methods in our work, shown line is the compressed one. ￿e sharp onset in both compressed and in part (d), all of which would compress a 32-bit ￿oat to 16-bit. uncompressed result is similar to each other, while the coda wave With the goal to improve the computational performance with of the compressed result does not perfectly match the reference compression, we choose to apply lossy instead of lossless compres- one because of the accuracy loss in compression. Nevertheless, sion schemes, so as to minimize the extra computation that we the most details of the compressed result are still well retained, introduce. which is accurate enough for strong ground motion simulations. Method (1) directly uses the 16-bit half precision de￿ned by the Furthermore, we also compared the Cangzhou station, which is far IEEE 754 standard, using 5 bits for the exponent and 10 bits for away from epicenter. While the compressed result endures slightly the mantissa. ￿e inherent support for single to half precision more accuracy loss due to longer time propagation and error ac- conversion makes the compression part e￿cient. However, for cumulation, the lines still match well with each other even till the variables with a larger dynamic range than the 5-bit exponent, the end of the 120-s simulation. ￿ese comparison results demonstrate compression scheme would incur numerical problems. In contrast, that although the compression introduces errors to some extent, we for variables with a smaller dynamic range, the 5-bit exponent can still achieve meaningful results using compression with be￿er becomes a waste. performances and larger problem sizes. To resovle the above issues in method (1), our method (2) deter- mines the required exponent bit-width according to the recorded 7 PERFORMANCE AND SCALABILITY maximum dynamic range in the ￿rst part, and uses the rest bits for mantissa. Method (2) assures the coverage of the full dynamic 7.1 How Performance Was Measured range, and can reserve more bits for the mantissa parts of variables ￿e performance is measured using the average time spent on one with a small dynamic range. ￿e only disadvantage is the relatively time step a￿er running a benchmark test for 100 time steps. ￿e high computational cost. number of ￿oating point operations are measured using two di￿er- Method (3) is the most balanced between accuracy and e￿ciency. ent methods, by counting all ￿oating point arithmetic instruction in According to the statistics in the ￿rst part, we normalize all the the assembly code as well as using the hardware performance mon- values of the same array to the range between 1 and 2, which itor, PERF, which is provided along with the Sunway TaihuLight corresponds to an exponent value of zero. ￿erefore, a￿er the compiler. ￿e two di￿erent methods generate similar operation normalization, we can shi￿ the bits to get the mantissa part as counts and we use the PERF tool to measure the average ￿oating the compressed value directly, which signi￿cantly simpli￿es the point operations in this study. Note that all the operations added for compression process. optimization purposes, such as the compression-related operations, ￿e compression scheme itself would introduce extra compu- are not counted in the number of FLOPs. tations, mostly integer operations, as well as intensive memory accesses. Even though the accesses all happen in the LDM, the 7.2 Kernel Optimization Results frequent load and store could largely reduce the computational For the velocity updates, kernels delcx,delc, which compute e￿ciency of the CPEs. As a result, our ￿rst version with compres- both the central region and the halo regions, is the major consumer sion only achieves 1/3 of the performance without compression. of computing cycles. For the stress updates, kernel dstrqc for the To reduce the performance penalty of the compression scheme, we central region consumes the most cycles. ￿e plasticity part, ker- perform a series of optimizations: (1) further increase the DMA nel drprecprc calc, is the most time-consuming part of the entire block size (determined by Wz ) to reduce number of DMA loads by program. ￿e remain kernels, including fstr, drprecprc app, pre- 70%; (2) adopt less complex algorithms for the cases that can still processing and post-processing of MPI (unpack VY, ather VX 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight SC17, November 12–17, 2017, Denver, CO, USA

Host Memory: Main Memory

min/max LDM dma_get dma_put Coarse Fine CPE: 16b to 32b decompression General 32b computation 32b to 16b compression (a) Collect statistic from coarse grid (b) Computation workflow T sign exp (8b) frac (24b) sign exp (5b) frac (10b) (vel,ww0,phi,cohes,taxx, ,taxz) Host Memory CPE N 1EEE754 32b to 16b FP conversion (1) W sign exp (8b) frac (24b) (str, r1,r2, ,r6,sigma2,yldfac) sign exp (0-8b) frac (7-15b)

Ne log2(Emax Emin) 13-point stencil N 15 N ...... (2) f e In-place decomposition sign exp (8b) frac (24b) (d1,lam,mu,qp,qs,vx1,vx2,ww) sign frac (15b)

V 1 V /(V Vmax Vmin) (3) V V 8 Compressed grid Decompressed block cmpr IEEE754 32-bit floating point format 16-bit floating point formats (c) Decompress-compute-compress scheme (d) Compression algorithms

Figure 5: ￿e on-the-￿y compression scheme.

Speedup red dased line: compressed 60 47.8 50 45.4 blue solid line: base 40.6 39.3 40 Ninghe 28.9 27.6 MPE 30 22.9 PAR 20 12.9 13.1 MEM Cangzhou 10 4.2 0 CMPR

0s 20s 40s 60s 80s 100s 120s

Figure 6: ￿e validation of the compression for Tangshan DMA Bandwidth 500m case. 26.9 27 27 24.8 30 23.8 79% 79% 79% 21.2 21.2 70% 73% 25 62% 62% 18.5 18.5 54% 54% PAR and unpack VX), which consumes about 1-2% of the total running 20 12.4 time. We also optimize them to achieve the highest performance. 15 36% MEM Fig. 7 demonstrates the performance and bandwidth improve- 10 CMPR ments for these di￿erent kernels when applying di￿erent approaches. 5 We can observe that the speedups for almost all the di￿erent most- 0 consuming kernels are in the same range of around 30x and the bandwidth for which are in the same range of around 70%-80% of the full bandwidth. ￿e only exception is the fstr kernel, which only achieve 4 to 5 times speedup due to its extremely low arith- metic density. As a result, the distribution of di￿erent kernels in Figure 7: ￿e speedup and memory bandwidth of di￿erent the total execution time do not change much a￿er applying the kernels (mainly for the wave propagation part), when ap- series of optimization techniques. plying di￿erent optimization techniques. ‘MPE’ stands for Among the di￿erent optimization schemes, the fusion of co- the original version that uses the MPE only. ‘PAR’ refers to located arrays plays a major part, improving the performance by the version that applies our speci￿c parallelization scheme up to 4 times for the most time consuming kernels. and uses both the MPE and the 64 CPEs for the computa- tion. ‘MEM’ refers to the version that adopts all the memory- 7.3 Weak Scaling Results related optimizations. ‘CMPR’ refers to the version that fur- Fig. 8 demonstrates the weak scaling results on linear and nonlinear ther applies the on-the-￿y compression scheme. cases. For both cases, we use each CG to compute a decomposed mesh size of 160 by 160 by 512, and scale that gradually to the entire machine. For weak scaling, we see an almost perfect linear speedup from 8,000 processes to 160,000 processes. Without compression, 160,000 processes in nonlinear case, compared to 10.7 P￿ops in the we achieve a sustained performance of 15.2 P￿ops when using linear case. ￿e compression scheme, which feed more data with the SC17, November 12–17, 2017, Denver, CO, USA H. Fu et al. same memory bandwidth, further improves the performance to 18.9 needed for high frequency simulations. With the strong computing P￿ops and 14.2 P￿ops for nonlinear and linear cases respectively. power of Sunway TaihuLight, we manage to achieve seismic wave propagation in Northern China caused by Tangshan earthquake

18 Ideal (Linear) with a maximum frequency of 18 Hz. ￿e entire computational 14 Ideal (Non-linear) Ideal (Linear+Compress) domain is 320 km by 312 km by 40 km. ￿e dynamic source de- 9 Ideal (Non-linear+Compress) rived from spontaneous rupture on fault with complex geometry 6 is used to drive the seismic wave propagation and to estimate seis- 4 mic hazard distribution of Tangshan earthquake. ￿e geometry of 3

PFLOPS Tangshan fault, as well as the tectonic stress ￿elds, is derived from 2 1.5 observations and reasonable inference. ￿e 3D velocity model of Linear (Peak: 10.7PFLops, Para. eff. 97.9%) north China with resolutions of 25 km in horizontal and of 1 2 km 1 Non-linear (Peak: 15.2PFlops, Para. eff. 80.1%) ⇠ Linear+Compress (Peak: 14.2PFlops, Para. eff. 96.5%) in the vertical directions is implemented in the both dynamic and 0.6 Non-linear+Compress (Peak: 18.9PFlops, Para. eff. 79.5%) ground motion simulations. Typically, the sediment structure is 8K 12K 16K 24K 32K 40K 48K 64K 80K 96K 120K 160K Number of processes added into the strong ground motion simulations. ￿ese 3D com- plexities make our simulations of Tangshan earthquake close to the Figure 8: ￿e weak scaling results of linear and nonlinear reality. earthquake simulation, scaling from 8,000 to 160,000 MPI processes. Each CG corresponds to one MPI process. 8.1 Dynamic sources To recover the seismic hazards caused by Tangshan earthquake, we need to describe the earthquake environments as accurately as 7.4 Strong Scaling Results possible. A￿er the occurrence of the earthquake, a surface rupture Fig. 9 shows the strong scaling benchmark test results for linear, zone about 8 to 11 km in the south of Tangshan city was reported. nonlinear, with and without compression cases, based on three ￿is rupture zone was composed by more than ten NE-trending di￿erent mesh sizes. Our so￿ware achieves similar speedup in right-lateral strike-slip le￿-stepping echelon ruptures, with a gen- both linear and nonlinear cases, with or without compression. ￿e eral strike of N30 E, 1.5 2.3 m horizontal displacement and 0.2 ⇠ degradation in performance as the number of processes increases is 0.7 m vertical displacement. At the south of the surface rupture ⇠ resulted from two aspects. ￿e ratio of computation to communica- zone the vertical displacement turns up on the west to up on the tion decreases, and so does the ratio of the outer halo region to the east. A￿er the discovery of the new surface rupture zone [19], sub-volume size in proportion, which makes the AWP application more geological evidences proved that the surface rupture zone less e￿ective in overlapping communication and computation. of the 1976 Tangshan earthquake extends to the southwest of the city more than 47 km [10] as shown in Fig. 10a. Moreover, deep 22 16 Ideal Ideal seismic re￿ection pro￿le indicated that the Tangshan fault system 12 dx=100m dx=100m dx=50m dx=50m is extremely complicated and may go deep into Moho [16]. 8 79.9% dx=16m dx=16m 75.5% 6 73.6% 75.6% Based on previous investigations and observations on Tangshan 4 63.6%

Speedup 53.3% 3 earthquake, we construct the 3D geometry of Tangshan fault, as 2 shown in Fig. 10b. ￿e non-planar fault extends about 70 km Linear Non-Linear 1 and 35 km along the strike and dip directions, respectively. Two 22 16 Ideal Ideal horizontal principle compress stress with the directions shown in 12 dx=100m dx=100m dx=50m 75.8% dx=50m Fig. 10a are used as the driving force in the dynamic simulation. 8 67.5% dx=16m 72.4% dx=16m 67.2% 6 ￿e third principle compress stress is vertical and not shown in the 4 51.2% 51.7% Speedup 3 ￿gure. A simple slip-weakening friction law with depth-depending 2 parameters is implemented. Linear+Compress Non-Linear+Compress 1 ￿e complexities in fault geometry require powerful numerical 8K 8K 12K 16K 24K 32K 48K 64K 80K 100K 128K 160K 12K 16K 24K 32K 48K 64K 80K 100K 128K 160K Number of processes Number of processes methods to implement dynamic condition on the irregular fault plane. We perform the dynamic rupture simulation of Tangshan Figure 9: ￿e strong scaling results of linear and nonlinear earthquake on a fault of such complex geometry to generate source earthquake simulation, scaling from 8,000 to 160,000 MPI for the later ground motion simulation on Sunway TaihuLight. Fig. processes, for three di￿erent problem size in both linear and 10b illustrates the snapshot of absolute slip rate of Tangshan earth- nonlinear cases. On Each CG corresponds to one MPI pro- quake rupture at T=10.5 sec. ￿e northeast side of the rupture fault cess. shows more complexity because of the curvature of the fault strike. ￿is con￿rms the importance of the complex 3D fault geometry during dynamic rupture simulations. 8 THE TANGSHAN EARTHQUAKE SIMULATION ON SUNWAY TAIHULIGHT 8.2 Strong ground motion Previous simulations for violent earthquakes are usually limited to To fully recover the seismic hazard distribution of the larger area low frequency, since enormous memory and time consumption are adjacent to Tangshan earthquake, we use the dynamic rupture 18.9-Pflops Nonlinear Earthquake Simulation on Sunway TaihuLight SC17, November 12–17, 2017, Denver, CO, USA

(a) (b) resolution. As the Fig. 11e f show, the hazard distribution is pre- ⇠ sented by color. ￿e red color (degree 9 11), which indicates the 116˚E 117˚E 118˚E 119˚E ⇠ severe damage during this earthquake, is strongly dependent on the 0 200 400 600 800 Depth (m) rupture fault location, but redistributed by the sediment e￿ect: the 40˚N Luanxian Tangshan Luannan county in Tangshan city, which is not located adjacent to Luannan Wuqing Ninhe the fault trace, also experienced great damage. ￿e hazard map in Bazhou high spatial resolution (Fig. 11f) shows large di￿erences compared 39˚N Bohai Sea with the low resolution one (Fig. 11e). For example, the intensity Cangzhou in Wuqing is level 6 (blue color) in the le￿ map but 7(green

38˚N color) in the right map, which again demonstrates the importance of high resolution simulation in achieving accurate seismic hazard map. Figure 10: (a) Simulation region (320km by 312km by 40km) (a) (b) of the strong ground motion for Tangshan earthquake. ￿e Cangzhou (200m) Cangzhou (16m) sediment depth of this region indicated by color; the epicen- Ninghe (200m) Ninghe (16m) ter (red star), the earthquake fault (red curved line) and the 0s 50s 100s 0s 50s 100s stress ￿eld (red and blue vectors) are shown; (b) the dynamic source calculated by CG-FDM method (absolute slip rate on 0.2 Shunyi Shunyi the fault at T = 10.5 sec). 40˚N Beijing Beijing 0.1 Luanxian Luanxian Tangshan Tangshan Luannan Luannan Wuqing Ninhe Wuqing Ninhe 0.0 Tianjin Tianjin Velocity (m/s) 39˚N −0.1

Bohai Bohai −0.2 sources as the input of the wave propagation program to calculate Cangzhou Cangzhou (c) (d) the strong ground motion of a larger area. ￿e region which we 38˚N simulate is 320 km by 312 km by 40 km (about 115.7 E 119.7 E, ⇠ 38.0N 41.7N, including major cities such as Tangshan, Beijing, 11 ⇠ Shunyi Shunyi 10 Tianjin, etc.), and we perform the simulation with the spatial reso- 40˚N Beijing Beijing Luanxian Luanxian lution gradually increasing from 500 m to 8 m, with the maximum Tangshan Tangshan 9 Luannan Luannan Wuqing Ninhe Wuqing Ninhe 8 frequency increasing from 0.5 Hz to 18 Hz. Tianjin Tianjin Intensity 7 Due to the dynamic rupture, complex 3D media and the geology 39˚N 6 Bohai Bohai structure, the high frequency energy (up to at least 10 Hz) can 5 Cangzhou Cangzhou be generated by earthquakes. ￿e seismogram with e￿cient high (e) (f) 38˚N frequency component is important data for engineering seismology 116˚E 117˚E 118˚E 119˚E 116˚E 117˚E 118˚E 119˚E analysis to design proper standards for the seismic protection of buildings. As Fig. 11a b show, there exists obvious discrepancy Figure 11: ￿e comparison between di￿erent spatial resolu- ⇠ between di￿erent spatial resolution simulations, because low spa- tion simulation, the le￿ column is simulated in 200m while tial resolution such as 200m is not enough to describe the basin the right column is in 16m. (asb) ￿e seismograms of veloc- structure very well (in Fig. 10a, the maximum sediment depth is ity (W-E component) in Ninghe county and Cangzhou city; 800m). ￿us, the coda wave calculated by low spatial resolution is (csd) ￿e snapshots of velocity ￿eld at 60 second a￿er the not correct. ￿ere are much more coda wave vibrations a￿er the seismic nucleation, and the zoomed sub￿gures in the top main seismic energy in Cangzhou city simulated in higher spatial right corner described the details of the area with sediment resolution compared with the lower one. Moreover, as the epicen- e￿ect (the dashed square frame); (esf) Seismic hazard distri- ter of Tangshan earthquake is located at the sediment basin, the bution, which represented by Chinese intensity map is cal- main-peak of the earthquake cannot even be calculated accurately culated from the horizontal peak ground velocity ￿eld. (Fig. 11a b, Ninghe). In a word, the high spatial resolution at the ⇠ level of 16m or above is a key element for performing accurate earthquake simulations in the existence of 3D complex structures. 9 IMPLICATIONS We also compare the snapshots of wave ￿eld and the hazard map One message we want to emphasize here is that this work man- to illustrate the details of di￿erent results. Fig. 11c d show the ages to push the performance in various aspects of the Sunway ⇠ snapshots 60 seconds a￿er the occurrence of Tangshan earthquake. TaihuLight machine to its hardware limits. As shown in Table 4, we While the main behavior of the wave ￿eld is similar, we see clear have made close-to-extreme utilizations, especially for the memory di￿erences in small scales. ￿e zoom-in pictures in the top right system. As a result, even for this memory-bound problem (30 3D corner illustrate the sediment e￿ect during this earthquake, which arrays to process for each kernel) on TaihuLight’s relatively modest cannot be described very well in low spatial resolution. memory system (a byte-to-￿op ratio that is only 1/5 of Titan), with- ￿e hazard map (expressed by seismic intensity) for Tangshan out using the compression scheme, we already achieve 15.2-P￿ops earthquake can be obtained by calculating the horizontal peak nonlinear earthquake simulation by using the 10,400,000 cores of ground velocity. We have compared the results of 200m and 16m Sunway TaihuLight, up to 12.2% of the peak. SC17, November 12–17, 2017, Denver, CO, USA H. Fu et al.

Table 4: ￿e computing and memory performance for the [5] Charlie Catle￿. 2002. ￿e philosophy of TeraGrid: building an open, extensi- run of largest cases without compression. ￿e computing ble, distributed TeraScale facility. In Cluster Computing and the Grid, 2002. 2nd IEEE/ACM International Symposium on. IEEE, 8–8. performance, memory size, and bandwidth are for single CG [6] Yifeng Cui, Reagan Moore, Kim Olsen, Amit Chourasia, Philip Maechling, cases. ￿e LDM size is for a single CPE. Note that out of the Bernard Minster, Steven Day, Yuanfang Hu, Jing Zhu, Amitava Majumdar, and 8 GB memory per CG, we have to reserve a space of 2.5 GB others. 2007. Enabling very-large scale earthquake simulations on parallel ma- chines. In International Conference on Computational Science. Springer, 46–53. for system and MPI bu￿ers in full machine cases. [7] Yifeng Cui, Kim B Olsen, ￿omas H Jordan, Kwangyoon Lee, Jun Zhou, Patrick Small, Daniel Roten, Geo￿rey Ely, Dhabaleswar K Panda, Amit Chourasia, and others. 2010. Scalable earthquake simulation on petascale supercomputers. E￿ectively used Peak % In High Performance Computing, Networking, Storage and Analysis (SC), 2010 Computing Performance 98.7 G￿ops 765 G￿ops 12.9% International Conference for. IEEE, 1–20. [8] Yifeng Cui, Efecan Poyraz, Kim B Olsen, Jun Zhou, Kyle Withers, Sco￿ Callaghan, Memory Size 5.2 GB 5.5 GB 94.5% Je￿ Larkin, C Guest, D Choi, Amit Chourasia, and others. 2013. Physics-based Memory Bandwidth 25 GB/s 34 GB/s 73.5% seismic hazard analysis on petascale heterogeneous supercomputers. In Proceed- LDM Size 60 KB 64 KB 93.8% ings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 70. [9] Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, and others. 2016. ￿e Sunway TaihuLight supercomputer: system and applications. Science China ￿e even more exciting innovation is of course the on-the-￿y Information Sciences 59, 7 (2016), 072001. [10] Hui Guo, Wali Jiang, and XIE Xinsheng. 2011. New evidence for the distribution compression scheme, which, at the cost of an acceptable level of of surface rupture zone of the 1976 Ms 7.8 Tangshan earthquake. Seismology accuracy lost, scales our simulation performance and capabilities and geology 33, 3, Article 506 (2011), 18 pages. DOI:h￿ps://doi.org/10.3969/j.issn. 0253-4967.2011.03.002 even beyond the machine’s physical constraints. With on-the-￿y [11] Alexander Heinecke, Alexander Breuer, Sebastian Re￿enberger, Michael Bader, compression integrated into all the 10,485,760 CPEs, which is prob- Alice-Agnes Gabriel, Christian Pelties, Arndt Bode, William Barth, Xiang-Ke ably also the largest-scale runtime compression and decompression Liao, Karthikeyan Vaidyanathan, and others. 2014. Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers. In Proceed- for scienti￿c simulation in history, we manage to double the maxi- ings of the International Conference for High Performance Computing, Networking, mum problem size we can support. Moreover, with a careful design Storage and Analysis. IEEE Press, 3–14. of the compression scheme, even though we did introduce more [12] Kuo-Hung Hsiao and YAN Hong-Sen. 2009. THE REVIEW OF RECONSTRUC- TION DESIGNS OF ZHANG HENG’S SEISMOSCOPE. Journal of Japan Associa- computational and memory complexities, the compression actually tion for Earthquake Engineering 9, 4 (2009), 4 1–4 10. further improve the performance by another 24%. [13] Tsuyoshi Ichimura, Kohei Fujita, Pher Errol Balde ￿inay, Lalith Maddegedara, Muneo Hori, Seizo Tanaka, Yoshihisa Shizawa, Hiroshi Kobayashi, and Kazuo Our compression scheme expands our computational perfor- Minami. 2015. Implicit nonlinear wave simulation with 1.08 T DOF and 0.270 T mance to the level of 18.9 P￿ops (15% of the peak), and enables us unstructured ￿nite elements to enhance comprehensive earthquake simulation. to support 18-Hz, 8-meter simulations, which is a big jump for the In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 4. previous state of the art. While the current compression scheme [14] Tsuyoshi Ichimura, Kohei Fujita, Seizo Tanaka, Muneo Hori, Maddegedara Lalith, is largely customized for our speci￿c application and the Sunway Yoshihisa Shizawa, and Hiroshi Kobayashi. 2014. Physics-based urban earthquake architecture, we believe the idea has great potential to be applied simulation enhanced by 10.7 BlnDOF 30 K time-step unstructured FE non-linear seismic wave simulation. In Proceedings⇥ of the International Conference for High to other applications and other architectures, especially for the era Performance Computing, Networking, Storage and Analysis. IEEE Press, 15–26. of exa-scale where the memory wall becomes the major constraint. [15] Dimitri Komatitsch, Seiji Tsuboi, Chen Ji, and Jeroen Tromp. 2003. A 14.6 billion degrees of freedom, 5 tera￿ops, 2.5 terabyte earthquake simulation on the Earth Simulator. In Supercomputing, 2003 ACM/IEEE Conference. IEEE, 4–4. 10 ACKNOWLEDGMENT [16] Q. Liu, J. Wang, J. Chen, S. Li, and B. Guo. 2007. Seismogenic Tectonic Environment of 1976 Great Tangshan Earthquake: Results from Dense Seis- We thank Y. Cui, A. Breuer, J. Tobin, and D. Mu from UCSD, S. Day, D. mic Array Observations. Earth Science Frontiers 14 (2007), 205–212. DOI: Roten, and K. Olsen from SDSU for their discussion and great advice h￿ps://doi.org/10.1016/S1872-5791(08)60012-3 [17] John Milne. 1886. Earthquakes and other earth movements. Vol. 56. D. Appleton. on AWP project over Sunway TaihuLight. ￿is work was partially [18] Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep supported by National Key Research & Development Plan of China Dubey. 2010. 3.5-D blocking optimization for stencil computations on mod- under grant# 2017YFA0604500, NSF of China under grant# 41374113, ern CPUs and GPUs. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for. IEEE, 1–13. 91530323, 41661134014, 41504040 and 61361120098, and Shandong [19] Zehua Qiu, Jin Ma, and Liu Guoxi. 2005. Discovery of the great fault of the Province Taishan Scholar. ￿e corresponding authors are H. Fu (hao- Tangshan earthquake. Seismology and geology 27, 4 (2005), 669–677. [email protected]), C. He ([email protected]), W. [20] Max Rietmann, Peter Messmer, Tarje Nissen-Meyer, Daniel Peter, Piero Basini, Dimitri Komatitsch, Olaf Schenk, Jeroen Tromp, Lapo Boschi, and Domenico Xue ([email protected]), and X. Chen ([email protected]). Giardini. 2012. Forward and adjoint simulations of seismic wave propagation on emerging large-scale GPU architectures. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. REFERENCES IEEE Computer Society Press, 38. [1] Don L Anderson. 1989. ￿eory of the Earth. Blackwell scienti￿c publications. [21] Daniel Roten, Yifeng Cui, Kim B Olsen, Steven M Day, Kyle Withers, William H [2] Hesheng Bao, Jacobo Bielak, Omar Gha￿as, Loukas F Kallivokas, David R Savran, Peng Wang, and Dawei Mu. 2016. High-frequency nonlinear earthquake O’hallaron, Jonathan R Shewchuk, and Jifeng Xu. 1996. Earthquake ground simulations on petascale heterogeneous supercomputers. In Proceedings of the motion modeling on parallel computers. In Proceedings of the 1996 ACM/IEEE International Conference for High Performance Computing, Networking, Storage conference on Supercomputing. IEEE Computer Society, 13. and Analysis. IEEE Press, 82. [3] Alexander Breuer, Alexander Heinecke, and Yifeng Cui. 2017. EDGE: Extreme [22] Seth Stein and Michael Wysession. 2009. An introduction to seismology, earth- Scale Fused Seismic Simulations with the Discontinuous Galerkin Method. In quakes, and earth structure. John Wiley & Sons. International Supercomputing Conference. Springer, 41–60. [23] Zhenguo Zhang, Wei Zhang, and Xiaofei Chen. 2014. ￿ree-dimensional curved [4] Laura Carrington, Dimitri Komatitsch, Michael Laurenzano, Mustafa M grid ￿nite-di￿erence modelling for non-planar rupture dynamics. Geophysical Tikir, David Michea,´ Nicolas Le Go￿, Allan Snavely, and Jeroen Tromp. Journal International 199, 2 (2014), 860–879. 2008. High-frequency simulations of global seismic wave propagation using SPECFEM3D GLOBE on 62K processors. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 60.