Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture

Architectures for the Post-Moore Era Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture This article presents an efficient dataflow methodology for solving Euler atmospheric dynamic equations. The authors map a complex Euler stencil kernel into a single field-programmable gate array chip and develop a long streaming pipeline that can perform 956 mixed-precision operations per cycle. Their dataflow design outperforms traditional multicore and many-core counterparts in both time to solution and energy to solution. Lin Gan, umerical atmospheric simulation is an essential method to study the climate sys- Haohuan Fu tem and a key tool for verifying climate changing mechanisms, making predic- Tsinghua University tions into the future, and providing guidance for protecting the planet from severe N issues such as climate change and weather disasters. Due to atmospheric models’ Wayne Luk high demand for computing power, a lot of effort has been made to study and simulate the Imperial College London atmospheric model using the world’s most powerful supercomputers. However, traditional high-performance computing (HPC) systems based on multicore and many-core architectures Chao Yang have to face architectural constraints from data representation, memory access patterns, com- Chinese Academy of munications, and bandwidth, as well as challenges from the extremely complex atmospheric Sciences algorithm. Thus, novel architectures and revolutionary optimization techniques are in great demand toward achieving better results in both time to solution and energy to solution. Wei Xue, In recent years, dataflow engines (DFEs) based on reconfigurable field-programmable gate Guangwen Yang arrays (FPGAs) have developed quickly and achieved many inspiring results in various applica- Tsinghua University tions such as deep learning.1 Unlike traditional architectures, reconfigurable DFEs2 obtain high performance through deploying a long pipeline of concurrent operations corresponding to the targeting algorithm, while maintaining high power efficiency due to the lower clock frequency. Their customizable features on the hardware circuit, data representation, and on-chip memory provide great flexibility and optimizing space to tackle complex numerical algorithms. 40 Published by the IEEE Computer Society 0272-1732/17/$33.00 © 2017 IEEE Dataflow Dataflow Dataflow Dataflow DFE core A core B core C core D FPGA Dataflow Dataflow Data core E core F Data CPU Dataflow Dataflow PCIe Dataflow Dataflow core I core J BRAM core G core H DRAM FPGA DRAM (a) (b) T N subdomain subdomain per stencil sweep cycle W E N–1 N S B subdomain 2 C2D DFE Inner Euler stencil halo D2C Halo updating inner CPU Outer stencil 25-point stencil subdomain 1 1 2 3 4 5 (c) (d) (e) Figure 1. The dataflow engine (DFE) architectures and algorithmic computing mechanisms. (a) General architecture of the hybrid CPU-DFE system using PCI Express as a bridge. (b) Dataflow computing model. Each dataflow core refers to a specific operation. (c) The 25-point Euler stencil. (d) The hybrid domain decomposition methodology assigns inner stencil to DFE and outer area to CPU. (e) Workflow of the hybrid methodology, in which C2D and D2C refer to the data exchange between CPU and DFE. In this article, we present an efficient data- other major multicore and many-core proces- flow computing model to solve the 3D Euler sors prove dataflow computing to be a promis- atmospheric equations, which is the essen- ing method in the post-Moore era to break the tial step to describe mesoscale atmospheric architectural constraints and bring inspiring dynamics. Through fully exploiting the cus- achievements. tomizable features of the selected FPGA platforms, we propose a set of novel optimizing Background techniques such as a customizable window This section introduces the hardware architec- buffer, algorithmic offsetting arithmetics, and ture, equations, and algorithm, and summa- a mixed-precision mechanism; manage to map rizes the existing efforts. the complex Euler kernel into a single FPGA chip; and build a long computing pipeline Dataflow Engines that can perform 956 mixed-precision oper- Figure 1a shows the general architecture of a ations per cycle. We also present our efforts hybrid CPU-DFE system. Similar to the tradi- to fully optimize the Euler algorithm over tional hybrid system, it also contains CPUs to traditional platforms including CPU, Many handle issues such as DFE initialization, task Integrated Coprocessor (MIC), GPU, and scheduling, and data exchange. SW26010 CPU. Compared with traditional multicore and Our dataflow design outperforms tradi- many-core architectures such as CPU, MIC, tional multicore and many-core counterparts and GPU, DFE achieves high performance in terms of both time to solution and energy through developing within a single FPGA to solution. The performance comparisons with chip a long streaming pipeline that comprises www.computer.org/micro July/August 2017 41 Architectures for the Post-Moore Era 1: for (k, j, i) ← (0, 0, 0) to finite volume scheme,3 the numerical fluxes on (N , N , N ) do k–1 j–1 i–1 a common edge of two consecutive mesh ele- 2: if (k, j, i) Halo then ments are identical. 3: Halo Updating We used a cell-centered finite volume 4: end if scheme plus an explicit Runge-Kutta time- 5: 3D Euler Stencil stepping method,3 after which each time step 6: end for in solving the Euler equations (that is, the Euler algorithm) required two Euler stencil Figure 2. Euler algorithm per stencil sweep. sweeps applied at all mesh elements. Figure 1c is the Euler stencil in the 3D channel, where many concurrent operations. Different hard- to update a mesh element (for example, the ware resources, such as logic units and digital central one in Figure 1c), we need to access 25 signal processors (DSPs), are customized to elements. After the original channel is decom- deploy a specific hardware circuit that corre- posed into different subdomains corresponding sponds exactly to the targeted algorithm (see to different MPI processes (see Figure 1d), the Figure 1b). Data that is originally stored in halo area needs to exchange data information the onboard memory (DRAM) will stream with neighboring subdomain halos, which is through different dataflow cores on the pipe- marked as halo updating. line and get computed. By streaming different Figure 2 shows the Euler algorithm that dataflows through different dataflow cores, we contains two major steps: halo updating and can execute multiple instructions onto multiple 3D Euler stencil computing. Unlike an ordi- data (MIMD), and thus boost the overall per- nary stencil kernel, the 3D Euler stencil has formance. to perform over 2,100 double-precision operations to update one element per sweep,4 and it Euler Equations is more complex in the communication model, We can write the Euler equations3 as the fol- resulting in tough computational challenges for lowing set of conservation laws: traditional HPC platforms. ∂Q ∂F ∂G ∂H Computational Challenges + + + +=S 0 , (1) ∂t ∂x ∂ y ∂z Traditional HPC systems based on many-core and multicore architectures face tough chal- in which lenges in part from the complex algorithmic kernel. The more realistic 3D model built on Q 5 (r9, ru, rv, rw, (r)9)T, the Euler equations brings more complex algo- F 5 (ru, ruu 1 p9, ruv, ruw, ru)T, rithms and data communication patterns, and G 5 (rv, rvu, rvv 1 p9, rvw, rv)T, thus greatly strikes the memory bandwidth. H 5 (rw, rwu, rwv, rww 1 p9, rw)T, The neighboring points of a 25-point stencil S 5 (0, 0, 0, r9g, 0)T, (2) are stored in different rows, columns, or slices, and can be far apart in the memory space. where r is the density, v 5 (u, v, w) is the Accessing all 25 points incurs a lot of cache velocity, p is the pressure, and is the potential misses. temperature of the atmosphere. The system is Although complex atmospheric stencils closed with the equation of state have been widely studied on traditional HPC platforms, the achieved efficiency is relatively γ pRθ low (generally less than 10 percent). Therefore, pp= 00 , (3) novel architectures and revolutionary optimiza- p00 tion techniques are in great demand to achieve better efficiency. 5 in which p00 1013:25hPa is the ground-level pressure, R 5 287:04J/(kg . K) is the gas con- Existing Efforts stant for dry air, and g 5 1:4. We remark here With increased computing power in the form that because of the conservative property of the of many-core processors and accelerators (for 42 IEEE Micro example, the Sunway many-core processor, over CPUs and GPUs.11 Koji Okina and his GPU, MIC, and so on), we see a lot of efforts colleagues selected two essential 3D stencil at porting and refactoring atmospheric mod- kernels from the heat conduction and electro- els on the most powerful supercomputers to magnetic fields to evaluate their performance achieve ultra-high scalability and resolution. A on stream-oriented FPGAs.12 Comparatively, recent work5 that won the 2016 ACM Gordon the 3D Euler stencil in this work is much more Bell Prize scales a fully implicit Euler solver to challenging to solve, because more than 2,100 over 10.5 million cores of the Sunway TaihuLight double-precision floating-point operations supercomputer6 and sustains a double-precision must be performed. performance of 7.95 Pflops with 488-m hor- izontal resolution. Although these many-core Dataflow-Oriented Euler Designs architectures demonstrate potential for improv- This section focuses on dataflow designs, ing the performance of existing models, we still including the essential optimizations to boost see a gap of several orders of magnitude to the overall performance. advance to the next level of science. On the other hand, preliminary work Design Overview through FPGAs has achieved promising results Efficiently mapping a given algorithm into a in recent years.

Load more