A Performance-Portable Implementation of Empirical Dynamic Modeling Using Kokkos

kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos Keichi Takahashi Joseph Park Ryousei Takano Wassapon Watanakeesuntorn [email protected] Jason Haga Kohei Ichikawa U.S. Department of the Interior [email protected] [email protected] Homestead, Florida, USA [email protected] [email protected] National Institute of Advanced [email protected] Industrial Science and Technology Nara Institute of Science and Tsukuba, Japan Technology Nara, Japan George Sugihara Gerald M. Pao [email protected] [email protected] University of California San Diego Salk Institute for Biological Studies La Jolla, California, USA La Jolla, California, USA ABSTRACT ACM Reference Format: Empirical Dynamic Modeling (EDM) is a state-of-the-art non-linear Keichi Takahashi, Wassapon Watanakeesuntorn, Kohei Ichikawa, Joseph Park, Ryousei Takano, Jason Haga, George Sugihara, and Gerald M. Pao. time-series analysis framework. Despite its wide applicability, EDM 2021. kEDM: A Performance-portable Implementation of Empirical Dynamic was not scalable to large datasets due to its expensive computational Modeling using Kokkos. In Practice and Experience in Advanced Research cost. To overcome this obstacle, researchers have attempted and Computing (PEARC ’21), July 18–22, 2021, Boston, MA, USA. ACM, New York, succeeded in accelerating EDM from both algorithmic and imple- NY, USA, 8 pages. https://doi.org/10.1145/3437359.3465571 mentational aspects. In previous work, we developed a massively parallel implementation of EDM targeting HPC systems (mpEDM). 1 INTRODUCTION However, mpEDM maintains different backends for different archi- Empirical Dynamic Modeling (EDM) [2] is a state-of-the-art non- tectures. This design becomes a burden in the increasingly diversify- linear time-series analysis framework used for various tasks such as ing HPC systems, when porting to new hardware. In this paper, we assessing the non-linearity of a system, making short-term forecasts, design and develop a performance-portable implementation of EDM and identifying the existence and strength of causal relationships based on the Kokkos performance portability framework (kEDM), between variables. Despite its wide applicability, EDM was not which runs on both CPUs and GPUs while based on a single code- scalable to large datasets due to its expensive computational cost. base. Furthermore, we optimize individual kernels specifically for To overcome this challenge, several studies have been conducted EDM computation, and use real-world datasets to demonstrate up to accelerate EDM by improving the algorithm [10] and taking to 5.5× speedup compared to mpEDM in convergent cross mapping advantage of parallel and distributed computing [14]. computation. We tackle this challenge by taking advantage of modern HPC CCS CONCEPTS systems equipped with multi-core CPUs and GPUs. We have been developing a massively parallel distributed implementation of EDM ! • Computing methodologies Parallel computing method- optimized for GPU-centric HPC systems, which we refer to as ologies; • General and reference ! Performance; • Applied arXiv:2105.12301v1 [cs.DC] 26 May 2021 mpEDM. In our previous work [22], we have deployed mpEDM on ! computing Mathematics and statistics. the AI Bridging Cloud Infrastructure (ABCI)1 to obtain an all-to- all causal relationship map of all 105 neurons in an entire larval KEYWORDS zebrafish brain. To date, this is the first causal analysis ofawhole Empirical Dynamic Modeling, Performance Portability, Kokkos, vertebrate brain at single neuron resolution. GPU, High Performance Computing Although mpEDM has successfully enabled EDM computation Permission to make digital or hard copies of all or part of this work for personal or at an unprecedented scale, challenges remain. The primary chal- classroom use is granted without fee provided that copies are not made or distributed lenge is performance portability across diverse hardware platforms. for profit or commercial advantage and that copies bear this notice and the full citation Recent HPC landscape has seen significant increase in the diversity on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, of processors and accelerators. This is reflected in the design of to post on servers or to redistribute to lists, requires prior specific permission and/or a upcoming exascale HPC systems, for example, the Frontier system fee. Request permissions from [email protected]. PEARC ’21, July 18–22, 2021, Boston, MA, USA at the Oak Ridge National Laboratory will use AMD EPYC CPUs © 2021 Association for Computing Machinery. and Radeon GPUs while the Aurora system at the Argonne National ACM ISBN 978-1-4503-8292-2/21/07...$15.00 https://doi.org/10.1145/3437359.3465571 1https://abci.ai PEARC ’21, July 18–22, 2021, Boston, MA, USA K. Takahashi et al. Observation Embedding k-NN Search Lookup Reconstructed Manifolds x(t) Observation X(t) x(t4) x(t2) Latent Manifold x(t3) x(t1) Y(̂ t) Prediction Distances d i and Indices t i y(t) Y(t) y(t2) y(t4) y(t3) y(t1) Figure 1: Overview of Convergent Cross Mapping (CCM) Laboratory will employ Intel Xeon CPUs and Xe GPUs; additionally, 2 BACKGROUND the Fugaku system at RIKEN uses Fujitsu A64FX ARM CPUs. 2.1 Empirical Dynamic Modeling (EDM) The current design of HPC applications has failed to keep up with this trend of rapidly diversifying HPC systems. Computational Empirical Dynamic Modeling (EDM) [2, 25] is a non-linear time application kernels are developed with the native programming series analysis framework based on the Takens’ embedding theo- model for the respective hardware (e.g., CUDA on NVIDIA GPUs). rem [6, 19]. Takens’ theorem states that given a time-series obser- mpEDM is no exception and maintains two completely indepen- vation of a deterministic dynamical system, one can reconstruct dent backends for x86_64 CPUs and NVIDIA GPUs. However, this the latent attractor manifold of the dynamical system using time- design becomes a burden when supporting a diverse set of hard- delayed embeddings of the observation. While the reconstructed ware platforms because a new backend needs to be developed and manifold might not preserve the global structure of the original maintained for every platform. Based on this trend, various per- manifold, it preserves the local topological features (i.e., a diffeo- formance portability frameworks [3, 4] have emerged to develop morphism). performance-portable applications from single codebase. Convergent Cross Mapping (CCM) [13, 18, 21] is one of the In this paper, we use the Kokkos [1] framework and develop a widely used EDM methods that identifies and quantifies the causal performance-portable implementation of EDM that runs on both relationship between two time series variables. Figure 1 illustrates CPUs and GPUs, herein referred to as kEDM. This new implemen- the overview of CCM. To assess if a time series . ¹Cº (hereinafter tation is based on a single-source design and facilitates the future called target) causes another time seris - ¹Cº (hereinafter called development and porting to new hardware. Furthermore, we iden- library), CCM performs the following four steps: tify and take advantage of optimization opportunities in mpEDM (1) Embedding: Both time series - and . are embedded into and achieve up to 5.5× higher performance on NVIDIA V100 GPUs. 퐸-dimensional state space using their time lags. For example, The rest of this paper is organized as follows: Section 2 first embedding of - is denoted by G, where G ¹Cº = ¹- ¹Cº,- ¹C − introduces EDM briefly and discusses the challenges in mpEDM; gº,...,- ¹C − ¹퐸 − 1ºgºº. Here, g is the time lag and 퐸 is the Section 3 presents kEDM, a novel implementation of EDM based on embedding dimension, which is empirically determined and the Kokkos performance portability framework; Section 4 compares usually 퐸 < 20 in real-world datasets. kEDM and mpEDM using both synthetic and real-world datasets (2) k-Nearest Neighbor Search: For every library point G ¹Cº, its and assesses the efficiency of kEDM; and finally Section 5 concludes 퐸 ¸1 nearest neighbors in the state space are searched. These the paper and discusses future work. neighbors form an 퐸-dimensional simplex that encloses G ¹Cº in the state space. We refer to these nearest neighbors as G ¹C1º, G ¹C2º,..., G ¹C퐸¸1º and the Euclidean distance between G ¹Cº and G ¹C8 º as 3¹C,C8 º = kG ¹Cº − G ¹C8 ºk. (3) Lookup: The prediction ~^¹Cº for a target point ~¹Cº is a linear combination of its neighbors ~¹C1º,~¹C2º, . ,~¹C퐸¸1º. kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos PEARC ’21, July 18–22, 2021, Boston, MA, USA Specifically, amount of memories copies between the CPU and the GPU and 퐸¸1 memory reads from GPU memory. ∑︁ F ~^¹Cº = 8 · ~¹C º Another potential optimization opportunity is the partial sort Í퐸¸1 8 8=1 8=1 F8 function topk() invoked in the k-NN search. ArrayFire uses NVIDIA’s 3 where CUB library to implement partial sort. CUB is a collection of highly 8 9 optimized parallel primitives and is being used by other popular ><> 3¹C,C8 º >=> libraries such as Thrust4. ArrayFire’s topk() function divides the F8 = exp − min 3¹C,C8 º input array into equal sized sub-array and then calls CUB’s parallel > 1≤8 ≤퐸 > : ; radix sort function to sort each sub-array. It then extracts the top-: ^ The prediction . ¹Cº for . ¹Cº is made by extracting the first elements from each sub-array and concatenates them into a new component of ~^¹Cº. array. This is recursively repeated until the global top-: elements (4) Assessment of Prediction: Pearson’s correlation d between the are found. Even though this implementation is well-optimized, it ^ target time series . and the predicted time series . is com- may not be optimal for EDM because the target : is relatively small puted to assess the predictive skill.

Load more