HPC and AI Initiatives for Supercomputer Fugaku and Future Prospects
Total Page:16
File Type:pdf, Size:1020Kb
HPC and AI Initiatives for Supercomputer Fugaku and Future Prospects Atsushi Nukariya Kazutoshi Akao Jin Takahashi Naoto Fukumoto Kentaro Kawakami Akiyoshi Kuroda Kazuo Minami Kento Sato Satoshi Matsuoka The development of AI technology is indispensable for the realization of Society 5.0, which has been announced as part of Japan's science and technology policy. As such development proceeds, the computational resources required for AI learning continue to increase. Through the development of the K computer and the supercomputer Fugaku (hereafter, Fugaku), Fujitsu has been providing high performance computing (HPC) systems with abundant computing re- sources. Now, in order to utilize the abundant computational resources of HPC systems for AI learning, we are working with RIKEN to develop an AI infrastructure on Fugaku. This article de- scribes the current status of our joint project with RIKEN to test and evaluate the performance of AI-related software on Fugaku and our future work on HPC and AI. 1. Introduction • Large amounts of data required for learning AI In 2016, Society 5.0 was proposed as part of • High-performance AI platform Japan's science and technology policy [1]. In the society • Talented AI personnel realized by Society 5.0, one aim is to add value to in- We aim for Fugaku to become the core infrastruc- dustry through the analysis of big data using AI. ture of Society 5.0 by providing a high-performance The development of AI technology is essential for AI platform that can process large amounts of data at the utilization of AI in society and to lead the world in high speed. Further, by providing a high-performance the realization of Society 5.0. As such development AI platform, we expect to attract talented AI personnel proceeds, the computational resources (e.g. processors, from around the world, which will promote develop- memory) required for AI learning and inference continue ment of AI technology. to increase. Against this background, the utilization of AI by high performance computing (HPC), which is rich in 2.2 Utilization of AI in HPC computational resources, is being promoted. This subsection describes the use of AI in HPC, in- This article describes the relationship between cluding the acceleration of deep learning and examples HPC and AI, and the AI-related work being performed of the application of AI to HPC applications. using the supercomputer Fugaku (hereafter, Fugaku). It 1) Acceleration of deep learning goes on to describe the future outlook of this initiative. The main reason for the rapid development of AI in recent years is the improvement of deep learning 2. HPC and AI technology. Deep learning can learn more features This section describes the relationship between than conventional machine learning, but this requires HPC and AI from the perspective of the use of AI in HPC. a large number of computations. Moreover, since there is a lot of data to be used for learning, a mechanism to 2.1 Fugaku as an AI platform process the data at high speed is necessary. Three factors are considered to be key contributors On the other hand, in recent years, the abun- to the development of AI technology, which is essential dance of computational resources has made learning for the realization of Society 5.0 [2]. in a realistic time possible. As a result, the use of deep Fujitsu Technical Review 1 HPC and AI Initiatives for Supercomputer Fugaku and Future Prospects learning is expanding. The CPUs of the K computer 3.2 Development of a deep learning and Fugaku have high computing performance, paral- framework lel performance, and wide memory bandwidth, which This subsection gives an overview of the deep makes them suitable for deep learning. Going forward, learning framework produced by the DL4Fugaku proj- AI applications that use deep learning are expected to ect and its performance evaluation. be increasingly used on HPC systems rich in computa- 1) Trends in and structure of deep learning frameworks tional resources, with increasing acceleration of deep PyTorch, TensorFlow, Chainer, and MXNet, among learning. other deep learning frameworks, have been released 2) Application of AI technology to applications for as OSS. PyTorch and TensorFlow adopt a design in HPC which the application programming interface (API) Initiatives have also been launched to realize part, which is provided for various languages such as Society 5.0 through the application of AI technology Python, and the operation part, which is implemented to HPC applications. For example, deep learning is in C++, are separate (Figure 1). The reason why the expected to be useful in coping with large-scale natu- operation part is implemented in C++ is that most ral disasters by improving the resolution of satellite of the deep learning and inference processing that images and cloud removal to provide a more detailed require high-speed operation is performed in the op- picture of the ground surface [3]. AI is also being ap- eration part. plied to system software tuning using deep learning for The main processing content of the operation part the enhancement of HPC [4]. On the other hand, AI is in deep learning is matrix operations. Thus learning also expected to be used for the acceleration of data and inference are sped up by calling on arithmetic li- augmentation, which is a method for improving the ac- braries optimized for Intel CPUs [9], such as the oneAPI curacy of deep learning itself [5]. Deep Neural Network Library (oneDNN) and the Basic The development of an AI platform for HPC is also Linear Algebra Subprograms (BLAS), from the opera- important for performing AI-related computations as tion part. part of applications for HPC. 2) Demonstration of deep learning performance of Fugaku with Chainer 3. Deep learning initiatives for Fugaku To verify the feasibility of improving the per- This section describes the DL4Fugaku project and formance of AI software by Fugaku, we performed other deep learning initiatives for Fugaku. performance tuning and evaluation during massively parallel execution using Chainer, which is widely used 3.1 Launch of DL4Fugaku project in Japan. The DL4Fugaku project was launched with Although there are several algorithms for convo- the aim of operating AI software at high speed in lution operations, which are frequently used in deep order to provide Fugaku as an AI platform [6]. This learning, in Chainer, they are often implemented using project aims to develop AI software by tuning and op- GEMM (General Matrix Multiply) operations. Therefore, eration verification, and RIKEN and Fujitsu have signed with a view to combining performance tuning and eval- a memorandum of understanding under which they uation with the deep learning library dnnl_aarch64 are conducting joint research and development [7]. and the mathematical library of the system software Research and development of frameworks and applica- Technical Computing Suite (TCS) of the K computer, we tions that enable efficient distributed learning is also created a prototype of a general-purpose high-speed underway. kernel using GEMM operations so that it could be used The software developed by the project will be on the CPUs of the K computer and Fugaku, and we released on an ongoing basis as open-source software tuned it on the K computer. The tuning performed was (OSS) [8]. thread parallelization of convolution operations with the mathematical library of TCS and acceleration of the Adam optimization algorithm. This made the learning operation on the K computer 36.4 times faster. 2 Fujitsu Technical Review HPC and AI Initiatives for Supercomputer Fugaku and Future Prospects ntel Fuau F licatins thn etc licatins thn etc ee leanin aework ee leanin aework TensFlwTorch TensFlwTorch at thn etc at thn etc eatin at eatin at TS BLAS ne atheatical nnlaach liay Optiiatin f F Figure 1 Internal structure of deep learning frameworks. 450,000 with acs 400,000 improved in proportion to the number of racks, and one is 350,000 can expect considerable performance improvement in 300,000 larger configurations. 250,000 Chainer moved into a maintenance phase in 200,000 2019 and migration to PyTorch is now being encour- valuatin cnitins 150,000 aged [10]. Therefore, future performance evaluation • haine 100,000 of massively parallel execution will be performed using • haine PyTorch. 50,000 • ataset aeet 0 3) Acceleration initiatives for Fugaku 0 20 40 60 80 Fujitsu is developing the dnnl_aarch64 and ue iaes cesse e secn is ue acs TCS math libraries optimized for the Fugaku CPU Figure 2 (A64FX), which uses the ARM architecture. Replacing Performance of distributed learning using Chainer oneDNN and BLAS optimized for Intel CPUs with the (ResNet-50). dnnl_aarch64 and TCS math libraries, respectively, is expected to accelerate deep learning frameworks on Fugaku. On Fugaku, we performed further parallel tuning, 4) ResNet-50 evaluation with PyTorch/TensorFlow such as accelerating I/O processing by pseudo-staging The dnnl_aarch64 and TCS mathematical libraries of the newly designed file system Lightweight Layered were incorporated into PyTorch and TensorFlow, and IO Accelerator (LLIO) and acceleration of the import ResNet-50 v1.5 performance evaluation was performed processing for Python packages. Through this tun- using A64FX. ing, we achieved learning processing performance of With PyTorch, we achieved 98.8 ips for learning 410,726 ips (images per second) on 72 racks (27,648 and 383.7 ips for inferencing on a single node (red nodes) using the ResNet-50 v1.5 image recognition figures in Table 1). We also evaluated the performance neural network. As shown in Figure 2, the performance of Horovod, a distributed learning library for deep Fujitsu Technical Review 3 HPC and AI Initiatives for Supercomputer Fugaku and Future Prospects Table 1 Single-node performance with PyTorch/TensorFlow.