Exploring the Automated Compilation for Deep Learning on Sunway Architecture

swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture Changxi Liu Hailong Yang Rujun Sun School of Computer Science and School of Computer Science and State Key Laboratory of Mathematical Engineering, Beihang University Engineering, Beihang University Engineering and Advanced [email protected] [email protected] Computing [email protected] Zhongzhi Luan Lin Gan Guangwen Yang School of Computer Science and Department of Computer Science and Department of Computer Science and Engineering, Beihang University Technology, Tsinghua University Technology, Tsinghua University [email protected] [email protected] [email protected] Depei Qian School of Computer Science and Engineering, Beihang University [email protected] ABSTRACT and machine translation [7]. The deep learning frameworks such The flourish of deep learning frameworks and hardware platforms as Caffe [11], TensorFlow [1], PyTorch [12] and MxNet [3], provide has been demanding an efficient compiler that can shield the diver- an efficient platform to support the research and development on sity in both software and hardware in order to provide application intelligent applications. In the meanwhile, emerging deep learning portability. Among the exiting deep learning compilers, TVM is algorithms exhibit increasing demands for massive computation well known for its efficiency in code generation and optimization power. To satisfy the computation demand, various accelerating across diverse hardware devices. In the meanwhile, the Sunway hardwares such GPU and FPGA have been applied in the deep many-core processor renders itself as a competitive candidate for its learning field. Current deep learning frameworks almost rely onthe attractive computational power in both scientific and deep learning high performance libraries such as cuDNN [6] and MKL [22], which applications. This paper combines the trends in these two direc- are provided by the hardware vender to accelerate the deep learning tions. Specifically, we propose swTVM that extends the original application. With new deep learning algorithms and hardwares TVM to support ahead-of-time compilation for architecture requir- arising rapidly, the engineering cost for porting the algorithm to ing cross-compilation such as Sunway. In addition, we leverage the the hardware, has increased dramatically. It is necessary to find architecture features during the compilation such as core group for a way to deploy these emerging deep learning algorithms on the massive parallelism, DMA for high bandwidth memory transfer and various underlying hardwares automatically and efficiently. local device memory for data locality, in order to generate efficient To address the above problem, the end-to-end compiler for deep code for deep learning application on Sunway. The experimental learning application is proposed. The state-of-the-art deep learn- results show the ability of swTVM to automatically generate code ing compilers include Glow [19], nGraph [8], Tensor Comprehen- for various deep neural network models on Sunway. The perfor- sion [21] and TVM [4]. Take TVM for an example, it adopts the mance of automatically generated code for AlexNet and VGG-19 by design of two level optimization to automatically generate the code swTVM achieves 6.71× and 2.45× speedup on average than hand- for deep neural network, designed on different deep learning frameworks, to various hardware devices such as CPU, GPU and FPGA. arXiv:1904.07404v2 [cs.LG] 18 Apr 2019 optimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from On graph level, TVM applies multiple optimizations to the computa- the compiler perspective to bridge the gap of deep learning and tion graph derived from the deep neural network, such as operator high performance architecture particularly with productivity and fusion and data layout transformation. On operator level, TVM efficiency in mind. We would like to open source the implementa- converts the computation on operators to the tensor operations tion so that more people can embrace the power of deep learning targeting the specific hardware architecture, and hides the memory compiler and Sunway many-core processor. latency by optimizing the instruction pipeline. Moreover, TVM can optimize the code generation automatically according to the shape KEYWORDS and data layout of the input to each layer for better performance. Meanwhile, for its compelling computation power, Sunway many- Sunway architecture, Deep Learning, Automatic Compilation core processor serves as the basic building block of Sunway Taihu- 1 INTRODUCTION Light supercomputer, which is the first supercomputer to achieve over 100 petaFlops in the world. The Sunway SW26010 processor Currently, deep learning has achieved outstanding performance in many fields, including self-driving car [2], face detection [26] consists of four core groups (CG). Each CG, including a Manage- • We apply several optimizations to the tensor operations re- ment Processing Element (MPE) and 64 Computing Processing garding the unique architecture features on Sunway. Specifi- Elements (CPEs), can achieve 765 GFlops peak performance in cally, we propose a DMA control interface that manipulates double-precision. The memory attached to each CG is 8GB with the DMA data transfers for each tensor during computation. bandwidth of 34.1GB/s. The MPE is a complete 64-bit RISC core, In addition, we design a LDM management mechanism that typically used for task control and management, whereas the CPE buffers the tensor data as much as possible to reduce the is also a 64-bit RISC core but with limited functionalities, typically latency for accessing memory. Moreover, the DMA instruc- used for computation. In addition, each CPE has a 64KB local de- tions are automatically inserted during code generation to vice memory (LDM), that is managed explicitly by software. The improve the re-accessibility of the buffered data. executables on Sunway are generated through cross-compilation • We propose swTVM that implements AOT code generation with MPE and CPE as different compilation targets. Due to the and architecture specific optimizations on top of TVM, which limitation of Sunway customized operating system, the dynamic offers the high performance of Sunway processor tothe linked libraries are not supported. deep learning community through automatic compilation. To embrace the advantage of automatic compilation and high We compare the performance of swTVM for several deep performance for deep learning application, it is intuitive to adopt neural networks with hand-optimized implementations and TVM to Sunway architecture. However, the unique compilation the result shows the performance of swTVM is even better environment and architecture features prevent an naive adoption than hand-optimized OpenACC implementations. of TVM to Sunway. First, TVM relies on dynamic link libraries to generate executables on different hardware devices, which is The remainder of this paper is organized as follows. In Section not supported on Sunway. In addition, its code organization fails 2, we present the background of the deep learning compiler and to recognize the different compilation targets for MPE and CPE, Sunway architecture. Section 3 presents the design overview of and thus incapable to manage the function calls between MPE swTVM. Section 4 and Section 5 describe the details of code gen- and CPE. Second, the memory capacity of each CG on Sunway is eration in AOT mode and optimizations for tensor operations on quite limited. During the deep learning computation, large memory Sunway. Section 6 presents the performance results of AlexNet capacity is required to store the intermediate data as well as the and VGG-19 compared to hand-optimized implementations using weight parameters. How to allocate the memory space efficiently OpenACC and swCaffe. Section 7 presents the related work in the and leverage the unique architecture features such as DMA for high fields of deep learning complier and performance optimization on bandwidth data transfer is important to generate code with high Sunway. Section 8 concludes this paper. performance. Third, each CPE within a CG contains a 64KB LDM that can be used to buffer data with explicit software management. 2 BACKGROUND How to leverage the limited LDM on each CPE with improved data locality is critical for realizing the performance advantage of 2.1 Sunway Architecture Sunway during code generation. The architecture of Sunway processor is shown in Figure 1(b). The To overcome the above challenges, we propose swTVM, a deep Sunway SW26010 processor consists of four core groups (CGs), and learning compiler on Sunway that is tailored for the unique compi- each CG includes one Management Processing Element (MPE) and lation environment and architecture features on Sunway. In swTVM, 64 Computing Processing Elements (CPEs). Each CG can achieve we provide ahead-of-time (AOT) code generation that manages the 765GFlops peak performance in double-precision and 34.1GB/s function calls as well as compilation for MPE and CPE explicitly. In theoretical memory bandwidth. The MPE is a complete 64-bit RISC addition, we apply several optimizations to the tensor operations core, designed for task management and control, whereas the CPEs so that the architecture features such as DMA and LDM are bet- are also 64-bit RISC cores but with limited functionalities, focusing ter utilized during code generation.

Load more