Arxiv:1809.02697V3 [Cs.DC] 5 Jul 2019 Computation on Cpus
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing CNN Model Inference on CPUs Yizhi Liu*, Yao Wang*, Ruofei Yu, Mu Li, Vin Sharma, Yida Wang Amazon Web Services {yizhiliu, wayao, yuruofei, mli, vinarm, wangyida}@amazon.com Abstract inference is essentially executing a computation graph con- sisting of operations. In practice, people normally use high- The popularity of Convolutional Neural Network (CNN) mod- performance kernel libraries (e.g. Intel MKL-DNN [27] and els and the ubiquity of CPUs imply that better performance of OpenBlas [51]) to obtain high performance for CNN opera- CNN model inference on CPUs can deliver significant gain tions. While these libraries tune very carefully for common to a large number of users. To improve the performance of operations with normal input data shapes (e.g. 2D convolu- CNN inference on CPUs, current approaches like MXNet tions), they only focus on the (mostly, convolution) operations and Intel OpenVINO usually treat the model as a graph and but miss the opportunities to further optimize the end-to-end use the high-performance libraries such as Intel MKL-DNN model inference at the graph level. The graph-level optimiza- to implement the operations of the graph. While achieving tion is often handled by the deep learning frameworks, e.g. reasonable performance on individual operations from the off- TensorFlow [5] and MXNet [8]. the-shelf libraries, this solution makes it inflexible to conduct optimizations at the graph level, as the local operation-level However, the graph-level optimization such as operation fu- optimizations are predefined. Therefore, it is restrictive and sion and data layout planing that a framework can do is limited misses the opportunity to optimize the end-to-end inference because the operation implementation is already predefined pipeline as a whole. This paper presents NeoCPU, a com- in the third-party libraries. Therefore, the optimizations in the prehensive approach of CNN model inference on CPUs that frameworks do not work in concert with the optimizations employs a full-stack and systematic scheme of optimizations. in the kernel library, which leaves significant performance NeoCPU optimizes the operations as templates without rely- gains unrealized in practice. Furthermore, different CPU ar- ing on third-parties libraries, which enables further improve- chitectures rely on different high-performance libraries and ment of the performance via operation- and graph-level joint integrating a library into a deep learning framework requires optimization. Experiments show that NeoCPU achieves up error-prone and time-consuming engineering effort. Lastly, to 3.45× lower latency for CNN model inference than the although those libraries are highly optimized, they present current state-of-the-art implementations on various kinds of as third-party plug-ins, which may introduce contention is- popular CPUs. sues with other libraries in the framework. As an example, TensorFlow originally used the Eigen library [4] to handle arXiv:1809.02697v3 [cs.DC] 5 Jul 2019 computation on CPUs. Later on, MKL-DNN was also in- 1 Introduction troduced. As a consequence, at runtime MKL-DNN threads coexist with Eigen threads, resulting in resource contention. The growing use of Convolutional Neural Network (CNN) In summary, this kind of framework-specific approach for models in computer vision applications makes this model CNN model inference on CPUs is inflexible, cumbersome, architecture a natural focus for performance optimization and sub-optimal. efforts. Similarly, the widespread deployment of CPUs in Because of the constraint imposed by the framework, opti- servers, clients, and edge devices makes this hardware plat- mizing the performance of CNN model inference end-to-end form an attractive target. Therefore, performing CNN model without involving a framework (i.e. a framework-agnostic inference efficiently on CPUs is of critical interest to many method) is of obvious interest to many deep learning prac- users. titioners. Recently, Intel launched a universal CNN model The performance of CNN model inference on CPUs leaves inference engine called OpenVINO Toolkit [16]. This toolkit significant room for improvement. Performing a CNN model optimizes CNN models in the computer vision domain on In- *Equal contribution tel processors (mostly x86 CPUs) and claims to achieve better performance than the deep learning frameworks alone. Yet, Op-level opt Graph-level opt Joint opt Open-source NeoCPU 3 3 3 3 OpenVINO could only provide limited graph-level optimiza- MXNet [8]/TensorFlow [5] 3rd party limited 7 3 tion (e.g. operation fusion as implemented in ngraph [15]) as OpenVINO [16] 3rd party limited ? 7 Original TVM [9] incomplete 3 7 3 it still relies upon MKL-DNN to deliver performance gains Glow [40] single core 3 7 3 for the carefully-tuned operations. Therefore, the optimiza- tion done by OpenVINO is still not sufficient for most of the Table 1: Side-by-side comparison between NeoCPU and ex- CNN models. isting works on CNN model inference Based on the previous observation, we argue that in order to further improve the CNN model inference performance on • Constructs a template to achieve good performance of CPUs, being able to do the flexible end-to-end optimization is convolutions, which is flexible to apply to various con- the key. In this paper, we propose NeoCPU, a comprehensive volution workloads on multiple CPU architectures (x86 approach to optimize CNN models for efficient inference on and ARM) without relying on high-performance kernel CPUs. NeoCPU is full-stack and systematic, which includes libraries; operation- and graph-level joint optimizations and does not • Designs a global scheme to look for the best layout rely on any third-party high-performance libraries. At the combination in different operations of a CNN model, operation level, we follow the well-studied techniques to op- which minimizes the data layout transformation over- timize the most computationally-intensive operations like head between operations while maintaining the high convolution (CONV) in a template, which is applicable to dif- performance of individual operations. ferent workloads on multiple CPU architectures and enables us for flexible graph-level optimization. At the graph level, in It is worth noting that, this paper primarily deals with direct addition to the common techniques such as operation fusion convolution computation, while NeoCPU is compatible to and inference simplification, we coordinate the individual op- other optimziation works on the computationally-intensive eration optimizations by manipulating the data layout flowing kernels, e.g. CONVs via Winograd [7, 29] or FFT [52]. through the entire model for the best end-to-end performance. We evaluated NeoCPU on CPUs with both x86 and ARM In summary, NeoCPU does the end-to-end optimization in a architectures. In general, NeoCPU delivers the best perfor- flexible and automatic fashion, while the existing works rely mance for 13 out of 15 popular networks on Intel Skylake on third-party libraries and lack comprehensive performance CPUs, 14 out of 15 on AMD EYPC CPUs, and all 15 models tuning. on ARM Cortex A72 CPUs. It is worthwhile noting that the NeoCPU is built upon a deep learning compiler stack baselines on x86 CPUs were more carefully tuned by the chip named TVM [9] with a number of enhancements. TVM vendor (Intel MKL-DNN) but the ARM CPUs were less opti- enables the possibility of using own operation-level opti- mized. While the selected framework-specific (MXNet and mizations instead of third-party high-performance libraries, TensorFlow) and framework-agnostic (OpenVINO) solutions which make it flexible to apply our operation- and graph-level may perform well on one case and less favorably on the other joint optimization. However, there was only one customized case, NeoCPU runs efficiently across models on different operation-level optimization on ARM CPUs for convolutions architectures. with specific data shapes and no operation- and graph-level In addition, NeoCPU produces a standalone module with joint optimization in the original TVM stack before our work. minimal size that does not depend on either the frameworks In addition, there exist other deep learning compilers such as or the high-performance kernel libraries, which enables easy Tensor Comprehensions [46] and Glow [40]. Unfortunately, deployment to multiple platforms. NeoCPU is used in Ama- they either do not target on CPUs or not optimize the CPU zon SageMaker Neo Service 1, enabling model developers performance well, e.g. based on the paper description and our to optimize for inference on CPU-based servers in the cloud own experiments, Glow only optimizes the single-core per- and devices at the edge. Using this service, a number of ap- formance for CPUs. Therefore we do not incorporate those plication developers have deployed CNN models optimized works as the baseline. Table1 summarizes the features of for inference in production on several types of platforms. NeoCPU compared to others. To the best of our knowledge, All source code has been released to the open source TVM NeoCPU achieves competitive performance for CNN model project2. inference on various kinds of popular CPUs. The rest of this paper is organized as follows: Section2 Specifically, this paper makes the following contributions: reviews the background of modern CPUs as well as the typical CNN models; Section3 elaborates the optimization ideas • Provides an operation- and graph-level joint optimiza- that we propose and how we implement them, followed by tion scheme to obtain high CNN model inference perfor- evaluations in Section4. We list the related works in Section5 mance on different popular CPUs including Intel, AMD and summarize the paper in Section6. and ARM, which outperforms the current state-of-the-art 1https://aws:amazon:com/sagemaker/neo/ implementations; 2https://github:com/dmlc/tvm 2 Background model is normally abstracted as a computation graph, essen- tially, Directed Acyclic Graph (DAG), in which a node rep- 2.1 Modern CPUs resents an operation and a directed edge pointing from node X to Y represents that the output of operation X serves as (a Although accelerators like GPUs and TPUs demonstrate their part of) the inputs of operation Y (i.e.