Exploring the Vision Processing Unit As Co-Processor for Inference
Total Page:16
File Type:pdf, Size:1020Kb
Exploring the Vision Processing Unit as Co-processor for Inference Sergio Rivas-Gomez1, Antonio J. Pena˜ 2, David Moloney3, Erwin Laure1, and Stefano Markidis1 1KTH Royal Institute of Technology 2Barcelona Supercomputing Center (BSC) 3Intel Ireland Ltd. Abstract—The success of the exascale supercomputer is of the exascale supercomputer that we consider the embrace- largely debated to remain dependent on novel breakthroughs in ment of these developments in the near-term future. technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the In this work, we set the initial steps towards the integra- integration of co-processors in high-performance computing tion of low-power co-processors on HPC. In particular, we (HPC) to enable low-power, seamless computation offloading of analyze the so-called Vision Processing Unit (VPU). This certain operations. In particular, we explore the so-called Vision type of processor emerges as a category of chips that aim to Processing Unit (VPU), a highly-parallel vector processor with provide ultra-low power capabilities, without compromising a power envelope of less than 1W. We evaluate this chip during inference using a pre-trained GoogLeNet convolutional performance. For this purpose, we explore the possibilities network model and a large image dataset from the ImageNet of the Movidius Myriad 2 VPU [13], [14] during inference in ILSVRC challenge. Preliminary results indicate that a multi- convolutional networks, over a large image dataset from the VPU configuration provides similar performance compared to ImageNet ILSVRC 2012 challenge [15]. In our evaluations, reference CPU and GPU implementations, while reducing the we use a pre-trained network from the Berkeley Vision and thermal-design power (TDP) up to 8× in comparison. Learning Center (BVLC), which follows the GoogLeNet Keywords-Vision Processing Unit; High-Performance Com- work by Szegedy et al. [3]. Preliminary results indicate puting; Machine Learning that a combination of several of these chips can potentially provide equivalent performance compared to a reference I. INTRODUCTION CPU and GPU implementation, while reducing the thermal- The recent advances in deep learning and convolutional design power (TDP) up to 8×. The observed throughput, networks, have dramatically influenced the role of machine measured as number of inferences per Watt, is over 3× learning on a wide-range of scientific applications [1], [2]. higher in comparison. The estimated top-1 error rate is 32% This fact has been motivated by an increase in object on average, with a confidence error difference of 0.5%. classification and detection accuracy [3], [4], alongside with The contributions of this work are the following: better tools for data mining that allow us to understand large datasets of unstructured information [5], [6]. The inference • We provide a comprehensive technical overview of error rate of machine learning algorithms has become re- the Myriad 2 VPU in the context of the Intel Neural markably low as well, reaching a state where the capacity of Compute Stick (NCS) platform [16]. humans has been already surpassed in certain scenarios [7]. • We design and implement a small inference framework As a consequence, there is an existing trend that pro- based on Caffe [17] and the Neural Compute API [18] poses the integration of data-centric models on HPC that to support our experiments on the VPU. combines specialized hardware with the aim of fulfilling • We illustrate that VPUs feature an excellent ratio be- this need [8]. Upcoming major supercomputers are expected tween throughput and power consumption compared to to feature new hardware architectures that provide high- reference CPU and GPU implementations, including in performance 16-bit / 32-bit mixed arithmetic support for multi-VPU configurations. machine learning [9], both during training and inference. • We compare the top-1 error rate [3] with a reference In addition, innovation at software level is also observed CPU implementation to understand the implications of with the appearance of novel data formats that use tensors using FP16 on the VPU. with a shared exponent [10], [11], maximizing the dynamic The paper is organized as follows. We provide a high- range of the traditional 16-bit floating point data format. level overview of the VPU in Section II. We describe the These breakthroughs provide multiple advantages in terms implementation considerations of a small inference frame- of performance and power consumption. Specifically, some work in Section III. The experimental setup and performance of the aforementioned architectural changes are expected to evaluation is presented in Section IV. We extend the discus- increase the performance 5–10× in comparison with current sion of the results and provide further insights in Section V. large-scale HPC clusters, using just twice the power [12]. Related work is reported in Section VI. A summary of our Hence, it will be of paramount importance for the success conclusions and future work is outlined in Section VII. © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works II. BACKGROUND SHAVE VLIW Vector Processor The emergence of machine learning and data-centric IDC DCU VRF 32×128-bit / 12 ports applications on HPC poses several constraints on general- IRF 32×32-bit / 18 ports I-cache D-cache purpose processors, mainly due to the irregularity of the 2KB 1KB memory accesses that they feature [19], [20]. These accesses LSU0 LSU1 PEU BRU IAU SAU VAU CMU have reduced temporal or spatial locality, incurring in long 64-bit CMX ports memory stalls and large bandwidth requirements. As a side effect, the power consumption and thermal dissipation Connection Matrix (CMX) requirements considerably increase as well [21]. Thus, dur- SRAM 2MB 4GB LPDDR3 Memory ing the last decade, scientists have experimented with the L2-cache DDR integration of novel algorithms that perform dynamic, in- 256KB Cont. 128-bit CMX 64-bit DDR memory data rearrangements of irregular structures [22], 128-bit AXI + L2 ports ports [23]. The aim is to overcome (or partially hide) some of the aforementioned limitations. Figure 1: High-level representation of one of the SHAVE Nonetheless, the inherent complexity of such techniques, vector processors featured on the Myriad 2 VPU [14]. The coupled with the adoption of the “CPU + Accelerator” model Connection Matrix (CMX) enables seamless interaction be- to enhance the performance of scientific applications [24], tween the vector processors and other hardware components. makes programming general-purpose processors another key-factor to consider. In addition, transferring data among ··· these different hardware layers can also become costly [25]. As a consequence, the industry is shifting towards designing Neural Compute Stick processors where cost, power, and thermal dissipation are Myriad 2 VPU Application Host key concerns [14]. Specialized co-processors have recently Hardware Accelerators ··· SHAVE SHAVE SHAVE SHAVE SHAVE Application SHAVE emerged with the purpose of reducing the power envelope CMX RISC Neural Compute API Processor SHAVE SHAVE SHAVE SHAVE SHAVE constraints, while improving the overall performance on SHAVE L2-cache scenarios such as machine learning [26]. In this regard, USB 3.0 we observe that other scientific fields can benefit from this Interface 4GB LPDDR3 Memory trend by adopting part of these technologies. In fact, energy consumption in HPC is considered one of the main limiting factors towards the exascale supercomputer [27]. Figure 2: Approximate implementation of the Myriad 2 VPU In this section, we briefly describe the most relevant used within the Neural Compute Stick (NCS) platform [16]. technical aspects of the Movidius Myriad 2 VPU [13], [14], The Neural Compute API allows us to coordinate the exe- in the context of the Intel Neural Compute Stick (NCS) cution on the VPU of one or more NCS devices [18]. platform [16]. Our goal is to understand how this type of low-power co-processors could potentially be integrated for computation offloading on HPC. is designed featuring 12 highly-parallelizable vector proces- sors, named Streaming Hybrid Architecture Vector Engines A. Vision Processing Unit (SHAVE). Each SHAVE processor contains wide register The Myriad 2 VPU is designed as a 28-nm co-processor files and several functional units. These are controlled by that provides high-performance tensor acceleration. The Variable-Length Long Instruction Word (VLLIW) packets. chip dissipates less than 1W [13]. High-level APIs allow Hence, enabling seamless SIMD operations on the chip. The application programmers to easily take advantage of its nominal frequency is 600MHz. features and, thus, enhance programming productivity. In Figure 1 illustrates a high-level diagram of one of the addition, the software-controlled memory subsystem enables SHAVE processors and the interactions with other compo- fine-grained control on different workloads, if required. The nents of the Myriad 2 VPU. The main vector register file term “vision” is employed due to the original purpose of (VRF) has 128-bit × 32 entries and 12 ports. A general the VPU, which was meant to accelerate computer vision register file (IRF) is also available with 32-bit × 32 entries applications on the “edge” [28]. and 18 ports. Among the functional units of each SHAVE The architecture of this chip is inspired by Agarwal’s ob- processor, we highlight the 128-bit Vector Arithmetic Unit servation, which states that beyond a certain frequency limit (VAU), the 128-bit Compare-and-Move Unit (CMU), the 32- for any particular design and target process technology, the bit Scalar Arithmetic Unit (SAU), and the 32-bit Integer cost is quadratic in power for linear increases in operating Arithmetic Unit (IAU). The chip supports 8, 16, 32, and frequency [14].