Euphrates: Algorithm-Soc Co-Design for Low-Power Mobile Continuous Vision

Euphrates: Algorithm-SoC Co-Design for Low-Power Mobile Continuous Vision Yuhao Zhu1 Anand Samajdar2 Matthew Mattina3 Paul Whatmough3 1University of Rochester 2Georgia Institute of Technology 3ARM Research Abstract Haar HOG Tiny YOLO 2 SSD YOLO Faster R-CNN 10 Continuous computer vision (CV) tasks increasingly rely on SOTA CNNs 1 convolutional neural networks (CNN). However, CNNs have 10 Compute Capability massive compute demands that far exceed the performance 0 @ 1W Power Budget and energy constraints of mobile devices. In this paper, we 10 propose and develop an algorithm-architecture co-designed -1 Scaled-down 10 CNN Better system, Euphrates, that simultaneously improves the energy- -2 10 efficiency and performance of continuous vision tasks. Hand-crafted Approaches -3 Our key observation is that changes in pixel data between 10 Tera Ops Per Second (TOPS) Tera 0 20 40 60 80 100 consecutive frames represents visual motion. We first propose Accuracy (%) an algorithm that leverages this motion information to relax the number of expensive CNN inferences required by contin- Fig. 1: Accuracy and compute requirement (TOPS) comparison be- uous vision applications. We co-design a mobile System-on- tween object detection techniques. Accuracies are measured against the widely-used PASCAL VOC 2007 dataset [32], and TOPS is based a-Chip (SoC) architecture to maximize the efficiency of the on the 480p (640×480) resolution common in smartphone cameras. new algorithm. The key to our architectural augmentation is to co-optimize different SoC IP blocks in the vision pipeline col- requirements measured in Tera Operations Per Second (TOPS) lectively. Specifically, we propose to expose the motion data as well as accuracies between different detectors under 60 that is naturally generated by the Image Signal Processor (ISP) frames per second (FPS). As a reference, we also overlay the early in the vision pipeline to the CNN engine. Measurement 1 TOPS line, which represents the peak compute capability and synthesis results show that Euphrates achieves up to 66% that today’s CNN accelerators offer under a typical 1 W mo- SoC-level energy savings (4× for the vision computations), bile power budget [21,41]. We find that today’s CNN-based with only 1% accuracy loss. approaches such as YOLOv2 [98], SSD [85], and Faster R- CNN [99] all have at least one order of magnitude higher 1. Introduction compute requirements than accommodated in a mobile device. Computer vision (CV) is the cornerstone of many emerging Reducing the CNN complexity (e.g., Tiny YOLO [97], which application domains, such as advanced driver-assistance sys- is a heavily truncated version of YOLO with 9/22 of its layers) tems (ADAS) and augmented reality (AR). Traditionally, CV or falling back to traditional hand-crafted features such as algorithms were dominated by hand-crafted features (e.g., Haar [61] and HOG [113] lowers the compute demand, which, Haar [109] and HOG [55]), coupled with a classifier such as a however, comes at a significant accuracy penalty. support vector machine (SVM) [54]. These algorithms have The goal of our work is to improve the compute efficiency of low complexity and are practical in constrained environments, continuous vision with small accuracy loss, thereby enabling but only achieve moderate accuracy. Recently, convolutional new mobile use cases. The key idea is to exploit the motion neural networks (CNNs) have rapidly displaced hand-crafted information inherent in real-time videos. Specifically, today’s feature extraction, demonstrating significantly higher accuracy continuous vision algorithms treat each frame as a standalone on a range of CV tasks including image classification [104], entity and thus execute an entire CNN inference on every object detection [85, 97, 99], and visual tracking [56, 91]. frame. However, pixel changes across consecutive frames are This paper focuses on continuous vision applications that not arbitrary; instead, they represent visual object motion. We extract high-level semantic information from real-time video propose a new algorithm that leverages the temporal pixel streams. Continuous vision is challenging for mobile archi- motion to synthesize vision results with little computation tects due to its enormous compute requirement [117]. Using while avoiding expensive CNN inferences on many frames. object detection as an example, Fig.1 shows the compute Our main architectural contribution in this paper is to co- 1 design the mobile SoC architecture to support the new algo- Frontend Backend rithm. Our SoC augmentations harness two insights. First, we Camera ISP CNN Sensor Accelerator can greatly improve the compute efficiency while simplifying (~150 mW) (~150 mW) (~700 mW) the architecture design by exploiting the synergy between dif- Image Image Signal RGB/ Computer Semantic Hardware RAW ferent SoC IP blocks. Specifically, we observe that the pixel Sensing Processing YUV Vision Results motion information is naturally generated by the ISP early in Bayer Domain Conversion RGB Domain Semantic Understanding the vision pipeline owing to ISP’s inherent algorithms, and Dead Pixel Color … Demosaic … … Detection Tracking … thus can be obtained with little compute overhead. We aug- Correction Balance Software ment the SoC with a lightweight hardware extension that ex- poses the motion information to the vision engine. In contrast, Fig. 2: A typical continuous computer vision pipeline. prior work extracts motion information manually, either offline frontend where ISPs are increasingly incorporating motion es- from an already compressed video [45,116], which does not timation, which we exploit in this paper (Sec. 2.2). Finally, we apply to real-time video streams, or by calculating the motion briefly describe the block-based motion estimation algorithm information at runtime at a performance cost [73, 101]. and its data structures that are used in this paper (Sec. 2.3). Second, although the new algorithm is light in compute, implementing it in software is energy-inefficient from a system 2.1. The Mobile Continuous Vision Pipeline perspective because it would frequently wake up the CPU. We The continuous vision pipeline consists of two parts: a fron- argue that always-on continuous computer vision should be tend and a backend, as shown in Fig.2. The frontend prepares task-autonomous, i.e., free from interrupting the CPU. Hence, pixel data for the backend, which in turn extracts semantic we introduce the concept of a motion controller, which is a new information for high-level decision making. hardware IP that autonomously sequences the vision pipeline The frontend uses (off-chip) camera sensors to capture light and performs motion extrapolation—all without interrupting and produce RAW pixels that are transmitted to the mobile the CPU. The motion controller’s microarchitecture resembles SoC, typically over the MIPI camera serial interface (CSI) [20]. a micro-controller, and thus incurs very low design cost. Once on-chip, the Image Signal Processor (ISP) transforms We develop Euphrates, a proof-of-concept system of our the RAW data in the Bayer domain to pixels in the RGB/YUV algorithm-SoC co-designed approach. We evaluate Euphrates domain through a series of algorithms such as dead pixel on two tasks, object tracking and object detection, that are correction, demosacing, and white-balancing. In architecture critical to many continuous vision scenarios such as ADAS terms, the ISP is a specialized IP block in a mobile SoC, and AR. Based on real hardware measurements and RTL im- organized as a pipeline of mostly stencil operations on a set of plementations, we show that Euphrates doubles the object local SRAMs (“line-buffers”). The vision frontend typically detection rate while reducing the SoC energy by 66% at the stores frames in the main memory for communicating with the cost of less than 1% accuracy loss; it also achieves 21% SoC vision backend due to the large size of the image data. energy saving at about 1% accuracy loss for object tracking. The continuous vision backend extracts useful informa- In summary, we make the following contributions: tion from images through semantic-level tasks such as object • To our knowledge, we are the first to exploit sharing motion detection. Traditionally, these algorithms are spread across data across the ISP and other IPs in an SoC. DSP, GPU, and CPU. Recently, the rising compute intensity • We propose the Motion Controller, a new IP that au- of CNN-based algorithms and the pressing need for energy- tonomously coordinates the vision pipeline during CV tasks, efficiency have urged mobile SoC vendors to deploy dedicated enabling “always-on” vision with very low CPU load. CNN accelerators. Examples include the Neural Engine in the • We model a commercial mobile SoC, validated with hard- iPhoneX [4] and the CNN co-processor in the HPU [29]. ware measurements and RTL implementations, and achieve Task Autonomy During continuous vision tasks, different significant energy savings and frame rate improvement. SoC components autonomously coordinate among each other The remainder of the paper is organized as follows. Sec.2 with minimal CPU intervention, similar to during phone calls introduces the background. Sec.3 and Sec.4 describe the or music playback [90]. Such a task autonomy frees the CPU motion-based algorithm and the co-designed architecture, re- to either run other OS tasks to maintain system responsiveness, spectively. Sec.5 describes the evaluation methodology, and or stay in the stand-by mode to save power. As a comparison, Sec.6 quantifies the benefits of Euphrates. Sec.7 discusses the typical power consumption of an image sensor, an ISP, and limitations and future developments. Sec.8 puts Euphrates in a CNN accelerator combined is about 1 W (refer to Sec. 5.1 the context of related work, and Sec.9 concludes the paper. for more details), whereas the CPU cluster alone can easily 2. Background and Motivation consume over 3 W [68, 82]. Thus, we must maintain task autonomy by minimizing CPU interactions when optimizing We first give an overview of the continuous vision pipeline energy-efficiency for continuous vision applications.

Euphrates: Algorithm-Soc Co-Design for Low-Power Mobile Continuous Vision

Lecture 26: Domain Specific Architectures Chapter 07, CAQA 6Th Edition

P1360R0: Towards Machine Learning for C++: Study Group 19

SIMD in Different Processors

An Analysis and Implementation of the HDR+ Burst Denoising Method

Gables: a Roofline Model for Mobile Socs

Poster/Demo Sessions Slides

Concepts Introduced in Chapter 7 Reasons for Adoption of Domain SpeciC Architectures

Cloudleak: DNN Model Extractions from Commercial Mlaas Platforms

COEN-4730 Computer Architecture Lecture 12 Domain Specific Architectures (DSA) Chapter 7

Review & Background

Mempool: a Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

Smartphone Huawei Senza Futuro Stop Ai Chip E Alla Licenza Android