The Pennsylvania State University the Graduate School OPTIMIZING
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School OPTIMIZING VIDEO PROCESSING FOR NEXT-GENERATION MOBILE PLATFORMS A Dissertation in Computer Science and Engineering by Haibo Zhang © 2020 Haibo Zhang Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2020 The dissertation of Haibo Zhang was reviewed and approved by the following: Mahmut T. Kandemir Professor of Computer Science and Engineering Dissertation Co-Advisor, Co-Chair of Committee Anand Sivasubramaniam Distinguished Professor of Computer Science and Engineering Dissertation Co-Advisor, Co-Chair of Committee Chita R. Das Distinguished Professor of Computer Science and Engineering Department Head of Computer Science and Engineering Dinghao Wu Associate Professor of Information Sciences and Technology ii Abstract Video has become one of the most dominant applications in the mobile and Internet- of-Things (IoT) platforms. Video applications cover numerous kinds of emerging applications, such as live broadcasts, video posts, video conferences, smart camera- based object tracking, augmented reality (AR), and virtual reality (VR), expanding the boundaries of communication and entertainment. Video-oriented applications, e.g., Netflix, TikTok, Snapchat, are among the most popular applications in the mobile application stores. Smart mobile devices, commonly equipped with camera(s) and display, serve as the platform for video applications and, therefore, contribute to a significant volume of video traffic. It is thus essential that better video quality and low energy consumption are the keys for continued growth and further acceptance of mobile devices. While previous research has mainly studied how to transmit video data efficiently over a network, this dissertation provides a systematic study of the inefficiencies within the mobile platforms. This dissertation reveals that, even with highly optimized video codecs and video hardware accelerators, today’s mobile platforms for video processing are still suffering from various energy and Quality of Service (QoS) inefficiency issues, which lie within current system and hardware stack, and warrant a revisit. To further improve QoS and energy efficiency of video processing on mobile platforms, the optimization scope needs to expand to the interaction between multiple video processing stages as well as their memory system. With the concept of considering multiple stages as an entire optimization entity, this dissertation tackles above inefficiencies in the system/hardware stacks of video processing, by 1) intelligently scheduling video processing requests regarding their QoS requirements, 2) designing a new power management policy, 3) saving energy by leveraging the value locality inherited in video memory traffic. Overall, this dissertation aims at systematically optimizing video processing for next-generation mobile platforms. iii Table of Contents List of Figures vii List of Tables xi Acknowledgments xii Chapter 1 Introduction 1 Chapter 2 Background 9 2.1 Video Basics . 9 2.2 Display-based Video Processing in Mobile Systems . 11 2.3 Camera-based Video Processing in Mobile Systems . 12 2.4 Application Example: Video Streaming . 14 Chapter 3 Flow Sensitive Scheduling for Multi-Videos on Mobile Platforms 16 3.1 Introduction . 16 3.2 Motivation . 18 3.3 Solving Memory Bottleneck . 21 3.3.1 A Flow-Sensitive Scheduler: FLOSS . 21 3.3.2 Implementation Details . 25 3.4 Experimental Evaluation . 25 3.4.1 Experiment Setup . 25 3.4.2 Results . 27 3.5 Related Work . 29 3.6 Conclusions . 29 iv Chapter 4 Optimizing QoS and Energy for Display-based Video Processing 30 4.1 Introduction . 30 4.2 Background and Motivation . 33 4.2.1 Insomnia: Inefficiencies of Current Video Processing . 34 4.3 Race-to-Sleep . 36 4.3.1 Batch Decoding . 37 4.3.2 Race Decoding . 39 4.3.3 Race-to-Sleep . 41 4.4 Content Caching . 42 4.4.1 Address Locality vs. Value Locality . 43 4.4.2 MACH: Macroblock Cache . 44 4.4.3 Gradient Blocks . 46 4.4.4 Implementation of MACH . 47 4.4.5 Discussion . 49 4.5 Display Caching . 50 4.5.1 Display Cache and MACH Buffer . 51 4.5.2 Memory Access Savings in Display . 53 4.6 Experimental Evaluation . 53 4.6.1 Experimental Setup and Workloads . 53 4.6.2 Results . 54 4.6.3 Sensitivity Study and Overhead . 58 4.6.4 Other Potential Applications of MACH . 59 4.7 Related Work . 60 4.8 Conclusions . 62 Chapter 5 Reducing Memory Usage and Energy for Camera-based Video Processing 63 5.1 Introduction . 63 5.2 Background and Motivation . 67 5.2.1 Value Locality in Raw Videos . 68 5.2.2 Reducing Video Buffer Bandwidth . 69 5.3 Isolating the Essential Bits . 72 5.3.1 Base Selection . 73 5.3.2 Evaluation of MinVB and MidVB . 78 5.4 Distilling the Essential Bits . 79 5.4.1 Approximation through Distilling . 81 5.4.2 Implementation Details of Distill . 87 5.5 Evaluation . 89 v 5.5.1 Workloads and Experimental Setup . 89 5.5.2 Results . 90 5.6 Related Work . 94 5.7 Conclusions . 96 Chapter 6 Future Work 97 6.1 Smart Vision Pipeline . 98 6.2 Smart Display Pipeline . 101 Chapter 7 Concluding Remarks 102 Bibliography 105 vi List of Figures 1.1 Overview of my dissertation research. 4 2.1 Illustration of display-based video processing. 11 2.2 Video playback application invoking IPs in parallel. ......... 14 3.1 FPS improved due to PIP and SoC utilization with and without PIP. 18 3.2 Examples showing the advantages and drawbacks of different QoS- aware memory schedulers. Timeline is measured as units of memory requests, and all time slots are normalized to the time taken to process one memory request. 20 3.3 FLOSS: Rate calculation under various scenarios. 24 3.4 Different memory scheduling schemes on FPS and energy efficiency per frame. The FPS and energy efficiency results are w.r.t. the non-PIP/FR-FCFS baseline. 27 3.5 Per frame memory performance and utilization improvement . 27 4.1 (a) Performance and energy breakdown of video streaming; (b) Control flow and data flow of video streaming (YouTube). 33 vii 4.2 (a) State diagram of power state transitions. (b) and (c), CDF plots of frame execution time/energy consumption for 5000 frames sorted in increasing order.Per frame execution time is fixed at 16.6ms and per frame energy is fixed at 5mJ (decoder power 300mW * frame time). The plots show the time/energy spent in five different states (S3, S1, transition, short slack, and execution). (d) and (e) depict time/energy CDFs with batching. ........................... 34 4.3 Pictorial representation of different decoding schemes. tSlack is the time taken to process the next frame. P1 and P0 are Pstates. The transition energy between P and S states is shown in red. 37 4.4 Effects of various schemes. (a) and (b) show the effect of Batch Decoding. (c) and (d) plot the time/energy CDFs with Race Decoding and Race-to-Sleep for processing 5000 frames (as similar to Fig. 4.2b and Fig. 4.2c). Race Decoding increases transition a lot while Race- to-Sleep reduces transition mostly. 38 4.5 (a) Memory access pattern with low and high frequency settings. DRAM needs more Act/Pre for a low frequency setting. (b) showing energy saving of using a high-frequency VD. 39 4.6 Impact of Race-to-Sleep. Left: VD at low frequency; Right: VD at high frequency (Batching up to 16 frames.) . 41 4.7 Access locality vs. Value locality. (a) Conventional cache can hardly exploit address locality in writing a frame back to memory. (b) More than half of the current frame are already stored in memory. 44 4.8 Examples illustrating how MACH works. 45 4.9 Effect of content caching. (a) mab saves 13%, and gab saves 34% memory bandwidth and memory space. (b) gab always captures more matches than mab. (c) i. Memory layout of mab/gab; ii. layout of the pointers in Sec.4.4; and iii. layout of the pointers/digests in Sec.4.5. 47 4.10 Architecture and effects of display cache and MACH buffers. (a) Illustration of display cache; (b) Illustration of MACH buffers; (c) Sensitivity to size of display cache; and (d) Distribution of types of gab; (e) Fraction (%) of the memory accesses. 50 viii 4.11 Normalized energy results with our recipes shown across 16 videos. L:Baseline; B:batching; R:Racing; S:Race-to-Sleep; M:MAB; G:GAB. Normalized to baseline. (the lower the better). 54 4.12 Sensitivities to (a) the number of extra frame buffers when using gab; (b) the number of MACH buffer entries; (c) size of mab; and (d) different hashing mechanism. 57 5.1 Overview of a video processing application at edge devices. 64 5.2 Spatial value similarity in pixel values. 68 5.3 (a) Number of unique tiles of raw video and encoded video inputs. (b) Pixel intensity (tile 1x1) of a sequence of tiles (in total 1.8 billion tiles of 1199 frames). 34% of the pixels require 8 bits to fully represent their values. Partitioning frames into tiles: each tile consists of 4x4, or 16x16 pixels. A tile’s bit-width is determined by the largest bit-width of the pixels in the tile. 70 5.4 (a) Distribution of the pixel deltas in a tile. Pixel deltas are the differences from the first pixel value in the tile. Around 20% of the pixel deltas are 0, indicating that they are the same as the first pixel value. Only 0.01% of the pixels still require 8 bit-width. (b) The bit-width needed for a tile is now determined by the number of bits needed to represent the maximal (minimal) of deltas of the tile.