Dynamic Resource Management of Network-On-Chip Platforms for Multi-Stream Video Processing
Total Page:16
File Type:pdf, Size:1020Kb
DYNAMIC RESOURCE MANAGEMENT OF NETWORK-ON-CHIP PLATFORMS FOR MULTI-STREAM VIDEO PROCESSING Hashan Roshantha Mendis Doctor of Engineering University of York Computer Science March 2017 2 Abstract This thesis considers resource management in the context of parallel multiple video stream de- coding, on multicore/many-core platforms. Such platforms have tens or hundreds of on-chip processing elements which are connected via a Network-on-Chip (NoC). Inefficient task allo- cation configurations can negatively affect the communication cost and resource contention in the platform, leading to predictability and performance issues. Efficient resource management for large-scale complex workloads is considered a challenging research problem; especially when applications such as video streaming and decoding have dynamic and unpredictable workload characteristics. For these type of applications, runtime heuristic-based task mapping techniques are required. As the application and platform size increase, decentralised resource management techniques are more desirable to overcome the reliability and performance bot- tlenecks in centralised management. In this work, several heuristic-based runtime resource management techniques, targeting real-time video decoding workloads are proposed. Firstly, two admission control approaches are proposed; one fully deterministic and highly predictable; the other is heuristic-based, which balances predictability and performance. Secondly, a pair of runtime task mapping schemes are presented, which make use of limited known application properties, communication cost and blocking-aware heuristics. Combined with the proposed deterministic admission con- troller, these techniques can provide strict timing guarantees for hard real-time streams whilst improving resource usage. The third contribution in this thesis is a distributed, bio-inspired, low-overhead, task re-allocation technique, which is used to further improve the timeliness and workload distribution of admitted soft real-time streams. Finally, this thesis explores parallelisation and resource management issues, surrounding soft real-time video streams that have been encoded using complex encoding tools and mod- ern codecs such as High Efficiency Video Coding (HEVC). Properties of real streams and decod- ing trace data are analysed, to statistically model and generate synthetic HEVC video decoding workloads. These workloads are shown to have complex and varying task dependency struc- tures and resource requirements. To address these challenges, two novel runtime task clustering and mapping techniques for Tile-parallel HEVC decoding are proposed. These strategies con- sider the workload communication to computation ratio and stream-specific characteristics to balance predictability improvement and communication energy reduction. Lastly, several task to memory controller port assignment schemes are explored to alleviate performance bottle- necks, resulting from memory traffic contention. 3 4 Contents Abstract 3 Contents 5 List of Tables 9 List of Figures 10 Acknowledgements 12 Declaration 13 1 Introduction 14 1.1 Background and motivation............................... 14 1.1.1 Use-cases for predictable real-time multi-stream video decoding..... 15 1.2 Research problems.................................... 16 1.2.1 The problem of shared-memory based parallel video decoding....... 17 1.2.2 The problem of resource management for unpredictable workloads.... 17 1.2.3 The problem of scalable management..................... 18 1.2.4 The problem of resource sharing........................ 18 1.3 Research hypothesis................................... 19 1.4 Thesis contributions................................... 19 1.5 Thesis outline....................................... 20 2 Literature Survey 22 2.1 Video decoding applications............................... 22 2.1.1 Video stream decoding overview........................ 22 2.1.2 Features of the H.265 (HEVC) coding standard................ 24 2.1.3 Complexities of video decoding......................... 26 2.1.4 Parallel video decoding.............................. 27 2.1.5 Challenges in characterisation of video decoding workload......... 28 2.2 Many-core platforms................................... 31 2.2.1 Network-on-chip interconnects......................... 32 2.2.2 Predictability in NoCs.............................. 34 2.2.3 Many-core memory challenges......................... 35 2.2.4 Communication models............................. 35 2.2.5 Modelling many-core systems.......................... 36 2.3 Resource management in multicore platforms.................... 37 2.3.1 Multiprocessor real-time systems........................ 38 2.3.2 Resource management organisation...................... 44 2.3.3 Dynamic task mapping.............................. 50 5 2.3.4 Runtime task re-allocation............................ 57 2.4 Summary.......................................... 62 3 Eval. platform, metrics and problem statement 64 3.1 System model....................................... 64 3.1.1 Application model................................ 64 3.1.2 Platform model.................................. 68 3.2 Metrics........................................... 70 3.2.1 Predictability metrics............................... 71 3.2.2 Utilisation-based performance metrics.................... 72 3.2.3 Resource management energy-efficiency................... 74 3.2.4 Resource management overhead........................ 74 3.3 Simulation-based evaluation.............................. 75 3.4 Problem statement.................................... 76 4 Admission control strategies to balance predictability and utilisation 78 4.1 Runtime task mapping and priority assignment.................... 79 4.1.1 Runtime task mapping table........................... 79 4.2 Admission control tests.................................. 80 4.2.1 Deterministic admission controller....................... 80 4.2.2 Heuristic-based admission controler...................... 83 4.3 Evaluation......................................... 85 4.3.1 Experimental design............................... 85 4.3.2 Results discussion................................. 86 4.4 Summary and novel contributions........................... 90 5 Dynamic task mapping for hard real-time video streams 91 5.1 System model....................................... 92 5.1.1 Application model refinements......................... 92 5.1.2 Platform model refinements........................... 94 5.1.3 Memory-specific response time analysis refinements............ 96 5.2 Least worst-case remaining slack heuristic....................... 96 5.3 LWCRS-aware runtime mapping............................ 96 5.3.1 LWCRS mapping complexity analysis...................... 99 5.4 Clustered I and P frames mapping........................... 99 5.4.1 IPC mapping complexity analysis........................ 100 5.5 Baseline static task mapping............................... 100 5.6 Evaluation......................................... 103 5.6.1 Experimental design............................... 103 5.6.2 Results discussion................................. 105 5.7 Summary and novel contributions........................... 110 6 Distributed task remapping 112 6.1 System model refinement................................ 113 6.1.1 Omitting the memory transaction model................... 113 6.1.2 Disabled admission control........................... 113 6.1.3 Job-level task mapping/remapping....................... 113 6 6.1.4 Low-overhead remapping procedure...................... 114 6.2 Bio-inspired distributed task remapping........................ 114 6.2.1 Overview of the PS distributed load-balancing algorithm.......... 114 6.2.2 Extensions to the PS algorithm......................... 115 6.2.3 PSRM parameter selection............................ 118 6.2.4 Complexity and overhead analysis of PSRM.................. 119 6.2.5 Challenges of PSRM................................ 120 6.2.6 Application of PSRM............................... 120 6.3 Cluster-based distributed task remapping....................... 121 6.3.1 Improvements to CCPRMV1 ........................... 122 6.3.2 CCPRMV2 parameter selection......................... 122 6.3.3 Complexity and overhead analysis of CCPRMV2 ................ 122 6.4 Evaluation......................................... 123 6.4.1 Experimental design............................... 123 6.4.2 Results discussion................................. 124 6.5 Summary and novel contributions........................... 128 7 Extending the application model to support HEVC encoded video streams 130 7.1 Application model refinements............................. 131 7.2 HEVC stream analysis and synthetic workload generation methodology...... 131 7.3 Video sequences and codec tools under investigation................ 132 7.3.1 Encoder and decoder settings.......................... 133 7.4 Adaptive GoP characteristics............................... 134 7.4.1 Distribution of different frame types...................... 135 7.4.2 Contiguous B-frames and reference frame distances............. 136 7.5 HEVC coding unit characteristics............................ 136 7.5.1 CU sizes and types................................ 137 7.5.2 CU decoding time................................. 137 7.6 Workload communication volume characteristics.................. 139 7.6.1 Reference data................................... 139 7.6.2 Encoded frame sizes............................... 140 7.7 CPU and memory usage