©Copyright Haichen Shen

©Copyright òýÔÀ Haichen Shen System for Serving Deep Neural Networks Eciently Haichen Shen A dissertation submitted in partial fulllment of the requirements for the degree of Doctor of Philosophy University of Washington òýÔÀ Reading Committee: Arvind Krishnamurthy, Chair Matthai Philipose, Chair omas E. Anderson Xi Wang Program Authorized to Oer Degree: Paul G. Allen School of Computer Science and Engineering University of Washington Abstract System for Serving Deep Neural Networks Eciently Haichen Shen Co-Chairs of the Supervisory Committee: Professor Arvind Krishnamurthy Computer Science & Engineering Aliate Professor Matthai Philipose Computer Science & Engineering Today, Deep Neural Networks (DNNs) can recognize faces, detect objects, and transcribe speech, with (almost) human performance. DNN-based applications have become an important workload for both edge and cloud computing. However, it is challenging to design a DNN serving system that satises various constraints such as latency, energy, and cost, and still achieve high eciency. First, though server-class accelerators provide signicant computing power, it is hard to achieve high eciency and utilization due to limits on batching induced by the latency constraint. Second, resource management is a challenging problem for mobile-cloud applications because DNN inference strains device battery capacities and cloud cost budgets. ird, model optimization allows systems to trade o accuracy for lower computation demand. Yet it introduces a model selection problem regarding which optimized model to use and when to use it. is dissertation provides techniques to improve the throughput and reduce the cost and energy consumption signicantly while meeting all sorts of constraints via better scheduling and resource management algorithms in the serving system. We present the design, implementation, and evaluation of three systems: (a) Nexus, a serving system on a cluster of accelerators in the cloud that includes a batch-aware scheduler and a query analyzer for complex query; (b) MCDNN, an approximation- based execution framework across mobile devices and the cloud that manages resource budgets proportionally to their frequency of use and systematically trades o accuracy for lower resource use; (c) Sequential specialization that generates and exploits low-cost and high-accuracy models at runtime once it detects temporal locality in the streaming applications. Our evaluation on realistic workloads shows that these systems can achieve signicant improvements on cost, accuracy, and utilization. Table of Contents Page List of Figures . iv ListofTables................................................ vi ChapterÔ: Introduction ......................................Ô Ô.Ô Nexus.............................................. Ô.ò MCDNN............................................â Ô.ç Sequential Specialization . .Þ Ô.¥ Contributions ......................................... Ô. Organization . .À Chapter ò: Background . Ôý ò.Ô Deep Neural Networks . Ôý ò.ò Model Optimization . Ô¥ ò.ç RelatedWork.......................................... Ô ò.ç.Ô Nexus . Ô ò.ç.ò MCDNN . Ô ò.ç.ç Sequential Specialization . òý Chapter ç: Nexus: Scalable and Ecient DNN Execution on GPU Clusters . òò ç.Ô Background . òò ç.Ô.Ô Vision pipelines and DNN-based analysis . òç ç.Ô.ò Accelerators and the challenge of utilizing them . ò¥ ç.Ô.ç Placing, packing and batching DNNs . ò ç.ò Examples . òÞ ç.ò.Ô Squishy bin packing . òÞ i ç.ò.ò Complex query scheduling . çý ç.ç System Overview . çÔ ç.ç.Ô Nexus API . ç¥ ç.¥ Global Scheduler . ç ç.¥.Ô Scheduling streams of individual DNN tasks . çâ ç.¥.ò Scheduling Complex Queries . ¥ý ç. Runtime............................................. ¥ò ç. .Ô Adaptive Batching . ¥ò ç. .ò GPU Multiplexing . ¥¥ ç. .ç Load balancing . ¥ ç.â Evaluation ........................................... ¥â ç.â.Ô Methodology . ¥Þ ç.â.ò Case Study . ¥ ç.â.ç Microbenchmark . ý ç.â.¥ Large-scale deployment with changing workload . ç.Þ Conclusion........................................... â Chapter ¥: MCDNN: Approximation-Based Execution Framework for Mobile-Cloud Environment . ¥.Ô Background . ¥.ò Approximate Model Scheduling . âò ¥.ò.Ô Fractional packing and paging . âò ¥.ò.ò AMS problem denition . â¥ ¥.ò.ç Discussion . ââ ¥.ç System Design . âÞ ¥.ç.Ô Model catalogs . âÞ ¥.ç.ò System support for novel model optimizations . Þò ¥.ç.ç MCDNN’s approximate model scheduler . Þ¥ ¥.ç.¥ End-to-end architecture . Þ ¥.¥ Evaluation ........................................... ÞÀ ¥.¥.Ô Evaluating optimizations . ý ¥.¥.ò Evaluating the runtime . ç ¥.¥.ç MCDNN in applications . â ii ¥. Conclusions .......................................... Chapter : Sequential Specialization for Streaming Applications . À .Ô Class skew in day-to-day video . À .ò Specializing Models . ÀÔ .ç Sequential Model Specialization . À¥ .ç.Ô e Oracle Bandit Problem (OBP) . À¥ .ç.ò e Windowed ε-Greedy (WEG) Algorithm . Àâ .¥ Evaluation ........................................... ÀÀ .¥.Ô Synthetic experiments . ÀÀ .¥.ò Video experiments . ÔýÔ . Conclusion........................................... Ôý¥ Chapterâ: Conclusion ....................................... Ôýâ Bibliography................................................ ÔýÀ Appendix A: Hardness of Fixed-rate GPU Scheduling Problem (FGSP) . Ôòç Appendix B: Compact Model Architectures . Ôò Appendix C: Determine Dominant classes . ÔòÞ iii List of Figures Figure Number Page Ô.Ô eimpactofbatching....................................ò Ô.ò Accuracy of object classication versus number of operations . .ç Ô.ç System suite overview . .¥ ò.Ô Four common operators in deep neural networks. .ÔÔ ò.ò LeNet [ý] model architecture. Ôò ò.ç AlexNet [ÞÞ] model architecture. Ôò ò.¥ VGG [ÔÔÞ] model architecture . Ôç ò. ResNet [ À] model architecture . Ôç ò.â Inception [ÔòÔ] model architecture. Ôç ç.Ô Example video streams . òç ç.ò A typical vision processing pipeline . ò¥ ç.ç Impact of batching DNN inputs on model execution throughput and latency . ò ç.¥ Resource allocation example. òÀ ç. Query pipeline . òÀ ç.â Nexus runtime system overview. çò ç.Þ Nexus application API. çç ç. Application example of live game streaming analysis. ç¥ ç.À Trac monitoring application example. ç ç.Ôý Process of merging two nodes . çÀ ç.ÔÔ Dataow representations of two example applications. ¥Ô ç.Ôò Compare the percent of requests that miss their deadlines under three batching policies ¥ò ç.Ôç Batch size trace by three batching policies . ¥¥ ç.Ô¥ Compare the throughput of live game analysis . ¥ ç.Ô Compare the throughput of trac monitoring application . ¥À ç.Ôâ Compare the throughput when having multiple models on a single GPU . ý iv ç.ÔÞ Compare the throughput improvement of batching the common prex . ò ç.Ô Evaluate the overhead and memory use of prex batching . ç ç.ÔÀ Compare the throughput between Nexus and Nexus using batch-oblivious resource allocator. ............................................ ¥ ç.òý Compare the throughput between Nexus with and without query analyzer . ç.òÔ Compare the throughput of dierent latency split plans . â ç.òò Changing workload of Ôý applications over time on â¥ GPUs . Þ ¥.Ô Basic components of a continuous mobile vision system. À ¥.ò Memory/accuracy tradeos in MCDNN catalogs. âÀ ¥.ç Energy/accuracy tradeos in MCDNN catalogs. Þý ¥.¥ Latency/accuracy tradeos in MCDNN catalogs. ÞÔ ¥. System support for model specialization . Þò ¥.â System support for model sharing . Þ¥ ¥.Þ Architecture of the MCDNN system. Þ ¥. Impact of collective optimization on memory consumption . Ô ¥.À Impact of collective optimization on energy consumption . ò ¥.Ôý Impact of collective optimization on execution latency . ç ¥.ÔÔ Impact of MCDNN’s dynamically-sized caching scheme. ..

Load more