Copyright by Jian He 2020 The Dissertation Committee for Jian He certifies that this is the approved version of the following dissertation:

Empowering Video Applications for Mobile Devices

Committee:

Lili Qiu, Supervisor

Mohamed G. Gouda

Aloysius Mok

Xiaoqing Zhu Empowering Video Applications for Mobile Devices

by

Jian He

DISSERTATION

Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN May 2020 Acknowledgments

First and foremost, I want to thank my advisor Prof. Lili Qiu, for the support and guidance I have received over the past few years. I appreciate all her contributions of time, ideas and funding to make my Ph.D. experience productive and stimulating. The enthusiasm she has for her research signifi- cantly motivated to concentrate on my research especially during tough times in my Ph.D. pursuit. She taught me how to crystallize ideas into solid and fancy research works. I definitely believe that working with her will help me have a more successful career in the future.

I also want to thank all the members in my dissertation committee, Prof. Mohamed G. Gouda, Prof. Aloysius Mok and Dr. Xiaoqing Zhu. I owe many thanks to them for their insightful comments on my dissertation.

I was very fortunate to collaborate with Wenguang Mao, Mubashir Qureshi, Ghufran Baig, Zaiwei Zhang, Yuchen Cui, Sangki Yun, Zhaoyuan He, Chenxi Yang, Wangyang Li and Yichao Chen on many interesting works. They always had time and passion to devote to my research projects. Without their support, I could not complete those projects smoothly. I want to thank my colleagues Mei Wang, Wei Sun and Swadhin Pradhan for their great help make my research life enjoyable.

I would like to thank Xiaoqing Zhu, Shruti Sanadhya, Sangki Yun,

iv Christina Vlachou and Kyu-Han Kim. They were my mentors during my internships at Cisco and HP Labs. I had very fun experiences with them to learn how to do projects in industry companies and research labs. They encouraged me a lot to seek for more success in my career.

I feel extremely grateful to have my many friends who brought lots of joy to my life at UT: Chen Chen, Lei Xu, Yuxiang Lin, Wenhui Zhang, Hangchen Yu, Zhiting Zhu, Yuepeng Wang, Xinyu Wang, Ye Zhang, and many others. More importantly, I owe my sincere gratitude to Xiaoting Liu, who provided me her continuous care. I will never forget the days and nights we had together to go through the hard times during the COVID-19 outbreak. I wish you all the best!

Finally, I would like to thank my family for all their love and encour- agement. I dedicate this dissertation to you.

v Empowering Video Applications for Mobile Devices

Jian He, Ph.D. The University of Texas at Austin, 2020

Supervisor: Lili Qiu

The popularity of video applications has grown rapidly. There are two main trends in the development of video applications: (i) video streaming sup- porting higher-resolution videos and 360◦ videos, (ii) providing video analytics (e.g., running object detection on video frames). In this dissertation, we focus on how to improve performance of streaming 360◦ and 4K videos and running real-time video analytics on mobile devices.

We identify a few major challenges to guarantee high user experience for running video applications on mobile devices. First, existing video ap- plications call for high-resolution videos(e.g., 4K). Due to limited hardware resource on mobile devices, it is slow to code high-resolution videos. It is crit- ical to design a light-weight video codec to provide fast video coding as well as high compression efficiency for mobile devices. Second, wireless channels have unpredictable throughput fluctuation. It is necessary to design a robust rate adaptation algorithm to adjust video quality according to the varying

vi network condition. Third, streaming entire panoramic video views wastes lots of bandwidth, while only transmitting the portion visible to the users FoV significantly degrades video quality. It is hard to save bandwidth while main- taining high video quality with inevitable head movement prediction error. Last, motion based object tracking can speed up video analytics, but existing motion estimation is noisy due to the presence of complex background and object size or shape changes.

In this dissertation, we will show how to address the above mentioned challenges. We propose a new layered coding design to code high-resolution video data. It can effectively adapt to varying data rates on demand by first sending the base layer and then opportunistically sending more layers when- ever the link allows. We further design an optimization algorithm to decide which video layers to send according to available throughput. Compared with existing rate adaptation algorithm, our algorithm includes the new dimen- sion of deciding the number of layers to transmit. We design a novel layered tile-based encoding framework for 360◦ videos. It can achieve efficient video coding, bandwidth saving, and robustness against head movement prediction error. Moreover, we design a robust technique to extract reliable motion from video frames. We use a combination of feature maps and motion to generate a representative mask which can reliably capture the motion of object pixels and the changes of the overall object shape or size.

First, we implement our tile-based layered encoding framework Rubiks on mobile devices for 360◦ video streaming. We exploit spatial and tempo-

vii ral characteristics of 360◦ videos for encoding. Specifically, Rubiks splits the 360◦ video spatially into tiles and temporally into layers. The client runs an optimization routine to determine the video data that needs to be fetched to optimize user QoE. Using this encoding approach, we can send the video portions that have a high probability of viewing at a higher quality and the portion that has a lower probability of viewing at a lower quality. By control- ling the amount of data sent, the data can be decoded in time. Rubiks can save significant bandwidth while maximizing the users QoE and decoding the video in a timely manner. Compared with existing approaches, Rubiks can achieve up to 69% improvement in user QoE and 49% in bandwidth savings over existing approaches.

Next, we design a system Jigsaw to support live 4K video streaming over wireless networks using commodity mobile devices. Given the high data rate requirement of 4K videos, 60GHz is appealing, but its large and unpredictable throughput fluctuation makes it hard to provide desirable user experience. We propose a novel system Jigsaw, which consists of (i) easy-to-compute layered video coding to seamlessly adapt to unpredictable wireless link fluctuations, (ii) efficient client GPU implementation of video coding on commodity mobile devices, and (iii) effectively leveraging both WiFi and WiGig through delayed video adaptation and smart scheduling. Using real experiments and emulation, we demonstrate the feasibility and effectiveness of Jigsaw. Our results show that it improves PSNR by 6-15dB and improves SSIM by 0.011-0.217 over state-of-the-art approaches.

viii Finally, we develop a novel mobile video analytics system Sight. Its unique features include (i) high accuracy, (ii) real-time, and (iii) running ex- clusively on a mobile device without the need of edge/cloud server or network connectivity. At its heart lies an effective technique to reliably extract motion from video frames and use the motion to speed up video analytics. Unlike the existing motion extraction, our technique is robust to background noise and changes in object sizes. Using extensive evaluation, we show that Sight can support real-time object tracking at 30 frames/second (fps) on Jetson TX2. For single-object tracking, Sight improves the average Intersection-over- Union (IoU) by 88%, improves the mean Average Precision (mAP) by 207% and reduces the average hardware resource usage by 45% over state-of-the-art approach. For multi-object tracking, Sight improves IoU by 69%, improves mAP by 173% and reduces resource usage by around 32%.

ix Table of Contents

Acknowledgments iv

Abstract vi

List of Tables xiv

List of Figures xv

Chapter 1. Introduction 1 1.1 Background ...... 1 1.2 Motivation ...... 3 1.2.1 Video Streaming ...... 3 1.2.2 Mobile Video Analytics ...... 5 1.3 Challenges ...... 6 1.3.1 Video Streaming ...... 6 1.3.2 Video Coding ...... 8 1.3.3 Video Analytics ...... 8 1.4 Our Approach ...... 9 1.5 Summary of Contributions ...... 11 1.6 Dissertation Outline ...... 13

Chapter 2. Related Work 14 2.1 Video Streaming Algorithms ...... 14 2.2 Wireless Technologies ...... 16 2.3 Video Coding ...... 18 2.4 Mobile Video Analytics ...... 19

x Chapter 3. Practical 360◦ Video Streaming for Smartphones 22 3.1 Background for 360◦ Video Streaming ...... 22 3.1.1 Existing Streaming Framework ...... 22 3.1.2 H.264 and HEVC Codecs ...... 23 3.1.3 Scalable Video Coding ...... 24 3.2 Motivation ...... 25 3.2.1 Real-Time Media Codecs ...... 25 3.2.2 Limitations of Existing Approaches ...... 26 3.2.2.1 Decoding Time ...... 27 3.2.2.2 Bandwidth Savings ...... 29 3.2.2.3 Video Quality ...... 30 3.2.3 Insights From Existing Approaches ...... 32 3.3 Challenges ...... 33 3.4 Our Approach ...... 34 3.4.1 Video Encoding ...... 35 3.4.2 360◦ Video Rate Adaptation ...... 36 3.4.2.1 MPC-based Optimization Framework ...... 37 3.4.2.2 User QoE ...... 38 3.4.2.3 Estimate Video Quality ...... 40 3.4.2.4 Decoding Time ...... 43 3.4.2.5 Improving Efficiency ...... 44 3.5 System Design for Rubiks ...... 45 3.5.1 System Architecture ...... 45 3.5.2 Server Side ...... 46 3.5.3 Client Side ...... 47 3.6 Evaluation for Rubiks ...... 49 3.6.1 Evaluation Methodology ...... 50 3.6.1.1 Experiment Setup ...... 50 3.6.1.2 Experiment Settings ...... 50 3.6.2 Micro Benchmarks ...... 52 3.6.2.1 Head Movement Prediction Error ...... 52 3.6.2.2 Decoding Time Modeling ...... 54

xi 3.6.3 System Results ...... 57 3.6.3.1 Rubiks for 4K videos ...... 57 3.6.3.2 Rubiks for 8K Videos ...... 60 3.6.3.3 Energy Consumption ...... 62 3.6.4 Summary and Discussion of Results ...... 63

Chapter 4. Robust Live 4K Video Streaming 65 4.1 Motivation ...... 65 4.1.1 4K Videos Need Compression ...... 65 4.1.2 Rate Adaptation Requirement ...... 65 4.1.3 Limitations of Existing Video Codecs ...... 67 4.1.4 WiGig and WiFi Interaction ...... 70 4.2 Challenges ...... 71 4.3 Our Approach ...... 73 4.3.1 Light-weight Layered Coding ...... 73 4.3.1.1 Our Design ...... 74 4.3.2 Layered Coding Implementation ...... 78 4.3.2.1 GPU Implementation ...... 79 4.3.2.2 Jigsaw GPU Encoder ...... 81 4.3.2.3 Jigsaw GPU Decoder ...... 84 4.3.2.4 Pipelining ...... 87 4.3.3 Video Transmission ...... 88 4.4 Evaluation for Jigsaw ...... 92 4.4.1 Evaluation Methodology ...... 92 4.4.2 Micro-benchmarks ...... 95 4.4.3 System Results ...... 100 4.4.4 Emulation Results ...... 104

Chapter 5. Real-Time Deep Video Analytics on Mobile Devices105 5.1 Motivation ...... 105 5.1.1 Deep Model Inference Latency ...... 106 5.1.2 Motion Estimation Based Tracking ...... 107 5.1.3 Object Size Changes ...... 109

xii 5.2 Challenges ...... 110 5.3 Approach ...... 111 5.3.1 Reliable Mask Extraction ...... 113 5.3.2 Object Size Adaptation ...... 119 5.3.3 Adaptive Inference ...... 123 5.4 System Implementation ...... 126 5.4.1 Inference Module ...... 127 5.4.2 Tracking Module ...... 128 5.5 Evaluation ...... 129 5.5.1 Evaluation Methodology ...... 129 5.5.2 Micro-benchmarks ...... 132 5.5.3 Single Object Tracking ...... 135 5.5.4 Robustness to Different Types of Videos ...... 138 5.5.5 Multi-Object Tracking ...... 139

Chapter 6. Conclusion 141

Bibliography 143

xiii List of Tables

3.1 Decoding Time Lookup Table ...... 55 3.2 Energy Consumption...... 63

4.1 Codecs encoding and decoding time per frame ...... 69 4.2 Performance over different GPUs ...... 98 4.3 Video SSIM under various mobility patterns...... 104

5.1 Inference latency for a single frame ...... 106

xiv List of Figures

3.1 Architecture of Existing 360◦ Streaming Systems ...... 23 3.2 Decoding time and bandwidth savings...... 28 3.3 Video quality of existing approaches ...... 30 3.4 Design Space of 360◦ Video Streaming Algorithms ...... 32 3.5 Spatial and temporal splitting of 360◦ chunk ...... 36 3.6 Correlation between video quality and bitrate...... 42 3.7 System Architecture for Rubiks ...... 45 3.8 Head Movement Prediction Analysis ...... 53 3.9 Modeling Error...... 55 3.10 Performance of 4K videos(8 videos, 10 throughput traces, 80 head movement traces) ...... 56 3.11 Performance of Rubiks under low throughput...... 59 3.12 Performance of Rubiks for high motion videos...... 60 3.13 Performance of 8K videos(8 videos, 10 throughput traces, 80 head movement traces) ...... 61

4.1 Example 60GHz Throughput Traces ...... 66 4.2 Example WiFi Throughput Traces ...... 71 4.3 Our layered coding for Live 4K Video Streaming ...... 74 4.4 Compression Efficiency ...... 78 4.5 Codec GPU Modules Running Time ...... 85 4.6 Jigsaw Pipeline ...... 86 4.7 Video quality vs no. of received layers...... 95 4.8 Impacts of using WiGig and WiFi...... 96 4.9 Impacts of interface scheduler...... 97 4.10 Impacts of inter-frame scheduler...... 98 4.11 Impacts of frame deadline...... 99 4.12 Performance of Jigsaw under various mobility patterns. (HR: High-Richness, LR: Low-Richness) ...... 101

xv 4.13 Frame quality correlation with throughput...... 103 4.14 Frame quality comparison ...... 104

5.1 Successfully Tracked Frames ...... 108 5.2 Impacts of object size changes. (Red Box: ground-truth result. Green Box: tracking result.) ...... 109 5.3 Calculating motion from feature maps (Left: raw frames or fea- ture maps, Right: optical flow) ...... 114 5.4 Representative mask generated from the intersection between feature-map mask and optical-flow mask...... 115 5.5 Ratio of Object Pixels in Masks ...... 118 5.6 Size Adaptive Box Update ...... 119 5.7 Representative Mask based Tracking. (Green Box: tracking result) ...... 120 5.8 Stale Inference Update (Green: stale inference. Red: ground- truth. Yellow: updated stale infer.) ...... 125 5.9 System Architecture for Sight ...... 127 5.10 Micro-benchmark results over dataset D1 ...... 131 5.11 Performance of Sight over dataset D1 ...... 134 5.12 System latency ...... 137 5.13 Performance of Sight using the dataset D3 ...... 138 5.14 Sight for multi-object tracking (dataset D2) ...... 140

xvi Chapter 1

Introduction

The popularity of videos has grown significantly in the past few years. Cisco [21] estimates that video traffic will constitute 82% of all consumer In- ternet traffic by 2021. Users demand immersive user experience from various video applications such as YouTube, Netflix, Gaming, Augmented Reality, Virtual Reality, etc. There are two major trends in the development of video applications: (i) video streaming aiming for higher-resolution videos and 360◦ videos, both of which call for high bandwidth. Even with the recent 5G tech- nologies, it is hard to satisfy the bandwidth requirement. (ii) video analytics which applies sophisticated computer vision technologies(e.g., object detec- tion, activity analysis) to video data. In addition to viewing videos, video analytics provides more interactive experience for users. In this dissertation, we propose systematic approaches to improve the performance of both aspects.

1.1 Background

Mobile Video Applications: Mobile devices consume video data in three steps: (i)receiving video data, (ii)decoding and displaying video, (iii)analyzing the received video content. Video streaming applications are responsible for

1 transmitting high quality video content to users with low latency. Mobile devices need to decode the received video data and display the decoded video on the screen when running video streaming applications. Video analytics enables user interactions with video by running sophisticated computer vision technologies on the received video content. For example, users can place virtual objects into the real scene captured by an AR application.

360◦ Videos: A 360◦ video consists of panoramic video frames. To watch a 360◦ video, a user wears a headset that blocks outside view so the user focuses only on what is being displayed on the smartphone or VR headset. 360◦ videos are shot using omni-directional cameras or multiple cameras where the collected images are stitched together. The resulting effect of either approach is a video consisting of spherical images. While watching a 360◦ video, a user views a defined portion of the whole image, usually it is 110◦ along the X-axis and 90◦ along the Y-axis. This user’s view is termed as Field-of-View (FoV). The view automatically adapts with the user’s head movement. As user moves his head along any of the X, Y, or Z axis, the video player automatically updates the FoV.

4K Videos: Live 4K videos are generally generated by interactive video applications, such as VR and gaming. They demand not only high resolution but also low latency (e.g., within 60 ms [44, 57, 78]). The delay requirement for streaming a live 4K video at 30 FPS means that we should finish encoding, transmitting, and decoding each video frame (having 4096 x 2160 pixels) within 60 ms [44, 57, 78]. Existing commodity devices are not powerful enough to

2 perform encoding and decoding in real-time.

Video Analytics: The state-of-art video analytics approaches involve run- ning deep models [54, 93, 119, 120]. There are different types of video analytics tasks. For example, tracking the position of the objects in video frames, de- tecting the types of activities performed by the subjects in the received video frames, segmenting background and foreground, etc. Executing a deep model on a video frame is defined as inference, which outputs the bounding boxes and labels of objects.

1.2 Motivation 1.2.1 Video Streaming

Higher Video Resolution: The popularity of 4K videos has grown rapidly. Gaming, Virtual Reality (VR) and Augmented Reality (AR) all call for 4K videos since the resolution has profound impact on the immersive user experi- ence. Compared with regular videos, 4K videos have much higher resolution. On popular video sharing websites like YouTube and Netflix, the highest res- olution for regular videos is 1920 × 1080 (1080p) or 1280 × 720 (720p). The 4K resolution is 3840 × 2160. Due to much higher resolution, transmitting 4K videos demands high data rate. With existing video codecs (such as H264 or HEVC), the throughput requirement for transmitting 1080p videos at 30 frames per second (fps) is 3Mbps. Transmitting 30fps 4K videos needs around 16Mbps throughput. The requirement of high data rate puts significant stress on the network. Even with rapid innovation in networking, it is still challeng-

3 ing to sustain such high bandwidth requirement.

Fast Video Coding: Coding high-resolution videos is slow on commodity mobile devices due to limited mobile devices. Existing video codecs (H264 or HEVC) exploit complex compression algorithms to reduce the video size without significant quality degradation. Even with hardware acceleration for video codecs, existing mobile devices still slow encoding/decoding speed. Scal- able video coding (SVC) is the state-of-art approach to guarantee robustness to varying network conditions. It divides video data into multiple layers and video quality increases with the number of received layers. However, running SVC on commodity mobile devices results in large delay due to its high com- plexity. Therefore, high-resolution videos demand the design of a light-weight codec which can run on mobile devices in a timely manner.

Immersive 360◦ Videos :A significant recent advance in video technology is Virtual Reality (VR) or 360◦ videos. It provides panoramic views to give immersive user experience. The resolution of 360◦ videos can be as high as 7680 × 4320 (8K), which is even higher than the resolution of 4K videos. Existing systems stream 360◦ videos in a similar way as regular videos, where all data of the panoramic view is transmitted. It is not practical since the throughput demand of streaming entire 360◦ views can be even more than 100Mbps. This is also wasteful since a user only views a small portion of the 360◦ view. To save bandwidth, recent works propose the tile-based streaming, which divides the panoramic view to multiple smaller sized tiles and streams only the tiles within a users field of view (FoV) predicted based on the recent

4 head position. Users suffer from significant video quality degradation due to head movement prediction error, which is hard to avoid. Moreover, existing tile-based streaming approaches can not run in real-time even on the latest smartphone due to hardware and software limitations. It is very challenging to stream 360◦ content on smartphones to avoid bandwidth wastage while maintaining high video quality.

1.2.2 Mobile Video Analytics

Video analytics requires more hardware resource than video streaming applications because analyzing video content is a computation resource inten- sive task. There is an increasing demand for real-time mobile video analytics (e.g., object tracking, new object detection, scene change detection) since it can enable a wide range of applications, such as VR/AR, cognitive assistance, video surveillance, smart driving, unmanned delivery, and cashier free stores. To achieve seamless performance for these applications, video analytics must run in real-time at 30 frames/second (fps) [48]. This poses an interesting sys- tem challenge – how to enable real-time video analytics on mobile devices with limited computation resources?

Video analytics has been a widely studied topic. Many existing works focus on the accuracy(e.g., by developing better models). There are many deep models for object detection, including FasterRCNN [120], SSD [93], R- FCN [54], Yolo [119]. The models have become increasingly complex and require very powerful hardware to run. Video analytics consists of two major

5 tasks [88, 115, 146], detecting where the object is and recognizing what the object is in video frames. In this paper, we focus on detecting the object position in the video.

One way to speed up the video analytics is to offload to an edge/cloud server. For example, recent work [88] runs deep models on edge servers equipped with powerful Nvidia TITAN XP GPU [9], which takes 30× more power than mobile GPU (e.g., TX2 [6]). However, it is not always practical to assume a powerful edge server nearby and good network connectivity to reach the cloud server. This is particularly challenging in re- mote areas or congested regions. Moreover, offloading process is expensive and may incur large delay arising from network and server processing. A large delay means that the analysis results are stale for the current video frame, which translates to low accuracy. In addition, offloading also raises privacy concerns. Therefore, it is important to investigate how to run real-time video analytics exclusively on mobile devices.

1.3 Challenges

To improve the performance of mobile video applications, we need to handle the following challenges in video transmission, coding and analysis.

1.3.1 Video Streaming

Unpredictable Throughput Fluctuation: Mobile devices receive video data throughput wireless links. The rate adaptation algorithms used in exist-

6 ing video streaming applications adapt the video bitrate according to available bandwidth. However, as many measurement studies have shown, the data rate of wireless links can fluctuate widely and is hard to predict. This makes it challenging to adapt video quality based on throughput prediction.

FoV Constrained Streaming: For streaming 360◦ videos, they are pre- encoded and stored at the video server. The coding delay for 360◦ only comes from decoding. Streaming all pixels in 360◦ videos is wasteful since the user only views a small portion of the video due to the limited FoV. Decoding all pixels also incurs large decoding delay. To reduce bandwidth wastage and decoding delay, it is critical to decide which portion of the 360◦ video frame to transmit.

Streaming Live Video over WiGig: Live videos can not be pre-encoded in advance. The coding delay of live 4K videos include both encoding and decod- ing. To remove large coding delay, streaming raw video data is one potential option. However, it requires extremely high data rate. Recent network tech- nology 60GHz can provide higher bandwidth than existing wireless networks. Let us consider a raw 4K video is streamed at 30 fps and uses 12-bit YUV color space. Without any compression, it requires ∼3 Gbps. Our commodity devices with the latest 60 GHz technology can only achieve 2.4 Gbps in the best case. With mobility, the throughput can quickly diminish and sometimes the link can be completely broken. Therefore, we need a light-weight codec to compress live 4K video data. Moreover, 60GHz alone is often insufficient to support 4K video streaming since its data rate may reduce by orders of

7 magnitude even with small movement or obstruction. We need an effective way to exploit different wireless links (e.g., WiFi and 60GHz links) to improve total throughput.

1.3.2 Video Coding

Hardware Resource Constrained Video Coding: While existing video coding (e.g., H.264 and HEVC) achieves high compression rates, they are too slow for 4K or 360◦ video encoding and decoding due to high resolution. Fouladi et al [61] show that YouTube H.264 encoder takes 36.5 minutes to encode a 15-minute 4K video at 24 FPS using H.264, which is too slow. For an 8K 360◦ video frame, decoding it on Android phones takes more than 45ms, which indicates the user can watch the 360◦ video at 22fps. Therefore, a fast video coding algorithm is needed to stream 360◦ and 4K videos.

1.3.3 Video Analytics

Noisy Object Motion Estimation: Due to high hardware resource de- mand, the speed of running deep object tracking models on mobile devices is slower than 3 fps [90, 146, 153]. Existing works [88] propose to use motion tracking to estimate the object position. Object motion estimation is noisy be- cause it is challenging to separate background pixels and object pixels. More- over, object size or shape can change significantly due to the movement of the object or camera. Object size changes can result in large error if the motion estimation can not adapt to object size robustly.

8 1.4 Our Approach

To solve the above mentioned challenges, we propose a novel layered tile-based streaming framework for 360◦ videos, a new layered coding approach for streaming 4K videos and representative mask based motion tracking for video analytics.

Layered Tile-Based 360◦ Video Streaming: we design a novel tile-based layered encoding framework for 360◦ videos. We exploit spatial and temporal characteristics of 360◦ videos for encoding. Our approach splits the 360◦ video spatially into tiles and temporally into layers. The client runs an optimization routine to determine the video data that needs to be fetched to optimize user QoE. Using this encoding approach, we can send the video portions that have a high probability of viewing at a higher quality and the portion that has a lower probability of viewing at a lower quality. By controlling the amount of data sent, the data can be decoded without rebuffering. Our approach can save significant bandwidth while maximizing the users QoE and decoding the video in a timely manner.

Layered 4K Video Streaming: we design a novel layered video coding scheme for transmitting high-quality video data to mobile devices with low delay.

• Layered Coding: To handle unpredictable data rate, we propose simple yet effective layered coding design. It can effectively adapt to varying

9 data rates on demand by first sending the base layer and then oppor- tunistically sending more layers whenever the link allows.

• Fast Video Codec: We efficiently exploit available hardware resource for our video codec which incorporates our layered coding design. Com- pared with the existing default video codecs, our codec is fast to run on commodity mobile devices.

• Video Layer Transmission Optimization: With our layered coding, we need to decide how many layers to transmit in addition to the encoding bitrate. We develop efficient optimization algorithms to determine the number of layers to transmit and the encoding bitrate of those layers.

Robust Real-Time Video Analytics: We use estimated motion to speed up the video analytics. Our approach is inspired by the existing works [88, 123, 150] but goes beyond them in the following aspects: (i) we derive motion from the feature maps instead of the raw video frames and show feature maps give more accurate motion estimation because it can effectively filter out the background pixels, (ii) we use a combination of feature map and motion to efficiently generate a mask for motion estimation and size adaptation, and (iii) we significantly speed up motion estimation so that we can run the video analytics exclusively on a mobile device.

Specifically, we generate feature maps from the first convolutional layer of deep models [55, 99, 140]. A single convolutional layer has many convolu- tional filters, each of which produces a feature map. We design an effective

10 metric to select the feature map that gives high tracking accuracy. Running one filter is much faster than executing the whole model. Thus, we can extract the selected feature map within 10ms.

To adapt to the change in the object size, we develop an efficient way of generating a representative mask. The representative mask may only include a subset of object pixels, but it can reliably capture the changes of the overall object shape or size. We generate the mask based on both the feature map and optical flow. Instead of extracting the complete mask of the object, we can generate the representative mask in a few ms. To the best our knowledge, this is the first work that efficiently adapts object size for motion based object tracking.

1.5 Summary of Contributions

We summarize the major contributions of this dissertation as follows.

360◦ Video Streaming:

• We develop a novel tile-based layered coding that exploits spatial and temporal characteristics of 360◦ videos to reduce the decoding overhead while saving bandwidth and accommodating head movement prediction error.

• We implement our 360◦ video streaming scheme Rubiks as an Android app. Extensive experiment results show that Rubiks improve user QoE by 69% and save bandwidth by 35% over existing approaches for 4K

11 videos. It provides 36% improvement in QoE and 49% in bandwidth savings for 8K videos.

4K Video Streaming:

• We propose new layered coding design for 4K video streaming. The layered design divides video data into layers. It can effectively adapt to unpredictable data rates on demand by first sending the base layer and then opportunistically sending more layers when the link allows. The layered coding is robust to throughput fluctuation.

• We design optimization algorithms to adapt the number of layers to transmit according to available throughput. Our algorithms include the new control dimension that is which video layers to transmit.

• We implement fast video codecs for our layered coding design. Our video codecs can compress video data and are fast to run.

• We implement our live 4K video streaming scheme Jigsaw on commodity mobile devices. Our evaluation results show that Jigsaw can stream 4K live videos at 30 FPS using WiFi and WiGig under various mobility patterns.

Video Analytics:

• We propose a robust motion estimation scheme. It can efficiently and accurately extract motion from feature maps. By filtering out the back- ground pixels, feature maps provide more reliable motion estimation

12 than the raw video frames. It can efficiently adapt to the change in ob- ject size/shape by generating a representative mask and using the mask (which may only contain a subset of object pixels) to adjust the bounding box size.

• We implement our video analytics system Sight. It runs locally on mobile devices to perform real-time analytics for 30 fps videos without powerful edge/cloud servers or network connectivity. Sight reduces the average hardware resource usage by 45% and 32% for single-object tracking and multi-object tracking without degrading accuracy.

1.6 Dissertation Outline

Chapter 2 overviews the related work. Chapter 3 describes our 360◦ video streaming approach in detail. Chapter 4 fully explains our live 4K video streaming work. Chapter 5 shows the design of our real-time video analytics system. Chapter 6 concludes this dissertation and describes the future work.

13 Chapter 2

Related Work

In this section, we review existing works related to video streaming algorithms, wireless technologies, video coding and video analytics.

2.1 Video Streaming Algorithms

Video Adaptation for Regular Videos: There has been lots of recent works on video streaming under limited and fluctuating network through- put [38, 71, 77, 87, 100, 104, 148]. A video is divided into chunks, each of which contains a few seconds video content. The client adapts the bitrate of video chunks according to the network condition. These works try to maximize the user QoE, which can be defined using multiple metrics: video bitrate, bitrate changes between successive chunks and rebuffering. Yin et al. [148] propose an MPC-based optimization framework for video streaming. It casts the problem as an utility optimization, where the utility is defined as the weighted sum of the above metrics for the future K video chunks. FESTIVE [77] balances both stability and efficiency, and provides fairness among video players by per- forming randomized chunk scheduling and bitrate selection of future chunks. Pensieve [100] trains a neural network that selects the bitrate of future chunks

14 based on the performance data collected from video players. However, none of these works considers transmission of high-resolution videos.

Most of these works only investigate video-on-demand (VoD) services in which entire videos are encoded and stored at the server side before users start downloading. Compared with VoD, live video streaming [56, 111, 147] is more delay sensitive. Existing live video streaming approaches stream video content in chunks, and users have to wait for a few seconds before a video chunk is ready to play. In our work for 4K video streaming, we focus on supporting live video streaming in which the delay of a video frame is as small as only tens of milliseconds. Such live streaming is crucial for interactive applications, like gaming, virtual reality and augmented reality [44, 78].

360◦ video Streaming Algorithms: Recently, there have been some works targeting 360◦ video content streaming. In 360◦ videos, a user looks at only some portion of the video at any given time so there is an opportunity to save bandwidth without sacrificing quality. Qian et al. [114] propose a scheme that divides an entire 360◦ frame into several smaller rectangular tiles and only streams the tiles that overlap with predicted FoV. This approach can lead to rebuffering or blank screen in case of inaccurate head movement prediction. Hosseini et al. [70] propose an approach where the video is divided into mul- tiple tiles and the tiles that are more likely to be viewed are streamed earlier. Bao et al. [40] propose an optimization framework that takes into account the head movement prediction error and requests some additional tiles to account for the prediction error. However, none of these approaches were implemented

15 on smartphones. So it is not clear about the feasibility and performance of these existing approaches on smartphones. POI360 [144] proposes adapting compression ratio of video tiles according to network throughput, but it still suffers from slow decoding since the encoded rate of tiles does not affect decod- ing time. Moreover, POI360 is not implemented using video codecs available on commercial smartphones. Recently, Liu et al. [96] propose using SVC for 360◦ videos. However, SVC is currently not supported on smartphones, so they do not have a real implementation.

2.2 Wireless Technologies

Streaming Regular Videos over Conventional Wireless Networks: Wireless links are not reliable and have large fluctuation in available through- put. Flexcast [37] incorporates rateless code into the existing video codec such that video quality does not drop drastically when network condition varies. Softcast [74] exploits 3D DCT to achieve similar target. Parcast [94] inves- tigates how to transmit more important video data to more reliable wireless channels. PiStream [145] uses physical layer information from LTE links to enhance video rate adaptation. However, these works do not study high- resolution video streaming (e.g., 4K). Due to large size, existing video coding cannot code high-resolution video content in real time. Furion [83] tries to use multiple codec instances to encode and decode VR video content in parallel, but it supports a much lower resolution than the 4K resolution.

Some dedicated hardware can support high-resolution videos in real

16 time [89], but it is generally not available on laptops and mobile devices. Our work focuses on supporting efficient high-resolution video encoding, transmis- sion, and decoding on commodity devices.

Streaming High-Resolution Videos over 60GHz: Owing to the large bandwidth in 60GHz, streaming uncompressed video has become a key appli- cation for WiGig [132, 142]. Choi et al [52] proposes a link adaptation policy to allocate different amount of resource to different data bits in a pixel such that the total amount of allocated resource is minimized while maintaining good video quality. He et al [68] try to encode an uncompressed video into multiple descriptions using RS coding. The video quality improves as more descriptions are received. [52, 68] use unequal error protection to protect dif- ferent bits in a pixel based on importance. Shao et al [128] compresses pixel difference values using run length coding, which is difficult to parallize since run length codes since different pixels can not be known beforehand. Singh et al [131] propose to partition adjacent pixels into different packets and adapt the number of pixels to send based on throughput estimation. But its video quality degrades significantly when throughput drops since it has no compres- sion. Li et al [86] develops an interface that notifies the application layer of the parallelize between WiGig and WiFi so that the video server can deter- mine an appropriate resolution of the video to send. In general, these works do not consider encoding time, which can be significant for 4K videos. They also do not explore how to leverage multiple links for video streaming, which our system addresses.

17 Other Wireless Approaches for High-Resolution Videos: Developing other high-bandwidth reliable wireless networks is also an interesting solution to support 4K video streaming. A recent router [13] claims to support Gbps- level throughput by exploiting one 2.4GHz and two 5GHz bands. However, it still cannot support raw 4K video transmission. Our system can efficiently fit 4K video into this throughput range.

2.3 Video Coding

Layered Coding: Scalable video coding (SVC) is an extension of the H.264 standard. It divides the video data into base layer and enhancement layers. The video quality improves as more enhancement layers are received. The SVC uses layered code to achieve robustness under varying network condi- tions. But this comes at a cost of high computational cost since SVC needs more prediction mechanisms like inter-layer prediction [126]. Due to its higher complexity, SVC has rarely been used commercially even though it has been standardized as an H.264 extension [95]. None of the smartphones has hard- ware SVC encoders/decoders. Therefore, running SVC on commodity mobile devices incurs large delay.

Video Encoding Schemes for 360◦ Videos: There have been many works (e.g., [46, 69, 75, 97, 133, 139]) on video encoding where some parts of the video are encoded at a higher bitrate, commonly referred to as Region of Interest (ROI) while other parts are encoded at a lower bitrate. This is only done for regular videos to account for the fact that some regions of the video contain

18 more critical or useful information and should be encoded at a higher bitrate. This encoding is not suitable for 360◦ since the user can change FoV. So this does not scale because one has to handle large number of possible ROIs. Some works [63, 108, 125] try to use SVC to encode ROI with high quality. These works focus on optimizing user experience by exploiting SVC to reduce transmission delay, increase region of interest quality and avoid rebuffering.

Our work for 360◦ is inspired by these works, but differs from them in that our work targets specifically 360◦ and incorporates tile-based coding with layered coding to achieve efficient decoding, bandwidth saving, and robustness against prediction error.

2.4 Mobile Video Analytics

Object Detection Acceleration : There have been many deep models for object detection (e.g., SSD [93], YOLO [118], R-FCN [54], Faster R-CNN [120], Mask R-CNN [66]). Higher object detection accuracy requires more complex models, which incurs large latency on mobile devices with commodity hard- ware. We divide the existing deep model acceleration techniques into three classes: (i)Deep model compression: Weights in deep models have high redun- dancy. Compressing weights via pruning and quantization [64] can speed up deep model running time. Liu et al [92] train a deep network to learn how to compress layers in a deep model. (ii)Edge server offloading: Glimpse [48] offloads key frames to a nearby edge server which runs the expensive SIFT feature extraction. However, its end-to-end latency can be even higher than

19 400ms because of large offloading overhead. Liu et al [88] try to offload object detection task to the edge server, but their system needs a powerful edge server to support real-time object tracking. DARE [91] and DeepDecision [115] adapt the system configuration for offloading (e.g., video resolution, frame rate, deep model complexity, etc.) to find the best trade-off between object detection accuracy and end-to-end latency. MARVEL [47] solves the object detection problem by matching an offloaded image, which is captured by the mobile de- vice camera, with a database of images tagged with object labels stored at the server. NeuroSurgeon [79] minimizes latency by splitting the deep model into two parts and running one part on the edge server. (iii)Cache-based accelera- tion: DeepMon [73] reuses the feature maps of regions similar to the previous frames. DeepCache [146] explores similar regions using the motion compen- sation algorithm in H.264 video codec, so it can also utilize similar regions at different locations.

Deep compression and cache can be applied to our work to speed up inference. Offloading is not always feasible as it needs a powerful edge server and good network connectivity and also raises privacy concerns. Therefore we focus on supporting real-time video analytics on mobile devices without offloading.

Motion-Assisted Object Tracking : It is challenging to run a deep model at high fps. However, video frames have high temporal locality since objects tend to have continuous motion in consecutive frames. Exploiting temporal information has the potential to reduce inference frequencies. Liu et al [88]

20 exploit this insight along with offloading to a powerful edge server to enable real-time video analytics. Since the inference runs on the server, its latency is short and it can afford to run frequent inference. Therefore, it simply uses the average motion of all pixels in the bounding box. But as we show, the average motion based tracking is insufficient when the inference is done on a mobile, which incurs a large delay. In the context of video surveillance, [123, 150] use motion vectors in H.264/AVC encoded bit-stream to track mov- ing objects. However, using motion vectors does not work when the camera moves. [127, 129] use optical flow to track the movement of objects, but it is very sensitive to object deformation, camera movement, lighting condition changes, etc. Our work goes beyond the existing work by efficiently and re- liably generating an object mask to achieve accurate motion estimation and accommodate the changes in object size and shape.

Feature Map-Assisted Object Tracking : Convolutional filters [51, 55, 99, 113, 140] in deep models remove noise from the background, so object motion has less noise in the feature maps than the original frame. Zhu et al [153] warps feature maps from neighboring frames using optical flow and combines them with the feature maps of the current frame. Combining feature maps improves object detection accuracy. Liu et al [90] incorporate the feature maps from the previous frames by adding LSTM layers into existing deep models. However, these works improve tracking accuracy at the cost of slower speed. It remains open how to achieve both high accuracy and fast speed, which is required for real-time video analytics on mobile devices.

21 Chapter 3

Practical 360◦ Video Streaming for Smartphones

3.1 Background for 360◦ Video Streaming 3.1.1 Existing Streaming Framework

DASH [134] is widely used for video streaming over the Internet due to its simplicity and compatibility with the existing CDN servers. In this frame- work, a video is divided into multiple chunks with an equal play duration (e.g., a few seconds). A 360◦ video chunk is spatially divided into equal size por- tions, called tiles, generally 15-40 tiles [62]. Each tile is encoded independently with a few bitrates.

In tile based streaming frameworks, the client only requests for a sub- set of tiles according to head prediction and throughput estimation. Due to independent encoding, the client can still decode that tile subset successfully. The client constructs the 360◦ frame based on decoded the tiles and displays the FoV on the screen.

Fig. 3.1 shows the architecture of existing 360◦ video streaming systems like YouTube and Facebook. The user sends video requests to the video server. In the request, the user has to specify which video segment and which bitrate to

22 Figure 3.1: Architecture of Existing 360◦ Streaming Systems request. In tile based streaming frameworks, the user also needs to specify the tiles to request. The video server transmits the 360◦ video data to the user that is connected to the Internet via WiFi access points. The user uses a mobile VR headset to watch 360◦ videos. After receiving the 360◦ video data, the smartphone displays the video content within the FoV, which is determined by the head orientation. The performance of existing 360◦ streaming systems can be affected by multiple factors, including smartphone computational resources, network throughput, user head movement prediction.

3.1.2 H.264 and HEVC Codecs

H.264 [24] is an industry standard for video compression. Frames are encoded at the macroblock level. A typical macroblock size is 16 × 16 pixels. H.264 uses motion prediction to reduce the size of encoded data. As frames within a chunk of a few seconds are highly correlated in temporal domain, this greatly reduces the data size. HEVC [25] is an advanced extension of H.264. It divides a video frame into independent rectangular regions. Each region can be encoded independently. A region is essentially a tile used in tile

23 based streaming framework. Each region is encoded at a 64 × 64 (CTU) level. Due to the larger block size in HEVC, it achieves higher compression than H.264.

HEVC is more suitable for tile based streaming approaches. All tiles in HEVC are contained in a single encoded video file and can be decoded in parallel using one decoder instance. The video file is still decodable even if we remove some tiles. However, H.264 has to encode tiles into separate video files. If user requests multiple tiles, H.264 needs to decode multiple video files. The smartphone only allows a small number concurrent video decoders (e.g., 4 decoders on Samsung S7 and Huawei Mate 9), which is explained in Sec. 3.2. When the number of video files is greater than the number of concurrent hardware decoders, some video files will have to be decoded one after another, which results in longer decoding delay. Therefore, H.264 is not scalable for tile based streaming. Even for HEVC, our measurement results in Sec. 3.2 show that HEVC can not decode all video tiles in real time for 8K videos. Thus, it is necessary to explore how we can avoid sending all video tiles without significant video quality degradation.

3.1.3 Scalable Video Coding

Scalable Video Coding (SVC) [42] is an extension of H.264. It is a layered scheme where a high quality video bit stream is composed of lower quality subset bit streams. A subset bit stream is obtained by removing some data from higher quality bit stream such that it can be played albeit at a lower

24 spatial and/or temporal resolution. SVC can potentially save bandwidth for tile based streaming by adapting the video quality for each tile based on the likelihood of viewing these tiles. However, currently Android does not support this extension [20].

3.2 Motivation

In this section, we perform extensive measurements to understand the performance of existing 360◦ video streaming approaches on Samsung S7. We identify several significant limitations of the existing approaches and leverage these insights to develop a practical tile based streaming tailored for smart- phones. First, we introduce media codecs available in Android.

3.2.1 Real-Time Media Codecs

There are two main options for decoding videos on Android: ffmpeg- android and MediaCodec.

ffmpeg-android: This version of ffmpeg is tailored for Android and is com- pletely based on software. While ffmpeg supports an unlimited number of threads, they cannot decode videos in real time on smartphones because ffm- peg cannot perform hardware decoding [33]. Instead, it decodes everything in software, which is very slow. For example, decoding a 1-second video chunk with resolution 3840 × 1920 and 4 tiles takes more than 3 sec, which causes long rebuffering time.

25 MediaCodec: Android provides MediaCodec class for developers to en- code/decode video data. It can access low-level multimedia infrastructure like hardware decoders, which decode the video much faster than ffmpeg-android. First, the decoder is set up based on the video format, such as H.264 or HEVC. Setting up the decoder enables access to the input and output buffers for the corresponding codec. The data that needs to be decoded is pushed into the input buffers and the decoded data is collected from the output buffers. Medi- aCodec is the best option for video decoding on Android as it uses dedicated hardware resources. For HEVC decoder of MediaCodec, decoding a 1-second video chunk with resolution 3840 × 1920 and 36 tiles takes around 0.5 sec.

Observation: Only hardware-accelerated media codec can support real-time decoding.

3.2.2 Limitations of Existing Approaches

We focus on MediaCodec as ffmpeg-android is infeasible for real-time decoding. We use the following baselines in our analysis: (i) YouTube [152], which streams all data belonging to the whole 360◦ frames to the client, (ii) Naive Tile-Based, which divides 360◦ frames into 4 tiles and streams all tiles to the client, (iii) FoV-only [114], which divides the video into 36 tiles and only streams the tiles predicted to be in user’s FoV. (iv) FoV+ [40], which di- vides the video into 36 tiles and streams data in both FoV and the surrounding region, where the surrounding region is selected based on the estimated predic- tion error of FoV: if the estimated head movement prediction error is  along

26 X axis, it extends the FoV width at both sides of the X axis by ; similarly for the Y axis. We quantify the performance using three metrics: (i) Decoding Time, (ii) Bandwidth Savings, and (iii) Video Quality.

In our experiments, we use HEVC as the decoder. Recent measure- ment [152] finds that existing 360◦ streaming systems (e.g.YouTube and Ocu- lus) stream the entire 360◦ frames. H.264 can only decode one tile from an input video file, while HEVC can include multiple video tiles in the input file and decode them in parallel. Due to limited number of concurrent hardware decoders on smartphones, H.264 has to decode some tiles serially when de- coding multiple video tiles. HEVC has faster decoding speed in tile based streaming frameworks than H.264 since it can decode all requested video tiles in parallel. Our baseline YouTube follows the existing 360◦ streaming sys- tems to stream the entire 360◦ frames and uses a faster decoder available on smartphones.

3.2.2.1 Decoding Time

The feasibility of current approaches in streaming high quality 360◦ videos depends on whether they can decode the data and display it in time. To answer this question, we decode 4K and 8K videos for both YouTube and Naive Tile-Based approaches on smartphone. The input videos have frame rate of 30fps and 30 chunks each, where each video chunk contains 1 second video. We run our experiments 5 times for the same video.

The YouTube approach directly encodes the entire 360◦ chunk into a

27 1.6 1 FoV only 1.4 0.8 FoV+ 1.2 1 0.6 0.8 0.6 0.4 0.4 YouTube(8K) Naive Tile-Based(8K) 0.2 YouTube(4k)

Decoding Time(sec) 0.2

Naive Tile-Based(4K) Bandwidth Savings(%) 0 0 3 6 9 12 15 18 21 24 27 30 1 2 3 4 5 6 7 8 9 10 Chunk ID Head Movement Trace ID (a) Decoding Time. (b) Bandwidth Savings

Figure 3.2: Decoding time and bandwidth savings. single tile. The Naive Tile-Based approach divides each 360◦ frame into 4 video tiles, and encodes them into independent video files. We run 4 video decoders simultaneously to decode video tiles in parallel. We only show the results of the Naive Tile-Based approach for 4 tiles, since this gives the best performance and further increasing the number of tiles can only be slower since only up to 4 decoding threads can run in parallel.

We first measure the decoding time of 4K 360◦ videos. The resolution of the test video is 3840 × 1920. Fig. 3.2(a) shows that both YouTube and Naive Tile-Based approaches can decode a 4K video chunk in 0.55 seconds on average. So 4K videos can be decoded and displayed in time.

Next, we measure the decoding time of 8K 360◦ videos. The resolution of the test video is 7680 × 3840. Fig. 3.2(a) shows the average decoding time of each chunk. YouTube needs 1.4 sec on average to decode one chunk, which results in rebuffering since decoding speed cannot catch up with the playback speed. The Naive Tile-Based approach needs 1.3 sec on average to decode one chunk. It speeds up decoding by utilizing parallel threads. However, it is still insufficient to support real-time decoding for 8K videos. Using parallel

28 threads cannot reduce the amount of data to read from the video decoder output buffers, which limits the performance gain from parallel threads.

Observation: Traditional decoding and existing tile-based decoding cannot support 8K or higher video resolution.

3.2.2.2 Bandwidth Savings

YouTube wastes lots of bandwidth because it streams the entire frames while a user views only a small portion. FoV-only and FoV+ can save band- width. Fig. 3.2(b) shows the bandwidth savings of FoV-only and FoV+ com- pared with YouTube. We test the bandwidth savings of the same video and 10 different head movement traces. For FoV-only, we use the oracle head move- ment, calculated from gyroscope readings, to estimate the maximum possible bandwidth savings. The video bitrate is set to a constant value for all chunks and approaches. The tiles with high motion have larger size than those with smaller motion. If a user views the tiles with larger motion, bandwidth sav- ing is less. Since users have different viewing behaviors, we observe different bandwidth savings across head movement traces. FoV-only approach can save bandwidth by up to 80% but may incur significantly video quality degrada- tion due to prediction error. FoV+ approach saves 18% less bandwidth than FoV-only approach, but incurs smaller degradation in video quality.

Observation: Due to the limited FoV, significant amount of bandwidth can be saved.

29 3.2.2.3 Video Quality

To save bandwidth, the tiles outside the predicted FoV are not streamed in the FoV-only approach. We use linear regression to predict head movement. As explained in Sec. 3.6, our predictor uses the head movement in the past 1-second window to predict head movement for the future 2-second window. When the head prediction has large error, the user sees blank areas. This results in a very poor viewing experience. In the FoV+ approach, additional tiles are streamed to account for head movement prediction error, but this error estimation itself can be inaccurate.

1.2 1 1.1 0.8 1 0.9 0.6 CDF SSIM 0.8 0.4 0.7 YouTube 0.2 0.6 FoV only X Axis FoV+ Y Axis 0.5 0 1 2 3 4 5 6 7 8 9 10 0 40 80 120 160 Head Movement Trace ID Prediction Error(degrees) (a) Video quality (b) Head prediction error

150 80 60 100 40 50 20 0 0 -20

Degrees -50 Degrees -40 -100 Actual X -60 Actual Y -150 Error X -80 Error Y 5 10 15 20 25 30 5 10 15 20 25 30 Time (sec) Time (sec) (c) Head trace (X Axis) (d) Head trace (Y Axis)

Figure 3.3: Video quality of existing approaches

We quantify the video quality using SSIM [141], defined as the struc- tural difference between streamed and original user’s FoV. Fig. 3.3(a) shows

30 the SSIM for 10 different head movement traces and an example 4K video. The video bitrate is set to the highest value in our experiments. YouTube can always achieve SSIM close to 1, while the average SSIM of FoV-only and FoV+ are only 0.77 and 0.87, respectively. The average prediction error along the X axis and Y axis for the test head movement traces is 30 and 9 degrees, respectively. On average, our head movement traces have prediction error larger than 50◦ (or 10◦) along the X axis (or Y axis) for around 20% of time. In the FoV-only approach, we observe that 20% chunks have 23% blank areas on average along the X axis and 18% along the Y axis. By including extra tiles, the FoV+ approach can reduce blank areas along the X axis to 14% and along the Y axis to 11% on average. Thus, sending extra data can still not avoid blank areas since the head movement prediction error is unpredictable, as well. Fig. 3.3(c) and Fig. 3.3(d) show one example head movement trace and its corresponding prediction error. We can observe that when head moves quickly, the prediction error goes up significantly. For example, the prediction error for the X axis increases to 100 degrees at time 22 sec and that for Y axis increases to 40 degrees at time 12 sec. We find that fast head movement mainly happens when the user randomly explores different scenes in the video or follows an interesting fast-moving object.

Observation: Streaming a few extra tiles is not robust enough to head move- ment prediction error.

31 Video Quality

YouTube

Optimal

Rubiks

FoV + Bandwidth Savings

FoV only Decoding Speed

Figure 3.4: Design Space of 360◦ Video Streaming Algorithms

3.2.3 Insights From Existing Approaches

Figure 3.4 summarizes the existing algorithms. YouTube achieves the highest video quality but at the expense of bandwidth and decoding speed. FoV-only and FoV+ save bandwidth and increase decoding speed, but suffer from degraded video quality. A desirable algorithm should simultaneously op- timize all three metrics: bandwidth saving, decoding speed, and video quality.

An important design decision is how to encode 360◦ videos to optimize these three metrics. Such a scheme should (i) adapt the data to stream based on the FoV to save bandwidth, (ii) support fast decoding using a limited number of threads, and (iii) tolerate significant head movement prediction error. (i) suggests tile-like scheme is desirable. (ii) suggests we do not have the luxury to allocate a tile to each decoding thread, but should have a different task-to-thread assignment. (iii) suggests we should still stream entire video frames albeit at a lower resolution in case of unpredictable head movement.

32 3.3 Challenges

360◦ videos are streamed in a similar way as regular videos. A short duration of data, typically 1-2 seconds, is requested by the client. However, 360◦ videos have much higher bit rates because spherical images of 360◦ videos contain more pixels and higher fidelity in order to provide a good viewing experience when being watched from a close distance. Typical resolution is 4K - 8K. Streaming all pixels in 360◦ videos is wasteful since the user only views a small portion of the video due to the limited FoV. Moreover, streaming all pixels also create significant burden for a smartphone, which has limited storage, computation resources and power.

One recent approach [114] tailored for 360◦ videos is the tile-based streaming. In this approach, each 360◦ frame is divided into smaller sized non-overlapping rectangular regions called tiles. Each tile can be decoded in- dependently. The client requests those tiles that are expected to be in FoV using head movement prediction techniques. This reduces the decoding and bandwidth usage at the receiver. However, whenever there is a prediction error, the user either sees some portion of the screen to be blank or expe- riences rebuffering due to missing data. This can severely degrade the user QoE. Moreover, these works explore tile-based streaming either in simulation or in desktop implementation but not on smartphones. We observed in our experiments that 8K 360◦ videos (which are widely available on video websites like YouTube) streamed using either regular or tile-based streaming technique cannot be decoded and displayed to user in time due to resource constraints

33 in smartphones. This limitation primarily stems from how the video data is encoded. Encoding produces independently decodable data segments that are very rich in data, so decoding them takes long time.

3.4 Our Approach

We propose a novel encoding framework for 360◦ videos to simulta- neously optimize bandwidth saving, decoding time, and video quality. Each video chunk is divided spatially into tiles and temporally into layers. The two dimensional splitting allows us to achieve the following two major benefits.

• We can stream different video portions at different bitrates and with different numbers of layers. The video portions with a high probability of viewing are streamed at a higher quality. The ones with a lower chance of viewing are sent at a lower quality instead of not sending them at all. This allows us to save network bandwidth while improving robustness against head movement prediction error.

• By managing the amount of data sent for tiles, we can control the de- coding time. In 8K videos, we can not decode all tiles in time due to hardware constraints, so we can selectively send tiles with less viewing chance to improve efficiency.

The performance of 360◦ video streaming is determined by video coding and video rate adaptation. Below we examine them in turn.

34 3.4.1 Video Encoding

We propose a tile-based layered video encoding scheme. A 360◦ video chunk is spatially divided into tiles, which are further temporally split into layers. We utilize redundant I-Frames to remove encoding overhead due to layering.

Spatial Splitting: As shown in Fig. 3.5, in spatial domain, each 360◦ frame can be divided into multiple equal-size regions, called tiles. These tiles are encoded independently so that each tile can be decoded individually. Each tile is encoded at several bitrates for a client to control the quality of the tiles.

Temporal Splitting: Each tile consists of N frames distributed across M layers. The base layer includes the first of every M frames, the second layer includes the second of every M frames, and so on. For example, for a video chunk consisting of 16 frames, frames 1, 5, 9, 13 form the base layer; frames 2, 6, 10, 14 form the second layer; frames 3, 7, 11, 15 form the third layer; and frames 4, 8, 12, 16 form the fourth layer. Fig. 3.5 shows how temporal splitting is applied to each tile. We consider the highlighted tile. Each tile can be decomposed into M layers by distributing N frames as described above. ntl denotes t-th tile at the l-th layer. This is the granularity at which we encode video data.

Reducing Encoding Overhead: In an encoded video file, the first frame is encoded as an I-frame, which is decoded independently and is large in size. The subsequent frames are encoded as B-frames or P-frames, which reference

35 Figure 3.5: Spatial and temporal splitting of 360◦ chunk neighboring frames to significantly reduce the frame size. We generate M independently decodable video files corresponding to M layers of each chunk, and each layer has a separate I-frame. In comparison, the YouTube approach generates a single I-frame since it encodes all data in a single file.

To eliminate this coding overhead, we remove the I-frames from the enhancement layers as follow. We insert the first frame from the base layer to the beginning of each enhancement layer. After encoding, the I-frame from each enhancement layer can be removed since it is the same as the first frame in the base layer. When decoding video, we just need to copy the I-frame from the base layer to the beginning of each encoded enhancement layer, thereby reducing video size by removing the redundant I-frames.

3.4.2 360◦ Video Rate Adaptation

Despite significant work on rate adaptation, the 360◦ video rate adapta- tion is a new and under-explored topic. Unlike the traditional rate adaptation,

36 where the user views the entire frame, a user only views a small portion of 360◦ videos. Therefore, the high-level problem is to select which portion to stream and at what rate to maximize the user QoE. This is challenging be- cause of unpredictable head movement. We use an Model Predictive Control (MPC) based framework [148] which is efficient to optimize the user QoE even if network throughput fluctuation is unpredictable. Our optimization takes the following inputs: predicted FoV center, estimated FoV center prediction error, predicted throughput and buffer occupancy. It outputs the number of tiles in each layer and the bitrate of each tile. We first introduce our MPC framework, and then describe how to compute each term in the optimization objective.

3.4.2.1 MPC-based Optimization Framework

To handle random throughput fluctuation, our optimization framework optimizes the QoE of multiple chunks in a future time window. Given the predicted network throughput during the next w video chunks, it optimizes the QoE of these w video chunks. The QoE is a function of the bitrate, the number of tiles to download for each layer, and the FoV. This function can be formulated as follow: i=t+w−1 X max QoEi(ri, ei, ci) (3.1) (r ,e ),i∈[t,t+w−1] i i i=t where QoEi denotes the QoE of chunk i ,w denotes the optimization window size, ri denotes the bitrate of tiles to download for chunk i, ei is a tuple whose j element ei denotes the number of tiles to download for the j-th layer in the

37 i-th chunk. Since a user’s FoV varies across frames in a chunk, we explicitly

k k k take that into account by computing the QoE based on ci = (xi , yi ), which denotes the X and Y coordinates of the FoV center of frame k in chunk i. We search (ri, ei) that maximizes the objective within the optimization window, and then request the data for chunk t according to the optimal solution. In the next interval, we move the optimization window forward to [t + 1, t + w] to optimize the next chunk t + 1.

3.4.2.2 User QoE

Next, we define the user QoE metric. It is widely recognized that the user perceived QoE for a video chunk is determined by the following factors: video quality, quality changes, and rebuffering time [100, 136, 148].

Video quality: Each video chunk has K frames. The quality of frame k in chunk i is a function of bitrate ri, number of tiles to download ei, and

k k the FoV center ci , We let h(ri, ei, ci ) denote the video quality. By averaging quality across all frames in the current chunk, we get the quality of chunk i as follow: K 1 X f (r , e , c ) = h(r , e , ck) (3.2) 1 i i i K i i i k=1 Note that different frames may have different FoV, so their quality is deter-

k mined by their corresponding FoV center ci .

Quality changes: The video quality changes between two consecutive

38 chunks is defined as

f2(ri, ri−1, ei, ei−1, ci, ci−1) =

|f1(ri, ei, ci) − f1(ri−1, ei−1, ci−1)| (3.3)

where (ri, ei) and (ri−1, ei−1) represent the bitrates and numbers of tiles to download for chunk i and chunk i − 1, respectively.

Rebuffering: To compute the rebuffering time, we observe that the chunk size depends on the bitrate and the set of tiles downloaded. Let vi(ri, ei) denote the chunk size. We start requesting the next chunk after the previous chunk has been completely downloaded. The buffer occupancy has a unit of second. Let Bi denotes the buffer occupancy at the time of requesting chunk i. Each chunk contains L-second video data. Let Wi denote the predicted network throughput of downloading chunk i. Then, the buffer occupancy when requesting chunk i + 1 can be calculated as:

vi(ri, ei) Bi+1 = max(Bi − , 0) + L (3.4) Wi

The first term indicates that we play vi(ri,ei) sec video data while download- Wi ing chunk i. Afterwards, L-sec video data will be added to the buffer. The rebuffering time of chunk i is

vi(ri, ei) f3(ri, ei) = max( − Bi, 0) + max(τi − L, 0) (3.5) Wi where τi denotes the decoding time of chunk i and is derived from our mea- surement. The first term denotes the rebuffering time incurred due to slow

39 downloading and the second term denotes the rebuffering time incurred due to slow decoding. The expression considers the fact that the chunk i’s decoding starts after finishing playing out everything ahead of the chunk. Note that we can start playing a chunk even if it is only partially decoded.

Putting them together, we compute the QoE of chunk i as follow:

QoEi(ri, ei, ci) = αf1(ri, ei, ci)

−βf2(ri, ri−1, ei, ei−1, ci, ci−1) − γf3(ri, ei) (3.6) where α, β and γ are weights of video quality, quality changes and rebuffering, respectively. The latter two terms are negative since we want to minimize them.

3.4.2.3 Estimate Video Quality

The first and second terms in the user QoE are determined by the video quality. To support efficient optimization, we need to quickly compute the video quality for a given streaming strategy. That is, for a given FoV, we need to determine the video quality of streaming a selected set of tiles in all layers and bitrates.

Video Quality Metric Approximation: In our optimization problem, we need to estimate the video quality of the predicted FoV. Exactly computing video quality metrics for all paths in the optimization problem is too expensive. Moreover, since user head movement is not known in advance, computing video quality offline requires computing for all possible FoV, which is also

40 too expensive. We develop an efficient scheme to approximate video quality metric.

Before we introduce our algorithm, first we describe how video is con- structed from various layers. For each tile, we extract the data from all layers associated with the tile. As described in Section 3.4.1, we divide videos tem- porally into layers and a layer j corresponds to the j-th frame in every 4 frames. Therefore, given a tile that has k layers, we put the downloaded tiles for these k layers at the corresponding positions and add the missing frame by duplicating the last received layer. For example, if a tile only has a base layer, we duplicate the base layer 3 times for every group of 4 frames. If a tile has the first 3 layers, we use the data from the layers 1, 2 and 3 to form the first three frames and duplicate the third frame to form the frame 4 every group. According to the above video construction, we derive the following metric based on the observation that there exists a strong correlation between video quality and bitrate. This indicates an opportunity of using video bitrate for optimization. Quantization parameter (QP) [26] is used by HEVC to con- trol video bitrate. A larger quantization indicates a lower bitrate. The quality

k of frame k in chunk i, denoted as h(ri, ei, ci ), is defined as follow:

1 X h(r , e , ck) = q(rl, e ) (3.7) i i i |F oV (ck, e )| i i i i k l∈F oV (ci ,ei)

k k where F oV (ci , ei) denotes the set of tiles within the FoV centered at ci and

l k q(ri, ei) represents the quality of tile l in the FoV. h(ri, ei, ci ) averages the quality over all tiles in the FoV and the quality of each tile is determined by

41 the number of layers streamed for the tile and the data rate it is streamed at.

1 1

0.95 0.95

SSIM 0.9 8K Video 1 SSIM 0.9 4K Video 1 8K Video 2 4K Video 2 8K Video 3 4K Video 3 8K Video 4 4K Video 4 0.85 0.85 20 25 30 35 40 45 20 25 30 35 40 45 QP QP (a) 8K Videos (b) 4K Videos

Figure 3.6: Correlation between video quality and bitrate.

For the tiles in the predicted FoV, they have the highest probabilities of being viewed. Therefore, we stream all layers of the tiles in the predicted

l FoV, and evaluate q(ri, ei) assuming 4 layers.

l l We study the correlation between ri and SSIM. We set ri to different QP values and the number of layers is set to 4. The input video is divided into 36 tiles. We set the FoV center to the center of video tiles to calculate the video quality under 36 different FoVs. FoV quality is the average SSIM of the same FoV across all video chunks. Fig. 3.6 shows the correlation between average FoV quality and QP for both 4K and 8K videos. We observe that average FoV quality decreases linearly with the normalized video QP. The average correlation coefficient among all test videos is 0.98. We approximate FoV quality using −0.004 × QP + b. From Fig. 3.6, we can see that the value of b varies across different videos. However, b remains constant for all chunks of the same video. We set b to 0 since removing a constant in object function does not affect the optimization solution.

42 3.4.2.4 Decoding Time

The third term in the user QoE is the rebuffering time, which is af- fected by both the downloading time and decoding time. Existing rate adap- tation scheme ignores the decoding time and only uses the downloading time to determine the rebuffering time. This is acceptable for regular videos with much fewer pixels and fast desktops. But decoding time for 360◦ videos on smartphones can be significant and sometimes exceed the downloading time. Therefore it is important to estimate the decoding time for different streaming strategies to support optimization.

Decoding time depends on the number of tiles in each layer. Moreover, there is also variation in decoding time even when decoding the same tile configuration due to other competing apps on a smartphone. We model this decoding time and take into account the variation.

To model the decoding time, we measure the decoding time for 8 videos by varying the bitrate and the number of tiles in each layer. We decode each configuration 3 times and record the results. As we have 4 threads, so each measurement has 4 decoding time values and the overall decoding time is dictated by the thread that takes the longest time. Each thread decodes one layer. An underlying assumption here is that all decoding threads start at the same time. We observe that there is not a significant variation in the start time of these threads, so this assumption works well in practice.

We implement a simple table lookup for decoding time based on mea-

43 surement, where the table is indexed by (# tiles layer-2, # tiles layer-3, # tiles layer-4). We do not consider the number of tiles in the base layer as it always contains all tiles. Moreover, we also do not consider the video bitrate because we observe that it does not impact the decoding time significantly. This is because the bit rate only affects the quantization level, while the video decoding complexity mainly depends on the resolution of the input video. The decoding time entries in the table are populated by averaging the maximum decoding time of each instance for all measurement sets. We use a simple table lookup for decoding time because the variation in decoding time for decoding the same configuration is not large (e.g., within 7% and 6% on average for Samsung S7 and Huawei Mate9, respectively).

3.4.2.5 Improving Efficiency

The large search space poses a significant challenge for real-time opti- mization. To support efficient optimization, we identify the following impor- tant constraints that can be used to prune the search space.

j j0 0 Constraints on the numbers of tiles: ei ≥ ei for any layer j < j . This is intuitive as the lower layer tiles should cover no smaller area to tolerate prediction errors.

l l0 k Constraints on the bitrates: ri ≥ ri for any tiles l ∈ F oV (ci , ei) and

0 k l l ∈/ F oV (ci , ei), where ri denotes the bitrate of tile l for chunk i. This means that tiles which outside the predicted FoV should have no higher bitrate since they are less likely to be viewed.

44 Temporal constraints: When the throughput is stable over the optimiza- tion window (i.e., no significant increasing or decreasing trend), all future chunks have the same streaming strategy (i.e.,(ri, ei) = (ri0 , ei0 ), where i and i0 are any of the future w chunks.

3.5 System Design for Rubiks 3.5.1 System Architecture

Fig. 3.7 shows two major components of our system. (i) the Client side estimates the predicted FoV and network throughput, runs the optimiza- tion, and generates requests accordingly, and (ii) the Server side handles the video encoding (i.e., including spatially partitioning into tiles and temporally splitting into layers) and stream data according to the client requests.

Figure 3.7: System Architecture for Rubiks

The Client side runs the optimization, which takes the head movement prediction of the user, prediction error, playback buffer and network through- put as the input and outputs the data that needs to be requested next. The objective is to maximize the user QoE while taking into account the network

45 throughput and decoding time of incoming data.

The red arrows in Fig. 3.7 indicate the complete process from video chunk request generation to chunk playback.

3.5.2 Server Side

Fig. 3.7 shows different modules of our server. We use the standard equirectangular projection to store raw 360◦ frames, which is used by YouTube [35]. There are other ways to store raw 360◦ frames like cubemap [22] proposed by Facebook. The cubemap projection tries to reduce the size of raw 360◦ video without degrading video quality. Rubiks focuses on how to transmit tiles in the projected 360◦ frames. Projection methods can not speed up video decoding since it does not reduce the video resolution which determines decoding speed.

Video Layer Extractor and Video Tile Extractor divide the video data spatially and temporally as described in Sec. 3.4. We use 36 tiles and 4 layers in our implementation. We use an open-source HEVC encoder kvazaar [29] for encoding. We let kvazaar restrict motion compensation within video tiles such that each tile can be decoded independently. Encoded video data is stored in a video database.

When the video request handler receives a request, it queries the video database to find the requested tiles. The video database sends the requested tiles to the tile merger to generate a merged chunk for each layer. We spatially merge the independently decodable tiles from the same layer into a video chunk file. The client can decode the portion of 360◦ view covered by the tiles

46 contained in the merged video chunk. Since the client needs the size of video tiles to optimize video requests, the video request handler sends tile size as the meta-data before sending the encoded video data.

3.5.3 Client Side

As shown in Fig. 3.7, the client first predicts head movement, and uses this information along with the network throughput to perform 360◦ video rate adaptation. Then it requests the corresponding video from the server, decodes each layer, merges all layers, and finally plays the resulting video. Next we describe each module.

Tracking and predicting head position: We need head movement pre- diction to determine which part of the video the user is going to view in the next few seconds. When watching a 360◦ video, the user head position can be tracked using gyroscope, which is widely available on smartphones. Note that, the head position estimated from gyroscope readings is considered as the ground-truth head position in our system. We can then use this head position to get the center of FoV. X-axis can go from -180◦ to 180◦, Y-axis and Z-axis can go from -90◦ to 90◦. Head movement exhibits strong auto-correlation on small time scales [40]. So we use past head motion to predict future head motion.

We collect head movement traces from 20 users for 10 videos, each lasting around 30 seconds. We randomly select half of the head movement traces as the training data and use the other half for testing. Sampling interval

47 of gyroscope measurements is set to 0.1 sec. We use least square to solve Y = AX, where X is the past head movement and Y is the future movement. We learn A using the training data X and Y . We apply the learned A and past head movement X to predict the future movement Y . Moreover, we also estimate the error of our head movement prediction using the least square and use the error to determine additional tiles to request for robustness. We train a separate model for each of the three axes. In our evaluation, we use the past 1-second head movement to predict the future 2-second movement. In our system, the time to predict the head position and error is only 2ms.

Network throughput predictor: The client continuously monitors the network throughput when downloading video data. It records the network throughput in the previous 10-sec window, and uses the harmonic-mean pre- dictor [77] to predict the network throughput in the next optimization window.

Video request optimizer : Given the predicted head position and through- put, the optimization searches for the optimal decision. The decision specifies the set of tiles to download from each layer and their corresponding bitrates. We implement the optimization routine using Android JNI since it is much faster than general Java. The optimization window size w is set to 3. Due to the small search space, it can finish the optimization within 12ms on average.

Video downloader : It maintains HTTP connections with the server to download video. The optimization results are used to construct HTTP re- quests.

48 Video decoder : We exploit hardware-accelerated Android media codec to decode video data. Four parallel video decoders are initialized before we start playing video. Note that four threads are the maximum number of concurrent decoders we can run due to limited hardware resource. Each video layer has one merged video chunk file. So each decoder decodes all the requested tiles of one layer. When decoding a video chunk file, the corresponding video decoder will run decoding as a background thread, which is implemented using Android AsyncTask. Decoding has to run in the background to avoid blocking video playback in the UI thread.

Frame constructor : 360◦ frames are reconstructed based on downloaded tiles from each layer.

Video Player : 360◦ video frames constructed from the frame constructor are rendered through Android OPENGL ES, which uses GPU to render video frames. The head monitor tells the video player which portion of the 360◦ should be displayed.

3.6 Evaluation for Rubiks

We implement Rubiks as an Android app and run our experiments on Samsung Galaxy S7 and Huawei Mate9. We show that Rubiks can not only support 8K videos on smartphones, which is infeasible for the existing tile based streaming approaches, but also enhances user experience and saves bandwidth for both 4K and 8K videos.

49 3.6.1 Evaluation Methodology

In this section, we first introduce the experiment setup. Then, we explain the experiment configuration.

3.6.1.1 Experiment Setup

We run our video server on a laptop equipped with a wireless NIC Intel AC-8260. The client runs on Samsung Galaxy S7 and Huawei Mate 9. We configure the laptop as an access point such that the smartphone can have wireless connections to the video server through WiFi. The smartphone re- mains close (e.g. <1m) to the laptop such that it always has stable wireless connections to the laptop. We use tc [30] to control the throughput between the smartphone and laptop. For trace-driven experiments, video player ren- ders the video according to the user FoV in the head movement traces. We quantify the performance of Rubiks using the user QoE as we vary the videos, head movement traces and network throughput traces. We also examine indi- vidual terms in the QoE metric, including video quality, video quality changes, rebuffering and bandwidth savings.

3.6.1.2 Experiment Settings

Video traces : We test the performance of our system for both 4K and 8K videos. We download 16 videos from YouTube: eight 4K videos and eight 8K videos. The resolution of 4K video is 3840 × 1920, while the 8K resolution is 7680×3840. Each video is divided into 30 video chunks, where each chuck has

50 32 frames. For 4K videos, 4 of them have fast motion (e.g., videos of football games [16], roller coaster [17], etc.), while the other 4 have slow motion (e.g., videos of underwater scene [19], sailing [18], etc.). 8K videos have similar motion characteristics.

A video chunk is further divided into 4 layers. Each layer divides a 360◦ frame into 6 × 6 uniform tiles. For 4K videos, the resolution of a video tile is 640 × 320. A video tile from 8K videos has resolution 1280 × 640. We use Quantization Parameter(QP) to specify the encoding bitrate of videos. The QP value has options: 22, 27, 32, 37 and 42, recommended from the recent work [62].

Throughput traces : We select 10 throughput traces from the dataset HS- DPA [27] with varying throughput patterns. To ensure sufficient throughput to support 4K and 8K videos, we scale up the traces according to the bitrate of test videos. For 4K videos, we scale the traces to have average throughput in the range 0.15MBps-2.1MBps. For 8K, we scale to the range 3MBps-27MBps.

Head movement traces : We collect real head movement traces when users watch our test 360◦ videos via a headset Samsung Gear VR. For each video, we collect 20 head movement traces. We randomly select half of the traces to learn the weights of the linear predictor. The trained linear predictor is used by the head predictor module to estimate head movement of users when watching a 360◦ video. We use the other half of traces as practical user head movement behaviors to evaluate the performance of Rubiks.

51 Video quality metric : In our experiment results, we show actual video quality via SSIM [141] since it has high correlation with actual user QoE. It is defined as structural difference between the streamed and actual FoV. A higher SSIM indicates higher user QoE.

QoE weights : We set the weights in QoE definition as follows, α = 1, β = 0.5 and γ = 1, which is a commonly used setting in the existing works [100, 136, 148].

3.6.2 Micro Benchmarks

In this section, we quantify the head movement prediction error and decoding time modeling error.

3.6.2.1 Head Movement Prediction Error

We want to understand how head movement prediction error changes when we vary (i) the future window in which to predict head movement, which is called as prediction window, (ii) the algorithm used for prediction, and (iii) the amount of historical information used for prediction. We compare two algorithms: Neural Networks and Linear Regression. In our analysis, we use 100 head movement traces, collected from 10 users by showing them 10 videos. We train the model using 30% data and test on 70% data. Only X and Y axis movement is considered because most of the head movement is along these dimensions.

Figure 3.8(a) shows how the prediction error changes along the X and

52 Y axes when we vary the prediction window size. As expected, the prediction error increases with the prediction window size. For a 2-sec prediction window that we use, the average prediction errors along the X and Y axes are 25◦ and 8◦, respectively, and the 99th percentile errors are 60◦ and 20◦, respectively. Traditional tile based streaming suffers from poor viewer quality experience when the prediction error is very large, because the user sees either black screen or waits for fetching of absent data, which incurs rebuffering. In our traces, the maximum error along X-axis is 170◦ and along Y-axis is 69◦. However, using Rubiks, user can still see the content outside the predicted FoV, so it is more robust to large head movement prediction errors.

100 40 X-axis Avg-Linear 90 X-axis Avg th 35 X-axis Avg-Neural 80 X-axis 99 Pctl Y-axis Avg 30 Y-axis Avg-Linear 70 th Y-axis Avg-Neural 60 Y-axis 99 Pctl 25 50 20 40 15 30 10 Error in Degrees 20 Error in Degrees 10 5 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Prediction Window Size (sec) Prediction Window Size (sec) (a) Vary Prediction Window (b) Regression vs Neural Net.

Figure 3.8: Head Movement Prediction Analysis

We compare the linear regression with a non-linear modeling tool: Neu- ral Networks. We train a neural network with 3 hidden layers and each layer contains 30 neurons. As shown in Figure 3.8(b), there is no performance dif- ference between Neural Networks and Linear Regression. So we opt for linear regression for its simplicity. We also vary the historical window size while fixing the prediction window to 1 sec. We observe that past 0.5 seconds are enough

53 to predict the future head movement because head movements in distant past are not highly correlated with future movement. The distant head position variables are given low weights so we use past 1 sec historical information.

3.6.2.2 Decoding Time Modeling

We evaluate how decoding time is affected when we use different tile configurations. This helps estimate the maximum number of tiles that can be decoded in time. Moreover, we also quantify the accuracy of our model. We measure the decoding time of eight 8K videos. It is not necessary to model decoding time for 4K videos, since all of our testing algorithms can decode them in real time. For tile configurations, we decode all tiles for the base layer. The number of tiles for each enhancement layer can take any value from the list {9, 12, 16, 20, 25, 30}. The test videos have 5 different bitrates. For a specific bitrate and tile configuration, we run the decoding time experiments 3 tiles. We avoid measuring the decoding time of tile configurations whose lower layers have fewer tiles than higher layers since that is not practical.

Table 3.1 shows a few example entries in the decoding time lookup table trained on Samsung S7 and Huawei Mate9. Note that, the tile configuration tuple indicates the number of tiles from layer 2, 3 and 4, respectively. The number of tiles from layer 1 is fixed at 36. There are 56 entries in the lookup table. We populate the lookup table by averaging the decoding time across videos with the same bitrate and tile configuration. We observe that the max- imum number of tiles which can be decoded in real time is (25, 20, 20). To

54 1

0.8

0.6

CDF 0.4 Same-Br Same-Video Same-Br Diff-Video 0.2 Diff-Br Same-Video Diff-Br Diff-Video 0 0 0.1 0.2 0.3 Decoding Time Error(sec)

Figure 3.9: Modeling Error. evaluate the impacts of new hardware on decoding time, we also include the decoding time of Samsung S8 which is equipped more recent hardware. We find that S8 has very similar decoding time as Mate9. In Sec. 3.6.4, we will discuss that Rubiks has significant improvement in video quality and band- width savings, compared with existing state-of-the-art algorithms, in addition to speeding up decoding.

Tile Conf. 16,12,9 20,12,9 25,16,9 25,16,12 25,20,20 S7 0.71s 0.79s 0.91s 0.95s 1.05s Mate9 0.68s 0.75s 0.87s 0.90s 1.03s S8 0.66s 0.74s 0.85s 0.89s 1.02s

Table 3.1: Decoding Time Lookup Table

Fig. 3.9 shows the CDF of decoding time modeling error for the con- figuration (25, 16, 12) in following cases: (1) applying the model to the same video at the same bitrate, (2) applying the model to the same video with a dif- ferent bitrate, (3) applying the model to a different video at the same bitrate, (4) applying the model to a different video at a different bitrate. We build

55 the lookup table from one video to test the modeling accuracy across different videos. We can see that 90th percentile error for all cases is less than 0.1sec. To handle this modeling error, we inflate the decoding time by 0.1 sec when the optimization routine estimates rebuffering time based on decoding time for a given strategy. Since cross-video and cross-bitrate cases do not increase modeling error, we can populate the lookup table with similar modeling error as shown in Fig. 3.9 by measuring decoding time for a single bitrate and one 30-sec video. Thus, it is easy to generate a lookup table for a new phone within half an hour. We can generate decoding tables for different phones and store them in app database. A user downloads all these tables alongside the app, the app can then choose appropriate table based on the user’s smartphone model.

1 1 1 Rubiks (S7) Rubiks YouTube YouTube 0.8 FoV only 0.8 0.8 FoV+ FoV only 0.6 Rubiks (Mate9) 0.6 FoV+ 0.6

CDF 0.4 CDF 0.4 CDF 0.4 Rubiks YouTube 0.2 0.2 0.2 FoV only FoV+ 0 0 0 -0.9 -0.6 -0.3 0 0.3 0.6 0.9 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 QoE SSIM SSIM Changes (a) QoE (b) Video Quality (c) Quality Changes

1 1 0.8 0.8 0.6 0.6

CDF 0.4 Rubiks CDF 0.4 YouTube Rubiks 0.2 FoV only 0.2 FoV only FoV+ FoV+ 0 0 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 Rebuffering (sec) Savings Fraction (d) Rebuffering (e) Bandwidth Savings

Figure 3.10: Performance of 4K videos(8 videos, 10 throughput traces, 80 head movement traces)

56 3.6.3 System Results

In this section, we evaluate our system. We compare it against the following three baseline schemes: (i) YouTube [152]: Streaming the 360◦ as a single-tile video. (ii) FoV-only [114]: The video is divided into 6 × 6 uniform tiles. We only stream tiles in the predicted FoV. (iii) FoV+ [40]: We enlarge the predicted FoV to include the surrounding parts according to the prediction error and stream the content in the enlarged FoV.

3.6.3.1 Rubiks for 4K videos

User QoE: Fig. 3.10(a) shows on average Rubiks out-performs YouTube, FoV-only and FoV+ by 14%, 69%, and 26%, respectively. All schemes can decode 4K videos in real-time. Rubiks improves QoE over YouTube because it reduces the amount of data to send and leads to less rebuffering. Rubiks improves over FoV-only and FoV+ due to its robustness to head movement prediction errors.

We test the performance of Rubiks on both Samsung S7 and Huawei Mate 9. Because both phones can support real-time decoding for 4K videos, the difference between QoE achieved by two phones is within 1%, as shown in Fig. 3.10(a).

Rebuffering: Rubiks incurs less rebuffering time since it sends much less data than YouTube. Fig. 3.10(d) shows that the average rebuffering time of Rubiks and YouTube is 1.2 sec and 4.1 sec, respectively. The reduction in

57 rebuffering time accounts for most of the QoE improvement in Rubiks.

Video Quality: Fig. 3.10(b) shows that the average video quality of Rubiks, YouTube, FoV-only, and FoV+ are 0.97, 0.98, 0.7 and 0.84, respectively. FoV- only is very vulnerable to prediction error because only predicted FoV content is streamed. Even though FoV+ includes extra data to improve robustness against prediction error, it is still insufficient under large prediction error. In comparison, Rubiks does not incur noticeable video quality degradation: its difference from YouTube is only 0.01. This is because it streams the entire video frames albeit at a lower quality for tiles outside predicted FoV.

Video Quality Changes: Fig. 3.10(c) shows the video quality changes is 0.15 and 0.28 for FoV-only and FoV+. The average quality changes is around 0.01 for both Rubiks and YouTube. FoV-only and FoV+ have higher video quality changes whenever the prediction error becomes large (e.g., the user looks at video portions that lie outside the streamed content which leads to a large drop in viewing quality). Sending tiles not included in predicted FoV at lower quality allows Rubiks to avoid such a large drop in video quality.

Bandwidth Savings: Fig. 3.10(e) shows that for 4K videos Rubiks saves 35% bandwidth compared with YouTube. Among them, 19% bandwidth sav- ing comes from Rubiks not sending all layers for all tiles, 11% saving comes from Rubiks sending tiles outside the predicted FoV at a lower rate, and 5% bandwidth saving comes from our removal of I-frame in the other layers in- troduced in Section 3.4.1. FoV-only and FoV+ save 56% and 41% bandwidth

58 compared with YouTube, respectively. The bandwidth saving of Rubiks is sig- nificant, but lower than FoV and FoV+ since it streams entire video frames to improve robustness to movement prediction error. We believe this is a reasonable trade-off.

1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4 Rubiks YouTube Rubiks 0.2 FoV only 0.2 FoV only FoV+ FoV+ 0 0 -0.8 -0.4 0 0.4 0.8 1.2 1.6 0.2 0.4 0.6 0.8 1 QoE Savings Fraction (a) QoE (b) Bandwidth Savings

Figure 3.11: Performance of Rubiks under low throughput.

Low Throughput: Fig. 3.11 shows the benefits of Rubiks when throughput is lower than the lowest video bitrate. All approaches tend to select the lowest bitrate. YouTube has to send entire 360◦ frames, which results in larger re- buffering. The average QoE is 0.39, −0.52, 0.31 and 0.29 for Rubiks, YouTube, FoV-only and FoV+, respectively. Compared with YouTube, Rubiks improves QoE by 0.9 due to significant reduction in rebuffering time. Even if YouTube can support real-time decoding for 4K videos, reducing the amount of data sent provides significant benefits for Rubiks when network throughput is low. The average bandwidth savings of Rubiks, FoV, and FoV+ are 39%, 61% and 48%, respectively.

High-Motion Videos: Fig. 3.12 shows the benefits of Rubiks for high motion videos, which are encoded with larger size. When the user views high-motion

59 tiles, there will be less chance to save bandwidth. The average bandwidth sav- ings of Rubiks is 24%. Rubiks improves QoE by 16% on average over YouTube.

1 1 Rubiks 0.8 YouTube 0.8 FoV only FoV+ 0.6 0.6

CDF 0.4 CDF 0.4 Rubiks 0.2 0.2 FoV only FoV+ 0 0 -0.8 -0.4 0 0.4 0.8 1.2 0 0.2 0.4 0.6 0.8 QoE Savings Fraction (a) QoE (b) Bandwidth Savings

Figure 3.12: Performance of Rubiks for high motion videos.

3.6.3.2 Rubiks for 8K Videos

Next, we evaluate Rubiks for 8K videos. YouTube can not decode 8K video chunks timely. For Rubiks and FoV+, we use an upper bound, derived from our decoding time model, to limit the number of tiles sent to the client such that video chunks can be decoded before the playback deadline.

QoE : Fig. 3.13(a) shows that Rubiks always achieves the best user QoE. Rubiks out-performs YouTube, FoV-only and FoV+ by 36%, 40% and 20% in QoE, respectively. Compared with 4K videos, Rubiks achieves less QoE improvement over FoV-only for 8K videos due to different video content and user head movement. Because YouTube can not decode 8K video data in time, Rubiks achieves more QoE improvement over YouTube for 8K videos than that for 4K videos.

The QoE of Rubiks on Huawei Mate9 is 2% higher than that on Sam-

60 1 1 Rubiks (S7) Rubiks YouTube 0.8 0.8 YouTube FoV-only FoV only FoV+ FoV+ 0.6 Rubiks (Mate9) 0.6

CDF 0.4 CDF 0.4

0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0.6 0.8 1 QoE SSIM (a) QoE (b) Video Quality

1 1 Rubiks 0.8 0.8 FoV only FoV+ 0.6 0.6

CDF 0.4 CDF 0.4 Rubiks YouTube 0.2 FoV only 0.2 FoV+ 0 0 0 5 10 15 0 0.2 0.4 0.6 0.8 1 Rebuffering (sec) Savings Fraction (c) Rebuffering (d) Bandwidth Savings

Figure 3.13: Performance of 8K videos(8 videos, 10 throughput traces, 80 head movement traces) sung S7. From the decoding time model, we can see that Mate9 has slightly faster decoding speed than S7. Thus, Mate9 can decode more tiles to provide higher robustness to head movement prediction error, which results in slightly higher QoE for Mate9.

Rebuffering: Fig. 3.13(c) shows the rebuffering time. Overall average re- buffering time of YouTube is 8.0 sec. Slow video decoding in the YouTube approach results in 7.1 sec average rebuffering time. The rest comes from throughput fluctuation during downloading. This leads to large QoE degrada- tion. Rubiks, FoV and FoV+ incur average rebuffering time in range 0.2-0.3 sec, which is very small when compared to YouTube. About 0.01 sec of aver-

61 age rebuffering time for Rubiks comes from inaccurate decoding time modeling. Compared with 4K videos, speeding up video decoding helps Rubiks achieve a larger reduction in rebuffering, which translates to a higher QoE improvement over YouTube.

Quality and Quality Changes: We observe similar video quality and video quality changes patterns as 4K videos. The average video quality is 0.96, 0.98, 0.79 and 0.87 for Rubiks, YouTube, FoV-only and FoV+, respectively. Com- pared with 4K videos, FoV-only achieves higher video quality due to slightly better head movement prediction. Nevertheless, Rubiks still significantly out- performs both FoV-only and FoV+. Moreover, FoV-only and FoV+ experience 0.21 and 0.13 quality changes, which is larger than the other approaches due to movement prediction error, whereas, for Rubiks, it is 0.02.

Bandwidth Savings: Fig. 3.13(d) shows that Rubiks, FoV-only and FoV+ save 49%, 66%, and 54% bandwidth, respectively, compared with YouTube. The encoded file size difference between two consecutive bitrates for 8K videos is larger than that in 4K videos. This means when we switch between bitrates in 8K videos, there is a larger difference in the amount of data when compared to 4K. So Rubiks yields larger bandwidth savings for 8K videos.

3.6.3.3 Energy Consumption

We use the Google battery historian tool [36] to monitor the energy consumption of our system. We stream multiple 4K and 8K videos using vari- ous algorithms and record the energy consumption in each experiment, where

62 each video lasts for 5 minutes. Table 3.2 shows the average total energy con- sumption across videos for each algorithm. Both FoV-only and FoV+ consume up to 33% less energy than YouTube and Rubiks since they decode fewer tiles at the expense of much worse video quality. On average, Rubiks consumes 9% and 19% less energy than YouTube for 4K and 8K videos, respectively. This is because Rubiks does not decode all video tiles and takes shorter time to finish decoding than YouTube. When the decoder finishes decoding one chunk, it will go to the idle state before the next chunk is ready to decode. As we would expect, all algorithms consume more energy in decoding the 8K video than the 4K video. YouTube Rubiks FoV-only FoV+ 4K(S7) 43.4mAh 39.8mAh 30.3mAh 34.5mAh 4K(Mate9) 41.9mAh 37.6mAh 29.2mAh 32.1mAh 8K(S7) 93.7mAh 76.2mAh 62.4mAh 68.6mAh 8K(Mate9) 91.8mAh 75.5mAh 61.7mAh 65.2mAh

Table 3.2: Energy Consumption.

3.6.4 Summary and Discussion of Results

Our main findings can be summarized as follow:

• Rubiks achieves significant QoE improvement due to reduction in re- buffering time and enhanced robustness against head movement predic- tion error.

• Rubiks saves substantial bandwidth by sending video tiles with a lower viewing probability at a lower quality.

63 • Users give higher ratings to Rubiks due to smaller rebuffering or higher video quality.

Though we evaluate Rubiks’s performance over a few smartphones, techniques employed in Rubiks can improve the performance of streaming 360◦ videos on any smartphone hardware platform. In addition to speeding up the decoding process, it gives the following benefits:

• Rubiks is more robust to head movement prediction errors even if decod- ing is fast. Compared with the existing state-of-art tile based streaming approaches FoV-only and FoV+, Rubiks improves video quality by 38% and 15% for 4K videos. The improvement is 22% and 10% for 8K videos. Note that, both FoV-only and FoV+ can decode all 4K and 8K videos in real time. Thus, even if decoding is fast, the current tile based streaming approaches are not robust enough to head movement prediction errors.

• Rubiks needs much less bandwidth to achieve high video quality. Com- pared with YouTube, Rubiks saves 35% bandwidth for 4K videos, while achieving similar average video quality (0.97 in Rubiks and 0.98 in YouTube).

64 Chapter 4

Robust Live 4K Video Streaming

4.1 Motivation

Uncompressed 4K video streaming requires around 3Gbps of data rate. WiGig is the commodity wireless card that comes closest to such a high data rate. In this section, we examine the feasibility and challenges of streaming live 4K videos over WiGig from both system and network perspectives. This study identifies major issues that should be addressed in supporting live 4K video streaming.

4.1.1 4K Videos Need Compression

WiGig throughput in our wireless card varies from 0 to 2.4 Gbps. Even in the best case, it is lower than the data rate of 4K raw videos at 30 FPS, which is 3 Gbps. Therefore, sending 4K raw videos is too expensive, and video coding is necessary to reduce the amount of data to transmit.

4.1.2 Rate Adaptation Requirement

WiGig links are sensitive to mobility. Even minor movement at the transmitter (TX) or receiver (RX) side can induce a drastic change in through- put. In the extreme cases, where an obstacle is blocking the line-of-sight path,

65 3 1.6 Static Static 2.5 Rotating 1.4 Rotating 1.2 2 1 1.5 0.8 1 0.6 0.4 0.5 0.2 Throughput (Gbps)

0 Prediction Error (Gbps) 0 0 30 60 90 120 150 0 5 10 15 20 25 30 35 40 45 50 Sampling Interval (10ms) Prediction Window (30ms) (a) Throughput Variation (b) Prediction Error

Figure 4.1: Example 60GHz Throughput Traces throughput can reduce to 0. Evidently, such a throughput drop results in severe degradation in the video quality.

Large fluctuations: Fig. 4.1(a) shows the WiGig link throughput in our system when both the transmitter and receiver are static (Static) and when the TX rotates around a human body slowly while the RX remains static (Ro- tating). The WiGig link can be blocked when the human stands between the RX and TX. As you can see, the amount of throughput fluctuation can be up to 1 Gbps in the Static case. This happens due to multipath, beam searching and adaptation [116]. Since WiGig links are highly directional and have large attenuation factor due to high frequency, the impacts of multipath and beam direction on throughput are severe. In the Rotating case, the throughput re- duces from 2 Gbps to 0 Gbps. We observe 0 Gbps when the user completely blocks the transmission between the TX and RX. Therefore, drastic through- put fluctuation is common in 60 GHz, and a desirable video streaming scheme should adapt to such fluctuation.

Large prediction error: In order to predict the throughput fluctuation, ex-

66 isting streaming approaches use historical throughput measurements to predict the future throughput. The predicted throughput is then used to determine the amount of data to be sent. Due to large throughput fluctuation in WiGig links, the throughput prediction error can be large. In Fig. 4.1(b), we evaluate the accuracy of using the average throughput of the previous 40 ms window to predict the average throughput of the next 30 ms window. We observe large prediction error when there is a drastic change in the throughput. Even if throughput remains relatively stable in the Static case, the prediction error can reach up to 0.5 Gbps. When link blockage happens, the prediction error can be even higher than 1 Gbps in Rotating case. Such large fluctuation will cause selecting too low video rate or too high rate. The former degrades the video quality while the latter results in partially blank video frame because only part of the video frame can arrive in time.

Insight: The large and unpredictable fluctuation of 60 GHz link suggests that we should adapt promptly to the unanticipated throughput variation. Layered coding is promising since the sender can opportunistically send less or more data depending on the network condition instead of selecting a fixed video rate in advance.

4.1.3 Limitations of Existing Video Codecs

We explore the feasibility of using traditional codecs like H.264 and HEVC for 4K video encoding and decoding. These codecs are attractive due to their high compression ratios. We also study the feasibility of the state-

67 of-the-art layered encoding scheme since it is designed to adapt to varying network conditions.

Cost of conventional software/hardware video codecs: Recent work [61] shows that YouTube’s H.264 software encoder takes 36.5 minutes to encode a 4K video of 15 minutes and VP8 [39] software encoder takes 149 minutes to encode 15 minute video. It is clear these coding schemes are too slow to support live 4K video streaming.

To test the performance of hardware encoding, we use NVIDIA NVENC or NVDEC [10] hardware codec for H.264 encoding. We use a desktop equipped with 3.6GHz quad-core Intel Xeon Processor, 8GB RAM, 256GB SSD and NVIDIA Geforce GTX 1050 GPU to collect encoding and decoding time. We measure the 4K video encoding time where one frame is fed to the encoder ev- ery 33 ms. Table 4.1 shows the average encoding and decoding time of a single 4K video frame played at 30 FPS. We use ffmpeg [3] for encoding and decod- ing using NVENC and NVDEC codecs. The maximum tolerable encoding and decoding time is 60ms as mentioned earlier. We observe the encoding and de- coding time is well beyond this threshold even with GPU support. This means that even using the latest hardware codecs and commodity hardware, video content cannot be encoded in time and result in unacceptable user experience. Such large coding time is not surprising. In order to achieve high compression rate, these codecs use sophisticated motion compensation [135, 143], which are computationally expensive.

A recent work [89] uses a very powerful desktop equipped with an ex-

68 Codec Video Enc(ms) Dec(ms) Total(ms) H.264 Ritual 160 50 210 H.264 Barscene 170 60 230 HEVC Ritual 140 40 180 HEVC Barscene 150 50 200

Table 4.1: Codecs encoding and decoding time per frame pensive NVIDIA GPU ($1200) to stream 4K videos in real time. This work is interesting, but such GPUs are not available on common devices due to their high cost, bulky size (e.g., 4.37600 ×10.500 [8]) and large system power consump- tion (600W [8]). This makes it hard to deploy on laptops and smartphones. Therefore, we aim to bring the capability of live 4K video streaming to devices with limited GPU capabilities like laptops and other mobile devices.

Cost of layered video codecs: Scalable video coding (SVC) is an exten- sion of the H.264 standard. It divides the video data into base layer and enhancement layers. The video quality improves as more enhancement layers are received. The SVC uses layered code to achieve robustness under varying network conditions. But this comes at a cost of high computational cost since SVC needs more prediction mechanisms like inter-layer prediction [126]. Due to its higher complexity, SVC has rarely been used commercially even though it has been standardized as an H.264 extension [95]. None of the mobile devices have hardware SVC encoders/decoders.

To compare the performance of SVC with standard H.264, we use OpenH264 [11], which has software implementation of both SVC and H.264. Our results show that SVC with 3 layers takes 1.3x encoding time as that

69 of H.264. Since H.264 alone is not feasible for live 4K encoding, we conclude that the SVC is not suitable for live 4K video streaming.

Insight: Traditional video codecs are too expensive for 4K video encoding and decoding. We need a cheap 4K video coding that is fast to run on commodity devices.

4.1.4 WiGig and WiFi Interaction

Since WiGig links may not be stable and can be broken, we should seek an alternative way of communication. As WiFi is widely available and low-cost, it is an attractive candidate. There have been several interesting works that use WiFi and WiGig together, however they are not sufficient for our purpose as they do not consider delay sensitive applications like video streaming.

Reactive use of WiFi: Existing works [137] use WiFi in reactive manner (only when WiGig fails). Reactive use of WiFi falls short for two reasons. First, WiFi is not used at all as long as WiGig is not completely disconnected. However, even when WiGig is not disconnected, its throughput can be quite low and WiFi throughput becomes significant. Second, it is hard to accurately detect the WiGig outage. A too early detection may result in unnecessary switch to WiFi, which tends to have lower throughput than WiGig, while a too late detection may lead to a long outage.

WiFi throughput: Second, it is non-trivial to use WiFi and WiGig links

70 0.14 Static 0.12 Rotating 0.1 0.08 0.06 0.04 Throughput (Gbps) 0.02 0 0 30 60 90 120 150 Sampling Interval (10ms)

Figure 4.2: Example WiFi Throughput Traces simultaneously since both throughput may fluctuate widely. This is shown in Fig. 4.2. As we can see, the WiFi throughput also fluctuates in both static and mobile scenarios. In the static case, the fluctuation is mainly due to con- tention. In the mobile cases, the fluctuation mainly comes from both wireless interference and signal variation due to multipath effect caused by mobility.

Insight: It is desirable to use WiFi in a proactive manner along with the WiGig link. Moreover, it is important to carefully schedule the data across the WiFi and WiGig links to maximize the benefit of WiFi.

4.2 Challenges

Adapting to highly variable and unpredictable wireless links: A nat- ural approach is to adapt the video encoding bitrate according to the available bandwidth. However, as many measurement studies have shown, the data rate of a 60 GHz link can fluctuate widely and is hard to predict. Even small ob- struction or movement can significantly degrade the throughput. This makes

71 it challenging to adapt video quality in advance.

Fast encoding and decoding on commodity devices: It is too expensive to stream the raw pixels in 4K videos. Even the latest 60 GHz cannot meet its bandwidth requirement. On the other hand, the time to stream live videos includes not only the transmission delay but also encoding and decoding delay. While existing video coding (e.g., H.264 and HEVC) achieves high compression rates, they are too slow for real-time 4K video encoding and decoding. Fouladi et al [61] show that YouTube H.264 encoder takes 36.5 minutes to encode a 15-minute 4K video at 24 FPS using H.264, which is too slow. Therefore, a fast video coding algorithm is needed to stream live 4K videos.

Exploiting different links: WiGig alone is often insufficient to support 4K video streaming since its data rate may reduce by orders of magnitude even with small movement or obstruction. Sur et al [137] have developed approaches to detect when WiGig is broken based on the throughput in a recent window and then switch to WiFi reactively. However, it is challenging to select the right window size: a too small window results in unnecessary switch to WiFi and being constrained by the limited WiFi throughput and a too large window results in long link outage. In addition, even when WiGig link is not broken, WiFi can also complement WiGig by increasing the total throughput. The WiFi throughput can be significant compared with the WiGig throughput since the latter can be arbitrarily small depending on the distance, orientation, and movement. In addition, existing works mainly focus on bulk data transfer and do not consider delay sensitive applications like live video streaming.

72 4.3 Our Approach

In this section, we describe the important components of our system Jigsaw for live 4K video streaming: (i) light-weight layered coding, (ii) efficient implementation using GPU and pipelining, and (iii) effective use of WiGig and WiFi.

4.3.1 Light-weight Layered Coding

Motivation: Existing video codecs, such as H.264 and HEVC, are too ex- pensive for live 4K video streaming. A natural approach is to develop a faster codec albeit with a lower compression rate. If a sender sends more data than the network can support, the data arriving at the receiver before the deadline may be incomplete and insufficient to construct a valid frame. In this case, the user will see a blank screen. This can be quite common for WiGig links. On the other hand, if a sender sends at a rate lower than the supported data rate (e.g., also due to prediction error), the video quality degrades unnecessarily.

In comparison, layered coding is robust to throughput fluctuation. The base layer is small and usually delivered even under unfavorable network con- ditions so that the user can see some version of a video frame albeit at a lower resolution. Enhancement layers can be opportunistically sent based on network conditions. While layered coding is promising, the existing layered coding schemes are too computationally expensive as shown in Section 4.1. We seek a layered coding that can (i) be easily computed, (ii) support parallelism to leverage GPUs, (iii) compress the data, and (iv) take advantage of partial

73 Figure 4.3: Our layered coding for Live 4K Video Streaming layers, which are common since a layer can be large.

4.3.1.1 Our Design

A video frame can be seen as a 2D array of pixels. There are two raw video formats: RGB and YUV where YUV is becoming more popular due to better compression than RGB. Each pixel in RGB format can be represented using three 8-bit unsigned integers in RGB (one integer for red, one for green, and one for blue). In the YUV420 planar format, four pixels are represented using four luminance (Y) values and two chrominance (UV) values. Our im- plementation uses YUV but the general idea applies to RGB.

We divide a video frame into non-overlapping 8x8 blocks of pixels. We further divide each 8x8 block into 4x4, 2x2 and 1x1 blocks. We then compute the average of all pixels in each 8x8 block which makes up the base layer or Layer 0. We round each average into an 8-bit unsigned integer, denoted as

74 A0(i, j), where (i, j) denotes the block index. Layer 0 has only 1/64 of the original data size, which translates to around ∼50 Mbps. Since we use both WiGig and WiFi, it is very rare to get the total throughput below ∼50 Mbps. Therefore, layer 0 is almost always delivered. 1 While only receiving layer 0 gives a 512x270 video, which is a very low resolution, it is still much better than partial blank screen, which may happen if we try to send a higher resolution video than the link can support.

Next, we go down to the 4x4 block level and compute averages of these smaller blocks. Let A1(i, j, k) denote the average of a 4x4 block where (i, j, k) is the index of the k-th 4x4 block within the (i, j)-th 8x8 block and k = 1, 2, 3, 4. D1(i, j, k) = A1(i, j, k)−A0(i, j) forms layer 1. Using three of these differences, we can reconstruct the fourth one since the 8x8 block contains the average of the four 4x4 blocks. This reconstruction is not perfect due to rounding error. The rounding error is small: MSE is 1.1 in our videos where the maximum pixel value is 255. This has minimum impact on the final video quality. Therefore, the layer 1 consists of D1(i, j, k) for k = 1 to 3 from all 8x8 blocks. We call D1(i, j, 1) the first sublayer in layer 1, D1(i, j, 2) the second sublayer, and D1(i, j, 3) the third.

Following the same principle, we form layer 2 by dividing the frame into 2x2 blocks and computing A2 − A1, and form layer 3 by dividing into 1x1 blocks and computing A3 − A2. As before, we omit the fourth value in layer 2

1If the typical throughput is even lower, one can construct 16x16 blocks and use average of these blocks to form layer 0.

75 and layer 3 blocks. The complete layering structure of an 8x8 block is shown in Fig. 4.3. The values marked in grey are not sent and can be reconstructed at the receiver. This layering strategy gives us 1 average value for layer 0, 3 difference values for layer 1, 12 difference values for layer 2 and 48 difference values for layer 3, which gives us a total of 64 values per 8x8 block to be transmitted. Note that due to spatial locality, the difference values are likely to be small and can be represented using less than 8 bits. In YUV format, an 8x8 block’s representation requires 96 bytes. Our strategy allows us to use fewer than 96 bytes.

Encoding: Difference values calculated in the layers 1 - 3 vary in magnitude. If we use the minimum number of bits to represent each difference value, the receiver needs to know that number. However, sending that number for every difference value defeats the purpose of compression and we will end up sending more data. To minimize the overhead, we use the same number of bits to represent the difference values that belong to the same layer of an 8x8 block. Furthermore, we group eight spatially co-located 8x8 blocks, referred to as block-group and use the same number of bits to represent their difference values for every layer. This serves two purposes:(i) It further reduces the overhead and (ii) the compressed data per block-group is always byte aligned. To understand (ii), consider if b bits are used to represent difference values for a layer in a block-group, b ∗ 8 will always be byte-aligned. This enables more parallelism in GPU threads as explained in Sec. 4.3.2. Our meta data is less than 0.3% of the original data.

76 Decoding: Each pixel at the receiver side is constructed by combining the data from all the received layers corresponding to each block. So the pixel value at location (i, j, k, l, m), where (i, j), k, l, m correspond to 8x8, 4x4, 2x2 and 1x1 block indices respectively, can be reconstructed as

A0(i, j) + D1(i, j, k) + D2(i, j, k, l) + D3(i, j, k, l, m)

If some differences are not received, they are assumed to be zero. When partial layers are received, we first construct the pixels for which we received more layers and then use them to interpolate the pixels with fewer layers based on the average of the larger block and the current blocks received so far.

Encoding Efficiency: We evaluate our layered coding efficiency using 7 video traces. We classify the videos into two categories based on the spatial correlation of the pixels in the videos. A video is considered rich if it has low spatial locality. Fig. 4.4(a) shows the distribution across block-groups for the minimum number of bits requires to encode difference values for all layers. For less rich videos, 4 bits are sufficient to encode more than 80% of difference values. For more detailed or rich videos, 6 bits are needed to encode 80% of difference values. Fig. 4.4(b) shows the average compression ratio for each video, where compression ratio is defined by the ratio between the encoded frame size and the original frame size. Error bars represent maximum and minimum compression ratio across all frames of a video. Our layering strategy can reduce the size of data to be transmitted by 40-62%. Less rich videos achieve higher compression ratio due to relatively smaller difference values.

77 1 60 0.8 50 40 0.6 30 CDF 0.4 20 0.2 High-Richness Video 10 Low-Richness Video

0 Compression Ratio (%) 0 2 3 4 5 6 7 8 1 2 3 4 5 6 7 Number of Bits Video ID (a) Difference bits (b) Compression ratio

Figure 4.4: Compression Efficiency

4.3.2 Layered Coding Implementation

Our layered coding encodes a frame as a combination of multiple in- dependent blocks. This allows encoding of a full frame to be divided into multiple independent computations and makes it possible to leverage GPU. GPU consists of many cores with very simple control logic that can efficiently support thousands of threads to perform the same task on different inputs to improve throughput. In comparison, CPU consists of a few cores with com- plex control logic that can handle only a small number of threads running different tasks to minimize the latency of a single task. Thus, to minimize our encoding and decoding latency, we implement our schemes using GPU. However, achieving maximum efficiency over GPU is not straightforward. To maximize the efficiency of GPU implementation, we address several significant challenges.

GPU threads synchronization: An efficient GPU implementation should have minimal synchronization between different threads for independent tasks. Synchronization requires data sharing, which greatly impacts the performance.

78 We divide computation into multiple independently computable blocks. How- ever, blocks still have to synchronize since blocks vary in size and have variable writing offsets in memory. We design our scheme such that we minimize the synchronization need among the threads by putting the compression of a block- group in a single thread.

Memory copying overhead: We transmit encoded data as sublayers. The receiver has to copy the received data packets of sublayers from CPU to GPU memory. A single copy operation has non-negligible overhead. Copying pack- ets individually incurs too much overhead, while delaying copying packets until all packets are received incurs significant startup delay for the decoding pro- cess. We design a mechanism to determine when to copy packets for each sublayer. We manage a buffer to cache sublayers residing in GPU memory. Because copying a sublayer takes time, GPU is likely to be idle if the buffer is small. Therefore, we try to copy packets of a new sublayer while we are decoding a sublayer in the buffer. Our mechanism can reduce GPU idle time by parallelizing copying new sublayers and decoding buffered sublayers.

Our implementation works for a wide range of GPUs, including low to mid-range GPUs [4], which are common on commodity devices.

4.3.2.1 GPU Implementation

GPU background: To maximize the efficiency of GPU implementation, we follow the two guidelines below: (i) A good GPU implementation should divide the job into many small independently computable tasks. (ii) It should

79 minimize memory synchronization across threads as memory operations can be expensive.

Memory optimization is critical to the performance of GPU imple- mentation. GPU memory can be classified into three types: Global, Shared and Register. Global memory is standard off-chip memory accessible to GPU threads through bus interface. Shared memory and register are located on- chip, so their access time is ∼100× faster than global memory. However, they are very small. But they give us an opportunity to optimize performance based on the type of computation.

In GPU terms, a function that is parallelized among multiple GPU threads is called kernel. If different threads in a kernel access contiguous memory, global memory access for different threads can be coalesced such that memory for all threads can be accessed in a single read instead of individual reads. This can greatly speed up memory access and enhance the performance. Multiple threads can work together using the shared memory, which is much faster than the global memory. So the threads with dependency can first read and write to the shared memory, thereby speeding up the computation. Ultimately, the results are pushed to the global memory.

Implementation Overview: To maximize the efficiency of GPU imple- mentation, we divide the encoder and decoder into many small independently computable tasks. In order to achieve independence across threads, we should satisfy the following two properties:

80 No write byte overlap: If no two threads write to the same byte, there is no need for memory synchronization among threads. This is a desired property, as thread synchronization in GPU is expensive. Since layers 1-3 may use fewer than 8 bits to represent the differences, a single thread processing a 8x8 block may generate output that is not a multiple of 8-bits. In this case, different threads may write to the same byte and

require memory synchronization. Instead, we use one thread to encode difference values per block-group, where a block-group consists of 8 spatially co-located blocks. Since all blocks in the block-group use the same number of bits to represent difference values, the output size from block-group is an multiple of 8 and always byte aligned.

Read/write independence among blocks: Each GPU thread should know in advance where to read the data from or where to write the data to. As the number of bits used to represent the difference values vary across block-groups, its writing offset depends on how many bits were used by the previous block-groups. Similarly, the decoder should know the read offsets. Be- fore spawning GPU threads for encoding or decoding, we derive the read/write memory offsets for that thread using cumulative sum of the number of bits used per block-group.

4.3.2.2 Jigsaw GPU Encoder

Overview: Encoder consists of three major tasks: (i) calculating averages and differences, (ii) constructing meta-data, and (iii) compressing differences.

81 Specifically, consider a YUV video frame of dimensions a×b. The encoder first

ab 0.5ab spawns 64 threads for Y values and 64 for UV values. Each thread operates on an 8x8 block to (i) compute the average of an 8x8 block, (ii) compute the averages of all 4x4 blocks within the 8x8 block and take the difference between the averages of 8x8 block and 4x4 block, and similarly compute the hierarchical differences for 2x2 blocks and 1x1 blocks, and (iii) coordinate with the other threads in its block-group using the shared memory to get the minimum number of bits required to represent difference values for its block- group. Next the encoder derives the meta-data and memory offsets. Then it

x∗y∗1.5 spawns 64∗8 threads, each of which compresses the difference values for every 8 blocks.

Calculating averages and differences: The encoder spawns multiple threads to compute averages and differences. Each thread processes an 8x8 block. It calculates all the averages and difference values required to generate all layers for that block. It reads eight 8-byte chunks for a block one by one to compute the average. In GPU architecture, multiple threads of the same kernel execute the same instruction for a given cycle. In our implementa- tion, successive threads work on contiguous 8x8 blocks. This makes successive threads access and operate on contiguous memory, so the global memory reads are coalesced and memory access is optimized.

After computing the average at each block level, each thread computes the differences for each layer and writes it to an output buffer. The thread also keeps track of the minimum number of bits required to represent the

82 differences for each layer. Let bi denote the number of bits required for layer i. All 8 threads for a block-group use atomic operations in a shared memory location to get the number of bits required to represent differences for that

i block-group. We use Bj to denote the number of bits used by block-group j for layer i.

Meta data processing: We compute the memory offset where the com- pressed value for the ith layer of block-group j should be written to using a

i Pk=j i cumulative sum Cj = k=0 Bk. Based on the cumulative sum, the encoding kernel generates multiple threads to process the difference values concurrently

i without write byte overlap. Bj values are transmitted, based on which the receiver can compute the memory offsets of compressed difference values. 3 values per block-group per layer are transmitted except the base layer and we

3∗x∗y∗1.5 need 4 bits to represent one meta-data value. Therefore a total of 64∗8∗2 bytes are sent as meta-data, which is within 0.3% of the original data.

Encoding: A thread is responsible for encoding all the difference values for

i a block-group in a layer. A thread for block-group j uses Bj to compress and combine the difference values for layer i from all its blocks. It then writes the compressed value in an output buffer at the given memory offset. Our design is to ensure consecutive threads read and write to contiguous memory locations in the global GPU memory to take advantage of memory coalescing. Fig. 4.5(a) shows average running time of each step in our compression algorithm. On average, all steps finish around 10ms. Error bars represent the maximum and minimum running times across different runs.

83 4.3.2.3 Jigsaw GPU Decoder

Overview: The receiver first receives the layer 0 and meta-data. It then pro- cesses the meta-data to compute the read offsets. Each layer except the base layer has 3 sublayers as described in Section 4.3.1. Once all data corresponding to a sublayer are received, a thread is spawned to decode the sublayer and add the difference value to the previously reconstructed lower level average. This process is repeated for every sublayer that is received completely.

The decoder consists of the following steps: (i) processing meta-data, (ii) decoding, and (iii) reconstructing a frame.

Meta-data processing: Meta-data contains the number of bits used for dif- ference values per layer per block group. The kernel calculates the cumulative

i sum Cj from the meta-data values similarly to the encoder described previ- ously. This cumulative sum indicates the read offset at which block-group j can read the compressed difference values for any sublayer in layer i. Since all sublayers in a layer are generated using the same number of bits, their relative offset is the same.

Decoding: A decoding thread decodes all the difference values for a block- group corresponding to a sublayer. This thread is responsible for decompress- ing and adding the difference value to the previously reconstructed lower level average. This process is repeated for every sublayer that is received completely. Each block is composed of 4 smaller sub-blocks. Difference values for three of these sub-blocks are transmitted. Once sublayers corresponding to the three

84 15 30

10 20 10

5 Time (ms) Time (ms) 0 0 Total Layer-1Layer-2Layer-3 Total Metadata Layer-1Layer-2Layer-3 Reconstruct AveragesMetadata (a) Encoder Modules (b) Decoder Modules

Figure 4.5: Codec GPU Modules Running Time sub-blocks of the same block are received, the fourth sublayer is constructed based on the average principle.

Received sublayers reside in the main memory and should be copied to GPU memory before they can be decoded. Each memory copy incurs overhead so copying every packet individually to GPU has significant overhead. On the other hand, memory copy for a sublayer cannot be deferred to the point at which its decoding kernel is scheduled. Otherwise, it would stall the kernel till the sublayer is completely copied to GPU memory.

We implement a smart delayed copy mechanism. It parallelizes mem- ory copy for one sub-layer with the decoding of another sublayer. We always keep a maximum of 4 sublayers in GPU memory. They are next in line to decode. As soon as one of these 4 sublayers is scheduled to be decoded in GPU, we choose a new complete sublayer to copy from CPU to GPU memory. If no new complete sublayer is available, we copy a partial sublayer to GPU. In the latter case, all future incoming packets for that sublayer are directly copied to GPU without any delay. Our smart delayed copy mechanism allows

85 Average & Difference Calculation Layer-2 Metadata Layer-3 Layer-1 Frame Reconstruction

Frame 1 Encoding Frame 2 Encoding TX TRANSMISSION RX

Frame 1 Decoding

Time

Figure 4.6: Jigsaw Pipeline us to reduce the memory copy time between GPU and main memory by 4x at the cost of only 1% of the kernels experiencing 0.15ms stall on average due to delayed memory copying. The total running time of the decoding process of a sublayer consists of the memory copy time and the kernel running time. Because of the large memory copy time, our mechanism significantly reduces the total running time.

Frame reconstruction: Once the deadline for a frame is imminent, we stop decoding its sublayers and prepare for reconstruction. We interpolate for the partial sublayers as explained in Section 4.3.1. The receiver reconstructs a pixel based on all the received sublayers. Finally, reconstructed pixels are organized as a frame and rendered on the screen.

Fig. 4.5(b) shows the average running time of each step in our decom- pression algorithm, where the error bars represent the maximum and minimum values. On average, the total running time is around 19ms.

86 4.3.2.4 Pipelining

A 4K video frame takes significant amount of time to transmit. Starting transmission after the whole frame finishes encoding increases delay; similarly, decoding after receiving a whole frame is also inefficient. Pipelining can be po- tentially used to reduce the delay. However, not all encoders and decoders can be pipelined. Many encoders have no intermediate point where transmission can be pipelined. Due to data dependency, it is not possible to start decoding before all dependent data has been received.

Our layering strategy has nice properties that (i) the lower layers are independent of the higher layers and can be encoded or decoded independently and (ii) sublayers within a layer can be independently computed. This means our encoding can be pipelined with data transmission as shown in Fig. 4.6. Transmission can start as soon as the average and difference calculation is done, which can happen as soon as 2ms after the frame is generated. This is possible since the base layer consists of only average values and does not re- quire any additional computation. Subsequent sublayers are encoded in order and scheduled for transmission as soon as any sublayer is encoded. Average encoding time for a single sublayer is 200-300us.

A sublayer can be decoded once all its lower layers have been received. As the data are sent in order, by the time a sublayer arrives, its dependent data are already available. So a sublayer is ready to be decoded as soon as it is received. As shown in Fig. 4.6, we pipeline sublayer decoding with data reception. Final frame reconstruction cannot happen until all the data of the

87 frame have been received and decoded. As the frame reconstruction takes around 3-4ms, we stop receiving data for that frame before the frame playout deadline minus the reconstruction time. Pipelining reduces the delay from 10ms to 2ms at the sender, and reduces the delay from 20ms to 3ms at the receiver.

4.3.3 Video Transmission

We implement a driver module on top of the standard network interface bonding driver in that bonds WiFi and WiGig interfaces. It is responsi- ble for the following important tasks: (i) adapting video quality based on the wireless throughput, (ii) selecting the appropriate interface to transmit the packets (i.e., intra video frame scheduling), and (iii) determining how much data to send for each video frame before switching to the next one (i.e., inter video frame scheduling). Our high-level approach is to defer the decision till transmission to minimize the impact of prediction errors. Below we describe these components in detail.

Delayed video rate adaptation: A natural approach to transmit data is to let the application decide how much data to generate and pass it to the lower network layers based on its throughput estimate. This is widely used in the existing video rate adaptation. However, due to unpredictable throughput fluctuation, it is easy to generate too much or too little data. This issue is exacerbated by the fact that the 60 GHz link throughput can fluctuate widely and rapidly.

88 To fully utilize network resources with minimized effect on user quality, we delay the transmission decision as late as possible to remove the need for throughput prediction at the application layer. Specifically, for each video frame we let the application generate all layers and pass them to the driver in the order of layers 0, 1, 2, 3. The driver will transmit as much data as the network interface card allows before the sender switches to the next video frame.

While the concept is intuitive, realizing it involves significant effort as detailed below. First, our module determines whether a packet belongs to a video stream based on the destination port. Instead of forwarding all video packets directly to the interfaces, they are added to a large circular buffer. This allows us to make decisions based on any optimization objective. Packets are dequeued from this buffer and added into the transmission queue of an interface whenever a transmission decision is made. Moreover, to minimize throughput wastage, this module drops the packets that are past their deadline. The module uses the header information attached to each video packet to determine which video frame the packet belongs to and its deadline. Before forwarding the packets to the interface, the module estimates whether that packet can be delivered within the deadline based on current throughput estimate and queue length of the interfaces. The interface queue depletion rate is used to estimate the achievable throughput. If a packet cannot make the deadline, all the remaining packets belonging to the same video frame are marked for deletion since they have the same deadline. The buffer head is moved to the

89 start of next frame. Packets marked for deletion are removed from the socket buffer in a separate low priority thread since the highest priority is to forward packets to interfaces to maintain throughput.

Intra-frame scheduling: Interface firmware beneath the wireless driver is responsible for forwarding packets. Transmission is out of control of the driver once the packet is put to interface queue of an interface. This means that the packet cannot be dropped even if it misses its deadline. To minimize such waste, we only enqueue a minimum number of packets to both interfaces to sustain the maximum throughput. The interface card notifies the driver whenever a packet has been successfully received, and the driver removes it from that interface queue automatically.

Our module keeps track of the queue lengths for both WiFi and WiGig interfaces. As soon as the queue size of any interface goes below the threshold, it forwards packets from the packet buffer to that interface. As shown in Section 4.1, both interfaces have frequent inactivity intervals during which they are unable to transmit any packets. This can result in significant delay in the packets sent to the inactive interface. Such delay can sometimes cause the lower layer packets to get stuck at one interface queue and the higher layer packets to be transmitted at the other interfaces. Since our layered encoding requires the lower layer packets to be received in order to decode the higher layer packets, it is important to ensure the lower layer packets to be transmitted before the higher layer.

To achieve this goal, the sender monitors the queue on both interfaces.

90 If no packet is removed from that queue for T duration, it declares that inter- face is inactive. When the queue of the active interface reduces, which means that it is now sending packets and is ready to accept more packets, we com- pare the layer of the new packet with that of the packet queued at the inactive interface. If the latter is a lower layer (which means it is required in order to decode the higher layer packets), we move the packet from the inactive inter- face to the active interface. No rescheduling is done for the packets in the same layer since these packets have the same priority. In this way, we ensure that all the lower layer packets are received prior to the reception of higher layer packets. T is adapted dynamically based on the current frame’s deadline. If the deadline is more than 10 ms away, we set T = 4ms, otherwise it is set to 2ms. This is because we need to react more quickly if the deadline is closer so that all the enqueued packets can be transmitted on the active interface and received before the deadline.

Inter-frame scheduling: If the playout deadline for a frame is larger than the frame generation interval, the module can have packets from different video frames at a given instant. For example, in a 30 FPS video, a new frame is generated every 33 msec. If the deadline to play a frame is set to 60 ms from its generation as suggested in recent studies [44], two consecutive frames will overlap for 27 ms. In this case, we have an opportunity to decide when to start transmitting the next video frame. Certainly, no packets should be transmitted after their deadline. However, we sometimes need to move on to transmit the next video frame before the deadline of the previous frame arrives, to ensure

91 similar numbers of packets are transmitted for two consecutive frames and there is no significant variation in the quality of consecutive frames.

To achieve this, whenever the interface queue has room to accept a new packet, our module predicts the number of packets that can be transmitted for the next frame. If the estimate is smaller than the number of packets already sent for the current frame minus a threshold (80 packets in our system), we switch to transmitting the next video frame. This reduces variation in the quality of the two consecutive video frames. Note that a frame is never pre- empted before its base layer finishes transmission (unless it passes the deadline) to ensure at least the lowest resolution frame is fully transmitted.

4.4 Evaluation for Jigsaw

In this section, we first describe our evaluation methodology. Then we perform micro-benchmarks to understand the impact of our design choices. Finally, we compare Jigsaw with the existing approaches.

4.4.1 Evaluation Methodology

Evaluation methods: We classify our experiments into two categories: Real- Experiment and Emulation. Experiments are done on two laptops equipped with QCA9008 AC+AD WiFi module that supports 802.11ad on 60 GHz and 802.11ac on 2.4/5 GHz band. Each laptop has 2.4GHz dual-core processor, 8 GB RAM and NVIDIA Geforce 940M GPU. One laptop serves as the sender and the other serves as the receiver. The sender encodes the data while the

92 receiver decodes the data and displays the video.

While real experiments are valuable, the network may vary significantly when we run different schemes even for the same movement pattern. This makes it hard to compare different schemes. Emulation allows us to run dif- ferent schemes using exactly the same throughput trace. For emulation, we connect two desktops, each with 3.6GHz quad-core processor, 8 GB RAM, 256GB SSD and NVIDIA Geforce GTX 1050 GPU, a 10 Gbps fiber optic link. We collect packet traces over WiFi and WiGig using our laptops and use these traces to run trace-driven emulation. The sender and receiver code run on two desktops in real time. We emulate two interfaces, and delay the packets according to the packet trace from WiFi and WiGig before sending them over the 10 Gbps fiber optic link. We verify the correctness of our emulation by comparing the instantaneous throughput between the real traces and the em- ulated experiment, and find that the emulated throughput is within 0.5% of the real trace’s throughput.

Video Traces: We use 7 uncompressed videos with YUV420 format from Derf’s collection under Xiph [14] with resolution 4096x2160. We choose videos with different motion and spatial locality to evaluate their impact. Videos are streamed at 30 FPS and each video is concatenated to itself to generate desired duration. We classify the videos into two categories based on spatial correlation of pixels. We use variance in the Y values in 8x8 blocks for all video frames of a video to quantify spatial correlation a video. Videos that have high variance are classified as high richness (HR) videos and video with

93 low variance are classified as low richness (LR) videos. We use 2 HR videos and 5 LR videos.

Mobility Patterns: We collect traces by fixing the sender and moving the receiver. We use the following movement patterns in our experiment. (i) Static: No movement in sender or receiver; (ii) Watching a 360 video: A user watches a 360o video, and rotating the receiver laptop up and down, right and left according to the video displayed on the laptop and the static sender is around 1.5m away; (iii) Walking: The user walks around with the receiver laptop in hand within a 3m radius from the sender, (iv) Blockage: Sender and receiver are static, but another user moves between them and may block the link from time to time thus inducing environment mobility.

For our real experiments, we run each experiment 5 times for each mobility pattern and then average the results. For emulation, we collect 5 different packet level traces for each mobility pattern.

Performance metrics: We use Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) to quantify the video quality. PSNR is a widely used video quality metric. Videos with PSNR greater than 45 dB are considered to have excellent quality, between 33-45 are considered good and between 27-33 are considered fair. 1 dB difference in PSNR is already visible and 3 dB difference indicates that the video quality is doubled. Videos with SSIM greater than 0.99 are considered to have excellent quality, 0.95-0.99 are considered good, and 0.88-0.95 are considered fair [103].

94 50 LR Video HR Video 40 30 20

PSNR (dB) 10 0 1 2 3 4 No. Received Layers

Figure 4.7: Video quality vs no. of received layers.

In all our results, the error bars represent maximum and minimum values unless otherwise specified.

4.4.2 Micro-benchmarks

We use emulation to quantify the impact of different system compo- nents since we can keep the network condition identical.

Impact of layers on video quality: Our layered coding allows us to exploit spatial correlation. It can not only adapt the video quality on the fly but also reduce the amount of data to transmit. We quantify the compression rate for different types of videos. Fig. 4.7 shows the frame PSNR as we vary the number of layers. For the LR videos, receiving only 2 layers can achieve an average PSNR close to 40dB and receiving only 3 layers gives PSNR of 42 dB. For the HR videos, the average PSNR is around 7dB lower than the LR video when receiving 3 layers. Our scheme can achieve similar PSNR values for both kind of videos when all the layers are received. Moreover, as we would expect,

95 less rich videos have higher compression ratios due to smaller difference values, which can be represented using fewer bits.

3 WiFi 2.5 WiGig 2 1.5 1 0.5 Thtpt (Gbps) 0 0 50 100 150 200 250 300 350 400 450 50 40 30 20 W/o WiFi PSNR (dB) 10 With WiFi 0 0 50 100 150 200 250 300 350 400 450 Frame Number

Figure 4.8: Impacts of using WiGig and WiFi.

Impact of using WiGig and WiFi: Fig. 4.8 shows the PSNR per frame when using WiGig only. We use the throughput trace collected when a user is watching 360o videos. Without WiFi, PSNR drops below 10dB when discon- nection happens. When WiGig has drastic changes, PSNR can decrease by more than 10dB even if WiGig is not completely disconnected. WiFi improves PSNR by over 25dB when WiGig is disconnected, and by 2dB even when WiGig is not disconnected. The disconnection of WiGig can result in partial blank frames because the available throughput may not be able to transmit the whole layer 0 in time. The complementary throughput from WiFi can remove partial blank frames effectively, so we observe large improvement in PSNR when WiGig disconnection happens. We can transmit more layers when using both WiGig and WiFi because of higher total available throughput. Thus,

96 WiFi still improves PSNR even if WiGig is not disconnected.

14 Ideal W/o Rescheduling 12 With Rescheduling 10 8 6 Partial Frames Percentage of 4 2 0 Blockage Watching Walking Mobility Patterns

Figure 4.9: Impacts of interface scheduler.

Impact of interface scheduler: When the WiGig throughput reduces, data at the WiGig interface gets stuck. Such situation is especially bad when the data packets queued at WiGig are from the lower layer than those queued at WiFi. In this case, even if WiFi successfully delivers the data, they are not decodable without the complete lower layer data. To avoid this situation, our scheduler can effectively move the data from the inactive interface to the active interface whenever the inactive interface has lower layer data. When Layer-0 data get stuck, the user will see part of the received frame in blank. Fig. 4.9 shows the percentage of partial frames that do not have complete Layer-0 under various mobility patterns. Our scheduler reduces the percentage of partial frames by 90%, 82% and 49% for watching, walking, and blockage traces, respectively. In the static case, we do not observe any partial frame. Our scheme gives a partially blank frame only when the throughput is below

97 GPU Cores Power Encoding Decoding Rating (W) Time (ms) Time (ms) 384 30 10.1 19.3 768 75 1.7 5.7 2816 250 1.1 5.3

Table 4.2: Performance over different GPUs the minimum required – 50 Mbps. As shown in Fig. 4.9 the number of partially blank frames for our scheme are within 0.1% of what we receive in ideal case for these traces.

Impact of Inter-frame Scheduler: When throughput fluctuates, the video quality can vary significantly across frames. Our inter-frame scheduler tries to balance throughput allocated to consecutive frames. Without this scheduling, we just send frame data until its deadline. As shown in Fig. 4.10, this improves the quality of frames around 10dB when the throughput is low. The impact is minimal when the throughput is already high and stable.

1 With Inter-frame W/o Inter-frame 0.8

0.6

CDF 0.4

0.2

0 0 10 20 30 40 50 PSNR (dB)

Figure 4.10: Impacts of inter-frame scheduler.

Impact of GPU: Next we study the impact of GPU on the encoding and

98 50

40

30

20 PSNR (dB) 10

0 16 26 36 46 56 66 Frame Deadline (ms)

Figure 4.11: Impacts of frame deadline. decoding time. Table 4.2 shows the performance of Jigsaw for three types of GPUs with different number of cores. More powerful GPU has more cores and reduces the encoding and decoding time significantly, leaving more time to transmit data. However, this comes at the cost of high power consumption. Even the low to mid-end GPUs that are generally available on mobile devices can successfully encode and decode the data in real time.

Impact of Frame Deadline: Frame deadline is determined by the desired user experience of the application. Fig. 4.11 shows the video PSNR when varying frame deadline from 16 ms to 66 ms. The performance of Jigsaw using 36 ms deadline is similar to that using 66 ms deadline. 36 ms is much lower than the 60 ms delay threshold for interactive mobile applications [44, 57, 78]. We use 60ms as our frame deadline threshold because this is the threshold that is deemed tolerable by the earlier works [44, 57, 78]. However, as we show that we can tolerate frame delay as low as 36ms, so even if future applications require lower delay, our system can still support.

99 4.4.3 System Results

We compare our system with the following three baseline schemes using real experiments as well as emulation.

• Pixel Partitioning(PP): Video is divided into 8x8 blocks. The first pixels from all blocks are transmitted, followed by the second pixels in all blocks, and so on. If not all pixels are received for a block, the remaining ones are replaced by average of the remaining pixels, which can be computed based on the average of the larger block and the current blocks received so far.

• Raw: Raw video frame is transmitted. This is uncompressed video streaming. Packets after their deadline are dropped to avoid wastage.

• Rate adaptation (Rate): Throughput is estimated using historical information. Based on this estimate, the uncompressed video is down- sampled accordingly before transmission. The receiver up-samples it to 4K and displays it. This is similar in concept to DASH streaming, which is the current standard of video-on-demand streaming.

Benefits of Jigsaw: Fig. 4.12 shows the video quality for all schemes under various mobility patterns and videos. We perform real experiments by running each video 5 times over each mobility traces and report the average. We make the following observations.

100 Jigsaw Rate PP Raw Jigsaw Rate PP Raw 1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 0.2

0 0 0 10 20 30 40 50 0 10 20 30 40 50 PSNR (dB) PSNR (dB) (a) HR Video, Static (b) HR Video, Watching

Jigsaw Rate PP Raw Jigsaw Rate PP Raw 1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 0.2

0 0 0 10 20 30 40 50 0 10 20 30 40 50 PSNR (dB) PSNR (dB) (c) HR Video, Walking (d) HR Video, Blockage

Jigsaw Rate PP Raw Jigsaw Rate PP Raw 1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 0.2

0 0 0 10 20 30 40 50 0 10 20 30 40 50 PSNR (dB) PSNR (dB) (e) LR Video, Static (f) LR Video, Watching

Jigsaw Rate PP Raw Jigsaw Rate PP Raw 1 1

0.8 0.8

0.6 0.6

CDF 0.4 CDF 0.4

0.2 0.2

0 0 0 10 20 30 40 50 0 10 20 30 40 50 PSNR (dB) PSNR (dB) (g) LR Video, Walking (h) LR Video, Blockage

Figure 4.12: Performance of Jigsaw under various mobility patterns. (HR: High-Richness, LR: Low-Richness) 101 First, Jigsaw achieves much higher PSNR than the other schemes across all mobility scenarios and videos. Raw always has partial-blank frames and has a very low PSNR of around 10dB in all settings as the throughput can not support full 4K frame transmission even in the best case. PP also transmits complete 4K frame, so it is never able to transmit a complete frame in any setting, however the impact of not receiving all data is not as severe as Raw. This is because some data from different parts of the frame is received, and frames are no longer partially blank. Moreover, using interpolation to estimate the unreceived pixels also improves quality. Rate achieves high PSNR in static scenarios since the throughput prediction is more accurate and it can adapt the resolution. However, it cannot transmit at full 4K resolution.

Second, the benefit of Jigsaw is higher under mobility scenarios. Jigsaw improves the median PSNR by up to 6 dB over Rate, 12 dB over PP and 35 dB over Raw in static settings. The corresponding improvement is 15 dB over Rate, 16 dB over PP and 38 dB over Raw because Jigsaw can quickly adapt to the throughput changes in the mobile case due to its layered coding design, delayed video rate adaptation, and smart scheduling.

Third, for all schemes HR videos suffer more when fewer data can be transmitted. So all schemes achieve higher PSNR for the LR videos. The median video PSNR for the LR (HR) videos is 46dB (41dB), 38dB (31dB), 33dB (26dB) and 10dB (9dB) for Jigsaw, Rate, PP and Raw, respectively. These results show that Jigsaw is effective and significantly out-performs the existing approaches for a variety of network conditions and videos.

102 Table. 4.3 shows the median video SSIM for all schemes. We can observe similar trend as PSNR. Jigsaw achieves the highest SSIM. Jigsaw can achieve at least good video quality for all videos and mobility traces.

50 25 0 PSNR (dB) 0 50 100 150 200 250 300 350 400 450 3 2 1 0 Thpt (Gbps) 0 50 100 150 200 250 300 350 400 450 60 40 20 # layers 0 0 50 100 150 200 250 300 350 400 450 Frame Number

Figure 4.13: Frame quality correlation with throughput.

Effectiveness of layer adaptation: Fig. 4.13 shows the correlation be- tween throughput and video frame quality for Jigsaw. We can see that the changes of video quality has very similar pattern as the throughput changes. Jigsaw only receives Layer-0 and partial Layer-1 when throughput is close to 0. In those cases, the frame quality drops to around 30dB. When the through- put stays close to 1.8Gbps, the frame quality reaches around 49dB. Because we keep our interface queues small, our packet transmission rate closely fol- lows the packet depletion rate from the interface queues. Hence, our layer adaptation can quickly respond to any throughput fluctuation.

103 Jigsaw Rate PP Raw HR,Static 0.993 0.978 0.838 0.575 HR,Watching 0.965 0.749 0.818 0.489 HR,Walking 0.957 0.719 0.805 0.421 HR,Blockage 0.971 0.853 0.811 0.511 LR,Static 0.996 0.985 0.949 0.584 LR,Watching 0.971 0.785 0.897 0.559 LR,Walking 0.965 0.748 0.903 0.481 LR,Blockage 0.984 0.862 0.907 0.560

Table 4.3: Video SSIM under various mobility patterns.

4.4.4 Emulation Results

In this section, we compare time series of Jigsaw with the other schemes when using the same throughput trace. We use emulation to compare different schemes under the same condition. Fig. 4.14 shows the quality of each frame using an example throughput trace collected from the Watching mobility pat- tern. As we can see, Jigsaw achieves the highest quality for frames and least variation.

Jigsaw Rate PP Raw 50 45 40 35 30 25 20

PSNR (dB) 15 10 5 0 0 50 100 150 200 250 300 350 400 450 Frame Number

Figure 4.14: Frame quality comparison

104 Chapter 5

Real-Time Deep Video Analytics on Mobile Devices

5.1 Motivation

Applications that need real-time video analytics mostly rely on compu- tation offloading to reduce inference latency. Computation offloading is exten- sively studied. DeepDecision [115] and DARE [91] offload computationally- intensive inference to an edge/cloud server based on the network condition. Liu et al [88] propose to offload inference of a few frames to an edge server and use motion to track objects for frames between the two offloaded frames. How- ever, offloading (i) requires powerful server (e.g., [88] uses an Nvidia TITAN XP GPU, which is far more powerful than the hardware typically present on mobiles), (ii) requires good network connectivity, which may not be available in remote or crowded areas, and (iii) raises privacy concerns. Thus, there have been many recent works [73, 84, 102, 146] investigating how to run deep models on client devices only, but they can only support video analytics slower than 3 fps. To realize real-time video analytics on mobile devices, we first investigate the major challenges involved.

105 5.1.1 Deep Model Inference Latency

A natural way to detect objects in a video is to run a deep model on every frame. These models detect the objects in the frame and produce a bounding box for each object. The state of art deep models have many layers. For example, both FasterRCNN [120] and SSD [93] use VGG [130] as the base network and have more than 19 layers. R-FCN [54] uses ResNet-101 [67] and has 101 layers. Even the recent model Yolo [119] designed for real-time object detection has 24 layers. Having many layers means that inference takes a long time.

We evaluate the inference latency of several models including Yolo [119], FasterRCNN [120], SSD [93], and R-FCN [54]. We run these models on two commodity GPUs, including Nvidia Jetson TX2 (low power embedded GPU [6]) and Nvidia Geforce 940M (typical mid tier laptop GPU [5]). Ta- ble 5.1 shows the inference latency for a single frame and the accuracy of those models on the test set PASCAL VOC 2007 [58]. The results show that on the embedded GPU, the fastest model can process frames at a slightly less than 3 fps, which is an order of magnitude slower to support real-time object tracking in a typical 30 fps video.

Model Yolo SSD-300 F-RCNN R-FCN GPU TX2 390ms 370ms 1082ms 1609ms 940M 320ms 265ms 954ms 1049ms Accuracy(mAP) 78.6 [119] 74.3 [93] 73.2 [120] 79.5 [54]

Table 5.1: Inference latency for a single frame

106 Insight : Commodity mobile hardware can not run inference on every frame to support real-time object tracking. We need a fast way to adjust the inference results for the frames falling between consecutive inferences.

5.1.2 Motion Estimation Based Tracking

Instead of running inference on every frame, one can use the previous inference results and motion to derive the object position in the current frame. This idea has been explored in a recent work [88], which uses a combination of motion estimation and edge offloading to improve the efficiency. Since inference runs on a very powerful edge server, its delay is within 30 ms so that it runs complete inference frequently when motion based tracking fails. Motion based tracking is only used in only a few frames (e.g., 3) between two inferences.

We examine the feasibility of motion tracking when the mobile device runs inference locally. In this case, it incurs a large delay (i.e., around 15 frames). We evaluate two approaches: (i) reuse the bounding box from the previous inference in the current frame and (ii) use the average motion of pixels in the bounding box to track the box from the previous frame to the current frame. To calculate motion, we use either optical flow or motion vectors.

We use a widely used accuracy metric IoU [59], which quantifies the ratio between the intersection of the tracked bounding box and ground-truth bounding box and their union. As in the existing work [80], we consider the tracking has failed if IoU is below 0.5.

107 1 0.8 0.6

CDF 0.4 Optical Flow 0.2 Reusing Motion Vector 0 2 4 6 8 10 12 14 # Successfully Tracked Frames

Figure 5.1: Successfully Tracked Frames

In Fig. 5.1, we show the number of successfully tracked frames in a group of 15 consecutive frames. We test 5 videos in our experiments. Our results show that on average optical flow based tracking will fail tracking after 5 frames. The results for motion vector based tracking and reusing box approach are even worse. Motion based tracking fails because the bounding box contains pixels from both the object and the background. Motion of the background pixels may be very different from the object, which makes the average motion of all pixels in the box may deviate from the object motion. The motion vectors estimate motion at a much coarser granularity than the optical flow, hence its accuracy is worse.

Insight : The existing motion based schemes require frequent inference to

108 1 0.8 0.6

CDF 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 IoU Degradation (a) Size changes example (b) IoU degradation

Figure 5.2: Impacts of object size changes. (Red Box: ground-truth result. Green Box: tracking result.) maintain high accuracy. It is infeasible to run frequent inference on mobile hardware so we need better motion estimation.

5.1.3 Object Size Changes

When we use motion to track an object, the bounding box size remains the same. In practical applications (e.g., Augmented Reality, Autonomous Driving, etc), the movement from both object and camera can result in signif- icant changes in the object size. As a result, overestimating or underestimating the box size can significantly degrade the tracking accuracy.

Fig. 5.2(a) shows an example video with a significant change in the object size. Frame 0 has the ground-truth bounding box. The object size decreases significantly at frame 6. More background pixels are included in the tracked box, which causes a significant deviation from the ground-truth box for the subsequent frames.

To quantify the impact of object size changes on IoU, we analyze the

109 IoU for any group of 15 consecutive frames in 5 videos. The first frame in the group has the ground-truth box. For every subsequent frame, we move the box so that the center of the tracked box matches the center of the ground-truth box but its size remains the same. If the object size does not change, the IoU should be 1.0. Fig. 5.2(b) shows that the average IoU degrades by 0.25 due to the object size changes. This means that even an ideal tracking scheme without adapting the size can achieve only 75% accuracy.

Insight : It is important to adapt to object size changes. Otherwise, there will be severe accuracy degradation.

5.2 Challenges

Without offloading, state of the art [90, 146, 153] can only support ob- ject tracking at slower than 3 fps on mobile devices (e.g., Nvidia Jetson TX2). Running object detection on a frame using deep models is called inference. Due to the large latency of running inference on a mobile device (i.e., inference la- tency), many frames between two inferences do not have the object detection results. Liu et al [88] use motion tracking to estimate the object position in those frames. Motion tracking updates the bounding box (generated by in- ference for each object) using the previous detection result and the average motion of pixels in the object bounding box. Bounding box tracking based on average motion of the pixels in the box is referred as avg-tracking. However, motion estimation is noisy due to the presence of background pixels in the object bounding box. It is challenging to separate the object and background

110 pixels. A mask is a set including all pixels of the detected object. Existing object mask extraction approaches [66, 98, 110] use complex deep models and have high latency on mobile devices (e.g., Mask RCNN [66] takes more than 1 sec on Nvidia Jetson TX2).

Moreover, object size or shape can change significantly due to the move- ment of the object or camera. Over-estimating or under-estimating object size can introduce significant error in motion estimation, which translates to a large tracking error. Existing works [91, 115, 140] can detect size changes by running inference on each frame which is infeasible on mobile devices. The ex- isting motion based object tracking [88, 123, 127, 129, 150] do not adapt object size during tracking. It remains open how to design an efficient and robust mechanism to adapt object size for motion based tracking.

5.3 Approach

As shown in Section 5.1, running deep models on mobile devices is com- putationally expensive and incurs large latency, which adversely impacts the accuracy of object tracking. Leveraging the motion estimation can potentially reduce the frequency of running deep models so that we can achieve both low latency and high accuracy.

In order to leverage motion estimation for this purpose, its accuracy should be high. Otherwise, motion estimation error may quickly accumulate across frames. We identify the following challenges in fully realizing the poten- tial of motion estimation: (i) we should reliably estimate the motion using only

111 the pixels belonging to the object and avoid contamination from the motion of background pixels, which could be quite different, (ii) it should adapt to the changes in object size and shape, which is quite common (e.g., object moving towards/away from the camera or non-rigid objects), and (iii) it should auto- matically detect when motion estimation accumulates a large error so that we should re-run the complete inference using a deep model.

To address (i), we seek an efficient and reliable method to generate an object mask that filters as many background pixels as possible and uses only the pixels on the mask to estimate motion. Existing methods of mask generation (e.g., [66, 98, 110]) are too expensive for real-time execution on mobile devices. Therefore, we develop a new mask generation method. Our key insight is that a convolutional feature map captures visual features of the object and filters out many background pixels, and is very helpful for mask generation and motion estimation. Combining the visual features with motion further improves the quality of the object mask.

To address (ii), we leverage the mask generated in the previous step and track how pixels on the mask change across frames to adapt the bounding box. However, since both the mask and motion estimation are not perfect, we develop a simple yet effective procedure to remove outliers and adapt the new box in such a way that is more resilient to these errors.

To address (iii), we design an adaptive strategy to decide when to run inference. We develop a simple metric based on the similarity between the tracked box and ground-truth box from the latest inference. This allows the

112 system to avoid running unnecessary inference.

Next we will describe the three major components in details: (i) mask generation, (ii) adaptation to object size and shape changes, and (iii) adaptive inference.

5.3.1 Reliable Mask Extraction

As mentioned earlier, accurate motion estimation needs an efficient and reliable filtering of the background pixels, so that we only use object pixels for motion estimation. On one hand, many existing works can generate accurate mask (e.g., [66, 98, 110]), however they are too expensive to run. For example, it takes 210ms for Mask R-CNN[66] to generate a mask on a powerful desktop GPU. On the other hand, one can try to use cheaper clustering based schemes to differentiate between the object and background from the raw image. However, they are sensitive to the colors and patterns of the object and background [49, 53, 109]. How to efficiently and reliably generate a mask is an important open problem.

Through experimenting with various options, we identify feature maps can be used for this purpose. Feature maps are the output of convolutional layers in deep models. They are fast to compute and can distinguish between background and object pixels. However, not all feature maps are effective and good feature maps vary across videos. Therefore, we need a way to select an appropriate feature map to use. Although a good feature map helps dis- tinguish most background pixels from object pixels, quite a few background

113 pixels may still be similar to object pixels (e.g., when the background has similar appearance as the object). This may contaminate the object motion estimation. To remove the remaining background pixels, we observe that the object and background not only differ in their visual features but also in their motion. Therefore, we filter the remaining background pixels using motion. Our mask generation consists of (i) selecting an appropriate feature map, (ii) extracting the feature map, (iii) computing the motion using the feature map.

Figure 5.3: Calculating motion from feature maps (Left: raw frames or feature maps, Right: optical flow)

Feature map: Feature maps are desirable for mask generation since they are fast to compute and can discriminate many background pixels from object pixels. A feature map is the output from applying a convolutional filter to the input image. A convolutional layer in a trained deep model consists of multiple convolutional filters. Each filter is small and generates one feature

114 map. Refer to [1] for more details of the feature map generation procedure.

Recent works [55, 99, 140] show that the feature maps in initial layers of deep models extract features correlated to object visual cues. Each element in a feature map is referred to as activation value. Some feature maps have large activation in the object region and low activation in the background region, while others are opposite. We prefer the feature maps with low activation in the background region so that we can apply clustering to the feature map to effectively filter out background pixels.

Feature Map

Raw Image

Ground-Truth Mask Feature-Map Mask Optical Flow Mask Representative Mask

Figure 5.4: Representative mask generated from the intersection between feature-map mask and optical-flow mask.

As an example, consider we want to track the bike in the video frame in Fig. 5.3. The feature map is generated from the first convolutional layer and the second max-pool layer in Yolo model [119]. The max-pool layer downsam- ples the feature map output of the first layer by taking the maximum of each 2 × 2 grid. The first column illustrates the original frame and feature maps, and the second column shows the corresponding optical flow. We can see that filter 9 has very sparse activation and the object region is highly correlated

115 with the visual features of the actual object. Applying clustering to this fea- ture map removes most of the background pixels. This shows the potential of using feature maps to remove background pixels.

Estimating motion: Optical flow is widely used to estimate the motion between two consecutive frames [41]. It is generally calculated from raw video frames. However, computing optical flow directly on raw frame is error-prone when the background and object look similar. Fig. 5.3 shows that the opti- cal flow from raw frames can not effectively separate object and background motion even when they are quite different.

Since feature maps are effective in separating the background and ob- ject pixels, we compute the optical flow on feature maps. Fig. 5.3 shows the optical flow of a good feature map whose optical flow correctly captures the motion of object pixels. But not all feature maps have good optical flow. For example, filter 10 does not generate a sparse feature map, so its optical flow is very noisy and fails to capture the object movement.

Feature Map Selection: A convolutional layer generates multiple feature maps (e.g., 32 in the first convolutional layer of Yolo [119]). Generating and processing all feature maps is not only expensive but also can introduce errors, since using bad feature maps can significantly degrade the performance.

We develop a simple metric to select the right feature map. It is based on the following observation that a good feature map should have high activa- tion in object region and sparse activation in the background region. There-

116 fore, our metric is the ratio between the total activation of pixels in the ground- truth bounding box and the total activation of all other pixels. We call this metric contrast. A higher contrast is preferred.

The ground-truth bounding box is obtained by running a deep model. This means that the ground truth is available only when we run inference. A natural question is whether the feature map selected at the time of inference remains good for the subsequent frames till the next inference. To answer this question, we estimate our contrast metric for 20 frames which is the average number of frames between two inferences across different videos. We find that the contrast metric remains stable till the next inference, and stays within 17% of the contrast value of the frame at the time of inference.

Clustering: We use a combination of visual features and motion to group the pixels in the box into 2 clusters: object and background. This is inspired by the following observation. Object pixels differ from the background pixels in both visual features [49, 53, 109] and motion [45, 85, 107]. However, each alone is not reliable since objects can have similar visual features (e.g., color, texture) and similar motion to some part of the background. By using both, we can significantly enhance the reliability of clustering. The intuition is that pixels that belong to both clusters are more likely to be object pixels.

We generate the feature-map mask by simply clustering pixels in the bounding box into two clusters based on their activation value in the feature map. We select object pixels as the one whose average activation value has a higher difference from the average of the pixels surrounding the box.

117 We obtain the optical-flow mask by performing similar clustering on the optical flow calculated from feature maps. To further remove background pixels, we take the intersection of the clusters generated from feature map and optical flow. We call this the representative mask.

1 FMap(1.0) 1 Both Raw(1.0) OFlow 0.8 FMap(0.6) 0.8 FMap Raw(0.6) 0.6 0.6

CDF 0.4 CDF 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Object Pixel Ratio Object Pixel Ratio (a) Feat-Map vs Raw Frame (b) Individual vs Intersection

Figure 5.5: Ratio of Object Pixels in Masks

The top row in Fig. 5.4 shows the representative mask generation on a feature map, while the bottom row shows the masks from raw video frames. The masks from feature map and optical flow can not completely remove background pixels. The intersection filters out most of the background pixels, but it can also remove some of the object pixels. We design our tracking algorithm such that it only requires the mask to contain enough object pixels for motion estimation and is robust to missing object pixels.

Next, we compare different mask generation schemes. We generate representative masks on feature maps and raw frames for the bounding boxes having IoU of 1.0 and 0.6. Fig. 5.5(a) summarizes the results for all frames. When the IoU is 1.0, on average 90% and 65% pixels are object pixels in the

118 masks generated from feature maps and raw frames, respectively. When IoU is 0.6, the box has more background pixels. The corresponding numbers become 82% and 53%, respectively. The reduction is expected since including more background pixels brings error into mask generation. Moreover, even when the box drifts from the ground-truth box, most pixels on the representative mask generated from feature maps still belong to the object. Fig. 5.5(b) shows that the ratio of object pixels on masks generated from feature map or optical flow alone is only 64% and 51%, respectively. These results show that (i) generating mask from feature maps is more reliable than from the raw frames, and (ii) combining both visual features and motion improves the mask generation.

5.3.2 Object Size Adaptation

Figure 5.6: Size Adaptive Box Update

Object sizes may vary due to the change in their distance from the camera and/or non-rigid objects, which have different motions in different

119 parts. We develop Map&Box to address this issue.

1 Represent. Mask 0.8 Avg Mask 0.6

IoU 0.4 0.2 0 2 4 6 8 10 12 14 Video ID

Figure 5.7: Representative Mask based Tracking. (Green Box: tracking result)

Map&Box: Fig. 5.6 illustrates our method. First, we map all pixels in the mask (yellow pixels) to pixels in the current frame (green pixels) using estimated motion. Then we pick a new box that covers all the mapped pixels (dashed green box). This Map&Box technique allows the box to change its dimensions. In the figure, we can see that along the horizontal axis, some of the pixels have larger movement than others, which causes the new dashed green box to have larger width than the original dashed yellow box.

Each pixel is moved independently using its own motion estimate. This helps avoid errors due to averaging motion over all pixels. However, it is sensitive to the presence of background pixels in the mask, since it keeps all

120 the mapped pixels in the updated box. If the background and object have different motions, the box may grow too large and inaccurate. In order to avoid this, we perform the following outlier removal. For each pixel in the desired cluster, we examine if there are at least c pixels in its neighborhood that belong to the same cluster, where our implementation uses c = 5 and define the neighborhood as within 2 pixels left, right, up, and down around a pixel. The intuition is that an object is not spatially discontinuous, so sparse points far away from others in the mask are most likely noise. Furthermore, we also remove the pixels whose estimated motion is more than 1.5 standard deviation different from the average motion of all pixels in the mask.

Our outlier removal removes additional background pixels to improve robustness of our tracking. While it also removes some object pixels, our tracking algorithm is robust to missing a few object pixels.

Representative Mask Tracking: Applying Map&Box to the representative mask produces a new box that covers all the mapped pixels, however, this does not give us the new object box. This is because a representative mask only covers part of the object pixels. Therefore, the box that only covers the mapped pixels will mostly underestimate the original box.

Since our above procedure tends to underestimate the original object box, we do not directly use it to get object box. Instead, we use changes in the box covering representative mask to adapt the original object box.

As shown in Fig. 5.6, once we have the mapped box, we compute how

121 much the mapped boxed moves relative to the original box. We compute ∆X1 and ∆X2, which is the change of both sides along the horizontal axis, and ∆Y1 and ∆Y2, which is the change of both sides along the vertical axis. We use these changes to update the object box to obtain the new box.

Note that our representative mask based tracking (referred as mask- tracking) is designed to tolerate missing object pixels in the representative mask. It can track the object robustly and adapt the box to size changes as long as the mask contains enough pixels that can capture the changes in object shape and size. Our evaluation shows the robustness of mask-tracking. We also note that this technique can tolerate some background noise in the representative mask. As the box dimensions are chosen so that it covers all pixels, changes in the box size is mainly dependent on the motion of the pixels close to the boundary. Therefore, the pixels with the lowest and highest coordinates that will determine the dimensions. This makes Map&Box robust against the presence of background pixels in the representative mask that are far from the box boundary.

Fig. 5.7 shows an example of Sight tracking a person for 12 frames. The red points show the representative mask, which is generated using one of the feature maps of that frame. The green box is the tracked box. As it shows, Sight tracks the person effectively: it adapts the box dimensions accordingly when the person changes shape. This is achieved even when the mask covers only part of the person so most object pixels are not part of the mask. The figure also shows Sight is robust against the presence of some background noise

122 in the representative mask. For frames 3, 9 and 12, some of the background pixels are incorrectly added to the mask. However, the accuracy of the tracked box remains high.

5.3.3 Adaptive Inference

Sight needs an effective mechanism to decide when to run complete inference. One possibility is to run inference on the current frame whenever the previous one completes. However, running inference is expensive and should be avoided unless the accuracy is low. Our goal is to design an adaptive strategy that runs inference only when needed. This involves (i) designing a metric to determine when accuracy is low, and (ii) developing an efficient way to use the inference result to update the tracking results for the subsequent frames. (ii) is needed since it takes several frames to run inference on a mobile device. When the inference result is available, several frames have already passed by and the inference results need to be updated before it can be used.

Inference Triggering Condition: Tracking accuracy is high if the tracked box contains all or most of the object. To quantify the accuracy, we use the similarity between the tracked box and ground-truth box from the last inference. We need a similarity measure that is robust to the relative position change inside the box.

The H.264/AVC [143] codecs encode video frames by exploring simi- lar macroblocks (with sizes from 4 × 4 to 16 × 16) across frames. For each macroblock, if we find a similar one in the previous frame, it is encoded us-

123 ing inter-coded mode. The number of inter-coded macroblocks indicates how similar two frames are. For each tracked box, we calculate the number of inter-coded macroblocks when encoding with the ground-truth box from the last inference. Let n define the number of inter-coded macroblocks in the current tracked box, and N represent the maximum number of inter-coded macroblocks seen in any tracked box after the last inference. We run inference if we observe that n/N < α for t frames. Based on our experiments, we find α = 0.4 and t = 2 give us good performance.

To test our metric, we run tracking over all videos in our dataset and record the IoU of tracked box when the system decides to run inference. We find that the average IoU when we run inference is 0.6. This indicates the metric can reliably predict when the system is about to fail tracking (e.g., IoU goes below 0.5).

Stale inference: Fig. 5.8 shows the process of running a single inference. It takes several frames (10 in our experiments) to complete an inference. The object position in the latest frame may change a lot from the inference result as shown in the figure. If we simply reuse the inference result (green box) in the current frame, the accuracy can be very low. Mask-tracking can not be used to real-time update stale inference for intermediate frames to catch up with the latest frame due to the large delay of clustering all intermediate frames.

Therefore, we use the average motion of pixels in the bounding box to derive the new box. It is applied to the motion derived from the feature map,

124 Updating Stale Inference Starts Inference

Inference Starts Inference Ends

Figure 5.8: Stale Inference Update (Green: stale inference. Red: ground-truth. Yellow: updated stale infer.) which has already filtered most noise from background pixels. Furthermore, since we start from the ground-truth box, which has few background pixels, the error accumulation is slow. It is still less accurate than the mask-tracking as there is no size adaptation for these frames, but we have to sacrifice some accuracy for speed. Our experiments show that the average IoU of the frame after updating stale inference is 0.7, which shows the effectiveness of this stale inference update strategy.

Tracking-Aided Inference: In our experiments, we observe that Sight never completely loses track of the object and whenever it decides to run in- ference on the frame, the tracked bounding box still partially overlaps with the actual object. We use this observation to reduce the inference latency. Instead of running inference on the whole frame, we only need to run it on the cropped area that has a high probability of including the complete ob- ject. Since the tracked box partially overlaps with the object when running inference, we know that this region should be in the neighborhood of the box. However, if our estimate of this region is incorrect, this will increase the infer-

125 ence latency as we will have to run inference over the whole image after we fail to detect it in the smaller region. Therefore, we need to pick the dimensions of this region with care. We get the region by extending the tracked box equally in all directions.

5.4 System Implementation

Fig. 5.9 shows the overall workflow for Sight. There are two main modules running in parallel threads. First, the inference module runs inference over the first frame and produces a bounding box for the object. It also selects the feature map for tracking. Once we have the initial box, the tracking module updates the size and position of the box using mask-tracking for every new frame. The tracked box is used to calculate the similarity metric using H264 encoder [106]. As long as the similarity metric remains high, the inference module remains idle. Whenever the similarity falls below a threshold, the inference module starts. It runs inference on the current frame, selects the feature map, and updates the stale box using avg-tracking. In the meantime, the tracking module continues to track the object based on the previously tracked box until the inference task completes. Once the inference finishes, the tracking module takes the inferred results and updates the bounding box using the most recently selected feature map.

Sight uses darknet [117] to run Yolo [119], for both inference and feature map extraction. Darknet is an open source neural network framework. It is built upon cuDNN [50], a GPU-accelerated library from Nvidia to run deep

126 Figure 5.9: System Architecture for Sight neural networks. Jobs from the tracking module always have high priority, so they can preempt any task from the inference module running on the GPU. Sight is flexible and can easily support other models trained with different resolutions. However, we choose Yolo because other models either have high latency (e.g., Faster RCNN [120], R-FCN [67]) or low inference accuracy (e.g., SSD [93]). In comparison, Yolo has low latency and comparable accuracy to R-FCN [54].

5.4.1 Inference Module

The inference module runs the Yolo model for inference. To speed up the inference, it crops the input frame to a region centered at the center of the current tracked box and spanning 40% of the entire frame or has twice the tracked box dimensions, whichever is bigger. If the model cannot detect the object in the cropped frame, it re-runs the model on the whole frame.

127 While running the inference, it also ranks the feature maps from the first convolutional and pooling layer in terms of the contrast metric defined in Section 5.3. It then passes the index of the selected feature map to the tracking module along with the inference results.

5.4.2 Tracking Module

The tracking module consists of several subtasks and below we describe each of them.

Feature Map Extraction: We use a CNN with single convolutional and max-pool layer to extract feature map. The convolutional layer has a single filter, which is one of the 32 filters in Yolo’s first layer. This filter is selected based on the index from the inference module. To minimize latency, we only compute the feature map of a small region that includes the current box. Specifically, our empirical evaluation suggests we can compute the feature map of the region centered at the center of current box with dimensions of 20 pixels added to both width and height of the current box or 0.7 times the original frame, whichever is larger. Such a choice can completely cover the object in subsequent frames.

Motion Calculation: Sight uses the GPU accelerated implementation of the optical flow extraction from openCV [43] to extract motion across con- secutive frames. We compute dense optical flow using Gunner Farneback’s algorithm [60], which produces the optical flow for all pixels in the frame. While the inference is running, the optical flow values are cached and used to

128 update the box based on the inference results.

Representative Mask Generation: The major component of mask gener- ation is clustering using both feature map and optical flow. We use K-means clustering due to its efficiency. For both feature map and optical flow, we set K = 2. We use K-means implementation from armadillo [124], which is a C++ library for scientific computation. We run clustering over the feature map in parallel with the optical flow computation.

5.5 Evaluation

In this section, we describe our evaluation methodology and present evaluation results.

5.5.1 Evaluation Methodology

Video Datasets: We use 3 video datasets in our evaluation. Videos in our datasets are captured by mobile cameras and the objects in them have different sizes and move around the frame in all directions. Our video datasets include (i) D1: The dataset D1 consists of 15 videos, where 10 videos are from the DAVIS dataset [112] and 5 videos are from the VOT dataset [80]. The DAVIS and VOT datasets are designed to test object segmentation and tracking. Each video in D1 has a single object to track across all frames. There are 3000 frames in D1. (ii) D2: The dataset D2 consists of 10 videos from the DAVIS dataset. The number of objects in those videos varies from 2 to 6. It contains 1400 frames. (iii) D3: The dataset D3 has 10 videos from

129 the ImageNet ILSVRC2017 dataset [122]. D2 has different object categories from those in D1. D3 has single-object videos and 1080 frames. For all traces, we only extract video segments in which the target objects are not completely occluded. We resize the frame resolution to 544×544 to apply the Yolo model.

Ground-truth Object Bounding Box: We use the object detection results from Yolo [119] as the ground-truth. The original datasets, DAVIS and VOT, provide both video frames and manually labeled ground-truth bounding boxes. For fewer than 3% of frames, Yolo fails to detect the target object. For those cases, we use the object annotation as the ground-truth.

Tracking Accuracy Metric: We use Intersection over Union (IoU) [59] and mean Average Precision (mAP) [80] to quantify the accuracy. Both met- rics are widely used. The IoU is the ratio of the area of overlap between tracked bounding box and ground-truth bounding box versus the area of their union. The mAP [80] quantifies the percentage of frames in which the object is successfully tracked (i.e. IoU ≥ 0.5). For both metrics, higher values are preferred.

Experiment Hardware: We run our experiments on Nvidia Jetson TX2 [6], which is a power-efficient module for mobile devices like robots, drones, smart cameras and portable medical devices. It is equipped with an Nvidia Pascal GPU with 256 CUDA cores. It has 8GB memory shared by both the CPU and GPU. The maximum CPU frequency is 2.0GHz and the maximum GPU frequency is 1.3GHz. We use the default running mode, which uses Dynamic

130 1 1 Feature Map Best Represent. Mask 1 0.8 Raw Image Selected 0.8 OFlow Mask Median 0.8 Map Mask 0.6 Worst 0.6 0.6 IoU IoU IoU 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Video ID Video ID Video ID (a) Feature map benefits (b) Feature map selection (c) Representative mask

1 1 30 Optical Flow Represent. Mask Motion Vector Avg Mask 0.8 0.8 25 0.6 0.6 20 IoU 0.4 IoU 0.4 15 0.2 0.2 Inference Interval 0 0 10 2 4 6 8 10 12 14 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Video ID Video ID Video ID (d) Motion calculation (e) Object size adaptation (f) Inference Interval

1 1 Adaptive w/ Cropping 0.8 Most Frequent 0.8 w/o Cropping 0.6 0.6

IoU 0.4 IoU 0.4 0.2 0.2 0 0 2 4 6 8 10 12 14 2 4 6 8 10 12 14 Video ID Video ID (g) Adaptive inference (h) Tracking-aided infer.

Figure 5.10: Micro-benchmark results over dataset D1

Voltage and Frequency Scaling (DVFS) [2] to adjust the clock frequency at run time according to the load and power consumption.

Algorithms: There are some related works on speeding up video analytics. Liu et al [90] use the LSTM model to incorporate the feature maps of previous frames to track the object in the current frame. To achieve comparable accu- racy to our approach Sight, their approach [90] only supports 3fps. Using a simpler model, it can support 15fps but with 30% accuracy degradation. Zhu

131 et al [153] use optical flow to generate feature maps of the current frame from those of previous frames. The high complexity of feature map generation yields large processing delay (e.g., > 700ms per frame). Since these approaches are too slow for real-time use, we skip their implementation and compare with the following algorithms: (i) Baseline, which is implemented in [88] and uses the average motion calculated from raw frames, (ii) Baseline+Map, which uses the average motion calculated from the feature maps, (iii) Sight, which uses rep- resentative mask as described in Sec. 5.3.1. The benefit of (iii) over (ii) comes from using representative mask and size/shape adaptation and the benefit of (iii) over (i) further includes the use of feature maps over the raw frames.

5.5.2 Micro-benchmarks

First, we evaluate the benefit of each individual component in Sight using the dataset D1.

Benefits of Feature Maps: Fig. 5.10(a) shows the accuracy of running mask-tracking on raw frame and feature map. The feature map improves the accuracy by around 20% as it can remove noise from background pixels and more reliably captures the object movement.

Feature Map Selection: There are around 8 feature maps whose accuracy is within 10% of the best one in our videos. Fig. 5.10(b) shows our selected feature map is within 3% of the best. It is 30% and 180% better than the median and worst feature maps. This demonstrates the effectiveness of our feature map selection.

132 Representative Mask: We compare generating a representative mask using the intersection between feature map and optical flow versus using only feature map or optical flow. Fig. 5.10(c) shows that the average accuracy of optical flow mask and feature map mask is 0.46 and 0.51, respectively. In comparison, the accuracy of Sight is 0.60. Note that the benefit of our mask generation varies across videos. For some videos with good feature maps (i.e., low acti- vation for background pixels), the feature map mask is already good. Using intersection still improves over feature map mask, but not by a lot. For videos with complex background, feature map alone produces a very noisy mask. In- tersection can significantly improve the performance. For videos with large camera movement, the optical flow mask is very noisy and the intersection gives significant improvement.

Motion Calculation: In Sight, we use dense optical flow [60] to get motion estimation for all pixels. Another way of estimating motion is to use motion vectors, which is much faster than optical flow. However, motion vectors only contain coarse movement patterns at the macro-block level, which varies from 4 × 4 to 16 × 16. Moreover, the motion vector in H264 codec is designed for compression instead of accurate motion tracking [143]. Fig. 5.10(d) shows that optical flow out-performs the motion vector by 22%.

Object Size Adaptation: In Fig. 5.10(e), Sight out-performs the scheme that updates the bounding box using the average motion over all pixels in the representative mask by 14%. The benefit comes from adapting the bound- ing box size and allowing different parts of objects to move differently when

133 100 1 Baseline 80 Baseline+Map 0.8 Sight 60 0.6 IoU 40 0.4 20

GPU Usage (%) 0.2 0 0 10 15 20 25 30 2 4 6 8 10 12 14 Inference Interval Video ID (a) GPU usage (b) Average IoU

Baseline Baseline Baseline+Map Baseline+Map 1 Sight Sight 0.8 1 0.8 0.6 CDF mAP 0.6 0.4 0.4 0.2 0.2 0 0 2 4 6 8 10 12 14 0 0.2 0.4 0.6 0.8 1 Video ID IoU (c) mAP (d) Frame IoU

Figure 5.11: Performance of Sight over dataset D1 adjusting the box.

Adaptive Inference: To reduce resource usage, Sight runs inference adap- tively. Fig. 5.10(f) shows the average number of frames between two consec- utive inferences in each video is between 14 and 29. Fig. 5.10(g) shows that adaptive inference performs similar to the most frequent inference, which runs inference whenever the previous one finishes. The most frequent inference is sometimes less accurate because it more frequently uses avg-tracking to update the inference and the avg-tracking is less accurate than mask-tracking.

Tracking-Aided Inference: Fig. 5.10(h) shows the accuracy of Sight with

134 and without using cropped frames for inference. If we use the entire input frame, the inference latency for Sight increases to the duration of 15 frames and the average accuracy reduces by 10% because increasing the inference latency means more frames use average-tracking for update and average-tracking is less accurate than mask-tracking.

Resource Usage: Sight leverages GPU to run inference and tracking. The average GPU usage is around 64% when running both inference and tracking simultaneously. In comparison, the usage is 4% when running tracking only. The inference latency in Sight is 10 frames. So an inference interval of 10 means the system keeps running inference all the time. Fig. 5.11(a) shows the GPU usage for different inference intervals. As we would expect, increasing the inference interval allows the GPU usage to stay low for longer time. Therefore, the average usage decreases with the inference interval. Sight has an average inference interval of 20 frames, which shows adaptive inference reduces the resource usage by around 45% over the most frequent inference.

5.5.3 Single Object Tracking

In this section, we show the performance of our complete system using the dataset D1.

Tracking Accuracy Improvement: Fig. 5.11(b) shows the average IoU for all schemes across 15 videos. The average IoU is 0.32, 0.44 and 0.60 for Base- line, Baseline+Map and Sight, respectively. Sight improves the IoU by 88% and 38% over Baseline and Baseline+Map, respectively. The benefit over Base-

135 line+Map comes from size adaptive mask-tracking, and the additional benefit over Baseline comes from reliable motion estimation using feature maps. More- over, the high IoU in Sight shows that even with a large inference delay (equal to 10 frames) Sight can track objects accurately and adapt the box to the object size and shape changes through mask-tracking.

Fig. 5.11(c) shows the average mAP across all videos is 0.24, 0.38 and 0.74 for Baseline, Baseline+Map and Sight, respectively. This translates to 207% and 95% improvement over Baseline and Baseline+Map, respectively. Baseline and Baseline+Map run inference whenever the previous inference completes, which avoids performance loss in these schemes due to delayed inference. In comparison, Sight runs inference adaptively. The average number of frames between two inferences is 20. Sight achieves much better performance despite running inference less frequently.

Next we look at the performance across videos where the object moves in such a way that the bounding box size changes a lot. We identify 5 videos in our dataset, which have larger than 20% average object size change between consecutive inferences. For these videos, the average IoU of Baseline is 0.22, and Sight improves the average IoU to 0.59, which is 2.7× improvement. Thus, Sight yields larger improvement for the videos that have large change in the object sizes.

We show the distribution of IoU for all frames in Fig. 5.11(d). Baseline has around 18% frames with zero IoU whereas Sight has no such frames. The median IoU improves from 30% in Baseline to 60% in Sight, which translates

136 400 40 Small Object 380 30 Large Object 360 20 340 320 10

Latency (msec) 300 Latency (msec) 0

Sight Total Baseline Map Extr. Mask Gen. Motion Calc. Tracking Box Baseline+Map

(a) Inference (b) Tracking

Figure 5.12: System latency

2× improvement.

Latency: Fig. 5.12(a) shows the average inference latency of a single frame across all videos. In Baseline, the tracked box drifts a lot from the ground truth box, so running inference over the cropped frame fails often, which increases the inference latency. Therefore, we disable cropping in Baseline. The average inference latency for Baseline is 360 msec. Baseline+Map and Sight use GPU resources to compute the feature map and optical flow while running inference. Higher tracking accuracy allows us to speed up inference by running it over cropped frames. Thus, the inference latency is around 12, 11 and 10 frames for 30 fps videos in Baseline, Baseline+Map and Sight, respectively. Fig. 5.12(b) shows the latency of individual components in the tracking module. In mask generation, larger objects result in higher latency due to more points to cluster. We observe that the DVFS keeps the board at the lowest power mode with the lowest clock frequency when only running tracking. We can see that even at the lowest power settings, the total tracking latency can support tracking at

137 1 Baseline Baseline Baseline+Map 1 Baseline+Map 0.8 Sight Sight 0.8 0.6

IoU 0.6 mAP 0.4 0.4 0.2 0.2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Video ID Video ID (a) Average IoU (b) mAP

Figure 5.13: Performance of Sight using the dataset D3

30 fps. For the highest clock frequency, tracking latency is less than 10 msec for one frame, this comes at the cost of high power consumption.

5.5.4 Robustness to Different Types of Videos

To validate the robustness of Sight to various types of videos, we apply Sight to the dataset D3, which has different object categories from D1.Fig. 5.13(a) shows that the average IoU of Baseline, Baseline+Map and Sight is 0.34, 0.47 and 0.62, respectively. This translates to 82% and 32% improvements over Baseline and Baseline+Map, respectively. Fig. 5.13(a) shows the mAP is 0.26, 0.36 and 0.72 for Baseline, Baseline+Map and Sight, respectively. Sight im- proves mAP by 177% and 100% over Baseline and Baseline+Map, respectively. Sight can achieve substantial improvements across various types of videos.

138 5.5.5 Multi-Object Tracking

Sight can also track multiple objects in video frames. We evaluate multi-object tracking using the dataset D2.

Different objects may need to use different feature maps to track. How- ever, extracting multiple feature maps can not support object tracking at 30fps. We extend our contrast metric to select a single feature map for all objects. For each feature map, we calculate the multi-object contrast metric by averaging the value of the contrast metric across all objects. We select the feature map having the highest average value. Fig. 5.14 shows the effectiveness of feature map selection for multi-object tracking. Sight(Multi Maps) uses the optimal feature map for each object, while Sight(One Map) uses a single feature map for all objects. Note that, Sight(Multi Maps) can not run in real-time and its results are generated offline. The average IoU is 0.63 and 0.59 for Sight(Multi Maps) and Sight(One Map), respectively. Selecting a single feature map only degrades the tracking accuracy by around 6%.

Fig. 5.14(a) shows that the average IoU for Baseline, Baseline+Map, Sight(One Map) and Sight(Multi Maps) is 0.35, 0.44, 0.59 and 0.63, respec- tively. Sight(One Map) improves the average IoU by 69% and 34%, respec- tively. Fig. 5.14(b) shows that the mAP for Baseline, Baseline+Map, Sight(One Map) and Sight(Multi Maps) is 0.26, 0.40, 0.71 and 0.77, respectively. Sight(One Map) has 173% and 77% improvements over Baseline and Baseline+Map, re- spectively. For multi-object tracking, the average number of frames between two inferences is 16, which is smaller than the value for single-object tracking.

139 Baseline Baseline 1 Baseline+Map Baseline+Map Sight (One Map) Sight (One Map) 0.8 Sight (Multi Maps) Sight (Multi Maps) 1

IoU 0.6

mAP 0.8 0.4 0.6 0.4 0.2 0.2 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Video ID Video ID (a) Average IoU (b) mAP

Figure 5.14: Sight for multi-object tracking (dataset D2)

Sight reduces the hardware resource usage by around 32% compared with the most frequent inference. Thus, Sight achieves significant improvement over other approaches when tracking multiple objects.

140 Chapter 6

Conclusion

In this dissertation, we have explained our techniques to improve the performance of video applications for mobile devices.

We propose novel layered coding design to improve the robustness to throughput fluctuation. With our layered coding, we can adapt to varying data rates by first sending the base layer and opportunistically sending en- hancement layers when the network throughput allows. To speed up coding 360◦ and 4K videos, we implement our own video codecs which incorporate our layered coding design and are fast to run. Our codecs can efficiently run on the available hardware of commodity mobile devices. To adapt to avail- able throughput, we further develop optimization algorithms to determine the number of layers to transmit. In addition to video bitrate which is the only dimension used in existing rate adaptation algorithms, our algorithms includes the new dimension of deciding the number of video layers to transmit. We im- plement our layered coding scheme Rubiks for 360◦ video streaming and Jigsaw for live 4K video streaming on commodity mobile devices. Extensive evalua- tion results demonstrate that Rubiks can achieve up to 69% improvement in user QoE and 49% in bandwidth savings compared with existing approaches.

141 Jigsaw improves PSNR by 6-15dB and improves SSIM by 0.011-0.217 over state-of-the-art approaches.

Moreover, we design a system Sight to run real-time video analytics on mobile devices. It is efficient, accurate, flexible, and runs exclusively on a mobile device without the need of edge server or network connectivity. It achieves these desirable properties by (i) generating a representative mask, (ii) adapting to the change in the object size and shape, and (iii) using it to accurately estimate the object motion so that we can track an object even with infrequent inference. Our extensive evaluation results from Nvidia Jetson TX2 demonstrate that we can track objects in real-time for 30 fps videos. Compared with state-of-the-art, Sight improves the average IoU by 88%, improves the average mAP by 207%, and reduces hardware resource usage by 45% for single object tracking. Sight improves IoU by 69%, improves mAP by 173%, and reduces hardware usage by 32% for multi-object tracking.

Our work shows the following useful observations for the development of mobile video applications. (i) Video streaming and analytics applications are increasingly dominating the usage of mobile devices and it is critical to design effective techniques to guarantee good user experience. (ii) Designing low-cost layered coding for mobile devices enables us to stream high-resolution and 360◦ videos with low latency and high quality. (iii) Identifying when to skip analytics through efficiently and reliably reusing or adapting previous inferences works well for mobile video analytics.

142 Bibliography

[1] Convolutional filter. http://cs231n.github.io/convolutional-networks/ #conv.

[2] Dynamic voltage and frequency scaling on nvidia jetson tx2. https:// devblogs.nvidia.com/jetson-tx2-delivers-twice-intelligence-edge/.

[3] ffmpeg. https://www.ffmpeg.org/.

[4] Nvidia 940m. https://www.notebookcheck.net/NVIDIA-GeForce-940M. 138027.0.html.

[5] Nvidia geforce 940m. https://www.geforce.com/hardware/notebook-gpus/ geforce-940m.

[6] Nvidia jetson tx2. https://developer.nvidia.com/embedded/buy/ jetson-tx2.

[7] Nvidia jetson tx2 nvpmodel tool. https://www.jetsonhacks.com/ 2017/03/25/nvpmodel-nvidia-jetson-tx2-development-kit/.

[8] Nvidia titan x specs. https://www.nvidia.com/en-us/geforce/products/ 10series/titan-x-pascal/.

[9] Nvidia titan xp. https://www.nvidia.com/en-us/titan/titan-xp/.

143 [10] Nvidia video codec. https://developer.nvidia.com/nvidia-video-codec-sdk.

[11] Openh264. https://www.openh264.org/.

[12] Power management for jetson agx xavier devices. https://docs.nvidia. com/jetson/l4t/#page/\%2520Linux\%2520Driver\%2520Package\ %2520Development\%2520Guide\%2Fpower_management_jetson_xavier. html\%23wwpID0E0VO0HA.

[13] Tp-link archer c5400x mu-mimo tri-band gaming router. https:// venturebeat.com/2018/09/06/tp-link-launches-gaming-router-for-4k-video-stream-era/.

[14] Video dataset. url="https://media.xiph.org/video/derf/.

[15] Youtube 4k bitrates. https://support.google.com/youtube/answer/ 1722171?hl=en.

[16] 360-degree football game video, 2017. https://www.youtube.com/ watch?v=E0HUVPM_A00.

[17] 360-degree rollercoaster video, 2017. https://www.youtube.com/watch? v=8lsB-P8nGSM.

[18] 360-degree sailing video, 2017. https://www.youtube.com/watch?v= IJ_CwOFTZyM.

[19] 360-degree shark encounter video, 2017. https://www.youtube.com/ watch?v=rG4jSz_2HDY&t=15s.

144 [20] Android supported media formats, 2017. https://developer.android. com/guide/topics/media/media-formats.html.

[21] Cisco visual networking index report, 2017.

[22] Facebook cubemap for 360 degree videos, 2017. https://code.facebook. com/posts/1126354007399553/next-generation-video-encoding-techniques-for-360-video-and-vr/.

[23] Google cardboard, 2017. https://store.google.com/us/product/ google_cardboard.

[24] H264, 2017. https://www.itu.int/rec/T-REC-H.264.

[25] Hevc, 2017. https://www.itu.int/rec/T-REC-H.265.

[26] Hevc transform and quantization, 2017. https://link.springer.com/ chapter/10.1007/978-3-319-06895-4_6.

[27] Hsdpa tcp dataset, 2017. http://home.ifi.uio.no/paalh/dataset/ hsdpa-tcp-logs/.

[28] Htc vive, 2017. https://www.vive.com.

[29] Kvazaar, 2017. https://github.com/ultravideo/kvazaar.

[30] Linux tc, 2017. https://linux.die.net/man/8/tc.

[31] Oculus, 2017. https://www.oculus.com.

[32] Samsung gear vr, 2017. http://www.samsung.com/us/mobile/virtual-reality/ gear-vr.

145 [33] Video codec hardware acceleration, 2017. https://trac.ffmpeg.org/ wiki/HWAccelIntro.

[34] Vr/ar market, 2017.

[35] Youtube encoder settings for 360 degree videos, 2017. https://support. google.com/youtube/answer/6396222?hl=en.

[36] Google battery historian tool, 2018. https://github.com/google/ battery-historian.

[37] Siripuram Aditya and Sachin Katti. Flexcast: Graceful wireless video streaming. In Proceedings of the 17th annual international conference on Mobile computing and networking, pages 277–288. ACM, 2011.

[38] Saamer Akhshabi, Lakshmi Anantakrishnan, Constantine Dovrolis, and Ali C. Begen. Server-based traffic shaping for stabilizing oscillating adaptive streaming players. In Proceeding of the 23rd ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, NOSSDAV ’13, pages 19–24, New York, NY, USA, 2013. ACM.

[39] J. Bankoski, J. Koleszar, L. Quillio, J. Salonen, P. Wilkins, and Y. Xu. Vp8 data format and decoding guide. RFC 6386, Google Inc., November 2011.

[40] Y. Bao, H. Wu, T. Zhang, A. A. Ramli, and X. Liu. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In

146 2016 IEEE International Conference on Big Data (Big Data), pages 1161–1170, Dec 2016.

[41] John L Barron, David J Fleet, and Steven S Beauchemin. Performance of optical flow techniques. International journal of computer vision, 12(1):43–77, 1994.

[42] A. Bjelopera and S. Grgi. Scalable video coding extension of h.264/avc. In Proceedings ELMAR-2012, pages 7–12, Sept 2012.

[43] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.

[44] Chun-Ming Chang, Cheng-Hsin Hsu, Chih-Fan Hsu, and Kuan-Ta Chen. Performance measurements of virtual reality systems: Quantifying the timing and positioning accuracy. In Proceedings of the 2016 ACM on Multimedia Conference, pages 655–659. ACM, 2016.

[45] Michael M Chang, A Murat Tekalp, and M Ibrahim Sezan. Simultane- ous motion estimation and segmentation. IEEE transactions on image processing, 6(9):1326–1333, 1997.

[46] H. Chen, G. Braeckman, S. M. Satti, P. Schelkens, and A. Munteanu. Hevc-based video coding with lossless region of interest for telemedicine applications. In 2013 20th International Conference on Systems, Signals and Image Processing (IWSSIP), pages 129–132, July 2013.

147 [47] Kaifei Chen, Tong Li, Hyung-Sin Kim, David E Culler, and Randy H Katz. Marvel: Enabling mobile augmented reality with low energy and low latency. In Proceedings of the 16th ACM Conference on Embedded Networked Sensor Systems, pages 292–304. ACM, 2018.

[48] Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, Paramvir Bahl, and Hari Balakrishnan. Glimpse: Continuous, real-time object recogni- tion on mobile devices. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pages 155–168. ACM, 2015.

[49] Tse-Wei Chen, Yi-Ling Chen, and Shao-Yi Chien. Fast image segmen- tation based on k-means clustering with histograms in hsv color space. In 2008 IEEE 10th Workshop on Multimedia Signal Processing, pages 322–325. IEEE, 2008.

[50] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.

[51] Zhizhen Chi, Hongyang Li, Huchuan Lu, and Ming-Hsuan Yang. Dual deep network for visual tracking. IEEE Transactions on Image Process- ing, 26(4):2005–2015, 2017.

[52] Munhwan Choi, Gyujin Lee, Sunggeun Jin, Jonghoe Koo, Byoungjin Kim, and Sunghyun Choi. Link adaptation for high-quality uncom- pressed video streaming in 60-ghz wireless networks. IEEE Transactions on Multimedia, 18(4):627–642, 2016.

148 [53] Guy Barrett Coleman and Harry C Andrews. Image segmentation by clustering. Proceedings of the IEEE, 67(5):773–785, 1979.

[54] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.

[55] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Convolutional features for correlation filter based visual track- ing. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 58–66, 2015.

[56] Luca De Cicco, Saverio Mascolo, and Vittorio Palmisano. Feedback control for adaptive live video streaming. In Proceedings of the second annual ACM conference on Multimedia systems, pages 145–156. ACM, 2011.

[57] Jonathan Deber, Ricardo Jota, Clifton Forlines, and Daniel Wigdor. How much faster is fast enough?: User perception of latency & latency improvements in direct and indirect touch. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 1827–1836. ACM, 2015.

[58] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge 2007 (voc2007) results. 2007.

149 [59] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.

[60] Gunnar Farneb¨ack. Two-frame motion estimation based on polyno- mial expansion. In Josef Bigun and Tomas Gustavsson, editors, Image Analysis, pages 363–370, Berlin, Heidelberg, 2003. Springer Berlin Hei- delberg.

[61] Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Va- suki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivara- man, George Porter, and Keith Winstein. Encoding, fast and slow: Low-latency video processing using thousands of tiny threads. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 363–376, Boston, MA, 2017. USENIX Association.

[62] Mario Graf, Christian Timmerer, and Christopher Mueller. Towards bandwidth efficient adaptive streaming of omnidirectional video over http: Design, implementation, and evaluation. In Proceedings of the 8th ACM on Multimedia Systems Conference, pages 261–271. ACM, 2017.

[63] D. Grois, E. Kaminsky, and O. Hadar. Roi adaptive scalable video coding for limited bandwidth wireless networks. In 2010 IFIP Wireless Days, pages 1–5, Oct 2010.

150 [64] Song Han, Huizi Mao, and William J Dally. Deep compression: Com- pressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[65] Jian He, Mubashir Adnan Qureshi, Lili Qiu, Jin Li, Feng Li, and Lei Han. Rubiks: Practical 360-degree streaming for smartphones. In Pro- ceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2018.

[66] Kaiming He, Georgia Gkioxari, Piotr Doll´ar,and Ross Girshick. Mask r- cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.

[67] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep resid- ual learning for image recognition. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 770–778, 2016.

[68] Zhifeng He and Shiwen Mao. Multiple description coding for uncom- pressed video streaming over 60ghz networks. In Proceedings of the 1st ACM workshop on Cognitive radio architectures for broadband, pages 61–68. ACM, 2013.

[69] I. Himawan, W. Song, and D. Tjondronegoro. Impact of region-of- interest video coding on perceived quality in mobile video. In 2012 IEEE International Conference on Multimedia and Expo, pages 79–84, July 2012.

151 [70] Mohammad Hosseini and Viswanathan Swaminathan. Adaptive 360 VR video streaming: Divide and conquer! CoRR, abs/1609.08729, 2016.

[71] Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. A buffer-based approach to rate adaptation: Evidence from a large video streaming service. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, pages 187–198, New York, NY, USA, 2014. ACM.

[72] Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. A buffer-based approach to rate adaptation: Evi- dence from a large video streaming service. ACM SIGCOMM Computer Communication Review, 44(4):187–198, 2015.

[73] Loc N Huynh, Youngki Lee, and Rajesh Krishna Balan. Deepmon: Mobile gpu-based deep learning framework for continuous vision appli- cations. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 82–95. ACM, 2017.

[74] Szymon Jakubczak and Dina Katabi. A cross-layer design for scalable mobile video. In Proceedings of the 17th annual international conference on Mobile computing and networking, pages 289–300. ACM, 2011.

[75] A. Jerbi, Jian Wang, and S. Shirani. Error-resilient region-of-interest video coding. IEEE Transactions on Circuits and Systems for Video Technology, 15(9):1175–1181, Sept 2005.

152 [76] Junchen Jiang, Vyas Sekar, Henry Milner, Davis Shepherd, Ion Stoica, and Hui Zhang. Cfa: A practical prediction system for video qoe opti- mization. In NSDI, pages 137–150, 2016.

[77] Junchen Jiang, Vyas Sekar, and Hui Zhang. Improving fairness, effi- ciency, and stability in http-based adaptive video streaming with fes- tive. In Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT ’12, pages 97–108, New York, NY, USA, 2012. ACM.

[78] Teemu K¨am¨ar¨ainen,Matti Siekkinen, Antti Yl¨a-J¨a¨aski,Wenxiao Zhang, and Pan Hui. Dissecting the end-to-end latency of interactive mobile video applications. In Proceedings of the 18th International Workshop on Mobile Computing Systems and Applications, pages 61–66. ACM, 2017.

[79] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGPLAN No- tices, 52(4):615–629, 2017.

[80] Matej Kristan, Jiri Matas, AleˇsLeonardis, Tom´aˇsVoj´ıˇr,Roman Pflugfelder, Gustavo Fernandez, Georg Nebehay, Fatih Porikli, and Luka Cehovin.ˇ A novel performance evaluation methodology for single-target track- ers. IEEE transactions on pattern analysis and machine intelligence, 38(11):2137–2155, 2016.

153 [81] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[82] Robert Kuschnig, Ingo Kofler, and Hermann Hellwagner. An evaluation of tcp-based rate-control algorithms for adaptive internet streaming of h. 264/svc. In Proceedings of the first annual ACM SIGMM conference on Multimedia systems, pages 157–168. ACM, 2010.

[83] Zeqi Lai, Y Charlie Hu, Yong Cui, Linhui Sun, and Ningwei Dai. Furion: Engineering high-quality immersive virtual reality on today’s mobile de- vices. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, pages 409–421. ACM, 2017.

[84] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio For- livesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In Proceedings of the 15th International Conference on Information Pro- cessing in Sensor Networks, page 23. IEEE Press, 2016.

[85] Jos´eLezama, Karteek Alahari, Josef Sivic, and Ivan Laptev. Track to the future: Spatio-temporal video segmentation with long-range motion cues. In CVPR 2011, pages 3369–3376. IEEE, 2011.

[86] Yao-Yu Li, Chi-Yu Li, Wei-Han Chen, Chia-Jui Yeh, and Kuochen Wang. Enabling seamless wigig/wifi handovers in tri-band wireless systems. In

154 Network Protocols (ICNP), 2017 IEEE 25th International Conference on, pages 1–2. IEEE, 2017.

[87] Z. Li, X. Zhu, J. Gahm, R. Pan, H. Hu, A. C. Begen, and D. Oran. Probe and adapt: Rate adaptation for http video streaming at scale. IEEE Journal on Selected Areas in Communications, 32(4):719–733, April 2014.

[88] Luyang Liu, Hongyu Li, and Marco Gruteser. Edge assisted real-time object detection for mobile augmented reality. In ACM Mobicom, 2019.

[89] Luyang Liu, Ruiguang Zhong, Wuyang Zhang, Yunxin Liu, Jiansong Zhang, Lintao Zhang, and Marco Gruteser. Cutting the cord: Designing a high-quality untethered vr system with low latency remote rendering. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services, pages 68–80. ACM, 2018.

[90] Mason Liu and Menglong Zhu. Mobile video object detection with temporally-aware feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5686–5695, 2018.

[91] Qiang Liu and Tao Han. Dare: Dynamic adaptive mobile augmented re- ality with edge computing. In 2018 IEEE 26th International Conference on Network Protocols (ICNP), pages 1–11. IEEE, 2018.

[92] Sicong Liu, Yingyan Lin, Zimu Zhou, Kaiming Nan, Hui Liu, and Jun- zhao Du. On-demand deep model compression for mobile devices: A

155 usage-driven model selection framework. In ACM Mobisys, 2018.

[93] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.

[94] Xiao Lin Liu, Wenjun Hu, Qifan Pu, Feng Wu, and Yongguang Zhang. Parcast: Soft video delivery in mimo-ofdm wlans. In Proceedings of the 18th annual international conference on Mobile computing and network- ing, pages 233–244. ACM, 2012.

[95] Xing Liu, Qingyang Xiao, Vijay Gopalakrishnan, Bo Han, Feng Qian, and Matteo Varvello. 360° innovations for panoramic video stream- ing. In Proceedings of the 16th ACM Workshop on Hot Topics in Net- works, HotNets-XVI, pages 50–56, New York, NY, USA, 2017. ACM.

[96] Xing Liu, Qingyang Xiao, Vijay Gopalakrishnan, Bo Han, Feng Qian, and Matteo Varvello. 360 innovations for panoramic video streaming. In Proc. of HotNets, pages 50–56, 11 2017.

[97] Y. Liu, Z. G. Li, and Y. C. Soh. Region-of-interest based resource allocation for conversational video communication of h.264/avc. IEEE Transactions on Circuits and Systems for Video Technology, 18(1):134– 139, Jan 2008.

156 [98] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolu- tional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.

[99] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. Hi- erarchical convolutional features for visual tracking. In Proceedings of the IEEE international conference on computer vision, pages 3074–3082, 2015.

[100] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, pages 197–210, New York, NY, USA, 2017. ACM.

[101] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with pensieve. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 197–210. ACM, 2017.

[102] Akhil Mathur, Nicholas D Lane, Sourav Bhattacharya, Aidan Boran, Claudio Forlivesi, and Fahim Kawsar. Deepeye: Resource efficient lo- cal execution of multiple deep vision models using wearable commodity hardware. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pages 68–81. ACM, 2017.

157 [103] A. Moldovan and C. H. Muntean. Qoe-aware video resolution thresholds computation for adaptive multimedia. In 2017 IEEE International Sym- posium on Broadband Multimedia Systems and Broadcasting (BMSB), pages 1–6, June 2017.

[104] C. Mueller, S. Lederer, J. Poecher, and Ch. Timmerer. libdash - an open source software library for the mpeg-dash standard. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) 2013, San Jose, USA, pages pp. 1–2, 2013.

[105] T. Nitsche, C. Cordeiro, A. B. Flores, E. W. Knightly, E. Perahia, and J. C. Widmer. Ieee 802.11ad: directional 60 ghz communication for multi-gigabit-per-second wi-fi [invited paper]. IEEE Communications Magazine, 52(12):132–141, December 2014.

[106] Nvidia. Nvidia video codec sdk. https://developer.nvidia.com/ nvidia-video-codec-sdk, 2019.

[107] Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2014.

[108] J. R. Ohm. Advances in scalable video coding. Proceedings of the IEEE, 93(1):42–56, Jan 2005.

[109] Thrasyvoulos N Pappas. An adaptive clustering algorithm for image segmentation. IEEE Transactions on signal processing, 40(4):901–914,

158 1992.

[110] Pedro O Pinheiro, Ronan Collobert, and Piotr Doll´ar.Learning to seg- ment object candidates. In Advances in Neural Information Processing Systems, pages 1990–1998, 2015.

[111] Karine Pires and Gwendal Simon. Youtube live and twitch: a tour of user-generated live streaming systems. In Proceedings of the 6th ACM Multimedia Systems Conference, pages 225–230. ACM, 2015.

[112] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.

[113] Yuankai Qi, Shengping Zhang, Lei Qin, Hongxun Yao, Qingming Huang, Jongwoo Lim, and Ming-Hsuan Yang. Hedged deep tracking. In Pro- ceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4303–4311, 2016.

[114] Feng Qian, Lusheng Ji, Bo Han, and Vijay Gopalakrishnan. Optimiz- ing 360 video delivery over cellular networks. In Proceedings of the 5th Workshop on All Things Cellular: Operations, Applications and Chal- lenges, ATC ’16, pages 1–6, New York, NY, USA, 2016. ACM.

[115] Xukan Ran, Haolianz Chen, Xiaodan Zhu, Zhenming Liu, and Jiasi Chen. Deepdecision: A mobile deep learning framework for edge video

159 analytics. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pages 1421–1429. IEEE, 2018.

[116] Theodore Rappaport. Wireless Communications: Principles and Prac- tice. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 2001.

[117] Joseph Redmon. Darknet: Open source neural networks in c. http: //pjreddie.com/darknet/, 2013–2016.

[118] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.

[119] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7263–7271, 2017.

[120] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.

[121] Shaoqing Ren, Kaiming He, Ross Girshick, Xiangyu Zhang, and Jian Sun. Object detection networks on convolutional feature maps. IEEE transactions on pattern analysis and machine intelligence, 39(7):1476– 1481, 2017.

160 [122] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. In- ternational journal of computer vision, 115(3):211–252, 2015.

[123] Houari Sabirin and Munchurl Kim. Moving object detection and track- ing using a spatio-temporal graph in h. 264/avc bitstreams for video surveillance. IEEE Transactions on Multimedia, 14(3):657–668, 2012.

[124] C. Sanderson and R. Curtin. Armadillo: a template-based C++ library for linear algebra. The Journal of Open Source Software, 1:26, June 2016.

[125] T. Schierl, T. Stockhammer, and T. Wiegand. Mobile video transmis- sion using scalable video coding. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1204–1217, Sept 2007.

[126] H. Schwarz, D. Marpe, and T. Wiegand. Overview of the scalable video coding extension of the h.264/avc standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103–1120, Sept 2007.

[127] Sanjivani Shantaiya, Kesari Verma, and Kamal Mehta. Multiple ob- ject tracking using kalman filter and optical flow. European Journal of Advances in Engineering and Technology, 2(2):34–39, 2015.

[128] Huai-Rong Shao, Julan Hsu, Chiu Ngo, and ChangYeul Kweon. Pro- gressive transmission of uncompressed video over mmw wireless. In

161 Consumer Communications and Networking Conference (CCNC), 2010 7th IEEE, pages 1–5. IEEE, 2010.

[129] Jeongho Shin, Sangjin Kim, Sangkyu Kang, Seong-Won Lee, Joonki Paik, Besma Abidi, and Mongi Abidi. Optical flow-based real-time object tracking using non-prior training active feature model. Real- Time Imaging, 11(3):204–218, 2005.

[130] Karen Simonyan and Andrew Zisserman. Very deep convolutional net- works for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[131] Harkirat Singh, Jisung Oh, Changyeul Kweon, Xiangping Qin, Huai- Rong Shao, and Chiu Ngo. A 60 ghz wireless network for enabling uncompressed video communication. IEEE Communications Magazine, 46(12), 2008.

[132] Harkirat Singh, Xiangping Qin, Huai-rong Shao, Chiu Ngo, Chang Yeul Kwon, and Seong Soo Kim. Support of uncompressed video stream- ing over 60ghz wireless networks. In Consumer Communications and Networking Conference, 2008. CCNC 2008. 5th IEEE, pages 243–248. IEEE, 2008.

[133] P. Sivanantharasa, W. A. C. Fernando, and H. K. Arachchi. Region of interest video coding with flexible macroblock ordering. In First International Conference on Industrial and Information Systems, pages 596–599, Aug 2006.

162 [134] Thomas Stockhammer. Dynamic adaptive streaming over http–: stan- dards and design principles. In Proceedings of the second annual ACM conference on Multimedia systems, pages 133–144. ACM, 2011.

[135] Gary J Sullivan, Jens-Rainer Ohm, Woo-Jin Han, Thomas Wiegand, et al. Overview of the high efficiency video coding(hevc) standard. IEEE Transactions on circuits and systems for video technology, 22(12):1649– 1668, 2012.

[136] Yi Sun, Xiaoqi Yin, Junchen Jiang, Vyas Sekar, Fuyuan Lin, Nanshu Wang, Tao Liu, and Bruno Sinopoli. Cs2p: Improving video bitrate selection and adaptation with data-driven throughput prediction. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Confer- ence, pages 272–285. ACM, 2016.

[137] Sanjib Sur, Ioannis Pefkianakis, Xinyu Zhang, and Kyu-Han Kim. Wifi- assisted 60 ghz wireless networks. In Proceedings of the 23rd Annual In- ternational Conference on Mobile Computing and Networking, MobiCom ’17, pages 28–41, New York, NY, USA, 2017. ACM.

[138] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.

163 [139] N. Tsapatsoulis, C. Loizou, and C. Pattichis. Region of interest video coding for low bit-rate transmission of carotid ultrasound videos over 3g wireless networks. In 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 3717–3720, Aug 2007.

[140] Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. Visual tracking with fully convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 3119–3127, 2015.

[141] Zhou Wang, Ligang Lu, and Alan C Bovik. Video quality assessment based on structural distortion measurement. Signal processing: Image communication, 19(2):121–132, 2004.

[142] Teng Wei and Xinyu Zhang. Pose information assisted 60 ghz net- works: Towards seamless coverage and mobility support. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking, pages 42–55. ACM, 2017.

[143] Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, and Ajay Luthra. Overview of the h. 264/avc video coding standard. IEEE Transactions on circuits and systems for video technology, 13(7):560–576, 2003.

[144] Xiufeng Xie and Xinyu Zhang. Poi360: Panoramic mobile video tele- phony over lte cellular networks. In Proceedings of the 13th Interna- tional Conference on emerging Networking EXperiments and Technolo- gies, pages 336–349. ACM, 2017.

164 [145] Xiufeng Xie, Xinyu Zhang, Swarun Kumar, and Li Erran Li. pistream: Physical layer informed adaptive video streaming over lte. In Proceed- ings of the 21st Annual International Conference on Mobile Computing and Networking, pages 413–425. ACM, 2015.

[146] Mengwei Xu, Mengze Zhu, Yunxin Liu, Felix Xiaozhu Lin, and Xuanzhe Liu. Deepcache: Principled cache for mobile deep vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pages 129–144. ACM, 2018.

[147] Hao Yin, Xuening Liu, Tongyu Zhan, Vyas Sekar, Feng Qiu, Chuang Lin, Hui Zhang, and Bo Li. Design and deployment of a hybrid cdn-p2p system for live video streaming: experiences with livesky. In Proceedings of the 17th ACM international conference on Multimedia, pages 25–34. ACM, 2009.

[148] Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A control-theoretic approach for dynamic adaptive video streaming over http. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM ’15, pages 325–338, New York, NY, USA, 2015. ACM.

[149] Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A control-theoretic approach for dynamic adaptive video streaming over http. In ACM SIGCOMM Computer Communication Review, volume 45(4), pages 325–338. ACM, 2015.

165 [150] Wonsang You, MS Houari Sabirin, and Munchurl Kim. Moving object tracking in h. 264/avc bitstream. In Multimedia Content Analysis and Mining, pages 483–492. Springer, 2007.

[151] Zhilong Zhang, Danpu Liu, and Xin Wang. Real-time uncompressed video transmission over wireless channels using unequal power allocation. IEEE Systems Journal, 2016.

[152] Chao Zhou, Zhenhua Li, and Yao Liu. A measurement study of oculus 360 degree video streaming. In Proceedings of the 8th ACM on Multi- media Systems Conference, pages 27–37. ACM, 2017.

[153] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow- guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, volume 3, 2017.

[154] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In CVPR, volume 1, page 3, 2017.

166