Video Coding for Machines: a Paradigm of Collaborative Compression and Intelligent Analytics Ling-Yu Duan, Jiaying Liu, Wenhan Yang, Tiejun Huang, Wen Gao

1 Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics Ling-Yu Duan, Jiaying Liu, Wenhan Yang, Tiejun Huang, Wen Gao Abstract—Video coding, which targets to compress and recon- the video coding efficiency, by squeezing out the spatial- struct the whole frame, and feature compression, which only temporal pixel-level redundancy of video frames based on the preserves and transmits the most critical information, stand at visual signal statistics and the priors of human perception. two ends of the scale. That is, one is with compactness and efficiency to serve for machine vision, and the other is with More recently, deep learning based video coding makes great full fidelity, bowing to human perception. The recent endeavors progress. With the hierarchical model architecture and the in imminent trends of video compression, e.g. deep learning large-scale data priors, these methods largely outperform the based coding tools and end-to-end image/video coding, and state-of-the-art codecs by utilizing deep-network aided coding MPEG-7 compact feature descriptor standards, i.e. Compact tools. Rather than directly targeting machines, these methods Descriptors for Visual Search and Compact Descriptors for Video Analysis, promote the sustainable and fast development focus on efficiently reconstructing the pixels for human vision, in their own directions, respectively. In this paper, thanks to in which the spatial-temporal volume of pixel arrays can be boomingAI technology, e.g. prediction and generation models, fed into machine learning and pattern recognition algorithms we carry out exploration in the new area, Video Coding for to complete high-level analysis and retrieval tasks. Machines (VCM), arising from the emerging MPEG standardiza- However, when facing big data and video analytics, existing tion efforts1. Towards collaborative compression and intelligent analytics, VCM attempts to bridge the gap between feature video coding methods (even for the deep learning based) coding for machine vision and video coding for human vision. are still questionable, regarding whether such big video data Aligning with the rising Analyze then Compress instance Digital can be efficiently handled by visual signal level compression. Retina, the definition, formulation, and paradigm of VCM are Moreover, the full-resolution videos are of low density in prac- given first. Meanwhile, we systematically review state-of-the-art tical values. It is prohibitive to compress and store all video techniques in video compression and feature compression from the unique perspective of MPEG standardization, which provides data first and then perform analytics over the decompressed the academic and industrial evidence to realize the collaborative video stream. By degrading the quality of compressed videos, compression of video and feature streams in a broad range of AI it might save more bitrates, but incur the risk of degraded applications. Finally, we come up with potential VCM solutions, analytics performance due to the poorly extracted features. and the preliminary results have demonstrated the performance To facilitate the high-level machine vision tasks in terms of and efficiency gains. Further direction is discussed as well. performance and efficiency, lots of research efforts have been Index Terms—Video coding for machine, video compression, dedicated to extracting those pieces of key information, i.e., feature compression, generative model, prediction model visual features, from the pixels, which is usually compressed and represented in a very compact form. This poses an alter- native strategy Analyze then Compress, which extracts, saves, I. INTRODUCTION and transmits compact features to satisfy various intelligent In the big data era, massive videos are fed into machines video analytics tasks, by using significantly less data than the to realize intelligent analysis in numerous applications of compressed video itself. In particular, to meet the demand for smart cities or Internet of things (IoT). Like the explosion large-scale video analysis in smart city applications, the feature arXiv:2001.03569v2 [cs.CV] 13 Jan 2020 of surveillance systems deployed in urban areas, there arise stream instead of the video signal stream can be transmitted. important concerns on how to efficiently manage massive In view of the necessity and importance of transmitting feature video data. There is a unique set of challenges (e.g. low latency descriptors, MPEG has finalized the standardization of com- and high accuracy) regarding efficiently analyzing and search- pact descriptors for visual search (CDVS) (ISO/IEC15938-13) ing the target within the millions of objects/events captured in Sep. 2015 [16] and compact descriptors for video analysis everyday. In particular, video compression and transmission (CDVA) (ISO/IEC15938-15) [17] in July 2019 to enable the constitute the basic infrastructure to support these applications interoperability for efficient and effective image/video retrieval from the perspective of Compress then Analyze. Over the past and analysis by standardizing the bitstream syntax of compact decades, a series of standards( e.g. MPEG-4 AVC/H.264 [1] feature descriptors. In CDVS, hand-crafted local and global and High Efficiency Video Coding (HEVC) [2]), Audio Video descriptors are designed to represent the visual characteristics coding Standard (AVS) [3] are built to significantly improve of images. In CDVA, the deep learning features are adopted to further boost the video analysis performance. Over the course L.-Y. Duan, J. Liu, W. Yang, T. Huang, and W. Gao are with of the standardization process, remarkable improvements are Peking University, Beijing 100871, China (e-mail: [email protected]; achieved in reducing the size of features while maintaining [email protected]; [email protected]; [email protected]; [email protected]). their discriminative power for machine vision tasks. Such 1https://lists.aau.at/mailman/listinfo/mpeg-vcm compact features cannot reconstruct the full resolution videos 2 for human observers, thereby incurring two successive stages II. REVIEW OF VIDEO COMPRESSION:FROM PIXEL of analysis and compression for machine and human vision. FEATURE PERSPECTIVE For either Compress then Analyze or Analyze then Com- Visual information takes up at least 83% of all informa- press, the optimization jobs of video coding and feature tion [18] that people can feel. It is important for humans coding are separate. Due to the very nature of multi-tasks for to record, store, and view the image/videos efficiently. For machine vision and human vision, the intrusive setup of two past decades, lots of academic and industrial efforts have separate stages is sub-optimal. It is expected to explore more been devoted to video compression, which is to maximize collaborative operations between video and feature streams, the compression efficiency from the pixel feature perspective. which opens up more space for improving the performance of Below we review the advance of traditional video coding as intelligent analytics, optimizing the video coding efficiency, well as the impact of deep learning based compression on and thereby reducing the total cost. Good opportunities to visual data coding for human vision in a general sense. bridge the cross-domain research on machine vision and A. Traditional Hybrid Video Coding human vision have been there, as deep neural network has demonstrated its excellent capability of multi-task end-to- Video coding transforms the input video into a compact end optimization as well as abstracting the representations of binary code for more economic and light-weighted storage multiple granularities in a hierarchical architecture. and transmission, and targets reconstructing videos visually by the decoding process. In 1975, the hybrid spatial-temporal In this paper, we attempt to identify the opportunities and coding architecture [20] is proposed to take the lead and challenges of developing collaborative compression techniques occupy the major proportion during the next few decades. for humans and machines. Through reviewing the advance After that, the following video coding standards have evolved of two separate tracks of video coding and feature coding, through the development of the ITU-T and ISO/IEC stan- we present the necessity of machine-human collaborative dards. The ITU-T produced H.261 [21] and H.263 [22], compression, and formulate a new problem of video coding ISO/IEC produced MPEG-1 [23] and MPEG-4 Visual [24], for machines (VCM). Furthermore, to promote the emerging and the two organizations worked together to produce the MPEG-VCM standardization efforts and collect for evidences H.262/MPEG-2 Video [25], H.264/MPEG-4 Advanced Video for MPEG Ad hoc Group of VCM, we propose trial exem- Coding (AVC) [26] standards, and H.265/MPEG-H (Part 2) plar VCM architectures and conduct preliminary experiments, High Efficiency Video Coding (HEVC) [27] standards. which are also expected to provide insights into bridging the The design and development of all these standards follows cross-domain research from visual signal processing, computer the block-based video coding approach. The first technical vision, and machine learning, when AI meets the video big feature is block-wise partition. Each coded picture is parti- data. The contributions are summarized as follows, tioned into macroblocks (MB) of luma and chroma samples. • We present and formulate a new problem of Video MBs will be divided into slices and coded independently.

Load more