Residual in Residual Based Convolutional Neural Network In-loop Filter for AVS3

Kai Lin∗, Chuanmin Jia∗, Zhenghui Zhao∗, Li Wang†, Shanshe Wang∗, Siwei Ma∗ and Wen Gao∗ ∗Institute of Digital Media, Peking University, Beijing, China Email: {kailin, cmjia, zhzhao, sswang, swma, wgao}@pku.edu.cn †Hikvision Research Institute, Hangzhou, China Email: [email protected]

Abstract—Deep learning based video coding tools development the problems of ringing artifacts and restoring the overall has been an emerging topic recently. In this paper, we propose pixel-level error respectively. Inherited from AVS2, DBF, SAO a novel deep convolutional neural network (CNN) based in-loop and ALF are processed sequentially in AVS3. The combination filter algorithm for the third generation of Audio Video Coding Standard (AVS3). Specifically, we first introduce a residual of the three algorithms increases flexibility of in-loop filtering block based CNN model with global identity connection for and promotes degraded frames enhancement significantly. the luminance in-loop filter to replace conventional rule-based Convolutional neural network (CNN) facilitates the devel- algorithms in AVS3. Subsequently, the reconstructed luminance opment of substantial computer vision tasks, such as image channel is deployed as textural and structural guidance for classification [8], segmentation [9]. When it comes to video chrominance filtering. The corresponding syntax elements are also designed for the CNN based in-loop filtering. In addition, compression, CNN also shows enormous potential on deep we build a large scale database for the learning based in-loop learning based coding tools [10]. Deep learning based methods filtering algorithm. Experimental results show that our method have emerged recently to replace conventional coding tools achieves on average 7.5%, 16.9% and 18.6% BD-rate reduction including intra prediction [11], inter prediction [12], [13], in- under all intra (AI) configuration on common test sequences. loop filter [14] and so on. These methods were integrated into In particular, the performance for 4K videos is 6.4%, 15.5% and 17.5% respectively. Moreover, under random access (RA) state-of-the-art video coding standards and obtained satisfying configuration, the proposed method brings 3.3%, 14.4%, and performance. More especially, in-loop filter is certainly suit- 13.6% BD-rate reduction separately. able for CNN by modeling as restoration and inverse problem. In this paper, we propose a deep residual in residual based Index Terms—Video Coding, In-loop Filter, Audio Video Coding neural network to replace conventional in-loop filter methods Standard, Convolution Neural Network and further improve subjective and objective quality. Our contribution can be summarized as follows: I.INTRODUCTION • We build a large scale database for removing compression The third generation of the audio video coding standard artifacts based on DIV2K [15]. Our database contains (AVS3) is a newly established video coding standard de- high resolution original videos and reconstructed videos veloped by China AVS working group. On the top of the compressed by AVS3 reference software with quantiza- second generation of the audio video coding standard (AVS2) tion parameter (QP) ranging from 27 to 50. [1] [2], AVS3 achieves higher compression efficiency [3], • For luma components, we propose a residual block based especially for ultra high definition content. However, arti- fully convolutional network with global identity to re- facts such as blocking, ringing and blurring appear in the place the combination of DBF, SAO and ALF. Afterward- reconstructed images after block-based hybrid coding. Such s, to obtain better enhancement for chroma components, artifacts significantly affect the quality-of-experiences (QoE) reconstructed luminance channels are fed into network of the compressed videos and the restored images could not with chrominance simultaneously as guidance. be properly referenced by the coding of consequent pictures. • Rate-distortion optimization (RDO) is adopted to ensure As a result, suppressing these artifacts is of vital importance overall rate-distortion performances. According to RDO, and plenty of in-loop filter algorithms regarding this have been frame level and coding tree unit (CTU) level flags are investigated. designed to provide more adaptivity and flexibility. Located after inverse transform, in-loop filter contributes to alleviate artifacts and improve both subjective quality and ob- II.RELATED WORK jective quality of degraded frames. To deal with discontinuities Although the coding performance continues to be higher along block boundaries, AVS2 adopted deblocking filter (DBF) with the progress of video coding standards, the unpleasing [4] [5]. However, DBF can not deal with inner samples within artifacts still exist due to the coarse quantization and block the coding block. In addition to DBF, sample adaptive offset based prediction-transform coding framework. In this section, (SAO) [6] and adaptive loop filter (ALF) [7] were proposed we introduce previous work related to conventional in-loop step by step to compensate shortcomings of DBF by handling filter methods and deep learning based approaches separately. A. In-loop filter methods in AVS3 The combination of DBF, SAO and ALF is deployed in AVS3 as post-processing. DBF deals with visible artifacts at block boundaries [5] [16]. DBF detects discontinuities at the block boundaries and attenuates them by applying a filter. According to distribution of samples near boundaries, deci- sions between different filter levels are accomplished. Besides samples along block boundaries, SAO is presented to adjust samples within block based on statistics [6] . SAO classifies reconstructed samples into different categories and obtains an offset for each category. The offset for each category is sig- Fig. 1. Structure of residual block naled after calculation at the encoder. Reconstructed samples add the offset adaptively according to the class [17]. Adaptive loop filter targets at minimizing the mean square error between is much simpler to be captured by network. Therefore a global original samples and reconstructed samples by Wiener-based shortcut is appended to make the network focus on the residual adaptive filter [7] [18]. Different filters are applied to different and detail recovery. Inspired by ResNet [8], our proposed regions, and filter parameters are also transmitted to decoder. network is built by several residual blocks. As depicted in Fig. In order to save bits, the filter parameters of different regions 1, each residual block has two convolutional layers separated can be merged according to RDO. by Rectified Linear Units (ReLU). As aforementioned, the identity connection is utilized to accelerate learning process as B. Deep learning based loop filter well as restoring the details. As shown in Fig. 2, the network Majority of CNN based image restoration methods have architecture for luma components is a fully CNN composed emerged recently and obtained state-of-the-art performance. of three parts. The number of residual block determines the More specially for deep learning based in-loop filtering, on the depth of the whole network. one hand, the network architecture is turning into reasonable Human visual system (HVS) is more sensitive to the ad- and complicated gradually. Dai et al. [19] proposed a variable- justment of luminance, consequently chroma channels are filter-size residue-learning CNN which contained only four commonly downsampled to reduce the volume of data storage. convolution layers. Zhang et al. [20] explored more complex Since luminance accommodates much more details, it can be network. In [20], authors designed various types of high way guidance in the filtering of chroma component. YUV420 is unit and built a deep residual highway CNN based on this unit. a commonly used format in video coding. The height and To adapt to various image contents, Jia et al. [21] devoted to width of the chroma components are both half of the luma a content-aware multimodel filtering. Under the guidance of components in YUV420. In order to get three channels of the discriminative network, each CTU selected the optimal the same size, we upsample chroma components by nearest restoration model in this work. neighbor of each sample, which imports no additional artifacts. On the other hand, more priori knowledge shared by both The network architecture for chroma component is depicted in encoder and decoder was extracted from encoding process to Fig. 3. Residual blocks are main foundation of whole network offer better assistance. Jia et al. proposed a spatial-temporal structure as well. In particular, there are two branches of part1 residue network based in-loop filter in [21]. Current block in chrominance network architecture. Enlarged reconstructed and co-located block in reference frame were both taken into chroma channel and reconstructed luma channel are processed consideration and fed into network. He et al. [22] extracted by different network branches separately and then concaten- CU split partition and generated masks from it subsequently. tate. Afterwards, fused feature maps are handled by the same The authors in [22] further discussed fusion strategy of mask parts as previous network structure. It is worth to mention that and reconstructed frame. Li et al. [23] utilized difference an average pooling layer is put at the end to downsample the between reconstructed and predicted pixel as additional input size of feature maps. of network. B. Rate and distortion optimization III.PROPOSED METHOD Following [14], the frame level and coding tree unit (CTU) In this section, we describe proposed network architecture level syntax elements are designed to maximize the coding and training details. In addition, corresponding syntax ele- efficiency of proposed CNN based in-loop filtering algorithm. ments and RDO are discussed in this section. In particular, we design frame flags for both luma and chroma components. Moreover, the coding tree unit (CTU) flags are A. Network Architecture also utilized for luminance. All flags are determined in terms We present an end-to-end deep residual in residual based of rate-distortion optimization (RDO). CNN model to remove artifacts of reconstructed frames. Com- Frame-level flag controls the on/off of our proposed coding paring with directly learning to store the original frames, the tool for the whole frame. If frame flag is turned off, the deter- difference between original frames and reconstructed frames mination process of CTU no longer proceeds. Additionally, Fig. 2. Network architecture of inloop filtering for luma component

Fig. 3. Network architecture of inloop filtering for chroma component no more CTU level syntax elements are transmitted under configuration with QP ranging from 27 to 50. We switch this circumstance. On the contrary, one syntax element for DBF, SAO and ALF off in the compression process, and other each CTU is in demand in the case that frame flag enabled. parameters are kept same as AVS3 common test conditions. D1 and D2 here denote distortion with our proposed loop In order to adapt to multiple QP values as well as reduce the filter scheme enabled and disabled respectively. Similarly, R1 number of CNN models, we classified QP ranging from 27 and R2 indicate bits transmitted to decoder as mentioned to 50 into 4 QP bands: [27, 31], [32, 37], [38, 44], [45, 50].A separately. Additionally, R1 equals to the number of CTU model is trained for every QP band with belonging samples while R2 is set to be zero on their conditions. If J1 < J2, in this band. frame flag is enabled and vice versa. To shrink the training time, we split compressed high J = D + λ ∗ R resolution frame into 128x128 patches. We denote the train- 1 1 1 (1) luma chroma ing patches as (xi , xi ), and target patches as luma chroma J2 = D2 + λ ∗ R2 (2) (yi , yi ), where xi is the codec-compressed patch and y represents its corresponding pristine patch. Especially, the For CTU level syntax elements, there is one bit to switch i pair of (xluma, xchroma) is fed into network when conducting our proposed scheme on/off. D and D denote distortion i i 3 4 in-loop filtering for chrominance. The loss function of above with CTU level flag enabled and disabled, and CTU flag is networks is mean squared error (MSE), which brings samples determined by the relationship of D and D . If J < J , 3 4 3 4 closer to true value and penalizes samples far from real. We CTU flag is enabled and vice versa. train our proposed network by Pytorch on NVIDIA 1080Ti J3 = D3,J4 = D4 (3) GPU for every combination of QP band and YUV channels. There are a total of 12 models to deal with different circum- C. Training Process stances. We first build a large scale database for CNN based in-loop filter algorithms using DIV2K dataset [15], which contains 800 We utilize the first-order gradient-based optimization Adam high resolution images for training and another 100 for valida- [24] to train our network, and parameters of Adam are set as tion. We first convert original RGB image into YUV420 format default in Pytorch. The batchsize of each iteration is set to be video sequences by ffmpeg1. Then these video sequences are 64 and training samples are shuffled randomly. The learning compressed by AVS3 reference software (HPM3.1) under AI rate is 1e-4 for all models initially, and we decrease learning rate by a factor of 0.1 when PSNR on validate dataset no 1http://ffmpeg.org longer arises. IV. EXPERIMENTAL RESULT TABLE II PSNR INCREMENTUNDER AI CONFIGURATION WITH QP 45 We provide experimental results including objective quality and subjective quality in this section. We integrate the pro- ∆ PSNR posed approach into AVS3 reference software (HPM3.1) as Class the only in-loop filter coding tool. The numbers of residual Y(db) U(db) V(db) block in network for luminance and chrominance components 4K 0.25 0.53 0.55 are set to be 20 and 10 respectively. 1080p 0.37 0.68 0.81 A. Objective Quality 720p 0.50 0.54 0.62 To fully evaluate the compression efficiency, we conduct Average 0.37 0.58 0.66 1 second test with HPM3.1 under AI and RA configuration respectively. Anchor here refers to HPM3.1 with DBF, SAO, and ALF enabled. Other configurations of these two encoders proposed method brings 3.28%, 14.37% and 13.59% bd-rate are set as AVS3 common test conditions. The experimental re- saving on average compared with anchor. sults on benchmark test sequences under AI configuration are described in table I. The proposed approach achieves 7.51%, TABLE III BD-RATE SAVING UNDER RA CONFIGURATION IN HPM3.1 16.88% and 18.59% BD-rate saving on average compared with anchor. It is worth to note that all 4K test sequences, Market- BD-rate Place and RitualDance are 10-bit precision test sequences. Class Sequence The proposed scheme achieves high compression efficiency Y(%) U(%) V(%) on both 8-bit and 10-bit precision test sequences. For high Campfire -10.97 -9.26 -6.69 definition content, bd-rate reductions are 6.40%, 15.54% and DaylightRoad2 -5.10 -18.22 -21.72 17.52% for 4K resolution sequences. Moreover, in table II we UHD4K offer details of PSNR increment under AI configuration with ParkRunning3 3.13 -2.11 -9.09 QP 45. Tango2 -1.44 -18.80 -11.52 BasketballDrive -2.13 -6.02 -6.54 TABLE I BD-RATE SAVING UNDER AI CONFIGURATION IN HPM3.1 Cactus -5.04 -18.19 -10.04 1080p MarketPlace 0.26 -17.63 -20.55 BD-rate Class Sequence RitualDance -3.52 -14.84 -1.19 Y(%) U(%) V(%) City -1.42 -23.35 -26.25 Campfire -6.74 -10.40 -15.27 Crew -2.61 -14.04 -13.18 DaylightRoad2 -8.60 -21.44 -22.02 720p UHD4K Vidyo1 -9.09 -21.51 -17.59 ParkRunning3 -4.64 -8.48 -12.47 Vidyo3 -1.46 -8.41 -18.71 Tango2 -5.62 -21.85 -20.31 Average -3.28 -14.37 -13.59 BasketballDrive -7.74 -19.18 -22.95 Cactus -8.06 -17.23 -17.84 1080p B. Subjective Quality MarketPlace -5.02 -21.66 -21.59 We compare the subjective quality of reconstructed images RitualDance -12.67 -22.23 -21.46 with our proposed approach and traditional loop filter. In City -3.65 -16.26 -15.07 addition, we offer pixel variation of these two loop filter Crew -4.39 -16.70 -21.88 methods. As depicted in Fig.4, artifacts including blocking 720p and ringing disappear in reconstructed image of our proposed Vidyo1 -13.21 -17.33 -17.14 method, and it is more closer to original image. Comparing Vidyo3 -9.77 -9.84 -15.09 with conventional loop filter, the distribution of difference adjusted with our approach mainly focus on texture contour Average -7.51 -16.88 -18.59 and CU boundary.

Although our training samples are all intra frames com- V. CONCLUSION pressed by encoder, enhanced intra frames are able to provide In this paper, we proposed a residual in residual based CNN more accurate reference for P frames and B frames under model for in-loop filtering in AVS3 standard. The CNN model random access (RA) configuration. We provide detailed BD- contains several residual blocks as well as the global identity rate reductions under RA configuration in TABLE III, and the connection. We first built a large scale training database for [9] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017. [10] Siwei Ma, Xinfeng Zhang, Chuanmin Jia, Zhenghui Zhao, Shiqi Wang, and Shanshe Wanga. Image and video compression with neural net- works: A review. IEEE Transactions on Circuits and Systems for Video Technology, 2019. [11] Jiahao Li, Bin Li, Jizheng Xu, Ruiqin Xiong, and Wen Gao. Fully connected network-based intra prediction for image coding. IEEE Transactions on Image Processing, 27(7):3236–3247, 2018. [12] Lei Zhao, Shiqi Wang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao. Enhanced motion-compensated video coding with deep virtual reference frame generation. IEEE Transactions on Image Processing, 2019. [13] Zhenghui Zhao, Shiqi Wang, Shanshe Wang, Xinfeng Zhang, Siwei Ma, and Jiansheng Yang. Enhanced bi-prediction with convolutional neural network for high efficiency video coding. IEEE Transactions on Circuits and Systems for Video Technology, 2018. [14] Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, Shanshe Wang, Jiaying Liu, Shiliang Pu, and Siwei Ma. Content-aware convolutional neural network for in-loop filtering in high efficiency video coding. IEEE Transactions on Image Processing, 2019. [15] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single Fig. 4. Subjective quality of the proposed method and conventional loop filter image super-resolution: Dataset and study. In The IEEE Conference methods on Computer Vision and Pattern Recognition (CVPR) Workshops, July 2017. [16] Andrey Norkin, Gisle Bjontegaard, Arild Fuldseth, Matthias Narroschke, compression artifacts removal. Then we filtered the luminance Masaru Ikeda, Kenneth Andersson, Minhua Zhou, and Geert Van der channel with CNN model and subsequently using the recon- Auwera. Hevc deblocking filter. IEEE Transactions on Circuits and Systems for Video Technology, 22(12):1746–1754, 2012. structed luma channel as guidance for chrominance filtering. [17] Chih-Ming Fu, Elena Alshina, Alexander Alshin, Yu-Wen Huang, Extensive experimental results show that the proposed CNN Ching-Yeh Chen, Chia-Yang Tsai, Chih-Wei Hsu, Shaw-Min Lei, Jeong- based loop filtering algorithm obtains significant compression Hoon Park, and Woo-Jin Han. Sample adaptive offset in the hevc stan- dard. IEEE Transactions on Circuits and Systems for Video technology, performance under AI configuration and RA configuration 22(12):1755–1764, 2012. especially for chrominance components. [18] Chia-Yang Tsai, Ching-Yeh Chen, Tomoo Yamakage, In Suk Chong, Yu- Wen Huang, Chih-Ming Fu, Takayuki Itoh, Takashi Watanabe, Takeshi Chujoh, Marta Karczewicz, et al. Adaptive loop filtering for video VI.ACKNOWLEDGMENT coding. IEEE Journal of Selected Topics in Signal Processing, 7(6):934– 945, 2013. This work was supported by the National Natural Science [19] Yuanying Dai, Dong Liu, and Feng Wu. A convolutional neural network Foundation of China (61632001) and High-performance Com- approach for post-processing in hevc intra coding. In International puting Platform of Peking University, which are gratefully Conference on Multimedia Modeling, pages 28–39. Springer, 2017. [20] Yongbing Zhang, Tao Shen, Xiangyang Ji, Yun Zhang, Ruiqin Xiong, acknowledged. and Qionghai Dai. Residual highway convolutional neural networks for in-loop filtering in hevc. IEEE Transactions on Image Processing, REFERENCES 27(8):3827–3841, 2018. [21] Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, Shanshe Wang, and Siwei [1] Zhichu He, Lu Yu, Xiaozhen Zheng, Siwei Ma, and Yun He. Framework Ma. Spatial-temporal residue network based in-loop filter for video of avs2-video coding. In 2013 IEEE International Conference on Image coding. In 2017 IEEE Visual Communications and Image Processing Processing, pages 1515–1519. IEEE, 2013. (VCIP), pages 1–4. IEEE, 2017. [2] Siwei Ma, Tiejun Huang, and Wen Gao. The second generation ieee [22] Xiaoyi He, Qiang Hu, Xiaoyun Zhang, Chongyang Zhang, Weiyao Lin, 1857 video coding standard. In 2015 IEEE China Summit and Inter- and Xintong Han. Enhancing hevc compressed videos with a partition- national Conference on Signal and Information Processing (ChinaSIP), masked convolutional neural network. In 2018 25th IEEE International pages 171–175. IEEE, 2015. Conference on Image Processing (ICIP), pages 216–220. IEEE, 2018. [3] Junru Li, Meng Wang, Li Zhang, Kai Zhang, Hongbin Liu, Shiqi Wang, [23] Daowen Li and Lu Yu. An in-loop filter based on low-complexity Siwei Ma, and Wen Gao. History-based motion vector prediction using residuals in intra video coding. In 2019 IEEE International for future video coding. In 2019 IEEE International Conference on Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE, 2019. Multimedia and Expo (ICME), pages 67–72. IEEE, 2019. [24] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic [4] Yu Qin, Ma Siwei, and He ZhiChu. Suggested video platform for avs2. optimization. arXiv preprint arXiv:1412.6980, 2014. AVS-Doc, M2972, 2012.06. [5] He Jianqiang and Ma Siwei. Improvement of de-blocking filter in avs2. AVS-Doc, M3013, 2012.12. [6] Chen Jie, Lee Sunil, Alshina Elena, Kim Chanyul, Fu Chih-Ming, Huang Yu-Wen, and Lei Shawmin. Sample adaptive offset for avs2. AVS-Doc, M3197, 2013.09. [7] Zhang Xinfeng, Junjun Si, Wang Shanshe, Ma Siwei, Cai Jiayang, Chen Qinghua, Huang Yu-Wen, and Lei Shawmin. Adaptive loop filter for avs2. AVS-Doc, M3292, 2014.04. [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.