Deep Kalman Filtering Network for Video Compression Artifact Reduction

Guo Lu1, Wanli Ouyang2,3, Dong Xu2, Xiaoyun Zhang1, Zhiyong Gao1, and Ming-Ting Sun4

1 Shanghai Jiao Tong University, China {luguo2014,xiaoyun.zhang,zhiyong.gao}@sjtu.edu.cn 2 The University of Sydney, Australia {wanli.ouyang,dong.xu}@sydney.edu.au 3 The University of Sydney, SenseTime Computer Vision Group 4 University of Washington, USA [email protected]

Abstract. When lossy video compression algorithms are applied, com- pression artifacts often appear in videos, making decoded videos un- pleasant for human visual systems. In this paper, we model the video artifact reduction task as a Kalman iltering procedure and restore de- coded frames through a deep Kalman iltering network. Diferent from the existing works using the noisy previous decoded frames as temporal information in the restoration problem, we utilize the less noisy previ- ous restored frame and build a recursive iltering scheme based on the Kalman model. This strategy can provide more accurate and consistent temporal information, which produces higher quality restoration results. In addition, the strong prior information of prediction residual is also exploited for restoration through a well designed neural network. These two components are combined under the Kalman framework and op- timized through the deep Kalman iltering network. Our approach can well bridge the gap between the model-based methods and learning-based methods by integrating the recursive nature of the Kalman model and highly non-linear transformation ability of deep neural network. Exper- imental results on the benchmark dataset demonstrate the efectiveness of our proposed method.

Keywords: Compression artifact reduction, deep neural network, Kalman model, recursive iltering, video restoration

1 Introduction Compression artifact reduction methods aim at generating artifact-free images from lossy decoded images. To store and transfer a large amount of images and videos on the Internet, image and video compression algorithms (e.g., JPEG, H.264) are widely used [1–3]. However, these algorithms often introduce un- desired compression artifacts, such as blocking, blurring and ringing artifacts. Thus, compression artifact reduction has attracted increasing attention and many methods have been developed in the past few decades. 2 Guo Lu et al.

video with artifacts video with artifacts

restored video restored video (a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 1. Diferent methodologies for video artifact reduction (a) the traditional pipeline without considering previous restored frames. (b) our Kalman iltering pipeline. Arti- fact reduction results between (c) Xue et al. [21] and (d) our proposed deep Kalman iltering network. (e) original frame (f) decoded frame (g) quantized prediction residual (h) distortion between the original frame and the decoded frame. Early works use manually designed ilters [4, 5] and sparse coding methods [6–9] to remove compression artifacts. Recently, convolutional neural network (CNN) based approaches have been successfully applied for a lot of computer vision tasks [10–20], such as super-resolution [15,16], denoising [17] and artifact reduction [18–20]. In particular, Dong et al. [18] irstly proposed a four-layer neural network to eliminate the JPEG compression artifacts. For the video artifact reduction task, our motivations are two-fold. First, the restoration process for the current frame could beneit from the previous restored frames. A reason is that when compared to the decoded frame, the pre- vious restored frame can provide more accurate information. Therefore temporal information (such as motion clues) from neighbouring frames is more precise and robust, and can provide the potential to further improve the performance. In ad- dition, the dependence of previous restored frames naturally leads to a recursive pipeline for video artifact reduction. Therefore it recursively restores the cur- rent frame by potentially utilizing all past restored frames, which means we can leverage efective information propagated from previous estimations. Currently, most of the state-of-the-art approaches for compression artifact reduction are limited to remove artifacts in a single image [16–19]. Although the video artifact reduction method [21] or video super-resolution methods [22–24] try to combine temporal information for the restoration tasks, their methods ignore the previous restored frame and restore each frame separately as shown in Fig. 1(a). Therefore, the video artifact reduction performance can be further improved by using an appropriate dynamic iltering scheme. Second, modern video compression algorithms may contain powerful prior information that can be utilized to restore the decoded frame. It has been ob- DKFN for Video Compression Artifact Reduction 3 served that practical video compression standards are not optimal according to the information theory [9], therefore the resulting compression code streams still have redundancies. It is possible, at least theoretically, to improve the restoration results by exploiting the knowledge hidden in the code streams. For video com- pression algorithms, inter prediction is a fundamental technique used to reduce temporal redundancy. Therefore, the decoded frame consists of two components: prediction frame and quantized prediction residual. As shown in Fig. 1(g)-(h), the distortion between the original frame and the decoded frame has a strong relationship with quantized prediction residual, i.e., the region which has high distortion often corresponds to high quantized prediction residual values. There- fore, it is possible to enhance the restoration by employing this task-speciic prior information. In this paper, we propose a deep Kalman iltering network (DKFN) for video artifact reduction. The proposed approach can be used as a post-processing tech- nique, and thus can be applied to diferent compression algorithms. Speciically, we model the video artifact reduction problem as a Kalman iltering procedure, which can recursively restore the decoded frame and capture information that propagated from previous restored frames. To perform Kalman iltering for de- coded frames, we build two deep convolutional neural networks: prediction net- work and measurement network. The prediction network aims to obtain prior estimation based on the previous restored frame. At the same time, we inves- tigate the quantized prediction residual in the coding algorithms and a novel measurement net incorporating this strong prior information is proposed for ro- bust measurement. After that, the restored frame can be obtained by fusing the prior estimation and the measurement under the Kalman framework. Our proposed approach bridges the gap between model-based methods and learning- based methods by integrating the recursive nature of the Kalman model and highly non-linear transform ability of neural network. Therefore, our approach can restore high quality frames from a series of decoded video frames. To the best of our knowledge, we are the irst to develop a new deep convolutional neural network under the Kalman iltering framework for video artifact reduction. In summary, the main contributions of this work are two-fold. First, we for- mulate the video artifact reduction problem as a Kalman iltering procedure, which can recursively restore the decoded frames. In this procedure, we utilize the CNN to predict and update the state of Kalman iltering. Second, we employ the quantized prediction residual as the strong prior information for video arti- fact reduction through deep neural network. Experimental results show that our proposed approach outperforms the state-of-the-art methods for reducing video compression artifacts. 2 Related Work 2.1 Single Image Compression Artifact Reduction A lot of methods have been proposed to remove the compression artifacts. Early methods [25, 26] designed new ilters to reduce blocking and ringing artifacts. One of the disadvantages for these methods is that such manually designed 4 Guo Lu et al. ilters cannot suiciently handle the compression degradation and may over- smooth the decoded images. Learning methods based on sparse coding were also proposed for image artifact reduction [8,9,27]. Chang et al. [8] proposed to learn a sparse representation from a training image set, which is used to reduce artifacts introduced by compression. Liu et al. [9] exploited the DCT information and built a sparsity-based dual domain approach. Recently, deep convolutional neural network based methods have been suc- cessfully utilized for the low-level computer vision tasks. Dong et al. [18] pro- posed artifact reduction CNN (ARCNN) to remove the artifacts from JPEG compression. Inspired by ARCNN, several methods have been proposed to re- duce compression artifact by using various techniques, such as residual learn- ing [28], skip connection [28, 29], batch normalization [17], perceptual loss [30], residual block [20] and generative adversarial network [20]. For example, Zhang et al. [17] proposed a 20-layer neural network based on batch normalization and residual learning to eliminate Gaussian noise with unknown noise level. Tai et al. [16] proposed a memory block, consisting of a recursive unit and a gate unit, to explicitly mine persistent memory through an adaptive learning process. In addition, a lot of methods [31–33] were proposed to learn the image prior by using CNN and achieved competitive results for the image restoration tasks. As mentioned in [9], compression code streams still have redundancies. There- fore, it is possible to obtain a more robust estimation by exploiting the prior knowledge hidden in the encoder. However, most of the previous works do not exploit this important prior information. Although the works in [9, 19, 27] pro- posed to combine DCT information, it is not suicient especially for the video artifact reduction task. In our work, we further exploit the prior information in the code streams and incorporate prediction residual into our framework for robust compression artifact reduction. 2.2 Deep Learning for Video Restoration Due to the popularity of neural networks for image restoration, several CNN based methods [21,22,34–36] were also proposed for the video restoration tasks. For video super-resolution, Liao et al. [34] irst generated an ensemble of SR draft via , and then used a CNN model to restore the high resolution frame from all drafts. Kappeler et al. [35] estimated optical low and selected the corresponding patches across frames to train a CNN model for video super-resolution. Based on the spatial transformation network (STN) [36], the works in [21–24] aligned the neighboring frames according to the estimated optical low or transform parameters and increased the temporal coherence for the video SR task. Tao et al. [22] achieved sub-pixel motion compensation and resolution enhancement with high performance. Xue et al. [21] utilized a joint training strategy to optimize the motion estimation and video restoration tasks and achieved the state-of-the-art results for video artifact reduction. Compared to the methods for single image compression artifact reduction (see Section 2.1), the video restoration methods exploit temporal information. However, these methods process noisy/low-resolution videos separately without considering the previous restored frames. Therefore, they cannot improve video DKFN for Video Compression Artifact Reduction 5 restoration performance by utilizing more accurate temporal information. In our work, we recursively restore each frame in the videos by leveraging the previous restored frame for video artifact reduction. Although the work [37, 38] try to combine deep neural network and Kalman ilter, they are not designed for the image/video enhancement tasks. 3 Methodology We irst give a brief introduction about the basic Kalman ilter and then describe our formulation for video artifact reduction and the corresponding network de- sign.

Introduction of Denotations Let V = {X|X1,X2,...,Xt−1,Xt,...} denote an mn×1 uncompressed video sequence, where Xt ∈R is a video frame at time step t and mn represents the spatial resolution. In order to simplify the description, we only analyze video frame with a single channel, although we consider RGB/YUV c channels in our implementation. After compression, Xt is the decoded frame of ˆ − ˆ Xt. Xt denotes the prior estimation and Xt denotes the posterior estimation for c c restoring Xt from the decoded frame Xt . Rt is the quantized prediction residual in video coding standards, such as H.264. Rt is the corresponding unquantized prediction residual. 3.1 Brief Introduction of Kalman ilter [39] is an eicient recursive ilter that estimates the internal state from a series of noisy measurements. In artifact reduction task, the internal state is the original image to be restored, and the noisy measurements can be considered as the images with compression artifacts. Preliminary Formulation The Kalman ilter model assumes that the state Xt at time t is changed from the state Xt−1 at time t − 1 according to

Xt = AtXt−1 + wt−1, (1) where At is the transition matrix at time t and wt−1 is the process noise. The measurement Zt of the true state Xt is deined as follows, Zt = HXt + vt, (2) where H is the measurement matrix and vt represents the measurement noise. However, the system may be non-linear in some complex scenarios. Therefore, a non-linear model for the transition process in Eq.(1) can be formulated as follows,

Xt = f(Xt−1,wt−1), (3) where f(·) is the non-linear transition model. Linear Kalman ilter corresponds to Eq. (1) and Eq. (2). Non-linear Kalman ilter corresponds to Eq. (3) and Eq. (2). Kalman Filtering As shown in Fig. 2(a), Kalman iltering consists of two steps, prediction and update. In the prediction step, it calculates the prior estimation from the posterior estimation of the previous time step. For non-linear model Eq. (3) and Eq. (2), 6 Guo Lu et al.

t-1 Prediction t t+1 "% $"! Prior state estimation %" "% Posterior "$ $ Estimation Update Posterior Jacobian Matrix Estimation $ Covariance estimation Measurement

(a) Basic Kalman model

t-1 t t+1 Prediction Network % # "$"! "$ %" Temporal Mapping Network "$ "%$

Linearization Network Update Previous Decoded Prior Frame Posterior Restored Covariance estimation Estimate Estimation Frame

# # $ "$ ($ Measurement Network Measurement Decoded Prediction Frame Residual (b) Proposed deep Kalman filtering network

Fig. 2. (a) Basic Kalman model. (b) Overview of the proposed deep Kalman iltering c ˆ network for video artifact reduction. Xt is the decoded frame at time t. Xt−1 represents ˆ − the restored frame from t − 1. The prediction network generates prior estimate Xt for c ˆ original frame based on Xt and Xt−1. The measurement network uses the decoded c c frame Xt and the quantized prediction residual Rt to obtain an initial measurement ˆ − Zt. After that, we can build the posterior estimate by fusing the prior estimate Xt and the measurement Zt. the prediction step is accomplished by two sub-steps as follows, ˆ − ˆ Prior state estimation: Xt = f(Xt−1, 0), (4) − T Covariance estimation: Pt = AtPt−1At + Qt−1, (5) − where Qt−1 is the covariance of the process noise wt−1 at time t − 1, Pt is a covariance matrix used for the update step. At in Eq. (5) is deined as the ∂f(Xˆt−1,0) Jacobian matrix of f(·), i.e., At = ∂X [40]. Prediction procedure for the linear model can be considered as a special case by setting f(Xˆt−1, 0) = AtXˆt−1. In the update step, Kalman ilter will calculate the posterior estimate by fusing the prior estimate from the prediction step and the measurement. Details about the update step can be found in Section 3.6. An overview of Kalman iltering is shown in Fig. 2(a). First, the prediction step uses the estimated state Xˆt−1 at time t−1 to obtain a prior state estimation ˆ − ˆ − Xt at time t. Then the prior state estimation Xt and the measurement Zt are used by the update step to obtain the posterior estimation of the state, denoted by Xˆt. These two steps are performed recursively.

3.2 Overview of our Deep Kalman Filtering Network Fig. 2(b) shows an overview of the proposed deep Kalman iltering network. In ˆ c this framework, the previous restored frame Xt−1 and the decoded frame Xt are DKFN for Video Compression Artifact Reduction 7

ˆ − used for obtaining a prior estimate Xt . Then the measurement Zt obtained from ˆ − the measurement network and the prior estimate Xt are used by the update step for obtaining the posterior estimation Xˆt. The proposed DKFN framework follows the Kalman iltering procedure by using the prediction and update steps for obtaining the predicted state Xˆt. The main diferences to the original Kalman iltering are as follows. First, in the prior estimation sub-step of the prediction step, we use the temporal mapping sub-network as the non-linear function f(·) in Eq. (3) to ob- tain the prior state estimation. Speciically, the temporal mapping sub-network ˆ c takes the previous restored frame Xt−1 and the decoded frame Xt as the input ˆ − and generates the prior estimate Xt for the current frame. Details are given in Section 3.3. Second, in the covariance estimation sub-step, the transition matrix At in Eq. (5) is approximated by a linearization sub-network. In conventional non- linear Kalman ilter, the Jacobian matrix of the non-linear function f(·) is used to obtain At. For our CNN implementation of f(·), however, it is too complex to compute, especially when we need to compute the Jacobian matrix for each pixel location. Therefore, we approximate the calculation of Jacobian matrix by using a linearization sub-network. Details are given in Section 3.4. Third, in the update procedure, we use a measurement network to generate the measurement. In comparison, the conventional Kalman ilter might directly use the decoded frame with compression artifacts as the measurement. Details are given in Section 3.5. 3.3 Temporal Mapping Network Mathematical Formulation Based on the temporal characteristic of video sequences, the temporal mapping sub-network is used for implementing the non- linear function f(·) in the prior state estimation as follows: ˆ − ˆ c Xt = F(Xt−1,Xt ; θf ), (6) where θf are the trainable parameters. Eq.(6) indicates that the prior estimation of the current frame Xt is related with its estimated temporal neighbouring frame ˆ c Xt−1 and its decoded frame Xt at the current time step. This formulation is based on the following assumptions. First, temporal evolution characteristic can provide a strong motion clue to predict Xt based on the previous frame Xˆt−1. Second, due to the existence of complex motion scenarios with occlusion, it is c necessary to exploit the information from the decoded frame Xt to build a more accurate estimation for Xt. Based on this assumption, our formulation in Eq. (6) c adds the decoded frame Xt as the extra input to the transition function f(·) deined in Eq. (4). Network Implementation The temporal mapping sub-network architecture is shown in Fig. 3(a). Speciically, each residual block contains two convolutional layers with the pre-activation structure [41]. We use convolutional layers with 3 × 3 kernels and 64 output channels in the whole network except the last layer. The last layer is a simple convolutional neural network with one feature map without non-linear transform. Generalized divisive normalization (GDN) and 8 Guo Lu et al.

6 Residual Blocks Concatenation " ! " "%  " " # GDN ReLU " IGDN Conv,3x3,64 Conv,3x3,1 Residual Block Residual Block Residual Block

(a) Temporal Mapping Network. Concatenation + $() +() "# "+# & ' "# +"# * * , *&% ! ! , ReLU ! Conv,3x3,64 Reshape, Conv,3x3, Residual Block Residual Block Residual Block Reshape,

Reshape, "# +/ (b) Linearization Network. 6 Residual Blocks Concatenation !

! ! ! !  $ GDN ReLU IGDN Conv,3x3,64 Conv,3x3,1 Residual Block Residual Block Residual Block

(c) Measurement network.

Fig. 3. Network architecture of the proposed (a) Temporal Mapping Network (b) Lin- earization network. (c) Measurement Network. For better illustration, we omit the ˆ c c mn×1 m×n matrix conversion process of Xt, Xt and Rt from R to R . Here ‘Conv,3x3,64’ represents the operation with the 3x3 kernel and 64 feature maps. ‘Re- shape, m × n’ is the operation that reshapes one matrix to m × n. ⊕ and ⊗ represent element-wise addition and matrix multiplication. inverse GDN (IGDN) [42] are used because they are well-suited for Gaussianizing data from natural images. More training details are discussed in Section 3.7.

3.4 Linearization Network

Linearization network aims to learn a linear transition matrix At for the co- variance estimation in Eq. (5) adaptively for diferent image regions. It is non- trivial to calculate the Jacobian matrix of transition function F(·) and linearize it through Taylor series. Therefore, we use a simple neural network to learn a ˆ − transition matrix. Speciically, given the prior estimation Xt , previous restored ˆ c frame Xt−1 and decoded frame Xt , the problem is expressed by the following way, ˆ − ˆ c ˜ ˆ ˜ ˆ c Xt = F(Xt−1,Xt ; θf ) ≈ AtXt−1, where At = G(Xt−1,Xt ; θm), (7) ˆ c ˜ G(Xt−1,Xt ; θm) is the linearization network with the parameters θm. At is the output of this network. The network architecture is shown in Fig. 3(b). Given ˆ c mn×1 ˜ mn×mn Xt−1,Xt ∈R , the network will generate a transition matrix At ∈R as an approximation to At in Eq. (5). DKFN for Video Compression Artifact Reduction 9

3.5 Measurement Network Since prediction based coding is one of the most important techniques for video coding standards (such as MPEG2, H.264 or H.265), we take this coding ap- proach into consideration when designing the measurement network. In predic- c tion based coding, the decoded frame Xt can be decoupled into two components, c p c p c i.e., Xt = Xt + Rt , where Xt and Rt represent the prediction frame and the quantized prediction residual, respectively. Note that quantization is only used for the prediction residual and the distortion in compression only comes from c Rt . In addition, for non-predictive video codecs, such as JPEG2000, we use an existing warp operation [23,36] from the previous decoded frame to the current decoded frame, and the diference between them is considered as the prediction c residual Rt . For most of video codecs (e.g.,H.264 and H.265), we can directly utilize the quantized prediction residual in the code streams. c We obtain the measurement using the quantized prediction residual Rt as follows, p ˆ ˆ c c Zt = Xt + Rt, where Rt = M(Xt ,Rt ; θz), (8) where Rˆt is the restored residual to remove the efect of quantization so that Zt is close to the original image Xt. We use a deep neural network as shown in Fig. 3(c) (with same architecture as Fig. 3(a)) for the function M(·). This network takes c c the decoded frame Xt and the quantized prediction residual Rt as the input and estimates the restored residual Rˆt. There are two advantages of our formulation c for measurement. On one hand, instead of utilizing the decoded frame Xt as the measurement, our measurement formulation avoids explicitly modeling the complex relationship between original frames and decoded frames. On the other hand, most of the existing artifact reduction methods can be embedded into our model as the measurement method, which provides a lexible framework to obtain a more accurate measurement value. 3.6 Update Step ˆ − Given the prior state estimation Xt from the temporal mapping network (Sec- tion 3.3), the transition matrix A˜t obtained from the linearization network (Sec- tion 3.4), and the measurement Zt obtained from the measurement network (Section 3.5), we can use the following steps1 to obtain the posterior estimation of the restored image: − ˜ ˜T Pt = AtPt−1At + Qt−1, (9) − T − T −1 Kt = Pt H (HPt H + Ut) , (10) ˆ ˆ − ˆ − Xt = Xt + Kt(Zt − HXt ), (11) − Pt =(I − KtH)Pt , (12) ˆ − where Xt represents the posterior estimation for the image Xt. Pt and Pt are the estimated state covariance matrixs for the prior estimation and the posterior estimation respectively. Kt is the Kalman gain at time t. H is the measurement

1 Eq. (9) corresponds to the covariance estimation and listed here for better presen- tation. 10 Guo Lu et al. matrix deined in Eq. (2) and is assumed to be an identity matrix in this work. Qt−1 and Ut are the process noise covariance matrix and the measurement noise covariance matrix respectively. We assume Qt−1 and Ut to be constant over time. For more details about the update procedure of Kalman iltering, please refer to [40]. Discussion Our approach can solve the error accumulation problem of the recursive pipeline through the adaptive Kalman gain. For example, when the errors accumulate in the previous restored frames, the degree of reliability for prior estimation (i.e., information from the previous frame) will be decreased and the inal result will depend more on the measurement (i.e., the current frame). 3.7 Training Strategy

There are three sets of trainable parameters θf , θm and θz in our approach. First, the parameters θf in the temporal mapping network are optimized as follows, ˆ c 2 Lf (θf )= ||Xt − F(Xt−1,Xt ; θf )||2, (13) Note that in the minimization procedure for Eq. (13), the restored frame of the previous one Xˆt−1 is required. This leads to the chicken-and-egg problem. A straightforward method is to feed several frames of a clip into the network and train all the input frames in the iteration. However, this strategy increases GPU memory consumption signiicantly and simultaneous training multi-frames for a video clip is non-trivial. Alternatively, we adopt an on-line update strategy. Speciically, the estimation results Xˆt in each iteration will be saved in a bufer. In the following iterations, Xˆt in the bufer will be used to provide more accurate temporal information when estimating Xt+1. Therefore, each training sample in the bufer will be updated in an epoch. We only need to optimize one frame for a video clip in each iteration, which is more eicient. After that, the parameters θf are ixed and we can optimize the linearization network G(θm) by using the following loss function: ˆ − ˆ c ˆ 2 Lm(θm)= ||Xt −G(Xt−1,Xt ; θm)Xt−1||2, (14) Note that we use a small patch size (4 × 4) to reduce the computational cost when optimizing θm. Then, we will train the measurement net and optimize θz based on the fol- lowing loss function, c c p 2 Lz(θz)= ||Xt − (M(Xt ,Rt ; θz)+ Xt )||2, (15) Finally, we ine-tune the whole deep Kalman iltering network based on the loss L deined as follows, ˆ 2 L(θ)= ||Xt − Xt||2, (16) θ are the trainable parameters in the deep Kalman iltering network. 4 Experiments To demonstrate the efectiveness of the proposed model for video artifact reduc- tion, we perform the experiments on the benchmark dataset Vimeo-90K [21]. Our approach is implemented by using the Tensorlow [43] platform. It takes 22 hours to train the whole model by using two Titan X GPUs. DKFN for Video Compression Artifact Reduction 11

JPEG2000 ARCNN Tolow Ours

(37.20/0.951) (38.03/0.968) (39.32/0.977) (40.48/0.981)

(32.24/0.954) (32.91/0.959) (33.92/0.967) (35.17/0.973)

Fig. 4. Quantitative (PSNR/SSIM) and visual comparison of JPEG2000 artifact re- duction on the Vimeo dataset for q=20. 4.1 Experimental Setup

Dataset. The Vimeo-90K dataset [21] is recently built for evaluating diferent video processing tasks, such as video denoising, video super-resolution (SR) and video artifact reduction. It consists of 4,278 videos with 89,800 independent clips that are diferent from each other in content. All frames have the resolutio of 448 × 256. For video compression artifact reduction, we follow [21] to use 64,612 clips for training and 7,824 clips for performance evaluation. In this section, PSNR and SSIM [44] are utilized as the evaluation metrics. To demonstrate the efectiveness of the proposed method, we generate com- pressed/decoded frames through two coding settings, i.e.,codec HEVC (x265) with quantization parameter qp = 32 and qp = 37 and codec JPEG2000 with quality q = 20 and q = 40. Implementation Details. For model training, we use the Adam solver [45] with the initial learning rate of 0.001, β1 = 0.9 and β2 = 0.999. The learning rate is divided by 10 after every 20 epochs. We apply gradient clip with global norm 0.001 to stabilize the training process. The mini-batch size is set to 32. We use the method in [46] for weight initialization. Our approach takes 0.15s to restore a color image with the size of 448x256.

We irst train the temporal mapping network using the loss Lf in Eq. (13). After 40 epochs, we ix the parameters θf and train the linearization network by using the loss Lm. Then we train the measurement network using the loss Lz in Eq. (15). After 40 epochs, the training loss will become stable. Finally, we ine-tune the whole model. In the following experiments, we train diferent models for diferent codecs or quality levels. 12 Guo Lu et al.

HEVC ARCNN HEVC-LF Ours

(34.31/0.941) (35.45/0.954) (34.86/0.949) (40.10/0.977)

(32.61/0.910) (33.47/0.922) (32.87/0.913) (37.11/0.958)

Fig. 5. Quantitative (PSNR/SSIM) and visual comparison of diferent methods for HEVC artifact reduction on the Vimeo dataset at qp=37. 4.2 Experimental Results

Comparison with the State-of-the-art Methods. To demonstrate the ef- fectiveness of our approach, we compare it with several recent image and video artifact reduction methods: ARCNN [18], DnCNN [17],V-BM4D [47] and Tolow [21]. In addition, modern video codecs already have a default artifact reduction scheme. For example, HEVC utilizes loop ilter [1] (HEVC-LF) to reduce the blocking artifacts. This technique is also included for comparison. For ARCNN [18] and DnCNN [17], we use the code provided by the authors and train their models on the Vimeo training dataset. For V-BM4D and Tolow, we directly cited their results in [21]. The results of HEVC-LF are generated by enabling loop ilter and SAO [1] in HEVC codec (x265). For fair comparison with the existing approaches, we follow [21] and only evaluate the 4th frame of each clip in the Vimeo dataset. The quantitative results are reported in Table 1 and Table 2. As we can see, our proposed approach outperforms the state-of-the-art methods by more than 0.6db in term of PSNR. Dataset Setting ARCNN [18] DnCNN [17] V-BM4D [47] Tolow [21] Ours q=20 36.11/0.960 37.26/0.967 35.75/0.959 36.92/0.966 37.93/0.971 Vimeo q=40 34.21/0.944 35.22/0.953 33.99/0.940 34.97/0.953 35.88/0.958

Table 1. Average PSNR/SSIM results on the Vimeo dataset for JPEG2000 artifact reduction (q=20,40). Qualitative comparisons of ARCNN [18], Tolow [21], HEVC-LF [1] and ours are shown in Fig. 4 and Fig. 5. In these igures, the blocking artifacts exist in JPEG2000/HEVC decoded frame, our proposed method successfully removes DKFN for Video Compression Artifact Reduction 13

Dataset Setting ARCNN [18] DnCNN [17] HEVC-LF [1] Ours qp=32 34.87/0.954 35.58/0.961 34.19/0.950 35.81/0.962 Vimeo qp=37 32.54/0.930 33.01/0.936 31.98/0.923 33.23/0.939

Table 2. Average PSNR/SSIM results on the Vimeo test sequences for HEVC artifact reduction (qp=32,37).

MN PR TM RF PSNR/SSIM ✓ 37.15/0.967 ✓ ✓ 37.49/0.968 ✓ 37.35/0.967 ✓ ✓ 37.76/0.970 ✓ ✓ ✓ ✓ 37.93/0.971

Table 3. Ablation study of the proposed deep Kalman iltering method on the Vimeo- 90k dataset. The results with or without using the prediction residual (PR) in the measurement network (MN) are reported in the irst two rows. The results with or without using the recursive iltering (RF) scheme in the temporal network (TM) are rd th th reported in the 3 and 4 rows. Our full model is MN+PR+TM+RF (the 5 row). these artifacts while other methods still have observable artifacts. For example, the equipment (the fourth row in Fig. 4) and the railing (the fourth row in Fig. 5) both have complex texture and structure, our method can well recover these complex regions while other baseline methods may fail. Ablation Study of Measurement Network(MN). In this subsection, we in- vestigate the efectiveness of the proposed measurement network. Note that the output of our measurement network itself can be readily used as the artifact reduction result. So the results in this subsection are obtained without using the temporal mapping network. In order to validate that prediction residual can serve as important prior information for improving the performance, we train another model with the same architecture but without using prediction resid- ual (PR) as the input. Therefore, it generates restored frames by only using the decoded frames as the input. Quantitative results on the Vimeo-90k dataset are listed in Table 3. When compared with our simpliied model without pre- diction residual(see the 1st row), our simpliied model with prediction residual (MN+PR, see the 2nd row) can boost the performance by 0.34dB in term of PSNR. It demonstrates that incorporating strong prior information can improve the restoration performance. Ablation Study on the Temporal Mapping Network(TM). We further evaluate the efectiveness of the temporal mapping network. Note that the out- put of our temporal mapping network itself can be also readily used for the video artifact reduction. So the results in this subsection are obtained without using the measurement network. For comparison, we train another model, which utilizes the same network architecture as our temporal mapping network but c c the input is the concatenation of Xt and Xt−1. Namely, it restores the current frame without considering previous restored frames. The quantitative results are reported in Table 3. When compared with our simpliied model without using recursive iltering(RF) (see the 3rd row), our simpliied model with recur- 14 Guo Lu et al.

Test Dataset Tolow [21] DnCNN [17] Ours HEVC Sequneces 32.37/0.948 33.19/0.953 33.83/0.958 MPI Sintel datast 34.78/0.959 36.40/0.969 37.01/0.973

Table 4. Average PSNR/SSIM results evaluated on two new datasets for video artifact reduction (JPEG2000, q=20) for cross dataset validation. sive iltering (TM+RF, see the 4th row) can signiicantly improve the quality of restored frame by 0.41dB in term of PSNR. A possiable explanation is our recur- sive iltering scheme can efectively leverage information from previous restored frames, which provides more accurate pixel information. It is worth mentioning that the result in the 5th row is the best as we combine the outputs from both the measurement network and the temporal mapping network through the Kalman update process. Cross dataset validation. The results on the HEVC standard sequences (Class D) and the MPI Sintel Flow dataset in Table 4 show that our approach performs better than the state-of-the-art methods. Comparison with the RNN based approach. We use the recurrent network to completely replace the Kalman ilter in Fig. 2. Speciically, the same CNN architecture is used to extract the features from the distorted frames at each time step and a convolutional gated recurrent unit (GRU) module is used to restore the original image based on these features. The result of our work is 37.93dB, which outperforms the recurrent network based method(37.10dB). One possible explanation is that it is diicult to train the recurrent network, while our pipeline makes it easier to learn the network by using the domain knowledge of prediction residual and combining both measurement and prior estimation. 5 Conclusions In this paper, we have proposed a deep Kalman iltering network for video ar- tifact reduction. We model the video compression artifact reduction task as a Kalman iltering procedure and update the state function by learning deep neural networks. Our framework can take advantage of both the recursive na- ture of Kalman iltering and representation learning ability of neural network. Experimental results have demonstrated the superiority of our deep Kalman il- tering network over the state-of-the-art methods. Our methodology can also be extended to solve other low-level computer vision tasks, such as video super- resolution or denoising, which will be studied in the future. Acknowledgement: This work was supported by s research project from Sense- Time. This work was also supported in part by National Natural Science Foun- dation of China (61771306, 61521062), Natural Science Foundation of Shang- hai(18ZR1418100)฀Chinese National Key S&T Special Program(2013ZX01033001- 002-002), STCSM Grant 17DZ1205602, Shanghai Key Laboratory of Digital Me- dia Processing and Transmissions (STCSM 18DZ2270700). DKFN for Video Compression Artifact Reduction 15

References 1. Sullivan, G.J., Ohm, J., Han, W.J., Wiegand, T.: Overview of the high eiciency video coding (hevc) standard. TCSVT 22(12) (2012) 2. Schwarz, H., Marpe, D., Wiegand, T.: Overview of the scalable video coding extension of the h. 264/avc standard. TCSVT 17(9) (2007) 3. Lu, G., Zhang, X., Chen, L., Gao, Z.: Novel integration of frame rate up conversion and hevc coding based on rate-distortion optimization. TIP 27(2) (2018) 4. Shen, M.Y., Kuo, C.C.J.: Review of postprocessing techniques for compression artifact removal. Journal of Visual Communication and Image Representation 9(1) (1998) 5. Reeve, H.C., Lim, J.S.: Reduction of blocking efects in image coding. Optical Engineering 23(1) (1984) 6. Jung, C., Jiao, L., Qi, H., Sun, T.: Image deblocking via sparse representation. Signal Processing: Image Communication 27(6) (2012) 7. Choi, I., Kim, S., Brown, M.S., Tai, Y.W.: A learning-based approach to reduce jpeg artifacts in image matting. In: ICCV. (2013) 8. Chang, H., Ng, M.K., Zeng, T.: Reducing artifacts in jpeg decompression via a learned dictionary. IEEE Transactions on Signal Processing 62(3) (2014) 9. Liu, X., Wu, X., Zhou, J., Zhao, D.: Data-driven sparsity-based restoration of jpeg-compressed images in dual transform-pixel domain. In: CVPR. Volume 1. (2015) 5 10. Ouyang, W., Wang, X.: Joint deep learning for pedestrian detection. In: ICCV. (2013) 11. Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., Li, H., Yang, S., Wang, Z., Loy, C.C., et al.: Deepid-net: Deformable deep convolutional neural networks for object detection. In: CVPR. (2015) 12. Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutional networks. In: ICCV. (2015) 13. Zhao, R., Ouyang, W., Wang, X.: Unsupervised salience learning for person re- identiication. In: CVPR. (2013) 14. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: ECCV. (2014) 15. Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolu- tional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(2) (2016) 16. Tai, Y., Yang, J., Liu, X., Xu, C.: Memnet: A persistent memory network for image restoration. In: CVPR. (2017) 17. Zhang, K., Zuo, W., Chen, Y., Meng, D., Zhang, L.: Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. TIP 26(7) (2017) 18. Dong, C., Deng, Y., Change Loy, C., Tang, X.: Compression artifacts reduction by a deep convolutional network. In: ICCV. (2015) 19. Guo, J., Chao, H.: Building dual-domain representations for compression artifacts reduction. In: ECCV. (2016) 20. Galteri, L., Seidenari, L., Bertini, M., Del Bimbo, A.: Deep generative adversarial compression artifact removal. arXiv preprint arXiv:1704.02518 (2017) 21. Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented low. arXiv preprint arXiv:1711.09078 (2017) 22. Tao, X., Gao, H., Liao, R., Wang, J., Jia, J.: Detail-revealing deep video super- resolution. In: ICCV. (2017) 16 Guo Lu et al.

23. Liu, D., Wang, Z., Fan, Y., Liu, X., Wang, Z., Chang, S., Huang, T.: Robust video super-resolution with learned temporal dynamics. In: CVPR. (2017) 24. Caballero, J., Ledig, C., Aitken, A., Acosta, A., Totz, J., Wang, Z., Shi, W.: Real- time video super-resolution with spatio-temporal networks and motion compensa- tion. In: CVPR. (2017) 25. Foi, A., Katkovnik, V., Egiazarian, K.: Pointwise shape-adaptive dct for high- quality denoising and deblocking of grayscale and color images. TIP 16(5) (2007) 26. Zhang, X., Xiong, R., Fan, X., Ma, S., Gao, W.: Compression artifact reduction by overlapped-block transform coeicient estimation with block similarity. TIP 22(12) (2013) 27. Wang, Z., Liu, D., Chang, S., Ling, Q., Yang, Y., Huang, T.S.: D3: Deep dual- domain based fast restoration of jpeg-compressed images. In: CVPR. (2016) 28. Svoboda, P., Hradis, M., Barina, D., Zemcik, P.: Compression artifacts removal using convolutional neural networks. arXiv preprint arXiv:1605.00366 (2016) 29. Mao, X.J., Shen, C., Yang, Y.B.: Image denoising using very deep fully convolu- tional encoder-decoder networks with symmetric skip connections. arXiv preprint (2016) 30. Guo, J., Chao, H.: One-to-many network for visually pleasing compression artifacts reduction. In: CVPR. (2017) 31. Zhang, K., Zuo, W., Gu, S., Zhang, L.: Learning deep cnn denoiser prior for image restoration. arXiv preprint (2017) 32. Chang, J.R., Li, C.L., Poczos, B., Kumar, B.V., Sankaranarayanan, A.C.: One network to solve them all—solving linear inverse problems using deep projection models. arXiv preprint (2017) 33. Bigdeli, S.A., Zwicker, M., Favaro, P., Jin, M.: Deep mean-shift priors for image restoration. In: NIPS. (2017) 34. Liao, R., Tao, X., Li, R., Ma, Z., Jia, J.: Video super-resolution via deep draft- ensemble learning. In: ICCV. (2015) 35. Kappeler, A., Yoo, S., Dai, Q., Katsaggelos, A.K.: Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging 2(2) (2016) 36. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS. (2015) 37. Shashua, S.D.C., Mannor, S.: Deep robust kalman ilter. arXiv preprint arXiv:1703.02310 (2017) 38. Krishnan, R.G., Shalit, U., Sontag, D.: Deep kalman ilters. arXiv preprint arXiv:1511.05121 (2015) 39. Kalman, R.E.: A new approach to linear iltering and prediction problems. Journal of Basic Engineering 82(1) (1960) 40. Haykin, S.S., et al.: Kalman iltering and neural networks. Wiley Online Library (2001) 41. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV. (2016) 42. Ballé, J., Laparra, V., Simoncelli, E.P.: Density modeling of images using a gen- eralized normalization transformation. arXiv preprint arXiv:1511.06281 (2015) 43. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016) 44. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13(4) (2004) DKFN for Video Compression Artifact Reduction 17

45. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 46. Glorot, X., Bengio, Y.: Understanding the diiculty of training deep feedforward neural networks. In: International Conference on Artiicial Intelligence and Statis- tics. (2010) 47. Maggioni, M., Boracchi, G., Foi, A., Egiazarian, K.: Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. TIP 21(9) (2012)