arXiv:2012.11152v1 [eess.IV] 21 Dec 2020 rnfr ae nr ie oigcnaheeBj achieve can coding video 8 intra proposed the based i that effici transform coding DCT demonstrate the and improve results Fi to transform Experimental developed Saab detail. are coding the in video integrating intra analyzed on cod experimentally strategies video three intra and based theoretically transform Rat Saab Then, is of developed. Meanwhil gain is learning. (RD) transform Distortion Saab transform dependent codi Saab mode video intra off-line intra based with transform Saab framework a capabil Transform propose decorrelation we and Cosine Secondly, compaction Discrete energy their mainstream on the (DCT) anal with is compared transform, (Saab) and Bias Su Adjusted capabil i.e., with transform, decorrelation Approximation based learning or problem machine compaction explainable optimization machi The energy an model as the we design Firstly, maximizing efficiency. transform coding based t the learning In improve video intra transform. to based transform ing of explainable an potential propose we performance work, coding the explore ae rnfr,Itapeito,Sbpc Approximatio ( Decorrelation. Optimization Compaction, Subspace Rate-Distortion Energy (Saab), prediction, Bias Adjusted Intra with Transform, based naeaea oprdwt h antem8 mainstream the with scheme. compared coding as average on et i ae(DR rm-1.19 from (BDBR) Rate Bit Delta ern,Uiest fSuhr aiona o nee,C Angeles, Los California, Southern email:[email protected]. of University neering, yun.zhang ehooy hns cdm fSine,Seze,China, Shenzhen, Sciences, of Academy Chinese Technology, ie oig(EC otelts estl ie Coding Video Versatile latest the to (HEVC) Coding Video MPEG-2, Efficiency H.265/High from the (AVC), standards, network Coding is coding Video over H.264/Advanced which video transmission of years, development and the two volume storage In every the video doubles videos, for of data bottleneck quality video th and global Along usage laptop of TV, the etc.. e.g., both drones, of (IoT), increase Things cameras, of surveillance Internet connect smartphones, or devices internet video the of applicat number to video increasing these with of striving usage vi is the immersive addition, and the In 3D experiences. by realistic, boost providing videos in (VR) Three- attractiveness Reality future. Virtual holograph near and 4K (HDR), (3D) the from Range Dimension in extended Dynamic UHD High be 8K Meanwhile, to to (UHD) expected Definition are Ultra-High resolutions video broadcas for Commercial application. wide and representation V .C a u swt igHihDprmn fEetia Eng Electrical of Department Hsieh Ming with is Kuo Jay C.-C. aL n u hn r ihteSeze ntttso Advanc of Institutes Shenzhen the with are Zhang Yun and Li Na Abstract ne Terms Index xlial ahn erigbsdTransform based Learning Machine Explainable aavlm nteeao i aadet t realistic its to due data of big increasing of the era in the most in volume the data contributes data IDEO } @siat.ac.cn Mcielann ehiuspoieacac to chance a provide techniques learning —Machine oigfrHg fcec nr Prediction Intra Efficiency High for Coding VdoCdn,EpanbeMcieLearning Machine Explainable Coding, —Video .I I. aL,u Zhang, Li,Yun Na NTRODUCTION % o-10.00 to eirMme,IEEE, Member, Senior × % C based DCT 8 lfri,USA, alifornia, e-mail: n -3.07 and ø × nteggard Saab 8 bspace RDO), { nally, na.li1, ency. ities. yzed cod- ions sual ting ing ity. his ng ed ne % ed of e- s, e, n n i- e . rmbt ie eune n tl mgsoto h Call the of out images still and secondary sequences video the both of from derivation Koo a wi transforms, As frames reconstructed patches overhead. from searching block computational current through the KLT Lan with dimensional matrix. similar eigen-decomposition one system trained Laplacian an directional [3] from discrete transform a of new a constructed oigta mrvstecdn fcec ntehbi vide hybrid the in framework. on efficiency coding transfor coding focus the based we improves learning paper, that coding this machine explainable In o bits. an symbols developing more of encodes with symbols whereas probability encodes bits lower it less Generally, with entropy. probability its higher codin coefficie approach transformed entropy to of Finally, larger as property given. where statistical sensitive the signals, be exploits more frequency could high scales generally than example, quantization is frequency For (HVS) low redundancies. to System perceptual Vision furth spectral to and Human a coefficients spectral spatial to the coding exploit quantize i predictive then quantization and from and domain, residuals transform transform referenc of transfor consists Secondly, to its mainly domain. higher on that temporal to based coding or leads prediction spatial a from inter thus classified pixels and be and can prediction encoded, coding intra predictive be The to ratio. compression residuals le less accuracy of a prediction to basis Higher blocks spatial frames. the neighboring successive the on spatial temporal content remove among video correlation to of exploiting is redundancies temporal coding and is predictive entrop and which Firstly, coding adopted, transform coding. been coding, predictive has of framework composed coding video hybrid big a desired. the highly efficien still and coding are Higher is ratio techniques data. video compression there global on on decades, increase video improvement volume past developing the between the on gap in keep techniques researchers coding Although eve years. almost doubled is ten ratio compression video the (VVC)[1], eerhr osuydt-rvntasom Dvir transform. data-driven study attr KLT to of capabilities e researchers decorrelation outstanding ar in and The matrix KLT compaction transmission. autocorrelation ergy using without the stored while cases and overhead are derived rate There coding. bit whi video additional coefficients, an transformed a brings with encoded be associated block. shall input matrix transmitted autocorrelation source the each coding, for video matrix cal autocorrelation to an requires late which decorrelation, and compaction energy Karhunen-Lo standards, coding video of generations three latest the In .C a Kuo, Jay C.-C. e ´ eTasom(L)i nieltasomfor transform ideal an is (KLT) Transform ve tal. et elw IEEE Fellow, 4 ere o-eaal transforms non-separable learned [4] tal. et t so nts tal. et ads cu- [2] act nd nd ch cy n- In ry th m er m o g y e e s s 1 f 2 for Proposals (CfP), and adopted five of the transforms with (JVET) [16], which consists of experts from ISO/IEC MPEG the best Rate-Distortion (RD) cost in the reference encoder and ITU-T VCEG. Zeng et al. [17] presented a two-stage as the final non-separable transform. Cai et al. [5] proposed transform framework, where coefficients at the first stage to only estimate the residual covariance as a function of produced by all directional transforms were arranged appropri- the coded boundary gradient, considering prediction is very ately for the secondary transform. Considering that multi-core sensitive to the accuracy of the prediction direction in the transforms and non-separable transforms can capture diverse image region with sharp discontinuities. Wang et al. [6] directional texture patterns more efficiently, EMT [14] and proposed to optimize transform and quantization together with Non-Separable Secondary Transform (NSST) [18] were com- RD optimization (RDO). Zhang et al.[7] designed a high bined to provide better coding performance in the reference efficient KLT based image compression algorithm. Graph software of the next generation video coding standard. Zhang Based Transform (GBT) was proposed as a derivation of the et al. [19] presented a method on Implicit-Selected Transform traditional KLT in [8], which incorporates Laplacian with (IST) to improve the performance of transform coding for structural constraints to reflect the inherent model assumptions AVS-3. Pfaff et al. [20] applied mode dependent transform about the video signal. Arrufat et al. [9] designed a KLT based with primary and secondary transforms to improve transform transform for each intra prediction mode in Mode-Dependent coding in HEVC. Park et al. [21] introduced fast computation Directional Transform (MDDT). Takamura et al.[10] proposed methods of N-point DCT-V and DCT-VIII. Garrido et al. [22] the non-separable mode-dependent data-dependent transform proposed an architecture to implement transform types for and create offline 2D-KLT kernels for each intra prediction different sizes. DCT is a pre-defined and fixed transform to modes and for each Transform Unit (TU) sizes. In these re- approach KLT’s performance for Guassian distributed source. cent studies, researches focused on generating autocorrelation However, the Guassian distributed source assumption cannot matrix for the data-dependent KLT and optimizing the data- be always guaranteed due to the diversity of video contents dependent KLT with the constrained autocorrelation matrix. It and variable block patterns, which enlarge the gap between is not only computational difficult to estimate or encode the coding efficiency of using DCT and the best KLT. Video autocorrelation matrix for dynamic block residuals, but also content with complex textures are difficult to follow Gaussian memory consuming to store the transform kernels offline. distribution. In addition, using multiple transform kernels for Discrete Cosine Transform (DCT) performs similarly to different source assumptions and selecting the optimal one KLT on energy compaction when the input signal approx- with RDO significantly increase the coding complexity. imates Guassian distribution. Due to its good energy com- Machine learning based transform is a possible solution to paction capability and relative low complexity, DCT has been have a good trade-off between data dependent KLT and fixed widely used in video coding standards, including MPEG-1, DCT and improve the video coding performance. Lu et al.[23] MPEG-2, MPEG-4, H.261, H.262 and H.263. H.264/AVC utilized non-linear residual encoder decoder network imple- and later coding standards adopted Integer DCT (ICT) to mented as Convolutional Neural Network (CNN) to replace approximate float-point DCT calculation for lower complexity the linear transform. Parameters of these CNN are determined and hardware cost. ICT was computed exactly in integer by backpropagation which is not algorithmic transparency, arithmetic instead of the float-point computation and mul- short of human comprehensibility, not robust to perturbations tiplications were replaced with additions and shifts. Since and hard to be improved layer by layer. To handle these DCT transform kernels are fixed and difficult to adapt to all problem, explainable machine learning is a possible solution video contents and modes, advanced DCT optimizations are to improve the interpretation, scalability and robustness of proposed to improve the transform coding efficiency through learning. Therefore, Kuo et al. [24] proposed an interpretable jointly utilizing multiple transforms and RDO [11]. For intra feedforward design, noted as Subspace Approximation with video coding in HEVC, an integer Discrete Sine Transform Adjusted Bias (Saab) transform, which is statistics-centric (DST) [12] was further applied to 4 4 luminance (Y) block and in unsupervised manner. Motivated by the analyses on residuals. Han et al. [13] proposed a variant× of the DST named nonlinear activation of CNN in [25], [26], Kuo et al. [27] Asymmetric DST (ADST) [13] regarding to the prediction proposed the Subspace Approximation with Augmented Ker- direction and boundary information. nels (Saak), where each transform kernel was the KLT kernel Furthermore, due to the diversity of video contents and their augmented with its negative so as to resolve the sign confusion distributions, multiple transforms from DCT/DST families problem. The sign confusion problem was solved by shifting were jointly utilized other than one transform to enhance the the transform input to the next layer with a relatively large coding efficiency. Zhao et al. [14] presented the Enhanced bias in Saab transform [24]. As the explainable machine Multiple Transform (EMT) by selecting the optimal transform learning method, Saab transform[24], interpreted the cascaded from multiple candidates based on the source properties and convolutional layers as a sequence of approximating spatial- distributions. EMT is intra mode dependent where DCT/DST spectral subspaces. The data-dependent, multi-stage and non- transform kernels are selected based on the intra direction. As separable Saab transform was proposed to interpret CNN for the coding efficiency of EMT comes with the cost of higher recognition tasks, which also has good energy compaction computational complexity at the encoder side, Kammoun et al. capability. In [28], Saab transforms were learned from video [15] proposed an efficient pipelined hardware implementation. coding dataset and have potentials to outperform DCT on In 2018, EMT was simplified as Multiple Transform Selection energy compaction capability for variable block size residuals (MTS) and adopted in the VVC by Joint Video Expert Team from intra prediction. However, the coding performance of 3
Saab transform for intra coding need to be further investigated. %ORFNUHVLGXDOV Saab transform based intra video coding for specific block size could be reproduced for other block sizes under the >G Gt G T @ same methodology. Therefore, we mainly explore 8 8 Saab 8QVXSHUYLVHGOHDUQLQJ ] G DGT b transform for intra predicted 8 8 block residuals to× illustrate &RPSXWH t t t 2 >] ] ] @ × t T the coding performance potential of Saab transform. ,PDJH &RPSXWHWKHFRYDULDQFHPDWUL[ In this work, we propose the Saab transform based intra (^== T ` video coding to improve the coding efficiency. The paper (LJHQYDOXHGHFRPSRVLWLRQ / >w w w @ (OHPHQWLQWKHLQSXW t T is organized as follows. Saab transform and its transform RIWKH6DDEWUDQVIRUP &RPSXWHWKHNHUQHOVDQGELDVHV
'&FRHIILFLHQW %ORFNUHVLGXDO T k performances are analyzed in Section II. A framework of Saab ° k ° ak ® K bk ® $&FRHIILFLHQW PD[ __d __ dk d K °w dk d K ¯° d transform based intra video coding and intra mode dependent ¯ k x a a b Saab transform are illustrated in Section III. Then, the RD K a K a K b performances and computational complexity are analyzed in )RUZDUGRQHVDWJH6DDE b $Saab WUDQVIRUP a a b Section IV. Extensive experimental results and analyses are K K K K
y presented in Section V. Finally, conclusions are drawn in K a a a K b 7 Section VI. b &RHIILFLHQWVJULG ,QYHUVHRQHVWDJH6DDE $Saab b WUDQVIRUP a aK aK K K b II. EXPLAINABLEMACHINELEARNINGBASED K x' K /HDUQLQJRQHVWDJH TRANSFORM AND ANALYSIS 2QHVWDJH6DDEWUDQVIRUP 6DDEWUDQVIRUP A. Problem Formulation Transform in image/video compression is general linear Fig. 1. Diagram of learning and testing of an one-stage Saab transform. and aims to improve transform performances, such as energy compaction and decorrelation, for the transformed coefficients. DCT on energy compaction and decorrelation, KLT based Let x = xi be an input source, and it is forward transformed { } video coding requires to transmit kernel information for each to output y = yk in another transform domain as { } block which causes large overhead bits. Saab transform is K 1 able to learn statistics for a large number of blocks, which − is a possible solution for improving the coding performance yk = xiak,i or y = Ax, (1) Xi=0 of existing codecs. where ak,i is the transform element in the forward transform B. Saab Transform kernel A. The inverse transform from yk to xi is presented as Saab transform [24] is conducted as a data-dependent, multi- K 1 − stage and nonseparable transform in a local window to get a x = y u or x = Uy, (2) i k k,i local spectral vector. Saab transform is explainable machine kX=0 learning based transform which is algorithmic transparency, where uk,i is the transform element in the inverse transform 1 human comprehensible, more robust to perturbations and can matrix U. U is an inverse matrix of A satisfying U = A− and be improved layer by layer. Diagram of testing and training UA = AU = I, where I is the identity matrix. If the transform the 2D one-stage Saab transform is presented in Fig. 1, where is orthogonal, which means the rows of transform matrix are the left part is Saab transform and the right part is learning an orthogonal basis set, the inverse transform matrix U satisfies 1 T transform kernels. U = A− = A . M N Given an M N dimensional input x in the space R × , As a machine learning based transform, A is estimated from which is rearranged× into a vector in lexicographic order as data samples = [d0, ..., dT 1]. Generally, transforms are − D x = [x00, x01, ..., x0,N 1, x1,0, x1,1, ..., x1,N 1 learned from subspaces of data samples with different learning − − T (4) strategies. Given a transform set , the optimal transform A∗ ..., xM 1,0, ..., xM 1,N 1] . is selected from through solvingA the optimization problem − − − Then, output transformed coefficients from Saab transform can expressed as A be computed as A∗ = argmax M(y), (3) Ai A K 1 ∈ − aT where M(y) indicates a target transform performance of the yk = ak,j xj + bk = k x + bk, (5) X transformed coefficient y. The optimal transform could be j=0 selected by maximizing the value of M(y). In video cod- where ak are transform kernels and bk is the bias, K=M N, ing, M(y) can be defined as but not limited to the energy k = 0, 1, ..., K 1. In Saab transform, DC kernel and× AC − a K 1 b K 1 compaction or decorrelation capabilities, which relates to the kernels, are composed of ASaab= k 0 − and b= k 0 − , compression efficiency. For example, DCT is predefined as which are unsupervised learned{ from} the training{ dataset} a general orthogonal transform for all block residuals. KLT = [d0, ..., dT 1], as illustrated at the right part of Fig. kernel is derived by maximizing the decorrelation, which D1. The number of− samples in the training dataset, i.e., T , is varies for each block residual. Although KLT outperforms around 60K. y is generally organized as a coefficients grid. In 4 the process of the forward Saab transform, for input x in the M N space R × , the DC and AC coefficients for y are computed separately as DC Coefficient: The DC coefficient is computed with • K 1 y = 1 − x + b , where DC kernel a = 0 √K j 0 0 jP=0 1 (1, ..., 1)T and the corresponding bias b =0. √K 0 AC Coefficients: Firstly, z′ is computed as z′ = x • aT x 1 1 c c c − ( 0 + b0) , where = / , and = (1, 1, , 1, 1) is the constant unit vector.|| || Then, AC coefficient··· is K 1 − aT computed as yk = ak,j z′j + bk = k z′ + bk. AC jP=0 kernel ak is the eigenvectors wk of the covariance matrix T C = E ZZ , where = [z0, ..., zt, ..., zT 1] and z = d ({aT d+}b )1 are derivedZ from the training− dataset − 0 0 . The corresponding bias is bk = max d . D d || || The inverse Saab transform process is symmetric to its E forward Saab transform. Since one-stage Saab transform is Fig. 2. Energy compaction (y) comparison among KLT, DCT and Saab transforms. orthogonal [24], the inverse Saab transform USaab is sim- ply the transpose of the forward Saab transform, noted as T compute the energy compaction of KLT, DCT and these two USaab = ASaab. The vector y is inverse transformed into x′ K 1 K 1 with xk′ = ak′( yk − bk − )+ y , where ak′=[a ,k, Saab transforms. Fig. 2 shows the energy compaction E(y) { }1 −{ }1 0 1 a2,k, ..., aK 1,k], k =0, 1, ..., K 1. comparison among KLT, DCT, the one-stage Saab transform Multi-stage− Saab transform [24]− can be built by cascading and two-stage Saab transform, and we can have the following multiple one-stage Saab transforms, which can be used to two key observations: 1) KLT outperforms DCT by a large extract high-level recognition features. To solve the sign confu- margin on the energy compaction, since KLT is specified sion problem for pattern recognition, the input of the next stage for each block and DCT is fixed and pre-defined for all is shifted to be positive by the bias. It has been exploited in the blocks. DCT is desired to obtain the best energy compaction handwritten digits recognition and object classification [24], performance for the source following Gaussion. In many cases, [?]. In this paper, we explore the potential of Saab transform the assumption of the Gaussion source is not always satisfied for video coding. in video coding of various contents and settings. For video content with complex texture which is not exact Gaussian source, the energy compaction performance gap between DCT C. Energy Compaction and Decorrelation Capabilities of and the best KLT is larger than the content with smooth Saab Transform texture which is closer to be Gaussian source. 2) One-stage In video data compression, one discipline of transform is to and two-stage 8 8 Saab transforms, which learn the fixed save bits by transforming input x to another domain with fewer transform kernels× off-line from training data and then are non-zero elements, which is noted as energy compaction. applied to transform all testing blocks, perform better than Therefore, the energy compaction is mathematically defined DCT in energy compaction. Therefore, 8 8 Saab transform as [29] has the potential to improve the coding performance× of the i y2 existing video codecs with the single choice of DCT. k Another discipline of transform in video coding is removing y kP=0 (6) E( )= 2 , σx redundancy or correlation of the input signals x via trans- formation, i.e., decorrelation. To evaluate the decorrelation where 2 is the variance of the input x and is the number of σx i capability of a transform, we measure the decorrelation cost coefficients. Accordingly, we analyze the energy compaction of transformed coefficients y with its covariance as of 8 8 transforms for video coding, i.e., KLT, DCT, one- × stage Saab transform and two-stage Saab transform. Without C(y)= cov(yi,yj) i=j | | losing generalization, both the one-stage (denoted as “Saab P6 , = E (yi µi)(yj µj ) , 0 i, j K 1 Transform [8 8]”) and the two-stage Saab transform (denoted | { − − }| ≤ ≤ − iP=j as “Saab Transform× [4 4,2 2]”) were learned from over 6 (7) × × 70K 8 8 luminance (Y) block residuals of “Planar” mode where cov(yi,yj) is the covariance between yi and yj , i = j. µi × 6 collected from encoding the video sequence “FourPeople” and µj are the mean of yi and yj . Smaller C(y) value indicates with Quantization Parameters (QPs) 22, 27, 32, 37 in a better decorrelation capability of a transform. The value of ∈ { } HEVC. Only one Saab transform was trained off-line and C(y) in the transform domain of KLT is 0, which means yi applied to transform all blocks in “Saab Transform [8 8]” and and yj are completely independent and their correlation is 0, “Saab Transform [4 4,2 2]”, respectively. Then, other× 500 of as i = j. In other words, redundancy is minimized as 0 among × × 6 8 8 luminance (Y) block residuals were randomly selected to the elements yi in the transformed coefficients. × 5
TABLE I %ORFN y x A ȕ § A Ȗ ELWVWUHDP DECORRELATION CAPABILITY C(y) COMPARISON AMONG KLT, DCT SL[HO Ͳ AND SAABTRANSFORMS. ,QWUDSUHGLFWLRQ Encoder ([LVWLQJLQWUDYLGHRHQFRGHU Decorrelation cost C(y) D Sequence QP &ROOHFWLQJLQWUDSUHGLFWLRQEORFNUHVLGXDOV Saab Transform x DCT KLT ^Train ` [8×8] [4×4,2×2] E *URXSLQJEORFNUHVLGXDOVLQWRJURXSV
22 581.78 574.26 582.46 ^g g n` 27 1453.53 1309.17 1357.91 BasketballDrill F /HDUQLQJDVHWRI6DDEWUDQVIRUPV 32 3266.85 2691.37 2769.59 7UDLQLQJVWDJH ^6%7 6%7 n` $FRS\RI^ 6%7 6%7 n`LVWUDQVPLWWHGRIIOLQHWRWKHGHFRGHU 7HVWLQJVWDJH 37 5886.47 4574.13 4802.21 y %ORFN A A 22 588.19 586.50 591.71 x § Ȗ ELWVWUHDP Ȗ § xÖ SL[HO A
27 1385.87 1361.44 1382.41 6DDE A6DDE Ͳ § 6HOHFWLRQZ RaceHorses 6HOHFWLRQZ
RUZR5'2 32 3911.97 3814.71 3887.13 0 RUZR5'2 6%7 i 6%7 i
6%7 i 37 8281.33 7818.34 8148.87 G G 0DSSLQJWKH 0DSSLQJWKH A A 22 358.67 371.54 381.89 6DDE EORFNUHVLGXDO EORFNUHVLGXDO 27 1042.21 1010.10 1054.60 WR6%7 L FourPeople WR6%7 L %ORFN 32 1348.89 1331.15 1394.65 SL[HO 37 2842.04 2691.33 2950.05 ,QWUDSUHGLFWLRQ ,QWUDSUHGLFWLRQ Average 2578.98 2344.50 2441.96 Encoder Decoder 6DDEWUDQVIRUPEDVHGLQWUDYLGHRFRGHF
Fig. 3. Framework of Saab transform based intra video coding. Experimental analyses on the decorrelation capability of one-stage Saab transform, two-stage Saab transform and DCT for 8 8 block residuals were performed. Saab trans- × Saab transform, β denotes quantizer and γ denotes entropy form kernels were learned from three video sequences in coding. The Saab transform contains the learning kernel stage “BasketballDrill”, “RaceHorses”, “FourPeople” . For each { } and transform stage, which consists of four key components, video sequence, the value of C(y) was computed for 500 of including (a) collecting intra prediction block residuals, (b) block coefficients randomly selected in the transform domain dividing block residuals into groups based on intra mode, (c) of each transform. As to elements in the block residual are learning off-line a set of intra mode dependent Saab transforms either negative integer or positive integer, the expectation of and (d) selecting kernel based on the best intra prediction mode the block residual element is supposed to be zero mean, i.e., to do the forward and inverse Saab transforms. µ = µ = 0, in computing the decorrelation cost defined i j At the stage of learning Saab transform kernels, the in- as C(y). In Table I, decorrelation cost in terms of Eq. I tra prediction block residuals = x are collected off- of KLT, one-stage Saab transform, two-stage Saab transform T rain line from conventional DCT basedD { video} encoder. Since the and DCT are 0, 2344.50, 2441.96 and 2578.98 on average. distribution of the residual data highly depends to the intra We can have the following three key observations: 1) the mode [14], all intra modes are divided into n mode sets, M , decorrelation cost C(y) of KLT is 0, which is confirmed to i i [0,n] and n <= 35 for HEVC. Then, block residuals be the best. 2) Saab transform performs better than DCT on =∈ x are divided into groups g regarding their intra average. For smaller QP, i.e., 22, as well as for some specific T rain i modeD { whether} in M or not. A number{ } of Saab transform video sequences, e.g., “FourPeople”, DCT performs better i kernels SBT ,...,SBT are learned individually based on the on the decorrelation than two-stage Saab transform, which 0 n intra mode{ in M and} their block residual groups g . n is presents the well design of DCT in some cases but not all i i 23 here in the proposed scheme. For example, only{ blocks} in of them. 3) One-stage Saab transform is better than two-stage g are used to train SBT . Note that this is an off-line training Saab transform on decorrelation, mainly because the increase i i process that various video sequences and settings can be used of stages in Saab transform is designed to be beneficial for to train the Saab transform kernels. The complexity of the distinguishing the category of blocks, but not necessary for Saab transform training is negligible to the codec if it is off- the decorrelation C(y). This motivates us to explore one-stage line. Also, the trained Saab transform kernels are transmitted 8 8 Saab transform based video coding to improve the coding only once and stored at client side for inverse transform in efficiency.× decoding. At the stage of using the Saab transform kernels, the III. SAABTRANSFORMBASEDINTRAVIDEOCODING learned Saab transform kernels SBT0, ...,SBTn are utilized In this section, Saab transform is applied to video codec to to transform block residuals based{ on the intra mode,} regarding improve intra video coding efficiency. Firstly, a framework of to the learning schemes in section III-B different from the Saab transform based intra video coding is presented. Then, existing mode dependent transform selection strategy, e.g., intra mode dependent Saab transform is developed. Finally, MDDT[9]. For example, SBTi will be used to transform the three integration strategies of Saab transform for intra video block residuals from mode in Mi. Note that there are several coding are proposed to improve coding performances. cases to integrate Saab transform into the video encoder. One is to replace the conventional DCT with the Saab transform. A. Framework of Saab Transform based Intra Video Coding The other is to add the Saab transform as an alternative Fig. 3 illustrates the proposed coding framework of Saab transform option and select the optimal one between Saab transform based intra video coding, where ASaab denotes transform ASaab and the conventional transform A , i.e., DCT, † 6
TABLE II TWO TRAINING STRATEGIES FOR MODE DEPENDENT SAAB TRANSFORM. 20.0
17.5 Category Mode ID Saab Learning Scheme 15.0 A 2 ∼ 7, 13 ∼ 23 , 29 ∼ 34 CGSL 12.5 B 0 ∼ 1, 8 ∼ 12, 24 ∼ 28 FGSL
10.0
7.5 Percentage (%) Percentage 5.0 characteristics of the angular intra mode and develop intra
2.5 mode dependent Saab transform.
0.0 We propose to divide intra prediction block residuals into
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 groups g in terms of intra prediction modes M , which are Intra prediction mode ID i i then used to learn Saab transform kernel SBTi. The statistical Fig. 4. Percentage of luminance (Y) blocks coded with each intra prediction characteristics of block residuals from single intra prediction mode indexed by 0 ∼ 34. mode is relatively easier to be represented, in comparison to the case of multiple intra prediction modes. On the other hand,
6%7 Saab transform learned unsupervisedly for block residuals 6%7 from single intra prediction mode, noted as Fine-Grained Saab 3ODQDU 6%7 Learning (FGSL), may have better performance than those 6%7 '& 6%7 learned for multiple intra prediction modes, noted as Coarse- 6%7 Grained Saab Learning (CGSL). However, FGSL trains SBTi 6%7 6%7 for each intra mode, 0 i 35. It means there are 35 6%7 ≤ ≤ SBTs for each Transform Unit (TU) size in HEVC, and even more kernels shall be learned for standards beyond HEVC, which significantly increases the difficulties in codec design. In addition, the ratio of blocks for each mode distributes unevenly. The distributions of 35 intra prediction modes were statistically analyzed on the number of 8 8 luminance (Y) × 6%7
6%7 block residuals encoded by each of these modes. 100 frames of each video sequence in “BasketballDrill”, “FourPeople”, “RaceHorses” were encoded{ with four QPs in 22, 27, 32, 37 in HEVC.} As shown in Fig. 4, block residuals{ of Fig. 5. Grouping intra prediction block residuals on the basis of intra “Planar”,} “DC”, “Horizontal” and “Vertical” and their neigh- prediction modes. Saab transforms, noted as SBTk, 1 ≤ k ≤ 24, are applied to the corresponding block residuals. boring modes have more percentages than those of rest modes, which indicate these modes have higher impacts on the coding efficiency than the others. Therefore, considering the coding by RD cost comparison. In the latter case, an signaling flag efficiency and complexities of designing the codec, we will of choosing Saab transform or DCT is required to be encoded train Saab transform kernels for “Planar (0)”, “DC (1)”, “Hor- and transmitted to the client side for decoding. izontal (10)” and “Vertical (26)” and their neighboring modes At the decoder side, if the conventional DCT is replaced (8 12 and 24 28), shown as in Category B in Table II ∼ ∼ with Saab transform, based on the intra mode in Mi, the with FGSL scheme. Saab transform kernels for the rest modes, Saab transform kernel SBTi will be used in the inverse Saab shown as in Category A in Table II, are trained with CGSL transform to reconstruct block residuals. Otherwise, based on scheme. We considered the coding efficiency and complexity the signalling flag and intra mode in Mi, either DCT or SBTi of designing the codec. 24 Saab transforms for intra predicted will be selectively used in the inverse transform. block residuals are trained with FGSL and CGSL schemes. In comparison to the other KLT derivation based intra prediction mode dependent transform selection which design adapted B. Intra Mode Dependent Saab Transform transform to each of the 35 intra prediction modes, FGSL and Saab transform is a data-driven transform which is learned CGSL could save more memory to store the transform kernel based on the statistical characteristics of source input. How- as well as more complexity of designing the transform kernels. ever, the statistical characteristics of intra prediction block Whereas to be compared with the DCT/DST transform based residuals x are depending on intra prediction accuracy [5] as intra prediction mode dependent multiple transforms, FGSL well as image{ } texture [30]. There is still significant directional and CGSL is supposed to learn the statistics better for block information left in the intra prediction block residuals . The residuals through grouping block residuals in terms of intra intra prediction block residuals generated by the same intra prediction modes. The details of the FGSL and CGSL schemes prediction mode still have highly varying statistics. Therefore, are: it is necessary to learn Saab transforms based on the statistical CGSL scheme for modes in Category A: Grouping the • 7
intra prediction block residuals generated by intra pre- RD performance gain. Detailed analysis can be referred in diction mode indexed by mode ID i with block residuals Subsection IV-B. So, the original DCT in the existing intra related to intra prediction mode ID i 1 to learn SBTk. video codec is used for these modes. To complement Saab − SBTk will be applied to the block residuals of intra transform with DCT, strategy sII in the middle row of Table prediction mode with mode ID i. III is derived, where SBTk is applied to intra prediction modes FGSL scheme for modes in Category B: Grouping the in 0 7, 13 23, 29 34 . Meanwhile, the DCT is • intra prediction block residuals generated by intra pre- also{ activated.∼ The∼ optimal∼ transform} is selected from DCT diction mode indexed by mode ID i (i 3) with block and SBTk by choosing the one with lower RD cost, i.e., ≥ residuals related to the intra prediction modes indexed RDO. N/A indicates there is no available SBTk for modes by mode ID i 1 to learn SBTk. Different from Scheme in 8 12, 24 28 . So, DCT is used for them directly. − { ∼ ∼ } CGSL, the learned Saab transform SBTk will be applied Furthermore, to maximize the coding efficiency, strategy sIII to block residuals generated by intra prediction modes is proposed, as shown in the bottom row in Table III. SBTk, indexed by mode ID i 1 and i. Residuals generated 0 k 23, are applied to all 35 intra prediction modes in by “Planar” (indexed with− i = 0) and “DC” (indexed intra≤ coding.≤ Meanwhile, DCT is also activated. The optimal with i = 1) are grouped respectively as two groups for transform is selected from SBTk and DCT with RD cost learning and testing the SBTk. comparison. In integration strategies sII and sIII , the optimal Results of the proposed grouping scheme are illustrated in transform is selected with RDO from SBTk and DCT. In Fig. 5, where 24 of the Saab transform kernels are learned and these cases, a signaling flag with 1 bit shall be added in the applied to intra prediction block residuals, noted as SBTk, 0 bitstream for each TU, where 0/1 indicates using DCT/SBTk, ≤ k 23. FGSL is adopted by block residuals generated by respectively. Note that k in SBTk is determined by the intra intra≤ prediction modes indexed by mode ID in 0 7, 13 23, mode according to the learning scheme. The RD performance 29 34 . According to FGSL, SBT and SBT are{ ∼ learned∼ for gain of Saab transform and three integration strategies in intra ∼ } 0 1 block residuals of “Planar” and “DC” separately. SBTk with video coding are analyzed in following sections. k in 2 4, 10 15, 21 23 are learned from block residuals generated{ ∼ by intra∼ prediction∼ } modes pairs in (2,3), (4,5), (6,7), IV. RD PERFORMANCE AND COMPUTATIONAL { (13,14), (15,16), (17,18), (19,20), (21,22), (22,23), (29,30), COMPLEXITY OF SAABTRANSFORMBASEDINTRACODING (31,32), (33,34) . These Saab transforms are applied to cor- } In this section, RD performances and computational com- responding block residuals regarding to the FGSL, except plexity of Saab transform based intra video coding are an- SBT15, which is learned from block residuals corresponding alyzed theoretically and experimentally. Firstly, we analyze to intra prediction modes pairs (22,23) but (23,24), and only the RD cost of Saab transform based intra video coding applied to the block residuals generated by the intra prediction and two sufficient conditions are derived while using Saab mode indexed by 23, as CGSL is utilized to learn the Saab transform to improve the RD performance. Then, these two transform for block residuals of the intra prediction mode sufficient conditions of using SBTk are validated individually indexed by 24. CGSL is adopted by intra prediction modes with coding experiments. Finally, computational complexity indexed by mode ID in 8 12, 23 28 . Regarding CGSL, { ∼ ∼ } of one-stage Saab transform is compared with DCT. SBTk with mode ID in 5 9, 15 20 are learned from block residuals generated by{ ∼ intra prediction∼ } modes pairs in A. Theoretical RD Cost Analysis on Saab Transform (7,8), (8,9), (9,10), (10,11), (11,12), (12,13), (23,24), (24,25), (25,26),{ (26,27), (27,28) . These Saab transforms are applied The objective of video coding is to minimize the distortion to the block residuals according} to CGSL. (D) subject to bit rate (R) is lower than a target bit rate. By introducing the Lagrangian multiplier λ, the R-D optimization problem in video coding can be formulated by minimizing RD C. Integration Strategies for Saab Transform cost J(Q) as Since Saab transform has good performances on energy compaction and de-correlation, as analyzed in Subsection II-C, min J(Q),J(Q)= D(Q)+ λ R(Q), (8) we propose three integration strategies that integrate intra · mode dependent Saab transform with DCT in intra video codec where Q is quantization step, D(Q) and R(Q) are distortion to improve the coding efficiency. Table III shows these three and bit rate at given Q. So, it is necessary to analyze the RD cost of Saab transform based intra coding and that of DCT integration strategies, noted as sI , sII and sIII , for Saab based intra coding to validate its effectiveness. transform based intra video codec. In sI , each intra prediction mode adopts either Saab transform or DCT for transform The rate R(Q) and distortion D(Q) of using the Saab coding. Intra prediction modes in 0 7, 13 23, 29 34 transform are modelled and theoretically analyzed. The bit { ∼ ∼ ∼ } rate R can be modelled with the entropy of the transformed utilize SBTk with index k in 0 4, 10 15, 21 23 as their transforms. For intra prediction{ ∼ modes∼ that around∼ } the coefficient y. Meanwhile, when the transformed coefficient y horizontal and vertical directions, block residuals of which is quantized with quantization step Q, the bit rate R(Q) can retain more dynamical statistical characteristics of block con- be modelled as its entropy minus log Q, which is [31] + tent than the other modes. Replace DCT with Saab trans- ∞ R(Q) fy(y) log fy(y)dy log Q. (9) form for these modes directly can obtain less or negative ≈− Z − −∞ 8
TABLE III INTRA MODE DEPENDENT SAAB TRANSFORM SET {SBTk} AND INTEGRATION STRATEGIES. Integration Intra Mode ID Transform Strategies 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 1 sI SBT / DCT 01223344 DCT 10 10 11 11 12 12 13 13 14 14 15 DCT 21 21 22 22 23 23 SBT index 01223344 N/A 10 10 11 11 12 12 13 13 14 14 15 N/A 21 21 22 22 23 23 s 2 II DCT DCT SBT index 0 1 2 2 3 3 4 4 5 6 7 8 9 10 10 11 11 12 12 13 13 14 14 15 16 17 18 19 20 21 21 22 22 23 23 s 2 III DCT DCT 1 Either DCT or SBTk is used depending on the intra mode. 2 The optimal transform is selected from DCT and SBTk with RDO. One bit signalling flag is transmitted to indicate the type, where 0 / 1 indicate DCT / SBTk.
To analyze the transformed coefficient y output from Saab transform, we collected 1000 of 8 8 luminance (Y) intra prediction block residuals generated× by “Planar” mode from encoding “FourPeople” in HEVC. Histograms of transformed coefficients of Saab transform at two locations of the coeffi-