1

Neural via Non-Local Attention Optimization and Improved Context Modeling Tong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Yao Wang

Abstract—This paper proposes a novel Non-Local Attention Fundamentally, image coding/compression is trying to ex- optmization and Improved Context modeling-based image com- ploit signal redundancy and represent the original pixel sam- pression (NLAIC) algorithm, which is built on top of the deep ples (in RGB or other color space such as YCbCr) using a nerual network (DNN)-based variational auto-encoder (VAE) structure. Our NLAIC 1) embeds non-local network operations compact and high-fidelity format. This is also referred to as the as non-linear transforms in the encoders and decoders for both source coding [5]. Conventional (e.g., JPEG, the image and the latent representation probability informa- JPEG 2000) or hybrid transform/prediction coding (e.g., intra tion (known as hyperprior) to capture both local and global coding of H.264/AVC and HEVC) is utilized. Here, typical correlations, 2) applies attention mechanism to generate masks transforms are Discrete Cosine Transform (DCT) [6], that are used to weigh the features, which implicitly adapt bit allocation for feature elements based on their importance, and 3) Transform [7], and so on. Transforms referred here are usually implements the improved conditional entropy modeling of latent with fixed basis, that are trained in advance presuming the features using joint 3D convolutional neural network (CNN)-based knowledge of the source signal distribution. On the other autoregressive contexts and hyperpriors. Towards the practical hand, intra prediction usually leverages the local [8] and application, additional enhancements are also introduced to speed global correlations [9] to exploit the redundancy. Since intra up processing (e.g., parallel 3D CNN-based context prediction), reduce memory consumption (e.g., sparse non-local processing) prediction can be expressed as the linear superimposition of and alleviate the implementation complexity (e.g., unified model casual samples, it can be treated as an alternative transform for variable rates without re-training). The proposed model as well. is then achieved via applying the outperforms existing methods on Kodak and CLIC datasets with quantization on transform coefficients followed by an adaptive the state-of-the-art compression efficiency reported, including entropy coding. Thus, typical image compression pipeline learned and conventional (e.g., BPG, JPEG2000, JPEG) image compression methods, for both PSNR and MS-SSIM distortion can be simply illustrated by “transform”, “quantization” and metrics. “entropy coding” consecutively. Instead of applying the handcrafted components in ex- Index Terms—Non-local network, attention mechanism, con- ditional probability prediction, variable-rate model, end-to-end isting image compression methods, such as DCT, scalar learning quantization, etc, most recently emerged machine learning based image compression algorithms [10], [11], [12] leverage the autoencoder structure, which transforms raw pixels into I.INTRODUCTION compressible latent features via stacked convolutional neural Light reflected from the object surface, travels across the networks (CNNs) in a nonlinear means [13]. These latent 3D environment and finally reaches at the sensor plane of features are quantized and entropy coded subsequently by the camera or the retina of our Human Visual System (HVS) further exploiting the statistical redundancy. Recent works as a projected 2D image to represent the natural scene. have revealed that compression efficiency can be improved Nowadays, images spread everywhere via social networking when exploring the conditional probabilities via the contexts of (e.g., Facebook, WeChat), professional photography sharing autoregressive spatial-channel neighbors and hyperpriors [12], (e.g., flickr), online advertisement (e.g., Google Ads) and [14], [10], [15] for the compression of features. Typically, rate- arXiv:1910.06244v1 [eess.IV] 11 Oct 2019 so on, mostly in compliant formats compressed distortion optimization (RDO) [16] is fulfilled by minimizing using JPEG [1], JPEG2000 [2], H.264/AVC [3] or High Lagrangian cost J = R + λD, when performing the end-to- Efficiency Video Coding (HEVC) [4] intra-picture coding end learning. Here, R is referred to as entropy rate, and D is based image profile (a.k.a., - BPG the distortion measured by either mean squared error (MSE), https://bellard.org/bpg/), etc. A better compression method1 multiscale structural similarity (MS-SSIM) [17], even feature is always desired to preserve the higher image quality but or adversarial loss [18], [19]. with less bits consumption. This would save the file storage However, existing methods still present several limitations. at the Internet scale (e.g., > 350 million images submitted For example, most of the operations, such as stacked convolu- and shared to Facebook per day), and enable the faster and tions, are performed locally with limited receptive field, even more efficient image sharing/exchange with better quality of with pyramidal decomposition. Furthermore, latent features experience (QoE). are mostly treated with equal importance in either spatial or channel dimension, without considering the diverse visual T. Chen and H. Liu contributed equally to this work. sensitivities to various content at different frequency [20]. 1Since the focus of this paper is lossy compression, for simplicity, we use “compression” to represent “lossy compression” in short throughout this work, Thus, attempts have been made in [14], [12] to exploit impor- unless pointed out specifically. tance maps on top of the quantized latent feature vectors for 2

Y

Conv Conv Conv Conv Conv ​ ​ ​ ​ ​

m h

ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock

5 5 5 5 5

× × × × ×

5 5 5 5 5

× × × × ×

192 192 192 192 192

NLAM NLAM NLAM ​ ​ ​

​ ​

s s s s

s Q Q

2 2 2 2 2 ˆ Xˆ Z

AE​ Context Model AE​

2

96 48

x

24 24

x x

1

×

1 1 5

​ ×

5

× 5

ˆ k Conv

Mask Conv Mask Conv k Conv

Y k Conv

Reconstruction

2 2 2 2 2

s s s s s

2 AD​ AD​

s

3 3

×

192 192 192 192 192 192 384

5 × × × × ×

×

5 5 5 5 5

5 × × × × ×

5 5 5 5 5

NLAM NLAM NLAM ​ ​ ​

ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock

Conv Conv ​

Conv Conv Conv Conv Conv Conv Conv ​ ​ ​ ​ ​ m h

(a)

Conv Conv ReLU X HWC

 :1 1  :11 g :11 1

x HWC2 HWC2 HWC2 1

Attention mask HW C 2 C2 HW HW C 2

NLN

Sigmoid

ResBlock ResBlock ResBlock

Conv Conv Input feature Input

Output feature Output HW HW softmax

HW C 2 Y HWC2

z :1 1

ResBlock ResBlock ResBlock HWC

Corresponding input Z

(b) (c)

Fig. 1: Non-Local Attention optimization and Improved Context modeling-based image compression - NLAIC. (a) NLAIC: a variational autoencoder with embedded non-local attention optimization in the main and hyperprior encoders and decoders (e.g., Em, Eh, Dm, and Dh). ”Conv 5×5×192 s2” indicates a convolution layer using a kernel of size 5×5, 192 output channels, and stride of 2 (in decoder Dm and Dh, ”Conv” indicates transposed convolution). NLAM represents the Non-Local Attention Modules. “Q” is for quantization, “AE” and “AD” are arithmetic encoding and decoding, P here denotes the probability model serving for , k1 in context model means 3d conv kernel of size 1×1×1; (b) NLAM: The main branch consists of three ResBlocks. The mask branch combines non-local modules with ResBlocks for attention mask generation. The details of ResBlock is shown in the dash frame. (c) Non-local network (NLN): H × W × C denotes the size of feature maps with height H, width W and channel C. ⊕ is the add operation and ⊗ is the matrix multiplication. adaptive bit allocation, but still only at bottleneck layer. These also improve the context modeling of the entropy engine for methods usually signal the importance maps explicitly. If the better latent feature compression, by using a masked 3D CNN importance map is not embedded explicitly, coding efficiency (i.e., 5×5×5) based prediction to approximate more accurate will be slightly affected because of the probability estimation conditional statistics. error reported in [12]. Even with the coding efficiency outperforming most existing In this paper, our NLAIC introduces non-local processing traditional image compression standards, recent learning-based blocks into the variational autoencoder (VAE) structure to methods [21], [10], [12] are still far from the massive adoption capture both local and global correlations among pixels. Atten- in reality. For practical application, compression algorithms tion mechanism is also embedded to generate more compact need to be carefully evaluated and justified by its coding per- representation for both latent features and hyperpriors. Simple formance, computational and space complexity (e.g., memory rectified linear unit (ReLU) is applied for nonlinear activation. consumption), hardware implementation friendliness, etc. Few Different from those existing methods in [14], [12], our non- researches [11], [22] were developed in this line for practical local attention masks are applied at different layers (not only learned image compression coder. In this paper, we propose for quantized features at the bottleneck), to mask and adapt additional enhancements to simply the proposed NLAIC, intelligently through the end-to-end learning framework. We including 1) a unified network model for variable bitrates 3

using quality scaling factors; 2) sparse non-local processing II.RELATED WORK for memory reduction; and 3) parallel 3D masked CNN based context modeling for computational throughput improvement. In this section, we review prior works related to the non- All of these attempts have greatly reduce the space and time local operations in image/video processing, attention mecha- complexity of proposed NLAIC, with negligible sacrifice of nism, as well as the learned image compression algorithms. the coding efficiency. Non-local Operations. Most traditional filters (such as Our NLAIC has outperformed all existing learned and Gaussian and mean) process the data locally, by using a traditional image compression methods, offering the state- weighted average of spatially neighboring pixels. It usually of-the-art coding efficiency, in terms of the rate distortion produces over-smoothed reconstructions. Classical non-local performance for the distortion measured by both MS-SSIM methods for image restoration problems (e.g., low-rank mod- and PSNR (Peak Signal-to-Noise Ratio). Experiments have eling [24], joint sparsity [25] and non-local means [26]) have been executed using common test datasets such as Kodak [23] shown their superior efficiency for quality improvement by ex- and CLIC testing samples that are widely studied in [10], ploiting non-local correlations. Recently, non-local operations [21], [12]. When compared with the same JPEG anchors, haven been extended using DNNs for video classification [27], our NLAIC shows BD-Rate gains at 64.39%, followed by image restoration (e.g., denoising, artifacts removal and super- Minnen2018 [21] at 59.84%, BPG (YCbCr 4:4:4) HM at resolution) [15], [28], etc, yielding significant performance 59.46%, Balle2018´ [10] at 56.19%, and JPEG2000 at 38.02%, improvements as reported. It is also worth to point out respectively. that non-local operations have been applied in intra coding, Additional ablation studies have been conducted to analyze such as the intra block copy in HEVC-based screen content different components of the proposed NLAIC, including the compression [9], by allowing the block search in current frame impacts of sparse non-local processing, parallel 3D context to exploit non-local correlations. modeling, unified multi-rate model, non-local operations, etc, Self Attention. Self-attention mechanism was popularized on the coding performance, and system complexity (e.g., in deep learning based natural language processing (NLP) [29], time and space). These investigations further demonstrate the [30], [31]. It can be described as a mapping strategy which efficiency of our NLAIC for potential practical applications. queries a set of key-value pairs to an output. For example, Contributions. We highlight the novelties of this paper Vaswani et. al [31] have proposed multi-headed attention below: methods which are extensively used for machine transla- • We are the first to introduce non-local operations into tion. For low-level vision tasks [28], [14], [12], self-attention compression framework to capture both local and global mechanism makes generated features with spatial adaptive correlations among the pixels in the original image and activation and enables adaptive information allocation with latent features. the emphasis on more challenging areas (i.e., rich textures, • We apply attention mechanism together with the non- saliency, etc). local operations to generate implicit importance masks In image compression, quantized attention masks are used at various layers to guide the adaptive processing. These for adaptive bit allocation, e.g., Li et. al [14] uses three layers masks essentially help to allocate more bits to more im- of local convolutions and Mentzer et. al [12] selects one of portant areas that are critical for rate-distortion efficiency. the quantized features. Unfortunately, these methods require • We employ a single-layer masked 3D CNN to exploit the extra explicit signaling overhead. By disabling the explicit the spatial-channel correlations in the latent features, the signaling, probability estimation errors are induced. Our model output of which is then concatenated with hyperpriors to adopts attention mechanism that is close to [14], [12] but estimate the conditional statistics of the latent features, applies multiple layers of non-local as well as convolutional enabling more efficient entropy coding. operations to automatically generate attention masks from the • We introduce simplifications of the original NLAIC, in- input image. The attention masks are applied to the temporary cluding the sparse non-local processing, parallel masked latent features directly to generate the final latent features to 3D CNN-based contexts modeling, and unified model of be coded. Thus, there is no need to use extra bits to code the variable rates for practical applications to reduce com- masks. putational complexity, memory storage, and to improve Image Compression Architectures. DNN-based image implementation friendliness. compression generally relies on well-known autoencoders. The remainder of this paper proceeds as follows: SectionII One direction is based on recurrent neural networks (RNN) gives a brief review of related works, while the proposed (e.g., convolutional LSTM) in [32], [33], [34]. Note that these NLAIC is discussed in Section III followed by proposed works only require a single network model for variable rates, complexity-reduction options in SectionIV. SectionV evalu- without resorting to the re-training. An explicit spatial adap- ates the coding efficiency of proposed method in comparison tive bit rate allocation of compression control was suggested to the traditional image codecs and recently emerged learning- in [34] to consider the content variations for better quality at based approaches, while SectionVI presents the ablation the same bit rate. More advanced bit allocation schemes, such studies that examine the impact of the various components as attention driven approaches, will be discussed later. of the proposed NLAIC framework on the coding efficiency In another avenue, recently, (non-recurrent) CNN-based and complexity. Finally, concluding remarks and future works approaches [35], [36], [11], [10], [12], [14], [37], [21], [38]. are described in Section VII. have attracted more attentions in both industry and academia. 4

Among them, variational autoencoder (VAE) has been context model P to perform conditional probability estimation proven to be an effective structure for compression initially re- for entropy coding of the quantized latent features. ported in [35]. Significant advances have been developed pro- The NLAM module is shown in Fig. 1(b), and explained in gressively in main components including non-linear transforms Sections III-A and III-B below. (such as convolutions plus generalized divisive normalization - GDN in [35]), differentiable quantization (such as uniform A. Non-local Network Processing noise approximated quantization - UNAQ [35], and soft-to- Our NLAM adopts the Non-Local Network (NLN) proposed hard quantization [36]), conditional entropy probability mod- in [27] as a basic block, as shown in Figs. 1(b) and 1(c). eling following the Bayesian generative rules (for example, via NLN has been mainly used for image/video processing, rather hyperpriors [10], and joint 2D PixelCNN-based [39] autore- compression. This NLN computes the output at i-th position, gressive contexts and hyperpriors [21]). Importance map-based Yi, using a weighted average of the transformed feature values bit rate control are studied in [14], [12] to apply more weights of input X, as below: to important area. In addition to common MSE or MS-SSIM 1 X loss used in practice, we have witnessed other loss function Y = f(X , X )g(X ), (6) i C(X) i j j designs in learning to improve the image quality, such as the ∀j feature-based or adversarial loss [37], [40], [19]. where i is the location index of output vector Y and j represents the index that enumerates all accessible positions III.NLAIC:NON-LOCAL ATTENTION OPTIMIZED IMAGE of input X. X and Y share the same size. The function OMPRESSION C f(·) computes the correlations between Xi and Xj, and g(·) Referring to the VAE structure shown in Fig. 1(a), we can derives the representation of the input at the position j. C(X) formulate the problem as is a normalization factor to generate the final response which is set as C(X) = P f(X , X ). Note that a variety of function ˆ ∀j i j min J = RXˆ + RZˆ + λ · d{Y, Y}, (1) forms of f(·) have been already discussed in [27]. Thus in this ˆ X = Q {WeK (··· (We2 (We1 Y)))} , (2) work, we directly use the embedded Gaussian function for f(·)     for simplicity, i.e., Yˆ = ··· Xˆ , (3) WdK Wd2 Wd1 T f(X , X ) = eθ(Xi) φ(Xj ). (7) n ˆ o i j R ˆ = −E log2 p ˆ ˆ (ˆxn|xˆn−1,..., xˆ0, Z) , (4) X X|Z Here, θ(X ) = W X and φ(X ) = W X , where W and  i θ i j φ j θ RZˆ = −E log2 pZˆ (ˆzn) . (5) Wφ denote the cross-channel transform using 1×1 convolu- tion in our framework. The weights f(X , X ) are further We wish to find appropriate parameters of transforms and i j Wek abstracted using a softmax operation. The operation defined (k ∈ [1,K]), quantization Q{} and conditional entropy Wdk in Eq. (6) can be written in matrix form [27] as: coding for better compression efficiency, in an end-to-end T T  learning fashion. Distortion d{} between original input Y and Y = softmax X Wθ WφX g(X). (8) reconstructed image Yˆ is measured by either MSE or negative MS-SSIM in current study. Other distortion measurements can In addition, residual connection can be added for better be applied in this learning framework as well, such as feature convergence as suggested in [27], as shown in Fig. 1(c), i.e., loss [18]. Bitrate is estimated using the expected entropy Z = WzY + X, (9) of latent and hyperprior features. For instance, bitrate for where Wz is also a linear 1×1 convolution across all channels, hyperpriors Rˆ is based on the self probability distribution Z and Z is the final output vector with augmented global and (e.g., (5)), while bit rate for latent features RXˆ is derived by exploring the probability conditioned on the distribution of local correlation of input X via above NLN. both autoregressive neighbors in the feature maps (e.g., causal spatial-channel neighbors) and hyperpriors (e.g., (4)). is for B. Non-local Attention Module (NLAM) the convolutional operations. Note that importance (attention) map has been studied Figure 1(a) illustrates detailed network structures and as- in [14], [12] to adaptively allocate information to quantized sociated parameter settings of five different components in latent features. For instance, we can give more bits to edge our NLAIC system. Our NLAIC is built on a variational area but less bits to elsewhere, resulting in better visual quality autoencoder structure [10], with non-local attention modules at the similar bit rate consumption. Such adaptive allocation (NLAM) as basic units in both main and hyperprior encoder- can be implemented by using an explicit mask encapsulated in decoder pairs (i.e., EM , DM , Eh and Dh). EM with quanti- the compressed stream with additional bits. Note that explicit zation Q are used to generate quantized latent features and mask can be implemented using implicit signaling as well, but DM decodes the features into the reconstructed image. Eh with prediction error-induced coding efficiency degradations as and Dh are applied to provide side information ˆz about the reported in [12]. In addition, existing mask generation methods probability distribution of quantized latent features (known as at only bottleneck layer in [14], [12] are too simple to handle hyperpriors), to enable efficient entropy coding. The hyper- areas with more complex content characteristics. priors as well as autoregressive spatial-channel neighbors of As shown in Fig. 1(b), the entire NLAM presents three the latent features are then processed through the conditional branches. The main (or feature) branches uses conventional 5

Input channel = 4 channel = 19 sum of attention Channel

Vertical

Horizontal

Current

3x3x3 masked kernel Input channel = 4 channel = 19 sum of attention Fig. 3: 3D Masked Convolution. A 3×3×3 masked con- volution exemplified for exploring contexts of spatial and channel neighbors jointly. Current pixel (in purple grid cube) is predicted by the causal/processed pixels (in yellow for neighbors from previous channel, green for vertical neighbors, and blue for horizontal neighbors) in a 3D space. Those unprocessed pixels (in white cube) and the current pixel are Fig. 2: Visualization of attention masks generated by masked with zeros. NLAM. Brighter means more attention. Attention masks have the same size as latent features with several channels. Here channel 4 and 19 are picked to show the spatial and channel particularly for those with superior performance [35], [10], attention mechanism (values are in the range of (0, 1)). Term [21], [42], GDN [35] activation has proven its better efficiency “sum of attention” denotes the accumulated attention maps compared with ReLU, tanh, sigmoid, leakyReLU, etc. over all channels. This may be due to the fact that GDN captures the global infor- mation across all feature channels at the same pixel location. Whereas, our NLAIC shows that simple ReLU function works stacked convolutional networks (e.g., three residual blocks, effectively without resorting to the GDN, owing to the fact that a.k.a., ResBlock, in this work [41]) to generate features, and proposed NLAM captures non-local correlations efficiently. the mask branch applies the NLN, followed by another three ResBlocks, one 1×1 convolution and non-linear sigmoid C. Conditional Entropy Rate Modeling activation to produce a joint spatial-channel attention mask M, i.e., Previous sections present our novel NLAM scheme to M = sigmoid(FNLN(X)), (10) transform the input pixels into more compact latent features. This section details the entropy rate modeling that is critical where M denotes the attention mask and X is the input for the overall rate-distortion efficiency. feature vector. FNLN(·) represents the operations using NLN, 1) Context Modeling Using Hyperpriors: Similar as [10], ResBlocks and 1×1 convolution in Fig. 1(b). This attention a non-parametric, fully factorized density model is trained for mask M, having its element 0 < Mk < 1, Mk ∈ R, hyperpriors zˆ, which is described as: is multiplied element-wise with corresponding pixel element in feature maps from the main branch to perform adaptive Y (i) 1 1 pˆ (zˆ|ψ) = (p (i) (ψ ) ∗ U(− , ))(zˆi), (11) processing. Another residual connection in Fig. 1(c) is added z|ψ zi|ψ 2 2 i for faster convergence [41]. In comparison to the existing masking methods [14], [12], where ψ(i) represents the parameters of each univariate distri- our NLAM only uses attention masks implicitly. Furthermore, bution pzˆ|ψ(i) . multiple NLAMs (e.g., two pairs of NLAMs in the main For quantized latent features xˆ, each element xˆi can be encoder-decoder, and one pair of NLAM in the hyperprior modeled as a conditional Gaussian distribution as: encoder-decoder) are embedded, to massively exploit the non- Y 1 1 local and local correlations at multi scales for accurate mask p (xˆ|zˆ) = (N (µ , σ2) ∗ U(− , ))(xˆ ), (12) xˆ|zˆ i i 2 2 i generation, rather than performing mask generation at bot- i tleneck layer only in [14], [12]. NLAMs at different layers where its µ and σ are predicted using the hyperpriors zˆ. We are able to provide attention masks with multiple levels of i i evaluate the bits of xˆ and zˆ using: granularity. Visualization of the attention masks generated by our NLAM can be shown in Fig.2. As will be reported in X Rxˆ = − log2(pxˆi|zˆi (xˆi|zˆi)), (13) later experiments, NLAM embedded at various layers could i X (i) offer noticeable compression performance improvement. (i) Rzˆ = − log2(pzˆi|ψ (zˆi|ψ )). (14) Since batch normalization (BN) is sensitive to the data i distribution, we avoid any BN layers and only use one ReLU in Usually, we take zˆ as the side information for estimating µi our ResBlock, justified through our experimental observations. and σi and zˆ only occupies a very small fraction of bits, as Note that in existing learned image compression methods, revealed in SectionVI and shown in Fig. 13. 6

2) Context Modeling Using Joint Autoregressive Spatial- X Channel Neighbors and Hyperpriors: Local image neigh- HWC bors usually present high correlations. PixelCNNs and Pix- elRNNs [39] have been used for effective modeling of prob- g :11 abilistic distribution of images using local neighbors in an  :1 1  :11 HWC2 HWC2 HWC2 autoregressive way. It is further extended for adaptive context HWC 2 DOWN DOWN modeling in compression framework with noticeable improve- SAMPLE SAMPLE ment [33]. For example, Minnen et al. [21] proposed to extract H s W s C 2 H s W s C 2 autoregressive information by a 2D 5×5 masked convolution CHW2 s 2 HW sC2  2 2 at each feature channel. Such neighbor information is then HWHW s softmax combined with hyperpriors using stacked 1×1 convolutions, HWC 2 Y HWC2 for probability estimation. Models in [21] was reported as the first learning-based method with better PSNR compared with z:1 1 the BPG-YUV444 at the same bit rate. HWC In NLAIC, we use a 5×5×5 3D masked convolution to exploit the spatial and cross-channel correlation jointly. This Z 5×5×5 convolutional kernel shares the same parameters for all Fig. 4: Sparse NLN. Downsampling is utilized to scale down channels, offering reduced model complexity for implementa- the memory consumption of the correlation matrix. tion. For simplicity, a 3×3×3 example is shown in Fig.3. Traditional 2D PixelCNNs in [21] need to search for a well structured channel order to exploit the conditional probability correlation matrix. We set downsampling factor s to balance efficiently. Instead, our proposed 3D masked convolutions between the coding efficiency and memory consumption, as implicitly capture the correlations across adjacent channels. shown in Fig.4. Our experimental studies showed that s = 8 Compared with 2D masked CNN used in [21], our 3D CNN- (e.g., downsampled as a factor of 64×) can achieve a good based approach significantly reduces the network parameters trade-off. In practice, the factor s can be chosen according to for the conditional context modeling, by enforcing the same the memory constraints imposed by the underlying system. convolutional kernels across the entire spatial-channel space. Leveraging the additional contexts from spatial-channel neigh- bors via an autoregressive fashion, we can obtain a better conditional Gaussian distribution to model the entropy as: B. Parallel 3D Masked CNN based Context Modeling

pxˆ(xˆi|xˆ1, ...,xˆi−1, zˆ) = Both 2D and 3D masked convolutions [39] can be ex- Y 1 1 tended to model the conditional probability distribution for (N (µ , σ 2) ∗ U(− , ))(xˆ ), (15) i i 2 2 i the quantized latent features pixel by pixel in Fig.3. Although i the masked convolutions can leverage the neighbor pixels to where xˆ1, xˆ2, ..., xˆi−1 denote the causal (and possibly recon- predict the current pixels efficiently, it usually leads to a great structed) elements prior to current xˆi in feature maps. µi and computational penalty because of the strictly sequential pixel- σi are estimated from these causal samples and hyperpriors zˆ. by-pixel processing, making the compression framework far from the practical application. IV. EXTENSIONSOF NLAIC FOR COMPLEXITY Recalling the examples in Fig.3, parallelism is mainly REDUCTION broken by using the left (horizontal) neighbor (highlighted in This section introduces series of enhancements to reduce blue) for masked convolutions, due to a raster scan processing the model complexity of our proposed NLAIC. For example, order. For this design, it requires H×W ×C convolutions to sparse NLAM is applied to reduce the memory consumption, complete all pixel elements in feature maps, with computa- while parallel masked 3D convolution is used to reduce the tional complexity noted as O(H ×W ×C). Here, C represents computational complexity. Another improvement is applying the number of channels of the quantized latent features. a unified neural model for variable bitrates via quality mapping One simplification is to remove the dependency on left factors, by which the model complexity for application is neighbors, by only using the vertical and channel neighbors. greatly alleviated. Then the convolution for each line can be performed in parallel by W processors, each computing H×C convolutions. A. Sparse NLAM Theoretically this can reduce the processing time to O(H×C). Referring to the NLN aforementioned, it typically requires We can further remove the vertical neighbors in Fig.3, so a large amount of space/memory to host a correlation matrix that the convolutions for all the pixels in each feature map at size of HW × HW for a na¨ıve implementation. Note channel can be computed simultaneously using H × W pro- that H and W are the height and width for input feature cessors, reducing the processing time to O(C). Simulations in map. Sparsity has been widely applied in image processing ablation studies and in Fig. 12 show that performance impact by leveraging the local pixel similarity. This motivates us to is negligible when we used such parallel 3D convolution-based perform maxpooling-based sampling to reduce the size of the context modeling. 7

0.2 bpp=0.14 Encoder SF Q​ 0.1

bpp=0.41 0.0 AE​ 0.1 bpp=1.11 0.2 ​ (a) AD​ 15 30 60 Decoder ISF 10 20 40 5 10 20 0 0 0 Fig. 6: Quality Scaling Factor. Feature maps are scaled using

5 1 0 2 0 scaling factor (SF) before quantization, and inverse scaling 10 2 0 40

factor (ISF) is used for entropy-decoded elements appropri- 3 0 15 60

40 80 ately. A fixed context model is used for entropy probability 20 estimation (b) bpp = 0.14 (c) bpp = 0.41 (d) bpp = 1.11

Fig. 5: Visualization at Variable Rates. (a) Convolutional suggests that we can simply apply a set of scaling factors Kernels with bitrates at 0.14 bpp (first row), 0.41 bpp (second to adapt the last feature maps to be coded which was trained row), 1.11 bpp (last row); (b)-(d) Feature maps at correspond- at a high bitrate scenario to other bitrate ranges, without the ing bitrates. retraining of entire model for individual rates any more. This is analogous to the scaling operations [46] used in H.264/AVC or HEVC for transformed coefficients prior to being quantized. C. A Unified Model for Variable Rates Thus, we propose a set of quality scaling factors (sf ) that will be embedded in autoencoder as shown in Fig.6 for the bit Deep learning-based image coders are usually trained by rate adaptation. For any input image Y, encoder generates minimizing a rate-distortion criteria, i.e., R + λD, where λ E corresponding feature maps using X0 = (Y) at a specific is a hyperparameter controlling the trade-off between R and E bit rate, typically for a high bit-rate R . Scaling factors (s ∈ D. For most studies [10], [21], [12], they have trained dif- 0 f {a , b }, {a , b }, ..., {a , b }) are devised to linearly scale ferent models for different target bitrates by varying λ. Using 0 0 1 1 n n each of all n channels in input features to new bitrate. Thus, different models for different rates not only brings significant for feature maps Xnew at new bitrate target, we have: memory consumption to host/cache them, but also introduces ˆ new 0 0 additional model switching overhead when performing the Bi = Q{Xi } = Q{SF{Xi }} = Q{ai · Xi + bi}. (16) bitrate adaptation in encoder optimization. Thus, a single or ˆ th unified model covering a reasonable range of bitrate is highly Bi is a vector of quantized elements in i channel for entropy desirable in practice. coding, and will be inversely scaled, A few attempts have been made in pursuit of a single/unified Bˆ − b Xˆ = ISF{Bˆ } = i i , (17) model for variable rates without re-training for individual i i a bitrate, such as [32], [33], [34], [43], all of which i bit-planes progressively. These methods inherently offer the prior to being fed into the decoder network D for reconstructed ˆ ˆ bit-depth scalability. Standing upon the scalability perspective, image Y = D(X). a layered network design [44] is proposed to produce fine- Note that entropy context modeling P is actively employed grain bit rates in a unified framework, but it is actually a in learned image compression to accurately capture the prob- concatenation of a base layer model trained at the low bit ability distribution of feature map elements for rate-distortion rate and a set of refinement models trained for successively optimization and entropy coding. Given that elements in refinement. Recently, Dumas et. al [45] have tried to learn feature maps are scaled and biased, thus we need to adjust the the transform and quantization jointly in a exemplified simple quantization step in probability calculation accordingly, i.e., network, to offer the quantization independent compression of Y 1 1 luminance image. p(xˆ) = (N (µ , σ2) ∗ U(− , ))(xˆ ) (18) i i 2 2 i As aforementioned adapting λ yields the best coding ef- i ficiency for a specific bitrate target. To examine how does Y 2 1 1 new the learnt filters and the feature maps change with the bit ⇒ (N (µi, σi ) ∗ U(− , ))(xˆi ) (19) 2ai 2ai rate, we first train a model at a high bitrate (≈ 1bpp), i and retrain additional models at a variety of bitrates (by Though we have exemplified the quality scaling factor using using increasingly larger λ) based on this high-bitrate model an illustrative autoencoder in Fig.6, it can be extended to other initialization. Corresponding feature maps and convolutional complex network structures easily. In practice, our NLAIC kernels are presented in Fig.5. It shows that both kernels and implements both main and hyper encoders and decoders. feature maps keep almost the same pattern with the intensity However, our simulations have revealed that hyperpriors only of each element scaled, for models at variable rates. This occupy a small overhead, e.g., 2%-8% (larger number for 8

smaller bpp), for an entire compressed bitstream. Thus, in

NLAIC (MS-SSIM opt.) NLAIC baseline (MS-SSIM opt.) the view of low-complexity and practical application, we only 22 Minnen (2018) Balle (2018) apply scaling factors in main codec, leaving the hyper codec Ripple (2017) Mentzer (2018) NLAIC (MSE opt.) fixed, which is trained at the highest bit rate. NLAIC baseline (MSE opt.) 20 BPG (YCbCr 4:4:4) HM BPG (YCbCr 4:4:4) JPEG2000 JPEG (4:2:0) XPERIMENTAL TUDIES 18 V. E S This section presents comprehensive performance evalua- 16 tion. Training and measurement follows the comment practices used by other compression algorithms [47], [2], [10], [21] for MS-SSIM (dB) 14 a fair comparison.

12 A. Training

10 We use COCO [48] and CLIC [49] datasets to train 0.0 0.2 0.4 0.6 0.8 1.0 our NLAIC framework. We randomly crop images into bit per pixel (bpp) 192×192×3 patches for subsequent learning. Well-known (a) RDO process is applied to do end-to-end training at various bit rates via L = λ·d(Yˆ , Y) + Rx + Rz. d(·) is a distortion NLAIC (MSE opt.) ˆ Minnen (2018) measurement between reconstructed image Y and the original NLAIC baseline (MSE opt.) BPG (YCbCr 4:4:4) HM image Y. Both negative MS-SSIM and MSE are used in our 36 BPG (YCbCr 4:4:4) x265 Balle (2018) JPEG2000 work as distortion loss functions, which are marked as “MS- NLAIC (MS-SSIM opt.) NLAIC baseline (MS-SSIM opt.) SSIM opt.” and “MSE opt.”, respectively. Rx and Rz represent 34 JPEG (4:2:0) the estimated bit rates of latent features and hyperpriors,

32 respectively. Note that all components of our NLAIC are trained together. We set learning rates (LR) for EM , DM , Eh, 30 Dh and P at 3e−5 in the beginning. But for P, its LR is clipped PSNR (dB) to 1e−5 after 30 epochs. Batch size is set to 16 and the entire

28 model is trained on 4-GPUs in parallel.

26 B. Rate-Distortion Efficiency We evaluate our NLAIC models by comparing the rate- 0.0 0.2 0.4 0.6 0.8 1.0 bit per pixel (bpp) distortion performance averaged on publicly available Kodak dataset. Figure7 shows the performance when distortion is (b) measured by MS-SSIM and PSNR, respectively. MS-SSIM

70 and PSNR are widely used in image and video compression 64.39% tasks. Here, PSNR represents the pixel-level distortion while 59.84% 59.46% 60 57.74% 56.19% MS-SSIM describes the structural similarity. MS-SSIM is 52.52% reported to offer higher correlation with human perceptual 50 inception, especially for low bit rates [17]. As we can see, our

40 38.02% NLAIC provides the state-of-the-art performance with notice- able performance gain compared with the other existing lead- 30 ing methods, such as Minnen2018 [21] and Balle2018´ [10]. Objective Measurement. As shown in Fig. 7(a) using MS-

BD-Rate Gains(%) 20 SSIM for both loss and final distortion measurement, and in

10 Fig. 7(b) using MSE for loss function and PSNR for final distortion, NLAIC is both ranked at the first place, offering 0 NLAIC Minnen2018 BPG444 NLAIC BPG444 Balle2018 JPEG2000 the state-of-the-art coding efficiency. Figure 7(c) compares (HM) (baseline) (x265) the average BD-Rate reductions by various methods over the (c) legacy JPEG encoder. Our NLAIC model shows 64.39% and 11.97% BD-Rate [50] reduction against JPEG (4:2:0) and BPG Fig. 7: Rate-distortion Efficiency. Illustrative comparisons (YCbCr 4:4:4) HM, respectively. Here BPG HM is compiled are given using public Kodak data. (a) distortion is measured with HEVC HM reference software which has slightly better by MS-SSIM (dB). Here we use −10 log10(1−d) to represent performance than x265 used in default BPG. raw MS-SSIM (d) in dB scale. (b) PSNR is used for distortion Subjective Evaluation. We also evaluate our method on evaluation. (c) Numerical coding gains with JPEG as anchor BSD500 [51] dataset, which is widely used in image restora- and distortion is measured by PSNR. tion problems. Figure8 shows the results of different image codecs at the similar bit rate. Our NLAIC provides the best 9

(a) JPEG:0.3014bpp (b) BPG:0.3464bpp (c) NLAIC MSE opt.:0.2929bpp (d) NLAIC MS-SSIM opt.:0.3087bpp (e) Original PSNR:21.23 MS-SSIM:0.8504 PSNR:24.84 MS-SSIM:0.9270 PSNR:24.71 MS-SSIM:0.9277 PSNR:23.57 MS-SSIM:0.9551

(a) JPEG:0.2127bpp (b) BPG:0.1142bpp (c) NLAIC MSE opt.:0.1276bpp (d) NLAIC MS-SSIM opt.:0.1074bpp (e) Original PSNR:25.17 MS-SSIM:0.8629 PSNR:31.97 MS-SSIM:0.9581 PSNR:34.63 MS-SSIM:0.9738 PSNR:32.54 MS-SSIM:0.9759

Fig. 8: Subjective Evaluation. Visual comparison among JPEG420, BPG444, NLAIC MSE opt., MS-SSIM opt. and the original image from left to right. Our method achieves the best visual quality containing more texture without blocky nor blurring artifacts. subjective quality with relative smaller bit rate. In practice, at high bit rate using models optimized for PSNR and MS- some bit rate points cannot be reached for BPG and JPEG. SSIM loss as shown in Fig.9. We find that MS-SSIM loss Thus we choose the closest one to match our NLAIC bit rate. optimized results exhibit worse details compared with PSNR loss optimized models at high bit rate. This may be due to C. Complexity Analysis the fact that pixel distortion becomes more significant at high bit rate, but structural similarity puts more weights at a fairly We perform all tests on a NVIDIA P100 GPU with Pytorch low bit rate range. It will be interesting to explore a better toolbox. The trained model has a size of about 262MB. metric to cover the advantages of PSNR at high bit rate and For an input image at a size of 512×768×3, The model MS-SSIM at low bit rate for an overall optimal efficiency. has 291.8G FLOPs for encoder, 353.2G FLOPs for decoder Impacts of Contexts. To understand the contribution of and 3.46G FLOPs for context model. Forwarding encoding the context modeling using spatial-channel neighbors, we offer requires 6172MB running memory and takes about 438ms. another alternative implementation. It is referred to as “NLAIC At the decoder side, the default line-by-line decoding with baseline” that only uses the hyperpriors to estimate the means context models takes most of the decoding time which can and variances of the latent features (see Eq. (12)). In contrast, be remarkably decreased by channel-wise parallel context default NLAIC uses both hyperpriors and previously coded modeling, as discussed in Section IV-B. pixels in the latent feature maps (see Eq. (15)). Referring to Fig. 7(a), even our “NLAIC baseline” out- VI.ABLATION STUDIES performs the existing methods when using MS-SSIM as loss We further analyze our NLAIC in following aspects to and evaluation measures. For the case that uses MSE as loss understand the capability of our system in practice: and PSNR as distortion measurement, “NLAIC baseline” is slightly worse than the model in [21] that uses contexts from Impacts of Loss Functions. Considering that MS-SSIM both hyperpriors and autoregressive neighbors jointly as our loss optimized results demonstrate much smaller PSNR at “NLAIC”, but better than the work [10] that only uses the high bit rate in Fig. 7(a), we visualize decompressed images hyperpriors to do context modeling for fair comparison. 10

(a) MS-SSIM opt:0.8743bpp (b) MSE opt:0.8798bpp (c) Original PSNR:31.49 MS-SSIM:0.9956 PSNR:35.12 MS-SSIM:0.9935

(d) MS-SSIM opt:0.6056bpp (e) MSE opt:0.6045bpp (f) Original PSNR:28.84 MS-SSIM:0.9879 PSNR:30.43 MS-SSIM:0.9815 Fig. 9: Impacts of Loss Functions. Illustrative reconstruction samples of respective PSNR and MS-SSIM loss optimized compression

-+- NLAIC (MSE opt.) in the NLAM pairs gradually, and retrain our framework for -+- NLAIC baseline (MSE opt.) 一- NLAIC baseline (MSE opt.)(remove_first) performance evaluation. For this study, we use the baseline 36 一�NLAIC baseline (MSE opt.)(remove_main) -+- NLAIC baseline (MSE opt.)(remove_all) context modeling (only hyperpriors) in all cases, and use Balle (2018) MSE as the loss function and PSNR as the final distortion 34 一一·一·一一一一一·一·一一一 ➔-•一·一一一一一·一·一一一一一·一仁 -一一一·一· (SP) measurement, shown in Fig. 10. For illustrative understanding, we also provide two anchors, i.e., “Balle2018”´ [10] and tlNSd32 一·一·一·一·一一一·一·一·一·一一 i·一·一·一·一一一·一·一· “NLAIC” respectively. However, to see the degradation caused by gradually removing the mask branch in NLAMs, one should

30 一·一·一·一·一一一·一·一·一·一一 1 compare with the NLAIC baseline curve. Removing the mask branches of the first NLAM pair in the main encoder-decoders (referred to as “remove first”) 28 yields a PSNR drop of about 0.1dB compared to “NLAIC 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.0 0.2 0.4 0.6 0.8 1.0 baseline” at the same bit rate. PSNR drop is further enlarged bit per pixel (bpp) noticeably when removing all NLAM pairs’ mask branches Fig. 10: Impacts of NLAM. Efficiency illustration when in main encoder-decoders (a.k.a., “remove main”). It gives removing NLAM components gradually and re-training the the worst performance when further disabling the NLAM model. pair’s mask branches in hyperprior encoder-decoders, resulting in the traditional variational autoencoder without non-local characteristics explorations (i.e., “remove all”). We further compare conditional context modeling efficiency Impacts of Parallel Context Modeling. Parallel 3D CNN- of the model variants in Fig. 11. As we can see, with embedded based context modeling is introduced in Section IV-B to NLAM and joint contexts modeling, our NLAIC could provide speedup the computational throughput, by removing the pre- more compact latent features, and less normalized feature pre- diction dependency on left or left and upper neighbors. First, diction error, both contributing to its leading coding efficiency. we found that left neighbors have negligible affects to the In this work, we first train the “NLAIC baseline” models. performance. In this case, we just use contexts without left To train the NLAIC model, one way is fixing the main and neighbors in the default NLAIC. As shown in Fig. 12, we have hyperprior encoders and decoders in the baseline model, and examined three scenarios for conditional probability modeling updating only the conditional context model P. Compared using, e.g., 1) hyperprior only (a.k.a., NLAIC baseline), 2) with the “NLAIC baseline”, such transfer learning based hyperprior + ctx#1 (a.k.a., contexts w/ channel neighbors “NLAIC” provides 3% bit rate reduction at the same distortion. only), and 3) NLAIC (hyperprior + ctx#2 (a.k.a., contexts Alternatively, we could use the baseline models as the start w/o left neighbor)), leading to 6.76%, 3.73%, ≈ 0% BD-Rate point, and refine all the modules in the “NLAIC” system. In losses measured by MS-SSIM for respective 1), 2), 3) use this way, “NLAIC” offers more than 9% bit rate reduction over cases, when compared with the default NLAIC in with full the “NLAIC baseline” at the same quality. Thus, we choose access to complete information for probability modeling. On the latter one for better performance. the other hand, computational complexity is greatly reduced Impacts of NLAM. To further delineate the gain due to by enforcing the parallel context modeling. the newly introduced NLAM, we remove the mask branch Hyperpriors ˆz. Hyperpriors zˆ has noticeable contribution to 11

NLAIC (hyperprior + ctx#2) 22 hyperprior + ctx#1 NLAIC baseline (hyperprior only)

RemoveAll 21

20

RemoveMain 19 MS-SSIM (dB)

18

RemoveFirst 17

0.3 0.4 0.5 0.6 0.7 0.8 0.9 bit per pixel (bpp)

Baseline Fig. 12: Impacts of Parallel Context Modeling. Coding Ef- ficiency Illustrations for a variety of conditional probability prediction using hyperprior only (a.k.a., NLAIC baseline), hy- perprior + ctx#1 (a.k.a., contexts w/ channel neighbors only), Joint and hyperprior + ctx#2 (a.k.a., contexts w/o left neighbor).

Latent features Mean Scale Normalized Predicted Error Distribution NLAIC joint (MSE opt.) Fig. 11: Prediction error with different models at similar 8 NLAIC joint (MS-SSIM opt.) bit rate. Column-wisely, it depicts the latent features, the 7 predicted mean, predicted scale, normalized prediction error feature−mean 6 ( i.e., scale ) and the distribution of the normalized prediction error from left to right plots. Each row represents a 5 different model (e.g., various combinations of NLAM compo- 4 nents, and contexts prediction). These figures show that with NLAM and joint contexts from hyperprior and autoregres- 3 bits of hyper information (%) sive neighbors, the latent features capture more information 2 (indicated by a larger dynamic range), which leads to larger scales for both predicted mean and scale (standard deviation of 1 features). Meanwhile, the distribution of the final normalized 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 feature prediction error remains compact, which leads to the Total bit rate (bpp) lowest bit rate. Fig. 13: Percentage of zˆ. Bits consumed by zˆ in the entire bitstream. The percentage of zˆ for MSE loss optimized method is noticeably higher than the scenario using MS-SSIM loss. the overall compression performance [21], [10]. Its percentage decreases as the overall bit rate increases, shown in Fig. 13. The percentage of zˆ for MSE loss optimized model is higher bitrate), as shown in Fig. 14, to cover the respective low, than the case using MS-SSIM loss optimization, but still much medium and high bitrate ranges. For each bitrate range, its less than the bits consumed by the latent features. model is trained at the highest bitrate. Note that bitrate range Impacts of Unified Model for Variable Rates. A unified may overlap. Simulations have revealed that, our approach model for variable rates is highly desired for practical applica- with three unified models for variable rates, still outperforms tion, without resorting to the model re-training for individual the BPG with significant performance margin, and offers the rates. As discussed in Sec. IV-C, we have developed quality comparable efficiency with Minnen2018 [21] which requires scaling factors to re-use the models trained at high-bit-rate model retraining for each bitrate. Note that the bitrate of scenario directly for another set of bitrates. hyperpriors are fixed for the entire bitrate range, which causes Since the original feature maps are shifted, scaled and the performance loss at low bitrates. We can further apply quantized, it may induce the data distribution variations when quality factors to the hyper encoder and decoder as well to having scaling and inverse scaling operations, leading to the narrow the gap between our unified model and models that degradation of coding efficiency at lower bitrates. Usually, are separately trained. coding efficiency degrades larger when the bitrate distance gets further. Thus, we suggest to apply three models instead VII.CONCLUDING REMARKSAND FUTURE WORKS of a single one to avoid an unexpected performance gap at In this paper, we proposed a neural image compression ultra-low bitrate (assuming the model is trained at the high algorithm using non-local attention optimization and improved 12

NLAIC (MS-SSIM opt.) [7] B.-F. Wu and C.-F. Lin, “A high-performance and memory-efficient Minnen (2018) pipeline architecture for the 5/3 and 9/7 discrete of 22 BPG (YCbCr 4:4:4) HM NLAIC QF low jpeg2000 codec,” IEEE Transactions on circuits and systems for video NLAIC QF mid technology, vol. 15, no. 12, pp. 1615–1628, 2005.1 NLAIC QF high 20 [8] J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, “Intra coding of the hevc standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1792–1801, 2012.1 18 [9] X. Xu, S. Liu, T. Chuang, Y. Huang, S. Lei, K. Rapaka, C. Pang, V. Seregin, Y. Wang, and M. Karczewicz, “Intra block copy in hevc screen content coding extensions,” IEEE Journal on Emerging and 16 Selected Topics in Circuits and Systems, vol. 6, no. 4, pp. 409–419, MS-SSIM (dB) Dec 2016.1,3 14 [10] J. Balle,´ D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Vari- ational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.1,2,3,4,5,7,8,9, 10, 11, 12 12 [11] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” arXiv preprint arXiv:1705.05823, 2017.1,2,3 0.0 0.2 0.4 0.6 0.8 1.0 [12] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool, bit per pixel (bpp) “Conditional probability models for deep image compression,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, Fig. 14: RD performance of Multi Bitrates Generation no. 2, 2018, p. 3.1,2,3,4,5,7 [13] J. Balle,´ “Efficient nonlinear transforms for lossy image compression,” in 2018 Picture Coding Symposium (PCS). IEEE, 2018, pp. 248–252. 1 context modeling (NLAIC). Our NLAIC method achieves the [14] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional state-of-the-art performance, for both MS-SSIM and PSNR networks for content-weighted image compression,” arXiv preprint arXiv:1703.10553, 2017.1,2,3,4,5 evaluations at the same bit rate, when compared with exist- [15] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrent ing image compression methods, including well-known BPG, network for image restoration,” in Advances in Neural Information JPEG2000, JPEG as well as the most recent learning based Processing Systems, 2018, pp. 1680–1689.1,3 [16] G. J. Sullivan, T. Wiegand et al., “Rate-distortion optimization for video schemes [21], [10]. compression,” IEEE signal processing magazine, vol. 15, no. 6, pp. 74– Key novelties of NLAIC are non-local operations as non- 90, 1998.1 linear transforms to capture both local and global correlations [17] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar for latent features and hyperpriors, implicit attention mecha- Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, nisms to adapt more weights for salient image areas, and con- pp. 1398–1402.1,8 ditional entropy modeling using joint 3D CNN-based spatial- [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. channel neighbors and hyperpriors. For practical applications, 1,4 sparse nonlocal operations, parallel 3D CNN-based context [19] C. Huang, H. Liu, T. Chen, S. Pu, Q. Shen, and Z. Ma, “Extreme image modeling, and unified model for variable rates, are introduced compression via multiscale autoencoders with generative adversarial optimization,” 2019.1,4 to reduce the space, time, and implementation complexity. [20] D. Marr, Vision: A Computational Investigation into the Human Repre- For future study, we would like to extend our framework sentation and Processing of Visual Information. New York, NY: Henry for end-to-end video compression framework with more priors Holt and Co., Inc., 1982.1 [21] D. Minnen, J. Balle,´ and G. D. Toderici, “Joint autoregressive and acquired from spatial and temporal information. Another inter- hierarchical priors for learned image compression,” in Advances in esting avenue is to further simplify the models for embedded Neural Information Processing Systems, 2018, pp. 10 794–10 803.2, system, such as fixed-point implementation, platform frienly 3,4,5,6,7,8,9, 11, 12 [22] J. Balle,´ N. Johnston, and D. Minnen, “Integer networks for data network structures, etc. compression with latent-variable models,” 2018.2 [23] “Kodak lossless true color image suite,” http://r0k.us/graphics/kodak/, Dec. 15th 2016.3 VIII.ACKNOWLEDGEMENT [24] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm We are very grateful for the constructive comments from minimization with application to image denoising,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, anonymous reviewers to improve this manuscript. pp. 2862–2869.3 [25] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-local REFERENCES sparse models for image restoration,” in 2009 IEEE 12th International Conference on Computer Vision (ICCV). IEEE, 2009, pp. 2272–2279. [1] G. K. Wallace, “The still picture compression standard,” Commu- 3 nications of the ACM, vol. 34, no. 4, pp. 30–44, 1991.1 [26] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image [2] D. T. Lee, “Jpeg 2000: Retrospective and new developments,” Proceed- denoising,” in 2005 IEEE Computer Society Conference on Computer ings of the IEEE, vol. 93, no. 1, pp. 32–41, Jan 2005.1,8 Vision and Pattern Recognition (CVPR’05), vol. 2. IEEE, 2005, pp. [3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview 60–65.3 of the h.264/avc video coding standard,” IEEE Transactions on Circuits [27] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, July works,” in Proceedings of the IEEE Conference on Computer Vision 2003.1 and Pattern Recognition, 2018, pp. 7794–7803.3,4 [4] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview of [28] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-local the high efficiency video coding (hevc) standard,” IEEE Transactions on attention networks for image restoration,” 2018.3 Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649– [29] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap- 1668, Dec 2012.1 proaches to attention-based neural machine translation,” arXiv preprint [5] T. Berger, “Rate-distortion theory,” Wiley Encyclopedia of Telecommu- arXiv:1508.04025, 2015.3 nications, 2003.1 [30] O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neural [6] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” machine translation with a shared attention mechanism,” arXiv preprint IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, 1974.1 arXiv:1601.01073, 2016.3 13

[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, sion,” arXiv preprint arXiv:1804.02958, 2018.4 Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances [41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image in Neural Information Processing Systems, 2017, pp. 5998–6008.3 recognition,” in Proceedings of the IEEE conference on computer vision [32] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen, and pattern recognition, 2016, pp. 770–778.5 S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres- [42] H. Liu, T. Chen, P. Guo, Q. Shen, and Z. Ma, “Gated context model sion with recurrent neural networks,” arXiv preprint arXiv:1511.06085, with embedded priors for deep image compression,” arXiv preprint 2015.3,7 arXiv:1902.10480, 2019.5 [33] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, [43] Z. Zhang, Z. Chen, J. Lin, and W. Li, “Learned scalable image and M. Covell, “Full resolution image compression with recurrent neural compression with bidirectional context disentanglement network,” arXiv networks,” in CVPR. IEEE, 2017, pp. 5435–5443.3,6,7 preprint arXiv:1812.09443, 2018.7 [34] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen, [44] C. Jia, Z. Liu, Y. Wang, S. Ma, and W. Gao, “Layered image com- S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compres- pression using scalable auto-encoder,” arXiv preprint arXiv:1904.00553, sion with priming and spatially adaptive bit rates for recurrent networks,” 2019.7 in Proceedings of the IEEE Conference on Computer Vision and Pattern [45] T. Dumas, A. Roumy, and C. Guillemot, “Autoencoder based image Recognition, 2018, pp. 4385–4393.3,7 compression: can the learning be quantization independent?” in 2018 [35] J. Balle,´ V. Laparra, and E. P. Simoncelli, “End-to-end optimized image IEEE International Conference on Acoustics, Speech and Signal Pro- compression,” arXiv preprint arXiv:1611.01704, 2016.3,4,5 cessing (ICASSP). IEEE, 2018, pp. 1188–1192.7 [36] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, [46] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low- L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end- complexity transform and quantization in h.264/avc,” IEEE Transactions to-end learning compressible representations,” in Advances in Neural on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 598– Information Processing Systems, 2017, pp. 1141–1151.3,4 603, July 2003.7 [37] H. Liu, T. Chen, Q. Shen, T. Yue, and Z. Ma, “Deep image compression via end-to-end learning.” in Proceedings of the IEEE Conference on [47] Overview of JPEG, “https://jpeg.org/jpeg/,” 2018.8 Computer Vision and Pattern Recognition Workshops, 2018, pp. 0–0.3, [48] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, 4 P. Dollar,´ and C. L. Zitnick, “Microsoft coco: Common objects in [38] H. Liu, T. Chen, Q. Shen, and Z. Ma, “Practical stacked non-local context,” in European conference on computer vision. Springer, 2014, attention modules for image compression,” in Proceedings of the IEEE pp. 740–755.8 Conference on Computer Vision and Pattern Recognition Workshops, [49] Challenge on learned image compression 2018. [Online]. Available: 2019, pp. 0–0.3 http://www.compression.cc/challenge/8 [39] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent [50] G. Bjontegaard, “Calculation of average psnr differences between rd- neural networks,” arXiv preprint arXiv:1601.06759, 2016.4,6 curves,” VCEG-M33, 2001.8 [40] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool, [51] The berkeley segmentation dataset and benchmark. [Online]. Available: “Generative adversarial networks for extreme learned image compres- https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/8