Neural Image Compression Via Non-Local Attention Optimization and Improved Context Modeling Tong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Yao Wang

1 Neural Image Compression via Non-Local Attention Optimization and Improved Context Modeling Tong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Yao Wang Abstract—This paper proposes a novel Non-Local Attention Fundamentally, image coding/compression is trying to ex- optmization and Improved Context modeling-based image com- ploit signal redundancy and represent the original pixel sam- pression (NLAIC) algorithm, which is built on top of the deep ples (in RGB or other color space such as YCbCr) using a nerual network (DNN)-based variational auto-encoder (VAE) structure. Our NLAIC 1) embeds non-local network operations compact and high-fidelity format. This is also referred to as the as non-linear transforms in the encoders and decoders for both source coding [5]. Conventional transform coding (e.g., JPEG, the image and the latent representation probability informa- JPEG 2000) or hybrid transform/prediction coding (e.g., intra tion (known as hyperprior) to capture both local and global coding of H.264/AVC and HEVC) is utilized. Here, typical correlations, 2) applies attention mechanism to generate masks transforms are Discrete Cosine Transform (DCT) [6], Wavelet that are used to weigh the features, which implicitly adapt bit allocation for feature elements based on their importance, and 3) Transform [7], and so on. Transforms referred here are usually implements the improved conditional entropy modeling of latent with fixed basis, that are trained in advance presuming the features using joint 3D convolutional neural network (CNN)-based knowledge of the source signal distribution. On the other autoregressive contexts and hyperpriors. Towards the practical hand, intra prediction usually leverages the local [8] and application, additional enhancements are also introduced to speed global correlations [9] to exploit the redundancy. Since intra up processing (e.g., parallel 3D CNN-based context prediction), reduce memory consumption (e.g., sparse non-local processing) prediction can be expressed as the linear superimposition of and alleviate the implementation complexity (e.g., unified model casual samples, it can be treated as an alternative transform for variable rates without re-training). The proposed model as well. Lossy compression is then achieved via applying the outperforms existing methods on Kodak and CLIC datasets with quantization on transform coefficients followed by an adaptive the state-of-the-art compression efficiency reported, including entropy coding. Thus, typical image compression pipeline learned and conventional (e.g., BPG, JPEG2000, JPEG) image compression methods, for both PSNR and MS-SSIM distortion can be simply illustrated by “transform”, “quantization” and metrics. “entropy coding” consecutively. Instead of applying the handcrafted components in ex- Index Terms—Non-local network, attention mechanism, conditional probability prediction, variable-rate model, end-to-end isting image compression methods, such as DCT, scalar learning quantization, etc, most recently emerged machine learning based image compression algorithms [10], [11], [12] leverage the autoencoder structure, which transforms raw pixels into I. INTRODUCTION compressible latent features via stacked convolutional neural Light reflected from the object surface, travels across the networks (CNNs) in a nonlinear means [13]. These latent 3D environment and finally reaches at the sensor plane of features are quantized and entropy coded subsequently by the camera or the retina of our Human Visual System (HVS) further exploiting the statistical redundancy. Recent works as a projected 2D image to represent the natural scene. have revealed that compression efficiency can be improved Nowadays, images spread everywhere via social networking when exploring the conditional probabilities via the contexts of (e.g., Facebook, WeChat), professional photography sharing autoregressive spatial-channel neighbors and hyperpriors [12], (e.g., flickr), online advertisement (e.g., Google Ads) and [14], [10], [15] for the compression of features. Typically, rate- arXiv:1910.06244v1 [eess.IV] 11 Oct 2019 so on, mostly in standard compliant formats compressed distortion optimization (RDO) [16] is fulfilled by minimizing using JPEG [1], JPEG2000 [2], H.264/AVC [3] or High Lagrangian cost J = R + λD, when performing the end-to- Efficiency Video Coding (HEVC) [4] intra-picture coding end learning. Here, R is referred to as entropy rate, and D is based image profile (a.k.a., Better Portable Graphics - BPG the distortion measured by either mean squared error (MSE), https://bellard.org/bpg/), etc. A better compression method1 multiscale structural similarity (MS-SSIM) [17], even feature is always desired to preserve the higher image quality but or adversarial loss [18], [19]. with less bits consumption. This would save the file storage However, existing methods still present several limitations. at the Internet scale (e.g., > 350 million images submitted For example, most of the operations, such as stacked convolu- and shared to Facebook per day), and enable the faster and tions, are performed locally with limited receptive field, even more efficient image sharing/exchange with better quality of with pyramidal decomposition. Furthermore, latent features experience (QoE). are mostly treated with equal importance in either spatial or channel dimension, without considering the diverse visual T. Chen and H. Liu contributed equally to this work. sensitivities to various content at different frequency [20]. 1Since the focus of this paper is lossy compression, for simplicity, we use “compression” to represent “lossy compression” in short throughout this work, Thus, attempts have been made in [14], [12] to exploit impor- unless pointed out specifically. tance maps on top of the quantized latent feature vectors for 2 Y Conv Conv Conv Conv Conv Conv Conv m h ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock 5 5 5 5 5 × × × × × 5 5 5 5 5 × × × × × 192 192 192 192 192 192 192 NLAM NLAM NLAM s s s s s Q Q 2 2 2 2 2 ˆ Xˆ Z AE Context Model AE 2 96 48 x 24 24 x x 1 × 1 1 5 × 5 × 5 ˆ k Conv Mask Conv Mask Conv k Conv Y k Conv Reconstruction 2 2 2 2 2 s s s s s 2 AD AD s 3 × 192 192 192 192 192 192 384 5 × × × × × × 5 5 5 5 5 5 × × × × × 5 5 5 5 5 NLAM NLAM NLAM ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock ResBlock Conv Conv Conv Conv Conv Conv Conv Conv Conv m h (a) Conv Conv ReLU X HWC :1 1 :1 1 g :1 1 1 x HWC2 HWC2 HWC2 1 Attention mask HW C 2 C2 HW HW C 2 NLN Sigmoid ResBlock ResBlock ResBlock Conv Conv Input feature Input Output feature Output HW HW softmax HW C 2 Y HWC2 z :1 1 ResBlock ResBlock ResBlock HWC Corresponding input Z (b) (c) Fig. 1: Non-Local Attention optimization and Improved Context modeling-based image compression - NLAIC. (a) NLAIC: a variational autoencoder with embedded non-local attention optimization in the main and hyperprior encoders and decoders (e.g., Em, Eh, Dm, and Dh). ”Conv 5×5×192 s2” indicates a convolution layer using a kernel of size 5×5, 192 output channels, and stride of 2 (in decoder Dm and Dh, ”Conv” indicates transposed convolution). NLAM represents the Non-Local Attention Modules. “Q” is for quantization, “AE” and “AD” are arithmetic encoding and decoding, P here denotes the probability model serving for arithmetic coding, k1 in context model means 3d conv kernel of size 1×1×1; (b) NLAM: The main branch consists of three ResBlocks. The mask branch combines non-local modules with ResBlocks for attention mask generation. The details of ResBlock is shown in the dash frame. (c) Non-local network (NLN): H × W × C denotes the size of feature maps with height H, width W and channel C. ⊕ is the add operation and ⊗ is the matrix multiplication. adaptive bit allocation, but still only at bottleneck layer. These also improve the context modeling of the entropy engine for methods usually signal the importance maps explicitly. If the better latent feature compression, by using a masked 3D CNN importance map is not embedded explicitly, coding efficiency (i.e., 5×5×5) based prediction to approximate more accurate will be slightly affected because of the probability estimation conditional statistics. error reported in [12]. Even with the coding efficiency outperforming most existing In this paper, our NLAIC introduces non-local processing traditional image compression standards, recent learning-based blocks into the variational autoencoder (VAE) structure to methods [21], [10], [12] are still far from the massive adoption capture both local and global correlations among pixels. Atten- in reality. For practical application, compression algorithms tion mechanism is also embedded to generate more compact need to be carefully evaluated and justified by its coding per- representation for both latent features and hyperpriors. Simple formance, computational and space complexity (e.g., memory rectified linear unit (ReLU) is applied for nonlinear activation. consumption), hardware implementation friendliness, etc. Few Different from those existing methods in [14], [12], our non- researches [11], [22] were developed in this line for practical local attention masks are applied at different layers (not only learned image compression coder. In this paper, we propose for quantized features at the bottleneck), to mask and adapt additional enhancements to simply the proposed NLAIC, intelligently through the end-to-end learning framework. We including 1) a unified network model for variable bitrates 3 using quality scaling factors; 2) sparse non-local processing II. RELATED WORK for memory reduction; and 3) parallel 3D masked CNN based context modeling for computational throughput improvement. In this section, we review prior works related to the non- All of these attempts have greatly reduce the space and time local operations in image/video processing, attention mecha- complexity of proposed NLAIC, with negligible sacrifice of nism, as well as the learned image compression algorithms. the coding efficiency.

Neural Image Compression Via Non-Local Attention Optimization and Improved Context Modeling Tong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Yao Wang

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support