Wavelet-Srnet: a Wavelet-Based CNN for Multi-Scale Face Super Resolution

Wavelet-SRNet: A Wavelet-based CNN for Multi-scale Face Super Resolution Huaibo Huang1,2,3, Ran He1,2,3, Zhenan Sun1,2,3 and Tieniu Tan1,2,3 1School of Engineering Science, University of Chinese Academy of Sciences 2Center for Research on Intelligent Perception and Computing, CASIA 3National Laboratory of Pattern Recognition, CASIA [email protected], rhe, znsun, tnt @nlpr.ia.ac.cn { } Abstract Most modern face super-resolution methods resort to convolutional neural networks (CNN) to infer high- resolution (HR) face images. When dealing with very low resolution (LR) images, the performance of these CNN (a) (b) (c) (d) based methods greatly degrades. Meanwhile, these methods tend to produce over-smoothed outputs and miss some textural details. To address these challenges, this paper presents a wavelet-based CNN approach that can ultra-resolve a very low resolution face image of 16 × 16 or smaller pixel- (e) (f) (g) (h) size to its larger version of multiple scaling factors (2×, 4×, 8× and even 16×) in a unified framework. Different from conventional CNN methods directly inferring HR images, our approach firstly learns to predict the LR’s corresponding series of HR’s wavelet coefficients before reconstructing HR images from them. To capture both global topology (i) (j) (k) (l) information and local texture details of human faces, we Figure 1. Illustration of wavelet decomposition and our wavelet- present a flexible and extensible convolutional neural net- based SR. Top row: (a) The original 128 128 high-resolution work with three types of loss: wavelet prediction loss, tex- face image and its (b) 1 level, (c) 2 level, (d)× 3 level, full wavelet ture loss and full-image loss. Extensive experiments demon- packet decomposition image. Middle row: (h) The 16 16 low- × resolution face image and its (g) 2 , (f) 4 , (e) 8 , upscaling strate that the proposed approach achieves more appealing × × × results both quantitatively and qualitatively than state-of- versions inferred by our network. Bottom row: similar with the middle row except the low-resolution input (l) is 8 8 pixel-size. the-art super-resolution methods. × 1. Introduction ror (MSE) loss in image space to push the outputs pixel- wise closer to the ground-truth HR images in training phase. Face super-resolution (SR), also known as face halluci- However, such approaches tend to produce blurry and over- nation, refers to reconstructing high resolution (HR) face smoothed outputs, lacking some textural details. Besides, images from their corresponding low resolution (LR) input- they seem to only work well on limited up-scaling factors s. It is significant for most face-related applications, e.g. (2× or 4×) and degrades greatly when ultra-resolving a face recognition, where captured faces are of low resolution very small input (like 16×16 or smaller). Several recent ef- and lack in essential facial details. It is a special case of forts [5, 33, 35] have been developed to deal with this issue single image super resolution and many methods have been based on convolutional neural networks. Dahl et al. [5] use proposed to address it. It is a widely known undetermined PixelCNN [27] to synthesize realistic details. Yu et al. [33] inverse problem, i.e., there are various corresponding high- investigate GAN [8] to create perceptually realistic results. resolution answers to explain a given low-resolution input. Zhu et al. [35] combine dense correspondence field estima- Most current single image super-resolution methods [2, tion with face super-resolution. However, the application 6, 14, 15, 23] depend on a pixel-wise mean squared er- of these methods in super-resolution in image space faces 1689 many problems, such as computational complexity [5], in- less of pose and occlusion variations. Experimental results stability in training [33] and poor robustness for pose and collaborate with our assumption and demonstrate that our occlusion variations [35]. Therefore, due to various prob- method can well capture both global topology information lems yet to be solved, image SR remains an open and chal- and local textural details of human faces. lenging task. Main contributions of our work can be summarized as Wavelet transform (WT) has been shown to be an effi- follows: cient and highly intuitive tool to represent and store multi- 1) A novel wavelet-based approach is proposed for resolution images [18]. It can depict the contextual and tex- CNN-based face SR problem. To the best of our knowl- tural information of an image at different levels, which mo- edge, this is the first attempt to transform single image S- tivates us to introduce WT to a CNN-based super-resolution R to wavelet coefficients prediction task in deep learning system. As illustrated in Figure 1, the approximation coef- framework - albeit many wavelet-based researches exist for ficients(i.e. the top-left patches in (b-d)) of different-level SR. wavelet packet decomposition [4] compress the face’s glob- 2) A flexible and extensible fully convolutional neural al topology information at different levels; the detail coeffi- network is presented to make the best use of wavelet trans- cients(i.e. the rest patches in (b-d)) reveal the face’s struc- form. It can apply to different input-resolution faces with ture and texture information. We assume that a high-quality multiple upscaling factors. HR image with abundant textural details and global topolo- 3) We qualitatively and quantitatively explore multi- gy information can be reconstructed via a LR image as long scale face super-resolution, especially on very low input as the corresponding wavelet coefficients are accurately pre- resolutions. Experimental results show that our proposed dicted. Hence, the task of inferring a high-resolution face is approach outperforms state-of-the-art face SR methods. transformed to predicting a series of wavelet coefficients. Emphasis on the prediction of high-frequency wavelet co- 2. Related work efficients helps recovering texture details, while constraints In general, image super-resolution methods can be on the reconstruction of low-frequency wavelet coefficients divided into three types: interpolation-based, statistics- enforces consistence on global topology information. The based [26, 31, 32] and learning-based methods [3, 9, 24]. combination of the two aspects makes the final HR results In the early years, the former two types have attracted most more photo-realistic. of attention for their computationally efficiency. However, To take full advantage of wavelet transform, we present a they are always limited to small upscaling factors. Learn- wavelet-based convolutional neural network for face super- ing based methods employ large quantities of LR/HR image resolution, which consists of three subnetworks: embed- pair data to infer missing high-frequency information and ding, wavelet prediction and reconstruction networks. The promises to break the limitations of big magnification. Re- embedding net takes the low-resolution face as an input and cently deep learning based methods [6, 14, 15, 2, 23] have represents it as a set of feature maps. The wavelet predic- been introduced into SR problem due to their powerful abil- tion net is a series of parallel individual subnetworks, each ity to learn knowledge from large database. Most of these of which aims to learn a certain wavelet coefficient using convolutional methods use MSE loss to learn the map func- the embedded features. The number of these subnetwork- tion of LR/HR image pairs, which leads to over-smooth out- s is flexible and easy to adjust on demand, which makes puts when the input resolution is very low and the magnifi- the magnification factor flexible as well. The reconstruc- cation is large. tion network is used to recover the inferred wavelet coef- Specific to face super-resolution, there have been about ficients to the expected HR image, acting as a learned ma- three ways to alleviate this problem. The first one [29, 13, trix. These three subnetworks are coordinated with three 28, 30, 35] is to exploits the specific static information of types of loss: wavelet prediction loss, texture loss and full- face images with the help of face analysis technique. Yang image loss. The wavelet prediction loss and texture loss et al. [29] estimate landmarks and facial pose before recon- correspond with the wavelet prediction subnetwork, impos- structing HR images while the accurate estimation is dif- ing constraint in wavelet domain. The full-image loss is ficult for rather small faces. Zhu et al. [35] present a u- used after the reconstruction subnetwork to add a tradition- nified framework of face super-resolution and dense corre- al MSE constraint in image space. Besides, as each wavelet spondence field estimation to recover textural details. They coefficient shares the same size with the low-resolution in- achieve state-of-the-art results for very low resolution input- put, we use a network configuration to make every feature s but fail on faces with various poses and occlusions, due to map keep the same size with the input, which reduces the the difficulty of accurate spatial prediction. difficulty of training. As our network is fully convolutional The second way [17, 33, 25, 5] is to bring in image prior and trained with simply-aligned faces, it can apply to dif- knowledge with the help of generative models. Yu et al. [33] ferent input resolutions with various magnifications, regard- propose a generative adversarial network (GAN [8]) to re- 1690 solve 16×16 pixel-size faces to its 8× larger versions. Dahl hhigh 2↓ D Along rows et al. [5] present a recursive framework based on PixelCN- hhigh 2↓ N [27] to synthesize details of 4× magnified images with Along colomns hlow 2↓ H x Along rows 8 × 8 LR inputs.

Load more