Multiple Network Fusion with Low-Rank Representation for Image-Based Age Estimation

applied sciences Article Multiple Network Fusion with Low-Rank Representation for Image-Based Age Estimation Chaoqun Hong *,† , Zhiqiang Zeng, Xiaodong Wang and Weiwei Zhuang School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China; [email protected] (Z.Z.); [email protected] (X.W.); [email protected] (W.Z.) * Correspondence: [email protected]; Tel.: +86-592-6291390 † Current address: Ligong Road #600, Houxi Town, Jimei District, Xiamen 361024, Fujian Province, China. Received: 9 August 2018; Accepted: 3 September 2018; Published: 10 September 2018 Featured Application: The proposed method is used in biometric feature recognition of people. Abstract: Image-based age estimation is a challenging task since there are ambiguities between the apparent age of face images and the actual ages of people. Therefore, data-driven methods are popular. To improve data utilization and estimation performance, we propose an image-based age estimation method. Theoretically speaking, the key idea of the proposed method is to integrate multi-modal features of face images. In order to achieve it, we propose a multi-modal learning framework, which is called Multiple Network Fusion with Low-Rank Representation (MNF-LRR). In this process, different deep neural network (DNN) structures, such as autoencoders, Convolutional Neural Networks (CNNs), Recursive Neural Networks (RNNs), and so on, can be used to extract semantic information of facial images. The outputs of these neural networks are then represented in a low-rank feature space. In this way, feature fusion is obtained in this space, and robust multi-modal image features can be computed. An experimental evaluation is conducted on two challenging face datasets for image-based age estimation extracted from the Internet Move Database (IMDB) and Wikipedia (WIKI). The results show the effectiveness of the proposed MNF-LRR. Keywords: age estimation; multi-modal features; deep learning; low-rank representation 1. Introduction Image-based age estimation tries to compute the age or age group with facial images. It can be widely used in many applications such as biometric feature recognition, human–computer interaction (HCI), and so on. Although a number of studies have been conducted [1–3], image-based age estimation is still a challenging task due to the following aspects. First, it often lacks sufficient training samples since each person may be captured by several images in a wide range of ages. Second, facial appearance may not indicate the age accurately since some people may look younger than they actually are and some people may look older. Third, facial images are often captured in wild conditions so they are influenced by large variations such as occlusion, lighting, shadow, and complex backgrounds. Similar to many other applications of computer vision, most existing image-based age estimation approaches focus on two key stages: feature description and feature mapping. Feature description tries to represent facial images without losing details. Traditional methods usually uses texture features or shape features, such as the active appearance model (AAM) [4], holistic subspace features [5,6], local binary patterns (LBPs) [7], Gabor wavelets [4], bio-inspired features (BIFs) [8], and so on. However, most of them make use of hand-crafted features. In this way, strong prior knowledge is required. To solve this problem, learning-based feature descriptors [9,10] have been proposed to compute descriptive features directly from images. Recently, neural networks have Appl. Sci. 2018, 8, 1601; doi:10.3390/app8091601 www.mdpi.com/journal/applsci Appl. Sci. 2018, 8, 1601 2 of 13 been efficient in exploring descriptive representations in natural images, such as autoencoders [11], Convolutional Neural Networks (CNNs) [12], and so on. Among these methods, Liu et al. proposed group-aware deep feature learning (GA-DFL) to estimate ages with facial images [13]. Different from most previous methods using hand-crafted features for facial image description, GA-DFL uses a deep CNN framework to compute a discriminative feature descriptor per image automatically from raw pixels of facial images. Although a large number of feature descriptors have been proposed, most of them can only describe a part of the information inherent in images. Therefore, researchers look into representing images with multiple features. Traditional methods make use of multiple features by directly concatenating them, which is oversimplified. To solve this problem, researchers also apply manifold learning to combine different types of features [14,15]. On the other hand, feature mapping tries to learn the mapping relationship from face images to age labels. With descriptive representations of facial images, age estimation is usually considered as a regression or classification problem [5,16]. Linear regression and twin Gaussian processes are also used for pose estimation [17,18]. Tian et al. proposed conducting age estimation by taking both ordinality and locality into consideration [19]. Previous approaches made an over-simplified assumption that the mapping from images into poses is linear. To tackle this nonlinear issue, methods based on deep learning have been applied. They can train a series of nonlinear mapping models [20–24]. However, these models cannot explicitly define the ordinal relationship between facial images and chronological ages, because they usually suffer from insufficient and unbalanced training data. In this way, they still cannot be used in practical scenarios. Although many methods for age estimation with images have been proposed, they usually use only a single type of feature. Even with popular neural networks, they apply only a single structure of neural networks, which still suffers the so-called “semantic gap”. Currently, multiple types of features have been used in many applications. Inspired by it, we proposed a Multiple Network Fusion with Low-Rank Representation (MNF-LRR) for the age estimation method. The contributions of this paper can be summarized by the following: • The first and key contribution is a novel framework that estimates ages with a single image by fusing multiple deep neural networks. This framework is flexible and the hidden representations are computed independently. In this way, different types of neural networks, different network structures, and different features can be used in this framework. • The second contribution of the proposed method is multiple-network fusion with low-rank learning. Low-rank representation is naturally sparse. Besides, different types of features are extracted by different networks and their distributions can be observed clearly. To improve traditional low-rank learning, we introduce a hypergraph manifold. In this way, samples can be represented in a unified low-rank space and the process of fusion can be achieved in this space. • The third contribution is that the performance of the proposed method is verified on datasets from the Internet Movie Database (IMDB) and Wikipedia (WIKI). They are challenging datasets since the images are collected in natural scenarios and not all of the faces are frontal. The performance on this dataset indicates that the proposed MNF-LRR is suitable for practical and complicated applications. 2. Multiple Network Learning with Low-Rank Representation 2.1. Overview of the Proposed Method The process of the proposed method (MNF-LRR) can be summarized in Figure1. To get rid of the influences of background, we should extract faces in images first. This process depends on the definitions of different datasets. In some datasets, such as IMDB and WIKI, the positions and sizes are provided and they can be used directly. However, in some other datasets or real scenarios, we need face detection or face tracking to determine the face area. We then utilize different networks to extract deep features of facial images. Finally, we use manifold learning based on low-rank representation Appl. Sci. 2018, 8, 1601 3 of 13 to integrate the outputs of these networks. In this way, a unified multi-modal representation can be obtained. Figure 1. The flowchart of the proposed method (Multiple Network Fusion with Low-Rank Representation (MNF-LRR)). 2.2. Definitions In age estimation with regression, given a set of images X = fx1, x2, ..., xng and the corresponding labels Y = fy1, y2, ..., yng with n pairs of samples, we try to learn a model that minimizes the loss: argmin jY − F(X)j (1) d where F is the regression function, d is the regression parameter, and X is the feature representation of X. Therefore, to minimize Equation (1), we need a descriptive X and a reasonable F. In the proposed method, we focus on X. 2.3. Multiple Network Learning As mentioned in the introduction, multiple feature fusion has been proved to be effective in image representation. In this way, to compute X, we propose feature learning by fusing multiple neural networks, which compute features with different neural networks and integrate them to form new features. Neural networks [25,26] have been widely used to explore hidden representations of images and the effectiveness has been proved. Generally speaking, neural networks compute hidden representation by minimizing the loss function: n 2 ∑ k xi − xi k (2) i where xi = Wxi is the hidden representation by mapping xi with weight W. The key to neural networks is optimizing W, which is defined differently by different neural networks. However, they depend on a large number of training data. Usually, age estimation with a single image is achieved with insufficient training samples or classification information. Therefore, we adopt different types of neural networks to extract different types of features and fuse them to improve the descriptive power with a small number of training samples. In MNF-LRR, we use the following neural networks to represent face images. Appl. Sci. 2018, 8, 1601 4 of 13 • Autoencoders (AE). Autoencoders are unsupervised to learn the hidden representation. To solve Equation (2), people usually use denoising autoencoders (DAE). In DAE, inputs x1, ..., xn are corrupted by randomly removing some features.

Multiple Network Fusion with Low-Rank Representation for Image-Based Age Estimation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support