3D human tongue reconstruction from single “in-the-wild” images
Stylianos Ploumpis1,2 * Stylianos Moschoglou1,2 * Vasileios Triantafyllou2 Stefanos Zafeiriou1,2 1Imperial College London, UK 2Huawei Technologies Co. Ltd 1{s.ploumpis,s.moschoglou,s.zafeiriou}@imperial.ac.uk 2{vasilios.triantafyllou}@huawei.com
Figure 1. We propose a framework that accurately derives the 3D tongue shape from single images. A high detailed 3D point cloud of the tongue surface and a full head topology along with the tongue expression can be estimated from the image domain. As we demonstrate, our framework is able to capture the tongue shape even in adverse “in-the-wild” conditions.
Abstract dataset, consisting of 1, 800 raw scans of 700 individuals varying in gender, age, and ethnicity backgrounds *. As we 3D face reconstruction from a single image is a task demonstrate in an extensive series of quantitative as well that has garnered increased interest in the Computer Vision as qualitative experiments, our model proves to be robust community, especially due to its broad use in a number of and realistically captures the 3D tongue structure, even in applications such as realistic 3D avatar creation, pose in- adverse “in-the-wild” conditions. variant face recognition and face hallucination. Since the introduction of the 3D Morphable Model in the late 90’s, we witnessed an explosion of research aiming at particu- 1. Introduction larly tackling this task. Nevertheless, despite the increasing Recently, 3D face reconstruction from single “in-the- arXiv:2106.12302v1 [cs.CV] 23 Jun 2021 level of detail in the 3D face reconstructions from single wild” images has been a very active topic in Computer Vi- images mainly attributed to deep learning advances, finer sion with applications ranging from realistic 3D avatar cre- and highly deformable components of the face such as the ation to image imputation and face recognition [48, 17, 43, tongue are still absent from all 3D face models in the liter- 25, 41, 15]. Nevertheless, despite the improvement in the ature, although being very important for the realness of the quality of the 3D reconstructions, all of these methods do 3D avatar representations. In this work we present the first, not accommodate any statistical variations in the oral cav- to the best of our knowledge, end-to-end trainable pipeline ity let alone a tongue template mesh. As a result, the oral that accurately reconstructs the 3D face together with the region is completely disregarded from the final result. tongue. Moreover, we make this pipeline robust in “in-the- Being able to reconstruct the tongue expression has mul- wild” images by introducing a novel GAN method tailored tiple advantages in various applications. First of all, the for 3D tongue surface generation. Finally, we make pub- generated avatars would be more realistic and would be able licly available to the community the first diverse tongue *Project url: https://github.com/steliosploumpis/ *Authors contributed equally. 3D_human_tongue_reconstruction
1 to mimic many more facial expressions. Moreover, speech • To make this pipeline robust to “in-the-wild” images, animation tasks would be improved as the inclusion of the we introduce a novel GAN framework which is able to oral cavity plays a significant role. Finally, face recognition accurately reconstruct 3D tongues from “in-the-wild” applications could be enhanced as more extreme poses and images with an increasing level of detail. expressions would be modeled. However, as we already pointed out, all of the current state-of-the-art (SOTA) methods [48, 17, 43] do not con- 2. Related Work tain the tongue component in their implementations. This Single view 3D reconstruction of human facial/head is because of two reasons: a) there is no publicly available parts is undeniably an extremely valuable task in Computer tongue dataset, and b) it is very challenging to carry out Vision. However, it has posed many challenges to the re- 3D reconstruction of the face together with the tongue in search community, due to the fundamental depth ambigui- “in-the-wild” conditions, because of the highly deformable ties and the ill-posed nature of the problem. In order to con- nature of the human tongue. strain the ambiguity of the problem, many statistical para- To tackle the absence of tongue data, we collected a metric models have been introduced for different parts of large and diverse dataset of textured 3D tongue point-clouds the human face/head [3,6, 29, 38]. (more info about the data in Section3). Having captured Due to the increasing interest of facial analysis over the the data, we created a pipeline which is comprised of the years the research community has mainly focused on hu- following parts: a) a tongue point-cloud autoencoder (AE) man facial reconstructions. Since the inception of facial 3D which is used to derive useful 3D features of our raw col- Morphable Models (3DMMs) in [3], a myriad of scientific lected 3D data, b) a tongue image encoder optimized based papers have been published focusing solely on the recon- on the aforementioned 3D features, c) a shape decoder struction of facial shape and appearance [4,5, 17, 25]. Only which translates the encoder outputs to the parameter space recently with the emergence of 3D scanning data has the of the Universal Head Model (UHM) [37]. We should note research interest shifted to other significant parts of the hu- that the UHM in our case is further rigged/modified so that it man head. A few head models have been introduced dur- can model various tongue shapes/expressions, as explained ing the recent years but without any statistical craniofacial in Section3. We begin by training the AE in step a) and then correlations [29, 40]. The first craniofacial 3DMM of the we train steps b-d) in an end-to-end fashion so that the out- human head was introduced in [13] and later extended and put tongue expression of the UHM model is as close as pos- leveraged into a 3D head reconstruction setting from un- sible to the corresponding ground-truth 3D tongue point- constrained single images [38]. A few recent works tried cloud of the 2D tongue image. to align a skull structure of the the human head with the Since there is a lack of ground-truth 3D tongue data cor- facial topology [30, 31] in order to obtain a distribution of responding to “in-the-wild” 2D tongue images, the pipeline plausible face shapes given a skull shape. we described so far is only trained using our collected data Finer details of the human face/head started to appear which were captured under controlled conditions. This re- with the introduction of 3D human ear modeling [50]. Ears sults in sub-optimal performance in “in-the-wild” condi- are key structures of the human head that have an impor- tions. To remedy this, we developed a novel conditional tant contribution to the biometric recognition and general GAN framework that is able to generate accurate 3D tongue appearance of a person. The two foremost examples of ear point-clouds based on the image encoder outputs (step b) of models were introduced in [51, 12] but none of them was the pipeline). Having created new image/point-cloud pairs fused to a face/head in order to create a complete appear- of “in-the-wild” tongue data, we re-train the pipeline using ance. also these new data. As we show in Section4, this addition Moreover, in an attempt to overcome the “uncanny val- substantially improves the quality of the tongue reconstruc- ley” problem, a few approaches have tried to model the in- tions. dependent variations/movements of the human eye and the To summarize, the contributions of our work are the fol- the facial eye region [46,2]. These efforts are challenging lowing: due to the limited amount of data around the eye region and • We release a dataset of 1, 800 raw tongue scans of vari- the extreme level of detail required for this task. Moving ous shapes and positions, corresponding to around 700 towards the oral cavity, teeth modeling was introduced in subjects. Being the first such diverse tongue dataset, it [47, 44], where the 3D structure of the teeth was recovered can be proven very useful to the community. from 2D images via an elaborate optimization scheme. Only very recently, a few approaches [28, 37] have tried • We present a complete pipeline trained in an end-to- to combine all of these aforementioned attributes of the hu- end fashion that is able to reconstruct the 3D face to- man head (eyes, ears, teeth, and inner oral cavity) in order to gether with the tongue from a single image. build a complete model in terms of shape and texture, which Figure 2. Random 3D tongue expressions of our synthetic database based on the mean UHM template. The expressions are rigged and manually sculpted to induce more variance around the tongue surface and the general oral cavity.
accurately represents the human head. Although these mod- 9% Asian, 3% Black and 6% other). els include an oral topology, none of them deals with the dy- Rigged tongue database. In order to carry out 3D namics of the tongue, something which is really important tongue and face reconstruction, we would need to use a for speech animation and the overall realness of the avatar face/head model. Nevertheless, one major drawback of all representation. To this end, in this work we aim at extend- of the currently used face/head models [29, 40, 13] is that ing these approaches and paving the way towards a realistic they are missing the tongue component. This is because it is human appearance by releasing a diverse 3D tongue dataset a challenging task to non-rigidly capture in a fixed template to the research community. We also present the first frame- the 3D topology of the oral cavity. These challenges in- work for accurate 3D human tongue reconstruction from clude: a) the highly deformable nature of the tongue, b) the single images. non-convexity of the mouth region, c) the specular texture of the teeth. In order to alleviate this issue, we constructed 3. 3D human tongue reconstruction a synthetic 3D head and tongue dataset rigged by 3D artists. For our neutral mesh template T¯ we utilize the mean tem- In this Section we present the complete tongue recon- plate of the UHM [37] as it provides all the necessary com- struction pipeline. We begin by describing our collected ponents of the human oral cavity in accordance with the 2D/3D tongue dataset and our manually rigged tongue entire head statistical structure. The resulting rigged tongue dataset which is based on the the UHM [37] template. We expressions rise at 75 distinct meshes. In order to further further provide details about the point-cloud AE, the image augment our synthetic dataset, we performed trilinear inter- encoder, the shape decoder and the overall loss functions we polation between the closest expression meshes and gener- used to optimize the pipeline for the tongue reconstruction. ated a total of ns = 720 tongue expressions. Some exam- Moreover, we present the novel conditional GAN method ple synthetic expressions are shown in Fig.2. A standard which is able to accurately reconstruct 3D tongue point- PCA was applied on the interpolated meshes resulting in an 3N×nt clouds of “in-the-wild” tongue images. Finally, we explain orthogonal basis matrix Ut ∈ R (where N are the how we used the generated point-clouds of the GAN to re- mesh vertices and nt = 110 the kept components). The train the pipeline to achieve better results in “in-the-wild” PCA is performed on the entire set of head vertices and not conditions. solely on the oral cavity. In this way, it is more efficient afterwards to transfer the tongue expression from the mean 3.1. Tongue datasets head to a head with a different facial identity. TongueDB: the first 3D tongue dataset. As mentioned 3.2. Method in Section1, we collected a large dataset comprising of tex- tured 3D tongue scans. Our point cloud database, dubbed Tongue point-cloud AE. In order to accurately recon- TongueDB, contains approximately 1,800 3D tongue scans struct a tongue in its 3D form based on a 2D image, our which were captured during a special exhibition in the Sci- image encoder needs to be guided by meaningful target la- ence Museum, London. The subjects were instructed to per- bels which can capture all the desired 3D point-cloud in- form a range of tongue expressions (i.e., tongue out left and formation. These labels, denoted as y ∈ R256, are learned right, tongue out center, tongue out center round, tongue by autoencoding the raw point-clouds of our dataset (i.e., out center extreme open mouth, tongue inside left and right, the raw 3D tongue scans). For this task, we utilize a self etc.). Some example images can be seen in Fig.6. The cap- organizing-map framework for hierarchical feature extrac- turing apparatus utilized for this task was a 3dMD 4 camera tion [27]. structured light stereo system, which produces high quality Tongue image encoder. The task of the tongue image dense meshes. We recorded a total of 700 distinct subjects encoder is to produce features which are close to the tar- with available metadata about them, including their gender get 3D features y of the AE. To make the encoder robust to (42% male, 58% female), age, and ethnicity (82% White, various camera angles or illuminations, we employ a ren-
1 / 1 chrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.html
27/06/2020 draw.io
27/06/2020 draw.io
27/06/2020 27/06/2020 draw.io draw.io 27/06/2020 draw.io Ground Truth Tongue Point Cloud Point Cloud Auto-Encoder
y
Rendered input Image ℒy = ∥y − y˜∥2 x y1 PCA 1 z1 MLP Shape Decoder
y˜ pt Ut ℒtotal = λ1ℒCD + λ2ℒn + λ3ℒl + λ4ℒe + λ5ℒc + λ6ℒy
yn Embedding Net xn Regressed tongue zn ResNet-50 expression Figure 3. An illustration of our tongue reconstruction framework. First we train the point-cloud AE on its own to get meaningful 3D features (y) and then we trained the image encoder /shape decoder using a number of different losses as explained in Section3
dering framework where we utilize the textured raw scans we only utilize a small area around the oral cavity which (TongueDB in Section 3.1). We render our 1,8K textured is defined based on the ground-truth point-cloud. Addi- meshes with a pre-computed radiance transfer technique us- tionally, we calculate a Laplacian regularization Ll loss be-
ing spherical harmonics which efficiently represent global tween our predicted mesh the mean shape of the PCA model .io draw
light scattering. Additionally, we use more than 15 differ- in order to27/06/2020 prevent the vertices from moving too freely out- ent indoor scenes coupled with random light positions and side the mean positions and constrain the resulting shape mesh orientations around all 3D axes, resulting in approxi- to be smooth. An edge length loss Le is also introduced mately 100K images. As an encoder we used a ResNet-50 which penalizes any flying vertices (outliers). Finally, we [20] model pre-trained on ImageNet [14] and fine-tuned it employ a collision loss Lc which prevents the points of the on our dataset. In particular, we modified the last layer of tongue to penetrate the surface of the oral cavity and is for- the network to output a vector y˜ ∈ R256 similar to the di- mulated as the sum of each collision error around the 12 mension of the ground truth vector y. mouth landmarks of the UHM template (as illustrated in the Shape decoder. In order to decode the encoder y˜ la- supplementary material):
bels into meaningful tongue shapes, we use the synthetic 11 N−1 PCA model Ut of the rigged tongue expression dataset. To 1 X X i Lcol = max 0, d (1) y˜ N k this end, after producing the labels, we utilize a standard k=0 i=0 multi-layer perceptron (MLP) which works as a regression R110 i 2 i 2 i 2 i 2 scheme to the latent parameters pt ∈ of the synthetic dk = r − q1 − xk − q2 − yk − q3 − zk PCA tongue model. The statistical nature of the PCA model helps us constrain the final result during training and en- The Lcol is calculated as the sum of distances of each i i i i sures meaningful deformations which lie inside the spec- collided point q = {q1, q2, q3} to the sphere k with center trum of our rigged/modified UHM model. the landmark coordinates xk, yk, zk and radius r = 1, 5cm. Lastly, we impose a final L2 loss Ly in the intermediate Pipeline training. During training, we first train the step of our pipeline where we constrain the predicted y˜ en- point-cloud AE on its own and then train the pipeline of coded features to be as close as possible to the ground-truth the image encoder and shape decoder in an end-to-end fash- features y of the corresponding autoencoded point-cloud. ion. To optimize the pipeline, we apply a total of 6 losses This loss is of paramount importance because: a) the y˜ fea- with each one contributing to the quality of the final re- tures in this way contain rich 3D information invariant to sult. The first 2 losses are calculated between the predicted texture/illumination variations and b) our “in-the-wild” ex- tongue expression of the rigged/modified UHM model and tension which we introduce later is based on a generative the ground-truth tongue point-cloud of the corresponding point-cloud framework that relies on such rich 3D features. input image. Similarly to [45], we adopt a Chamfer loss The final loss function Ltotal is given by: [16] LCD to optimize the position of the resulting template
points as well as a normal loss Ln to correct the orientation Ltotal = λ1LCD + λ2Ln + λ3Ll + λ4Le + λ5Lc + λ6Ly of the mesh. In order to compute an accurate Chamfer loss, (2)
chrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.html 1/1
chrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.html 1/1
chrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.html 1/1
chrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.htmlchrome-extension://pebppomjfocnoigkeepgbmcifnnlndla/index.html 1/1 1/1 TongueGAN architecture Generator (G) Discriminator (D) Label MLP Label MLP Injection MLP Layers (I)
y I1 I2 y Output layer Main MLP layers(LD) Generated Surface Point x y c o o … G(z, y) = x˜t c … Real/Fake 1 2 9 z 1 2 8 Noise MLP LG LG LG Point MLP LD LD LD
Main MLP layers (L ) x G x˜t or y z z xt z ∼ � (0,I) Figure 4. Symbol c stands for row-wise concatenation along the channel dimension. Symbol o stands for element-wise (i.e., Hadamard) product. The Generator inputs are a Gaussian noise sample z and a label y corresponding to a particular tongue, from which we want to sample a 3D point. The Discriminator input pairs are a label y which corresponds to a specific tongue and xt a real 3D point belonging to the aforementioned tongue point-cloud (but sampled as explained in Section 3.3.1) and G (z, y) = x˜t a generated point belonging to this tongue. The Discriminator is asked to distinguish the real from the fake point.
where λ1, . . . , λ6 are training hyper-parameters. During for the 3D tongue surface generations. inference, the encoder network takes as an input a single In order to generate accurate point-clouds that corre- tongue image and predicts a 3D embedding ˜y, which is spond to certain tongue images, our GAN, dubbed as later transformed to the corresponding pt parameters of the TongueGAN, needs to be guided by meaningful labels synthetic expression model, through the MLP between the which can capture all the desired 3D surface information. two latent spaces. Finally, we apply the PCA model of the These labels are provided by the trained point-cloud AE as rigged head model on these pt parameters to derive the final described in Section3. Since the generation is driven by mesh of the head with the tongue expression. An overview labels, TongueGAN is a conditional one [34]. of the methodology can be seen in Fig.3. In particular, given a label denoted as y and a random Gaussian noise z ∈ R128, the generator G produces a novel 3 3.3. TongueGAN for “in-the-wild” reconstruction point-cloud point G (z, y) ∈ R , which we denote as ˜xt, that belongs to the tongue surface represented by the label Although the pipeline presented in Section 3.2 provides a y. On the other hand, the discriminator D receives as inputs good estimation of the tongue pose in the test set of our col- the label y, a real point-cloud point x (which belongs to the lected data, it does not perform very well in “in-the-wild” t tongue represented by the label y) and the generator output images (Fig.9). This behavior is expected because our ˜x and tries to discriminate the fake (i.e., generated) from collected data were captured in controlled conditions and t the real point. In the mathematical parlance, this can be the training of the encoder was carried out only with ren- described as: dered images which do not fully mimic “in-the-wild” condi- tions. To make our approach robust in “in-the-wild” images too, we would need to further train the pipeline by using LD = Ext [log D (xt, y)] − E˜xt [log D (˜xt, y)] , also such data. However, for “in-the-wild” collected im- (3) LG = E˜xt [log D (˜xt, y)] ages from the web, we do not have their corresponding 3D tongue point-clouds. As a result, to use “in-the-wild” data where D tries to maximize LD, whereas G tries to mini- in the pipeline, we would first need to have a method that mize LG. The training process is considered complete when can learn the distribution of our collected 3D tongue data D is no longer able to differentiate between the real and fake and generalize well. point-cloud points. Finding a method to generate novel 3D tongues is tricky. Please note that instead of generating whole point-clouds This is because of several unique properties of the human for every provided pair (z, y) of noise and label, respec- tongue: a) it is a highly deformable object, so we cannot tively, we merely generate a point corresponding to the sur- register our collected data in a reference template and apply face which the label y represents. That confers several ad- relevant methods [7, 35, 39], b) it is a non-watertight surface vantages in comparison to the rest of the methods in the lit- (i.e., it contains holes) so we cannot also use any implicit erature, such as: a) we do not need to have in our training set function approximations methods [36, 42, 33] or volumetric point-clouds with the same number of points and as a result approaches [22, 23, 49]. Therefore, having excluded the we can train our GAN without any data pre-processing on aforementioned categories, we decided to use GANs [18] the raw point-clouds, which do not have a fixed number of of the generated points are discarded as fake by the discrim- inator with high confidence). To remedy this, we slightly softened the discriminator, especially in the initial steps, by slightly modifying the real points fed to it. To achieve this, instead of directly feeding a real point xt corresponding to a label y to the discriminator, we feed the following: