Universidad Politecnica´ de Madrid

Escuela Tecnica´ Superior de Ingenieros de Telecomunicacion´

MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´

TRABAJO FIN DE MASTER´

Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations

Patricia Alonso de Apell´aniz

2020

UNIVERSIDAD POLITECNICA´ DE MADRID ESCUELA TECNICA´ SUPERIOR DE INGENIEROS DE TELECOMUNICACION´

Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations

Autor: Patricia Alonso de Apell´aniz Tutor: Dr. Alberto Belmonte Hern´andez Departamento: Departamento de Se˜nales,Sistemas y Radiocomunicaciones

MIEMBROS DEL TRIBUNAL:

Presidente:

Vocal:

Secretario:

Suplente:

Realizado el acto de lectura y defensa del Trabajo de Fin de M´asteracuerdan la cali- ficaci´onde:

Calificaci´on:

Madrid, a de de

Universidad Politecnica´ de Madrid

Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations

MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´

Patricia Alonso de Apell´aniz

2020

Summary

Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications.

Throughout this Master’s Thesis, a study of the state of the art in the different tech- niques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different exist- ing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows.

First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms.

As the second approach of this Thesis, the implementation of three different convolu- tional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]’s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network’s behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning.

Finally, to compare and evaluate each network’s results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem. Keywords

Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolu- tional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantita- tive evaluations. Resumen

Generar im´agenessintetizadas, siendo capaces de animarlas o transformarlas de alguna manera, ha experimentado en los ´ultimosa˜nos una evoluci´onmuy significativa gracias, en parte, al uso de redes neuronales en sus implementaciones. En particular, el intento de transferir diferentes gestos faciales y audio a una imagen existente ha llamado la atenci´ontanto en la investigaci´oncomo, incluso, socialmente, debido a sus posibles aplicaciones.

A lo largo de este Proyecto de Fin de M´aster, se realizar´aun estudio del estado del arte en las diferentes t´ecnicasque existen para esta transferencia de gestos faciales entre los medios audiovisuales que implican, incluso, el movimiento de los labios. Es- pec´ıficamente, se centrar´aen los diferentes m´etodos e investigaciones existentes que generan rostros parlantes basados en varios rasgos de la informaci´onmultimedia uti- lizada. A partir de este estudio, la implementaci´on,desarrollo y evaluaci´onde varios sistemas se har´ade la siguiente manera.

En primer lugar, conociendo la importancia relevante de entrenar redes neuronales pro- fundas utilizando un conjunto de datos grande y bien procesado, VoxCeleb2 se descar- gar´ay sufrir´aun proceso de condicionamiento y adaptaci´onen cuanto a la extracci´on de informaci´onde imagen y audio del v´ıdeooriginal para ser utilizado como entrada de las redes. Estas caracter´ısticasser´anlas que se utilizan normalmente en el estado del arte para tareas como la mencionada, como los puntos clave de la imagen y los espectrogramas de audio.

Como segundo enfoque de esta Tesis, la implementaci´onde tres redes convolucionales diferentes, en particular Generative Adversarial Networks (GANs), se har´abas´andose en la implementaci´onde [1] pero a˜nadiendoalgunas nuevas configuraciones, como la red que gestiona las caracter´ısticasde audio o las funciones de p´erdidasdependiendo de esta nueva arquitectura y el comportamiento de la red. En otras palabras, la primera implementaci´onconsistir´aen la red del paper mencionado; a esta implementaci´onse le a˜nadir´aun encoder para las caracter´ısticasdel audio; y, finalmente, el entrenamiento se basar´aen esta ´ultimaarquitectura pero teniendo en cuenta la p´erdidacalculada para el aprendizaje del audio.

Por ´ultimo,para comparar y evaluar los resultados de cada red se realizar´antanto mediciones cuantitativas como evaluaciones cualitativas. Dado que el resultado final de estos sistemas ser´ala obtenci´onde un v´ıdeoclaro y realista con un rostro aleatorio al que se le han transferido gestos de otro, la percepci´onvisual es clave para resolver este problema.

Palabras Clave

Aprendizaje profundo, transferencia de caras, generaci´onde im´agenes,im´agenessin- tetizados, encoder, Redes Neuronales Convolucionales (CNN), autoencoder, Genera- tive Adversial Networks (GAN), Generador, Discriminador, procesamiento de datos, dataset, Python, evaluaciones cualitativas y cuantitativas. Agradecimientos

Gracias al apoyo incondicional de mi tu- tor, Alberto, porque es una persona todoter- reno capaz de centrarse y ense˜nara todo el que se lo pida. Hac´ıa mucho tiempo que no conoc´ıa a alguien al que le apa- sionase tanto saber y transmitir, consigu- iendo meterme en un mundo en el que quiero seguir desarroll´andomesiempre, as´ı que much´ısimasgracias. Gracias a mi ’co- muna’ por haber conseguido lo que pocos pueden: aguantarme en mis peores momen- tos intentando sacarme una sonrisa, aunque sea vacil´andome constantemente, y hacer posible que siga adelante con todo. Por ´ultimo,gracias a madre y padre, que nunca han dudado de m´ı.

Index

1 Introduction and objectives 1

1.1 Motivation ...... 1

1.2 Objectives ...... 4

1.3 Structure of this document ...... 4

2 State of the art 5

2.1 Image Synthesis ...... 5

2.2 Deep Learning basics ...... 6

2.2.1 Artificial Neural Networks ...... 6

2.2.2 Convolutional Neural Networks ...... 7

2.2.3 Recurrent Neural Networks ...... 13

2.3 DL and Image Synthesis ...... 15

3 Development setup 29

3.1 Federated Learning ...... 29 3.2 PyTorch ...... 33

3.3 Libraries ...... 36

3.4 Overview of the proposed DL process ...... 37

4 Implementation 40

4.1 Data collection ...... 41

4.2 Data preparation ...... 45

4.2.1 Visual features extraction ...... 46

4.2.2 Audio features extraction ...... 48

4.3 Modeling ...... 51

4.3.1 Embedders ...... 53

4.3.2 Generator ...... 56

4.3.3 Discriminator ...... 58

4.4 Training ...... 59

4.4.1 Meta-learning ...... 60

4.4.2 Fine-tunning ...... 65

4.4.3 Other hyper-parameters ...... 67

4.5 Evaluation ...... 68

4.5.1 Quantitative Evaluation ...... 69

4.5.2 PS ...... 73

4.5.3 NMI ...... 73 4.5.4 Qualitative Evaluation ...... 76

5 Simulations and results 77

5.1 Configuration of the evaluation ...... 78

5.1.1 Evaluation dataset ...... 78

5.1.2 Reference system ...... 80

5.2 Project experiments ...... 85

5.2.1 Reference system with this project’s dataset ...... 86

5.2.2 Results with both image and audio features ...... 94

5.2.3 Results with audio loss ...... 101

5.3 Other experiments of possible interest ...... 107

5.3.1 Federated learning ...... 107

5.3.2 Angela Merkel video results ...... 112

5.3.3 Video to Image results ...... 116

5.3.4 Video to not human face ...... 119

6 Conclusions and future lines 123

6.1 Conclusions ...... 123

6.2 Future lines of work ...... 127

References 129 Appendices 140

A Social, economic, environmental, ethical and professional impacts 141

A.1 Introduction ...... 141

A.2 Description of impacts related to the project ...... 142

A.2.1 Ethical impact ...... 142

A.2.2 Social impact ...... 143

A.2.3 Economic impact ...... 143

A.2.4 Environmental impact ...... 143

A.3 Conclusions ...... 144

B Economic budget 145

C Code available 147

D Other Results 149

D.1 Reference system with this project’s dataset ...... 150

D.2 Results with both image and audio features ...... 154

D.3 Results with audio loss ...... 158 Index of figures

1.1 Example of deepfake in ”The Shining”(1980). Jim Carrey replaces Jack Nicholson’s through DL techniques[2]...... 2

1.2 Example of Obama’s talking video generation through DL techniques[3].3

2.1 ANN architecture [9] ...... 7

2.2 Object detection results using the Faster R-CNN system [10] ...... 8

2.3 CNN architecture for classification [14] ...... 9

2.4 GAN architecture [20] ...... 11

2.5 GAN architecture [21] ...... 12

2.6 LSTM architecture [24] ...... 13

2.7 LRCN architecture [26] ...... 15

2.8 Example results of Pix2Pix net on automatically detected edges com- pared to ground truth [28] ...... 16

2.9 Face generation through the years. Faces on the left were created by Artificial Intelligence in 2014 and the ones on the right, in 2018. [31] . 17 2.10 Dense alignment, including key points, and 3D reconstruction results for [38] ...... 18

2.11 OpenFace behaviour analysis pipeline, including facial action unit recognition [41] ...... 19

2.13 X2Face network during the initial training stage [43] ...... 19

2.12 Results of the reenactment system Face2Face [42] ...... 20

2.14 System to synthesize Obama’s talking head [44] ...... 20

2.15 Results comparison between Face2Face and Obama’s talking head generation model [44] ...... 21

2.16 Overall Speech2Vid model [45] ...... 22

2.17 Proposed conditional recurrent adversarial video generation model [46] 22

2.18 Proposed Disentangled Audio- [47] ...... 23

2.19 Few shot Model Architecture [1] ...... 24

2.20 Few shot Model Results compared to other models seen [1] ...... 24

3.1 Federated learning general representation [53] ...... 30

3.2 Proposed federated learning system architecture ...... 31

3.3 PyTorch Vs. TensorFlow: Number of Unique Mentions. Conference legend: CVPR, ICCV, ECCV - conferences; NAACL, ACL, EMNLP - NLP conferences; ICML, ICLR, NeurIPS - general ML conferences. [53] ...... 35

3.4 System architecture general training overview ...... 38

3.5 System architecture final application overview ...... 39 4.1 Deep learning process block diagram ...... 40

4.2 VoxCeleb2 faces of speakers in the dataset...... 42

4.3 VoxCeleb2 downloaded folders organization ...... 43

4.4 VoxCeleb2 txt file with video information to download it ...... 43

4.5 VoxCeleb2 video example from YouTube ...... 44

4.6 Visual features extraction [41] ...... 46

4.7 Visual feature extraction. Frame A from dataset video, bounding-box coordinates provided by dataset and landmarks extracted, respectively. 47

4.8 Final visual feature extraction. Input to the network consisting of frame and landmarks concatenated...... 48

4.9 Audio feature extraction. First column: frame A talking from dataset video, audio waveform from frame A, MFCCs and Mel-spectrogram, respectively. Second column: the same for frame B not talking...... 50

4.10 Project’s network architecture ...... 51

4.11 Image Embedder Architecture ...... 53

4.12 Audio Embedder Architecture ...... 54

4.13 Single Residual Block[79] ...... 55

4.14 Residual Down Sampling Block ...... 56

4.15 Generator Architecture Without Audio vector ...... 57

4.16 Generator Architecture With Audio vector ...... 58

4.17 Residual Up Sampling Block ...... 59

4.18 Discriminator Architecture ...... 60 4.19 VGG-19 Architecture ...... 63

4.20 VGG-Face Architecture ...... 63

4.21 Different image distortions to proceed with image evaluation ...... 69

5.1 Frame of generated video of myself with noticeable face gestures and speaking...... 79

5.2 Frames of downloaded video from Pedro S´anchez before and after cutting and cropping it...... 80

5.3 Implementation inference after 5 epochs of training in small dataset [94] 80

5.4 Example of output of the reference system trained using the pre-trained model available. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated. The first column of pictures just trains the meta-learning stage and the second trains for 40 epochs the fine-tuning stage...... 81

5.5 Training losses evolution graph in reference system with pre-trained weights...... 83

5.6 Example of losses output during the fine-tuning stage using the reference system model available...... 84

5.7 Example of output of the reference system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 87

5.8 Different FT T values applied. Example of output of the reference system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 88 5.9 Different FT epochs applied. Example of output of the base system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 89

5.10 Different paddings applied. Example of output of the base system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 89

5.11 Training losses evolution graph in reference system with own dataset. . 91

5.12 Examples of Generator outputs during meta-learning...... 91

5.13 Example of losses output during the fine-tuning stage using the base system model available with this projec’t dataset with different T values. 92

5.14 Example of losses output during the fine-tuning stage using the base system model available with this projec’t dataset with different config- urations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad =50)...... 93

5.15 Example of output of the Video-Audio system trained just the meta- learning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 95

5.16 Different FT T values applied. Example of output of Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 96 5.17 Different FT epochs applied. Example of output of Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 96

5.18 Different paddings applied. Example of output of the Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 97

5.19 Training losses evolution graph in Video-Audio system with own dataset. 98

5.20 Example of losses output during the fine-tuning stage using the Video- Audio system model available with this project’s dataset with different T values...... 99

5.21 Example of losses output during the fine-tuning stage using the Video- Audio system model available with this project’s dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50)...... 100

5.22 Example of output of the Video-Audio with audio loss system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 102

5.23 Different FT T values applied. Example of output of Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 103 5.24 Different FT epochs applied. Example of output of Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 103

5.25 Different paddings applied. Example of output of the Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 104

5.26 Training losses evolution graph in Video-Audio system using audio loss with own dataset...... 105

5.27 Example of losses output during the fine-tuning stage using the Video- Audio system model with audio loss available with this project’s dataset with different T values...... 106

5.28 Example of losses output during the fine-tuning stage using the Video- Audio system with audio loss model available with this project’s dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50)...... 106

5.29 Example of output of the Server 2 reference system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 108

5.30 Reference networks’ generated image during meta-learning stage . . . . 109 5.31 Example of output of the Server 2 Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 110

5.32 Video-Audio networks’ generated image during meta-learning stage . . 110

5.33 Example of output of the Server 2 Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated...... 111

5.34 Video-Audio with audio loss networks’ generated image during meta- learning stage ...... 112

5.35 Angela Merkel video frame [97] ...... 113

5.36 Different visual results for each experiment done using S´anchez video and using Merkel´s...... 115 5.37 Original image used for the application purpose ...... 116

5.38 Example of output of the reference system trained both the meta- learning and fine-tuning steps. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated...... 117

5.39 Example of output of the reference system trained both the meta- learning and fine-tuning steps with the project’s dataset for the video to image application. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated...... 118 5.40 Example of output of the Video-Audio system trained just for the meta-learning stage with the project’s dataset for the video to image application. The picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated...... 118

5.41 Example of output of the Video-Audio with audio loss system trained for the meta-learning step with the project’s dataset for the video to image application. The picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated...... 119

5.42 Example of non human face picture [98] ...... 120

5.43 Example of output using a non human face to transfer the gestures from the video of myself ...... 120

5.44 Example of video game face picture [100] ...... 121

5.45 Example of output using a video game face to transfer the gestures from the video of myself ...... 121

5.46 Example of output using the same two videos of myself for the network without Audio Embedder and for the one with it ...... 122

B.1 Economic budget ...... 146

D.1 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part I. . . . 150

D.2 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part II. . . 151

D.3 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part IV. . . 152 D.4 Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part V. . . 153

D.5 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part I. . 154

D.6 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part II. 155

D.7 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part IV. 156

D.8 Generator synthesized frames during meta-learning training in the Audio-Video system using this project’s pre-processed dataset. Part V. 157

D.9 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part I...... 158

D.10 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part II(...... 159

D.11 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part IV...... 160

D.12 Generator synthesized frames during meta-learning training in the Audio-Video system with audio loss using this project’s pre-processed dataset. Part V...... 161 Index of tables

2.1 DL and image synthesis summary ...... 27

3.1 Hardware used in the development of the Master’s thesis ...... 32

3.2 Different existing DL frameworks ...... 34

4.1 VoxCeleb2 description ...... 41

4.2 Final number of samples from VoxCeleb2 used ...... 45

4.3 Dataset storage used ...... 45

4.4 Information about spectrogram computation ...... 49

4.5 Metrics values obtained for the different distortions of the first image to show range of values...... 75

5.1 Test videos information ...... 79

5.2 Metrics values obtained for the different configurations for the video results using the reference system ...... 85

5.3 Experiments carried out in this project ...... 86 5.4 Metrics values obtained for the different configurations for the video results using the reference system with this project’s dataset ...... 94

5.5 Metrics values obtained for the different configurations for the results using the Video-Audio system with this project’s dataset ...... 101

5.6 Metrics values obtained for the different configurations for the results using the Video-Audio with audio loss system with this project’s dataset 107

5.7 Best metric values obtained for each of the previous experiments. . . . 114

Glossary

DL - Deep Learning

MIT - Massachusetts Institute of Technology

ANN - Artificial Neural Network

CNN - Convolutional Neural Network

R-CNN - Region Convolutional Neural Network

SVM - Support Vector Machine

NLP - Natural Language Processing

ReLU - Rectified Linear Unit

NN - Neural Network

GAN - Generative Adversarial Network

DCGAN - Convolutional Generative Adversarial Network

BN - Batch Normalization

RNN - Recurrent Neural Network

LSTM - Long Short-Term Memory

WLAS - Watch, Listen, Atend and Spell

PRN - Position map Regression Network

CPU - Central Processing Unit

GPU - Graphic Processing Unit

HG - HourGlass MFCC - Mel-Frequency Cepstrum Coefficients

ILSVRC - ImageNet Large Scale Visual Recognition Challenge

LAD - Least Absolute Deviations

PSNR - Peak Signal-to-Noise Ratio

SSIM - Structural Similarity

CSIM - Cosine Similarity

LMD - Landmark Distance Error

IS - Inception Score

FID - Frechet-inception Distance

PS - Perceptual Similarity

NMI - Normalized Mutual Information

MI - Mutual Information

MSE - Mean Squared Error

WER - Word Error Rate

1

Chapter 1

Introduction and objectives

This section of the Master’s thesis introduces a discussion of the background that has motivated the fulfillment of the project, including a brief overview of its main targets to be carried out and the different sections in which it will be divided.

1.1 Motivation

Transforming or manipulating photographs using different techniques or methods dates back to some of the first pictures captured during the 19th Century, not long after the first one was taken in 1825. From paint retouching, airbrushing, or even manipulating negatives while still in the camera device, techniques have transitioned to the radical new approaches thanks to digitization in just one century. Since then, more and more advances have been made in this task while quality and advantages in equipment have been improving. Due to those developments through the years, interest slowly moved from static pictures to motion ones.

Being able to animate a still image or to transform a video of a face in a controllable way has had a huge impact in society since it can be applied to many image editing applications, such as animating an onscreen person with other human expressions or even manipulating what this person is saying. These alterations are done, generally, 2 for entertainment purposes in advertising or social media, for example. However, as it can be thought of immediately there are some ethical issues and controversies about it, which will be mentioned in Social, economic, environmental, ethical and professional impacts, involving, of course, the famous deepfakes among other applications. This synthetic media has gathered widespread attention for some unethical use in fake news, financial fraud, or even adult-content videos. It consists basically of a replacement of a person’s face with another one, as it can be seen in Figure 1.1. Nowadays, several online applications can be found just for this purpose, which, as its name can suggest, can be performed due to advances in Deep Learning (DL) algorithms.

Figure 1.1: Example of deepfake in ”The Shining”(1980). Jim Carrey replaces Jack Nicholson’s through DL techniques[2].

If this face replacement configuration has gained a lot of interest in academic and social fields, recent researches have been focusing in synthesizing talking faces, such as taking a still face image and making it ”talk” in means of moving its lips according to some audio, written sentences or even a video. This is the famous Barack Obama’s video case (Figure 1.2) in which, given audio, a video of himself was created depicting him mouthing the words of the track. This is going further, as it has been mentioned, taking a video of a person, not a still image, and replacing what he or she is saying by generating lips and facial movement according to a given video different from the 3 original. There are many examples of talking people video generation techniques that are being developed and improved thanks to DL algorithms.

Figure 1.2: Example of Obama’s talking video generation through DL techniques[3].

DL is eclipsing other techniques and gaining a lot of interest in a huge amount of com- puter vision fields and, in particular, in this one thanks to allowing high computational models to learn how to generate images based on features extracted from other media. Nevertheless, there is still a large way to go, since generating a natural, realistic and personalized human head is really difficult because of its geometric and kinematic com- plexity and also because our visual system is able to recognize even minor mistakes in a human’s appearance.

As it will be seen in the state of the art section of this report, there are several DL models proposed by researchers from over the world to overcome these challenges which use images from datasets to perform this approach. But what about combining those features with audio ones to help a model learn and try to improve the proposed ones? In the end, we need to synthesize lips and its motion depends as well on the audio features. 4

1.2 Objectives

As stated before, the main target of this research is to implement a DL model that, given two videos, switches the facial expressions and poses from the first one to the second one.

This main objective can be deployed in two secondary ones. First, to test the suitability of adding as an input to the neural network model audio features in addition to the image’s to assess whether it makes a better performance or not. The second one would be to extend the advances in this field, which is a new one and does not own a huge amount of examples available as other fields could have.

1.3 Structure of this document

This thesis’ document is structured in 6 sections as follows:

1. Introduction and objectives: the motivation to carry out this project is explained. 2. State of the art: the theoretical and practical background referred to this field and similar ones are introduced. 3. Development setup: the software and hardware requirements are defined. 4. Implementation: the implemented algorithms are detailed. 5. Simulations and results: the final results of our experiments are presented. 6. Conclusions and future lines: the final conclusions that have been drawn from this project are shown and some future lines of work are proposed. 7. Annexes: some more information about the project, its impacts and the economic budget is provided. 5

Chapter 2

State of the art

2.1 Image Synthesis

Since the late 1960s, the condition and accuracy of computer-generated images have improved dramatically. It has gone from simple ambient representations or synthesis of a single object [4][5], with the only possibility of direct lighting, to generating complete scenes [6] with shadows and . This improvement can be due to several reasons, but advances in both hardware and software, such as the increased computational speed of technologies or equipment, stand out among the rest making it possible to generate high spatial and color resolution mature representations.

So, for some years now, image synthesis has been explored in depth having a strong relevance in applications that consist in, for example, creating high-resolution images based on low-resolution ones or generating facial images with different poses. This task [7] addresses the process of generating images that constitute the information of a real scene by using some sort of description.

Despite the progress in generating images and though it has strong relevance in multiple fields, creating high-resolution ones from a given input remains a challenge. This might be because traditional and newly developed techniques lack high-level information, which is required for generating images. 6

By making use of the advances in DL in Computer Vision, Image Synthesis has been getting more and more attention as it means defining new application fields while generating images improving in costs, scalability, and time consumption.

2.2 Deep Learning basics

Nowadays, in the previous discipline defined and the other several ones, DL is becoming the main solution to handle them. DL techniques vary greatly and are found in fields as diverse as medicine, with tasks as identifying skin cancer through patient photos, and advertising, offering clients products that will fit them best, for example.

DL could be defined as a type of machine learning technique where the information given as input is processed in hierarchical layers so that the machine understands some of its features in high levels of complexity. These techniques consist of neural networks, which share some properties: they are interconnected neurons organized in layers differing in architecture and maybe training. A good way to summarize this, Massachusetts Institute of Technology (MIT) official introductory course in DL [8] defines it as a technique that extracts patterns from data using neural networks.

As it has been defined there are different types of neural networks, depending on ar- chitectures, training technique, and, not less important, type of raw input to the net. Among those types, the basic ones will be defined as follows.

2.2.1 Artificial Neural Networks

The basic type of neural network, Artificial Neural Network (ANN), consists of pro- cessing elements organized in interconnected layers, as Figure 2.1 shows, where the flow of information is unidirectional through the network. This means that data travels in just a forward direction, input to output, in a way where each neuron is connected by a weighted link to every other one in the following layer. The input layer sends the information to the hidden one, which processes it and could be interconnected with more hidden layers, and the output layer generates the result. ANN modeling starts with a random selection of weight coefficients which are modified through the network 7 until the output matches the true values.

Figure 2.1: ANN architecture [9]

This architecture might be useful for solving some regression and classification appli- cations. While trying to solve an image as an input problem, the image has to be converted from a 2D one to a 1D vector before the training step, which has two draw- backs. The first one is that the number of parameters during the training is going to increase due to the increasing size of the image and the second one is that ANN loses the spatial characteristics of this image, which is a huge source of information. Another drawback which can be found due to this simple architecture and training is that ANN does not capture sequential information.

All these limitations of ANNs are directly addressed by making use of other more complex architectures and training techniques explained below.

2.2.2 Convolutional Neural Networks

In the area of computer vision, Convolutional Neural Networks (CNNs) have been positioning themselves above the rest becoming the core of most systems today, such as object detection and image classification ones. 8

Figure 2.2: Object detection results using the Faster R-CNN system [10]

Such is their effectiveness in object detection, for example, that many family models have been implemented and tested throughout the years. In [11] an R-CNN, Region- CNN, system for object detection is developed, taking advantage of the capacity of CNNs to bottom-up region proposals to localize and segment objects to finally classify the feature vectors developed with a Support Vector Machine (SVM) system. This approach compared to other previously implemented in the state of the art of this task outperforms results. Due to this, several architectures based on R-CNNs have been appearing to detect objects: in [10] a Faster R-CNN is implemented to predict separately the region proposals of the image, improving a lot of time speed compared to the previous ones and obtaining the following results shown in 2.2. 9

More recently, CNNs are also being applied to problems in Natural Language Processing (NLP), like machine translation or text classification, obtaining interesting results. In [12], a sentence classification system is implemented representing the input text as an array of vectors, just like an image can be represented as an array of pixel values.

So, in DL, a CNN is a deep neural network that processes data that has a grid topology using convolutional and pooling layers to extract features from it. Unlike several other networks, a CNN works with matrices and filters of n dimensions taking into account the spatial dependency of pixel values. In a CNN, the connection between layers is restricted so that all the nodes of each one have the same weight, making them detect the same characteristics but in different areas of the image. As it can be seen in Figure 2.3, just the last layers of the net are flattened using fully connected ones. A CNN could use this combination of convolutional and pooling layers to classify a dataset. In this example, it classifies the CIFAR dataset [13], which consists of 60.000 color images divided into 10 animal types.

Figure 2.3: CNN architecture for classification [14]

The basic functionality of this architecture is as follows, being a basic one a combination of each of the layers defined:

1. The convolution layers scan their input looking for patterns. They are charac- terized by the number of independent filters, determining the number of output images, by the kernel size, which gives the size of the sliding filter, and by the stride, determining the number of pixels the filter slides. 10

2. Talking about the rectifier or detector layer, it is usually chosen the Rectified Linear Unit (ReLU). 3. The pooling layers perform downsampling after the convolutional ones to re- duce dimensionality so that performance is improved increasing computational efficiency. These layers are characterized by the size of a pooling window. 4. Finally, the output is flattened out to a vector and classified through the fully connected layers.

As it has already been seen at the beginning of this section for object detection, there are several CNN architectures available that have been key in building DL algorithms achieving high accuracy results. Talking about image recognition, which is a task with an extensive research history, some architectures are worth-mentioned too. Examples of those are AlexNet [15], containing 8 layers: 5 convolutional (some followed by max- pooling layers) and 3 fully connected ones, and VGGNet [16], which uses convolutional kernels of size 3x3 and max-pooling kernels of size 2x2 with stride 2. After the celebrated victory of those too, the Resnet model [17] appeared providing to the DL fields a novel architecture with ”skip connections”, which let to train a Neural Network (NN) with a big amount of layers while still having lower complexity than the VGGNet mentioned before. It also achieved an error rate that beat human-level performance on the dataset used.

2.2.2.1 Generative Adversarial Networks

As it has been said, there are multiple CNN architectures available, but this state of the art is going to focused on a few, which are of special interest. Since this thesis consists in synthesizing images based on some inputs, that will be discussed in the following sections, a generative architecture model should be used. Generative model- ing involves automatically learning patterns from input in such a way that the model outputs data that apparently could have been obtained from the original set. This is where Generative Adversarial Networks (GANs) appear.

GANs are able to produce or to generate new content using DL architectures, such as CNNs. Its architecture was first described in [18] in 2014, but a year later a standard- ized approach called Deep Convolutional Generative Adversarial Network (DCGAN) 11

[19] was developed and finally led to more formalized models. It is important to high- light that this DCGAN uses stridden convolutions instead of pooling layers to increase and decrease feature’s spatial dimensions and that it uses a technique of Batch Nor- malization (BN) to normalize so that zero mean and unit variance exists in all layers. The final target of this formalized model is to stabilize learning while dealing with poor weight initialization.

Figure 2.4: GAN architecture [20]

As Figure 2.4 shows, a GAN involves two sub-models: a generator one for generating new data and a discriminator one for deciding (classifying) whether the generated data is real or fake. So the aim of the first one would be to maximize the probability of making the discriminator mistake its inputs as real, while second one would aim to guide the generator to create more realistic images.

At first, the generator doesn’t know how to begin producing images that are similar or resemble the real ones, the ones from the training dataset, and the discriminator doesn’t know how to classify the images in real and fake. This is why the discriminator model receives two different batches: one with the true images and another one with noisy signals. During the training, the generator learns how to output images that resemble the training set ones.

The complexity part of this architecture comes when it needs two losses so that the 12 discriminator can output probabilities close to 0 for fake images and near 1 for real images. One would maximize the probabilities for the real ones and the other one would minimize the probability of fake ones. Thus, the total loss for this sub-model is the sum of those partial losses.

So, as it can be seen, GANs have the potential of expanding DL horizons, and re- searchers know it since they have been developing many techniques for training GANs. This architecture provides a pathway to a solution to problems that require a generative solution, such as this Thesis target.

2.2.2.2 Autoencoders

When talking about generative neural network models, Autoencoders seek to ”recon- struct” its input, which means output data identical to the input by learning an identity function.

Basically, an autoencoder can be thought of as two sub-networks, which can be seen in Figure 2.5. The encoder accepts the input compressing it into the latent-space representation, while the decoder takes it and reconstructs the data.

Figure 2.5: GAN architecture [21]

While both GANs and autoencoders are generative models, GANs generate new and realistic data but autoencoders simply compress inputs into a latent-space representa- tion.

So, autoencoders can be seen as neural networks used for applications such as dimen- sionality reduction or denoising, as well as for outliers detection. Out of the Computer 13

Vision area, autoencoders are used in NLP applications, such as machine translation [22]. These models combined with other networks can lead to interesting architectures.

2.2.3 Recurrent Neural Networks

It is important to highlight that the neural network’s input data doesn’t always have to be static since there is data that depends on past instances of itself to predict future ones. Applications in NLP such as speech recognition or machine translation, stock price prediction, or spam detection, among others, process this kind of data. Neural networks that address this state of data, temporal or sequential one, are Recurrent Neural Networks (RNN). Basically, RNNs store the last output calculated in its memory and use it to predict the new output.

There are many architectures possible for RNNs, being common in all of them that, as it has been described, that they feed their outputs from a previous time step as inputs to the net. One of these architectures is shown in Figure 2.6, being this a Long Short-Term Memory (LSTM) [23].

Figure 2.6: LSTM architecture [24]

An LSTM shares information through the network learning from it to predict future data using a memory cell which is represented in the diagram above. This cell’s inner 14 iterations can be explained as follows:

1. The first gate decides which details have to be discarded from the block using a sigmoid function looking at the previous state. 2. The second gate decides which value from the input is going to be used to modify the memory. A sigmoid function does that task, while a tanh one gives weightage to the values passed, depending on their importance. 3. Finally, the output gate consists of a sigmoid function which, again, decides which values to let through the net and a tanh one, giving weightage. Basically, the output is decided depending on the input and the memory of the block.

When talking about images as the input of a RNN, researchers have used them in combination with a CNN, where the output of the second one is the input of the first. In [25] the problem of multi-label classification of images failing, due to not exploiting completely label dependencies in an image, is approached proposing a CNN- RNN framework where an image-label embedding is learned to characterize the semantic label dependency. Another example of this combination but with LSTMs is [26] where temporal dynamics and convolutional perceptual representations are both learned for a visual recognition task showing good results compared to the state of the art ones. Many possible architectures using this combination are proposed in this paper. In Figure 2.7 this proposed model, LRCN, representation is describing how it processes the variable-length visual inputs using the CNN to feed the LSTM, sharing their weights across time. 15

Figure 2.7: LRCN architecture [26]

2.3 DL and Image Synthesis

DL is attracting a lot of attention and interest, being in constant development achieving unprecedented levels of success. Its algorithms have even outperformed humans in many fields, such as Computer Vision.

In the past few years, there has been a harsh growth of research in GANs, which have been defined before. Several fields are making use of these neural networks or architectures based on them to give solutions to, for example, translating an input image to an output one. Traditionally, this task has been approached with techniques such as stitching together small patches of images [27]. Nowadays, translating an image or an object to another image is tackled as in [28], where a CGAN is proposed to train this mapping. In Figure 2.8 an example output of the released net software for the image to image translation can be seen. This network’s, PixPix [29], results suggest that this approach is effective since many internet users have been posting their results using it. Such is this impact, that a Pix2PixHD net [30], has been already implemented for synthesizing high-resolution images, in particular 2048x1024, outperforming existing 16 methods and also generating different results from the same input, allowing a user to edit them interactively.

Figure 2.8: Example results of Pix2Pix net on automatically detected edges compared to ground truth [28]

Another particular area in which progress is getting scarily good at is face generation. Figure 2.9 shows the work in research in this field through just 4 years, making it possible to generate lifelike looking faces using neural networks. Generating realistic facial images with different facial expressions or keeping the information about the identity is an open investigation topic which is having a deep impact on face recognition, image augmentation, even face aging and face to face translation.

In the first field mentioned, face recognition, it is needed a huge dataset to be trained. There are many datasets available created by companies to train researchers’ nets. [32] uses VoxCeleb2 dataset [33], which contains over 1 million utterances from YouTube videos of over 6.000 celebrities, computing spectrograms from its raw audios to use them as the input of a CNN to finally recognize identities successfully. Another dataset available that has been used in [34] to recognize words being spoken by a human just using the video and not the audio is the LRW dataset [35]. It consists of 1.000 samples of 500 words which have been spoken by hundreds of speakers. There is a sentence version [36] in which this dataset has evolved and it has been tried as the input of a Watch, Listen, Atend and Spell (WLAS) network [37] to operate over visual, audio or both inputs to lip read outperforming previous techniques. 17

Figure 2.9: Face generation through the years. Faces on the left were created by Artificial Intelligence in 2014 and the ones on the right, in 2018. [31]

In one of the previous papers, the spectrogram from audio was used as input of a network. There are several other human features that are used in face reconstruction or generation, such as key points from facial structure or pose, which are worth mentioning and to know how to extract them. In [38] a simple CNN is trained to reconstruct a 3D face structure from a 2D image representation in UV space which predicts dense alignment. This is achieved using the 300W-LP [39] dataset as the training set to the Position map Regression Network (PRN) proposed resulting on a robust method to illumination, pose, and occlusions. The code for this model is also available in [40].

To help with this field of producing faces under many circumstances and extracting features such as key points, several tools and frameworks are being developed by re- searchers. In [41], OpenFace, open-source tool that detects landmarks, head pose, and eye-gaze, among others, has been developed. In Figure 2.11 its analysis can be seen with each of the features extracted. It is important to highlight how important is to implement a model which learns how to extract these characteristics for later applying 18

Figure 2.10: Dense alignment, including key points, and 3D reconstruction results for [38] them to some task as input data.

It has already been shown how to create photo-realistic faces but, what about creating photo-realistic talking heads? Producing a virtual person or animated being sounding and appearing real is a challenge for some applications such as special effects. [42] is one of the first approaches for a real-time facial reenactment of a target video. It basically animates facial expressions of the target video by a source video recorded by a webcam rendering the output in a realistic way. Figure 2.12 shows how they address the facial identity recovery obtaining successful results.

A neural network, X2Face [43], which controls pose and expression taking as input two frames: a source and a driving one, being the first one the input of the embedding submodel and the second one, the input of the driving submodel. This can be under- stood in Figure 2.13. The embedding network learns how to map from the source frame to representation and the driving network learns how to transform the pixels from this representation to a generated frame. It is said that controls pose, expression, or identity since it doesn’t make assumptions about them, using the ones in the generated frame. 19

Figure 2.11: OpenFace behaviour analysis pipeline, including facial action unit recogni- tion [41]

Figure 2.13: X2Face network during the initial training stage [43]

[44] is one of the most famous approaches in this talking head generation task. Their LSTM model takes Obama’s audios as input, converts them to a time-varying sparse mouth shape generating, based on it, a realistic mouth texture which is composited into the mouth part of a video, as shown in Figure 2.14. 20

Figure 2.12: Results of the reenactment system Face2Face [42]

Figure 2.14: System to synthesize Obama’s talking head [44] 21

Figure 2.15 presents a comparison between Face2Face and the previous net for four different samples in the same speech making use of the same video. The second method can synthesize a more realistic mouth showing natural creases and more clear teeth.

Figure 2.15: Results comparison between Face2Face and Obama’s talking head generation model [44]

There are similar papers to the Obama´s one being distributed and researchers trying to improve their results based on them. Another neural network that uses both audio and still images as input is the Speech2Vid one [45] generating a video of a talking face but this time using an encoder-decoder CNN model and showing that there is a relation in generating video data based on audio sources. Figure 2.16 shows this model architecture with an emphasis on the deblurring block, which is used to refine the output frames. 22

Figure 2.16: Overall Speech2Vid model [45]

To try and outperform the Speech2Vid model, a new conditional adversarial network is presented in [46] to generate a video with a talking face too. In this case 2.17, a multi-task adversarial model is trained to treat audio input as a condition for the recurrent adversarial network to try and make the transition for the lip and facial expression smoother. It is important to mention that to reduce the size of the set without reducing quality, phoneme distribution information has been extracted from the audio. Results show a superior and more accurate visual representation.

Figure 2.17: Proposed conditional recurrent adversarial video generation model [46] 23

Even though face expression variation and speech semantics are coupled together be- cause of the movement of the talking face, [47] learns disentangled audio-visual rep- resentations through a training process that generates a more realistic face with clear motion patterns. Figure 2.18 shows this model in which three encoders take part, one for Person-ID information from a visual source and the other two for Word-ID to extract speech information from visual and audio sources.

Figure 2.18: Proposed Disentangled Audio-Visual System [47]

To end with this research, [1] implements a talking head model which, unlike most of the recent works, learns from few-shot images instead of just a single one. Figure 2.19 shows this model’s architecture, which takes as the input of the generator the image key points. It is also able to initialize the parameters of both submodels, generator, and discriminator, in a person-specific way. Since landmarks from different people, the lack of landmark adaptation is being used from this task model is usually a problem, but this system achieves a high-realism solution to it 2.20. 24

Figure 2.19: Few shot Model Architecture [1]

Figure 2.20: Few shot Model Results compared to other models seen [1] 25

As told before, in the face translation or face to face field huge achievements have been approached. Being able to animate a still image of a face, whether it is real or not, is a challenging task that can produce mixed feelings due to ethical issues as it can also happen in other of the previously mentioned. These ethical issues will be announced later in this thesis.

Finally, Table 2.1 shows a brief summary of each of the state of the art algorithms which have been studied and described previously.

SUMMARY DL and Image Synthesis Evaluation (Best Cite Paper Input Data Model Dataset Results) Audio spectro- Cost function C VoxCeleb2: Deep VGGVox: based on det Train: Vox- gram (Hamming of 0.429 and Equal [32] Speaker Recogni- VGG-M and ResNet Celeb2. Test: window of 25ms Error Rate (EER) of tion (Jun 2018) architectures VoxCeleb1. and 10ms step) 3.95%. Four different VGG-M models. They differ in Top-1 accuracy is Lip Reading in the Mouth region [34] architecture and how 65.4% and Top-10 LRW Wild (Nov 2016) images they ”ingest” input accuracy is 92.3%. data. Lip region images WLAS network: Audio and lips: a (120x120) and Three model net Character Error Rate Lip Reading Sen- MFCC features (Watch, Listen and (CER) of 7.9%, Word [37] tences in the Wild LRSW (25ms windows Spell). All LSTM with Error Rate (WER) (Jan 2017) at 100Hz, time- cell sizes 256,256 and of 13.9% and BLEU stride of 1) 512, respectively. metric of 87.4. Mean Normalized Joint 3D Face Re- 256x256 images Mean Error (NME) construction and and 3D Mor- Train: 300W- for 3D Face Align- Dense Alignment phable Models PRNet: Light- LP. Test: [38] ment: 3.62%. For with Position Map (3DMM) param- weighted CNN AFLW and 3D Reconstruction Regression Network eters to generate Florence. a mean NME of (Mar 2018) UV map 3.7551%. Conditional Lo- cal Neural Fields Among others... (CLNF): Based on Train: Multi- OpenFace: an open Constrained Local Mean absolute degree PIE, LFPW and source facial behav- Input image or [41] Model (CLM)[48]. error of 2.6 in Biwi Helen. Test: ior analysis toolkit sequence Two components dataset. AFW, BU, SE- (Apr 2016) (Point Distribution MAINE and Model and patch ex- MPIIGaze. perts). 26

C. Cao & K. Zhou: blend- Descriptors of shape and com- Face2Face: Real- a frame: land- parison data. V. time Face Capture marks, expression Dense, global non- Evaluation based on Blanz, T. Vetter [42] and Reenactment of parameters, ro- rigid model-based visual comparison. & O. Alexander: RGB Videos (Jun tation and LBP, bundling. face data. A. 2016) Local Binary Dai: voice. D. Pattern Ritchie: video reenactment. X2Face: A network Video frames X2Face: embedding for controlling face Mean absolute error Train: Vox- (differed factors network (U-Net and generation by using (MAE) in degrees for Celeb, AFLW [43] of variation) and pix2pix) and driving images, audio, and head pose regression and LRW. Test: audio features network (encoder- pose codes (Jul of 9.36. VoxCeleb deatures decorder). 2018) Audio MFCCs, Synthesizing 25ms sliding LSTM techniques (60 14 hours of Obama: learning Evaluation based on [44] window with LSTM nodes, 20 step Barack Obama’s lip sync from audio visual comparison. 10ms sampling time delay) videos (Jul 2017) interval Speech2Vid: Encoder- 112x112x3 Im- decoder CNN model ages and Audio that uses joint em- You said that? MFCC (0.35 sec- Evaluation based on VoxCeleb and [45] bedding of face and (May 2017) ond audio with visual comparison. LRW audio(audio encoder, 100Hz sampling entity encoder and rate) image encoder). Peak signal-to-noise Audio encoder, im- ratio (PSNR) of age encoder, image Talking Face Gen- Audio MFCC 27.43, Structural discriminator, im- eration by Condi- (350m s) and Similarity index TCD-TIMIT, age decoder archi- [46] tional Recurrent lip shade (SSIM) of 0.918. Lip- VoxCeleb and tectures. All of them Adversarial Net- frames cropped reading accuracy of LRW constructed by convo- work (Apr 2018) (128x128) 63% in Top5 and lutional or deconvolu- Landmark Distance tional networks. Error (LMD) of 3.14 Talking Face Gener- DAVS system: Three For audio approach, Face from video ation by Adversar- encoders based on PSNR of 26.7 and frames (256x256) ially Disentangled VGG-M, FAN [49] SSIM of 0.883. For LRW and MS- [47] and Audio Audio-Visual Rep- and [50], respectively. video approach, Celeb-1M MFCC (sampling resentation (Jul Decoder contains 10 PSNR of 26.8 and rate of 100Hz) 2018) convolution layers. SSIM of 0.884. 27

For K=32, a Frechet- inception distance Few-Shot Adver- (FID) of 30.6, a Three networks: Im- sarial Learning of SSIM of 0.72, a K frames and age Embedder, Gener- VoxCeleb1 and [1] Realistic Neural cosine similarity face landmarks ator and Discrimina- VoxCeleb2 Talking Head Mod- (CSIM) of 0.45 and tor. els (May 2019) a user accuracy of detecting fake ones of 0.33%. Table 2.1: DL and image synthesis summary

29

Chapter 3

Development setup

In this section of the thesis report an overview of the proposed project system archi- tecture involving the main development configurations, frameworks and tools used to carry out the project will be presented.

When speaking about huge amounts of data processing and deep learning algorithms combined, several solutions have been implemented by researchers in the past years to make development easier, since massive processing power and ability to handle different data layers are required. Some of these solutions have gained a lot of interest among developers from over the world providing them with different tools, which enable deep learning applications research and production.

3.1 Federated Learning

As it has already been said, massive processing power and a huge amount of data are needed for tasks carried out using deep learning algorithms. To free up Central Processing Unit (CPU) cycles in the device used for other jobs that don’t concern graphical and mathematical computations, a Graphic Processing Unit (GPU) with Nvidia CUDA toolkit should appear in this process. Nvidia CUDA-X[51] is a software stack for developers that provides a way to build high-performance GPU-accelerated 30 applications taking advantage of optimizations such as mixed precision compute on Tensor Cores and accelerating a set of models. One of its worth mentioning libraries is cuDNN[52], giving the possibility to implement highly tuned routines, like forward convolution and pooling, for deep neural networks.

Traditional deep learning tasks involve uploading data to a server and using it to train a model. In other words and applied to our case of study, due to the complexity of the algorithms chosen and with it, the complexity of this project’s task needing a big representative collection of samples, traditional ways of processing and training are not enough. In recent years, researchers and developers have been provided with devices being able to have enormous amounts of storage space but it never seems to be enough. It is quite normal too to own data on different devices and having to spend lots of time and power to centralize that data in a single one, which will be the one to use to train the model. It can also become a problem with privacy centralizing personally-identifiable information when it comes to using data obtained from different users, which is might be our case since faces from different people around the world are being used to train the model. These problems described referring to data quantity and quality can’t be resolved using a traditional way of centralized training machine learning models. This is where Federated Learning appears.

Figure 3.1: Federated learning general representation [53] 31

Federated learning is a training technique that basically makes able collaborative learn- ing from the same model performed by several devices. This model is trained on a server using data stored in it and then every other device downloads this same model to improve it using its own local data. This improved model changes in other devices are sent to the main server, where the models are averaged to obtain a combined one. Figure 3.1 shows a generalized representation of how this would work: The phone trains the model locally (A) and many other devices create updates (B) that are averaged to form a change to the shared model (C).

Having said this and being able to make use of several devices, the proposed decen- tralized system architecture, shown in Figure 3.2 will consist of the following devices and configurations based on the implementation developed in [54], which presents a method for federated learning of networks based on iterative model averaging proving that it can be made practical with few rounds of communication between devices. This configuration will allow faster deployment and testing of the project’s model consuming less power and time.

Figure 3.2: Proposed federated learning system architecture 32

Since the database, which will be described later, owns a huge amount of data and due to the limited time there was to develop the project, six different devices from the same network where used in the step of data preparation. In the architecture representation four different devices provided with a CPU were used for this pre-process and storage, sending the data obtained through a virtual link to both the main server and a second server. These two servers are provided with two GPUs and one GPU, respectively, and each one trains the model saving their local updates. The second server sends every certain time its model update to the main server that computes the average between this information and its own creating an update for the global model, which is also stored in this server.

The following Table 3.1 summarizes the hardware setup indicated in the proposed architecture used for the development of this project.

Device Pseudonym Processor OS Version GPU Intel(R) Core(TM) PC 1 Ubuntu 18.04.1 LTS - i7-2600K CPU @ 3.40GHz Intel(R) Core(TM) PC 2 Ubuntu 18.04.4 LTS - 2 Quad CPU Q6600 @ 2.40GHz x 4 Intel(R) Core(TM) PC 3 Ubuntu 18.04.1 LTS - i7-4712MQ CPU @ 2.30GHz x 8 Intel(R) Core(TM) PC 4 Ubuntu 16.04.6 LTS - i7-4712MQ CPU @ 2.30GHz x 8 Intel(R) Core(TM) 2 units: GeForce RTX SERVER 1 Ubuntu 18.04.4 LTS i9-7900X CPU @3.30GH x 20 2080 Ti/PCIe/SSE2 Intel(R) Core(TM) GeForce GTX SERVER 2 Ubuntu 18.04.3 LTS i7-4790 CPU @ 3.60GHz x 8 1080 Ti/PCIe/SSE2

Table 3.1: Hardware used in the development of the Master’s thesis

As it has been mentioned, both of the devices, SERVER 1 and SERVER 2, that will serve as training machines own GPU, which will be accessed through the use of CUDA toolkits versions 10.1 and 10.2, respectively. 33

3.2 PyTorch

Once the hardware setup has been defined, the next step is to describe the software provided by researchers and developers that has been decided to be used, starting by the main framework.

A few years ago, no one but the leaders of Theano was present in the field of DL frameworks. As of today, there is a myriad of frameworks at the developer’s disposal that allow them to develop any kind of ML application in a more simplified way being each of them built in a different manner depending on the purpose. These frameworks, thanks to a high-level programming interface, offer different building blocks for design- ing, training, and evaluating DL neural networks. In Table 3.2, a brief summary of the most popular DL frameworks is provided.

Even though Facebook’s PyTorch (2016)[55] is still relatively new compared to other frameworks, it is already reaching importance against TensorFlow[56], Google’s deep learning ultimate framework. As a matter of fact, in Figure 3.3 a comparison graph between both of them can be seen. This graph shows the number of unique mentions of both TensorFlow and PyTorch implementations in papers at each of the top research conferences over time. This information has been extracted from The Gradient[57], which is a digital magazine about research and trends in artificial intelligence.

Pytorch, based on Torch[58] library but running on Python, is used for modeling deep neural networks and executing high complexity tensor computations offering a dynamic graph definition. This last approach benefits neural network architecture’s implementa- tions. For example, if an analysis model for sentences task using RNNs is implemented, using static graphs will not allow the input sequence length to change having to fix a maximum value and padding smaller sequences with zeros. This is not convenient at all, so PyTorch gives the facility to build computational graphs that can change during run time. 34

Framework Developer(s) Platform Languages Information Logo - Tight Numpy integration University of Linux, macOS, Theano [59] Python - Extensive unit-testing and Montreal (2007) Windows error diagnose - Use of GPU - Model deployment on mobile or embedded devices Python and JS Linux, macOS, - Large production environments Google (TensorFlow.js). TensorFlow [60] Windows, iOS, - Extensive coding and (2015) Java, C, Go Android static computation graph (experimental). - Used for data integration functions - Runs on top of TensorFlow, CNTK, and Theano Fran¸coisChollet, Linux, macOS, (Keras sits on a higher level) Keras [61] Google Engineer Python Windows - Support for data parallelism (2015) - Limited low-level computation - Precise and readable code Facebook’s API - Small projects and prototyping Linux, macOS, Python, C++, Pytorch [62] Research Lab - Dynamic graphs Windows CUDA (2016) - Tensor computation - Limited support if not Microsoft Microsoft Research Microsoft community Cognitive Toolkit Windows, Linux C++ (2016) - Scales well in production (CNTK) [63] using GPU

- Caffe2 is part of Pytorch NVIDIA Caffe University of Linux, macOS, C++ with Python - Support for RNN is poor [64] California (2017) Windows interface - No hard coding for model definition

MATLAB [65] - Apps for labeling and MathWorks Linux, macOS, MATLAB, C++, (Deep Learning generating synthetic data (2018) Windows Java Toolbox) - Supports Python interoperability - Easy migration Swift for Google - Good option if dynamic TensorFlow macOS, Linux Swift (2018) languages are not suited (S4TF) [66] for the project - Brings together Java Linux, macOS, Eclipse Foundation Java and support for environment to execute DL DL4J [67] Windows, iOS, (2019) Scala and Kotlin. - Multi-threaded and Android single-threaded frameworks - Portable and scalable Apache Software Python. Support for - Multiple GPUs Linux, macOS, MXNet [68] Foundation Scala, Java, C++, - Allows mixing symbolic Windows (2020) and R, among others. and imperative programming

Table 3.2: Different existing DL frameworks 35

Figure 3.3: PyTorch Vs. TensorFlow: Number of Unique Mentions. Conference legend: CVPR, ICCV, ECCV - computer vision conferences; NAACL, ACL, EMNLP - NLP confer- ences; ICML, ICLR, NeurIPS - general ML conferences. [53]

Another ability provided by this framework is defining basic building blocks that ex- tend functionality since object-oriented approaches are used. The datasets and the nn.Module modules are an example of this, containing wrappers for datasets used to benchmark architectures and providing a way to create complex architectures, respec- tively. It is true though that recent versions of TensorFlow are adapting to this whole ecosystem of wrappers trying to simplify the way developers work with the library providing freedom on using it with other frameworks such as Keras to suit some tasks.

One of the biggest characteristics distinguishing PyTorch from TensorFlow is splitting the batches of samples of data and running the computation for each one in parallel over multiple GPUs using torch.nn.DataParrallel. Tensorflow, instead of performing 36 this way, allows the developer to configure the operation to be run on a specific device.

These and many other advantages make PyTorch work more as a framework rather than a library, as Tensorflow might be seen. However, both are capable and competitive deep learning tools with robust visualization options than can be used for high-level model implementation and development.

In this project, PyTorch version 1.2.0 (being the current stable release 1.5.0), which is the latest one fully tested and supported[69], with torchvision version 0.4.0, the one rec- ommended, have been used. This last package provides datasets, model architectures, and image transformations.

3.3 Libraries

It is well-known that Python is the go-to choice for Deep Learning developers due to its huge offer of libraries that provide features and flexible configurations that increase both productivity and quality of the code easing the workload. In this project, there are some libraries that are worth mentioning, which have been used for both data processing and feature extraction. These two processes will be explained more deeply in the following sections.

First of all, to download the dataset videos that were going to be used there are sev- eral libraries available in Python, which with the video reference and some resolution and format filters the desired video can be obtained with no restriction at all. Py- tube’s version 9.5.3[70] is the one chosen for this project with which, together with the FFmpeg[71] tool, the videos were collected with the desired frame-rate and their audios too.

The next step is to manipulate both audio and video downloaded. To be able to do it there are several libraries written in Python, such as pydub[72] for audio and OpenCV [73], which is a well-known artificial intelligence open-source library developed by Intel used for computer vision. To perform this project’s approach, versions 4.1.2 and 0.23.1 from OpenCV and pydub, respectively, have been installed.

To detect, compute, and extract the features that will be needed as the input to the 37 network developed, versions 1.0.0 and 0.7.2 from Face Recognition and LibROSA have been respectively used. The first one generates facial landmarks using an accurate face alignment network based on [49], which detects key points in 2D and 3D coordinates. The other one is a package that helps with audio files analysis providing tools to generate different information based on their characteristics.

As the process progresses, an evaluation of the model should be carried out. There are several libraries available in Python for this task, since it is one of the main steps to proceed with so that the model implemented can be judged in both numerical and perceptually ways. The main library that should always be consulted to look for a metric to suit the task approached is the module metrics from the scikit-learn[74] library, which provides simple and efficient tools, such as score functions, distance computations, performance and pairwise metrics, to analyze predictive models.

As it can be seen Python is an amazing programming language for analyzing data and developing deep learning tasks with lots of modules, packages, and libraries to expand its capabilities making it a lot easier for researchers and developers to carry out their approaches. In this project, Python’s release 3.6 has been used.

3.4 Overview of the proposed DL process

As it has been said in the previous section, there are libraries that will be used for diverse functions or steps in the process of building the project’s model, training, and evaluating it. In the following sections, the whole system will be described in detail but a brief summary of the processes taken to achieve the project’s target will be done.

Figure 3.4 shows an overview of the processes that the data will go through during the training step. First, once collected the data it will go through a feature extraction process that will generate the different inputs to the network (in this project’s case, a set of frames from videos with their associated landmarks and a landmark image per frame from the video that will be modified). Then, the training process will take place taking these inputs and adjusting the hyper-parameters from the model each time the whole dataset goes through it. This process will synthesize, for each input set, a frame based on the landmark image and associated spectrogram from the input. Once this process has finished, the final evaluation will be done by comparing different metrics 38 that will be mentioned and described in the following chapter.

Figure 3.4: System architecture general training overview

With the model already trained and two different videos not used during the previous process, the final video solution is obtained. The first video will be a famous person speaking, for example, whose face will be modified using the facial gestures and lips movement of the second video, in this case, a video of myself. Both of them are used as inputs of the system that will perform the feature extraction, the adequate changed to each frame based on the trained model and will generate a final video with that characteristics described: a face translation from one video to another.

In Figure 3.5, this defined final application can be seen. In this case, the output or final video is a concatenated version of the input, its landmarks, and the generated one so that the gestures can be identified and compared easily. The final output is not as realistic as it could be, since the model trained for a short period of time due to resources. 39

Figure 3.5: System architecture final application overview 40

Chapter 4

Implementation

This section will describe the steps carried out to reach the main objective of the project defined in the Introduction and objectives chapter. As any other DL task, the process of the whole implementation can be divided into the following blocks exposed in Figure 4.1, so the following subsections will try to go into each of these blocks explaining what has been done to achieve the final results.

Figure 4.1: Deep learning process block diagram

In the Annex Code available a brief description of the project can be seen, providing a link to check its functionality since the code is not available due to it belonging to a private project. 41

4.1 Data collection

To start with the DL block diagram, suitable and enough data has to be collected to use as the input of the neural network that is going to be implemented for this project’s task. Collecting data for a model of this kind allows capturing a record of events labeled or events that have already happened to analyze them, let the network find recurring patterns, and learn from them. Basically, these predictive models are as good as the data from which they are built, so this step is crucial to develop a high-performing DL model.

Data is being constantly generated at an unprecedented rate and used for a huge amount of applications. In DL, gathering data is an active research topic because, although it saves costs in feature engineering since these techniques normally generate features from data, large databases are required. In this project’s field and as it has been seen in the State of the art, there are several datasets available of collected videos from people that researchers use to extract features from them and use as the input of the neural network. As it can be seen in Table 2.1, one of the most used is VoxCeleb2.

As it is mentioned in [32], VoxCeleb2 contains 1 million utterances from YouTube videos of 6.000 celebrities, owning 61% of male ones, which is fairly gender-balanced, and also different accents, ages and ethnicities, making it a well-generalized dataset (Figure 4.2). These videos include interviews from different situations, such as red carpets or indoor studios, which means that the audios might be noisy due to degradation from background chatters or room acoustics, among others. In Table 4.1 a brief description of this dataset statistics which might be of general interest.

VoxCeleb2 DATASET Number of POIs 6.112 Number of male POIs 3.761 Number of female POIs 2.351 Number of hours 2.442 Number of videos 150.480 Avg. videos per POI 25

Table 4.1: VoxCeleb2 description 42

Figure 4.2: VoxCeleb2 faces of speakers in the dataset.

VoxCeleb2 contains both training and testing sets separated, which is quite useful to help researchers evaluate their model. In this project’s case, to test the models trained with the named dataset, two different kinds of videos will be used: an original one in which there will be a person speaking or not and to which the translation of facial gestures from another video will be applied. Those two types of videos could be chosen according to the taste and necessities of the researcher so in this project’s experiments chapter (Simulations and results), the different videos downloaded and home-made to be used will be described.

To finally obtain this dataset, there are several options provided, such as downloading the audio files, the cropped video files, or the YouTube URLs. This last one was the one chosen since it included, as well, the timestamps for the celeb utterances. The files downloaded, after having to fill a form[75] to specify the purposes and request access, consisted of several celeb-identified folders containing sub-folders for each video, which, at the same time, owned one or more files indicating the reference to the YouTube page and the utterances’ frames with the coordinates of the bounding box surrounding the 43 celeb face. Figure 4.3 shows the way these downloaded files were presented and the folder organization. In this case, the whole dataset is inside the ”txt” folder, which consists of 5.994 items celeb folders, less than the total of 6.112 specified previously due to probably not updating the dataset. Since, as it is going to be explained later, the entire dataset is not going to be used, so there is no problem. The celeb folders are named after an id, for example, the folder ”id00012” contains 23 video folders named after the YouTube reference and the first one of them, the ”2DLq Kkc1r8” folder, contains 3 txt files. In Figure 4.4, the content of file ”00016.txt” can be seen, consisting of the reference to the final YouTube link (”https://www.youtube.com/ watch?v=2DLq_Kkc1r8”) and the bounding box coordinates for each frame. Figure 4.5 shows that this URL links to a video example from the dataset that will be downloaded.

Figure 4.3: VoxCeleb2 downloaded folders organization

Figure 4.4: VoxCeleb2 txt file with video information to download it 44

Figure 4.5: VoxCeleb2 video example from YouTube

Due to the short time for development and storage resources even though there were a total of 6 devices involved, only a small portion of this dataset’s videos could finally be downloaded and later prepared. They were downloaded in the best resolution available and transformed to 25 frames per second since the timestamps indicated assumed the video was saved with this frame rate. It is important to highlight that the audio was extracted from them and saved during the process, as well. In Table 4.2, the final number of samples used in both devices where the model has been trained can be seen. 45

Information Server 1 Server 2 Total Number of POIs 131 124 255 Number of videos 2.545 2.347 4.892 Number of utterances 19.350 17.738 37.088 Avg. videos per POI 19,43 18,93 19,18

Table 4.2: Final number of samples from VoxCeleb2 used

To see the amount of storage needed to hold the raw videos and audios, the following Table 4.3 shows some guidelines for the duration and weight of the media files. You can see the enormous amount of storage required and the reason why the entire dataset could not be downloaded, since this information would have to be added to the subsequent processing for feature extraction.

Information Server 1 Server 2 Total Total video duration 250,32 hours 224,83 hours 475,15 hours Total video & audio storage 62.2 GB 48.1 GB 110.3 GB Total video & audio preprocess storage 400 GB 345 GB 745 GB Total storage 460 GB 455 GB 1 TB

Table 4.3: Dataset storage used

Finally, some information about the files is going to be provided. As it has been said the videos have been downloaded in the best resolution possible and with a frame rate of 25, H.264 (High Profile) as codec and its audio is stereo, AAC codification and 44.100 Hz of sample rate.

4.2 Data preparation

To ensure data has the necessary format and information, this preparation or ”prepro- cess” is a previous requisite which needs a huge amount of time to be analyzed and performed in every DL task. In this case, VoxCeleb2 dataset will suffer a feature ex- traction process to reduce dimensionality and obtain a set of characteristics or feature 46 which describe it accurately. As it has already been mentioned, both audio and visual features will be extracted.

4.2.1 Visual features extraction

When talking about visual characteristics of a video of a person, it has been seen in the state of the art that it is really common in these kind of tasks to use neural networks to extract facial landmarks, bounding boxes, or face alignment. In particular, it seems to be essential to calculate some points of interest, landmarks, localizing upper and lower linings of the eyes, and the mouth along with some other face key points, which allows face tracking. Figure 4.6 shows an example of these visual features that can be extracted from an image.

Figure 4.6: Visual features extraction [41]

There are multiple tools to perform landmark detection in each of the frames of the videos. As it was mentioned in the Development setup chapter, the face-alignment 47 library allows detecting this key points based on the bounding box provided by the dataset. It is important to highlight that these coordinates are given assuming that the video resolution is 224p, so a conversion depending on the resolution of the video downloaded had to be made. The face detection performed by the package mentioned is based on FAN’s network described in [49], which consists of four modified HourGlass (HG) CNN networks from [76] taking as input a facial image and generating a set of heatmaps, one for each landmark.

Figure 4.7: Visual feature extraction. Frame A from dataset video, bounding-box coordi- nates provided by dataset and landmarks extracted, respectively.

In Figure 4.7, these extractions can be seen on a frame of a video from the dataset. Thanks to the coordinates provided by the VoxCeleb2 dataset, a bounding box can be drawn taking into account the head pose estimation, as well. Landmarks are usually represented as a set of points associated with the eyes, lips, nose, and other facial components. In this case, the landmarks are joint obtaining a face representation of each of these components. This can be done thanks to knowing the relation between the key points extracted. Finally, the final input of the neural network referred to visual features is a cropped version of the frame and its landmarks concatenated, as Figure 4.8 shows. Every single pair of frame-landmarks is cropped to focus just on the person’s face and transformed to the necessary input size, 256x256. 48

Figure 4.8: Final visual feature extraction. Input to the network consisting of frame and landmarks concatenated.

4.2.2 Audio features extraction

In the domain of audio analysis, state of the art describes that audio samples are usu- ally represented by their spectrograms and their Mel-Frequency Cepstrum Coefficients (MFCC). The first one is a visual representation of the spectrum of frequencies as the signal varies through time, while the second features describe the overall shape of the spectral envelope, in other words, they model the characteristics of the human voice. Sometimes, some other characteristics, such as prosodic ones (pitch, duration or stress), are used to complement these signal’s representations.

Both of those features were considered when extracting information from short segments of the audio files, where one segment corresponded to one frame of the video. In other words, since every video used was supposed to have a frame rate of 25, the audio file was divided in short segments of 40 ms and, in combination with different configurations specified in Table 4.4, it was used to compute the Mel-spectrogram, finally. Besides the framerate, the duration of 40 ms has been decided since this window interval is used in some state of the art papers that collect audio information such as [77] and because a smaller window will generate a more accurate representation in time of the words spoken by the person. This feature was chosen above the other one defined since to calculate the MFCCs, the Mel-spectrogram is first computed, so it was considered to be enough to use the first one. 49

Mel-spectrogam configuration Duration 40 ms Sample rate 44100 Hz NFFT 2048 Hop Length 512 Power 2.0 Fmin 20 Hz Fmax 20.000 Hz Number of mels 256

Table 4.4: Information about spectrogram computation

Table 4.4 describes the different arguments the function from the package defined in the chapter Development setup receives as input: the audio signal with a duration of 40 ms, the sample rate (44.100 Hz in this case), the length of the FFT window, the number of samples between successive frames, the exponent for the magnitude Mel-spectrogram (in this case 2 for power), the lowest and highest frequency and the number of Mel bands to generate, respectively. The values of each parameter have been established taking into account different examples that have been seen about audio features extraction.

Basically, the NFFT value depends on the number of samples of the segment (N) and corresponds to the powers of 2. It should be greater than N and the higher the number, the more frequency resolution the spectrogram will have; so, given the duration and sample rate, 2048 was chosen as NFFT. The hop length size depends on the purpose of the analysis: more points will be given when more overlap, so smoother results will appear. For the purpose of the spectrogram display, it is normally referred to as a quarter of the NFFT. The lowest and highest frequency values have to do with the human audio spectrum. And, finally, the number of mels by default is 128, but in this case, a higher number of bands represented better the spectrogram.

The spectrogram is usually depicted as a heatmap where the intensity variation can be seen through different colors or brightness, as it is shown in Figure 4.9, where we can also see the audio waveform and the MFCCs extracted, to visualize how they would look in the case of the person talking and not talking, where there is only noise. In this case, the spectrogram image is also transformed to the necessary input size, 256x256. 50

Figure 4.9: Audio feature extraction. First column: frame A talking from dataset video, audio waveform from frame A, MFCCs and Mel-spectrogram, respectively. Second column: the same for frame B not talking. 51

4.3 Modeling

To continue with the process of DL, the network architecture has to be defined and described based on the different state of the art ones. This project’s architecture will use [1] as a base network but some modifications will be made to adapt it to the input dataset and proposed idea.

In Figure 4.10 an overall view of the different parts of the final architecture can be seen, as well as their inputs and outputs and how they are related to each other. As it can be seen, its task is to finally synthesize frames of a video sequence that contains speech expressions and facial gestures of a person based on a set of face landmarks.

Figure 4.10: Project’s network architecture 52

The parts of the architecture can be easily differentiated and grouped into two feature Encoders (audio and image) and a GAN-based network that consists of the main com- ponents of a GAN, the Discriminator and the Generator, but this last one is combined with a previous encoder, both conforming an autoencoder. This whole modified version of a GAN is meant to share information between the different layers of the Generator through the combination of feature vectors coming from the three different Encoders (audio, image, and the one combined with the Generator), which compress data from the video. As it has been mentioned in the chapter State of the art, GANs are widely used in this field of synthesizing images, since the Generator’s job is to create candidate images while the Discriminator tries to evaluate if they are real or not. Basically, the Generator trains to increase the error rate of the Discriminator, by ”fooling” it making it evaluate an image as not synthesized.

While several state of the art models’ proposition is to train large networks where both generator and discriminator have a high number of parameters requiring long videos or large amounts of images to generate talking head models, this system generates talking head models using as input a handful of images (K-shot learning), as it can be seen in the input of the Image Embedder from the figure. This ability to learn using a set of images can work thanks to extensive training, which will be called meta-learning, in which the system performs K-shot learning tasks given a small training set of images from the same person. After this iteration, another set of images from another person will continue a new learning problem using the Generator and the Discriminator pre- trained via the meta-learning process defined.

The inputs to the Generator and both Embedders have been already calculated in the previous sections, consisting of visual and audio features, such as frames and their land- marks from videos of the dataset and the spectrogram of the audio segments mentioned. It is necessary to highlight again that the Image Embedder takes as input a specific number, K, of frames from a video, in this case, K = 8 random frames. From this sequence of random frames, one is selected randomly to use as input of the Generator to create the synthesized frame. As input to the Audio Embedder, the spectrogram from the audio segment of this randomly selected frame will be used.

In the next sections, each block will be described in more detail. 53

4.3.1 Embedders

As it has been seen in the project’s architecture figure, there are two Embedders involved in the meta-learning stage that take information from the video and map it into two N-dimensional vectors, one each.

In particular, the Image Embedder takes a video frame and its associated landmark and the Audio Embedder takes the spectrogram computed from an audio associated with a frame. During this learning process, the Embedders aim to learn how to generate a vector that contains specific information, such as the person’s identity. The Image Embedder averages the resulting embeddings from the K-set of frames from the same video and, in the case in which the Audio Embedder is used too, it concatenates the result with the embedding vector generated by the Audio Embedder.

Figure 4.11: Image Embedder Architecture 54

Both Embedders use the same networks (differing in the input) consisting of residual downsampling blocks with spectral normalization layers, a self-attention block from [78] inserted at 32x32 spatial resolution, and a sum pooling over spatial dimensions followed by a ReLU, which is the activation function, to finally obtain the vectorized outputs.

Figure 4.12: Audio Embedder Architecture

Figures 4.11 and 4.12 show a schematic representation of these networks, which main objective is to find a representation of the data from the input to characterize it. As it has been said, the only difference is how the input is managed. The Image Embedder takes two images and concatenates them to create only one input element. As for the padding layer, it is meant for inputs that are different to 256x256, to adapt them to this resolution.However, the Audio Embedder takes as input the spectrogram image with a size 256x256, as it has been mentioned before in the preprocessing step. This spectrogram is associated with the audio segment that takes place during the input image of the Generator, which will be explained later. In this case too, the padding layer is meant for inputs that differ to the size desired. The rest of the network coincides 55 in both Embedders and their different layers will be explained as follows.

Residual blocks appear to solve the problem of the vanishing gradients, making the activation unit from a layer being fed directly to a deeper one of the network (skip connection). As it can be seen in Figure 4.13, the identity function can be learned directly by relying just on the skip connections. Looking at the equation in the figure, where x is the input of the neural network block and the desired function to be learned is H(x), since there is an identity connection coming thanks to x, the layers are going to learn R(x).

Figure 4.13: Single Residual Block[79]

So, in this case, a residual downsampling block is being used in several layers consisting of several 2-D convolutions with a kernel size of 2 followed by ReLUs and normalized, as it has been said previously, as well finally adding a max-pooling layer. Figure 4.14 shows this architecture.

The self-attention block introduced after three layers of the downsampling blocks applies an attention mechanism to each position of the input sequence, by creating three vectors for each position. This layer basically consists of three 2-D convolutional layers with spectral normalization followed by a softmax function. The max-pooling applied later is adaptive, so the stride and kernel-size are automatically selected to adapt to the specified output (1,1) and then go through the activation function defined.

As the final output of both Embedders, a vector is generated with information about those features and a dimension depending on the batch size (B) and, in the case of the Image Embedder, on the number of shots (K). 56

Figure 4.14: Residual Down Sampling Block

4.3.2 Generator

The Generator takes as input the randomly selected landmark image for a video frame. During the training process, it takes the predicted video and audio embeddings con- catenated in case of existing both and just the video embedding in case of just using the Image Embedder to share this information through the layers of the network. It trains to maximize the similarity between the outputs, which are synthesized frames, and the ground truth frames. Depending on the use of both Embedders or not, the input dimension of the Generator will vary from a dimension of (B, 1024, 1) to (B, 512, 1). Figures 4.16 and 4.15 shows its architecture in both situations, being the only difference in the number of downsampling or upsampling layers to adapt dimensions.

The Generator in both cases consists of an Encoder, a Bottleneck block, in which the embedding features are inserted, and a Decoder. The Encoder takes the input landmark image that goes through 3 residual downsampling blocks each one followed by instance normalization, which normalizes across each channel in each training sample, before applying the self-attention block and finally going through one or two residual downsampling-instance normalization pair of blocks depending on the dimension. The final data generated by the Encoder should own a dimension of 512x16x16 or 1024x8x8, when not making use of the Audio Embedder or making use of it, respectively. In other words, the Encoder part uses the same architecture as the Embedders but using an instance normalization. 57

Figure 4.15: Generator Architecture Without Audio vector

The Bottleneck block consists of 5 residual blocks where the information of the embed- ded vector is added to the Encoder output before going through a ReLU layer followed by a 2-D Convolutional one with spectral normalization. This addition is performed through an AdaIN block based on [80], which receives an input and a style input to align the channel-wise mean and variance of the first one to match the second one’s. These mean and variance from the style input are computed based on the embedding vectors.

Finally, the Decoder performs the opposite of the Encoder with 2 or 3 residual upsam- pling blocks, depending on the input dimension, then a self-attention block followed by 2 more residual upsampling blocks and an AdaIN one, to finally end with the ReLU activation function, a 2-D Convolutional layer and a Sigmoid layer. In this case, the residual upsampling blocks are represented as shown in Figure 4.17. 58

Figure 4.16: Generator Architecture With Audio vector

4.3.3 Discriminator

The last element putting out together all the network is the Discriminator, which takes as input the synthesized video frame, its associated landmark image, and the ground truth image. As Figure 4.18 shows, the architecture matches the Embedder’s one: sev- eral residual downsampling blocks, having an additional one the discriminator at the end operating at 4x4 spatial resolution and, to obtain the vectorized output, the global sum pooling over spatial dimensions before a ReLU is performed. It is important to highlight that in this element of the architecture, as in the other ones, all the convolu- tional layers have been configured with a minimum number of channels of 64 and the maximum, as well as the size of the embedding vectors, to 512 or 1024 depending on the combination of Embedders-GAN.

The Discriminator maps the synthesized frame and landmark image into an N-dimensional 59

Figure 4.17: Residual Up Sampling Block vector and owns several learning parameters, some of them randomly initialized, that help with the prediction of a realism score, that indicates if the input frame is a real one from the sequence or not and if it matches the landmark images, based on the mapping vector and the learning parameters.This network outputs the realism score and a set of values that, during the training, are compared to the ones obtained with the ground truth image to continue with the learning process.

Finally, those are all the set of networks that make up the whole final architecture with their inputs and outputs. The next step will be to see how they interact with each other, in other words, how they train.

4.4 Training

This next step of the process concerns every configuration made to put the neural network to train using the data referred to in previous sections. It has been observed that depending on the training and hyper-parameters configurations, the efficiency of the model can be improved drastically. It has been said before that the use of GPUs is of significant help during this process due to the computational requisites of the algorithms. But there are other arrangements that are worth outlining, such as the two-step training specific to this project’s network. 60

Figure 4.18: Discriminator Architecture

4.4.1 Meta-learning

As it has already been mentioned, this ability to learn through K-shots of a video is gained thanks to a pre-training process, which will be called meta-learning, on videos from the dataset described. During this course, the system simulates K-shot learning tasks to transform landmarks images into personalized images making use of a set of frames from a person. In this project, 8 has been chosen as the value of K, since in the state of the art of the base model it has been verified that it is a suitable number.

So, this meta-learning stage assumes there are several video sequences from different people and the associated landmark images with each of the frames, and trains the three or four different networks defined previously:

• The Image Embedder Ev tries to learn some parameters (φ) such that the output vector contains video-specific information that does not vary to the facial gestures 61

in the frame.

• The Audio Embedder’s Ea parameters (ξ) try the same but, in this case, the output contains audio-specific information referred to the audio segment of the frame.

• The Generator G trains to maximize the similarity between the synthesized output frame and the ground truth one. In this case, it could be said that the parame- ters of this network are divided into the ones that are learned during the meta- learning process and the ones predicted from the embedding vectors making use of a trainable projection matrix P. The first ones could be named as person-generic parameters ψ and the other ones, person-specific ones µ = Pxembedding vector.

v v a a • The Discriminator D owns learning parameters (θ,W , w0 ,W , w0 , b), which have been mentioned before and help with the computation of the realism score for each landmark image input.

In each training episode, a random video sequence i and landmark image calculated from frame t from the sequence are selected, as well as 8 random frames sk. The v estimated image embedding vector ei from i is computed as it is shown in 4.1, in v which an average of each embedding vector ei (sk), given each frame’s image xi and landmark image yi, is performed.

8 1 X e v = E((x (s )), (y (s )); φ) (4.1) i 8 i k i k k=1

If the process uses the Audio Embedder too, the audio features are extracted from a the audio segment associated with t and the embedding vector ei is generated and a concatenated with ei , obtaining ei.

The Generator then synthesizes a frame xi(t) based on the landmark image from t and the embedding vectors: xi(t) = G((yi(t)), (ei); ψ, P ) (4.2) 62

4.4.1.1 Generator loss

When finishing these two steps, the two networks’ parameters need to be optimized so that the loss function 4.3 is minimized. This function is composed of three or four different terms depending on the situation. The difference remains on the use of the a Audio Embedder: if it is used the term LA MCH and the referred parameters (ξ,W , a w0 ) are added when necessary, if not then they aren’t.

v v a a L(φ, ξ, ψ, P, W , w0 ,W , w0 , b) = LCNT(φ, ξ, ψ, P ) v v a v +LADV(φ, ξ, ψ, P, W , w0 ,W , w0 , b) v (4.3) +LV MCH(φ, W ) a +LA MCH(ξ, W )

Content loss

The content loss term LCNT measures the distance between xi(t) and the ground truth one making use of two already trained networks: VGG19[81], trained to measure percep- tual similarity for ImageNet Large Scale Visual Recognition Challenge (ILSVRC)[82], which evaluates algorithms used for object detection and image classification; and VGGFace[83], which is trained to perform face verification.

The first one’s, VGG19, architecture can be seen in Figure 4.19, presenting a 19 layer deep CNN that is able to classify images into 1.000 classes. From those 19 layers with learnable weights, 16 of them are convolutional ones and 3 are fully connected. The only preprocessing done to the images of the dataset is subtracting the mean RGB value from each pixel, being the input a fixed-size 224x224 RGB image, which passes through a stack of 16 convolutional layers (3x3 filters, stride 1) followed by 5 max-pooling layers (2x2 pixel window, stride 2) and, finally, a set of 3 fully-connected layers and a soft-max to finally perform the classification. This network outperformed at the moment all the previous models, so this is why it became competitive in the field.

The VGGFace one, Figure 4.20, has 16 weight layers, only differing in the output layer with VGG16[81]. A sequence of convolutional + ReLU activations layers, max-pooling, and softmax layers can be seen forming the final architecture. This model has been 63

Figure 4.19: VGG-19 Architecture trained on a dataset consisting of 2.622 unique identities with over 2 million faces.

Figure 4.20: VGG-Face Architecture

-1 The final LCNT term is calculated then as the weighted sum (1.5xe for VGG-19 and -2 2.5xe for VGG-Face) of L1 losses between the features of the network. The L1 loss stands out for Least Absolute Deviations (LAD) and it is used to minimize the error, being this the sum of all the absolute differences between the true and the predicted value.

n X L1 = | ytrue − ypredicted | (4.4) i=1 64

Adversarial loss

The next term in 4.3, represented in 4.6, is the adversarial term LADv that refers to the realism score obtained in the Discriminator, which has to be maximized, and to a feature matching term LFM 4.5 based on [30], which depends on the extracted features from T layers of the Discriminator referred previously to learn to match them from the real and the synthesized frames helping with the stability of the training.

T X 1 L (G, D) = (t) [|| D(i)(t) − D(i)(x(t)) || ] (4.5) FM E N 1 i=1 i

v v a v LADV((φ, ξ, ψ, P, W , w0 ,W , w0 , b) = a v a a −D(xi(t), yi(t), t; θ, W , w0 ,W , w0 , b) (4.6)

+LFM(G, D)

The matrices Wv and Wa contain columns with the embeddings that correspond to the v a videos and audios, w0 , and w0 correspond to the general realism of the synthesized frame and its compatibility with the input of the generator, the landmark image. As it has been said, the Discriminator maps its inputs to a vector V(xi(t), yi(t); θ) to generate the realism score as 4.7, being w the audio-video matrix consisting of the v v a a concatenated matrices with values generated as follows: Wi + w0 and Wi + w0 . This loss is added to the total loss with a weight of 10, which guarantees that the losses’ values are in close ranges, making each of them as important as the rest.

v v a a T D(xi(t), yi(t), t; θ, W , w0 ,W , w0 , b) = V (xi(t), yi(t); θ) (wi) + b (4.7)

Match loss

As the previous section described, there is another embedding type apart from the video and audio ones: the ones corresponding to the columns of W. The final terms 65

v a from the total loss, LV MCH(φ,W ) and LA MCH(φ,W ), boost the similarity of the v v different types of embeddings penalizing the L1 loss between ei and Wi when referred a a to the video, and ei and Wi , when referred to audio. Both losses are finally added to the total loss with a weight of 10 each.

4.4.1.2 Discriminator loss

In 4.7 the realism score computed by the Discriminator can be seen and it is mentioned that its trainable parameters are needed to calculate a loss term of the Generator to update its parameters. To update the Discriminator’s parameters the LDSC(φ, ψ, P, θ, v v a a W ,w0 , w0 ,Wi , b) is minimized, trying to increase realism score on real frames and the decrease on synthesized ones.

v v a a LDSC(φ, ψ, P, θ, W , w0 , w0 ,W i , b) = v v a a max(0, 1 + D(xi(t), yi(t), t; φ, ψ, θ, W , w0 ,W , w0 , b)) (4.8) v v a a +max(0, 1 − D(xi(t), yi(t), t; θ, W , w0 ,W , w0 , b))

4.4.2 Fine-tunning

Once the meta-learning step has converged, this project’s system is able to learn to synthesize frames for a new person not seen during the previous process. This learning is performed in a T-shot way, given T frames of a video and their associated landmark images. In the paper, this T value varies between 1, 8 and 32. In this case, several values will be used to see the different results obtained. This step tries to give a solution with the identity gap problem that appears even though the synthesized frame of an unseen person frame is still realistic. It can be seen as a simplified version of the previous learning but just using a single and smaller video sequence.

The Embedders trained in the previous step are used to estimate the new embedding a v vectors eNEW and eNEW for this video frames, both audio and image in case of using the Audio Embedder. 66

To generate the new synthesized frames we proceed the same as in the meta-learning stage using the Generator, which has already been trained, with the new embedding vectors concatenated eNEW. This is where person-specific parameters µ appear based on the learned projection matrix P, µ = PxeNEW.

x(t) = G0((y(t)); ψ, µ) (4.9)

Finally, the Discriminator proceeds as before to calculate the realism score, being θ and b parameters initialized to the results of the previous learning step. Since this video v a has an unseen person in the dataset used for training, the information of Wi and Wi is not available. Nonetheless, the last two loss terms of the total Generator loss ensure the similarity between the embeddings, so w’ in 4.10 is the concatenated matrix of w’v a v v a a and w’ initialized to the sum of w0 and eNEW , and w0 and eNEW for video and audio respectively.

D0(x(t), y(t); θ, w0v, w0a, b) = V (x(t), y(t); θ)T(w0) + b (4.10)

4.4.2.1 Loss functions

The loss functions for each network proceed as the previous ones, though the Generator v a loss loses the LV MCH(φ,W ) and LA MCH(ξ,W ) terms and looks as in 4.11

0 0v 0a 0 0 0v 0a L (ψ, µ, w , w , b) = L CNT(ψ, µ) + L ADV(ψ, µ, w , w , b) (4.11) and the Discriminator loss is optimized as 4.12 shows.

0 0v 0a L DSC(ψ, µ, θ, w , w , b) = max(0, 1 + D(x(t), y(t); ψ, µ, θ, w0v, w0a, b)) (4.12) +max(0, 1 − D(x(t), y(t), t; θ, w0v, w0a, b)) 67

4.4.3 Other hyper-parameters

Some other hyper-parameters are worth mentioning since they affect directly to the results obtained and shown in the chapter Simulations and results.

At first, data parallelism was meant to be used, since there were two GPUs available in Server 1 and the base model was adapted to use data parallelism if possible. This par- allelization through multiple processors helps data distribution across different nodes to reduce processing time. However, there were several problems regarding compatibilities, so finally this possibility was discarded.

The number of epochs was chosen taking into account that a training process of these characteristics lasts several days, even though only a small portion of the dataset was used, and the number of experiments that were decided to be carried out. The meta- learning process trained for 400 epochs and depending on the GPU it lasted from 4 days to 2-3 weeks, approximately. The fine-tunning process was set to 200 epochs, which, of course, lasted a lot less: less than an hour long.

The base model from the paper mentioned divided the dataset in 2 to train the model, in other words, it used 2 batches.Due to the size of the network, the number of parameters, and the images involved in the process, the GPU reaches up to about 10 GB of memory, so in this case the batch size could not be increased to values larger than 1.

As for the optimization parameters, the ones used in the paper were used as well: Adam as an optimizer for both the Generator and the Discriminator and a learning rate of 5x10-5 and 2x10-4, respectively, were chosen. The first one was updated one time per training epoch and the second one, twice. The number of frame shots (input to the Image Embedder) used during the training, as it was mentioned previously, is the same as the paper’s. Due to time, as well, it wasn’t changed to see possible increases in efficiency but it seems to be an important parameter to do so.

So, as it can be seen, every hyper-parameter was set to the original one, since due to resources and time, it was not possible to vary between values to see how the results improved or not. 68

4.5 Evaluation

The final step of this DL process would be to evaluate its efficiency during the perfor- mance of the whole system. During the training process, the data necessary to finally evaluate the model has been saved after 5 epochs so, in case of encountering a possible failure and having to start the process again, the model could be loaded to continue the process. Visual results of frames synthesized using the weights saved every 5 epochs during the training process can be seen in Other Results for each of the experiments carried out.

Given the condition of this project’s training, separated in two phases (meta-learning and fine-tunning), the evaluation will be mainly focused after the fine-tuned models on few-shot learning sets of a video from a person or set of videos from different people not seen during the first step. In particular, the model will be evaluated using a number of hold-out frames (32, being this the number chosen in the paper) from the resulting video or videos generated and comparing them to the original ones.

These networks, as it has been explained multiple times, are composed of the Generator and Discriminator training together to maintain an equilibrium not existing an objective loss function to train these models as in other models, where these functions help them train until their convergence. There is also no objectively way to assess their training progress and, of course, their quality just based on the losses alone.

In this kind of tasks, the model generates images, which can be evaluated both quanti- tatively and qualitatively, combining their information to provide a robust assessment of the mode. Even though there are several metrics to perform this approach, the objective evaluation of GAN models remains a problem.

So, the metrics searched and analyzed will be described in this section as follows. It is important to highlight that better visual results might have worse quantitative ones, since the metrics don’t have to be correlated directly with human . Also, it has been thought that it would be necessary to define some ranges where the values of each of these quantitative and qualitative metrics are considered good, normal, or bad representations of them. To do so, a random good quality image, such as the one shown in 555x780 RGB 4.21 downloaded from [84] has gone through two processes of Gaussian blur, one heavier than the other, which averages pixels giving more weight to 69 those near to the center of a specified kernel; adding salt-and-pepper noise, adding to the image white and black pixels; and, finally, a distortion consisting in rotating the columns and rows of the image according to a sine function was going to be applied but the landmark detection did not work at all. The next images will be tested using the different metrics finally chosen to see have some base values to classify as good or bad quality generated images.

(a) Original image (b) Blurred image (c) High blurred image (d) Noise image

Figure 4.21: Different image distortions to proceed with image evaluation

4.5.1 Quantitative Evaluation

Quantitative GANs metrics refer to the computation of numerical scores to summa- rize someway the quality of the synthesized images. Looking at the state of the art evaluation metrics used for this type of tasks, several metrics have been found.

To measure image quality itself the Peak Signal-to-Noise Ratio (PSNR), to verify the efficiency of the GAN losses to improve the quality of the synthesized frame, the Struc- tural Similarity (SSIM) from [85] and Cosine Similarity (CSIM), to assess identity mismatch, are widely used. Lips are an important characteristic of the frames gener- ated, so to measure the accuracy of their movements in a pixel level way, Landmark Distance Error (LMD) from [86] could be used. To assess the similarity between two images, Normalized Mutual Information (NMI) and the Perceptual Similarity metric 70 that has been proposed in [87] have been decided to be analyzed. Finally, to evaluate photorealism and the preservation of the person’s identity in image generation tasks, the Inception Score (IS) and Frechet-inception Distance (FID) from [88] are normally measured.

Each of these objective metrics will be described and chosen if they fit this project’s task or not as follows.

4.5.1.1 PSNR

In general terms, the PSNR is a metric used to represent the ratio between the maximum possible value of a signal and the power of noise that is affecting the quality of the signal representation. It is normally expressed in decibels.

In terms of image quality, the PSNR measures the relation signal-noise given two images, the real one and the reconstructed one. The higher the PSNR, the better the synthesized image has been reconstructed to match the real one, in other words, the better the algorithm implemented. To compare these images in a pixel level way, the Mean Squared Error (MSE) 4.13, which represents the average of the squares of errors between both images, is used. In this case, the error would be the amount by which the values of the real image differ from the synthesized one.

m−1 n−1 1 X X MSE = || f(i, j) − g(i, j) ||2 (4.13) mn 0 0

This metric 4.14 is appealing given that it is quite easy to calculate and gives a clear physical meaning. In both equations, m and n are the number of rows and columns of pixels of the images, f and g represent the data matrices of the original and the synthesized images, respectively, and, finally, MAXf is the maximum signal power from the real image.

MAX PSNR = 20 log (√ f ) (4.14) 10 MSE 71

4.5.1.2 SSIM

This metric is used in this case as a perceptual one that quantifies the synthesized image degradation caused by the process of going through the whole network measuring the low-level similarity to the ground truth image. It considers that the degradation is due to a change in its structural information, having the image a strong dependence between pixels that are near in space. To get to know the structural information of those images, the influence of the illumination has to be analyzed separately, since the attributes of the representation of the objects in the scene of the frame should be independent of the luminance and contrast. This metric 4.15 is inexpensive to compute and owns a simple analytical form, so this is why it is quite used in the state of the art.

(2µxµy + c1)(2σxy + c2) SSIM(x, y) = 2 2 2 2 (4.15) (µx + µy + c1)(σx + σy + c2)

In the previous equation, µx, µy, σx and σy refer to the mean pixel intensity and the standard deviations of this intensity in an image patch centered at x and y, respectively. The constants c1 and c2 are values added for numerical stability if weak denominator.

4.5.1.3 IS

This metric was proposed in 2015’s [89], becoming probably the most widely adopted score for this type of network’s evaluation. It mainly involves using a pre-trained neural network for generated image classification, capturing how much an image looks like a specific class. A higher value of this core indicates a better quality synthesized image.

4.5.1.4 FID

To capture the similarity of synthesized images to the original ones, FID metric is implemented. The score summarizes how similar the generated and real images are in terms of statistics on computer vision features of the raw images calculated using the model defined as follows. This way of measuring perceptual realism, models the data 72 distribution of features obtained from the coding layer of the Inception v3 network [90] using a multivariate Gaussian distribution with mean µ and covariance P.

2 X X X X 1 FID(x, y) =|| µx − µy || +T r( x + y − 2( x y) 2 ) (4.16)

The term Tr refers to the trace linear algebra operation, which is the sum of elements on the main diagonal of the matrix. In this case, the lower the score, the higher the quality images present, since it would mean that the groups of images have similar statistics.

4.5.1.5 CSIM

CSIM measures the similarity between two vectors given of an inner product space. In other words and applied to this case, it measures the similarity between the embedding vectors to search for identity mismatch. It is measured by the cosine of the angle between both vectors determining if they are pointing the same direction. Given the two embedding vectors, x and y, this metric can be computed as 4.17 says.

xy CSIM(x, y) = (4.17) || x |||| y ||

In this case, a cosine value of 0 would mean that both vectors are orthogonal, then they would have no match. So the closer the cosine value is to 1, the greater match there is.

4.5.1.6 LMD

The LDM was proposed to evaluate whether the synthesized video corresponded to accurate lip movements based on the audio. 4.18 shows how it would be calculated.

T P 1 1 X X LMD = x || LR − LF || (4.18) T P t,p t,p 2 t=1 p=1 73

In the equation, the Euclidean distance between two pairs of landmarks, LF and LR, is computed previously to normalizing it with the temporal length of the video, T, and the total number of landmark key points on each frame, P. To calculate this lip landmarks, LF and LR, on the synthesized and real frames respectively, a HOG-based facial landmarks detector from Dlib [91] is being used.

4.5.2 PS

This metric consists of using pre-trained neural networks too that classify images in a high-level way and allow measuring similarity between two images. [92] provides the implementation for this metric using PyTorch and trying several network architectures obtaining similar scores, such as AlexNet, which has been described in the chapter State of the art. These architectures have trained pairs of images looking forward to learning their similarities.

4.5.3 NMI

The NMI is a normalization of the information that the Mutual Information (MI) score, Equation 4.19, provides used to scale the results between an absence of mutual information, 0, and a perfect correlation, 1, between two different images, U and V. Equation 4.20 shows how the MI is normalized by the mean of the entropy of the two images compared (H(U), H(V)), being the result independent to these areas of data covered by the windows. It is quite normal to compare specific windows of data from each of the images: the smaller the window, the less accurate the estimation of the probability distribution.

|U| |V | X X | U i ∩ V j | N | U i ∩ V j | MI(U, V ) = log( ) (4.19) N | U || V | i=1 j=1 i j

MI(U, V ) NMI(U, V ) = (4.20) media(H(U),H(V )) 74

4.5.3.1 Final metric selection

Once all of those usually implemented metrics have been defined, it is time to chose whether they can help evaluate this project’s results or not. It is important to take into account that due to the nature of the application of this project, the generated frames will not coincide with the original ones ever since the face gestures and lip movements are being translated from another video to the base one.

• PSNR: In this generative model, this metric will give a vision of the relation in a pixel level way, which is not that intuitive since a structural way measure has probably more sense. It is also true that, in this case, comparing the ground-truth image to the generated one is never going to give the best PSNR value, since both images will be different due to the nature of this project’s task. However, it is a good metric to measure image quality so it will be used.

• SSIM: In this case, this metric will be used since it gives the model an evaluation of the synthesized image in a block way, in other words, this index is calculated through several windows, so it can be a better way of measuring image quality that the PSNR.

• IS: An improvement of this score was proposed two years later using the same network, so it has been decided to be discarded and use the improved one, which is defined in the next point.

• FID: It has been decided to be used since it performs quite well in terms of robustness and computational efficiency showing that it is consistent with human evaluations being more robust to noise than the previous one.

• CSIM: Given this project’s embedding vectors, which are composed of really small values close to 0, the distance between them is really small, so the value of the metric will always be close to 1, which is why it has been discarded.

• LMD: When trying to evaluate lips movement, LMD cannot precisely reflect their accuracy due to two reasons: the same word can be pronounced differently and, as it has been said, the generated frames don’t have to be the same as the ground-truth ones. So this metric is discarded since it will not reflect correctly the accuracy of the movements of the lips. 75

• PS: It has been demonstrated that this metric outperforms widely used metrics like PSNR or SSIM, since they were not designed to used them for situations where spatial ambiguity is an important factor to take into account.

• NMI: This metric is proposed to be used since it can analyze empirically the performance of the method.

Once decided which metrics are going to be used, the picture that suffered from some kinds of disturbance has been evaluated using them. Table 4.5 shows these values ob- tained for each of the distorted images comparing them to the original one. Given these results, a range of values for each metric can be used to proceed with the comparisons considering the perceptual quality of each of the images.

Images PSNR SSIM FID PS NMI Original image - Original image inf 1.0 0 0.00 1.0 Blurred image - Original image 25.69 0.73 123.15 0.50 0.42 High blurred image - Original image 22.94 0.66 220.36 0.60 0.35 Noise image - Original image 12.47 0.08 145.162 1.340 0.21

Table 4.5: Metrics values obtained for the different distortions of the first image to show range of values.

As it can be seen in the table, the higher the PSNR the better quality presented in the image generated compared to the original one. A normal quality value for this metric would be around 20dB; less than that it could be considered a bad generated image when talking about PSNR. The SSIM metric values range from -1 to 1, where 1 represents that images are identical and, according to the original-noise comparison, around 0.1 values of this metric indicate bad similarity. When speaking about PS, a higher value corresponds to more different images, being the optimal 0 value, as it can be seen for the first comparison using the same two images. A value of PS around 1 provides information about a high differentiation of images. Finally, a lower score of FID indicates more realistic images, because the generated images match the statistical features of the real ones. The optimal score would be a 0, around 100-200 values consider the image a realistic one but values above those indicate zero realistic images. 76

Taking this values of each metric as a base for the following evaluation, will help com- paring each of the experiments done for the project.

4.5.4 Qualitative Evaluation

Qualitative metrics are those that don’t provide a numerical solution but involve human subjective evaluation. Since this is an image generation task, the main evaluation of the system could be a visual one by humans, in other words, a user study based on comparisons between different results from different networks analyzing perceptual similarity and realism. This subjective comparison could provide a better idea of the quality of the performance than other types of metrics, being it one of the most common and intuitive ways to evaluate GANs.

There are several qualitative measures used for this evaluation, such as Nearest Neigh- bors and Rapid Scene Categorization. The first one involves selecting examples of ground-truth images and locating using distance measures the more similar generated ones for comparison given a context to evaluate how realistic they are. The second one consists in presenting images to people for a short amount of time to classify them as fake or real ones. But, perhaps, the most used is the Rating and Preference Judgment, where people is asked to compare or rank examples of ground-truth and generated images, which will be the case in this project: the different images will be compared between each other and evaluated visually.

To fix a basis on the visual perception of image quality in a subjective way used in this project, it has been established that Figure 4.21 images have good, normal, bad and bad qualities, respectively.

However, due to time resources, the models used for the experiments in this project could not be trained for a long time and with the whole amount of samples in the dataset. So, this comparison might not be enough, since the results may not be realistic at all and it would not make sense to differentiate between real and fake pictures, but it could be an approximation of a real comparison. 77

Chapter 5

Simulations and results

This chapter will present the different experiments carried out and their results, both visual and numerical taking into account the previous metric definitions explained. This explanation is organized as follows.

The explanations will be divided into five main sections. The first approach will be detailing how will the evaluation be carried out, explaining the two main videos that are going to be used for the testing and evaluation, the reference system with its im- plementation, and some results as examples will be defined. After that, for each of the main experiments using the two videos, a brief description will be done defining which features were used and how it was configured; then, the different results will be shown in an image way to see what the model generates; and, finally, the different results obtained for each metric will be specified. As the fourth evaluation, a comparison using a different video from the original one giving a more general comparison will be done. As additional information about this project’s results, some other testing of general interest will be presented. The whole system experiments will be compared to a refer- ence system, which is the base model [1] defined previously, and between each of the experiments. Finally, a summary of the best configurations of each experiment will be shown. 78

5.1 Configuration of the evaluation

It is important to highlight how the paper defines that the evaluation is performed after the fine-tuned stage, when using a few-shot learning set for a person not seen during the previous stage. In this evaluation, the model is compared to X2Face [43] and Pix2PixHD [30] models using for the three of them VoxCeleb1 dataset and the same number of iterations. The first one is used since it is a strong baseline for warping- based methods, which is what this model tries to avoid, and the second one is chosen for using methods of direct synthesis. It would be interesting this comparison between different models as a future line of work. To show the full potential of their approach, VoxCeleb2 is used. For the quantitative comparisons, the evaluation is done using 32 hold-out frames from each of the 50 videos used as the testing set, which gives a more general evaluation for different videos.

5.1.1 Evaluation dataset

In this project, the final application has already been defined as using two videos as the input of the trained system and generating a video in which the face of one has been transferred to the other video. While the reference system 50 videos from two datasets to try this, it has been decided to download and generate just two videos for this step to focus on the results obtained for each experiment of the final application. It is true that it would be necessary, as the paper suggests, to evaluate the system using a limited amount of videos instead of just one, so this experiment will be added in the sections of other testings of interest to see if the system works well for every kind of video but the main experiments will focus just on the two main videos.

The video from where the facial gestures and speaking are going to be taken is a video of myself (frame of video shown in 5.1), ”home test video.MOV”, which consists of myself talking and doing some head and face movements for a duration of 13 seconds and the characteristics defined in 5.1. This video in particular tries to show a clean and defined face where landmarks can be easily detected.

This information will be translated to another video of the Spanish president Pedro S´anchez (5.2), ”pedro test video.MOV”, downloaded from [93]. A video of a person 79

Figure 5.1: Frame of generated video of myself with noticeable face gestures and speaking. given some kind of interview or speech was desired since these kinds of videos show the face of the person in a clear way and the scene doesn’t present cuts or changes, which is desirable to analyze the final result. Figure 5.2 shows the transformation that the original video suffered since it was a very long video and, in addition, in the scene, there were two faces, S´anchez (the desired one) and the translator for deaf-mutes, so the landmark detection would identify both faces and would not work well. In Table 5.1 the information about each video is provided.

Video Frames Audio Sample Audio Video Duration Resolution codec per second codec rate channels home test video.MOV 00:00:13.20 1280x720 H.264 30 fps AAC 44.100 Hz 2 downloaded test video.MOV 01:12:15.05 1280x720 H.264 30 fps AAC 44.100 Hz 2 pedro test video.MOV 00:01:51.74 1280x720 H.264 29.97fps AAC 48.000 Hz 2

Table 5.1: Test videos information

Other experiments will be carried out using some other inputs and configurations but that information will be explained in each section since they are results to other different scenarios that might be of interest, not the main one. 80

Figure 5.2: Frames of downloaded video from Pedro S´anchez before and after cutting and cropping it.

5.1.2 Reference system

Figure 5.3: Implementation inference after 5 epochs of training in small dataset [94]

As it has been mentioned multiple times, the reference system used is an implementation [95] from [1] that provides the already trained model and the whole system implemented as the paper indicates. The original system trains the model on the whole VoxCeleb2 dataset in a two-variant way: first, a feed-forward (FF) approach consisting of 150 epochs without the LMCH; the second variant is a fine-tuning (FT) approach, trained for 75 epochs but making use of LMCH so that fine-tuning stage can be later run. The implemented system, which is the one that provides the model pretrained, trains just for 5 epochs during the meta-learning stage to later train for 40 epochs during the fine-tuning one using a small test dataset, due to computation resources. Figure 5.3 81 shows these results using an image and the webcam, instead of two videos. In other words, the implementation process doesn’t follow the two-variant ways to train, since the FF approach makes the matrix W and the embedding vectors quite different (on its own, W and the embedding vectors represent a person’s specific information), so it wouldn’t be possible to fine-tune after that [96].

The results obtained for this reference system with the pre-trained model will simply be a review of what can be obtained visually for the main configurations.

Qualitative results

(a) Results for T = 1, not fine-tuned (b) Results for T = 1, FT epochs = 40

(c) Results for T = 8, not fine-tuned (d) Results for T = 8, FT epochs = 40

(e) Results for T = 32, not fine-tuned (f) Results for T = 32, FT epochs = 40

Figure 5.4: Example of output of the reference system trained using the pre-trained model available. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated. The first column of pictures just trains the meta-learning stage and the second trains for 40 epochs the fine-tuning stage. 82

Since the paper mentions that the testing is done with different values for the parameter T, Figure 5.4 shows the results in a frame with the mouth opened to see clearly the face translated for a training process using the two available options: FT and not FT. The number of epochs decided for the FT step is according to the information from the paper (40 epochs) and the padding selected when cropping the face for the landmark detection is 50. As it can be seen, visually the difference between fine-tuning the model or not is quite big, obtaining much clearer face results for the second column of the Figure. However, changing the value of T parameter doesn’t affect at all, at least perceptually, stating that even with just one fine-tuned frame, the model learns. Using just a few epochs of training, both meta-learning and fine-tuning, generate good quality images but in this case, the mouth of the synthesized frame should be opened and it isn’t.

Quantitative results

In the implemented example, the model is saved locally after training 99 items from a set of 36.237 videos for each epoch adding to a list the current loss values for both Generator and Discriminator so that their evaluation can be seen in a graph as 5.5 shows. The pre-trained model that can be downloaded belongs to a saving done during epoch 0 and after 99 samples trained. When representing the losses it makes no sense that the number of iterations is 1.091, since if these weights were trained with the code available and corresponded to the epoch 0 and after 99 samples, the number of iterations would be 99. So, basically, code and pre-trained model don’t match; however, results can still be obtained. It might have happened that the author of the code updated the code later or just didn’t upload the right scripts.

Although GANs’ losses are not that intuitive, Generator and Discriminator are com- peting against each other, so an improvement on one of them means that the loss of the other one increases. It is usual that after some iterations both losses converge to some constant range of values, as it can be seen in the orange representation. In this case, the Generator loss is decreasing slowly until it gets to approximately 1000 iterations and it suddenly decreases from a value of 100 to a value of 40 at the same time as the Discriminator loss increases, which could mean that the learning rate configured for the optimizer has decreased abruptly. Since the code used to save this information is not available, only assumptions can be made. Analyzing both losses independently, the Generator one seems to be a high value since it is the sum of several losses, which 83 have been calibrated so that the summands do not have very different values, as it has been defined previously. Talking about the Discriminator’s, the loss optimum should be around 0.5, discriminating equally between the real images and the synthesized. In this case, as it can be seen, there are some oscillations between 0 and 3. It would be better if the graphs showed mean values per epoch, but since this information is not precise, it can’t be drawn.

(a) Complete results representation

(b) Generator results representation (c) Discriminator results representation

Figure 5.5: Training losses evolution graph in reference system with pre-trained weights. 84

As it has been shown the graph of the Generator and Discriminator losses during the meta-learning, Figure 5.6 shows the evolution of the curves for the different configura- tions of the T during the fine-tuning step parameter defined previously. It is important to remember that during this training, the Match losses are not calculated. As the visual perception of the images generated might not vary changing the T value, the Generator loss functions do decrease faster using a value of 32 random frames reaching a value of approximately 9, while the others are around 12, as it can be seen, which show the three losses of each experiment together to compare. The Discriminator losses for T = 32 and T = 8 are really similar stabilizing near an optimum value of 2. This around 2 value comes from the Discriminator loss calculation 4.8 where the values obtained for real frames and synthesized ones tend to 0 and adding them gives the value of 2.

(a) Generator Results for FT epochs = 40 (b) Discriminator results FT epochs = 40

Figure 5.6: Example of losses output during the fine-tuning stage using the reference sys- tem model available.

Finally, Table 5.2 shows the different values obtained for the video results of the base system. To evaluate using these metrics, the same hold-out frames have been chosen for each experiment so that they can be compared and the ground-truth frames have been processed to have the same resolution and padding respect the face landmarks, so the comparison can be correctly done. Every configuration’s PSNR presents a positive and high value, meaning that frames are quite similar in quality but not the same, as it 85 is supposed to be. On the one hand, SSIM values are around 0.3-0.5, which means that there is a similarity perception but high-intensity pixels might vary. As it can be seen both values are bigger for fine-tuned trainings, being those the ones with the best visual quality, as metric FID is analyzing (obtaining better results). On the other hand, NMI values are really low for every configuration, which means that, in the context of NMI, the real and synthesized frames are not that similar, being worse for the images not fine-tuned, as it could be guessed. Finally, the PS value should be near to 0 to present an evaluation where both images are similar. However, the values obtained are around 0.5, which is not that good. Comparing these values to the base ones obtained in the chapter Implementation, it could be said that the fine-tuned images synthesized show a normal to bad quality, whereas the not fine-tuned ones show bad quality.

Experiments PSNR SSIM FID PS NMI Not FT, T = 1 10.57 0.36 412.64 0.64 0.15 Not FT, T = 8 10.60 0.37 368.38 0.63 0.15 Not FT, T = 32 10.56 0.36 393.56 0.63 0.15 FT, T = 1 15.32 0.50 141.11 0.42 0.21 FT, T = 8 14.06 0.47 138.32 0.42 0.20 FT, T = 32 14.27 0.48 145.50 0.42 0.21

Table 5.2: Metrics values obtained for the different configurations for the video results using the reference system

5.2 Project experiments

This section will describe each of the relevant experiments carried out and their results. In particular, the training and testing configurations, and results will be specified, showing the values obtained for each metric defined previously. As it was said, the model was trained using two main servers, so visual results will be shown for each one and for the federated learning global model obtained, being the main one from which quantitative metrics will be obtained the Server 1.

This chapter will be organized as follows: since there are three main models that have been implemented, results, both qualitative and quantitative, for Server 1 will be shown 86 for the two options available (transfer video gestures to video and static image); after describing this three, other experiments of interest including the federated learning configuration ones will be shown. Table 5.3 describes each of the experiments done in a brief way assuming that each of them will try different configurations.

Audio Loss Experiments Network Input LA MCH Reference system with Generator, Discriminator No Two videos: myself and S´anchez project’s dataset and Image Embedder Reference system with Generator, Discriminator, project’s dataset Audio and Image No Two videos: myself and S´anchez adding Audio Embedder Embedder Reference system with Generator, Discriminator, project’s dataset adding Audio and Image Yes Two videos: myself and S´anchez Audio Embedder and Audio Loss Embedder Every architecture Federated Learning Every option Two videos: myself and S´anchez implemented Every architecture Video to image Every option Video and Image implemented Generator, Discriminator, Video to not human face Yes Video and not human image Audio and Image Embedder Generator, Discriminator, Video to same video Yes Two equal videos of myself Audio and Image Embedder Every architecture Merkel video comparison Every option Two videos: myself and Merkel implemented

Table 5.3: Experiments carried out in this project

5.2.1 Reference system with this project’s dataset

Since this project aims to compare several configurations of the models implemented to test which one performs more efficiently, the first experiment decided to try is one using the reference network that had the pre-trained model available but with this project’s dataset. In other words, the first experiment consists in training the reference architecture, composed of just the Image Embedder, Generator, and Discriminator, with this project’s own dataset preprocessed and defined previously. 87

5.2.1.1 Qualitative results

Once obtained the trained model for 400 epochs in the meta-learning stage, due to having a really small dataset compared to the paper and implemented systems, the next step, the fine-tuned one, should be executed to obtain better results. As a quick reminder, this main Server, Server 1, trained a dataset of 2.545 videos, which is more than 10 times less than the reference training, so this is why the results are not as realistic as the shown in the paper. First, to see how a result of the application would look without passing through the fine-tuning stage, in Figure 5.7 the lack of personalized frames, even though the value of T frames used for this stage changes, can be seen. Compared to the previous one, the mouth is generated as it should look (opened) showing a better result in this area of the face.

(a) Results for T = 1

(b) Results for T = 8

(c) Results for T = 32

Figure 5.7: Example of output of the reference system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speak- ing with its associated landmarks image and the S´anchez’s frame with the first video’s face translated. 88

The next step would be to execute the fine-tuning step. Several configurations have been carried out to check different results and see which one fits this model the best (Figure 5.8). Though in the paper and the system implemented, the number of epochs used for this training is 40, in Figure 5.9 200 epochs have been used too, to see if it needs more epochs to learn better or not. These results demonstrate that even though the number of epochs or the value of T increases, the video obtained doesn’t change much being just a little more detailed the facial expression.

(a) Results for T = 1, FT epochs = 40

(b) Results for T = 8, FT epochs = 40

(c) Results for T = 32, FT epochs = 40

Figure 5.8: Different FT T values applied. Example of output of the reference system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each pic- ture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated. 89

(a) Results for T = 32, FT epochs = 40 (b) Results for T = 32, FT epochs = 200

Figure 5.9: Different FT epochs applied. Example of output of the base system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

The different results obtained show that the frames of each video are cropped focusing on the face of the people. This is done to perform a better approach, since detecting the landmarks and applying this information to Sanchez’s video shows better results, more defined facial features, for a padding of 50 rather than 200 as Figure 5.10.

(a) Results for T = 32, FT epochs = 40, (b) Results for T = 32, FT epochs = 40, Padding = 50 Padding = 200

Figure 5.10: Different paddings applied. Example of output of the base system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

5.2.1.2 Quantitative results

Having saved the model while training, it is possible to load it and see how the losses have increased or decreased during the meta-learning process. In this case, the weights 90 saved correspond to the last epoch of training (400). Since every single value of the losses for every video has been saved, the number of iterations amounts to a high number, so it has been decided to average the values per epoch to see a more thin curve. It is true that the Generator loss should decrease with the epochs, since the output synthesized frames during this training are becoming more realistic by epoch. However, it can be seen in Figure 5.11 that this training’s Generator losses increase suddenly at the beginning of the training and then remains between values in the range of 20-40. On the one hand, the sudden increase at the beginning might be explained due to the random initialization of the W matrix, which could be similar to the first generated images during the training, which are random too. On the other hand, the stabilization might have happened due to the small dataset that is being used for this training: since it trains for each random video with a single person it might learn or not. The Discriminator loss shows a positive value at the beginning since the real frames are classified as such. Then, this network tends to a value of 1 learning really fast getting the Generator to generate good results from early epochs. It seems that at about 50 epochs the complete system begins to generalize, as it can be seen in the images shown as follows.

Figure 5.12 shows the difference of the synthesized frames during the meta-learning training process making it clear that the network is learning, and after some epochs, as it can be seen in the previous graphs, the Discriminator is obviously not able to recognize that face as real or fake. In the annex Other Results the different generated images for each of these experiments can be found since it might seem of interest to observe how fast and good the networks learn to synthesize information. 91

(a) Complete results representation (b) Complete results mean representation

(c) Generator results mean representation (d) Discriminator results mean representation

Figure 5.11: Training losses evolution graph in reference system with own dataset.

(a) Ep 10 (b) Ep 50 (c) Ep 100 (d) Ep 175 (e) Ep 250 (f) Ep 325 (g) Ep 400

Figure 5.12: Examples of Generator outputs during meta-learning. 92

During the fine-tuning stage, the losses for both the Generator and Discriminator can be seen in Figure 5.13 for the different values of the T parameter, not being really different between them, even coinciding in the Discriminator values. The Generator tends to decrease as it is expected, since it is learning to generate well images. Figure 5.14 though shows the comparison between the different configurations of the number of epochs and padding used. The picture representing the Generator losses shows that for 200 epochs of fine-tuning training the curve is really similar to the default one being lower both than the different padding one and decreasing faster at the beginning but slowing down after 100 epochs, approximately. This makes sense since the visual representation of the results of the configuration using 200 as padding is a lot worse than the others. The Discriminator losses though show some instability from epoch 110 onwards.

(a) Generator results for different T (b) Discriminator results for different T values, FT epochs = 40, Pad = 50 values, FT epochs = 40, Pad = 50

Figure 5.13: Example of losses output during the fine-tuning stage using the base system model available with this projec’t dataset with different T values. 93

(a) Generator results for different configura- (b) Discriminator results for different configu- tions rations

Figure 5.14: Example of losses output during the fine-tuning stage using the base system model available with this projec’t dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50).

Finally, Table 5.4 shows the different values obtained for the video results of the base system using the project’s processed dataset. In this case, it happens the contrary as previously. The frames fine-tuned have more image quality as the FID indicates but the values of PSNR are lower compared to the ones that just went through the meta- learning stage. Hence, the visual quality doesn’t match the numeric values. This might happen because PSNR does a pixel-level similarity evaluation, whereas PS and NMI are evaluated by blocks, which makes more sense. This is why these PS values are better for the fine-tuned images coinciding with the perceptual subjective evaluation. 94

Experiments PSNR SSIM FID PS NMI Not FT, T = 1 13.79 0.46 111.73 0.39 0.20 Not FT, T = 8 14.14 0.46 113.49 0.40 0.20 Not FT, T = 32 14.12 0.47 115.76 0.40 0.20 FT, T = 1 13.58 0.42 106.67 0.35 0.19 FT, T = 8 13.61 0.43 94.87 0.35 0.20 FT, T = 32 13.72 0.43 115.60 0.35 0.20 FT, T = 32, Epoch = 200 13.56 0.40 120.944 0.33 0.19 FT, T = 32, Padding = 200 13.99 0.42 469.46 0.56 0.21

Table 5.4: Metrics values obtained for the different configurations for the video results using the reference system with this project’s dataset

5.2.2 Results with both image and audio features

In the previous chapters, it has been explained that the model could use both image and audio information from the frames to train for the task of face-to-face translation. In the previous experiment, the only information used was the one provided extracted from the image. This new experiment takes into account the Audio Embedder too. In this case, the way to proceed is just the same, but better results are looked for, at least for the lips movement, since audio information is being processed too. The network’s Generator takes as input each of the landmarks associated to the frames from the S´anchez video, the Image Embedder uses T frames and landmarks images associated and, finally, the Audio Embedder uses the audio from the those frames of to extract the spectrograms, to try and learn from it to generate a more clear lips movement.

5.2.2.1 Qualitative results

As the first experiment training this model, Figure 5.15 shows the results obtained after training as well for 400 epochs the model adding the Audio Embedder without going through the fine-tuning stage. The results, in this case, don’t vary depending on the value of T but it can be seen that the top part of the head is not as defined as the previous experiment but the mouth is more realistic. 95

(a) Results for T = 1

(b) Results for T = 8

(c) Results for T = 32

Figure 5.15: Example of output of the Video-Audio system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speak- ing with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

The same process followed by the previous experiment will be done, since it can provide a better way of comparison among the different results obtained. So, Figure 5.16 shows different results obtained varying the value of T, which generates maybe more detailed faces, not as artificial as the previous one. Figure 5.17 shows results increasing the epochs that the model trains in the fine-tuning stage, which don’t present many changes being the mouth even less defined. 96

(a) Results for T = 1, FT epochs = 40

(b) Results for T = 8, FT epochs = 40

(c) Results for T = 32, FT epochs = 40

Figure 5.16: Different FT T values applied. Example of output of Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each pic- ture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

(a) Results for T = 32, FT epochs = (b) Results for T = 32, FT epochs = 40 200

Figure 5.17: Different FT epochs applied. Example of output of Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each pic- ture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated. 97

Focusing just on the face using a smaller cropping of the images shows worse results as well as the previous experiment. The second image of Figure 5.18 crops the face with a bigger padding, 200, demonstrating that the frame generated is a less clear and realistic one and the mouth opened is worst defined being a more noisy image.

(a) Results for T = 32, FT epochs = (b) Results for T = 32, FT epochs = 40, Padding = 50 40, Padding = 200

Figure 5.18: Different paddings applied. Example of output of the Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each pic- ture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

5.2.2.2 Quantitative results

Figure 5.19 shows the Generator and Discriminator losses during the training of this experiment. Since there are lots of iterations, it might be quite difficult to define its behavior so it has been decided to perform the mean value per epoch as well. The Generator’s loss proceeds the same as the experiment from the previous section, but it is true that it stabilizes at a lower level, around 14, being the loss calculated the same way. In this case, the Discriminator loss shows a stabilization at around 1, which would mean that it has been fooled by the Generator. In general, the strange behavior matches the previous one, with the difference that the Generator loss, in this case, begins to decrease, as expected from a Generator network. 98

(a) Complete results representation (b) Complete results mean representation

(c) Generator results mean representation (d) Discriminator results mean representation

Figure 5.19: Training losses evolution graph in Video-Audio system with own dataset.

The losses for both the Generator and Discriminator during the process of fine-tuning can be seen in Figure 5.20 for the different values of the T parameter, not being really different between them, highlighting the slightly better performance of T = 32 ’s Generator. But the truth is that comparing them to the previous experiment, the Generator loss decreases faster to values of 15 whereas the previous one remained around 20-25. Figure 5.21 compares between the different configurations of the number of epochs and padding used. During the 200 fine-tuning training the Discriminator shows 99 an increase in oscillations instead of converging to a stable point around 0 due to the training being just with one person, Pedro S´anchez.

(a) Generator results for different T (b) Discriminator results for different T values, FT epochs = 40, Pad = 50 values, FT epochs = 40, Pad = 50

Figure 5.20: Example of losses output during the fine-tuning stage using the Video-Audio system model available with this project’s dataset with different T values. 100

(a) Generator results for different configura- (b) Discriminator results for different configu- tions rations

Figure 5.21: Example of losses output during the fine-tuning stage using the Video-Audio system model available with this project’s dataset with different configurations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50).

Finally, Table 5.5 shows the different values obtained for the video results of this experi- ment. As it can be seen, the PSNR and SSIM values match the better visual perception of the images fine-tuned respect the others. This can also be seen in the values of FID and PS, being higher and lower, respectively, so the synthesized images are relatively similar to the original ones. Finally, the NMI metrics indicate in every comparison that the mutual information is really low between both images. The configuration using a bigger padding demonstrates again that the similarity between both images and the quality of the synthesized one is the worst since when going through the network, the landmark extraction performs worse which affects the final generation of the image. 101

Experiments PSNR SSIM FID PS NMI Not FT, T = 1 13.22 0.41 159.26 0.46 0.18 Not FT, T = 8 12.90 0.40 159.64 0.45 0.18 Not FT, T = 32 13.04 0.41 159.88 0.45 0.18 FT, T = 1 14.40 0.45 129.40 0.31 0.20 FT, T = 8 14.26 0.45 107.16 0.31 0.20 FT, T = 32 14.05 0.43 107.52 0.31 0.20 FT, T = 32, Epoch = 200 14.12 0.44 109.77 0.28 0.20 FT, T = 32, Padding = 200 13.86 0.43 567.61 0.57 0.21

Table 5.5: Metrics values obtained for the different configurations for the results using the Video-Audio system with this project’s dataset

5.2.3 Results with audio loss

Finally, the third model to evaluate and show its different results according to each experiment carried out consists of the Generator, Discriminator, Image and Audio Embedders. In other words, it uses the same architecture as the previous one with the only difference being that to the total loss function of the Generator, the LA MCH is added. As it has been explained previously in the chapter Implementation, this loss penalizes the L1 loss between the audio embedding calculated and the matrix W referred to audio. So, it has been thought that maybe learning through this way might result in better results, which will be verified as follows.

5.2.3.1 Qualitative results

The different configurations carried out for this model experiments will follow the two previous ones to be able to compare results visually. Figure 5.22 shows the first results of just the meta-learning stage with several T values. It seems that adding the loss function referred to audio helps the network learn better and faster since these results are more defined than the previous excepting the opened mouth and even though they don’t vary when changing the T value neither. 102

(a) Results for T = 1

(b) Results for T = 8

(c) Results for T = 32

Figure 5.22: Example of output of the Video-Audio with audio loss system trained just the meta-learning stage with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

The next step is to show the results going through the fine-tuning stage learning the personalized features for the video not seen during the previous training, in this case, Pedro S´anchez’s video. In figure 5.23 these results can be seen for a fine-tuning training of 40 epochs. To compare those results to see how the network performs its approach of the task, Figure 5.24 shows the result for 200 epochs of fine-tuning. 103

(a) Results for T = 1, FT epochs = 40

(b) Results for T = 8, FT epochs = 40

(c) Results for T = 32, FT epochs = 40

Figure 5.23: Different FT T values applied. Example of output of Video-Audio with au- dio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

(a) Results for T = 32, FT epochs = (b) Results for T = 32, FT epochs = 40 200

Figure 5.24: Different FT epochs applied. Example of output of Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated. 104

Finally, the last experiment worth to be shown for the video results using this network is a decrease in efficiency when cropping less the frame focusing on the face, as it can be seen in Figure 5.25.

(a) Results for T = 32, FT epochs = (b) Results for T = 32, FT epochs = 40, Padding = 50 40, Padding = 200

Figure 5.25: Different paddings applied. Example of output of the Video-Audio with au- dio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

5.2.3.2 Quantitative results

In Figure 5.26 it can be analyzed the evolution of the network losses, both Generator and Discriminator ones, observing a similar behavior as in the previous experiment. The Generator loss increases at the beginning a little slower than the previous one, reaching higher values but then it seems that it continues increasing, behaving as a Generator that is not learning at all. 105

(a) Complete results representation (b) Complete results mean representation

(c) Generator results mean representation (d) Discriminator results mean representation

Figure 5.26: Training losses evolution graph in Video-Audio system using audio loss with own dataset.

Figures 5.27 and 5.28 show different losses throughout the process of fine-tuning training of every configuration explained in the previous section for this network architecture. The curves are really similar to their analogs in the experiment where the audio loss is not included in the learning process obtaining the Discriminator values next to 0 since the network is seeing the same person during the training process. 106

(a) Generator results for different T (b) Discriminator results for different T values, FT epochs = 40, Pad = 50 values, FT epochs = 40, Pad = 50

Figure 5.27: Example of losses output during the fine-tuning stage using the Video-Audio system model with audio loss available with this project’s dataset with different T values.

(a) Generator results for different configura- (b) Discriminator results for different configu- tions rations

Figure 5.28: Example of losses output during the fine-tuning stage using the Video-Audio system with audio loss model available with this project’s dataset with different configu- rations (default: T = 32, Ep = 40, Pad = 50; different padding: T = 32, Ep = 40, Pad = 200; and different epochs: T = 32, Ep = 200, Pad = 50). 107

As the last metrics results, in Table 5.6 the different values obtained for the video results of this experiment can be analyzed. As it has been seen before, the PSNR and SSIM values are lower for the configurations that generate worse perceptually images as the FID value defines, while the best quality ones present a higher PSNR. The results seem pretty similar as the previous ones, highlighting the improvement in every metric for 200 epochs of fine-tuning training.

Experiments PSNR SSIM FID PS NMI Not FT, T = 1 12.94 0.39 136.15 0.43 0.17 Not FT, T = 8 12.84 0.40 151.89 0.43 0.18 Not FT, T = 32 12.82 0.39 153.90 0.44 0.18 FT, T = 1 13.71 0.42 119.90 0.36 0.19 FT, T = 8 13.67 0.42 135.98 0.36 0.19 FT, T = 32 13.57 0.42 117.38 0.36 0.19 FT, T = 32, Epoch = 200 13.68 0.43 97.82 0.30 0.20 FT, T = 32, Padding = 200 13.97 0.39 416.31 0.56 0.21

Table 5.6: Metrics values obtained for the different configurations for the results using the Video-Audio with audio loss system with this project’s dataset

5.3 Other experiments of possible interest

5.3.1 Federated learning

Due to the resources available and the time given for this project’s development, the idea of federated learning was interesting. The previous information about the configurations and results are referred to Server 1’s training, but, as it has been explained before, Server 2’s has trained as well the three different models with a dataset of 2.347 videos so that its results could be combined with the first one to obtained a global model, based on a pondered mean among them. The main difference between both trainings was the time spent during the process: in this case, the training lasted several more days than in Server 1.

The global model generated combining both trained models from the servers has been computed just with the information from the epoch 400 and several others to show visually a comparison among images generated by the network. 108

This section will show the different results in a visual way to maybe generate some interest in this method of learning and show its efficiency in this particular task with this project’s dataset. Taking into account the best results obtained for Server 1 in each model, these ones will be the ones shown in the next sections.

5.3.1.1 Reference system with this project’s dataset

Server 2

Using as input both videos used in the video experiments for Server 1, the base model with this project’s dataset, which only uses the Generator, Discriminator, and Image Embedder, outputs the results shown in Figure 5.29 in the case of training through fine-tuning for 40 epochs. The rest of the parameters chosen are the ones that fit best the previous experiments: padding of 50 and T value of 32. In this case, the results obtained are worse than for Server 1 may be due to training with fewer videos or fewer of different faces.

Figure 5.29: Example of output of the Server 2 reference system trained both the meta- learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

Global

The global model computed in this case, the reference system with this project’s own dataset, has been obtained computing a weighted arithmetic mean, giving a 90% impor- tance to Server 1. During the meta-learning step, the images generated by the system for each of the servers were saved to compare them with the global one. Figure 5.30 compares the output of this network with the other two server’s outputs. As it can be seen, the outputs of the Generator from the global model show the worst quality 109 because it seems that the weights of both Server 1 and Server 2 are not that similar. It might be interesting to train for more epochs and see it the more epochs both servers train, the more similar their weights become. It is important to highlight that the network learns really fast achieving in the first epochs good results.

(a) Ep 10 (b) Ep 100 (c) Ep 250 (d) Ep 400 (e) Server 1 Generator outputs during meta-learning.

(f) Ep 10 (g) Ep 100 (h) Ep 250 (i) Ep 400 (j) Server 2 Generator outputs during meta-learning.

(k) Ep 10 (l) Ep 100 (m) Ep 250 (n) Ep 400 (o) Global model Generator outputs during meta-learning.

Figure 5.30: Reference networks’ generated image during meta-learning stage

5.3.1.2 Results with both image and audio features

Server 2

The training of this model generates the results that can be seen in Figure 5.31 config- uring the evaluation as the previous one and obtaining good quality results even though a small dataset had been used in the training process. 110

Figure 5.31: Example of output of the Server 2 Video-Audio system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

Global

(a) Ep 10 (b) Ep 100 (c) Ep 250 (d) Ep 400 (e) Server 1 Generator outputs during meta-learning.

(f) Ep 10 (g) Ep 100 (h) Ep 250 (i) Ep 400 (j) Server 2 Generator outputs during meta-learning.

(k) Ep 10 (l) Ep 100 (m) Ep 250 (n) Ep 400 (o) Global model Generator outputs during meta-learning.

Figure 5.32: Video-Audio networks’ generated image during meta-learning stage 111

The global model computed in this case, coincides in method of calculation with the previous one. Figure 5.32 compares the output of this architecture during the training with the other two server’s outputs. It is difficult to compare this and the previous experiments through these images but it could be said that the image generated in epoch 400 for the global model, in this case, is perceived as a more realistic one.

5.3.1.3 Results with audio loss

Server 2

In this case, using the audio loss to force the generator to learn better, and the previ- ous one, it seems that it generates better-synthesized faces than using just the image features, as it can be seen in 5.33. The face and the environment in general are more detailed more defined.

Figure 5.33: Example of output of the Server 2 Video-Audio with audio loss system trained both the meta-learning and fine-tuning steps with the project’s dataset. Each pic- ture shows a frame of the video of myself speaking with its associated landmarks image and the S´anchez’s frame with the first video’s face translated.

Global

The global model, though, generates worse images during the meta-learning stage but it seems to make sense since the Server 1 and Server 2 images are also really blurred maybe needing more epochs to learn than the other ones. It is important to highlight that in the first generator’s result from Figure 5.34 obtained in epoch 10 a black image can be seen. It might have happened that the weights might have not been saved correctly or the pondered average hasn’t worked as expected. 112

(a) Ep 10 (b) Ep 100 (c) Ep 250 (d) Ep 400 (e) Server 1 Generator outputs during meta-learning.

(f) Ep 10 (g) Ep 100 (h) Ep 250 (i) Ep 400 (j) Server 2 Generator outputs during meta-learning.

(k) Ep 10 (l) Ep 100 (m) Ep 250 (n) Ep 400 (o) Global model Generator outputs during meta-learning.

Figure 5.34: Video-Audio with audio loss networks’ generated image during meta-learning stage

5.3.2 Angela Merkel video results

The original paper mentions that the evaluation in their case had been done using 32 hold-out frames from 50 videos. In this case, an analysis using just one video, Pedro S´anchez’s one, has been done. It makes more sense to use several examples and not just one because it could be possible that the application performs better with a specific video and not with another.

It should be optimum to evaluate the model using a set of few unseen videos from the VoxCeleb2, since it has already been downloaded. This performance would basically 113 consist in introducing two videos as well, mine and one of the list, fine-tune the one of the list, and generating a final video. It would be the same process as the one followed till now, but every metric will suppose an average of the obtained ones for each case, so the results will be more generalized.

There has not been enough time to perform this evaluation this way, but it has been decided to use another video, ”Merkel.mp4”, downloaded from [97], with the following specifications: 1280x720, 01:03 minutes of duration, AAC and H.264 codecs, 2 audio channels.. A frame of it can be seen in Figure 5.35.

Figure 5.35: Angela Merkel video frame [97]

Table 5.7 shows on the one hand the best results for each of the experiments done previously. This information has been used to generate the metric values using Merkel’s video for each of the experiments. There are some of the experiments that might have had other configurations with similar results but an estimation on better quality shown of the frames generated made that the below information was chosen. So, on the other hand, the last four rows show the values for each metric computed based on this mentioned video. So, comparing each metric it can be seen if the network can generalize well or it just worked well with Sanchez’s video. 114

Videos Main video-video experiments Configuration PSNR SSIM FID PS NMI S´anchez Reference system FT: 40 epochs, pad = 50, T = 1 15.32 0.50 141.11 0.42 0.21 S´anchez Reference system with own dataset FT: 40 epochs, pad = 50, T = 32 14.27 0.48 145.50 0.42 0.21 S´anchez Video-Audio features FT: 40 epochs, pad = 50, T = 8 14.26 0.45 107.16 0.31 0.20 S´anchez Video-Audio features with audio loss FT: 40 epochs, pad = 50, T = 1 14.40 0.45 129.40 0.31 0.20 Merkel Reference system FT: 40 epochs, pad = 50, T = 1 19.12 0.51 277.32 0.47 0.13 Merkel Reference system with own dataset FT: 40 epochs, pad = 50, T = 32 18.34 0.43 169.05 0.40 0.11 Merkel Video-Audio features FT: 40 epochs, pad = 50, T = 8 18.88 0.46 187.33 0.38 0.12 Merkel Video-Audio features with audio loss FT: 40 epochs, pad = 50, T = 1 18.54 0.45 261.25 0.39 0.12

Table 5.7: Best metric values obtained for each of the previous experiments.

Going through an analysis of the metrics obtained and making a comparison between both S´anchez and Merkel’s qualitative (Figure 5.36) and quantitative results, it can be said that the results are somehow similar. Talking just about the metrics, in the case of Merkel’s video, the PSNR is higher in every experiment compared to the other video but matching the higher value in the case of the reference system. The SSIM values are represented around the same range of values for both of them, as well as, the PS and NMI ones. However, it might be quite strange that the FID values for Merkel’s video are a lot higher than the other ones coinciding with the visual quality since the frames of the first ones are less clear and defined than S´anchez ones. 115

(a) S´anchez, Reference system. Results for T = (b) Merkel, Reference system. Results for T = 1, FT epochs = 40 1, FT epochs = 40

(c) S´anchez, Reference system with own (d) Merkel, Reference system with own dataset. Results for T = 32, FT epochs = 40 dataset. Results for T = 32, FT epochs = 40

(e) S´anchez, Video-Audio features. Results for (f) Merkel, Video-Audio features. Results for T = 8, FT epochs = 40 T = 8, FT epochs = 40

(g) S´anchez, Video-Audio features with audio (h) Merkel, Video-Audio features with audio loss. Results for T = 1, FT epochs = 40 loss. Results for T = 1, FT epochs = 40

Figure 5.36: Different visual results for each experiment done using S´anchez video and using Merkel´s. 116

5.3.3 Video to Image results

Reference system

This Github implementation makes possible the option of transferring the face gestures from a video to a static image, in this case, a 778x1027 image called ”test cranston.jpeg”, which can be seen in 5.37 along with the translation application, training just the meta- learning stage and both of them. As it can be seen, the definition of the image is not that good as in the video to video transfer application, though the fine-tuned one shows a more detailed face.

Figure 5.37: Original image used for the application purpose 117

(a) Results without fine-tuning

(b) Results with 40 epochs of fine-tuning

Figure 5.38: Example of output of the reference system trained both the meta-learning and fine-tuning steps. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated.

Reference system with this project’s dataset

As it has been shown, the example from the implemented system in the Github transfers the face captured by the webcam to a static image. This has been also implemented in this project’s system but, instead of using a webcam, using the same video as the previous examples. Figure 5.39 shows the result for this face translation for just the meta-learning process and for both meta-learning and fine-tuning. In this case, the padding and fine-tuning epochs decided to use due to better results previously are 50 and 40, respectively. In this case, the pictures show a more detailed face and, more importantly, the lips show a movement according to the video of myself. 118

(a) Results without fine-tuning (b) Results with 40 epochs of fine-tuning

Figure 5.39: Example of output of the reference system trained both the meta-learning and fine-tuning steps with the project’s dataset for the video to image application. Each picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated.

Results with both image and audio features

The next experiment with this trained model is transferring the face gestures to the static image to animate it and create a final video. Figure 5.40 shows these results for this face translation for just the meta-learning process. The fine-tuning results are not being shown because this process uses a single image, in this case, so there is no audio information to be extracted. Therefore, the second learning stage makes no sense. In this case, as in the previous one, the padding decided to use due to better results previously is 50. Comparing this result with the previous one (no audio features), it can be seen that the face is more realistic and its borders are more defined given the same epoch and padding configurations.

(a) Results without fine-tuning

Figure 5.40: Example of output of the Video-Audio system trained just for the meta- learning stage with the project’s dataset for the video to image application. The picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated. 119

Results with audio loss

In this case, it happens just the same as the previous section: since there is no audio available in a static image, it makes no sense to go through the fine-tuning process. It would be necessary to create another network that didn’t make use of the Audio Embedder. Figure 5.41 shows this model’s result using a padding of 50, to focus more on the face. Even though the eye and nose area might seem more clear, in this case, the lip movement is not as accurate as the previous one.

(a) Results without fine-tuning

Figure 5.41: Example of output of the Video-Audio with audio loss system trained for the meta-learning step with the project’s dataset for the video to image application. The picture shows a frame of the video of myself speaking with its associated landmarks image and the animated image with the first video’s face translated.

5.3.4 Video to not human face

An interesting experiment to be done using this implemented network would be to see how it works to animate or to transfer the talking face of a video to a not humans head. Analyzing this behavior would allow checking if the network works good in identifying landmarks of non-human faces with other characteristics and generating a human face gesture on them.

Results on animal face

Firstly, it has been decided to try this on a picture of a monkey that owns a lot of hair in the face but the structure of it isn’t as different as the human’s. Figure 5.42 shows this good quality 640x675 RGB picture, which has been downloaded from [98]. 120

Figure 5.42: Example of non human face picture [98]

This experiment has been done using just the architecture with the Image Embedder, Generator, and Discriminator since it is an image and it doesn’t own audio features to go through the fine-tuning process. The results, Figure 5.43, show a frame with the features of a face, which is not that clear as the previous ones, which might happen because of the facial hair, but maybe with more intensive training, it could work a lot better. However, it is quite clear that the face has been detected and some gestures have been generated.

Figure 5.43: Example of output using a non human face to transfer the gestures from the video of myself 121

Results on video game face

Secondly, it has been decided to try this networks in a video game character, since it could be a final application [99]: generating animated video games character in a simple and fast way without having to design them. The picture used for this is Sims 4 [100] 744x410 RGB one as it can be seen in Figure 5.44. It was chosen due to the the clear and simple face, considering it a better image than the monkey one to perform this application.

Figure 5.44: Example of video game face picture [100]

This experiment has also been done using the same network as previously and obtaining a clearer result even though the original picture’s face definitely doesn’t seem like a real one.

Figure 5.45: Example of output using a video game face to transfer the gestures from the video of myself 122

5.3.4.1 Video to same video

Finally, the last experiment of interest that has been carried out is to transfer me to the same exact video, just to analyze how good it would work using the same face structure for both videos.

(a) Results for reference system with own dataset

(b) Results for Audio Embedder network

Figure 5.46: Example of output using the same two videos of myself for the network with- out Audio Embedder and for the one with it

As it can be seen in Figure 5.46 two examples have been generated: the first one using the reference architecture and the second one taking into account the audio features from the video. The first one shows a more synthesized image though the frame gener- ated is perfectly recognized. The second one, though, shows a less clear face not being able to identify if the mouth is opened or not. 123

Chapter 6

Conclusions and future lines

This Master Thesis has been focused on exploring different approaches based on GANs to generate different synthesized frames from the landmarks information among others and, finally, proceed with the face transfer from one video to another.

Thus, after a wide study of some of the state of the art techniques found for this task and others of close relation, the experimental part of this project has been presented. The main objective of this part has been to extend an already implemented GAN system based on [1] adding some new configurations to its architecture and its way of training.

This section, then, includes some final conclusions, both practical and theoretical, ex- tracted during the study, development, implementation, and results achievement of this project work. As well, some future lines of research and experimentation will be defined.

6.1 Conclusions

Firstly, the extensive state of the art search and study done in the Chapter State of the art has introduced concepts about DL and DL applied to this generative field, which has helped in the development of this project. 124

In the Chapter’s Development setup and Implementation the algorithm, systems, and different frameworks to carry out this project have been briefly described, focusing on the topics that are closely related to the targets of the Thesis.

Finally, the Chapter Simulations and results presents the different results analyzed explaining the configurations applied to each of the experiments.

So, the main conclusions that have been extracted from this project are presented as follows.

• After several experiments, the dataset proposed seems to have been really small for the meta-learning process. Taking into account that in the base implementation, the dataset used was VoxCeleb2, which provides 150k videos, and in this project’s case, just 5k where trained, the difference is to big. This is directly shown in the different Generator loss graphs since there are a few random videos per epoch and sometimes it learns things, sometimes not oscillations are produced at the beginning, and then it stabilizes not decreasing. It is true that downloading and processing extracting multimedia features from each item of the dataset takes a lot of time and power, so this why the dataset was so small.

• Talking again about the dataset, the feature extraction should be analyzed trying to evaluate whether they are useful for the learning process or not. Extracting both facial landmarks and spectrograms given an image and an audio segment is quite simple using available libraries from Python. However, as it can be said that landmarks do help with the task of facial analysis, such as expressions, iden- tity, and face recognition, the audio spectrogram computed might have not really helped in this task. This can be seen in the frames synthesized where the mouth was supposed to be opened but it wasn’t, for example. It is true that maybe other kinds of features, such as MFCCs, extracted from a segment of audio could help.

• The design of the GANs used for this project with different architecture com- ponents added, such as the two embedders to extract features, has been a total challenge since their implementation and training are considered to be somehow complex. Thanks to this, a wide knowledge of the area has been achieved. These networks could be considered a total breakthrough within the DL tasks generating different images with information that can help with lots of applications from a day-to-day basis. 125

• An element to highlight from these architectures used for this project, which has already been described in the Chapter State of the art is the Autoencoder. In this project, this architecture was implemented making use of an encoder and the Generator itself. The landmark image was introduced to the encoder part to compress the related information, so that the information provided by the other two feature encoders could be inserted and, then, the total information could be reconstructed by the decoder/generator, synthesizing the frame. It has been a useful way to take different information from the video used as the input of the network in a low dimensional way to try to perform the generative approach.

• The following training of these networks implemented has supposed a lot of time and power even though the dataset was a small one. Several months of prepro- cessing data were necessary making the use of GPU indispensable. Several devices were used for the data processing and, finally, for the training process, and though this last one didn’t last that much time, the movement or transfer of data from one computer to another was complicated and time-consuming. It has been demon- strated that, even though the visual final results are not really good, during the training stage of the network the images synthesized show an amazing quality in every network implemented. This might have happened basically because random frames from the same person are being learned during the training and maybe the final result is not that optimum due to the nature of the training being different to the final application, which is transferring the facial gestures of one video to another one.

• To ensure that the network learning process improves each epoch, several loss functions have been decided to be used. In a GAN architecture it makes sense to own a loss function for the Generator and another one for the Discriminator, since it involves the training of two models. It could be said that the Discriminator is updated as any other DL neural network, while the Generator uses the Dis- criminator as loss function, meaning that its loss function is implicit and learned through the process. In this case, the Discriminator seeks to maximize the prob- ability assigned to real and synthesized images. The losses proposed to be used for the Generator learning process involve three different aspects. The first one is a simple one that measures the distance between the ground truth image and the synthesized making use of the VGG19 and VGGFace networks. The second one has to do with the realism score obtained by the Discriminator (in the case of the Generator, this score should be maximized). Finally, the third loss proposed is a 126

matching term, that can be separated for both embedding feature vectors, which encourages the similarity between these embedding vectors and the ones present in the Discriminator, corresponding to the individual videos. These different loss functions provide different information and comparisons about the process and results obtained that help with the learning process. • Results achieved by the developed systems don’t have to be always better than the ones provided by the reference system, as it has been defined in the previous section in the quantitative evaluations. Nevertheless, they are still comparable obtaining similar results between the reference one and the other three main experiments done. In this project’s experiments, it could be said that the frames synthesized by the implemented systems and the dataset collected outperform the results of the reference system in a qualitative way. In other words, the images generated look more defined and clear to the point of showing the open mouth when it should be shown open, which does not happen using the implemented reference system. The thing is that quantitative results, metrics, show better values for the reference one. So, there should be a discussion about whether to take more account of visual perception or the numbers obtained. • Differentiating between experiments, it seems that adding the audio feature hasn’t helped at all in the generation of the results, as it has been said previously. It seems that the use of the encoder of audio features is not indispensable for this task and this dataset. Both quantitative and qualitative evaluations make this clear. • Talking about the evaluation part of the system, it should be highlighted that even though there are several metrics for GANs evaluation, there is actually no consensus as to which one captures best the strengths and limitations of these networks. So it was quite difficult to choose whether to use some metrics or not, deciding finally to take the ones seen in the state of the art. It is necessary, as well, to take into account that even though numerical evaluation exists, it is not always precise or does not match the human visual system perception. • As other results of interest, some random applications were proposed to be ana- lyzed, such as applying this face transfer to non-human faces (animals and video game’s characters). Nowadays, video game’s characters are more and more realis- tic making their face features coincide with human ones, so it could be possible to generate person faces for different fields. The thing is that a non-human face such 127

as an animal one owns features not really identifiable, like the facial hair. So, the system should be adapted to better face recognition and transfer.

• Finally, some of the applications that could boost this system might or not need to train for the specific task that needs to be achieved. For example, some pos- sible ones could be the face expression translation to virtual avatars, video game characters, or just animated or real characters from different films and series. It could be a more specific translation, such as the mouth part so the system would need to train specifically using mouths. This type of mouth transfer could be applied to the automatic modification of a movie character mouth when dubbing is performed, so that the lip movement matched the spoken lines.

6.2 Future lines of work

To conclude, there are quite a few future lines to continue working in this project try- ing to improve its efficiency. The main one to be thought of is the way the system is trained, extending the dataset as in using the whole defined VoxCeleb2 dataset or more data than the amount used. In concordance with this, training for more epochs and configuring different hyper-parameters such as the optimizers and fine-tuning char- acteristics should help in the task of comparing different results and finding the best network configuration.

Not only the configuration should be taken into account, but the architecture itself of the different networks should be improved by trying deeper neural networks, which would possibly help with increasing the quality of the synthesized images. This last improvement could also be achieved by implementing networks that generate bigger frames that are called BigGANs [101]. Furthermore, trying combined networks such as RNNs to learn in the time domain and not just frame by frame might be of interest.

In this project several networks have been implemented separately, so an idea that would be quite useful and efficient would be to implement a modular architecture combining the three of them where the system provides configuration to use just the image or the audio by themselves or together. Also, it has been decided to extract landmarks and spectrogram features from the multimedia information but other characteristics, such as MFCCs, bounding boxes, or even metadata could be tried to use as the input of the 128 network.

Lastly, it has already been mentioned in the evaluation definition but it seems to be widely extended the evaluation consisting in making several comparisons of results between the different state of the art systems that have been implemented for related tasks. So, it could be a good way to define whether this system is performing well or new configurations in the whole DL system should be taken into account. Bibliography

[1] E. Zakharov, A. Shysheya, E. Burkov, and V. S. Lempitsky, “Few-shot adversarial learning of realistic neural talking head models,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9458–9467, 2019. [2] “Jim carrey deepfake [vfx comparison],” 2019 (accessed May 15, 2020). https: //www.youtube.com/watch?v=JbzVhzNaTdI. [3] “Barack obama is the benchmark for fake lip-sync videos,” 2018 (accessed May 15, 2020). https://medium.com/syncedreview/barack-obama-is-the- benchmark-for-fake-lip-sync-videos-d85057cb90ac. [4] Y. Movshovitz-Attias, T. Kanade, and Y. Sheikh, “How useful is photo-realistic rendering for visual learning?,” in ECCV Workshops, 2016. [5] D. Park and D. Ramanan, “Articulated pose estimation with tiny synthetic videos,” 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 58–66, 2015. [6] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” ArXiv, vol. abs/1608.02192, 2016. [7] M. FathimaShirin and M. S. Meharban, “Review on image synthesis techniques,” 2019 5th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 13–17, 2019. [8] A. Amini, “Introduction to deep learning,” 2020 (accessed March 19, 2020). http: //introtodeeplearning.com/.

129 130

[9] S. M. John McGonagle, Jos´eAlonso Garc´ıa,“Feedforward neural networks,” 2020 (accessed April 4, 2020). https://brilliant.org/wiki/feedforward-neural- networks/. [10] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–1149, 2015. [11] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, 2013. [12] Y. Kim, “Convolutional neural networks for sentence classification,” Association for Computational Linguistics, pp. 1746–1751, 2014. [13] A. Krizhevsky, “The cifar-10 dataset,” 2009 (accessed March 19, 2020). https: //www.cs.toronto.edu/~kriz/cifar.html. [14] S. Bhatt, “Understanding the basics of cnn with image classification,” 2019 (accessed March 19, 2020). https://becominghuman.ai/understanding-the- basics-of-cnn-with-image-classification-7f3a9ddea8f9. [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012. [16] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog- nition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. [18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014. [19] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” CoRR, vol. abs/1511.06434, 2015. [20] S. Thalles, “A short introduction to generative adversarial networks,” 2017 (ac- cessed March 20, 2020). https://sthalles.github.io/intro-to-gans/. 131

[21] N. Hubens, “Introducci´onal autoencoder,” 2018 (accessed March 20, 2020). https://www.deeplearningitalia.com/introduzione-agli-autoencoder-2/. [22] C. Escolano, M. R. Costa-juss`a, and J. A. R. Fonollosa, “(self-attentive) autoencoder-based universal language representation for machine translation,” ArXiv, vol. abs/1810.06351, 2018. [23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computa- tion, vol. 9, pp. 1735–1780, 1997. [24] C. Olah, “Understanding lstm networks,” 2015 (accessed March 19, 2020). https: //colah.github.io/posts/2015-08-Understanding-LSTMs/. [25] J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “Cnn-rnn: A unified framework for multi-label image classification,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2285–2294, 2016. [26] J. Donahue, L. A. Hendricks, M. Rohrbach, S. Venugopalan, S. Guadarrama, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2625–2634, 2014. [27] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis and trans- fer,” in SIGGRAPH ’01, 2001. [28] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–5976, 2016. [29] P. Isola, “pix2pix,” 2017 (accessed March 29, 2020). https://github.com/ phillipi/pix2pix/. [30] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 8798–8807, 2017. [31] J. Vincent, “These faces show how far ai image generation has ad- vanced in just four years,” 2018 (accessed March 20, 2020). https: //www.theverge.com/2018/12/17/18144356/ai-image-generation-fake- faces-people-nvidia-generative-adversarial-networks-gans. 132

[32] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recogni- tion,” in INTERSPEECH, 2018. [33] A. Z. J. S. Chung*, A. Nagrani*, “The voxceleb2 dataset,” 2018 (accessed March 29, 2020). http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html. [34] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in ACCV, 2016. [35] A. Z. J. S. Chung, “The oxford-bbc lip reading in the wild (lrw) dataset,” 2016 (accessed March 29, 2020). https://www.robots.ox.ac.uk/~vgg/data/ lip reading/lrw1.html. [36] A. Z. J. S. Chung, “The oxford-bbc lip reading sentences 2 (lrs2) dataset,” 2016 (accessed March 29, 2020). http://www.robots.ox.ac.uk/~vgg/data/ lip reading/lrs2.html. [37] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453, 2016. [38] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou, “Joint 3d face recon- struction and dense alignment with position map regression network,” ArXiv, vol. abs/1803.07835, 2018. [39] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across large poses: A 3d solution,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 146–155, 2016. [40] Y. F, “Prnet,” 2018 (accessed March 29, 2020). https://github.com/YadiraF/ PRNet. [41] T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Openface: An open source facial behavior analysis toolkit,” 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–10, 2016. [42] J. Thies, M. Zollh¨ofer,M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2387–2395, 2016. [43] O. Wiles, A. S. Koepke, and A. Zisserman, “X2face: A network for controlling face generation by using images, audio, and pose codes,” in ECCV, 2018. 133

[44] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman, “Synthesizing obama: learning lip sync from audio,” ACM Trans. Graph., vol. 36, pp. 95:1– 95:13, 2017. [45] J. S. Chung, A. Jamaludin, and A. Zisserman, “You said that?,” ArXiv, vol. abs/1705.02966, 2017. [46] Y. Song, J. Zhu, X. Wang, and H. Qi, “Talking face generation by conditional recurrent adversarial network,” in IJCAI, 2019. [47] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang, “Talking face generation by adversarially disentangled audio-visual representation,” in AAAI, 2019. [48] D. Cristinacce and T. F. Cootes, “Feature detection and tracking with constrained local models,” in BMVC, 2006. [49] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d and 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks),” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1021–1030, 2017. [50] J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” in ACCV Workshops, 2016. [51] “Develop, optimize and deploy gpu-accelerated apps,” 2020 (accessed May 13, 2020). https://developer.nvidia.com/cuda-toolkit. [52] “Nvidia cudnn,” 2020 (accessed May 13, 2020). https://developer.nvidia.com/ cudnn. [53] “State of ml frameworks,” 2019 (accessed May 12, 2020). https: //thegradient.pub/state-of-ml-frameworks-2019-pytorch-dominates- research-tensorflow-dominates-industry/. [54] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” in AISTATS, 2017. [55] “Pytorch,” 2020 (accessed May 12, 2020). https://pytorch.org/. [56] “Tensorflow,” 2020 (accessed May 12, 2020). https://www.tensorflow.org/. 134

[57] “The gradient,” 2020 (accessed May 12, 2020). https://thegradient.pub/ about/. [58] “Torch,” 2020 (accessed May 12, 2020). http://torch.ch/. [59] “Theano github.” https://github.com/Theano/Theano. [60] “Tensorflow github.” https://github.com/tensorflow/tensorflow. [61] “Keras github.” https://github.com/keras-team/keras. [62] “Pytorch github.” https://github.com/pytorch/pytorch. [63] “Cntk github.” https://github.com/Microsoft/CNTK. [64] “Caffe github.” https://github.com/BVLC/caffe. [65] “Matlab deep learining toolbox.” https://es.mathworks.com/products/deep- learning.html. [66] “Swift github.” https://github.com/tensorflow/swift. [67] “Dl4j github.” https://github.com/eclipse/deeplearning4j. [68] “Mxnet github.” https://github.com/apache/incubator-mxnet. [69] “Installing previous versions of pytorch,” 2020 (accessed May 12, 2020). https: //pytorch.org/get-started/previous-versions/. [70] “pytube,” (accessed May 14, 2020). https://pypi.org/project/pytube/. [71] “Ffmpeg,” (accessed May 14, 2020). https://ffmpeg.org/. [72] “Pydub,” (accessed May 14, 2020). https://pypi.org/project/pydub/. [73] “Opencv,” (accessed May 14, 2020). https://opencv.org/. [74] “scikit-learn.” https://scikit-learn.org/stable/. [75] “Voxceleb download,” (accessed May 15, 2020). https://docs.google.com/ forms/d/e/1FAIpQLSdQhpq2Be2CktaPhuadUMU7ZDJoQuRlFlzNO45xO- drWQ0AXA/viewform?fbzx=7440236747203254000. 135

[76] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in ECCV, 2016. [77] V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifying environ- mental sounds using image recognition networks,” in KES, 2017. [78] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena, “Self-attention gener- ative adversarial networks,” ArXiv, vol. abs/1805.08318, 2019. [79] “Residual block.” https://towardsdatascience.com/residual-blocks- building-blocks-of-resnet-fd90ca15d6ec. [80] X. Huang and S. J. Belongie, “Arbitrary style transfer in real-time with adap- tive instance normalization,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1510–1519, 2017. [81] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2015. [82] “Imagenet - large scale visual recognition challenge (ilsvrc).” http://www.image- net.org/challenges/LSVRC/. [83] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in BMVC, 2015. [84] “Pedro s´anchez.” https://es.wikipedia.org/wiki/ Pedro S\unhbox\voidb@x\bgroup\let\unhbox\voidb@ x\setbox\@tempboxa\hbox{a\global\mathchardef\accent@ spacefactor\spacefactor}\accent19a\egroup\spacefactor\accent@ spacefactornchez. [85] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assess- ment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, pp. 600–612, 2004. [86] L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, “Lip movements generation at a glance,” in ECCV, 2018. [87] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” 2018 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pp. 586–595, 2018. 136

[88] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NIPS, 2017. [89] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” ArXiv, vol. abs/1606.03498, 2016. [90] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826, 2016. [91] D. E. King, “Dlib-ml: A machine learning toolkit,” J. Mach. Learn. Res., vol. 10, pp. 1755–1758, 2009. [92] “Perceptual similarity metric and dataset [project page].” https://github.com/ richzhang/PerceptualSimilarity. [93] “Directo coronavirus — rueda de prensa de pedro s´anchez.” https:// www.youtube.com/watch?v=-ptGIV4LjDc. [94] “Vincent th´evenin - 2019 07 01 20 37 35.” https://www.youtube.com/ watch?feature=player embedded&v=F2vms-eUrYs. [95] “Realistic-neural-talking-head-models.” https://github.com/vincent- thevenin/Realistic-Neural-Talking-Head-Models. [96] “Questions about matching loss #27.” https://github.com/vincent- thevenin/Realistic-Neural-Talking-Head-Models/issues/27. [97] “Angela merkel — 2017 new year’s speech.” https://www.youtube.com/watch?v= mJEKql2QV48. [98] “10 animales salvajes beb´es: ¡qu´e cosas m´as lindas!.” https:// queanimalada.net/animales-salvajes-bebe/. [99] T. Shi, Y. Yuan, C. Fan, Z. Zou, Z. Shi, and Y. Liu, “Face-to-parameter trans- lation for game character auto-creation,” 2019 IEEE/CVF International Confer- ence on Computer Vision (ICCV), pp. 161–170, 2019. [100] “Sims 4 face.” https://picsart.com/i/image-sims-4-no-323937968530201. 137

[101] A. B. Andrey Voynov, Stanislav Morozov, “Big gans are watching you: Towards unsupervised object segmentation with off-the-shelf generative models,” ArXiv, 2020. [102] “Deepempathy.” https://deepempathy.mit.edu/. [103] M. Brundage, S. Avin, J. Clark, H. Toner, P. Eckersley, B. Garfinkel, A. Dafoe, P. Scharre, T. Zeitzoff, B. Filar, H. S. Anderson, H. Roff, G. C. Allen, J. Stein- hardt, C. Flynn, S. O.´ hEigeartaigh,´ S. Beard, H. Belfield, S. Farquhar, C. Lyle, R. Crootof, O. Evans, M. Page, J. Bryson, R. Yampolskiy, and D. Amodei, “The malicious use of artificial intelligence: Forecasting, prevention, and mitigation,” ArXiv, vol. abs/1802.07228, 2018. [104] “Ai4eu.” https://www.ai4eu.eu/. [105] “Horizon 2020.” https://www.horizon2020.es/. [106] “Docker.” https://www.docker.com/.

Appendices

139

141

Appendix A

Social, economic, environmental, ethical and professional impacts

A.1 Introduction

This Master’s Thesis has focused on the research and application of an algorithm using AI that translates a face, its gestures, and lip movement, to another one’s face. The main applications that could make use of this function are not really that clear. There are several options, and obviously, potentially catastrophic cases can be found among them. It would be necessary to develop what kind of impacts, both positive and neg- ative, can arise from making use of this application in different living environments. This appendix will describe briefly the different impacts related to the implementation and application of this project. 142

A.2 Description of impacts related to the project

A.2.1 Ethical impact

One of the main ethical impacts that can be highlighted when thinking about how to implement this system is that of the security or privacy of the information of the different users, since videos or images of people’s faces are used. In this project, for example, a set of public videos from celebrities has been used but there are different possibilities to avoid treating directly with user information, such as Federated Learning, which aims to not store user information in a central server and pose a risk of releasing it.

When talking about generating synthesized images of people or talking faces from dif- ferent videos, the word deepfake emerges, due to their possible negative applications. These range from ransomfakes, campaigns against politicians, fake news, and deepfake pornography. However, there are always ethical applications of the same generative AI technology, such as synthesizing images to cause an effect on people getting them to em- pathize with people who suffered from disasters such as wars by showing them how their homes would look if they had suffered the same. This is the case of DeepEmpathy[102], a project by Unicef joined by MIT. So, these basically show that, like every other technology, AI generative models have both positive and negative applications. Never- theless, there are some other applications that might not be clear whether their ethical impact would be positive or negative, such as synthetic resurrection for example to recreate deceased actors or actresses to take part in a movie. In this case, the lines between respectful recreation, commercial and psychological damage exploitation are really thin ones that should be analyzed and discussed.

These two risks have to be approached using different regulations or legislative defi- nitions for the use of information and AI generative models. Among those, some are going to be highlighted: reviewing critical controls for AI products and fixing possible security vulnerabilities, deploy robust encryption to protect sessions and data, and con- sider ethical ramifications if the development is used for malicious applications. [103] gives an overview of the malicious use of AI, its forecasting, prevention, and mitigation. 143

A.2.2 Social impact

The social impact of a generative image technology depends on its final applications. Indeed one of the main impacts of AI that should be taken into account is the one linked to the possible loss of employment caused by the replacement of workers by an algorithm that is capable of performing the same function, but in a more efficient way. However, this type of system would generate new work dedicated to the implementation of the algorithm and its supervision, since it is not infallible.

A.2.3 Economic impact

This type of system implementations require investing in powerful devices provided with GPUs, since a normal device could last months processing the data and training the neural networks. As it can be seen in Annex B the number of devices necessary to treat a small dataset, since there was no time to use a bigger one, was high, costing the most expensive one a total of approximately 3.500€. If the researcher or user wants to implement a more efficient training, the cost will gradually be higher. So, it basically depends on the time wanted to spend in the deployment of the application.

However, the different final applications that might be carried out making use of this system might help in reducing other types of costs, such as labor ones, which could be interesting for the different companies or projects obtaining an increase in productivity.

A.2.4 Environmental impact

In terms of environmental impact, this project doesn’t show many possible positive or negative consequences in its performance. The only material that has been used are different types of computer devices, some of them with GPUs, which consume lots of energy for many days, even weeks. So, that could be the only negative environ- mental impact regarding this project: the use of massive amounts of energy for the implementation and deployment of the application. 144

A.3 Conclusions

To conclude, the impact that could be quite negative is the ethical one, as it depends on whether the system is run for malicious applications or not. In reality, this affects all technologies equally, so a specific regulation is needed for this. In other words, this project does have a positive impact on society if it is implemented and used ethically since the rest of the impacts don’t suppose a threaten. 145

Appendix B

Economic budget

In this appendix, the economic budget adapted to this project is shown, covering aspects such as material execution costs, performers’ fees and taxes involved, 146

Figure B.1: Economic budget 147

Appendix C

Code available

This Master Thesis is a pilot version of one of the multiple European projects that are being carried out in the Grupo de Aplicaci´onde Telecomunicaciones Visuales (GATV) from the Escuela T´ecnicaSuperior de Ingenieros de Telecomunicaci´on(ETSIT) in the Universidad Polit´ecnicade Madrid (UPM), Spain. In particular, the project is called AI4EU [104]. This consortium was established in 2019 to build the first European Ar- tificial Intelligence (AI) On-Demand Platform and Ecosystem. Some of their activities involve the mobilization of the research groups and companies from around the world to make available resources based on AI technology making able scientific research and innovation in this field. It has a duration from 2019 to 2021 and it has received funding from the EU through the Horizon 2020 [105] program.

Since this is an implementation from a project from the GATV, the code itself is not available but a simulation environment using Docker [106], a platform that uses virtu- alization to deliver software in containers, has been implemented so that introducing two videos the final application can be performed.

To access this docker where the project can be launched the following link should be used: https://www.ai4eu.eu/resource/ai4eu-media-pilot. A registration in the AI4EU platform is necessary. Once registered and logged in, the instructions to make use of it are provided.

149

Appendix D

Other Results

This appendix consists in showing the different synthesized frames by the Generator during the meta-learning stage, where a dataset of 2.545 videos in Server 1 has been used as input to the network. These images are worth mentioning and showing since the generated frames show a faster good quality acquisition due to the fast learning accomplished by the network. It is true that the generated frames during the training are amazingly better than the final result, one gone through the fine-tuning training. The reason after this could be a three-fold one: first, the dataset used in this project is a really small one; then during the training process, the same person’s images are being used as input to the network; and, finally, the final objective of this thesis is to translate a facial gesture from one person to another one, which is quite different than just generating frames based on landmarks. Each model image has been saved every epoch training a total of 400, but the images generated by epochs multiples of 5 will be shown, since it could be enough to see the evolution. 150

D.1 Reference system with this project’s dataset

(a) 0 (b) 5 (c) 10 (d) 15 (e) 20 (f) 25

(g) 30 (h) 35 (i) 40 (j) 45 (k) 50 (l) 55

(m) 60 (n) 65 (o) 70 (p) 75 (q) 80 (r) 85

(s) 90 (t) 95 (u) 100 (v) 105 (w) 110 (x) 115

(y) 120 (z) 125

Figure D.1: Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part I. 151

(a) 130 (b) 135 (c) 140 (d) 145 (e) 150 (f) 155

(g) 160 (h) 165 (i) 170 (j) 175 (k) 180 (l) 185

(m) 190 (n) 195 (o) 200 (p) 205 (q) 210 (r) 215

(s) 220 (t) 225 (u) 230 (v) 235 (w) 240 (x) 245

(y) 250 (z) 255

Figure D.2: Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part II. 152

(a) 260 (b) 265 (c) 270 (d) 275 (e) 280 (f) 285

(g) 290 (h) 295 (i) 300 (j) 305 (k) 310 (l) 315

(m) 320 (n) 325 (o) 330 (p) 335 (q) 340 (r) 345

(s) 350 (t) 355 (u) 360 (v) 365 (w) 370 (x) 375

(y) 380 (z) 385

Figure D.3: Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part IV. 153

(a) 390 (b) 395

Figure D.4: Generator synthesized frames during meta-learning training in the reference system using this project’s pre-processed dataset. Part V. 154

D.2 Results with both image and audio features

(a) 0 (b) 5 (c) 10 (d) 15 (e) 20 (f) 25

(g) 30 (h) 35 (i) 40 (j) 45 (k) 50 (l) 55

(m) 60 (n) 65 (o) 70 (p) 75 (q) 80 (r) 85

(s) 90 (t) 95 (u) 100 (v) 105 (w) 110 (x) 115

(y) 120 (z) 125

Figure D.5: Generator synthesized frames during meta-learning training in the Audio- Video system using this project’s pre-processed dataset. Part I. 155

(a) 130 (b) 135 (c) 140 (d) 145 (e) 150 (f) 155

(g) 160 (h) 165 (i) 170 (j) 175 (k) 180 (l) 185

(m) 190 (n) 195 (o) 200 (p) 205 (q) 210 (r) 215

(s) 220 (t) 225 (u) 230 (v) 235 (w) 240 (x) 245

(y) 250 (z) 255

Figure D.6: Generator synthesized frames during meta-learning training in the Audio- Video system using this project’s pre-processed dataset. Part II. 156

(a) 260 (b) 265 (c) 270 (d) 275 (e) 280 (f) 285

(g) 290 (h) 295 (i) 300 (j) 305 (k) 310 (l) 315

(m) 320 (n) 325 (o) 330 (p) 335 (q) 340 (r) 345

(s) 350 (t) 355 (u) 360 (v) 365 (w) 370 (x) 375

(y) 380 (z) 385

Figure D.7: Generator synthesized frames during meta-learning training in the Audio- Video system using this project’s pre-processed dataset. Part IV. 157

(a) 390 (b) 395

Figure D.8: Generator synthesized frames during meta-learning training in the Audio- Video system using this project’s pre-processed dataset. Part V. 158

D.3 Results with audio loss

(a) 0 (b) 5 (c) 10 (d) 15 (e) 20 (f) 25

(g) 30 (h) 35 (i) 40 (j) 45 (k) 50 (l) 55

(m) 60 (n) 65 (o) 70 (p) 75 (q) 80 (r) 85

(s) 90 (t) 95 (u) 100 (v) 105 (w) 110 (x) 115

(y) 120 (z) 125

Figure D.9: Generator synthesized frames during meta-learning training in the Audio- Video system with audio loss using this project’s pre-processed dataset. Part I. 159

(a) 130 (b) 135 (c) 140 (d) 145 (e) 150 (f) 155

(g) 160 (h) 165 (i) 170 (j) 175 (k) 180 (l) 185

(m) 190 (n) 195 (o) 200 (p) 205 (q) 210 (r) 215

(s) 220 (t) 225 (u) 230 (v) 235 (w) 240 (x) 245

(y) 250 (z) 255

Figure D.10: Generator synthesized frames during meta-learning training in the Audio- Video system with audio loss using this project’s pre-processed dataset. Part II(. 160

(a) 260 (b) 265 (c) 270 (d) 275 (e) 280 (f) 285

(g) 290 (h) 295 (i) 300 (j) 305 (k) 310 (l) 315

(m) 320 (n) 325 (o) 330 (p) 335 (q) 340 (r) 345

(s) 350 (t) 355 (u) 360 (v) 365 (w) 370 (x) 375

(y) 380 (z) 385

Figure D.11: Generator synthesized frames during meta-learning training in the Audio- Video system with audio loss using this project’s pre-processed dataset. Part IV. 161

(a) 390 (b) 395

Figure D.12: Generator synthesized frames during meta-learning training in the Audio- Video system with audio loss using this project’s pre-processed dataset. Part V.