Universidad Politécnica De Madrid Analysis and Implementation of Deep Learning Algorithms for Face to Face Translation Based On

Universidad Politecnica´ de Madrid Escuela Tecnica´ Superior de Ingenieros de Telecomunicacion´ MASTER´ UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACION´ TRABAJO FIN DE MASTER´ Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Patricia Alonso de Apellániz 2020 UNIVERSIDAD POLITECNICA´ DE MADRID ESCUELA TECNICA´ SUPERIOR DE INGENIEROS DE TELECOMUNICACION´ Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Autor: Patricia Alonso de Apellániz Tutor: Dr. Alberto Belmonte Hernández Departamento: Departamento de Se~nales,Sistemas y Radiocomunicaciones MIEMBROS DEL TRIBUNAL: Presidente: Vocal: Secretario: Suplente: Realizado el acto de lectura y defensa del Trabajo de Fin de Másteracuerdan la cali- ficaciónde: Calificación: Madrid, a de de Universidad Politecnica´ de Madrid Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations MASTER´ UNIVERSITARIO EN INGENIERÍA DE TELECOMUNICACION´ Patricia Alonso de Apellániz 2020 Summary Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications. Throughout this Master's Thesis, a study of the state of the art in the different techniques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different existing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows. First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms. As the second approach of this Thesis, the implementation of three different convolutional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]'s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network's behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning. Finally, to compare and evaluate each network's results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem. Keywords Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolu- tional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantitative evaluations. Resumen Generar imágenessintetizadas, siendo capaces de animarlas o transformarlas de alguna manera, ha experimentado en los últimosa~nos una evoluciónmuy significativa gracias, en parte, al uso de redes neuronales en sus implementaciones. En particular, el intento de transferir diferentes gestos faciales y audio a una imagen existente ha llamado la atencióntanto en la investigacióncomo, incluso, socialmente, debido a sus posibles aplicaciones. A lo largo de este Proyecto de Fin de Máster, se realizaráun estudio del estado del arte en las diferentes técnicasque existen para esta transferencia de gestos faciales entre los medios audiovisuales que implican, incluso, el movimiento de los labios. Es- pec´ıficamente, se centraráen los diferentes métodos e investigaciones existentes que generan rostros parlantes basados en varios rasgos de la informaciónmultimedia uti- lizada. A partir de este estudio, la implementación,desarrollo y evaluaciónde varios sistemas se haráde la siguiente manera. En primer lugar, conociendo la importancia relevante de entrenar redes neuronales pro- fundas utilizando un conjunto de datos grande y bien procesado, VoxCeleb2 se descar- garáy sufriráun proceso de condicionamiento y adaptaciónen cuanto a la extracción de informaciónde imagen y audio del v´ıdeooriginal para ser utilizado como entrada de las redes. Estas caracter´ısticasseránlas que se utilizan normalmente en el estado del arte para tareas como la mencionada, como los puntos clave de la imagen y los espectrogramas de audio. Como segundo enfoque de esta Tesis, la implementaciónde tres redes convolucionales diferentes, en particular Generative Adversarial Networks (GANs), se harábasándose en la implementaciónde [1] pero a~nadiendoalgunas nuevas configuraciones, como la red que gestiona las caracter´ısticasde audio o las funciones de pérdidasdependiendo de esta nueva arquitectura y el comportamiento de la red. En otras palabras, la primera implementaciónconsistiráen la red del paper mencionado; a esta implementaciónse le a~nadiráun encoder para las caracter´ısticasdel audio; y, finalmente, el entrenamiento se basaráen esta últimaarquitectura pero teniendo en cuenta la pérdidacalculada para el aprendizaje del audio. Por último,para comparar y evaluar los resultados de cada red se realizarántanto mediciones cuantitativas como evaluaciones cualitativas. Dado que el resultado final de estos sistemas serála obtenciónde un v´ıdeoclaro y realista con un rostro aleatorio al que se le han transferido gestos de otro, la percepciónvisual es clave para resolver este problema. Palabras Clave Aprendizaje profundo, transferencia de caras, generaciónde imágenes,imágenessin- tetizados, encoder, Redes Neuronales Convolucionales (CNN), autoencoder, Genera- tive Adversial Networks (GAN), Generador, Discriminador, procesamiento de datos, dataset, Python, evaluaciones cualitativas y cuantitativas. Agradecimientos Gracias al apoyo incondicional de mi tutor, Alberto, porque es una persona todoter- reno capaz de centrarse y ense~nara todo el que se lo pida. Hac´ıa mucho tiempo que no conoc´ıa a alguien al que le apa- sionase tanto saber y transmitir, consigu- iendo meterme en un mundo en el que quiero seguir desarrollándomesiempre, as´ı que much´ısimasgracias. Gracias a mi 'co- muna' por haber conseguido lo que pocos pueden: aguantarme en mis peores momen- tos intentando sacarme una sonrisa, aunque sea vacilándome constantemente, y hacer posible que siga adelante con todo. Por último,gracias a madre y padre, que nunca han dudado de m´ı. Index 1 Introduction and objectives 1 1.1 Motivation . .1 1.2 Objectives . .4 1.3 Structure of this document . .4 2 State of the art 5 2.1 Image Synthesis . .5 2.2 Deep Learning basics . .6 2.2.1 Artificial Neural Networks . .6 2.2.2 Convolutional Neural Networks . .7 2.2.3 Recurrent Neural Networks . 13 2.3 DL and Image Synthesis . 15 3 Development setup 29 3.1 Federated Learning . 29 3.2 PyTorch . 33 3.3 Libraries . 36 3.4 Overview of the proposed DL process . 37 4 Implementation 40 4.1 Data collection . 41 4.2 Data preparation . 45 4.2.1 Visual features extraction . 46 4.2.2 Audio features extraction . 48 4.3 Modeling . 51 4.3.1 Embedders . 53 4.3.2 Generator . 56 4.3.3 Discriminator . 58 4.4 Training . 59 4.4.1 Meta-learning . 60 4.4.2 Fine-tunning . 65 4.4.3 Other hyper-parameters . 67 4.5 Evaluation . 68 4.5.1 Quantitative Evaluation . 69 4.5.2 PS . 73 4.5.3 NMI . 73 4.5.4 Qualitative Evaluation . 76 5 Simulations and results 77 5.1 Configuration of the evaluation . 78 5.1.1 Evaluation dataset . 78 5.1.2 Reference system . 80 5.2 Project experiments . 85 5.2.1 Reference system with this project's dataset . 86 5.2.2 Results with both image and audio features . 94 5.2.3 Results with audio loss . 101 5.3 Other experiments of possible interest . 107 5.3.1 Federated learning . 107 5.3.2 Angela Merkel video results . 112 5.3.3 Video to Image results . 116 5.3.4 Video to not human face . 119 6 Conclusions and future lines 123 6.1 Conclusions . 123 6.2 Future lines of work . 127 References 129 Appendices 140 A Social, economic, environmental, ethical and professional impacts 141 A.1 Introduction . 141 A.2 Description of impacts related to the project . 142 A.2.1 Ethical impact . 142 A.2.2 Social impact . 143 A.2.3 Economic impact . 143 A.2.4 Environmental impact . 143 A.3 Conclusions . 144 B Economic budget 145 C Code available 147 D Other Results 149 D.1 Reference system with this project's dataset . 150 D.2 Results with both image and audio features . 154 D.3 Results with audio loss . 158 Index of figures 1.1 Example of deepfake in "The Shining"(1980). Jim Carrey replaces Jack Nicholson's through DL techniques[2]. .2 1.2 Example of Obama's talking video generation through DL techniques[3].3 2.1 ANN architecture [9] . .7 2.2 Object detection results using the Faster R-CNN system [10] . .8 2.3 CNN architecture for classification [14] . .9 2.4 GAN architecture [20] . 11 2.5 GAN architecture [21] . 12 2.6 LSTM architecture [24] . 13 2.7 LRCN architecture [26] .

Universidad Politécnica De Madrid Analysis and Implementation of Deep Learning Algorithms for Face to Face Translation Based On

Image-Based 3D Reconstruction: Neural Networks Vs. Multiview Geometry

Amodal 3D Reconstruction for Robotic Manipulation Via Stability and Connectivity

Stereoscopic Vision System for Reconstruction of 3D Objects

Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars

3D Scene Reconstruction from Multiple Uncalibrated Views

3D Shape Reconstruction from Vision and Touch

3D Reconstruction Is Not Just a Low-Level Task: Retrospect and Survey

Automatic Reconstruction of Textured 3D Models of Textured 3Dmodels Automatic Reconstruction Dipl.-Ing

Image-Based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models

3D Reconstruction and Recognition Acknowledgement

Bayesian Reconstruction of 3D Human Motion from Single-Camera Video

Arxiv:2001.05613V2 [Cs.CV] 14 Oct 2020 Mental Results Demonstrate That the Mean Per Joint Position I.E., Parts Or All of the Body Must Not Be Lost at Any Time