Universidad Politécnica De Madrid Analysis and Implementation of Deep Learning Algorithms for Face to Face Translation Based On

Universidad Politécnica De Madrid Analysis and Implementation of Deep Learning Algorithms for Face to Face Translation Based On

Universidad Politecnica´ de Madrid Escuela Tecnica´ Superior de Ingenieros de Telecomunicacion´ MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´ TRABAJO FIN DE MASTER´ Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Patricia Alonso de Apell´aniz 2020 UNIVERSIDAD POLITECNICA´ DE MADRID ESCUELA TECNICA´ SUPERIOR DE INGENIEROS DE TELECOMUNICACION´ Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Autor: Patricia Alonso de Apell´aniz Tutor: Dr. Alberto Belmonte Hern´andez Departamento: Departamento de Se~nales,Sistemas y Radiocomunicaciones MIEMBROS DEL TRIBUNAL: Presidente: Vocal: Secretario: Suplente: Realizado el acto de lectura y defensa del Trabajo de Fin de M´asteracuerdan la cali- ficaci´onde: Calificaci´on: Madrid, a de de Universidad Politecnica´ de Madrid Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´ Patricia Alonso de Apell´aniz 2020 Summary Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications. Throughout this Master's Thesis, a study of the state of the art in the different tech- niques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different exist- ing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows. First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms. As the second approach of this Thesis, the implementation of three different convolu- tional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]'s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network's behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning. Finally, to compare and evaluate each network's results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem. Keywords Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolu- tional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantita- tive evaluations. Resumen Generar im´agenessintetizadas, siendo capaces de animarlas o transformarlas de alguna manera, ha experimentado en los ´ultimosa~nos una evoluci´onmuy significativa gracias, en parte, al uso de redes neuronales en sus implementaciones. En particular, el intento de transferir diferentes gestos faciales y audio a una imagen existente ha llamado la atenci´ontanto en la investigaci´oncomo, incluso, socialmente, debido a sus posibles aplicaciones. A lo largo de este Proyecto de Fin de M´aster, se realizar´aun estudio del estado del arte en las diferentes t´ecnicasque existen para esta transferencia de gestos faciales entre los medios audiovisuales que implican, incluso, el movimiento de los labios. Es- pec´ıficamente, se centrar´aen los diferentes m´etodos e investigaciones existentes que generan rostros parlantes basados en varios rasgos de la informaci´onmultimedia uti- lizada. A partir de este estudio, la implementaci´on,desarrollo y evaluaci´onde varios sistemas se har´ade la siguiente manera. En primer lugar, conociendo la importancia relevante de entrenar redes neuronales pro- fundas utilizando un conjunto de datos grande y bien procesado, VoxCeleb2 se descar- gar´ay sufrir´aun proceso de condicionamiento y adaptaci´onen cuanto a la extracci´on de informaci´onde imagen y audio del v´ıdeooriginal para ser utilizado como entrada de las redes. Estas caracter´ısticasser´anlas que se utilizan normalmente en el estado del arte para tareas como la mencionada, como los puntos clave de la imagen y los espectrogramas de audio. Como segundo enfoque de esta Tesis, la implementaci´onde tres redes convolucionales diferentes, en particular Generative Adversarial Networks (GANs), se har´abas´andose en la implementaci´onde [1] pero a~nadiendoalgunas nuevas configuraciones, como la red que gestiona las caracter´ısticasde audio o las funciones de p´erdidasdependiendo de esta nueva arquitectura y el comportamiento de la red. En otras palabras, la primera implementaci´onconsistir´aen la red del paper mencionado; a esta implementaci´onse le a~nadir´aun encoder para las caracter´ısticasdel audio; y, finalmente, el entrenamiento se basar´aen esta ´ultimaarquitectura pero teniendo en cuenta la p´erdidacalculada para el aprendizaje del audio. Por ´ultimo,para comparar y evaluar los resultados de cada red se realizar´antanto mediciones cuantitativas como evaluaciones cualitativas. Dado que el resultado final de estos sistemas ser´ala obtenci´onde un v´ıdeoclaro y realista con un rostro aleatorio al que se le han transferido gestos de otro, la percepci´onvisual es clave para resolver este problema. Palabras Clave Aprendizaje profundo, transferencia de caras, generaci´onde im´agenes,im´agenessin- tetizados, encoder, Redes Neuronales Convolucionales (CNN), autoencoder, Genera- tive Adversial Networks (GAN), Generador, Discriminador, procesamiento de datos, dataset, Python, evaluaciones cualitativas y cuantitativas. Agradecimientos Gracias al apoyo incondicional de mi tu- tor, Alberto, porque es una persona todoter- reno capaz de centrarse y ense~nara todo el que se lo pida. Hac´ıa mucho tiempo que no conoc´ıa a alguien al que le apa- sionase tanto saber y transmitir, consigu- iendo meterme en un mundo en el que quiero seguir desarroll´andomesiempre, as´ı que much´ısimasgracias. Gracias a mi 'co- muna' por haber conseguido lo que pocos pueden: aguantarme en mis peores momen- tos intentando sacarme una sonrisa, aunque sea vacil´andome constantemente, y hacer posible que siga adelante con todo. Por ´ultimo,gracias a madre y padre, que nunca han dudado de m´ı. Index 1 Introduction and objectives 1 1.1 Motivation . .1 1.2 Objectives . .4 1.3 Structure of this document . .4 2 State of the art 5 2.1 Image Synthesis . .5 2.2 Deep Learning basics . .6 2.2.1 Artificial Neural Networks . .6 2.2.2 Convolutional Neural Networks . .7 2.2.3 Recurrent Neural Networks . 13 2.3 DL and Image Synthesis . 15 3 Development setup 29 3.1 Federated Learning . 29 3.2 PyTorch . 33 3.3 Libraries . 36 3.4 Overview of the proposed DL process . 37 4 Implementation 40 4.1 Data collection . 41 4.2 Data preparation . 45 4.2.1 Visual features extraction . 46 4.2.2 Audio features extraction . 48 4.3 Modeling . 51 4.3.1 Embedders . 53 4.3.2 Generator . 56 4.3.3 Discriminator . 58 4.4 Training . 59 4.4.1 Meta-learning . 60 4.4.2 Fine-tunning . 65 4.4.3 Other hyper-parameters . 67 4.5 Evaluation . 68 4.5.1 Quantitative Evaluation . 69 4.5.2 PS . 73 4.5.3 NMI . 73 4.5.4 Qualitative Evaluation . 76 5 Simulations and results 77 5.1 Configuration of the evaluation . 78 5.1.1 Evaluation dataset . 78 5.1.2 Reference system . 80 5.2 Project experiments . 85 5.2.1 Reference system with this project's dataset . 86 5.2.2 Results with both image and audio features . 94 5.2.3 Results with audio loss . 101 5.3 Other experiments of possible interest . 107 5.3.1 Federated learning . 107 5.3.2 Angela Merkel video results . 112 5.3.3 Video to Image results . 116 5.3.4 Video to not human face . 119 6 Conclusions and future lines 123 6.1 Conclusions . 123 6.2 Future lines of work . 127 References 129 Appendices 140 A Social, economic, environmental, ethical and professional impacts 141 A.1 Introduction . 141 A.2 Description of impacts related to the project . 142 A.2.1 Ethical impact . 142 A.2.2 Social impact . 143 A.2.3 Economic impact . 143 A.2.4 Environmental impact . 143 A.3 Conclusions . 144 B Economic budget 145 C Code available 147 D Other Results 149 D.1 Reference system with this project's dataset . 150 D.2 Results with both image and audio features . 154 D.3 Results with audio loss . 158 Index of figures 1.1 Example of deepfake in "The Shining"(1980). Jim Carrey replaces Jack Nicholson's through DL techniques[2]. .2 1.2 Example of Obama's talking video generation through DL techniques[3].3 2.1 ANN architecture [9] . .7 2.2 Object detection results using the Faster R-CNN system [10] . .8 2.3 CNN architecture for classification [14] . .9 2.4 GAN architecture [20] . 11 2.5 GAN architecture [21] . 12 2.6 LSTM architecture [24] . 13 2.7 LRCN architecture [26] .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    196 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us