Multimodal Deep Learning Methods for Person Annotation in Video Sequences

Universitat Politècnica de Catalunya Escola Tècnica Superior d’Enginyeria de Telecomunicació de Barcelona Signal Theory and Communications Department A DEGREE THESIS by David Rodríguez Navarro Multimodal Deep Learning methods for person annotation in video sequences Academic Supervisor: Prof. Josep Ramon Morros Rubió In partial fulfilment of the requirements for the degree in Audiovisual Systems Engineering ___________________________________________________________________ Barcelona June 2017 Multimodal Deep Learning methods for person annotation in video sequences 0. ABSTRACT In unsupervised identity recognition in video sequences systems, which is a very active field of research in computer vision, the use of convolutional neural networks (CNN’s) is currently gaining a lot of interest due to the great results that this techniques have been shown in face recognition and verification problems in recent years. In this thesis, the improvement of a CNN applied for face verification will be made in the context of an unsupervised identity annotation system developed for the MediaEval 2016 task. This improvement will be achieved by training the 2016 CNN architecture with images from the task database, which is now possible since we can use the last version outputs, along with a data augmentation method applied to the previously extracted samples. In addition, a new multimodal verification system is implemented merging both visual and audio feature vectors. An evaluation of the margin of improvement that these techniques introduce in the whole system will be made, comparing against the State‐of‐the‐Art. Finally some conclusions will be exposed based on the obtained results will be drawn along with some possible future lines of work. Keywords: Deep learning, convolutional neural networks, video annotation, triplet neural network, face identification, face verification. 1 Multimodal Deep Learning methods for person annotation in video sequences 1. RESUM En els sistemes de reconeixement d’identitat no supervisats, el qual és un camp d’investigació molt actiu en la visió per computador, l’ús de xarxes neuronals convolucionals (CNN’s) està rebent molt interès actualment degut als grans resultats que aquestes tècniques están conseguint en tasques de reconeixement i verificació facial els últims anys. En aquesta tesi es realitzarà una millora d’una CNN aplicada a verificació facial en el context d’un sistema d’anotació d’identitat no supervisat, el qual ve ser realitzar per la tasca MediaEval 2016. Aquesta millora serà duta a terme re‐entrenant l’arquitectura neuronal del 2016 amb imatges de la base de dades de la tasca, ara possible degut a que podem utilitzar els resultats del sistema del 2016, a més d’un mètode de data augmentation el qual és aplicat sobre aquestes imatges obtingudes anteriorment. A mes, s’implementarà un nou sistema multimodal de verificació fusionant els vectors de característiques obtinguts per els sistemes de video i audio. També s’evaluaran els marges de millora que introdueixen aquestes tècniques, en comparació amb l’estat de l’art. Per últim, s’exposen algunes conclusions basades en els resultats obtinguts junt amb posibles noves líneas de treball. Paraules clau: Deep learning, xarxes neuronals convolucionals, anotació de video, xarxes neuronals triplet, identificació facial, verificació facial. 2 Multimodal Deep Learning methods for person annotation in video sequences 2. RESUMEN En los sistemas de reconocimiento de identidad no supervisados, el cual es un campo de investigación muy activo en la visión por computador, el uso de redes neuronales convolucionales (CNN’s) está recibiendo mucho interés actualmente, debido a los grandes resultados que estas técnicas están consiguiendo en tareas de reconocimiento i verificación facial en los últimos años. En esta tesis se realizará una mejora de una CNN aplicada a verificación facial en el contexto de un sistema de anotación de identidad no supervisado, el cual fue realizado para la tarea MediaEval 2016. Esta mejora será llevada a cabo re‐entrenando la arquitectura neuronal de 2016 con imágenes de la base de datos de la tarea, lo cual ahora es posible ya que podemos usar los resultados del sistema del 2016, además de un método de data augmentation el cual se aplicará sobre estas imágenes obtenidas anteriormente. Además, se implementará un nuevo sistema multimodal de verificación fusionando los vectores de características obtenidos por los sistemas de video y audio. También se evaluaran los margenes de mejora que introducen estas técnicas, en comparación con el estado del arte. Por último, se exponen algunas conclusiones basadas en los resultados obtenidos junto con posibles líneas de trabajo futuras. Palabras clave: Deep learning, redes neuronals convolucionales, anotación de video, redes neuronales triplet, identificación facial, verificación facial. 3 Multimodal Deep Learning methods for person annotation in video sequences 3. ACKNOWLEDGEMENTS I would like to show my gratitude to Prof. Josep Ramon Morros, my project supervisor, for his help, support, suggestions and the confidence he placed in me to fulfil this project. His knowledge in video processing and deep learning helped me understanding the details of the covered techniques. Furthermore, I want to give my thanks for the help given by the technical team in the D5 building, Albert Gil and Josep Pujal, for their backup in technical inquires that I had during the development of the project. Last, I want to thank my family along with my friends for their encouragement all through this project. 4 Multimodal Deep Learning methods for person annotation in video sequences REVISION HISTORY AND APPROVAL RECORD Revision Date Purpose 0 27/03/2017 Document creation 1 14/04/2017 Document revision 2 15/05/2017 Document revision 3 23/06/2017 Document revision 4 28/06/2017 Document revision 5 29/06/2017 Document revision 6 30/06/2017 Document revision DOCUMENT DISTRIBUTION LIST Name Email David Rodríguez Navarro [email protected] Josep Ramon Morros Rubió [email protected] WRITTEN BY: REVIEWED AND APPROVED BY: Date Date Name David Rodríguez Navarro Name Josep Ramon Morros Rubió Position Project author Position Project supervisor 5 Multimodal Deep Learning methods for person annotation in video sequences 4. CONTENTS 1. Introduction 1.1. Project Overview . 9 1.2. Requirements and specifications . 9 1.3. Work Plan . 10 2. State of the art 2.1. Definition of unsupervised video annotation system . 11 2.2. Face recognition and verification . 12 2.3. Convolutional Neural Networks . 12 2.4. CNN architectures for face verification . 16 2.5. Multimodal feature vectors . 19 3. Methodology and project development 3.1. Systems overview . 21 3.2. Database generation . 22 3.3. Network training . 23 3.4. Verification procedure . 26 3.5. Multimodal feature‐level fusion . 27 3.6. Evaluation metric: Mean Average Precision . 29 4. Experiments and Results 4.1. CNN training . 31 4.2. Multimodal fusion . 34 5. Conclusions and future development 4.1. Future lines of research . 38 Bibliography 6 Multimodal Deep Learning methods for person annotation in video sequences 0. LIST OF FIGURES 1. Different possible casuistry of appearance in videos . 11 2. Examples of verificationd and recognition procedures . 12 3. Scheme of a basic ConvNet . 13 4. Filters from the last convolutional layer from our 2016 VGG‐16 . 13 5. How the filter looks at a region from its input and computes its output . 14 6. Basic distribution of a CNN hidden layer with its dimensions . 14 7. Example of average and max. pooling . 15 8. Scheme of a Siamese neural network architecture . 17 9. Graphical example of how metric learning affects to its input samples . 18 10. Architecture of a generic triplet‐loss CNN . 19 11. Block diagram of the structure from the UPC MediaEval submitted system . 21 12. Block diagram of the structure from our implemented multimodal system . 22 13. Results from applying our data augmentation algorithm . 23 14. VGG‐16 main architecture . 24 15. Diagram of our triplet‐loss neural network architecture . 25 16. Diagram of the MCB Pooling algorithm . 28 17. Tensor Sketch algorithm formalization . 28 18. VGG‐16 finetuning error curve from the first dataset (1e‐6 and 1e‐4 decay) . 32 19. VGG‐16 finetuning error curve from second dataset (1e‐5 decay) . 33 20. VGG‐16 finetuning error curve from third dataset (1e‐5 decay) . 33 21. Examples of two name + face recognition errors from our dataset . 34 22. Two face tracks with a same related name and track assigned to an erroneous name . 37 7 Multimodal Deep Learning methods for person annotation in video sequences 0. LIST OF TABLES 1. Number of parameters to finetune our VGG‐16 architecture . 24 2. Summary of all layers from the developed autoencoder . 29 3. Generated datasets with its source samples, sizes and number of identities . 31 4. Results from applying MAP on the hypothesis generated by our first verification configuration . 35 5. Comparison against concatenation and MCB Pooling + PCA reduction with second‐last or last fully connected layer extracted features . 36 8 Multimodal Deep Learning methods for person annotation in video sequences 1 Introduction This chapter provides a general overview of the project, its main goals and a work plan showing the general project’s organization and deadlines. 1.1 Project Overview Unsupervised identity recognition in video sequences is a difficult issue to deal with due to the lack of labelled data for training the model, since the identities that appear in the video are totally unknown. The use of deep neural networks for this sort of problems is widely extended nowadays because of its robustness when facing non‐seen identities and its considerably fast test execution time. This project is related to previous lines of work followed in the field of unsupervised person annotation in video sequences. Concretely, it continues the Master Thesis Dissertation in Computer Vision made by Gerard Martí Juan [1], where he analysed several convolutional neural network architectures and verification procedures in order to obtain the best face verification performance.

Multimodal Deep Learning Methods for Person Annotation in Video Sequences

Short and Deep: Sketching and Neural Networks

Low-Rank Tucker Approximation of a Tensor from Streaming Data : ; \ \ } Yiming Sun , Yang Guo , Charlene Luo , Joel Tropp , and Madeleine Udell

Fast and Scalable Polynomial Kernels Via Explicit Feature Maps *

Learning-Based Frequency Estimation Algorithms

Polynomial Tensor Sketch for Element-Wise Function of Low-Rank Matrix

Low-Rank Pairwise Alignment Bilinear Network for Few-Shot Fine

Incremental Randomized Sketching for Online Kernel Learning

Compressing Gradient Optimizers Via Count-Sketches

Fast and Guaranteed Tensor Decomposition Via Sketching

Communication-Efficient Federated Learning with Sketching

High-Order Attention Models for Visual Question Answering

Visual Attribute Transfer Through Deep Image Analogy ∗