Universidad Politécnica De Madrid Analysis and Implementation of Deep Learning Algorithms for Face to Face Translation Based On

Total Page:16

File Type:pdf, Size:1020Kb

Universidad Politécnica De Madrid Analysis and Implementation of Deep Learning Algorithms for Face to Face Translation Based On Universidad Politecnica´ de Madrid Escuela Tecnica´ Superior de Ingenieros de Telecomunicacion´ MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´ TRABAJO FIN DE MASTER´ Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Patricia Alonso de Apell´aniz 2020 UNIVERSIDAD POLITECNICA´ DE MADRID ESCUELA TECNICA´ SUPERIOR DE INGENIEROS DE TELECOMUNICACION´ Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations Autor: Patricia Alonso de Apell´aniz Tutor: Dr. Alberto Belmonte Hern´andez Departamento: Departamento de Se~nales,Sistemas y Radiocomunicaciones MIEMBROS DEL TRIBUNAL: Presidente: Vocal: Secretario: Suplente: Realizado el acto de lectura y defensa del Trabajo de Fin de M´asteracuerdan la cali- ficaci´onde: Calificaci´on: Madrid, a de de Universidad Politecnica´ de Madrid Analysis and implementation of deep learning algorithms for face to face translation based on audio-visual representations MASTER´ UNIVERSITARIO EN INGENIER´IA DE TELECOMUNICACION´ Patricia Alonso de Apell´aniz 2020 Summary Generating synthesized images, being able to animate or transform them somehow, has lately been experiencing a breathtaking evolution thanks, in part, to the use of neural networks in their approaches. In particular, trying to transfer different facial gestures and audio to an existing image has caught the attention in terms of research and even socially, due to its potential applications. Throughout this Master's Thesis, a study of the state of the art in the different tech- niques that exist for this transfer of facial gestures involving even lip movement between audiovisual media will be carried out. Specifically, it will be focused on different exist- ing methods and researches that generate talking faces based on several features from the multimedia information used. From this study, the implementation, development, and evaluation of several systems will be done as follows. First, knowing the relevant importance of training deep neural networks using a big and well-processed dataset, VoxCeleb2 will be downloaded and will suffer a process of conditioning and adaptation regarding image and audio information extraction from the original video to be used as the input of the networks. These features will be ones widely used in the state of the art for tasks as the one mentioned, such as image key points and audio spectrograms. As the second approach of this Thesis, the implementation of three different convolu- tional networks, in particular Generative Adversarial Networks (GANs), will be done based on [1]'s implementation but adding some new configurations such as the network that manages the audio features or loss functions depending on this new architecture and the network's behavior. In other words, the first implementation will consist of the network based on the paper mentioned; to this implementation, a encoder for audio features will be added; and, finally, the training will be based on this last architecture but taking into account a loss calculated for the audio learning. Finally, to compare and evaluate each network's results both quantitative metrics and qualitative evaluations will be carried out. Since the final output of these systems will be obtaining a clear and realistic video with a random face to which gestures from another one have been transferred, the perceptual visual evaluation is key to solve this problem. Keywords Deep Learning, face transfer, image generation, synthesized frames, encoder, Convolu- tional Neural Networks (CNNs), autoencoder, Generative Adversial Networks (GANs), Generator, Discriminator, data processing, dataset, Python, qualitative and quantita- tive evaluations. Resumen Generar im´agenessintetizadas, siendo capaces de animarlas o transformarlas de alguna manera, ha experimentado en los ´ultimosa~nos una evoluci´onmuy significativa gracias, en parte, al uso de redes neuronales en sus implementaciones. En particular, el intento de transferir diferentes gestos faciales y audio a una imagen existente ha llamado la atenci´ontanto en la investigaci´oncomo, incluso, socialmente, debido a sus posibles aplicaciones. A lo largo de este Proyecto de Fin de M´aster, se realizar´aun estudio del estado del arte en las diferentes t´ecnicasque existen para esta transferencia de gestos faciales entre los medios audiovisuales que implican, incluso, el movimiento de los labios. Es- pec´ıficamente, se centrar´aen los diferentes m´etodos e investigaciones existentes que generan rostros parlantes basados en varios rasgos de la informaci´onmultimedia uti- lizada. A partir de este estudio, la implementaci´on,desarrollo y evaluaci´onde varios sistemas se har´ade la siguiente manera. En primer lugar, conociendo la importancia relevante de entrenar redes neuronales pro- fundas utilizando un conjunto de datos grande y bien procesado, VoxCeleb2 se descar- gar´ay sufrir´aun proceso de condicionamiento y adaptaci´onen cuanto a la extracci´on de informaci´onde imagen y audio del v´ıdeooriginal para ser utilizado como entrada de las redes. Estas caracter´ısticasser´anlas que se utilizan normalmente en el estado del arte para tareas como la mencionada, como los puntos clave de la imagen y los espectrogramas de audio. Como segundo enfoque de esta Tesis, la implementaci´onde tres redes convolucionales diferentes, en particular Generative Adversarial Networks (GANs), se har´abas´andose en la implementaci´onde [1] pero a~nadiendoalgunas nuevas configuraciones, como la red que gestiona las caracter´ısticasde audio o las funciones de p´erdidasdependiendo de esta nueva arquitectura y el comportamiento de la red. En otras palabras, la primera implementaci´onconsistir´aen la red del paper mencionado; a esta implementaci´onse le a~nadir´aun encoder para las caracter´ısticasdel audio; y, finalmente, el entrenamiento se basar´aen esta ´ultimaarquitectura pero teniendo en cuenta la p´erdidacalculada para el aprendizaje del audio. Por ´ultimo,para comparar y evaluar los resultados de cada red se realizar´antanto mediciones cuantitativas como evaluaciones cualitativas. Dado que el resultado final de estos sistemas ser´ala obtenci´onde un v´ıdeoclaro y realista con un rostro aleatorio al que se le han transferido gestos de otro, la percepci´onvisual es clave para resolver este problema. Palabras Clave Aprendizaje profundo, transferencia de caras, generaci´onde im´agenes,im´agenessin- tetizados, encoder, Redes Neuronales Convolucionales (CNN), autoencoder, Genera- tive Adversial Networks (GAN), Generador, Discriminador, procesamiento de datos, dataset, Python, evaluaciones cualitativas y cuantitativas. Agradecimientos Gracias al apoyo incondicional de mi tu- tor, Alberto, porque es una persona todoter- reno capaz de centrarse y ense~nara todo el que se lo pida. Hac´ıa mucho tiempo que no conoc´ıa a alguien al que le apa- sionase tanto saber y transmitir, consigu- iendo meterme en un mundo en el que quiero seguir desarroll´andomesiempre, as´ı que much´ısimasgracias. Gracias a mi 'co- muna' por haber conseguido lo que pocos pueden: aguantarme en mis peores momen- tos intentando sacarme una sonrisa, aunque sea vacil´andome constantemente, y hacer posible que siga adelante con todo. Por ´ultimo,gracias a madre y padre, que nunca han dudado de m´ı. Index 1 Introduction and objectives 1 1.1 Motivation . .1 1.2 Objectives . .4 1.3 Structure of this document . .4 2 State of the art 5 2.1 Image Synthesis . .5 2.2 Deep Learning basics . .6 2.2.1 Artificial Neural Networks . .6 2.2.2 Convolutional Neural Networks . .7 2.2.3 Recurrent Neural Networks . 13 2.3 DL and Image Synthesis . 15 3 Development setup 29 3.1 Federated Learning . 29 3.2 PyTorch . 33 3.3 Libraries . 36 3.4 Overview of the proposed DL process . 37 4 Implementation 40 4.1 Data collection . 41 4.2 Data preparation . 45 4.2.1 Visual features extraction . 46 4.2.2 Audio features extraction . 48 4.3 Modeling . 51 4.3.1 Embedders . 53 4.3.2 Generator . 56 4.3.3 Discriminator . 58 4.4 Training . 59 4.4.1 Meta-learning . 60 4.4.2 Fine-tunning . 65 4.4.3 Other hyper-parameters . 67 4.5 Evaluation . 68 4.5.1 Quantitative Evaluation . 69 4.5.2 PS . 73 4.5.3 NMI . 73 4.5.4 Qualitative Evaluation . 76 5 Simulations and results 77 5.1 Configuration of the evaluation . 78 5.1.1 Evaluation dataset . 78 5.1.2 Reference system . 80 5.2 Project experiments . 85 5.2.1 Reference system with this project's dataset . 86 5.2.2 Results with both image and audio features . 94 5.2.3 Results with audio loss . 101 5.3 Other experiments of possible interest . 107 5.3.1 Federated learning . 107 5.3.2 Angela Merkel video results . 112 5.3.3 Video to Image results . 116 5.3.4 Video to not human face . 119 6 Conclusions and future lines 123 6.1 Conclusions . 123 6.2 Future lines of work . 127 References 129 Appendices 140 A Social, economic, environmental, ethical and professional impacts 141 A.1 Introduction . 141 A.2 Description of impacts related to the project . 142 A.2.1 Ethical impact . 142 A.2.2 Social impact . 143 A.2.3 Economic impact . 143 A.2.4 Environmental impact . 143 A.3 Conclusions . 144 B Economic budget 145 C Code available 147 D Other Results 149 D.1 Reference system with this project's dataset . 150 D.2 Results with both image and audio features . 154 D.3 Results with audio loss . 158 Index of figures 1.1 Example of deepfake in "The Shining"(1980). Jim Carrey replaces Jack Nicholson's through DL techniques[2]. .2 1.2 Example of Obama's talking video generation through DL techniques[3].3 2.1 ANN architecture [9] . .7 2.2 Object detection results using the Faster R-CNN system [10] . .8 2.3 CNN architecture for classification [14] . .9 2.4 GAN architecture [20] . 11 2.5 GAN architecture [21] . 12 2.6 LSTM architecture [24] . 13 2.7 LRCN architecture [26] .
Recommended publications
  • Image-Based 3D Reconstruction: Neural Networks Vs. Multiview Geometry
    Image-based 3D Reconstruction: Neural Networks vs. Multiview Geometry Julius Schoning¨ and Gunther Heidemann Institute of Cognitive Science, Osnabruck¨ University, Osnabruck,¨ Germany Email: fjuschoening,[email protected] Abstract—Methods using multiple view geometry (MVG), like algorithms, guarantee linear processing time, even in cases Structure from Motion (SfM), are still the dominant approaches where the number and resolution of the input images make for image-based 3D reconstruction. These reconstruction methods MVG-based approaches infeasible. have become quite robust and accurate. However, how robust and accurate can artificial neural networks (ANNs) reconstruct For the research if the underlying mathematical principle a priori unknown 3D objects? Exceed the restriction of object of MVG can be learned by ANNs, datasets like ShapeNet categories this paper evaluates ANNs for reconstructing arbitrary [9] and ModelNet40 [10] cannot be used hence they en- 3D objects. With the use of a synthetic scalable cube dataset for code object categories such as planes, chairs, tables. The training, testing and validating ANNs, it is shown that ANNs are needed dataset must not have shape priors of object cat- capable of learning mathematical principles of 3D reconstruction. As inspiration for the design of the different ANNs architectures, egories, and also, it must be scalable in its complexity the global, hierarchical, and incremental key-point matching for providing a large body of samples with ground truth strategies of SfM approaches were taken into account. Based data, ensuring the learning of even deep ANN. For this on these benchmarks and a review of the used dataset, it is reason, we are using the synthetic scalable cube dataset [11], shown that voxel-based 3D reconstruction cannot be scaled.
    [Show full text]
  • Amodal 3D Reconstruction for Robotic Manipulation Via Stability and Connectivity
    Amodal 3D Reconstruction for Robotic Manipulation via Stability and Connectivity William Agnew, Christopher Xie, Aaron Walsman, Octavian Murad, Caelen Wang, Pedro Domingos, Siddhartha Srinivasa University of Washington fwagnew3, chrisxie, awalsman, ovmurad, wangc21, pedrod, [email protected] Abstract: Learning-based 3D object reconstruction enables single- or few-shot estimation of 3D object models. For robotics, this holds the potential to allow model-based methods to rapidly adapt to novel objects and scenes. Existing 3D re- construction techniques optimize for visual reconstruction fidelity, typically mea- sured by chamfer distance or voxel IOU. We find that when applied to realis- tic, cluttered robotics environments, these systems produce reconstructions with low physical realism, resulting in poor task performance when used for model- based control. We propose ARM, an amodal 3D reconstruction system that in- troduces (1) a stability prior over object shapes, (2) a connectivity prior, and (3) a multi-channel input representation that allows for reasoning over relationships between groups of objects. By using these priors over the physical properties of objects, our system improves reconstruction quality not just by standard vi- sual metrics, but also performance of model-based control on a variety of robotics manipulation tasks in challenging, cluttered environments. Code is available at github.com/wagnew3/ARM. Keywords: 3D Reconstruction, 3D Vision, Model-Based 1 Introduction Manipulating previously unseen objects is a critical functionality for robots to ubiquitously function in unstructured environments. One solution to this problem is to use methods that do not rely on explicit 3D object models, such as model-free reinforcement learning [1,2]. However, quickly generalizing learned policies across wide ranges of tasks and objects remains an open problem.
    [Show full text]
  • Stereoscopic Vision System for Reconstruction of 3D Objects
    International Journal of Applied Engineering Research ISSN 0973-4562 Volume 13, Number 18 (2018) pp. 13762-13766 © Research India Publications. http://www.ripublication.com Stereoscopic Vision System for reconstruction of 3D objects Robinson Jimenez-Moreno Professor, Department of Mechatronics Engineering, Nueva Granada Military University, Bogotá, Colombia. Javier O. Pinzón-Arenas Research Assistant, Department of Mechatronics Engineering, Nueva Granada Military University, Bogotá, Colombia. César G. Pachón-Suescún Research Assistant, Department of Mechatronics Engineering, Nueva Granada Military University, Bogotá, Colombia. Abstract applications of mobile robotics, for autonomous anthropomorphic agents [9] or not, require reducing costs and The present work details the implementation of a stereoscopic using free software for their development. Which, by means of 3D vision system, by means of two digital cameras of similar laser systems or mono-camera, does not allow to obtain a characteristics. This system is based on the calibration of the reduction in price or analysis of adequate depth. two cameras to find the intrinsic and extrinsic parameters of the same in order to use projective geometry to find the disparity The system proposed in this article presents a system of between the images taken by each camera with respect to the reconstruction with stereoscopic pair, i.e. two cameras that will same scene. From the disparity, it is obtained the depth map capture the scene from their own perspective allowing, by that will allow to find the 3D points that are part of the means of projective geometry, to establish point reconstruction of the scene and to recognize an object clearly correspondences in the scene to find the calibration matrices of in it.
    [Show full text]
  • Configurable 3D Scene Synthesis and 2D Image Rendering with Per-Pixel Ground Truth Using Stochastic Grammars
    International Journal of Computer Vision https://doi.org/10.1007/s11263-018-1103-5 Configurable 3D Scene Synthesis and 2D Image Rendering with Per-pixel Ground Truth Using Stochastic Grammars Chenfanfu Jiang1 · Siyuan Qi2 · Yixin Zhu2 · Siyuan Huang2 · Jenny Lin2 · Lap-Fai Yu3 · Demetri Terzopoulos4 · Song-Chun Zhu2 Received: 30 July 2017 / Accepted: 20 June 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract We propose a systematic learning-based approach to the generation of massive quantities of synthetic 3D scenes and arbitrary numbers of photorealistic 2D images thereof, with associated ground truth information, for the purposes of training, bench- marking, and diagnosing learning-based computer vision and robotics algorithms. In particular, we devise a learning-based pipeline of algorithms capable of automatically generating and rendering a potentially infinite variety of indoor scenes by using a stochastic grammar, represented as an attributed Spatial And-Or Graph, in conjunction with state-of-the-art physics- based rendering. Our pipeline is capable of synthesizing scene layouts with high diversity, and it is configurable inasmuch as it enables the precise customization and control of important attributes of the generated scenes. It renders photorealistic RGB images of the generated scenes while automatically synthesizing detailed, per-pixel ground truth data, including visible surface depth and normal, object identity, and material information (detailed to object parts), as well as environments (e.g., illuminations and camera viewpoints). We demonstrate the value of our synthesized dataset, by improving performance in certain machine-learning-based scene understanding tasks—depth and surface normal prediction, semantic segmentation, reconstruction, etc.—and by providing benchmarks for and diagnostics of trained models by modifying object attributes and scene properties in a controllable manner.
    [Show full text]
  • 3D Scene Reconstruction from Multiple Uncalibrated Views
    3D Scene Reconstruction from Multiple Uncalibrated Views Li Tao Xuerong Xiao [email protected] [email protected] Abstract aerial photo filming. The 3D scene reconstruction applications such as Google Earth allow people to In this project, we focus on the problem of 3D take flight over entire metropolitan areas in a vir- scene reconstruction from multiple uncalibrated tually real 3D world, explore 3D tours of build- views. We have studied different 3D scene recon- ings, cities and famous landmarks, as well as take struction methods, including Structure from Mo- a virtual walk around natural and cultural land- tion (SFM) and volumetric stereo (space carv- marks without having to be physically there. A ing and voxel coloring). Here we report the re- computer vision based reconstruction method also sults of applying these methods to different scenes, allows the use of rich image resources from the in- ranging from simple geometric structures to com- ternet. plicated buildings, and will compare the perfor- In this project, we have studied different 3D mances of different methods. scene reconstruction methods, including Struc- ture from Motion (SFM) method and volumetric 1. Introduction stereo (space carving and voxel coloring). Here we report the results of applying these methods to 3D reconstruction from multiple images is the different scenes, ranging from simple geometric creation of three-dimensional models from a set of structures to complicated buildings, and will com- images. It is the reverse process of obtaining 2D pare the performances of different methods. images from 3D scenes. In recent decades, there is an important demand for 3D content for com- 2.
    [Show full text]
  • 3D Shape Reconstruction from Vision and Touch
    3D Shape Reconstruction from Vision and Touch Edward J. Smith1;2∗ Roberto Calandra1 Adriana Romero1;2 Georgia Gkioxari1 David Meger2 Jitendra Malik1;3 Michal Drozdzal1 1 Facebook AI Research 2 McGill University 3 University of California, Berkeley Abstract When a toddler is presented a new toy, their instinctual behaviour is to pick it up and inspect it with their hand and eyes in tandem, clearly searching over its surface to properly understand what they are playing with. At any instance here, touch provides high fidelity localized information while vision provides complementary global context. However, in 3D shape reconstruction, the complementary fusion of visual and haptic modalities remains largely unexplored. In this paper, we study this problem and present an effective chart-based approach to multi-modal shape understanding which encourages a similar fusion vision and touch information. To do so, we introduce a dataset of simulated touch and vision signals from the interaction between a robotic hand and a large array of 3D objects. Our results show that (1) leveraging both vision and touch signals consistently improves single- modality baselines; (2) our approach outperforms alternative modality fusion methods and strongly benefits from the proposed chart-based structure; (3) the reconstruction quality increases with the number of grasps provided; and (4) the touch information not only enhances the reconstruction at the touch site but also extrapolates to its local neighborhood. 1 Introduction From an early age children clearly and often loudly demonstrate that they need to both look and touch any new object that has peaked their interest. The instinctual behavior of inspecting with both their eyes and hands in tandem demonstrates the importance of fusing vision and touch information for 3D object understanding.
    [Show full text]
  • 3D Reconstruction Is Not Just a Low-Level Task: Retrospect and Survey
    3D reconstruction is not just a low-level task: retrospect and survey Jianxiong Xiao Massachusetts Institute of Technology [email protected] Abstract 3D reconstruction is in obtaining more accurate depth maps [44, 45, 55] or 3D point clouds [47, 48, 58, 50]. We now Although an image is a 2D array, we live in a 3D world. have reliable techniques [47, 48] for accurately computing The desire to recover the 3D structure of the world from 2D a partial 3D model of an environment from thousands of images is the key that distinguished computer vision from partially overlapping photographs (using keypoint match- the already existing field of image processing 50 years ago. ing and structure from motion). Given a large enough set of For the past two decades, the dominant research focus for views of a particular object, we can create accurate dense 3D reconstruction is in obtaining more accurate depth maps 3D surface models (using stereo matching and surface fit- or 3D point clouds. However, even when a robot has a depth ting [44, 45, 55, 58, 50, 59]). In particular, using Microsoft map, it still cannot manipulate an object, because there is Kinect (also Primesense and Asus Xtion), a reliable depth no high-level representation of the 3D world. Essentially, map can be obtained straightly out of box. 3D reconstruction is not just a low-level task. Obtaining However, despite all of these advances, the dream of hav- a depth map to capture a distance at each pixel is analo- ing a computer interpret an image at the same level as a two- gous to inventing a digital camera to capture the color value year old (for example, counting all of the objects in a pic- at each pixel.
    [Show full text]
  • Automatic Reconstruction of Textured 3D Models of Textured 3Dmodels Automatic Reconstruction Dipl.-Ing
    Dipl.-Ing. Benjamin Pitzer Automatic Reconstruction of Textured 3D Models Automatic Reconstruction of Textured 3D Models Automatic Reconstruction of Textured Benjamin Pitzer 020 Benjamin Pitzer Automatic Reconstruction of Textured 3D Models Schriftenreihe Institut für Mess- und Regelungstechnik, Karlsruher Institut für Technologie (KIT) Band 020 Eine Übersicht über alle bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs. Automatic Reconstruction of Textured 3D Models by Benjamin Pitzer Dissertation, Karlsruher Institut für Technologie (KIT) Fakultät für Maschinenbau Tag der mündlichen Prüfung: 22. Februar 2011 Referenten: Prof. Dr.-Ing. C. Stiller, Adj. Prof. Dr.-Ing. M. Brünig Impressum Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed. www.ksp.kit.edu This document – excluding the cover – is licensed under the Creative Commons Attribution-Share Alike 3.0 DE License (CC BY-SA 3.0 DE): http://creativecommons.org/licenses/by-sa/3.0/de/ The cover page is licensed under the Creative Commons Attribution-No Derivatives 3.0 DE License (CC BY-ND 3.0 DE): http://creativecommons.org/licenses/by-nd/3.0/de/ Print on Demand 2014 ISSN 1613-4214 ISBN 978-3-86644-805-6 DOI: 10.5445/KSP/1000025619 Automatic Reconstruction of Textured 3D Models Zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften von der Fakultät für Maschinenbau der Universität Karlsruhe (TH) genehmigte Dissertation von DIPL.-ING.BENJAMIN PITZER aus Menlo Park, CA Hauptreferent: Prof. Dr.-Ing. C. Stiller Korreferent: Adj.
    [Show full text]
  • Image-Based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models
    Image-based Synthesis and Re-Synthesis of Viewpoints Guided by 3D Models Konstantinos Rematas1, Tobias Ritschel2,3, Mario Fritz2, and Tinne Tuytelaars1 1KU Leuven, iMinds 2Max Planck Institute for Informatics 2Saarland University Abstract tinely by artists to convey presence, most well-known as the Ken Burn’s effect [15]. We propose a technique to use the structural informa- In contrast, the predominant paradigm in computer vi- tion extracted from a set of 3D models of an object class to sion is to present all possible viewpoints in order to arrive improve novel-view synthesis for images showing unknown at a model that is robust to out-of-plane rotations. The most instances of this class. These novel views can be used to prominent detection models lack the domain knowledge “amplify” training image collections that typically contain that would give them an understanding of how generaliza- only a low number of views or lack certain classes of views tion across viewpoints can be achieved from a single-view entirely (e. g. top views). example. A dense sampling across viewpoints and intra- We extract the correlation of position, normal, re- category variation is tedious to achieve. Recent analysis of flectance and appearance from computer-generated images such detectors has pointed out that rare cases in viewpoint is of a few exemplars and use this information to infer new indeed one of the frontiers on which there is still significant appearance for new instances. We show that our approach room to push the state-of-the-art in object detection [14]. can improve performance of state-of-the-art detectors using In this work, we show how to improve novel view syn- real-world training data.
    [Show full text]
  • 3D Reconstruction and Recognition Acknowledgement
    EE290T : 3D Reconstruction and Recognition Acknowledgement Courtesy of Prof. Silvio Savarese. Introduction “There was a table set out under a tree in front of the house, and the March Hare and the Hatter were having tea at it.” “The table was a large one, but the three were all crowded together at one corner of it …” From “A M a d Tea-Party” Alice's Adventures in W onderla n by Lewis Ca rroll “There was a table set out under a tree in front of the house, and the March Hare and the Hatter were having tea at it.” “The table was a large one, but the three were all crowded together at one corner of it …” From “A M a d Tea-Party” Alice's Adventures in W onderla n by Lewis Ca rroll Illustra tion by Arthur Ra ck ha m Computer vision Image/ video Object 1 Object N - semantic -semantic … Computer vision Image/ video Object 1 Object N - semantic … -semantic -geometry -geometry Computer vision Image/ video Object 1 Object N - semantic … -semantic -geometry -geometry spatial & temporal relations Computer vision Image/ video Object 1 Object N - semantic … -semantic -geometry -geometry spatial & temporal relations Scene -Sema ntic - geometry Computer vision • Informatio n extraction Sensing device Computational • Interpretation device 1. Information extraction: features, 3D structure, motion flows, etc… 2. Interpretation: recognize objects, scenes, actions, events Computer vision and Applications EosSystems 1990 2000 2010 21 Fingerprint biometrics Augmentation with 3D computer graphics 23 3D object prototyping EosSystems Photomodeler 24 Computer vision and Applications EosSystems Autostich 1990 2000 2010 25 Face detection Face detection Web applications Photometria 28 Panoramic Photography kolor 3D modeling of landmarks 30 Computer vision and Applications • Efficient SLAM/SFM • Large scale image repositories • Deep learning (e.g.
    [Show full text]
  • Bayesian Reconstruction of 3D Human Motion from Single-Camera Video
    Bayesian Reconstruction of 3D Human Motion from Single-Camera Video Nicholas R. Howe Michael E. Leventon Department of Computer Science Artificial Intelligence Lab Cornell University Massachusetts Institute of Technology Ithaca, NY 14850 Cambridge, MA 02139 [email protected] [email protected] William T. Freeman MERL - a Mitsubishi Electric Research Lab 201 Broadway Cambridge, MA 02139 [email protected] Abstract The three-dimensional motion of humans is underdetermined when the observation is limited to a single camera, due to the inherent 3D ambi­ guity of 2D video. We present a system that reconstructs the 3D motion of human subjects from single-camera video, relying on prior knowledge about human motion, learned from training data, to resolve those am­ biguities. After initialization in 2D, the tracking and 3D reconstruction is automatic; we show results for several video sequences. The results show the power of treating 3D body tracking as an inference problem. 1 Introduction We seek to capture the 3D motions of humans from video sequences. The potential appli­ cations are broad, including industrial computer graphics, virtual reality, and improved human-computer interaction. Recent research attention has focused on unencumbered tracking techniques that don't require attaching markers to the subject's body [4, 5], see [12] for a survey. Typically, these methods require simultaneous views from multiple cam­ eras. Motion capture from a single camera is important for several reasons. First, though under­ determined, it is a problem people can solve easily, as anyone viewing a dancer in a movie can confirm. Single camera shots are the most convenient to obtain, and, of course, apply to the world's film and video archives.
    [Show full text]
  • Arxiv:2001.05613V2 [Cs.CV] 14 Oct 2020 Mental Results Demonstrate That the Mean Per Joint Position I.E., Parts Or All of the Body Must Not Be Lost at Any Time
    Synergetic Reconstruction from 2D Pose and 3D Motion for Wide-Space Multi-Person Video Motion Capture in the Wild Takuya Ohashi1,2 Yosuke Ikegami2 Yoshihiko Nakamura2 1NTT DOCOMO 2The University of Tokyo [email protected] [email protected] [email protected] Figure 1: All futsal players’ motions were captured using 12 video cameras surrounding the court. (left) Input images and reprojected joint position. (right) Bone CG drawing based on the calculated joint angles. Abstract diagnosis, behavioral understanding, and even humanoid robot operation [43, 32, 37]. Various motion capture meth- Although many studies have investigated markerless mo- ods have been developed to obtain such data, e.g., opti- tion capture, the technology has not been applied to real cal motion capture, where reflective markers are attached sports or concerts. In this paper, we propose a marker- to characteristic parts of the body, and these 3D positions less motion capture method with spatiotemporal accuracy are then measured [1,5]. Inertial motion capture uses IMU and smoothness from multiple cameras in wide-space and sensors attached to body parts, and then, the positions are multi-person environments. The proposed method predicts calculated using sensor speed [6,2]. Markerless motion each person’s 3D pose and determines the bounding box capture uses a depth camera or single/multiple RGB video of multi-camera images small enough. This prediction and cameras [34, 38,3,4]. However, although various methods spatiotemporal filtering based on human skeletal model en- for using motion data exist, this technology is only used in ables 3D reconstruction of the person and demonstrates limited locations.
    [Show full text]