Monocular Depth Cues in Computer Vision Applications
Total Page:16
File Type:pdf, Size:1020Kb
Monocular Depth Cues in Computer Vision Applications A dissertation submitted by Diego Cheda at Universitat Aut`onomade Barcelona to fulfil the degree of Doctor of Philosophy. Bellaterra, September 26, 2012 Director: Dr. Daniel Ponsa Mussarra Centre de Visi´oper Computador Universitat Aut`onomade Barcelona Co-director: Dr. Antonio M. L´opez Pe~na Centre de Visi´oper Computador Universitat Aut`onomade Barcelona Thesis commite: Dr. Juan Andrade Cetto Institut de Rob`oticai Inform`aticaIndustrial Consejo Superior de Investigaciones Cient´ıficas Dr. Angel´ Domingo Sappa Centre de Visi´oper Computador Dr. David Masip Rod´o Dept. d'Estudis d’Inform`aticaMultim`ediai Telecomunicaci´o Universitat Oberta de Catalunya Dra. Aura Hernandez Sabat´e Dept. de Ci`enciesde la Computaci´o Universitat Aut`onomade Barcelona Dra. Carme Juli`aFerr´e Dept. d'Enginyeria Inform`aticai Matem`atiques Universitat Rovira i Virgili This document was typeset by the author using LATEX 2". The research described in this book was carried out at the Computer Vision Center, Universitat Aut`onomade Barcelona. Copyright ⃝c MMXII by Diego Cheda. All rights reserved. No part of this publica- tion may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the author. ISBN 978-84-940530-0-9 Printed by Ediciones Gr´aficasRey, S.L. A Martina y a Pau Acknowledgements Foremost, I would like to thank my advisors, Dr. Daniel Ponsa and Dr. Antonio M. L´opez, for their continuous support, advice and patience during these years. They introduced me to the Computer Vision field and shared with me a lot of his expertise and research insight. I would like to express my gratitude towards all the members of ADAS group. I would like to acknowledge the funding, academic and technical support of the Computer Vision Center, the Universitat Aut`onomade Barcelona, and the Ministry of Education and Science (projects TRA2011-29454-C03-01 and TIN2011-29494-C03-02, and Consolider Ingenio 2010: MIPRCV (CSD200700018)). I would also like to thank my committee for their time and comments: Dr. Juan Andrade Cetto, Dr. Angel´ Domingo Sappa, Dr. David Masip Rod´o,Dra. Aura Hernandez Sabat´e,and Dra. Carme Juli`aFerr´e.I also want to thank the anonymous conference and journal reviewers for their invaluable comments along these years. I also owe a great deal to the many friends and colleagues that I have met at Computer Vision Center. Thank you all for share your time in one way or another along some endless days at CVC. I am indebted to Fernando Barrera and David Rojas for the innumerable talks about research, life, rare things and, almost as important, the morning coffee ritual. I thank to my closest friends Juan Jos´eTamagnini, Salvador Cavadini, and Daniel Romero for always being there. Most importantly, I thank my wonderful family. First, to all my big family back in Argentina, I thank you for your continued support and encouragement from abroad. Second, to my small family in Catalonia: to Fernanda for those nice days in Tarragona, and to Mariana and Aar´onfor those incredible shared moments at Barcelona. Last, to my dear wife Martina for embarked with me in this adventure, for your continued love and support, and for keeping at my side {dealing with the good and bad times along these years. During this time, you had the generous experience of raising a family, teaching me how deep is your love. To my gorgeous son, Pau, I thank you for your smiles, your 3am wake-ups asking for milk, for make me playing again and again with the animals, and for showing me the impressive way of the life. i ii Abstract Depth perception is a key aspect of human vision. It is a routine and essential visual task that the human do effortlessly in many daily activities. This has often been associated with stereo vision, but humans have an amazing ability to perceive depth relations even from a single image by using several monocular cues. In the computer vision field, if image depth information were available, many tasks could be posed from a different perspective for the sake of higher performance and robustness. Nevertheless, given a single image, this possibility is usually discarded, since obtaining depth information has frequently been performed by three-dimensional reconstruction techniques, requiring two or more images of the same scene taken from different viewpoints. Recently, some proposals have shown the feasibility of computing depth information from single images. In essence, the idea is to take advantage of a priori knowledge of the acquisition conditions and the observed scene to estimate depth from monocular pictorial cues. These approaches try to precisely estimate the scene depth maps by employing computationally demanding techniques. However, to assist many computer vision algorithms, it is not really necessary computing a costly and detailed depth map of the image. Indeed, just a rough depth description can be very valuable in many problems. In this thesis, we have demonstrated how coarse depth information can be inte- grated in different tasks following alternative strategies to obtain more precise and robust results. In that sense, we have proposed a simple, but reliable enough tech- nique, whereby image scene regions are categorized into discrete depth ranges to build a coarse depth map. Based on this representation, we have explored the potential usefulness of our method in three application domains from novel viewpoints: cam- era rotation parameters estimation, background estimation and pedestrian candidate generation. In the first case, we have computed camera rotation mounted in a moving vehicle applying two novels methods based on distant elements in the image, where the translation component of the image flow vectors is negligible. In background estimation, we have proposed a novel method to reconstruct the background by pe- nalizing close regions in a cost function, which integrates color, motion, and depth terms. Finally, we have benefited of geometric and depth information available on single images for pedestrian candidate generation to significantly reduce the number of generated windows to be further processed by a pedestrian classifier. In all cases, results have shown that our approaches contribute to better performances. iii Resumen La percepci´onde la profundidad es un aspecto clave en la visi´onhumana. El ser humano realiza esta tarea sin esfuerzo alguno con el objetivo de efectuar diversas actividades cotidianas. A menudo, la percepci´onde la profundidad se ha asociado con la visi´onbinocular. Pese a esto, los seres humanos tienen una capacidad asombrosa de percibir las relaciones de profundidad, incluso a partir de una sola imagen, mediante el uso de varias pistas monoculares. En el campo de la visi´onpor ordenador, si la informaci´onde la profundidad de una imagen estuviera disponible, muchas tareas podr´ıanser planteadas desde una perspectiva diferente en aras de un mayor rendimiento y robustez. Sin embargo, dada una ´unicaimagen, esta posibilidad es generalmente descartada, ya que la ob- tenci´onde la informaci´onde profundidad es frecuentemente obtenida por las t´ecnicas de reconstrucci´ontridimensional, que requieren dos o m´asim´agenesde la misma escena tomadas desde diferentes puntos de vista. Recientemente, algunas propues- tas han demostrado que es posible obtener informaci´onde profundidad a partir de im´agenesindividuales. En esencia, la idea es aprovechar el conocimiento a priori de las condiciones de adquisici´onde la imagen y de la escena observada para estimar la profundidad empleando pistas pict´oricasmonoculares. Estos enfoques tratan de estimar con precisi´onlos mapas de profundidad de la escena empleando t´ecnicascom- putacionalmente costosas. Sin embargo, muchos algoritmos de visi´onpor ordenador no necesitan un mapa de profundidad detallado de la imagen. De hecho, s´olouna descripci´onen profundidad aproximada puede ser muy valiosa en muchos problemas. En nuestro trabajo, hemos demostrado que incluso la informaci´onaproximada de profundidad puede integrarse en diferentes tareas siguiendo una estrategia hol´ıstica con el fin de obtener resultados m´asprecisos y robustos. En ese sentido, hemos propuesto una t´ecnicasimple, pero fiable, por medio de la cual regiones de la imagen de una escena se clasifican en rangos de profundidad discretos para construir un mapa tosco de la profundidad. Sobre la base de esta representaci´on,hemos explorado la utilidad de nuestro m´etodo en tres dominios de aplicaci´ondesde puntos de vista novedosos: la estimaci´onde la rotaci´onde la c´amara,la estimaci´ondel fondo de una escena y la generaci´onde ventanas de inter´espara la detecci´onde peatones. En el primer caso, calculamos la rotaci´onde la c´amaramontada en un veh´ıculoen movimiento mediante dos nuevos m´etodos que emplean elementos distantes en la imagen a trav´esde nuestros mapas de profundidad, aprovechando el hecho de que v vi el componente traslacional en los vectores de flujo de la imagen pueden considerarse como insignificantes. En la reconstrucci´ondel fondo de una imagen, propusimos un m´etodo novedoso que penaliza las regiones cercanas en una funci´onde coste que integra, adem´as,informaci´ondel color y del movimiento. Por ´ultimo,empleamos la informaci´ongeom´etricay de la profundidad de una escena para la generaci´onde peatones candidatos. Este m´etodo reduce significativamente el n´umerode ventanas generadas, las cuales ser´anposteriormente procesadas por un clasificador de peatones. En todos los casos, los resultados muestran que los enfoques basados en la profundidad contribuyen a un mejor rendimiento de las aplicaciones estudidadas. viii Contents Abstract iii Resumen v 1 Introduction 3 1.1 Contributions and outline . 8 2 Depth perception 13 2.1 Introduction . 13 2.2 Depth perception . 13 2.3 Depth estimation from a single image .