Understanding Cityscapes: Efficient Urban Semantic Scene

UNDERSTANDINGCITYSCAPES Efficient Urban Semantic Scene Understanding Dissertation approved by technische universität darmstadt Fachbereich Informatik for the degree of Doktor-Ingenieur (Dr.-Ing.) by marius cordts Master of Science (M.Sc.) born in Aachen, Germany Examiner: Prof. Stefan Roth, Ph.D. Co-examiner: Prof. Dr. Bernt Schiele Date of Submission: September 4th, 2017 Date of Defense: October 17th, 2017 Darmstadt, 2017 D17 Marius Cordts: Understanding Cityscapes, Efficient Urban Semantic Scene Understanding, © September 2017 ABSTRACT Semantic scene understanding plays a prominent role in the environment perception of autonomous vehicles. The car needs to be aware of the semantics of its surroundings. In particular it needs to sense other vehicles, bicycles, or pedestrians in order to predict their behav- ior. Knowledge of the drivable space is required for safe navigation and landmarks, such as poles, or static infrastructure such as build- ings, form the basis for precise localization. In this work, we focus on visual scene understanding since cameras offer great potential for perceiving semantics while being comparably cheap; we also focus on urban scenarios as fully autonomous vehicles are expected to ap- pear first in inner-city traffic. However, this task also comes with significant challenges. While images are rich in information, the semantics are not readily available and need to be extracted by means of computer vision, typically via machine learning methods. Fur- thermore, modern cameras have high resolution sensors as needed for high sensing ranges. As a consequence, large amounts of data need to be processed, while the processing simultaneously requires real-time speeds with low latency. In addition, the resulting semantic environment representation needs to be compressed to allow for fast transmission and down-stream processing. Additional challenges for the perception system arise from the scene type as urban scenes are typically highly cluttered, containing many objects at various scales that are often significantly occluded. In this dissertation, we address efficient urban semantic scene understanding for autonomous driving under three major perspectives. First, we start with an analysis of the potential of exploiting multiple input modalities, such as depth, motion, or object detectors, for semantic labeling as these cues are typically available in autonomous vehicles. Our goal is to integrate such data holistically throughout all processing stages and we show that our system outperforms compa- rable baseline methods, which confirms the value of multiple input modalities. Second, we aim to leverage modern deep learning methods requiring large amounts of supervised training data for street scene understanding. Therefore, we introduce Cityscapes, the first large-scale dataset and benchmark for urban scene understanding in terms of pixel- and instance-level semantic labeling. Based on this work, we compare various deep learning methods in terms of their performance on inner-city scenarios facing the challenges introduced above. Leveraging these insights, we combine suitable methods to obtain a real-time capable neural network for pixel-level semantic labeling with high classification accuracy. Third, we combine our previ- iii ous results and aim for an integration of depth data from stereo vision and semantic information from deep learning methods by means of the Stixel World (Pfeiffer and Franke, 2011b). To this end, we reformu- late the Stixel World as a graphical model that provides a clear for- malism, based on which we extend the formulation to multiple input modalities. We obtain a compact representation of the environment at real-time speeds that carries semantic as well as 3D information. iv ZUSAMMENFASSUNG Für die Umgebungserfassung autonomer Fahrzeuge ist das semantische Szenenverständnis von großer Bedeutung. Das autonome Fahr- zeug muss seine Umgebung wahrnehmen und verstehen können. Ins- besondere müssen andere Fahrzeuge, Fahrräder oder Fußgänger er- kannt werden, um ihr Verhalten zu prädizieren. Die Basis für sichere Navigation ist ein exaktes Wissen über die befahrbare Umgebung, während eine präzise Lokalisierung mittels Landmarken (z.B. Pfeiler) oder statischer Infrastruktur (z.B. Gebäude) ermöglicht wird. Da Ka- meras ein großes Potenzial für die Wahrnehmung von Semantik bie- ten und zudem vergleichsweise günstig sind, liegt der Fokus dieser Arbeit auf dem visuellen Verstehen von Szenen. Ein weiterer Fokus liegt auf dem urbanen Umfeld, da vollautonome Fahrzeuge zuerst im innenstädtischen Verkehr erwartet werden. Nichtsdestotrotz birgt diese Aufgabe auch signifikante Herausforderungen. Obwohl Bilder informationsreich sind, ist die Semantik nicht unmittelbar verfügbar und muss zunächst durch Methoden der Bildverarbeitung, typischerweise mittels Verfahren des maschinellen Lernens, extrahiert werden. Des Weiteren haben moderne Kameras hochauflösende Sensoren, um hohe Erkennungsreichweiten zu erreichen. Als Konsequenz daraus müssen große Datenmengen verarbeitet werden, was in Echtzeit bei niedriger Latenz geschehen muss. Zusätzlich muss die resultieren- de semantische Umgebungsrepräsentation komprimiert werden, um eine schnelle Übertragung und Weiterverarbeitung zu ermöglichen. Weitere Herausforderungen für das System zur Umgebungserfassung ergeben sich durch den Szenentyp, da urbane Szenen typischerweise unübersichtlich sind und viele Objekte in verschiedenster Größe und mit signifikanten Verdeckungen beinhalten. In dieser Dissertation wird effizientes urbanes semantisches Sze- nenverstehen für autonomes Fahren unter drei Hauptgesichtspunk- ten adressiert. Erstens werden die Möglichkeiten hinter der Benut- zung von mehreren Eingangsmodalitäten, wie zum Beispiel Tiefe, Be- wegung oder Objektdetektion, für semantisches Labeln analysiert, da diese Informationen typischerweise in autonomen Fahrzeugen ver- fügbar sind. Hierbei ist das Ziel, diese Daten vollständig und durch- gängig in alle Verarbeitungsschritte zu integrieren, mit dem Ergeb- nis, dass das System die Performance von Vergleichsmethoden über- trifft, was den Wert von mehreren Inputmodalitäten bestätigt. Zwei- tens wird darauf abgezielt moderne Methoden aus dem Bereich Deep Learning, die große annotierte Trainingsdatenmengen benötigen, für das Verstehen von Straßenszenen einzusetzen. Dazu wird Cityscapes eingeführt, der erste großangelegte Datensatz und Benchmark für urbanes Szenenverstehen mittels semantischem Labeln auf Pixel- und v Instanzebene. Basierend auf dieser Arbeit werden verschiedene Me- thoden des Deep Learnings hinsichtlich ihrer Performance auf Innen- stadtszenarien bezüglich obiger Herausforderungen verglichen. Auf Basis dieser Erkenntnisse werden geeignete Methoden kombiniert, um ein echtzeitfähiges neuronales Netz für semantisches Labeln auf Pixel-Ebene mit hoher Klassifikationsgenauigkeit zu erhalten. Drit- tens werden die vorigen Ergebnisse mit dem Ziel kombiniert, Tie- fendaten aus Stereobildverarbeitung und semantische Informationen von Deep Learning-Methoden mit Hilfe der Stixel Welt (Pfeiffer und Franke, 2011b) zu integrieren. Zu diesem Zweck wird die Stixel Welt als graphisches Modell umformuliert, so dass ein klarer Formalismus existiert, auf Basis dessen das Modell auf mehrere Eingangsmodali- täten erweitert wird. Resultierend ergibt sich eine kompakte Reprä- sentation der Umgebung, die in Echtzeit berechnet werden kann und semantische sowie 3D Informationen beinhaltet. vi ACKNOWLEDGMENTS First and foremost, I would like to express my gratitude to my advisor Prof. Stefan Roth for accepting me as a Ph.D. student in his Visual Inference Group at TU Darmstadt. Thank you for the all the valuable input to my research, your dedication in the days and nights before submission deadlines, and for supporting my development as a research scientist. I am especially graceful for such an excellent level of supervision and an integration into the research group given that I am an external Ph.D. candidate. In particular, the retreats allowed for a valuable exchange with other Ph.D. students and researchers in related fields and supported the creation of ideas and collaborations. Next, I would like to thank Dr. Uwe Franke for providing me with the opportunity to conduct my research in his Image Understanding Group at Daimler R&D. You created an excellent research environment in which it is a great pleasure to work, to experiment, and to discuss newest ideas. You preserve the research atmosphere despite structural changes and in doing so allow for continuous and focused research. Thank you for all the valuable impulses regarding my work, your trust, and your support of my career. My sincere and deepest gratitude goes to Dr. Markus Enzweiler, my technical advisor at Daimler R&D. You were my mentor in many aspects throughout the Ph.D. studies and beyond. You were always encouraging and inspiring, you had an open ear for my ideas and problems, and you provided continuous feedback and insightful com- ments for my work. My thanks also go to the remaining colleagues from the Image Understanding Group, especially to my closest collaborators Timo Rehfeld and Lukas Schneider. Thank you for all the valuable discus- sions and the hard questions that allowed me to dive deeper into my research problems and to widen my horizon. Thank you for creating a great atmosphere that makes our working hours and beyond truly enjoyable. I found many long-term friends in you. My warmest gratitude also goes to Nick Schneider, who has become

Load more