Mobile for Semantic 3D Models -A Smartphone-based Approach with CityGML-

Christoph Henning Blut

Veröffentlichung des Geodätischen Instituts der Rheinisch-Westfälischen Technischen Hochschule Aachen Mies-van-der-Rohe-Straße 1, 52074 Aachen

NR. 70

2019 ISSN 0515-0574

Mobile Augmented Reality for Semantic 3D Models -A Smartphone-based Approach with CityGML-

Von der Fakultät für Bauingenieurwesen der Rheinisch-Westfälischen Technischen Hochschule Aachen zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften genehmigte Dissertation

vorgelegt von

Christoph Henning Blut

Berichter: Univ.-Prof. Dr.-Ing. Jörg Blankenbach Univ.-Prof. Dr.-Ing. habil. Christoph van Treeck

Tag der mündlichen Prüfung: 24.05.2019

Diese Dissertation ist auf den Internetseiten der Universitätsbibliothek online verfügbar.

Veröffentlichung des Geodätischen Instituts der Rheinisch-Westfälischen Technischen Hochschule Aachen Mies-van-der-Rohe-Straße 1, 52074

Nr. 70

2019 ISSN 0515-0574 Acknowledgments I

Acknowledgments

This thesis was written during my employment as research associate at the Geodetic Institute and Chair for Computing in Civil Engineering & Geoinformation Systems of RWTH Aachen University. First and foremost, I would like to express my sincere gratitude towards my supervisor Univ.-Prof. Dr.-Ing. Jörg Blankenbach for his excellent support, the scientific freedom he gave me and the inspirational suggestions that helped me succeed in my work. I would also like to thank Univ.-Prof. Dr.-Ing. habil. Christoph van Treeck for his interest in my work and the willingness to take over the second appraisal. Many thanks go to my fellow colleagues for their valuable ideas towards my research and the fun after-work activities that will be remembered. Last but not least, I am grateful to my family for the support and motivation they gave me. A very special thank you goes to my brother Timothy Blut for the inspiring extensive discussions about my work throughout this journey, sometimes until late into the night, that where very helpful in finding great new ideas.

Aachen, June 2019 Christoph Henning Blut II Abstract

Abstract

The increasing popularity of smartphones over the past 10 years has drastically propelled mobile technology forward, enabling innovative applications and experiences, as for example in form of mobile virtual reality (VR) and mobile augmented reality (AR). While in earlier days mobile AR systems were constructed using multiple large and costly external components carried in bulky and heavy backpacks, today low-cost off-the-shelf mobile devices, such as smartphones, are sufficient, since these provide all the necessary technology right out- of-the-box. However, the realization of highly accurate and performant systems on such devices poses a challenge, since the inexpensive parts (e.g. sensors) are often prone to inaccuracies. Many AR systems are developed for entertainment purposes, but mobile AR potentially also has further beneficial applications in more serious fields, such as archaeology, education, medicine, military, etc. For civil engineering and city planning, mobile AR is also promising, as it could be used to enhance some typical workflows and planning processes. A real-life example application is the visualization of planed building parts, to simplify planning processes and to optimize the communication between the participating decision makers. In this thesis, a concept for a mobile AR system aimed at the mentioned scenarios is presented, implemented and evaluated. For this, on the one side a suitable mobile AR system and on the other some appropriate data are necessary. A problem is that much digital 3D building data typically lacks the required spatial referencing and important additional information, like semantics or Abstract III topology. Some exceptions can be found in the construction sector and in the geographic information domain with the IFC and CityGML format. While the focus of IFC primarily lies on particular highly detailed building models, CityGML emphasizes more general, less detailed models in a broader context, thus, enabling city and room scale visualizations. A proof-of-concept system was realized on an Android-based smartphone using CityGML models. It is fully self-sufficient and operates without external infrastructures. To process the CityGML data, a mobile data processing unit consisting of a SpatiaLite database, a data importer and a data selection method, was implemented. The importer is based on a XML Pull parser which reads CityGML 1.0 and CityGML 2.0 data and writes it into the SpatiaLite-based CityGML database that is modelled according to the CityGML schema. The selection algorithm enables efficiently filtering the data that is relevant to the user at his current location from the entirety of data in the database. To visualize the data and make the information of each object accessible, a customized rendering solution was implemented that aims at preserving the object information while maximizing the rendering performance. For preparing the geometry data for rendering, a customized polygon triangulation algorithm was implemented, based on the ear-clipping method. To superimpose the physical objects with these virtual elements, a fine-grained (indoor) pose tracking system was implemented, using a combination of image- and inertial measurement unit (IMU)-based methods. The IMU is utilized to determine initial coarse pose estimates which then are optimized by the CityGML model-based optical pose estimation methods. For this, a 2D image-based door detector and a 3D corner extraction method that return accurate IV Abstract corners of the door were implemented. These corners are then used for the pose estimations. Lastly, the mobile CityGML AR system was evaluated in terms of data processing/visualization performance and accuracy/stability of the pose tracking solution. The results show that off-the-shelf low-cost mobile devices, such as smartphones, are sufficient to realize a fully-fledged self-sufficient location-based mobile AR system that qualifies for numerous AR scenarios, like the earlier described one. Zusammenfassung V

Zusammenfassung

Die zunehmende Popularität von Smartphones über die vergangenen 10 Jahre hat die mobile Technologie entscheidend vorangetrieben und ermöglicht innovative Anwendungen und Erfahrungen, wie zum Beispiel in Form von mobiler Virtual Reality (VR) und mobiler Augmented Reality (AR). Wurden zuvor für mobile AR-Systeme noch eine Vielzahl von großen und teuren externen Komponenten benötigt, die in sperrigen und schweren Rucksäcken transportiert wurden, reichen heute preiswerte, handelsübliche mobile Geräte, wie Smartphones, aus, da diese bereits alle erforderlichen Technologien beinhalten. Die Realisierung hochgenauer und performanter Systeme auf Basis solcher Geräte stellt jedoch eine Herausforderung dar, da die kostengünstigen Komponenten (z.B. Sensoren) oft zu Ungenauigkeiten neigen. Vorrangig werden mobile AR-Systeme für den Entertainment- bereich entwickelt, mobile AR hat jedoch auch ein vielversprechendes Potential in anderen Bereichen, wie beispielsweise der Archäologie, Bildung, Medizin oder dem Militär. Auch im Bauwesen und in der Stadtplanung ist mobile AR äußerst vielversprechend, da es zur Optimierung einiger typischer Arbeitsabläufe und Planungsprozesse verwendet werden könnte. Ein Beispiel für eine reale Anwendung ist die Visualisierung von geplanten Bauwerksteilen, um Planungs- prozesse zu vereinfachen und die Kommunikation zwischen den beteiligten Entscheidungsträgern zu optimieren. In dieser Arbeit wird ein Konzept für ein AR-System vorgestellt, implementiert und evaluiert, das auf die genannten Szenarien abzielt. VI Zusammenfassung

Dazu sind einerseits ein geeignetes mobiles AR-System und andererseits entsprechende Daten notwendig. Problematisch sind die benötigten, jedoch häufig fehlenden räumlichen Bezüge der digitalen 3D-Gebäudedaten und fehlende wesentliche attributive Daten, wie Semantik oder Topologie. Einige Ausnahmen finden sich im Bau- und Geoinformationssektor mit dem IFC- und CityGML-Format. Während der Fokus von IFC in erster Linie auf einzelnen, hochdetaillierten Gebäudemodellen liegt, legt CityGML den Schwerpunkt auf allgemeinere, weniger detaillierte Modelle in einem breiteren Kontext und ermöglicht so Visualisierungen im Stadt- und Raummaßstab. Ein Demonstrator wurde auf einem Android-basierten Smartphone und mit entsprechenden CityGML-Modellen realisiert. Dieser ist vollständig autark und funktioniert ohne externe Infrastrukturen. Zur Verarbeitung der CityGML-Daten wurde eine mobile Daten- verarbeitungskomponente implementiert, die aus einer SpatiaLite- Datenbank, einem Datenimporter und einer Datenselektionsmethode besteht. Der Importer basiert auf einem XML Pull-Parser, der CityGML 1.0- und CityGML 2.0-Daten liest und in die SpatiaLite- basierte CityGML-Datenbank schreibt, die nach dem CityGML- Schema modelliert ist. Der Selektionsalgorithmus ermöglicht ein effizientes Filtern der Daten abhängig von der aktuellen Position des Nutzers, sodass nur relevante Daten aus der Datenbank exportiert werden. Für die Visualisierung der Daten und Bereitstellung der Objektinformationen wurde eine spezialisierte Rendering-Lösung implementiert, die es ermöglicht die Objektinformationen zu erhalten, aber gleichzeitig die Rendering-Leistung zu maximieren. Zur Vorbereitung der Geometriedaten für das Rendering wurde ein Zusammenfassung VII angepasster Polygontriangulationsalgorithmus, basierend auf der Ear- Clipping Methode, implementiert. Um die physischen Objekte mit diesen virtuellen Elementen zu überlagern, wurde ein Lagebestimmungssystem, unter Verwendung einer Kombination von bildbasierten und inertialen Messverfahren, implementiert. Die inertiale Messeinheit (IMU) wird verwendet, um erste grobe Posen zu ermitteln, die nachfolgend durch die CityGML- Modell-basierten optischen Verfahren optimiert werden. Dafür wurden ein 2D-bildbasierter Türdetektor und ein 3D- Eckenextraktions-verfahren implementiert, um die Ecken der entsprechenden Tür präzise zurückzugeben und für die Lageschätzung zu verwenden. Schließlich wurde das mobile CityGML-AR-System in Bezug auf die Datenverarbeitungs- und Datenvisualisierungsleistung und die Genauigkeit/Stabilität des Lagebestimmungssystems evaluiert. Die Ergebnisse zeigen, dass kostengünstige Standard-Mobilgeräte wie Smartphones ausreichen, um ein vollwertiges, autarkes, standort- basiertes mobiles AR-System zu realisieren, das für zahlreiche AR- Szenarien, wie das zuvor beschriebene, geeignet ist. VIII Table of Contents

Table of Contents

Acknowledgments ...... I Abstract ...... II Zusammenfassung ...... V Table of Contents ...... VIII List of Figures ...... XI List of Tables ...... XXI List of Abbreviations ...... XXIII 1 Introduction ...... 1 1.1 Thesis Goal ...... 5 1.2 Related Work ...... 7 2 Background ...... 17 2.1 Fundamentals ...... 17 2.1.1 Transforms ...... 17 2.1.2 Camera Models ...... 26 2.2 3D Real-Time Rendering ...... 33 2.2.1 Graphics Rendering Pipeline ...... 33 2.2.2 Tessellation ...... 39 2.2.3 Graphics Libraries ...... 46 2.3 Data for Augmented Reality ...... 50 2.3.1 Geospatial Data ...... 51 2.3.2 Building Information Modeling ...... 56 2.3.3 Data Parsers ...... 59 Table of Contents IX

2.3.4 Databases...... 61 2.4 Pose Tracking ...... 67 2.4.1 Coordinate Systems for Positioning ...... 68 2.4.2 Coordinate Systems for Orientation ...... 72 2.4.3 Pose Tracking Methods ...... 74 2.4.4 Sensor Fusion ...... 83 2.4.5 Optical Pose Tracking Methods ...... 87 3 Requirement Analysis ...... 121 3.1 Mobile Device ...... 122 3.2 CityGML vs. IFC ...... 125 3.3 Data Processing Options ...... 131 3.3.1 Memory-based Option ...... 133 3.3.2 Web-based Option ...... 136 3.3.3 Local Database-based Option ...... 137 3.4 Polygon Triangulation ...... 140 3.5 Data Rendering ...... 142 3.6 Pose Tracking Methods ...... 144 3.6.1 Sensor-based Tracking ...... 144 3.6.2 Infrastructure-based Tracking ...... 151 3.6.3 Optical Tracking ...... 152 4 Implementation ...... 169 4.1 General Solution Architecture ...... 169 4.2 CityGML Viewer ...... 171 4.2.1 Data Processor ...... 172 4.2.2 Visualizing CityGML ...... 180 4.3 Pose Tracking ...... 193 4.3.1 Sensor-based Pose Tracker ...... 194 4.3.2 Optical Pose Tracker ...... 195 4.3.3 Fusion ...... 199 X Table of Contents

4.4 Android App ...... 201 5 Evaluation of AR System ...... 204 5.1 System Calibration ...... 204 5.2 Data Performance ...... 207 5.2.1 Data Processing ...... 209 5.2.2 Data Visualization ...... 215 5.3 Door Detection ...... 217 5.3.1 Image Dataset ...... 219 5.3.2 Influence of Image Resolution ...... 220 5.3.3 Influence of Environmental Conditions ...... 224 5.4 Pose Tracking ...... 228 5.4.1 Optical Pose ...... 229 5.4.2 Sensor Pose Stability ...... 230 5.5 General AR System Evaluation ...... 236 6 Conclusions ...... 238 6.1 Data Processing ...... 240 6.2 Rendering ...... 240 6.3 Door Detection ...... 241 6.4 Pose Tracking ...... 242 6.5 BIM as Extension to AR System ...... 243 Bibliography...... 246

List of Figures XI

List of Figures

Figure 1.1: Trend of AR in the last 14 years, based on search statistics from Google [5]. The values indicate the search interest relative to the highest interest in the specified time period...... 2 Figure 1.2: Concept image of outdoor AR system...... 6 Figure 1.3: Concept image of indoor AR system...... 6 Figure 1.4: Reality-virtuality continuum according to [12]...... 8 Figure 2.1: An example for a translation of a geometrical object...... 19 Figure 2.2: An example for a rotation of a geometrical object...... 20 Figure 2.3: The three DoF (yaw, pitch and roll)...... 21 Figure 2.4: An example for gimbal lock when using Euler angles...... 22 Figure 2.5: An example for scaling a geometrical object...... 24 Figure 2.6: Orthographic projection according to [40]...... 25 Figure 2.7: View frustum for a perspective projection according to [40]...... 25 Figure 2.8: The pinhole camera model according to [41]...... 26 Figure 2.9: The relationship between world coordinates and pixel coordinates according to [43]...... 28 Figure 2.10: (a) Undistorted image (left); (b) Image with pincushion distortion (middle); (c) Image with barrel distortion (right)...... 29 Figure 2.11: Camera calibration chessboard photographed from different angles...... 32 Figure 2.12: Encoded field of control points for camera calibration...... 32 XII List of Figures

Figure 2.13: General conceptual parts of the graphics rendering pipeline according to [37]...... 33 Figure 2.14: Steps of the geometry stage...... 34 Figure 2.15: The axes of the OpenGL coordinate system...... 35 Figure 2.16: A screen with a coordinate system in which the x- axis points right and the y-axis points downwards...... 38 Figure 2.17: Shapes of a polygon. (a) Convex polygon (left); (b) Concave polygon (middle); (c) Convex polygon with holes (right)...... 40 Figure 2.18: Polygon triangulated by connecting a vertex with all other vertices of the polygon, except its direct neighbours...... 41 Figure 2.19: Progress of the ear-clipping algorithm. (a) Non- triangulated polygon (top-left); (b) polygon with clipped ear (top-right); (c) polygon with two clipped ears (bottom-left); (d) three clipped ears (bottom- right) ...... 42 Figure 2.20: Two interior polygons (holes) linked to the outer polygon by edges...... 43 Figure 2.21: Delaunay triangulation with empty circumcircles...... 44 Figure 2.22: The triangulation does not meet the Delaunay condition, since the sum of angles and is larger than 180° and the circumcircle of contains the vertex ...... ...... 44 Figure 2.23: Delaunay triangulation with newly inserted point s...... 45 Figure 2.24: The gap of a concave polygon should not be closed...... 46 Figure 2.25: Example structure of a scene graph...... 49 Figure 2.26: Two boundary surfaces of a wall modelled using B- Rep...... 53 List of Figures XIII

Figure 2.27: City model of Berlin. Courtesy of Berlin Partner für Wirtschaft und Technologie GmbH [56]...... 54 Figure 2.28: An example of a DTM based on data from Aachen [58]...... 55 Figure 2.29: An example of a DSM based on data from Aachen [58]...... 56 Figure 2.30: A wall modelled using CSG...... 58 Figure 2.31: Boolean operations used to construct new objects. The top-right geometry is subtraced from the top- left geometry, resulting in the bottom-right geometry...... 58 Figure 2.32: MBRs R1 to R6 are arranged for the R-tree structure...... 65 Figure 2.33: Tree data structure of an octree. Each node has eight children. A minimum bounding cuboid is subdivided into octants...... 67 Figure 2.34: Geocentric Reference System...... 69 Figure 2.35: Geographic Reference System...... 70 Figure 2.36: Earth’s surface represented by an ellipsoid and geoid. Ellipsoidal height and geoid model can be used to calculate the orthometric height...... 72 Figure 2.37: Android’s sensor world coordinate system, according to [71]...... 73 Figure 2.38: An Android smartphone with its coordinate axes...... 74 Figure 2.39: CV is based on image analysis, which in turn in based on image processing...... 89 Figure 2.40: Original image (left); Dilation (middle); Erosion (right)...... 91 XIV List of Figures

Figure 2.41: Gaussian blur applied to an image (left) and the resulting image (right)...... 92 Figure 2.42: Original image and its schematic pixel-value- representation (left); Vertical edges detected with the Sobel operator using the kernel (middle); Horizontal edges detected with the Sobel operator using the kernel (right); ...... 94 Figure 2.43: Image created with the Canny edge detector...... 96 Figure 2.44: Sweep window to find corners based on high variations of intensity. If there are no variations in all directions, then there is no feature (case 1/left); If there are variations in one direction, then an edge has been found (case 2/middle); If there are variations in all direction, then a corner has been found (case 3/right) ...... 98 Figure 2.45: Features found by Harris corner detector...... 99 Figure 2.46: Geometry of the three-point space resection problem...... 101 Figure 2.47: Example of a fiducial marker...... 109 Figure 2.48: Miniature digital 3D model of building interior visualized using a fiducial marker...... 110 Figure 2.49: Different types of model-based tracking based on [137]...... 112 Figure 2.50: 3D wireframe model of a building...... 113 Figure 2.51: Cube with visble and hidden edges. The thin gray lines in the back are covered by the front surfaces...... 114 Figure 2.52: A projected edge of a 3D model (black), samples points (red), normals (blue) searching for edges of the door in the image (orange)...... 115 List of Figures XV

Figure 2.53: Example result of the SURF detector of [152]. The red circles represent the found interest points and their scale...... 118 Figure 2.54: Model captured using the Google Tango tablet...... 120 Figure 3.1: Required components of an AR-system ...... 121 Figure 3.2: Custom created LOD4 model of the civil engineering building of the RWTH Aachen University (Model 1)...... 129 Figure 3.3: An office of the model 1 building...... 129 Figure 3.4: Custom created LOD4 building (Model 2)...... 130 Figure 3.5: The kitchen of model 2...... 130 Figure 3.6: The living room of model 2...... 131 Figure 3.7: Part of the LOD2 model of Aachen based on data from [161] (Model 3)...... 131 Figure 3.8: Average RAM consumption for DOM and the pull parser [55]...... 134 Figure 3.9: Average loading times for DOM and the pull parser [55]...... 135 Figure 3.10: Average runtimes for typical spatial queries. The Oracle and PostGIS queries were run on an Intel Core [email protected] and 8 GB RAM. SpatiaLite was run on a Google Nexus 5 [55]...... 139 Figure 3.11: Percentage of convex and concave polygons in the three models M1, M2 and M3...... 140 Figure 3.12: Example of sensor drift for the x-axis of the accelerometers of Google Nexus 5, Sony Xperia Z2 and Google Pixel 2 XL...... 147 XVI List of Figures

Figure 3.13: Example of sensor drift for the y-axis of the accelerometers of Google Nexus 5, Sony Xperia Z2 and Google Pixel 2 XL...... 148 Figure 3.14: Example of sensor drift for the z-axis of the accelerometers of Google Nexus 5, Sony Xperia Z2 and Google Pixel 2 XL...... 148 Figure 3.15: Orientation error in indoor and outdoor areas due to the magnetometer...... 149 Figure 3.16: Position error when using GNSS...... 150 Figure 3.17: The virtual objects are shifted and do not fit the physical objects due to inaccuracies of the calculated pose...... 150 Figure 3.18: The Polemus G4 becomes increasingly inaccurate with growing distance...... 151 Figure 3.19: Mean translation error of each method when Gaussian noise is increasingly added to the 2D image points...... 155 Figure 3.20: Mean rotation error of each method when Gaussian noise is increasingly added to the 2D image points...... 155 Figure 3.21: Mean translation error of each method when increasing the number of points using an image heavily distorted with Gaussian noise...... 156 Figure 3.22: Mean rotation error of each method when increasing the number of points using an image heavily distorted with Gaussian noise...... 157 Figure 3.23: Mean translation error of each method when Gaussian noise is increasingly added to the 3D object points...... 158 List of Figures XVII

Figure 3.24: Mean rotation error of each method when Gaussian noise is increasingly added to the 3D object points...... 158 Figure 3.25: Mean translation error of each method when increasing the number of 3D object points heavily distorted with Gaussian noise...... 159 Figure 3.26: Mean rotation error of each method when increasing the number of 3D object points heavily distorted with Gaussian noise...... 160 Figure 3.27: Comparison of the mean translation error when using methods with and without RANSAC on 2D image points that include outliers...... 161 Figure 3.28: Comparison of the mean translation error when using methods with and without RANSAC on 3D objects points that includes outliers...... 162 Figure 3.29: Edge-based Pose Tracking with IMU ...... 164 Figure 3.30: An example of the edge-based pose tracking algorithm...... 165 Figure 4.1: General AR system architecture...... 169 Figure 4.2: Activity diagram of the AR system...... 170 Figure 4.3: Architecture for the CityGML viewer component of the AR system according to [55]...... 171 Figure 4.4: Example of a relationship between different classes in the database [55]...... 177 Figure 4.5: Activity diagram for the implemented selection algorithm [55]...... 178 Figure 4.6: Activity diagram for the implemented ear-clipping algorithm according to [55]...... 184 Figure 4.7: Process of connecting multiple holes (inner polygons) to the outer polygon...... 185 XVIII List of Figures

Figure 4.8: A fully triangulated LOD4 building...... 186 Figure 4.9: Hierarchy of CityGML objects in a scene graph [55]. ... 188 Figure 4.10: Ray casting-based picking using a view frustum of the perspective projection [55]...... 192 Figure 4.11: Activity diagram of the pose tracking system...... 200 Figure 5.1: The PHIDIAS markers used for the calibration process captured from different angles...... 205 Figure 5.2: Time to perform the queries Q1 - Q5 for each smartphone...... 210 Figure 5.3: Loading times for each smartphone for the positions P1, P2 and P5...... 212 Figure 5.4: Loading times for each smartphone for the positions P3 and P4...... 212 Figure 5.5: The time spent to select data and export it from the database for each smartphone in positions P1, P2 and P5...... 213 Figure 5.6: The time spent to select data and export it from the database for each smartphone in positions P3 and P4...... 214 Figure 5.7: Required time to prepare the exported CityGML data for visualization...... 214 Figure 5.8: Required time to prepare the exported CityGML data for visualization...... 215 Figure 5.9: The average draw calls that were required for the rendered scene in P1 - P5...... 216 Figure 5.10: Average FPS in the different positions for each smartphone...... 217 Figure 5.11: Setup for testing optical pose estimation...... 218 List of Figures XIX

Figure 5.12: Some examples of doors used for evaluating the door detection algorithm...... 219 Figure 5.13: Example of detected door. The red dots are corner points and the green rectangle is the correctly detected door...... 221 Figure 5.14: Example of a partially detected door...... 221 Figure 5.15: True Positive detection rate of the door detection algorithm for images with different resolutions...... 222 Figure 5.16: True Negative detection rate of the door detection algorithm for images with different resolutions...... 222 Figure 5.17: Time required to detect a door in cases using the same images in different resolutions...... 223 Figure 5.18: Time required to detect a door in cases M1 - M6 using downscaled images with a resolution of 480×360 pixels...... 226 Figure 5.19: Accuracy of the automatically derived door corner coordinates from downscaled images with a resolution of 480×360 pixels...... 227 Figure 5.20: Accuracy of automatically estimated position...... 229 Figure 5.21: Accuracy of automatically estimated orientation...... 230 Figure 5.22: Quality of relative orientation using the Rotation Vector over time...... 232 Figure 5.23: Quality of relative orientation using the Game Rotation Vector over time...... 232 Figure 5.24: Quality of relative orientation using the Gyroscope over time...... 233 Figure 5.25: Quality of relative orientation using the Rotation Vector when AR system is rotated...... 235 XX List of Figures

Figure 5.26: Quality of relative orientation using the Game Rotation Vector when AR system is rotated...... 235 Figure 5.27: Quality of relative orientation using the Gyroscope when AR system is rotated...... 236 Figure 5.28: The time that the smartphone battery lasts when using the AR framework...... 237 Figure 6.1: Door augmented using the AR system...... 239 Figure 6.2: Building augmented using the AR system...... 239 Figure 6.3: AR view using Google Tango and a custom created BIM model of the RWTH Aachen University [208]...... 245 List of Tables XXI

List of Tables

Table 2.1: Positioning/pose tracking technologies according to [74]...... 75 Table 3.1: Specifications of the three smartphones...... 124 Table 3.2: Statistics of the three models used in the project...... 128 Table 3.3: Maximum capabilities of SpatiaLite ([175])...... 138 Table 3.4: Information about the accelerometer, magnetometer and gyroscope of the Google Nexus 5...... 145 Table 3.5: Information about the accelerometer, magnetometer and gyroscope of the Sony Xperia Z2...... 146 Table 3.6: Information about the accelerometer, magnetometer and gyroscope of the Google Pixel 2 XL...... 146 Table 3.7: Comparison of corner detectors...... 168 Table 4.1: Simplications of the CityGML classes for the database schema [55]...... 173 Table 5.1: The calibration parameters of the Google Nexus 5 obtained from the calibration process using PHIDIAS in calibration C1 and calibration C2...... 205 Table 5.2: The calibration parameters of the Sony Xperia Z2 obtained from the calibration process using PHIDIAS in calibration C1 and calibration C2...... 206 Table 5.3: The calibration parameters of the Google Pixel 2 XL obtained from the calibration process using PHIDIAS in calibration C1 and calibration C2...... 206 XXII List of Tables

Table 5.4: Statistics about the test CityGML database containing the models Model 1, Model 2 and Model 3...... 208 Table 5.5: Positions P1 - P5 representing the possible locations that can occur with the AR system and the average amount of polygons and objects that the selection algorithm loaded...... 211 Table 5.6: Detection rate of the door detection algorithm using downscaled images with a resolution of 480×360 pixels...... 225 List of Abbreviations XXIII

List of Abbreviations

AAA-Model AFIS-ALKIS-ATKIS-Model AC alternating current AdV Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland AEC architecture, engineering and construction AoA angle of arrival AR augmented reality AV augmented virtuality B3DM Batched 3D Model BIM building information modeling BLOB Binary Large Object BMVI Federal Ministry of Transport and Digital Infrastructure bpp bit per pixel B-Rep boundary representation CDT Constrained Delaunay Triangulation CityGML City Geography Markup Language COLLADA Collaborative Design Activity CPU central processing unit CSG constructive solid geometry CV computer vision DBMS database management system DBS database system DC pulsed direct current DEM digital elevation model DGPS differential global positioning system XXIV List of Abbreviations

DLT direct linear transformation DoF degrees of freedom DOM document object model DR dead reckoning DSM digital surface model DT Delaunay triangulation DTM digital terrain model ECEF Earth-centered, Earth-fixed ENU east, north, up ER Entity-Relationship ETRS89 European Terrestrial Reference System 1989 EWMA exponentially weighted moving average FOV field of view FPS frames per second GB gigabyte GDI-DE Spatial Data Infrastructure Germany GIS geographic information system glTF GL Transmission Format GML Geography Markup Language GNSS global navigation satellite system GPS global positioning system GPU graphics processing unit GUI graphical user interface HMD head-mounted display IFC industry foundation classes IMU inertial measurement unit IPS indoor positioning system IR infrared IRLS Iterative Re-weighted Least Squares List of Abbreviations XXV

JAXB Java Architecture for XML Binding JTS JTS Topology Suite KB kilobyte LOD level of detail LOS line-of-sight MB megabyte MBR minimum bounding rectangle MEMS micro electro mechanical sensor MP megapixel MR mixed reality NED north, east, down NRW North Rhine-Westphalia ODB object database OGC Open Geospatial Consortium OHMD optical head-mounted display OpenGL Open Graphics Library OpenGL ES OpenGL for Embedded Systems OS operating system OSM OpenStreetMap P3P Perspective-Three-Point PDA personal digital assistant PnP Perspective-n-Point POI point of interest RAM random access memory RANSAC Random Sample Consensus RDBMS relational database management system RE real environment RGB red, green, blue ROS region of support XXVI List of Abbreviations

RSS received signal strength SAX Simple API for XML SDK software development kit SIFT Scale-Invariant Feature Transform SIG 3D Special Interest Group 3D SLAM Simultaneous Localization and Mapping SQL structured query language SSID service set identifier SURF Speeded Up Robust Features SVD singular value decomposition TDoA time difference of arrival TIN triangulated irregular network ToA time of arrival ToF time of flight UI user interface UTM Universal Transverse Mercator UWB ultra-wideband VE virtual environment VIO visual-inertial odometry VO visual odometry VR virtual reality WFS Web Feature Service WGS84 World Geodetic System WKB well-known binary WKT well-known text WLAN wireless local area network micrometer μm 1 Introduction 1

1 Introduction

In human history geospatial data has always played an important role. Some of the earliest known maps date back to approximately 25.000BC depicting mountains, rivers, valleys and routes [1]. Today, people are still just as interested in such spatial information. With more sophisticated data acquisition and visualization techniques driven by the rapid evolution of technology, the data has evolved in detail and complexity. In recent years, a strong trend towards mobile computing has arisen, mainly propelled by mobile devices such as smartphones which have had a strong impact on worldwide markets since Apple introduced the first iPhone in 2007 [2]. In 2017, about 1.54 billion smartphones were sold [3]. Next to Apple other major companies, like Google, Samsung, Sony and LG, have also developed numerous models of smartphones and mobile devices, propelling the evolution of mobile technology and allowing for more powerful and yet smaller devices (e.g. wearables like smartwatches). [4] coined the term ubiquitous computing which basically describes the main direction in which technology is moving, computing anytime and anywhere. Modern mobile devices essentially offer possibilities equal to desktop computers, with some design and hardware specific aspects that must be considered. For example, the typically much smaller displays in comparison to a desktop PC or the limited capabilities of low-cost mobile hardware. Therefore, new ways of interaction and visualization are of importance. 2 1 Introduction

Head-mounted displays (HMD) for example work around the issue of constrained visualization space, by bringing displays closer to the user’s eye. This technique is for instance used by virtual reality (VR). The goal of VR is to fully immerse the user in the virtual world. VR has been a topic of numerous projects and research efforts since the 1980s, but just very recently the interest has increased significantly, especially since the Oculus Rift 1 was presented in 2012. Along with VR, the interest in mixed reality (MR) and augmented reality (AR) in particular has increased (Figure 1.1). AR adds additional virtual information to the user’s real-world perception instead of immersing him into a virtual world.

100

80

60

40 percent(%) 20

0

2004-01 2004-11 2005-09 2006-07 2007-05 2008-03 2009-01 2009-11 2010-09 2011-07 2012-05 2013-03 2014-01 2014-11 2015-09 2016-07 2017-05 2018-03 Figure 1.1: Trend of AR in the last 14 years, based on search statistics from Google [5]. The values indicate the search interest relative to the highest interest in the specified time period.

1 https://www.oculus.com/rift/ 1 Introduction 3

As one of the first global companies, Google announced the development of a wearable vision-based AR solution with a miniature computer combined with an optical head-mounted display (OHMD), named Google Glass 2. Though, for now the developments have been postponed, due to hardware restrictions and user experience issues. Next to Google the entertainment and gaming industry has recently also focused some research activities on AR applications. However, instead of developing special hardware, many solutions rely on smartphones. This is not surprising, since modern smartphones already provide all the necessary components of an AR system. But not only developers and users from the entertainment and gaming industry have found application areas for smartphone-based AR, use cases in the geospatial domain have proven to be promising. For instance, a location-based mobile AR system can be utilized for the georeferenced visualization of points of interest (POI) or for navigational purposes. The advantage is that the user can view the information in a much more natural way from a first-person perspective. From a civil engineering point of view, the visualization of buildings is highly interesting, as some examples named by [6] show: • Real estate and planning offices could offer their clients the possibility to visualize planned buildings on parcels of land. With the help of the AR system, clients could inspect the georeferenced virtual 3D building models freely on-site and easily compare colors, sizes, look and feel or overall integration into the cityscape.

2 https://x.company/glass/ 4 1 Introduction

• Physical buildings could be augmented to enable the visualization of hidden building parts, such as cables, pipes and beams. • Tourist information centers could offer tourists visual historic city tours by visualizing historic buildings on-site and displaying additional information and facts about the location and the historic building.

In the following, a location-based mobile AR system is defined as a system that utilizes geospatial data and operates in a global reference system. For such AR solutions, not only suitable hardware is obligatory, but also appropriate data. There are many graphic formats for modelling 3D objects, but these typically lack spatial references and the possibility to store additional information about the object. A promising alternative to these formats are semantic information models , which have become fairly popular for geographic information systems (GIS). A concrete realization of such a model is the Geography Markup Language 3 (GML)-based XML encoding schema City Geography Markup Language 4 (CityGML). It enables modelling, storing and exchanging city models by employing a modular structure and a level of detail (LOD) system. It provides necessary classes that are generally found in cities, such as buildings, roads and water. Many use cases of CityGML are created around desktop PC environments and typically focus on overview-like presentations of the data. Some typical desktop tasks are large-scale analyses, solar potential analyses, shadow analyses and disaster

3 http://www.opengeospatial.org/standards/gml 4 http://www.opengeospatial.org/standards/citygml 1 Introduction 5

analyses [7]. However, applications of CityGML in mobile environments are still scarce. AR promises to be a suitable technology for mobile CityGML. In this work a mobile location-based AR system for CityGML data is presented.

1.1 Thesis Goal

The goal of this thesis is to develop a concept for a location-based mobile AR system that can visualize buildings or building parts to allow the realization of use cases, as described in the previous section. Furthermore, a fully functional self-sufficient mobile AR system prototype should be implemented to evaluate if modern mobile devices, such as smartphones, satisfy the requirements of an AR system for such scenarios. Figure 1.2 and Figure 1.3 depict exemplary reference visualizations for the targeted application. To develop a system as proposed, various state-of-the-art methods must be combined from different disciplines, such as 3D modelling, real-time rendering, computer vision (CV), etc. and appropriate hardware and data must be found. For this, a detailed literature review and a familiarization with the practical implementation for each topic are mandatory. To find suitable devices, data and methods for this specific AR system, empirical comparisons between the different possibilities are essential. Lastly, the finished proof-of- concept system should be evaluated.

6 1 Introduction

Figure 1.2: Concept image of outdoor AR system.

Figure 1.3: Concept image of indoor AR system. 1 Introduction 7

1.2 Related Work

Concept of Augmented Reality Already at the end of the 1960s Ivan Sutherland presented the first stationary AR system with a stereoscopic HMD. The poor capabilities of processing units only enabled very simple wireframe models of rooms to be placed over the real world though. Sensing of the head’s position was done using a heavy mechanical head position sensor, which determined the position and orientation (pose) of the user’s head. Due to the weight of the entire system, it was fittingly named “The Sword of Damocles” [8]. But what exactly is AR? Basically, AR adds additional virtual information to the user’s real-world perception. Generally speaking, this not only includes augmentations of the visual sense (ophthalmoception), but all senses including hearing (audioception), taste (gustaoception), smell (olfacoception) and touch (tactioception). In most cases when referring to AR, the focus lies on visual augmentations, as the majority of past research projects has shown. In this thesis, the focus also lies on visual-based AR. According to [9], AR ideally enables a virtual object and a real object to coexist in the same space. As it is often the case, ideas exist long before these are actually realized. For AR, this also holds true. Already back in the early 1900s the novel author L. Frank Baum mentions spectacles that overlay data on real-world people [10], but not until the 1960s a first AR system was developed. The actual term Augmented Reality is credited to [11], which they coined in the early 1990s. 8 1 Introduction

Widely accepted definitions of AR where made by Milgram and Azuma. [12] first presented their reality-virtuality continuum in 1994. As depicted in Figure 1.4, the continuum reaches from the real environment (RE) to the virtual environment (VE), with AR and augmented virtuality (AV) being two realms in between of both and part of MR. While VR places the user in an entirely computer- generated world, MR environments combine virtual- and real-world elements to enhance the user’s perception. AR adds virtual elements to real-world perceptions and AV in contrast complements virtual environments with real-world information.

Figure 1.4: Reality-virtuality continuum according to [12].

A following, also widely accepted, but more precise definition of AR was made by [9]. According to this definition, AR is the extension of perception by integrating additional virtual information and can be characterized by three main characteristics: • It combines real and virtual elements • It must be registered in 3D • It is interactive in real time ‐ The combination of “real and virtual elements” defines the actual description of AR and is extended by the necessity of “registration in 3D”, which separates it from 2D virtual video overlays. “Interactive in 1 Introduction 9

real-time” further delimits AR from virtual computer graphic augmentations, as used in pictures or television (e.g. overlay of virtual lines in soccer games) by requiring the system to update in real-time and respond to the user. In this thesis, AR is understood as defined by Milgram [12] and Azuma [9].

Types of Visual Augmented Reality A common interaction scheme for today’s computers is the point-and- click action. AR needs a different kind of interaction, especially in terms of mobile AR. [13] states that AR research aims at the development of “intuitive user interfaces”, with the words of [14], the transformation of the world into the user’s interface. According to [15], the advantages of AR are obvious: When the real- and virtual environment are viewed separately the information of both environments must be combined by the user. If real- and virtual information is perceived already combined, the experience is simplified, making the information likely more comprehensible. [16] gives an overview of possible applications for AR: For example, medical visualizations for surgery training with the help of x-ray visions of patients, manufacturing and repair work for assembly, maintenance and repair of complex machinery, robot path planning or entertainment systems. According to [15], five types of technical realizations of visual AR exist: optical see through HMD AR, video see through HMD AR, handheld display ‐AR, projection based AR with ‐video augmentation and projection based AR with physical‐ surface augmentation. While optical see through‐ HMD AR uses a transparent surface through which the ‐ user can see the real world, video see through HMD AR uses displays on which captured images are projecte‐ d. In both cases 10 1 Introduction

additional virtual information is displayed into the field of view (FOV) of the user. Handheld display AR captures the real world with one or multiple cameras and displays the video images combined with virtual information on the video display. Projection based AR with video augmentation also uses video cameras to capture‐ the real world and a surface on which the video information, plus its additional virtual information, is projected. Projection based AR with physical surface augmentation uses real-world objects‐ and projects virtual information onto these, creating enhanced object perceptions. Generally, AR systems can be categorized by marker-based and markerless systems. While markerless systems generally use sensor- based position and orientation trackers, marker-based systems rely on image-based methods with 2D fiducial markers, to determine the pose of the device. For marker-based tracking, the environment must be prepared ahead of time by distributing the markers. Markerless tracking, however, needs no such preparations. Therefore, the majority of AR applications generally use marker-based tracking for small-scale environments and markerless tracking for large-scale environments, such as outdoor scenarios.

Augmented Reality Research Projects While the first AR system by Sutherland [8] only allowed movements in a constrained area with a diameter of roughly 2 m, the full potential of AR first unfolds when the system is mobile. The first mobile augmented reality system, named MARS, was presented by [17] and relied on a portable computer carried in a backpack, a personal digital assistant (PDA) with touchpad, an orientation tracker, differential global positioning system (DGPS) and a wireless web access point. It was a campus information system and assisted a 1 Introduction 11

user in navigating to POI and allowed querying information about these. [18] also developed multiple mobile AR systems for indoor and outdoor scenarios, inter alia creating a guided campus tour by overlaying models of historic buildings and georeferenced media, such as hypermedia news stories. Around the same years as the first mobile AR systems were presented, one of the first fiducial-marker-based AR systems was presented by [19]. A popular open source library for tracking with six degrees of freedom (DoF) based on fiducial markers was presented in [20], named ARToolKit. The framework has been used in multiple projects and applications and is still maintained and in development today. It has been ported to major platforms such as Windows 5, Linux 6, MacOS 7 and also the mobile platforms iOS 8 and Android 9. [21] first presented a backpack-based system which was developed into Tinmith [22]. It evolved into a combination of a small computer worn on the belt, a helmet with HMD, wireless input devices, and a freeware software framework. Pose tracking was done using global positioning system (GPS) and an InterSense orientation tracker. [23] presented Archeoguide, a mobile AR system consisting of mobile computers, wireless access points and GPS. The system offered navigation and visualizations of ancient constructions on the historical Olympia site in Greece. [24] also presented an AR system named GEIST for interactive narrative tours through historical

5 https://www.microsoft.com/de-de/windows 6 https://www.linux.org/ 7 https://www.apple.com/de/macos/mojave/ 8 https://www.apple.com/de/ios/ios-12/ 9 https://www.android.com/intl/de_de/ 12 1 Introduction

locations, where historical facts and events were told by fictional avatars. Efforts towards indoor solutions were made by [25]. They presented a mobile AR building navigation guide based on the ARToolkit. The system displayed directional information in the see- through heads up display by overlaying a registered wire-frame model of the building. The overall position of the user in the building could be viewed on a wrist worn pad, displaying a 3D world-in-miniature (WIM) map, which also acted as an input device [25]. Tracking was done using fiducial markers, which were distributed across the building, and an inertial measurement unit (IMU). Another indoor guidance system was developed by [26], which was also based on the ARToolkit and fiducial marker recognition to visualize and overlay 3D models of the environment. The major improvement here was that the system ran completely on a PDA, minimizing the need for infrastructure. [27] presented a mobile urban environment outdoor AR system which overlays virtual wireframe models over the real buildings. In comparison to their predecessors, they used an edge-based visual matching method and a combination of the sensors, gyroscope, magnetometer and gravimeter for tracking. The disadvantage is that a good initial pose is required in contrast to the approach utilized in this thesis. Next to fiducial marker-based approaches, other visual tracking methods have gained interest. Not only is no preparation of the environment required, as would be the case when using fiducial markers, but the methods can be used equally in both small- and large-scale environments. This, for instance, enables the possibility of precise pose tracking indoors and outdoors. This method has been used frequently, especially indoors, since positional information is 1 Introduction 13

hardly or not available at all, due to the lack of global navigation satellite systems (GNSS) availability. [28] introduced SiteLens, a mobile AR system for urban planning purposes. The system visualizes referenced 3D objects, mainly buildings. In 2010, a handheld AR system named Vidente that visualizes underground infrastructures, like pipes, was introduced by [29]. [30] presented the LARA system, an assistive AR system that is also able to visualize underground infrastructures. It uses the GeoTools 10 open source library and PostGIS 11 . For pose tracking, GNSS and IMU data is fused. [31] presented a mobile system that enables the visualization of CityGML data in different modes, like VR and AR. The data is stored in the PostgreSQL 12 -based 3D City Database 13 (3DCityDB) and is transmitted using a client-server model. The client is implemented on the iOS platform using the Glob3 Mobile API 14 , a map and 3D globe framework. Next to the mentioned research projects, also open source, freeware and commercial frameworks were developed, which commonly provide readily implemented low-level methods to realize AR projects more easily. Every framework has a certain focus though, so that one must be chosen depending on the type of project. An overview can be found here [32].

10 http://geotools.org/ 11 https://postgis.net/ 12 https://www.postgresql.org/ 13 https://www.3dcitydb.org 14 http://glob3mobile.com/ 14 1 Introduction

Commercial Augmented Reality Systems Furthermore, complete commercial solutions that combine hardware with software frameworks were developed by for example Google and Microsoft. First developer versions of the Google Glass, an OHMD in the shape of eyeglasses with a miniature computer attached to it, developed by Google, was sold in 2013, but consumer versions were postponed due to technical limitations. Nevertheless, advancements in AR were still made with Google’s Project Tango [33], which consists of a specialized tablet device and a software development kit (SDK). For sensing, the device includes a 4 megapixel (MP) red, green, blue (RGB)-infrared (IR) pixel sensor camera, a motion tracking camera, 3D depth sensor, accelerometer, ambient light sensor, barometer, compass, GPS and a gyroscope. The sensors are backed up by suitable hardware, such as a NVIDIA Tegra K1 15 (192 CUDA cores), 128 gigabytes (GB) internal storage and 4 GB of random access memory (RAM). 3D pose tracking is realized using three core technologies: motion tracking, area learning and depth perception. Motion tracking uses visual-inertial odometry (VIO), which is a combination of visual odometry (VO) and IMU, to estimate the device’s pose relative to a starting pose. VIO utilizes a series of camera images in which the relative positions of different image features are tracked to determine pose changes. By incorporating an IMU into the process, drift errors occur. To correct these, Tango uses its second core technology, area learning. With this, the device gains the ability to record key visual features of the physical 3D world, like edges and corners, which can be used to recognize the area at a later time. The features are mathematically described and saved in a

15 https://www.nvidia.com/object/tegra-k1-processor.html 1 Introduction 15

searchable index. When finding known features, the system can apply drift corrections by adjusting its path according to previous observations, before the drift errors occurred. The third core technology, depth perception, uses IR rays to create point clouds of the surrounding environment. From these point clouds, 3D coordinates can be calculated which enable positional tracking forwards, backwards, up and down. Since 2016, Microsoft also has made its head-mounted see-through holographic headset, named HoloLense [34], available. For displaying holographic images into the user’s sight, it features two HD 16:9 light engines, using a custom-built holographic processing unit, 64 GB of flash storage and 2 GB of RAM. For tracking, it utilizes one IMU, four environment understanding cameras, one depth camera, one 2 MP camera, four microphones and one ambient light sensor. It also features possibilities for human interactions by sound, gaze, gesture and voice.

Commercial Augmented Reality Frameworks Next to hardware coupled solutions, the major companies are also developing more general approaches that enable AR on non- specialized devices. Apple, for instance, bought the German AR startup company Metaio in 2015 [35] and recently released ARKit 16 as part of iOS 11. It allows creating AR experiences on normal iPhones and iPads using monocular VIO. Google accordingly announced ARCore 17 , a similar framework that enables AR experiences on normal Android smartphones, also using monocular VIO. The

16 https://developer.apple.com/arkit/ 17 https://developers.google.com/ar/ 16 1 Introduction

disadvantage of both solutions is that the pose is only estimated in a local coordinate frame, relative to an initial pose. In this thesis the focus lies on developing an AR system that determines absolute poses in a global reference frame. 2 Background 17

2 Background

AR combines numerous research fields and highly specialized methods. In general, an AR application can be split into three essential components, the reality component, the virtual component and a method to combine these two components. The following sections discuss fundamental methods for this thesis and related topics of AR systems. Since different domains commonly share certain vocabulary (e.g. CV, photogrammetry, 3D real-time rendering), but in some cases understand something else by it, the terminology in the following sections should be understood in the current context. Related methods or terminology are additionally provided if required.

2.1 Fundamentals

This section provides details on fundamentals that are essential to an AR system. These basics are found in AR-related sub-topics, such as 3D real-time rendering, CV, etc.

2.1.1 Transforms

Some of the most fundamental parts of AR are the transforms . They are essential for important tasks such as 3D real-time rendering, pose tracking, etc. With the help of a transform objects can be altered by position, orientation or size. A translation is such a transform. It refers to changing the , and values and thus the location of an object. Transforms that preserve vector addition and scalar 18 2 Background

multiplication are called linear transforms . The following is true then (equation (2.1)):

(2.1) , . Some typical linear transforms are rotation and scaling . However, it is advantageous to be able to combine multiple transforms, such as a translation and rotation. With an affine transform it is possible to perform a linear transform followed by a translation by combining these. In computer graphics, affine transforms are typically represented as 4×4 matrices, using the homogeneous notation , so that translations and perspective projections can be expressed as matrix multiplications, allowing single transforms, like rotation, scaling and translation, to be combined into a single matrix. Especially in computer graphics this uniform representation of transforms is desirable, since hardware can be optimized towards handling 4×4 matrices. This has the advantage that the coordinates of a vertex only must be multiplied once. According to [36] and [37], a rotation matrix and a translation matrix can be concatenated to a matrix which can be multiplied with , resulting in the same rotation and translation as if had been multiplied with them separately (see equations (2.3) and (2.2)).

(2.2) ∙ ∙ ∙ where:

(2.3) ∙ The following sections give an overview of typical transforms using 2 Background 19

the homogeneous form:

Translation

Figure 2.1: An example for a translation of a geometrical object.

A translation is used to move points of a geometry by the same amount in a certain direction. The translation matrix is given by .

1 0 0 0 1 0 0 0 1 0 0 0 1 With this matrix, a vertex is translated by the vector resulting in a new, , vertex , 1 . Like this, a shift from one location , , to another can be accomplished′ as depicted in Figure 2.1. 20 2 Background

Rotation

Figure 2.2: An example for a rotation of a geometrical object.

The following matrices , and are used to rotate an instance by radians about the -, - and -axis. Figure 2.2 shows an example∅ for a rotation of a shape.

1 0 0 0 0 cos∅ sin∅ 0 ∅ 0 sin ∅ cos ∅ 0 0 0 0 1

cos ∅ 0 sin ∅ 0 0 1 0 0 ∅ sin∅ 0 cos∅ 0 0 0 0 1 2 Background 21

cos∅ sin∅ 0 0 sin ∅ cos ∅ 0 0 ∅ 0 0 1 0 0 0 0 1 Euler Transform In virtual worlds, it is important to be able to orient objects, such as the virtual camera. Yaw, pitch and roll are defined by the Euler Transform which is a concatenation of the three rotation matrices , so that (Figure ,2.3). , ℎ, ℎ

Figure 2.3: The three DoF (yaw, pitch and roll).

A disadvantage of the Euler transform is that a gimbal lock can occur, in which case one DoF is lost. This is caused by two axes aligning in such a way that a rotation of either of these axes results in a rotation of the other axes. Given a three-gimbal mechanism in its initial state, therefore, with the three gimbal axes mutually perpendicular, this can for example occur when a 90° change about the pitch axis is applied. 22 2 Background

The yaw axis gimbal and the roll axis gimbal become aligned (Figure 2.4) so that changes to roll and yaw then essentially apply the same rotation.

Figure 2.4: An example for gimbal lock when using Euler angles.

Quaternions A solution to the gimbal lock problem is the use of quaternions . A quaternion is an extension of the complex numbers originally described by [38]. It can simply be thought of as a delta rotation that describes the shortest path between two rotations. This is advantageous in comparison to rotations in Euler angles that are represented as a sequence of steps. While Euler angles are quite intuitive compared to quaternions, they also require some questions to be answered: For example, in which order the rotations are applied or what the reference frame is. Another advantage is that quaternions have a smaller memory footprint, using only 4 floating point numbers in contrast to rotation matrices with 9 (3×3 matrix) or 16 (4×4 matrix) respectively. The mathematical definition of a quaternion is given by equation (2.4):

2 Background 23

, ,

, , , (2.4)

1, , , . where is the real part of the quaternion and and the imaginary part [39]. Rotating a quaternion can simply be performed by multiplying it with another quaternion that represents the rotation that should be applied.

Scaling A scaling matrix is used to enlarge or shrink an instance along the -, - and -axes (Figure 2.5). It is a uniform scaling operation if and non-uniform otherwise. A uniform scaling operation with the factor 3, for example, is performed by setting , , = 3.

0 0 0 0 0 0 0 0 0 0 0 0 1 24 2 Background

Figure 2.5: An example for scaling a geometrical object.

Projection In visual AR, an image-based representation of reality is captured by a physical camera and a representation of the virtual world is captured by an artificial camera. In both cases a transformation from 3D to 2D space is required, referred to as a projection . Projections are important transformations in computer graphics. Two types of projections are commonly used, the parallel projection (Figure 2.6) and the perspective projection (Figure 2.7). Parallel projections use parallel lines to project points along these onto the plane, preserving the size of objects. A much more natural type of projection is the perspective projection, which produces smaller projections for distant objects and larger projections for near objects. This is the result of projecting points along the lines originating from a single point, referred to as the center of projection (Figure 2.8). Since this type of projection coincides with the way humans perceive reality, it is more 2 Background 25

commonly used in 3D real-time rendering than the parallel projection. To perform the projection, camera models are developed to describe the relationship between 3D space and its 2D projection.

Figure 2.6: Orthographic projection according to [40].

Figure 2.7: View frustum for a perspective projection according to [40]. 26 2 Background

2.1.2 Camera Models

A simple physical camera can be constructed with a lightproof box, a very small hole on one side of the box and some light sensitive paper. This is referred to as the pinhole camera . This simple physical construction is transferred to the pinhole camera model which describes the camera functionality in an abstract way. This model can be used to implement a virtual camera quite easily and perform projections from 3D to 2D space. Figure 2.8 shows the basic setup of the camera model.

Figure 2.8: The pinhole camera model according to [41].

To better describe the pinhole camera model, some terminology must 2 Background 27

be defined. As shown in Figure 2.8 and described in [42], the pinhole of the camera is referred to as the optical center and the plane on which the image is created is called the image plane. The image plane is placed on the z-axis, which is referred to as optical axis . The point in which the optical axis intersects the image plane is termed the principal point . The distance between the optical center and the principal point is referred to as focal length . The size of the pinhole itself is denoted as aperture . As shown in [41], the pinhole camera model is described by the following equation (2.5):

, (2.5) | where is considered the camera matrix that projects a 3D point to| a 2D point in a 2D plane. is an arbitrary scaling factor.

The intrinsic camera matrix contains 5 intrinsic parameters that describe the camera itself, so that the 3D camera coordinates are transformed into 2D pixel coordinates. The extrinsic matrix describes the transformation from 3D world coordinates to 3D camera| coordinates (Figure 2.9). Therefore, the parameters describe the pose of the camera in the world frame.

28 2 Background

Figure 2.9: The relationship between world coordinates and pixel coordinates according to [43].

To transform world coordinates to pixel coordinates (Figure 2.9), [41] defines a matrix as follows (equations (2.6), (2.7), (2.8) and (2.9)):

= (2.6) , 1 where are the pixel coordinates 1 of the projected 3D point defined by, , , , 0 (2.7) where and are the focal lengths,0 0 1 the principal point, the skew coefficient between the - and -axis, which is typically 0.

, | (2.8) where defines the rotation and the translation. 2 Background 29

In homogeneous coordinates the full model is expressed as equation (2.9).

0 0 0 (2.9) 1 0 0 1 0 0 0 0 1 1 Lens Distortion

Figure 2.10: (a) Undistorted image (left); (b) Image with pincushion distortion (middle); (c) Image with barrel distortion (right).

The described pinhole camera model is based on an optimal configuration. In reality, physical cameras use lenses which introduce some distortions like radial distortion or tangential distortion . Typical radial distortion is described by barrel distortion (Figure 2.10 – c) and pincushion distortion (Figure 2.10 - b) which occur from the way light rays pass through the lens. Towards the edges of a lens the light rays are bent more than in the centre. Generally, this effect depends on the size of the lens. Smaller lenses increase the distortion. Tangential distortion occurs when the camera lens and the camera sensor are not parallel to each other. Thus, the pinhole camera model must be extended to account for this. According to 30 2 Background

[41], the mentioned pinhole camera model transformation can also be expressed as

(2.10) ∗ ∗ where and with . ≠ 0 The camera model is extended by radial and tangential distortion coefficients and .

(2.11) 2 2 2 2 where and are radial distortion coefficients and are tangential,, distortion, , , coefficients. ,

(2.12) ∗ ∗ Depending on the amount of distortion, typically three distortion coefficients are sufficient to account for radial distortion, as shown in equation (2.13):

(2.13) 1 ∗ ∗ ∗ 1 ∗ ∗ ∗ 2 Background 31

Camera Calibration So that the undistorted images of the virtual camera superimpose the distorted images of the physical camera correctly, it is necessary to correct these by determining the parameters of the interior orientation and the distortion coefficients, to form a camera model. Geometric camera calibration , also known as camera resectioning, is used to estimate the intrinsic and extrinsic parameters and distortion coefficients. The parameters can be estimated by utilizing known 3D world points and their corresponding 2D image points. In the field of CV this is typically achieved by using an object with easily identifiable and known geometry. A popular calibration rig is a chessboard. Its square shape and interior black and white squares are easily detectable and definable. Chessboards are especially robust and yield accurate calibration results due to their linear features. As shown in Figure 2.11, the chessboard typically is captured from multiple angles. A similar approach is found in the field of photogrammetry that uses a field of control points, for example encoded markers, as shown in Figure 2.12. As with the chessboard, the test-field should optimally be captured from different angles. The obtained camera model then is used to undistort images captured by the camera.

32 2 Background

Figure 2.11: Camera calibration chessboard photographed from different angles.

Figure 2.12: Encoded field of control points for camera calibration. 2 Background 33

2.2 3D Real-Time Rendering

Virtual elements are an important part of AR systems. To allow real- time interaction, they must be displayed multiple times per second. In general, real-time rendering is described as the rapid creation and displaying of images, referred to as frames , on monitors. Graphics can be considered real-time with a rate of at least 15 frames per second (FPS), although some flickering can then still be perceived. According to [37], typically 60 FPS is a good value for a fluent visualization, preventing the human eye from distinguishing between separate frames. The FPS usually depend on the complexity of calculations during every frame. For instance, the smoothing of edges (anti-aliasing) requires additional calculations and therefore additional time. A fundamental part of 3D real-time rendering is the graphics rendering pipeline. It describes the necessary steps to produce a 2D image from a 3D virtual world using a virtual camera, taking light sources, textures, etc. into account.

2.2.1 Graphics Rendering Pipeline

As in [37], the pipeline can be divided into three conceptual parts: application stage, geometry stage and rasterizer stage , each consisting of sub-pipelines (Figure 2.13).

Figure 2.13: General conceptual parts of the graphics rendering pipeline according to [37]. 34 2 Background

Application Stage The application stage deals with the actual software and is carried out on the central processing unit (CPU). Here, the application developer has full control over the implementation and thus increasing or decreasing the application’s performance. For example, user input from a keyboard or mouse is processed and most importantly the preparations for the geometry stage are made.

Geometry Stage The geometry stage is typically performed on the graphics processing unit (GPU) which is commonly optimized to process rendering primitives, such as points, lines and triangles. Therefore, complex geometries must be decomposed into the three named basic elements. Polygons, for example, can be decomposed into a set of triangles. This process is referred to as polygon triangulation . The geometry stage is where transformations and mappings are executed. More specifically, it is where objects are transformed from model space (local space) to screen space (Figure 2.14).

Figure 2.14: Steps of the geometry stage.

Commonly, translations, rotations and scaling are combined in a single matrix. Some of the most important transformation matrices of the process are the model matrix , view matrix and projection matrix . 2 Background 35

An object commonly resides in local space . This is the coordinate space that was used when creating the object, for example by specialized modeling software. So, the vertices of the object are relative to the local origin A virtual 3D world uses its own coordinate system defined by0, 0,the 0. application programming interface (API). The Open Graphics Library 18 (OpenGL) for instance uses a right-handed system by convention, in which the positive -axis points to the right, the positive -axis up and the positive -axis backwards [44] (Figure 2.15).

Figure 2.15: The axes of the OpenGL coordinate system.

Direct3D 19 on the other side commonly uses a left-handed coordinate system. Objects are placed relative to the global origin in this coordinate space, which is referred to as world space . In0, world 0, 0 space

18 https://www.opengl.org/ 19 https://docs.microsoft.com/en-us/windows/desktop/direct3d 36 2 Background

every object has its own location, rotation and scale, which is where the model matrix becomes important. The model matrix , is a combination of translation matrix , rotation matrix and scaling matrix . transforms an object with its vertices from local space to world space. In reality, people can only see what is visible from their position and orientation. Similarly, this should also be the case in the virtual world. This is referred to as view space , eye space or camera space . So, camera space is the space as seen from the camera, thus, the user’s view. To accomplish a transformation from world space to view space, combined translations and rotations, saved in the view matrix , are applied to the world coordinates. Next, the coordinates must be transformed from view space to clip space. In this step only coordinates that fall within a specific range are kept, all other coordinates outside this range are clipped. This is done with the projection matrix , which creates the clipping volume. Two commonly used projection matrices are the orthographic projection matrix and the perspective projection matrix , as shown in [36].

2 ℎ 0 0 ℎ ℎ 2 0 0 2 0 0 0 0 0 1 defines a rectangular box for the clipping volume (Figure 2.6). Characteristics of the orthographic projection (or also referred to as parallel projection ) are that it does not modify the size of objects, 2 Background 37

no matter how distant they are in relation to the camera, and that parallel lines remain parallel. This is especially useful for 2D scenes. However, this type of view does not comply with the human perception. In reality, the closer an object lies, the larger it appears and vice versa. This can also be achieved in virtual worlds using a perspective projection . describes a truncated pyramid referred to as a frustum (Figure 2.7). It consists of six planes top, bottom, left, right, near and far.

2 ∙ ℎ 0 0 ℎ ℎ 2 ∙ 0 0 2 ∙ ∙ 0 0 0 0 1 0 To define a view frustum, usually four parameters are necessary • FOV • Aspect-ratio • Near plane • Far plane

With the FOV being the observable area of sight (a typical value is 45.0 degrees), the aspect-ratio, the ratio between width and height of the canvas window, the near plane defining the closest distance and the far plane the farthest distance at which an object can still be observed. Given the values for the near and far plane, the remaining four planes can be calculated with the following equation (2.14), according to [36]: 38 2 Background

∙ tan ∙ /2 180 (2.14) ∙ ℎ The perspective projection is typically used in 3D rendering to create realistic scenes, as in first-person perspective scenes. Concluding, to receive clip coordinates in clip space the following transformation, expressed in the equation (2.15), is necessary, given a vertex in local space:

(2.15) ∙ ∙ ∙ Matrix multiplication is reversed, since the multiplication starts on the right side. In a last step, objects are transformed from clip space to screen space and sent to the rasterizer to visualize them on screen. Modern displays use 2D rectangular pixel grids to visualize images (Figure 2.16).

Figure 2.16: A screen with a coordinate system in which the x-axis points right and the y-axis points downwards. 2 Background 39

A pixel is the smallest element and has two properties, a position and a color. The position is expressed in -coordinates with the origin located at the top-left corner of the screen. The -axis points right0, 0 (width) and the -axis down (height). The color of a pixel is assigned using the RGB color model. The resolution of a display can be calculated by multiplying the rows with the columns of the raster grid. A typical resolution of a modern display is 1920×1080 (Full HD).

Rasterizer Stage In the last stage of the pipeline, pixels are calculated and are assigned colors in the rasterization process. A buffer holds the rasterized image. Commonly, double buffering is used which consists of a back buffer and a front buffer. The rasterization process is performed in the background in the back buffer, while the contents of the front buffer are displayed on the screen. When the rasterization process in the back buffer is completed, front and back buffer are swapped. This has the advantage that the rasterization process is not visible on screen, making the visualization smoother.

2.2.2 Tessellation

As mentioned, to effectively render geometry on a GPU, surfaces must be split into a set of polygons. This process is referred to as tessellation . Commonly, graphics APIs and hardware are optimized for triangles. The process of polygonal tessellation for triangles is called polygon triangulation . A triangle is defined as a simple polygon with only three points ( vertices ) connected by lines ( edges ) in which no diagonal line can be spanned. 40 2 Background

A simple polygon is defined as an ordered set of vertices which are connected consecutively by edges (Figure, … , 2.17). The edges may not intersect each other, except< at, their> shared vertices. Polygons may also contain one or multiple holes (Figure 2.17 - c). The polygon then consists of an outer polygon and one or multiple inner polygons, which are described by a set of vertices, commonly in clockwise order, if the outer polygon is defined by vertices in counter-clockwise order or vice-versa.

Figure 2.17: Shapes of a polygon. (a) Convex polygon (left); (b) Concave polygon (middle); (c) Convex polygon with holes (right).

Given a polygon without holes, it is convenient to examine the shape of the polygon before the triangulation process, classifying them as convex (Figure 2.17 - a) or concave (Figure 2.17 - b) polygons. This can be tested by calculating the interior angle of two connected edges. If the interior angle is larger than 180 degrees ( ), a vertex is concave (reflex vertex) and if it is smaller than 180 degrees, the vertex is convex. If the polygon is convex, it is trivial to triangulate it in linear time [ ], by for example selecting a random vertex and connecting it with all other vertices, except its neighbouring vertices (Figure 2.18). 2 Background 41

Figure 2.18: Polygon triangulated by connecting a vertex with all other vertices of the polygon, except its direct neighbours.

If the polygon is concave, it must be treated differently to retain the holes and gaps. A popular triangulation algorithm in computer graphics is the ear-clipping algorithm.

Polygon Triangulation with Ear-Clipping The goal of the ear-clipping algorithm is to find a triangle for which does not intersect any edges of the polygon,, , so that forms< , an ear> (ear tip ), and remove it from the polygon. This process is repeated until all ears have been removed and the remainder of the polygon is a triangle itself (Figure 2.19).

42 2 Background

Figure 2.19: Progress of the ear-clipping algorithm. (a) Non- triangulated polygon (top-left); (b) polygon with clipped ear (top- right); (c) polygon with two clipped ears (bottom-left); (d) three clipped ears (bottom-right)

A simple ear-clipping algorithm can be implemented with an order of , by checking if three subsequent vertices form an ear. With some careful attention paid to the ear finding process, the algorithm can be optimized to [45]. The ear-clipping method can also be applied to polygons with holes by adding some additional steps. This can be achieved by connecting the inner polygon rings to the outer polygon ring. A prerequisite for this is that the inner and outer rings are defined in opposite directions, as mentioned earlier. A connecting edge can then be spanned between the interiors and exterior, so that a single continuous exterior ring is created, thus, forming a simple polygon 2 Background 43

(Figure 2.20). The ear-clipping method can then be applied to the polygon.

Figure 2.20: Two interior polygons (holes) linked to the outer polygon by edges.

Delaunay Triangulation The triangles produced by the ear-clipping algorithm do not conform to any conditions, so that long and thin triangles (sliver triangles ) are possible. While the triangle shape generally does not influence the rendering performance, it has some undesirable properties for some processes, such as interpolation [46]. Avoiding slivers and aiming for uniformly shaped triangles that maximize the minimum inner angles are preferred here. Such a triangulation is referred to as Delaunay triangulation (DT) [47]. A DT employs the concept of circumcircles which contain the three vertices of a triangle, so that no other vertex of the dataset lies within the circle, as shown in Figure 2.21.

44 2 Background

Figure 2.21: Delaunay triangulation with empty circumcircles.

There are different algorithms to construct a DT. As described in [48], a straight forward algorithm uses an edge flipping technique which progresses as follows: Given two triangles ABD and BCD with a common edge BD , the triangles meet the Delaunay condition if the sum of angles α and γ is smaller than or equal to 180°.

Figure 2.22: The triangulation does not meet the Delaunay condition, since the sum of angles and is larger than 180° and the circumcircle of contains the vertex . 2 Background 45

If the two triangles do not meet the Delaunay condition, then flipping the common edge BD to AC results in two triangles that are Delaunay conform. Therefore, using a randomly triangulated set of points , a can be produced by flipping the common edges until all triangles meet the conditions. In a worst-case scenario, this algorithm has . An alternative approach to creating is to incrementally construct it according to [49], [50]. When implemented with caution, the algorithm can run in The idea is to add each point of separately and to re-triangulate . the affected parts accordingly, maintaining the DT of the previously inserted points. Given an existing DT of a subset of points , an update step is performed by inserting a new point into and finding the triangle that contains ∈ . The vertices of are connected to resulting ∈ in three triangles inside of (Figure 2.23). Next, the edge flipping algorithm is applied until the triangles meet the Delaunay conditions.

Figure 2.23: Delaunay triangulation with newly inserted point s. 46 2 Background

Constrained Delaunay Triangulation If any edges of, for instance a polygon, are already prescribed, the normal DT is not sufficient, since certain vertices then should not be connected to each other. This is the case for concave polygons and polygons with holes (Figure 2.24). For this, a C onstrained Delaunay Triangulation (CDT) [51], [52] can be used. A CDT allows the definition of constraining edges which the triangles may not cross. The disadvantage is that some triangles can be produced that are not conforming to the Delaunay conditions.

Figure 2.24: The gap of a concave polygon should not be closed.

2.2.3 Graphics Libraries

To visualize the triangulated polygons, graphics libraries and frameworks are utilized that provide the means to create 2D- and 3D computer applications. 2 Background 47

OpenGL / OpenGL ES A CPU is one of the main components of a modern computer and is used to execute software. More specifically speaking, it executes the instructions contained in the software. Real-time rendering in 2D and especially 3D is a computationally expensive task. Since real-time rendering has become a fundamental part of everyday life (e.g. computer games), specialized hardware has become available to increase the performance of such applications. GPUs are specifically designed to accelerate graphical processing. The main difference between a CPU and a GPU is their architecture. While a CPU typically only has a few, yet efficient cores, a GPU has many less efficient, but highly specialized ones. As a result, GPUs are more suited to handle particular tasks that can be parallelized, for instance real-time rendering of large amounts of data. When implementing 2D/3D graphics applications, the developer needs some possibility to interact with the GPU. OpenGL is such an established low-level API. It is a cross-platform API for rendering 2D and 3D vector graphics using a set of functions. For instance, triangulated data can be handed to the GPU and instructions on how to handle it can be given. An alternative version, OpenGL for Embedded Systems 20 (OpenGL ES), focusses on embedded systems, such as smartphones, tablets, etc. and uses a subset of the functions of OpenGL. OpenGL and OpenGL ES both mainly focus on rendering though, so that other parts of a full 3D real-time experience, such as sound, physics, must be realized using additional frameworks.

20 https://www.khronos.org/opengles/ 48 2 Background

Game Engines From the high demand for 3D applications, such as computer games, in the past years, higher level frameworks have evolved to simplify reoccurring tasks, while creating real-time applications. These frameworks are commonly built on low-level APIs, such as OpenGL or Direct3D. Such frameworks are game engines, typically used in the computer game industry. In many cases, a game engine provides an entire suite of tools for the rapid development of computer games. These tools commonly offer graphical user interfaces (GUI) to visually assist in the development process and take care of all necessary steps for rendering geometries. In comparison to low-level interfaces like OpenGL or Direct3D, game engines usually bundle necessary methods to solve certain problems in a visual “building block” manner. The 2D/3D graphics renderer of a game engine normally also already comprises optimizations towards the visualization process, to allow for fluent real-time rendering. Many engines use a scene graph-based approach to organize the game objects, such as buildings, trees and characters, logically and spatially. Basically, a scene graph is a tree with nodes in which parent- and child objects can be defined (Figure 2.25). This has multiple advantages. For example, given a building with multiple components, like walls, floors, ceilings, doors and windows, a simple structure can be constructed: A node represents the building and the building components are attached to this node as child nodes (see Figure 4.9). Now, every operation (e.g. transformation) performed on the building node, also is performed on the child nodes, so that only a single operation is sufficient to transform all objects. 2 Background 49

Figure 2.25: Example structure of a scene graph.

Next to optimizations for the rendering process, game engines generally offer possibilities to apply post processes like ambient cubemaps, ambient occlusion, bloom, color grading, depth of field, eye adaptation, lens flares, light shafts, temporal anti-aliasing, tone mapping or visual effects, such as rain, snow or fog, to the scene. Game engines normally also provide tools for sound, physics and artificial intelligence. Widely accepted and standardized graphics formats such as the Collaborative Design Activity 21 (COLLADA) by the Khronos Group are supported out of the box with suitable loaders. Typically, modern game engines can deploy the developed games or applications to current generation platforms, such as Windows, OSX, Android and iOS. Some of the most popular and

21 https://www.khronos.org/collada/ 50 2 Background

elaborate game engines are Unreal Engine 22 , Unity 23 , Frostbite Engine 24 , CryEngine 25 , jMonkey 26 and Godot 27 . A current trend is to provide game engines for free and even make them open source, allowing for hobby developers to create applications with professional tools. Unreal Engine for instance, is available completely for free along with the complete source code for private- and commercial use, up to certain revenue.

2.3 Data for Augmented Reality

While game engines provide the necessary tools for creating and rendering virtual elements, proper data is also necessary for AR. Many exchange formats for 2D/3D models have evolved over the years. Some of the most popular, such as OBJ 28 by Wavefront Technologies or the open standard XML schema COLLADA, are typically supported by game engines. For location-based mobile AR these formats are less interesting though since the data typically has no actual connection to the physical world. World-registered overlays require some kind of spatial reference. Such data that has a reference to the planet’s surface is referred to as geographic data or geospatial data . While in earlier days

22 https://www.unrealengine.com 23 https://unity3d.com 24 https://www.ea.com/frostbite/engine 25 https://www.cryengine.com/ 26 http://www.jmonkeyengine.org 27 https://godotengine.org 28 http://www.martinreddy.net/gfx/3d/OBJ.spec 2 Background 51

publicly available geospatial data was fairly rare or costly, recent advancements in the geospatial domain have very much changed this. OpenStreetMap 29 (OSM) is a popular example for publicly available geospatial data. The data is created by the community and made available for free. Furthermore, countries are increasingly promoting open data and making government data available to the public using web-based portals. An example for this is the open data portal Open.NRW 30 of the German state of North Rhine-Westphalia (NRW). It gives access to various types of data like economic and geospatial data.

2.3.1 Geospatial Data

Today geospatial data is created, stored and exchanged in various formats. In general, these formats can be categorized by vector and raster formats. Raster data consists of rows and columns of cells which each store a single value to reflect a digitized abstraction of reality. Some typical use cases are aerial photos. Popular formats are for example GeoTIFF 31 and JPEG2000 32 . Vector formats on the other hand describe geographical features using primitive geometries, such as points, lines and polygons. A well-established vector format for exchanging geospatial data is the XML grammar-based GML encoding standard ([53]) issued by the Open Geospatial Consortium (OGC). It consists of two parts, the schema, containing descriptions

29 https://www.openstreetmap.org 30 https://open.nrw/ 31 https://trac.osgeo.org/geotiff/ 32 https://jpeg.org/jpeg2000/ 52 2 Background

about the document, and the document which contains the actual data. The GML model includes a variety of primitives (e.g. points, lines or triangles), to model application specific schemas, as for example realized with CityGML.

CityGML With an increasing demand for digital 3D city models for urban planning purposes, facility management and tourism, the need for more complex analysis tasks has arisen. Based on GML, an application schema named CityGML was developed to provide a standardized way of describing digital 3D city models, including semantic information to enable advanced city and architectural planning, simulations and navigation tasks. CityGML [7] is a common semantic information and open data model based on XML, developed by the Special Interest Group 3D (SIG 3D), which is part of the Spatial Data Infrastructure Germany (GDI-DE) initiative, and issued as an international standard by the OGC and the ISO TC211. It is realized as a GML application schema using the version 3.1.1 (GML3). It also uses various standards from the ISO 191xx family, OGC, W3C Consortium, Web 3D Consortium and OASIS. It offers a standardized model to represent, store and exchange virtual 3D city models in five different LODs, with LOD0 offering the lowest modeling accuracy, presentation detail and thematic properties and LOD4 the highest. Unlike pure graphic formats, such as OBJ, CityGML additionally addresses semantic and thematic properties of spatial objects, their relations and aggregations (part-whole-relations). Basically, the CityGML model consists of three separate models, the spatial model , the appearance model and the thematic model . City objects are described geometrically and 2 Background 53

topologically by the GML3-based geometry model (ISO 19107 ‘Spatial Schema’). Objects are represented using the boundary representation (B-Rep) model that describes a solid by its bounding surfaces which are connecting to their neighboring surfaces, so that no gaps occur (Figure 2.26).

Figure 2.26: Two boundary surfaces of a wall modelled using B-Rep.

CityGML’s features are visually described using the appearance model, derived from specifications of the X3D 33 and COLLADA formats, to define color and texture properties of these. CityGML’s thematic model describes important objects that are part of a city, such as buildings, roads, bridges and vegetation (Figure 2.27). CityGML offers a total of 13 modules and can be used for a wide range of applications, some typical ones being environmental simulations, energy demand estimations, city lifecycle management,

33 http://www.web3d.org/x3d/what-x3d 54 2 Background

urban facility management, real estate appraisal, disaster management, pedestrian navigation, robotics, urban data mining, and location-based marketing [7], [54], [55].

Figure 2.27: City model of Berlin. Courtesy of Berlin Partner für Wirtschaft und Technologie GmbH [56].

Today, CityGML is used worldwide, with Germany, France, the Netherlands, Switzerland, Denmark, Turkey, Malaysia only being a few of which provide digital 3D city models in the CityGML format. As of 2018, in Germany the Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland (AdV), the German land surveying and cadaster committee, will be extending their AFIS-ALKIS-ATKIS-Model (AAA-Model) in Version 7 of the GeoInfoDok with 3D building models, based on the CityGML format [57]. 2 Background 55

Elevation Models Generally, there are digital elevation models (DEM), digital surface models (DSM) (Figure 2.29) and digital terrain models (DTM) (Figure 2.28). Unfortunately, the terms are used differently around the world. Commonly, DEM is used as a generic term for DSM and DTM. In most cases, a DSM represents a surface including the objects on it, man-made or natural (e.g. buildings, trees, etc.). In contrast, a DTM represents only the ground surface, excluding the objects on it.

Figure 2.28: An example of a DTM based on data from Aachen [58]. 56 2 Background

Figure 2.29: An example of a DSM based on data from Aachen [58].

CityGML also enables modeling terrain with its module Relief . It offers four types to specify the terrain: as a regular raster or grid (RasterRelief ), as a triangulated irregular network (TIN, TINReflief ), by break lines ( BreaklineRelief ), or by mass points ( MasspointRelief ). But commonly terrain data is exchanged using simple ASCII files in which a single text row represents the -data of a single point. 2.3.2 Building Information Modeling

An alternative source of building-related data can be found in the architecture, engineering and construction (AEC) domain, where building information modeling (BIM) has gained great importance in recent years and is already well advanced in Scandinavian countries, the USA, Great Britain and also in Australia, Singapore and China. In Germany, all public projects of the Federal Ministry of Transport 2 Background 57

and Digital Infrastructure (BMVI) are to be realized with BIM from 2020. The Road Map for Digital Design and Construction [59] envisages a three-stage implementation for BIM. BIM is a digital 3D model-based process that spans across the entire lifecycle of a project, from the planning and design phase to cost, construction and facility management. The extensive list of software solutions (e.g. Autodesk Revit 34 , Bentley AECOsim Building Designer 35 ) to solve BIM related tasks is growing continuously. As an exchange format, the open file format industry foundation classes (IFC) [60] data model developed by buildingSMART has evolved into a standard in the industry. It is registered as an official international standard ISO 16739:2013 and is based on the EXPRESS specification. To store the data in an actual file, three formats are possible, *.ifc using a STEP physical file structure, *.ifcXML using an XML document structure and *.ifcZIP which compresses one of the preceding formats. In IFC, geometries are typically described using constructive solid geometry (CSG) (Figure 2.30, Figure 2.31), but it also allows other modelling techniques, such as B-Rep. Further in- depth details about BIM can be found in [61].

34 https://www.autodesk.de/products/revit/overview 35 https://www.bentley.com/en/products/product-line/building-design- software/aecosim-building-designer 58 2 Background

Figure 2.30: A wall modelled using CSG.

Compared to the B-Rep model, commonly simple solid objects, such as cuboids, cylinders, pyramids or cones, referred to as primitives, are used to construct more complex objects by means of Boolean operations, such as difference, intersection or union [62] (Figure 2.31).

Figure 2.31: Boolean operations used to construct new objects. The top-right geometry is subtraced from the top-left geometry, resulting in the bottom-right geometry. 2 Background 59

2.3.3 Data Parsers

To import data into a software application, some kind of interface to load the data is required. While, for example, many game engines support typical 3D graphics formats specific to the gaming domain, such as COLLADA, capabilities to handle other formats like CityGML or IFC from other domains must be added using custom solutions. To do so, the data must be read and interpreted in some way. For the XML-based CityGML, XML parsers can be used. In general, a parser reads input data so that it can be made available in a suitable format, depending on the use case. A typical additional functionality is the validation of the data structure and syntax. XML parsers are a specialized form of a parser. They read information encoded according to the XML language specification. Commonly, three different types are regularly used.

Document Object Model-based Parsers The document object model (DOM) [63] is a standard defined by the W3C to access and manipulate HTML and XML documents. It creates a hierarchical tree structure in memory where each element is represented as a node . So, for a document, one tree is created with the nodes arranged according to their original hierarchy. It consists of three different parts: • Core DOM • XML DOM • HTML DOM

The advantages of DOM are that it is language and platform independent, the DOM tree is traversable, and nodes are modifiable. 60 2 Background

It allows users to navigate the hierarchy and search for specific information. Information can be added, edited, moved or removed. The main disadvantage of this model is that the memory consumption is relatively high (see chapter 3.3).

Push- / Pull-based Parsers An alternative method of parsing XML documents is a push-based approach. No actual standard is defined for this method, but the Simple API for XML (SAX) can be considered as the common method. SAX operates sequentially on XML data and reports events when they occur. Events are typically triggered when typical components of an XML document are reached, for example an opening XML tag. It is up to the custom implementation to handle the incoming data of such events. This has the advantage that the developer has full control over the functionality. And, especially in comparison to XML DOM, the sequential event-driven approach of SAX is much more memory efficient. Instead of keeping all the information in memory as DOM does, SAX discards it after every event. This renders it to be a performant method, especially for large XML documents. The disadvantage is that SAX is a relatively low- level API, which in turn is more difficult to use. For instance, there is no straight forward way of moving back to previously occurred events to retrieve information. The developer must also keep track of the current position in the hierarchy with some custom implementation. This is a particularly complex task in deeply nested XML structures. Similar to the push-based method, a pull-based one also exists. The basic process of both is the same. As with the push parser, the pull parser sequentially reads the XML document. The main difference is that instead of firing events when they occur, the pull parser enables 2 Background 61

the developer to define events that he is interested in. This has the advantage that other non-relevant events are omitted.

2.3.4 Databases

Large amounts of data require some sort of efficient structure and way of access. While any file type can represent a kind of structured data storage, they often are limited in multiple ways concerning such aspects as data retrieval or saving performance, availability or security. A better solution is to utilize a database system (DBS) which is a database in combination with a database management system (DBMS). In general, a database is referred to as a structured collection of data and a DBMS offers the means to interact with the database. Some typical functions of a DBMS are database creation, thus, defining data types and storing data, data retrieval and manipulation, security management, access control and data integrity, backup and recovery. Another important aspect of a database is the abstraction of data. Data can be described in an abstract way by a data model which for example defines how data elements relate to each other. Entity- Relationship (ER) models give some sort of independency from the actual database in the designing process, allowing to describe data models in such an abstract manner. Later in the database creation process, a database specific data model ( logical data model/physical data model ) expressed in a schema must be derived from the ER model. This, among others, depends on the type of chosen DBMS.

Database Models Relational databases and relational database management systems 62 2 Background

(RDBMS) are a popular and well-established solution due to their simplicity. They are based on the relational model, which utilizes mathematical relations [64]. A database is formed by a collection of relations which can be described as a table with values. A row (tuple) of a table represents related values and a column is an attribute. Some examples for DBS using the relational model are Oracle 36 , MySQL 37 and PostgreSQL. The structured query language (SQL) offers the means to interact with the data structures and the data itself. It is the most used database language for relational databases and is based on relational algebra. Due to the complexity of some tasks like storing images or videos, different concepts are necessary. The increase in object-oriented programming languages, such as Java, and the software developed with these, calls for a simplification of integrating databases. Object- oriented databases (or object databases (ODB)) address this issue. Typical concepts of object-orientation like states (attributes) and behaviors (operations) of objects and also hierarchies and inheritance are used. Existing objects can be stored permanently in the object- oriented database, therefore, extending the lifecycle of the object and additionally allowing them to be shared across applications. A major advantage of this database model is that the same data model can be used in the programming language and the database. Some examples for object DBS’ are DB4Objects 38 , Versant 39 , Objectivity 40 and

36 https://www.oracle.com/database/index.html 37 https://www.mysql.com 38 http://www.mono-project.com/archived/db4o 39 http://supportservices.actian.com/versant/vod 2 Background 63

ObjectStore 41 . In general, the acceptance of this model is still quite low, due to a couple of issues. Firstly, there is no actual standard object-oriented query language as for the relational model with SQL. Secondly, relational databases have a well-established and experienced user base. Switching to an object-oriented database would imply extensive migrations with complex data mappings which are time and cost intensive. A compromise between relational databases and ODB are object- relational databases . Basically, these are a combination of the described database models, thus, offering the advantages of both, but neglect some of the disadvantages they pose on their own. SQL in addition supports necessary commands for objects, classes and inheritances. Also, custom data types and methods can be implemented. An example for such a database is Oracle.

Spatial Databases With the increase in spatial related questions, specialized databases optimized towards storing and querying objects with spatial characteristics have evolved. These are based on the previously mentioned database models and referred to as spatial databases . Typically, such databases offer support for basic geometric types like points, lines and polygons. Based on these geometries, usually some spatial measurements like point distances, line lengths or polygon areas are possible. Spatial operations involving multiple geometries, like intersections, unions, etc., are generally also possible.

40 http://www.objectivity.com/products/objectivitydb 41 https://www.ignitetech.com/solutions/information-technology/objectstore 64 2 Background

A popular use case for spatial databases is a GIS to deal with geospatial data. The majority of GIS are built around 2D data, so that commonly spatial databases are also restrained to 2D geometries. Some few databases can handle 3D objects or at least 2.5D objects, which are 2D objects including a height component, and offer some rudimentary spatial operations. Oracle Spatial 42 is an example for a spatial database with 2D and 3D support. But, GIS is only one use case for such databases. They are also applicable to the AEC sector for instance. IFC building data can also be stored, managed and processed, for example with BIMserver 43 , an open source building information model server. As mentioned, one of the main benefits of a DBS is the performance. Fast data querying is especially important if applications rely on the data provided by the database. A well- established structure for this is an index . A database index works similarly to an index of a book. While a book index gives an overview of where to find certain information in the book, more specifically on which page, a database index is used to quickly find data in the database. The downside of using a database index is the increased time for writing data to the database. This results from the indexing process when inserting data. Spatial databases also have the need for a special type of index, one that can increase the performance of spatial queries. For this, the majority of spatial databases offer spatial indices , commonly implemented using the R-tree.

42 http://www.oracle.com/technetwork/database- options/spatialandgraph/overview/spatialandgraph-1707409.html 43 http://www.bimserver.org 2 Background 65

R-tree As described by [65], the idea of the R-tree structure is to represent geometries by their minimum bounding rectangle (MBR) with minimum and maximum and coordinates and grouping neighboring objects in a parent MBR. This hierarchy is realized as a tree (Figure 2.32).

Figure 2.32: MBRs R1 to R6 are arranged for the R-tree structure.

A query is performed by traversing the tree starting from the top and using the bounding boxes to decide if and to which branch to proceed to. This renders the approach especially suitable for large data sets, since non-neighboring geometries are excluded from further searches, leaving only a small subset to inspect more precisely with the 66 2 Background

complex and more time-consuming spatial operators. An improved variant of the normal R-tree is presented by the R*tree [66] which usually has a higher performance when querying data. This is accomplished by rearranging the entries to structure the tree in a more advantageous way. The main focus lies on minimizing the amount of overlapping MBRs. This enables higher query performances, since less data must be analysed. Every overlapping area demands both MBRs (branches) to be evaluated. A slight decrease in performance when inserting data, is the downside of the R*tree in comparison to the normal R-tree. While the R-tree is most frequently utilized for indexing 2D information, variations of the R-tree have also been derived for use in higher dimensions as for example by [67] in 3D. Here, the general principal of the 2D algorithm using bounding rectangles is extended to bounding boxes.

Octree An alternative to a 3D spatial index based on the R-tree, is an octree. This tree data structure allows each node to have exactly eight children. Figure 2.33 shows an example for such a tree and a minimum bounding cuboid that is subdivided into octants, according to the octree structure. [68] describes some pioneer work using octrees for 3D computer graphics. An implementation of the octree for an Oracle Spatial DBMS is described by [69]. 2 Background 67

Figure 2.33: Tree data structure of an octree. Each node has eight children. A minimum bounding cuboid is subdivided into octants.

2.4 Pose Tracking

Before discussing different methods for pose tracking that are applicable in AR, some terminology is explained. A pose generally is referred to as a combination of a translation and an orientation , with 3DoF each. Sometimes the term pose is also used to only describe the orientation of an object, but in this thesis, it is referred to as consisting of 6DoF. One of the main characteristics of AR according to [9], is the registration of virtual information in 3D. This refers to joining objects residing in different local coordinate systems in a common coordinate system so that they are aligned to each other. The dynamic process of this is referred to as tracking or 6DoF pose 68 2 Background

tracking . More practically speaking, the physical camera of an AR device and the virtual camera must be aligned so that their captured images overlay each other on the output display as accurately as possible. Assuming the coordinate frames of the cameras are the same and the physical camera is calibrated to account for image distortions, this can be achieved by determining the extrinsic camera matrix to obtain the camera pose and applying it to the virtual camera accordingly. The goal is to determine the pose as accurately as possible, since any inaccuracies are visible as misalignments of the images, thus the contained physical and virtual objects.

2.4.1 Coordinate Systems for Positioning

For pose tracking, coordinate systems that describe a point in space are most helpful for defining positions. A widely used system for this is the Cartesian coordinate system. It consists of coordinate axes and an origin, which is the intersection point of the coordinate axes. Numerical coordinates are used to describe a spatial point in the system. A spatial reference system (or Geographical Coordinate System) is a specialized coordinate system that enables the specification of locations on Earth. Specific examples are the geocentric coordinate system , geographic coordinate system and planar (projected) coordinate systems [70].

Geocentric Coordinate System One way of defining a location on earth is by using a right-handed Cartesian coordinate-based geocentric -system, also referred to as Earth-centered, Earth-fixed (ECEF). Its origin is located in Earth’s 2 Background 69

center of mass, the -axis points to the mean Greenwich meridian, the -axis coincides with Earth’s mean rotational axis and the -axis completes the right-handed coordinate system. The - and -axes are located in the equatorial plane (Figure 2.34).

Figure 2.34: Geocentric Reference System.

Geographic Coordinate System The widely used geographic coordinates consist of angular values referred to as latitude and longitude (Figure 2.35). Latitude is the angle between the ellipsoidal normal of the point and the equatorial plane. So, on the equator latitude is 0° and ranges from +90° (90°N) at the North Pole to -90° (90°S) at the South Pole. Longitude is the angle to the meridian ellipse passing through Greenwich. It ranges from +180° (180°E) eastwards to -180° (180°W) westwards. The 70 2 Background

angular units of latitude and longitude are commonly expressed in one of several formats. A format using Degrees Minutes Seconds is one of the most common ones. An alternative for example is Decimal Degrees .

Figure 2.35: Geographic Reference System.

Projected Coordinate System Another widely utilized method of defining a location on Earth uses projected coordinates . These are the result of a map projection, which presents Earth’s 3D irregular curved surface on a flat 2D map. 2D Cartesian coordinates describe locations on the map and thus on Earth. The horizontal -axis is commonly referred to as Easting and 2 Background 71

the -axis as Northing . To project Earth’s surface to a map, different map projections are used, cylindrical , conical or azimuthal . A widespread projection is Universal Transverse Mercator (UTM), which uses a transverse cylinder. In fact, it not only uses one map projection but sixty, one for each zone of the system. Each zone is a narrow longitudinal zone of six degrees. By using multiple projections and limiting the zones to six degrees, distortions which are inevitable when projecting 3D objects to a 2D plane, can be minimized.

Height To fully describe a position on Earth, elevation must also be taken into account. Generally, heights are measured in relation to common baselines such as an ellipsoid (ellipsoidal height) or the geoid . As with horizontal coordinates, an ellipsoid is also suited as a vertical datum to mathematically represent Earth’s surface. Though, a problem hereby lies in the fact that Earth is not a perfect ellipsoid and that its surface, with mountains and craters, is not flat as that of an ellipsoid. Therefore, in some cases it lies above or below Earth’s actual surface, resulting in height offsets. A better approach is to use a geoid model that describes the actual physical shape of Earth. The geoid is derived from Earth’s gravitational field, without taking influences, such as winds and tides, into account. The exact physical shape of Earth is complex, so it generally is unsuitable for use in, for example, GNSS receivers. Instead, the ellipsoidal height is calculated and then converted to orthometric height using the geoid model that contains the difference between the reference ellipsoid and the geoid, referred to as geoid undulation or geoid height . Orthometric height is calculated with the following equation (2.16):

72 2 Background

(2.16) ℎ where is the orthometric height, the ellipsoidal height and the geoid undulation (see Figure 2.36).

Figure 2.36: Earth’s surface represented by an ellipsoid and geoid. Ellipsoidal height and geoid model can be used to calculate the orthometric height.

2.4.2 Coordinate Systems for Orientation

Typically, orientation is defined in some world reference system (world frame) . Two conventions for such frames are the east, north, up (ENU) system and the north, east, down (NED) system. Android for instance defines the -axis tangential to the ground towards Earth’s magnetic North Pole at the current location. The -axis is perpendicular to the ground and points upwards. The -axis is derived as the vector product of the - and -axes. It points east and is also tangential to the ground at the current location (Figure 2.37). 2 Background 73

Figure 2.37: Android’s sensor world coordinate system, according to [71].

A body, such as a smartphone, has a system referred to as body frame . The axes of the system are fixed to the structure and rotate with the smartphone. According to [72] and [73], an Android smartphone lying flat on a surface is aligned with the world coordinate system and has the -axis pointing east, tangential to the ground, the -axis towards magnetic north (magnetic North Pole), tangential to the ground, and the -axis towards the sky, perpendicular to the ground. Therefore, yaw defines a rotation about the -axis, pitch, a rotation about the -axis and roll, a rotation about the -axis (Figure 2.38). 74 2 Background

Figure 2.38: An Android smartphone with its coordinate axes.

2.4.3 Pose Tracking Methods

For pose tracking, different solutions have been suggested and used in past research and commercial projects. The type of technology applied for tracking typically depends on the use case and the environment. On the one hand, there are stationary positioning techniques which commonly are used for limited areas, such as indoor environments, and on the other hand, decoupled methods which function independently from the infrastructure and are suitable for larger areas. All systems rely on some sort of sensors that register physical phenomena (e.g. magnetic fields) and methods to estimate a pose (e.g. from the signal strength or the time of arrival (ToA) / time of flight (ToF)). ToA uses measurements of the absolute travel time 2 Background 75

of a signal between a transmitter and a receiver from which the Euclidean distance can be calculated. Table 2.1 gives an overview over different technologies for positioning/pose tracking.

Table 2.1: Positioning/pose tracking technologies according to [74]. Technology Method Type GNSS Satellites (ToA) Decoupled Inertial Dead Reckoning Decoupled Optical Image-based Decoupled Infrared Active Beacons (ToA) Stationary Ultrasound Transmitters (ToA) Stationary Wi -Fi Fingerprinting Stationary Ultra -Wideband Transmitters (ToA) Stationary Magnetic Fingerprinting Stationary

Decoupled Systems Decoupled systems have the advantage of being independent of any infrastructure, which is why they are most applicable for large scale environments in which preparing fixed systems is not possible or too complex.

Global Navigation Satellite Systems Today, one of the most regularly used systems for positioning is GNSS-based. Especially for vehicle navigation, such as motor vehicles, watercrafts and aircrafts, GNSS are a popular application. But it also 76 2 Background

has other applications, like surveying, monitoring or gaming. Strictly speaking, a GNSS is not a decoupled system, since certain infrastructures like the orbiting satellites are required, but since it operates at a global scale, the infrastructure can be assumed to be generally available. The most widely used GNSS today is the Navigational Satellite Timing and Ranging – Global Positioning System , commonly referred to as GPS, which was developed in the USA. The system consists of three main parts, a space segment , a control segment and a user segment [75]. The space segment involves satellites orbiting at an altitude of approximately 20.200 km which carry very precise atomic clocks that are synchronized to each other and to the ground stations. Additionally, the positions of all satellites are also known very precisely. Currently there are 32 satellites in the system, of which 31 are actively in use [76]. The satellite’s orbits are arranged so that at least six are visible within a line-of-sight (LOS) anywhere on Earth, at any time [77]. The control segment, composed of a master control station, a secondary master control station, ground antennas and monitoring stations, is responsible for running and monitoring the system [75]. GPS receivers, as commonly built into modern devices, such as smartphones, are referred to as the user segment. To obtain a position, different methods are available, depending on the required accuracy, but all are based on distance measurements between the satellites and the GPS receiver. 3D ECEF coordinates are derived and commonly converted to geographic coordinates and a height value. For vehicle navigation, typically, the TOF of the signal between each satellite (a minimum of 4) and the GPS receiver is used to determine the distance. The receiver’s position is obtained from the intersection of the resulting spheroids. The accuracy of the positions 2 Background 77

typically ranges from 1 – 10 m, but can, under certain conditions, also range up to 100 m. Some factors that influence GPS signals are generally the number of visible satellites, the quality of the GPS receiver, atmospheric conditions like humidity or pressure, multipath effects caused by signals reflecting off objects such as buildings, the ground, walls, etc., or clock errors. To overcome the inaccuracies of measurements, due to various types of errors, ground-based reference stations that transmit correctional data can be used to calculate more accurate positions. A well-known method is DGPS, which can operate within accuracies of less than 1 cm. It requires two GPS receivers, one on a ground-based reference station, which is very accurately surveyed using classical measurement methods, and a mobile GPS receiver. Both receivers determine their positions simultaneously and the reference station then compares its derived GPS position with the reference coordinates and calculates the correctional data, which typically is broadcast to the mobile GPS receivers using radio signals. Other applications, such as surveying, require better accuracies. These can be achieved using carrier phase tracking . In short, the phase of the satellite signal is compared to a reference phase in the receiver, from which the phase shift and thus the distance can be obtained.

Inertial Tracking An IMU usually consists of accelerometers, gyroscopes and, optionally, magnetometers. Typical use cases for IMUs are positioning tasks using the dead reckoning (DR) method. If an initial position is known, then following positions can be estimated by analyzing the sensor data. This is especially advantageous in environments without external infrastructure. A disadvantage of DR is that the deviation of 78 2 Background

positions grows with time, due to accumulations of inaccuracies. This is especially common with low-cost micro electro mechanical sensors (MEMS), as regularly used in today’s smartphones [74]. Generally, accelerometers are used to determine velocity, vibration, impact or inclination. This is accomplished by measuring the acceleration applied to a device and the force influencing the sensor and its proof mass. Basically, they measure the difference between a linear acceleration in the accelerometer’s reference frame and the Earth's gravitational field vector. According to Newton’s second law of motion, the mass is displaced consistent with the force applied. As described in [78], it can be formulated with the following equation (2.17): (2.17) 1 But not only active accelerations influence the sensor, the force of gravity has an impact on it too. So, when an accelerometer lies motionless on a table, the magnitude is and when it ≅ 9.81 is in free fall, the magnitude is . This must be taken into ≅ 0 consideration, as shown in equation (2.18).

(2.18) 1 To overcome the force of gravity, high-pass filters can be applied. Vice versa, the force of gravity can be extracted with a low-pass filter. Gyroscopes are primarily used to measure angular velocity (rotational speed) and, thus, the rotation angle by integration over time. To calculate the relative orientation, angular changes are 2 Background 79

accumulated and multiplied with the time difference between the current and previous measurement. Magnetometers are able to measure magnetic fields. In smartphones they are typically used as a compass by determining magnetic north of Earth’s magnetic field. Generally, the magnetometers in smartphones are based on the Hall Effect, which registers influences of an external magnetic field to the local internal magnetic field created on the electrical conductor. This is a measurement for the strength of an external magnetic field [79] . Some advantages of MEMS are their small size, low production costs and low energy consumption. Disadvantages are their unreliability and sensibility to temperature changes.

Stationary Systems Typically, stationary systems are applied in bounded areas due to their limited range and need of offline preparations. Especially indoors (e.g. in buildings), research towards indoor positioning systems (IPS) has focused much on using stationary systems. While GNSS provides reliable positional data in open spaces, occluded areas are more problematic. Obstacles, such as buildings or trees, or also extreme atmospheric conditions, influence the satellite signals, so that positional data becomes imprecise or even unavailable. Therefore, several different technologies, like IR, ultrasound, wireless local area networks (WLAN), ultra-wide band or magnetic fields have been employed for indoor positioning.

Infrared IR solutions employ mobile active IR beacons and stationary beacon receivers. The locations of the stationary beacons are known, whereas 80 2 Background

the mobile beacons are unknown. Room-level accuracy can be achieved by placing one receiver per room. To further sub-divide the room and reach room-part accuracy, multiple receivers can be placed across the room. One of the first IR positioning systems, is the Active Badge System [80], which uses badges that emit IR pulses with distinctive codes that are received by the fixed IR receivers across the building space. A downside, next to the inaccurate locations, is the positional update rate, which lies at approximately 10 s and, therefore, renders the technology unfeasible for real-time applications [81]. A drawback of IR in general, is that the signals are unable to pass through objects such as walls.

Ultrasound Similar to IR-positioning, ultrasound can be used to determine the locations of users with a two-part system. The options are either active or passive device systems. In an active device system, the mobile devices carried by the users actively transmit signals and in a passive device system, the mobile devices receive signals from the stationary units. Both systems employ ToA measurements for positioning, therefore, a precise synchronization of the transmitter’s clocks with the receiver’s clocks is a prerequisite. An example of an active device system, is the Bat System , also referred to as Active Bat, described by [82], as a follow up to their research towards an IR-based positioning system. It consists of ultrasonic transmitters (the Bats) and multiple receivers with known locations on the walls and ceilings. Given three or more receivers and their distances to a Bat, the 3D position can be calculated by using trilateration. An advantage of the system is that the orientation can 2 Background 81

also be determined if two or more Bats are carried. Using only one Bat, it is also possible to determine the orientation by analysing the pattern of receivers registering signals from the Bat, considering the signal strength. Another example is the Local Positioning System (LPS) presented in [83]. Receivers are installed as reference stations in the vicinity and a mobile measurement rod with two transmitters allows surveying 3D coordinates, with average accuracies between 1-5 mm, up to a described distance of 15 m between transmitters and receivers. [84] describe a passive device system named Cricket which uses actively transmitting beacons, for example attached to the ceiling, and listeners that are attached to a device, such as a laptop. The downside of ultrasonic systems is the influence of atmospheric changes on the ToA method.

Wireless Local Area Network A popular approach to indoor positioning uses WLAN. This technology is of particular interest, since today multiple WLAN access points are readily available, especially in indoor environments, which makes it possible to rely on existing infrastructures. The most prevalent method of determining positions based on WLAN is the fingerprinting method. A fingerprint is a characteristic measurement of a WLAN signal which typically consists of a MAC address, a service set identifier (SSID) of the WLAN router and the strength of the received signal, referred to as received signal strength (RSS). These fingerprints are collected along with the corresponding known coordinates prior to the actual positioning process and saved in a database to create a radio map of the environment. Later in the positioning process, this radio map is used to compare current 82 2 Background

measurements of the RSS with the ones stored in the fingerprints. Some early work using WLAN was done by [81]. WLAN has a relatively large operating area with approximately 50 to 100 m, making this method suitable for large indoor areas and even small outdoor areas. Depending on the number of WLAN access points, accuracies of approximately 1 to 30 m can be achieved. The major disadvantage of this system is that appropriate fingerprint databases must be created and maintained. Changes in environments, like adding, removing or even just rearranging objects, affects the fingerprints corresponding to the known locations, resulting in having to update the fingerprint database.

Ultra-Wideband Another radio technology suitable for positioning is ultra-wideband (UWB). Similar to IR and ultrasonic-based technologies, UWB positioning uses transmitters and receivers. It is possible to either use the ToA method, the time difference of arrival (TDoA) method or the angle of arrival (AoA) method. The main advantage of UWB is that it can provide locations at centimeter levels. Some work was done by [85] and [86] using this technology.

Magnetic Field Artificial magnetic and electromagnetic fields can also be used to retrieve positions. The fields are generated by permanent magnets or coils with an alternating current (AC) or pulsed direct current (DC). The great advantage of magnetic-based positioning systems is that the magnetic signals can penetrate objects, such as walls, and so are not limited to a LOS. Depending on the size of the coil and thus the strength of the magnetic field, it can be used from room to building- 2 Background 83

scale. The downside is that the environment must be prepared with magnets or coils, which can become quite large. Research using electro-magnetic coils was for example conducted by [87]–[89]. A commercial example for such a system is the Polemus G4 44 . It is based on electromagnetic fields with an AC and consists of sensors that are tracked by the G4 source that produces the electromagnetic field. The sensors are connected to a portable hub that transmits poses to the host PC using a wireless connection. The system delivers 6DoF and has coverage capabilities ranging from small to middle sized rooms.

2.4.4 Sensor Fusion

Commonly, to acquire a 6DoF pose, multiple technologies and information from multiple sensors must be combined. This is typically referred to as sensor fusion .

Accelerometer and Magnetometer For AR, a common choice of sensor fusion involves merging information from accelerometers and magnetometers to determine the orientation of a device in a world coordinate system. The 2DoF pitch and roll are determined with the accelerometer and yaw is found with the magnetometer. To describe 3DoF orientation, the values are combined in a single orientation matrix. Typically, accelerometers and magnetometers each produce three dimensional vectors, so that the values of the accelerometer are given by the vector and the values of the magnetometer by the vector (equation (2.19)): 44 https://polhemus.com/motion-tracking/all-trackers/g4 84 2 Background

(2.19) , As described in [90]–[92], vector is used to define the -axis perpendicular to the ground. With and a new vector (equation (2.20)) is created, by calculating the cross product of both vectors. Since the cross product of two vectors is always perpendicular to both vectors, it defines the -axis which is perpendicular to the -axis and -axis. (2.20) ∗ ∗ ∗ ∗ As shown in equations (2.21) and ∗(2.22), the ∗ norm of the vectors is derived to receive and , so that they can be used in a rotation matrix . (2.21) 1 ||

(2.22) 1 || The -axis is derived by calculating the cross product of and which ensures that it is perpendicular to the -axis and -axis (equation (2.23)). This also cancels out magnetic dip (also known as magnetic inclination) which is the angle between the horizontal plane and the magnetic field vector. When a compass is moved closer to one 2 Background 85

of Earth’s magnetic poles, one side of the compass needle dips downwards, resulting in a possible dragging against the compass capsule. This can prevent the compass reading from being accurate. To overcome this, typically, compasses are balanced by manufacturers to account for magnetic inclination. Since magnetic inclination varies depending on the location, this is done for compasses of specific magnetic zones on Earth. (2.23) ∗ ∗ ∗ ∗ , and can now be used to create∗ (equation∗ (2.24).

(2.24)

Next to the magnetic inclination, there is also the magnetic declination [93]. Magnetic declination defines the angle between true north (geographic North Pole) and magnetic north (magnetic North Pole), more specifically the horizontal component of Earth’s magnetic field. A compass always points towards magnetic north therefore the compass readings must be corrected to obtain a true north bearing (true bearing). The declination depends on the location and varies significantly. For example, Aachen in Germany has a positive magnetic declination of , while New York City in the USA has a negative declination of1° 29′ [94]. To account for this difference, the declination must be12° 55′ applied to . Given the before defined coordinate system where the -axis points east, the -axis north and the -axis upwards, the rotation about the -axis (yaw) 86 2 Background

must be corrected (equation (2.25)). So, the rotation matrix , with the magnetic declination , is multiplied with to receive .

(2.25) When working with projected coordinates (e.g. UTM coordinates) another angle must be considered. The direction northwards along the grid lines of a map projection is referred to as grid north . Due to the ellipsoidal shape of Earth, its projection onto a flat map surface results in meridians that differ from the grid lines. On the central meridian, grid north and true north are aligned, but diverge increasingly northwards, since the meridians converge towards the poles. The meridian convergence is referred to as the angle between true north and grid north. The meridian convergence is calculated according to the position and is applied to true north. Therefore equation (2.26) is:

(2.26) where is the meridian convergence. Filters Depending on the quality of sensors, raw data often are significantly noisy. Furthermore, accelerometers for example are quite sensitive and therefore susceptible to the slightest motion, such as the minor natural movement of a hand holding a smartphone. This is reflected in the output data as small peaks which influence the calculated orientation. Applied to AR, the result is a jittery augmentation that 2 Background 87

can make the application less enjoyable or even unusable. To make the data more useable, a solution is to process it by applying filters. Depending on the use case, typical filters are high pass filters and low pass filters . While a high pass filter passes data above a certain value, the low pass filter passes data below a certain value. To remove noise from acceleration data, typically a low pass filter is applied. For AR, good results can be achieved with an exponentially weighted moving average (EWMA) filter, which allows for smoother tracking than for example a simple moving average filter. It is defined by the equations (2.27) and (2.28):

, (2.27) ∙ 1 ∙ where is the smoothed acceleration data at time and the raw acceleration data at a certain time. The weight applied to the filter is defined by: , with (2.28) / 0 < < 1 where is a time constant with the relative duration of the signal and the event delivery rate [95], [96]. The result is smoothed data in which, for example, strong peaks have been removed. The amount of data that passes the filter depends on the chosen parameters.

2.4.5 Optical Pose Tracking Methods

Due to the rapid evolution of mobile hardware, today’s smartphones also contain affordable cameras that produce images in such a quality 88 2 Background

that optical tracking is possible. Since smartphone sensors, such as magnetometers, are prone to drift and inaccuracies, optical methods have increasingly gained attention in the AR community. Optical tracking methods typically are independent of additional external infrastructures, so strictly speaking these methods could be categorized as decoupled, but due to the complexity of the topic, these are described in a separate section in this work. In general, optical tracking is a problem from the field of CV and photogrammetry . For humans, the perception of 3D reality is quite natural. With ease different shaped objects, from simple to the most complex with various colors and illuminations, can be recognized and separated from the most cluttered environments. This, for instance, allows humans to categorize or count objects by shape or color. Teaching a computer to interpret an image on the same level as humans is a complex task which is researched, for example, in the interdisciplinary field of CV. The main steps involved here are image acquisition, processing, analysing and interpreting. Interesting sub- topics of CV for use in AR tracking are the recognition of objects and the estimation of their pose relative to the AR system. This in turn requires methods of photogrammetry. In this thesis the focus lies on monocular tracking, since the majority of today’s smartphones typically still only contain a single rear camera.

2 Background 89

Figure 2.39: CV is based on image analysis, which in turn in based on image processing.

Image Processing Optical tracking for AR requires high level object detection (recognition). To detect an object, lower level methods like image processing and image analysis are required (Figure 2.39). To understand how CV methods are applied to optical tracking, some fundamentals of image processing and analysis are presented in the following. Since the field of image processing is rather complex, the focus lies on selected methods relevant for this work.

Image An image can be represented by a function with that defines the intensity at position . Typically, images have a rasterized , rectangular form and a finite number of rows and columns. An intersection of a row and a column is referred to as a pixel . The number of pixels in an image is referred to as the image’s resolution . For a digital camera, the range of is limited by the size of the sensor. The size of a pixel in turn is defined by the sensor’s resolution. 90 2 Background

Image Processing Operators Image processing utilizes basic image processing operators. In general, an image processing operator uses one or multiple input images and creates an output image which can be denoted by equation (2.29):

(2.29) ℎ Image processing methods can roughly be split into two types, point operators and neighborhood operators . Point operators represent the simplest form of image operations, where each input pixel value has a corresponding output pixel value. Some typical point operators change the brightness and contrast of an image, for example denoted by the following equation (2.30):

(2.30) where is referred to as the gain (scale) and the bias (offset). They are >said 0 to adjust the contrast and brightness of an image.

Neighborhood operators include the pixels of an area around a pixel to produce an output image. Generally, these operators can be divided into linear and non-linear filtering operators. A commonly used neighborhood operator is the linear filter , denoted by equation (2.31): (2.31) , , ℎ, where is referred to the, as the kernel or mask [97]. , 2 Background 91

It determines the values of the output pixels by calculating the weighted sum of the input pixel values. Some examples for linear filtering operators are blurring and sharpening images. Non-linear filters on the other side are more complex, but often provide better performance. A typical non-linear filter is the median filter. Other common non-linear operators are morphological operators, which alter the shape of objects. These operations are executed on binary images that are produced from thresholding. As shown in equation (2.32), binary images have only two possible values for a pixel (2.32) 1 if ≥ , . According to [98], the most fundamental0 else morphologic al operations are dilation and erosion . Generally, dilation increases the object size by adding several pixels, so that holes and gaps decrease or are filled entirely (Figure 2.40 middle). Erosion decreases the object size which increases holes and gaps (Figure 2.40 right).

Figure 2.40: Original image (left); Dilation (middle); Erosion (right).

Further morphological operations are created by combinations of dilation and erosion. Opening is achieved with an erosion followed by 92 2 Background

a dilation. Closing employs a dilation followed by an erosion. Opening typically is used for removing noise surrounding the foreground object without affecting the size of the object itself. Closing is useful for removing small gaps in the foreground object without altering the object size.

Gaussian Filter One of the most commonly used smoothing filters is the Gaussian filter, which employs a Gaussian function. Typical applications are image noise reduction and decreasing of image detail. This is done by reducing the high frequencies in images. The low pass filter is denoted by the following equation (2.33) [97], [99]:

(2.33)

1 , where is the horizontal distance2 from the origin, the vertical distance from the origin and the standard deviation, which determines the amount of blur. Figure 2.41 depicts the result of a blurring with the Gaussian filter.

Figure 2.41: Gaussian blur applied to an image (left) and the resulting image (right). 2 Background 93

Feature Detection Generally, a feature is an interesting part of an image that is re- identifiable in the same image or following ones. Some examples are edges and corners . To detect these features, some specialized algorithms exist.

Edge Detector In general, an edge is a boundary between image regions with a significant change of the intensity perpendicular to the direction of the edge. Intensity changes can be found by taking the image derivatives to obtain the image gradient of image , which is a vector that points in the direction of the greatest intensity ascent (equation (2.34)): (2.34) , . where is the derivative with respect to x (in x-direction) and the derivative with respect to y (in y-direction). Directionally independent information is described by the equation (2.35), as the length of the gradient vector:

(2.35) || The direction of the gradient vector is obtained using the equation (2.36): (2.36) arctan . A popular operator to find edges is the Sobel operator [100] which calculates approximations of the derivatives (see equation (2.34)) 94 2 Background

using two 3×3 kernels which are convolved with the original image to detect changes in the - and -directions. finds vertical edges (Figure 2.42 middle) and horizontal edges (Figure 2.42 right). , (2.37) 1 0 1 1 2 1 2 0 2 0 0 0 1 0 1 1 2 1

Figure 2.42: Original image and its schematic pixel-value- representation (left); Vertical edges detected with the Sobel operator using the kernel (middle); Horizontal edges detected with the Sobel operator using the kernel (right); Since an image derivative emphasizes high frequencies by removing some low frequency components, it increases the amount of high 2 Background 95

frequency image noise. Therefore, image smoothing methods are a necessity. A commonly used edge detection algorithm, that incorporates multiple steps, is the Canny edge detector [101]. Figure 2.43 shows the result of the Canny edge detector. The steps are the following: 1. A Gaussian filter is applied to smooth the image and reduce the amount of noise. This decreases the probability to detect false edges in the next step. 2. An Edge detection operator (e.g. Sobel [100]) is applied to the image. 3. The edge thinning technique Non-maximum suppression is used to find a sharp edge which is done by determining the sharpest change of intensity values. For this, the edge strength of a pixel is compared to other pixels in the area that belong to the same edge direction. Only the largest value is preserved, others are suppressed. 4. A double threshold is used to eliminate remaining false edges. This is done by using a lower and an upper threshold. An edge pixel with a gradient value above the upper threshold is marked as a strong edge pixel, if it lies between the upper and lower threshold it is marked as a weak edge pixel and if it lies below the lower threshold, it is suppressed. 5. In the last step, the edge tracking by hysteresis, uncertainties involving the weak pixels still must be considered, since these could still be the result of noise. Weak pixels that are connected to strong pixels are preserved, while ones without a connection are suppressed. For this, blobs of pixels are analyzed. Blobs that contain 96 2 Background

at least one strong pixel are considered as a positive indication for the weak pixel.

Figure 2.43: Image created with the Canny edge detector.

A practical use case for an edge detector is found in object detection. The extracted edges can, for example, be used for an analysis process that tries to identify an object based on its shape. An advantage of using edges is that these can be extracted from objects with a minimum of textural information, whereas other methods rely on clearly identifiable features in textures.

Corner Detection Different than edges, corners refer to point features. Generally, a corner is the point in which two edges of different orientation meet. Therefore, the gradient of the image is highly variable in all directions in this region. A well-known algorithm that exploits this is the Harris 2 Background 97

corner detector [102] . Given a grayscale image, a sweeping window approach can be used to search for high variations of intensity. This is represented with the following equation (2.38):

, , , , ² (2.38) , where is the sweep window, the intensity and , the shifted intensity, (Figure 2.44). , By maximizing the term and applying Taylor Expansion to the equation (2.38), the following equation (2.39) can be derived:

(2.39) , , where . ∑, , To determine if a corner is present or not the eigenvalues of are analyzed using the following equation (2.40):

(2.40) ² where and with and as the eigenvalues of . There are three cases: 1. If is small there is no feature (Figure 2.44 left) 2. If then an edge has been found (Figure 2.44 middle) < 0 98 2 Background

3. If is large a corner has been found (Figure 2.44 right)

Figure 2.44: Sweep window to find corners based on high variations of intensity. If there are no variations in all directions, then there is no feature (case 1/left); If there are variations in one direction, then an edge has been found (case 2/middle); If there are variations in all direction, then a corner has been found (case 3/right)

2 Background 99

The result of the corner detector is shown in (Figure 2.45).

Figure 2.45: Features found by Harris corner detector.

Depending on the image, the Harris corner detector often finds large numbers of features. For point-based tracking of objects it generally is more important to find distinct features of an image instead of many features. Therefore, [103] proposed a different scoring function than the Harris detector utilizes. It is given by equation (2.41):

(2.41) , Using equation (2.41), generally, better results can be achieved, since it only determines strong corners in an image. An alternative is a curvature-based corner, as for example described by [104]. Curvature is denoted by equation (2.42): 100 2 Background

(2.42) , , , , , . , , where

, ⊗ ,

, ⊗ ,

, ⊗ , , ⊗ , and describe the curve by the arc length . The convolution operator is applied with the Gaussian kernel . The corner detector ⊗ first creates an edge map using the, Canny algorithm and extracts contours from it. Next, initial corner candidates are selected by computing the absolute curvature of points on the contours and finding local maxima. Since the goal is to find corners at right angles, round corners are discarded by applying an adaptive threshold to the curvature values. Additionally, the region of the corner candidates is analyzed regarding the angles of neighboring corners, here referred to as region of support (ROS).

Image-based Pose Estimation Given 3D object points in a world coordinate system and 2D features in an image coordinate system extracted from an image, a camera pose can be estimated by exploiting the relationships between the points. The relationship is expressed by equation (2.43):

, (2.43) | where is the camera matrix that projects a 3D point onto a 2D point| in a 2D plane. is an arbitrary scaling factor, , the 2 Background 101

calibration matrix, contains the intrinsic parameters and the extrinsic parameters of the camera, where is the rotation and| the translation. Typically, pose tracking algorithms assume that is known and estimate and . The relationships between 3D points and 2D points were already investigated nearly 180 years ago by [105]. As an example, the solution by [105] is shown below, as described by [106]. The solution is based on the observation that angles between 2D points are equal to the angles between their corresponding 3D points (Figure 2.46).

Figure 2.46: Geometry of the three-point space resection problem.

are unknown positions with that , and , 1,2,3 form a triangle. The lengths of the sides are given by equation (2.44): , , 102 2 Background

‖ ‖ (2.44) ‖ ‖ Let the origin of the camera coordinate ‖ ‖ frame be the center of the projection and the distance of the image projection plane to the projection center. Let the perspective projection of be with , and , and (2.45) , 1,2,3 The angles are given by , ,

(2.46) ∙ ∙ with being unit vectors calculated ∙ by , ,

(2.47) 1 , 1,2,3 The position of the point can be determined by finding the unknown distances , and of these to the projection center due to , ,

2 Background 103

(2.48) , 1,2,3 [105] solved it with the following equations (2.49):

2 cos , (2.49) 2 cos , With the substitution of 2 cos and . , there are three equations for : 2 cos , (2.50) 1 2 cos , Next, the equations are solved1 for 2 cos (equation .(2.51)):

(2.51) , 2 cos , 1 2 cos . These can be reduced to the 1two equations 2 cos (2.52) and (2.53):

(2.52) 2 2 cos cos 0

(2.53) 2 cos 2 cos 0. 104 2 Background

Using equation (2.54), a solution for in terms of can be obtained (equation (2.55)) by substituting into equation (2.53).

(2.54) 2 2 cos cos

(2.55) 1 2 cos 1 2cos cos This equation again substituted back into equation (2.52), gives a fourth order polynomial in :

(2.56) 0 where

4 1 cos

4 1 cos 1 cos cos 2 cos cos 2 Background 105

2 1 2 cos 2 cos 4 coscoscos 2 cos

2 4 1 cos cos cos 1 cos cos 4 1 cos . Since then many solutions have been suggested and applied in areas such as robotics, photogrammetry and AR. Today, a wide body of literature describing various solutions to this problem is available. A straight forward solution was introduced by [8] with the direct linear transformation (DLT). Its advantage is that it can be used to estimate the internal and external parameters of the camera. This is done with equation (2.57).

, 106 2 Background

(2.57)

. It can be solved using the singular value decomposition (SVD). The disadvantages of this solution are that a relatively high number of correspondences are necessary and that the results can be rather imprecise [107], [108].

Perspective-Three-Point Therefore, more favorable solutions use a pre-determined camera calibration matrix , so that only 6 parameters must be found instead of 11. Using and 3 known correspondences generally is referred to as the Perspective-Three-Point (P3P) problem. Many solutions to P3P have been introduced over the years. Some are [105], [109], [110], [111], [112], [113]. Typically, with known correspondences 4 possible solutions are produced so 3 that at least correspondences are necessary for a unique solution. ≥ 4

Perspective-n-Point The case is referred to as the Perspective-n-Point (PnP) problem ≥ and 4 includes P3P. Generally, it is preferred, since the pose accuracy often can be increased with a higher number of points. Some PnP solutions were presented by [114], [115], [116], [117], [118], [119], [120], [121], [122], [123], [124], [125]. Iterative solutions typically optimize a cost function based on error minimization, for example using the Gauss-Newton algorithm [126] or the Levenberg-Marquardt 2 Background 107

([127], [128]) algorithm. Examples are the minimization of the geometric error (e.g. 2D reprojection) or algebraic error. Some disadvantages of iterative solutions are that these require a good initial estimate. Noisy data strongly influences the results with the risk of falling into local minima. Also, the high computational complexity of many solutions renders them unusable for real-time applications. Therefore, an alternative to iterative solutions are non- iterative ones. A popular non-iterative PnP method is EPnP [120]. The main idea behind it is to represent the points of the object space as a weighted sum of four virtual control points. The coordinates of the control points then can be estimated by expressing them as the weighted sum of the eigenvectors. Some quadric equations then are solved to determine the weights. An advantage of this approach is the complexity. Random Sample Consensus Since the described methods rely on sets of correspondences any erroneous data, such as false point relationships, can produce unstable or wrong results. To address this problem, often robust estimators, such as the Random Sample Consensus (RANSAC) introduced by [129], are used. Roughly described, RANSAC helps to detect outliers from a set of data and removes these to produce better results with the remaining data. It iteratively estimates parameters of a mathematical model by repeating two essential steps: 1. A subset of data is randomly selected from the full dataset and model parameters are computed. The data subset represents a minimal sample of data that is necessary to estimate the model parameters. 108 2 Background

2. The elements of the full dataset are checked against the determined model if these are consistent with it. Any element that does not fit the model is considered an outlier. The algorithm repeats the two steps until a certain number of inliers or iterations are reached. For pose estimation, RANSAC can be used to extract subsets of good correspondences. This is achieved by randomly selecting a subset of data and estimating the camera pose for it. The camera pose then is used to project the 3D points into the 2D image space. The Euclidean distance between the projected 3D points and the corresponding 2D image points determines if the pose estimation produced a valid result. Valid points are added to the inliers and invalid points to the outliers. This process is repeated for other data subsets, therefore, minimizing the geometric error. The estimated pose with the largest number of inliers then is returned as the final result.

Marker-based Tracking Using P3P/PnP to estimate poses from corresponding points, researchers have developed different approaches to apply this to AR. Many existing AR frameworks and applications like [20], [130], [131] use marker-based tracking methods due to their simplicity in implementation and robustness while tracking, even with low-cost cameras. A marker, also referred to as fiducial marker or just fiducial, commonly is a square (or circular [132]) easily detectable black and white pattern (Figure 2.47). One of the first markers was introduced by [19]. The colors black and white are used to ensure a high contrast of the single shapes of the pattern. This makes them more distinguishable. 2 Background 109

Figure 2.47: Example of a fiducial marker.

Typically, marker tracking depends on some preliminary offline steps such as: • Creating a physical marker • Generating a digital description of the marker and saving it as a reference marker

The actual pose estimation is an online process which can be summed up with the following pipeline [133]: • Acquire image of scene with marker in it • Analyze the image and search for square shapes and extract corners of the square • Analyze the interior patterns of the squares to identify these as markers

Pose estimation Given the current view of a marker as an image and a reference marker generated during the offline phase, the motion of the camera between both can be calculated by decomposing the homography. Once the pose of the camera is known, typically 3D models are displayed on the fiducial markers. The models can be viewed from all angles and distances, as long as the marker is visible for the camera 110 2 Background

and is recognizable for the pattern recognition algorithms. As mentioned above, fiducial marker tracking relies on some offline preparation steps. The main disadvantage is that environments that should be used for AR must be prepared ahead of time. While distributing markers in small environments is feasible, equipping large areas is a tedious effort. So, for larger scale scenarios it is more advantageous to use already existing objects of the environment for the process.

Figure 2.48: Miniature digital 3D model of building interior visualized using a fiducial marker.

Natural Feature-based Tracking Existing objects of the physical world are referred to as natural features and can also be used for tracking. Though, in comparison to the above-mentioned fiducial markers, which are artificially generated simple patterns, natural features are often complex objects part of a complex environment, which makes them more difficult to detect. Natural features can be planar surfaces, such as book covers, or also 2 Background 111

3D objects, such as bottles or also larger objects like buildings. This type of tracking is referred to as natural feature tracking or markerless tracking . Natural feature tracking has been investigated in numerous studies and projects due to its flexibility and broader range of use cases in contrast to marker-based tracking. In general, the solutions can be classified into model-based methods (Figure 2.49) and model-free methods . For identifying features, interest points (keypoints ) of an image are important. As described by [134], these are specific parts of an image that meet certain conditions, optimally: • clear definition • clearly defined position • different from its highly texturized surrounding • repeatability should be guaranteed so that the interest points can be computed in multiple images of the same scene with variations in illumination, image noise and distortions

In the following, only monocular tracking methods, thus, using a single camera, are considered.

Model-based Tracking A frequently applied approach for natural feature-based pose tracking relies on existing 3D models of physical objects or scenes. The coordinates of the 3D model then are used as the counterpart to the 2D image coordinates. One of the main difficulties here is to find correct corresponding 2D-3D points. For this, model-based tracking approaches can be classified into two further categories, namely edge-based and texture- based. The edge- and the texture-based approaches can be implemented in a recursive way (e.g. [135]), where the previous pose 112 2 Background

is optimized in the next iteration, which requires a good initial pose, or alternatively, using tracking by detection (e.g. [136]). The advantage of tracking by detection is that no initial pose is required. In some cases, it is advantageous to combine both types of approaches.

Figure 2.49: Different types of model-based tracking based on [137].

Recursive Edge-based Tracking For recursive edge-based tracking a 3D wireframe model (e.g. a CAD- model) (Figure 2.50) of a physical object is matched to its 2D edges extracted from an image of the object (Figure 2.52). The matched 2D- 3D points of the lines are then used for the pose estimation with an appropriate method (e.g. PnP), as described above.

2 Background 113

Figure 2.50: 3D wireframe model of a building.

More specifically, the steps of an edge-based tracking algorithm are the following: 1. Sample Points : Along the edges of the 3D wireframe model sample points are generated in a predefined or calculated interval. This is done using the current pose on the 3D edge, which is then projected into the 2D image together with the sample points or directly on its 2D projection. The advantage of using the 2D projection, is that the sample points are then distributed more uniformly along the edges. 2. Visible Edges : To minimize the number of outliers in the matching process, only edges that are visible in the 2D image are also needed from the 3D model. Edges that are not visible, due to self-occlusion or because they lie outside of the current FOV, are filtered out. This is typically a hardware-based process using the depth-buffer of the renderer, making the calculation much more lightweight. Figure 2.51 shows a model of an example cube. The thin 114 2 Background

gray lines in the back are covered by the front surfaces and would not be visible from the given view point. 3. Corresponding Points : Using the visible edges of the model, a 1D search for gradient maxima perpendicular to the model edge for each sample point is performed. This is either based on a single hypothesis or multiple hypothesis approach. The single hypothesis approach uses only the highest gradient closest to the sample point. The multiple hypothesis approach utilizes a fixed number of high gradient points which are used as possible candidates. 4. Pose Calculation : The pose is then calculated using a non- linear minimization process with the found corresponding 2D image and 3D model points. The found pose is used for the next iteration of the tracking algorithm.

Figure 2.51: Cube with visble and hidden edges. The thin gray lines in the back are covered by the front surfaces. 2 Background 115

Figure 2.52: A projected edge of a 3D model (black), samples points (red), normals (blue) searching for edges of the door in the image (orange).

Some examples for line-based tracking can be found in [135], [138]– [141], [142], [143], [144].

Recursive Texture-based Tracking Recursive texture-based tracking methods can be further sub-divided into methods that use template matching and methods that use local image interest points . Template matching relies on registering the current image to a stored template image, using global region tracking methods, meaning that the entire image is used in contrast to local interest point-based methods. A possible algorithm for this is the Lucas-Kanade algorithm 116 2 Background

[145], which tries to find the parameters of the deformation that fit the template image into the current input image. Some examples for use in 3D tracking are [146], [147] and [148]. In contrast to global region methods, localized interest point-based approaches are computationally less complex. One way to find local interest points is to manually select these in an offline stage. Another more convenient way is to detect them automatically. Some typical interest points are corners, which can be defined as the intersection of two edges or the line endings and can be found using corner detection algorithms like the Harris operator [102] or Shi-Tomasi detector [103]. The found interest points then are used in the matching process which utilizes sets of interest points detected from two images. The set of image points is matched with . Matching is performed by taking an interest point and searching for around the location of . Searching is done by measuring similarities with cross- correlation. An example can be found in [149]. Edge-based Tracking by Detection The advantage of detection-based methods, in comparison to recursive methods, is that these do not require an initial pose. On the counter-side these typically are computationally more expensive. Edge-based tracking can also be solved by a detection-based approach. These methods typically rely on the recognition of predefined views of a 3D wireframe model which are created in an offline phase. For this, a virtual camera is automatically placed in different poses around the model. These precomputed views are then later applied during the matching process to identify the pose of the current view by measuring similarities between both. This method is non-incremental. An example can be found in [150]. 2 Background 117

Texture-based Tracking by Detection Detection-based methods can also be employed using texture-based tracking as shown in [136]. For this, local image parts are extracted that are invariant to scale, illumination and viewpoint, interchangeably referred to interest points or keypoints, depending on the method. Typically, keyframes are created in an offline phase from the object that should later be tracked in the online phase. Keypoints are extracted and saved for every keyframe, along with their corresponding pose and 3D coordinates. To extract and save keypoints, a detector and a descriptor are required. Established ones are Scale-Invariant Feature Transform (SIFT) [151] and Speeded Up Robust Features (SURF) [152] (Figure 2.53). In the online tracking phase, keypoints are extracted from the current frame with the respective algorithm used to extract the keypoints in the offline phase. These keypoints are used for the matching process which searches in the keypoint database. When matches are found, 2D-3D correspondences can be created. These corresponding points are then utilized for the actual pose estimation by employing a suitable pose estimation algorithm like PnP. Typically, RANSAC is used to filter out the spurious data.

118 2 Background

Figure 2.53: Example result of the SURF detector of [152]. The red circles represent the found interest points and their scale.

Combining Different Tracking Methods Typically, today multiple pose tracking methods are combined to receive better results, since every tracking method on its own has its advantages and disadvantages. For example, in recursive tracking an initial pose is necessary which then must be optimized during the process. An example for a combination of different methods is shown by [138]. They present a textureless 3D object detection and tracking approach using a particle filtering framework. The particles represent possible samples of the true state. To generate the particles chamfer matching is used which is a shape-based object detection method. It relies on pre-defined templates that are matched against an image. The templates are acquired from a 3D mesh model by setting the intrinsic camera parameters obtained from the physical camera to the virtual camera and capturing different views of the virtual 3D object 2 Background 119

by rotating the camera around it. Using the obtained templates, the chamfer matching algorithm is applied, from which coarse pose estimations are acquired. A particle weighting and annealing process then follows to determine the most likely pose hypotheses. Using these hypotheses, the 3D model is projected to employ an edge-based tracking algorithm. Iterative Re-weighted Least Squares (IRLS) is used to minimize the error. Generally, just as in the presented work above, a popular approach is to determine a coarse pose by using a method that requires no initialization, such as the described object detection, followed by a pose optimization process, such as the iterative edge-based approach.

Model-free Tracking The above-mentioned methods all require some sort of pre-captured model. But even if no model is available there are still some possibilities for tracking. Especially in recent years the concept of Simultaneous Localization and Mapping (SLAM) has been becoming more important. SLAM is concerned with mapping an unknown environment while performing localization at the same time. This is popular in the field of robotics to achieve self-localization of robots. In comparison to model-based tracking methods where the result is a pose in a global coordinate system, SLAM simply uses some local coordinate system, so the pose is relative to a starting point (e.g. begin of the SLAM process). Figure 2.54 shows a model of a room that was captured during the SLAM process.

120 2 Background

Figure 2.54: Model captured using the Google Tango tablet. 3 Requirement Analysis 121

3 Requirement Analysis

Figure 3.1: Required components of an AR-system

While many AR applications are designed towards small-scale use and based on fiducial marker tracking, augmentations of large-scale objects, such as buildings, are still quite uncommon. In this chapter the requirements of a visual location-based mobile AR system that can display semantic information models, like CityGML, are analyzed and discussed. The focus of this work lies on the design of a low-cost off-the-shelf smartphone-based mobile solution that is self-contained and decoupled from additional external infrastructures, like pose 122 3 Requirement Analysis

tracking systems or web-based solutions like Web Feature Services 45 (WFS). It should be possible to augment entire building sections of a city, single buildings, sections of a building and single building parts. In the following, different components and methods are evaluated to find a suitable configuration for the proposed goals mentioned above. Figure 3.1 shows the different parts of a mobile AR-system.

3.1 Mobile Device

Appropriate hardware is essential for the functionality of an AR system. The strong increase of powerful yet affordable smartphones in the last years encourages various applications to be designed around them. The variety of internal sensors, high resolution cameras and displays of today’s smartphones also offer all components necessary to develop a complete mobile AR system. Today, the two most popular mobile operating systems (OS) are Apple’s iOS and Google’s Android [153]. Therefore, it is advantageous to develop an AR system using a smartphone with one of the two OS. This has the benefit of a large user base. Android is particularly interesting, since it runs on approximately 70% of all mobile devices, whereas iOS shares the other 30% with other mobile OS. Additionally, Android is based on the platform independent programming language Java, which for example easily allows reusing source code from existing projects. The amount of different Android smartphones on the market is overwhelming at times, with manufacturers like Samsung producing multiple new devices on a yearly basis.

45 http://www.opengeospatial.org/standards/wfs 3 Requirement Analysis 123

So, another important factor for selecting a smartphone is the hardware specifications. Many modern smartphones offer displays with at least Full-HD resolutions. Taking the size of the screen and the distance at which, these are typically viewed into account, these displays are sufficient for an AR system. Since an AR system strongly depends on acquiring data of the physical environment, a variety of sensors is a requirement for such a system. Specifically, accelerometers, magnetometers and gyroscopes are essential, as these can be used to determine the orientation of the physical device. Furthermore, the camera is essential for capturing images of the physical world that are later displayed on the screen and internally used for pose estimation. Consequently, the images should be of high quality. Multiple factors such as the amount of MP, the sensor size and the size of pixels are important for the image quality. Roughly, it can be said that the larger each of the components is, the better the resulting images are. Typically, in development a trade-off must be found. For example, to increase the amount of MP, either a larger sensor must be used, or the size of the pixels must be decreased so they fit onto the sensor area. In smartphones size is fairly important, limiting the possibilities of an increase of the sensor size. On the other side, the smaller a pixel is the less light it can capture. This has an impact on the performance of a camera in darker conditions. A camera with larger pixels performs better in darker conditions than a camera with smaller ones. In smartphones, the sensor sizes are described as a fractional number in inches (e.g. ½.3 ”) which refers to a type of sensor. Its size is measured along its diagonal. A typical size is 7.66 mm. Pixel sizes 124 3 Requirement Analysis

are measured in micrometers ( and typically range from 1 to 2 in today’s smartphones [154], [155]. For this thesis the three smartphones listed below were chosen for the development of an AR system and its evaluation (Table 3.1).

Table 3.1: Specifications of the three smartphones. Google Nexus Sony Xperia Google Pixel 2 5 Z2 XL Display 1080 x 1920 1080 x 1920 1440 x 2880 Resolution pixels (~445 pixels ( ~424 pixels ( ~538 ppi) ppi) ppi) CPU / GPU Qualcomm Qualcomm Qualcomm Snapdragon Snapdragon Snapdragon 800 Quad-core 801 Quad-core 835 Octa-core 2.3 GHz 2.3 GHz 4x2.35 GHz processor / processor / Kryo & 4x1.9 Adreno 330 Adreno 330 GHz Kryo / Adreno 540 Storage/RAM 32 GB / 2 GB 16 GB / 3 GB 128 GB / 4 GB Camera 8.0 MP, f/2.4, 20.7 MP, 12.2 MP, f/1.8, 1/3.2" sensor f/2.0, 1/2.3" 1/2.6" sensor size, 1.4 µm sensor size, size, 1.4 µm pixel size 1.12 µm pixel pixel size size OS Android 6.0.1 Android 6.0.1 Android 8.1

3 Requirement Analysis 125

3.2 CityGML vs. IFC

An AR system requires appropriate data. CityGML and IFC each provide data with potential. They generally have quite different application frames, resulting from their primary domains. The most significant differences lie within the scale and content, but also the degree of detail and accuracy of the model. Furthermore, fundamentals, such as the geometrical representation, differ, as CityGML uses a surface-based approach and IFC generally a solid- based one. While IFC models commonly focus on a detailed view of a specific building, ranging from basic structures down to small building components, CityGML covers large regions with multiple buildings and additional thematic areas, such as for example traffic areas, water bodies and vegetation. Buildings in CityGML are commonly less detailed than in IFC. For an AR application both have their advantages and specific information to contribute. CityGML allows a more general application that takes other city objects into account and permits assessments in the overall context. IFC models on the other side allow more detailed views of a single building and its components. So, depending on the target area of the AR application, the format must be chosen accordingly. CityGML data was chosen for this thesis, since it allows for a more generalized approach with city-wide use cases. IFC can be considered in future work to add more detail to the buildings. Due to the growing popularity of CityGML (more specifically its building module), the availability of data sets has increased in the past years. A number of publicly available CityGML models are already accessible, for example, Berlin [156], Hamburg [157], New York [158], 126 3 Requirement Analysis

[159] or the state of NRW [58]. The open data initiative of the state NRW in Germany further propels the availability of CityGML in LOD2 by providing roughly ten million 3D models buildings of the state. The models are derived by a fully automated process using existing data from aerial laser scans, aerial images and ground plans of buildings [160]. Standardized roof types are added in LOD2 by using the roof type that best fits the point cloud. Next to the geometry the models also contain some attributes like name and height of the building. The data is georeferenced in European Terrestrial Reference System 1989 (ETRS89) UTM32 and DHHN92/NHN. The height accuracy of the LOD1 models is specified with 5 m and 1 m for LOD2 [160]. The files are sorted and bundled by municipality. The “StädteRegion Aachen” for instance consists of 214 files with a total size of 1715 megabytes (MB) for the LOD2 model. For each of the 396 municipalities of the state of NRW the data sets range from approximately 100 MB up to 4 GB. The size of a single CityGML file much depends on the number of objects it holds, the LOD used and the amount of additional information for each object. For example, a file describing a single building from the “StädteRegion Aachen” dataset has a size of approximately 8 kilobytes (KB) in LOD1 and 15 KB in LOD2. Some custom created test files show the difference in size between a LOD1 and LOD4 building model. While the LOD1 model of the building is only 14 KB small, the LOD4 model measures already 16.5 MB. Since data for the “StädteRegion Aachen” is only available in LOD1 or LOD2, a fully georeferenced LOD4 CityGML model (Model 1) (Figure 3.2, Figure 3.3) of the civil engineering faculty building of the RWTH Aachen University was created. The model was constructed with the following steps: 3 Requirement Analysis 127

1. A geometry model of the interior and exterior was created using a tachymeter. With the tachymeter, the corner points of the building elements, such as walls, floors, ceilings, doors and windows, were surveyed and connected, creating a wireframe model. 2. Buildings elements were derived from the wireframe model using the software Bentley AECOsim Building Designer to create a BIM model with semantic and descriptive information for each object. 3. The BIM model was converted to a CityGML model using the Feature Manipulation Engine 46 (FME) by Safe Software. 4. The CityGML model was then manually edited to add an address and CityGML specific object attributes to the building and fix miscellaneous errors, such as duplicate geometry definitions of elements. Generally, the duplicate geometries are a problem of the BIM-CityGML conversion process, due to the different types of geometry modeling they typically use (CSG to B-Rep).

Additionally, an artificial LOD4 CityGML building (Model 2) (Figure 3.4, Figure 3.5, Figure 3.6) was created. From the CityGML files of the “StädteRegion Aachen” the Aachen city region was chosen (Model 3) (Figure 3.7). Table 3.2 shows the number of objects for each model.

46 https://www.safe.com/fme/fme-desktop/ 128 3 Requirement Analysis

Table 3.2: Statistics of the three models used in the project. Type Model 1 Model 2 Model 3 CityModel 1 1 1 Building 1 1 94.752 BuildingPart 0 4 32.685 BuildingInstallation 0 2 0 IntBuildingInstallation 9 8 0 RoofSurface 1 4 167.933 WallSurface 45 12 843.969 GroundSurface 1 1 111.920 ClosureSurface 0 0 447 CelingSurface 38 18 0 InteriorWallSurface 717 15 0 FloorSurface 21 5 0 OuterCelingSurface 0 0 0 OuterFloorSurface 0 0 0 Window 644 17 0 Door 324 13 0 Room 38 18 0 BuildingFurniture 130 46 0 Polygon 330.223 99.601 1.173.730 Size 283 MB 86 MB 1790 MB

3 Requirement Analysis 129

Figure 3.2: Custom created LOD4 model of the civil engineering building of the RWTH Aachen University (Model 1).

Figure 3.3: An office of the model 1 building.

130 3 Requirement Analysis

Figure 3.4: Custom created LOD4 building (Model 2).

Figure 3.5: The kitchen of model 2. 3 Requirement Analysis 131

Figure 3.6: The living room of model 2.

Figure 3.7: Part of the LOD2 model of Aachen based on data from [161] (Model 3).

3.3 Data Processing Options

Research towards data processing was conducted in [55]. As described there, customized solutions for processing the data are a prerequisite due to the limited hardware of smartphones and the complexity of CityGML. While frameworks for visualizing 3D models based on 132 3 Requirement Analysis

common graphics formats such as COLLADA, OBJ or X3D are generally available (e.g. Unity3D), support for CityGML is still rare, particularly for mobile devices. Commercial software for CityGML visualization like ArcGIS (ESRI) 47 , Bentley Map (Bentley Systems) 48 or the CityEditor (3DIS GmbH) 49 and freeware/open source software like the CityGML SpiderViewer (GEORES) 50 , the FZKViewer (KIT Karlsruhe) 51 or 3DCityDB are only available for desktop computers. For visualizing CityGML on mobile devices, current developments mainly focus on web-based solutions using client/server architectures and tiling systems that load blocks of data according to the current view. [162] and [163] presented a WebGL-based architecture that handles the data on a server and exports tiles with data in JSON files. [164], [165] and [166] pursue similar concepts by using client/server approaches and a conversion to graphics formats such as X3D, OBJ or KML 52 for a more efficient visualization. The increased focus on HTML5 (WebGL) solutions has introduced entire frameworks for 3D geospatial data visualizations such as Cesium 53 and iTowns 54 . They feature an open source JavaScript and WebGL- based virtual globe and map engine that can display terrain, image data and 3D models. Since Cesium and iTowns do not provide native

47 http://desktop.arcgis.com 48 https://www.bentley.com/en/products/product-line/asset- performance/bentley-map 49 https://www.3dis.de/loesungen/3d-stadtmodelle/cityeditor 50 http://www.geores.de/geoResProdukteSpider_en.html 51 https://www.iai.kit.edu/1648.php 52 https://developers.google.com/kml/ 53 https://cesiumjs.org 54 http://www.itowns-project.org 3 Requirement Analysis 133

support for loading and visualizing CityGML, approaches have been described by [167] using KML or [168] using the GL Transmission Format (glTF) which are natively supported by Cesium. The initial official release of glTF by the Khronos Group was in late 2015 [169], with the goal of minimizing transmission and loading times of 3D scenes for WebGL applications. While employing formats such as COLLADA, OBJ, X3D or glTF makes the rendering process with existing visualization frameworks particularly simple, these pure graphics formats cannot directly store CityGML’s semantic information which eliminates CityGML’s peculiarity. Therefore, a new specification named 3D Tiles based on glTF is being developed that promises the efficiency of glTF and the possibility to store additional information [170], [171]. Specifically, the format Batched 3D Model (B3DM) aims at displaying large city models by using batching methods while preserving the per-object properties.

3.3.1 Memory-based Option

A straightforward approach for processing data for visualization is to load it directly from an exchange format (e.g. CityGML) into an in- memory model. The advantage is that no further infrastructure is necessary. Some low-level libraries such as citygml4j 55 that can process CityGML are already available. However, careful attention must be paid to the compatibility of such libraries with the Android system, since Java code as such can be run without problems, but not all libraries of the standard desktop-based Java package are part of the system. For instance, citygml4j utilizes the Java Architecture for

55 https://www.3dcitydb.org/3dcitydb/citygml4j 134 3 Requirement Analysis

XML Binding 56 (JAXB) interface, which is not compatible with Android. To evaluate an in-memory model approach, two Java-based CityGML parsers that are able to run on mobile devices were developed, one using a DOM and another using a pull parser. Both parsers create per-feature Java objects with their corresponding attributes. They were compared in terms of time- and memory usage when parsing the CityGML data and transferring it to a Java rendering model.

12000

10000

8000

6000 DOM

RAM (MB) 4000 Pull Parser 2000

200 400 800 1600 File Size (MB)

Figure 3.8: Average RAM consumption for DOM and the pull parser [55].

As Figure 3.8 shows, RAM consumption generally ranged from 5-6 times of the source file size, independently of the LOD, using the

56 https://www.oracle.com/technetwork/articles/javase/index-140168.html 3 Requirement Analysis 135

DOM. Thus, given a file size of approximately 500 MB, already 3 GB of RAM in average are necessary. Consequently, for 1500 MB roughly 9 GB RAM are required. The pull parser on the other hand had a consistent RAM consumption of approximately 10 MB for running, regardless of the LOD or file size, plus the size of the actual in- memory model which was about the half to the same size of the file. Furthermore, the measured loading times for both parsers are not applicable for real-time use in AR environments. As seen in Figure 3.9, generally, the pull parser was about 2-3 times faster than the DOM, but still too time consuming for real-time applications. Some tests showed that utilizing a DOM to parse CityGML requires approximately 5-6 times of the source file size of RAM. So, given a file size of 283 MB, 1400 – 1700 MB of RAM are necessary already.

2000

1600

1200

800 DOM Pull Parser Loading Time (s) 400

200 400 800 1600 File Size (MB)

Figure 3.9: Average loading times for DOM and the pull parser [55]. 136 3 Requirement Analysis

3.3.2 Web-based Option

An alternative to loading the data directly from the file and storing it in-memory, is to use a web-based solution. The 3DCityDB is a possible solution for this. It is a database schema which maps the CityGML 2.0 model to a relational model bundled with import, management, analysis, visualization and export tools and is available for Oracle and PostGIS [172]. The included OGC compliant WFS 2.0 allows for web-based access to the CityGML data. An advantage of a web-based solution is its platform independency that allows using an application on any device, given a compatible web browser. This enables access to a larger user base and provides the same experience on every device. Also, by employing a client/server structure, data and workloads can be distributed. For instance, data storage, management and queries can be handled by the server while the client device is used for rendering. However, the usability of such web-based solutions heavily depends on the availability of a network connection. While stable and fast network connections are commonly present in desktop environments, mobile connections are often unreliable, especially in rural areas, due to the poor network coverage. Bandwidth is another critical issue to consider. CityGML models are regularly multiple GB large which is problematic with limited mobile data plans. In CityGML, especially the LOD has a strong impact on file sizes. A building modelled in LOD1 typically is only a few KB small, but its LOD4 representation already can be multiple GBs large. Another critical issue which is especially important in the context of AR is low-level access to the smartphone’s hardware, like the sensors. Specifically using HTML5 (WebGL), the functionality and performance of the web-applications 3 Requirement Analysis 137

heavily depends on the type of web browser and its capabilities. Full HTML5 support generally still is uncommon, especially on mobile devices [173]. In Unity 3D for instance, WebGL is not supported at all on mobile devices or only perform insufficiently [174]. How well the application performs ultimately depends on the web browser’s JavaScript engine and capabilities to allocate memory. As a consequence of the limitations of web-based solutions and the lack of existing native mobile solutions, a custom solution was developed.

3.3.3 Local Database-based Option

To stay independent from network connectivity, but to be able to handle large amounts of data on the device itself, a fully embedded local database is an option. There is a solid amount available with for example Oracle Berkeley DB 57 or Couchbase Lite 58 for Android devices. Preferably the database should provide spatial capabilities, which is not the case for many. Android, for instance, contains the software library SQLite 59 that implements a self-contained, serverless, zero-configuration, transactional SQL database engine [175]. However, like the majority of embedded databases, it natively does not provide spatial functionalities. SpatiaLite 60 solves this by extending the core functionalities of SQLite to facilitate vector geodatabase functionalities. Multiple types of geometries such as points, lines, and

57 http://www.oracle.com/technetwork/database/database- technologies/berkeleydb/overview/index.html 58 https://github.com/couchbase/couchbase-lite-android 59 https://www.sqlite.org/ 60 https://www.gaia-gis.it/fossil/libspatialite/index 138 3 Requirement Analysis

polygons, but also more complex types, such as MultiPolygons, are supported. Most important for CityGML is that it is capable of storing 3D geometries. Unfortunately, SpatiaLite only provides a few functions, such as ST_3DDistance, that actually consider the z-value of the geometries during calculations. Oracle Spatial and PostGIS on the other hand each have a more extensive list [176]. Nevertheless, SpatiaLite represents a strong solution for storing CityGML data. Just as Oracle Spatial and PostGIS, it is capable of utilizing spatial indices, which are crucial for fast querying of geometries in large database tables. The spatial index uses the SQLite R*Tree algorithm which employs a tree-like approach in which geometries are represented by their MBR. It allows attaching multiple databases to a single database connection, enabling them to operate as one. Using this feature, it has the following maximum capabilities (Table 3.3):

Table 3.3: Maximum capabilities of SpatiaLite ([175]). max string/BLOB length 2.147483647 GB

max pages 2,147,483,646 pages with size 65.536 KB max attached databases 125

max database size 140 TB

By being able to utilize multiple databases it is not only possible to store massive models, but also handle models with different coordinate systems. For example, models with the same coordinate 3 Requirement Analysis 139

system can be hosted in one database while others are stored in another database. The overall performance of SpatiaLite in terms of 2D/3D spatial queries is equal to that of PostGIS or Oracle Spatial. This was assessed by evaluating typical spatial functions such as topological operators relevant for querying CityGML data (Figure 3.10).

3 2,5 2 1,5 1 Runtime(s) 0,5 0

SpatiaLite Oracle PostGIS

Figure 3.10: Average runtimes for typical spatial queries. The Oracle and PostGIS queries were run on an Intel Core [email protected] and 8 GB RAM. SpatiaLite was run on a Google Nexus 5 [55].

SpatiaLite’s performance and functions make it a suitable option for working with CityGML, even with the limited amount of spatial functions that support true 3D. A solution to the missing support is to utilize the in-database 2D functions, pre-select data to receive only data of interest and then to apply 3D queries outside of the database 140 3 Requirement Analysis

to particularize the selection.

3.4 Polygon Triangulation

While convex polygons simply require an algorithm that connects a vertex with all other vertices of the polygon, except its direct neighbors, concave polygons require more complex algorithms. A variety of CityGML models from LOD1 to LOD4, including Model 1, Model 2 and Model 3, were evaluated to count their convex and concave polygons and to determine the necessity of sophisticated triangulation algorithms.

100%

80%

60%

40% Concave Polygons Convex 20%

0% M1 M2 M3 CityGML Model

Figure 3.11: Percentage of convex and concave polygons in the three models M1, M2 and M3.

Models 1, 2 and 3 reflect the overall conclusion that the majority of 3 Requirement Analysis 141

polygons in CityGML models typically are convex (Figure 3.11).

Ear-Clipping Triangulation vs. Delaunay Triangulation While the ear-clipping algorithm is mainly found in literature related to 3D real-time rendering for triangulating simple polygons with an ordered set of vertices, the DT is a commonly utilized triangulation for unordered point sets, as for example found for TINs. To decide which triangulation type is appropriate for the AR system, the results of the analyzed data in Figure 3.11 provide an indicator, since the choice mainly depends on the type of data and the use case for the triangulated data. Two advantages of the DT towards the ear-clipping method are the more uniform triangles and the possible runtime of in comparison to . A disadvantage of the DT is that it is only applicable for convex² polygons, due to the lack of knowledge about prescribed edges. For concave polygons or polygons with holes, a yet more sophisticated method is required, like the CDT. In comparison to ear-clipping, it is somewhat more complex to implement to reach the named runtime. Also, by having to use the CDT, one of the advantages of the triangulation is compromised, since not all triangles will be conforming to the Delaunay conditions. Depending on the algorithm used to obtain the CDT, the straightforward ear-clipping approach with can be equally or more efficient, if for instance the CDT requires² a complex tree-based structure to be created. Therefore, for practical purposes a trivial triangulation method was chosen for the convex polygons that selects a random vertex and connects it with all other vertices except its direct neighbours and the ear-clipping method for the concave polygons. 142 3 Requirement Analysis

Polygon Triangulation Libraries As mentioned in [55], polygon triangulation is not a new problem, so consequently a wealth of libraries exists. Most are written in C++, like cgal 61 . Only a few, such as JTS Topology Suite (JTS)62 , Poly2Tri 63 , jDelaunay 64 or delaunay-triangulator 65 , are implemented with Java. These were evaluated, and it was found that many special cases that can occur in CityGML data are not considered. For instance, holes in general or holes touching the outline of the polygon, or another hole, are not processed correctly, resulting in erroneous triangulated polygons. Therefore, a custom triangulation solution was implemented according to [45], which specifically covers these possible special cases.

3.5 Data Rendering

One of the three main characteristics of AR, according to [9], is its interactivity in real-time. 3D real-time rendering, therefore, plays a crucial role for this requirement. According to [37] and [177], a human can perceive 10-12 FPS individually. More than 12 FPS are perceived as real-time, but flickering is very noticeable. A preferred value typically is 60 FPS, as flickering is hardly recognizable, though a higher rate is desirable, since interaction response times are reduced by this. To achieve real-

61 https://www.cgal.org 62 https://github.com/locationtech/jts 63 http://sites-final.uclouvain.be/mema/Poly2Tri 64 https://github.com/orbisgis/jdelaunay 65 https://github.com/jdiemke/delaunay-triangulator 3 Requirement Analysis 143

time visualization of CityGML models, efficient rendering algorithms must be implemented by using appropriate graphics APIs and hardware. As described, there are some commonly used APIs such as Direct3D and OpenGL for offline use and WebGL for web-based development. For native Android developments, OpenGL ES can be used. However, as described in [55], instead of using such a low-level API, a game engine is more desirable. For the AR system of this thesis, the jMonkey game engine is best suited, since it is a fully customizable Java-based open source framework that provides all necessary high-level functionalities, while simultaneously facilitating low-levels access. Some commercial game engines, such as Unity3D or Unreal Engine, indeed contain some more sophisticated development tools or functionalities, but typically limit the developer to using engine specific concepts that make low-level adjustments particularly difficult or impossible. There are several smaller non-commercial open source game engines similar to jMonkey, such as Ogre3D 66 or libGDX 67 , but these are typically focused on specific use cases, limited in functionality or poorly documented. From experience, jMonkey delivers the most complete package, as it is stable, comprehensibly structured and offers all necessary features. Additionally, it is well documented and has an active community. In its release version it includes a graphics engine that can simulate lights and shadows, use shaders and materials, and apply filters and effects. Typical game logic, like custom controls to control and animate objects and keyboard/touch input, are available to the developer. Utility classes

66 https://www.ogre3d.org 67 https://libgdx.badlogicgames.com 144 3 Requirement Analysis

include physics simulations, networking functionalities and a modular GUI framework. Next to the possibility to load typical game graphics formats like OBJ or FBX, jMonkey also allows importing custom polygon meshes.

3.6 Pose Tracking Methods

Today, much of the published research work in the field of AR is related to pose tracking, since it still is challenging for many scenarios. Primarily, the choice of tracking method depends on the use case of the AR system, for example, if it will be used in- or outdoors, in small- or large-scale areas, etc.

3.6.1 Sensor-based Tracking

While in outdoor environments a GNSS like GPS is a reasonable solution, the strong dependency on a LOS renders it impracticable for indoor use. But even outdoors it can be imprecise in particular environments, when the LOS is blocked in any way, as for example in street canyons or dense forests. DGPS can provide more accurate positions but requires additional correction data that is transmitted by radio. Generally, commercial smartphones do not support this system yet. To determine the orientation of the AR device, typically, the accelerometer, magnetometer and gyroscope are used in conjunction with sensor fusion. An example framework for implementing sensor-based AR is DroidAR 68 .

68 https://bitstars.github.io/droidar/ 3 Requirement Analysis 145

Low-cost inertial sensors in smartphones unfortunately are very susceptible to drift over time and lack precision in general. The following Table 3.4, Table 3.5 and Table 3.6 show detailed specifications for the accelerometers, magnetometers and gyroscopes of the three smartphones used in this thesis.

Table 3.4: Information about the accelerometer, magnetometer and gyroscope of the Google Nexus 5. Google Nexus 5 Accelerometer Magnetometer Gyroscope Name MPU6515 AK8963 MPU6515 Max Range 19.613297 m/s² 4911.9995 μT 34.906586 rad/s Resolution 5.950928E-4 0.14953613 0.0010681152 m/s² μT rad/s Power 0.4 mA 5.0 mA 3.2 mA Consumption Vendor InvenSense AKM InvenSense

146 3 Requirement Analysis

Table 3.5: Information about the accelerometer, magnetometer and gyroscope of the Sony Xperia Z2. Sony Xperia Z2 Accelerometer Magnetometer Gyroscope Name BMA2X2 AK09911 BMG160 Max Range 39.2265930176 4911.9995117188 34.9065856934 m/s² μT rad/s Resolution 0.0191497803 0.5996704102 μT 0.0010681152 m/s² rad/s Power 0.13mA 0.24 mA 5.0 mA Consumption Vendor Bosch AKM Bosch

Table 3.6: Information about the accelerometer, magnetometer and gyroscope of the Google Pixel 2 XL. Google Pixel 2 XL Accelerometer Magnetometer Gyroscope Name LSM6DSM AK09915 LSM6DSM Max Range 156.9064025879 5160.0004882813 17.4532928467 m/s² μT rad/s Resolution 0.0023956299 0.1495361328 μT 0.0012207031 m/s² rad/s Power 0.15 mA 1.8 mA 0.45 mA Consumption Vendor STMicro- AKM STMicro- electronics electronics

3 Requirement Analysis 147

Figure 3.12 shows an example for the sensor drift of the accelerometers. The measurements were taken by placing each smartphone in a neutral area without electromagnetic influences. They were kept still for the duration of each measurement over 30 min.

0,6 0,4 0,2 0 -0,2 0 acceleration (m/s²) 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 time (s)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 3.12: Example of sensor drift for the x-axis of the accelerometers of Google Nexus 5, Sony Xperia Z2 and Google Pixel 2 XL.

148 3 Requirement Analysis

0,8 0,6 0,4 0,2 0 0 acceleration (m/s²) 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 time (s)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 3.13: Example of sensor drift for the y-axis of the accelerometers of Google Nexus 5, Sony Xperia Z2 and Google Pixel 2 XL.

10 9,8 9,6 9,4 9,2 0 acceleration (m/s²) 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 time (s)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 3.14: Example of sensor drift for the z-axis of the accelerometers of Google Nexus 5, Sony Xperia Z2 and Google Pixel 2 XL. 3 Requirement Analysis 149

Not only the accumulation of errors over time is a problem, but also the accuracy of the resulting pose in general. For an evaluation, measurements from the accelerometer, magnetometer and gyroscope were fused to determine an absolute pose. Outdoors, locations were obtained using GNSS and indoors a position was manually set.

25 20 15 10 (degrees) 5

MeanRotation Error 0 Indoor Outdoor

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 3.15: Orientation error in indoor and outdoor areas due to the magnetometer.

Figure 3.15 shows that outdoors the orientation accuracy generally is better than indoors. Much of the error is produced by the magnetometer, possibly due to the influence of electronic devices and metal objects. Nonetheless, with an average of 22 degrees indoors and 18 degrees error even in outdoor areas, all three smartphones are not accurate enough to fulfil the requirements of the here proposed AR system. The error of the magnetometer also confirms the reported results of [179]. 150 3 Requirement Analysis

Also, the average GNSS positions provided by the three smartphones are not sufficient (Figure 3.16). 2-3 m position error results in a noticeable shift between the physical and the virtual object displayed on the screen, as shown in Figure 3.17.

4

3

2

1

0

MeanPosition Error(m) Indoor Outdoor

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 3.16: Position error when using GNSS.

Figure 3.17: The virtual objects are shifted and do not fit the physical objects due to inaccuracies of the calculated pose. 3 Requirement Analysis 151

3.6.2 Infrastructure-based Tracking

As an alternative, the commercial tracking system, Polemus G4, was evaluated, as the manufacturer defines promising accuracies of 0.20 cm in 1 m distance, 0.64 cm at 2 m and 1.27 cm at 3 m. The system was tested in different environments and configurations, but the targeted accuracies could not be reached.

4 3,5 3 2,5 2 x-Axis 1,5 1 z-Axis 0,5 MeanPosition Error(cm) 0 5 30 55 80 105 140 190 280 Distance (cm)

Figure 3.18: The Polemus G4 becomes increasingly inaccurate with growing distance.

Figure 3.18 shows that the actual pose determined by the device was accurate enough for constructing an AR system, but its range was not sufficient. Beyond 3 m distance the accuracy drastically decreased. Also, building structures, such as walls, heavily impacted the measurements, so that tracking could not be achieved across multiple 152 3 Requirement Analysis

rooms. Thus, multiple units would be necessary to cover larger areas. Furthermore, next to installing the system in the environment, it is also necessary to use external hardware on the smartphone for the tracking process to work. In large outdoor areas such infrastructure- based devices are consistently inapplicable. For this reason, image- based tracking methods were evaluated.

3.6.3 Optical Tracking

Optical pose estimation has been used in various AR scenarios and shown for example by [20], [27], [135], [137]–[140], [142]–[144], [180]– [192]. For this thesis, the goal was to develop a visual tracking algorithm that solely relies on data present on the device, so that the AR system is self-sufficient. This implies the following restrictions: • No offline processes as for example pre-computing additional data, like point clouds, or creating reference picture databases. • No network connections to acquire additional information. • No preparations of the environment with additional tracking infrastructures, like beacons or fiducial markers.

Therefore, a straightforward choice is to utilize the fully georeferenced CityGML data already available directly on the device. Since a CityGML model provides 3D coordinates that correspond to a real- world coordinate system, these can be utilized to determine a real- world position. 2D coordinates on the other side can be obtained from the images produced by the smartphone camera. The challenge hereby lies in finding the right method from the various model-based pose estimation methods available. Since the majority of CityGML models today typically is not texturized or only provides low quality 3 Requirement Analysis 153

textures or carry limited color information, a geometry-based (e.g. edge-based) approach shows the most potential. The actual pose estimation then is performed by a PnP method. OpenCV 69 provides a couple of different implementations of P3P/PnP methods. Some of these were evaluated as described in the following to determine the expectable pose accuracies by conducting some synthetic experiments.

Expectable PnP Accuracy Three methods were evaluated: • P3P based on the paper of [112]. • An iterative approach based on a Levenberg-Marquardt optimization [127] that minimizes the reprojection error that is the sum of squared distances between observed points and projected points. • EPNP based on [120].

The test cases were designed similar to related works such as [111], [116], [119]–[124], [193], [194]. A calibrated camera was simulated with an image size of 3264×2448 px, a focal length of 4.0 mm, which equals 30.45 mm on a full frame (35 mm) camera, and a principal point at the center of the image (1632; 1224). 3D object points were randomly distributed into the , , interval [−2; 2] × [−2; 2] × [5; 10]. The number of points was chosen according to the PnP method, with 4 points for P3P and 500 for PnP. Ground-truth translations were uniformly distributed in a 6×6 m raster at an interval of 0.5 m at the distances 2, 4 and 6 m from the 3D object points (see section 2.1.1 – Translation). A ground-truth

69 https://opencv.org 154 3 Requirement Analysis

rotation matrix was randomly generated for each position (see section 2.1.1 – Rotation). 2D/3D point correspondences were generated by projecting the 3D object points into the image coordinate system using the intrinsic and extrinsic parameters (see section 2.1.1 – Projection). The point sets and the camera calibration matrix were then used to estimate a translation and rotation with a PnP method (see section 2.4.5). The translation errors were calculated using the (Euclidean norm) (equation (3.1)):

(3.1) | | Rotational errors were calculated for each axis with the following equation (3.2):

(3.2) , ∙ 180 / where and are the column of and respectively . , Each simulation was run 100 times to determine the mean and maximum error. First, the impact of distorted image coordinates was evaluated. Distortions in the 2D coordinates are common due to automated extraction processes (e.g. door detection and coordinate extraction). To simulate this, Gaussian noise was added to the artificial image. Distortion was continuously added by increasing the standard deviation . Four 2D/3D point correspondences were used. The results are shown in Figure 3.19 and Figure 3.20. 3 Requirement Analysis 155

18 16 14 12 10 P3P 8 6 Iterative 4 EPNP 2 MeanTranslation Error(m) 0 0 5 10 15 20 25 Standard Deviation ( ) [px] Figure 3.19: Mean translation error of each method when Gaussian noise is increasingly added to the 2D image points.

60

50

40

30 P3P

20 Iterative

10 EPNP

0 MeanRotation Error(degrees) 0 5 10 15 20 25 Standard Deviation ( ) [px] Figure 3.20: Mean rotation error of each method when Gaussian noise is increasingly added to the 2D image points. 156 3 Requirement Analysis

As Figure 3.19 and Figure 3.20 show, the iterative solution is most influenced by noisy image data. The iterative solution and EPNP each are developed to use more than four 2D/3D point correspondences. So, the performance of both algorithms was analyzed with different numbers of points. The 2D data was distorted with Gaussian noise ( [px]). Again, the mean translation error and mean rotation error are 5.0 depicted in Figure 3.21 and Figure 3.22.

70 60 50 40 30 Iterative 20 EPNP 10

MeanTranslation Error(mm) 0 10 50 100 500 1000 5000 10000 Number of Points

Figure 3.21: Mean translation error of each method when increasing the number of points using an image heavily distorted with Gaussian noise. 3 Requirement Analysis 157

0,35 0,3 0,25 0,2 0,15 Iterative 0,1 0,05 EPNP 0 MeanRotation Error(degrees)

Number of Points

Figure 3.22: Mean rotation error of each method when increasing the number of points using an image heavily distorted with Gaussian noise.

Figure 3.21 and Figure 3.22 show that the mean translation/rotation errors decrease with an increasing amount of points. A second factor is the accuracy of the 3D points. Like for the 2D points, distortions in the 3D coordinates are also common due to automated extraction processes. To simulate this, Gaussian noise was increasingly added to the 3D data. The results of the evaluations using the distorted data are shown in Figure 3.23 and Figure 3.24. Four 2D/3D correspondences were used.

158 3 Requirement Analysis

6 5 4 3 P3P 2 Iterative 1 EPNP

MeanTranslation Errror(m) 0 0,01 0,025 0,05 0,075 0,1 Standard Deviation ( ) [m] Figure 3.23: Mean translation error of each method when Gaussian noise is increasingly added to the 3D object points.

80 70 60 50 40 P3P 30 Iterative 20 EPNP 10 0

MeanRotation Error(degrees) 0,01 0,025 0,05 0,075 0,1 Standard Deviation ( ) [m] Figure 3.24: Mean rotation error of each method when Gaussian noise is increasingly added to the 3D object points. 3 Requirement Analysis 159

Figure 3.23 and Figure 3.24 show that an increase of noisy 3D data does not have a significant impact on the mean translation/rotation error. As with the distorted 2D points, to test an increasing number of 2D/3D points correspondences with distorted 3D data, Gaussian noise ( [m]) was added to the 3D data and simulations were run with an 0.05 increasing number of points (Figure 3.25 and Figure 3.26).

0,7 0,6 0,5 0,4 0,3 Iterative 0,2 EPNP 0,1 0 MeanTranslation Error(mm) 10 50 100 500 1000 5000 10000 Number of Points

Figure 3.25: Mean translation error of each method when increasing the number of 3D object points heavily distorted with Gaussian noise. 160 3 Requirement Analysis

0,0014 0,0012 0,001 0,0008 0,0006 Iterative 0,0004 0,0002 EPNP 0 MeanRotation Error(degrees)

Number of Points

Figure 3.26: Mean rotation error of each method when increasing the number of 3D object points heavily distorted with Gaussian noise.

As Figure 3.25 and Figure 3.26 show, both methods, the iterative solution and EPNP, gain accuracy with an increased amount of 2D/3D corresponding points. The last test case to evaluate was the impact of outliers in the data, first, with outliers in the 2D data and, secondly, in the 3D data. Since errors were equally found in the positions and orientations in the previous test cases, in the following test case only the translation errors are evaluated. The diagrams (Figure 3.27 and Figure 3.28) depict the results of the normal iterative solution and EPNP and both methods combined with RANSAC (see section 2.4.5) which were run with 500 corresponding points.

3 Requirement Analysis 161

30

25

20 Iterative 15 EPNP 10 Iterative [R] 5 EPNP [R] MeanTranslation Error(m) 0 5 10 25 50 Amount of Outliers (%)

Figure 3.27: Comparison of the mean translation error when using methods with and without RANSAC on 2D image points that include outliers.

Figure 3.27 shows that with outliers in the 2D data the iterative solution and EPNP each produce relatively large errors. EPNP performs slightly better though. The increase in outliers does not significantly impact the results. When using RANSAC in combination with the methods, the outliers do not affect the results and an accurate position is estimated.

162 3 Requirement Analysis

25

20

15 Iterative

10 EPNP Iterative [R] 5 EPNP [R] MeanTranslation Error(m) 0 5 10 25 50 Amount of Outliers (%)

Figure 3.28: Comparison of the mean translation error when using methods with and without RANSAC on 3D objects points that includes outliers.

Figure 3.28 shows that when outliers are distributed into the 3D data, the results of the normal EPNP method are similar to the error found in the results of the 2D distorted data. The iterative solution on the other hand performs relatively well in comparison to EPNP with approximately 0.1 m error. When RANSAC is included equally accurate results are produced, as with the 2D altered data.

Edge-based Tracking A first approach to optical pose estimation was to use an edge-based method according to [135], [140]. As shown in Figure 3.29, using the sensor-derived pose of the smartphone, 3D lines of the CityGML objects were extracted and projected into the current camera image. The lines were extracted from the CityGML model by utilizing the 3 Requirement Analysis 163

depth-buffer, so that only those that were visible in the image were projected. An advantage of using the hardware-based depth-buffer to filter occluded edges, is the increased performance compared to a CPU-based solution. Sample points were generated in a predefined interval on the projected lines from which positive and negative normals were calculated. A 1D search was performed in binary camera images containing edges found with the Canny algorithm. Edges with a similar orientation and a minimal distance to the projected lines were considered as inliers. Once a similar edge was found, the 3D starting point of the normal and the found 2D point in the image were saved as corresponding points. When the algorithm finished, the corresponding points were used for a PnP-based pose estimation. 164 3 Requirement Analysis

Acquire Camera Image

Acquire Pose From IMU

Load CityGML Data With Pose

Find Edges Set Pose

Remove Occluded Edges

Use 3D Sample Points And 2D Image Points For Pose Estimation

Project 3D CityGML Edges Into Image With Camera Pose

Search For Points With Strong Gradient In Image

Generate Sample Points On CityGML Edges

Create Normals From Each Sample Point

Figure 3.29: Edge-based Pose Tracking with IMU 3 Requirement Analysis 165

The main problem with this approach is that a good initial pose is required. Some test-runs showed that the pose must be relatively accurate to project the edges closely to their corresponding 2D image edges. A first idea was to use a pose derived from the smartphone sensors as an initial pose. Though, in most cases the resulting projection of the 3D edge was displaced in such a way that no or only false corresponding edges could be found in the image (Figure 3.30). The red lines represent the projected 3D edges and the green lines the normals used for the 1D search for 2D image edges. The green lines end where the algorithm found possible corresponding image edges.

Figure 3.30: An example of the edge-based pose tracking algorithm.

Therefore, an object detection-based approach was implemented 166 3 Requirement Analysis

which has the advantage that no preliminary pose is required. CityGML provides multiple objects like furniture, openings or walls that qualify for the tracking process, but openings are especially suited due to their simple geometrical shape. Furthermore, openings are permanent in their position, as opposed to furniture which is easily moved and is also geometrically more complex.

Door-based Pose Detection [195] presented two approaches to estimate a 3D position for an IPS using door corners. The first approach uses artificial markers with encoded object information placed on each door to identify these. The second approach uses an external database with images of each door taken from various angles to account for differences in appearance caused by the different viewpoints. Doors are found in an image using its contours extracted by a Canny algorithm [101] and a Hough transformation [196]. Additionally, special door features, such as door handles, are saved. The extracted texturized door candidates are normalized and compared to doors in the door database. Disadvantages of these methods are that they require some preparations in advance, like placing QR codes or creating a database of images and secondly rely on a network connection for the communication between the smartphone (client) and image database (server). Furthermore, the total runtime for the positioning process of approximately 50 s is too long for real-time applications, such as AR. With the CityGML data right on the device it is possible to realize a local smartphone-based approach, independent of preliminary work and network connections. Like for the edge-based approach, first a suitable PnP method is required. Secondly, corresponding 2D/3D points of the door must be found. Therefore, the door must be 3 Requirement Analysis 167

identified in the camera image and in the CityGML model correspondingly. While finding the CityGML door and extracting its geometry is relatively straight forward, extracting accurate 2D coordinates from an image is more challenging. Ultimately, the door must be identified as accurately as possible in the image. It is not sufficient to only recognize a door in an image, but its geometry must be extractable. [197] apply a door detector that combines edge and corner features so that the geometry becomes available. The detection success much depends on the quality of the corners since these are tested against certain conditions that apply to a door. In their presented solution they use a curvature-based corner detector. Alternatives are the Harris corner detector and the Shi-Tomasi corner detector. For the implementation of the door-based pose estimation for this work, the three corner detectors were evaluated in terms of corner quality (Table 3.7). Similar to [104], some reference images were captured in which such corners were marked and noted manually that are generally regarded as corners by humans, therefore are at a right angle. In the following, these are referred as true corners . The algorithms were run on 480×360 px images. In average there were 51 true corners in the image set. The parameters of each algorithm were set, so that at least the door corners could be detected in the majority of the images.

168 3 Requirement Analysis

Table 3.7: Comparison of corner detectors. Corner Total Found Missed Found Door Detector Found True True Fal se Corners Included Harris 4243 37 14 4206 87% Shi - 248 34 17 214 85% Tomasi Curvature - 319 34 17 285 89% Based

As Table 3.7 shows, the Harris corner detector in average finds the most true corners, but on the downside also detects the most false corners, which are approximately 20 times higher than with the Shi- Tomasi detector or curvature-based detector. This is crucial to the detection speed of the door detector, since higher corner numbers generally result in decreased detection times. Comparing the Shi- Tomasi detector to the curvature-based detector, the results are quite similar, with equal average found true corners and only slight differences in the amount of detected false corners. The Shi-Tomasi detector in average detects less false corners, which speaks for this solution, but the curvature-based detector showed better performance for the specific case of door detection, in which for 89% of the evaluation images the door corners were included. Therefore, the curvature-based detector was chosen for the implementation of the door detection solution. 4 Implementation 169

4 Implementation

Based on the requirement analysis in the preceding chapter, the realized AR system is described in detail in the following. First, the overall system architecture is shown and discussed shortly. Separate parts of the solution are then described in more detail.

4.1 General Solution Architecture

Figure 4.1: General AR system architecture. 170 4 Implementation

[Use GPS] Start Screen Wait for GPS

[Use Manual Position] Start AR-Mode [Options Pressed] [GPS Not Available] Get Position [GPS Available] Set Options [Start AR Pressed] Get IMU Orientation

Load CityGML With Pose

[Nothing Pressed] Show AR-View

[Optical Pose Pressed]

Estimate Pose From Image [Nothing Selected] [Object Selected]

Show Object Information

Figure 4.2: Activity diagram of the AR system.

The implemented AR system can be split into three core components, virtual world (CityGML Viewer), pose tracker and physical world, which also define the three pillars of AR in general. To realize the virtual component, a CityGML data processor and renderer were 4 Implementation 171

implemented. The physical world is captured and integrated using the image processor, after camera calibration. Both components are linked over the pose tracker. Figure 4.1 shows how these fit into the overall system architecture and which sub-components define each component. Figure 4.2 shows a generalized activity diagram of the system.

4.2 CityGML Viewer

Figure 4.3: Architecture for the CityGML viewer component of the AR system according to [55].

Figure 4.3 shows the architecture of the CityGML viewer that is responsible for the virtual part of the AR system. It consists of two 172 4 Implementation

main components, the data processing component that reads, stores and filters the CityGML data and the visualization component that is responsible for preparing the data, rendering and selecting object information.

4.2.1 Data Processor

In the following sections the CityGML processor is presented, as described in [55]. The processor consists of a CityGML parser, a data storage system and a selection mechanism.

Parsing CityGML to SpatiaLite with XmlPullParser Before the actual parsing process of CityGML data can be started some preparations must be made. First, a new SpatiaLite database (file) is created with a pre-defined name which is derived from the CityGML file name. In the following it is referred to as citydb(.sqlite) . Next, the required tables for the CityGML data are created. All classes, attributes and relationships between the classes are preserved in the database. For better accessibility of the schema some simplifications were applied when possible. For example, multiple subclasses, such as Window and Door, are merged into a single table OPENINGS (Table 4.1). For preliminary research towards the use cases of CityGML and AR, the appearance model was reduced to the X3DMaterial feature. Generally, textures consume more memory than simple color information therefore their overall benefit for AR at the cost of the application’s performance first must be assessed in further research. For optimized storing and accessibility, complex object attributes and datatypes were simplified (e.g. multi valued attributes) to reduce the number of database tables. For instance, 4 Implementation 173

multiple functions of a building are represented as a concatenated string divided by a separator, so these can be stored in a single column.

Table 4.1: Simplications of the CityGML classes for the database schema [55]. Database Tables CityGML classes ADDRESSES Address BUILDINGS _CityObject;AbstractBuilding;Building BUILDING - _CityObject;BuildingInstallation; INSTALLATIONS IntBuildingInstallation BUILDINGPARTS _CityObject;AbstractBuilding;BuildingPart CITYMODELS CityModel BUILDING - _CityObject;BuildingFurniture FURNITURE OPENINGS _CityObject;Opening;Window;Door GEOMETRIES GML ROOMS _CityObject;Room BOUNDARY - _CityObject;BoundarySurface;RoofSurface; SURFACES WallSurface;GroundSurface;ClosureSurface; CeilingSurface;InteriorWallSurface; FloorSurface;OuterCeilingSurface; OuterFloorSurface APPEARANCES Appearance;SurfaceData;X3DMaterial

Followed by the database preparations, the actual parsing procedure is started. The two key methods of the XmlPullParser interface are next() and getEventType() . While next() is used to progress to 174 4 Implementation

the next event, getEventType() identifies the current event. There are four event types: • START_TAG • TEXT • END_TAG • END_DOCUMENT

To comply with the sequential nature of the pull parser, temporary Java objects are created to hold the parsed information. After the contained information has been transferred to the database, it is deleted, and the object reused for following objects. This allows keeping the memory requirements low. Relationships between the entities are preserved by introducing variables that hold the identifier of the current parent of the entity. The general hierarchy of the entities is tracked by using an object stack. More specifically, the parsing method proceeds in the following manner: When an event type START_TAG is triggered, the current XML tag name is queried. When it is a relevant tag (e.g. building), a new Java object is created together with its identifier found in the GML file. Since an identifier is important to preserve the object-relationships, one is created if none is found in the file. The object then is placed on the stack. If there is text content for the current XML tag and the event type TEXT is triggered, it is added to the current Java object. For example, a building object could receive information like name, function, etc. When the END_TAG event occurs for the corresponding object, the information contained in the temporary Java object is inserted into the SpatiaLite database. The Java object then is taken off the stack. After the END_DOCUMENT event is triggered some post processes like the calculation of bounding boxes for all objects and the 4 Implementation 175

creation of indices on the database tables are performed. Bounding boxes are only calculated if no envelope was found for the object while parsing the file. The bounding boxes are necessary for spatial queries later in the AR system. Indices are created for all primary and parent keys. Additionally, spatial indices are placed on geometry tables. One of the most complex tasks is collecting and inserting the geometries. When geometry tags are found, their text content is retrieved and used for the construction of well-known text (WKT)70 expressions. WKT is a text mark-up language for describing vector geometries. It supports geometry primitives like points, linestrings, polygons and also multipart geometries like multi-points, multi- linestrings and multi-polygons and more. An advantage is that it is readable for humans, as shown in the following short example in which a 3D polygon is described:

POLYGON Z((292994.851 5627394.638 229.645, 292991.454 5627391.99 227.743, 292991.454 5627391.99 219.029, 292994.851 5627394.638 219.029, 292994.851 5627394.638 229.645))

Well-known binary (WKB) is equivalent to WKT but represents the information in binary form. SpatiaLite provides methods such as GeomFromText() that expects a WKT expression and an SRID to create native geometry representations.

70 http://docs.opengeospatial.org/is/12-063r5/12-063r5.html 176 4 Implementation

Parsing terrain data to SpatiaLite Additionally, terrain data can be added to the database. This data is then used for a visualization of the terrain and for querying height information for the AR system. CityGML’s relief module allows modelling terrain, but typically such data is not provided in the CityGML format. Generally, it is exchanged in plain ASCII files in which the 3D points are saved row by row. Therefore, an additional parser was written for this format to utilize the data in the AR application. Like the CityGML parser, it transfers the data from the source file into a SpatiaLite database. For this, first the necessary table with the corresponding attributes is added to the database schema. Next, the terrain file is read line-wise and the easting, northing and altitude coordinates are extracted. These are used to create a native SpatiaLite point geometry. The advantage of keeping the points as separate objects in the database is that these can later be managed and selected more easily, as for instance opposed to storing an already triangulated polygon. As with the CityGML data, a post-process creates a spatial index on the points-column.

Selecting Data from Database To retrieve data from the database and forward it to the rendering engine, the right selection algorithms are essential. With the possibility of large databases with entire cities saved in them, the selection algorithm was developed to deliver only information relevant to the current location. For CityGML, selections are made by using the hierarchical structure of the format and exploiting the semantic relationships of its classes (Figure 4.4). 4 Implementation 177

Figure 4.4: Example of a relationship between different classes in the database [55].

This allows to reduce the number of computationally expensive geometric queries, such as point in polygon. In general, there are two cases, the user is outside of a building or the user is on the inside of a building. Figure 4.5 shows how the algorithm proceeds. 178 4 Implementation

get position

citymodel search

[position in a citymodel] [position not in a city model]

select buildings with citymodel id

search in radius of position

[outside of all buildings] [in a building]

select all building hulls select rooms with and exterior parts building id and search with position

[in no room] [in a room] select the building hull

select room hull and objects in room export geometry data

Figure 4.5: Activity diagram for the implemented selection algorithm [55]. 4 Implementation 179

So, first a geometric query using a given position is run to determine if it lies inside of a city model. For this, the bounding box of the city is used. If a city model is found, its identifier is used to determine all buildings that belong to it. This is done by searching for buildings with a parent_citymodel_id corresponding to the citymodel_id . From these buildings, all nearby are filtered by using a selection buffer around the given position. Another geometric query then determines if the position lies inside or outside of the bounding boxes of the buildings. When viewing a building from the outside in reality, the interior is occluded by walls. These covered objects are also not necessary in the virtual world. This reduces the amount of rendered geometries. So, if the user’s position is on the outside, only data of exterior parts, such as building hulls (walls) and exterior building installations, are selected and transferred. If the position is on the inside of a building, again, two cases are possible, either the position is located in the building and in a room or only located in the building. If the user’s position lies on the inside of a building its identifier is used to search for rooms that belong to it. A following geometric query using the bounding boxes of the found rooms is run to determine if the position is in a room or not. If it is in a room, the room’s identifier is used to find walls, furniture and interior building installations which are then transferred to the renderer. The selection is done room-wise, therefore, only objects relevant to the current position are loaded, since the walls of a room occlude surrounding rooms and objects. If the position is not located in a room (e.g. in a LOD3 model), only the exterior hull of the building is loaded by selecting the walls of the building using the building_id . Exterior objects like exterior building installations or neighbouring buildings are ignored. 180 4 Implementation

For the terrain data, a buffer with a defined radius is created around the current position. All terrain points that are contained in the buffer are selected and exported. The CityGML and terrain geometries are streamed in the internal SpatiaLite Binary Large Object (BLOB) format to the renderer, which is quite similar to WKB.

4.2.2 Visualizing CityGML

Research towards CityGML visualization was conducted in [55]. The main parts of this process are the preparation of geometries by triangulating and rendering the geometries and the interaction with the CityGML objects at runtime.

Prerequisites All geometries exported by the implemented selection algorithms are received in the internal SpatiaLite BLOB format. This not only enables compact data transfers and performant interpretation of data, but also quick access to the MBR of the geometries, relevant for data querying. The BLOB-geometries are read using a custom solution that was developed according to the documentation of SpatiaLite [198]. Using the Byte offsets, as defined in the documentation, relevant content is extracted. For geometries of the type polygon, triangulation is a prerequisite for rendering. This is done using a custom implementation of the ear-clipping triangulation method [45] that is applicable for convex and concave polygons with an arbitrary amount of holes. Additionally, the algorithm accounts for special cases, such as holes touching the outer ring of the polygon or the neighboring polygons. 4 Implementation 181

Custom Ear-Clipping Triangulator Since the ear-clipping algorithm is aimed at the triangulation of 2D polygons, these must initially be transformed from 3D to 2D. [37] describe a possible method that determines a component of each vertex’s -coordinates to discard and thereby projects the polygon onto one of the three planes , or . The projection plane is typically determined either by computing the projected area of the polygon in each plane and choosing the plane with the largest projected area or by calculating the polygon’s normal and discarding the component with the greatest magnitude. For instance, from the polygon normal (-6, 4, 2) the value would be discarded, since it has the greatest magnitude. Though, in test cases this 3D to 2D projection method did not provide satisfying results for the CityGML data. Multiple polygons were distorted or had self-intersections which made them unusable for further processes. Therefore, an alternative approach was implemented which computes the local basis of the polygon and transforms its points from world space coordinates, to the polygon’s local space coordinate system. The basis requires three orthonormal vectors, one of which is given by the polygon’s normal vector which is treated as the -axis of the basis. The -axis is determined by projecting the -axis of the standard basis onto the polygon’s plane. This is achieved with equation (4.1) which projects onto the plane determined by the polygon’s normal vector and is then subsequently normalized. If the scalar product of and equals 1, then and must be equal, in which case the negative -axis of the standard basis is used. The polygon’s -axis is determined by taking the cross product of and , which is done in equation (4.2). When the local basis of the 182 4 Implementation

polygon is found, the points can be transformed to it by using equation (4.3).

(4.1) ∗ ∙ ∙ ≠ 1  ‖ ∗ ∙ ‖ ℎ

(4.2) (4.3) ∙ , ∙ The original geometry is kept in 3D space,0 while references to the points of the triangles in 2D space are made. Hereafter, the received 2D polygon is validated against further conditions, such as that a polygon must consist of an ordered set of vertices and must be free of self-intersections. As defined by [45], an ear is defined as a triangle with three consecutive vertices , and in which no other vertices of the polygon are contained.   is  the ear tip and to a line (diagonal) which lies entirely in the polygon. The  objective of the algorithm is to find such ears and to remove these from the polygon. Therefore, the overall process can be summarized with the following two-step cycle: • find_ear(): An ear is searched for by selecting the first convex vertex and forming a triangle by connecting the vertices with its neighbors. An ear is found if no other vertex lies in the triangle. 4 Implementation 183

• cut_ear(): Remove the ear from the polygon and save the triangle made up of the ear tip and its direct neighbors.

Figure 4.6 shows an activity diagram for the implemented ear- clipping algorithm. Initially, all the vertices of a polygon are queried. A pre-test checks if all vertices of the polygon comply with the constraints mentioned above. Next, a search for possible holes decides on an intermediate step. If no hole is found in the search, the algorithm can proceed to the triangulation. However, if a hole or several holes are found in the polygon, they are connected to the outer polygon by creating a link with the outer polygon ring. The inner polygon is connected to the outer polygon to create a continuous line consisting of the ring of the outer polygon and the inner polygon that can be used in the triangulation process to identify holes. 184 4 Implementation

markdefective aspolygon [self-intersection or[self-intersection vertices] unordered search holes search for [holesfound] connect innerpolygonspolygons outerwith connect [no self-intersection and [no ordered vertices] [no holes [no found] find find "ear" cut "ear" cut off connect vertices connect loadvertices polygon [concave] save polygon triangulated [convex] select vertex select connect vertex connect all withvertex other • directvertices neighbours except

Figure 4.6: Activity diagram for the implemented ear-clipping algorithm according to [55]. 4 Implementation 185

Figure 4.7: Process of connecting multiple holes (inner polygons) to the outer polygon.

Subsequently, it is checked whether the polygon is convex or concave, by examining each vertex of the polygon. An edge is drawn between the previous and next neighboring vertex, starting from the current vertex. If the tested vertex is on the right side of the edge, it is convex and concave if it is on the left side. If the vertex lies on the edge itself, it is collinear . If all vertices of the polygon are convex, the triangulation of the polygon is trivial and can be accomplished in linear time by connecting an arbitrary vertex with all other vertices, except its immediate neighbors. If one of the vertices is concave, ear-clipping is applied according to [45]. The following Figure 4.8 shows an example of a fully triangulated LOD4 building with interiors. 186 4 Implementation

Figure 4.8: A fully triangulated LOD4 building.

Rendering with jMonkey Engine Real-time rendering requires some basic components, such as a virtual space, objects to populate the space with, a virtual camera and lighting for the scene. Generally, additional post processing of the generated images is used to enhance the visuals. For rendering, the jMonkey game engine is used. jMonkey has two important classes, the SimpleApplication class and the AppState class. SimpleApplication is the base class of the game engine and is typically extended by a custom class which then is the starting point for every application and provides access to key components like the scene graph, the input-, state-, asset-, and audio manager, physics simulator, etc. Therefore, it should only be extended once. The AppState class also provides access to these basic components but 4 Implementation 187

may be used multiple times. AppStates are typically utilized to modularize the application, by separating different functionalities of the game or application.

Scene Graph For rendering a scene (e.g. CityGML), the scene graph is an important feature. Every application has one scene graph, thus, one root node. The root node of the scene graph may be directly accessed from the SimpleApplication or AppState object. Elements that are attached to the root node become part of the scene graph and are rendered. In jMonkey these elements are referred to as Spatials . Two types of Spatials may be attached to the root node, Geometry and Node . While both contain the location, rotation and scale of an object, a Geometry is a visible object in the scene and the Node is an invisible container that groups multiple Geometries and Nodes together. Using Nodes, a tree is created and relationships between objects are established, so that, for example, walls can be part of a specific building which again can be part of a specific city model (Figure 4.9). These relationships enable the selection of specific objects from the scene, for instance, all walls of a specific room belonging to a certain building. This is useful for visualizing only specific objects. Another advantage is that transformations that are applied to a parent Node are also applied to all child elements. This, for example, allows handling a building with all its sub-elements as a single object.

188 4 Implementation

Figure 4.9: Hierarchy of CityGML objects in a scene graph [55].

To render CityGML data using jMonkey’s scene graph, the geometries of the objects of the CityGML model must be added to the root node using custom meshes. jMonkey provides the class Mesh for this purpose which expects triangulated polygons.

Optimizing with Draw Call Batching Generally, the number of objects in a scene impacts the rendering performance. The more separate meshes are visual in a scene, the more performance is necessary. Draw calls play an important role here. 4 Implementation 189

To visualize a mesh, the renderer issues a draw call to the graphics API. Draw calls are relatively expensive due to validation and translation steps in the graphics driver which are done between each draw call (e.g. switching to a different material). This is critical when rendering entire city models with hundreds to thousands of buildings with many separate objects in the virtual scene. Therefore, the goal is to reduce the number of draw calls. A solution is to merge multiple meshes, which reduces the number of draw calls. Generally, this is referred to as Draw Call Batching [37]. For the AR system this technique is utilized by merging all meshes with the same X3D material before transferring them to the renderer. For example, all walls, doors and windows with the color gray or all walls and doors with the color red are combined and handled as one large inseparable object. Therefore, scenes with many same-colored objects are most advantageous for this technique.

Localizing Objects Location-based mobile AR requires that the virtual objects have some reference to their physical counterparts. Given some CityGML objects in the UTM coordinate system, a straight forward approach to use their coordinates, is to transfer them directly into the virtual world’s coordinate system. A virtual position then corresponds to its physical position. However, the successful visualization depends on the length of the coordinates. Due to the limitations of the floating point number-based calculations of the GPU, jumping of the geometries occurs (spatial jitter ) [199] when the geometries lie too far from the origin of the coordinate system. For example, given data in the ETRS89 UTM32 reference system with six digits for the east value and seven digits for the north value plus multiple digits behind the 190 4 Implementation

decimal point each, the objects appear jittery. To solve this issue, the geometries must be placed closer to the origin of the virtual world to reduce the length of the coordinates. For the AR system, this is achieved by translation, using the coordinates of the first loaded geometry and subtracting them from the following geometries to create local relative coordinates. The calculations are carried out in-memory, so that the geometries stay un-edited in the database.

Scene Illumination As mentioned in [200], lighting is another essential part of a virtual scene in AR. Lighting can help improve the sensation of realism and to blend the virtual objects in with the physical objects. For example, the virtual object’s shadows should match the physical object’s shadows. jMonkey offers four types of lighting. A SpotLight has a position and shines in a certain direction with a limited range. A PointLight on the other side also has a position but shines in all directions in a limited radius. In contrast to a SpotLight or PointLight , a DirectionalLight has no position, only a direction. Typically, a scene only has one DirectionLight which is used as a representation of the sun. An AmbientLight has no position or direction and is equally bright in the entire scene. Therefore, it influences the brightness and color of the scene and objects are illuminated by it from all sides. The lights each can be altered in their color to influence the scene‘ atmosphere. To create realistic lighting for AR, the physical light sources of the area must be simulated. For the AR system, this was realized with the sun as a light source. A directional light is used to represent the physical sun. Its position is calculated based on the current device 4 Implementation 191

position, date and time, so that it directly corresponds to the physical sun, producing similar shadows of the objects in both worlds.

Camera The virtual camera is responsible for capturing images of the virtual scene, so they can be displayed on the screen of the AR device. To visualize the 3D objects of a scene, they are projected from 3D to 2D. For the AR system, the images produced by the physical camera must be overlaid by the images produced by the virtual camera, so that the depicted virtual objects superimpose the depicted physical objects. This requires both cameras to have the same pose. So, when the physical camera is moved or rotated in reality, the virtual camera must be moved or rotated by an equivalent. Beside the extrinsic parameters of the physical camera, the intrinsic parameters and lens distortion also must be taken into account, since these also influence the image. For instance, an offset of the principal point shifts the image left/right and up/down. So, to account for this, either the images of the physical camera can be corrected, or the images of the virtual camera altered.

Object Selector Every object in the scene has a unique ID which is derived from the CityGML data or generated if no ID is available. Using this ID, the SpatiaLite database is queried for additional information like name, function, etc. of this object (e.g. building). To determine which object in the scene is selected when touching the smartphone’s display, a picking algorithm is applied. 192 4 Implementation

Figure 4.10: Ray casting-based picking using a view frustum of the perspective projection [55].

The algorithm is based on a ray casting approach as described by [201]–[203], which uses a directed line to determine the object of choice, as depicted in Figure 4.10. When the user touches an object, the screen space coordinates of the selection point are calculated and are transformed into 3D virtual world coordinates. A point is calculated by using the -value of the near plane. From point a ray is cast according to the current view of the virtual camera towards on the far plane. Intersections between the ray and objects in the local world space are calculated and used to determine the closest geometry collision from the camera’s point of view, which is the intersection in Figure 4.10. In turn, the object ID of the collision object can be determined. Since the picking algorithm is mesh-based, it requires every object (e.g. wall) to be associated to a separate mesh to be identifiable. 4 Implementation 193

However, by merging multiple meshes (draw call batching), this relationship is lost. As a solution a unique index number is assigned to each triangle of a combined mesh to keep track of which triangles belong to the original separate meshes. The picking algorithm returns the triangle’s index which allows an exact identification of which CityGML object was selected. This method only has the minor additional cost of storing the triangle indices of each CityGML object, but with the benefit of an improved rendering performance.

4.3 Pose Tracking

In the following sections the realized pose tracking methods are presented. The general idea behind the implemented pose tracking algorithm is to determine the pose of the physical camera and transfer it to the virtual camera, so that both are aligned as closely as possible. In an optimal case, by aligning both cameras, the view of the physical world and the view of the virtual world overlap each other when placed in the correct order, creating an augmentation of the physical world. The next sections describe the realized AR tracking system to achieve the following scenario:

- The user is outside and starts the AR system. The tracking system acquires the current location of the device by GNSS and the orientation using the IMU and magnetometer, taking magnetic declination and meridian convergence into account (see chapter 2.4.4). Next, the user moves towards a building and enters it. The indoor-outdoor detection algorithm registers that the user is now located on the inside of a building. This is where the inertial indoor navigation system is activated and GNSS positioning is deactivated. 194 4 Implementation

The last known valid GNSS position is used as a starting point for a DR approach. The system keeps track of the current position down to room-level. When the user arrives in the destination room, the inertial indoor navigation system returns the current room. Based on this position the visible CityGML objects of the room are queried and visualized. The user then selects a CityGML door from the virtual scene to use for the following optical pose estimation algorithm. Using a door detection algorithm, the physical door is extracted from the camera image. The 2D door and 3D door are utilized for a PnP- based pose estimation which registers both to each other. –

The AR system uses a tracking management class that fuses the different managers that are responsible for acquiring the sensor and image-based pose information.

4.3.1 Sensor-based Pose Tracker

For positioning outdoors, GNSS data are used. To ensure a stable position, the system waits until at least four satellites are available. Also, the first 10 positions are discarded to exclude inaccuracies, which can typically be observed when starting a GNSS session. The following positions then are collected and averaged. The coordinates are received as geographic latitude and longitude, based on the global datum World Geodetic System 71 (WGS84). To determine whether newly received locations are better in terms of accuracy, location strategies, as described by [204], are applied. The strategies include checks for significantly newer locations than previous ones and

71 https://epsg.io/4326 4 Implementation 195

comparisons of accuracy and location provider. Since the altitude derived from GNSS is generally less accurate than the latitude/longitude, an additional DTM is used for querying the height information. Using the current location provided by GNSS, the database is queried. The nearest terrain points are used for an interpolation to determine the best estimate of the current height. To obtain the height of the AR device, the user must enter his body height which is added to the ground DTM height. For indoor positioning, a pedestrian indoor navigation system, according to [205], is used. It utilizes an initial position (e.g. the last known GNSS outdoor position before entering a building) and estimates positions by measuring the distance from the initial position with a step detector and calculations of the current heading. Additionally, the system uses a particle filter to make it more robust to outliers. The general idea of the particle filter is to deploy particles in the environment and to form clusters of these which represent the most probable position estimate. This is achieved by excluding particles that do not comply with certain rules, for instance, particles that lie inside of walls.

4.3.2 Optical Pose Tracker

Given a room derived from the coarse pose of the inertial pedestrian navigation system: First, its elements are loaded from the SpatiaLite database and visualized. Next, using the position of the inertial pedestrian navigation system and current orientation derived from the IMU, which is a coarse estimate, the user selects the door that should be used for optical pose estimation. This is done by touch interaction over the smartphone display. 196 4 Implementation

The selection is determined using the picking algorithm which is also applied for querying the object information. Using the selection, the mesh of the object is determined and extracted. In the next step the 3D corner points of the virtual door are extracted from the mesh. Since the door meshes typically consist of hundreds of points, an algorithm is necessary to determine the corner points and sort them in a predefined order. This is done with the following steps: 1. Remove duplicate points. 2. Compute the bounding box of all points that belong to the door. 3. Sort the points from farthest to closest to the center of the bounding box. 4. Save the 4 or 8 points that are farthest from the center, as these most likely are the corner points of the door. If the door is modelled as a solid, it has 8 corner points and if it is modelled using only a surface, it has 4 corner points. 5. If the door is a surface, select the 4 corner points. If the door is a solid, select the 4 corner points that are closest to the camera, as these are the corner points that are visible to the user. 6. Sort the corner points counter-clock-wise.

Next, the 2D points of the physical door are extracted from the current smartphone camera image. For this, two steps are necessary. Recognition of the door in the image. Extraction of the geometry and corner points. The door detection is based on the work of [197]. The approach requires no training or preparations ahead of time. It uses a geometric 4 Implementation 197

door shape model, an edge detector and a corner detector. The algorithm is independent of information, such as color or texture, so that a wide range of doors can be detected.

Edge and Corner Detection Before a search for edges and corners can be applied, the image is pre- processed by down-sampling it to reduce computational cost. Gaussian blur is applied to remove noise. Next, a binary edge map is created using the Canny algorithm from which the contours are extracted. These contours are used for the corner detector based on [104]. The following steps consist of grouping the found corner candidates into groups of four, according to the geometric model, which defines that a door consists of four corners and four lines , and checking these against some, , geometric, requirements , such,  as:,  • a door must have a certain width and height. • vertical lines should be parallel to each other. • vertical lines are nearly perpendicular to the horizontal axis of the image.

To rate the width and height of a door, the relative size is calculated using the lengths of the door lines and the length of the diagonal of the image. The ratio is represented by the equation (4.4):

(4.4) ² ℎ  198 4 Implementation

The orientation of the lines is calculated using the following equation (4.5): (4.5) 180 The values are tested if these fall in a predefined threshold, set for each condition. Candidates that comply with the requirements are further tested by combining them with the edges found by the Canny algorithm. Lines of door candidates are checked if they match with the edge map by using a fill ratio which defines the amount that the lines overlap the edge map (equation (4.6)).

(4.6) where is the length of the overlappingℎ part of the line and the total length of the line .   PnP The found 3D points from the CityGML door and the 2D points extracted from the camera image are then used for a PnP-based approach, which estimates a pose for the AR device. The following three PnP methods can be used: • P3P based on the paper of [112]. • An iterative approach based on a Levenberg-Marquardt optimization [127] that minimizes the reprojection error that is the sum of squared distances between observed points and projected points. • EPNP based on [120]. 4 Implementation 199

4.3.3 Fusion

Pose estimation using a door is a relatively expensive operation, so that it cannot be used directly for the tracking process of real-time AR. Another disadvantage is that the door must be continuously visible in the camera images. Therefore, the user cannot freely look around the environment. To overcome these limitations, the optical pose estimation algorithm is combined with sensor-based poses. One solution is to determine the error of the poses derived from the IMU by comparing them to the optical pose and to continuously correct for these differences. After some evaluations it was concluded that the difference between the IMU pose and optical pose varies too strongly to apply only a single correction, mainly due to the data produced by the magnetometer. Instead of using the optical pose as a reference pose to calculate corrections for the IMU, it is used as an initial pose for a relative tracking. The following Figure 4.11 shows the general procedure: 200 4 Implementation

Deactivate Magnetometer/Accelerometer Select CityGMLDoor StartModel-based [3D NotCoordinatesDoor [3D Available] Acquire CameraImage Optical Pose Estimation [Long Object] Touch [3D Door [3D Available] Coordinates [Short Touch [Short Object] [Buttonpressed] [No Door [No Detected] Door Show InformationShow About Object Detect Door Render Render CityGML World 2D/3D Points 2D/3D [Door Detected] [Door Set VirtualSet Pose to Camera EstimateWith Pose [Button [Button not pressed] Get Pose IMU Activate SetPoseTo Camera Virtual Accelerometer/Gyroscope Activate Start Start Inertial Tracking-Loop Magnetometer/Accelerometer Figure 4.11: Activity diagram of the pose tracking system. 4 Implementation 201

4.4 Android App

The functionality of the AR system is realized as an Android app. It is divided into three general parts: • Core framework • Visual components • Pose tracking manager

Core framework The core framework is realized using Activities. An Activity is one of the essential building blocks of an Android app and provides an entry point for the user to the app. It typically is tightly coupled with a user interface (UI). An Activity has four states, active, paused , stopped and finished . When the activity changes its state, the corresponding methods are triggered. onCreate() for example is triggered when the Activity is created and onPause() when the Activity is sent into the background. In the AR system these methods are used to activate or deactivate the IMU, open or close the camera or start/end the rendering process.

Visual components In Android, UI elements are constructed using Views. A View draws visuals to the screen and handles events triggered by the user through interactions. For the AR system different Views were created, one View for the images of the physical smartphone camera, one View for the images of the virtual camera in the CityGML world generated by jMonkey and one View for some buttons. By layering the Views in the order 202 4 Implementation

1. HardwareCameraView 2. VirtualCameraView 3. ButtonView the images of the physical world are augmented with the images of the virtual world.

Pose Tracking Manager Android provides managers to obtain data from the smartphone’s hardware. For instance, the LocationManager class enables access to the location services of the system and the SensorManager access to data from the smartphone’s sensors like accelerometer, magnetometer or gyroscope. Android’s LocationManager , for example, allows obtaining periodic updates of geographical locations by network provider and GPS, given the corresponding Android permissions (e.g. ACCESS_COARSE_LOCATION or ACCESS_FINE_LOCATION ). To receive GNSS locations, the ACCESS_FINE_LOCATION permission is required. Furthermore, a LocationListener must be registered with the LocationManager . When events occur, a corresponding method is called, for instance, the method onLocationChanged() when a location provider outputs a new location. The returning location object can contain information like latitude, longitude, a timestamp, bearing, altitude and velocity [204]. Similar to the LocationManager , Android’s SensorManager provides access to data produced by the various sensors in a smartphone. To receive periodic updates from the sensors, an instance of the sensor (e.g. accelerometer) must be registered to the SensorManager by providing a desired delay which controls the interval in which data is sent. Some examples are represented by the constants: 4 Implementation 203

• SENSOR_DELAY_GAME (20,000 μs delay) • SENSOR_DELAY_UI (60,000 μs delay) • SENSOR_DELAY_FASTEST (0 μs delay)

Typically, low delays have a higher energy consumption than higher delays, so that a tradeoff between both must be found. For proof of concept the lowest delay was chosen for the AR system to enable real-time tracking. The pose tracking manager ( TrackingManager class) of the AR system bundles the different methods to determine the pose of the smartphone. While the Android LocationManager and SensorManager are used to determine coarse values, an OpticalPoseManager was implemented that calculates more accurate values using image-based methods. When the TrackingManager receives a new pose from one of the slave managers, it fires an event that notifies the rendering component to adjust the virtual camera. The data from the IMU is received as an event in the method onSensorChanged() , in which the origin of the event is determined. According to the sender of the data, it is interpreted and represented as a rotation matrix that is later used to set the orientation of the camera. The sensors each provide the data in a default coordinate system. Depending on the rotation of the screen (e.g. portrait or landscape), the axes of the rotation matrix must be adjusted. Since the AR system developed in this thesis is held in landscape mode, therefore rotated 90°, the -axis must be mapped to the -axis and the negative -axis to the -axis. 204 5 Evaluation of AR System

5 Evaluation of AR System

The mobile CityGML AR system was empirically evaluated in terms of performance and accuracy. The performance evaluations include the loading times for CityGML models, FPS for the augmentation process and the performance of the pose estimation. The AR system was evaluated in an outdoor area, as well as in an indoor area. All evaluations were repeated several times to confirm the reproducibility of the results.

5.1 System Calibration

Before the system can be properly used and evaluated it is necessary to calibrate it, so that it facilitates accurate results. This was done for each smartphone that was part of the evaluation using the photogrammetry software PHIDIAS 72 . As shown in Figure 5.1, markers were placed on a wall and photographed from various angles with each smartphone. The images were then used as input to determine the intrinsic parameters for the rear smartphone camera.

72 http://www.phocad.de 5 Evaluation of AR System 205

Figure 5.1: The PHIDIAS markers used for the calibration process captured from different angles.

For the three smartphones, the following values were derived as shown in Table 5.1, Table 5.2 and Table 5.3: Table 5.1: The calibration parameters of the Google Nexus 5 obtained from the calibration process using PHIDIAS in calibration C1 and calibration C2. Parameter C1 C2 Unit cx 1648,266667 1634,6667 px cy 1233,133333 1244,7333 px fx 2841,648444 2818,2869 px fy 2841,648444 2818,2869 px k1 0,06926731 0,0752 k2 -0,07528287 -0,0993 k3 0,00000000 0,0000 k4 0,00000000 0,0000 p1 -0,00030605 -0,0004 p2 0,00041346 0,0001 206 5 Evaluation of AR System

Table 5.2: The calibration parameters of the Sony Xperia Z2 obtained from the calibration process using PHIDIAS in calibration C1 and calibration C2. Parameter C1 C2 Unit cx 2637,0508 2629,6780 px cy 1996,6441 1989,5254 px fx 4083,1701 4071,8716 px fy 4083,1701 4071,8716 px k1 0,3306 0,2972 k2 -1,2830 -1,0865 k3 1,5181 1,2125 k4 0,0000 0,0000 p1 0,0000 0,0000 p2 0,0000 0,0000 Table 5.3: The calibration parameters of the Google Pixel 2 XL obtained from the calibration process using PHIDIAS in calibration C1 and calibration C2. Parameter C1 C2 Unit cx 2021,2857 2037,7143 px cy 1507,8571 1486,1429 px fx 3215,4933 3202,8319 px fy 3215,4933 3202,8319 px k1 0,1457 0,1494 k2 -0,4873 -0,5150 k3 0,4801 0,5264 k4 0,0000 0,0000 p1 0,0000 0,0000 p2 0,0000 0,0000 5 Evaluation of AR System 207

The parameters were then set on each smartphone so that these were available to the correction algorithms. Once calibrated, every image captured with the rear smartphone camera is undistorted before using it for the optical pose estimation. It is also important to undistort the images before displaying them on the screen, since the undistorted virtual objects otherwise will not seamlessly overlay the objects shown in the images. The evaluation results presented in the following sections are based on the calibration parameters C1 in Table 5.1, Table 5.2 and Table 5.3. A second calibration C2 was done for each smartphone about a year later to identify the robustness of the calibration parameters. C1 and C2 were each applied to some distorted reference images which contained a matrix of control points. The undistorted images of C1 and C2 then were compared to each other, by comparing the control point coordinates and calculating the distances between them. The conclusion was that no significant differences could be found between the images, thus, showing that a calibration is robust enough over longer periods of time for the described application.

5.2 Data Performance

For AR, it is essential that the virtual information is displayed in real-time. For the here presented AR system the CityGML database and the jMonkey-based renderer are essential for a good performance. For the evaluation of both, a database containing the models Model 1, Model 2 and Model 3 was created. First, the performance of the data selection process was evaluated and secondly of the rendering performance. The following Table 5.4 shows some statistics about the database: 208 5 Evaluation of AR System

Table 5.4: Statistics about the test CityGML database containing the models Model 1, Model 2 and Model 3. Type # CityModel 1 Building 94.754 BuildingPart 32.689 BuildingInstallation 2 IntBuildingInstallation 17 RoofSurface 167.938 WallSurface 844.026 GroundSurface 111.922 ClosureSurface 447 CelingSurface 56 InteriorWallSurface 732 FloorSurface 26 OuterCelingSurface 0 OuterFloorSurface 0 Window 661 Door 337 Room 56 BuildingFurniture 176 Total Objects 1.253.840 Polygon s 1.603.554 Size (MB) 2140

5 Evaluation of AR System 209

5.2.1 Data Processing

CityGML especially stands out from other formats, such as pure graphical ones, due to the additional information about the city objects (e.g. semantic information or topology). This enables applications to not only query data geometrically, but also in a more sophisticated manner. For instance, a useful query for the here presented AR system is to select all building parts with a certain attribute that belong to buildings in a defined radius that are part of a city model with a specific identifier. Such queries are important tools to minimize the data load and to only display such data that is of interest to the user at his current location. Therefore, specialized selection algorithms were implemented that efficiently select data that is required for the current scene. Some typical SQL queries necessary for the AR system are the following: • select all buildings (Q1) • select a specific building with identifier (Q2) • select all openings of a specific building using its name (Q3) • select all polygons of surfaces that belong to buildings with a specific height (Q4) • select buildings that are part of city model with defined name and within radius of a defined position (Q5)

Each query was run using the test database on each smartphone (Figure 5.2).

210 5 Evaluation of AR System

35

30

25

20 Google Nexus 5

15 Sony Xperia Z2 Google Pixel 2 XL 10

MeanQuery Time (ms) 5

0 Q1 Q2 Q3 Q4 Q5

Figure 5.2: Time to perform the queries Q1 - Q5 for each smartphone.

As expected, the most recent smartphone, the Google Pixel 2 XL, shows the best performance with an average query time of 6 ms, but the other two still perform well at approximately 26 ms. To evaluate the complete process of selecting data in the AR system which includes querying the database, exporting geometries, preparing geometries for rendering (transformation/triangulation) and displaying triangulated meshes on the screen, the overall loading times were measured. Positions inside and outside of LOD2 and LOD4 buildings were chosen to display exteriors and interiors of these. A selection (export) and visualization radius of 100 m was used. Table 5.5 shows the five positions in the city model that were used for evaluating the different cases that can occur and the number of objects the algorithm selected/exported for visualization. Examples for objects are walls, doors, furniture, etc. 5 Evaluation of AR System 211

Table 5.5: Positions P1 - P5 representing the possible locations that can occur with the AR system and the average amount of polygons and objects that the selection algorithm loaded. P1 P2 P3 P4 P5 Case outside outside inside of inside of inside of and sur- and sur- a LOD2 a LOD2 a room rounded rounded building building of a by by sur- sur- LOD4 LOD2 LOD2 rounded rounded building buildings buildings by by which is and a LOD2 LOD2 sur- large buildings buildings rounded LOD4 and a by large LOD2 LOD4 buildings building Polygon s 3671 47.887 11 19 35.656 Building s 247 10 1 1 1 /Room s Object s 3556 1775 11 19 19

As shown in Table 5.5, the selection algorithm efficiently minimizes the number of exported objects and extracts only those necessary to visualize the near vicinity of the user’s position, from a total of 1.253.840 objects. The results of the evaluation of P1 - P5 are split up in a diagram for P1, P2, P5 and a diagram for P3, P4 in the following figures, since they are too different from each other to display them in a single diagram. Figure 5.3 and Figure 5.4 show the total time required from selection to visualization.

212 5 Evaluation of AR System

3125

625

125 Google Nexus 5 Sony Xperia Z2 25 Google Pixel 2 XL 5

MeanTotal Loading Time (s) 1 P1 P2 P5

Figure 5.3: Loading times for each smartphone for the positions P1, P2 and P5.

0,5

0,4

0,3 Google Nexus 5 Sony Xperia Z2 0,2 Google Pixel 2 XL 0,1

MeanTotal Loading Time (s) 0 P3 P4

Figure 5.4: Loading times for each smartphone for the positions P3 and P4. The results indicate that the position generally does not influence the 5 Evaluation of AR System 213

loading time, but the number of objects returned by the algorithm does. Since the total loading times for P2 and P5 are relatively high for the Google Nexus 5 and Sony Xperia Z2 (Figure 5.5 and Figure 5.7), it is interesting to determine the costliest part of the process. Therefore, Figure 5.5 and Figure 5.6 show the required time to select and export the data and Figure 5.7 and Figure 5.8 show the necessary time to prepare the data for visualization (e.g. polygon triangulation). While selecting and exporting is done in the database, the preparations are done outside of the database.

12

10

8 Google Nexus 5 6 Sony Xperia Z2 4 Google Pixel 2 XL 2 MeanExport Time (s)

0 P1 P2 P5

Figure 5.5: The time spent to select data and export it from the database for each smartphone in positions P1, P2 and P5. 214 5 Evaluation of AR System

0,4 0,35 0,3 0,25 Google Nexus 5 0,2 Sony Xperia Z2 0,15 Google Pixel 2 XL 0,1

MeanExport Time (s) 0,05 0 P3 P4

Figure 5.6: The time spent to select data and export it from the database for each smartphone in positions P3 and P4.

700 600 500 400 Google Nexus 5 300 Sony Xperia Z2 200 Google Pixel 2 XL 100 MeanPreparation Time (s) 0 P1 P2 P5

Figure 5.7: Required time to prepare the exported CityGML data for visualization.

5 Evaluation of AR System 215

0,45 0,4 0,35 0,3 0,25 Google Nexus 5 0,2 Sony Xperia Z2 0,15 Google Pixel 2 XL 0,1 0,05 MeanPreparation Time (s) 0 P3 P4

Figure 5.8: Required time to prepare the exported CityGML data for visualization.

The majority of the time is required for preparing the data for the visualization. While querying and exporting the data is finished in less than 10 s for each position, the data preparation takes significantly longer for P2 and P5. P1, P3 and P4 on the other side roughly require an equal amount of time for both processes. The high preparation time of P2 and P5 is a result of the relatively large number of polygons, in comparison to the other positions, but this mainly only is an issue on the older smartphones. On the Google Pixel 2 XL both processes differ only slightly by a couple of seconds.

5.2.2 Data Visualization

To evaluate the rendering performance on each smartphone a resolution of 1920×1080 was used with a color depth of 32 bit per pixel (bpp), gamma correction enabled, and anti-aliasing turned off. 216 5 Evaluation of AR System

No further post-effects were applied to the images. In 3D real-time rendering the number of draw calls strongly influences the rendering performance. In 3D Tiles for instance, every tile requires one draw call [170], therefore, the performance depends on the number of tiles. The presented solution instead depends on the variety of colors in the scene. This has the advantage that uniformly colored scenes require a minimum of draw calls. As a reference, the following Figure 5.9 shows the number of draw calls for the positions P1 - P5.

30 25 20 15

Draw Calls 10 5 0

P1 P2 P3 P4 P5

Figure 5.9: The average draw calls that were required for the rendered scene in P1 - P5.

In Figure 5.10, the average FPS for P1 - P5 are displayed. For stability reasons, the maximum FPS was set to 60. Generally, the rendering performance of the Google Nexus 5 and Sony Xperia Z2 is sufficient for an AR system, but they have difficulties handling objects with large amounts of polygons, as for example in P2. Therefore, newer generation smartphones, such as the Google Pixel 2 5 Evaluation of AR System 217

XL, are a better choice for very large or very detailed scenes.

70 60 50 40 Google Nexus 5 30 Sony Xperia Z2 20 Google Pixel 2 XL 10 MeanFrames perSecond 0 P1 P2 P3 P4 P5

Figure 5.10: Average FPS in the different positions for each smartphone.

5.3 Door Detection

The door detection algorithm was evaluated in terms of detection time, rate, accuracy and stability. For this, it was tested on a dataset containing images of various doors. Figure 5.11 shows the used evaluation setup: Images were captured from distances of 3 to 9 m. At each distance the doors were photographed straight towards the door and up to angles of 70° in different lighting conditions, so that results can be summed up into six measurement cases M1 – M6.

218 5 Evaluation of AR System

Figure 5.11: Setup for testing optical pose estimation.

The following general measurements (M1 – M6) were done. • M1: Near (3 m) straight • M2: Far (up to 9 m) straight • M3: Near (3 m) at angle • M4: Far (up to 9 m) at angle • M5: Good lighting • M6: Bad lighting

For M1 and M2, the camera was placed straight towards the door at different distances. For M3 and M4, the camera was placed next to the door, so that it is shot from an angle. This was also done from different distances. For good lighting conditions (M5) images were captured during the day. Indoors the windows were fully opened, and artificial lighting was turned on. For bad lighting conditions (M6), 5 Evaluation of AR System 219

images were captured on an overcast day with the blinds of the windows shut and all artificial lighting turned off.

5.3.1 Image Dataset

In total 80 images (Figure 5.12) that fit the pre-defined measurement cases (M1 – M6) were captured with each smartphone. The majority of test images contain a door that is part of a complex scene. Complex scenes are defined as scenes that represent a cluttered environment with arbitrary objects such as wall paintings, books or other furniture. Additionally, 20 images in which no door is included, but some similar objects, such as pictures on walls etc., were added to the dataset. Figure 5.12 shows some example images:

Figure 5.12: Some examples of doors used for evaluating the door detection algorithm. 220 5 Evaluation of AR System

5.3.2 Influence of Image Resolution

In [197], an image resolution of 320×240 is recommended. To identify the optimal resolution for the here presented project, the influence of image resolution on the detection runtimes and rates was evaluated. Each image of the image dataset was downscaled to seven different resolutions with the same aspect ratio, ranging from 240×180 to 1280×960 pixels. Higher resolutions were not used, since these were not processable on the smartphone hardware.

Detection Rate The rate of detection is defined as the reliability of the detection algorithm, whether or not a door is detected in an image. There are four different cases that can occur: • The image contains a door and it is properly detected (true positive – TP) (Figure 5.15) • The image contains a door and it is not detected (false negative – FN) • The image contains no door and no door is detected (true negative – TN) (Figure 5.16) • The image contains no door, but a door is detected (false positive – FP).

A door is rated as detected when the extracted geometry matches the manually extracted shape of the door (Figure 5.13). Figure 5.14 shows an example of an incorrect detection. 5 Evaluation of AR System 221

Figure 5.13: Example of detected door. The red dots are corner points and the green rectangle is the correctly detected door.

Figure 5.14: Example of a partially detected door.

Figure 5.15 shows the success rate of detecting a door in images at different resolutions containing a door and Figure 5.16 the success 222 5 Evaluation of AR System

rate of determining that no door is included in images at different resolutions.

100 80 60 40

Rate (%) 20 0 True Positive TruePositive Detection

Figure 5.15: True Positive detection rate of the door detection algorithm for images with different resolutions.

100 80 60 40 20 Rate (%) 0 True Negative Detection

Figure 5.16: True Negative detection rate of the door detection algorithm for images with different resolutions. 5 Evaluation of AR System 223

As Figure 5.15 depicts, with a resolution of 240×180, the algorithm only has a TP detection rate of 19%. It increases rapidly to more than four times as much with a resolution of 480×360. Higher resolutions only slightly improve the TP rate. A similar trend is found for the TN rate (Figure 5.16). With a resolution of 240×180 in more than half of the cases a door is falsely detected in an image without a door in it. For 480×360 images, the rate is reduced to 20% (80% TN rate) and also only increases slightly for higher resolved images.

Detection Time

6000

5000

4000

3000

2000

1000

0 MeanTotal Detection Time (ms)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.17: Time required to detect a door in cases using the same images in different resolutions.

224 5 Evaluation of AR System

The measured detection time is defined as the duration of the entire detection process, including steps such as downscaling of the image, extracting edges and analyzing these, starting with the original source image. Figure 5.17 shows the mean time required to detect a door in images at different resolutions. The total detection time is lowest for 240×180 with 17 ms for the Google Pixel 2 XL, 56 ms for the Sony Xperia Z2 and 72 ms for the Google Nexus 5. It continually increases with the size of the images to 1.5 s for the Google Pixel 2 XL, 3.5 s for the Sony Xperia Z2 and 5.1 s for the Google Nexus 5.

5.3.3 Influence of Environmental Conditions

Generally, the larger the images are the better the TP/TN detection rates are, but they also require more time to be analyzed. While the detection time grows exponentially with the image resolution, the detection rate stagnates at 480×360 pixels and only increases slightly with higher resolutions. Therefore, an image resolution of 480×360 pixels was chosen as a standard resolution for the AR system. The following section shows the results using the standard resolution.

5 Evaluation of AR System 225

Detection Rate

Table 5.6: Detection rate of the door detection algorithm using downscaled images with a resolution of 480×360 pixels. Value Description Result Rate (%) true positive Door in image 67 84 (TP) and correctly identified it true negative No door in 16 80 (TN) image and correctly identified this false positive No door in 4 20 (FP) image, but falsely identified one false negative Door in image, 13 16 (FN) but not detected as one

Table 5.6 shows that when a door is present in an image, the algorithm has a successful detection rate of 84 %. If there is no door in the image, the algorithm has a success rate of 80 % of determining this. 226 5 Evaluation of AR System

Detection Time

400 350 300 250 Google Nexus 5 200

(ms) Sony Xperia Z2 150 Google Pixel 2 XL 100 50 MeanTotal Detection Time 0 M1 M2 M3 M4 M5 M6

Figure 5.18: Time required to detect a door in cases M1 - M6 using downscaled images with a resolution of 480×360 pixels.

As Figure 5.18 shows, the time does not directly correlate with the distance, angle or lighting. A much stronger factor is the background of the door. Backgrounds with various colors and shapes in them increase the required time. The reason is that in cluttered backgrounds the algorithm finds more potential corner points that subsequently must be analyzed. Therefore, doors that are surrounded by multiple objects such as pictures on the wall, books, etc. require more time in comparison to doors that are only surrounded by white walls. Typically, the images captured at an angle and from a distance contain more of the surrounding environment, so that the detection process is more complex. 5 Evaluation of AR System 227

Detection Accuracy The accuracy of the detection algorithm describes how good the automatically derived corners match the actual corners of the door. For this, the corners of the doors were manually extracted from each test image as reference coordinates. The automatically derived corner coordinates were then compared to these references.

1,8 1,6 1,4 1,2 1 Google Nexus 5 0,8 Sony Xperia Z2 0,6 Google Pixel 2 XL 0,4

MeanPosition Error(pixel) 0,2 0 M1 M2 M3 M4 M5 M6

Figure 5.19: Accuracy of the automatically derived door corner coordinates from downscaled images with a resolution of 480×360 pixels.

Figure 5.19 shows that the automatically extracted door corners only lie in average 1.3 pixels from the reference coordinates. The door detection and corner extraction algorithm, therefore, can be considered to be sufficient for optical pose estimation. 228 5 Evaluation of AR System

5.4 Pose Tracking

Pose tracking is one of the most researched topics in the context of AR, since it is also one of the most challenging problems. Specifically, for the use cases described in this thesis a high accuracy and stability are a necessity. In the following sections the results of the evaluation of the developed tracking system are described. The application was run according to a normal use case (e.g. augmenting a room). Two measurements were mainly of interest, the overall accuracy and the stability of the augmentations. Generally, the size and the position of the augmentations depend on two factors, first, the camera calibration parameters from which, for example, the FOV is calculated and, secondly, the pose of the physical device. To eliminate the first factor, each camera was calibrated using PHIDIAS and a reference pose was manually set. The augmentations were visually inspected to confirm the correctness of the calibration parameters. In the next steps the automatic pose estimation could be evaluated. The accuracy reflects how good the virtual objects overlay the physical objects. Provided that the AR system is fully calibrated so that there are no distortions caused by differences between the images produced by the physical and virtual camera, the accuracy of augmentation depends on the automatically estimated device pose. Therefore, the pose automatically derived by the optical tracker was compared to a reference pose. The stability determines how long the virtual objects remain in a certain pose and how much influence, for instance the sensor-drift, has on the augmentations. Translation errors were calculated using the (Euclidean norm) (see equation (3.1)) and rotational errors were calculated for each axis using equation (3.2). 5 Evaluation of AR System 229

5.4.1 Optical Pose

When using images to estimate poses, factors such as the distance or angle to the object (e.g. door) and lighting conditions potentially influence the accuracy of the pose. To confirm or refute this, poses were estimated for the cases M1 - M6.

Accuracy The accuracy of the automatically estimated poses was determined by comparing them to manually calculated reference poses. Figure 5.20 shows the positional differences and Figure 5.21 the angular differences.

0,25

0,2

0,15

0,1

0,05

MeanPosition Error(m) 0 M1 M2 M3 M4 M5 M6

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.20: Accuracy of automatically estimated position.

230 5 Evaluation of AR System

3,5 3 2,5 2 1,5 1 0,5 0 M1 M2 M3 M4 M5 M6 MeanRotation Error(degrees) Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.21: Accuracy of automatically estimated orientation.

Generally, the automatic pose estimations lie in average 17 cm from the reference position and 2.5 degrees in average from the reference orientation. The results show that the accuracy is generally independent from the angle, distance or the lighting conditions.

5.4.2 Sensor Pose Stability

In this section the results of the evaluation of the sensor pose algorithms are presented. Given an initial pose from the optical pose algorithms, the sensors of the smartphone enable the user to look around. Essential for this is the stability of the IMU-based poses. This is inspected by measuring the influence of sensor drift and the impact of relative movements. The stability of the augmentations was analyzed using three different sensor types provided by the Android smartphones, the rotation vector , the game rotation vector and the 5 Evaluation of AR System 231

gyroscope . The rotation vector and game rotation vector are referred to as composite sensors in Android, since they use multiple hardware sensors and apply sensor fusion. The rotation vector incorporates the accelerometer, magnetometer and gyroscope and reports the orientation relative to magnetic north. The game rotation vector utilizes the accelerometer and gyroscope and measures relative to some starting point, just like the gyroscope. Two cases were evaluated, first, the impact of the sensor drift on pose tracking. For this, the devices were placed in a stable position and pose tracking was run for a pre-determined amount of time. Secondly, the impact of device movements, for example when the user looks around, was evaluated. For both cases, the initial pose of the device was saved and later compared to the pose at the end of the measurement.

Resting AR system For the evaluation of the resting AR system, each smartphone was pointed in a stable direction using a sturdy tripod and a measurement session was started. The starting pose was automatically saved after a delay of 2 s to prevent any distortions caused by interactions with the touch screen. After a pre-determined amount of time the next pose was saved and compared to the starting pose. The resulting orientation errors for each smartphone and the different sensor types are shown in Figure 5.22, Figure 5.23 and Figure 5.24 respectively.

232 5 Evaluation of AR System

1 0,8 0,6 0,4 0,2 0 0 30 60 120 240 480 600 Time (s) MeanRotation Error(degrees)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.22: Quality of relative orientation using the Rotation Vector over time.

0,6 0,5 0,4 0,3 0,2 0,1 0 0 30 60 120 240 480 600 Time (s) MeanRotation Error(degrees)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.23: Quality of relative orientation using the Game Rotation Vector over time. 5 Evaluation of AR System 233

80

60

40

20

0 0 30 60 120 240 480 600 Time (s) MeanRotation Error(degrees)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.24: Quality of relative orientation using the Gyroscope over time.

Figure 5.22 shows that the Rotation Vector of the three smartphones performs relatively well in a 10 min window, with mean rotation errors below 1 degree. While the error of the Google Nexus 5 increases the slowest, the Sony Xperia Z2 shows a rather steep linear increase. The curve of the Google Pixel 2 XL shows that the error stabilizes after a certain time. The Game Rotation Vectors (Figure 5.23) of the three smartphones show mean rotation errors under 0.5 degrees in a 10 min window. The Google Pixel 2 XL performs the best but shows a linear error increase. In comparison, the errors of the Google Nexus 5 and the Sony Xperia Z2 each stabilize after a certain time. While the average absolute orientation from the Rotation Vector suffers from the imprecise measurements of the magnetometer resulting in bearing errors up to 25 degrees (also see [179]), the relative orientation results 234 5 Evaluation of AR System

are much better. When comparing Figure 5.22 and Figure 5.23 with Figure 5.24, it can be seen that the quality of the orientation benefits from sensor fusion in the Rotation Vector and Game Rotation Vector. While the orientation derived from the Gyroscope (Figure 5.24) alone is comparatively good for the Sony Xperia Z2 and Google Pixel 2 XL, the Google Nexus 5 suffers immensely from the sensor drift.

Moving AR system The evaluation of the error produced by the sensors when moving the device around was also determined by saving the starting pose and comparing it to the pose after the movements. The start and end pose each were saved by pressing a button. To find the exact physical pose the device was in before being moved around, a tripod was used, of which the initial position was marked on the corresponding surface. The entire tripod with the smartphone then was moved accordingly and later placed exactly in the markings again. The results for each smartphone and type of sensor are displayed in Figure 5.25, Figure 5.26 and Figure 5.27 respectively.

5 Evaluation of AR System 235

7 6 5 4 3 2 1 0 1 2 3 4 5 6 Horizontal Turns MeanRotation Error(degrees)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.25: Quality of relative orientation using the Rotation Vector when AR system is rotated.

4

3

2

1

0 0 1 2 3 4 5 6 Horizontal Turns MeanRotation Error(degrees)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.26: Quality of relative orientation using the Game Rotation Vector when AR system is rotated. 236 5 Evaluation of AR System

4

3

2

1

0 0 1 2 3 4 5 6 Horizontal Turns MeanRotation Error(degrees)

Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.27: Quality of relative orientation using the Gyroscope when AR system is rotated.

Figure 5.25 shows that the mean rotation error is not influenced by movements of the smartphone when using the Rotation Vector. Figure 5.26 and Figure 5.27 show that the mean rotation error increases with each turn when using the game rotation vector and gyroscope. In both cases the Sony Xperia Z2 shows the best results and the Google Pixel 2 XL the poorest results.

5.5 General AR System Evaluation

After evaluating specific components of the AR system, such as the CityGML database and the tracking algorithms, in this section the AR system is evaluated in a more general sense. Specifically, measurements interesting to the usability, such as the battery life. 5 Evaluation of AR System 237

Battery Life For good usability the user should be able to work with the system for a reasonable amount of time before having to recharge it. To evaluate the impact of the developed AR framework on each smartphone, each system was run until the battery was depleted. Figure 5.28 shows how long each battery lasted. The main energy consumers were perceived to be the IMU and the display.

250

200

150

100

50 Battery Life (min)

0 Google Nexus 5 Sony Xperia Z2 Google Pixel 2 XL

Figure 5.28: The time that the smartphone battery lasts when using the AR framework.

While the Google Nexus 5 in average lasted 40 min, the Sony Xperia Z2 could run nearly three times as long with 110 min in average. The Google Pixel 2 XL was most durable with 210 min. Some ways to save some energy would be, for example, to decrease the brightness of the displays and to decrease the rate of each sensor of the IMU. Though, this would have the disadvantage that the pose would not be updated as quickly as possible, resulting in delays on the display. 238 6 Conclusions

6 Conclusions

In this thesis a general design for an AR-system using CityGML data was presented and discussed. Based on the results of the requirement analyses, a proof of concept system was implemented and evaluated, using three Android smartphones, the Google Nexus 5, the Sony Xperia Z2 and the Google Pixel 2 XL. While the fundamental methods of the presented AR system already involve multiple sub- topics, such as pose tracking, that are worth investigating and optimizing, AR offers yet more potential for research. An example is the optimization of the visual coherence of the physical and virtual objects, so that the virtual objects become more indistinguishable from the physical objects and are registered to these seamlessly. The results of the AR system evaluation show that a modern smartphone is sufficient to realize a fully-fledged location-based mobile AR system, independent of exterior infrastructures like pose tracking systems or data processing solutions. In terms of runtimes, especially the newest generation smartphones, such as the Google Pixel 2 XL, provide the suitable hardware to execute the algorithms with an appropriate runtime. Furthermore, a combination of optical pose tracking methods and IMU-based methods provides sufficient results, so that virtual objects, such as the CityGML objects, are accurately overlaid over their physical counterparts. Figure 6.1 and Figure 6.2 show examples of visualized CityGML objects in the AR environment. The augmentation in Figure 6.1 was achieved fully automatically with the optical pose tracker. For Figure 6.2, the pose was derived from the sensor-based pose tracker and manually 6 Conclusions 239

corrected. Future work includes transferring the optical object-based pose tracker to outdoor environments, to enable fully automatic pose tracking, as in indoor areas.

Figure 6.1: Door augmented using the AR system.

Figure 6.2: Building augmented using the AR system. 240 6 Conclusions

6.1 Data Processing

With 6 - 26 ms to execute, some for this application, typical database queries, SpatiaLite performs well, even on outdated smartphone hardware. The actual data selection and export requires 2 - 10 s for roughly 50.000 polygons and 2000 single objects, depending on the hardware. The difference in hardware becomes clearer during polygon triangulation which takes more than 10 min on the Google Nexus 5, but only 7 s on the Google Pixel 2 XL. One of the most time-consuming processes to visualize the data is the polygon triangulation. Therefore, an alternative to triangulating the data on-the-fly after exporting it from the database could be to triangulate it before the import, so the process would be required only once and would be ready to render on export.

6.2 Rendering

In nearly all of the tested cases the system reached the maximum of 60 FPS, providing a real-time experience. Next to technical aspects such as optimizing the performance of rendering the virtual objects, the here presented AR system could further be optimized towards more seamless augmentations by matching the visual appearance of the virtual and physical objects. Some essential parts of visual coherence have already been taken into account, like the correct placement of the CityGML objects in relation to their physical counterparts in a global reference system and the synchronization of the virtual camera with the physical camera. Furthermore, the virtual camera uses a perspective projection with the lens distortion and the 6 Conclusions 241

intrinsic parameters of the physical camera taken into account. Generally, these aspects can be referred to as object registration related aspects. Next to these, some appearance related parts of visual coherence have also been considered, like the correct coloring and shadow casting of objects. The colors are simple RGB colors. An enhancement here could be to use textures, so that the objects appear more realistically. Their overall benefit for AR, at the cost of the application’s performance, first must be assessed in further research though. The shadows are cast according to the position of the sun. Some optimizations here would be to also take other light sources, such as artificial lights, into account and to consider indirect lighting from neighboring objects. Depending on the use case, occlusion could also help enhance the visual coherence. For instance, partly covered objects would then only be partly visible.

6.3 Door Detection

Currently, the door detection relies on features found by a corner detector. The entirety of points is analyzed and tested against some conditions to find possible door candidates. While the current implementation already provides good results with a detection speed of 215 ms and a positive detection rate of 80 %, the detection rate could still be improved by increasing the image resolution, which would make features in the images better detectable. The disadvantage is a decrease of detection speed. To counter this, the amount of corner points could be reduced by additional conditions. An approach could be to exploit the color of the doors. This information is available in the CityGML model, so an additional test could be performed to check if the found corner points are part of a 242 6 Conclusions

certain colored object. The accuracy of the detected door geometry is sufficient for optical pose estimation, with a 1.3-pixel error only.

6.4 Pose Tracking

As one of the major topics in connection with AR, pose tracking leaves the most space for improvements. In this work, two methods for image-based pose estimation were implemented, a model-based approach using projected 3D edges and a model-based approach using door detection. The edge-based method was not applicable since the pose derived from the IMU was not sufficient to accurately project the 3D edges. The door detection method on the other hand is computationally relatively expensive and suffers from some drift when switching to the IMU-based tracking to look around. In average, an orientation error of 1 - 6 degrees occurs. This is noticeable in the visual augmentations, but not a major issue. The initial pose derived from the door-based pose estimation in average has a positional accuracy of 17 cm and a maximum orientational accuracy of 3 degrees. With these values the virtual objects augment their physical counterparts sufficiently. To further increase the quality of the augmentations and stabilize them, a possibility would be to combine the edge-based and the door- based methods. First, a pose could be estimated using the door detection, which then could be used as a starting pose for the edge- based tracking algorithm, instead of the IMU. The disadvantage here would be that the door would always have to be visible for the edges to be matched. An alternative to solely relying on doors could be to additionally make use of objects in the vicinity for pose estimation, to 6 Conclusions 243

either fully rely on image-based tracking or to sporadically estimate correctional data, to account for the sensor drift. A challenge hereby would be to find reliable physical objects, which on the one hand have a constant pose over time and on the other hand do not have an overly complex shape to be detected reliably with CV methods. Alternatively, in rooms an approach similar to that of ARCore could be effective. ARCore finds feature points in the captured images and derives planes, such as wall-, ceiling-, floor- or table surfaces, from them. This information, in combination with building data as provided by CityGML, could be used to stabilize the poses by matching the planes to each other. A disadvantage of this approach is that the surfaces may not be cluttered with objects for them to be detectable. A more general approach therefore could be to employ VO in addition to the IMU tracking system (VIO) to stabilize the relative pose estimation process after obtaining an absolute pose from an object like a door.

6.5 BIM as Extension to AR System

Generally, many AR applications are produced for entertainment purposes, but it also promises to be valuable in disciplines such as city planning or civil engineering. Here, especially BIM provides the means for future driven applications of AR in AEC. Different research works, such as [206], [207], have already focused on related topics. BIM could also contribute to the here presented AR system. The 3D models described by the IFC format could serve as an addition to the CityGML models. IFC offers the means to describe objects of buildings in more detail and additional components. Within the studies for this thesis, [208] implemented a prototype for a BIM-based 244 6 Conclusions

AR system and investigated its usability in civil engineering. The application was realized on a Google Tango device using the game engine Unity 3D. Generally, the following steps were involved in the development process of the application. 1. First, the IFC files were converted into the FBX format, so they could be loaded into Unity, since it does not support IFC natively. An advantage of FBX is that it preserves the identifiers of the objects. This is essential for coupling the objects with additional object information, such as semantics or descriptive data. 2. Since the material information, such as colors, of the objects is only partly transferred from IFC to FBX, it is necessary to manually re-add this information. For this, the 3D modeling program 3ds Max 73 was used. 3. Additional object information was extracted from the IFC file into CSV files. The information is linked to the objects over their corresponding identifiers. 4. With help of the Project Tango SDK, the pose tracking and object interactions were realized. The pose then is estimated relative to a predefined starting pose, for example derived from a fiducial marker. The AR system was evaluated on a construction site with the focus on the usability of AR in construction. The general outcome of the evaluation was that construction processes can benefit from AR, given sufficiently detailed building models. Figure 6.3 shows a screenshot from the live augmentation of a building of the RWTH Aachen

73 https://www.autodesk.com/products/3ds-max/overview 6 Conclusions 245

University.

Figure 6.3: AR view using Google Tango and a custom created BIM model of the RWTH Aachen University [208]. Bibliography 246

Bibliography

[1] C. Choi and C. Brahic, “Found: a pocket guide to prehistoric Spain,” New Sci. , vol. 203, no. 2720, pp. 8–9, 2009. [2] Apple, “Apple reinvents the phone with iPhone,” Apple.com , 2007. [Online]. Available: http://www.apple.com/pr/library/2007/01/09Apple-Reinvents- the-Phone-with-iPhone.html. [Accessed: 20-Apr-2018]. [3] Statista, “Number of smartphones sold to end users worldwide from 2007 to 2015 (in million units),” www.statista.com , 2018. [Online]. Available: http://www.statista.com/statistics/263437/global-smartphone- sales-to-end-users-since-2007/. [Accessed: 20-Apr-2018]. [4] M. Weiser, “The Computer for the 21st Century,” Sci. Am. , vol. 265, no. 3, pp. 94–104, 1991. [5] Google LLC, “Google Trends,” 2018. [Online]. Available: https://trends.google.com/trends/explore?q=Augmented Reality. [Accessed: 07-May-2018]. [6] G. Schall, J. Schöning, V. Paelke, and G. Gartner, “A survey on augmented maps and environments: Approaches, interactions and applications,” Adv. Web-based GIS, Mapp. Serv. Appl. , vol. 9, pp. 207–226, 2011. [7] G. Gröger, T. Kolbe, C. Nagel, and K.-H. Häfele, “OGC City Geography Markup Language (CityGML) En-coding Standard,” OGC , pp. 1–344, 2012. [8] I. E. Sutherland, “A head-mounted three dimensional display,” in Proceedings of the December 9-11, 1968, fall joint computer 247 Bibliography

conference, part I on - AFIPS ’68 (Fall, part I) , 1968, pp. 757–764. [9] R. T. Azuma, “A survey of augmented reality,” Presence Teleoperators Virtual Environ. , vol. 6, no. 4, pp. 355–385, 1997. [10] L. F. Baum, The Master Key: An Electrical Fairy Tale, Founded Upon the Mysteries of Electricity and the Optimism of Its Devotees . 1901. [11] T. P. Caudell and D. W. Mizell, “Augmented Reality: An application of heads-up display technology to manual manufacturing processes,” in Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences , 1992, vol. 2, pp. 659–669. [12] P. Milgram, H. Takemura, A. Utsumi, and F. Kishino, “Augmented reality: a class of displays on the reality-virtuality continuum,” in Telemanipulator and telepresence technologies , 1995, vol. 2351, pp. 282–292. [13] D. Wagner, “Handheld Augmented Reality,” Graz University of Technology, 2007. [14] T. Höllerer, J. Wither, and S. DiVerdi, “„Anywhere Augmentation“: Towards Mobile Augmented Reality in Unprepared Environments,” Locat. Based Serv. TeleCartography , pp. 393–416, 2007. [15] G. Schall, Mobile Augmented Reality for Human Scale Interaction with Geospatial Models , 1st ed. Gabler Verlag, 2013. [16] Di. Chatzopoulos, C. Bermejo, Z. Huang, and P. Hui, “Mobile Augmented Reality Survey: From Where We Are to Where We Go,” IEEE Access , vol. 5, pp. 6917–6950, 2017. Bibliography 248

[17] S. Feiner, B. Macintyre, T. Höllerer, and A. Webster, “A Touring Machine: Prototyping 3D Mobile Augmented Reality Systems for Exploring the Urban Environment,” in ISWC ‘97 , 1997, vol. 1, no. 4, pp. 74–81. [18] T. Hollerer, S. Feiner, and J. Pavlik, “Situated documentaries: embedding multimedia presentations in the real world,” in ISWC ’99 , 1999, vol. 99, pp. 79–86. [19] J. Rekimoto, “Matrix: a realtime object identification and registration method for augmented reality,” in Computer Human Interaction, 1998. Proceedings. 3rd Asia Pacific , 1998, pp. 63–68. [20] H. Kato and M. Billinghurst, “Marker tracking and HMD calibration for a video-based augmented reality conferencing system,” in Proceedings 2nd IEEE and ACM International Workshop on Augmented Reality (IWAR’99) , 1999, pp. 85–94. [21] B. Thomas, V. Demczuk, W. Piekarski, D. Hepworth, and B. Gunther, “A wearable computer system with augmented reality to support terrestrial navigation,” in Digest of Papers. Second International Symposium on Wearable Computers , 1998, pp. 168–171. [22] W. Piekarski and B. Thomas, “Augmented Reality With Wearable Computers Running Linux,” in 2nd Australian Linux Conference (Sydney) , 2001, pp. 1–14. [23] V. Vlahakis et al. , “Archeoguide: An augmented reality guide for archaeolog sites,” IEEE Comput. Graph. Appl. , vol. 22, no. 5, pp. 52–60, 2002. [24] U. Kretschmer et al. , “Meeting The Spirit of History,” in Proceedings of the 2001 Conference on Virtual Reality, Archaeology and Cultural Heritage (VAST’01) , 2001, pp. 141– 249 Bibliography

152. [25] M. Kalkusch, T. Lidy, N. Knapp, G. Reitmayr, H. Kaufmann, and D. Schmalstieg, “Structured visual markers for indoor pathfinding,” in Augmented reality toolkit, The First IEEE international workshop , 2002, pp. 8 pp.-. [26] D. Wagner and D. Schmalstieg, “First steps towards handheld augmented reality,” Seventh IEEE Int. Symp. Wearable Comput. 2003. Proceedings. , pp. 127–135, 2003. [27] G. Reitmayr and T. W. Drummond, “Going out: Robust model-based tracking for outdoor augmented reality,” in Proceedings of the 5th IEEE and ACM International Symposium on Mixed and Augmented Reality , 2006, pp. 109– 118. [28] S. White and S. Feiner, “SiteLens: Situated Visualization Techniques for Urban Site Visits,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 2009, pp. 1117–1120. [29] G. Schall, D. Schmalstieg, and S. Junghanns, “Vidente - 3D Visualization of Underground Infrastructure using Handheld Augmented Reality,” GeoHydroinformatics Integr. GIS Water , pp. 207–219, 2010. [30] E. Stylianidis et al. , “LARA: A location-based and augmented reality assistive system for underground utilities’ networks through GNSS,” in Proceedings of the 2016 International Conference on Virtual Systems and Multimedia , 2016, pp. 1–9. [31] J. M. Santana, J. Wendel, A. Trujillo, J. P. Suárez, A. Simons, and A. Koch, “Multimodal location based services—semantic 3D city data as virtual and augmented reality,” in Progress in location-based services 2016 , 2017, pp. 329–353. Bibliography 250

[32] Socialcare, “Augmented Reality SDK Comparison,” 2015. [Online]. Available: http://socialcompare.com/en/w/augmented-reality-sdks. [Accessed: 11-Apr-2018]. [33] J. Mundy, “What is Project Tango? Google’s new AR tech explained,” 2017. [Online]. Available: http://www.trustedreviews.com/news/what-is-project-tango- 2941129. [Accessed: 17-Jul-2018]. [34] Microsoft, “Microsoft HoloLens,” 2018. [Online]. Available: https://www.microsoft.com/de-de/hololens. [Accessed: 11-Apr- 2018]. [35] R. Miller and J. Constine, “Apple Acquires Augmented Reality Company Metaio,” Tech Crunch , 2015. [Online]. Available: https://techcrunch.com/2015/05/28/apple-metaio/. [Accessed: 20-Apr-2018]. [36] Sol, “OpenGL 101: Matrices - projection, view, model,” 2013. [Online]. Available: https://solarianprogrammer.com/2013/05/22/opengl-101- matrices-projection-view-model/. [Accessed: 20-Apr-2018]. [37] T. Akenine-Möller, E. Haines, and N. Hoffman, Real-Time Rendering 3rd Edition . Natick, MA, USA: A. K. Peters, Ltd., 2008. [38] W. R. Hamilton, “II. On quaternions; or on a new system of imaginaries in algebra,” London, Edinburgh, Dublin Philos. Mag. J. Sci. , vol. 25, no. 163, pp. 10–13, 1844. [39] J. van Oosten, “Understanding Quaternions,” 2012. [Online]. Available: https://www.3dgep.com/understanding- quaternions/. [Accessed: 07-May-2018]. [40] C. Hock-Chuan, “3D Graphics with OpenGL Basic Theory,” 251 Bibliography

2012. [Online]. Available: http://www.ntu.edu.sg/home/ehchua/programming/opengl/C G_BasicsTheory.html. [Accessed: 20-Apr-2018]. [41] OpenCV, “Camera Calibration and 3D Reconstruction,” 2018. [Online]. Available: https://docs.opencv.org/4.0.0/d9/d0c/group__calib3d.html. [Accessed: 28-Nov-2018]. [42] Scratchapixel, “3D Viewing: the Pinhole Camera Model,” 2018. [Online]. Available: https://www.scratchapixel.com/lessons/3d- basic-rendering/3d-viewing-pinhole-camera/how-pinhole- camera-works-part-2. [Accessed: 20-Apr-2018]. [43] Matlab, “What is Camera Calibration?,” 2016. [Online]. Available: http://de.mathworks.com/help/vision/ug/camera- calibration.html#buvr2qb-2. [Accessed: 20-Apr-2018]. [44] J. de Vries, “Transformations,” 2018. [Online]. Available: https://learnopengl.com/Getting-started/Transformations. [Accessed: 20-Apr-2018]. [45] D. Eberly, “Triangulation by ear clipping,” Magic Software, Inc , pp. 1–13, 2002. [46] H. Edelsbrunner et al. , “Smoothing and cleaning up slivers,” in Proceedings of the thirty-second annual ACM symposium on Theory of computing - STOC ’00 , 2000, pp. 273–277. [47] B. Delaunay, “Sur la sphere vide,” Izv. Akad. Nauk SSSR, Otd. Mat. i Estestv. Nauk , vol. 7, pp. 793–800, 1934. [48] F. Hurtado, M. Noy, and J. Urrutia, “Flipping edges in triangulations,” Discret. Comput. Geom. , vol. 22, no. 3, pp. 333–346, 1999. [49] A. Bowyer, “Computing dirichlet tessellations,” Comput. J. , vol. 24, no. 2, pp. 162–166, 1981. Bibliography 252

[50] D. F. Watson, “Computing the n-dimensional Delaunay tessellation with application to Voronoi polytopes,” Comput. J. , vol. 24, no. 2, pp. 167–172, 1981. [51] H. Si, “Constrained Delaunay Triangulations and Algorithms,” 2007. [52] J. R. Shewchuk, “General-dimensional constrained delaunay and constrained regular triangulations, I: Combinatorial properties,” Discret. Comput. Geom. , vol. 39, no. 1, pp. 580– 637, 2008. [53] OGC, “Geography Markup Language Specification,” 2003. [Online]. Available: http://portal.opengis.org/files/?artifact_id=4700. [Accessed: 11-Apr-2018]. [54] F. Biljecki, J. Stoter, H. Ledoux, S. Zlatanova, and A. Çöltekin, “Applications of 3D City Models: State of the Art Review,” ISPRS Int. J. Geo-Information , vol. 4, no. 4, pp. 2842–2889, 2015. [55] C. Blut, T. Blut, and J. Blankenbach, “CityGML goes mobile: application of large 3D CityGML models on smartphones,” Int. J. Digit. Earth , pp. 1–18, 2017. [56] Berlin Partner für Wirtschaft und Technologie GmbH, “Das 3D-Stadtmodell Berlin,” 2018. [Online]. Available: https://www.businesslocationcenter.de/de/WA/B/seite0.jsp. [Accessed: 24-Jul-2018]. [57] Arbeitsgemeinschaft der Vermessungsverwaltungen der Länder der Bundesrepublik Deutschland, “AFIS-ALKIS-ATKIS- Modell,” 2018. [Online]. Available: http://www.adv- online.de/AAA-Modell/. [Accessed: 11-Apr-2018]. [58] Geschäftsstelle des IMA GDI in Nordrhein-Westfalen, 253 Bibliography

“geoportal.nrw,” 2018. [Online]. Available: https://www.geoportal.nrw/. [Accessed: 11-Apr-2018]. [59] Federal Ministry of Transport and Digital Infrastructure, “Road Map for Digital Design and Construction,” p. 20, 2015. [60] ISO, “Industry Foundation Classes (IFC) for data sharing in the construction and facility management industries,” Iso 16739 , p. 23, 2013. [61] A. Borrmann, M. König, C. Koch, and J. Beetz, “Building Information Modeling - Technologische Grundlagen und industrielle Praxis,” p. 591, 2015. [62] J. D. Foley, A. Van Dam, S. K. Feiner, and J. F. Hughes, Computer Graphics: Principles and Practice in C . 1995. [63] W3C, “What is the Document Object Model?,” 2016. [Online]. Available: https://www.w3.org/TR/WD- DOM/introduction.html. [Accessed: 11-Apr-2018]. [64] E. F. Codd, “A relational model of data for large shared data banks,” Commun. ACM , vol. 13, no. 6, pp. 377–387, 1970. [65] A. Guttman, “R-trees: A dynamic index structure for spatial searching,” in Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data , 1984, pp. 47–57. [66] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-tree: an efficient and robust access method for points and rectangles,” ACM SIGMOD Rec. , vol. 19, no. 2, pp. 322–331, 1990. [67] Q. Zhu, J. Gong, and Y. Zhang, “An efficient 3D R-tree spatial index method for virtual geographic environments,” ISPRS J. Photogramm. Remote Sens. , vol. 62, no. 3, pp. 217–224, 2007. [68] D. Meagher, “Geometric modeling using octree encoding,” Bibliography 254

Comput. Graph. Image Process. , vol. 19, no. 2, pp. 129–147, 1982. [69] B. Schön, A. S. M. Mosa, D. F. Laefer, and M. Bertolotto, “Octree-based Indexing for 3D Pointclouds Within an Oracle Spatial DBMS,” Comput. Geosci. , vol. 51, pp. 430–438, 2013. [70] R. Knippers, “Coordinate Systems,” 2009. [Online]. Available: https://kartoweb.itc.nl/geometrics/Coordinate systems/coordsys.html. [Accessed: 20-Apr-2018]. [71] Google LLC, “SensorManager,” 2018. [Online]. Available: https://developer.android.com/reference/android/hardware/Se nsorManager#getOrientation(float[], float[]). [Accessed: 17-Jul- 2018]. [72] Google LLC, “Best Practices for Accessing and Using Sensors,” 2018. [Online]. Available: https://developer.android.com/guide/topics/sensors/sensors_o verview.html#sensors-practices. [Accessed: 20-Apr-2018]. [73] Google Inc, “Sensor types,” Android Open Source Project , 2016. [Online]. Available: https://source.android.com/devices/sensors/sensor-types.html. [Accessed: 20-Apr-2018]. [74] R. Mautz, “Indoor Positioning Technologies,” ETH Zurich, 2012. [75] J. Pike, “GPS III Operational Control Segment (OCX),” GlobalSecurity.com , 2009. [Online]. Available: https://www.globalsecurity.org/space/systems/gps_3-ocx.htm. [Accessed: 24-Apr-2018]. [76] U.S. government, “Space Segment,” 2018. [Online]. Available: https://www.gps.gov/systems/gps/space/. [Accessed: 24-Apr- 2018]. 255 Bibliography

[77] U.S. Department of Homeland Security, “GPS Frequently Asked Questions (FAQ),” 2018. [Online]. Available: http://www.navcen.uscg.gov/?pageName=gpsFaq. [Accessed: 24-Apr-2018]. [78] Google LLC, “Motion Sensors,” 2018. [Online]. Available: https://developer.android.com/guide/topics/sensors/sensors_ motion.html. [Accessed: 20-Apr-2018]. [79] Y. Cai, Y. Zhao, X. Ding, and J. Fennelly, “Magnetometer basics for mobile phone applications,” Electron. Prod. (Garden City, New York) , vol. 54, no. 2, 2012. [80] R. Want, A. Hopper, V. Falcão, and J. Gibbons, “The active badge location system,” ACM Trans. Inf. Syst. , vol. 10, no. 1, pp. 91–102, 1992. [81] P. Bahl and V. N. Padmanabhan, “RADAR: An in-building RF based user location and tracking system,” Proc. IEEE INFOCOM 2000. Conf. Comput. Commun. Ninet. Annu. Jt. Conf. IEEE Comput. Commun. Soc. , vol. 2, pp. 775–784, 2000. [82] A. Ward, A. Jones, and A. Hopper, “A new location technique for the active office,” IEEE Pers. Commun. , vol. 4, no. 5, pp. 42–47, 1997. [83] C. Ziegler, “Entwicklung und Erprobung eines Positionierungssystems für den lokalen Anwendungsbereich,” Beck, München, 1996. [84] N. B. Priyantha, A. Chakraborty, and H. Balakrishnan, “The Cricket Location-Support System,” Proc. 6th Annu. Int. Conf. Mob. Comput. Netw. , pp. 32–43, 2000. [85] J. Blankenbach, Z. Kasmi, A. Norrdine, and H. Schlemmer, Indoor-Positionierung auf Basis von UWB – Ein Lokalisierungsprototyp zur Baufortschrittsdokumentation , Heft Bibliography 256

8-9/08. AVN, 2008. [86] J. Blankenbach, Präzise Positions- und Orientierungsbestimmung mit UWB . Wichmann Verlag, 2010. [87] J. Blankenbach, H. Sternberg, and S. Tilch, Indoor- Positionierung . Berlin, Heidelberg: Springer Berlin Heidelberg, 2015. [88] J. Blankenbach and A. Norrdine, Magnetic Indoor Local Positioning System . Kamini (eds.): Indoor Wayfinding and Navigation, CRC Press, 2015. [89] J. Blankenbach and A. Norrdine, “Position estimation using artificial generated magnetic fields,” in 2010 International Conference on Indoor Positioning and Indoor Navigation, IPIN 2010 - Conference Proceedings , 2010, pp. 1–5. [90] Camel, “IMU Maths – How To Calculate Orientation,” 2016. [Online]. Available: http://www.camelsoftware.com/2016/02/20/imu-maths/]. [Accessed: 24-Apr-2018]. [91] M. Pedley, “Tilt Sensing Using a Three-Axis Accelerometer,” Free. Semicond. Appl. notes , pp. 1–22, 2013. [92] T. Ozyagcilar, “Implementing a tilt-compensated eCompass using accelerometer and magnetometer sensors,” Free. Semicond. AN , pp. 1–21, 2012. [93] Geokov.com, “Magnetic Declination - Magnetic Inclination (Dip),” 2014. [Online]. Available: http://geokov.com/education/magnetic-declination- inclination.aspx. [Accessed: 11-Apr-2018]. [94] Magnetic-Declination, “Find the magnetic declination at your location.,” 2018. [Online]. Available: http://www.magnetic- declination.com/. [Accessed: 11-Apr-2018]. 257 Bibliography

[95] M. Xie and D. Pan, “Accelerometer Gesture Recognition,” 2014. [96] Google LLC, “SensorEvent,” 2018. [Online]. Available: https://developer.android.com/reference/android/hardware/Se nsorEvent.html#values. [Accessed: 24-Apr-2018]. [97] R. Szeliski, Computer Vision: Algorithms and Application . Springer, 2011. [98] “Morphological Image Processing.” [Online]. Available: https://www.cs.auckland.ac.nz/courses/compsci773s1c/lectures /ImageProcessing-html/topic4.htm. [Accessed: 11-Apr-2018]. [99] M. Nixon and A. Aguado, Feature Extraction and Image Processing , 1st ed. Academic Press, 2002. [100] I. Sobel and G. Feldman, “A 3x3 isotropic gradient operator for image processing.,” a talk Stanford Artif. Proj. , pp. 271–272, 1968. [101] J. Canny, “A Computational Approach to Edge Detection,” IEEE Trans. Pattern Anal. Mach. Intell. , no. 6, pp. 679–698, 1986. [102] C. Harris and M. Stephens, “A Combined Corner and Edge Detector,” in In Proc. of Fourth Alvey Vision Conference , 1988. [103] J. Shi and C. Tomasi, “Good features to track,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 1994, pp. 593–600. [104] X. Chen He and N. H. C. Yung, “Corner detector based on global and local curvature properties,” Opt. Eng. , vol. 47, no. 5, 2008. [105] J. A. Grunert, “Das Pothenotische Problem in erweiterter Gestalt über seine Anwendungen in der Geodäsie,” in Archiv der Mathematik und Physik, Band 1 , Greifswald: Verlag von Bibliography 258

C.A. Koch, 1841, pp. 238–248. [106] B. M. Haralick, C. N. Lee, K. Ottenberg, and M. Nölle, “Review and analysis of solutions of the three point perspective pose estimation problem,” Int. J. Comput. Vis. , vol. 13, no. 3, pp. 331–356, 1994. [107] O. Faugeras, Three-dimensional computer vision: a geometric viewpoint . MIT Press, 1993. [108] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision , 2nd ed. Cambridge University Press, 2004. [109] E. L. Merritt, “Explicit Three-Point Resection In Space,” Photogramm. Eng. , pp. 649–665, 1949. [110] S. Linnainmaa, D. Harwood, and L. S. Davis, “Pose Determination of a Three-Dimensional Object Using Triangle Pairs,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 10, no. 5, pp. 634–647, Sep. 1988. [111] D. F. DeMenthon and L. S. Davis, “Model-based object pose in 25 lines of code,” in European conference on computer vision , 1992, pp. 335–343. [112] X. S. Gao, X. R. Hou, J. Tang, and H. F. Cheng, “Complete solution classification for the perspective-three-point problem,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 25, no. 8, pp. 930–943, 2003. [113] L. Kneip, D. Scaramuzza, and R. Siegwart, “A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. , pp. 2969–2976, 2011. [114] L. Quan and Z. Lan, “Linear N-point camera pose determination,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 259 Bibliography

21, no. 8, pp. 774–780, 1999. [115] R. Horaud, “New Methods for Matching 3-D Objects with Single Perspective Views,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. PAMI-9, no. 3, pp. 401–412, 1987. [116] B. Triggs, “Camera pose and calibration from 4 or 5 known 3D points,” in 7th International Conference on Computer Vision (ICCV’99) , 1999, vol. 1, pp. 278–284. [117] A. Ansar and K. Daniilidis, “Linear pose estimation from points or lines,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 25, no. 5, pp. 578–589, 2003. [118] G. Schweighofer and A. Pinz, “Robust pose estimation from a planar target,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 28, no. 12, pp. 2024–2030, 2006. [119] G. Schweighofer and A. Pinz, “Globally Optimal O(n) Solution to the PnP Problem for General Camera Models,” in Procedings of the British Machine Vision Conference 2008 , 2008, pp. 1–10. [120] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnP problem,” Int. J. Comput. Vis. , vol. 81, no. 2, pp. 155–166, Jul. 2009. [121] J. A. Hesch and S. I. Roumeliotis, “A Direct Least-Squares (DLS) method for PnP,” in Proceedings of the IEEE International Conference on Computer Vision , 2011, pp. 383– 390. [122] S. Li, C. Xu, and M. Xie, “A Robust O(n) Solution to the Perspective-n-Point Problem,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 34, no. 7, pp. 1444–1450, 2012. [123] V. Garro, F. Crosilla, and A. Fusiello, “Solving the PnP problem with anisotropic orthogonal procrustes analysis,” in Bibliography 260

Second International Conference on 3D Imaging, Modeling, Processing, Visualization Transmission , 2012, pp. 262–269. [124] A. Penate-Sanchez, J. Andrade-Cetto, and F. Moreno-Noguer, “Exhaustive linearization for robust camera pose and focal length estimation,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 35, no. 10, pp. 2387–2400, 2013. [125] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 22, no. 11, pp. 1330–1334, 2000. [126] C. F. Gauß, “Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium,” Hamburgi: Sumtibus Frid. Perthes et I. H. Besser . 1809. [127] K. Levenberg, “A method for the solution of certain non-linear problems in least squares,” Q. Appl. Math. , vol. 2, no. 2, pp. 164–168, 1944. [128] D. W. Marquardt, “An Algorithm for Least-Squares Estimation of Nonlinear Parameters,” J. Soc. Ind. Appl. Math. , vol. 11, no. 2, pp. 431–441, 1963. [129] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM , vol. 24, no. 6, pp. 381–395, 1981. [130] D. Wagner and D. Schmalstieg, “ARToolKitPlus for Pose Tracking on Mobile Devices,” Comput. Vis. Winter Work. Graz Tech. Univ. , 2007. [131] M. Fiala, “ARTag, a fiducial marker system using digital techniques,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 2005, vol. 2, pp. 590–596. 261 Bibliography

[132] L. Naimark and E. Foxlin, “Circular data matrix fiducial system and robust image processing for a wearable vision- inertial self-tracker,” in Proceedings - International Symposium on Mixed and Augmented Reality, ISMAR 2002 , 2002, pp. 27– 36. [133] OpenCV, “Detection of ArUco Markers,” 3.2.0 , 2015. [Online]. Available: https://docs.opencv.org/3.2.0/d5/dae/tutorial_aruco_detectio n.html. [Accessed: 20-Apr-2018]. [134] T. Lindeberg, “Scale selection properties of generalized scale- space interest point detectors,” J. Math. Imaging Vis. , vol. 46, no. 2, pp. 177–210, 2013. [135] H. Wuest, F. Vial, and D. Stricker, “Adaptive line tracking with multiple hypotheses for augmented reality,” in Proceedings - Fourth IEEE and ACM International Symposium on Symposium on Mixed and Augmented Reality, ISMAR 2005 , 2005, pp. 62–69. [136] I. Skrypnyk and D. G. Lowe, “Scene modelling, recognition and tracking with invariant image features,” in Proceedings of the Third IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2004) , 2004, pp. 110–119. [137] J. P. Lima, F. Simões, L. Figueiredo, and J. Kelner, “Model Based Markerless 3D Tracking Applied to Augmented Reality,” J. 3D Interact. Syst. , vol. 1, pp. 2–15, 2010. [138] C. Choi and H. I. Christensen, “3D Textureless Object Detection and Tracking an Edge Based Approach,” IEEE/RSJ Int. Conf. Intell. Robot. Syst. , pp. 3877–3884, 2012. [139] A. Petit, E. Marchand, and K. Kanani, “A robust model-based tracker combining geometrical and color edge information,” in Bibliography 262

IEEE International Conference on Intelligent Robots and Systems , 2013, pp. 3719–3724. [140] L. Vacchetti, V. Lepetit, and P. Fua, “Combining edge and texture information for real-time accurate 3D camera tracking,” in Proceedings of the Third IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR 2004) , 2004, pp. 48–56. [141] H. Wuest and D. Stricker, “Tracking of industrial objects by using CAD models,” J. Virtual Real. Broadcast. , vol. 4, no. 1, 2007. [142] C. Teulière, E. Marchand, and L. Eck, “Using multiple hypothesis in model-based tracking,” in Proceedings - IEEE International Conference on Robotics and Automation, 2010, pp. 4559–4565. [143] J. P. Lima, V. Teichrieb, J. Kelner, and R. W. Lindeman, “Standalone edge-based markerless tracking of fully 3- dimensional objects for handheld augmented reality,” in Proceedings of the 16th ACM Symposium on Virtual Reality Software and Technology - VRST ’09 , 2009, pp. 139–142. [144] A. Petit, E. Marchand, and K. Kanani, “Tracking complex targets for space rendezvous and debris removal applications,” in IEEE International Conference on Intelligent Robots and Systems , 2012, pp. 4483–4488. [145] B. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2 , 1981, pp. 674–679. [146] F. Jurie and M. Dhome, “Hyperplane Approximation for Template Matching,” IEEE Trans. Pattern Anal. Mach. Intell. , 263 Bibliography

vol. 24, no. 7, pp. 996–1000, 2002. [147] M. La Cascia and S. Sclaroff, “Fast, Reliable Head Tracking under Varying Illumination,” in IEEE International Conference on Computer Vision and Pattern Recognition , 1999, pp. 604– 610. [148] M. Lima, P. Kurka, Y. Silva, and V. Lucena, “Indoor visual localization of a wheelchair using Shi-Tomasi and KLT,” in Canadian Conference on Electrical and Computer Engineering , 2017, pp. 1–4. [149] L. Vacchetti, V. Lepetit, and P. Fua, “Stable real-time 3D tracking using online and offline information,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 26, no. 10, pp. 1385–1391, 2004. [150] C. Wiedemann, M. Ulrich, and C. Steger, “Recognition and Tracking of 3D Objects,” in Proceedings of the 30th DAGM Symposium on Pattern Recognition , 2008, pp. 132–141. [151] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int. J. Comput. Vis. , vol. 60, no. 2, pp. 91–110, 2004. [152] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-Up Robust Features (SURF),” Comput. Vis. Image Underst. , vol. 110, no. 3, pp. 346–359, 2008. [153] NetMarketShare, “Operating system market share,” WebArticle , 2018. [Online]. Available: http://www.netmarketshare.com/operating-system-market- share.aspx?qprid=10&qpcustomd=0. [Accessed: 26-Apr-2018]. [154] ePHOTOzine, “Complete Guide To Image Sensor Pixel Size,” 2016. [Online]. Available: https://www.ephotozine.com/article/complete-guide-to-image- Bibliography 264

sensor-pixel-size-29652. [Accessed: 11-Apr-2018]. [155] T. Schiesser, “Know Your Smartphone: A Guide to Camera Hardware,” 2014. [Online]. Available: https://www.techspot.com/guides/850-smartphone-camera- hardware/page2.html. [Accessed: 11-Apr-2018]. [156] Business Location Center, “Berlin 3D - Download Portal,” 2016. [Online]. Available: http://www.businesslocationcenter.de/en/downloadportal. [Accessed: 11-Apr-2018]. [157] Landesbetrieb Geoinformation und Vermessung, “3D- Stadtmodell LoD2-DE Hamburg,” 2017. [Online]. Available: http://suche.transparenz.hamburg.de/dataset/3d-stadtmodell- lod2-de-hamburg2. [Accessed: 11-Apr-2018]. [158] Technische Universität München, “3D City Model of New York City,” 2016. [Online]. Available: http://www.gis.bgu.tum.de/projekte/new-york-city-3d/. [Accessed: 11-Apr-2018]. [159] NYC, “NYC 3-D Building Model,” 2017. [Online]. Available: http://www1.nyc.gov/site/doitt/initiatives/3d-building.page. [Accessed: 11-Apr-2018]. [160] Bezirksregierung Köln, “3D-Gebäudemodelle,” 2018. [Online]. Available: http://www.bezreg- koeln.nrw.de/brk_internet/geobasis/hoehenmodelle/3d_gebae udemodelle/index.html. [Accessed: 11-Apr-2018]. [161] Beauftragter der Landesregierung Nordrhein-Westfalen für Informationstechnik (CIO), “open nrw,” 2018. [Online]. Available: https://open.nrw/. [Accessed: 23-Apr-2018]. [162] G. Gesquière and A. Manin, “3D Visualization of Urban Data Based on CityGML with WebGL,” Int. J. 3-D Inf. Model. , vol. 265 Bibliography

1, no. 3, pp. 1–15, 2012. [163] J. Gaillard, A. Vienne, R. Baume, F. Pedrinis, A. Peytavie, and G. Gesquière, “Urban Data Visualisation in a Web Browser,” in Web3D ’15 Proceedings of the 20th International Conference on 3D Web Technology , 2015, no. 1, pp. 81–88. [164] I. Prieto, J. L. Izkara, and F. J. Delgado Del Hoyo, “Efficient visualization of the geometric information of CityGML: Application for the documentation of built heritage,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2012, vol. 7333 LNCS, pp. 529–544. [165] L. Giovannini, S. Pezzi, U. di Staso, F. Prandi, and R. de Amicis, “Large-Scale Assessment and Visualization of the Energy Performance of Buildings with Ecomaps - Project SUNSHINE: Smart Urban Services for Higher Energy Efficiency,” in DATA 2014 - Proceedings of 3rd International Conference on Data Management Technologies and Applications , 2014, pp. 170–177. [166] B. Simões, F. Prandi, and R. De Amicis, “I-Scope: A CityGML Framework for Mobile Devices,” in Proceedings of the Second ACM International Conference on Mobile Software Engineering and Systems , 2015, pp. 52–55. [167] K. Chaturvedi, “Web based 3D analysis and visualization using HTML5 and WebGL Web based 3D analysis and visualization using,” University of Twente Faculty of Geo-Information and Earth Observation (ITC), 2014. [168] A. Schilling, J. Bolling, and C. Nagel, “Using glTF for Streaming CityGML 3D City Models,” in Proceedings of the 21st International Conference on Web3D Technology - Web3D Bibliography 266

’16 , 2016, pp. 109–116. [169] The Khronos Group, “Khronos Finalizes glTF 1.0 Specification,” 2015. [Online]. Available: https://www.khronos.org/news/press/khronos-finalizes-gltf-1.0- specification. [Accessed: 27-Apr-2018]. [170] P. Cozzi, “Introducing 3D Tiles,” 2015. [Online]. Available: https://cesium.com/blog/2015/08/10/introducing-3d-tiles/. [Accessed: 19-Apr-2018]. [171] Cesium Team, “3D Tiles,” 2018. [Online]. Available: https://github.com/AnalyticalGraphicsInc/3d-tiles. [Accessed: 27-Apr-2018]. [172] T. H. Kolbe and C. Nagel, “3D City Database for CityGML.” 2011. [173] N. Leenheer, “HTML5TEST,” 2018. [Online]. Available: https://html5test.com/index.html. [Accessed: 27-Apr-2018]. [174] Unity Technologies, “WebGL Browser Compatibility,” 2018. [Online]. Available: https://docs.unity3d.com/Manual/webgl- browsercompatibility.html. [Accessed: 27-Apr-2018]. [175] The SQLite Consortium, “SQLite,” 2018. [Online]. Available: http://sqlite.org/index.html. [Accessed: 11-Apr-2018]. [176] The PostGIS Development Group, “PostGIS Special Functions Index,” 2018. [Online]. Available: http://postgis.net/docs/manual- dev/PostGIS_Special_Functions_Index.html#PostGIS_3D_ Functions. [Accessed: 11-Apr-2018]. [177] P. Read and M.-P. Meyer, Restoration of Motion Picture Film . Butterworth-Heinemann, 2000. [178] T. Willemsen, “Fusionsalgorithmus zur autonomen Positionsschätzung im Gebäude, basierend auf MEMS- 267 Bibliography

Inertialsensoren im Smartphone,” HafenCity Universität Hamburg, 2016. [179] G. Schall et al. , “Global pose estimation using multi-sensor fusion for outdoor augmented reality,” in 8th IEEE International Symposium on Mixed and Augmented Reality , 2009, pp. 153–162. [180] V. Lepetit and P. Fua, “Monocular Model-Based 3D Tracking of Rigid Objects: A Survey,” Found. Trends® Comput. Graph. Vis. , vol. 1, no. 1, pp. 1–89, 2005. [181] M. Pupilli and A. Calway, “Real-time camera tracking using known 3D models and a particle filter,” in Proceedings - International Conference on Pattern Recognition , 2006, vol. 1, pp. 199–203. [182] G. Bleser, Y. Pastarmov, and D. Stricker, “Real-time 3D camera tracking for industrial augmented reality applications,” 13th Int. Conf. Cent. Eur. Comput. Graph. Vis. Comput. Vis. 2005, WSCG’2005 , pp. 47–54, 2005. [183] C. Choi and H. I. Christensen, “Robust 3D visual tracking using particle filtering on the SE(3) group,” in 2011 IEEE International Conference on Robotics and Automation, 2011, pp. 4384–4390. [184] K. Xu, K. W. Chia, and A. D. Cheok, “Real-time Camera Tracking for Marker-less and Unprepared Augmented Reality Environments,” Image Vis. Comput. , vol. 26, no. 5, pp. 673– 689, 2008. [185] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg, “Real-time detection and tracking for augmented reality on mobile phones,” IEEE Trans. Vis. Comput. Graph. , vol. 16, no. 3, pp. 355–368, 2010. Bibliography 268

[186] H. Uchiyama and E. Marchand, “Object Detection and Pose Tracking for Augmented Reality: Recent Approaches,” 18th Korea-Japan Jt. Work. Front. Comput. Vis. , 2012. [187] G. Schall, H. Grabner, M. Grabner, P. Wohlhart, D. Schmalstieg, and H. Bischof, “3D tracking in unknown environments using on-line keypoint learning for mobile augmented reality,” in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops , 2008, pp. 1–8. [188] J. Kwon and F. C. Park, “Visual Tracking via Particle Filtering on the Affine Group,” Int. J. Rob. Res. , vol. 29, no. 2–3, pp. 198–217, 2010. [189] K. Dorfmüller, “Robust tracking for augmented reality using retroreflective markers,” Comput. Graph. , vol. 23, no. 6, pp. 795–800, 1999. [190] G. Klein and D. W. Murray, “Full-3D Edge Tracking with a Particle Filter,” in Proceedings of the British Machine Vision Conference 2006 , 2006, pp. 1119–1128. [191] A. I. Comport, E. Marchand, M. Pressigout, and F. Chaumette, “Real-time markerless tracking for augmented reality: The virtual visual servoing framework,” IEEE Trans. Vis. Comput. Graph. , vol. 12, no. 4, pp. 615–628, 2006. [192] H. Bae, M. Golparvar-Fard, and J. White, “High-precision vision-based mobile augmented reality system for context- aware architectural, engineering, construction and facility management (AEC/FM) applications,” Vis. Eng. , vol. 1, no. 1, p. 3, 2013. [193] Y. Zheng, S. Sugimoto, and M. Okutomi, “ASPnP: An accurate and scalable solution to the perspective-n-point problem,” 269 Bibliography

IEICE Trans. Inf. Syst. , vol. E96.D, no. 7, pp. 1525–1535, 2013. [194] L. Ferraz, X. Binefa, and F. Moreno-Noguer, “Very fast solution to the PnP problem with algebraic outlier rejection,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 2014, pp. 501–508. [195] V. Händler, Konzeption eines bildbasierten Sensorsystems zur 3D- Indoorpositionierung sowie Analyse möglicher Anwendungen . Fachrichtung Geodäsie Fachbereich Bauingenieurwesen und Geodäsie Technische Universität Darmstadt, 2012. [196] Hough V and Paul C., “Method and means for recognizing complex patterns,” 3069654, 1962. [197] X. Yang and Y. Tian, “Robust door detection in unfamiliar environments by combining edge and corner features,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, CVPRW 2010 , 2010, pp. 57–64. [198] A. Furieri, “Internal BLOB-Geometry format,” 2018. [Online]. Available: http://www.gaia-gis.it/gaia-sins/BLOB- Geometry.html. [Accessed: 11-Apr-2018]. [199] C. Thorne, “Using a floating origin to improve fidelity and performance of large, distributed virtual worlds,” in Proceedings of the 2005 International Conference on Cyberworlds , 2005, pp. 263–270. [200] D. Schmalstieg and T. Höllerer, Augmented Reality - Principles and Practice . Addison-Wesley Professional, 2016. [201] A. Gerdelan, “Mouse Picking with Ray Casting,” 2016. [Online]. Available: http://antongerdelan.net/opengl/raycasting.html. Bibliography 270

[Accessed: 02-May-2018]. [202] J. Schaback, “OpenGL Picking in 3D,” Schabby’s Blog , 2013. [Online]. Available: http://schabby.de/picking-opengl-ray- tracing/. [Accessed: 02-May-2018]. [203] S. Woop, J. Schmittler, and P. Slusallek, “RPU: A Programmable Ray Processing Unit for Realtime Ray Tracing,” in ACM SIGGRAPH 2005 Papers , 2005, pp. 434–444. [204] Google LLC, “Location Strategies,” 2018. [Online]. Available: https://developer.android.com/guide/topics/location/strategies .html. [Accessed: 11-Apr-2018]. [205] C. R. Ehrlich, J. Blankenbach, and A. Sieprath, “Towards a robust smartphone-based 2,5D pedestrian localization,” in 2016 International Conference on Indoor Positioning and Indoor Navigation, IPIN 2016 , 2016. [206] L. van Berlo, K. A. Helmholt, and W. Hoekstra, “C2B: Augmented reality on the construction site,” in 9th International Conference on Construction Applications of Virtual Reality , 2009. [207] S. Zollmann, C. Hoppe, S. Kluckner, C. Poglitsch, H. Bischof, and G. Reitmayr, “Augmented reality for construction site monitoring and documentation,” Proc. IEEE , vol. 102, no. 2, pp. 137–154, 2014. [208] A. Kemmann, “Mobile Augmented Reality im Bauwesen – Untersuchung des Potentials von mobilen BIM-basierten AR Systemen in der Bauausführung,” RWTH Aachen University, 2016.