Klassifikation Urbaner Punktwolken Mittels 3D CNNs In Kombination mit Rekonstruktion von Gehsteigen

DIPLOMARBEIT

zur Erlangung des akademischen Grades

Diplom-Ingenieurin

im Rahmen des Studiums

Visual Computing

eingereicht von

Lisa Maria Kellner,BSc Matrikelnummer 01428183

an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Univ.-Prof.Dipl.-Ing. Dr.techn. Dr.h.c. Werner Purgathofer Mitwirkung: Dipl.-Ing. Dr.Stefan Maierhofer

Wien, 23. Februar 2021 Lisa Maria Kellner Werner Purgathofer

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at

Classification of Urban Point CloudsUsing 3D CNNs In Combination with Reconstruction of Sidewalks

DIPLOMA THESIS

submitted in partial fulfillmentofthe requirements forthe degree of

Diplom-Ingenieurin

in

Visual Computing

by

Lisa Maria Kellner,BSc Registration Number 01428183

to the Faculty of Informatics at the TU Wien Advisor: Univ.-Prof.Dipl.-Ing. Dr.techn. Dr.h.c. Werner Purgathofer Assistance: Dipl.-Ing. Dr.Stefan Maierhofer

Vienna, 23rd February, 2021 Lisa Maria Kellner Werner Purgathofer

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.ac.at

Erklärung zur Verfassungder Arbeit

Lisa Maria Kellner,BSc

Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe.

Wien, 23. Februar 2021 Lisa Maria Kellner

v

Danksagung

Es wareine langeReise vomBeginnmeinesStudiums bis zu dieser Diplomarbeit und ichhabewährenddessensehrviel Unterstützung vonFamilie,Freunden undKollegen erhalten.Deshalb möchte ichmichandieser Stelle bedanken. Zuerst,möchteich mich beiVRVis Zentrum für Virtual Realityund Visualisierung Forschungs-GmbH bedanken, welches mir diese Diplomarbeit ermöglicht hat. Die VRVis Forschungs-GmbH wirdimRahmen vonCOMET –Competence Centers for Excellent Technologies(854174, 879730) durchBMK, BMDW, Land Steiermark,Steirische Wirt- schaftsförderung–SFG, Land Tirol und WirtschaftsagenturWien –Ein Fonds der Stadt Wien gefördert.Das Programm COMET wird durchdie FFGabgewickelt. All meinen Kollegen am VRVis sei ebenfalls einDankausgesprochen, für die fachliche Hilfe,Dis- kussionen,Anregungen und Unterstützung während meines Studiumsund dieserArbeit. Dabei ist einganz besonderer Dank an Stefan, Attila,Michi, Harri, Georg, Andi,Thomas, Lui,Rebecca, Janne, Mariaund Dani gerichtet. Ebenfalls möchte ichmichbei meinenEltern bedanken, welche michwährend desgesamten Studiumsunterstützt haben.Hier seiauchein großerDankanmeine restlicheFamilie und Freunde gerichtet,die mir während meinesStudiumszur Seite gestandenhaben. Außerdem möchte ichmichbei CycloMedia Deutschland,LiDAR PointCloud und StadtentwässerungsbetriebeKöln,AöR für das Bereitstellender Daten aus Köln bedanken.

vii

Acknowledgements

It hasbeenalong journey from thebeginning of my studies to this thesisand Ihave received alot of support fromfamily,friends and colleagues during this time. Therefore, Iwouldliketotakethis opportunity to thank them. First, Iwouldliketothank VRVisZentrum für VirtualRealityund Visualisierung Forschungs-GmbH formaking this thesispossible. VRVis is funded by BMK, BMDW, Styria,SFG,Tyrol and BusinessAgency in thescopeofCOMET -Competence Centers for ExcellentTechnologies(854174, 879730) which is managed by FFG. Iwould alsoliketothank all my colleagues at VRVis fortheir professionalhelp, discussions, suggestions and support duringmystudies and this thesis.Special thanksgotoStefan, Attila,Michi, Harri,Georg, Andi, Thomas,Lui, Rebecca, Janne, Maria and Dani. Iwouldalso liketothank my parents, who supported me throughoutmystudies. Iwould alsoliketothank the rest of my family and friends,who have supportedmeduring my studies. Furthermore, Iwouldliketothank CycloMedia Deutschland,LiDAR PointCloud and StadtentwässerungsbetriebeKöln,AöR for providing the data fromCologne.

ix

Kurzfassung

LiDAR-Gerätesind in der Lage diephysische Welt sehr genau zu erfassen. Daher werden sie häufig fürdie 3D-Rekonstruktionverwendet. Leider könnensolcheDaten sehrschnell sehrgroßwerden und meist ist nurein Bruchteilder Punktwolketatsächlichvon Interesse. Daher wirddie Punktwolkevorher gefiltert, um Algorithmennur auf diefür sie relevanten Punkte anzuwenden. Eine semantische Informationüberdie Punkte kann für einesolche Filterung verwendet werden. Die semantische Segmentierung einer Punktwolkeist ein beliebtes Forschungsgebietund auchhier gibtesinden letzten Jahren einenTrendin Richtung DeepLearning. Jedoch sindPunktwolken,imGegensatzzuBildern, nicht strukturiert.Daher werdenPunktwolken oft rasterisiert, wasjedochineinerWeise geschehen muss, sodass die zugrunde liegende Struktur gut erhaltenbleibt. In dieser Arbeitwird ein 3D Convolutional Neural Network für die semantische Segmen- tierung einer LiDAR-Punktwolkeentwickelt und trainiert. Dabei wirdeine Punktwolke mittelseinerOctree-Datenstruktur repräsentiert, welche es ermöglicht, nurrelevante Teile zu rasterisieren. Denn nurdichteTeile derPunktwolke, in denen sichwichtige Informatio- nen über die Struktur befinden, werdenimmer weiter unterteilt.Das ermöglicht es, die Knoten eines gewissen LevelsimOctree zu nehmenund diese als Daten zu rasterisieren. Es gibtviele Anwendungsgebietefür 3D Rekonstruktionen basierendauf Punktwolken. In einemstädtischen Szenario könnendas zumBeispiel ganzeStadtmodelle oder einzelne Gebäudesein. Jedoch wird in dieser Arbeitdie Rekonstruktionvon Gehwegen untersucht. Da beiÜberflutungssimulationen in Städten eineErhöhung vonein paar Zentimetern einengroßen Unterschied machen kann und Wissen über die Bordsteinegeometrie hilft die Simulation genauer zu machen. Bei derGehwegrekonstruktionwird zuerst, basierend auf dersemantischenInformationdes 3D CNNs, die Punktwolkegefiltertund dann werdenPunktmerkmale berechnet, um die Punkte des Bordsteins zu erkennen. Mit diesen Bordsteinpunkten wirddie Geometrie der Bordsteins, Gehweges und derStraße berechnet. Insgesamtwird in dieser Arbeitein Proof-of-Concept Prototypfür semantische Punktwolken- Segmentierung mittels 3D CNNs undbasierend auf dieser einDetektions- undRekon- struktionsalgorithmus für Bordsteine entwickelt.

xi

Abstract

LiDAR devices are abletocapture the physical world very accurately.Therefore, they are often used for 3D reconstruction.Unfortunately,suchdata can become extremely large very quickly and usuallyonly asmallpart of thepoint cloud is of interest.Thus, the pointcloud is filteredbeforehand in order to applyalgorithms only on those points thatare relevant for it.Asemanticinformationabout the points canbeused for sucha filtering.Semanticsegmentation of pointclouds is apopular field of research andhere there has been atrend towards deep learning in recentyears too. However, contrary to images, pointcloudsare unstructured. Hence,point cloudsare often rasterized,but this hastobedone, suchthatthe underlying structure is represented well. In this thesis, a3DConvolutional Neural Network is developedand trained forasemantic segmentation of LiDAR pointclouds. Thereby,apointcloud is represented with an octreedata structure, whichmakes it easy to rasterizeonlyrelevantparts.Since, just dense parts of thepoint cloud,inwhichimportantinformation about thestructure is located, are subdividedfurther. This allowstosimply takenodes of acertainlevel of the octree and rasterize them as datasamples. There are manyapplicationareas for 3D reconstructionsbasedonpoint clouds. In an urbanscenario, thesecan be forexample whole citymodelsorbuildings. However, in this thesis,the reconstruction of sidewalks is explored. Since, forfloodsimulations in cities,an increase in heightofafew centimeterscan makeagreatdifferenceand information about the curbgeometryhelps to makethem more accurate. In the sidewalk reconstruction process,the pointcloud is filtered first, basedonasemanticsegmentation of a3DCNN, and then pointcloud featuresare calculated to detect curbpoints. With these curb points, the geometryofthe curb, sidewalk and streetare computed. Takenall together,this thesis develops aproof-of-conceptprototypefor semanticpoint cloud segmentation using 3D CNNs and based on that,acurb detection and reconstruction algorithm.

xiii

Contents

Kurzfassung xi

Abstract xiii

Contents xv

1Introduction1 1.1 Motivation ...... 1 1.2 Problem Statement...... 2 1.3 Aim of this Thesis ...... 3 1.4 Structure of thisThesis ...... 4

2Related Work 5 2.1 Semantic PointCloud Segmentation...... 5 2.2 CurbDetection ...... 17

3Semantic PointCloudSegmentation 21 3.1 DataExtraction ...... 22 3.2 TrainingDataset...... 26 3.3 Network Learning Basics ...... 28 3.4 Network Architecture ...... 31 3.5 PointCloud Segmentation ...... 34

4Curb Detection and Fitting35 4.1 PointCloud Prefiltering ...... 35 4.2 Feature Extraction ...... 36 4.3 False PositiveFiltering ...... 42 4.4 CurbFitting ...... 45

5Neural Network Training49 5.1 Datasets ...... 49 5.2 Hyperparameter-Tuning ...... 50 5.3 Network Using SmallGrid Size...... 52 5.4 Network Using MiddleGrid Size...... 54

xv 5.5 Network Using BigGrid Size...... 56

6Results 61 6.1 SemanticSegmentation of theSemantic3d Dataset...... 61 6.2 SemanticSegmentation and CurbFitting of theCologneDataset... 69 6.3 Critical Reflection ...... 82

7Conclusion and Future Work 85

AHyperparameterTuning 87

BNetwork Architectures93 B.1Network Using SmallGrid Size...... 93 B.2Network Using MiddleGrid Size...... 95

List of Figures 97

ListofTables 101

Bibliography 103 CHAPTER 1

Introduction

1.1 Motivation

3D reconstruction is rebuildingthe physical world basedoncaptured data.

Reconstructions have abroad field of applications, like preservation of cultural heritage as statues from Michelangelo, relictsfrom Egypt or Buddha statues[GBS14]orserious games about it [MCB+14]. Further use casesare underwater robotic surveys [JRPWM10] or indoor sceneslikeKinectFusion, whichcan be usedfor augmented reality[IKH+11]. Alsoreconstructionsofcities are of great interest,because theycan be used forplanning, city climateornoisepropagationapplications [Bre00]aswellasemergency management, disaster controland formovies or computergames [MWA + 13].

Suchsimple citymodelsare usedinVisdom[Vis], asimulationand application for disaster managementoffloods. Ascreenshot of this software canbeseen in Figure1.1.Theyassist decision making in emergencies of floodscaused by heavy rainfall, tsunamisorotherdisasters.The impact of short-term protections, like sandbags, or long-termprotections canbesimulated in advance to support decisions aboutsafety arrangements.Based on this, emergency personnelcan be trained accordinglyand possibleevacuationplans canbeexplored beforehand.Inaddition, the sewersystem of a city can be added to thesimulationtotaketheircapacity and potential bottlenecks into account.

In case of floods, an increaseinheightofacoupleofcentimetres or ahousedriveway with aslopedowncan makeagreatdifferenceinsafetyarrangements. Therefore, ageometric informationabout drivewaysorsidewalks would be of great interest for theVisdom- teamtomaketheirsimulationevenmore accurate. Therefore, this thesisaddresses the detectionand reconstruction of sidewalk surfaces.

1 1. Introduction

Figure 1.1: Screenshot of theVisdomfloodsimulationsoftware. (Source: [Vis])

1.2 ProblemStatement

Acquiring respectivedata fora3D reconstruction is nowadays probablyaseasy as never before. Various technical instrumentsare available. Mobilephones are abletotake high-resolutionphotoswith meta-data likecameraparameters or GPSlocation. Moreover, photos are not limitedtothe ground, because drones allowsustocapture the world from above too. LiDAR devices scan theirsurroundingwith highprecisionand yield dense pointclouds. This data canthen be combinedwith other information like cadastre data or with datafrom Open Streets Maps,anopensource map with further geospatial tags. Each type of data hasits ownadvantagesand disadvantages. Everyone cantakepho- tos with theirsmartphone, butthe qualityofphotosstrongly depends on thelighting conditionsand are thereforeprone to over- as well as underexposure. Airbornephotogra- phyisnot abletocapture small structures and has to copewith occlusionsfrom trees. Although,they are agoodextensionfor terrestriallaser scanning datatoclose gaps. Photogrammetric pointcloudsare notasfine-grained as LiDAR data, but on the other hand,LiDAR datacan become so huge that it can be challenging to process it efficiently. Geospatial data does notprovide enough informationfor reconstruction,but is agood enhancementtoimprovealgorithms. Therefore, to solve aproblem it is inescapableto use the type of dataaswellasdata structure fitting best to theproblem andtocombine different sources of data. The reconstruction of sidewalks is basedonLiDAR data, because laser scannersare able to capture subtle structures likeanelevation of ten to fifteen centimetres.Laser scan

2 1.3. Aim of this Thesis data of awhole cityhas up to several billion points and curbs are just atinypart of them. It is thereforenot feasibletosearch curb points in the whole pointcloud.This would notonly take manycomputational resources, but also asatisfying curbdetection result would be hard to achieve. Forthe simplereason that by applyingthis problemtothe wholecity, it caneitherbeover-modelled, which would resultinmanyfalsepositivesor under-modelled,whichwouldnot detect manycurb points. Sincesidewalksare located on theground and besides streets, the majorityofpoints can be filtered beforehand to reduce the searcharea.This could either be achieved with some kind of groundplane filtering or by using semantic information. With semantic informationappliedtopoints, they canbefiltered more precisely.For example,acar is on the groundand at or nearby astreettoo,but during the detectionprocess,those carpointswouldlead to manyfalse positivecurb points. Therefore, it would be great to remove car points fromthe point cloud beforehand too. Anotheradvantage is, that this would notonly help to limit the search area for this problem, sincethe semantic information couldalso be usedfor further reconstructions,likebuildings, or for filteringvegetationand other unwelcome points. Hence,sidewalksare located nearroads;informationabout the street network would be another great sourceofdata. OpenStreetMap [Opea]providesitand many further geospatial information. However, the location of those tags is savedinanothercoordinate system as theLiDAR dataand hastobeconverted beforehand.Due to the semantic filtering of thepoint cloud and thecombination with further datasources, thesearch areaisinsomuch restricted, thatdetectingcurb points canbeachievedwith basicspatial featuresand geometry can be generated withsimple algorithms.

1.3Aim of this Thesis

The aim of this thesisistodevelopaproof-of-conceptprototypefor semantic pointcloud segmentation and based on that, acurb detectionand 3D reconstruction algorithm. Convolutionalneural networks areprobably the most commondeeplearning method nowadays.They are used in awide rangeoffields, such as image or speechrecognition, facedetection, text or video analysis and self-driving cars. They are outperformingother supervised classificationmethods by far, because during theconvolutional operations not onlylow-level features are detected,but alsohigh-level featuresand essential structures are distinguished. The essential processes like convolutionorpooling can be easily adaptedtothe three-dimensionalspace. Therefore, asemanticsegmentation of point cloudsisachievedbybuilding and training 3D convolutional neuralnetworks. To decrease memory consumption and training time, the pointcloud willnot just be represented in a uniform gridstructure, but with an octree.The cells of theoctreeinacertainlevel are then voxelized and used forthe segmentation. Forcurb detection, justground points are used,whichcan easily be selected based on the semanticinformation. Multiple geometric features are calculated for eachpointand if a pointfulfilsall features, it is labelledas“curb”.Afterwardsafiltering step removesfalse positive “curb” points and finally geometryisgeneratedfor sidewalks, curbsand street.

3 1. Introduction

Therefore, thework of this thesiscan be divided into twoparts. First, forasemantic segmentation of apointcloud,a3D convolutional neuralnetwork willbedeveloped and trained.Secondly,curb points willbecalculated with geometric pointcloud featuresand basedonthem, geometryfor thecurb, sidewalk and streetwillbecomputed.

1.4Structure of thisThesis

First, an overview about related work in pointcloud segmentation as well as curb detectionisgiven in Chapter 2, where section 2.1 discusses semanticsegmentation based on handcrafted features,artificial neural networks and convolutionalneural networks. Convolutional neuralnetworks are categorized in 2D image segmentation networks,where the pointcloud is projected into images first, and 3D convolutionalneural networks, which aredirectly actingonthe pointcloud.Section 2.2 describes thecurbdetection process in pointcloudsonwhichthe curbdetection steps in this thesisare based on. In thecourse of this process,the pointcloud is prefiltered, then spatialfeatures are calculated and based on them apointisclassified as curb or not.Afterwards, false positivelabelled points are filtered and finally geometryiscomputed. Chapter3describes semantic pointcloud segmentation and therebyaddresses data extraction in Section 3.1, thedataset used for training in Section 3.2, some network learning basics in Section3.3,aswellasthe finalnetwork architecture in Section 3.4. Furthermore, network classification and segmentation of an unknown pointcloud is shortlydepicted in Section 3.5. Chapter4is dedicatedtothe differentsteps of thecurb detection and fitting process. Section 4.1 describes prefiltering of apointcloud and Section 4.2 explainsand visualizes spatial pointcloud features, including the integration of OpenStreetMap data. Section 4.3 addressesfalse positivefilteringand finally,Section 4.4 discusses geometry fitting. Chapter5addressesthe network trainingprocess and itsresults in detail, with the composition of thetraining setinSection 5.1 and thehyperparameter tuninginSection 5.2. Furthermore, threenetworks with differentgrid sizes areevaluated: anetwork using asmallgrid sizeinSection 5.3, anetwork using amiddlegrid sizeinSection 5.4 and a network using abig gridsize in Section 5.5. Chapter6showsand discusses the resultsofthe network training and semantic segmen- tationofthe trainingset in Section 6.1 as well as the semantic segmentation and curb fitting results of real-worlddatainSection 6.2. Acriticalreflection on thework of this thesis is given in Section 6.3 and finally,Chapter7gives aconclusion and shortoutlook into futurework.

4 CHAPTER 2

Related Work

Thischaptergives an overview about related work in pointcloud segmentationand curb detection.Section 2.1 addresses semanticpointcloud segmentation based on handcraftedfeatures as well as artificial neural networks.Therebyafocusisgiven to 2-dimensional convolutionalneural networks, which are mapping thepoint cloud into images, 3-dimensional convolutionalneural networks, whichprocess the pointcloud directly,aswellasspecial for pointcloudsdeveloped optimizations, like tree-based networks for example. Section 2.2 describes research in roadboundary andcurb detectionofself-drivingvehicles.Emphasisisgiven to the pipelineused inthis thesis, where thepoint cloud is filtered first,then candidate points are detected and filtered for false positives, and finally geometry is fitted to thedetected curb points.

2.1Semantic PointCloudSegmentation

Point clouds are aset of unstructuredpoints either calculated by photogrammetryfrom severalphotosorcaptured directly by ascanningdevice. The basic principle of generating photogrammetricpointclouds is computing 2D image- featuresand determining their3Dposition by finding the same feature in multiple overlapping photos. The most commonly usedfeature is theScale Invariant Feature Transform(SIFT), which is invariant to translation, scalingand rotationofthe image [Low04]. Structure-from-Motioncalculates thelocation and orientation of thephotos as well as scene geometry, withoutany prior knowledge about camerapositions, by using bundle adjustmentand suchimage-features.3Dpoints, and thus thescene geometry, are computed by knowledgeacquiredinthis processabout thesame feature in several photos and by using triangulation [WBG+12]. Figure 2.1 shows such aphotogrammetric point cloud.

5 2. RelatedWork

Figure 2.1: Aphotogrammetric pointcloud with location and orientation of thephotos. (Source: adapted from [SKMW17])

LightDetection and Ranging (LiDAR) devices,orcolloquially called laserscanners, are abletocapture the physical world by shooting out laserbeams. Whenalaser beam hits an object,itisreflected and returns to thescanning device. The elapsed time, until the return of thebeam, is measured and based on that,the distance fromthe object to the laser scanner is obtained and thereforethe 3D position of thehit-pointaswell. Point cloudscaptured with LiDAR devices are invariant to illumination, more accurateand denser thanphotogrammetric pointclouds. Suchdevices are abletocapture huge areas, but the resulting pointcloudscould become very largevery quickly. Semantic point cloud segmentation is the assignmentofdiscrete class labels for eachpoint in the pointcloud individually. In this thesisthe main focus is on supervised learning techniques,whereaclassifier learns the correlation of set X := {x1,x2,..., xn} of n data samples to the corresponding set Y := {y1,y2,..., yn} of n groundtruth labels.Atrained classifier takes adata sample xi as inputand predicts aclass label yi,where i is the indexofthe data sample and its according groundtruth value in the sets X and Y respectively. As mentioned above, theset X is consisting of n datasamples, where eachdata sample is an m-dimensional feature vectorbyitself.Therefore, X is nm-dimensional. The set Y encodesthe affiliation of datasamples to c classesand is commonly represented by a one-hot encoding of thecategorical classvalues. Thus, Y is nc-dimensional. One-hotencodingisthe mapping of categorical valuestobinary feature vectors.An exampleisgiven in Table 2.1, where three categories “building”,“vegetation”and “terrain” areeachencodedintoathree-dimensionalbinary feature vector, where every category is assignedtoafixedposition in thevector. Thevalues in afeature vector are zeroeverywhere, exceptfor theposition of therepresentedclass, which hasvalue one. The according feature vectors of our exampleare [1, 0, 0], [0, 1, 0] and [0, 0, 1] respectively. Thismethodisnot restricted to class labels; it is usedtoencode categorical valuesin data samples as well. Sinceanumerical encoding for categories, like in our example1,2

6 2.1. SemanticPointCloud Segmentation and 3, would cause amachine learningalgorithmtorate“terrain”higher as “building”, becauseof3>1and one-hot encoding solves this problem.

categoricalname numerical encoding one-hot encoding building 1 100 vegetation 2 010 terrain 3 001

Table 2.1: An exampleofmappingcategories to numerical valuesand an according one-hot encoding of them.

2.1.1 Handcrafted Features Handcrafted features are calculated for each singlepoint individually.They usually considerinformation aboutapoint’sneighbourhood,whichcan be defined by the number of thenearest k points or acertain radius. Commonly usedfeatures describingthe geometric shapeofapoint’s neighbourhood are based on theeigenvalues λi,where λ1 ≥ λ2 ≥ λ3,ofthe covariance matrix. The covariance matrixΣofapoint p and itsneighbourhood consisting of k points is defined by: 1k Σ= (p − µ)(p − µ)T k i i i=1 (2.1) 1 k µ = p k i i=1

Suchfeatures -amongst manyothers -are linearity,planarity, sphericity,omnivariance, sumofeigenvalues, curvature and eigenentropy, which are listed in Table2.2 [WJHM15] [HWS16] [TGDM18]. Weinmann et al. [WUH+15]are optimizingtheir pointcloud featuresbychoosing the best neighbourhoodsize for asingle point. This makesfeaturesbetterdistinguishable and thereforeimprovesthe classification. They are calculating theoptimalamountofk nearest neighbours forapoint p individually by building thecovariance matrixofthe neighbourhood around p with an altering k.The best k is chosen based on theminimum Shannon entropy[Sha48]ofthe normalized eigenvaluesorfeatures.Adownside of this approachisthe increasedruntime,since the optimal neighbourhood is computed for eachpointindividually.Hackeletal. [HWS16]are approximatingthe neighbourhood by iteratively downsampling the pointcloud to generate amulti-scale pyramid.The features are then computed perscale with afixed k.This methodcomputesfeatures very fast and concurrently achieves high classificationsresults. Thomas et al. [TGDM18]are using a spherical neighbourhood basedonaradius instead. Theyuse asphericalneighbourhood instead of afixed number of neighbours, because featureswith suchaneighbourhood have aconsistentgeometrical meaningand the number of points in the neighbourhood

7 2. RelatedWork

become afeature by themselves.Besides, classifierwith those featuresare performing better thanwith featurescomputed with afixed amountofpoints.

Linearity λ1−λ2 λ1

Planarity λ2−λ3 λ1

Sphericity λ3 λ1 √ 3 Omnivariance λ1λ2λ3

Sum of eigenvalues λ1 + λ2 + λ3

Curvature λ3 λ1+λ2+λ3

3 Eigenentropy − i=1 λi ln λi

Table 2.2: Equations of some pointcloud features.

2.1.2 Artificial Neural Networks Besides descriptive features,aclassifier is needed to map those features to class labels. Suchclassifier could be decisiontrees[BFOS84], randomforests [Bre01], support vector machines(SVM) [CV95]orartificialneural networks (ANN) for example. Artificial neuralnetworks, or shortneural networks (NN),are inspired by mechanicsofmammal’s brains,where signals are transported between neurons. The basis for neural networkswas Frank Rosenblatt’s perceptron algorithm [Ros58], which could linearlyseparatebinary data.Itcalculates adecision boundary betweentwo classes, which is an (n − 1)-dimensional hyperplane. Aweightvector w together with a bias b represent this hyperplane, where w is normal to theplane and thereforedefinesits orientation and b determines its position. Theclass affiliationofanew datasample x is defined by y(x)=wTx+b,where y(x) > 0means that x lies in front of theseparation plane, respectively y(x) < 0states that x is located behind.Based on that,the classifier assigns x to one of thetwo classes[Bis06]. Feedforward neuralnetworks, alsoknown as multilayerperceptrons (MLPs), are directed acyclic graphs, whereeachedgeinpointing forward.Theyare able to approximatea mappingfunction f ∗ fromaninput x to its according class label y.This approximated mappingfunctionisdefinedasy=f(x;θ), where θ is searchedduring the training,such that f becomes as similar to f ∗ as possible. Thisisachieved by composing multiple functions, where thelayerwise graph structure describes howthose functions are assembled. Forexample, if there is anetwork with threelayers and some functions representingeach layer, say f1 forthe first, f2 forthe second and f3 forthe third layer, the assembled

8 2.1. SemanticPointCloud Segmentation

functions would look like f(x)=f3(f2(f1(x))). If we want to model alinear function, θ would consistofwand b,resultinginf(x;w, b)=xTw+b.Aloss function, like the mean squared error (MSE), shown inEquation2.2, is then minimized with respect to w and b [GBC16]. Note,Equation2.2 is adapted from [GBC16].

1 J(θ)= (f∗(x)−f(x;θ))2 (2.2) |X| x∈X

However, commonproblems are not linear. In order to be able to learn nonlinear models, an affinetransformation and anonlinear function,named activation function,are used. Consider anetwork with asingle hidden layerconsistingofh=f1(x;W, c), where W are weights and c are biases,and y = f2(h; w, b), resulting in f(x; W, c, w, b)=f2(f1(x)). By applyingthis methodtoh,this would result in h = g(W T x + c), where g is the activation function and thefinal network would be defined as f(x; W, c, w, b)=wTg(WTx+c)+b [GBC16]. This exampleisvisualizedinFigure 2.2.

Figure 2.2: An exampleofafeedforward neural network with asingle hidden layer mappinganinput x to an output y with an intermediate layer h,where W maps x to h and w maps h to y.(Source: adapted from [GBC16])

As mentionedabove,problems are commonly not linear and by using an activation function, neural networks are abletolearn suchanonlinear mapping. There are many differentactivation functions, the most common onesare the binarystep,linear, sigmoid, tanh,ReLu and softmax function, whichare shown in Table 2.3. There are so many differentones, becauseeachactivation functionhas itsadvantages and disadvantages. Besides that,the problem anetwork should solveorinwhichlayerthe activation function is usedinfluences whichkind of function is used too. Forexample, aclassificationproblem performs better with sigmoid functions, but on theotherhand,sigmoid as well as tanh functions tend to cause gradients to become zero, which we want to avoid. The binary step function can be usedfor binary classifier, but its gradientiszero. Linear activation

9 2. RelatedWork

functions do not have zerogradients, but the network willnot perform well on hard problems. ReLUiswidely used and performs better in most cases thanother functions, but it could cause gradients to become zerotoo.Inadditiontothat, ReLU canonly be usedinhiddenlayers, where sigmoidand tanh are not used there. Furthermore, the sigmoid function is used for binaryclassificationproblems only and for multi-class problems,the softmaxactivation function is usedinstead [SSA20]. The basic graphstructureofafeedforward neural network consists of an inputlayer, one or multiple hidden layers and one outputlayer. Figure 2.3 visualizes thesimplest possiblenetwork with three layers: an input layerdepicted with green nodes, one hidden layerinblue and theoutput layerinred. … ࢞૚ ࢟૚ … … … … . . … ࢞࢔ ࢟࢓

Input Layer Hidden Layer Output Layer

Figure 2.3: Avisualization of afeedforward neuralnetwork, alsoknown as multilayer perceptron(MLP).

Pioneering work in 3D pointcloud classification and segmentation using feedforward neuralnetworks wasaccomplishedbyQi[QSMG17]etal. with the PointNet architecture. Theydescribeintheirwork thatPointNet processes the whole pointclouddirectly at once and returns either aclassificationfor thewhole pointcloud or apointwise segmentation result. Its architecture consistsoftwotransformation networks, whichare subnetworks predictingand applyinganaffine transformation matrixtomakethe network invariant fortransformations of thepoint cloud,followedbyaMLP respectively, ensued by a max pooling layer(see:Figure 2.5) to extract aglobalfeature vector. Forpointcloud classification,this global feature isdirectlyprocessedbyaMLP. Forsegmentation,all pointfeatures are combinedwith the global feature by concatenation and newpoint featuresare extractedfrom this combination first. With this method, thesenew features are encoding localaswell as global information. They achievewith theirPointNet architecture equal or even better classificationresultsthanstate of theart at that time.

10 2.1. SemanticPointCloud Segmentation

1 Binary Step 0ifx<0 f(x)= 0 Function 1ifx≥0

1

2 2 1 0 1

a=4

1 Linear f(x)=ax 0 Function

1

2 2 1 0 1

1

Sigmoid 1 f(x)= 0 Function (1+e−x)

1

2 2 1 0 1

1 Tanh f(x)=tanh(x) 0 Function

1

2 2 1 0 1

1 ReLU f(x)=max(0,x) 0 Function

1

2 2 1 0 1

Softmax exi f(x)i = K Function exk k=1 Table 2.3: Mostcommonlyused activation functions. (left) Name of thefunction, (middle) function f(x)itself and (right)plot of thefunction(Source: [SSA20]).

11 2. RelatedWork

Qi et al. [QYSG17]improvedthis work, calling it PointNet++, by developing a hierarchical neural network, whichisable to learn localstructures as well. They describe thatPointNet++ dividesthe pointcloud into small overlapping regions and extracts localfeaturesbyusing PointNet. Local features are then groupedtogether to form higher-levelfeatures and this processisrepeated until the whole pointcloud is encoded. Furthermore, to addressdifferent densities within the pointcloudthey introduce a multi-scale grouping (MSG)and amulti-resolutiongrouping (MRG)byconcatenating featuresfrom different scales or resolutions. Those features are generalizing in different pointdensities. Engelmann et al. [EKHL17]also extendedPointNet to increaseits classification performance by addinglargerspatialcontext to the network. They achieve this by either using multi-scale or neighbouring regions in combination with either so- called Consolidation Units (CU) or RecurrentConsolidation Units (RCU). CU’s and RCU’s are encoding information aboutneighbouring features into afeature. Afurther methodusingdeeplearning is thegraph-basedapproachfromLandrieu et al. [LS18]. Theydepict in theirwork thatthe verticesoftheir graph arecalled superpoints, which are representingbasicshapes of thepointcloud,and the edges aredescribing the relationship of those superpoints.For segmentation, the superpoints are converted to adescriptorusing PointNetand then classified by aconvolutionalnetwork, whichis specializedfor graphs. Theychose asuperpointgraph-based approachbecause of several advantages: objectsare easier to classifythan single points, relationships between those objectscan be usedfor theclassificationaswell and thesize of thegraph is bounded by the amountofdifferentshapes, which is rather small in comparisontothe amountof points in apointcloud.Furthermore, they are outperforming with theirapproachthe state of theart at that time.

2.1.3 Convolutional Neural Networks Handcrafted features in combination withneuralnetworks are apowerfultoolfor mapping data samples to theirclass labels.However, in image classification exist amore sophisti- cated methodcalled convolutional neuralnetworks (CNNs).CNNs are combining feature extraction and classification into asingle neuralnetwork, where specific operationsare extractinghigh-level features, whichare then mapped by an appending neural network to class labels.SinceCNNs are currentstate of theart in image classification and segmentation, this approachisused in various waysfor pointcloud segmentation too.

2D Convolutional Neural Networks The mainpartsofaCNN are convolutionand pooling operationstoextract features fromanimage, as well as aneural network, whichmaps those features to distinct classes or labels. The convolution operation consists of amatrix called kernel, whichshifts likeasliding windowoverthe image matrixand computes new image valuesbyperformingaconvolu- tion. Aconvolutionisthe multiplication of eachkernel matrixvalue with eachaccording

12 2.1. SemanticPointCloud Segmentation

3 1 4 1 1 2 14 11 12 5 9 2 6 * = 0 1 87 18 22 5 3 5 8

Figure 2.4: An exampleofaconvolutionoperation. Suchanoperation conv = m ∗ k consists of an image matrix m and akernel k.The kernelshifts likeasliding windowover the image matrixand computes new image valuesbymultiplying and addingaccording values. Theposition of thesliding windowisvisualized in blue,the kernelinpink and the resulting value in green, which is obtainedasfollows: 3 ∗ 1+1∗2+5∗0+9∗1=14. image value inside thesliding window, followedbysumming up thesevalues. Figure2.4 shows an exampleofaconvolution. The image parttobeconvolutediscolouredblue, the kernelpink and theresultingvaluegreen. Theconvolutioniscalculated as follows: 3 ∗ 1+1∗2+5∗0+9∗1=14. As mentionedabove,the second main part of CNNs are pooling operations. The most commonly usedpooling operation is max-pooling and itspurpose is to reduce image size.Ituses asliding windowtoo,wherethe biggest value insideistransferredintothe downsampled image matrix. An example with apooling sizeof2×2isshown in Figure 2.5.

3 1 4 1

5 9 2 6 9 6

5 3 5 8 9 9

9 7 9 3

Figure 2.5: An exampleofamax-poolingoperation. Suchanoperationdownsamples the image matrixbyusing aslidingwindow,whichonly transfers the highest value insideintothe reducedimage matrix. (left) Imagematrixwith four positions of an 2 × 2 slidingwindowand (right)the downsampled image matrixwith the transferred valuesin according colors.

CNNs are assigning asingle class to the whole image, but for semanticsegmentation, a pixel-wiselabelling is needed.Segmentation networks,likethe U-Net from Ronneberger et al. [RFB15]use convolutionand pooling operationstodownsample theimage firstand so called up-convolutions (or sometimes also referredtoasdeconvolutions) to upsample

13 2. RelatedWork

the image again,whereas ahigh-resolutionsegmentation map is received.The down- and upsamplingapproachisoften denoted as encoding and decoding, like in the SegNet fromBadrinarayananetal. [BKC17], which is visualized in Figure 2.6. There it can be seen, that theirencoder network consists of 13 convolutionalaswell as 5pooling operationsand the decoder network of thesame amountofupsampling andconvolutional operations. Thoseinverse pooling and convolutionoperations are alsocalled unpooling and deconvolutionrespectively [ZTF+11].

Convolutional Encoder-Decoder Input Output

Pooling Indices

RGB Image Conv + Batch Normalisation + ReLU Segmentation Pooling Upsampling Softmax

Figure 2.6: Semanticsegmentation network architecture with encoder-decoderstructure. (Source: [BKC17])

Boulch et al. [BGSA18]use suchimage segmentation methods for semantic pointcloud segmentation and introduce anovel and efficientnetwork named SnapNet. First, they thin the pointcloud and generate amesh from it.After that, they generate RGBand depth composite image views in 2D, which theycall snapshots, eitherinarandomormulti-scale way. Then, they segmentthose images using theirown implementationsofSegNet and U-Net.The pixelwise classifications are back-projected to themesh-vertices, which can be done easily,because they savedthe information if theaccording mesh-faces are visible from thesnapshot. After that,the classifications are mapped backtothe original point cloud by using the labels of thenearest mesh-vertices. This methodperforms as good or even outperformsthe state of theart at that time,especiallyinindoorscenes.

Lawinetal. [LDT+17]haveasimilar approach, they createmultiple viewsfrom apoint cloud by rendering it into 2D images.They argue that a2Dimagebasedsegmentation approachresults in abetter accuracy, becauseimages can have higher resolutions than voxelizedpoint clouds, duetothe increasedmemory consumption of voxelized data.As furtherbenefit, they state already existinglabelleddatasetsfor image segmentation. Theyuse RGB, depthand normal images and projectthe pointcloud into images by using amodifiedversion of pointsplatting to eliminate noisyforegroundand non-visible points. Convolutional neural networks processall differentimage typesand provide a semanticsegmentation.The final prediction isthe sum of those segmentation outputs. These pixelwise predictions are mappedbackintothe pointcloud and thesum overall viewsgives the finalclassificationfor thepoints. Thisapproachoutperforms the state of the art of that time.

14 2.1. SemanticPointCloud Segmentation

Besides semanticpoint cloud segmentation,afurther field of researchisdedicatedto3D object classification using similar techniques. Su et al. [SMKLM15]are introducing such amethodtoclassify 3D objects, by rendering a3Dshape into 2D images from multiple viewsand classify those imagesbyusingaCNN.

3D Convolutional Neural Networks Sinceconvolutionand pooling operations can easily be adapted for3-dimensional space, anotherfieldofresearchfor pointcloud segmentation as well as 3D object classification hasemerged basedon3Dconvolutionalneural networks (3D CNNs). Ji et al. [JXYY13]introduced 3D convolutions in 2013 for human action recognition in videos. Theywanted to capture motionofpeopleovermultiple frames to not onlyinclude spatial,but temporalinformationtoo.Thus, they stackedmultiple consecutivevideo frames and so time became the thirddimension. For3-dimensional convolutionoperations, the kernelis3Daswell andthe convolutionisexecuted overall threedimensions. Wu et al. [WSK+15]introduced 3D ShapeNetsaswell as ModelNet, a3DCNN and alarge CAD modeldataset respectively. They depict data with a3Dvolumetricgrid of size 30 × 30 × 30, where eachvoxel holds abinary value,representingifitisinside or outside themesh.Totransform thesebinary valuestoaprobability distribution- the representationofa3D shape -ConvolutionalDeepBeliefNetworks are introduced. Furthermore, theyuse no pooling layers within thisnetwork, becausethey are notonly classifyingobjects butalsoreconstructingthem and pooling operations would increase uncertainty regardingthe reconstruction. Additionally,they pre-train the network first and fine-tune it afterwards to enhance thenetwork’s performance.Besidesintroducing anew modeldataset,whichhas by farmore categoriesand objects percategory than datasets at that time, they surpass the state of theart in object classification at that time too. Asimilar approachfor object recognition presentedbyMaturana et al. [MS15]isVoxNet, which is alsoa3DCNN working with voxelized pointcloud data,but it is not restricted to CADmodels. In theirwork they distinguish between free, occupied and unknown voxels and use ray tracing to calculatethese three differenttypes of occupancy values: Abinary value,stating if the voxel is occupied or not;adensityvalue, whichisthe probabilitythatavoxel would block aray and ahit value,whichdoesnot distinguish between free and unknown space.The gridsize of thevoxelized data is 32 × 32 × 32 and the network consists of two convolutional, one pooling and two fully connectedlayers. Garcia-Garcia et al. [GGG+16]build theirwork on ShapeNet [WSK+15]aswellas VoxNet [MS15]and introducea3DCNN namedPointNet1 forobjectclassrecognition. As they [GGG+16]describe in theirwork, they use avolumetricoccupancygrid to represent the data and simply map thepointsintovoxels. The occupancy value of avoxel

1Despitethe same name, the PointNet from [GGG+16]isadifferentPointNetthan the one from [QSMG17].

15 2. RelatedWork

is the normalized number of points in it and thenetwork is composed of aconvolutional followedbyapooling layer, again aconvolutionalaswell as apooling layerand two fully connected layers. They are as good as thestate of theart at that time, but their classification is faster,whichenablesreal-time object classification.Qietal. [QSN+16] are generating differentorientations of the3Dobject as input for a3DCNN and combine theirinformation within the network to enhance theclassification result.

Huangetal. [JS16]focus on segmentation of large pointcloudsand use 3D CNNs with binary occupancy valuesand aresolution of 20 × 20 × 20 for this purpose. As they depict in their work, fortraining,the pointcloud is voxelized at aset of centerpoints within a certainradius. Thecenter points are randomlyselected,but balancedacross all classes available in the dataset. Forclassification, the pointcloud is first voxelizedasawhole and those voxels are theinputfor thenetwork. Theclass label of avoxel gridisamajorityvote of thelabelsofall points within it. SEGCloud, introduced by Tchapmi et al. [TCA+17], is alsovoxelizingthe pointcloud before a3Dfully convolutional neuralnetwork is usedto classify eachvoxel. Theydescribethat they apply atrilinearinterpolation to interpolate the voxel predictiontothe original pointcloud.This interpolation is then combinedwith other pointfeaturesusing afully connectedCRF [KK11] to classify eachpoint.

ImprovedConvolutional Neural Networks forPointClouds

ConventionalCNNs are working with rasterizeddata, since 2D images are already rasterizedthisdoesnot cause anyissues. However, 3D pointclouds arenot rasterized and vary much in pointdensity, from regions with avery high density to regions with no points at all.Rasterizing pointclouds notonlyextendthe dimensionfrom 2D to 3D, which is acubicgrowth in memory consumption, butempty space is rasterizedaswell, which does not include anyinformation andisthereforeunnecessarily allocating memory and slowing down the network. Therefore, many differentapproachesare developed to reducememoryconsumptionand computationtime,asfor exampletree-basedmethods. Klokovetal. [KL17]are using feedforward neuralnetworks basedonkd-trees instead of rasterized dataand calling it Kd-network. Theinputtosuchanetwork is the point cloud’s kd-tree andthe network is implemented based on akd-treedatastructure instead of agrid structure. Riegleretal. [ROUG17]are working with shallowunbalanced octreesand theirencoding with abit-string, statingifanodeissplit or not.Based on thatdatastructure they implemented convolutional, pooling and unpooling operations. Wang et al. [WLG+17]introduced another approachbased on octrees, where they use shuffle keys and labels to store the neighbour- and parenthood of thenodes. Hence, this informationisneeded to perform convolution, pooling, unpooling and deconvolutional operationsbasedonoctrees. Recentresearchtackles this problem by improving the convolutionoperationitself in regard of pointclouds, likeThomasetal. [TQD+19]or Boulch et al. [Bou19].

16 2.2. CurbDetection

2.2CurbDetection

The detection of curbs and road-boundaries is apopular research topic in thefieldof autonomousdrivingvehicles for along time, since knowledgeabout street borders is essential to ensureasafe ride.Overyears, various differentsensorsand techniques were exploredtoperceivethe road and itsproperties. Hillel et al. [BHLLR14]describedin theirsurvey that monocular vision, stereoimaging, radar, thecombination of GIS,GPS alongwith inertialmeasurementunits and LiDAR are among them. However, as they state further,camerabased approaches are pronetoocclusions, as from other vehicles, shadows, weather conditionsand illumination. In addition to that,the calculated 3D informationofstereo images is not as accurateasLiDAR data. Radar devices have a lowresolution and are not abletodetectfine structures.Therefore, they are used for obstacle detection and on-/off-road differentiation. The positioning information is just usedasadditionaldata, sincethey are not as accurateand would needprecise as well as up-to-date maps.Since,LiDAR does not depend on illumination anddirectly provides 3D data; it is often used for curb detection.

Already in 2010, Zhang et al.[Zha10]presented theirwork on road-edge detection based on 2D LiDAR data. Their algorithm contains fivestages: candidate point-set selection, feature extraction, classification, false alarmmitigation and road curb detection.The basic framework forthe curb detectionpipelineisformed by thesesteps,exceptfor the classification,and canbestill found in recent researchaswellasinthis thesis.

2.2.1PointCloud Pre-Filtering

Since the following steps in the curb detection pipeline are applied pointwise, the point cloud is roughlyfiltered beforehand to reduce the amountofpoints to process and thereforecomputationtime.For this case, thefiltering does notneedtobehighly accurate, because itspurpose is to omitregions of thepoint cloud,wheredefinitely no curbislocated.

Zhao et al. [ZY12]form aregular voxel gridfrom the pointcloud and define the voxel with lowest altitude as ground data.They use avoxel gridtoquickly access points inside avoxel and to modelneighbourhood relationsbetween points. Then, they fit aplane to points stored in thesevoxels and thefitting points are determined as ground points. Zhang et al.[ZWWD18]are using aplane-based methodtoo,whichisapplied on points belowthe scanningdevice. Theyuse theRANSAC[FB81]algorithm to fit aplane and groundpointsare determined basedonaloose threshold.Anotherplane-based method is proposed by Wang et al. [WWHY19], they are first splitting thepoint cloud into multiple segmentsand fit aplane in all segments,byalsousing theRANSACmethod. Theyare using suchasegmentbased approachtobeable to handle slopesonthe road.

17 2. RelatedWork

2.2.2 Feature Extraction Curbs have very distinctivespatialproperties; they are between 10 to 15 centimetres high, roadsand curbs areorthogonal as well as parallel to one anotherand featuresbased on thecovarianceinacertainneighbourhood can describecurbs too. Furthermore,the laserlineofaLiDAR device has adifferentpatternwhen it hits an obstacle. Thisand manymore propertiesare used to find curb points in the pre-filtered pointcloud.Usually, several features are computed perpointand if apointsatisfiesall theseconditions, it is identified as acurb point. Some of these features aredescribedbelow. Height difference is probablythe most commonused spatial feature to describecurbs in autonomousdriving. Zhang etal. [ZWW+15]are using aslidingwindow approachto speed up thefeature calculation.For eachslidingwindowthe maximum and minimumin heightissearched and if thedifference betweenthem is greater than acertainthreshold, it probably contains curb points. Zhao et al. [ZY12]are computing the difference in heightfor eachvoxel and are restraining it with an upper and lowerthreshold too. Wang et al. [WWHY19]are using,besides the difference between the highest and lowest height values, alsothe standard deviation of heightvalues of points in the same laserline.They constrainthe difference withanupper as well as lowerand the standard deviation with an upperthreshold. The spatial structure of curbscan alsobedescribed with the distancebetween two neighbouring points of thesame laser line.Zhang et al. [ZWWD18] are calculating the horizontal and verticaldistance and are bounding them with alowerthreshold. Wang et al. [WWHY19]are using the same method, but they are onlyverifyingthe horizontal distance. Sinceroads and curbs are usually orthogonal to eachother, the anglebetween them is anotherwell describingfeature. Therefore, Zhang etal. [ZWWD18]introduced a feature describing this property, by calculating the angle betweentwo vectors startingat the same point. They createthose vectors by summingupnpreviousorfollowing points respectively, along thelaserline fromthe starting pointand dividethe resultbyn.If the starting pointlies on acurb, the angle betweenthosetwo vectors would be smaller than the anglebetween vectors originating from aroad point. Wang et al. [WWHY19] are using this methodtoo,but they are not normalizing thesum with the amountof usedpointsand definethe upper thresholdwith 170 degrees. Anotherset of features is based on the eigenvalue decompositionofthe covariancematrix in acertain neighbourhood of points. Zhao et al. [ZY12]compute thenormal vector of a pointbyusing theaccording eigenvector of thesmallest eigenvalue.This normal is then checkedfor perpendicularity to the carand if so,the pointisprobably part of acurb. Whereas, Fernándezetal. [FILS14]are using theeigenvaluesofaweightedcovariance matrixtoestimatethe curvature in the areaaround apoint. Theydothis by dividing the smallesteigenvalue through the sum of all eigenvaluesand boundingthe result with alower and upper threshold. Furthermore, they state that, curbs can be categorized into very small, small, regular and big obstacles withincreasing thresholdvalues.

18 2.2. CurbDetection

Smoothnessofpoints in acertain neighbourhood is afurther curbdescriptor.Wanget al. [WWHY19]describe howsmooth asurface is by taking adjacentpoints on both sidesofapointand buildthe norm of thesum of differencesbetween those pointand its neighbours. This norm is then divided through the norm of the pointtimesthe amount of neighbours. The resulting smoothness value is boundedbyalowerthreshold.

2.2.3FalsePositiveFiltering

In an urbanscenario, besides theself-driving vehicle itself,othercars,pedestrians, buildings, vegetation and manymore obstacles are inside or nearbythe road tooand thereforepart of theLiDAR pointcloud.Some of them are having thesame spatial properties as well as features as curbs and will thus be falselydetected as curb points in the previoussteps.For this reason, afalsepositive filtering is sometimesapplied before the actual curb fitting to remove those misclassified points. This eliminates possible sources of error for thefollowing step and can thereforeenhance the curbdetection results.

Zhao et al. [ZY12]are applyingashort-term memory technique on predicted curbs to filterwrongly classified points. They incorporatecurb points of thecurrentframe with those of some previousframes.Their curb detection algorithm works on voxels and if avoxel has already been labelledasroad, anyverticalunsteadiness willnot be predicted as curb anymore. Fernándezetal. [FILS14]are using ConditionalRandom Fields[LMP01]with a4-connected neighbourhood forthe filtering process.They are using the previousresult of theConditional Random Field as prior andcurbs found with spatial features as likelihood.

Anotherapproachsplits false positives into inside and outside theroad and filters both individually using methods suitable forthem.For example, Hata et al. [HOW14]are detecting curbpointsbasedonthe distance between neighbouring laser rings,because whenalaser hits an object this distance becomes smaller. Therefore, they are using a differential filtertokeep points with aheightvariancebyusingaconvolutionoperation. Then, they apply adistance filtertoomit curb-likepoints besides the road, because curbs are thenearest points to the car. Finally,they use aregression filter to omitobstacles insidethe road, like othercars or pedestrians. This filterestimates theshape of the curb and removespoints not fitting in. Wang et al. [WWHY19]also perform their filtering with two steps to remove false positives inside and outside theroad. Theyapply adistance filtertoeliminate wronglyclassified points outside theroad and aRANSAC filterfor inside. The RANSACfilter calculates the parameters foraquadraticpolynomial model, which fitscurb points as good as possible. All points furtherawayfrom the model thanacertainthreshold are false positives and thereforeremovedfromthe curbpoint set.

19 2. RelatedWork

2.2.4 GeometryFitting andTracking Finally,ageometricmodel is generated from the calculated and filtered curbpoints. For this, several differentapproaches are usedregarding representation and computation of the model, which are discussed below. Forthe representation of the curbasgeometricmodel either aparabolic model[ZY12] [ZWW+15][WWHY19]oraB-spline[SZX+19]isused.Zhaoetal. [ZY12]achieve the fitting with the combinationofRANSACfollowed by alinear leastsquares method. Zhang etal. [ZWW+15]are using arootmean squareerrormethodtocompute the geometric modeland Wang et al. [WWHY19]aleast square method. To improve the curb further,itistrackedovermultiple frames. Zhao et al. [ZY12]use thereforea particle filter[DdFG01]and Wang et al. [WWHY19]aswellasZhang et al. [ZWW+15] are using aKalman filteringmethod[Kal60]. Zhao et al. [ZY12]state, thatthey are using aparticle filter and not aKalman filter to avoid abias towards the cars movement. Nevertheless, not allresearch usesthis pipelineorthis kind of features,for instance Han et al. [HKLS12]are computing line segmentsinthe untransformedpolar coordinates of the LiDAR sensor to detect theborders of theroad. Kumar et al. [KMLM13]are using parametric activecontourmodelstodetectroad boundaries in the converted 2D raster surfaces of elevation,reflectanceand pulsewidth valuesofthe LiDAR. Afurther method, introduced by Wang et al. [WLW+15]uses saliency maps and salient points to detect curbs andRodríguez-Cuenca et al. [RCGCOA15]base theircurb detectionalgorithm on 2D raster images of thepoint cloud,holdingthe heightdifference and theamountof points in each cell.

20 CHAPTER 3

SemanticPointCloud Segmentation

Whenhumanslookatapointcloud,they caninstantly identify objectslikehouses, treesorcars,but foracomputerapointcloud is just amass of unstructuredpoints with no semanticinformation. An urban pointcloud consistsofbuildings, vegetation, pedestrians, scanningartefacts and alot more. However, formost tasks,pointsofjust a certaincategory are actuallyneeded.Let us assume thegoal is to reconstruct buildings, it would be great if we already knew where in thepoint cloud thebuildingsare located, to apply further reconstruction algorithms specifically there. This would decrease the reconstruction time, since manypoints do not need to be processed. On theotherhand,in someapplications it would be an advantage to filter points of aspecific classbeforehand, likescanning artefactsorvegetation,toenhance thestability of algorithms.

Forthis and manyother reasons,semanticpoint cloud segmentation is awell-liked researchtopic with along history.SinceCNNs are agreattooltoextract and learn highlevel features andreachavery highaccuracy in various applications as well, it seems natural to apply theknowledgeand algorithms from image classification with CNNs to the3Dworld of pointclouds. Therefore, this thesis usesa3DConvolutional Neural Network forthe pointcloud segmentation and this chapter is dedicated howthis is achieved.Section 3.1 explainsthe process of dataextractionofapointcloud and Section 3.2describes the useddataset.Then,Section 3.3 gives an overviewabout neural network learningbasics and Section3.4 discusses the network architecture.Finally, Section3.5 shortly depicts thesegmentation process of apointcloud.

21 3. SemanticPointCloud Segmentation

3.1Data Extraction

The convolutionand pooling operationsinconventionalmachine learningenvironments are working on rasterizeddataand sincethose operationscan be easily adapted from 2D to 3D, they are already supported. Nevertheless,point cloudsdonot have aregular structure andthe pointdensitycan differ from very densetoextremely sparse. In order thatthe neuralnetwork canprocess the data,itneeds to be rasterizedinaway that the underlying structure of thepoint cloud is represented well. Furthermore, to train awell-generalized classifieragreatamountoftrainingdataisneeded,whichrequiresa lot of memory-especiallyinthe case of 3D data-but graphic cards do not have such an amountofmemoryavailable. Therefore, thedataneeds to be extracted suchthat it contains importantinformation, butregions with toofew points should be omitted.

Forthis reasons as well thatLiDAR datacan become extremelyhuge, an appropriate data structureisneededtovisualize andprocess apointcloud in aproperway.Commonly useddatastructures forhandling spatial datafast are spatial subdivisiontrees.Based on that,anoctree is usedasunderlying data structure. As thename indicates, an octree node canhaveuptoeight child nodes.Incase of LiDAR data, thewhole pointcloud is the root node.The root node and recursivelyall other nodesare subdivided further, if they containafeasible amountofpoints.Figure 3.1 shows a2Dvisualizationofan octree,where the root node is coloured black. Dense regions are subdivided more often and thereforelocatedinlowerlevelsofthe tree,likethe red nodes of Figure3.1. Because of that,the size of thosenodes is getting smallerwith each subdivision. Contrary to that, sparse regions are not subdivided further andthereforelocated in bigger nodeson ahigherlevel in thetree,asthe blue nodes in Figure 3.1. Such data-structures give informationabout pointdensityand location of relevantinformationinapointcloud, which is agood basistoextract training data.

Figure 3.1: 2D visualization of an octree.

22 3.1. DataExtraction

The visualization and processing of pointcloudsisdone with thealgorithms and data- structures from theAardvark Platform [Aar]. The out-of-core octreeimplementation can handlepoint cloudsvery fast and is abletoprocess humongousdatasets. In this octree,the nodes are called cells and they have apropertynamed “exponent” n,which indicates afixed cell-size of 2n meters. That makes it is easy to traverse atree to the cells with thedesiredsize in meters, particularly becauseall cells with the same exponent are on thesame leveloftree.Since pointcloudsare stored with afixed measurement -a meter has alwaysthe same unit -itispossibletoextract differentdatasets of various sizeswith thesame procedure. This is important, because thesize of thenodes is crucial to representthe underlying structure of apointcloud.The cells must not be toobig, because it would not be possible to representfine structures after therasterization, since they would be aggregatedinthe same gridcell. On theotherhand, toosmallnodes would loseinformation about the greater context of an object and theclassifierwould notbeable to distinguishing them well. In thiswork,the exponent n =0,whichisone meter, wasused,sincethis exponent established as best cell size in prior attempts. Duringdataextraction, the octree is traversed to the level with the desired exponent and thecellsofthis level are used for training. Nonetheless, if acellisalready aleaf, but still contains afeasibleamount of points, it is subdivided further. Forthe training the minimum amountofpoints in acellare ten, forthe classification, on theother hand,all cell with anypoints inside are used.The cells are then rasterized with afixed gridsize. Used gridsizes are 16 and 32, for example, if acellwith exponent n =0, which is onemeter,israsterized with agrid sizeof32, then eachgrid cell hasasize of 1 32 m =0.03125m =3.125cm.Eachgrid cell is assignedavalue, whichrepresents the informationabout the points occupying the cell. In this thesis,threedifferentvalues where explored:

• a binary occupancy-value,wherethe grid cellcontains 1ifthereare anypoints insideand 0ifnot,

• the absolute amount of allpoints inside the cell

• and the ratio of thenumberofpoints inside the gridcell to the total amount thatcan be inside anode.

Besides thedataitself,eachtrainingsample needsagroundtruth to be learned by a neuralnetwork. Since, not allpointsinanodemusthavethe same class-label; amajority vote on theclass distributionisperformed. By empirically exploring theclass distribution forbothused grid sizes,ittranspiredthat, either justone class is containedinanode, or one class is highlyrepresented in comparison to theothers.Thatshows us, amajority vote is sufficienttoportraythe correct class of anodeand by consideringthe rather small size of thenodes of just 1meter,this seemsreasonable. However, not only the nodes themselves containimportantinformation, theirneighbour- hood can be revealing as well. Therefore, forall six faces of the node, adjacentcells are

23 3. SemanticPointCloud Segmentation

created and rasterized as well. Figure 3.2 shows themiddle-node (left) and theaccording six side-cells(right) in matching colours. These six side-cells have the same exponent andtherefore the same size, but thegrid size used for rasterization could differ. In this thesis,grid sizesof8and 16 were usedfor the side-cells.

Figure 3.2: Left:middle-node with coloured faces. Right: adjacentside-cells coloured according to thefaces

3.1.1 Data Augmentation The generalization of aclassifier is the ability to perform as good on unseen as on known data.That meansthat theclassifier does not suffer from performance lossifitisapplied to new data.Aclassifiermay be afflictedwith abad generalization if it under- or overfits the training data.The trainingofaclassifier approximates afunction y = f ∗(x), which should depict thedistributionofthe training data well, but in case of under- or overfitting this function poorly representsthe data.Figure 3.3 visualizes under- and overfitting in the case of polynomialcurvefitting fromBishop et al. [Bis06], where the green curve shows the searchedfunction, theblue dots are thetraining samples,the red curve is the fitted function and M denotes theorder of thefitted polynomial. In case of M =1the fitted function does not capture thesearched function at all (itistoo simple) and this is called underfitting. M =9models afunctionfitting exactly all training samples, but gives apoorapproximationofthe searchedfunction, this is called overfitting. Overfitting willcauseahigh training accuracy, wherebythe validation or test accuracy is low. To improve thegeneralization of aclassifier, each classshould have anearly equal amountoftraining samples. Nevertheless,usuallysome classesare represented well, likevegetation or buildings, where others, like for instance natural terrain as meadows, are not so common in an urban pointcloud dataset. To create an equal distributed training set, the amountofdata sampleswith the least frequentclass-labelcouldbe usedasmaximumsize forall other classestoo.However, for this methodthe dataset must have acertain size, because otherwise thetrainingset wouldbecome toosmall

24 3.1. DataExtraction

Figure 3.3: Example of under- and overfitting (red curves) in thecase of polynomial curvefitting. M is the order of thefitted function. Left:underfitting. Middle: good representation. Right:overfitting. (Source:adaptedfrom[Bis06])

to train agoodclassifier. Another methodwould be to createmore datasamplesfrom underrepresented classesartificially. This is called data augmentation and acommon procedureinmachine learning. Dataaugmentationincreasesthe varietyofdifferentdata samplesinsideaclass,whichleads to abettergeneralization.Itisnot restricted to some classesand canbeapplied to thewhole training set to increaseits sizeand diversity. However, the training setused in this work is highly unbalanced; some classeshavemore samplesthancouldbeused fortraining, where othersare underrepresented. Therefore, data augmentation is used in this work to createanequaldistributed trainingset by creatingvarious datasamplesofthose underrepresented classes. To create datasamples artificially,random samplesfrom the trainingset are taken and then rotated, scaled, horizontally as well as vertically flipped, translated or acombination of some of them. Sincethe usedaugmentation-methods are depending on the use-case ofthe classifier,a rotationaround the z-axis up to 360 degrees, an arbitrarytranslationinall directions about max. 75% of thenode-sizeorahorizontal flipare applied in this work. These data sampleaugmentation methods are visualized inFigure 3.4.

B

a) b) c)

Figure 3.4: Datasample augmentation methods: a) rotation aroundz-axis, b) translation in alldirections,c)horizontalflip.

25 3. SemanticPointCloud Segmentation

3.2Training Dataset

As mentionedabove,the training setneeds to be of acertainsize and thedatasamples must have acorrectground-truth. Forthe purpose of this thesis, these would be point cloudsofvillages or cities with acertainamountofpoints,point density and an according labelling. Thelabelling should representclasseslikethe ground, vegetation,buildings, carsand suchlike. Therefore, theSemantic3d.net [HSL+17]dataset wasused for thetraining of theclassifier. Thisisadataset with aroundfour billion points of urban and rural cities,whichwas capturedwith alaserscanner.This dataset comprisesthe followingclasses:

• man-made terrain

• naturalterrain

• highvegetation

• lowvegetation

• buildings

• hard scape

• cars

• scanningartefacts

The dataset contains 30 laser scansand is equally dividedintoatrainingand atest set, where onlythe training setisalready labelled[HSL+17]. Apointcloud of this training set,capturedinaruralarea,isshown in Figure3.5. The class affiliation of the points is visualizedwith colours, where eachcolour represents one class. Unlabelled points are white, “man-madeterrain” is grey,“vegetation”is colouredwith different shades of green,“buildings” are blue, “hard scape” is orange, “cars” are pink and “artefacts” are visualized with abrown-pink colour. Figure 3.6 shows thedatasetfromFigure 3.5 coloured according to theclass membership.Thereit can alsobeseenthat “man-madeterrain” is asphaltedterrain and “natural terrain” are meadows. The difference between highand lowvegetation is thattrees are referred as “high vegetation”and bushesorflowers as “lowvegetation”. “Hard scape” are man-made structureslikelow walls, postsorwells. Exportingthe 15 Semantic3d training set pointcloudswith acell-sizeofone meter results in over 200 000 training samples. Since, this is adataset captured in thereal world the class distributionisvery unbalanced. “Buildings” is themost common class with nearly 75 000 datasamples, followedby“highvegetation” with round 56 000, “natural terrain” and “man-madeterrain” with 39 000 as well as 31 000 samplesrespectively.

26 3.2. TrainingDataset

Figure3.5: Apointcloud of theSemantic3d dataset,captured with alaser scannerina ruralarea.

Figure 3.6: Apointcloud of theSemantic3d dataset coloured according theclass affiliation of thepoints.

27 3. SemanticPointCloud Segmentation

Whereas “hard scape”, “lowvegetation”, “scanning artefacts” and in particular “cars” are extremelyunderrepresented. However, foragood generalization all classesshould be equallydistributed, otherwise it could occurthatthe underrepresented labels are prone to misclassifications.For curb detection, especiallycars need to be detected well, because they are usuallyplacedinornearbythe street and are themost common cause for false positivecurb classifications.Therefore, the classeswith toofew samplesare artificially increased withdataaugmentation.

3.3Network LearningBasics

As mentionedearlier, thetraining of aneural network approximates afunction y = f ∗(x) by learning the mapping y = f(x; θ)with respect to θ [GBC16]. To achievethis, aloss function and an optimization algorithm are needed.

LossFunction

Aloss function L(θ)measuresthe cost of misclassifications and is thereforealso known as cost or objectivefunction.Itcompares thenetwork predictions with the groundtruth basedonthe selectedloss function. Theprobably best-known loss function is themean squared error (MSE), where thesquareddistance between thepredicted class probability y˜ and thegroundtruth y is calculatedasfollows:

1 N L(θ)= (y − y˜ )2. (3.1) N i i i=0

However, theprobably most used loss function is thecategorical or binarycross-entropy formultiple or two classesrespectively. The entropy H wasintroduced by C. E. Shannon [Sha48]and is the measure for theaverage information content.Inuse as lossfunction, the cross-entropy is defined as follows:

N K L(θ)=− yi log y˜i, (3.2) n=1 c=1

where N is theamountofdata samplesand K is the number of classes[Bis06]. Note, Equation3.2 is adapted from[Bis06].

The goal of thetrainingistominimizethe loss function, because asmallerloss means asmallerclassificationerror. This is achieved by using optimization algorithms like gradientdescent.

28 3.3. Network Learning Basics

GradientDescent Gradient descent wasintroduced by Cauchy in 1847 [Cau47]and is an iterativeopti- mization algorithm, which searches forthe minimum of afunction. Starting at apoint xk in afunction f(x), it looks where thegreatest slopeisand makesastep in the opposite direction. This results in anew point xk+1 and is one iteration of thealgorithm. This process is repeated until aflat region is found. Thegreatest slopeisthe derivative f (x) or alsoknown as gradient ∇f(x)ofthe function f(x). The mathematical expression for one iteration is as follows: xk+1 = xk − αf (xk), (3.3) where α is thestep size, whichisusually knownaslearning rate. When thegradient converges nearzero ∇f(x)=0the algorithm terminates, because acriticalpointisfound. Manyimprovements of thegradientdescentalgorithmweredeveloped,the probably most knownisaddingamomentumtermtoimprove the speed of convergence [Qia99]. Further methods are Adam or AdaMax,whichare using adaptivelearning rates [KB14].

Back-Propagation Thetraining of anetwork can be broken down into twoparts:The forwardpass,where an input x is propagatedthroughthe network to receiveaprediction y˜.Then,the cost between theoutput y˜ and theground truth y is calculated. In the backwardpass,the cost is propagated backwardsthrough the network to compute the gradients for the gradientdescentoptimization. The back-propagation algorithm [RHW86]achieves this in acomputational inexpensive way [GBC16]. As Rumelhartetal. [RHW86]describedintheir work, the gradients are thepartial derivativesofthe totalnetwork error E with respect to thenetworks weights w and is ∂E denoted as ∂w .Considering, xj = i yiwji,where xj is the inputtoaneuron j, y are the outputsofthe previous i neurons of j and wji are theaccording weights. The partial derivative of the error E with respect to theoutputs y is computed first, resulting in ∂E ∂y .Then,the chain ruleisused to compute thepartial derivativeofthe error E with ∂E ∂E ∂y respect to theinput x: ∂x = ∂y ∂x.With this technique, the influenceofaninput x to ∂E ∂E ∂x an output y,and thereforetothe error E,isknown.Finally, ∂w = ∂x ∂w is determined by usingthe chain ruleagain.All this together results in: ∂E ∂E ∂y ∂x = . (3.4) ∂w ∂y ∂x ∂w

∂E The network is updated by using ∂w to update the weights. As Rumelhartetal. [RHW86] stated further in their work, they aggregate all resultsuntil the update is performed and they modify the weights by using Equation3.5, where t is increasedbyone after each iterationand α =[0,1] stating the influence of thegradientstothe weights. ∂E Δw(t)=− (t)+αΔw(t−1) (3.5) ∂w

29 3. SemanticPointCloud Segmentation

3.3.1 Regularization Methods Regularization prevents aneural network fromoverfitting. The most used methods are L1- or L2 regularization,dropout layerand early stopping,whichare shortly discussed below.

L2 ParameterRegularization

2 1 2 The L parameter regularization methodadds anorm penaltyΩ(θ)= 2w2to the loss function L(θ). The regularized loss function L˜(θ)istherefore defined as shown in Equation 3.6, where α ∈ [0, ∞)controls thestrength of theregularization.This method is alsoknows as weightdecay,because it shifts the weights closertozero [GBC16]. Note, Equation3.6 is adapted from[GBC16] to the notation used in this thesis. α L˜(θ)= w2+L(θ)(3.6) 2 2

L1 Regularization

1 The L regularization also addsanorm penaltytermΩ(θ)=w1 to the loss function L(θ). Theregularized lossfunction L˜(θ), including the weightfactor α,isdefinedwith this methodasshown in Equation3.7. The L1 regularizationhas theeffect that parts of theweights become zero and that is whyitwas used as feature selection method [GBC16]. Note, 3.7isadapted from [GBC16] to the notationused in thisthesis too.

˜ L(θ)=αw1+L(θ)(3.7)

Dropout Anotherapproachtoreduce overfittingwas introducedbySrivastavaetal. [SHK+14] and is called dropout. As they describeintheir work, this methodrandomly omits neurons as well as theirconnections from the network duringtraining, whichshould hinder them frombecoming tooco-adapted. The probability,that aneuron is droppedis defined by 1 − p.They stated, thatadropout value of 0.5 ≤ p ≤ 0.8isagoodvaluefor hidden neurons, whereas for inputneurons p should be near1.Inthe tests,they often used p =0.5and p =0.8for thehidden and input layerrespectively.

Early Stopping An overfittedclassifierachieves ahigh accuracy on thetraining set, but generalizespoorly. Thiscan alreadybeseenduring the training by observing thetraining and validation set loss. At somepointthe validationset losswill start to increase,while the trainingset loss stilldecreases-this is the beginning of overfitting. Early stoppingtraces those values and terminates thetraining,ifthe validation setloss does not decreaseoverapre-defined amountofepochs. In course of this, theparameters of thenetwork are savedatevery improvementofthe validationloss and after thetrainingreloaded for classification.

30 3.4. Network Architecture

3.4Network Architecture

The architecture of aCNN can be brokendownintotwo main blocks, which can be seeninFigure 3.7. The first blockperforms feature extractionaswellasdimensionality reductionand is called featurelearningblock.The second oneisaneural network and referred as classification block.

Figure 3.7: Exemplary structureofaCNNillustrating its two-stage structurewith the featurelearning and classification blocks.(Source: [Sah])

3.4.1FeatureLearning Block The featurelearning block is composed of multiple components: A3Dconvolution operation (CONV), followedbyabatch normalization (BN),aReLU activation function and a3Dpooling operation (POOL). Batch normalization wasintroduced by Ioffe et al. [IS15]and is aregularization method, which conducts anormalization for each mini-batch duringthe training. They state thatthis methodallows higher learning rates and one nothavetobesocareful about initialization. Their normalization transforms for each activation function x the meanto0and the variance to 1byusing Equation 3.8. x − E[x] xˆ = (3.8) Var[x] First, twoconvolutionaloperations followedbybatchnormalization and aReLUare stacked. Then, apooling operation is applied. Thissequenceofoperationsisreferredas afeaturelearning component in this work and can be repeated multiple timestoform a deepernetwork. Figure 3.8visualizes thestructure of such acomponent. The 3D convolutionisperformed withakernel-sizeof3×3and an L2-regularization for the kerneland bias. The3Dmax-pooling is executed with apooling-size as well as stride of 2 × 2. Thisblock can be repeated multiple times, until adata sample is scaled down to the desired size. This is performed onto the middle-cellaswell as all six neighbouring

31 3. SemanticPointCloud Segmentation

side-cellsindividually.The results are then flattenedbyaglobalaverage pooling and concatenated afterwards. This final vector is fed into theclassification block.

CONV BN ReLU

ReLU BN CONV

POOL

Figure 3.8: A featurelearning component is composed of twoconvolutionoperations (CONV), batch-normalization (BN) and ReLU activation functions as well as apooling operation (POOL).

3.4.2 ClassificationBlock The classification block is composed of multiple components too: It consistsofadropout- layer(DO), afully-connected layer(FC), aReLU activation function and then again,a dropout-layeraswellasafully-connected layerand aReLU activation function. This structure is visualized in Figure3.9 and is called neural network component it this thesis. The fully-connected layers are regularizing thekerneland thebias with an L2-regularizer. The dropout-layers are first randomly excluding 20% and then 50% of theneurons. After this block, the finaloutput-layerisappended, whichisafully-connected layerwith asoftmax activationfunctionand the number of classesasamount of neurons.

DO FC ReLU

ReLU FC DO

Figure 3.9: The neural network component of thenetwork with fully-connected (FC), dropout (DO) layers and aReLU activationfunction.

32 3.4. Network Architecture

Figure 3.10 showsthe combination all these part resulting in aneural network architecture with multiple convolution and pooling components,followedbyfully-connected and dropout layers and thefinallayer with asoftmax activation function.

MIDDLE CELL

CONV CONV GLOBAL AVG POOL + + POOL POOL

SIDE CELLS

CONCATENATE

NEURAL NETWORK

SOFTMAX

OUTPUT

Figure 3.10: The entirenetwork architecture.

33 3. SemanticPointCloud Segmentation

3.5Point CloudSegmentation

To applythe trained networkontoapointcloud, it needs to be exported the same way as thetraining data.However, in this case the minimum amountofpoints inside acell is one, because all points of thedataset must be represented in the exported datato become classified. The datasamplesare then fed into the neuralnetwork and aclass label is applied.Sincethe octree nodes are axis-aligned, themin- and max-position of the cells are exportedtoo.Byusing that,the classified cell canbemapped backintothe pointcloudtoits initialposition.All points insidethis cell are then labelledwith the class prediction of thecell. Thisresultsinapointwise segmentation of thepoint cloud.

34 CHAPTER 4

Curb Detection andFitting

Due to the previous segmentation of thepoint cloud by using a3DCNN asemantic information of the individualpointsisavailable. This information can be usedfor various applications and we are interested in extraction and geometryreconstruction of sidewalks. To achievethat, curbpointsare searchedinthe pre-filtered pointcloud first and based on them, polygonsofthe sidewalk including the street are calculated.For searching curb points, onlypointslabelledas“terrain” are used, sincecurbsare located on the ground. Thischapter addresses thecurbdetection and geometryfitting process.The curb detectioncontains multiplesteps:First, apointcloud is roughly filtered in Section4.1, then simple geometricpoint cloud features are calculatedand basedonthempointsare classifiedas“curb” or “nocurb” in Section 4.2. Some of those features need information about thestreet. Therefore, this Sectioncoversdataweuse forthattoo.Since,every machine learningmethodlead to some false positiveclassifications,whichwouldimpair the finalcurb fitting,they are filtered in Section 4.3. Finally,the actual polygons are generated in Section 4.4.

4.1Point Cloud Prefiltering

To identify curbpointsinapointcloud,features are calculated for eachsingle point. Since LiDAR datasetsare usually very large, thearea in apointcloud forthe curbdetection must be narroweddown, to be abletodetectcurbsinareasonabletime.Anobvious approachwouldbelimiting the searcharea to the ground, becausecurbsare usually located there. Unfortunately,this would result in many false positivecurb classifications, because houses, carsand other objects near or inside thestreet have similar geometric propertiesascurbs. However, with an understanding of thescene,itispossible to filter

35 4. Curb Detection andFitting

thesespecific points beforehand and restrict thesearch area even further. Filtering these high-risk false positivepoints,will improve the curbdetection result.Therefore, apoint cloud is prefilteredaccording to theresult of thepoint cloud segmentation and just points labelledas“terrain” are used.

4.2Feature Extraction

Curbs have certaingeometriccharacteristics; they areperpendicular to the street, follow the course of theroad and have aparticular elevation as well as shape. These propertiesare encodedintohandcrafted features and computed for eachsingle pointofthe (remaining) pointcloud.Toensure apropercurb detection result, multiple differentspatial features are used perpointand if the pointfulfilsall descriptors, it is classified as curb. Three featuresare usedinthis thesis:difference in heightalong with the standard deviation of the height,curvatureand angleofthe pointnormal to theroad.

4.2.1 Height Difference andStandardDeviation The heightofcurbs usuallydonot exceed acertainlimit and theelevation changeswithin aparticular range. Therefore, features describing theheightorthe change of it,within a certainneighbourhood,are commonly used. Thisfeature wasproposedbyWangetal. [WWHY19]and it calculates the absolute difference in heightinacertainneighbourhood of thecurrent contemplated point p and looks if this value lies within acertain range. Moreover, it alsocomputesthe standard deviation of theheightvalues and restrictsitwith alower bound.Ifbothconditionsare fulfilled, this feature states thepoint p as curb point. The nearest neighbours n within half ameter of thepoint p are used and theheight difference is calculatedbetween theabsolute minimum hmin and maximum hmax height valueofthe neighbourhood.Hence,the heightdifference isdefinedas:

Tlower ≤ hmax − hmin ≤ Tupper (4.1)

where the range is boundedwith alowerand upperthreshold, Tlower and Tupper respec- tively. Figure4.1 depicts thisfeature. The standard deviation of theheightiscalculated as follows: 1 N µ = n N i i=1 (4.2) N (n − µ)2 i=1 i ≥ T N std

where N is theamountofnearest neighbours and Tstd is theupper threshold for the standarddeviation. Note, Equations 4.1 and4.2 are adapted from [WWHY19]tothe

36 4.2. Feature Extraction notation usedinthis work. Thethresholds used forthis feature aredenoted in meters and have the followingvalues: Tlower =0.03

Tupper =0.3

Tstd =0.01

hmax

hmin

Figure 4.1: Heightfeature:The difference between hmax and hmin indicates weather the pointisacurb or not.

4.2.2Curvature The shapeofcurbs is distinctiveand thereforeanothergood feature to detect sidewalks in pointclouds. Thecurb shape can be transformed into afeature by analysing thepoint distributionwithinacertain neighbourhood. Thisfeature wasintroduced by Fernándezetal. [FILS14]and it computes the curvature in the neighbourhood of apoint p by creatingaweightedcovariance matrixwith the nearest neighbours and applyinganeigenvalue decomposition. The curvature is then computed by dividing the smallesteigenvalue by the sum of all eigenvalues. If this value lies within acertain range,the pointismarked as “curb”. The nearest neighbours n within half ameter of thepoint p are used in this thesisto create the weighted covariance matrix,whichiscomputed as follows:

1 N σ = n N i i=1 1 N µ = |σ − n | N i i=1 (4.3) |p − n |2 w = exp − i i µ2 N T COV = wi(ni − p) (ni − p) i=1

37 4. Curb Detection andFitting

where N is theamountofnearest neighbours n and wi is theweightvaluefor the neighbour ni. Thenaneigenvalue decomposition is appliedtothis weighted covariance matrix,which results in three eigenvalues,where λ0 ≤ λ1 ≤ λ2.Finally,the curvaturevalue is defined as: λ0 Tlower ≤ ≤ Tupper (4.4) λ0 + λ1 + λ2

where alowerand an upperthreshold, Tlower and Tupper respectively, defiesthe range within apointisclassified as “curb”.Note, Equations 4.3 and 4.4 are adaptedfrom [FILS14]. Visuallyspoken, thecovariance matrixencodes thepoint distribution of theneighbourhood aroundthe pointofinterest.The eigenvaluedecomposition of this matrix computes a coordinatesystem representing the orientation of this distribution. Theeigenvectors are the three orthogonal principal axisofthis coordinatesystemand theeigenvaluesare the lengthofthese vectors,whichencodes the underlying pointvariance alonganaxis.The curvature of theneighbourhood thencan be describedbyputting theseeigenvaluesin relation to eachother. Figure 4.2 illustrates theeigenvectors of aneighbourhood in their actual lengthin2D. Comparing the result from aneighbourhood of acurb point(left) to the resultfrom aneighbourhood of aroad-point(right), it canbeseen thatthe length of theeigenvectors and thereforethe variance in the z-axisisway smaller for theflat neighbourhood.

λ0

λ0

Figure 4.2: Resultofaneigenvaluedecomposition, oncefrom acurb point(left) and once from astreet-point(right)in2D. Theeigenvalue λ0 fromacurb pointislarger than the eigenvalue λ0 from astreet-point.

The threshold values usedfor this feature are the following:

Tlower =0.009

Tupper =1

38 4.2. Feature Extraction

4.2.3 Perpendicularity Curbs are alwaysperpendiculartothe street. Therefore, this characteristicisused to detect curbs too. Thisfeature wasproposed by Zhao et al. [ZY12]and it verifiesifthe normal of the pointisperpendicular to theroad. In this work, the road is simplified with acontinuous line-strip. Therefore,the directionofthe nearest line-segmenttothe pointistaken to representthe road. Thedot product between thepoint normal n and thedirection of the road d is calculated andbound by an upperthreshold. Hence, theperpendicularity feature is defined as: n · d ≤ 0.2(4.5)

Thiscorrespondstoaminimum bound of 78.46 degrees between the road direction and the pointnormal.Figure 4.3illustrates this feature.

d n

Figure 4.3: Perpendicularity feature, where theangle between thepoint normal n and the directionofthe road d is computed.

Open Street Maps

Sinceweare dealing with LiDAR pointclouds, although they have informationabout terrain points, we do not knowthe course of theroad. However, there is agreatinterest in collecting geodata-including streetdata-with projects likeOpenStreetMap[Opea]. The geodataisdepicted with tags assigned to nodes, ways or relations. Streets are ways, which are formedbyalist of nodes, represented with akey-valuetag, like for example highway=*,where highway is thekey.The valuedescribes thetypeof the road and can have various classifications,likefor example primary, secondary, residential, pedestrian, footway or cycleway.They canhavefurthertags, describing propertiessuchasname, speed limit, number of lanes or if the road is in a tunnel or abridge[Opeb]and many more.Figure 4.4 shows an overviewofColognein OpenStreetMap.

39 4. Curb Detection andFitting

Figure 4.4: Overview of CologneinOpenStreetMap.(Source: [Opea])

The OpenStreetMap geodatafrom an arbitraryarea canbeexported as an XML. Figure 4.5 shows asmall areaofCologneincludingthe Christian-Gau-Straße, whichisaway with fivenodes, highlighted in red.The according XML-representation of this streetis as follows:

40 4.2. Feature Extraction

Figure 4.5: Christian-Gau-StrasseinColognewith its OpenStreetMap tag colouredin red. (Source: adapted from [Opea])

Thisstreet is represented as away and has an id as meta-information. Attributes of a way are thenodes nd with an id as reference ref to them and key-valuepairscalled tag, statingthe nameofthe road as Christian-Gau-Strasse,the type as residential, aspeed limit of 30 and among other tags alsothat this street is an one-way.Areferenced node canlooklikethis:

The node has alsoanid and GPScoordinates, lat and lon respectively, as meta- attributes. The latitude and longitude are represented in WGS84,World GeodeticSystem 1984, but LiDAR datahaveusuallyalocal coordinate system, like in case of thedataused in this thesis,the EPSG:31467, alsoknown as 3-degreeGauss-Krugerzone 3. Therefore, the GPS coordinates of theOpenStreetMap nodes must be converted into the other coordinatesystemand this is achieved by using the library ProjNET4GeoAPI [Net].

To find streets in an exportedOpenStreetMap XML, thefile is searchedfor thetag way and this tag is then checkedfor theattribute highway.Ifthe tag contains this attribute,the valueofthis key-value pair is inspected to filter ways like footway, steps, cycleway, pedestrian or service,because suchwaysare not of interest for us since they are not streets. Assuming that thetag contains a highway-attribute with apermittedvalue, thereferencednodes in this tagare searched in theXML. The GPS positions of themare extracted,converted into the 3-degreeGauss-Krugerzone 3 coordinatesystem andsavedinafile named with theidofthe way.When those roads are needed later on, thepositions of theaccording fileare loaded into thesystem, where they form the line-strip of theroad. Figure 4.6shows the LiDAR pointcloud of the Christian-Gau-Straße withits imported OpenStreetMap road datainred.

41 4. Curb Detection andFitting

Figure 4.6: Christian-Gau-StrasseinColognewith its imported OpenStreetMaproad data.

4.3 False PositiveFiltering

Machine learning is not perfect,the classifier is not performingwell on some datasamples. Points misclassified as “curb”,although theyare not,are called false positives.Suchfalse positives can origin from nearbycars,buildings and other objectswith asimilar geometric structure to curbs. Hence,this step tries to identify and filter suchfalse positiveclassified curb points. Thefalsepositives are filteredbythese threesteps:

• First, the curb points are clustered with adensity-based clustering algorithm.

• Then, aline-filterisapplied on each clusterand

• finally,the remaining points percluster are used to fit lines. These lines are checked forparallelism to theroad.

Whereby the first twosteps are inspired from thework of Wang et al. [WWHY19]. They are usingadensity-basedclustering to split the points into the right and left side and a RANSAC-filtertodetect false positives insidethe road.

42 4.3. False Positive Filtering

4.3.1 DBSCAN-Clustering

The points classified as curbs areclustered withthe density-based clustering algorithm DBSCAN [EKS+96], which is abletodetect cluster of an arbitraryshape and noise. The following paragraphdescribing this algorithmincluding Definition4.6 is citedfrom [EKS+96]. The DBSCAN-algorithmdifferentiates between three typesofpoints: core points, borderpoints and noise. Noise are points p of thedataset D with no cluster Ci assignedand defined as {p ∈ D |∀i:p/∈Ci}.Core and border points are both belonging to acluster, wherebythe latter ones arelocated at the border of the cluster. Themain difference between them is the amount of neighboursfound within the -neighbourhood, which is specifiedasN(p)={q∈D|dist(p, q) ≤ },where dist(p, q)isadistance function between two points p and q. Border points usually have less -neighbours than core points. To form aclustersome constraint must be fulfilled, namely in acluster everypoint p must be part of the -neighbourhood of q with aminimum amount of point minP ts.This directly density-reachability is formal denoted as p ∈ N(q)and | N(q) |≥ minP ts,wherethe second one is thecore pointcondition.The next constraint is the density-reachability of two points p and q,where p is reachablefrom q overthe points p1,..., pn,thusevery step from pi to pi+1 must be directly density-reachable.The last constraintisthe density-connection between two points p and q,wherebothof them are density-reachablefromanother point g.Finally,they define acluster C by usingDefinition 4.6.

∀p, q : p ∈ C and q is density-reachable from p then q ∈ C (4.6) ∀p, q ∈ C : p is density-connected to q

The curb points are clusteredbyusing this algorithm, with =0.2and minP ts =50, which indicates thatan-neighbourhood of 20 centimetres must contain50points. This parameters are set relatively low, becauseLiDAR data are usually very dense and therefore curbs have morethan enoughpoints within theseranges. It is possible thatthe DBSCAN-algorithm splitsacurb into multiple clusters.This could origin from “holes” in the pointcloudcaused by occlusions from other objectsduring the scanningprocess. Forexample, in an urbanpointcloud curbs are often occluded by parkingcars besides the street, whichcauseadisconnected sidewalk.However, this algorithm is used to filterfalse positiveclassified curb points by identifying them as noise. Therefore, it does notmatteriforinhow manyclusters acurb is split by this algorithm. The resulting clusters are used in thenextsteps of thefalse positivefiltering,but discardedafterwards. Besides that,this step and the followingones are performed only with the xy-position of the points, sincethe line-filtering and thevalidation of theparallelism to theroad are performed in 2D, which simplifies theproblem, but is as accurateasin3D.

43 4. Curb Detection andFitting

4.3.2 Line-Filtering The resultingclusters fromthe previous step are nowfilteredfurtherwith aline-filter. Thisfilter uses the RANSACalgorithm[FB81], where arandom set S of size N is selected from thedataset D and amodel M is fittedwith the samples of S.The size N dependsonthe minimum amountofpoints themodel needs. Then, themodel is verifiedbycalculating an error function between all points of D to M and all points within acertaintolerancerange are forming the consensus set Sc.Ifthe sizeofScis greater thansome threshold, anew model is computed with theconsensusset.Ifnot, this process is repeated until another random subset S fulfils the condition.Incase no randomsubset S is greater thanthe threshold,the algorithm terminates after adefined number of iterations, there either the best modelsofar is returned or the algorithm fails. The RANSACimplementation of this thesisusesa2Dlineasmodel,whichneedstwo points to be formulated.The problemistransformed from the3Dtothe 2D space by omitting thez-axis and justusing x- and y-coordinates of thepoints. Visuallyspoken, it is likelooking at theproblemfrom abird’s eyeperspective.This dimensionalityreduction simplifies theproblem, butdoesnot influence theoverall resultinanegativeway.As error function, the distance from thepointsinDto the line is usedand the points in the consensus set are called inlier.Ifthe number of inlier is greater than thebest resultso far, anew modelbasedonthe consensus setand ascoreare computed.The score gives a measurementabout the accuracyofthe model and is based on the normalizes squared error of inlier fromChoietal. [CKY97]denotedasseeninEquation4.7, where I is the setofinlier, Err is an error function between apoint pi and theestimated model M or true model M ∗. 2 Err(pi; M) NSE(M)= pi∈I (4.7) Err(p ; M ∗)2 pi∈I i

The scoreisthen calculated as in Equation 4.8, which originates from the RANSAC implementation of thescikit-learn pythonlibrary [PVG+11] [BLB+13].

2 (pi − M(pi)) NSE(M)= pi∈I (p − µ)2 pi∈I i (4.8) Score(M)=1−NSE(M)

The RANSAC-filtering process performs the followingsteps foreachcluster. Considering thatthe course of theroadisapproximated with aline-strip, the cluster is split according to the nearest line-segments of theroad. Forthe reasonthat, if the road bends, the points of thecurvewill be partofthe samecluster, sincethe conditions of theDBSCAN algorithm are fulfilled, butaline cannotapproximate acurvewell. The RANSAC algorithm is applied to eachsub-clusterwith adistance threshold of 15 centimetres.If the resulting model is accurate enough,whichisachieved with a Score(M) ≥ 0.8, the inlier andthe model are saved. This again filterspointsnot complyingthe distance threshold of theRANSACalgorithm.

44 4.4. CurbFitting

4.3.3 Parallelismtothe Road All fittedlines of the sub-clustersare combined into asingle continuous line-strip by interpolatingthe corresponding endpoints of them. This new cluster line-strip is then checkedfor parallelism to theroad by using the dot productbetween them, which is 1if two vectors are parallel. If anydot product foraline-segmentinthe cluster is ≥ 0.99 the remaining inlier points of this cluster are marked as true positives and not filtered. Allremaining possiblecurb points arelabelled as “curb”.Inthe nextand last step, the geometryisfitted through these points.

4.4CurbFitting

The curb fitting is done in multiple steps, where the processorients itself on thecourse of theroad line-strip. First, each curb pointisassigned to itsnearest road line-segment and then allocatedtothe leftorrightside of it.Throughbothsides,amodelofa mathematical function is fittedwiththe RANSACalgorithm, where it is represented by aLagrange polynomial. Thestandard case usesapolynomial of second degree, but for streets with astronger curve,apolynomialofahigherdegree is applied. Thisstep uses polynomialofahigherdegree and not aline to represent the course of theroad, because we needagood representationofthe curb in this step, sinceweare nowcomputing the finalgeometry.The errorfunctionisthe distance fromthe pointtothe modeland the distance threshold is 0.1 meter.Furthermore, thefunctiononlyuses thex-and y-coordinates,because this step calculates thepathway of thecurb and not thelocation of the height. However, afunction of higher degree is not so handy to work with afterwards. Therefore, it is subdivided into acontinuous line-stripwithafixed step-size of two meters in the standard case.This depicts the actual road even with curves well.

u1

l1

u0

l0

Figure 4.7: Acurb segment with thefitted lowerand upper edges. u0 and u1 denote the endpoints of the upperedgeand l0 and l1 denote the endpoints of theloweredge.

45 4. Curb Detection andFitting

To calculate theheightofthe curb, the best-fit linesthemselves and theirinlierpoints are used.For eachcurb line-segment, the nearest inlier are taken to find the minimum and maximumheightvalueinthe areaofthe endsofthe line.The pointcloud could have holes in thearea of thecurb,causedbyparking cars for example, butwewantto closethose holes andcreateacontinuouscurb geometry.Therefore,westart searching within the neighbourhood of 0.1 meter and as longaswedonot findany curb points, this areaisenlarged by 0.1 meter each iteration. With the line-segments representing the course of thecurb and thedetected minimumaswell as maximumheightvalues of points constitutingthe endsofthe curb, the upper and loweredges of thecurb are constructed,whichcan be seeninFigure 4.7.

The final geometrymodel includes polygons for thecurbs, sidewalk and street. Figures 4.7 to 4.10 are showing the curbingreen, the sidewalk in blue and theroad in grey. For eachside of theroad, thepolygon line-stripwiththe upper and loweredges, u as well as l respectively, are processed. Thecurb polygons are constructed by the edge points of the according lines. The endpoints of theloweredgeaswell as theendpoints of theupper edge formacurb polygon Ci with thevertices l0,l1,u0 and u1.Figure4.8 visualizes a section of acurbwith three finalcurbpolygons C0,C1 and C2.

u1 u1,0

C2 u1,0 l1 C1 u0 l1,0 C0

l1,0

l0

Figure 4.8: Asection of acurb with the finalpolygons C0,C1 and C2 of three curb segments.

Foreachsidewalk-polygon, theterrainpoints in aneighbourhood of two metersare split into pavementand street points by using the middle heightofthe lowerand upper edges. The points above this heightvalueare sidewalk points and thepointsbelow are streetpoints. Aplane is fitted throughthese points using theRANSACalgorithm with adistance threshold of 0.005 meter.The sidewalk-polygons are constructedby using the endpoints of theupper edge, and positions twometersperpendiculartothem, resulting in apolygon SWi with vertices u0,u1,s0 and s1.Tocreate acontinuous sidewalk geometry, the perpendicular points are then interpolated accordingly. Theresulting sidewalk polygons SW0,SW1 and SW2 of this step are visualizedinFigure4.9.

46 4.4. CurbFitting

s 1 u1 SW s0,1 2 u1,0 SW1 s0,1 u1,0 SW0

s0 u0

Figure 4.9: Asection of acurb with the finalpolygonsofthree curbsegments with the according sidewalk-polygons SW0,SW1 and SW2.

Finally,the road-polygon is created by using the end-points of theloweredge-linesofthe curbs. Figure4.10shows this polygon with vertices r0,..., r1 in pink created from the according endpoints of theline-segments l0 and l1 respectively.

r5 r4

r7 r2 Road

r8 r1

r9 r0 Figure 4.10: The final polygons of acurb sectionfrom three curb as well as sidewalk segments and the road polygon,whichisconstituted by the vertices ri.

47

CHAPTER 5

Neural Network Training

TrainingaConvolutionalNeural Networks needs alot of fine-tuning. Therefore, a lot of work wasput into finding thebest datasets sizes, network architecture and hyperparameters. Theresultsofthe network training are shown inthis chapter, whereby Section5.1 gives ashort overviewabout the assembly of theused dataset,Section 5.2 describes the usedhyperparametersand whythey were chosen and finally,Sections5.3, 5.4aswell as 5.5 discuss thetrainingresults of three different networks respectively. The following system setup wasused for thetraining and classification: Intel(R)Core(TM) i7-9700K CPU @3.60GHz, 64 GB RAM and GeForce RTX2080 Ti.The network was implementedwithKeras [C+15] and Tensorflow [AAB+15] in Python.

5.1Datasets

The Semantic3d training set is used for thenetwork training. It is split into an actual training,validation and test set,wherethe training set is used to learn thedatadistribution and thevalidation set to tune the network’s hyperparameters. Thetestset is used to verify theperformance of theresultingclassifier on unseen data. Hence,for this chapter we refer to theSemantic3d training set as dataset. The dataset is split into atraining, avalidation and atestset.The distributionofthe classesshould be equalinsize. If there arenot enough data samples of aclass available, data augmentation is used to createmore samplesartificially.Inthe trainingset,the amountofdata samples perclass is 10.000, in thevalidation set 3.000 and in thetest set 630. Thisresultsinatotal set size of 80.000 for thetrainingset and 24.000 as well as 5.040 for thevalidation and test set respectively. The composition of thesesetsare

49 5. Neural Network Training

visualized in Figure5.1 with the training setincluding dataaugmentations in blue,the validation set as well as itsdataaugmentationsingreenand the test setinred.

12.5k

10k

es 7.5k pl am s

ta 5k da

2.5k

0

in in on on gs pe ts rs ra ra ti ti in ca ac ca er ef t ter eta eta ild s l rd rt de ra eg eg bu a v ha ma tu w v ng na gh lo ni n- hi an ma sc

Data Augmentations (Training Set) Data Augmentations (Validation Set) Training Set Validation Set Test Set

Figure 5.1: Class Distribution of thetraining, validationand test set includingdata augmentations.

5.2Hyperparameter-Tuning

Aconvolutional neuralnetwork includesmanyhyperparameter,whichneed to be tuned individually.This beginswith the actual architecture of thenetwork, the amount of neurons in each layer, gridsize, up to dropout or learningrates. To get thevery best out of anetwork many differentcombinationswere evaluated. Furthermore, in order to be statisticallyrepresentative, thenetwork wastrainedand evaluated tentimeswitheach setting. The resultsare an average value of them. Note, that theprecise resultsofthese hyperparameter-tuningtests are listed in AppendixA. However, some parameters were thesame for all differentexperiments: To improvethe generalization of thenetwork, early stopping wasused. It stops thetraining process prematurely if the validation loss has not improvedoverthe last sevenepochs. Categorical cross-entropy is used as lossfunctionand everytime it improves, theweightsofthe network are saved. Therefore, oncethe training is finished,the weights of thebest epoch

50 5.2. Hyperparameter-Tuning areused forthe classification. As optimizer AdaMax with its default parameters, 0.002 for the learning rate and β1 =0.9aswellasβ2=0.999 were used. The most comprehensivetests were made with agrid sizeof16for themiddle node and 8for theside cubes. Forthe simple reasonthat this is the smallestdataset,accordingto memory consumption, and essential things, like the occupancy values, canbequickly testedand probably discarded in further experiments. Afeature learning component consists of convolutionaland pooling operationsand has the followingstructure: CONV-BN -ReLU -CONV -BN-ReLU -POOL.First, the impact of the amount of suchcomponents on the middle nodes as well as side cubes was explored. Thepoolingoperation hasakernel size of 2 × 2; this scales theinputdownby half. That meansfor ainputsize of 16, one pooling operation resultsinagrid sizeof8, two pooling operations resultinagrid sizeof4and so on. To explore theimpact of theamountoffeature learning components on the middle nodes,the network wastrained justonthem.One componentachievedaslightlybetter performance,thantwo or threestackedcomponents. In particular,itwas about 0.86% better thantwo and1.26%better than three were. Thisisalso the case foradifferent amountoffeature learning components in combination with sidecubes. Twostacked componentsperformslightly worsethanjust one. Thereasonfor that is probably,that morecomponentsdownscalethe inputtoo much andloose information in this process. However, the increase in accuracybetween anetwork trained justonmiddle nodes and anetwork alsoincluding the neighbourhood information is 6.47%. In furthertests,the amountoffeature learning componentswas set to one for themiddle cubeand one for the side cubes,sincethis combinationworks best. Three different occupancy values were evaluated:

• A binary occupancy value,where the grid cell either contains 1or0,depending if there areany points insidethe cell or not,

• the absolute amount of points inside acell and

• the ratio of thenumberofpointsinsidethe gridcelltothe totalamount of points in the node.

Note that theprevious tests were performed withbinary occupancy values, as it has been determined very earlier that this mode worksbest. Thetotal amount of points as occupancy value performs about 3.75% worse on thetestset and thefractionaloccupancy value does notwork at all.The network is notable to minimizethe loss and therefore cannot learn anything; theaccuracy is around33%. The reason behind this is that the average amountofpoints inside acell is 1.76 and 15.92 for middle andside cells respectively.The averagetotal amount of pointsis7221.31 for middle and8149.96 and side cells. Buildingthe fractionwith the according valuesresultsinnearly zero, which is probably toosmall to learnamapping fromdata samples to class membership.

51 5. Neural Network Training

Furthermore, additional features describing propertiesofthe middle cell were concatenated to the resulting feature vector of thefeature learning phase and evaluated as well. Those featuresinvolvethe absolute heightofacell, average heightofall points insideacell, average pointnormal andaverage pointcolour. The additional information about the heightofthe boxcenterincreasesthe accuracyabout 1.49% and theaverage heightof points insideaboxresults in abetter performance about 1.68%.The average normal of points in abox does notincrease the resultmuch, just about 0.07%.However,the average pointcolourhas amuchhigherimpact on theaccuracyofthenetwork; it is increasedby2.37%byaddingthis feature.

Further hyperparametersare the dropout value and amountofneuronsinthe classification part of thenetwork. Adropout value of 0.2% for theinput and 0.5% for thehiddenlayer with 2048 and 4096 neuronsworks best.Furthermore, 128 neuronsinthe convolutional block works best forthe middle cube and 64 neuronsfor theside cubes.L2-Regularization with avalueof0.0001 is used.This value does notreduceoverfittingofthe network much,but on theotherhand,italsodoesnot decreasethe accuracytoo much (about 1%, instead of 7.5% witharegularizationvalueof0.01).

5.3Network UsingSmall Grid Size

Thisnetwork architecture has agrid sizeof16for themiddle cubeaswellas8for the sidecubes and consistsofthe followinglayers: CONV-BN-ReLU -CONV -BN- ReLU -POOL -DO-FC-DO -FC. The convolutionallayers have 128 neuronsfor the middle and 64 for theside cubes.The first fully connected layerhas 2048 neurons and thesecond one has 4096. The first dropout layerhas adropout rate of 0.2 and the second one of 0.5. Furthermore, the binary mode and additional feature of average point colour were used. The architecture of this network is visualized in AppendixB.1.

The network performance wasvalidated with the metric.py-script from theSemantic3d dataset [Sem]onthe test set.Itcalculates theIntersection overUnion(IoU),Overall Accuracy(OA)and alsoaconfusion matrix. This network achieves an overall IoUof 72.12% and an OA of 80.87%, whichisshown togetherwith the IoUofthe eight single classesinTable 5.1. There it can be seenthatthe classes“man-madeterrain” and “natural terrain” are classifiedquite well, but the other classesperformnot so good, especially“lowvegetation” and “hardscape”.The confusion matrix is shown in Table 5.2, where MMT is “man-madeterrain”, NT is “natural terrain”, HV is “high vegetation”, LV is “lowvegetation”, B are “buildings”, HS is “hardscape”, A are “artefacts” and C are “cars”.Bytaking acloserlookonthis confusion matrix, it canbeseen, that alot of misclassification occurbetween “high vegatation”and “lowvegetation”, “buildings” and “hardscape” as well as “cars” and “hardscape”.Even, if the terrain classesperform quite well, most misclassification are made between those twoclasses.

52 5.3. Network Using SmallGrid Size

8classes6classes classes IoUin% man-made terrain 83.29 90.6 natural terrain 84.55 highvegetation 67.10 82.04 lowvegetation 54.14 buildings 65.34 66.02 hard scape 51.52 53.4 artefacts 74.93 74.55 cars 68.24 66.09 average over all classes 68.64 72.12 OA in % overall accuracy 80.87 85.69

Table 5.1: IoU of thesingle classes, average IoU and OA of thetestset classifiedwith the network using small gridsize. Once with the original eightclasses and once with the combinedsix classes.

MMT NT HV LV BHSA C MMT 91.75 4.60 0.00 0.32 0.95 1.59 0.16 0.63 NT 3.17 92.06 0.00 3.49 0.63 0.48 0.00 0.16 HV 0.00 0.16 81.27 14.76 1.27 1.11 0.95 0.48 LV 0.79 2.70 15.40 70.63 2.70 4.44 0.95 2.38 B 1.59 0.16 0.95 1.90 78.41 9.84 3.65 3.49 HS 1.59 0.79 1.59 5.56 10.16 67.30 3.17 9.84 A 0.79 0.16 1.75 1.90 2.54 5.87 83.02 3.97 C 2.22 0.32 1.43 2.54 1.75 7.30 1.90 82.54

Table 5.2: Confusion matrixofthe test set classifiedwith the network using small grid size. (Note: valuesare in %.)

Since, forthe purpose of this thesis, thedistinctionbetween theclasses“man-made terrain” and “natural terrain” as well as “high vegetation” and “lowvegetation”isnot relevant,theyare combinedintoasingle class, called “vegetation” and “terrain”.Another network wastrainedonthese six classes, whichperforms better thanthe network using eightclasses. Forthis network the IoUis72.12% and theOAis85.69%, whichare 3.48% as well as 4.82% better as with eightclasses. TheIoU for thesingle classesare alsovisualized in Table 5.1. By looking at this table, it canbeseenthatthe single class “terrain” performs 6.68% better thanthe average of theclasses“man-madeterrain” and “natural terrain” and the single class “vegetation” performs 21.42% better thanthe average of theclasses“highvegetation”and “lowvegetation”for theIoU. The other classesperformalittle bit better or worse, but in arange that can be accepted. By looking into theconfusion matrixofthis network using six classesinTable 5.3, it can

53 5. Neural Network Training

be seenthatthe single class “terrain” performs 3.72% better thanthe average of the classes“man-madeterrain” and “natural terrain”.The singleclass “vegetation” performs 16.83% better thanthe average of theclasses“highvegetation” and “lowvegetation” too. The accuracyfor theotherclassesisupto3%betterorupto4%worse, butalso this can be accepted duetothe nearly 17% performance increase of the“vegetation”class. Furthermore, it should be mentioned, that for this network using six classesthe data samplesofthe terrain and vegetation classeswere combined,whichresultsinanamount of datasamplestwice as bigasfor theother classes. Moreover, alsoanetwork using six classes with an equal class distribution as well as anetwork using six classeswith an unequal class distribution,but with class weights were trained.Nevertheless,those networks performed slightly worse. Therefore,the further networks using six classesin this thesis are trained with an unequal classdistributionwithout class weights.

terrain vegetation buildings hardscape artefacts cars terrain 95.63 2.22 0.56 0.87 0.08 0.63 vegetation 1.43 92.78 0.95 2.14 1.43 1.27 buildings 1.90 3.97 75.87 10.32 4.13 3.81 hardscape 3.17 8.25 7.46 68.57 5.40 7.14 artefacts 0.48 3.17 2.06 4.60 86.03 3.65 cars 2.70 6.35 2.38 7.46 2.86 78.25

Table 5.3: Confusion matrixofthe test set classifiedwith the network using small grid size and six classes.

5.4Network UsingMiddle Grid Size

Thisnetwork architecture has agrid sizeof16for themiddleaswellassidecubes,apart fromthat it is the same as for thenetwork with gridsizes 16 and 8respectively. Except for the amountofneuronsinthe dense layeraswellasthe batchsize, they were reduced due to memory limitations. Theyare decreasedto1024 for thefirst and 2048 for the second dense layer as well as 64 forthe batchsize.Moreover, theadditional feature of average pointcolourwas used in this network too. The architectureofthis network is visualizedinAppendixB.2. Forthe network using eight classes,the overallaccuracy is 82.14%, whichis1.27%better thanwith the network using small gridsize and theaverage IoU is 1.63% better with the increasedgrid sizefor thesidecubes.The IoU valuesfor eachclass are shown in Table 5.4, they are all better with this network exceptfor theclass “natural terrain”; this one performs 1.27%worse. The confusion matrix forthe network using eight classes is displayedinTable5.5,there it canbeseenthatthe classes“man-madeterrain”, “high vegetation”, “hardscape”, “artefacts” and “cars” perform better thanwith the network using small gridsize. Especially the two classes“highvegetation”and “artefacts”

54 5.4. Network Using MiddleGrid Size

increasedabout 4.44% and 5.55% respectively. On theotherhand,the classes“natural terrain”,“lowvegetation” and “buildings”performed slightly worse, with amaximumof 2.69% for theclass “natural terrain”.However, the most misclassifications occurbetween thetwo terrain and thetwo vegetation classesaswellasbetween theclasses“buildings” and “hardscape”.

8classes6classes classes IoUin% man-made terrain 83.67 90.00 natural terrain 83.28 highvegetation 69.95 83.76 lowvegetation 56.08 buildings 66.90 67.57 hard scape 53.42 53.91 artefacts 78.59 76.30 cars 70.25 72.06 average over all classes 70.27 73.93 OA in % overall accuracy 82.14 86.69

Table 5.4: IoU of thesingle classes, average IoU and OA of thetestset classifiedwith the network using middle grid size.Once with the original eightclasses and once with the combinedsix classes.

MMT NT HV LV BHSA C MMT 92.70 3.65 0.00 0.00 0.32 1.90 0.32 1.11 NT 4.76 89.37 0.32 3.17 0.63 0.79 0.16 0.79 HV 0.00 0.00 85.71 11.43 0.32 0.79 0.48 1.27 LV 0.63 2.22 18.10 70.32 1.43 2.70 1.43 3.17 B 1.90 0.48 1.59 1.11 77.30 10.63 3.02 3.97 HS 1.27 0.48 0.95 5.87 9.52 68.10 5.08 8.73 A 0.95 0.16 0.63 1.75 2.06 3.81 88.57 2.06 C 1.27 0.32 0.95 2.06 1.27 6.83 2.22 85.08

Table 5.5: Confusion matrixofthe test set classifiedwith the network using middle grid size. (Note: valuesare in %.)

Thisnetwork wasalso trained with six classesand performs 4.55% better in the OA and 3.66% better in the average IoU. The IoU for single classesare alsoincreased: About 6.53 %for theclass “terrain” compared to theaverage of theclasses“man-madeterrain” and “natural terrain” and about 20.75% forthe single class “vegetation”compared to the average IoU valuesofthe classes“highvegetation” and “lowvegetation”.Moreover, “buildings”, “hardscape” and “cars” have aslightly better IoU, up to 1.81% than with the network using eight classes.The class “artefacts” performs 2.29% worse compared

55 5. Neural Network Training

to the network using eight classes.However,the increase in IoUfor all other classes outweigh thisrather smalldecrease in theclass “artefacts”.The precise valuesofthe OA and IoU are shown inTable5.4. The confusion matrixofthe network using six classesisdisplayedinTable 5.6, thereit can be seen thatthis network performs 4.68% better with the combined class “terrain” and 14.52% better with the combined class “vegetation” compared to thenetwork using eight classes. The other classesperform either slightly better or worse, to be precise the classes “buildings” as well as “artefacts” perform both 2.22% worse and theclasses“hard scape” and “cars” have an increased accuracy of about1.9% and 0.48% respectively. The most misclassifications of thenetwork using six classesoccurbetween theclasses“buildings” and “hardscape”, which is probablycausedofthe similarstructure of buildings and low walls. Alsobetween theclasses“hard scape” and“cars” is themisclassification ratehigh, precisely about 9%. Compared to theaccuracies of thenetwork using small gridsize and six classes the network using middle gridsize has an increaseinthe OA about 1% and in theaverage IoU about 1.81%.Bycomparingthe confusion matricesofthose two networks, it canbeseen that theaccuracy is also increasedfor most classes:The classes “terrain” and “artefacts” justperform slightlybetterthan with thenetwork using middle gridsize, about 0.08% and 0.32% respectively. The class “hardscape” performs 1.43% better tooand the class “cars” has an increaseabout 7.31% in correctclassifications.The remaining two classes, “vegetation” and “buildings”, perform marginal worse thanwith the network using small gridsize, namely about 0.24% and 0.79% respectively. However, this decrease in the accuracy is negligible incomparison to the increase for all other classesand especially the increase about 7% for theclass “cars”. terrain vegetation buildings hardscape artefacts cars terrain 95.71 1.90 0.32 1.19 0.24 0.63 vegetation 1.67 92.54 0.95 1.98 1.19 1.67 buildings 2.86 4.13 75.08 11.90 3.49 2.54 hardscape 2.86 6.67 6.19 70.00 5.24 9.05 artefacts 0.79 3.02 1.43 5.87 86.35 2.54 cars 2.86 3.33 0.95 5.71 1.59 85.56

Table 5.6: Confusion matrixofthe test set classifiedwith the network using middle grid size withsix classes.

5.5Network UsingBig Grid Size

Here, thegrid sizefor themiddle cubeis32and forthe side cubes 16. Thisnetwork is composed of two stackedfeature learningcomponents with 64 and 128 neuronsrespectively forthe middle cubeand one feature learning componentwith 64 neuronsfor theside

56 5.5. Network Using BigGrid Size

cubes. The first denselayerconsistsof1024 and thesecondone of 2048 neurons, like in the network using middle gridsize, and abatchsize of 64 wasused too. The rest of the network and hyperparametersstayedasinthe networks before and theadditional feature of average pointcolour wasused as well. The network using biggrid sizeachieves forthe eight class problemabout the same overall accuracy as thenetwork usingmiddlegrid size, it is just0.7% better. By looking at theIoU values, which areshown in Table 5.7, it can be seenthat, in comparisontothe network using middle gridsize, this network performs forthe most classesslightly better, with up to 3.97% forthe class “cars”.Onthe other hand, theclasses“high vegetation” and “artefacts” perform worse, with 0.14% and 0.74% respectively. Furthermore, the average IoU of all classesfor thenetwork using big gridsizeisabout 1.06% better than with thenetwork using middle gridsized.

8classes6classes classes IoUin% man-made terrain 84.58 90.26 natural terrain 84.09 highvegetation 69.81 84.38 lowvegetation 57.99 buildings 67.88 66.19 hard scape 54.20 51.91 artefacts 77.85 77.42 cars 74.22 73.82 average over all classes 71.33 74.00 OA in % overall accuracy 82.84 86.81

Table 5.7: IoU of thesingle classes, average IoU and OA of thetestset classifiedwith the network using big gridsize. Once with the originaleightclasses and oncewith the combinedsix classes.

The confusion matrixofthe network using big gridsize and eightclasses is shown in Table 5.8, thereitcan be seenthat most misclassifications stilloccur between theclasses “high vegetation”and “lowvegetation” as well as “buildings”and “hardscape”.Incomparison to the network using middle gridsize and eightclasses the accuracies of thesingle classes is slightly worse for“man-made terrain”,“high vegetation”and “buildings”, with about 1.27%,2.38%and 0.16% respectively. On the other hand, theclasses“natural terrain”, “lowvegetation”, “hardscape”, “artefacts”and “cars” perform better withthe network usingbig gridsize,with at most 3.97% for theclass“lowvegetation”. In addition,the network using big gridsize wastrainedwith six classesinstead of eight. As it can be seeninTable5.7,the network using six classeshas an overallaccuracy of 86.81%, whichis4.57%better thanwith the network using eight classes.The IoU-value forthe combined class “terrain” is 5.93% better thanthe average of theclasses“man-

57 5. Neural Network Training

MMT NT HV LV BHSA C MMT 91.43 4.44 0.00 0.00 0.79 1.59 0.63 1.11 NT 3.17 90.63 0.00 4.44 0.48 1.11 0.00 0.16 HV 0.00 0.00 83.33 13.02 0.63 1.43 1.59 0.00 LV 0.32 2.38 15.56 74.29 1.27 3.49 1.27 1.43 B 1.43 0.32 1.11 2.54 77.14 10.95 3.49 3.02 HS 1.27 0.32 1.27 5.24 7.94 69.68 5.71 8.57 A 0.95 0.16 0.32 1.11 1.27 4.29 89.84 2.06 C 0.95 0.16 1.11 1.75 1.27 5.71 2.70 86.35

Table 5.8: Confusion matrixofthe test set classifiedwith the network using big gridsize. (Note: valuesare in %.)

made terrain” and “natural terrain”.The IoU-valueofthe combined class “vegetation” is 20.48% better thanthe average of theclasses“highvegetation” and “lowvegetation”. The IoU-values of theremaining classesare slightly worse thanwith the network using six classes, with at most of 2.29% for theclass “hardscape”.However, the average of all IoU-values is increased 2.67% with thenetwork using six classes. This canalso be seenby comparing thetwo confusion matrices(seeTable5.8 forthe network using eight classes and Table 5.9 for thenetwork usingsix classes), the accuracyofthe combined class “terrain” performs 5.32% better thanthe average accuracyofthe classes“man-made terrain” and “natural terrain”.The combinedclass “vegetation”has an increaseinthe accuracyof14.21% compared to theaverage of theclasses“highvegetation” and “low vegetation”. Furthermore, theclass “buildings”performs3.34% better too, but the classes “hardscape”, “artefacts” and “cars” perform worse, with 5.08% for theclass “hardscape”. Moreover, it alsocan be seenthat misclassificationsbetween theclasses“buildings” and “hardscape” are still very high fornetworks usingeightand six classes.

terrain vegetation buildings hardscape artefacts cars terrain 96.35 1.59 0.56 1.03 0.24 0.24 vegetation 2.14 93.02 1.43 1.51 1.11 0.79 buildings 2.70 4.13 80.48 7.30 3.81 1.59 hardscape 3.02 7.30 13.33 64.60 5.24 6.51 artefacts 1.27 2.06 3.02 4.13 88.73 0.79 cars 2.22 3.81 1.27 7.94 2.86 81.90

Table 5.9: Confusion matrixofthe test set classifiedwith the network using big gridsize withsix classes.

By comparing the network using big gridsize and six classes with the network using middle grid size and six classes, it canbeseenthat the network using big gridsize has an increase in accuracy of just 0.12% andalsothe average IoU performsjust0.06% better. The IoU-values for thenetwork using big gridsize are slightlybetter for theclasses “terrain”, “vegetation”, “artefacts” and “cars”, butthe classes“buildings” and “hard

58 5.5. Network Using BigGrid Size

scape” perform worse, with at most about 2% for theclass “hardscape”.Bycomparing the accuracies of thesingle classesinthe confusion matrices of Table 5.6 for thenetwork using middle gridsize and of Table 5.9 for thenetwork using big gridsize, it canbe seenthat the classes“terrain”, “vegetation”, “buildings”and “artefacts” perform better with thenetwork using big gridsize. They do this with at most 5.40% for theclass “buildings”, butthe remaining classes“cars” and “hardscape” perform worse, with at most 5.40% for theclass“hard scape”. As it can be seeninTable 5.10 theoverall accuracyincreases with an increased grid size,wherebythe largest increase is between thenetworks using small and themiddle gridsizes. This is probably due to the increasedgrid sizeofthe neighbouring cells. Besides that,the networks using six classesare performingbetterthan thenetworks using eight classeswith all threegridsizes, wherebythe greatest difference between the accuracies is from thenetwork using small gridsize to the network using middle gridsize too. Since, thenetwork using big grid sizeperformsbetter than thenetworks using small and middle grid sizes,itisusedfor thesemanticsegmentation of thepoint cloud.Figure 5.2visualizes thearchitectureofthis network in moredetail. 8classes 6classes difference OA increase OA increase network using small gridsize 80.87 - 85.69 - 4.82 network using middle gridsize 82.14 1.27 86.69 1.00 4.55 network using big grid size 82.84 0.70 86.81 0.12 3.97

Table 5.10: Overallaccuracies of thethree differentnetworks, increase in the OA between the networks as well as difference between sixand eight class networks (Note: values are in %).

59 5. Neural Network Training

INPUT INPUT 32x32x32 16x16x16

CONV 64 CONV 128 CONV 64 k = 3x3x3, s = 1 k = 3x3x3, s = 1 k = 3x3x3, s = 1

BN BN BN

ReLU ReLU ReLU

CONV 64 CONV 128 CONV 64 k = 3x3x3, s = 1 k = 3x3x3, s = 1 k = 3x3x3, s = 1

BN BN BN

ReLU ReLU ReLU

MAX POOL MAX POOL MAX POOL k = 2x2x2, s = 2 k = 2x2x2, s = 2 k = 2x2x2, s = 2 COLOR FEATURE

GLOBAL AVG POOL GLOBAL AVG POOL BN

CONCATENATED FEATURE VECTOR

DO 0.2

FC 1024

DO 0.5

FC 2048

SOFTMAX

OUTPUT

Figure 5.2: Architecture of thenetwork using big gridsize.

60 CHAPTER 6

Results

Thischapter is dedicated to resultsofthe semantic segmentation using 3D CNNs as well as resultsofthe curbfitting.Section 6.1 shows results of thesemanticsegmentation on theSemantic3d training as well as test set.Section 6.2 presents theoutcome of the semanticsegmentation as well as ofthe curbfitting onarealworld dataset ofacityin Germany. Finally,acritical reflection is given in Section 6.3, whereby 3D CNNs as well as curb fitting will be discussed shortly.

6.1 Semantic Segmentationofthe Semantic3d Dataset

Asegmentedpointcloud caneitherbeevaluated categorically,ifground truth class labels are available, like for theSemantic3d dataset,orvisually,ifnoground truth class labels are existing, like for theColognedataset.The numerical evaluation is performed with the same metricsasinChapter5,but pointwiseinstead of boxes.

Forvisualevaluation, thepoints of thepoint cloud are colouredaccording to theirclass affiliation. “Man-made terrain” is grey, “naturalterrain”islight-green, “high-vegetation” is green, “lowvegetation”isdark-green,“buildings” are blue,“hard scape” is orange, “cars” are pink and“artefacts” are brown-pink. In case of six classes instead of eight, the combined“terrain” class is coloured greyand the combined “vegetation” class is coloured green. The classesand theircolours are listed in Table 6.1, whereas on theleft side are the colours foreight classesand on therightside are thecolours for six classes visualized.

61 6. Results

man-madeterrain natural terrain terrain highvegetation vegetation lowvegetation buildings buildings hard scape hard scape artefacts artefacts cars cars

Table 6.1: The points of thepoint cloud are colouredaccording theirassignedclasses. Left:Colours for eightclasses. Right: Colours for six classes with combined “terrain” and “vegetation” classes.

6.1.1 Semantic3dTrainingSet

The 15 pointcloudsofthe Semantic3d training set were usedfor trainingofthe 3D CNN. However, not the entire pointcloudswere usedfor training, but only parts of them. Since, the positions of the boxes,whichwereused for training, are known, the points of apointcloud canbesplit into twosets: usedfor training and not usedfor training. Thesetwo sets canthen be numericallyevaluatedseparately.

The resultsofthis evaluation are shown in Table6.2 and thereitcan be seenthatthe the network using eight classeshas nearlythe same accuracyfor points used and not usedfor training,with about 90% and 91% respectively.Most classesperform similar in those twosets, exceptof“lowvegetation”and “hardscape”; they have adeviation of about 10% in favour of points usedfor training.During the training, it became apparent thatthe network hasproblems distinguishing “high vegetation”and “lowvegetation” due to its similarity. However, most misclassifications of “lowvegetation”inpoints not used fortraining occurat“hardscape”.The class “hard scape” does not perform as good as other classes in thetrainingset as well. Misclassifications occurinboth directions, whichisevident in theIoU of this class with about 63% too. Forpoints not used fortraining this problemdoeshave strongereffectsand theIoU is about 24%. Thisisprobably caused by manydifferentthingsbeing labelledas“hardscape”, suchas walls, fences,flagpoles, streetlights and fountains to name afew. All thesethingshave a diverse structure in the pointcloud and thenetwork is probably confuseddue to this. Atargeted relearning and refinementofthis class cloud probably solve this problem. Furthermore, the classes“artefacts” and “cars”differofabout 4-5% between used and not used fortraining.Whereby,“artefacts” performbetter in not used fortraining and “cars” in usedfor training.“Cars”have an accuracyaround 98% and 93% in used and not usedfor training respectively.This class maybeslightly overfittedduring the training. “Artefacts” are not performingwell in both sets. They have an accuracy of approximately 33% for points used and 38% for points not used fortraining.Inpoints usedfor training, most misclassifications occurbetween this class and“buildings” as well as “hardscape”. In points not used fortraining,most misclassificationsoccur between this class and

62 6.1. SemanticSegmentation of theSemantic3d Dataset

“man-madeterrain” as well as “hardscape”.The misclassifications between “buildings” and “hardscape” maybecaused by the similarstructure of exterior walls of buildings and lowwallsorfences. The misclassificationsbetween “man-madeterrain” and “hard scape” probablyresultsfrom the cubedata structure. By looking at theclassified point cloud,itbecomes apparentthatmany“artefacts” are causedbypassingpeopleand are therebylocatedatthe ground.Ifacubeisclassified as “man-madeterrain” but has “artefacts” in it, those “artefacts”are wronglylabelled as “man-madeterrain”.

usedfor training not used fortraining 8classes 6classes 8classes 6classes Accuracy in % man-made terrain 94.05 94.33 96.76 97.71 natural terrain 85.43 89.24 highvegetation 92.10 89.48 98.69 96.74 lowvegetation 96.43 87.27 buildings 93.54 95.39 90.36 92.56 hard scape 81.01 78.02 68.62 59.50 artefacts 33.45 24.39 37.82 40.33 cars 97.71 95.35 92.72 91.64 overall accuracy 90.25 93.73 91.07 95.51 IoU in % man-made terrain 89.22 90.78 95.74 96.30 natural terrain 81.05 84.54 highvegetation 87.32 89.05 93.86 92.39 lowvegetation 88.11 43.07 buildings 80.43 76.12 88.90 90.89 hard scape 63.87 65.02 24.27 28.95 artefacts 30.26 22.85 20.38 24.99 cars 65.35 68.60 6.84 8.99 average overall classes 73.20 70.36 55.98 57.09

Table 6.2: Final semanticsegmentation resultsofthe Semantic3d training set.

Furthermore, it canbeseen in Table6.2, thatthe network using six classeshas nearly the same accuracyfor points used and not used fortraining too, with about 94% and 96% respectively. “Terrain”, “vegetation”, “buildings”and “cars” are performingalmost equal is both sets, with about 1-4% difference.“Hard scape” and “artefacts” have a big difference in accuracy too. Whereby“hard scape” performsabout 19% better for points usedfor training and “artefacts” performs about 16% worse. “Hard scape” points are misclassified through all classesand have corresponding IoUvaluesofabout 65% and 29% for points used and not used fortraining respectively. The reason for this is most probablythe same as for thenetwork using eightclasses. Points labelledas“hard

63 6. Results

scape” belong to differentobjectswith diverse structures and this problem is enhanced forunknown points. “Artefacts” are performinginthe network using six classes bad too, with just about 24% and 40% accuracyfor points used and not used fortraining respectively. Most misclassified “artefacts” in points usedfor training occurbetween this class and “buildings”aswell as “hardscape”.However,inpoints not used fortraining most misclassificationsoccur betweenthis class and “terrain”.The reason for those misclassifications are most probably the same as forthe network using eight classes. In addition to that,Table6.2 shows thatthe network using eight classes and thenetwork using six classeshave asimilar overallaccuracy,wherethe network using six classes performs acouple percent better. In both sets, used and not used fortraining,the combined“terrain” and “vegetation”classesoutperform therespectivesingle classes. “Buildings” and “cars” are performing approximatelyequalinthose sets. “Hard scape” performs similarinpoints usedfor training,but in not used fortraining the network using six classes performs worse thanthe network using eight.“Artefacts” are performingfor points not used for training better forbothnetworks, but with agreater difference in the network using six classes. Thisdifferencebetween thetwo networks in thoseclasses may be causedbythe diversityin“hard scape” objectstoo and the problem with “artefacts” in general. The semanticsegmentation of theSemantic3d datasets“bildstein station5”and “sg27 station 1” are shown in Figure 6.1 and 6.2 respectively. The top images display the according scene,the second rowshows the groundtruth labels,the third rowthe segmentation predictions andthe bottomrow visualizesifthe semantic segmentation is correct(blue points) or not (orange points). Theleft columndepicts aclassification with eightclasses and therightcolumn with six classes. In Figure 6.1, it can be seenthat, the overall classificationworks quite well foreight as well as six classes, wherebythe six class classification is moreaccurate. Mostmisclassifications occurdue to the representation of the data in the 3D CNN, sincecubes with one meter side lengthare usedasdatasamples. Forexample, if acubecontains mostly car and some terrain points, it willbepredicted as “cars” from theneural network. Thus, all points in the cubeare labelled as “cars” -also terrain points, whichisincorrect. Anotherfrequent sourceofmisclassifications are theconfusion between theclasses“buildings” and “hardscape”, which canbeseen at theleftmost building and at thebarn in Figure6.2.Furthermore, in Figure 6.2 it canalsobeseen thatmisclassifications between theclasses“highvegetation”, “low vegetation” and “natural terrain” occurcommonly.Inadditiontothat, it canalsobe determined thatinthis scenario aclassificationwith six classesperforms better than witheightclasses.

64 6.1. SemanticSegmentation of theSemantic3d Dataset

Figure 6.1: Semantic3d training dataset “bildsteinstation5”. Second row: groundtruth, third row: network predictions, bottomrow:correct predicted (blue) or not (orange). The left column depicts aclassificationwith eight classesand the right with six.

65 6. Results

Figure 6.2: Semantic3d training dataset “sg27 station 1”.Secondrow:ground truth, third row: network predictions, bottomrow:correct predicted (blue) or not (orange). The leftcolumn depictsaclassification with eightclassesand the right with six.

66 6.1. SemanticSegmentation of theSemantic3d Dataset

6.1.2 Semantic3d Test Set By looking at thesemanticsegmentation of theSemantic3d test set with eightclasses in Figure 6.3, it can be seenthatthe overall classification works quite well too, with the same weaknesses as in theSemantic3d training set.Mostmisclassifications occurbetween the classes“buildings” and “hardscape” and besidesthat, the vegetation classesaswell as “natural terrain” are being mixedup. This canbeseenquite well in the pictures of the second row. Furthermore, in the top rowitcan be determined thatthe shopping windows are labelledas“natural terrain” and not as part of thebuilding;this might be probably caused by reflections of theopposite meadow. The dataset of thethirdrow is segmentedquite well, but the class affiliationofthe vegetation in the background might not be correctand alsosome misclassifications between theclasses“natural terrain” and “man-madeterrain” might occur duetothe cubic representationofthe pointcloud in the 3D CNN. The bottomrow mayalso suffer from this, wherebyhere misclassifications between theclasses“man-madeterrain” and “artefacts” occurand theclasses “buildings” and “hardscape” are mixed up at thebuilding on theright side. These classification resultswere alsosubmitted to theSemantic3d benchmark, where an overall accuracyof85.1% and an average IoU of 51.4% wasachieved. By taking a look at theexact results, whichare shown in Table 6.3, it can be seenthatthe classes “lowvegetation”, “artefacts” and “hardscape” are performingthe least well. This has already become apparentinTable6.2 at points notusedfor training,where those classes produced some of theweakest resultstoo.

Accuracy in % overall accuracy 85.1 IoU in % man-made terrain 89.6 natural terrain 77.2 highvegetation 47.7 lowvegetation 27.8 buildings 85.5 hard scape 24.7 artefacts 24.5 cars 34.2 average overall classes 51.4

Table 6.3: Final semanticsegmentation resultsofthe Semantic3d test set.

67 6. Results

Figure6.3: Semanticsegmentation of datasetsfrom the Semantic3d test set.First row: “sg27 station 3”, second row: “sg27 station 6”, thirdrow:“st gallen cathedral station 1”, bottomrow:“st gallen cathedral station 6”.

68 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

6.2Semantic Segmentation and Curb Fitting of the CologneDataset

Cologne is acityinGermanywithabout 1.1 million inhabitants and an areaofaround 405 squarekilometres, whereby 66 square kilometres are public thoroughfare [Kö]. CycloMedia Deutschland, LiDAR Point Cloud and StadtentwässerungsbetriebeKöln, AöR have scanned thestreets of thewhole citywith LiDAR devices,resulting in rawdata of over1.5 TB.However, for this thesis, not theentire pointcloud wasprocessed, but onlyindividualstreets were evaluated. These streets were chosen carefullytoprovide representativeurbanscenarios and are called as follows:

• An derSchanz

• Auenweg

• Augustinerstraße

• ÄußereKanalstraße

• Kleiner Griechenmarkt.

As mentionedabove,those streets were chosen carefullytoprovide meaningfuluse cases for acity. Theyinclude urbanareas with manybuildingsand parkingcars next to the streets, ruralareaswith lots of vegetation, roads with curvesaswell as straightstreets. With this selection,the semantic segmentation with anetwork using six classescan be testedthorough, becausethere are alot of terrain, vegetation, buildingsand cars present. Furthermore, with this data the network is testedonadifferentdataset than the trainingdata, which is evaluated visually,sincethere is no groundtruth available. Besides that,those selected streets coverthe use case of simple straightstreets, curves and partlyoccludedsidewalks. Thisisanice settoevaluate howwellthis methodworks in the simplest (straight streets) and harder use cases (curves) as well as howitmanages missing information about curbs. Forall fivestreets, the resultofasemanticsegmentation with a3DCNN using thebig gridsize and six classes, the detectionofcurb points with geometric pointcloud features and thefinalreconstructed geometryofthe street, curbs andsidewalks are shown and discussed below. Something thatcan be saidinadvance is that, the segmentation with the trained 3D CNNs works quite well, keepinginmind, thatthose Cologne data are differentfrom the training data. The detection of curbpoints on terrain points of the semanticsegmentation is robust andthe finalreconstructed geometryisquite accurate with an error of afew centimetres.

69 6. Results

6.2.1 An der Schanz

The first streetofthe Cologne dataset is called “An derSchanz”and it is aroad with two separated driving lanes, which arearchitecturally divided by arailway track.However, both laneswere evaluated in this thesis and thepoint cloud of those two lanes are shown in Figure6.4.

Figure 6.4: The LiDAR pointcloud of thetwo driving lanesofthe street in Cologne named “An der Schanz”.

Asemanticsegmentation wasapplied to the pointcloud with the network using big grid size and six classes. Theresultisvisualized in Figure6.5. There it canbeseenthat the segmentation works quite well forthe classes“terrain”, “vegetation” and “buildings”. However, as it can be seeninthe leftimage of Figure6.5,there are some misclassifications at thebarriers and therailwiresinthe middle of the two lanes.

Figure 6.5: The result of thesemanticsegmentation of thenetwork using big gridsize and six classes of thestreetnamed “An der Schanz”.

70 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

Next, thecurb points are calculated by using geometric features.Figure 6.6 showsthe resulting curbpointsinmagenta. In theleft image not much curb points are detected, this might be caused by the ratherlow curbheight. On theotherhand,the curb point computation of therightimage works well, but it canbeseenthatpointsofthe wall in the background,whichare labelledasterrain, are alsodetected as curb points due to the similar geometric features.

Figure 6.6: Detected curb points coloured magenta of thegeometricfeature calculation of thestreet named “An derSchanz”.

However, the next steps of thealgorithmnevertheless manage to computereliable geometry. The geometryofthe curbs, street and sidewalk have amean error of 0.0169m, 0.0146m and 0.0251m respectively. The mean error is theaverage distance from points of thepoint cloud to thefinalgeometrywithin adistance of 25cm.The points within this threshold are filtered according to theirnormal vector to avoid mixing street points into thecalculation of thecurb errorand vice versa.Ifthe normal is pointing up, the points belongseithertothe street or sidewalk;ifnot,the points belongstothe curb. The resultscan be seeninFigure 6.7, where thereconstructed streetisvisualized in green, thecurbitself in red and thesidewalk in blue.Furthermore, Figure 6.8 shows the finalcalculated geometryofbothlanesfromabird’s eyeperspective.

Figure 6.7: The reconstructed geometryofthe street named “An derSchanz”.

71 6. Results

Figure 6.8: Topview of thereconstructed geometryofthe street named “An derSchanz” of both lanes.

Figure 6.9: The LiDAR pointcloud of thestreet named “Auenweg”.

72 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

6.2.2 Auenweg

The second street evaluated in this thesis is named“Aueweg”, which is near apark,the river Rhein and bigcongressbuildings.The pointcloudofthis street capturedwith a LiDAR device is showninFigure 6.9. The semanticsegmentation,whichcan be seeninFigure 6.10 (a),with the 3D CNN using big gridsize and six classes, works quite well forthe class “terrain”, even if thereare some chunks classified as “hardscape”, which are those orange squares on thestreet. The classification of thebuilding besides thestreet is okay,there are some pieces predicted as “cars”, thepink squares, and some areas, which arelabelledas“vegetation”, which is visualized with thecolourgreen.However,the most importantpartsfor curb detection, namely terrain and sidewalks, are predicted correct.Thus, thecalculationofthe curb featuresworks quite well, even if there is abig gap of not detected curb points, which is showninFigure6.10 (b).

(a) (b)

Figure 6.10: (a) The result of thesemanticsegmentation with thenetwork using big grid size and six classes. (b)The result of thegeometricallycalculated curb feature points of the street named “Auenweg”.

Nevertheless,the algorithm for thegeometry reconstruction is abletohandle suchgaps and produces acontinuous geometryfor thestreet, sidewalksand curbs. The mean error forthis dataset is 0.0322m,0.0314m and 0.0157m forthe curbs,street andsidewalk respectively. This final resultinggeometrycan be seen in Figure 6.11.

73 6. Results

Figure 6.11: The final reconstructed geometryofthe street called “Auenweg”.

Figure 6.12: The LiDAR pointcloud of theurbanstreet named“Augustinerstraße”.

74 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

6.2.3 Augustinerstraße

An urban streetnamed “Augustinerstraße” is thenextstreet in the dataset and its LiDAR pointcloud is visualized in Figure 6.12. The semanticsegmentation with the 3D CNN, which is showninFigure 6.13 (a),works quite well forthe classes“terrain” and “vegetation” too, butsome parts of thebuildingsare wronglyclassified as either “vegetation”or“hard scape”.Bytaking acloserlook, it canbenoticed that this misclassifications occurmost frequent in shopping windows or glass doors.This might probably be causedbyreflectionsorgeometricproperties of those windows. The detection of thecurb points by using geometric pointcloud featuresworks quite well, but alsointhis dataset, there aresome misclassifications due to incorrect labelled terrain points and holes in thecurb.The result of thecurb pointcalculation can be seen in Figure 6.13 (b).

(a) (b)

Figure 6.13: (a) The result of thesemanticsegmentation with thenetwork using big grid size and six classes. (b)The result of thegeometricallycalculated curb feature points of the street named “Augustinerstraße”.

However, alsohere the methodused in this thesisisabletocopewith thatand calculated acontinuous geometry, whichisshown in Figure 6.14. Furthermore, it canbeseenthat the street makesabend. Nonetheless, thealgorithmisable to compute an accurate geometryalong this curve as well. Themean error is 0.0182m for the curb,0.0390m for thestreet and 0.0246m for the sidewalk. Theresulting geometry of thewhole curve of the road can be seenfromabove in Figure 6.15.

75 6. Results

Figure 6.14: The final reconstructed geometryofthe street named “Augustinerstraße”.

Figure 6.15: The final reconstructed geometryofthe street named “Augustinerstraße” from above, where the curve of the street can be seen.

76 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

6.2.4 Äußere Kanalstraße

The second laststreet evaluated isnamed“Äußere Kanalstraße” and itsLiDAR point cloud is visualized in Figure6.16. In this dataset are alot of parking carsbesides the street. It canbeseeninFigure 6.17 (a) that theirclassificationworks okayfor carsparkingdirectly besides thestreet, but cars in the second roware all classifiedas “vegetation”. These LiDAR pointcloudsare capturedbydrivingbyonthe street. By taking acloserlookatthe pointcloud itself, it canbenoticed, that this part of thepoint cloud is not as dense as parts directly besides the street. This is most probably the reason whythose points are labelledas“vegetation”, because thereisnoclearstructure of acar in the data. Nevertheless, the terrain is segmented quiteaccurateand the vegetationis classifiedcorrect.

Figure 6.16: The LiDAR pointcloud of thestreet named “Äußere Kanalstraße”.

The calculation of thecurb points works well in this dataset, whichisshown in Figure 6.17 (b). In addition, hereare some small holespresent, but as already seenabove,this is no problemfor therestofthe algorithm. Furthermore,itshould be mentioned, that left and right of the parkingcars is an exit of theparkingarea and thedrivinglane is flattened to the level of thestreet. This is the reasonwhy thereare no curb points detected in this area.

77 6. Results

(a) (b)

Figure 6.17: (a) The result of thesemanticsegmentation with thenetwork using big grid size and six classes. (b)The result of thegeometricallycalculated curb feature points of the street named“ÄußereKanalstraße”.

The final reconstructed geometryisvisualized in Figure6.18and there it can be seen thatthe algorithm detectsthe geometryofthe street, curband sidewalksaccurately, with amean error of 0.0112m,0.0261m and 0.0253m for thecurbs, street and sidewalk respectively.

Figure 6.18: The final reconstructed geometryofthe street named “ÄußereKanalstraße”.

78 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

6.2.5 Kleiner Griechenmarkt

The laststreet evaluated inthis thesis is called “Kleiner Griechenmarkt”, which is a typical urban streetwith alot of parking carsbesides the road. The pointcloud captured with aLiDAR device is shown inFigure 6.19 and thereitcan be seenthathecurb is covered by the cars.

Figure 6.19: The LiDAR pointcloud of thestreet named “Kleiner Griechenmarkt”.

The semanticsegmentation,whichisshown in Figure 6.20 (a),with the 3D CNN works well forthe class “terrain”, but someparts of thecars and buildings are misclassified as “terrain”.The carsare mostlyclassified correct, justthe most leftone is partly labelled as “hardscape”.The segmentation of thebuilding works quite well too, butmanylower parts are misclassifiedas“vegetation”.

The resultingcurb points computed with geometric pointcloud featuresare visualized in Figure 6.20 (b)and there it can be seenthat those car andbuilding points, whichwere misclassified as “terrain”, are causing false positivecurb points, because those points have the same geometric features as curb points and are inside thestreet. Nevertheless,the actual geometryisreconstructed correctly, with amean errorof0.0117m for the curbs, 0.0743m for the streetand 0.0250m for the sidewalk. Themean error of thestreet about seven centimetres originates from the horizontal curvature of it.The calculated street polygonfits thestreet points well on thesides, where thecurbs are, butthe distance between thepointsand the polygon becomes higher towards the middleofthe street.

79 6. Results

(a) (b)

Figure 6.20: (a) The result of thesemanticsegmentation with thenetwork using big grid size and six classes. (b)The result of thegeometricallycalculated curb feature points of the street named“Kleiner Griechenmarkt”.

Whatmakes thedetection of thecurbsinthis scene even more difficult is that there is no completeinformation aboutthe curb available, because thepoint cloud is captured from the street. By taking alookatFigure6.21,whichshows the same scene fromabove,it can be seenthatthere are no points behind the parkingcars present.

Figure 6.21: The LiDAR pointcloud of thestreet named “Kleiner Griechenmarkt”from above.

80 6.2. SemanticSegmentation and CurbFitting of theCologneDataset

However, the algorithm is abletocompensatebothproblems:the false positivecurb points and themissing information of the sidewalk behind the parkingcars.The resulting geometries of thestreet,the curband the sidewalk is shown inFigure 6.22 and in Figure 6.23 from above,whereitcan be seenthat the geometryiscalculated completelybehind the parkingcars.

Figure 6.22: The final reconstructed geometryofthe street, curb and sidewalk of the street named“KleinerGriechenmarkt”.

The mean error of thefinal calculated geometry of alldatasetsare acouple of centimetres each. The overall mean reconstruction error is ±2cm forcurbsand sidewalksand ±3cm forstreets. An overview of theexact values for all datasets is giveninTable6.4.

curbs street sidewalk An derSchanz (1. side) 1.6 1.5 3.4 An derSchanz (2. side) 1.8 1.4 1.6 Auenweg 3.2 3.1 1.6 Augustinerstraße 1.8 3.9 2.5 ÄußereKanalstraße 1.1 2.6 2.5 Kleiner Griechenmarkt 1.2 7.4 2.5 Average 1.8 3.3 2.3

Table 6.4: Overview of mean reconstruction errors of alldatasets(Note: valuesare in centimetres).

81 6. Results

Figure 6.23: The final reconstructed geometryofthe street, curband sidewalk of the streetnamed “Kleiner Griechenmarkt” from above.

6.3Critical Reflection

In this section,ashort discussion is conductedabout what does not work so well and what couldhavebeendone better in this thesis.Whereby, Section 6.3.1focuses on the semanticsegmentation with 3D CNNs and Section 6.3.2onthe curbreconstruction process.

6.3.1 3D CNNs

Voxelizing apointcloud needsalot of memory and it grows with an increased gridsize. Thisisthe reasonthat in thisthesis amaximumgrid sizeof32and abatchsize of 32 for the network using big gridsize wasused.Otherwise,the network trainingfails dueto memory exhaustion. Therefore, recent researchisdedicated to findmethods that are moreefficient, but those approachesare not includedinthe Tensorflowlibrary yetand thus not used in this work. By using moreefficient3Dconvolutionand pooling operations, bigger gridsizes and deeper networkscouldbegenerated, which wouldmost probably resultinamore accuratepoint cloudsegmentation.Furthermore, thesegmentation is done by labelling all points in acubewith the overall label of it. This causes some

82 6.3. Critical Reflection wrongclassifications,for examplewhen some terrain points are in acellwith mostlycar points, then the terrain points are misclassified as “cars”.Adirectsegmentation with an appropriate segmentation network would notlead to suchmisclassifications and most probably increasethe overallaccuracy of thesemantic segmentation.

6.3.2Curb Fitting

Overall, the curbfitting works quite well, but the calculationofthe curb points with geometric features sometimes produces many false positives, which could leadtoinaccurate geometry. Thiscouldbefixed with either exploring or using moregeometric point cloud features as well as enhancing thefalse positivefiltering. Furthermore, the actual computation of thecurb geometrycouldalso be improvedbyautomaticallydetermining the best degreeofthe function fittedthrough thecurbspoints. In most cases, aparabola modelissufficient, butfor streets with astronger curve,ahigher degree is requiredto modelthe curbgeometry well. Moreover, the algorithmiscurrentlyjust abletocalculate curbs paralleltothe street, but thatisnot alwaysthe case in realworld data.Besides that, for thisthesis onlyahandfulofstreets were tested, even if the selection wasmade with caretocover as many special casesaspossible,thereare certainly more cases where the algorithm fails to reconstruct thegeometry correctly. Therefore, awiderrange on different street scenarios would still have to be testedtofind and solve those cases.

83

CHAPTER 7

Conclusion andFutureWork

Thisthesis hasshown aproof-of-concept work including asemantic segmentation of LiDAR pointcloudsbyusing 3D CNNs and in combination with that,adetection and reconstruction of sidewalksurfaces.

Current implementationsof3DCNNs work with rasterizeddataand to voxelizeapoint cloud,suchthatthe underlying structure is represented well, an octree data structure wasused.Therewith, toosparseregions of thepoint cloud can easily be identified and omitted in thetraining stage, whichresultsinabetter generalizedclassifier. Although, the semantic segmentation leavesroomfor improvement, it works quite well, especiallyfor some classes, suchasfor example“vegetation”.Furthermore,point cloud with asemantic segmentation couldbeusefulinmanyapplications,because asmartprefiltering, such thatonly points of interest are remaining, coulddecrease theamountofpoints to process enormously,especiallywhen working with big datasets.Because of this, computation time could be reducedand resultscouldbeenhanced by eliminating high-risk points for incorrect results beforehand.

Suchaprefiltering wasalso applied before the sidewalk detection and onlypointslabelled as “terrain” were used. This excludes especiallyalot of high-risk false positivepointsin advance, like cars or other objectsnear thestreet with asimilar geometric structure as sidewalks. Forthe simple reasonthat, curbpointsare found by traditionalpoint cloud features, likedifferenceand standarddeviation of heightaswellascurvature in acertain neighbourhood.Perpendicularity to the street is usedtoo,wherebyinformationabout the road wasobtainedbyusing additional datafrom OpenStreetMap [Opea]. Then false positives are filtered with acombination of adensitybased clustering, approximated linearityand parallelism to the road.Based on those sidewalk points,the curb geometry is calculated by extractingthe lowerand upper edges. Theupper sidewalk surface is

85 7. Conclusion andFuture Work

then computed by fitting planes into terrainpoints above the mean heightofthose edges and finallythe streetpolygon is composed of thevertices of theloweredges. The semanticsegmentation as well as detection of sidewalk surfaces arequite general approaches. Eachpointcloud canbestoredinanoctreeand then segmentedwith a 3D CNN. Once asemanticsegmentation is performed, afilteredversionofthe point cloud can be usedfor various tasks and not onlyfor detecting curbs. With this potential, it is worthwhile to improve the semantic segmentation.Especially, those corner cases, in whichadata sample consistsof, forexample,mostlycar and alsosome terrain points, but due to the methodused in this thesis,all points within this data sample are classifiedas“cars”, which leads to wrong classifiedterrainpoints. This might not cause manymisclassified points in asingle data sample,but accumulated to all samples of thepoint cloud,the classificationerrorlies in aconsiderablerange. Suchcases could be probably eliminated with ahierarchical classification approachorasegmentation network, whichisable to work directly on thepoints, like theU-Netfrom Ronneberger et al. [RFB15], whichapplies class labels directly to the pixels of an image. Therefore, suchasegmentation network willbeimplementedand tested in the future to improve the semantic segmentation of pointclouds. Anotherpossiblenextstep, according the semanticsegmentation,will be to use already detected curb points as well as trained 3D CNN and apply transfer learningtoadd anew “curb” class to the network. Therefore, the curbpointscouldbeautomatically detected within the semantic segmentation step. By looking at thecalculation of thecurb points and final reconstructed geometry, every step could and will be improvedinthe future. Beginningwithtesting andapplying more pointcloud features, enhancingthe false positivefiltering and automatically compute the degree of thefunctionaccording to thebending of theroad. In addition,adeeper look will be taken on howtofill even bigger gaps in thepointcloud accurately, than the ones causedbyparkingcars.Furthermore,the algorithm is currentlyjust abletocreate geometryparalleltothe road, butsometimes partsofthe curbare notparallel. The calculation of those parts will alsobeaddressed in the future. Besides that, the sidewalk points are calculated with standardpoint cloud features and which can be simply adapted to other similar casesbychanging some thresholds,likefor exampledetection of railway platforms, lowwalls, fences etcetera. Geometrically seenare those examplesjust very largecurbsand by extendingthe thresholdsofthe heightfeature they could be detected too. In conclusion,itcan thereforebesaid,that the developmentofaproof-of-conceptproto- type wassuccessfuland showedalotofpotential.Not onlythe semanticsegmentation with a3DCNN works well afterthis first phase of development, but also thedetection of curb points as well as reconstructingthe finalgeometry.Wewill therefor continue to work on these tworesearchquestions to improve themfurther.

86 APPENDIX A

Hyperparameter Tuning

Thisappendix shows the exact valuesofthe hyperparameter tuningofSection 5.2 on the training, validationand test set.The Difference valuesinthe tablescalculates the difference in the results for the accordingtest sets. Eachresultofatestisthe average of ten runs.

Amount of feature learningcomponents for themiddle cube: The amountofthe feature learning components forthe middle cubewith agrid sizeof 16 were explored first. Note,thatthe network wasnot trained with anyneighbourhood informationatthis point. The results are shown in Table A.1.

Amount of feature learningcomponents for thesidecubes: One feature learningcomponentfor themiddle cubewith gridsize 16 wasused.The difference to thenetwork without anyneighbourhood information wascalculated and the neighbouring cells hadagrid sizeof8.The results are listed in Table A.2 and thereit can be seen, that theadditionalusage of theneighbouring cells gives agreatincrease in classification accuracyofthe network.

Amount Components Training Set ValidationSet Test Set Difference 1 80.76 76.86 80.64 2 80.92 76.51 79.78 -0.86 3 80.42 76.01 79.37 -1.26

Table A.1: Impact of theamount of feature learningcomponents in the middle cube.

87 A. HyperparameterTuning

AmountComponents Training Set ValidationSet Test Set Difference 0 80.76 76.86 80.64 1 89.42 84.13 87.11 +6.47 2 89.02 83.11 86.02 +5.38

Table A.2: Impact of theamount of featurelearning components for the sidecubes.

Occupancy Values Training Set Validation Set Test Set Difference binary 89.42 84.13 87.11 total 82.89 80.26 83.36 -3.75 fraction 37.78 37.22 33.73 -55.38

Table A.3: Comparison of different occupancyvalues.

Additional Features Training Set ValidationSet Test Set Difference none 89.42 84.13 87.11 cube center height 91.38 85.93 88.59 +1.49 average pointheight 91.31 85.86 88.78 +1.68 average pointnormal 91.09 84.65 87.18 +0.07 average pointcolour 92.34 87.88 89.47 +2.37

Table A.4: Impact of different additional features.

Occupancyvalues: As Section 5.2 states, three differentoccupancy values were tested, but as discussed there, the ratioofthe amountofpointsinacelltothe totalamount of points in the cube does notwork well. In fact, it worked so badly thatthe trainingwas terminated early,thusthe results fromthe fractionare justfrom asingle run. The results of the different occupancy values are shown in Table A.3.

Additional features: The tested additional features are theheightofthe center of themiddle cubeaswellas average height, average normal vector and average colourofall points inside acube. The results are listed in Table A.4.

88 MiddleCubeSideCubes Training Set ValidationSet Test Set 32 32 91.29 84.60 87.33 64 32 90.93 84.84 87.57 64 64 90.86 84.72 87.87 128 64 91.05 84.93 87.94 128 128 90.66 84.73 87.58

Table A.5: Comparison of amountofneuronsinmiddle and sidecubes.

Batch Size Training Set ValidationSet Test Set 256 90.77 84.90 87.62 128 91.05 84.93 87.94 64 90.72 84.89 87.68

Table A.6: Comparison of different batchsizes.

Amount of neurons in afeature learningcomponent: The amountofneuronsinafeature learning componentfor themiddle together with side cubeswere explored and theresultsare showninTable A.5.

Batchsize: Furthermore, three differentbatchsizes were testedand the results are shown in Table A.6.

Dropout layer andamount of neurons in fully connected layer: The dropout ratefor theinput, hidden and output layeraswellasthe amountofneurons in the fully connected layerofthe classificationpart of thenetwork were explored. Note, thatthe amountofneuronswas doubledfor the secondfully connected layer. Therefore, an amountofneuronsof1024 is 1024 for thefirst and 2048 for thesecond layeraswell as theamount of 2048 is 2048 for thefirst and 4096 for thesecond fully connected layer in the network. The results of the compared values are listed in Table A.7.

89 A. HyperparameterTuning

Dropout Rate Amount Training Validation Test Set InputL.HiddenL.Output L. Neurons Set Set -0.2 -1024 89.73 84.13 86.85 -0.5 -1024 89.63 84.37 87.68 -0.8 -1024 88.29 83.49 86.95 0.2 0.5 - 1024 90.60 84.75 87.66 0.5 0.5 - 1024 90.41 84.40 87.30 0.8 0.5 -1024 83.05 80.37 84.48 -0.2 -2048 89.66 84.15 87.18 -0.5 -2048 89.68 84.35 87.64 -0.8 -2048 88.45 83.69 87.48 0.2 0.5 -2048 90.86 84.72 87.87 0.5 0.5 -2048 90.73 84.55 87.24 0.8 0.5 -2048 81.88 79.44 83.77 0.2 0.5 0.2 2048 90.58 84.66 87.29 0.2 0.5 0.5 2048 90.68 84.61 87.31 0.2 0.5 0.8 2048 89.91 84.36 87.48

Table A.7: Comparison of differentdropout ratesindifferentlayersaswell as amount of neurons used in those layers.

Side Cubes Training Set ValidationSet Test Set Overfitting Difference without 91.55 82.33 81.03 10.52 0.1 66.73 65.98 64.82 1.91 -16.21 0.01 76.13 74.50 73.51 2.62 -7.52 0.001 84.02 78.98 78.47 5.55 -2.56 0.0001 88.32 81.34 79.98 8.34 -1.05

Table A.8: Impact of L2-regularization factorsonoverfittingand accuracy.

L2-Regularization:

To prevent overfitting and to achieveagood generalization of thenetwork L2-regularization is commonly used. The impact of theregularization with differentfactors on overfitting as well as accuracyofthe test set were explored. Theresults are listed in Table A.8, the Overfitting column shows thedifference between the trainingand the test set of the same row, and the Difference column shows thedifference between thetest setaccuracies withoutany regularization to theused parameters. Note,thatfrom this pointonthe tests were performed on adifferentassembly of the dataset thanthe tests above.Because,the sizeand composition of thetraining,validation and test set wasupdated in thecourseofthe evaluation process.

90 AmountComponents Training Set ValidationSet Test Set Difference 1 75.05 72.64 72.00 +0.69 2 80.25 74.64 73.87 +2.56 3 80.29 74.56 73.71 +2.40 4 80.48 74.38 73.55 +2.24 network using middle grid size 75.94 71.54 71.31

Table A.9: Comparison of amountoffeature learning components forthe middle cell of the network using big grid size.

First C. Second C. Training Set ValidationSet Test Set Difference 64 64 80.25 74.64 73.87 +2.56 64 128 79.75 74.77 73.97 +2.66 128 128 79.47 74.61 73.76 +2.45 network using middle gridsize 75.94 71.54 71.31

Table A.10: Comparison of amountofneurons forthe twofeature learningcomponents of themiddle cell in the network using big gridsize.

Numberoffeature learningcomponents for themiddle cube of thenetwork using big gridsize: Sinceinthe network using big gridsize, the gridsize of themiddle cubeisastwice as big as in thenetwork using middle gridsize, the amountoffeature learningcomponentsfor this gridsize were explored. Theamount of neuronsineachfeature learningcomponent were 64. In order to have acomparison not onlyamongthe amountofcomponents themselves,the Difference between theaccuracies of thetestsetsfrom this network to the network using middle gridsize wasformed.Note, that thenetworks in this tests where solely trained on themiddle cubeand the network using middle gridsize consisted of asingle feature learningcomponentwith 128 neurons. The resultsare compared in Table A.9.

Numberofneuronsfor themiddlecubeofthe network using big gridsize: The previoustests showedthatfor thenetwork using big gridsizetwo feature learning componentsare working best.Therefore, the number of neuronsfor eachcomponent were exploredand not only compared with themselves,but alsothe Difference in accuracy of thetest set to thenetwork using middle gridsize wascalculated.Note, that thesetests also used networks just trained on themiddle cube.

91

APPENDIX B

Network Architectures

Thisappendixvisualizes thearchitecture of thenetworks using small and middle grid sizes of Section 5.3 and 5.4respectively.

B.1 Network Using Small GridSize

The network using small gridsize takes as inputamiddlecubewith gridsize16and sidecubes with agrid sizeofeight.The part of thenetwork processing the middle cube consists of one feature learning componentwith 128 neuronsand the side cubes are processedwithone feature learning component, butwith 64 neurons. The flattened output of those componentsare then concatenated together with thefeature vectorofthe average pointcolourinadata sample.The resultingfeature vectoristhen fed into the classification partofthe network, whichiscomposed of adropout layer, fully connected layerwith 2048 neurons,again adropout as well as afully connected layerwith 4096 neurons and finally asoftmax layer. Thearchitecture of this network is visualized in Figure B.1.

93 B. Network Architectures

INPUT INPUT 16x16x16 8x8x8

CONV 128 CONV 64 k = 3x3x3, s = 1 k = 3x3x3, s = 1

BN BN

ReLU ReLU

CONV 128 CONV 64 k = 3x3x3, s = 1 k = 3x3x3, s = 1

BN BN

ReLU ReLU

MAX POOL MAX POOL k = 2x2x2, s = 2 k = 2x2x2, s = 2 COLOR FEATURE

GLOBAL AVG POOL GLOBAL AVG POOL BN

CONCATENATED FEATURE VECTOR

DO 0.2

FC 2048

DO 0.5

FC 4096

SOFTMAX

OUTPUT

Figure B.1:Architecture of thenetwork usingsmallgrid size.

94 B.2. Network Using MiddleGrid Size

B.2 Network UsingMiddle Grid Size

The network using middle grid sizetakes as inputamiddlecubeaswellasside cubes with agrid sizeof16. The partofthe network processing the middle cubeconsistsofone feature learning componentwith 128 neuronstoo.The side cubes are alsoprocessed with one feature learning componentwith 64 neurons. The flattenoutput of those components are then again concatenated together with thecolourfeature. The classification partof this network consists of two dropout followedbyafully connected layereachfollowedby afully connected layertoo,wherethe fully connected layerhave1027 and 2048 neurons respectively. Thelast layerofthis network is again asoftmax layer.

95 B. Network Architectures

INPUT INPUT 16x16x16 16x16x16

CONV 128 CONV 64 k = 3x3x3, s = 1 k = 3x3x3, s = 1

BN BN

ReLU ReLU

CONV 128 CONV 64 k = 3x3x3, s = 1 k = 3x3x3, s = 1

BN BN

ReLU ReLU

MAX POOL MAX POOL k = 2x2x2, s = 2 k = 2x2x2, s = 2 COLOR FEATURE

GLOBAL AVG POOL GLOBAL AVG POOL BN

CONCATENATED FEATURE VECTOR

DO 0.2

FC 1024

DO 0.5

FC 2048

SOFTMAX

OUTPUT

Figure B.2:Architecture of thenetwork using middle gridsize.

96 ListofFigures

1.1 Screenshot of theVisdomfloodsimulationsoftware. (Source: [Vis]) ... 2

2.1 Aphotogrammetric pointcloud with location and orientation of thephotos. (Source: adapted from[SKMW17]) ...... 6 2.2 An exampleofafeedforward neuralnetwork with asingle hidden layermapping an input x to an output y with an intermediate layer h,where W maps x to h and w maps h to y.(Source:adaptedfrom[GBC16]) ...... 9 2.3 Avisualization of afeedforward neuralnetwork, alsoknown as multilayer perceptron(MLP)...... 10 2.4 An exampleofaconvolutionoperation. Suchanoperation conv = m ∗ k consists of an image matrix m and akernel k.The kernelshifts likeasliding windowoverthe image matrixand computes new image valuesbymultiplying and addingaccording values. Theposition of thesliding windowisvisualized in blue,the kernelinpink and theresultingvalue in green, which is obtained as follows: 3 ∗ 1+1∗2+5∗0+9∗1=14...... 13 2.5 An exampleofamax-pooling operation. Suchanoperation downsamples the image matrixbyusing aslidingwindow, whichonly transfers the highest value insideintothe reducedimage matrix. (left) Imagematrixwith four positions of an 2 × 2slidingwindow and (right) thedownsampled image matrixwith the transferred valuesinaccording colors...... 13 2.6 Semanticsegmentation network architecture with encoder-decoderstructure. (Source: [BKC17])...... 14

3.1 2D visualization of an octree...... 22 3.2 Left:middle-node with coloured faces. Right: adjacentside-cells coloured accordingtothe faces...... 24 3.3 Example of under- and overfitting (red curves) in thecase of polynomialcurve fitting. M is the order of thefitted function. Left:underfitting. Middle: good representation. Right: overfitting. (Source: adapted from [Bis06]) ....25 3.4 Data sample augmentation methods: a) rotation aroundz-axis, b) translation in alldirections, c) horizontal flip...... 25 3.5 Apointcloud of theSemantic3d dataset,captured with alaser scannerina ruralarea...... 27

97 3.6 Apointcloud of theSemantic3d dataset coloured according theclass affiliation of the points...... 27 3.7 Exemplary structure of aCNN illustrating its two-stage structurewith the featurelearning and classification blocks.(Source: [Sah]) ...... 31 3.8 A featurelearningcomponent is composedoftwo convolutionoperations (CONV), batch-normalization(BN) and ReLU activation functions as well as apooling operation (POOL)...... 32 3.9 The neural network component of thenetwork with fully-connected (FC), dropout (DO) layers and aReLU activationfunction...... 32 3.10 The entire network architecture...... 33

4.1 Heightfeature:The difference between hmax and hmin indicates weather the pointisacurb or not...... 37 4.2 Resultofaneigenvalue decomposition, oncefrom acurb point(left) and once from astreet-point(right)in2D. Theeigenvalue λ0 fromacurb pointislarger thanthe eigenvalue λ0 fromastreet-point...... 38 4.3 Perpendicularity feature,wherethe anglebetween thepoint normal n and the directionofthe road d is computed...... 39 4.4OverviewofCologne in OpenStreetMap. (Source:[Opea]) ...... 40 4.5 Christian-Gau-StrasseinColognewith its OpenStreetMap tag colouredin red.(Source: adapted from [Opea]) ...... 41 4.6 Christian-Gau-StrasseinColognewith its imported OpenStreetMaproad data...... 42

4.7 Acurb segment with the fitted lowerand upper edges. u0 and u1 denotethe endpoints of theupper edgeand l0 and l1 denote the endpoints of the lower edge...... 45

4.8 Asection of acurb with the finalpolygons C0,C1 and C2 of three curb segments...... 46 4.9 Asection of acurb with the finalpolygonsofthree curbsegments with the accordingsidewalk-polygons SW0,SW1 and SW2...... 47 4.10 The final polygons of acurb sectionfrom three curb as well as sidewalk segmentsand the roadpolygon,whichisconstituted by the vertices ri.. .47

5.1 Class Distribution of thetraining, validationand test set includingdata augmentations...... 50 5.2Architecture of the network using big grid size...... 60

6.1 Semantic3d training dataset “bildsteinstation5”. Second row: groundtruth, third row: network predictions, bottomrow:correct predicted (blue) or not (orange). The leftcolumn depictsaclassificationwith eight classes and the rightwithsix...... 65

98 6.2 Semantic3d training dataset “sg27 station 1”.Secondrow:ground truth, third row: network predictions, bottomrow:correct predicted (blue) or not (orange). The leftcolumn depictsaclassificationwith eight classes and the right with six...... 66 6.3 Semanticsegmentation of datasetsfrom the Semantic3d test set.First row: “sg27 station 3”, second row: “sg27 station 6”, thirdrow:“st gallen cathedral station 1”, bottomrow:“st gallen cathedral station 6”...... 68 6.4 The LiDAR pointcloud of thetwo driving lanesofthe street in Cologne named“An derSchanz”...... 70 6.5 The result of thesemanticsegmentation of thenetwork using biggridsize and sixclassesofthe street named“An der Schanz”...... 70 6.6 Detected curbpointscolouredmagenta of thegeometricfeature calculation of thestreetnamed “An der Schanz”...... 71 6.7 The reconstructedgeometry of thestreet named “An derSchanz”...... 71 6.8 Topview of thereconstructed geometryofthe street named “An derSchanz” of both lanes...... 72 6.9 The LiDAR pointcloud of thestreet named “Auenweg”...... 72 6.10 (a)The result of thesemantic segmentation with thenetwork using biggrid size and six classes. (b) Theresultofthe geometricallycalculated curb feature points of the street named “Auenweg”...... 73 6.11 The final reconstructedgeometryofthe street called“Auenweg”...... 74 6.12 The LiDAR pointcloud of theurban streetnamed “Augustinerstraße”. .. 74 6.13 (a)The result of thesemanticsegmentation with thenetwork using biggrid size and six classes. (b) Theresultofthe geometricallycalculated curb feature points of the street named “Augustinerstraße”...... 75 6.14 The final reconstructedgeometryofthe streetnamed “Augustinerstraße”.76 6.15 The final reconstructed geometry of thestreet named “Augustinerstraße” from above, where thecurve of thestreet canbeseen...... 76 6.16 The LiDAR pointcloud of thestreet named “Äußere Kanalstraße”..... 77 6.17 (a)The result of thesemantic segmentation with thenetwork using biggrid size and six classes. (b) Theresultofthe geometricallycalculated curb feature points of the street named “Äußere Kanalstraße”...... 78 6.18 The final reconstructed geometry of thestreetnamed “Äußere Kanalstraße”. 78 6.19 The LiDAR pointcloud of thestreet named “Kleiner Griechenmarkt”...79 6.20 (a)The resultofthe semantic segmentation with thenetwork using biggrid size and six classes. (b) Theresultofthe geometricallycalculated curb feature points of the street named “Kleiner Griechenmarkt”...... 80 6.21 The LiDAR pointcloud of thestreet named “Kleiner Griechenmarkt”from above...... 80 6.22 The final reconstructed geometryofthe street,curb and sidewalk of thestreet named“KleinerGriechenmarkt”...... 81 6.23 The final reconstructed geometryofthe street,curb and sidewalk of thestreet named“KleinerGriechenmarkt” fromabove...... 82

99 B.1Architecture of thenetwork using small gridsize...... 94 B.2Architecture of thenetwork using middle gridsize...... 96

100 List of Tables

2.1 An exampleofmappingcategories to numerical valuesand an according one-hot encoding of them...... 7 2.2 Equations of some pointcloudfeatures...... 8 2.3 Most commonlyusedactivationfunctions. (left) Name of thefunction, (mid- dle) function f(x)itself and (right)plot of the function (Source: [SSA20]). 11

5.1 IoU of thesingle classes, average IoU and OA of thetestset classifiedwith the network using small gridsize. Once with the original eightclasses and once with the combinedsix classes...... 53 5.2 Confusion matrixofthe test set classifiedwith the network using small grid size. (Note: valuesare in %.) ...... 53 5.3 Confusion matrixofthe test set classifiedwith the network using small grid sizeand six classes...... 54 5.4 IoU of thesingle classes, average IoU and OA of thetestset classifiedwith the network using middle grid size. Once with the originaleight classes and once with the combinedsix classes...... 55 5.5 Confusion matrixofthe test set classifiedwith the network using middle grid size. (Note: valuesare in %.) ...... 55 5.6 Confusion matrixofthe test set classifiedwith the network using middle grid sizewith six classes...... 56 5.7 IoU of thesingle classes, average IoU and OA of thetestset classifiedwith the network using big gridsize. Once with the original eightclasses and once with the combinedsix classes...... 57 5.8 Confusion matrixofthe test set classifiedwith the network using big gridsize. (Note: values are in %.) ...... 58 5.9 Confusion matrixofthe testset classifiedwith the network using big gridsize with six classes...... 58 5.10 Overall accuracies of thethree differentnetworks, increase in the OA between the networks as well as difference between six and eight class networks (Note: valuesare in %)...... 59

101 6.1 The points of thepoint cloud are coloured according theirassignedclasses. Left:Colours for eightclasses. Right: Colours for six classes with combined “terrain” and “vegetation” classes...... 62 6.2Finalsemanticsegmentation resultsofthe Semantic3d training set.. ... 63 6.3Finalsemanticsegmentation resultsofthe Semantic3d testset...... 67 6.4 Overview of mean reconstruction errors of all datasets(Note: valuesare in centimetres)...... 81

A.1Impactofthe amountoffeature learning componentsinthe middlecube. 87 A.2Impactofthe amountoffeature learningcomponentsfor theside cubes. .88 A.3Comparison of differentoccupancy values...... 88 A.4Impactofdifferentadditional features...... 88 A.5Comparison of amountofneuronsinmiddle and side cubes...... 89 A.6Comparison of differentbatch sizes...... 89 A.7 Comparison of differentdropout rates in different layers as well as amount of neurons used in those layers...... 90 A.8ImpactofL2-regularization factorsonoverfittingand accuracy...... 90 A.9 Comparison of amountoffeature learningcomponentsfor themiddle cell of the network using biggrid size...... 91 A.10 Comparison of amountofneuronsfor thetwo feature learning components of the middle cell in the network using big grid size...... 91

102 Bibliography

[AAB+15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,Zhifeng Chen, CraigCitro,Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, IanGoodfellow, AndrewHarp, Geoffrey Irving, Michael Isard, Yangqing Jia,Rafal Jozefowicz,LukaszKaiser, Manju- nath Kudlur, Josh Levenberg,Dandelion Mané, Rajat Monga, Sherry Moore,Derek Murray, Chris Olah, MikeSchuster, JonathonShlens, Benoit Steiner, Ilya Sutskever, KunalTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg,Martin Wicke,Yuan Yu,and Xiaoqiang Zheng.TensorFlow: Large-scale machine learningonheterogeneous systems, 2015. Software availablefromtensorflow.org.

[Aar] Aardvark. Aardvark platform. https://github.com/aardvark- platform.(Accessed on 04/10/2019).

[BFOS84] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification andRegressionTrees.Wadsworthand Brooks, Monterey, CA, 1984.

[BGSA18] Alexandre Boulch,JorisGuerry,Bertrand Le Saux, and Nicolas Audebert. Snapnet: 3d pointcloud semantic labeling with 2d deep segmentation networks. Computers &Graphics,71:189 –198, 2018.

[BHLLR14] Aharon Bar Hillel,RonenLerner, Dan Levi, and GuyRaz. Recentprogress in road and lanedetection:asurvey. Machine Visionand Applications, 25(3):727–745, 2014.

[Bis06] Christopher MBishop. Patternrecognitionand machine learning.springer, 2006.

[BKC17] V. Badrinarayanan, A. Kendall,and R. Cipolla. Segnet: Adeep con- volutionalencoder-decoder architecture for image segmentation. IEEE Transactions on PatternAnalysis andMachine Intelligence,39(12):2481– 2495, 2017.

[BLB+13] Lars Buitinck, Gilles Louppe,Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, VladNiculae, Peter Prettenhofer, Alexandre

103 Gramfort, Jaques Grobler, Robert Layton, JakeVanderPlas, Arnaud Joly, Brian Holt,and Gaël Varoquaux. API design for machine learningsoftware: experiencesfrom the scikit-learn project. In ECMLPKDD Workshop: Languages for Data Mining and Machine Learning,pages 108–122, 2013.

[Bou19] Alexandre Boulch.GeneralizingDiscrete Convolutions for Unstructured PointClouds. In Eurographics Workshop on 3D ObjectRetrieval.The EurographicsAssociation, 2019.

[Bre00] Claus Brenner. Towards fully automatic generation of city models. In- ternational Archives of Photogrammetry and Remote Sensing,33(B3/1; PART 3):84–92, 2000.

[Bre01] Leo Breiman. Random forests. Machine learning,45(1):5–32, 2001.

[C+15] François Chollet et al. Keras. https://keras.io,2015. (Accessed on 10/10/2019).

[Cau47] AugustinCauchy.Méthode généralepour la résolutiondes systemes d’équationssimultanées. Comp. Rend. Sci. Paris,25(1847):536–538, 1847.

[CKY97] Sunglok Choi,TaeminKim,and Wonpil Yu.Performance evaluation of ransac family. Journal of Computer Vision,24(3):271–300, 1997.

[CV95] Corinna Cortes and Vladimir Vapnik.Support-vector networks. Machine learning,20(3):273–297, 1995.

[DdFG01] Amaud Doucet, Nando de Freitas,and NeilGordon. Sequential Monte Carlomethods inpractice.Springer Science&Business Media, 2001.

[EKHL17] FrancisEngelmann,Theodora Kontogianni,Alexander Hermans, and Bastian Leibe.Exploring spatial context for3dsemanticsegmentation of pointclouds. In TheIEEE International ConferenceonComputer Vision (ICCV)Workshops,Oct 2017.

[EKS+96] Martin Ester,Hans-Peter Kriegel, JörgSander,XiaoweiXu, et al. A density-based algorithm for discoveringclusters in largespatialdatabases with noise.InKdd,volume 96, pages 226–231, 1996.

[FB81] Martin A. Fischler and RobertC.Bolles.Random sample consensus: Aparadigm for modelfitting with applications to image analysisand automated cartography. Commun. ACM,24(6):381–395, 1981.

[FILS14] C. Fernández, R. Izquierdo,D.F.Llorca,and M. A. Sotelo. Road curb and lanes detection for autonomousdrivingonurbanscenarios.In17th InternationalIEEE ConferenceonIntelligent Transportation Systems (ITSC),pages 1964–1969, 2014.

104 [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[GBS14] Leonardo Gomes, Olga Regina Pereira Bellon, and LucianoSilva. 3d reconstruction methods for digital preservation of cultural heritage: A survey. PatternRecognition Letters,50:3 –14, 2014. Depth Image Analysis.

[GGG+16] A. Garcia-Garcia, F. Gomez-Donoso, J. Garcia-Rodriguez, S. Orts- Escolano, M. Cazorla, and J. Azorin-Lopez. Pointnet: A3dconvolutional neuralnetwork forreal-time object class recognition. In 2016 International Joint ConferenceonNeural Networks (IJCNN),2016.

[HKLS12] J. Han, D. Kim, M. Lee, and M. Sunwoo. Enhanced road boundary and obstacle detectionusing adownward-lookinglidarsensor. IEEE Transactions on Vehicular Technology,61(3):971–985, 2012.

[HOW14] A. Y. Hata, F. S. Osorio, and D. F. Wolf. Robust curbdetection and vehicle localizationinurban environments.In2014 IEEEIntelligent Vehicles Symposium Proceedings,pages 1257–1262, 2014.

[HSL+17] Timo Hackel, N. Savinov, L. Ladicky, Jan D. Wegner,K.Schindler, and M. Pollefeys. Semantic3d.net:Anew large-scalepoint cloud classification benchmark. In ISPRSAnnals of thePhotogrammetry, Remote Sensing and Spatial Information Sciences,volume IV-1-W1, pages 91–98, 2017.

[HWS16] T. Hackel, J. D. Wegner, andK.Schindler. Fast semanticsegmenta- tionof3dpoint cloudswith stronglyvaryingdensity. ISPRSAnnals of Photogrammetry, Remote Sensingand Spatial Information Sciences, III-3:177–184, 2016.

[IKH+11] ShahramIzadi,David Kim, OtmarHilliges,David Molyneaux, Richard Newcombe,Pushmeet Kohli,Jamie Shotton,Steve Hodges, Dustin Free- man, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: Real-time 3d reconstruction and interaction using amoving depth camera.InPro- ceedingsofthe 24th AnnualACM Symposium on User Interface Software and Technology,UIST ’11, pages 559–568, 2011.

[IS15] Sergey Ioffe and ChristianSzegedy. Batchnormalization: Accelerating deepnetwork training by reducinginternal covariate shift. In Proceedings of the 32nd International ConferenceonMachineLearning,volume 37 of ProceedingsofMachineLearningResearch,pages 448–456. PMLR,07–09 Jul 2015.

[JRPWM10] MatthewJohnson-Roberson,Oscar Pizarro, Stefan B. Williams, and Ian Mahon. Generation and visualization of large-scalethree-dimensional reconstructionsfrom underwaterroboticsurveys. Journal of Field Robotics, 27(1):21–51, 2010.

105 [JS16] Jing Huangand SuyaYou. Pointcloud labeling using 3d convolutional neu- ral network. In 2016 23rdInternational ConferenceonPattern Recognition (ICPR),pages 2670–2675, 2016. [JXYY13] S. Ji, W. Xu,M.Yang, andK.Yu. 3d convolutional neural networks for human action recognition. IEEETransactions on PatternAnalysisand Machine Intelligence,35(1):221–231, 2013. [Kal60] R. E. Kalman.ANew ApproachtoLinear Filtering and Prediction Problems. Journal of BasicEngineering,82(1):35–45, 1960. [KB14] Diederik PKingma and JimmyBa. Adam: Amethodfor stochastic optimization. arXivpreprint arXiv:1412.6980,2014. [KK11] Philipp Krähenbühland VladlenKoltun.Efficientinferenceinfully con- nected crfs with gaussian edge potentials. Advances in neural information processingsystems,24:109–117, 2011. [KL17] Roman Klokov and Victor Lempitsky. Escapefromcells: Deepkd-networks forthe recognition of 3d pointcloud models. In TheIEEE International ConferenceonComputer Vision(ICCV),2017. [KMLM13] PankajKumar, Conor P. McElhinney,Paul Lewis, and Timothy McCarthy. An automated algorithm forextractingroad edges from terrestrial mobile lidardata. ISPRS Journal of Photogrammetry and Remote Sensing,85:44 –55, 2013. [Kö] Stadt Köln. Zahlenund statistik. https://www.stadt-koeln. de/politik-und-verwaltung/statistik/?schriftgroesse= normal.(Accessed on 24/04/2020). [LDT+17] Felix Järemo Lawin, Martin Danelljan, Patrik Tosteberg, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg.Deepprojective3dsemantic segmentation.InMichael Felsberg,Anders Heyden, and NorbertKrüger, editors, Computer Analysis of Images and Patterns,pages 95–107. Springer International Publishing, 2017. [LMP01] John Lafferty,Andrew McCallum, and Fernando C.N.Pereira. Conditional randomfields:Probabilisticmodelsfor segmenting and labelingsequence data. In Proceedings of the 18th InternationalConferenceonMachine Learning 2001 (ICML 2001),pages 282–289, 2001. [Low04] David G. Lowe.Distinctiveimage features from scale-invariantkeypoints. InternationalJournal of Computer Vision,60(2):91–110, Nov2004. [LS18] Loic Landrieu and MartinSimonovsky.Large-scalepoint cloud seman- tic segmentation with superpointgraphs. In TheIEEE Conferenceon ComputerVision andPattern Recognition (CVPR),June2018.

106 [MCB+14] MichelaMortara, Chiara EvaCatalano, Francesco Bellotti,Giusy Fiucci, MinicaHoury-Panchetti,and PanagiotisPetridis.Learning cultural her- itagebyserious games. Journal of CulturalHeritage,15(3):318 –325, 2014.

[MS15] D. Maturana and S. Scherer. Voxnet: A3dconvolutional neuralnetwork for real-time object recognition. In 2015 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS),pages 922–928, 2015.

[MWA + 13] PrzemyslawMusialski, Peter Wonka,Daniel G. Aliaga, Michael Wimmer, Luc vanGool, and Werner Purgathofer.ASurvey of Urban Reconstruction. Computer Graphics Forum,32(6):146–177, 2013.

[Net] NetTopologySuite.Projnet4geoapi. https://github.com/ NetTopologySuite/ProjNet4GeoAPI.(Accessed on 18/10/2019).

[Opea] OpenStreetMap.Openstreetmap. https://www.openstreetmap. org.(Accessed on17/10/2019).

[Opeb] OpenStreetMap.Openstreetmapwiki. https://wiki. openstreetmap.org/wiki/Main_Page.(Accessed on 17/10/2019).

[PVG+11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B.Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R.Weiss, V. Dubourg,J.Vanderplas, A. Pas- sos, D. Cournapeau,M.Brucher,M.Perrot, and E. Duchesnay. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[Qia99] NingQian. On the momentum term in gradientdescentlearning algorithms. Neural Networks,12(1):145 –151, 1999.

[QSMG17] CharlesR.Qi, Hao Su,KaichunMo, and Leonidas J. Guibas.Pointnet: Deeplearningonpointsetsfor 3d classificationand segmentation.InThe IEEE ConferenceonComputer Visionand PatternRecognition(CVPR), July 2017.

[QSN+16] CharlesR.Qi, Hao Su,Matthias Niessner,Angela Dai, Mengyuan Yan, and Leonidas J. Guibas.Volumetric and multi-view cnnsfor object classification on 3d data. In The IEEEConferenceonComputer Vision and PatternRecognition (CVPR),2016.

[QYSG17] CharlesRuizhongtai Qi,LiYi, Hao Su,and Leonidas JGuibas. Point- net++: Deephierarchical feature learning on pointsetsinametric space.InAdvances in NeuralInformation ProcessingSystems 30,pages 5099–5108. Curran Associates, Inc.,2017.

107 [RCGCOA15] Borja Rodríguez-Cuenca, Silverio García-Cortés, Celestino Ordóñez, and Maria C. Alonso. An approachtodetect anddelineate street curbs from mls 3d pointcloud data. Automation in Construction,51:103 –112, 2015.

[RFB15] OlafRonneberger,Philipp Fischer,and Thomas Brox.U-net:Convolu- tional networks forbiomedical image segmentation. In MedicalImage Computing andComputer-Assisted Intervention –MICCAI 2015,pages 234–241. Springer International Publishing, 2015.

[RHW86] David ERumelhart, GeoffreyEHinton, and Ronald JWilliams.Learning representations by back-propagating errors. nature,323(6088):533–536, 1986.

[Ros58] Frank Rosenblatt. Theperceptron: aprobabilisticmodel for information storage and organization in thebrain. Psychological review,65(6):386, 1958.

[ROUG17] Gernot Riegler,Ali Osman Ulusoy, and Andreas Geiger.Octnet:Learning deep3drepresentationsathigh resolutions. In The IEEE Conferenceon ComputerVision andPattern Recognition (CVPR),2017.

[Sah] Sumit Saha.Acomprehensive guidetoconvolutionalneural networks—the eli5 way. https://towardsdatascience. com/a-comprehensive-guide-to-convolutional-neural- networks-the-eli5-way-3bd2b1164a53.(Accessed on 06/12/2019).

[Sem]Semantic3D. Semantic3d -results. http://semantic3d.net/view_ results.php?chl=1.(Accessed on 04/07/2020).

[Sha48] C. E. Shannon. Amathematical theoryofcommunication. BellSystem Technical Journal,27(3):379–423, 1948.

[SHK+14] NitishSrivastava,Geoffrey Hinton,Alex Krizhevsky,IlyaSutskever, and Ruslan Salakhutdinov. Dropout:asimple waytopreventneural networks fromoverfitting. Thejournal of machine learningresearch,15(1):1929– 1958, 2014.

[SKMW17] Michael Schwärzler, Lisa-Maria Kellner,StefanMaierhofer,and Michael Wimmer.Sketch-based guidedmodeling of 3d buildings from oriented photos.InProceedings of the 21st ACMSIGGRAPH Symposium on Interactive 3D Graphics and Games,pages 9:1–9:8. ACM, February 2017.

[SMKLM15] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and ErikLearned-Miller. Multi-viewconvolutional neural networks for3dshape recognition. In The IEEE International ConferenceonComputer Vision(ICCV),2015.

108 [SSA20] Siddharth Sharma, SimoneSharma, and Anidhya Athaiya.Activation functions in neural networks. InternationalJournal of Engineering Applied Sciences and Technology,04:310–316, 05 2020.

[SZX+19] P. Sun,X.Zhao, Z. Xu, R. Wang,and H. Min. A3dlidardata-based dedicated road boundary detection algorithm for autonomousvehicles. IEEE Access,7:29623–29638, 2019.

[TCA+17] L. Tchapmi, C. Choy,I.Armeni, J. Gwak, and S. Savarese. Segcloud: Se- manticsegmentation of 3d pointclouds. In 2017 International Conference on 3D Vision (3DV),pages 537–547, 2017.

[TGDM18] H. Thomas, F. Goulette, J. Deschaud, and B. Marcotegui.Semantic classification of 3d pointcloudswith multiscalespherical neighborhoods. In 2018 InternationalConferenceon3DVision(3DV),pages 390–398, Sep. 2018.

[TQD+19] Hugues Thomas, CharlesR.Qi, Jean-Emmanuel Deschaud, Beatriz Mar- cotegui,Francois Goulette, and Leonidas J. Guibas.Kpconv: Flexible and deformable convolutionfor pointclouds. In Proceedingsofthe IEEE/CVF International ConferenceonComputer Vision(ICCV),October 2019.

[Vis] Visdom. Visdom -combining simulationand visualization. http:// visdom.at/.(Accessed on 05/12/2019).

[WBG+12] M.J. Westoby, J. Brasington, N.F.Glasser, M.J. Hambrey,and J.M. Reynolds. ‘structure-from-motion’ photogrammetry:Alow-cost,effective tool for geoscience applications. Geomorphology,179:300 –314, 2012.

[WJHM15] Martin Weinmann,BorisJutzi, Stefan Hinz, and Clément Mallet. Seman- tic pointcloud interpretationbasedonoptimalneighborhoods, relevant featuresand efficientclassifiers. ISPRSJournal of Photogrammetry and RemoteSensing,105:286 –304, 2015.

[WLG+17] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-YuSun, and Xin Tong. O-cnn: Octree-based convolutional neuralnetworks for3dshapeanalysis. ACMTrans. Graph.,36(4):72:1–72:11, 2017.

[WLW + 15] H. Wang,H.Luo,C.Wen,J.Cheng, P. Li, Y. Chen, C. Wang,and J. Li. Road boundariesdetection based on local normal saliency from mobile laser scanning data. IEEE Geoscienceand Remote SensingLetters, 12(10):2085–2089, 2015.

[WSK+15] ZhirongWu, ShuranSong, Aditya Khosla,Fisher Yu,LinguangZhang, Xiaoou Tang,and Jianxiong Xiao. 3d shapenets: Adeep representation forvolumetric shapes. In TheIEEE ConferenceonComputer Visionand Pattern Recognition (CVPR),June 2015.

109 [WUH+15] M. Weinmann,S.Urban, S. Hinz, B. Jutzi, and C. Mallet.Distinctive2d and 3d features for automated large-scalescene analysisinurbanareas. Computers &Graphics,49:47 –57, 2015.

[WWHY19] G. Wang,J.Wu,R.He, and S. Yang.Apointcloud-based robust road curb detection and tracking method. IEEE Access,7:24611–24625, 2019.

[Zha10] W. Zhang. Lidar-basedroadand road-edge detection.In2010 IEEE Intelligent Vehicles Symposium,pages 845–848, 2010.

[ZTF+11] MatthewDZeiler, Graham WTaylor, RobFergus,etal. Adaptive deconvolutional networks formid and high levelfeature learning. In ICCV, volume1,page 6, 2011.

[ZWW+15] Y. Zhang, J. Wang,X.Wang, C. Li, and L. Wang.Areal-time curb detectionand tracking methodfor ugvsbyusing a3d-lidar sensor.In 2015 IEEEConferenceonControl Applications (CCA),pages 1020–1025, 2015.

[ZWWD18] Y. Zhang, J. Wang,X.Wang, andJ.M.Dolan. Road-segmentation- based curbdetectionmethodfor self-drivingvia a3d-lidarsensor. IEEE Transactions on Intelligent Transportation Systems,19(12):3981–3991, 2018.

[ZY12] G. Zhao and J. Yuan. Curbdetection and tracking using 3d-lidar scanner. In 2012 19th IEEEInternational ConferenceonImageProcessing,pages 437–440, 2012.

110