Radar-detection based classification of moving objects using methods

VICTOR NORDENMARK ADAM FORSGREN

Master of Science Thesis Stockholm, Sweden 2015

Radar-detection based classification of moving objects using machine learning methods

Victor Nordenmark Adam Forsgren

Master of Science Thesis MMK 2015:77 MDA 520 KTH Industrial Engineering and Management Machine Design SE-100 44 STOCKHOLM

ii

Examensarbete MMK 2015:77 MDA 520

Radar-detection based classification of moving objects using machine learning methods

Victor Nordenmark Adam Forsgren Godkänt Examinator Handledare 2015-06-17 Martin Grimheden De-Jiu Chen Uppdragsgivare Kontaktperson Scania Södertälje AB Kristian Lundh

Sammanfattning I detta examensarbete undersöks möjligheten att klassificera rörliga objekt baserat på data från Dopplerradardetektioner. Slutmålet är ett system som använder billig hårdvara och utför beräkningar av låg komplexitet. Scania, företaget som har beställt detta projekt, är intresserat av användningspotentialen för ett sådant system i applikationer för autonoma fordon. Specifikt vill Scania använda klassinformationen för att lättare kunna följa rörliga objekt, vilket är en väsentlig färdighet för en autonomt körande lastbil. Objekten delas in i fyra klasser: fotgängare, cyklist, bil och lastbil. Indatan till systemet består väsentligen av en plattform med fyra stycken monopulsdopplerradars som arbetar med en vågfrekvens på 77 GHz. Ett klassificeringssystem baserat på maskininlärningskonceptet Support vector machines har skapats. Detta system har tränats och validerats på ett dataset som insamlats för projektet, innehållandes datapunkter med klassetiketter. Ett antal stödfunktioner till detta system har också skapats och testats. Klassificeraren visas kunna skilja väl på de fyra klasserna i valideringssteget. Simuleringar av det kompletta systemet gjort på inspelade loggar med radardata visar lovande resultat, även i situationer som inte finns representerade i träningsdatan. För att vidare undersöka systemet har det implementerats och testats på prototyplastbilen Astator, och prestandan har utvärderats utifrån både realtidstidsperpsektiv och klassificeringsnoggranhet. Generellt uppvisar systemet lovande resultat i scenarier som liknar slutanvändningsområdet. I mer komplexa trafiksituationer och då lastbilen färdas i högre hastigheter leder dock en högre förekomst av sensorbrus till att systemets prestanda försämras.

iii

Master of Science Thesis 2015:77 MDA 520

Klassificering av rörliga objekt baserat på radardetektioner med hjälp av maskininlärningsmetoder

Victor Nordenmark Adam Forsgren Approved Examiner Supervisor 2015-06-17 Martin Grimheden De-Jiu Chen Commissioner Contact person Scania Södertälje AB Kristian Lundh

Abstract In this MSc thesis, the possibility to classify moving objects based on radar detection data is investigated. The intention is a light-weight, low-level system that relies on cheap hardware and calculations of low complexity. Scania, the company that has commissioned this project, is interested in the usage potential of such a system in autonomous vehicle applications. Specifically, the class information is desired in order to enhance the moving object tracker, a subsystem that represents a crucial skillset of an autonomously driving truck. Objects are classified as belonging to one of four classes: Pedestrian, bicyclist, personal vehicle and truck. The major system input consists of sensor data from a set of four short-range mono-pulse Doppler radars operating at 77 GHz. Using a set of training and validation data gathered and labeled within this project, a classification system based on the machine learning method of Support vector machines is created. Several other supporting software structures are also created and evaluated. In the validation phase, the system is shown to discern well between the four classes. System simulations performed on logged radar data show promising performance also in situations not reflected within the labeled dataset. To further investigate the feasibility of the system, it has been implemented and tested on the prototype test vehicle Astator, and performance has been evaluated with regards to both real-time constraints and classification accuracy. Overall, the system shows promise in the scenarios for which it was intended, both with respect to real-time and classification performance. In more complex scenarios however, sensor noise is increasingly apparent and affects the system performance in a negative way. The noise is extra apparent in heavy traffic and high velocity scenarios.

iv List of Figures

1 Schematic overview of relevant target system architecture...... 4 2 Approximate placement and FOV of SRR sensors on Astator....5 3 Development approach schematic...... 10 4 Coordinate systems...... 18 5 Projection of radar EGO component on the range rate of a detection. 19 6 Linear decision boundaries in two and three dimensions...... 21 7 An example of a non-linearly separable dataset...... 21 8 Illustration of the bias-variance dilemma...... 24 9 The support vector machine visualized in a two-dimensional feature space...... 26 10 Illustration of the kernel concept...... 29 11 Illustration of soft-margin SVM...... 31 12 DBSCAN clustering method...... 37 13 Development process and classification system overview...... 44 14 The eps tradeoff...... 54 15 Timing Diagram...... 64 16 Biplot of features and training data projected onto first two PC.. 70 17 Training data and biplot projected onto first three PC...... 71 18 Lines of noise detections behind an object...... 74 19 Scattered noise detections behind an object...... 75 20 Clusters of noise detections with high velocity values...... 76 21 Fence detections and noise when moving at 60 km/h...... 77 22 Radar Detections Clustered with Eps = 4 meters...... 82 23 Coarse grid search for SVM parameters...... 84 24 Fine grid search for SVM parameters...... 84 25 Typical frames from evaluation logs...... 91 26 Typical frame from highway log...... 91

v List of Tables

1 Delphi SRR Midrange specifications...... 5

2 Binary classification confusion matrix...... 34

3 Multiclass Confusion matrix...... 62

4 Classification performance measurements...... 62

5 Radar detection clusters gathered and labeled...... 67

6 Mean and variance of features used for object description...... 69

7 Table of the cumulative variance explanation per principal component 70

8 maxdR detection filter statistics...... 79

9 maxClusterVelVar filtering results...... 79

10 minClusterAmpVar filtering results...... 80

11 Offline evaluation of Classification performance...... 85

12 Real-time simlation performance from Simulink Profiler...... 87

13 Real-time performance on the target system...... 88

14 Input output comparison of the two systems for the different classes 89

15 Classification system evaluation on log with trailer...... 90

16 Fulfillment of functional requirements...... 93

17 Fulfillment of extra-functional requirements...... 94

vi Contents

1 Introduction

1.1 Project background...... 1 1.1.1 General project background...... 1 1.1.2 Problem description...... 2 1.1.3 Difference to other projects...... 2

1.2 Project goal...... 3 1.2.1 Project purpose...... 3 1.2.2 Target system...... 3 1.2.3 Project requirements...... 7 1.2.4 Research questions...... 9

1.3 Project development methodology and considerations...... 9 1.3.1 Development approach...... 9 1.3.2 Delimitations...... 11 1.3.3 Sustainable development considerations...... 12

2 Frame of reference

2.1 Previous work...... 13 2.1.1 Radar based vehicle perception...... 13 2.1.2 Doppler radar as input to learning systems...... 15

2.2 Doppler radar perception and integration of multiple sensors.. 16 2.2.1 Basic Doppler radar theory...... 16 2.2.2 Sensor fusion and integration...... 17

2.3 Theoretical overview of machine learning concepts and methods 19 2.3.1 , classification and overfitting...... 20 2.3.2 Support vector machines as a method for classification...... 25 2.3.3 Classification performance analysis...... 33 2.4 Extraction and analysis of object descriptions...... 36 2.4.1 Clustering of sensor data...... 36 2.4.2 Selecting and extracting features from data clusters...... 39 2.4.3 Principal component analysis for feature evaluation...... 40

2.5 Frame of reference conclusions...... 41 2.5.1 Current best-practice in vehicle perception...... 42 2.5.2 Theory and methods employed...... 43

3 Methods

3.1 Method overview and system introduction...... 44 3.1.1 Stages of system development...... 44 3.1.2 Classification system overview...... 45

3.2 Gathering radar detection data...... 46 3.2.1 Test-track data gathering...... 46 3.2.2 Labeling of gathered data...... 48

3.3 Practical selection and analysis of object descriptions...... 48 3.3.1 Description of features used...... 49 3.3.2 Analysis of features and data with PCA...... 51

3.4 Pre-classification signal processing...... 51 3.4.1 Filtering of radar detections...... 51 3.4.2 Clustering of radar detections using DBSCAN...... 52 3.4.3 Feature vector calculation...... 55 3.4.4 Filtering of radar detection clusters...... 55

3.5 Classification of processed objects...... 57 3.5.1 Implementation of support vector machine system...... 57 3.5.2 Multiclass, rejection and confidence structures...... 58 3.5.3 Evaluating classification performance on validation data...... 61

3.6 System implementation on target platform...... 62 3.6.1 Real-time implementation goals and restrictions...... 63 3.6.2 Timings and tasks...... 64 3.6.3 Validation of final system implementation...... 65 4 Results and discussion

4.1 Results of data gathering and labeling...... 67

4.2 Analysis of feature and data characteristics...... 69 4.2.1 Characteristics of selected features...... 69 4.2.2 Principal component analysis of features on training data...... 70 4.2.3 Feature and data analysis discussion...... 71

4.3 Signal processing and filtering performance...... 73 4.3.1 Common types of noise in the radar output...... 74 4.3.2 Results of developed filtering structures...... 78 4.3.3 DBSCAN clustering parameter evaluation...... 81

4.4 Classification-related results...... 83 4.4.1 Support vector machine model selection...... 83 4.4.2 Offline evaluation of classification performance...... 85

4.5 Complete system performance assessment...... 86 4.5.1 Real time performance...... 86 4.5.2 Classification performance...... 88

5 Conclusions and future work

5.1 Concluding discussion regarding research questions and requirements...... 93 5.1.1 Requirements...... 93 5.1.2 Research questions...... 94

5.2 Project-wide conclusions...... 97

5.3 Future work...... 99 Work division In this section, the work division made during the thesis and specifically, the writing of this report, is shown.

The thesis has always been seen as a collaborative effort and both authors have been involved in most, if not all, different sections of the report.

However, focus has been placed in different areas, and for this reason one author is seen as mainly responsible for the corresponding sections originally written by that author.

In the frame of reference chapter, Victor has been mainly responsible for sections 2.1 and 2.4, while Adam has governed sections 2.2 and 2.3.

In the method chapter, Victor has been primarily responsible for sections 3.1, 3.3, 3.4.2, 3.5.2 and 3.6, and Adam sections 3.2, 3.4, 3.5.1 and 3.5.3.

As for the results chapter, Victor is mainly responsible for sections 4.2, 4.3.3 and 4.5, while Adam is primarily responsible for sections 4.1, 4.3.1, 4.3.2 and 4.4.

The conclusions chapter saw sections 5.2 and 5.3 written by Victor, while Adam was mainly responsible for section 5.1.

The introduction chapter was an entirely collaborative effort and no particular work division can be seen here, as all sections were cooperatively written by both authors.

This has been a natural work division, as the knowledge gained in the literature review as well as the writing of the frame of reference sections, has been vital when writing the corresponding sections in the method and results chapters. Part 1: Introduction

This chapter aims to present a background and to introduce the reader to the project. First, background and motivation for the project will be presented followed by a problem description and an introduction to the target system - the prototype vehicle Astator. The requirements and research questions considered will also be detailed here. Finally, the development approach used will be explained together with ethical considerations made.

1.1 Project background In this section, the background to the project will be presented together with a problem description. Brief details about the difference between this project and earlier work made in similar areas is also presented.

1.1.1 General project background In the automotive industry today, a big effort is put into the development of intelligent vehicles, advanced driver assistance systems, and ultimately, autonomously driving vehicles.

The development of more intelligent systems within vehicles have wide implications. Early warning systems and advanced vehicle perception can lead to a big decrease in injuries resulting from accidents. Additionally, vehicle operations can be more optimized for fuel consumption, thus minimizing costs and reducing environmental impact. The need for human operators can be reduced, as in the case of platooning systems where a single person operates an entire fleet of vehicles, or a fully autonomous system, were the human operator is completely removed.

The advance of more complex systems in vehicles leads to higher requirements in the processing of large amounts of data, such as when analysing multiple sensor signals simultaneously. When it comes to processing large quantities of (often high dimensional) data, the ever-increasing computational power available has enabled the use of machine learning methods to tackle new problems. Machine learning methods can provide a means to analyze big quantities of data in ways that have previously not been possible.

For a company such as Scania, research in advanced driver assistance systems is at the core of an expected future suite of services to provide to customers. A crucial goal is the ability of a vehicle to detect and track surrounding objects. This complex problem of vehicle perception demands many system layers and has many feasible solution approaches.

At Scania REPA (the unit for development of advanced driver assistance systems), within the iQMatic project, research is conducted on the development of a truck

1 that can autonomously carry goods to increase the efficiency of mining sites. The prototype vehicle Astator is a platform for testing new solutions developed in this project. The vehicle contains, among many other subsystems, an object tracking system. The purpose of this system is to keep track of objects surrounding the truck, and predict their future positions and movements. The tracker, together with other subsystems, builds up a perception system which allows the vehicle to sense and react to its surroundings.

1.1.2 Problem description The object tracking system of Astator uses motion models to predict how a detected object will move. Currently, motion models are interpolated from detection history, which if a wrong assumption is made can lead to estimation errors and unreliable performance. This problem is especially apparent when no history is available, as the choice of initial parameters has a big effect on future predictions.

For a suitable motion model to be selected by the object tracker, some initial information about a detected object is needed. Such information would preferably include an estimation about what category, or class, a detected object belongs to. For this reason, Scania has requested an investigation into the creation of a low-level classification system.

This system should deliver class data (object type) in order to provide additional information as input to the object tracker. The system will use machine learning methods to process and classify objects based on data received from radar sensors mounted on Astator.

1.1.3 Difference to other projects Classification of moving objects based on sensor data is not a new field. However, the classification process is usually done at a high level, combining input data gathered from several sensors such as LIDAR, stereo cameras and radars.

The use of multiple sensors and fusion of sensor data can provide a very accurate classification, but also requires heavy processing power. Additionally, the use of multiple sensors will lead to a higher cost for the sensor platform, and the moving parts that are found in LIDAR systems can lead to a reduced hardware robustness of the system.

Other methods of object classification based on radar input rely on a deeper knowledge about the radar signal characteristics in order to construct statistical models. Whereas these methods require radar data in its raw form, the radar sensors used in this project deliver heavily processed data in the form of detection points. When dealing with processed radar data in this form, little research has been done before.

2 The classification system concerned in this thesis project is intended to operate on a low level, based only on radar detection data, and before the tracking system in the signal chain. As such, it should not require big amounts of data as input, the calculations to be made should be of low complexity, and it should be executable in a real-time context.

1.2 Project goal In this section, the purpose and goals of this master thesis project are described. The Astator target system platform is also introduced. Additionally, the research questions considered are detailed here together with project requirements.

1.2.1 Project purpose This MSc thesis project is about researching, implementing and evaluating a classification system with the purpose of identifying moving objects based on sensor data mainly consisting of radar detections.

Objects shall be classified as belonging to one of four classes:

1. Pedestrian 2. Bicyclist 3. Personal vehicle 4. Truck

The system will be implemented and evaluated on the prototype vehicle platform Astator.

The end purpose of the classification system is to improve accuracy in the object tracking system by providing class data as an additional input.

1.2.2 Target system The Astator system (referred to as ”Astator” or the ”target system”) is the prototype platform used for development within the iQMatic project. It contains numerous subsystems and modules. Below in figure 1, a schematic overview of all relevant parts in the system architecture is shown, together with the autonomous vehicle context that this project operates in.

3 Sensors Interface ECU and software Actuators

SRR_1 Steering Pre-process Translate data SRR_2 Acceleration

SRR_3 Moving Object Classifier Braking CAN bus SRR_4 ... SMOT object tracker

IMU

GPS

YAW Autonomous Vehicle Skillset for IQmatic LIDAR Situational Awareness Artificial Intelligence Automatic control Cam Object Assessment Decision Making Path Execution Sub-object Assessment Path Planning ... SItuational Assessment

Figure 1: Schematic overview of relevant target system architecture

The yellow shaded area spans the parts of the target system that are of direct and indirect concern within this project. The boxes in red are the specific modules (software or hardware) that the classifier system developed in this project relies on for input data. Sensor modules are connected to the main ECU (Embedded control unit) through CAN (Controller area network). The green box is the intended product of this project. The purple box represents the receiving subsystem that the classification system output is intended for.

The blue shaded area provides some autonomous vehicle context and represents a higher abstraction level with concepts that are of particular concern within the iQMatic project (but not within the scope of this thesis). The moving object classification system that this project concerns is intended to be part of the Object Assessment skill. This in turn is part of the vehicle situational awareness (also called vehicle perception).

Relevant target system characteristics and specifics are presented below.

Short Range Radars Astator is equipped with six short range radars (SRR), two forward facing systems and one additional radar at each corner of the vehicle. The four corner radars (SRR 1-4) are the sensors that will be of interest to this

4 project and they will provide the main data input to the classification system. The forward facing sensors will not be further considered as they are of a slightly different configuration. The corner radars are mounted to give a broad, partly overlapping field of view (FOV) of the vehicle side and rear surroundings. The approximate positions of these radars as well as their FOV can be seen in figure 2 below.

Figure 2: Approximate placement and FOV of SRR sensors on Astator

A condensed list of the SRR system specifications can be found below in table 1:

Table 1: Delphi SRR Midrange specifications

Frequency 77 [GHz] Field of view 150 [degrees] Range 0.5-80 [m] Sampletime 50 [ms] Bandwidth 250 [MHz]

The radar sensors deliver data in the form of detections. These are data points representing a detected object. Each detection contains positional information as well as additional parameters.

Each of the radars deliver a data package of up to 64 detections every 50 ms over CAN. Furthermore, since four different radars are used in the project, up to 256 detections can be received in any single frame. The amount of detections received

5 depend on external factors such as how much movement and reflection is detected as well as the internal structure of the radar processing unit.

A data package from the radar sensors contains the following parameters:

• Amplitude: indicates amount of energy reflected back from detected surface. Exact unit and calculation unknown. • Doppler velocity: the relative velocity of the detected surface in the radar radial direction [m/s]. • Distance: distance to the detected surface [m]. • Angle: detection ray angle relative to the normal of the radar [rad].

In addition, there is information about time, whether the delivered package contains updated data, detection ID and package size.

IMU and GPS The IMU (Inertial measurement unit) and GPS (Global positioning system) sensors provide vehicle heading and velocity information, which is important for the interpretation and translation of radar data. These computations are made in the VEGO system, and will not be presented in further detail.

The ECU The Astator ECU is a 16GB RAM, core i7 quad-core processor running Linux. Most of the autonomous framework software operates on this. The software system works on an update cycle of 10 ms.

Pre-processing In this software stage, the data on the CAN bus is converted into the LCM format used within all of the other software framework.

Translate data The MEAS software step serves to translate sensor data into different coordinate systems, as well as provide estimations of measurement certainty. In particular, we are concerned with the EGO (Referring to the self vehicle) local Cartesian coordinate system, with origin located at the center of the rear axis of Astator. The data translation is necessary to determine absolute object velocities and to integrate the different radars into one system.

Based on the calculated velocity associated with a certain radar detection, and the estimated uncertainty of the corresponding radar in this particular angle, an index of certainty that the detection belongs to a moving object is produced.

• Movement index 3: 99.7 % certainty (three standard dev.). • Movement index 2: 95.5 % certainty (two standard dev.). • Movement index 1: 68.3 % certainty (one standard dev.). • Movement index 0: Object is probably not moving.

6 It should be noted that since the exact standard deviations of the radar system are unknown, this method is based on estimated values. There could be dynamic behaviours regarding the measurement certainty of the radars, or phenomena unaccounted for. Hence, the certainties above do not express reality, but only a very rough approximation. A brief outline of the theory behind how radar detections can be transformed from sensor specific polar coordinates to the local Cartesian system can be found in section 2.2.2, Sensor fusion and integration. Moving Object Classifier This is the subsystem to be developed within this project, and which the rest of the report will concern in detail. SMOT object tracker SMOT is the object tracking system of Astator. It uses combined sensor data in a Kalman filter structure to track objects and predict their future positions based on history. It is the recipient of the output from the moving object classifier. Autonomous vehicles must consolidate many different skill sets in order to function. Within the iQmatic project, these skill sets have been abstracted into the following nomenclature: Situational Awareness This can also be referred to as vehicle perception. This contains object assessment (of which the moving object classification system is part of), sub-object assessment and situation assessment. Artificial Intelligence This concerns the decision making of the autonomous vehicle and will not be further examined in this project. Automatic Control This regulates the movements of the autonomous vehicle through the different actuators available. This will not be further examined in this project.

1.2.3 Project requirements These requirements have been developed by the authors together with supervisors from both Scania and KTH, and are also influenced by the literature survey conducted at the start of the project. They should be seen as a result of investigations conducted by the authors and not as requirements imposed by Scania. These requirements will serve as foundation for the research questions asked within the project. Real time The system shall detect and classify objects in a real time environment with a predictable execution time. This execution time must be fast enough so that the output can be computed before the next sample. The sensors deliver data at 20 Hz, this means that the absolute maximum execution time for the classification system in order to classify on each sample is 50 ms. A reduction of the execution time below 10 ms would be beneficial since other systems on the Scania test truck are run with a frequency of 100 Hz.

7 Clustering The sensors deliver data in the form of detection points with a small number of parameters. It is assumed that these parameters in their raw form are not enough to reliably classify objects. Because of this, it is necessary to group the detections into objects. This clustering process will provide new information about the detected objects, enabling more features to be computed.

System output In order for the system to be useful to the tracker, the class output needs to have a certain degree of accuracy. A confidence output (with some sort of confidence measurement) would be useful to the system in order for the data to be more easily integrated into the tracker, especially if it can take on a probabilistic structure.

Programming language The reference programming language of this project is MATLAB, but in order to implement the system on the embedded system of Astator, a toolbox to generate C code will be used.

Below is a more condensed list of the requirements to be investigated within this project.

Functional requirements

• The system shall take the output of the sensors and cluster these data points into objects • The system should filter out static objects and only be concerned with moving objects • The system shall classify these moving objects as belonging to one or none of the classes • The system should be able to provide a confidence output • The system should have an execution time that never exceeds 50 ms

Extra-functional requirements and technological preferences

• The system shall use input data from four short range radars, combined with sensor data regarding EGO movements. • The system shall classify on every sample separately, without the use of detection history or feedback loops. • The system shall operate with the use of SVM as classification method • The system shall be programmed in MATLAB • The system shall be implemented on the embedded hardware of Astator

8 1.2.4 Research questions To approach the task from a scientific viewpoint, the following research questions have been formulated:

1. How can existing machine learning theory be integrated into the embedded hardware of Astator with the purpose of creating a classification system? 2. How can this system be optimized for real time execution? 3. What can be done to improve the classification accuracy of this system with regards to robustness against noise and environmental factors? 4. What are the major obstacles in creating the system, and what can be done to overcome them?

The aim of this project is to answer these research questions through the development of a classification system. The overall development approach used to construct this system is presented below.

1.3 Project development methodology and considerations In this section, the specific methodology used for development within this project is presented, together with delimitations in the project scope. Additionally, social and ethical considerations with regards to sustainable development will be discussed here.

1.3.1 Development approach In the development of this project, an adaptation of the V-model of development is used. This is beneficial in that it provides a foundation on which to divide efforts, makes for a logical flow of work and implies certain approaches of validating results. A strict V-model protocol will not be discussed or followed, but the general development approach as adapted and understood within this project will be presented below.

The development approach used here divides development into three distinct phases, containing several layers each. The output of each phase corresponds to a major project deliverable, either parts of the report or the actual classification system. In figure 3 below, a schematic image of the development approach is shown:

9 Definition Testing Theory and best Complete system practice verification// Verification Data studies//Requirements Performance Data gathering development assessment gathering

Functional and Piece wise system Frame of technical system Verification verification Results and definitions reference// discussion Methods

Implementation Software Verification RT implementation of design//Offline system SW functions development Classification No system

Performance Yes OK?

Figure 3: Development approach schematic

Definition First, during the definition phase, we adopt a top-down approach of first defining requirements of the total system, then define how the system will operate at a functional level, and finally the technical level. In this phase, theory studies will be performed and current best-practice examined.

This phase also requires the gathering of pre-existing data that can be analyzed, in order to properly define necessary functionalities and gain sufficient domain knowledge.

The outputs of the definition phase constitute what is presented in chapters 2, Frame of reference and chapter 3, Methods.

Implementation After the definition phase, the implementation phase is performed. In this phase, code is written and implemented. We use a model-based method of first developing an offline system that operates in a simulated environment. When this offline system is deemed to perform well, a real-time implementation of the same system is developed through means of code-generation for implementation on the target system. This is continually checked against the offline version, using the same testing data, to ensure that the RT-functions give the same output as the offline functions.

The output if this phase is the actual classification system.

Testing The third major phase then consists of testing the implementation against the respective levels of definitions and requirements. Here, a bottom-up

10 approach is adopted: functionality is tested at the lowest useful level first, then put together and tested in groups, and finally as a complete system against the requirements defined in the first phase. The verification is done against the requirements and definitions constructed in the definition phase.

This testing phase requires the gathering of more data, to evaluate the complete system in a realistic environment or appropriate experimental setup.

The output of this phase constitutes chapter 4, Results and discussion.

Remarks on validation This project contains many sub functions, the performance of which can be evaluated separately. But since the project scope is limited, it is clear that too much time cannot be spent trying to optimally assess the performance of every subsystem. In cases where there is theory or used-practice available with regards to the assessment of performance, the suggested methods can and will be used. For subsystems where such methods are not clearly available, heuristic approaches will be used instead in order to reach appropriate performance.

1.3.2 Delimitations This project will not investigate what classification method is best for any particular purpose. Instead one method will be chosen and focus will lie in the implementation of an integrated system on Astator. Through conclusions reached in the background study, it is decided that support vector machines will be the classification method used throughout the project.

The system shall not include sensor data other than that delivered by the Doppler radars, and information about the EGO vehicle speed.

Detection history shall not be part of the system, meaning that there will be no feedback loops and that the system will perform processing and classification on each sample cycle separately.

The goal is to create and evaluate a methodology as well as to identify major obstacles. Constructing an end-user product is not within the scope of this project. As such, the performance requirements when it comes to classification are not very strict.

A mining site is generally off-limits for the public, and compared to for example an inner-city environment it contains few moving objects. These objects can also be held under strict supervision in such a controlled environment. This means that the scope of this project can be limited to areas sparsely populated by moving objects, without affecting the validity.

11 1.3.3 Sustainable development considerations Here, the sustainable development aspects of this project are discussed. The more narrow area explicitly covered in this project is of little such interest, but in a broader context there are interesting discussions to be had.

The broader area of machine autonomy is certainly an area subject to some controversy, with regards to ethics and how the area should be approached by legislation and such. Most of these discussions can also be applied to the automation of heavy vehicles.

For example, it has been said that 50 percent of Swedish jobs could be gone within 20 years [1]. It has always been the case that machines and new technology take over manual labour previously performed by humans, but perhaps the pace that is currently experienced is unprecedented.

This phenomena has huge implications both economically and socially. For companies and particular businesses (such as Scania), this presents a huge opportunity. The human in the loop often represent a major part of costs, and if this can be minimized great profits can be made.

On a nation-wide level, however, the economic benefits can be more diffuse. If 50 percent of current jobs disappear without being replaced in the same pace with new ones, this will obviously place some strain on society. Such a scenario can cause economic and social vulnerability for many individuals. If, however, the benefits of heavy automation can be shared by the entire community (of a nation, a continent or world-wide), this could revolutionize the human experience.

Autonomy has the potential to eliminate hazardous and monotonous jobs (such as driving heavy vehicles for long periods), and lead to safer machine operations in general. It can also be beneficial for environmental reasons, by allowing increased optimization of resource usage.

Another important debate about autonomy is the dilution of responsibility. This subject has been heavily discussed, both in academia and in regular newspapers (see for example [2]). The basic question is who is to blame if an autonomous machine causes an accident? Currently, it seems that this is dealt with through extreme caution before introducing autonomous systems, but in the future things might be different.

Used with responsibility and afterthought, autonomous vehicles will almost certainly provide benefits to most. As such, the broader context that this project operates in is compatible with a sustainable development.

12 Part 2: Frame of reference

This chapter contains the theoretical framework on which the rest of the project is based. A brief overview of work done in similar fields is given. This is followed by a review of the theoretical foundations for the particular methods used in this project.

2.1 Previous work In this section, some of the previous work in the area is discussed. This is divided into the more general case of vehicle perception, and the more specific case of radar usage within learning systems. This provides a different background perspective then what was discussed in the introduction, and serves as a foundation for the exploration of a solution space.

2.1.1 Radar based vehicle perception There are several examples of projects that have tried to accomplish a similar thing as this one. Below, some of these are presented and their relevance to this project is discussed.

Vu 2010 In [3], Vu proposes a two-fold way of detecting and classifying moving objects. The work performed was done within the framework of the European project PReVENT ProFusion.

The first part consists of the usage of SLAM (Simultaneous Localization And Mapping) with Detection of Moving Objects. This approach uses odometry in conjunction with a laser scanner to sense the environment. An object is perceived as dynamic if it occupies a space that was previously unoccupied.

Vu also stores a dynamic map of the environment in conjunction with the static one. This serves to increase the likelihood that an object is dynamic if detected in an area with a history of containing several of dynamic objects.

For the purpose of clustering detections into objects, a simple distance threshold of 0.3m is used.

In the context of moving object classification, the output of the dynamic SLAM can be seen as clustered laser detections that are hypothesized to be dynamic.

The second part of the method proposed by Vu is for classifying moving objects as well as estimating tracks, through the solution to the DATMO (detection and tracking of moving objects) problem. It operates by using a sliding window method of finding sequences of clusters through a time-series of frames. Class hypotheses are then calculated through fitting the box size and movement characteristics of

13 four pre-defined classes to the detection sequences. Here, a data-driven Markov chain Monte Carlo (DDMCMC) is used, and the maximum a posteriori estimation is calculated from the space of all possible hypotheses. This method can then be used to predict future positions of objects.

The pre-defined classes used is pedestrian, bike (including motorbike and bicycle), car and bus. The models for the respective class is derived from the averages of externally gathered statistics.

The work done by Vu has some points that are interesting to this project, but differs greatly on a few critical points:

The use of a laser scanner (a sensor that provides an almost complete surface map of the surrounding environment, but at a very high cost) as the main source of sensory input, and lack of usage of radar sensors.

The use of detection history (a static and a dynamic SLAM map) for any stage of the detection and classification to be conducted.

The application of detection and classification in later signal processing stages (after SLAM).

Since our project is not specifically concerned with tracking objects (this will occur at later stages), nor with using detection history, the usage of the DATMO nomenclature is not directly applicable (all though concepts might be similar).

Garcia 2014 In [4], a similar nomenclature is used, but more focus is placed on using several different sensors and through sensor fusion reaching better results. Here, the distinction is made between SLAM as modelling the static environment, and DATMO as covering the dynamic parts. This project was supported under the European Commission project interactIVe.

A fundamental insight for Garcia is that the object class information is useful for tracking (later stages), and thus classification is better performed at detection level (earlier stages). He assumes that the SLAM is solved, and concentrates on the DATMO.

Garcia uses radar, LIDAR and camera as sensory inputs. Two classification approaches are proposed:

The first approach uses camera images to classify objects. This method uses HOG (Histogram of Oriented Gradients) descriptors and integral images derived from the cameras, along with machine learning methods (discrete Adaboost) to construct a classifier.

The second method uses radar sensor input, but only to infer object velocity (relative velocity or estimated target velocity) as an input to a sensor fusion object representation.

14 Garcia uses a qualitative system evaluation that basically consists of showing system outputs for a few different scenarios and discuss whether they are good or not. A quantitative evaluation is done through the creation of truth data from several different driving scenarios, and comparing the classification results with the truth data.

The approach used by Garcia has more in common with the aims of this project, in that it uses radar sensors and machine learning methods for object classification. It also has useful methods for system evaluation. Critically, it differs in the use of several other sensors, of detection history, and of the choice of machine learning method.

Mercedes Bertha 2013-14 In [5] and [6], the team behind the Bertha project describes their use of a sensor platform consisting of several different radars in conjunction with stereoscopic cameras. They construct a light-weight object representation made from stixels and apply a mixture of experts machine learning method to classify objects. However, the reports concentrate heavily on the output produced by the cameras and do not specifically discuss the radar sensors contribution. They also state a heavy reliance on pre-existing static maps and good vehicle localization.

PROUD-Car Test 2013 In [7], the team behind the Vislab PROUD Car Test 2013 demonstrates an autonomous vehicle platform. Their sensor platform consists of laser sensors and stereo cameras, and is thus of little specific interest to this project.

2.1.2 Doppler radar as input to learning systems Vehicle perception is not the only area where Doppler radars and machine learning methods have been used in conjunction. Below, some examples of other usages are presented.

Waske and Benediktsson 2007 The use of machine learning methods as opposed to statistical models can be beneficial in areas where little knowledge exists about the data. This is because one is not constrained to a priori assumptions on how the input data is distributed. Machine learning methods also allows for weighting of different features that might be more or less representative for different classes, which is hard to do with statistical models. For this reason, [8] uses machine learning methods for classification of land coverage using multiple sensors. It is concluded that a multi-layered support vector machine is the most accurate for classifying on the particular data type.

Allthough using radars as one sensor type, besides the use of general machine learning methods, [8] give little advice with regards to the specifics of this project.

15 Cho et al. 2009 In [9], the authors explore a vehicle classification scheme involving radar data, support vector machines and k-means clustering. They use the frequency domain response of the signal, and extract two features to aid classification of vehicles into the classes small vehicle and big vehicle. The authors use FMCW (Frequency modulated continuous wave) radars, which differs from the mono-pulse radars used in this project. The authors also have access to the frequency response on which they compute features and classify. This data is very different from the data used in this project (consisting of mono pulse Doppler radar detections, not frequency responses). Hence, the methods described in [9] are not directly applicable to this project.

Others There is an abundance of studies both regarding classification methods of Doppler radar data (such as [10], [11] and [12]), but they either concern the Doppler frequency response, or do not use machine learning methods (or other machine learning methods than support vector machines). Many other studies within a similar area have been read and discarded as being irrelevant. No studies have been found that have been thought to be more relevant then readily accessible machine learning and clustering theory.

The above research suggests that even though much research exists within the general topic, the specifics of this project differs from previous work on several critical points.

2.2 Doppler radar perception and integration of multiple sensors This section contains a brief outline of the basic theory of Doppler radars and how they can be used to enable perception. Also in this section is a description of how several radars can be integrated and combined with different types of sensors.

2.2.1 Basic Doppler radar theory Radars can provide different types of information about an object, such as the angle at which the object is detected, and the distance and speed of the object relative to the radar [13]. This is done by emitting radio waves in a certain direction and studying the properties of returning waves reflected by a target.

The fact that all electromagnetic waves travel at the speed of light makes it possible to calculate the distance to an object by measuring the time delay between a transmitted wave and the return of its reflection.

In [14], the range r to an object is expressed as:

c∆t c r = , with c ≈ √0 (1) 2 εr

16 8 Here c0 ≈ 3 × 10 m/s is the speed of light in vacuum, while εr is the material permittivity.

Due to the Doppler effect, an object that moves towards or away from the radar will cause the frequency of its reflected waves to be different than that of the transmitted waves. In Doppler radars, this phenomena is used to obtain information about the velocity of an object relative to the radar.

This velocityr ˙ is usually called the range-rate, radial speed or Doppler speed. In [14] it is expressed as follows: cf r˙ = d for v  c (2) 2ft

Here fd = fr −ft is the Doppler frequency shift; the difference in frequency between the transmitted wave ft and the reflected wave fr.

In pulse-Doppler radars, the emitted waves are modulated by a pulse train, causing the radar signal to be emitted in short bursts [14].

Modern radar systems use signal processing techniques to modulate signals with different frequencies or polarization. This enables separation of waves originating from multiple different targets, and can also prevent radars from interfering with each other [13].

Doppler radars exist in a multitude of different configurations and with many types of outputs. For more detailed technical information about the radar system used in this particular project, see section 1.2.2, Target system.

The next section contains details of how multiple radar sensors can be integrated into one single system, and work together with other sensors such as IMU and GPS.

2.2.2 Sensor fusion and integration When using radar to enable vehicle perception, it is common to use several separate radar sensors mounted in different locations. The reason for this is that a single radar has a limited field of view whereas a radar system may be required to see in several directions.

In the case of multiple independent radars, each separate radar may deliver data in relation to its own inherent polar coordinate system. Here r is the distance to the detection, while φ is the angle between the central normal line n of the radar, and the line on which the detection is located.

In order to integrate separate radars into one single sensor system, the data from each radar must be translated into the same coordinate system.

17 (a) Radar coordinate system (b) Local coordinate system

Figure 4: Coordinate systems

 T An illustration of a radar-specific polar coordinate system xr,pol = r φ can be seen in figure 4a. To fuse the radars, the data from each specific radar should be translated to a common coordinate system. In the case of this project, a suitable coordinate system to work in is a Cartesian system that moves with the truck itself.  T An example of such a coordinate system is the local system xlocal = x y , with origin located at the center of the rear axis of the truck. This local coordinate system and its relation to a radar specific system can be seen in figure 4b.

In order to transform data from a radar-specific system to a local Cartesian system, it should first be transformed into a radar-specific Cartesian coordinate system  T xradar = xr yr . Such a system is seen in the lower right part of figure 4b. It has the same origin as the original polar system, but with x-axis parallel, and y-axis perpendicular, to the radar normal axis n.

This first transform (from polar to Cartesian coordinates) can be formulated as below: x = r cos(φ) r (3) yr = r sin(φ)

Once a position in the radar-specific Cartesian system xradar is known, a transform into the local Cartesian system can be made as follows:     cos(ψ) sin(ψ) xsrr,local xlocal = xradar + (4) −sin(ψ) cos(ψ) ysrr,local

Here, xsrr,local, ysrr,local and ψ represent the position and orientation of the radar-specific coordinate system relative to the local system (details of the exact radar mounting position is required here). These are the parameters seen in red in figure 4b.

The transformations in equations (3) and (4) can also be performed on the range  T rater ˙ to derive the components x˙ local = x˙ y˙ in the local system. However, it

18 is important to differentiate between the velocity component caused by a moving object, and the one caused by movement of the EGO system. The range rater ˙ only contains information about the velocity with which a target is approaching, or moving away, in the normal direction of the radar. This does not yield any information on whether it is the target that is moving, or the radar itself. To get around this, the radar system can be combined with data from other sensors. Using data from position and acceleration sensors such as IMU and GPS, accurate information of the vehicles EGO velocity can be obtained. By combining this data with the known positions of each radar, the radar velocity x˙ SRR can be derived. By projecting this derived velocity on the Doppler speed of a detection, and subtracting it, the velocity component caused by the actual target movement is obtained.

Figure 5: Projection of radar EGO component on the range rate of a detection.

In figure 5 the concept is illustrated. An interesting case is when x˙ doppler = x˙ proj, which in fact means that the detection is stationary (in the normal direction of the radar). Mathematically, the projection is done as follows:

x˙ SRR · x˙ doppler x˙ proj = x˙ doppler (5) x˙ doppler · x˙ doppler

The actual target velocity in the normal direction of the radar is then obtained by: x˙ = x˙ doppler − x˙ proj (6)

2.3 Theoretical overview of machine learning concepts and methods In this section, an introduction to machine learning is given together with a summary of some of the methods and problems that are of concern to this project.

19 A general explanation to some basic concepts of machine learning is followed by a more detailed description of the primary machine learning method used in this project - support vector machines.

The field of machine learning is usually divided into three main categories; Supervised learning, and . Most of the methods that are of concern to this project fall into the category of supervised learning, that is - learning from examples.

2.3.1 Supervised learning, classification and overfitting In a supervised learning problem, pre-labeled input-output pairs must be available from which a system can learn [15]. The learning system uses the information in these training examples to adapt to patterns or trends in the data. If the learning is successful, the trained system can mimic behaviour, creating its own output when exposed to new input data. Supervised learning is characterized by the need for labeled examples or training data, that the operator or teacher uses to train the system.

Two classical problems in the field of supervised learning are the problems of classification and regression. Here, the objective is to create a mapping between one or several inputs and their corresponding outputs. In the classification case, the output takes the form of an integer, while regression methods can yield an output of any value [15].

The input is given as a vector of values. These values or features can take the form of continuous or discrete numbers that describe some property of the entity that is observed. In the classification case, the output integer expresses which class this input vector likely comes from.

Each training example is composed of a pair of input and output values. In the classification problem, each training example is given as a feature vector, and its corresponding output class. Each input can be seen as a point in a certain feature space. Here each element of the feature vector represents a position along an orthogonal axis in a Euclidean geometrical space.

Using supervised learning methods such as neural networks, support vector machines or Gaussian processes, the training examples can be used to create a complex structure that separates the feature space into areas of each class. When a new input point is given, the output class is determined depending on where in the feature space this point exists.

Binary classification In a binary classification problem, the goal is to predict for a given data point in a considered feature space, which of two possible classes this data points belongs to. In the case of a 2D linear classifier, this will be done by finding a straight line (the decision boundary) that divides the feature space into two separate parts. All samples on one side of the line will be predicted as

20 belonging to the first class, while samples on the other side will be predicted to belong to the second class. In the case of a higher dimensional feature space, the linear decision boundary will instead take the form of a plane (in 3D) or a hyperplane (in higher dimensions). [15]

Figure 6: Linear decision boundaries in two and three dimensions

A dataset is said to be linearly separable if a straight line, plane or hyperplane is enough to separate the dataset so that that each point is on the correct side of the decision boundary. In many cases, this will not be possible. An example of a non-linearly separable case is given in figure 7.

Figure 7: An example of a non-linearly separable dataset

In order to handle non-linearly separable data sets, a non-linear classifier is needed. Some classification methods produce inherently non-linear decision boundaries, while others can be modified to enable non-linear separation. The usage of kernel methods in support vector machines, or using hidden layers in a neural network

21 are examples of such modification. Using the right method, a decision boundary of arbitrary shape can be created.

The decision boundary is found by presenting labeled training points to a learning algorithm, that positions the boundary according to certain criteria. As can be seen in figure 6, there are an infinite amount of ways to place the decision boundary and still achieve separation of the two classes.

One way to determine which decision boundary is optimal is to maximize the distance between the decision boundary, and the closest data points on each side of the boundary. This distance is called the margin, and a classifier that determines its decision boundary by maximising this distance is called a maximum margin classifier.

Multiclass classification Binary classification is a very common problem to consider in supervised learning, however, not all problems are limited to just two classes.

Many classification methods such as support vector machines are inherently limited to binary classification only.

For the methods that cannot inherently produce multiclass classifiers, techniques exist with which to combine several binary classifiers into one multiclass model.

Studies of the most commonly used so called ”binarization” techniques can be found in [16], [17], [18]. A brief summary is given below:

One vs All A quite straight forward method to produce a multiclass ensemble model is the ”One vs All” scheme. (Also called One against All, One against rest or OvA). Here a separate binary classifier is created for each class present in the original multiclass problem.

Thus an m-class classification problem is substituted by m binary classification problems. Each classifier is trained to separate one particular class from all other classes. The training of classifier c is conducted by letting the class label yc = 1 for all training points belonging to this certain class, while yc = −1 for the training points belonging to any other class. This process is repeated for each of the m classifiers.

To classify a new data point x, one output is obtained from each of the m classifiers, and the model output is chosen to be the class whose classifier gave the highest output score.

The OvA method has a downside in that it will be quite demanding in training time, due to the fact that m classifiers must be created, and every one of them uses the full training data set. Another problem with OvA is that each classifier will be trained on inherently unbalanced data sets since generally, there will always be

22 much fewer positive training examples than negative. This can make the resulting classifiers become biased, since they could be prone to adapt more to the bigger of the training sets. The strength in OvA lies in its simplicity, and the fact that it may be slightly faster in prediction than other techniques. This is due to the fact that in prediction time, only m classifier scores need to be calculated, whereas with other techniques the number may be higher. One vs One The ”One vs One” (also known as all pairs, OvO or one against one) method implies training one separate binary classifier for each pair of classes in the original multiclass problem. Each model is trained to differentiate between a single pair of classes, and is trained using only the subset of training points belonging to either of these two classes. For an m-class problem, this will result m(m−1) in 2 different classifiers. In order to classify a new data point with the OvO scheme, the point is provided as input to each of the classifiers. The most common way to combine their outputs is to let each classifier vote for one of the two classes upon which it has been trained, and then chose the majority vote as the final prediction of the multiclass model. Since only a subset of the training data is used for each training, OvO is generally faster in training time than OvA even though a higher number of classifiers must be trained.

m(m−1) In prediction time however, OvO is generally slower due to the fact that 2 scores must be calculated for each new input point, compared to just m in the OvA method. Another problem is that since each classifier is only trained to differentiate between two of the classes in the original problem, the classifiers will often encounter data belonging to none of the classes upon which they have been trained. When this happens, the vote of these classifiers will be worthless in the final prediction, and they are sometimes referred to as incompetent predictors. DAG and others Although OvA and OvO are the most commonly used binarization techniques, there exists a multitude of other techniques. The directed acyclic graph (DAG) and the binary tree of classifiers (BTC) are two other commonly used methods. They are similar to the OvO method in that one classifier is trained for each pair of classes in the data. However, they have an advantage in that not all classifier scores must be evaluated to classify a data point. Instead, a binary tree structure is traversed, where each node represents one classifier and every leaf represents one class to which the data point is finally assigned. In [19], it is argued that as long as the underlying binary classifiers are well made, the methods presented here are very similar when it comes to classification performance. For this reason, OvA could be the preferred method due to its simplicity. If time complexity proves to be a challenge however, it may be worth looking further into the other methods available.

23 Overfitting When performing supervised learning, a common problem is that the learning system adapts too much to the training data.

Should this happen, the system will not generalize well, meaning that it will perform poorly when exposed to unseen data. This problem is known as overfitting [15].

The problem has a great effect on classification performance when noise is present in the training data. A model would be considered robust if it could look at the trend of the data, without adapting to individual noise observations. A less robust model would adapt to the noise points, giving a high score when evaluating performance on the training data, but showing drastically decreased performance when looking at new data. This is a sign of overfitting.

The problem of overfitting is tightly connected to what is called the bias-variance trade-off or dilemma [15]. As a predictive model grows more complex, its ability to adapt to training data will increase, resulting in a decreased bias. However, this will also increase the risk that the model adapts to noise or temporary components in the data. This causes the variance - the dependence on what particular set of training data is used, to increase. As such, a successful model will neither be too complex, or too simple, as a simple model may have difficulty to adapt to the trend of the data (underfitting). An illustration of the bias-variance dilemma can be seen below in figure 8.

Figure 8: Illustration of the bias-variance dilemma

Dividing the dataset In order to evaluate a models capability to generalize, its prediction performance must be tested on unseen data. This can be achieved by dividing the original labeled training data into two separate parts, one for training and one for validation.

The prediction model is created using only the training part of the data, and its prediction performance is evaluated on the validation part. This method can be

24 hard to use if the amount of labeled data is limited, as the division into training and validation sets will further limit the amount of data available.

Another downside of the method is that the resulting model, and the performance value, may be very dependent on the initial shuffling of data - which observations end up in the training part, and which end up in validation part.

Cross-validation A commonly used method to reduce problems with overfitting is cross-validation. In the k-fold cross validation method, the set of labeled data is first divided into k equal parts. Out of these k parts, k − 1 are used to train a classification model, while the last part is used for validation. By repeating this process k times, with a new validation part each time, and averaging the k different scores, the cross-validation score is obtained. [15]

By using this method, each observation in the data will be used for both validation and training purposes, making the cross-validation score less dependent on the initial shuffling of the data. Another advantage is that this method can efficiently be used even when the amount of labeled data is limited, as only a small part will be used for validation in any single iteration.

2.3.2 Support vector machines as a method for classification The support vector machine (SVM) is a learning method first introduced by Vladimir Vapnik in [20]. The method was further refined in [21] and [22], and has since become one of the most widely used machine learning methods.

The support vector machine is a maximum margin method that can be used for both classification and regression.

One of the benefits of SVM is that it is formulated as an optimization problem instead of an iterative method such as the neural network. This is beneficial since it provides a mathematical insight into the structure. Another advantage is that the training can be performed using optimization techniques such as quadratic programming [23].

Below, a linear support vector machine is shown. The name of the method is derived from the ”support vectors” - the data points that exist closest to the decision boundary, thus laying on the margin itself.

25 Figure 9: The support vector machine visualized in a two-dimensional feature space

In figure 9, the concept of support vectors and margin are visualised. More details about the mathematical formulation of SVM are presented below.

Mathematical formulation and primal problem Since a linear decision boundary takes the form of a hyperplane, it can be expressed as the set of points x that satisfies: ω · x + b = 0 (7) Here ω represents the normal vector of the hyperplane, while the bias b is the distance along ω to the origin.

The margin is defined by two separate hyperplanes both parallel to the decision boundary: ω · x + b = −1 and ω · x + b = 1 (8)

The size of the margin can be expressed as the geometrical distance between the two planes of equation (8): 1 −1 2 M = − = (9) size kωk kωk kωk

The objective of support vector training is to maximise Msize while ensuring that no data points exist between the two planes of equation (8). In fact only data points of class 1 should exist beyond the first hyperplane, and only points of class 2 beyond the other. Mathematically, this condition is formulated as two inequality constraints.

26 For all data points x of the first class:

ω · x + b ≤ −1 (10)

For all data points x belonging to the second class:

ω · x + b ≥ 1 (11)

Using the class label yi = ±1 the two constraints (10), (11) can be rewritten as a single inequality constraint:

For all N training data points 1 ≥ i ≥ N:

yi(ω · xi + b) ≥ 1 (12) From equation (9), it can be seen that in order to maximize the size of the margin, kωk needs to be minimized.

Thus, the primal SVM optimization problem is stated as: 1 min kωk2 ω,b 2

subject to yi(ω · xi + b) ≥ 1 i = 1,...,N. (13)

1 2 Since the evaluation of kωk requires a square root, the reformulation into 2 kωk is made. This has the same minima and results in a problem suitable to be solved using quadratic programming techniques.

Dual SVM formulation In equation (13), the primal formulation of the SVM optimization problem was given. However, when implementing support vector machines, it is more common to make use of the dual problem.

In [23], the dual formulation is written as: 1 min · α · diag(y) · G · diag(y) · α − e · α α 2 subject to α · y = 0

αi ≥ 0 i = 1,...,N. (14)

This formulation of the SVM problem is on a form readily solvable using convex optimization toolboxes. Here e is a vector containing only ones, while diag(y) is a

27 diagonal matrix containing all class labels yi = ±1. Also used is the Gram matrix of dot products G ≡ xi · xj.

Solving the optimization problem (14) yields the vector of Lagrange multipliers α. These can in turn be used to find the optimal separating hyperplane normal vector ωˆ : N X ωˆ = (αiyixi) i=1

Only a small subset of the training points xi will have corresponding αi 6= 0. For these points, the inequality constraint in equation (12) will be an equality constraint: yi(ω · xi + b) = 1 (15) This means that they lie exactly on the edge of the margin; they are the support vectors [23].

Once the optimal hyperplane normal vector ωˆ is known, equation (15) can be used to obtain the hyperplane bias b:

b = yi − ωˆ · xi (16)

In [23], the bias value used is a weighted average value of b over all support vectors:

ˆ X X b = αi(yi − ωˆ · xi)/ αi (17) i i

When ωˆ and ˆb are known, the SVM classifier is constructed. The decision boundary is defined by equation (7) and thus any new datapoint xn can be classified according to ˆ class(xn) = sgn(ωˆ · xn + b) (18)

Kernels and soft margin The method on which SVM is based was first published in 1963 in [20]. However, this method, as well as the formulation given in equation (13), is unable to handle classification problems unless they are linearly separable.

In practise, very few data sets can be separated this way, and for this reason the method was not widely used. In [21], a way to enable nonlinear support vector classification using kernels was presented.

If the feature data can be transformed to a feature space of higher dimensionality, the classes may be linearly separable in that space. A linear decision boundary that separates the points in the high-dimensional space will be non-linear in the original feature space, enabling separation of non-linear data sets.

28 A simple illustration of the transformation and resulting decision boundaries is given in figure 10.

Figure 10: Illustration of the kernel concept

Mathematically, the transformation is described by the function φ. If x ∈ Rn then φ(x) ∈ RN , where N > n. To enable advanced decision boundaries in the original space, N is usually a very high number. With the primal formulation of SVM, this can lead to computing problems since the resulting optimization problem will be extremely big. With the dual formulation however, the size of the optimization problem will not change [23].

What makes the method even more powerful is that there is no need to compute any transform φ(x) directly, due to what is called the kernel trick. Since the dual formulation of SVM (see equation (14)) only contains data points x in the form of dot products within the Gram matrix, it is enough to compute the dot products of the points in the transformed space φ(xi) · φ(xj). The kernel function K is a way to calculate these dot products implicitly, without ever computing the data point coordinates in the higher-dimensional space.

Thus in order to implement non-linear classification with SVM, the only thing needed is to replace the Gram matrix G ≡ xi · xj with the kernel function K(xi, xj) ≡ φ(xi) · φ(xj).

The function K(xi, xj) can be chosen as any function that satisfies certain properties of an inner product. When implementing support vector machines however, it is common to use one of the following basic kernel functions [23], [24]:

• Linear kernel (the Gram matrix) T K(xi, xj) = xi xj • Sigmoid Kernel T K(xi, xj) = (γxi xj + r)

29 • Polynomial Kernel T p K(xi, xj) = (γxi xj + r) • Radial basis function (RBF kernel) 2 −γkxi−xj k K(xi, xj) = e

Here γ, p and r are kernel parameters. A method to choose suitable values for these parameters is given in the section about model selection further below.

It should be noted that the usage of kernels may result in a slightly extended execution time in the prediction stage, since the Gram matrix only represents a dot product, while the different kernel functions may require additional computations.

If class separation is possible using the linear kernel, it should always be chosen, since this will minimize execution time and also reduce the risk of overfitting.

Soft margin To reduce the risk of overfitting, it is common in machine learning techniques to have what is called a regularization parameter. This parameter can be seen as a tool to control the complexity of the model, and to prevent it from adapting to noise in the training data.

In the SVM formulation given in the beginning of this section, it was stated that each data point should be on the correct side of the two planes that make up the margin. Mathematically, this was formulated as the constraint in equation (12).

yi(ω · xi + b) ≥ 1

Due to this constraint, the original formulation is very sensitive to noise in the training data. With linear SVM, a single noise point can lead to a margin of drastically reduced size, or an optimization problem that is not solvable at all.

In the case of non-linear SVM (using kernels), the constraint, in combination with noise in the training data, often leads to a model of unnecessary complexity. The decision boundary of such a model will curve and adapt to every single noise point in the data, resulting in problems of overfitting.

In [22], what is now called ”soft-margin” SVM was presented. With this modification, the primal optimization problem can be expressed as follows:

N 1 2 X min kωk + C ξi ω,b 2 i=1

subject to yi(ω · xi + b) ≥ 1 − ξi i = 1,...,N. (19)

As can be seen, the constraint from (12) has been slightly modified. The soft-margin formulation of SVM introduces slack variables ξi that permits data

30 points to exist on the wrong side of the margin. By specifying the regularization parameter C, the user can decide how much ”slack” is allowed. The resulting SVM model is much more resistant to noise, since individual points are allowed to exist on the wrong side of the margin.

In the dual formulation of SVM, the soft margin implementation only leads to an additional constraint αi ≤ C.

Figure 11: Illustration of soft-margin SVM

In figure 11 the benefit of soft margin is clear; a much wider margin is created instead of adapting the decision boundary to the individual noise point.

Model selection The predictive performance of a support vector machine model is dependent on not only the training data from which it is constructed, but on SVM parameters as well.

A normal scenario is that there are two parameters to select: the soft-margin cost parameter C, and a kernel parameter in the case of kernel-SVM.

When C → ∞, the cost for slack in the model will be so high that it will in effect be a hard margin classifier. This means that each training observation must be placed on the correct side of the decision boundary. This will yield a model that always gets a 100% score on training data, but is extremely sensitive to overfitting and noise. As C gets smaller however, the error rate of the model will increase as more and more points will be placed on the wrong side of the decision boundary, and the model complexity is reduced.

The mathematical formulation of SVM, and the fact that this parameter can take any value, makes it hard to intuitively decide a suitable value for C, it is instead recommended to use some form of iterative method and choose the value which yields the best results for the dataset in question.

31 Grid-search A straight forward, and very powerful method to choose parameters, is the parameter grid search, suggested in [25], [24]. This is a brute force method, and as such it has a very high time complexity and may require a lot of processing power and time. In spite of this, the simplicity and performance of the method has led to it being very commonly used, such as in [16], [26]

In the case of linear SVM, with only one parameter to find, a region of possible values to assess is defined, and for each C value in this region, a cross-validation score of the classifier on this particular dataset is obtained and stored. The C value is then chosen as the one that yielded the highest score.

In the case of kernel SVM, there may be additional parameters to identify, such as the RBF-kernel parameter γ or the polynomial kernel parameter p. In these cases, a region of values to assess is defined for all parameters, and a grid composed of all possible parameter combinations is created. After looping through the grid and getting all the cross validation scores, the parameter combination that yielded the highest score is chosen.

If several parameter choices yield similar cross validation scores, it is important to select one that yields good generalization in the model. For C and p, this means choosing as low a value as possible, since this will reduce the complexity of the model. For γ however, the highest value should be chosen, since γ → ∞ will in effect result in a linear kernel and a less complex model.

A less complex model is always desirable since it will reduce the risk of overfitting and improve generalization capability. See the part about overfitting in section 2.3.1.

A less complex model will also mean that fewer support vectors are required, and thus reduce the execution time of the classification stage.

SVM outputs and Platt-scaling When using an SVM model to predict the class of a new data point xn, the point is inserted into the hyperplane equation using the known hyperplane parameters: ˆ f(xn) = ωˆ · xn + b (20)

The sign of f describes which side of the decision boundary the data point exists on, and thus which class it likely belongs to. The size of the output in turn gives an indication of how far from the decision boundary the point exists. In this sense, a large value of f would mean that the prediction is quite accurate, while a value close to zero means that the point is close to the decision boundary and could easily belong to the other class as well.

As such, the SVM output score is quite unintuitive, as it is hard to know what value the output should take before the prediction can be considered to have a good level of certainty.

32 In order to facilitate post-processing, there have been methods developed that calibrate SVM scores to a probabilistic output. One such method is called Platt scaling.

The method works by fitting the outputs of the original SVM to a sigmoid function. In [27], the probability that the data point belongs to the positive class is calculated as: 1 p(y = 1|f) = (21) 1 + eAf+B

A large positive value of f will yield a Platt score approaching 1, while a large negative value will give a Platt score approaching zero. Besides being bounded between 0 and 1, the probabilistic score has an advantage in that it gives a more intuitive idea of the certainty of a prediction.

The idea behind Platt scaling is the assumption that the output of a SVM is proportional to the log odds of positive examples. Using this assumption, Platt finds the parameters A and B using maximum likelihood estimations from a set of training data. It should be remembered that the Platt output is only probability-like (not a true probability but an estimation).

2.3.3 Classification performance analysis As is described in section 2.3.1, a common way to evaluate the performance of a classifier is to test the system on a set of validation data that was not used when training the system. By comparing the outputs of this test against the known class labels of each validation data point, several performance parameters can be obtained.

In a binary classification problem, the parameters below are commonly used to evaluate performance [28]:

• True positives - These are the positive examples correctly classified as positive. • True negatives - These are the negative examples correctly classified as negative. • False positives - These are the negative examples incorrectly classified as positive. • False negatives - These are the positive examples incorrectly classified as negative.

To enable an easy overview, the parameters can be used to construct a confusion matrix or table of confusion:

33 Table 2: Binary classification confusion matrix

Actual class Classified as 1 Classified as -1 1 nr of true positives nr of false negatives -1 nr of false postitives nr of true negatives

Based on the four parameters, several measurements of classification performance can be calculated. The simplest way to measure performance is the accuracy measurement [29]:

n + n Accuracy = tp tn ntot

Here, ntp and ntn are the number of true positives and true negatives respectively, while ntot = ntp + ntn + nfp + nfn is the total amount of validation data examples.

The accuracy measurement has a weakness in that the score does not necessarily prove good performance for unbalanced data sets. For example, a dataset containing 90% positive examples would yield a score of 90%, for a classifier that can only yield positive outputs.

The error measurement is tightly connected to accuracy: n + n Error = fp fn ntot

The error measurement can be useful due to the fact that a small change in accuracy often reflects a big change in error. For example, an improvement in accuracy from 0.9 to 0.95 means a 50% reduction in error.

To provide further insight into the performance of a classifier, several other measurements can be calculated [28], [29]:

n P recision = tp ntp + nfp

The precision or positive predictive value is a measurement of the fraction of examples classified as positive that were actually positive.

Sensitivity and specificity, also called true positive rate and true negative rate respectively, are two additional measurements: n n Sensitivity = tp , Specificity = tn ntp + nfn nfp + ntn

34 Sensitivity is a measurement of the fraction of positive examples that were correctly classified as positive, while specificity is the fraction of negative examples that were correctly classified as negative. Another word for sensitivity is recall.

If a classifier is tuned with the purpose of getting a high precision value, the recall value has a tendency to decrease. If the aim is a high recall value, the precision may decrease.

F-Measure is a measurement that combines the two as the harmonic mean of precision and recall: 2 · precision · recall F = meas recall + precision

For multiclass classification problems, the parameters true positive, true negative, false positive and false negative have a slightly different meaning, since there are more than just the positive and negative output. However, a class-specific variant of the measurements can be calculated by looking at each class separately.

In the multiclass case, for one specific class c, the number of true positives tpc are all validation data points correctly classified as belonging to c. The true negatives tnc are all the data points that did not belong to c, and were classified as not belonging to c (regardless if they were correctly classified or not).

The false positives fpc are all the points predicted as belonging to c, when in fact they did not. The false negatives fnc are all points that belong to c, but were classified as not belonging to c.

These new definitions enable class specific variants of the measurements mentioned earlier, such as precision and recall.

In [28] and [29], additional measurements are considered for multiclass problems:

C   1 X ntp + ntn Average = c c acc C n c=1 totc This is the average of the class-specific accuracy values over all C classes.

C   1 X nfp + nfn Average = c c err C n c=1 totc This is the average of the class-specific error over all C classes.

C   1 X 2 · precisionc · recallc Mean = Fmeas C recall + precision c=1 c c Mean F-measure (MFM) is the average of F-measure values over all C classes.

35 2.4 Extraction and analysis of object descriptions In this section, means of extracting and analysing relevance of descriptive data is discussed. As discussed in section 2.3.1, Supervised learning, classification and overfitting, a pre-condition to supervised learning is the existence of training examples. These examples, consisting of input-output mappings in the form of feature vectors with class labels, have to be constructed somehow.

Consider the specifics of this project: input data is delivered in the form of radar detections with spatial location and some other parameters. These detections should be used to describe a real-world object, and in the end determine which class this object belongs to. There are at least two necessary steps required to accomplish this: the clustering of data, and the subsequent feature extraction from clustered data.

In the clustering step, input data points are grouped together into clusters believed to come from the same real-world object. This is done using some heuristic or domain knowledge, such as spatial closeness. In a machine learning context, clustering can be seen as unsupervised learning. This is because it needs no training or labeled examples to extrapolate category information or discriminate from a set of data.

In the feature extraction step, computations are performed on the clustered subsets of data in order to extract descriptive values. Such a descriptive value can be the average or variance of a certain measurement type, like amplitude.

Closely connected to the feature extraction is the concept of feature selection. Since there is a virtually limitless space of feasible computations that can be made on a set of data (implying a nearly infinite set of different feature extractions), one needs a method to determine the relevance of a certain feature as well as some heuristic as to what is a good feature in general.

While feature extraction is straight forward and only implies performing computations, feature selection covers the more delicate considerations that have to be made in order for the extraction step to perform well.

The methods for feature selection and evaluation, specifically the use of principal component analysis, are also useful for general data exploration. This exploration is naturally an integral part of a project such as this, since domain knowledge regarding the data is necessary.

In the sections below, methods available for spatially clustering data, as well as means to perform feature selection are discussed.

2.4.1 Clustering of sensor data In this section, some of the relevant aspects of the theory of clustering of data are presented. As stated in section 1.2.3, Project requirements, clustering is considered

36 a soft requirement to the success of this project. Hence, some dedication to the theory available with regards to clustering of data is presented below.

There is no exact definition of what constitutes a ”cluster”. One needs some domain knowledge and pre-conception of what a cluster is in a particular context in order to apply an appropriate clustering method [30]. If one has little domain knowledge, and a data set consisting of densely populated regions separated by sparse areas containing noise points, a density-based clustering method can be preferable [31][32].

There are several such density-based clustering algorithms available, for example DBSCAN (Density Based Spatial Clustering of Applications with Noise) [32] that will be described in detail below, and OPTICS (an extension of DBSCAN that allows for a more automated choice of clustering parameters [33]). There are also several extensions for parallel computing of the algorithm, such as PDBSCAN [31].

Clustering of sensor data with DBSCAN The fundamental idea of density-based methods is to formalize the notion of density that comes intuitively for a human [32].

DBSCAN essentially works by defining a point as belonging to a cluster if it lies on at least distance Eps from another point belonging to said cluster, said to be (directly) density reachable. A cluster is formed if there are at least MinP ts number of points that are density reachable to each other. The points of a cluster need not be directly density reachable, it is enough if they can be indirectly connected [32]. Figure 12 below illustrates these clustering requirements:

Figure 12: DBSCAN clustering method

In figure 12, the parameter Eps determines whether points are within density reachable distance from one another. Clustered points are displayed as green, points that are density reachable but not clustered due to too few points are red

37 and the black point in the middle is not density connected to any other points. The green points are all considered indirectly density reachable and thus become density connected. Since the number of points that are density connected is also at least MinP ts, they form a cluster.

DBSCAN can incorporate any distance function, with the most common being the Euclidean distance (as is the case in figure 12 above). However, if one has domain knowledge that suggest clusters have a particular shape, other distance functions could be used [32].

When using DBSCAN, one can determine suitable clustering parameters with little domain knowledge, which is beneficial in many cases. The parameters Eps and MinP ts do have to be specified, so SOME domain knowledge is required. However, if one has an idea of the characteristics of a typical data set this is quite straight forward. For example, one can determine the smallest cluster that should be detectable by manually looking at data sets. One can also determine a reasonable Eps distance threshold by the same method.

Since the methods behind parameter selection are so heavily dependent on the data domain, general theory about this would be of little use. Instead, the specifics of parameter selection are discussed in the method chapter.

The time-complexity of DBSCAN on an un-indexed data set is O(n2). If one needs a faster computing time and can pre-partition the data, a time-complexity of O(nlogn) can be reached [31].

There could be performance optimizations to be made from several different angles compared to just using DBSCAN with a Euclidean distance function. One could use a different distance measurement that takes advantage of some data domain knowledge (for example, if one knows that clusters always appear in elliptic shapes with a certain direction, one could use such a distance measurement). One could, if performance is too slow, try to incorporate a pre-partitioning of the data set to input into the clustering algorithm. And finally, one could use some more advanced algorithm like OPTICS instead of DBSCAN to automatically, even dynamically, determine clustering parameters and therefore have an easier time dealing with clusters of different sizes.

Algorithm description Consider the data set D containing N number of points pn, n = 1, ...N to be clustered. The epsilon neighbourhood N(p) denotes the collection of points within distance  (the Eps parameter) to a chosen point p. The minimum number of points a cluster should contain is denoted MinP ts. A point is considered undecided if it has not been assigned to a cluster, or labeled as noise. The algorithm goes:

38 Algorithm 1 DBSCAN

1: Initialize the state of each point pn in data set D to be undecided. 2: while there exists undecided points in D do 3: choose an undecided point pn and compute N(pn) 4: if |N(pn)| ≥ MinP ts then 5: form a new cluster C and insert pn into C 0 6: form set C containing N(pn) − pn 7: while there are undecided or noise points in C0 do 0 8: for each undecided or noise point qn ∈ C do 9: insert qn in C and compute N(qn) 10: if |N(qn)| ≥ MinP ts then 0 11: expand C to contain N(qn) 12: end if 13: end for 14: end while 15: else 16: label pn as noise 17: end if 18: end while

When the algorithm has finished, one is left with the original collection of points, with a cluster ID (or a corresponding noise label) associated with each point.

2.4.2 Selecting and extracting features from data clusters In this section, the methods and concepts regarding the selection of features relevant to this project are presented. The methods presented here include some theory behind feature selection, as well as some theory behind the method of principal component analysis which is a common way of looking at high-dimensional data.

Feature selection The concepts of feature selection are important for many machine learning applications. If there exists a large space of features available for a particular data set, unless computing power is not an issue, some form of feature selection has to be performed. This can be either a formalized way, such as a specific algorithm, or one can use some domain knowledge or useful heuristic [34].

Data exploration is a closely connected theme, as the evaluation methods used for feature selection can also be seen as descriptors of the data set. Thus, the methods presented here can also be used for stating certain characteristics of the data set used.

Feature selection is done to reduce the dimensionality of the data set to be used in a particular machine learning application, with the intention of increasing

39 time-performance, increasing classification accuracy, or both. Feature selection is to be distinguished from feature extraction (the process of combining features from the original data set into a new set of features).

It is an area subject to much research, and there is ample theory available (see for example [35] and [36].

According to [34], the process of feature selection (or feature evaluation) can be divided into three different approaches. These are the filtering, wrapper and embedded approaches. They will be briefly explained below.

Filtering The filtering approach works independent of which classification method is used. It uses parameters such as quality measurements from information theory (with regards to for example information gain) to generally determine whether a feature is useful or not. The upside of using the filter approach is the universality, it can be applied in any circumstance. The downside is that it could negatively affect performance, and also that it inherently lets through features that might provide very little (but still positive) information gain.

Wrapper The approach of using wrappers means evaluating the performance of a subset of features on the classification results (thus ”wrapping” the feature selection into the rest of the machine learning procedure). If classification accuracy is the most important performance characteristic, this approach is good. It has the downside of possibly inducing bias (in that it adapts the features used to the particulars of the rest of the machine learning methods). It can also be very computationally expensive during learning, and thus prove to be unfeasible.

Embedded approaches The usage of embedded approaches imply using machine learning methods that have inherent means of feature selection. Methods such as artificial neural networks applying pruning, and decision trees, are examples that have inherent mechanisms for ordering or selecting features.

The above distinctions can be useful from a theoretical perspective. However, when conducting feature selection (and often in conjunction with feature extraction), the usual and widely accepted method is to have some initial proposition of features to use, evaluate their performance and optimize the feature set for some criteria. To have an initial feature proposition, one can use some algorithmic methodology, but more common is the use of some heuristic or what can be referred to as expert knowledge [34].

Below, a commonly used method of evaluating features and of looking at high-dimensional data in general is presented.

2.4.3 Principal component analysis for feature evaluation Principal component analysis (PCA) is a useful tool for reducing the dimensionality of data sets that would otherwise be hard to visualise. The

40 main idea is to transform the data set in question into a new set of variables, the principal components (PC). These are uncorrelated with each other, and are ordered in such a way that the first variable contains most of the variance of the data set, the second contains second most and so on[37]. Mathematically, the first PC is defined as:

p 0 X α1x = α11x1 + ... + α1pxp = α1jxj (22) j=1

Where α1 is a vector of p constants, x is a vector of p random variables and the 0 linear function α1x is a line along which x has the highest variance. The second PC 0 0 is the line α2x, which is the line uncorrelated with α1x having maximum variance, and so on. There are p possible PCA:s to find, all though one rarely wants to use all of them (since that would defeat the purpose of dimensional reduction) [34].

The PC:s correspond to the eigenvectors of the covariance matrix Σ (or in the more usual case of an unknown covariance matrix, the sample covariance matrix S is used) of the random variables x. The first PC corresponds to the eigenvector with the largest eigenvalue, the second PC with the second largest eigenvalue and so on. They are usually found by means of Lagrange multipliers, by solving: (Σ − λkIp)αk = 0 (23) 0 0 so that αkx is the k:th PC and var(αkx) = λk where λk is the k:th largest eigenvalue [34].

When deciding how many principal components to use, one can look at the cumulative percentage of total variation, defined as:

m 100 X t = l (24) m p i i=1 where tm is the cumulative percentage of variation, p is the total number of variables, m is the number of principal components used and li is the variation percentage of the i:th principal component. A sensible cutoff is between 70 and 90 percent, all though an exact figure is hard to recommend[37].

2.5 Frame of reference conclusions In this section, current best-practice and previous work has been examined. Existing theory behind concepts and methods of interest have been presented and discussed. In this section, we state some conclusions regarding the areas identified as having a major relevance for what the final system solution looks like.

41 2.5.1 Current best-practice in vehicle perception When it comes to current best practice, a few critical issues that separate this project from other projects within similar fields have been identified: • Others use other sensors as well as radar (and often not radar at all). The employment of laser sensors is extensive, and while this sensor system provides a very good source of input, it is tremendously expensive compared to radars. Also, since it contains moving parts, it is much less robust then the fixed electronics of a radar. The use of cameras is also extensively employed, and while cameras have a comparable cost and robustness of radars, they get easily obstructed and demand more computing power then radars. Also, in the cases where cameras are employed together with radars, it is often the camera data that is the main source of classification data. Radars are often just helping the system, instead of constituting a majority of it. • Others always have history-dependent solutions (like Kalman filters or Markov chain Monte Carlo based methods). In our case, this is both impossible (due to requirements) and unwanted (due to wanting a low-level system). The accuracy of systems in previous work must be considered in the light of this when being compared to our project. • In the cases where there is great similarity to this project, previous work do not divulge the specifics of their solutions, such as which algorithms they use for clustering or what features they look at for their support vector machines. While there might be many reasons to avoid being specific (such as business secrets, focus being put elsewhere, uninterest/ignorance or academic fuzziness), this also means that there is little theory available with regards to state-of-the-art methods. Previous work differ to the point that the current nomenclature usually employed in vehicle perception is not really applicable here. Despite heavy sensor prevalence in both industry and research, the usage of radar sensors in this particular context is not well-explored. No real advice have been found in the more general cases either, in the articles described under 2.1.2, Doppler radar as input to learning systems. Critically, a lot of previous work in the area of Doppler radar data classification deals with raw data, and not the pre-processed radar detections that the sensors in this project delivers. It seems that the particulars of this project is not a well-explored area. There might exist previous cases where a similar usage of radar sensors is discussed. However, none such have been found. Thus, all though concepts and certain methods can be used within this project, the solution approach has to be different and has to be developed without a comprehensive foundation of previous projects to rest on. Consequently, our approach is based on lower-level research of the different steps necessary to complete the requirements of this project.

42 2.5.2 Theory and methods employed Here, some brief conclusions regarding the choice of methods in this project are discussed.

Two major areas of theory have been presented: the machine learning area, and the data extraction and analysis area. These two major areas have been deemed necessary and sufficient in order to gain a working system. This is excluding more practical considerations such as coding, filtering, sensor data fusion and hardware implementation.

In a machine learning context, these two areas can be seen as different approaches to the same problem, namely supervised versus unsupervised learning. Supervised learning constitutes training a classifier from labeled examples. Unsupervised learning constitutes extracting data and discriminating between it in a manner that does not need labeled training examples, such as clustering and filtering.

From a theoretical viewpoint, the major methods and structures of one area could be replaced without having to completely revisit the other area. The choice of methods here should be seen as initial approach suggestions, and not as being dependent or logically following one another.

Practically, the division into a pre-processing step (clustering and feature extraction) and a classification step (employing support vector machines) is beneficial since it allows parallelization of work flow. They are also associated with different types of results: extraction and analysis can be used if one desires an increased domain knowledge for other reasons than classification. Support vector machines can be used with a different set of input data.

Data extraction and analysis In order to extract a coherent and useful feature vector from a set of data inputs, two steps are performed. First, the data is spatially clustered using the algorithm DBSCAN. Then, feature extraction is performed. Which features to use is decided through feature selection, where a principal component analysis of the data is performed to gain domain knowledge and determine the usability of features.

Classification In order to classify a feature vector as belonging to one or none of the proposed classes, the method of choice is support vector machines. To provide a multiclass classification system from the binary nature of SVM, the multiclass ensemble method of One vs All can be employed.

To enable integration of the class output into the object tracker, Platt scaling can be used to provide a confidence value of probabilistic nature. This process will make the class output more valuable since an intuitive measurement of confidence is provided with each prediction.

43 Part 3: Methods

This chapter aims to explain the methods developed in order to achieve the project goal stated in chapter 1. The methods presented here are based upon the theoretical framework laid out in chapter 2 and will be evaluated in chapter 4.

3.1 Method overview and system introduction In this section, a brief introduction to the methods used in the development process and an overview of the complete system solution is presented. Below in figure 13, a schematic overview of the development process and the classification system is shown:

Development process Classification system

Classification SSeennssor ddaatata in ipnupt ut Gather data Detection filter

Data exploration and Data clustering feature selection

System development Feature Extraction

System Class and System output implementation confidence output

Figure 13: Development process and classification system overview

The figure above outlines the major components of the process required in order to gain a complete system. Each segment represents a major functional part of the system or a major development process. These are all presented more in detail below. How each method and implementation is verified and validated is discussed in its respective section, and the means of validating the complete system is discussed in the section 3.6.3, Validation of final system implementation.

Verifying whether the chosen system structure is optimal is not an easy task, and will not be attempted in this project. Whether the structure is functional is useful, and this can be derived from the complete system validation.

3.1.1 Stages of system development The green box in figure 13 above represents the system development process. This process contains the pre-requisites to the classification system and also the system

44 implementation process. The major development and system input at this stage is the sensor data, symbolized by the blue arrow. The intended output is a working system, in turn with a reliable system output, symbolized by the orange arrow.

Sensor data input This is the input data stream, consisting of sensor data delivered by the radar sensors. The data is pre-processed, partly in the sensors themselves (they deliver radar detections and not raw radar signal data) and partly by pre-existing software structures (see section 1.2.2, Target system).

It should be specifically noted that all data used throughout the project comes from actual logged sensor input. So when simulations are performed, they are done so using real sensor data in the simulated environment.

Gather data In order to perform supervised learning and to gain data domain knowledge, the first process needed is the gathering of data. How this is done is described in detail in section 3.2.1, Test-track data gathering.

Data exploration Data exploration in this sense implies exploring the usage of different features to describe objects, as well as using statistical methods to determine characteristics of the labeled data. It can be performed after sensor data has been gathered. In this process, different feasible features that can be extracted from the training data are examined. The best ones are selected to be part of the feature extraction. This is described more in detail in section 3.3, Practical selection and analysis of object descriptions.

System development In this step, the classification system is constructed and trained. This represents an umbrella stage, containing all of the processes and systems described below belonging to the classification system.

System implementation Finally, after the system is developed, it has to be implemented into the real-time environment of the Astator platform. Here, considerations such as timing constraints and time-complexity of algorithms need to be examined. Details of the system implementation process can be found in 3.6, System implementation on target platform.

The constituents of the classification system are presented briefly below.

3.1.2 Classification system overview The yellow box in figure 13 above symbolizes the actual classification system being developed. The components of this system, parameter choices and training of the classifier are developed using offline methods and continuous verification before being implemented to the real-time system (within the System implementation step).

Detection filter Here, an initial filtering step is performed, with the main purpose of removing detections belonging to stationary objects. This is discussed more in detain in section 3.4.1, Filtering of radar detections.

45 Clustering This is where detections are grouped into clusters, to allow for a better extraction of object characteristics. The methods used here are discussed more in detail in section 3.4.2, Clustering of radar detections using DBSCAN. Additionally, a post-clustering filter structure is implemented with the intention of removing ”unwanted” clusters. Feature extraction In this step, computations are done on clusters in order to extract a feature vector that contains characteristic data of the presumed object that the detection cluster comes from. What features are extracted, why, and how this is done is discussed more in section 3.4.3, Feature vector calculation. Classification This is where each feature vector is classified as belonging to one of the four classes. Many considerations are required here, discussed in section 3.5.1, Implementation of support vector machine system. Class and confidence output Here, the final system output is constructed. The computations done in this step include ensemble methods for multiclass classification, discussed in section 3.5.2, Multiclass, rejection and confidence structures. A confidence measurement is also constructed here, also discussed in 3.5.2. This section has outlined the rest of the contents of this chapter and provides an overview to much of the work performed in this project. In the sections below, the different subsystems and processes are explained in detail.

3.2 Gathering radar detection data In order to create a classifier using supervised learning, labeled training examples are required. Gathered data is also necessary for most parts of the system development process, such as system simulation and verification, analysis of data domain and the development of filtering processes. This section describes the process of gathering and labeling the data required in order to construct the classification system.

3.2.1 Test-track data gathering In order to acquire labeled data from each of the four classes, several test scenarios using the Astator platform have been realized. Below, the details of these test scenarios are presented. The tests were conducted at the Scania testing track. Several different tests were conducted in which a known object moved on the road while the Astator system continuously logged all radar input and saved it to file. In order to have more control over the test environment, the EGO vehicle was stationary, while the observed object was the only moving object in the vicinity. The following scenarios were constructed:

46 Pedestrian For the pedestrian class, it was decided that instead of moving on the road, the pedestrian should move close to Astator. This was partly because the radars have difficulty detecting pedestrians at range, and partly to avoid having the pedestrian on the road were occasionally, heavy vehicles could be driving.

The pedestrian scenarios are as follows: The pedestrian moves around Astator in a circle of approximately ten meters radius. In the first test walking slowly, and in the second and third tests testing a jogging and running gait respectively. To investigate the ranges at which the radars can detect pedestrians, additional tests were performed. In the first, a pedestrian starts far from Astator, at a distance of approximately 60 meters, and moves straight towards the vehicle. In the second the pedestrian moves towards the vehicle, but at an angle. Both of these tests are done in both a walking and a running gait.

In addition to these seven tests, a scenario were the pedestrian moves in a more chaotic manner was added. Here both velocity, heading, and distance to EGO are varied throughout the test in order to get a bigger spread in any considered feature space.

All pedestrian tests were conducted in two variants, one where the pedestrian is wearing black clothes, and one with a reflective safety vest. This was to get additional spread in the feature space.

Bicyclist For the bicycle class, the first test scenario was set up as follows:

Astator is parked parallel to, and at a distance of approximately 30 meters to the road. The bicyclist moves straight along the road at one velocity per test. The first is a relaxed cruising velocity, the second a more active, normal bicyclist velocity and the third is the maximal possible speed.

The second test scenario has the first three tests repeated, but with Astator placed in a 45 degree angle with respect to the road. This was to get a bigger spread in the headings of the detected object.

Lastly, a random speed and direction test is suggested were the bicyclist moves around Astator in a snakelike pattern. As with the pedestrian case, all bicyclist tests were repeated in two variants, the first in black clothing and the second with more reflective clothes.

Vehicles To gather data from the personal vehicle and truck classes, similar tests to the bicycle case were conducted. Three velocities were chosen: 10 km/h, 30 km/h and 50km/h. A higher speed was avoided due to limitations on the test track, and the low speed of 10 km/h was added to ensure that the class velocities overlap in the training data.

Data from these three different velocities were gathered in both the parallel and angled scenario. An additional test with random direction and velocity was added for both personal vehicle and truck.

47 To increase spread in the data, two different vehicles were used for each class, the main difference being their size. For the personal vehicle tests, the first car used was the VW E-Up, a small electrical car. The second was a VW Transport Crafter 35, a much bigger personal vehicle. For the truck class, a Scania truck was used. The tests were performed once with the truck standalone, and once with an added trailer.

3.2.2 Labeling of gathered data After data is gathered, it needs to be labeled according to class in order to be useful for supervised learning. Preferably, this would be purely scripted. However, this approach is problematic due to noise being present in the sensor signals, which could lead to a lot of mislabeled samples.

Instead, logs are gone through manually frame by frame. Detections belonging to a known object are labeled as such, and detections that clearly do not belong to any moving object are labeled as noise. This method is very time consuming, but does provide some benefits.

With manual labeling, noise data can be stored in addition to the wanted data, but with a separate noise label. This noise data can be used to research the properties of radar noise, and possibly develop methods to handle and evaluate filter structures.

These labeled samples, each containing a feature vector and a class label, provide a necessary pre-requisite for supervised learning but can also be used for many other purposes, such as analysing characteristics of the different classes.

3.3 Practical selection and analysis of object descriptions This section contains an overview into the methods used to choose and evaluate object representations. The same methods can also be used for general exploration of data characteristics, and how this is done will also be presented below.

In order for features to be computed, the sensor data first has to be clustered into groups thought to belong to the same object (as described in section 2.4, Extraction and analysis of object descriptions). The specifics of the methods used for clustering can be found in section 3.4.2, Clustering of radar detections using DBSCAN.

Feature selection method As discussed in section 2.4.2, Feature selection, there are several approaches that can be taken to find suitable features. In this project, an applied version of the filtering strategy is used. The reason for this is mostly practical, as the use of wrapper methods are deemed to be too

48 complex both coding-wise and time-complexity wise to be applied here. The use of embedded methods is also disregarded, due to the choice of support vector machines as classification method (which does not inherently contain a feature ordering mechanism).

The filtering strategy implies the usage of some external measurement to determine the usefulness of a feature.

One such measurement is the normal distribution of each feature with respect to each class. As long as the input data does not cause the particular feature to be biased, or the feature itself causes bias, a feature can be considered descriptive (or ”good”, or ”usable”) if there is a significant difference in normal distribution between the classes. A set of features can then be considered good if they fulfil the above, and also are somewhat uncorrelated with each other.

Another measurement is to do a PCA (discussed in section 2.4.3, Principal component analysis for feature evaluation). This is a more complex analysis than just comparing normal distributions, and provides a powerful tool to view correlation between features and the usefulness of a particular feature in explaining data set variance. If there is a very uneven distribution where many features explain very little, and one or two give much of the variance explanation, this is a sign of bad feature selection. It is however important to not over-interpret the results of a PCA when it comes to prediction, as features with a small variance still can hold good predictive value [35].

In this project, both measurements are used in order to provide both a robust and easy to grasp evaluation of the features selected.

3.3.1 Description of features used Despite the presence of good evaluation methods, an initial set of features has to be suggested, from knowledge about the data or from some heuristic, in order to have something to evaluate [34]. Below, a list of the features that are used within this project and the domain knowledge behind their usage is presented. The features below will, in the complete classification system, be calculated through a simple feature extraction process.

Each feature constitutes some function of one or several of the data types contained within each radar detection. They are meant to be simple and plentiful, in order to provide smaller pieces of valuable information from many sources rather then much information from few sources.

Number of detections This feature simply consists of the number of detections found in a specific cluster. This gives some sense of the largeness of the object the cluster belongs to. It is also possible that some characteristic of the radar sensors cause them to deliver more detections from certain surfaces or from certain types of movement. This could then amplify the separability of the different classes for this feature.

49 Minimum length This feature consists of the length of the diagonal of a rectangle drawn from the edge points of the cluster. This length is called the minimum length because, provided the cluster in question describes an actual object, the object is at least this long (but could be longer in reality). This feature is thought to distinguish classes based on average size, and will thus likely be correlated with the number of detections feature.

Area This feature is the area of the rectangle drawn around the edge points of the cluster. It is likely to be heavily correlated to especially the minimum length feature, but could provide extra information from certain shape characteristics of specific classes.

Density This feature is calculated as one divided by the average distance between each detection in a cluster. Instead of just computing the density through dividing number of detections by area, this method gives a somewhat uncorrelated density value. This feature is thought to hold some additional descriptive value over just the area or the number of detections features. For example, a certain class might have average characteristics of being very large, but having such a surface or such a movement characteristic that the number of detections are low.

Mean Doppler velocity This feature measures the mean Doppler velocity of the cluster. This is thought to distinguish classes well at their corresponding average speed. However, since all classes span the same low area of possible velocities (a car could travel at very low velocity), this feature could introduce bias.

Variance of Doppler velocity This feature measures the variance of the Doppler velocities of the detections within a cluster. It measures the weighted mean of the velocity variance of the detections from each radar. This method avoids some bias (since two different radars can pick up the same object, but due to different placements register very different Doppler velocities), compared to just taking the variance of a complete cluster.

It is thought that this feature can provide useful information for separating classes with large difference between the slowest and fastest moving parts within the same object. For example, a truck has large wheels that, at the lowest and highest points, have vastly different velocities. If a radar sensor happens to detect these two points, the Doppler velocity variance will be large.

Amplitude per distance This feature measures the average detection amplitude in a cluster and divides it by the mean cluster distance. Exactly how the amplitude data type is measured is unknown, but it is thought that it varies with distance (possibly, or even likely, with distance squared). In order to not introduce a large bias, the amplitude is divided by the mean distance. This then lowers the likelihood of a class (for example, pedestrians) that is generally detected close to the EGO vehicle (due to smallness) gets a higher average amplitude score then other classes.

50 Variance of amplitude This feature measures the variance in the amplitude of the detections of a cluster. The amplitude is thought to vary with surface differences. For example, a detection from the side of a truck trailer is thought to have a higher average amplitude then a detection from a pedestrian jacket (soft fabric). This implies that classes with few differences in surface material are likely to have a lower variance of amplitude then classes that can have many different surface materials.

A PCA of the usage of the above features, together with a mean and variance analysis of each feature, is presented in section 4.2.1, Characteristics of selected features together with a discussion about possible bias problems regarding the feature selection.

3.3.2 Analysis of features and data with PCA By performing clustering on the gathered and radar labeled data, and computing the features above for every cluster, a set of labeled feature vectors is obtained. Since 8 features are present, this dataset is 8-dimensional, making it hard to get an overview into the separability and distribution of the classes.

By projecting the data into a lower dimensional space consisting of the first two or three principal components, a better overview is obtained. Thus, the PCA method will be used in determining whether the final classifier performs as can be expected. If there is a clear correlation between the separability of particular subsets of data (for example, a low separability of the pedestrian and the bicycle class), then it can be expected that the classification system will have trouble distinguishing between these particular classes.

3.4 Pre-classification signal processing As with all sensors, the radar system mounted on Astator is affected by noise. To reduce the risk of classifying false objects, a methodology to reduce the impact of noise is developed.

First, the properties of noise detections is studied visually by inspecting recorded radar data, and logging parameters from detections believed to be noise. If properties can be found that separate noise detections or detections with false parameters from ”true” detections, these properties can be used to form a filter structure which will attempt to remove ”bad” detections before the classification stage.

3.4.1 Filtering of radar detections Before getting further in the classification system signal chain, each radar detection passes through a filter structure. The purpose of this filter is to remove undesired

51 detections as well as noise detections in order to provide a less cluttered view of the vehicle surroundings. What the input data looks like and how it has been pre-processed is described in section 1.2.2, Target system.

The classification system is only concerned with moving objects, and for this reason the primary purpose of the detection filter is to remove stationary detections. The different detection level filters used are described below.

Detection filter 1 The first stage in the detection filter is a threshold minMoveInd in the movement index of the detection. This is to remove all detections believed to belong to stationary objects. When choosing threshold, it has to be considered that the highest value (3) has the highest probability of removing stationary objects. However, objects such as pedestrians move slowly and are hard to detect in that they rarely produce many detections. For this reason, a lower value may be needed to avoid missing these objects.

Detection filter 2 The second stage in the detection filter is set up to remove detections where the calculated velocity value is above a certain threshold maxdR. The reason for this is that noise detections occasionally have higher velocity values than what is reasonable for any of the objects desired by the classification system. For example exceeding 150 km/h.

Detection filter 3 The third stage in the filter is a threshold maxRange, and any detection outside this range is discarded. The reason for this is that both the number of detections on any given object, and the accuracy in these detections, tend to decrease with increasing range. The possibility to set this parameter enables the possibility to limit the range of the classification system, while possibly increasing accuracy.

3.4.2 Clustering of radar detections using DBSCAN This section contains an overview of the methods concerned when clustering radar detection points into objects.

The clustering method chosen for this project is the DBSCAN algorithm, explained more in detail in 2.4.1, Clustering of sensor data. The main reason behind using DBSCAN and not one of its more advanced cousins (such as OPTICS, also briefly mentioned in 2.4.1) is simplicity. Due to the relatively small data sets to be handled simultaneously (256 radar detections), time-complexity is not seen as an issue and thus parallel extensions are disregarded. Neither is the choice of clustering parameters seen as particularly troublesome (although several considerations need to be taken here, discussed more in detail below). The wish for simplicity weighs more heavily than the potential performance increase of a dynamic parameter usage. However, one should be aware that there could exist both a performance and functionality gain to be had with the usage of a more complex clustering method.

52 In addition, no effort is put into the pre-partitioning of data. Since the system is to work in a RT-environment, pre-partitioning would only serve to ”move” the time-complexity elsewhere (since it would still need to be executed within the same cycle).

As discussed in section 2.4.1, one can use any distance function within DBSCAN. In this project, the two dimensional Euclidean distance will be the distance function of choice.

In the offline implementation, open source functions from [38] were used to implement the clustering algorithm, while the online implementation was developed from the algorithm description in section 2.4.1 .

Clustering Parameter Selection The choice of clustering parameters will have a big effect on the entire chain of calculations within the classification system. Clustering is an essential part of discerning objects from a lump of single detections within a frame, and the choice of cluster parameters greatly affect how an object is perceived. It is therefore important to have a well-grounded method of parameter selection.

As stated in 2.4.1, there are two parameters to decide on: MinP ts and Eps. MinP ts decides how many points a cluster has to contain, and Eps describes the distance between clustered points. What constitutes a good parameter choice depends both on how the data stream (the output from the radar sensors) looks and on real-world considerations.

Choice of the Eps Parameter When choosing the Eps parameter, there is a tradeoff between how far apart two objects have to be to be to not get clustered together, and how often a single object is perceived as several separate clusters.

For example, consider the case shown below in figure 14:

53 eps = X eps = X´

Figure 14: The eps tradeoff

The figure above shows detections (crosses) belonging to two different object (silhouetted by black borders). The sensors might deliver detections from an object at a particular distance with an average distance between detections of X meters, which is then used as the Eps = X. However, there can be gaps present where there are no detections within the specified area. Thus, two clusters are formed (green) when there was only one object present. However, if one changes the parameter to Eps = X0, the green detections get clustered together, but the red detections belonging to another object are also put in the same cluster.

Within the iQmatic project, the aim is to perform autonomous driving in mining sites. This environment greatly differs from, for example, an inner city environment where pedestrians and bicycles mix with cars at very close distances. At a mining site, safety distances to vehicles is likely to be maintained, and situations where many moving objects are present at the same time in a small area are unlikely to arise. Thus, it is deemed more important in this project to have a sufficiently large Eps that detections from a single object are clustered together.

The method of choosing the right Eps parameter can therefore be to observe data logs with large objects (for example trucks) that can provide enough detections in an area to form several clusters, and at long distances (between 60 and 80 meters), where detections are likely more spread out. Observing such data logs, one chooses the smallest Eps that allows all points that clearly belong to the same object to be clustered together.

This approach is very much a heuristic, but more thorough or optimal ways of verifying results are deemed to be outside of the scope of this project and are thus not considered.

54 Choice of the MinP ts parameter The method of choosing the MinP ts parameter is also a heuristic that needs to be suited to the data in question. In particular, since the same MinP ts is used for all clusters, it has to be adapted to the smallest objects that are meant to be detected. In the current case, this means the pedestrian class provides the limiting factor. So, in order to choose a good MinP ts parameter, one can observe data logs of pedestrians at a maximum distance (as far away as they are detectable), and select a value that allows for detections of pedestrians to be clustered.

There are also noise points to consider. Such is almost always the case with complex sensors, and the radar sensors used within this project have from previous experience been known to produce detections of unknown origin. These noise detections are of a very low density, but can appear anywhere in a frame. If one wants to avoid sending all these noise points further in the system structure, one has to choose the MinP ts parameter to be sufficiently large to filter out these noise points.

The noise points could also appear close enough to a cluster (within the Eps distance) to be put in said cluster. This could lead to a warping of the cluster characteristics compared to if all detections come from real objects. If this proves to be a problem, the Eps parameter can be tuned to a lower value then otherwise preferable. Determining whether this is the case will also be done through heuristic approaches, as a more meticulous review is deemed outside the scope of this project.

One method of determining this could be to qualitatively assess a few difficult frames, and see whether a lower value causes problems.

3.4.3 Feature vector calculation After the clustering step, each cluster of detections is fed to the feature extraction stage. The purpose of the feature extraction is to calculate properties of each cluster that can be used for classification and possible further filtering.

The output from the feature extraction system is a feature vector, composed of the eight parameters described in section 3.3, Practical selection and analysis of object descriptions.

In order to prevent features with inherently large values causing bias, each parameter is scaled to a value between -1 and 1. This is done both on the training data used to create the classifier as well as on new data points. When classifying new objects, it is important that each parameter in the feature vector is scaled by the correct scale factor (the same one used when the system was trained).

3.4.4 Filtering of radar detection clusters After the feature extraction step, several new properties are available that provide new information about a clustered object.

55 In addition to being used for pure classification purposes, these properties could possibly be used in a second filtering step. The purpose of this proposed filtering is to remove clusters that for any reason are undesirable to send further in the system. Reasons to discard clusters could be either that they are thought to belong to none of the classes specified, or that they are believed to be clusters of noise or corrupt detections. In order to remove undesired clusters, several filtering hypotheses are put forward. The ideas for these filters are developed by watching logged radar data with known objects and studying the properties of corrupted clusters encountered. Using labeled data from both real objects and noise objects, the filters can be evaluated by the amount of noise clusters vs the amount of real clusters that are removed. Cluster filtering hypothesis 1: In false detection clusters, a tendency is seen in which the velocity values of individual detections are more spread out than in any real moving object cluster. For example it may be common for false clusters to have velocity values in a greater span than detections that truly originate from a moving object. This proposed filtering step will look at the Variance of Doppler velocity feature, described in section 3.3.1, and remove all clusters where this value is above a certain threshold maxClusterV elV ar. Cluster filtering hypothesis 2: False clusters seemingly appear with a small spread in the amplitude values of individual detections, while detections belonging to an object such as a moving vehicle seems to have greater variance in this parameter. This proposed filter will be based on the Variance of amplitude feature, removing all clusters were the variance is below a certain threshold value minClusterAmpV ar. Cluster filtering hypothesis 3: It is believed that a large part of false detections are created through phenomena such as radar interference and double reflections. Clusters of these false detections may be especially hard to discern from the real object which caused the interference. A proposed filtering method is to look where in the local coordinate system the closest detected cluster is positioned. From the EGO vehicle, two lines are drawn to the edges of this cluster, and then continue outwards from the vehicle. This would create a circle segment of a certain angle, and the proposed filter would remove all clusters that lie in the ”shadow” of the closest cluster detected. This filter structure has validity in that it is improbable that any object could be detected straight behind another object, similar to how the human eye cannot see what is behind something else. However, it is known that some of these detections could be from real objects, for example if the radar wave passes under or over the first object, or through windows.

56 3.5 Classification of processed objects This section contains a description of the methods used in order to implement a classification stage in the system. The SVM implementation structure is described together with an outline of how parameters are chosen. The section is concluded with a description of how validation with respect to classification performance is conducted.

3.5.1 Implementation of support vector machine system The optimization problem (14) given in section 2.3.2, Support vector machines as a method for classification, is quite easily solved using any quadratic programming toolbox, such as quadprog in MATLAB, or cvxopt in Python.

When implementing SVM however, it can be practical to use one of the many open-source libraries available.

In this project, the open-source library LIBSVM is used since it provides a simple and efficient interface for training support vector machines. LIBSVM [39] is library that is suitable both for beginners and advanced users and is currently one of the most used SVM applications.

LIBSVM has support for both classification and regression and has built-in features like kernels, soft-margin and cross validation. In [39], the dual formulation of the SVM optimization problem is stated as follows: 1 min αT Qα − eT α α 2 subject to yT α = 0

0 ≤ αi ≤ C, i = 1,...,N. (25)

Here, e is a vector of length N containing only ones, while Q is an N × N positive semi-definite matrix with elements:

Qi,j ≡ yiyjK(xi, xj) (26)

It can be seen that this implementation is identical to the dual formulation (14) given in section 2.3.2, with the addition of the constraint αi ≤ C (soft-margin), and with dot products xi · xj replaced by K(xi, xj) (the kernel function). To understand the meaning of these changes, see section 2.3.2, Kernels and soft margin.

The use of kernels leads to a slightly changed decision function: ! T X class(xn) = sgn(ω · φ(xn) + b) = sgn yyαiK(xi, xn) + b (27) i

57 LIBSVM provides scripts for both training and prediction, and stores all parameters needed for classification. This means that even if a model is trained using LIBSVM, prediction can be made manually using equation (27).

This could be worth considering in a real-time implementation, since it could provide benefits from an embedded systems perspective to have the prediction function as light as possible.

Model selection From section 2.3.2, Model selection, it is known that there are multiple parameters to choose before a support vector machine model can be constructed. These parameters have a huge effect on the performance of an SVM classifier, and if chosen poorly can lead to problems with overfitting.

In this project, the grid search method explained in 2.3.2 is used to identify suitable values for each parameter. The reason that the grid search method is used is partly because it is a straight forward, simple to understand method, and partly because it is so commonly used and shows great performance.

The only downside with the cross validation grid search is the processing complexity of the algorithm, which makes the process very time consuming. However, since the only timing requirements in this project are on the on-line classification tasks, as opposed to the offline training of the system, this is not a problem.

For the radial basis kernel SVM with soft margin, there are two parameters that need to be identified, the SVM slack parameter C, and the kernel parameter γ.

In [24], a grid of γ = (2−15, 2−2 ... 23),C = (2−5, 22 ... 215) is suggested.

This is very a broad search, but with a large step size of k = 2. For this reason, it results in only 165 parameter combinations to test, and a quite simple search. It can be seen as an initial search with the purpose of getting a broad overview of the parameter space and seeing what range of values could be worth looking at in a finer search.

A second, more detailed grid search should always be conducted based on the results of the first search. This is a finer search in the areas that showed most potential in the first search.

3.5.2 Multiclass, rejection and confidence structures Aside from the choice of slack and kernel parameters, there are a number of architectural considerations to make before implementing a multiclass SVM. In the case of this particular project, the implications mostly concern how to deal with noise or borderline cases in the data, but the desired output format is also something to consider. If the SVM output scores can be scaled to a probability-like measurement, the ability for the system output to be integrated into other systems is increased.

58 When having a multiclass problem such as this, the common ways of turning the binary nature of support vector machines into multiclass (or ensemble) solvers is to either have one vs. one, or one vs. all classification, as discussed in section 2.3.1, Multiclass classification.

Confidence output As discussed above, it is beneficial if the SVM output prediction can be accompanied by a score that provides an intuitive measurement of confidence. In section 2.3.2, SVM outputs and Platt-scaling, one method to convert the raw SVM outputs to probability-like measurements was provided. However, this method is inherently made for binary classifiers, and may be problematic to apply in a multiclass problem.

If using the OvA ensemble scheme, one SVM score will be provided for each class. These scores could be Platt-scaled independently to obtain 4 different probability-like scores. It must be noted that these scores will not reflect true probabilities, and will also not necessarily be reasonable in a multiclass perspective. For example, the Platt-scores obtained for the different classes are not limited to a probabilistic sum of one.

However, the Platt-scaled outputs would still be very valuable in deciding the worth of a prediction, enabling easy integration in a system-wide perspective.

If using the OvO ensemble scheme, more SVM outputs will be obtained than the amount of classes, additionally, the SVM outputs will not represent an actual confidence measure for any specific class but rather how each class compares to the other class in each pair. These facts will make it more complex to obtain a probabilistic output, as the original Platt-scaling method can not be applied.

Handling of noise and borderline data Noise and borderline data (clusters of radar detections that share similarity with several classes) should preferably be treated differently. Noise should be discarded, while borderline data should be assigned the more likely class. An example of borderline data could be a large personal vehicle that borders on truck characteristics.

There are at least three main ways of dealing with these data categories: The usage of a noise-class, the usage of additional filtering, and the usage of a rejection threshold. Furthermore, the performance in any of these three may be affected by the choice of multiclass ensemble method. The pros and cons of each of the three concepts are discussed below with regards to the OvO and the OvA scheme respectively.

Noise-class One method is to create a fifth class (the noise class) and train the classification system with additional labeled noise data. This could be a sensible thing to do, if the noise class had any sort of homogeneity. However, due to the many types of different noise, the noise class will span a much larger space then the individual classes, and thus there is the risk of introducing a bias. Because

59 of the wide diversity of the supposed noise class, the risk is also high that real objects, especially borderline cases, are misclassified as noise.

The noise class method would have similar results with both OvO and OvA, all though with OvO the time-complexity would increase more when adding an additional class.

Filters Another method is to have a cluster filter that attempts to remove bad clusters before classification. Three ideas on how clusters could be filtered were presented in 3.4.4, Filtering of radar detection clusters.

The filtering method is cumbersome, because in order for it to be valid the different filter parameters would have chosen and verified in some structured way, preferably through a rigorous statistical evaluation. Additional filtering will also lead to increased time complexity of the system, however, it is independant on the choice of ensemble method.

Rejection threshold This method relies on the application of a rejection threshold, where one can label data objects as noise if the classifier reports a lower output then specified. This method avoids the problems with of noise class method, but may be hard to implement depending on the choice of ensemble method.

If using OvO, the SVM output scores are not dependent on the absolute characteristics of a class, but rather how they compare to another class. Additionally, there will be more SVM scores than classes. These properties make the rejection method problematic as there is no natural way of chosing a rejection threshold. There may exist some method that suitably combines or scales the OvO outputs to a score that is more fitting for a rejection structure.

The OvA scheme makes it easier to apply a rejection scheme, since there is only one SVM score associated with each class. However, it is unclear whether the rejection approach might introduce some bias (different classes may have different average scores, and thus an absolute threshold might cause bias towards having these classes being rejected more often).

One way of avoiding the bias of a noise class, as well as bias introduced by an arbitrary threshold, could be to use OvA and have a rejection threshold of 0. This would lead to a system which will invariably let through noise. However, it does allow for removing the most obvious noise (the clusters that no classifier deem as belonging to its own class). It also ensures that borderline data is classified as something, instead of being discarded, as a borderline point would be assigned several non-negative scores. This method also has the benefit of not needing any advanced methods in choosing rejection threshold.

60 Conclusions There are many different architectures and considerations to make when constructing a multiclass classification model. The choice of archetype may have an impact on time-complexity as well as systems capability to handle noise and borderline cases.

Above, three methods in which these cases could be handled were discussed. However, it is hard to know which of the three methods is best, or if a combination is the suitable choice. It may very well be that the same objects that are filtered away in a cluster-filter or classified as belonging to a noise-class, are the same objects that would be rejected in a rejection structure.

To improve the usefulness of the classification output, a probabilistic output structure is desired. As discussed in section 2.3.2, Platt scaling is one method of getting a probability-like output. There are other feasible methods, but in this project the Platt method is used.

Furthermore, the One vs All method is used as it is easy to combine with both Platt-scaling and a rejection structures. In this project, a rejection structure is implemented with a rejection threshold of 0.

To solve the problem of combining and choosing rejection threshold with probabilistic outputs, the rejection is performed on the raw OvA SVM scores. The conversion to probabilistic output (Platt scaling) is then performed after the rejection process, on the clusters that passed the rejection structure.

3.5.3 Evaluating classification performance on validation data To evaluate the performance of a finished classifier, outputs are computed for a set of validation data. By comparing these outputs with the known class labels existing for each validation data point, the performance measurements described in 2.3.3, Classification performance analysis can be calculated. The scores used in this project are found below.

The first is the total accuracy score: Correctly classified T otalAccuracy : total nr of validation data

To gain further insight into the performance of a classifier on the different classes, a multiclass confusion matrix is constructed. This is a variant of the confusion matrix described in section 2.3.3 but containing all classes. Furthermore, the class specific values of precision and recall are calculated for each class and included in the matrix. An example can be seen below in table 3.

61 Table 3: Multiclass Confusion matrix

Class 1 Class 2 Class 3 Class 4 Recall Class 1 ntp1 R1 Class 2 ntp2 R2 Class 3 ntp3 R3 Class 4 ntp4 R4 Precision P1 P2 P3 P4

Here, the values on the diagonal represent the true positive predictions for each class, while all the values outside the diagonal will be different types of misclassifications.

To provide even more insight into classification performance, the class specific and average values of accuracy, error and F-measure should be shown in a table. An example of such a table is seen below in table 4.

Table 4: Classification performance measurements

Accuracy Error F-measure Class 1 Acc1 Err1 F meas1 Class 2 Acc2 Err2 F meas2 Class 3 Acc3 Err3 F meas3 Class 4 Acc4 Err4 F meas4 1 PC 1 PC Average C c=1 Accc C c=1 Errc MFM

3.6 System implementation on target platform In this section, the methods related to the real-time system implementation on the Astator platform will be detailed. As stated in section 1.3.1, Development approach, the development and implementation of the classification system is conducted in two stages. First, an offline version of each system function is created. Details of these individual subsystems are found in the respective sections above.

When the offline system is verified to operate according to specifications, a real-time implementation is constructed. Both the offline and real-time systems can be tested and simulated in a desktop environment with logged radar data, using Matlab/Simulink as simulation and verification tools.

There are several considerations to be made in order for the system implementation to be functional. Many of these are purely practical and will not be discussed further. Some considerations are more relevant, and these are discussed below.

62 3.6.1 Real-time implementation goals and restrictions There are three main objectives during the real-time implementation. The first one is to have each RT function give the exact same output as the corresponding offline system function (as presented in section 1.3.1, Development approach.

The second one is for the real-time implementation to be executable within the time-frame allowed. These considerations are discussed more in detail below in section 3.6.2, Timings and tasks.

The third is for the real-time implementation to perform well as a complete system implemented on the test-vehicle. This is discussed in section 3.6.3, Validation of final system implementation.

The real-time implementation software is created in Simulink as a modification of the offline functions. This environment can then be used to simulate the real-time implementation. Using code-generation tools, embedded MATLAB blocks can be converted to C code, which is then transferred to the target system in order to produce a real implementation.

Frame-size considerations The frame size (meaning the length of time that is considered a single frame) is an important characteristic to consider. The radar sensors deliver data at a pace of 20 Hz. However, they do not update at the same time. Each radar delivers 64 detections every time it sends data. It flags the detections as being updated if the data in that particular detection has not been sent by the radar before. The full set of detections are rarely (if ever) updated at the same time, so every radar delivers somewhere between 0 and 64 updated detections every 50 ms.

The real-time system operates on 10 ms execution cycles, and data is sent from the radar sensors every 10 ms, but new data can only arrive 50 ms after a radar last delivered updated detections.

There are at least two different ways of handling the frame size. One way is to see each unique time stamp as one frame. The benefit of this is that data is processed as quickly as possible and each detection within one frame is derived from the same ”true” time.

The other way is to keep each frame at a fixed size of 50ms (same as the update time of the radars), and to collect all detections delivered within this time window into a single frame, which is delivered each 50 ms to the clustering algorithm. The benefit of this is that each frame is going to be a better representation of the environment, since it will contain data from all radars. This also gives more detections per moving object (if the object is within an overlapping field of view), which might improve classification accuracy. Another benefit is that it is easier to describe the timings of the real-time implementation, since they become more deterministic.

63 A drawback with this approach is that there is a possible time difference of 50 ms between the oldest data content and the newest within a single frame, which will negatively affect classification speed and might have more severe implications for objects moving very fast.

In this project, the method of considering frames to consist of 50 ms worth of data is used (due to the benefits mentioned above), but one could just as validly process data as soon as it arrives.

3.6.2 Timings and tasks In this section, the timing and time complexity aspects of the real-time implementation are discussed. A simplified collection of tasks performed within the classification system is seen in the timing diagram below in figure 15.

gatherFrame

makeFrame

cluster

classify

system

t_system = 10 ms t_frame = 50 ms t_exec = ? ms

Figure 15: Timing Diagram

System parallelity is in this case assumed to be non-present (so every task has to finish in order for consecutive ones to start). Since code generation tools are used and the hardware itself is powerful and runs on several cores, this is a simplification and some parallelity likely exists. However, for the purpose of real-time performance assessment, this view is helpful. In future cases if one would look to significantly increase performance, applying methods of additional parallelization could be useful.

System is the inherent cycle of the target system with an execution time of 10 ms. This cycle time is the basic timeframe to relate to. Within the system task are all of the Astator functions. Amongst these are preprocessing and sensor data fusion aswell as the functions that translate sensor data to different coordinate

64 systems. A brief overview of the different functions inherent to the target system can be found in 1.2.2, Target system. gatherFrame is the task that collects the updated detections from the radar sensors, and when a complete frame cycle of 50ms has been finished, it triggers the rest of the classification algorithm to execute. makeFrame, cluster and classify are the tasks that make up the rest of the classification system. These have to be executed one after another, and also need the system task to be finished for the corresponding cycle in order to execute (since they need the data to be transformed to the EGO local coordinate system). They also wait for the gatherFrame task to be finished, so that they can operate on a coherent frame of data.

The total execution time of these tasks together with gatherFrame is texec, and it is this time that is of interest when it comes to evaluating the RT timing performance. Simply, this is the execution time of the complete classification system cycle.

The total execution time of texec + tsyst shall never exceed 10ms, since this would prevent the system task to complete every 10 ms. It is of course preferred to have as low an execution time as possible, as this will increase the ability of the complete object tracking system to be extended with further tasks.

In order to validate timing performance, the built-in Simulink tool Profiler is used. This tool checks how much time is spent within each function.

Using a model to evaluate real-time performance, on hardware that is different from the real target system, can not directly prove adequate performance. However, the simulated environment is very likely to be slower than compiled c-code given the same hardware. In addition to this, the target system hardware is more powerful than the processors on which the Simulink profile assessment is made. This should mean that the profiler assessment is still valuable as an indication of RT-performance.

3.6.3 Validation of final system implementation To evaluate complete system performance, the following steps will be taken:

First, it will be ensured that the real-time system produces the same output as the simulated system given the same input. This is because the simulated environment produces its own time-stamps that do not precisely match the actual run-time sample times. Hence, it cannot be guaranteed that the simulated classification receives all sensor data at precisely the same timestamps as the real-time implementation, and this will invariably produce slightly different results.

Once it is ensured that the same input yields the same output in both the implemented system and the simulation, continued system evaluation can be made in the simulated environment.

65 To add to the real-time evaluation made in the simulated environment using Simulink profiler, an additional investigation will be made on the actual target platform, ensuring that the system cycle time is never exceeded. These two checks combined should provide a confidence that the real time requirements are met.

When it comes to the actual classification system performance validation, a variant of the approach used by Garcia [4] and discussed in section 2.1.1, Radar based vehicle perception will be used. The first step is a qualitative analysis made by looking at the system output made for a set of radar data logs with known objects.

Then, a quantitative approach of looking through a log frame for frame and noting the number of correct vs misclassified clusters is performed. In this process, the ”truth” data will be composed of the logged objects for which we know the type.

The results and analysis of the system evaluation is found in section 4.5, Complete system performance assessment.

66 Part 4: Results and discussion

In this part, the results of the project and the methods described in section 3 are presented and discussed. The chapter is a merge of results and discussion and as such, each section will contain the results of a particular method or subject, together with a discussion of said results. The chapter is concluded with the results of validation and verification of the classification system, both in offline context and in the real time implementation.

4.1 Results of data gathering and labeling Through the process described in section 3.2.1, Test-track data gathering, a total of 60 different data logs have been produced. The logs contain radar data from the different test scenarios described. In addition to these controlled test scenarios, several other logs have been produced and used in the evaluation of different functions.

Two scripts for labeling individual clusters in the recorded data have also been produced, and by using these scripts a total of 13 logs have been processed and labeled. These constitute the labeled data set used throughout this chapter.

The results of these processes are 2947 separate labeled radar detection-clusters belonging to either pedestrian, bicyclist, personal vehicle or truck.

Each of the clusters have a corresponding calculated feature vector, however, should the amount or structure of the features used in the system be modified, new feature vectors can easily be calculated from the labeled data. The labeling has been made on each individual detection within the cluster, thus a change in clustering parameters or a new filter structure does not mean that the labeling process needs to be performed again. The labeled detections will simply pass through the new filtering, clustering and feature extraction stages in order to obtain new labeled feature vectors.

Table 5: Radar detection clusters gathered and labeled

Class Pedestrian Bicyclist Car Truck Noise Instances 479 533 734 1201 1709

Aside from the labeled class data, additional labeling has been performed on noise clusters. In table 5, the spread of labeled clusters over the different classes is shown.

Since the beginning of the project, it has been suspected that the task of acquiring labeled training data could pose a challenge. The reason for this is that data

67 gathering and labeling are both time consuming processes and it was unknown if resources such as access to the test track could be given.

Luckily, the chance to gather training data was presented, and a multitude of examples of radar signals coming from the different classes were recorded. However, with just one day of data gathering, it was inevitable that the data would be insufficient. It is a big challenge to avoid correlation in the training examples.

For example, in all data gathered within this project, only one type of weather environment exists. Also, only two different types of cars and trucks were used, which may not be enough to get a good coverage in the feature space. Despite the fact that two different types of clothing were used for the pedestrian and bicyclist classes, all the tests were performed on the same person, and with the same bike. In reality, data would be needed from persons and vehicles of additional different sizes and with additional materials.

Even though it was attempted to acquire test data of many different velocities for each class, with limited time it was only possible to get three different speeds per scenario and class. This is another example of high correlation in the training data.

To add to the challenge of data gathering, the labeling of said data is just as problematic. Of the sixty separate data logs that were gathered, only 13 have been processed and labeled. This is because labeling is a time consuming process and it was decided that the time was better spent on other tasks.

From a single data log consisting of a one minute long recording of radar data, it is possible to get thousands of labeled clusters. However, many of them will be heavily correlated, since the radar system yields 20 samples per second and any moving object considered here will not change much in such short time frames.

It is also important to note that the difference in number of labeled clusters for the different classes could cause bias in the later stages of machine learning. When used for training examples, a more equal distribution would be preferable, since it would eliminate this source of potential errors. Since focus in this project has not been on fine-tuning performance, this has not been applied. However, it is something to keep in mind for future work.

The labeled training data acquired within this project is sufficient to demonstrate the feasibility of the concept and that good classification results can be obtained within the requirements stated at the beginning of the project. However, to achieve the full potential of the system, a much bigger set of training data will be needed and with a bigger variation in velocity, heading, distance, environments and object types.

68 4.2 Analysis of feature and data characteristics In this section, results regarding the analysis of data and selection of features to use for classification are presented. First, the selected features are presented with mean and variance measurements. Then, a PCA is shown for the data and the chosen features. Finally, a discussion regarding these results and their implications is presented.

4.2.1 Characteristics of selected features Below, characteristics of the features from the feature selection process are presented. These results consists of a table of mean and variance measurements for the different features selected with respect to the different classes. The analysis was done using Matlab on object data extracted from the recorded data logs, and features were calculated as described in section 3.3, Practical selection and analysis of object descriptions.

In order to give a real-world sense of what is being presented, the data is not scaled. This means that values for a certain feature cannot be meaningfully compared to values of another feature. Only comparisons between classes, regarding the same feature, is useful in a classification context.

Pedestrian Bike Car Truck Feature mean var mean var mean var mean var nrOfDets 3.20 1.84 3.32 2.29 5.82 11.14 17.69 303.88 minLength 1.73 5.29 2.26 5.43 4.62 6.43 12.24 26.33 areaVal 2.87 49.03 2.91 33.76 10.63 160.44 76.91 5982.80 densityVal 4.05 26.00 1.37 3.68 0.55 0.17 0.23 0.01 mean dR 1.23 0.75 2.53 2.42 3.75 7.52 3.02 7.15 var dR 1.10 173.10 0.50 8.88 0.93 6.90 5.24 774.95 ampPerDist -1.68 0.65 -0.33 0.49 -0.13 0.03 0.16 0.02 varAmp 15.34 376.19 20.43 816.42 30.69 1286.18 53.63 895.74

Table 6: Mean and variance of features used for object description

The table above clearly shows that most features have mean values that distinguish between most classes. Notable exceptions are nrOfDets and areaV al when comparing pedestrians and bicyclists: the mean values for these are very close to each other. This makes sense, since both these classes detection surfaces mainly consists of persons.

Disregarding the variance, this implies that the features chosen carry explanatory value. It is however important not to overinterpret the results. How much, and what features are good for what distinctions, can not be directly inferred from this particular analysis alone.

69 4.2.2 Principal component analysis of features on training data In this section, the results of the principal component analysis, discussed in section 2.4.3, Principal component analysis for feature evaluation above, is shown. This analysis was done on the labeled data presented in section 4.1, Results of data gathering and labeling above.

Below in table 7, the results of equation 24 is shown. This table shows how much each principal component adds to the variance explanation of the data set.

Principal component Cumulative variance 1 0.479 2 0.653 3 0.818 4 0.912 5 0.942 6 0.970 7 0.992 8 1.00

Table 7: Table of the cumulative variance explanation per principal component

Below in figure 16, a biplot (16a) of the features used together with the training data projected onto the first two principal components (16b) is shown. The biplot is a way of visualizing a principal component analysis. The features used are displayed as projections onto the 3D principal component space. The length of the vectors corresponding to each feature are analogue to the amount of ”explanatory value” (its portion of total data set variance explanation) they carry compared to other features. Additionally, similarity in direction indicate high correlation. The projection of the training data onto the two first principal components give a visual feeling for how separable the different classes are.

training data with 2947 samples projected onto first two PC 0.3

0.2

Biplot of training data with 2947 samples 0.1 0.8 var amp 0 0.6

0.4 −0.1

0.2 −0.2 amp per dist

0 princmp 2 densityVal −0.3 nrOfDets

Component 3 −0.2 var dR minLength −0.4 areaVal −0.4 −0.5 −0.6 mean dR

−0.8 −0.6 0.5 0 0.8 −0.7 −0.5 0 0.2 0.4 0.6 −0.8 −0.6 −0.4 −0.2 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 princmp 1 Component 2 Component 1 (a) (b)

Figure 16: Biplot of features and training data projected onto first two PC

70 In figure 17 below, the training data is projected onto a biplot using the first three principal components. It gives a visual indication for how the features used explain the variance of the training data.

training data with 2947 samples projected onto first three PC with biplot 0.8

0.6 var amp 0.4

0.2

amp per dist 0

princmp 3 −0.2 densityVal var dR nrOfDets −0.4 areaVal minLength −0.6

mean dR −0.8

0.5 0 −0.5

−0.2 0 0.2 0.4 0.6 0.8 princmp 2 −0.8 −0.6 −0.4 princmp 1

Figure 17: Training data and biplot projected onto first three PC

In the figures above, the pedestrian class is plotted in blue, the bicycle class in green, the car class in yellow and the truck class in red. It is apparent that the pedestrian and the bicycle class are hard to separate in the first three principal component dimensions.

It can also be seen in the figures above that certain features, such as amplitude per distance minimum length, carry much of the explanatory value for principal components one and two.

It can also be seen that the features minimum length, number of detections and area are closely correlated with each other (they point in a similar direction). The mean dR and the var dR features are also correlated.

4.2.3 Feature and data analysis discussion Below, a discussion regarding the results of the specifics of the feature selection, and the general data exploration, is presented. A discussion about the existence and influence of bias in the selected features is also presented.

71 Feature usefulness Due to the nature of the PCA method and the difficulty in grasping high-dimensional data, it is hard to draw conclusions about the individual usability of features. Rather, they should be evaluated together.

A sign of good feature selection is that the variance explanation does not come from just one or two features. As seen in table 7 above, the cumulative variance is only slightly over 80 percent at three features. This also implies that what can be seen in the PCA plots is just part of the truth, since so much of the variance explanation occurs in higher dimensions. There is a clear contribution from each principal component, except for the last one. This implies that most features are useful in explaining variance in the data set. It should be stressed that, since the principal components do not correspond to any single feature, this is not the same as saying that the least useful feature does not add explanatory value.

One can determine from the figures that the features minLength, ampP erDist, varAmp and mean dR have the longest vectors, meaning they carry the most information for the three first PCs. It can also be seen in table 6 above that these particular features have a good distinction between the mean values for each class. This implies that these features are particularly useful in object classification.

It can also be seen that several features point in similar direction, but with much shorter vectors, than other features. For example, the features nrOfDets and areaV al both point in a very similar direction to minLength, but they both have much shorter vectors. This implies that these features are highly correlated, and that the minLength feature possibly carries most of the information needed. If it would be necessary to speed up classification, it is probably possible to exclude these less useful features and maintain classification accuracy.

Data analysis and class separability As seen in figure 16b and 17 above, when projected onto the first two or three principal components, it is not apparent that there is an easy separability of the classes. In particular, the bicycle class (green) and the car class (yellow) are very entwined, which can explain classification difficulties between these particular classes. However, one should bear in mind that these plots are only low-level representations of the data, and that they may be more easily separated in the original 8 dimensional feature space.

Feature bias Below, a discussion about the presence of bias in the features selected is presented.

In this context, bias present in features is to be understood as the tendency of a certain feature to introduce systematic errors in classification. This is separate from the beneficial distinction between classes that a feature is meant to induce, but which could also be labeled as bias using a different nomenclature.

Striving to use features that possess no bias is preferable in most circumstances. There is however little possibility of accomplishing a total absence of bias, since

72 it can be present in ways unimagined and it is hard to experimentally check for bias.

The presence of bias in both training data and features used is a possibly big source of error. Great afterthought has to be applied when making the feature selection, and discussions about the presence of bias need to be comprehensive. Otherwise, there is a risk of seriously compromising the validity of the end results. Particular bias considerations with regards to specific features are presented below.

Variance of Doppler velocity Since this feature is calculated as a weighted mean of the variance of velocity values from each specific radar, some bias induced by sensor placement is avoided.

There are however situations in which this feature can still induce bias. For example, consider the case of an object passing perpendicular to a radar sensor. If the object is a point, no Doppler velocity can be detected (since there is no detected point having a relative velocity in the radial direction of the sensor). But if the object is very long, there is a high probability that the sensor produces detections from points that have a velocity not strictly perpendicular to the sensor radial direction. Points from the back end of the object will appear as moving towards the sensor, while points from the front end will move away. Thus, the variance in Doppler velocity of such an object is likely large.

The bias in this case is that the same object, at a further distance, would produce a lower Doppler speed variance. Thus, this feature could introduce a dependency on distance to the EGO vehicle, or other unwanted behaviour that leads to systematic errors.

Amplitude per distance This feature is possibly biased from construction. Depending on what unit the amplitude is calculated as within the radar sensors, it could be that a division with distance squared or distance cubed would be more accurate. If the wrong exponent is used in the division, this can lead to a bias. Since the exact amplitude calculation is unknown, the division with distance to the power of one has been maintained, but this should be changed if more exact knowledge about the calculations done within the radars can be gained.

This bias would introduce a dependency of the distance to the EGO vehicle for this feature. Clearly, the distance should not in itself be considered a feature: it is implausible that a certain class of moving objects would on average be located nearer or further away than another.

4.3 Signal processing and filtering performance This section contains results related to the methods used for processing radar data, such as filtering and clustering. The section also contains some results and

73 discussion related to common types of noise that has been detected in the radar sensors.

4.3.1 Common types of noise in the radar output In order to create a suitable signal chain structure, an investigation has been made into the different types of noise coming from the radar system. The results of this investigation are found below.

It has been seen that when it comes to noise detections from the radar sensors, some forms occur more often than others and are repeatedly seen in almost all data logs.

Reflection lines A common scenario where noise detections appear is when a big object moves close to Astator and in specific angles. The noise is seen as lines of detections behind the real object.

Examples of the phenomena are given in figure 18. Here a big object is moving close to astator, and lines of corrupt detections are seen behind the actual object.

Radar−detection based classification Radar−detection based classification 80 80 Astator Astator Raw Detections Raw Detections Filtered detections Filtered detections 60 60 Clustered detections Clustered detections

40 40

20 20

0 0

−20 −20

−40 −40

−60 −60

−80 −80 −80 −60 −40 −20 0 20 40 60 80 −80 −60 −40 −20 0 20 40 60 80 (a) (b)

Figure 18: Lines of noise detections behind an object

Scatter Similar to the reflection lines, this type of noise seemingly appears whenever there are objects moving close to EGO and at certain angles. The difference here is that instead of appearing in straight lines behind the object, the noise appears as sparse detections spread out in an arc behind the object. The detection velocity values also seems to be low in this case, as opposed to the reflection lines where the velocity values are often amplified.

Examples of scattered noise detections are given below:

74 Radar−detection based classification Radar−detection based classification 80 80 Astator Astator Raw Detections Raw Detections Filtered detections Filtered detections 60 60 Clustered detections Clustered detections

40 40

20 20

0 0

−20 −20

−40 −40

−60 −60

−80 −80 −80 −60 −40 −20 0 20 40 60 80 −80 −60 −40 −20 0 20 40 60 80 (a) (b)

Figure 19: Scattered noise detections behind an object

In both figures 19a and 19b, the only real moving object is the cluster seen closest to the EGO vehicle (centered in the figure).

The noise described in figures 18 and 19 are believed to be caused by radar waves bouncing multiple times on an object before being detected by the sensor.

Theoretically, this would cause detections to appear with range values rdet appearing in multiples of the true range to the object.

rdet = n × rtrue where n = (2, 3 ··· ) (28)

Also, since the wave is reflected several times on the object, several Doppler shifts would be superimposed. Theoretically, this would show in the range rates of the detected points:

r˙det = n × r˙true where n = (2, 3 ··· ) (29)

In reality, the noise is seen as lines and arcs of false detections appearing behind a detected object. These noise detections do not seem to be restricted to appear with ranges in multiples of the true range. This could be a sign that some other phenomena is the cause of this noise.

One theory is that the noise appears due to radar signals reflecting on an object but being detected by a different radar than the one in which it originated, causing both the angle, the range, and the velocity of the detection to be corrupted. However, since the radars are frequency modulated, such interference between the radars should not happen.

75 The reflection lines do not appear as commonly as the scattered noise (which appears just about every time an object passes by), but they are also harder to remove using the filter structures developed in this project. The reason for this is that the lines usually appear with detections very close to each other, causing them to be clustered into an object instead of being marked as noise in the DBSCAN stage.

False clusters One type of noise detections seemingly appears independent of whether or not there is an object close to Astator. These ”false” detections often appear briefly, in a single sample only, but in high numbers and with high velocity values.

Radar−detection based classification Radar−detection based classification 80 80 Astator Raw Detections Astator Filtered detections 60 Raw Detections 60 Clustered detections Filtered detections Velocity Clustered detections 40 Velocity 40

20 20

0 0

−20 −20

−40 −40

−60 −60

−80 −80 −80 −60 −40 −20 0 20 40 60 80 −80 −60 −40 −20 0 20 40 60 80 (a) (b)

Figure 20: Clusters of noise detections with high velocity values

Figures 20a and 20b are examples of this type of noise. No real objects are present in either of the figures.

The false clusters are only apparent in a single frame and do not return in the same place. Also plotted in the figures are the velocity values of the detections (shown as arrows).

This type of noise by far makes up the most common type of noise seen in the radar data. The quantity and density of these false detections make it hard to filter them using DBSCAN since they usually end up being clustered, as is the case with the reflection lines.

One thing that can be seen is that this type of noise usually has a very high variation in velocity values among detections, which can appear to move at unrealistic speed and opposing directions, even though they are all seem to come from the same ”ghost” object. Looking at figure 20, it can be seen that the velocity values within these false clusters are very high, often above 100 km/h.

76 Clutter When the EGO vehicle is moving, an increased amount of corrupt detections are seen in the radar data. These detections are thought to originate in stationary objects that are interpreted as moving objects by the radar processing software.

In some cases, the amount of corrupt detections is so great that the points are clustered and sent onwards in the signal chain.

Below in figure 21, a frame from when the EGO vehicle is moving at around 60 km/h along a straight road with low road fences on both sides is shown.

30

20

10 X: -18.28 Y: 2.637

0 X: -16.2 Y: 2.8 meters -10

-20

-30

-40 -60 -50 -40 -30 -20 -10 0 10 20 meters

Figure 21: Fence detections and noise when moving at 60 km/h

In this figure, no real moving objects are present; every detection is either from noise or stationary objects. The detections that lie on a straight line are from fences. Despite this, the radar processing software labels these detections as having a movement index of 3, meaning the system is certain they originate from moving targets.

Conclusions regarding noise The types of noise shown above are extremely common, being present in some form throughout most logged data.

Since it is impossible to strictly guarantee a 100% visual overview of the test track, it is problematic to rule out the possibility that some moving object was actually

77 there to cause the detections. However, great care was taken to prevent this by only recording data when there could be seen no moving objects in the vicinity except the object being studied.

Assuming that there were in fact no objects other than those deliberately moving on the test track, two possibilities remain which could cause false detections to appear:

1. A stationary object is registered as moving (corrupted detections) 2. Something within the radar sensor internal processor causes false detections to appear (pure noise)

It should be noted that even though these types of noise are very common in the radar data gathered within this project, this does not guarantee that they are representative of all situations. In a crowded traffic environment for example, the noise situation may be entirely different than what has been shown here.

The main bulk of data gathered within this project is from a large open part of the test-track with rarely more than one other object present at a time. This is thought to be somewhat representative of the mining context in which the iQmatic project is mainly focused, and for this reason the investigation of noise is thought to be valid.

The study of noise was done to gain more insight into the downsides of the radar system as well as to provide a basis for developing filter structures.

4.3.2 Results of developed filtering structures As is known from the previous section, there is a considerable amount of noise coming from the radar system. The most common problem is that moving detections sporadically appear, where there were in fact no moving objects. This section contains results of the different filter structures developed within this project.

Detection filtering In order to remove unwanted detections, the three different radar detection filters described in section 3.4.1 have been implemented.

The movement index filter constitutes a crucial component of the overall system, since stationary detections contain no desired information. The weakness of this filter is that, as described above, detections coming from stationary objects can be corrupted and appear with nonzero movement indexes.

The detection velocity threshold is an attempt to reduce the amount of corrupt detections sent to the clustering stage. Since corrupt detections often appear with velocity values in a wider range than real objects, detections with unreasonably high velocities are discarded.

78 Below, results of the maxdR filter on two different data logs is shown.

Table 8: maxdR detection filter statistics

Datalog maxdR [m/s] [km/h] Filtered 20150311 110218 25 90 1096/44426 20150311 110218 14 50.4 2166/44426 20150311 140531 25 90 575/2924 20150311 140531 14 50.4 1102/2924

In table 8 it can be seen that out of the detections that appear as moving detections, some are discarded by the filter. The first log contains radar data collected as a truck is driving in a snakelike pattern close to EGO. The second log is a shorter log from a bicyclist test scenario. In both logs, the velocity of the studied object never exceeds 30 km/h. Despite this, a large amount of detections have velocity values exceeding 50 km/h.

The results show that the number of corrupt velocity detections vary highly throughout logged data, but in both cases, the maxdR threshold is useful. The validity of this filter is easily defended, since even if the removed detections belong to real objects, the velocity values are corrupt, and thus they should not be used for classification.

The maxRange threshold described in section 3.4.1 is currently set to 80 m, which is the maximum range of the radars. The performance of this particular filter step has not been evaluated, but reasonably this parameter could be useful.

Cluster filtering The first two solution hypotheses of the cluster filtering structure described in section 3.4.4 have been implemented and evaluated.

The results presented below were obtained by testing different threshold values for minClusterAmpVar and maxClusterVelVar respectively, and studying the amount of corrupt clusters (labeled noise clusters) that are removed vs the amount of true clusters (labeled class objects) that are removed. This evaluation is made on the full set of labeled data.

Table 9: maxClusterVelVar filtering results

Threshold Noise clusters removed Real clusters removed -5 1406 (100.0 %) 2933 (100.0 %) 0 1377 (97.9 %) 2812 (95.9 %) 5 772 (54.9 %) 99 (3.4 %) 10 727 (51.7 %) 62 (2.1 %) 15 699 (49.7 %) 54 (1.8 %)

79 Table 10: minClusterAmpVar filtering results

Threshold Noise clusters removed Real clusters removed 0 0 (0.0 %) 0 (0.0 %) 1 254 (18.1 %) 180 (6.1 %) 2 335 (23.8 %) 268 (9.1 %) 3 385 (27.4 %) 357 (12.2 %) 4 421 (29.9 %) 433 (14.8 %) 5 445 (31.7 %) 494 (16.8 %)

As can be seen in tables 9 and 10, the two first cluster filtering hypotheses both show potential. The filtering of clusters with highly varying velocities clearly removes more noise clusters than real clusters, and when studying logs manually, it has been seen that a significant part of the false clusters described in section 4.3.1 can be removed using this filter.

The amplitude variance filter is not as effective, since even at a small threshold a significant amount of real clusters are discarded. However, if the right threshold is chosen, this filter could still be useful.

It should be noted that, as described in the end of section 4.3.1, the labeled noise data gathered within this project does not necessarily represent all situations. For this reason, the filtering evaluation made above can only validly be applied to the situations considered within this project, and contained within the data gathered here.

Filtering conclusions If a detection caused by a stationary object, such as a ground reflection or a building, appears to have movement in the sensor, it may pass the detection filter. If the velocity value of this detection is unreasonably high, it will be discarded by the maxdR filter. However, many detections slip through the detection filter as their velocity values are just high enough to appear as moving, but not so big as to be discarded.

If several of these noise detections are close together, they will pass the clustering stage as well, thus reaching all the way to the classification stage of the signal chain.

A more strict rejection threshold in the multiclass ensemble could reduce the impact of this, as a larger amount of noise clusters would be rejected. However, this would unavoidably lead to an increase in the amount of true clusters being rejected. For this reason a lower rejection threshold is avoided.

The cluster filter hypotheses was an investigation into the possibility to remove corrupt clusters before they reach the classification stage. However, even though

80 it showed potential, it has been deactivated in the final system. The reason for this is described below.

Since the system concerned in this project aims to operate on the same cycle time as the radars, and use the data of every sample separately, it is very sensitive to noise which appears on a sample to sample basis. The usage of detection history, that is, to use data from earlier samples in a probabilistic filter structure, would greatly reduce the impact of noise detections that only exist in a single sample.

Even though the system developed in this project was specified to work without detection history, and thus may be sensitive to noise, the tracking system for which the class output is intended uses a probabilistic filter structure with detection history. This means that from a system-wide perspective, noise data in the class output will not have such a big impact.

This is also the reason why the cluster filter structure was disabled, and it was chosen to avoid focusing further on filter structures, such as the third filtering hypothesis (the object shadow filter) described in 3.4.4.

It is simply not worth the risk of real objects being filtered, just to remove noise that would have been removed in later system stages either way.

4.3.3 DBSCAN clustering parameter evaluation In this section, results regarding the clustering of radar detections are presented. The focus here is on the choice of clustering parameters Eps and MinP ts, through the methods described in 3.4.2, Clustering of radar detections using DBSCAN. The overall performance of the clustering step is also discussed.

The Eps Parameter The Eps parameter was decided by looking through a data log where a truck passes the EGO vehicle. In figure 22 below, a typical difficult to cluster frame is shown. The difficulty consists of the object being quite far away ( 70 meters, the maximum range of the radar sensors is 80 meters), the object being large (truck) and the detections belonging to the object being quite few and spread out (there are clear gaps between detections, and also between groups of detections).

81 #of clusters: 1

20

10

0

meters −10

−20

−30

−70 −60 −50 −40 −30 −20 −10 0 10 meters

Figure 22: Radar Detections Clustered with Eps = 4 meters

In the figure above, the red circles represent clustered radar detections, yellow circles are detections that are not considered moving, and the black rectangle is the EGO vehicle. Here, Eps = 4 meters, which results in all the detections belonging to the truck ending up in one single cluster. A lower value for Eps results in two clusters being formed. As discussed in section 3.4.2, a small Eps value is beneficial since it reduces the risk of noise points being clustered together with real objects. For this reason, the Eps value was set to 4.

The MinP ts Parameter The choice of the MinP ts parameter was made by looking at the pedestrian class and noticing the typical number of detections a pedestrian produced. It was discovered that pedestrians often produce as few as a single detection, even at close distances to the EGO vehicle. A minimum value of MinP ts = 2, meaning at least two detections per cluster, was deemed necessary in order to avoid much of the clustering of noise (as discussed in 3.4.2, Clustering of radar detections using DBSCAN). Thus, a parameter value of MinP ts = 2 was chosen.

Remarks and discussion about overall performance of clustering step The choice of both algorithm and clustering parameters is a delicate matter that greatly affects the performance of the entire classification system. Due to the presence of noise, having a non-dynamic set of parameters inevitably leads to issues. For example, there are problems when trying to filter out stationary objects while moving (especially at higher velocities, but also in general).

In figure 21, found in section 4.3.1 above, a frame from when the EGO vehicle is moving along a straight road with fences was shown. The frame is an example of

82 noise caused by movement of the EGO vehicle. This type of noise is problematic as there is no apparent way to filter it. Dealing with these kinds of noise detections in the clustering step (and avoiding sending the resulting clusters to the classifier step) would therefore be beneficial. The close detections lie on about 2 meters distance from each other, meaning that with the same Eps parameter as above, many clusters would be created. If on the other hand the Eps parameter would be lowered, many detections that are really from a single object would be put into different clusters, making the classification more unreliable.

What also happens in the situation described above, is that when an actual moving object appears too close to the fence, it will get clustered together with it. Almost certainly, the system will reject this massive and weird looking cluster as noise, and the object is therefore not classified.

This illustrates the difficulty with choosing suitable fixed parameters for all situations.

These results imply that the choice of DBSCAN as clustering method is probably not enough to reach satisfactory performance in all situations. Instead, a more flexible option such as OPTICS could be considered which would allow for a dynamic parameter usage. This would however demand extensive analysis and work.

Another thing to consider is the usage of a different distance function. Since DBSCAN works with any distance function (not just the Euclidean, which is what has been used here), a function that better fits the data could provide better performance. For example, since it is known that the radars can produce detections only from physical surfaces, moving objects should tend to look more like ellipses then circles (or more like lines then boxes). A distance function that is more strict in the normal direction of the radar could have a better correlation with the physical reality, and therefore increase clustering performance.

4.4 Classification-related results In this section, the results of classification-related tasks are presented. Firstly, the choice of SVM parameters is explained via the outputs of parameter grid-searches conducted. This is followed by a complete evaluation of the classification performance on validation data.

4.4.1 Support vector machine model selection To determine SVM parameters, several cross-validation grid searches have been performed. The results of two grid searches are presented below.

The first grid search was conducted on a parameter grid composed of: γ = (2−4, 2−2 ... 28),C = (20, 22 ... 212)

83 and with k-fold cross-validation with k = 5.

SVM parameter grid search using cross−validation scores

2.6e+02 85

6.4e+01

80 1.6e+01

4.0e+00 75

1.0e+00 70 Kernel parameter gamma Cross−validation score [%]

2.5e−01 65

6.3e−02

1.0e+00 4.0e+00 1.6e+01 6.4e+01 2.6e+02 1.0e+03 4.1e+03 Slack parameter C

Figure 23: Coarse grid search for SVM parameters

In figure 23 it can be seen that the best cross-validation scores were achieved in the upper part of the grid. For this reason, the second grid search was extended to allow higher values in both C and γ. The step size was also reduced from k = 2 to k = 0.25, resulting in a time consuming but detailed grid.

This second grid search was conducted on the grid composed of:

γ = (2−4, 2−3.75 ... 214),C = (2−4, 2−3.75 ... 216)

This search was also done with k = 5.

SVM parameter grid search using cross−validation scores

9.7e+03

85 1.7e+03

80 3.0e+02

5.4e+01 75

9.5e+00 70 Kernel parameter gamma Cross−validation score [%] 1.7e+00 65

3.0e−01 60

3.0e−01 1.7e+00 9.5e+00 5.4e+01 3.0e+02 1.7e+03 9.7e+03 5.5e+04 Slack parameter C

Figure 24: Fine grid search for SVM parameters

84 In this grid search, a maximum cross-validation score of 89.80% was found with the parameter combination C = 214, γ = 26.9. However, as can be seen in figure 24, the central area contains a wide range of parameter combinations yielding similar results. In order to improve generalization in the model, a lower C value and a higher γ was chosen: C = 860, γ = 45. This combination resulted in a cross-validation score of 89.50%.

It should be pointed out that while it is apparent in figure 24 that the ridge of high cross-validation scores continue to the southeast, that direction also leads to increased overfitting. Since a global optima cannot be guaranteed anyway (due to the grid search being a pure brute force method), it is deemed unnecessary to further extend the search.

4.4.2 Offline evaluation of classification performance Here, the results of the classification system created within this project are evaluated in an offline context, meaning a simulated MATLAB/Simulink environment.

The evaluation is made by using 50% of labeled class data for training purposes, and verifying performance on the remaining 50%. The results are found below, in table 11.

Table 11: Offline evaluation of Classification performance Total Accuracy: 1325/1473 ≈ 90.0%

(a) Confusion matrix Pedestrian Bicyclist Car Truck Recall Pedestrian 252 3 2 1 0.98 Bicyclist 14 213 36 3 0.80 Car 4 54 284 12 0.80 Truck 1 2 16 576 0.97 Precision 0.93 0.78 0.84 0.97

(b) Performance measurements Class Accuracy Error F-measure Pedestrian 0.98 0.02 0.95 Bicyclist 0.92 0.08 0.79 Car 0.92 0.08 0.82 Truck 0.98 0.02 0.97 Average 0.95 0.05 0.88

As expected, the lowest performance scores are those achieved for the bicycle class. This is due to the fact that these examples are easily misclassified as both

85 pedestrians and personal vehicles. From section 4.2.2 it can be seen that these three classes are overlapping in the feature space and that the bicycle class is placed in between, and highly overlapping with these two other classes.

Physically, this is reasonable since bicyclists are similar in size to pedestrians, but can also move at higher speeds and have moving wheels. This brings them further towards the vehicle classes. The bicycle class simply does not have enough aspects that set it apart from the other classes considered in this project.

Luckily, since the iQMatic project is a research project regarding autonomous driving in mines, the bicycle class is perhaps not the most important of the classes considered, since the mine is a restricted zone with a low chance of encountering bicyclists. However, the inclusion of the bicyclist class has yielded much insight into the limits of the radar data and the capability of the classification system.

Overall, the results in table 11 show that the finished classification system is quite powerful and has the ability to separate the four classes. It should be noted however, that these results do not necessarily prove that the system has good generalization.

If the discussion made in section 4.1 is taken into account, it can be further argued that if the labeled data has high correlation, this may also lead to a higher score when evaluating classifier performance on validation data.

In order to get a better estimate of the generalization capability of the classification system, a more diverse set of labeled data is required for validation purposes. However, the results show that there is a spread between the classes in the feature space and that a good classification performance is definitely possible given the limited input data considered in this project.

4.5 Complete system performance assessment In this section, the performance of the complete integrated system is shown and discussed. Both via results obtained from the simulated environment and from actual run-time implementation on the target system. To assess the system we look at both classification and real-time performance.

4.5.1 Real time performance As discussed in section 3.6.2, Timings and tasks, the classification system real-time performance was partly evaluated using Simulink Profiler. These simulations were run on a Intel core-i5 CPU @ 3.20 GHz with 6 GB RAM running Windows 7. Notably, this hardware is considerably less powerful than the hardware available on Astator (see section 1.2.2, Target system). The relevant results of the profiler analysis is shown below in table 12:

86 Table 12: Real-time simlation performance from Simulink Profiler

Name Time [ s / % of total] Calls Time/Call Total (clas sim) 50.98 100.0 1 50.981 clas sim/meas 10.28 20.2 1001 0.0103 clas sim/clas 2.76 5.4 1001 0.0028 clas sim/vego 0.20 0.4 1001 0.0002 other 37.74 74.0 - -

The above table shows proportion of total simulation time taken by the three major function blocks present in the simulation of the model clas sim. The clas function performs all the tasks of the classification system (gatherF rame, makeF rame, cluster and classify as discussed in 3.6.2, Timings and tasks). The vego and meas functions constitute the system task.

In the simulation, the complete clas function is performed every 10 ms, but only contains new data every 50 ms. Thus, the time/call for the clas function shown in the table above would have a lower average if, as is the case in the real implementation, it conducted computations every 50 ms.

It can be seen in the table above that the clas function has an average execution time of around 2.8 ms, which is well within the required maximum execution. To compare, the meas function that computes transforms has more than triple the average execution time per call.

Since the Simulink Profiler evaluations were run on a desktop with notably less powerful hardware than what is available on the Astator platform, and also in the simulated environment, the time performance is likely considerably worse than what is the case for the implemented system using compiled C code. Thus, these results strongly indicate that the classification system has adequate real-time performance.

It should be noted that the Simulink Profiler analysis was conducted using a typical data log. Such a log usually contains between 0 and 15 clusters per frame. Since it is theoretically possible (though implausible) for a single frame to contain 128 clusters (that is, every one of the 256 possible detections is present and also clustered together with precisely one other detection), the worst-case scenario is considerably more computationally expensive than what has been evaluated here. For this reason, it is possible that the real-time performance can be inadequate in extreme load situations (all though such situations are very unlikely to occur).

Real time performance on the target system To further ensure that the real time requirements are met, the real time properties of the system were also studied as the system was running on the actual Astator hardware.

87 Below in table 13, the average execution times of the main software functions (same as discussed above in table 12) is shown:

Table 13: Real-time performance on the target system

Function exec time min. [ms] exec time avg. [ms] log ID clas 0.018 0.349868 20150603 150621 vego 0.263 0.344751 20150603 150621 meas 0.242 1.61016 20150603 150621 clas 0.018 0.317206 20150603 150521 vego 0.26 0.341969 20150603 150521 meas 0.26 1.3693 20150603 150521

The data above was extracted from data logs gathered in a highway scenario. Since such a scenario contains a lot more input data then the typical use case, these results can be seen as a heavy load-scenario. It is clear from the table above that the real-time performance of the classification system (the clas function) is well within the boundaries specified, averaging around 0.3 ms in execution time from start to finish.

It should also be noted that since the Astator ECU is running on a real time operating system and with several threads, there is a possibility of the operating system interrupting the currently executed function. Thus, actual function execution time may actually be lower than what is seen above.

4.5.2 Classification performance In this part, an evaluation of the complete system classification performance is shown. First, an input/output comparison between the implemented and the simulated system is presented. Then, the simulated classification system performance is measured on two different logged radar data scenarios.

Input output comparison between implementation and simulation As discussed in section 3.6.3, the first step of the system validation is to ensure that the system gives the same output when running on the target system, as in the simulated environment, provided the same input.

This was done through comparing outputs of the two systems, and finding clusters that had the exact same position in both environments. Since the position of a cluster is calculated as the average position of all detections contained within said clusters, this should mean that clusters in the exact same positions, also contain the same detections. If two clusters contain the same radar detections, they provide the exact same input to the classification step. Thus, the classification outputs of the two different systems can be compared.

The different feature values are not directly available in the implemented system. This is due to the code generation creating data structures that are very hard

88 to logically follow. As a result, these values cannot be compared. Instead, we compare the probability outputs of the different implementations. If they are the same, it is very likely that all other parameters have been the same. Therefore, if the same input results in the same class probability output, than the two implementations are likely functioning the exact same way.

In table 14 below, the input and output of objects classified as belonging to one of each of the four different classes are compared for the real-time implemented system and the simulated implementation:

Table 14: Input output comparison of the two systems for the different classes

Syst time xpos ypos class prob c1 prob c2 prob c3 prob c4 Sim 0.140 -24.90 57.70 4 0.0375 0.0065 0.0000 0.997 Impl 0.159 -24.90 57.70 4 0.0375 0.0065 0.0000 0.997 Sim 0.140 -1.35 43.09 3 0.0720 0.0000 0.9890 0.0068 Impl 0.159 -1.35 43.09 3 0.0720 0.0000 0.9890 0.0068 Sim 0.140 7.27 5.28 2 0.0650 0.8670 0.0000 0.0469 Impl 0.159 7.27 5.28 2 0.0650 0.8670 0.0000 0.0469 Sim 1.900 3.04 -6.91 1 0.7677 0.0000 0.0021 0.0047 Impl 0.200 3.04 -6.91 1 0.7677 0.0000 0.0021 0.0047 Sim 0.190 -6.491 26.34 -1 0.1135 0.0051 0.0000 0.1314 Impl 0.200 -6.491 26.34 -1 0.1135 0.0051 0.0000 0.1314

Above, the fields Sim correspond the simulated environment and Impl is the implementation on the target system. The Impl data was taken directly from a log created while running the classification system on Astator, while the Sim data was created using the stored radar signals of said log to simulate the system output in the Simulink environment. The class c1 corresponds to pedestrians, c2 to bikes, c3 to cars and c4 to trucks respectively. The -1 class corresponds to a cluster being rejected (not likely belonging to any class). The different probabilities correspond to the Platt-scaled outputs of the classification system.

As can be seen in the table above, given the same cluster input the two systems respond identically. Thus, it can be concluded that the real-time system implementation very likely delivers the exact same output as the simulated system, given the same input.

It should be noted that the above results do not indicate whether the system predictions are correct or not, only that they give the same prediction output provided the same input, for each of the different classes.

It can be concluded that the implementation of the classification subsystem works and performs adequately in a real-time perspective, both with regards to timing and with regards to correct computations. Thus, further evaluation can be done in a simulated environment, without negatively affecting validity.

89 Classification performance evaluation in the simulated environment Here, results relating to the system performance in the simulated environment are presented. Results from two different logs, one easy and one more difficult, are presented and discussed. A frame from a scenario in which the EGO vehicle is traveling on a highway is also shown and discussed in relation to the other results.

Below in table 15 a table of classification results is shown. The first log is taken from a scenario where a truck with a trailer is passing the EGO vehicle slowly (around 30 km/h), and the EGO vehicle is stationary. This can be considered an easy log, partly since trailers are very big and thus farther from other classes, and partly because the EGO vehicle is stationary which greatly reduces the presence of noise. Also, the log is taken in a big open area with no other moving targets.

The second log is from a scenario containing the EGO vehicle driving at around 20 km/h and being followed by a car. This is of medium difficulty because the EGO vehicle is moving, which is causing a considerable amount of noise.

Table 15: Classification system evaluation on log with trailer

Log \Frames Clusters Bad tot/obj Rej Bike Car Truck 1 \218 300 85 / 60 79 26 22 173 2 \140 585 428 / 50 62 292 136 95

For the first log, out of a total of 300 clusters, 85 were misclassifications. This yields an overall error rate of 85/300 = 0.283. In this context, misclassifications were counted as each cluster that clearly belonged to the truck, but was classified as something else, as well as the clusters not belonging to any real object that were not rejected. When looking at only the clusters coming from the truck (one per frame), and disregarding all noise clusters, 60 out of 218 clusters were misclassified. This yields an error rate of a more qualitative nature, calculated as 60/218 = 0.275.

As for the second log, the overall error rate was 428/585 = 0.732, this high value is due to the fact that so many noise clusters are present and the system does not reject many of them. If we look only at the clusters belonging to the car, the error rate is instead 50/140 = 0.357, considerably lower, but still higher than the first log. The conclusion of this is that, as more noise is present in the radar signals, beside from the misclassifications made upon the noise itself, the system also performs worse on the real objects. One theory that explains this is that as noise is increased, the clusters coming from real objects are also corrupted as they may contain some noise detections and corrupt detections as well.

In figure 25 below, a typical frame from each of the two logs is shown (with the only cluster belonging to a real object marked):

90 80 80

60 60

40 40

20 20

X: -69.190 0

Y:meters 6.331 meters X: -55.51 Y: -0.9926 -20 -20

-40 -40

-60 -60

-80 -80 -80 -60 -40 -20 0 20 40 60 80 -80 -60 -40 -20 0 20 40 60 80 meters meters (a) Log 1, clusters from truck and noise (b) Log 2, clusters from car and noise

Figure 25: Typical frames from evaluation logs

As can be seen in the figures above, a considerable amount of noise clusters is almost always present. The case when the EGO vehicle is stationary (the first log) is seen to have less noise than the second log, where EGO was moving. As a further comparison, a typical frame from when the EGO vehicle is moving fast on a highway is shown below in figure 26.

80

60

40

20

0 meters -20

-40

-60

-80 -80 -60 -40 -20 0 20 40 60 80 meters

Figure 26: Typical frame from highway log

91 In the figure above, the ego vehicle is moving at around 80 km/h, and several vehicles of unknown classes are present around it. The detections that appear in lines are from stationary fences on both sides of the lane. Detections below the southernmost fence are likely stationary objects such as trees and signposts perceived as moving objects by the system. These detections tend to appear whenever the truck is moving at higher speeds. The preprocessing software does not seem to be able to deal with this in a satisfactory way, so when looking at radar data frame by frame, the presence of noise is very common. The noise points in turn lead to clusters being corrupted, as detections belonging to real moving objects are grouped together with stationary things like fences or other types of noise.

It should be noted that misclassifications of clusters actually belonging to real objects can also be the result of a lack of corresponding training data, (the system is not trained on any object at these speeds) which is not reflected in the figure.

Overall, the ability of the preprocessing stage to separate stationary detections is inadequate in certain situations, which leads to low classification performance. (generally when the EGO vehicle is moving).

However, it is important to put this in the correct context. Since the end goal of the iQMatic project (that this system is part of) is in mining applications, it might not be at all relevant for performance to be good on highways. Scenarios with lots of open space and few moving targets, with the EGO vehicle moving at relatively low speeds, is likely more relevant.

When looking at the classification performance (on clusters belonging to real objects), the error rate is between 28 and 36 percent in the evaluated logs. With additional parameter tuning and a wider dataset, this could probably be greatly improved. Since this project has not aimed to specifically produce as high classification accuracy as possible, but rather to provide a proof of concept, the system performance is considered adequate.

92 Part 5: Conclusions and future work

To conclude the thesis, we look back to the project goals stated in the beginning of the report in an effort to evaluate the work done. The report is also concluded with thoughts and observations together with a discussion and suggestions about work that could be done in the future.

5.1 Concluding discussion regarding research questions and requirements In an effort to relate and connect the results of the project to the goals stated in the first chapter, this section contains a discussion regarding requirements and research questions.

5.1.1 Requirements In section 1.2.3, a list of project requirements was presented. As stated, these requirements were developed in unison with Scania as well as KTH and are more seen as guidelines for the project than something that is strictly imposed by Scania. Nevertheless, the requirements have been seen as real requirements in the development of the system.

Table 16: Fulfillment of functional requirements

Functional Requirement Fulfilled Cluster individual radar X detections into objects Remove stationary objects and only be concerned with X* moving objects Classify moving object as belonging to one or none of X the four classes Provide a confidence output X Never exceed the real time X execution time of 50 ms

As can be seen in table 16, all of the functional requirements stated at the beginning of the project have been fully or partially fulfilled. One requirement that has been hard to strictly fulfill is the filtering of static objects. This is due to the fact that the radar output is easily corrupted (extra apparent if the EGO vehicle is moving), causing stationary objects to yield detections with nonzero velocity values. This is discussed extensively throughout the report.

93 Several attempts to reduce the impacts of such corrupt detections have been made in the project, however, this has not been enough to strictly ensure that no static objects pass all the way through the signal chain. The radar sensors are simply not accurate enough. The usage of detection history, as briefly discussed in 4.3.2, Results of developed filtering structures, could significantly reduce the impact of such corrupt radar data.

As for the real time execution requirement, a significant evaluation has been made and the system is deemed to fulfill the real time requirement by a large amount, with an average execution time of less than 3 ms in a non-optimized environment and averaging 0.5 ms on the target system. However, since no worst-case scenario has been tested, there may exist a case where the execution time of the system is increased.

Table 17: Fulfillment of extra-functional requirements

Extra-functional Requirements Fulfilled Use only radar and velocity information X Avoid using detection history X and feedback loops Use SVM as classification method X Develop the system in a X MATLAB environment Implement the system on the X embedded hardware of Astator

As for the extra-functional requirements, shown above in table 17, they have all been followed and fulfilled.

5.1.2 Research questions In 1.2.4, a list of four research questions was presented. The research questions were based on the requirements and the original thesis formulation, and refined together with the KTH supervisor.

The aim has been to provide answers to each research question by developing and implementing an actual system.

Below, a recap of each research question is presented together with a discussion to serve as answer.

Research question 1: How can existing machine learning theory be integrated into the embedded hardware of Astator with the purpose of creating a classification system?

94 There exists a variety of different ways in which machine learning theory can be used to achieve the goals stated in this project. In our system implementation, one successful way to do it has been presented.

Using support vector machines is a mathematically intuitive way to solve the problem of classification. By utilizing the basics of SVM theory (see section 2.3.2), either via manual coding or using an open source library (both were done in this project), a classification system can be constructed in a quite simple software structure. When it comes to implementation, the usage of code-generation provided a powerful tool to convert m-code to lower level code implementable on Astator. However, should it be necessary, the SVM structure could just as well be implemented in C from the start.

In order to create the classification model, several prerequisites must first be fulfilled:

• Gather representative data from each of the classes • Label this data according to class • Find features that represent the data set and separate the classes

Research question 2: How can this system be optimized for real time execution?

Inherently, the SVM structure is quite light-weight, as only a subset of the training data is kept in the actual classification model. As for the actual prediction stage, all that is required in run-time is the evaluation of a dot product. This could be slightly extended if using kernel methods and/or a conversion to probabilistic output.

Throughout the report, several ways have been discussed in which the software could be further optimized, below is a short overview:

• Switching to a linear kernel in the SVM structure will reduce the calculations required to make predictions, and, provided the features are good enough, may result in model of similar performance. (see section 2.3.2, Kernels and soft margin) • If using kernels, selecting a lower value of the SVM slack parameter C will allow more slack, leading to a model of less complexity and a faster prediction stage. (see section 2.3.2, Model selection) • Utilizing the capability to perform paralell computations, for example by using PDBSCAN (paralell variant of DBSCAN) in the clustering stage, as discussed in section 2.4.1, Clustering of sensor data.

95 • Using a complexity-efficient multiclass ensemble, such as BTC or DAG (briefly discussed in section 2.3.1, Supervised learning, classification and overfitting. • Reducing the amount of features will lead to a reduced prediction execution time. As shown in section 4.2, Analysis of feature and data characteristics, many of the features used within this project are correlated, and thus the amount could probably be reduced without a big loss in performance.

Research question 3: What can be done to improve the classification accuracy of this system with regards to robustness against noise and environmental factors?

A common problem in this radar-based system is the fact that the radar sensors occasionally deliver corrupt detections (stationary detections percieved as moving), and false detections. Both these categories are seen as noise in the context of this project.

In order to improve the systems robustness against noise, several strategies have been tested. The main concepts are listed below:

• Discard detections believed to be false or corrupt - Details of the methods used to do this are found in section 3.4.1, Filtering of radar detections. • Use a density based clustering algorithm specifically constructed to handle noise. See section 2.4.1, Clustering of sensor data. • Discard clusters believed to be corrupt - Several hypotheses on how to achieve this were put forward in section 3.4.4, Filtering of radar detection clusters. • Reject objects if the classification output is below the rejection threshold - More info can be found in section 3.5.2, Multiclass, rejection and confidence structures.

In combination, the strategies listed above can remove a significant amount of noise data and reduce the amount of false outputs coming from the classification system. However, as the radar sensors deliver a substantial amount of noise detections, it is very hard to achieve complete robustness without the usage of detection history.

As discussed in section 4.3.2, Results of developed filtering structures, the most effective way to handle noise would be to combine data from earlier samples, and use this history to discard noise data that only appears sporadically in sampled data. However, since this is already done in later system stages, one requirement of the classification system was to operate without detection history.

As for environmental factors, radar sensors have an advantage in that they are not easily affected by outer conditions such as weather. However, as was shown in

96 section 4.3.1, Common types of noise in the radar output, the radars used within this project tend to output more false data in certain situations, such as when several objects are present, or when objects are close to EGO. Another situation that leads to more noise data is when the EGO system itself is moving at high velocities.

These aspects are seen as a weakness in the radar system itself, and not in the classification system. To gain increased robustness in these conditions, another solution would be to combine the radar data with data from other sensors.

Research question 4: What are the major obstacles in creating the system, and what can be done to overcome them?

As discussed above, noise in the radar sensor output has been a big challenge and obstacle. As a standalone system, the classification structure developed in this project may be insufficient as noise has such a big impact. However, the intention has never been to create a standalone system but rather an integrated part of the object tracking system already present.

Aside from noise, one big obstacle has been the gathering and labeling of data. Since this is a supervised learning system, it is completely dependant on labeled training data in order to perform well. Thus insufficient data will lead to problems in almost all stages of the project.

Since data gathering and labeling is manual work and very time consuming, it has not been a big focus of this project. Instead, the focus has been on providing a proof of concept and demonstrating the feasibility of the system.

As such, one way to overcome this particular problem is to dedicate more resources to data-gathering and labeling, as a bigger, more diverse set of data will improve all aspects of the system, including validation.

Another obstacle has been the difficulty in attaining in-depth knowledge regarding the radar hardware and software, as well as the the software structures already implemented on Astator. Having such in-depth knowledge is of course beneficial. However, an upside to the methodology used in this project is that it is not needed to have extensive sensor and platform knowledge in order to produce a working system.

5.2 Project-wide conclusions In this project, we have developed a low-level, real-time moving object classification system based on Doppler radar detections and concepts familiar from machine learning, specifically support vector machines. The process can be defined as follows:

97 • Gather data from sensors. This should strive to be unbiased and be as representative as possible. • Develop a filtering structure to reduce the impact of unwanted sensor data and noise. • If necessary or wanted, develop an algorithm that clusters data points together, based on for example spatial closeness. This can be seen as unsupervised learning. • Perform a feature selection and extraction process to gain a feature vector that is capable of separating different classes. • Train a support vector machine model using an appropriate kernel and parameters. This step is considered supervised learning. • Use code generation tools to convert the model from a simulated environment to software that can be implemented on the target system.

The methods used differ from projects with similar goals, in being based solely on data from radars and in being located early in the total system signal chain. This implies our solution is cheaper, more robust (less reliance on fast moving parts such as laser detectors that are more prone to break) and less complex then what similar projects have accomplished. The trade-off is that the system described is not as useful in itself, but only as part of a larger object tracking system. Performance is also affected in a number of ways, being lackluster especially in scenarios where the EGO vehicle is moving at a higher velocity.

The methodology presented within this project has been proven to work, and it is our opinion that it has the potential to be used within an end-user application. Significant performance improvements are thought possible through more fine tuning of parameters, and gathering and labeling more training data.

The biggest problem with the current solution is considered to be the inadequate filtering of stationary detections and pure noise. Radar detections that belong to stationary targets or noise and are never meant to reach the classification stage, are present and become more common the more complex a scenario becomes. If the filtering stages can be improved, our methodology becomes even more relevant.

At a higher abstraction level, methods described in this project can be seen as a way of dealing with sensors that are similar to a black-box, doing sensor model-fitting through the use of machine learning methods. This is an approach that could become increasingly useful as sensors become more and more complex, and thus more likely to be provided by a third-party manufacturer that might be reluctant to share explicit details of how a sensor operates. Instead of trying to reverse-engineer, one can apply the methods used here and create an accurate fit between input and output.

98 5.3 Future work In order to improve the performance of the system that has been developed in this project, several things can be done. We suggest the following is investigated, ranging from most to least important for performance increase:

• First and foremost, working out a better way to distinguish between sensor data coming from stationary and moving objects. The system performs very well when noise is absent, so this would greatly improve performance. • Secondly, implementing a different clustering method with the use of dynamic clustering parameters. Using the algorithm OPTICS instead of DBSCAN could accomplish this. • Thirdly, investigating the usage of more complex features that could significantly improve performance and functionality. Somehow incorporating detection and classification history through use of features is one such possibility. In addition, a bigger coverage in the labeled training and validation data would improve performance in almost every stage of the process. This is especially true for the scenarios that were not covered in the dataset created in this project.

99 References

[1] “Vartannat jobb automatiseras inom 20 ˚ar,” Stiftelsen f¨or Strategisk Forskning, 2014.

[2] S. Wong, “If an autonomous machine kills someone, who is responsible?,” The Guardian, 2009.

[3] T.-D. Vu, Vehicle Perception: Localization, Mapping with Detection, Classification and Tracking of Moving Objects. PhD thesis, Institut National Polytechnique de Grenoble, 2010.

[4] R. O. Chavez-Garcia, Multiple Sensor Fusion for Detection, Classification and Tracking of Moving Objects in Driving Environments. PhD thesis, Universite de Grenoble, 2014.

[5] U. Franke, D. Pleiffer, C. Rabe, C. Knoeppel, M. Enzweiler, F. Stein, and R. G. Herrtwich, “Making Bertha see,” in IEEE International Conference on Computer Vision Workshop, 2013.

[6] J. Z. et al., “Making Bertha drive — an autonomous journey on a historic route,” Intelligent Transportation Systems Magazine, IEEE, vol. 6, pp. 101–141, 2014.

[7] VisLab, “Proud-car test 2013.” webpage vislab.it/proud/, 2013.

[8] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for classification of multisensor data,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, 2007.

[9] H. J. CHo, R. Li, H. Lee, and J. Y. Wu, “Vehicle classification using support vector machines and kmeans clustering,” in Computational Methods in Science and Engineering, Advances in Computational Science, 2009.

[10] M. Jahangir, K. Ponting, and J. O’Loghlen, “Robust doppler classification technique based on hidden markov models,” IEE Pro. Radar Sonar Navig., vol. 150, 2003.

[11] P. G. Kealey and M. Jahangir, “Advances in doppler recognition for ground moving target indication,” in Automatic Target Recognition XVI, 2006.

[12] J. Kjellgren, S. Gadd, N.-U. Jonsson, and J. Gustavsson, “Analysis of doppler measurements of ground vehicles,” in IEEE International Radar Conference, 2005.

[13] C. Alabaster, Pulse Doppler Radar - Principles, Technology, Applications. SciTech Publishing, 2012.

100 [14] D. Kissinger, Millimeter-Wave Receiver Concepts for 77 GHz Automotive Radar in Silicon-Germanium Technology. Springer-Verlag New York, 1st ed., 2012. p. 9-19. [15] P. Flach, Machine Learning: The Art and Science of Algorithms That Make Sense of Data. Cambridge University Press, 2012. [16] C.-W. Hsu and C.-J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Transactions on Neural Networks, vol. 13, pp. 101–141, 2002. [17] M. Galar, A. Fern´andez, E. Barrenechea, H. Bustince, and F. Herrera, “An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes,” Pattern Recognition, vol. 44, no. 8, pp. 1761 – 1776, 2011. [18] A. Daniely, S. Sabato, and S. Shalev-Shwartz, “Multiclass learning approaches: A theoretical comparison with implications,” Advances in Neural Information Processing Systems, pp. 494 – 502, 2012. [19] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” Journal of Machine Learning Research, vol. 5, pp. 101–141, 2004. [20] V. Vapnik and A. Lerner, “Pattern recognition using generalized portrait method,” Automation and Remote Control, vol. 24, 1963. [21] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 1992. [22] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 3, pp. 273–297, 1995. [23] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. Flannery, Numerical recipes. Cambridge University Press, 3rd ed., 2007. p. 883-892. [24] C.-C. C. Chih-Wei Hsu and C.-J. Lin, “A practical guide to support vector classification,” tech. rep., Department of Computer Science, National Taiwan University, Taiwan, 2003. [25] S. Abe, Support Vector Machines for Pattern Classification. Springer London, 2nd ed., 2010. p. 93. [26] D. Meyer, F. Leisch, and K. Hornik, “The support vector machine under test,” Neurocomputing, vol. 55, no. 1–2, pp. 169 – 186, 2003. Support Vector Machines. [27] J. C. Platt, “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods,” in Advances in Large Margin Classifiers, 1999.

101 [28] M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Information Processing & Management, vol. 45, no. 4, pp. 427 – 437, 2009.

[29] C. Ferri, J. Hern´andez-Orallo,and R. Modroiu, “An experimental comparison of performance measures for classification,” Pattern Recognition Letters, vol. 30, no. 1, pp. 27 – 38, 2009.

[30] V. Estivill-Castro, “Why so many clustering algorithms — a position paper,” ACM SIGKDD Explorations Newsletter, 2002.

[31] W.-K. Loh and Y.-H. Park, Ubiquitous Information Technologies and Applications, ch. A Survey on Density-Based Clustering Algorithms. Springer Berlin Heidelberg, 2014.

[32] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of 2nd International Conference on Knowledge Discovery and , 1996.

[33] M. Ankerst, M. M. Breunig, H. peter Kriegel, and J. Sander, “Optics: ordering points to identify the clustering structure,” in Proceedings of the 1999 ACM SIGMOD international conference on Management of data, 1999.

[34] U. Stanczyk, Feature Selection for Data and Pattern Recognition, ch. 3. Springer, 2015.

[35] H. Liu and H. Motoda, Feature Extraction, Construction and Selection: A Data Mining Perspective. Springer Science, 1998. p. 24.

[36] U. Stanczyk and L. C. Jain, Feature Selection for Data and Pattern Recognition. Springer, 2015.

[37] I. T. Jolliffe, Principal Component Analysis. Springer, 2nd ed., 2002. p. 27-32, 138.

[38] M. Daszykowski, (software), Matlab. Dpt. of Chemometrics, Institute of Chemistry, Univ. of Silesia., 2004. Available at: http://www.chemometria.us.edu.pl/download/DBSCAN.M.

[39] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, 2011. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.

102