LiU-ITN-TEK-A--20/011--SE

Visuella förarhjälpmedel för fjärrstyrning av fordon Tobias Matts Anton Sterner

2020-05-28

Department of Science and Technology Institutionen för teknik och naturvetenskap Linköping University Linköpings universitet nedewS ,gnipökrroN 47 106-ES 47 ,gnipökrroN nedewS 106 47 gnipökrroN LiU-ITN-TEK-A--20/011--SE

Visuella förarhjälpmedel för fjärrstyrning av fordon Examensarbete utfört i Medieteknik vid Tekniska högskolan vid Linköpings universitet Tobias Matts Anton Sterner

Handledare Karljohan Lundin Palmerius Examinator Daniel Jönsson

Norrköping 2020-05-28 Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra- ordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

© Tobias Matts, Anton Sterner Linköping University | Department of Science and Technology Master’s thesis, 30 ECTS | Medieteknik 2020 | LIU-ITN-TEK-A--20/011-SE

Vision-based Driver Assistance Systems for Teleoperation of On- Road Vehicles – Compensating for Impaired Visual Perception Capabilities Due to Degraded Video Quality

Visuella förarhjälpsystem för fjärrstyrning av fordon

Tobias Matts Anton Sterner

Supervisor : Karljohan Lundin Palmerius Examiner : Daniel Jönsson

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer- ingsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko- pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis- ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker- heten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman- nens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to down- load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

Tobias Matts © Anton Sterner Abstract

Autonomous vehicles is going to be a part of future transport of goods and people, but to make them usable in unpredictable situations presented in real traffic, there is need for backup systems for manual vehicle control. Teleoperation, where a driver controls the vehicle remotely, has been proposed as a backup system for this purpose. This technique is highly dependent on stable and large wireless network bandwidth to transmit high resolution video from the vehicle to the driver station. Reduction in network bandwidth, resulting in a reduced level of detail in the video stream, could lead to a higher risk of driver error.

This thesis is a two part investigation. One part looking into whether lower resolution and increased lossy compression of video at the operator station affects driver performance and safety of operation during teleoperation. The second part covers implementation of two vision-based driver assistance systems, one which detects and highlights vehicles and pedestrians in front of the vehicle, and one which detects and highlights lane markings.

A driving test was performed at an asphalt track with white markings for track boundaries, with different levels of video quality presented to the driver. Reducing video quality did have a negative effect on lap time and increased the number of times the track boundary was crossed. The test was performed with a small group of drivers, so the results can only be interpreted as an indication toward that video quality can negatively affect driver performance.

The vision-based driver assistance systems for detection and marking of pedestrians was tested by showing a test group pre-recorded video shot in traffic, and them reacting when they saw a pedestrian about to cross the road. The results of a one-way analysis of variance, shows that video quality significantly affect reaction times, with p = 0.02181 at signifi- cance level α = 0.05. A two-way analysis of variance was also conducted, accounting for video quality, the use of a driver assistance system marking pedestrians, and the interac- tion between these two. The results point to that marking pedestrians in very low quality video does help reduce reaction times, but the results are not significant at significance level α = 0.05. Acknowledgments

The authors would like to thank Voysys for letting us conduct our thesis there. Special thanks to our mentor Gabriella Rydenfors for putting up with us during our time at Voysys. We would also like to thank our supervisor Karljohan Lundin Palmerius for great advice during this process, as well as our examiner Daniel Jönsson.

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Motivation ...... 1 1.2 Aim...... 2 1.3 Research questions ...... 2 1.4 Delimitations ...... 2

2 Background 3 2.1 Autonomous Vehicles ...... 3 2.2 Teleoperation ...... 5 2.3 Advanced Driver Assistance Systems ...... 5 2.4 Pilot study: ADAS Categorization ...... 5 2.5 Voysys Software Suite for Teleoperation ...... 6

3 Theory 7 3.1 Visual Acuity ...... 7 3.2 Focal and Ambient Vision ...... 10 3.3 Situation Awareness ...... 11 3.4 Video Quality & Compression ...... 11

4 Video Quality & Vision in Teleoperation 12 4.1 Testing Visual Acuity in Streamed Video ...... 12 4.2 Lane Keeping Test ...... 14 4.3 Discussion ...... 15 4.4 Analysis of the Test Results ...... 19

5 Vision-based Driver Assistance Systems 20 5.1 System Overview ...... 20 5.2 Inference Plugin Implementation ...... 23 5.3 Visualization Plugin Implementation ...... 32 5.4 Systems Results and Performance ...... 33 5.5 User test ...... 34 5.6 Discussion ...... 37

6 Conclusions 43

v 6.1 Could It Have Been Made Simpler? ...... 44 6.2 The Work In A Wider Context ...... 44 6.3 Future Work ...... 44 6.4 Source Criticism ...... 45

Bibliography 46

Appendices 50

A ADAS Categorization 51 A.1 Active Systems ...... 51 A.2 Passive Systems ...... 52

A User test video description 53

vi List of Figures

3.1 Snellen chart and ETDRS chart...... 8 3.2 Visual angle V - the angle spanned by the extremities A, B, of an object...... 8

4.1 Visual acuity measures of the combination of each setting...... 13 4.2 High resolution with different levels of compression...... 14 4.3 Medium resolution with different levels of compression...... 15 4.4 Low resolution with different levels of compression...... 16 4.5 Car lined up to start the test course...... 17

5.1 An overview of the system design...... 21 5.2 The front- and back camera feeds are aligned in the streamer to make up a frame to be sent to the player...... 21 5.3 ...... 22 5.4 Illustration of the packed RGB format. Pixel values are stored next to each other in memory...... 26 5.5 Illustration of how virtual camera data is traversed...... 27 5.6 a) Area under red line makes up the ROI. b) Masked binary image created from inference output...... 31 5.7 Illustration of how a line segment is drawn using triangles...... 33 5.8 Two examples of detection boxes drawn in front of video stream...... 34 5.9 Two examples lane overlay drawn in front of video stream...... 34 5.10 The test subject presses a button on the steering wheel, which activates the symbol in the bottom left corner of the screen...... 36 5.11 Data distribution of reaction times for each scenario, sorted by scenario settings. Different scenarios denoted by color. Reaction time is relative to the moment a pedestrian stepping out in the street (dashed line in figure)...... 38 5.12 Data distribution of reaction times for each video, sorted by scenario settings. Reaction time is relative to the moment a pedestrian stepping out in the street (dashed line in figure)...... 38 5.13 Sampled reaction timestamps, the timestamp when a pedestrian steps into the road and timestamp when the pedestrian was marked by the ADAS. All times are relative to the timestamp of a pedestrian coming into view...... 39 5.14 Q-Q plot of residuals of all observations, blue line is expected normal distribution. 39

vii List of Tables

2.1 The different levels of automation in on-road motor vehicles proposed by SAE. . . 4

3.1 VA requirements in different countries in decimal notation according to the Eu- ropean Council of Optometry and Optics (ECOO) and International Council of Ophthalmology (ICO)...... 9

4.1 Bitrates (Mbit/s) required at different resolutions to produce a similar level of compression...... 13 4.2 Video stream settings used in driving test...... 15 4.3 Times recorded for each test setting...... 16

5.1 Remote controlled car hardware specifications...... 21 5.2 Scenario settings...... 35 5.3 Scenario order for test subject 1-5...... 35 5.4 Univariate Type III Repeated-Measures ANOVA Assuming Sphericity. Data from scenario A, B and C...... 36 5.5 Mauchly Tests for Sphericity and Greenhouse-Geisser correction for departure from sphericity...... 36 5.6 Univariate Type III Repeated-Measures ANOVA Assuming Sphericity. Data from scenario B, C, D and E...... 37

A.1 Description of videos shown in user tests...... 54

viii 1 Introduction

A future with autonomous vehicles occupying public roads is quickly approaching. But it seems unlikely that such systems could replace human operators in every possible traffic scenario. Using teleoperation of these vehicles allows for a human operator to temporarily take control of the vehicle, from a distance, while having a full overview of the situation from the driving station.

1.1 Motivation

Several aspects can affect operator performance negatively in a teleoperation setting. Large or variable time delay (i.e. latency) and reduced video quality are two examples of possible issues [1]. The teleoperation system developed by Voysys lowers the resolution and/or in- creases compression of the video stream in case of reduced bandwidth to maintain low and stable latency throughout the system. The available bandwidth is generally high in Sweden, with 84.44% of land area having at least 10 Mbit/s (via HSPA or LTE) and 12.53% of area with least 30 Mbit/s (via LTE) [2], but this is no guarantee that this level of service is al- ways available. If this level of bandwidth is not available, the lower available bandwidth and thereby reduced video quality could pose serious safety concerns in teleoperation of vehicles in traffic. Advanced Driver Assistance Systems (ADAS) can provide increased safety and added convenience in operating a vehicle. During teleoperation, it could be beneficial to have ADAS similar to what is available in modern cars. However, teleoperation differs from driving while in the vehicle: instead of a windshield and instrument cluster, there are screens or widescreen projections. This difference means that new or enhanced types of ADAS will be possible; e.g. it is expected that AR-type ADAS will be very useful for teleoperation. One example of possible areas of improvement of technology common in cars is Heads-Up Displays (HUD). HUDs are used to present vital information to the driver, available at a glance in the lower part of the windshield. In a teleoperation setting, the physical constraints of projecting onto a windshield is removed; the whole screen can be utilized to present information to the driver. This brings a set of new problems and questions to be asked: could the reduction in video quality cause the operator to lose situational awareness and thereby lose the ability to handle the vehicle safely? Are there ADAS available in cars today, or novel systems which could be

1 1.2. Aim implemented to aid the driver during those situations? It is questions like these that will be further studied in this thesis.

1.2 Aim

The aim of this thesis is to investigate how a driver’s visual perception through the screen of the driver station is affected when the quality of video transmitted from the teleoperated vehicle is degraded; if driver performance is affected by it, and propose and evaluate vision- based driver assistance systems that could guide the driver during such conditions. The process to investigate these questions will be divided into two parts. The first part will inves- tigate if reduced video quality changes how much detail can be perceived in streamed video content, comparing the results to standards of visual acuity required for driver licenses, and test if it has any impact on driver performance in teleoperation. The second part is to propose two vision-based driver assistance systems, and test whether these systems are useful in a teleoperation setting.

1.3 Research questions

These are the questions this thesis aim to investigate and answer.

1. In a teleoperation setting, how does degradation of video quality (due to reduced reso- lution and increased compression) affect the level of perceivable detail, and what effect does it have on driver performance?

2. In a teleoperation setting, how well can impairments to a driver’s visual perception ability, caused by degraded video quality, be compensated for by vision-based driver assistance systems in the form of graphical elements superimposed onto the driver’s view?

1.4 Delimitations

This thesis aims to investigate whether the proposed vision-based driver assistance systems can help during less-than-excellent or poor visual feedback conditions. To accurately assess the systems, they would ideally be tested in the same environment as they would operate in. In the case of this thesis, that would mean integrated into a street-legal teleoperated vehi- cle. The integrated systems could then be tested in different traffic scenarios in an enclosed area. The company at which the thesis was conducted (Voysys AB) is primarily video stream- ing software- and hardware-oriented, providing their costumers with the tools and expertise needed for creating their own teleoperated systems. To test and demonstrate their product they have built a 1:5 scale radio-controlled (RC) car for teleoperation. The RC-car is usually driven on sidewalks in the proximity of the office location in central Norrköping, on an RC- car race track, or in a large indoor area; it cannot be driven on public roads. In other words, no designated testing grounds with appropriate facilities needed for evaluating the proposed systems were available. As a result, the tests described in this thesis report focuses primarily on testing visual perception rather than the driving task as a whole. All implemented systems will be tailored to the software and hardware made available to us, without considerations of cross-compatibility to similar systems (if any such exists).

2 2 Background

This chapter will provide an insight into how teleoperation can potentially bridge certain gaps in the process of seeing autonomous vehicles on future public roads, and how vision- based driver assistance systems can complement teleoperation.

2.1 Autonomous Vehicles

In the transportation sector where human behavior is the cause of more than 90% of traffic accidents, the incentive to integrate automation systems is evident [3]. It is predicted that autonomous vehicles will have a considerable impact on the future transportation system. Not just traffic accidents, but emissions, congestion, and energy consumption are believed to be significantly reduced as well [4]. Furthermore, autonomous vehicles are believed to reduce car ownership with car-sharing programs and enable the elderly and disabled to be more mobile [5]. As of today, there are already several autonomous cars that have driven millions of kilo- meters on public freeways and streets [6]. Technology-focused companies Waymo (Google) [7] and Uber [8] are two actors at the forefront of the field. However, the technology is still at an early stage and has not yet been established enough to reach a human-out-of-the-loop level, that is, to a point where no human interaction or supervision is required. Poor weather and lighting conditions are some aspects that can cause problems when perceiving the vehi- cle’s surroundings and need further research [9]. Naturally, most car manufacturers are also looking into the automation of vehicles: BMW, Mercedes, Volvo, and Tesla are just a few. Even though full car automation is not yet commer- cially available, these car manufacturers provide autonomous features of varying degrees in their high-end cars [9]. These features can range from semi-automatic parking (controls only steering) to autonomous freeway driving and lane change. Tesla are even equipping their new Model S cars with hardware for full automation which will make it available in the fu- ture by updating the software [10]. Society of Automotive Engineers (SAE) proposes a standard taxonomy defining the dif- ferent levels of vehicle automation [11]. One of the reasons why this standardization was proposed was to clarify the role of the human driver when an automated driving system is active. Table 2.1 illustrates how the different aspects of driving a vehicle are either performed by an automated system or a human driver. DDT (dynamic driving task) denotes all the real-

3 2.1. Autonomous Vehicles time functions required to operate the vehicle in on-road traffic, where OEDR (object and event detection and response) is a collective term for monitoring the environment, preparing a response to objects and events in the environment and performing an action accordingly [11]. The operational design domain (ODD) describes the domain in which the automated driving system is designed to function, defined by e.g. geographic areas, roads or environ- mental limitations. What can be concluded from Table 2.1 is that up until level four and five the automated driving system expects a driver to control all or some aspects of the DDT and to act as the ultimate system fallback. While there are automated driver systems that are operating in the fourth and fifth levels of automation, it is likely to be several years before these systems are fully operational. However, teleoperation will allow the fallback driver to operate from outside the vehicle, possibly supervising several vehicles at once, and thus (in a way) still utilizing the driver-less aspect of autonomous vehicles.

Table 2.1: The different levels of automation in on-road motor vehicles proposed by SAE [11].

Despite recent advancements automation technologies, there will not emerge automation systems capable enough to carry out the entire driving task overnight. The need for human supervision and intervention if automation systems fail, will create a demand for back-up systems. This creates a demand for new types of human-in-the-loop systems which do not

4 2.2. Teleoperation require the physical presence of a human. This is where teleoperation systems are believed to be a vital complement to automated vehicles.

2.2 Teleoperation

Teleoperation, as a form of human-machine interaction, has emerged as an important mean of detaching an operator from the environment in which a machine operates. This might be necessary to guarantee the operator’s health, or in other cases simply a more efficient way of operating the machine. The first teleoperation systems are believed to have emerged in the mid-1940s for handling radioactive material through a slave arm mechanically coupled with a master arm, i.e. reflecting forces applied on the master [12]. An important task of a teleoperation system is to reflect the environment on the remote side such as it allows the operator to maneuver the remote machine accurately [13]. In a ve- hicular context this would mean providing the driver with enough visual feedback (through cameras mounted on the vehicle), at a rate that provides the operator with current informa- tion of the environment, allowing the driver to perform relevant actions. Teleoperation is becoming a common compliment to autonomous vehicles, allowing re- mote operators to take over when autonomous systems fail or encounter situations it can not maneuver safely. During teleoperation, the user interface of the remote operation station used replaces the glass of the regular vehicle cockpit. The user interface and visualizations used for teleoperation must be well-designed as not to distract, compromise the driver’s situational awareness or induce a higher mental workload [14].

2.3 Advanced Driver Assistance Systems

Modern cars and trucks are equipped with multiple advanced driver assistance systems (ADAS) which can increase road safety in terms of crash-avoidance and crash severity mitiga- tion, as well as providing driver convenience [15]. Although many cars today are equipped with an array of ADAS, only a handful of systems have been systematically proven to in- crease road safety. The development and implementation of new ADAS is market-driven, and there does not currently exist any standards to evaluate the road safety effects of new technologies. Augmented reality solutions for cars today are normally limited to small area HUDs lo- cated at the lower part of the windshield. There are previous studies aimed at creating ADAS systems using HUDs covering a larger part of the windshield to inform or warn the driver of upcoming dangers [16].

2.4 Pilot study: ADAS Categorization

Previous reviews of the ADAS field have been made; presenting a wide view of the state of the field and its possible future developments [17], a historic view of vehicle technology and human factors [18], and with focus on modern vision- and sensor-based detection technolo- gies [19], but neither present any categorization of systems which separates systems suitable for implementing in a teleoperation context. A pilot study was therefore conducted, where a categorization was defined to identify which kind of systems which are most likely to be able to be successfully ported from a ve- hicle to a teleoperation interface. The categorization was partly inspired by previous works [20], but modified to fit a wider spectrum of different technologies, not only systems aimed toward safety. The ADAS available in today’s car industry was documented partly by read- ing specification sheets from major car manufacturers and subcontractors [21], and partly by reading existing reviews on the subject [15].

5 2.5. Voysys Software Suite for Teleoperation

ADAS currently available in cars and trucks can be divided into two main categories: ac- tive systems; with two subcategories: supportive and intervening, and passive systems; with two subcategories: informative and warning. Active systems are capable of controlling the vehicle itself to some degree, either by enhancing or reducing the vehicle’s response from driver inputs or taking complete control of the vehicle. Passive systems are used to present information or warnings to the driver; to increase safety, situational awareness, ease of oper- ation and convenience. All systems investigated are listed in full in Appendix A. From the knowledge gained in the pilot study, it was concluded that this thesis would be focused on systems that convey information for the driver to take action upon, not systems controlling the vehicle itself.

2.5 Voysys Software Suite for Teleoperation

Voysys is a company that has recognized the growing need for teleoperation solutions in the area of autonomous vehicles. Voysys has several highly customizable software products which allow for low-latency video streaming between different devices. Most products share the prefix Oden in name. Common among these editors is the use of a scene graph where video elements and virtual cameras can be placed to customize the viewpoint to be sent from streamer and what will be shown on the screen for a driver. The editors are modular and are dependent on plugins for most of their functionality. Plugin entities placed in the scene graph can be moved and manipulated, or attached to other entities to produce desired viewpoint of all available cameras. Many plugins are delivered with the products, but new plugins can be built using Voysys proprietary plugin-API to customize functionality further. The editors have settings to regulate streamed video quality, and automatic bandwidth control to keep latency low even when available bandwidth is reduced. Which resolution to use in a certain bandwidth range is set manually, and can then be regulated automatically by the streaming computer. If the bandwidth drops below a set threshold, the video quality will be reduced to not exceed available bandwidth, to avoid additional delays in the video stream.

6 3 Theory

This chapter will describe human visual functions- and perception, their role in driving, as well as lossy video compression and the different artifacts that it may introduce in video.

3.1 Visual Acuity

In a teleoperation setting, the operator’s main source of information often comes in the form of visual representations of the environment, e.g. a camera feed projected onto some kind of display. When it comes to driving a motor vehicle, such as a car, some claim that the informa- tion used for the driving task is 90% visual (while others are skeptical when it comes to that claim [22]). Regardless of the exact percentage of information being visual, the consensus is that it plays a predominant role when it comes to the driving task. The human visual system and its role in driving is an enormous field of research, which would be far beyond the scope of this thesis to summarize. The following section will, therefore, focus on visual acuity and its role in driving. Researchers tend to differentiate between driving safety and performance when assessing visual function and its impact on the driving task; safety is defined in terms of vehicle col- lision involvement, while performance is defined in terms of the behavior of the driver (e.g. speed regulation, object detection, lane positioning, etc.) [23].

3.1.1 Definition Visual acuity (VA) is most commonly tested when evaluating visual function in conjunction with driver’s license application [23] [24]. VA can be described as the ability to resolve detail. In other words, the spatial resolution of the visual system. The visual system’s spatial reso- lution is determined by the minimum angle of resolution (MAR), which is the angle spanned at the nodal point of the eye by two points or shapes which can just be identified as separate [25].

3.1.2 Testing Visual Acuity Testing VA has traditionally been done using the Snellen chart, which was designed by the Dutch ophthalmologist Hermann Snellen (see Figure 3.1a) [26]. The chart consists of multiple

7 3.1. Visual Acuity lines of black letters printed on a white background, where each line of letters decreases in size. The chart shown in Figure 3.1a is constructed for a viewing distance of 20 feet.

(a) Traditional Snellen chart [27]. (b) The ETDRS chart [28].

Figure 3.1: The ETDRS chart (b) follows a logarithmic progression of letter sizes, as well as proportional line spacing, opposed to the Snellen chart (a) which follows a more irregular progression. Note that the images are not of equal size and scale and can therefore not be directly compared.

Snellen concluded that the reference point for the lettering size should be five minutes of arc (1’ equals one sixtieth of a degree), i.e. the width of the letters should subtend the visual angle of 5’, and the same for the height of the letters. The visual angle is defined as the angle between the two lines going from the eye to the extremities of an object (see Figure 3.2) [25]. Subsequently, the stroke width of the letters should be 1’, which implies that the reference MAR is 1’. In Figure 3.1a the eighth line represents the reference line, where the size of the letters span the visual angle of 5’ at a distance of 20 feet. The right-hand fraction follows the form VA = d/ds, where d is the viewing distance and ds the distance at which the letters subtend the visual angle 5’ [26]. It follows that the lines until the eighth represent longer distances at which the letters span a 5’ visual angle, thus less-than-normal VA, and the ninth line and onward represent more-than-normal VA. In other words, if a person can at most read the letters on the sixth line (20/30) it translates to a ”normally-sighted person” being able to read it from 30 feet.

Figure 3.2: Visual angle V - the angle spanned by the extremities A, B, of an object [29].

8 3.1. Visual Acuity

Apart from the so-called Snellen fraction, VA is quantified in two different forms: the multiplicative inverse of the MAR, i.e. 1/MAR, referred to as the decimal form, and the base ten logarithm of MAR, log10(MAR) [25]. The latter is used in conjunction with a new type of VA-chart - the ETDRS (Early Treatment of Diabetic Retinopathy Study) chart, or the logMAR chart. The logMAR chart has been established as the new standard for VA testing, improving on some of the drawbacks of the Snellen chart; the viewing distance can be varied, each line contains the same number of letters, the progression of lines is logarithmic, as opposed to the irregular progress in the Snellen, and the spacing between lines is proportionally equal [30]. Additionally, a new set of sans serif letters consisting of C, D, H, K, N, O, R, S, V, and Z, is used. Proposed by Louise Sloan in 1959 [31], these set of letters has since then become the standard for the logMAR visual acuity charts [32]. VA using the ETDRS chart is determined by the last line where three out of five letters can be identified.

3.1.3 Visual Acuity in Driving As mentioned previously in 3.1.1, VA is the most common visual function tested in conjunc- tion with driver’s license applications; but what is considered as an acceptable VA for driv- ing? The answer differs depending on where in the world a license is issued. Table 3.1 shows the binocular (i.e. measured with both eyes) VA in different countries. The presented data is is a selection from two different reports by the International Council of Ophthalmology (ICO) [24] and the European Council of Optometry and Optics (ECOO), and is not presented in its entirety. However, the table illustrates how VA requirements can vary not only on the same continent but in even in the same country.

Country Binocular VA Source Algeria 0.8 ICO Canada 0.4 ICO Czech Republic 0.7 ECOO France 0.5 ECOO Germany 0.5 ECOO Greece Sum of both eyes ě 1.0 ECOO India 0.33 ICO Japan 0.7 ICO Mexico 0.8 ICO The Netherlands 0.5 ECOO Spain 0.5 ECOO Sweden 0.5 ECOO Switzerland 0.63 ECOO Turkey 1.0 ECOO The United Kingdom 0.5 ECOO The U.S. (California) 0.5 ICO The U.S. (New Jersey) 0.4 ICO

Table 3.1: VA requirements in different countries in decimal notation according to the Euro- pean Council of Optometry and Optics (ECOO) [33] and International Council of Ophthal- mology (ICO) [24].

Since VA has become the universal test for driving fitness, one might draw the conclusion that VA is a reliable predictor in terms of driving safety. However, there have been incon- sistent findings in research during the last 50 years. One of the first (and often discussed) extensive studies is the one by Burg presented in two consecutive reports in 1967 and 1968, where he carried out different visual performance tests on 17,500 drivers licensed in the state of California [34], [35]. He found that there is no significant relationship between VA and

9 3.2. Focal and Ambient Vision vehicle collision rates in young and middle-aged drivers. He did, however, find a weak cor- relation in the older age group. Subsequent research has both reported similar as well as conflicting results, where no significant relationship between collision involvement and VA has been found [23]. The lack of consistency in research results and considering only a weak correlation can be presented at best, VA can not be considered as an effective predictor of driver safety [36], [37]. Although VA and driver safety cannot be significantly associated, there is a relatively stronger correlation between VA and driver performance. In an article from 1995, Szlyk et al. presented a study on the effect of visual acuity deterioration, due to age-related macular de- generation (AMD), on driving skills [38]. The test subjects were compared to cohort subjects with normal vision with an interactive driving simulator and an on-road driving test. The results showed that the test subjects tended to drive too slow and had difficulties maintain- ing proper lane position (including crossing lane boundaries) in the driving simulator test as well as the on-road test. Furthermore, the test subjects took a longer time to brake at stop signs and had more accidents in the driving simulator test. In 1998 Higgins, Wood and Tait tested the effects of visual impairment, in terms of reduced VA, on different aspects of the driving task [39]. The tests were carried out in a closed-road environment where participants were fitted with blur-inducing goggles, degrading VA to similar levels for all, and tasked with completing a test course. Deteriorating VA correlated to a significant decline in road sign recognition and road hazard avoidance. Additionally, the time to finish the test course saw a significant increase as VA was degraded. However, the ability to navigate through gaps in a slalom course did not seem to be affected to a great extent. In a following 2005 report by Higgins and Wood the previous results were confirmed, finding a linear relationship between sign recognition, hazard avoidance, driving time, and degraded VA [40].

3.2 Focal and Ambient Vision

In section 3.1.2, we saw that when it comes to vehicle driving safety there is no significant correlation to VA. However, we saw that VA shows a stronger association with driving per- formance in terms of sign recognition, hazard avoidance, driving times, etc. The fact that different aspects of driving are impacted by reduced VA, while others are not, can be at- tributed to the notion that visual processing can be separated in two parallel systems; the focal and ambient subsystems [41]. The role of ambient vision can be described in terms of be- ing concerned with the question "Where am I?" - the space around the body, while focal vision is concerned with the question "What is it?" - small details and objects. Focal vision acts on the center of the visual field, where VA is the highest and is charac- terized by being capable of resolving fine spatial details while being insensitive to high tem- poral variations (motion/flicker) and low contrast/low spatial frequency information [41]. Ambient vision is active in the peripheral visual field, characterized by being susceptible to motion/flicker and low contrast/low spatial frequency information while being insensitive to fine spatial details. In 1978, Donges proposed a two-level of driver steering [42], which have direct parallels to the notion of ambient-focal vision [41]. The two levels consist of (1) the guidance level where the central vision is used for gathering information about the road "far" ahead and prepare future steering adjustments and (2) the stabilization level where steering is adjusted, relative to how the current path deviates from the desired path, through peripheral vision of the "near" road ahead. In conjunction with the ambient-focal approach, the two driver steering levels can be expressed in terms of focal/far and ambient/near visual processes [41].

10 3.3. Situation Awareness

3.3 Situation Awareness

Gugerty defined situation awareness in driving as: "The updated, meaningful knowledge of an unpredictably-changing, multifaceted situation that operators use to guide choice and action when engaged in real-time multitasking" [43]. He further argues there are three levels of cognitive processing involved in maintaining situation awareness while driving:

• Automatic, pre-attentive processes that occur unconsciously.

• Recognition-primed decision processes that are active for brief moments of time.

• Conscious processes that place heavy demands on cognitive resources.

3.4 Video Quality & Compression

Video quality is typically a measure of degradation compared to a reference video. Resolution is directly correlated to the minimum level of detail which can be reproduced in video. It can be measured both objectively, measuring the source material itself, and by subjective quality assessment. Digital video recorded in RAW-format saves all pixel values as captured by the sensor. Compression of the video is often needed to reduce the amount of space required to store, or in the case of video-streaming; adapting the bit-rate to not exceed available bandwidth limits. Advanced Video Codec (AVC), also referred to as H.264, is the encoding standard used for all video streaming in this project [44]. An H.264 video is basically a sequence of coded pictures (or frames). A picture is segmented into slices, which are collections of macroblocks (16x16 pixels). The size and shape of slices can be modified by a group map, assigning specific macroblocks to separate groups. Each slice is self-contained, meaning it is coded and decoded without the use of data from other slices.

3.4.1 Compression Artifacts Lossy compression of video results in visually perceptible artifacts, increasing in intensity with the level of compression, because a larger part of the image is purely predicted values. Compression artifacts can generally be separated into two categories, depending on when they are visible:

• Spatial artifacts, which are location-based, such as blocking, blurring and ringing.

• Temporal artifacts, sequence based errors visible only when video is playing, like flick- ering and floating.

Previous experiments have shown that blurring, blocking, flickering and color distortion are the most easily identifiable artifacts for amateur viewers [45]. Blocking refers to the visible edges between transform block boundaries, often the artifact most easily discernible, and is the most widely studied artifact in block-based video coding [46]. It occurs (partly) because edge pixels of a macroblock are predicted with less accuracy than interior pixels. De-blocking filters to decrease that effect is a part of H.264 coding and is applied if the absolute difference between samples near block edges are "relatively high" to smooth out transitions between blocks, but it can add some additional blurring to the image. Blurring, the loss of spacial details on sharp edges is also a consequence of the block-based predictions.

11 4 Video Quality & Vision in Teleoperation

In a teleoperation setting, visual feedback of the system will be impaired e.g. by reduced video resolution and high compression levels, due to bandwidth limitations. This chapter aims to evaluate how visual- functions and perception are impacted under such conditions. Furthermore the results from the evaluation will create grounds for the vision-based aid sys- tems proposed later in Chapter 5.

4.1 Testing Visual Acuity in Streamed Video

To estimate how VA changes with different levels of video resolution and compression, a test was performed where a logMAR visual acuity chart was placed in front of a video of traffic captured by a camera on the test car. The size of the chart was scaled to be of real life-size, and placed 4 meters in front of the scene graph camera, as it would be if used for a regular VA test. A white background is normally used in VA tests with characters for maximum contrast, but it was here removed to ensure the letters would compress at a similar level as the background video. For simplicity only spatial artifacts visible when the video is paused are accounted for in our test. The video was streamed between two computers in H.264 format, using twelve differ- ent combinations of video resolution and approximate level of compression. Table 4.1 shows video bitrates (Mbit/s) which result in a similar level of visible compression artifacts, such as blocking and blurring, for the three resolutions. These numbers were gathered in a separate experiment where bitrates were lowered gradually for each resolution. The bitrate required for there to be no discernible artifacts was set to 30 Mbit/s for all resolution levels, and is omitted from the table. The grade of compression denoted "noticeable" is when compression artifacts are easily noticeable by the eye in large parts of the image, and is a subjective mea- sure made by the authors, as exact control of the compression level was not possible in this experiment. The compression level denoted "medium" is the mean of the bitrates for "no- ticeable" and "maximum" compression. "Maximum" level of compression was achieved by lowering bitrates until the video could not be compressed further at the current resolution. Results from the test can be seen in Figures 4.2 to 4.4, which show a logMAR VA chart included in video streams of differing resolution and bitrates. The VA scores for each of the settings were determined by examining the smallest discernible line of letters in the charts shown in the images, according to the VA measurement standards described in section 3.1,

12 4.1. Testing Visual Acuity in Streamed Video

Grade of compression Resolution Noticeable Medium Max High (2400x2048) 7.0 4.2 1.4 Medium (1200x1024) 4.0 2.3 0.7 Low (600x512) 1.5 1.0 0.5

Table 4.1: Bitrates (Mbit/s) required at different resolutions to produce a similar level of compression. E.g. the level of visible compression artifacts using a bitrate of 7.0 Mbit/s on High resolution is comparable to using a bitrate of 4.0 Mbit/s on Medium resolution. see Figure 4.1. These scores enable control of VA in tests presented in the following parts of this thesis.

Figure 4.1: Visual acuity measures (LogMAR, lower is better) of the combination of each setting, the value is determined by the last line where at least three out of five letters are discernible. Note that Compression Level relates to different bitrates for each resolution.

13 4.2. Lane Keeping Test

(a) Reference image. No compression. (b) Noticeable compression.

(c) Medium compression. (d) Maximum compression.

Figure 4.2: High resolution with different levels of compression.

4.2 Lane Keeping Test

Three participants took part in a driver user test, controlling an RC-car using the Voysys tele- operation system i.e. using a wheel and pedals to control the car, and watching the video stream projected on a curved wall. The purpose of the test was to examine if reduced video quality affects driver performance, measured by track completion time and number of lane excur- sions. A test track was marked with dashed lines on an asphalt walkway using white street crayons. Line segments, 30 centimeters long and five centimeters wide, were drawn 60 cen- timeters apart. The track consisted of one right turn, one left turn and a straight in between. The width of the track was 80 centimeters, which gave 15 centimeters of margin to the track limits from either side of the car, and the length was approximately 50 meters. Part of the track and car can be seen in Figure 4.5. The test subjects drove three laps on the track, each lap with different video quality set- tings equivalent of a specific logMAR VA score. A lap was defined as the track driven in both directions, with a total length of approximately 100 meters. The settings used are found in Table 4.2. The levels of video quality were chosen to represent full vision, the lower limit of required vision for driving (0.3 logMAR/0.5 decimal level in Sweden) and one additional setting below, at 0.6 logMAR. Lap times were recorded as a sum of both runs, and a lane

14 4.3. Discussion

(a) No compression. (b) Noticeable compression.

(c) Medium compression. (d) Maximum compression.

Figure 4.3: Medium resolution with different levels of compression. excursion was noted wheels. The participants did not have any previous training before starting the test, so learning effects are expected between runs.

Test Case Resolution Bitrate (Mbit/s) Visual Acuity (LogMAR) A High 14 0.0 B Medium 2.5 0.3 C Low 1.4 0.6

Table 4.2: Video stream settings used in driving test.

Data collected from driving test is shown in Table 4.3.

4.3 Discussion

This section interprets the results from the visual acuity and vehicle guidance tests, placing them in the context of previous research on vision and driving.

15 4.3. Discussion

(a) No compression. (b) Noticeable compression.

(c) Medium compression. (d) Maximum compression.

Figure 4.4: Low resolution with different levels of compression.

Test Case A Test Case B Test Case C Subject Order Time Run-offs Time Run-offs Time Run-offs 1 ABC 112, 23 2 108, 03 3 108, 54 4 2 BCA 107, 88 1 123, 82 3 119, 62 4 3 CAB 108, 41 0 109, 37 0 120, 42 1 Average 109, 50 1 113, 74 2 116, 19 3

Table 4.3: Times recorded for each test setting.

4.3.1 Video Quality Effects on Visual Acuity In Figure 4.1, VA decreases linearly as the compression level increases, up to a certain point toward the maximum level of compression where quality drops off more quickly. Comparing these measures to VA requirements for driving, only five of these settings are below 0.3 Log- MAR (0.5 decimal), and thus the only settings retaining a "legal" level of VA in Sweden. Two compression artifacts easily discernible from these sample images are blocking and blurring, which at higher levels of compression removes most details of any surface.

16 4.3. Discussion

Figure 4.5: Car lined up to start the test course.

In section 3.1, VA was defined as the smallest visual angle the visual system can resolve. Since the ability to resolve detail is highly subjective between individuals, the VA scores are shown in Figure 4.1, and can be thought of as the best achievable VA by the teleoperation system at each setting. This notion was based on the assumption of the vehicle operator having perfect sight up until the displays of the driver station. VA scores are thus based purely on pixel representations of the letters being distorted beyond recognition. Such distinctions may also be subjective to the observer, but were in this context considered less significant. As described in section 4.1, the white background of the logMAR VA chart was removed to ensure that the letters were compressed at similar levels as the camera image. However, different areas in the camera image differ in terms of brightness, color and detail. In e.g. Figure 4.4, the sky is bright and low in detail while the sidewalk is darker and contains a squared pattern. Letters that are placed over a bright area create high contrast edges that are compressed less in the H.264 codec. This effect can be observed in Figure 4.4d, where the top line letters placed over a bright area are discernible (H, Z, V), while the others are not. Thus, inconsistencies in the background of the VA chart might have given a recognition bias towards letters in brighter areas. Furthermore, reading of a VA chart should be done in high-contrast conditions to elim- inate the results being affected by contrast sensitivity (i.e. how much contrast a person re- quires to see a target [47]). The approximations of VA made from the streamed video test do not fully represent how much detail is discernible for an observer watching a video captured by the camera mounted on the car. The logMAR VA chart was imported into the scene as a digital image, rather than being captured by a physical camera. Since the image quality also depends on the cam- era characteristics and its settings at image capture, the results gathered does not reflect the camera’s impact on VA. The choice of a measure similar to a normal VA test may seem odd, considering there already exist many methods of measuring video quality objectively. After reviewing other

17 4.3. Discussion methods, there seems to be a bias towards methods that provides a measure of viewer Quality of Experience in video entertainment. These kinds of measures were deemed not very useful, since we are only concerned with measuring the level of detail perceivable in a video to make it comparable to measures of VA. They could however be used in conjunction with VA measures, to create a conversion table between these different units of measurement.

4.3.2 Lane Keeping Test Comments made by the participants include difficulty to discern track boundary markings when video quality was reduced, as the surface details blended. They also expressed the lack of detail making it difficult to navigate due to reference points in the surroundings disappear- ing. The number of lane excursions increased when video quality was lowered. Furthermore, all cases where video quality was lowered from one run to the next (A to B, or B to C) saw an increase, while when video quality was increased (C to A) lane excursions decreased. The results suggest lower VA did correlate to an increase in lane excursions. The results also indicate that a learning effect was at play during the tests. This is reflected in the lap times presented in Table 4.3 where times decreased between the first and second lap in all cases, and either continued to decrease or stayed practically the same between lap two and three. Average lap times show an increase as VA is degraded, which shows tendencies towards VA counteracting the learning effect. However, considering the low number of test subjects, such tendencies should be taken with reservation. Since the tests were carried out during wintertime in Sweden, where daylight hours are few, the participants were not given time to familiarize themselves with driving the car before the test started. The lack of practice before tests, most likely contributed to the learning effect observed in the results. Rain during test day was another aspect which limited the available time for each subject, since the teleoperated RC car used for the tests was highly sensitive to moisture. Rain or accumulated water on the ground meant the tests needed to be put off. This also called for another approach for marking the track. Originally the plan was to mark track lines with white duct tape, which had been tested in a previous pilot study. In the pilot study it was concluded that white duct tape had a good resemblance of real road markings with a bright, matte color. However, moisture from rain prevented the tape from sticking to the ground on test day. This led to street crayons being used instead, since they were more resistant to wet surfaces. The drawback of using crayons was the fact that they did not resemble road markings as well as the tape. Figure 4.5 shows how lane markings are faded and do not have a solid surface as they would on a real road. This reduces their brightness and contrast against the asphalt, thus making them less visible for the driver. This could however emulate a road environment where the markings have been worn down during years of exposure to the elements. The maximum speed of the car was set relatively low, which was demonstrated by the fact that test subjects seldom had to regulate their speed in turns. The car could stay on track with full throttle input as long as the steering input was correct. A higher top speed, or tighter turns could have given greater variations in lap times and lane excursions between test cases. However, since the aim of the test was to investigate how steering and lane-keeping abilities were affected by degraded VA, less emphasis on speed regulation could mean isolation of the steering task to a greater extent. Such constant-speed steering tests have performed well in previous research [48]. Note that whether such isolation was present in the lane-keeping test is purely speculative and cannot be resolved from the results.

18 4.4. Analysis of the Test Results

4.4 Analysis of the Test Results

The purpose of the tests described in this chapter was to determine potential visual impair- ments, and their effect on a driver of an on-road teleoperated vehicle, due to degraded video quality. The findings would determine if there is a motivation for implementing vision-based driver assistance systems aiding under such conditions. In the first part of this chapter VA was determined for different combinations of com- pression and resolution. The results allowed VA to be gradually degraded in the subsequent lane-keeping driver test. In the lane-keeping test, we saw that both average lap times and lane excursions increased as VA was degraded, despite learning effects being observed in the results, and with reservations for the low number of test subjects. Section 3.1.3 described how previous research has failed to find a significant correlation between driver safety and VA [23]. There is, however, a significant association between VA and driver performance, where driver performance is defined in terms of speed regulation, object detection, lane positioning, etc [38][39]. In section 3.2 the notion of two separate visual subsystems, the focal system and the ambient system, was introduced. Focal vision was de- scribed as having the role of processing information concerning what is being seen by the eyes in the central part of the visual field where VA is the highest. An important part of driving is to detect objects in and around the path of the vehicle which might pose a hazardous situa- tion, which is visual processes highly dependent on the focal system. Therefore, the following parts of this thesis will be under the following premise: Degraded VAduring teleoperation of an on-road vehicle will impair the focal vision, and thus the driver’s ability to detect potential hazardous objects. We saw in section 3.2 a proposed two-level of driver steering [42], which in conjunction with the focal and ambient vision is referred to as the focal/far and ambient/near visual processes. A 2005 study by Brooks et al., investigating how lane keeping and steering were affected by reductions in VA, found that driver performance variables mean lateral speed (i.e. motion along the width of the road) and the number of lane excursions both significantly increased as VA was reduced [49], which was also found in a related study from 1999 [48]. These results could be explained by the focal/far and ambient/near processes, where the former gathers information about the road ahead and the latter adjusts the current path according to the road near the vehicle [41]. The elevated number of lane excursions suggests that reduced VA impairs the ability to plan steering, which is also reflected by the elevated lateral speeds, where heavy adjustments are needed to keep the vehicle within the lane. The results from the lane-keeping test presented in this chapter are similar to those pre- sented by Brooks [49] in terms of an increasing number of lane excursions. Considering these analogous results, an additional premise to the one presented earlier in this section is formed: Reduced VAduring teleoperation of an on-road vehicle implies impairment of the focal/far visual pro- cesses and, thus, the lane-keeping performance of the driver.

19 5 Vision-based Driver Assistance Systems

Up until this point in the thesis, we have seen evidence for degraded video quality dimin- ishing driver performance in a teleoperation setting. This chapter focuses on answering the second research question, can vision-based driver assistance systems make up for informa- tion lost? To help answer this question, this chapter will describe the implementation of vision-based driver assistance systems, and evaluate their effect through a user test.

5.1 System Overview

This section describes an overview of the system and different concepts vital for the design of the ADAS proposed in this thesis (the term ADAS here refers to the systems implemented). Further details of implementation will be explained in section 5.2. The ADAS proposed are implemented as plugins to the Voysys 3D graphics engine. The functionality is separated into two plugins. One is used on the teleoperated vehicle to extract information from the uncompressed video feed from the mounted cameras and to transmit it to the driver station - the inference plugin. Inference refers to the process of using a trained machine learning model to make a prediction. The visualization plugin visualizes the informa- tion it receives on the display of the driver station. The plugins can also be used in the same instance of Oden software, using prerecorded video instead of a live video stream, to allow for local testing of systems. On each end of the master-slave relationship between the teleoperated vehicle and the driver station, there is a version of the Oden software running - the one on the vehicle will henceforth be referred to as the streamer and the one on the driver station will be referred to as the player. Figure 5.1 shows how the streamer sends video to the player and, in turn, the player transmits inputs from the steering controllers to the streamer. The current hardware setup available for testing and demonstrations is a remote- controlled car with front- and rear-facing cameras and an on-board computer, see Table 5.1. Oden Streamer runs on this computer and sends video to the receiving end, which can be either a computer using regular screens or a wide-angle dome projection setup. The projector setup used in this work consists of three 1080p projectors, projecting a stitched image onto a curved wall (i.e. partly overlapping projections which together makes one larger projection), which allows for a viewing angle exceeding 180 degrees. The term stitched is also used to describe videos mapped to hemispherical surfaces within the Oden editors.

20 5.1. System Overview

Figure 5.1: An overview of the system design.

Camera FLIR BFS-U3-63S4C-C: 6.3 MP, 59.6 FPS, Sony IMX178, Color CPU Intel Core i5-7500T (quad-core, 2.7 GHz, up to 3.3 GHz) GPU GeForce® GTX 1070 8GB GDDR5 256-bit

Table 5.1: Remote controlled car hardware specifications.

5.1.1 Streamer The streamer has two different scenes; one for sending video captured from the cameras, and one used for the ADAS. The former includes a 2D video entity for the front- and back-facing cameras respectively, which are aligned such as they make up one frame (see Figure 5.2). Since the back-facing camera is displayed in a small part of the driver display, as a rear-view mirror, the resolution can be kept lower. The frame is encoded with the H.264 codec and sent over a 4G/LTE network to the player.

Figure 5.2: The front- and back camera feeds are aligned in the streamer to make up a frame to be sent to the player.

The scene used for the ADAS is made up of the following entities: • A stitched video of the front camera mapped to a hemispherical surface and placed in front of the first person camera. • A virtual camera located at the same position as the first person-camera, facing the front stitched video. • The inference plugin. The stitched video entity uses parameters extracted from calibrating the cameras to project the video onto a hemispherical surface. Once the video has been projected, objects and shapes

21 5.1. System Overview are no longer warped with a fisheye effect but displayed as they appeared when they were captured by the camera, giving the driver a natural view of the surroundings. The 3D engine then renders an image of the scene with the virtual camera, using a horizontal field of view of 90 degrees. The image is preprocessed and passed to two different image classification networks - one for road segmentation (classifying the image pixels), and the other one for detecting pedestrians and vehicles in the image. When the image has passed through the segmentation network, a set of operations are applied to identify lane markings and a curve is fitted to both left and right markings (de- scribed in detail in section 5.2.7). The pedestrian and vehicle detection network outputs rect- angular areas in the image containing either a vehicle or a pedestrian, and with how much confidence. Only areas with a confidence level above a certain threshold are considered as detections. Once an object has been detected, the position and size of the detection is derived from the area corner points. To create a graphical overlay to display at the driver station, a conversion from R2 image coordinates to R3 (XYZ-space) coordinates is done with predefined ray casting methods in- cluded in the plugin API; principally, the points are projected onto the hemisphere, see Figure 5.3. When detection coordinates have been converted and curves have been fitted, they are sent to the visualization plugin.

Figure 5.3

5.1.2 Player The player receives video frames, as illustrated in Figure 5.2, from the streamer. The 3D engine has an auto-crop function, which separates the front and back cameras in the frame. Using the same calibration parameters as the streamer, the front camera feed is placed as a stitched video entity in front of the first-person camera, and the back camera feed is placed behind. In parallel, the visualization plugin receives detections from the inference plugin. In the case of pedestrian and vehicle detections, rectangles are simply drawn in the scene with size and position as found by the inference network. However, lane detections are delivered in the form of curve coefficients defined in the R2 image plane. Points on a curve is therefore first sampled in the image plane before being converted to R3-space coordinates in the scene, with the same projection method as illustrated in Figure 5.3. Once the left- and right-lane curves are defined as sets of R3 point coordinates line segments are drawn between them.

22 5.2. Inference Plugin Implementation

The player is also responsible for transmitting steering control signals from the steering wheel and pedals connected to the driver station.

5.1.3 Plugin Structure A plugin needs to follow a certain structure to work with the 3D engine. The following functions and constructs must be defined:

•A State data structure - all system state variables must be stored here since a plugin may be instantiated more than once. Global variables will be shared among all instances of the plugin.

• The Init function, which is called once the plugin is instantiated. Here data is allocated for storing the state, and plugin settings can be made.

• The Update function, where computations and drawing in the scene is executed. The Update function is called once per program iteration.

• The Shutdown function which frees the data allocated during plugin initialization.

• The GUI function is used for generating graphical user interface (GUI) elements in the Oden sidebar, where variables can be inspected and changed. The function is called each program iteration, as long as a plugin GUI is visible in the sidebar.

• The Plugin registration function which informs the 3D engine about the new plugin and registers functions to be called by the engine. The plugins are registered to the Oden engine with a universally unique identifier (UUID) [50] and a string name.

Each function is called by the engine with a pointer to a function-specific object holding the State as well as API functions. These objects follow the naming convention OdenPlugi- nEntityXParams, where X is specific for each function.

5.2 Inference Plugin Implementation

This section describes the implementation of the Inference plugin described in section 5.1. Since the main objective was not to develop a novel object detection algorithm, models for detection and image segmentation from the Intel Distribution of OpenVINO Toolkit [51] were used. The OpenVINO Toolkit contains a collection of pre-trained inference models and API constructs for object detection and image segmentation. Since the OpenVINO toolkit had been used previously by developers at Voysys, compatibility with the Oden plugin API was already established. Additionally, object detection networks available presented good per- formance with an acceptable accuracy [52]. Two separate classification networks are run in parallel on the images generated by the virtual camera: one network for pedestrian and vehicle detection[52] (henceforth referred to as the PVD network), and another for road seg- mentation[53].

5.2.1 State State is defined as a C++ struct, with a constructor and destructor. The constructor takes as arguments the target device (e.g. CPU or GPU) and paths to each of the two classification networks. All variables used in the plugin is declared in the body and then initialized in the constructor. Accessing the State follows the conventions showed in Listing 5.1. If the variable p appears in code snippets in the following parts of this section, it can be assumed that it refers to the plugin State, unless else is specified. Likewise, if the api variable appears it refers to the object passed by the Oden engine for accessing API functions.

23 5.2. Inference Plugin Implementation

1 void inference_update(OdenPluginEntityUpdateParams * api) 2 { 3 State * p = reinterpret_cast(api->entityUserData); 4 if (p == nullptr) { 5 return; 6 } 7 // The function continues here... 8 } Listing 5.1: The code snippet illustrates how the plugin State is accessed, here in the update function.

5.2.2 Inference Engine Initialization The State struct contains an inference engine initialization function which takes as input a plugin/shared library object, network path, target device, and a network ID. The network ID is used for specifying the type of the classification network to be initialized, i.e. for pedestrian and vehicle detection or for road segmentation. This function is called in the body of the constructor, once for each network. The inference engine initialization function has the following structure:

1. The path to the shared library with implementation for inference on the specified device is configured.

2. The intermediate representation (IR) model of the network is read.

3. Inputs and outputs are configured - precision and memory layout. Input precision was set to 8-bit unsigned integer values and output precision was set to 32-bit floating-point values.

4. The model is compiled and loaded to the target device - creates an executable network object.

5. An InferRequest is created with the executable network object, which is used by input buffers for input and output.

Depending on the network type, output and input to a network are configured differently. For example, the PVD network outputs rectangular bounding boxes representing detections, the number of bounding boxes and a description of each detection. The description consists of an image ID, label (ID of predicted class) and image coordinates for the bounding box. The segmentation network outputs a four-channel image where each channel represents a class, and pixel values represent the level of confidence of which they belong to each class. Furthermore, the different networks have been trained with images of different resolu- tions and thus require inputs of different size. The PVD network requires input images of 384x672 (HxW) pixels while the segmentation network requires images of 512x896 pixels. In- stead of generating two differently sized images with the virtual camera, the segmentation network was reshaped with the inference engine API. The reshape method updates input shapes and propagates them down to the model outputs through all hidden layers. The risk of reshaping a network is that accuracy can be significantly reduced. However, after test- ing inference with the reshaped network, it was concluded that any potential reduction of accuracy was negligible. To save network-specific parameters and objects for the inference engine, a data struc- ture called InferenceObject is defined. This data structure contains network ob- jects, an InferRequest, as well as information about the input and output formats. Two InferenceObjects are created called pvd and seg, for the pedestrian-vehicle detection

24 5.2. Inference Plugin Implementation

network and segmentation network respectively. These are stored in the plugin State for executing inference and extracting classifications in a later stage of the process. Once the inference engine has been initialized with correct network paths, shared li- braries, and input/output configurations, a boolean in State is set to true indicating that inference can be run on the networks.

5.2.3 Init Here, paths to the classification networks were read from the sidebar GUI text input fields and passed as input arguments to the State constructor (see Listing 5.2). The memory location of the plugin State is held by the entityUserData variable in the OdenPluginEntityInitParams object which is passed as a pointer to the Init func- tion. The entityUserData can then be accessed by other plugin functions through the OdenPluginEntityXParams described in Section 5.1.3.

1 api->entityUserData = new State(hardware, networkPath, networkPath2); Listing 5.2: The plugin State is initialized with a target device and paths to the two classification networks.

5.2.4 GUI The GUI function is called with the OdenPluginEntityGuiParams object which makes API functions for drawing GUI elements such as check-boxes, drop-downs, buttons, etc. The Oden engine uses a paradigm called immediate mode for GUI and drawing functions. An immediate mode GUI is drawn in each frame and event processing is made in the same code that draws the GUI. In the GUI function, text input widgets for paths to classification network files were cre- ated. Listing 5.3 shows an example of code for creating a network path input field in the GUI sidebar. The variable api is a pointer to the OdenPluginEntityGuiParams object giving access to Oden API functions.

1 char networkPath[512] = {}; 2 { 3 // read saved network path 4 api->readString("network_path", networkPath, ARRAY_COUNT(networkPath)); 5 // if text has been written into the field 6 if (api->inputText("Network Path Car", networkPath, ARRAY_COUNT(networkPath))) { 7 // write input text to string 8 api->writeString("network_path", networkPath); 9 // inference engine should be re-initialized with new network path 10 reInitPed = true; 11 } 12 } Listing 5.3: Code snippet for creating text input widget in the sidebar GUI.

Furthermore, a drop-down multiple choice widget was created for setting the target de- vice of the inference engine; the options being CPU or GPU. However, it was discovered that using the GPU resulted in longer inference times since the GPU is already heavily occupied by the Oden engine. As a result, the CPU was exclusively used as target device for the infer- ence engine. Finally, toggle controls for activating/deactivating the ADAS along with some system calibration functionality were created. The latter will be further explained in Section 5.2.7.

25 5.2. Inference Plugin Implementation

5.2.5 Update The Oden engine runs a loop in which it updates all scene entities each iteration. Here is where computations, drawing of AR elements, as well as sending of COM-channel messages for the ADAS were made, and can be considered the heart of the plugin. The following segments are not necessarily executed in the order they are mentioned during the update cycle but are explained in this order for the sake of clarity.

Download Virtual Camera Image At the beginning of each update, the function downloadVirtualCameraImage is called. As the name indicates, the function downloads an image from the virtual camera and returns true when it is available. The download is asynchronous, meaning the virtual camera frame is downloaded on a different thread and available four frames after it is shown on screen. The function must be called each iteration until the download is completed. Listing 5.4 shows how the function is called. First a unsigned char pointer is initialized for holding the mem- ory location of the data, since the image returned by the downloadVirtualCameraImage function is in 8-bit packed RGB format. This means that pixel values for each channel are aligned in memory, see Figure 5.4. For an RGB image with a width of e.g. 672 pixels, every 3*672 set of values represents one row in the image.

Figure 5.4: Illustration of the packed RGB format. Pixel values are stored next to each other in memory.

The first argument passed to the downloadVirtualCameraImage function is an iden- tifying name for the virtual camera entity in the scene graph, i.e. it specifies which virtual camera to download an image from. The second argument is the pointer to the data and the last two stores the virtual camera resolution which has been set in the GUI of the editor.

1 const uint8_t * downloadedCameraData = nullptr; 2 int width = 0; 3 int height = 0; 4 bool hasDownloadedData; 5 6 hasDownloadedData = api->downloadVirtualCameraImage("vci", &downloadedCameraData, 7 &width, &height); Listing 5.4: downloadVirtualCameraImage

Get Messages From COM-channels After the virtual camera image has been downloaded, the COM-channels are checked for new messages containing information about if the ADAS has been activated or deactivated in the visualization plugin. A COM-channel message is a simple data structure consisting of different types of variables, defined in a separate header file. The messages are sent as UDP packets between plugins. A COM-channel message type is identified by a char ID, preferably describing what kind of information it holds. Listing 5.5 shows how a new data message with the type ID ”Toggle Message” is initialized before being assigned the incoming message. The toggle data message consists of two booleans, one for each ADAS. If true the ADAS has been activated, if false the ADAS has been deactivated. These values are copied and saved in the plugin State.

26 5.2. Inference Plugin Implementation

1 const char * toggleMessageId = "Toggle Message"; 2 DataMessageToggle rxMessage = {}; 3 4 if(api->comChannelGetLastMessage( 5 toggleMessageId, 6 reinterpret_cast(&rxMessage), 7 sizeof(rxMessage))) 8 { 9 // If there is a new ’’Toggle Message’’ do something with the message data. 10 } Listing 5.5: Code snippet for checking for COM-channel messages with ID ”Toggle Message”.

Prepare Input for Inference The classification networks require the channels of the input image to be aligned in memory, i.e. all pixel values for red followed by green and blue. Listing 5.6 displays the code used for aligning the channels in memory. The virtual camera image is upside down in memory, which means that the last 3*width pixel values represent the first row in the image. Starting at this position, the data is traversed three steps at a time. Figure 5.5 illustrates the order of which the red pixel values are traversed in the virtual camera data. The same traversal is then applied to the green and blue pixel values, starting at Row 1, and so forth.

Figure 5.5: Illustration of how virtual camera data is traversed.

1 std::shared_ptr input = p->pvd.inferRequest.GetBlob("data"); 2 3 u8 * inputData = input->buffer().as(); 4 5 for (int c = 0; c < p->pvd.channels; ++c) {//expected number of image channels 6 for (int h = 0; h < height; ++h) { 7 for (int w = 0; w < width; ++w) { 8 inputData[static_cast(c * width * height + h * width + w)] = 9 downloadedCameraData[static_cast(3 * width * (height - h - 1) + 10 3 * w + c)]; 11 } 12 } 13 } Listing 5.6: Input image for the PVD network is prepared for inference.

Running Inference When the input to the classification networks has been prepared, the next step is to exe- cute the inference. Since running inference is a relatively expensive task, a worker thread is launched for each of the networks. This enables inference to be run in parallel on both net- works while allowing execution of the update function to progress. A function called run_inference takes as argument the InferRequest created in the ini- tialization of the inference engine along with an atomic boolean. The std::atomic template provides well-defined behavior if one thread writes to a variable while another thread reads

27 5.2. Inference Plugin Implementation

it [54]. Each thread has an atomic boolean stored in the State object, for indicating whether inference is being executed on the thread in question. Listing 5.7 shows how the run_inference function is called with the thread object pedInferenceThread designated for executing inference on the PVD network. The thread object has been initialized in the State constructor along with another thread object for in- ference on the segmentation network. At the start of the function, the atomic boolean is set to true, i.e inference is now running. Inference is then executed on the input data, which was prepared earlier in the update function, with the InferRequest object passed to the func- tion (see line eight in listing). When inference is completed, the atomic boolean is set to true before reaching the end of the function, i.e. marking the inference is no longer running. During the execution of inference on the two classification networks, the update function has carried on executing code. In the following update iteration, the two atomic booleans are checked; if false, the process up until this point is carried out in the same way, otherwise, it skips to the next part described below.

1 static void run_inference( 2 std::atomic_bool * running, 3 InferenceEngine::InferRequest * inferRequest) 4 { 5 *running = true; 6 7 try { 8 inferRequest->Infer(); // Inference started 9 } catch (const std::exception & e) { 10 oden->logError("%s", e.what()); 11 } 12 13 *running = false; // end of function -> inference completed 14 } 15 // Code for calling the inference function ---> 16 // p is a pointer to the plugin State 17 p->pedInferenceRunning = true; // atomic bool 18 p->pedInferenceThread = std::thread( 19 run_inference, 20 &p->pedInferenceRunning, 21 &p->pvd.inferRequest); //InferRequest object 22 // <--- Listing 5.7: Function for running inference on the PVD network, executed on a worker thread.

Get Output From Inference For each update iteration, and for each of the ADAS, the following conditions need to be fulfilled in order to execute inference and get output from the classification network:

• A virtual camera image has been downloaded.

• The virtual camera resolution (width and height) matches the network input dimen- sions.

• Initialization of the classification network was successful.

• Inference is not currently being executed on the network.

• The ADAS has been activated in the visualization plugin.

Given the above conditions are fulfilled, the first step is to get output from the inference. If inference has not yet been executed, which is the case for the first frame, the output will

28 5.2. Inference Plugin Implementation

simply contain only zero-confidence classifications. As described earlier in Section 5.2.2, out- put from the two different classification networks differs. Thus, two different functions for getting the output and applying some post-processing were implemented. These functions are described in detail in Sections 5.2.6 and 5.2.7. Getting access to the output data is done the same way, regardless of network. In list- ing 5.8 a generalized syntax for accessing the data is shown. The inputOrOutputBlob vari- able gives access to the inference engine Blob class object which is intended for reading and writing memory. Using the inputOrOutputBlob variable, the input/output data is retrieved through the buffer() method, which gives access to the memory allocated for input/out- put. The memory location is then stored in the inputOrOutputData variable. The results from the output/processing functions are saved as individual data structures for the PVD network and segmentation network respectively. More specifically, for the PVD network, each detection is saved as a data structure containing an ID, a position (in R3 coor- dinates), size, confidence, and a label. The results from the road segmentation are saved as two sets of coefficients for the curves which model road lanes detected in the virtual camera image.

1 std::shared_ptr inputOrOutputBlob = 2 p->inferenceObject.inferRequest.GetBlob("data"); 3 const float * inputOrOutputData = inputOrOutputBlob->buffer().as(); Listing 5.8: Generalized commands for getting output data from

Send Detection Data with COM-channels The last part of the update function is executed every iteration. Here detections extracted from inference on the two classification networks are sent to the visualization plugin, via COM-channels. As described earlier in this section, detections are stored as simple data struc- tures in the State object. This means that in iterations were inference is running, or the virtual camera image has not finished downloading, detections from the previous frame are sent. Listing 5.9 shows an example of pedestrian and vehicle detections being sent with the same principle as getting messages from the COM-channels (see Listing 5.5). A new message is initialized along with a message ID. The number of detections saved in the State is then stored in the message. Finally, the detections in State are copied to the message. Note that a limit has been set to 14 detections, which is due to the COM-channels having a maximum packet size. The maximum packet size can be increased in the Oden editor. However, after testing the vehicle and pedestrian detection in different environments, it was observed that the number of detections extracted from one frame was rarely above five or six, peaking at around ten. Therefore a limit of 14 detections was deemed high enough.

1 const char * messageId = "PVD Message"; 2 DataMessage message = {}; 3 message.nrDetections = static_cast(p->pvDetections.size()); 4 5 for(size_t i = 0; i < p->pvDetections.size() && i < 14; ++i) { 6 message.d[i] = p->pvDetections[i]; 7 } 8 api->comChannelSendMessage( 9 messageId, 10 reinterpret_cast(&message), 11 sizeof(message)); Listing 5.9: Sending of pedestrian and vehicle detections to the visualization plugin.

29 5.2. Inference Plugin Implementation

5.2.6 Vehicle and Pedestrian Detections Section 5.2.5 describes how output from the two classification networks is handled by two different functions. One of the functions is implemented for obtaining and processing the output from the PVD network and is called PedestrianAndVehicleDetector. Each detection contains an image id, label (vehicle or pedestrian), confidence level and image coordinates for detection bounding boxes. For every inference on a frame, the inference engine will give hundreds of possible detections, referred to as proposals. Every proposal has a confidence level and detections with low confidence are filtered out early. Tracking of detected objects between frames is not handled by the object detection net- work. Algorithm 1 describes the process of handling detections made by the network. In short, each new detection is assigned a unique ID, then compared to detections made in pre- vious frames by calculating intersection-over-union (IOU), the area of overlap divided by the area of union. If the IOU score is larger than a set threshold, it is considered a match, and the old detection gets updated with a new position and size.

Algorithm 1 Tracking of detections if no new detections then increment counter dissappeared, the number of frames an old detection has not been matched to a new one else if no old detections then save all new detections else for all new detections do for all old detections do compare new and old detections (intersection-over-union) end for end for for all matching pairs do if old and new detection in pair not already assigned a match this round then update old detection with new detection data, mark both as assigned end if if old detection in pair already assigned this round and new detection not assigned then mark new detection to be skipped end if end for for all old detections do if not assigned this round then increment counter dissappeared end if if dissappeared for too long then remove detection else reset assignment in preparation for next frame end if end for for all new detections do if not already assigned or marked to be skipped then save detection end if end for end if end if

30 5.2. Inference Plugin Implementation

5.2.7 Road Segmentation & Lane Detection As described in Section 5.2.2, the inference model for road segmentation returns, for each pixel in the input image, the probability of being one of four classes: background, road, curb, and mark. From this data, a binary image is created, with pixels having "mark" probability over a certain threshold set to one, the rest set to zero. The only level of interest, in this case, was the pixels classified as "mark".

Region of Interest & Masking On a road with several lanes or with markings on both sides of the mid-line, the camera might pick up marks which are not part of the vehicle’s active lane. To avoid false markings, a region of interest (ROI) is defined, which is made up by four points in the image. In Figure 5.6a, the ROI is illustrated by the red lines superimposed on the image. The purpose of the ROI is to exclude all information which lies outside its borders, thus reducing "information pollution". The challenging part of this approach is to actually decide which region of the image that is "interesting". There are some simple assumptions that could give acceptable results; e.g. everything above the horizon will not contain any markings, the horizon lies approximately in the middle of the image, and the fact that perspective makes the lane markings converge towards its mid-point. Thus, ROI can simply be defined as the triangle with points at the lower left- and right corners of the image and the image center. However, depending on the placement of the camera, its rotation, field of view, as well as road width, the above assump- tions might not be valid. In the implementation presented here, the problem of defining the ROI was solved by making it manually adjustable through slider widgets in the visualization plugin GUI sidebar. For handling image data, the OpenCV basic image container Mat was used, which repre- sents images as matrices instead of just a series of data.

(a) Road with ROI. (b) Binary road image.

Figure 5.6: a) Area under red line makes up the ROI. b) Masked binary image created from inference output.

Lane Separation Separating the pixels of a binary image into two separate sets of markings was first done by simply separate them depending on which side of the lane middle point they were, with the lane middle point defined as the middle point of the region of interest. This works well under ideal circumstances when the road markings are highly visible and the road does not contain sharp turns, which is the case on most bigger roads. Another method was then implemented, using a histogram and sliding window search to find points belonging to the left- and right lane marking respectively. To determine the

31 5.3. Visualization Plugin Implementation starting point of the search, a histogram was applied to the binary image to find the two horizontal regions with the highest value, i.e. the largest sum of pixels. Two sliding windows are centered at the points found by the histogram. The windows are then moved upwards in the image, each new window centered on the mean position of the lane markings found in the previous window. The position of all pixels found within these windows was added to the respective left- and right lane. Since lane markings closer to the camera appear larger than markings further away, they make up a large portion of pixels in the image. This way the two starting points for the sliding window search (should) align with the markings closest to the car. This assumption works well in most cases where there are normal lane markings, and no other markings in the lane, e.g. arrows indicating lane merges.

Curve Fitting Using Gauss-Newton Algorithm To be able to draw a curve along the lane markings, the curve must first be approximated. The method used for this purpose is the Gauss-Newton algorithm, which is a method for solving non-linear least squares problems. Starting with a model function f (x), in this case a second- order polynomial y = a + bx + cx2, and an initial estimate of coefficients β0, the algorithm iteratively finds the coefficients β which minimizes the sum of squares of residuals between m 2 real values and the approximated curve: S(β)= ři=1 ri (β) [55]. The coefficients are updated (s+1) (s) T ´1 T (s) each iteration by β = β +(Jf Jf ) Jf r(β ), Jf is the Jacobian matrix of the function T ´1 T f (x), and (Jf Jf ) Jf is the left pseudoinverse of Jf . This is repeated until convergence or a set number of iterations. The coefficients (a,b and c) are then sent to the visualization plugin.

5.3 Visualization Plugin Implementation

This section describes the implementation of the Visualization plugin described in section 5.1. This plugin is used at the driver station to visualize the data computed by the inference plugin.

5.3.1 Drawing Detection Markers As part of the Oden plugin API, there are a couple of ready-made methods for drawing simple shapes in the scene, such as lines, rectangles, and triangles. The detections of vehicles and pedestrians where simply marked by a rectangle, with different colors depending on the detection label. Because the video is projected on a spherical surface, the shapes drawn were rotated to align with the tangent of the sphere, to have the correct size and to avoid clipping with the sphere. Two examples can be seen in Figure 5.8.

5.3.2 Sampling a Curve The coefficients sent from the inference plugin are used to sample values of a polynomial function y = a + bx + cx2 for each x-value (0-671) in the image, creating a curve which follows the lane markings from the virtual camera image. To smooth the movements of the curve between frames, all points sampled from the curve was added to a moving average before being drawn. In the visualization GUI, the number of update cycles to average over can be adjusted using a slider.

5.3.3 Drawing Lanes The overlay for road lanes were drawn using triangles, drawn along the points sampled from the curve found by the Gauss-Newton algorithm. The curve defined in two dimensions R² is converted to three dimensions R³ by the same method used for detection positions. To create a filled line between these points with triangles, additional points are created to serve

32 5.4. Systems Results and Performance as triangle vertices. Figure 5.7 illustrates how triangles are drawn along sampled points (red crosses).

Figure 5.7: Illustration of how a line segment is drawn using triangles. The red crosses de- noted 1-4 are the sampled curve points. In this example, the blue vectors are used to calculate the direction of vectors extending from point 2. The right part of the figure represents the triangles calculated from these four points.

First, a vector orthogonal to the sphere’s normal and the z-axis originating in the first point is created (aligned with the sphere’s tangent and the XZ-plane). Along the positive and negative direction of this vector, shown as the arrows extending from each point in Figure 5.7, two new points are created to serve as triangle vertices. The same calculations are done to the last point as well, to both start and end the line with edges aligned horizontally. The length of the vectors extending from each point is normalized and scaled to produce a line of preferred width. The direction of the orthogonal vectors, excluding first and last point, is a cross product between two vectors: one vector from the camera position to the point in question (vi(x, y, z) = pointi(x, y, z) ´ (0, 0, 0)), the other vector going from the previous point to the next point (vprevToNext(x, y, z) = pointi+1(x, y, z) ´ pointi-1(x, y, z)). In the case of the example in Figure 5.7, the direction of the vector extending from the second point (arrows in figure), is the cross product of the vector between the camera and point 2 and the vector from point 1 to 3 (v2orthogonal = v2 ˆ v1to3). This vector is orthogonal to both the hemisphere and the vector from point 1 to point 3. This ensures that the width of the curve made by the triangles is kept roughly the same along the whole length, and works well as long as the inner angle between two points (e.g. point 1 and 3 in 5.7) does not get too small. The smaller the angle, the thinner the line becomes, since the scaling factor of the vectors is constant. To create a smooth line of approximately equal width, all points are sampled at short intervals, thus avoiding this problem.

5.4 Systems Results and Performance

The final product of this work is two vision-based driver assistance systems applied in a teleoperation environment. Figure 5.8 depicts two scenes in traffic with active pedestrian detection, with green detection rectangles drawn to highlight a pedestrian. Figure 5.9 shows the lane detection in action, with red lines drawn on top of road markings. These images shown are captured with a restricted field of view, while the full images shown to a driver would have at least 180 degrees horizontal field of view.

33 5.5. User test

(a) City (b) Empty road

Figure 5.8: Two examples of detection boxes drawn in front of video stream.

(a) Solid lines (b) Dashed lines

Figure 5.9: Two examples lane overlay drawn in front of video stream.

When running only the vehicle and pedestrian detection system, performance was over 50 fps, which is only a small decline from the inference engines baseline performance [52]. The lane detection system was running above 40 fps. The road segmentation inference did require more computations per update cycle than the other system, and more additional work was done with the data before visualizing the results. Running both ADAS simultaneously, the update frequency was above 30 fps. The performance of the systems did not pose any limitations in the user tests, as the videos used only ran at 30 frames per second.

5.5 User test

This section will describe the experiment design, the user test procedure, and analysis of the gathered data.

5.5.1 Experimental Design Building upon results from the visual acuity tests made in chapter 4, an experimental user study was designed and performed to investigate two research questions [56]:

1. Whether visual acuity (video quality setting) affects reaction time performance.

2. Whether visual driver aids affect reaction time performance.

To investigate these questions, the following hypotheses were tested (H0 denotes null hy- pothesis, H1 alternate hypothesis):

34 5.5. User test

1. Main effect of video quality setting H0 : µhigh = µmedium = µlow against H1: at least one group mean differs.

2. Main effect of visual driver aids H0 : µOFF = µON against H1: at least one group mean differs. 3. Interaction effect (Setting X Driver Aid) H0: There is no interaction effect between Setting and Driver Aid against H1: The interaction between Setting and Driver Aid is significant.

The interpretation of the null hypotheses is that reaction time is not changed. To test the hypotheses, five scenarios were constructed. Three different settings for video quality, with and without visual driving aids activated, the combinations used for scenario A-E can be seen in Table 5.2.

Visual Acuity (LogMAR) ADAS 0.0 0.3 0.6 OFF ABC ON ´ DE

Table 5.2: Scenario settings.

Each scenario consists of three 10-20 second video sequences recorded from a car driving on roads in and outside of Norrköping, 15 videos in total. The videos were captured by a Panasonic GH4 with an iZugar MKX22 Fisheye Lens. The camera was mounted at the front of the car, on the hood. In each video, pedestrians are crossing the car’s path, or are about to cross the road. A description of each video scenario can be found in appendix A. To negate learning effects that could be present from watching the videos in one particular order, the order of the scenarios were scrambled according to the Latin square[57] in Table 5.3. This design also limits the number of subjects required to complete the study, five people in this case.

Subject Scenario Order 1 ABCDE 2 BAECD 3 CDBEA 4 DEABC 5 ECDAB

Table 5.3: Scenario order for test subject 1-5.

5.5.2 Test Procedure Each test subject was seated in front of a curved wall, which the videos were projected upon (see Figure 5.10). Their given objective was to react, by pressing a button on a steering wheel, when they could see a pedestrian who was about to cross the road. Button presses were recorded by coupling the button to a symbol which was toggled on the screen. They were shown video sequences in the order shown in Table 5.3, three videos per scenario. Between scenarios the subjects were given a chance to briefly describe what they reacted to in the videos, to sort out possible faulty button presses or other errors.

35 5.5. User test

Figure 5.10: The test subject presses a button on the steering wheel, which activates the sym- bol in the bottom left corner of the screen.

5.5.3 Analysis of Variance The hypotheses presented in 5.5.1 were tested both separately and in conjunction. The first hypothesis, the main effect of video quality setting, was tested by a one-way ANOVA, with significance level α = 0.05, having one dependent variable; reaction time, and one indepen- dent variable, video quality setting (high/medium/low). This test used data from scenario A, B, and C, to limit to one independent variable.

Sum Sq num Df Error SS den Df F value Pr(>F) (Intercept) 3.7146 1 1.32810 4 11.1877 0.028707 Setting 1.9945 2 0.91961 8 8.6752 0.009918

Table 5.4: Univariate Type III Repeated-Measures ANOVA Assuming Sphericity. Data from scenario A, B and C.

Test statistic p-value corrected p-value Setting 0.61934 0.48741 0.02181

Table 5.5: Mauchly Tests for Sphericity and Greenhouse-Geisser correction for departure from sphericity.

The first test shows a statistically significant difference between groups as determined by a one-way ANOVA(F(2,8) = 8.6752, p = 0.0218).

The test for the second hypothesis was a factorial design (meaning several independent variables) with one dependent variable; reaction time, and two independent variables; video quality setting (high/medium/low) and ADAS (on/off). A two-way repeated-measures ANOVA was calculated as a test of statistical significance, with significance level α = 0.05. This test uses data from scenario B, C, D and E. Data from scenario A is removed because ADAS was not tested in that scenario.

36 5.6. Discussion

Sum Sq num Df Error SS den Df F value Pr(>F) (Intercept) 2.22656 1 0.61028 4 14.5937 0.01878 Setting 0.59656 1 0.59714 4 3.9962 0.11624 ADAS 0.05438 1 0.81526 4 0.2668 0.63274 Setting:ADAS 1.05517 1 0.68110 4 6.1969 0.06753

Table 5.6: Univariate Type III Repeated-Measures ANOVA Assuming Sphericity. Data from scenario B, C, D and E.

The second test does not show a statistically significant difference between groups at sig- nificance level α = 0.05.

5.6 Discussion

Here follows a discussion of the results and how delimitations, choices made-, and limitations discovered in the development process affected the result.

5.6.1 Analysis of Test Data The boxplot in Figures 5.11 and 5.12 shows data distribution of experiment results. The out- lined boxes represent the interval between the first and third quartile, the interquartile range or IQR. The line inside boxes is the mean value. Points below or above the vertical lines are outliers, at a distance of at least 1.5IQR from the first or third quartile respectively, according to Tukey’s rule [58]. Four such outliers were removed from the data before further analysis. Results from one video in scenario B was removed post-test because the situation presented in the video differed much from the rest of the material. One concern with using different videos for each scenario is that the results may not be directly comparable between scenarios. The distribution shows a much greater spread of re- action times in scenarios B and C compared to scenario A. Looking at Figure 5.12, the videos in scenarios B and C have distributions which are very different from each other within the scenarios. This is likely caused by difference in situations presented in the videos, and not so much the actual settings used. More videos per setting could perhaps have balanced out the results. Those scenarios are also the only ones having several outliers, this could be be- cause of individual differences in difficulty perceiving moving objects in low-quality video, but with only five observations per video, there is not enough data to confidently draw any conclusions by visual examination. Examining the results from videos used in scenario D and E, where ADAS was active, there is less spread of reaction times compared to scenario B and C, where there are several outliers. This could point to that this method of marking pedestrians is not necessarily moving the threshold for detection earlier, but keeping reaction times more consistent because of the attention capture it creates. This consistency in reaction times is also seen among videos in scenario A, where video quality was at the highest setting. One variable which was measured in the videos is how long time the pedestrian was visible before stepping into the road. Looking at Figure 5.13, there is not much difference between the time of reaction and the time of pedestrian stepping into the road for any of the videos, except for when the pedestrian is not visible at all before stepping into the road. The interpretation made from this is that the time a pedestrian being visible is not a variable that must be accounted for, as there is no clear relationship between it and the time of reaction.

5.6.2 ANOVA Result Validity Several assumption about the data must be satisfied in order for the ANOVA results to be reliable [56]:

37 5.6. Discussion

Figure 5.11: Data distribution of reaction times for each scenario, sorted by scenario settings. Different scenarios denoted by color. Reaction time is relative to the moment a pedestrian stepping out in the street (dashed line in figure).

Figure 5.12: Data distribution of reaction times for each video, sorted by scenario settings. Reaction time is relative to the moment a pedestrian stepping out in the street (dashed line in figure).

1. Subjects should be randomly selected from the population of interest.

2. The observations made must be normally distributed.

3. Sphericity assumption, the variance of differences between all possible pairs of groups must be equal.

All subjects in the test are drivers with at least 3 years of driving experience. Normality of the data can be assumed, as the residuals of observations are approximates of the diagonal line (expected normal distribution) in the Q-Q plot shown in Figure 5.14. The result of the first ANOVA is corrected for any variance deviations, so all assumptions are deemed satisfied.

38 5.6. Discussion

Figure 5.13: Sampled reaction timestamps, the timestamp when a pedestrian steps into the road and timestamp when the pedestrian was marked by the ADAS. All times are relative to the timestamp of a pedestrian coming into view.

But interpreting what this result implies is difficult. ANOVA only measures if there is a significant difference between group means, but examining the data distributions, the differ- ences in mean value is not large between groups. The large spread of reaction times among videos in scenarios B and C makes it difficult to draw any conclusions from this ANOVA alone.

Figure 5.14: Q-Q plot of residuals of all observations, blue line is expected normal distribu- tion.

The results from the two-way ANOVA (testing all hypotheses simultaneously) were not statistically significant, the null hypothesis could not be rejected. Therefore testing assump- tions is not necessary.

39 5.6. Discussion

5.6.3 Test subject comments In between test scenarios, the subjects were free to comment on their experience. In scenarios where video quality was set to high, they experienced little to no trouble recognizing pedes- trians. In medium settings, there were a few videos where the detection helped the subjects react faster by capturing their attention, but most times they expressed seeing the pedestrian before the detection marker appeared. The most difficult part according to some subjects was interpreting pedestrian behavior. They could see movements, but when driving in traffic you also have to be able to predict pedestrian movements based on their behavior, where they are looking, etc. In scenarios C and E with low video quality setting, several subjects commented on the quality of the video, and how they would not feel comfortable driving a vehicle dur- ing such conditions. Some also said the detection markers made them feel somewhat more secure in being able to detect pedestrians in time. But the markers were also a bit too distract- ing since they outline pedestrians with a bright green rectangle, making it hard to see that it was a person inside of the box if it was not aligned properly with the pedestrian’s position.

5.6.4 Experiment Design Some learning effects were observed, where the subject learned to look for a specific pedes- trian crossing the road, as these videos all contained the same actor. Using different videos for each scenario was not optimal considering the small sample size. An alternative design could be used, where each subject only is shown video in one setting, and the results of differ- ent groups are compared. Adjusted for individual error, the results would have been directly comparable. But this would require five times as many subjects to amount to the same num- ber of samples per video as is now. That alternative was deemed not feasible as the logistics behind such extensive user study would be immense. Conducting a thorough qualitative study would perhaps have given a more solid ground for making conclusions. Removing the driving task from this test did make it simpler compared to real driving, because the subjects could focus solely on trying to find pedestrians, without anything else diverting their attention. Oftentimes the subjects would pan the display in front of them in a manner that would not seem natural in a driving scenario. One could argue that the removal of the driving task was the most detrimental flaw of the experiment, but was in this case a necessity. As mentioned in section 1.4, no testing ground with appropriate facilities was available during this work. As a result, testing pedestrian detection performance during the driving task, and keeping test scenarios consistent between subjects, was not feasible.

5.6.5 Vehicle & Pedestrian Detection The biggest issue with this vehicle and pedestrian detection method is that the inference network output is not all that consistent, and there can at times be many frames in sequence where it cannot find an object that is clearly still in view. We did not want detection markings flickering on and off. To try and "fill the gap" between detections of an object gone missing for a few frames, the old detections were drawn on screen at the place they were last seen. Often times, this could result in multiple detection boxes being drawn on top of each other and detections moving out of view still being drawn at the spot they were last seen. To solve this, the algorithm for tracking the object across frames was implemented. The general logic of the algorithm is solid, but it has one weak spot: the step of determining if a new detection can be matched to an earlier detection. For this step, several methods where tried, but none really solved the problem in full. Attempts were made using more advanced and proven methods (Simple Online Realtime Tracking [59]), but the time required to understand the algorithm and to make it work within this context proved to be a very large undertaking and was abandoned. Attempts were also made to include a feature description for each detection, by con- structing a feature description from part of the image restricted by the detection box (us-

40 5.6. Discussion ing OpenCV ORB keypoint detector). Due to the sometimes very small size of these images (minimum size 60x120 and 40x30 pixels for pedestrians and vehicles respectively), the feature detection was not very effective, causing feature descriptors captured at different frames to not share any properties. OpenVINO does have a network model made for re-identification of pedestrians. But using this would require another network model running in parallel with the existing ones, reducing frame rate further. We set an aim to stay above 30 frames per second with the plugin running on the vehicle on-board computer. This threshold was based on previous research where it has been proven that refresh rates below 25 frames per second affects motion per- ception [60]. Furthermore, variations in refresh rate are less inclined to being detected above 30 frames per second since motion above this threshold is perceived as smooth motion. To account for the lower single-core performance of the vehicle on-board computer, we set an aim of maintaining at least 40 frames per second on our work station computers. Two networks running simultaneously was already making the application dropping towards this number, so we opted to try other solutions. Additionally, this solution would only provide re-identification of pedestrians, there was no ready-made model for vehicles available. In the end, we resorted to using a simple intersection-over-union threshold for detection matching. This method works when observing objects moving slowly relative to the camera, but fails more frequently when fast movements occur and is not really usable in a real traffic scenario. Perhaps this solution could work sufficiently if the video is recorded with a higher frame rate so that objects do not move as much from frame to frame.

5.6.6 Lane Detection The separation of lanes only works well in ideal conditions. The system cannot handle tight turns, where left and right lane markings are both present at the same horizontal position. This was tested by creating a track marked with white gaffer tape to simulate road markings, and driving the track with the RC-car. By fine-tuning the parameters of the sliding window technique, the performance could possibly be improved to be able to handle more adverse road conditions. An issue regarding the estimation of a curve from lane markings is that beyond the points found there are no limits to how the curve is to be fitted. This can result in the curve quickly bending away in unexpected directions if there are not enough markings visible at any one time, e.g. if only one mark is visible, a maxima or minima of the polynomial function can potentially align with the mark, with the curve bending off in other directions just after the marking ends. A solution could be extrapolating additional points to the image, to artificially extend the lane markings beyond the limits of what the road segmentation could pick up. That solution was not further investigated. This issue was however partly limited by adding markers from several consecutive frames together, so that a road with dashed lines would behave more like solid lines. Another issue with curve estimation is handling outliers. This is partly done by defining a region of intrest (ROI). However, this does not address the potential outliers inside the ROI. Such outliers arise from markings inside of lanes, e.g. indicating upcoming turns or in some cases speed limits. By excluding points in the image classified as mark which lie "too far away" from the curve estimated for the previous frame, false lane detection could be minimized.

5.6.7 Virtual Camera Issues There are several issues arising from using the virtual camera image as input to the inference networks:

41 5.6. Discussion

Delay With the main objective of Oden video streaming being low end-to-end latency, extracting video data to perform additional computation on from this workflow is not an easy task. The workaround is to project the video feed onto a surface in the Oden editor, such as a 2D- plane for video recorded using a standard field of view, or a hemispherical surface for video captured using wide view angle lenses (fisheye lenses). A virtual camera is added to the scene, pointed at the center of the projected video, or at any object you wish to capture. While being an easy solution, the virtual camera does have one obvious flaw, the image returned from virtual camera is at least four update cycles behind the video stream. In a case where the update frequency is 60 frames per second, that amounts to a delay of 66.67ms. The delay in retrieving the camera image does result in detections lagging behind. The faster an object moves in relation to the camera, the more noticeable it is. When driving slowly, e.g. when driving the RC car on the side-walk and observing oncoming traffic and pedestrians, the lag is not so noticeable. But, when applying the object detection to video of high speed traffic, detection markings lag behind. When the boxes are drawn in the wrong place, they are not of much use. Therefore we decided to not perform any user tests including detection of vehicles, as the delay of the system could not be compensated for. The delay was not as detrimental to the performance of the lane detection, as the the direction of the road usually does not have drastic changes within a few successive frames.

Level of Detail The virtual camera captures a part of the video projected upon the hemispheral surface at a resolution of 672x384 pixels. The issue here is that the original video quality; as a combi- nation of sensor resolution and camera lens properties, results in the image captured by the virtual camera is not as detailed as it should be for optimal inference network performance, especially at larger distances. A better performing solution would be to dedicate a separate front-facing camera, having a lens with a smaller field-of-view for increased detail, to use as input to the inference systems.

42 6 Conclusions

In this thesis, we investigated how video quality in a teleoperation setting relates to visual acuity in normal driving, whether a reduction in video quality had any negative effects on driver performance during teleoperation, and if vision-based driver assistance systems could compensate for a diminished level of visual acuity. The first part of the thesis was an investigation in multiple steps: a review of previous studies made regarding visual acuity and driving, coupling visual acuity measures to video quality, and tests to find out whether there is any correlation between the two. It can be concluded that visual acuity does degrade as video quality degrades, and that is has a detri- mental effect on driver performance. The second part was the implementation of two vision-based driver assistance systems, to assist a driver during low video quality conditions. The implementation of these systems proved to be a significant challenge, and the results were therefore not up to the standard we initially aimed for. The performance of the lane detection system was not good enough for the dynamic environment of a driving test. Nevertheless, the tests of the pedestrian detection system point to the marking of pedestrians in the driver’s view does provide a more con- sistent reaction pattern among tested subjects, possibly because of the additional attention capture it creates. Constructing driving studies that are closely related to real-life situations, that simultane- ously does not put any people in harm’s way, is often very difficult and costly. This is why most research in driving is either simulator-based or based on statistics from road accidents. To make our tests as close to reality as possible, with the delimitations made, we opted to include both a driving test and a test with video recorded in traffic. Both tests were able to separate key variables in driving and vision, despite neither including both driving task and traffic. There is not one vision-based driver assistance system that fits all possible use cases. These kind of driver aids must be customized for every unique platform to assure best per- formance, otherwise these kind of driver aids could prove to be more distracting than helpful.

In a teleoperation setting, how does the degradation of video quality (due to reduced resolution and increased compression) affect the level of perceivable detail, and what effect does it have on driver performance?

43 6.1. Could It Have Been Made Simpler?

Tests performed in this thesis showed that visual acuity, as a measure of perceivable de- tail, is degraded as the resolution is lowered and compression is increased. The detrimental effects of increased compression is potentially greater than the effect of lowered resolution, especially as compression reaches high levels. Reduced visual acuity during teleoperation of an on-road vehicle will impair the focal vision, and thus the driver’s ability to detect potentially hazardous objects. Furthermore, reduced VA during teleoperation of an on-road vehicle implies impairment of the focal/far visual processes and, thus, lane-keeping performance of the driver.

In a teleoperation setting, how well can impairments to a driver’s visual percep- tion ability, caused by degraded video quality, be compensated for by vision-based driver assistance systems in the form of graphical elements superimposed onto the driver’s view?

Examining results from the test of the pedestrian detection system, signs are pointing towards this system being helpful for a driver of a teleoperated vehicle, but no significant change in reaction time performance could be measured.

6.1 Could It Have Been Made Simpler?

It can be argued that simpler methods would have sufficed to be able to perform the tests made. If only concerned with testing reaction times, the videos used could have been an- notated with markers in video post-processing, instead of implementing a detection system for this purpose. Although the method and performance of systems implemented are not state-of-the-art, we argue this result is more in line with the level of performance that can be expected from real systems. Additionally, testing a hypothetical best-case scenario result would not have provided any insight into development within this specific teleoperation sys- tem, or tested the viability of ready-made object detection solutions.

6.2 The Work In A Wider Context

These kinds of driving aids are not to be seen as a solution by which further risk-taking in driving can be justified. They are only aimed at providing a driver with additional help dur- ing adverse conditions to help avoid accidents. Ideally, the driving of teleoperated vehicles in traffic should not be done manually when high video quality cannot be guaranteed.

6.3 Future Work

There have been previous studies concerning visual acuity, driving and video resolution. However, there have been no (as far as we know) which further investigates how compression and compression artifacts affect visual acuity in driving. This would be an interesting area of research as compression does not just reduce spatial detail, it can also distort the visual information such as perceptions that do not correspond to the real environment. Here, other concepts in vision, other than visual acuity, such as contrast sensitivity, depth perception, motion perception, etc. would need further evaluation. The correlation between visual acuity and video quality within a teleoperation setting should be further investigated, if vehicles with these kinds of systems are to occupy public roads. Should the same visual acuity standards for normal driving be applied to information perceived through a screen as well? Furthermore, should there be regulations regarding a minimum level of video quality when operating vehicles remotely in public spaces?

44 6.4. Source Criticism

6.4 Source Criticism

The majority of sources cited throughout this thesis are peer-reviewed publications. Online sources are subject to change but were available and relevant at the time of citation. The fields of research this thesis touches upon are very large, and it is not possible to get an all- compassing overview in this short of time. Therefore there may be previous work made within these fields that conflicts with the theory and methods used in this thesis or would have lead to different conclusions.

45 Bibliography

[1] Chen J.Y.C, Haas E.C, and Barnes M.J. “Human performance issues and user interface design for teleoperated robots”. In: IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37.6 (2007), pp. 1231–1245. [2] Post & Telestyrelsen. PTS Mobiltäcknings- och bredbandskartläggning 2018. URL: https: / / statistik . pts . se / mobiltaeckning - och - bredband/ (visited on 03/02/2020). [3] Singh S. Critical reasons for crashes investigated in the national motor vehicle crash causation survey. Tech. rep. DOT HS 812 506. Washington, DC: National Highway Traffic Safety Administration, 2018. [4] Bagloee S.A, Tavana M, Asadi M, and Oliver T. “Autonomous vehicles: challenges, opportunities, and future implications for transportation policies”. In: Journal of modern transportation 24.4 (2016), pp. 284–303. [5] Fagnant D.J and Kockelman K. “Preparing a nation for autonomous vehicles: opportu- nities, barriers and policy recommendations”. In: Transportation Research Part A: Policy and Practice 77 (2015), pp. 167–181. [6] Gandia R.M et. al. “Autonomous vehicles: scientometric and bibliometric review”. In: Transport Reviews 39.1 (2019), pp. 9–28. [7] Waymo. On the Road to Fully Self-Driving: Waymo Safety Report. 2018. URL: https:// waymo.com/safety/ (visited on 09/19/2019). [8] Uber. Advanced Technologies Group. 2019. URL: https://www.uber.com/us/en/ atg/ (visited on 09/19/2019). [9] Van Brummelen J, O’Brien M, Gruyer D, and Najjaran H. “Autonomous vehicle percep- tion: The technology of today and tomorrow”. In: Transportation research part C: emerging technologies 89 (2018), pp. 384–406. [10] Tesla. Autopilot. URL: https://www.tesla.com/sv_SE/autopilot (visited on 09/19/2019). [11] SAE International. Taxonomy and Definitions for Terms Related to Driving Automation Sys- tems for On-Road Motor Vehicles. 2018. [12] Sheridan T.B. “Telerobotics”. In: Automatica 25.4 (1989), pp. 487–507.

46 Bibliography

[13] Hokayem P.F and Spong M.W. “Bilateral Teleoperation: An Historical Survey.” In: Au- tomatica. Vol. 42. 12. 2006, pp. 2035–57. DOI: 10.1016/j.automatica.2006.06. 027. [14] Smiley A. “Mental workload and information management”. In: Conference Record of papers presented at the First Vehicle Navigation and Information Systems Conference (VNIS ’89). 1989, pp. 435–438. DOI: 10.1109/VNIS.1989.98805. [15] European Road Safety Observatory. Advanced Driver Assisance Systems. 2018. URL: https : / / ec . europa . eu / transport / road _ safety / specialist / observatory / analyses / traffic _ safety _ syntheses / safety _ synthesies_en (visited on 09/19/2019). [16] Park B, Lee J, Yoon C, and Kim K. “A Study on Augmented Reality of a Vehicle Infor- mation Using Head-Up Display”. In: 2016 6th International Conference on IT Convergence and Security (ICITCS). 2016, pp. 1–2. DOI: 10.1109/ICITCS.2016.7740308. [17] Bengler K, Dietmayer K, Farber B, Maurer M, Stiller C, and Winner H. “Three Decades of Driver Assistance Systems: Review and Future Perspectives”. In: IEEE Intelligent Transportation Systems Magazine 6.4 (2014), pp. 6–22. DOI: 10 . 1109 / MITS . 2014 . 2336271. [18] Akamatsu M, Green P, and Bengler K. “Automotive Technology and Human Factors Research: Past, Present, and Future”. In: International Journal of Vehicular Technology 2013 (2013), 27 pages. DOI: 10.1155/2013/526180. [19] Bila C, Sivrikaya F, Khan M.A, and Albayrak S. “Vehicles of the Future: A Survey of Research on Safety Issues”. In: IEEE Transactions on Intelligent Transportation Systems 18.5 (2017), pp. 1046–1065. DOI: 10.1109/TITS.2016.2600300. [20] Troppmann R, Höger A, and Bosch R. Tech Tutorial: Driver Assistance Systems an intro- duction to Adaptive Cruise Control: Part 1. June 2006. [21] HELLA GmbH & Co. KGaA. Overview of driver assistance systems. URL: https://www. hella.com/techworld/au/Technical/Car-electronics-and-electrics/ Overview-of-driver-assistance-systems-45184/ (visited on 09/24/2019). [22] Sivak M. “The information that drivers use: Is it indeed 90 percent visual?” In: The UMTRI Research Review 29.1 (1998), p. 1. [23] Owsley C and McGwin Jr G. “Vision and driving”. In: Vision research 50.23 (2010), pp. 2348–2361. [24] Colenbrander A and De Laey J.J. Vision requirements for driving safety with emphasis on individual assessment. Report prepared for the International Council of Ophthalmology at the 30th World Ophthalmology Congress. Sao Paulo, Brazil, February 2006. 2010. [25] Millodot M. Dictionary of Optometry and Visual Science E-Book. Elsevier Health Sciences, 2014. [26] Bühren J. “Snellen Acuity”. In: Encyclopedia of Ophthalmology. Ed. by Schmidt-Erfurth U and Kohnen T. Berlin, Heidelberg: Springer Berlin Heidelberg, 2018, pp. 1650–1650. ISBN: 978-3-540-69000-9. DOI: 10.1007/978-3-540-69000-9_668. URL: https: //doi.org/10.1007/978-3-540-69000-9_668. [27] Khex14. Sample Snellen chart.jpg. 2014. URL: https://sv.m.wikipedia.org/wiki/ Fil:Sample_Snellen_chart.jpg (visited on 01/07/2020). License: Creative Com- mons BY-SA 3.0. [28] National Institutes of Health National Eye Institute. 1606 Snellen Chart-02.jpg. 2012. URL: https : / / www . flickr . com / photos / nationaleyeinstitute / 7544734768 (visited on 01/07/2020). License: Creative Commons BY 2.0.

47 Bibliography

[29] Ojosepa. EyeOpticsV400y.jpg. 2008. URL: https : / / commons . wikimedia . org / wiki/File:EyeOpticsV400y.jpg (visited on 01/06/2020). License: Creative Com- mons BY 3.0. [30] Bühren J. “ETDRS Visual Acuity Chart”. In: Encyclopedia of Ophthalmology. Ed. by Schmidt-Erfurth U and Kohnen T. Berlin, Heidelberg: Springer Berlin Heidelberg, 2018, pp. 741–742. ISBN: 978-3-540-69000-9. DOI: 10.1007/978- 3- 540- 69000- 9_617. URL: https://doi.org/10.1007/978-3-540-69000-9_617. [31] L.L Sloan. “New test charts for the measurement of visual acuity at far and near dis- tances”. In: American journal of ophthalmology 48.6 (1959), pp. 807–813. [32] Bühren J. “Sloan Letters”. In: Encyclopedia of Ophthalmology. Ed. by U Schmidt-Erfurth and Kohnen T. Berlin, Heidelberg: Springer Berlin Heidelberg, 2018, pp. 1648–1648. ISBN: 978-3-540-69000-9. DOI: 10.1007/978-3-540-69000-9_667. URL: https: //doi.org/10.1007/978-3-540-69000-9_667. [33] Little J-A, Tromans C, Blackmore A, and O’Brien M. Visual Standards for Driving in Eu- rope, a Consensus Paper. European Council of Optometry and Optics, 2017. [34] Burg A. “The relationship between vision test scores and driving record; general find- ings”. In: (1967). [35] Burg A. “Vision test scores and driving record: Additional findings”. In: (1968). [36] Hu P.S, Trumble D, and Lu A. “Statistical relationships between vehicle crashes, driving cessation, and age-related physical or mental limitations: Final summary report”. In: Washington, DC: National Highway Traffic Safety Administration, US Department of Trans- portation (1997). [37] Charman W.N. “Vision and driving-a literature review and commentary”. In: Oph- thalmic and Physiological Optics 17.5 (1997), pp. 371–391. [38] Szlyk J.P, Pizzimenti C.E, Fishman G.A, Kelsch R, Wetzel L.C, Kagan S, and Ho K. “A comparison of driving in older subjects with and without age-related macular degen- eration”. In: Archives of ophthalmology 113.8 (1995), pp. 1033–1040. [39] Higgins K.E, Wood J.M, and Tait A. “Vision and driving: Selective effect of optical blur on different driving tasks”. In: Human Factors 40.2 (1998), pp. 224–232. [40] Higgins K.E and Wood J.M. “Predicting components of closed road driving perfor- mance from vision tests”. In: Optometry and vision science 82.8 (2005), pp. 647–656. [41] Schieber F, Schlorholtz B, and McCall R. “Visual requirements of vehicular guidance”. In: Human factors of visual and cognitive performance in driving (2009), pp. 31–50. [42] Donges E. “A two-level model of driver steering behavior”. In: Human factors 20.6 (1978), pp. 691–707. [43] Gugerty L. “Situation awareness in driving. In Fisher, D. L., Rizzo, M., Caird, J. K., and Lee, J. D. (Eds). Handbook of Driving Simulation for Engineering, Medicine, and Psychology.” In: CRC Press, 2011, pp. 298–306. [44] Wiegand T, Sullivan G.J, Bjøntegaard G, and Luthra A. “Overview of the H.264/AVC video coding standard”. In: IEEE Transactions on Circuits and Systems for Video Technology 13.7 (2003), pp. 560–576. ISSN: 10518215. DOI: 10.1109/TCSVT.2003.815165. [45] Jun Xia, Yue Shi, Teunissen K, and Heynderickx I. “Perceivable artifacts in compressed video and their relation to video quality”. In: Signal Processing: Image Communication 24.7 (2009), pp. 548–556. ISSN: 0923-5965. DOI: https://doi.org/10.1016/j. image . 2009 . 04 . 002. URL: http : / / www . sciencedirect . com / science / article/pii/S0923596509000423.

48 Bibliography

[46] Unterweger A. “Compression Artifacts in Modern Video Coding and State-of-the-Art Means of Compensation”. In: Multimedia Networking and Coding. Ed. by Farrugia R and Debono C. IGI Global, 2013, pp. 28–49. DOI: 10.4018/978-1-4666-2660-7.ch002. [47] Owsley C. “Contrast sensitivity.” In: Ophthalmology Clinics of North America 16.2 (2003), pp. 171–177. [48] Boemare N and Coudert F. COST 331: Requirements for Horizontal Road Marking - Final Report. European Cooperation in the Field of Scientific and Technical Research, 1999. [49] Brooks J.O, Tyrrell R.A, and T.A Frank. “The effects of severe visual challenges on steer- ing performance in visually healthy young drivers”. In: Optometry and Vision Science 82.8 (2005), pp. 689–697. [50] Leach P, Mealling M, and Salz R. “A universally unique identifier (uuid) urn names- pace”. In: (2005). [51] INTEL. INTEL Distribution of OpenVINO Toolkit. URL: https://software.intel. com/en-us/openvino-toolkit (visited on 09/20/2019). [52] Kozlov A and Osokin D. “Development of Real-time ADAS Object Detector for De- ployment on CPU”. In: CoRR abs/1811.05894 (2018). arXiv: 1811.05894. URL: http: //arxiv.org/abs/1811.05894. [53] Intel. road-segmentation-adas-0001. 2018. URL: https://docs.openvinotoolkit. org/2018_R5/_docs_Transportation_segmentation_curbs_release1_ caffe_desc_road_segmentation_adas_0001.html (visited on 02/07/2020). [54] cppreference.com. std::atomic. URL: https : / / en . cppreference . com / w / cpp / atomic/atomic (visited on 02/26/2020). [55] Björck Å. “Numerical methods for least squares problems”. In: SIAM, Society for In- dustrial and Applied Mathematics, 1996. [56] Verma J.P. “Repeated Measures Design for Empirical Researchers”. In: John Wiley & Sons, Incorporated, 2015. [57] Latin Square. URL: https://en.wikipedia.org/wiki/Latin_square (visited on 02/10/2020). [58] Tukey J.W. “Exploratory data analysis”. In: Addison-Wesley, 1977. URL: https : / / search.ebscohost.com/login.aspx?direct=true&AuthType=ip,uid&db= cat00115a&AN=lkp.60089&lang=sv&site=eds-live&scope=site (visited on 02/14/2020). [59] Bewley A, Zongyuan Ge, Ott L, Ramos F, and Upcroft B. Simple Online and Realtime Tracking. 2016. arXiv: 1602.00763 [cs.CV]. [60] Baker Jr C.L. and Braddick O.J. “Temporal properties of the short-range process in ap- parent motion”. In: Perception 14.2 (1985), pp. 181–192.

49 Appendices

50 A ADAS Categorization

A.1 Active Systems

In this section all the ADAS that can be labeled as Active are listed, i.e. all the systems that control the vehicles behavior to a varying degree:

• Supportive

– Safety * Electronic stability control (ESC) * Anti-lock braking system (ABS) * Cross wind stabilization (trucks) * Hill descent control * Hill hold assist – Adaptive Drive * Chassis / Suspension * Engine Mapping * Steering – Parking Assistance - Sonar / Radar / Camera * Automated parking * Trailer Backup Assist * Surround view system – Vision enhancement - Adaptive Headlights * Adaptive bend lighting * Adaptive high beam – Cruise Control - Radar / Stereo Camera * Adaptive * Stop & Go • Intervening

51 A.2. Passive Systems

– Lane keep assist (Level 2 autonomy, “hands off wheel”) – Collision avoidance system * Autonomous emergency braking system · Emergency brake assist · Cruise control with emergency brake · maneuver Brake Assist (at low speeds) · Multi-collision brake * Pedestrian and Vehicle recognition * Lane change prevention * Rear End / Side Pre-Crash Assist * Safe Exit Assist

A.2 Passive Systems

In this section the ADAS that are labeled as Passive are listed, that is, systems that simply present information or warnings for the driver to act upon.

• Informative

– Navigation System * AR on top of front camera view (Mercedes) – Night vision – Surround View system * Driving * Parking – Traffic Sign Detection * Intelligent Speed Adaptation (ISA) – Heads Up Display (HUD) – Voice control

• Warnings

– Blind Spot Monitor – Lane Departure Warning * Sound warning * Visual feedback * Vibrations · Steering Wheel · Seat – Driver Fatigue Detection – Wrong way driving warning – Cross traffic alert – Tyre pressure monitoring – Parking sensors – Back-up camera – Forward / Reverse collision warning

52 A User test video description

53 Scenario Video name Description A video2_4 Rural road, straight. Pedestrian coming from left-hand side, crosses road. No obstacles in view. A video4_1 City road, 4-way cross, no visibility around street corners. Pedestrian coming from right-hand side around corner, does not cross road. A video4_2 City road, T-cross. Pedestrian coming from left around corner. Car stops at intersection. B video2_3 Rural road, straight. Pedestrian coming from right-hand side, does not cross road. No obstacles in view. B video6_1 City road, intersection. Pedestrian coming from right-hand side, walking the same direction as car is moving. Car turns right, pedestrian goes straight on, crossing the road. C video5_3 City road. Pedestrian coming from left-hand side, partly hid- den at start, does not cross road. C video3_2 City road. Pedestrian coming from left-hand side, hidden be- hind cars before entering road, crosses road. C video1_1 Rural road, left turn up-hill. Pedestrian coming from right- hand side, does not cross road. Low contrast to background. D video3_1 City road, 4-way cross. Pedestrian coming from left-hand side, partly hidden behind cars before entering road, does not cross road. D video6_2 City road. Pedestrian coming from right-hand side, partly hid- den behind cars before entering road, does not cross road. D video1_4 Rural road, right turn downhill. Pedestrian coming from left- hand side, does not cross road. Low contrast to background. E video2_2 Rural road, straight. Pedestrian coming from left-hand side, does not cross road. No obstacles in view. E video5_1 City road. Pedestrian coming from right-hand side, hidden be- hind cars and other pedestrians, briefly visible before close to road, does not cross road. Low contrast to background. E video6_3 City road. Pedestrian coming from left-hand side, hidden be- hind cars before entering road, crosses the road.

Table A.1: Description of the videos shown to user test subjects, in what kind of environment the car driving, which direction the pedestrian is coming from, if the pedestrian crosses the road before the car goes past and other details.

54