electronics

Article May a Pair of ‘Eyes’ Be Optimal for Vehicles Too?

Ernst D. Dickmanns

Independent Researcher, 85649 Brunnthal, Germany; [email protected]

 Received: 13 March 2020; Accepted: 23 April 2020; Published: 5 May 2020 

Abstract: Following a very brief look at the human vision system, an extended summary of our own elemental steps towards future vision systems for ground vehicles is given, leading to the proposal made in the main part. The question is raised of why the predominant solution in biological vision systems, namely pairs of eyes (very often multi-focal and gaze-controllable), has not been found in technical systems up to now, though it may be a useful or even optimal solution for vehicles too. Two potential candidates with perception capabilities closer to the human sense of vision are discussed in some detail: one with all cameras mounted in a fixed way onto the body of the vehicle, and one with a multi-focal gaze-controllable set of cameras. Such compact systems are considered advantageous for many types of vehicles if a human level of performance in dynamic real-time vision and detailed scene understanding is the goal. Increasingly general realizations of these types of vision systems may take all of the 21st century to be developed. The big challenge for such systems with the capability of learning while seeing will be more on the software side than on the hardware side required for sensing and computing.

Keywords: automotive vehicles; autonomous driving; vision systems

1. Introduction The acceptance of technical vision systems by the human driver depends much on the differences in performance the overall system shows in comparison to his own capabilities. Therefore, in Section 1.1, a brief survey is given on some characteristic parameters of the human vision system. Section 1.2 then presents a review of the elemental steps towards future vision systems of ground vehicles, leading to the proposal made in Section4. In Section2, a brief survey of the reasons behind the di fferent developments in biology and technology is laid out. Section3 contains some conclusions for further developments. An outlook on potential future paths of development in automotive vision systems is given in Sections4 and5. Based on observations made, both in many biological species and in technical developments, two (of many possible) designs are looked at in some more detail. Neural net approaches are not taken into account for various reasons. The author is convinced that—as the 20th century has seen a wide spread of designs for ground vehicles—the 21st century will see the development of a wide variety of technical vision systems for vehicles, including deep neural nets. These developments are not a matter of years but of decades.

1.1. Characteristics of the Human Eye and Brain The larger part of the brain of highly developed vertebrates (including ourselves) is devoted to the processing of visual data for scene understanding, behavior decision and motion control [1–4]. Many biologists consider vision as one of the driving factors for developing intelligence, evolved by individuals with complex brains over millions of years [5]. Many different types and numbers of eyes are found per individual in a wide range of classes of animals. It is hard to believe that only chance is behind the fact that most animals with high-performance locomotion capabilities on land, in water and in the air have pairs of (mostly gaze-controllable) eyes with radial areas of different resolutions in

Electronics 2020, 9, 759; doi:10.3390/electronics9050759 www.mdpi.com/journal/electronics Electronics 20202020,, 99,, 759x FOR PEER REVIEW 2 of 2827

resolutions in the images they provide. This modality of vision must have advantages (at least for thecarbon images-based they systems) provide. that This modalityhave evolved of vision over must many have generations. advantages (at Figure least for 1 carbon-based shows some systems)characteristics that haveof one evolved of these over solutions, many the generations. human eye Figure (from1 p.shows 5 in [6 some]). The characteristics total horizontal of onefield of of theseview solutions,(FoV) of a the single human eye eye is about(from p. 145°; 5 in the [6]). image The total resolution horizontal decreases field of from view the (FoV) center of a singleto the eyeperiphery. is about The 145 central◦; the image area, dubbed resolution ‘fovea’, decreases a region from of the slightly center elliptical to the periphery. size, has a The visual central range area, of dubbedabout 1.5° ‘fovea’, vertically a region to of about slightly 2° elliptical horizontally. size, has There a visual are two range different of about types 1.5◦ vertically of sensor to elements about 2◦ horizontally.present in the There eye: are Color two sensitivedifferent types ‘cones’ of sensor and only elements-intensity present-sensitive in the ‘rods’. eye: Color The sensitive fovea contains ‘cones’ andexclusively only-intensity-sensitive cones, about 7 million ‘rods’. The (blue fovea curve contains in Figure exclusively 1) with cones, a high about density 7 million of about (blue 150 curve,000 inreceptors Figure1 per) with mm a2 high; this densityallows a of ‘Vernier about 150,000 acuity’ receptors in edge detection per mm 2in; this the allowsrange of a ‘Vernier 0.06 to 0.25 acuity’ mrad in edge(individual detection variations). in the range For of angles 0.06 to radially 0.25 mrad away (individual from the variations). fovea > 15° For (in anglesthe peripheral radially awayregions from of thethe fovearetina),> the15◦ density(in the peripheralof cones is regionsabout 1/30 of the of the retina), peak the value; density that ofmeans cones that is about the distance 1/30 of thebetween peak value;cones here that meansis five to that six the times distance that between between rods. cones here is five to six times that between rods.

Figure 1.1. Distribution ofof rodsrods andand conescones alongalong aa line line passing passing through through the the fovea fovea and and the the blind blind spot spot of of a humana human eye eye (p. (p. 5 in5 in [6 []).6]). The fovea is employed for accurate vision in the direction toward where it is pointed by the The fovea is employed for accurate vision in the direction toward where it is pointed by the rotation of the eye. It comprises less than 1% of the retinal size, but uses over 50% of the visual cortex rotation of the eye. It comprises less than 1% of the retinal size, but uses over 50% of the visual cortex in the brain [7]. The ratio of ganglion cells to photoreceptors in the eye is about 2.5; almost every in the brain [7]. The ratio of ganglion cells to photoreceptors in the eye is about 2.5; almost every ganglion cell in the eye receives data from a single cone, and each cone feeds into one to three ganglion ganglion cell in the eye receives data from a single cone, and each cone feeds into one to three cells. Image preprocessing is performed in the eye so that the number of nerves feeding the primary ganglion cells. Image preprocessing is performed in the eye so that the number of nerves feeding the visual cortex in the back side of the human brain is reduced by two orders of magnitude compared to primary visual cortex in the back side of the human brain is reduced by two orders of magnitude the number of photoreceptors in the eye; edge orientations are coded directly in the eye to achieve this compared to the number of photoreceptors in the eye; edge orientations are coded directly in the eye reduction [8]. The eye does not take a photographic image of the environment, but rather produces in to achieve this reduction [8]. The eye does not take a photographic image of the environment, but parallel up to 100 extremely compressed full images of low precision for the peripheral FoV; in addition, rather produces in parallel up to 100 extremely compressed full images of low precision for the three to four small areas with high resolution are generated by foveal perception [9]. These data are peripheral FoV; in addition, three to four small areas with high resolution are generated by foveal compared with existing images in the actual imagination process, and both together are transformed perception [9]. These data are compared with existing images in the actual imagination process, and into the actual perception of the environment. both together are transformed into the actual perception of the environment. The human eye has three angular degrees of freedom: Two for gaze directions of the line of sight The human eye has three angular degrees of freedom: Two for gaze directions of the line of and one for a rotation around it. The latter, called roll angle, will be neglected here. The downward sight and one for a rotation around it. The latter, called roll angle, will be neglected here. The looking angle (infraduction or depression) may go to 60 , the upward looking angle (supraduction downward looking angle (infraduction or depression)− may◦ go to −60°, the upward looking angle or elevation) to +45 . The capability of the horizontal rotation may go on average up to 50 both (supraduction or elevation)◦ to +45°. The capability of the horizontal rotation may go on average◦ up towards the nose and away from it; however, in general the angles hardly exceed 20 . These changes to 50° both towards the nose and away from it; however, in general the angles ±hardly◦ exceed ±20°. in the viewing direction are supported by rotations of the head. Fast eye movements may go up to These changes in the viewing direction are supported by rotations of the head. Fast eye movements 600may◦/ s go in up saccades; to 600°/s during in saccades; the tracking during of t objectshe tracking (fixation), of objects the maximal (fixation), rotational the maximal speed rotational is about

Electronics 2020, 9, 759 3 of 28 Electronics 2020, 9, x FOR PEER REVIEW 3 of 27

100speed◦/s. is Saccades about 100 are°/s. usually Saccades finished are usuallyafter 50 millisecondsfinished after (ms) 50 milliseconds [6]. Fixations (ms) and saccades[6]. Fixations make and up thesaccades largest make part of up cognitive the largest eye movements.part of cognitive Thus, intentional eye movements. vision is Thus, a controlled intentional cognitive vision activity is a bycontrolled the subject. cognitive activity by the subject. The red curves in Figure 1 1 show show thethe densitydensity ofof thethe (b(b/w)/w) elements that measure intensity intensity-values-values only. There There are are no no rods rods in in the the fovea; fovea; their their densit densityy rises rises rapidly rapidly in inthe the first first 15° 15 radially◦ radially away away from from the thefovea fovea to similar to similar values values to those to those that that cones cones have have in inthe the fovea; fovea; keep keep in in mind mind that that Figure Figure 1 is aa cutcut through thethe foveafovea andand the the ‘blind ‘blind spot’, spot’, which which is is the the circular circular area area where where visual visual nerves nerves pass pass through through the retina.the retina. For For other other cuts cuts through through the retina,the retina, the red the curves red curves at the at left- the and left right-hand- and right- sideshand ofside thes figureof the arefigure connected are connected by some by kind some of kind inverted of inverted parabola. parabola. At a radial At distancea radial distance of around of15 around◦ away 15° from away the fovea,from the rod fovea, spacing rod is asspacing high as is cone as high spacing as cone in the spacing fovea (about in the 2 fovea to 2.5 (aboutµm between 2 to 2.5 elements). μm between In a FoVelements). of 120 ◦In(50 a FoV◦ temple of 120° side (50° to temple 70◦ nose side side, to see70° thenose inserted side, see gray the bar)inserted the numbergray bar) of the receptors number per of mmreceptors2 is above per 80,000,mm2 is which above is 80 about,000, whichhalf that is of about the fovea half that (and of of the the fovea imaginary (and center); of the imaginary this value correspondscenter); this tovalue a spacing corresponds of about to 3.5 a µ spacingm between of about the sensor 3.5 μm elements. between These the sensor numbers elements. are, of course, These specificnumbers to are, the carbon-basedof course, specific wetware to the of all carbon biological-based systems. wetware The of basic all biological distinctions systems. to silicon-based The basic hardwaredistinctions will to besilicon addressed-based inhardware more detail will inbe Sectionaddressed2. in more detail in Section 2.

1.2. The Development of the Actual State of Technical Real-TimeReal-Time VisionVision SystemsSystems In Japan,Japan, S.S. TsugawaTsugawa in in the the late late 1970s 1970s investigated investigated the the lateral lateral guidance guidance of a roadof a vehicleroad vehicle by using by analogusing analog data processing data processing in image in image analysis analysis for stereo for visionstereo invision vertical in vertical planes; planes; single rows single of rows the images of the ofimages two cameras—mounted of two cameras— abovemounted each above other (see each Figure other2) and (see rotated Figure by 2) 90 and◦—were rotated evaluated by 90° in— orderwere toevaluated detect the in guideorder railsto detect at the the side guide of the rails road at [10the]. side All electronicof the road devices [10]. All that electronic were needed devices were that on board.were needed Short distanceswere on wereboard. driven Short atdistanc a lowes speed. were In driven the early at a 1980s, low speed. there wasIn the a gap early with 1980s respect, there to vision-guidedwas a gap with road respect vehicles to vision in Japan.-guided The activitiesroad vehicles were in resumed Japan. whenThe activities new results were from resumed the USA when and Germanynew results became from the public USA in and the Germany second half became of the public decade. in the second half of the decade.

Figure 2. The f firstirst vision vision-guided-guided road vehicle in the late 1970s, Japan [10].

1.2.1. Vision Systems on Digital Microprocessors with All Equipment on Board The development of real real-time-time vision systems based on digital microprocessors ( µμP), with all the equipment needed on board the vehicle, started in the USA and in Germany Germany,, independently of each other.other. InIn bothboth countries,countries, the the vehicles vehicles chosen chosen at at the the beginning beginning were were vans vans with with weights weights from from 5 to 85 tons.to 8 Intons. the In USA, the the USA huge, the DARPA huge project DARPA “On project Strategic “On Computing” Strategic Computing” started in 1982 started [11]; thein 1982 ‘Autonomous [11]; the Land‘Autonomous Vehicle’ (ALV)Land Vehicle’ was one (ALV) of three was fields one of of application. three fields Real-time of application. computer Real vision-time computer was an essential vision element,was an essential since video element, sensors since provide video a much sensors higher provide signal a densitymuch higher and better signal capabilities density and for objectbetter recognitioncapabilities in for comparison object recognition to other types in comparison of sensors. Laser-Range-Finders to other types of sensors. (LRF) have Laser been-Range incorporated-Finders right(LRF) fromhave thebeen beginning incorporated for theright easy from determination the beginning of for range the easy to points determination in the environment of range to andpoints to objects.in the environment Carnegie Mellon and to University objects. Carnegie (CMU) had Mellon its own University vehicle, (CMU) , had a van its own specially vehicle equipped, NavLab, for slowa van driving. specially At equipped the core of for the slow project driving. were massively At the core parallel of the computer project systems were massively for data analysis parallel computer systems for data analysis and dynamic scene understanding to be developed over the next decade or two. In the meanwhile, several universities and research institutes were funded in order to

Electronics 2020, 9, 759 4 of 28 and dynamic scene understanding to be developed over the next decade or two. In the meanwhile, several universities and research institutes were funded in order to develop software for interpreting and understanding image sequences on computers that were actually available. The typical cycle times for a single perception step were in the range of seconds to minutes. All these groups coming from Computer Science (CS) and Artificial Intelligence (AI) started by investigating rather complex vision tasks at a much slowed-down rate on available computing hardware, while at the same time massively parallel computer systems were investigated for analyzing sequences of digital images [11]. Temporal aspects were considered separately in a second step. Differences in the interpretation results of consecutive images had to be interpreted either as movements of objects in the mapped 3-D scene, or as changes due to ego-motion (or both in conjunction). A completely different approach has been chosen by the author, coming from systems dynamics and control engineering, including optimal control. A detailed survey of the history of vision for ground vehicles may be found in [12]. The 4-D approach to dynamic vision: In Germany, the author had experience in working with horses in agriculture when in the 1950s a tractor replaced horses. He noted that the advantage of the tractor having multiple horsepower was sometimes offset by the disadvantage of having to pay attention to vehicle guidance all the time. Horses found their way home by themselves and even kept the learned distance to the road boundary quite well while moving. Since tractors had no sense of vision, the human driver had to be alert all the time. It became clear to him that developing a sense of vision for vehicles should be one important task for the future. After studies of mechanical and control engineering and a few years of working in flight mechanics, in 1975 he was appointed professor for control engineering at the newly founded University of the Federal Armed Forces of Germany in Neubiberg near (UniBwM). He realized that the remaining 26 professional years corresponded to an increase in computing power by five to six orders of magnitude; this number was based on the observation over the recent past that computing power increased by a factor of ten every four to five years. Trusting in further miniaturization and in the observed temporal increase in computing power, a real-time evaluation of video sequences at around 10 Hz should be achievable with a modest number of parallel µP within about 20 years. At that time, he did not know about the activities in the USA and in Japan. He used the initial funding of UniBwM to start his long-term development effort by building ‘Hardware-In-the-Loop’ (HIL) simulation facilities for machine vision systems similar to those on the market for training pilots [13]. When studying the visual guidance of aircraft landing approaches in the 1980s, it turned out that without the feedback of inertial sensor signals from the gaze-platform to the image interpretation process, the approach that will be discussed below would not work under stronger winds and gusts. With this feedback, it has been shown to work satisfactorily [14]. All the experiences with real-world systems has led the UniBwM group to include gaze control in the software development for dynamic vision from the beginning, whenever possible. The group selected sufficiently simple tasks for the beginning, but wanted to do real-time interpretation right from the start by using standard tools for the state estimation of dynamical systems. Initially, the ‘Luenberger observer’ [15] had been chosen in connection with 3-D shape models for the objects perceived, and with scene models including the differential equations for the motion components of the object observed in 3-D space over time. Since models for motion and for perspective projection are in general nonlinear, linearized approximations for short periods of time had to be selected [16–20]. Due to the fact that in practical life almost every process is subjected to unknown perturbations, the transition to (Extended) Kalman Filtering (EKF) was soon made [17,19]. A lower limit of interpreting at least ten frames per second was introduced as a side constraint. In connection with the feedback of prediction errors as a basic correction step, this brought about a different kind of thinking for the overall approach. Note that this method leads to a direct representation of the scene observed in 3-D space and time, i.e., in 4-D; it might thus be seen as a first step towards consciousness. Electronics 2020, 9, 759 5 of 28

By exploiting the feedback of the prediction errors, only information from the last image of the Electronics 2020, 9, x FOR PEER REVIEW 5 of 27 sequence needed to be stored; the results of all the previous evaluations were stored in the best estimates understandingfor the shape- and road state-values scenes, this of the approach hypothesized had initially (dynamical) been objects. tested In since terms 1980 of understanding in a special Hardwareroad scenes,-In this-the approach Loop (HIL) had initiallysimulation been [13 tested]. Using since road 1980 representation in a special Hardware-In-the in terms of differential Loop (HIL) geometrysimulation made [13]. the Using 4-D road approach representation particularly in termsefficient of di[16fferential]. The positi geometryve results made for the the 4-D autonomous approach visualparticularly guidance efficient of road [16 ].vehicles The positive led in results1984 to for its theimplementation autonomous visualin a 5-ton guidance van (see of roadFigure vehicles 3a). It hadled in to 1984be specially to its implementation equipped with in about a 5-ton 3 m van-high (see standard Figure 3industriala). It had racks to be for specially all the equippedelectronic equipmentwith about like 3 m-high communicati standardon industrial devices and racks computers for all the (Figure electronic 3c). The equipment sensors likefor vision communication were two camerasdevices and(Figure computers 3b) on a gaze (Figure control3c). Theplatform sensors (seen for at visionright in were (Figure two 3d cameras) from the (Figure other3 side).b) on The a testgaze-vehicle control was platform dubbed (seen “VaMoRs” at right in(Versuchsfahrzeug (Figure3d) from thefür otherautonome side). Mobilität The test-vehicle und Rechnersehen). was dubbed “VaMoRs”The first results (Versuchsfahrzeug in real-world für autonomous autonome Mobilität visual driving und Rechnersehen). were achieved The in first1986; results in 1987 in real-world, VaMoRs becameautonomous the first visual autonomous driving were vehicle achieved capable in of 1986; high in-speed 1987, VaMoRsdriving on became roads thewhen first it autonomousachieved 60 mphvehicle (96 capable km/h, its of high-speedmaximum drivingspeed due on roadsto limited when en itgine achieved power) 60 mphon a new (96 km stretch/h, its of maximum the Autobahn speed notdue yet to limitedturned engineover to power)the public on a[20 new]. stretch of the Autobahn not yet turned over to the public [20].

Figure 3. 3. TheThe first first high-speed high-speed autonomous autonomous vehicle, vehicle VaMoRs,, VaMoRs, from 1986 from to 2004 1986 with to 2004three generationswith three

generationsof vision systems of vision [21] systems (sub-images [21] ( fromsub-images video ‘VaMoRsfrom video Autobahn ‘VaMoRs Dingolfing Autobahn 1987 Dingolfing0). 1987′).

These results motivatedmotivated thethe companycompany Daimler-BenzDaimler-Benz AG AG (DBAG) (DBAG) to to join join with with UniBwM UniBwM in in order order to todevelop develop the the sense sense of visionof vision for for the the autonomous autonomous guidance guidance of roadof road vehicles. vehicles. This This task task became became one one of the of theactivities activities in the in common the common European European framework framework ‘EUREKA’ ‘EUREKA’ for long-term for research long-term and developments. research and developments.The Europe-wide The ‘PROgraMme Europe-wide for ‘PROgraMme a European Tra forffic aof European Highest E ffi Trafficciency of and Highest Unprecedented Efficiency Safety’ and Unprecedented(PROMETHEUS) Safety’ was modified(PROMETHEUS) by substituting was modified ‘machine by vision’substituting in exchange ‘machine for vision’ lateral in guidance exchange of forroad lateral vehicles guidance based onof electro-magneticroad vehicles based inductive on electro fields-magnetic generated inductive by cables fields buried generated in the center by cables of the buriedlane. More in the than center a dozen of the European lane. More automotive than a dozen companies European and about automotive five times companies that number and ofabout research five timesinstitutes that andnumber universities of research joined institutes the sub-project, and universities dubbed joined PRO-ART the sub (derived-project, fromdubbed PROmetheus PRO-ART (derivedART-ificial from Intelligence). PROmetheus The programART-ificial ran Intelligence). from 1987 until The the program end of 1994. ran Thefrom open 1987 exchange until the between end of 1994.the European The open participants exchange between led to the the fact European that all major participants automotive led to companies the fact that had all their major own automotive vehicle(s) companiesequipped with had body-fixed their own cameras vehicle(s) and capable equipped of lane-recognition with body-fixed and -following.cameras and capable of lane-Whenrecognition in the and early -following. 1990s the England-based ‘Transputers’ (µP with four parallel links with a transferWhen rate in ofthe up early to 20 1990s megabit the /sEngland to their-based neighbors) ‘Transputers’ appeared (μ onP with the market,four parallel it allowed links forwith the a transferreduction rate of vehicleof up to size 20 megabit/s due to less to storage their neighbors) space being appeared required. on Two the high-endmarket, it sedans allowed Mercedes for the reducSEL-500tion were of vehicle equipped size with due a to new less central storage visual space perception being required system. basedTwo high on transputers-end sedans and Mercedes the 4-D SELapproach-500 were [22– 26equipped]: The DBAG-vehicle with a new central “Vision visual Information perception Technology system Application”based on transputers (VITA_2) and and the the 4-D approach [22–26]: The DBAG-vehicle “Vision Information Technology Application” (VITA_2) and the UniBwM-vehicle VaMoRs-PKW (in short VaMP, see Figure 4). As can be seen, the electronic devices needed for vision still occupied a relatively large volume.

Electronics 2020, 9, 759 6 of 28

UniBwM-vehicle VaMoRs-PKW (in short VaMP, see Figure4). As can be seen, the electronic devices neededElectronics for 2020 vision, 9, x FOR still PEER occupied REVIEW a relatively large volume. 6 of 27

FigureFigure 4. 4.Demonstration Demonstration vehicle vehicle VaMP VaMP of of UniBwM UniBwM (top (top right, right, a twin a twin to the to Daimler-VITA_2);the Daimler-VITA_2); top left:top componentsleft: components for autonomous for autonomous driving. driving. Bottom Bottom right: right: A view A view into theinto passenger the passenger cabin cabin with thewith first the smallfirst small bifocal bifocal “Vehicle “Vehicle eye” oneye” a yaw-platform, on a yaw-platform, detailed detailed at bottom at bottom left. left.

AllAll central central cognitive cognitive components components based based on the on two the ‘vehicle two ‘vehicle eyes’ with eyes’ new, with small new, ‘finger small cameras’ ‘finger ofcameras’ about 1 ofcm about in diameter 1 cm inmounted diameter on mountedyaw platforms on yaw of about platforms 12 cm of in about diameter 12 cm (see in Figure diameter4, lower (see left)Figure have 4, been lower developed left) have by UniBwM.been developed Since highways by UniBwM. are rather Since planar highways and smooth, are rather and the planar vehicles and weresmooth, comfortable and the with vehicles only minor were comfortable perturbations with in pitch, only the minor vertical perturbations degree of freedom in pitch, was the left vertical off in orderdegree to of achieve freedom a simple was left and off small in order design. to achieve The system a simple is described and small in design. [23,24]. The system is described in [23During,24]. the tests it became apparent that for a human-like perceptual performance in the long run severalDuring additional the tests features it became would apparent be necessary: that for 1. toa extendhuman- thelike image perceptu resolutional performance to higher in values the long for anrun earlier several detection additional and finer features perception would ofbe details; necessary: 2. to 1. have to extend more cameras the image available resolution for perceiving to higher thevalues nearby for environment,an earlier detection and 3. and to realize finer theperception capability of ofdetails; perceiving 2. to have areal more features cameras like homogeneous available for regionsperceiving in black-and-white the nearby environm or evenent, in colorand 3. and to texture.realize the Exploiting capability the of advantages perceiving of areal the 4-D features approach, like stereohomogeneous vision would regions only in be black necessary-and-white nearby, or even where in the color point and at texture. which anExploiting object touches the advantages the ground of isthe not 4-D visible. approach, Further stereo away, vision environmental would only be objects necessary allow nearby performing, where rangethe point estimation at which based an object on recognizedtouches the known ground objects is not (like visible. road width, Further size away, in the environmental image, etc.). objects allow performing range estimationTo handle based these on challenges,recognized DBAGknown added,objects for(like VITA_2, road width, fourteen size additionalin the image cameras, etc.). distributed aroundTo the handle vehicle these for challenges, collecting additional DBAG added information, for VITA_2 on the, fourteen immediate additional environment cameras (see distributed Figure5). Theiraround role the is vehicle described for incollecting [27]; they additional were not information used by the on perception the immediate system environment of UniBwM (see since Figure the computing5). Their role power is described was missing. in [27]; Inthey the were main not part used of by this the article, perception another system arrangement of UniBwM of camerassince the assembledcomputing into power an ‘eye’, was probably missing. on In gaze-controllable the main part of platforms, this article at the, another top end arrangement of the structure of holdingcameras theassembled front windshield into an ‘eye’, at both probably sides (A-frames) on gaze- ofcontrollable a car will be platforms, discussed, at which the top could end perform of the structure the task ofholding the safety the driver front in windshield the mentioned at both Paris-demonstrations. sides (A-frames) of a car will be discussed, which could perform the task of the safety driver in the mentioned Paris-demonstrations.

Figure 5. Camera arrangement in VITA_2 of Daimler 1994 to 1996. The rectangular insert in the roof-region shows the abstract scheme for positioning cameras (basic structure from [27]).

Electronics 2020, 9, x FOR PEER REVIEW 6 of 27

Figure 4. Demonstration vehicle VaMP of UniBwM (top right, a twin to the Daimler-VITA_2); top left: components for autonomous driving. Bottom right: A view into the passenger cabin with the first small bifocal “Vehicle eye” on a yaw-platform, detailed at bottom left.

All central cognitive components based on the two ‘vehicle eyes’ with new, small ‘finger cameras’ of about 1 cm in diameter mounted on yaw platforms of about 12 cm in diameter (see Figure 4, lower left) have been developed by UniBwM. Since highways are rather planar and smooth, and the vehicles were comfortable with only minor perturbations in pitch, the vertical degree of freedom was left off in order to achieve a simple and small design. The system is described in [23,24]. During the tests it became apparent that for a human-like perceptual performance in the long run several additional features would be necessary: 1. to extend the image resolution to higher values for an earlier detection and finer perception of details; 2. to have more cameras available for perceiving the nearby environment, and 3. to realize the capability of perceiving areal features like homogeneous regions in black-and-white or even in color and texture. Exploiting the advantages of the 4-D approach, stereo vision would only be necessary nearby, where the point at which an object touches the ground is not visible. Further away, environmental objects allow performing range estimation based on recognized known objects (like road width, size in the image, etc.). To handle these challenges, DBAG added, for VITA_2, fourteen additional cameras distributed around the vehicle for collecting additional information on the immediate environment (see Figure 5). Their role is described in [27]; they were not used by the perception system of UniBwM since the computing power was missing. In the main part of this article, another arrangement of cameras assembled into an ‘eye’, probably on gaze-controllable platforms, at the top end of the structure Electronicsholding 2020 the, 9 front, 759 windshield at both sides (A-frames) of a car will be discussed, which could7 of 28 perform the task of the safety driver in the mentioned Paris-demonstrations.

Figure 5. 5. Camera arrangement in VITA_2 of Daimler 1994 to to 1996.1996. The rectangular insert in the Electronicsroof-regionroof- 20region20, 9, x showsshows FOR PEER thethe abstractabstract REVIEW schemescheme forfor positioningpositioning camerascameras (basic(basic structurestructure fromfrom [[2727]).]). 7 of 27

AtAt thethe samesame conferenceconference inin 19941994 [[2828],], aa second-generationsecond-generation two-axistwo-axis platformplatform forfor VaMoRsVaMoRs waswas presented.presented. Its Its dynamic dynamic performance performance was was tested tested by by tracking tracking a triangular a triangular traffic traffi signc sign at the at the side side of the of theroad road when when approaching approaching it. This it. platform This platform was capable was capable of executing of executing a 20°-saccade a 20◦-saccade in less than in less 150 than ms. 150A video ms. A from video 1996 from showing 1996 showing the characteristics, the characteristics, also in also slow in motion, slow motion, may be may viewed be viewed in [29 in]. [ The29]. Theresult resultinging graph graph of the of theexperiment experiment is shown is shown in (Figure in (Figure 6):6 Curve): Curve 1.1 1.1 (green, (green, left) left) shows shows the the change change of ofthe the azimuth azimuth angle angle during during the the approach. approach. Saccade Saccade-1-1 brings brings the the sign sign into into the the FoV FoV of the high-resolutionhigh-resolution cameracamera (red(red curvecurve left).left). ItIt takestakes fivefive evaluationevaluation cyclescycles untiluntil thethe signsign isis trackedtracked inin thethe high-resolutionhigh-resolution imageimage (green(green curvecurve 1.2,1.2, bottom bottom andand center).center). This image is thenthen storedstored and and sent sent to to thethe evaluationevaluation process;process; afterwards, afterwards, the the second second saccade saccade (red (red curve curve at right) at right) brings brings the viewing the viewing direction direction back to normalback to again.normal Similar again. gazeSimilar maneuvers gaze maneuvers are needed are for needed observing for observing traffic lights; traffic however, lights; theyhowever, have tothey be trackedhave to overbe tracked time since over they time are since time-variable. they are time As stated-variable. in [30 As], statedthis task in may [30], be this a major task may challenge be a when major drivingchallenge in when cities anddriving on di inff erentcities typesand on of different state roads; types the of positioning state roads; of the tra positioningffic lights on of both traffic sides lights of theon both road sides (for turn-o of theff s)road and (for even turn high-offs) above and theeven center high ofabove the road the center makes of the the task road diffi makescult. Fixationthe task bydifficult. a high-resolution Fixation by camera, a high while-resolution other camera,cameras whilewith a other wider cameras FoV keep with the overall a wider tra FoVffic situation keep the imagedoverall traffic in parallel, situation seems imag to beed ain better parallel, approach seems thanto be body-fixed a better approach cameras. than body-fixed cameras.

FigureFigure 6.6. DetectionDetection andand tracking tracking of of a traa trafficffic sign sign (curve (curve 1.1); 1.1); saccadic saccadic change change of the of gazethe gaze direction direction onto theonto sign the (1), sign five (1), image five cycles image for cycles stable recognition, for stable recognition, then reading then the sign read anding storing the sign the and high-resolution storing the imagehigh-resolution (curve 1.2); image saccade (curve 2 back 1.2); to saccade the original 2 back gaze to the direction original (see gaze [29 direction]). (see [29]).

TheThe di differentfferent approaches approaches were were presented presented in new in journals new like journals ‘Machine like Vision ‘Machine and Applications’ Vision and sinceApplications’ 1988 [20]. since On the 1988 IJCAI [20]. 1989 On inthe Detroit, IJCAI the1989 4-D in approachDetroit, the and 4 its-D applicationsapproach and were its presentedapplications in anwere invited presented keynote in address. an invited Towards keynote the end address. of the 1980s, Towards research the on end autonomous of the 1980s, ground research vehicles on in theautonomous USA was ground transferred vehicles to the in Army the USA Research was transferred Laboratory to (ARL)the Army in Aberdeen, Research Laboratory Maryland. (ARL) Its main in researchAberdeen, partner Maryland. in autonomous Its main resea drivingrch was partner the National in autonomous Institute driving of Standards was the and National Technology Institute (NIST) of Standards and Technology (NIST) nearby. ARL and NIST favored a common research project with UniBwM in Germany (reviewed below). Proponents of the two approaches in the USA and Germany first met at a conference in 1986 [31] and at a high-level international symposium on robotics [32] in 1987. At this symposium, the new start in Japan in 1987 also became public. Afterwards, the development of the field of autonomous driving was promoted by frequent exchanges between the leading nations at international conferences and symposia. In 1996, ARL, along with NIST, proposed a joint US– American/German project for merging the best components developed so far on both sides. The joint US–American/German project AutoNav: In the framework of an existing ‘Memorandum of Understanding’, a joint project on autonomous navigation, dubbed “AutoNav”, was formulated. The goal was to bring together, on PC–hardware now available on the market: 1. The overall software concept for integrating machine perception with the concepts for the planning and execution of missions on all levels, developed by the National Institute of Standards and Technology (NIST, group of J. Albus [33]); 2. The capabilities of stereo perception in real time, under development by the Princeton-group under P. Burt of the Stanford Research Institute (SRI), running on dedicated hardware [34];

Electronics 2020, 9, 759 8 of 28 nearby. ARL and NIST favored a common research project with UniBwM in Germany (reviewed below). Proponents of the two approaches in the USA and Germany first met at a conference in 1986 [31] and at a high-level international symposium on robotics [32] in 1987. At this symposium, the new start in Japan in 1987 also became public. Afterwards, the development of the field of autonomous driving was promoted by frequent exchanges between the leading nations at international conferences and symposia. In 1996, ARL, along with NIST, proposed a joint US–American/German project for merging the best components developed so far on both sides. The joint US–American/German project AutoNav: In the framework of an existing ‘Memorandum of Understanding’, a joint project on autonomous navigation, dubbed “AutoNav”, was formulated. The goal was to bring together, on PC–hardware now available on the market:

1. The overall software concept for integrating machine perception with the concepts for the planning and execution of missions on all levels, developed by the National Institute of Standards and Technology (NIST, group of J. Albus [33]); 2. The capabilities of stereo perception in real time, under development by the Princeton-group Electronicsunder 2020 P., 9 Burt, x FOR of PEER the StanfordREVIEW Research Institute (SRI), running on dedicated hardware [34];8 of 27 3. The 4-D approach of UniBwM, with active gaze control and the software packages for dynamic 3. Tscenehe 4-D understanding approach of UniBwM, and vehicle with guidance active gaze [20,35 control]. and the software packages for dynamic scene understanding and vehicle guidance [20,35]. The final purpose was to demonstrate both with German and US test vehicles the capability The final purpose was to demonstrate both with German and US test vehicles the capability of of performing a mission on a network of major and minor roads (hardened and natural surfaces) performing a mission on a network of major and minor roads (hardened and natural surfaces) including off-road sections prescribed by sequences of GPS-waypoints in 2001. During these trips, including off-road sections prescribed by sequences of GPS-waypoints in 2001. During these trips, beside obstacles above the driven surface (designated as ‘positive’ obstacles), ‘negative’ ones (ditches beside obstacles above the driven surface (designated as ‘positive’ obstacles), ‘negative’ ones and large holes below the driven surface) also had to be detected and avoided. In Germany, VaMoRs (ditches and large holes below the driven surface) also had to be detected and avoided. In Germany, with a new, 3rd-generation vision system [36–40] was selected for the final demonstration. In the USA, VaMoRs with a new, 3rd-generation vision system [36−40] was selected for the final demonstration. a HMMWV of NIST and of an industrial group, plus a small UGV, were selected as demonstrators. In the USA, a HMMWV of NIST and of an industrial group, plus a small UGV, were selected as Only the VaMoRs results, with its explorative ‘eye’ carrying five cameras, will be discussed here. demonstrators. Only the VaMoRs results, with its explorative ‘eye’ carrying five cameras, will be Figure7 shows the ‘vehicle eye’ for stereo vision built by UniBwM; it was not intended to be a first discussed here. Figure 7 shows the ‘vehicle eye’ for stereo vision built by UniBwM; it was not version of a practical eye for the more distant future, but a real-world system for testing the software for intended to be a first version of a practical eye for the more distant future, but a real-world system dynamic visual perception with multi-focal cameras and an intelligent on-board gaze control including for testing the software for dynamic visual perception with multi-focal cameras and an intelligent very fast saccades. on-board gaze control including very fast saccades.

FigureFigure 7.7. PurelyPurely experimental,experimental, roof-mountedroof-mounted 2-dof2-dof vehicle eye for VaMoRs inin projectproject AutoNav:AutoNav: (a) basicbasic structurestructure withwith threethree camerascameras mounted.mounted. ( b)) View View from the rear within VaMoRs withwith thethe fivefive camerascameras asas usedused inin AutoNav.AutoNav.

InIn 1997, 1997, the the third third computer computer system system selected for VaMoRs consisted consisted for the the first first time time of of four four ‘Commercial-O‘Commercial-Offff-The-Shelf’-The-Shelf’ (COTS) PC IntelIntel Dual-Dual- Pentium-IV,Pentium-IV, interlinkedinterlinked with a new, alsoalso COTSCOTS communicationcommunication network (Scalable Coherent Interface). In In 2001 2001 and 2003 2003,, VaMoRs was capablecapable ofof performingperforming thethe taskstasks givengiven onon thethe rightright sideside inin FigureFigure8 8[36 [3–640−40]. ].

Figure 8. Final demo of project AutoNav: Mission performance ‘On- and Off-Road’ with active (including saccadic) gaze control for turn-offs left and right; detect and avoid negative obstacles.

Point 2 may be viewed in [41] and the rest in [42], a video summarizing the results of project AutoNav. The special ‘stereo- computer’ of the US-partner for recognizing the exact spatial geometry of a ditch had a volume of about 30 liters in 2001. In 2003, the same mission could be

Electronics 2020, 9, x FOR PEER REVIEW 8 of 27

3. The 4-D approach of UniBwM, with active gaze control and the software packages for dynamic scene understanding and vehicle guidance [20,35]. The final purpose was to demonstrate both with German and US test vehicles the capability of performing a mission on a network of major and minor roads (hardened and natural surfaces) including off-road sections prescribed by sequences of GPS-waypoints in 2001. During these trips, beside obstacles above the driven surface (designated as ‘positive’ obstacles), ‘negative’ ones (ditches and large holes below the driven surface) also had to be detected and avoided. In Germany, VaMoRs with a new, 3rd-generation vision system [36−40] was selected for the final demonstration. In the USA, a HMMWV of NIST and of an industrial group, plus a small UGV, were selected as demonstrators. Only the VaMoRs results, with its explorative ‘eye’ carrying five cameras, will be discussed here. Figure 7 shows the ‘vehicle eye’ for stereo vision built by UniBwM; it was not intended to be a first version of a practical eye for the more distant future, but a real-world system for testing the software for dynamic visual perception with multi-focal cameras and an intelligent on-board gaze control including very fast saccades.

Figure 7. Purely experimental, roof-mounted 2-dof vehicle eye for VaMoRs in project AutoNav: (a) basic structure with three cameras mounted. (b) View from the rear within VaMoRs with the five cameras as used in AutoNav.

In 1997, the third computer system selected for VaMoRs consisted for the first time of four ‘Commercial-Off-The-Shelf’ (COTS) PC Intel Dual- Pentium-IV, interlinked with a new, also COTS Electronicscommunication2020, 9, 759 network (Scalable Coherent Interface). In 2001 and 2003, VaMoRs was capable9 of of 28 performing the tasks given on the right side in Figure 8 [36−40].

Figure 8. FinalFinal demo demo of of project AutoNavAutoNav:: Mission Mission performance performance ‘On ‘On-- and OffOff--Road’Road’ with active (including saccadic)saccadic) gazegaze controlcontrol forfor turn-oturn-offsffs leftleft andand right;right; detectdetect andand avoidavoid negativenegative obstacles.obstacles.

Point 2 may be viewed in [41] and the rest in [[42],42], a videovideo summarizingsummarizing the results of project AutoNav.AutoNav. The The special special ‘stereo- ‘stereo computer’- computer’ of the of US-partner the US-partner for recognizing for recognizing the exact the spatial exact geometry spatial ofgeometry a ditch had of a a ditch volume had of a about volume 30 liters of about in 2001. 30 liter In 2003,s in 2001. the same In 2003 mission, the could same bemission driven could using be a miniaturized real-time stereo system implemented in the meantime on a plug-in board for one of the PCs. The video film [42] gives an impression of the capabilities of gaze-controllable ‘vehicle eyes’ for performing the right-hand part of the mission. Again, it was noted that two gaze-controllable eyes, one each at the upper end of frame A of the vehicle, would help in improving flexibility for maneuvers. Between the computer systems used for these results and future µP for perception in the early 2020s, there will again be a difference in computing power of five to six orders of magnitude [43]. Therefore, besides developing the hardware of a ‘vehicle eye’, software development for the real-time perception of dynamic scenes will be the major task for the future. As a result of our experience with diverse scenes of autonomous ground vehicle guidance by vision, two cooperating eyes, one at each side above the upper outer corner of the front windshield, seem to be favorable (top of the A-frame).

1.2.2. Another Type of Vision System in This Century Similar demonstrations as in Germany have been shown in the USA by our partners with their vehicles. The results of all American efforts up to the turn of the century have led to the decision by Congress [44] that “It shall be a goal of the Armed Forces to achieve the fielding of unmanned, remotely controlled technology such that... by 2015, one-third of the operational ground combat vehicles are unmanned.” This largely increased the funding possibilities for this new technology of partially autonomous driving. In addition to the activities of the ARL in the general field of research for autonomous driving in unknown environments, DARPA formulated a supply mission in well-known environments, widely exploiting the Global Positioning System (GPS) in connection with precise maps of routes to be driven. These routes were prescribed by sequences of densely positioned GPS-waypoints; the main autonomous part was avoiding obstacles above the surface to be driven. In 2002, DARPA defined such a first supply mission over 142 miles in a semi-arid area in the south of Nevada as the “Grand Challenge” [45]. Obstacle-detection and -avoidance with different types of lasers and with radar was the main issue, since path-following by GPS-waypoints was so tightly prescribed that even hair-needle curves could be handled by a local curve fitting. While the Grand Challenges in 2004 and 2005 focused on the development of autonomous vehicles that operate in an off-road environment with only a limited interaction with other vehicles, the Urban Challenge of 2007 extended this concept Electronics 2020, 9, x FOR PEER REVIEW 9 of 27 driven using a miniaturized real-time stereo system implemented in the meantime on a plug-in board for one of the PCs. The video film [42] gives an impression of the capabilities of gaze-controllable ‘vehicle eyes’ for performing the right-hand part of the mission. Again, it was noted that two gaze-controllable eyes, one each at the upper end of frame A of the vehicle, would help in improving flexibility for maneuvers. Between the computer systems used for these results and future μP for perception in the early 2020s, there will again be a difference in computing power of five to six orders of magnitude [43]. Therefore, besides developing the hardware of a ‘vehicle eye’, software development for the real-time perception of dynamic scenes will be the major task for the future. As a result of our experience with diverse scenes of autonomous ground vehicle guidance by vision, two cooperating eyes, one at each side above the upper outer corner of the front windshield, seem to be favorable (top of the A-frame).

1.2.2. Another Type of Vision System in This Century Similar demonstrations as in Germany have been shown in the USA by our partners with their vehicles. The results of all American efforts up to the turn of the century have led to the decision by Congress [44] that “It shall be a goal of the Armed Forces to achieve the fielding of unmanned, remotely controlled technology such that... by 2015, one-third of the operational ground combat vehicles are unmanned.” This largely increased the funding possibilities for this new technology of partially autonomous driving. In addition to the activities of the ARL in the general field of research forElectronics autonomous2020, 9, 759 driving in unknown environments, DARPA formulated a supply mission10 of in 28 well-known environments, widely exploiting the Global Positioning System (GPS) in connection with precise maps of routes to be driven. These routes were prescribed by sequences of densely positionedto autonomous GPS-waypoints; vehicles that the safely main execute autonomous missions part in was a complex avoiding urban obstacles environment above the with surface moving to betra driven.ffic [46]. In 2002, DARPA defined such a first supply mission over 142 miles in a semi-arid area in the southThis goal of wasNevada promoted as the since “Grand a safe andChallenge” effective [ operation45]. Obstacle-detection in moving traffi c and is a basic-avoidance requirement with for all missions of autonomous ground vehicles. In November 2007, the CMU-vehicle “Boss” (see Figure9a) gained the first prize. The arrangement of a large number of diverse sensors all around the vehicle was typical for most contenders (Figure9b). The second winner, “Junior” of Stanford University, is seen in the center of the figure. All vehicles show that almost all vision-based sensors have been mounted outside the cabin, fixed onto the body of the vehicle.

Figure 9. AA collection collection of of images images to to yield yield an an impression impression of the amount of external sensors used i inn the Urban Challenge Challenge 2007 2007 for for perceiving perceiving the environmentthe environment in short in and short medium and medium range. Note range. the Noterevolving the revolvingLaser-Range-Finder Laser-Rang one-Finder top of allon vehicles.top of all vehicles.

ThisAfter goal the Urban was promoted Challenge since in the a USA, safe and some effective private companiesoperation in like moving Google traffic took the is a lead basic in requirementdeveloping capabilities for all missions for autonomous of autonomous driving ground on roads. vehicles. Google In November installed a 2007, special the institute CMU-vehicle for the winner of the Grand Challenge 2005, S. Thrun, who also came second in the Urban Challenge with his car “Junior”. The institute was provided sufficient funding for equipping up to 10 vehicles for test driving. The rotating Laser-Range-Finder (LRF) with 64 active units arranged above each other, which had been so successful in the Urban Challenge, became the main sensor for obstacle detection. All vehicles had rotating laser-range-finders on top for 360◦ obstacle detection. In addition, the system relied extensively on GPS data in conjunction with precise local maps. The vision part relied on the stored environmental data of visually predominant stationary objects; these data were not allowed to be older than two days. This set-up differs significantly, in terms of visual perception, from that developed at UniBwM in Germany. While the latter one may be called ‘scout-type vision’, since it does rely much less on stored data, the former type from Google, as well as most of the others, may be dubbed ‘confirmation-type vision’ [47]. This, of course, reduces the amount of methods and algorithms needed for object recognition, but it requires a lot of preparatory work for each new application after heavy changes due to weather conditions (like snow) or after catastrophic events (like a hurricane or an earthquake). Most industrial developers of assistant systems for driver support around the globe followed the easier route of a confirmation-type vision. The exception is the company Mobileye, founded in Israel in 1999, which in 2017 was taken over by Intel Corporation [48,49]. Contrary to UniBwM (with a bi- or tri-focal set of gaze-controlled cameras), Mobileye initially opted for a single camera, mounted and fixed onto the body as a primary sensor for enabling autonomous driving. Due to its capability of interpreting the shapes (like other vehicles and pedestrians) as well as textures and colors of structured objects and areas in perspective projection, it avoided LRF altogether. Early on, the company started developing simple vision systems Electronics 2020, 9, x FOR PEER REVIEW 10 of 27

“Boss” (see Figure 9a) gained the first prize. The arrangement of a large number of diverse sensors all around the vehicle was typical for most contenders (Figure 9b). The second winner, “Junior” of Stanford University, is seen in the center of the figure. All vehicles show that almost all vision-based sensors have been mounted outside the cabin, fixed onto the body of the vehicle. After the Urban Challenge in the USA, some private companies like Google took the lead in developing capabilities for autonomous driving on roads. Google installed a special institute for the winner of the Grand Challenge 2005, S. Thrun, who also came second in the Urban Challenge with his car “Junior”. The institute was provided sufficient funding for equipping up to 10 vehicles for test driving. The rotating Laser-Range-Finder (LRF) with 64 active units arranged above each other, which had been so successful in the Urban Challenge, became the main sensor for obstacle detection. All vehicles had rotating laser-range-finders on top for 360° obstacle detection. In addition, the system relied extensively on GPS data in conjunction with precise local maps. The vision part relied on the stored environmental data of visually predominant stationary objects; these data were not allowed to be older than two days. This set-up differs significantly, in terms of visual perception, from that developed at UniBwM in Germany. While the latter one may be called ‘scout-type vision’, since it does rely much less on stored data, the former type from Google, as well as most of the others, may be dubbed ‘confirmation-type vision’ [47]. This, of course, reduces the amount of methods and algorithms needed for object recognition, but it requires a lot of preparatory work for each new application after heavy changes due to weather conditions (like snow) or after catastrophic events (like a hurricane or an earthquake). Most industrial developers of assistant systems for driver support around the globe followed the easier route of a confirmation-type vision. The exception is the company Mobileye, founded in Israel in 1999, which in 2017 was taken over by Intel Corporation [48,49]. Contrary to UniBwM (with a bi- or tri-focal set of gaze-controlled cameras), Mobileye initially opted for a single camera, mounted and fixed onto the body as a primary sensor for enabling autonomous driving. Due to its capability of interpreting the shapes (like other vehicles and pedestrians) as well as textures and colors of structured objects and areas in Electronics 2020, 9, 759 11 of 28 perspective projection, it avoided LRF altogether. Early on, the company started developing simple vision systems on miniature hardware, i.e., proprietary computation cores as fully programmable acceleratoron miniatures. Their hardware, system i.e., series proprietary ‘EyeQI’, computation where “I” now cores runs as fullyfrom programmable1 (2008) to 5 (2020), accelerators. covers a Theirwide rangesystem of series performance ‘EyeQI’, where levels, “I” also now for runs several from 1 cameras (2008) to and 5 (2020), sensors covers in parallel. a wide range The of system performance EyeQ2 allowedlevels, also input for from several two cameras cameras and (up sensors to 2048 in × parallel.2048 pixel Thes) and system had EyeQ2the computing allowed devices input from needed two forcameras perception (up to integrat 2048 ion2048 on pixels) an electronic and had board the computing (see Figure devices 10). In needed2020, EyeQ5 for perception is predicted integration to allow × inputon an from electronic more board than 16 (see multi Figure-mega 10).-pixel In 2020, cameras EyeQ5, plus is predicted laser and to radar; allow its input computational from more thanpower 16 targetsmulti-mega-pixel range from cameras, 15 to overplus laser200 trillionand radar; operations its computational per second, power while targets drawing range only from moderate15 to over electrical200 trillion power. operations Another per company second, while (NVIDIA) drawing [43] only predicts moderate up to electrical300 tera-operations power. Another per second company (3 × 10(NVIDIA)14) in computing [43] predicts power up to on 300 its tera-operations boards in the early per second 2020s. (3 In 2019,1014) Mobileyein computing claimed power to on have its × deployedboards in the vision early safety 2020s. technology In 2019, Mobileye in over claimed 40 million to have vehicles deployed in vision connection safety technologywith many in tier over-1 automotive40 million vehicles companies in connection [48]. They with expect many ride tier-1-share automotive vehicles without companies a driver [48]. (level They expect5) as a ride-sharenext step soon.vehicles without a driver (level 5) as a next step soon.

Electronics 2020, 9, x FOR PEER REVIEW 11 of 27

With respect to the arrangement of cameras on road vehicles, no new proposals have been noted. For driver assistance systems around the globe, they are mostly mounted at or behind the inside center top of the front wind shield (similar to Figure 11). The development in Europe in this century: After the Prometheus project, the European

automotive companies concentrated on simple vision systems for road and lane recognition with the new μPs coming alongFigure steadily. 10. ComputingComputingFor obstacle power power recognition of of specific specific, LRF vision and systems radar [ have48]. been mainly favored. One of the most active industrial groups is that of Daimler. It realized the efficient stereo algorithm With respect to the arrangement of cameras on road vehicles, no new proposals have been noted. ‘Semi-Global Matching’ on an FPGA in 2008 [50]; Figure 11 shows one result with the demo-vehicle For driver assistance systems around the globe, they are mostly mounted at or behind the inside center ‘Bertha’ [30], which after 125 years repeated in 2013—this time fully autonomously—the famous first top of the front wind shield (similar to Figure 11). long-distance drive of Mrs. Benz with an automobile in the year 1888 from Mannheim to Pforzheim.

Figure 11. Right: Mounting of stereo cameras in the demo-vehicle “Bertha” of Daimler 2013; left: Figure 11. Right: Mounting of stereo cameras in the demo-vehicle “Bertha” of Daimler 2013; left: Objects detected by stereo vision with the ‘stixel’ approach [30,50]. Objects detected by stereo vision with the ‘stixel’ approach [30,50].

TheAlmost development all automotive in Europe companies in this century: and tier- After1 supplier the Prometheuss in Europe project, and in the Asia European started automotiveactivities in companiesthe field of concentrated autonomous on driving, simple either vision on systems their own for road or in and connection lane recognition with research with the institutions. new µPs comingRadar and along LRF steadily. were widely For obstacle used for recognition, obstacle detection LRF and and radar range have estimation, been mainly while favored. GPS and One precise of the mostmaps activewere industrialrelied on groupsfor navigation; is that of a Daimler.group of It German realized automotive the efficient companies stereo algorithm even ‘Semi-Globalbought their Matching’own company on an for FPGA keeping in 2008maps [50 up];- Figureto-date 11 and shows precise. one T resulthe revolving with the LRF demo-vehicle of Velodyne ‘Bertha’ also found [30], whichmany customers after 125 years for precise repeated range in 2013—this estimation time all around fully autonomously—the the vehicle. Figure famous 12 shows first, at long-distance the right, an driveLRF-image of Mrs. derived Benz with from an 64 automobile laser beams in rotating the year at 1888 10 revolutions from Mannheim per second to Pforzheim. and covering a range of upAlmost to 100 allm. automotiveIt has been taken companies from the and test tier-1 vehicle suppliers MuCAR in Europe-4 of UniBwM and in Asia(a VW started-Tiguan, activities at the incenter) the field [51]. of The autonomous main sensors driving, for the either autonom on theirous exploration own or in connectionof previously with unknown, research unstructured institutions. Radarenvironments and LRF are were the widely multi- usedfocal forcameras obstacle on detectionthe gaze control and range platform estimation, active while in yaw GPS and and pitch precise (at mapsthe left). were Together, reliedon the for two navigation; complementing a group sensors of German allow for automotive the perception companies of traversable even bought ground their as ownwell as company for the detection for keeping and maps recognition up-to-date of objects and precise. relevant The for revolving the current LRF mission. of Velodyne This combination also found is probably the most powerful visual perception system available at the moment.

Figure 12. Agile multi-focal camera platform (left) as the main sensor, supported by a high-resolution 360° rotating laser scanner mounted on the roof (center). Right: Laser-image in perspective projection. UniBwM/LRT/TAS [51].

The interesting question is whether a vision system (as will be discussed in Section 4) allows one to get rid of the expensive LRF without losing the perception capabilities achieved up to now. It might even be possible to improve the perception capabilities of future vehicles up to the performance level of humans, especially in terms of more distant resolution details for the perception of color and texture.

2. Reasons behind the Differences between Biological and Technical Vision Systems Biological systems have evolved over millions of generations in carbon wetware with the selection of the fittest with respect to the ecological environment encountered; these systems are based on completely different properties of the underlying material as compared to technical vision

Electronics 2020, 9, x FOR PEER REVIEW 11 of 27

With respect to the arrangement of cameras on road vehicles, no new proposals have been noted. For driver assistance systems around the globe, they are mostly mounted at or behind the inside center top of the front wind shield (similar to Figure 11). The development in Europe in this century: After the Prometheus project, the European automotive companies concentrated on simple vision systems for road and lane recognition with the new μPs coming along steadily. For obstacle recognition, LRF and radar have been mainly favored. One of the most active industrial groups is that of Daimler. It realized the efficient stereo algorithm ‘Semi-Global Matching’ on an FPGA in 2008 [50]; Figure 11 shows one result with the demo-vehicle ‘Bertha’ [30], which after 125 years repeated in 2013—this time fully autonomously—the famous first long-distance drive of Mrs. Benz with an automobile in the year 1888 from Mannheim to Pforzheim.

Figure 11. Right: Mounting of stereo cameras in the demo-vehicle “Bertha” of Daimler 2013; left: Objects detected by stereo vision with the ‘stixel’ approach [30,50].

Almost all automotive companies and tier-1 suppliers in Europe and in Asia started activities in the field of autonomous driving, either on their own or in connection with research institutions. ElectronicsRadar and2020 LRF, 9, 759 were widely used for obstacle detection and range estimation, while GPS and precise12 of 28 maps were relied on for navigation; a group of German automotive companies even bought their own company for keeping maps up-to-date and precise. The revolving LRF of Velodyne also found many customers for precise range estimation all around the vehicle. Figure 12 shows,shows, at the right,right, an LRF-imageLRF-image derived from 64 laser beams rotating at 10 revolutions per second and covering a range of up to 100 m. It has been taken from the test vehicle MuCAR MuCAR-4-4 of UniBwM (a VW VW-Tiguan,-Tiguan, at the center) [51]. The The main main sensors sensors for for the the autonom autonomousous exploration exploration of previously unknown, unstructured environments areare thethe multi-focalmulti-focal cameras cameras on on the the gaze gaze control control platform platform active active in yawin yaw and and pitch pitch (at the(at left).the left). Together, Together, the twothe two complementing complementing sensors sensors allow allow for the for perception the perception of traversable of traversable ground ground as well as aswell for as the for detection the detection and and recognition recognition of objects of objects relevant relevant for for the the current current mission. mission. This This combination combination is probablyis probably the the most most powerful powerful visual visual perception perception system system available available at theat the moment. moment.

Figure 12. 12.Agile Agile multi-focal multi-focal camera camera platform platform (left)as (left) the main as sensor,the main supported sensor, by supported a high-resolution by a high-resolution 360° rotating laser scanner mounted on the roof (center). Right: Laser-image in 360◦ rotating laser scanner mounted on the roof (center). Right: Laser-image in perspective projection. UniBwMperspective/LRT projection./TAS [51]. UniBwM/LRT/TAS [51].

The interestinginteresting questionquestion isis whether whether a a vision vision system system (as (as will will be be discussed discussed in Sectionin Section4) allows 4) allows one toone get to ridget of rid the of expensive the expensive LRF withoutLRF without losing losing the perception the perception capabilities capabilities achieved achieved up to now.up to Itnow. might It evenmight be even possible be to possible improve to the improve perception the capabilities perception of capabilities future vehicles of futureup to the vehicles performance up to level the ofperformance humans, especially level of in humans terms of, more especially distant in resolution terms of details more for distant the perception resolution of color details and for texture. the perception of color and texture. 2. Reasons behind the Differences between Biological and Technical Vision Systems 2. ReasonsBiological behind systems the Differences have evolved between over millions Biological of generations and Technical in carbon Vision wetware Systems with the selection of theBiological fittest with systems respect have to the evolved ecological over environment millions of generations encountered; in these carbon systems wetware are based with the on completelyselection of di thefferent fittest properties with respect of the to underlyingthe ecological material environment as compared encountered; to technical these vision systems systems are based onon siliconcompletely (chips). different The biological properties electrochemical of the underlying units domaterial have switching as compared times to in technical the millisecond vision (ms) range; the traveling speed of signals in the nerves is in the range of 10 to 100 m/s. Cross-connections between units exist in abundance (1000 to 10,000 per neuron). A single brain consists of up to 1011 of these units. The main processing step is the summation of the weighted input signals which contain widely (up to now) unknown (multiple?) feedback loops [9]. These systems need long learning times and adapt to new situations only slowly. In contrast, technical substrates for sensors and µPs have switching times in the nanosecond range (a factor of about 106 as compared to biological systems). They can be programmed easily, and they may have various computational modes between which they can switch almost instantaneously; however, the direct cross-connections to other units are limited in number (one to six, usually) but may have very high bandwidths (in the GB/s range). Initially, digital processors were bulky and needed quite a bit of electrical power for running and cooling. Since the advent of digital µP in the 1970s, the computing power of these units has increased by a factor of ten every four to five years. In [47] (Figure1), a survey of the transistor count over time from 1980 to 2010 of such µP on a semi-logarithmic scale has been cited; a linear extension of the interpolating line over time suggests that this number may reach the number of neurons in a human brain in the early 2020s. Of course, these numbers are not comparable directly, due to the very different properties of the units; however, they may indicate that, with respect to complexity, similar overall systems should soon be in reach on a technical basis. Up to now, the computing power of µP-systems has increased by more than a factor of 1 million since the early 1980s, while at the same time the volume and need of electrical power has decreased dramatically. Electronics 2020, 9, 759 13 of 28

Thus, computing power does not seem to be a limiting factor any more for the future. Therefore, we will concentrate on system design and computer software, taking the facts mentioned in the introduction into account.

3. Conclusions for the Design of a Technical Eye of High Performance for Automotive Applications The partially separated processing of image data both in the eye and in the visual cortex of the human vision system based on carbon molecules is not mandatory for technical vision systems based on silicon molecules. Sensors and computers may be kept separate. In fact, several high-resolution images may be communicated in parallel from the sensors to the computing elements; however, the multi-focal, saccadic type of image sequence analysis may be of advantage in technical systems too. Requirements that need to be fulfilled are outlined next. In the near environment of an autonomous vehicle, a large FoV should be covered. If the road being driven and a road crossing have to be viewed simultaneously, 110◦ to 120◦ seem to be adequate. With the number of sensing elements limited by their size and the space available in an eye (or on a camera chip), the resolution of the imaging sensor is bounded. The highest resolution required for object-detection, -recognition and -tracking at larger distances may lead to a large number of sensor elements. Therefore, the locations and the potential viewing directions of all cameras on the vehicle are essential design parameters. Are many cameras with a small neighboring FoV, possibly collected group-wise together as ‘eyes’ and mounted in a fixed way onto the body of the vehicle, a good solution? In the frontal range from 115 to +115 relative to the vehicle heading, the highest − ◦ ◦ resolution corresponding to 0.2 mrad per pixel in the image should be available. {General remark: With respect to the localization of intensity edges, the high-resolution results should correspond to the capability of the human eye (0.2 mrad in all directions at all times).} This leads to 20,000 pixels covering the frontal angular range of 230◦. However, this high resolution is not required all the time in all directions. Usually, in road traffic, the highest resolution is needed down the road ahead in the driving direction. Approaching cross-roads, it is desirable to have this high resolution available for small periods in time over the complete FoV, while less resolved images may suffice between these periods. This will be discussed in Section 4.1 for two frontal ‘vehicle eyes’ mounted in a fixed way at the upper end of the left and right of the A-frame of the vehicle. Since highly resolved images are only needed for very small regions of the total FoV, the question arises of whether a better approach is to collect these huge amounts of data only when and where they are actually needed. This has led to multi-focal saccadic vision in biological systems. Of course, this complicates the software development for technical dynamic scene understanding. However, including spatiotemporal models for the interpretation of image sequences right from the beginning in feature extraction, as done in the 4-D approach since the early 1980s [16], has shown to be very efficient. The volume of data to be handled may be reduced by orders of magnitude. Experience with Hardware-In-the-Loop (HIL) simulations and with the UniBwM-test-vehicles VaMoRs and VaMP in the 1990s has shown that a joint evaluation of the multi-focal image sequences with saccadic changes of the gaze direction is well suited for technical dynamic vision too. As one of many alternative designs for active gaze control, a tri-focal vision system with focal lengths separated by factors of 3 and 4 will be discussed in Section 4.2.

4. Alternative Lines of Development The answer to the question “which is the preferable approach to follow” depends on the field of application. Are the vehicles exposed to fast and heavy angular perturbations leading to motion blur in the images under unfavorable lighting conditions? If so, gaze control for image stabilization is necessary anyway; the question is then: what is the best angular range of the gaze control (amplitudes and rotational rates). Today, the prevalent opinion in the automotive industry is that motion blur is no issue in modern imaging sensors for today’s applications. Therefore, version-1 of a technical eye is here Electronics 2020, 9, 759 14 of 28 considered, with all cameras mounted in a fixed way onto the body of the vehicle. The cameras that are actually available may then be very small and light-sensitive; they may also have a high dynamic range with respect to the light intensity. Whether very small cameras of the future will have sufficiently high capabilities in these areas at only slightly higher costs than standard cameras will be a deciding factor between the two types of technical vision systems considered next.

4.1. Eyes with Cameras Mounted in a Fixed Way onto the Body of the Vehicle The simplest extreme solution would be mounting many cameras in a fixed way onto the vehicle body wherever required (see Figure5). If average human visual acuity is the goal, a camera with 4000 pixels per line will cover about 46◦; five cameras will be required to cover a FoV of 230◦ horizontally (see Figure 13). With a requested vertical FoV of about 45◦, a set of five cameras is needed. Two eyes are chosen here, one each at the upper end of the A-frame at each side of the front windshield (see Figure 14). In order to allow coverage of the sides of the vehicle, the axis of symmetry of each eye is rotated by 46◦ away from the center line of the vehicle. Thus, the left and the right eye are identical with respect to their structure, but are mounted differently. Since only the near environment is of interest to the oblique sides (away from 0◦ (ahead) and 90◦ across the vehicle), the two outer cameras of the arrangement may have a reduced resolution, shown with light-gray longer-dashed lines in Figure 13. Electronics 2020, 9, x FOR PEER REVIEW 14 of 27

Figure 13. 13. AA potential potential technical technical eye eye mounted mounted in a in fixed a fixed way on way the on vehicle the vehicle body; it body; has seven it has cameras seven withcameras three with resolutions. three resolutions.

TheFor arrangementan efficient understanding chosen allows aof single automotive circular scenes arc of sensorsit is important around to a verticalstart the axis, computational mounted in aanalysis fixed way with onto low the-resolution body of the image vehicle.s from If camerasthe regions with nearby a 2000 (at 2000the bottom pixel resolution of the image). should Therefore, be much × lesswhen expensive using only or smaller,high-resolution the same sensors FoV as in(without Figure 13the may ones be painted covered in by blue 20 camerasin Figure (instead 13) it makes of the fivesense here). to compute Vertical pairs several would (2 × be2)- arrangedpyramid around levels to the start vertical with. axis, The each next pair higher rotated pyramid by 23◦ relative level is tocomputed its neighbors. by forming This would one pixel yield on less this distortion level by averaging in the areas four of transitionpixels in a between square on adjacent the lower images. level. This Forreduces an e ffithecient amount understanding of image data of automotive on the second scenes pyramid it is important level by a tofactor start of the 16. computational Starting from analysis0.2 mrad/pixel, with low-resolution on level 2 the images resolution from thewill regions be 0.8, nearbyand on (at level the 4 bottom it will ofthen the be image). 3.2 mrad/pixel. Therefore, whenLevel using4 means only that high-resolution at a 10 m viewing sensors range (without, a lane the marking ones painted of 12 incm blue width in Figurewill be 13 covered) it makes by sensethree toto computefour pixels, several a reasonable (2 2)-pyramid number levels for detecting to start with. the orientation The next higher of an pyramid edge precisely. level is computedA car with by a × formingbody-width one of pixel 1.8 onm will this be level covered by averaging by about four 60 pixels, pixels insufficient a square for on recognizing the lower level. some This details, reduces e.g., thethe amountblack tires of imageunderneath data on being the secondfour to pyramideight pixels level wide by a(12 factor to 25 of cm). 16. The Starting number from of 0.2 pixels mrad on/pixel, this onpyramid level 2 level, the resolution however, will is less be 0.8, than and 0.4% on level(a factor 4 it willof 1/256) then beof the 3.2 mradoriginal/pixel. high Level-resolution 4 means level. that at a 10 mSince viewing the range, computation a lane marking of pyramid of 12 cmimages width require will bes coveredquite a bybit threeof computing to four pixels, power, a reasonable it seems numberworth considering for detecting whe thether orientation directly of sensed an edge low precisely.-resolution A car images, with a as body-width sketched in of blue 1.8 m color will be in coveredFigure 13, by could about be 60 ofpixels, advantage; sufficient in for any recognizing case they would some details, provide e.g., redundant the black images tires underneath, increasing beingsafety. four With to 1_K eight pixel pixels perwide row, (12an angular to 25 cm). range The of number 115° can of be pixels covered on this with pyramid a resolution level, of however, about 2 ismrad/pixel. less than 0.4% Thus (a, two factor cameras of 1/256) with of a the side original ratio of high-resolution 16: 9 will cover level. a FoV of about 2 × (115° × 65°). The number of pixels per frame for the two cameras is then 1.125 megapixel (MP). At 40 m distance, one pixel of this camera covers about 8 cm normal to the line of sight. The rear view of the lower body of a car of size 1.8 × 0.9 m at that distance will be covered by about 22 × 11 pixels. This is reasonable for alerting interest in special sets of features in road scenes. Compared to the total number of pixels resulting from a design as shown in Figure 13 on the high-, medium- (the second), and low (the fourth) pyramid levels of higher-resolution original images (53.6 MP), this means a reduction in the data volume by a factor of almost 50. Under standard conditions, only this reduced data volume is analyzed. The medium resolution of 0.8 mrad/pixel (on the 2nd pyramid level) means that at a 50 m distance one pixel covers an area of 4 × 4 cm normal to the line of sight. This is sufficient for lane detection up to about a 70 m range. At the fourth pyramid level, one pixel covers 16 by 16 cm at a 50 m distance. This may be sufficient for detecting larger objects with contrasting surfaces; even four to six pixels often suffice for detecting dominant features of objects, e.g., a large difference in brightness. In case more information on such a set of features and its environment is required, the higher-resolution image data of this region (say two low-resolution pixels added to each side around the center of the point of interest) need only be transferred for a more extensive analysis. Thus, an object of size 6 × 3 pixels on the fourth pyramid level would trigger the transfer and analysis of a sub-image of size 1120 pixels of the medium-resolution image around the detected center. Here, the feature analysis will indicate whether an even finer resolution is needed and in which subarea. For the four medium-resolution images of both eyes with indices 1 and 5 in Figure 14

Electronics 2020, 9, x FOR PEER REVIEW 15 of 27

(dashed circular double-arrows), the data reduction per object (set of features) is two to three orders of magnitude when using two pyramid stages. For the entire front hemisphere from −115° to +115°, high-resolution image data are stored and selectively available on demand. The total FoV covered horizontally by both eyes ranges from −161° to +161°. In the frontal sub-region with index 2 (marked in magenta in Figure 14), a binocular stereo interpretation is possible. Using the 4-D approach for dynamic vision [16], these regions could be adapted in size and predicted over time for the next image in order to directly track the object in the high-resolution image sequence. Computing any pyramid data on higher levels does not need to be done in this case. In order to sum this up: Figure 14 shows three cameras “Bi” looking to the rear (in brown color), in addition to the multiple fields of view Li and Ri of the five cameras at the front of a car. Two eyes, one each at the left- (L in blue) and right-hand side (R in red) are shown. In the figure, a high-resolution camera B2 for the back side has been chosen for the center (from −23° to +23°). Medium-resolution image data are available to the rear sides (indices 1 and 3). Thus, the rear hemisphere is covered from −69° to +69° if the same cameras are used as in the front (for simplicity of Electronicsservice);2020 of ,course,9, 759 B1and B3 could be selected with a larger FoV in order to cover the entire15 ofrear 28 hemisphere. However, the gain would only be the two small white triangles to the side.

Figure 14. Car with a pair of body-fix eyes in the front at the left (Li) and right side (Ri) of the roof, Figure 14. Car with a pair of body-fix eyes in the front at the left (Li) and right side (Ri) of the roof, and and a set of three cameras looking to the rear (Bi). a set of three cameras looking to the rear (Bi).

SinceWhat the amount computation of data ofand pyramid computational images requires operations quite is a bitneeded of computing to visually power, track it ten seems objects worth in consideringparallel, following whether either directly the sensed pyramid low-resolution approach or images, separately as sketched sensed image in blue data color on in three Figure focal 13, couldlevels? be A of rough advantage; estimate in shows any case a reduced they would load provide by a factor redundant of two images,to three increasingorders of magnitude safety. With is required to provide the proper high-resolution image data to the processors for the scene evaluation. 1_K pixel per row, an angular range of 115◦ can be covered with a resolution of about 2 mrad/pixel. Thus,This indicatestwo cameras that with an aearly side transitionratio of 16:9 to will multiple cover a hypothesized FoV of about single 2 (115 objects65 moving). The number in the × ◦ × ◦ ofobserved pixels per scene frame considerably for the two enhances cameras isthe then efficiency 1.125 megapixel of dynamic (MP). scene At 40 understanding. m distance, one If pixel the two of thisadditional camera low covers-resolution about 8 cameras cm normal are to chosen the line to of avoid sight. the The need rear for view the of computing the lower bodyimage of pyramid a car of sizelevels,1.8 and0.9 if moneat of that the distance focal spacing will bes is covered reduced by to about 3 (yielding 22 11 a pixels. total of This 12 instead is reasonable of 16), forthis alerting lessens × × interestthe number in special of operations sets of features to about in 0.2% road of scenes. those for Compared all higher- toresolution the total images. number The of pixels positions resulting of the fromcorresponding a design as regions shown infor Figure ten objects 13 on may, the high-, of course, medium- change (the from second), frame and to low frame (the within fourth) a pyramidn FoV of levels230° × of 46° higher-resolution (see Figure 14, originalregions imageswith indices (53.6 MP),2 to 4). this By means accepting a reduction small delay in the times data volumefor saccadic by a factorgaze control, of almost the 50. reduction in data volume can be further improved, even with higher capabilities in resolution.Under standard conditions, only this reduced data volume is analyzed. The medium resolution of 0.8 mrad/pixel (on the 2nd pyramid level) means that at a 50 m distance one pixel covers an area of 4 4 cm normal to the line of sight. This is sufficient for lane detection up to about a 70 m range. × At the fourth pyramid level, one pixel covers 16 by 16 cm at a 50 m distance. This may be sufficient for detecting larger objects with contrasting surfaces; even four to six pixels often suffice for detecting dominant features of objects, e.g., a large difference in brightness. In case more information on such a set of features and its environment is required, the higher-resolution image data of this region (say two low-resolution pixels added to each side around the center of the point of interest) need only be transferred for a more extensive analysis. Thus, an object of size 6 3 pixels on the fourth pyramid level would trigger the transfer and × analysis of a sub-image of size 1120 pixels of the medium-resolution image around the detected center. Here, the feature analysis will indicate whether an even finer resolution is needed and in which subarea. For the four medium-resolution images of both eyes with indices 1 and 5 in Figure 14 (dashed circular double-arrows), the data reduction per object (set of features) is two to three orders of magnitude when using two pyramid stages. For the entire front hemisphere from 115 to +115 , high-resolution image − ◦ ◦ data are stored and selectively available on demand. The total FoV covered horizontally by both eyes ranges from 161 to +161 . In the frontal − ◦ ◦ sub-region with index 2 (marked in magenta in Figure 14), a binocular stereo interpretation is possible. Using the 4-D approach for dynamic vision [16], these regions could be adapted in size and predicted Electronics 2020, 9, 759 16 of 28 over time for the next image in order to directly track the object in the high-resolution image sequence. Computing any pyramid data on higher levels does not need to be done in this case. In order to sum this up: Figure 14 shows three cameras “Bi” looking to the rear (in brown color), in addition to the multiple fields of view Li and Ri of the five cameras at the front of a car. Two eyes, one each at the left- (L in blue) and right-hand side (R in red) are shown. In the figure, a high-resolution camera B2 for the back side has been chosen for the center (from 23 to +23 ). Medium-resolution − ◦ ◦ image data are available to the rear sides (indices 1 and 3). Thus, the rear hemisphere is covered from 69 to +69 if the same cameras are used as in the front (for simplicity of service); of course, B1and B3 − ◦ ◦ could be selected with a larger FoV in order to cover the entire rear hemisphere. However, the gain would only be the two small white triangles to the side. What amount of data and computational operations is needed to visually track ten objects in parallel, following either the pyramid approach or separately sensed image data on three focal levels? A rough estimate shows a reduced load by a factor of two to three orders of magnitude is required to provide the proper high-resolution image data to the processors for the scene evaluation. This indicates that an early transition to multiple hypothesized single objects moving in the observed scene considerably enhances the efficiency of dynamic scene understanding. If the two additional low-resolution cameras are chosen to avoid the need for the computing image pyramid levels, and if one of the focal spacings is reduced to 3 (yielding a total of 12 instead of 16), this lessens the number of operations to about 0.2% of those for all higher-resolution images. The positions of the corresponding regions for ten objects may, of course, change from frame to frame within an FoV of 230 46 (see ◦ × ◦ Figure 14, regions with indices 2 to 4). By accepting small delay times for saccadic gaze control, the reduction in data volume can be further improved, even with higher capabilities in resolution.

4.2. Gaze-Controllable Eyes with Tri-Focal Saccadic Vision Assuming that developments in the fields of tiny video cameras and corresponding gaze control platforms will continue to deliver smaller units, the camera specifications chosen for the design considered are more likely to be conservative for the long-term developments in automotive sensing. About 5 cm (or less) seems reasonable as the diameter of a ‘vehicle eye’. The capability of very fast changes in gaze direction to sets of interesting features allows for the reduction of the data rates, as discussed above. This may be achieved with gaze control at the expense of small delay times until scene understanding is achieved on the interpretation level. For humans, this delay time is in the order 1 of 0.3 to 1 s, i.e., >7 video frames at 25 Hz (>9 at 33 /3 Hz). This is well achievable with technical perception systems [16]. If the maximal resolution is requested to be the same as that assumed above (0.2 mrad/pixel), and if the lateral range to be covered at a 200 m distance is 32 m (about the width of a four-lane highway in both directions), the number of pixels needed per line is 800; they cover an angle of about 9◦. A lane marking of 12 cm width is covered by three pixels, and the lower body of a car of size 1.8 0.9 m is covered by about 1000 pixels in total. This resolution also allows for × good recognition of the lower parts of the wheels (16 to 28 cm wide) seen from the back [52,53]. These features at a specific distance to each other below a uniform area allow for the recognition of a car as opposed to a big box. With the active control of the gaze direction in the range of 150◦ horizontally (90◦ away from the car, 60◦ over the car body) and 30◦ vertically, a tri-focal set of four cameras (see Figure 15) allows the same high-resolution FoV as discussed in Section 4.1, even in a larger total region. While for automotive applications the low and medium resolution cameras may have rectangular image sizes, for the high-resolution camera a quadratic image with about 800 pixels in each direction may be sufficient. The corresponding data volumes and rates are given in Table1. Here, the data rate for one object tracked at all resolutions is 0.21 GB/s (for one eye with full redundancy on three levels of resolution). Using the same set of two low-resolution cameras for the gaze-controllable eye, their data rates are 1 0.054 GB/s (at 25 Hz; 0.072 GB/s for 33 /3 Hz). These cameras are mounted such that at a gaze direction of 90 , their FoV covers the entire side of the vehicle; this yields their gaze directions of 32.5 to ◦ ± ◦ Electronics 2020, 9, x FOR PEER REVIEW 16 of 27

4.2. Gaze-Controllable Eyes with Tri-focal Saccadic Vision Assuming that developments in the fields of tiny video cameras and corresponding gaze control platforms will continue to deliver smaller units, the camera specifications chosen for the design considered are more likely to be conservative for the long-term developments in automotive sensing. About 5 cm (or less) seems reasonable as the diameter of a ‘vehicle eye’. The capability of very fast changes in gaze direction to sets of interesting features allows for the reduction of the data rates, as discussed above. This may be achieved with gaze control at the expense of small delay times until scene understanding is achieved on the interpretation level. For humans, this delay time is in the order of 0.3 to 1 s, i.e., > 7 video frames at 25 Hz (> 9 at 33⅓ Hz). This is well achievable with technical perception systems [16]. If the maximal resolution is requested to be the same as that assumed above (0.2 mrad/pixel), and if the lateral range to be covered at a 200 m distance is 32 m (about the width of a four-lane highway in both directions), the number of pixels needed per line is 800; they cover an angle of about 9°. A lane marking of 12 cm width is covered by three pixels, and the lower body of a car of size 1.8 × 0.9 m is covered by about 1000 pixels in total. This resolution also allows for good recognition of the lower parts of the wheels (16 to 28 cm wide) seen from the back [52,53]. These features at a specific distance to each other below a uniform area allow for the recognition of a car as opposed to a big box. With the active control of the gaze direction in the range of 150° horizontally (90° away from the car, 60° over the car body) and 30° vertically, a tri-focal set of four cameras (see Figure 15) allows the same high-resolution FoV as discussed in Section 4.1, even in a larger total region. While for automotive applications the low and medium resolution cameras may have rectangular image sizes, for the high-resolution camera a quadratic image with about 800 pixels in each direction may be sufficient. The corresponding data volumes and rates are given in Table 1. Here, the data rate for one object tracked at all resolutions is 0.21 GB/s (for one eye with full redundancy on three levels of Electronicsresolution).2020 ,Using9, 759 the same set of two low-resolution cameras for the gaze-controllable eye,17 their of 28 data rates are 0.054 GB/s (at 25 Hz; 0.072 GB/s for 33⅓ Hz). These cameras are mounted such that at a gaze direction of 90°, their FoV covers the entire side of the vehicle; this yields their gaze directions both sides of the main gaze direction of the eye. As a consequence, the front region is covered in a of ±32.5° to both sides of the main gaze direction of the eye. As a consequence, the front region is multiple-redundant way by two low-resolution cameras and one medium-resolution camera of each covered in a multiple-redundant way by two low-resolution cameras and one medium-resolution eye in a FoV of 50 (marked orange in Figure 15). camera of each eye◦ in a FoV of 50° (marked orange in Figure 15).

FigureFigure 15.15. ‘Vehicle‘Vehicle eye’eye’ withwith aa tri-focaltri-focal setset ofof fourfour cameras:cameras: twotwo timestimes 115115°◦ FoV with low resolution, 5555°◦ with medium and 9.29.2°◦ with high resolution.

Choosing a high-sensitivityTable 1. Parametersblack-and- ofwhite a gaze-controllable camera for low ‘vehicle resolution eye’. (large field of view) and color Typecameras (Resolution) for medium and high Low resolution would Medium be closer High to the capabilities Remark of the human eye. Delay times for saccades of up to 40° are in the range of a few standard video cycle times (around Fields of view (in ◦) 115 62 55 31 9.2 9.2 Left ( ) and right (+) for ‘low’ two to three tenths of a second). × In summary, by× choosing × properly designed− gaze-controllable Imaging characteristics 2.4 0.6 mrad/pixel, ‘vehicle eyes’, the amount of image1 data to be 1handled may be0.2 reduced drastically. Figure 16 shows (resolution) /4 of med. /3 of high acuity of edge localization typical images scaled at a factor of three (upper two images) and four (lower two images) apart; they Pixels/line 800 1600 800 These are rough estimates were taken towards the end of the last century with VaMP and with the camera set inserted in the according to a pinhole model upper Numberimage. of lines 450 900 800 3 Bytes/pixel; sum = 8.4 Data volume/frame 2.16 MB 4.32 MB 1.92 MB MB/cycle Data rate at Hz: 25 0.054 GB/s 0.108 0.048 sum in GB/s 0.210 1 33 /3 0.072 GB/s 0.144 0.064 0.280

Choosing a high-sensitivity black-and-white camera for low resolution (large field of view) and color cameras for medium and high resolution would be closer to the capabilities of the human eye. Delay times for saccades of up to 40◦ are in the range of a few standard video cycle times (around two to three tenths of a second). In summary, by choosing properly designed gaze-controllable ‘vehicle eyes’, the amount of image data to be handled may be reduced drastically. Figure 16 shows typical images scaled at a factor of three (upper two images) and four (lower two images) apart; they were taken towards the end of the last century with VaMP and with the camera set inserted in the upper image. Here, color cameras had been selected for the low and medium resolution, while a highly sensitive b/w-camera was used for the high resolution. The details resolved in the upper image are impressive when comparing this region with the rectangular sub-images marked in the lower two images. Medium resolution is good for object recognition that is not too far away, and low resolution fits best for recognizing the entire situation. High resolution is needed in a few smaller regions only. Notice that the license number of the second vehicle in front and even a phone number written on its body are well readable. For an object of size n m in a low-resolution image, extending the search region by two ob × ob pixels to each side, the resulting region in the high-resolution image is (n + 4) x (m + 4) 12 12 ob ob × × pixels around the center. An example with size 6 3 low-resolution pixels for an arbitrary object yields × a data volume of 10,080 pixels/frame and a data rate from a high-resolution search region of 0.756 MB/s. Electronics 2020, 9, x FOR PEER REVIEW 17 of 27

Table 1. Parameters of a gaze-controllable ‘vehicle eye’.

Type (resolution) Low Medium High Remark Fields of view (in °) 115 × 62 55 × 31 9.2 × 9.2 Left (-) and right (+) for ‘low’ Imaging characteristics 2.4 0.6 mrad/pixel, 0.2 (resolution) ¼ of med. ⅓ of high acuity of edge localization Pixels/line 800 1600 800 These are rough estimates according to a Number of lines 450 900 800 pinhole model Data volume/frame 2.16 MB 4.32 MB 1.92 MB 3 Bytes/pixel; sum = 8.4 MB/cycle Data rate at Hz: 25 0.054 GB/s 0.108 0.048 sum in GB/s 0.210 33⅓ 0.072 GB/s 0.144 0.064 0.280

Here, color cameras had been selected for the low and medium resolution, while a highly sensitive b/w-camera was used for the high resolution. The details resolved in the upper image are impressive when comparing this region with the rectangular sub-images marked in the lower two images. Medium resolution is good for object recognition that is not too far away, and low resolution fits best for recognizing the entire situation. High resolution is needed in a few smaller regions only. ElectronicsNotice that2020 the, 9, 759license number of the second vehicle in front and even a phone number written18 on of its 28 body are well readable.

Figure 16. TriTri-focal-focal images of a traffic traffic scene; VaMP 19991999 (cameras(cameras asas insert).insert).

InsteadFor an object of losing of size performance nob × mob in in perception a low-resolution compared image, to cameras extending mounted the search in a fixedregion way, by gazetwo controlpixels to allows each side, for athe direct resulting compensation region in ofthe rotatory high-resolution perturbations image by is (n feedbackob + 4) x from(mob + signals 4) × 12 from × 12 inexpensivepixels around inertial the center. sensors An on example the pointing with platform.size 6 × 3 Figurelow-resolution 17 shows pixels a reduction for an inarbitrary amplitude object by moreyields than a data an volume order of of magnitude 10,080 pixels/frame in pitch for and a brakinga data rate maneuver from a high of VaMoRs.-resolution In search addition region to this, of via0.756 angular MB/s. rate commands to the gaze control, the tracking of fast-moving objects can be done for a reductionInstead of motionof losing blur; performance this is especially in perception valuable compared under low to lightcameras conditions. mounted In in general, a fixed it way increases, gaze thecontrol competence allows for in a perceiving direct compensation dynamic scenes. of rotatory In special perturbations application by areas, feedback this advantagefrom signals may from be decisive;inexpensive whether inertial it issensors favorable on the for pointing road vehicles platform. as well Figure has yet 17 toshows be seen. a reduction in amplitude by more than an order of magnitude in pitch for a braking maneuver of VaMoRs. In addition to this, via

Figure 17. Inertial gaze stabilization for a braking maneuver with VaMoRs.

Even small angular perturbations on the vehicle of 1◦ per frame (40 ms at 25 Hz) shift the corresponding input-ray of a body-fix sensor to a location about 3.5 m perpendicular off to the side at 200 m distance; this corresponds to the same point in the real world being shifted by about 87 pixels Electronics 2020, 9, 759 19 of 28 laterally in the high-resolution image. Due to the small aspect angles on the longitudinal axis, the effect on the depth perception is more pronounced. This would cause large search activities in the next image in order to understand the image sequence. With the 4-D approach, good predictions for gaze control can be made by taking both object motion and ego-motion systematically into account.

4.2.1. Considerations with Respect to Levels of Resolution for ‘Vehicle Eyes’ Though animals with more than two eyes can been found, two eyes with the capability of fast gaze control predominate by far in the biological realm. This allows stereo vision in the near range and redundancy in case one eye fails. In combination with active gaze control, the contradictory requirements of resolution on the one side, and the number of cameras needed on the other side, are resolved by choosing more complex imaging sensors (see Figure1 for the biological realm). Vertebrate eyes have areas of lower resolution over wider fields of view and (very) high resolution in the central area. Usually, the transition is gradual. In the technical example discussed here, three discrete levels of resolution are chosen in order to use three types of cameras with a constant resolution in their FoV. The corresponding object discovered with a low resolution may then be analyzed with an image resolution four or even 12 times higher than needed just for detection. The discrete values of the camera parameters chosen here are not meant to be the optimal ones; these depend on the criteria used. The values chosen are just reasonable numbers based on our experience with multi-focal technical eyes. It has been discovered that the recognition of objects after a saccade is more efficient if the focal spacing is not larger than by a factor of three to four. Via feedback of prediction errors for optical features to the control of the gaze direction, the interesting set of features may be automatically tracked, thereby reducing motion blur and allowing for better object recognition and tracking. Let us assume that a highly sensitive video sensor to be used will have 800 pixels per row. If the requested total FoV of the low-resolution part of the eye is 115◦, one pixel in this array will have a lateral coverage of 2.4 mrad; the corresponding lateral ranges in cm at six typical distances for ≈ automotive applications are given in row 3 of Table2.

Table 2. One pixel covering δ mrad of angle means a lateral range in cm at a distance Lx m.

Distance Lx m in m 2 6.25 12.5 25 50 100 200 angle δ in mrad Lateral coverage in cm of 1 pixel normal to line of sight low 2.4 0.48 1.5 3 6 12 24 48 medium 0.6 0.12 0.375 0.75 1.5 3 6 12 high 0.2 0.04 0.125 0.25 0.5 1.0 2.0 4

Perpendicular to the line of sight, one low-resolution pixel will cover about 48 cm at a 200 m distance; road vehicles seen from the back will thus be covered by about 6 to 12 pixels. If their brightness is quite different from the surroundings, this is sufficient for detecting that there may be something of interest; however, no recognition is possible with this low resolution. If the highest resolution is requested to be 4 cm (0.04 m) per pixel perpendicular to the line of sight at the distance L0.04 = 200 m, the focal partitioning given in the second column of the table results. The medium-resolution camera with 1600 pixel per row (column 3 in Table1, and row 4 in Table2) will be able to see a tra ffic light located 2.5 m to the side at distances above about 5 m when looking along the centerline of the first lane of a road. It allows for the detection of a traffic light nearer than 100 m in range. The low-resolution camera covers a single traffic light of 0.2 m diameter by about 9 pixels below a about 25 m range. If the high-resolution camera tracks the traffic light, the two low-resolution cameras remain capable of mapping the entire traffic situation in front of the vehicle nearby. The gaze direction during this fixation will change similar to the upper curve shown in Figure6. However, tra ffic lights on the different types of roads in cities and around rural roads may be found in a much larger environment of the road (on both sides and even high above the road). When moving towards the lights, this has shown to be a Electronics 2020, 9, 759 20 of 28 challenge for cameras mounted in a fixed way onto the body of the vehicle. This capability of fixation onto an object, either moving itself or due to ego-motion (or both together), is one of the big advantages of active gaze control.

4.2.2. Number of Gaze-Controlled Eyes Needed

With a horizontal field of view that is two times 115◦ for the low-resolution cameras of a single eye (see Figure 15), and a range of potential gaze directions in azimuth from 90 (left side) to +60 (right − ◦ ◦ side of Figure 18) for the left eye ( 60 to +90 for the right eye), most of the interesting environmental − ◦ ◦ regions around a vehicle can be covered with two eyes when properly located. In the central range of 60 around the heading direction of the vehicle, even redundant coverage with a high resolution is ± ◦ possible by controlling each eye separately (called vergence). The part (H) at the center is the standard orientation straight ahead, as given in Figure 18; the part (‘Side’, at the left) shows the extreme yaw Electronics 2020, 9, x FOR PEER REVIEW 20 of 27 Electronicsangle away 2020, from 9, x FOR the PEER body REVIEW for the left eye. For this gaze direction, the low-resolution (wide-angle)20 of 27 camera to the outward side of the eye covers the entire side of the vehicle (shown in blue color, also in TheThe medium medium--resolutionresolution camera camera is is sufficient sufficient for for perceiving perceiving objects objects that that are are not not too too far far away away on on a a Figure 19). crossroadcrossroad inin greatergreater detail.detail.

FigureFigure 18.18. TrifocalTrifocal gaze gaze-controlledgaze--controlledcontrolled (left (left-side)(left--side)side) ‘vehicle ‘vehicle eye’ eye’ with with four four cameras: cameras: (Side, (Side, at at left): left): Looking Looking −90° to the left (relative to the heading direction of the vehicle); (H) miniaturized Figure 15 looking −90°90◦ toto the the left left (relative (relative to to the the heading direction of the vehicle); (H) miniaturized Figure 15 lookinglooking straight− ahead; (CB, at right) extreme gaze direction in azimuth across the body is + 60°. The eye on straight ahead;ahead; (CB,(CB, atat right) right) extreme extreme gaze gaze direction direction in in azimuth azimuth across across the the body body is + is60 +◦ .60°. The The eye eye on theon thetheright-hand rightright--handhand side sideside is vertically isis verticallyvertically mirrored mirroredmirrored (+90 (+(+90°90°and andand60 −−60°).60°).). ◦ − ◦

FigureFigure 19. 19. CoverageCoverageCoverage of of ofthe the the environment environment environment with with twowith gaze-controlled two two gaze gaze--controlledcontrolled tri-focal tri trieyes--focalfocal for the eyes eyes front for for hemisphere the the front front hemispherehemisphereand two body-fix andand twotwo cameras bodybody--fix forfix cameras thecameras rear one. forfor thethe rearrear oneone..

ByBy installinginstalling oneone ofof thesethese eyeseyes atat thethe upperupper endend atat eacheach sideside ofof thethe frontfront windshield,windshield, thethe frontfront hemispherehemisphere willwill bebe steadilysteadily coveredcovered withwith lowlow--resolutionresolution imagingimaging whenwhen lookinglooking straightstraight ahead.ahead. WithWith gazegaze control, control, a a region region of of 235° 235° maymay be be perceived perceived with with medium medium resolutionresolution with with atat least least oneone eye eye (separately(separately controllable).controllable). ThatThat meansmeans thatthat eacheach interestinginteresting setset ofof featuresfeatures discovereddiscovered inin aa FoVFoV ofof ±90°±90° atat lowlow resolutionresolution inin aa coordinatedcoordinated “surveillance“surveillance mode”mode” withwith thethe gazegaze directiondirection straightstraight aheadahead maymay thenthen bebe analyzedanalyzed byby thethe mediummedium--resolutionresolution cameracamera inin aa rangerange ofof 235°235° withinwithin aa smallsmall fractionfraction ofof aa secondsecond duedue toto thethe saccadicsaccadic gazegaze control.control. ForFor smallersmaller gazegaze anglesangles (say(say ΨΨgg == ±45°),±45°), bothboth eyeseyes maymay bebe directeddirected towardstowards thethe samesame realreal--worldworld object,object, thereby thereby allowing allowing 4 4--DD stereo stereo perception perception by by exploiting exploiting thethe extendedextended vergence vergence control control over over somesome time. time. With With a a stereo stereo basebase of of about about 1.2 1.2 to to 1.7 1.7 m m (width(width of of the the front front windshield), windshield), these these depth depth perceptionsperceptions shouldshould bebe reasonablyreasonably goodgood upup toto distancdistanceses ofof atat leastleast 2020 m.m. However,However, inin thethe veryvery nearnear rangerange the the largelarge stereo stereo basebase maymay causecause somesome difficultiesdifficulties duedue toto thethe differentdifferent aspectaspect conditionsconditions forfor partsparts ofof thethe object.object. WithWith anan individuallyindividually controllablecontrollable tritri--focalfocal eyeeye onon eacheach upperupper sideside ofof thethe frontfront windshielwindshield,d, onlyonly thethe rear rear hemisphere hemisphere in in the the central central part part is is not not perceivable perceivable (white (white area area in in Figure Figure 19). 19). For For the the very very limitedlimited typestypes ofof maneuversmaneuvers performedperformed whenwhen drivingdriving backwards,backwards, bodybody--fixedfixed camerascameras,, widelywidely inin useuse

Electronics 2020, 9, 759 21 of 28

As the wide-angle cameras are mounted so that the outer side of the image ends just at 180◦, both cameras have a region of central overlap of 50◦ in the gaze direction of the eye (orange in Figure 18). The two eyes should be positioned at the upper end and outside corner of the front windshield of a car; the potential FoV are shown in Figure 19 together with two FoV of cameras mounted body-fixed near the upper outside end of the rear windshield. Due to the curved shape of the roof of the car, the gaze directions to the front across the body are generally limited to about 60◦. The large FoV of the wide-angle cameras (115 ), mounted with their line of sight under an angle of Ψ = 32.5 ◦ W ± ◦ to the side of the gaze direction Ψg of the eye, allows for coverage of the entire side of the vehicle with the corresponding eye at a gaze direction of 90◦ (see Figure 19). These large angles are only of interest nearby for the low-resolution cameras in order to perceive objects in the neighboring lane. Objects further away on a crossroad may be imaged by the medium-resolution (standard) camera up to a maximum of 117.5◦ (green lines in Figure 19), and with the high-resolution camera up to 94.6◦. The medium-resolution camera is sufficient for perceiving objects that are not too far away on a crossroad in greater detail. By installing one of these eyes at the upper end at each side of the front windshield, the front hemisphere will be steadily covered with low-resolution imaging when looking straight ahead. With gaze control, a region of 235◦ may be perceived with medium resolution with at least one eye (separately controllable). That means that each interesting set of features discovered in a FoV of 90 at low ± ◦ resolution in a coordinated “surveillance mode” with the gaze direction straight ahead may then be analyzed by the medium-resolution camera in a range of 235◦ within a small fraction of a second due to the saccadic gaze control. For smaller gaze angles (say Ψg = 45 ), both eyes may be directed towards the same real-world ± ◦ object, thereby allowing 4-D stereo perception by exploiting the extended vergence control over some time. With a stereo base of about 1.2 to 1.7 m (width of the front windshield), these depth perceptions should be reasonably good up to distances of at least 20 m. However, in the very near range the large stereo base may cause some difficulties due to the different aspect conditions for parts of the object. With an individually controllable tri-focal eye on each upper side of the front windshield, only the rear hemisphere in the central part is not perceivable (white area in Figure 19). For the very limited types of maneuvers performed when driving backwards, body-fixed cameras, widely in use today for backing up, may also be sufficient in the long run, especially when additional radar sensors are used for detecting distant vehicles closing in. Two medium-resolution cameras mounted in a fixed way at each upper side of the rear windshield would be preferable for more precise observations. To summarize, it is emphasized again that in the region of 117.5 the gaze-controllable eyes ± ◦ allow the tracking of objects of special interest for the precise perception of a situation by reducing motion blur. In addition, these objects may be analyzed simultaneously at different levels of resolution, which may allow for a more stable interpretation. Additionally, keep in mind that monocular stereo perception is possible with the 4-D approach when moving [16].

4.3. Car Design for Both Types of Eyes The two types of ‘vehicle eyes’ discussed above will most likely not be developed in the near future, since they are not needed for the low capabilities required for driving in well-known and smooth environments (mainly: known roads) with support from GPS as well as from detailed high-precision maps that are kept actual. However, after some catastrophic event (e.g., earthquakes, weather extremes, etc.) or in unknown environments, vehicles with more capabilities would be advantageous. This is also valid for vehicles exploring other planets or moons in astronautic missions. Scout vehicles in the defense realm or for fire brigades are other examples. Once these high-performance vision systems are available and can be realized at relatively low costs, their migration into the market of general vehicles cannot be excluded. Therefore, they should be conceived as a separately marketable sub-system right from the beginning. Electronics 2020, 9, x FOR PEER REVIEW 21 of 27

today for backing up, may also be sufficient in the long run, especially when additional radar sensors are used for detecting distant vehicles closing in. Two medium-resolution cameras mounted in a fixed way at each upper side of the rear windshield would be preferable for more precise observations. To summarize, it is emphasized again that in the region of ±117.5° the gaze-controllable eyes allow the tracking of objects of special interest for the precise perception of a situation by reducing motion blur. In addition, these objects may be analyzed simultaneously at different levels of resolution, which may allow for a more stable interpretation. Additionally, keep in mind that monocular stereo perception is possible with the 4-D approach when moving [16].

4.3. Car Design for Both Types of Eyes The two types of ‘vehicle eyes’ discussed above will most likely not be developed in the near future, since they are not needed for the low capabilities required for driving in well-known and smooth environments (mainly: known roads) with support from GPS as well as from detailed high-precision maps that are kept actual. However, after some catastrophic event (e.g., earthquakes, weather extremes, etc.) or in unknown environments, vehicles with more capabilities would be advantageous. This is also valid for vehicles exploring other planets or moons in astronautic missions. Scout vehicles in the defense realm or for fire brigades are other examples. Once these Electronicshigh-performance2020, 9, 759 vision systems are available and can be realized at relatively low costs,22 their of 28 migration into the market of general vehicles cannot be excluded. Therefore, they should be conceived as a separately marketable sub-system right from the beginning. Since one can hardly predict which of the approaches discussed here will be superior from a Since one can hardly predict which of the approaches discussed here will be superior from a cost–performance point of view in a few decades, it seems reasonable to design the changes necessary cost–performance point of view in a few decades, it seems reasonable to design the changes at the upper end of the A-frame of a car so that it fits for both types. Figure 20 shows one possible necessary at the upper end of the A-frame of a car so that it fits for both types. Figure 20 shows one design of the upper end of the A-frame of a car, providing a potential standard solution for mounting possible design of the upper end of the A-frame of a car, providing a potential standard solution for front eyes. mounting front eyes.

Figure 20. One potential design of the upper end of the A A-frame-frame for a standard solution for mounting front eyes.eyes.

A sectionsectionof of a cylindricala cylindrical glass glass tube tube for protecting for protecting the cameras, the cameras, together together with a wiperwith a for wiper clearing for theclearing view the(see view top of (see Figure top 21 of), Figure and a lid 21), for and covering a lid thefor glass covering cylinder the glass are part cylinder of the adapted are part design of the ofadapted the roof. design The positionof the roof. for the The wiper position at rest for is the in wiperthe back at cornerrest is ofin the the glass back cylinder. corner of Of the course, glass watercylinder. supply Of course, for cleaning water thesupply glass for cylinder cleaning has the to glass be included. cylinder The has eyeto be itself, included. with all The connections eye itself, neededwith all forconnections mechanical needed and electronic for mechanical functioning, and electronic should befunctioning mountable, should from the be insidemountable of the from car, ifthe possible inside of as the a compact car, if possible unit (indicated as a compact by diagonal unit (indicated arrows, topby diagonal right in Figure arrows, 20 ).top The right hardware in Figure of the20). electronicsThe hardware (possibly of the withelectronics computing (possibly power with for computing feature extraction power for from feature the wide-angleextraction from images) the maywide be-angle located images) near themay lower be located end of near the A-frame the lower to keep end of the the cable A-frame lengths to short. keep Eventually,the cable lengths when interestingshort. Eventually, features when are detected, interesting a quick features change are detected, in the gaze a quick direction change may in be the initiated gaze direction directly frommay here with little delay time and with corresponding information sent to the central unit for perception.

The cables to the eye may run well protected in the inner side of the A-frame. Electronics 2020, 9, x FOR PEER REVIEW 22 of 27

be initiated directly from here with little delay time and with corresponding information sent to the Electronicscentral unit2020 ,for9, 759 perception. The cables to the eye may run well protected in the inner side 23of ofthe 28 A-frame.

Figure 21. BodyBody-fix-fix left-sideleft-side eye:eye: TheThe specificspecific FoV FoV as as given given in in Figure Figure 13 13are are shown shown in thein the lower lower part part (by making(by making the roofthe roof transparent transparent here). here). The twoThe low-resolutiontwo low-resolution cameras cameras (in blue) (in blue) cover cover the same the same FoV as FoV all higher-resolutionas all higher-resolution cameras cameras together together (230 from (230° 161fromto –+161°69 to); their+ 69°); images their meetimages at themeet gaze at directionthe gaze ◦ − ◦ ◦ ofdirection46 . The of – top46°. part The sketches top part a sketches view from a view the left-hand from the side; left-hand the cameras side; the are cameras inside a are cylindrical inside a − ◦ (semi-transparent)cylindrical (semi-transparent) glass tube that glass is kepttube cleanthat is by kept a wiper, clean whose by a wiper, rest position whose isrest at theposition right corner.is at the right corner. 4.4. Sketch of an Eye Mounted in a Fixed Way onto the Body of a Car 4.4. SkOneetch possibility of an Eye Mounted for the arrangement in a Fixed way of onto cameras the Body in suchof a Car an eye has been discussed in Section 4.1 withOne many possibility cameras for collected the arrangement in one unit of and cameras mounted in such in aan fixed eye has way been onto discussed the vehicle in (FigureSection 413.1; Figurewith many 14). Forcameras best viewingcollected conditions,in one unit it and should mounted be mounted in a fixed as farway as onto possible the vehicle above the(Figure ground. 13; FigureFigure 2114 ).shows For best a favorable viewing configuration conditions, it for should the left be eye mounted of a car. as With far as the possible vanishing above costs the of cameras,ground. itFigure seems 21 reasonable shows a favorable to add the configuration two cameras for with the low left resolution eye of a car. but Withwith the the same vanishing FoV as costs those of neededcameras, for it highseems resolution reasonable (see to the add blue the parts two ofcameras Figure with 19 and low the re bottomsolution of but Figure with 21 the). This same provides FoV as redundancythose needed in for environmental high resolution mapping (see the and blue thus parts safety. of Figure In the 19 case and of the a failure bottom of oneof Figure or more 21). of This the high-resolutionprovides redundancy cameras, in theseenvironmental low-resolution mapping images and will thus su ffisafety.ce for In slow the autonomouscase of a failure driving of one to the or nextmore service of the station,high-resolution if the corresponding cameras, these software low-resolution is available. images will suffice for slow autonomous drivingFigure to the 21 nextshows service a sketch station, of such if the a vehiclecorresponding eye without software gaze is control available. positioned at the upper end of theFigure left A-frame 21 shows of aa car:sketch The of lower such a part vehicle outlines eye thewithout arrangement gaze control of three positioned high-resolution at the upper cameras end (L2of the to L4),left A two-frame medium-resolution of a car: The lower cameras part outlines (L1 and the L5) arrangement and two low-resolution of three high cameras-resolution in blue cameras color. In(L2 the to FoV L4), of two the medium cameras- L1resolution and L5, only cameras nearby (L1 objects and L5) are and of interest, two low so- thatresolution the overall cameras data volume in blue cancolor. be In kept the lowFoV (aof reductionthe cameras from L1 and 32 to L5 2, mega-pixels,only nearby objects or a saving are of ofinterest, 90 MB so of that data the per overall cycle; data this reducesvolume can the databe kept flow low rate (a ofreduction the five camerasfrom 32 to by 2 aboutmega- 37%).pixels, or a saving of 90 MB of data per cycle; this reducesThe eye the at thedata right-hand flow rate of side the of five the cameras car is mirrored by about by 37 a%). vertical plane in the longitudinal axis through the center of the vehicle, enabling binocular vision in the frontal range from 69 to +69 only. The eye at the right-hand side of the car is mirrored by a vertical plane in the −longitudinal◦ ◦ axis Boththrough eyes the together center allowof the avehicle binocular, enabling stereo imagebinocular interpretation vision in the with frontal a relatively range from large − stereo-base69° to +69° (Lonly.StB) ofBoth LStB eyes= 1.2 together to 1.7 m (depending allow a binocular on the width stereo of image the vehicle). interpretation Crossroads with and a the relatively nearby lateral large vicinity (up to 161 ) can be seen by one eye (left or right) only. The low-resolution cameras, however, stereo-base (LStB±) of◦ LStB = 1.2 to 1.7 m (depending on the width of the vehicle). Crossroads and the providenearby lateral redundancy vicinity for (up the to perception ±161°) can of lowerbe seen details. by one Assuming eye (left aor su right)fficient only. accuracy The low in the-resolution distance estimation of up to about 15 L , a separate sensor (e.g., LRF) for good viewing conditions does not cameras, however, provide ×redundancyStB for the perception of lower details. Assuming a sufficient seem necessary; a radar is needed for all weather conditions anyway.

Electronics 2020, 9, x FOR PEER REVIEW 23 of 27

Electronicsaccuracy2020 in ,the9, 759 distance estimation of up to about 15 × LStB, a separate sensor (e.g., LRF) for24 good of 28 viewing conditions does not seem necessary; a radar is needed for all weather conditions anyway. A glass structure for protecting the cameras is mandatory; a section of a cylindrical tube is chosenA glasshere. structureOne design for protectingpossibility theis sketched cameras iswith mandatory; the position a section of the of wiper a cylindrical in action tube in isthe chosen front here.(top left) One; at design rest, the possibility wiper would is sketched be parked with thein the position rear outer of the corner. wiper When in action the ineye the is frontnot needed, (top left); a atlidrest, may the cover wiper the wouldglass tube. be parked These insupplementary the rear outer devices corner. will When have the to eye be ispart not of needed, the adapted a lid mayroof coverdesign the (see glass Figure tube. 21 These top). supplementary devices will have to be part of the adapted roof design (see Figure 21 top). 4.5. Sketch of a Gaze-Controllable Eye with Tri-focal Saccadic Vision 4.5. Sketch of a Gaze-Controllable Eye with Tri-Focal Saccadic Vision In contrast to a body-fixed eye, a tri-focal, gaze-controllable eye provides redundancy on three levelsIn of contrast resolution to a for body-fixed the same eye, region a tri-focal, by separate gaze-controllable cameras with eye considerably provides redundancy reduced data on threerates levels(−94.4% of  resolution 1/18). Image for thestabilization same region on all by three separate levels cameras is achieved with considerablyby inertial feedback reduced of data external rates ( 94.4% 1/18). Image stabilization on all three levels is achieved by inertial feedback of external perturbations− ≈ onto the gaze control platform. Figure 22 shows a simple-minded design as an perturbationsextension of the onto line the of gaze development control platform. of the Figurelarger 22platforms shows a tested simple-minded with the designvehicles as of an UniBwM. extension ofThey the had line separate of development degrees of of freedom the larger in pitch, platforms marked tested here with by (1) the at vehicles the top, ofand UniBwM. in yaw, marked They had in separatered by (2) degrees for the of motor freedom in the in pitch, bottom marked of the here unit. by The (1) atpresentation the top, and is in given yaw, markedhere for inthe red maximum by (2) for thegaze motor direction in the to bottom the outside of the ( unit.−90° Theto the presentation left). In this is givendesign, here the for cables the maximumfor power gaze supply direction and for to the outside ( 90 to the left). In this design, the cables for power supply and for transferring the sensor transferring −the ◦sensor signals have to allow angles of rotation in yaw of 150° and of about 50° in signalspitch. These have cables to allow and angles the bearings of rotation for inthe yaw axes of of 150 rotation◦ and of may about be causes 50◦ in pitch.for wear These and cables tear, needing and the bearingsservice every for the now axes and of then. rotation may be causes for wear and tear, needing service every now and then.

Figure 22. 22. Gaze-controlledGaze-controlled ‘Vehicle ‘Vehicle Eye’ Eye’ (left (left side) side) at the at upper the upper end of theend A-frame of the Acapable-frame of capable providing of tri-focalproviding images tri-focal in theimages same in FoV the as same the body-fixFoV as the eye body in Figure-fix eye 21 ;in however, Figure 21 it; has however, a data rateit has lower a data by tworate orderslower by of two magnitude, orders of but magnitude, with a delay but time with of a adelay few video-cycles.time of a few video-cycles.

An ideal design design for for such such an an eye eye would would include include a a(magnetic) (magnetic) levitation levitation and and torque torque application application in inthe the two two axes axes of of motion motion by by electromagnetic electromagnetic fields, fields, avoiding avoiding mechanical mechanical contacts. contacts. For For sensor sensor data transmission from the eye to the world outsideoutside and for power supply to the sensors, a solution without material links would also be ideal. Whether this will also remain impossible in the long run (especially the the combination combination of of both) both) is hard is hard to predict. to predict Levitation. Levitation and control and control of motion of motionalone, without alone, a mechanical touch, could make a gaze-controllable eye very attractive. According to the data given Electronics 2020, 9, 759 25 of 28 above, data rates are one to two orders of magnitude lower than for the eye mounted in a fixed way onto the vehicle. Thus, a gaze-controlled tri-focal eye may deliver all the visual information required, with the needed computing power much reduced and, in addition, with the information redundantly obtained from different cameras. Future eye-designs may reduce the delay time for saccades. This achievement would increase its biggest advantage, namely that all the redundant images of a gaze-controlled eye may be taken while fixating a fast moving object by using both inertial sensor data from perturbations of the vehicle itself (measured by tiny devices on the gaze platform) and from the estimated 3-D motion of the object of special interest.

5. Conclusions and Outlook Based on four decades of experience with real-time dynamic machine vision systems, two types of ‘vehicle eyes’ have been sketched here. In the last two decades of last century, seven test vehicles, two of our institute [54] and five of our partners in industry have been equipped with the latest vision- and computer hardware. Taking the developments in cameras and µP in this century into account, the present state allows completely new types of high-performance-eyes for vehicles: One with all cameras closely collected together and mounted in a fixed way onto the body of the vehicle, and one with a set of five smaller cameras mounted onto a gaze controllable platform. For this gaze-controllable eye, according to Figures 18, 19 and 22, no pixel-computations need to be done at all in order to obtain available tri-focal images of the momentary region of interest. A medium-resolution image is available with a FoV of 27.5 laterally and 15.5 vertically around the actual gaze direction. ± ◦ ± ◦ Two low-resolution images from each eye provide multiple redundant image data for the entire frontal hemisphere laterally and 31 vertically. Each side of the vehicle beyond 90 can be covered entirely ± ◦ ◦ by one low-resolution image. The medium-resolution image may cover lateral angles of up to 117.5 , ± ◦ while high resolution may be available up to 94.6 ; these two angular ranges are of special interest ± ◦ when approaching crossroads. A FoV of 25 around the line of sight is covered in a triple-redundant way by each of the eyes ± ◦ (from two low- and one medium-resolution camera), thereby improving safety aspects in the case of a failure of cameras. This compactness, in combination with object-tracking and dynamic scene understanding exploiting the 4-D approach based on spatiotemporal models for motion processes in the real 3-D world, yields an extremely efficient method for real-time vision. It may be the direct basis for some kind of consciousness. In the other setup, for the body-fix eye, which includes two low-resolution cameras similar to the ones used with gaze control, the increase in data volume for covering the same FoV corresponds to about 0.054%. Evaluating this relatively very small amount of data for features indicating potential objects of interest may considerably reduce the needed access to regions for high-resolution images. For a single object tracked at high resolution in the corresponding image, only about 0.1% of computing operations is needed as compared to the pyramid approach. Thus, the additional low-resolution images and their coarse evaluation for indications of potential objects of interest seem favorable. A complication for cameras mounted in a fixed way onto the body may be the transition when objects move from one image into that of another high-resolution camera. However, from a data volume and data rate point of view, the type of body-fixed eye including redundant low-resolution covering may, as discussed, be competitive with the gaze-controlled eye if motion blur should turn out not to be a challenge in machine vision (in contrast to biological vision systems). Since this can only be judged properly by comparing actually built devices, this question has to be left open for now. The disadvantages of gaze-controlled vision systems are twofold: first, the mechanisms for precisely pointing the eye in pan and tilt (yaw and pitch); and second, the need for the transfer of the image data from the sensor elements to the vehicle body. If both challenges can be solved by non-mechanical links in the future (say magnetic torques and electromagnetic waves between the eye and its mounting in the vehicle body), this would favor gaze control. Electronics 2020, 9, 759 26 of 28

The expectation is that the development of a variety of eyes for vehicles will take up the entire 21st century, similar to the development of the different types of ground vehicles in the 20th century. The requirements depend heavily on the type of mission to be performed. For planetary exploration, a reliable, multi-focal eye with its redundancy will have advantages. On Earth, this may also be true for missions after some catastrophe.

Funding: This research received no external funding. Conflicts of Interest: The author declares no conflict of interest

References

1. Handbook of Physiology; American Physiological Society: Bethesda, MD, USA, 1960. 2. Motor Control. Handbook of Physiology; American Physiological Society: Bethesda, MD, USA, 1987; Volume II. 3. Sensory Processes. Handbook of Physiology; American Physiological Society: Bethesda, MD, USA, 1984; Volume III. 4. Higher Functions of the Brain. Handbook of Physiology; American Physiological Society: Bethesda, MD, USA, 1987; Volume V. 5. Evolution of the Eye. Available online: https://en.wikipedia.org/wiki/Evolution_of_the_eye (accessed on 26 April 2020). 6. Human Eye from Wikipedia. Available online: https://en.wikipedia.org/wiki/Fovea_centralis (accessed on 26 April 2020). 7. Krantz, J.H. Experiencing Sensation and Perception. Chapter 3: The Stimulus and Anatomy of the Visual System. Available online: http://psych.hanover.edu/classes/sensation/chapters/Chapter%203.pdf (accessed on 26 April 2020). 8. Hubel, D.H.; Wiesel, T. Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. J. Physiol. 1962, 160, 106–154. [CrossRef][PubMed] 9. Hunziker, H.-W. Im Auge des Lesers: Foveale und periphere Wahrnehmung—Vom Buchstabieren zur Lesefreude. In Transmedia; Stäubli Verlag: Zürich, Switzerland, 2006. 10. Tsugawa, S.; Yatabe, T.; Hirose, T.; Matsumoto, S. An automobile with artificial intelligence. In Proceedings of the 6th IJCAI, Tokyo, Japan, 20–23 August 1979. 11. Roland, A.; Shiman, P. Strategic Computing: DARPA and the Quest for Machine Intelligence, 1983–1993; MIT Press: Cambridge, MA, USA, 2002. 12. Dickmanns, E.D. Vision for ground vehicles: History and prospects. Int. J. Veh. Auton. Syst. 2002, 1, 1–44. [CrossRef] 13. Dickmanns, E.D.; Zapp, A.; Otto, K.-D. Ein Simulationskreis zur Entwicklung einer automatischen Fahrzeugführung mit bildhaften und inertialen Signalen. In Simulationstechnik, Informatik-Fachberichte 85; Springer-Verlag: London, UK, 1984; pp. 554–558. 14. Schell, F.R.; Dickmanns, E.D. Autonomous Landing of Airplanes by Dynamic Machine Vision. In Proceedings of the IEEE-Workshop on Applications of , Palm Springs, CA, USA, 30 November–2 December 1992. 15. Luenberger, D.G. Observers for Multivariable Systems. IEEE Trans. Autom. Control 1966, 11, 190–197. [CrossRef] 16. Dickmanns, E.D. Dynamic Vision for Perception and Control of Motion; Springer: London UK; 2007; p. 474. 17. Dickmanns, E.D.; Wünsche, H.-J. Satellite Rendezvous Maneuvers by Means of Computer Vision. Jahrestagung der DGLR, München, Okt 1986; DGLR: Bonn, Germany, 1986; pp. 251–259. 18. Mysliwetz, B.; Dickmanns, E.D. A Vision System with Active Gaze Control for real-time Interpretation of Well Structured Dynamic Scenes. In Proceedings of the 1st Conference on Intelligent Autonomous Systems (IAS-1), Amsterdam, The Netherlands, 8–11 December 1986; pp. 477–483. 19. Wünsche, H.-J. Erfassung und Steuerung von Bewegungen durch Rechnersehen; VDI Verlag GmbH: Düsseldorf, Germany, 1987. 20. Dickmanns, E.D.; Graefe, V. Dynamic monocular machine vision. Mach. Vis. Appl. 1988, 1, 223–240. [CrossRef] Electronics 2020, 9, 759 27 of 28

21. Van VaMoRs (1985–2004). Available online: www.dyna-vision.de/main/H.2.3%20Van%20VaMoRs.html (accessed on 3 May 2020). 22. Ulmer, B. VITA-II—Active Collision Avoidance in Real Traffic. In Proceedings of the Intelligent Vehicles’ 94 Symposium, Paris, France, 24–26 October 1994. 23. Dickmanns, E.D.; Behringer, R.; Dickmanns, D.; Hildebrandt, T.; Maurer, M.; Thomanek, F.; Schiehlen, J. The Seeing Passenger Car ‘VaMoRs-P’. In Proceedings of the Intelligent Vehicles’ 94 Symposium, Paris, France, 24–26 October 1994; pp. 68–73. 24. Thomanek, F.; Dickmanns, E.D.; Dickmanns, D. Multiple Object Recognition and Scene Interpretation for Autonomous Road Vehicle Guidance. In Proceedings of the Intelligent Vehicles’ 94 Symposium, Paris, France, 24–26 October 1994; pp. 231–236. 25. Von Holt, V. Tracking and Classification of Overtaking Vehicles on Autobahnen. In Proceedings of the Intelligent Vehicles’ 94 Symposium, Paris, France, 24–26 October 1994; pp. 314–319. 26. Schiehlen, J. Kameraplattformen für Aktiv Sehende Fahrzeuge; Diss. UniBwM/LRT 1995. Also as Fortschrittsberichte VDI Verlag, Reihe 8, Nr. 514; VDI Verlag GmbH: Düsseldorf, Germany, 1995. 27. Reichardt, D. Kontinuierliche Verhaltenssteuerung eines Autonomen Fahrzeugs in Dynamischer Umgebung. Ph.D. Thesis, Informatik, Universität Kaiserslautern, January 1996. 28. Schiehlen, J.; Dickmanns, E.D. A Camera Platform for Intelligent Vehicles. In Proceedings of the Intelligent Vehicles’ 94 Symposium, Paris, France, 24–26 October 1994; pp. 393–398. 29. Gaze Control, Including Saccades. 1994. Available online: www.dyna-vision.de/main/WebQualVideo/17% 20GazeControlTrafficSignSaccade1994.wmv (accessed on 1 May 2020). 30. Franke, U.; Pfeiffer, D.; Rabe, C.; Knoeppel, C.; Enzweiler, M.; Stein, F.; Herrtwich, R.G. Making Bertha See. In Proceedings of the ICCV Workshop Computer Vision for Autonomous Vehicles, Sydney, Australia, 2–8 December 2013. 31. Dickmanns, E.D.; Zapp, A.A. Curvature-based Scheme for Improving Road Vehicle Guidance by Computer Vision. In Mobile Robots I; Marquina, N., Wolfe, W.J., Eds.; SPIE: Bellingham, WA, USA, 1987; Volume 727, pp. 161–168, *(only available on the SPIE Digital Library) . 32. Dickmanns, E.D. 4-D-Dynamic Scene Analysis with Integral Spatio-Temporal Models. In Robotics Research; Bolles, R.C., Roth, B., Eds.; MIT Press: Cambridge, MA, USA, 1988; pp. 311–318. 33. Albus, J.S. 4-D/RCS reference model architecture for unmanned ground vehicles. In Proceedings of the ICRA, San Francisco, CA, USA, 24–27 April 2000. 34. Baten, S.; Lützeler, M.; Dickmanns, E.D.; Mandelbaum, R.; Burt, P. Techniques for Autonomous, Off-Road Navigation. IEEE Intell. Syst. 1998, 13, 57. [CrossRef] 35. Dickmanns, E.D. Vehicles Capable of Dynamic Vision. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI-97), Nagoya, Japan, 23–29 August 1997; Volume 2, pp. 1577–1592. 36. Gregor, R.; Lützeler, M.; Pellkofer, M.; Siedersberger, K.H.; Dickmanns, E.D. EMS-Vision: A Perceptual System for Autonomous Vehicles. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, Dearborn, MI, USA, 3–5 October 2000; pp. 52–57. 37. Gregor, R.; Dickmanns, E.D. EMS-Vision: Mission Performance on Road Networks. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, Dearborn, MI, USA, 3–5 October 2000; pp. 140–145. 38. Siedersberger, K.-H.; Dickmanns, E.D. EMS-Vision: Enhanced Abilities for Locomotion. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, Dearborn, MI, USA, 3–5 October 2000; pp. 146–151. 39. Pellkofer, M.; Dickmanns, E.D. EMS-Vision: Gaze Control in Autonomous Vehicles. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, Dearborn, MI, USA, 3–5 October 2000; pp. 296–301. 40. Lützeler, M.; Dickmanns, E.D. EMS-Vision: Recognition of Intersections on Unmarked Road Networks. In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, Dearborn, MI, USA, 3–5 October 2000; pp. 302–307. 41. Turn Left on Unknown Dirt Road Crossing. Available online: www.dyna-vision.de/main/WebQualVideo/ 30%20TurnLeftSacc%20DirtRoad%202001.wmv (accessed on 3 May 2020). 42. AutoNav Final Demo 2003. Available online: www.dyna-vision.de/main/WebQualVideo/33%20EMS-On- Offroad%20Driving2003VaMoRs.wmv (accessed on 3 May 2020). 43. NVIDIA DRIVE AGX ORIN. Available online: https://www.nvidia.com/en-us/self-driving-cars/ (accessed on 28 January 2020). Electronics 2020, 9, 759 28 of 28

44. National Defense Authorization Act for Fiscal Year 2001; Public Law 106–398 (Sect. 220, pp. 40–42). Available online: https://www.congress.gov/106/plaws/publ398/PLAW-106publ398.pdf (accessed on 1 May 2020). 45. Grand Challenge. Available online: https://www.darpa.mil/about-us/timeline/-grand-challenge-for- autonomous-vehicles (accessed on 17 February 2020). 46. DARPA: Urban Challenge, Proposer Information Pamphlet (PIP), Broad Agency Announcement (BAA) 06-36, 1 May 2006. Available online: https://www.darpa.mil/about-us/timeline/darpa-urban-challenge (accessed on 17 February 2020). 47. Dickmanns, E.D. Developing the Sense of Vision for Autonomous Road Vehicles at the UniBwM. IEEE Comput. 2017, 50, 24–31. [CrossRef] 48. Mobileye. Available online: https://www.mobileye.com/our-technology/evolution-eyeq-chip/ (accessed on 28 January 2020). 49. Intel-Announcement. Available online: https://newsroom.intel.com/news-releases/intel-mobileye- acquisition/#gs.usmbiy (accessed on 28 January 2020). 50. Gehrig, S.; Eberli, F.; Meyer, T. A Real-time Low-Power Stereo Vision Engine Using Semi-Global Matching. In Proceedings of the International Conference on Computer Vision Systems, ICVS 2009, Liege, Belgium, 13–15 October 2009. 51. UniBwM/LRT/TAS. Available online: https://www.unibw.de/tas (accessed on 11 February 2020). 52. Hofmann, U. Zur Visuellen Umfeldwahrnehmung Autonomer Fahrzeuge; VDI Verlag GmbH: Düsseldorf, Germany, 2004. 53. Hofmann, U.; Rieder, A.; Dickmanns, E.D. EMS-Vision: Application to Hybrid Adaptive . IV’00 (see [23]). In Proceedings of the IEEE Intelligent Vehicles Symposium 2000, Dearborn, MI, USA, 3–5 October 2000; pp. 468–473. Available online: www.dyna-vision.de/main/WebQualVideo/27%20Hybrid% 20Adaptive%20Cruise%20Control%20VaMP%202000.wmv (accessed on 2 May 2020). 54. Summary of Contributions to Vision-based Autonomous Vehicles. Available online: www.dyna-vision.de/ main/Review2016%20of%20ContribVisBasAutVeh.html#_1._Time_line_1 (accessed on 4 May 2020).

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).