Noname manuscript No. (will be inserted by the editor)
Explainability of vision-based autonomous driving systems: Review and challenges
Eloi´ Zablocki∗,1 · H´ediBen-Younes∗,1 · Patrick P´erez1 · Matthieu Cord1,2
Abstract This survey reviews explainability methods 1 Introduction for vision-based self-driving systems. The concept of ex- plainability has several facets and the need for explain- 1.1 Self-driving systems ability is strong in driving, a safety-critical application. Gathering contributions from several research fields, Research on autonomous vehicles is blooming thanks to namely computer vision, deep learning, autonomous recent advances in deep learning and computer vision driving, explainable AI (X-AI), this survey tackles sev- (Krizhevsky et al, 2012; LeCun et al, 2015), as well as eral points. First, it discusses definitions, context, and the development of autonomous driving datasets and motivation for gaining more interpretability and ex- simulators (Geiger et al, 2013; Dosovitskiy et al, 2017; plainability from self-driving systems. Second, major re- Yu et al, 2020). The number of academic publications cent state-of-the-art approaches to develop self-driving on this subject is rising in most machine learning, com- systems are quickly presented. Third, methods provid- puter vision, robotics and transportation conferences, ing explanations to a black-box self-driving system in a and journals. On the industry side, several manufactur- post-hoc fashion are comprehensively organized and de- ers are already producing cars equipped with advanced tailed. Fourth, approaches from the literature that aim computer vision technologies for automatic lane follow- at building more interpretable self-driving systems by ing, assisted parking, or collision detection among other design are presented and discussed in detail. Finally, re- things. Meanwhile, constructors are working on and de- maining open-challenges and potential future research signing prototypes with level 4 and 5 autonomy. The directions are identified and examined. development of autonomous vehicles has the potential to reduce congestions, fuel consumption, and crashes, Keywords Autonomous driving · Explainability · and it can increase personal mobility and save lives Interpretability · Black-box · Post-hoc interpretabililty given that nowadays the vast majority of car crashes are caused by human error (Anderson et al, 2014). The first steps in the development of autonomous Eloi´ Zablocki driving systems are taken with the collaborative Eu- arXiv:2101.05307v1 [cs.CV] 13 Jan 2021 E-mail: [email protected] ropean project PROMETHEUS (Program for a Euro- H´ediBen-Younes pean Traffic with Highest Efficiency and Unprecedented E-mail: [email protected] Safety) (Xie et al, 1993) at the end of the ’80s and the Grand DARPA Challenges in the late 2000s. At Patrick P´erez E-mail: [email protected] these times, systems are heavily-engineered pipelines (Urmson et al, 2008; Thrun et al, 2006) and their mod- Matthieu Cord ular aspect decomposes the task of driving into sev- E-mail: [email protected] eral smaller tasks — from perception to planning — ∗ equal contribution which has the advantage to offer interpretability and 1 Valeo.ai transparency to the processing. Nevertheless, modular 2 Sorbonne Universit´e pipelines have also known limitations such as the lack of flexibility, the need for handcrafted representations, 2 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al. and the risk of error propagation. In the 2010s, we ob- 1.3 Research questions and focus of the survey serve an interest in approaches aiming to train driving systems, usually in the form of neural networks, either Two complementary questions are the focus of this sur- by leveraging large quantities of expert recordings (Bo- vey and they guide its organization: jarski et al, 2016; Codevilla et al, 2018; Ly and Akhloufi, 1. Given a trained self-driving model, coming as a 2020) or through simulation (Espi´eet al, 2005; Toro- black-box, how can we explain its behavior? manoff et al, 2020; Dosovitskiy et al, 2017). In both 2. How can we design learning-based self-driving mod- cases, these systems learn a highly complex transforma- els which are more interpretable? tion that operates over input sensor data and produce end-commands (steering angle, throttle). While these Regardless of driving considerations, these ques- neural driving models overcome some of the limitations tions are asked and answered in many generic ma- of the modular pipeline stack, they are sometimes de- chine learning papers. Besides, some papers from the scribed as black-boxes for their critical lack of trans- vision-based autonomous driving literature propose in- parency and interpretability. terpretable driving systems. In this survey, we bridge the gap between general X-AI methods that can be ap- plied for the self-driving literature, and driving-based approaches claiming explainability. In practice, we reor- ganize and cast the autonomous driving literature into an X-AI taxonomy that we introduce. Moreover, we 1.2 Need for explainability detail generic X-AI approaches — some have not been used yet in the autonomous driving context — and that The need for explainability is multi-factorial and de- can be leveraged to increase the explainability of self- pends on the concerned people, whether they are end- driving models. users, legal authorities, or self-driving car designers. End-users and citizens need to trust the autonomous 1.4 Positioning system and to be reassured (Choi and Ji, 2015). More- over, designers of self-driving models need to under- Many works advocate for the need of explainable driv- stand the limitations of current models to validate them ing models (Ly and Akhloufi, 2020) and published re- and improve future versions (Tian et al, 2018). Besides, views about explainability often mention autonomous regarding legal and regulator bodies, it is needed to ac- driving as an important application for X-AI methods. cess explanations of the system for liability purposes, However, there are only a few works on interpretable especially in the case of accidents (Rathi, 2019; Li et al, autonomous driving systems, and, to the best of our 2018c). knowledge, there exists no survey focusing on the in- The fact that autonomous self-driving systems are terpretability of autonomous driving systems. Our goal not inherently interpretable has two main origins. is to bridge this gap, to organize and detail existing On the one hand, models are designed and trained methods, and to present challenges and perspectives for within the deep learning paradigm which has known building more interpretable self-driving systems. explainability-related limitations: datasets contain nu- This survey is the first to organize and review self- merous biases and are generally not precisely curated, driving models under the light of explainability. The the learning and generalization capacity remains em- scope is thus different from papers that review self- pirical in the sense that the system may learn from driving models in general. For example, Janai et al spurious correlation and overfit on common situations, (2020) review vision-based problems arising in self- also, the final trained model represents a highly-non- driving research, Di and Shi(2020) provide a high-level linear function and is non-robust to slight changes in review on the link between human and automated driv- the input space. On the other hand, self-driving sys- ing, Ly and Akhloufi(2020) review imitation-based self- tems have to simultaneously solve intertwined tasks of driving models, Manzo et al(2020) survey deep learning very different natures: perception tasks with detection models for predicting steering angle, and Kiran et al of lanes and objects, planning and reasoning tasks with (2020) review self-driving models based on deep rein- motion forecasting of surrounding objects and of the forcement learning. ego-vehicle, and control tasks to produce the driving Besides, there exist reviews on X-AI, interpretabil- end-commands. Here, explaining a self-driving system ity, and explainability in machine learning in general thus means disentangling predictions of each implicit (Beaudouin et al, 2020; Gilpin et al, 2018; Adadi and task, and to make them human-interpretable. Berrada, 2018; Das and Rad, 2020). Among others, Xie Explainability of vision-based autonomous driving systems: Review and challenges 3 et al(2020) give a pedagogic review for non-expert read- further explainability of self-driving systems. Section 6 ers while Vilone and Longo(2020) offer the most ex- presents the particular use-case of explaining a self- haustive and complete review on the X-AI field. Moraf- driving system by means of natural language justifi- fah et al(2020) focus on causal interpretability in ma- cations. chine learning. Moreover, there also exist reviews on ex- plainability applied to decision-critical fields other than driving. This includes interpretable machine learning 2 Explainability in the context of autonomous for medical applications (Tjoa and Guan, 2019; Fellous driving et al, 2019). Overall, the goal of this survey is diverse, and we This section contextualizes the need for interpretable hope that it contributes to the following: driving models. In particular, we present the main motivations to require increased explainability in Sec- – Interpretability and explainability notions are clar- tion 2.1, we define and organize explainability-related ified in the context of autonomous driving, depend- terms in Section 2.2 and, in Section 2.3, we answer ques- ing on the type of explanations and how they are tions such as who needs explanations? what kind? for computed; what reasons? when? – Legal and regulator bodies, engineers, technical and business stakeholders can learn more about explain- ability methods and approach them with caution 2.1 Call for explainable autonomous driving regarding presented limitations; – Self-driving researchers are encouraged to explore The need to explain self-driving behaviors is multi- new directions from the X-AI literature such as factorial. To begin with, autonomous driving is a high- causality, to foster explainability and reliability of stake and safety-critical application. It is thus natu- self-driving systems; ral to ask for performance guarantees, from a soci- – The quest for interpretable models can contribute etal point-of-view. However, self-driving models are not to other related topics such as fairness, privacy, and completely testable under all scenarios as it is not pos- causality, by making sure that models are taking sible to exhaustively list and evaluate every situation good decisions for good reasons. the model may possibly encounter. As a fallback solu- tion, this motivates the need for explanation of driving decisions. 1.5 Contributions and outline Moreover, explainability is also desirable for vari- ous reasons depending on the performance of the sys- Throughout the survey, we review explainability- tem to be explained. For example, as detailed by Sel- related definitions from the X-AI literature and we varaju et al(2020), when the system works poorly, ex- gather a large number of papers proposing self-driving planations can help engineers and researchers to im- models that are explainable or interpretable to some prove future versions by gaining more information on extent, and organize them within an explainability tax- corner cases, pitfalls, and potential failure modes (Tian onomy we define. Moreover, we identify limitations and et al, 2018; Hecker et al, 2020). Moreover, when the sys- shortcomings from X-AI methods and propose sev- tem’s performance matches human performance, expla- eral future research directions to have potentially more nations are needed to increase users’ trust and enable transparent, richer, and more faithful explanations for the adoption of this technology (Lee and Moray, 1992; upcoming generations of self-driving models. Choi and Ji, 2015; Shen et al, 2020; Zhang et al, 2020). This survey is organized as follows: Section 2 con- In the future, if self-driving models largely outperform textualizes and motivates the need for interpretable humans, produced explanations could be used to teach autonomous driving models and presents a taxonomy humans to better drive and to make better decisions of explainability methods, suitable for self-driving sys- with machine teaching (Mac Aodha et al, 2018). tems; Section 3 gives an overview of neural driving Besides, from a machine learning perspective, it is systems and explores reasons why it is challenging also argued that the need for explainability in machine to explain them; Section 4 presents post-hoc meth- learning stems from a mismatch between training ob- ods providing explanations to any black-box self-driving jectives on the one hand, and the more complex real- model; Section 5 turns to approaches providing more life goal on the other hand, i.e. driving (Lipton, 2018; transparency to self-driving models, by adding explain- Doshi-Velez and Kim, 2017). Indeed, the predictive per- ability constraints in the design of the systems; this sec- formance on test sets does not perfectly represent per- tion also presents potential future directions to increase formances an actual car would have when deployed to 4 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
an explanation is a human-interpretable description of Transparency Is the system the process by which a decision-maker took a partic- intrinsically ular set of inputs and reached a particular conclusion. transparent? Interpretability In practice, Doshi-Velez and Kortz(2017) state that an Is the explanation explanation should answer at least one of the three fol- understandable by a human? Post-hoc lowing questions: what were the main factors in the de- Explainability interpretability cision? Would changing a certain factor have changed Can we analyze the the decision? and Why did two similar-looking cases get Do we need extra Completeness model after it is trained, information in addition either locally or different decisions, or vice versa? to the test score? Does the explanation globally? exhaustively describe the The term explainability often co-occurs with the whole processing? concept of interpretability. While some recent work (Beaudouin et al, 2020) advocate that the two are syn- Fig. 1: Taxonomy of explainability terms adopted onyms, (Gilpin et al, 2018) use the term interpretability in this survey. Explainability is the combination of in- to designate to which extent an explanation is under- terpretability (= comprehensible by humans) and com- standable by a human. For example, an exhaustive and pleteness (= exhaustivity of the explanation) aspects. completely faithful explanation is a description of the There are two approaches to have interpretable sys- system itself and all its processing: this is a complete tems: approaches intrinsic to the design of the sys- explanation although the exhaustive description of the tem, which increases its transparency, and post-hoc ap- processing may be incomprehensible. Gilpin et al(2018) proaches that justify decisions afterwards for any black- state that an explanation should be designed and as- box system. sessed in a trade-off between its interpretability and its completeness, which measures how accurate the expla- the real world. For example, this may be due to the fact nation is as it describes the inner workings of the sys- that the environment is not stationary, and the i.i.d. as- tem. The whole challenge in explaining neural networks sumption does not hold as actions made by the model is to provide explanations that are both interpretable alter the environment. In other words, Doshi-Velez and and complete. Kim(2017) argue that the need for explainability arises Interpretability may refer to different concepts, as from incompleteness in the problem formalization: ma- explained by Lipton(2018). In particular, interpretabil- chine learning objectives are flawed proxy functions to- ity regroups two main concepts: model transparency wards the ultimate goal of driving. Prediction metrics and post-hoc interpretability. Increasing model trans- alone are not sufficient to fully characterize the learned parency amounts to gaining an understanding of how system (Lipton, 2018): extra information is needed, ex- the model works. For example, Guidotti et al(2018) ex- planations. Explanations thus provide a way to check if plain that a decision model is transparent if its decision- the hand-designed objectives which are optimized en- making process can be directly understood without any able the trained system to drive as a by-product. additional information; if an external tool or model is used to explain the decision-making process, the provided explanation is not transparent according to 2.2 Explainability: Taxonomy of terms Rosenfeld and Richardson(2019). For Choi and Ji (2015), the system transparency can be measured as the Many terms are related to the explainability concept degree to which users can understand and predict the and several definitions have been proposed for each way autonomous vehicles operate. On the other hand, of these terms. The boundaries between concepts are gaining post-hoc interpretability amounts to acquiring fuzzy and constantly evolving. To clarify and narrow extra information in addition to the model metric, gen- the scope of the survey, we detail here common defi- erally after the driving decision is made. This can be the nitions of key concepts related to explainable AI, and case for a specific instance, i.e. local interpretability, or, how they are related to one another as illustrated in more generally, to explain the whole model and/or its Figure 1. processing and representations. In human-machine interactions, explainability is de- An important aspect for explanations is the no- fined as the ability for the human user to understand tion of correctness or fidelity. They designate whether the agent’s logic (Rosenfeld and Richardson, 2019). The the provided explanation accurately depicts the inter- explanation is based on how the human user under- nal process leading to the output/decision (Xie et al, stands the connections between inputs and outputs of 2020). In the case of transparent systems, explanations the model. According to Doshi-Velez and Kortz(2017), are faithful by design, however, this is not guaranteed Explainability of vision-based autonomous driving systems: Review and challenges 5 with post-hoc explanations which may be chosen and third dimension is situation management, which is the optimized their capacity to persuade users instead of belief that the user can take control whenever desired. accurately unveiling the system’s inner workings. As a consequence of these three dimensions of trust, Besides, it is worth mentioning that explainability Zhang et al(2020) propose several key factors to pos- in general — and interpretability and transparency in itively influence human trust in autonomous vehicles. particular — serve and assist broader concepts such For example, improving the system performance is a as traceability, auditability, liability, and accountability straightforward way to gain more trust. Another pos- (Beaudouin et al, 2020). sibility is to increase system transparency by providing information that will help the user understand how the system functions. Therefore, it appears that the capac- 2.3 Contextual elements of an explanation ity to explain the decisions of an autonomous vehicle has a significant impact on user trust, which is crucial The relation with autonomous vehicles differs a lot for broad adoption of this technology. Besides, as ar- given who is interacting with the system: surrounding gued by Haspiel et al(2018), explanations are especially pedestrians and end-users of the ego-car put their life in needed when users’ expectations have been violated as the hand of the driving system and thus need to gain a way to mitigate the damage. trust in the system; designers of self-driving systems Research on human-computer interactions argues seek to understand limitations and shortcomings of the that the timing of explanations is important for trust. developed models to improve next versions; insurance (Haspiel et al, 2018; Du et al, 2019) conducted a companies and certification organizations need guaran- user study showing that, to promote trust in the au- tees about the autonomous system. These categories tonomous vehicle, explanations should be provided be- of stakeholders have varying expectations and thus the fore the vehicle takes action rather than after. Apart need for explanations has different motivations. The from the moment when the explanation should appear, discussions of this subsection are summarized in Ta- Rosenfeld and Richardson(2019) advocate that users ble 1. are not expected to spend a lot of time processing the explanation, which is why it should be concise and di- 2.3.1 Car users, citizens and trust rect. This is in line with other findings of Shariff et al (2017); Koo et al(2015) who show that although trans- There is a long and dense line of research trying to de- parency can improve trust, providing too much infor- fine, characterize, evaluate, and increase the trust be- mation to the human end-user may cause anxiety by tween an individual and a machine (Lee and Moray, overwhelming the passenger and thus decrease trust. 1992, 1994; Lee and See, 2004; Choi and Ji, 2015; Shariff et al, 2017; Du et al, 2019; Shen et al, 2020; Zhang et al, 2020). Importantly, trust is a major factor for users’ ac- 2.3.2 System designers, certification, debugging and ceptance of automation, as was shown in the empirical improvement of models study of Choi and Ji(2015). Lee and See(2004) define trust between a human and a machine as “the attitude Driving is a high-stake critical application, with strong that an agent will help achieve an individual’s goal, in safety requirements. The concept of Operational Design a situation characterized with uncertainty and vulnera- Domain (ODD) is often used by carmakers to designate bility”. According to Lee and Moray(1992), human- the conditions under which the car is expected to be- machine trust depends on three main factors. First, have safely. Thus, whenever a machine learning model performance-based trust is built relatively to how well is built to address the task of driving, it is crucial to the system performs at its task. Second, process-based know and understand its failure modes, i.e. in the case trust is a function of how well the human understands of accidents (Chan et al, 2016; Zeng et al, 2017; Suzuki the methods used by the system to complete its task. et al, 2018; Kim et al, 2019; You and Han, 2020), and Finally, purpose-based trust reflects the designer’s in- to verify that these situations do not overlap with the tention in creating the system. ODD. To this end, explanations can provide technical In the more specific case of autonomous driving, information about the current limitations and short- Choi and Ji(2015) define three dimensions for trust in comings of a model. an autonomous vehicle. The first one is system trans- The first step is to characterize the performance of parency, which refers to which extent the individual can the model. While performance is often measured as an predict and understand the operating of the vehicle. averaged metric on a test set, it may not be enough The second one is technical competence, i.e. the percep- to reflect the strengths and weaknesses of the system. tion by the human of the vehicle’s performance. The A common practice is to stratify the evaluation into 6 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
Who? Why? What? When?
End user, citizen Trust, situation management Intrinsic explanations, post-hoc ex- Before/After planations, persuasive explanations
Designer, certification body Debug, understand limitations and Stratified evaluation, corner cases, Before/After shortcomings, improve future ver- intrinsic explanations, post-hoc ex- sions, machine teaching planations
Justice, regulator, insurance Liability, accountability Exhaustive and precise explana- After tions, complete explanations, post- hoc explanations, training and val- idation data
Table 1: The four W’s of explainable driving AI. Who needs explanations? What kind? For what reasons? When? situations, so that failure modes could be highlighted. These explanations should provide “meaningful infor- This type of method is used by the European New Car mation about the logic involved” in the decision-making Assessment Program (Euro NCAP) to test and assess process. Algorithms are expected to be available for the assisted driving functionalities in new vehicles. Such scrutiny of their inner workings (possibly through coun- evaluation method can also be used at the development terfactual interventions (Rathi, 2019; Wachter et al, step, as in (Bansal et al, 2019) where authors build a 2017)), and their decisions should be available for con- real-world driving simulator to evaluate their system testing and contradiction. This should prevent unfair on controlled scenarios. When these failure modes are and/or unethical behaviors of algorithms. Even though found in the behavior of the system, the designers of these questions are crucial for the broad machine learn- the model can augment the training set with these sit- ing community in general, the field of autonomous driv- uations and re-train the model (Pei et al, 2019). ing is not directly impacted by such problems as sys- However, even if these global performance-based ex- tems do not use personal data. planations are helpful to improve the model’s perfor- Legal institutions are interested in explanations for mance, this virtuous circle may stagnate and not be liability and accountability purposes, especially when sufficient to solve some types of mistakes. It is thus a self-driving system is involved in a car accident. As necessary to delve deeper into the inner workings of noted in (Beaudouin et al, 2020), detailed explanations the model and to understand why it makes those er- of all aspects of the decision process could be required to rors. Practitioners will look for explanations that pro- identify the reasons for a malfunction. This aligns with vide insights into the network’s processing. Researchers the guidelines towards algorithmic transparency and may be interested in the regions of the image that were accountability published by the Association for Com- the most useful for the model’s decision (Bojarski et al, puting Machinery (ACM), which state that system au- 2018), the number of activated neurons for a given input ditability requires logging and record keeping (Garfinkel (Tian et al, 2018), the measure of bias in the training et al, 2017). In contrast with this local form of explana- data (Torralba and Efros, 2011), etc. tions, a more global explanation of the system’s func- This being said, conducting a rigorous validation tioning could be required in a lawsuit. It consists in full of a machine learning-based system is a hard problem, or partial disclosure of source codes, training or vali- mainly because it is not trivial to specify the require- dation data, or thorough performance analysis. It may ments a neural network should meet (Borg et al, 2019). also be important to provide information about the sys- tem’s general logic that could be understandable, such 2.3.3 Regulators and legal considerations as the goals of the loss function. Notably, explanations generated for legal or regu- In the European General Data Protection Regulation latory institutions are likely to be different from those 1 (GDPR) , it is stated that users have the right to obtain addressed to the end-user. Here, explanations are ex- explanations from automated decision-making systems. pected to be exhaustive and precise, as the goal is to 1 https://eur-lex.europa.eu/legal-content/EN/TXT/ take a deep delve into the inner workings of the sys- HTML/?uri=CELEX:32016R0679&from=EN tem. These explanations are directed towards experts Explainability of vision-based autonomous driving systems: Review and challenges 7 who will likely spend large amounts of time studying As pipeline architectures split the driving task into the system (Rosenfeld and Richardson, 2019), and who easier-to-solve problems, they offer somewhat inter- are thus inclined to receive rich explanations with great pretable processing of sensor data through specialized amounts of detail. modules (perception, planning, decision, control). How- ever, these approaches have several drawbacks. First, they rely on human heuristics and manually-chosen in- termediate representations, which are not proven to be 3 Self-driving cars optimal for the driving task. Second, they lack flexibil- In this section, we present an overview of the main ap- ity to account for real-world uncertainties and to gen- proaches tackling autonomous driving, regardless of ex- eralize to unplanned scenarios. Moreover, from an en- plainability concerns, in Section 3.1. Moreover, in Sec- gineering point of view, these systems are hard to scale tion 3.2, we delineate the explainability challenges to- and to maintain as the various modules are entangled ward the design of interpretable self-driving systems. together (Chen et al, 2020a). Finally, they are prone to error propagation between the multiple sub-modules (McAllister et al, 2017). To circumvent these issues, and nurtured by the 3.1 Autonomous driving: learning-based self-driving deep learning revolution (Krizhevsky et al, 2012; Le- models Cun et al, 2015), researchers focus more and more on machine learning-based driving systems, and in partic- This subsection gives an outlook over the historical ular on deep neural networks. In this survey, we focus shift from modular pipelines towards end-to-end learn- on these deep learning systems for autonomous driving. ing based models (Section 3.1.1); the main architectures used in modern driving systems are presented (Sec- tion 3.1.2), as well as how they are trained and opti- 3.1.2 Driving architecture mized (Section 3.1.3). Finally, the main public datasets used for training self-driving models are presented in We now present the different components constituting Section 3.1.4. most of the existing learning-based driving systems. As illustrated in Figure 2, we can distinguish four key ele- ments involved in the design of a neural driving system: 3.1.1 From historical modular pipelines to end-to-end input sensors, input representations, output type, and learning learning paradigm.
The history of autonomous driving systems started in Sensors. Sensors are the hardware interface through the late ’80s and early ’90s with the European Eureka which the neural network perceives its environment. project called Prometheus (Dickmanns, 2002). This Typical neural driving systems rely on sensors from two has later been followed by driving challenges proposed families: proprioceptive sensors and exteroceptive sen- by the Defense Advanced Research Projects Agency sors. Proprioceptive sensors provide information about (DARPA). In 2005, STANLEY (Thrun et al, 2006) the internal vehicle state such as speed, acceleration, is the first autonomous vehicle to complete a Grand yaw, change of position, and velocity. They are mea- Challenge, which consists in a race of 142 miles in a sured through tachometers, inertial measurement units desert area. Two years later, DARPA held the Urban (IMU), and odometers. All these sensors communicate Challenge, where autonomous vehicles had to drive in through the controller area network (CAN) bus, which an urban environment, taking into account other ve- allows signals to be easily accessible. In contrast, ex- hicles and obeying traffic rules. BOSS won the chal- teroceptive sensors acquire information about the sur- lenge (Urmson et al, 2008), driving 97 km in an urban rounding environment. They include cameras, radars, area, with a speed up to 48 km/h. The common point LiDARs, and GPS: between STANLEY, BOSS, and the vast majority of the other approaches at this time (Leonard et al, 2008) – Cameras are passive sensors that acquire a color is the modularity. Leveraging strong suites of sensors, signal from the environment. They provide RGB these systems are composed of several sub-modules, videos that can be analyzed using the vast and grow- each completing a very specific task. Broadly speaking, ing computer vision literature treating video signals. these sub-tasks deal with sensing the environment, fore- Despite being very cheap and rich sensors, there casting future events, planning, taking high-level deci- are two major downsides to their use. First, they sions, and controlling the vehicle. are sensitive to illumination changes. It implies that 8 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
Deep Driving model <
......
Sensors Inputs Learning Outputs
- Camera - Local history - Imitation learning - RADAR - Point clouds with a dataset - Vehicle controls - LiDAR - RGB Video - Future trajectory - IMU - Object detections - Reinforcement learning - GPS - Semantic segmentations - Depth maps with a simulator - Bird-eye-view
Fig. 2: Overview of neural network-based autonomous driving systems.
day/night changes, in particular, have a strong im- Input representation. Once sensory inputs are acquired pact on the performance of downstream algorithms, by the system, they are processed before being passed even if this phenomenon is tackled by some recent to the neural driving architecture. Approaches differ work on domain adaptation (Romera et al, 2019). by the way they process the raw signals before feed- Second, they perceive the 3D world through 2D pro- ing them to the network, and this step constitutes an jection, making depth sensing with a single view active research topic. Focusing on cameras, recent work challenging. This is an important research problem proposed to use directly the raw image pixels (Bojarski in which deep learning has shown promising results et al, 2016; Codevilla et al, 2018). But most successful (Godard et al, 2017, 2019; Guizilini et al, 2020), but methods build a structured representation of the scene is still not robust enough. using computer vision models. This type of approach is – Radars are active sensors that emit radio waves and referred to as mediated perception (Ullman, 1980): sev- measure the travel time and frequency shift of the eral perception systems provide their understanding of received reflected waves. They can provide informa- the world, and their outputs are aggregated to build an tion about the distance and speed of other vehicles input for the driving model. An example of such vision at long range, and are not sensitive to weather con- tasks is object detection, which aims at finding and clas- ditions. However, their accuracy can be quite poor. sifying relevant objects in a scene (cars, bicycles, pedes- – LiDARs work similarly as radars but emit light trians, stop signs, etc.). Popular object detectors such waves instead of radio waves. They are much more as Faster-RCNN (Ren et al, 2015) and YOLO (Redmon accurate than radars and can be used to construct et al, 2016; Redmon and Farhadi, 2017, 2018) operate a 3D representation of the surrounding scene. How- at the image level, and the temporality of the video ever, contrary to radars, they do not measure the can be leveraged to jointly detect and track objects relative speed of objects and are affected by bad (Behrendt et al, 2017; Li et al, 2018a; Fernandes et al, weather (snow and heavy fog in particular). Also, 2021). See (Feng et al, 2019) for a comprehensive sur- the price and bulk of high-end LiDARs make them vey on object detection and semantic segmentation for unsuited until now for the majority of the car mar- autonomous driving, including datasets, methods using ket. multiple sensors and challenges. In addition to detect- – GPS receivers can estimate precise geolocation, ing and tracking objects, understanding the vehicle’s within an error range of 30 centimeters, by mon- environment involves extracting depth information, i.e. itoring multiple satellites to determine the precise knowing the distance that separates the vehicle from position of the receivers. each point in the space. Approaches to depth estima- For a more thorough review of driving sensors, we refer tion vary depending on the sensors that are available: the reader to (Yurtsever et al, 2020). direct LiDAR measurements (Xu et al, 2019; Tang et al, Explainability of vision-based autonomous driving systems: Review and challenges 9
2019; Jaritz et al, 2018; Park et al, 2020), stereo cam- in the output space (vehicle controls, future trajecto- eras (Chang and Chen, 2018; Kendall et al, 2017) or ries, . . . ) and minimized on the training set composed even single monocular cameras (Fu et al, 2018; Kuzni- by human driving sessions. Initial attempt to behav- etsov et al, 2017; Amiri et al, 2019; Godard et al, 2017; ior cloning of vehicle controls was made in (Pomerleau, Zhou et al, 2017; Casser et al, 2019; Godard et al, 2019; 1988), and continued later in (Chen et al, 2015; Bo- Guizilini et al, 2020). Other types of semantic informa- jarski et al, 2016; Codevilla et al, 2018). For example, tion can be used to complement and enrich inputs such DESIRE (Lee et al, 2017) is the first neural trajectory as the recognition of pedestrian intent (Abughalieh and prediction model based on behavior cloning. Alawneh, 2020; Rasouli et al, 2019). Even if it seems satisfactory to train a neural net- Mediated perception contrasts with the direct per- work based on easy-to-acquire expert driving videos, ception approach (Gibson, 1979), which instead ex- imitation learning methods suffer from several draw- tracts visual affordances from an image. Affordances backs. First, in the autoregressive setting, the test dis- are scalar indicators that describe the road situation tribution is different to the train distribution due to such as curvature, deviation to neighboring lanes, or the distributional shift (Ross et al, 2011) between ex- distances between ego and other vehicles. These human- pert training data and online behavior (Zeng et al, 2019; interpretable features are usually recognized using neu- Codevilla et al, 2019). At train time, the model learns ral networks (Chen et al, 2015; Sauer et al, 2018; Xiao to make its decision from a state which is a consequence et al, 2020). Then, they are passed at the input of a of previous decisions of the expert driver. As there is driving controller which is usually hard-coded, even if a strong correlation between consecutive expert deci- some recent approaches use affordance recognition to sions, the network finds and relies on this signal to provide compact inputs to learning-based driving sys- predict future decisions. At deployment, the loop be- tems (Toromanoff et al, 2020). tween previous prediction and current input is closed and the model can no longer rely on expert previous Outputs. Ultimately, the goal is to generate vehicle decisions to take an action. This phenomenon gives low controls. Some approaches, called end-to-end, tackle train and test errors, but very bad behavior at deploy- this problem by training the deep network to di- ment. Second, supervised training is harmed by biases rectly output the commands (Pomerleau, 1988; Bo- in datasets: a large part of real-world driving consists of jarski et al, 2016; Codevilla et al, 2018). However, in a few simple behaviors and only rare cases require com- practice most methods instead predict the future tra- plex reasoning. Also, systems trained with supervised jectory of the autonomous vehicle; they are called end- behavior cloning suffer from causal confusion (de Haan to-mid methods. The trajectory is then expected to et al, 2019), such that spurious correlations cannot be be followed by a low-level controller, such as the pro- distinguished from true causal relations between input portional–integral–derivative (PID) controller. The dif- elements and outputs. Besides, behavior cloning meth- ferent choices for the network output, and their link ods are known to poorly explore the environment, they with explainability, are reviewed and discussed in Sec- are data-hungry, requiring massive amounts of data to tion 5.3. generalize. Finally, behavior cloning methods are un- able to learn in situations that are not contained in driv- 3.1.3 Learning ing datasets: these approaches have difficulties dealing with dangerous situations that are never demonstrated Two families of methods coexist for training self- by experts (Chen et al, 2020a). driving neural models: behavior cloning approaches, which leverage datasets of human driving sessions (Sec- Reinforcement learning (RL). Alternatively, re- tion 3.1.3), and reinforcement learning approaches, searchers have explored using RL to train neural which train models through trial-and-error simulation driving systems (Kiran et al, 2020; Toromanoff et al, (Section 3.1.3). 2020). This paradigm learns a policy by balancing self-exploration and reinforcement (Chen et al, 2020a). Behavior cloning (BC). These approaches leverage This training paradigm does not require a training set huge quantities of recorded human driving sessions to of expert driving but relies instead on a simulator. learn the input-output driving mapping by imitation. In In (Dosovitskiy et al, 2017), the autonomous vehicle this setting, the network is trained to mimic the com- evolves in the CARLA simulator, where it is asked to mands applied by the expert driver (end-to-end mod- reach a high-level goal. As soon as it reaches the goal, els), or the future trajectory (end-to-mid models), in collides with an object, or gets stuck for too long, the a supervised fashion. The objective function is defined agent receives a reward, positive or negative, which it 10 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al. tries to maximize. This reward is a scalar value that stereo cameras and LiDAR sensors. The dataset offers combines speed, distance traveled towards the goal, 15k frames annotated with 3D bounding boxes and se- collision damage, overlap with sidewalk, and overlap mantic segmentation maps. More recently, Caesar et al with the opposite lane. (2020) released the nuScenes dataset composed of one In contrast with BC, RL methods do not require thousand clips of 20 seconds each. The acquisition was any annotations and have the potential to achieve su- done through 6 cameras for a 360◦ field of view, 5 perhuman performances through exploration. However, radars, and one LiDAR. Keyframes are sampled at 2Hz these methods are inefficient to train, they necessitate a and fully annotated with 3D bounding boxes of 23 ob- simulator, and the design of the reward function is deli- ject classes. Besides, a human-annotated semantic map cate. Besides, as shown in (Dosovitskiy et al, 2017), RL- of 11 classes (e.g. traffic light, stop line, drivable area) based systems achieve lower performance than behav- is associated to the clips on keyframes, and can be ior cloning training. More importantly, even if driving used in combination with the precise localization data in simulation can provide insights about system design, (with errors below 10 cm). Other multi-modal driving the ultimate goal is to drive in the real world. Promis- datasets have been released (e.g., Waymo Open Dataset ing results have been provided in (Kendall et al, 2019) (Sun et al, 2020), ArgoVerse (Chang et al, 2019a), Lyft to training an RL driving system in the real world, but L5 (Houston et al, 2020)) with a varying number of the problem is not solved yet. A detailed review of rein- recorded hours, type and number of sensors, and seman- forcement learning models is provided in (Kiran et al, tic annotations. Contrasting with these datasets using 2020). a calibrated camera, in BDDV (Xu et al, 2017), the au- It is also worth mentioning the family of Inverse Re- thors have collected a large quantity of dash-cam driv- inforcement Learning (IRL) methods, which use both ing videos and explored the use of this low-quality data expert driving data and simulation. IRL is based on the to learn driving models. assumption that humans drive optimally. These tech- niques aim at discovering the unknown reward func- tion justifying human driving behavior (Ng and Rus- 3.2 Challenges for explainable autonomous vehicles sell, 2000; Sharifzadeh et al, 2016; Kiran et al, 2020). On standard control tasks, IRL approaches are partic- Introducing explainability in the design of learning- ularly efficient in the low data regime, i.e. when few ex- based self-driving systems is a challenging task. These pert trajectories are available (Ho and Ermon, 2016). concerns arise from two aspects: modern self-driving In the context of autonomous driving, IRL has been systems are deep learning models, which brings known mostly employed for learning on driving-related sub- shortcomings associated with these trained architec- tasks such as highway driving (Abbeel and Ng, 2004; tures as detailed in Section 3.2.1. Besides, these systems Syed and Schapire, 2007), automatic parking lot navi- are implicitly solving several heterogeneous subtasks at gation (Abbeel et al, 2008), urban driving (Ziebart et al, the same time as explained in Section 3.2.2. 2008), lane changing (Sharifzadeh et al, 2016) and com- fortable driving (Kuderer et al, 2015). Unfortunately, 3.2.1 Autonomous vehicles are machine learning IRL algorithms are expensive to train as they involve a models reinforcement learning step between cost estimation to policy training and evaluation (Kiran et al, 2020). Explainability hurdles of self-driving models are shared with most deep learning models, across many applica- 3.1.4 Driving datasets tion domains. Indeed, decisions of deep systems are in- trinsically hard to explain as the functions these sys- We list here public datasets used for training self- tems represent, mapping from inputs to outputs, are driving models. We do not exhaustively cover all of not transparent. In particular, although it may be pos- them and refer the reader to (Janai et al, 2020) for more sible for an expert to broadly understand the structure datasets. However, we focus on datasets that can be of the model, the parameter values, which have been used for designing transparent driving systems thanks learned, are yet to be explained. to extra annotations, or that can be used to learn to From a machine learning perspective, there are sev- provide post-hoc explanations. Table 2 summarizes the eral factors giving rise to interpretability problems for main characteristics of these datasets. self-driving systems, as machine learning researchers do Geiger et al(2013) have pioneered the work on not perfectly master the dataset, the trained model, and multi-modal driving datasets with KITTI, which con- the learning phase. These barriers to explainability are tains 1.5 hours of human driving acquired through reported in Figure 3. Explainability of vision-based autonomous driving systems: Review and challenges 11
Vol. Sensors Annotations Cameras LiDAR Radar GPS/IMU CAN KITTI 1.5 2 RGB + 2 2D/3D bounding boxes, tracking, (Geiger et al, 2013) hours grayscale pixel-level Cityscapes 20K 2 RGB Pixel-level (Cordts et al, 2016) frames SYNTHIA 200K 2 multi- Pixel-level, depth (Ros et al, 2016) frames cameras HDD 104 3 cameras Driver behavior annotations (labels) (Ramanishka et al, 2018) hours BDDV 10K dash-cam (Xu et al, 2017) hours BDD100K 100K dash-cam 2D bounding boxes, tracking, pixel- (Yu et al, 2020) × 40s level BDD-A 1232 × dash-cam Human gaze (Xia et al, 2018) 10s BDD-X 7K × dash-cam Textual explanations associated to (Kim et al, 2018) 40s video segments BDD-OIA 23K × dash-cam Authorized actions, explanations (Xu et al, 2020) 5s (classif) BDD-A extended 1103 × dash-cam Human gaze, human desire for an ex- (Shen et al, 2020) 10s planation score Brain4Cars 1180 Road + (Jain et al, 2016) miles cabin cameras nuScenes 1000 × 6 cameras 2D/3D bounding boxes, tracking, (Caesar et al, 2020) 20s maps ApolloScape 100 6 cameras fitted 3D models of vehicles, pixel- (Huang et al, 2018) hours level Lyft L5 1K 7 cameras 2D aerial boxes, HD maps (Houston et al, 2020) hours Waymo OpenDataset 1150 × 5 cameras 2D/3D bounding boxes, tracking (Sun et al, 2020) 20s ArgoVerse 300K 360◦ + 2D/3D bounding boxes, tracking, (Chang et al, 2019a) × 5s stereo maps cameras DoTA 4677 dash-cam Temporal and spatial (tracking) (Yao et al, 2020) videos anomaly detection Road Scene Graph 506 6 cameras Relationships (Tian et al, 2020) videos CTA 1935 dash-cam Accidents labeled with cause and ef- (You and Han, 2020) videos fects and temporal segmentation
Table 2: Summary of driving datasets. Most used driving datasets for training learning-based driving models are presented in Section 3.1.4; in addition datasets that specifically provide explanation information are presented throughout Section 5.2.1.
First, the dataset used for training brings inter- ered as a black-box. The model is highly non-linear and pretability problem, with questions such as: Has the does not provide any robustness guarantee as small in- model encounter situations like X? Indeed, a finite put changes may dramatically change the output be- training dataset cannot exhaustively cover all possible havior. Also, these models are known to be prone to driving situations and it will likely under- and over- adversarial attacks (Morgulis et al, 2019; Deng et al, represent some specific ones (Tommasi et al, 2017). 2020). Explainability issues thus occur regarding the Moreover, datasets contain numerous biases of various generalizability and robustness aspects: How will the nature (omitted variable bias, cause-effect bias, sam- model behave under these new scenarios? pling bias), which also gives rise to explainability issues related to fairness (Mehrabi et al, 2019). Third, the learning phase is not perfectly under- stood. Among other things, there are no guarantees Second, the trained model, and the mapping func- that the model will settle at a minimum point that gen- tion it represents, is poorly understood and is consid- eralizes well to new situations, and that the model does 12 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
Dataset Learning
Model
- Black-box - Thousands of driving sessions - Spurious correlations - Million of parameters Explainability - Various biases - Underfit or overfit some situations - Highly non-linear - Under/over represented hurdles - Misspectified objective - Robustness issues situations - Prone to adversarial attacks
- How will the model behave in a - Did the model correctly learn on new scenario? Explainability - Is situation like "X" situations that rarely occur? - Can the model generalize to unseen questions encountered in the dataset? - Did the model learn to make decisions situations? for the good reasons? - Is the model robust to slightly perturbated inputs?
Fig. 3: Explainability hurdles and questions for autonomous driving models, as seen from a machine learning point of view. not underfit on some situations and overfit on others. tion, properties of their motion (velocity, accelera- Besides, the model may learn to ground its decisions on tion), intentions of other agents, etc.; spurious correlations during training instead of lever- – Reasoning: information about how the different aging causal signals (Codevilla et al, 2019; de Haan components of the perceived environment are or- et al, 2019). We aim at finding answers to questions ganized and assembled by the system. This includes like Which factors caused this decision to be taken? global explanations about the rules that are learned These known issues related to training deep models by the model, instance-wise explanation showing apply beyond autonomous driving applications. There which objects are relevant in a given scene (Bojarski is a strong research trend trying to tackle these prob- et al, 2018), traffic pattern recognition (Zhang et al, lems through the prism of explainability, to characterize 2013), object occlusion reasoning (Wojek et al, 2011, the problems, and to try to mitigate them. In Section 4 2013); and Section 5, we review selected works that link to the – Decision: information about how the system pro- self-driving literature. cesses the perceived environment and its associated reasoning to produce a decision. This decision can 3.2.2 Autonomous vehicles are heterogeneous systems be a high-level goal such as “the car should turn right”, a prediction of the ego vehicle’s trajectory, For humans, the complex task of driving involves solv- its low-level relative motion or even the raw con- ing many intermediate sub-problems, at different lev- trols, etc. els of hierarchy (Michon, 1984). In the effort towards building an autonomous driving system, researchers While the separation between perception, reasoning, aim at providing the machine with these intermediate and decision is clear in modular driving systems, some capabilities. Thus, explaining the general behavior of recent end-to-end neural networks blur the lines and autonomous vehicle inevitably requires understanding perform these simultaneously (Bojarski et al, 2016). how each of these intermediate steps is carried and how However, despite the efficiency and flexibility of end- it interacts with others, as illustrated in Figure 4. We to-end approaches, they leave small room for struc- can categorize these capabilities into three types: tured modeling of explanations, which would give the end-user a thorough understanding of how each step – Perception: information about the system’s under- is achieved. Indeed, when an explanation method is standing of its local environment. This includes the developed for a neural driving system, it is often not objects that have been recognized and assigned to a clear whether it attempts to explain the perception, semantic label (persons, cars, urban furniture, drive- the reasoning, or the decision step. Considering the na- able area, crosswalks, traffic lights), their localiza- ture of neural networks architecture and training, dis- Explainability of vision-based autonomous driving systems: Review and challenges 13
Perception Reasoning Decision
- High dimensional space Explainability - Many latent rules - Many sensors types - Several possible futures - Spurious correlations hurdles - Input space not semantic
- How did the model reason about that partially occluded object? - Why was a lane change Explainability - What did the model perceive? - Did the model keep track of the decided? questions - Did the model see "X" and "Y" pedestrian? - Where will the car go in the - Which part of the input is more next future? important?
Fig. 4: Explainability hurdles and questions for autonomous driving models, as seen from an autonomous driving point of view. entangling perception, reasoning, and decision in neural 4 Explaining a deep driving model driving systems constitutes a non-trivial challenge. When a deep learning model in general — or a self- driving model more specifically — comes as an opaque black-box as it has not been designed with a specific 3.2.3 Organization of the rest of the survey explainability constraint, post-hoc methods have been proposed to gain interpretability from the network pro- As explained in this previous section, there are many as- cessing and its representations. Post-hoc explanations pects to be explained in a self-driving model. Several or- have the advantage of giving an interpretation to black- thogonal dimensions can be identified to organize the X- box models without conceding any predictive perfor- AI literature, regarding for example whether or not the mance. In this section, we assume that we have a pre- explanation is provided in a post-hoc fashion, whether trained model f. Two main categories of post-hoc meth- it globally explains the model or just a specific instance, ods can be distinguished to explain f: local methods depending on the type of input/output/model. At this which explain the prediction of the model for a spe- point, we want to emphasize the fact that the intention cific instance (Section 4.1), and global methods that of our article is not to exhaustively review the litera- seek to explain the model in its entirety (Section 4.2), ture on X-AI, which was comprehensively covered in i.e. by gaining a finer understanding on learned rep- many surveys (Gilpin et al, 2018; Adadi and Berrada, resentations and activations. Besides, we also make a 2018; Xie et al, 2020; Vilone and Longo, 2020; Moraf- connection with the system validation literature which fah et al, 2020; Beaudouin et al, 2020), but to cover aims at automatically making a stratified evaluation of existing work at the intersection of explainability and deep models on various scenarios and discovering fail- driving systems. For the sake of simplicity and with ure situations in Section 4.3. Selected references from autonomous driving research in mind, we classify the this section are reported in Table 3. methods into two main categories. Methods that be- long to the first category (Section 4) are applied to an already-trained deep network and are designed to pro- 4.1 Local explanations vide post-hoc explanations The second category (Sec- tion 5) contains intrinsically explainable systems, where Given an input image x, a local explanation aims at the model is designed to provide upfront some degree justifying why the model f gives its specific predic- of interpretability of its processing. This organization tion y = f(x). In particular, we distinguish three types choice is close to the one made in (Gilpin et al, 2018; of approaches: saliency methods which determine re- Xie et al, 2020). gions of image x influencing the most the decision (Sec- 14 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
Approach Explanation type Section Selected references VisualBackprop (Bojarski et al, 2018, 2017) Causal filtering (Kim and Canny, 2017) Saliency map 4.1.1 Grad-CAM (Sauer et al, 2018) Local Meaningful Perturbations (Liu et al, 2020) Local approximation 4.1.2 ∅ Shifting objects (Bojarski et al, 2017) Counterfactual interventions 4.1.3 Removing objects (Li et al, 2020c) Causal factor identification (Bansal et al, 2019) Model translation 4.2.1 ∅ Global Representations 4.2.2 Neuron coverage (Tian et al, 2018) Prototypes and Criticisms 4.2.3 ∅ Specific test cases (Bansal et al, 2019) Evaluation 4.3 Subset filtering (Hecker et al, 2020) Automatic finding of corner cases (Tian et al, 2018)
Table 3: Key references aiming at explaining a learning-based driving model. tion 4.1.1), local approximations which approach the In the autonomous driving literature, saliency meth- behavior of the black-box model f locally around the ods have been employed to highlight image regions that instance x (Section 4.1.2) and counterfactual analysis influence the most driving decisions. By doing so, these which aims to find the cause in x that made the model methods mostly explain the perception part of the driv- predict f(x)(Section 4.1.3). ing architectures. The first saliency method to visualize the input influence in the context of autonomous driv- ing has been developed by Bojarski et al(2018). The VisualBackprop method they propose identifies sets of 4.1.1 Saliency methods pixels by backpropagating activations from both late layers, which contain relevant information for the task A saliency method aims at explaining which input but have a coarse resolution, and early layers which image’s regions influence the most the output of have a finer resolution. The algorithm runs in real- the model. These methods produce a saliency map time and can be embedded in a self-driving car. This (a.k.a. heat map) that highlights regions on which method has been used by Bojarski et al(2017) to ex- the model relied the most for its decision. There are plain PilotNet (Bojarski et al, 2016), a deep end-to-end two main lines of methods to obtain a saliency map opaque self-driving architecture. They qualitatively val- for a trained network, namely back-propagation meth- idate that the model correctly grounds its decisions on ods and perturbation-based methods. Back-propagation lane markings, edges of the road (delimited with grass methods retro-propagate output information back into or parked cars), and surrounding cars. the network and evaluate the gradient of the output with respect to the input, or intermediate feature- The VisualBackprop procedure has also been em- maps, to generate a heat-map of the most contribut- ployed by Mohseni et al(2019) to gain more insights ing regions. These methods include DeConvNet (Zeiler into the PilotNet architecture and its failures in partic- and Fergus, 2014) and its generalized version (Si- ular. They use saliency maps to predict model failures monyan et al, 2014), Guided Backprop (Mahendran by training a student model that operates over saliency and Vedaldi, 2016), Class Activation Mapping (CAM) maps and tries to predict the error made by the Pi- (Zhou et al, 2016), Grad-CAM (Selvaraju et al, 2020), lotNet. They find that saliency maps given by the Vi- Layer-Wise Relevance Propagation (LRP) (Bach et al, sualBackprop are better suited than raw input images 2015), deepLift (Shrikumar et al, 2017) and Integrated to predict model failure, especially in case of adverse Gradients (Sundararajan et al, 2017). Perturbation- conditions. Kim and Canny(2017) propose a saliency based methods estimate the importance of an input re- visualization method for self-driving models built with gion by observing how modifications in this region im- an attention mechanism. They explain that attention pacts the prediction. These modifications include edit- maps comprise “blobs” and argue that while some in- ing methods such as pixel (Zeiler and Fergus, 2014) or put blobs have a true causal influence on the output, super-pixel (Ribeiro et al, 2016) occlusion, greying out others are spurious. Thus, they propose to segment and (Zhou et al, 2015a) or blurring (Fong and Vedaldi, 2017) filter out about 60% spurious blobs to produce simpler image regions. causal saliency maps, derived from attention maps in Explainability of vision-based autonomous driving systems: Review and challenges 15 a post-hoc analysis. To do so, they measure a decrease 4.1.2 Local approximation methods in performance when a local visual blob from an input raw image is masked out. Qualitatively, they find that The idea of a local approximation method is to ap- the network cues on features that are also used by hu- proach the behavior of the black-box model in the mans while driving, including surrounding cars and lane vicinity of the instance to be explained, with a simpler markings for example. Recently, Sauer et al(2018) pro- model. In practice, a separate model, inherently inter- pose to condition the saliency visualization on a variety pretable, is built to act as a proxy for the input/output of driving features, namely driving “affordances”. They mapping of the main model locally around the instance. employ the Grad-CAM saliency technique (Selvaraju Such methods include the Local Interpretable Model- et al, 2020) on an end-to-mid self-driving model trained agnostic Explanations (LIME) approach (Ribeiro et al, to predict driving affordances on a dataset recorded 2016), which learns an interpretable-by-design in- from the CARLA simulator (Dosovitskiy et al, 2017). put/output mapping, mimicking the behavior of the They argue that saliency methods are particularly well main model in the neighborhood of an input. In prac- suited for this type of architecture on the contrary to tice, such mapping can be instantiated by a decision end-to-end models, as all of the perception (e.g. detec- tree or a linear model. To constitute a dataset to learn tion of speed limits, red lights, cars, etc.) is mapped to a the surrogate model, data points are sampled around single control output for those models. Instead, in their the input of interest and corresponding predictions are case, they can analyze the saliency in the input image computed by the black-box model. This forms the train- for each affordance, e.g. “hazard stop” or “red light”. ing set on which the interpretable model learns. Note Still in the context of driving scenes, although not prop- that in the case of LIME, the interpretable student erly for explaining a self-driving model, it is worth men- model does not necessarily use the raw instance data tioning that Liu et al(2020) use the perturbation-based but rather an interpretable input, such as a binary vec- masking strategy of Fong and Vedaldi(2017) to obtain tor indicating the presence or absence of a superpixel in saliency maps for a driving scene classification model an image. The SHapley Additive exPlanations (SHAP) trained on the HDD dataset (Ramanishka et al, 2018). approach (Lundberg and Lee, 2017) has later been in- troduced to generalize LIME, as well as other additive feature attribution methods, and provides more con- sistent results. In (Ribeiro et al, 2018), anchors are While saliency methods enable visual explanations introduced to provide local explanations of complex for deep black-box models, they come with some limi- black-box models. They consist of high-precision if-then tations. First, they are hard to evaluate. For example, rules, which constitute sufficient conditions for predic- human evaluation can be employed (Ribeiro et al, 2016) tion. Similarly to LIME, perturbations are applied to but this comes with the risk of selecting methods which the example of interest to create a local dataset. An- are more persuasive, i.e. plausible and convincing and chors are then found from this local distribution, con- not necessarily faithful. Another possibility to evalu- sisting of input chunks which, when present, almost ate saliency methods is to use additional annotations surely preserve the prediction made by the model. provided by humans, which can be costly to acquire, In the autonomous driving literature, we are not to be matched with the produced saliency map (Fong aware of any work that aims to explain a self-driving and Vedaldi, 2017). Second, Adebayo et al(2018) in- model by locally approximating it with an interpretable dicate that the generated heat maps may be mislead- model. Some relevant work though is the one of Ponn ing as some saliency methods are independent both of et al(2020), which leverages the SHAP approach to in- the model and the data. Indeed, they show that some vestigate performances of object detection algorithms saliency methods behave like edge-detectors even when in the context of autonomous driving. The fact that they are applied to a randomly initialized model. Be- almost no paper explains self-driving models with lo- sides, Ghorbani et al(2019) show that it is possible cal approximation methods is likely due to the cost of to attack visual saliency methods so that the generated local approximation strategies, as a set of perturbed in- heat-maps do not highlight important regions anymore, puts are sampled and forwarded in the main model to while the predicted class remains unchanged. Lastly, collect their corresponding labels. For example, in the different saliency methods produce different results and case of SHAP, the number of forward passes required it is not obvious to know which one is correct, or better to explain the model is exponential in the number of than others. In that respect, a potential research direc- features, which is prohibitive when it comes to explain- tion is to learn to combine explanations coming from ing computer vision models with input pixels. Sampling various explanation methods. strategies need to be carefully designed to reduce the 16 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al. complexity of these explanation models. Besides, those 2018; Harradon et al, 2018). To tackle this task, the methods operate on a simplified input representation general idea is to model the deep network as a Func- instead of the raw input. This interpretable semantic tional Causal Model (FCM) on which the causal effect basis should be chosen wisely, as it constitutes the vo- of a model component can be computed with causal rea- cabulary that can be used by the explanation system. soning on the FCM (Pearl, 2009). For example, this has Finally, these techniques were shown to be highly sen- been employed to gain an understanding of the latent sitive to hyper-parameter choices (Bansal et al, 2020). space learned in a variational autoencoder (VAE) or a generative adversarial network (GAN) (Besserve et al, 2020), or in RL to explain agent’s behavior with coun- 4.1.3 Counterfactual explanation terfactual examples by modeling them with an SCM (Madumal et al, 2020). Counterfactual explanations Recently, a lot of attention has been put on counter- have the advantage that they do not require access to factual analysis, a field from the causal inference lit- the dataset nor the model to be computed. This as- erature (Pearl, 2009; Moraffah et al, 2020). A coun- pect is important for automotive stakeholders who own terfactual analysis aims at finding features X within datasets and industrial property of their model and who the input x that caused the decision y = f(x) to be may lose a competitive advantage by being forced to taken, by imagining a new input instance x0 where X disclose them. Besides, counterfactual explanations are is changed and a different outcome y0 is observed. The GDPR compliant (Wachter et al, 2017). A potential new imaginary scenario x0 is called a counterfactual ex- limit of counterfactual explanations is that they are not ample and the different output y0 is a contrastive class. unique: distinct explanations can explain equally well The new counterfactual example, and the change in the same situation while contradicting each other. X between x and x0, constitute counterfactual expla- nations. In other words, a counterfactual example is a When dealing with a high-dimensional input space modified version of the input, in a minimal way, that — as it is the case with images and videos — coun- changes the prediction of the model to the predefined terfactual explanations are very challenging to obtain output y0. For instance, in an autonomous driving con- as naively producing examples under the requirements 0 text, it corresponds to questions like “What should be specified above leads to new instances x that are imper- different in this scene, such that the car would have ceptibly changed with respect to x while having output 0 0 stopped instead of moving forward?” Several require- y = f(x ) dramatically different from y = f(x). This ments should be imposed to find counterfactual exam- can be explained given that the problem of adversarial ples. First, the prediction f(x0) of the counterfactual perturbations arises with high dimensional input space example must be close to the desired contrastive class of machine learning models, neural networks in partic- y0. Second, the counterfactual change must be minimal, ular (Szegedy et al, 2014). To mitigate this issue in the i.e. the new counterfactual example x0 must be as sim- case of image classification, Goyal et al(2019) use a ilar as possible to x, either by making sparse changes specific instance, called a distractor image, from the or in the sense of some distance. Third, the counterfac- predefined target class and identify the spatial region tual change relevant, i.e. new counterfactual instances in the original input such that replacing them with spe- must be likely in the underlying input data distribution. cific regions from the distractor image would lead the The simplest strategy to find counterfactual examples is system to classify the image as the target class. Besides, the naive trial-and-error strategy, which finds counter- Hendricks et al(2018) provide counterfactual explana- factual instances by randomly changing input features. tions by staying at the attribute level and by augment- More advanced protocols have been proposed, for ex- ing the training data with negative examples created ample Wachter et al(2017) propose to minimize both with hand-crafted rules. the distance between the model prediction f(x0) for the Regarding the autonomous driving literature, there counterfactual x0 and the contrastive output y0 and the only exists a limited number of approaches involving distance between x and x0. Traditionally, counterfac- counterfactual interventions. When the input space has tual explanations have been developed for classification semantic dimensions and can thus be easily manipu- tasks, with a low-dimensional semantic input space, lated, it is easy to check for the causality of input such as the credit application prediction task (Wachter factors by intervening on them (removing or adding). et al, 2017). It is worth mentioning that there also ex- For example, Bansal et al(2019) investigate the causal ist model-based counterfactual explanations which aim factors for specific outputs: they test the Chauffeur- at answering questions like “What decision would have Net model under hand-designed inputs where some ob- been taken if this model component was not part of jects have been removed. With a high-dimensional in- the model or designed differently?” (Narendra et al, put space (e.g. pixels), Bojarski et al(2017) propose to Explainability of vision-based autonomous driving systems: Review and challenges 17
the camera or LiDAR signals) and scalar-valued signals that represent the current physical state of the vehicle (velocity, yaw rate, acceleration, etc.). Performing con- trolled interventions on these input spaces require the capacity to modify the content of raw high-dimensional inputs (e.g. videos) realistically: changes in the input space such that counterfactual examples still belong to the data distribution, without producing meaning- less perturbations alike adversarial ones. Even though some recent works explore realistic alterations of vi- sual content (Gao et al, 2020), this is yet to be applied Fig. 5: Removing a pedestrian induces a change in the in the context of self-driving and this open challenge, driver’s decision from Stop to Go, which indicates that shared by other interpretability methods, is discussed the pedestrian is a risk-object. Credits to (Li et al, in more details in Section 5.1.2. Interestingly, as more 2020c). and more neural driving systems rely on semantic rep- resentations (see Section 3.1.2), alterations of the input space are simplified as the realism requirement is re- check the causal effect that image parts have, with a moved, and synthetic examples can be passed to the saliency visualization method. In particular, they mea- model as it has been done in (Bansal et al, 2019). Sec- sure the effect of shifting the image regions that were ond, modified inputs must be coherent and respect the found salient by VisualBackProp on the PilotNet ar- underlying causal structure of the data generation pro- chitecture. They observe that translating only these cess. Indeed, the different variables that constitute the image regions, while maintaining the position of other input space are interdependant, and performing an in- non-salient pixels, leads to a significant change in the tervention on one of these variables implies that we can steering angle output. Moreover, translating non-salient simulate accordingly the reaction of other variables. As image regions, while maintaining salient ones, leads to an example, we may be provided with a driving scene almost no change for the output of PilotNet. This analy- that depicts a green light, pedestrians waiting and vehi- sis indicates a causal effect of the salient image regions. cles passing. A simple intervention consisting of chang- More recently, Li et al(2020c) introduce a causal in- ing the state of the light to red would imply massive ference strategy for the identification of “risk-objects”, changes on the other variables to be coherent: pedestri- i.e. objects that have a causal impact on the driver’s ans should start crossing the street and vehicles should behavior (see Figure 5). The task is formalized with an stop at the red light. The very recent and promising FCM and objects are removed in the input stream to work of Li et al(2020d) tackles the issue of unsuper- simulate causal effects, the underlying idea being that vised causal discovery in videos. They discover a struc- removing non-causal objects will not affect the behav- tural causal model in the form of a graph that describes ior of ego vehicles. Under this setting, they do not re- the relational dependencies between variables. Interest- quire strong supervision about the localization of risk- ingly, this causal graph can be leveraged to perform objects, but only the high-level behavior label (‘go’ or interventions on the data (e.g. specify the state of one ‘stop’), as provided in the HDD dataset (Ramanishka of the variables), leading to an evolution of the system et al, 2018) for example. They propose a training algo- that is coherent with this inferred graph. We believe rithm with interventions, where some objects are ran- that the adaptation of this type of approach to real domly removed in scenes where the output is ‘go’. The driving data is crucial for the development of causal object removal is instantiated with partial convolutions explainability. Finally, even if we are able to perform (Liu et al, 2018). At inference, in a sequence where the realistic and coherent interventions on the input space, car predicts ‘stop’, the risk-object is found as the one we would need to have annotations for these new exam- which gives the higher score to the ‘go’ class. ples. Indeed, whether we use those altered examples to train a driving model on or to perform exhaustive and We call the reader’s attention to the fact that ana- controlled evaluations, expert annotations would be re- lyzing driving scenes and building driving models using quired. Considering the nature of the driving data, it causality is far from trivial as it requires the capacity might be hard for a human to provide these annota- to intervene on the model’s inputs. This, in the con- tions: they would need to imagine the decision they text of driving, is a highly complex problem to solve would have taken (control values or future trajectory) for three main reasons. First, the data is composed of in this newly generated situation. high-dimensional tensors of raw sensor inputs (such as 18 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
4.2 Global explanations ically designed to explain deep networks that perform a classification task, which is usually not the case of Global explanations contrast with local explanation self-driving models. methods as they attempt to explain the behavior of a model in general by summarizing the information it 4.2.2 Explaining representations contains. We cover three families of methods to pro- vide global explanations: model translation techniques, Representations in deep networks take various forms as which aim at transforming an opaque neural net- they are organized in a hierarchy that encompasses in- work into a more interpretable model (Section 4.2.1), dividual units (neuron activation), vectors, and layers representations explanation to analyze the knowledge (Gilpin et al, 2018). The aim of explaining representa- contained in the data structures of the model (Sec- tions is to provide insights into what is captured by the tion 4.2.2), and prototypes-based methods, which pro- internal data structures of the model, at different gran- vide global explanations by selecting and aggregating ularities. Representations are of practical importance in multiple local explanations (Section 4.2.3). transfer learning scenarios, i.e. when they are extracted from a deep network trained on a task and transferred 4.2.1 Model translation to bootstrap the training of a new network optimizing a different task. In practice, the quality of intermedi- The idea of model translation is to transfer the knowl- ate representations can be evaluated, and thus made edge contained in the main opaque model into a sep- partially interpretable, with a proxy transfer learning arate machine learning model that is inherently inter- task (Razavian et al, 2014). At another scale, some pretable. Concretely, this involves training an explain- works attempt to gain insights into what is captured able model to mimic the input-output mapping of the at the level of an individual neuron (Zhang and Zhu, black-box function. Despite sharing the same spirit with 2018). For example, a neuron’s activation can be inter- local approximation methods presented in Section 4.1.2, preted by accessing input patterns which maximize its model translation methods are different as they should activation, for example by sampling such input images approximate the main function globally across the data (Zhou et al, 2015b; Castrej´onet al, 2016), with gradi- distribution. In the work of Zhang et al(2018) an ex- ent ascent (Erhan et al, 2009; Simonyan et al, 2014), or planatory graph is built from a pre-trained convolu- with a generative network (Nguyen et al, 2016). To gain tional neural net to understand how the patterns memo- more understanding of the content of vector activations, rized by its filters are related to object parts. This graph the t-Distributed Stochastic Neighbor Embedding (t- aims at providing a global view of how visual knowl- SNE) (Maaten and Hinton, 2008) has been proposed to edge is organized within the hierarchy of convolutional project high-dimensional data into a space of lower di- layers in the network. Deep neural networks have also mension (usually 2d or 3d). This algorithm aims at pre- been translated into soft decision trees (Frosst and Hin- serving the distances between points in the new space ton, 2017) or rule-based systems (Zilke et al, 2016; Sato where points are projected. t-SNE has been widely em- and Tsukimoto, 2001). The recent work of Harradon ployed to visualize and gain more interpretability from et al(2018) presents a causal model used to explain representations, by producing scatter plots as explana- the computation of a deep neural network. Human- tions. This has for example been employed for video understandable concepts are first extracted from the representations (Tran et al, 2015), or deep Q-networks neural network of interest, using auto-encoders with (Zahavy et al, 2016). sparsity losses. Then, the causal model is built using In the autonomous driving literature, such ap- those discovered human-understandable concepts and proaches have not been widely used to the best of our can quantify the effect of each concept on the network’s knowledge. The only example we can find is reported output. in (Tian et al, 2018) which uses the neuron coverage To the best of our knowledge, such strategies have concept from (Pei et al, 2019). The neuron coverage is not been used in the autonomous driving literature a testing metric for deep networks, that estimates the to visualize and interpret the rules learned by a neu- amount of logic explored by a set of test inputs: more ral driving system. Indeed, one of the limit of such formally the neuron coverage of a set of test inputs a strategy lies in the disagreements between the in- is the proportion of unique activated neurons, among terpretable translated model and the main self-driving all network’s neurons for all test inputs. Tian et al model. These disagreements are inevitable as rule-based (2018) use this value to partition the input space: to models or soft-decision trees have a lower capacity than increase the neuron coverage of the model, they auto- deep neural networks. Moreover, these methods are typ- matically generate corner cases where the self-driving Explainability of vision-based autonomous driving systems: Review and challenges 19 model fails. This approach is presented in more details analysis is related to the software and security litera- in Section 4.3. Overall, we encourage researchers to pro- ture. However, we are dealing here with learned models vide more insights on what is learned in intermediate and methods from these fields of research poorly apply. representations of self-driving models through methods Even if several attempts have been made to formally explaining representations. verify the safety properties of deep models, these tech- niques do not scale to large-scale networks such as the 4.2.3 Protoypes/Criticism and submodular picks ones used for self-driving (Huang et al, 2017; Katz et al, 2017). We thus review in this subsection some methods A prototype is a specific data instance that represents that are used to precisely evaluate the behavior of neu- well the data. Prototypes are chosen simultaneously ral driving systems. to represent the data distribution in a non-redundant way. Clustering methods, such as partitioning around A popular way of analysing and validate self-driving medoids (Kaufmann, 1987), can be used to automati- models is stratified evaluation. Bansal et al(2019) cally find prototypes. As another example, the MMD- present a model ablation test for the ChauffeurNet critic algorithm (Kim et al, 2016) selects prototypes model, and they specifically evaluate the self-driving such that their distribution matches the distribution of model against a variety of scenarios. For example, they the data, as measured with the Maximum Mean Dis- define a series of simple test cases such as stopping for crepency (MMD) metric. Once prototypes are found, stop signs or red lights or lane following, as well as more criticisms — instances that are not well represented complex situations. Besides, since their model works on by the set of prototypes — can be chosen where the structured semantic inputs, they also evaluate Chauf- distribution of the data differs from the one of the pro- feurNet against modified inputs where objects can be totypes. Despite describing the data, prototypes and added or removed as explained in Section 4.1.3. More- criticisms can be used to make a black-box model in- over, Hecker et al(2020) argue that augmenting the terpretable. Indeed, by looking at the predictions made input space with semantic maps enables the filtering of on these prototypes, it can provide insight and save a subset of driving scenarios (e.g. sessions with a red time to users who cannot examine a large number of light), either for the training or the testing, and thus explanations and rather prefer judiciously chosen data gaining a finer understanding of the potential perfor- instances. Ribeiro et al(2016) propose a similar idea mance of the self-driving model, a concept they coin to select representative data instances, which they call as “performance interpretability”. With the idea of de- submodular picks. Using the LIME algorithm (see Sec- tecting erroneous behaviors of deep self-driving models tion 4.1.2), they provide a local explanation for every in- that could lead to potential accidents, Tian et al(2018) stance of the dataset and use the obtained features im- develop an automatic testing tool. They partition the portance to find the set of examples that best describe input space according to the neuron coverage concept the data in terms of diversity and non-redundancy. from (Pei et al, 2019) by assuming that the model de- This type of approach has not been employed as cision is the same for inputs that have the same neu- an explanation strategy in the autonomous driving lit- ron coverage. With the aim of increasing neuron cover- erature. Indeed, the selection of prototypes and criti- age of the model, they compose a variety of transfor- cisms heavily depends on the kernel used to measure mation of the input image stream, each corresponding the matching of distributions, which has no trivial de- to a synthetic but realistic editing of the scene: linear sign in the case of high-dimensional inputs such as video (e.g. change of luminosity/contrast), affine (e.g. camera or LiDAR frames. rotation) and convolutional (e.g. rain or fog) transfor- mations. This enables them to automatically discover 4.3 Fine-grain evaluation and stratified performances many — synthetic but realistic — scenarios where the car predictions are incorrect. Interestingly, they show System validation is closely connected to the need for that the insights obtained on erroneous corner cases can model explanation. One of the links between these two be leveraged to successfully retrain the driving model on fields is made of methods that automatically evaluate the synthetic data to obtain an accuracy boost. Despite deep models on a wide variety of scenarios and that not giving explicit explanations about the self-driving seek rare corner cases where the model fails. Not only model, such predictions help to understand the model’s are these methods essential for validating models, but limitations. In the same vein, Ponn et al(2020) use a they can provide a feedback loop to improve future ver- SHAP approach (Lundberg and Lee, 2017) to find that sions with learned insights. In computer science and em- the relative rotation of objects and their position with bedded systems literature, validation and performance respect to the camera influence the prediction of the 20 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al. model. Their model can be used to create challenging 5.1 Input scenarios by deriving corner cases. Input-level explanations aim at enlightening the user Some limits exist in this branch of the literature on which perceptual information is used by the model as manually creating the system’s specifications, to to take its decisions. We identified two families of ap- automatically evaluate the performance of deep self- proaches that ease interpretation at the input level: driving models, remains costly and essentially amounts attention-based models (Section 5.1.1) and models that to recreate the logic of a real driver. use semantic inputs (Section 5.1.2).
5.1.1 Attention-based models
Attention mechanisms, initially designed for NLP ap- 5 Designing an explainable driving model plication (Bahdanau et al, 2015), learn a function that scores different regions of the input depending on whether or not they should be considered in the deci- In the previous section, we saw that it is possible to sion process. This scoring is often performed based on explain the behavior of a machine learning model lo- some contextual information that helps the model de- cally or globally, using post-hoc tools that make lit- cide which part of the input is relevant to the task at tle to no assumption about the model. Interestingly, hand. Xu et al(2015) are the first to use an attention these tools operate on models whose design may have mechanism for a computer vision problem, namely, im- completely ignored the requirement of explainability. age captioning. In this work, the attention mechanism A good example of such models is PilotNet (Bojarski uses the internal state of the language decoder to con- et al, 2016, 2020), presented in Section 3.1.2, which con- dition the visual masking. The network knows which sists in a convolutional neural network operating over words have already been decoded, and seeks for the next a raw video stream and producing the vehicle controls relevant information inside of the image. Many of such at every time step. Understanding the behavior of this attention models were developed for other applications system is only possible through external tools, such as since then, for example in Visual Question Answering the ones presented in Section 4, but cannot be done (VQA) (Xu and Saenko, 2016; Lu et al, 2016; Yang directly by observing the model itself. et al, 2016). These systems, designed to answer ques- Drawing inspiration from modular systems, recent tions about images, use a representation of the ques- architectures place a particular emphasis on convey- tion as a context to the visual attention module. In- ing understandable information about their inner work- tuitively, the question tells the VQA model where to ings, in addition to their performance imperatives. As look to answer the question correctly. Not only do at- was advocated in (Xu et al, 2020), the modularity of tention mechanisms boost the performance of machine pipelined architectures allows for forensic analysis, by learning models, but also they provide insights into the studying the quantities that are transferred between inner workings of the system. Indeed, by visualizing the modules (e.g. semantic and depth maps, forecasts of attention weight associated with each input region, it is surrounding agent’s future trajectories, etc.). Moreover, possible to know which part of the image was deemed finding the right balance between modular and end-to- relevant to make the decision. end systems can encourage the use of simulation, for Attention-based models recently stimulated inter- example by training separately perception and driving est in the self-driving community, as they supposedly modules (M¨ulleret al, 2018). These modularity-inspired give a hint about the internal reasoning of the neu- models exhibit some forms of interpretability, which can ral network. In (Kim and Canny, 2017), an attention be enforced at three different levels in the design of the mechanism is used to weight each region of an image, driving system. We first review input level explanations using information about previous frames as a context. (Section 5.1), which aim at communicating which per- A different version of attention mechanisms is used in ceptual information is used by the model. Secondly, we (Mori et al, 2019), where the model outputs a steer- study intermediate-level explanations (Section 5.2) that ing angle and a throttle command prediction for each force the network to produce supplementary informa- region of the image. These local predictions are used tion as it drives. Then we consider output-level explana- as attention maps for visualization and are combined tions (Section 5.3), which seeks to unveil high-level ob- through a linear combination with learned parameters jectives of the driving system. Selected references from to provide the final decision. Visual attention can also this section are reported in Table 4. be used to select objects defined by bounding boxes, as Explainability of vision-based autonomous driving systems: Review and challenges 21
Approach Explanation type Section Selected references Visual attention (Kim and Canny, 2017) Attention maps 5.1.1 Object centric (Wang et al, 2019) Attentional Bottleneck (Kim and Bansal, 2020) Input interpretability DESIRE (Lee et al, 2017) Semantic inputs 5.1.2 ChauffeurNet (Bansal et al, 2019) MTP (Djuric et al, 2020; Cui et al, 2019) Affordances/action primitives (Mehta et al, 2018) Auxiliary branch 5.2.1 Detection/forecast of vehicles (Zeng et al, 2019) Intermediate representations Multiple auxiliary losses (Bansal et al, 2019) NLP Natural language (Kim et al, 2018; Mori et al, 2019) Sequences of points (Lee et al, 2017) Sets of points (Cui et al, 2019) Classes (Phan-Minh et al, 2020) Auto-regressive likelihood map 5.3 (Srikanth et al, 2019; Bansal et al, 2019) Output interpretability Segmentation of future track in bird-eye-view (Caltagirone et al, 2017) Cost-volume (Zeng et al, 2019)
Table 4: Key references to design an explainable driving model.
Attention maps in (Wang et al, 2019). In this work, a pre-trained ob- Rendered ChauffeurNet ject detector provides regions of interest (RoIs), which Input Images Ours w/ Visual Attention are weighted using the global visual context, and aggre- gated to decide which action to take; their approach is validated on both simulated GTAV (Kr¨ahenb¨uhl, 2018) and real-world BDDV (Xu et al, 2017) datasets. Cultr- era et al(2020) also use attention on RoIs in a slightly different setup with the CARLA simulator (Dosovitskiy et al, 2017), as they directly predict a steering angle in- stead of a high-level action. Recently, Kim and Bansal (2020) extended the ChauffeurNet (Bansal et al, 2019) architecture by building a visual attention module that operates on a bird-eye view semantic scene represen- tation. Interestingly, as shown in Figure 6, combining visual attention with information bottleneck results in sparser saliency maps, making them more interpretable.
While these attention mechanisms are often thought to make neural networks more transparent, the recent work of Jain and Wallace(2019) mitigates this as- sumption. Indeed, they show, in the context of natural language, that learned attention weights poorly corre- late with multiple measures of feature importance. Be- sides, they show that randomly permuting the atten- tion weights usually does not change the outcome of the model. They even show that it is possible to find adver- sarial attention weights that keep the same prediction while weighting the input words very differently. Even though some works attempt to tackle these issues by Fig. 6: Comparison of attention maps from classical vi- learning to align attention weights with gradient-based sual attention and from attention bottleneck. Attention explanations (Patro et al, 2020), all these findings cast bottleneck seems to provide tighter modes, focused on some doubts on the faithfulness of explanations based objects of interest. Credits to (Kim and Bansal, 2020). on attention maps. 22 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.
5.1.2 Semantic inputs
Some traditional machine learning models such as linear and logistic regressions, decision trees, or generalized additive models are considered interpretable by practi- tioners (Molnar, 2019). However, as was remarked by Alvarez-Melis and Jaakkola(2018), these models tend to consider each input dimension as the fundamental unit on which explanations are built. Consequently, the input space must have a semantic nature such that ex- planations become interpretable. Intuitively, each in- put dimension should mean something independently of other dimensions. In general machine learning, this condition is often met, for example with categorical and tabular data. However, in computer vision, when, deal- ing with images, videos, and 3D point clouds, the in- put space has not an interpretable structure. Overall, Fig. 7: RGB image of the perceived environment in bird- in self-driving systems, the lack of semantic nature of eye-view, that will be used as an input to the CNN. inputs impacts the interpretability of machine learning Credits to (Djuric et al, 2020). systems. This observation has motivated researchers to de- sign, build, and use more interpretable input spaces, to obtain labels in the top-down view generated from for example by enforcing more structure or by imposing the LiDAR point cloud. In DESIRE, static scene com- dimensions to have an underlying high-level meaning. ponents are projected within the top-down view image The promise of a more interpretable input space to- (e.g. road, sidewalk, vegetation), and moving agents are wards increased explainability is diverse. First, the vi- represented along with their tracked present and past sualization of the network’s attention or saliency heat positions. The ChauffeurNet model (Bansal et al, 2019) maps in a semantic input space is more interpretable relies on a similar top-down scene representation, how- as it does not apply to individual pixels but rather to ever instead of originating from a LiDAR point cloud, higher-level object representations. Second, counterfac- the bird-eye-view is obtained from city map data (such tual analysis is simplified as the input can be manipu- as speed limits, lane positions, and crosswalks), traf- lated more easily without the risk of generating mean- fic light state recognition and detection of surrounding ingless imperceptible perturbations, akin to adversarial cars. These diverse inputs of the network are gathered attacks. into a stack of several images, where each channel corre- sponds to a rendering of a specific semantic attribute. This contrasts with more recent approaches that ag- Using semantic inputs. Besides camera inputs pro- gregate all information into a single RGB top-view im- cessed with deep CNNs in (Bojarski et al, 2016; Codev- age, where different semantic components correspond illa et al, 2018), different approaches have been devel- to different color channels (Djuric et al, 2020; Cui et al, oped to use semantic inputs in a self-driving model, 2019). While the information is still semantic, having depending on the types of signals at hand. 3D point a 3-channel RGB image allows leveraging the power of clouds, provided by LiDAR sensors, can be processed to pre-trained convolutional networks. An example RGB form a top-view representation of the car surroundings. semantic image is shown in Figure 7. For instance, Caltagirone et al(2017) propose to flatten the scene along the vertical dimension to form a top- down map, where each pixel in the bird-eye-view cor- Towards more control on the input space. Having a ma- responds to a 10cm×10cm square of the environment. nipulable input space where we can play on semantic While this input representation provides information dimensions (e.g. controlling objects’ attributes, chang- about the presence or absence of an obstacle at a cer- ing the weather, removing a specific car) is a very desir- tain location, it crucially lacks semantics as it ignores able feature for increased explainability of self-driving the nature of the obstacles (sidewalks, cars, pedestrians, models. First, this can make the input space more in- etc.). This lack of high-level scene information is atten- terpretable by having dimensions we can play on. Im- uated in DESIRE (Lee et al, 2017), where the output portantly, having such a feature would nicely syner- of an image semantic segmentation model is projected gies with many of the post-hoc explainability methods Explainability of vision-based autonomous driving systems: Review and challenges 23 presented in Section 4. For example, to learn counter- ing decision can be accompanied by an auxiliary output factual examples without producing adversarial mean- that provides a human-understandable view of the in- ingless perturbations, it is desirable to have an input formation contained in the intermediate features. In the space on which we can apply semantic modifications second class of methods, detailed in Section 5.2.2, this at a pixel-level. As other examples, local approxima- representation space is constrained in an unsupervised tion methods such as LIME (Ribeiro et al, 2016) would fashion, where a structure can be enforced so that the highly benefit from having a controllable input space as features automatically recognize and differentiate high- a way to ease the sampling of locally similar scenes. level latent concepts. Manipulating inputs can be done at different se- mantic levels. First, at a global level, changes can in- 5.2.1 Supervising intermediate representations clude the scene lighting (night/day) and the weather (sun/rain/fog/snow) of the driving scene (Tian et al, As was stated in (Zhou et al, 2019), sensorimotor agents 2018), and more generally any change that separately benefit from predicting explicit intermediate scene rep- treats style and texture from content and semantics resentations in parallel to their main task. But besides (Geng et al, 2020) ; such global changes can been done this objective of model accuracy, predicting scene el- with video translation models (Tulyakov et al, 2018; ements may give some insights about the information Bansal et al, 2018; Chen et al, 2020b). At a more local contained in the intermediate features. In (Mehta et al, level, possible modifications include adding or removing 2018), a neural network learns to predict control out- objects (Li et al, 2020c; Chang et al, 2019b; Yang et al, puts from input images. Its training is helped with 2020), or changing attributes of some objects (Lample auxiliary tasks that aim at recognizing high-level ac- et al, 2017). Recent video inpainting works (Gao et al, tion primitives (e.g. “stop”, “slow down”, “turn left”, 2020) can be used to remove objects from videos. Fi- etc.) and visual affordances (see Section 3.1.2) in the nally, at an intermediate level, we can think of other CARLA simulator (Dosovitskiy et al, 2017). In (Zeng semantic changes to be applied to images, such as con- et al, 2019), a neural network predicts the future trajec- trolling the proportion of classes in an image (Zhao tory of the ego-vehicle using a top-view LiDAR point- et al, 2020). Manipulations could be done by playing cloud. In parallel to this main objective, they learn to on attributes (Lample et al, 2017), by inserting virtual produce an interpretable intermediate representation objects in real scenes (Alhaija et al, 2018), or by the composed of 3D detections and future trajectory pre- use of textual inputs with GANs (Li et al, 2020a,b). dictions. Multi-task in self-driving has been explored We note that having a semantically controllable in- deeply in (Bansal et al, 2019), where the authors design put space can have lots of implications for areas con- a system with ten losses that, besides learning to drive, nected with interpretability. For example, to validate also forces internal representations to contain informa- models, and towards having a framework to certify tion about on-road/off-road zones and future positions models, we can have a fine-grain stratified evaluation of other objects. of self-driving models. This can also be used to auto- Instead of supervising intermediate representations matically find failures and corner cases by easing the with scene information, other approaches propose to task of exploring the input space with manipulable in- directly use explanation annotations as an auxiliary puts (Tian et al, 2018). Finally, to aim for more robust branch. The driving model is trained to simultaneously models, we can even use these augmented input spaces decide and explain its behavior. In the work of Xu et al to train more robust models, as a way of data augmen- (2020), the BDD-OIA dataset was introduced, where tation with synthetically generated data (Bowles et al, clips are manually annotated with authorized actions 2018; Bailo et al, 2019). and their associated explanation. Action and explana- tion predictions are expressed as multi-label classifica- 5.2 Intermediate representations tion problems, which means that multiple actions and explanations are possible for a single example. While A neural network makes its decisions by automatically this system is not properly a driving model (no control constructing intermediate representations of the data. or trajectory prediction here, but only high-level classes One way of creating interpretable driving models is to such as “stop”, “move forward” or “turn left”), Xu et al enforce that some information, different than the one (2020) were able to increase the performance of action directly needed for driving, is present in these features. decision making by learning to predict explanations as A first class of methods, presented in Section 5.2.1, uses well. Very recently, Ben-Younes et al(2020) propose supervised learning to specify the content of those rep- to explain the behavior of a driving system by fusing resentation spaces. Doing so, the prediction of a driv- high-level decisions with mid-level perceptual features. 24 Eloi´ Zablocki∗, H´ediBen-Younes∗ et al.