Exploring Applications of Deep for Real-world Autonomous Driving Systems

Victor Talpaert1,2, Ibrahim Sobh3, B. Ravi Kiran2, Patrick Mannion4, Senthil Yogamani5, Ahmad El-Sallab3 and Patrick Perez6 1U2IS, ENSTA ParisTech, Palaiseau, France 2AKKA Technologies, Guyancourt, France 3Valeo Egypt, Cairo 4Galway-Mayo Institute of Technology, Ireland 5Valeo Vision Systems, Ireland 6Valeo.ai, France

Keywords: Autonomous Driving, Deep Reinforcement Learning, Visual Perception.

Abstract: Deep Reinforcement Learning (DRL) has become increasingly powerful in recent years, with notable achie- vements such as Deepmind’s AlphaGo. It has been successfully deployed in commercial vehicles like Mo- bileye’s path planning system. However, a vast majority of work on DRL is focused on toy examples in controlled synthetic car simulator environments such as TORCS and CARLA. In general, DRL is still at its infancy in terms of usability in real-world applications. Our goal in this paper is to encourage real-world deployment of DRL in various autonomous driving (AD) applications. We first provide an overview of the tasks in autonomous driving systems, reinforcement learning algorithms and applications of DRL to AD sys- tems. We then discuss the challenges which must be addressed to enable further progress towards real-world deployment.

1 INTRODUCTION cific control task. The rest of the paper is structured as follows. Section 2 provides background on the vari- Autonomous driving (AD) is a challenging applica- ous modules of AD, an overview of reinforcement le- tion domain for machine learning (ML). Since the arning and a summary of applications of DRL to AD. task of driving “well” is already open to subjective Section 3 discusses challenges and open problems in definitions, it is not easy to specify the correct be- applying DRL to AD. Finally, Section 4 summarizes havioural outputs for an autonomous driving system, the paper and provides key future directions. nor is it simple to choose the right input features to learn with. Correct driving behaviour is only loo- sely defined, as different responses to similar situa- 2 BACKGROUND tions may be equally acceptable. ML-based control policies have to overcome the lack of a dense me- The software architecture for Autonomous Driving tric evaluating the driving quality over time, and the systems comprises of the following high level tasks: lack of a strong signal for expert imitation. Supervi- Sensing, Perception, Planning and Control. While sed learning methods do not learn the dynamics of the some (Bojarski et al., 2016) achieve all tasks in one environment nor that of the agent (Sutton and Barto, unique module, others do follow the logic of one task 2018), while reinforcement learning (RL) is formula- - one module (Paden et al., 2016). Behind this sepa- ted to handle sequential decision processes, making it ration rests the idea of information transmission from a natural approach for learning AD control policies. the sensors to the final stage of actuators and motors In this article, we aim to outline the underlying as illustrated in Figure 1. While an End-to-End ap- principles of DRL for applications in AD. Passive proach would enable one generic encapsulated solu- perception which feeds into a control system does tion, the modular approach provides granularity for not scale to handle complex situations. DRL setting this multi-discipline problem, as separating the tasks would enable active perception optimized for the spe- in modules is the divide and conquer engineering ap-

564

Talpaert, V., Sobh, I., Kiran, B., Mannion, P., Yogamani, S., El-Sallab, A. and Perez, P. Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems. DOI: 10.5220/0007520305640572 In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 564-572 ISBN: 978-989-758-354-4 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

Figure 1: Fixed modules in a modern autonomous driving systems : Sensor infrastructure notably include Cameras, Radars, LiDARs (the Laser equivalent of Radars) or GPS-IMUs (GPS and Inertial Measurement Units provide an instantaneous position); their raw data is considered low level. This dynamic data is often processed into higher level descriptions as part of the Perception module. Perception estimates positions of: location in lane, cars, pedestrians, traffic lights and other semantic objects among other descriptors. It also provides road occupation over a larger spatial & temporal scale by combining maps. Mapping Localised high definition maps (HD maps) provide centrimetric reconstruction of buildings, static obstacles and dynamic objects, which frequently are crowdsourced. Scene understanding provide the high-level scene comprehension that includes detection, classification & localisation tasks, feeding the driving policy/planning module. Path Planning predicts the future actor trajectories and manoeuvres. A static shortest path from point A to point B with dynamic traffic information constraints are employed to calculate the path. Vehicle control orchestrates the high level orders for motion planning using simple closed loop systems based on sensors from the Perception task. proach. Deep Neural Networks (DNN) have recently been applied as function approximators for RL agents, al- 2.1 Review of Reinforcement learning lowing agents to generalise knowledge to new unseen situations, along with new algorithms for problems with continuous state and action spaces. Deep RL Reinforcement Learning (RL) (Sutton and Barto, agents can be value-based: DQN (Mnih et al., 2013), 2018) is a family of algorithms which allow agents Double DQN (Van Hasselt et al., 2016), policy-based: to learn how to act in different situations. In other TRPO (Schulman et al., 2015), PPO (Schulman et al., words, how to establish a map, or a policy, from situ- 2017); or actor-critic: DPG (Silver et al., 2014), ations (states) to actions which maximize a numerical DDPG (Lillicrap et al., 2015), A3C (Mnih et al., reward signal. RL has been successfully applied to 2016). Model-based RL agents attempt to build an en- many different fields such as helicopter control (Naik vironment model, e.g. Dyna-Q (Sutton, 1990), Dyna- and Mammone, 1992), traffic signal control (Mannion 2 (Silver et al., 2008). Inverse reinforcement learning et al., 2016a), electricity generator scheduling (Man- (IRL) aims to estimate the reward function given ex- nion et al., 2016b), water resource management (Ma- amples of agents actions, sensory input to the agent son et al., 2016), playing relatively simple Atari ga- (state), and the model of the environment (Abbeel and mes (Mnih et al., 2015) and mastering a much more Ng, 2004a), (Ziebart et al., 2008), (Finn et al., 2016), complex game of Go (Silver et al., 2016), simulated (Ho and Ermon, 2016). continuous control problems (Lillicrap et al., 2015), (Schulman et al., 2015), and controlling robots in real 2.2 DRL for Autonomous Driving environments (Levine et al., 2016). RL algorithms may learn estimates of state values, In this section, we visit the different modules of the environment models or policies. In real-world appli- autonomous driving system as shown in Figure 1 and cations, simple tabular representations of estimates describe how they are achieved using classical RL and are not scalable. Each additional feature tracked in Deep RL methods. A list of datasets and simulators the state leads to an exponential growth in the number for AD tasks is presented in Table 1. of estimates that must be stored (Sutton and Barto, 2018).

565 VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

Table 1: A collection of datasets and simulators to evaluate AD algorithms. Dataset Description Berkeley Driving Dataset (Xu et al., 2017) Learn driving policy from demonstrations Baidu’s ApolloScape Multiple sensors & Driver Behaviour Profiles Honda Driving Dataset (Ramanishka et al., 2018) 4 level annotation (stimulus, action, cause, and at- tention objects) for driver behavior profile. Simulator Description CARLA (Dosovitskiy et al., 2017) Urban Driving Simulator with Camera, LiDAR, Depth & Semantic segmentation Racing Simulator TORCS (Wymann et al., 2000) Testing control policies for vehicles AIRSIM (Shah et al., 2018) Resembling CARLA with support for Drones GAZEBO (Koenig and Howard, 2004) Multi-robo simulator for planning & control

Vehicle Control: Vehicle control classically has been Q-function for Lane Change: A Reinforcement achieved with predictive control approaches such as Learning approach is proposed in (Wang et al., 2018) Model Predictive Control (MPC) (Paden et al., 2016). to train the vehicle agent to learn an automated lane A recent review on motion planning and control task change in a smooth and efficient behavior, where the can be found by authors (Schwarting et al., 2018). coefficients of the Q-function are learned from neural Classical RL methods are used to perform optimal networks. control in stochastic settings the Linear Quadratic Re- gulator (LQR) in linear regimes and iterative LQR IRL for Driving Styles: Individual perception of (iLQR) for non-linear regimes are utilized. More re- comfort from demonstration is proposed in (Kuderer cently, random search over the parameters for a policy et al., 2015), where individual driving styles are network can perform as well as LQR (Mania et al., modeled in terms of a cost function and use feature 2018). based inverse reinforcement learning to compute One can note recent work on DQN which is trajectories in vehicle autonomous mode. Using Deep used in (Yu et al., 2016) for simulated autonomous Q-Networks as the refinement step in IRL is proposed vehicle control where different reward functions are in (Sharifzadeh et al., 2016) to extract the rewards. examined to produce specific driving behavior. The While evaluated in a simulated autonomous driving agent successfully learned the turning actions and environment, it is shown that the agent performs a navigation without crashing. In (Sallab et al., 2016) human-like lane change behavior. DRL system for lane keeping assist is introduced for discrete actions (DQN) and continuous actions Multiple-goal RL for Overtaking: In (Ngai and (DDAC), where the TORCS car simulator is used and Yung, 2011) a multiple-goal reinforcement learning concluded that, as expected, the continuous actions (MGRL) framework is used to solve the vehicle provide smoother trajectories, and the more restricted overtaking problem. This work is found to be able termination conditions, the slower convergence to take correct actions for overtaking while avoiding time to learn. Wayve, a recent startup, has recently collisions and keeping almost steady speed. demonstrated an application of DRL (DDPG) for AD using a full-sized autonomous vehicle (Kendall et al., Hierarchical Reinforcement Learning (HRL): 2018). The system was first trained in simulation, Contrary conventional or flat RL, HRL refers to before being trained in real time using onboard the decomposition of complex agent behavior using computers, and was able to learn to follow a lane, temporal abstraction, such as the options framework successfully completing a real-world trial on a 250 (Barto and Mahadevan, 2003). The problem of metre section of road. sequential decision making for autonomous driving with distinct behaviors is tackled in (Chen et al., DQN for Ramp Merging: The AD problem of 2018). A hierarchical neural network policy is ramp merging is tackled in (Wang and Chan, 2017), proposed where the network is trained with the where DRL is applied to find an optimal driving Semi-Markov Decision Process (SMDP) though the policy using LSTM for producing an internal state proposed hierarchical policy gradient method. The containing historical driving information and DQN method is applied to a traffic light passing scenario, for Q-function approximation. and it is shown that the method is able to select correct decisions providing better performance compared to

566 Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

a non-hierarchical reinforcement learning approach. 3 PRACTICAL CHALLENGES An RL-based hierarchical framework for autonomous multi-lane cruising is proposed in (Nosrati et al., Deep reinforcement learning is a rapidly growing 2018) and it is shown that the hierarchical design field. We summarize the frequent challenges such enables significantly better learning performance as sample complexity and reward formulation, that than a flat design for both DQN and PPO methods. could be encountered in the design of such methods.

Frameworks: A framework for an end-end Deep 3.1 Bootstrapping RL with Imitation Reinforcement Learning pipeline for autonomous dri- ving is proposed in (Sallab et al., 2017), where the in- Ability learning by imitation is used by humans to te- puts are the states of the environment and their aggre- ach other humans new skills. Demonstrations usually gations over time, and the output is the driving acti- focus on state space essential areas from the expert’s ons. The framework integrates RNNs and attention point of view. Learning from Demonstrations (LfD) glimpse network, and tested for lane keep assist algo- is significant especially in domains where rewards are rithm. In this section we reviewed in brief the appli- sparse. In imitation learning, an agent learns to per- cations of DRL and classical RL methods to different form a task from demonstrations without any feed- AD tasks. back rewards. The agent learns a policy as a super- vised learning process over state-action pairs. Howe- 2.3 Predictive Perception ver, high quality demonstrations are hard to collect, leading to sub-optimal policies (Atkeson and Schaal, In this section, we review some examples of ap- 1997). Accordingly, LfD can be used to initialize plications of IOC and IRL to predictive perception the learning agent with a policy inspired by perfor- tasks. Given a certain instance where the autonomous mance of an expert. Then, RL can be conducted to agent is driving in the scene, the goal of predictive discover a better policy by interacting with the envi- perception algorithms are to predict the trajectories ronment. Learning a model of the environment con- or intention of movement of other actors in the envi- densing LfD and RL is presented in (Abbeel and Ng, ronment. The authors (Djuric et al., 2018), trained a 2005). Measuring the divergence between the current deep convolutional neural network (CNN) to predict policy and the expert for policy optimization is pro- short-term vehicle trajectories, while accounting for posed in (Kang et al., 2018). DQfD (Hester et al., inherent uncertainty of vehicle motion in road traffic. 2017) pre-trains the agent and uses demonstrations by Deep Stochastic IOC (inverse optimal control) RNN adding them into the replay buffer of the DQN and Encoder-Decoder (DESIRE) (Lee et al., 2017) is a giving them additional priority. More recently, a trai- framework used to estimate a distribution and not ning framework that combines LfD and RL for fast le- just a simple prediction of an agent’s future positions. arning asynchronous agents is proposed in (Sobh and This is based on the context (intersection, relative Darwish, 2018). position of other agents).

Pedestrian Intention: Pedestrian intents to cross 3.2 Exploration Issues with Imitation the road, board another vehicle, or is the driver in the parked car going to open the door (Erran and In some cases, demonstrations from experts are not Scheider, 2017) . Authors in (Ziebart et al., 2009) available or even not covering the state space lea- perform maximum entropy inverse optimal control ding to learning a poor policy. One solution con- to learn a generic cost function for a robot to avoid sists in using the Data Aggregation (DAgger) (Ross pedestrians. Authors (Kitani et al., 2012) used and Bagnell, 2010) methods where the end-to-end le- inverse optimal control to predict pedestrian paths by arned policy is run and extracted observation-action considering scene semantics. pair is again labelled by the expert, and aggregated to the original expert observation-action dataset. Thus Traffic Negotiation: When in traffic scenarios invol- iteratively collecting training examples from both re- ving multiple agents, policies learned require agents ference and trained policies explores more states and to negotiate movement in densely populated areas and solves this lack of exploration. Following work on with continuous movement. MobilEye demonstra- Search-based Structured Prediction (SEARN) (Ross ted the use of the options framework (Shalev-Shwartz and Bagnell, 2010), Stochastic Mixing Iterative Le- et al., 2016). arning (SMILE) trains a stochastic stationary policy over several iterations and then makes use of a geo- metric stochastic mixing of the policies trained. In a

567 VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

standard imitation learning scenario, the demonstra- level domain adaptation focuses on stylizing ima- tor is required to cover sufficient states so as to avoid ges from the source domain to make them similar to unseen states during test. This constraint is costly and images of the target domain, based on image condi- requires frequent human intervention. Hierarchical tioned generative adversarial networks (GANs). In imitation learning methods reduce the sample com- (Bousmalis et al., 2017b), the model learns a trans- plexity of standard imitation learning by performing formation in the pixel space from one domain to the data aggregation by organizing the action spaces in a other, in an unsupervised way. GAN is used to adapt hierarchy (Le et al., 2018). simulated images to look like as if drawn from the real domain. Both feature-level and pixel-level dom- 3.3 Intrinsic Reward Functions ain adaptation combined in (Bousmalis et al., 2017a), where the results indicate that including simulated In controlled simulated environments such as games, data can improve the vision-based grasping system, an explicit reward signal is given to the agent along achieving comparable performance with 50 times fe- with its sensor stream. In real-world robotics and au- wer real-world samples. Another relatively simpler tonomous driving deriving, designing a good reward method is introduced in (Peng et al., 2017), by dyn- functions is essential so that the desired behaviour amics randomizing of the simulator during training, may be learned. The most common solution has been policies are capable of generalising to different dy- reward shaping (Ng et al., 1999) and consists in sup- namics without any training on the real system. RL plying additional rewards to the agent along with that with Sim2Real: A model trained in virtual environ- provided by the underlying MDP. Rewards as already ment is shown to be workable in real environment pointed earlier in the paper, could be estimated by in- (Pan et al., 2017). Virtual images rendered by a simu- verse RL (IRL) (Abbeel and Ng, 2004b), which de- lator environment are first segmented to scene parsing pends on expert demonstrations. representation and then translated to synthetic realis- In the absence of an explicit reward shaping and tic images by the proposed image translation network. expert demonstrations, agents can use intrinsic re- The proposed network segments the simulated image wards or intrinsic motivation (Chentanez et al., 2005) input, and then generates a synthetic realistic images. to evaluate if their actions were good. Authors (Pat- Accordingly, the driving policy trained by reinforce- hak et al., 2017) define curiosity as the error in an ment learning can be easily adapted to real environ- agents ability to predict the consequence of its own ment. actions in a visual feature space learned by a self- World Models proposed in (Ha and Schmidhu- supervised inverse dynamics model. In (Burda et al., ber, 2018b; Ha and Schmidhuber, 2018a) are trained 2018) the agent learns a next state predictor model quickly in an unsupervised way, via a Variational Au- from its experience, and uses the error of the pre- toEncoder (VAE), to learn a compressed spatial and diction as an intrinsic reward. This enables that agent temporal representation of the environment leading to to determine what could be a useful behavior even learning a compact policy. Moreover, the agent can without extrinsic rewards. train inside its own dream and transfer the policy back into the actual environment. 3.4 Bridging the Simulator-reality Gap 3.5 Data Efficient and Fast Adapting Training deep networks requires collecting and anno- RL tating a lot of data which is usually costly in terms of time and effort. Using simulation environments ena- Depending on the task being solved, RL require a lot bles the collection of large training datasets. Howe- of observations to cover the state space. Efficieny ver, the simulated data do not have the same data dis- is usually achieved with imitation learning, reward tribution compared to the real data. Accordingly, mo- shaping and transfer learning. Readers are directed dels trained on simulated environments often fail to towards the survey on transfer learning in RL here generalise on real environments. Domain adaptation (Taylor and Stone, 2009). The primary motivation in allows a machine learning model trained on samples transfer learning in RL is to reuse previously trained from a source domain to generalise to a target dom- policies/models for a source task, so as to reduce the ain. Feature-level domain adaptation focuses on le- current target task’s training time. Authors in (Liaw arning domain-invariant features. In work of (Ganin et al., 2017) study the policy composition problem et al., 2016), the decisions made by deep neural net- where composing previously learned basis policies, works are based on features that are both discrimi- e.g., driving in different conditions or over different native and invariant to the change of domains. Pixel terrain types, the goal is to be able to reuse them for

568 Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

a novel task that is a mixture of previously seen dy- the DDPG is used first for learning a driving policy namics by learning a meta-policy, that maximises the in a stable and familiar environment, then policy reward. Their results show that learning a new policy network and safety-based control are combined to for a new task takes longer than a meta-policy learnt avoid collisions. It was found that combination of using basis policies. DRL and safety-based control performs well in most Meta-learning, or learning to learn, is an impor- scenarios. tant approach towards versatile agents that can adapt quickly to news tasks. The idea of having one neural Negative-Avoidance for Safety: In order to enable network interact with another one for meta-learning DRL to escape local optima, speed up the training has been applied in (Duan et al., 2016) and (Wang process and avoid danger conditions or accidents, et al., 2016). More recently, the Model-Agnostic Survival-Oriented Reinforcement Learning (SORL) Meta-Learning (MAML) is proposed in (Finn et al., model is proposed in (Ye et al., 2017), where survi- 2017), where the meta-learner seeks to find an initiali- val is favored over maximizing total reward through zation for the parameters of a neural network, that can modeling the autonomous driving problem as a con- be adapted quickly for a new task using only few ex- strained MDP and introducing Negative-Avoidance amples. Continuous adaptation in dynamically chan- Function to learn from previous failure. The SORL ging and adversarial scenarios is presented in (Al- model found to be not sensitive to reward function Shedivat et al., 2017) via a simple gradient-based and can use different DRL algorithms like DDPG. meta-learning algorithm. Additionally, Reptile (Ni- chol et al., 2018) is mathematically similar to first- 3.7 Other Challenges order MAML, making it consume less computation and memory than MAML. Multimodal Sensor Policies. Modern autonomous driving systems constitute of multiple modalities 3.6 Incorporating Safety in DRL for AD (Sobh et al., 2018), for example Camera RGB, Depth, Lidar and others sensors. Authors in (Liu et al., 2017) Deploying an autonomous vehicle after training propose end-to-end learning of policies that leverages directly could be dangerous. We review different ap- sensor fusion to reduced performance drops in noisy proaches to incorporate safety into DRL algorithms. environment and even in the face of partial sensor fai- lure by using Sensor Dropout to reduce sensitivity to SafeDAgger (Zhang and Cho, 2017) introduces any sensor subset. a safety policy that learns to predict the error made by a primary policy trained initially with Reproducibility. State-of-the-art deep RL methods the approach, without que- are seldom reproducible. Non-determinism in stan- rying a reference policy. An additional safe policy dard benchmark environments, combined with vari- takes both the partial observation of a state and ance intrinsic to the methods, can make reported re- a primary policy as inputs, and returns a binary sults tough to interpret, authors discuss common is- label indicating whether the primary policy is likely sues and challenges (Henderson et al., 2018). The- to deviate from a reference policy without querying it. refore, in the future it would be helpful to develop standardized benchmarks for evaluating autonomous Multi-agent RL for Comfort Driving and Safety: vehicle control algorithms, similar to the benchmarks In (Shalev-Shwartz et al., 2016), autonomous driving such as KITTI (Geiger et al., 2012) which are already is addressed as a multi-agent setting where the host available for AD perception tasks. vehicle applies negotiations in different scenarios; where balancing is maintained between unexpected behavior of other drivers and not to be too defensive. The problem is decomposed into a policy for learned 4 CONCLUSION desires to enable comfort of driving, and trajectory planning with hard constraints for safety of driving. AD systems present a challenging environment for tasks such as perception, prediction and control. DRL DDPG and Safety Based Control: The deep rein- is a promising candidate for the future development of forcement learning (DDPG) and safety based control AD systems, potentially allowing the required beha- are combined in (Xiong et al., 2016), including viours to be learned first in simulation and further re- artificial potential field method that is widely used fined on real datasets, instead of being explicitly pro- for robot path planning. Using TORCS environment, grammed. In this article, we provide an overview of

569 VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

AD system components, DRL algorithms and appli- In The IEEE Conference on Computer Vision and Pat- cations of DRL for AD. We discuss the main chal- tern Recognition (CVPR), volume 1, page 7. lenges which must be addressed to enable practical Burda, Y., Edwards, H., Pathak, D., Storkey, A., Dar- and wide-spread use of DRL in AD applications. Alt- rell, T., and Efros, A. A. (2018). Large-scale hough most of the work surveyed in this paper were study of curiosity-driven learning. arXiv preprint arXiv:1808.04355. conducted on simulated environments, it is encoura- Chen, J., Wang, Z., and Tomizuka, M. (2018). Deep hierar- ging that applications on real vehicles are beginning chical reinforcement learning for autonomous driving to appear, e.g. (Kendall et al., 2018). The key chal- with distinct behaviors. In 2018 IEEE Intelligent Vehi- lenges in constructing a complete real-world system cles Symposium (IV), pages 1239–1244. IEEE. would require resolving key challenges such as sa- Chentanez, N., Barto, A. G., and Singh, S. P. (2005). Intrin- fety in RL, improving data efficiency and finally ena- sically motivated reinforcement learning. In Advan- bling transfer learning using simulated environments. ces in neural information processing systems, pages To add, environment dynamics are better modeled by 1281–1288. considering the predictive perception of other actors. Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou, F.-C., Lin, T.-H., and Schneider, J. (2018). Motion We hope that this work inspires future research and prediction of traffic actors for autonomous driving development on DRL for AD, leading to increased using deep convolutional networks. arXiv preprint real-world deployments in AD systems. arXiv:1808.05819. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Kol- tun, V. (2017). CARLA: An open urban driving simu- lator. In Proceedings of the 1st Annual Conference on REFERENCES Robot Learning, pages 1–16. Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutske- Abbeel, P. and Ng, A. Y. (2004a). Apprenticeship learning ver, I., and Abbeel, P. (2016). Fast reinforcement lear- via inverse reinforcement learning. In Proceedings of ning via slow reinforcement learning. arXiv preprint the twenty-first international conference on Machine arXiv:1611.02779. learning, page 1. ACM. Erran, L. and Scheider, J. (2017). Machine learning for au- Abbeel, P. and Ng, A. Y. (2004b). Apprenticeship learning tonomous vehicless. In ICML Workshop on Machine via inverse reinforcement learning. In Proceedings of Learning for Autonomous Vehicles 2017,. the Twenty-first International Conference on Machine Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic Learning, ICML ’04, pages 1–, New York, NY, USA. meta-learning for fast adaptation of deep networks. ACM. arXiv preprint arXiv:1703.03400. Abbeel, P. and Ng, A. Y. (2005). Exploration and appren- Finn, C., Levine, S., and Abbeel, P. (2016). Guided cost ticeship learning in reinforcement learning. In Pro- learning: Deep inverse optimal control via policy op- ceedings of the 22nd International Conference on Ma- timization. In International Conference on Machine chine Learning, pages 1–8, Bonn, Germany. Learning, pages 49–58. Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor- Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Laro- datch, I., and Abbeel, P. (2017). Continuous adapta- chelle, H., Laviolette, F., Marchand, M., and Lem- tion via meta-learning in nonstationary and competi- pitsky, V. (2016). Domain-adversarial training of neu- tive environments. arXiv preprint arXiv:1710.03641. ral networks. The Journal of Machine Learning Rese- Atkeson, C. G. and Schaal, S. (1997). Robot learning from arch, 17(1):2096–2030. demonstration. In Proceedings of the 14th Interna- Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we re- tional Conference on Machine Learning, volume 97, ady for autonomous driving? the kitti vision bench- pages 12–20, Nashville, USA. mark suite. In Computer Vision and Pattern Recogni- Barto, A. G. and Mahadevan, S. (2003). Recent advances tion (CVPR), 2012 IEEE Conference on, pages 3354– in hierarchical reinforcement learning. Discrete Event 3361. IEEE. Dynamic Systems, 13(4):341–379. Ha, D. and Schmidhuber, J. (2018a). Recurrent world Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B., models facilitate policy evolution. arXiv preprint Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Mul- arXiv:1809.01999. Accepted at NIPS 2018. ler, U., Zhang, J., Zhang, X., Zhao, J., and Zieba, Ha, D. and Schmidhuber, J. (2018b). World models. arXiv K. (2016). End to end learning for self-driving cars. preprint arXiv:1803.10122. CoRR, abs/1604.07316. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., D., and Meger, D. (2018). Deep reinforcement lear- Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Ko- ning that matters. In McIlraith, S. A. and Weinberger, nolige, K., et al. (2017a). Using simulation and dom- K. Q., editors, (AAAI-18), the 30th innovative Appli- ain adaptation to improve efficiency of deep robotic cations of Artificial Intelligence, New Orleans, Lou- grasping. arXiv preprint arXiv:1709.07857. isiana, USA, February 2-7, 2018, pages 3207–3214. Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and AAAI Press. Krishnan, D. (2017b). Unsupervised pixel-level dom- Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, ain adaptation with generative adversarial networks. T., Piot, B., Horgan, D., Quan, J., Sendonaris, A.,

570 Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

Dulac-Arnold, G., et al. (2017). Deep q-learning from Schumann, R., editors, Autonomic Road Transport demonstrations. arXiv preprint arXiv:1704.03732. Support Systems, pages 47–66. Springer International Ho, J. and Ermon, S. (2016). Generative adversarial imi- Publishing. tation learning. In Advances in Neural Information Mannion, P., Mason, K., Devlin, S., Duggan, J., and Ho- Processing Systems, pages 4565–4573. wley, E. (2016b). Multi-objective dynamic dispa- Kang, B., Jie, Z., and Feng, J. (2018). Policy optimiza- tch optimisation using multi-agent reinforcement lear- tion with demonstrations. In Proceedings of the 35th ning. In Proceedings of the 15th International Confe- International Conference on Machine Learning, vo- rence on Autonomous Agents and Multiagent Systems lume 80, pages 2469–2478, Stockholm, Sweden. (AAMAS), pages 1345–1346. Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D., Mason, K., Mannion, P., Duggan, J., and Howley, E. (2016). Allen, J.-M., Lam, V.-D., Bewley, A., and Shah, A. Applying multi-agent reinforcement learning to wa- (2018). Learning to drive in a day. arXiv preprint tershed management. In Proceedings of the Adaptive arXiv:1807.00412. and Learning Agents workshop (at AAMAS 2016). Kitani, K. M., Ziebart, B. D., Bagnell, J. A., and Hebert, Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., M. (2012). Activity forecasting. In Fitzgibbon, A., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Lazebnik, S., Perona, P., Sato, Y., and Schmid, C., Asynchronous methods for deep reinforcement lear- editors, Computer Vision – ECCV 2012, pages 201– ning. In International Conference on Machine Lear- 214, Berlin, Heidelberg. Springer Berlin Heidelberg. ning, pages 1928–1937. Koenig, N. and Howard, A. (2004). Design and use para- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Anto- digms for gazebo, an open-source multi-robot simula- noglou, I., Wierstra, D., and Riedmiller, M. (2013). tor. In In IEEE/RSJ International Conference on In- Playing atari with deep reinforcement learning. arXiv telligent Robots and Systems, pages 2149–2154. preprint arXiv:1312.5602. Kuderer, M., Gulati, S., and Burgard, W. (2015). Learning Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve- driving styles for autonomous vehicles from demon- ness, J., Bellemare, M. G., Graves, A., Riedmiller, stration. In Robotics and Automation (ICRA), 2015 M., Fidjeland, A. K., Ostrovski, G., et al. (2015). IEEE International Conference on, pages 2641–2646. Human-level control through deep reinforcement le- IEEE. arning. Nature, 518(7540):529. Le, H. M., Jiang, N., Agarwal, A., Dud´ık, M., Yue, Y., and Naik, D. K. and Mammone, R. (1992). Meta-neural net- Daume´ III, H. (2018). Hierarchical imitation and rein- works that learn by learning. In Neural Networks, forcement learning. arXiv preprint arXiv:1803.00590. 1992. IJCNN., International Joint Conference on, vo- lume 1, pages 437–442. IEEE. Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H., and Chandraker, M. (2017). Desire: Distant future Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invari- prediction in dynamic scenes with interacting agents. ance under reward transformations: Theory and appli- In Proceedings of the IEEE Conference on Computer cation to reward shaping. In ICML, volume 99, pages Vision and Pattern Recognition, pages 336–345. 278–287. Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End- Ngai, D. C. K. and Yung, N. H. C. (2011). A multiple- to-end training of deep visuomotor policies. The Jour- goal reinforcement learning method for complex vehi- nal of Machine Learning Research, 17(1):1334–1373. cle overtaking maneuvers. IEEE Transactions on In- telligent Transportation Systems, 12(2):509–522. Liaw, R., Krishnan, S., Garg, A., Crankshaw, D., Gon- zalez, J. E., and Goldberg, K. (2017). Composing Nichol, A., Achiam, J., and Schulman, J. (2018). meta-policies for autonomous driving using hierar- On first-order meta-learning algorithms. CoRR, chical deep reinforcement learning. arXiv preprint abs/1803.02999. arXiv:1711.01503. Nosrati, M. S., Abolfathi, E. A., Elmahgiubi, M., Yadmel- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., lat, P., Luo, J., Zhang, Y., Yao, H., Zhang, H., and Ja- Tassa, Y., Silver, D., and Wierstra, D. (2015). Conti- mil, A. (2018). Towards practical hierarchical reinfor- nuous control with deep reinforcement learning. arXiv cement learning for multi-lane autonomous driving. preprint arXiv:1509.02971. Paden, B., Cap,´ M., Yong, S. Z., Yershov, D. S., and Fraz- Liu, G.-H., Siravuru, A., Prabhakar, S., Veloso, M., and zoli, E. (2016). A survey of motion planning and con- Kantor, G. (2017). Learning end-to-end multimodal trol techniques for self-driving urban vehicles. CoRR, sensor policies for autonomous navigation. In Levine, abs/1604.07446. S., Vanhoucke, V., and Goldberg, K., editors, Procee- Pan, X., You, Y., Wang, Z., and Lu, C. (2017). Virtual to dings of the 1st Annual Conference on Robot Lear- real reinforcement learning for autonomous driving. ning, volume 78 of Proceedings of Machine Learning arXiv preprint arXiv:1704.03952. Research, pages 249–261. PMLR. Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Mania, H., Guy, A., and Recht, B. (2018). Simple random Curiosity-driven exploration by self-supervised pre- search provides a competitive approach to reinforce- diction. In International Conference on Machine Le- ment learning. arXiv preprint arXiv:1803.07055. arning (ICML), volume 2017. Mannion, P., Duggan, J., and Howley, E. (2016a). An expe- Peng, X. B., Andrychowicz, M., Zaremba, W., and Ab- rimental review of reinforcement learning algorithms beel, P. (2017). Sim-to-real transfer of robotic con- for adaptive traffic signal control. In McCluskey, L. T., trol with dynamics randomization. arXiv preprint Kotsialos, A., Muller,¨ P. J., Klugl,¨ F., Rana, O., and arXiv:1710.06537.

571 VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K. Sutton, R. S. (1990). Integrated architectures for learning, (2018). Toward driving scene understanding: A data- planning, and reacting based on approximating dyna- set for learning driver behavior and causal reasoning. mic programming. In Machine Learning Proceedings In Proceedings of the IEEE Conference on Computer 1990, pages 216–224. Elsevier. Vision and Pattern Recognition, pages 7699–7707. Sutton, R. S. and Barto, A. G. (2018). Reinforcement lear- Ross, S. and Bagnell, D. (2010). Efficient reductions for ning: An introduction. MIT press. imitation learning. In Proceedings of the thirteenth Taylor, M. E. and Stone, P. (2009). Transfer learning for international conference on artificial intelligence and reinforcement learning domains: A survey. Journal of statistics, pages 661–668. Machine Learning Research, 10(Jul):1633–1685. Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S. Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep rein- (2016). End-to-end deep reinforcement learning for forcement learning with double q-learning. In AAAI, lane keeping assist. arXiv preprint arXiv:1612.04340. volume 16, pages 2094–2100. Sallab, A. E., Abdou, M., Perot, E., and Yogamani, Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., S. (2017). Deep reinforcement learning frame- Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., work for autonomous driving. Electronic Imaging, and Botvinick, M. (2016). Learning to reinforcement 2017(19):70–76. learn. arXiv preprint arXiv:1611.05763. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Mo- Wang, P. and Chan, C.-Y. (2017). Formulation of deep rein- ritz, P. (2015). Trust region policy optimization. In forcement learning architecture toward autonomous International Conference on Machine Learning, pa- driving for on-ramp merge. In Intelligent Transpor- ges 1889–1897. tation Systems (ITSC), 2017 IEEE 20th International Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Conference on, pages 1–6. IEEE. Klimov, O. (2017). Proximal policy optimization al- Wang, P., Chan, C.-Y., and de La Fortelle, A. (2018). A rein- gorithms. arXiv preprint arXiv:1707.06347. forcement learning based approach for automated lane Schwarting, W., Alonso-Mora, J., and Rus, D. (2018). Plan- change maneuvers. arXiv preprint arXiv:1804.07871. ning and decision-making for autonomous vehicles. Wymann, B., Espie,´ E., Guionneau, C., Dimitrakakis, C., Annual Review of Control, Robotics, and Autonomous Coulom, R., and Sumner, A. (2000). Torcs, the Systems. open racing car simulator. Software available at Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Air- http://torcs. sourceforge. net, 4. sim: High-fidelity visual and physical simulation for Xiong, X., Wang, J., Zhang, F., and Li, K. (2016). Com- autonomous vehicles. In Field and Service Robotics, bining deep reinforcement learning and safety ba- pages 621–635. Springer. sed control for autonomous driving. arXiv preprint Shalev-Shwartz, S., Shammah, S., and Shashua, A. (2016). arXiv:1612.00147. Safe, multi-agent, reinforcement learning for autono- Xu, H., Gao, Y., Yu, F., and Darrell, T. (2017). End-to-end mous driving. arXiv preprint arXiv:1610.03295. learning of driving models from large-scale video da- Sharifzadeh, S., Chiotellis, I., Triebel, R., and Cremers, tasets. In 2017 IEEE Conference on Computer Vision D. (2016). Learning to drive using inverse reinfor- and Pattern Recognition, CVPR 2017, Honolulu, HI, cement learning and deep q-networks. arXiv preprint USA, July 21-26, 2017, pages 3530–3538. arXiv:1612.03653. Ye, C., Ma, H., Zhang, X., Zhang, K., and You, S. (2017). Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, Survival-oriented reinforcement learning model: An L., Van Den Driessche, G., Schrittwieser, J., Antonog- effcient and robust deep reinforcement learning algo- lou, I., Panneershelvam, V., Lanctot, M., et al. (2016). rithm for autonomous driving problem. In Internati- Mastering the game of go with deep neural networks onal Conference on Image and Graphics, pages 417– and tree search. nature, 529(7587):484–489. 429. Springer. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Yu, A., Palefsky-Smith, R., and Bedi, R. (2016). Deep rein- Riedmiller, M. (2014). Deterministic policy gradient forcement learning for simulated autonomous vehi- algorithms. In ICML. cle control. Course Project Reports: Winter 2016 Silver, D., Sutton, R. S., and Muller,¨ M. (2008). Sample- (CS231n: Convolutional Neural Networks for Visual based learning and search with permanent and tran- Recognition), pages 1–7. sient memories. In Proceedings of the 25th internati- Zhang, J. and Cho, K. (2017). Query-efficient imitation le- onal conference on Machine learning, pages 968–975. arning for end-to-end simulated driving. In Procee- ACM. dings of the Thirty-First AAAI Conference on Artificial Sobh, I., Amin, L., Abdelkarim, S., Elmadawy, K., Saeed, Intelligence, San Francisco, California, USA., pages M., Abdeltawab, O., Gamal, M., and El Sallab, A. 2891–2897. (2018). End-to-end multi-modal sensors fusion sy- Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. stem for urban automated driving. In NIPS Works- (2008). Maximum entropy inverse reinforcement lear- hop on Machine Learning for Intelligent Transporta- ning. In AAAI, volume 8, pages 1433–1438. Chicago, tion Systems. IL, USA. Sobh, I. and Darwish, N. (2018). End-to-end framework for Ziebart, B. D., Ratliff, N., Gallagher, G., Mertz, C., Peter- fast learning asynchronous agents. In Proceedings of son, K., Bagnell, J. A., Hebert, M., Dey, A. K., and the 32nd Conference on Neural Information Proces- Srinivasa, S. (2009). Planning-based prediction for sing Systems, Imitation Learning and its Challenges pedestrians. In Proc. IROS. in Robotics workshop.

572