Quick viewing(Text Mode)

Autonomous Self-Driving Vehicle Using Deep Q-Learning

Autonomous Self-Driving Vehicle Using Deep Q-Learning

Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

AUTONOMOUS SELF-DRIVING VEHICLE USING DEEP Q-LEARNING

M.SANGEETHA1, K.NIMALA1,D.SAVEETHA1, P. SURESH2*, S. SIVAPERUMAL2 1Dept. of Information Technology,SRM Institute of Science and Technology - Kattankulathur, 603203. Tamilnadu, India. 2Dept of ECE, Veltech Rangarajan Dr SSagunthala R & D Institute of Science and Technology * Corresponding author: [email protected]

ABSTRACT

Most of the modern day self-driving cars lack the ability to make a quick judgement based on the objects that come in front of them. We have proposed a hierarchy method in this paper by which a vehicle can decide as to how it should respond in case of a deadlock situation, i.e. when the car has no option other than crashing. We want to achieve a model of a self-driving car that can learn to keep itself on the road and also avoid obstacles in front of it and can give a priority to the different object as to which object is more valuable.

Keywords: Deep Q learning, self-driving, Autonomous System,

1. INTRODUCTION

Transportation has a lot of improvements regularly in modern times. Increasing developments can be seen done in the industry by the many new automobile manufacturers that seem to enter the market at an all-time increasing pace.

Developments are also be made towards road safety. It’s today’s times, the road an accident claiming numerous human lives is at peak [1 - 3]. At the end of the day,most mishaps occur because of driver error. So it is inevitable to think of a solution to increase road safety by trying to remove human effort completely or at least partially to reduce risks on the road.

However, a key component for this not being a viable solution in day-to-day life was the slow processing powers of computers, with up to differences of seven hundred of a second in the decision making of a computer vs the human brain [4]. Furthermore, computers do not process the rationality that a human brings to the table.However, sometime now the computers have caught in processing speeds and can judge fairly rationally in some given scenarios [5, 6]. The trust in these new systems can be seen even in the consumer markets where manufacturers like Tesla and Porsche already have started selling mass-market cars with self-driving capabilities built into them. And a few isolated incidents aside, the track record of such systems has been impeccable. Giving us more and more reasons to take this field with more seriousness.

So, with the above in mind we decided create a project where the model(vehicle) can judge for itself with some accuracy how a human mind would have taken decisions in case of deadlock situation where avoiding a mishap was not a choice.

Implementing software tools from the open source world like Deep Q learning, Relational Learning, Tensor Flow and OpenCV, we tried to create a simulation that would represent such a scenario and perform fairly too [8].

www.turkjphysiotherrehabil.org 1003 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

We have implemented with a model with a simulated camera that can capture the frames from in front of the vehicle and process them into expected results so that the car can proceed with maximum accuracy [9]. Furthermore, the same camera is utilized to analyse the potential obstacles and judge them against a pre-existing database which can then give us the values based on the severity of the damage that might be caused.

2. PROPOSED SYSTEM

We propose a system which utilises model which uses Deep Q – Learning to train a model which can detect obstacles and avoid. In case if a deadlock, the model is able to decide how to resolve the deadlock with minimum damage and penalty.

2.1

Deep Learning is a subset of Machine Learning and Artificial Intelligence. It’s the key technology behind self- driving cars, facial recognition, text translation and much more. The core structure is an Artificial Neural Network, also known as the is the foundation of Deep Learning. Neural Networks were inspired by the working of the human brain. Information processing and intermodal working were adapted. Although neural networks are designed to be static and limiting in their work of computation. The human brain is indeed extremely dynamic in its neuron computation.

The intention was always for Deep Learning to result in performance similar to the human brain itself. Now it dives to specific tasks, and neural networks are designed accordingly. This is a deviation from Biological studies.

Artificial Neural Networks were intended to behave and output results that conventional algorithms can’t reach with a certain level of accuracy or efficiency. Such complex problems required attention as they paved the way towards the automation we see today.

2.2 Q-Learning

Reinforcement learning is a part of Machine Learning. It contains methods of how to maximize reward for taking a suitable action which eventually contributes the problem being solved.Q-Learning is a algorithm in which the main goal of the system is to learn a Q function which is represented as Q(s, a), where ‘’s’’ is the current state of the system and ‘’a’’ is the next action that the system must take in order to change its current state.

Figure 1. Q-Learning Process

In Q-learning, we have to build a memory table represented as Q[s, a] to the Q function value of all the possible combinations of ‘’s’’ and ‘’a’’ where “s” denotes the current state and “a” denotes the action. Following is the algorithm that is used to fit the Q value with the sampled reward. If α, being the discount factor, is less than one then it is easier for the Q value to converge.

Algorithm:-

Start with Q0(s, a) for all the s and a Get the initial state s For i = 1,2,3,… till Q converges Sample action a and get the next state’s’ if s’ is terminal state :

www.turkjphysiotherrehabil.org 1004 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

target = R(s, a, s’) Sample new initial s’ else

target=R(s,a,s’)+αmax(Qi(s’, a’))

Qk+1(s, a) = (1- α)Qk(s, a) + α[target] s = s’

The problem that is commonly faced with the algorithm mentioned above is that if the combination set of the action and states are too large, the memory and processing power for Q will be too high. To solve this problem we will be using deep Q network and find an approximate value for Q(s,a).

Our reason of using Deep Q networks for this project is its ability of converting high-dimensional sensory data like vision to implementable decision directly on the agent. DQN uses a convolution network that has the ability of extracting high level features from raw sensory data.

2.3 Action State

The action space is defined in a discrete action space. These discrete spaces are easier for the DQN to predict and therefore an advantage in terms of processing. There are two types of actions we have to consider when it comes to the movement of the vehicle, longitudinal and lateral. In terms of longitudinal, there are three kinds of actions that need to be considered: 1. Cruise control speed which is calculated as v + vc, where vc is the additional target speed so that the

vehicle can accelerate in case of an empty path with no obstacles. vc is set as 2 units/h. 2. Maintaining the current speed

3. Reducing longitudinal velocity by using the earlier value vc but in this case we subtract this value form

the velocity v, i.e., v- vc

In terms of the lateral movement, we have the following actions:- 1. Staying in its current lane of motion 2. Changing the lane to the left 3. Changing the lane to the right

Since an autonomous vehicle has to maintain both longitudinal and lateral motions at the same time to avoid hitting objects, we can define 5 actions as follows:- • No action • Accelerate • Decelerate • Change lane to left • Change lane to right

2.4 Reward Function

When an action is selected in reinforcement learning, it has a certain value that is assigned to it and is called the reward. For the vehicle to find the perfect driving policy, it will have to maximize its expected future reward. It can infer from this that the final policy that the system learns can vary depending on how the reward function is designed. Therefore it is important that we chose the perfect reward function in relation to the task being performed for the system to learn the perfect driving policy for the vehicle.

Since we are implementing a hierarchy for the objects that are present on the streets, the reward function will have to be designed in such a way that when the vehicle hits an object on top of the hierarchy then it is penalized

www.turkjphysiotherrehabil.org 1005 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X more than when it hits an object that lower in the hierarchy. By doing this we can train the vehicle to avoid important objects on the road if there is a deadlock and the vehicle cannot stop and collision is imminent.

We have designed a function that satisfies these conditions and is given as follows:- For a constant speed:-

푣−푣푚푖푛 푟푣 = (1) 푣푚푎푥−푣푚푖푛

For collision:-

푟푐표푙 = −푟푐표푙푙푖푠푖표푛 − 푟표푏푗푒푐푡 (2)

Where rv is the reward for the vehicle travelling at a constant speed, v is he velocity, rcol is the total reward for collision, rcollision is a constant value for any collision and robject is the penalty for hitting the object according to the hierarchy table of objects.

The below table 1 shows the common objects penalty used in the project: -

Table:1. Common Object Penalty

Object Penalty

Human -0.95

Dog, cat -0.75

Vehicle -0.50

Wall -0.30

Barrels -0.10

2.5 Deadlock

In case of a deadlock situation, i.e. where the vehicle has to no option but to hit an obstacle, the vehicle which has identified the various obstacles will prioritise and try to hit the obstacle with lowest penalty while slowing down itself.So, each obstacle identified by the vehicle will have a danger coefficient with is calculated by distance of object from vehicle, current speed of vehicle and the priority and penalty associated with the object.

3. EXPERIMENTAL SETUP We decided to implement the proposed idea into a simulation and see what results we get from it. The project was built in 3 different stages to see what different kinds of results we can get.

In the first stage, our training agent was trained to stay on a fixed road and read a goal at the end of the road. The purpose was to see if the agent is able to stay on the road without going off track. www.turkjphysiotherrehabil.org 1006 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

In the second stage, the agent was given the task of staying on the road for as long as possible. So instead of fixed road which the agent can memorize, the road was long and many turns. The purpose of this stage was to see if the agent is able to adapt to the road and follow it for long durations of time without going off course.

In the final and third stage, our training agent was given the problem of staying on the road, avoid obstacles, make deadlock decisions and do all this for as long as possible without going off course.

The simulation is done in Unity game engine. Unity offers a Python based API backend called Unity ML Agents which makes it possible to connect the Unity environment to Tensor Flow where the reinforcement learning algorithm takes control of the Unity environment and trains a model.

4. TECHNOLOGIES UTILISED 4.1 Tensor Flow Developed by the Google Brain team, first for team purposes only then released to the public in 2015 under Apache License 2.0.TensorFlow is mainly used by developers and researchers for data flow and differentiable programming and a broad task range is available. A math library symbolically is heavily implemented in the field of machine learning specifically neural networks. Tensor Flow can be run underneath high-level libraries like .

It has CUDA and SYCL extensive capabilities for Graphical Processing Unit programming. Deployment on CPU, GPU and Tensor Processing Unit is available due to the immense flexibility. Computations are showcased as stately dataflow graphs, which remember the previous interactions taken place inside the system. Neural Networks perform operations through calculations on multidimensional arrays which are called tensors.

4.2 Unity ML Agents Unity Machine Learning Agents is a software development kit available to use with the Unity software. It allows agents to be trained using Reinforcement Learning, Deep Q Learning and Imitation Learning. Extensive machine learning algorithms can be used with the Python API. The collection of ML tools helps researchers design solutions and make advances in the game development field alongside robotics.

Figure 2.High level architecture diagram on how ML Agents interact with Unity and TensorFlow

www.turkjphysiotherrehabil.org 1007 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

4.3 OpenCV Open Source is a library for computer vision tasks. The most widely used tool for these, it was developed by Intel. The library is cross-platform and free for use within open source license (BSD).

OpenCV accepts deep learning models (ONNX format) from frameworks like Tensor flow and Caffe. It takes them into supported layers of a defined list. There are portable formats as well which ease the import process. The library itself functions optimally due to its underlying C++ implementation which increases the efficiency on an overall level. It is the fastest for its causes in execution.

5. RESULT AND ANALYSIS Since the project was produced in 3 stages; in the 1st stage, the agent was told to move to certain target and the main goal was for the agent to stay on the road.

In the 2nd stage, the agent was told to stay on road. The main objective was to train the agent to take any kinds of turns the road would take. Now the agent is able to stay on the road and move along with it.

In the 3rd stage, we introduce obstacles for the agent to avoid in the form of 3D human models which would instantiate randomly in front of the agent. The main goal was to avoid detect, identify and avoid obstacles. In each stage, the model learns something and builds upon it.

Stage 1: Information about the Stage: State-size: 16x40 +1 • The camera has resolution of 16 x 40. It sees part of the sky and the track. • The forward velocity of the car (defined by the velocity dot product the normalized forward vector of the car)

Action-size: 2, continuous

6. Up or Down (1,-1) 7. Left or Right (-1, 1)

Rewards and done:

• Every step: the normal of the forward velocity/5. This is done to encourage forward movement. It if moves backwards, multiply the reward by -4. • When the car falls of track, set reward to -50 and done. • When the car reaches the goal, set reward to 100 and done.

Training period: 200 episodes with episode length of 8000 frames.

www.turkjphysiotherrehabil.org 1008 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

Figure 3.Reward vs Episode – Stage 1

The above graph shows the plot of reward vs episode for the 1st stage. The number below the plot is the mean of the rewards from all the episodes.

We can infer that the agent learns fairly easily that it has to avoid falling off the road.

Stage 2:

Information about the Stage:

State-size: 16x40 +1

• The camera has resolution of 16 x 40. It sees part of the sky and the track. • The forward velocity of the car (defined by the velocity dot product the normalized forward vector of the car)

Action-size: 2, continuous

• Up or Down (1,-1) • Left or Right (-1, 1)

Rewards and done:

• Every step: the normal of the forward velocity/5. This is done to encourage forward movement. It if moves backwards, multiply the reward by -4. • When car falls off the track, set reward to -50 and done.

Training period: 200 episodes with episode length of 8000 frames.

www.turkjphysiotherrehabil.org 1009 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

Figure 4.Reward vs Episode – Stage 2

The above graph shows the reward vs episode plot for the 2nd stage. The number below the plot is the mean of the rewards from all the episodes. We can infer that the agent makes numerous mistakes but is able to traverse and follow the road for long stretches of time.

Stage 3: Information about the Stage: State-size: 16x40 +1 • The camera has resolution of 16 x 40. It sees part of the sky and the track. • The forward velocity of the car (defined by the velocity dot product the normalized forward vector of the car) Action-size: 2, continuous • Up or Down (1,-1) • Left or Right (-1, 1) Rewards and done: • Every step: the normal of the forward velocity/5. This is done to encourage forward movement. It if moves backwards, multiply the reward by -4. • When car falls off the track, set reward to -50 and done. Training period: 200 episodes with episode length of 8000 frames.

www.turkjphysiotherrehabil.org 1010 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

Figure 5. Frozen tensor flow graph file logs

The above image is a screenshot of the logs from the frozen graph file that shows how at each step the agent is identifying the object, gives a danger coefficient and then makes an action to change to a more favourable state.

Figure 6.Accuracy of pedestrian estimation

We can infer that the agent is able to identify and avoid obstacles. The agent is able to traverse for long periods and does not hesitate to stop itself even though there is a negative penalty for no movement so that the collision between agent and obstacle can be avoided.

www.turkjphysiotherrehabil.org 1011 Turkish Journal of Physiotherapy and Rehabilitation; 32(2) ISSN 2651-4451 | e-ISSN 2651-446X

6 CONCLUSION The project has been a massive learning experience for all the team members who have gained some level of expertise in the fields of Machine Learning and 3D modelling. As well as learning about the existing models for self-driven automobiles.The solution that we provided to our problem statement has been effective and we have tried to actively test and develop the project into a robust and reliable model for reference of the automotive industry as well as future developments.We have tried to develop a scalable modular solution which can be affixed to existing systems with little to no modifications required. The real life scenarios though are completely different from the simulated ones and for any sort of practical implementations we will have to go through the entire testing and tweaking of algorithms albeit with little to no actual changes to the basis of the project, again for reliable real world performance.We have been successful in getting promising results and expected behaviour from our model, and have continually improved upon our efforts from time to time.

REFERENCES 1. P. A. Viola and M. J. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features,” in2001IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition (CVPR 2001), with CD-ROM, 8-14 December 2001, Kauai, HI, USA, 2001,pp. 511–518. 2. Lecun, L. Bottou, Y. Bengio, and P. Haffner,“Gradient-based Learning Applied to DocumentRecognition,”Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, Nov 1998. 3. M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner,B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller,J. Zhang, et al. End to end learning for self-driving cars.arXiv preprint arXiv:1604.07316, 2016. 4. David Silver, Aja Huang , “Mastering the game of Go with deep neural networks and tree search”. 5. Suresh, P., Saravanakumar, U., Karthikeyan, V., Vamsi Krishna, G., Sreekanth, K. and Kumar Reddy, G., “Design and Implementation of Real Time Data Acquisition System in all Programmable System on Chip,” International Journal of Innovative Technology and Exploring Engineering, Vol. 8, Issue 10, 2019, pp. 3680-3684. 6. Xiaobin Zhang, Fucai Chen, Ruiyang Huang“A Combination of and CNN for Attention-based Relation Classification” vol 131, pp. 911-917, 2018. 7. Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, Gigel Macesanu, “A Survey of Deep Learning Techniques for Autonomous Driving”. 8. Volodymyr Mnih, Koray Kavukcuoglu, David Silver , Andrei A. Rusu, Joel Veness, Marc G. Bellemare , Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg & Demis Hassabis , “Human-level control through deep reinforcement learning”, vol 518, 2015. 9. Suresh, P., Saravanakumar, U., Celestine Iwendi, Senthilkumar Mohan, Gautam Srivastav 2020 “Field-programmable gate arrays with low power vision system using dynamic switching” Computers & Electrical Engineering, Vol. 90, 2021, 106996.

www.turkjphysiotherrehabil.org 1012