DEPARTMENTOFGEOINFORMATICS

UNIVERSITYOFAPPLIEDSCIENCESMUNICH

PROF.DR.THOMASABMAYR

Research on new Artificial Intelligence based Path Planning with Focus on Autonomous Driving

Christoph Oberndorfer

Master Thesis

Research on new Artificial Intelligence based Path Planning Algorithms with Focus on Autonomous Driving

Master Thesis

University of Applied Sciences Munich Department of Geoinformatics

Executed at Audi Electronics Venture GmbH

Advisors: Prof. Dr. Thomas Abmayr (University of Applied Sciences Munich) Gordon Taft (Audi Electronics Venture GmbH)

Author: Christoph Oberndorfer

Course of study: Geomatics (M. Eng.)

Realized in summer semster 2017, submitted in September 2017

Abstract

Keywords: Artificial Intelligence, Neural Networks, , Evolu- tionary Algorithms, Routing Algorithms, Path Planning, Autonomous Driving Artificial intelligence based technologies strongly gained importance during the last years in various fields of research. They are likely to significantly influence the future developments of the automotive industry. Autonomous driving in dynamic, intra-urban environments is thus one of the most challenging areas for automobile manufacturers at this juncture. One of the corresponding key technologies is intelligent path planning. Within this field of research, clas- sical routing algorithms emerged as not flexible enough to handle complex situations in road traffic. Building on these challenges, the main goal of this work is to develop an artificial intelligence based approach of path planning algorithms that provides a better applicability than classical routing algorithms. Additionally, a comparison of both approaches gives a review of their assets and drawbacks. Therefore, the developed approaches should broaden the viewpoint of classical path planning methods. The present work targets the realization of two different artificial intelligence based approaches. The first implemented uses a neural network as well as reinforcement learning meth- ods. In the second approach, an is applied to evolve a neural network. The resulting algorithms are tested in a grid based environment to carry out an accuracy analysis. Various experiments furthermore help to identify the most fitting parameter sets for the particular algorithms. Both algorithms turned out to exhibit sufficient functionality. The reinforcement learning based approach is able to find its way from the starting point to the target point with a probability of 98.74%. The comparison to a classical state of the art path planning algorithm (A*) showed that the reinforcement learning based approach has the ability to take the shortest path in 99.52% of all successful testings. These findings clearly show that this algorithm is highly suitable for the challenge of path planning. The second based approach was only able to provide insufficient results with a probability of 69.96%. To improve these results, several potential extensions are recommended. Thereby, the present work shows the efficiency of artificial intelligent based methods in path planning tasks by realizing a holistic and powerful path planning algorithm.

Contents

List of FiguresI

List of TablesII

List of AbbreviationsIII

1 Introduction1 1.1 Artificial Intelligence today...... 1 1.2 Motivation...... 3 1.3 Related work...... 4 1.4 Scope of the work...... 6 1.5 Structure of the work...... 6

2 Fundamental Methods7 2.1 Routing algorithms...... 7 2.1.1 Dijkstra Algorithm...... 8 2.1.2 A* Algorithm...... 10 2.1.3 Ant Colony Optimisation Algorithm...... 12 2.2 Artificial Intelligence...... 13 2.2.1 Reinforcement Learning...... 15 2.2.2 Neural Network...... 19 2.2.3 Evolutionary Algorithm...... 25

3 Approach 27 3.1 A* Algorithm...... 27 3.2 Q-Learning Algorithm...... 28 3.3 Multilayer Perceptron...... 30 3.4 of Augmenting Topologies...... 34

4 Experiments and Results 39 4.1 Multilayer Perceptron...... 39 4.1.1 Independent runs...... 39 4.1.2 five epochs...... 40 4.1.3 Different ratio of data sets...... 40 4.1.4 Different optimisers...... 41 4.1.5 Different alpha values...... 41 4.1.6 One hidden layer...... 41 4.1.7 Two hidden layers...... 42 4.1.8 One to five hidden layers...... 43 4.1.9 Reduction of unsolved grids in training...... 45 4.1.10 Reduction of the number of steps for final result...... 46 4.2 Neuroevolution of Augmenting Topologies...... 47 4.2.1 Independent runs...... 48 4.2.2 Different number of training sets...... 48 4.2.3 Different number of neurons...... 48 4.2.4 Different MutateAddNeuronProb values...... 49 4.2.5 Different ActivationLevel values...... 49 4.2.6 Different SurvivalRate values...... 50 4.2.7 Different YoungAgeFitnessBoost values...... 50 4.2.8 Other parameters...... 51 4.2.9 Experiments with final parameters...... 52 4.2.10 Analysis of the final result...... 53

5 Discussion and Outlook 57 5.1 Evaluation of Multilayer Perceptron...... 57 5.2 Evaluation of Neuroevolution of Augmenting Topologies...... 59 5.3 Comparison of algorithms...... 60 5.4 Summary...... 62 5.5 Outlook...... 63

Appendix V A XOR function example step-by-step...... V B NEAT parameter list...... VI

BibliographyVII List of Figures

2.1 Visualisation of a graph with nodes and links...... 7 2.2 Development of a search tree with two different search algorithms.....8 2.3 Two subgraphs with Dijkstra Algorithm...... 9 2.4 Two subgraphs with A* Algorithm...... 10 2.5 Comparison of Dijkstra and A* algorithms...... 11 2.6 Development of the Ant Colony Optimisation algorithm...... 12 2.7 The agent interacts with the environment...... 14 2.8 Reinforcement Learning framework...... 16 2.9 Map of five rooms connected through doors...... 17 2.10 Graph representation of the rooms with rewards for each link...... 17 2.11 R-matrix includes the initial rewards for all links...... 17 2.12 Q-matrix as memory of the algorithm...... 18 2.13 Biological and artificial neuron...... 19 2.14 Schematic representation of an ANN...... 20 2.15 Stochastic with a two-dimensional error function..... 21 2.16 Overview of the most common activation functions...... 22 2.17 Examples of underfitting, fitting and overfitting...... 23 2.18 Crossover and as genetic operators...... 26

3.1 Graph and shortest path created by A*...... 28 3.2 Graph and shortest path created by Q-Learning...... 29 3.3 Grid with player, walls, positive and negative items...... 30 3.4 Process of calculating an action to a state...... 32 3.5 Workflow of the learning mechanism in the Neural Network...... 33 3.6 Workflow of the NEAT algorithm...... 35

4.1 MSE development over five epochs...... 40 4.2 MSE development with different numbers of hidden layers...... 44 4.3 Development of the result value...... 47 4.4 Analysis of the quantity of neurons in each layer...... 54 4.5 Visualisation of the final Neural Network...... 55

I List of Tables

1.1 Different levels of automation for self-driving cars...... 2

2.1 All possible combinations of input and output of the XOR function..... 24 2.2 Results after learning the XOR function...... 25

4.1 Results of the MSE development over five epochs...... 40 4.2 Results with different numbers of training and test sets...... 41 4.3 Results of different optimisers...... 41 4.4 Results of different alpha values of the SGD...... 41 4.5 Results of one hidden layer...... 42 4.6 Results of two hidden layers...... 42 4.7 Run time with two hidden layers...... 43 4.8 Results of two hidden layers after five epochs...... 43 4.9 Results of one to five hidden layers...... 44 4.10 Detailed results of one to five hidden layers...... 44 4.11 Results of unsolved grids in training phase...... 45 4.12 Results of different numbers of maximum repetitions...... 45 4.13 Results of minimised steps...... 47 4.14 Different numbers of training sets...... 48 4.15 Different neurons of initial hidden layer...... 49 4.16 Different MutateAddNeuronProb values...... 49 4.17 Different ActivationLevel values...... 50 4.18 Different SurvivalRate values...... 50 4.19 Different YoungAgeFitnessBoost values...... 51 4.20 Best parameters for the final evaluation...... 52 4.21 Results of single parameter...... 52 4.22 Results of double parameter combinations...... 53 4.23 Results of triple and fourfold parameter combinations...... 53 4.24 Final results of best parameter combinations...... 53 4.25 Number of neurons in final Neural Network...... 54

II List of Abbreviations

ADAS Advanced Driver Assistance Systems ASCII American Standard Code for Information Interchange AI Artificial Intelligence ANN Artificial Neural Network ACO Ant Colony Optimization BFS Breadth-First Search CNN Convolutional Neural Network CVRP Capacitated Vehicle Routing Problem DFS Depth-First Search DNC Differential Neural Computer EA Evolutionary Algorithm EC ES EP GA GP LSTM Long Short-Term Memory MLP Multilayer Perceptron MSE Mean Squared Error NEAT Neuroevolution of Augmenting Topologies NN Neural Network RMS Root Mean Square RL Reinforcement Learning SGD Stochastic Gradient Descent TSP Travelling-Salesman-Problem

III

1 Introduction

In the following chapter an introduction and overview of the topic of the current work is given. The first section 1.1 shows by use of some examples the present situation of artificial intelligence. Subsequently, there follows an overview of the automation levels of the development of intelligent cars and a definition of the objective of the work is provided. The second section 1.2 deals with the own motivation of focusing on the domain of artificial intelligence and autonomous driving. The following section 1.3 gives an overview of related works in the field of artificial intelligence (AI) based path planning algorithms. Afterwards section 1.4 shows, how the present work can contribute to the research field of AI. The last section 1.5 finally summarise all chapters and provides an overall overview of the structure of the work.

1.1 Artificial Intelligence today

Algorithms, which can paint images [38], write news stories for websites [52], read the body language of people [9], [70] or recommend recipes based on a photo of food [80] are only a few examples of systems based on AI. This list can be prolonged to a large extend as almost every day new applications, ideas or concepts appear. Additionally, these algorithms are not limited on any special field of industry. , manufactur- ing, transportation, healthcare, customer service or finance are only a subset of possible fields of research and development of AI. Most of the algorithms need a big data basis for learning. With regards to big data, good solutions to collect, store, connect and analyse a huge amount of data sets exist in the meanwhile. By further development of existing algorithms and increase available processing power, it became possible to create such in- telligent systems. The advantages of AI are obvious. Instead of pre-defining a model with all conceivable cases for an application, the intelligent machine itself discovers the best way to the desired target. Anyone thinking that all these trends are still a plan for the future, is wrong. AI already found its way to the normal course of life. This does not mean that every house already has an intelligent, self-driving vacuum cleaning robot. But almost everybody owns a per- sonal computer or a smartphone nowadays. All the integrated personal assistants like Siri, Google Now or Cortana are based on a high developed AI engine. The suggestions of a search engine if for the next word following the one already typed in while writing a message or the next product to be purchased in an online shop are all chosen by AI meth- ods. Another intelligent system, will be widespread and well-known by almost everybody in a few years: the autonomous car. This field of research is one of the great challenges

1 for car manufacturers at the moment. Advanced driver assistance systems (ADAS) are developed to support the driver in speci- fied driving situations. It started with simple ADAS like the automatic distance alert for parking assistance. Meanwhile, systems with traffic sign recognition, adaptive cruise control or lane departure warning systems are available. These systems are further de- veloped and some self-driving prototypes already exist. But still some time will pass until the save and completely autonomous driving of cars in any urban environment is achievable. One can differentiate between three gradings of automation with their own definitions exist different gradings of automation with own definitions. The organisation SAE International (at first established as Society of Automotive Engineers) [79], the Ger- man Federal Highway Research Institute (Bundesanstalt für Straßenwesen, BASt) [6] and the National Highway Traffic Safety Administration (NHTSA) [69] use different notations but with similar definitions. Table 1.1 gives an overview of the levels of automation for self driving cars.

Execution of Steering, Monitoring Fallback Performance System SAE SAE BASt NHTSA SAE Narrative Definition Acceleration/ of Driving of Dynamic capability Level Name Level Level Deceleration Environment Driving Task (driving modes) Human driver monitors the driving environment Full-time performance by the human No Driver 0 driver of all aspects of the dynamic Human driver Human driver Human driver N/A 0 automation only driving task Driving mode-specific execution by Driver Human driver 1 a driver assistance system of either Human driver Human driver Some driving modes Assisted 1 assistance and systems steering or acceleration/deceleration Part-time execution by one or more Partial Partially 2 driver assistance systems of both System Human driver Human driver Some driving modes 2 automation automated steering and acceleration/deceleration Automated driving system monitors the driving environment Driving mode-specific performance Conditional Highly 3 by an automated driving system, System System Human driver Some driving modes 3 automation automated human driver does intervene Driving mode-specific performance High Fully 4 by an automated driving system, System System System Some driving modes automation automated human driver does not intervene 3/4 Full-time performance by an Full 5 automated driving system of all System System System Some driving modes - automation aspects of the dynamic driving task

Table 1.1: Different levels of automation of self-driving cars. The human driver monitors the driving environment in the lower levels (coloured blue), in the higher levels an automated driving system takes this task (coloured green). The black line separates the responsibility of the human driver and the system. The level of automation of the three different organisations SAE, BASt and NHTSA are coloured red [20].

The human driver (coloured blue) and the automated driving system (coloured green), which monitors the driving environment, are separated by the black line. The main tasks can be divided into four aspects. Who executes the steering, acceleration/deceleration, who monitors the driving environment, who does the fall-back performance and who de- cides on the driving mode. The SAE narrative definition includes a short explanation of each level. The level of automation of the three different organisations SAE, BASt and NHTSA are coloured red [20].

2 Intelligent path planning is thereby one of the key technologies of autonomous driving. Established path planning algorithms are mostly not flexible enough to handle complex and dynamic situations in urban environments. This work highlights a new approach of path planning algorithms based on AI.

1.2 Motivation

The AI based method with Neural Networks (NN) made great strides in the recent years. Alpha Go is an AI based software which won the board game Go against the best players in the world [24]. Recently, with this success AI is on everyone’s lips if a discussion comes to intelligent systems. As already mentioned in section 1.1, a lot of different fields of activity where the technology can be applied exist. The current study determines if AI can also be used in the area of geoinformatics. Classical path planning algorithms have not changed significantly in the last years but with these new AI methods new developments are possible. Additionally, this also may be an impulse to to find further possibilities to use AI in the field of geoinformatics. The Gartner Hype Cycle for Emerging Technologies 2017 forecasts that deep learning and neural networks are some of the top trends in the next two to five years [37]. Many leading technology companies like Amazon, Apple, Facebook, Google, IBM or Intel already deal with AI for some years [22]. They hold own AI research departments and are aware of the high importance of AI in the future. The market for AI is growing fast and many startups arise [18]. The market research company Garnter predicts that by 2019 startups will overtake the big technology companies with innovative AI technologies [74]. Recently this was also shown by the company DeepL (previously called Linguee). Their AI based online translation service provides faster and more accurate solutions as Google or Microsoft applications [14]. Nowadays these big companies experience a run on the innovative startups dealing with AI, as can be seen by the fact that Google acquired 12 startups since 2012 [19]. This is likely to be a rising trend. Andrew Ag, former Baidu chief scientist, Coursera co-founder, and Stanford adjunct professor is one of the leading AI experts in the world. He compared the changes that artificial intelligence will create with the electrical revolution a hundred years ago [90]. AI based methods are also very important for the development of autonomous driving. With this technology, the traffic fatalities could be reduced up to 90% [7]. In the US alone, almost 300,000 lives could be saved each year. Many accidents are caused by human failure like driving with wrong speed, keeping a to short distance to other road users or ignoring the right of way [25]. These reasons can be eliminated if an intelligent system takes over the driving functions. At the same time the development of this technology has to be done with caution and the implementation of sufficient safety mechanisms seems necessary. In their study, Evtimov et al. have showed that traffic signs can be modified with stickers to provoke an incorrect interpretation citeEvtimov2017. Challenges like this will also have to be mastered.

3 There are some more critical voices such as Elon Musk, Stephen Hawking and Bill Gates, who warned against potentially tragic consequences of AI [65]. They require an enhanced monitoring and more decisive government interventions especially regarding the develop- ment of AI for military purposes. Researchers of the Facebook AI Research Lab developed chat bots with a novel AI engine [55]. These bots are digital agents which automatically communicate with other participants over text messages. Thereby intelligent bots should learn reasoning, conversation and negotiation on the way to a personalised digital assis- tant. The current case became famous as these bots learned to communicate in their unique language which could not be understood by humans [56]. This example shows the awesome and simultaneously frightening potential of AI. In general, this was only an individual case but the useful applications are predominant. This technology enables groundbreaking developments and it is assumable, that numerous new developments in the field of AI will be developed in the next few years.

1.3 Related work

As already mentioned before autonomous driving is a great goal in the research field of automobile manufacturers and therefore it has been the topic of different studies focusing on path planning algorithms and AI. A short summary of this research is given in the following section. The A* algorithm belongs to the classical methods and is discussed in the work of Yao et al. [98]. They have shown that the algorithm can be used for path planning in known and unknown environments. Additionally, they presented an improvement of speed regarding the run time. The findings concerning the improvement of the classical A* algorithm by speed, efficiency or memory usage are confirmed by various researchers [42], [27] or [12]. Another modification of the A*, which is called Hyperstar, was developed by Bell [10]. Many different A* paths are calculated and the decision, which one is currently used, is based on events during the trip like congestion (risk-averse strategy). This approach is similar to the D* Algorithm (dynamic A*) developed by Stentz [86]. The shortest path is recalculated in each time step and can react very fast in dynamic environments. The general approach of the swarm intelligent based ant colony optimisation (ACO) algorithm for path planning is described by Blum in [13]. A modified algorithm was introduced by Gambardella et al. [34]. Thereby, the Capacitated Vehicle Routing Problem (CVRP), similar to the travelling salesman problem (TSP), is solved by using two ant colonies with different goals. The colonies can exchange information and find a solution for the multiple objective functions in cooperation. In order to find a target with Reinforcement Learning (RL), Heidrich et al. developed the so-called Q-Learning algorithm[46]. This is the basic approach with a static grid and without dynamic elements. Mnih et al. extend the RL approach by combining it with NN. This development is called Deep Reinforcement Learning or Deep Q-Learning and was tested in different Atari games. Such games can also be seen as multidimensional optimisation problems similar to those of path finding algorithms. The same approach

4 was used to train a vehicle to drive in a simulated environment. Thereby, the algorithm learned to drive on a lane without crashing on the basis of the images from the simulation [99], [36]. NNs are limited in storing data over a long timescale [41]. The method of differential neural computer (DNC) extends the NN by using an external memory. This algorithm was tested successfully for a shortest path problem based on the network of the London underground.

Ericsson et al. used a genetic algorithm (GA) to set the weights from the connections of a computer network [29]. The optimised weights are used by a further algorithm, called Open Shortest Path First (OSPF), which is a common algorithm for the routing of in- ternet traffic. This concept can be transferred to calculate the weights of connections in general networks, which can finally be solved by algorithms like A*. The approach of finding the shortest path in a static road network based on GA was described by Behzadi and Alesheikh [8]. Therefore different mutation operators, which extend the standard GA operators, were developed. Another example was introduced by Ismail et al. [49]. The networks consisted of a map, where each pixel was a node and a possible position of the robot. Some obstacles were placed on the grid and then the GA should evolve the best path. Similar to this work Achour and Chaalal described their approach [3]. Instead of using all possible positions only a few positions are randomly generated. This method reduces the searching space and can faster reach a good result. In these three types of algorithms, the GA is applied directly on different suggestions of possible paths. This means that a node is represented by a neuron and a link by a connection between two neurons. A different approach was introduced by Weeks using the GA for interplanetary trajectory optimisation [92]. Through the main equations, the trajectory has been fixed but the vari- ables like position of the spacecraft, flight path angle and angular position were evolved by the algorithm. The goal was to find an optimised configuration of the parameter for the shortest path. Weeks mentioned, the result can be seen as first assumption for more detailed investigations. The general operating principle of the GA algorithm Neuroevolution of Augmented Topolo- gies (NEAT) by Stanley and Miikkulainen is described in the works [85] and [83]. They also developed a method to use the NEAT algorithm for playing the popular board game Go with a small board size [84]. Hausknecht et al. used the extension of the NEAT algorithm, which is called HyperNEAT, as basis for their approach to learn to play Atari games [44]. In this case, the NN represents the knowledge of the algorithm and the output will be the direct action, which can be forwarded to the game.

Another path planning approach is based on an artificial potential field [32], [51]. This method became popular in the field of robotic navigation. In this approach, a potential field is created on the environment which is based on an attractive force coming from the target an a repulsive force coming from all obstacles. With these different forces, a robot is able to calculate the path from its position to the target. The last introduced sampling-based algorithm uses the rapidly exploring random tree (RRT) and is called Informed RRT* algorithm [35], [89]. The main idea is, that a self-developing space-filling tree searches for the optimal paths in the multidimensional search space.

5 1.4 Scope of the work

This work focuses on different path planning algorithms. The classical approaches like the Dijkstra or A* algorithms were already developed some decades ago. A new field of methods with the AI based approach of NN arose in the last years. Thereby, two new algorithms could be developed. The first one uses a NN, which is trained with reinforcement learning methods. The second approach is just based on a NN, which is evolved with an evolutionary process. These two methods and the A* algorithm will be implemented and compared to each other. This work explains two modern AI based algorithms from the scratch, which also makes it understandable for people without any knowledge of AI. The main goal is to extend the classical point of view of traditional path finding methods. Thereby, a good start into the subject of artificial intelligence can be given.

1.5 Structure of the work

The work is divided into the following sections. The preceding chapter gives an intro- duction to the field of AI and autonomous driving. Chapter2 includes the fundamental methods, which are used in this work. Thereby, it will be distinguished between classical routing methods and AI based methods. The chapter3 describes the implementation of the selected algorithm. In chapter4 different experiments are carried out and the results are presented. The last chapter5 discusses the methods and results and gives an outlook.

6 2 Fundamental Methods

The following chapter shows some basic information about the fundamental methods which belong to this topic. The first section 2.1 describes some classical well-known rout- ing algorithms. An overview on the field of AI will be given in section 2.2. Additionally, some specific AI methods which are used for the new algorithm will be discussed.

2.1 Routing algorithms

A routing algorithm is the process of finding a path in a given graph. A graph consists of a random number of nodes which are connected over links. Figure 2.1 shows an example of a graph with eight nodes (circles) and eleven links (lines). There exist different data formats like adjacency matrix or adjacency list to save a graph. The adjacency list contains all nodes, which include the links to the next connected nodes. This data format is used for the following algorithms. A good example of a graph is a part of a street network. The crossroads are the nodes and the roads are the links. In addition, the links can have different properties like directed, uni directed, weighted or unweighted. A path is a part of the graph that shows the way from the start to the target node. Figure 2.2 C shows a graph from node A to node E written as P(A,E). The most common task in the field of path planning is to find the ‘best’ path from a start node to a target node. Depending on the definition what ‘best’ means, it could be the shortest path or the path with minimal costs. Another question of path planning goes back to the 18th century and is called the ‘Königsberg Bridge Problem’. The challenge was, to find a path in a town over seven different bridges by crossing each bridge only once. The start and end point should also be the same. Leonhard Euler dealt with this question and found a solution in 1736. This founding represents the beginning of [95]. Another more current example is named the Tube Challenge. It pursues the objective to visit all London Underground stations as fast as possible. The first trial was

D E H A F C B G

Figure 2.1: Visualisation of a graph with nodes (circles) and links (lines).

7 documented in the year 1960. The recent record is from 2016 as Steve Wilson and Andy James visited all 270 stations in 15 hours and 45 minutes [81]. This type of tour planning problem is called Travelling-Salesman-Problem. The goal is to find a path where each node is visited at least once. In general, depending on the size of the graph, there exists a huge number of possible combinations to find the best path. To calculate and compare all the possible path solutions would take a long time and is inefficient. The algorithms described in the following subsections show a more efficient solution of the shortest path problem.

2.1.1 Dijkstra Algorithm

The Dijkstra algorithm was published by E. W. Dijkstra in the year 1959 [58]. The goal is to find the optimal path with a minimum of analysed nodes or in a minimal created search tree. Basically, there exist two different approaches how to create the search tree, the Breadth-First Search (BFS) and the Depth-First Search (DFS) [58]. An example of the different developments of the search trees are visualised in figure 2.2. Figure A shows the initial graph, figure B represents the first steps of the BFS and figure C visualises the first steps of DFS.

A B C D E D E D E H H H A F A F A F C C C B G B G B G

Figure 2.2: Development of a search tree with two different search algorithms: A: Ini- tialisation, B: Breadth-First Search (BFS) all connected nodes are getting analysed, C: Depth-First Search (DFS) only one of the next nodes is getting analysed.

The BFS and DFS algorithms are explained in more detail in this section: Generally, the algorithm uses three different lists: The unknown list contains all nodes, the open list all active nodes and the closed list all passive nodes which are already explored. Compared to figure 2.2 A, all blue nodes placed in the unknown list. The open and closed lists are empty at the beginning. The algorithm then puts the start node from the unknown list to the open list and starts analysing (example node A). The BFS moves all nodes connected to the start node from the unknown list to the end of the open list (example node B, C, D). The start node itself is moved to the closed list. This procedure of analysing the first element of the open list will be repeated till the open list is empty. This search algorithm analyses the nodes directly connected to the start node at first (example node B, C, D), then the nodes two steps away (example node C and D) and

8 so on. The DFS also begins with moving the start node from the unknown list to the open list and its analysis. Afterwards, the DFS moves only the first connected node from the unknown to the open list (example node D). The single nodes in the open list are analysed step by step (example node D, F, H, E). As soon as this process is competed and there exist no more connected nodes from the analysed one, the algorithm will go backwards (backtracking) checking the prevenient nodes again (example node F will be analysed next). Both algorithms create a full search tree with all existing nodes but with different structure. The Dijkstra algorithm is based on the BFS and contains an extension which finds the optimal path faster and before the full search tree is developed [58]. It uses the same three lists to store nodes which were already defined for the BFS and DFS algorithms, namely the unknown list, the open list and the closed list. An example of this situation is visualised in figure 2.3.

B E X G A C F

P(A,X) Figure 2.3: Graph with two subgraphs. The first one (A-B-C-X) was already passed, the second one (X-E-F-G) is unknown.

Additionally, a definition of the target node is needed (example node G). The algorithm starts with moving the start node A to the open list and its analysis. The Dijkstra algorithm contains a cost function g(X) which is calculated for each node when it moves from the unknown list to the open list. The node X is the one on the top of the open list (in the example the forth step is visualised). The recursive g function written as

gpXq  gpCq vpC,Xq (2.1) contains the length from the path P(A,X). Usually all nodes contain a coordinate which could be relative or absolute and two or three dimensional. The distance between two nodes can be calculated by the Euclidean distance d ¸n 2 dEuclideanpC,Xq  pCi ¡ Xiq . (2.2) i1 and is also called v function [94]. The distance of a far connected node will be calculated by the recursive call or rather the sum of all single distances before. For this example the equation

gpXq  vpA, Bq vpB,Cq vpC,Xq. (2.3)

9 can be used. The second change is that the unsorted open list is replaced by a sorted list depending on the g factor of each node. With this improvement, the search tree will be extended at the node with the shortest distance to the start node. The algorithm is finished if the target node is the first element in the open list. The minimal search tree is saved in the closed list where the optimal path can be extracted. This algorithm forms the basis for the A* algorithm which will be described in the next subsection.

2.1.2 A* Algorithm

The A* algorithm (pronounced as ‘A star’) was released in 1968 by Hart et al. and ex- tends the Dijkstra algorithm by using a heuristic [43]. In general, a heuristic is a trick or a simplification. In reference to the routing algorithm, it means to steer the path in the right searching area. Thereby the algorithm will change from a blind search to an informal search. The type of the heuristic should be specific to the individual search problem. Overall there exist three approaches for the heuristic [58]:

Best node first search: For this case the heuristic determines on which node the search tree should be extended. It shows which node should be used next to achieve the best success instead of a strict BFS or DFS search.

Heuristic selection of links: The heuristic determines which outgoing link from a node should be chosen. This is only possible if there exists more than one link to nodes which were not explored yet.

Pruning: The third type of heuristic can cut parts of the graph. If a subgraph does not contain the target node it can be deleted to reduce the search space.

For this algorithm, the best node first search heuristic is applied. It is realised by using a new cost function fpXq  gpXq hpXq (2.4) which extends the g function from equation 2.1. The new added heuristic function h estimates the distance to the target node as shown in figure 2.4.

B E X G A C F

P(A,X) P(X,A) Figure 2.4: Graph with two subgraphs. The first one (A-B-C-X) was already passed, the second one (X-E-F-G) will be estimated through the new cost function including the heuristic.

10 The correct distance from the graph P(X,A) can not be calculated because the exact graph to the target node is unknown at this point of time. How the h function should be defined depends on the problem. Until this point the described algorithm is called A algorithm because a bad heuristic could produce an unsatisfactory result. Only if we add the definition, that the h function estimates the optimal distance between the nodes X and F, the algorithm is specified as A*. In a few cases and especially if the nodes represent points on the earth’s surface, the Euclidean distance equation 2.2 (comparable to the air line distance) is a good estimator for the h function. If the h function sets to zero, the A* algorithm shows similar results as the Dijkstra algorithm. A city with heavy traffic can also function as an example. In this scenario, the h function could include a complex traffic model with real time data to avoid traffic jam. Figure 2.5 shows the difference of the Dijkstra (left) and the A* (right) algorithm. This representation uses a constant grid as graph. In the initial state, all nodes are grey and stored in the unknown list. The green node is the start point and the yellow node the target one. The nodes which are already analysed are saved in the closed list and illustrated as blue circles in the figure. The number inside the circle gives information about the distance to the start node. The blue nodes without a number belong to the open list. The left figure shows the Dijkstra algorithm after 24 iterations (24 nodes in the closed list) that hasn’t found the target node yet. The A* algorithm on the right side runs seven iterations and finds the target in the 8th run. With the help of the heuristic, the A* algorithm finds the target with less iterations. Accordingly, the A* algorithm has to analyse less nodes then the Dijkstra algorithm and is faster. Relating to more complex scenarios and limited computer performance this is a big advantage of the A* algorithm.

A B

3 7 3 2 3 6 3 2 1 2 3 4 5 3 2 1 1 2 3 1 2 3 3 2 1 2 3 3 2 3 3 Figure 2.5: This figure shows the difference of the Dijkstra (A) and the A* algorithms (B). The way will be explored from the start node (green) over the analysed nodes (blue) to the target node (yellow). With the help of the heuristic, the A* algorithm finds the target with less iterations.

11 2.1.3 Ant Colony Optimisation Algorithm

The Ant Colony Optimization algorithm was published by Dorigo and Di Caro in the year 1999 [26]. It belongs to the field of and is derived from social characteristics of animals in the nature. Especially swarms of birds, bees, fishes or ants feature dynamic interaction mechanisms [54]. Kroll describes further that a global goal can be reached without a central instance which controls each individual. This natural phenomenon of ants is used as a paradigm for the ACO algorithm and is specified as follows.

The ants are swarming out in a high count from the ant hill to search for food. This situation can be transferred to the classic path finding problem as discussed at the top of section 2.1. At the beginning, the ants choose a random path on the graph. To communicate and coordinate with each other, ants are secreting pheromones all over their way. These pheromones volatilise slowly. Usually the ants prefer the way with higher concentration of pheromones but there still exists a small probability of not choosing the way of the highest concentration. In dynamic environments, this is helpful if a shorter way comes into existence after some time. The probability p includes an α (weight of the pheromone concentration) and a β (route attractiveness) value which gives the choice between fast convergence and good exploration of the search space. The highest frequency of ants can be found on the shortest path because they can return quickly and restart again. In this case, the pheromone concentration on the ideal path will be intensified again (self-reinforced effect). After some time, most ants choose the same way and the ideal path is found. Figure 2.6 shows three steps of the development of this algorithm. The basis is a graph with the start node A (anthill, coloured green) and the target node H (food, coloured yellow). The ants (black dots) spray out the pheromones (coloured light pink). In the first step A the ants walk randomly. Step B shows a light pheromone concentration on the shortest path A-C-F-H. In the last step C the concentration of pheromones is very high on the shortest path.

A B C

D E D E D E H H H A F A F A F C C C B G B G B G

Figure 2.6: Development of the Ant Colony Optimisation algorithm, A: Initial random walk B: light pheromone concentration on the shortest path C: high pheromone concentration on the shortest path.

12 2.2 Artificial Intelligence

This section gives a summary about artificial intelligence. The first part includes the history, the present, the definition and the methods of AI. Reinforcement learning will be explained detailed in section 2.2.1. Afterwards the section 2.2.2 contains an introduc- tion about neural networks. Evolutionary algorithms in section 2.2.3 will complete this chapter. The history of artificial intelligence goes back to the year 1947 as the English mathemati- cian Alan Turing gave a lecture on that topic [63]. Turing also determined that AI is not about building machines but researched by programming computers. The Turning test (named by Alan Turning) was the first concept to define a machine as intelligent. The test consisted of a chat by teletype between a person and a knowledge based machine. If the machine can pretend to be a human it would certainly be intelligent. The phrase artificial intelligence was first used by the computer scientist John McCarthy in 1955 [62]. They used this term in the proposal for the Dartmouth Summer Research Project on artificial intelligence. This research project marks the beginning of academic research in the field of AI. Meanwhile AI is one of the most important research fields in computer science in the present. The famous universities, like the Massachusetts Institute of Technology [60] or the Stanford University [82], hold their own artificial intelligence laboratories. The Association for the Advancement of Artificial Intelligence (AAAI) was founded in the year 1979 [1]. They will host the 32. AAAI Conference on Artificial Intelligence in February 2018. Additionally, the leading technology companies, like Amazon, Apple, Deep Mind, Google, Facebook, IBM and Microsoft [75] corporate in a consortium called Partnership on AI. Some examples about current AI technologies are listed in section 1.1. To clarify the phrase ‘artificial intelligence’, it is good to know that there exists no om- nipresent definition. McCarthy described it with the sentence “It is the science and engi- neering of making intelligent machines, especially intelligent computer programs” [62]. As McCarthy already mentioned AI research is mostly placed in the field of computer science. But there are also interdisciplinary relationships to Mathematics, Philosophy, Psychology, (Computational) Linguistics, Biology and Engineering Sciences [16]. Another definition from a philosophical point of view differentiates between the so called weak AI and the strong AI [17]. The weak AI attends to a special problem like speech recognition or navigation systems and is also called weak AI or only the simulation of intelligence [17], [23]. On the other side the strong AI could be described as really intelligent as it is not specified in one field but has general intelligence like humans. It should own competences like logical thinking, decision making, learning, communication in natural language and use all these skills to solve a superior goal. In general, every AI pursues a target it will achieve. Therefore it can be imagined as an agent which is coupled with perception, rea- soning and acting and acts in an environment [4]. Figure 2.7 shows an agent interacting with an environment. The first agent input is the observation of the environment. The past experiences of

13 Prior knowledge Goals Abilities

Agent

Observations Past experiences Action

Environment

Figure 2.7: The agent interacts with the environment [4]. It receives observations and holds past experiences. With the prior knowledge, goals and abilities, it will decide to take some action back to the environment. observations and actions which are already taken can be used as second input. The prior knowledge means some general information about the environment. The next inputs are the goals it should achieve. Abilities are the last inputs where the agent has the choice between primitive actions, it can execute. Based on this information the agent has to decide which action will be applied to the environment. This is a general view of an , which can feature any level of complexity.

Different AI methods presented in the following subsection. With thought of the intelligent agent arises the question which methods it needs to solve the given problem. Burgard lists some of these methods: solving and searching problems, knowledge represen- tation and processing, action planning, , handling uncertain knowledge and neural networks [16]. He mentioned also that there exist many other methods in different interdisciplinary fields today. This work will concentrate on the two main meth- ods of machine learning and neural networks. Neural networks will be discussed in a separate section 2.2.2. Murphy describes Machine Learning as “a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data” [68]. This set of methods can be divided in three types of learning, which Abu-Mostafa et al. describe as [2]:

• This method is called if the input data including the cor- responding output is available. The input is called ‘training data’ and the known output ‘labels’. A good example for input data could be a set of images with hand written digits (random from 0 - 9) and additionally information associating the right digits to each image. The algorithm is able to learn the assignment of inputs to outputs and after the training to use its knowledge to analyse a new picture and calculate a probability for each digit. The highest probability should match the right digit. This process is called classification and is reliable if enough training data is available. Especially this example is a well-known problem and therefore exist a big data set called MNIST. It includes 60,000 training pictures with corre- sponding labels [97]. This package can be downloaded and used to develop an own digit recognition algorithm.

14 • Unsupervised Learning is the setting where no information about the output is available. In this case only the plain data but no labels exist. The goal is to find patterns or structures in the input data. Abu-Mostafa specifies it as “a way to create higher-level representation of the data” [2]. E-commerce is a good example for unsupervised learning [68]. The idea is to cluster users into different groups depending on their purchasing behaviour or web-surfing characteristics. Individual advertisement can be sent to each group of customers to rise the probability of buying a product. For this type of learning only the input data is needed. • Reinforcement Learning also has no information about the output but can grade actions which are made. Better actions will be rated higher and the algorithm is able to perform the same actions in similar situations again. This method is often used to learn to play games like backgammon. For every turn, different possible actions have to be chosen and situations are different from each other. In the case of a good action (like bringing a stone to the target or hitting a stone from the other player) the grade/reward will be higher. For the training, different samples are needed to learn the best strategy for the game. This method is used in this work to find a good solution for a navigation task. Explained more detailed in the next section 2.2.1. There exist some more views of learning. Abu-Mostafa mentions data mining as an own field with the focus on finding pattern, correlations, or anomalies in large databases [2]. The method is called Semi-supervised learning if the labels exists only for a subset of input data [15]. This section represents an overview on the main learning styles. Three of them will be explained in the next sections.

2.2.1 Reinforcement Learning

Sutton and Barto describe the basic idea as follows “Reinforcement learning is learning what to do - how to map situations to actions - so as to maximize a numerical reward signa” [87]. He mentions that the learner has to decide on his own which action he takes. By trying different decisions, he can discover which action scores the highest reward. In more challenging cases, an action does not affect the reward immediately but in the subsequently situations. This yield to the two most important features of reinforcement learning “trial-and-error search and delayed reward” according to Sutton and Barto [87]. Figure 2.8 represents the reinforcement learning framework. It is similar to the basic representation from figure 2.7. The agent interacts in each time step t  0, 1, 2, 3, ... with the environment. At each step the agent receives any state st P S, with S as the set of all possible states, and selects an action at P Apstq, Apstq is the set of possible actions in the particular state st. A state for example can be interpreted as the current situation on a playing field. One time step later the agent receives a numerical reward rt 1{inR, where R is the reward function, as the consequence of the action he took. Caused by the action the state is also forwarded to st 1. The mapping from a state to the probabilities of each possible action at each time

15 Agent State Reward Action

st rt at rt+1 Environment st+1 Figure 2.8: Reinforcement Learning framework [87].

step is called policy πt. The probability πtps, aq is given through at  a if st  s. The agent has to change its policy depending on its experience to reach the goal by maximising the total sum of rewards over a long run [87]. Q-Learning is the name of a specific algorithm based on this concept. Sutton and Barto mention “One of the most important breakthroughs in reinforcement learning was the development of an offpolicy TD control algorithm known as Q-learning” [87]. This algorithm was first introduced by Watkins in 1989 [91]. Off policy means that the algo- rithm learns an optimal policy without knowing which policy it is [5]. The Q-function is defined as

Qpst, atq Ð Qpst, atq αrrt 1 γ max Qpst 1, aq ¡ Qpst, atqs (2.5) a with α P r0, 1s as learning rate and γ P r0, 1s as discount factor [87], [28]. Setting the learning rate α to zero does not update the Q-function and nothing can be learned. Heidrich-Meisner et al. mentioned that the learning rate decreases over time and can be interpreted as stochastic gradient descent [45]. Learning will rise faster if the value is around 0.9. It is similar with the discount factor which controls how strong new rewards should be weighted [28]. The Q-function Qpst, atq can be interpreted similar to the general policy πtps, aq from the section before and is called action-value function in general. A simplified representation is the Bellman equation

1 1 Qt 1ps, aq  rr γ max Qips , a q|s, as (2.6) E a1 as an iterative update [67](E = expectation). For i Ñ 8 the algorithm convergences ¦ to the optimal value function Qt Ñ Q [87]. The equation 2.6 is used in the following example from McCullock which shows how the process of the Q-Learning Algorithm operates [64]. It describes an agent which wants to find the shortest path in an unknown environment. Figure 2.9 shows five rooms (number 0-4) connected through doors. The area outside (number 5) can be accessed through room 1 and room 4. Otherwise there is no feature with room 5. In general, there exist six different locations. These rooms can be represented by a graph similar to the explanation in chapter 2.1. In this example a room is represented by a node and a door by a link. As a target room has to be defined, room number 5 is selected. In this case, all connections to this room get a reward of 100. All other connections have the reward 0. This graph is shown in figure 2.10 A. Each room

16 5 0 1 2

3 4

Figure 2.9: Map of five rooms connected through doors [64].

A B 1 1

0 0 0 80 64 80 0 100 64 100 2 3 5 100 2 3 5 100 0 100 51 100 0 0 0 64 80 80 0 64 0 4 0 4 0 80 Figure 2.10: Graph representation of the rooms with rewards for each link. Figure A initial state: Only the links to the target node 5 contains any reward, figure B final state: All links have a reward. The shortest path from node 2 to 5 can be read out as 2-3-1-5 or 2-3-4-5 which is coloured red. can be defined as a state and the links between as actions. In this model the actions are limited because the agent can only choose actions that are connected to its current state. For example if the agent is in state 1, it can only choose actions to move to state 3 or 5. The initial graph diagram can be stored together with the reward values in a reward table, called the R-matrix shown in figure 2.11.

Action State 0 1 2 3 4 5 0 000-1 000-1 000-1 000-1 0000 000-1 1 000-1 000-1 000-1 0000 000-1 000100 -1 -1 -1 0 -1 -1 R = 2 000 000 000 000 000 000 3 000-1 0000 0000 000-1 0000 000-1 4 0000 000-1 000-1 0000 000-1 000100 5 000-1 0000 000-1 000-1 0000 000100 Figure 2.11: R-matrix includes the initial rewards for all links. The rows represent states and the columns represent actions.

An entry of -1 means there is no connection. All other numbers represent that there is a connection and the number shows the reward. Additionally, there exists a Matrix Q shown in figure 2.12. It can be understood as memory where all the learned experiences from the agent are saved. The Q-matrix is initialised with zeros at the beginning. The learning rule is similar to equation 2.6

17 Qpstate, actionq  Rpstate, actionq γ ¦ maxrQpnext state, all actionsqs (2.7) with γ  0.8 for example. Then the iteration starts over the episodes with selecting a random initial state. The following process will run until the state reached the goal: 1. Select one possible random action for the state 2. Going to the next state regarding the action 3. Compute Q as equation 2.7 shows, with maximum Q value for all possible actions 4. Set the next state as the current state This algorithm runs till the target state is reached and can be applied to the example. It will be explained step by step: Episode 1: Select random state 1 and analyse in row 1 from the R-matrix which actions are possible (state 3 or 5) and choose one of them randomly, go to state 5, calculate Q: Q(1, 5) = R(1, 5) + 0.8 * max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100 and set Q(1,5 ) = 100 (compare figure 2.12 A). Episode 2: Select random state 3 and analyse row 3 from R-matrix which actions are possible (state 1,2 or 4) and choose one of them randomly, go to state 1, calculate Q: Q(3, 1) = R(3, 1) + 0.8 * max[Q(1, 2), Q(1, 5)] = 0 + 0.8 * max(0, 100) = 80 (compare figure 2.12 B) the goal is not reached, analyse row 1 from R-matrix which actions are possible (state 3 or 5) and choose one of them randomly, go to state 5, calculate Q: Q(1, 5) = R(1, 5) + 0.8 * Max[Q(5, 1), Q(5, 4), Q(5, 5)] = 100 + 0.8 * 0 = 100. This value is already set and the Q-matrix does not change. The goal is reached and the process can be stopped. After some more episodes the Q-matrix gets close to convergence. The stop conditions can be defined differently which will be discussed in the chapter 3.2. The final Q-matrix is shown in figure 2.12 C. These values can be normalised (divided by the highest number) and visualised in the graph structure in figure 2.10B.

A B C 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 0000 0000 0000 0000 0000 0000 0 0000 0000 0000 0000 0000 0000 0 0000 0000 0000 0000 000400 0000 1 0000 0000 0000 0000 0000 000100 1 0000 0000 0000 0000 0000 000100 1 0000 0000 0000 000320 0000 000500 0 0 0 0 0 0 2 0000 0000 0000 0000 0000 0000 2 0000 0000 0000 000320 0000 0000 Q = 2 000 000 000 000 000 000 3 0000 0000 0000 0000 0000 0000 3 0000 00080 0000 0000 0000 0000 3 0000 400000 256000 0000 000400 0000 4 0000 0000 0000 0000 0000 0000 4 0000 0000 0000 0000 0000 0000 4 320000 0000 0000 000320 0000 000500 5 0000 0000 000 000 0000 00000 5 0000 0000 000 000 0000 00000 5 0000 400000 000 000 0000400 000500 Figure 2.12: Q-matrix as memory of the algorithm. Figure A after the first episode, figure B after the second, figure C final state after some more episodes.

After the training task, the algorithm can be tested. The problem of the shortest way from each position to the target can now be solved by the learned Q-matrix. For example initial state 2 will choose the action to 3 (Q(2,3)), to 1(Q(3,1)) or also possible to 4(Q(4,1)), to

18 5 (Q(1,5)). This path is coloured red in figure 2.10 B. With this example the section of reinforcement learning ends.

2.2.2 Neural Network

A neural network also called Artificial Neural Network (ANN) is a computational model that is inspired by the central nervous system of humans [11]. The history of NNs started in the year 1943 as W. Pitts and W.S. McCulloch introduced a model of neurologistical networks. Based on this idea, Frank Rosenblatt described different types of perceptrons in 1957 [53]. The basic element of an ANN is an artificial neuron which is also modelled after the paradigm of a neuron in a human brain. As described by Berger figure 2.13 shows the similarity between the biological and artificial neuron [11]. A B Axon terminals Dendrites In1 Nucleus Axon Out

In2

weigh

sumup activate Cell body Inn

Figure 2.13: Figure A shows a biological neuron, figure B shows an artificial neuron [11].

The biological neuron in figure 2.13 A has dendrites to receive signals, a cell body with a nucleus to process them and an axon including axon terminals to forward signals to other neurons. The artificial neuron in figure 2.13 B has a number of n input channels, a processing unit and an output channel. From there it can forward the signal to other neurons. Inside the neuron the processing unit can be divided into three parts [11]: 1. Weigh: Each input is multiplied with a weight which belongs to the specified input channel. If the neuron has three inputs, it has also three weight values. The value will adjust during the process based on the error of the last run. 2. Sum up: All multiplied input values are summed up with an offset called bias which is also adjusted during the learning process. 3. Activation: The single value is passed into an activation function where the final output signal is calculated. One example activation function is called binary function and outputs only ‘1’ or ‘0’. More activation functions are shown in figure 2.16. An artificial neuron is also called perceptron. An ANN contains several perceptrons which are connected in different layers as shown in figure 2.14. This ANN contains two neurons in the input layer (coloured green), three neurons in the hidden layer (coloured blue) and two neurons in the output layer (coloured yellow). The

19 Input layer Hidden layer Output layer

Figure 2.14: Schematic representation of an ANN with two neurons in the input layer (coloured green), three neurons in the hidden layer (coloured blue) and two neurons in the output layer (coloured yellow) [11].

input neurons get signals and forward it to the hidden neurons. In this example only one hidden layer exists but an ANN in general can include more hidden layers. The output neurons forward the signal back to the outside world. If an ANN has at least one hidden layer it is called Multilayer Perceptron (MLP). If the network is a large deep neural net with more hidden layers, the sub field can be named Deep Learning [59]. There also exists another type of neurons called the Bias units or on-neurons [77]. These types have no input and output the value ‘+1’ in every step to the connected neurons in the hidden layer. Through this other neurons can still be activated also there is no other input.

There are a lot of different types of neural networks which are explained in this section [40]. A convolutional neural network (CNN) is characterised by using convolutional and polling operations and is optimised to analyse data in grid-like typologies for example images. Recurrent neural networks (RNN) are specialised for processing a sequence of data and the architecture includes recurrent connections between perceptrons. The concept of Long Short-Term Memory (LSTM) is based on the RNN and includes self-loops. This is a newer model published by Hochreiter and Schmidhuber in 1997 [48] and “has been found extremely successful in many applications” as Goodfellow et al. describes [40]. These types of NNs are not explained more precisely because it is not the focus of this work.

The process of passing data through a NN is called forward propagation. The important process about adjusting the weights is called back-propagation or backprop algorithm [40]. There also exist other learning rules like the Hebb rule or the delta rule but they can only be applied on NNs without hidden layers [77]. Caused by this restriction they can not be used for complex networks and the backprop algorithm is more powerful. At the beginning all perceptrons are initialised with random weights and one training data is propagated forward through the network. After the first run, the output from the output units will be compared with the expected value and the network error will be calculated. One established possibility of calculating this error is to compute the mean squared error (MSE) [40] with

20 1 ¸m  p ¡ q2 Errmse outputi outputexpected . (2.8) m i1 as an alternative also the root mean square error (RMS) can be used which was mentioned by Kriesel as a Errrms  Errmse (2.9) or the already defined Euclidean distance from equation 2.2[53]. This error will be propagated backwards through the network to shift the weights from each perceptron a little bit more to the optimal value. This process will be explained more details. The errors from weights in the perceptrons can be defined as a global . The main goal is to minimise the error function and find the global minimum. Goodfellow et al. note that back-propagation is only the procedure of computing the gradient [40] in the loss function. The real learning process is done by optimisation algorithms such as stochastic gradient descent (SGD). In general, this method can be used to find the minimum or maximum of a n-dimensional function. A vector g called gradient can be placed everywhere on the function and points in the direction of the steepest ascent. It can be calculated by the derivative of the n-dimensional function and is defined through the nabla operator (∇) as

gpx1, x2, ..., xnq  ∇fpx1, x2, ..., xnq. (2.10) If the goal is to find the minimum, the negative gradient should be used to find the steepest descent [53]. The norm of the gradient |g| is proportional to the size of steps. That means that the step size will be longer if the descent is steeper. Figure 2.15 A shows the 3D-environment and figure 2.15 B shows the way of the SGD to find the minimum of the function in 2D. If the descent is getting flatter the step size is also shorter.

A) B)

Figure 2.15: Stochastic gradient descent with a two-dimensional error function [53]. Fig- ure A shows the 3D environment. Figure B shows the way to the minimum with shrinking step size in 2D.

Nevertheless, the SGD procedure is not errorless. Some of the possible errors are detecting bad local minima, stand still in flat areas, directly or indirectly oscillating in canyons or

21 missing local minimal in narrow valleys by jumping over [53]. Rey and Wender note some solutions for these problems [77]. The first suggestion is to vary the initial weights of the perceptrons. This will yield a new start position of the GSD. It is also possible to change the mean variation of the weights. Another suggestion is to vary the learning rate from the network. This affects that the step size is getting longer (increase the learning rate) or getting shorter (decrease the learning rate) which can prevent some of the named problems. The learning rate η controls the speed and accuracy of the learning process and should be in the range 0.01 ¤ η ¤ 0.9 [53]. If the learning rate is very small the correction of the weights through the backprop will be low. Alternatively there exist other optimisation algorithms like momentum, adagrad, rmsprop or adam but they will not be explained in detail [39]. As already mentioned there exist different activation functions or also called transfer functions for the neurons. After summing all network inputs, it will choose an activation level on the basis of the activation function. Typically, all neurons of one layer sometimes even all neurons of the whole network use the same function [77]. Figure 2.16 gives an overview of the most common activation functions which are explained in the following section [76], [50].

A) Identity B) Binary step C) ReLU D) Logistic E) TanH

Activation level

Output Output Output Output Output

Figure 2.16: Overview of the most common activation functions: identity, binary step, ReLU, logistic and TanH [76], [50]. The y-axis represents the activation level and the x-axis the network output.

The first function is called linear or identity function and is defined as

fpxq  x (2.11) and visualised in figure 2.16 A. The next function is called binary step, unit step or Heaviside and is defined as $ &'-1 for x 0 fpxq  0 for x  0 (2.12) %' 1 for x ¡ 0 and shown in figure 2.16 B. The rectified linear unit (ReLU), also called ramp function is defined as # 0 for x 0 fpxq  (2.13) x for x ¥ 0

22 and shown in figure 2.16 C. The logistic or soft step function (kind of sigmoid function) is defined as 1 fpxq  (2.14) 1 e¡x and displayed in figure 2.16 D. Figure 2.16 E shows the hyperbolic tangent (TanH, kind of sigmoid function) and is defined as

2 fpxq  tanhpxq  ¡ 1. (2.15) 1 e¡2x

The central challenge of a neural network is to perform well with new, unseen input patterns and not only with the inputs which were used for the training [40]. This property is called generalisation. Moreover it is important to know that there exist three different learning phases [77]. The training phase takes some grids as learning material and trains it by changing the weights inside of the network. In the test phase, the weights are not changed anymore. The network gets some grids as input and it can be tested if the output is positive. The success of the learning process can be observed during the training phase. In the test phase some unused grids will be forward propagated through the network. The result can be used to check if the network generalises. This is very important for a good performance. In the application phase the network will be used for a concrete use case but it still can be trained simultaneously. Similar to the training and test phase a training and test error can be calculated [40]. Goodfellow et al. further describe that a machine learning algorithm performs well if the training error and the gap between training and test error is small. These two facts can assign to the central challenges of underfitting and overfitting. In which direction a model trends can be measured by its capacity which depends on the hypothesis space. It means that the algorithm performs best if the capacity is appropriate for the complexity of the problem [40]. These properties are visualised in figure 2.17.

A) Underfitting B) Fitting C) Overfitting y y y

x x x

Figure 2.17: The red curves represent three different functions which are trained with test data that consist of the five blue points. The function in figure A is a linear function and shows the underfitting effect. Figure B includes a quadratic function which fits the points very well and figure C shows the overfitting effect [40].

23 The figure shows three models (red curves) with the same training sets (blue points). The models are based on different functions (A: linear, B: quadratic C: polygon of degree 9) and were trained with the set of points. The points were generated randomly by sampling x values and calculating y values with a quadratic function. Figure 2.17 A shows a linear function which is underfitted because it does not pass the points. Figure 2.17 B represents a quadratic function which fits well and generalises all points on the graph. Additional the capacity of the function is appropriate to the original one. Figure 2.17 C demonstrates the overfitting effect. The function is a polygon and fits the points but can not extract the structure from the origin function. Moreover, it has a too high capacity and it is possible to find other functions with degree nine that pass through the points. So, the chance to choose the right one is very small. The XOR function example finally shows the process and functionality of a NN [57], [66]. The XOR or also called ‘exclusive or’ function gets an input of two binary values and results one binary output. The result is 1 if explicitly one of the inputs has the value 1, not both. With this rule there exist four possible combinations as shown in table 2.1.

Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 0

Table 2.1: All possible combinations of input and output of the XOR function.

These functionalities should be learned from the NN. The network is defined as MLP similar to the example in figure 2.14 but with only one output neuron. The process can be divided in different tasks. The first step is to initialise all weights with a random value. Then the iterative loop starts by choosing randomly a data set (contains input and output) and starts the forward propagation process. The difference between the result and the target output from the data set is known as network error. In the last step, this error will be back-propagated through the network to optimise the weights of each connection. This process will be repeated till the network error is approaching to a minimum and consequently a good approximation is found for the XOR function. Lämmel and Cleve calculated 1,000 steps for the same problem and compared the results. The network architecture is similar to the configuration in this example and shows the same characteristic after enough learning steps. These results are shown in table 2.2. This result can be interpreted as follows. After 1,000 iterations a good approximation is already found. Compared to the four target outputs in table 2.1 there exists a trend in the right direction but a residual error is still visible. That is no problem for this example as the final result should be a binary value. Therefore it is necessary to define a threshold like ‘0.5’ to decide if the output belongs more to ‘0’ or ‘1’. By applying this threshold the network already returns correct results after 600 iterations, but a higher distance to the threshold gives more safety and stability like the result after 1,000 iterations.

24 Output after Input 600 700 800 900 1,000 iterations 0 0 0.37 0.27 0.21 0.17 0.14 0 1 0.54 0.62 0.70 0.75 0.79 1 0 0.55 0.63 0.71 0.76 0.79 1 1 0.45 0.38 0.31 0.26 0.22

Table 2.2: Results after learning the XOR function with back-propagation after different numbers of iterations [57].

A detailed calculation of a forward and back-propagation process is explained in appendix A. It uses the same XOR example and contains all needed equations. If a neural network will be used for a problem statement, in most cases it is helpful to revert to a neural network framework (listed in section 3.3). The methods like forward and back-propagation are already included and can be used easily. In this case, it is not mandatory to know the detailed back-propagation procedure. But to understand the fundamental function of neural networks it is necessary to comprehend the back-propagation method.

2.2.3 Evolutionary Algorithm

Evolutionary algorithms (EA) use the biological evolution of organisms as paradigm. This concept goes back to the evolution theory by Charles Darwin which was published in the book “On the Origins of Species” in 1859 [93]. Weicker further explains from the technical point of view that the algorithms can simulate an artificial evolution process to find a good approximation for an optimisation problem. EA represent the field of algorithms and evolutionary computation (EC) is used as a term for the complete area of research. Historical the EA can be separated in the fields of genetic algorithm (GA), evolution strategies (ES), evolutionary programming (EP) and genetic programming (GP). These methods differ in their properties and usage of the operators. Nowadays the term genetic algorithm gets popular and sometimes is used for the whole field of methods [57], [78] and [77]. Moreover, a modern algorithm can use different operators and should be designed especially for the problem statement [93]. The components of an EA are described in the following section and refer to the biological paradigm of an organism [93], [72]. A genome is a complete set of genetic material. It contains several chromosomes which consist of genes from a big gene pool. Genotype is the name of a specific set of gene. The is the characteristic after the development of a genome. In this distinct state, it is also called individual. A group of individuals is called population. Different genetic operators are another important functionality of the EA. These op- erations are used to generate new individuals and to evolve a population to the next generation [57]. A generation is a step in the evolutionary process. The selection oper- ator identifies good individuals by using a fitness function. These individuals are passed

25 A) Crossover B) Mutation

Parent A 1 1 1 1 1 Parent B 0 0 0 0 0 Parent C 0 0 1 0 1

Child A 1 1 1 0 0 Child B 0 0 0 1 1 Child C 0 0 1 2 1

Figure 2.18: Figure A shows the crossover operator which combines the genes from two parent genomes and creates two new children. Figure B presents the mutation operator which can change a value from a gen randomly. to the next generation and are used as a paradigm for new individuals. This concept is also based on the law of the evolution theory ‘Survival of the fittest’ [78]. The other two operators are used to create new individuals. The crossover operator produces chil- dren by cutting across two parent genomes. Figure 2.18 A shows an example of a single point crossover operator. It means that there only exists one cut in the parent genomes. Another type is the two-point crossover which uses two cuts or the generalised n point crossover operator. The uniform crossover operator creates a child by randomly choos- ing genes or bits from the parents. The second type of operator is called mutation and changes the value of a gen by random as shown in figure 2.18 B. All operators can be controlled by parameters. Furthermore it can choose the number of individuals for the selection, the rate of crossover or mutation use or the combination of both operators [57]. One possibility to use the GA is directly as a path finding algorithm as described in[8]. The basis is the graph with the nodes and the connections. All individuals contain the start node as first gene and the target node as last gene. The other genes are chosen randomly but with the knowledge of the graph. After this initialisation, each individual contains an own suggestion for a solution. The selection operator chooses the individuals with the shorter paths. They will be modified by crossover and mutation until the shortest path is found. This method adapts the GA to the network of the graph. Another method is to apply the GA to neural networks to find an optimal network configuration. This configuration can be the count of hidden layer, the count of neurons in each hidden layer, the type of activation function or the weight of connections between neurons etc. [77]. For the initial population, a lot of different networks were generated. In the selection phase the best networks will be chosen by the fitness function and all others will be deleted. The network error is used, after propagation of some data sets as explained in section 2.2.2, as the result of the fitness function [57]. The free spots are refilled with new created networks based on crossover and mutation. These operations will develop new combinations of neural networks. This iterative process can be continued until a defined number of cycles or a network with a small error is found. Neural networks which have been created through a GA process can also called compositional pattern-producing networks (CPPN) [31].

26 3 Approach

This chapter includes the implementation of the different approaches. The A* algorithm is realised in section 3.1. Then follows section 3.2 with the Q-Learning algorithm. Both implementations use the same graph and contain a graphical interface for visualisation. For the next two sections the graph has been simplified. Section 3.3 shows the appli- cation with a multilayer perceptron. The last part 3.4 includes the implementation of an evolutionary algorithm which is called neuroevolution of augmenting topologies. The fundamental methods of all algorithms was already introduced in section2. The develop- ment of the applications was done with a HP zBook 15 g3 (32 GB RAM, i7-6820HQ 2.70 GHz) with Ubuntu 16.04. All algorithms are implemented in C++ version 11 (C++11). The graphical interfaces in the first two sections are realised with the SDL (Simple Di- rectMedia Layer) library version 2.0. The visualisations of the networks in the last two sections were created by python scripts (executed with python version 3.5.2).

3.1 A* Algorithm

The basis of the A* algorithm is a graph. Therefore, an application was developed to create a graph on the basis of a picture in the background as shown in figure 3.1 A. The used picture is a map from the Audi driving experience center in Neuburg on the Danube from HERE maps [47]. The graph is saved in two text files and is used as an input for the algorithm. One file is the node list and contains for each of the 747 nodes a x and y value as pixel coordinates in the picture. In this example, there exists no geographical reference but the algorithm also work with latitude and longitude values. The second file is the adjacency matrix. It contains 747 rows and 747 columns. Each position represents a possible connection between two nodes. The value ‘1’ represents the existence of a connection and value ‘0’ means that there is no connection. Most of the nodes only have a connection to their direct neighbours. In this case the most entries are ‘0’ and the matrix can be called a sparse matrix. The algorithm works as already explained in chapter 2.1 and contains three lists for nodes. At the beginning, it is necessary to choose a start node which is moved from the unknown list to the open list. The selection of the target node will start the algorithm. The result is a list of nodes which represents the shortest path. Figure 3.1 B shows a detail of the map with the shortest path (coloured turquoise) from the start node (coloured red) to the target node (coloured blue). All other nodes are invisible. After the calculation, a red car starts driving from the start to the target node over the shortest path. The driving angle of the car is calculated for every frame regarding the position of the following node. Additionally, it is possibility to

27 A B

Figure 3.1: Figure A shows the full graph with 747 nodes (coloured green) and a map in the background. Figure B shows a detail of the map with the shortest path (coloured turquoise) from the start node (coloured red) to the target node (coloured blue) and a red car which is driving on the path. place obstacles on the graph which are represented by black nodes in figure 3.1 B. After placing an obstacle on the current path all connections to this node will be deleted in the adjacency matrix and the algorithms restart calculating. Instead of the original start node the current position of the car will be used as the start node. This recalculation is very fast and the car immediately starts to drive on the new path including a turn over if necessary. Figure 3.1 B shows some street direction labels which mark one-way-streets. These labels have been ignored because it would reduce the possible solutions to find a connection between start and target after placing some obstacles. It would be possible to add this information by correcting the initial adjacency matrix.

3.2 Q-Learning Algorithm

The Q-Learning algorithm was implemented on the basis of the same graph as from section 3.1. In the initial state the R and Q-matrix are empty. In this case, the first required step is to define a target node. This node gets an entry in the R matrix with the positive reward of 100. This information has to propagate through the network. The graph in example figure 2.10 was very small and contains only six nodes. In this graph, there exist 747 nodes. To choose every node at least once can prolong the learning process because the nodes are chosen randomly. Caused by this characteristic another strategy is chosen. Therefore, each node is used only once for the learning process whereby another difficulty occurs. The standard Q-Learning process can propagate the positive reward only one node backwards. If a node is selected for learning at the beginning of the process and

28 the reward is far away, the propagation never arrives at itself. This issue can be solved by another trick. The randomly chosen way of the algorithm can be stored and walked backwards after the target was found. In each step, the Q-value is calculated and the reward is propagated to the original start node. This is more expensive but it is a safe approach to update the network completely. To propagate a new reward through the whole network takes about 1-3 seconds. It is also possible to place successively more reward nodes on the graph as shown in figure 3.2. All the blue nodes are target nodes. If a global start node (coloured red) is chosen in the graph, the algorithm finds a way to the closest target node. This is very easy because it only has to check the Q-matrix which is the next node with less costs. The turquoise coloured nodes show the shortest path.

Figure 3.2: This figure shows a detail of the map with the shortest path (coloured turquoise) from the start node (coloured red) to the next target node (coloured blue) and a red car which is driving on the path.

If there is only one target node the Q-learning algorithm will output a similar result as the A* algorithm. The stop conditions for the propagation can be defined differently. In this implementation, the propagation stops when each node was selected exactly once. Alternatively a fixed number of random nodes could be initiated (e.g. 1,000 nodes). The third possibility is to check if the normalised Q-matrix converges the optimal value after a fixed number of steps (for example after 10). If there are no essential changes anymore it can be stopped. To make sure that the algorithm does not run in an infinite loop caused by the random forward steps, a counter (e.g. 500 steps) was implemented which stops the search for the target node. In this case the node found no reward to propagate it back. It is possible that one of the next nodes finds the way over that skipped node to calculate the Q-value. The option to place an obstacle on the graph was not implemented but it would be possible to propagate these items with a negative reward through the network.

29 3.3 Multilayer Perceptron

For the implementation of the multilayer perceptron a deep learning framework has to be chosen. To implement an own network would cost too much time besides the most frameworks are open source which can be used for free. The following passage gives a short overview on the most popular deep learning frameworks with the producer and the available programming languages [71], [88]: • Caffe was developed by Berkeley Vision and Learning Center (BVLC) and by com- munity contributors and is available in C, C++, Python, Matlab. • CNTK (or Microsoft Cognitive Toolkit) was developed by Microsoft Research and is available in C++, C#, Python. • Tensorflow was developed Google’s Machine Intelligence research organisation and is available in C++, Python • Theano is available in Python • Keras running on top of Tensorflow or Theano and is available in Python • Torch is available in C, C++, Lua • Deeplearning4j is the first library for Java and Scala • The Neural Network Toolbox was developed from MathWorks and is available in Matlab [61] All the frameworks have different advantages and disadvantages. For the implementation of the MLP another small C++ library called tiny-dnn was used. This library only contains header files and is a dependency-free framework for deep learning. It is published on Github and can be freely downloaded [39]. The advantage of a smaller library is the easy understanding and the short training period for a developer until the first application runs. For this section a simplified graph in form of a grid as environment was used. Figure 3.3 shows a grid with different symbols. The symbol P marks the player position, x mark

A B C D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x P x x P x x P x x P x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x

Figure 3.3: Grid with player, walls, positive and negative items. The blue arrows in figure A show the four possible actions the player can choose. After two steps right (figure B and C) and one step up the player reaches the target item (figure D).

30 walls, the minus marks a negative reward and the plus marks the target or the positive reward. A grid is also called environment because it is an image of the current state. The main goal is to find a way from the player position to the target position. The blue arrows in figure 3.3 A show the four possible moves the player can choose. In this example the player moved two times right (figure 3.3 B and C) and one time up (figure 3.3 D) and successfully reached the target position. If the grid is in the same initial state every time the solution would be very easy. The Q- learning algorithm can update the Q-matrix once and find a solution from every position. But there exists not only one grid. A small application was developed which randomly creates a lot of different grids. The surrounding wall exists in every grid but the player, the single wall, the minus and the plus item are at different positions. A few special constellations are suspended. No solution can be found if the player or the target (plus item) is enclosed in the corner from the wall and the minus there exists no solution. The count of all possible grids can be calculated with

n! c  (3.1) grids pn ¡ kq! with n as the count of possible positions ( 16) and k the count of items ( 4)[33]. Finally, there are 43.680 possible combinations. Caused by the enclosed items 208 com- binations (4 corners ¦ two different enclosed items[player, plus] ¦ two combinations of block up, [wall, minus] ¦ 13 possible positions for the not enclosed items  208) have to be subtracted. The final count of possible grids is 43.680-208  43.472 combinations. All the grids were saved in a text file and can be read from the main applications. In the first approach only 10,000 different grids, 7,500 for the training phase and 7,500 for the test phase, were used. Subsequently some parameter of the neural network are defined. The input layer consists of 144 binary neurons. This is the amount of data for one state of a grid. The graphical representation has to be transformed in a vector format. The vector consists of 4 ¦ 36 binary values. The four items are decoded in the four different layers. Each layer consists of 36 values. The rows are stacked behind each other. All values are set to zero except the used item. The first layer saves the position of the player, the second the wall, the third the minus and the last layer the plus as shown in figure 3.4. The parameter of the hidden layer gets evaluated in chapter4. The standard configuration runs with two hidden layers including 150 and 200 neurons and uses the ReLu activation function. The GSD is used as a global optimiser. The output layer consists of four neurons which belong to the actions up, down, left, right and they use the identity activation function. Figure 3.4 additionally shows the forward propagation step. The result of the neural network is used as a new action on the grid. The whole algorithm is explained in the next section. The implemented algorithm is similar to the algorithm from Mnih et al. which learned to play atari games [67]. The only difference is that the extension of the experience replay will not be implemented. They named the algorithm deep Q-Learning as a combination

31 0 0 0 0 0

0 …

x x x x x x . . . 1 x x …

1 . . .

x x x …

x P x 1 … x x . . . 1 x x x x x x … 0 up down left right 0 0 0 2.31 4.78 0.61 6.14 0 0

Figure 3.4: A grid is decoded as a binary vector which is used as an input of the NN. After forward propagation, the result is used as a new action on the grid. of deep learning and Q-Learning. The process of the algorithm will now be explained step-by-step with the sequence diagram in figure 3.5. The single steps are marked with a number in turquoise circles. In order to start the NN has to be initialised as described before. All the weights are chosen randomly. 1. A grid from the text file is used for current state. This state is forward propagated through the network and outputs the Qvalue (Qval). 2. It will be decided randomly if the action proposal from Qval is used (by using the action with the maximal value) or an randomly chosen action. In this decision, the  value is integrated. The two possibilities have not the same probability. At the beginning  is set to 1.0. This means that in the first step the action is randomly chosen (1.0 = 100%). With each iteration  will be reduced and the probability of the action from the network increases (see step 10). The value of  will never fall under 0.1, this small probability of a random action exists at every point of time. 3. The chosen action is applied to the current state and the player makes the move. The new state1 is created. The player can now be placed on all possible items. 4. The reward is calculated for state1. The position of the player is analysed and if it matches with the position of the target the reward will be ‘10’. If it matches with a negative item like the wall (21 times) or minus, it gets ‘-10’ and if the player only moved to another free field it gets the reward of -1. 5. The reward will be analysed and if it is ‘10’ or ‘-10’ (which means an item was found) a stop flag is set. It does not mean that the whole algorithm stops but the training with this single grid. Another criterion for a stop is the count of the iterations of a grid. If the algorithm does not find any item after ten steps it also stops to avoid an infinite loop. 6. The state1 is forward propagated through the network. From the output the new Q-value (newQ) the maximal value is probably needed in the next step. 7. Depending on the reward the update is calculated. If the reward was ‘10’ or ‘-10’

32 the value will be used directly as an update. If the reward was ‘-1’ the update is the sum of the reward and the multiplication of γ and newQ. This means that the q value of the next step also influences the learning step before. 8. In this step, the trainQ will be created, which would be the optimal output of the network for the action. The trainQ consists of the Qval but the value in the action position is replaced by the update value. 9. The defined trainQ will be used together with the current state to train the network. In this training step the SGD method is used as optimiser. Additionally, the MSE between the input (state) and the output (trainQ) is calculated to get a feedback of the training process. 10. In the last step the state1 is used as state and the  value will be reduced by the value of the inverse of the iterations number ( 1{7500). If no item was found, the algorithm restart again. If the stop flag was set in step 5 the next grid from the text file is used as current state instead of state1.

makeMove 3

Training Grid State = State‘ getReward State State‘ 4 10 ε-- 9 1 6

Predict Train Predict Neural Network

Qval trainQ NewQ Reward

2 5

maxVal(NewQ) if ε if R End

Random maxID(Qval) Update = Reward +(gamma * NewQ) Update = Reward

ActionID UpdateValue Action 8 Update 7

Figure 3.5: Workflow of the learning mechanism in the neural network. The turquoise circles with numbers describe the single steps one after another. The green objects are used for different states of a grid. The forward propagation process consists of the green and additionally the yellow objects. The red object describe the parts from the back-propagation.

The algorithm ends if all 7,500 training sets passed through. Afterwards the trained network has to be tested. The 2,500 grids were forward propagated through the network until an item was found or ten steps were done. The result which item was found can be

33 analysed. 100% means that all 2,500 grids are solved under ten steps with the result of 10. One step of the forward process is visualised in figure 3.4 or in the sequence diagram 3.5 where the steps 1,2,3,10 can be observed. The learning process is also represented in a more formal type as pseudo code in al- gorithm1. The first loop iterates over all training grids and the second loop over the steps inside a grid until it reaches an item. The function φpstq can be seen as a forward propagation through the NN with the output φt at the iteration t. After an action was chosen and the reward was executed, the update is calculated to get the next state1. In a last step the back-propagation is performed by the SGD method.

Algorithm 1 Neural network based on Q-Learning 1: initialise action-value function Q with random weights 2: for i  1 to maxgrids  3: initialise state s1 gridi 4: for t  1 to maxsteps 5: process φt  φpstq 6: select a random action at with probability ε 7: otherwise select at  maxidpφtq 8: execute action at and observe reward rt and state st 1 9: process φt 1 #φpst 1q  rt for terminal φt 1 10: set updatet rt γ maxvalpφt 1q for non-terminal φt 1 p q 11: perform a SGD step on Q with φt at, updatet and st 12: end 13: end

3.4 Neuroevolution of Augmenting Topologies

This section describes the implementation of the generic algorithm. The concrete algo- rithm is called Neuroevolution of Augmenting Topologies and was released by Stanley and Miikkulainen [85]. Different libraries are available for the implementation which can be used for an implementation. They are containing a lot of different programming lan- guages like C#, C++, Python, Java or Delphi, Go, Lua, Ruby or Matlab [30]. For this work the package MultiNEAT from Chervenski which is written in C++ is used [21]. It is licenced under GNU Lesser General Public License v3.0 and thereby free-to-use. The library includes the NEAT package and the extensions HyperNEAT and ES-HyperNEAT which could be interesting to continue this work. The algorithm, included in this package, by default uses only one single thread which leads to a prolonged run time. Due to this several experiments were executed parallel by starting them in separate command lines. Furthermore the environment with the grids from section 3.3 is also used for this algo- rithm. The advantage of a C++ library is the availability of some already implemented

34 functions like readGrids, makeMove or getReward can be used again. The basis of NEAT is an extensive set of parameters which have to be defined at the beginning. One can differentiate between the following subcategories: Basic parameters, GA parameters, phased search parameters, novelty search parameters, mutation param- eters, genome properties parameters and speciation parameters. Each of the 107 single parameters has usable default values which can be used. The detailed parameter set with explanations and default values is listed in appendix B. The evaluation of different parameters is described in section 4.2. In the following part the process of the NEAT algorithm will be explained. Figure 3.6 visualises the single steps of the implementation. Before the main process can be started a genome has to be defined. It has 144 input layers, no hidden layer and four output layers and uses the ReLU activation function for all neurons. The hidden layers should be developed by the algorithm. This genome is the model for all other genomes in the population. After the number of epochs is defined (10,000 iterations) the algorithm can start.

Training Grids nextGrid 5 2 runGrid State = State‘ Epochs State Population G1 G2 G3 1 Phenotype (NN)

G4 G5 Gn 7 Action, makeMove

State‘ 4 testNet(bestGenome)

8 Reward 3 if R

Population.Epoch() 6 Ʃ

Figure 3.6: Workflow of the NEAT algorithm. The turquoise circles with numbers describe the single steps one after another. The green objects are used for the different states of a grid. The forward propagation process consists of the green and additionally the yellow objects. The red object describes the evolution process of the population. The blue object represents the genomes (genotypes) or .

1. All genomes from the population are analysed by the runGrid function. The genome (or phenotype) is converted to a phenotype which is usable as a NN. 2. The first grid from the training data set is chosen as first state. This state is propagated through the NN and an action is chosen from the output. The action generates the new state1 with the makeMove function. The reward of state1 is calculated and summarised in each step to the universal fitness value.

35 3. The reward will be analysed and whereby two possibilities arise. If the reward is ‘-1’ continue with step 4. If the reward is ‘-10’ or ‘10’ continue with step 5. 4. The state1 is used as state and the iteration can run again. These steps will continue until the reward is different from ‘-1’ or until the maximum step count is reached. 5. In this case the player in the grid reaches an item and stops. The next grid from the training set will be used as new state. 6. If all training sets are passed through, the fitness value (sum of the rewards divided by the maximum reward of all grids) is assigned to the genome. 7. The testNet function evaluates the best genomes and yields a value written in per- cent. This result shows how well the genome performs. Therefore, the 2500 test grids are used. The process is similar to the runGrid function. This test step is not done in each iteration but in every 100th loop. 8. With the epoch function the algorithm starts to evolve new genomes. The genomes with the best fitness value will be used as model. By using the crossover and mutation operators new genomes are created. The epoch function is provided by the library and the evolving process can only be controlled by the parameter set as defined at the beginning. The new population is analysed again by the runGrid function. This process runs until the defined number of generations is reached.

This process can also be translated in pseudo code and describes the same steps as shown in figure 3.6. The first two loops iterate over the populations (=epochs) and over each single genome. The inner loops represent the training grids and the step count which is needed by the player to find an item. The result of the fitness function is assigned to the fitness property of the genome and is analysed in the evolution step in line 17.

36 Algorithm 2 NEAT 1: initialise population p with genomes 2: for e  1 to maxepochs 3: for g  1 to maxgenomes 4: convert g to gphenotype 5: for i  1 to maxgrids 6: for t  1 to maxsteps 7: use gphenotype and process φt  φpstq 8: select action at  maxidpφtq and 9: execute action at and observe reward rt and state st 1 10: calc reward and sum to fitness value 11: if not terminate set st  st 1 12: end 13: end° 14: return fitness 15: end 16: testNet(p) 17: p.Epoch() 18: end

37 38 4 Experiments and Results

This chapter shows different experiments of the MLP in section 4.1 and the NEAT al- gorithm in section 4.2. Different parameters and settings are tested. The best working configuration can be used to present the results. To compare all tests and methods with each other, it is necessary to apply the same final test for both algorithms. The 2500 grids are forward propagated through the network and the result value in percentage is returned. Thereby 100% stands for the solution of all grids with a reward of 10.

4.1 Multilayer Perceptron

The following section shows different test cases together with the final result. The neural network can be designed in different ways. The number of neurons in the input and output layers are fixed as the size of grids and actions does not vary. The hidden layers give a big scope for evaluation. The default configuration is running with two hidden layers containing 150 and 200 neurons.

4.1.1 Independent runs

This experiment contains 100 independent runs without changing any parameter to get a guideline for the stability of the results. For each run a new NN is initialised. Therefore, the arithmetic mean 1 ¸n x¯  xi (4.1) n i1 with i as the number of examples (=100) is used. The standard deviation is calculated as d a 1 ¸n  2  p ¡ q2 σx σx xi x¯ (4.2) n i1 and gives a value for the scattering of the results [96]. After using the results of the runs the arithmetic mean is 83.76% and the standard deviation is 2.77%. The change of a parameter has to improve the result for more than 3% to make sure that the improvement was not caused by the standard deviation. For one run the arithmetic mean time is 100.88 seconds.

39 4.1.2 Training five epochs

In this part the NN is trained by five epochs. Therefore, the training run is repeated five times with 7,500 grids. The NN is not reset between the runs. The MLP returns the result value and additionally the MSE. The development of the mean square error is displayed in figure 4.1. Additionally, the results are displayed in table 4.1 with the MSE at 100% of the 7,500 grids, the MSE mean and the result value. In both representations, the MSE decreases in each epoch. The value of solving grids in percent is rising.

6

5

4

3 MSE

2

1

0 0 10 20 30 40 50 60 70 80 90 100 % of 7500 runs

Epoch 1 Epoch 2 Epoch 3 Epoch 4 Epoch 5

Figure 4.1: MSE development in the training phase over five epochs.

Epoch 1 2 3 4 5 MSE at 100% 0.50 0.17 0.21 0.04 0.12 MSE mean 1.31 0.35 0.20 0.17 0.16 Result value 86.08% 95.12% 96.92% 97.20% 98.12%

Table 4.1: Results of the MSE development over five epochs with MSE at 100%, MSE mean and the result value. The colours belong to the different epochs in figure 4.1.

4.1.3 Different ratio of data sets

This section analyses a different number of training and test sets. As calculated in section 3.3 the grid data set consists of more than 43,000 examples. In this test, different ratios of training and test data sets are used. Table 4.2 gives an overview of the test cases. The combination of 7,500 training and 2,500 test sets will be used furthermore for the next experiments.

40 Train grids 75 750 7,500 15,000 22,500 30,000 Test grids 25 250 2,500 5,000 7,500 10,000 Result value 12.00% 29.60% 85.68% 94.00% 95.45% 96.88% Time in sec 1 13 100 188 296 395

Table 4.2: Results with different numbers of training and test sets. The combination of 7500 training and 2500 test data sets are shown in bold. Additionally, the result value and the time in seconds are displayed.

4.1.4 Different optimisers

Instead of the SGD explained in section 2.2.2 other optimisers can be used. Table 4.3 shows the results of the RMSprop, adagrad, adam and momentum optimiser in comparison to the SGD. The description and the mathematical calculations of each can be found in the tiny-dnn package in the optimizer.h header file [39]. Only the momentum optimiser is working as good as the SGD. For that reason the SGD should be used preferably.

Optimiser grad_dec RMSprop adagrad adam momentum Result value 85.68% 23.00% 17.64% 24.36% 84.16% Time in sec 100 140 138 143 108

Table 4.3: Results of different optimisers, the result value and the time in seconds. The stochastic gradient decent (grad_dec) has the best results.

4.1.5 Different alpha values

In this experiment, different alpha values for the SGD optimiser are used. This value controls the learning rate of the optimiser. Table 4.4 shows the alpha value, the result value and the time in seconds. The default value is highlighted in bold. The other values only show an improvement of less than 3%.

SGD alpha 1.0 0.1 0.02 0.011 0.01 0.009 0.008 0.007 0.006 0.005 Result value 31.88% 9.08% 68.48% 79.56% 85.24% 88.28% 87.80% 88.28% 81.92% 83.88% Time in sec 118 77 111 102 98 98 96 98 99 101

Table 4.4: Results of different alpha values of the SGD. Additionally, the result value and the time in seconds are shown. The default value is highlighted in bold.

4.1.6 One hidden layer

For this experiment, only one hidden layer with a different numbers of neurons is used. Table 4.5 shows the number of neurons which are graded from 10 to 1,000, the result value and the time in seconds. It can be seen that the time for learning increases with the number of neurons. The best value with 500 neurons is highlighted in bold.

41 Neurons Layer H1 10 20 50 100 150 200 500 1000 Result value 47.12% 69.36% 69.80% 81.32% 81.80% 83.16% 87.32% 80.76% Time in sec 6 9 19 32 45 55 133 265

Table 4.5: Results of one hidden layer and different numbers of neurons. Additionally, the result value and the time in seconds are shown. The best result is achieved with 500 neurons highlighted in bold.

4.1.7 Two hidden layers

This experiment compares two hidden layers with a different number of neurons. Thus there exist 64 combinations which are shown in table 4.6. The rows represent the number of neurons in the first hidden layer (H1) and the columns the number of neurons in the second hidden layer (H2). The result value is calculated for each combination. Combina- tions with a very small number of neurons like 10-10 to 50-50 do not yield in good results. The accuracy rises with the number of neurons in the layers. The fields with good values are very close together (85-90%). The colour gradient varies from red for the lowest value over yellow to green for the highest value.

Neurons layer H2 Result value 10 20 50 100 150 200 500 1000 Neurons layer H1 10 47.80% 43.72% 65.28% 71.12% 61.32% 65.88% 67.72% 63.52% 20 55.08% 64.20% 70.16% 71.12% 72.68% 71.76% 78.32% 69.44% 50 72.72% 69.20% 78.12% 78.36% 64.28% 80.96% 82.44% 75.96% 100 78.04% 78.36% 76.84% 83.96% 85.24% 83.04% 84.64% 86.76% 150 73.72% 76.12% 81.44% 82.96% 83.24% 84.32% 82.48% 79.20% 200 76.40% 83.12% 87.00% 79.36% 86.68% 78.32% 88.36% 85.36% 500 76.92% 86.56% 88.32% 84.44% 88.16% 85.96% 90.24% 83.68% 1000 81.28% 87.24% 85.92% 87.96% 84.48% 90.84% 88.32% 88.96%

Table 4.6: Results of two hidden layers and different combinations of numbers of neurons. The result value is calculated for each combination. The colour gradient varies from red for the lowest value over yellow to green for the highest value. The default combination of number of neurons is highlighted in bold.

The run time of the single steps is displayed in table 4.7 and relates to the sum of the neurons. The combination with two times 1,000 neurons (2,000 neurons) lasts with 1,999 seconds (=33:19 min) the longest time. The trend of the fastest run to the slowest is linear but with a shift. A lower number of neurons in the first hidden layer is faster than the reverse combination of the lower numbers in the second layer. This experiment can also be done over five epochs for all combinations. Table 4.8 shows the results. The default configuration reaches 97.76% and the configuration with two hidden layers with each 1,000 neurons 99.00% of solved grids.

42 Neurons layer H2 Time in sec 10 20 50 100 150 200 500 1,000 Neurons layer H1 10 5 5 5 8 10 11 19 35 20 8 9 10 11 14 17 30 55 50 16 18 21 25 30 36 62 115 100 32 33 38 50 57 66 121 215 150 47 48 56 69 82 96 173 316 200 60 61 72 90 109 124 229 403 500 146 146 168 215 264 303 553 1016 1,000 279 288 337 426 521 607 1088 1999 Table 4.7: Run time in seconds with two hidden layers and different combinations of num- ber of neurons. The colour gradient varies from red for the slowest run over yellow to green for the fastest. The default combination of number of neurons is highlighted in bold.

Neurons layer H2 Result value 10 20 50 100 150 200 500 1,000 Neurons layer H1 10 57.04% 80.60% 93.12% 95.48% 94.08% 94.60% 96.40% 96.44% 20 65.12% 80.08% 94.80% 97.00% 95.80% 95.92% 97.56% 98.72% 50 77.48% 85.60% 95.96% 97.80% 97.80% 98.32% 97.84% 98.72% 100 67.28% 92.20% 95.84% 96.60% 98.48% 97.72% 98.40% 97.96% 150 76.84% 93.08% 96.36% 97.56% 97.64% 97.76% 98.52% 98.08% 200 78.52% 90.16% 96.80% 97.76% 98.20% 97.88% 98.76% 98.92% 500 75.64% 93.64% 95.00% 96.72% 98.20% 97.60% 98.96% 98.16% 1,000 74.28% 87.40% 97.28% 98.12% 97.96% 98.52% 98.40% 99.00% Table 4.8: Results of two hidden layers and different combinations of number of neurons after five epochs. The result value is calculated for each combination. The colour gradient varies from red for the lowest value over yellow to green for the highest value. The default combination of number of neurons is highlighted in bold.

4.1.8 One to five hidden layers

This section compares NN with one to five hidden layers with a different number of neurons. To try out all possible combinations of neurons would have gone beyond the scope. In this case, all layers have the same number of neurons. Table 4.9 shows the different combinations.

The rows of the table show the number of hidden layers and the columns show the number of neurons in each layer. The configuration with two layers achieves the best results. The results are still good by using one or three hidden layers, but especially with 5 layers the result values are very low. The run time for the training of the configuration with five layers and 1,000 neurons takes 8,781 seconds (2:26:21 h). To analyse this effect in more detail, the data set (=column) with 150 neurons per layer is used. Figure 4.2 shows the development of the MSE in five different cases with one to five layers. The MSE of the

43 Neurons per HL Result value 10 20 50 100 150 200 500 1,000 Number of HL 1 49.60% 71.08% 75.20% 78.16% 75.92% 75.40% 84.96% 83.64% 2 49.40% 63.44% 70.52% 80.68% 86.96% 90.00% 90.76% 89.40% 3 35.24% 38.52% 64.56% 80.40% 68.76% 80.48% 82.80% 82.44% 4 15.00% 37.00% 28.12% 48.64% 53.80% 57.48% 68.76% 65.16% 5 10.12% 12.88% 16.56% 20.64% 23.32% 25.60% 21.00% 37.24%

Table 4.9: Results of one to five hidden layers with different numbers of neurons. The rows show the number of hidden layers and the columns the number of neurons. configuration with one and two layers decreases well. But the values from the other three configurations are not improving. Table 4.10 clarifies this rating. It shows the MSE value at the last training step and the MSE mean. Both values are distinctly higher for the three, four or five hidden layers. A high MSE also affects a lower result value.

6

5

4

3 MSE

2

1

0 0 10 20 30 40 50 60 70 80 90 100 % of 7500 runs

1 HL 2 HL 3 HL 4 HL 5 HL

Figure 4.2: MSE development with different numbers of hidden layers. Each layer contains 150 neurons.

Number of HL 1 2 3 4 5 MSE at 100% 0.39 0.53 1.85 3.58 1.91 MSE mean 1.71 1.32 2.38 2.93 3.06 Result value 75.92% 86.96% 68.76% 53.80% 23.32%

Table 4.10: Detailed results of one to five hidden layers with the MSE at 100%, MSE mean and the result value. The the best results are highlighted in bold.

44 4.1.9 Reduction of unsolved grids in training

As it can be seen in table 4.1 the results are better by using more epochs or by training each training grid more than once. After more epochs, the network can decide on basis of more experiences and can solve more grids than in the first epoch. The main idea of this test is to have more repetitions for a grid if it fails. To reduce the number of unsolved grids in a run it is necessary to know what happens in a normal run. Table 4.11 shows five default runs with the number of unsolved grids in the training phase, the result value and the time in seconds. These runs use the same default configuration as in the first experiment with the 100 grids, but the values for unsolved grids were added. Additionally, the mean values for this five runs are calculated with the equation 4.1. It shows that little more than half of all grids are not getting solved in the first run.

Mean Unsolved grids in training 4418 4351 4375 4346 4367 4371 Result value 78.48% 86.36% 81.04% 76.44% 82.60% 80.98% Time in sec 105 112 107 106 99 105.8

Table 4.11: Results of unsolved grids in training phase based on five similar runs. The mean value is calculated for the unsolved grids, the result value and the time in seconds.

The results of the new implemented repetition functionality is shown in table 4.12. Five different configurations were created with 5, 10, 50, 100 or 500 repetitions. For example, 5 repetitions imply, that each grid is looped 5 times at maximum if the grid is unsolved. With this additional function the unsolved grids values decrease whereas the run time increases. Especially with 500 repetitions the run time is eight times higher as in the default configuration.

Max repetitions 5 10 50 100 500 Unsolved grids in training 1054 475 98 58 18 Result value 93.12% 94.84% 94.64% 95.04% 93.76% Time in sec 187 261 385 482 887

Table 4.12: Results with different numbers of maximum repetitions in the training phase. It shows the maximum repetitions, the unsolved grids in the training phase, the result value and the time in seconds.

Another idea is to use the backward steps as described in section 3.2. The path of the player is saved after each action until the player finds the target. This saved path is walked backwards afterwards with an increasingly reduced reward after every step. This function was also tested but it produced, with the 3061 unsolved grids and 83,28% solved grids, comparable results to the default configuration.

45 4.1.10 Reduction of the number of steps for final result

This experiment compares the number of steps with the shortest path calculated by A*. Until now the result value only noticed the solution of a grid. But the main goal is to find the shortest path. A function to count and save the number of steps of each grid was added. The ground truth was calculated by executing the A* algorithm to each grid. Additionally, an alternative reward function was implemented. The normal reward for finding the target is the value ‘10’. The new reward function uses the default value ‘11’. ‘1’ is subtracted for each step the algorithm takes until the target is found. If the target is directly on the field next to the player and the algorithm choose the action to this field, the reward also will be ‘10’ (=11-1). This new reward function will rates shorter ways higher than longer ways within the same grid. Different runs of the new functions were defined to analyse the improvement. • R120 uses the default configuration with no changes. • R121 tests a configuration with the new reward function. • R122 uses the maximum repetitions algorithm with five iterations. • R123 used the new reward function and the maximum five repetitions. • R1001 is the final run and uses the maximum repetitions of five and additionally 10 epochs for training. Table 4.13 shows the results of the five configurations. The upper part of the table analyses the training phase and the lower part the test phase. Both parts include the following results. The number of unsolved grids is the count of all grids which are not solved in the process. The result value is used similarly as in the sections before. The ‘Times more steps’- value counts all grids which are not solved with the same number of steps as the ground truth (based on the A* algorithm). The sum of the numbers of steps which are needed additionally to solve the grid is represented with the ‘Difference steps’- value. The test phase further contains the ‘Optimal steps’ value which shows the percentage of all solved grids with the shortest path. It can be seen that the new reward function gives no essential better results. The using of the maximum repetition functionality gives a faster learning effect and improves the result value. Additionally, figure 4.3 shows the development of the result value for the different test runs. The runs R120 and R121 are leading more slowly in the direction of 100% compared to the other runs The new reward function does not improve the results. The runs R122 and R123 used the maximum repetition function and converge faster to 100%. After 7500 grids for training, the optimal steps value decreases again slightly because the maximum repetition function is only applied to the training data set. The configuration from run R1001 can be seen as a very good result. The result value is near 100% all the time. For this run, only the last epoch is visualised. It solved 98.48% of the grids and 99.52% of them with the shortest way. Based on 2500 grids in the test set there are 38 grids which can not be solved and 12 grids which failed to find the shortest path. The results from table 4.8 with 1,000 neurons in both hidden layers was 99.00% which only means 25 unsolved grids.

46 R120 R121 R122 R123 R1001 Unsolved grids 4,355 4,486 427 490 13 Result value 41.93% 40.19% 94.31% 93.47% 99.83% Training Times more steps 855 778 1,421 1,647 880 Difference steps 2,660 2,364 3,948 4,648 2,000 Unsolved grids 443 370 185 166 38 Result value 82.28% 85.20% 92.60% 93.36% 98.48% Test Times more steps 41 19 21 49 12 Difference steps 84 40 42 102 26 Optimal steps 98.36% 99.24% 99.16% 98.04% 99.52%

Table 4.13: Results of minimised steps to converge best to a similar solution as the shortest path from A* algorithm. The colours indicate the different runs in figure 4.3.

100 90 80

% 70 in 60

50 value 40

30 Result 20 10 0 0 2,500 5,000 7,500 10,000 Number of grids

R120 R121 R122 R123 R1001

Figure 4.3: Development of the result value for the five configurations over all grids. The dashed line separates the training and test phase.

With these two results in different configurations, the algorithm can finally be rated with an accuracy of 98.74%.

4.2 Neuroevolution of Augmenting Topologies

This section contains the experiment and results of the NEAT algorithm. The first three subsections include some different basic configurations. Afterwards the variations of dif- ferent parameters of the evolutionary algorithm follow. As already described in section 3.4 there are over 100 different parameters available. To test all parameters would have gone beyond the scope of this study. The final experiment together with its results is shown in the last sections and can be reached by combining the best working parameters. A good

47 configuration of parameters only constitutes the basis for the genomes. The genome itself develops over generations to a good solution. The highest result values are the results of the best genome in the corresponding generation. To change a value of a parameter that improves the result means that the evolutionary process has been changed. In this case, a new genome which yields better results can be developed.

4.2.1 Independent runs

To get a guideline for the mean and the standard deviation, five runs with a default configuration were analysed after 1,000 generations. For the evolution, 7,500 training sets are used and for the test 2,500 test sets. The mean value is calculated by equation 4.1 and outputs 45.50%. With equation 4.2 the standard deviation of 2.21% can be calculated. The mean run time is 5:48:48 hours and the standard deviation is 7:02 minutes. This long run time could be caused by the fact that the algorithm is only working by default with one single thread. This fact will not change over the evaluation.

4.2.2 Different number of training sets

The experiment uses different numbers of training sets from 10 to 7,500. Furthermore the number of generations is increased to 10,000 and 2,500 grids were used for the test. The result increases with a higher number of training units as shown in figure 4.14. The same number of training data sets (7,500) is used to compare the results of the approach with the neural network.

Trained grids Result value 10 50 100 1,000 2,500 5,000 7,500 Generations 100 13.20% 13.56% 15.96% 22.36% 27.88% 27.84% 27.64% 500 12.96% 15.24% 17.52% 27.88% 40.80% 37.64% 41.96% 1,000 13.04% 16.92% 17.52% 31.20% 49.40% 50.72% 46.48% 2,000 13.04% 16.48% 17.80% 32.92% 51.88% 55.00% 52.28% 5,000 13.04% 16.00% 17.80% 37.60% 55.72% 62.64% 62.72% 10,000 13.04% 16.00% 18.56% 43.60% 59.28% 64.52% 67.80% Time in [d h:mm:ss] 1:20:25 0:53:58 1:16:39 8:48:58 1d 0:26:27 2d 4:36:28 3d 1:43:39

Table 4.14: Different numbers of training sets from 10 to 7,500 over 10,000 generations. The colour gradient varies from red for the lowest value over yellow to green for the highest value. The results of the further used number of training grids is highlighted in bold.

4.2.3 Different number of neurons

The NEAT algorithm starts in the initial state without any hidden layers but 144 input and four output neurons. There is the possibility to set up one hidden layer for the

48 initialisation in the parameters. This parameter is tested with 4, 10, 36 and 144 neurons and compared with the default value. The results are shown in table 4.15. By using some neurons in the hidden layer, the results are getting worse. Additionally, the time raises extremely from round about six hours to over three days. In case of the best result the setting without a hidden layer is used for the next steps.

Neurons hidden layer Result value 0 4 10 36 144 Generations 100 27.72% 19.60% 12.12% 11.24% 10.04% 250 36.44% 21.44% 13.80% 14.08% 10.24% 500 40.28% 26.40% 16.88% 19.80% 10.76% 750 40.92% 26.24% 18.76% 21.20% 12.52% 1,000 42.72% 27.24% 21.80% 21.20% 13.64% Time in [d h:mm:ss] 6:52:25 5:19:58 7:55:38 23:50:59 3d 21:28:58

Table 4.15: Different numbers of neurons in the hidden layer from 0 to 144 over 1,000 generations. In addition the time is listed in the last row. The colour gradient varies from red for the lowest value over yellow to green for the highest value (highlighted in bold).

4.2.4 Different MutateAddNeuronProb values

The next parameter to be analysed, is called MutateAddNeuronProb. It is the probability that a new genome is changed by mutation by using the add neuron operator. In this process an existing link will break between two neurons, a new neuron is inserted and reconnects the two neurons. Table 4.16 compares different MutateAddNeuronProb values. The default value is highlighted in bold and the best value is highlighted in blue. The improvement by changing this parameter is 13% and enables a faster growing of the neural network.

MutateAddNeuronProb Result value 0.01 0.03 0.05 0.10 0.30 0.50 0.75 1.00 Generations 100 25.84% 27.48% 23.64% 25.12% 20.04% 19.56% 20.36% 17.16% 250 37.96% 41.48% 37.00% 37.64% 34.32% 26.84% 31.84% 27.80% 500 40.64% 41.16% 44.20% 46.80% 43.56% 42.56% 42.88% 44.92% 750 43.32% 41.56% 47.56% 56.40% 54.16% 55.00% 56.04% 54.28% 1,000 45.08% 43.08% 48.72% 58.16% 58.68% 66.28% 62.64% 57.24%

Table 4.16: Different MutateAddNeuronProb values of 1,000 generations. The default value is highlighted in bold and the best value is highlighted in blue.

4.2.5 Different ActivationLevel values

This parameter contains the two values MinActivationA and MaxActivationA. It is an attribute of the genome and is used for changing the slope of the function with a scalar

49 value. Some neuron activation functions use these values. The results are shown in table 4.17. The default value is highlighted in bold and the best value blue. An Improvement of 10% can be observed.

min_act 4.9 1 4.9 1 Result value max_act 4.9 4.9 6 6 Generations 100 27.48% 23.00% 27.80% 25.28% 250 41.48% 32.48% 37.80% 33.24% 500 41.16% 45.92% 44.52% 34.76% 750 41.56% 46.44% 48.56% 41.80% 1,000 43.08% 47.80% 53.84% 46.44%

Table 4.17: Different MinActivationA and MaxActivationA values of 1,000 generations. The default value is highlighted in bold and the best value is highlighted in blue.

4.2.6 Different SurvivalRate values

The SurvivalRate is an important property of an evolutionary algorithm. It controls the percentage of best individuals which survive until the next evolutionary epoch. These individuals are used as paradigms for the next epoch as described in section 3.4. The value 1.0 means 100% of all individuals are chosen for the next epoch. This value is not recommended as in this case no new genomes will arise. As displayed in table 4.18 the default value of 0.2 (highlighted in bold) can be improved with the value 0.4 (highlighted in blue). Changing this parameter leads to an improvement of 11%.

SurvivalRate Result value 0.1 0.2 0.3 0.4 0.5 Generations 100 21.88% 25.72% 26.20% 25.48% 28.84% 250 35.48% 37.88% 37.08% 40.80% 37.28% 500 43.00% 44.16% 45.60% 47.52% 51.28% 750 44.28% 47.56% 48.32% 58.36% 53.84% 1,000 46.72% 49.08% 53.48% 60.04% 56.40%

Table 4.18: Different SurvivalRate values of 1,000 generations. The default value is high- lighted in bold and the best value is highlighted in blue.

4.2.7 Different YoungAgeFitnessBoost values

The results of the experiment with different YoungAgeFitnessBoost parameters are shown in table 4.19. This value is used as a multiplier to raise the fitness value of young genomes. The youngAgeTreshold that defines at which age a genome is still young is initialised by default with five and will not be changed. That means that all genomes younger than 5 epochs get this fitness boost. The value of 1.0 for the YoungAgeFitnessBoost affects

50 no boost. The improvement by changing this parameter from 1.1 (highlighted in bold) to 1.5 (highlighted in blue) is 9%. This leads to the fact that new genomes are not discarded in the first epochs of existence. Thereby they have a higher chance to survive in the evolutionary process.

YoungAgeFitnessBoost Result value 1.0 1.1 1.2 1.3 1.5 2 Generations 100 24.44% 23.00% 22.32% 25.24% 24.92% 24.52% 250 35.40% 33.76% 35.36% 36.16% 39.24% 35.48% 500 40.56% 38.88% 42.36% 42.52% 43.92% 38.68% 750 43.52% 42.72% 45.52% 45.04% 50.56% 41.24% 1,000 44.44% 44.24% 49.08% 47.80% 53.44% 45.24% Table 4.19: Different YoungAgeFitnessBoost values of 1,000 generations. The default value is highlighted in bold and the best value is highlighted in blue.

4.2.8 Other parameters

This section includes all other parameters that were analysed. They achieved no better results or were not exorbitant over 3% standard deviation. The parameters are named and explained shortly but detailed experiments will not be carried out anymore: • Different activation functions of the output and hidden neurons (used value a_Out- putActType = ReLU, a_HiddenActType = ReLU). • Different numbers of population size that controls the total number of genomes (used value PopulationSize = 150). • Different numbers of minimum and maximum species. Species are a group of genomes with similar properties (used value MinSpecies = 5, MaxSpecies = 25) • Different probabilities for mutating recurrent links. This is the probability that a link mutation creates a recurrent link (used value RecurrentProb = 0) • Different probabilities for overall . This probability controls the mutation rate after a baby was produced by the crossover operator (used value OverallMuta- tionRate = 0.8). • Different probabilities for add link mutations. This probability controls the add link mutation operator of a baby (used value MutateAddLinkProb = 0.05). • Different probabilities for using the crossover operator to produce a baby (used value CrossoverRate = 0.75) • Different thresholds which exclude the existence of links (used value LeoThreshold = 0.1)

51 The alternative values for the comparisons with default values come from two applica- tions given as example in the MultiNEAT package [21]. The files lunar_lander.py and TestNEAT_xor.py are written in python but the parameter set can be extracted easily.

4.2.9 Experiments with final parameters

In this section the parameters which improved the results are tested and combined. These four parameter are MutateAddNeuronProb from section 4.2.4, ActivationLevel from sec- tion 4.2.5, SurvivalRate from section 4.2.6 and the YoungAgeFitnessBoost from section 4.2.7. Each parameter gets an unique ID with the default value and the best working value as shown in table 4.20.

ID Parameter Default Best 1 MutateAddNeuronProb 0.03 0.5 MinActivationA 4.9 4.9 2 MaxActivationA 4.9 6.0 3 SurvivalRate 0.2 0.4 4 YoungAgeFitnessBoost 1.1 1.5

Table 4.20: The four best parameters with ID, parameter name, default and best value.

In total, there exist 16 possible combinations which are separated in three groups. All cases run over 1,000 generations and the best five results are highlighted in blue. At first, all default values and afterwards the best working values are tested independently as shown in table 4.21. Table 4.22 shows the results of the second part which combines two best working values. The last case combines three or all four best working values which is represented in table 4.23. This includes the most well working combinations. The five selected cases are running again in a final experiment over 10,000 generations as shown in table 4.24.

Combinations Result value default 1 2 3 4 Generations 100 24.24% 19.60% 26.76% 23.04% 24.32% 250 41.48% 30.20% 37.44% 34.28% 37.36% 500 45.00% 45.36% 43.92% 44.12% 42.92% 750 49.56% 54.16% 46.92% 48.08% 49.92% 1,000 51.52% 56.28% 48.44% 48.80% 52.52%

Table 4.21: Results of the default and single parameters of 1,000 generations.

The best combination uses the IDs 1 (MutateAddNeuronProbe), 2 (MinActivationA, Max- ActivationsA) and 3 (SurvivalRate). The final result of the best genome in this population is 69,96% of solved grids. This evolutionary process takes longer than nine days. The genome was saved and will be analysed in the next section.

52 Combinations Result value 1-2 1-3 1-4 2-3 2-4 3-4 Generations 100 20.68% 21.72% 21.32% 26.72% 30.00% 25.64% 250 28.08% 28.96% 28.20% 37.92% 41.96% 37.76% 500 43.72% 42.60% 41.96% 48.80% 46.16% 46.04% 750 48.84% 49.76% 56.12% 56.36% 46.88% 52.36% 1,000 54.48% 54.24% 59.28% 56.72% 49.28% 56.36%

Table 4.22: Results of double parameter combinations of 1,000 generations. The best configurations are highlighted in blue.

Combinations Result value 1-2-3 1-2-4 1-3-4 2-3-4 1-2-3-4 Generations 100 19.64% 22.24% 21.44% 24.80% 20.68% 250 30.68% 35.96% 32.24% 36.72% 30.60% 500 40.56% 52.04% 47.84% 45.96% 46.92% 750 57.40% 57.88% 58.72% 50.80% 50.28% 1,000 62.64% 58.80% 58.72% 54.40% 59.16% Table 4.23: Results of triple and fourfold parameter combinations over 1,000 generations. The best configurations are highlighted in blue.

Combinations Result value 1-4 1-2-3 1-2-4 1-3-4 1-2-3-4 Generations 100 19.60% 21.24% 20.88% 22.16% 23.24% 250 41.44% 45.28% 41.76% 43.84% 48.92% 500 56.32% 61.00% 62.04% 55.56% 59.08% 750 58.16% 63.92% 65.60% 62.68% 61.20% 1,000 58.16% 65.68% 67.76% 65.20% 62.96% 10,000 58.16% 69.96% 68.00% 66.76% 64.04% Time in [d h:mm:ss] 8d 17:58:07 9d 22:19:51 9d 16:57:09 10d 14:48:05 10d 18:40:08

Table 4.24: Final results of best parameter combinations. This experiment runs for 10,000 generations. The best configurations are highlighted in blue.

4.2.10 Analysis of the final result

The best configuration gives the basis for the evolutionary process that generates the best genome. This genome was saved in a text file in the ASCII format which can be subsequently analysed. It includes all neurons with a unique ID, neuron type, the acti- vation function and some more properties. Additionally, all links are included with the information of the two connected neurons (IDs), the weight and other properties. The number of neurons per type is summed up and shown in table 4.25. This neural network contains summarised 1,784 neurons and 2,427 links. The type ‘between’ also describes hidden neurons but they can not be assigned to a specific hidden layer.

53 Type Input Hidden Between Output Sum Links Count 144 1,293 343 4 1,784 2,427

Table 4.25: Number of neurons in the final neural network summed up by layer types, the sum of all neurons and the number of links between the neurons.

Figure 4.4 analyses the positions of neurons and arranges them into layers. During the evolutionary process seven main hidden layers (coloured light green) arise. Between these main hidden layers some single scattered hidden neurons (coloured dark green) exist. The used MultiNEAT package includes a pyhton script that can visualise genomes which was modified to read saved genomes from a text file. Thereby all saved genomes over the evolutionary process can be visualised.

471

250 232

144 80 94 75 84 82 Number neurons Number of 54 50 49 27 33 33 22 4

Type of layer

Figure 4.4: Analysis of the quantity of neurons in each layer. It contains the input layer (turquoise), seven main hidden layers (light green), eight scattered neurons between the hidden layers (dark green) and the output layer (orange).

Figure 4.5 shows the neural network in the initial state (A) and after 5,000 (B) and 10,000 generations (C). The seven main hidden layers can be recognised well in figure C. The links between the neurons are coloured red for negative and blue for positive weights. The thickness of the lines increases with the weight of the link. These results will be also discussed in the following chapter.

54 A) Generation 0 B) Generation 5,000 C) Generation 10,000

Figure 4.5: Visualisation of the final neural network in three states after different genera- tions with input (turquoise), hidden (light green) and output neurons (orange). The links can be negative (red) or positive (blue) and the thickness of the line increases with the weight of the link.

55 56 5 Discussion and Outlook

This chapter discusses the results of the algorithms analysed before in order to find the best possible solution for path planning problems. Section 5.1 evaluates the achieved results of the MLP algorithm and section 5.2 the ones of the NEAT algorithm. The result values of all experiments are depicted in percentage. 100% signifies the best solution and means that the target item has been reached in all grids. Afterwards a comparison of all used algorithms is made in section 5.3. Additionally, their advantages and disadvantages are pointed out and some suggestions of improvements of the algorithms are given. The section 5.4 gives a short resume about the results of the work. Some ideas giving an outlook on potential future research fields are listed in section 5.5.

5.1 Evaluation of Multilayer Perceptron

The experiments and results of the MLP algorithm are evaluated in this section. There- fore, only the significant results are discussed. In the first experiment in section 4.1.1 the standard deviation of 2.77% was calculated. The smaller the value of the standard deviation, the more precise is the result. In this study this value can be interpreted as acceptable. The standard deviation as well as the arithmetic mean of the default re- sult values with 83.76% are used as reference for interpreting the results of the following experiments. The second experiment in section 4.1.2 showed the results of five training epochs. Especially in the second epoch (first repeat) an improvement of the result value of 10% can be observed. In the following epochs, the result still rose by 1%. The improvement turned out to be smaller if the result is getting closer to the 100% value. These results indicate to be a well working mechanism to get a higher result value. Section 4.1.3 compared different ratio of data sets. The selected combination of 7,500 training sets and 2,500 test sets was used as it still can be optimised and has sufficient training samples to enable the generalisation of the NN. In this environment, the number of possible examples is limited but in case of unlimited data sets, the results can be improved by using a higher number of training data. The experiment in section 4.1.7 examined the usage of different neurons in a two hidden layer network and was demonstrated. For one epoch it can be seen that results for layers with only 10 or 20 neurons are worse. Especially for the results after five epochs the standard deviation should be included. The final results of NNs with well

57 working configurations are close together. All combinations with more than 50 or 100 neurons revealed good results of 98% ¨ 1.00%. The differences between the single values sometimes were only 0.04%. In this case of similarity a comparison is very difficult. In summary the results got slightly higher with more neurons included in the hidden layers. The experiment in section 4.1.8 investigated NNs with different numbers of hidden layers. With a smaller amount of neurons, like 10, 20 or 50, the best results were achieved by only one layer as shown in table 4.9. With a higher amount of neurons like 100 up to 1,000 the best results were obtained by the configuration with two layers. These were the best results in this experiment. Especially in the case of 150 or 200 neurons the results show a 10% higher value as the default value by using two layers than by using one or three layers. Figure 4.2 shows the detailed run with 150 neurons. As already mentioned, the MSE of the runs with one or two hidden layers showed a greater decrease as the runs with three, four or five hidden layers. Additionally, the standard deviation of the MSE is higher with three to five layers. This can be seen by comparing the range of the y-axis of each run. All these observations indicate that the overfitting effect appears in runs with more than two hidden layers. As a consequence, two layers are recommended for the configuration. The goal to reduce the number of unsolved grids during the training process was analysed in section 4.1.9. By using the feature of maximum five repetitions, the number of unsolved grids was reduced from 4,371 to 1,054. With these improvements, the result value rose about 10%. The faster approach to 100% can be seen in table 4.3 by comparing the runs R120 and R122. If more numbers of repetitions were chosen, the unsolved grids decreased, but the result value only improved about 1 to 2%. In the case of maximally 500 repetitions, the number of unsolved grids decreased to 18 but the result value also shrunk by 1.5%. These values should be treated with caution as all changes are in the range of the standard deviation. A possible explanation could be that some grids, which can not be solved, ran 500 iterations and had a higher impact on the learning process. This feature with five repetitions was also used in the final configuration. The final run which used 10 epochs and maximum five repetitions was very stable and yielded the result of 98.48%. The result of the default configuration, shown in table 4.8, was 99.00%. Finally, with a collective accuracy of 98.74% solved grids, the MLP algorithm reached a very good result. Different parameter combinations were capable of developing a good approximation of the objective function. In general, types of these problems are functions in a hyperspace and it is not trivial to make a generalisation as a lot of dependencies exist. For this example, the suggestion of using two hidden layers with minimum 150 neurons in each layer can be given. By increasing the number of neurons or the number of epochs used for training the best result were achieved. This shows very plainly that more than one possible well working configuration exits. In any case this type of algorithm is suitable for the given problem. To improve this algorithm, the first recommendation would be to enhance the com- plexity of the environment. 99% accuracy already means a good approximation. To improve the value to 100% is not implicitly necessary and can cost a high effort. There

58 still exists the error of the standard deviation. Due to this cause, improvements under 3% can not reliably be assigned to the change of the parameter. By rising the complexity, the general accuracy will fall and the improvements can better be observed. For example, it is easier to identify an improvement of 20% than one of 0.01%. A higher complexity also means that the grid can be enlarged and a different number of items can be placed on the grid for example. It would also be possible to add dynamic objects, which have an own logic to move over the grid. This topic is also discussed in section 5.3.

5.2 Evaluation of Neuroevolution of Augmenting Topologies

This section evaluates the experiments and results of the NEAT algorithm. The standard deviation of 2.21% which was calculated in section 4.2.1 can be rated as acceptable. The mean result value with 45.50% is rather low. By evaluating the results of the single section, the following observations can be made. The experiments in the sections 4.2.4 to 4.2.7 pointed out some improvements by changing the values of the parameters. The parameter variation which was mentioned in section 4.2.7 did not improve the result and thus was not used furthermore. The different parameter combinations were analysed in section 4.2.9. The first notable result value was the one of the default configuration. With 51.52% it was 6.00% better than the mean result but with a standard deviation of only 2.21%, which gives the possibility of a range of 4.42%. The calculation of the mean more examples should be examined in future research. This effect can be seen more clearly in the comparison of the IDs 2 and 3. In the experiment in section 4.2.4 the result was 66.28% and in the evaluation only 56.28%, which means a decrease of 10.00%. For the ID 3 the difference between the two experiments with the same parameters was 11.24%, thus higher than the standard deviation of 2.21% which means the results are spreading in a higher range. Thereby the results are not easily reproducible. The best results of the combinations with more IDs are very close together. The values from 58.72% to 62.64% have a smaller range of only 4%. The final result of 69.96% was achieved by raising the generations from 1,000 to 10,000. It should be noted that the default configuration with 10,000 generations also outputs the result value of 67.80% as shown in table 4.14. The improvements of combining different values revealed very small. More runs with the final configuration would have be helpful to make a statement regarding the spreading of the results. As one run of this configuration takes more than 9 days and 22 hours it would take a very long time to evaluate different results. The final neural network was analysed in section 4.2.10. In table 4.5 it can be seen that the algorithm creates more and more hidden layers. In figure B, the existence of three main hidden layers can be observed. Figure C already shows seven main hidden layers. If the algorithm runs longer, more hidden layers will probably arise. This rep- resents no sufficient working network architecture for the given problem. No structure,

59 which matches to the individual objective function, can be detected. To interpret the environment in the right way, it is very important to adjust the weights correctly. In every state the surrounding wall consist of the first six input neurons, some blocks of 12 input neurons and the last six input neurons. It might be expected, that links which belong to those neurons are getting a very small or negative weight. This effect can not be seen by comparing the links. In the initial, state the 576 (=144*4) links get different weights by random. The possibility that these are working well is very small. During the evolutionary process, the weights are getting modified, but also some new neurons are added to the network. One possible improvement would be to find a good modification of weights at first and only after some generations neurons can be added. The direct modification of the initial parameters is not possible with the used MultiNEAT package but maybe could be executed with another framework. An alternative would be to only allow the weight modification and run a lot of different runs with random initial values. The best working network then could be saved. These weight values could be used in the next process instead of the randomly chosen initial weights. The first suggestion for an improvement is to make the algorithm executable for more than one thread. The calculations of the fitness values of each genome by propagating all 7,500 training grids are the bottleneck of the algorithm. This process has no dependencies to other parts of the application. Thereby, it is possible to calculate eight fitness values parallel. This was already tested with the openMP API but without delivering successful results [73]. There also exist other possible multi threading libraries which could work better. Another suggestion is to use less values than inputs. The surrounding wall can be abstracted and the grid gets a new size of 4x4 values. Maybe it would be also possible to use different numbers for the items player, target, minus and plus. Then the input could be minimised to 16 values. The third improvement is to use the extension of NEAT called HyperNEAT. This algorithm is shortly explained in section 5.5.

5.3 Comparison of algorithms

This section compares the different used algorithms to each other. At first the MLP and NEAT algorithm can be compared. With a value of 98.74% the final result of the MLP was significantly higher than the result of NEAT which was only 69.96%. Thereby, the MLP creates a better approximation of the objective function. The processes of the algorithms can not be compared directly because of their different basis. The MLP algorithm creates only one NN which is trained on the basis of the training data sets and the back-prop mechanism. The NEAT algorithm generates a lot of different NNs with an evolutionary process and all of them are tested with the training data sets. The development of these networks can be different in some runs but the final NN looks similar. In the last epochs, a lot of different solutions exist but only the best one is used as final result. The run time of the final result of the NEAT algorithm (9 days 22:19:51 hours) is much longer than of the MLP algorithm (22:55 minutes). The MultiNEAT library internally uses the openMP API for parallelisation and the processing unit can be distributed to eight threads to calculate the results. Even if this effect is included, the

60 run time of the NEAT algorithm is up to 75 times longer than from the MLP algorithm.

These two algorithms can also be compared with the A* algorithm, on the one hand, by directly comparing the results on the other hand by comparing the type of the algo- rithm. At first, the results of the MLP can be compared with the ones of the A* which was tested in the experiments in section 4.1.10. The standard configuration already reached 98.36% of optimal steps (with 82.28% result value). This is a good value, compared to the final evaluated value of 99.52% (with 98.48% result value). These results show that the probability of solving the grid with the shortest path is very high if it was solved successfully. The NEAT algorithm was not compared with the shortest path because the results were generally worse. The second issue was the comparison of the types of algorithm. The A* algorithm can calculate the shortest path from a start position to a target position. On the basis of one state of the environment at a specific moment the shortest path has to be calculated. In the next time step the environment could be changed and it is not guaranteed that the found path still is the shortest one. A continuously restart of calculation and the creation of a new graph are necessary for finding out the optimal path in every time step. In general, the A* algorithm only can find a solution in a fixed structure (model based). Models can be very complex and it is a lot of work to define them. The great advantage of the RL approach is the lack of necessity of knowledge of the model. Instead of the model, the RL algorithm uses a policy which defines the goal (policy based). Thereby, the algorithm is much more flexible and can even solve high complex problems. The RL algorithm can be used for example to find correct decisions in a decision tree with a high complexity. As already mentioned in section 1.3 this algorithm was also used in AlphaGo. The A* algorithm only can be used to find the shortest path based on different costs of a link, but it is not able to solve such complex problems without building a model. Another advantage of the approach with the NN is that images can be used as input. Mnih et al. showed this with different Atari games and used the current state of the screen as input of the NN [67]. Thereby it is necessary to use convolution layers which analyse the image with the help of filter kernels. If an image should be used as input of the A* algorithm, it has to be converted to a graph structure at first. In all cases, the RL algorithms are more flexible and more useful in complex environments. Another perspective is to compare the type of the output of both algorithms. The A* algorithm outputs the shortest path to the target node. It is a list of next nodes which should be visited. Accordingly the steering command has to be derived. The MLP algorithm directly outputs the steering commands like up, down, left and right. In this example the action with the highest value will be executed. An improvement of this con- figuration would be to change the discrete actions (up, down, left, right) to continuous performing actions (forward, backward, turn left, turn right). Thus the player can move consistently on a field without the underlying grid. Additionally, the amount of the value of the output becomes more important as it controls the intensity of the action (for ex- ample 5 m/s forward and turn left 2¥/s ). This extended feature can not be realised by the A* algorithm.

61 The last comparison contains the A* and the Q-learning algorithm from section 2.2.1. For this discussion both algorithms should be applied to the grid example. As already mentioned 43,472 constellations of the four items on the grid exist. The A* algorithm applies on only one of these grids and finds the shortest path. The goal of the Q-learning algorithm is to find the optimal action (up, down, left, right) for all player positions in all possible grids. The solutions are saved in the Q-matrix. For the grid example, the Q-matrix has 43,472 rows for all situations and four columns for the possible actions. After the full Q-matrix was calculated with enough training samples, there can be found an action for each state. Thereby, the full knowledge of all possible paths is saved in the Q-matrix. The A* algorithm only can calculate a solution of one state at that time and is not able to save past paths. If the problem is getting more complex and the Q- matrix is getting increasingly larger, it becomes a huge effort to calculate all the values. Therefore, the Q-matrix is replaced with a NN which estimates the Q-values. This is the final algorithm as implemented in section 3.3. The NN approximates the objective function which outputs the optimal actions to the given state.

5.4 Summary

This section summarises the results and discussions of the used algorithms. The current study has shown that the reinforcement learning based multilayer perceptron approach achieves a far better result (98.74%) than the Neuroevolution of Augmenting Topolo- gies algorithm (69.96%). Both algorithms use some modifications to reach their results. Thereby, the MLP algorithm has solved the most grids (99.52%) with the same number of steps as the A* algorithm. Regarding the complexity, the MLP approach can be used in much more problem statements. It does not need a complex model of the problem and proceeds with a policy (policy based). On the other hand, the A* algorithm certainly finds the shortest path in a defined model (model based). Thus it is not that flexible and cannot be applied to complex problem statements as the MLP. In a plain path planning problem with a fixed model and no dynamic changes the A* can still be considered state of the art. If the problem exhibits an increasing complexity and external effects emerge, the RL based MLP approach turns out to be the better choice. Thereby, a lot of different complex control situations can be solved by this approach. In the field of autonomous driving it can be distinguished between to levels of path planning. The navigation level plans the whole path from the start to the target position. It is based on the map and includes information like turnings on crossroads. The tactical level only calculates the path for a shorter distance starting at each car position (for example 10 to 30 m in cities or 50 to 300 m on highways). Different sensor data enable the perception of the environment. It will include the standard driving functions and additionally actions like lane change or overtaking manoeuvre. In general, the navigation level can be seen as a corridor where the main way is already defined. The detailed decisions have to be made by the tactical level and happens inside the corridor. The A* algorithm is able to perform the navigation level very well whereas the RL based MLP

62 algorithm succeeds in executing the tactical level. Therefore, this leads to the conclusion that both algorithms could be helpful for navigation tasks of autonomously driving cars in the future. With this definition, the benefits of both algorithms can be used. But regarding to new AI based methods it can be queried if a dedicated planning module is still needed in a few years.

5.5 Outlook

This section outlines some ideas for possible future research in this field of this study, although some smaller improvements and outlooks have already been listed in the sections before. The extension of the NEAT algorithm is called ‘Hypercube-based NeuroEvolution of Aug- menting Topologie’ or shortly HyperNEAT. The word hypercube is used because the inputs of the NN (or CPPN) uses a connectivity pattern with at least four dimensions [31]. The geometric layout of the connectivity pattern is called substrate. Each connection of the substrate will be used as input of the CPPN. This CPPN will be evolved with the NEAT algorithm and yields a value which is set as weight to the according connection. A detailed explanation of this algorithm can be found on the HyperNEAT user page [31]. An advantage of the HyperNEAT algorithm is its capability of dealing with a large number of inputs. Additionally, the geometric relationship between input and output values can be included in the evolutionary process. It could be an important improvement to use the algorithm for the grid problem. The defined policies in the grid used in this study, showed no high complexity, as only three types of different rewards exist(found target, step, found negative item). In the case of a larger number of complex policies, the definition and organisation of all policies would be challenging. Different policies, for example, could be to find the target, to hold street lane, to drive fast, to drive safe, etc. Some policies like holding the right distance to an object can be used in different cases. Therefore, different possible architectures for different cases could emerge. Another possible question is the accuracy of the network performance in a real scenario when only trained in a simulated environment. For example, the algorithm could be trained in a car simulation environment and the images from the cockpits viewpoint to the street are used as input. After learning the basic driving functions, it can be tested with real images, to test the ability to transfer the knowledge to similar environments. Another issue is the safety of the NN. Generally different NNs can exist which are trained with different training sets but execute similar results. For example, one NN could be trained by using high speed and the other NN by using safety as reward. It could be analysed, if particular advantages would emerge from combining the information in one NN or from being separated in different NNs. Additionally, it could be tested if it makes sense to use a second NN as safeguard which is trained for the same goal with different reward functions or different policies.

63 The last point addresses the classical path planning applications where A* is still in use. Through this AI based approach traffic flow effects can be integrated better in the planning part. External live services like weather service, current traffic situations, problems of the public transport etc. can be used. Additionally, the knowledge about the fastest driving routes at particular daytimes acquired from past drives could be included. If there is a possibility to use the data set from a fleet of cars, more information like traffic-light controls could be learned. Intelligent Car-to-X approaches could be also integrated in the path planning process. To sum up, there are a lot of interesting issues in the field of AI based algorithms in the development of autonomously driving cars, which should be part of future research.

64 Appendix

A XOR function example step-by-step

The goal of this example is to understand the function of the back-propagation algorithm, which is the basic method in the learning process of a neural network. At first a forward propagation is needed to calculate the network error. The second step is to back-propagate this error which changes the weights of the network. To see the result of the weight change, one more forward propagation will be calculated. These three steps will be calculated in detail for a clear understanding. This section uses parts of the examples by Lämmel and Cleve [57] and Miller [66]. As a quick reminder, table A.1 shows the different combinations of input and output of the XOR function (see 2.1).

Input 1 Input 2 Output 0 0 0 0 1 1 1 0 1 1 1 0

Table A.1: All possible combinations of input and output of the XOR function.

The neural network is defined as MLP with two input neurons, three hidden neurons and one output neuron as shown in figure A.1. The input neurons labelled i1 and i2, the hidden neurons h1, h2, h3 and the output neuron o. The weights of the connections are initialised with random values between 0 and 1 and labelled with w1 ¡ w9.

Forward propagation 1

Then the first forward propagation step can be started. As first input the last data set with i1  1 and i2  1 should be used. The input is now forwarded to the hidden neurons. All inputs have to be multiplied with the weights and summed up. To get the output of a neuron the input has to pass through the activation function. This has to be calculated two times, firstly from the input to the hidden layer and secondly from the hidden to the output layer. The calculation steps from all neurons in the same layer can be done together by using matrix multiplications

V             h1input w1 w4 0.8 0.2 1.0     i1   1   h2input  w2 w5  0.4 0.9  1.3 . (A.1) i2 1 h3input w3 w6 0.3 0.5 0.8

These input values have to pass through the activation function to get the neuron output. For this example all neurons should use the logistic function

1 fpxq  (A.2) 1 e¡x which was already highlighted in equation 2.14. The output of the hidden neurons can be calculated by using the equation   ¤  ¤    h1output h1input 1.0 0.73   ¥  ¥    h2output  f h2input  f 1.3  0.79 . (A.3) h3output h3input 0.8 0.69

This step can be repeated for the output neuron with         w7 h1output 0.3 0.73         o input  w8 h2output  0.5 0.79  1.2 (A.4) w9 h3output 0.9 0.69 and the result passes through f with

o output  fpoinputq  fp1.2q  0.77. (A.5)

The results of the calculations are visualised in figure A.2 A and are coloured red. The input is displayed in smaller digits in the upper half of a neuron and the output in bigger digits in the middle of a neuron.

h1

i1 1 h2 o w =0.5 8 0

i2

1 h3

Figure A.1: NN with two neurons in the input layer (coloured green), three neurons in the hidden layer (coloured blue) and one neuron in the output layer (coloured yellow) with random initialised weights (coloured red) [66]. A) Forward Propagation 1 B) Back-propagation C) Forward Propagation 2

1.0 1.0 0.98 0.73 0.73 0.73 1 1 1

1.3 0.5 1.2 1.3 0.39 1.2 1.28 0.39 0.99 0.79 0.77 0.79 0.77 0.78 0.73

1 1 1 0.8 0.8 0.74 0.69 0.69 0.67

Figure A.2: Neural network with different weights in the three states of forward propaga- tion 1, back-propagation and forward propagation 2. The updated values in each state are coloured red [66].

Back-propagation

The target result of the XOR function should be 0 but the calculated result is 0.77. The error can now be scaled down with the back-propagation method. This calculation is done backwards, at first between the output and the hidden layer and then between the hidden and the input layer. The new weight between a neuron i and neuron j can be calculated by using the equation

1  wij wij ∆wij (A.6) with

∆wij  α oi δj (A.7) as difference between the old and the new weight, α as learning rate and oi as output from neuron i. The learning rate α is set to 1 in this example to simplify the calculations. The error signal δj is defined as # 1 f po inputq po target ¡ o calculatedq, if j is output neuron  ° δj 1 (A.8) p q f h input k δk wjk, if j is hidden neuron with k as index of the neuron behind the j neuron and the derivative of fpxq as

1 f pxq  fpxq p1 ¡ fpxqq. (A.9)

With this equation the whole back-prop procedure can be calculated. The results of the forward propagation are used to calculate the error signal of the output neuron with

1 1

δo  f po inputq po target ¡ o calculatedq  f p1.2q p0 ¡ 0.77q  ¡0.14. (A.10) The result is used in the calculation of δ values of the hidden neurons with

¸ ¸ 1 1

δh1  f ph1inputq δo w7  f p1q ¡0.14 0.3  ¡0.01, (A.11) k k ¸ ¸ 1 1

δh2  f ph2inputq δo w8  f p1.3q ¡0.14 0.5  ¡0.01, (A.12) k k ¸ ¸ 1 1

δh3  f ph3inputq δo w9  f p0.8q ¡0.14 0.9  ¡0.03. (A.13) k k

With this results is it possible to calculate the weight change between the hidden and the output layer with         ∆w7 h1output 0.73 ¡0.10         ∆w8  α h2output δo  1 0.79 p¡0.14q  ¡0.11 (A.14) ∆w9 h3output 0.69 ¡0.09 and between the input and the hidden layer with

    ∆w1 ∆w3 ∆w5 i1  α rδ δ δ s ∆w ∆w ∆w i h1 h2 h3 2 4 6  2   1 ¡0.01 ¡0.01 ¡0.03  1 r¡0.01 ¡ 0.01 ¡ 0.03s  1 ¡0.01 ¡0.01 ¡0.03 (A.15)

These values represent the change of each internal weight and have to be summed up with the original weights. This can be calculated in vector notation as             1 ¡ w1 w1 ∆w1 0.8 0.01 0.79  1        ¡    w2 w2 ∆w2 0.2  0.01 0.19  1        ¡    w3 w3 ∆w3 0.4  0.01 0.39  1        ¡    w4 w4 ∆w4 0.9  0.01 0.89  1          ¡     w5 w5 ∆w5 0.3  0.03 0.27 (A.16)  1        ¡    w6 w6 ∆w6 0.5  0.03 0.47  1        ¡    w7 w7 ∆w7 0.3  0.10 0.20  1        ¡    w8 w8 ∆w8 0.5 0.11 0.39 1 ¡ w9 w9 ∆w9 0.9 0.09 0.81

The new weights are displayed in figure A.2 B. After this the back-propagation process is finished. Forward propagation 2

One more forward propagation is required to see the changes in the output. This is similar to the equations A.1 to A.20 and is done with the new weights. The new inputs for the hidden layers can be calculated as         1 1     h1input w1 w4 0.79 0.19 0.98 1 1 i1 1            h2input w2 w5 0.39 0.89 1.28 (A.17) 1 1 i2 1 h3input w3 w6 0.27 0.47 0.74 and the output as   ¤  ¤    h1output h1input 0.98 0.73   ¥  ¥    h2output  f h2input  f 1.28  0.78 . (A.18) h3output h3input 0.74 0.67

The same calculations can be done for the output neuron with equation         1 w7 h1output 0.20 0.73 1            o input w8 h2output 0.39 0.78 0.99 (A.19) 1 w9 h3output 0.81 0.67 and passing through f with

o output  fpo inputq  fp0.99q  0.73. (A.20)

This result is also visualised in figure A.2 C. After the second forward propagation it can be seen, that the calculated result moved a bit in the direction of the target value (from 0.77 to 0.73, the target is 0 ). This example shows that the change from one step to the other is very small and it needs more iterations to train the network until it works properly. The training in this case can take up to 1,000 steps. Table 2.2 shows the development of this value after different iterations. It is important for the learning process that each step is done with different inputs. The four different cases should be chosen alternating randomly. Another possibility is to use the batch procedure which propagates more data sets into the network and the weight correction includes all errors. In this example no bias units were used because the calculations would have been more time-consuming. For the case that an input is 0 the back-propagation will calculate 0 for ∆w as defined in equation A.15 and the weights will not change. This negative effect can be avoided by inserting a bias unit which is used as input for all neurons in the hidden layer. In this case the hidden neurons get the input 0 from the input neuron and a value different from 0 of the bias neuron. B NEAT parameter list

////////////////////// // Basic parameters // //////////////////////

// Size of population PopulationSize = 300;

// If true, this enables dynamic compatibility thresholding // It will keep the number of species between MinSpecies and MaxSpecies DynamicCompatibility = true;

// Minimum number of species MinSpecies =5;

// Maximum number of species MaxSpecies = 10;

// Don’t wipe the innovation database each generation? InnovationsForever = true;

// Allow clones or nearly identical genomes to exist simultaneously in // the population. This is useful for non-deterministic environments, // as the same individual will get more than one chance to prove himself, // also there will be more chances the same individual to mutate in // different ways.The drawback is greatly increased time for . // If you want tosearch quickly, yet less efficient, leave this to true. AllowClones = true;

/////////////////// // GA Parameters // ///////////////////

// Age treshold, meaning if a species is below it, it is considered young YoungAgeTreshold =5;

// Fitness boost multiplier for young species (1.0 means no boost) // Make sure it is >= 1.0 to avoid confusion YoungAgeFitnessBoost = 1.1;

// Number of generations without improvement (stagnation) allowed for a species SpeciesMaxStagnation = 50;

// Minimum jump in fitness necessary to be considered as improvement. // Setting this value to 0.0 makes the system to behave like regular NEAT. StagnationDelta = 0.0;

VI // Age threshold, meaning if a species is above it, it is considered old OldAgeTreshold = 30;

// Multiplier that penalizes old species. // Make sure it is <= 1.0 to avoid confusion. OldAgePenalty = 0.5;

// Detect competetive coevolution stagnation // This kills the worst species of age >N (each X generations) DetectCompetetiveCoevolutionStagnation = false;

// Each X generation.. KillWorstSpeciesEach = 15;

// Of age above.. KillWorstAge = 10;

// Percent of best individuals that are allowed to reproduce. 1.0 =100% SurvivalRate = 0.25;

// Probability for a baby to result from sexual reproduction // (crossover/mating). 1.0 = 100% // If asexual reprodiction is chosen, the baby will be mutated 100% CrossoverRate = 0.7;

// If a baby results from sexual reproduction, this probability determines // if mutation will be performed after crossover. // 1.0 = 100% (always mutate after crossover) OverallMutationRate = 0.25;

// Probability for a baby to result from inter-species mating. InterspeciesCrossoverRate = 0.0001;

// Probability for a baby to result from Multipoint Crossover when mating. // 1.0 = 100%. The default is the Average mating. MultipointCrossoverRate = 0.75;

// Performing roulette wheel selection or not? RouletteWheelSelection = false;

// For tournament selection TournamentSize =4;

// Fraction of individuals to be copied unchanged EliteFraction = 0.01; ////////////////////////////// // Phased Search parameters // //////////////////////////////

// Using phased search or not PhasedSearching = false;

// Using delta coding or not DeltaCoding = false;

// What is the MPC + base MPC needed to begin simplifying phase SimplifyingPhaseMPCTreshold = 20;

// How many generations of global stagnation should have passed to // enter simplifying phase SimplifyingPhaseStagnationTreshold = 30;

// How many generations of MPC stagnation are needed to turn back // on complexifying ComplexityFloorGenerations = 40;

/////////////////////////////// // Novelty Search parameters // ///////////////////////////////

// the K constant NoveltySearch_K = 15;

// Sparseness treshold. Add to the archive if above NoveltySearch_P_min = 0.5;

// Dynamic Pmin? NoveltySearch_Dynamic_Pmin = true;

// How many evaluations should pass without adding to the archive // in order to lower Pmin NoveltySearch_No_Archiving_Stagnation_Treshold = 150;

// How should it be multiplied (make it less than 1.0) NoveltySearch_Pmin_lowering_multiplier = 0.9;

// Not lower than this value NoveltySearch_Pmin_min = 0.05;

// How many one-after-another additions to the archive should // pass in order to raise Pmin NoveltySearch_Quick_Archiving_Min_Evaluations =8;

// How should it be multiplied (make it more than 1.0) NoveltySearch_Pmin_raising_multiplier = 1.1;

// Per how many evaluations to recompute the sparseness of the population NoveltySearch_Recompute_Sparseness_Each = 25;

//////////////////////////////////// // Structural Mutation parameters // ////////////////////////////////////

// Probability for a baby to be mutated with the Add-Neuron mutation. MutateAddNeuronProb = 0.01;

// Allow splitting of any recurrent links SplitRecurrent = true;

// Allow splitting of looped recurrent links SplitLoopedRecurrent = true;

// Probability for a baby to be mutated with the Add-Link mutation MutateAddLinkProb = 0.03;

// Probability for a new incoming link to be from the bias neuron; // This enforces it. A value of 0.0 doesn’t mean there will not be such links MutateAddLinkFromBiasProb = 0.0;

// Probability for a baby to be mutated with the Remove-Link mutation MutateRemLinkProb = 0.0;

// Probability for a baby that a simple neuron will be replaced with a link MutateRemSimpleNeuronProb = 0.0;

// Maximum number of tries to find 2 neurons to add/remove a link LinkTries = 31;

// Probability that a link mutation will be made recurrent RecurrentProb = 0.25;

// Probability that a recurrent link mutation will be looped RecurrentLoopProb = 0.25; /////////////////////////////////// // Parameter Mutation parameters // ///////////////////////////////////

// Probability for a baby’s weights to be mutated MutateWeightsProb = 0.90;

// Probability for a severe (shaking) weight mutation MutateWeightsSevereProb = 0.25;

// Probability for a particular gene’s weight to be mutated. 1.0 =100% WeightMutationRate = 1.0;

// Maximum perturbation for a weight mutation WeightMutationMaxPower = 1.0;

// Maximum magnitude of a replaced weight WeightReplacementMaxPower = 1.0;

// Maximum absolute magnitude of a weight MaxWeight = 8.0;

// Probability for a baby’s A activation function parameters to be perturbed MutateActivationAProb = 0.0;

// Probability for a baby’s B activation function parameters to be perturbed MutateActivationBProb = 0.0;

// Maximum magnitude for the A parameter perturbation ActivationAMutationMaxPower = 0.0;

// Maximum magnitude for the B parameter perturbation ActivationBMutationMaxPower = 0.0;

// Activation parameter A min/max MinActivationA = 1.0; MaxActivationA = 1.0;

// Activation parameter B min/max MinActivationB = 0.0; MaxActivationB = 0.0;

// Maximum magnitude for time costants perturbation TimeConstantMutationMaxPower = 0.0;

// Maximum magnitude for biases perturbation BiasMutationMaxPower = WeightMutationMaxPower; // Probability for a baby’s neuron time constant values to be mutated MutateNeuronTimeConstantsProb = 0.0;

// Probability for a baby’s neuron bias values to be mutated MutateNeuronBiasesProb = 0.0;

// Time constant range MinNeuronTimeConstant = 0.0; MaxNeuronTimeConstant = 0.0;

// Bias range MinNeuronBias = 0.0; MaxNeuronBias = 0.0;

// Probability for a baby that an activation function type will be changed // for a single neuron considered a structural mutation because of the // large impact on fitness MutateNeuronActivationTypeProb = 0.0;

// Probabilities for a particular activation function appearance ActivationFunction_SignedSigmoid_Prob = 0.0; ActivationFunction_UnsignedSigmoid_Prob = 1.0; ActivationFunction_Tanh_Prob = 0.0; ActivationFunction_TanhCubic_Prob = 0.0; ActivationFunction_SignedStep_Prob = 0.0; ActivationFunction_UnsignedStep_Prob = 0.0; ActivationFunction_SignedGauss_Prob = 0.0; ActivationFunction_UnsignedGauss_Prob = 0.0; ActivationFunction_Abs_Prob = 0.0; ActivationFunction_SignedSine_Prob = 0.0; ActivationFunction_UnsignedSine_Prob = 0.0; ActivationFunction_Linear_Prob = 0.0; ActivationFunction_Relu_Prob = 0.0; ActivationFunction_Softplus_Prob = 0.0;

// Trait mutation probabilities MutateNeuronTraitsProb = 1.0; MutateLinkTraitsProb = 1.0; MutateGenomeTraitsProb = 1.0;

/////////////////////////////////// // Genome properties paramameter // ///////////////////////////////////

// When true, don’t have a special bias neuron and treat all inputs equal bool DontUseBiasNeuron = false; // When false, this prevents any recurrent pathways in the // genomes from forming bool AllowLoops = true;

/////////////////////////// // Speciation parameters // ///////////////////////////

// Percent of disjoint genes importance DisjointCoeff = 1.0;

// Percent of excess genes importance ExcessCoeff = 1.0;

// Average weight difference importance WeightDiffCoeff = 0.5;

// Node-specific activation parameter A difference importance ActivationADiffCoeff = 0.0;

// Node-specific activation parameter B difference importance ActivationBDiffCoeff = 0.0;

// Average time constant difference importance TimeConstantDiffCoeff = 0.0;

// Average bias difference importance BiasDiffCoeff = 0.0;

// Activation function type difference importance ActivationFunctionDiffCoeff = 0.0;

// Compatibility treshold CompatTreshold = 5.0;

// Minumal value of the compatibility treshold MinCompatTreshold = 0.1;

// Modifier per generation for keeping the species stable CompatTresholdModifier = 0.3;

// Per how many generations to change the treshold (used in generational mode) CompatTreshChangeInterval_Generations =1;

// Per how many evaluations to change the treshold (used in steady state mode) CompatTreshChangeInterval_Evaluations = 10; Bibliography

[1] AAAI. Association for the Advancement of Artificial Intelligence. https://www. aaai.org/home.html. Accessed on 20.07.2017. [2] Abu-Mostafa, Yasar S., Magdon-Ismail, Malik, and Lin, Hsuan-Tien. Learning from Data. AMLbook, 2012. isbn: 978-1-60049-006-4. [3] Achour, Nouara and Chaalal, Mohamed. Mobile robots path planning using ge- netic algorithms. In: The Seventh International Conference on Autonomic and Au- tonomous Systems, 2011. [4] Artificial Intelligence Foundation of computational agents. Agents Situated in En- vironments. http://artint.info/html/ArtInt_7.html. Accessed on 21.07.2017. [5] Artificial Intelligence Foundation of computational agents. Q-learning. http : / / artint.info/html/ArtInt_265.html. Accessed on 23.07.2017. [6] BASt. BASt Bundesanstalt für Straßenwesen. http://www.bast.de/DE/Home/ home_node.html. Accessed on 24.08.2017. [7] Bec Crew (ScienceAlert). Driverless Cars Could Reduce Traffic Fatalities by Up to 90Says Report. https://www.sciencealert.com/driverless-cars-could-redu ce-traffic-fatalities-by-up-to-90-says-report. Accessed on 28.08.2017. [8] Behzadi, Saeed and Alesheikh, Ali A. Developing a Genetic Algorithm for Solving Shortest Path Problem. In: WSEAS International Conference on URBAN PLAN- NING and TRANSPORTATION UPT07, 2008. [9] Belagiannis, Vasileios and Zisserman, Andrew. Recurrent human pose estimation. In: IEEE, 2017. [10] Bell, Michael G.H. Hyperstar: A multi-path Astar algorithm for risk averse vehicle navigation. In: Elsevier, 2009. [11] Berger, Christoph. Perceptron - the most bais form of a neural network. https: //appliedgo.net/perceptron/. Accessed on 25.07.2017. [12] Björnsson, Yngvi, Enzenberger, Markus, Holte, Robert C, and Schaeffer, Jonathan. Fringe Search: Beating A* at Pathfinding on Game Maps. In: CIG 5:125-132, 2005. [13] Blum, Christian. Ant colony optimization: Introduction and recent trends. In: Else- vier, 2005. [14] Brücken Timo (Gründerscene). The Disruptive Power of Artificial Intelligence. ht tps://www.gruenderszene.de/allgemein/deepl-maschineller-uebersetzer- linguee-google-translate. Accessed on 01.09.2017.

VII [15] Brownlee, Jason. A Tour of Machine Learning Algorithms. http://machinelea rningmastery.com/a- tour- of- machine- learning- algorithms. Accessed on 23.07.2017. [16] Burgard, Wolfram and Nebel, Bernhard. Foundations on AI, AlbertLudwigsUniversität Freiburg. http://gki.informatik.uni-freiburg.de/teaching/ss08/gki/ai01. pdf. Accessed on 20.07.2017. [17] Carl von Ossietzky Universität Oldenburg. Künstliche Intelligenz. http://www. informatik.uni-oldenburg.de/~iug08/ki/Grundlagen_Starke_KI_vs._Schwa che_KI.html. Accessed on 21.07.2017. [18] CBinsight. AI 100: The Artificial Intelligence Startups Redefining Industries. http s://www.cbinsights.com/research/artificial-intelligence-top-startups. Accessed on 28.08.2017. [19] CBinsights. The Race For AI: Google, Baidu, Intel, Apple In A Rush To Grab Artificial Intelligence Startups. https://www.cbinsights.com/research/top- acquirers-ai-startups-ma-timeline/. Accessed on 01.09.2017. [20] CEDR Conference of European Directors of Roads. Transnational Road Research Programme, Mobility and ITS. http://www.bast.de/DE/BASt/Forschung/Forsc hungsfoerderung/Downloads/cedr_call_2014_2.pdf. Accessed on 24.08.2017. [21] Chervenski, Peter. Github MultiNEAT. https://github.com/peter- ch/Multi NEAT. Accessed on 09.08.2017. [22] Datamation. Top 20 Artificial Intelligence Companies. http://www.datamation.co m/applications/top-20-artificial-intelligence-companies.html. Accessed on 28.08.2017. [23] David Senior (TechCrunch). Narrow AI: Automating The Future Of Information Retrieval. https://techcrunch.com/2015/01/31/narrow-ai-cant-do-that- or-can-it. Accessed on 30.08.2017. [24] DeepMind. The story of AlphaGo so far. https : / / deepmind . com / research / alphago/. Accessed on 28.08.2017. [25] Destatis, Deutsches Statistisches Bundesamt. Unfallursachen. http://www.gdv. de/2013/07/menschliches-fehlverhalten-ist-haeufigste-ursache-fuer- verkehrsunfaelle/. Accessed on 30.08.2017. [26] Dorigo, Marco and Caro, Gianni Di. Ant Colony Optimizations: A New Meta- Heuristic. In: IEEE, 1999. [27] Duchoň, František, Babinec, Andrej, Kajan, Martin, Beňo, Peter, Florek, Martin, Fico, Tomáš, and Jurišica, Ladislav. Path planning with modified a star algorithm for a mobile robot. In: Elsevier, 2014. [28] Eden, Tim, Knittel, Anthony, and Uffelen, Raphael van. Q-Learning. http://www. cse.unsw.edu.au/~cs9417ml/RL1/algorithms.html. Accessed on 23.07.2017. [29] Ericsson, M, Resende, Mauricio GC, and Pardalos, Panos M. A genetic algorithm for the weight setting problem in OSPF routing. In: Springer, 2002. [30] Evolutionary Complexity (EPlex) Research Group at the University of Central Florida. Find the Right Version of NEAT for Your Needs. http://eplex.cs. ucf.edu/neat_software/. Accessed on 09.08.2017. [31] Evolutionary Complexity (EPlex) Research Group at the University of Central Florida. The Hybercube-based NeuroEvolution of Augmenting Topologies (Hyper- NEAT) Users Page. http://eplex.cs.ucf.edu/hyperNEATpage/. Accessed on 23.08.2017. [32] Fakoor, Mahdi, Kosari, Amirreza, and Jafarzadeh, Mohsen. Revision on fuzzy arti- ficial potential field for humanoid robot path planning in unknown environment. In: Inderscience Publishers (IEL), 2015. [33] Fernuni Hagen. Wahrscheinlichkeitsrechnung Kombinatorik. http://www.fernuni- hagen.de/ksw/neuestatistik/content/MOD_27351/html/comp_27410.html. Accessed on 02.08.2017. [34] Gambardella, Luca Maria, Taillard, Éric, and Agazzi, Giovanni. MACS-VRPTW: A multiple ant colony system for vehicle routing problems with time windows. In: Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale, 1999. [35] Gammell, Jonathan D, Srinivasa, Siddhartha S, and Barfoot, Timothy D. Informed RRT*: Optimal sampling-based path planning focused via direct sampling of an ad- missible ellipsoidal heuristic. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2014, 2014. [36] Ganesh, Adithya, Charalel, Joe, Sarma, Matthew Das, and Xu, Nancy. Deep Re- inforcement Learning for Simulated Autonomous Driving. In: Stanford University, 2016. [37] Gartner Inc. Top Trends in the Gartner Hype Cycle for Emerging Technologies, 2017. http://www.gartner.com/smarterwithgartner/top-trends-in-the-gar tner-hype-cycle-for-emerging-technologies-2017. Accessed on 28.08.2017. [38] Gatys, L. A., Ecker, A. S., and Bethge, M. A Neural Algorithm of Artistic Style. In: arXiv, 2015. [39] Github. tiny-dnn. https : / / github . com / tiny - dnn / tiny - dnn. Accessed on 01.08.2017. [40] Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron. Deep Learning. The MIT Press, 2016. isbn: 978-0-26203-561-3. [41] Graves, Alex et al. Hybrid computing using a neural network with dynamic external memory. In: Nature Research 538(7626):471-476, 2016. [42] Guruji, Akshay Kumar, Agarwal, Himansh, and Parsediya, DK. Time-Efficient A* Algorithm for Robot Path Planning. In: Elsevier, 2016. [43] Hartl, Peter E., Nilsson, Nils J., and Raphael, Bertram. A formal Basis for the Heuristic Determiniation of Minimum Cost Paths. In: IEEE, 1968. [44] Hausknecht, Matthew, Lehman, Joel, Miikkulainen, Risto, and Stone, Peter. A Neu- roevolution Approach to General Atari Game Playing. In: IEEE Transactions on Computational Intelligence and AI in Games, 2013. [45] Heidrich-Meisner, V., Lauer, M., Igel, C., and Riedmiller, M. Reinforcement Learn- ing in a Nutshell. In: Institut für Neuroinformatik, Ruhr-Universität Bochum, Neu- roinformatics Group, University of Osnabrück, 2007. [46] Heidrich-Meisner, Verena, Lauer, Martin, Igel, Christian, and Riedmiller, Martin A. Reinforcement learning in a nutshell. In: Institut für Neuroinformatik Ruhr- Universität Bochum, Neuroinformatics Group University of Osnabrück, Germany, 2007. [47] Here. Map from Audi driving experience center. https://wego.here.com/?map= 48.72497,11.2526,16,normal. Accessed on 31.07.2017. [48] Hochreiter, S. and Schmidhuber, J. Long short-term memory. In: Neural Computa- tion 9 (8):1735-1780, 1997. [49] Ismail, AT, Sheta, Alaa, and Al-Weshah, Mohammed. A mobile robot path planning using genetic algorithm in static environment. In: Journal of Computer Science, 4(4):341-344, 2008. [50] Jain, Sruti. Designing neural network: What activation function you plan to imple- ment. http://www.srutisj.in/blog/research/designing-neural-network- what-activation-function-you-plan-to-implement/. Accessed on 03.08.2017. [51] Khatib, Oussama. Real-time obstacle avoidance for manipulators and mobile robots. In: Sage Publications Sage CA: Thousand Oaks, CA, 1986. [52] Klint Finely. This News-Writing Bot Is Now Free for Everyone. https://www. wired.com/2015/10/this-news-writing-bot-is-now-free-for-everyone/. Accessed on 24.08.2017. [53] Kriesel, David. A Brief Introduction to Neural Networks. 2007. url: availableat http://www.dkriesel.com. [54] Kroll, Andreas. Computational Intelligence. De Gruyter Oldenbourg, 2016. isbn: 978-3-11-040066-3. [55] Lewis, Mike, Yarats, Denis, Dauphin, Yann N, Parikh, Devi, and Batra, Dhruv. Deal or No Deal? End-to-End Learning for Negotiation Dialogues. In: arXiv, 2017. [56] Lewis, Mike, Yarats, Denis, Dauphin, Yann N, Parikh, Devi, and Batra, Dhruv. Deal or no deal? Training AI bots to negotiate. https://code.facebook.com/posts/ 1686672014972296 / deal - or - no - deal - training - ai - bots - to - negotiate/. Accessed on 30.08.2017. [57] Lämmel, Uwe and Cleve, Jürgen. Künstliche Intelligenz. Hanser, 2008. isbn: 978-3- 446-41398-6. [58] Lunze, Jan. Künstliche Intelligenz für Ingenieure. De Gruyter Oldenbourg, 2016. isbn: 978-3-11-044896-2. [59] Machine Learning Mastery. What is Deep Learning? http://machinelearningma stery.com/what-is-deep-learning/. Accessed on 25.07.2017. [60] Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory. https://www.csail.mit.edu. Accessed on 20.07.2017. [61] MathWorks. Neural Network Toolbox. https://de.mathworks.com/products/ neural-network.html. Accessed on 01.08.2017. [62] McCarthy, John. A Proposal for the Dartmouth Summer Research Project on Ar- tificial Intelligence. http : / / jmc . stanford . edu / articles / dartmouth . html. Accessed on 19.07.2017. [63] McCarthy, John. What is artificial intelligence? In: Stanford University, 2007. [64] McCullock, John. Q-Learning Tutorial. http://mnemstudio.org/path-finding- q-learning-tutorial.htm. Accessed on 23.07.2017. [65] Michael Sainato (Observer). Stephen Hawking, Elon Musk, and Bill Gates Warn About Artificial Intelligence. http://observer.com/2015/08/stephen- hawki ng - elon - musk - and - bill - gates - warn - about - artificial - intelligence/. Accessed on 28.08.2017. [66] Miller, Steven. Mind: How to Build a Neural Network (Part One). https://steve nmiller888.github.io/mind-how-to-build-a-neural-network/. Accessed on 04.08.2017. [67] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, and Riedmiller, Martin. Playing Atari with Deep Rein- forcement Learning. In: DeepMind Technologies, 2013. [68] Murphy, Kevin P. Machine Learning A Probabilistic Perspective. The MIT Press, 2012. isbn: 978-0-262-01802-9. [69] National Highway Traffic Safety Administration. The National Highway Traffic Safety Administration is responsible for keeping people safe on America’s roadways. https://www.nhtsa.gov/about-nhtsa. Accessed on 24.08.2017. [70] News, Carnegie Mellon. A Computer That Reads Body Language. https://cacm. acm.org/news/219282- a- computer- that- reads- body- language/fulltext. Accessed on 01.09.2017. [71] Nvidia. Deep Learning Frameworks. https : / / developer . nvidia . com / deep - learning-frameworks. Accessed on 01.08.2017. [72] Obitko, Marek. Introduction to Genetic Algorithms, Biological Background. http: //www.obitko.com/tutorials/genetic-algorithms/biological-background. php. Accessed on 04.08.2017. [73] OpenMP. The OpenMP API specification for parallel programming. http://www. openmp.org/. Accessed on 22.08.2017. [74] Panetta Kasey (Gartner). The Disruptive Power of Artificial Intelligence. http: //www.gartner.com/smarterwithgartner/the-disruptive-power-of-artific ial-intelligence. Accessed on 01.09.2017. [75] Partnership on AI. to benefit people and society. https://www.partnershiponai. org. Accessed on 20.07.2017. [76] Raschka, Sascha. What is the Role of the Activation Function in a Neural Network? http://www.kdnuggets.com/2016/08/role-activation-function-neural- network.html. Accessed on 03.08.2017. [77] Rey, Günter Daniel and Wender, Karl F. Neuronale Netze eine Einführung in die Grundlagen, Anwendungen und Datenauswertung. Huber, 2011. isbn: 978-3-456- 84513-5. [78] Rimscha, Markus. Algorithmen kompakt und verständlich. Springer, 2010. isbn: 978- 3-8348-9635-3. [79] SAE International. An Abridged History of SAE. http://www.sae.org/about/ general/history/. Accessed on 24.08.2017. [80] Salvador, Amaia, Hynes, Nicholas, Aytar, Yusuf, Marin, Javier, Ofli, Ferda, We- ber, Ingmar, and Torralba, Antonio. Learning Cross-modal Embeddings for Cooking Recipes and Food Images. In: CVF / IEEE, 2017. [81] Standard. Tube challenger claws back world record for fastest journey around all 270 stations. http://www.standard.co.uk/news/london/tube-challenger-claws- back - world - record - for - fastest - journey - around - all - 270 - stations - a3149961.html. Accessed on 17.06.2017. [82] Stanford University. Stanford Artificial Intelligence Laboratory. http://www.cs. stanford.edu/research/ai. Accessed on 20.07.2017. [83] Stanley, Kenneth O. and Miikkulainen, Risto. Efficient Evolution of Neural Network Topologies. In: Proceedings of the 2002 Congress on Evolutionary Computation, CEC2002, 2002. [84] Stanley, Kenneth O and Miikkulainen, Risto. Evolving a roving eye for go. In: Springer, 2004. [85] Stanley, Kenneth O. and Miikkulainen, Risto. Evolving Neural Networks Through Augmenting Topologies. In: Evolutionary Computation 10(2): 99-127, 2002. [86] Stentz, Anthony. Optimal and efficient path planning for unknown and dynamic environments. In: Carnegie-Mellon University Pittsburgh PA Robotics Institute, 1993. [87] Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. The MIT Press, 1998. isbn: 978-0-26219-398-6. [88] Teglor. Deep Learning Libraries by Language. http://www.teglor.com/b/deep- learning-libraries-language-cm569. Accessed on 01.08.2017. [89] Varricchio, Valerio, Chaudhari, Pratik, and Frazzoli, Emilio. Sampling-based algo- rithms for optimal motion planning using process algebra specifications. In: IEEE. 2014. [90] Vinod Mahanta (ETtech). AI can make an impact like electricity: Coursera’s co- founder Andrew Ng. http://tech.economictimes.indiatimes.com/news/peo ple/ai- can- make- an- impact- like- electricity- courseras- co- founder- andrew-ng/60229635. Accessed on 28.08.2017. [91] Watkins, Christopher John Cornish Hellaby. Learning from Delayed Rewards. http: //www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf. Accessed on 23.07.2017. [92] Weeks, Abby. Interplanetary trajectory optimization using a genetic algorithm. In: [Aerospace Engineering Dept, Pennsylvania State University], 1994. [93] Weicker, Karsten. Evolutionäre Algorithmen. Springer, 2015. isbn: 978-3-658-09957- 2. [94] Wolfram. Geometry Distance. http://mathworld.wolfram.com/Distance.html. Accessed on 18.06.2017. [95] Wolfram. Königsberg Bridge Problem. http://mathworld.wolfram.com/Koenigsb ergBridgeProblem.html. Accessed on 09.06.2017. [96] Yale University, Department of Statistics and Data Science. Mean and Variance of Random Variables. http://www.stat.yale.edu/Courses/1997- 98/101/ rvmnvar.htm. Accessed on 10.08.2017. [97] Yann LeCun, New York University. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed on 22.07.2017. [98] Yao, Junfeng, Lin, Chao, Xie, Xiaobiao, Wang, Andy JuAn, and Hung, Chih-Cheng. Path planning for virtual human motion using improved A* star algorithm. In: IEEE, 2010. [99] Yu, April, Palefsky-Smith, Raphael, and Bedi, Rishi. Deep Reinforcement Learning for Simulated Autonomous Vehicle Control. In: Stanford University, 2016.