applied sciences

Article Enhanced Reinforcement Learning Method Combining One-Hot Encoding-Based Vectors for CNN-Based Alternative High-Level Decisions

Bonwoo Gu 1 and Yunsick Sung 2,*

1 Department of M&S, SIMNET Cooperation, Daejeon 34127, ; [email protected] 2 Department of Multimedia Engineering, Dongguk University-Seoul, Seoul 04620, Korea * Correspondence: [email protected]

Abstract: Gomoku is a two-player that originated in ancient . There are various cases of developing Gomoku using artificial intelligence, such as a genetic algorithm and a tree search algorithm. Alpha-Gomoku, Gomoku AI built with Alpha-Go’s algorithm, defines all possible situations in the Gomoku board using Monte-Carlo tree search (MCTS), and minimizes the probability of learning other correct answers in the duplicated Gomoku board situation. However, in the tree search algorithm, the accuracy drops, because the classification criteria are manually set. In this paper, we propose an improved reinforcement learning-based high-level decision approach using convolutional neural networks (CNN). The proposed algorithm expresses each state as One-Hot Encoding based vectors and determines the state of the Gomoku board by combining the similar state of One-Hot Encoding based vectors. Thus, in a case where a stone that is determined by CNN has already been placed or cannot be placed, we suggest a method for selecting an alternative. We  verify the proposed method of Gomoku AI in GuPyEngine, a Python-based 3D simulation platform. 

Citation: Gu, B.; Sung, Y. Enhanced Keywords: gomoku; game artificial intelligence; convolutional neural-networks; one-hot encoding; Reinforcement Learning Method reinforcement learning Combining One-Hot Encoding-Based Vectors for CNN-Based Alternative High-Level Decisions. Appl. Sci. 2021, 11, 1291. https://doi.org/10.3390/ 1. Introduction app11031291 Gomoku is a two-player board game that originated in ancient China. Two players alternate in placing a stone of their choice of color, and the player who first completes Academic Editor: Joonki Paik the five-in-a-row horizontally, vertically, or diagonally wins the game. Various studies Received: 23 December 2020 have attempted to develop Gomoku using artificial intelligence (AI), for example, using Accepted: 27 January 2021 a genetic algorithm and tree search algorithm [1,2]. Deep learning algorithms have been Published: 1 February 2021 actively employed in the recent years after Alpha-Go showed its outstanding performance. For example, efficient Gomoku board recognition and decision-making was made possible Publisher’s Note: MDPI stays neutral while using a convolution layer of the deep-learning convolutional neural networks (CNN) with regard to jurisdictional claims in algorithm [3]. However, as CNN-based decision-making determines a single optimal published maps and institutional affil- iations. position in the same recognized Gomoku board state, some cases arise where a stone cannot be placed in the relevant position. In order to address this problem, Alpha-Gomoku (Gomoku AI built with Alpha-Go’s algorithm) defines all possible situations in the Gomoku board while using Monte-Carlo tree search (MCTS) and minimizes the probability of learning other correct answers in a duplicated Gomoku board situation [4]. However, in Copyright: © 2021 by the authors. the tree search algorithm, the accuracy drops because the classification criteria are manually Licensee MDPI, Basel, Switzerland. set. This article is an open access article In this paper, we propose an improved reinforcement learning-based high-level de- distributed under the terms and conditions of the Creative Commons cision algorithm while using CNN. The proposed algorithm expresses each state as one- Attribution (CC BY) license (https:// hot-encoding-based vectors and determines the state of the Gomoku board by combining creativecommons.org/licenses/by/ the similar state of one-hot-encoding-based vectors. Thus, in a case where a stone that is 4.0/). determined by CNN has already been placed or cannot be placed, we suggest a method

Appl. Sci. 2021, 11, 1291. https://doi.org/10.3390/app11031291 https://www.mdpi.com/journal/applsci Appl. Sci. 2021, 11, 1291 2 of 15

for selecting an alternative place. As similar states are combined, the amount of learning does not increase exponentially, which is advantageous. We verify the performance of the proposed reinforcement learning algorithm by applying it to Gomoku. We also verify the proposed method of Gomoku AI in GuPyEngine, a Python-based 3D simulation platform. This paper is expected to contribute to the field of incremental algorithms of reinforce- ment learning and deep learning-based 3D simulation by introducing the functions and performance of GuPyEngine. The rest of this paper is organized, as follows. Section2 describes related work on Gomoku AI, while Section3 describes the proposed self-teaching Gomoku framework. Section4 explains the GuPyEngine and the experiments that were conducted in this paper using the proposed method on Gomoku. Finally, Section5 concludes this study summarizing the main findings and future research directions.

2. Related Works Several researchers have attempted to develop Gomoku AI models using various algorithms. A genetic algorithm is an evolutionary-type algorithm that is used in various game AI models and it has been used in Gomoku AI models [2]. Shah. S. et al. [5] built a Gomoku AI model by applying the genetic algorithm. Wang et al. solved the bottleneck [6] that arises during a deep search in a tree-based Gomoku AI using the genetic algorithm [7]. In addition to the genetic algorithm, game AI models have been built using various tree search-based algorithms [1]. Typically, behavior trees (BT) allow for the efficient structuring of the behavior of artificial agents [8]. BT is widely used in computer games and robotics because the approach is efficient for building complex AI models that are modular as well as reactive. Allis et al. described a Gomoku AI model using the threat- space search [9] and proof-number search [10,11]. This algorithm was implemented in the Gomoku program, called Victoria. When Victoria first moved a stone, the algorithm yielded a winning result without fail. Cao et al. presented a Gomoku AI model using an algorithm that combined the upper confidence bounds that were applied to the trees (UCT) [12] and adaptive dynamic programming (ADP) [13,14]. This algorithm could solve the search depth defect more accurately and efficiently than the case when only a single UCT was used, thereby improving the performance of Gomoku AI. Gomoku models using CNN have also been developed, as Deep-Mind’s Alpha-Go demonstrated outstanding deep learning performance. Li et al. built a Gomoku model using MCTS and deep learning [15,16]. An expert-level AI model was developed without use of human expert knowledge by integrating the effective depth exploration of MCTS and deep neural networks (DNN). Yan et al. proposed a hybrid Gomoku AI model that improved the performance by including a hard-coded convolution-based Gomoku evaluation function [17]. Shao et al. predicted the next stone position with an accuracy of 42% while using only the current stone state of the board through the use of various DNNs with different hyper-parameters [18]. Various studies have developed expert-level game AI models using reinforcement learning models. For instance, Deep-Mind used the Deep Q-Networks (DQN) algorithm of reinforcement learning to apply it to seven Atari 2600 games, thereby enabling them to play at an expert-level [19]. Oh et al. considered a real-time fighting AI game while using deep reinforcement learning [20]. The AI model achieved a winning rate of 62% when competing with five professional gamers. Zhentao et al. presented an efficient Gomoku AI by integrating MCTS and ADP [21]. This algorithm could effectively eliminate the “short-sighted” defect of the neural network evaluation function, yielding results exceeding the Gomoku model adopting only ADP. Zhao et al. designed a Gomoku AI model that is capable of self-teaching through ADP [22]. Self-teaching is a method that penalizes the final action of the loser while rewarding the last action of the winner. The present study also uses “Record” for self-teaching, inspired by this basic idea. With the help of reinforcement learning, a model-free algorithm can represent the next state of AI as a probability when AI acts in a specific state [23]. In Q-Learning, the Appl. Sci. 2021, 11, x FOR PEER REVIEW 3 of 17

penalizes the final action of the loser while rewarding the last action of the winner. The present study also uses “Record” for self-teaching, inspired by this basic idea. Appl. Sci. 2021, 11, 1291 With the help of reinforcement learning, a model-free algorithm can represent3 ofthe 15 next state of AI as a probability when AI acts in a specific state [23]. In Q-Learning, the next Q state is calculated while using Q-values [24]. In this regard, de Lope et al. designed an efficient KNN-TD algorithm for a model-free algorithm by integrating the K-nearest neighbornext Q state (KNN) is calculated algorithm while [25] usingand the Q-values temporal [24 ].difference In this regard, (TD) [26 de, Lope27]. et al. designed an efficientIn this paper KNN-TD, by algorithmcombining forone a-hot model-free-encoding algorithm-based vectors, by integrating we build the a model K-nearest-free Gomokuneighbor (KNN)AI that algorithm is capable [ 25 of] andself-teaching. the temporal After difference generating (TD) the [26 state,27]. vectors by per- formingIn this convolution paper, by operations combining on one-hot-encoding-based the relationship between vectors, the stones we build that awere model-free placed onGomoku the Gomoku AI that board, is capable we ofdetermine self-teaching. similar After situations generating for theeach state situation vectors vector by performing inputted withconvolution an approx operationsimate nearest on the neighbor relationship (ANN) between [28]. In the the stones case of that a similar were placedsituation, on the Gomoku board, we determine similar situations for each situation vector inputted with an one-hot-encoding-based vectors are combined; subsequently, a CNN is utilized. The row approximate nearest neighbor (ANN) [28]. In the case of a similar situation, the one-hot- values of the one-hot-encoding-based vectors are independent of each other, but they are encoding-based vectors are combined; subsequently, a CNN is utilized. The row values of converted into a related row that is not independent of each other by combining the one-hot-encoding-based vectors are independent of each other, but they are converted one-hot-encoding-based vectors. Following this approach, when applying self-teaching, into a related row that is not independent of each other by combining one-hot-encoding- if a stone cannot be placed in an optimal position, then the next optimal position can be based vectors. Following this approach, when applying self-teaching, if a stone cannot considered. We also develop ‘GuPyEngine’ for Python and 3D programming education. be placed in an optimal position, then the next optimal position can be considered. We For the proposed method, we conduct experiments by implementing a Gomoku system also develop ‘GuPyEngine’ for Python and 3D programming education. For the proposed in GuPyEngine. method, we conduct experiments by implementing a Gomoku system in GuPyEngine.

3. Advanced CNN CNN-Based-Based D Decision-Makingecision-Making This paperpaper proposesproposes an an enhanced enhanced reinforcement reinforcement learning learning model model to improve to improve the train- the traininging data data of a CNN,of a CNN, in order in order to enable to enable optimal optimal decision-making decision-making and alternative and alternative selection. se- lection.By selecting By an selecting alternative, an italternative, solves the problem it solves of thea CNN-based problem decision-making of a CNN-based method, deci- sionallowing-making for method, only the allowing optimal solutionfor only the to beoptimal selected, solution and itto minimizes be selected, the and forgetting it mini- mizesproblem. the The forgetting forgetting problem problem. The is that, forgetting given that problem the weights is that, through given previous that the learning weights throughin neural previous network algorithmslearning in areneural changed network by new algorithms learning are [29 ],changed there is theby new phenomenon learning [29]that, thethere weights is the arephenomenon forgotten, whichthat the is weights generally are invoked forgotten, when which the inputs is generally are similar invoked and whenlabels the are inputs different. are similar and labels are different.

3.1. Framework Overview Figure 11 showsshows the proposed framework. framework. As As can can be be seen, seen, the the framework framework is isbased based on multipleon multiple enc encoding-basedoding-based vector vector combinations, combinations, contain containinging the the following following stages: “State Vector Generation”, Generation” “CNN-based, “CNN Estimation”,-based Estimation”“Decision Making”,, “Decision “One-Hot-Encoding- Making”, “Onebased-Hot Vector-Encoding Generation”,-based Vector “ANN-based Generation” one-hot, “ANN encoding-based vector one-hot Combination”, encoding vector and “EnhancedCombination” CNN, and Training”. “Enhanced CNN Training”.

Figure 1. Proposed self-teaching Gomoku framework.

The state vector generation stage generates the vectors for simulating the states. When 1 1 f (st) is the state vector generation function, f (st) = < s’t,1, s’t,2, ... s’t,n >, where st,i is the i-th element of st and s’t,i is the converted value of st,i, which is defined while considering both the environment and CNN. In the CNN-based estimation stage, alternative actions Appl. Sci. 2021, 11, 1291 4 of 15

are derived by applying the state vector to a CNN. Moreover, in this stage, the proposed CNN uses convolution with a kernel size of 3 × 3 [18]. As in the game of Gomoku, the position of stones must be accurately determined, the minimum kernel size should be used. In the decision-making stage, a case might arise where the action corresponding to the highest probability in the CNN result cannot be performed. This case is considered as −1, which makes the selection impossible, thereby selecting and performing an alternative action with the highest probability. The one-hot-encoding-based vector generation stage generates the vectors for simulat- 2 ing the actions. f (st) returns ot the one-hot-encoding-based vector of st. The ANN-based one-hot-encoding vector combination stage receives < s’t, ot > and then returns < s’t, o”t > as training data, where o’t corresponds to the most similar vector in the training data. Moreover, o”t combines ot with o’t. The following section describes how these two vectors are combined. The CNN training stage makes the CNN learn with .

3.2. ANN-Based One-Hot Encoding Vector Combination Stage In this paper, we define “Record” as the action of going back several moves at the end of each game and placing a stone in the position where the opponent’s stone is likely to be placed, while using the proposed method. There are two considerations to update rewards as below. First, when the player wins, no reward is received. If the rewards of wins are reflected, a specific row in the combined one-hot-encoding-based vector is increased rapidly. If so, the diverse kinds of stone’s positions are not considered. As a result, Gomoku falls in one way, which is the local optimum problem [30]. Next, whenever the player loses, reward is only updated. For examples, in the case of one record, it goes back by one move from the end of the corresponding game, and reward is updated by 1 considering the situation after replacing the last position of the opponent’s stone with the player’s own stone, and then encoding-based vectors are combined by one vector. In the case of three records, it repeats one record three times to replace the black stone with the white stone by going back three moves. Figure2 shows the combination of one-hot-encoding-based vectors for the proposed Gomoku system. In Gomoku, the relationship between similar stone moves occurs fre- quently. Depending on the player’s tendency, the moves can completely differ from one another, even in a similar situation. The proposed Gomoku system divides similar situations for each feature vector while using the ANN algorithm. In the divided similar sit- Appl. Sci. 2021, 11, x FOR PEER REVIEW 5 of 17 uations, one-hot encoding-based vectors are combined in order to calculate the probability. This solves the problem of an exponential increase in the number of training data.

FigureFigure 2.2. ApproximateApproximate nearestnearest neighborneighbor (ANN)-based(ANN)-based one-hotone-hot encodingencoding vectorvector combination. combination.

ANNANN algorithmsalgorithms areare diverselydiversely utilizedutilized forfor large-scalelarge-scale imageimage searchingsearching andand classifi-classifi- cationcation problemsproblems [[28].28]. InIn thisthis paper,paper, thethe similaritysimilarity amongamong one-hot-encoding-basedone-hot-encoding-based vectorsvectors isis calculatedcalculated byby EuclideanEuclidean distance,distance, aa representativerepresentative ANNANN algorithm.algorithm. IfIf thethe similaritysimilarity isis smallersmaller thanthan aa specificspecific threshold,threshold, thenthen thethe correspondingcorresponding vectorsvectors arearecombined. combined. AA newnew vector that contains the corresponding vector is inserted if the similarity is bigger than the threshold when that there is no similar vector. One-hot encoding-based vector is a binary vector that sets one value of the corre- sponding index of labeled data by 1 and sets others by 0 [31]. Given that, when learning by a neural network with labeled data as numbers, the learning results can be deducted by decimal points. If so, categorical data cannot be accurately classified. However, if one-hot encoding-based vector is applied, categorical data can be classified accurately. Given that only one stone can be placed at a specific place in Gomoku, the possible posi- tions of stones are treated with 49 positions. For example, one-hot-encoding-based vec- tor expression is one of hot-issued approaches for classifying categorical data by Deep Neural Network (DNN) [31–34]. In the paper, the positions of stones considering states are expressed by one-hot-encoding-based vector. By combining one-hot-encoding, 49 row relationships that are independent of each other are made non-independent. The CNN result of learning the non-independent one-hot encoding outputs the result that completely resembles the learned one-hot-encoding vectors. Thus, the CNN result with each non-independent row is outputted. Although there is a black or white stone in the position with the highest probability in the CNN result, the position with the next highest probability is non-independently related to the current situation. Therefore, it becomes the next best position for the current situation. In addition, as one-hot-encoding is continuously com- bined for each game in similar situations, the learning result varies according to the user's tendency.

3.3. CNN Training Stage The record data generated for each game are combined for each similar situation through the ANN. Nodes are generated for each different situation in the ANN, and each node has one-hot-encoding-based vectors that are combined (Figure 3). Appl. Sci. 2021, 11, 1291 5 of 15

vector that contains the corresponding vector is inserted if the similarity is bigger than the threshold when that there is no similar vector. One-hot encoding-based vector is a binary vector that sets one value of the corre- sponding index of labeled data by 1 and sets others by 0 [31]. Given that, when learning by a neural network with labeled data as numbers, the learning results can be deducted by decimal points. If so, categorical data cannot be accurately classified. However, if one-hot encoding-based vector is applied, categorical data can be classified accurately. Given that only one stone can be placed at a specific place in Gomoku, the possible positions of stones are treated with 49 positions. For example, one-hot-encoding-based vector expression is one of hot-issued approaches for classifying categorical data by Deep Neural Network (DNN) [31–34]. In the paper, the positions of stones considering states are expressed by one-hot-encoding-based vector. By combining one-hot-encoding, 49 row relationships that are independent of each other are made non-independent. The CNN result of learning the non-independent one- hot encoding outputs the result that completely resembles the learned one-hot-encoding vectors. Thus, the CNN result with each non-independent row is outputted. Although there is a black or white stone in the position with the highest probability in the CNN result, the position with the next highest probability is non-independently related to the current situation. Therefore, it becomes the next best position for the current situation. In addition, as one-hot-encoding is continuously combined for each game in similar situations, the learning result varies according to the user’s tendency.

3.3. CNN Training Stage The record data generated for each game are combined for each similar situation Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 17 through the ANN. Nodes are generated for each different situation in the ANN, and each node has one-hot-encoding-based vectors that are combined (Figure3).

FigureFigure 3.3.Enhanced Enhanced convolutionalconvolutional neuralneural networks (CNN)(CNN) training.training.

TheThe vector vector for for each each ANN ANN node node frequently frequently varies varies every every time that time learning that learning is performed is per- withformed the with record the data. record Hence, data. in Hence, the ‘Enhanced in the ‘Enhanced CNN Training’ CNN stageTraining of each’ stage game, of each the existinggame, the CNN existing model CNN is removed, model is removed, and a new and CNN a new model CNN is model regenerated is regenerated based on based the learnedon the le ANNarned result. ANN In result. addition, In addition, the training the datatraining of the data ANN of the are ANN regenerated are regenerated with the latestwith ANNthe latest node. ANN As similar node. As situations similar are situations combined are through combined the through ANN, the the amount ANN, the of trainingamount dataof training does not data increase does incrementally,not increase incrementally even though, CNN even trainingthough isCNN performed training for is eachperformed game. This for each also mitigatesgame. This the also forgetting mitigates problem the forgetting by reducing problem the amount by reducing of useless the dataamount learning of useless during data CNN learning training. during CNN training.

4. Experiments We verify the performance of the proposed reinforcement learning model by ap- plying it to a Gomoku game. Section 4.1. describes the learning and development envi- ronments, whereas Section 4.2. introduces GupyEngine. Sections 4.3.–4.5. present the performance verification of the proposed reinforcement learning Gomoku AI model while considering various criteria.

4.1. Experimental Environment We verified the performance through various methods in order to confirm the per- formance of the Gomoku system proposed in this paper. We performed experiments with “the number of wins”, “the number of ANN nodes”, and “the next best solution” of the Gomoku game. We verified the difference in performance according to the number of records generated for each game. In this paper, the experiment was developed with C/C++, Python 3.0, and TensorFlow. CPU I-7, RAM 16GB, and GPU GTX-1050 were used in the experimental environment. In the experiments, 600 Gomoku games were iterated for each experimental case. The genetic algorithm was applied to black stones. The same moves for each game were prevented by randomly initializing X and Y in positions between 3 and 5. As the per- formance of ANN varies depending on the defined distance, we verified the performance according to various distances in the experiments. The proposed method was applied to white stones. The goal of the white stone was to prevent the black stone from completing five-in-a-row, by predicting the position of the black stone. The proposed Gomoku system considered the white stone as the winner if it completes the five-in-a-row first, or if all 49 intersections are filled. If a white stone wins, it moves to the next game, whereas, if a black stone wins, it relearns and proceeds to the next game. In the CNN-based estimation stage, a total of three hidden layers were considered, and each hidden layer had 49, 128, and 516 neurons. The output layer and input layer Appl. Sci. 2021, 11, 1291 6 of 15

4. Experiments We verify the performance of the proposed reinforcement learning model by applying it to a Gomoku game. Section 4.1 describes the learning and development environments, whereas Section 4.2 introduces GupyEngine. Sections 4.3–4.5 present the performance verification of the proposed reinforcement learning Gomoku AI model while considering various criteria.

4.1. Experimental Environment We verified the performance through various methods in order to confirm the perfor- mance of the Gomoku system proposed in this paper. We performed experiments with “the number of wins”, “the number of ANN nodes”, and “the next best solution” of the Gomoku game. We verified the difference in performance according to the number of records generated for each game. In this paper, the experiment was developed with C/C++, Python 3.0, and TensorFlow. CPU I-7, RAM 16GB, and GPU GTX-1050 were used in the experimental environment. In the experiments, 600 Gomoku games were iterated for each experimental case. The genetic algorithm was applied to black stones. The same moves for each game were pre- vented by randomly initializing X and Y in positions between 3 and 5. As the performance of ANN varies depending on the defined distance, we verified the performance according to various distances in the experiments. The proposed method was applied to white stones. The goal of the white stone was to prevent the black stone from completing five-in-a-row, by predicting the position of the black stone. The proposed Gomoku system considered the white stone as the winner if it completes the five-in-a-row first, or if all 49 intersections are filled. If a white stone wins, it moves to the next game, whereas, if a black stone wins, it relearns and proceeds to the next game. In the CNN-based estimation stage, a total of three hidden layers were considered, and each hidden layer had 49, 128, and 516 neurons. The output layer and input layer were designed to have 49 (7 × 7) neurons, which corresponds to the total number of intersections on the board. The CNN result of the proposed Gomoku system was outputted in the form of a vector with 49 rows, in order to predict the position of the next white stone by matching 1:1 to the 49 intersections of the board.

4.2. GuPyEngine GuPyEngine program is a Python-based simulation platform. Figure4 depicts screen- shots of the GuPyEngine, which aims to easily and simply implement a three-dimensional (3D) simulation environment and simulate Python-based math and physics, among others. OpenGL 4.0 was used for 3D rendering, and the GUI supports high-speed rendering using GDI+ and OpenCV. The state changes in 3D objects, such as position, rotation, and scale, were supported, as well as camera, FBX file Loader, and bone key animation. The Python code can be linked for each 3D object and written by linking with the Pycharm Tool, as when using the Unity 3D Tool (Figure5). Currently, in GuPyEngine, we can change the object state, perform creation and deletion, and modify the object texture using the Python code. In addition, it is possible to install and link the code with deep learning tools, such as Tensor Floor and Keras, through PIP, due to the development of the 32 bit and 64 bit versions. The GupyEngine supports PIP versions of up to 3.0, and errors in Python code are displayed in the console (cmd.exe) window to facilitate debugging. Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 17

were designed to have 49 (7 × 7) neurons, which corresponds to the total number of in- tersections on the board. The CNN result of the proposed Gomoku system was outputted in the form of a vector with 49 rows, in order to predict the position of the next white stone by matching 1:1 to the 49 intersections of the board.

4.2. GuPyEngine GuPyEngine program is a Python-based simulation platform. Figure 4 depicts screenshots of the GuPyEngine, which aims to easily and simply implement a three-dimensional (3D) simulation environment and simulate Python-based math and physics, among others. OpenGL 4.0 was used for 3D rendering, and the GUI supports Appl. Sci. 2021, 11, 1291 high-speed rendering using GDI+ and OpenCV. The state changes in 3D objects, such7 of 15as position, rotation, and scale, were supported, as well as camera, FBX file Loader, and bone key animation.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 17 FigureFigure 4.4.Screenshots Screenshots ofof GuPyEngine.GuPyEngine.

The Python code can be linked for each 3D object and written by linking with the Pycharm Tool, as when using the Unity 3D Tool (Figure 5). Currently, in GuPyEngine, we can change the object state, perform creation and deletion, and modify the object texture using the Python code. In addition, it is possible to install and link the code with deep learning tools, such as Tensor Floor and Keras, through PIP, due to the develop- ment of the 32 bit and 64 bit versions. The GupyEngine supports PIP versions of up to 3.0, and errors in Python code are displayed in the console (cmd.exe) window to facili- tate debugging.

FigureFigure 5.5.Proposed Proposed GomokuGomoku system’ssystem’s PythonPython code code in in GupyEngine GupyEngine with with Pycharm. Pycharm.

GuPyEngineGuPyEngine rebuildsrebuilds andand usesuses Python Python code code in in C/C++. C/C++. Hence,Hence, PythonPython cancan bebe usedused withwith highhigh processing processing speed, speed, and and external external libraries libraries are are easily easily incorporated. incorporated. AA function function for for convertingconverting Python Python code code into into DLL/So DLL/So files files will will be be developed developed in thein the future future for GuPyEngine;for GuPyEn- thisgine; will this ensure will ensure that the that content the content that is written that is inwritten Python in code Python in GuPyEngine code in GuPyEngine can be easily can linkedbe easily with linked Unity with 3D Engine,Unity 3D Unreal Engine, Engine, Unreal and Engine, other externaland other programs. external programs. The Gomoku The systemGomoku that system is proposed that is inproposed this paper in wasthis developedpaper was by developed installing by and installing using TensorFlow and using inTensorFlow GuPyEngine. in GuPyEngine. 4.3. Number of Winning Games We calculated the winning rate by grouping 600 experimental data into five categories. Figure6 shows the experimental results regarding the winning rate for four Records. Figure6a shows that the winning rate is low, owing to the frequent forgetting problems, as only the CNN was used without relearning from the records. Figure6b confirms that the winning rate gradually increases when the records were relearned. However, new moves of the GA AI continue to be created, which results in frequent variations in the winning rate. Figure6c shows that the winning rate gradually increases, and all the wins were observed from the 525th data, as seen in the green box area that is depicted in Figure6c. The figure also shows that the result of learning all the possible moves that the GA AI can Appl. Sci. 2021, 11, x FOR PEER REVIEW 9 of 17

4.3. Number of Winning Games We calculated the winning rate by grouping 600 experimental data into five catego- ries. Figure 6 shows the experimental results regarding the winning rate for four Rec- ords. Figure 6a shows that the winning rate is low, owing to the frequent forgetting problems, as only the CNN was used without relearning from the records. Figure 6b confirms that the winning rate gradually increases when the records were relearned. However, new moves of the GA AI continue to be created, which results in frequent variations in the winning rate. Figure 6c shows that the winning rate gradually increas- es, and all the wins were observed from the 525th data, as seen in the green box area that Appl. Sci. 2021, 11, 1291 is depicted in Figure 6c. The figure also shows that the result of learning all the possible8 of 15 moves that the GA AI can perform to place the stone, as the ANN minimizes the forget- ting problem. Gomoku is a game that requires the exact stone position to be determined. Therefore,perform to the place ideal the result stone, is asderived the ANN when minimizes the distance the forgettingα of the ANN problem. is five. Gomoku Figure 6d is– a hgame show that the requiresresults of the the exact gradually stone positionincreasing to the be determined.α of the ANN. Therefore, The figures the idealshow result that, asis derived the α of when the ANN the distance is higherα of than the ANNrequired, is five. more Figure errors6d–h occur, show whi thech results hinders of the the learning.gradually Figure increasing 6d,e the displayα of the similar ANN. shapes. The figures ANN show learning that, as deteriorates the α of the ANNthe forgetting is higher problem,than required, which more results errors in a occur, decrease which in the hinders winning the learning.rate, as seen Figure in the6d,e red display box areas similar in Figureshapes. 6d, ANNe. Figure learning 6h shows deteriorates a lower thewinning forgetting rate than problem, Figure which 6a. results in a decrease in the winning rate, as seen in the red box areas in Figure6d,e. Figure6h shows a lower winning rate than Figure6a.

Figure 6. Result of win rate whenFigure record 6. Result count of winwas ratefour. when Each recordgraph county-axis wasrefers four. to the Each number graph ofy-axis wins refers from to0 to the 5, number and of the x-axis refers to the numberwins of from games 0 to from 5, and 0 to the 120x-axis. The refers figures to theshow number the experimental of games from result 0 to using: 120. The (a) figuresOnly CNN show the without Record relearning; (experimentalb) CNN withresult Record using: relearning (a) Only without CNN without ANN; (c Record) Proposed relearning; Gomoku (b) system CNN with with Record ANN relearningα:5; (d) Proposed Gomoku systemwithout withANN; ANN( cα):10; Proposed (e) Proposed Gomoku Gomoku system with system ANN withα:5; ANN (d) Proposed α:15; (f Gomoku) Proposed system Gomoku with ANN α:10; (e) Proposed Gomoku system with ANN α:15; (f) Proposed Gomoku system with ANN α:20; (g) Proposed Gomoku system with ANN α:25; and, (h) Proposed Gomoku system with ANN α:30.

Figure7 shows the experimental results regarding the winning rate for three records. The results of Figure7 show that, overall, the winning rate is lower than those seen in Figure6. The proposed Gomoku system performs Enhanced CNN Training for each game. Therefore, the results of this experiment show that the winning rate increases nearing the end rather than the beginning. However, in Figure7b–e, the red box area shows fewer learning records than in the previous experiment. In particular, Figure7c confirms that the winning rate in the all-winning section, as shown in Figure6c, also drops, owing to the lack of records. However, as in Figures6 and7c shows a very high winning rate when compared to other experimental cases presented in Figure7, showing that the ANN α of five corresponds to the optimized case in the proposed Gomoku system. Appl. Sci. 2021, 11, x FOR PEER REVIEW 10 of 17

system with ANN α:20; (g) Proposed Gomoku system with ANN α:25; and, (h) Proposed Gomoku system with ANN α:30.

Figure 7 shows the experimental results regarding the winning rate for three rec- ords. The results of Figure 7 show that, overall, the winning rate is lower than those seen in Figure 6. The proposed Gomoku system performs Enhanced CNN Training for each game. Therefore, the results of this experiment show that the winning rate increases nearing the end rather than the beginning. However, in Figure 7b–e, the red box area shows fewer learning records than in the previous experiment. In particular, Figure 7c confirms that the winning rate in the all-winning section, as shown in Figure 6c, also drops, owing to the lack of records. However, as in Figures 6 and 7c shows a very high winning rate when compared to other experimental cases presented in Figure 7, show- ing that the ANN α of five corresponds to the optimized case in the proposed Gomoku system. Appl. Sci. 2021, 11, 1291 9 of 15

FigureFigure 7. 7. ResultResult of of Win Win rate rate when when Record count was three. Each graphgraph yy-axis-axis refersrefers to to the the number number of ofwins wins from from 0 to0 to 5, 5 and, and the thex -axisx-axis refers refers to to the the number number of of games games from from 0 0 to to 120. 120. The The figures figures show show the theexperimental experimental result result using: using: (a) Only(a) Only CNN CNN without without Record Record relearning; relearning; (b) CNN (b) CNN with Recordwith Record relearning re- learningwithout ANN;without (c )ANN; Proposed (c) Proposed Gomoku systemGomoku with system ANN withα:5; ANN (d) Proposed α:5; (d) Gomoku Proposed system Gomoku with sys- ANN temα:10; with (e) ProposedANN α:10; Gomoku (e) Proposed system Gomoku with ANN systemα:15; with (f) ProposedANN α:15; Gomoku (f) Proposed system Gomoku with ANN systemα:20. with(g) Proposed ANN α:20. Gomoku (g) Proposed system Gomoku with ANN systemα:25; with and, ANN (h) Proposed α:25; and Gomoku, (h) Proposed system Gomoku with ANN systemα:30. with ANN α:30. Figures8 and9 show the experimental results regarding the winning rates for the two records and the single record, respectively. These results confirm that the number of records to be learned is insufficient and, therefore, the winning rate drops in all of the sections and the “Self-Teaching” occurs frequently. The experiments confirmed the satisfactory functioning of the ANN in the proposed Gomoku system when characterizing similar situations. We could also confirm that the optimal result was derived when the α of the ANN was five. Furthermore, when the number of α was greater than required, for example, in the range 25–30, it was observed that the CNN learning was rather hindered. Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 17

Figures 8 and 9 show the experimental results regarding the winning rates for the two records and the single record, respectively. These results confirm that the number of records to be learned is insufficient and, therefore, the winning rate drops in all of the sections and the “Self-Teaching” occurs frequently. The experiments confirmed the sat- isfactory functioning of the ANN in the proposed Gomoku system when characterizing similar situations. We could also confirm that the optimal result was derived when the α of the ANN was five. Furthermore, when the number of α was greater than required, for example, in the range 25–30, it was observed that the CNN learning was rather hin- dered. Appl. Sci. 2021, 11, 1291 10 of 15

Figure 8. Results of win rate when the record count was two. Each graph y-axis refers to the number of wins from 0 to 5, andFigure the x-axis 8. refers Results to the of number win rate of gameswhen from the 0record to 120. count The figures was showtwo. theEach experimental graph y- resultaxis refers using: (toa) the only num- CNN withoutber Recordof wins relearning; from 0 to (b )5 CNN, andwith the Recordx-axis relearningrefers to the without number ANN; of (c )games Proposed from Gomoku 0 to 120 system. The with figures ANN α:5; (d) Proposedshow the Gomoku experimental system with result ANN using:α:10; (e ()a Proposed) only CNN Gomoku without system Record with ANN relearning;α:15; (f) Proposed (b) CNN Gomoku with system Rec- withord ANN relearningα:20; (g) Proposed without Gomoku ANN; system (c) Proposed with ANN Gomokuα:25; and, system (h) Proposed with GomokuANN α system:5; (d) with Proposed ANN α :30.Gomoku system with ANN α:10; (e) Proposed Gomoku system with ANN α:15; (f) Proposed Gomoku system with ANN α:20; (g) Proposed Gomoku system with ANN α:25; and, (h) Proposed Go- moku system with ANN α:30.

Appl. Sci. Sci. 20202121,, 1111,, x 1291 FOR PEER REVIEW 1211 of of 17 15

FigureFigure 9. 9. ResultResult of of Win Win rate rate when when Record Record count count was was one. one. Each Each graph graph yy--axisaxis refers refers to to the the number number of of winswins from from 0 0 to to 5 5,, and and the the xx--axisaxis refers refers to to the the number number of of games games from from 0 0to to 120 120.. The The figures figures show show the the experimentalexperimental result using: (a) only CNNCNN withoutwithout RecordRecord relearning; relearning; ( b(b)) CNN CNN with with Record Record relearning re- learningwithout ANN;without (c) ANN; Proposed (c) Proposed Gomoku systemGomoku with system ANN withα:5; (ANNd) Proposed α:5; (d) Gomoku Proposed system Gomoku with sys- ANN temα:10; with (e) ProposedANN α:10; Gomoku (e) Proposed system Gomoku with ANN systemα:15; with (f) Proposed ANN α:15; Gomoku (f) Proposed system Gomoku with ANN systemα:20; with ANN α:20; (g) Proposed Gomoku system with ANN α:25; and, (h) Proposed Gomoku system (g) Proposed Gomoku system with ANN α:25; and, (h) Proposed Gomoku system with ANN α:30. with ANN α:30. 4.4. Number of ANN Nodes 4.4. NumberThe number of ANN of Nodes ANN nodes refers to the number of cases grouped by similar situations in Gomoku,The number which of ANN indicates nodes the refers amount to the of CNNnumber training of cases data. grouped Figure by 10 similara shows situa- the tionsresults in ofGomoku, saving four which Records indicates data the for eachamount game of withoutCNN training ANN. Duringdata. Figure 600 games, 10a shows 1000 therecords results data of weresaving saved four as Records record data,data exceptfor each for game the winning without game. ANN. Figure During 10 b600 shows games, the 1000ANN records and, because data were record saved with asα rlessecord than data five, e arexcept combined for the winning into a similar game. situation, Figure 10b the showsnumber the of ANN record a datand, because is reduced record when with compared α less to than Figure five 10 area.Moreover, combined 874 into record a similar data situation,were saved. the Figure number 10 c–gof r showecord thedata results is reduced of saving when 633, compared 448, 217, to 78, Figure and 50 10a. record Moreo- data, ver,respectively, 874 record from data 600 were games. saved. Figure 10c–g show the results of saving 633, 448, 217, 78, and 50 record data, respectively, from 600 games.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 17

Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 17 Appl. Sci. 2021, 11, 1291 12 of 15

Figure 10. Number of ANN Nodes when Record was four.

With an increase in the α of the ANN, the saved record data become smaller. Fig-

ureFigure 10b 10.showsNumber less oftraining ANN Nodes data when Recordcompared was four.to Figure 10a. Figure 6b,c show that the winningFigure 10. rateNumber is also of ANNhigher Nodes than when that Recordwhen using was four. only the existing CNN. Thus, it is con- sideredWith to anbe increaseideal learning in the αresult.of the However, ANN, the savedwith an record increase data becomein α, records smaller. that Figure are not10b With an increase in the α of the ANN, the saved record data become smaller. Fig- similarshows lessare also training merged data, whenas seen compared in Figure to 10g, Figureh. Thus, 10a. Figure CNN6 learningb,c show is that hindered, the winning and ure 10b shows less training data when compared to Figure 10a. Figure 6b,c show that the therate winning is also higher rate is than dropped that when, as in using Figure only 6g, theh. Figure existing 11 CNN. shows Thus, similar it is consideredresults to those to be winning rate is also higher than that when using only the existing CNN. Thus, it is con- presentedideal learning in Figure result. 10. However, with an increase in α, records that are not similar are also sidered to be ideal learning result. However, with an increase in α, records that are not merged, as seen in Figure 10g,h. Thus, CNN learning is hindered, and the winning rate is similar are also merged, as seen in Figure 10g,h. Thus, CNN learning is hindered, and dropped, as in Figure6g,h. Figure 11 shows similar results to those presented in Figure 10. the winning rate is dropped, as in Figure 6g,h. Figure 11 shows similar results to those presented in Figure 10.

Figure 11. Figure 11. NumberNumber of of ANN ANN Nodes Nodes when when Record Record was was three. three. Figures 12 and 13 confirm that, as the number of Record data to be learned is minimal, Figures 12 and 13 confirm that, as the number of Record data to be learned is mini- overall, the number of ANN nodes is less than required. These experiments also confirm mal, overall, the number of ANN nodes is less than required. These experiments also Figurethat the 11. amount Number of of CNN ANN training Nodes when data Record does not was increase three. linearly when the ANN is combined Appl. Sci. 2021, 11, x FOR PEER REVIEWconfirm that the amount of CNN training data does not increase linearly when the 14ANN of 17 according to a similar situation. When the α of the ANN is 5, the amount of training data is combined according to a similar situation. When the α of the ANN is 5, the amount of wasFigure reduceds 12 by and 12% 13 on confirm average, that, while as the still number ensuring of an Record effective data performance. to be learned is mini- training data was reduced by 12% on average, while still ensuring an effective perfor- mal, overall, the number of ANN nodes is less than required. These experiments also mance. confirm that the amount of CNN training data does not increase linearly when the ANN is combined according to a similar situation. When the α of the ANN is 5, the amount of

training data was reduced by 12% on average, while still ensuring an effective perfor- mance.

FigureFigure 12. 12. NumberNumber of of ANN ANN Nodes Nodes when when Record Record was was two. two.

Figure 13. Number of ANN Nodes when Record was one.

Appl. Sci. 2021, 11, x FOR PEER REVIEW 14 of 17

Figure 12. Number of ANN Nodes when Record was two. Appl. Sci. 2021, 11, 1291 13 of 15

Appl. Sci. 2021, 11, x FOR PEER REVIEW 15 of 17

4.5. Next Best Answer Selection The combination of one-hot encoding showed that the rows have a dependent rela-

tionship. Thus, we also experimented in order to verify the number of times the next Figure 13. Number of ANN Nodes when Record was one. Figurebest answer 13. Number was of selected ANN Nodes in the when proposed Record was Gomoku one. system. The experimental criteria considered4.5. Next Best four Answer records Selection and the ANN α of five, thereby yielding the most optimal re-

sults.The combination of one-hot encoding showed that the rows have a dependent rela- tionship.Figure Thus, 14 shows we also the experimented number of times in order that to the verify next the best number answer of was times selected the next in bestthe proposedanswer was Gomoku selected system. in the proposed As can Gomoku be seen, system. the highest The experimental rank was selected criteria 3483 considered times. Whenfour records a new and case the was ANN addedα of through five, thereby the ANN, yielding the the proposed most optimal CNN results. Gomoku system performedFigure the 14 greatestshows the number number of ofselections times that, because the next there best was answer no other was one selected-hot encod- in the ingproposed to merge. Gomoku If a stone system. was As already can be seen,placed the in highest the highest rankwas rank selected or could 3483 not times. be placed, When thena new the case second was added-best resultthrough was the selected. ANN, the The proposed number CNN of Gomokuthe second system-best performed was 1693, thirdthe greatest-best was number 751, fourth of selections,-best was because 472, and there fifth was-best no was other 322. one-hot The number encoding of to times merge. that If thea stone other was rankings already were placed selected in the highestwas 1377. rank Through or could this not experiment be placed, then we theconfirmed second-best that theresult proposed was selected. Gomoku The system number provides of the second-best the next best was answer 1693, third-best by combining was 751, the fourth- inde- pendentbest was one 472,-hot and encoding fifth-best. was 322. The number of times that the other rankings were selected was 1377. Through this experiment we confirmed that the proposed Gomoku system provides the next best answer by combining the independent one-hot encoding.

FigureFigure 14. 14. ResultResult of of selecting selecting the the nex nextt best best answer. answer. 5. Conclusions 5. Conclusions We proposed a new reinforcement learning algorithm based on a CNN. We verified the We proposed a new reinforcement learning algorithm based on a CNN. We verified performance of the proposed algorithm by applying it to Gomoku. We enabled the CNN the performance of the proposed algorithm by applying it to Gomoku. We enabled the to determine the correct answers according to the situation by itself by determining similar CNN to determine the correct answers according to the situation by itself by determining situations with an ANN and combining the rows of one-hot-encoding-based vectors into similar situations with an ANN and combining the rows of one-hot-encoding-based the determined similar situations. As the rows of the independent one-hot-encoding-based vectors into the determined similar situations. As the rows of the independent vectors become non-independent relationships through the proposed algorithm, the next one-hot-encoding-based vectors become non-independent relationships through the best answer can be selected if there exists a case where it is impossible to place a stone for proposed algorithm, the next best answer can be selected if there exists a case where it is the correct answer with the highest probability. This minimizes the forgetting problem, impossible to place a stone for the correct answer with the highest probability. This which arises when indiscriminately learning the data. Thus, it was possible to obtain minimizesa result of the all wins.forgetting Further, problem, the average which arises amount when of trainingindiscriminately data was learning reduced the by data. 12% Thus,on average it was whenpossible the to Dis obtain of the a result ANN of was all five, wins. as Further, the amount the average of data amount did not increaseof train- ingincrementally data was reduced for each game. by 12% on average when the Dis of the ANN was five, as the amount of data did not increase incrementally for each game. We developed the proposed reinforcement learning based Gomoku AI and verified its performance in GuPyEngine. We also introduced the functions and performance of the GuPyEngine. In the future, we will perform various experiments and optimizations on the linkage with deep learning tools for GuPyEngine. In this way, we expect to con- tribute to 3D simulation research by improving the function of the 3D simulation plat- form. Further, we will apply the proposed reinforcement learning algorithm to other games and conduct relevant experiments. Appl. Sci. 2021, 11, 1291 14 of 15

We developed the proposed reinforcement learning based Gomoku AI and verified its performance in GuPyEngine. We also introduced the functions and performance of the GuPyEngine. In the future, we will perform various experiments and optimizations on the linkage with deep learning tools for GuPyEngine. In this way, we expect to contribute to 3D simulation research by improving the function of the 3D simulation platform. Further, we will apply the proposed reinforcement learning algorithm to other games and conduct relevant experiments.

Author Contributions: Conceptualization, B.G. and Y.S.; writing—review and editing, B.G. & Y.S.; validation, B.G.; supervision, Y.S. All authors have read and agreed to the published version of the manuscript. Funding: This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2018R1D1A1B07049990). Informed Consent Statement: Not applicable. Data Availability Statement: Data are not availability for public. Conflicts of Interest: The authors claim no conflict of interest.

References 1. Zheng, P.M.; He, L. Design of gomoku ai based on machine game. Comput. Knowl. Technol. 2016, 2016, 33. 2. Marks, R.E. Playing games with genetic algorithms. In Evolutionary Computation in Economics and Finance; Physica: Heidelberg, Germany, 2002; pp. 31–44. 3. O’Shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. 4. Browne, C.B.; Powley, E.; Whitehouse, D.; Lucas, S.M.; Cowling, P.I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; Colton, S. A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 2012, 4, 1–43. [CrossRef] 5. Shah, S.M.; Singh, D.; Shah, J.S. Using Genetic Algorithm to Solve Game of Go-Moku. IJCA Spec. Issue Optim. On-Chip Commun. 2012, 6, 28–31. 6. Chen, Y.H.; Tang, C.Y. On the bottleneck tree alignment problems. Inf. Sci. 2010, 180, 2134–2141. [CrossRef] 7. Wang, J.; Huang, L. Evolving Gomoku solver by genetic algorithm. In Proceedings of the 2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA), Ottawa, ON, Canada, 29–30 September 2014; pp. 1064–1067. 8. Colledanchise, M.; Ögren, P. Behavior trees in robotics and AI: An introduction. arXiv 2017, arXiv:1709.00084. 9. Yen, S.-J.; Yang, J.-K. Two-stage Monte Carlo tree search for . IEEE Trans. Comput. Intell. AI Games 2011, 3, 100–118. 10. Herik, J.V.D.; Winands, M.H.M. Proof-Number Search and Its Variants. In Flexible and Generalized Uncertainty Optimization; Springer: Berlin, Heidelberg, 2008; pp. 91–118. 11. Allis, L.V.; Herik, H.J.; Huntjens, M.P. Go-Moku and Threat-Space Search; University of Limburg, Department of Computer Science: Maastricht, The Netherlands, 1993. 12. Coquelin, P.A.; Munos, R. Bandit algorithms for tree search. arXiv 2007, arXiv:cs/0703062. 13. Wang, F.-Y.; Zhang, H.; Liu, D. Adaptive Dynamic Programming: An Introduction. IEEE Comput. Intell. Mag. 2009, 4, 39–47. [CrossRef] 14. Cao, X.; Lin, Y. UCT-ADP Progressive Bias Algorithm for Solving Gomoku. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019. 15. Li, X.; He, S.; Wu, L.; Chen, D.; Zhao, Y. A Game Model for Gomoku Based on Deep Learning and Monte Carlo Tree Search. In Proceedings of the 2019 Chinese Intelligent Automation Conference, Zhenjiang, China, 20–22 September 2019; Springer: Singapore, 2019; pp. 88–97. 16. Xie, Z.; Fu, X.; Yu, J. AlphaGomoku: An AlphaGo-based Gomoku Artificial Intelligence using Curriculum Learning. arXiv 2018, arXiv:1809.10595. 17. Yan, P.; Feng, Y. A Hybrid Gomoku Deep Learning Artificial Intelligence. In Proceedings of the 2018 Artificial Intelligence and Cloud Computing Conference, Tokyo, , 21–23 December 2018. 18. Shao, K.; Zhao, D.; Tang, Z.; Zhu, Y. Move prediction in Gomoku using deep learning. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), IEEE, Wuhan, China, 11–13 November 2016. 19. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforce- ment learning. arXiv 2013, arXiv:1312.5602. 20. Oh, I.; Rho, S.; Moon, S.; Son, S.; Lee, H.; Chung, J. Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning. arXiv 2019, arXiv:1904.03821. 21. Tang, Z.; Zhao, D.; Shao, K.; Le, L.V. ADP with MCTS algorithm for Gomoku. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2017. Appl. Sci. 2021, 11, 1291 15 of 15

22. Zhao, D.; Zhang, Z.; Dai, Y. Self-teaching adaptive dynamic programming for Gomoku. Neurocomputing 2012, 78, 23–29. [CrossRef] 23. Geffner, H. Model-free, Model-based, and General Intelligence. arXiv 2018, arXiv:1806.02308. 24. Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [CrossRef] 25. Peterson, L.E. K-nearest neighbor. Scholarpedia 2009, 4, 1883. [CrossRef] 26. Bradtke, S.J.; Barto, A.G. Linear least-squares algorithms for temporal difference learning. Mach. Learn. 1996, 22, 33–57. [CrossRef] 27. De Lope, J.; Maravall, D. The knn-td reinforcement learning algorithm. In Proceedings of the International Work-Conference on the Interplay between Natural and Artificial Computation, Santiago de Compostela, Spain, 22–26 June 2009; Springer: Berlin/Heidelberg, Germany, 2009. 28. O’Hara, S.; Draper, B.A. Are you using the right approximate nearest neighbor algorithm? In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), IEEE, Tampa, FL, USA, 15–17 January 2013. 29. Liu, H.; Yang, Y.; Wang, X. Overcoming Catastrophic Forgetting in Graph Neural Networks. arXiv 2020, arXiv:2012.06002. 30. Conti, E.; Madhavan, V.; Such, F.P.; Lehman, J.; Stanley, K.O.; Clune, J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. arXiv 2017, arXiv:1712.06560. 31. Hancock, J.T.; Khoshgoftaar, T.M. Survey on categorical data for neural networks. J. Big Data 2020, 7, 1–41. [CrossRef] 32. Potdar, K.; Pardawala, T.S.; Pai, C.D. A comparative study of categorical variable encoding techniques for neural network classifiers. Int. J. Comput. Appl. 2017, 175, 7–9. [CrossRef] 33. Brett, L. Machine Learning with R; Packt Publishing Limited: Birmingham, UK, 2013; ISBN 978-1782162148. 34. Cerda, P.; Varoquaux, G.; Kégl, B. Similarity encoding for learning with dirty categorical variables. Mach. Learn. 2018, 107, 1477–1494. [CrossRef]