electronics

Article Using a Reinforcement Q-Learning-Based Deep Neural Network for Playing Video Games

Cheng-Jian Lin 1,* , Jyun-Yu Jhang 2 , Hsueh-Yi Lin 1 , Chin-Ling Lee 3 and Kuu-Young Young 2 1 Department of Science and Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan; [email protected] 2 Institute of Electrical and Control Engineering, National Chiao Tung University, Hsinchu 300, Taiwan; [email protected] (J.-Y.J.); [email protected] (K.-Y.Y.) 3 Department of International Business, National Taichung University of Science and Technology, Taichung City 40401, Taiwan; [email protected] * Correspondence: [email protected]

 Received: 23 September 2019; Accepted: 1 October 2019; Published: 7 October 2019 

Abstract: This study proposed a reinforcement Q-learning-based deep neural network (RQDNN) that combined a deep principal component analysis network (DPCANet) and Q-learning to determine a playing strategy for video games. Video game images were used as the inputs. The proposed DPCANet was used to initialize the parameters of the kernel and capture the image features automatically. It performs as a deep neural network and requires less computational complexity than traditional convolution neural networks. A reinforcement Q-learning method was used to implement a strategy for playing the video game. Both Flappy Bird and Atari Breakout games were implemented to verify the proposed method in this study. Experimental results showed that the scores of our proposed RQDNN were better than those of human players and other methods. In addition, the training time of the proposed RQDNN was also far less than other methods.

Keywords: convolution neural network; deep principal component analysis network; image sensor; ; Q-learning; video game

1. Introduction Reinforcement learning was first used to play video games at the Mario AI Competition, which was hosted in 2009 by Institute of Electrical and Electronics Engineers (IEEE) Games Innovation Conference and IEEE Symposium on Computational and Games [1]. The top three researchers in the competition adopted a range of 20 20, centered on the manipulated roles as the input, and used × an A* algorithm matched with a reinforcement learning algorithm. In 2013, Mnih et al. [2] proposed a convolution neural network based on the deep reinforcement learning algorithm, called Deep Q-Learning (DQN). It is an end-to-end reinforcement learning algorithm. The game’s control strategies were learned with good effects. The algorithm can be directly applied in all games by modifying the input and output dimensions. Moreover, Mnih et al. [3] proposed an improved DQN by adding a replay memory mechanism. This method stores all learned states from a randomly selected number of empirical values from the experience in each update. In such a way, the continuity between states is broken and the algorithm learns more than one fixed strategy. Schaul et al. [4] improved the replay memory mechanism by removing useless experience and strengthening the selection of important experience in order to enhance the learning speed of the algorithm while consuming less memory. In addition to improving the reinforcement learning mechanism, some researchers have modified the network structure of DQN. Hasselt et al. [5] used double DQN to estimate Q values. This algorithm can stably converge parameters from both networks through synchronization at regular intervals.

Electronics 2019, 8, 1128; doi:10.3390/electronics8101128 www.mdpi.com/journal/electronics Electronics 2019, 8, 1128 2 of 15

Wang et al. [6] proposed a duel DQN to decompose Q values into the value function and the advantage function. By limiting the advantage function, the algorithm can focus more on the learning strategy throughout the game. Moreover, strategies in playing video games have been applied in many fields, such as autonomous driving, automatic flight control, and so on. In autonomous driving, Li et al. [7] adoptedElectronics reinforcement 2019, 8, 1128 learning to realize lateral control for autonomous driving in an open2 of racing15 car simulator.This The algorithm experiments can stably demonstrated converge parameters that the from reinforcement both networks learning through controllersynchronization outperformed at linear quadraticregular intervals. regulator Wang (LQR) et al. and[6] proposed model predictivea duel DQN controlto decompose (MPC) Q values methods. into the Martinez value function et al. [ 8] used the gameand Grand the advantage Theft Auto function. V (GTA-V) By limiting to the gather advantag traininge function, data the to algorithm enhance can the focus autonomous more on the driving system.learning To achieve strategy an throughout even more the accurate game. model, GTA-V allows researchers to easily create a large dataset for testingMoreover, and strategies training in neural playing networks. video games In flight have control, been applied Yu et al.in [many9] designed fields, asuch controller as for autonomous driving, automatic flight control, and so on. In autonomous driving, Li et al. [7] adopted a quadrotor. To evaluate the effectiveness of their method, a virtual maze using the Airsim software reinforcement learning to realize lateral control for autonomous driving in an open racing car platformsimulator. was created. The experiments The simulation demonstrated results that demonstrated the reinforcement that learning the quadrotor controller outperformed could complete the flight task.linear Kersandt quadratic et regulator al. [10] proposed(LQR) and deepmodel reinforcement predictive control learning (MPC) formethods. the autonomous Martinez et al. operation [8] of drones.used The the drone game does Grand not Theft require Auto a V pilot (GTA-V) for the to entiregather periodtraining fromdata to flight enhance to completion the autonomous of the final mission.driving Experiments system. To showed achieve that an even the more drone accurate could bemodel, trained GTA-V completely allows researchers autonomously to easily andcreate obtained results similara large dataset to human for testing performance. and training These neural studies networ showks. In that flight training control, data Yu et are al. easier[9] designed and less a costly controller for a quadrotor. To evaluate the effectiveness of their method, a virtual maze using the to obtainAirsim in a virtualsoftware environment. platform was created. The simulation results demonstrated that the quadrotor could Althoughcomplete deepthe flight learning task. hasKersandt been widelyet al. [10] used proposed in various deep fields,reinforcement it requires learning a lot for of the time and computingautonomous resources, operation and it requiresof drones. expensiveThe drone does equipment not require to traina pilot the for network.the entire period Therefore, from flight in this study, a new networkto completion architecture, of the final named mission. reinforcement Experiments sh Q-learning-basedowed that the drone deepcould neuralbe trained network completely (RQDNN), was proposedautonomously to improve and obtained the above-mentioned results similar to huma shortcomings.n performance. The These major studies contributions show that training of this study data are easier and less costly to obtain in a virtual environment. are described as follows: Although has been widely used in various fields, it requires a lot of time and A newcomputing RQDNN, resources, which and combines it requires a deepexpensive principal equipment component to train the analysis network. network Therefore, (DPCANet) in this and • Q-learning,study, a new is proposed network architecture, to determine named the strategyreinforcement in playing Q-learning-based a video game; deep neural network (RQDNN), was proposed to improve the above-mentioned shortcomings. The major contributions of The proposed approach greatly reduces computational complexity compared to traditional deep • this study are described as follows: neural• networkA new RQDNN, architecture; which combines and a deep principal component analysis network (DPCANet) and The trainedQ-learning, RQDNN is proposed only uses to determine CPU and the decreases strategy in computingplaying a video resource game; costs. • • The proposed approach greatly reduces computational complexity compared to traditional deep The rest of this paper is organized as follows: Section2 introduces reinforcement learning and neural network architecture; and deep reinforcement• The trained learning, RQDNN Sectiononly uses3 CPUintroduces and decreases the proposed computing RQDNN resource costs. in this study, and Section4 illustrates theThe experimental rest of this paper results. is organized Two games, as follows: Flappy section Bird 2 introduces and Breakout, reinforcement are used learning as the and real testing environment.deep reinforcement Section5 is learning, the conclusion section 3 andintroduces describes the proposed future work. RQDNN in this study, and section 4 illustrates the experimental results. Two games, Flappy Bird and Breakout, are used as the real testing 2. Overviewenvironment. of Deep Section Reinforcement 5 is the conclusion Learning and describes future work.

Reinforcement2. Overview of learningDeep Reinforcement consists of Learning an agent and the environment (see Figure1). When the algorithm starts,Reinforcement the agent learning will firstly consists produce of an theagent initial and actionthe environment (At, where (seet represents Figure 1). When the present the time point) andalgorithm input starts, to the the environment. agent will firstly Then, produce the environmentthe initial action will (𝐴, feedwhere back t represents a new the state present (St+1 ) and a new rewardtime point) (Rt+1 ),and and, input based to the on environment. observation Then of state, the environment (St) and reward will feed (Rt )back on thea new new state time (𝑆 point,) the and a new reward (𝑅 ), and, based on observation of state (𝑆 ) and reward (𝑅 ) on the new time agent will select a new action (At). Through this repeated interactive process, the agent will learn a set of optimalpoint, strategies the agent (thiswill select is similar a new action to when (𝐴). a Through kid learns this repeated how to rideinteractive a bike: process, he needs the agent to keep will a good learn a set of optimal strategies (this is similar to when a kid learns how to ride a bike: he needs to balancekeep and practicea good balance not falling and practice down. not Then, falling after down making. Then, mistakes, after making he learnsmistakes, the he strategy learns the needed to ride thestrategy bike.). needed to ride the bike.).

Figure 1. The structure of reinforcement learning.

Electronics 2019, 8, 1128 3 of 15

Reinforcement learning comprises two major branches: model-based and model-free branches. The difference between these two branches is the need for a thorough understanding of the environment. The model-based method is effective in training and learning optimal strategies, but, in reality, a complex environment is hard to model, which causes difficulty in developing the model-based method. Hence, most researchers now focus on model-free methods, where the strategies are learned not by modeling the environment directly, but through continuous interactions with environment. Q-learning is the most commonly used model-free method, and the updated parameters are listed in Equation (1). Tables are used to record Q values in original Q-learning, where the environmental state and action shall be discretized, as shown in Table1. After learning the parameters, decisions can be made by directly looking up values in the table.

Q(S, A) Q(S, A) + α[R + γ maxaQ(S0, A0) Q(S, A)]. (1) ← −

Table 1. Schematic diagram of a Q-table.

Actions Q-Table South(0) North(1) East(2) West(3) Pickup(4) Dropoff(5) 0 0 0 0 0 0 0 ...... States ...... 328 2.30108 1.97092 2.30357 2.20591 10.3607 8.55830 . − . − . − . − . − . − ...... 499 9.96984 4.02706 12.9602 29 3.32877 3.38230

In most cases, for a complicated control problem, a simple Q-table can not be used to store complicated states and actions. Some scholars have introduced artificial neural networks from Q-learning to fit complicated Q-tables. Equation (2) is used to show the updated parameters of artificial neural networks. Q-learning can learn more complex states and actions by adding artificial neural networks, but to avoid excessive data dimensions, manually selected features or few input sensors are used. Xh 2i Loss = (R + γmaxa Q(S0, A0, θ−) Q(S, A, θ)) . (2) 0 − This problem is solved until DQN is proposed, and there are some important improvements to DQN as follows:

(1) Original images are used as the input states for the convolution neural networks; (2) A replay memory mechanism is added to enhance the learning efficiency of the algorithm; and (3) Independent networks are introduced to estimate time difference errors more accurately and to stabilize algorithm training.

The outstanding effects of DQN have attracted much research into DQN improvement, including improved methods, such as Nature DQN, Double DQN, and Dueling DQN. However, no DQN improvement strategy has strayed from architecture based on convolution neural networks; therefore, the problems of long consumption times and diverse computing resources have not been resolved. Convolution neural networks were proposed by Lecun [11] in recognizing handwriting, in which the accuracy reached 99%. However, due to insufficient computing resources at the time, convolutional neural networks were not taken seriously by other researchers. It was greatly developed until 2012, when AlexNet was proposed by Krizhevsky, when display card acceleration was introduced to solve slow training in addition to building deeper and larger network architectures. [12] Since then, deeper and more complicated network architectures have been proposed, such as VGGNet [13], GoogleNet [14], and ResNet [15]. Electronics 2019, 8, 1128 4 of 15 Electronics 2019, 8, 1128 4 of 15

[12] Since then, deeper and more complicated network architectures have been proposed, such as VGGNetElectronicsThe architecture 2019[13],, 8 GoogleNet, 1128 of convolution [14], and ResNet neural [15]. networks comprises in the first4 half of 15 and The architecture of convolution neural networks comprises feature selection in the first half and classifiers[12] Since in then, the second deeper half, and asmore shown complicated in Figure netw2. However,ork architectures the main have di ffbeenerence proposed, is that traditionalsuch as methodsclassifiers are in conducted the second through half, as manually shown in designedFigure 2. However, feature selection the main mechanisms, difference is such that astraditional Histogram methodsVGGNet are[13], conducted GoogleNet through [14], and manually ResNet designed [15]. feature selection mechanisms, such as Histogram of Orientation Gradient (HOG), Local Binary Pattern (LBP), and Scale-Invariant Feature Transform of OrientationThe architecture Gradient of convolution(HOG), Local neural Binary networks Pattern comprises(LBP), and feature Scale-Invariant selection Featurein the first Transform half and (SIFT). For convolution neural networks, feature selection is conducted based on learning parameters (SIFT).classifiers For inconvolution the second neural half, as networks, shown in feature Figure sele 2. However,ction is conducted the main based difference on learning is that parameters traditional and offsets of the convolutional, pooling, and layers. The effects of the convolution andmethods offsets are conductedof the convolutional, through manually pooling, designed and activationfeature selection function mechanisms, layers. The such effects as Histogram of the kernelsconvolutionof Orientation gained fromkernels Gradient learning gained (HOG), are from significantly Local learning Binary betterare Patte significantlyrn than (LBP), those and gainedbetter Scale-Invariant than from those the traditional Featuregained Transformfrom method the for featuretraditional(SIFT). selection. For methodconvolution for feature neural selection.networks, feature selection is conducted based on learning parameters andConvolution Convolutionoffsets of neuralthe neural convolutional, networks networks are are dividedpooling, divided intoand into the activation convolutional the convolutional function , layers.layer, pooling poolingThe layer, effects andlayer, activationof andthe functionactivationconvolution layer. function Thesekernels layer. layers gained These are from introduced layers learning are introduced as aref. significantly as f. better than those gained from the traditional method for feature selection. Convolution neural networks are divided into the convolutional layer, pooling layer, and activation function layer. These layers are introduced as f.

FigureFigure 2. 2.Schematic Schematic diagram of of conv convolutionolution neural neural networks. networks.

2.1.2.1. Convolutional Convolutional Layer Layer Figure 2. Schematic diagram of convolution neural networks. TheThe convolutional convolutional layer layer is is the the mostmost importantimportant part part in in convolution convolution neural neural networks. networks. The The features features areare2.1. extracted extracted Convolutional through through Layer convolution convolution kernels kernels after after learning, learning, andand thenthen mappingmapping from low-level features features to high-levelto high-level features features is completed is completed through through superposition superposition of multi-layer of multi-layer convolution convolution kernels kernels to achieve to The convolutional layer is the most important part in convolution neural networks.𝐾 ×𝐾 The features3×3 goodachieve effects. good Figure effects.3 isFigure a schematic 3 is a schematic diagram diagram with a with step a (s)step of (s) 1 andof 1 and kernel kernel ( K (h Kw) of) of3 3 for a are extracted through convolution 4 × kernels 4 after learning,𝑓 ×𝑓 and then mapping from low-level× features× convolutionfor a convolution operation operation of a 4 of a4 feature feature map map ( f ( fw), and), and the the operation operation is is shown shown in inEquations Equations (3) to high-level features is completed× through supeh × rposition of multi-layer convolution kernels to and(3) (4). and The (4).convolution The convolution result result is shown is shown as follows:as follows: achieve good effects. Figure 3 is a schematic diagram with a step (s) of 1 and kernel (𝐾 ×𝐾) of 3×3 𝑌 = ∑ ∑ 𝑥 ∗𝑘 for a convolution operation ()() of a 4 ×X 4 K featureX andK map (𝑓 ×𝑓), and the operation is shown in Equations(3) = w h (3) and (4). The convolution YresultIJ is shown as follows:x(I+i 1 )(J+j 1) kij and (3) h × w = +1 × i=0+1j.= 0 − − ∗ (4) 𝑌 = ∑ ∑ 𝑥 ∗𝑘 and ! ! (3) ()() f h kh f w kw h w= − + 1 − + 1 . (4) h × w = +1 ×× +1s . × s (4)

Figure 3. Schematic diagram of convolution kernel operation.

2.2. Pooling Layer FigureFigure 3. 3.Schematic Schematic diagramdiagram of convolution kernel kernel operation. operation. In general, two pooling operations are commonly used: average pooling and max pooling. 2.2.Figure2.2. Pooling Pooling 4 shows Layer Layer the average pooling operation, which adds up the elements within the specified range InIn general, general, two two pooling pooling operations operations areare commonlycommonly used: used: average average pooling pooling and and max max pooling. pooling. Figure Figure4 shows 4 shows the the average average pooling pooling operation, operation, which which addsadds up the elements elements wi withinthin the the specified specified range range

Electronics 2019, 8, 1128 5 of 15

ElectronicsElectronics 20192019,, 88,, 11281128 55 ofof 1515 and takes the average. Max pooling takes the maximum element within the range as a new value, andand takestakes thethe average.average. MaxMax poolingpooling takestakes thethe maximaximummum elementelement withinwithin thethe rangerange asas aa newnew value,value, which is shown in Figure5. whichwhich isis shownshown inin FigureFigure 5.5.

FigureFigureFigure 4. 4.4.Schematic SchematicSchematic diagram diagramdiagram of of average average pooling.pooling. pooling.

FigureFigureFigure 5. 5.5.Schematic SchematicSchematic diagram diagramdiagram of of maxmax pooling.pooling. pooling.

2.3.2.3.2.3. Activation ActivationActivation Function FunctionFunction Layer LayerLayer In orderInIn orderorder to increasetoto increaseincrease the thethe fitting fittingfitting ability abilityability of ofof neural neuralneural networks, networks,networks, an anan activation activationactivation function functionfunction layerlayerlayer isis addedadded to maketoto make themake original thethe originaloriginal linear linearlinear neural neuralneural networks networksnetworks fit nonlinear fitfit nononlinearnlinear functions. functions.functions. Early EarlyEarly researchers researchersresearchers used usedused sigmoid sigmoidsigmoid as the activationasas thethe activationactivation function layer.functionfunction In layer. traininglayer. InIn deeptrainingtraining neural deepdeep networks, neuralneural networks,networks, due to featuresduedue toto featuresfeatures of the sigmoid ofof thethe sigmoidsigmoid function, function, the gradient approaches 0 in the functional saturation region, which results in the thefunction, gradient the approaches gradient 0approaches in the functional 0 in the saturation functional region, saturation which region, results which in the results disappearance in the disappearance of the gradient and bottlenecks the training. Currently, a commonly used activation of thedisappearance gradient and of the bottlenecks gradient and the training.bottlenecks Currently, the training. a commonly Currently, useda commonly activation used function activation layer functionfunction layerlayer isis RectifiedRectified LinearLinear UnitUnit (ReLU),(ReLU), aa pipiecewiseecewise function,function, wherewhere thethe gradientgradient isis keptkept atat 11 is Rectified Linear Unit (ReLU), a piecewise function, where the gradient is kept at 1 in the region inin thethe regionregion greatergreater thanthan 00 andand isis keptkept atat 00 inin ththee regionregion smallersmaller thanthan 0.0. InIn additionaddition toto solvingsolving thethe greater than 0 and is kept at 0 in the region smaller than 0. In addition to solving the disappearance of disappearancedisappearance ofof thethe gradient,gradient, ReLUReLU cancan increaseincrease sparsesparse neuralneural networksnetworks andand preventprevent .overfitting. the gradient, ReLU can increase sparse neural networks and prevent overfitting. Current common CurrentCurrent commoncommon activationactivation functionsfunctions areare shownshown inin TableTable 2.2. activation functions are shown in Table2. TableTable 2.2. CommonCommon activationactivation functions.functions. Table 2. Common activation functions. NameName FunctionsFunctions Derivatives Derivatives Name Functions1 Derivatives 1 SigmoidSigmoid σσ((ΧΧ)) == 𝑓𝑓 ((ΧΧ)) == 𝑓𝑓((ΧΧ))11 − −𝑓𝑓((ΧΧ)) Sigmoid σ(X) = 111++𝑒𝑒 f (X) = f (X)(1 f (X))2 1+e x 0 𝑒𝑒 −𝑒−𝑒− − ( ) x x 𝑓 (Χ) =1−𝑓(Χ) tanhtanh σσ(ΧΧ) == e e− 𝑓 (Χ) =1−2 𝑓(Χ) tanh σ(X) = 𝑒z+− +z 𝑒 f 0(X) = 1 f (X) 𝑒e +e− 𝑒 − ( 00 𝑖𝑓𝑖𝑓 ΧΧ 0 0 ( 00 𝑖𝑖𝑓𝑓 ΧΧ 0 0 ReLU 𝑓𝑓((ΧΧ)) ==0 i f X < 0 𝑓𝑓(0(ΧΧi)) f =X=< 0 ReLUReLU f (X) = Χ if Χ 0 f (X) = 1 if Χ 0 XΧ if if X Χ 0 0 0 1 if X 10 if Χ 0 0.01Χ0.01Χ 𝑖𝑓𝑖𝑓 ΧΧ 0 0 0.010.01 𝑖𝑖𝑓𝑓 ΧΧ 0 0 LeakyLeaky ReLUReLU 𝑓𝑓((ΧΧ))(== ≥ (𝑓𝑓((ΧΧ)) ==≥ 0.01XΧΧ ifiif f ΧΧX<000 0.01 i f X <11 ifif0 ΧΧ 00 Leaky ReLU f (X) = f (X) = X if X 0 0 1 if X 0 ≥ ≥ 3.3. TheThe ProposedProposed RQDNNRQDNN 3. The ProposedToTo reducereduce RQDNN thethe hardwarehardware computingcomputing resourcesresources andand trainingtraining timetime ofof deepdeep reinforcementreinforcement learning,learning, thisthis studystudy proposedproposed thethe reinforcementreinforcement Q-learningQ-learning-based-based deepdeep neuralneural networknetwork (RQDNN),(RQDNN), asas shownshown To reduce the hardware computing resources and training time of deep reinforcement learning, inin FigureFigure 6.6. 𝑆𝑆 andand 𝑆𝑆 are are thethe inputinput andand outputoutput ofof thethe neuralneural network,network, respectively,respectively, takingtaking aa reversereverse this study proposed the reinforcement Q-learning-based deep neural network (RQDNN), as shown in transfertransfer toto updateupdate neuralneural networknetwork parameters,parameters, inin whichwhich LL((𝜃𝜃)) is is thethe lossloss functionfunction ofof thethe RQDNN.RQDNN. Figure6. St and St+1 are the input and output of the neural network, respectively, taking a reverse transfer to update neural network parameters, in which L(θ) is the of the RQDNN.

Electronics 2019, 8, 1128 6 of 15 Electronics 2019, 8, 1128 6 of 15

Figure 6.FigureDiagram 6. Diagram of the reinforcement of the reinforcement Q-learning-based Q-learning-based deep neuraldeep neural network network (RQDNN) (RQDNN) algorithm. algorithm. This section introduces the network architecture and operation flow of the RQDNN. Firstly, in orderThis to trainsection the introduces algorithm the stably, network by architecture reference ofand Mnih operation Nature flow DQN of the [3 RQDNN.], this paper Firstly, used in a double-networkorder to train architecturethe algorithm to stably, estimate by reference loss functions, of Mnih and Nature the lossDQN function [3], this ofpaper the used original a double- DQN is network architecture to estimate loss functions, and the loss function of the original DQN is as shown as shown in Equation (5). After each parameter update, the original neural network will be changed in Equation (5). After each parameter update, the original neural network will be changed when when fitting the target, which leads to the algorithm being unable to converge stably. fitting the target, which leads to the algorithm being unable to converge stably. X h 2i LossLoss= = ∑([(𝑅R + γ + max 𝛾𝑚𝑎𝑥a Q𝑄(S(𝑆0, ,𝐴A0,,𝜃θ) −𝑄(𝑆,𝐴,𝜃))Q(S, A, θ))]. .(5) (5) 0 − Traditional DQN uses two groups of neural networks with identical architectures: Q-Network Traditional DQN uses two groups of neural networks with identical architectures: Q-Network and and Target Q-Network. The parameter of the Q-Network is 𝜃, while the parameter of the Target Q- TargetNetwork Q-Network. is 𝜃 .The The parameter loss function of the will Q-Network be changed is θ to, while Equa thetion parameter (6). In every of the training, Target Q-Networkonly the is θ−parameters. The loss functionof the Q-Network will be changed are updated, to Equation and, after (6). every In every 100 training,updates, onlythe parameters the parameters of the of Q- the Q-NetworkNetwork are are updated,copied to and,the Target after everyQ-Network. 100 updates, In such thea way, parameters loss function of the calculations Q-Network will are be copied stable, to the Targetaccurate, Q-Network. and reduce In the such convergence a way, loss time function of algorithm. calculations will be stable, accurate, and reduce the convergence time of algorithm. Loss = [(𝑅 + 𝛾𝑚𝑎𝑥𝑄(𝑆 ,𝐴,𝜃 ) −𝑄(𝑆,𝐴,𝜃)) ] (6) Xh 2i Loss = (R + γmaxa Q(S0, A0, θ−) Q(S, A, θ)) (6) DQN adopts a convolution neural network0 architecture,− which contains three convolutional layers and two full connection layers. The architecture is simple, but, because of the features of DQN adopts a convolution neural network architecture, which contains three convolutional layers convolution neural networks, the computing resources and time for training are greatly increased. and two full connection layers. The architecture is simple, but, because of the features of convolution Moreover, in the early training of the algorithm, convolution kernels cannot extract effective features, neural networks, the computing resources and time for training are greatly increased. Moreover, in the which greatly increases the training time of DQN. To solve this problem, a new network architecture, earlyRQDNN, training which of the combines algorithm, DPCANet convolution and Q-le kernelsarning, cannot was proposed extract einff ectivethis study. features, which greatly increases the training time of DQN. To solve this problem, a new network architecture, RQDNN, which combines3.1. The DPCANet Proposed DPCANet and Q-learning, was proposed in this study.

3.1. The ProposedThe proposed DPCANet DPCANet is an 8-layer neural network. The network architecture is shown in Figure 7, and the operation in each layer is described as follows: The proposed DPCANet is an 8-layer neural network. The network architecture is shown in Figure7, and the operation in each layer is described as follows:

Electronics 2019, 8, 1128 7 of 15 Electronics 2019, 8, 1128 7 of 15 Electronics 2019, 8, 1128 7 of 15

Figure 7. Diagram of the proposed DPCANet. FigureFigure 7. 7.Diagram Diagram ofof thethe proposed DPCANet. DPCANet.

• The first layer (input layer):  •The The first first layer layer (input (input layer): layer): This layer has no operations, but it is responsible for pre-treatment of the input game pictures, ThisThis layer layer has has no no operations, operations, but but it it is is responsibleresponsible for pre-treatment pre-treatment of of the the input input game game pictures, pictures, which involves magnifying images to 80 × 80 and recording the time it takes for convolution whichwhich involves involves magnifying magnifying images images to 80 to 80 and × 80 recording and recording the time the it time takes it for takes convolution for convolution operation. operation. × operation. The second and third layers (convolutional layers): • The second and third layers (convolutional layers): Both• ofThe these second layers and are third convolutional layers (convolutional layers. Therelayers): are 8 convolutional layers on the second Both of these layers are convolutional layers. There are 8 convolutional layers on the second layer andBoth 4 onof these the third layers layer. are convolutional The size of thelayers. convolution There are 8 kernels convolutional is 5 5 layerswith aon step the ofsecond 1. But, layer and 4 on the third layer. The size of the convolution kernels is 5×5× with a step of 1. But, difflayererent and from 4 on the the parameters third layer. of The convolution size of the kernels convolution in traditional kernels is convolution5×5 with a neuralstep of networks1. But, different from the parameters of convolution kernels in traditional convolution neural networks different from the parameters of convolution kernels in traditional convolution neural networks learnedlearned from from algorithms, algorithms, the the parameters parameters ofof thethe convolutionconvolution kernels kernels on on both both layers layers are are gained gained from from learned from algorithms, the parameters of the convolution kernels on both layers are gained from principalprincipal component component analysis analysis calculations. calculations. ThisThis calculation flow flow is is as as shown shown in in Figure Figure 8.8 . principal component analysis calculations. This calculation flow is as shown in Figure 8.

FigureFigure 8. 8.Flowchart Flowchart ofof parameterparameter calculations of of convolution convolution kernels. kernels. Figure 8. Flowchart of parameter calculations of convolution kernels.

BeforeBefore calculating calculating the the parameters parameters ofof convolutionconvolution kernels, kernels, firstly, firstly, images images were were collected collected through through Before calculating the parameters of convolution kernels, firstly, images were collected through interaction with the games and models of random actions served as the training samples. There were interactioninteraction with with the the games games and and models models of of random random actions actions served served as as thethe trainingtraining samples. There There were were a totala total of n of training n training samples, samples, which which is indicatedis indicated by by X =X=x1𝑥, x,𝑥2,x,𝑥3,...,…,𝑥, xn.. a total of n training samples, which is indicated by X={ 𝑥,𝑥,𝑥,…,𝑥}. TheThe first first step step is tois sampleto samplex in 𝑥 the in range the range of 5 of5 and5×5 then and flatten then flatten all blocks all gainedblocks gained from sampling from The first step is to sample1 𝑥 in the range of 5×5 and then flatten all blocks gained from sampling and splice them into a matrix 𝑥, 𝑥, with× a size of 25 × 25, calculated as Equation (7) where and splice them into a matrix xe1, xe1, with𝑥, a 𝑥 size of 25 25, calculated25 × 25 as Equation (7) where h and w sampling and splice them into a matrix , with a size× of , calculated as Equation (7) where h and w are, respectively, the length and width of 𝑥, 𝑘 and 𝑘 are, respectively, the length and are,hrespectively, and w are, respectively, the length andthe length width and of x 1width, kh and of k𝑥w,are, 𝑘 and respectively, 𝑘 are, respectively, the length andthe length width and of the

Electronics 2019, 8, 1128 8 of 15 convolution kernel, and s is the step. Next, in the same way, other samples are sampled to get the  training set eX = xe1, xe2, xe3, ... , xen . ! !! h kh w kw (k kw) ( − + 1 − + 1 . (7) h × × s × s

The second step is to calculate the of xe1. Firstly, the mean m of eX is calculated with a size of 1 5776, subtracted m from xe1 to get a new xe1, and then xe1 is multiplied by its × T own transport to get xe1xe1 . The same operation is conducted for other training samples and then T n T T T To eXeX = xe1xe1 , xe2xe2 , xe3xe3 , ... , xenxen is obtained based on the calculations as shown in Equation (8). The size of the covariance matrix σ is 25 25. × Pn T i=1 xeixei σ =    . (8) h kh w kw ( − + 1 − + 1 s × s Lastly, σ is entered into the Lagrange equation and an eigenvector size of 25 25 is obtained. × Then, take the first 8 dimensions as the parameters of the 8 convolution kernels on the second layer. The parameters of the 4 convolution kernels on the third layer can also be obtained through calculations of the above-mentioned steps. But, before calculating the convolution kernels on the third layer, a 2-layer convolution operation is conducted on X = x , x , x , ... , xn to get XL = { 1 2 3 } 2 x , x , ... , x , x , x , ... , x , ... , xn and then the parameters of the third-layer convolution kernel { 11 12 18 21 22 28 8} are calculated with XL2 as the new training sample set. The fourth layer (block-wise histograms layer): • After performing convolution operations on the second and third layers, the original input image will produce 32 feature maps (1 8 4) in total, and effects similar to multi-layer convolution neural × × networks can be obtained through superposition of the second and third layers. For the activation function layer, however, we used block-wise histograms to achieve nonlinear transformation of the original activation function layer. The specific operation process is shown below. After the third layer is convolved, the feature map is used for binarization and is expressed as XL3, XL = x , x , ... , x , x , x , ... , x , ... , x . Firstly, XL is grouped every 4 pieces, and n groups of 3 { 11 12 14 21 22 24 84} 3 feature maps are gained in total xn = x , xn , xn , x , where n is the number of convolution kernels { n1 2 2 n4} on the second layer. Based on the operation shown in Equation (9), the size of xen, xen is 72 72. Next, × xen is divided into four blocks 18 18 in size, expressed as xf1n, xf2n, xf3n, xf4n. The four blocks are used × for histogram and spliced into a 1024-dimensional vector. Next, other groups of feature maps are operated in the same way, and lastly, all of them are spliced into an 8192-dimensional eigenvector as the output of this layer. P4 4 i xen = i=12 − xni. (9) The fifth, sixth, and seventh layers (full connection layers): • The fifth, sixth, and seventh layers are full connection layers and the dimensions are, respectively, 8192, 4096, and 512, with operations the same as those of other normal, similar neutral networks. The eighth layer (output layer): • The eighth and last layer is the output layer. The output dimension is the number of actions that can be controlled by the games and was set as 2 in this paper. Next, the operation flow of DPCANet will be explained. The training flow comprises an interaction stage, storage and selection stage, and update stage, as shown in Figure9. Electronics 2019, 8, 1128 9 of 15 Electronics 2019, 8, 1128 9 of 15 Electronics 2019, 8, 1128 9 of 15

FigureFigure 9. 9.Flowchart Flowchart of of deep deep principal principal componentcomponent analysis analysis network network (DPCANet). (DPCANet). Figure 9. Flowchart of deep principal component analysis network (DPCANet). 3.2.3.2. The The Proposed Proposed RQDNN RQDNN 3.2. The Proposed RQDNN BeforeBefore starting starting the the major major stage,stage, initialization is is conducted conducted as asthe the flow, flow, as shown as shown in Figure in Figure 10. In 10. In thethe beginning,beginning,Before starting an an experience experience the major pool stage, pool will willinitialization be initialized. be initialized. is Theconducted experience The experience as the pool flow, D pool is as responsible shownD is responsible in Figure for all states10. for In all statesandthe andbeginning, actions actions explored an explored experience by byRQDNN pool RQDNN will during be during initialized. training, training, The and, experience and, at the at the update pool update D stage,is responsible stage, experience experience for all will states will be be randomlyrandomlyand actions extracted extracted explored from from by the theRQDNN experience experience during pool pool totraining, addto add into and,into the theat parameter the parameter update update. update.stage, Next,experience Next, Q-Network Q-Network will be and Targetandrandomly Q-Network Target extracted Q-Network will from be initialized,will the be experience initialized, and the pool and architectures to the add architectures into of the both parameter of networks both networksupdate. are the Next, are same, the Q-Network as same, shown as in Figureshownand7 .Target At in theFigure Q-Network update 7. At stage, the will update parameterbe init stage,ialized, parameterθ of and the the Q-Network θarchitectures of the Q-Network is the of onlyboth is one networksthe updated,only one are updated, andthe same, parameter and as θ parametershown in Figure θ of 7. the At targetthe update Q-Network stage, parameter is synchronized θ of the with Q-Network parameterθ is θthe of only the oneQ-Network updated, upon and − of the target Q-Network is synchronized with parameter of the Q-Network upon every 100 everyparameter 100 updateθ of thestages. target Lastly, Q-Network the environmen is synchronizedt will be with initialized, parameter and θthe of initial the Q-Network state 𝑠 will upon be update stages. Lastly, the environment will be initialized, and the initial state s0 will be the output. everythe output. 100 update Next are stages. the circulation Lastly, the of environmen interaction stt willage, bestorage initialized, and selection and the stage, initial and state update 𝑠 will stage, be Next are the circulation of interaction stage, storage and selection stage, and update stage, until the untilthe output. the stopping Next are condition the circulation is met. of In interaction this study, st theage, stopping storage and condition selection was stage, to reach and updatethe specified stage, stopping condition is met. In this study, the stopping condition was to reach the specified survival survivaluntil the time.stopping The interactioncondition is stage, met. storageIn this study,and sele thection stopping stage, conditionand update was stage to reachoperated the asspecified below: time.survival The interaction time. The interaction stage, storage stage, and storage selection and sele stage,ction and stage, update and update stage operated stage operated as below: as below:

Figure 10. Flowchart of initialization. FigureFigure 10. 10.Flowchart Flowchart of initialization. Upon completion of initialization, next comes the interaction stage with the flow, as shown in FigureUponUpon 11. completion In completion the beginning, of of initialization, initialization, the states are nextnext taken comescomes from theth thee interaction environment, stage stage and with withthe Q-valuesthe the flow, flow, ofas asallshown shownactions in in

FigureunderFigure 11 . 11.𝑠 In shallIn the the beginning, be beginning, calculated the the by states states Q-Network. are are taken taken Next, fromfrom 𝜀 the 𝑔𝑟𝑒𝑒𝑑𝑦 environment, is used to and andselect the the theQ-values Q-values action of𝑎 ofall that allactions actions will underunderinteractst shall 𝑠 with shall be calculatedthe be calculatedenvironment. by Q-Network. by Q-Network.If the Next, Next,ε greedy 𝜀is 𝑔𝑟𝑒𝑒𝑑𝑦greateris used is than used to select 𝑎 to , selectthe the action action the action awillt that 𝑎be will thatselected interact will 𝑎 withrandomly.interact the environment. with In thethe end,environment. If a the new probability state If 𝑠the is probability will greater be gained than is a greaterthrought, the action than the interaction will , the be selectedaction between will randomly. be𝑎 selectedand Inthe the environment.randomly. In the end, a new state 𝑠 will be gained through the interaction between 𝑎 and the end, a new state st+1 will be gained through the interaction between at and the environment. environment.

Electronics 2019, 8, 1128 10 of 15 Electronics 2019, 8, 1128 10 of 15

Electronics 2019, 8, 1128 10 of 15

FigureFigure 11. 11. Flowchart Flowchart of the interactioninteraction stage. stage stage. .

NextNext is theis the storagestorage storage andand and selectionselection selectionFigure stage. 11.stage stage. Flowchart. Algorithms AlgorithmsAlgorith of thems interaction related relatedrelated toto tostage. Q-LearningQ-Learning Q-Learning are are basically basically updated updated in insingle singlein single steps. steps steps. Namely,. N amely,Namely, the update the the update update is conducted isis conductedconducte for eachd for interaction each each interaction interaction with the with environment. with the the environment. environment. Its advantages Its Its advantageareadvantages its extremelyNexts are is arethe its high its storageextremely extremely effi andciency highselection high and efficiency efficiency the stage. update Algorithandand canthethe msupdate be related done cancan without to be be Q-Learning done done completion without without are completionbasically completion of the entireupdated of theof flow. the entireTheentirein disadvantages singleflow. flow. Thesteps. The disadvantage disadvantagesNamely, are that the thes areupdate are strategies that that isthe the conducte learnedstrategiesstrategiesd arefor learned simpleeach are areinteraction to simple simpl target eto to and,with target target oncethe and, andenvironment. the once, once game the the game flow gameIts is flowchanged,flowadvantages is changed, is thechanged, e areffects the its the ofeffectsextremely effects the strategiesof of the highthe strategies strategies efficiency learned learnedlearned willand beth ewill veryupdate bebe poor. veryvery can poor. Inpoor.be order done In In order towithoutorder solve to to solve completionthissolve this problem, this problem, problem, of the DQN entire flow. The disadvantages are that the strategies learned are simple to target and, once the game DQNproposesDQN proposes proposes replay replay memory, replay memory memory, which, which stores which allstores stores the experiencedallall the experienced interactions interactions interactions in the in experience in the the experience experience pool poolD frompool flow is changed, the effects of the strategies learned will be very poor. In order to solve this problem, Dinteractions fromD from interaction interactions with environment.s with with environment. environment. At the update At At thethe stage, updateupdate the stage, experience thethe experience experience most recently most most usedrecently recently in the used used interaction in thein the DQN proposes replay memory, which stores all the experienced interactions in the experience pool interactionstageinteraction will not stage stage be directlywill will not not used.be be directly directly Instead, used used. a. batch IInstead,nstead, of experience a batch ofof experience experience randomly randomly selectedrandomly fromselected selected the from experience from the the D from interactions with environment. At the update stage, the experience most recently used in the poolexperience will be used. pool Inwill this be paper, used. theIn this batch paper, was setthe asbatch 32, andwas the set flowas 32, of and storage the andflow selectionof storage is shownand experienceinteraction pool stage will will be not used. be directly In this used. paper, Instead, the batch a batch was of experienceset as 32, andrandomly the flow selected of storage from the and in Figureselection 12 is. shown in Figure 12. selectionexperience is shown pool inwill Figure be used. 12. In this paper, the batch was set as 32, and the flow of storage and selection is shown in Figure 12. Start Start Examine the capacity No Select a batch of Start Examineof experience the capacity pool is No Selectexperience a batch of ofExamine experienceexceeded the orcapacity pool not is No Selectexperience a batch of Store (st, rt, αt, st+1) ofexceeded experience or pool not is Storein experience (st, rt, αt, spoolt+1) D experience Yesexceeded or not in experienceStore (st, r poolt, αt, s tD+1) in experience pool D Yes DeleteYes the oldest End Deleteexperience the oldest End Deleteexperience the oldest End experience Figure 12. Flowchart of the storage and selection stage.

Figure 12. Flowchart of the storage and selection stage stage.. The last is the updateFigure stage, 12. andFlowchart the flow of the is storageas shown and selectionin Figure stage. 13. Upon selection from the experience pool, Equation (6) will be used to calculate loss function and the format of experience is The last is the update stage, and the flowflow is asas shownshown inin FigureFigure 1313.. Upon selection from the 𝑠,𝑟The,𝑎 last,𝑠 is the Qupdate(s ,a ;𝜃 stage,) andmaxQ(s the flow,a is as;𝜃 shown) in Figure 13. Upon selection from the experience pool, Equation. (6) will and be used to calculate lossare functioncalculated and through the format Q-Network of experience and experienceexperience pool, pool, Equation Equation (6) (6) will will be be used used toto calculatecalculate loss functionfunction and and the the format format of of experience experience is is Target Q-Network and, after that, the state of s is examined,′ and r is given as the reward. When . .Q (Qst(,s at,;aθ;)휃and) andmaxQ maxQ(st+(1s, at+,1a; θ0); 휃are ) calculatedare calculated through through Q-Network Q-Network and Target and the𝑠푡 푡 barrier,𝑟푡,𝑎푡+,𝑠 1is successfully . Q푡(s,a푡 passed,;𝜃) and r = maxQ(s1, otherwise푡+1 ,a푡+ r1 =;𝜃 –1,) andare incalculated the state ofthrough persistent Q-Network existence, and r = TargetQ-Network Q-Network and, after and that,, after the that, state the of sstatet+1 is of examined, s is examined, and r is given and asr is the given reward. as the When reward the. barrierWhen 0.1.Target Next, Q-Network the loss function and, after L (that,θ) can the be state calculated. of s푡+1 isThe examined, random gradientand r is given reduction as the shall reward. be used When to is successfully passed, r = 1, otherwise r = –1, and in the state of persistent existence, r = 0.1. Next, the theupdate thebarrier barrier parameteris successfullyis successfully θ of passed, thepassed, Q-Network, r r= = 1, 1, otherwise otherwise and then r = –1,–parameter1, andand inin the theθ state state of ofthe of persistent persistenttarget Q-Network existence, existence, r is= r = (θ) 0.1.losssynchronized0.1. Next function Next,, the theL loss losswith canfunction function parameter be calculated. L (Lθ( )θθ ) can of can Thethe be be Q-Network randomcalculated calculated. gradient. uponThe randomrandom every reduction 100 gradient gradient update shall bereduction stages.reduction used to shall update shall be be parameterused used to to − updateθ ofupdate the Q-Network, parameter parameter θ and θo f thenof the the parameterQ Q-Network,-Network,θ− of andand the thentargetthen parameter Q-Network θθ is synchronizedof of thethe target target withQ-Network Q-Network parameter is isθ synchronizedof thesynchronized Q-Network with with upon parameter parameter every 100θ θof updateof the the Q Q-Network- stages.Network upon everyevery 100 100 update update stages. stages.

Figure 13. Flowchart of the update stage.

FigureFigure 13. 13.Flowchart Flowchart of the update update stage. stage. Figure 13. Flowchart of the update stage.

Electronics 2019, 8, 1128 11 of 15

Based on the above-mentioned stages, the proposed RQDNN can effectively learn to determine the playing strategy. The training time and hardware resources consumed in RQDNN are much less than those of DQN.

4. Experimental Results In order to evaluate the effectiveness of the method proposed in this study, performance comparisons of different convolution layers and the two games (i.e., Flappy Bird and Atari Breakout) are implemented.

4.1. Evaluation of Different Convolution Layers in DPCANet In this study, the proposed DPCANet replaced the traditional convolutional neural network (CNN) for extracting image features. DPCANet adopted a principal component analysis to calculate convolution kernel parameters, which can significantly reduce the computation time. In this subsection, the Modified National Institute of Standards and Technology (MNIST) database was used to test the network performances of different convolution layers. The MNIST dataset consists of grayscale images of handwritten digits 0–9. One thousand samples were used for training and testing in DPCANet. The experimental results are shown in Table3. In this table, the performance of network 2 was better than that of the other networks. In addition, more convolutional layers increased the computing time and could not improve the accuracy during fixed iterations. Therefore, the architecture of network 2 was used to extract the features of game images.

Table 3. Performance evaluation of different convolution layers.

1st Layer 2nd Layer 3rd Layer 4th Layer Testing Computation Dimension of Dimension of Dimension of Dimension of Error Time per Convolution Convolution Convolution Convolution Rate Image (s) Network 1 8 NaN NaN NaN 5.5% 0.12 Network 2 8 4 NaN NaN 4.1% 0.17 Network 3 8 4 4 NaN 4.6% 0.22 Network 4 8 4 4 4 4.8% 0.34

4.2. The Flappy Bird Game The Flappy Bird game is as shown in Figure 14. In this game, a player can manipulate the bird in the picture to continuously fly past water pipes. This game seems easy as there are only two actions: flying or not flying. But the position and length of the water pipes in the picture are completely random; thus, the state is very huge. Such a relationship was fitted to verify the effectiveness of RQDNN. Electronics 2019, 8, 1128 12 of 15

Figure 14. The Flappy Bird game. Figure 14. The Flappy Bird game.

Next is the comparison of training and test results between the proposed RQDNN and DQN [2]. Firstly, the used hardware configurations for training are introduced, as shown in Table 4. The training results are shown in Figure 15. This figure shows that the training time of RQDNN was shorter than DQN [2]. Moreover, the external computing resources in the proposed RQDNN did not require accelerated computing of the Graphics Processing Unit (GPU).

Table 4. Hardware configurations of RQDNN and DQN [2]

RQDNN DQN [2] CPU Intel Xeon E3-1225 Intel Xeon E3-1225 GPU None GTX 1080Ti Training time 2 hours 5 hours

Figure 15. Comparison results of RQDNN and Deep Q-learning Network (DQN) [2] training with the Flappy Bird game.

Here, we set passing 200 water pipes as the condition for successful training. In Figure 15, the rewards were given at the 400th and the 1200th trials using RQDNN and DQN, respectively. In addition, the proposed RQDNN required only 1213 trials to achieve successful training, while DQN required 3145 trials. Next, the testing results were compared. If 𝜀 was set as 0, then the gained results are shown in Table 4 through 10 executions. The scores of the human player were gained from tests done by human

Electronics 2019, 8, 1128 12 of 15

Electronics 2019, 8, 1128 12 of 15

Figure 14. The Flappy Bird game. Next is the comparison of training and test results between the proposed RQDNN and DQN [2]. Firstly, theNext used is the hardware comparison configurations of training forand training test results are between introduced, the proposed as shown RQDNN in Table4 and. The DQN training [2]. resultsFirstly, are shownthe used in hardware Figure 15 configurations. This figure showsfor training that theare trainingintroduced, time as ofshown RQDNN in Table was 4. shorter The thantraining DQN [ 2results]. Moreover, are shown the externalin Figure computing 15. This figure resources shows in that the proposedthe training RQDNN time of didRQDNN not require was acceleratedshorter than computing DQN [2]. of Moreover, the Graphics the external Processing com Unitputing (GPU). resources in the proposed RQDNN did not require accelerated computing of the Graphics Processing Unit (GPU). Table 4. Hardware configurations of RQDNN and DQN [2] Table 4. Hardware configurations of RQDNN and DQN [2] RQDNN DQN [2] RQDNN DQN [2] CPUCPU Intel Intel Xeon Xeon E3-1225 E3-1225 Intel Intel Xeon Xeon E3-1225 E3-1225 GPUGPU NoneNone GTXGTX 1080Ti 1080Ti TrainingTraining time time 2 2hours h 5 5 hours h

FigureFigure 15. Comparison 15. Comparison results results of RQDNN of RQDNN and and Deep Deep Q-learning Q-learning Network Network (DQN) (DQN) [2 [2]] training training with with the Flappy Bird game. the Flappy Bird game.

Here,Here, we we set set passing passing 200 200 water water pipespipes as the the condition condition for for successful successful training. training. In Figure In Figure 15, the 15 , the rewards were were given given at at the the 400th 400th and and the the 1200th 1200th trials trials using using RQDNN RQDNN and andDQN, DQN, respectively. respectively. In In addition,addition, the the proposed proposed RQDNNRQDNN requiredrequired only 1213 trials trials to to achieve achieve successful successful training, training, while while DQN DQN requiredrequired 3145 3145 trials. trials. Next,Next, the the testing testing results results were were compared. compared. IfIfε 𝜀was was set set as as 0, 0,then then thethe gained results results are are shown shown inin TableTable4 through 4 through 10 executions. 10 executions. The The scores scores of of the the human human player player were were gained gained fromfrom tests done done by by human human players. In Table5, the gained score of the proposed method was better than that gained in the DQN proposed by Mnih [2] and significantly better than that gained by human players.

Table 5. Testing results of the Flappy Bird game using various methods.

Min Max Mean Proposed method 223 287 254.2 V. Mnih [2] 176 234 213.4 Human player 10 41 21.6

4.3. The Atari Breakout Game In Breakout, six bricks are arranged on the first third of the screen. A ball moves straight around the screen and bounces off the top and sides of the screen. When a brick is hit, the ball bounces, and the brick is destroyed. The player has a paddle that moves horizontally to reflect the ball and destroy all the bricks. There are five chances in a game. If the ball crosses the racket, the player will lose a chance. The Atari Breakout game is as shown in Figure 16. Electronics 2019, 8, 1128 13 of 15

players. In Table 5, the gained score of the proposed method was better than that gained in the DQN proposed by Mnih [2] and significantly better than that gained by human players.

Table 5. Testing results of the Flappy Bird game using various methods.

Min Max Mean Proposed method 223 287 254.2 V. Mnih [2] 176 234 213.4 Human player 10 41 21.6

4.3. The Atari Breakout Game In Breakout, six bricks are arranged on the first third of the screen. A ball moves straight around the screen and bounces off the top and sides of the screen. When a brick is hit, the ball bounces, and the brick is destroyed. The player has a paddle that moves horizontally to reflect the ball and destroy Electronicsall the2019 bricks., 8, 1128 There are five chances in a game. If the ball crosses the racket, the player will lose13 a of 15 chance. The Atari Breakout game is as shown in Figure 16.

FigureFigure 16. 16. TheThe Atari Breakout game. game.

In thisIn this game, game, 3500 3500 trials trials were were set set as as the the completecomplete trainingtraining condition condition for for training training process. process. The The trainingtraining results results of of the the proposed proposed RQDNN RQDNN andand DQNDQN are shown in in Figure Figure 17. 17 This. This figure figure shows shows that that thethe proposed proposed method method had had a a faster faster convergence convergence raterate than DQN DQN in in playing playing the the Breakout Breakout game. game. After After 35003500 trials, trials, the the proposed proposed RQDNN RQDNN kept kept 1179 1179 time time steps steps to playto play Breakout, Breakout, while while DQN DQN only only kept kept 570 570 time steps.time The steps. experimental The experimental results results showed showed that thethat proposed the proposed RQDNN RQDNN can can keep keep a longer a longer playing playing time timeElectronics than 2019DQN, 8, in1128 the Breakout game. 14 of 15 than DQN in the Breakout game.

FigureFigure 17. 17.Comparison Comparison results results of of RQDNN RQDNN and and DQN DQN [[2]2] inin thethe Breakout game. game.

5. Conclusions5. Conclusions and and Future Future Work Work In thisIn study,this study, a Reinforcement a Reinforcement Q-Learning-based Q-Learning-based Deep NeuralDeep Neural Network Network (RQDNN) (RQDNN) was proposed was for playingproposed a videofor playing game. a Thevideo proposed game. The RQDNN proposed comprised RQDNN DPCANetcomprised DPCANet and Q-learning and Q-learning to determine to the playingdetermine strategy the playing of video strategy games. of Videovideo gamegames. images Video weregame usedimages as thewere inputs. used as DPCANet the inputs. was usedDPCANet to initialize was the used parameters to initialize of the convolution parameters kernels, of convolution to obtain kernels, good to rewards, obtain good and rewards, to shorten and the convergenceto shorten time the convergence in training process. time in training The reinforcement process. The Q-learningreinforcement method Q-learning was used method to capture was used the imageto featurescapture the automatically image features and au implementtomatically game and implement control strategies. game control Experimental strategies. resultsExperimental showed thatresults the scores showed of the that proposed the scores RQDNNof the proposed were betterRQDNN than were those better of than human those players of human and players other and deep other deep reinforcement learning algorithms. In addition, the training time of the proposed RQDNN was also far less than that of other deep reinforcement learning algorithms. Moreover, the proposed architecture did not require GPU to accelerate computation and needed less computing resources than traditional deep reinforcement learning algorithms. Although the architecture of the proposed method replaces the convolutional and pooling layers of the original convolution neural network, the block-wise histogram layer used for the activation function layer is not good architecture. In addition to large dimensions, external feature combinations were also improved. In future work, we will focus on using kernel principal component analysis and other nonlinear methods to generate convolution kernels and improve the performance of deep reinforcement learning.

Author Contributions: Conceptualization and methodology, C.-J.L., J.-Y.J. and K.-Y.Y.; software, J.-Y.J. and H.- Y. L.; writing-original draft preparation, C.-L. L and H.-Y. L.; writing—review and editing, C.-J. L. and C.-L.L

Funding: This research was funded by the Ministry of Science and Technology of the Republic of China, Taiwan, grant number MOST 106-2221-E-167-016.

Conflicts of Interest: The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

1. Togelius, J.; Karakovskiy, S.; Baumgarten, R. The 2009 Mario AI Competition. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010, DOI:10.1109/CEC.2010.5586133. 2. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602.

Electronics 2019, 8, 1128 14 of 15 reinforcement learning algorithms. In addition, the training time of the proposed RQDNN was also far less than that of other deep reinforcement learning algorithms. Moreover, the proposed architecture did not require GPU to accelerate computation and needed less computing resources than traditional deep reinforcement learning algorithms. Although the architecture of the proposed method replaces the convolutional and pooling layers of the original convolution neural network, the block-wise histogram layer used for the activation function layer is not good architecture. In addition to large dimensions, external feature combinations were also improved. In future work, we will focus on using kernel principal component analysis and other nonlinear methods to generate convolution kernels and improve the performance of deep reinforcement learning.

Author Contributions: Conceptualization and methodology, C.-J.L., J.-Y.J. and K.-Y.Y.; software, J.-Y.J. and H.-Y.L.; writing—original draft preparation, C.-L.L. and H.-Y.L.; writing—review and editing, C.-J.L. and C.-L.L. Funding: This research was funded by the Ministry of Science and Technology of the Republic of China, Taiwan, grant number MOST 106-2221-E-167-016. Conflicts of Interest: The authors declare no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

1. Togelius, J.; Karakovskiy, S.; Baumgarten, R. The 2009 Mario AI Competition. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010. [CrossRef] 2. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. 3. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.A.; Fidjeland, A.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [CrossRef][PubMed] 4. Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. arXiv 2016, arXiv:1511.05952. 5. Hasselt, H.v.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16), Phoenix, AZ, USA, 12–17 February 2016. 6. Wang, Z.; Schaul, T.; Hessel, M.; Hasselt, H.V.; Lanctot, M.; Freitas, N.D. Dueling Network Architectures for Deep Reinforcement Learning. arXiv 2015, arXiv:1511.06581. 7. Li, D.; Zhao, D.; Zhang, Q.; Chen, Y. Reinforcement Learning and Deep Learning Based Lateral Control for Autonomous Driving. IEEE Comput. Intell. Mag. 2019, 14, 83–98. [CrossRef] 8. Martinez, M.; Sitawarin, C.; Finch, K.; Meincke, L. Beyond Grand Theft Auto V for Training, Testing and Enhancing Deep Learning in Self Driving Cars. arXiv 2017, arXiv:1712.01397. 9. Yu, X.; Lv, Z.; Wu, Y.; Sun, X. Neural Network Modeling and Backstepping Control for Quadrotor. In Proceedings of the 2018 Chinese Congress (CAC), Xi’an, China, 30 November–2 December 2018. [CrossRef] 10. Kersandt, K.; Muñoz, G.; Barrado, C. Self-training by Reinforcement Learning for Full-autonomous Drones of the Future. In Proceedings of the 2018 IEEE/AIAA 37th Digital Avionics Systems Conference (DASC), London, UK, 23–27 September 2018. [CrossRef] 11. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [CrossRef] 12. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012. 13. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. Electronics 2019, 8, 1128 15 of 15

14. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with . In Proceedings of the 2015 IEEE Conference on and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [CrossRef] 15. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).