Reinforcement Learning with Convolutional Reservoir Computing
Total Page:16
File Type:pdf, Size:1020Kb
Reinforcement Learning with Convolutional Reservoir Computing Han-ten Chang, Katsuya Futagami Graduate school of Systems and Information Engineering University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan fs1820554, [email protected] Abstract al. 2015) which selects samples that facilitate training. How- ever, the cost of a series of computations, from data collec- Recently, reinforcement learning models have achieved great success, mastering complex tasks such as Go and other games tion to action determination, remains high. with higher scores than human players. Many of these mod- The world model (Ha and Schmidhuber 2018) can also re- els store considerable data on the tasks and achieve high per- duce computational costs by completely separating the train- formance by extracting visual and time-series features using ing processes between the feature extraction model and the convolutional neural networks (CNNs) and recurrent neural action decision model. The world model trains the feature networks, respectively. However, these networks have very extraction model in a rewards-independent manner by using high computational costs because they need to be trained variational auto-encoder (VAE) (Kingma and Welling 2013; by repeatedly using the stored data. In this study, we pro- pose a novel practical approach called reinforcement learning Jimenez Rezende, Mohamed, and Wierstra 2014) and mix- with convolutional reservoir computing (RCRC) model. The ture density network combined with an RNN (MDN-RNN) RCRC model uses a fixed random-weight CNN and a reser- (Graves 2013). After extracting the environment state fea- voir computing model to extract visual and time-series fea- tures, it uses an evolution strategy method called the co- tures. Using these extracted features, it decides actions with variance matrix adaptation evolution strategy (CMA-ES) an evolution strategy method. Thereby, the RCRC model has (Hansen and Ostermeier 2001; Hansen 2016) to train the ac- several desirable features: (1) there is no need to train the fea- tion decision model. The world model can achieve outstand- ture extractor, (2) there is no need to store training data, (3) it ing scores in famous RL tasks. The separation of these two can take a wide range of actions, and (4) there is only a single models results in the stabilization of feature extraction and task-dependent weight parameter to be trained. Furthermore, reduction of parameters to be trained based on task-rewards. we show the RCRC model can solve multiple reinforcement learning tasks with a completely identical feature extractor. From the success of the world model, it is implied that in the RL feature extraction process, it is important to extract the features that express the environment state sufficiently Introduction rather than features trained to get higher rewards. Adopting Recently, reinforcement learning (RL) models have this idea, we propose a new method called “reinforcement achieved great success, mastering complex tasks such as learning with convolutional reservoir computing (RCRC)”. Go (Silver et al. 2016) and other games (Mnih et al. 2013; The RCRC model is inspired by the reservoir computing. Horgan et al. 2018; Kapturowski et al. 2019) with higher Reservoir computing (Lukoseviˇ ciusˇ and Jaeger 2009) is scores than human players. Many of these models use con- a kind of RNNs, and the model weights are set to random. arXiv:1912.04161v1 [cs.NE] 5 Dec 2019 volutional neural networks (CNNs) to extract visual features One of the reservoir computing models, the echo state net- directly from the environment state images (Arulkumaran work (ESN) (Jaeger 2001; Jaeger and Haas 2004) is used to et al. 2017). Some models use recurrent neural networks solve time-series tasks such as future value prediction. For (RNNs) to extract time-series features and achieved higher this, the ESN extracts features for the input signal based on scores (Hausknecht and Stone 2015). the dot product of the input signal and fixed random-weight However, these deep neural networks (DNNs) based mod- matrices generated without training. Surprisingly, features els are often very computationally expensive in that they obtained in this manner are expressive enough to understand train networks weights by repeatedly using a large volume of the input, and complex tasks such as chaotic time-series pre- past playing data and task-rewards. Certain techniques can diction can be solved by using them as the input for a linear alleviate these costs, such as the distributed approach (Mnih model. In addition, the ESN has solved various tasks in mul- et al. 2016; Horgan et al. 2018) which efficiently uses multi- tiple fields such as time-series classification (Tanisaro and ple agents, and the prioritized experienced replay (Schaul et Heidemann 2016; Ma et al. 2016) and Q-learning-based RL Copyright c 2020, Association for the Advancement of Artificial (Szita, Gyenes, and Lorincz˝ 2006). Similarly, in image clas- Intelligence (www.aaai.org). All rights reserved. sification, the model that uses features extracted by the CNN with fixed random-weights as the ESN input achieves high accuracy classification with a smaller number of parameters (Tong and Tanaka 2018). Based on the success of the fixed random-weight mod- els, the RCRC model extracts the visual features of the envi- ronment state using fixed random-weight CNN, and, using these features as the ESN input, extracts time-series features of the environment state transitions. After extracting the en- vironment state features, we use CMA-ES (Hansen and Os- termeier 2001; Hansen 2016) to train a linear transformation from the extracted features to the actions, as in the world Figure 1: Reservoir Computing overview for the time-series model. This model architecture results in the omission of prediction task. the training process of feature extractor and reduced com- putational costs; there is also no need to store past play- ing data. Furthermore, we show that the RCRC model can and the previous values, and W has two major hyperparam- solved multiple RL tasks with the completely identical struc- eters called sparsity and spectral radius. The sparsity is the ture and weights feature extractor. ratio of 0 elements in matrix W and the spectral radius is a Our contributions in this study are as follows: memory capacity parameter which is calculated by the max- imal absolute eigenvalue of W . • We developed a novel and widely applicable approach to Finally, the ESN estimates the target signal y = extract visual and time-series features of an RL environ- N×Dy fy(1); y(2); :::; y(t); :::; y(N)g 2 R as ment state using fixed random-weights networks feature extractor with no training. y(t) = W out[X(t); U(t); 1]: (3) • We developed the RCRC model that doesn’t need to store The weight matrix W out 2 RDy ×(Dx+Du+1) is estimated data and to train feature extractor. by a linear model such as ridge regression. An overview of • We showed that the RCRC model can solve different tasks reservoir computing is shown in Figure1. with training only a single task-dependent weight matrix The unique feature of the ESN is that the two matri- in and using the completely identical feature extractor. ces W and W are randomly generated from a probabil- ity distribution and fixed. Therefore, the training process in out Related Work the ESN consists only of a linear model to estimate W , hence the ESN has very low computational cost. In addi- Reservoir Computing tion, the reservoir state reflects complex dynamics despite Reservoir computing is one of the RNNs and it extracts fea- being obtained by random matrix transformation, and it is tures of the input without training for the feature extrac- possible to use it to predict complex time-series by a simple tion process. In this study, we focus on a reservoir comput- linear transformation (Jaeger 2001; Verstraeten et al. 2007; ing model, ESN (Jaeger 2001; Jaeger and Haas 2004). The Goudarzi et al. 2014). Because of the low computational cost ESN was initially proposed to solve time-series tasks (Jaeger and high expressiveness of the extracted features, the ESN 2001) and is regarded as an RNN model (Lukoseviˇ ciusˇ and is also used to solve other tasks such as time-series clas- Jaeger 2009; Lukoseviˇ ciusˇ 2012). sification (Tanisaro and Heidemann 2016; Ma et al. 2016), Let the N-length, Du-dimensional input signal be u = Q-learning-based RL (Szita, Gyenes, and Lorincz˝ 2006) and fu(1); u(2); :::; u(t); :::; u(N)g 2 RN×Du and the signal image classification (Tong and Tanaka 2018). that added one bias term to input signal be U = [u; 1] = N×(D +1) World Models fU(1);U(2); :::; U(T ); :::; U(N)g 2 R u . Let [;] be a vector concatenation. The ESN gets features called the The world model (Ha and Schmidhuber 2018) is one of the reservoir state X = fX(1); :::; X(t); :::; X(N)g 2 RN×Dx RL models that separates the training of the feature extrac- as follows: tion model and the action decision model to train the model more efficiently. It uses VAE (Kingma and Welling 2013; ~ in X(t + 1) = f(W U(t) + WX(t))) (1) Jimenez Rezende, Mohamed, and Wierstra 2014) and MDN- X(t + 1) = (1 − α)X(t) + αX~(t + 1) (2) RNN (Graves 2013) as feature extractors. They are trained in a reward-independent manner with randomly played 10000 where the matrices W in 2 R(Du+1)×Dx and W 2 RDx×Dx episodes data. As a result, in the feature extraction pro- are sampled from a probability distribution such as a Gaus- cess, the reward-based parameters are omitted, and there sian distribution, and f is the activation function which is remains a single weight parameter to be trained that de- applied element-wise.