Investigating Recurrent Neural Network Memory Structures Using Neuro-Evolution
Total Page:16
File Type:pdf, Size:1020Kb
Investigating Recurrent Neural Network Memory Structures using Neuro-Evolution Travis Desell ([email protected]) Associate Professor Department of Software Engineering Co-Authors: AbdElRahman ElSaid (PhD GRA) Alex Ororbia (Assistant Professor, RIT Computer Science) Collaborators: Steven Benson, Shuchita Patwardhan, David Stadem (Microbeam Technologies, Inc.) James Higgins, Mark Dusenbury, Brandon Wild (University of North Dakota) Overview • What is Neuro-Evolution? • Distributed Neuro- Evolution • Background: • Recurrent Neural • Results Networks for Time Series Prediction • Future Work • Recurrent Memory Cells • Discussion • EXAMM: • NEAT Innovations • Edge and Node Mutations • Crossover The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Motivation The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Neuro-Evolution • Applying evolutionary strategies to artificial neural networks (ANNs): • EAs to train ANNs (weight selection) • EAs to design ANNs (what architecture is best?) • Hyperparameter optimization (what parameters do we use for our backpropagation algorithm) The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Neuro-Evolution for Recurrent Neural Networks • Most people use human-designed ANNs, selecting from a few architectures that have done well in the literature. • No guarantees these are most optimal (e.g., in size, predictive ability, generalizability, robustness, etc). • Recurrent edges can go back farther in time than the previous time step -- dramatically increases the search space for RNN architectures. • With so many memory cell structures and architectures to choose from, which are best? What cells and architectures perform best, does this change across data sets and why? The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Neuro-Evolution • Applying evolutionary strategies to artificial neural networks (ANNs): • EAs to train ANNs (weight selection) • EAs to design ANNs (what architecture is best?) • Hyperparameter optimization (what parameters do we use for our backpropagation algorithm) • Using NE to better understand and guide ML: what structures and architectures are evolutionarily selected and why? The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Background The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Recurrent Neural Networks Hidden Hidden Input Hidden Hidden Output Input Output Node 1 Node 2 time 0 Node 1 Node 2 time 0 Recurrent Neural Networks can be extremely challenging to train due to the exploding/ Input Hidden Hidden Output vanishing gradients problem. In short, when time 1 Node 1 Node 2 time 1 training a RNN over a time series (via backpropagation through time), it needs to be completely unrolled over the time series. Input Hidden Hidden Output For the simple example above (blue arrows time 2 Node 1 Node 2 time 2 are forward connections, red are recurrent), backpropagating the error from time 3 reaches all the way back to input at time 0 (right). Even with this extremely simple RNN, Input Hidden Hidden Output we end up having an extremely deep time 3 Node 1 Node 2 time 3 network to train. ... The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Recurrent Neural Networks Hidden Hidden Input Hidden Hidden Output Input Output Node 1 Node 2 time 0 Node 1 Node 2 time 0 Traditionally, RNN connections go back a single time step and to the Input Hidden Hidden Output same or a previous node in the time 1 Node 1 Node 2 time 1 network. This is not a requirement - they can Input Hidden Hidden Output go back multiple time steps (green) time 2 Node 1 Node 2 time 2 or forward in the network (orange). However it is not as well studied due to additional complexity and Input Hidden Hidden Output dramatically increasing the time 3 Node 1 Node 2 time 3 architectural search space. ... The 2019 Genetic and Evolutionary Computation Conference Prague, Czech Republic July 13th-17th, 2019 Using ACO to Optimize LSTM Recurrent Neural Networks GECCO ’18, July 15–19, 2018, Kyoto, Japan t+10 Classification vs. Time Series Data Prediction Current input Table 1: K-Fold Cross Validation Results Input lag Actual vibration Prediction Errors (MAE) Hidden layer RNNs are perhaps more commonly used for classification (and have been mixed with CNNsLSTM for image NOE identification). NARX NBJ This ACO involves outputs being fed Subsamplethrough 1a softmax 8.34% 10.6%layer which 8.13% results 8.40% in 7.80%probabilities for the Subsample 2 4.05% 6.96% 6.08% 7.34% 3.70% input beingSubsample a particular 3 6.76% class. 16.8% The error 11.2% minimized 13.6% 3.49%is for the output being an incorrect class: t-9 t-1 t Mean 0.0638 0.1145 0.0847 0.0977 0.0501 Std. Dev. 0.0217 0.0497 0.0258 0.0333 0.0245 Figure 4: Nonlinear Output Error inputs neural network. 5.3 Error Function This network was updated to utilize 10 seconds of input data. For all the networks studied in this work, Mean Squared Error (MSE) t t+1 t+9 t+10 Current input RNNs can(shown also in be Equation used 1) for was time used asseries an error data measure prediction, for training, however in this Input lag case theas RNN it provides is predicting a smoother optimization an exact surface value for of backpropagation a time series, some Prediction than mean average error. Mean Absolute Error (MAE) (shown in Hidden layer number of time steps in the future. The error being minimized is Equation 2) was used as a nal measure of accuracy for the three typically architectures,the mean assquared because the error parameters (1) or were mean normalized absolute between error (2). This is an important0 and 1,distinction. the MAE is also the percentage error. 0.5 Actual V ib Predicted Vib 2 t-9 t-1 t Error = ⇥ ( − ) (1) Testin Seconds Õ ABS Actual V ib Predicted Vib Figure 5: Nonlinear AutoRegressive with eXogenous inputs Error = [ ( − )] (2) Testin Seconds neural network. This network was updated to utilize 10 sec- Õ onds of input data, along with the previous 10 predicted out- 5.4 Machine Specications The 2019 Genetic and Evolutionary Computation Conference put values. Prague, Czech Republic Python’s Theano Library [23] was used to implement the neural July 13th-17th, 2019 t t+1 t+9 t+10 networks and MPI for Python [3] and was used to run the ACO op- Actual vibration timization on a high performance computing cluster. The cluster’s Error Current input operating system was Red Hat Enterprise Linux (RHEL) 7.2, and Input lag Prediction had 31 nodes, each with 8 cores (248 cores in total) and 64GBs RAM Hidden layer (1948 GB in total). The interconnect was 10 gigabit (GB) InniBand. 6 RESULTS The ACO algorithm was run for 1000 iterations using 200 ants. The t-9 t-1 t networks were allowed to train for 575 epochs to learn and for the error curve to atten. The minimum value for the pheromones Figure 6: Nonlinear Box-Jenkins inputs neural network. were 1 and the maximum was 20. The population size was equal to This network was updated to utilize 10 seconds of input data, number number of iterations in the ACO process, i.e., the population along with the future output and error values. Due to requir- size was also 1000. Each run took approximately 4 days. ing future knowledge, it is not possible to utilize this net- A dataset of 57 ights was divided into 3 subsamples, each con- work in an online fashion. sisting of 19 ights. The subsamples were used to cross validate the results by examining combations utilizing two of the subsamples indicative of a case of vanishing gradients. Accordingly, the study as the training data set and the third as the testing set. Subsamples allowed for the recurrent weights to be considered in the gradient 1, 2 and 3 consisted of 23,371, 31,207 and 25,011 seconds of ight calculations in order to update the weights with respect to the cost data, respectively. function output. These subsamples were used to train the NOE, NARX, NBH, base 5.2.3 Nonlinear Box-Jenkins (NBJ) Inputs Neural Network: The architecture and the ACO optimized architecture. Figures 7 shows structure of the NBJ is depicted in Figure 6. As previously noted, predictions for the dierent models over a selection of test ights, this network is not feasible for prediction past one time step in and Figure 8 shows predictions an single uncompressed (higher the future in an online manner, as it requires the actual prediction resolution) test ight. Table 1 compares these models to the base value and error between it and the predicted value to be fed back architecture (LSTM) and the ACO optimized architecture (ACO). into the network. However, as this work delt with oine data, the actual future vibration values, error, and the output were all fed to 6.1 NOE, NARX, and NBJ Results the network along with the current instance parameters and lag Somewhat expectedly, the NOE model performed the worst with inputs. As in the other networks, the values for the previous 10 with a mean error of 11.45% (σ = 0.0497). The NBJ model performed time steps were also utilized. better than the NOE model with a mean error of 9.77% (σ = 0.0333), Memory Cell Structures - Simple Neuron • "Simple" neurons are the basic neural network building block. Inputs are summed, and activation Input function is applied and the (sum) output is feed forward to other neurons. • Simple (and any other Activation Function memory cell structure) can (tanh, also have recurrent edges (in sigmoid, ...) red) added through the evolutionary process.